Monday 2 July 2012

Big data - how big is big?

One of the many buzzwords flying around the data management world is 'Big Data'. It seems that every other day, we are hearing hairy-chested stories about petabyte sized data sets being churned by networks of computers gathering data from weather satellites, climate models and many other massive number-crunching projects. It's easy to get carried away by the amount of data that modern systems are having to cope with.

Big data is largely a question of scale. Big data should only be called big when it is compared to the infrastructure that it resides on and the time available to do the work. In my view, you have big data if any of the following is happening in your organisation.
  • Your processes are exceeding their allotted window in the daily/weekly/monthly etc. schedule
  • Specialist data projects  have trouble running specialist processes
  • Your larger scheduled processes are being cancelled or fall over on a regular basis
Your data becomes 'big' data when it exceeds or severely challenges the capacity of your systems, or when the time it takes to process the data becomes inconvenient or unfeasible.

Having big data on a mature infrastructure is notably more challenging than these landmark 'blue sky' bespoke projects. Upgrading your servers to cope may not be a practical option. There are other ways to optimise your data to get better performance from large data sets.

  • Reduce extractions to the fields you need, rather than whole tables. Ban "select * " from your code.
  • Be careful how you join tables. Make sure the fields you are joining on are indexed, and that you join the data by all the fields that the tables relate to.
  • Index your data mart tables - integers, dates and primary/foreign and surrogate keys.
  • Build an entity of lots of tables with as few as possible fields in them. Narrower tables are easier to load and manage.
  • Delta load incrementally - i.e. load a day's worth of data every day, rather than year-to-date's data every day.
  • Monitor your servers and retain stats on how things are running. Look for times where the performance slows and conduct root cause analysis to keep your performance up.

No comments:

Post a Comment