Friday 28 June 2013

5 things to consider when joining data

Warning - we are in major geek territory here. This assumes basic understanding of SQL. You have been warned! 

OK, you have been asked to get some data together for a project. The data is scattered across different tables and needs to be joined together before you can get any insight from it. Here are 5 things to help you avoid some of the pitfalls that can happen.

1.  Consider location carefully
Your analysis software is very powerful. Many business intelligence packages can allow you to join a table in one server to a table in another server, without ever considering whether it is a good idea to do so. I once heard of a user in Cheshire who tried to join a couple of million records on a server in London to half a million records in Hong Kong, and was surprised when he brought his entire network down. Don't do it! Extract your data from one server, place it in a temporary area on the other server, and join the data there. Or even better, import both tables to a local server before joining them. It may be a little more inconvenient to code, but don't cross your network administrators else ye will pay a terrible price!!

2. Keep your joins simple
Yes, it's flash and kind of impressive if you join 5 tables together within one SQL statement. But one day, someone else is going to have to examine your data when it is challenged. They are going to get to your large SQL query and realise that the problem is somewhere in the tenuous joining you have done. It's much better to join one table to anther; then use the product of that join to another table etc. It takes longer to code, but means that gaps in referential integrity are much easier to spot.

3.  De-duplicate the keys before joining
Just do it. Make it a habit. Even when the data looks right. One day it might be wrong.

4.  Use indexed fields
Ideally, you should be using indexed fields for the joins and also any other selection criteria. Indexes vastly improve processing times. If the fields are not indexed, you can add an index when you import the data. 

5.  Outer join and coalesce
When joining tables, consider what you want to happen to the records that don't match. It is a lot simpler to 'left outer join' the tables so that the missing records still appear in the query results. You can then use the coalesce function to add in an identifier for your missing data. It makes your reports more transparent if bad data can be categorised. It keeps your organisation honest and makes your results easier to check.

No comments:

Post a Comment