Friday 28 June 2013

5 things to consider when joining data

Warning - we are in major geek territory here. This assumes basic understanding of SQL. You have been warned! 

OK, you have been asked to get some data together for a project. The data is scattered across different tables and needs to be joined together before you can get any insight from it. Here are 5 things to help you avoid some of the pitfalls that can happen.

1.  Consider location carefully
Your analysis software is very powerful. Many business intelligence packages can allow you to join a table in one server to a table in another server, without ever considering whether it is a good idea to do so. I once heard of a user in Cheshire who tried to join a couple of million records on a server in London to half a million records in Hong Kong, and was surprised when he brought his entire network down. Don't do it! Extract your data from one server, place it in a temporary area on the other server, and join the data there. Or even better, import both tables to a local server before joining them. It may be a little more inconvenient to code, but don't cross your network administrators else ye will pay a terrible price!!

2. Keep your joins simple
Yes, it's flash and kind of impressive if you join 5 tables together within one SQL statement. But one day, someone else is going to have to examine your data when it is challenged. They are going to get to your large SQL query and realise that the problem is somewhere in the tenuous joining you have done. It's much better to join one table to anther; then use the product of that join to another table etc. It takes longer to code, but means that gaps in referential integrity are much easier to spot.

3.  De-duplicate the keys before joining
Just do it. Make it a habit. Even when the data looks right. One day it might be wrong.

4.  Use indexed fields
Ideally, you should be using indexed fields for the joins and also any other selection criteria. Indexes vastly improve processing times. If the fields are not indexed, you can add an index when you import the data. 

5.  Outer join and coalesce
When joining tables, consider what you want to happen to the records that don't match. It is a lot simpler to 'left outer join' the tables so that the missing records still appear in the query results. You can then use the coalesce function to add in an identifier for your missing data. It makes your reports more transparent if bad data can be categorised. It keeps your organisation honest and makes your results easier to check.

Saturday 22 June 2013

Big Data - Keep calm and carry on?

It is an uncomfortable truth that oil has been a contributing factor of nearly all recent international conflicts. So when our Big Data evangelists tell us that "Data is the new oil," it may be more of a problem than we currently understand.

Julian Assange recently stated that the internet has become a militarised zone. He was responding to the  revelations about the surveillance system, 'Prism' that was recently leaked. But is this something new?

Whether we like it or not, the internet and criminality have always been uneasy partners. Everyone gets phishing emails. We all know not to follow links for generic Viagra or bank password resets. We also know that those polite and badly worded emails that you have inherited millions of dollars are a bit too good to be true! You only need to see the amount of virus definitions that your anti-virus software updates on a weekly basis to see that computer spying and infiltration have been an accepted way of life for many years.

We have become so accustomed to implementing our own defence systems (firewalls) and counter-measures (anti-virus) that we have become blind to the reality. There is a war on for our data. The Americans haven't just invented spying. It is a burgeoning international business. How could we possibly forget the Leveson revelations of computer hacking by the red-top journalists of Fleet Street? How about the Stuxnet worm that that was allegedly developed by an alliance between Israel and USA intelligence? China and Russia have also been implicated at other times.

It is clear that Stuxnet was the first reported government sponsored worm that achieved successful military sabotage. This worm was used to target Iran's uranium enrichment programme by causing major failures in their specialist software that was manufactured by Siemens.

So as data becomes increasingly valuable, it becomes an even greater target. With spying comes other activities of warfare - destruction of property and the disabling of capability.

The new "Big data" warehouses being built are of such value and importance that they may be too big and important to fail. We are becoming increasingly dependent on our data. Distributing computer operations over large arrays of nodes decreases risk by spreading operations over a collection of cores, but this complexity also increases the possible points of failure and the ease with which sabotage can happen.

With the rapidly growing rewards of big data, we must take great care to understand the risks. We are now building ubiquitous systems of such importance that they become targets on a military-industrial scale.

Saturday 15 June 2013

The end of social media part 2

In a previous blog article, I mentioned how social media was being increasingly levered by corporate interests, as users were facing complex privacy options and more intrusive advertising.

I now want to address the affect of America's Prism surveillance system and possible implications to the social media business.

For Facebook, this is particularly damaging, as it undermines the credibility of their already unpopular privacy settings. There are still many fundamental questions that need to be answered. The Prism documents refer to data being directly acquired from internet companies. Facebook and Microsoft both deny culpability, quoting their stats about formal information requests. While it is possible to intercept information between computers, it becomes far easier if they are complicit in the operation.

Facebook has been troubled by pro-child-abuse pages popping up. At times it has appeared that Facebook does not have the resources to bring them down fast enough. Cyber bullying is on rise, too. Children across America are being shot in school. These are problems that Prism would be excellent at confronting. It is clear that from Facebook's continuing problems and the recent escalation of shooting in schools that the NSA are not protecting children in the USA.

This suggests to me that they are either not getting the results they wanted, or they are only picking projects that cannot be traced back to Prism.

So if prism isn't being used to improve the social media experience, suspicions will run high as to the intentions of the NSA. This does not bode well for the reputations of the social media providers that fall within Prism's reach. It also asks questions about the increasing militarising of the internet.

Will people turn away from social media? Do people want to be monitored 24/7? What are the guarantees that prism won't be used for 'special political interests'? Two things are clear - 1. It is a tough time to be a social media provider  2. The full truth has yet to come out.

Friday 14 June 2013

5 steps to valuable data quality measurement

Kaplan and Norton's Balanced Scorecard is a way of defining and measuring strategic performance . It is probably the most used tool in management today. A data quality department may wish to have their progress measured on one or more of the quadrants. 

But the full corporate scorecard itself can provide great guidance as to the strategic direction of data quality measurement and remediation for the whole of your organisation.

1.  Identify key data items
For each goal in your balanced scorecard, identify the fields, tables and databases that are critical to the delivery of the scorecard objectives. If you can, also include the data items that are being used for the scorecard measures.

2. Prioritise each data item
 The key to this part of the exercise is to ensure that the most important data items for each measure get the full attention they deserve. You will possibly have a lengthy list of data items for each goal within your balanced scorecard. Cut out the items that are not important. If you still have a large amount of fields, try to give priorities and weighting to them.

3.  Agree business validation rules
Once you have a comprehensive list of fields and tables, go to your business and agree the business rules that you will use to validate each field. 

4.  Measure the quality of your data
Apply the business rules to all fields and tables as per above. Roll all the scores up into the items that you originally started with. You now have a scorecard of data quality in relation to your corporate Balanced Scorecard.

5.  Take it onwards with actions
What you have developed is a powerful baseline that informs your colleagues exactly how well the quality of data underpins your strategic corporate values and goals. Next steps are to prioritise any remedial action that is required, and agree all targets for improvements.

Saturday 8 June 2013

The best form of defence

Recently the Guardian newspaper broke a story that the National Security Agency (NSA) in the United States of America had implemented an extensive program to acquire and monitor all on-line communications. The project has been active for the past 6 years, and acquires data from Google, Apple, Microsoft, Facebook, Paltalk, AOL and Youtube.

When you put all of these services together, you will realise that they also cover 'facetime', 'google talk' and 'skype' which are all video and internet telephone conferencing facilities as well as emails, social media. Couple this with the recent discoveries that the NSA also acquired access to all of Verizon's telephone communications, and you have the largest, most insidious and far-reaching national and international communications surveillance program of all time.

This level of surveillance displays paranoia on such an industrial scale as to make the cold-war McCarthy witch-hunts seem like a storm in a tea-cup. It is quite right for everyone to be extremely concerned. As immediate allies to the United States, the UK government is quite rightly under extreme pressure to disclose their involvement. The internet service providers are at present denying culpability.

Which brings me neatly to my conclusion. The best defence is trust. If your customers know you are doing the right thing with their data, they will stick with you. If your data is wrong, or you are doing unethical things with it, expect trouble. Good data governance ensures that you honour your obligations to your customers, and prompts you to question when your government asks for too much.

Thursday 6 June 2013

Size vs value

It is a widely quoted stat that business data is expected to more than double every year. This is due to a perfect storm of the increasing proliferation of data generating devices, and the rapidly decreasing costs of collection and storage. 

The rise in use of cloud technology proposes enormous advantages to business everywhere. This cannot be understated. Simply put, using 3rd parties to store information means that companies no longer need to pay large amounts for server space that they may never use. Instead, they pay their cloud provider just for the space they use.

While this is a fantastic idea, we can easily get sucked into storing data, just because we can. Because data is measured and more importantly charged by size, it creates a market demand for data, not because it is useful, but because it is an asset that provides an income.

A recent Digital Universe study found that only 0.5% of all data is actually analysed. So is the 99.5% useful, wrong or just waiting for technology to catch up?

The much promised 'Big Data' solutions have not achieved critical mass within the IT industry, with people talking about them more than implementing anything. So what is happening to all this excess data? The truth is, we are all paying for it in one way or another, in the price of our goods and services or the tax into our governments.

Data size is only important to cloud providers, as that is how they choose to charge people. The challenge is for everyone to find a better way to assign value to their data.  Only then can we keep the data that can take our lives forward and reject the waste that is clogging servers all over the world.