Musings of a Data Geek: July 2012

Tuesday 31 July 2012

When is a problem not a problem?

One of your colleagues comes to you with a problem. There is a field in one of the marts that is not being populated. The data is being collected in your production database, it is just not being passed into your mart.

An analysis of your mart produces interesting results. The field has never had the data added. This is where we get into the world of the defect vs feature.

I have heard the term 'feature' politely used for something that ought to work, but was either not tested, overlooked at testing or the problem was discovered so late that a decision was made to implement without it. There is significantly less desire within your IS department to correct a 'feature' than a true deviation from agreed process ('if-it-ain't-broke-don't-fix-it'). The longer the time span between implementation and discovery of the feature, the more reluctant your colleagues will become to fix it.

It is very easy to become cynical when faced by vigorous challenges from your technical areas around remediation of features. You may have tried to fix the problem before, but with the advent of new disciplines like data quality, there is much more support available for getting these problems remediated.

So the answer is, a problem is always a problem - no matter how orwellian the description is. Some may choose to reframe the issue, but the truth is that it never worked properly. The 'feature' is still a problem. Screw your courage to the sticking post. Assert yourself. Engage your data quality section. Get it fixed.

Monday 30 July 2012

4 fundamental questions for business intelligence

When setting up a business intelligence project, most people tend to start with systems, people and processes. It is very common to consider the 'hows' before the 'whys'. Taking a step back and asking some more fundamental questions may be more beneficial.

Cost

Any piece of information can be captured. Is it financially feasible to do so? Compare the costs to the anticipated benefits of acquiring the information. What people generally miss from a good cost benefit analysis is the cost of acquiring the money. This will depend on how your organisation is funded. If you have shareholders, the cost of acquiring money (dividend payments) is probably far more expensive than loaning from a bank (interest rates). But when you factor in financial acquisition costs, your benefit analysis may be far less than you thought. As new technologies become established, their cost comes down. Could you wait until this happens or are you losing a potential opportunity to gain an advantage over competitors?

Ethics

Is it ethically appropriate to collect the information? Do your customers or colleagues know you are capturing this information? Do they have a right to know? If they found out, what would their reaction be? What are the rules about this information? What are the regulatory obligations and constraints?

Ownership

Who will own this new data? Ownership and accountability should be ascertained right at project inception. I guarantee that as soon as something has been built without an accountable owner agreed beforehand, your colleagues will head for the hills. If they can get something built without being held accountable, then they will. Without an accountable business owner, it becomes so much harder to get cooperation from the rest of the organisation when the data requires remediation.

Access

This is like accountability, only the other way round. A new report or data set becomes available and EVERYONE wants access. Do they really need it? Is the information politically sensitive? Could the performance of other colleagues be derived from this information? What could be the repercussions of general circulation? How valuable would the insight be to an external company? How vulnerable is this data to theft?

Consider carefully before you start. There are inherent risks - both financial and human - to collecting new data that need to be carefully considered.

Saturday 28 July 2012

Security considerations for data quality tooling

One of the more unexpected places you will experience resistance to implementing a data quality toolset could be from your own internal IT department. If they take security and system performance seriously, they will want to know all about your new software and how it interacts with all the databases and marts.

All data quality tooling uses ODBC or JDBC connections to access your enterprise data sources. Each different connection will require a userID and password. If you are planning to implement a desktop only solution, prepare for a long fight. Desktop versions of data quality systems rely upon your PC having these connections set up and the passwords embedded. This could cause problems if other people gain access to your desktop. Some will be able to use your connections with other programs - like MS access to query your data and save it anywhere. Even if the O/JDBC connections are not accessible outside the data quality suite of programs, the embedded ETL capabilities of the data quality software may pose a security risk that may prove a step too far for some system administrators.

The solution is to implement a server version of your data quality software. This involves installing your software on a central server. The ODBC/JDBC connections are similarly centralised. Users then have a desktop program that interfaces with the server, and cannot access it until a password is keyed in. This is far more secure, but will effectively treble the set-up costs, especially if you build a failover solution. A failover option may be mandatory if you work in a highly governed business like pharma or banking.

In this information age, the value of data has never been higher. With all current crime trends pointing to internal colleagues stealing data and selling on to other companies as being the biggest data security threat, it is important that the capability of data quality tooling is not perceived as too great a risk. All of these risks can be successfully mitigated with the correct infrastructure implemented and governance controls around appropriate access to data.

Thursday 26 July 2012

Trust, a bridge to better quality

Remember my post on organisational entropy? One of the key secrets to lowering the entropy with everyone in your organisation is the building of trust. You can tell when trust is lost with any individual, because every interaction becomes almost impossible. Working in the quality and governance space is a highly sensitive dynamic, for who will trust you with their business problems if they believe you are not trustworthy?

You can take the role of a critical friend; you can become a master of body language and unspoken communication; you can even learn NLP. But these are all superfluous without the core principles to building trust. Here is my take on building trust:

Deliver to your promise

Say what you are going to do, when you will do it, and deliver. When you can consistently do this, you are well on the way to building trust. Broken promises damage trust. If you find that you cannot deliver on something, go immediately to your customer, apologise and let them know what you can do. The earlier you can tell your customer about a potential problem, the better they manage expectations elsewhere.

Keep things private and personal, appropriately

Nothing destroys trust more than if you blurt out gossip that is told to you in confidence. This is particularly damaging if you talk about one department's troubles to another one. Word gets around that you can't be discreet, and that can cause problems. However, this is not the same as keeping secrets from colleagues who should be informed. If someone tells you of criminal activities, report them to the appropriate officials, and not to your friends on coffee break.

Delegate to educate

When you delegate something, you are also telling your colleague that you trust them. But don't just drop it and run. Use it as an opportunity to coach them in your area of expertise. Share the decision making that you would make if you were doing it yourself. Build that rapport.

Deliver together

If you have worked on something with someone, present the results together. Make sure they know they are being recognised as a key contributor.

Take responsibility appropriately

Don't blame others when things go wrong. At the same time, don't accept blame when it is not your fault.

When dealing with data quality or governance issues, you are in a position of trust. Therefore, be worthy of that trust. It is one of the most important things you can build with your colleagues. Be consistent, be effective, be reliable and fair. The rewards are great, as Emerson said......

"Trust men and they will be true to you; treat them greatly and they will show themselves great."

Friday 20 July 2012

Is your SCV really a SCV?

My business is full of jargon. You can't avoid it. And along with the jargon comes a whole rack of TLA's. In this case, TLA stands for "Three Letter Acronym". Things like Treating Customers Fairly (TCF) and Extract Transformation and Load (ETL) to name a couple.

But the one TLA that causes more furrowed brows than anything I have seen is SCV.... the Single Customer View.

As businesses grow, they acquire other companies. They build separate databases to service different customers. As time goes by, you can end up with one customer who can have their personal details on a bewildering array of platforms and marts. There is significant risk of your marketing data and your customer insight being fragmented.

The best solution is to build a separate data mart with all of your customer insight data merged. There are a number of master data management tools that can help you do this. They are purpose-built for the job, and can cut down a lot of the legwork. With the proliferation of these solutions being increasingly widespread, consider that you may not be the only one in your organisation who is building a single customer view.

This brings me to the crux of my point. A single customer view should be just that - single. Having 5 or 6 different single customer views throughout your organisation is bound to create reconciliation problems further downstream.

So when someone asks you to build a SCV, your first priority is to make sure that it hasn't been done somewhere else before you start. And the second priority is that it should be the last single customer view you create.

Thursday 19 July 2012

Privacy - sharing the responsibility

Business intelligence is right at the cutting edge of the all our concerns about privacy. Facebook and google are prime examples of organisations that collect an enormous amount of personal information. One of the main challenges to business intelligence is the ethics of collecting information, and where it crosses into intrusion and espionage.

We now know that not only did Google unethically collect this information, but they did not delete the information correctly. It makes you wonder what information is being collected that is not currently being used for anything. This goes even further when you look at the amount of personal information collected by mobile devices like smartphones. Steve Jobs had an interesting take on the problem.

I would take this a step further and say that sharing of information and total transparency is the key. There are significant advantages to be gained to giving customers control of the information that you hold about them. Giving them control of their information makes them accountable for the quality of it. Wouldn't it be great if every 6 months we sent our customers an email for them to check their contact data and revise their privacy agreement and preferred contact methods? Carrying out the process could entitle them to discounts or extra services.

More importantly, when everyone works together, trust is built. Customers get back the control and businesses get a feedback loop to keep details updated and quality high.

Wednesday 18 July 2012

What gets fixed first?

Dogs played a special role in ancient Greece. They had guard dogs, hunting dogs and dogs for herding. They even put a mythical guard dog at the gates of the dead - cerberus. Choosing the right puppy from a litter was an important skill. Making sure you had the very best of the litter could mean the difference between life and death.

One hunting dog expert advocated a special test to find out which puppy was the best. He decided to let the mother choose by taking the pups away from their mother, then placing taking an oil-soaked string in a circle around them and setting it on fire. The mother would jump over the ring and rescue each pup... one by one... in order of their merit. The first pup saved by the mother was obviously the best one.

Many people build data quality reporting tools with some kind of severity measure to ascertain priority. However, you will always get colleagues who will inflate the severity just to get their issue fixed first. There are others who will understate the risk to avoid repercussions. Naturally, some errors just stand out as being the most important. But when you have a massive list of data quality errors, go back to your consumers and ask them.... Which puppy gets saved first?

Sunday 15 July 2012

Retail and Corporate Customer Data

Companies generally have two types of customer data - Retail and Corporate. Your retail customers are individuals who have their own houses. Corporate customers are businesses. There are subtle differences between the two.

Unless you deal exclusively in one or the other, your corporate customer information will have fewer rows than your retail. There are far more ways you can check your retail customers - files like the electoral role or the national deceased register. There are no lists of contacts within businesses that you can reference to check whether your contacts are still working there.

So when approaching the quality of these data sources, the following assumptions should be tested:

Corporate customers generally bring in more revenue, so more funds may be available for remediation. There may be a greater desire within the business to keep these customers delighted. However, corporate contact data will be much harder to check without going back to the customer.
Retail customers will have higher volumes of data. There may be less funds available to fix problems, so scale must be considered wisely. Retail data is easier to fix, because there are many more tools available to check and remediate it.

When developing a single customer view, there are significant differences to the structure of the data. Your retail customer has the national insurance number, date of birth, a home address and maybe a correspondence address. Your corporate customer has a name, a job title, a role, perhaps even a level of authority for placing orders, company name, company registered office address, invoice address, delivery addresses. The list is not exhaustive.

Segmenting your data quality approach for corporate and retail customers is one dynamic that must be considered if you are to make an outstanding contribution to your organisation.

Friday 13 July 2012

Managing the mavericks

Centralising specialist functions within a large organisation is generally the best way of ensuring that operations are run in a consistent and professional manner. For IT related functions, this is vital. Departments that are entirely focused on non-data goals like sales, manufacturing or customer services may easily become disillusioned by the IT department and what they see as excessive time, cost and red tape. This gives rise to a phenomenon I have named "the data maverick."

Data mavericks are empowered individuals who often start as average employees doing the same job as everyone else. They see an opportunity to develop their own data services for their bosses outside the formal support of the IT infrastructure. Very often the services they provide are niche activities that your IS department is reluctant to undertake. This can be a good thing on the short term, but you have to examine how they achieve their objectives. Development times are slashed by ignoring IT fundamentals like backing up the data, failover servers, data security and documentation. The department boss is delighted because all he/she sees is results. But if (heaven forbid) your data maverick was to have an unfortunate accident, their systems would not be supported by the rest of the organisation.

The rise of data mavericks must be taken seriously. They are a warning sign that your technical operations are not meeting the demands of your internal customers, or the budgets are not in line with departmental ambition. This could be due to high internal charging, bureaucratic project control, resource allocation issues and infrastructure constraints. It has been known for IT departments to overprice on projects that they do not want to undertake. Addressing these problems is the first step, but the second must be to bring maverick processes under the support of qualified colleagues.

Your maverick may not see your point of view. They have autonomy and a certain satisfaction from being able to achieve quick, short-term wins. But they are operating without the support of the rest of the organisation, and as such are shouldering much more responsibility than the average colleague. They are likely to be on a significantly lower wage than your IT professionals, without any formal training progression.

If you can, try to include your maverick in the migration process. You have a colleague who is able to learn for themselves and develop on their own. These people can be hard to find. Move them into your IT or business intelligence function. Give them the training and the development they need to use your organisation and technical infrastructure correctly. With the right support, your maverick can become an outstanding asset.

Tuesday 10 July 2012

Data Amnesty

I recently walked into our Management Information section. One of my colleagues (who I have known for years) joked, "Quick, it's the data police.... hide your laptops."

It made me think, though....

In the UK, when the police want to clear an area of dangerous weapons, they announce an amnesty.

The police make it known, that for a limited period of time, anyone with a weapon can hand it in to the station without fear of prosecution.

So why not announce a data quality or IT 'amnesty'? These are the things you could include:

Data quality issues
Data security issues
User ID's that are no longer being used
Computer equipment that is no longer being used
Software that is not recognised by the IT department

A strategy like this works well when there is an uneasy relationship between management and colleagues. It is a good way to start a new data quality or governance department, or merely to re-launch your services. If you have a good relationship, make it humorous. It could be a gimmick, with people submitting things being entered for a prize draw.

If you think a strategy like this won't work, it's worth noting that the UK police recently took nearly 40,000 weapons off our streets in 2012. Imagine what a similar amnesty may do for your organisation.

Sunday 8 July 2012

Data Quality - S.E.P. ??

Douglas Adams' book Life, the Universe and Everything waxed lyrical about a fantasy piece of technology called a S.E.P. field. This extremely useful piece of kit was used like camouflage. It enabled the characters to land on Earth in a massive UFO without anyone noticing.

The S.E.P. field did not make you invisible. Everyone could still see you perfectly. It just made them totally ignore you by convincing them that you were just somebody else's problem.

Amusing, it may be, but it is based upon a large body of philosophical and psychological studies that suggest people, organisations and whole societies can completely disconnect themselves from recognising a critical issue - even though it is staring them in the face. Here are some well known reasons why you may be the last to hear about a data problem in your organisation.

Optimism bias - This is a self-serving trait where people spot a potential risk, but ignore it by assuming it will not happen. The reason why they do this is because it allows them to continue on their chosen pattern. e.g. - Smokers are less likely to think that cigarettes will give them cancer. This bias is most often seen in high-pressure areas where risk-taking is rewarded.

Diffusion of responsibility - The colleague is less likely to report a problem because they do not see it as their job to raise it. This is most common in hierarchical organisations where roles are very sharply delineated or processes are very mechanistic and rigid. It is perceived as too much trouble to deviate from the norm.

De-motivation - The colleague does not see the benefit in reporting or fixing the problem. It's far too much effort for something that they know will not help them in the long run. They may even think that fixing it is impossible. De-motivation is a symptom of low self-esteem, cynicism or excessive work loads.

Herd mentality or bystander syndrome - If no-one else is dealing with the problem, then why should the individual? I call this "collective irresponsibility syndrome". It is usually driven by fear of being punished for standing out from the crowd.

Your data quality measures can't be everywhere. Sometimes you have to rely on your colleagues to keep their eyes and ears open and report what they know. Be aware that it can be a challenge for people to come forward and say something has gone wrong - even if they know it wasn't their fault. Things you can do:

Start a no-blame culture
Encourage colleagues to take personal and social responsibility
Build a customer centric ethos
Allow people to report the problems they discovered in different ways. This enables them to choose the most comfortable method of reporting
Recognise colleagues for spotting data quality issues

So is there a S.E.P. field around your quality issues?

Thursday 5 July 2012

When change causes problems

Organisations change for a number of reasons. They may be downsizing, merging or exploring new opportunities. The regulatory landscape may suddenly change. The whole company may restructure to align to new practices and processes.

Your data marts may be totally fit for purpose. But what if that fundamental purpose changes?

Document your data - Build a complete metadata model. Keep it relevant by including any changes to your systems. Tell everyone about it. And when you think everyone knows about it, tell them again and again. A good metadata model is only valuable when everyone knows about it and uses it.

Govern your data - Have a team of people who can advise all change projects on how the data is to be governed. Good data governance people will make sure that data is valued as much as the infrastructure.

Profile your data - Keeping up-to-date profiles of your data can advise you of how any compatibility issues between the data on your legacy system and the target systems that it is to be migrated to.

Standardise your data quality - By this, I mean the tools that you are using. Bespoke reports on legacy systems have all kinds of business and technical assumptions built into them. If you change the use of the underlying data, these reports will be compromised. By utilising a central toolset, changes in quality requirements can be more easily managed.

This four-pronged attack will lay the bedrock for managing data through fundamental wholesale changes in regulations, roles, strategy and corporate direction.

Monday 2 July 2012

Big data - how big is big?

One of the many buzzwords flying around the data management world is 'Big Data'. It seems that every other day, we are hearing hairy-chested stories about petabyte sized data sets being churned by networks of computers gathering data from weather satellites, climate models and many other massive number-crunching projects. It's easy to get carried away by the amount of data that modern systems are having to cope with.

Big data is largely a question of scale. Big data should only be called big when it is compared to the infrastructure that it resides on and the time available to do the work. In my view, you have big data if any of the following is happening in your organisation.

Your processes are exceeding their allotted window in the daily/weekly/monthly etc. schedule
Specialist data projects have trouble running specialist processes
Your larger scheduled processes are being cancelled or fall over on a regular basis

Your data becomes 'big' data when it exceeds or severely challenges the capacity of your systems, or when the time it takes to process the data becomes inconvenient or unfeasible.

Having big data on a mature infrastructure is notably more challenging than these landmark 'blue sky' bespoke projects. Upgrading your servers to cope may not be a practical option. There are other ways to optimise your data to get better performance from large data sets.

Reduce extractions to the fields you need, rather than whole tables. Ban "select * " from your code.
Be careful how you join tables. Make sure the fields you are joining on are indexed, and that you join the data by all the fields that the tables relate to.
Index your data mart tables - integers, dates and primary/foreign and surrogate keys.
Build an entity of lots of tables with as few as possible fields in them. Narrower tables are easier to load and manage.
Delta load incrementally - i.e. load a day's worth of data every day, rather than year-to-date's data every day.
Monitor your servers and retain stats on how things are running. Look for times where the performance slows and conduct root cause analysis to keep your performance up.