bad data

Bad Data Always Exists. What You Do About It Is What Matters!

Michelle Buchecker Our Thots

As with any company, we have projects where we need to grapple with bad data during an analysis exercise.  Recently, we had reason to discuss a number of issues due to an inconsistency of naming conventions that were complicating a set of reports.   For instance, sometimes a word was spelled two different ways, the Queen’s English and American English, resulting in two separate data values.  In healthcare data, this pops up all the time in the various ways that patient names can be coded, for example.

So there will always be difficult data. This article shows ways to tackle some of the most common problems.

Many of the solutions prescribe fixing the data at the source. While this is certainly true, it is equally true that I should exercise more and eat fewer Oreos.  Neither of those things is likely to happen each and every time bad data (or tasty Oreos) pop up, as much as I wish it to do so.

So outside of a perfect world, your choices are as follows:

  1. Ignorance is bliss:  Decide the source system ‘ain’t your problem’ and let it propagate to your analysis and reports
  2. Go to war:  Make moves to fix and possibly delete the data at the source so it flows cleanly to all analysis projects
  3. “Rube Goldberg” it:  Fix the data programmatically in your code pre-analysis

Choice 1 is the easiest, but the outcome is the worst.  So unless you are trying to make a passive-aggressive point to your organization, we don’t recommend it.   Choice 2 is straightforward, in principle, but may take a lot of time if you are not the owner of the data.  In fact, the data owner may not appreciate or care about the problem.   Choice 3 is the choice of the underdog– take it on the chin and do the extra, time-consuming work to fix it programmatically before triggering the analysis routines. Of course, this is all well and good until someone else tries to use your code (or you forget why you did all of this manipulation!)

Your data, your choice, as perfect worlds don’t exist.  But I do think that what you do defines the type of analyst you are and how your organization behaves.  In our world, I did choice #3, but my colleague advocated for choice #2.  So we struck a compromise — Choice 3 for now (and to meet our deadline); Choice #2 goes on a future sprint.

What would you do?