Drawn In Perspective

Falsehoods Data Engineers Believe About Data

  1. The data you want exists
  2. The sample data you were given is representative
  3. Cuts and live feeds of data will be in the same format, or even have all the same columns
  4. You were given all the data you asked for / that was available
  5. You'll be notified when the underlying schema / format of your source data changes
  6. There is one row per entity
  7. Primary keys always exist
  8. Primary keys are always unique
  9. Primary keys are always stable
  10. Primary keys always uniquely identify rows, or even entities
  11. The people who created the dataset you're using care that this is not what "primary key" is supposed to mean 1
  12. These two similarly formatted columns are safe to join on
  13. Fake data is easy to generate
  14. Humans will enter the data they’re told to enter into the fields they’re told to enter them
  15. Columns in general are being used for what they are supposed to
  16. Humans won’t find ways around your clever validation rules
  17. I will remember all the edge cases my regex was designed to handle when I come back to it in a year’s time
  18. There is a single right value for every property of an object
  19. It's possible in general to decide when two objects are the same
  20. Aggregated data is no more or less sensitive than the individual rows that make it up
  21. Data always has a single identifiable "owner" who can make all the decisions about how it should look and who it should be shared with 2
  22. There are always general rules about who can access what data
  23. Data separated by tabs or commas is in valid tsv or csv format
  24. The underlying data infrastructure I'm using is by default optimised for my use-case
  25. You always need to get everything perfect first time 3

This was the title of a talk I gave to my colleagues earlier this year. A better title would probably have been Falsehoods Data Engineers Mahmoud Believed About Data And Came To Regret Believing. This kind of talk is fun to give to a small audience because it's an excuse to rant about times things have gone wrong in unexpected ways - and you can then open up the floor to other people to share their similar "war stories" too.

If this was helpful to you, this page has an anthology of similar lists (normally about more concrete domain areas) that you'll probably find useful too.


1 They often do care deep down, but have other things to fix or fires to put out.

2 Many organisations do have a role like this, and for some kinds of information category and jurisdiction it is mandated by law - but in my experience unless the organisation is very small or hierarchical or the data very simple, the best any single person can do is build consensus among or on behalf of all the necessary stakeholders for processing.

3 The key here is to be mindful of the ways your work is imperfect and communicate it clearly, especially when it might be used for Important Stuff downstream.

Thoughts? Leave a comment