- The data you want exists
- The sample data you were given is representative
- Cuts and live feeds of data will be in the same format, or even have all the same columns
- You were given all the data you asked for / that was available
- You'll be notified when the underlying schema / format of your source data changes
- There is one row per entity
- Primary keys always exist
- Primary keys are always unique
- Primary keys are always stable
- Primary keys always uniquely identify rows, or even entities
- The people who created the dataset you're using care that this is not what "primary key" is supposed to mean 1
- These two similarly formatted columns are safe to join on
- Fake data is easy to generate
- Humans will enter the data they’re told to enter into the fields they’re told to enter them
- Columns in general are being used for what they are supposed to
- Humans won’t find ways around your clever validation rules
- I will remember all the edge cases my regex was designed to handle when I come back to it in a year’s time
- There is a single right value for every property of an object
- It's possible in general to decide when two objects are the same
- Aggregated data is no more or less sensitive than the individual rows that make it up
- Data always has a single identifiable "owner" who can make all the decisions about how it should look and who it should be shared with 2
- There are always general rules about who can access what data
- Data separated by tabs or commas is in valid tsv or csv format
- The underlying data infrastructure I'm using is by default optimised for my use-case
- You always need to get everything perfect first time 3
This was the title of a talk I gave to my colleagues earlier this year. A better title would probably have been Falsehoods Data Engineers Mahmoud Believed About Data And Came To Regret Believing. This kind of talk is fun to give to a small audience because it's an excuse to rant about times things have gone wrong in unexpected ways - and you can then open up the floor to other people to share their similar "war stories" too.
If this was helpful to you, this page has an anthology of similar lists (normally about more concrete domain areas) that you'll probably find useful too.
1 They often do care deep down, but have other things to fix or fires to put out.
2 Many organisations do have a role like this, and for some kinds of information category and jurisdiction it is mandated by law - but in my experience unless the organisation is very small or hierarchical or the data very simple, the best any single person can do is build consensus among or on behalf of all the necessary stakeholders for processing.
3 The key here is to be mindful of the ways your work is imperfect and communicate it clearly, especially when it might be used for Important Stuff downstream.