Conflicts between Agile and Data Modelling

Data Modelling Photo by Christina Morillo from Pexels

Having worked in various Agile and Waterfall environments I have become fascinated by the potential mismatch between Data Modelling and Agile. Searching for some discussion pieces on this, I found an interesting post that reviews a conference discussion. Ultimately, this dates from Enterprise Data World 2015, but does seem to draw on a lot of expertise from enterprise level people who have had to reconcile Data Modelling and Agile. One of the statements made was that under Agile development: “There are numerous distinctions between the objectives, approaches, and needs of developers and data modelers that provide various points of conflict”. There is some exploration of the experience of various speakers with these conflicts.

This seems like an important topic, and while I think this answer draws attention to some of the issue, it doesn’t entirely address them. I’m sure that Data modelling doesn’t fit the “fail-fast-and-iterate” imperative of Agile. On the code developers side, you can increase the functionality, elegance, performance and even “pythonicness” of your code. Under Agile, you do this to meet a customer need and have an entire conceptual and infrastructural support structure to enable you to do this efficiently and safely. Even if substantial changes in code-base design are required, you will have accrued a good framework of unit tests to help you with any refactoring.

However, if you get the data model wrong, inertia takes over. There may be some end-to-end integration tests which could reassure you that you have remodelled correctly. But I don’t believe there is a tight link between data model and code covered by a relevant level of testing. I’ve heard the statistical joke “if you beat the data hard enough it will confess to anything”. However, with a mis-specified data model you end up using your code to torture the data model to fit all the use cases.

I have always been struck by a Linus Torvalds quote “Bad programmers worry about the code. Good programmers worry about data structures and their relationships”. So it’s not as if we are unaware that perhaps we are better at contemplating code quality absent a tight linkage to user data. This might give us elegant and performant code as measured by the tests we apply to it, but doesn’t necessarily translate to performant code in a real world application. This problem hasn’t been enough for substantial development work in refactoring tools that could guide us through a change of data model. It may be, for example, that many of our unit tests are wrong after we change our data model and need updating. Which ones? Even in this simple case, how many linters exist to help us identify the deprecated data structures in our unit tests.

Previously, I’d come to the conclusion that the answer had to be some kind of Waterfall style careful specification of the data models ahead of time. Some of the discussion reported from the Enterprise Data World 2015 conference seems to provide a degree of support to this; that Data Modellers need to be involved early in an Agile process. I’m not sure how that fits with an Agile process where you are developing customer requirements as you go; but then I wonder about the viability of developing a product with such a large set of unknowns that you are unable to even specify the strategic data model. However, the main conclusion seems to suggest that it should be possible (using branches for example) to try out variations in the data model and see if they solve a particular problem. After that, you trust your CI/CD to identify all the things that change breaks, and you can fix them. That still feels like a recipe for inertia. If the choice is between a good solution for your immediate problem, and a lot of fixing work in the rest of your code, or a mediocre solution to your immediate problem and no fixing work, which is going to win out under the time pressure of a sprint? As these mediocre solutions accumulate,at what point do you regret not making the change?

More profoundly, how can this work with a data pipeline, as used for statistical analysis or machine learning. There are numerous excellent tools (Airflow, Dagster) that help coordinate a complex data pipeline. The Data Model is far more than a well specified database (or warehouse, or lake). It’s a dynamic thing that doesn’t exist until the data pipeline is realised. How do we ensure we have optimal data models in this context? In conclusion, I don’t have the answers, and repeated web searching isn’t helping much. But I believe it is an important topic, and one I will revisit regularly.

Read More