Nowadays, we rarely ask the question “do we have data”. Often we have lots of data and the first barrier we hit may force us to instead ask “do we have the right data?” Even when the answer to this question is NO, we often need to be able to do the best we can with the available data. It has become so easy to collect vast quantities of electronic data as a byproduct of other processes that we expect data to be on tap and free. It is still the case that well designed and specific data collection can yield better answers than a “free” data source. However, we still need to be able to combine insights from different data sources; some purposefully collected, some byproducts of another process. This blog will illustrate the caveats and cautions we have to take in order to learn from data.
(I would have used the acronym AI for Actionable Insight but that acronym has been taken)
There are two key things to say about modern Data Science.
The first concerns “Continous Integration/Continuous Delivery” (CI/CD); a modern software development practice. The advantages of scripted analyses (reproducibility, testing, auditability) are well established; it’s a small leap to seeing these scripts as simple pieces of software and realising they need their entire architecture specified and tested continuously. Indeed I have published in Annals of Operations Research on the pitfalls of spreadsheets for data analysis. Using CI/CD in software projects is a well established way of minimizing technical debt; in data science there is a human tendency to focus on producing beautifully crafted scripts that work well today, without considering all the various stages of analysis of software dependency. One example, we need to update software versions regularly (bugfixes and needed feature enhancements).
The second concerns an up to date knowledge of what is possible in statistical science. The field of statistical science is being advanced just as rapidly as any other science. Many statistics courses are based on techniques which were invented before 1930, and training has not been updated. Given the availability of powerful computers, convenient analytical methods are no longer needed to approximate a solution. We can look for the right solution, if we allow a little computer time. Given the price of computer time, this rarely seems like a barrier to using the best methods avvailable. There is old adage in English “if the only tool you have is a hammer, then every problem you see is a nail”. This is the problem with statistical training. We are only given the hammer, and a hammer from the 1930s at that. This blog will provide case studies of using modern methdods to answer real questions. Yes, this does require deeper statistical expertise, but in turn it requires domain specialists to think harder about the problem they want to solve, rather than just “hitting it with a hammer”