Unreasonable effectiveness of data
Unpacking the claim that data are cheaper than models
Saturday, 14 Nov, 2020 By Paul Hewson.
Photo by Christina Morillo from Pexels
In 1960, Eugene Wigner published “The unreasonable effectiveness of mathematics in the natural sciences”. Many commentators have indeed agreed it is surprising how relatively simple maths can indeed model so much of the natural world. Wigner argued that modelling a given scientific theory can lead to further advances in the theory. He also claimed success in that a mathematical model can make predictions which allow the theory to be empirically tested. Playing on the title of this paper, in 2009, researchers at Google presented “The unreasonable effectiveness of data”. This is often quoted by tech-evangelists and used to support the idea that “building models is expensive, training data is cheap”. I would argue it is not quite that simple.
Firstly, it needs to be noted that Google were training Machine Learning (ML) algorithms in the context of language (translation). This was possible because they had accumulated vast quantities of training data. They had been digitising books at scale and therefore had a training set with corresponding human-translated material. So their translation algorithm does little to help us understand language or further our understanding of linguistics. It is still not capable of an artistic rendering of poetry from one language to another, where considerable leeway has to be taken to preserve structure and deeper meaning of poetry in different languages. It is also the case that language evolves. I would argue that there is at the very least a simple model underlying Google’s translation efforts - literary prose. And, as with a beginner learning about linear regression, I would also suggest that extrapolating beyond the range of your data could be extremely risky.
Of course, Wigner also was immediately challenged over his arguments. Richard Hamming argued, for example that we tend to find whatever it is we are looking for, and also argued that we tend to create and select the mathematics we need to fit a particular situation. Perhaps more importantly, the nature that can be described by mathematics is only a limited part of the human experience. Significantly, many tech-evangelists have taken “The unreasonable effectiveness of data” beyond its context. Currently, image recognition is one of the hot topics in Machine Learning. Again, various cloud companies have acquired large image banks which make for great training data. Facebook for example has names and other data to associate with facial images. Again though, predictive use might well be limited to interpolation rather than extrapolation. Implicitly, there is a model here. The training set contains images of interest to those posting to Facebook.
The problem I’m alluding to are the ethical biases that arise when we have a biased data set. Facial recognition software is notoriously poor at recognising black faces. If we refute the use of models, we have no mechanism to correct this bias. This may not matter if your interest in Big Data is to increase sales, and you increase sales. But as public services become increasingly Big Tech-ified, there is a mindset gap here that needs to be filled.
Share on: