There’s a famous saying in the world of (computer) models:

“All models are wrong, some are useful.”
George Box

There’s an obvious corollary to that – namely that if some are useful, the others are useless. There are a bunch of tests that can be made to determine whether a model is useful or not. These include back testing it against historical data, testing predictions it makes against future events and so on. One of the key attributes of a useful model is that it should be repeatable and vary its output only when its inputs vary. This applies even to models with (pseudo)random elements in them because the seeds to the PRNG are some of the controllable inputs. Sometimes you want to use different seeds for the pseudo-random number generation  because you need to see whether a series of runs (an ensemble) clusters around a particular set of outcomes or whether it diverges for some reason. At other times (e.g. when trying to reproduce the work or the original researcher) you need to start with the same seeds and confirm that you get exactly the same answer.

Clearly models that are intended to be used for important purposes, such as to set government policies, should be of the useful kind and not the useless variety. Well, you’d have thought that was the case. Unfortunately we know that when it comes to climate science that the models are almost all badly wrong and do a poor job of predicting the future

The one good thing about the climate change models is that they meet the fundamental test of models – i.e. they are repeatable. They also generally now meet another basic requirement – having the code available in some form so that anyone who wants to can download it, compile etc. it and run it and verify that they get the same results. Note many used not to meet this requirement but over the last decade or so that has been rectified. In addition they are also pretty consistently wrong in one direction (too hot) and they produce results that change in reasonably predictable and apparently accurate ways with changes in inputs. So even though they produce wrong results they do at least produce predictably wrong results, which means a cunning policy-maker could take those results apply a bugger factor and come up with data that he or she can then use to make policy choices. They still aren’t close to perfect and they make certain predictions that are not seen in reality, but they are gradually improving and may sometime be actually directly useful.

But compared to the model that has caused the great 2020 depression, they are a wonder of clarity and utility.

Prof Niel Ferguson’s Covid-19 model fails the basic test of usefulness – repeatability.

Non-deterministic outputs. Due to bugs, the code can produce very different results given identical inputs. They routinely act as if this is unimportant.

This problem makes the code unusable for scientific purposes, given that a key part of the scientific method is the ability to replicate results. Without replication, the findings might not be real at all – as the field of psychology has been finding out to its cost. Even if their original code was released, it’s apparent that the same numbers as in Report 9 might not come out of it.

… the documentation wants us to think that, given a starting seed, the model will always produce the same results.

Investigation reveals the truth: the code produces critically different results, even for identical starting seeds and parameters.

I’ll illustrate with a few bugs. In issue 116 a UK “red team” at Edinburgh University reports that they tried to use a mode that stores data tables in a more efficient format for faster loading, and discovered – to their surprise – that the resulting predictions varied by around 80,000 deaths after 80 days

This sounds terribly familiar to those of us who paid attention to ‘climategate’ and the infamous HARRY_READ_ME – link to my old blog site and contains links to specific criticisms of the code we discovered.

It also fails the “source code availability” one the code that has been released is a cleaned up version, not the version used to produce the report that caused governments to lockdown their countries and put their economies into free-fall. What we see as various code-archaeologists dig their way through is that the released code repo shows some pretty hairy bugfixes in that code:

In fact the second change in the restored history is a fix for a critical error in the random number generator. Other changes fix data corruption bugs (another one), algorithmic errors, fixing the fact that someone on the team can’t spell household, and whilst this was taking place other Imperial academics continued to add new features related to contact tracing apps.

The released code at the end of this process was not merely reorganised but contained fixes for severe bugs that would corrupt the internal state of the calculations. That is very different from “essentially the same functionally”.

As “Sue Denim”, the author of the linked posts, points out this sort of things actually matters because policy-makers need reliable models on which to make decisions:

Imagine you want to explore the effects of some policy, like compulsory mask wearing. You change the code and rerun the model with the same seed as before. The number of projected deaths goes up rather than down. Is that because:

  • The simulation is telling you something important?
  • You made a coding error?
  • The operating system decided to check for updates at some critical moment, changing the thread scheduling, the consequent ordering of floating point additions and thus changing the results?

You have absolutely no idea what happened.

In a correctly written model this situation can’t occur. A change in the outputs means something real and can be investigated. It’s either intentional or a bug. Once you’re satisfied you can explain the changes, you can then run the simulation more times with new seeds to estimate some uncertainty intervals.

In an uncontrollable model like ICL’s you can’t get repeatable results and if the expected size of the change is less than the arbitrary variations, you can’t conclude anything from the model. And exactly because the variations are arbitrary, you don’t actually know how large they can get, which means there’s no way to conclude anything at all.

And the author continues to explain why this sort of thing matters:

Taking the average of a lot of faulty measurements doesn’t give a correct measurement. And though it would be convenient for the computer industry if it were true, you can’t fix data corruption by averaging.

This, again, is dreadfully reminiscent of climate ‘science’. As seen in the graphic above averaging a load of wrong answers still fails to get you to the right one. Science in general has known this for years. There are endless papers (and corrections to papers) regarding getting this sort of thing wrong.

The point here is that the model is actually less than useful.

One of the interesting questions that have yet to be answered is why did governments decide to rely on the Ferguson model and report? Did they ask anyone in related commercial fields (say health insurance underwriters) for their opinions on the accuracy of infection models? One suspects the answer is “no” based on this reddit post:

I work with natural catastrophe models, mainly the output, and develop analytical systems and tools for a major multinational reinsurer. I’m a bit of a jack of all trades so I also get involved in reviewing validation documentation for the models we use as well as development work. I said in a comment in r/covid19 ages ago that myself and some colleagues look over the Ferguson paper and called it ‘amateur hour’. The results of our analysis don’t really determine lives, they determine the risk tolerance of the company and ensure that we will be able to pay all of our claims when natural disasters happen. But even then, we always use 2 different internal models plus for major decisions an external, independent view normally from a broker. It’s unbelievable that a decision of this magnitude was based off a single model, even if it was a good, properly validated model, which this clearly never was. I might have a dig around in this code but to be honest, you tell me 15,000 lines of C++ in a single file written by rank amateurs, and I’m thinking I’d rather shoot myself in the head than try and figure out how that shit is actually working. God save the brave souls who have plunged into that mess in order to perform these kinds of reviews.

As a side note, we don’t actually license any pandemic models (we use deterministic events for that rather than stochastic simulations), but both of the major vendors in the market, AIR and RMS, have pandemic models which we have been receiving estimates from since day 1. The AIR model never predicted anything in the region that Ferguson’s model did, which is the only one I have had any access to. I’m not sure why governments would not have looked at using these models. However, we did determine as a company that the accuracy of even these models was questionable enough for us not to bother with licensing them.


I should note that this isn’t the only covid-19 model that seems opaque. The IHME models in the US also seem to be lacking in source code and source code audits so we have no idea how they generate their numbers. Over at powerlineblog there has been a number of criticisms of its outputs, which vary dramatically from day to day, but I haven’t seen anyone comment on the code that creates them. One model that is entirely open source and which attempts to do things that the Ferguson model doesn’t (e.g. model for different ages) is this one. I don’t know how good it is, but the source is on github and it is written in python and node.js (javascript) which means it will tend to avoid some of the more obvious memory/pointer initialization bugs and it’s broken down in to sensible bits. It also has had quite a number of people work on it which suggests it may be producing useful predictions.

Finally a nice BBC report that explicitly makes the point that epidemic models are wrong early on and that they should be ignored until they become reasonably accurate