Data mining of Covid numbers

Story title image

Today I want to go a little bit into Covid's data, because I see that there is definitely some confusion.

The main reason I started this project was to say that lockdown measures were tied to "R-value" and scientists in the public media said that R-value could only be estimated - unbelievable to me. So let's see what we can find out.

What data do we have at our disposal at all?

I'm going to takeData from John's Hopkins University as they seem to have more accurate and faster data so far. The two most important values we have are the total number of infected and the people currently infected. From these two metrics we will generate everything.

the picture shows chats of totally infected and currently infected people

Doesn't look like much - yet - but let's find some simple things to pull out of there:

Preparation of the data

Picture shows excel table with some numbers

First of all, we need something that we can easily work with. For this I will use Excel, because it is fast and easy. Programming something with Python could also work, but for a short project that might be too much work.

The data we have are inserted into two columns as shown above. Each row represents a day in the pandemic. The pandemic began in Switzerland on 25 February.

Daily new infections

One value that is likely to be useful is new infections. I will not collect this data at Johns Hopkins University, but will extract it from total infections. This will save copying time in the future.

We just need to subtract the previous day's total infections from the current day's infections.

the image shows new infections per day and 5d median

What we can see is a small increase in infections over the last few days, but more on that later.

Total healings and average duration of infection

The fully cured can be taken by calculating the total infections minus the currently infected. For illustration purposes, I contrast the graph with the total infected.

The graph shows the total infected and the total cured. Both curves look similar except for a horizontal offset

Well, that may not show that much yet, except that our hospitals are far from full. Additionally, we can read the infection time from this. By shifting the cured total to the left, we should get the infection time if they overlap. For Switzerland this is 16 days, for Germany 15 days. This seems to be within the range of measurement.

The chart shows the total number of infected and the total number of cured. This time shifted by the number of days so that they overlap

The infection time will later prove useful for the R-value

Healed per day

Screenshot of the strange findings

From the total healed, we can generate the daily healed. Similar to the daily new infections, we subtract yesterday's total healing from today's total healing.

This is where I got some question marks about the method of collecting this data by John Hopkins University. Apparently, since May 16, the cured cases have been reported in batches of 100, which seems very odd to me:

Unfortunately, we have to assume that the values are correct, but I will have to use the less accurate mean instead of the median. I find the median more reliable as it effectively removes outliers. But since our data here consists of outliers, we have to use the mean.

The chart shows the total number of infected and the total number of cured. This time shifted by the number of days so that they overlap

In addition, we can now overlay daily cured and daily infected patients.

the graph shows daily cured versus newly infected patients. the cured patients are obviously lagging behind the infected ones

What we can clearly see here is that the recovered patients were consistently more than the newly infected as of ~April 1. This basically means that the hospitals had less and less work from that day on.

The mystic R-Value and the future

To obtain the R-value, one must first know the definition of the R-value. The definition is as follows:

"Theexpected number of cases generated directly by a case in a population in which all individuals arevulnerable for wine infection."

how do we get there?

we already have the most important information.

- How long does an infection last on average? 16 days

- How many infected people are there currently?

- How many new infections are there per day?

The daily R-value can then be calculated as follows:

(new infections * duration of an infection) / current number of infected persons

This is the result for Switzerland:

chart of the R-value

Additionally, I created my own metric to highlight whether things are getting worse or better. More green than read means things are getting better. More red than green means things are getting worse.

The higher or lower the values, the more intense the effect.

the graph shows a metric that I personally found more useful than the r-value

Uh oh - the second wave will be upon us!

but slowly for now. Let's first access these two metrics from Germany as well, with Robert Koch's Guesstimated R-Value in the background:

r value Germany infections development Germany

Germany has a similar curve. The rising R-value just had a little bump. Also, the infection event jumps around like crazy at the end.

Why is that? Let's look at the data around June 15, just before the peak:

We have a total of 292 active cases and ~15 new cases per day in Switzerland. In fact, the number is so small that a single hot spot will already drastically overturn the values. A single short spike is therefore not enough to predict that the second wave will come.

As some might suspect, a virus does not come and then disappear forever. For example, cases of swine flu have recently been detected again.

I doubt we can completely eradicate the virus (which we are currently trying to do). Rather, it will evolve into something we humans will have to live with. There will be hotspots in the near future.

However, I don't expect to see a new 2nd wave similar to the first anytime soon. What happens next year in January/February, only time will tell.