Mathematical background

Note: In this section we use the shorthand instead of the more rigorous to arrive at (hopefully) more readable mathematical expressions below.

Estimating R

The effective reproduction number is frequently mentioned in Austrian media. gives the average number of people an infected individual infects.

Typically, we do not know who infects whom. Therefore, the straightforward idea of averaging the total number of secondary infections caused by sufficiently many infected individuals is not practically feasible.

The estimation of would be simple if we assumed that the time duration for which an individual is infectious is only a single day. Let us assume that a person becomes infectious on the 5 day after contracting the virus (i.e., after the incubation period) and is infectious for 1 day. In such a simplified example, we could infer simply by dividing the number of new cases today by the new cases from 5 days ago.

In reality, however, infected individuals are infectious for several days, and on each of those infectious days a different number of secondary infections may occur.

In order to arrive at an estimation procedure for , typically the following line of thought is followed:

First, let us denote as the number of people that an infected individual infects on the -th day after their own infection. Then this person simply infects individuals, assuming that after 20 days no further infections can occur. We can also write this using the summation symbol as

.

If every is divided by , then we get the fraction of infections that occur on the -th post-infection day. Let us call this fraction , i.e. . Even if may change, we assume that the fractions remain constant. (Clearly, one should question this assumption.) These fractions can be estimated from samples of patient pairs where it is known that one patient was infected by the other. The number of days between the first and the second infection in such a pair is called the length of the serial interval. (In our implementation of EpiEstim we assume that the distribution of the serial interval is given by a discretized Gamma distribution with mean 4.46 and standard deviation 2.63, based on estimates of AGES.)

Next, let be the number of people infected on day . Knowing and for , we can estimate the number of people that were infected at day by

.

If we allow to vary over time, we need to account for this. Naively, one would rewrite the formula for as

.

Given that, one would attribute a decrease in the infections today, at day , by the infected individual from day to a decreased reproduction number on the past day . But since infections occur today, on day , and we want to describe the disease spread today, one would rather update the above formula again to

.

(Given such definition of , one frequently also speaks about the effective reproduction number, which is written explicitly as Reff.) The updated formula also allows an easier estimation of : we can factor out of the sum, and then divide by the remaining sum, to obtain

.

At its core, this is already the fundamental formula on which many estimation procedures of are based. Notably, this estimation procedure becomes more involved if we consider that the infections on day are not fully deterministic. Instead, we know that many random factors affect .

In order to capture this, the mathematical model considers to be a (Poisson) random variable, and that only its expectation is given by the quantity

.

Based on this, one uses a Bayesian model to infer and credible intervals. This method was introduced by Cori et al. (2013) and is implemented in the software package EpiEstim. Our graphs present the estimates of according to this method, as well as estimates of an extension which was developed by the London School of Hygiene & Tropical Medicine (LSHTM) / epiforecasts.io .

In Bayesian statistics, observed data (e.g., case numbers) are used to assess the plausibility of different values of a parameter. As a results, a so-called posterior distribution of the parameter of interest – namely – is computed. To do this, we require a prior distribution of the parameter, which summarizes our previous knowledge about its value. If a model contains further parameters, which are not fully known, then one can treat those similarly with their own prior distributions. Essentially this means that for the estimate of , one takes appropriate averages over many different values of every additional parameter.

In the software package EpiEstim one can do this for the serial interval. The method used by epiforecasts.io is also based on a Bayesian estimator, however, it also considers additional sources of uncertainty such as the reporting delay, i.e. the time duration between infection and onset of symptoms.

A credible interval is a region in which the parameter falls with a given probability according to its posterios distribution. Here, we show the 50% and 90% credible intervals for the method developed by LSHTM / epiforecasts.io.

Hypothesis test

We are primarily interested in whether the disease is spreading more rapidly or slowing down, i.e., whether is b smaller or greater than 1, respectively. In order to answer this question, we employ a hypothesis test (which is conceptually simpler than the estimator described in the previous section). This hypothesis test is the basis for the plausibility assessment of versus based on case numbers.

If we assume that and treat the case numbers for days that are further than one week in the past as fixed, then we can use the mathematical model which was introduced in the previous section:

.

This model allows us to calculate the probability that the model-based prediction for the sum of cases for the current week is larger than or equal to the observed number of cases . Since a smaller yields fewer cases and therefore a smaller probability to observe or more cases, we know that for every value the probability to observe or more cases is .

Therefore, if is very small, we have evidence that is not less than or equal to 1.

The value of can be determined numerically. Using and the model above, one simulates many (e.g. 1000) different, virtual series of case data for the previous week. One then simply determines the proportion of simulations for which the simulated case number is bigger or equal to .