Our estimator on actual data

During the course of a pandemic, the spread of the disease within a population manifests itself in the number of new infections each day. While this number is, in general, unknown to us, it can be estimated from observables such as the number of cases that test positive each day. The aim of this project is to build an estimator for the unknown time series, I(t), formed by the daily number of new COVID-19 infections, based on the observed time series, P(t), of the number of hospitalized cases that test positive each day.

This simulator has two components: (1) a generative model ("ground truth") for the underlying time series of daily new infections and for the observed time series of hospitalized cases that test positive; and (2) our estimator for the underlying time series of daily new infections.

1. The generative model:

The built-in generative model is meant purely to serve as a tool for validating the estimator.

The generative model for new infections takes as input the size of the population (n), the initial fraction (ini) of the population that has been infected on day 0 of the timeline, and the model of interactions that lead to the spread of infection. The I(t) time series is generated from these inputs.

From I(t), to generate P(t), the time series of hospitalized cases that test positive, we require an additional input: the distribution of the incubation period, i.e., the time, in days, it takes for an infected individual to show symptoms. It is assumed that there is a maximum number of days (symp_max) within which symptoms show; beyond that an infected individual is taken to be asymptomatic, and the fraction of such individuals can be specified as an input to the model.

2. The estimator:

The estimator only has access to the observed time series P(t). It does not have access to I(t), and crucially, it makes no assumptions whatsoever about the model according to which I(t) is generated. It also does not know the values of the input parameters, such as the precise incubation period distribution, used to generate P(t) from I(t). The user is allowed to provide as input what they believe to be the distribution of the incubation period, which could be different from the ground truth. Finally, there is a smoothing (regularization) parameter that can be used to control wild swings in estimates from one day to the next; the larger the value of this parameter, the smoother the estimator output.

The technical details:

The basis upon which our estimator stands is the model that relates the unknown time series I(t) to the observed time series P(t). We begin the description by establishing some terminology relevant to our model. An individual is "infected" if they are carrying the SARS-CoV2 virus that is responsible for COVID-19. An infected individual is "symptomatic" if they are showing the symptoms of COVID-19. An infected individual is "pre-symptomatic" if they will eventually turn symptomatic. Finally, an infected individual is "asymptomatic" if they will never show symptoms. We assume that a newly infected individual is either "pre-symptomatic" or "asymptomatic" (i.e., on the day that they get infected, they cannot be "symptomatic").

Our model is parameterized by the incubation period distribution q: for k = 1,2, ..., symp_max, q(k) is the fraction of pre-symptomatic individuals for whom the onset of symptoms occurs exactly k days after getting infected. symp_max is the maximum number of days it takes for the onset of symptoms to occur in pre-symptomatic individuals. Based on the study [1], we take symp_max to be 14, and the incubation period distribution to be log-normal with a median of 5.1 days and a mean of 5.5 days.

We define asymp_frac to be the fraction of infected individuals that are asymptomatic. This fraction is not easy to obtain, as it requires widespread testing. Different sources put this fraction anywhere between 0.01 and 0.25 (i.e., between 1% and 25% of all infected individuals). We leave this as an input that the user may specify to the estimator, based on specific knowledge of the target population.

The underlying assumption in our model is that when a pre-symptomatic individual starts to develop symptoms, s/he will go a hospital or testing centre, where s/he will get tested. Since this individual is infected, the test will be positive. Now, recall that the total number of newly infected individuals on day t is given by I(t). Of these, on an average, (1-asymp_frac)*I(t) are pre-symptomatic, and the remaining (asymp_frac)*I(t) are asymptomatic. Thus, the _average_ number of positive tests on day t is given by (1-asymp_frac)*[I(t-K)*q(K) + I(t-K+1)*q(K-1) + ... + I(t-1)*q(1)], where we set K:=symp_max. We simply equate this to P(t), justifying this by appealing to an empirical "law of large numbers". In summary, we have the relation

P(t) = (1-asymp_frac)*[I(t-K)*q(K) + I(t-K+1)*q(K-1) + ... + I(t-1)*q(1)].

where, again, K = symp_max is the maximum number of days it takes for the onset of symptoms to occur in pre-symptomatic individuals.

Thus, the vector P = [P(symp_max+1) P(symp_max+2) ... P(t)]' is obtained by pre-multiplying the vector I = [I(1) I(2) ... I(t-1)]' by a Toeplitz matrix M, yielding the relation P = M * I . The observed vector P is known to us, and the matrix M is obtained from the parameters of our model (which can be entered by the user). The goal is to estimate the vector I. For this purpose, we use a non-negative least-squares estimator with a regularizer to control unwanted oscillations:

Î = argminI>=0 [||P-M*I||2 + λ * { [(I(2)-I(1))]2 + [I(3)-I(2)]2 + ... + [I(t)-I(t-1)]2 }

The regularization parameter λ>0 should be chosen carefully. As λ is increased, the resulting estimates of I will tend not to oscillate wildly from one day to the next. But too large a value of λ will cause the desired cost function ||P-M*I||^2 to have little effect on the minimization. We recommend choosing a λ in the range 0.1-0.2.

Finally, it should be noted that to even obtain an estimate for the number of infections on day t, we need to know the test results on day t+1. In other words, test results up to day t can only give us estimates for the number of infections up to day t-1. In fact, the estimates based on P(1), ..., P(t) are only trustworthy about (1/2)*symp_max number of days prior to day t.


[1] Lauer et al. (2020), The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application. Annals of Internal Medicine, 10th March 2020. https://annals.org/aim/fullarticle/2762808/incubation-period-coronavirus-disease-2019-covid-19-from-publicly-reported

  • For estimation

  • For the generative model

DISCLAIMER: These plots are merely indicative and are NOT claimed to be estimates of the actual number of infected cases. This is because the data on the number of positive tests does not satisfy the assumptions upon which our current model for estimation is based. Our current model requires that the number of positive test cases only count those individuals who show up at hospitals on their own on account of their developing symptoms consistent with Covid-19. In particular, our model does not at present account for contact tracing and other methods used to actively identify infected individuals. The exact asymptomatic fraction is not known and it does not seem easy to estimate, unless wide-spread testing in the community is done.

Adjustable parameters


NOTE: A log-normal distribution is usually used for the incubation period distribution (see here). A log-normal random variable Y = eX where X ~ Gaussian (μ,σ). Median of the log-normal distribution is eμ. Here, σ is fixed to be 1/√6 (see Wiki).
Data for these plots are taken from the following sources. We thank them for making it available for public use. We are also grateful to the Ministry of Health and Family Welfare and the Karnataka Govt. Media Bulletins