After months of following a stay-at-home confinement strategy to slow the progression of COVID-19, many countries are now attempting a return to normal by implementing a cautious strategy of relaxing the rules of confinement. A key ingredient for the success of such a strategy is the ability to do extensive testing to detect asymptomatic carriers and to identify any new outbreak in real time while it is still localized. The enormous number of tests required for this strategy is a challenge for many countries, especially because the number of positive tests is expected to be very low.
Suppose we mix the samples of a group of individuals and we test it. If the test is negative, then, with only one test, we know that all individuals in the group are negative! If the test is positive, we need to refine our search. For instance, we may divide the group into two subgroups and test each subgroup, etc. This is the principle of group testing.
The idea is not new. It was introduced by Robert Dorfman in 1943 during World War II, when men called up for induction were tested for syphilis. Since then, group testing has become a large field of research with many applications. In theoretical computer science, a common problem is to determine the minimum number of tests necessary to detect all positive cases and to design practical efficient algorithms. Both the minimum number of tests and the performance of the algorithm depend on the percentage of infected individuals, as well as on the reliability of the test—that is, the percentage of false negatives or false positives. In practice, the number of false negatives can increase if the group of mixed samples is too large.
There are two large classes of group-testing algorithms: adaptive and non-adaptive. An example of an adaptive algorithm would be to divide the individuals into groups of 20 and to test a mix of each group. If the group is positive, then each individual in the group is tested. If the percentage of infected people is of the order of 1%, then a mean of 25 tests per 100 individuals is required; hence, a ratio of 1/4. If the percentage of infected people is of the order of 2%, the same strategy requires a mean number of 39.4 tests per 100 individuals. Of course, in the latter case, one can do better with, for instance, testing groups of 10 individuals, etc. But to reach the most economical strategies, one needs to increase the number of rounds of tests.
An important drawback of adaptive group testing is that it requires several rounds of tests. This is something to be avoided when testing is time-consuming and results are urgently needed.
In non-adaptive group testing, all the tests are group tests, which are decided in advance and can be performed in parallel. The techniques are similar to those of error-correcting codes. Each individual can be assigned a 1 if infected and 0 otherwise. Thus, we can represent the state of a population of size $n$ as a word $X$ with $n$ letters, each letter a 0 or a 1. If the number of 1 entries is small, they can be viewed as “errors” in the transmission of an initial word with all 0 entries.
Error-correcting codes are designed to locate and correct errors, provided not too many errors have occurred. The technique is to left-multiply $X$ with a $T \times n$ matrix $M=(m_{ij})$, the code matrix, and to localize the errors from the vector $Y=MX$. The rows of $M$ represent the different tests to be made; the nonzero entries of the $i$th row represent the samples of the individuals to be mixed in the $i$th test. In the matrix multiplication, we then replace addition $+$ by a Boolean sum $\vee$, so that $y_i= \bigvee_{j=1}^n m_{ij}x_j$—that is, $y_i=1$ if and only if there is at least one infected individual in the subgroup tested in the $i$th test. The code matrix $M$ is constructed so that, given $Y$, we can uniquely recover $X$ under the assumption of a maximum number of $k$ infected among the whole population of $n$ individuals. Such code matrices can be constructed with either probabilistic or algebraic methods. A sufficient condition for uniquely determining the infected individuals is that no column of the matrix is included in the Boolean sum of a subgroup of $k$ other columns.
One can generate random matrices $M$: the entries $m_{ij}$ are independent Bernoulli variables with probability $p$ of taking the value 1. For adequately chosen $p$ and $T$, there is a nonzero probability that a random matrix satisfies the sufficient condition. Algebraic methods borrowed from Reed-Solomon codes provide better code matrices; it is possible to generate code matrices with given number $r$ (resp. $c$) of nonzero entries on each row (resp. column). This means that the number of samples in each test is equal to $r$ and that the sample from an individual will be used in $c$ tests.
The drawback of non-adaptive group testing is that the tests are designed for a maximum frequency $q$ of infected individuals; otherwise, they may fail to report some infected individuals. Hence, the frequency $k/n$ has to be estimated in advance, and a test has to be designed for a reasonable upper bound for that frequency.
The technique has been adapted and tested for COVID-19 (see [1]) with $n=384$, $T=48$, $c=6$, $r=48$ and $k=5$ (that is, a frequency of 1.3% infected individuals). Hence, 48 tests are performed in parallel, each with a mix of 48 samples. Each sample of an individual is shared in six equal parts in six different tests. A robot can easily be programmed to prepare the tests. This method provides an excellent ratio of $T/n=1/8$.
[1] N. Shental, S. Levy, V. Wuvshet, S. Skorniakov, Y. Shemer-Avni. A. Porgador and T. Hertz, Efficient high throughput SARS-CoV-2 testing to detect asymptomatic carriers, medRxiv preprint doi: https://doi.org/10.1101/2020.04.14.20064618