This vignette provides an overview of the functions that can be used to estimate the sample size needed to detect a pathogen variant or estimate its frequency in a population, given molecular characterization of a single cross-sectional sample of pathogen infections.

When the goal is detecting the presence or absence of a specific
variant in a population using a single cross-sectional sample
(**Figure 1**), we can use the
`vartrack_samplesize_detect()`

function with the
`sampling_freq = "xsect"`

option. Applying this function,
however, requires knowledge of the coefficient of detection ratio
between two pathogen variants (or, more commonly, one variant and the
rest of the pathogen population). The coefficient of detection ratio for
two variants can be calculated using the
`vartrack_cod_ratio()`

function (see *Estimating bias in observed
variant prevalence* for more details). Since we are only
interested in the ratio of the coefficients of detection, applying this
function only requires providing parameters which are expected to differ
between variants. The ratio between any variants not provided is assumed
to be equal to one.

Once we have an estimate of the coefficient of detection ratio, we can calculate the sample size needed for detection from the following parameters:

Param | Variable Name | Description |
---|---|---|

\(P_{V_1}\) |
p_v1 | the desired minimum variant prevalence to be able to detect |

\(p\) |
prob | the desired probability of detection |

\(\omega\) |
omega | the sequencing success rate |

\(\frac{C_{V_1}}{C_{V_2}}\) |
c_ratio | the coefficient of detection ratio, calculated as the ratio of the
coefficients of variant 1 to variant 2 (can be calculated using
`vartrack_cod_ratio()` ) |

We then apply this sample size calculation function as follows:

`library(phylosamp)`

```
vartrack_samplesize_detect(p_v1=0.02, prob=0.95, omega=0.8,
c_ratio=1.368, sampling_freq="xsect")
```

`## Calculating sample size for variant detection assuming single cross-sectional sample`

`## [1] 135.9928`

In other words, 136 samples are needed to detect a variant at 2% (or higher) in a population with 95% of detection, given a coefficient of detection ratio (\(\frac{C_{V_1}}{C_{V_2}}\)) of 1.368 and a single, cross-sectional sample. This takes into account the fact that not all samples sequenced (or otherwise characterized) will be successful. We assume an 80% success rate (\(\omega=0.8\)), which ensures a selection of 136 samples will produce the 109 high quality data points that can be used to detect the presence of a pathogen variant.

In some cases, it may not be enough to simply *detect* a
variant—we may want to estimate its frequency in the population
(**Figure 2**). In that case, we can use the
`vartrack_samplesize_prev()`

function to determine the sample
size needed to estimate the prevalence of a variant in a population
given some desired precision. This function requires the user to
specific a slightly different set of parameters:

Param | Variable Name | Description |
---|---|---|

\(P_{V_1}\) |
p_v1 | the desired minimum variant prevalence to be able to detect |

\(p\) |
prob | the desired probability of detection |

\(d\) |
precision | the desired precision in the prevalence estimate |

\(\omega\) |
omega | the sequencing success rate |

\(\frac{C_{V_1}}{C_{V_2}}\) |
c_ratio | the coefficient of detection ratio, calculated as the ratio of the
coefficients of variant 1 to variant 2 (can be calculated using
`vartrack_cod_ratio()` ) |

We then can calculate sample size as follows:

```
<- vartrack_cod_ratio(psi_v1=0.25, psi_v2=0.4, tau_a=0.05, tau_s=0.3)
c1_c2 vartrack_samplesize_prev(p_v1=0.1,prob=0.95,precision=0.25,
omega=0.8, c_ratio=c1_c2, sampling_freq="xsect")
```

`## Calculating sample size for variant prevalence estimation assuming single cross-sectional sample`

`## [1] 582.2843`

In other words, 583 samples must be processed in order to estimate the population prevalence of a variant with at least 10% prevalence in the population, with a precision of 25% of the true value, and a confidence of 95% in the prevalence estimate. Again, we assume an 80% success rate, which ensures successful characterization of 466 of the 583 samples selected for sequencing so they can be used to detect the presence of a pathogen variant.

For information on functions that can be used to estimate the sample
size given a periodic sampling approach, see *Estimating the sample
size needed for variant monitoring:* *periodic sampling*. These
functions can also be used in “reverse”, to calculate the probability of
detection given some sampling scheme, as in the *Estimating the
probability of detecting a variant* *cross-sectional* and *periodic* vignettes.