Why The Uniformity Of A DNA Library Matters

Image of playing cards in a swirling circle pattern, representing the coupon collectors problem. Uniformity is an important concept for baseball card collectors and molecular biologists.

What do baseball card collectors, CRISPR researchers, and antibody engineers have in common?

They’re all battling a statistical hurdle known as the coupon collector’s problem. In essence, the problem describes how difficult it will be to collect all of a set of items given a finite number of collection attempts. To understand this concept, consider the case of a baseball card collector.

The trading card market is worth billions of dollars. Though the cards are little more than shiny paper rectangles with a material value of less than a penny, some of the rarest baseball cards have sold at auction for millions of dollars. The trading card market thrives because manufacturers rely on the coupon collector problem to increase sales.

Here’s how it works: Let’s say there are 100 unique baseball cards that can be collected, and each pack of cards sold contains one card. In this scenario, collectors would need to buy far more than 100 packs to collect all 100 cards because they are likely to get duplicates. When you first set out, every pack is likely to have a new card for your collection. But every time you get a new one, it increases the odds that your next card will be one you’ve already seen. Because of this, you’ll be motivated to buy more packs until you complete your collection.

The coupon collector's problem gives us a way to determine how many attempts we’d need to collect all cards. If there is an equal likelihood of getting every card, it will take approximately 520 packs to collect the complete set of 100. To create further desirability, card manufacturers assign cards a rarity by saturating the pool of cards with “common” cards that represent only a small fraction of the players represented in the set of 100. At the same time, they will print only a few cards for some players, making those cards quite “rare”.

Another way to say this is that the pool of baseball cards is non-uniformly distributed. When this happens, the actual number of packs needed to complete the set, and therefore the cost to complete the set, is significantly higher.

Honus Wagner Rookie Baseball Card Image — *A Honus Wagner baseball card from 1909 sold for over $6 million at auction in August 2021.*

Back to statistics, the number of samples (packs) needed to complete the set is defined by two key metrics: the number of unique individuals (cards) in the set and the distribution (rarity) of individuals. The more uniformly distributed the individuals in the population are, the fewer number of samples you’ll need to complete it.

The coupon collector’s problem in molecular biology

Molecular biologists, antibody engineers, and geneticists all have to contend with the coupon collector’s problem as well.

Take for instance a researcher who wants to perform a large-scale CRISPR knockout screen. In these experiments, a pool of thousands of unique sgRNAs will be synthesized and transduced into cells where they can edit the genome. After editing, researchers will typically separate the cells based on a reporter signal (such as GFP expression) and then sequence a subset of the population of interest. If the pool of sgRNA’s is not uniformly synthesized—meaning a few sgRNAs are common, and many others are rare—then researchers have a problem. When sequencing the sorted cell population, most of the cells they sequence will likely have gotten the same sgRNA and thus add no additional value to their data set. In order to truly capture all of the sgRNAs that are in their population, the researchers will need to pay to sequence more cells in an effort to detect the “rare” sgRNAs in their pool.

This same principle applies to antibody engineers who may screen libraries containing billions of candidate antibodies. If their antibody library is not uniformly distributed, they will have to spend more to ensure that all potential candidates are represented in their data.

🤔 Uniformity vs On-Target Rate

What has a larger impact on sequencing efficiency: On-target rate or Uniformity? In this white paper, we explored this question with an in-depth analysis and show that uniformity is far more impactful. Read our white paper to learn more.

Like card collectors, the amount of time and money researchers need to spend oversampling their dataset is directly beholden to the uniformity of the library they are screening. Researchers can obtain more hits in a given oversampling rate with a more uniform library.

Combatting the coupon collector’s problem with Twist Bioscience

The importance of uniformity in the coupon collector’s problem was highlighted in a 2020 paper from the California Institute of Technology. Researchers screened a library of Adeno-associated virus (AAV) capsids for variants capable of gene delivery to the mouse brain. Developing such precise technologies is an essential step in producing effective gene therapies. To do so, the AAV capsid needs to be engineered and selected for high specificity for a cell type of interest.

Billions of candidate capsids were initially generated as a DNA library and subject to a positive selective pressure for targeting brain cells. Thousands of candidates were identified and taken forward to a second round of screening. The researchers compared two methods to generate this round two library: PCR amplification from positive round one samples, or working with Twist Bioscience to synthesize an oligo pool containing the positive sequences from round one.

Distribution curve showing superior uniformity for the DNA library synthesized by Twist compared to a library prepared by PCR amplification. — *Figure adapted from Kumar et al., 2020, figure 2c. An AAV capsid library derived from Twist oligo pool synthesis was far more uniform than an equivalent library derived from PCR amplification.*

Results showed the PCR-generated library was highly skewed, with a small number of sequences constituting the majority of the pool, and a large number of sequences barely represented. Comparatively, the distribution of the synthetic pool was highly uniform. This translated into the PCR-generated library yielding 700 hits in the second screen, whereas the synthetic Twist library generated 1700 hits.

Lorenz Curve comparing the uniformity of two libraries, one prepped by Twist synthesis and one with PCR. Twist's library is very close (0.17) to the theoretical perfect uniformity (0.0), the PCR library is not (0.63). — Figure adapted from Kumar et al., 2020. An AAV capsid library derived from Twist oligo pool synthesis demonstrates significantly better uniformity relative to the PCR-derived library. Shown here is a Lorenz curve demonstrating the theoretical perfect uniformity wherein every oligonucleotide (oligo) is equally represented (black line).

From the oligo pool derived library, the researchers identified several AAV capsids that could pass the blood-brain barrier and target brain cells without targeting other tissues. The researchers state that many of the sequences taken forward for further validation from the oligo pool library were missing from the PCR-generated library.

Highly uniform DNA libraries allow researchers to generate more hits from their screens, saving both time and money across the course of an experiment’s timeline. Twist Bioscience specializes in precise and uniform DNA synthesis. Twist’s silicon-based platform synthesizes millions of oligonucleotides simultaneously with high uniformity and accuracy. These oligos are then converted into custom CRISPR libraries, protein variant libraries, and target capture NGS panels for high fidelity screening experiments. Using highly uniform libraries ensures that the amount of oversampling needed to represent a dataset is kept to a minimum, meaning researchers can screen with confidence while saving resources.