Library construction for next-generation sequencing
Library construction for next-generation sequencing
Abstract
High-throughput sequencing, also known as next-generation sequencing (NGS), has revolutionized genomic research. In recent years, NGS technology has steadily improved, with costs dropping and the number and range of sequencing applications increasing exponentially. Here, we examine the critical role of sequencing library quality and consider important challenges when preparing NGS libraries from DNA and RNA sources. Factors such as the quantity and physical characteristics of the RNA or DNA source material as well as the desired application (i.e., genome sequencing, targeted sequencing, RNA-seq, ChIP-seq, RIP-seq, and methylation) are addressed in the context of preparing high quality sequencing libraries. In addition, the current methods for preparing NGS libraries from single cells are also discussed.
Keywords: deep sequencing, DNA, RNA, library preparation, next-generation sequencing, RNA-seq, DNA-seq, ChIP-seq, RIP-seq
Over the past five years, next-generation sequencing (NGS) technology has become widely available to life scientists. During this time, as sequencing technologies have improved and evolved, so too have methods for preparing nucleic acids for sequencing and constructing NGS libraries (1,2). For example, NGS library preparation has now been successfully demonstrated for sequencing RNA and DNA from single cells (311).
Fundamental to NGS library construction is the preparation of the nucleic acid target, RNA or DNA, into a form that is compatible with the sequencing system to be used (Figure 1). Here, we compare and contrast various library preparation strategies and NGS applications, focusing primarily on those compatible with Illumina sequencing technology. However, it should be noted that almost all of the principles discussed in this review can be applied with minimal modification to NGS platforms developed by Life Technologies, Roche, and Pacific Biosciences.
Fragmentation/Size selection
In general, the core steps in preparing RNA or DNA for NGS analysis are: (i) fragmenting and/or sizing the target sequences to a desired length, (ii) converting target to double-stranded DNA, (iii) attaching oligonucleotide adapters to the ends of target fragments, and (iv) quantitating the final library product for sequencing.
The size of the target DNA fragments in the final library is a key parameter for NGS library construction. Three approaches are available to fragment nucleic acid chains: physical, enzymatic, and chemical. DNA fragmentation is typically done by physical methods (i.e., acoustic shearing and sonication) or enzymatic methods (i.e., non-specific endonuclease cocktails and transposase tagmentation reactions)(12). In our laboratory, acoustic shearing with a Covaris instrument (Covaris, Woburn, MA) is typically done to obtain DNA fragments in the 100 bp range, while Covaris g-TUBEs are employed for the 620 Kbp range necessary for mate-pair libraries. Enzymatic methods include digestion by DNase I or Fragmentase, a two enzyme mix (New England Biolabs, Ipswich MA). Comparisons of NGS libraries constructed with acoustic shearing/sonication versus Fragmentase found both to be effective (13). However, Fragmentase produced a greater number of artifactual indels compared with the physical methods. An alternative enzymatic method for fragmenting DNA is Illuminas Nextera tagmentation technology (Illumina, San Diego, CA) in which a transposase enzyme simultaneously fragments and inserts adapter sequences into dsDNA. This method has several advantages, including reduced sample handling and preparation time (12).
Desired library size is determined by the desired insert size (referring to the library portion between the adapter sequences), because the length of the adaptor sequences is a constant. In turn, optimal insert size is determined by the limitations of the NGS instrumentation and by the specific sequencing application. For example, when using Illumina technology, optimal insert size is impacted by the process of cluster generation in which libraries are denatured, diluted and distributed on the two-dimensional surface of the flow-cell and then amplified. While shorter products amplify more efficiently than longer products, longer library inserts generate larger, more diffuse clusters than short inserts. We have successfully sequenced libraries with Illumina instruments up to bases in length.
Optimal library size is also dictated by the sequencing application. For exome sequencing, more than 80% of human exomes are under 200 bases in length (14). We run 2 × 100 paired-end reads and our exome sequencing libraries typically contain insert sizes of approximately 250 bases in length as a compromise to match the average size of most exons while sequencing without overlapping read pairs. The size of an RNA-Seq library is also determined by the applications. We typically do basic gene expression analysis using single-end 100 base reads. However, for analysis of alternative splicing or determination of transcription start and stop sites, we employ 2 × 100 base paired-end reads. In most instances, the RNA will be fragmented before conversion into cDNA. This is typically done through the use of controlled heated digestion of the RNA with a divalent metal cation (magnesium or zinc). The desired length of the library insert can be adjusted by increasing or decreasing the time of the digestion reaction with good reproducibility.
In a recent study of seven different RNA-seq library preparation methods (15), the majority involve some sort of fragmentation of the mRNA prior to adapter attachment. The two that do not use a hexamer priming method (16) or in the case of the SMARTer Ultra Low RNA Kit (Clontech, Mountain View, CA)(17), a full length cDNA is synthesized with a fixed 3 and 5 sequence added so that the entire cDNA library (average 2 kb in length) can be amplified in long distance PCR (LD-PCR). This amplified double-stranded cDNA is then fragmented by acoustic shearing to the appropriate size and used in a standard Illumina library preparation (involving end-repair and kination, A-tailing and adapter ligation, followed by additional amplification by PCR).
A second post-library construction sizing step is commonly used to refine library size and remove adaptor dimers or other library preparation artifacts. Adapter dimers are the result of self-ligation of the adapters without a library insert sequence. These dimers form clusters very efficiently and consume valuable space on the flow cell without generating any useful data. Thus, we typically use either magnetic bead-based clean up, or we purify the products on agarose gels. The first works in most instances for samples where sufficient starting material is available. When sample input is limiting, more adapter dimer products are often generated. In our experience, bead-based methods may not perform optimally in this situation and combining bead-based with agarose gel purifications may be necessary.
In the case of microRNA (miRNA)/ small RNA library preparation, the desired product is only 2030 bases larger than the 120 bp adaptor dimers. Therefore, it is critical to perform a gel size selection to enrich the libraries as much as possible for the desired product. This resolution of separation is not feasible using beads. Alternatively, we often create large library inserts (1 kb) combined with longer reads (2 × 300 base paired-end) and no PCR amplification for de novo assembly of bacterial genomes. To optimize the value of the data generated for de novo assembly, it is necessary to do careful gel-based size selections to ensure uniform insert size.
NGS library construction using fragmented/size selected DNA
There are several important considerations when preparing libraries from DNA samples, including the amount of starting material and whether the application is for resequencing (in which a reference sequence is available to align reads to) or de novo sequencing (in which the reads will need to be assembled to create a new reference sequence). Library preparations can be susceptible to bias resulting from genomes that contain unusually high or low GC content and approaches have been developed to address these situations through careful selection of polymerases for PCR amplification, thermocycling, conditions and buffers (18, 1921).
Library preparation from DNA samples for sequencing whole genomes, targeted regions within genomes (for example exome sequencing), ChIP-seq experiments, or PCR amplicons (see below) follows the same general workflow. Ultimately, for any application, the goal is to make the libraries as complex as possible (see below).
Numerous kits for making sequencing libraries from DNA are available commercially from a variety of vendors. Competition has driven prices steadily down and quality up. Kits are available for making libraries from microgram down to picogram quantities of starting material. However, one should keep in mind the general principle that more starting material means less amplification and thus better library complexity.
With the exception of Illuminas Nextera prep, library preparation generally entails: (i) fragmentation, (ii) end-repair, (iii) phosphorylation of the 5 prime ends, (iv) A-tailing of the 3 ends to facilitate ligation to sequencing adapters, (v) ligation of adapters, and (vi) some number of PCR cycles to enrich for product that has adapters ligated to both ends (1) (Figure 1). The primary differences in an Ion Torrent workflow are the use of blunt-end ligation to different adapter sequences.
Once the starting DNA has been fragmented, the fragment ends are blunted and 5 phosphorylated using a mixture of three enzymes: T4 polynucleotide kinase, T4 DNA polymerase, and Klenow Large Fragment. Next, the 3 ends are A-tailed using either Taq polymerase or Klenow Fragment (exo-). Taq is more efficient at A-tailing, but Klenow (exo-) can be used for applications where heating is not desired, such as preparing mate-pair libraries. During the adapter ligation reaction the optimal adapter:fragment ratio is ~10:1, calculated on the basis of copy number or molarity. Too much adapter favors formation of adapter dimers that can be difficult to separate and dominate in the subsequent PCR amplification. Bead or column-based cleanups can be performed after end repair and A-tail reactions, but after ligation we find bead-based cleanups are more effective at removing excess adapter dimers.
To facilitate multiplexing, different barcoded adapters can be used with each sample. Alternatively, barcodes can be introduced at the PCR amplification step by using different barcoded PCR primers to amplify different samples. High quality reagents with barcoded adapters and PCR primers are readily available in kits from many vendors. However, all the components of DNA library construction are now well documented, from adapters to enzymes, and can readily be assembled into home-brew library preparation kits.
An alternative method is the Nextera DNA Sample Prep Kit (Illumina), which prepares genomic DNA libraries by using a transposase enzyme to simultaneously fragment and tag DNA in a single-tube reaction termed tagmentation (Figure 2)(22). The engineered enzyme has dual activity; it fragments the DNA and simultaneously adds specific adapters to both ends of the fragments. These adapter sequences are used to amplify the insert DNA by PCR. The PCR reaction also adds index (barcode) sequences. The preparation procedure improves on traditional protocols by combining DNA fragmentation, end-repair, and adaptor-ligation into a single step. This protocol is very sensitive to the amount of DNA input compared with mechanical fragmentation methods. In order to obtain transposition events separated by the appropriate distances, the ratio of transposase complexes to sample DNA is critical. Because the fragment size is also dependent on the reaction efficiency, all reaction parameters, such as temperatures and reaction time, are critical and must be tightly controlled.
Sequencing the genomes of single cells has been recently reported by several group (11,2326). The current strategy utilizes whole genome amplification with multiple displacement amplification (MDA). MDA is based on the use of random primers with phi29, a highly processive strand displacing polymerase (27). While this technique is capable of generating enough amplified material to construct sequencing libraries, it suffers from considerable bias, created by nonlinear amplification. A recent report demonstrated a significantly improved method of MDA by adding a quasi-linear preamplification step that reduced bias (10). A technology platform based on small compartmentalization and microfluidics can be used to facilitate library preparation from up to 96 single cells per run is offered by Fluidigm (South San Francisco, CA).
NGS library construction using RNA
It is important to consider the primary objective of an RNA sequencing experiment before making a decision on the best library protocol. If the objective is discovery of complex and global transcriptional events, the library should capture the entire transcriptome, including coding, noncoding, anti-sense and intergenic RNAs, with as much integrity as possible. However, in many cases the objective is to study only the coding mRNA transcripts that are translated into the proteins. Yet another objective might be to profile only small RNAs, most commonly miRNA, but also small nucleolar RNA (snoRNA), piwi-interacting RNA (piRNA), small nuclear RNA (snRNA), and transfer RNA (tRNA). While we will endeavor to describe the principles of RNA sequencing libraries in this review, it is not possible to explain all of the different protocols available. Interested readers should research the many options (Table 1) themselves.
Table 1.
Objective Principles of approach References Gene expression Target poly(A) mRNAs (enrich or selectively amplify). To quantify expression new methods are available based on 3 sequence tags or combinatorial barcodes to remove duplicate reads. Short read runs (50100 bp) concentrating on 3 sequence can be sufficient and save considerable resources. One option is to spike in the ERCC synthetic standards for quantification (110). (36,111,112) Alternative splicing Target exon/intron boundaries by either doing long read sequencing (>300 bp) or paired end read sequencing ( 2 × 100). In the case of paired end sequencing, the insert size is typically larger and/or variable in size. (113,114) miRNA (or small RNAs) Target short reads using size selection purification because miRNAs are in the 1823 bp range. piRNAs, snoRNAs, tRNAs are all under 100 bps. (115) Non-coding RNA Directional RNA sequencing is critical. (116,117) Anti-sense RNA Consider combining mRNA expression with directional sequencing to reveal the subset of transcripts representing the anti-sense orientation and correlate these with gene expression changes. (30,118) Single cell sequencing Requires special strategies to start with picogram quantities of input RNA and allow extensive whole transcriptome amplification.Critical challenge is the technical noise created by amplification. (3,5,7,10,11)
One of the first and earliest successes in applying NGS to RNA-seq was in the case of miRNA (28,29). The protocols for preparing miRNA sequencing libraries are surprisingly simple and are usually performed in a one-pot reaction (Figure 3). The fact that miRNAs are found in their native state with a 5 terminal phosphate allows the use of ligases to selectively target miRNAs.
In the first step of the Illumina protocol (Figure 3A), an adenylated DNA adapter with a blocked 3 end is ligated to the RNA sample using a truncated T4 RNA ligase 2. This enzyme is modified to require the 3 adapter substrate to be adenylated. The result is that fragments of other RNA species in the total RNA sample are not ligated together in this reaction; only the pre-adenylated oligo-nucleotide can be ligated to free 3 RNA ends. Moreover, since the adapter is 3 blocked, it cannot serve as a substrate for self-ligation. In the next step, a 5 RNA adapter is added along with ATP and RNA ligase 1. Only RNA molecules whose 5 ends are phosphorylated will be effective substrates for the ligation reaction. After this second ligation, a reverse transcription (RT) primer is hybridized to the 3 adapter and a RT-PCR amplification is performed (usually 12 cycles). Due to the small but predictable size of the miRNA library (120 bases of adapter sequence plus the miRNA insert of ~2030 bases), the library or a pooled sample composed of multiple barcoded libraries can be run on a gel and size selected. The gel size selection is critical due to the presence of adapter dimer side products created during the ligation reaction as well as higher molecular weight products generated from ligation of other non-miRNA RNA fragments containing 5 phosphate groups (e.g., tRNA and snoRNA). This library preparation method results in an oriented library such that the sequencing always reads from the 5 end to the 3 end of the original RNA species. The principle of miRNA sequencing on the Ion Torrent platform is similar (Figure 3B). Ion Torrent uses dual duplex adapters that ligate to the miRNAs 3 and 5 ends in a single reaction, followed by RT-PCR. This general library prep approach can also be used to create a directional RNA-seq library from any RNA substrate.
One major limitation in miRNA library construction arises when the amount of input RNA is low (e.g., <200 ng total RNA); short adapter dimers compete in the RT-PCR reaction with the desired product, adapters, and miRNA inserts. When too many adapter dimers are present they stream up the gel during the size selection step and contaminate the product bands. To minimize this problem, many commercial miRNA library preparation kits now incorporate various strategies to suppress adapter dimer formation.
For mRNA sequencing libraries, methods have been developed based on cDNA synthesis using random primers, oligo-dT primers, or by attaching adapters to mRNA fragments followed by some form of amplification. mRNA can be primed by random oligomers or by an anchored oligo-dT to generate first strand cDNA. If random priming is used, the rRNA must first be removed or reduced. rRNA can be removed using oligonucleotide probe-based reagents, such as Ribo-Zero (Epicenter, Madison, WI) and RiboMinus (Life Technologies, Carlsbad, CA). Alternatively, poly-adenylated RNA can be positively selected using oligo-dT beads.
It is often desirable to create libraries that retain the strand orientation of the original RNA targets. For example, in some cases transcription creates anti-sense RNA constructs that may play a role in regulating gene expression (30). In fact, long noncoding RNA (lncRNA) analysis depends on directional RNA sequencing (31). Methods for preparing directional RNA-seq libraries are now readily available (15). The concept is to perform the cDNA reaction and remove one of the two strands selectively, by incorporating dUTP into the second strand cDNA synthesis reaction. The uracil-containing strand can then be removed enzymatically (32) (NEBNext Ultra Directional RNA Library Prep Kit for Illumina) or prevented from further amplification with a PCR polymerase that cannot recognize uracil in the template strand (Illumina TruSeq Stranded Total RNA kit). In addition, actinomycin D is frequently added to the first strand cDNA synthesis reaction to reduce spurious antisense synthesis during the first strand synthesis reaction (33).
An alternative and hybrid method utilizes random or anchored oligo-dT primers with an adapter sequence on the 5 end of the primer to initiate first strand cDNA synthesis. Next, in a procedure called template switching (shown in Figure 4B), a 3 adapter sequence is added to the cDNA molecule (17). This method has a distinct advantage in that the first strand cDNA molecule can be PCR amplified directly without second strand synthesis using the unique sequence tag put on the 3 end by the template switching reaction. A 5 unique sequence tag is also introduced by standard priming in the first strand synthesis.
The strategic design of the primers used for cDNA synthesis is a powerful strategy for making RNA-seq libraries. For example, rRNA sequences can be avoided by including strategically designed primers that target rRNA but do not allow subsequent amplification. A commercial kit (NuGEN Ovation RNA-seq; San Carlos, CA) combines SPIA nucleic acid amplification technology (34) with primers used in the first strand cDNA synthesis that are designed to suppress amplification of rRNA sequences. Another method was reported in which all possible hexamer sequences were screened against rRNA sequences to identify and eliminate perfect matches. A pool of 749 hexamers remained that was then used to prime the first strand cDNA synthesis reaction. The result was a drop in rRNA reads from 78% to 13% in the sequencing data (16). Finally, a method called DP-seq (7) was developed, in which the amplification of a majority of the mouse transcriptome was accomplished using a defined set of 44 heptamer primers. This primer sequence design selectively suppressed the amplification of highly expressed transcripts, including rRNA, and provided a reliable estimation of low abundance transcripts in a model of embryonic development.
Recently methods for preparing RNA-seq libraries from single cells have been reported (Figure 4)(35,8,9). One strategy utilizes polynucleotide tailing of the first strand cDNA (Figure 4A)(5,8), which can be combined with a template switching reaction (Figure 4B)(4,9). The end result is a first strand cDNA product that can be amplified by universal PCR primers. The version shown in Figure 4B has been incorporated into a commercially available kit (SMARTer Ultra Low RNA Kit; Clontech). An alternative approach called CEL-Seq incorporates a T7 promoter sequence at the 5 end of the cDNA, followed by linear amplification using in vitro transcription (Figure 4C)(3).
A typical cell has approximately 10 pg of total RNA and may contain only 0.1 pg of poly-adenylated RNA. Thus, these approaches all require some sort of whole-transcript amplification to generate enough material to make a sequencing library (5). The downside of such extensive amplification is the generation of significant technical noise, and this problem has yet not been solved (35).
Finally, ribosomal footprinting can reveal the pool of cellular mRNA transcripts undergoing translation at any point in time (36,37). The protocol involves treating cell lysates with RNase, leaving behind only the 30-nucleotide region protected by each ribosome. Ribosomes are then purified by sucrose density gradient centrifugation, and the co-purified mRNA fragments are extracted from the ribosomes. Another novel application of RNA sequencing is SHAPE-Seq (Selective 2-hydroxyl acylation analyzed by primer extension)(38), which is used to probe the secondary structure of RNA via acylating reagents that preferentially modify unpaired bases. When the modified RNA and an unmodified control undergo RT using specific primers, the resulting cDNA fragments can be sequenced and compared to reveal nucleotide level base pairing information.
ConsiderationsinNGSlibrary preparation: Complexity, bias, and batch effects
The main objective when preparing a sequencing library is to create as little bias as possible. Bias can be defined as the systematic distortion of data due to the experimental design. Since it is impossible to eliminate all sources of experimental bias, the best strategies are: (i) know where bias occurs and take all practical steps to minimize it and (ii) pay attention to experimental design so that the sources of bias that cannot be eliminated have a minimal impact on the final analysis.
The complexity of an NGS library can reflect the amount of bias created by a given experimental design. In terms of library complexity, the ideal is a highly complex library that reflects with high fidelity the original complexity of the source material. The technological challenge is that any amount of amplification can reduce this fidelity. Library complexity can be measured by the number or percentage of duplicate reads that are present in the sequencing data (39). Duplicate reads are generally defined as reads that are exactly identical or have the exact same start positions when aligned to a reference sequence (40). One caveat is that the frequency of duplicate reads that occur by chance (and represent truly independent sampling from the original sample source) increases with increasing depth of sequencing. Thus, it is critical to understand under what conditions duplicate read rates represent an accurate measure of library complexity.
Using duplicate read rates as a measure of library complexity works well when doing genomic DNA sequencing, because the nucleic acid sequences in the starting pool are roughly in equimolar ratios. However, RNA-seq is considerably more complex, because by definition the starting pool of sequences represents a complex mix of different numbers of mRNA transcripts reflecting the biology of differential expression. In the case of ChIP-seq the complexity is created by both the differential affinity of target proteins for specific DNA sequences (i.e., high versus low). These biologically significant differences mean that the number of sequences ending up in the final pool are not equimolar.
However, the point is the samethe goal in preparing a library is to prepare it in such a way as to maximize complexity and minimize PCR or other amplification-based clonal bias. This is a significant challenge for libraries with low input, such as with many ChIP-seq experiments or RNA/DNA samples derived from a limited number of cells. It is now technologically possible to perform genomic DNA and RNA sequencing from single cells. The key point is that the level of extensive amplification required creates bias in the form of preferential amplification of different sequences, and this bias remains a serious issue in the analysis of the resulting data. One approach to address the challenge is a method of digital sequencing that uses multiple combinations of indexed adapters to enable the differentiation of biological and PCR-derived duplicate reads in RNA-seq applications (41,42). A version of this method is now commercially available as a kit from Bioo Scientific (Austin, TX).
When preparing libraries for NGS sequencing, it is also critical to give consideration to the mitigation of batch effects (4345). It is also important to acknowledge the impact of systematic bias resulting from the molecular manipulations required to generate NGS data; for example, the bias introduced by sequence-dependent differences in adaptor ligation efficiencies in miRNA-seq library preparations. Batch effects can result from variability in day-to-day sample processing, such as reaction conditions, reagent batches, pipetting accuracy, and even different technicians. Additionally, batch effects may be observed between sequencing runs and between different lanes on an Illumina flow-cell. Mitigating batch affects can be fairly simple or quite complex. When in doubt, consulting a statistician during the experimental design process can save an enormous amount of wasted money and time.
There are many ways to minimize bias during library preparation. Within a single experiment, we aim to start with samples of similar quality and quantity. We also use master mixes of reagents whenever possible. One particularly egregious source of bias is from amplification reactions such as PCR; it is well documented that GC content has a substantial impact on PCR amplification efficiency. We recommend PCR enzymes such as Kapa HiFi (Kapa Biosystems, Wilmington, MA) or AccuPrime Taq DNA Polymerase High Fidelity (Life Technologies) that have been shown to minimize amplification bias resulting from extremes of GC content. It was recently reported that, for particularly high GC targets, a 3 min initial denaturation time with subsequent PCR melt cycles extended to 80 s can significantly reduce amplification bias (18). We use as few amplification cycles as necessary, but it is critical that every sample within an experiment is amplified the same number of cycles. In miRNA library preparation protocols, ligase enzymes have been shown to contribute a high level of sequence-dependent bias (46,47). One group found that addition of three degenerate bases to the 5 end of the 3 adapter and the 3 end of the 5 adapter significantly reduced this ligation bias (48). A miRNA library prep kit that incorporates three degenerate bases on the 5 adapter is commercially available through Gnomegen (San Diego, CA).
In addition to enzymatic steps, bias can be reduced in purification steps by pooling barcoded samples before gel or bead purification. In the case of miRNA-seq libraries, we first run the individual libraries on an Agilent Bioanalyzer (Agilent Technologies, Santa Clara, CA) to quantitate the miRNA peaks. We use this information to create barcoded library pools of up to 24 samples and then perform gel purification in a single lane of an agarose gel to avoid sizing variation between samples.
Sample preparation for NGS applications: Targeted and amplicon sequencing
Targeted sequencing allows investigators to study a selected set of genes or specific genomic elements; for example, CpG islands and promoter/enhancer regions (reviewed in References 49). A common application of targeted sequencing is exome sequencing and high quality kits are commercially available; SureSelect (Agilent Technologies), SeqCap (Roche NimbleGen, Madison, WI) and TruSeq Exome Enrichment Kit (Illumina). All three capture methods are based on probe hybridization to enrich sequencing libraries made from whole genome samples (51,52). Life Technologies has commercialized an alternative approach based on highly multiplexed, PCR-based AmpliSeq technology. There are options to customize all these products and investigators can design capture or PCR probes for target regions covering from thousands to millions of bases within a genome.
Hybridization capture approaches generally work well but can suffer from off-target capture and struggle to effectively capture sequences with high levels of repetition or low complexity (i.e., the Human Histocompatibility Locus region). The PCR-based AmpliSeq method is more efficient with lower amounts of DNA (53). It should also be noted that probes are based on a reference sequence, and variations that substantially deviate from the reference, as well as significant insertion/deletion mutations, are not always going to be identified.
Another targeted sequencing method, developed by Raindance (Billerica, MA) uses microdroplet PCR and custom-designed droplet libraries (54,55). The nature of micro-droplet emulsion PCR significantly decreases PCR amplification bias (56). Microdroplet PCR allows the user to set up 1.5 × 106 micro-droplet amplifications in a single tube in under an hour. The droplet libraries are designed based on 500 bp amplicons, and a single custom library can target from to 10,000 different amplicons covering up to 5 × 106 bases.
Amplicon sequencing involves making NGS libraries from PCR products. This form of targeted sequencing is more appropriate for applications such as microbiomic experiments where community composition is analyzed by surveying 16S rRNA sequences in complex bacterial mixtures (57), analysis of antibody diversity (58) and T cell receptor gene repertoires (50), and facilitating the process of identifying and selecting high value aptamers in a SELEX protocol (59). To highlight the flexibility of amplicon sequencing, a recent study used the method to analyze the incorporation of unnatural nucleotides during DNA synthesis (60).
Sequencing of short amplicons also makes obtaining entire sequences possible in either a single read or using a paired-end read design. Here, adapters can be added directly to the ends of the amplicons and sequenced to retain haplotype information essential for reconstructing antibody or T cell receptor gene sequences as well as identifying species in micro-biome projects.
However, it is often necessary to design longer amplicons for targeted sequencing applications. In this case, the PCR products need to be fragmented for sequencing. Amplicons can be fragmented as-is using acoustic shearing, sonication, or enzymatic digestion. Alternatively, they can be first concatenated into longer fragments using ligation followed by fragmentation. One problem associated with amplicon sequencing is the presence of chimeric amplicons generated during PCR by PCR-mediated recombination (61). This problem is exacerbated in low complexity libraries and by overamplification. A recent study identified up to 8% of raw sequence reads as chimeric (62). However, the authors were able to decrease the chimera rate down to 1% by quality filtering the reads and applying the bioinformatic tool, Uchime (63). The presence of the PCR primer sequences or other highly conserved sequences presents a technical limitation on some sequencing platforms that utilize fluorescent detection (i.e., Illumina). This can occur with amplicon-based sequencing such as microbiome studies using 16S rRNA for species identification. In this situation, the PCR primer sequences at the beginning of the read will generate the exact same base with each cycle of sequencing, creating problems for the signal detection hardware and software. This limitation is not an issue with Ion Torrent systems (not fluorescence-based) and can be addressed on Illumina systems by sequencing multiple different amplicons in the same lane whenever possible. An alternative strategy we employ is to use several PCR primers during PCR of a specific amplicon. Each primer has a different number of bases (typically 13 random bases) added to the 5 end to offset/ stagger the order of sequencing when adapters are ligated to the amplicons.
Sample preparation for NGS applications: Mate pair sequencing and other strategies
The objective of de novo sequencing is to use algorithms to produce a novel genome assembly that can serve as a reference for future experiments. Closing contigs and scaffolds into a cohesive genome map can be a remarkably challenging task. Because of this, de novo assemblies require some of the highest quality (i.e., least biased, most representative) sequencing libraries of any NGS application.
We routinely use three library preparation strategies to maximize assembly efficiency: (i) libraries comprised of long inserts (~1 kb insert sizes), (ii) no PCR amplification in library preparation, and (iii) mate-pair libraries with long distance spacing (520 kb) between reads. While it has so far proven impossible to build mate-pair libraries without PCR amplification, long insert libraries can easily be constructed without PCR if sufficient DNA is available (2). Such long insert libraries are created by careful shearing of genomic DNA. We find that the final data quality is greatly improved if sheared ~1 kb DNA is first size selected on a 1% agarose gel to narrow the size distribution as much as possible. This step minimizes the possibility for small fragments to concatenate during the adapter ligation step that increases the risk of chimeric read pairs impeding the data assembly process.
Mate-pair libraries are constructed by circularization of input DNA that has been fragmented to a size of >2 kb. Typically, insert size measures between 2 and 20 kb. We developed a mate-pair protocol using Cre-Lox recombination instead of blunt end circularization (64). In this method, a biotin-labeled LoxP sequence is created at the junction site from the end ligation of two LoxP adapters. This strategy allows junctions to be identified without using a reference genome. The location of the LoxP sequence in the reads distinguishes true mate-paired reads from spurious paired-end reads using the bioinformatics tool, Deloxer (64). A similar approach improves upon this method by allowing longer insert sizes (up to 22 kb)(65). Illumina also provides a transposome-based protocol that requires only a small amount of input material (~1 μg) and allows barcoded multiplexing of up to 12 samples per lane.
A significantly more complicated protocol generates mate-pair reads with approximately 40 kb spacing using a unique fosmid vector design (Lucigen NxSeq 40 kb Mate-Pair Cloning Kit; Middleton, WI). The phage packaging mechanism selects for DNA fragments of ~40 kb, which are packaged into phage particles in vitro by bacteriophage Lambda packaging extract followed by transfection into Escherichia coli for replication. Experience in fosmid preparation and replication is a definite plus before taking on this protocol.
Sample preparation for NGS applications: ChIP-seq
Chromosome immunoprecipitation sequencing (ChIP-seq) is now a well-established method for evaluating the presence of histone modifications and/or transcription factors on a genome-wide scale. Histone modifications are an important part of the epigenomic landscape and are thought to help regulate the recruitment of transcription factors and other DNA modifying enzymes. The precise biological role of histone modifications is still poorly understood, but genome-wide studies using ChIP-seq are beginning to provide important insights into their patterns and purpose.
Originally developed as a low-throughput PCR-based assay, the introduction of NGS technology has allowed ChIP-seq to be efficiently applied on a genome wide scale (Figure 5). The general principle of this assay involves immunoprecipitation of specific proteins along with their associated DNA. The procedure usually requires DNA-protein crosslinking with formaldehyde followed by fragmentation of the chromatin using micro-coccal nuclease (MNase) and/or sonication. Specific antibodies are used to target the protein or histone modification of interest, at which point the DNA is purified and subjected to high throughput sequencing. The sequencing results should be compared with a proper control. Data from a successful ChIP-seq should be enriched for the sequences that were crosslinked to the targeted protein/ modified histone.
There has been some discussion on the best controls for ChIP-seq. Rabbit IgG has been used as a control for non-specific antibody binding, but these antisera typically dont control well for the non-specific cross-reactivity that is present with the use of affinity-purified antibodies. Thus, an aliquot of the input DNA pool after fragmentation but before immunoprecipitation has become more commonplace as the control for ChIP-seq. Additionally, input controls appear to give a better estimation of biases that result from chromatin fragmentation and sequencing (66).
ChIP-seq has a number of technical challenges that require consideration and more standardization to facilitate cross-study analysis. In particular, antibody quality is a large factor affecting the outcome of ChIP-seq experiments. The ENCODE (Encyclopedia Of DNA Elements; www.genome.gov/) and Roadmap consortia (NIH Roadmap Epigenomics Mapping Consortium) have set forth procedures for assessing antibody quality, including dot blot immunoassays against histone tail peptides to evaluate binding specificity and cross-reactivity (67). Some of the technical procedures used in ChIP-seq studies have a direct impact on downstream ChIP-seq library preparation and the resulting sequencing data (40,66,68,69). For example, the formaldehyde crosslinking typically used in ChIP-seq experiments is particularly important for studying transcription factors, but it appears to result in lower resolution and increases the likelihood of non-specific interactions (40). Resolution was recently addressed for DNA binding proteins with the use of lambda exonuclease to digest the 5 ends at a fixed distance from the crosslinked protein, thus greatly reducing contaminating non-specific DNA (66). Additionally, the use of formaldehyde crosslinking has been shown to protect DNA from micrococcal nuclease digestion, so sonication is now the preferred method of fragmentation when using ChIP-seq in the assessment of DNA binding proteins. Conversely, micrococcal nuclease is known to digest the linker regions between nucleosomes, so it remains the preferred method for chromatin fragmentation when studying histone modifications (68). Regardless of fragmentation method, if successful the DNA insert plus the sequencing adapters should be ~300 bp. We routinely do bead-based purifications after sequencing adapter ligation and again after the PCR step in the library protocol in order to minimize sample losses.
One of the greatest technical issues in ChIP-seq has been the requirement for large amounts of starting material (68). Typically, 1 million to 20 million cells are required per IP in order to acquire sufficient material for sequencing. These amounts are particularly difficult to achieve for primary cells, progenitor cells, and clinical samples. This remains an area that will benefit greatly from improved sequencing library preparation methods from very small quantities of relatively short fragments of DNA. To date, most methods attempting to ameliorate the large amount of starting material required for ChIP-seq have required whole genome amplification or extensive PCR amplification. However, the recently introduced Nano-ChIP-seq method allows for starting amounts down to 10,000 cells by using custom primers with hairpin structures and an internal BciVI restriction site (66,70). In another recent development, ChIP-seq for the transcription factor ERalpha was successfully performed with an input of only cells by using single tube linear amplification (LinDA). This approach uses an optimized T7 RNA polymerase IVT-based protocol, which was demonstrated to be robust and reduced amplification bias due to GC content (66).
It is especially challenging to study a novel DNA binding protein or histone modification for which there are no commercial antibodies. The approach required in these cases usually entails the use of transient or stable expression of the protein of interest with a tag that can be targeted (such as a His or FLAG tag). The drawback of this approach is the need for extensive controls to ensure that the fusion protein is localized properly and that interactions are not affected by steric hindrance or non-endogenous expression levels (67).
Sample preparation for NGS applications: RIP-seq/CLIP-seq
Transcription of primary RNAs begins a complex process involving the recognition of intron/exon junctions, splicing and alternative splicing, addition of poly(A) tails, transport to the cytoplasm, entry into ribosomes, processing of various non-coding RNAs, and the generation of signals for RNA degradation. One powerful tool for studying these events, and the proteins that control them, is RIP-seq, where protein complexes assembled at different sites on the RNA molecules are immunoprecipitated and then the RNA bound to them is purified and sequenced (Figure 6)(71).
RNA binding proteins (RBPs) recognize ribonucleic acid motifs including specific sequences, single-stranded backbones, secondary structures, and double-stranded RNA (72,73). These interactions involve all types of RNAs and occur at every step from transcription to degradation (74). Many steps in the post-transcriptional processing of messenger RNA overlap, resulting in multiple RBP complexes bound to a transcript at any given moment in its existence (75). RIP-seq can be done with protein-specific antibodies or by expressing tagged versions of the RBPs of interest. Furthermore, RIP-seq provides the ability to characterize the function of an RBP in a specific cell type and/or cell state based on the population of bound RNAs (7678).
The amount of starting total RNA needed for a successful RIP-seq experiment is significantly greater than that required for RNA-seq. First, the amount of RNA bound by any given RBP is highly variable but always only a fraction of the original pool and often a very minor fraction. Second, depending on the target RBP, a nuclear lysate may be required, necessitating an even greater amount of starting material (79). Another technical challenge is the tendency of RNA to non-specifically bind proteins. We address this limitation by preclearing the lysate with an isotype control antibody bound to beads. Non-specific DNA binding is also a challenge. DNase I treatment should be performed multiple times throughout the protocol (i.e., during lysate preparation, post-TRIZOL separation, and library preparation). The duration of the IP step can vary from 2 h to overnight. Longer incubation times can increase the percentage of pulled down protein; however, non-specific RNA binding is also increased, resulting in additional noise. RIP-purified RNA can be taken directly into standard library protocols suitable for low input, short fragment samples. We have had good success with the ScriptSeq-v2 RNA-Seq Library Preparation Kit (Epicenter) with our RIP-seq samples.
A variation of RIP-seq is crosslinking and immunoprecipitation (CLIP-seq) followed by digestion of the RNA sequences not protected by the RBP complexes. This procedure is used to identify the specific binding sites and flanking sequences of RBPs. In the original CLIP protocol, the starting material was crosslinked by exposure to UV radiation (80). Prior to immunoprecipitation, the prepared lysate is digested with RNase, limiting the RNA populations to those regions protected by the bound RBPs. Next, there is a multistep protocol to radiolabel the RBP-bound RNA, separate the samples by SDS-PAGE, visualize the RNA-protein complex by radiography, and excise the desired region (~530 kDa above the target RBPs molecular weight). Finally, the RBP is digested with proteinase K, linkers are ligated to the remaining RNA fragments, and a library is constructed for sequencing (81,82). Control samples are required to account for crosslinking efficiency, RNase digestion, and non-specific RNA binding (83).
Recent modifications to the CLIP-seq protocol include individual-nucleotide resolution CLIP (iCLIP)(84) and photo-activatable-ribonucleoside-enhanced CLIP (PAR-CLIP)(85). In iCLIP, an adapter ligation step is replaced with an intramolecular circularization step that has increased reaction efficiency and the added ability to identify the site of crosslinking (individual nucleotide resolution)(84). In PAR-CLIP, a ribonucleoside analog (4-SU or 6-SG) is added to the media prior to UV-crosslinking. The irradiation step binds the ribonucleoside analog to the RBP in addition to changing the bases identity. Following the standard CLIP-seq protocol, the photoactivated crosslinked sites can be identified by locating single base mismatches or indels when compared with the whole RNA-seq data (86).
Sample preparation for NGS applications: Methylseq
A fundamental mechanism of the epigenetic regulation of gene activity is DNA methylation. This is rapidly being recognized as a critical feature of disease states where simple genetic inheritance is not sufficient to explain the complexity of the phenotypes encountered in clinical medicine. In principle, DNA methylation changes also reflect the history of the organism, not just the genetic inheritance.
Methylation of the 5 position of cytosine (5mC) is the most common form of DNA methylation, with 60%80% of the 28 million CpG dinucleotides in the human genome being methylated (87,88). While genome-wide hypomethylation has been linked to increased rates of mutation and chromosomal instability, hypermethylation of promoters inhibits gene transcription (89). DNA methylation is also essential for genetic imprinting, suppression of transposable elements, and X chromosome inactivation (90). Aberrant DNA methylation is associated with many diseases including cancer, autoimmune diseases, inflammatory diseases, and metabolic disorders (9194).
Early studies were limited to investigating DNA methylation in a few genes at a time or generating a non-specific but global estimation of methylation. Recent advances in high throughput sequencing have dramatically increased both the throughput and resolution of such studies. There are three major methods for studying DNA methylation with NGS platforms: (i) restriction enzyme (RE) based, (ii) targeted enrichment, and (iii) bisulfite sequencing (Figure 7). Each of these methods has advantages and disadvantages that must be weighed according to the researchers needs and budget.
Methylation sensitive restriction enzyme sequencing (MRE-seq) relies on restriction enzymes that are sensitive to CpG methylation (Figure 7A)(95,96). The most commonly used REs are the methylation-sensitive HpaII and its methylation-insensitive isoschizomer MspI (97). A method called HELPseq (HpaII tiny fragment enriched by ligation mediated PCR) utilizes both of these enzymes to analyze genome-wide methylation profiles (98). A sample is digested with each enzyme, and the resulting fragments are sequenced separately. The MspI digested reference sample not only allows for a point of comparison for methylation but also controls for misinterpretation of HpaII not cutting due to single nucleotide polymorphisms (SNPs)(97). Other RE-based methods, such as methyl-sensitive cut counting (MSCC), methylation-specific digital sequencing (MSDS), and modified methylation-sensitive digital karyotyping (MMSDK) rely on other methylation sensitive REs (97). RE-based methods are limited in their scope by the fixed number of digestion sites present in the genome, which skews the view of CpG methylation to these particular sites, and its accuracy is dependent upon complete digestion with high fidelity (67).
Affinity enrichment of methylated DNA requires either antibodies specific for methylated DNA (MeDIP) or other proteins capable of binding methylated DNA (MBDseq)(Figure 7B)(95,97,98). Specifically, the methyl binding domain (MBD)-containing proteins MeCP2, MBD1, MBD2, and their binding partner MBD3L1 have been used to immunoprecipitate methylated DNA (98). While such immunoprecipitation methods are not limited by sequence specificity, they tend to preferentially pull down regions that are heavily methylated and miss genomic areas with sparse methylation. Moreover, sequencing of the recovered material gives the researcher an idea of the areas that are methylated, but does not reveal which individual bases are methylated.
Treatment of DNA with sodium bisulfite results in the chemical conversion of unmethylated cytosine to uracil while methylated cytosines are protected (Figure 7C)(99). Bisulfite conversion coupled with shotgun sequencing was first performed in Arabidopsis thaliana by two research groups who coined the methods BS-seq (100) and MethylC-seq (101). MethylC-seq was also used to create the first human single base resolution map of DNA methylation (87). While BS-seq/MethylC-seq is widely considered the gold standard in methylome analysis, it requires significant read depth (30× coverage)(67). It remains expensive and not easily applied to the large sample sizes needed for clinical investigations. Recently, it was shown that only ~20% of CpGs are differentially methylated across 30 human cells and tissues, suggesting that 80% of the CpG methylation in whole genome sequencing is not informative (88). To reduce the cost and complexity of data associated with whole genome bisulfite sequencing, recent methods have sought to couple enrichment methods with bisulfite sequencing. The capture and targeted sequencing of specific regions identified in the genome to be enriched for CpG methylation sites such as islands, shores, gene promoters, and differentially methylated regions (DMRs) can be accomplished using a commercially available kit from Agilent Technologies (SureSelectXT Methyl-Seq Target Enrichment). Alternatively, bisulfite conversion of DNA isolated by MeDIP or MBD pull downs allows for single base resolution to be achieved by these methods. Sequence-specific binding to beads (51) followed by bisulfite treatment or binding of bisulfite-converted DNA to bisulfite padlock probes (BSPPs)(102) has also been demonstrated to be an effective method for enriching potentially methylated regions. Our group developed a method for targeted bisulfite sequencing using microdroplet PCR with custom-designed droplet libraries (55). This technique relies on the unbiased amplification of bisulfite treated DNA with region-specific primers. All of these enrichment methods retain the single base pair resolution that is so advantageous for bisulfite sequencing while vastly reducing the amount of sequencing required. However, it is important to note that bisulfite treatment of DNA leads to DNA instability and loss of product; thus, many of these methods require more input DNA than the non-bisulfite conversion-based methods.
The recent discovery that 5-hydroxymethyl-cytosine (5hmC)(103) is an intermediate of the demethylation of 5mC to cytosine has opened a whole new area of study into the mechanics of DNA methylation and epigenetic regulation. Studies revealed that the Ten-Eleven Translocation (TET) family of proteins facilitate demethylation of 5mC to cytosine through three intermediates, 5hmC, 5-formylcytosine (5fC), and 5-carboxylcytosine (5caC). Bisulfite treatment converts 5fC and 5caC to uracil, but cannot convert 5mC or 5hmC. Thus, bisulfite sequencing cannot distinguish between 5mC and 5hmC (67). In order to detect these novel methylation intermediates, new techniques have been developed. The first efforts either involved antibodies specific for 5hmC (hMeDIP-seq) or chemical modification of 5hmC (67). More recent advances toward single-base resolution sequencing of 5hmC are oxidative bisulfite sequencing (oxBS-seq)(104) and TET-assisted bisulfite sequencing (TAB-seq)(105). Single-molecule real-time (SMRT) DNA sequencing (Pacific Biosciences, Menlo Park, CA) has been introduced as another method to sequence 5hmC (106). SMRT sequencing relies on the kinetics of polymerase incorporation of individual nucleotides, allowing for direct detection of these modified cytosines (106). Most recently, antibody-based immunoprecipitation methods (107,108) and chemical modification methods have been developed to allow for sequencing of 5fC (109).
The tremendous and rapid evolution of NGS technologies and protocols has generated both amazing opportunities for science and significant challenges. We believe that the transformational power of deep sequencing has already been clearly demonstrated in basic science. It is poised to advance into clinical medicine, creating a new generation of molecular diagnostics based on DNA sequencing, RNA sequencing, and epigenetics.
Acknowledgments
This research was supported by NIH grants: R24 GM (SRH), U19 A (DRS, SRH, PO, HKK), U01 GM (DRS, TW), U01 AI (DRS, SRH) and U54 AI (SALM), and by a JDRF postdoctoral fellowship (HKK). This paper is subject to the NIH Public Access Policy.
Footnotes
Competing Interests
The authors DRS, SRH, and PO are founding scientists and consultants to Transplant Genomics Inc.
Author contributions
SRH, PO, and DRS wrote and edited the paper. HKK contributed to the Methylseq section, SALM contributed to the ChIP-seq section, TW contributed to the RIP-seq section, and FVN contributed to the sections on Nextera and mate-pair library sections.
To purchase reprints of this article, contact:
Preparation of DNA Sequencing Libraries for Illumina ...
Preparation of libraries for DNA sequencing for Illumina systems involves multiple steps. In a general workflow, purified DNA is fragmented, end-repaired, and A-tailed; adapters are ligated to the DNA fragments; libraries are amplified if necessary; and the prepared libraries are cleaned, quantitated, and normalized before loading onto a flow cell (Figure 1). Since library preparation plays a critical role in obtaining high-quality data [1], researchers should understand the underlying principles and considerations for the key steps in the workflow.
Figure 1. DNA sequencing library preparation.1. DNA sequencing methods
Common DNA sequencing methods include whole-genome sequencing, de novo sequencing, targeted sequencing, and exome sequencing (discussed below) (Figure 2). DNA may also be sequenced for epigenetic studiese.g., methylation analysis (also known as bisulfite sequencing or Bis-Seq) and DNAprotein interaction sequencing (commonly known as ChIP-Seq), which are not covered in this section. The method of choice depends on the research goals and biological questions to address [2-4].
Figure 2. Common DNA sequencing methods. Exome and gene panel sequencing are considered targeted methods, since they only include subsets of the whole genome. Some gene panels may include promoter sequences.
a. Whole-genome sequencing
Whole-genome sequencing, or WGS, is performed to sequence the entire genome of an organism using the total genomic DNA. WGS data of a sample is then compared to a reference sample or controlfor instance, comparison between cancer cells and normal cellsfor small and large genetic variations. Examples of these genetic variations include single nucleotide polymorphisms (SNPs); single nucleotide variations (SNVs); nucleotide insertions and deletions (indels); structural rearrangements such as inversions, duplications, and translocations; and copy number variations (CNVs) (Figure 3).
Figure 3. Common genetic variations.
If you want to learn more, please visit our website TSKT.
WGS is useful for uncovering genetic mutations in an unbiased and detailed manner. However, it requires a large amount of sample input and involves extensive data processing, especially when analyzing the human genome, which is large and complex.
Top
b. De novo sequencing
When genomic data for a particular organism are either unavailable or of insufficient quality, de novo sequencing (meaning from the beginning) is a method of building or updating the reference genome. Although a whole-genome sample may be used in sequencing, the lack of a reference sequence necessitates assembling overlapping short sequencing reads into longer contiguous sequences (contigs) (Figure 4A) using computational tools. The main goal is to generate an overall physical map that represents the whole genome without (large) gaps.
De novo sequencing usually relies on a hybrid approach for assembling the genome: reads from long-insert paired-end sequencing, referred to as mate-pair sequencing (with higher error rate), are used to build a scaffold, and reads from short-insert paired-end sequencing (with lower error rate) are used to fill in and improve the quality of a new genome map [5] (Figure 4B).
Figure 4. De novo sequencing and assembly. (A) Alignment of contiguous sequences. (B) Assembly of short-insert and long-insert paired-end reads into a reference genome.
Top
c. Targeted sequencing
Targeted sequencing (instead of WGS) is used when the goal of the experiment is to sequence specific genes, sets of related genes, or targeted regions of a genome. An example of targeted sequencing is screening for known cancer genes in different types of cancer cells. Therefore, targeted sequencing is hypothesis-driven and requires knowledge of the sequence of the reference genes or genomic regions. Since targeted sequencing does not require analysis of the whole genome (e.g., 3.2 x 109 base pairs for human), it allows more reads, better coverage, and higher depth, and therefore improved detection of rare variants at a lower cost than WGS.
To perform targeted sequencing, samples are enriched for the sequences of interest. Among methods available for enrichment of target sequences, the two most common approaches are hybrid capture and PCR amplification [6].
- The hybrid capture strategy utilizes a set of oligonucleotide probes that are complementary to the target sequences. Probes are usually coupled to magnetic or biotinylated beads so that target sequences hybridized to the probes can be selected from the mixture. After removing the unbound sequences, target sequences are released from the probes and prepared as a sequencing library (Figure 5). Target enrichment by hybrid capture usually requires higher sample input and a longer workflow, but it may yield more uniform coverage and higher data quality over the PCR enrichment method.
Figure 5. Target enrichment by hybrid capture. Blue = desired sequences, red = magnetic beadbound probes.
- Target enrichment by PCR, also known as amplicon sequencing, relies on highly multiplexed PCR to amplify DNA sequences corresponding to target regions. As many as 24,000 primer pairs, each pair designed to amplify a specific region, may be used to capture hundreds of sequences in one PCR run (Figure 6). PCR enables limited sample input and a faster workflow. However, the quality and coverage of the data obtained may be impacted by primer design, PCR efficiency, amplification bias, etc.
Figure 6. Target enrichment by PCR amplification.
d. Exome sequencing
Exome sequencing is a special type of targeted method to sequence protein-coding regions of the genome, called the exome [7]. While making up only about 12% of the human genome, the exome harbors approximately 85% of known disease-causing mutations. Therefore, whole-exome sequencing (WES) enables researchers to focus on identifying genetic mutations and variations that are significantly implicated in diseases.
Top
2. DNA fragmentation strategies
The first step in NGS library preparation for Illumina systems is fragmentation of DNA into the desired size range, typically 300600 bp depending on the application. Traditionally, two methods have been employed for DNA fragmentation: mechanical shearing and enzymatic digestion. Typically, 15 mg of input DNA is required for fragmentation, but often less is needed for enzymatic fragmentation approaches.
Between the two methods, mechanical shearing is more widely used because of its unbiased fragmentation and ability to obtain more consistent fragment sizes (Figure 7). On the other hand, enzymatic digestion requires lower DNA input and offers a more streamlined library preparation workflow.
Figure 7. Comparison of percentage of each base at each position in sequencing of samples prepared by mechanical shearing vs. enzymatic digestion. Mechanical shearing shows very little bias in base representation at the beginning of reads, but enzymatic digestion shows some base imbalance at this stage.
Top
a. Mechanical shearing
Mechanical shearing involves breakage of phosphodiester linkages of DNA molecules by applying shear force. Widely used methods include high-power unfocused sonication, nebulization, and focused high-frequency acoustic shearing.
- Sonication is the simplest method among the three and uses a sonicator (probe- or waterbath-based) to emit low-frequency acoustic waves for shearing. Although probe-based sonication delivers more focused energy towards the sample, the samples are in an open container, directly in contact with the probe, and thus are at a high risk of contamination. Waterbath-based sonication, on the other hand, keeps the samples within a closed system but usually requires higher energy due to energy dissipation/dispersion and low output. In either approach, optimization is needed to obtain the desired fragment lengths (Figure 8). Resting/cooling periods between sonication cycles should be incorporated to keep the samples from overheating, which necessitates a longer workflow and wait time.
Figure 8. Dependence of average fragment length distribution on number of sonication cycles (1 sonication cycle = 30 sec).
- Nebulization creates shear force with compressed gas, forcing a nucleic acid solution through a small hole in a nebulizer. The aerosolized sample with fragmented DNA is then collected. The level of fragmentation can be controlled by the compressed gas pressure and can also be affected by the solutions viscosity and temperature. This method requires a large sample input and often results in high sample loss (low recovery).
- The focused acoustic method (developed by Covaris) uses high-frequency ultrasonic waves to shear DNA. High-frequency waves concentrate high energy on the sample within a small enclosed tube while minimizing heat generation. It has become a preferred method of mechanical shearing among NGS users because of its advantages over traditional sonication and nebulization, such as minimal sample loss, low risk of contamination, and better control over uniform fragmentation. However, the special equipment needed and the associated cost often limit its usage.
Top
b. Enzymatic digestion
Enzymatic digestion is an effective alternative to the mechanical shearing methods. Endonucleases and nicking enzymes are usually employed to cleave both strands of DNA or nick individual strands to generate double-stranded breakage. To avoid sequence bias, enzymes with less cleavage specificity and/or cocktails of enzymes are used for fragmentation. The enzymatic digestion approach typically requires lower DNA input than mechanical shearing and thus is a method of choice when you have limited samples. In addition, enzymatic digestion and then downstream library preparation steps can be done in the same tube, thus enabling automation, streamlining the workflow, minimizing sample loss, reducing contamination risks, and decreasing hands-on time.
Top
c. Transposon-based fragmentation
Some users may follow transposon-based library preparation as an alternative to mechanical shearing and enzymatic digestion (Figure 9) [8]. Using transposons, this approach fragments DNA templates and simultaneously tags them with transposon sequences, generating blunt DNA fragments with transposed sequences at both ends. Adapters (and indexes) are added via adapter-addition PCR. Therefore, some steps of the conventional workflow, such as traditional DNA fragmentation, end conversion, and adapter ligation, are circumvented when following this approach.
Figure 9. Fragmentation and tagging by transposons.
Top
3. End repair and adapter ligation
a. End repair
Following the fragmentation step, DNA samples are subjected to end repair (also called end conversion). DNA fragments produced by mechanical shearing or enzymatic digestion have a mix of 5 and 3 protruding ends that need repair or conversion for ligation with the adapters. The following are key steps in the process to blunt, phosphorylate, and adenylate the termini (Figure 10) [2].
- 5 overhangs are filled in by 53 polymerase activity of an enzyme such as T4 DNA polymerase or Klenow fragment
- 3 overhangs are removed by 35 exonuclease activity of an enzyme such as T4 DNA polymerase
- 5 ends of the blunted DNA fragments are phosphorylated (for efficient subsequent ligation) by an enzyme such as
T4 polynucleotide kinase
- 3 ends of the blunted DNA fragments are adenylated (A tailing), which is required for TA ligation with Illumina adapters, by an enzyme such as Klenow fragment (exo) orTaq DNA polymerase
Figure 10. End conversion process.
The end conversion process involves a number of enzymatic steps, but some commercially available kits are designed to run all these reactions in a single tube, saving time and sample loss.
Top
b. Adapter ligation
Adapters are a pair of annealed oligonucleotides that facilitate clonal amplification and sequencing reactions. Identical duplex adapters are ligated to both ends of the library fragments so that oligos on the flow cell can recognize them for sequencing. In library preparation, a stoichiometric excess of adapters relative to sample DNA is used to help drive the ligation reaction to completion. Ligation efficiency is critical for conversion of DNA fragments into sequenceable molecules and thus impacts conversion rate and yield of the libraries. Because library fragments are flanked by adapters, they are sometimes called inserts.
During formation of the adapter duplexes, two strands of oligos called P5 and P7 are annealed. The P5 and P7 adapters are named after their sites of binding to the flow cell oligos. The adapters are noncomplementary at their ends to prevent their self-ligation and thus form a Y shape after annealing. This Y shape is no longer maintained if library amplification is subsequently performed (Figure 11).
Figure 11. Adapter ligation.
Looking more closely, the library adapters are usually 5060 nucleotides long and often consist of the features described below (Figure 12) [9-10].
Figure 12. Sequencing adapters. (* = phosphorothioate linkage)
- Sites of binding to P5 or P7 oligos on the flow cells and to the sequencing primers
- Index sequences composed of specific 68 nucleotides to distinguish one sample from another. Index sequences enable multiplexing, a process of sequencing multiple libraries in one flow cell, and dual-indexed libraries are commonly employed for multiplex sequencing (Figure 13).
- Additional T on the 3 end of the P5 adapter to prevent formation of adapter dimers and facilitate ligation with the 3 A of library fragments (similar to TA cloning). Since a missing 3 T would lead to adapter dimer formation, the more stable phosphorothioate linkage (instead of phosphodiester) is usually used to attach the 3 T to the adapters.
- Phosphate on the 5 end of the P7 adapter for ligation with the 3 end of library fragments
For PCR-amplified libraries and RNA-Seq libraries, unique molecular identifiers (UMI) may be included to enable tracking of every library fragment and monitoring of deviations during library amplification [11].
Figure 13. Multiplex sequencing with pooled libraries. (Solid and striated red and green bars = different index sequences)
Top
c. Index hopping and unique dual indexes
Index hopping is a phenomenon associated with multiplexing or pooling of library samples. When two or more libraries are sequenced together in the same flow cell, one of the indexes assigned to one library may become swapped with that of another library (Figure 14). Index hopping has always affected multiplex libraries (e.g., from cross-contamination of indexes) but has become more prominent when sequencing is performed on patterned flow cells with exclusion amplification chemistry [12]. Index hopping has seriously implications in subsequent data analysis, such as incorrect assignment of sequencing data from one sample (library) to another.
Figure 14. Index hopping. (* = mutation of interest from Library 1)
Two main strategies have been employed to minimize the effect of index hopping during sequencing.
- Using unique dual indexes (UDIs) instead of combinatorial dual indexes (CDI)s (Figure 15) [13-14]. Assigning a set of UDIs to each library in the sequencing pool helps ensure that index 1 and index 2 sequences be designated only once during sample pooling prior to loading of the sequencer.
- Minimizing the amount of free, unligated adapter in the samples. Removal of unligated adapters from the libraries helps minimize index hopping. Possibly for that reason, PCR-free libraries are reported to be more susceptible to index hopping than PCR-amplified libraries [12], because fewer cleanup steps are usually performed to remove unligated adapters. The amount of unligated adapters can be measured by microfluidics-based electrophoresis.
Figure 15. Combinatorial dual indexes (CDI) vs. unique dual indexes (UDI).
Top
4. Library amplification considerations
Depending on the need for amplification, DNA library preparation methods can be categorized as PCR-free or PCR-based. In either method, care should be taken to follow protocols that yield highly diverse and representative libraries of input samples from different amounts to help generate high-quality data.
a. PCR-free libraries
Since PCR amplification can contribute to GC bias, PCR-free library preparation is usually the preferred method to create libraries covering high-GC or high-AT sequences, to help ensure library diversity [1,15]. Note that even with PCR-free library preparation methods, bias can be introduced during cluster generation and from the chemistry of the sequencing step itself.
Compared to PCR-based methods, PCR-free libraries require higher input amounts of starting material (although improvements have been made in lowering the input requirements). This can be challenging in scenarios such as using limited or precious samples and highly degraded nucleic acids. With PCR-free libraries, accurate assessment of library quality and quantity may be difficult, compared to PCR-amplified libraries [16].
Nevertheless, better representation and balanced coverage offered by PCR-free libraries make them attractive for the following applications:
- Studies of population-scale genomics and molecular basis of a disease
- Investigation of promoters and regulatory regions in the genome, which often are high in GC or AT content
- Whole-genome sequencing analysis and variant calling for single-nucleotide polymorphisms (SNPs) and small insertions or deletions (indels)
Top
b. PCR-based libraries
The PCR-based method is a popular strategy for constructing NGS libraries, since it allows lower sample input and selective amplification of inserts with adapters at both ends. However, PCR can introduce GC bias, leading to challenges in data analysis. For example, GC bias may hinder de novo genome assembly and single-nucleotide polymorphism (SNP) discovery.
A number of factors can impact GC bias, and the following factors should be considered to achieve balanced library coverage [17]:
- PCR enzyme and master mix used (Figure 16)
- Number of PCR cycles run, and cycling conditions
- PCR additives or enhancers in the reaction
Figure 16. Varying levels of GC bias in libraries amplified with different PCR enzyme master mixes.
With a given PCR enzyme or master mix, an increase in the number of PCR cycles usually increases GC bias. Therefore, a general recommendation is to run the minimum number of cycles (e.g., 48) that generates sufficient library yields for sequencing.
Decreasing the number of PCR cycles also reduces PCR duplicates and improves library complexity. PCR duplicates are defined as sequencing reads resulting from two or more PCR amplicons of the same DNA molecule. Although bioinformatic tools are available to identify and remove PCR duplicates during data analysis [18], minimizing PCR duplicates is important for efficient use of the flow cell in sequencing.
Other PCR artifacts can also result in reduced library quality and complexity. These artifacts include amplification bias (due to PCR stochasticity), nucleotide errors (from enzyme fidelity), and PCR chimeras (due to enzymes template switching) (Figure 17) [19].
Figure 17. Common PCR artifacts.
Top
5. Size selection and cleanup
An important step in NGS library preparation is size selection and/or cleanup. Depending on the library preparation protocol, it may be performed following fragmentation, adapter ligation, or PCR amplification. As its name implies, the process selects the desired fragment size range, while removing unwanted components such as excess adapters, adapter dimers, and primers.
a. Importance of size selection and cleanup
In NGS libraries, uniformity of fragment sizes is critical to enable maximum data output and reliable data analysis because there are limitations to sequencing read length as dictated by NGS applications. If DNA inserts are much longer than recommended, some portions of the inserts remain unsequenced. On the other hand, inserts shorter than recommended result in suboptimal use of sequencing reagents and resources. A mix of short and long inserts could lower sequencing efficiency and pose challenges in data analysis.
Removal of unligated adapters and adapter dimers (two adapters ligated to each other) is crucial to improve data output and quality. Excess adapters often compete with library fragments in binding to the flow cell, lowering data output. Even worse, adapter dimers can also clonally amplify and generate sequencing noise, which must be filtered out during the data analysis. With the introduction of patterned flow cells, excess unligated adapters make the libraries more prone to index hopping during sequencing [12].
Top
b. Methods for size selection and cleanup
Among methods used for size selection, agarose gelbased and magnetic beadbased are two of the most popular. Sample amounts, sample throughput, protocol time, and size range of the libraries may determine the suitability of either method [20].
Size selection from agarose gels is essentially a gel purification process in which DNA fragments separated through the gel according to size are collected (Figure 18). In addition to being simple and effective, the method allows flexibility in gel percentages for separation and collection of fragments in a narrow range. However, it requires large amounts of sample and a long processing time, although specialized gels are available to simplify the process [21-22].
Figure 18. Size selection by agarose gel.
Size selection by magnetic beads is widely used in NGS library preparation. This method relies on binding and unbinding of DNA fragments of different lengths to the magnetic beads, which is controlled by the ratio of beads to DNA and by buffer composition (Figure 19) [20,23]. Suitability with low sample amounts, high recovery of DNA, ability to automate, and flexibility to select the desired fragment size range make this method attractive to NGS users. Nevertheless, the method may not be suitable to separate fragments that are very close in molecular weights.
Figure 19. Size selection by magnetic beads. (A) Size distribution of library fragments with respect to their size cutoff. Above the graph is a description of the basic principle of a two-sided size selection protocol. (B) Schematic of two-sided size selection workflow.
Top
6. Library quantification approaches
Before NGS libraries are loaded onto the sequencer, they should be quantified and normalized so that each library is sequenced to the desired depth with the required number of reads. Concentrations of prepared NGS libraries can vary widely because of differences in the amount and quality of nucleic acid input, as well as the target enrichment method that may be used. While underclustering due to overestimated library concentrations can result in diminished data output, overclustering can result in low quality scores and problematic downstream analysis (Figure 20).
Figure 20. Library clustering on a flow cell.
Top
a. Microfluidics-based quantitation
Microfluidic electrophoresis separates fragments in NGS libraries based on size and can estimate the quantity of different size ranges using a reference standard (Figure 21). More commonly, however, the results of fragment analysis obtained by this method are used in conjunction with the two other methods listed below for more accurate quantitation of NGS libraries.
Figure 21. Fragment analysis of libraries by microfluidics-based electrophoresis.
Top
b. Fluorometry-based quantitation
The fluorometric assay uses fluorescent dyes that bind specifically to double-stranded DNA (dsDNA) to determine library concentration [24]. After a short incubation of samples with a dye, the samples are read in a fluorometer, and library concentrations are calculated by (built-in) analysis software. Although the workflow is simple and takes only a few minutes per sample, this method may not scale well above 2030 samples because samples are often read one at a time. Nevertheless, flexible input volumes and short incubation times allow for quick and easy testing of prepared libraries for concentrations. Since the measured concentration is for total dsDNA, the average size distribution of the libraries should be taken into account for accurate quantitation.
Top
c. qPCR-based quantitation
The qPCR-based assay quantifies NGS libraries by amplifying DNA fragments with the P5 and P7 adapters (Figure 22) [25]. A qPCR standard curve is used to determine a broad range of library concentrations, even as low as femtomolar. Since the PCR primers are designed specifically to bind to the adapter sequences, the qPCR assays detect only properly adapted, amplifiable libraries that can form clusters during sequencing. Note, though, that qPCR can also amplify adapter dimers; therefore, melting curve analysis and/or fragment size analysis should be performed to assess specificity and accuracy of quantitation by qPCR. The final library concentration is calculated based on the following formula.
Figure 22. Schematic of primer binding in library quantitation by qPCR.
After preparation and quantitation, libraries of desired quantity and quality are ready to load on a flow cell for subsequent clonal amplification and sequencing.
Top
Are you interested in learning more about NGS Library Prep Kits? Contact us today to secure an expert consultation!