By the Bioscience Core Lab and Caitlin Clark
Life on Earth contains a blueprint of the self called the genome, which carries all genetic information. Genomics, the field focusing on the genome, has grown rapidly over the last few decades due to the advent of genome determination technology.
Today, genomics is a meeting point for biology and data science, and it is also becoming a hotspot in the healthcare field, as it enables researchers to see what happens during different biological processes. For example, through genomics, scientists can observe how cancer cells differ from cells in healthy tissues. And, as the world continues to battle SARS-CoV-2, genomics allows researchers to observe and assess if an individual is infected with the virus and how the virus evolves and spreads in communities. Genomics has now become part of our daily lives.
An important tool in genomics is DNA sequencing, which reveals the order of DNA’s four chemical building blocks, or bases. Two commonly used sequencing methods are Sanger sequencing and next-generation sequencing (NGS). Sanger sequencing uses the chain termination sequencing method and computer technology to generate up to 1,000 base sequences at a time and reads only one strand of DNA. In contrast, NGS technologies are newer techniques that allow for rapid and high-throughput DNA sequencing. They carry out millions of sequencing reactions at once in parallel and can sequence entire genomes at one time, usually generating single sequences ranging from around 100 to 600 bases.
Output from NGS can be 106 times more or even higher than that of Sanger sequencing because of parallelization, compelling users to apply software and statistics for even quality control (QC).
Third generation sequencing: A constant evolution
Third generation sequencing (TGS), a more recent platform, provides scientists with the ability to sequence a single molecule of DNA in a high-throughput “long read.” With long reads, researchers can examine an entire genome and in particular look at repetitive regions in genomes, which are difficult to examine when DNA is read in short fragments. Several currently used TGS technologies came into existence around the early to mid-2000s, such as PacBio SMRT sequencing and Oxford Nanopore.
“One of the missions in the KAUST Core Labs is to provide high-quality data and a wide range of facilities to KAUST researchers and external collaborators,” said Yoshinori Fukasawa, team lead in bioinformatics from the University’s Bioscience Core Lab (BCL). “BCL genomic scientists offer their expertise in Sanger, NGS and TGS on a daily basis.”
“In genomics, a very rapid evolution of technologies is ongoing and drives the field,” he continued. “Development of TGS has been a remarkable achievement, but it is still relatively new in the scientific community.”
Despite TGS’s successes, challenges exist with the new technologies. Although reads are long and processing times are short, error rates can be high compared to previous sequencing technologies. Standardized QC methods exist for short read sequencers, but these techniques don’t work well for long reads. The special properties of long read outputs need different QC statistics to describe their characteristics.
“Even assessment of the data quality was not fully established for TGS,” Fukasawa continued. “People have tried to apply pre-existing methods; however, my colleagues and I from BCL found that these methods could not spot failures well, and we decided to develop a new method to overcome the situation.”
Collaboration for LongQC
Fukasawa and BCL team members developed LongQC, a new automated data QC tool, to evaluate sequencing data from third generation sequencers. LongQC was completed as part of a Core Labs’ Impact Project and in a collaboration between BCL’s bioinformatics team and the TGS team. It is the first of its kind in data QC; is able to generate statistics optimized for TGS; is platform independent; and operates without reference genomes.
“The tool visualizes statistics designed for erroneous long read data to highlight potential problems and is well-suited as a first step analysis to highlight various issues prior to any further deep-dive downstream analysis,” explained Fukasawa. “This helps users to have greater understanding of the data, which will lead to more accurate and effective interpretation.”
Details of LongQC’s method and resulting data were published as the cover article of the April 2020 issue of G3: Genes, Genomes, Genetics, a journal covering issues ranging from genetic experiments to theory.
Contributing to genomics worldwide
“Generally speaking, the quality of data depends on multiple factors, and each step in an experiment affects the quality of data for a final outcome,” Fukasawa noted. “By screening issues in a dataset at an early stage of the analysis—such as through the use of LongQC—we can avoid wasting time or even making unreliable conclusions.”
“The theoretical aspect of LongQC was developed and implemented by the BCL bioinformatics team, but there were still some missing pieces for proof of concept: good data, and more importantly, what we called challenging data, or low-quality sequences,” explained Luca Ermini, BCL team lead, Sanger and TGS. “To validate the tool, we first generated high-quality data to test LongQC, but then we also needed to produce low-quality sequences to assess the performance of LongQC for the challenging data.”
However, having low-quality data is not particularly easy, as this kind of data is not shared and is hardly reproducible.
“We modified an experimental protocol to mimic a failure and succeeded in generating the also low-quality sequencing data,” added Karen Carty, BCL senior technical specialist, TGS.
They tested LongQC on a wide range of datasets covering both DNA and RNA sequencing data, finding that the difference between LongQC-estimated and actual mapping results was negligible. The researchers also expect that LongQC could be used with single cell analysis.
The researchers stated that because of the improvements in long-read technologies, these technologies may become “the mainstay of sequencing in the next decade, and [they are] increasingly popular in various applications to address a diverse range of biological problems. LongQC could be applied on new applications.”
For KAUST TGS users, the BCL team now provides LongQC as part of their TGS expertise, and the team plans to give a training course in the near future that discusses the interpretation of LongQC results. There are plans to present LongQC globally at upcoming conferences, and, since the publication of the G3 issue, Fukasawa continues to receive questions from around the world about the tool’s use.
“LongQC is valuable, easy-to-use and efficient for present and future TGS data, enabling better understanding of data before downstream analysis,” Fukasawa said.
“I’ve shared LongQC within the Nanopore user community, and many users do find it helpful,” stated Hai Wang, a former BCL employee who is the third author of the paper and who now works as a strategic account manager at Oxford Nanopore Technologies. “We already included the paper on our official website’s resource center.”
“The BCL team assesses and provides LongQC results to aim at higher quality, and we believe our results will contribute to the worldwide genomics community,” noted Fukasawa.
“I have been so impressed with the BCL genomics staff,” stated Heiko Langner, facilities director, Analytical Chemistry and Bioscience Core Laboratories. “Over the past couple of years, they managed not only to stay abreast in a rapidly developing field, but they have also been able to make a number of great contributions that are truly innovative and give our faculty some tools that are undoubtedly unique in the Kingdom and the region.”
“Their LongQC publication is only one example of a string of great achievements,” he continued. “Last fall, the TGS team was able to secure the status as the first and only Certified Service Provider of PacBio Sequel technology in the Middle East. The BCL was front and center of multiple projects by KAUST faculty to tackle COVID-19 related projects—for example, by sequencing the complete genome of the coronavirus. Finally, our NGS team just commissioned the latest sequencer on the market, and its capabilities are already heavily used by researchers throughout the KAUST Biological and Environmental Science and Engineering division.”