Genomics is ideally suited to include Artificial Intelligence (AI) in the clinical workflow. The reason is that large Next Generation Sequencing (NGS) panels like Whole Exome (WES) and Whole Genome Sequencing (WGS) will result in discovering thousands to millions of variants in an individual patient, writes, Dr Sunil Tadepalli, Founder & CEO at Labnetworx.
CENTOGENE & Big Data
With big data being a key enabler of any successful artificial intelligence effort, CENTOGENE is ideally suited to employ artificial intelligence (AI) in its diagnostic workflow. We believe that we have the world’s largest curated mutation database for rare diseases, and in particular the biggest database for causal variants; since the predictive power and accuracy results of artificial intelligence rely heavily on better and more comprehensive the data, the size of our database combined with our AI expertise is creating a paradigm shift in our diagnostic approach and capability.
The Need for Variant Prioritization
Next-generation sequencing (NGS) has revolutionized disease diagnosis and treatment ̶ approaching the point of providing a personal genome sequence for every patient. Typical NGS analyses of a patient depict tens of thousands of non reference coding variants, but only one or very few are expected to be significant for the relevant disorder. In a filtering stage, one employs family segregation, population frequency, quality, predicted protein impact and evolutionary conservation as a means for shortening the variation list.
However, narrowing down further towards culprit disease genes usually entails a laborious process of seeking gene-phenotype relationships by consulting numerous separate databases. Thus, a major challenge variant prioritization addresses are to transition from the few hundred shortlisted genes to the most viable disease-causing candidates.
CENTOGENE Variant Prioritization Tool
We developed a new variant prioritization tool to address the need for identifying the most viable disease-causing genes – a solution that leverages CENTOGENE’s unique data repository. By deploying our in-house Artificial Intelligence capability, we devised a strategy to prioritize candidate sequence variants based on their segregation patterns, the prevalence in internal and external population databases, impact and quality, along the deep phenotype similarities. Importantly, the tool was found to have accelerated CENTOGENE’s Whole Exome and Clinical Exome Sequencing diagnostics process.
We were able to demonstrate that by combining what we believe to be the world’s largest database of genetic information with an AI based variant prioritization solution, CENTOGENE’s clinical score was found to perform superior to other companies’ tools available.
Big Data & Artificial Intelligence in Diagnostics
In the field of clinical diagnostics, Big Data and Artificial Intelligence (AI) are transforming the sector like nothing before. The availability of large and diverse biomedical data sets (big data) combined with huge increases in computing power are enabling previously unheard of analysis possibilities ̶ resulting in discoveries and diagnostic insights via AI.
Importantly, Big Data is the key enabler of any artificial intelligence efforts. Because AI systems need enormous data sets to train algorithms, the more data that is available the higher the predictive power and the accuracy of artificial intelligence. The better and more comprehensive the data, the higher the predictive power and accuracy results of artificial intelligence because machines can identify complex patterns and relationships in large amounts of unstructured data that has been derived from a variety of different sources. AI derives insights and makes predictions that are simply not possible by human analysis alone.
In the diagnosis of rare disease patients, variant prioritization is a vital step in discovering causal variants to identify disease-causing mutations. This is because the results of next generation sequencing (NGS) technologies and applications, such as Whole Exome Sequencing (WES) or Whole Genome Sequencing (WGS), will often consist of a list of several thousands of variants of unknown significance, many of which are proved to benign (even though any rare variant has the potential to be pathogenic).
In simple terms, variant prioritization accelerates and simplifies variant interpretation because the results enable the interpretation of variants of unknown significance. It is a process that with filtering identifies which variants found via genetic testing are likely to affect the function of a gene. Prioritization scores enable the diagnosis of a patient, and indeed rare disease diagnosis relies heavily on variant prioritization scores to determine which variants are likely to affect the function of genes.
There are many academic and commercial tools available today for filtering out variants that are deemed unlikely to cause disease. These tools prioritize variants on the basis of segregation, minor allele frequency, predicted pathogenicity, text-mining, dbSNP information, and geno type quality. However, these methods used in a standalone fashion have typically not been able to identify the causative variants underlying a patient’s phenotype, and require additional investigation such as external databases, and also the identification of shared rare variants in unrelated individuals with similar diseases.
CENTOGENE’s Data Repository
At the core of CENTOGENE’s platform is our data repository CentoMD®, which includes epidemiologic, phenotypic and hetero genetic data, and allows us to assemble an extensive knowledge base in rare hereditary diseases. As of May 31, 2019, our CentoMD® database included curated data from over 420,000 patients. CentoMD® brings rationality to the interpretation of global genetic data, and we believe it is the world’s largest curated mutation database for rare diseases, and in particular the biggest database for variants of >7.3 million variants and a significant number of unpublished variants.
Importantly, before adding information into CentoMD®, our experts use evidence-based criteria to validate the interpretation of the data. We use a combination of computer-based tools and manual curation by professional scientists with strong backgrounds in human genetics. Our team of scientists collects annotates and reviews the phenotypic, genetic and epidemiologic data of patient samples to ensure the highest medical validity of each sample. We also employ Human Phenotype Ontology (‘HPO’) coding to accurately track and standardize sample phenotype and genotype data.
Our methodological approach to information curation ensures we provide highly accurate data relevant to clinical diagnoses and decision-making by humans ̶ and indeed for the development of artificial intelligence solutions.
While CENTOGENE’s data repository integrates all relevant structured and unstructured patient data including metabolomic, proteomic, genetic, as well as health records and clinical information, and especially longitudinal data, for example, biomarker or patient recorded outcome (also called ‘real world data’), as well as diagnostic workflow data, the sheer size and continued expansion of the data repository indicates that a human based manual approach by itself in utilizing the data is no longer adequate.
As a result, CENTOGENE is increasingly placing AI at the heart of its diagnostic and pharmaceutical solutions. The application of AI enables us to find statistical relationships faster (speed), draw more exact conclusions about relationships in the data (accuracy), and discover patterns that cannot be found with traditional methods (complexity).