Bioinformatics and Genomics
Historically, bioinformatics emerged with the understanding that biology used sequences at different levels: nucleic acid sequences for DNA and RNA, amino acid sequences for proteins.
Understanding, analyzing and comparing sequences are part of the fundamentals of biology and require the development of computer tools.
The development of sequencing technologies in recent years (NGS technologies for example) has led to the production of a large mass of information. The different fields of action of bioinformatics at the service of genomics are as follows:
Sequencing techniques produce short sequences, which must then be assembled. The shotgun sequencing technique, for example, generates fragments of 35 to 900 nucleotides, which must then be assembled. Sequence alignment for a known genome, such as the human genome, requires significant computing resources, although advances in computer science are making it possible to move faster. The presence of "gaps" in the genome is common and requires more focused work in a second step.
In the case of unknown genomes (de novo sequencing), the alignment may be more complex, and it is possible that some regions are only very difficult to sequence.
Annotation is the process of labeling the specificities of a DNA sequence: introns and exons (coding sequences), regulatory sequences, methylation profiles, etc.
Sequence analysis can reveal links between species, which is defined by the term evolutionary biology. The phenomena studied are typically gene duplications, horizontal transfers, and large-scale comparisons of genomes, which makes it possible to consolidate or compare the taxonomic or physiological methods used so far for the classification of species.
Bioinformatics tools will enable the construction of model populations to predict the evolution of the system over the long term.
Sequence comparison starts with the comparison of two gene sequences from two different organisms.
The differences observed, from point mutations in a nucleotide to changes in chromosomal segments such as duplications, transfers, inversions etc., allow the complexity of evolution to be understood.
In the case of certain diseases such as cancers, the genomes of the affected cells are very widely modified: rearrangements, point mutations, etc.
Bioinformatics will allow two types of comparative analyses based on sequencing data: between cancer cells and normal cells of an organism, and between cancer cells of an organism and cancer cells of other organisms. This type of study makes it possible to classify and list changes in the genomes of cancer patients in order to ultimately save time in terms of diagnosis and propose the best treatments.
For more information on the tools available, the Open Bioinformatics Foundation lists tools such as Biopython, BioJS or Bioperl.
Artificial Intelligence at the service of Bioinformatics
Developments in artificial intelligence in recent years, including machine learning and deep learning, have been applied in the field of bioinformatics, particularly in the prediction of protein structure.
A protein is a sequence of amino acids structured as follows:
- Primary structure: a sequence of amino acids
- Secondary structure: alpha-helix folding and beta lamination
- Tertiary structure: three-dimensional folding by covalent or non-covalent bonding
- Quaternary structure: integration in a protein complex
With the rise of AI, bioinformatics tools allow us to go much further in the study and prediction of protein structures.
Classifying proteins into new superfamilies
AI tools will analyze primary protein sequences and extract essential information (typically essential for their structure, or highly conserved). This will lead to the prediction of pseudo-proteins, which serve as a reference for the classification of unknown proteins in superfamilies.
Generating models of protein structures
One of the powerful machine learning tools invented in 2014 is the Generative Adversarial Network (GAN). This tool is used to generate data that would be similar to the original data. This is particularly relevant for generating models of tertiary protein structures, which would be "similar" or consistent with reference models. One paper used GANs to generate structures, which are checked for consistency or incoherence and fed back into the generator. This makes it possible to propose robust structure solutions, especially in cases where part of a protein's structure is missing or corrupted.
IA studies can also be applied to modeling antibody-antigen interaction domains, in order to minimize developmental steps in animals or by phage display.
The Importance of Big Data in Health
During the course of a clinical trial, different types of data are collected, transformed into analyzable data sets to answer specific research questions, and used to generate various publications and reports for different audiences. Biostatistics are used to collect, analyze and interpret the results.
They will be assisted by a biostatistician in the following steps:
- Definition of the hypothesis
- Choice of statistical tests and determination of their power
- Sample Size
- Definition of risk and influence factors
- Understanding Correlation and Regression
- Explanation of the phenomena of multiplicity
Real-world patient data
Real-world data (RWD) is data from healthy and treated patients from a variety of sources, typically generated directly from the patient.
The processing of this data will generate "real-world evidence" (RWE), useful for the following areas:
The possible applications of these data are economically and socially critical; this requires competent and specialized tools and statisticians.