Bioinformatics and Biostatistics: Analysis of Biological Data
Understanding bioinformatics and its role in biotechnology
Computer analyses of study results are often required to verify the distribution of the data obtained. These statistical analyses help to ensure the probability of the data.
Computer models can provide a better representation of biological functions when applied to this area. Bioinformatics is interdisciplinary in nature, allowing biologists to understand more complex systems, and computer scientists to develop software tools for understanding biological data.
Bioinformatics combines biology, computer science, information engineering, mathematics and statistics.
What are the advantages of using a service provider for bioinformatics or biostatistical analysis?
Getting the most information from experimental data with advanced bioinformatics tools
Implementing the best statistical study
Discover bioinformatics and biostatistics services and exchange with the best service providers
Bioinformatics and Genomics
Historically, bioinformatics emerged with the understanding that biology used sequences at different levels: nucleic acid sequences for DNA and RNA, amino acid sequences for proteins.
Understanding, analyzing and comparing sequences are part of the fundamentals of biology and require the development of computer tools.
The development of sequencing technologies in recent years (NGS technologies for example) has led to the production of a large mass of information. The different fields of action of bioinformatics at the service of genomics are as follows:
Sequence assembly
Sequencing techniques produce short sequences, which must then be assembled. The shotgun sequencing technique, for example, generates fragments of 35 to 900 nucleotides, which must then be assembled. Sequence alignment for a known genome, such as the human genome, requires significant computing resources, although advances in computer science are making it possible to move faster. The presence of "gaps" in the genome is common and requires more focused work in a second step.
In the case of unknown genomes (de novo sequencing), the alignment may be more complex, and it is possible that some regions are only very difficult to sequence.
Genome annotation
Annotation is the process of labeling the specificities of a DNA sequence: introns and exons (coding sequences), regulatory sequences, methylation profiles, etc.
Evolutionary biology
Sequence analysis can reveal links between species, which is defined by the term evolutionary biology. The phenomena studied are typically gene duplications, horizontal transfers, and large-scale comparisons of genomes, which makes it possible to consolidate or compare the taxonomic or physiological methods used so far for the classification of species.
Bioinformatics tools will enable the construction of model populations to predict the evolution of the system over the long term.
Comparative genomics
Sequence comparison starts with the comparison of two gene sequences from two different organisms.
The differences observed, from point mutations in a nucleotide to changes in chromosomal segments such as duplications, transfers, inversions etc., allow the complexity of evolution to be understood.
Mutation analysis
In the case of certain diseases such as cancers, the genomes of the affected cells are very widely modified: rearrangements, point mutations, etc.
Bioinformatics will allow two types of comparative analyses based on sequencing data: between cancer cells and normal cells of an organism, and between cancer cells of an organism and cancer cells of other organisms. This type of study makes it possible to classify and list changes in the genomes of cancer patients in order to ultimately save time in terms of diagnosis and propose the best treatments.
For more information on the tools available, the Open Bioinformatics Foundation lists tools such as Biopython, BioJS or Bioperl.
Artificial Intelligence at the service of Bioinformatics
Developments in artificial intelligence in recent years, including machine learning and deep learning, have been applied in the field of bioinformatics, particularly in the prediction of protein structure.
A protein is a sequence of amino acids structured as follows:
- Primary structure: a sequence of amino acids
- Secondary structure: alpha-helix folding and beta lamination
- Tertiary structure: three-dimensional folding by covalent or non-covalent bonding
- Quaternary structure: integration in a protein complex
With the rise of AI, bioinformatics tools allow us to go much further in the study and prediction of protein structures.
Classifying proteins into new superfamilies
AI tools will analyze primary protein sequences and extract essential information (typically essential for their structure, or highly conserved). This will lead to the prediction of pseudo-proteins, which serve as a reference for the classification of unknown proteins in superfamilies.
Generating models of protein structures
One of the powerful machine learning tools invented in 2014 is the Generative Adversarial Network (GAN). This tool is used to generate data that would be similar to the original data. This is particularly relevant for generating models of tertiary protein structures, which would be "similar" or consistent with reference models. One paper used GANs to generate structures, which are checked for consistency or incoherence and fed back into the generator. This makes it possible to propose robust structure solutions, especially in cases where part of a protein's structure is missing or corrupted.
IA studies can also be applied to modeling antibody-antigen interaction domains, in order to minimize developmental steps in animals or by phage display.
The different types of providers
Freelancers and one-man companies
Many researchers or physicians may conduct bioinformatics and biostatistics studies on an ad hoc basis, or over the duration of research or clinical projects.
Specialist companies and CROs
There are firms specializing in clinical studies that offer biostatistical and bioinformatics analysis.
The companies conducting the clinical trials (contract research organization CROs) generally have the internal capacity to process the data collected in order to compile regulatory dossiers.
The Importance of Big Data in Health
Clinical Data
During the course of a clinical trial, different types of data are collected, transformed into analyzable data sets to answer specific research questions, and used to generate various publications and reports for different audiences. Biostatistics are used to collect, analyze and interpret the results.
They will be assisted by a biostatistician in the following steps:
- Definition of the hypothesis
- Choice of statistical tests and determination of their power
- Sample Size
- Definition of risk and influence factors
- Understanding Correlation and Regression
- Explanation of the phenomena of multiplicity
Real-world patient data
Real-world data (RWD) is data from healthy and treated patients from a variety of sources, typically generated directly from the patient.
The processing of this data will generate "real-world evidence" (RWE), useful for the following areas:
- Sub-categorization of patients
- Use of health resources
- Diagnosis of rare diseases
- Follow-up of treatments
- Mortality rate study
- Treatment cost and impact study
The possible applications of these data are economically and socially critical; this requires competent and specialized tools and statisticians.
The technologies used
Bioinformatics tools
Machine learning and artificial intelligence applications
Estimated rates for this type of service
A sequence alignment costs about € 100 - € 500.
A sequence annotation costs about € 100 - € 500.
A biostatistical study varies significantly depending on the type of data, the number of variables, and the statistical power of the model required.