Bioinformatics and Biostatistics: Analysis of Biological Data

Understanding bioinformatics and its role in biotechnology

Computer analyses of study results are often required to verify the distribution of the data obtained. These statistical analyses help to ensure the probability of the data.

Computer models can provide a better representation of biological functions when applied to this area. Bioinformatics is interdisciplinary in nature, allowing biologists to understand more complex systems, and computer scientists to develop software tools for understanding biological data.

Bioinformatics combines biology, computer science, information engineering, mathematics and statistics.

What are the advantages of using a service provider for bioinformatics or biostatistical analysis?

Getting the most information from experimental data with advanced bioinformatics tools


Implementing the best statistical study

Bioinformatics and Genomics

Historically, bioinformatics emerged with the understanding that biology used sequences at different levels: nucleic acid sequences for DNA and RNA, amino acid sequences for proteins.

Understanding, analyzing and comparing sequences are part of the fundamentals of biology and require the development of computer tools.

The development of sequencing technologies in recent years (NGS technologies for example) has led to the production of a large mass of information. The different fields of action of bioinformatics at the service of genomics are as follows:

Sequence assembly

Sequencing techniques produce short sequences, which must then be assembled. The shotgun sequencing technique, for example, generates fragments of 35 to 900 nucleotides, which must then be assembled. Sequence alignment for a known genome, such as the human genome, requires significant computing resources, although advances in computer science are making it possible to move faster. The presence of "gaps" in the genome is common and requires more focused work in a second step.

In the case of unknown genomes (de novo sequencing), the alignment may be more complex, and it is possible that some regions are only very difficult to sequence.

Genome annotation

Annotation is the process of labeling the specificities of a DNA sequence: introns and exons (coding sequences), regulatory sequences, methylation profiles, etc.

Evolutionary biology

Sequence analysis can reveal links between species, which is defined by the term evolutionary biology. The phenomena studied are typically gene duplications, horizontal transfers, and large-scale comparisons of genomes, which makes it possible to consolidate or compare the taxonomic or physiological methods used so far for the classification of species.

Bioinformatics tools will enable the construction of model populations to predict the evolution of the system over the long term.

Comparative genomics

Sequence comparison starts with the comparison of two gene sequences from two different organisms.

The differences observed, from point mutations in a nucleotide to changes in chromosomal segments such as duplications, transfers, inversions etc., allow the complexity of evolution to be understood.

Mutation analysis

In the case of certain diseases such as cancers, the genomes of the affected cells are very widely modified: rearrangements, point mutations, etc.

Bioinformatics will allow two types of comparative analyses based on sequencing data: between cancer cells and normal cells of an organism, and between cancer cells of an organism and cancer cells of other organisms. This type of study makes it possible to classify and list changes in the genomes of cancer patients in order to ultimately save time in terms of diagnosis and propose the best treatments.

 

For more information on the tools available, the Open Bioinformatics Foundation lists tools such as Biopython, BioJS or Bioperl.

Artificial Intelligence at the service of Bioinformatics

Developments in artificial intelligence in recent years, including machine learning and deep learning, have been applied in the field of bioinformatics, particularly in the prediction of protein structure.

A protein is a sequence of amino acids structured as follows:

  • Primary structure: a sequence of amino acids
  • Secondary structure: alpha-helix folding and beta lamination
  • Tertiary structure: three-dimensional folding by covalent or non-covalent bonding
  • Quaternary structure: integration in a protein complex

 

With the rise of AI, bioinformatics tools allow us to go much further in the study and prediction of protein structures.

Classifying proteins into new superfamilies

AI tools will analyze primary protein sequences and extract essential information (typically essential for their structure, or highly conserved). This will lead to the prediction of pseudo-proteins, which serve as a reference for the classification of unknown proteins in superfamilies.

Generating models of protein structures

One of the powerful machine learning tools invented in 2014 is the Generative Adversarial Network (GAN). This tool is used to generate data that would be similar to the original data. This is particularly relevant for generating models of tertiary protein structures, which would be "similar" or consistent with reference models. One paper used GANs to generate structures, which are checked for consistency or incoherence and fed back into the generator. This makes it possible to propose robust structure solutions, especially in cases where part of a protein's structure is missing or corrupted.

IA studies can also be applied to modeling antibody-antigen interaction domains, in order to minimize developmental steps in animals or by phage display.

The different types of providers

 

Freelancers and one-man companies

Many researchers or physicians may conduct bioinformatics and biostatistics studies on an ad hoc basis, or over the duration of research or clinical projects.

Specialist companies and CROs

There are firms specializing in clinical studies that offer biostatistical and bioinformatics analysis.
The companies conducting the clinical trials (contract research organization CROs) generally have the internal capacity to process the data collected in order to compile regulatory dossiers.

The Importance of Big Data in Health

Clinical Data

During the course of a clinical trial, different types of data are collected, transformed into analyzable data sets to answer specific research questions, and used to generate various publications and reports for different audiences. Biostatistics are used to collect, analyze and interpret the results.
They will be assisted by a biostatistician in the following steps:

  • Definition of the hypothesis
  • Choice of statistical tests and determination of their power
  • Sample Size
  • Definition of risk and influence factors
  • Understanding Correlation and Regression
  • Explanation of the phenomena of multiplicity

 

Real-world patient data

Real-world data (RWD) is data from healthy and treated patients from a variety of sources, typically generated directly from the patient.

The processing of this data will generate "real-world evidence" (RWE), useful for the following areas:

 

The possible applications of these data are economically and socially critical; this requires competent and specialized tools and statisticians.

The technologies used

Bioinformatics tools

Machine learning and artificial intelligence applications

Estimated rates for this type of service

A sequence alignment costs about € 100 - € 500.
A sequence annotation costs about € 100 - € 500.
A biostatistical study varies significantly depending on the type of data, the number of variables, and the statistical power of the model required.

Need help?