5 years in development, AI for predicting disease

Predicting phenotypes such as the expression of the disease (e.g. a cancer type), from biomarkers such as the genome including all the mechanisms, pathways, and interactions that also include personal histories and demographics, is beyond the human capability to interpret.

To make genomic medicine a reality, machine learning algorithms need to interpret the genome of the cell and its relation to disease, linking the effects of genetic variations and potential treatment to be explored in a quicker cheaper and more accurate way than can be otherwise obtained via laboratory experiments.

The enormous complexity of the relationship between a full genotype and its phenotype can only be understood using machine learning and Deep Learning and will play a critical role as biology moves toward high-throughput experiments.

Genome Risk Analysis Model with AI Validation

Diagnosis (1)

Implementing comparative genomics

Genetic variants with links to disease risks by association is enhanced by using comparative genomics. These methods, by comparing large genome and phenotype data sets on both healthy individuals and cancer patients.

Deep Learning algorithms create models on the genotype-to-phenotype relationship and cell variables in order to produce a disease risk model. The resulting comparative model establishes statistical significance for a potentially causal variant for a particular disease between the affected group of individuals compared to a control group of non-affected individuals.

Deciphering genomics via Big Data

One of the main difficulties with genome wide association studies is establishing statistical significance for predicting risk, as that they output correlations, not causal relationships.

Some of the reasons for not finding causal variants is due to undetected differences in subpopulations groups and factors such as location, demographics, ancestry, health data, lifestyle and more. The statistical problem is made worse by the fact that some variants have weak effects and those that have strong effects are rare and represent a low of the % of the population group.

The obvious solution is for enlarging the breadth and depth of the data sets available for analysis, as Deep Learning loves consuming data. Instead of studying data on several thousand patients, expanding the data universe to hundreds of thousands, exposes enough data for Deep Learning to find causal variants. Additionally, enriching the data with personal profile data, health data from wearables, lifestyle habits and demographics makes prediction even more accurate.

Deciphering the genomic instructions of the cell and the impact of biological mechanisms requires an exponential growth in data, something that the Blockchain is ideally suited to achieve.

Scaleable processing frameworks

The Blockchain Genomics platform is a cloud service designed to collect and aggregate data on consumers. Each user such as a researcher is provided a dedicated workspace which is based on Hadoop architecture. Hadoop is a distributed data store that provides a platform for implementing powerful parallel processing frameworks. The reliability of this data store is in storing massive volumes of data, coupled with its flexibility in running multiple processing frameworks.

There are mainly five building blocks inside this runtime Hadoop environment (from bottom to top):

  • The cluster is the set of host machines (nodes). Nodes may be partitioned in racks. This is the hardware part of the infrastructure.
  • The YARN Infrastructure (Yet Another Resource Negotiator) is the framework responsible for providing the computational resources (e.g., cpus, memory, etc.) Needed for application executions.
  • The HDFS Federationis the framework responsible for providing permanent, reliable and distributed storage. This is typically used for storing inputs and output (but not intermediate ones).
  • Other alternative storage solutions. For instance, Amazon uses the Simple Storage Service (S3).

The mapreduce Framework is the software layer implementing the mapreduce paradigm

How Deep Learning works

Block23 AI  includes automatic discovering of behavioral patterns based on intelligent processing of all data collected using different data mining techniques (hierarchical clustering, classification using SVM and ANN/DNN, decision trees, probabilistic networks).

The logic patterns are being extracted from the data using the Sequential Covering Algorithm. This algorithm is not required to generate a decision tree first. In this algorithm, each pattern extracted for a given class covers many of the data rows of that class. As per the general strategy the logic patterns are learned one at a time. For each time new pattern is learned, a data row covered by this pattern is removed and the process continues for the rest of the data rows. An example of pattern that can be extracted from data can be the following:
P1: (age = youth) ^ (student = yes))(buys computer = yes)

The following is the sequential learning algorithm where logic patterns are learned for one class at a time. When learning a pattern from a class Ci, we want the pattern to cover all the data rows from class Ci only and no data row for any other class.

Algorithm: Sequential Covering

D, a data set class-labeled data rows,
Att_vals, the set of all attributes and their possible values.

Output: A Set of logic patterns.
Pattern_set={ }; // initial set of patterns learned is empty

For each class c do

Pattern = Learn_One_Pattern(D, Att_valls, c);
Remove data rows covered by Pattern from D;
Until termination condition;

Pattern_set=pattern_set+pattern; // add a new pattern to pattern-set
End for
Return pattern_set;

After the initial process of learning patterns is finished, it is required to prune the learned pattern-set. The Assessment of quality is made on the original set of training data. The pattern may perform well on training data but less well on subsequent data. This is why pattern pruning is required.

The pattern is pruned by removing conjunct. The pattern R is pruned, if pruned version of R has greater quality than what was assessed on an independent set of data rows.
FOIL is one of the simple and effective methods for pruning. For a given pattern P,
FOIL_Prune = pos – neg / pos + neg where pos and neg is the number of positive data rows covered by P, respectively.

This value will increase with the accuracy of P on the pruning set. Hence, if the FOIL_Prune value is higher for the pruned version of P, then we prune P.

Once all data which available on consumers is aggregated it can be analyzed using dashboards which are assembled out of over 25 different types of customizable visualization components.

The application’s visual controls enables users to rapidly view the data the way they want and challenges the established methods of BI solutions that are typically rigid in the level of information detail and the ways users can navigate.

Together with highly interactive capabilities, users can quickly and easily see patterns, trends, and unforeseen relationships and dependencies in their data – and as a result, users are able to draw insight, inferences, and conclusions on causal relationships