Project Overview

The project is carried out by an interdisciplinary team of experts in data integration, systems medicine and systems biology, machine learning in genetics, biomedical text mining oncology, cancer biology, and bioinformatics. The team is small and highly focused on the particular applications and cancer entities under study in the project. The project is carried out in five work packages; each work packages combines the expertise of several PIs, where one (first in each list) has the overall responsibility.

WP 1: Integrated Disease Knowledge Base (Ulf Leser, Nils Blüthgen, Christine Sers)

This work package will develop the cancer-specific disease knowledge base (DKB) that serves as basis for the three applications described in WP3 to WP5. The DKB will consist of a database engine storing information on mutations, drugs, treatments, profiles, etc., a middleware which caches mutation profiles in main memory to allow fast searching and comparison, and a REST API for granting authorized access to its functionality and content. The main part of the work package is concerned with obtaining, restructuring, and integrating relevant data from public and in-house data sources.

WP 2: Text Mining of Scientific Articles (Ulf Leser, Marius Kloft, Reinhold Schäfer, Ulrich Keilholz)

In this work package, we will adapt and develop advanced techniques for the computational analysis of scientific English texts. All methods will be applied to the whole of PubMed (app. 23 citations and abstracts), the PLoS text mining collection (app. 600K full texts), free text reports from the databases integrated into the cancer knowledge base (see WP1), and the Elsevier corpus (Elsevier offered free access for text mining to UL). Texts will be preprocessed using Natural Language Processing (sentence splitting, tokenization, Part-of-Speech tagging, lemmatization), and then advanced text mining routines will be applied for the identification of genes, miRNA, proteins, diseases, drugs (chemical and product names), and different types of mutations (SNPs, structural variations, copy number variation). Based on these entities, Machine Learning based relationship identification will find relations between disease and mutations, genes and diseases, and drugs and mutations / diseases. All extracted information will be ranked by confidence and relevancy to the two tumor entities studied in this project, computed by a text classification algorithm (MK) using an expert (RS, UK) provided set of training texts. The ranked results will then be presented in batches to medical and biological experts who judge their correctness and usefulness. This curation serves two purposes: First, it ensures that only confirmed information enters the cancer knowledge base (see WP1); second, it generates novel training data which will be used to improve the text mining models in an iterative fashion. Curation will be supported by specialized software offering a comfortable way for experts to quickly review a large number of extracted facts.

WP 3: Cancer Mutation Profile Assessment Workbench (Nils Blüthgen, Ulrich Keilholz, Ulf Leser)

The main aim of this work package will be to make the data in the core knowledge base (WP1) including text mining data (WP2) and the predictions from WP5 available for clinical decision making. The central framework in which clinical treatment decisions are made within the CCCC are the tumor boards. Here, clinicians of different disciplines discuss current cases and, based on data and history of the patients, they decide about treatment. Genomic data has currently only a minor role in these boards, largely because of the complexity of the data and the overwhelming mass of unstructured knowledge that makes it impossible for clinicians to oversee the consequences of individual genomic alterations, especially when considering the large number of cases that have to be dealt with in short time. To address the urgent clinical need of supporting the tumor board with well structured, simple to digest information about the genomic alterations of a given tumor and possible consequences, we will develop a Cancer Mutation Profile Assessment Workbench. This workbench will, given a patient’s mutation profile obtained through genome, exome or panel  sequencing, make available a wealth of relevant information such as drugs reported or known to target one of the mutated genes, current or past clinical trial that include patients with the (completely or partially) same genomic profile for this particular tumor, trials for other tumor types that may suggest that a certain drug can or cannot be used in this tumor, selected publications dealing with the mutation and the tumor type (WP2), potentially in combination with already administered drugs, results of the prediction of drug sensitivity (WP5), pathways affected by the given mutations, and Systems Biology models which contain genes affected by a mutation.

WP 4: Data Mining Software for Design and Evaluation of Umbrella and Basket Studies (Nils Blüthgen, Ulrich Keilholz, Christine Sers, Reinhold Schäfer)

This work package aims at developing a software pipeline that supports a clinical researcher in the design of new “arms” of an umbrella or basket study. In umbrella trials, patients with the same tumor entity are assigned to small trial cohorts with different treatments based on a particular molecular profile. In basket trials, groups of patients with different tumor entities but identical driver mutations (or driver profiles) are included into a single trial cohort. One of the main challenges in developing a new arm of an umbrella study is to define the patient subgroup that most likely shows a response to the tested treatment and those that most likely do not respond, respectively. In this work package we will develop software which (a) helps researcher to find properties (especially biomarkers) of patients which characterize potential members of the study, and (b) to find patients carrying these properties.

WP 5: Model-Based Prediction of Drug Responsiveness (Marius Kloft, Nils Blüthgen, Christine Sers)

We will develop an effective machine-learning methodology for predicting from genomic subtypes the expected efficacy of drugs in cancer treatment, and for identifying the associated biomarkers that vary the most significantly. These predictions are essential for trial design in WP4, and will partly be verified experimentally in WP4. As training and initial validation data, we will use the drug-screening data integrated into the PREDICT cancer knowledge base (see WP1). Screening results will be connected to mutation profiles by integrating genomic information on the cell lines from the cancer cell line encyclopedia. In this project we generally strive for achieving the maximal possible prediction accuracy as inaccurate prediction may lead to suboptimal treatments of patients. However, given the wealth of potential drugs that could be considered for umbrella trials in WP4, our focus will be on high accuracy predictions for ineffective drugs, i.e., our central aim is to exclude interventions for which we predict no effects with high certainty.