“Once sequenced, a cancer tumor can have thousands of genetic mutations. But the challenge is distinguishing the mutations that contribute to tumor growth (drivers) from the neutral mutations (passengers).
Currently this interpretation of genetic mutations is being done manually. This is a very time-consuming task where a clinical pathologist has to manually review and classify every single genetic mutation based on evidence from text-based clinical literature.We need your help to develop a Machine Learning algorithm that, using this knowledge base as a baseline, automatically classifies genetic variations.”
~~Kaggle
_For implementation please follow the link : _https://github.com/vedanshsharma/Personalized-Cancer-Diagnosis
The task of identifying the type of variation of the gene is typically a three step process. Our task is to use a model to automate the third step which is the most time consuming step for a molecular pathologist. This involves analyzing the evidence related to each of the variations to classify them. More formally our problem statement is -
Classify the given genetic variations/mutations based on evidence from text-based clinical literature.
Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/discussion/35336#198462
There are nine different classes a genetic variation/mutation can be classified into. This implies that our problem is a **multi class classification **problem. The performance metrics that we will be using are -
So our final objective is-
Objective: Predict the probability of each data-point belonging to each of the nine classes.
And our final constraints are-
Our data is not temporal in nature i.e. it doesn`t change with time. Hence we will be randomly splitting our data set into train Cv and test set with sizes being 64% (80 % of 80), 16 %(20% of 80) and 20% respectively.We will split the data in such a way that the distribution of class label remain conserved.
We will be doing some basic text preprocessing which includes-
There are few null values present in text.
#cancer #machine-learning #random-forest #data analysis