Introduction

This article is aimed towards people who are looking to “break into” the bioinformatics realm and have experience with R (ideally using the tidyverse). Bioinformatics can be a scary-sounding concept (as least it is for me) because it is such a vast and fast-developing field that it can be difficult to define exactly what it is. I’ve always thought that bioinformatics was a highly advanced field beyond what I was capable of doing — that I would need years of technical training to begin actually doing it. But like with everything, it doesn’t actually take much to _begin _doing something (it goes without saying that it does take years to _master _something).

Acknowledging that I’m oversimplifying, bioinformatics is essentially the _in silico _(or data-based) approach to answering biological questions. With the advent of more advanced sequencing technology and accompanying developments in statistical algorithms, we now have unprecedented access to biological data at a scale and price previously unheard of as well as the tools to extract insights from this data.

In this article, I aim to provide an example of an easy way that anyone who likes data, likes to work with R, and has an interest in this field, can start _doing _analyses in the bioinformatics realm.

There are loads of different types of questions and different types of biological data, so for this article, we will be performing a differential gene expression analysis (DGEA) using RNA-Seq data. Briefly, if you take a trip down memory lane to your high school/college biology classes, DGEA answers the question of whether there are changes in expression levels in genes between different experimental conditions, and RNA-Seq (short for “RNA sequencing”) uses next-generation sequencing to quantify the amount of RNA in a given transcriptome (the set of all RNA transcripts). We’ll be getting the data from recount2_. _Essentially, it is a database maintained by Johns Hopkins that interfaces with R through their package, and the user can query RNA-seq read count data from it. Counts are the number of sequenced reads that align with a particular region of a gene.

#data-science #gene #rna-seq #bioinformatics #r #data analysis

How to Start Learning Bioinformatics and Not Get Intimidated(With R)
1.10 GEEK