Background

While a New York Subway station is bustling with swarms of businessmen, students, artists, and millions of other city-goers every day, its floors, railings, stairways, toilets, walls, kiosks, and benches are teeming with non-human life. The microbial ecosystem, or the complex web of relationships that microorganisms have with one another and with the environment, is omnipresent in shared public transportation spaces. The PathoMap, a 2013 project led by Cornell Professor Christopher Mason, was the first of an annual series of DNA collection projects at various locations around the world. A sampling from NYC subways found that only half of the DNA matched known organisms. This initial project confirmed that the urban microbiome was still a relatively unexplored field, virtually begging for researchers to seize the opportunity. The success of the project yielded the creation of the MetaSUB international consortium, and ever since, intensive studies have been carried out with microbial samples from urban locations around the globe. In addition to being a relatively new field of work, the applications of such projects are endless.

But what happens after the samples are collected? As I learned this summer, there is no magical one-step formula that outputs clean, categorized, analyzed, and graphed data. I had the opportunity to work with MetaSUB microbial data and learn the painstaking yet satisfying process of biological data manipulation and visualization.

Collecting & Cleaning the Data in Linux Bash Terminal

Before the analysis is run, swab samples are collected from specified locations, and DNA libraries are prepared for paired-end sequencing. Since both ends of the fragment are sequenced, this type of sequencing allows for more precise reading of the DNA, results in better alignment, and detects any rearrangements.

While the biological sampling yields a plethora of data, not all of it is relevant for analysis. Adapter sequences, low-quality bases, and human DNA were all extraneous data points in the set. For my project, the first goal was to clean and categorize this data in the Linux Bash Terminal to run visual analyses in R Studio.

The Linux Bash Terminal is used for all reading and cleaning of the data. Read files are stored in a FASTQ format (as seen in Figure 1), and the following is the structure of this file type:

  • Line 1 of the file contains the identifier of the sequence which summarizes where the sequence of bases is found
  • Line 2 provides the actual raw letters of the nitrogenous base sequence
  • Line 3 starts with the plus sign and may be followed by the sequence identifier again
  • Line 4 is the last line of the FASTQ file, and it contains the quality score of each base in the format of an ASCII symbol, each symbol corresponding to a Phred Quality (Q) score

Image for post

Figure 1

However, the raw files still have unwanted DNA fragments, including adapters (which are used to ligate the DNA molecules) and low-quality bases (which can be identified using the ASCII or Phred Quality scores). The Linux AdapterRemoval function is used to remove these adapter sequences and trim the data of low-quality bases, as seen in Figure 2.

Image for post

Figure 2

New FASTQ files are created from the trimmed reads after the sample reads are identified, the Adapter Removal function is used, the reads are filtered with a minimum quality score, and the files are unzipped as seen in the first line of code in Figure 2.

Although these sequences have been removed, the reads are still not thoroughly cleaned. For this project, only microbial data is needed. However, DNA samples from a public transportation space will contain mostly (close to 99%) human DNA since the bacterial genome is more than 1600 times smaller than the human one. Fortunately, most of this human DNA has already been stripped in earlier stages of this project, though some may remain. The bowtie2 function in Linux must be used to completely rid the reads of human DNA.

Image for post

#genomics #microbiome #covid19 #rstudio #linux-terminal #linux

Microbial Surveillance: How It Works & Why It’s Important
1.20 GEEK