Bioinformatics 2: Bit-Encoding for DNA Sequences

In the bioinformatics domain, people often deal with different data sequences such as DNA, RNA, and protein. One common aspect (which is also a significant challenge) regardless of the sequence type is the amount of data that needs to be processed. In this article, we will narrow down our discussion to the DNA sequences.

In the bioinformatics domain, people often deal with different data sequences such as DNA, RNA, and protein. One common aspect (which is also a significant challenge) regardless of the sequence type is the amount of data that needs to be processed. In this article, we will narrow down our discussion to the DNA sequences.

Compared to the past, we have access to massive amounts of DNA sequence data due to the advancement in high-throughput sequencing technologies. For example, Next Generation Sequencing techniques produce hundreds and thousands of gigabytes worth data for a single organism. Processing these enormous amounts of data creates an interesting yet challenging computer science problem.

A Computer Science Perspective

1. Bit-Encoding for DNA Sequences

Bit-encoding is a simple yet efficient data compression technique. The principal idea is to reduce the number of bits used to represent a single nucleotide.

In standard C, a single character is represented by 8 bits or 1 byte(`char` data type). Using eight bits we can represent 2⁸ different characters. However, DNA sequences consist of only 4 alphabets {A, C, G, and T}, which can be represented using only 2 bits. We can leverage this to our advantage.

It would reduce memory consumption by 75%. For example, if the original sequence consumes 100GB, the encoded version would consume 25GB only, which is a significant reduction.

How can we do this?

Suppose the length of the DNA sequence is n,

Step 1: Allocate memory of size (n * 2)/8 bytes [n characters require n*2 number of bits. Divide by 8 for byte conversion]

Step 2: For each character in the given DNA sequence, set the corresponding bits in the allocated memory.

Data Science Course in Dallas

Become a data analysis expert using the R programming language in this [data science](https://360digitmg.com/usa/data-science-using-python-and-r-programming-in-dallas "data science") certification training in Dallas, TX. You will master data...

50 Data Science Jobs That Opened Just Last Week

Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments. Our latest survey report suggests that as the overall Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments, data scientists and AI practitioners should be aware of the skills and tools that the broader community is working on. A good grip in these skills will further help data science enthusiasts to get the best jobs that various industries in their data science functions are offering.

Applications Of Data Science On 3D Imagery Data

The agenda of the talk included an introduction to 3D data, its applications and case studies, 3D data alignment and more.

Data Science With Python Training | Python Data Science Course | Intellipaat

🔵 Intellipaat Data Science with Python course: https://intellipaat.com/python-for-data-science-training/In this Data Science With Python Training video, you...

77 Programming Language Q&A (P4)

Check the bottom of the page for links to the other questions and answers I’ve come up with to make you a great Computer Scientist (when it comes to Programming Languages).