Visualizing High-Dimensional Microbiome Data

This article is part of a tutorial series on applying machine learning to bioinformatics data:

Part 1 — Data Acquisition & Preprocessing

Part 2 — Dimensionality Reduction

To follow along, you can either download our Jupyter notebook here, or continue reading and typing in the following code as you proceed through the walkthrough.


Unsupervised machine learning methods can allow us to understand and explore data in situations where we are not given explicit labels. One type of unsupervised machine learning methods falls under the family of clustering. Getting a general idea of groups or clusters of similar data points can inform us of any underlying structural patterns in our data, such as geography, functional similarities, or communities when we otherwise would not know this information beforehand.

We will be applying our dimensional reduction techniques to Microbiome data acquired from UCSD’s Qiita platform. If you haven’t already done so, see Part 1 of this tutorial series for how to acquire and preprocess your data, or alternatively download our notebook on that section here. We will be needing this before we move on. Basically, our microbiome dataset has columns that represent counts of bacterial DNA sequences present, and our rows represent samples of individual communities of bacteria. This type of data table can be created from Illumina NGS sequencing data after a variety of bioinformatic data cleaning and transformation. We expect samples from different environments to have a different microbial signature since bacteria communities are affected by their environment. The data we worked on for this article are samples take from Toronto, Flagstaff, and San Diego, which should be unique to each other. We hope to visualize this difference which is hidden somewhere in their bacterial composition.

Bacterial communities are expected to be unique across the three different locations, and we hope to visualize that through the high-dimensional microbiome data. Image Source: Pexels, modified by user.

To visualize this complex, sparse, and high dimensional metagenomic data as something our eyes can interpret on a two-dimensional computer screen, we will be needing to drastically reduce our dimension size, or in other words, the number of features in our data. Rather than the 25,000 columns of our dataset which currently represent a portion of the genetic sequence of each organism and their counts in our microbiome, we instead would like a notion of “the most important features” to plot. This article explores 3 different dimensionality reduction and visualization techniques applied to microbiome data and explains what these visualizations can tell us about the structure inherent in our data.

All visualizations were produced in python with the Matplotlib and Seaborn plotting packages and pandas was used for data frame construction.

import pandas as pd
	import matplotlib.pyplot as plt
	import seaborn as sns

For demonstration purposes, since we actually do have labels for this data set, we can confirm whether or not our data set produces nice visualizations by assigning a different color to each point corresponding to a different geographic location. In reality, you will often not have this if you are already taking the unsupervised machine learning approach.

#genomics #machine-learning #bioinformatics #microbiome #data-science

What is GEEK

Buddha Community

Visualizing High-Dimensional Microbiome Data
 iOS App Dev

iOS App Dev


Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

Sid  Schuppe

Sid Schuppe


How To Blend Data in Google Data Studio For Better Data Analysis

Using data to inform decisions is essential to product management, or anything really. And thankfully, we aren’t short of it. Any online application generates an abundance of data and it’s up to us to collect it and then make sense of it.

Google Data Studio helps us understand the meaning behind data, enabling us to build beautiful visualizations and dashboards that transform data into stories. If it wasn’t already, data literacy is as much a fundamental skill as learning to read or write. Or it certainly will be.

Nothing is more powerful than data democracy, where anyone in your organization can regularly make decisions informed with data. As part of enabling this, we need to be able to visualize data in a way that brings it to life and makes it more accessible. I’ve recently been learning how to do this and wanted to share some of the cool ways you can do this in Google Data Studio.

#google-data-studio #blending-data #dashboard #data-visualization #creating-visualizations #how-to-visualize-data #data-analysis #data-visualisation

Gerhard  Brink

Gerhard Brink


Getting Started With Data Lakes

Frameworks for Efficient Enterprise Analytics

The opportunities big data offers also come with very real challenges that many organizations are facing today. Often, it’s finding the most cost-effective, scalable way to store and process boundless volumes of data in multiple formats that come from a growing number of sources. Then organizations need the analytical capabilities and flexibility to turn this data into insights that can meet their specific business objectives.

This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that’s designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You’ll also explore key benefits and common use cases.


As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).

This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.

#big data #data analytics #data analysis #business analytics #data warehouse #data storage #data lake #data lake architecture #data lake governance #data lake management

Cyrus  Kreiger

Cyrus Kreiger


How Has COVID-19 Impacted Data Science?

The COVID-19 pandemic disrupted supply chains and brought economies around the world to a standstill. In turn, businesses need access to accurate, timely data more than ever before. As a result, the demand for data analytics is skyrocketing as businesses try to navigate an uncertain future. However, the sudden surge in demand comes with its own set of challenges.

Here is how the COVID-19 pandemic is affecting the data industry and how enterprises can prepare for the data challenges to come in 2021 and beyond.

#big data #data #data analysis #data security #data integration #etl #data warehouse #data breach #elt

Analyzing Data From U.S. Road Accidents With Data Visualization

Every 24 seconds, a life is lost on the road, and it costs countries around 3% of their gross domestic product - World Health Organization.

With a fatality rate of 12.3% per 100,000 inhabitants, traffic accidents are a leading cause of death in the United States. In 2019, it was reported that 36,096 lives were lost on U.S. roads and according to the National Highway Traffic System Administration (NHTSA), it costs about $871 billion annually to the U.S. economy.

In this article, we would be analyzing data related to US road accidents, which can be utilized to study accident-prone locations and also helps understand the factors that influence road fatalities in the United States.

“Having access to accurate and updated information about the current road situation enables drivers, pedestrians, and passengers to make informed road safety decisions.”

- Association For Safe International Road Travel.

#data-science #big-data-analytics #data-integration #solving-data-integration #data #data-analysis