One of the key skills a data scientist should have is being able to wrangle data from a variety of sources. In this blog, I will discuss three unorthodox data types and how you can get started working with them.
We live in the age of information, a time when there is more data at our fingertips than at any other point in history, and it's growing.
DC predicts the world’s data will grow to 175 zettabytes in 2025 … If you attempted to download 175 zettabytes at the average current internet connection speed, it would take you 1.8 billion years to download.
That is a lot of data. So why are people still using the same airline CSVs or soccer player statistics? This is a trap I myself have fallen into on occasion and I think it happens mainly for two reasons: laziness and familiarity. Laziness, because we know how easy it is to use precleaned CSVs. Familiarity, because we know how to access the data and get to work on our projects rather than messing around unpacking and importing less familiar data types.
In this blog, I am going to deviate from this status quo and explore three alternative types of data to work with, and how to get started with them.
WAV File Data (Easy)
JPEG File Data (Medium)
APK Code Data (Advanced)
Many people new to the field of data science and signal processing assume that opening and analyzing sound data is a complicated and advanced process. Although some of the theory behind signal processing can get a bit advanced, actually opening and working with sound files is astonishingly easy, thanks to SciPy’s io.wavefile package.
Waveform Audio File, or .Wav, is a common audio bitstream storage file for PCs. If you work with signal processing or audio and sound engineering you will regularly encounter and have to perform various transformations on .wav files.
Working with .Wav files is easy and quick. Just pick a sound file you are interested in taking a closer look at. I have chosen to use Benjamin Tissot’s Actionable, a royalty-free rock song. Let’s import os.wavfile and open up the song:
import numpy as np ## for data transformation import matplotlib.pyplot as plt ## for visualizing the data import scipy.io.wavfile as wavfile ## for opening the data Fs, aud = wavfile.read('Actionable-BenjaminTissot.wav')
In this case, “Fs” or sampling rate is the number of audio samples carried per second and “aud” is the actual audio as sound pressure. Most modern sound files have two audio channels, a left and a right for stereo sound, let’s focus on just one channel and see what we can learn about the audio file:
aud = aud[:,0] #Pick just the left channel print("Sample rate: "+str(Fs)) print("Duration: " + str(aud.shape/Fs) +" seconds")  Sample rate: 44100  Duration: 122.80163265306122 seconds #Duration is just total samples over sampling rate
As we can see, the sampling rate is 44100 Hz or 44.1 kHz, the standard for most high-quality audio files. By dividing the total samples by the sampling rate we get the length of the song, which is roughly two minutes. Let's visualize the first 1.5 seconds of this audio file:
first = aud[:int(Fs*1.5)] #Snip just the first 1.5 seconds plt.plot(first) plt.ylabel("Sound Pressure (Pa)") plt.xlabel("Sample Count") plt.title("Original Audio (1.5s)");
Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant
Learning is a new fun in the field of Machine Learning and Data Science. In this article, we’ll be discussing 15 machine learning and data science projects.
Some of the most used audio processing tasks in programming include - loading and saving audio files, splitting and appending the audio files into segments,
This post will help you in finding different websites where you can easily get free Datasets to practice and develop projects in Data Science and Machine Learning.
Data Preparation Techniques and Its Importance in Machine Learning. “Data are just summaries of thousands of stories, tell a few of those stories to help make the data meaningful.”