Three Often Overlooked Sources of Data for your Next Passion Project

We live in the age of information, a time when there is more data at our fingertips than at any other point in history, and it’s growing.

DC predicts the world’s data will grow to 175 zettabytes in 2025 … If you attempted to download 175 zettabytes at the average current internet connection speed, it would take you 1.8 billion years to download.

That is a lot of data. So why are people still using the same airline CSVs or soccer player statistics? This is a trap I myself have fallen into on occasion and I think it happens mainly for two reasons: laziness and familiarity. Laziness, because we know how easy it is to use precleaned CSVs. Familiarity, because we know how to access the data and get to work on our projects rather than messing around unpacking and importing less familiar data types.

In this blog, I am going to deviate from this status quo and explore three alternative types of data to work with, and how to get started with them.

WAV File Data (Easy)

JPEG File Data (Medium)

APK Code Data (Advanced)

(EASY) Unlocking the Data of WAV Sound Files

Many people new to the field of data science and signal processing assume that opening and analyzing sound data is a complicated and advanced process. Although some of the theory behind signal processing can get a bit advanced, actually opening and working with sound files is astonishingly easy, thanks to SciPy’s io.wavefile package.

Waveform Audio File, or .Wav, is a common audio bitstream storage file for PCs. If you work with signal processing or audio and sound engineering you will regularly encounter and have to perform various transformations on .wav files.

Working with .Wav files is easy and quick. Just pick a sound file you are interested in taking a closer look at. I have chosen to use Benjamin Tissot’s Actionable, a royalty-free rock song. Let’s import os.wavfile and open up the song:

import numpy as np ## for data transformation
import matplotlib.pyplot as plt ## for visualizing the data
import scipy.io.wavfile as wavfile ## for opening the data

Fs, aud = wavfile.read('Actionable-BenjaminTissot.wav')

In this case, “Fs” or sampling rate is the number of audio samples carried per second and “aud” is the actual audio as sound pressure. Most modern sound files have two audio channels, a left and a right for stereo sound, let’s focus on just one channel and see what we can learn about the audio file:

aud = aud[:,0] #Pick just the left channel

print("Sample rate: "+str(Fs))
print("Duration: " + str(aud.shape[0]/Fs) +" seconds")
[1] Sample rate: 44100
[2] Duration: 122.80163265306122 seconds 
#Duration is just total samples over sampling rate

As we can see, the sampling rate is 44100 Hz or 44.1 kHz, the standard for most high-quality audio files. By dividing the total samples by the sampling rate we get the length of the song, which is roughly two minutes. Let’s visualize the first 1.5 seconds of this audio file:

first = aud[:int(Fs*1.5)] #Snip just the first 1.5 seconds
plt.plot(first)
plt.ylabel("Sound Pressure (Pa)")
plt.xlabel("Sample Count")
plt.title("Original Audio (1.5s)");

#data-science #image-processing #machine-learning #audio #computer-science

(EASY) Unlocking the Data of WAV Sound Files

towardsdatascience.com

Three Often Overlooked Sources of Data for your Next Passion Project