Many Data Analysis tasks are still performed on a laptop. This speeds up the analysis as you have your familiar work environment prepared with all of the tools. But chances are your laptop is not “the latest beast” with x-GB of main memory.
Then a Memory Error surprises you!
Gif from Giphy
What should you do? Use Dask? You never work with it and these tools have usually some quirks. Should you request a Spark cluster? Or is a Spark a little exaggerated choice at this point?
Calm down… breathe.
Gif from Giphy
Before you think about using another tool, ask yourself the following question:
Do I need all the rows and columns to perform the analysis?
In a case, you don’t need all rows, you can read the dataset in chunks and filter unnecessary rows to reduce the memory usage:
iter_csv = pd.read_csv('dataset.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])
Reading a dataset in chunks is slower than reading it all once. I would recommend using this approach only with bigger than memory datasets.
In a case, you don’t need all columns, you can specify required columns with “usecols” argument when reading a dataset:
df = pd.read_csv('file.csv', usecols=['col1', 'col2'])
This approach generaly speeds up reading and reduces the memory consumption. So I would recommend using with every dataset.
#analysis #data-science #python #programming #big-data