1590434520
When you encounter a new set of data, you need to spend some time getting to know it. The process is not unlike meeting a new person for the first time. Instead of asking questions like “What’s your name?”, “Where are you from?”, and “Katy Perry or Taylor Swift?” you ask “What does your distribution look like?”, “Do you contain any missing or unexpected data?”, and “How do you correlate to other pieces of data?”.
A typical Data Scientist spends 80% of their day on preparing data for processing¹. Tools like Pandas have made the process more efficient by adding a powerful set of features to explore and munge data. Pandas can turn a vanilla CSV file into insightful aggregations and charts. Plus Pandas’ number one feature is that it keeps me out of Excel.
Pandas is not all roses and sunshine however. Since the DataFrames (the foundation of Pandas) are kept in memory, there are limits to how much data can be processed at a time. Analyzing datasets the size of the New York Taxi data (1+ Billion rows and 10 years of information) can cause out of memory exceptions while trying to pack those rows into Pandas. Most Pandas related tutorials only work with 6 months of data to avoid that scenario.
#pandas #big-data #data-science #database #python
1623142193
Pandas is a popular Python package for data science, as it offers powerful, expressive, and flexible data structures for data explorations and visualization. But when it comes to handling large-sized datasets, it fails, as it cannot process larger than memory data.
Pandas offer a vast list of API for data explorations and visualization, which makes it more popular among the data scientist community. Dask, modin, Vaex are some of the open-source packages that can scale up the performance of Pandas library and handle large-sized datasets.
When the size of the dataset is comparatively larger than memory using such libraries is preferred, but when dataset size comparatively equal or smaller to memory size, we can optimize the memory usage while reading the dataset. In this article, we will discuss how to optimize memory usage while loading the dataset using pandas.read_csv(),**pandas.read_excel() orpandas.read_excel()**functions.
#machine-learning #education #pandas #optimize pandas memory usage for large datasets #pandas memory #datasets
1586702221
In this post, we will learn about pandas’ data structures/objects. Pandas provide two type of data structures:-
Pandas Series is a one dimensional indexed data, which can hold datatypes like integer, string, boolean, float, python object etc. A Pandas Series can hold only one data type at a time. The axis label of the data is called the index of the series. The labels need not to be unique but must be a hashable type. The index of the series can be integer, string and even time-series data. In general, Pandas Series is nothing but a column of an excel sheet with row index being the index of the series.
Pandas dataframe is a primary data structure of pandas. Pandas dataframe is a two-dimensional size mutable array with both flexible row indices and flexible column names. In general, it is just like an excel sheet or SQL table. It can also be seen as a python’s dict-like container for series objects.
#python #python-pandas #pandas-dataframe #pandas-series #pandas-tutorial
1602550800
Pandas is used for data manipulation, analysis and cleaning.
What are Data Frames and Series?
Dataframe is a two dimensional, size mutable, potentially heterogeneous tabular data.
It contains rows and columns, arithmetic operations can be applied on both rows and columns.
Series is a one dimensional label array capable of holding data of any type. It can be integer, float, string, python objects etc. Panda series is nothing but a column in an excel sheet.
s = pd.Series([1,2,3,4,56,np.nan,7,8,90])
print(s)
How to create a dataframe by passing a numpy array?
#pandas-series #pandas #pandas-in-python #pandas-dataframe #python
1616050935
In my last post, I mentioned the groupby technique in Pandas library. After creating a groupby object, it is limited to make calculations on grouped data using groupby’s own functions. For example, in the last lesson, we were able to use a few functions such as mean or sum on the object we created with groupby. But with the aggregate () method, we can use both the functions we have written and the methods used with groupby. I will show how to work with groupby in this post.
#pandas-groupby #python-pandas #pandas #data-preprocessing #pandas-tutorial
1616395265
In my last post, I mentioned summarizing and computing descriptive statistics using the Pandas library. To work with data in Pandas, it is necessary to load the data set first. Reading the data set is one of the important stages of data analysis. In this post, I will talk about reading and writing data.
Before starting the topic, our Medium page includes posts on data science, artificial intelligence, machine learning, and deep learning. Please don’t forget to follow us on Medium 🌱 to see these posts and the latest posts.
Let’s get started.
#python-pandas-tutorial #pandas-read #pandas #python-pandas