1597005540
In this article, we will discuss the Dirty Read concurrency issue and also learn the details of the Read Uncommitted Isolation Level.
A transaction is the smallest working unit that performs the CRUD (Create, Read, Update, and Delete) actions in the relational database systems. Relevant to this matter, database transactions must have some characteristics to provide database consistency. The following four features constitute the major principles of the transactions to ensure the validity of data stored by database systems. These are;
These four properties are also known as ACID principles. Let’s briefly explain these four principles.
This property is also known as all or nothing principle. According to this property, a transaction can not be completed partially, so if a transaction gets an error at any point of the transaction, the entire transaction should be aborted and rollbacked. Or, all the actions contained by a transaction must be completed successfully.
According to this property, the saved data must not damage data integrity. This means that the modified data must provide the constraints and other requirements that are defined in the database.
According to this property, the committed will not be lost even with the system or power failure.
The database transactions must complete their tasks independently from the other transactions. This property enables us to execute the transactions concurrently on the database systems. So, the data changes which are made up by the transactions are not visible until the transactions complete (committed) their actions. The SQL standard describes three read phenomena, and they can be experienced when more than one transaction tries to read and write to the same resources.
The simplest explanation of the dirty read is the state of reading uncommitted data. In this circumstance, we are not sure about the consistency of the data that is read because we don’t know the result of the open transaction(s). After reading the uncommitted data, the open transaction can be completed with rollback. On the other hand, the open transaction can complete its actions successfully. The data that is read in this ambiguous way is defined as dirty data. Now we will explain this issue with a scenario:
Assuming we have a table as shown below that stores the bank account details of the clients.
#execution plans #locking #performance #query analysis #data analysis
1597005540
In this article, we will discuss the Dirty Read concurrency issue and also learn the details of the Read Uncommitted Isolation Level.
A transaction is the smallest working unit that performs the CRUD (Create, Read, Update, and Delete) actions in the relational database systems. Relevant to this matter, database transactions must have some characteristics to provide database consistency. The following four features constitute the major principles of the transactions to ensure the validity of data stored by database systems. These are;
These four properties are also known as ACID principles. Let’s briefly explain these four principles.
This property is also known as all or nothing principle. According to this property, a transaction can not be completed partially, so if a transaction gets an error at any point of the transaction, the entire transaction should be aborted and rollbacked. Or, all the actions contained by a transaction must be completed successfully.
According to this property, the saved data must not damage data integrity. This means that the modified data must provide the constraints and other requirements that are defined in the database.
According to this property, the committed will not be lost even with the system or power failure.
The database transactions must complete their tasks independently from the other transactions. This property enables us to execute the transactions concurrently on the database systems. So, the data changes which are made up by the transactions are not visible until the transactions complete (committed) their actions. The SQL standard describes three read phenomena, and they can be experienced when more than one transaction tries to read and write to the same resources.
The simplest explanation of the dirty read is the state of reading uncommitted data. In this circumstance, we are not sure about the consistency of the data that is read because we don’t know the result of the open transaction(s). After reading the uncommitted data, the open transaction can be completed with rollback. On the other hand, the open transaction can complete its actions successfully. The data that is read in this ambiguous way is defined as dirty data. Now we will explain this issue with a scenario:
Assuming we have a table as shown below that stores the bank account details of the clients.
#execution plans #locking #performance #query analysis #data analysis
1667425440
Perl script converts PDF files to Gerber format
Pdf2Gerb generates Gerber 274X photoplotting and Excellon drill files from PDFs of a PCB. Up to three PDFs are used: the top copper layer, the bottom copper layer (for 2-sided PCBs), and an optional silk screen layer. The PDFs can be created directly from any PDF drawing software, or a PDF print driver can be used to capture the Print output if the drawing software does not directly support output to PDF.
The general workflow is as follows:
Please note that Pdf2Gerb does NOT perform DRC (Design Rule Checks), as these will vary according to individual PCB manufacturer conventions and capabilities. Also note that Pdf2Gerb is not perfect, so the output files must always be checked before submitting them. As of version 1.6, Pdf2Gerb supports most PCB elements, such as round and square pads, round holes, traces, SMD pads, ground planes, no-fill areas, and panelization. However, because it interprets the graphical output of a Print function, there are limitations in what it can recognize (or there may be bugs).
See docs/Pdf2Gerb.pdf for install/setup, config, usage, and other info.
#Pdf2Gerb config settings:
#Put this file in same folder/directory as pdf2gerb.pl itself (global settings),
#or copy to another folder/directory with PDFs if you want PCB-specific settings.
#There is only one user of this file, so we don't need a custom package or namespace.
#NOTE: all constants defined in here will be added to main namespace.
#package pdf2gerb_cfg;
use strict; #trap undef vars (easier debug)
use warnings; #other useful info (easier debug)
##############################################################################################
#configurable settings:
#change values here instead of in main pfg2gerb.pl file
use constant WANT_COLORS => ($^O !~ m/Win/); #ANSI colors no worky on Windows? this must be set < first DebugPrint() call
#just a little warning; set realistic expectations:
#DebugPrint("${\(CYAN)}Pdf2Gerb.pl ${\(VERSION)}, $^O O/S\n${\(YELLOW)}${\(BOLD)}${\(ITALIC)}This is EXPERIMENTAL software. \nGerber files MAY CONTAIN ERRORS. Please CHECK them before fabrication!${\(RESET)}", 0); #if WANT_DEBUG
use constant METRIC => FALSE; #set to TRUE for metric units (only affect final numbers in output files, not internal arithmetic)
use constant APERTURE_LIMIT => 0; #34; #max #apertures to use; generate warnings if too many apertures are used (0 to not check)
use constant DRILL_FMT => '2.4'; #'2.3'; #'2.4' is the default for PCB fab; change to '2.3' for CNC
use constant WANT_DEBUG => 0; #10; #level of debug wanted; higher == more, lower == less, 0 == none
use constant GERBER_DEBUG => 0; #level of debug to include in Gerber file; DON'T USE FOR FABRICATION
use constant WANT_STREAMS => FALSE; #TRUE; #save decompressed streams to files (for debug)
use constant WANT_ALLINPUT => FALSE; #TRUE; #save entire input stream (for debug ONLY)
#DebugPrint(sprintf("${\(CYAN)}DEBUG: stdout %d, gerber %d, want streams? %d, all input? %d, O/S: $^O, Perl: $]${\(RESET)}\n", WANT_DEBUG, GERBER_DEBUG, WANT_STREAMS, WANT_ALLINPUT), 1);
#DebugPrint(sprintf("max int = %d, min int = %d\n", MAXINT, MININT), 1);
#define standard trace and pad sizes to reduce scaling or PDF rendering errors:
#This avoids weird aperture settings and replaces them with more standardized values.
#(I'm not sure how photoplotters handle strange sizes).
#Fewer choices here gives more accurate mapping in the final Gerber files.
#units are in inches
use constant TOOL_SIZES => #add more as desired
(
#round or square pads (> 0) and drills (< 0):
.010, -.001, #tiny pads for SMD; dummy drill size (too small for practical use, but needed so StandardTool will use this entry)
.031, -.014, #used for vias
.041, -.020, #smallest non-filled plated hole
.051, -.025,
.056, -.029, #useful for IC pins
.070, -.033,
.075, -.040, #heavier leads
# .090, -.043, #NOTE: 600 dpi is not high enough resolution to reliably distinguish between .043" and .046", so choose 1 of the 2 here
.100, -.046,
.115, -.052,
.130, -.061,
.140, -.067,
.150, -.079,
.175, -.088,
.190, -.093,
.200, -.100,
.220, -.110,
.160, -.125, #useful for mounting holes
#some additional pad sizes without holes (repeat a previous hole size if you just want the pad size):
.090, -.040, #want a .090 pad option, but use dummy hole size
.065, -.040, #.065 x .065 rect pad
.035, -.040, #.035 x .065 rect pad
#traces:
.001, #too thin for real traces; use only for board outlines
.006, #minimum real trace width; mainly used for text
.008, #mainly used for mid-sized text, not traces
.010, #minimum recommended trace width for low-current signals
.012,
.015, #moderate low-voltage current
.020, #heavier trace for power, ground (even if a lighter one is adequate)
.025,
.030, #heavy-current traces; be careful with these ones!
.040,
.050,
.060,
.080,
.100,
.120,
);
#Areas larger than the values below will be filled with parallel lines:
#This cuts down on the number of aperture sizes used.
#Set to 0 to always use an aperture or drill, regardless of size.
use constant { MAX_APERTURE => max((TOOL_SIZES)) + .004, MAX_DRILL => -min((TOOL_SIZES)) + .004 }; #max aperture and drill sizes (plus a little tolerance)
#DebugPrint(sprintf("using %d standard tool sizes: %s, max aper %.3f, max drill %.3f\n", scalar((TOOL_SIZES)), join(", ", (TOOL_SIZES)), MAX_APERTURE, MAX_DRILL), 1);
#NOTE: Compare the PDF to the original CAD file to check the accuracy of the PDF rendering and parsing!
#for example, the CAD software I used generated the following circles for holes:
#CAD hole size: parsed PDF diameter: error:
# .014 .016 +.002
# .020 .02267 +.00267
# .025 .026 +.001
# .029 .03167 +.00267
# .033 .036 +.003
# .040 .04267 +.00267
#This was usually ~ .002" - .003" too big compared to the hole as displayed in the CAD software.
#To compensate for PDF rendering errors (either during CAD Print function or PDF parsing logic), adjust the values below as needed.
#units are pixels; for example, a value of 2.4 at 600 dpi = .0004 inch, 2 at 600 dpi = .0033"
use constant
{
HOLE_ADJUST => -0.004 * 600, #-2.6, #holes seemed to be slightly oversized (by .002" - .004"), so shrink them a little
RNDPAD_ADJUST => -0.003 * 600, #-2, #-2.4, #round pads seemed to be slightly oversized, so shrink them a little
SQRPAD_ADJUST => +0.001 * 600, #+.5, #square pads are sometimes too small by .00067, so bump them up a little
RECTPAD_ADJUST => 0, #(pixels) rectangular pads seem to be okay? (not tested much)
TRACE_ADJUST => 0, #(pixels) traces seemed to be okay?
REDUCE_TOLERANCE => .001, #(inches) allow this much variation when reducing circles and rects
};
#Also, my CAD's Print function or the PDF print driver I used was a little off for circles, so define some additional adjustment values here:
#Values are added to X/Y coordinates; units are pixels; for example, a value of 1 at 600 dpi would be ~= .002 inch
use constant
{
CIRCLE_ADJUST_MINX => 0,
CIRCLE_ADJUST_MINY => -0.001 * 600, #-1, #circles were a little too high, so nudge them a little lower
CIRCLE_ADJUST_MAXX => +0.001 * 600, #+1, #circles were a little too far to the left, so nudge them a little to the right
CIRCLE_ADJUST_MAXY => 0,
SUBST_CIRCLE_CLIPRECT => FALSE, #generate circle and substitute for clip rects (to compensate for the way some CAD software draws circles)
WANT_CLIPRECT => TRUE, #FALSE, #AI doesn't need clip rect at all? should be on normally?
RECT_COMPLETION => FALSE, #TRUE, #fill in 4th side of rect when 3 sides found
};
#allow .012 clearance around pads for solder mask:
#This value effectively adjusts pad sizes in the TOOL_SIZES list above (only for solder mask layers).
use constant SOLDER_MARGIN => +.012; #units are inches
#line join/cap styles:
use constant
{
CAP_NONE => 0, #butt (none); line is exact length
CAP_ROUND => 1, #round cap/join; line overhangs by a semi-circle at either end
CAP_SQUARE => 2, #square cap/join; line overhangs by a half square on either end
CAP_OVERRIDE => FALSE, #cap style overrides drawing logic
};
#number of elements in each shape type:
use constant
{
RECT_SHAPELEN => 6, #x0, y0, x1, y1, count, "rect" (start, end corners)
LINE_SHAPELEN => 6, #x0, y0, x1, y1, count, "line" (line seg)
CURVE_SHAPELEN => 10, #xstart, ystart, x0, y0, x1, y1, xend, yend, count, "curve" (bezier 2 points)
CIRCLE_SHAPELEN => 5, #x, y, 5, count, "circle" (center + radius)
};
#const my %SHAPELEN =
#Readonly my %SHAPELEN =>
our %SHAPELEN =
(
rect => RECT_SHAPELEN,
line => LINE_SHAPELEN,
curve => CURVE_SHAPELEN,
circle => CIRCLE_SHAPELEN,
);
#panelization:
#This will repeat the entire body the number of times indicated along the X or Y axes (files grow accordingly).
#Display elements that overhang PCB boundary can be squashed or left as-is (typically text or other silk screen markings).
#Set "overhangs" TRUE to allow overhangs, FALSE to truncate them.
#xpad and ypad allow margins to be added around outer edge of panelized PCB.
use constant PANELIZE => {'x' => 1, 'y' => 1, 'xpad' => 0, 'ypad' => 0, 'overhangs' => TRUE}; #number of times to repeat in X and Y directions
# Set this to 1 if you need TurboCAD support.
#$turboCAD = FALSE; #is this still needed as an option?
#CIRCAD pad generation uses an appropriate aperture, then moves it (stroke) "a little" - we use this to find pads and distinguish them from PCB holes.
use constant PAD_STROKE => 0.3; #0.0005 * 600; #units are pixels
#convert very short traces to pads or holes:
use constant TRACE_MINLEN => .001; #units are inches
#use constant ALWAYS_XY => TRUE; #FALSE; #force XY even if X or Y doesn't change; NOTE: needs to be TRUE for all pads to show in FlatCAM and ViewPlot
use constant REMOVE_POLARITY => FALSE; #TRUE; #set to remove subtractive (negative) polarity; NOTE: must be FALSE for ground planes
#PDF uses "points", each point = 1/72 inch
#combined with a PDF scale factor of .12, this gives 600 dpi resolution (1/72 * .12 = 600 dpi)
use constant INCHES_PER_POINT => 1/72; #0.0138888889; #multiply point-size by this to get inches
# The precision used when computing a bezier curve. Higher numbers are more precise but slower (and generate larger files).
#$bezierPrecision = 100;
use constant BEZIER_PRECISION => 36; #100; #use const; reduced for faster rendering (mainly used for silk screen and thermal pads)
# Ground planes and silk screen or larger copper rectangles or circles are filled line-by-line using this resolution.
use constant FILL_WIDTH => .01; #fill at most 0.01 inch at a time
# The max number of characters to read into memory
use constant MAX_BYTES => 10 * M; #bumped up to 10 MB, use const
use constant DUP_DRILL1 => TRUE; #FALSE; #kludge: ViewPlot doesn't load drill files that are too small so duplicate first tool
my $runtime = time(); #Time::HiRes::gettimeofday(); #measure my execution time
print STDERR "Loaded config settings from '${\(__FILE__)}'.\n";
1; #last value must be truthful to indicate successful load
#############################################################################################
#junk/experiment:
#use Package::Constants;
#use Exporter qw(import); #https://perldoc.perl.org/Exporter.html
#my $caller = "pdf2gerb::";
#sub cfg
#{
# my $proto = shift;
# my $class = ref($proto) || $proto;
# my $settings =
# {
# $WANT_DEBUG => 990, #10; #level of debug wanted; higher == more, lower == less, 0 == none
# };
# bless($settings, $class);
# return $settings;
#}
#use constant HELLO => "hi there2"; #"main::HELLO" => "hi there";
#use constant GOODBYE => 14; #"main::GOODBYE" => 12;
#print STDERR "read cfg file\n";
#our @EXPORT_OK = Package::Constants->list(__PACKAGE__); #https://www.perlmonks.org/?node_id=1072691; NOTE: "_OK" skips short/common names
#print STDERR scalar(@EXPORT_OK) . " consts exported:\n";
#foreach(@EXPORT_OK) { print STDERR "$_\n"; }
#my $val = main::thing("xyz");
#print STDERR "caller gave me $val\n";
#foreach my $arg (@ARGV) { print STDERR "arg $arg\n"; }
Author: swannman
Source Code: https://github.com/swannman/pdf2gerb
License: GPL-3.0 license
1662000840
BDF.jl
is a Julia module to read/write BIOSEMI 24-bit BDF files (used for storing electroencephalographic recordings)
Usage:
bdfHeader = readBDFHeader("res1.bdf") #read the bdf header
sampRate = bdfHeader["sampRate"][1] #get the sampling rate
#read the data, the event table, the trigger channel and the status channel
dats, evtTab, trigs, statusChan = readBDF("res1.bdf")
Load the module
using BDF
To read an entire BDF recording
dats, evtTab, trigChan, sysCodeChan = readBDF("res1.bdf")
dats
is the nChannelXnSamples matrix containing the data. Note that the triggers are not contained in the dats
matrix. The triggers can be retrieved either trough the event table (evtTab
), or the raw trigger channel (trigChan
). The eventTable is a dictionary containing the trigger codes evtTab["code"]
, the trigger indexes evtTab["idx"]
(i.e. the sample numbers at which triggers occurred in the recording), and the trigger durations evtTab["dur"]
(in seconds). The raw trigger channel returned in trigChan
contains the trigger code for each recording sample. Additional Biosemi status codes (like CM in/out-of range, battery low/OK) are returned in sysCodeChan
.
You can also read only part of a recording, the following code will read the first 10 seconds of the recording:
dats, evtTab, trigChan, statChan = readBDF("res1.bdf", from=0, to=10)
The readBDFHeader
function can be used to get information on the BDF recording:
bdfInfo = readBDFHeader("res1.bdf")
Get the duration of the recording:
bdfInfo["duration"]
Get the sampling rate of each channel:
bdfInfo["sampRate"]
Get the channel labels:
bdfInfo["chanLabels"]
To read the information stored in the status channel you can use the decodeStatusChannel
function
statusChanInfo = decodeStatusChannel(sysCodeChan)
this will return a dictionary with several arrays that indicate for each sample of the recordings whether CMS was in range, whether the battery charge was low, the speedmode of the system, and other information stored in the status channel.
Beware that BDF.jl
does not check that you have sufficient RAM to read all the data in a BDF file. If you try to read a file that is too big for your hardware, your system may become slow or unresponsive. Initially try reading only a small amount of data, and check how much RAM that uses.
Documentation is available here:
http://samcarcagno.altervista.org/BDF/index.html
Author: Sam81
Source Code: https://github.com/sam81/BDF.jl
License: MIT license
1669189531
In this Python article, we will learn about What is Pandas in Python? Every letter you consider. According to Forbes magazine report in 2019, this is a record year for enterprises’ interest in data science, AI, and machine learning features in their business strategies and goals. This means businesses around the world have started making corporate decisions based on the data that they have collected over the years – using Machine and Deep learning methods.
Programming languages like python and R provide a great platform for anyone starting out in Machine learning and AI, to analyze and extract useful insights for businesses. Dealing with data for analysis and visualization is an imperative process in Machine Learning and Artificial Intelligence.
In short Pandas is a Software Libarary in Computer Programming and it is written for the Python Programming Language its work to do data analysis and manipulation.
Also read Python Numpy Tutorial and Fibonacci Series in Python
You can also take up python pandas course designed for beginner level.
We all know that Python is majorly a programming language. However, after the introduction of data handling libraries like NumPy, Pandas and Data Visualization libraries like Seaborn and Matplotlib, and the ease of understanding languages, simple syntaxes, Python is rapidly gaining popularity among data science and ML professionals. The Below picture shows a google trends page, showing a comparison of growths (in terms of google searches) of python and R over the past 15 years. It is evident that python is seeing exponential growth, while R is dropping down.
In this article, we will go through the basics of Pandas and the commands that any beginner needs to know to do fundamental data analysis in a given dataset.
Artificial Intelligence is about executing machine learning algorithms on products that we use every day. Any ML algorithm, for it to be effective, needs the following prerequisite steps to be done.
“Pandas” – short for “Panel Data” (A panel is a 3D container of data) – is a library in python which contains in-built functions to clean, transform, manipulate, visualize and analyze data.
Getting started…
NumPy – Numerical python – forms the basics of what pandas is all about. While NumPy deals with “Arrays” and “Matrices”, Pandas deals with “Series” & “Data Frames”. In order to work with Pandas first Python has to be installed in your system. Download and install python from here – https://www.python.org/downloads/windows/
You can verify python installation by entering “python” in the command prompt. The command gives the installed version of python.
Python gets automatically installed through an application called “Anaconda”, which simplifies package/library management when there are many basic packages needed to be installed for the project.
Anaconda installer download – https://www.anaconda.com/distribution/#windows.
Once Anaconda is installed, you can navigate to the ‘lib’ folder within the Anaconda installation to have a look at what are all the packages that got installed by default. One such package is “Pandas”. In order to import Pandas to our command line, we will use a “Jupyter Notebook” in this article.
Jupyter Notebook, is basically a web application, mainly used in data science and machine learning to develop and share code. Jupyter Notebook is part of Anaconda installation and it can be accessed through Anaconda’s UI as shown below.
Click on the “Launch”, it opens the Jupyter Notebook. Each cell in this notebook can hold one or more python commands. Typing and executing the following command imports “Pandas” in our work environment.
Now that We have installed Pandas successfully, let us learn how to do some analysis on data.
A data type is used by a programming language to understand how to store and manipulate data. The table below summarizes the different data types in Pandas.
Data type | Use |
int | Integer number, eg: 10, 12 |
float | Floating point number, eg: 100.2, 3.1415 |
bool | True/False value |
object | Test, non-numeric, or a combination of text and non-numeric values, eg: Apple |
DateTime | Date and time values |
category | A finite list of values |
What does Pandas deal with?
There are two major categories of data that you can come across while doing data analysis.
These data can be of any data type. Character, number or even an object.
Series in Pandas is one-dimensional data, and data frames are 2-dimensional data. A series can hold only a single data type, whereas a data frame is meant to contain more than one data type.
In the example shown below, “Types of Vehicles” is a series and it is of the datatype – “Object” and it is treated as a character array. “Count” is another series and it is of the type – “Integer”. Third is the “Number Of wheels” is the third series and it is of the type “Integer” again. The individual Series are one dimensional and hold only one data type. However, the data frame as a whole contains more than 2 dimensions and is heterogeneous in nature.
This is the reason why Pandas is so powerful and so much in use today in Data science world
Data Frame
Creating Series & data frames in python
A series can be created in 3 different ways – Converting an array or List or a dictionary into a series. We will see an example for each of the categories.
Array: We first create an array using the ‘NumPy’ package and then convert them into a series using the “Series()” function.
Same can be done for lists as well.
Creating a data frame can be done using the following command.
We can also create data frames with multiple series by using dictionaries and converting them using a data frame.
Function | Description |
count() | Returns the number of times an element/data has occurred (non-null) |
sum() | Returns sum of all values |
mean() | Returns the average of all values |
median() | Returns the median of all values |
mode() | Returns the mode |
std() | Returns the standard deviation |
min() | Returns the minimum of all values |
max() | Returns the maximum of all values |
abs() | Returns the absolute value |
Often, you won’t be creating data but will be having it in some form, and you would want to import it to run your analysis on it. Fortunately, Pandas allows you to do this. Not only does it help in importing data, but you can also save your data in your desired format using Pandas.
The below table shows the formats supported by Pandas, the function to read files using Pandas, and the function to write files.
Input type | Reader | Writer |
CSV | read_csv | to_csv |
JSON | read_json | to_json |
HTML | read_html | to_html |
Excel | read_excel | to_excel |
SAS | read_sas | – |
Python Pickle Format | read_pickle | to_pickle |
SQL | read_sql | to_sql |
Google Big Query | read_gbq | to_gbq |
In the below example, I have shown how to read a CSV file.
import pandas as pd
import numpy as np
#Read input file
df = pd.read_csv(‘/content/player_data.csv’)
df.head()
Output
name year_start year_end position height weight birth_date college
0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke University
1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State University
2 Kareem Abdul-Jabbar 1970 1989 C 7-2 225.0 April 16, 1947 University of California, Los Angeles
3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 Louisiana State University
4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 San Jose State University
The example below shows how to save a DataFrame to a CSV file.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’],
‘Rating’:[100, 80, 84, 93, 90],
‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’]}
# Create DataFrame
df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’])
# Saving to CSV
df.to_csv(“avengers.csv”)
The aggregation function can be applied against a single or more column. You can either apply the same aggregate function across various columns or different aggregate functions across various columns.
Commonly used aggregate functions()- sum, min, max, mean.
Example: Same aggregate function on all columns.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],
‘Salary’:[1000, 80000, 79000, 93000],
‘Age’:[33, 50, 45, 52]}
# Create DataFrame
df = pd.DataFrame(data)
df.aggregate([‘sum’,’min’,’max’,’mean’])
Output
Name Salary Age
sum jennifer LawrenceBrad PittChris Hemsworth Dwayn… 253000.0 180.0
min Brad Pitt 1000.0 33.0
max jennifer Lawrence 93000.0 52.0
mean NaN 63250.0 45.0
Example: Different aggregate functions for different columns.
import pandas as pd
# initialize a dictionary
data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],
‘Salary’:[1000, 80000, 79000, 93000],
‘Age’:[33, 50, 45, 52]}
# Create DataFrame
df = pd.DataFrame(data)
df.aggregate({‘Salary’:[‘sum’,’mean’],
‘Age’:[‘min’,’max’]})
Output
Salary Age
max NaN 52.0
mean 63250.0 NaN
min NaN 33.0
sum 253000.0 NaN
Pandas groupby function is used to split the DataFrame into groups based on some criteria.
First, we will import the dataset, and explore it.
import pandas as pd
import numpy as np
#Read input file
df = pd.read_csv(‘/content/player_data.csv’)
df.head()
Output:
name year_start year_end position height weight birth_date college
0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke University
1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State University
2 Kareem Abdul-Jabbar 1970 1989 C 7-2 225.0 April 16, 1947 University of California, Los Angeles
3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 Louisiana State University
4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 San Jose State University
Let’s groupby the players’ college names.
# group the data on name and position.
gd = df.groupby(‘college’)
gd.first()
Output:
name year_start year_end position height weight birth_date
college
Acadia University Brian Heaney 1970 1970 G 6-2 180.0 September 3, 1946
Alabama – Huntsville Josh Magette 2018 2018 G 6-1 160.0 November 28, 1989
Alabama A&M University Mickell Gladness 2012 2012 C 6-11 220.0 July 26, 1986
Alabama State University Kevin Loder 1982 1984 F-G 6-6 205.0 March 15, 1959
Albany State University Mack Daughtry 1971 1971 G 6-3 175.0 August 4, 1950
… … … … … … … …
Xavier University Torraye Braggs 2004 2005 F 6-8 245.0 May 15, 1976
Xavier University of Louisiana Nat Clifton 1951 1958 C-F 6-6 220.0 October 13, 1922
Yale University Chris Dudley 1988 2003 C 6-11 235.0 February 22, 1965
Yankton College Chuck Lloyd 1971 1971 C-F 6-8 220.0 May 22, 1947
Youngstown State University Leo Mogus 1947 1951 F-C 6-4 190.0 April 13, 1921
Let’s print the values in any one of the groups.
gd.get_group((‘C’,’A.J. Bramlett’))
Output
Year_start year_end height weight birth_date college
435 2000 2000 6-10 227.0 January 10, 1977 University of Arizona
Let’s create groups based on more than one category
# group the data on name and position.
gd = df.groupby([‘position’,’name’])
gd.first()
Output
year_start year_end height weight birth_date college
position name
C A.J. Bramlett 2000 2000 6-10 227.0 January 10, 1977 University of Arizona
A.J. Hammons 2017 2017 7-0 260.0 August 27, 1992 Purdue University
Aaron Gray 2008 2014 7-0 270.0 December 7, 1984 University of Pittsburgh
Adonal Foyle 1998 2009 6-10 250.0 March 9, 1975 Colgate University
Al Beard 1968 1968 6-9 200.0 April 27, 1942 Norfolk State University
… … … … … … … …
G-F Win Wilfong 1958 1961 6-2 185.0 March 18, 1933 University of Memphis
Winford Boynes 1979 1981 6-6 185.0 May 17, 1957 University of San Francisco
Wyndol Gray 1947 1948 6-1 175.0 March 20, 1922 Harvard University
Yakhouba Diawara 2007 2010 6-7 225.0 August 29, 1982 Pepperdine University
Zoran Dragic 2015 2015 6-5 200.0 June 22, 1989 NaN
Before I start with Pandas join and merge functions, let me introduce you to four different types of joins, they are inner join, left join, right join, outer join.
Merging a Dataframe with one unique key.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head)
print(“\n”)
print(df2.head())
res = pd.merge(df1, df2, on=’key’)
res
Output
key Name Age
0 K0 Mercy 27
1 K1 Prince 24
2 K2 John 22
3 K3 Cena 32>
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Age Address Qualification
0 K0 Mercy 27 Canada Btech
1 K1 Prince 24 UK B.A
2 K2 John 22 India MS
3 K3 Cena 32 USA Phd
Merging Dataframe using multiple keys.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head)
print(“\n”)
print(df2.head())
res = pd.merge(df1, df2, on=[‘key’, ‘Address’])
res
Output
key Name Address Age
0 K0 Mercy Canada 27
1 K1 Prince Australia 24
2 K2 John India 22
3 K3 Cena Japan 32
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Address Age Qualification
0 K0 Mercy Canada 27 Btech
1 K2 John India 22 MS
In pd.merge() I pass the argument ‘how = left’ to perform a left merge.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”\n”)
print(df2.head(),”\n”)
res = pd.merge(df1, df2, how=’left’, on=[‘key’, ‘Address’])
res
Output
key Name Address Age
0 K0 Mercy Canada 27
1 K1 Prince Australia 24
2 K2 John India 22
3 K3 Cena Japan 32
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Address Age Qualification
0 K0 Mercy Canada 27 Btech
1 K1 Prince Australia 24 NaN
2 K2 John India 22 MS
3 K3 Cena Japan 32 NaN
In pd.merge() I pass the argument ‘how = right’ to perform a left merge.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”\n”)
print(df2.head(),”\n”)
res = pd.merge(df1, df2, how=’right’, on=[‘key’, ‘Address’])
res
Output
key Name Address Age
0 K0 Mercy Canada 27
1 K1 Prince Australia 24
2 K2 John India 22
3 K3 Cena Japan 32
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Address Age Qualification
0 K0 Mercy Canada 27.0 Btech
1 K1 NaN UK NaN B.A
2 K2 John India 22.0 MS
3 K3 NaN USA NaN Phd
In pd.merge(), I pass the argument ‘how = outer’ to perform a left merge.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”\n”)
print(df2.head(),”\n”)
res = pd.merge(df1, df2, how=’outer’, on=[‘key’, ‘Address’])
res
Output
key Name Address Age
0 K0 Mercy Canada 27
1 K1 Prince Australia 24
2 K2 John India 22
3 K3 Cena Japan 32
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Address Age Qualification
0 K0 Mercy Canada 27.0 Btech
1 K1 Prince Australia 24.0 NaN
2 K2 John India 22.0 MS
3 K3 Cena Japan 32.0 NaN
4 K1 NaN UK NaN B.A
5 K3 NaN USA NaN Phd
In pd.merge(), I pass the argument ‘how = inner’ to perform a left merge.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],
‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”\n”)
print(df2.head(),”\n”)
res = pd.merge(df1, df2, how=’inner’, on=[‘key’, ‘Address’])
res
Output
key Name Address Age
0 K0 Mercy Canada 27
1 K1 Prince Australia 24
2 K2 John India 22
3 K3 Cena Japan 32
key Address Qualification
0 K0 Canada Btech
1 K1 UK B.A
2 K2 India MS
3 K3 USA Phd
key Name Address Age Qualification
0 K0 Mercy Canada 27 Btech
1 K2 John India 22 MS
Join is used to combine DataFrames having different index values.
Example
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32]}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”\n”)
print(df2.head(),”\n”)
res = df1.join(df2)
res
Output
Name Age
0 Mercy 27
1 Prince 24
2 John 22
3 Cena 32
Address Qualification
0 Canada Btech
1 UK B.A
2 India MS
3 USA Phd
Name Age Address Qualification
0 Mercy 27 Canada Btech
1 Prince 24 UK B.A
2 John 22 India MS
3 Cena 32 USA Phd
Performing join with ‘how’ parameter. Different inputs to the ‘how’ parameter are, inner, outer, left, right.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32]}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2)
print(df1.head(),”\n”)
print(df2.head(),”\n”)
res = df1.join(df2, how=’inner’)
res
Output
Name Age
0 Mercy 27
1 Prince 24
2 John 22
3 Cena 32
Address Qualification
0 Canada Btech
1 UK B.A
2 India MS
3 USA Phd
Name Age Address Qualification
0 Mercy 27 Canada Btech
1 Prince 24 UK B.A
2 John 22 India MS
3 Cena 32 USA Phd
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
frames = [df1, df2]
res = pd.concat(frames)
res
Output
Name Age Address Qualification
K0 Mercy 27.0 NaN NaN
K1 Prince 24.0 NaN NaN
K2 John 22.0 NaN NaN
K3 Cena 32.0 NaN NaN
K0 NaN NaN Canada Btech
K1 NaN NaN UK B.A
K2 NaN NaN India MS
K3 NaN NaN USA Phd
The resultant DataFrame has a repeated index. If you want the new Dataframe to have its own index, set ‘ignore_index’ to True.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
frames = [df1, df2]
res = pd.concat(frames, ignore_index=True)
res
Output
Name Age Address Qualification
0 Mercy 27.0 NaN NaN
1 Prince 24.0 NaN NaN
2 John 22.0 NaN NaN
3 Cena 32.0 NaN NaN
4 NaN NaN Canada Btech
5 NaN NaN UK B.A
6 NaN NaN India MS
7 NaN NaN USA Phd
The second DataFrame is concatenating below the first one, making the resultant DataFrame have new rows. If you want the second DataFrame to be added as columns, pass the argument axis=1.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
frames = [df1, df2]
res = pd.concat(frames, axis=1, ignore_index=True)
res
Output
0 1 2 3
K0 Mercy 27 Canada Btech
K1 Prince 24 UK B.A
K2 John 22 India MS
K3 Cena 32 USA Phd
Append function concatenates along axis = 0 only. It can take multiple objects as input.
import pandas as pd
# Define a dictionary containing employee data
data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],
‘Age’:[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’],
‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]}
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
# Convert the dictionary into DataFrame
df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])
df1.append(df2)
Output
Name Age Address Qualification
K0 Mercy 27.0 NaN NaN
K1 Prince 24.0 NaN NaN
K2 John 22.0 NaN NaN
K3 Cena 32.0 NaN NaN
K0 NaN NaN Canada Btech
K1 NaN NaN UK B.A
K2 NaN NaN India MS
K3 NaN NaN USA Phd
You will often encounter time data. Pandas is a very useful tool when working with time series data.
In the below code I am generating random datetime.
import pandas as pd
# Create dates dataframe with frequency
date = pd.date_range(’10/28/2011′, periods = 5, freq =’H’)
date
Output
DatetimeIndex([‘2011-10-28 00:00:00’, ‘2011-10-28 01:00:00’,
‘2011-10-28 02:00:00’, ‘2011-10-28 03:00:00’,
‘2011-10-28 04:00:00’],
dtype=’datetime64[ns]’, freq=’H’)
In the below code I am generating datetime using a range, which has a starting value, ending value and periods which specifies how many samples do I want,
import pandas as pd
date = pd.date_range(start=’9/28/2018′, end=’10/28/2018′, periods = 10)
date
Output
DatetimeIndex([‘2018-09-28 00:00:00’, ‘2018-10-01 08:00:00’,
‘2018-10-04 16:00:00’, ‘2018-10-08 00:00:00’,
‘2018-10-11 08:00:00’, ‘2018-10-14 16:00:00’,
‘2018-10-18 00:00:00’, ‘2018-10-21 08:00:00’,
‘2018-10-24 16:00:00’, ‘2018-10-28 00:00:00’],
dtype=’datetime64[ns]’, freq=None)
To convert the datetime to either a Pandas Series or a DataFrame, just pass the argument into the initializer.
You can use the ‘to_datetime’ function to convert a Pandas Series or list-like object. When passed a Series, it returns a Series. If you pass a string, it returns a timestamp.
import pandas as pd
date = pd.to_datetime(pd.Series([‘Jul 04, 2020’, ‘2020-10-28’]))
date
Output
0 2020-07-04
1 2020-10-28
dtype: datetime64[ns]
In the below code I have specified the format of my input datetime. This speeds up the processing.
import pandas as pd
date = pd.to_datetime(‘4/7/1994′, format=’%d/%m/%Y’)
date
Output
Timestamp(‘1994-07-04 00:00:00’)
Datetime can be divided into its components using-
pandas.Series.dt.year returns the year.
pandas.Series.dt.month returns the month.
pandas.Series.dt.day returns the day.
pandas.Series.dt.hour returns the hour.
pandas.Series.dt.minute returns the minute.
import pandas as pd
# Create datetime with dataframe
date = pd.DataFrame()
date[‘date’] = pd.date_range(’10/28/2020′, periods = 10, freq =’H’)
# Create features for year, month, day, hour, and minute
date[‘year’] = date[‘date’].dt.year
date[‘month’] = date[‘date’].dt.month
date[‘day’] = date[‘date’].dt.day
date[‘hour’] = date[‘date’].dt.hour
date[‘minute’] = date[‘date’].dt.minute
# Print the dates divided into features
date.head()
Output
date year month day hour minute
0 2020-10-28 00:00:00 2020 10 28 0 0
1 2020-10-28 01:00:00 2020 10 28 1 0
2 2020-10-28 02:00:00 2020 10 28 2 0
3 2020-10-28 03:00:00 2020 10 28 3 0
4 2020-10-28 04:00:00 2020 10 28 4 0
Pandas can also be used to visualize data.
In the below code I am generating a line plot. I am using random normal values generated by NumPy as input.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,4),
index=pd.date_range(’10/28/2020′,periods=10),
columns=list(‘ABCD’))
df.plot()
Bar plot can be made by using ‘.plot.bar()’. Pass the argument ‘stacked = True’ if you want stacked bars.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,4),
columns=[‘a’,’b’,’c’,’d’])
df.plot.bar()
# using stacked bars
df.plot.bar(stacked=True)
To generate a horizontal bar graph, use ‘.plot.barh()’. You can also pass the argument ‘stacked = True’ if you want the bars to be stacked.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,5),
columns=[‘a’,’b’,’c’,’d’,’e’])
# using stacked bars
df.plot.barh(stacked=True)
To generate a histogram use ‘DataFrame.plot.hist()’. Pass the argument ‘bins’ specifying how many bins you want.
Example – df.plot.hist()
import pandas as pd
import numpy as np
df = pd.DataFrame({‘A’:np.random.randn(100)-3,
‘B’:np.random.randn(100)+1,
‘C’:np.random.randn(100)+3,
‘D’:np.random.randn(100)-1},
columns=[‘A’, ‘B’, ‘C’, ‘D’])
df.plot.hist(bins=20)
To plot separate histograms for all your inputs, use your DataFrame name followed by ‘.hist()’. Pass the argument ‘bins’ specifying how many bins you want.
Example- df.hist()
import pandas as pd
import numpy as np
df = pd.DataFrame({‘A’:np.random.randn(100)-3,
‘B’:np.random.randn(100)+1,
‘C’:np.random.randn(100)+3,
‘D’:np.random.randn(100)-1},
columns=[‘A’, ‘B’, ‘C’, ‘D’])
df.hist(bins=20)
To plot a single histogram for any of your input pass the input name in square brackets followed by ‘.hist()’.
Example- df[‘A’].hist()
import pandas as pd
import numpy as np
df = pd.DataFrame({‘A’:np.random.randn(100)-3,
‘B’:np.random.randn(100)+1,
‘C’:np.random.randn(100)+3,
‘D’:np.random.randn(100)-1},
columns=[‘A’, ‘B’, ‘C’, ‘D’])
df[‘A’].hist(bins=20)
Scatter plot can be created using DataFrame.plot.scatter() method.
Example- df.plot.scatter()
import pandas as pd
import numpy as np
df = pd.DataFrame({‘A’:np.random.randn(100)-3,
‘B’:np.random.randn(100)+1,
‘C’:np.random.randn(100)+3,
‘D’:np.random.randn(100)-1},
columns=[‘A’, ‘B’, ‘C’, ‘D’])
df.plot.scatter(x=’A’, y=’B’)
To generate a pie chart use ‘.plot.pie()’
Example – df.plot.pie()
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(5), index=[‘A’, ‘B’, ‘C’, ‘D’, ‘E’])
df.plot.pie(subplots=True)
Data Handling with Pandas..
Missing Values – Identification and Imputation..
How to identify missing values?
Pandas provide the following three functions to find out if at all the data frame has missing or null values.
Imputation – Drop or replace??
Indexing & Filtering in pandas
To Summarize
In this article we discussed the basics of Pandas including creating data frames, handling missing values, and data retrieval methods. It is said that 80% of a Data scientists’ job is in Data Handling and manipulation. So, if you choose to go with python for your ML project, it is very important that you know how Pandas operate.
Original article source at: https://www.mygreatlearning.com
1672833558
Bash has many environment variables for various purposes. The set command of Bash is used to modify or display the different attributes and parameters of the shell environment. This command has many options to perform the different types of tasks. The uses of set command for various purposes are described in this tutorial.
set [options] [arguments]
This command can be used with different types of options and arguments for different purposes. If no option or argument is used with this command, the shell variables are printed. The minus sign (-) is used with the command’s option to enable that option and the plus sign (+) is used with the command’s option to disable that option.
Three exit values can be returned by this command which are mentioned in the following:
The purposes of the most commonly used options of the set command are described in this part of the tutorial.
Option | Purpose |
---|---|
-a | It defines those variables or functions which are created or modified or exported. |
-b | It informs the job termination. |
-B | To do the task of the brace expansion. |
-C | It disables the overwriting feature of the existing file. |
-e | It exits for non-zero exit status value. |
-f | It disables the filename generation task. |
-h | It saves the location of the command where it has been used. |
-m | It enables job control. |
-n | It reads the commands. |
-t | It exits from the command after executing a single command. |
-u | It traces the unset variables. |
-v | It prints the shell input lines. |
-x | It displays the commands and their attributes sequentially. It is mainly used to debug the script. |
The uses of set command with different options are shown in this part of this tutorial.
Example 1: Using the Set Command with -a Option
Create a Bash file with the following script that enables the “set –a” command and initialize three variables named $v1, $v2, and $v3. These variables can be accessed after executing the script.
#!/bin/bash
#Enable -a option to read the values of the variables
set -a
#Initialize three variables
v1=78
v2=50
v3=35
Run the script using the following command:
$ bash set1.bash
Read the values of the variable using the “echo” command:
$ echo $v1 $v2 $v3
The following output appears after executing the previous commands:
Example 2: Using the Set Command with -C Option
Run the “cat” command to create a text file named testfile.txt. Next, run the “set –C” command to disable the overwriting feature. Next, run the “cat” command again to overwrite the file to check whether the overwriting feature is disabled or not.
$ cat > testfile.txt
$ set -C
$ cat > testfile.txt
The following output appears after executing the previous commands:
Example 3: Using the Set Command with -x Option
Create a Bash file with the following script that declares a numeric array of 6 elements. The values of the array are printed using for loop.
#!/bin/bash
#Declare an array
arr=(67 3 90 56 2 80)
#iterate the array values
for value in ${arr[@]}
do
echo $value
done
Execute the previous script by the following command:
$ bash set3.bash
Enable the debugging option using the following command:
$ set -x
The following output appears after executing the provided commands:
Example 4: Using the Set Command with -e Option
Create a Bash file with the following script that reads a file using the “cat” command before and after using the “set –e” command.
#!/bin/bash
#Read a non-existing file without setting set -e
cat myfile.txt
echo "Reading a file..."
#Set the set command with -e option
set -e
#Read a non-existing file after setting set -e
cat myfile.txt
echo "Reading a file..."
The following output appears after executing the provided commands. The first error message is shown because the file does not exist in the current location. The next message is then printed. But after executing the “set –e” command, the execution stops after displaying the error message.
Example 5: Using the Set Command with -u Option
Create a Bash file with the following script that initializes a variable but prints the initialized and uninitialized variable before and after using the “set –u” command.
#!/bin/bash
#Assign value to a variable
strvar="Bash Programming"
printf "$strvar $intvar\n"
#Set the set command with -u option
set -u
#Assign value to a variable
strvar="Bash Programming"
printf "\n$strvar $intvar\n"
The following output appears after executing the previous script. Here, the error is printed for the uninitialized variable:
Example 6: Using the Set Command with -f Option
Run the following command to print the list of all text files of the current location:
$ ls *.txt
Run the following command to disable the globbing:
$ set –f
Run the following command again to print the list of all text files of the current location:
$ ls *.txt
The following output appears after executing the previous script. Based on the output, the “ls *.txt” command did not work after setting “set –f” command:
Example 7: Split the String Using the Set Command with Variable
Create a Bash file with the following script that splits the string value based on the space using the “set – variable” command. The split values are printed later.
#!/bin/bash
#Define a string variable
myvar="Learn bash programming"
#Set the set command without option and with variable
set -- $myvar
#Print the split value
printf "$1\n$2\n$3\n"
The following output appears after executing the previous script. The string value is divided into three parts based on the space that is printed:
The uses of the different options of the “set” command are shown in this tutorial using multiple examples to know the basic uses of this command.
Original article source at: https://linuxhint.com/