Dirty Reads and the Read Uncommitted Isolation Level

In this article, we will discuss the Dirty Read concurrency issue and also learn the details of the Read Uncommitted Isolation Level.

A transaction is the smallest working unit that performs the CRUD (Create, Read, Update, and Delete) actions in the relational database systems. Relevant to this matter, database transactions must have some characteristics to provide database consistency. The following four features constitute the major principles of the transactions to ensure the validity of data stored by database systems. These are;

  • Atomicity
  • Consistency
  • Isolation
  • Durability

These four properties are also known as ACID principles. Let’s briefly explain these four principles.

Atomicity

This property is also known as all or nothing principle. According to this property, a transaction can not be completed partially, so if a transaction gets an error at any point of the transaction, the entire transaction should be aborted and rollbacked. Or, all the actions contained by a transaction must be completed successfully.

Consistency

According to this property, the saved data must not damage data integrity. This means that the modified data must provide the constraints and other requirements that are defined in the database.

Durability

According to this property, the committed will not be lost even with the system or power failure.

Isolation

The database transactions must complete their tasks independently from the other transactions. This property enables us to execute the transactions concurrently on the database systems. So, the data changes which are made up by the transactions are not visible until the transactions complete (committed) their actions. The SQL standard describes three read phenomena, and they can be experienced when more than one transaction tries to read and write to the same resources.

  • Dirty-reads
  • Non-repeatable reads
  • Phantom reads

What is Dirty Read?

The simplest explanation of the dirty read is the state of reading uncommitted data. In this circumstance, we are not sure about the consistency of the data that is read because we don’t know the result of the open transaction(s). After reading the uncommitted data, the open transaction can be completed with rollback. On the other hand, the open transaction can complete its actions successfully. The data that is read in this ambiguous way is defined as dirty data. Now we will explain this issue with a scenario:

Assuming we have a table as shown below that stores the bank account details of the clients.

#execution plans #locking #performance #query analysis #data analysis

What is GEEK

Buddha Community

Dirty Reads and the Read Uncommitted Isolation Level

Dirty Reads and the Read Uncommitted Isolation Level

In this article, we will discuss the Dirty Read concurrency issue and also learn the details of the Read Uncommitted Isolation Level.

A transaction is the smallest working unit that performs the CRUD (Create, Read, Update, and Delete) actions in the relational database systems. Relevant to this matter, database transactions must have some characteristics to provide database consistency. The following four features constitute the major principles of the transactions to ensure the validity of data stored by database systems. These are;

  • Atomicity
  • Consistency
  • Isolation
  • Durability

These four properties are also known as ACID principles. Let’s briefly explain these four principles.

Atomicity

This property is also known as all or nothing principle. According to this property, a transaction can not be completed partially, so if a transaction gets an error at any point of the transaction, the entire transaction should be aborted and rollbacked. Or, all the actions contained by a transaction must be completed successfully.

Consistency

According to this property, the saved data must not damage data integrity. This means that the modified data must provide the constraints and other requirements that are defined in the database.

Durability

According to this property, the committed will not be lost even with the system or power failure.

Isolation

The database transactions must complete their tasks independently from the other transactions. This property enables us to execute the transactions concurrently on the database systems. So, the data changes which are made up by the transactions are not visible until the transactions complete (committed) their actions. The SQL standard describes three read phenomena, and they can be experienced when more than one transaction tries to read and write to the same resources.

  • Dirty-reads
  • Non-repeatable reads
  • Phantom reads

What is Dirty Read?

The simplest explanation of the dirty read is the state of reading uncommitted data. In this circumstance, we are not sure about the consistency of the data that is read because we don’t know the result of the open transaction(s). After reading the uncommitted data, the open transaction can be completed with rollback. On the other hand, the open transaction can complete its actions successfully. The data that is read in this ambiguous way is defined as dirty data. Now we will explain this issue with a scenario:

Assuming we have a table as shown below that stores the bank account details of the clients.

#execution plans #locking #performance #query analysis #data analysis

Chloe  Butler

Chloe Butler

1667425440

Pdf2gerb: Perl Script Converts PDF Files to Gerber format

pdf2gerb

Perl script converts PDF files to Gerber format

Pdf2Gerb generates Gerber 274X photoplotting and Excellon drill files from PDFs of a PCB. Up to three PDFs are used: the top copper layer, the bottom copper layer (for 2-sided PCBs), and an optional silk screen layer. The PDFs can be created directly from any PDF drawing software, or a PDF print driver can be used to capture the Print output if the drawing software does not directly support output to PDF.

The general workflow is as follows:

  1. Design the PCB using your favorite CAD or drawing software.
  2. Print the top and bottom copper and top silk screen layers to a PDF file.
  3. Run Pdf2Gerb on the PDFs to create Gerber and Excellon files.
  4. Use a Gerber viewer to double-check the output against the original PCB design.
  5. Make adjustments as needed.
  6. Submit the files to a PCB manufacturer.

Please note that Pdf2Gerb does NOT perform DRC (Design Rule Checks), as these will vary according to individual PCB manufacturer conventions and capabilities. Also note that Pdf2Gerb is not perfect, so the output files must always be checked before submitting them. As of version 1.6, Pdf2Gerb supports most PCB elements, such as round and square pads, round holes, traces, SMD pads, ground planes, no-fill areas, and panelization. However, because it interprets the graphical output of a Print function, there are limitations in what it can recognize (or there may be bugs).

See docs/Pdf2Gerb.pdf for install/setup, config, usage, and other info.


pdf2gerb_cfg.pm

#Pdf2Gerb config settings:
#Put this file in same folder/directory as pdf2gerb.pl itself (global settings),
#or copy to another folder/directory with PDFs if you want PCB-specific settings.
#There is only one user of this file, so we don't need a custom package or namespace.
#NOTE: all constants defined in here will be added to main namespace.
#package pdf2gerb_cfg;

use strict; #trap undef vars (easier debug)
use warnings; #other useful info (easier debug)


##############################################################################################
#configurable settings:
#change values here instead of in main pfg2gerb.pl file

use constant WANT_COLORS => ($^O !~ m/Win/); #ANSI colors no worky on Windows? this must be set < first DebugPrint() call

#just a little warning; set realistic expectations:
#DebugPrint("${\(CYAN)}Pdf2Gerb.pl ${\(VERSION)}, $^O O/S\n${\(YELLOW)}${\(BOLD)}${\(ITALIC)}This is EXPERIMENTAL software.  \nGerber files MAY CONTAIN ERRORS.  Please CHECK them before fabrication!${\(RESET)}", 0); #if WANT_DEBUG

use constant METRIC => FALSE; #set to TRUE for metric units (only affect final numbers in output files, not internal arithmetic)
use constant APERTURE_LIMIT => 0; #34; #max #apertures to use; generate warnings if too many apertures are used (0 to not check)
use constant DRILL_FMT => '2.4'; #'2.3'; #'2.4' is the default for PCB fab; change to '2.3' for CNC

use constant WANT_DEBUG => 0; #10; #level of debug wanted; higher == more, lower == less, 0 == none
use constant GERBER_DEBUG => 0; #level of debug to include in Gerber file; DON'T USE FOR FABRICATION
use constant WANT_STREAMS => FALSE; #TRUE; #save decompressed streams to files (for debug)
use constant WANT_ALLINPUT => FALSE; #TRUE; #save entire input stream (for debug ONLY)

#DebugPrint(sprintf("${\(CYAN)}DEBUG: stdout %d, gerber %d, want streams? %d, all input? %d, O/S: $^O, Perl: $]${\(RESET)}\n", WANT_DEBUG, GERBER_DEBUG, WANT_STREAMS, WANT_ALLINPUT), 1);
#DebugPrint(sprintf("max int = %d, min int = %d\n", MAXINT, MININT), 1); 

#define standard trace and pad sizes to reduce scaling or PDF rendering errors:
#This avoids weird aperture settings and replaces them with more standardized values.
#(I'm not sure how photoplotters handle strange sizes).
#Fewer choices here gives more accurate mapping in the final Gerber files.
#units are in inches
use constant TOOL_SIZES => #add more as desired
(
#round or square pads (> 0) and drills (< 0):
    .010, -.001,  #tiny pads for SMD; dummy drill size (too small for practical use, but needed so StandardTool will use this entry)
    .031, -.014,  #used for vias
    .041, -.020,  #smallest non-filled plated hole
    .051, -.025,
    .056, -.029,  #useful for IC pins
    .070, -.033,
    .075, -.040,  #heavier leads
#    .090, -.043,  #NOTE: 600 dpi is not high enough resolution to reliably distinguish between .043" and .046", so choose 1 of the 2 here
    .100, -.046,
    .115, -.052,
    .130, -.061,
    .140, -.067,
    .150, -.079,
    .175, -.088,
    .190, -.093,
    .200, -.100,
    .220, -.110,
    .160, -.125,  #useful for mounting holes
#some additional pad sizes without holes (repeat a previous hole size if you just want the pad size):
    .090, -.040,  #want a .090 pad option, but use dummy hole size
    .065, -.040, #.065 x .065 rect pad
    .035, -.040, #.035 x .065 rect pad
#traces:
    .001,  #too thin for real traces; use only for board outlines
    .006,  #minimum real trace width; mainly used for text
    .008,  #mainly used for mid-sized text, not traces
    .010,  #minimum recommended trace width for low-current signals
    .012,
    .015,  #moderate low-voltage current
    .020,  #heavier trace for power, ground (even if a lighter one is adequate)
    .025,
    .030,  #heavy-current traces; be careful with these ones!
    .040,
    .050,
    .060,
    .080,
    .100,
    .120,
);
#Areas larger than the values below will be filled with parallel lines:
#This cuts down on the number of aperture sizes used.
#Set to 0 to always use an aperture or drill, regardless of size.
use constant { MAX_APERTURE => max((TOOL_SIZES)) + .004, MAX_DRILL => -min((TOOL_SIZES)) + .004 }; #max aperture and drill sizes (plus a little tolerance)
#DebugPrint(sprintf("using %d standard tool sizes: %s, max aper %.3f, max drill %.3f\n", scalar((TOOL_SIZES)), join(", ", (TOOL_SIZES)), MAX_APERTURE, MAX_DRILL), 1);

#NOTE: Compare the PDF to the original CAD file to check the accuracy of the PDF rendering and parsing!
#for example, the CAD software I used generated the following circles for holes:
#CAD hole size:   parsed PDF diameter:      error:
#  .014                .016                +.002
#  .020                .02267              +.00267
#  .025                .026                +.001
#  .029                .03167              +.00267
#  .033                .036                +.003
#  .040                .04267              +.00267
#This was usually ~ .002" - .003" too big compared to the hole as displayed in the CAD software.
#To compensate for PDF rendering errors (either during CAD Print function or PDF parsing logic), adjust the values below as needed.
#units are pixels; for example, a value of 2.4 at 600 dpi = .0004 inch, 2 at 600 dpi = .0033"
use constant
{
    HOLE_ADJUST => -0.004 * 600, #-2.6, #holes seemed to be slightly oversized (by .002" - .004"), so shrink them a little
    RNDPAD_ADJUST => -0.003 * 600, #-2, #-2.4, #round pads seemed to be slightly oversized, so shrink them a little
    SQRPAD_ADJUST => +0.001 * 600, #+.5, #square pads are sometimes too small by .00067, so bump them up a little
    RECTPAD_ADJUST => 0, #(pixels) rectangular pads seem to be okay? (not tested much)
    TRACE_ADJUST => 0, #(pixels) traces seemed to be okay?
    REDUCE_TOLERANCE => .001, #(inches) allow this much variation when reducing circles and rects
};

#Also, my CAD's Print function or the PDF print driver I used was a little off for circles, so define some additional adjustment values here:
#Values are added to X/Y coordinates; units are pixels; for example, a value of 1 at 600 dpi would be ~= .002 inch
use constant
{
    CIRCLE_ADJUST_MINX => 0,
    CIRCLE_ADJUST_MINY => -0.001 * 600, #-1, #circles were a little too high, so nudge them a little lower
    CIRCLE_ADJUST_MAXX => +0.001 * 600, #+1, #circles were a little too far to the left, so nudge them a little to the right
    CIRCLE_ADJUST_MAXY => 0,
    SUBST_CIRCLE_CLIPRECT => FALSE, #generate circle and substitute for clip rects (to compensate for the way some CAD software draws circles)
    WANT_CLIPRECT => TRUE, #FALSE, #AI doesn't need clip rect at all? should be on normally?
    RECT_COMPLETION => FALSE, #TRUE, #fill in 4th side of rect when 3 sides found
};

#allow .012 clearance around pads for solder mask:
#This value effectively adjusts pad sizes in the TOOL_SIZES list above (only for solder mask layers).
use constant SOLDER_MARGIN => +.012; #units are inches

#line join/cap styles:
use constant
{
    CAP_NONE => 0, #butt (none); line is exact length
    CAP_ROUND => 1, #round cap/join; line overhangs by a semi-circle at either end
    CAP_SQUARE => 2, #square cap/join; line overhangs by a half square on either end
    CAP_OVERRIDE => FALSE, #cap style overrides drawing logic
};
    
#number of elements in each shape type:
use constant
{
    RECT_SHAPELEN => 6, #x0, y0, x1, y1, count, "rect" (start, end corners)
    LINE_SHAPELEN => 6, #x0, y0, x1, y1, count, "line" (line seg)
    CURVE_SHAPELEN => 10, #xstart, ystart, x0, y0, x1, y1, xend, yend, count, "curve" (bezier 2 points)
    CIRCLE_SHAPELEN => 5, #x, y, 5, count, "circle" (center + radius)
};
#const my %SHAPELEN =
#Readonly my %SHAPELEN =>
our %SHAPELEN =
(
    rect => RECT_SHAPELEN,
    line => LINE_SHAPELEN,
    curve => CURVE_SHAPELEN,
    circle => CIRCLE_SHAPELEN,
);

#panelization:
#This will repeat the entire body the number of times indicated along the X or Y axes (files grow accordingly).
#Display elements that overhang PCB boundary can be squashed or left as-is (typically text or other silk screen markings).
#Set "overhangs" TRUE to allow overhangs, FALSE to truncate them.
#xpad and ypad allow margins to be added around outer edge of panelized PCB.
use constant PANELIZE => {'x' => 1, 'y' => 1, 'xpad' => 0, 'ypad' => 0, 'overhangs' => TRUE}; #number of times to repeat in X and Y directions

# Set this to 1 if you need TurboCAD support.
#$turboCAD = FALSE; #is this still needed as an option?

#CIRCAD pad generation uses an appropriate aperture, then moves it (stroke) "a little" - we use this to find pads and distinguish them from PCB holes. 
use constant PAD_STROKE => 0.3; #0.0005 * 600; #units are pixels
#convert very short traces to pads or holes:
use constant TRACE_MINLEN => .001; #units are inches
#use constant ALWAYS_XY => TRUE; #FALSE; #force XY even if X or Y doesn't change; NOTE: needs to be TRUE for all pads to show in FlatCAM and ViewPlot
use constant REMOVE_POLARITY => FALSE; #TRUE; #set to remove subtractive (negative) polarity; NOTE: must be FALSE for ground planes

#PDF uses "points", each point = 1/72 inch
#combined with a PDF scale factor of .12, this gives 600 dpi resolution (1/72 * .12 = 600 dpi)
use constant INCHES_PER_POINT => 1/72; #0.0138888889; #multiply point-size by this to get inches

# The precision used when computing a bezier curve. Higher numbers are more precise but slower (and generate larger files).
#$bezierPrecision = 100;
use constant BEZIER_PRECISION => 36; #100; #use const; reduced for faster rendering (mainly used for silk screen and thermal pads)

# Ground planes and silk screen or larger copper rectangles or circles are filled line-by-line using this resolution.
use constant FILL_WIDTH => .01; #fill at most 0.01 inch at a time

# The max number of characters to read into memory
use constant MAX_BYTES => 10 * M; #bumped up to 10 MB, use const

use constant DUP_DRILL1 => TRUE; #FALSE; #kludge: ViewPlot doesn't load drill files that are too small so duplicate first tool

my $runtime = time(); #Time::HiRes::gettimeofday(); #measure my execution time

print STDERR "Loaded config settings from '${\(__FILE__)}'.\n";
1; #last value must be truthful to indicate successful load


#############################################################################################
#junk/experiment:

#use Package::Constants;
#use Exporter qw(import); #https://perldoc.perl.org/Exporter.html

#my $caller = "pdf2gerb::";

#sub cfg
#{
#    my $proto = shift;
#    my $class = ref($proto) || $proto;
#    my $settings =
#    {
#        $WANT_DEBUG => 990, #10; #level of debug wanted; higher == more, lower == less, 0 == none
#    };
#    bless($settings, $class);
#    return $settings;
#}

#use constant HELLO => "hi there2"; #"main::HELLO" => "hi there";
#use constant GOODBYE => 14; #"main::GOODBYE" => 12;

#print STDERR "read cfg file\n";

#our @EXPORT_OK = Package::Constants->list(__PACKAGE__); #https://www.perlmonks.org/?node_id=1072691; NOTE: "_OK" skips short/common names

#print STDERR scalar(@EXPORT_OK) . " consts exported:\n";
#foreach(@EXPORT_OK) { print STDERR "$_\n"; }
#my $val = main::thing("xyz");
#print STDERR "caller gave me $val\n";
#foreach my $arg (@ARGV) { print STDERR "arg $arg\n"; }

Download Details:

Author: swannman
Source Code: https://github.com/swannman/pdf2gerb

License: GPL-3.0 license

#perl 

BDF.jl: Module to Read Biosemi BDF Files with The Julia

BDF.jl is a Julia module to read/write BIOSEMI 24-bit BDF files (used for storing electroencephalographic recordings)

Usage:

bdfHeader = readBDFHeader("res1.bdf") #read the bdf header
sampRate = bdfHeader["sampRate"][1] #get the sampling rate
#read the data, the event table, the trigger channel and the status channel
dats, evtTab, trigs, statusChan = readBDF("res1.bdf")

Usage

Load the module

using BDF

To read an entire BDF recording


dats, evtTab, trigChan, sysCodeChan = readBDF("res1.bdf")

dats is the nChannelXnSamples matrix containing the data. Note that the triggers are not contained in the dats matrix. The triggers can be retrieved either trough the event table (evtTab), or the raw trigger channel (trigChan). The eventTable is a dictionary containing the trigger codes evtTab["code"], the trigger indexes evtTab["idx"] (i.e. the sample numbers at which triggers occurred in the recording), and the trigger durations evtTab["dur"] (in seconds). The raw trigger channel returned in trigChan contains the trigger code for each recording sample. Additional Biosemi status codes (like CM in/out-of range, battery low/OK) are returned in sysCodeChan.

You can also read only part of a recording, the following code will read the first 10 seconds of the recording:


dats, evtTab, trigChan, statChan = readBDF("res1.bdf", from=0, to=10)

The readBDFHeader function can be used to get information on the BDF recording:


bdfInfo = readBDFHeader("res1.bdf")

Get the duration of the recording:


bdfInfo["duration"]

Get the sampling rate of each channel:


bdfInfo["sampRate"]

Get the channel labels:


bdfInfo["chanLabels"]

To read the information stored in the status channel you can use the decodeStatusChannel function


statusChanInfo = decodeStatusChannel(sysCodeChan)

this will return a dictionary with several arrays that indicate for each sample of the recordings whether CMS was in range, whether the battery charge was low, the speedmode of the system, and other information stored in the status channel.

Beware that BDF.jl does not check that you have sufficient RAM to read all the data in a BDF file. If you try to read a file that is too big for your hardware, your system may become slow or unresponsive. Initially try reading only a small amount of data, and check how much RAM that uses.

Documentation is available here:

http://samcarcagno.altervista.org/BDF/index.html

Download Details:

Author: Sam81
Source Code: https://github.com/sam81/BDF.jl 
License: MIT license

#julia #modules 

David  Keebler

David Keebler

1669189531

What Is Pandas in Python? Everything You Need to Know

In this Python article, we will learn about What is Pandas in Python? Every letter you consider. According to Forbes magazine report in 2019, this is a record year for enterprises’ interest in data science, AI, and machine learning features in their business strategies and goals. This means businesses around the world have started making corporate decisions based on the data that they have collected over the years – using Machine and Deep learning methods.

Programming languages like python and R provide a great platform for anyone starting out in Machine learning and AI, to analyze and extract useful insights for businesses. Dealing with data for analysis and visualization is an imperative process in Machine Learning and Artificial Intelligence.

What is Pandas?

In short Pandas is a Software Libarary in Computer Programming and it is written for the Python Programming Language its work to do data analysis and manipulation.

Also read Python Numpy Tutorial and Fibonacci Series in Python

You can also take up python pandas course designed for beginner level.

We all know that Python is majorly a programming language. However, after the introduction of data handling libraries like NumPy, Pandas and Data Visualization libraries like Seaborn and Matplotlib, and the ease of understanding languages, simple syntaxes, Python is rapidly gaining popularity among data science and ML professionals. The Below picture shows a google trends page, showing a comparison of growths (in terms of google searches) of python and R over the past 15 years. It is evident that python is seeing exponential growth, while R is dropping down.
 

In this article, we will go through the basics of Pandas and the commands that any beginner needs to know to do fundamental data analysis in a given dataset.

So, what is Pandas and how is it used in AI?

Artificial Intelligence is about executing machine learning algorithms on products that we use every day. Any ML algorithm, for it to be effective, needs the following prerequisite steps to be done.

  • Data Collection – Conducting opinion Surveys, scraping the internet, etc.
  • Data Handling – Viewing data as a table, performing cleaning activities like checking for spellings, removal of blanks and wrong cases, removal of invalid values from data, etc.
  • Data Visualization – plotting appealing graphs, so anyone who looks at the data can know what story the data tells us.

“Pandas” – short for “Panel Data” (A panel is a 3D container of data) – is a library in python which contains in-built functions to clean, transform, manipulate, visualize and analyze data.
 

Getting started…
 

NumPy – Numerical python – forms the basics of what pandas is all about. While NumPy deals with “Arrays” and “Matrices”, Pandas deals with “Series” & “Data Frames”. In order to work with Pandas first Python has to be installed in your system. Download and install python from here – https://www.python.org/downloads/windows/

You can verify python installation by entering “python” in the command prompt. The command gives the installed version of python.

Python gets automatically installed through an application called “Anaconda”, which simplifies package/library management when there are many basic packages needed to be installed for the project.

Anaconda installer download – https://www.anaconda.com/distribution/#windows.
 

Once Anaconda is installed, you can navigate to the ‘lib’ folder within the Anaconda installation to have a look at what are all the packages that got installed by default. One such package is “Pandas”. In order to import Pandas to our command line, we will use a “Jupyter Notebook” in this article.

Jupyter Notebook, is basically a web application, mainly used in data science and machine learning to develop and share code. Jupyter Notebook is part of Anaconda installation and it can be accessed through Anaconda’s UI as shown below.

Click on the “Launch”, it opens the Jupyter Notebook. Each cell in this notebook can hold one or more python commands. Typing and executing the following command imports “Pandas” in our work environment.
 

Now that We have installed Pandas successfully, let us learn how to do some analysis on data.

Data Types

A data type is used by a programming language to understand how to store and manipulate data. The table below summarizes the different data types in Pandas.
 

Data typeUse
intInteger number, eg: 10, 12
floatFloating point number, eg: 100.2, 3.1415
boolTrue/False value
objectTest, non-numeric, or a combination of text and non-numeric values, eg: Apple
DateTimeDate and time values
categoryA finite list of values

What does Pandas deal with?

There are two major categories of data that you can come across while doing data analysis.

  • One dimensional data
  • Two-dimensional data

These data can be of any data type. Character, number or even an object.

Series in Pandas is one-dimensional data, and data frames are 2-dimensional data. A series can hold only a single data type, whereas a data frame is meant to contain more than one data type.

In the example shown below, “Types of Vehicles is a series and it is of the datatype – “Object” and it is treated as a character array. “Count” is another series and it is of the type – “Integer”. Third is the “Number Of wheels” is the third series and it is of the type “Integer” again.  The individual Series are one dimensional and hold only one data type. However, the data frame as a whole contains more than 2 dimensions and is heterogeneous in nature. 

This is the reason why Pandas is so powerful and so much in use today in Data science world
 

Data Frame

Creating Series & data frames in python
 

A series can be created in 3 different ways – Converting an array or List or a dictionary into a series. We will see an example for each of the categories.
 

Array: We first create an array using the ‘NumPy’ package and then convert them into a series using the “Series()” function.

Same can be done for lists as well.

Creating a data frame can be done using the following command.

We can also create data frames with multiple series by using dictionaries and converting them using a data frame.

All statistical functions

FunctionDescription
count()Returns the number of times an element/data has occurred (non-null)
sum()Returns sum of all values
mean()Returns the average of all values
median()Returns the median of all values
mode()Returns the mode
std()Returns the standard deviation
min()Returns the minimum of all values
max()Returns the maximum of all values
abs()Returns the absolute value

Input and Output

Often, you won’t be creating data but will be having it in some form, and you would want to import it to run your analysis on it. Fortunately, Pandas allows you to do this. Not only does it help in importing data, but you can also save your data in your desired format using Pandas.

The below table shows the formats supported by Pandas, the function to read files using Pandas, and the function to write files.
 

Input typeReaderWriter
CSVread_csvto_csv
JSONread_jsonto_json
HTMLread_htmlto_html
Excelread_excelto_excel
SASread_sas
Python Pickle Formatread_pickleto_pickle
SQLread_sqlto_sql
Google Big Queryread_gbqto_gbq

In the below example, I have shown how to read a CSV file.

import pandas as pd

import numpy as np
 

#Read input file

df = pd.read_csv(‘/content/player_data.csv’)
 

df.head()

Output

name year_start year_end position height weight birth_date college

0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke University

1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State University

2 Kareem Abdul-Jabbar 1970 1989 C 7-2 225.0 April 16, 1947 University of California, Los Angeles

3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 Louisiana State University

4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 San Jose State University

 

The example below shows how to save a DataFrame to a CSV file.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’],

        ‘Rating’:[100, 80, 84, 93, 90],

        ‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’]}

# Create DataFrame

df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’])
 

# Saving to CSV 

df.to_csv(“avengers.csv”)

 

Aggregation

The aggregation function can be applied against a single or more column. You can either apply the same aggregate function across various columns or different aggregate functions across various columns.
Commonly used aggregate functions()- sum, min, max, mean.
 

Example: Same aggregate function on all columns.
 

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],

        ‘Salary’:[1000, 80000, 79000, 93000],

        ‘Age’:[33, 50, 45, 52]}

# Create DataFrame

df = pd.DataFrame(data)
 

df.aggregate([‘sum’,’min’,’max’,’mean’])

Output

                                                                                       Name Salary         Age

sum jennifer LawrenceBrad PittChris Hemsworth Dwayn… 253000.0     180.0

min Brad Pitt                                                                       1000.0         33.0

max jennifer Lawrence                                                           93000.0        52.0

mean NaN                                                                                   63250.0        45.0

 

Example: Different aggregate functions for different columns.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],

        ‘Salary’:[1000, 80000, 79000, 93000],

        ‘Age’:[33, 50, 45, 52]}

# Create DataFrame

df = pd.DataFrame(data)
 

df.aggregate({‘Salary’:[‘sum’,’mean’],

              ‘Age’:[‘min’,’max’]})

Output

            Salary             Age

max NaN             52.0

mean 63250.0 NaN

min NaN             33.0

sum 253000.0 NaN

 

Groupby

Pandas groupby function is used to split the DataFrame into groups based on some criteria.
First, we will import the dataset, and explore it.
 

import pandas as pd

import numpy as np
 

#Read input file

df = pd.read_csv(‘/content/player_data.csv’)
 

df.head()

Output:

name year_start year_end position height weight birth_date college

0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke University

1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State University

2 Kareem Abdul-Jabbar 1970 1989 C 7-2 225.0 April 16, 1947 University of California, Los Angeles

3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 Louisiana State University

4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 San Jose State University

 

Let’s groupby the players’ college names.

# group the data on name and position. 

gd = df.groupby(‘college’)

gd.first()
 

Output:

name year_start year_end position height weight birth_date

college

Acadia University Brian Heaney 1970 1970 G 6-2 180.0 September 3, 1946

Alabama – Huntsville Josh Magette 2018 2018 G 6-1 160.0 November 28, 1989

Alabama A&M University Mickell Gladness 2012 2012 C 6-11 220.0 July 26, 1986

Alabama State University Kevin Loder 1982 1984 F-G 6-6 205.0 March 15, 1959

Albany State University Mack Daughtry 1971 1971 G 6-3 175.0 August 4, 1950

… … … … … … … …

Xavier University Torraye Braggs 2004 2005 F 6-8 245.0 May 15, 1976

Xavier University of Louisiana Nat Clifton 1951 1958 C-F 6-6 220.0 October 13, 1922

Yale University Chris Dudley 1988 2003 C 6-11 235.0 February 22, 1965

Yankton College Chuck Lloyd 1971 1971 C-F 6-8 220.0 May 22, 1947

Youngstown State University Leo Mogus 1947 1951 F-C 6-4 190.0 April 13, 1921

 

Let’s print the values in any one of the groups.
 

gd.get_group((‘C’,’A.J. Bramlett’)) 

Output
 

            Year_start  year_end height weight birth_date             college

435 2000       2000        6-10 227.0 January 10, 1977 University of Arizona
 

 

Let’s create groups based on more than one category
 

# group the data on name and position. 

gd = df.groupby([‘position’,’name’])

gd.first()

Output

year_start year_end height weight birth_date college

position name

C A.J. Bramlett 2000 2000 6-10 227.0 January 10, 1977 University of Arizona

A.J. Hammons 2017 2017 7-0 260.0 August 27, 1992 Purdue University

Aaron Gray 2008 2014 7-0 270.0 December 7, 1984 University of Pittsburgh

Adonal Foyle 1998 2009 6-10 250.0 March 9, 1975 Colgate University

Al Beard 1968 1968 6-9 200.0 April 27, 1942 Norfolk State University

… … … … … … … …

G-F Win Wilfong 1958 1961 6-2 185.0 March 18, 1933 University of Memphis

Winford Boynes 1979 1981 6-6 185.0 May 17, 1957 University of San Francisco

Wyndol Gray 1947 1948 6-1 175.0 March 20, 1922 Harvard University

Yakhouba Diawara 2007 2010 6-7 225.0 August 29, 1982 Pepperdine University

Zoran Dragic 2015 2015 6-5 200.0 June 22, 1989 NaN

 

Merging, Joining and Concatenation

Before I start with Pandas join and merge functions, let me introduce you to four different types of joins, they are inner join, left join, right join, outer join.

 

  • Full outer join: Combines results from both DataFrames. The result will have all columns from both DataFrames.
  • Inner join: Only those rows which are present in both DataFrame A and DataFrame B will be present in the output.
  • Right join: Right join uses all records from DataFrame B and matching records from DataFrame A.
  • Left join: Left join uses all records from DataFrame A and matching records from DataFrame B.

Merging

Merging a Dataframe with one unique key.
 

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)
 

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 
 

print(df1.head)

print(“\n”)

print(df2.head())
 

res = pd.merge(df1, df2, on=’key’)

res

Output

    key  Name  Age

0  K0   Mercy   27

1  K1  Prince   24

2  K2    John   22

3  K3    Cena   32>

 

    key Address Qualification

0  K0  Canada         Btech

1  K1      UK           B.A

2  K2   India            MS

3  K3     USA           Phd
 

            key Name Age Address Qualification

0 K0 Mercy 27 Canada Btech

1 K1 Prince 24 UK             B.A

2 K2 John 22 India             MS

3 K3 Cena 32 USA             Phd

 

Merging Dataframe using multiple keys.
 

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)
 

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 
 

print(df1.head)

print(“\n”)

print(df2.head())
 

res = pd.merge(df1, df2, on=[‘key’, ‘Address’])

res

Output

    key    Name    Address  Age

0  K0   Mercy     Canada   27

1  K1  Prince  Australia     24

2  K2    John      India        22

3  K3    Cena      Japan     32

 

    key Address Qualification

0  K0  Canada         Btech

1  K1      UK           B.A

2  K2   India            MS

3  K3     USA           Phd
 

            key Name Address Age Qualification

0 K0 Mercy Canada 27 Btech

1 K2 John India             22 MS

 

Left merge

In pd.merge() I pass the argument ‘how = left’ to perform a left merge.
 

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)
 

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 
 

print(df1.head(),”\n”)

print(df2.head(),”\n”)
 

res = pd.merge(df1, df2, how=’left’, on=[‘key’, ‘Address’])

res

Output

    key    Name    Address  Age

0  K0   Mercy     Canada   27

1  K1  Prince    Australia   24

2  K2    John        India      22

3  K3    Cena      Japan     32 
 

    key Address Qualification

0  K0  Canada         Btech

1  K1      UK            B.A

2  K2   India            MS

3  K3     USA          Phd 
 

            key Name Address Age Qualification

0 K0 Mercy Canada 27 Btech

1 K1 Prince Australia 24 NaN

2 K2 John India             22 MS

3 K3 Cena Japan             32 NaN

 

Right merge

In pd.merge() I pass the argument ‘how = right’ to perform a left merge.
 

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)
 

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 
 

print(df1.head(),”\n”)

print(df2.head(),”\n”)
 

res = pd.merge(df1, df2, how=’right’, on=[‘key’, ‘Address’])

res

Output

    key    Name    Address  Age

0  K0     Mercy     Canada   27

1  K1     Prince  Australia    24

2  K2     John      India        22

3  K3     Cena      Japan     32 
 

    key Address Qualification

0  K0  Canada         Btech

1  K1      UK           B.A

2  K2   India            MS

3  K3     USA           Phd 
 

            key Name Address Age Qualification

0 K0 Mercy Canada 27.0 Btech

1 K1 NaN UK             NaN B.A

2 K2 John India             22.0 MS

3 K3 NaN USA             NaN Phd

 

Outer Merge

In pd.merge(), I pass the argument ‘how = outer’ to perform a left merge.
 

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)
 

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 
 

print(df1.head(),”\n”)

print(df2.head(),”\n”)
 

res = pd.merge(df1, df2, how=’outer’, on=[‘key’, ‘Address’])

res

Output

    key    Name    Address     Age

0  K0     Mercy     Canada     27

1  K1     Prince     Australia   24

2  K2     John       India         22

3  K3     Cena      Japan       32 
 

    key Address  Qualification

0  K0  Canada         Btech

1  K1      UK             B.A

2  K2   India             MS

3  K3     USA           Phd 
 

            key Name Address Age Qualification

0 K0 Mercy Canada 27.0 Btech

1 K1 Prince Australia 24.0 NaN

2 K2 John India             22.0 MS

3 K3 Cena Japan             32.0 NaN

4 K1 NaN UK             NaN B.A

5 K3 NaN USA             NaN Phd

 

Inner Merge

In pd.merge(), I pass the argument ‘how = inner’ to perform a left merge.
 

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)
 

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 
 

print(df1.head(),”\n”)

print(df2.head(),”\n”)
 

res = pd.merge(df1, df2, how=’inner’, on=[‘key’, ‘Address’])

res

Output

    key    Name    Address   Age

0  K0     Mercy     Canada   27

1  K1     Prince  Australia    24

2  K2     John      India        22

3  K3     Cena      Japan     32 
 

    key Address Qualification

0  K0  Canada         Btech

1  K1      UK             B.A

2  K2   India             MS

3  K3     USA           Phd 
 

            key Name Address Age Qualification

0 K0 Mercy Canada 27 Btech

1 K2 John India             22 MS

 

Join

Join is used to combine DataFrames having different index values.

Example
 

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32]} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)
 

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 
 

print(df1.head(),”\n”)

print(df2.head(),”\n”)
 

res = df1.join(df2)

res

Output

     Name  Age

0   Mercy   27

1  Prince   24

2    John   22

3    Cena   32 
 

    Address Qualification

0  Canada         Btech

1      UK            B.A

2   India            MS

3     USA           Phd 
 

            Name Age Address Qualification

0 Mercy 27 Canada Btech

1 Prince 24 UK             B.A

2 John 22 India             MS

3 Cena 32 USA             Phd

 

Performing join with ‘how’ parameter. Different inputs to the ‘how’ parameter are, inner, outer, left, right.
 

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32]} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)
 

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 
 

print(df1.head(),”\n”)

print(df2.head(),”\n”)
 

res = df1.join(df2, how=’inner’)

res

Output

     Name   Age

0   Mercy   27

1  Prince    24

2    John    22

3    Cena   32 
 

    Address Qualification

0  Canada         Btech

1      UK             B.A

2   India             MS

3     USA           Phd 
 

            Name Age Address Qualification

0 Mercy 27 Canada Btech

1 Prince 24 UK             B.A

2 John 22 India             MS

3 Cena 32 USA             Phd

 

Concatenation

Concatenating using ‘.concat()’ function

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’]) 
 

frames = [df1, df2]

res = pd.concat(frames)

res

Output

Name Age Address Qualification

K0 Mercy 27.0 NaN             NaN

K1 Prince 24.0 NaN             NaN

K2 John 22.0 NaN             NaN

K3 Cena 32.0 NaN             NaN

K0 NaN NaN Canada Btech

K1 NaN NaN UK             B.A

K2 NaN NaN India             MS

K3 NaN NaN USA             Phd
 

 

The resultant DataFrame has a repeated index. If you want the new Dataframe to have its own index, set ‘ignore_index’ to True.
 

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’]) 
 

frames = [df1, df2]

res = pd.concat(frames, ignore_index=True)

res

Output

            Name Age Address Qualification

0 Mercy 27.0 NaN             NaN

1 Prince 24.0 NaN             NaN

2 John 22.0 NaN             NaN

3 Cena 32.0 NaN             NaN

4 NaN NaN Canada Btech

5 NaN NaN UK             B.A

6 NaN NaN India             MS

7 NaN NaN USA             Phd

 

The second DataFrame is concatenating below the first one, making the resultant DataFrame have new rows. If you want the second DataFrame to be added as columns, pass the argument axis=1.
 

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’]) 
 

frames = [df1, df2]

res = pd.concat(frames, axis=1, ignore_index=True)

res

Output

            0 1 2             3

K0 Mercy 27 Canada Btech

K1 Prince 24 UK             B.A

K2 John 22 India             MS

K3 Cena 32 USA             Phd
 

 

Concatenating using ‘.append()’ function

Append function concatenates along axis = 0 only. It can take multiple objects as input.
 

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’]) 
 

df1.append(df2)
 

Output

            Name Age Address Qualification

K0 Mercy 27.0 NaN             NaN

K1 Prince 24.0 NaN             NaN

K2 John 22.0 NaN             NaN

K3 Cena 32.0 NaN             NaN

K0 NaN NaN Canada Btech

K1 NaN NaN UK             B.A

K2 NaN NaN India             MS

K3 NaN NaN USA             Phd
 

 

Date Time

You will often encounter time data. Pandas is a very useful tool when working with time series data. 

Generating random datetime

In the below code I am generating random datetime. 

import pandas as pd 

# Create dates dataframe with frequency   

date = pd.date_range(’10/28/2011′, periods = 5, freq =’H’)

date

Output

DatetimeIndex([‘2011-10-28 00:00:00’, ‘2011-10-28 01:00:00’,

               ‘2011-10-28 02:00:00’, ‘2011-10-28 03:00:00’,

               ‘2011-10-28 04:00:00’],

              dtype=’datetime64[ns]’, freq=’H’)
 

 

In the below code I am generating datetime using a range, which has a starting value, ending value and periods which specifies how many samples do I want,
 

import pandas as pd
 

date = pd.date_range(start=’9/28/2018′, end=’10/28/2018′, periods = 10)

date

Output

DatetimeIndex([‘2018-09-28 00:00:00’, ‘2018-10-01 08:00:00’,

               ‘2018-10-04 16:00:00’, ‘2018-10-08 00:00:00’,

               ‘2018-10-11 08:00:00’, ‘2018-10-14 16:00:00’,

               ‘2018-10-18 00:00:00’, ‘2018-10-21 08:00:00’,

               ‘2018-10-24 16:00:00’, ‘2018-10-28 00:00:00’],

              dtype=’datetime64[ns]’, freq=None)

 

To convert the datetime to either a Pandas Series or a DataFrame, just pass the argument into the initializer.

Converting to timestamps

You can use the ‘to_datetime’ function to convert a Pandas Series or list-like object. When passed a Series, it returns a Series. If you pass a string, it returns a timestamp.
 

import pandas as pd

date = pd.to_datetime(pd.Series([‘Jul 04, 2020’, ‘2020-10-28’]))
 

date

Output

0   2020-07-04

1   2020-10-28

dtype: datetime64[ns]

 

In the below code I have specified the format of my input datetime. This speeds up the processing.
 

import pandas as pd
 

date = pd.to_datetime(‘4/7/1994′, format=’%d/%m/%Y’)

date
 

Output

Timestamp(‘1994-07-04 00:00:00’)

 

Dividing datetime into its features

Datetime can be divided into its components using-
 

pandas.Series.dt.year returns the year.

pandas.Series.dt.month returns the month.

pandas.Series.dt.day returns the day.

pandas.Series.dt.hour returns the hour.

pandas.Series.dt.minute returns the minute.
 

import pandas as pd
 

# Create datetime with dataframe

date = pd.DataFrame() 

date[‘date’] = pd.date_range(’10/28/2020′, periods = 10, freq =’H’) 

# Create features for year, month, day, hour, and minute 

date[‘year’]  = date[‘date’].dt.year 

date[‘month’] = date[‘date’].dt.month 

date[‘day’]   = date[‘date’].dt.day 

date[‘hour’]  = date[‘date’].dt.hour 

date[‘minute’] = date[‘date’].dt.minute 

# Print the dates divided into features 

date.head()

Output
 

                                      date year month day hour minute

0 2020-10-28 00:00:00 2020 10 28 0 0

1 2020-10-28 01:00:00 2020 10 28 1 0

2 2020-10-28 02:00:00 2020 10 28 2 0

3 2020-10-28 03:00:00 2020 10 28 3 0

4 2020-10-28 04:00:00 2020 10 28 4 0

 

Visualization

Pandas can also be used to visualize data. 

Line plot

In the below code I am generating a line plot. I am using random normal values generated by NumPy as input.
 

import pandas as pd

import numpy as np
 

df = pd.DataFrame(np.random.randn(10,4),

                  index=pd.date_range(’10/28/2020′,periods=10),

                  columns=list(‘ABCD’))
 

df.plot()

 

Bar/Horizontal Bar plot

Bar plot can be made by using ‘.plot.bar()’. Pass the argument ‘stacked = True’ if you want stacked bars.
 

import pandas as pd

import numpy as np
 

df = pd.DataFrame(np.random.rand(10,4),

                  columns=[‘a’,’b’,’c’,’d’])
 

df.plot.bar()

# using stacked bars

df.plot.bar(stacked=True)
 

 

To generate a horizontal bar graph, use ‘.plot.barh()’. You can also pass the argument ‘stacked = True’ if you want the bars to be stacked.
 

import pandas as pd

import numpy as np
 

df = pd.DataFrame(np.random.rand(10,5),

                  columns=[‘a’,’b’,’c’,’d’,’e’])
 

# using stacked bars

df.plot.barh(stacked=True)

 

Histograms

To generate a histogram use ‘DataFrame.plot.hist()’. Pass the argument ‘bins’ specifying how many bins you want.

Example – df.plot.hist()
 

import pandas as pd

import numpy as np
 

df = pd.DataFrame({‘A’:np.random.randn(100)-3,

                   ‘B’:np.random.randn(100)+1,

                   ‘C’:np.random.randn(100)+3,

                   ‘D’:np.random.randn(100)-1},

                   columns=[‘A’, ‘B’, ‘C’, ‘D’])
 

df.plot.hist(bins=20)

 

To plot separate histograms for all your inputs, use your DataFrame name followed by ‘.hist()’. Pass the argument ‘bins’ specifying how many bins you want.

Example- df.hist()
 

import pandas as pd

import numpy as np
 

df = pd.DataFrame({‘A’:np.random.randn(100)-3,

                   ‘B’:np.random.randn(100)+1,

                   ‘C’:np.random.randn(100)+3,

                   ‘D’:np.random.randn(100)-1},

                   columns=[‘A’, ‘B’, ‘C’, ‘D’])
 

df.hist(bins=20)

 

To plot a single histogram for any of your input pass the input name in square brackets followed by ‘.hist()’.

Example- df[‘A’].hist()
 

import pandas as pd

import numpy as np
 

df = pd.DataFrame({‘A’:np.random.randn(100)-3,

                   ‘B’:np.random.randn(100)+1,

                   ‘C’:np.random.randn(100)+3,

                   ‘D’:np.random.randn(100)-1},

                   columns=[‘A’, ‘B’, ‘C’, ‘D’])
 

df[‘A’].hist(bins=20)

 

Scatter plot

Scatter plot can be created using DataFrame.plot.scatter() method.

Example- df.plot.scatter()
 

import pandas as pd

import numpy as np
 

df = pd.DataFrame({‘A’:np.random.randn(100)-3,

                   ‘B’:np.random.randn(100)+1,

                   ‘C’:np.random.randn(100)+3,

                   ‘D’:np.random.randn(100)-1},

                   columns=[‘A’, ‘B’, ‘C’, ‘D’])
 

df.plot.scatter(x=’A’, y=’B’)

 

Pie chart

To generate a pie chart use ‘.plot.pie()’

Example – df.plot.pie()
 

import pandas as pd

import numpy as np
 

df = pd.DataFrame(np.random.rand(5), index=[‘A’, ‘B’, ‘C’, ‘D’, ‘E’])

df.plot.pie(subplots=True)

 

 

Data Handling with Pandas..

  1. Reading from a csv or an excel – Pandas provide two functions – read_csv() and read_excel() to read data from a csv and an excel file respectively. Command can be used as follows.
  2. Viewing data – Viewing data from a data frame can be done by three ways
    • using the data frame’s name – returns the top and bottom 5 rows in the data frame.
    • using dataframe.head() function
    • using dataframe.tail() function
  3. To see more details on the data frame, the info() function can be used. info() gives an idea about what datatype each series in a data frame points to.

 

  1. The following functions are used to find the unique entries within a series/column in a data frame.
    • datafame.unique() – returns the unique values
    • dataframe.nunique() – returns the count of unique values
    • dataframe.value_counts() – returns the frequency of each of the categories in the column
  2. In our example, the titanic dataset contains a column called “Survived” which tells if the particular passenger survived the tragedy. Since this value could only be either 0 or 1, we can convert the data type from integer to object.
    • dataframe.astype() is the function which lets us do the conversion

Missing Values – Identification and Imputation..
 

How to identify missing values? 

Pandas provide the following three functions to find out if at all the data frame has missing or null values.

  • dataframe.isna()
  • dataframe.isnull()
  • dataframe.isna().sum() – gives the count of NA’s in each column/series of the dataframe.
  • dataframe.isna().sum().sum() – gives the count of NA’s in a whale of dataframe.

Imputation – Drop or replace??

  • Pandas provides the following functions to deal with imputation.
    • dataframe.dropna()
    • dataframe.fillna()

Indexing & Filtering in pandas

  1. We can access any row in a dataframe using the following functions
    • dataframe.loc() – returns the row based on the value of the index.
    • dataframe.iloc() – returns the row based on the position of the index
  2. We can filter out the required data with the help of ‘[]’, as shown in the following screenshot.
    • ‘&’ is used when the dataframe has to be filtered by multiple conditions

 

To Summarize

In this article we discussed the basics of Pandas including creating data frames, handling missing values, and data retrieval methods. It is said that 80% of a Data scientists’ job is in Data Handling and manipulation. So, if you choose to go with python for your ML project, it is very important that you know how Pandas operate.


Original article source at: https://www.mygreatlearning.com

#python #pandas 

Lawson  Wehner

Lawson Wehner

1672833558

How to Use Bash Set Command

Bash has many environment variables for various purposes. The set command of Bash is used to modify or display the different attributes and parameters of the shell environment. This command has many options to perform the different types of tasks. The uses of set command for various purposes are described in this tutorial.

Syntax

set [options] [arguments]

This command can be used with different types of options and arguments for different purposes. If no option or argument is used with this command, the shell variables are printed. The minus sign (-) is used with the command’s option to enable that option and the plus sign (+) is used with the command’s option to disable that option.

Exit Values of Set Command

Three exit values can be returned by this command which are mentioned in the following:

  1. Zero (0) is returned to complete the task successfully.
  2. One (1) is returned if a failure occurs for any invalid argument.
  3. One (1) is returned if a failure occurs for a missing argument.

Different Options of Set Command

The purposes of the most commonly used options of the set command are described in this part of the tutorial.

OptionPurpose
-aIt defines those variables or functions which are created or modified or exported.
-bIt informs the job termination.
-BTo do the task of the brace expansion.
-CIt disables the overwriting feature of the existing file.
-eIt exits for non-zero exit status value.
-fIt disables the filename generation task.
-hIt saves the location of the command where it has been used.
-mIt enables job control.
-nIt reads the commands.
-tIt exits from the command after executing a single command.
-uIt traces the unset variables.
-vIt prints the shell input lines.
-xIt displays the commands and their attributes sequentially. It is mainly used to debug the script.

Different Examples of the Set Command

The uses of set command with different options are shown in this part of this tutorial.

Example 1: Using the Set Command with -a Option

Create a Bash file with the following script that enables the “set –a” command and initialize three variables named $v1, $v2, and $v3. These variables can be accessed after executing the script.

#!/bin/bash
#Enable -a option to read the values of the variables
set -a
#Initialize three variables
v1=78
v2=50
v3=35

Run the script using the following command:

$ bash set1.bash

Read the values of the variable using the “echo” command:

$ echo $v1 $v2 $v3

The following output appears after executing the previous commands:

Example 2: Using the Set Command with -C Option

Run the “cat” command to create a text file named testfile.txt. Next, run the “set –C” command to disable the overwriting feature. Next, run the “cat” command again to overwrite the file to check whether the overwriting feature is disabled or not.

$ cat > testfile.txt
$ set -C
$ cat > testfile.txt

The following output appears after executing the previous commands:

Example 3: Using the Set Command with -x Option

Create a Bash file with the following script that declares a numeric array of 6 elements. The values of the array are printed using for loop.

#!/bin/bash
#Declare an array
arr=(67 3 90 56 2 80)
#iterate the array values
for value in ${arr[@]}
do
   echo $value
done

Execute the previous script by the following command:

$ bash set3.bash

Enable the debugging option using the following command:

$ set -x

The following output appears after executing the provided commands:

Example 4: Using the Set Command with -e Option

Create a Bash file with the following script that reads a file using the “cat” command before and after using the “set –e” command.

#!/bin/bash
#Read a non-existing file without setting set -e
cat myfile.txt
echo "Reading a file..."
#Set the set command with -e option
set -e
#Read a non-existing file after setting set -e
cat myfile.txt
echo "Reading a file..."

The following output appears after executing the provided commands. The first error message is shown because the file does not exist in the current location. The next message is then printed. But after executing the “set –e” command, the execution stops after displaying the error message.

Example 5: Using the Set Command with -u Option

Create a Bash file with the following script that initializes a variable but prints the initialized and uninitialized variable before and after using the “set –u” command.

#!/bin/bash
#Assign value to a variable
strvar="Bash Programming"
printf "$strvar $intvar\n"
#Set the set command with -u option
set -u
#Assign value to a variable
strvar="Bash Programming"
printf "\n$strvar $intvar\n"

The following output appears after executing the previous script. Here, the error is printed for the uninitialized variable:

Example 6: Using the Set Command with -f Option

Run the following command to print the list of all text files of the current location:

$ ls *.txt

Run the following command to disable the globbing:

$ set –f

Run the following command again to print the list of all text files of the current location:

$ ls *.txt

The following output appears after executing the previous script. Based on the output, the “ls *.txt” command did not work after setting “set –f” command:

Example 7: Split the String Using the Set Command with Variable

Create a Bash file with the following script that splits the string value based on the space using the “set – variable” command. The split values are printed later.

#!/bin/bash
#Define a string variable
myvar="Learn bash programming"
#Set the set command without option and with variable
set -- $myvar
#Print the split value
printf "$1\n$2\n$3\n"

The following output appears after executing the previous script. The string value is divided into three parts based on the space that is printed:

Conclusion

The uses of the different options of the “set” command are shown in this tutorial using multiple examples to know the basic uses of this command.

Original article source at: https://linuxhint.com/

#bash #set #command