Rupert  Beatty

Rupert Beatty

1673345400

Windless Makes It Easy to Implement invisible Layout Loading View

Windless

Windless makes it easy to implement invisible layout loading view.

basic.giftable.gifcollection.gif

Requirements

  • iOS 8.0+
  • Xcode 9.0+
  • Swift 4.0+

Installation

CocoaPods

CocoaPods is a dependency manager for Cocoa projects. You can install it with the following command:

$ gem install cocoapods

CocoaPods 1.1+ is required to build Windless 4.0+.

To integrate Windless into your Xcode project using CocoaPods, specify it in your Podfile:

source 'https://github.com/CocoaPods/Specs.git'
platform :ios, '8.0'
use_frameworks!

target '<Your Target Name>' do
    pod 'Windless', '~> 0.1.5'
end

Then, run the following command:

$ pod install

Carthage

Carthage is a decentralized dependency manager that builds your dependencies and provides you with binary frameworks.

You can install Carthage with Homebrew using the following command:

$ brew update
$ brew install carthage

To integrate Windless into your Xcode project using Carthage, specify it in your Cartfile:

github "Interactive-Studio/Windless" ~> 0.1.5

Run carthage update to build the framework and drag the built Windless.framework into your Xcode project.

Manually

If you prefer not to use either of the aforementioned dependency managers, you can integrate Windless into your project manually.


Usage

Code

import Windless

class ViewController: UIViewController {

    lazy var contentsView = UIView()
    
    var subView1 = UIView()
    var subView2 = UIView()

    override func viewDidLoad() {
        super.viewDidLoad()

        self.view.addSubview(contentsView)
        contentsView.addSubview(subView1)
        contentsView.addSubview(subView2)
        
        // start
        contentsView.windless
                .setupWindlessableViews([subView1, subView2])
                .start()
                
        // stop
        contentsView.windless.end()
    }

}

Storyboard, Xib

If you use Storyboard or xib, you only need to set the isWindlessable flag to true for the views you want to show as fake in the view inspector of the view, and you do not have to pass the view through the setupWindlessableViews method.

import Windless

class ViewController: UIViewController {

    @IBOutlet weak var contentsView: UIView!

    override func viewDidLoad() {
        super.viewDidLoad()

        contentsView.windless.start()
    }
}

Multiline

Depending on the lineHeight value and the spacing value, UILabel and UITextView will reconstruct the layout when the windless animation runs.

public protocol CanBeMultipleLines {

    var lineHeight: CGFloat { get set }

    var spacing: CGFloat { get set }
}
ConfigurationResult

Custom Options

There are several customizable options in Windless.

public class WindlessConfiguration {
    
    /// The direction of windless animation. Defaults to rightDiagonal.
    public var direction: WindlessDirection = .rightDiagonal
    
    /// The speed of windless animation. Defaults to 1.
    public var speed: Float = 1
    
    /// The duration of the fade used when windless begins. Defaults to 0.
    public var beginTime: CFTimeInterval = 0
    
    /// The time interval windless in seconds. Defaults to 4.
    public var duration: CFTimeInterval = 4
    
    /// The time interval between windless in seconds. Defaults to 2.
    public var pauseDuration: CFTimeInterval = 2
    
    /// gradient animation timingFunction default easeOut
    public var timingFuction: CAMediaTimingFunction = .easeOut
    
    /// gradient layer center color default .lightGray
    public var animationLayerColor: UIColor = .lightGray
    
    /// Mask layer background color default .groupTableViewBackground
    public var animationBackgroundColor: UIColor = .groupTableViewBackground
    
    /// The opacity of the content while it is windless. Defaults to 0.8.
    public var animationLayerOpacity: CGFloat = 0.8
}

To set the options, use the apply method as shown below.

import Windless

class ViewController: UIViewController {

    @IBOutlet weak var contentsView: UIView!

    override func viewDidLoad() {
        super.viewDidLoad()

        contentsView.windless
            .apply {
                $0.beginTime = 1
                $0.pauseDuration = 2
                $0.duration = 3
                $0.animationLayerOpacity = 0.5
            }
            .start()
    }
}

If you want to know more detailed usage, please refer to Example.

Looks

The isWindlessable value determines how the loading view looks. The images below show how the loading screen will look according to the isWindlessable value.

isWindlessable= 🌀

ConfigurationResult

Communication

  • If you found a bug, open an issue.
  • If you have a feature request, open an issue.
  • If you want to contribute, submit a pull request.

Credits

Download Details:

Author: ParkGwangBeom
Source Code: https://github.com/ParkGwangBeom/Windless 
License: MIT license

#swift #ios #skeleton #animation #loading 

What is GEEK

Buddha Community

Windless Makes It Easy to Implement invisible Layout Loading View

Dotnet Script: Run C# Scripts From The .NET CLI

dotnet script

Run C# scripts from the .NET CLI, define NuGet packages inline and edit/debug them in VS Code - all of that with full language services support from OmniSharp.

NuGet Packages

NameVersionFramework(s)
dotnet-script (global tool)Nugetnet6.0, net5.0, netcoreapp3.1
Dotnet.Script (CLI as Nuget)Nugetnet6.0, net5.0, netcoreapp3.1
Dotnet.Script.CoreNugetnetcoreapp3.1 , netstandard2.0
Dotnet.Script.DependencyModelNugetnetstandard2.0
Dotnet.Script.DependencyModel.NugetNugetnetstandard2.0

Installing

Prerequisites

The only thing we need to install is .NET Core 3.1 or .NET 5.0 SDK.

.NET Core Global Tool

.NET Core 2.1 introduced the concept of global tools meaning that you can install dotnet-script using nothing but the .NET CLI.

dotnet tool install -g dotnet-script

You can invoke the tool using the following command: dotnet-script
Tool 'dotnet-script' (version '0.22.0') was successfully installed.

The advantage of this approach is that you can use the same command for installation across all platforms. .NET Core SDK also supports viewing a list of installed tools and their uninstallation.

dotnet tool list -g

Package Id         Version      Commands
---------------------------------------------
dotnet-script      0.22.0       dotnet-script
dotnet tool uninstall dotnet-script -g

Tool 'dotnet-script' (version '0.22.0') was successfully uninstalled.

Windows

choco install dotnet.script

We also provide a PowerShell script for installation.

(new-object Net.WebClient).DownloadString("https://raw.githubusercontent.com/filipw/dotnet-script/master/install/install.ps1") | iex

Linux and Mac

curl -s https://raw.githubusercontent.com/filipw/dotnet-script/master/install/install.sh | bash

If permission is denied we can try with sudo

curl -s https://raw.githubusercontent.com/filipw/dotnet-script/master/install/install.sh | sudo bash

Docker

A Dockerfile for running dotnet-script in a Linux container is available. Build:

cd build
docker build -t dotnet-script -f Dockerfile ..

And run:

docker run -it dotnet-script --version

Github

You can manually download all the releases in zip format from the GitHub releases page.

Usage

Our typical helloworld.csx might look like this:

Console.WriteLine("Hello world!");

That is all it takes and we can execute the script. Args are accessible via the global Args array.

dotnet script helloworld.csx

Scaffolding

Simply create a folder somewhere on your system and issue the following command.

dotnet script init

This will create main.csx along with the launch configuration needed to debug the script in VS Code.

.
├── .vscode
│   └── launch.json
├── main.csx
└── omnisharp.json

We can also initialize a folder using a custom filename.

dotnet script init custom.csx

Instead of main.csx which is the default, we now have a file named custom.csx.

.
├── .vscode
│   └── launch.json
├── custom.csx
└── omnisharp.json

Note: Executing dotnet script init inside a folder that already contains one or more script files will not create the main.csx file.

Running scripts

Scripts can be executed directly from the shell as if they were executables.

foo.csx arg1 arg2 arg3

OSX/Linux

Just like all scripts, on OSX/Linux you need to have a #! and mark the file as executable via chmod +x foo.csx. If you use dotnet script init to create your csx it will automatically have the #! directive and be marked as executable.

The OSX/Linux shebang directive should be #!/usr/bin/env dotnet-script

#!/usr/bin/env dotnet-script
Console.WriteLine("Hello world");

You can execute your script using dotnet script or dotnet-script, which allows you to pass arguments to control your script execution more.

foo.csx arg1 arg2 arg3
dotnet script foo.csx -- arg1 arg2 arg3
dotnet-script foo.csx -- arg1 arg2 arg3

Passing arguments to scripts

All arguments after -- are passed to the script in the following way:

dotnet script foo.csx -- arg1 arg2 arg3

Then you can access the arguments in the script context using the global Args collection:

foreach (var arg in Args)
{
    Console.WriteLine(arg);
}

All arguments before -- are processed by dotnet script. For example, the following command-line

dotnet script -d foo.csx -- -d

will pass the -d before -- to dotnet script and enable the debug mode whereas the -d after -- is passed to script for its own interpretation of the argument.

NuGet Packages

dotnet script has built-in support for referencing NuGet packages directly from within the script.

#r "nuget: AutoMapper, 6.1.0"

package

Note: Omnisharp needs to be restarted after adding a new package reference

Package Sources

We can define package sources using a NuGet.Config file in the script root folder. In addition to being used during execution of the script, it will also be used by OmniSharp that provides language services for packages resolved from these package sources.

As an alternative to maintaining a local NuGet.Config file we can define these package sources globally either at the user level or at the computer level as described in Configuring NuGet Behaviour

It is also possible to specify packages sources when executing the script.

dotnet script foo.csx -s https://SomePackageSource

Multiple packages sources can be specified like this:

dotnet script foo.csx -s https://SomePackageSource -s https://AnotherPackageSource

Creating DLLs or Exes from a CSX file

Dotnet-Script can create a standalone executable or DLL for your script.

SwitchLong switchdescription
-o--outputDirectory where the published executable should be placed. Defaults to a 'publish' folder in the current directory.
-n--nameThe name for the generated DLL (executable not supported at this time). Defaults to the name of the script.
 --dllPublish to a .dll instead of an executable.
-c--configurationConfiguration to use for publishing the script [Release/Debug]. Default is "Debug"
-d--debugEnables debug output.
-r--runtimeThe runtime used when publishing the self contained executable. Defaults to your current runtime.

The executable you can run directly independent of dotnet install, while the DLL can be run using the dotnet CLI like this:

dotnet script exec {path_to_dll} -- arg1 arg2

Caching

We provide two types of caching, the dependency cache and the execution cache which is explained in detail below. In order for any of these caches to be enabled, it is required that all NuGet package references are specified using an exact version number. The reason for this constraint is that we need to make sure that we don't execute a script with a stale dependency graph.

Dependency Cache

In order to resolve the dependencies for a script, a dotnet restore is executed under the hood to produce a project.assets.json file from which we can figure out all the dependencies we need to add to the compilation. This is an out-of-process operation and represents a significant overhead to the script execution. So this cache works by looking at all the dependencies specified in the script(s) either in the form of NuGet package references or assembly file references. If these dependencies matches the dependencies from the last script execution, we skip the restore and read the dependencies from the already generated project.assets.json file. If any of the dependencies has changed, we must restore again to obtain the new dependency graph.

Execution cache

In order to execute a script it needs to be compiled first and since that is a CPU and time consuming operation, we make sure that we only compile when the source code has changed. This works by creating a SHA256 hash from all the script files involved in the execution. This hash is written to a temporary location along with the DLL that represents the result of the script compilation. When a script is executed the hash is computed and compared with the hash from the previous compilation. If they match there is no need to recompile and we run from the already compiled DLL. If the hashes don't match, the cache is invalidated and we recompile.

You can override this automatic caching by passing --no-cache flag, which will bypass both caches and cause dependency resolution and script compilation to happen every time we execute the script.

Cache Location

The temporary location used for caches is a sub-directory named dotnet-script under (in order of priority):

  1. The path specified for the value of the environment variable named DOTNET_SCRIPT_CACHE_LOCATION, if defined and value is not empty.
  2. Linux distributions only: $XDG_CACHE_HOME if defined otherwise $HOME/.cache
  3. macOS only: ~/Library/Caches
  4. The value returned by Path.GetTempPath for the platform.

 

Debugging

The days of debugging scripts using Console.WriteLine are over. One major feature of dotnet script is the ability to debug scripts directly in VS Code. Just set a breakpoint anywhere in your script file(s) and hit F5(start debugging)

debug

Script Packages

Script packages are a way of organizing reusable scripts into NuGet packages that can be consumed by other scripts. This means that we now can leverage scripting infrastructure without the need for any kind of bootstrapping.

Creating a script package

A script package is just a regular NuGet package that contains script files inside the content or contentFiles folder.

The following example shows how the scripts are laid out inside the NuGet package according to the standard convention .

└── contentFiles
    └── csx
        └── netstandard2.0
            └── main.csx

This example contains just the main.csx file in the root folder, but packages may have multiple script files either in the root folder or in subfolders below the root folder.

When loading a script package we will look for an entry point script to be loaded. This entry point script is identified by one of the following.

  • A script called main.csx in the root folder
  • A single script file in the root folder

If the entry point script cannot be determined, we will simply load all the scripts files in the package.

The advantage with using an entry point script is that we can control loading other scripts from the package.

Consuming a script package

To consume a script package all we need to do specify the NuGet package in the #loaddirective.

The following example loads the simple-targets package that contains script files to be included in our script.

#load "nuget:simple-targets-csx, 6.0.0"

using static SimpleTargets;
var targets = new TargetDictionary();

targets.Add("default", () => Console.WriteLine("Hello, world!"));

Run(Args, targets);

Note: Debugging also works for script packages so that we can easily step into the scripts that are brought in using the #load directive.

Remote Scripts

Scripts don't actually have to exist locally on the machine. We can also execute scripts that are made available on an http(s) endpoint.

This means that we can create a Gist on Github and execute it just by providing the URL to the Gist.

This Gist contains a script that prints out "Hello World"

We can execute the script like this

dotnet script https://gist.githubusercontent.com/seesharper/5d6859509ea8364a1fdf66bbf5b7923d/raw/0a32bac2c3ea807f9379a38e251d93e39c8131cb/HelloWorld.csx

That is a pretty long URL, so why don't make it a TinyURL like this:

dotnet script https://tinyurl.com/y8cda9zt

Script Location

A pretty common scenario is that we have logic that is relative to the script path. We don't want to require the user to be in a certain directory for these paths to resolve correctly so here is how to provide the script path and the script folder regardless of the current working directory.

public static string GetScriptPath([CallerFilePath] string path = null) => path;
public static string GetScriptFolder([CallerFilePath] string path = null) => Path.GetDirectoryName(path);

Tip: Put these methods as top level methods in a separate script file and #load that file wherever access to the script path and/or folder is needed.

REPL

This release contains a C# REPL (Read-Evaluate-Print-Loop). The REPL mode ("interactive mode") is started by executing dotnet-script without any arguments.

The interactive mode allows you to supply individual C# code blocks and have them executed as soon as you press Enter. The REPL is configured with the same default set of assembly references and using statements as regular CSX script execution.

Basic usage

Once dotnet-script starts you will see a prompt for input. You can start typing C# code there.

~$ dotnet script
> var x = 1;
> x+x
2

If you submit an unterminated expression into the REPL (no ; at the end), it will be evaluated and the result will be serialized using a formatter and printed in the output. This is a bit more interesting than just calling ToString() on the object, because it attempts to capture the actual structure of the object. For example:

~$ dotnet script
> var x = new List<string>();
> x.Add("foo");
> x
List<string>(1) { "foo" }
> x.Add("bar");
> x
List<string>(2) { "foo", "bar" }
>

Inline Nuget packages

REPL also supports inline Nuget packages - meaning the Nuget packages can be installed into the REPL from within the REPL. This is done via our #r and #load from Nuget support and uses identical syntax.

~$ dotnet script
> #r "nuget: Automapper, 6.1.1"
> using AutoMapper;
> typeof(MapperConfiguration)
[AutoMapper.MapperConfiguration]
> #load "nuget: simple-targets-csx, 6.0.0";
> using static SimpleTargets;
> typeof(TargetDictionary)
[Submission#0+SimpleTargets+TargetDictionary]

Multiline mode

Using Roslyn syntax parsing, we also support multiline REPL mode. This means that if you have an uncompleted code block and press Enter, we will automatically enter the multiline mode. The mode is indicated by the * character. This is particularly useful for declaring classes and other more complex constructs.

~$ dotnet script
> class Foo {
* public string Bar {get; set;}
* }
> var foo = new Foo();

REPL commands

Aside from the regular C# script code, you can invoke the following commands (directives) from within the REPL:

CommandDescription
#loadLoad a script into the REPL (same as #load usage in CSX)
#rLoad an assembly into the REPL (same as #r usage in CSX)
#resetReset the REPL back to initial state (without restarting it)
#clsClear the console screen without resetting the REPL state
#exitExits the REPL

Seeding REPL with a script

You can execute a CSX script and, at the end of it, drop yourself into the context of the REPL. This way, the REPL becomes "seeded" with your code - all the classes, methods or variables are available in the REPL context. This is achieved by running a script with an -i flag.

For example, given the following CSX script:

var msg = "Hello World";
Console.WriteLine(msg);

When you run this with the -i flag, Hello World is printed, REPL starts and msg variable is available in the REPL context.

~$ dotnet script foo.csx -i
Hello World
>

You can also seed the REPL from inside the REPL - at any point - by invoking a #load directive pointed at a specific file. For example:

~$ dotnet script
> #load "foo.csx"
Hello World
>

Piping

The following example shows how we can pipe data in and out of a script.

The UpperCase.csx script simply converts the standard input to upper case and writes it back out to standard output.

using (var streamReader = new StreamReader(Console.OpenStandardInput()))
{
    Write(streamReader.ReadToEnd().ToUpper());
}

We can now simply pipe the output from one command into our script like this.

echo "This is some text" | dotnet script UpperCase.csx
THIS IS SOME TEXT

Debugging

The first thing we need to do add the following to the launch.config file that allows VS Code to debug a running process.

{
    "name": ".NET Core Attach",
    "type": "coreclr",
    "request": "attach",
    "processId": "${command:pickProcess}"
}

To debug this script we need a way to attach the debugger in VS Code and the simplest thing we can do here is to wait for the debugger to attach by adding this method somewhere.

public static void WaitForDebugger()
{
    Console.WriteLine("Attach Debugger (VS Code)");
    while(!Debugger.IsAttached)
    {
    }
}

To debug the script when executing it from the command line we can do something like

WaitForDebugger();
using (var streamReader = new StreamReader(Console.OpenStandardInput()))
{
    Write(streamReader.ReadToEnd().ToUpper()); // <- SET BREAKPOINT HERE
}

Now when we run the script from the command line we will get

$ echo "This is some text" | dotnet script UpperCase.csx
Attach Debugger (VS Code)

This now gives us a chance to attach the debugger before stepping into the script and from VS Code, select the .NET Core Attach debugger and pick the process that represents the executing script.

Once that is done we should see our breakpoint being hit.

Configuration(Debug/Release)

By default, scripts will be compiled using the debug configuration. This is to ensure that we can debug a script in VS Code as well as attaching a debugger for long running scripts.

There are however situations where we might need to execute a script that is compiled with the release configuration. For instance, running benchmarks using BenchmarkDotNet is not possible unless the script is compiled with the release configuration.

We can specify this when executing the script.

dotnet script foo.csx -c release

 

Nullable reference types

Starting from version 0.50.0, dotnet-script supports .Net Core 3.0 and all the C# 8 features. The way we deal with nullable references types in dotnet-script is that we turn every warning related to nullable reference types into compiler errors. This means every warning between CS8600 and CS8655 are treated as an error when compiling the script.

Nullable references types are turned off by default and the way we enable it is using the #nullable enable compiler directive. This means that existing scripts will continue to work, but we can now opt-in on this new feature.

#!/usr/bin/env dotnet-script

#nullable enable

string name = null;

Trying to execute the script will result in the following error

main.csx(5,15): error CS8625: Cannot convert null literal to non-nullable reference type.

We will also see this when working with scripts in VS Code under the problems panel.

image

Download Details:
Author: filipw
Source Code: https://github.com/filipw/dotnet-script
License: MIT License

#dotnet  #aspdotnet  #csharp 

Rupert  Beatty

Rupert Beatty

1673345400

Windless Makes It Easy to Implement invisible Layout Loading View

Windless

Windless makes it easy to implement invisible layout loading view.

basic.giftable.gifcollection.gif

Requirements

  • iOS 8.0+
  • Xcode 9.0+
  • Swift 4.0+

Installation

CocoaPods

CocoaPods is a dependency manager for Cocoa projects. You can install it with the following command:

$ gem install cocoapods

CocoaPods 1.1+ is required to build Windless 4.0+.

To integrate Windless into your Xcode project using CocoaPods, specify it in your Podfile:

source 'https://github.com/CocoaPods/Specs.git'
platform :ios, '8.0'
use_frameworks!

target '<Your Target Name>' do
    pod 'Windless', '~> 0.1.5'
end

Then, run the following command:

$ pod install

Carthage

Carthage is a decentralized dependency manager that builds your dependencies and provides you with binary frameworks.

You can install Carthage with Homebrew using the following command:

$ brew update
$ brew install carthage

To integrate Windless into your Xcode project using Carthage, specify it in your Cartfile:

github "Interactive-Studio/Windless" ~> 0.1.5

Run carthage update to build the framework and drag the built Windless.framework into your Xcode project.

Manually

If you prefer not to use either of the aforementioned dependency managers, you can integrate Windless into your project manually.


Usage

Code

import Windless

class ViewController: UIViewController {

    lazy var contentsView = UIView()
    
    var subView1 = UIView()
    var subView2 = UIView()

    override func viewDidLoad() {
        super.viewDidLoad()

        self.view.addSubview(contentsView)
        contentsView.addSubview(subView1)
        contentsView.addSubview(subView2)
        
        // start
        contentsView.windless
                .setupWindlessableViews([subView1, subView2])
                .start()
                
        // stop
        contentsView.windless.end()
    }

}

Storyboard, Xib

If you use Storyboard or xib, you only need to set the isWindlessable flag to true for the views you want to show as fake in the view inspector of the view, and you do not have to pass the view through the setupWindlessableViews method.

import Windless

class ViewController: UIViewController {

    @IBOutlet weak var contentsView: UIView!

    override func viewDidLoad() {
        super.viewDidLoad()

        contentsView.windless.start()
    }
}

Multiline

Depending on the lineHeight value and the spacing value, UILabel and UITextView will reconstruct the layout when the windless animation runs.

public protocol CanBeMultipleLines {

    var lineHeight: CGFloat { get set }

    var spacing: CGFloat { get set }
}
ConfigurationResult

Custom Options

There are several customizable options in Windless.

public class WindlessConfiguration {
    
    /// The direction of windless animation. Defaults to rightDiagonal.
    public var direction: WindlessDirection = .rightDiagonal
    
    /// The speed of windless animation. Defaults to 1.
    public var speed: Float = 1
    
    /// The duration of the fade used when windless begins. Defaults to 0.
    public var beginTime: CFTimeInterval = 0
    
    /// The time interval windless in seconds. Defaults to 4.
    public var duration: CFTimeInterval = 4
    
    /// The time interval between windless in seconds. Defaults to 2.
    public var pauseDuration: CFTimeInterval = 2
    
    /// gradient animation timingFunction default easeOut
    public var timingFuction: CAMediaTimingFunction = .easeOut
    
    /// gradient layer center color default .lightGray
    public var animationLayerColor: UIColor = .lightGray
    
    /// Mask layer background color default .groupTableViewBackground
    public var animationBackgroundColor: UIColor = .groupTableViewBackground
    
    /// The opacity of the content while it is windless. Defaults to 0.8.
    public var animationLayerOpacity: CGFloat = 0.8
}

To set the options, use the apply method as shown below.

import Windless

class ViewController: UIViewController {

    @IBOutlet weak var contentsView: UIView!

    override func viewDidLoad() {
        super.viewDidLoad()

        contentsView.windless
            .apply {
                $0.beginTime = 1
                $0.pauseDuration = 2
                $0.duration = 3
                $0.animationLayerOpacity = 0.5
            }
            .start()
    }
}

If you want to know more detailed usage, please refer to Example.

Looks

The isWindlessable value determines how the loading view looks. The images below show how the loading screen will look according to the isWindlessable value.

isWindlessable= 🌀

ConfigurationResult

Communication

  • If you found a bug, open an issue.
  • If you have a feature request, open an issue.
  • If you want to contribute, submit a pull request.

Credits

Download Details:

Author: ParkGwangBeom
Source Code: https://github.com/ParkGwangBeom/Windless 
License: MIT license

#swift #ios #skeleton #animation #loading 

Dylan  Iqbal

Dylan Iqbal

1561523460

Matplotlib Cheat Sheet: Plotting in Python

This Matplotlib cheat sheet introduces you to the basics that you need to plot your data with Python and includes code samples.

Data visualization and storytelling with your data are essential skills that every data scientist needs to communicate insights gained from analyses effectively to any audience out there. 

For most beginners, the first package that they use to get in touch with data visualization and storytelling is, naturally, Matplotlib: it is a Python 2D plotting library that enables users to make publication-quality figures. But, what might be even more convincing is the fact that other packages, such as Pandas, intend to build more plotting integration with Matplotlib as time goes on.

However, what might slow down beginners is the fact that this package is pretty extensive. There is so much that you can do with it and it might be hard to still keep a structure when you're learning how to work with Matplotlib.   

DataCamp has created a Matplotlib cheat sheet for those who might already know how to use the package to their advantage to make beautiful plots in Python, but that still want to keep a one-page reference handy. Of course, for those who don't know how to work with Matplotlib, this might be the extra push be convinced and to finally get started with data visualization in Python. 

You'll see that this cheat sheet presents you with the six basic steps that you can go through to make beautiful plots. 

Check out the infographic by clicking on the button below:

Python Matplotlib cheat sheet

With this handy reference, you'll familiarize yourself in no time with the basics of Matplotlib: you'll learn how you can prepare your data, create a new plot, use some basic plotting routines to your advantage, add customizations to your plots, and save, show and close the plots that you make.

What might have looked difficult before will definitely be more clear once you start using this cheat sheet! Use it in combination with the Matplotlib Gallery, the documentation.

Matplotlib 

Matplotlib is a Python 2D plotting library which produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms.

Prepare the Data 

1D Data 

>>> import numpy as np
>>> x = np.linspace(0, 10, 100)
>>> y = np.cos(x)
>>> z = np.sin(x)

2D Data or Images 

>>> data = 2 * np.random.random((10, 10))
>>> data2 = 3 * np.random.random((10, 10))
>>> Y, X = np.mgrid[-3:3:100j, -3:3:100j]
>>> U = 1 X** 2 + Y
>>> V = 1 + X Y**2
>>> from matplotlib.cbook import get_sample_data
>>> img = np.load(get_sample_data('axes_grid/bivariate_normal.npy'))

Create Plot

>>> import matplotlib.pyplot as plt

Figure 

>>> fig = plt.figure()
>>> fig2 = plt.figure(figsize=plt.figaspect(2.0))

Axes 

>>> fig.add_axes()
>>> ax1 = fig.add_subplot(221) #row-col-num
>>> ax3 = fig.add_subplot(212)
>>> fig3, axes = plt.subplots(nrows=2,ncols=2)
>>> fig4, axes2 = plt.subplots(ncols=3)

Save Plot 

>>> plt.savefig('foo.png') #Save figures
>>> plt.savefig('foo.png',  transparent=True) #Save transparent figures

Show Plot

>>> plt.show()

Plotting Routines 

1D Data 

>>> fig, ax = plt.subplots()
>>> lines = ax.plot(x,y) #Draw points with lines or markers connecting them
>>> ax.scatter(x,y) #Draw unconnected points, scaled or colored
>>> axes[0,0].bar([1,2,3],[3,4,5]) #Plot vertical rectangles (constant width)
>>> axes[1,0].barh([0.5,1,2.5],[0,1,2]) #Plot horiontal rectangles (constant height)
>>> axes[1,1].axhline(0.45) #Draw a horizontal line across axes
>>> axes[0,1].axvline(0.65) #Draw a vertical line across axes
>>> ax.fill(x,y,color='blue') #Draw filled polygons
>>> ax.fill_between(x,y,color='yellow') #Fill between y values and 0

2D Data 

>>> fig, ax = plt.subplots()
>>> im = ax.imshow(img, #Colormapped or RGB arrays
      cmap= 'gist_earth', 
      interpolation= 'nearest',
      vmin=-2,
      vmax=2)
>>> axes2[0].pcolor(data2) #Pseudocolor plot of 2D array
>>> axes2[0].pcolormesh(data) #Pseudocolor plot of 2D array
>>> CS = plt.contour(Y,X,U) #Plot contours
>>> axes2[2].contourf(data1) #Plot filled contours
>>> axes2[2]= ax.clabel(CS) #Label a contour plot

Vector Fields 

>>> axes[0,1].arrow(0,0,0.5,0.5) #Add an arrow to the axes
>>> axes[1,1].quiver(y,z) #Plot a 2D field of arrows
>>> axes[0,1].streamplot(X,Y,U,V) #Plot a 2D field of arrows

Data Distributions 

>>> ax1.hist(y) #Plot a histogram
>>> ax3.boxplot(y) #Make a box and whisker plot
>>> ax3.violinplot(z)  #Make a violin plot

Plot Anatomy & Workflow 

Plot Anatomy 

 y-axis      

                           x-axis 

Workflow 

The basic steps to creating plots with matplotlib are:

1 Prepare Data
2 Create Plot
3 Plot
4 Customized Plot
5 Save Plot
6 Show Plot

>>> import matplotlib.pyplot as plt
>>> x = [1,2,3,4]  #Step 1
>>> y = [10,20,25,30] 
>>> fig = plt.figure() #Step 2
>>> ax = fig.add_subplot(111) #Step 3
>>> ax.plot(x, y, color= 'lightblue', linewidth=3)  #Step 3, 4
>>> ax.scatter([2,4,6],
          [5,15,25],
          color= 'darkgreen',
          marker= '^' )
>>> ax.set_xlim(1, 6.5)
>>> plt.savefig('foo.png' ) #Step 5
>>> plt.show() #Step 6

Close and Clear 

>>> plt.cla()  #Clear an axis
>>> plt.clf(). #Clear the entire figure
>>> plt.close(). #Close a window

Plotting Customize Plot 

Colors, Color Bars & Color Maps 

>>> plt.plot(x, x, x, x**2, x, x** 3)
>>> ax.plot(x, y, alpha = 0.4)
>>> ax.plot(x, y, c= 'k')
>>> fig.colorbar(im, orientation= 'horizontal')
>>> im = ax.imshow(img,
            cmap= 'seismic' )

Markers 

>>> fig, ax = plt.subplots()
>>> ax.scatter(x,y,marker= ".")
>>> ax.plot(x,y,marker= "o")

Linestyles 

>>> plt.plot(x,y,linewidth=4.0)
>>> plt.plot(x,y,ls= 'solid') 
>>> plt.plot(x,y,ls= '--') 
>>> plt.plot(x,y,'--' ,x**2,y**2,'-.' ) 
>>> plt.setp(lines,color= 'r',linewidth=4.0)

Text & Annotations 

>>> ax.text(1,
           -2.1, 
           'Example Graph', 
            style= 'italic' )
>>> ax.annotate("Sine", 
xy=(8, 0),
xycoords= 'data', 
xytext=(10.5, 0),
textcoords= 'data', 
arrowprops=dict(arrowstyle= "->", 
connectionstyle="arc3"),)

Mathtext 

>>> plt.title(r '$sigma_i=15$', fontsize=20)

Limits, Legends and Layouts 

Limits & Autoscaling 

>>> ax.margins(x=0.0,y=0.1) #Add padding to a plot
>>> ax.axis('equal')  #Set the aspect ratio of the plot to 1
>>> ax.set(xlim=[0,10.5],ylim=[-1.5,1.5])  #Set limits for x-and y-axis
>>> ax.set_xlim(0,10.5) #Set limits for x-axis

Legends 

>>> ax.set(title= 'An Example Axes',  #Set a title and x-and y-axis labels
            ylabel= 'Y-Axis', 
            xlabel= 'X-Axis')
>>> ax.legend(loc= 'best')  #No overlapping plot elements

Ticks 

>>> ax.xaxis.set(ticks=range(1,5),  #Manually set x-ticks
             ticklabels=[3,100, 12,"foo" ])
>>> ax.tick_params(axis= 'y', #Make y-ticks longer and go in and out
             direction= 'inout', 
              length=10)

Subplot Spacing 

>>> fig3.subplots_adjust(wspace=0.5,   #Adjust the spacing between subplots
             hspace=0.3,
             left=0.125,
             right=0.9,
             top=0.9,
             bottom=0.1)
>>> fig.tight_layout() #Fit subplot(s) in to the figure area

Axis Spines 

>>> ax1.spines[ 'top'].set_visible(False) #Make the top axis line for a plot invisible
>>> ax1.spines['bottom' ].set_position(( 'outward',10))  #Move the bottom axis line outward

Have this Cheat Sheet at your fingertips

Original article source at https://www.datacamp.com

#matplotlib #cheatsheet #python

Elian  Harber

Elian Harber

1641430440

Bokeh Plotting Backend for Pandas and GeoPandas

Pandas-Bokeh provides a Bokeh plotting backend for Pandas, GeoPandas and Pyspark DataFrames, similar to the already existing Visualization feature of Pandas. Importing the library adds a complementary plotting method plot_bokeh() on DataFrames and Series.

With Pandas-Bokeh, creating stunning, interactive, HTML-based visualization is as easy as calling:

df.plot_bokeh()

Pandas-Bokeh also provides native support as a Pandas Plotting backend for Pandas >= 0.25. When Pandas-Bokeh is installed, switchting the default Pandas plotting backend to Bokeh can be done via:

pd.set_option('plotting.backend', 'pandas_bokeh')

More details about the new Pandas backend can be found below.


Interactive Documentation

Please visit:

https://patrikhlobil.github.io/Pandas-Bokeh/

for an interactive version of the documentation below, where you can play with the dynamic Bokeh plots.


For more information have a look at the Examples below or at notebooks on the Github Repository of this project.

Startimage


 

Installation

You can install Pandas-Bokeh from PyPI via pip

pip install pandas-bokeh

or conda:

conda install -c patrikhlobil pandas-bokeh

With the current release 0.5.5, Pandas-Bokeh officially supports Python 3.6 and newer. For more details, see Release Notes.

How To Use

Classical Use

The Pandas-Bokeh library should be imported after Pandas, GeoPandas and/or Pyspark. After the import, one should define the plotting output, which can be:

pandas_bokeh.output_notebook(): Embeds the Plots in the cell outputs of the notebook. Ideal when working in Jupyter Notebooks.

pandas_bokeh.output_file(filename): Exports the plot to the provided filename as an HTML.

For more details about the plotting outputs, see the reference here or the Bokeh documentation.

Notebook output (see also bokeh.io.output_notebook)

import pandas as pd import pandas_bokeh pandas_bokeh.output_notebook()

File output to "Interactive Plot.html" (see also bokeh.io.output_file)

import pandas as pd import pandas_bokeh pandas_bokeh.output_file("Interactive Plot.html")

Pandas-Bokeh as native Pandas plotting backend

For pandas >= 0.25, a plotting backend switch is natively supported. It can be achievied by calling:

import pandas as pd
pd.set_option('plotting.backend', 'pandas_bokeh')

Now, the plotting API is accessible for a Pandas DataFrame via:

df.plot(...)

All additional functionalities of Pandas-Bokeh are then accessible at pd.plotting. So, setting the output to notebook is:

pd.plotting.output_notebook()

or calling the grid layout functionality:

pd.plotting.plot_grid(...)

Note: Backwards compatibility is kept since there will still be the df.plot_bokeh(...) methods for a DataFrame.


Plot types

Supported plottypes are at the moment:

Also, check out the complementary chapter Outputs, Formatting & Layouts about:


Lineplot

Basic Lineplot

This simple lineplot in Pandas-Bokeh already contains various interactive elements:

  • a pannable and zoomable (zoom in plotarea and zoom on axis) plot
  • by clicking on the legend elements, one can hide and show the individual lines
  • a Hovertool for the plotted lines

Consider the following simple example:

import numpy as np

np.random.seed(42)
df = pd.DataFrame({"Google": np.random.randn(1000)+0.2, 
                   "Apple": np.random.randn(1000)+0.17}, 
                   index=pd.date_range('1/1/2000', periods=1000))
df = df.cumsum()
df = df + 50
df.plot_bokeh(kind="line")       #equivalent to df.plot_bokeh.line()

ApplevsGoogle_1

Note, that similar to the regular pandas.DataFrame.plot method, there are also additional accessors to directly access the different plotting types like:

  • df.plot_bokeh(kind="line", ...)df.plot_bokeh.line(...)
  • df.plot_bokeh(kind="bar", ...)df.plot_bokeh.bar(...)
  • df.plot_bokeh(kind="hist", ...)df.plot_bokeh.hist(...)
  • ...

Advanced Lineplot

There are various optional parameters to tune the plots, for example:

kind: Which kind of plot should be produced. Currently supported are: "line", "point", "scatter", "bar" and "histogram". In the near future many more will be implemented as horizontal barplot, boxplots, pie-charts, etc.

x: Name of the column to use for the horizontal x-axis. If the x parameter is not specified, the index is used for the x-values of the plot. Alternative, also an array of values can be passed that has the same number of elements as the DataFrame.

y: Name of column or list of names of columns to use for the vertical y-axis.

figsize: Choose width & height of the plot

title: Sets title of the plot

xlim/ylim: Set visibler range of plot for x- and y-axis (also works for datetime x-axis)

xlabel/ylabel: Set x- and y-labels

logx/logy: Set log-scale on x-/y-axis

xticks/yticks: Explicitly set the ticks on the axes

color: Defines a single color for a plot.

colormap: Can be used to specify multiple colors to plot. Can be either a list of colors or the name of a Bokeh color palette

hovertool: If True a Hovertool is active, else if False no Hovertool is drawn.

hovertool_string: If specified, this string will be used for the hovertool (@{column} will be replaced by the value of the column for the element the mouse hovers over, see also Bokeh documentation and here)

toolbar_location: Specify the position of the toolbar location (None, "above", "below", "left" or "right"). Default: "right"

zooming: Enables/Disables zooming. Default: True

panning: Enables/Disables panning. Default: True

fontsize_label/fontsize_ticks/fontsize_title/fontsize_legend: Set fontsize of labels, ticks, title or legend (int or string of form "15pt")

rangetool Enables a range tool scroller. Default False

kwargs**: Optional keyword arguments of bokeh.plotting.figure.line

Try them out to get a feeling for the effects. Let us consider now:

df.plot_bokeh.line(
    figsize=(800, 450),
    y="Apple",
    title="Apple vs Google",
    xlabel="Date",
    ylabel="Stock price [$]",
    yticks=[0, 100, 200, 300, 400],
    ylim=(0, 400),
    toolbar_location=None,
    colormap=["red", "blue"],
    hovertool_string=r"""<img
                        src='https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Apple_logo_black.svg/170px-Apple_logo_black.svg.png' 
                        height="42" alt="@imgs" width="42"
                        style="float: left; margin: 0px 15px 15px 0px;"
                        border="2"></img> Apple 
                        
                        <h4> Stock Price: </h4> @{Apple}""",
    panning=False,
    zooming=False)

ApplevsGoogle_2

Lineplot with data points

For lineplots, as for many other plot-kinds, there are some special keyword arguments that only work for this plotting type. For lineplots, these are:

plot_data_points: Plot also the data points on the lines

plot_data_points_size: Determines the size of the data points

marker: Defines the point type (Default: "circle"). Possible values are: 'circle', 'square', 'triangle', 'asterisk', 'circle_x', 'square_x', 'inverted_triangle', 'x', 'circle_cross', 'square_cross', 'diamond', 'cross'

kwargs**: Optional keyword arguments of bokeh.plotting.figure.line```

Let us use this information to have another version of the same plot:

df.plot_bokeh.line(
    figsize=(800, 450),
    title="Apple vs Google",
    xlabel="Date",
    ylabel="Stock price [$]",
    yticks=[0, 100, 200, 300, 400],
    ylim=(100, 200),
    xlim=("2001-01-01", "2001-02-01"),
    colormap=["red", "blue"],
    plot_data_points=True,
    plot_data_points_size=10,
    marker="asterisk")

ApplevsGoogle_3

Lineplot with rangetool

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list('ABCD'))
df = df.cumsum()

df.plot_bokeh(rangetool=True)

rangetool

Pointplot

If you just wish to draw the date points for curves, the pointplot option is the right choice. It also accepts the kwargs of bokeh.plotting.figure.scatter like marker or size:

import numpy as np

x = np.arange(-3, 3, 0.1)
y2 = x**2
y3 = x**3
df = pd.DataFrame({"x": x, "Parabula": y2, "Cube": y3})
df.plot_bokeh.point(
    x="x",
    xticks=range(-3, 4),
    size=5,
    colormap=["#009933", "#ff3399"],
    title="Pointplot (Parabula vs. Cube)",
    marker="x")

Pointplot

Stepplot

With a similar API as the line- & pointplots, one can generate a stepplot. Additional keyword arguments for this plot type are passes to bokeh.plotting.figure.step, e.g. mode (before, after, center), see the following example

import numpy as np

x = np.arange(-3, 3, 1)
y2 = x**2
y3 = x**3
df = pd.DataFrame({"x": x, "Parabula": y2, "Cube": y3})
df.plot_bokeh.step(
    x="x",
    xticks=range(-1, 1),
    colormap=["#009933", "#ff3399"],
    title="Pointplot (Parabula vs. Cube)",
    figsize=(800,300),
    fontsize_title=30,
    fontsize_label=25,
    fontsize_ticks=15,
    fontsize_legend=5,
    )

df.plot_bokeh.step(
    x="x",
    xticks=range(-1, 1),
    colormap=["#009933", "#ff3399"],
    title="Pointplot (Parabula vs. Cube)",
    mode="after",
    figsize=(800,300)
    )

Stepplot

Note that the step-plot API of Bokeh does so far not support a hovertool functionality.

Scatterplot

A basic scatterplot can be created using the kind="scatter" option. For scatterplots, the x and y parameters have to be specified and the following optional keyword argument is allowed:

category: Determines the category column to use for coloring the scatter points

kwargs**: Optional keyword arguments of bokeh.plotting.figure.scatter

Note, that the pandas.DataFrame.plot_bokeh() method return per default a Bokeh figure, which can be embedded in Dashboard layouts with other figures and Bokeh objects (for more details about (sub)plot layouts and embedding the resulting Bokeh plots as HTML click here).

In the example below, we use the building grid layout support of Pandas-Bokeh to display both the DataFrame (using a Bokeh DataTable) and the resulting scatterplot:

# Load Iris Dataset:
df = pd.read_csv(
    r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/iris/iris.csv"
)
df = df.sample(frac=1)

# Create Bokeh-Table with DataFrame:
from bokeh.models.widgets import DataTable, TableColumn
from bokeh.models import ColumnDataSource

data_table = DataTable(
    columns=[TableColumn(field=Ci, title=Ci) for Ci in df.columns],
    source=ColumnDataSource(df),
    height=300,
)

# Create Scatterplot:
p_scatter = df.plot_bokeh.scatter(
    x="petal length (cm)",
    y="sepal width (cm)",
    category="species",
    title="Iris DataSet Visualization",
    show_figure=False,
)

# Combine Table and Scatterplot via grid layout:
pandas_bokeh.plot_grid([[data_table, p_scatter]], plot_width=400, plot_height=350)

 

Scatterplot

A possible optional keyword parameters that can be passed to bokeh.plotting.figure.scatter is size. Below, we use the sepal length of the Iris data as reference for the size:

#Change one value to clearly see the effect of the size keyword
df.loc[13, "sepal length (cm)"] = 15

#Make scatterplot:
p_scatter = df.plot_bokeh.scatter(
    x="petal length (cm)",
    y="sepal width (cm)",
    category="species",
    title="Iris DataSet Visualization with Size Keyword",
    size="sepal length (cm)")

Scatterplot2

In this example you can see, that the additional dimension sepal length cannot be used to clearly differentiate between the virginica and versicolor species.

Barplot

The barplot API has no special keyword arguments, but accepts optional kwargs of bokeh.plotting.figure.vbar like alpha. It uses per default the index for the bar categories (however, also columns can be used as x-axis category using the x argument).

data = {
    'fruits':
    ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries'],
    '2015': [2, 1, 4, 3, 2, 4],
    '2016': [5, 3, 3, 2, 4, 6],
    '2017': [3, 2, 4, 4, 5, 3]
}
df = pd.DataFrame(data).set_index("fruits")

p_bar = df.plot_bokeh.bar(
    ylabel="Price per Unit [€]", 
    title="Fruit prices per Year", 
    alpha=0.6)

Barplot

Using the stacked keyword argument you also maked stacked barplots:

p_stacked_bar = df.plot_bokeh.bar(
    ylabel="Price per Unit [€]",
    title="Fruit prices per Year",
    stacked=True,
    alpha=0.6)

Barplot2

Also horizontal versions of the above barplot are supported with the keyword kind="barh" or the accessor plot_bokeh.barh. You can still specify a column of the DataFrame as the bar category via the x argument if you do not wish to use the index.

#Reset index, such that "fruits" is now a column of the DataFrame:
df.reset_index(inplace=True)

#Create horizontal bar (via kind keyword):
p_hbar = df.plot_bokeh(
    kind="barh",
    x="fruits",
    xlabel="Price per Unit [€]",
    title="Fruit prices per Year",
    alpha=0.6,
    legend = "bottom_right",
    show_figure=False)

#Create stacked horizontal bar (via barh accessor):
p_stacked_hbar = df.plot_bokeh.barh(
    x="fruits",
    stacked=True,
    xlabel="Price per Unit [€]",
    title="Fruit prices per Year",
    alpha=0.6,
    legend = "bottom_right",
    show_figure=False)

#Plot all barplot examples in a grid:
pandas_bokeh.plot_grid([[p_bar, p_stacked_bar],
                        [p_hbar, p_stacked_hbar]], 
                       plot_width=450)

Barplot3

Histogram

For drawing histograms (kind="hist"), Pandas-Bokeh has a lot of customization features. Optional keyword arguments for histogram plots are:

bins: Determines bins to use for the histogram. If bins is an int, it defines the number of equal-width bins in the given range (10, by default). If bins is a sequence, it defines the bin edges, including the rightmost edge, allowing for non-uniform bin widths. If bins is a string, it defines the method used to calculate the optimal bin width, as defined by histogram_bin_edges.

histogram_type: Either "sidebyside", "topontop" or "stacked". Default: "topontop"

stacked: Boolean that overrides the histogram_type as "stacked" if given. Default: False

kwargs**: Optional keyword arguments of bokeh.plotting.figure.quad

Below examples of the different histogram types:

import numpy as np

df_hist = pd.DataFrame({
    'a': np.random.randn(1000) + 1,
    'b': np.random.randn(1000),
    'c': np.random.randn(1000) - 1
    },
    columns=['a', 'b', 'c'])

#Top-on-Top Histogram (Default):
df_hist.plot_bokeh.hist(
    bins=np.linspace(-5, 5, 41),
    vertical_xlabel=True,
    hovertool=False,
    title="Normal distributions (Top-on-Top)",
    line_color="black")

#Side-by-Side Histogram (multiple bars share bin side-by-side) also accessible via
#kind="hist":
df_hist.plot_bokeh(
    kind="hist",
    bins=np.linspace(-5, 5, 41),
    histogram_type="sidebyside",
    vertical_xlabel=True,
    hovertool=False,
    title="Normal distributions (Side-by-Side)",
    line_color="black")

#Stacked histogram:
df_hist.plot_bokeh.hist(
    bins=np.linspace(-5, 5, 41),
    histogram_type="stacked",
    vertical_xlabel=True,
    hovertool=False,
    title="Normal distributions (Stacked)",
    line_color="black")

Histogram

Further, advanced keyword arguments for histograms are:

  • weights: A column of the DataFrame that is used as weight for the histogramm aggregation (see also numpy.histogram)
  • normed: If True, histogram values are normed to 1 (sum of histogram values=1). It is also possible to pass an integer, e.g. normed=100 would result in a histogram with percentage y-axis (sum of histogram values=100). Default: False
  • cumulative: If True, a cumulative histogram is shown. Default: False
  • show_average: If True, the average of the histogram is also shown. Default: False

Their usage is shown in these examples:

p_hist = df_hist.plot_bokeh.hist(
    y=["a", "b"],
    bins=np.arange(-4, 6.5, 0.5),
    normed=100,
    vertical_xlabel=True,
    ylabel="Share[%]",
    title="Normal distributions (normed)",
    show_average=True,
    xlim=(-4, 6),
    ylim=(0, 30),
    show_figure=False)

p_hist_cum = df_hist.plot_bokeh.hist(
    y=["a", "b"],
    bins=np.arange(-4, 6.5, 0.5),
    normed=100,
    cumulative=True,
    vertical_xlabel=True,
    ylabel="Share[%]",
    title="Normal distributions (normed & cumulative)",
    show_figure=False)

pandas_bokeh.plot_grid([[p_hist, p_hist_cum]], plot_width=450, plot_height=300)

Histogram2


 

Areaplot

Areaplot (kind="area") can be either drawn on top of each other or stacked. The important parameters are:

stacked: If True, the areaplots are stacked. If False, plots are drawn on top of each other. Default: False

kwargs**: Optional keyword arguments of bokeh.plotting.figure.patch


Let us consider the energy consumption split by source that can be downloaded as DataFrame via:

df_energy = pd.read_csv(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/energy/energy.csv", 
parse_dates=["Year"])
df_energy.head()
YearOilGasCoalNuclear EnergyHydroelectricityOther Renewable
1970-01-012291.5826.71467.317.7265.85.8
1971-01-012427.7884.81459.224.9276.46.3
1972-01-012613.9933.71475.734.1288.96.8
1973-01-012818.1978.01519.645.9292.57.3
1974-01-012777.31001.91520.959.6321.17.7


Creating the Areaplot can be achieved via:

df_energy.plot_bokeh.area(
    x="Year",
    stacked=True,
    legend="top_left",
    colormap=["brown", "orange", "black", "grey", "blue", "green"],
    title="Worldwide energy consumption split by energy source",
    ylabel="Million tonnes oil equivalent",
    ylim=(0, 16000))

areaplot

Note that the energy consumption of fossile energy is still increasing and renewable energy sources are still small in comparison 😢!!! However, when we norm the plot using the normed keyword, there is a clear trend towards renewable energies in the last decade:

df_energy.plot_bokeh.area(
    x="Year",
    stacked=True,
    normed=100,
    legend="bottom_left",
    colormap=["brown", "orange", "black", "grey", "blue", "green"],
    title="Worldwide energy consumption split by energy source",
    ylabel="Million tonnes oil equivalent")

areaplot2

Pieplot

For Pieplots, let us consider a dataset showing the results of all Bundestags elections in Germany since 2002:

df_pie = pd.read_csv(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/Bundestagswahl/Bundestagswahl.csv")
df_pie
Partei20022005200920132017
CDU/CSU38.535.233.841.532.9
SPD38.534.223.025.720.5
FDP7.49.814.64.810.7
Grünen8.68.110.78.48.9
Linke/PDS4.08.711.98.69.2
AfD0.00.00.00.012.6
Sonstige3.04.06.011.05.0

We can create a Pieplot of the last election in 2017 by specifying the "Partei" (german for party) column as the x column and the "2017" column as the y column for values:

df_pie.plot_bokeh.pie(
    x="Partei",
    y="2017",
    colormap=["blue", "red", "yellow", "green", "purple", "orange", "grey"],
    title="Results of German Bundestag Election 2017",
    )

pieplot

When you pass several columns to the y parameter (not providing the y-parameter assumes you plot all columns), multiple nested pieplots will be shown in one plot:

df_pie.plot_bokeh.pie(
    x="Partei",
    colormap=["blue", "red", "yellow", "green", "purple", "orange", "grey"],
    title="Results of German Bundestag Elections [2002-2017]",
    line_color="grey")

pieplot2

Mapplot

The mapplot method of Pandas-Bokeh allows for plotting geographic points stored in a Pandas DataFrame on an interactive map. For more advanced Geoplots for line and polygon shapes have a look at the Geoplots examples for the GeoPandas API of Pandas-Bokeh.

For mapplots, only (latitude, longitude) pairs in geographic projection (WGS84) can be plotted on a map. The basic API has the following 2 base parameters:

  • x: name of the longitude column of the DataFrame
  • y: name of the latitude column of the DataFrame

The other optional keyword arguments are discussed in the section about the GeoPandas API, e.g. category for coloring the points.

Below an example of plotting all cities for more than 1 million inhabitants:

df_mapplot = pd.read_csv(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/populated%20places/populated_places.csv")
df_mapplot.head()
namepop_maxlatitudelongitudesize
Mesa108539433.423915-111.7360841.085394
Sharjah110302725.37138355.4064781.103027
Changwon108149935.219102128.5835621.081499
Sheffield129290053.366677-1.4999971.292900
Abbottabad118364734.14950373.1995011.183647
df_mapplot["size"] = df_mapplot["pop_max"] / 1000000
df_mapplot.plot_bokeh.map(
    x="longitude",
    y="latitude",
    hovertool_string="""<h2> @{name} </h2> 
    
                        <h3> Population: @{pop_max} </h3>""",
    tile_provider="STAMEN_TERRAIN_RETINA",
    size="size", 
    figsize=(900, 600),
    title="World cities with more than 1.000.000 inhabitants")

 

Mapplot

Geoplots

Pandas-Bokeh also allows for interactive plotting of Maps using GeoPandas by providing a geopandas.GeoDataFrame.plot_bokeh() method. It allows to plot the following geodata on a map :

  • Points/MultiPoints
  • Lines/MultiLines
  • Polygons/MultiPolygons

Note: t is not possible to mix up the objects types, i.e. a GeoDataFrame with Points and Lines is for example not allowed.

Les us start with a simple example using the "World Borders Dataset" . Let us first import all neccessary libraries and read the shapefile:

import geopandas as gpd
import pandas as pd
import pandas_bokeh
pandas_bokeh.output_notebook()

#Read in GeoJSON from URL:
df_states = gpd.read_file(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/states/states.geojson")
df_states.head()
STATE_NAMEREGIONPOPESTIMATE2010POPESTIMATE2011POPESTIMATE2012POPESTIMATE2013POPESTIMATE2014POPESTIMATE2015POPESTIMATE2016POPESTIMATE2017geometry
Hawaii413638171378323139277214080381417710142632014286831427538(POLYGON ((-160.0738033454681 22.0041773479577...
Washington467413866819155689089969634107046931715281872809347405743(POLYGON ((-122.4020153103835 48.2252163723779...
Montana4990507996866100352210119211019931102831710386561050493POLYGON ((-111.4754253002074 44.70216236909688...
Maine113275681327968132810113279751328903132778713302321335907(POLYGON ((-69.77727626137293 44.0741483685119...
North Dakota2674518684830701380722908738658754859755548755393POLYGON ((-98.73043728833767 45.93827137024809...

Plotting the data on a map is as simple as calling:

df_states.plot_bokeh(simplify_shapes=10000)

US_States_1

We also passed the optional parameter simplify_shapes (~meter) to improve plotting performance (for a reference see shapely.object.simplify). The above geolayer thus has an accuracy of about 10km.

Many keyword arguments like xlabel, ylabel, xlim, ylim, title, colormap, hovertool, zooming, panning, ... for costumizing the plot are also available for the geoplotting API and can be uses as in the examples shown above. There are however also many other options especially for plotting geodata:

  • geometry_column: Specify the column that stores the geometry-information (default: "geometry")
  • hovertool_columns: Specify column names, for which values should be shown in hovertool
  • hovertool_string: If specified, this string will be used for the hovertool (@{column} will be replaced by the value of the column for the element the mouse hovers over, see also Bokeh documentation)
  • colormap_uselog: If set True, the colormapper is using a logscale. Default: False
  • colormap_range: Specify the value range of the colormapper via (min, max) tuple
  • tile_provider: Define build-in tile provider for background maps. Possible values: None, 'CARTODBPOSITRON', 'CARTODBPOSITRON_RETINA', 'STAMEN_TERRAIN', 'STAMEN_TERRAIN_RETINA', 'STAMEN_TONER', 'STAMEN_TONER_BACKGROUND', 'STAMEN_TONER_LABELS'. Default: CARTODBPOSITRON_RETINA
  • tile_provider_url: An arbitraty tile_provider_url of the form '/{Z}/{X}/{Y}*.png' can be passed to be used as background map.
  • tile_attribution: String (also HTML accepted) for showing attribution for tile source in the lower right corner
  • tile_alpha: Sets the alpha value of the background tile between [0, 1]. Default: 1

One of the most common usage of map plots are choropleth maps, where the color of a the objects is determined by the property of the object itself. There are 3 ways of drawing choropleth maps using Pandas-Bokeh, which are described below.

Categories

This is the simplest way. Just provide the category keyword for the selection of the property column:

  • category: Specifies the column of the GeoDataFrame that should be used to draw a choropleth map
  • show_colorbar: Whether or not to show a colorbar for categorical plots. Default: True

Let us now draw the regions as a choropleth plot using the category keyword (at the moment, only numerical columns are supported for choropleth plots):

df_states.plot_bokeh(
    figsize=(900, 600),
    simplify_shapes=5000,
    category="REGION",
    show_colorbar=False,
    colormap=["blue", "yellow", "green", "red"],
    hovertool_columns=["STATE_NAME", "REGION"],
    tile_provider="STAMEN_TERRAIN_RETINA")

When hovering over the states, the state-name and the region are shown as specified in the hovertool_columns argument.

US_States_2

 

Dropdown

By passing a list of column names of the GeoDataFrame as the dropdown keyword argument, a dropdown menu is shown above the map. This dropdown menu can be used to select the choropleth layer by the user. :

df_states["STATE_NAME_SMALL"] = df_states["STATE_NAME"].str.lower()

df_states.plot_bokeh(
    figsize=(900, 600),
    simplify_shapes=5000,
    dropdown=["POPESTIMATE2010", "POPESTIMATE2017"],
    colormap="Viridis",
    hovertool_string="""
                        <img
                        src="https://www.states101.com/img/flags/gif/small/@STATE_NAME_SMALL.gif" 
                        height="42" alt="@imgs" width="42"
                        style="float: left; margin: 0px 15px 15px 0px;"
                        border="2"></img>
                
                        <h2>  @STATE_NAME </h2>
                        <h3> 2010: @POPESTIMATE2010 </h3>
                        <h3> 2017: @POPESTIMATE2017 </h3>""",
    tile_provider_url=r"http://c.tile.stamen.com/watercolor/{Z}/{X}/{Y}.jpg",
    tile_attribution='Map tiles by <a href="http://stamen.com">Stamen Design</a>, under <a href="http://creativecommons.org/licenses/by/3.0">CC BY 3.0</a>. Data by <a href="http://openstreetmap.org">OpenStreetMap</a>, under <a href="http://www.openstreetmap.org/copyright">ODbL</a>.'
    )

US_States_3

Using hovertool_string, one can pass a string that can contain arbitrary HTML elements (including divs, images, ...) that is shown when hovering over the geographies (@{column} will be replaced by the value of the column for the element the mouse hovers over, see also Bokeh documentation).

Here, we also used an OSM tile server with watercolor style via tile_provider_url and added the attribution via tile_attribution.

Sliders

Another option for interactive choropleth maps is the slider implementation of Pandas-Bokeh. The possible keyword arguments are here:

  • slider: By passing a list of column names of the GeoDataFrame, a slider can be used to . This dropdown menu can be used to select the choropleth layer by the user.
  • slider_range: Pass a range (or numpy.arange) of numbers object to relate the sliders values with the slider columns. By passing range(0,10), the slider will have values [0, 1, 2, ..., 9], when passing numpy.arange(3,5,0.5), the slider will have values [3, 3.5, 4, 4.5]. Default: range(0, len(slider))
  • slider_name: Specifies the title of the slider. Default is an empty string.

This can be used to display the change in population relative to the year 2010:


#Calculate change of population relative to 2010:
for i in range(8):
    df_states["Delta_Population_201%d"%i] = ((df_states["POPESTIMATE201%d"%i] / df_states["POPESTIMATE2010"]) -1 ) * 100

#Specify slider columns:
slider_columns = ["Delta_Population_201%d"%i for i in range(8)]

#Specify slider-range (Maps "Delta_Population_2010" -> 2010, 
#                           "Delta_Population_2011" -> 2011, ...):
slider_range = range(2010, 2018)

#Make slider plot:
df_states.plot_bokeh(
    figsize=(900, 600),
    simplify_shapes=5000,
    slider=slider_columns,
    slider_range=slider_range,
    slider_name="Year", 
    colormap="Inferno",
    hovertool_columns=["STATE_NAME"] + slider_columns,
    title="Change of Population [%]")

US_States_4



 

Plot multiple geolayers

If you wish to display multiple geolayers, you can pass the Bokeh figure of a Pandas-Bokeh plot via the figure keyword to the next plot_bokeh() call:

import geopandas as gpd
import pandas_bokeh
pandas_bokeh.output_notebook()

# Read in GeoJSONs from URL:
df_states = gpd.read_file(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/states/states.geojson")
df_cities = gpd.read_file(
    r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/populated%20places/ne_10m_populated_places_simple_bigcities.geojson"
)
df_cities["size"] = df_cities.pop_max / 400000

#Plot shapes of US states (pass figure options to this initial plot):
figure = df_states.plot_bokeh(
    figsize=(800, 450),
    simplify_shapes=10000,
    show_figure=False,
    xlim=[-170, -80],
    ylim=[10, 70],
    category="REGION",
    colormap="Dark2",
    legend="States",
    show_colorbar=False,
)

#Plot cities as points on top of the US states layer by passing the figure:
df_cities.plot_bokeh(
    figure=figure,         # <== pass figure here!
    category="pop_max",
    colormap="Viridis",
    colormap_uselog=True,
    size="size",
    hovertool_string="""<h1>@name</h1>
                        <h3>Population: @pop_max </h3>""",
    marker="inverted_triangle",
    legend="Cities",
)

Multiple Geolayers


Point & Line plots:

Below, you can see an example that use Pandas-Bokeh to plot point data on a map. The plot shows all cities with a population larger than 1.000.000. For point plots, you can select the marker as keyword argument (since it is passed to bokeh.plotting.figure.scatter). Here an overview of all available marker types:

gdf = gpd.read_file(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/populated%20places/ne_10m_populated_places_simple_bigcities.geojson")
gdf["size"] = gdf.pop_max / 400000

gdf.plot_bokeh(
    category="pop_max",
    colormap="Viridis",
    colormap_uselog=True,
    size="size",
    hovertool_string="""<h1>@name</h1>
                        <h3>Population: @pop_max </h3>""",
    xlim=[-15, 35],
    ylim=[30,60],
    marker="inverted_triangle");

Pointmap

In a similar way, also GeoDataFrames with (multi)line shapes can be drawn using Pandas-Bokeh.


 


Colorbar formatting:

If you want to display the numerical labels on your colorbar with an alternative to the scientific format, you can pass in a one of the bokeh number string formats or an instance of one of the bokeh.models.formatters to the colorbar_tick_format argument in the geoplot

An example of using the string format argument:

df_states = gpd.read_file(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/states/states.geojson")

df_states["STATE_NAME_SMALL"] = df_states["STATE_NAME"].str.lower()

# pass in a string format to colorbar_tick_format to display the ticks as 10m rather than 1e7
df_states.plot_bokeh(
    figsize=(900, 600),
    category="POPESTIMATE2017",
    simplify_shapes=5000,    
    colormap="Inferno",
    colormap_uselog=True,
    colorbar_tick_format="0.0a")

colorbar_tick_format with string argument

An example of using the bokeh PrintfTickFormatter:

df_states = gpd.read_file(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/docs/Testdata/states/states.geojson")

df_states["STATE_NAME_SMALL"] = df_states["STATE_NAME"].str.lower()

for i in range(8):
    df_states["Delta_Population_201%d"%i] = ((df_states["POPESTIMATE201%d"%i] / df_states["POPESTIMATE2010"]) -1 ) * 100

# pass in a PrintfTickFormatter instance colorbar_tick_format to display the ticks with 2 decimal places  
df_states.plot_bokeh(
    figsize=(900, 600),
    category="Delta_Population_2017",
    simplify_shapes=5000,    
    colormap="Inferno",
    colorbar_tick_format=PrintfTickFormatter(format="%4.2f"))

colorbar_tick_format with bokeh.models.formatter_instance


Outputs, Formatting & Layouts

Output options

The pandas.DataFrame.plot_bokeh API has the following additional keyword arguments:

  • show_figure: If True, the resulting figure is shown (either in the notebook or exported and shown as HTML file, see Basics. If False, None is returned. Default: True
  • return_html: If True, the method call returns an HTML string that contains all Bokeh CSS&JS resources and the figure embedded in a div. This HTML representation of the plot can be used for embedding the plot in an HTML document. Default: False

If you have a Bokeh figure or layout, you can also use the pandas_bokeh.embedded_html function to generate an embeddable HTML representation of the plot. This can be included into any valid HTML (note that this is not possible directly with the HTML generated by the pandas_bokeh.output_file output option, because it includes an HTML header). Let us consider the following simple example:

#Import Pandas and Pandas-Bokeh (if you do not specify an output option, the standard is
#output_file):
import pandas as pd
import pandas_bokeh

#Create DataFrame to Plot:
import numpy as np
x = np.arange(-10, 10, 0.1)
sin = np.sin(x)
cos = np.cos(x)
tan = np.tan(x)
df = pd.DataFrame({"x": x, "sin(x)": sin, "cos(x)": cos, "tan(x)": tan})

#Make Bokeh plot from DataFrame using Pandas-Bokeh. Do not show the plot, but export
#it to an embeddable HTML string:
html_plot = df.plot_bokeh(
    kind="line",
    x="x",
    y=["sin(x)", "cos(x)", "tan(x)"],
    xticks=range(-20, 20),
    title="Trigonometric functions",
    show_figure=False,
    return_html=True,
    ylim=(-1.5, 1.5))

#Write some HTML and embed the HTML plot below it. For production use, please use
#Templates and the awesome Jinja library.
html = r"""
<script type="text/x-mathjax-config">
  MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});
</script>
<script type="text/javascript"
  src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>

<h1> Trigonometric functions </h1>

<p> The basic trigonometric functions are:</p>

<p>$ sin(x) $</p>
<p>$ cos(x) $</p>
<p>$ tan(x) = \frac{sin(x)}{cos(x)}$</p>

<p>Below is a plot that shows them</p>

""" + html_plot

#Export the HTML string to an external HTML file and show it:
with open("test.html" , "w") as f:
    f.write(html)
    
import webbrowser
webbrowser.open("test.html")

This code will open up a webbrowser and show the following page. As you can see, the interactive Bokeh plot is embedded nicely into the HTML layout. The return_html option is ideal for the use in a templating engine like Jinja.

Embedded HTML

Auto Scaling Plots

For single plots that have a number of x axis values or for larger monitors, you can auto scale the figure to the width of the entire jupyter cell by setting the sizing_mode parameter.

df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd']) df.plot_bokeh(kind="bar", figsize=(500, 200), sizing_mode="scale_width")

Scaled Plot

The figsize parameter can be used to change the height and width as well as act as a scaling multiplier against the axis that is not being scaled.

 

Number formats

To change the formats of numbers in the hovertool, use the number_format keyword argument. For a documentation about the format to pass, have a look at the Bokeh documentation.Let us consider some examples for the number 3.141592653589793:

FormatOutput
03
0.0003.141
0.00 $3.14 $

This number format will be applied to all numeric columns of the hovertool. If you want to make a very custom or complicated hovertool, you should probably use the hovertool_string keyword argument, see e.g. this example. Below, we use the number_format parameter to specify the "Stock Price" format to 2 decimal digits and an additional $ sign.

import numpy as np

#Lineplot:
np.random.seed(42)
df = pd.DataFrame({
    "Google": np.random.randn(1000) + 0.2,
    "Apple": np.random.randn(1000) + 0.17
},
                  index=pd.date_range('1/1/2000', periods=1000))
df = df.cumsum()
df = df + 50
df.plot_bokeh(
    kind="line",
    title="Apple vs Google",
    xlabel="Date",
    ylabel="Stock price [$]",
    yticks=[0, 100, 200, 300, 400],
    ylim=(0, 400),
    colormap=["red", "blue"],
    number_format="1.00 $")

Number format

Suppress scientific notation for axes

If you want to suppress the scientific notation for axes, you can use the disable_scientific_axes parameter, which accepts one of "x", "y", "xy":

df = pd.DataFrame({"Animal": ["Mouse", "Rabbit", "Dog", "Tiger", "Elefant", "Wale"],
                   "Weight [g]": [19, 3000, 40000, 200000, 6000000, 50000000]})
p_scientific = df.plot_bokeh(x="Animal", y="Weight [g]", show_figure=False)
p_non_scientific = df.plot_bokeh(x="Animal", y="Weight [g]", disable_scientific_axes="y", show_figure=False,)
pandas_bokeh.plot_grid([[p_scientific, p_non_scientific]], plot_width = 450)

Number format

 

Dashboard Layouts

As shown in the Scatterplot Example, combining plots with plots or other HTML elements is straighforward in Pandas-Bokeh due to the layout capabilities of Bokeh. The easiest way to generate a dashboard layout is using the pandas_bokeh.plot_grid method (which is an extension of bokeh.layouts.gridplot):

import pandas as pd
import numpy as np
import pandas_bokeh
pandas_bokeh.output_notebook()

#Barplot:
data = {
    'fruits':
    ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries'],
    '2015': [2, 1, 4, 3, 2, 4],
    '2016': [5, 3, 3, 2, 4, 6],
    '2017': [3, 2, 4, 4, 5, 3]
}
df = pd.DataFrame(data).set_index("fruits")
p_bar = df.plot_bokeh(
    kind="bar",
    ylabel="Price per Unit [€]",
    title="Fruit prices per Year",
    show_figure=False)

#Lineplot:
np.random.seed(42)
df = pd.DataFrame({
    "Google": np.random.randn(1000) + 0.2,
    "Apple": np.random.randn(1000) + 0.17
},
                  index=pd.date_range('1/1/2000', periods=1000))
df = df.cumsum()
df = df + 50
p_line = df.plot_bokeh(
    kind="line",
    title="Apple vs Google",
    xlabel="Date",
    ylabel="Stock price [$]",
    yticks=[0, 100, 200, 300, 400],
    ylim=(0, 400),
    colormap=["red", "blue"],
    show_figure=False)

#Scatterplot:
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris["data"])
df.columns = iris["feature_names"]
df["species"] = iris["target"]
df["species"] = df["species"].map(dict(zip(range(3), iris["target_names"])))
p_scatter = df.plot_bokeh(
    kind="scatter",
    x="petal length (cm)",
    y="sepal width (cm)",
    category="species",
    title="Iris DataSet Visualization",
    show_figure=False)

#Histogram:
df_hist = pd.DataFrame({
    'a': np.random.randn(1000) + 1,
    'b': np.random.randn(1000),
    'c': np.random.randn(1000) - 1
},
                       columns=['a', 'b', 'c'])

p_hist = df_hist.plot_bokeh(
    kind="hist",
    bins=np.arange(-6, 6.5, 0.5),
    vertical_xlabel=True,
    normed=100,
    hovertool=False,
    title="Normal distributions",
    show_figure=False)

#Make Dashboard with Grid Layout:
pandas_bokeh.plot_grid([[p_line, p_bar], 
                        [p_scatter, p_hist]], plot_width=450)

Dashboard Layout

Using a combination of row and column elements (see also Bokeh Layouts) allow for a very easy general arrangement of elements. An alternative layout to the one above is:

p_line.plot_width = 900
p_hist.plot_width = 900

layout = pandas_bokeh.column(p_line,
                pandas_bokeh.row(p_scatter, p_bar),
                p_hist)

pandas_bokeh.show(layout)

Alternative Dashboard Layout


 



 

 

Release Notes

Release Notes can be found here.

Contributing to Pandas-Bokeh

If you wish to contribute to the development of Pandas-Bokeh you can follow the instructions on the CONTRIBUTING.md.

 

Author: PatrikHlobil
Source Code: https://github.com/PatrikHlobil/Pandas-Bokeh 
License: MIT License

#machine-learning  #datavisualizations #python 

Rylan  Becker

Rylan Becker

1668563924

Machine Learning Tutorial: Step By Step for Beginners

In this Machine Learning article, we learn about Machine Learning Tutorial: step by step for beginners. This Machine Learning tutorial provides both intermediate and basics of machine learning. It is designed for students and working professionals who are complete beginners. At the end of this tutorial, you will be able to make machine learning models that can perform complex tasks such as predicting the price of a house or recognizing the species of an Iris from the dimensions of its petal and sepal lengths. If you are not a complete beginner and are a bit familiar with Machine Learning, I would suggest starting with subtopic eight i.e, Types of Machine Learning.

Before we deep dive further, if you are keen to explore a course in Artificial Intelligence & Machine Learning do check out our Artificial Intelligence Courses available at Great Learning. Anyone could expect an average Salary Hike of 48% from this course. Participate in Great Learning’s career accelerate programs and placement drives and get hired by our pool of 500+ Hiring companies through our programs.

Before jumping into the tutorial, you should be familiar with Pandas and NumPy. This is important to understand the implementation part. There are no prerequisites for understanding the theory. Here are the subtopics that we are going to discuss in this tutorial:

What is Machine Learning?

Arthur Samuel coined the term Machine Learning in the year 1959. He was a pioneer in Artificial Intelligence and computer gaming, and defined Machine Learning as “Field of study that gives computers the capability to learn without being explicitly programmed”.

In simple terms, Machine Learning is an application of Artificial Intelligence (AI) which enables a program(software) to learn from the experiences and improve their self at a task without being explicitly programmed. For example, how would you write a program that can identify fruits based on their various properties, such as colour, shape, size or any other property?

One approach is to hardcode everything, make some rules and use them to identify the fruits. This may seem the only way and work but one can never make perfect rules that apply on all cases. This problem can be easily solved using machine learning without any rules which makes it more robust and practical. You will see how we will use machine learning to do this task in the coming sections.

Thus, we can say that Machine Learning is the study of making machines more human-like in their behaviour and decision making by giving them the ability to learn with minimum human intervention, i.e., no explicit programming. Now the question arises, how can a program attain any experience and from where does it learn? The answer is data. Data is also called the fuel for Machine Learning and we can safely say that there is no machine learning without data.

You may be wondering that the term Machine Learning has been introduced in 1959 which is a long way back, then why haven’t there been any mention of it till recent years? You may want to note that Machine Learning needs a huge computational power, a lot of data and devices which are capable of storing such vast data. We have only recently reached a point where we now have all these requirements and can practice Machine Learning.

How is it different from traditional programming?

Are you wondering how is Machine Learning different from traditional programming? Well, in traditional programming, we would feed the input data and a well written and tested program into a machine to generate output. When it comes to machine learning, input data along with the output associated with the data is fed into the machine during the learning phase, and it works out a program for itself.

Why do we need Machine Learning?

Machine Learning today has all the attention it needs. Machine Learning can automate many tasks, especially the ones that only humans can perform with their innate intelligence. Replicating this intelligence to machines can be achieved only with the help of machine learning. 

With the help of Machine Learning, businesses can automate routine tasks. It also helps in automating and quickly create models for data analysis. Various industries depend on vast quantities of data to optimize their operations and make intelligent decisions. Machine Learning helps in creating models that can process and analyze large amounts of complex data to deliver accurate results. These models are precise and scalable and function with less turnaround time. By building such precise Machine Learning models, businesses can leverage profitable opportunities and avoid unknown risks.

Image recognition, text generation, and many other use-cases are finding applications in the real world. This is increasing the scope for machine learning experts to shine as a sought after professionals. 

How Does Machine Learning Work?

A machine learning model learns from the historical data fed to it and then builds prediction algorithms to predict the output for the new set of data the comes in as input to the system. The accuracy of these models would depend on the quality and amount of input data. A large amount of data will help build a better model which predicts the output more accurately.

Suppose we have a complex problem at hand that requires to perform some predictions. Now, instead of writing a code, this problem could be solved by feeding the given data to generic machine learning algorithms. With the help of these algorithms, the machine will develop logic and predict the output. Machine learning has transformed the way we approach business and social problems. Below is a diagram that briefly explains the working of a machine learning model/ algorithm. our way of thinking about the problem.

History of Machine Learning

Nowadays, we can see some amazing applications of ML such as in self-driving cars, Natural Language Processing and many more. But Machine learning has been here for over 70 years now. It all started in 1943, when neurophysiologist Warren McCulloch and mathematician Walter Pitts wrote a paper about neurons, and how they work. They decided to create a model of this using an electrical circuit, and therefore, the neural network was born.

In 1950, Alan Turing created the “Turing Test” to determine if a computer has real intelligence. To pass the test, a computer must be able to fool a human into believing it is also human. In 1952, Arthur Samuel wrote the first computer learning program. The program was the game of checkers, and the IBM computer improved at the game the more it played, studying which moves made up winning strategies and incorporating those moves into its program.

Just after a few years, in 1957, Frank Rosenblatt designed the first neural network for computers (the perceptron), which simulates the thought processes of the human brain. Later, in 1967, the “nearest neighbor” algorithm was written, allowing computers to begin using very basic pattern recognition. This could be used to map a route for travelling salesmen, starting at a random city but ensuring they visit all cities during a short tour.

But we can say that in the 1990s we saw a big change. Now work on machine learning shifted from a knowledge-driven approach to a data-driven approach.  Scientists began to create programs for computers to analyze large amounts of data and draw conclusions or “learn” from the results.

In 1997, IBM’s Deep Blue became the first computer chess-playing system to beat a reigning world chess champion. Deep Blue used the computing power in the 1990s to perform large-scale searches of potential moves and select the best move. Just a decade before this, in 2006, Geoffrey Hinton created the term “deep learning” to explain new algorithms that help computers distinguish objects and text in images and videos.

Machine Learning at Present

The year 2012 saw the publication of an influential research paper by Alex Krizhevsky, Geoffrey Hinton, and Ilya Sutskever, describing a model that can dramatically reduce the error rate in image recognition systems. Meanwhile, Google’s X Lab developed a machine learning algorithm capable of autonomously browsing YouTube videos to identify the videos that contain cats. In 2016 AlphaGo (created by researchers at Google DeepMind to play the ancient Chinese game of Go) won four out of five matches against Lee Sedol, who has been the world’s top Go player for over a decade.

And now in 2020, OpenAI released GPT-3 which is the most powerful language model ever. It can write creative fiction, generate functioning code, compose thoughtful business memos and much more. Its possible use cases are limited only by our imaginations.

Features of Machine Learning

1. Automation: Nowadays in your Gmail account, there is a spam folder that contains all the spam emails. You might be wondering how does Gmail know that all these emails are spam? This is the work of Machine Learning. It recognizes the spam emails and thus, it is easy to automate this process. The ability to automate repetitive tasks is one of the biggest characteristics of machine learning. A huge number of organizations are already using machine learning-powered paperwork and email automation. In the financial sector, for example, a huge number of repetitive, data-heavy and predictable tasks are needed to be performed. Because of this, this sector uses different types of machine learning solutions to a great extent.

2. Improved customer experience: For any business, one of the most crucial ways to drive engagement, promote brand loyalty and establish long-lasting customer relationships is by providing a customized experience and providing better services. Machine Learning helps us to achieve both of them. Have you ever noticed that whenever you open any shopping site or see any ads on the internet, they are mostly about something that you recently searched for? This is because machine learning has enabled us to make amazing recommendation systems that are accurate. They help us customize the user experience. Now coming to the service, most of the companies nowadays have a chatting bot with them that are available 24×7. An example of this is Eva from AirAsia airlines. These bots provide intelligent answers and sometimes you might even not notice that you are having a conversation with a bot. These bots use Machine Learning, which helps them to provide a good user experience.

3. Automated data visualization: In the past, we have seen a huge amount of data being generated by companies and individuals. Take an example of companies like Google, Twitter, Facebook. How much data are they generating per day? We can use this data and visualize the notable relationships, thus giving businesses the ability to make better decisions that can actually benefit both companies as well as customers. With the help of user-friendly automated data visualization platforms such as AutoViz, businesses can obtain a wealth of new insights in an effort to increase productivity in their processes.

4. Business intelligence: Machine learning characteristics, when merged with big data analytics can help companies to find solutions to the problems that can help the businesses to grow and generate more profit. From retail to financial services to healthcare, and many more, ML has already become one of the most effective technologies to boost business operations.

Python provides flexibility in choosing between object-oriented programming or scripting. There is also no need to recompile the code; developers can implement any changes and instantly see the results. You can use Python along with other languages to achieve the desired functionality and results.

Python is a versatile programming language and can run on any platform including Windows, MacOS, Linux, Unix, and others. While migrating from one platform to another, the code needs some minor adaptations and changes, and it is ready to work on the new platform. To build strong foundation and cover basic concepts you can enroll in a python machine learning course that will help you power ahead your career.

Here is a summary of the benefits of using Python for Machine Learning problems:

machine learning tutorial

Types of Machine Learning

Machine learning has been broadly categorized into three categories

  1. Supervised Learning
  2. Unsupervised Learning
  3. Reinforcement Learning

What is Supervised Learning?

Let us start with an easy example, say you are teaching a kid to differentiate dogs from cats. How would you do it? 

You may show him/her a dog and say “here is a dog” and when you encounter a cat you would point it out as a cat. When you show the kid enough dogs and cats, he may learn to differentiate between them. If he is trained well, he may be able to recognize different breeds of dogs which he hasn’t even seen. 

Similarly, in Supervised Learning, we have two sets of variables. One is called the target variable, or labels (the variable we want to predict) and features(variables that help us to predict target variables). We show the program(model) the features and the label associated with these features and then the program is able to find the underlying pattern in the data. Take this example of the dataset where we want to predict the price of the house given its size. The price which is a target variable depends upon the size which is a feature.

Number of roomsPrice
1$100
3$300
5$500

In a real dataset, we will have a lot more rows and more than one features like size, location, number of floors and many more.

Thus, we can say that the supervised learning model has a set of input variables (x), and an output variable (y). An algorithm identifies the mapping function between the input and output variables. The relationship is y = f(x).

The learning is monitored or supervised in the sense that we already know the output and the algorithm are corrected each time to optimize its results. The algorithm is trained over the data set and amended until it achieves an acceptable level of performance.

We can group the supervised learning problems as:

Regression problems – Used to predict future values and the model is trained with the historical data. E.g., Predicting the future price of a house.

Classification problems – Various labels train the algorithm to identify items within a specific category. E.g., Dog or cat( as mentioned in the above example), Apple or an orange, Beer or wine or water.

What is Unsupervised Learning?

This approach is the one where we have no target variables, and we have only the input variable(features) at hand. The algorithm learns by itself and discovers an impressive structure in the data. 

The goal is to decipher the underlying distribution in the data to gain more knowledge about the data. 

We can group the unsupervised learning problems as:

Clustering: This means bundling the input variables with the same characteristics together. E.g., grouping users based on search history

Association: Here, we discover the rules that govern meaningful associations among the data set. E.g., People who watch ‘X’ will also watch ‘Y’.

What is Reinforcement Learning?

In this approach, machine learning models are trained to make a series of decisions based on the rewards and feedback they receive for their actions. The machine learns to achieve a goal in complex and uncertain situations and is rewarded each time it achieves it during the learning period. 

Reinforcement learning is different from supervised learning in the sense that there is no answer available, so the reinforcement agent decides the steps to perform a task. The machine learns from its own experiences when there is no training data set present.

In this tutorial, we are going to mainly focus on Supervised Learning and Unsupervised learning as these are quite easy to understand and implement.

Machine learning Algorithms

This may be the most time-consuming and difficult process in your journey of Machine Learning. There are many algorithms in Machine Learning and you don’t need to know them all in order to get started. But I would suggest, once you start practising Machine Learning, start learning about the most popular algorithms out there such as:

Here, I am going to give a brief overview of one of the simplest algorithms in Machine learning, the K-nearest neighbor Algorithm (which is a Supervised learning algorithm) and show how we can use it for Regression as well as for classification. I would highly recommend checking the Linear Regression and Logistic Regression as we are going to implement them and compare the results with KNN(K-nearest neighbor) algorithm in the implementation part.

You may want to note that there are usually separate algorithms for regression problems and classification problems. But by modifying an algorithm, we can use it for both classifications as well as regression as you will see below

K-Nearest Neighbor Algorithm

KNN belongs to a group of lazy learners. As opposed to eager learners such as logistic regression, SVM, neural nets, lazy learners just store the training data in memory. During the training phase, KNN arranges the data (sort of indexing process) in order to find the closest neighbours efficiently during the inference phase. Otherwise, it would have to compare each new case during inference with the whole dataset making it quite inefficient.

So if you are wondering what is a training phase, eager learners and lazy learners, for now just remember that training phase is when an algorithm learns from the data provided to it. For example, if you have gone through the Linear Regression algorithm linked above, during the training phase the algorithm tries to find the best fit line which is a process that includes a lot of computations and hence takes a lot of time and this type of algorithm is called eager learners. On the other hand, lazy learners are just like KNN which do not involve many computations and hence train faster.

K-NN for Classification Problem

Now let us see how we can use K-NN for classification. Here a hypothetical dataset which tries to predict if a person is male or female (labels) on the base of the height and weight (features).

Height(cm) -featureWeight(kg) -feature.Gender(label)
18780Male
16550Female
19999Male
14570Female
18087Male
17865Female
18760Male

Now let us plot these points:

K-NN algorithm

Now we have a new point that we want to classify, given that its height is 190 cm and weight is 100 Kg. Here is how K-NN will classify this point:

  1. Select the value of K, which the user selects which he thinks will be best after analysing the data.
  2. Measure the distance of new points from its nearest K number of points. There are various methods for calculating this distance, of which the most commonly known methods are – Euclidian, Manhattan (for continuous data points i.e regression problems) and Hamming distance (for categorical i.e for classification problems).
  3. Identify the class of the points that are more closer to the new point and label the new point accordingly. So if the majority of points closer to our new point belong to a certain “a” class than our new point is predicted to be from class “a”.

Now let us apply this algorithm to our own dataset. Let us first plot the new data point.

K-NN algorithm

Now let us take k=3 i.e, we will see the three closest points to the new point:

K-NN algorithm

Therefore, it is classified as Male:

K-NN algorithm

Now let us take the value of k=5 and see what happens:

K-NN algorithm

As we can see four of the points closest to our new data point are males and just one point is female, so we go with the majority and classify it as Male again. You must always select the value of K as an odd number when doing classification.

K-NN for a Regression problem

We have seen how we can use K-NN for classification. Now, let us see what changes are made to use it for regression. The algorithm is almost the same there is just one difference. In Classification, we checked for the majority of all nearest points. Here, we are going to take the average of all the nearest points and take that as predicted value. Let us again take the same example but here we have to predict the weight(label) of a person given his height(features).

Height(cm) -featureWeight(kg) -label
18780
16550
19999
14570
18087
17865
18760

Now we have new data point with a height of 160cm, we will predict its weight by taking the values of K as 1,2 and 4.

When K=1: The closest point to 160cm in our data is 165cm which has a weight of 50, so we conclude that the predicted weight is 50 itself.

When K=2: The two closest points are 165 and 145 which have weights equal to 50 and 70 respectively. Taking average we say that the predicted weight is (50+70)/2=60.

When K=4: Repeating the same process, now we take 4 closest points instead and hence we get 70.6 as predicted weight.

You might be thinking that this is really simple and there is nothing so special about Machine learning, it is just basic Mathematics. But remember this is the simplest algorithm and you will see much more complex algorithms once you move ahead in this journey.

At this stage, you must have a vague idea of how machine learning works, don’t worry if you are still confused. Also if you want to go a bit deep now, here is an excellent article – Gradient Descent in Machine Learning, which discusses how we use an optimization technique called as gradient descent to find a best-fit line in linear regression.

How To Choose Machine Learning Algorithm?

There are plenty of machine learning algorithms and it could be a tough task to decide which algorithm to choose for a specific application. The choice of the algorithm will depend on the objective of the problem you are trying to solve.

Let us take an example of a task to predict the type of fruit among three varieties, i.e., apple, banana, and orange. The predictions are based on the colour of the fruit. The picture depicts the results of ten different algorithms. The picture on the top left is the dataset. The data is classified into three categories: red, light blue and dark blue. There are some groupings. For instance, from the second image, everything in the upper left belongs to the red category, in the middle part, there is a mixture of uncertainty and light blue while the bottom corresponds to the dark category. The other images show different algorithms and how they try to classified the data.

Steps in Machine Learning

I wish Machine learning was just applying algorithms on your data and get the predicted values but it is not that simple. There are several steps in Machine Learning which are must for each project.

  1. Gathering Data: This is perhaps the most important and time-consuming process. In this step, we need to collect data that can help us to solve our problem. For example, if you want to predict the prices of the houses, we need an appropriate dataset that contains all the information about past house sales and then form a tabular structure. We are going to solve a similar problem in the implementation part.
  2. Preparing that data: Once we have the data, we need to bring it in proper format and preprocess it. There are various steps involved in pre-processing such as data cleaning, for example, if your dataset has some empty values or abnormal values(e.g, a string instead of a number) how are you going to deal with it? There are various ways in which we can but one simple way is to just drop the rows that have empty values. Also sometimes in the dataset, we might have columns that have no impact on our results such as id’s, we remove those columns as well. We usually use Data Visualization to visualize our data through graphs and diagrams and after analyzing the graphs, we decide which features are important. Data preprocessing is a vast topic and I would suggest checking out this article to know more about it.
  3. Choosing a model: Now our data is ready is to be fed into a Machine Learning algorithm. In case you are wondering what is a Model? Often “machine learning algorithm” is used interchangeably with “machine learning model.” A model is the output of a machine learning algorithm run on data. In simple terms when we implement the algorithm on all our data, we get an output which contains all the rules, numbers, and any other algorithm-specific data structures required to make predictions. For example, after implementing Linear Regression on our data we get an equation of the best fit line and this equation is termed as a model. The next step is usually training the model incase we don’t want to tune hyperparameters and select the default ones.
  4. Hyperparameter Tuning: Hyperparameters are crucial as they control the overall behavior of a machine learning model. The ultimate goal is to find an optimal combination of hyperparameters that gives us the best results. But what are these hyper-parameters? Remember the variable K in our K-NN algorithm. We got different results when we set different values of K. The best value for K is not predefined and is different for different datasets. There is no method to know the best value for K, but you can try different values and check for which value do we get the best results. Here K is a hyperparameter and each algorithm has its own hyperparameters and we need to tune their values to get the best results. To get more information about it, check out this article – Hyperparameter Tuning Explained.
  5. Evaluation: You may be wondering, how can you know if the model is performing good or bad. What better way than testing the model on some data. This data is known as testing data and it must not be a subset of the data (training data) on which we trained the algorithm. The objective of training the model is not for it to learn all the values in the training dataset but to identify the underlying pattern in data and based on that make predictions on data it has never seen before. There are various evaluation methods such as K-fold cross-validation and many more. We are going to discuss this step in detail in the coming section.
  6. Prediction: Now that our model has performed well on the testing set as well, we can use it in real-world and hope it is going to perform well on real-world data.

machine learning tutorial

Evaluation of Machine learning Model

For evaluating the model, we hold out a portion of data called test data and do not use this data to train the model. Later, we use test data to evaluate various metrics.

The results of predictive models can be viewed in various forms such as by using confusion matrix, root-mean-squared error(RMSE), AUC-ROC etc.

TP (True Positive) is the number of values predicted to be positive by the algorithm and was actually positive in the dataset. TN represents the number of values that are expected to not belong to the positive class and actually do not belong to it. FP depicts the number of instances misclassified as belonging to the positive class thus is actually part of the negative class. FN shows the number of instances classified as the negative class but should belong to the positive class. 

Now in Regression problem, we usually use RMSE as evaluation metrics. In this evaluation technique, we use the error term.

Let’s say you feed a model some input X and the model predicts 10, but the actual value is 5. This difference between your prediction (10) and the actual observation (5) is the error term: (f_prediction – i_actual). The formula to calculate RMSE is given by:

machine learning tutorial

Where N is a total number of samples for which we are calculating RMSE.

In a good model, the RMSE should be as low as possible and there should not be much difference between RMSE calculated over training data and RMSE calculated over the testing set. 

Python for Machine Learning

Although there are many languages that can be used for machine learning, according to me, Python is hands down the best programming language for Machine Learning applications. This is due to the various benefits mentioned in the section below. Other programming languages that could to use for Machine Learning Applications are R, C++, JavaScript, Java, C#, Julia, Shell, TypeScript, and Scala. R is also a really good language to get started with machine learning.

Python is famous for its readability and relatively lower complexity as compared to other programming languages. Machine Learning applications involve complex concepts like calculus and linear algebra which take a lot of effort and time to implement. Python helps in reducing this burden with quick implementation for the Machine Learning engineer to validate an idea. You can check out the Python Tutorial to get a basic understanding of the language. Another benefit of using Python in Machine Learning is the pre-built libraries. There are different packages for a different type of applications, as mentioned below:

  1. Numpy, OpenCV, and Scikit are used when working with images
  2. NLTK along with Numpy and Scikit again when working with text
  3. Librosa for audio applications
  4. Matplotlib, Seaborn, and Scikit for data representation
  5. TensorFlow and Pytorch for Deep Learning applications
  6. Scipy for Scientific Computing
  7. Django for integrating web applications
  8. Pandas for high-level data structures and analysis

Implementation of algorithms in Machine Learning with Python

Before moving on to the implementation of machine learning with Python part, you need to download some important software and libraries. Anaconda is an open-source distribution that makes it easy to perform Python/R data science and machine learning on a single machine. It contains all most all the libraries that are needed by us. In this tutorial, we are mostly going to use the scikit-learn library which is a free software machine learning library for the Python programming language.

Now, we are going to implement all that we learnt till now. We will solve a Regression problem and then a Classification problem using the seven steps mentioned above.

Implementation of a Regression problem

We have a problem of predicting the prices of the house given some features such as size, number of rooms and many more. So let us get started:

  1. Gathering data: We don’t need to manually collect the data for past sales of houses. Luckily there are some good people who do it for us and make these datasets available for us to use. Also let me mention not all datasets are free but for you to practice, you will find most of the datasets free to use on the internet.

The dataset we are using is called the Boston Housing dataset. Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are defined as follows (taken from the UCI Machine Learning Repository).

  1. CRIM: per capita crime rate by town
  2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS: proportion of non-retail business acres per town
  4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  5. NOX: nitric oxides concentration (parts per 10 million)
  6. RM: average number of rooms per dwelling
  7. AGE: the proportion of owner-occupied units built prior to 1940
  8. DIS: weighted distances to five Boston employment centers
  9. RAD: index of accessibility to radial highways
  10. TAX: full-value property-tax rate per $10,000
  11. PTRATIO: pupil-teacher ratio by town 
  12. B: 1000(Bk−0.63)2 where Bk is the proportion of blacks by town 
  13. LSTAT: % lower status of the population
  14. MEDV: Median value of owner-occupied homes in $1000s

Here is a link to download this dataset.

Now after opening the file you can see the data about House sales. This dataset is not in a proper tabular form, in fact, there are no column names and each value is separated by spaces. We are going to use Pandas to put it in proper tabular form. We will provide it with a list containing column names and also use delimiter as ‘\s+’ which means that after encounterings a single or multiple spaces, it can differentiate every single entry.

We are going to import all the necessary libraries such as Pandas and NumPy. Next, we will import the data file which is in CSV format into a pandas DataFrame.

import numpy as np
import pandas as pd
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT', 'MEDV']
bos1 = pd.read_csv('housing.csv', delimiter=r"\s+", names=column_names)

machine learning tutorial

2. Preprocess Data: The next step is to pre-process the data. Now for this dataset, we can see that there are no NaN (missing) values and also all the data is in numbers rather than strings so we won’t face any errors when training the model. So let us just divide our data into training data and testing data such that 70% of data is training data and the rest is testing data. We could also scale our data to make the predictions much accurate but for now, let us keep it simple.

bos1.isna().sum()

machine learning tutorial

from sklearn.model_selection import train_test_split
X=np.array(bos1.iloc[:,0:13])
Y=np.array(bos1["MEDV"])
#testing data size is of 30% of entire data
x_train, x_test, y_train, y_test =train_test_split(X,Y, test_size = 0.30, random_state =5)

3. Choose a Model: For this particular problem, we are going to use two algorithms of supervised learning that can solve regression problems and later compare their results. One algorithm is K-NN (K-nearest Neighbor) which is explained above and the other is Linear Regression. I would highly recommend to check it out in case you haven’t already.

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
#load our first model 
lr = LinearRegression()
#train the model on training data
lr.fit(x_train,y_train)
#predict the testing data so that we can later evaluate the model
pred_lr = lr.predict(x_test)
#load the second model
Nn=KNeighborsRegressor(3)
Nn.fit(x_train,y_train)
pred_Nn = Nn.predict(x_test)

4. Hyperparameter Tuning: Since this is a beginners tutorial, here, I am only going to turn the value ok K in the K-NN model. I will just use a for loop and check results of k ranging from 1 to 50. K-NN is extremely fast on small dataset like ours so it won’t take any time. There are much more advanced methods of doing this which you can find linked in the steps of Machine Learning section above.

import sklearn
for i in range(1,50):
    model=KNeighborsRegressor(i)
    model.fit(x_train,y_train)
    pred_y = model.predict(x_test)
    mse = sklearn.metrics.mean_squared_error(y_test, pred_y,squared=False)
    print("{} error for k = {}".format(mse,i))

Output:

machine learning tutorial

From the output, we can see that error is least for k=3, so that should justify why I put the value of K=3 while training the model

5. Evaluating the model: For evaluating the model we are going to use the mean_squared_error() method from the scikit-learn library. Remember to set the parameter ‘squared’ as False, to get the RMSE error.

#error for linear regression
mse_lr= sklearn.metrics.mean_squared_error(y_test, pred_lr,squared=False)
print("error for Linear Regression = {}".format(mse_lr))
#error for linear regression
mse_Nn= sklearn.metrics.mean_squared_error(y_test, pred_Nn,squared=False)
print("error for K-NN = {}".format(mse_Nn))

Now from the results, we can conclude that Linear Regression performs better than K-NN for this particular dataset. But It is not necessary that Linear Regression would always perform better than K-NN as it completely depends upon the data that we are working with.

6. Prediction: Now we can use the models to predict the prices of the houses using the predict function as we did above. Make sure when predicting the prices that we are given all the features that were present when training the model.

Here is the whole script:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
bos1 = pd.read_csv('housing.csv', delimiter=r"\s+", names=column_names)
X=np.array(bos1.iloc[:,0:13])
Y=np.array(bos1["MEDV"])
#testing data size is of 30% of entire data
x_train, x_test, y_train, y_test =train_test_split(X,Y, test_size = 0.30, random_state =54)
#load our first model 
lr = LinearRegression()
#train the model on training data
lr.fit(x_train,y_train)
#predict the testing data so that we can later evaluate the model
pred_lr = lr.predict(x_test)
#load the second model
Nn=KNeighborsRegressor(12)
Nn.fit(x_train,y_train)
pred_Nn = Nn.predict(x_test)
#error for linear regression
mse_lr= sklearn.metrics.mean_squared_error(y_test, pred_lr,squared=False)
print("error for Linear Regression = {}".format(mse_lr))
#error for linear regression
mse_Nn= sklearn.metrics.mean_squared_error(y_test, pred_Nn,squared=False)
print("error for K-NN = {}".format(mse_Nn))

Implementation of a Classification problem

In this section, we will solve the population classification problem known as Iris Classification problem. The Iris dataset was used in R.A. Fisher’s classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other. The columns in this dataset are:

speicies of iris

Different species of iris

  • SepalLengthCm
  • SepalWidthCm
  • PetalLengthCm
  • PetalWidthCm
  • Species

We don’t need to download this dataset as scikit-learn library already contains this dataset and we can simply import it from there. So let us start coding this up:

from sklearn.datasets import load_iris
iris = load_iris()
X=iris.data
Y=iris.target
print(X)
print(Y)

As we can see, the features are in a list containing four items which are the features and at the bottom, we got a list containing labels which have been transformed into numbers as the model cannot understand names that are strings, so we encode each name as a number. This has already done by the scikit learn developers.

from sklearn.model_selection import train_test_split
#testing data size is of 30% of entire data
x_train, x_test, y_train, y_test =train_test_split(X,Y, test_size = 0.3, random_state =5)
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
#fitting our model to train and test
Nn = KNeighborsClassifier(8)
Nn.fit(x_train,y_train)
#the score() method calculates the accuracy of model.
print("Accuracy for K-NN is ",Nn.score(x_test,y_test))
Lr = LogisticRegression()
Lr.fit(x_train,y_train)
print("Accuracy for Logistic Regression is ",Lr.score(x_test,y_test))

Advantages of Machine Learning

1. Easily identifies trends and patterns

Machine Learning can review large volumes of data and discover specific trends and patterns that would not be apparent to humans. For instance, for e-commerce websites like Amazon and Flipkart, it serves to understand the browsing behaviors and purchase histories of its users to help cater to the right products, deals, and reminders relevant to them. It uses the results to reveal relevant advertisements to them.

2. Continuous Improvement

We are continuously generating new data and when we provide this data to the Machine Learning model which helps it to upgrade with time and increase its performance and accuracy. We can say it is like gaining experience as they keep improving in accuracy and efficiency. This lets them make better decisions.

3. Handling multidimensional and multi-variety data

Machine Learning algorithms are good at handling data that are multidimensional and multi-variety, and they can do this in dynamic or uncertain environments.

4. Wide Applications

You could be an e-tailer or a healthcare provider and make Machine Learning work for you. Where it does apply, it holds the capability to help deliver a much more personal experience to customers while also targeting the right customers.

Disadvantages of Machine Learning

1. Data Acquisition

Machine Learning requires a massive amount of data sets to train on, and these should be inclusive/unbiased, and of good quality. There can also be times where we must wait for new data to be generated.

2. Time and Resources

Machine Learning needs enough time to let the algorithms learn and develop enough to fulfill their purpose with a considerable amount of accuracy and relevancy. It also needs massive resources to function. This can mean additional requirements of computer power for you.

3. Interpretation of Results

Another major challenge is the ability to accurately interpret results generated by the algorithms. You must also carefully choose the algorithms for your purpose. Sometimes, based on some analysis you might select an algorithm but it is not necessary that this model is best for the problem.

4. High error-susceptibility

Machine Learning is autonomous but highly susceptible to errors. Suppose you train an algorithm with data sets small enough to not be inclusive. You end up with biased predictions coming from a biased training set. This leads to irrelevant advertisements being displayed to customers. In the case of Machine Learning, such blunders can set off a chain of errors that can go undetected for long periods of time. And when they do get noticed, it takes quite some time to recognize the source of the issue, and even longer to correct it.

Future of Machine Learning

Machine Learning can be a competitive advantage to any company, be it a top MNC or a startup. As things that are currently being done manually will be done tomorrow by machines. With the introduction of projects such as self-driving cars, Sophia(a humanoid robot developed by Hong Kong-based company Hanson Robotics) we have already started a glimpse of what the future can be. The Machine Learning revolution will stay with us for long and so will be the future of Machine Learning.

Machine Learning Tutorial FAQs

How do I start learning Machine Learning?

You first need to start with the basics. You need to understand the prerequisites, which include learning Linear Algebra and Multivariate Calculus, Statistics, and Python. Then you need to learn several ML concepts, which include terminology of Machine Learning, types of Machine Learning, and Resources of Machine Learning. The third step is taking part in competitions. You can also take up a free online statistics for machine learning course and understand the foundational concepts.

Is Machine Learning easy for beginners? 

Machine Learning is not the easiest. The difficulty in learning Machine Learning is the debugging problem. However, if you study the right resources, you will be able to learn Machine Learning without any hassles.

What is a simple example of Machine Learning? 

Recommendation Engines (Netflix); Sorting, tagging and categorizing photos (Yelp); Customer Lifetime Value (Asos); Self-Driving Cars (Waymo); Education (Duolingo); Determining Credit Worthiness (Deserve); Patient Sickness Predictions (KenSci); and Targeted Emails (Optimail).

Can I learn Machine Learning in 3 months? 

Machine Learning is vast and consists of several things. Therefore, it will take you around six months to learn it, provided you spend at least 5-6 days every day. Also, the time taken to learn Machine Learning depends a lot on your mathematical and analytical skills.

Does Machine Learning require coding? 

If you are learning traditional Machine Learning, it would require you to know software programming as it will help you to write machine learning algorithms. However, through some online educational platforms, you do not need to know coding to learn Machine Learning.

Is Machine Learning a good career? 

Machine Learning is one of the best careers at present. Whether it is for the current demand, job, and salary growth, Machine Learning Engineer is one of the best profiles. You need to be very good at data, automation, and algorithms.

Can I learn Machine Learning without Python? 

To learn Machine Learning, you need to have some basic knowledge of Python. A version of Python that is supported by all Operating Systems such as Windows, Linux, etc., is Anaconda. It offers an overall package for machine learning, including matplotlib, scikit-learn, and NumPy.

Where can I practice Machine Learning? 

The online platforms where you can practice Machine Learning include CloudXLab, Google Colab, Kaggle, MachineHack, and OpenML.

Where can I learn Machine Learning for free?

You can learn the basics of Machine Learning from online platforms like Great Learning. You can enroll in the Beginners Machine Learning course and get the certificate for free. The course is easy and perfect for beginners to start with.


Original article source at: https://www.mygreatlearning.com

#machine-learning