Rahul  Gandhi

Rahul Gandhi

1601956560

Using NumPy to Speed Up K-Means Clustering by 70x

In Part 1 of our series on how to write efficient code using NumPy, we covered the important topics of vectorization and broadcasting. In this part we will put these concepts into practice by implementing an efficient version of the K-Means clustering algorithm using NumPy. We will benchmark it against a naive version implemented entirely using looping in Python. In the end we’ll see that the NumPy version is about 70 times faster than the simple loop version.

To be exact, in this post we will cover:

  1. Understanding K-Means Clustering
  2. Implementing K-Means using loops
  3. Using cProfile to find bottlenecks in the code
  4. Optimizing K-Means using NumPy

Let’s get started!

Understanding K-Means Clustering

In this post we will be optimizing an implementation of the k-means clustering algorithm. It is therefore imperative that we at least have a basic understanding of how the algorithm works. Of course, a detailed discussion would also be beyond the scope of this post; if you want to dig deeper into k-means you can find several recommended links below.

What Does the K-Means Clustering Algorithm Do?

In a nutshell, k-means is an unsupervised learning algorithm which separates data into groups based on similarity. As it’s an unsupervised algorithm, this means we have no labels for the data.

The most important hyperparameter for the k-means algorithm is the number of clusters, or k. Once we have decided upon our value for k, the algorithm works as follows.

  1. Initialize k points (corresponding to k clusters) randomly from the data. We call these points centroids.
  2. For each data point, measure the L2 distance from the centroid. Assign each data point to the centroid for which it has the shortest distance. In other words, assign the closest centroid to each data point.
  3. Now each data point assigned to a centroid forms an individual cluster. For k centroids, we will have k clusters. Update the value of the centroid of each cluster by the mean of all the data points present in that particular cluster.
  4. Repeat steps 1-3 until the maximum change in centroids for each iteration falls below a threshold value, or the clustering error converges.

Here’s the pseudo-code for the algorithm.

Pseudo-code for the K-Means Clustering Algorithm

I’m going to leave K-Means at that. This is enough to help us code the algorithm. However, there is much more to it, such as how to choose a good value of k, how to evaluate the performance, which distance metrics can be used, preprocessing steps, and theory. In case you wish to dig deeper, here are a few links for you to study it further.

Now, let’s proceed with the implementation of the algorithm.

Implementing K-Means Using Loops

In this section we will be implementing the K-Means algorithm using Python and loops. We will not be using NumPy for this. This code will be used as a benchmark for our optimized version.

Generating the Data

To perform clustering, we first need our data. While we can choose from multiple datasets online, let’s keep things rather simple and intuitive. We are going to synthesize a dataset by sampling from multiple Gaussian distributions, so that visualizing clusters is easy for us.

In case you don’t know what a Gaussian distribution is, check it out!

We will create data from four Gaussian’s with different means and distributions.

import numpy as np 
## Size of dataset to be generated. The final size is 4 * data_size
data_size = 1000
num_iters = 50
num_clusters = 4

## sample from Gaussians 
data1 = np.random.normal((5,5,5), (4, 4, 4), (data_size,3))
data2 = np.random.normal((4,20,20), (3,3,3), (data_size, 3))
data3 = np.random.normal((25, 20, 5), (5, 5, 5), (data_size,3))
data4 = np.random.normal((30, 30, 30), (5, 5, 5), (data_size,3))

## Combine the data to create the final dataset
data = np.concatenate((data1,data2, data3, data4), axis = 0)

## Shuffle the data
np.random.shuffle(data)

In order to aid our visualization, let’s plot this data in a 3-D space.

3-D Visualization of the Dataset

It’s very easy to see the four clusters of data in the plot above. This, for one, makes it easy for us to pick up a suitable value of k for our implementation. This goes in spirit of keeping the algorithmic details as simple as possible, so that we can focus on the implementation.

Helper Functions

We begin by initializing our centroids, as well as a list to keep track of which centroid each data point is assigned to.

## Set random seed for reproducibility 
random.seed(0)

## Initialize the list to store centroids
centroids = []

## Sample initial centroids
random_indices = random.sample(range(data.shape[0]), 4)
for i in random_indices:
    centroids.append(data[i])

## Create a list to store which centroid is assigned to each dataset
assigned_centroids = [0] * len(data)

Before we implement our loop, we will first implement a few helper functions.

compute_l2_distance takes two points (say [0, 1, 0] and [4, 2, 3]) and computes the L2 distance between them, according to the following formula.

L2(X1,X2)=∑dimensions(X1)i(X1[i]−X2[i])2

L2(X1​,X2​)=i∑dimensions(X1​)​(X1​[i]−X2​[i])2

def compute_l2_distance(x, centroid):
    ## Initialize the distance to 0
    dist = 0

    ## Loop over the dimensions. Take squared difference and add to 'dist' 
    for i in range(len(x)):
        dist += (centroid[i] - x[i])**2

    return dist

The other helper function we implement is called get_closest_centroid, the name being pretty self-explanatory. The function takes an input x and a list, centroids, and returns the index of the list centroids corresponding to the centroid closest to x.

def get_closest_centroid(x, centroids):
    ## Initialize the list to keep distances from each centroid
    centroid_distances = []

    ## Loop over each centroid and compute the distance from data point.
    for centroid in centroids:
        dist = compute_l2_distance(x, centroid)
        centroid_distances.append(dist)

    ## Get the index of the centroid with the smallest distance to the data point 
    closest_centroid_index =  min(range(len(centroid_distances)), key=lambda x: centroid_distances[x])

    return closest_centroid_index

Then we implement the function compute_sse, which computes the SSE or Sum of Squared Errors. This metric is used to guide how many iterations we have to do. Once this value converges, we can stop training.

def compute_sse(data, centroids, assigned_centroids):
    ## Initialise SSE 
    sse = 0

    ## Compute the squared distance for each data point and add. 
    for i,x in enumerate(data):
    	## Get the associated centroid for data point
        centroid = centroids[assigned_centroids[i]]

        ## Compute the distance to the centroid
        dist = compute_l2_distance(x, centroid)

        ## Add to the total distance
        sse += dist

    sse /= len(data)
    return sse

Main Loop

Now, let’s write the main loop. Refer to the pseudo-code mentioned above for reference. Instead of looping until convergence, we merely loop for 50 iterations.

## Number of dimensions in centroid
num_centroid_dims = data.shape[1]

## List to store SSE for each iteration 
sse_list = []

tic = time.time()

## Loop over iterations
for n in range(num_iters):

    ## Loop over each data point
    for i in range(len(data)):
        x = data[i]

        ## Get the closest centroid
        closest_centroid = get_closest_centroid(x, centroids)

        ## Assign the centroid to the data point.
        assigned_centroids[i] = closest_centroid

    ## Loop over centroids and compute the new ones.
    for c in range(len(centroids)):
        ## Get all the data points belonging to a particular cluster
        cluster_data = [data[i] for i in range(len(data)) if assigned_centroids[i] == c]

        ## Initialize the list to hold the new centroid
        new_centroid = [0] * len(centroids[0])

        ## Compute the average of cluster members to compute new centroid
        ## Loop over dimensions of data
        for dim in range(num_centroid_dims): 
            dim_sum = [x[dim] for x in cluster_data]
            dim_sum = sum(dim_sum) / len(dim_sum)
            new_centroid[dim] = dim_sum

        ## assign the new centroid
        centroids[c] = new_centroid

    ## Compute the SSE for the iteration
    sse = compute_sse(data, centroids, assigned_centroids)
    sse_list.append(sse)

The entire code can be viewed below.

import numpy as np 
import matplotlib.pyplot as plt 
import random
import time

## Size of dataset to be generated. The final size is 4 * data_size
data_size = 1000
num_iters = 50
num_clusters = 4

## sample from Gaussians 
data1 = np.random.normal((5,5,5), (4, 4, 4), (data_size,3))
data2 = np.random.normal((4,20,20), (3,3,3), (data_size, 3))
data3 = np.random.normal((25, 20, 5), (5, 5, 5), (data_size,3))
data4 = np.random.normal((30, 30, 30), (5, 5, 5), (data_size,3))

## Combine the data to create the final dataset
data = np.concatenate((data1,data2, data3, data4), axis = 0)

## Shuffle the data
np.random.shuffle(data)

## Set random seed for reproducibility 
random.seed(0)

## Initialise the list to store centroids
centroids = []

## Sample initial centroids
random_indices = random.sample(range(data.shape[0]), 4)
for i in random_indices:
    centroids.append(data[i])

## Create a list to store which centroid is assigned to each dataset
assigned_centroids = [0] * len(data)

def compute_l2_distance(x, centroid):
    ## Initialise the distance to 0
    dist = 0

    ## Loop over the dimensions. Take sqaured difference and add to `dist` 
    for i in range(len(x)):
        dist += (centroid[i] - x[i])**2

    return dist

def get_closest_centroid(x, centroids):
    ## Initialise the list to keep distances from each centroid
    centroid_distances = []

    ## Loop over each centroid and compute the distance from data point.
    for centroid in centroids:
        dist = compute_l2_distance(x, centroid)
        centroid_distances.append(dist)

    ## Get the index of the centroid with the smallest distance to the data point 
    closest_centroid_index =  min(range(len(centroid_distances)), key=lambda x: centroid_distances[x])

    return closest_centroid_index

def compute_sse(data, centroids, assigned_centroids):
    ## Initialise SSE 
    sse = 0

    ## Compute the squared distance for each data point and add. 
    for i,x in enumerate(data):
    	## Get the associated centroid for data point
        centroid = centroids[assigned_centroids[i]]

        ## Compute the Distance to the centroid
        dist = compute_l2_distance(x, centroid)

        ## Add to the total distance
        sse += dist

    sse /= len(data)
    return sse

## Number of dimensions in centroid
num_centroid_dims = data.shape[1]

## List to store SSE for each iteration 
sse_list = []

tic = time.time()

## Loop over iterations
for n in range(num_iters):

    ## Loop over each data point
    for i in range(len(data)):
        x = data[i]

        ## Get the closest centroid
        closest_centroid = get_closest_centroid(x, centroids)

        ## Assign the centroid to the data point.
        assigned_centroids[i] = closest_centroid

    ## Loop over centroids and compute the new ones.
    for c in range(len(centroids)):
        ## Get all the data points belonging to a particular cluster
        cluster_data = [data[i] for i in range(len(data)) if assigned_centroids[i] == c]

        ## Initialise the list to hold the new centroid
        new_centroid = [0] * len(centroids[0])

        ## Compute the average of cluster members to compute new centroid
        ## Loop over dimensions of data
        for dim in range(num_centroid_dims): 
            dim_sum = [x[dim] for x in cluster_data]
            dim_sum = sum(dim_sum) / len(dim_sum)
            new_centroid[dim] = dim_sum

        ## assign the new centroid
        centroids[c] = new_centroid

    ## Compute the SSE for the iteration
    sse = compute_sse(data, centroids, assigned_centroids)
    sse_list.append(sse)

#
toc = time.time()
print((toc - tic)/50)

#numpy #python #machine-learning #data-science #dveloper

What is GEEK

Buddha Community

Using NumPy to Speed Up K-Means Clustering by 70x
Chloe  Butler

Chloe Butler

1667425440

Pdf2gerb: Perl Script Converts PDF Files to Gerber format

pdf2gerb

Perl script converts PDF files to Gerber format

Pdf2Gerb generates Gerber 274X photoplotting and Excellon drill files from PDFs of a PCB. Up to three PDFs are used: the top copper layer, the bottom copper layer (for 2-sided PCBs), and an optional silk screen layer. The PDFs can be created directly from any PDF drawing software, or a PDF print driver can be used to capture the Print output if the drawing software does not directly support output to PDF.

The general workflow is as follows:

  1. Design the PCB using your favorite CAD or drawing software.
  2. Print the top and bottom copper and top silk screen layers to a PDF file.
  3. Run Pdf2Gerb on the PDFs to create Gerber and Excellon files.
  4. Use a Gerber viewer to double-check the output against the original PCB design.
  5. Make adjustments as needed.
  6. Submit the files to a PCB manufacturer.

Please note that Pdf2Gerb does NOT perform DRC (Design Rule Checks), as these will vary according to individual PCB manufacturer conventions and capabilities. Also note that Pdf2Gerb is not perfect, so the output files must always be checked before submitting them. As of version 1.6, Pdf2Gerb supports most PCB elements, such as round and square pads, round holes, traces, SMD pads, ground planes, no-fill areas, and panelization. However, because it interprets the graphical output of a Print function, there are limitations in what it can recognize (or there may be bugs).

See docs/Pdf2Gerb.pdf for install/setup, config, usage, and other info.


pdf2gerb_cfg.pm

#Pdf2Gerb config settings:
#Put this file in same folder/directory as pdf2gerb.pl itself (global settings),
#or copy to another folder/directory with PDFs if you want PCB-specific settings.
#There is only one user of this file, so we don't need a custom package or namespace.
#NOTE: all constants defined in here will be added to main namespace.
#package pdf2gerb_cfg;

use strict; #trap undef vars (easier debug)
use warnings; #other useful info (easier debug)


##############################################################################################
#configurable settings:
#change values here instead of in main pfg2gerb.pl file

use constant WANT_COLORS => ($^O !~ m/Win/); #ANSI colors no worky on Windows? this must be set < first DebugPrint() call

#just a little warning; set realistic expectations:
#DebugPrint("${\(CYAN)}Pdf2Gerb.pl ${\(VERSION)}, $^O O/S\n${\(YELLOW)}${\(BOLD)}${\(ITALIC)}This is EXPERIMENTAL software.  \nGerber files MAY CONTAIN ERRORS.  Please CHECK them before fabrication!${\(RESET)}", 0); #if WANT_DEBUG

use constant METRIC => FALSE; #set to TRUE for metric units (only affect final numbers in output files, not internal arithmetic)
use constant APERTURE_LIMIT => 0; #34; #max #apertures to use; generate warnings if too many apertures are used (0 to not check)
use constant DRILL_FMT => '2.4'; #'2.3'; #'2.4' is the default for PCB fab; change to '2.3' for CNC

use constant WANT_DEBUG => 0; #10; #level of debug wanted; higher == more, lower == less, 0 == none
use constant GERBER_DEBUG => 0; #level of debug to include in Gerber file; DON'T USE FOR FABRICATION
use constant WANT_STREAMS => FALSE; #TRUE; #save decompressed streams to files (for debug)
use constant WANT_ALLINPUT => FALSE; #TRUE; #save entire input stream (for debug ONLY)

#DebugPrint(sprintf("${\(CYAN)}DEBUG: stdout %d, gerber %d, want streams? %d, all input? %d, O/S: $^O, Perl: $]${\(RESET)}\n", WANT_DEBUG, GERBER_DEBUG, WANT_STREAMS, WANT_ALLINPUT), 1);
#DebugPrint(sprintf("max int = %d, min int = %d\n", MAXINT, MININT), 1); 

#define standard trace and pad sizes to reduce scaling or PDF rendering errors:
#This avoids weird aperture settings and replaces them with more standardized values.
#(I'm not sure how photoplotters handle strange sizes).
#Fewer choices here gives more accurate mapping in the final Gerber files.
#units are in inches
use constant TOOL_SIZES => #add more as desired
(
#round or square pads (> 0) and drills (< 0):
    .010, -.001,  #tiny pads for SMD; dummy drill size (too small for practical use, but needed so StandardTool will use this entry)
    .031, -.014,  #used for vias
    .041, -.020,  #smallest non-filled plated hole
    .051, -.025,
    .056, -.029,  #useful for IC pins
    .070, -.033,
    .075, -.040,  #heavier leads
#    .090, -.043,  #NOTE: 600 dpi is not high enough resolution to reliably distinguish between .043" and .046", so choose 1 of the 2 here
    .100, -.046,
    .115, -.052,
    .130, -.061,
    .140, -.067,
    .150, -.079,
    .175, -.088,
    .190, -.093,
    .200, -.100,
    .220, -.110,
    .160, -.125,  #useful for mounting holes
#some additional pad sizes without holes (repeat a previous hole size if you just want the pad size):
    .090, -.040,  #want a .090 pad option, but use dummy hole size
    .065, -.040, #.065 x .065 rect pad
    .035, -.040, #.035 x .065 rect pad
#traces:
    .001,  #too thin for real traces; use only for board outlines
    .006,  #minimum real trace width; mainly used for text
    .008,  #mainly used for mid-sized text, not traces
    .010,  #minimum recommended trace width for low-current signals
    .012,
    .015,  #moderate low-voltage current
    .020,  #heavier trace for power, ground (even if a lighter one is adequate)
    .025,
    .030,  #heavy-current traces; be careful with these ones!
    .040,
    .050,
    .060,
    .080,
    .100,
    .120,
);
#Areas larger than the values below will be filled with parallel lines:
#This cuts down on the number of aperture sizes used.
#Set to 0 to always use an aperture or drill, regardless of size.
use constant { MAX_APERTURE => max((TOOL_SIZES)) + .004, MAX_DRILL => -min((TOOL_SIZES)) + .004 }; #max aperture and drill sizes (plus a little tolerance)
#DebugPrint(sprintf("using %d standard tool sizes: %s, max aper %.3f, max drill %.3f\n", scalar((TOOL_SIZES)), join(", ", (TOOL_SIZES)), MAX_APERTURE, MAX_DRILL), 1);

#NOTE: Compare the PDF to the original CAD file to check the accuracy of the PDF rendering and parsing!
#for example, the CAD software I used generated the following circles for holes:
#CAD hole size:   parsed PDF diameter:      error:
#  .014                .016                +.002
#  .020                .02267              +.00267
#  .025                .026                +.001
#  .029                .03167              +.00267
#  .033                .036                +.003
#  .040                .04267              +.00267
#This was usually ~ .002" - .003" too big compared to the hole as displayed in the CAD software.
#To compensate for PDF rendering errors (either during CAD Print function or PDF parsing logic), adjust the values below as needed.
#units are pixels; for example, a value of 2.4 at 600 dpi = .0004 inch, 2 at 600 dpi = .0033"
use constant
{
    HOLE_ADJUST => -0.004 * 600, #-2.6, #holes seemed to be slightly oversized (by .002" - .004"), so shrink them a little
    RNDPAD_ADJUST => -0.003 * 600, #-2, #-2.4, #round pads seemed to be slightly oversized, so shrink them a little
    SQRPAD_ADJUST => +0.001 * 600, #+.5, #square pads are sometimes too small by .00067, so bump them up a little
    RECTPAD_ADJUST => 0, #(pixels) rectangular pads seem to be okay? (not tested much)
    TRACE_ADJUST => 0, #(pixels) traces seemed to be okay?
    REDUCE_TOLERANCE => .001, #(inches) allow this much variation when reducing circles and rects
};

#Also, my CAD's Print function or the PDF print driver I used was a little off for circles, so define some additional adjustment values here:
#Values are added to X/Y coordinates; units are pixels; for example, a value of 1 at 600 dpi would be ~= .002 inch
use constant
{
    CIRCLE_ADJUST_MINX => 0,
    CIRCLE_ADJUST_MINY => -0.001 * 600, #-1, #circles were a little too high, so nudge them a little lower
    CIRCLE_ADJUST_MAXX => +0.001 * 600, #+1, #circles were a little too far to the left, so nudge them a little to the right
    CIRCLE_ADJUST_MAXY => 0,
    SUBST_CIRCLE_CLIPRECT => FALSE, #generate circle and substitute for clip rects (to compensate for the way some CAD software draws circles)
    WANT_CLIPRECT => TRUE, #FALSE, #AI doesn't need clip rect at all? should be on normally?
    RECT_COMPLETION => FALSE, #TRUE, #fill in 4th side of rect when 3 sides found
};

#allow .012 clearance around pads for solder mask:
#This value effectively adjusts pad sizes in the TOOL_SIZES list above (only for solder mask layers).
use constant SOLDER_MARGIN => +.012; #units are inches

#line join/cap styles:
use constant
{
    CAP_NONE => 0, #butt (none); line is exact length
    CAP_ROUND => 1, #round cap/join; line overhangs by a semi-circle at either end
    CAP_SQUARE => 2, #square cap/join; line overhangs by a half square on either end
    CAP_OVERRIDE => FALSE, #cap style overrides drawing logic
};
    
#number of elements in each shape type:
use constant
{
    RECT_SHAPELEN => 6, #x0, y0, x1, y1, count, "rect" (start, end corners)
    LINE_SHAPELEN => 6, #x0, y0, x1, y1, count, "line" (line seg)
    CURVE_SHAPELEN => 10, #xstart, ystart, x0, y0, x1, y1, xend, yend, count, "curve" (bezier 2 points)
    CIRCLE_SHAPELEN => 5, #x, y, 5, count, "circle" (center + radius)
};
#const my %SHAPELEN =
#Readonly my %SHAPELEN =>
our %SHAPELEN =
(
    rect => RECT_SHAPELEN,
    line => LINE_SHAPELEN,
    curve => CURVE_SHAPELEN,
    circle => CIRCLE_SHAPELEN,
);

#panelization:
#This will repeat the entire body the number of times indicated along the X or Y axes (files grow accordingly).
#Display elements that overhang PCB boundary can be squashed or left as-is (typically text or other silk screen markings).
#Set "overhangs" TRUE to allow overhangs, FALSE to truncate them.
#xpad and ypad allow margins to be added around outer edge of panelized PCB.
use constant PANELIZE => {'x' => 1, 'y' => 1, 'xpad' => 0, 'ypad' => 0, 'overhangs' => TRUE}; #number of times to repeat in X and Y directions

# Set this to 1 if you need TurboCAD support.
#$turboCAD = FALSE; #is this still needed as an option?

#CIRCAD pad generation uses an appropriate aperture, then moves it (stroke) "a little" - we use this to find pads and distinguish them from PCB holes. 
use constant PAD_STROKE => 0.3; #0.0005 * 600; #units are pixels
#convert very short traces to pads or holes:
use constant TRACE_MINLEN => .001; #units are inches
#use constant ALWAYS_XY => TRUE; #FALSE; #force XY even if X or Y doesn't change; NOTE: needs to be TRUE for all pads to show in FlatCAM and ViewPlot
use constant REMOVE_POLARITY => FALSE; #TRUE; #set to remove subtractive (negative) polarity; NOTE: must be FALSE for ground planes

#PDF uses "points", each point = 1/72 inch
#combined with a PDF scale factor of .12, this gives 600 dpi resolution (1/72 * .12 = 600 dpi)
use constant INCHES_PER_POINT => 1/72; #0.0138888889; #multiply point-size by this to get inches

# The precision used when computing a bezier curve. Higher numbers are more precise but slower (and generate larger files).
#$bezierPrecision = 100;
use constant BEZIER_PRECISION => 36; #100; #use const; reduced for faster rendering (mainly used for silk screen and thermal pads)

# Ground planes and silk screen or larger copper rectangles or circles are filled line-by-line using this resolution.
use constant FILL_WIDTH => .01; #fill at most 0.01 inch at a time

# The max number of characters to read into memory
use constant MAX_BYTES => 10 * M; #bumped up to 10 MB, use const

use constant DUP_DRILL1 => TRUE; #FALSE; #kludge: ViewPlot doesn't load drill files that are too small so duplicate first tool

my $runtime = time(); #Time::HiRes::gettimeofday(); #measure my execution time

print STDERR "Loaded config settings from '${\(__FILE__)}'.\n";
1; #last value must be truthful to indicate successful load


#############################################################################################
#junk/experiment:

#use Package::Constants;
#use Exporter qw(import); #https://perldoc.perl.org/Exporter.html

#my $caller = "pdf2gerb::";

#sub cfg
#{
#    my $proto = shift;
#    my $class = ref($proto) || $proto;
#    my $settings =
#    {
#        $WANT_DEBUG => 990, #10; #level of debug wanted; higher == more, lower == less, 0 == none
#    };
#    bless($settings, $class);
#    return $settings;
#}

#use constant HELLO => "hi there2"; #"main::HELLO" => "hi there";
#use constant GOODBYE => 14; #"main::GOODBYE" => 12;

#print STDERR "read cfg file\n";

#our @EXPORT_OK = Package::Constants->list(__PACKAGE__); #https://www.perlmonks.org/?node_id=1072691; NOTE: "_OK" skips short/common names

#print STDERR scalar(@EXPORT_OK) . " consts exported:\n";
#foreach(@EXPORT_OK) { print STDERR "$_\n"; }
#my $val = main::thing("xyz");
#print STDERR "caller gave me $val\n";
#foreach my $arg (@ARGV) { print STDERR "arg $arg\n"; }

Download Details:

Author: swannman
Source Code: https://github.com/swannman/pdf2gerb

License: GPL-3.0 license

#perl 

Elton  Bogan

Elton Bogan

1600190040

SciPy Cluster - K-Means Clustering and Hierarchical Clustering

SciPy is the most efficient open-source library in python. The main purpose is to compute mathematical and scientific problems. There are many sub-packages in SciPy which further increases its functionality. This is a very important package for data interpretation. We can segregate clusters from the data set. We can perform clustering using a single or multi-cluster. Initially, we generate the data set. Then we perform clustering on the data set. Let us learn more SciPy Clusters.

K-means Clustering

It is a method that can employ to determine clusters and their center. We can use this process on the raw data set. We can define a cluster when the points inside the cluster have the minimum distance when we compare it to points outside the cluster. The k-means method operates in two steps, given an initial set of k-centers,

  • We define the cluster data points for the given cluster center. The points are such that they are closer to the cluster center than any other center.
  • We then calculate the mean for all the data points. The mean value then becomes the new cluster center.

The process iterates until the center value becomes constant. We then fix and assign the center value. The implementation of this process is very accurate using the SciPy library.

#numpy tutorials #clustering in scipy #k-means clustering in scipy #scipy clusters #numpy

NumPy Applications - Uses of Numpy

In this Numpy tutorial, we will learn Numpy applications.

NumPy is a basic level external library in Python used for complex mathematical operations. NumPy overcomes slower executions with the use of multi-dimensional array objects. It has built-in functions for manipulating arrays. We can convert different algorithms to can into functions for applying on arrays.NumPy has applications that are not only limited to itself. It is a very diverse library and has a wide range of applications in other sectors. Numpy can be put to use along with Data Science, Data Analysis and Machine Learning. It is also a base for other python libraries. These libraries use the functionalities in NumPy to increase their capabilities.

numpy applications

Numpy Applications

1. An alternative for lists and arrays in Python

Arrays in Numpy are equivalent to lists in python. Like lists in python, the Numpy arrays are homogenous sets of elements. The most important feature of NumPy arrays is they are homogenous in nature. This differentiates them from python arrays. It maintains uniformity for mathematical operations that would not be possible with heterogeneous elements. Another benefit of using NumPy arrays is there are a large number of functions that are applicable to these arrays. These functions could not be performed when applied to python arrays due to their heterogeneous nature.

2. NumPy maintains minimal memory

Arrays in NumPy are objects. Python deletes and creates these objects continually, as per the requirements. Hence, the memory allocation is less as compared to Python lists. NumPy has features to avoid memory wastage in the data buffer. It consists of functions like copies, view, and indexing that helps in saving a lot of memory. Indexing helps to return the view of the original array, that implements reuse of the data. It also specifies the data type of the elements which leads to code optimization.

3. Using NumPy for multi-dimensional arrays

We can also create multi-dimensional arrays in NumPy.These arrays have multiple rows and columns. These arrays have more than one column that makes these multi-dimensional. Multi-dimensional array implements the creation of matrices. These matrices are easy to work with. With the use of matrices the code also becomes memory efficient. We have a matrix module to perform various operations on these matrices.

4. Mathematical operations with NumPy

Working with NumPy also includes easy to use functions for mathematical computations on the array data set. We have many modules for performing basic and special mathematical functions in NumPy. There are functions for Linear Algebra, bitwise operations, Fourier transform, arithmetic operations, string operations, etc.

#numpy tutorials #applications of numpy #numpy applications #uses of numpy #numpy

NumPy Features - Why we should use Numpy?

Welcome to DataFlair!!! In this tutorial, we will learn Numpy Features and its importance.

NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays

NumPy (Numerical Python) is an open-source core Python library for scientific computations. It is a general-purpose array and matrices processing package. Python is slower as compared to Fortran and other languages to perform looping. To overcome this we use NumPy that converts monotonous code into the compiled form.

numpy features

NumPy Features

These are the important features of NumPy:

1. High-performance N-dimensional array object

This is the most important feature of the NumPy library. It is the homogeneous array object. We perform all the operations on the array elements. The arrays in NumPy can be one dimensional or multidimensional.

a. One dimensional array

The one-dimensional array is an array consisting of a single row or column. The elements of the array are of homogeneous nature.

b. Multidimensional array

In this case, we have various rows and columns. We consider each column as a dimension. The structure is similar to an excel sheet. The elements are homogenous.

2. It contains tools for integrating code from C/C++ and Fortran

We can use the functions in NumPy to work with code written in other languages. We can hence integrate the functionalities available in various programming languages. This helps implement inter-platform functions.

#numpy tutorials #features of numpy #numpy features #why use numpy #numpy

Using Singular Value Separation in Python and Numpy (linalg.svd)

In this pythonn - Numpy tutorial we will learn about Numpy linalg.svd: Singular Value Decomposition in Python. In mathematics, a singular value decomposition (SVD) of a matrix refers to the factorization of a matrix into three separate matrices. It is a more generalized version of an eigenvalue decomposition of matrices. It is further related to the polar decompositions.

In Python, it is easy to calculate the singular decomposition of a complex or a real matrix using the numerical python or the numpy library. The numpy library consists of various linear algebraic functions including one for calculating the singular value decomposition of a matrix.

In machine learning models, singular value decomposition is widely used to train models and in neural networks. It helps in improving accuracy and in reducing the noise in data. Singular value decomposition transforms one vector into another without them necessarily having the same dimension. Hence, it makes matrix manipulation in vector spaces easier and efficient. It is also used in regression analysis.

Syntax of Numpy linalg.svd() function

The function that calculates the singular value decomposition of a matrix in python belongs to the numpy module, named linalg.svd() .

The syntax of the numpy linalg.svd () is as follows:

numpy.linalg.svd(A, full_matrices=True, compute_uv=True, hermitian=False)

You can customize the true and false boolean values based on your requirements.

The parameters of the function are given below:

  • A->array_like: This is the required matrix whose singular value decomposition is being calculated. It can be real or complex as required. It’s dimension should be >= 2.
  • full_matrices->boolean value(optional): If set to true, then the Hermitian transpose of the given matrix is a square, if it’s false then it isn’t.
  • compute_uv->boolen value(optional): It determines whether the Hermitian transpose is to be calculated or not in addition to the singular value decomposition.
  • hermitian->boolean value(optional): The given matrix is considered hermitian(that is symmetric, with real values) which might provide a more efficient method for computation.

The function returns three types of matrices based on the parameters mentioned above:

  • S->array_like: The vector containing the singular values in the descending order with dimensions same as the original matrix.
  • u->array_like: This is an optional solution that is returned when compute_uv is set to True. It is a set of vectors with singular values.
  • v-> array_like: Set of unitary arrays only returned when compute_uv is set to True.

It raises a LinALgError when the singular values diverse.

Prerequisites for setup

Before we dive into the examples, make sure you have the numpy module installed in your local system. This is required for using linear algebraic functions like the one discussed in this article. Run the following command in your terminal.

pip install numpy

That’s all you need right now, let’s look at how we will implement the code in the next section.

To calculate Singular Value Decomposition (SVD) in Python, use the NumPy library’s linalg.svd() function. Its syntax is numpy.linalg.svd(A, full_matrices=True, compute_uv=True, hermitian=False), where A is the matrix for which SVD is being calculated. It returns three matrices: S, U, and V.

Example 1: Calculating the Singular Value Decomposition of a 3×3 Matrix

In this first example we will take a 3X3 matrix and compute its singular value decomposition in the following way:

#importing the numpy module
import numpy as np
#using the numpy.array() function to create an array
A=np.array([[2,4,6],
       [8,10,12],
       [14,16,18]])
#calculatin all three matrices for the output
#using the numpy linalg.svd function
u,s,v=np.linalg.svd(A, compute_uv=True)
#displaying the result
print("the output is=")
print('s(the singular value) = ',s)
print('u = ',u)
print('v = ',v)

The output will be:

the output is=
s(the singular value) =  [3.36962067e+01 2.13673903e+00 8.83684950e-16]
u =  [[-0.21483724  0.88723069  0.40824829]
 [-0.52058739  0.24964395 -0.81649658]
 [-0.82633754 -0.38794278  0.40824829]]
v =  [[-0.47967118 -0.57236779 -0.66506441]
 [-0.77669099 -0.07568647  0.62531805]
 [-0.40824829  0.81649658 -0.40824829]]

Example 1

Example 1

Example 2: Calculating the Singular Value Decomposition of a Random Matrix

In this example, we will be using the numpy.random.randint() function to create a random matrix. Let’s get into it!

#importing the numpy module
import numpy as np
#using the numpy.array() function to craete an array
A=np.random.randint(5, 200, size=(3,3))
#display the created matrix
print("The input matrix is=",A)
#calculatin all three matrices for the output
#using the numpy linalg.svd function
u,s,v=np.linalg.svd(A, compute_uv=True)
#displaying the result
print("the output is=")
print('s(the singular value) = ',s)
print('u = ',u)
print('v = ',v)

The output will be as follows:

The input matrix is= [[ 36  74 101]
 [104 129 185]
 [139 121 112]]
the output is=
s(the singular value) =  [348.32979681  61.03199722  10.12165841]
u =  [[-0.3635535  -0.48363012 -0.79619769]
 [-0.70916514 -0.41054007  0.57318554]
 [-0.60408084  0.77301925 -0.19372034]]
v =  [[-0.49036384 -0.54970618 -0.67628871]
 [ 0.77570499  0.0784348  -0.62620264]
 [ 0.39727203 -0.83166766  0.38794824]]

Example 2

Example 2

Suggested: Numpy linalg.eigvalsh: A Guide to Eigenvalue Computation.

Wrapping Up

In this article, we explored the concept of singular value decomposition in mathematics and how to calculate it using Python’s numpy module. We used the linalg.svd() function to compute the singular value decomposition of both given and random matrices. Numpy provides an efficient and easy-to-use method for performing linear algebra operations, making it highly valuable in machine learning, neural networks, and regression analysis. Keep exploring other linear algebraic functions in numpy to enhance your mathematical toolset in Python.

Article source at: https://www.askpython.com

#python #numpy