Eric  Bukenya

Eric Bukenya

1620948720

How to use target typing and covariant returns in C# 9

C## 9 introduced a number of features that allow us to write more efficient and flexible code. In previous articles we examined record typesstatic anonymous functionsrelational and logical patterns, and top-level programs. In this article we’ll look at two more useful features in C## 9 — new target typing capabilities and covariant returns.

Target typing refers to using an expression that gets its type from the context in which it is used, rather than specifying the type explicitly. With C## 9, target typing now can be used in new expressions and with conditional operators. Support for covariant return types in C## 9 allows the override of a method to declare a “more derived” (i.e., more specific) return type than the method it overrides.

Create a console application project in Visual Studio

First off, let’s create a .NET Core console application project in Visual Studio. Assuming Visual Studio 2019 is installed in your system, follow the steps outlined below to create a new .NET Core console application project in Visual Studio.

  1. Launch the Visual Studio IDE.
  2. Click on “Create new project.”
  3. In the “Create new project” window, select “Console App (.NET Core)” from the list of templates displayed.
  4. Click Next.
  5. In the “Configure your new project” window, specify the name and location for the new project.
  6. Click Create.

We’ll use this .NET Core console application project to work with target typing and covariant returns in the subsequent sections of this article.

#c# 9

What is GEEK

Buddha Community

How to use target typing and covariant returns in C# 9
Edward Jackson

Edward Jackson

1653377002

PySpark Cheat Sheet: Spark in Python

This PySpark cheat sheet with code samples covers the basics like initializing Spark in Python, loading data, sorting, and repartitioning.

Apache Spark is generally known as a fast, general and open-source engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It allows you to speed analytic applications up to 100 times faster compared to technologies on the market today. You can interface Spark with Python through "PySpark". This is the Spark Python API exposes the Spark programming model to Python. 

Even though working with Spark will remind you in many ways of working with Pandas DataFrames, you'll also see that it can be tough getting familiar with all the functions that you can use to query, transform, inspect, ... your data. What's more, if you've never worked with any other programming language or if you're new to the field, it might be hard to distinguish between RDD operations.

Let's face it, map() and flatMap() are different enough, but it might still come as a challenge to decide which one you really need when you're faced with them in your analysis. Or what about other functions, like reduce() and reduceByKey()

PySpark cheat sheet

Even though the documentation is very elaborate, it never hurts to have a cheat sheet by your side, especially when you're just getting into it.

This PySpark cheat sheet covers the basics, from initializing Spark and loading your data, to retrieving RDD information, sorting, filtering and sampling your data. But that's not all. You'll also see that topics such as repartitioning, iterating, merging, saving your data and stopping the SparkContext are included in the cheat sheet. 

Note that the examples in the document take small data sets to illustrate the effect of specific functions on your data. In real life data analysis, you'll be using Spark to analyze big data.

PySpark is the Spark Python API that exposes the Spark programming model to Python.

Initializing Spark 

SparkContext 

>>> from pyspark import SparkContext
>>> sc = SparkContext(master = 'local[2]')

Inspect SparkContext 

>>> sc.version #Retrieve SparkContext version
>>> sc.pythonVer #Retrieve Python version
>>> sc.master #Master URL to connect to
>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes
>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext
>>> sc.appName #Return application name
>>> sc.applicationld #Retrieve application ID
>>> sc.defaultParallelism #Return default level of parallelism
>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs

Configuration 

>>> from pyspark import SparkConf, SparkContext
>>> conf = (SparkConf()
     .setMaster("local")
     .setAppName("My app")
     . set   ("spark. executor.memory",   "lg"))
>>> sc = SparkContext(conf = conf)

Using the Shell 

In the PySpark shell, a special interpreter-aware SparkContext is already created in the variable called sc.

$ ./bin/spark-shell --master local[2]
$ ./bin/pyspark --master local[s] --py-files code.py

Set which master the context connects to with the --master argument, and add Python .zip..egg or.py files to the

runtime path by passing a comma-separated list to  --py-files.

Loading Data 

Parallelized Collections 

>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)])
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)])
>>> rdd3 = sc.parallelize(range(100))
>>> rdd = sc.parallelize([("a",["x","y","z"]),
               ("b" ["p","r,"])])

External Data 

Read either one text file from HDFS, a local file system or any Hadoop-supported file system URI with textFile(), or read in a directory of text files with wholeTextFiles(). 

>>> textFile = sc.textFile("/my/directory/•.txt")
>>> textFile2 = sc.wholeTextFiles("/my/directory/")

Retrieving RDD Information 

Basic Information 

>>> rdd.getNumPartitions() #List the number of partitions
>>> rdd.count() #Count RDD instances 3
>>> rdd.countByKey() #Count RDD instances by key
defaultdict(<type 'int'>,{'a':2,'b':1})
>>> rdd.countByValue() #Count RDD instances by value
defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary
   {'a': 2, 'b': 2}
>>> rdd3.sum() #Sum of RDD elements 4950
>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty
True

Summary 

>>> rdd3.max() #Maximum value of RDD elements 
99
>>> rdd3.min() #Minimum value of RDD elements
0
>>> rdd3.mean() #Mean value of RDD elements 
49.5
>>> rdd3.stdev() #Standard deviation of RDD elements 
28.866070047722118
>>> rdd3.variance() #Compute variance of RDD elements 
833.25
>>> rdd3.histogram(3) #Compute histogram by bins
([0,33,66,99],[33,33,34])
>>> rdd3.stats() #Summary statistics (count, mean, stdev, max & min)

Applying Functions 

#Apply a function to each RFD element
>>> rdd.map(lambda x: x+(x[1],x[0])).collect()
[('a' ,7,7, 'a'),('a' ,2,2, 'a'), ('b' ,2,2, 'b')]
#Apply a function to each RDD element and flatten the result
>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0]))
>>> rdd5.collect()
['a',7 , 7 ,  'a' , 'a' , 2,  2,  'a', 'b', 2 , 2, 'b']
#Apply a flatMap function to each (key,value) pair of rdd4 without changing the keys
>>> rdds.flatMapValues(lambda x: x).collect()
[('a', 'x'), ('a', 'y'), ('a', 'z'),('b', 'p'),('b', 'r')]

Selecting Data

Getting

>>> rdd.collect() #Return a list with all RDD elements 
[('a', 7), ('a', 2), ('b', 2)]
>>> rdd.take(2) #Take first 2 RDD elements 
[('a', 7),  ('a', 2)]
>>> rdd.first() #Take first RDD element
('a', 7)
>>> rdd.top(2) #Take top 2 RDD elements 
[('b', 2), ('a', 7)]

Sampling

>>> rdd3.sample(False, 0.15, 81).collect() #Return sampled subset of rdd3
     [3,4,27,31,40,41,42,43,60,76,79,80,86,97]

Filtering

>>> rdd.filter(lambda x: "a" in x).collect() #Filter the RDD
[('a',7),('a',2)]
>>> rdd5.distinct().collect() #Return distinct RDD values
['a' ,2, 'b',7]
>>> rdd.keys().collect() #Return (key,value) RDD's keys
['a',  'a',  'b']

Iterating 

>>> def g (x): print(x)
>>> rdd.foreach(g) #Apply a function to all RDD elements
('a', 7)
('b', 2)
('a', 2)

Reshaping Data 

Reducing

>>> rdd.reduceByKey(lambda x,y : x+y).collect() #Merge the rdd values for each key
[('a',9),('b',2)]
>>> rdd.reduce(lambda a, b: a+ b) #Merge the rdd values
('a', 7, 'a' , 2 , 'b' , 2)

 

Grouping by

>>> rdd3.groupBy(lambda x: x % 2) #Return RDD of grouped values
          .mapValues(list)
          .collect()
>>> rdd.groupByKey() #Group rdd by key
          .mapValues(list)
          .collect() 
[('a',[7,2]),('b',[2])]

Aggregating

>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))
>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
#Aggregate RDD elements of each partition and then the results
>>> rdd3.aggregate((0,0),seqOp,combOp) 
(4950,100)
#Aggregate values of each RDD key
>>> rdd.aggregateByKey((0,0),seqop,combop).collect() 
     [('a',(9,2)), ('b',(2,1))]
#Aggregate the elements of each partition, and then the results
>>> rdd3.fold(0,add)
     4950
#Merge the values for each key
>>> rdd.foldByKey(0, add).collect()
[('a' ,9), ('b' ,2)]
#Create tuples of RDD elements by applying a function
>>> rdd3.keyBy(lambda x: x+x).collect()

Mathematical Operations 

>>>> rdd.subtract(rdd2).collect() #Return each rdd value not contained in rdd2
[('b' ,2), ('a' ,7)]
#Return each (key,value) pair of rdd2 with no matching key in rdd
>>> rdd2.subtractByKey(rdd).collect()
[('d', 1)1
>>>rdd.cartesian(rdd2).collect() #Return the Cartesian product of rdd and rdd2

Sort 

>>> rdd2.sortBy(lambda x: x[1]).collect() #Sort RDD by given function
[('d',1),('b',1),('a',2)]
>>> rdd2.sortByKey().collect() #Sort (key, value) ROD by key
[('a' ,2), ('b' ,1), ('d' ,1)]

Repartitioning 

>>> rdd.repartition(4) #New RDD with 4 partitions
>>> rdd.coalesce(1) #Decrease the number of partitions in the RDD to 1

Saving 

>>> rdd.saveAsTextFile("rdd.txt")
>>> rdd.saveAsHadoopFile("hdfs:// namenodehost/parent/child",
               'org.apache.hadoop.mapred.TextOutputFormat')

Stopping SparkContext 

>>> sc.stop()

Execution 

$ ./bin/spark-submit examples/src/main/python/pi.py

Have this Cheat Sheet at your fingertips

Original article source at https://www.datacamp.com

#pyspark #cheatsheet #spark #python

Chloe  Butler

Chloe Butler

1667425440

Pdf2gerb: Perl Script Converts PDF Files to Gerber format

pdf2gerb

Perl script converts PDF files to Gerber format

Pdf2Gerb generates Gerber 274X photoplotting and Excellon drill files from PDFs of a PCB. Up to three PDFs are used: the top copper layer, the bottom copper layer (for 2-sided PCBs), and an optional silk screen layer. The PDFs can be created directly from any PDF drawing software, or a PDF print driver can be used to capture the Print output if the drawing software does not directly support output to PDF.

The general workflow is as follows:

  1. Design the PCB using your favorite CAD or drawing software.
  2. Print the top and bottom copper and top silk screen layers to a PDF file.
  3. Run Pdf2Gerb on the PDFs to create Gerber and Excellon files.
  4. Use a Gerber viewer to double-check the output against the original PCB design.
  5. Make adjustments as needed.
  6. Submit the files to a PCB manufacturer.

Please note that Pdf2Gerb does NOT perform DRC (Design Rule Checks), as these will vary according to individual PCB manufacturer conventions and capabilities. Also note that Pdf2Gerb is not perfect, so the output files must always be checked before submitting them. As of version 1.6, Pdf2Gerb supports most PCB elements, such as round and square pads, round holes, traces, SMD pads, ground planes, no-fill areas, and panelization. However, because it interprets the graphical output of a Print function, there are limitations in what it can recognize (or there may be bugs).

See docs/Pdf2Gerb.pdf for install/setup, config, usage, and other info.


pdf2gerb_cfg.pm

#Pdf2Gerb config settings:
#Put this file in same folder/directory as pdf2gerb.pl itself (global settings),
#or copy to another folder/directory with PDFs if you want PCB-specific settings.
#There is only one user of this file, so we don't need a custom package or namespace.
#NOTE: all constants defined in here will be added to main namespace.
#package pdf2gerb_cfg;

use strict; #trap undef vars (easier debug)
use warnings; #other useful info (easier debug)


##############################################################################################
#configurable settings:
#change values here instead of in main pfg2gerb.pl file

use constant WANT_COLORS => ($^O !~ m/Win/); #ANSI colors no worky on Windows? this must be set < first DebugPrint() call

#just a little warning; set realistic expectations:
#DebugPrint("${\(CYAN)}Pdf2Gerb.pl ${\(VERSION)}, $^O O/S\n${\(YELLOW)}${\(BOLD)}${\(ITALIC)}This is EXPERIMENTAL software.  \nGerber files MAY CONTAIN ERRORS.  Please CHECK them before fabrication!${\(RESET)}", 0); #if WANT_DEBUG

use constant METRIC => FALSE; #set to TRUE for metric units (only affect final numbers in output files, not internal arithmetic)
use constant APERTURE_LIMIT => 0; #34; #max #apertures to use; generate warnings if too many apertures are used (0 to not check)
use constant DRILL_FMT => '2.4'; #'2.3'; #'2.4' is the default for PCB fab; change to '2.3' for CNC

use constant WANT_DEBUG => 0; #10; #level of debug wanted; higher == more, lower == less, 0 == none
use constant GERBER_DEBUG => 0; #level of debug to include in Gerber file; DON'T USE FOR FABRICATION
use constant WANT_STREAMS => FALSE; #TRUE; #save decompressed streams to files (for debug)
use constant WANT_ALLINPUT => FALSE; #TRUE; #save entire input stream (for debug ONLY)

#DebugPrint(sprintf("${\(CYAN)}DEBUG: stdout %d, gerber %d, want streams? %d, all input? %d, O/S: $^O, Perl: $]${\(RESET)}\n", WANT_DEBUG, GERBER_DEBUG, WANT_STREAMS, WANT_ALLINPUT), 1);
#DebugPrint(sprintf("max int = %d, min int = %d\n", MAXINT, MININT), 1); 

#define standard trace and pad sizes to reduce scaling or PDF rendering errors:
#This avoids weird aperture settings and replaces them with more standardized values.
#(I'm not sure how photoplotters handle strange sizes).
#Fewer choices here gives more accurate mapping in the final Gerber files.
#units are in inches
use constant TOOL_SIZES => #add more as desired
(
#round or square pads (> 0) and drills (< 0):
    .010, -.001,  #tiny pads for SMD; dummy drill size (too small for practical use, but needed so StandardTool will use this entry)
    .031, -.014,  #used for vias
    .041, -.020,  #smallest non-filled plated hole
    .051, -.025,
    .056, -.029,  #useful for IC pins
    .070, -.033,
    .075, -.040,  #heavier leads
#    .090, -.043,  #NOTE: 600 dpi is not high enough resolution to reliably distinguish between .043" and .046", so choose 1 of the 2 here
    .100, -.046,
    .115, -.052,
    .130, -.061,
    .140, -.067,
    .150, -.079,
    .175, -.088,
    .190, -.093,
    .200, -.100,
    .220, -.110,
    .160, -.125,  #useful for mounting holes
#some additional pad sizes without holes (repeat a previous hole size if you just want the pad size):
    .090, -.040,  #want a .090 pad option, but use dummy hole size
    .065, -.040, #.065 x .065 rect pad
    .035, -.040, #.035 x .065 rect pad
#traces:
    .001,  #too thin for real traces; use only for board outlines
    .006,  #minimum real trace width; mainly used for text
    .008,  #mainly used for mid-sized text, not traces
    .010,  #minimum recommended trace width for low-current signals
    .012,
    .015,  #moderate low-voltage current
    .020,  #heavier trace for power, ground (even if a lighter one is adequate)
    .025,
    .030,  #heavy-current traces; be careful with these ones!
    .040,
    .050,
    .060,
    .080,
    .100,
    .120,
);
#Areas larger than the values below will be filled with parallel lines:
#This cuts down on the number of aperture sizes used.
#Set to 0 to always use an aperture or drill, regardless of size.
use constant { MAX_APERTURE => max((TOOL_SIZES)) + .004, MAX_DRILL => -min((TOOL_SIZES)) + .004 }; #max aperture and drill sizes (plus a little tolerance)
#DebugPrint(sprintf("using %d standard tool sizes: %s, max aper %.3f, max drill %.3f\n", scalar((TOOL_SIZES)), join(", ", (TOOL_SIZES)), MAX_APERTURE, MAX_DRILL), 1);

#NOTE: Compare the PDF to the original CAD file to check the accuracy of the PDF rendering and parsing!
#for example, the CAD software I used generated the following circles for holes:
#CAD hole size:   parsed PDF diameter:      error:
#  .014                .016                +.002
#  .020                .02267              +.00267
#  .025                .026                +.001
#  .029                .03167              +.00267
#  .033                .036                +.003
#  .040                .04267              +.00267
#This was usually ~ .002" - .003" too big compared to the hole as displayed in the CAD software.
#To compensate for PDF rendering errors (either during CAD Print function or PDF parsing logic), adjust the values below as needed.
#units are pixels; for example, a value of 2.4 at 600 dpi = .0004 inch, 2 at 600 dpi = .0033"
use constant
{
    HOLE_ADJUST => -0.004 * 600, #-2.6, #holes seemed to be slightly oversized (by .002" - .004"), so shrink them a little
    RNDPAD_ADJUST => -0.003 * 600, #-2, #-2.4, #round pads seemed to be slightly oversized, so shrink them a little
    SQRPAD_ADJUST => +0.001 * 600, #+.5, #square pads are sometimes too small by .00067, so bump them up a little
    RECTPAD_ADJUST => 0, #(pixels) rectangular pads seem to be okay? (not tested much)
    TRACE_ADJUST => 0, #(pixels) traces seemed to be okay?
    REDUCE_TOLERANCE => .001, #(inches) allow this much variation when reducing circles and rects
};

#Also, my CAD's Print function or the PDF print driver I used was a little off for circles, so define some additional adjustment values here:
#Values are added to X/Y coordinates; units are pixels; for example, a value of 1 at 600 dpi would be ~= .002 inch
use constant
{
    CIRCLE_ADJUST_MINX => 0,
    CIRCLE_ADJUST_MINY => -0.001 * 600, #-1, #circles were a little too high, so nudge them a little lower
    CIRCLE_ADJUST_MAXX => +0.001 * 600, #+1, #circles were a little too far to the left, so nudge them a little to the right
    CIRCLE_ADJUST_MAXY => 0,
    SUBST_CIRCLE_CLIPRECT => FALSE, #generate circle and substitute for clip rects (to compensate for the way some CAD software draws circles)
    WANT_CLIPRECT => TRUE, #FALSE, #AI doesn't need clip rect at all? should be on normally?
    RECT_COMPLETION => FALSE, #TRUE, #fill in 4th side of rect when 3 sides found
};

#allow .012 clearance around pads for solder mask:
#This value effectively adjusts pad sizes in the TOOL_SIZES list above (only for solder mask layers).
use constant SOLDER_MARGIN => +.012; #units are inches

#line join/cap styles:
use constant
{
    CAP_NONE => 0, #butt (none); line is exact length
    CAP_ROUND => 1, #round cap/join; line overhangs by a semi-circle at either end
    CAP_SQUARE => 2, #square cap/join; line overhangs by a half square on either end
    CAP_OVERRIDE => FALSE, #cap style overrides drawing logic
};
    
#number of elements in each shape type:
use constant
{
    RECT_SHAPELEN => 6, #x0, y0, x1, y1, count, "rect" (start, end corners)
    LINE_SHAPELEN => 6, #x0, y0, x1, y1, count, "line" (line seg)
    CURVE_SHAPELEN => 10, #xstart, ystart, x0, y0, x1, y1, xend, yend, count, "curve" (bezier 2 points)
    CIRCLE_SHAPELEN => 5, #x, y, 5, count, "circle" (center + radius)
};
#const my %SHAPELEN =
#Readonly my %SHAPELEN =>
our %SHAPELEN =
(
    rect => RECT_SHAPELEN,
    line => LINE_SHAPELEN,
    curve => CURVE_SHAPELEN,
    circle => CIRCLE_SHAPELEN,
);

#panelization:
#This will repeat the entire body the number of times indicated along the X or Y axes (files grow accordingly).
#Display elements that overhang PCB boundary can be squashed or left as-is (typically text or other silk screen markings).
#Set "overhangs" TRUE to allow overhangs, FALSE to truncate them.
#xpad and ypad allow margins to be added around outer edge of panelized PCB.
use constant PANELIZE => {'x' => 1, 'y' => 1, 'xpad' => 0, 'ypad' => 0, 'overhangs' => TRUE}; #number of times to repeat in X and Y directions

# Set this to 1 if you need TurboCAD support.
#$turboCAD = FALSE; #is this still needed as an option?

#CIRCAD pad generation uses an appropriate aperture, then moves it (stroke) "a little" - we use this to find pads and distinguish them from PCB holes. 
use constant PAD_STROKE => 0.3; #0.0005 * 600; #units are pixels
#convert very short traces to pads or holes:
use constant TRACE_MINLEN => .001; #units are inches
#use constant ALWAYS_XY => TRUE; #FALSE; #force XY even if X or Y doesn't change; NOTE: needs to be TRUE for all pads to show in FlatCAM and ViewPlot
use constant REMOVE_POLARITY => FALSE; #TRUE; #set to remove subtractive (negative) polarity; NOTE: must be FALSE for ground planes

#PDF uses "points", each point = 1/72 inch
#combined with a PDF scale factor of .12, this gives 600 dpi resolution (1/72 * .12 = 600 dpi)
use constant INCHES_PER_POINT => 1/72; #0.0138888889; #multiply point-size by this to get inches

# The precision used when computing a bezier curve. Higher numbers are more precise but slower (and generate larger files).
#$bezierPrecision = 100;
use constant BEZIER_PRECISION => 36; #100; #use const; reduced for faster rendering (mainly used for silk screen and thermal pads)

# Ground planes and silk screen or larger copper rectangles or circles are filled line-by-line using this resolution.
use constant FILL_WIDTH => .01; #fill at most 0.01 inch at a time

# The max number of characters to read into memory
use constant MAX_BYTES => 10 * M; #bumped up to 10 MB, use const

use constant DUP_DRILL1 => TRUE; #FALSE; #kludge: ViewPlot doesn't load drill files that are too small so duplicate first tool

my $runtime = time(); #Time::HiRes::gettimeofday(); #measure my execution time

print STDERR "Loaded config settings from '${\(__FILE__)}'.\n";
1; #last value must be truthful to indicate successful load


#############################################################################################
#junk/experiment:

#use Package::Constants;
#use Exporter qw(import); #https://perldoc.perl.org/Exporter.html

#my $caller = "pdf2gerb::";

#sub cfg
#{
#    my $proto = shift;
#    my $class = ref($proto) || $proto;
#    my $settings =
#    {
#        $WANT_DEBUG => 990, #10; #level of debug wanted; higher == more, lower == less, 0 == none
#    };
#    bless($settings, $class);
#    return $settings;
#}

#use constant HELLO => "hi there2"; #"main::HELLO" => "hi there";
#use constant GOODBYE => 14; #"main::GOODBYE" => 12;

#print STDERR "read cfg file\n";

#our @EXPORT_OK = Package::Constants->list(__PACKAGE__); #https://www.perlmonks.org/?node_id=1072691; NOTE: "_OK" skips short/common names

#print STDERR scalar(@EXPORT_OK) . " consts exported:\n";
#foreach(@EXPORT_OK) { print STDERR "$_\n"; }
#my $val = main::thing("xyz");
#print STDERR "caller gave me $val\n";
#foreach my $arg (@ARGV) { print STDERR "arg $arg\n"; }

Download Details:

Author: swannman
Source Code: https://github.com/swannman/pdf2gerb

License: GPL-3.0 license

#perl 

Callum Slater

Callum Slater

1653465344

PySpark Cheat Sheet: Spark DataFrames in Python

This PySpark SQL cheat sheet is your handy companion to Apache Spark DataFrames in Python and includes code samples.

You'll probably already know about Apache Spark, the fast, general and open-source engine for big data processing; It has built-in modules for streaming, SQL, machine learning and graph processing. Spark allows you to speed analytic applications up to 100 times faster compared to other technologies on the market today. Interfacing Spark with Python is easy with PySpark: this Spark Python API exposes the Spark programming model to Python. 

Now, it's time to tackle the Spark SQL module, which is meant for structured data processing, and the DataFrame API, which is not only available in Python, but also in Scala, Java, and R.

Without further ado, here's the cheat sheet:

PySpark SQL cheat sheet

This PySpark SQL cheat sheet covers the basics of working with the Apache Spark DataFrames in Python: from initializing the SparkSession to creating DataFrames, inspecting the data, handling duplicate values, querying, adding, updating or removing columns, grouping, filtering or sorting data. You'll also see that this cheat sheet also on how to run SQL Queries programmatically, how to save your data to parquet and JSON files, and how to stop your SparkSession.

Spark SGlL is Apache Spark's module for working with structured data.

Initializing SparkSession 
 

A SparkSession can be used create DataFrame, register DataFrame as tables, execute SGL over tables, cache tables, and read parquet files.

>>> from pyspark.sql import SparkSession
>>> spark a SparkSession \
     .builder\
     .appName("Python Spark SQL basic example") \
     .config("spark.some.config.option", "some-value") \
     .getOrCreate()

Creating DataFrames
 

Fromm RDDs

>>> from pyspark.sql.types import*

Infer Schema

>>> sc = spark.sparkContext
>>> lines = sc.textFile(''people.txt'')
>>> parts = lines.map(lambda l: l.split(","))
>>> people = parts.map(lambda p: Row(nameap[0],ageaint(p[l])))
>>> peopledf = spark.createDataFrame(people)

Specify Schema

>>> people = parts.map(lambda p: Row(name=p[0],
               age=int(p[1].strip())))
>>>  schemaString = "name age"
>>> fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
>>> schema = StructType(fields)
>>> spark.createDataFrame(people, schema).show()

 

From Spark Data Sources
JSON

>>>  df = spark.read.json("customer.json")
>>> df.show()

>>>  df2 = spark.read.load("people.json", format="json")

Parquet files

>>> df3 = spark.read.load("users.parquet")

TXT files

>>> df4 = spark.read.text("people.txt")

Filter 

#Filter entries of age, only keep those records of which the values are >24
>>> df.filter(df["age"]>24).show()

Duplicate Values 

>>> df = df.dropDuplicates()

Queries 
 

>>> from pyspark.sql import functions as F

Select

>>> df.select("firstName").show() #Show all entries in firstName column
>>> df.select("firstName","lastName") \
      .show()
>>> df.select("firstName", #Show all entries in firstName, age and type
              "age",
              explode("phoneNumber") \
              .alias("contactInfo")) \
      .select("contactInfo.type",
              "firstName",
              "age") \
      .show()
>>> df.select(df["firstName"],df["age"]+ 1) #Show all entries in firstName and age, .show() add 1 to the entries of age
>>> df.select(df['age'] > 24).show() #Show all entries where age >24

When

>>> df.select("firstName", #Show firstName and 0 or 1 depending on age >30
               F.when(df.age > 30, 1) \
              .otherwise(0)) \
      .show()
>>> df[df.firstName.isin("Jane","Boris")] #Show firstName if in the given options
.collect()

Like 

>>> df.select("firstName", #Show firstName, and lastName is TRUE if lastName is like Smith
              df.lastName.like("Smith")) \
     .show()

Startswith - Endswith 

>>> df.select("firstName", #Show firstName, and TRUE if lastName starts with Sm
              df.lastName \
                .startswith("Sm")) \
      .show()
>>> df.select(df.lastName.endswith("th"))\ #Show last names ending in th
      .show()

Substring 

>>> df.select(df.firstName.substr(1, 3) \ #Return substrings of firstName
                          .alias("name")) \
        .collect()

Between 

>>> df.select(df.age.between(22, 24)) \ #Show age: values are TRUE if between 22 and 24
          .show()

Add, Update & Remove Columns 

Adding Columns

 >>> df = df.withColumn('city',df.address.city) \
            .withColumn('postalCode',df.address.postalCode) \
            .withColumn('state',df.address.state) \
            .withColumn('streetAddress',df.address.streetAddress) \
            .withColumn('telePhoneNumber', explode(df.phoneNumber.number)) \
            .withColumn('telePhoneType', explode(df.phoneNumber.type)) 

Updating Columns

>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber')

Removing Columns

  >>> df = df.drop("address", "phoneNumber")
 >>> df = df.drop(df.address).drop(df.phoneNumber)
 

Missing & Replacing Values 
 

>>> df.na.fill(50).show() #Replace null values
 >>> df.na.drop().show() #Return new df omitting rows with null values
 >>> df.na \ #Return new df replacing one value with another
       .replace(10, 20) \
       .show()

GroupBy 

>>> df.groupBy("age")\ #Group by age, count the members in the groups
      .count() \
      .show()

Sort 
 

>>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("age", ascending=False).collect()
>>> df.orderBy(["age","city"],ascending=[0,1])\
     .collect()

Repartitioning 

>>> df.repartition(10)\ #df with 10 partitions
      .rdd \
      .getNumPartitions()
>>> df.coalesce(1).rdd.getNumPartitions() #df with 1 partition

Running Queries Programmatically 
 

Registering DataFrames as Views

>>> peopledf.createGlobalTempView("people")
>>> df.createTempView("customer")
>>> df.createOrReplaceTempView("customer")

Query Views

>>> df5 = spark.sql("SELECT * FROM customer").show()
>>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\
               .show()

Inspect Data 
 

>>> df.dtypes #Return df column names and data types
>>> df.show() #Display the content of df
>>> df.head() #Return first n rows
>>> df.first() #Return first row
>>> df.take(2) #Return the first n rows >>> df.schema Return the schema of df
>>> df.describe().show() #Compute summary statistics >>> df.columns Return the columns of df
>>> df.count() #Count the number of rows in df
>>> df.distinct().count() #Count the number of distinct rows in df
>>> df.printSchema() #Print the schema of df
>>> df.explain() #Print the (logical and physical) plans

Output

Data Structures 
 

 >>> rdd1 = df.rdd #Convert df into an RDD
 >>> df.toJSON().first() #Convert df into a RDD of string
 >>> df.toPandas() #Return the contents of df as Pandas DataFrame

Write & Save to Files 

>>> df.select("firstName", "city")\
       .write \
       .save("nameAndCity.parquet")
 >>> df.select("firstName", "age") \
       .write \
       .save("namesAndAges.json",format="json")

Stopping SparkSession 

>>> spark.stop()

Have this Cheat Sheet at your fingertips

Original article source at https://www.datacamp.com

#pyspark #cheatsheet #spark #dataframes #python #bigdata

Royce  Reinger

Royce Reinger

1659330128

Calculates Edit Distance using Damerau-Levenshtein Algorithm

damerau-levenshtein

The damerau-levenshtein gem allows to find edit distance between two UTF-8 or ASCII encoded strings with O(N*M) efficiency.

This gem implements pure Levenshtein algorithm, Damerau modification of it (where 2 character transposition counts as 1 edit distance). It also includes Boehmer & Rees 2008 modification of Damerau algorithm, where transposition of bigger than 1 character blocks is taken in account as well (Rees 2014).

require "damerau-levenshtein"
DamerauLevenshtein.distance("Something", "Smoething") #returns 1

It also returns a diff between two strings according to Levenshtein alrorithm. The diff is expressed by tags <ins>, <del>, and <subst>. Such tags make it possible to highlight differnce between strings in a flexible way.

require "damerau-levenshtein"
differ = DamerauLevenshtein::Differ.new
differ.run("corn", "cron")
# output: ["c<subst>or</subst>n", "c<subst>ro</subst>n"]

Dependencies

sudo apt-get install build-essential libgmp3-dev

Installation

gem install damerau-levenshtein

Examples

require "damerau-levenshtein"
dl = DamerauLevenshtein
  • compare using Damerau Levenshtein algorithm
dl.distance("Something", "Smoething") #returns 1
  • compare using Levensthein algorithm
dl.distance("Something", "Smoething", 0) #returns 2
  • compare using Boehmer & Rees modification
dl.distance("Something", "meSothing", 2) #returns 2 instead of 4
  • comparison of words with UTF-8 characters should work fine:
dl.distance("Sjöstedt", "Sjostedt") #returns 1
  • compare two arrays
dl.array_distance([1,2,3,5], [1,2,3,4]) #returns 1
  • return diff between two strings
differ = DamerauLevenshtein::Differ.new
differ.run("Something", "smthg")
  • return diff between two strings in raw format
differ = DamerauLevenshtein::Differ.new
differ.format = :raw
differ.run("Something", "smthg")

API Description

Methods

DamerauLevenshtein.version

DamerauLevenshtein.version
#returns version number of the gem

DamerauLevenshtein.distance

DamerauLevenshtein.distance(string1, string2, block_size, max_distance)
#returns edit distance between 2 strings

DamerauLevenshtein.string_distance(string1, string2, block_size, max_distance)
# an alias for .distance

DamerauLevenshtein.array_distance(array1, array2, block_size, max_distance)
# returns edit distance between 2 arrays of integers

DamerauLevenshtein.distance and .array_distance take 4 arguments:

  • string1 (array1 for .array_distance)
  • string2 (array2 for .array_distance)
  • block_size (default is 1)
  • max_distance (default is 10)

block_size determines maximum number of characters in a transposition block:

block_size = 0
(transposition does not count -- it is a pure Levenshtein algorithm)

block_size = 1
(transposition between 2 adjustent characters --
it is pure Damerau-Levenshtein algorithm)

block_size = 2
(transposition between blocks as big as 2 characters -- so abcd and cdab
counts as edit distance 2, not 4)

block_size = 3
(transposition between blocks as big as 3 characters --
so abcdef and defabc counts as edit distance 3, not 6)

etc.

max_distance -- is a threshold after which algorithm gives up and returns max_distance instead of real edit distance.

Levenshtein algorithm is expensive, so it makes sense to give up when edit distance is becoming too big. The argument max_distance does just that.


DamerauLevenshtein.distance("abcdefg", "1234567", 0, 3)
# output: 4 -- it gave up when edit distance exceeded 3

DamerauLevenshtein::Differ

differ = DamerauLevenshtein::Differ.new creates an instance of new differ class to return difference between two strings

differ.format shows current format for diff. Default is :tag format

differ.format = :raw changes current format for diffs. Possible values are :tag and :raw

differ.run("String1", "String2") returns difference between two strings.

For example:

differ = DamerauLevenshtein::Differ.new
differ.run("Something", "smthng")
# output: ["<ins>S</ins><subst>o</subst>m<ins>e</ins>th<ins>i</ins>ng",
#          "<del>S</del><subst>s</subst>m<del>e</del>th<del>i</del>ng"]

Or with parsing:

require "damerau-levenshtein"
require "nokogiri"

differ = DamerauLevenshtein::Differ.new
res = differ.run("Something", "Smothing!")
nodes = Nokogiri::XML("<root>#{res.first}</root>")

markup = nodes.root.children.map do |n|
  case n.name
  when "text"
    n.text
  when "del"
    "~~#{n.children.first.text}~~"
  when "ins"
    "*#{n.children.first.text}*"
  when "subst"
    "**#{n.children.first.text}**"
  end
end.join("")

puts markup

Output

S*o*m**e**thing~~!~~

Contributing to damerau-levenshtein

  • Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet
  • Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it
  • Fork the project
  • Start a feature/bugfix branch
  • Commit and push until you are happy with your contribution
  • Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
  • Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.

Versioning

This gem is following practices of Semantic Versioning

Download Details: 

Author: GlobalNamesArchitecture
Source Code: https://github.com/GlobalNamesArchitecture/damerau-levenshtein 
License: MIT license

#ruby #algorithm 

Josefa  Corwin

Josefa Corwin

1659736920

Mailboxer: A Rails Gem to Send Messages inside A Web Application

Mailboxer

This project is based on the need for a private message system for ging / social_stream. Instead of creating our core message system heavily dependent on our development, we are trying to implement a generic and potent messaging gem.

After looking for a good gem to use we noticed the lack of messaging gems and functionality in them. Mailboxer tries to fill this void delivering a powerful and flexible message system. It supports the use of conversations with two or more participants, sending notifications to recipients (intended to be used as system notifications “Your picture has new comments”, “John Doe has updated his document”, etc.), and emailing the messageable model (if configured to do so). It has a complete implementation of a Mailbox object for each messageable with inbox, sentbox and trash.

The gem is constantly growing and improving its functionality. As it is used with our parallel development ging / social_stream we are finding and fixing bugs continously. If you want some functionality not supported yet or marked as TODO, you can create an issue to ask for it. It will be great feedback for us, and we will know what you may find useful in the gem.

Mailboxer was born from the great, but outdated, code from lpsergi / acts_as_messageable.

We are now working to make exhaustive documentation and some wiki pages in order to make it even easier to use the gem to its full potential. Please, give us some time if you find something missing or ask for it. You can also find us on the Gitter room for this repo. Join us there to talk.

Installation

Add to your Gemfile:

gem 'mailboxer'

Then run:

$ bundle install

Run install script:

$ rails g mailboxer:install

And don't forget to migrate your database:

$ rake db:migrate

You can also generate email views:

$ rails g mailboxer:views

Upgrading

If upgrading from 0.11.0 to 0.12.0, run the following generators:

$ rails generate mailboxer:namespacing_compatibility
$ rails generate mailboxer:install -s

Then, migrate your database:

$ rake db:migrate

Requirements & Settings

Emails

We are now adding support for sending emails when a Notification or a Message is sent to one or more recipients. You should modify the mailboxer initializer (/config/initializer/mailboxer.rb) to edit these settings:

Mailboxer.setup do |config|
  #Enables or disables email sending for Notifications and Messages
  config.uses_emails = true
  #Configures the default `from` address for the email sent for Messages and Notifications of Mailboxer
  config.default_from = "no-reply@dit.upm.es"
  ...
end

You can change the way in which emails are delivered by specifying a custom implementation of notification and message mailers:

Mailboxer.setup do |config|
  config.notification_mailer = CustomNotificationMailer
  config.message_mailer = CustomMessageMailer
  ...
end

If you have subclassed the Mailboxer::Notification class, you can specify the mailers using a member method:

class NewDocumentNotification < Mailboxer::Notification
  def mailer_class
    NewDocumentNotificationMailer
  end
end

class NewCommentNotification < Mailboxer::Notification
  def mailer_class
    NewDocumentNotificationMailer
  end
end

Otherwise, the mailer class will be determined by appending 'Mailer' to the mailable class name.

User identities

Users must have an identity defined by a name and an email. We must ensure that Messageable models have some specific methods. These methods are:

#Returning any kind of identification you want for the model
def name
  return "You should add method :name in your Messageable model"
end
#Returning the email address of the model if an email should be sent for this object (Message or Notification).
#If no mail has to be sent, return nil.
def mailboxer_email(object)
  #Check if an email should be sent for that object
  #if true
  return "define_email@on_your.model"
  #if false
  #return nil
end

These names are explicit enough to avoid colliding with other methods, but as long as you need to change them you can do it by using mailboxer initializer (/config/initializer/mailboxer.rb). Just add or uncomment the following lines:

Mailboxer.setup do |config|
  # ...
  #Configures the methods needed by mailboxer
  config.email_method = :mailboxer_email
  config.name_method = :name
  config.notify_method = :notify
  # ...
end

You may change whatever you want or need. For example:

config.email_method = :notification_email
config.name_method = :display_name
config.notify_method = :notify_mailboxer

Will use the method notification_email(object) instead of mailboxer_email(object), display_name for name and notify_mailboxer for notify.

Using default or custom method names, if your model doesn't implement them, Mailboxer will use dummy methods so as to notify you of missing methods rather than crashing.

Preparing your models

In your model:

class User < ActiveRecord::Base
  acts_as_messageable
end

You are not limited to the User model. You can use Mailboxer in any other model and use it in several different models. If you have ducks and cylons in your application and you want to exchange messages as if they were the same, just add acts_as_messageable to each one and you will be able to send duck-duck, duck-cylon, cylon-duck and cylon-cylon messages. Of course, you can extend it for as many classes as you need.

Example:

class Duck < ActiveRecord::Base
  acts_as_messageable
end
class Cylon < ActiveRecord::Base
  acts_as_messageable
end

Mailboxer API

Warning for version 0.8.0

Version 0.8.0 sees Messageable#read and Messageable#unread renamed to mark_as_(un)read, and Receipt#read and Receipt#unread to is_(un)read. This may break existing applications, but read is a reserved name for Active Record, and the best pratice in this case is simply avoid using it.

How can I send a message?

#alfa wants to send a message to beta
alfa.send_message(beta, "Body", "subject")

How can I read the messages of a conversation?

As a messageable, what you receive are receipts, which are associated with the message itself. You should retrieve your receipts for the conversation and get the message associated with them.

This is done this way because receipts save the information about the relation between messageable and the messages: is it read?, is it trashed?, etc.

#alfa gets the last conversation (chronologically, the first in the inbox)
conversation = alfa.mailbox.inbox.first

#alfa gets it receipts chronologically ordered.
receipts = conversation.receipts_for alfa

#using the receipts (i.e. in the view)
receipts.each do |receipt|
  ...
  message = receipt.message
  read = receipt.is_unread? #or message.is_unread?(alfa)
  ...
end

How can I reply to a message?

#alfa wants to reply to all in a conversation
#using a receipt
alfa.reply_to_all(receipt, "Reply body")

#using a conversation
alfa.reply_to_conversation(conversation, "Reply body")
#alfa wants to reply to the sender of a message (and ONLY the sender)
#using a receipt
alfa.reply_to_sender(receipt, "Reply body")

How can I delete a message from trash?

#delete conversations forever for one receipt (still in database)
receipt.mark_as_deleted

#you can mark conversation as deleted for one participant
conversation.mark_as_deleted participant

#Mark the object as deleted for messageable
#Object can be:
  #* A Receipt
  #* A Conversation
  #* A Notification
  #* A Message
  #* An array with any of them
alfa.mark_as_deleted conversation

# get available message for specific user
conversation.messages_for(alfa)

How can I retrieve my conversations?

#alfa wants to retrieve all his conversations
alfa.mailbox.conversations

#A wants to retrieve his inbox
alfa.mailbox.inbox

#A wants to retrieve his sent conversations
alfa.mailbox.sentbox

#alfa wants to retrieve his trashed conversations
alfa.mailbox.trash

How can I paginate conversations?

You can use Kaminari to paginate the conversations as normal. Please, make sure you use the last version as mailboxer uses select('DISTINCT conversations.*') which was not respected before Kaminari 0.12.4 according to its changelog. Working correctly on Kaminari 0.13.0.

#Paginating all conversations using :page parameter and 9 per page
conversations = alfa.mailbox.conversations.page(params[:page]).per(9)

#Paginating received conversations using :page parameter and 9 per page
conversations = alfa.mailbox.inbox.page(params[:page]).per(9)

#Paginating sent conversations using :page parameter and 9 per page
conversations = alfa.mailbox.sentbox.page(params[:page]).per(9)

#Paginating trashed conversations using :page parameter and 9 per page
conversations = alfa.mailbox.trash.page(params[:page]).per(9)

You can take a look at the full documentation for Mailboxer in rubydoc.info.

Do you want to test Mailboxer?

Thanks to Roman Kushnir (@RKushnir) you can test Mailboxer with this sample app.

I need a GUI!

If you need a GUI you should take a look at these links:

Contributors


Author: mailboxer
Source code: https://github.com/mailboxer/mailboxer
License: MIT license

#ruby  #ruby-on-rails