Python Program: Python Check If The Function Returns True and Example

Learn Python Program: Python Check If The Function Returns True and Example

Using the if statement you can find the function returned true or not in Python. This step is needed some time in order to proceed to the next step of a separate function.

#Python 

What is GEEK

Buddha Community

Python Program: Python Check If The Function Returns True and Example
Edward Jackson

Edward Jackson

1653377002

PySpark Cheat Sheet: Spark in Python

This PySpark cheat sheet with code samples covers the basics like initializing Spark in Python, loading data, sorting, and repartitioning.

Apache Spark is generally known as a fast, general and open-source engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It allows you to speed analytic applications up to 100 times faster compared to technologies on the market today. You can interface Spark with Python through "PySpark". This is the Spark Python API exposes the Spark programming model to Python. 

Even though working with Spark will remind you in many ways of working with Pandas DataFrames, you'll also see that it can be tough getting familiar with all the functions that you can use to query, transform, inspect, ... your data. What's more, if you've never worked with any other programming language or if you're new to the field, it might be hard to distinguish between RDD operations.

Let's face it, map() and flatMap() are different enough, but it might still come as a challenge to decide which one you really need when you're faced with them in your analysis. Or what about other functions, like reduce() and reduceByKey()

PySpark cheat sheet

Even though the documentation is very elaborate, it never hurts to have a cheat sheet by your side, especially when you're just getting into it.

This PySpark cheat sheet covers the basics, from initializing Spark and loading your data, to retrieving RDD information, sorting, filtering and sampling your data. But that's not all. You'll also see that topics such as repartitioning, iterating, merging, saving your data and stopping the SparkContext are included in the cheat sheet. 

Note that the examples in the document take small data sets to illustrate the effect of specific functions on your data. In real life data analysis, you'll be using Spark to analyze big data.

PySpark is the Spark Python API that exposes the Spark programming model to Python.

Initializing Spark 

SparkContext 

>>> from pyspark import SparkContext
>>> sc = SparkContext(master = 'local[2]')

Inspect SparkContext 

>>> sc.version #Retrieve SparkContext version
>>> sc.pythonVer #Retrieve Python version
>>> sc.master #Master URL to connect to
>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes
>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext
>>> sc.appName #Return application name
>>> sc.applicationld #Retrieve application ID
>>> sc.defaultParallelism #Return default level of parallelism
>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs

Configuration 

>>> from pyspark import SparkConf, SparkContext
>>> conf = (SparkConf()
     .setMaster("local")
     .setAppName("My app")
     . set   ("spark. executor.memory",   "lg"))
>>> sc = SparkContext(conf = conf)

Using the Shell 

In the PySpark shell, a special interpreter-aware SparkContext is already created in the variable called sc.

$ ./bin/spark-shell --master local[2]
$ ./bin/pyspark --master local[s] --py-files code.py

Set which master the context connects to with the --master argument, and add Python .zip..egg or.py files to the

runtime path by passing a comma-separated list to  --py-files.

Loading Data 

Parallelized Collections 

>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)])
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)])
>>> rdd3 = sc.parallelize(range(100))
>>> rdd = sc.parallelize([("a",["x","y","z"]),
               ("b" ["p","r,"])])

External Data 

Read either one text file from HDFS, a local file system or any Hadoop-supported file system URI with textFile(), or read in a directory of text files with wholeTextFiles(). 

>>> textFile = sc.textFile("/my/directory/•.txt")
>>> textFile2 = sc.wholeTextFiles("/my/directory/")

Retrieving RDD Information 

Basic Information 

>>> rdd.getNumPartitions() #List the number of partitions
>>> rdd.count() #Count RDD instances 3
>>> rdd.countByKey() #Count RDD instances by key
defaultdict(<type 'int'>,{'a':2,'b':1})
>>> rdd.countByValue() #Count RDD instances by value
defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary
   {'a': 2, 'b': 2}
>>> rdd3.sum() #Sum of RDD elements 4950
>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty
True

Summary 

>>> rdd3.max() #Maximum value of RDD elements 
99
>>> rdd3.min() #Minimum value of RDD elements
0
>>> rdd3.mean() #Mean value of RDD elements 
49.5
>>> rdd3.stdev() #Standard deviation of RDD elements 
28.866070047722118
>>> rdd3.variance() #Compute variance of RDD elements 
833.25
>>> rdd3.histogram(3) #Compute histogram by bins
([0,33,66,99],[33,33,34])
>>> rdd3.stats() #Summary statistics (count, mean, stdev, max & min)

Applying Functions 

#Apply a function to each RFD element
>>> rdd.map(lambda x: x+(x[1],x[0])).collect()
[('a' ,7,7, 'a'),('a' ,2,2, 'a'), ('b' ,2,2, 'b')]
#Apply a function to each RDD element and flatten the result
>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0]))
>>> rdd5.collect()
['a',7 , 7 ,  'a' , 'a' , 2,  2,  'a', 'b', 2 , 2, 'b']
#Apply a flatMap function to each (key,value) pair of rdd4 without changing the keys
>>> rdds.flatMapValues(lambda x: x).collect()
[('a', 'x'), ('a', 'y'), ('a', 'z'),('b', 'p'),('b', 'r')]

Selecting Data

Getting

>>> rdd.collect() #Return a list with all RDD elements 
[('a', 7), ('a', 2), ('b', 2)]
>>> rdd.take(2) #Take first 2 RDD elements 
[('a', 7),  ('a', 2)]
>>> rdd.first() #Take first RDD element
('a', 7)
>>> rdd.top(2) #Take top 2 RDD elements 
[('b', 2), ('a', 7)]

Sampling

>>> rdd3.sample(False, 0.15, 81).collect() #Return sampled subset of rdd3
     [3,4,27,31,40,41,42,43,60,76,79,80,86,97]

Filtering

>>> rdd.filter(lambda x: "a" in x).collect() #Filter the RDD
[('a',7),('a',2)]
>>> rdd5.distinct().collect() #Return distinct RDD values
['a' ,2, 'b',7]
>>> rdd.keys().collect() #Return (key,value) RDD's keys
['a',  'a',  'b']

Iterating 

>>> def g (x): print(x)
>>> rdd.foreach(g) #Apply a function to all RDD elements
('a', 7)
('b', 2)
('a', 2)

Reshaping Data 

Reducing

>>> rdd.reduceByKey(lambda x,y : x+y).collect() #Merge the rdd values for each key
[('a',9),('b',2)]
>>> rdd.reduce(lambda a, b: a+ b) #Merge the rdd values
('a', 7, 'a' , 2 , 'b' , 2)

 

Grouping by

>>> rdd3.groupBy(lambda x: x % 2) #Return RDD of grouped values
          .mapValues(list)
          .collect()
>>> rdd.groupByKey() #Group rdd by key
          .mapValues(list)
          .collect() 
[('a',[7,2]),('b',[2])]

Aggregating

>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))
>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
#Aggregate RDD elements of each partition and then the results
>>> rdd3.aggregate((0,0),seqOp,combOp) 
(4950,100)
#Aggregate values of each RDD key
>>> rdd.aggregateByKey((0,0),seqop,combop).collect() 
     [('a',(9,2)), ('b',(2,1))]
#Aggregate the elements of each partition, and then the results
>>> rdd3.fold(0,add)
     4950
#Merge the values for each key
>>> rdd.foldByKey(0, add).collect()
[('a' ,9), ('b' ,2)]
#Create tuples of RDD elements by applying a function
>>> rdd3.keyBy(lambda x: x+x).collect()

Mathematical Operations 

>>>> rdd.subtract(rdd2).collect() #Return each rdd value not contained in rdd2
[('b' ,2), ('a' ,7)]
#Return each (key,value) pair of rdd2 with no matching key in rdd
>>> rdd2.subtractByKey(rdd).collect()
[('d', 1)1
>>>rdd.cartesian(rdd2).collect() #Return the Cartesian product of rdd and rdd2

Sort 

>>> rdd2.sortBy(lambda x: x[1]).collect() #Sort RDD by given function
[('d',1),('b',1),('a',2)]
>>> rdd2.sortByKey().collect() #Sort (key, value) ROD by key
[('a' ,2), ('b' ,1), ('d' ,1)]

Repartitioning 

>>> rdd.repartition(4) #New RDD with 4 partitions
>>> rdd.coalesce(1) #Decrease the number of partitions in the RDD to 1

Saving 

>>> rdd.saveAsTextFile("rdd.txt")
>>> rdd.saveAsHadoopFile("hdfs:// namenodehost/parent/child",
               'org.apache.hadoop.mapred.TextOutputFormat')

Stopping SparkContext 

>>> sc.stop()

Execution 

$ ./bin/spark-submit examples/src/main/python/pi.py

Have this Cheat Sheet at your fingertips

Original article source at https://www.datacamp.com

#pyspark #cheatsheet #spark #python

Lawrence  Lesch

Lawrence Lesch

1677668905

TS-mockito: Mocking Library for TypeScript

TS-mockito

Mocking library for TypeScript inspired by http://mockito.org/

1.x to 2.x migration guide

1.x to 2.x migration guide

Main features

  • Strongly typed
  • IDE autocomplete
  • Mock creation (mock) (also abstract classes) #example
  • Spying on real objects (spy) #example
  • Changing mock behavior (when) via:
  • Checking if methods were called with given arguments (verify)
    • anything, notNull, anyString, anyOfClass etc. - for more flexible comparision
    • once, twice, times, atLeast etc. - allows call count verification #example
    • calledBefore, calledAfter - allows call order verification #example
  • Resetting mock (reset, resetCalls) #example, #example
  • Capturing arguments passed to method (capture) #example
  • Recording multiple behaviors #example
  • Readable error messages (ex. 'Expected "convertNumberToString(strictEqual(3))" to be called 2 time(s). But has been called 1 time(s).')

Installation

npm install ts-mockito --save-dev

Usage

Basics

// Creating mock
let mockedFoo:Foo = mock(Foo);

// Getting instance from mock
let foo:Foo = instance(mockedFoo);

// Using instance in source code
foo.getBar(3);
foo.getBar(5);

// Explicit, readable verification
verify(mockedFoo.getBar(3)).called();
verify(mockedFoo.getBar(anything())).called();

Stubbing method calls

// Creating mock
let mockedFoo:Foo = mock(Foo);

// stub method before execution
when(mockedFoo.getBar(3)).thenReturn('three');

// Getting instance
let foo:Foo = instance(mockedFoo);

// prints three
console.log(foo.getBar(3));

// prints null, because "getBar(999)" was not stubbed
console.log(foo.getBar(999));

Stubbing getter value

// Creating mock
let mockedFoo:Foo = mock(Foo);

// stub getter before execution
when(mockedFoo.sampleGetter).thenReturn('three');

// Getting instance
let foo:Foo = instance(mockedFoo);

// prints three
console.log(foo.sampleGetter);

Stubbing property values that have no getters

Syntax is the same as with getter values.

Please note, that stubbing properties that don't have getters only works if Proxy object is available (ES6).

Call count verification

// Creating mock
let mockedFoo:Foo = mock(Foo);

// Getting instance
let foo:Foo = instance(mockedFoo);

// Some calls
foo.getBar(1);
foo.getBar(2);
foo.getBar(2);
foo.getBar(3);

// Call count verification
verify(mockedFoo.getBar(1)).once();               // was called with arg === 1 only once
verify(mockedFoo.getBar(2)).twice();              // was called with arg === 2 exactly two times
verify(mockedFoo.getBar(between(2, 3))).thrice(); // was called with arg between 2-3 exactly three times
verify(mockedFoo.getBar(anyNumber()).times(4);    // was called with any number arg exactly four times
verify(mockedFoo.getBar(2)).atLeast(2);           // was called with arg === 2 min two times
verify(mockedFoo.getBar(anything())).atMost(4);   // was called with any argument max four times
verify(mockedFoo.getBar(4)).never();              // was never called with arg === 4

Call order verification

// Creating mock
let mockedFoo:Foo = mock(Foo);
let mockedBar:Bar = mock(Bar);

// Getting instance
let foo:Foo = instance(mockedFoo);
let bar:Bar = instance(mockedBar);

// Some calls
foo.getBar(1);
bar.getFoo(2);

// Call order verification
verify(mockedFoo.getBar(1)).calledBefore(mockedBar.getFoo(2));    // foo.getBar(1) has been called before bar.getFoo(2)
verify(mockedBar.getFoo(2)).calledAfter(mockedFoo.getBar(1));    // bar.getFoo(2) has been called before foo.getBar(1)
verify(mockedFoo.getBar(1)).calledBefore(mockedBar.getFoo(999999));    // throws error (mockedBar.getFoo(999999) has never been called)

Throwing errors

let mockedFoo:Foo = mock(Foo);

when(mockedFoo.getBar(10)).thenThrow(new Error('fatal error'));

let foo:Foo = instance(mockedFoo);
try {
    foo.getBar(10);
} catch (error:Error) {
    console.log(error.message); // 'fatal error'
}

Custom function

You can also stub method with your own implementation

let mockedFoo:Foo = mock(Foo);
let foo:Foo = instance(mockedFoo);

when(mockedFoo.sumTwoNumbers(anyNumber(), anyNumber())).thenCall((arg1:number, arg2:number) => {
    return arg1 * arg2; 
});

// prints '50' because we've changed sum method implementation to multiply!
console.log(foo.sumTwoNumbers(5, 10));

Resolving / rejecting promises

You can also stub method to resolve / reject promise

let mockedFoo:Foo = mock(Foo);

when(mockedFoo.fetchData("a")).thenResolve({id: "a", value: "Hello world"});
when(mockedFoo.fetchData("b")).thenReject(new Error("b does not exist"));

Resetting mock calls

You can reset just mock call counter

// Creating mock
let mockedFoo:Foo = mock(Foo);

// Getting instance
let foo:Foo = instance(mockedFoo);

// Some calls
foo.getBar(1);
foo.getBar(1);
verify(mockedFoo.getBar(1)).twice();      // getBar with arg "1" has been called twice

// Reset mock
resetCalls(mockedFoo);

// Call count verification
verify(mockedFoo.getBar(1)).never();      // has never been called after reset

You can also reset calls of multiple mocks at once resetCalls(firstMock, secondMock, thirdMock)

Resetting mock

Or reset mock call counter with all stubs

// Creating mock
let mockedFoo:Foo = mock(Foo);
when(mockedFoo.getBar(1)).thenReturn("one").

// Getting instance
let foo:Foo = instance(mockedFoo);

// Some calls
console.log(foo.getBar(1));               // "one" - as defined in stub
console.log(foo.getBar(1));               // "one" - as defined in stub
verify(mockedFoo.getBar(1)).twice();      // getBar with arg "1" has been called twice

// Reset mock
reset(mockedFoo);

// Call count verification
verify(mockedFoo.getBar(1)).never();      // has never been called after reset
console.log(foo.getBar(1));               // null - previously added stub has been removed

You can also reset multiple mocks at once reset(firstMock, secondMock, thirdMock)

Capturing method arguments

let mockedFoo:Foo = mock(Foo);
let foo:Foo = instance(mockedFoo);

// Call method
foo.sumTwoNumbers(1, 2);

// Check first arg captor values
const [firstArg, secondArg] = capture(mockedFoo.sumTwoNumbers).last();
console.log(firstArg);    // prints 1
console.log(secondArg);    // prints 2

You can also get other calls using first(), second(), byCallIndex(3) and more...

Recording multiple behaviors

You can set multiple returning values for same matching values

const mockedFoo:Foo = mock(Foo);

when(mockedFoo.getBar(anyNumber())).thenReturn('one').thenReturn('two').thenReturn('three');

const foo:Foo = instance(mockedFoo);

console.log(foo.getBar(1));    // one
console.log(foo.getBar(1));    // two
console.log(foo.getBar(1));    // three
console.log(foo.getBar(1));    // three - last defined behavior will be repeated infinitely

Another example with specific values

let mockedFoo:Foo = mock(Foo);

when(mockedFoo.getBar(1)).thenReturn('one').thenReturn('another one');
when(mockedFoo.getBar(2)).thenReturn('two');

let foo:Foo = instance(mockedFoo);

console.log(foo.getBar(1));    // one
console.log(foo.getBar(2));    // two
console.log(foo.getBar(1));    // another one
console.log(foo.getBar(1));    // another one - this is last defined behavior for arg '1' so it will be repeated
console.log(foo.getBar(2));    // two
console.log(foo.getBar(2));    // two - this is last defined behavior for arg '2' so it will be repeated

Short notation:

const mockedFoo:Foo = mock(Foo);

// You can specify return values as multiple thenReturn args
when(mockedFoo.getBar(anyNumber())).thenReturn('one', 'two', 'three');

const foo:Foo = instance(mockedFoo);

console.log(foo.getBar(1));    // one
console.log(foo.getBar(1));    // two
console.log(foo.getBar(1));    // three
console.log(foo.getBar(1));    // three - last defined behavior will be repeated infinity

Possible errors:

const mockedFoo:Foo = mock(Foo);

// When multiple matchers, matches same result:
when(mockedFoo.getBar(anyNumber())).thenReturn('one');
when(mockedFoo.getBar(3)).thenReturn('one');

const foo:Foo = instance(mockedFoo);
foo.getBar(3); // MultipleMatchersMatchSameStubError will be thrown, two matchers match same method call

Mocking interfaces

You can mock interfaces too, just instead of passing type to mock function, set mock function generic type Mocking interfaces requires Proxy implementation

let mockedFoo:Foo = mock<FooInterface>(); // instead of mock(FooInterface)
const foo: SampleGeneric<FooInterface> = instance(mockedFoo);

Mocking types

You can mock abstract classes

const mockedFoo: SampleAbstractClass = mock(SampleAbstractClass);
const foo: SampleAbstractClass = instance(mockedFoo);

You can also mock generic classes, but note that generic type is just needed by mock type definition

const mockedFoo: SampleGeneric<SampleInterface> = mock(SampleGeneric);
const foo: SampleGeneric<SampleInterface> = instance(mockedFoo);

Spying on real objects

You can partially mock an existing instance:

const foo: Foo = new Foo();
const spiedFoo = spy(foo);

when(spiedFoo.getBar(3)).thenReturn('one');

console.log(foo.getBar(3)); // 'one'
console.log(foo.getBaz()); // call to a real method

You can spy on plain objects too:

const foo = { bar: () => 42 };
const spiedFoo = spy(foo);

foo.bar();

console.log(capture(spiedFoo.bar).last()); // [42] 

Thanks


Download Details:

Author: NagRock
Source Code: https://github.com/NagRock/ts-mockito 
License: MIT license

#typescript #testing #mock 

Callum Slater

Callum Slater

1653465344

PySpark Cheat Sheet: Spark DataFrames in Python

This PySpark SQL cheat sheet is your handy companion to Apache Spark DataFrames in Python and includes code samples.

You'll probably already know about Apache Spark, the fast, general and open-source engine for big data processing; It has built-in modules for streaming, SQL, machine learning and graph processing. Spark allows you to speed analytic applications up to 100 times faster compared to other technologies on the market today. Interfacing Spark with Python is easy with PySpark: this Spark Python API exposes the Spark programming model to Python. 

Now, it's time to tackle the Spark SQL module, which is meant for structured data processing, and the DataFrame API, which is not only available in Python, but also in Scala, Java, and R.

Without further ado, here's the cheat sheet:

PySpark SQL cheat sheet

This PySpark SQL cheat sheet covers the basics of working with the Apache Spark DataFrames in Python: from initializing the SparkSession to creating DataFrames, inspecting the data, handling duplicate values, querying, adding, updating or removing columns, grouping, filtering or sorting data. You'll also see that this cheat sheet also on how to run SQL Queries programmatically, how to save your data to parquet and JSON files, and how to stop your SparkSession.

Spark SGlL is Apache Spark's module for working with structured data.

Initializing SparkSession 
 

A SparkSession can be used create DataFrame, register DataFrame as tables, execute SGL over tables, cache tables, and read parquet files.

>>> from pyspark.sql import SparkSession
>>> spark a SparkSession \
     .builder\
     .appName("Python Spark SQL basic example") \
     .config("spark.some.config.option", "some-value") \
     .getOrCreate()

Creating DataFrames
 

Fromm RDDs

>>> from pyspark.sql.types import*

Infer Schema

>>> sc = spark.sparkContext
>>> lines = sc.textFile(''people.txt'')
>>> parts = lines.map(lambda l: l.split(","))
>>> people = parts.map(lambda p: Row(nameap[0],ageaint(p[l])))
>>> peopledf = spark.createDataFrame(people)

Specify Schema

>>> people = parts.map(lambda p: Row(name=p[0],
               age=int(p[1].strip())))
>>>  schemaString = "name age"
>>> fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
>>> schema = StructType(fields)
>>> spark.createDataFrame(people, schema).show()

 

From Spark Data Sources
JSON

>>>  df = spark.read.json("customer.json")
>>> df.show()

>>>  df2 = spark.read.load("people.json", format="json")

Parquet files

>>> df3 = spark.read.load("users.parquet")

TXT files

>>> df4 = spark.read.text("people.txt")

Filter 

#Filter entries of age, only keep those records of which the values are >24
>>> df.filter(df["age"]>24).show()

Duplicate Values 

>>> df = df.dropDuplicates()

Queries 
 

>>> from pyspark.sql import functions as F

Select

>>> df.select("firstName").show() #Show all entries in firstName column
>>> df.select("firstName","lastName") \
      .show()
>>> df.select("firstName", #Show all entries in firstName, age and type
              "age",
              explode("phoneNumber") \
              .alias("contactInfo")) \
      .select("contactInfo.type",
              "firstName",
              "age") \
      .show()
>>> df.select(df["firstName"],df["age"]+ 1) #Show all entries in firstName and age, .show() add 1 to the entries of age
>>> df.select(df['age'] > 24).show() #Show all entries where age >24

When

>>> df.select("firstName", #Show firstName and 0 or 1 depending on age >30
               F.when(df.age > 30, 1) \
              .otherwise(0)) \
      .show()
>>> df[df.firstName.isin("Jane","Boris")] #Show firstName if in the given options
.collect()

Like 

>>> df.select("firstName", #Show firstName, and lastName is TRUE if lastName is like Smith
              df.lastName.like("Smith")) \
     .show()

Startswith - Endswith 

>>> df.select("firstName", #Show firstName, and TRUE if lastName starts with Sm
              df.lastName \
                .startswith("Sm")) \
      .show()
>>> df.select(df.lastName.endswith("th"))\ #Show last names ending in th
      .show()

Substring 

>>> df.select(df.firstName.substr(1, 3) \ #Return substrings of firstName
                          .alias("name")) \
        .collect()

Between 

>>> df.select(df.age.between(22, 24)) \ #Show age: values are TRUE if between 22 and 24
          .show()

Add, Update & Remove Columns 

Adding Columns

 >>> df = df.withColumn('city',df.address.city) \
            .withColumn('postalCode',df.address.postalCode) \
            .withColumn('state',df.address.state) \
            .withColumn('streetAddress',df.address.streetAddress) \
            .withColumn('telePhoneNumber', explode(df.phoneNumber.number)) \
            .withColumn('telePhoneType', explode(df.phoneNumber.type)) 

Updating Columns

>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber')

Removing Columns

  >>> df = df.drop("address", "phoneNumber")
 >>> df = df.drop(df.address).drop(df.phoneNumber)
 

Missing & Replacing Values 
 

>>> df.na.fill(50).show() #Replace null values
 >>> df.na.drop().show() #Return new df omitting rows with null values
 >>> df.na \ #Return new df replacing one value with another
       .replace(10, 20) \
       .show()

GroupBy 

>>> df.groupBy("age")\ #Group by age, count the members in the groups
      .count() \
      .show()

Sort 
 

>>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("age", ascending=False).collect()
>>> df.orderBy(["age","city"],ascending=[0,1])\
     .collect()

Repartitioning 

>>> df.repartition(10)\ #df with 10 partitions
      .rdd \
      .getNumPartitions()
>>> df.coalesce(1).rdd.getNumPartitions() #df with 1 partition

Running Queries Programmatically 
 

Registering DataFrames as Views

>>> peopledf.createGlobalTempView("people")
>>> df.createTempView("customer")
>>> df.createOrReplaceTempView("customer")

Query Views

>>> df5 = spark.sql("SELECT * FROM customer").show()
>>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\
               .show()

Inspect Data 
 

>>> df.dtypes #Return df column names and data types
>>> df.show() #Display the content of df
>>> df.head() #Return first n rows
>>> df.first() #Return first row
>>> df.take(2) #Return the first n rows >>> df.schema Return the schema of df
>>> df.describe().show() #Compute summary statistics >>> df.columns Return the columns of df
>>> df.count() #Count the number of rows in df
>>> df.distinct().count() #Count the number of distinct rows in df
>>> df.printSchema() #Print the schema of df
>>> df.explain() #Print the (logical and physical) plans

Output

Data Structures 
 

 >>> rdd1 = df.rdd #Convert df into an RDD
 >>> df.toJSON().first() #Convert df into a RDD of string
 >>> df.toPandas() #Return the contents of df as Pandas DataFrame

Write & Save to Files 

>>> df.select("firstName", "city")\
       .write \
       .save("nameAndCity.parquet")
 >>> df.select("firstName", "age") \
       .write \
       .save("namesAndAges.json",format="json")

Stopping SparkSession 

>>> spark.stop()

Have this Cheat Sheet at your fingertips

Original article source at https://www.datacamp.com

#pyspark #cheatsheet #spark #dataframes #python #bigdata

Royce  Reinger

Royce Reinger

1659330128

Calculates Edit Distance using Damerau-Levenshtein Algorithm

damerau-levenshtein

The damerau-levenshtein gem allows to find edit distance between two UTF-8 or ASCII encoded strings with O(N*M) efficiency.

This gem implements pure Levenshtein algorithm, Damerau modification of it (where 2 character transposition counts as 1 edit distance). It also includes Boehmer & Rees 2008 modification of Damerau algorithm, where transposition of bigger than 1 character blocks is taken in account as well (Rees 2014).

require "damerau-levenshtein"
DamerauLevenshtein.distance("Something", "Smoething") #returns 1

It also returns a diff between two strings according to Levenshtein alrorithm. The diff is expressed by tags <ins>, <del>, and <subst>. Such tags make it possible to highlight differnce between strings in a flexible way.

require "damerau-levenshtein"
differ = DamerauLevenshtein::Differ.new
differ.run("corn", "cron")
# output: ["c<subst>or</subst>n", "c<subst>ro</subst>n"]

Dependencies

sudo apt-get install build-essential libgmp3-dev

Installation

gem install damerau-levenshtein

Examples

require "damerau-levenshtein"
dl = DamerauLevenshtein
  • compare using Damerau Levenshtein algorithm
dl.distance("Something", "Smoething") #returns 1
  • compare using Levensthein algorithm
dl.distance("Something", "Smoething", 0) #returns 2
  • compare using Boehmer & Rees modification
dl.distance("Something", "meSothing", 2) #returns 2 instead of 4
  • comparison of words with UTF-8 characters should work fine:
dl.distance("Sjöstedt", "Sjostedt") #returns 1
  • compare two arrays
dl.array_distance([1,2,3,5], [1,2,3,4]) #returns 1
  • return diff between two strings
differ = DamerauLevenshtein::Differ.new
differ.run("Something", "smthg")
  • return diff between two strings in raw format
differ = DamerauLevenshtein::Differ.new
differ.format = :raw
differ.run("Something", "smthg")

API Description

Methods

DamerauLevenshtein.version

DamerauLevenshtein.version
#returns version number of the gem

DamerauLevenshtein.distance

DamerauLevenshtein.distance(string1, string2, block_size, max_distance)
#returns edit distance between 2 strings

DamerauLevenshtein.string_distance(string1, string2, block_size, max_distance)
# an alias for .distance

DamerauLevenshtein.array_distance(array1, array2, block_size, max_distance)
# returns edit distance between 2 arrays of integers

DamerauLevenshtein.distance and .array_distance take 4 arguments:

  • string1 (array1 for .array_distance)
  • string2 (array2 for .array_distance)
  • block_size (default is 1)
  • max_distance (default is 10)

block_size determines maximum number of characters in a transposition block:

block_size = 0
(transposition does not count -- it is a pure Levenshtein algorithm)

block_size = 1
(transposition between 2 adjustent characters --
it is pure Damerau-Levenshtein algorithm)

block_size = 2
(transposition between blocks as big as 2 characters -- so abcd and cdab
counts as edit distance 2, not 4)

block_size = 3
(transposition between blocks as big as 3 characters --
so abcdef and defabc counts as edit distance 3, not 6)

etc.

max_distance -- is a threshold after which algorithm gives up and returns max_distance instead of real edit distance.

Levenshtein algorithm is expensive, so it makes sense to give up when edit distance is becoming too big. The argument max_distance does just that.


DamerauLevenshtein.distance("abcdefg", "1234567", 0, 3)
# output: 4 -- it gave up when edit distance exceeded 3

DamerauLevenshtein::Differ

differ = DamerauLevenshtein::Differ.new creates an instance of new differ class to return difference between two strings

differ.format shows current format for diff. Default is :tag format

differ.format = :raw changes current format for diffs. Possible values are :tag and :raw

differ.run("String1", "String2") returns difference between two strings.

For example:

differ = DamerauLevenshtein::Differ.new
differ.run("Something", "smthng")
# output: ["<ins>S</ins><subst>o</subst>m<ins>e</ins>th<ins>i</ins>ng",
#          "<del>S</del><subst>s</subst>m<del>e</del>th<del>i</del>ng"]

Or with parsing:

require "damerau-levenshtein"
require "nokogiri"

differ = DamerauLevenshtein::Differ.new
res = differ.run("Something", "Smothing!")
nodes = Nokogiri::XML("<root>#{res.first}</root>")

markup = nodes.root.children.map do |n|
  case n.name
  when "text"
    n.text
  when "del"
    "~~#{n.children.first.text}~~"
  when "ins"
    "*#{n.children.first.text}*"
  when "subst"
    "**#{n.children.first.text}**"
  end
end.join("")

puts markup

Output

S*o*m**e**thing~~!~~

Contributing to damerau-levenshtein

  • Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet
  • Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it
  • Fork the project
  • Start a feature/bugfix branch
  • Commit and push until you are happy with your contribution
  • Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
  • Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.

Versioning

This gem is following practices of Semantic Versioning

Download Details: 

Author: GlobalNamesArchitecture
Source Code: https://github.com/GlobalNamesArchitecture/damerau-levenshtein 
License: MIT license

#ruby #algorithm 

August  Larson

August Larson

1624930726

Automating WhatsApp Web with Alright and Python

Alright is a python wrapper that helps you automate WhatsApp web using python, giving you the capability to send messages, images, video, and files to both saved and unsaved contacts without having to rescan the QR code every time.

Why Alright?

I was looking for a way to control and automate WhatsApp web with Python; I came across some very nice libraries and wrappers implementations, including:

  1. pywhatkit
  2. pywhatsapp
  3. PyWhatsapp
  4. WebWhatsapp-Wrapper

So I tried

pywhatkit, a well crafted to be used, but its implementations require you to open a new browser tab and scan QR code every time you send a message, no matter if it’s the same person, which was a deal-breaker for using it.

I then tried

pywhatsapp,which is based onyowsupand require you to do some registration withyowsupbefore using it of which after a bit of googling, I got scared of having my number blocked. So I went for the next option.

I then went for WebWhatsapp-Wrapper. It has some good documentation and recent commits so I had hoped it is going to work. But It didn’t for me, and after having a couple of errors, I abandoned it to look for the next alternative.

PyWhatsapp by shauryauppal, which was more of a CLI tool than a wrapper, surprisingly worked. Its approach allows you to dynamically send WhatsApp messages to unsaved contacts without rescanning QR-code every time.

So what I did is refactoring the implementation of that tool to be more of a wrapper to easily allow people to run different scripts on top of it. Instead of just using it as a tool, I then thought of sharing the codebase with people who might struggle to do this as I did.

#python #python-programming #python-tutorials #python-programming-lists #selenium #python-dev-tips #python-developers #programming #web-monetization