Mitchel  Carter

Mitchel Carter

1597755600

Count of permutations such that sum of K numbers from given range is even

Given a range [low, high], both inclusive, and an integer K, the task is to select K numbers from the range(a number can be chosen multiple times) such that the sum of those K numbers is even. Print the number of all such permutations.

Examples:

Input:_ low = 4, high = 5, k = 3_

Output:_ 4_

Explanation:

There are 4 valid permutation. They are {4, 4, 4}, {4, 5, 5}, {5, 4, 5} and {5, 5, 4} which sum up to an even number.

Input:_ low = 1, high = 10, k = 2_

Output:_ 50_

Explanation:

There are 50 valid permutations. They are {1, 1}, {1, 3}, … {1, 9} {2, 2}, {2, 4}, …, {2, 10}, …, {10, 2}, {10, 4}, … {10, 10}.

These 50 permutations, each sum up to an even number.

Recommended: Please try your approach on {IDE} first, before moving on to the solution.

Naive Approach: The idea is to find all subset of size K such that the sum of the subset is even and also calculate permutation for each required subset.

Time Complexity:_ O(K * (2K))_

Auxiliary Space:_ O(K)_

Efficient Approach: The idea is to use the fact that the sum of two even and odd numbers is always even. Follow the steps below to solve the problem:

  1. Find the total count of even and odd numbers in the given range [low, high].
  2. Initialize variable even_sum = 1 and odd_sum = 0 to store way to get even sum and odd sum respectively.
  3. Iterate a loop K times and store the previous even sum as prev_even = even_sum and the previouse odd sum as prev_odd = odd_sum where even_sum = (prev_eveneven_count) + (prev_oddodd_count) and odd_sum = (prev_evenodd_count) + (prev_oddeven_count).
  4. Print the even_sum at the end as there is a count for odd sum because the previous odd_sum will contribute to the next even_sum.

Below is the implementation of the above approach:

  • Java

// Java program for the above approach

**import** java.util.*;

**class** GFG {

// Function to return the number

// of all permutations such that

// sum of K numbers in range is even

**public** **static** **void**

countEvenSum(``**int** low, **int** high,

**int** k)

{

// Find total count of even and

// odd number in given range

**int** even_count = high / 2 - (low - 1``) / 2``;

**int** odd_count = (high + 1``) / 2 - low / 2``;

**long** even_sum = 1``;

**long** odd_sum = 0``;

// Iterate loop k times and update

// even_sum & odd_sum using

// previous values

**for** (``**int** i = 0``; i < k; i++) {

// Update the prev_even and

// odd_sum

**long** prev_even = even_sum;

**long** prev_odd = odd_sum;

// Even sum

even_sum = (prev_even * even_count)

+ (prev_odd * odd_count);

// Odd sum

odd_sum = (prev_even * odd_count)

+ (prev_odd * even_count);

}

// Return even_sum

System.out.println(even_sum);

}

// Driver Code

**public** **static** **void** main(String[] args)

{

// Given ranges

**int** low = 4``;

**int** high = 5``;

// Length of permutation

**int** K = 3``;

// Function call

countEvenSum(low, high, K);

}

}

Output:

4

Time Complexity:_ O(K)_

Auxiliary Space:_ O(1)_

Attention reader! Don’t stop learning now. Get hold of all the important DSA concepts with the DSA Self Paced Course at a student-friendly price and become industry ready.

#arrays #dynamic programming #greedy #mathematical #adobe #array-range-queries #permutation #permutation and combination

What is GEEK

Buddha Community

Count of permutations such that sum of K numbers from given range is even
Mitchel  Carter

Mitchel Carter

1597755600

Count of permutations such that sum of K numbers from given range is even

Given a range [low, high], both inclusive, and an integer K, the task is to select K numbers from the range(a number can be chosen multiple times) such that the sum of those K numbers is even. Print the number of all such permutations.

Examples:

Input:_ low = 4, high = 5, k = 3_

Output:_ 4_

Explanation:

There are 4 valid permutation. They are {4, 4, 4}, {4, 5, 5}, {5, 4, 5} and {5, 5, 4} which sum up to an even number.

Input:_ low = 1, high = 10, k = 2_

Output:_ 50_

Explanation:

There are 50 valid permutations. They are {1, 1}, {1, 3}, … {1, 9} {2, 2}, {2, 4}, …, {2, 10}, …, {10, 2}, {10, 4}, … {10, 10}.

These 50 permutations, each sum up to an even number.

Recommended: Please try your approach on {IDE} first, before moving on to the solution.

Naive Approach: The idea is to find all subset of size K such that the sum of the subset is even and also calculate permutation for each required subset.

Time Complexity:_ O(K * (2K))_

Auxiliary Space:_ O(K)_

Efficient Approach: The idea is to use the fact that the sum of two even and odd numbers is always even. Follow the steps below to solve the problem:

  1. Find the total count of even and odd numbers in the given range [low, high].
  2. Initialize variable even_sum = 1 and odd_sum = 0 to store way to get even sum and odd sum respectively.
  3. Iterate a loop K times and store the previous even sum as prev_even = even_sum and the previouse odd sum as prev_odd = odd_sum where even_sum = (prev_eveneven_count) + (prev_oddodd_count) and odd_sum = (prev_evenodd_count) + (prev_oddeven_count).
  4. Print the even_sum at the end as there is a count for odd sum because the previous odd_sum will contribute to the next even_sum.

Below is the implementation of the above approach:

  • Java

// Java program for the above approach

**import** java.util.*;

**class** GFG {

// Function to return the number

// of all permutations such that

// sum of K numbers in range is even

**public** **static** **void**

countEvenSum(``**int** low, **int** high,

**int** k)

{

// Find total count of even and

// odd number in given range

**int** even_count = high / 2 - (low - 1``) / 2``;

**int** odd_count = (high + 1``) / 2 - low / 2``;

**long** even_sum = 1``;

**long** odd_sum = 0``;

// Iterate loop k times and update

// even_sum & odd_sum using

// previous values

**for** (``**int** i = 0``; i < k; i++) {

// Update the prev_even and

// odd_sum

**long** prev_even = even_sum;

**long** prev_odd = odd_sum;

// Even sum

even_sum = (prev_even * even_count)

+ (prev_odd * odd_count);

// Odd sum

odd_sum = (prev_even * odd_count)

+ (prev_odd * even_count);

}

// Return even_sum

System.out.println(even_sum);

}

// Driver Code

**public** **static** **void** main(String[] args)

{

// Given ranges

**int** low = 4``;

**int** high = 5``;

// Length of permutation

**int** K = 3``;

// Function call

countEvenSum(low, high, K);

}

}

Output:

4

Time Complexity:_ O(K)_

Auxiliary Space:_ O(1)_

Attention reader! Don’t stop learning now. Get hold of all the important DSA concepts with the DSA Self Paced Course at a student-friendly price and become industry ready.

#arrays #dynamic programming #greedy #mathematical #adobe #array-range-queries #permutation #permutation and combination

Reid  Rohan

Reid Rohan

1684293566

Find the Array Length in Bash

Bash supports both numeric and associative arrays. The total number of elements of these types of arrays can be calculated in multiple ways in Bash. The length of the array can be counted using the “#” symbol or loop, or using a command like “wc” or “grep”.  The different ways of counting the array length in Bash are shown in this tutorial.

Find the Array Length Using “#”

Using the “#” symbol is the simplest way to calculate the array length. The methods of counting the total number of elements of the numeric and associative array is shown in this part of the tutorial.

Example 1: Count the Length of a Numeric Array Using “#”

Create a Bash file with the following script that counts and prints the length of a numeric array using the “#” symbol. The “@” and “*” symbols are used here to denote all elements of the array.

#Declare a numeric array

items=("Shirt" "T-Shirt" "Pant" "Panjabi" "Shoe")

#Count array length using '#'

echo "Array length using '#' with '@':  ${#items[@]}"

echo "Array length using '#' with '*':  ${#items[*]}"

The following output appears after executing the script. The array contains five string values and the same output is shown for both “@” and “*” symbols:

Example 2: Count the Length of an Associative Array Using “#”

Create a Bash file with the following script that counts and prints the length of an associative array using the “#” symbol. The “@” and “*” symbols are used here to denote all elements of the array.

#Declare an associative array

declare -A items=([6745]="Shirt (M)" [2345]="Shirt (L)" [4566]="Pant (36)")

#Count array length using '#'

echo "Associative array length using '#' with '@':  ${#items[@]}"

echo "Associative array length using '#' with '*':  ${#items[*]}"

The following output appears after executing the script. The array contains three string values and the same output is shown for both the “@” and “*” symbols:

Find the Array Length Using a Loop

Using a loop is another way to count the total number of elements in the array. The length of an array is counted using a while loop in the following example:

Example: Count the Length of an Array Using a Loop

Create a Bash file with the following script that counts the total number of elements using a “while” loop. A numeric array of four string values is declared in the script using the “declare” command. The “for” loop is used to iterate and print the values of the array. Here, the $counter variable is used to count the length of the array that is incremented in each iteration of the loop.

#Declare an array

declare -a items=("Shirt(M)" "Shirt(L)" "Panjabi(42)" "Pant(38)")

echo "Array values are:"

#Count array length using loop

counter=0

#Iterate the array values

for val in ${items[@]}

do

  #Print the array value

     echo $val

     ((counter++))

  done

  echo "The array length using loop is $counter."

The following output appears after executing the script. The array values and the length of the array are printed in the output:

Find the Array Length Using the “Wc” Command

The length of the array can be counted using some commands. The “wc” command is one of them. But this command does not return the correct output if the array contains the string value of multiple words. The method of counting the total number of elements of an array and comparing the array length value that is counted by the “#” symbol and “wc” command is shown in the following example.

Example: Count the Length of an Array Using the “Wc” Command

Create a Bash file with the following script that counts the total number of elements using the “wc” command. A numeric array of five string values is declared in the script. The “wc” command with the -w option is used to count the length of two arrays of 5 elements. One array contains a string of one word and another array contains a string of two words. The length of the second arrays is counted using the “#” symbol and the “wc” command.

#Declare a numeric array of a single word of the string

items=("Shirt" "T-Shirt" "Pant" "Panjabi" "Shoe")

echo "Array values: ${items[@]}"

#Count array length using 'wc'

len=`echo ${items[@]} | wc -w`

echo "Array length using 'wc' command: $len"

 

#Declare a numeric array of multiple words of the string

items2=("Shirt (XL)" "T-Shirt (L)" "Pant (34)" "Panjabi (38)" "Shoe (9)")

echo "Array values: ${items2[@]}"

echo "Array length using '#': ${#items2[@]}"

#Count array length using 'wc'

len=`echo ${items2[@]} | wc -w`

echo "Array length using 'wc' command: $len"

The following output appears after executing the script. According to the output, the “wc” command generates the wrong output for the array that contains a string value of two words:

Conclusion

The methods of counting the length of an array using the “#” symbol, loop, and the “wc” command are shown in this tutorial.

Original article source at: https://linuxhint.com/

#bash #array 

津田  淳

津田 淳

1684297323

在 Bash 中查找数组长度

Bash 支持数字数组和关联数组。在 Bash 中可以通过多种方式计算这些类型数组的元素总数。可以使用“ # ”符号或循环,或使用“ wc”或“grep ”等命令来计算数组的长度。本教程展示了在 Bash 中计算数组长度的不同方法。

使用“#”查找数组长度

使用“ # ”符号是计算数组长度的最简单方法。本教程的这一部分显示了计算数值和关联数组元素总数的方法。

示例 1:使用“#”计算数值数组的长度

使用以下脚本创建一个 Bash 文件,该脚本使用“#”符号计算并打印数字数组的长度。这里使用“@”和“* ”符号来表示数组的所有元素。

#Declare a numeric array

items=("Shirt" "T-Shirt" "Pant" "Panjabi" "Shoe")

#Count array length using '#'

echo "Array length using '#' with '@':  ${#items[@]}"

echo "Array length using '#' with '*':  ${#items[*]}"

执行脚本后出现以下输出。该数组包含五个字符串值,“ @”和“* ”符号显示相同的输出:

示例 2:使用“#”计算关联数组的长度

使用以下脚本创建一个 Bash 文件,该脚本使用“#”符号计算并打印关联数组的长度。这里使用“@”和“* ”符号来表示数组的所有元素。

#Declare an associative array

declare -A items=([6745]="Shirt (M)" [2345]="Shirt (L)" [4566]="Pant (36)")

#Count array length using '#'

echo "Associative array length using '#' with '@':  ${#items[@]}"

echo "Associative array length using '#' with '*':  ${#items[*]}"

执行脚本后出现以下输出。该数组包含三个字符串值,“ @”和“* ”符号显示相同的输出:

使用循环查找数组长度

使用循环是计算数组中元素总数的另一种方法。在以下示例中使用 while 循环计算数组的长度:

示例:使用循环计算数组的长度

使用以下脚本创建一个 Bash 文件,该脚本使用“ while ”循环计算元素总数。使用“ declare ”命令在脚本中声明了一个包含四个字符串值的数字数组。“ for ”循环用于迭代和打印数组的值。这里,$counter 变量用于计算在循环的每次迭代中递增的数组长度。

#Declare an array

declare -a items=("Shirt(M)" "Shirt(L)" "Panjabi(42)" "Pant(38)")

echo "Array values are:"

#Count array length using loop

counter=0

#Iterate the array values

for val in ${items[@]}

do

  #Print the array value

     echo $val

     ((counter++))

  done

  echo "The array length using loop is $counter."

执行脚本后出现以下输出。数组值和数组长度打印在输出中:

使用“Wc”命令查找数组长度

可以使用一些命令来计算数组的长度。“ wc ”命令就是其中之一。但是如果数组包含多个单词的字符串值,则此命令不会返回正确的输出。通过“ # ”符号和“ wc ”命令统计数组元素总数并比较数组长度值的方法如下例所示。

示例:使用“Wc”命令计算数组的长度

使用以下脚本创建一个 Bash 文件,该脚本使用“ wc ”命令计算元素总数。脚本中声明了一个包含五个字符串值的数字数组。带有 -w 选项的“ wc ”命令用于计算两个 5 元素数组的长度。一个数组包含一个单词的字符串,另一个数组包含两个单词的字符串。使用“ # ”符号和“ wc ”命令计算第二个数组的长度。

#Declare a numeric array of a single word of the string

items=("Shirt" "T-Shirt" "Pant" "Panjabi" "Shoe")

echo "Array values: ${items[@]}"

#Count array length using 'wc'

len=`echo ${items[@]} | wc -w`

echo "Array length using 'wc' command: $len"

 

#Declare a numeric array of multiple words of the string

items2=("Shirt (XL)" "T-Shirt (L)" "Pant (34)" "Panjabi (38)" "Shoe (9)")

echo "Array values: ${items2[@]}"

echo "Array length using '#': ${#items2[@]}"

#Count array length using 'wc'

len=`echo ${items2[@]} | wc -w`

echo "Array length using 'wc' command: $len"

执行脚本后出现以下输出。根据输出,“ wc ”命令为包含两个单词的字符串值的数组生成错误输出:

结论

本教程展示了使用“ # ”符号、循环和“ wc ”命令计算数组长度的方法。

文章原文出处:https: //linuxhint.com/

#bash #array 

Найдите длину массива в Bash

Bash поддерживает как числовые, так и ассоциативные массивы. Общее количество элементов этих типов массивов может быть вычислено несколькими способами в Bash. Длину массива можно подсчитать с помощью символа « # » или цикла, или с помощью команды типа « wc» или «grep ». В этом руководстве показаны различные способы подсчета длины массива в Bash.

Найдите длину массива, используя «#»

Использование символа « # » — самый простой способ вычислить длину массива. В этой части руководства показаны способы подсчета общего количества элементов числового и ассоциативного массива.

Пример 1. Подсчет длины числового массива с использованием «#»

Создайте файл Bash со следующим сценарием, который подсчитывает и печатает длину числового массива, используя символ «#». Здесь используются символы «@» и «* » для обозначения всех элементов массива.

#Declare a numeric array

items=("Shirt" "T-Shirt" "Pant" "Panjabi" "Shoe")

#Count array length using '#'

echo "Array length using '#' with '@':  ${#items[@]}"

echo "Array length using '#' with '*':  ${#items[*]}"

Следующий вывод появляется после выполнения скрипта. Массив содержит пять строковых значений, и для символов « @» и «* » отображается один и тот же результат :

Пример 2. Подсчет длины ассоциативного массива с использованием «#»

Создайте файл Bash со следующим сценарием, который подсчитывает и печатает длину ассоциативного массива, используя символ «#». Здесь используются символы «@» и «* » для обозначения всех элементов массива.

#Declare an associative array

declare -A items=([6745]="Shirt (M)" [2345]="Shirt (L)" [4566]="Pant (36)")

#Count array length using '#'

echo "Associative array length using '#' with '@':  ${#items[@]}"

echo "Associative array length using '#' with '*':  ${#items[*]}"

Следующий вывод появляется после выполнения скрипта. Массив содержит три строковых значения, и для символов « @» и «* » отображается один и тот же результат :

Найдите длину массива с помощью цикла

Использование цикла — еще один способ подсчета общего количества элементов в массиве. Длина массива подсчитывается с помощью цикла while в следующем примере:

Пример. Подсчет длины массива с использованием цикла

Создайте файл Bash со следующим сценарием, который подсчитывает общее количество элементов с помощью цикла « пока ». Числовой массив из четырех строковых значений объявляется в скрипте с помощью команды « объявить ». Цикл for используется для повторения и печати значений массива. Здесь переменная $counter используется для подсчета длины массива, который увеличивается на каждой итерации цикла.

#Declare an array

declare -a items=("Shirt(M)" "Shirt(L)" "Panjabi(42)" "Pant(38)")

echo "Array values are:"

#Count array length using loop

counter=0

#Iterate the array values

for val in ${items[@]}

do

  #Print the array value

     echo $val

     ((counter++))

  done

  echo "The array length using loop is $counter."

Следующий вывод появляется после выполнения скрипта. Значения массива и длина массива печатаются в выводе:

Найдите длину массива с помощью команды «Wc»

Длину массива можно подсчитать с помощью некоторых команд. Команда « wc » — одна из них. Но эта команда не возвращает правильный вывод, если массив содержит строковое значение из нескольких слов. В следующем примере показан метод подсчета общего количества элементов массива и сравнения значения длины массива, подсчитываемого символом « # » и командой « wc ».

Пример: подсчет длины массива с помощью команды «Wc»

Создайте файл Bash со следующим сценарием, который подсчитывает общее количество элементов с помощью команды « wc ». В скрипте объявлен числовой массив из пяти строковых значений. Команда « wc » с параметром -w используется для подсчета длины двух массивов по 5 элементов. Один массив содержит строку из одного слова, а другой массив содержит строку из двух слов. Длина вторых массивов подсчитывается с помощью символа « # » и команды « wc ».

#Declare a numeric array of a single word of the string

items=("Shirt" "T-Shirt" "Pant" "Panjabi" "Shoe")

echo "Array values: ${items[@]}"

#Count array length using 'wc'

len=`echo ${items[@]} | wc -w`

echo "Array length using 'wc' command: $len"

 

#Declare a numeric array of multiple words of the string

items2=("Shirt (XL)" "T-Shirt (L)" "Pant (34)" "Panjabi (38)" "Shoe (9)")

echo "Array values: ${items2[@]}"

echo "Array length using '#': ${#items2[@]}"

#Count array length using 'wc'

len=`echo ${items2[@]} | wc -w`

echo "Array length using 'wc' command: $len"

Следующий вывод появляется после выполнения скрипта. Судя по выводу, команда « wc » выдает неверный вывод для массива, содержащего строковое значение из двух слов:

Заключение

В этом руководстве показаны методы подсчета длины массива с помощью символа « # », цикла и команды « wc ».

Оригинальный источник статьи: https://linuxhint.com/

#bash #array 

Edward Jackson

Edward Jackson

1653377002

PySpark Cheat Sheet: Spark in Python

This PySpark cheat sheet with code samples covers the basics like initializing Spark in Python, loading data, sorting, and repartitioning.

Apache Spark is generally known as a fast, general and open-source engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It allows you to speed analytic applications up to 100 times faster compared to technologies on the market today. You can interface Spark with Python through "PySpark". This is the Spark Python API exposes the Spark programming model to Python. 

Even though working with Spark will remind you in many ways of working with Pandas DataFrames, you'll also see that it can be tough getting familiar with all the functions that you can use to query, transform, inspect, ... your data. What's more, if you've never worked with any other programming language or if you're new to the field, it might be hard to distinguish between RDD operations.

Let's face it, map() and flatMap() are different enough, but it might still come as a challenge to decide which one you really need when you're faced with them in your analysis. Or what about other functions, like reduce() and reduceByKey()

PySpark cheat sheet

Even though the documentation is very elaborate, it never hurts to have a cheat sheet by your side, especially when you're just getting into it.

This PySpark cheat sheet covers the basics, from initializing Spark and loading your data, to retrieving RDD information, sorting, filtering and sampling your data. But that's not all. You'll also see that topics such as repartitioning, iterating, merging, saving your data and stopping the SparkContext are included in the cheat sheet. 

Note that the examples in the document take small data sets to illustrate the effect of specific functions on your data. In real life data analysis, you'll be using Spark to analyze big data.

PySpark is the Spark Python API that exposes the Spark programming model to Python.

Initializing Spark 

SparkContext 

>>> from pyspark import SparkContext
>>> sc = SparkContext(master = 'local[2]')

Inspect SparkContext 

>>> sc.version #Retrieve SparkContext version
>>> sc.pythonVer #Retrieve Python version
>>> sc.master #Master URL to connect to
>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes
>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext
>>> sc.appName #Return application name
>>> sc.applicationld #Retrieve application ID
>>> sc.defaultParallelism #Return default level of parallelism
>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs

Configuration 

>>> from pyspark import SparkConf, SparkContext
>>> conf = (SparkConf()
     .setMaster("local")
     .setAppName("My app")
     . set   ("spark. executor.memory",   "lg"))
>>> sc = SparkContext(conf = conf)

Using the Shell 

In the PySpark shell, a special interpreter-aware SparkContext is already created in the variable called sc.

$ ./bin/spark-shell --master local[2]
$ ./bin/pyspark --master local[s] --py-files code.py

Set which master the context connects to with the --master argument, and add Python .zip..egg or.py files to the

runtime path by passing a comma-separated list to  --py-files.

Loading Data 

Parallelized Collections 

>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)])
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)])
>>> rdd3 = sc.parallelize(range(100))
>>> rdd = sc.parallelize([("a",["x","y","z"]),
               ("b" ["p","r,"])])

External Data 

Read either one text file from HDFS, a local file system or any Hadoop-supported file system URI with textFile(), or read in a directory of text files with wholeTextFiles(). 

>>> textFile = sc.textFile("/my/directory/•.txt")
>>> textFile2 = sc.wholeTextFiles("/my/directory/")

Retrieving RDD Information 

Basic Information 

>>> rdd.getNumPartitions() #List the number of partitions
>>> rdd.count() #Count RDD instances 3
>>> rdd.countByKey() #Count RDD instances by key
defaultdict(<type 'int'>,{'a':2,'b':1})
>>> rdd.countByValue() #Count RDD instances by value
defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary
   {'a': 2, 'b': 2}
>>> rdd3.sum() #Sum of RDD elements 4950
>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty
True

Summary 

>>> rdd3.max() #Maximum value of RDD elements 
99
>>> rdd3.min() #Minimum value of RDD elements
0
>>> rdd3.mean() #Mean value of RDD elements 
49.5
>>> rdd3.stdev() #Standard deviation of RDD elements 
28.866070047722118
>>> rdd3.variance() #Compute variance of RDD elements 
833.25
>>> rdd3.histogram(3) #Compute histogram by bins
([0,33,66,99],[33,33,34])
>>> rdd3.stats() #Summary statistics (count, mean, stdev, max & min)

Applying Functions 

#Apply a function to each RFD element
>>> rdd.map(lambda x: x+(x[1],x[0])).collect()
[('a' ,7,7, 'a'),('a' ,2,2, 'a'), ('b' ,2,2, 'b')]
#Apply a function to each RDD element and flatten the result
>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0]))
>>> rdd5.collect()
['a',7 , 7 ,  'a' , 'a' , 2,  2,  'a', 'b', 2 , 2, 'b']
#Apply a flatMap function to each (key,value) pair of rdd4 without changing the keys
>>> rdds.flatMapValues(lambda x: x).collect()
[('a', 'x'), ('a', 'y'), ('a', 'z'),('b', 'p'),('b', 'r')]

Selecting Data

Getting

>>> rdd.collect() #Return a list with all RDD elements 
[('a', 7), ('a', 2), ('b', 2)]
>>> rdd.take(2) #Take first 2 RDD elements 
[('a', 7),  ('a', 2)]
>>> rdd.first() #Take first RDD element
('a', 7)
>>> rdd.top(2) #Take top 2 RDD elements 
[('b', 2), ('a', 7)]

Sampling

>>> rdd3.sample(False, 0.15, 81).collect() #Return sampled subset of rdd3
     [3,4,27,31,40,41,42,43,60,76,79,80,86,97]

Filtering

>>> rdd.filter(lambda x: "a" in x).collect() #Filter the RDD
[('a',7),('a',2)]
>>> rdd5.distinct().collect() #Return distinct RDD values
['a' ,2, 'b',7]
>>> rdd.keys().collect() #Return (key,value) RDD's keys
['a',  'a',  'b']

Iterating 

>>> def g (x): print(x)
>>> rdd.foreach(g) #Apply a function to all RDD elements
('a', 7)
('b', 2)
('a', 2)

Reshaping Data 

Reducing

>>> rdd.reduceByKey(lambda x,y : x+y).collect() #Merge the rdd values for each key
[('a',9),('b',2)]
>>> rdd.reduce(lambda a, b: a+ b) #Merge the rdd values
('a', 7, 'a' , 2 , 'b' , 2)

 

Grouping by

>>> rdd3.groupBy(lambda x: x % 2) #Return RDD of grouped values
          .mapValues(list)
          .collect()
>>> rdd.groupByKey() #Group rdd by key
          .mapValues(list)
          .collect() 
[('a',[7,2]),('b',[2])]

Aggregating

>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))
>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
#Aggregate RDD elements of each partition and then the results
>>> rdd3.aggregate((0,0),seqOp,combOp) 
(4950,100)
#Aggregate values of each RDD key
>>> rdd.aggregateByKey((0,0),seqop,combop).collect() 
     [('a',(9,2)), ('b',(2,1))]
#Aggregate the elements of each partition, and then the results
>>> rdd3.fold(0,add)
     4950
#Merge the values for each key
>>> rdd.foldByKey(0, add).collect()
[('a' ,9), ('b' ,2)]
#Create tuples of RDD elements by applying a function
>>> rdd3.keyBy(lambda x: x+x).collect()

Mathematical Operations 

>>>> rdd.subtract(rdd2).collect() #Return each rdd value not contained in rdd2
[('b' ,2), ('a' ,7)]
#Return each (key,value) pair of rdd2 with no matching key in rdd
>>> rdd2.subtractByKey(rdd).collect()
[('d', 1)1
>>>rdd.cartesian(rdd2).collect() #Return the Cartesian product of rdd and rdd2

Sort 

>>> rdd2.sortBy(lambda x: x[1]).collect() #Sort RDD by given function
[('d',1),('b',1),('a',2)]
>>> rdd2.sortByKey().collect() #Sort (key, value) ROD by key
[('a' ,2), ('b' ,1), ('d' ,1)]

Repartitioning 

>>> rdd.repartition(4) #New RDD with 4 partitions
>>> rdd.coalesce(1) #Decrease the number of partitions in the RDD to 1

Saving 

>>> rdd.saveAsTextFile("rdd.txt")
>>> rdd.saveAsHadoopFile("hdfs:// namenodehost/parent/child",
               'org.apache.hadoop.mapred.TextOutputFormat')

Stopping SparkContext 

>>> sc.stop()

Execution 

$ ./bin/spark-submit examples/src/main/python/pi.py

Have this Cheat Sheet at your fingertips

Original article source at https://www.datacamp.com

#pyspark #cheatsheet #spark #python