1597755600
Given a range [low, high], both inclusive, and an integer K, the task is to select K numbers from the range(a number can be chosen multiple times) such that the sum of those K numbers is even. Print the number of all such permutations.
Examples:
Input:_ low = 4, high = 5, k = 3_
Output:_ 4_
Explanation:
There are 4 valid permutation. They are {4, 4, 4}, {4, 5, 5}, {5, 4, 5} and {5, 5, 4} which sum up to an even number.
Input:_ low = 1, high = 10, k = 2_
Output:_ 50_
Explanation:
There are 50 valid permutations. They are {1, 1}, {1, 3}, … {1, 9} {2, 2}, {2, 4}, …, {2, 10}, …, {10, 2}, {10, 4}, … {10, 10}.
These 50 permutations, each sum up to an even number.
Naive Approach: The idea is to find all subset of size K such that the sum of the subset is even and also calculate permutation for each required subset.
Time Complexity:_ O(K * (2K))_
Auxiliary Space:_ O(K)_
Efficient Approach: The idea is to use the fact that the sum of two even and odd numbers is always even. Follow the steps below to solve the problem:
Below is the implementation of the above approach:
// Java program for the above approach
**import**
java.util.*;
**class**
GFG {
// Function to return the number
// of all permutations such that
// sum of K numbers in range is even
**public**
**static**
**void**
countEvenSum(``**int**
low,
**int**
high,
**int**
k)
{
// Find total count of even and
// odd number in given range
**int**
even_count = high /
2
- (low -
1``) /
2``;
**int**
odd_count = (high +
1``) /
2
- low /
2``;
**long**
even_sum =
1``;
**long**
odd_sum =
0``;
// Iterate loop k times and update
// even_sum & odd_sum using
// previous values
**for**
(``**int**
i =
0``; i < k; i++) {
// Update the prev_even and
// odd_sum
**long**
prev_even = even_sum;
**long**
prev_odd = odd_sum;
// Even sum
even_sum = (prev_even * even_count)
+ (prev_odd * odd_count);
// Odd sum
odd_sum = (prev_even * odd_count)
+ (prev_odd * even_count);
}
// Return even_sum
System.out.println(even_sum);
}
// Driver Code
**public**
**static**
**void**
main(String[] args)
{
// Given ranges
**int**
low =
4``;
**int**
high =
5``;
// Length of permutation
**int**
K =
3``;
// Function call
countEvenSum(low, high, K);
}
}
Output:
4
Time Complexity:_ O(K)_
Auxiliary Space:_ O(1)_
Attention reader! Don’t stop learning now. Get hold of all the important DSA concepts with the DSA Self Paced Course at a student-friendly price and become industry ready.
#arrays #dynamic programming #greedy #mathematical #adobe #array-range-queries #permutation #permutation and combination
1597755600
Given a range [low, high], both inclusive, and an integer K, the task is to select K numbers from the range(a number can be chosen multiple times) such that the sum of those K numbers is even. Print the number of all such permutations.
Examples:
Input:_ low = 4, high = 5, k = 3_
Output:_ 4_
Explanation:
There are 4 valid permutation. They are {4, 4, 4}, {4, 5, 5}, {5, 4, 5} and {5, 5, 4} which sum up to an even number.
Input:_ low = 1, high = 10, k = 2_
Output:_ 50_
Explanation:
There are 50 valid permutations. They are {1, 1}, {1, 3}, … {1, 9} {2, 2}, {2, 4}, …, {2, 10}, …, {10, 2}, {10, 4}, … {10, 10}.
These 50 permutations, each sum up to an even number.
Naive Approach: The idea is to find all subset of size K such that the sum of the subset is even and also calculate permutation for each required subset.
Time Complexity:_ O(K * (2K))_
Auxiliary Space:_ O(K)_
Efficient Approach: The idea is to use the fact that the sum of two even and odd numbers is always even. Follow the steps below to solve the problem:
Below is the implementation of the above approach:
// Java program for the above approach
**import**
java.util.*;
**class**
GFG {
// Function to return the number
// of all permutations such that
// sum of K numbers in range is even
**public**
**static**
**void**
countEvenSum(``**int**
low,
**int**
high,
**int**
k)
{
// Find total count of even and
// odd number in given range
**int**
even_count = high /
2
- (low -
1``) /
2``;
**int**
odd_count = (high +
1``) /
2
- low /
2``;
**long**
even_sum =
1``;
**long**
odd_sum =
0``;
// Iterate loop k times and update
// even_sum & odd_sum using
// previous values
**for**
(``**int**
i =
0``; i < k; i++) {
// Update the prev_even and
// odd_sum
**long**
prev_even = even_sum;
**long**
prev_odd = odd_sum;
// Even sum
even_sum = (prev_even * even_count)
+ (prev_odd * odd_count);
// Odd sum
odd_sum = (prev_even * odd_count)
+ (prev_odd * even_count);
}
// Return even_sum
System.out.println(even_sum);
}
// Driver Code
**public**
**static**
**void**
main(String[] args)
{
// Given ranges
**int**
low =
4``;
**int**
high =
5``;
// Length of permutation
**int**
K =
3``;
// Function call
countEvenSum(low, high, K);
}
}
Output:
4
Time Complexity:_ O(K)_
Auxiliary Space:_ O(1)_
Attention reader! Don’t stop learning now. Get hold of all the important DSA concepts with the DSA Self Paced Course at a student-friendly price and become industry ready.
#arrays #dynamic programming #greedy #mathematical #adobe #array-range-queries #permutation #permutation and combination
1684293566
Bash supports both numeric and associative arrays. The total number of elements of these types of arrays can be calculated in multiple ways in Bash. The length of the array can be counted using the “#” symbol or loop, or using a command like “wc” or “grep”. The different ways of counting the array length in Bash are shown in this tutorial.
Using the “#” symbol is the simplest way to calculate the array length. The methods of counting the total number of elements of the numeric and associative array is shown in this part of the tutorial.
Create a Bash file with the following script that counts and prints the length of a numeric array using the “#” symbol. The “@” and “*” symbols are used here to denote all elements of the array.
#Declare a numeric array
items=("Shirt" "T-Shirt" "Pant" "Panjabi" "Shoe")
#Count array length using '#'
echo "Array length using '#' with '@': ${#items[@]}"
echo "Array length using '#' with '*': ${#items[*]}"
The following output appears after executing the script. The array contains five string values and the same output is shown for both “@” and “*” symbols:
Create a Bash file with the following script that counts and prints the length of an associative array using the “#” symbol. The “@” and “*” symbols are used here to denote all elements of the array.
#Declare an associative array
declare -A items=([6745]="Shirt (M)" [2345]="Shirt (L)" [4566]="Pant (36)")
#Count array length using '#'
echo "Associative array length using '#' with '@': ${#items[@]}"
echo "Associative array length using '#' with '*': ${#items[*]}"
The following output appears after executing the script. The array contains three string values and the same output is shown for both the “@” and “*” symbols:
Using a loop is another way to count the total number of elements in the array. The length of an array is counted using a while loop in the following example:
Create a Bash file with the following script that counts the total number of elements using a “while” loop. A numeric array of four string values is declared in the script using the “declare” command. The “for” loop is used to iterate and print the values of the array. Here, the $counter variable is used to count the length of the array that is incremented in each iteration of the loop.
#Declare an array
declare -a items=("Shirt(M)" "Shirt(L)" "Panjabi(42)" "Pant(38)")
echo "Array values are:"
#Count array length using loop
counter=0
#Iterate the array values
for val in ${items[@]}
do
#Print the array value
echo $val
((counter++))
done
echo "The array length using loop is $counter."
The following output appears after executing the script. The array values and the length of the array are printed in the output:
The length of the array can be counted using some commands. The “wc” command is one of them. But this command does not return the correct output if the array contains the string value of multiple words. The method of counting the total number of elements of an array and comparing the array length value that is counted by the “#” symbol and “wc” command is shown in the following example.
Create a Bash file with the following script that counts the total number of elements using the “wc” command. A numeric array of five string values is declared in the script. The “wc” command with the -w option is used to count the length of two arrays of 5 elements. One array contains a string of one word and another array contains a string of two words. The length of the second arrays is counted using the “#” symbol and the “wc” command.
#Declare a numeric array of a single word of the string
items=("Shirt" "T-Shirt" "Pant" "Panjabi" "Shoe")
echo "Array values: ${items[@]}"
#Count array length using 'wc'
len=`echo ${items[@]} | wc -w`
echo "Array length using 'wc' command: $len"
#Declare a numeric array of multiple words of the string
items2=("Shirt (XL)" "T-Shirt (L)" "Pant (34)" "Panjabi (38)" "Shoe (9)")
echo "Array values: ${items2[@]}"
echo "Array length using '#': ${#items2[@]}"
#Count array length using 'wc'
len=`echo ${items2[@]} | wc -w`
echo "Array length using 'wc' command: $len"
The following output appears after executing the script. According to the output, the “wc” command generates the wrong output for the array that contains a string value of two words:
The methods of counting the length of an array using the “#” symbol, loop, and the “wc” command are shown in this tutorial.
Original article source at: https://linuxhint.com/
1684297323
Bash 支持数字数组和关联数组。在 Bash 中可以通过多种方式计算这些类型数组的元素总数。可以使用“ # ”符号或循环,或使用“ wc”或“grep ”等命令来计算数组的长度。本教程展示了在 Bash 中计算数组长度的不同方法。
使用“ # ”符号是计算数组长度的最简单方法。本教程的这一部分显示了计算数值和关联数组元素总数的方法。
使用以下脚本创建一个 Bash 文件,该脚本使用“#”符号计算并打印数字数组的长度。这里使用“@”和“* ”符号来表示数组的所有元素。
#Declare a numeric array
items=("Shirt" "T-Shirt" "Pant" "Panjabi" "Shoe")
#Count array length using '#'
echo "Array length using '#' with '@': ${#items[@]}"
echo "Array length using '#' with '*': ${#items[*]}"
执行脚本后出现以下输出。该数组包含五个字符串值,“ @”和“* ”符号显示相同的输出:
使用以下脚本创建一个 Bash 文件,该脚本使用“#”符号计算并打印关联数组的长度。这里使用“@”和“* ”符号来表示数组的所有元素。
#Declare an associative array
declare -A items=([6745]="Shirt (M)" [2345]="Shirt (L)" [4566]="Pant (36)")
#Count array length using '#'
echo "Associative array length using '#' with '@': ${#items[@]}"
echo "Associative array length using '#' with '*': ${#items[*]}"
执行脚本后出现以下输出。该数组包含三个字符串值,“ @”和“* ”符号显示相同的输出:
使用循环是计算数组中元素总数的另一种方法。在以下示例中使用 while 循环计算数组的长度:
使用以下脚本创建一个 Bash 文件,该脚本使用“ while ”循环计算元素总数。使用“ declare ”命令在脚本中声明了一个包含四个字符串值的数字数组。“ for ”循环用于迭代和打印数组的值。这里,$counter 变量用于计算在循环的每次迭代中递增的数组长度。
#Declare an array
declare -a items=("Shirt(M)" "Shirt(L)" "Panjabi(42)" "Pant(38)")
echo "Array values are:"
#Count array length using loop
counter=0
#Iterate the array values
for val in ${items[@]}
do
#Print the array value
echo $val
((counter++))
done
echo "The array length using loop is $counter."
执行脚本后出现以下输出。数组值和数组长度打印在输出中:
可以使用一些命令来计算数组的长度。“ wc ”命令就是其中之一。但是如果数组包含多个单词的字符串值,则此命令不会返回正确的输出。通过“ # ”符号和“ wc ”命令统计数组元素总数并比较数组长度值的方法如下例所示。
使用以下脚本创建一个 Bash 文件,该脚本使用“ wc ”命令计算元素总数。脚本中声明了一个包含五个字符串值的数字数组。带有 -w 选项的“ wc ”命令用于计算两个 5 元素数组的长度。一个数组包含一个单词的字符串,另一个数组包含两个单词的字符串。使用“ # ”符号和“ wc ”命令计算第二个数组的长度。
#Declare a numeric array of a single word of the string
items=("Shirt" "T-Shirt" "Pant" "Panjabi" "Shoe")
echo "Array values: ${items[@]}"
#Count array length using 'wc'
len=`echo ${items[@]} | wc -w`
echo "Array length using 'wc' command: $len"
#Declare a numeric array of multiple words of the string
items2=("Shirt (XL)" "T-Shirt (L)" "Pant (34)" "Panjabi (38)" "Shoe (9)")
echo "Array values: ${items2[@]}"
echo "Array length using '#': ${#items2[@]}"
#Count array length using 'wc'
len=`echo ${items2[@]} | wc -w`
echo "Array length using 'wc' command: $len"
执行脚本后出现以下输出。根据输出,“ wc ”命令为包含两个单词的字符串值的数组生成错误输出:
本教程展示了使用“ # ”符号、循环和“ wc ”命令计算数组长度的方法。
文章原文出处:https: //linuxhint.com/
1684301100
Bash поддерживает как числовые, так и ассоциативные массивы. Общее количество элементов этих типов массивов может быть вычислено несколькими способами в Bash. Длину массива можно подсчитать с помощью символа « # » или цикла, или с помощью команды типа « wc» или «grep ». В этом руководстве показаны различные способы подсчета длины массива в Bash.
Использование символа « # » — самый простой способ вычислить длину массива. В этой части руководства показаны способы подсчета общего количества элементов числового и ассоциативного массива.
Создайте файл Bash со следующим сценарием, который подсчитывает и печатает длину числового массива, используя символ «#». Здесь используются символы «@» и «* » для обозначения всех элементов массива.
#Declare a numeric array
items=("Shirt" "T-Shirt" "Pant" "Panjabi" "Shoe")
#Count array length using '#'
echo "Array length using '#' with '@': ${#items[@]}"
echo "Array length using '#' with '*': ${#items[*]}"
Следующий вывод появляется после выполнения скрипта. Массив содержит пять строковых значений, и для символов « @» и «* » отображается один и тот же результат :
Создайте файл Bash со следующим сценарием, который подсчитывает и печатает длину ассоциативного массива, используя символ «#». Здесь используются символы «@» и «* » для обозначения всех элементов массива.
#Declare an associative array
declare -A items=([6745]="Shirt (M)" [2345]="Shirt (L)" [4566]="Pant (36)")
#Count array length using '#'
echo "Associative array length using '#' with '@': ${#items[@]}"
echo "Associative array length using '#' with '*': ${#items[*]}"
Следующий вывод появляется после выполнения скрипта. Массив содержит три строковых значения, и для символов « @» и «* » отображается один и тот же результат :
Использование цикла — еще один способ подсчета общего количества элементов в массиве. Длина массива подсчитывается с помощью цикла while в следующем примере:
Создайте файл Bash со следующим сценарием, который подсчитывает общее количество элементов с помощью цикла « пока ». Числовой массив из четырех строковых значений объявляется в скрипте с помощью команды « объявить ». Цикл for используется для повторения и печати значений массива. Здесь переменная $counter используется для подсчета длины массива, который увеличивается на каждой итерации цикла.
#Declare an array
declare -a items=("Shirt(M)" "Shirt(L)" "Panjabi(42)" "Pant(38)")
echo "Array values are:"
#Count array length using loop
counter=0
#Iterate the array values
for val in ${items[@]}
do
#Print the array value
echo $val
((counter++))
done
echo "The array length using loop is $counter."
Следующий вывод появляется после выполнения скрипта. Значения массива и длина массива печатаются в выводе:
Длину массива можно подсчитать с помощью некоторых команд. Команда « wc » — одна из них. Но эта команда не возвращает правильный вывод, если массив содержит строковое значение из нескольких слов. В следующем примере показан метод подсчета общего количества элементов массива и сравнения значения длины массива, подсчитываемого символом « # » и командой « wc ».
Создайте файл Bash со следующим сценарием, который подсчитывает общее количество элементов с помощью команды « wc ». В скрипте объявлен числовой массив из пяти строковых значений. Команда « wc » с параметром -w используется для подсчета длины двух массивов по 5 элементов. Один массив содержит строку из одного слова, а другой массив содержит строку из двух слов. Длина вторых массивов подсчитывается с помощью символа « # » и команды « wc ».
#Declare a numeric array of a single word of the string
items=("Shirt" "T-Shirt" "Pant" "Panjabi" "Shoe")
echo "Array values: ${items[@]}"
#Count array length using 'wc'
len=`echo ${items[@]} | wc -w`
echo "Array length using 'wc' command: $len"
#Declare a numeric array of multiple words of the string
items2=("Shirt (XL)" "T-Shirt (L)" "Pant (34)" "Panjabi (38)" "Shoe (9)")
echo "Array values: ${items2[@]}"
echo "Array length using '#': ${#items2[@]}"
#Count array length using 'wc'
len=`echo ${items2[@]} | wc -w`
echo "Array length using 'wc' command: $len"
Следующий вывод появляется после выполнения скрипта. Судя по выводу, команда « wc » выдает неверный вывод для массива, содержащего строковое значение из двух слов:
В этом руководстве показаны методы подсчета длины массива с помощью символа « # », цикла и команды « wc ».
Оригинальный источник статьи: https://linuxhint.com/
1653377002
This PySpark cheat sheet with code samples covers the basics like initializing Spark in Python, loading data, sorting, and repartitioning.
Apache Spark is generally known as a fast, general and open-source engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It allows you to speed analytic applications up to 100 times faster compared to technologies on the market today. You can interface Spark with Python through "PySpark". This is the Spark Python API exposes the Spark programming model to Python.
Even though working with Spark will remind you in many ways of working with Pandas DataFrames, you'll also see that it can be tough getting familiar with all the functions that you can use to query, transform, inspect, ... your data. What's more, if you've never worked with any other programming language or if you're new to the field, it might be hard to distinguish between RDD operations.
Let's face it, map()
and flatMap()
are different enough, but it might still come as a challenge to decide which one you really need when you're faced with them in your analysis. Or what about other functions, like reduce()
and reduceByKey()
?
Even though the documentation is very elaborate, it never hurts to have a cheat sheet by your side, especially when you're just getting into it.
This PySpark cheat sheet covers the basics, from initializing Spark and loading your data, to retrieving RDD information, sorting, filtering and sampling your data. But that's not all. You'll also see that topics such as repartitioning, iterating, merging, saving your data and stopping the SparkContext are included in the cheat sheet.
Note that the examples in the document take small data sets to illustrate the effect of specific functions on your data. In real life data analysis, you'll be using Spark to analyze big data.
PySpark is the Spark Python API that exposes the Spark programming model to Python.
>>> from pyspark import SparkContext
>>> sc = SparkContext(master = 'local[2]')
>>> sc.version #Retrieve SparkContext version
>>> sc.pythonVer #Retrieve Python version
>>> sc.master #Master URL to connect to
>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes
>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext
>>> sc.appName #Return application name
>>> sc.applicationld #Retrieve application ID
>>> sc.defaultParallelism #Return default level of parallelism
>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs
>>> from pyspark import SparkConf, SparkContext
>>> conf = (SparkConf()
.setMaster("local")
.setAppName("My app")
. set ("spark. executor.memory", "lg"))
>>> sc = SparkContext(conf = conf)
In the PySpark shell, a special interpreter-aware SparkContext is already created in the variable called sc.
$ ./bin/spark-shell --master local[2]
$ ./bin/pyspark --master local[s] --py-files code.py
Set which master the context connects to with the --master argument, and add Python .zip..egg or.py files to the
runtime path by passing a comma-separated list to --py-files.
>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)])
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)])
>>> rdd3 = sc.parallelize(range(100))
>>> rdd = sc.parallelize([("a",["x","y","z"]),
("b" ["p","r,"])])
Read either one text file from HDFS, a local file system or any Hadoop-supported file system URI with textFile(), or read in a directory of text files with wholeTextFiles().
>>> textFile = sc.textFile("/my/directory/•.txt")
>>> textFile2 = sc.wholeTextFiles("/my/directory/")
>>> rdd.getNumPartitions() #List the number of partitions
>>> rdd.count() #Count RDD instances 3
>>> rdd.countByKey() #Count RDD instances by key
defaultdict(<type 'int'>,{'a':2,'b':1})
>>> rdd.countByValue() #Count RDD instances by value
defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary
{'a': 2, 'b': 2}
>>> rdd3.sum() #Sum of RDD elements 4950
>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty
True
>>> rdd3.max() #Maximum value of RDD elements
99
>>> rdd3.min() #Minimum value of RDD elements
0
>>> rdd3.mean() #Mean value of RDD elements
49.5
>>> rdd3.stdev() #Standard deviation of RDD elements
28.866070047722118
>>> rdd3.variance() #Compute variance of RDD elements
833.25
>>> rdd3.histogram(3) #Compute histogram by bins
([0,33,66,99],[33,33,34])
>>> rdd3.stats() #Summary statistics (count, mean, stdev, max & min)
#Apply a function to each RFD element
>>> rdd.map(lambda x: x+(x[1],x[0])).collect()
[('a' ,7,7, 'a'),('a' ,2,2, 'a'), ('b' ,2,2, 'b')]
#Apply a function to each RDD element and flatten the result
>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0]))
>>> rdd5.collect()
['a',7 , 7 , 'a' , 'a' , 2, 2, 'a', 'b', 2 , 2, 'b']
#Apply a flatMap function to each (key,value) pair of rdd4 without changing the keys
>>> rdds.flatMapValues(lambda x: x).collect()
[('a', 'x'), ('a', 'y'), ('a', 'z'),('b', 'p'),('b', 'r')]
Getting
>>> rdd.collect() #Return a list with all RDD elements
[('a', 7), ('a', 2), ('b', 2)]
>>> rdd.take(2) #Take first 2 RDD elements
[('a', 7), ('a', 2)]
>>> rdd.first() #Take first RDD element
('a', 7)
>>> rdd.top(2) #Take top 2 RDD elements
[('b', 2), ('a', 7)]
Sampling
>>> rdd3.sample(False, 0.15, 81).collect() #Return sampled subset of rdd3
[3,4,27,31,40,41,42,43,60,76,79,80,86,97]
Filtering
>>> rdd.filter(lambda x: "a" in x).collect() #Filter the RDD
[('a',7),('a',2)]
>>> rdd5.distinct().collect() #Return distinct RDD values
['a' ,2, 'b',7]
>>> rdd.keys().collect() #Return (key,value) RDD's keys
['a', 'a', 'b']
>>> def g (x): print(x)
>>> rdd.foreach(g) #Apply a function to all RDD elements
('a', 7)
('b', 2)
('a', 2)
Reducing
>>> rdd.reduceByKey(lambda x,y : x+y).collect() #Merge the rdd values for each key
[('a',9),('b',2)]
>>> rdd.reduce(lambda a, b: a+ b) #Merge the rdd values
('a', 7, 'a' , 2 , 'b' , 2)
Grouping by
>>> rdd3.groupBy(lambda x: x % 2) #Return RDD of grouped values
.mapValues(list)
.collect()
>>> rdd.groupByKey() #Group rdd by key
.mapValues(list)
.collect()
[('a',[7,2]),('b',[2])]
Aggregating
>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))
>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
#Aggregate RDD elements of each partition and then the results
>>> rdd3.aggregate((0,0),seqOp,combOp)
(4950,100)
#Aggregate values of each RDD key
>>> rdd.aggregateByKey((0,0),seqop,combop).collect()
[('a',(9,2)), ('b',(2,1))]
#Aggregate the elements of each partition, and then the results
>>> rdd3.fold(0,add)
4950
#Merge the values for each key
>>> rdd.foldByKey(0, add).collect()
[('a' ,9), ('b' ,2)]
#Create tuples of RDD elements by applying a function
>>> rdd3.keyBy(lambda x: x+x).collect()
>>>> rdd.subtract(rdd2).collect() #Return each rdd value not contained in rdd2
[('b' ,2), ('a' ,7)]
#Return each (key,value) pair of rdd2 with no matching key in rdd
>>> rdd2.subtractByKey(rdd).collect()
[('d', 1)1
>>>rdd.cartesian(rdd2).collect() #Return the Cartesian product of rdd and rdd2
>>> rdd2.sortBy(lambda x: x[1]).collect() #Sort RDD by given function
[('d',1),('b',1),('a',2)]
>>> rdd2.sortByKey().collect() #Sort (key, value) ROD by key
[('a' ,2), ('b' ,1), ('d' ,1)]
>>> rdd.repartition(4) #New RDD with 4 partitions
>>> rdd.coalesce(1) #Decrease the number of partitions in the RDD to 1
>>> rdd.saveAsTextFile("rdd.txt")
>>> rdd.saveAsHadoopFile("hdfs:// namenodehost/parent/child",
'org.apache.hadoop.mapred.TextOutputFormat')
>>> sc.stop()
$ ./bin/spark-submit examples/src/main/python/pi.py
Have this Cheat Sheet at your fingertips
Original article source at https://www.datacamp.com
#pyspark #cheatsheet #spark #python