1630478987
En este video corto de Python aprenderás la diferencia entre retornar un valor de una función con una sentencia return y llamar a la función print() para mostrar el valor.
#python
1669003576
In this Python article, let's learn about Mutable and Immutable in Python.
Mutable is a fancy way of saying that the internal state of the object is changed/mutated. So, the simplest definition is: An object whose internal state can be changed is mutable. On the other hand, immutable doesn’t allow any change in the object once it has been created.
Both of these states are integral to Python data structure. If you want to become more knowledgeable in the entire Python Data Structure, take this free course which covers multiple data structures in Python including tuple data structure which is immutable. You will also receive a certificate on completion which is sure to add value to your portfolio.
Mutable is when something is changeable or has the ability to change. In Python, ‘mutable’ is the ability of objects to change their values. These are often the objects that store a collection of data.
Immutable is the when no change is possible over time. In Python, if the value of an object cannot be changed over time, then it is known as immutable. Once created, the value of these objects is permanent.
Objects of built-in type that are mutable are:
Objects of built-in type that are immutable are:
Object mutability is one of the characteristics that makes Python a dynamically typed language. Though Mutable and Immutable in Python is a very basic concept, it can at times be a little confusing due to the intransitive nature of immutability.
In Python, everything is treated as an object. Every object has these three attributes:
While ID and Type cannot be changed once it’s created, values can be changed for Mutable objects.
Check out this free python certificate course to get started with Python.
I believe, rather than diving deep into the theory aspects of mutable and immutable in Python, a simple code would be the best way to depict what it means in Python. Hence, let us discuss the below code step-by-step:
#Creating a list which contains name of Indian cities
cities = [‘Delhi’, ‘Mumbai’, ‘Kolkata’]
# Printing the elements from the list cities, separated by a comma & space
for city in cities:
print(city, end=’, ’)
Output [1]: Delhi, Mumbai, Kolkata
#Printing the location of the object created in the memory address in hexadecimal format
print(hex(id(cities)))
Output [2]: 0x1691d7de8c8
#Adding a new city to the list cities
cities.append(‘Chennai’)
#Printing the elements from the list cities, separated by a comma & space
for city in cities:
print(city, end=’, ’)
Output [3]: Delhi, Mumbai, Kolkata, Chennai
#Printing the location of the object created in the memory address in hexadecimal format
print(hex(id(cities)))
Output [4]: 0x1691d7de8c8
The above example shows us that we were able to change the internal state of the object ‘cities’ by adding one more city ‘Chennai’ to it, yet, the memory address of the object did not change. This confirms that we did not create a new object, rather, the same object was changed or mutated. Hence, we can say that the object which is a type of list with reference variable name ‘cities’ is a MUTABLE OBJECT.
Let us now discuss the term IMMUTABLE. Considering that we understood what mutable stands for, it is obvious that the definition of immutable will have ‘NOT’ included in it. Here is the simplest definition of immutable– An object whose internal state can NOT be changed is IMMUTABLE.
Again, if you try and concentrate on different error messages, you have encountered, thrown by the respective IDE; you use you would be able to identify the immutable objects in Python. For instance, consider the below code & associated error message with it, while trying to change the value of a Tuple at index 0.
#Creating a Tuple with variable name ‘foo’
foo = (1, 2)
#Changing the index[0] value from 1 to 3
foo[0] = 3
TypeError: 'tuple' object does not support item assignment
Once again, a simple code would be the best way to depict what immutable stands for. Hence, let us discuss the below code step-by-step:
#Creating a Tuple which contains English name of weekdays
weekdays = ‘Sunday’, ‘Monday’, ‘Tuesday’, ‘Wednesday’, ‘Thursday’, ‘Friday’, ‘Saturday’
# Printing the elements of tuple weekdays
print(weekdays)
Output [1]: (‘Sunday’, ‘Monday’, ‘Tuesday’, ‘Wednesday’, ‘Thursday’, ‘Friday’, ‘Saturday’)
#Printing the location of the object created in the memory address in hexadecimal format
print(hex(id(weekdays)))
Output [2]: 0x1691cc35090
#tuples are immutable, so you cannot add new elements, hence, using merge of tuples with the # + operator to add a new imaginary day in the tuple ‘weekdays’
weekdays += ‘Pythonday’,
#Printing the elements of tuple weekdays
print(weekdays)
Output [3]: (‘Sunday’, ‘Monday’, ‘Tuesday’, ‘Wednesday’, ‘Thursday’, ‘Friday’, ‘Saturday’, ‘Pythonday’)
#Printing the location of the object created in the memory address in hexadecimal format
print(hex(id(weekdays)))
Output [4]: 0x1691cc8ad68
This above example shows that we were able to use the same variable name that is referencing an object which is a type of tuple with seven elements in it. However, the ID or the memory location of the old & new tuple is not the same. We were not able to change the internal state of the object ‘weekdays’. The Python program manager created a new object in the memory address and the variable name ‘weekdays’ started referencing the new object with eight elements in it. Hence, we can say that the object which is a type of tuple with reference variable name ‘weekdays’ is an IMMUTABLE OBJECT.
Also Read: Understanding the Exploratory Data Analysis (EDA) in Python
Where can you use mutable and immutable objects:
Mutable objects can be used where you want to allow for any updates. For example, you have a list of employee names in your organizations, and that needs to be updated every time a new member is hired. You can create a mutable list, and it can be updated easily.
Immutability offers a lot of useful applications to different sensitive tasks we do in a network centred environment where we allow for parallel processing. By creating immutable objects, you seal the values and ensure that no threads can invoke overwrite/update to your data. This is also useful in situations where you would like to write a piece of code that cannot be modified. For example, a debug code that attempts to find the value of an immutable object.
Watch outs: Non transitive nature of Immutability:
OK! Now we do understand what mutable & immutable objects in Python are. Let’s go ahead and discuss the combination of these two and explore the possibilities. Let’s discuss, as to how will it behave if you have an immutable object which contains the mutable object(s)? Or vice versa? Let us again use a code to understand this behaviour–
#creating a tuple (immutable object) which contains 2 lists(mutable) as it’s elements
#The elements (lists) contains the name, age & gender
person = (['Ayaan', 5, 'Male'], ['Aaradhya', 8, 'Female'])
#printing the tuple
print(person)
Output [1]: (['Ayaan', 5, 'Male'], ['Aaradhya', 8, 'Female'])
#printing the location of the object created in the memory address in hexadecimal format
print(hex(id(person)))
Output [2]: 0x1691ef47f88
#Changing the age for the 1st element. Selecting 1st element of tuple by using indexing [0] then 2nd element of the list by using indexing [1] and assigning a new value for age as 4
person[0][1] = 4
#printing the updated tuple
print(person)
Output [3]: (['Ayaan', 4, 'Male'], ['Aaradhya', 8, 'Female'])
#printing the location of the object created in the memory address in hexadecimal format
print(hex(id(person)))
Output [4]: 0x1691ef47f88
In the above code, you can see that the object ‘person’ is immutable since it is a type of tuple. However, it has two lists as it’s elements, and we can change the state of lists (lists being mutable). So, here we did not change the object reference inside the Tuple, but the referenced object was mutated.
Also Read: Real-Time Object Detection Using TensorFlow
Same way, let’s explore how it will behave if you have a mutable object which contains an immutable object? Let us again use a code to understand the behaviour–
#creating a list (mutable object) which contains tuples(immutable) as it’s elements
list1 = [(1, 2, 3), (4, 5, 6)]
#printing the list
print(list1)
Output [1]: [(1, 2, 3), (4, 5, 6)]
#printing the location of the object created in the memory address in hexadecimal format
print(hex(id(list1)))
Output [2]: 0x1691d5b13c8
#changing object reference at index 0
list1[0] = (7, 8, 9)
#printing the list
Output [3]: [(7, 8, 9), (4, 5, 6)]
#printing the location of the object created in the memory address in hexadecimal format
print(hex(id(list1)))
Output [4]: 0x1691d5b13c8
As an individual, it completely depends upon you and your requirements as to what kind of data structure you would like to create with a combination of mutable & immutable objects. I hope that this information will help you while deciding the type of object you would like to select going forward.
Before I end our discussion on IMMUTABILITY, allow me to use the word ‘CAVITE’ when we discuss the String and Integers. There is an exception, and you may see some surprising results while checking the truthiness for immutability. For instance:
#creating an object of integer type with value 10 and reference variable name ‘x’
x = 10
#printing the value of ‘x’
print(x)
Output [1]: 10
#Printing the location of the object created in the memory address in hexadecimal format
print(hex(id(x)))
Output [2]: 0x538fb560
#creating an object of integer type with value 10 and reference variable name ‘y’
y = 10
#printing the value of ‘y’
print(y)
Output [3]: 10
#Printing the location of the object created in the memory address in hexadecimal format
print(hex(id(y)))
Output [4]: 0x538fb560
As per our discussion and understanding, so far, the memory address for x & y should have been different, since, 10 is an instance of Integer class which is immutable. However, as shown in the above code, it has the same memory address. This is not something that we expected. It seems that what we have understood and discussed, has an exception as well.
Quick check – Python Data Structures
Tuples are immutable and hence cannot have any changes in them once they are created in Python. This is because they support the same sequence operations as strings. We all know that strings are immutable. The index operator will select an element from a tuple just like in a string. Hence, they are immutable.
Like all, there are exceptions in the immutability in python too. Not all immutable objects are really mutable. This will lead to a lot of doubts in your mind. Let us just take an example to understand this.
Consider a tuple ‘tup’.
Now, if we consider tuple tup = (‘GreatLearning’,[4,3,1,2]) ;
We see that the tuple has elements of different data types. The first element here is a string which as we all know is immutable in nature. The second element is a list which we all know is mutable. Now, we all know that the tuple itself is an immutable data type. It cannot change its contents. But, the list inside it can change its contents. So, the value of the Immutable objects cannot be changed but its constituent objects can. change its value.
Mutable Object | Immutable Object |
State of the object can be modified after it is created. | State of the object can’t be modified once it is created. |
They are not thread safe. | They are thread safe |
Mutable classes are not final. | It is important to make the class final before creating an immutable object. |
list, dictionary, set, user-defined classes.
int, float, decimal, bool, string, tuple, range.
Lists in Python are mutable data types as the elements of the list can be modified, individual elements can be replaced, and the order of elements can be changed even after the list has been created.
(Examples related to lists have been discussed earlier in this blog.)
Tuple and list data structures are very similar, but one big difference between the data types is that lists are mutable, whereas tuples are immutable. The reason for the tuple’s immutability is that once the elements are added to the tuple and the tuple has been created; it remains unchanged.
A programmer would always prefer building a code that can be reused instead of making the whole data object again. Still, even though tuples are immutable, like lists, they can contain any Python object, including mutable objects.
A set is an iterable unordered collection of data type which can be used to perform mathematical operations (like union, intersection, difference etc.). Every element in a set is unique and immutable, i.e. no duplicate values should be there, and the values can’t be changed. However, we can add or remove items from the set as the set itself is mutable.
Strings are not mutable in Python. Strings are a immutable data types which means that its value cannot be updated.
Join Great Learning Academy’s free online courses and upgrade your skills today.
Original article source at: https://www.mygreatlearning.com
1653377002
This PySpark cheat sheet with code samples covers the basics like initializing Spark in Python, loading data, sorting, and repartitioning.
Apache Spark is generally known as a fast, general and open-source engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It allows you to speed analytic applications up to 100 times faster compared to technologies on the market today. You can interface Spark with Python through "PySpark". This is the Spark Python API exposes the Spark programming model to Python.
Even though working with Spark will remind you in many ways of working with Pandas DataFrames, you'll also see that it can be tough getting familiar with all the functions that you can use to query, transform, inspect, ... your data. What's more, if you've never worked with any other programming language or if you're new to the field, it might be hard to distinguish between RDD operations.
Let's face it, map()
and flatMap()
are different enough, but it might still come as a challenge to decide which one you really need when you're faced with them in your analysis. Or what about other functions, like reduce()
and reduceByKey()
?
Even though the documentation is very elaborate, it never hurts to have a cheat sheet by your side, especially when you're just getting into it.
This PySpark cheat sheet covers the basics, from initializing Spark and loading your data, to retrieving RDD information, sorting, filtering and sampling your data. But that's not all. You'll also see that topics such as repartitioning, iterating, merging, saving your data and stopping the SparkContext are included in the cheat sheet.
Note that the examples in the document take small data sets to illustrate the effect of specific functions on your data. In real life data analysis, you'll be using Spark to analyze big data.
PySpark is the Spark Python API that exposes the Spark programming model to Python.
>>> from pyspark import SparkContext
>>> sc = SparkContext(master = 'local[2]')
>>> sc.version #Retrieve SparkContext version
>>> sc.pythonVer #Retrieve Python version
>>> sc.master #Master URL to connect to
>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes
>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext
>>> sc.appName #Return application name
>>> sc.applicationld #Retrieve application ID
>>> sc.defaultParallelism #Return default level of parallelism
>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs
>>> from pyspark import SparkConf, SparkContext
>>> conf = (SparkConf()
.setMaster("local")
.setAppName("My app")
. set ("spark. executor.memory", "lg"))
>>> sc = SparkContext(conf = conf)
In the PySpark shell, a special interpreter-aware SparkContext is already created in the variable called sc.
$ ./bin/spark-shell --master local[2]
$ ./bin/pyspark --master local[s] --py-files code.py
Set which master the context connects to with the --master argument, and add Python .zip..egg or.py files to the
runtime path by passing a comma-separated list to --py-files.
>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)])
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)])
>>> rdd3 = sc.parallelize(range(100))
>>> rdd = sc.parallelize([("a",["x","y","z"]),
("b" ["p","r,"])])
Read either one text file from HDFS, a local file system or any Hadoop-supported file system URI with textFile(), or read in a directory of text files with wholeTextFiles().
>>> textFile = sc.textFile("/my/directory/•.txt")
>>> textFile2 = sc.wholeTextFiles("/my/directory/")
>>> rdd.getNumPartitions() #List the number of partitions
>>> rdd.count() #Count RDD instances 3
>>> rdd.countByKey() #Count RDD instances by key
defaultdict(<type 'int'>,{'a':2,'b':1})
>>> rdd.countByValue() #Count RDD instances by value
defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary
{'a': 2, 'b': 2}
>>> rdd3.sum() #Sum of RDD elements 4950
>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty
True
>>> rdd3.max() #Maximum value of RDD elements
99
>>> rdd3.min() #Minimum value of RDD elements
0
>>> rdd3.mean() #Mean value of RDD elements
49.5
>>> rdd3.stdev() #Standard deviation of RDD elements
28.866070047722118
>>> rdd3.variance() #Compute variance of RDD elements
833.25
>>> rdd3.histogram(3) #Compute histogram by bins
([0,33,66,99],[33,33,34])
>>> rdd3.stats() #Summary statistics (count, mean, stdev, max & min)
#Apply a function to each RFD element
>>> rdd.map(lambda x: x+(x[1],x[0])).collect()
[('a' ,7,7, 'a'),('a' ,2,2, 'a'), ('b' ,2,2, 'b')]
#Apply a function to each RDD element and flatten the result
>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0]))
>>> rdd5.collect()
['a',7 , 7 , 'a' , 'a' , 2, 2, 'a', 'b', 2 , 2, 'b']
#Apply a flatMap function to each (key,value) pair of rdd4 without changing the keys
>>> rdds.flatMapValues(lambda x: x).collect()
[('a', 'x'), ('a', 'y'), ('a', 'z'),('b', 'p'),('b', 'r')]
Getting
>>> rdd.collect() #Return a list with all RDD elements
[('a', 7), ('a', 2), ('b', 2)]
>>> rdd.take(2) #Take first 2 RDD elements
[('a', 7), ('a', 2)]
>>> rdd.first() #Take first RDD element
('a', 7)
>>> rdd.top(2) #Take top 2 RDD elements
[('b', 2), ('a', 7)]
Sampling
>>> rdd3.sample(False, 0.15, 81).collect() #Return sampled subset of rdd3
[3,4,27,31,40,41,42,43,60,76,79,80,86,97]
Filtering
>>> rdd.filter(lambda x: "a" in x).collect() #Filter the RDD
[('a',7),('a',2)]
>>> rdd5.distinct().collect() #Return distinct RDD values
['a' ,2, 'b',7]
>>> rdd.keys().collect() #Return (key,value) RDD's keys
['a', 'a', 'b']
>>> def g (x): print(x)
>>> rdd.foreach(g) #Apply a function to all RDD elements
('a', 7)
('b', 2)
('a', 2)
Reducing
>>> rdd.reduceByKey(lambda x,y : x+y).collect() #Merge the rdd values for each key
[('a',9),('b',2)]
>>> rdd.reduce(lambda a, b: a+ b) #Merge the rdd values
('a', 7, 'a' , 2 , 'b' , 2)
Grouping by
>>> rdd3.groupBy(lambda x: x % 2) #Return RDD of grouped values
.mapValues(list)
.collect()
>>> rdd.groupByKey() #Group rdd by key
.mapValues(list)
.collect()
[('a',[7,2]),('b',[2])]
Aggregating
>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))
>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
#Aggregate RDD elements of each partition and then the results
>>> rdd3.aggregate((0,0),seqOp,combOp)
(4950,100)
#Aggregate values of each RDD key
>>> rdd.aggregateByKey((0,0),seqop,combop).collect()
[('a',(9,2)), ('b',(2,1))]
#Aggregate the elements of each partition, and then the results
>>> rdd3.fold(0,add)
4950
#Merge the values for each key
>>> rdd.foldByKey(0, add).collect()
[('a' ,9), ('b' ,2)]
#Create tuples of RDD elements by applying a function
>>> rdd3.keyBy(lambda x: x+x).collect()
>>>> rdd.subtract(rdd2).collect() #Return each rdd value not contained in rdd2
[('b' ,2), ('a' ,7)]
#Return each (key,value) pair of rdd2 with no matching key in rdd
>>> rdd2.subtractByKey(rdd).collect()
[('d', 1)1
>>>rdd.cartesian(rdd2).collect() #Return the Cartesian product of rdd and rdd2
>>> rdd2.sortBy(lambda x: x[1]).collect() #Sort RDD by given function
[('d',1),('b',1),('a',2)]
>>> rdd2.sortByKey().collect() #Sort (key, value) ROD by key
[('a' ,2), ('b' ,1), ('d' ,1)]
>>> rdd.repartition(4) #New RDD with 4 partitions
>>> rdd.coalesce(1) #Decrease the number of partitions in the RDD to 1
>>> rdd.saveAsTextFile("rdd.txt")
>>> rdd.saveAsHadoopFile("hdfs:// namenodehost/parent/child",
'org.apache.hadoop.mapred.TextOutputFormat')
>>> sc.stop()
$ ./bin/spark-submit examples/src/main/python/pi.py
Have this Cheat Sheet at your fingertips
Original article source at https://www.datacamp.com
#pyspark #cheatsheet #spark #python
1672928580
Bash has no built-in function to take the user’s input from the terminal. The read command of Bash is used to take the user’s input from the terminal. This command has different options to take an input from the user in different ways. Multiple inputs can be taken using the single read command. Different ways of using this command in the Bash script are described in this tutorial.
read [options] [var1, var2, var3…]
The read command can be used without any argument or option. Many types of options can be used with this command to take the input of the particular data type. It can take more input from the user by defining the multiple variables with this command.
Some options of the read command require an additional parameter to use. The most commonly used options of the read command are mentioned in the following:
Option | Purpose |
---|---|
-d <delimiter> | It is used to take the input until the delimiter value is provided. |
-n <number> | It is used to take the input of a particular number of characters from the terminal and stop taking the input earlier based on the delimiter. |
-N <number> | It is used to take the input of the particular number of characters from the terminal, ignoring the delimiter. |
-p <prompt> | It is used to print the output of the prompt message before taking the input. |
-s | It is used to take the input without an echo. This option is mainly used to take the input for the password input. |
-a | It is used to take the input for the indexed array. |
-t <time> | It is used to set a time limit for taking the input. |
-u <file descriptor> | It is used to take the input from the file. |
-r | It is used to disable the backslashes. |
The uses of read command with different options are shown in this part of this tutorial.
Example 1: Using Read Command without Any Option and variable
Create a Bash file with the following script that takes the input from the terminal using the read command without any option and variable. If no variable is used with the read command, the input value is stored in the $REPLY variable. The value of this variable is printed later after taking the input.
#!/bin/bash
#Print the prompt message
echo "Enter your favorite color: "
#Take the input
read
#Print the input value
echo "Your favorite color is $REPLY"
Output:
The following output appears if the “Blue” value is taken as an input:
Example 2: Using Read Command with a Variable
Create a Bash file with the following script that takes the input from the terminal using the read command with a variable. The method of taking the single or multiple variables using a read command is shown in this example. The values of all variables are printed later.
#!/bin/bash
#Print the prompt message
echo "Enter the product name: "
#Take the input with a single variable
read item
#Print the prompt message
echo "Enter the color variations of the product: "
#Take three input values in three variables
read color1 color2 color3
#Print the input value
echo "The product name is $item."
#Print the input values
echo "Available colors are $color1, $color2, and $color3."
Output:
The following output appears after taking a single input first and three inputs later:
Example 3: Using Read Command with -p Option
Create a Bash file with the following script that takes the input from the terminal using the read command with a variable and the -p option. The input value is printed later.
#!/bin/bash
#Take the input with the prompt message
read -p "Enter the book name: " book
#Print the input value
echo "Book name: $book"
Output:
The following output appears after taking the input:
Example 4: Using Read Command with -s Option
Create a Bash file with the following script that takes the input from the terminal using the read command with a variable and the -s option. The input value of the password will not be displayed for the -s option. The input values are checked later for authentication. A success or failure message is also printed.
#!/bin/bash
#Take the input with the prompt message
read -p "Enter your email: " email
#Take the secret input with the prompt message
read -sp "Enter your password: " password
#Add newline
echo ""
#Check the email and password for authentication
if [[ $email == "admin@example.com" && $password == "secret" ]]
then
#Print the success message
echo "Authenticated."
else
#Print the failure message
echo "Not authenticated."
fi
Output:
The following output appears after taking the valid and invalid input values:
Example 5: Using Read Command with -a Option
Create a Bash file with the following script that takes the input from the terminal using the read command with a variable and the -a option. The array values are printed later after taking the input values from the terminal.
#!/bin/bash
echo "Enter the country names: "
#Take multiple inputs using an array
read -a countries
echo "Country names are:"
#Read the array values
for country in ${countries[@]}
do
echo $country
done
Output:
The following output appears after taking the array values:
Example 6: Using Read Command with -n Option
Create a Bash file with the following script that takes the input from the terminal using the read command with a variable and the -n option.
#!/bin/bash
#Print the prompt message
echo "Enter the product code: "
#Take the input of five characters
read -n 5 code
#Add newline
echo ""
#Print the input value
echo "The product code is $code"
Output:
The following output appears if the “78342” value is taken as input:
Example 7: Using Read Command with -t Option
Create a Bash file with the following script that takes the input from the terminal using the read command with a variable and the -t option.
#!/bin/bash
#Print the prompt message
echo -n "Write the result of 10-6: "
#Take the input of five characters
read -t 3 answer
#Check the input value
if [[ $answer == "4" ]]
then
echo "Correct answer."
else
echo "Incorrect answer."
fi
Output:
The following output appears after taking the correct and incorrect input values:
The uses of some useful options of the read command are explained in this tutorial using multiple examples to know the basic uses of the read command.
Original article source at: https://linuxhint.com/
1653465344
This PySpark SQL cheat sheet is your handy companion to Apache Spark DataFrames in Python and includes code samples.
You'll probably already know about Apache Spark, the fast, general and open-source engine for big data processing; It has built-in modules for streaming, SQL, machine learning and graph processing. Spark allows you to speed analytic applications up to 100 times faster compared to other technologies on the market today. Interfacing Spark with Python is easy with PySpark: this Spark Python API exposes the Spark programming model to Python.
Now, it's time to tackle the Spark SQL module, which is meant for structured data processing, and the DataFrame API, which is not only available in Python, but also in Scala, Java, and R.
Without further ado, here's the cheat sheet:
This PySpark SQL cheat sheet covers the basics of working with the Apache Spark DataFrames in Python: from initializing the SparkSession to creating DataFrames, inspecting the data, handling duplicate values, querying, adding, updating or removing columns, grouping, filtering or sorting data. You'll also see that this cheat sheet also on how to run SQL Queries programmatically, how to save your data to parquet and JSON files, and how to stop your SparkSession.
Spark SGlL is Apache Spark's module for working with structured data.
A SparkSession can be used create DataFrame, register DataFrame as tables, execute SGL over tables, cache tables, and read parquet files.
>>> from pyspark.sql import SparkSession
>>> spark a SparkSession \
.builder\
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
>>> from pyspark.sql.types import*
Infer Schema
>>> sc = spark.sparkContext
>>> lines = sc.textFile(''people.txt'')
>>> parts = lines.map(lambda l: l.split(","))
>>> people = parts.map(lambda p: Row(nameap[0],ageaint(p[l])))
>>> peopledf = spark.createDataFrame(people)
Specify Schema
>>> people = parts.map(lambda p: Row(name=p[0],
age=int(p[1].strip())))
>>> schemaString = "name age"
>>> fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
>>> schema = StructType(fields)
>>> spark.createDataFrame(people, schema).show()
From Spark Data Sources
JSON
>>> df = spark.read.json("customer.json")
>>> df.show()
>>> df2 = spark.read.load("people.json", format="json")
Parquet files
>>> df3 = spark.read.load("users.parquet")
TXT files
>>> df4 = spark.read.text("people.txt")
#Filter entries of age, only keep those records of which the values are >24
>>> df.filter(df["age"]>24).show()
>>> df = df.dropDuplicates()
>>> from pyspark.sql import functions as F
Select
>>> df.select("firstName").show() #Show all entries in firstName column
>>> df.select("firstName","lastName") \
.show()
>>> df.select("firstName", #Show all entries in firstName, age and type
"age",
explode("phoneNumber") \
.alias("contactInfo")) \
.select("contactInfo.type",
"firstName",
"age") \
.show()
>>> df.select(df["firstName"],df["age"]+ 1) #Show all entries in firstName and age, .show() add 1 to the entries of age
>>> df.select(df['age'] > 24).show() #Show all entries where age >24
When
>>> df.select("firstName", #Show firstName and 0 or 1 depending on age >30
F.when(df.age > 30, 1) \
.otherwise(0)) \
.show()
>>> df[df.firstName.isin("Jane","Boris")] #Show firstName if in the given options
.collect()
Like
>>> df.select("firstName", #Show firstName, and lastName is TRUE if lastName is like Smith
df.lastName.like("Smith")) \
.show()
Startswith - Endswith
>>> df.select("firstName", #Show firstName, and TRUE if lastName starts with Sm
df.lastName \
.startswith("Sm")) \
.show()
>>> df.select(df.lastName.endswith("th"))\ #Show last names ending in th
.show()
Substring
>>> df.select(df.firstName.substr(1, 3) \ #Return substrings of firstName
.alias("name")) \
.collect()
Between
>>> df.select(df.age.between(22, 24)) \ #Show age: values are TRUE if between 22 and 24
.show()
Adding Columns
>>> df = df.withColumn('city',df.address.city) \
.withColumn('postalCode',df.address.postalCode) \
.withColumn('state',df.address.state) \
.withColumn('streetAddress',df.address.streetAddress) \
.withColumn('telePhoneNumber', explode(df.phoneNumber.number)) \
.withColumn('telePhoneType', explode(df.phoneNumber.type))
Updating Columns
>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber')
Removing Columns
>>> df = df.drop("address", "phoneNumber")
>>> df = df.drop(df.address).drop(df.phoneNumber)
>>> df.na.fill(50).show() #Replace null values
>>> df.na.drop().show() #Return new df omitting rows with null values
>>> df.na \ #Return new df replacing one value with another
.replace(10, 20) \
.show()
>>> df.groupBy("age")\ #Group by age, count the members in the groups
.count() \
.show()
>>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("age", ascending=False).collect()
>>> df.orderBy(["age","city"],ascending=[0,1])\
.collect()
>>> df.repartition(10)\ #df with 10 partitions
.rdd \
.getNumPartitions()
>>> df.coalesce(1).rdd.getNumPartitions() #df with 1 partition
Registering DataFrames as Views
>>> peopledf.createGlobalTempView("people")
>>> df.createTempView("customer")
>>> df.createOrReplaceTempView("customer")
Query Views
>>> df5 = spark.sql("SELECT * FROM customer").show()
>>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\
.show()
>>> df.dtypes #Return df column names and data types
>>> df.show() #Display the content of df
>>> df.head() #Return first n rows
>>> df.first() #Return first row
>>> df.take(2) #Return the first n rows >>> df.schema Return the schema of df
>>> df.describe().show() #Compute summary statistics >>> df.columns Return the columns of df
>>> df.count() #Count the number of rows in df
>>> df.distinct().count() #Count the number of distinct rows in df
>>> df.printSchema() #Print the schema of df
>>> df.explain() #Print the (logical and physical) plans
Data Structures
>>> rdd1 = df.rdd #Convert df into an RDD
>>> df.toJSON().first() #Convert df into a RDD of string
>>> df.toPandas() #Return the contents of df as Pandas DataFrame
Write & Save to Files
>>> df.select("firstName", "city")\
.write \
.save("nameAndCity.parquet")
>>> df.select("firstName", "age") \
.write \
.save("namesAndAges.json",format="json")
>>> spark.stop()
Have this Cheat Sheet at your fingertips
Original article source at https://www.datacamp.com
#pyspark #cheatsheet #spark #dataframes #python #bigdata
1635844372
Aprenda a usar la biblioteca de transformadores Huggingface para generar respuestas conversacionales con el modelo DialoGPT previamente entrenado en Python.
Los chatbots han ganado mucha popularidad en los últimos años y, a medida que crece el interés en el uso de chatbots para empresas, los investigadores también hicieron un gran trabajo en el avance de los chatbots de IA conversacionales.
En este tutorial, usaremos la biblioteca de transformadores Huggingface para emplear el modelo DialoGPT previamente entrenado para la generación de respuestas conversacionales.
DialoGPT es un modelo de generación de respuesta conversacional neuronal sintonizable a gran escala que se entrenó en 147 millones de conversaciones extraídas de Reddit, y lo bueno es que puede ajustarlo con su conjunto de datos para lograr un mejor rendimiento que el entrenamiento desde cero.
Para comenzar, instalemos transformadores :
$ pip3 install transformers
Abra un nuevo archivo o cuaderno de Python y haga lo siguiente:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# model_name = "microsoft/DialoGPT-large"
model_name = "microsoft/DialoGPT-medium"
# model_name = "microsoft/DialoGPT-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
Hay tres versiones de DialoGPT; pequeño, mediano y grande. Por supuesto, cuanto más grande, mejor, pero si ejecuta esto en su máquina, creo que el tamaño pequeño o mediano se adapta a su memoria sin problemas. También puede utilizar Google Colab para probar el más grande.
En esta sección, usaremos el algoritmo de búsqueda codiciosa para generar respuestas. Es decir, seleccionamos la respuesta del chatbot que tiene la mayor probabilidad de ser seleccionada en cada paso de tiempo.
Hagamos un código para chatear con nuestra IA usando una búsqueda codiciosa:
# chatting 5 times with greedy search
for step in range(5):
# take user input
text = input(">> You:")
# encode the input and add end of string token
input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
# concatenate new user input with chat history (if there is)
bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
# generate a bot response
chat_history_ids = model.generate(
bot_input_ids,
max_length=1000,
pad_token_id=tokenizer.eos_token_id,
)
#print the output
output = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
print(f"DialoGPT: {output}")
Expliquemos el núcleo de este código:
input_ids
usar el tokenizador DialoGPT, también agregamos el final del token de cadena y lo devolvemos como un tensor de Pytorch.input_ids
nuestro modelo durante una generación. De lo contrario, agregamos el historial de chat usando la concatenación con la ayuda del torch.cat()
método.model.generate()
método para generar la respuesta del chatbot.tokenizer.decode()
y configuramos skip_special_tokens
para True
asegurarnos de que no veamos ningún token especial molesto como <|endoftext|>
. Además, dado que el modelo devuelve la secuencia completa, omitimos el historial de chat anterior e imprimimos solo la respuesta del chatbot recién generada.A continuación se muestra una discusión de muestra con el bot:
>> You:How can you be rich so quickly?
DialoGPT: I'm not rich, I'm just a rich man.
>> You:Ok, you don't make sense, how old are you ?
DialoGPT: I'm not a rich man, I'm just a rich man.
>> You:How can you be happy ?
DialoGPT: I'm not happy, I'm just a rich man.
>> You:But they say money buys happiness!
DialoGPT: Money buys happiness, but not happiness itself.
>> You:You don't make sense again!
DialoGPT: I'm not happy, I'm just a rich man.
Verá que el modelo repite muchas respuestas, ya que estas son la probabilidad más alta y lo elige cada vez.
De forma predeterminada, model.generate()
utiliza un algoritmo de búsqueda codicioso cuando no se establecen otros parámetros; en las siguientes secciones, agregaremos algunos argumentos a este método para ver si podemos mejorar la generación.
La búsqueda por haz nos permite reducir el riesgo de perder secuencias de alta probabilidad al mantener las num_beams
hipótesis más probables en cada paso de tiempo y luego tomar las secuencias que tienen la probabilidad general más alta, el siguiente código generará respuestas de chatbot con búsqueda de haz:
# chatting 5 times with beam search
for step in range(5):
# take user input
text = input(">> You:")
# encode the input and add end of string token
input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
# concatenate new user input with chat history (if there is)
bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
# generate a bot response
chat_history_ids = model.generate(
bot_input_ids,
max_length=1000,
num_beams=3,
early_stopping=True,
pad_token_id=tokenizer.eos_token_id
)
#print the output
output = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
print(f"DialoGPT: {output}")
Cuando se ajusta num_beams
a 3
de model.generate()
método, a continuación, vamos a seleccionar 3 palabras en cada paso de tiempo y desarrollarlas para encontrar la más alta probabilidad global de la secuencia, el establecimiento num_beams
de 1 es la misma que la búsqueda codiciosa.
A continuación se muestra una discusión de muestra con el chatbot usando la búsqueda de haz:
>> You:How can I be rich?
DialoGPT: You can't.
>> You:Why not?
DialoGPT: Because you can't.
>> You:Then how am I supposed to live well?
DialoGPT: You can't.
>> You:So basically, you want me to be a poor guy?
DialoGPT: No, I want you to be a rich guy.
>> You:Ok how to be rich, then?
DialoGPT: I don't know.
En las secciones anteriores, usamos la búsqueda de rayos y codiciosos para generar la secuencia de probabilidad más alta. Eso es genial para tareas como la traducción automática o el resumen de texto donde el resultado es predecible. Sin embargo, no es la mejor opción para una generación abierta como en los chatbots.
Para una mejor generación, necesitamos introducir algo de aleatoriedad donde muestreamos de una amplia gama de secuencias candidatas basadas en probabilidades:
# chatting 5 times with sampling
for step in range(5):
# take user input
text = input(">> You:")
# encode the input and add end of string token
input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
# concatenate new user input with chat history (if there is)
bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
# generate a bot response
chat_history_ids = model.generate(
bot_input_ids,
max_length=1000,
do_sample=True,
top_k=0,
pad_token_id=tokenizer.eos_token_id
)
#print the output
output = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
print(f"DialoGPT: {output}")
Esta vez, establece do_sample
que True
para el muestreo, y nos pusimos top_k
a 0
lo que indica que estamos seleccionando todas las probabilidades posibles, vamos a discutir más adelante top_k
parámetro.
Aquí hay un chat con estos parámetros:
>> You:Hello, how can I be rich?
DialoGPT: You can't
>> You:Really? then how others get rich?
DialoGPT: They work as money artists
>> You:How to be a money artist?
DialoGPT: I don't know, I don't have enough knowledge on artists
>> You:Alright, thank you!
DialoGPT: No problem, if you need tips pm me.
>> You:Great, you're my savior!
DialoGPT: Thank you for helping a stranger out like me!
Claramente hay algunas mejoras. Sin embargo, el muestreo en una amplia lista de secuencias con bajas probabilidades puede conducir a una generación aleatoria (como se ve en la última oración).
Para mejorarlo aún más, podemos:
temperature
Disminuir el muestreo , eso nos ayuda a disminuir la probabilidad de elegir palabras de baja probabilidad y aumentar la probabilidad de elegir palabras de alta probabilidad.# chatting 5 times with Top K sampling & tweaking temperature
for step in range(5):
# take user input
text = input(">> You:")
# encode the input and add end of string token
input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
# concatenate new user input with chat history (if there is)
bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
# generate a bot response
chat_history_ids = model.generate(
bot_input_ids,
max_length=1000,
do_sample=True,
top_k=100,
temperature=0.75,
pad_token_id=tokenizer.eos_token_id
)
#print the output
output = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
print(f"DialoGPT: {output}")
Ahora, configuramos top_k
para 100
muestrear las 100
palabras principales ordenadas de forma descendente por probabilidad. También establecemos temperature
en 0.75
(el valor predeterminado es 1.0
) para brindar una mayor probabilidad de elegir palabras de alta probabilidad, establecer la temperatura en 0.0
es lo mismo que la búsqueda codiciosa, establecerla en infinito es lo mismo que completamente aleatorio.
Aquí hay una discusión con estos parámetros:
>> You:How can I get rich?
DialoGPT: I'd also like to know this.
>> You:So you don't know it?
DialoGPT: I know it, but it doesn't matter.
>> You:Really? money matters, man!
DialoGPT: Well, money is money and I gotta get my money.
>> You:Yeah how to get them?
DialoGPT: Well, that's a question. I'd like to know more, just for curiosity's sake, to know what other people have said about how I should get rich.
>> You:Some robots like you told me you have to buy gold, silver or stocks.
DialoGPT: I think my wallet is empty.
Como se puede ver, es mucho mejor ahora, no dude en modificar temperature
y top_k
parámetros y ver si puede mejorarlo.
El muestreo de núcleo o muestreo de Top-p elige entre las palabras más pequeñas posibles cuya probabilidad acumulada excede el parámetro p
que establecimos.
A continuación se muestra un ejemplo con el muestreo de Top-p:
# chatting 5 times with nucleus sampling & tweaking temperature
for step in range(5):
# take user input
text = input(">> You:")
# encode the input and add end of string token
input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
# concatenate new user input with chat history (if there is)
bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
# generate a bot response
chat_history_ids = model.generate(
bot_input_ids,
max_length=1000,
do_sample=True,
top_p=0.95,
top_k=0,
temperature=0.75,
pad_token_id=tokenizer.eos_token_id
)
#print the output
output = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
print(f"DialoGPT: {output}")
Configuramos top_k
para 0
deshabilitar el muestreo de Top-k, pero puede usar ambos métodos, lo que tiende a funcionar mejor. Aquí hay un chat:
>> You:How can I be rich fast?
DialoGPT: Inflation and free money
>> You:lol, but that takes time, I need fast money.
DialoGPT: Your post makes no sense. You just need a lot of money to be rich fast
>> You:Ok how to get a lot of money?
DialoGPT: Buy gold and then trade it for money.
>> You:Hmm, I got your point, thanks.
DialoGPT: No problem. You have to have money to buy gold, but its only good for trading it.
>> You:Yeah, I understand, thank you so much.
DialoGPT: Ok, let me know if you need help
Ahora el chatbot claramente tiene sentido en muchos casos.
Ahora agreguemos algo de código para generar más de una respuesta de chatbot, y luego elegimos qué respuesta incluir en la siguiente entrada:
# chatting 5 times with nucleus & top-k sampling & tweaking temperature & multiple
# sentences
for step in range(5):
# take user input
text = input(">> You:")
# encode the input and add end of string token
input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
# concatenate new user input with chat history (if there is)
bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
# generate a bot response
chat_history_ids_list = model.generate(
bot_input_ids,
max_length=1000,
do_sample=True,
top_p=0.95,
top_k=50,
temperature=0.75,
num_return_sequences=5,
pad_token_id=tokenizer.eos_token_id
)
#print the outputs
for i in range(len(chat_history_ids_list)):
output = tokenizer.decode(chat_history_ids_list[i][bot_input_ids.shape[-1]:], skip_special_tokens=True)
print(f"DialoGPT {i}: {output}")
choice_index = int(input("Choose the response you want for the next input: "))
chat_history_ids = torch.unsqueeze(chat_history_ids_list[choice_index], dim=0)
He configurado num_return_sequences
para 5
devolver 5 oraciones a la vez, tenemos que elegir la que se incluirá en la siguiente secuencia. Así es como fue:
>> You:Hello, friend.
DialoGPT 0: Hey there!
DialoGPT 1: I am your friend, and I am also your friend.
DialoGPT 2: Hello. :
DialoGPT 3: Hiya! How are you?
DialoGPT 4: Hello. I like your username.
Choose the response you want for the next input: 3
>> You:Fine, thanks, what about you?
DialoGPT 0: Good. Not much to do.
DialoGPT 1: Not much, you?
DialoGPT 2: Fine, thanks.
DialoGPT 3: I'm okay, thank you for asking.
DialoGPT 4: I'm fine too.
Choose the response you want for the next input: 2
>> You:Can you tell me how can I get rich quickly?
DialoGPT 0: Yes, I can!
DialoGPT 1: Yes, just look up how to get rich on Google.
DialoGPT 2: Just keep doing your thing. You'll get rich.
DialoGPT 3: No, I can't.
DialoGPT 4: no, i have no money
Choose the response you want for the next input: 1
>> You:Yeah, but I thought you have an answer...
DialoGPT 0: I'm getting married in the morning.
DialoGPT 1: How do you get rich?
DialoGPT 2: I think there are many good answers, but I could send you one, if you want.
DialoGPT 3: I will, thanks!
DialoGPT 4: I thought you have an answer.
Choose the response you want for the next input: 2
>> You:Yes, please send me.
DialoGPT 0: You can send me link.
DialoGPT 1: I am very interested
DialoGPT 2: I sent you a PM
DialoGPT 3: I'll send you a PM
DialoGPT 4: I am always interested in new ideas.
Choose the response you want for the next input: 2
Y ahí lo tienes, espero que este tutorial te haya ayudado a generar texto en DialoGPT y modelos similares. Para obtener más información sobre cómo generar texto, le recomiendo que lea la guía Cómo generar texto con Transformers .
Te dejo ajustando los parámetros para ver si puedes hacer que el bot funcione mejor.
Además, puede combinar esto con tutoriales de texto a voz y de voz a texto para crear un asistente virtual como Alexa , Siri , Cortana , etc.
#python #chatbot #ai