1598792160

If you are reading this, then you probably tried to predict who will survive the Titanic shipwreck. This Kaggle competition is a canonical example of machine learning, and a right of passage for any aspiring data scientist. What if instead of predicting *who* will survive, you only had to predict *how many* will survive? Or, what if you had to predict the *average age* of survivors, or the *sum of fare* that the survivors paid?

There are many applications where classification predictions need to be aggregated. For example, a customer churn model may generate probabilities that a customer will churn but the business may be interested in *how many* customers are predicted to churn, or *how much revenue* will be lost. Similarly, a model may give a probability that a flight will be delayed but we may want to know how many flights will be delayed, or how many passengers are affected. Hong (2013) lists a number of other examples from actuarial assessment to warranty claims.

Most binary classification algorithms estimate probabilities that an example belongs to the positive class. If we treat these probabilities as known values (rather than estimates), then the number of positive cases is a random variable with Poisson Binomial probability distribution. (If the probabilities were all the same, the distribution would be Binomial.) Similarly, the sum of a two-value random variables where one value is a zero and the other value some other number (e.g. age, revenue) is distributed as Generalized Poisson Binomial. Under these assumptions we can report mean values as well as prediction intervals. In summary, if we had the true classification probabilities, then we could construct the probability distributions of any aggregate outcome (number of survivors, age, revenue, etc.).

Of course, the classification probabilities we obtain from machine learning models are just estimates. Therefore, treating the probabilities as known values may not be appropriate. (Essentially, we would be ignoring the sampling error in estimating these probabilities.) However, if we are interested only in the aggregate characteristics of survivors, perhaps we should focus on estimating parameters that describe the probability distributions of these aggregate characteristics. In other words, we should recognize that we have a *numerical prediction* problem rather than a *classification* problem.

In this note I compare two approaches to getting aggregate characteristics of Titanic survivors. The first is to *classify and then aggregate*. I estimate three popular classification models and then aggregate the resulting probabilities to get aggregate characteristics of survivors. The second approach is *a regression model* to estimate how aggregate characteristics of a group of passengers affect the share that survives. I evaluate each approach using many random splits of test and train data. The conclusion is that many classification models do poorly when the classification probabilities are aggregated.

#machine-learning #titanic #classification #aggregation #predictions

1598792160

If you are reading this, then you probably tried to predict who will survive the Titanic shipwreck. This Kaggle competition is a canonical example of machine learning, and a right of passage for any aspiring data scientist. What if instead of predicting *who* will survive, you only had to predict *how many* will survive? Or, what if you had to predict the *average age* of survivors, or the *sum of fare* that the survivors paid?

There are many applications where classification predictions need to be aggregated. For example, a customer churn model may generate probabilities that a customer will churn but the business may be interested in *how many* customers are predicted to churn, or *how much revenue* will be lost. Similarly, a model may give a probability that a flight will be delayed but we may want to know how many flights will be delayed, or how many passengers are affected. Hong (2013) lists a number of other examples from actuarial assessment to warranty claims.

Most binary classification algorithms estimate probabilities that an example belongs to the positive class. If we treat these probabilities as known values (rather than estimates), then the number of positive cases is a random variable with Poisson Binomial probability distribution. (If the probabilities were all the same, the distribution would be Binomial.) Similarly, the sum of a two-value random variables where one value is a zero and the other value some other number (e.g. age, revenue) is distributed as Generalized Poisson Binomial. Under these assumptions we can report mean values as well as prediction intervals. In summary, if we had the true classification probabilities, then we could construct the probability distributions of any aggregate outcome (number of survivors, age, revenue, etc.).

Of course, the classification probabilities we obtain from machine learning models are just estimates. Therefore, treating the probabilities as known values may not be appropriate. (Essentially, we would be ignoring the sampling error in estimating these probabilities.) However, if we are interested only in the aggregate characteristics of survivors, perhaps we should focus on estimating parameters that describe the probability distributions of these aggregate characteristics. In other words, we should recognize that we have a *numerical prediction* problem rather than a *classification* problem.

In this note I compare two approaches to getting aggregate characteristics of Titanic survivors. The first is to *classify and then aggregate*. I estimate three popular classification models and then aggregate the resulting probabilities to get aggregate characteristics of survivors. The second approach is *a regression model* to estimate how aggregate characteristics of a group of passengers affect the share that survives. I evaluate each approach using many random splits of test and train data. The conclusion is that many classification models do poorly when the classification probabilities are aggregated.

#machine-learning #titanic #classification #aggregation #predictions

1623223443

Predictive modeling in data science is used to answer the question “What is going to happen in the future, based on known past behaviors?” Modeling is an essential part of data science, and it is mainly divided into predictive and preventive modeling. Predictive modeling, also known as predictive analytics, is the process of using data and statistical algorithms to predict outcomes with data models. Anything from sports outcomes, television ratings to technological advances, and corporate economies can be predicted using these models.

**Classification Model:**It is the simplest of all predictive analytics models. It puts data in categories based on its historical data. Classification models are best to answer “yes or no” types of questions.**Clustering Model:**This model groups data points into separate groups, based on similar behavior.- **Forecast Model: **One of the most widely used predictive analytics models. It deals with metric value prediction, and this model can be applied wherever historical numerical data is available.
**Outliers Model:**This model, as the name suggests, is oriented around exceptional data entries within a dataset. It can identify exceptional figures either by themselves or in concurrence with other numbers and categories.**Time Series Model:**This predictive model consists of a series of data points captured, using time as the input limit. It uses the data from previous years to develop a numerical metric and predicts the next three to six weeks of data using that metric.

#big data #data science #predictive analytics #predictive analysis #predictive modeling #predictive models

1617419868

As AI becomes more ubiquitous, it’s also become more autonomous — able to act on its own without human supervision. This demonstrates progress, but it also introduces concerns around control over AI. The AI Arms Race has driven organizations everywhere to deliver the most sophisticated algorithms around, but this can come at a price, ignoring cultural and ethical values that are critical to responsible AI. Here are five predictions on what we should expect to see in AI in 2021:

- Something’s going to give around AI governance
- Most consumers will continue to be sceptical of AI
- Digital transformation (DX) finds its moment
- Organizations will increasingly push AI to the edge
- ModelOps will become the “go-to” approach for AI deployment.

#opinions #2021 ai predictions #ai predictions for 2021 #artificial intelligence predictions #five artificial intelligence predictions for 2021

1653377002

This PySpark cheat sheet with code samples covers the basics like initializing Spark in Python, loading data, sorting, and repartitioning.

Apache Spark is generally known as a fast, general and open-source engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It allows you to speed analytic applications up to 100 times faster compared to technologies on the market today. You can interface Spark with Python through "PySpark". This is the Spark Python API exposes the Spark programming model to Python.

Even though working with Spark will remind you in many ways of working with **Pandas DataFrames**, you'll also see that it can be tough getting familiar with all the functions that you can use to query, transform, inspect, ... your data. What's more, if you've never worked with any other programming language or if you're new to the field, it might be hard to distinguish between RDD operations.

Let's face it, `map()`

and `flatMap()`

are different enough, but it might still come as a challenge to decide which one you really need when you're faced with them in your analysis. Or what about other functions, like `reduce()`

and `reduceByKey()`

?

Even though the **documentation** is very elaborate, it never hurts to have a cheat sheet by your side, especially when you're just getting into it.

This PySpark cheat sheet covers the basics, from initializing Spark and loading your data, to retrieving RDD information, sorting, filtering and sampling your data. But that's not all. You'll also see that topics such as repartitioning, iterating, merging, saving your data and stopping the SparkContext are included in the cheat sheet.

**Note** that the examples in the document take small data sets to illustrate the effect of specific functions on your data. In real life data analysis, you'll be using Spark to analyze big data.

**PySpark is the Spark Python API that exposes the Spark programming model to Python.**

```
>>> from pyspark import SparkContext
>>> sc = SparkContext(master = 'local[2]')
```

```
>>> sc.version #Retrieve SparkContext version
>>> sc.pythonVer #Retrieve Python version
>>> sc.master #Master URL to connect to
>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes
>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext
>>> sc.appName #Return application name
>>> sc.applicationld #Retrieve application ID
>>> sc.defaultParallelism #Return default level of parallelism
>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs
```

```
>>> from pyspark import SparkConf, SparkContext
>>> conf = (SparkConf()
.setMaster("local")
.setAppName("My app")
. set ("spark. executor.memory", "lg"))
>>> sc = SparkContext(conf = conf)
```

In the PySpark shell, a special interpreter-aware SparkContext is already created in the variable called sc.

```
$ ./bin/spark-shell --master local[2]
$ ./bin/pyspark --master local[s] --py-files code.py
```

Set which master the context connects to with the --master argument, and add Python .zip..egg or.py files to the

runtime path by passing a comma-separated list to --py-files.

```
>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)])
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)])
>>> rdd3 = sc.parallelize(range(100))
>>> rdd = sc.parallelize([("a",["x","y","z"]),
("b" ["p","r,"])])
```

Read either one text file from HDFS, a local file system or any Hadoop-supported file system URI with textFile(), or read in a directory of text files with wholeTextFiles().

```
>>> textFile = sc.textFile("/my/directory/•.txt")
>>> textFile2 = sc.wholeTextFiles("/my/directory/")
```

```
>>> rdd.getNumPartitions() #List the number of partitions
>>> rdd.count() #Count RDD instances 3
>>> rdd.countByKey() #Count RDD instances by key
defaultdict(<type 'int'>,{'a':2,'b':1})
>>> rdd.countByValue() #Count RDD instances by value
defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary
{'a': 2, 'b': 2}
>>> rdd3.sum() #Sum of RDD elements 4950
>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty
True
```

```
>>> rdd3.max() #Maximum value of RDD elements
99
>>> rdd3.min() #Minimum value of RDD elements
0
>>> rdd3.mean() #Mean value of RDD elements
49.5
>>> rdd3.stdev() #Standard deviation of RDD elements
28.866070047722118
>>> rdd3.variance() #Compute variance of RDD elements
833.25
>>> rdd3.histogram(3) #Compute histogram by bins
([0,33,66,99],[33,33,34])
>>> rdd3.stats() #Summary statistics (count, mean, stdev, max & min)
```

```
#Apply a function to each RFD element
>>> rdd.map(lambda x: x+(x[1],x[0])).collect()
[('a' ,7,7, 'a'),('a' ,2,2, 'a'), ('b' ,2,2, 'b')]
#Apply a function to each RDD element and flatten the result
>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0]))
>>> rdd5.collect()
['a',7 , 7 , 'a' , 'a' , 2, 2, 'a', 'b', 2 , 2, 'b']
#Apply a flatMap function to each (key,value) pair of rdd4 without changing the keys
>>> rdds.flatMapValues(lambda x: x).collect()
[('a', 'x'), ('a', 'y'), ('a', 'z'),('b', 'p'),('b', 'r')]
```

**Getting**

```
>>> rdd.collect() #Return a list with all RDD elements
[('a', 7), ('a', 2), ('b', 2)]
>>> rdd.take(2) #Take first 2 RDD elements
[('a', 7), ('a', 2)]
>>> rdd.first() #Take first RDD element
('a', 7)
>>> rdd.top(2) #Take top 2 RDD elements
[('b', 2), ('a', 7)]
```

**Sampling**

```
>>> rdd3.sample(False, 0.15, 81).collect() #Return sampled subset of rdd3
[3,4,27,31,40,41,42,43,60,76,79,80,86,97]
```

**Filtering**

```
>>> rdd.filter(lambda x: "a" in x).collect() #Filter the RDD
[('a',7),('a',2)]
>>> rdd5.distinct().collect() #Return distinct RDD values
['a' ,2, 'b',7]
>>> rdd.keys().collect() #Return (key,value) RDD's keys
['a', 'a', 'b']
```

```
>>> def g (x): print(x)
>>> rdd.foreach(g) #Apply a function to all RDD elements
('a', 7)
('b', 2)
('a', 2)
```

**Reducing**

```
>>> rdd.reduceByKey(lambda x,y : x+y).collect() #Merge the rdd values for each key
[('a',9),('b',2)]
>>> rdd.reduce(lambda a, b: a+ b) #Merge the rdd values
('a', 7, 'a' , 2 , 'b' , 2)
```

** **

**Grouping by**

```
>>> rdd3.groupBy(lambda x: x % 2) #Return RDD of grouped values
.mapValues(list)
.collect()
>>> rdd.groupByKey() #Group rdd by key
.mapValues(list)
.collect()
[('a',[7,2]),('b',[2])]
```

**Aggregating**

```
>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))
>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
#Aggregate RDD elements of each partition and then the results
>>> rdd3.aggregate((0,0),seqOp,combOp)
(4950,100)
#Aggregate values of each RDD key
>>> rdd.aggregateByKey((0,0),seqop,combop).collect()
[('a',(9,2)), ('b',(2,1))]
#Aggregate the elements of each partition, and then the results
>>> rdd3.fold(0,add)
4950
#Merge the values for each key
>>> rdd.foldByKey(0, add).collect()
[('a' ,9), ('b' ,2)]
#Create tuples of RDD elements by applying a function
>>> rdd3.keyBy(lambda x: x+x).collect()
```

```
>>>> rdd.subtract(rdd2).collect() #Return each rdd value not contained in rdd2
[('b' ,2), ('a' ,7)]
#Return each (key,value) pair of rdd2 with no matching key in rdd
>>> rdd2.subtractByKey(rdd).collect()
[('d', 1)1
>>>rdd.cartesian(rdd2).collect() #Return the Cartesian product of rdd and rdd2
```

```
>>> rdd2.sortBy(lambda x: x[1]).collect() #Sort RDD by given function
[('d',1),('b',1),('a',2)]
>>> rdd2.sortByKey().collect() #Sort (key, value) ROD by key
[('a' ,2), ('b' ,1), ('d' ,1)]
```

```
>>> rdd.repartition(4) #New RDD with 4 partitions
>>> rdd.coalesce(1) #Decrease the number of partitions in the RDD to 1
```

```
>>> rdd.saveAsTextFile("rdd.txt")
>>> rdd.saveAsHadoopFile("hdfs:// namenodehost/parent/child",
'org.apache.hadoop.mapred.TextOutputFormat')
```

`>>> sc.stop()`

`$ ./bin/spark-submit examples/src/main/python/pi.py`

**Have this Cheat Sheet at your fingertips**

Original article source athttps://www.datacamp.com

**#pyspark #cheatsheet #spark #python**

1601269980

In this analysis i’ll build a model that will predict whether a tumor is malignant or benign, based on data from a study on breast cancer. Classification algorithms will be used in the modelling process.

*The dataset*

**The data for this analysis refer to 569 patients from a study on breast cancer. The actual data can be found at UCI (Machine Learning Repository): **https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). The variables were computed from a digitized image of a breast mass and describe characteristics of the cell nucleus present in the image. In particular the variables are the following:

- **radius **(mean of distances from center to points on the perimeter)
- **texture **(standard deviation of gray-scale values)
**perimeter****area**- **smoothness **(local variation in radius lengths)
- **compactness **(perimeter^² / area — 1.0)
- **concavity **(severity of concave portions of the contour)
- **concave points **(number of concave portions of the contour)
**symmetry****fractal dimension**(“coastline approximation” — 1)- **type **(tumor can be either malignant -M- or benign -B-)

#predictive-analytics #logistic-regression #machine-learning #classification #decision-tree-classifier