1623373480

# A Friendly Introduction to Linear Algebra for ML (ML Tech Talks)

In this session of Machine Learning Tech Talks, Tai-Danae Bradley, Postdoc at X, the Moonshot Factory, will share a few ideas for linear algebra that appear in the context of Machine Learning.

Chapters:
0:00 - Introduction
1:37 - Data Representations
15:02 - Vector Embeddings
31:52 - Dimensionality Reduction
37:11 - Conclusion

Resources:

#machine-learning #data-science

1624447260

## Course Review: Python for Linear Algebra

Because I am continuously endeavouring to improve my knowledge and skill of the Python programming language, I decided to take some free courses in an attempt to improve upon my knowledge base. I found one such course on linear algebra, which I found on YouTube. I decided to watch the video and undertake the course work because it focused on the Python programming language, something that I wanted to improve my skill in. Youtube video this course review was taken from:- (4) Python for linear algebra (for absolute beginners) — YouTube

The course is for absolute beginners, which is good because I have never studied linear algebra and had no idea what the terms I would be working with were.

Linear algebra is the branch of mathematics concerning linear equations, such as linear maps and their representations in vector spaces and through matrices. Linear algebra is central to almost all areas of mathematics.

Whilst studying linear algebra, I have learned a few topics that I had not previously known. For example:-

A scalar is simply a number, being an integer or a float. Scalers are convenient in applications that don’t need to be concerned with all the ways that data can be represented in a computer.

A vector is a one dimensional array of numbers. The difference between a vector is that it is mutable, being known as dynamic arrays.

A matrix is similar to a two dimensional rectangular array of data stored in rows and columns. The data stored in the matrix can be strings, numbers, etcetera.

In addition to the basic components of linear algebra, being a scalar, vector and matrix, there are several ways the vectors and matrix can be manipulated to make it suitable for machine learning.

I used Google Colab to code the programming examples and the assignments that were given in the 1 hour 51 minute video. It took a while to get into writing the code of the various subjects that were studied because, as the video stated, it is a course for absolute beginners.

The two main libraries that were used for this course were numpy and matplotlib. Numpy is the library that is used to carry out algebraic operations and matplotlib is used to graphically plot the points that are created in the program.

#numpy #matplotlib #python #linear-algebra #course review: python for linear algebra #linear algebra

1623373480

## A Friendly Introduction to Linear Algebra for ML (ML Tech Talks)

In this session of Machine Learning Tech Talks, Tai-Danae Bradley, Postdoc at X, the Moonshot Factory, will share a few ideas for linear algebra that appear in the context of Machine Learning.

Chapters:
0:00 - Introduction
1:37 - Data Representations
15:02 - Vector Embeddings
31:52 - Dimensionality Reduction
37:11 - Conclusion

Resources:

#machine-learning #data-science

1595377260

## Linear Algebra: The hidden engine of machine learning

Algebra is firstly taken from a book, written by Khwarizmi(780-850 CE), which is about calculation and equations. It is a branch of mathematics in which letters are used instead of numbers. Each letter can represent a specific number in one place, and a completely different number in another. Notations and symbols are also used in algebra to show the relationship between numbers. I remember about 17 years ago when I was an ordinary student in applied mathematics(ordinary graduate today!), I was so curious about some research in algebra, done by Maryam Mirzakhani(1977–2017), at Harvard University about analogous counting problem. This science has evolved a lot throughout history and now includes many branches.

Elementary algebra includes basic operations on four main operations. After defining the signs by which fixed numbers and variables are separated, methods are used to solve the equations. A polynomial is an expression that is the sum of a finite number of non-zero terms, each term consisting of the product of a constant and a finite number of variables raised to whole number powers.

Abstract algebra or modern algebra is a group in the algebra family that studies advanced algebraic structures such as groups, rings, and fields. Algebraic structures, with their associated homomorphisms, form mathematical categories. Category theory is a formalism that allows a unified way for expressing properties and constructions that are similar for various structures. Abstract algebra is so popular and used in many fields of mathematics and engineering sciences. For instance, algebraic topology uses algebraic objects to study topologies. The Poincaré conjecture proved in 2003, asserts that the fundamental group of a manifold, which encodes information about connectedness, can be used to determine whether a manifold is a sphere or not. Algebraic number theory studies various number rings that generalize the set of integers.

I believe that the most influencing branch of algebra in other sciences is linear algebra. Let’s suppose that you went out for jogging, that can’t be easy these days with Covid-19 lockdown, and suddenly a beautiful flower catches all your attention. Please don’t be rush and pick it, just take a picture, others can enjoy it, as well. After a while when you look at this picture, you can recognize the flower in the image, because the human brain has evolved over millions of years and able to detect such a thing. We are unaware of the operations that take place in the background of our brains and enable us to recognize the colors in the image, they are trained to do this for us automatically. But, it’s not easy to do such a thing with a machine, that’s why this is one of the most active research areas in machine learning and deep learning. Actually, the fundamental question is: “How does a machine store this image?” You probably know that today’s computers are designed to process only two numbers, 0 and 1. Now, how can an image like this with different features be stored? This is done by storing the pixel intensity in a structure called a “matrix”.

The main topics in linear algebra are vectors and matrices. Vectors are geometric objects that have length and direction. For example, we can mention speed and force, both of which are vector quantities. Each vector is represented by an arrow whose length and direction indicate the size and direction of the vector.

The addition of two or more vectors can be done based on ease of use using parallelogram methods or the method of images in which each vector is decomposed into its components along the coordinate axes. A vector space is a collection of vectors, which may be added together and multiplied by scalars. Scalars generally can be picked up from any field but normally are real numbers.

#matrice #machine-learning #linear-algebra #algebra #deep-learning #deep learning

1650643200

## tech.ml.dataset

`tech.ml.dataset` is a Clojure library for data processing and machine learning. Datasets are currently in-memory columnwise databases and we support parsing from file or input-stream. We support these formats: raw/gzipped csv/tsv, xls, xlsx, json, and sequences of maps as input sources. SQL and Clojurescript bindings are provided as separate libraries.

Data size in memory is minimized (primitive arrays), datetime types are often converted to an integer representation and strings are loaded into string tables. These features together dramatically decrease the working set size in memory. Because data is stored in columnar fashion columnwise operations on the dataset are very fast.

Conversion back into sequences of maps is very efficient and we have support for writing the dataset back out to csv, tsv, and gzipped varieties of those.

We have upgraded support for Apache Arrow. We have full support including mmap support for JDK-8->JDK-17 although if you are on an M-1 Mac you will need to use JDK-17. We also support per-column compression (LZ4, ZSTD) across all supported platforms. The official Arrow SDK does not support mmap, JDK-17, and has no user-accessible way to save a compressed streaming format file.

Large aggregations of potentially out-of-memory datasets are represented by a sequence of datasets. This is consistent with the design of the parquet and arrow data storage systems and aggregation operations involving large-scale datasets are efficiently implemented in the tech.v3.dataset.reductions namespace. We have started to integrate algorithms from the Apache Data Sketches system in the apache-data-sketch namespace. Summations/means in this area are implemented using the Kahan compensated summation algorithm.

## Mini Walkthrough

``````user> (require '[tech.v3.dataset :as ds])
nil
;; We support many file formats
user> (def csv-data (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv"))
#'user/csv-data
test/data/stocks.csv [5 3]:

| symbol |       date | price |
|--------|------------|-------|
|   MSFT | 2000-01-01 | 39.81 |
|   MSFT | 2000-02-01 | 36.35 |
|   MSFT | 2000-03-01 | 43.22 |
|   MSFT | 2000-04-01 | 28.37 |
|   MSFT | 2000-05-01 | 25.45 |

;; tech.v3.libs.poi registers xls, tech.v3.libs.fastexcel registers xlsx.  If you want
;; to use poi for everything use workbook->datasets in the tech.v3.libs.poi namespace.
user> (require '[tech.v3.libs.poi])
nil
user> (def xls-data (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/file_example_XLS_1000.xls"))
#'user/xls-data
https://github.com/techascent/tech.v3.dataset/raw/master/test/data/file_example_XLS_1000.xls [5 8]:

| column-0 | First Name | Last Name | Gender |       Country |  Age |       Date |     Id |
|----------|------------|-----------|--------|---------------|------|------------|--------|
|      1.0 |      Dulce |     Abril | Female | United States | 32.0 | 15/10/2017 | 1562.0 |
|      2.0 |       Mara | Hashimoto | Female | Great Britain | 25.0 | 16/08/2016 | 1582.0 |
|      3.0 |     Philip |      Gent |   Male |        France | 36.0 | 21/05/2015 | 2587.0 |
|      4.0 |   Kathleen |    Hanner | Female | United States | 25.0 | 15/10/2017 | 3549.0 |
|      5.0 |    Nereida |   Magwood | Female | United States | 58.0 | 16/08/2016 | 2468.0 |

;;And you have fine grained control over parsing

{:parser-fn {"Date" [:local-date "dd/MM/yyyy"]}}))
https://github.com/techascent/tech.v3.dataset/raw/master/test/data/file_example_XLS_1000.xls [5 8]:

| column-0 | First Name | Last Name | Gender |       Country |  Age |       Date |     Id |
|----------|------------|-----------|--------|---------------|------|------------|--------|
|      1.0 |      Dulce |     Abril | Female | United States | 32.0 | 2017-10-15 | 1562.0 |
|      2.0 |       Mara | Hashimoto | Female | Great Britain | 25.0 | 2016-08-16 | 1582.0 |
|      3.0 |     Philip |      Gent |   Male |        France | 36.0 | 2015-05-21 | 2587.0 |
|      4.0 |   Kathleen |    Hanner | Female | United States | 25.0 | 2017-10-15 | 3549.0 |
|      5.0 |    Nereida |   Magwood | Female | United States | 58.0 | 2016-08-16 | 2468.0 |
user>

user>
user> (def airports (ds/->dataset "https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat"
#'user/airports
https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat [5 14]:

| column-0 |                                    column-1 |     column-2 |         column-3 | column-4 | column-5 |    column-6 |     column-7 | column-8 | column-9 | column-10 |            column-11 | column-12 |   column-13 |
|----------|---------------------------------------------|--------------|------------------|----------|----------|-------------|--------------|----------|----------|-----------|----------------------|-----------|-------------|
|        1 |                              Goroka Airport |       Goroka | Papua New Guinea |      GKA |     AYGA | -6.08168983 | 145.39199829 |     5282 |     10.0 |         U | Pacific/Port_Moresby |   airport | OurAirports |
|        2 |                              Madang Airport |       Madang | Papua New Guinea |      MAG |     AYMD | -5.20707989 | 145.78900147 |       20 |     10.0 |         U | Pacific/Port_Moresby |   airport | OurAirports |
|        3 |                Mount Hagen Kagamuga Airport |  Mount Hagen | Papua New Guinea |      HGU |     AYMH | -5.82678986 | 144.29600525 |     5388 |     10.0 |         U | Pacific/Port_Moresby |   airport | OurAirports |
|        4 |                              Nadzab Airport |       Nadzab | Papua New Guinea |      LAE |     AYNZ | -6.56980300 | 146.72597700 |      239 |     10.0 |         U | Pacific/Port_Moresby |   airport | OurAirports |
|        5 | Port Moresby Jacksons International Airport | Port Moresby | Papua New Guinea |      POM |     AYPY | -9.44338036 | 147.22000122 |      146 |     10.0 |         U | Pacific/Port_Moresby |   airport | OurAirports |

;;At any point you can get a sequence of maps back.  We implement a special version
;;of Clojure's APersistentMap that is much more efficient than even records and shares
;;the backing store with the dataset.

({"date" #object[java.time.LocalDate 0x4a998af0 "2000-01-01"],
"symbol" "MSFT",
"price" 39.81}
{"date" #object[java.time.LocalDate 0x6d8c0bcd "2000-02-01"],
"symbol" "MSFT",
"price" 36.35})

;;Datasets are comprised of named columns, and provide a Clojure hashmap-compatible
;;collection.  Datasets allow reading and updating column data associated with a column name,
;;and provide a sequential view of [column-name column] entries.

;;You can look up columns via `get`, keyword lookup, and invoking the dataset as a function on
;;a key (a column name). `keys` and `vals` retrieve respective sequences of column names and columns.
;;The functions `assoc` and `dissoc` work to define new associations to conveniently

;;Column data is stored in primitive arrays (even most datetimes!) and strings are stored
;;in string tables.  You can load really large datasets with this thing!

;;Columns themselves are sequences of their entries.
user> (csv-data "symbol")
#tech.v3.dataset.column<string>[560]
symbol
[MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, ...]
user> (xls-data "Gender")
#tech.v3.dataset.column<string>[1000]
Gender
[Female, Female, Male, Female, Female, Male, Female, Female, Female, Female, Female, Male, Female, Male, Female, Female, Female, Female, Female, Female, ...]
user> (take 5 (xls-data "Gender"))
("Female" "Female" "Male" "Female" "Female")

;;Datasets and columns implement the clojure metadata interfaces (`meta`, `with-meta`, `vary-meta`)

;;You can access a sequence of columns of a dataset with `ds/columns`, or `vals` like a map,
;;and access the metadata with `meta`:

user> (->> csv-data
vals  ;synonymous with ds/columns
(map (fn [column]
(meta column))))
({:categorical? true, :name "symbol", :size 560, :datatype :string}
{:name "date", :size 560, :datatype :packed-local-date}
{:name "price", :size 560, :datatype :float32})

;;We can similarly destructure datasets like normal clojure
;;maps:

user> (for [[k column] csv-data]
[k (meta column)])
(["symbol" {:categorical? true, :name "symbol", :size 560, :datatype :string}]
["date" {:name "date", :size 560, :datatype :packed-local-date}]
["price" {:name "price", :size 560, :datatype :float64}])

user> (let [{:strs [symbol date]} csv-data]
[symbol (meta date)])
[#tech.v3.dataset.column<string>[560]
symbol
[MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, MSFT, ...]
{:name "date", :size 560, :datatype :packed-local-date}]

;;We can get a brief description of the dataset:

user> (ds/brief csv-data)
({:min #object[java.time.LocalDate 0x5b2ea1d5 "2000-01-01"],
:n-missing 0,
:col-name "date",
:mean #object[java.time.LocalDate 0x729b7395 "2005-05-12"],
:datatype :packed-local-date,
:quartile-3 #object[java.time.LocalDate 0x6c75fa43 "2007-11-23"],
:n-valid 560,
:quartile-1 #object[java.time.LocalDate 0x13d9aabe "2002-11-08"],
:max #object[java.time.LocalDate 0x493bf7ef "2010-03-01"]}
{:min 5.97,
:n-missing 0,
:col-name "price",
:mean 100.7342857142857,
:datatype :float64,
:skew 2.4130946430619233,
:standard-deviation 132.55477114107083,
:quartile-3 100.88,
:n-valid 560,
:quartile-1 24.169999999999998,
:max 707.0}
{:mode "MSFT",
:values ["MSFT" "AMZN" "IBM" "AAPL" "GOOG"],
:n-values 5,
:n-valid 560,
:col-name "symbol",
:n-missing 0,
:datatype :string,
:histogram (["MSFT" 123] ["AMZN" 123] ["IBM" 123] ["AAPL" 123] ["GOOG" 68])})

;;Another view of that brief:

user> (ds/descriptive-stats csv-data)
https://github.com/techascent/tech.v3.dataset/raw/master/test/data/stocks.csv: descriptive-stats [3 10]:

| :col-name |          :datatype | :n-valid | :n-missing |       :min |      :mean | :mode |       :max | :standard-deviation |      :skew |
|-----------|--------------------|----------|------------|------------|------------|-------|------------|---------------------|------------|
|      date | :packed-local-date |      560 |          0 | 2000-01-01 | 2005-05-12 |       | 2010-03-01 |                     |            |
|     price |           :float64 |      560 |          0 |      5.970 |      100.7 |       |      707.0 |        132.55477114 | 2.41309464 |
|    symbol |            :string |      560 |          0 |            |            |  MSFT |            |                     |            |

;;There are analogues of the clojure.core functions that apply to dataset:
;;filter, group-by, sort-by.  These are all implemented efficiently.

;;You can add/remove/update columns, or use the map idioms of `assoc` and `dissoc`

user> (-> csv-data
(assoc "always-ten" 10) ;scalar values are expanded as needed
(assoc "random"   (repeatedly (ds/row-count csv-data) #(rand-int 100)))
https://github.com/techascent/tech.v3.dataset/raw/master/test/data/stocks.csv [5 5]:

| symbol |       date | price | always-ten | random |
|--------|------------|-------|------------|--------|
|   MSFT | 2000-01-01 | 39.81 |         10 |     47 |
|   MSFT | 2000-02-01 | 36.35 |         10 |     35 |
|   MSFT | 2000-03-01 | 43.22 |         10 |     54 |
|   MSFT | 2000-04-01 | 28.37 |         10 |      6 |
|   MSFT | 2000-05-01 | 25.45 |         10 |     52 |

user> (-> csv-data
(dissoc "price")
https://github.com/techascent/tech.v3.dataset/raw/master/test/data/stocks.csv [5 2]:

| symbol |       date |
|--------|------------|
|   MSFT | 2000-01-01 |
|   MSFT | 2000-02-01 |
|   MSFT | 2000-03-01 |
|   MSFT | 2000-04-01 |
|   MSFT | 2000-05-01 |

;;since `conj` works as with clojure maps and sequences of map-entries or pairs,
;;you can use idioms like `reduce conj` or `into` to construct new datasets on the
;;fly with familiar clojure idioms:

user> (let [new-cols [["always-ten" 10] ["new-price" (map inc (csv-data "price"))]]
new-data (into (dissoc csv-data "price") new-cols)]
https://github.com/techascent/tech.v3.dataset/raw/master/test/data/stocks.csv [5 4]:

| symbol |       date | always-ten | new-price |
|--------|------------|------------|-----------|
|   MSFT | 2000-01-01 |         10 |     40.81 |
|   MSFT | 2000-02-01 |         10 |     37.35 |
|   MSFT | 2000-03-01 |         10 |     44.22 |
|   MSFT | 2000-04-01 |         10 |     29.37 |
|   MSFT | 2000-05-01 |         10 |     26.45 |

;;You can write out the result back to csv, tsv, and gzipped variations of those.

;;Joins (left, right, inner) are all implemented.

;;Columnwise arithmetic manipulations (+,-, and many more) are provided via the
;;tech.v2.datatype.functional namespace.

;;Datetime columns can be operated on - plus,minus, get-years, get-days, and
;;many more - uniformly via the tech.v2.datatype.datetime.operations namespace.

;;There is much more.  Please checkout the walkthough and try it out!
``````

### Arrow Support

JDK-17, compression and memory mapping are supported - Arrow api.

### Parquet Support

Parquet now has first class support. That means we should be able to load most Parquet files and support their full range of datatypes.

## Questions, Community

Author: techascent
Source Code: https://github.com/techascent/tech.ml.dataset

1593865080

## Linear Regression with Knime

### Exploring the Dataset

LEGO is a popular brand of toy building bricks. They are often sold in sets to build a specific object. Each set is designed for a particular age-group, with a theme in mind and containing a different number of pieces. Each set has a different rating and price. Using this data, we want to design a Linear Regression model with Knime that can predict the price of a given Lego set.

The Lego Dataset we are using looks like this:

The different features in the dataset are:

FeatureDescriptionDataTypeageWhich age categories it belongs toStringlist_priceprice of the set (in \$)Doublenum_reviewsnumber of reviews per setIntegerpiece_countnumber of pieces in that lego setIntegerplay_star___ratingratingsDoublereview_difficultydifficulty level of the setStringstar_ratingratingsDoubletheme_namewhich theme it belongsStringval___star_ratingratingsDoublecountrycountry nameString

### Pre-Processing and Cleaning the data

Having a look at the data, you may notice that some of the features in the dataset are textual in nature. Thus, they don’t add value to the prediction model.

So, once the file is read into Knime using a **File Reader **node, we need to apply the first pre-processing step to the data. We will read the features with nominal values and map every category in that feature to an integer. Knime’s Category to Number nodes does the job for us.

Now, our complete dataset is in a numerical format. So, the next step is to remove any numeric outliers that may exist. Outliers are extreme values in a feature that deviate from other observations on data. They might exist due to experimental errors or variability in measurement. They need to be removed as they may have an effect on the statistics involved in the data. Knime’s Numeric Outliers node gives us an option to remove the rows with outliers.

After the outliers are removed, the next step is to use Knime’s Missing Value node that allows us to replace all missing values in a feature with a fixed value, the feature’s mean, or any other statistic.

### Removing Multi-Collinearity

Linear Regression model works under the assumption that there is no relation between independent features. Correlation should exist only between the independent features and the target feature. If multi-collinearityexists, then the overall performance of the model is affected.

To calculate the correlation between the independent features, we configure the Rank Correlation node to use Spearman’s Rank Correlation. The output of the node is a** correlation matrix**.

From the output, it is clear that there are some independent features that are highly correlated to each other. To filter these columns out, we use Knime’s Correlation Filter node that allows us to set a threshold value on the correlation value of the output matrix. It filters out the columns with correlation more than the threshold value.

From the output of the above node, it is clear that we don’t want to keep star_rating, theme_name, and val_star_rating features. So, using the Column Filter node to our cleaned data, we filter out the unwanted features.

### Train – Test Split

Finally, we have our dataset in a form that can be used for training a linear regressor and testing it. Before that, the last step we need to do is split the complete data into Train and Test data. To do so, we use Knime’s **Partitioning **node. In its configuration, we specify to split the data randomly with 70 % as our train data and the remaining as our test data.

### Training and Testing the model

Knime provides a Linear Regression Learner and **Regression Predictor **node for creating a Linear Regression Learner and Predictor. We feed the train data from partitioning node to the Learner node, and it produces a Predictor Model.

Then, we feed the output model and Test data to the Predictor node that churns out the predicted values for lego set prices.

#ml # ai and data engineering #tech blogs #data science #gui analytics #knime #linear regression #regression models