1595106480

# Improve person re-identification with face detection (FaceBoxes)

Person re-identification is an interesting and not completely solved task. It includes finding (localizing) a person in an image and creating a digital description (vector or embedding) for a photo of a particular person in a way that the distance to the vectors for other photos of a particular person is closer than to the vectors generated for photos of other people.

Person re-identification is used in many tasks including visitor flow analysis in a shopping center, tracking people across cameras, finding a certain person in a huge amount of photos.

Many effective models and approaches have been created recently to address the re-identification tasks. Full list of those models can be found here. But even the best models are still faced with a lot of problems, such as variations in pose and viewpoints of people because of which the embeddings for a photo of a person from different angles will be too far from each other, and the system can decide that this is a photo of different people.

The latest state-of-the-art models, such as Viewpoint-Aware Loss with Angular Regularization for Person Re-Identification, are designed to deal with mentioned problems, but we at ai-labs.org came up with a light approach that greatly simplifies the task of re-identification in some situations. I will talk about this approach in more detail.

Let’s start by explaining how most of the re-id frameworks detect photos of a particular person in the image. The most commonly used object detection models, such as Faster R-CNN or EfficientDet, are used to create a bounding box for the entire human body. After a photo for the entire human body is extracted, the embedding for this photo will be created.

The problem is that object detection models often work even too well, they find photos of people from a variety of viewpoints and not always of the best quality. Embeddings based on these photos often do not allow correct re-identification of a person or are generated in such a way that embeddings for a photo of a particular person from one viewpoint will be close to embeddings for photos only from the same viewpoint, but not to embeddings for photos of the same person from a different viewpoint and distance.

#computer-vision #deep-learning #machine-learning #ai #deep learning

1641812160

## deepface

Deepface is a lightweight face recognition and facial attribute analysis (age, gender, emotion and race) framework for python. It is a hybrid face recognition framework wrapping state-of-the-art models: VGG-Face, Google FaceNet, OpenFace, Facebook DeepFace, DeepID, ArcFace and Dlib.

Experiments show that human beings have 97.53% accuracy on facial recognition tasks whereas those models already reached and passed that accuracy level.

## Installation

The easiest way to install deepface is to download it from PyPI. It's going to install the library itself and its prerequisites as well. The library is mainly based on TensorFlow and Keras.

pip install deepface

Then you will be able to import the library and use its functionalities.

from deepface import DeepFace

Facial Recognition - Demo

A modern face recognition pipeline consists of 5 common stages: detect, align, normalize, represent and verify. While Deepface handles all these common stages in the background, you don’t need to acquire in-depth knowledge about all the processes behind it. You can just call its verification, find or analysis function with a single line of code.

Face Verification - Demo

This function verifies face pairs as same person or different persons. It expects exact image paths as inputs. Passing numpy or based64 encoded images is also welcome. Then, it is going to return a dictionary and you should check just its verified key.

result = DeepFace.verify(img1_path = "img1.jpg", img2_path = "img2.jpg")


Face recognition - Demo

Face recognition requires applying face verification many times. Herein, deepface has an out-of-the-box find function to handle this action. It's going to look for the identity of input image in the database path and it will return pandas data frame as output.

df = DeepFace.find(img_path = "img1.jpg", db_path = "C:/workspace/my_db")

Face recognition models - Demo

Deepface is a hybrid face recognition package. It currently wraps many state-of-the-art face recognition models: VGG-Face , Google FaceNet, OpenFace, Facebook DeepFace, DeepID, ArcFace and Dlib. The default configuration uses VGG-Face model.

models = ["VGG-Face", "Facenet", "Facenet512", "OpenFace", "DeepFace", "DeepID", "ArcFace", "Dlib"]

#face verification
result = DeepFace.verify(img1_path = "img1.jpg", img2_path = "img2.jpg", model_name = models[1])

#face recognition
df = DeepFace.find(img_path = "img1.jpg", db_path = "C:/workspace/my_db", model_name = models[1])

FaceNet, VGG-Face, ArcFace and Dlib are overperforming ones based on experiments. You can find out the scores of those models below on both Labeled Faces in the Wild and YouTube Faces in the Wild data sets declared by its creators.

Similarity

Face recognition models are regular convolutional neural networks and they are responsible to represent faces as vectors. We expect that a face pair of same person should be more similar than a face pair of different persons.

Similarity could be calculated by different metrics such as Cosine Similarity, Euclidean Distance and L2 form. The default configuration uses cosine similarity.

metrics = ["cosine", "euclidean", "euclidean_l2"]

#face verification
result = DeepFace.verify(img1_path = "img1.jpg", img2_path = "img2.jpg", distance_metric = metrics[1])

#face recognition
df = DeepFace.find(img_path = "img1.jpg", db_path = "C:/workspace/my_db", distance_metric = metrics[1])


Euclidean L2 form seems to be more stable than cosine and regular Euclidean distance based on experiments.

Facial Attribute Analysis - Demo

Deepface also comes with a strong facial attribute analysis module including age, gender, facial expression (including angry, fear, neutral, sad, disgust, happy and surprise) and race (including asian, white, middle eastern, indian, latino and black) predictions.

obj = DeepFace.analyze(img_path = "img4.jpg", actions = ['age', 'gender', 'race', 'emotion'])


Age model got ± 4.65 MAE; gender model got 97.44% accuracy, 96.29% precision and 95.05% recall as mentioned in its tutorial.

Streaming and Real Time Analysis - Demo

You can run deepface for real time videos as well. Stream function will access your webcam and apply both face recognition and facial attribute analysis. The function starts to analyze a frame if it can focus a face sequantially 5 frames. Then, it shows results 5 seconds.

DeepFace.stream(db_path = "C:/User/Sefik/Desktop/database")


Even though face recognition is based on one-shot learning, you can use multiple face pictures of a person as well. You should rearrange your directory structure as illustrated below.

user
├── database
│   ├── Alice
│   │   ├── Alice1.jpg
│   │   ├── Alice2.jpg
│   ├── Bob
│   │   ├── Bob.jpg


Face Detectors - Demo

Face detection and alignment are important early stages of a modern face recognition pipeline. Experiments show that just alignment increases the face recognition accuracy almost 1%. OpenCV, SSD, Dlib, MTCNN and RetinaFace detectors are wrapped in deepface.

All deepface functions accept an optional detector backend input argument. You can switch among those detectors with this argument. OpenCV is the default detector.

backends = ['opencv', 'ssd', 'dlib', 'mtcnn', 'retinaface']

#face verification
obj = DeepFace.verify(img1_path = "img1.jpg", img2_path = "img2.jpg", detector_backend = backends[4])

#face recognition
df = DeepFace.find(img_path = "img.jpg", db_path = "my_db", detector_backend = backends[4])

#facial analysis
demography = DeepFace.analyze(img_path = "img4.jpg", detector_backend = backends[4])

#face detection and alignment
face = DeepFace.detectFace(img_path = "img.jpg", target_size = (224, 224), detector_backend = backends[4])


Face recognition models are actually CNN models and they expect standard sized inputs. So, resizing is required before representation. To avoid deformation, deepface adds black padding pixels according to the target size argument after detection and alignment.

RetinaFace and MTCNN seem to overperform in detection and alignment stages but they are much slower. If the speed of your pipeline is more important, then you should use opencv or ssd. On the other hand, if you consider the accuracy, then you should use retinaface or mtcnn.

The performance of RetinaFace is very satisfactory even in the crowd as seen in the following illustration. Besides, it comes with an incredible facial landmark detection performance. Highlighted red points show some facial landmarks such as eyes, nose and mouth. That's why, alignment score of RetinaFace is high as well.

You can find out more about RetinaFace on this repo.

API - Demo

Deepface serves an API as well. You can clone /api/api.py and pass it to python command as an argument. This will get a rest service up. In this way, you can call deepface from an external system such as mobile app or web.

python api.py


Face recognition, facial attribute analysis and vector representation functions are covered in the API. You are expected to call these functions as http post methods. Service endpoints will be http://127.0.0.1:5000/verify for face recognition, http://127.0.0.1:5000/analyze for facial attribute analysis, and http://127.0.0.1:5000/represent for vector representation. You should pass input images as base64 encoded string in this case. Here, you can find a postman project.

Tech Stack - Vlog, Tutorial

Face recognition models represent facial images as vector embeddings. The idea behind facial recognition is that vectors should be more similar for same person than different persons. The question is that where and how to store facial embeddings in a large scale system. Herein, deepface offers a represention function to find vector embeddings from facial images.

embedding = DeepFace.represent(img_path = "img.jpg", model_name = 'Facenet')


Tech stack is vast to store vector embeddings. To determine the right tool, you should consider your task such as face verification or face recognition, priority such as speed or confidence, and also data size.

## Contribution

Pull requests are welcome. You should run the unit tests locally by running test/unit_tests.py. Please share the unit test result logs in the PR. Deepface is currently compatible with TF 1 and 2 versions. Change requests should satisfy those requirements both.

## Support

There are many ways to support a project - starring⭐️ the GitHub repo is just one 🙏

You can also support this work on Patreon

## Citation

@inproceedings{serengil2020lightface,
title        = {LightFace: A Hybrid Deep Face Recognition Framework},
author       = {Serengil, Sefik Ilkin and Ozpinar, Alper},
booktitle    = {2020 Innovations in Intelligent Systems and Applications Conference (ASYU)},
pages        = {23-27},
year         = {2020},
doi          = {10.1109/ASYU50717.2020.9259802},
url          = {https://doi.org/10.1109/ASYU50717.2020.9259802},
organization = {IEEE}
}

@inproceedings{serengil2021lightface,
title        = {HyperExtended LightFace: A Facial Attribute Analysis Framework},
author       = {Serengil, Sefik Ilkin and Ozpinar, Alper},
booktitle    = {2021 International Conference on Engineering and Emerging Technologies (ICEET)},
pages        = {1-4},
year         = {2021},
doi          = {10.1109/ICEET53442.2021.9659697},
url.         = {https://doi.org/10.1109/ICEET53442.2021.9659697},
organization = {IEEE}
}


Author: Serengil
Source Code: https://github.com/serengil/deepface

1648217849

## deepface

Deepface is a lightweight face recognition and facial attribute analysis (age, gender, emotion and race) framework for python. It is a hybrid face recognition framework wrapping state-of-the-art models: VGG-Face, Google FaceNet, OpenFace, Facebook DeepFace, DeepID, ArcFace and Dlib.

Experiments show that human beings have 97.53% accuracy on facial recognition tasks whereas those models already reached and passed that accuracy level.

## Installation

The easiest way to install deepface is to download it from PyPI. It's going to install the library itself and its prerequisites as well. The library is mainly powered by TensorFlow and Keras.

pip install deepface


Then you will be able to import the library and use its functionalities.

from deepface import DeepFace


Facial Recognition - Demo

A modern face recognition pipeline consists of 5 common stages: detect, align, normalize, represent and verify. While Deepface handles all these common stages in the background, you don’t need to acquire in-depth knowledge about all the processes behind it. You can just call its verification, find or analysis function with a single line of code.

Face Verification - Demo

This function verifies face pairs as same person or different persons. It expects exact image paths as inputs. Passing numpy or based64 encoded images is also welcome. Then, it is going to return a dictionary and you should check just its verified key.

result = DeepFace.verify(img1_path = "img1.jpg", img2_path = "img2.jpg")


Face recognition - Demo

Face recognition requires applying face verification many times. Herein, deepface has an out-of-the-box find function to handle this action. It's going to look for the identity of input image in the database path and it will return pandas data frame as output.

df = DeepFace.find(img_path = "img1.jpg", db_path = "C:/workspace/my_db")


Face recognition models - Demo

Deepface is a hybrid face recognition package. It currently wraps many state-of-the-art face recognition models: VGG-Face , Google FaceNet, OpenFace, Facebook DeepFace, DeepID, ArcFace and Dlib. The default configuration uses VGG-Face model.

models = ["VGG-Face", "Facenet", "Facenet512", "OpenFace", "DeepFace", "DeepID", "ArcFace", "Dlib"]

#face verification
result = DeepFace.verify(img1_path = "img1.jpg", img2_path = "img2.jpg", model_name = models[1])

#face recognition
df = DeepFace.find(img_path = "img1.jpg", db_path = "C:/workspace/my_db", model_name = models[1])


FaceNet, VGG-Face, ArcFace and Dlib are overperforming ones based on experiments. You can find out the scores of those models below on both Labeled Faces in the Wild and YouTube Faces in the Wild data sets declared by its creators.

Similarity

Face recognition models are regular convolutional neural networks and they are responsible to represent faces as vectors. We expect that a face pair of same person should be more similar than a face pair of different persons.

Similarity could be calculated by different metrics such as Cosine Similarity, Euclidean Distance and L2 form. The default configuration uses cosine similarity.

metrics = ["cosine", "euclidean", "euclidean_l2"]

#face verification
result = DeepFace.verify(img1_path = "img1.jpg", img2_path = "img2.jpg", distance_metric = metrics[1])

#face recognition
df = DeepFace.find(img_path = "img1.jpg", db_path = "C:/workspace/my_db", distance_metric = metrics[1])


Euclidean L2 form seems to be more stable than cosine and regular Euclidean distance based on experiments.

Facial Attribute Analysis - Demo

Deepface also comes with a strong facial attribute analysis module including age, gender, facial expression (including angry, fear, neutral, sad, disgust, happy and surprise) and race (including asian, white, middle eastern, indian, latino and black) predictions.

obj = DeepFace.analyze(img_path = "img4.jpg", actions = ['age', 'gender', 'race', 'emotion'])


Age model got ± 4.65 MAE; gender model got 97.44% accuracy, 96.29% precision and 95.05% recall as mentioned in its tutorial.

Streaming and Real Time Analysis - Demo

You can run deepface for real time videos as well. Stream function will access your webcam and apply both face recognition and facial attribute analysis. The function starts to analyze a frame if it can focus a face sequantially 5 frames. Then, it shows results 5 seconds.

DeepFace.stream(db_path = "C:/User/Sefik/Desktop/database")


Even though face recognition is based on one-shot learning, you can use multiple face pictures of a person as well. You should rearrange your directory structure as illustrated below.

user
├── database
│   ├── Alice
│   │   ├── Alice1.jpg
│   │   ├── Alice2.jpg
│   ├── Bob
│   │   ├── Bob.jpg


Face Detectors - Demo

Face detection and alignment are important early stages of a modern face recognition pipeline. Experiments show that just alignment increases the face recognition accuracy almost 1%. OpenCV, SSD, Dlib, MTCNN, RetinaFace and MediaPipe detectors are wrapped in deepface.

All deepface functions accept an optional detector backend input argument. You can switch among those detectors with this argument. OpenCV is the default detector.

backends = ['opencv', 'ssd', 'dlib', 'mtcnn', 'retinaface', 'mediapipe']

#face verification
obj = DeepFace.verify(img1_path = "img1.jpg", img2_path = "img2.jpg", detector_backend = backends[4])

#face recognition
df = DeepFace.find(img_path = "img.jpg", db_path = "my_db", detector_backend = backends[4])

#facial analysis
demography = DeepFace.analyze(img_path = "img4.jpg", detector_backend = backends[4])

#face detection and alignment
face = DeepFace.detectFace(img_path = "img.jpg", target_size = (224, 224), detector_backend = backends[4])


Face recognition models are actually CNN models and they expect standard sized inputs. So, resizing is required before representation. To avoid deformation, deepface adds black padding pixels according to the target size argument after detection and alignment.

RetinaFace and MTCNN seem to overperform in detection and alignment stages but they are much slower. If the speed of your pipeline is more important, then you should use opencv or ssd. On the other hand, if you consider the accuracy, then you should use retinaface or mtcnn.

The performance of RetinaFace is very satisfactory even in the crowd as seen in the following illustration. Besides, it comes with an incredible facial landmark detection performance. Highlighted red points show some facial landmarks such as eyes, nose and mouth. That's why, alignment score of RetinaFace is high as well.

You can find out more about RetinaFace on this repo.

API - Demo

Deepface serves an API as well. You can clone /api/api.py and pass it to python command as an argument. This will get a rest service up. In this way, you can call deepface from an external system such as mobile app or web.

python api.py


Face recognition, facial attribute analysis and vector representation functions are covered in the API. You are expected to call these functions as http post methods. Service endpoints will be http://127.0.0.1:5000/verify for face recognition, http://127.0.0.1:5000/analyze for facial attribute analysis, and http://127.0.0.1:5000/represent for vector representation. You should pass input images as base64 encoded string in this case. Here, you can find a postman project.

Tech Stack - Vlog, Tutorial

Face recognition models represent facial images as vector embeddings. The idea behind facial recognition is that vectors should be more similar for same person than different persons. The question is that where and how to store facial embeddings in a large scale system. Herein, deepface offers a represention function to find vector embeddings from facial images.

embedding = DeepFace.represent(img_path = "img.jpg", model_name = 'Facenet')


Tech stack is vast to store vector embeddings. To determine the right tool, you should consider your task such as face verification or face recognition, priority such as speed or confidence, and also data size.

## Contribution

Pull requests are welcome. You should run the unit tests locally by running test/unit_tests.py. Please share the unit test result logs in the PR. Deepface is currently compatible with TF 1 and 2 versions. Change requests should satisfy those requirements both.

## Support

There are many ways to support a project - starring⭐️ the GitHub repo is just one 🙏

You can also support this work on Patreon

## Citation

@inproceedings{serengil2020lightface,
title        = {LightFace: A Hybrid Deep Face Recognition Framework},
author       = {Serengil, Sefik Ilkin and Ozpinar, Alper},
booktitle    = {2020 Innovations in Intelligent Systems and Applications Conference (ASYU)},
pages        = {23-27},
year         = {2020},
doi          = {10.1109/ASYU50717.2020.9259802},
url          = {https://doi.org/10.1109/ASYU50717.2020.9259802},
organization = {IEEE}
}

@inproceedings{serengil2021lightface,
title        = {HyperExtended LightFace: A Facial Attribute Analysis Framework},
author       = {Serengil, Sefik Ilkin and Ozpinar, Alper},
booktitle    = {2021 International Conference on Engineering and Emerging Technologies (ICEET)},
pages        = {1-4},
year         = {2021},
doi          = {10.1109/ICEET53442.2021.9659697},
url          = {https://doi.org/10.1109/ICEET53442.2021.9659697},
organization = {IEEE}
}


Author: serengil
Source Code: https://github.com/serengil/deepface

1646698200

## Introduction

What is face recognition? Or what is recognition? When you look at an apple fruit, your mind immediately tells you that this is an apple fruit. This process, your mind telling you that this is an apple fruit is recognition in simple words. So what is face recognition then? I am sure you have guessed it right. When you look at your friend walking down the street or a picture of him, you recognize that he is your friend Paulo. Interestingly when you look at your friend or a picture of him you look at his face first before looking at anything else. Ever wondered why you do that? This is so that you can recognize him by looking at his face. Well, this is you doing face recognition.

But the real question is how does face recognition works? It is quite simple and intuitive. Take a real life example, when you meet someone first time in your life you don't recognize him, right? While he talks or shakes hands with you, you look at his face, eyes, nose, mouth, color and overall look. This is your mind learning or training for the face recognition of that person by gathering face data. Then he tells you that his name is Paulo. At this point your mind knows that the face data it just learned belongs to Paulo. Now your mind is trained and ready to do face recognition on Paulo's face. Next time when you will see Paulo or his face in a picture you will immediately recognize him. This is how face recognition work. The more you will meet Paulo, the more data your mind will collect about Paulo and especially his face and the better you will become at recognizing him.

Now the next question is how to code face recognition with OpenCV, after all this is the only reason why you are reading this article, right? OK then. You might say that our mind can do these things easily but to actually code them into a computer is difficult? Don't worry, it is not. Thanks to OpenCV, coding face recognition is as easier as it feels. The coding steps for face recognition are same as we discussed it in real life example above.

• Training Data Gathering: Gather face data (face images in this case) of the persons you want to recognize
• Training of Recognizer: Feed that face data (and respective names of each face) to the face recognizer so that it can learn.
• Recognition: Feed new faces of the persons and see if the face recognizer you just trained recognizes them.

OpenCV comes equipped with built in face recognizer, all you have to do is feed it the face data. It's that simple and this how it will look once we are done coding it.

## OpenCV Face Recognizers

OpenCV has three built in face recognizers and thanks to OpenCV's clean coding, you can use any of them by just changing a single line of code. Below are the names of those face recognizers and their OpenCV calls.

1. EigenFaces Face Recognizer Recognizer - cv2.face.createEigenFaceRecognizer()
2. FisherFaces Face Recognizer Recognizer - cv2.face.createFisherFaceRecognizer()
3. Local Binary Patterns Histograms (LBPH) Face Recognizer - cv2.face.createLBPHFaceRecognizer()

We have got three face recognizers but do you know which one to use and when? Or which one is better? I guess not. So why not go through a brief summary of each, what you say? I am assuming you said yes :) So let's dive into the theory of each.

### EigenFaces Face Recognizer

This algorithm considers the fact that not all parts of a face are equally important and equally useful. When you look at some one you recognize him/her by his distinct features like eyes, nose, cheeks, forehead and how they vary with respect to each other. So you are actually focusing on the areas of maximum change (mathematically speaking, this change is variance) of the face. For example, from eyes to nose there is a significant change and same is the case from nose to mouth. When you look at multiple faces you compare them by looking at these parts of the faces because these parts are the most useful and important components of a face. Important because they catch the maximum change among faces, change the helps you differentiate one face from the other. This is exactly how EigenFaces face recognizer works.

EigenFaces face recognizer looks at all the training images of all the persons as a whole and try to extract the components which are important and useful (the components that catch the maximum variance/change) and discards the rest of the components. This way it not only extracts the important components from the training data but also saves memory by discarding the less important components. These important components it extracts are called principal components. Below is an image showing the principal components extracted from a list of faces.

Principal Components source

You can see that principal components actually represent faces and these faces are called eigen faces and hence the name of the algorithm.

So this is how EigenFaces face recognizer trains itself (by extracting principal components). Remember, it also keeps a record of which principal component belongs to which person. One thing to note in above image is that Eigenfaces algorithm also considers illumination as an important component.

Later during recognition, when you feed a new image to the algorithm, it repeats the same process on that image as well. It extracts the principal component from that new image and compares that component with the list of components it stored during training and finds the component with the best match and returns the person label associated with that best match component.

Easy peasy, right? Next one is easier than this one.

### FisherFaces Face Recognizer

This algorithm is an improved version of EigenFaces face recognizer. Eigenfaces face recognizer looks at all the training faces of all the persons at once and finds principal components from all of them combined. By capturing principal components from all the of them combined you are not focusing on the features that discriminate one person from the other but the features that represent all the persons in the training data as a whole.

This approach has drawbacks, for example, images with sharp changes (like light changes which is not a useful feature at all) may dominate the rest of the images and you may end up with features that are from external source like light and are not useful for discrimination at all. In the end, your principal components will represent light changes and not the actual face features.

Fisherfaces algorithm, instead of extracting useful features that represent all the faces of all the persons, it extracts useful features that discriminate one person from the others. This way features of one person do not dominate over the others and you have the features that discriminate one person from the others.

Below is an image of features extracted using Fisherfaces algorithm.

Fisher Faces source

You can see that features extracted actually represent faces and these faces are called fisher faces and hence the name of the algorithm.

One thing to note here is that even in Fisherfaces algorithm if multiple persons have images with sharp changes due to external sources like light they will dominate over other features and affect recognition accuracy.

Getting bored with this theory? Don't worry, only one face recognizer is left and then we will dive deep into the coding part.

### Local Binary Patterns Histograms (LBPH) Face Recognizer

I wrote a detailed explaination on Local Binary Patterns Histograms in my previous article on face detection using local binary patterns histograms. So here I will just give a brief overview of how it works.

We know that Eigenfaces and Fisherfaces are both affected by light and in real life we can't guarantee perfect light conditions. LBPH face recognizer is an improvement to overcome this drawback.

Idea is to not look at the image as a whole instead find the local features of an image. LBPH alogrithm try to find the local structure of an image and it does that by comparing each pixel with its neighboring pixels.

Take a 3x3 window and move it one image, at each move (each local part of an image), compare the pixel at the center with its neighbor pixels. The neighbors with intensity value less than or equal to center pixel are denoted by 1 and others by 0. Then you read these 0/1 values under 3x3 window in a clockwise order and you will have a binary pattern like 11100011 and this pattern is local to some area of the image. You do this on whole image and you will have a list of local binary patterns.

LBP Labeling

Now you get why this algorithm has Local Binary Patterns in its name? Because you get a list of local binary patterns. Now you may be wondering, what about the histogram part of the LBPH? Well after you get a list of local binary patterns, you convert each binary pattern into a decimal number (as shown in above image) and then you make a histogram of all of those values. A sample histogram looks like this.

Sample Histogram

I guess this answers the question about histogram part. So in the end you will have one histogram for each face image in the training data set. That means if there were 100 images in training data set then LBPH will extract 100 histograms after training and store them for later recognition. Remember, algorithm also keeps track of which histogram belongs to which person.

Later during recognition, when you will feed a new image to the recognizer for recognition it will generate a histogram for that new image, compare that histogram with the histograms it already has, find the best match histogram and return the person label associated with that best match histogram.

Below is a list of faces and their respective local binary patterns images. You can see that the LBP images are not affected by changes in light conditions.

LBP Faces source

The theory part is over and now comes the coding part! Ready to dive into coding? Let's get into it then.

Coding Face Recognition with OpenCV

The Face Recognition process in this tutorial is divided into three steps.

1. Prepare training data: In this step we will read training images for each person/subject along with their labels, detect faces from each image and assign each detected face an integer label of the person it belongs to.
2. Train Face Recognizer: In this step we will train OpenCV's LBPH face recognizer by feeding it the data we prepared in step 1.
3. Testing: In this step we will pass some test images to face recognizer and see if it predicts them correctly.

[There should be a visualization diagram for above steps here]

To detect faces, I will use the code from my previous article on face detection. So if you have not read it, I encourage you to do so to understand how face detection works and its Python coding.

### Import Required Modules

Before starting the actual coding we need to import the required modules for coding. So let's import them first.

• cv2: is OpenCV module for Python which we will use for face detection and face recognition.
• os: We will use this Python module to read our training directories and file names.
• numpy: We will use this module to convert Python lists to numpy arrays as OpenCV face recognizers accept numpy arrays.
#import OpenCV module
import cv2
#import os module for reading training data directories and paths
import os
#import numpy to convert python lists to numpy arrays as
#it is needed by OpenCV face recognizers
import numpy as np

#matplotlib for display our images
import matplotlib.pyplot as plt
%matplotlib inline


### Training Data

The more images used in training the better. Normally a lot of images are used for training a face recognizer so that it can learn different looks of the same person, for example with glasses, without glasses, laughing, sad, happy, crying, with beard, without beard etc. To keep our tutorial simple we are going to use only 12 images for each person.

So our training data consists of total 2 persons with 12 images of each person. All training data is inside training-data folder. training-data folder contains one folder for each person and each folder is named with format sLabel (e.g. s1, s2) where label is actually the integer label assigned to that person. For example folder named s1 means that this folder contains images for person 1. The directory structure tree for training data is as follows:

training-data
|-------------- s1
|               |-- 1.jpg
|               |-- ...
|               |-- 12.jpg
|-------------- s2
|               |-- 1.jpg
|               |-- ...
|               |-- 12.jpg


The test-data folder contains images that we will use to test our face recognizer after it has been successfully trained.

As OpenCV face recognizer accepts labels as integers so we need to define a mapping between integer labels and persons actual names so below I am defining a mapping of persons integer labels and their respective names.

Note: As we have not assigned label 0 to any person so the mapping for label 0 is empty.

#there is no label 0 in our training data so subject name for index/label 0 is empty
subjects = ["", "Tom Cruise", "Shahrukh Khan"]


### Prepare training data

You may be wondering why data preparation, right? Well, OpenCV face recognizer accepts data in a specific format. It accepts two vectors, one vector is of faces of all the persons and the second vector is of integer labels for each face so that when processing a face the face recognizer knows which person that particular face belongs too.

For example, if we had 2 persons and 2 images for each person.

PERSON-1    PERSON-2

img1        img1
img2        img2


Then the prepare data step will produce following face and label vectors.

FACES                        LABELS

person1_img1_face              1
person1_img2_face              1
person2_img1_face              2
person2_img2_face              2


Preparing data step can be further divided into following sub-steps.

1. Read all the folder names of subjects/persons provided in training data folder. So for example, in this tutorial we have folder names: s1, s2.
2. For each subject, extract label number. Do you remember that our folders have a special naming convention? Folder names follow the format sLabel where Label is an integer representing the label we have assigned to that subject. So for example, folder name s1 means that the subject has label 1, s2 means subject label is 2 and so on. The label extracted in this step is assigned to each face detected in the next step.
3. Read all the images of the subject, detect face from each image.
4. Add each face to faces vector with corresponding subject label (extracted in above step) added to labels vector.

[There should be a visualization for above steps here]

Did you read my last article on face detection? No? Then you better do so right now because to detect faces, I am going to use the code from my previous article on face detection. So if you have not read it, I encourage you to do so to understand how face detection works and its coding. Below is the same code.

#function to detect face using OpenCV
def detect_face(img):
#convert the test image to gray image as opencv face detector expects gray images
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

#load OpenCV face detector, I am using LBP which is fast
#there is also a more accurate but slow Haar classifier

#let's detect multiscale (some images may be closer to camera than others) images
#result is a list of faces

#if no faces are detected then return original img
if (len(faces) == 0):
return None, None

#under the assumption that there will be only one face,
#extract the face area
(x, y, w, h) = faces[0]

#return only the face part of the image
return gray[y:y+w, x:x+h], faces[0]


I am using OpenCV's LBP face detector. On line 4, I convert the image to grayscale because most operations in OpenCV are performed in gray scale, then on line 8 I load LBP face detector using cv2.CascadeClassifier class. After that on line 12 I use cv2.CascadeClassifier class' detectMultiScale method to detect all the faces in the image. on line 20, from detected faces I only pick the first face because in one image there will be only one face (under the assumption that there will be only one prominent face). As faces returned by detectMultiScale method are actually rectangles (x, y, width, height) and not actual faces images so we have to extract face image area from the main image. So on line 23 I extract face area from gray image and return both the face image area and face rectangle.

Now you have got a face detector and you know the 4 steps to prepare the data, so are you ready to code the prepare data step? Yes? So let's do it.

#this function will read all persons' training images, detect face from each image
#and will return two lists of exactly same size, one list
# of faces and another list of labels for each face
def prepare_training_data(data_folder_path):

#------STEP-1--------
#get the directories (one directory for each subject) in data folder
dirs = os.listdir(data_folder_path)

#list to hold all subject faces
faces = []
#list to hold labels for all subjects
labels = []

#let's go through each directory and read images within it
for dir_name in dirs:

#ignore any non-relevant directories if any
if not dir_name.startswith("s"):
continue;

#------STEP-2--------
#extract label number of subject from dir_name
#format of dir name = slabel
#, so removing letter 's' from dir_name will give us label
label = int(dir_name.replace("s", ""))

#build path of directory containin images for current subject subject
#sample subject_dir_path = "training-data/s1"
subject_dir_path = data_folder_path + "/" + dir_name

#get the images names that are inside the given subject directory
subject_images_names = os.listdir(subject_dir_path)

#------STEP-3--------
#go through each image name, read image,
#detect face and add face to list of faces
for image_name in subject_images_names:

#ignore system files like .DS_Store
if image_name.startswith("."):
continue;

#build image path
#sample image path = training-data/s1/1.pgm
image_path = subject_dir_path + "/" + image_name

#display an image window to show the image
cv2.imshow("Training on image...", image)
cv2.waitKey(100)

#detect face
face, rect = detect_face(image)

#------STEP-4--------
#for the purpose of this tutorial
#we will ignore faces that are not detected
if face is not None:
#add face to list of faces
faces.append(face)
labels.append(label)

cv2.destroyAllWindows()
cv2.waitKey(1)
cv2.destroyAllWindows()

return faces, labels


I have defined a function that takes the path, where training subjects' folders are stored, as parameter. This function follows the same 4 prepare data substeps mentioned above.

(step-1) On line 8 I am using os.listdir method to read names of all folders stored on path passed to function as parameter. On line 10-13 I am defining labels and faces vectors.

(step-2) After that I traverse through all subjects' folder names and from each subject's folder name on line 27 I am extracting the label information. As folder names follow the sLabel naming convention so removing the letter s from folder name will give us the label assigned to that subject.

(step-3) On line 34, I read all the images names of of the current subject being traversed and on line 39-66 I traverse those images one by one. On line 53-54 I am using OpenCV's imshow(window_title, image) along with OpenCV's waitKey(interval) method to display the current image being traveresed. The waitKey(interval) method pauses the code flow for the given interval (milliseconds), I am using it with 100ms interval so that we can view the image window for 100ms. On line 57, I detect face from the current image being traversed.

(step-4) On line 62-66, I add the detected face and label to their respective vectors.

But a function can't do anything unless we call it on some data that it has to prepare, right? Don't worry, I have got data of two beautiful and famous celebrities. I am sure you will recognize them!

Let's call this function on images of these beautiful celebrities to prepare data for training of our Face Recognizer. Below is a simple code to do that.

#let's first prepare our training data
#data will be in two lists of same size
#one list will contain all the faces
#and other list will contain respective labels for each face
print("Preparing data...")
faces, labels = prepare_training_data("training-data")
print("Data prepared")

#print total faces and labels
print("Total faces: ", len(faces))
print("Total labels: ", len(labels))

Preparing data...
Data prepared
Total faces:  23
Total labels:  23


This was probably the boring part, right? Don't worry, the fun stuff is coming up next. It's time to train our own face recognizer so that once trained it can recognize new faces of the persons it was trained on. Read? Ok then let's train our face recognizer.

### Train Face Recognizer

As we know, OpenCV comes equipped with three face recognizers.

1. EigenFace Recognizer: This can be created with cv2.face.createEigenFaceRecognizer()
2. FisherFace Recognizer: This can be created with cv2.face.createFisherFaceRecognizer()
3. Local Binary Patterns Histogram (LBPH): This can be created with cv2.face.LBPHFisherFaceRecognizer()

I am going to use LBPH face recognizer but you can use any face recognizer of your choice. No matter which of the OpenCV's face recognizer you use the code will remain the same. You just have to change one line, the face recognizer initialization line given below.

#create our LBPH face recognizer
face_recognizer = cv2.face.createLBPHFaceRecognizer()

#or use EigenFaceRecognizer by replacing above line with
#face_recognizer = cv2.face.createEigenFaceRecognizer()

#or use FisherFaceRecognizer by replacing above line with
#face_recognizer = cv2.face.createFisherFaceRecognizer()


Now that we have initialized our face recognizer and we also have prepared our training data, it's time to train the face recognizer. We will do that by calling the train(faces-vector, labels-vector) method of face recognizer.

#train our face recognizer of our training faces
face_recognizer.train(faces, np.array(labels))


Did you notice that instead of passing labels vector directly to face recognizer I am first converting it to numpy array? This is because OpenCV expects labels vector to be a numpy array.

Still not satisfied? Want to see some action? Next step is the real action, I promise!

### Prediction

Now comes my favorite part, the prediction part. This is where we actually get to see if our algorithm is actually recognizing our trained subjects's faces or not. We will take two test images of our celeberities, detect faces from each of them and then pass those faces to our trained face recognizer to see if it recognizes them.

Below are some utility functions that we will use for drawing bounding box (rectangle) around face and putting celeberity name near the face bounding box.

#function to draw rectangle on image
#according to given (x, y) coordinates and
#given width and heigh
def draw_rectangle(img, rect):
(x, y, w, h) = rect
cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 2)

#function to draw text on give image starting from
#passed (x, y) coordinates.
def draw_text(img, text, x, y):
cv2.putText(img, text, (x, y), cv2.FONT_HERSHEY_PLAIN, 1.5, (0, 255, 0), 2)


First function draw_rectangle draws a rectangle on image based on passed rectangle coordinates. It uses OpenCV's built in function cv2.rectangle(img, topLeftPoint, bottomRightPoint, rgbColor, lineWidth) to draw rectangle. We will use it to draw a rectangle around the face detected in test image.

Second function draw_text uses OpenCV's built in function cv2.putText(img, text, startPoint, font, fontSize, rgbColor, lineWidth) to draw text on image.

Now that we have the drawing functions, we just need to call the face recognizer's predict(face) method to test our face recognizer on test images. Following function does the prediction for us.

#this function recognizes the person in image passed
#and draws a rectangle around detected face with name of the
#subject
def predict(test_img):
#make a copy of the image as we don't want to chang original image
img = test_img.copy()
#detect face from the image
face, rect = detect_face(img)

#predict the image using our face recognizer
label= face_recognizer.predict(face)
#get name of respective label returned by face recognizer
label_text = subjects[label]

#draw a rectangle around face detected
draw_rectangle(img, rect)
#draw name of predicted person
draw_text(img, label_text, rect[0], rect[1]-5)

return img

• line-6 read the test image
• line-7 detect face from test image
• line-11 recognize the face by calling face recognizer's predict(face) method. This method will return a lable
• line-12 get the name associated with the label
• line-16 draw rectangle around the detected face
• line-18 draw name of predicted subject above face rectangle

Now that we have the prediction function well defined, next step is to actually call this function on our test images and display those test images to see if our face recognizer correctly recognized them. So let's do it. This is what we have been waiting for.

print("Predicting images...")

#perform a prediction
predicted_img1 = predict(test_img1)
predicted_img2 = predict(test_img2)
print("Prediction complete")

#create a figure of 2 plots (one for each test image)
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))

#display test image1 result
ax1.imshow(cv2.cvtColor(predicted_img1, cv2.COLOR_BGR2RGB))

#display test image2 result
ax2.imshow(cv2.cvtColor(predicted_img2, cv2.COLOR_BGR2RGB))

#display both images
cv2.imshow("Tom cruise test", predicted_img1)
cv2.imshow("Shahrukh Khan test", predicted_img2)
cv2.waitKey(0)
cv2.destroyAllWindows()
cv2.waitKey(1)
cv2.destroyAllWindows()

Predicting images...
Prediction complete


wohooo! Is'nt it beautiful? Indeed, it is!

## End Notes

Face Recognition is a fascinating idea to work on and OpenCV has made it extremely simple and easy for us to code it. It just takes a few lines of code to have a fully working face recognition application and we can switch between all three face recognizers with a single line of code change. It's that simple.

Although EigenFaces, FisherFaces and LBPH face recognizers are good but there are even better ways to perform face recognition like using Histogram of Oriented Gradients (HOGs) and Neural Networks. So the more advanced face recognition algorithms are now a days implemented using a combination of OpenCV and Machine learning. I have plans to write some articles on those more advanced methods as well, so stay tuned!

Author: informramiz
Source Code: https://github.com/informramiz/opencv-face-recognition-python

1624519148

## Top 6 Alternatives To Hugging Face

• With Hugging Face raising $40 million funding, NLPs has the potential to provide us with a smarter world ahead. In recent news, US-based NLP startup, Hugging Face has raised a whopping$40 million in funding. The company is building a large open-source community to help the NLP ecosystem grow. Its transformers library is a python-based library that exposes an API for using a variety of well-known transformer architectures such as BERT, RoBERTa, GPT-2, and DistilBERT. Here is a list of the top alternatives to Hugging Face .

#### Dialogflow

#opinions #alternatives to hugging face #chatbot #hugging face #hugging face ai #hugging face chatbot #hugging face gpt-2 #hugging face nlp #hugging face transformer #ibm watson #nlp ai #nlp models #transformers

1641276000

## Finance Quant Machine Learning

### Introduction

Tabular augmentation is a new experimental space that makes use of novel and traditional data generation and synthesisation techniques to improve model prediction success. It is in essence a process of modular feature engineering and observation engineering while emphasising the order of augmentation to achieve the best predicted outcome from a given information set. DeltaPy was created with finance applications in mind, but it can be broadly applied to any data-rich environment.

To take full advantage of tabular augmentation for time-series you would perform the techniques in the following order: (1) transforming, (2) interacting, (3) mapping, (4) extracting, and (5) synthesising. What follows is a practical example of how the above methodology can be used. The purpose here is to establish a framework for table augmentation and to point and guide the user to existing packages.

For most the Colab Notebook format might be preferred. I have enabled comments if you want to ask question or address any issues you uncover. For anything pressing use the issues tab. Also have a look at the SSRN report for a more succinct insights.

Data augmentation can be defined as any method that could increase the size or improve the quality of a dataset by generating new features or instances without the collection of additional data-points. Data augmentation is of particular importance in image classification tasks where additional data can be created by cropping, padding, or flipping existing images.

Tabular cross-sectional and time-series prediction tasks can also benefit from augmentation. Here we divide tabular augmentation into columnular and row-wise methods. Row-wise methods are further divided into extraction and data synthesisation techniques, whereas columnular methods are divided into transformation, interaction, and mapping methods.

See the Skeleton Example, for a combination of multiple methods that lead to a halfing of the mean squared error.

#### Installation & Citation

pip install deltapy

@software{deltapy,
title = {{DeltaPy}: Tabular Data Augmentation},
author = {Snow, Derek},
url = {https://github.com/firmai/deltapy/},
version = {0.1.0},
date = {2020-04-11},
}

 Snow, Derek, DeltaPy: A Framework for Tabular Data Augmentation in Python (April 22, 2020). Available at SSRN: https://ssrn.com/abstract=3582219


### Function Glossary

Transformation

df_out = transform.robust_scaler(df.copy(), drop=["Close_1"]); df_out.head()
df_out = transform.triple_exponential_smoothing(df.copy(),["Close"], 12, .2,.2,.2,0);
df_out = transform.multiple_lags(df, start=1, end=3, columns=["Close"]); df_out.head()


Interaction

df_out = interact.lowess(df.copy(), ["Open","Volume"], df["Close"], f=0.25, iter=3); df_out.head()


Mapping

df_out = mapper.pca_feature(df.copy(),variance_or_components=0.80,drop_cols=["Close_1"]); df_out.head()
df_out = mapper.encoder_dataset(df.copy(), ["Close_1"], 15); df_out.head()


Extraction

extract.abs_energy(df["Close"])
extract.cid_ce(df["Close"], True)
extract.mean_abs_change(df["Close"])
extract.mean_second_derivative_central(df["Close"])
extract.variance_larger_than_standard_deviation(df["Close"])
extract.var_index(df["Close"].values,var_index_param)
extract.symmetry_looking(df["Close"])
extract.has_duplicate_max(df["Close"])
extract.partial_autocorrelation(df["Close"])
extract.augmented_dickey_fuller(df["Close"])
extract.gskew(df["Close"])
extract.stetson_mean(df["Close"])
extract.length(df["Close"])
extract.count_above_mean(df["Close"])
extract.longest_strike_below_mean(df["Close"])
extract.wozniak(df["Close"])
extract.last_location_of_maximum(df["Close"])
extract.fft_coefficient(df["Close"])
extract.ar_coefficient(df["Close"])
extract.index_mass_quantile(df["Close"])
extract.number_cwt_peaks(df["Close"])
extract.spkt_welch_density(df["Close"])
extract.linear_trend_timewise(df["Close"])
extract.c3(df["Close"])
extract.binned_entropy(df["Close"])
extract.svd_entropy(df["Close"].values)
extract.hjorth_complexity(df["Close"])
extract.max_langevin_fixed_point(df["Close"])
extract.percent_amplitude(df["Close"])
extract.zero_crossing_derivative(df["Close"])
extract.detrended_fluctuation_analysis(df["Close"])
extract.fisher_information(df["Close"])
extract.higuchi_fractal_dimension(df["Close"])
extract.petrosian_fractal_dimension(df["Close"])
extract.hurst_exponent(df["Close"])
extract.largest_lyauponov_exponent(df["Close"])
extract.whelch_method(df["Close"])
extract.find_freq(df["Close"])
extract.flux_perc(df["Close"])
extract.range_cum_s(df["Close"])
extract.structure_func(df["Close"])
extract.kurtosis(df["Close"])
extract.stetson_k(df["Close"])


Test sets should ideally not be preprocessed with the training data, as in such a way one could be peaking ahead in the training data. The preprocessing parameters should be identified on the test set and then applied on the test set, i.e., the test set should not have an impact on the transformation applied. As an example, you would learn the parameters of PCA decomposition on the training set and then apply the parameters to both the train and the test set.

The benefit of pipelines become clear when one wants to apply multiple augmentation methods. It makes it easy to learn the parameters and then apply them widely. For the most part, this notebook does not concern itself with 'peaking ahead' or pipelines, for some functions, one might have to restructure to code and make use of open source packages to create your preferred solution.

## Documentation by Example

Notebook Dependencies

pip install deltapy

pip install pykalman
pip install tsaug
pip install ta
pip install tsaug
pip install pandasvault
pip install gplearn
pip install ta
pip install seasonal
pip install pandasvault


import pandas as pd
import numpy as np
from deltapy import transform, interact, mapper, extract
import warnings
warnings.filterwarnings('ignore')

def data_copy():
df["Close_1"] = df["Close"].shift(-1)
df = df.dropna()
df["Date"] = pd.to_datetime(df["Date"])
df = df.set_index("Date")
return df


Some of these categories are fluid and some techniques could fit into multiple buckets. This is an attempt to find an exhaustive number of techniques, but not an exhaustive list of implementations of the techniques. For example, there are thousands of ways to smooth a time-series, but we have only includes 1-2 techniques of interest under each category.

### (1) Transformation:

1. Scaling/Normalisation
2. Standardisation
3. Differencing
4. Capping
5. Operations
6. Smoothing
7. Decomposing
8. Filtering
9. Spectral Analysis
10. Waveforms
11. Modifications
12. Rolling
13. Lagging
14. Forecast Model

### (2) Interaction:

1. Regressions
2. Operators
3. Discretising
4. Normalising
5. Distance
6. Speciality
7. Genetic

### (3) Mapping:

1. Eigen Decomposition
2. Cross Decomposition
3. Kernel Approximation
4. Autoencoder
5. Manifold Learning
6. Clustering
7. Neighbouring

### (4) Extraction:

1. Energy
2. Distance
3. Differencing
4. Derivative
5. Volatility
6. Shape
7. Occurrence
8. Autocorrelation
9. Stochasticity
10. Averages
11. Size
12. Count
13. Streaks
14. Location
15. Model Coefficients
16. Quantile
17. Peaks
18. Density
19. Linearity
20. Non-linearity
21. Entropy
22. Fixed Points
23. Amplitude
24. Probability
25. Crossings
26. Fluctuation
27. Information
28. Fractals
29. Exponent
30. Spectral Analysis
31. Percentile
32. Range
33. Structural
34. Distribution

## (1) Transformation

Here transformation is any method that includes only one feature as an input to produce a new feature/s. Transformations can be applied to cross-section and time-series data. Some transformations are exclusive to time-series data (smoothing, filtering), but a handful of functions apply to both.

Where the time series methods has a centred mean, or are forward-looking, there is a need to recalculate the outputed time series on a running basis to ensure that information of the future does not leak into the model. The last value of this recalculated series or an extracted feature from this series can then be used as a running value that is only backward looking, satisfying the no 'peaking' ahead rule.

There are some packaged in Python that dynamically create time series and extracts their features, but none that incoropates the dynamic creation of a time series in combination with a wide application of prespecified list of extractions. Because this technique is expensive, we have a preference for models that only take historical data into account.

In this section we will include a list of all types of transformations, those that only use present information (operations), those that incorporate all values (interpolation methods), those that only include past values (smoothing functions), and those that incorporate a subset window of lagging and leading values (select filters). Only those that use historical values or are turned into prediction methods can be used out of the box. The entire time series can be used in the model development process for historical value methods, and only the forecasted values can be used for prediction models.

Curve fitting can involve either interpolation, where an exact fit to the data is required, or smoothing, in which a "smooth" function is constructed that approximately fits the data. When using an interpolation method, you are taking future information into account e.g, cubic spline. You can use interpolation methods to forecast into the future (extrapolation), and then use those forecasts in a training set. Or you could recalculate the interpolation for each time step and then extract features out of that series (extraction method). Interpolation and other forward-looking methods can be used if they are turned into prediction problems, then the forecasted values can be trained and tested on, and the fitted data can be diregarded. In the list presented below the first five methods can be used for cross-section and time series data, after that the time-series only methods follow.

#### (1) Scaling/Normalisation

There are a multitude of scaling methods available. Scaling generally gets applied to the entire dataset and is especially necessary for certain algorithms. K-means make use of euclidean distance hence the need for scaling. For PCA because we are trying to identify the feature with maximus variance we also need scaling. Similarly, we need scaled features for gradient descent. Any algorithm that is not based on a distance measure is not affected by feature scaling. Some of the methods include range scalers like minimum-maximum scaler, maximum absolute scaler or even standardisation methods like the standard scaler can be used for scaling. The example used here is robust scaler. Normalisation is a good technique when you don't know the distribution of the data. Scaling looks into the future, so parameters have to be training on a training set and applied to a test set.

(i) Robust Scaler

Scaling according to the interquartile range, making it robust to outliers.

def robust_scaler(df, drop=None,quantile_range=(25, 75) ):
if drop:
keep = df[drop]
df = df.drop(drop, axis=1)
center = np.median(df, axis=0)
quantiles = np.percentile(df, quantile_range, axis=0)
scale = quantiles[1] - quantiles[0]
df = (df - center) / scale
if drop:
df = pd.concat((keep,df),axis=1)
return df



#### (2) Standardisation

When using a standardisation method, it is often more effective when the attribute itself if Gaussian. It is also useful to apply the technique when the model you want to use makes assumptions of Gaussian distributions like linear regression, logistic regression, and linear discriminant analysis. For most applications, standardisation is recommended.

(i) Standard Scaler

Standardize features by removing the mean and scaling to unit variance

def standard_scaler(df,drop ):
if drop:
keep = df[drop]
df = df.drop(drop, axis=1)
mean = np.mean(df, axis=0)
scale = np.std(df, axis=0)
df = (df - mean) / scale
if drop:
df = pd.concat((keep,df),axis=1)
return df



#### (3) Differencing

Computing the differences between consecutive observation, normally used to obtain a stationary time series.

(i) Fractional Differencing

Fractional differencing, allows us to achieve stationarity while maintaining the maximum amount of memory compared to integer differencing.

import pylab as pl

def fast_fracdiff(x, cols, d):
for col in cols:
T = len(x[col])
np2 = int(2 ** np.ceil(np.log2(2 * T - 1)))
k = np.arange(1, T)
b = (1,) + tuple(np.cumprod((k - d - 1) / k))
z = (0,) * (np2 - T)
z1 = b + z
z2 = tuple(x[col]) + z
dx = pl.ifft(pl.fft(z1) * pl.fft(z2))
x[col+"_frac"] = np.real(dx[0:T])
return x



#### (4) Capping

Any method that provides sets a floor and a cap to a feature's value. Capping can affect the distribution of data, so it should not be exagerated. One can cap values by using the average, by using the max and min values, or by an arbitrary extreme value.

(i) Winzorisation

The transformation of features by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers by replacing it with a certain percentile value.

def outlier_detect(data,col,threshold=1,method="IQR"):

if method == "IQR":
IQR = data[col].quantile(0.75) - data[col].quantile(0.25)
Lower_fence = data[col].quantile(0.25) - (IQR * threshold)
Upper_fence = data[col].quantile(0.75) + (IQR * threshold)
if method == "STD":
Upper_fence = data[col].mean() + threshold * data[col].std()
Lower_fence = data[col].mean() - threshold * data[col].std()
if method == "OWN":
Upper_fence = data[col].mean() + threshold * data[col].std()
Lower_fence = data[col].mean() - threshold * data[col].std()
median = data[col].median()
median_absolute_deviation = np.median([np.abs(y - median) for y in data[col]])
modified_z_scores = pd.Series([0.6745 * (y - median) / median_absolute_deviation for y in data[col]])
outlier_index = np.abs(modified_z_scores) > threshold
print('Num of outlier detected:',outlier_index.value_counts()[1])
print('Proportion of outlier detected',outlier_index.value_counts()[1]/len(outlier_index))
return outlier_index, (median_absolute_deviation, median_absolute_deviation)

para = (Upper_fence, Lower_fence)
tmp = pd.concat([data[col]>Upper_fence,data[col]<Lower_fence],axis=1)
outlier_index = tmp.any(axis=1)
print('Num of outlier detected:',outlier_index.value_counts()[1])
print('Proportion of outlier detected',outlier_index.value_counts()[1]/len(outlier_index))

return outlier_index, para

def windsorization(data,col,para,strategy='both'):
"""
top-coding & bottom coding (capping the maximum of a distribution at an arbitrarily set value,vice versa)
"""

data_copy = data.copy(deep=True)
if strategy == 'both':
data_copy.loc[data_copy[col]>para[0],col] = para[0]
data_copy.loc[data_copy[col]<para[1],col] = para[1]
elif strategy == 'top':
data_copy.loc[data_copy[col]>para[0],col] = para[0]
elif strategy == 'bottom':
data_copy.loc[data_copy[col]<para[1],col] = para[1]
return data_copy

_, para = transform.outlier_detect(df, "Close")


#### (5) Operations

Operations here are treated like traditional transformations. It is the replacement of a variable by a function of that variable. In a stronger sense, a transformation is a replacement that changes the shape of a distribution or relationship.

(i) Power, Log, Recipricol, Square Root

def operations(df,features):
df_new = df[features]
df_new = df_new - df_new.min()

sqr_name = [str(fa)+"_POWER_2" for fa in df_new.columns]
log_p_name = [str(fa)+"_LOG_p_one_abs" for fa in df_new.columns]
rec_p_name = [str(fa)+"_RECIP_p_one" for fa in df_new.columns]
sqrt_name = [str(fa)+"_SQRT_p_one" for fa in df_new.columns]

df_sqr = pd.DataFrame(np.power(df_new.values, 2),columns=sqr_name, index=df.index)

dfs = [df, df_sqr, df_log, df_rec, df_sqrt]

df=  pd.concat(dfs, axis=1)

return df



#### (6) Smoothing

Here we maintain that any method that has a component of historical averaging is a smoothing method such as a simple moving average and single, double and tripple exponential smoothing methods. These forms of non-causal filters are also popular in signal processing and are called filters, where exponential smoothing is called an IIR filter and a moving average a FIR filter with equal weighting factors.

(i) Tripple Exponential Smoothing (Holt-Winters Exponential Smoothing)

The Holt-Winters seasonal method comprises the forecast equation and three smoothing equations — one for the level $ℓt$, one for the trend &bt&, and one for the seasonal component $st$. This particular version is performed by looking at the last 12 periods. For that reason, the first 12 records should be disregarded because they can't make use of the required window size for a fair calculation. The calculation is such that values are still provided for those periods based on whatever data might be available.

def initial_trend(series, slen):
sum = 0.0
for i in range(slen):
sum += float(series[i+slen] - series[i]) / slen
return sum / slen

def initial_seasonal_components(series, slen):
seasonals = {}
season_averages = []
n_seasons = int(len(series)/slen)
# compute season averages
for j in range(n_seasons):
season_averages.append(sum(series[slen*j:slen*j+slen])/float(slen))
# compute initial values
for i in range(slen):
sum_of_vals_over_avg = 0.0
for j in range(n_seasons):
sum_of_vals_over_avg += series[slen*j+i]-season_averages[j]
seasonals[i] = sum_of_vals_over_avg/n_seasons
return seasonals

def triple_exponential_smoothing(df,cols, slen, alpha, beta, gamma, n_preds):
for col in cols:
result = []
seasonals = initial_seasonal_components(df[col], slen)
for i in range(len(df[col])+n_preds):
if i == 0: # initial values
smooth = df[col][0]
trend = initial_trend(df[col], slen)
result.append(df[col][0])
continue
if i >= len(df[col]): # we are forecasting
m = i - len(df[col]) + 1
result.append((smooth + m*trend) + seasonals[i%slen])
else:
val = df[col][i]
last_smooth, smooth = smooth, alpha*(val-seasonals[i%slen]) + (1-alpha)*(smooth+trend)
trend = beta * (smooth-last_smooth) + (1-beta)*trend
seasonals[i%slen] = gamma*(val-smooth) + (1-gamma)*seasonals[i%slen]
result.append(smooth+trend+seasonals[i%slen])
df[col+"_TES"] = result
#print(seasonals)
return df



#### (7) Decomposing

Decomposition procedures are used in time series to describe the trend and seasonal factors in a time series. More extensive decompositions might also include long-run cycles, holiday effects, day of week effects and so on. Here, we’ll only consider trend and seasonal decompositions. A naive decomposition makes use of moving averages, other decomposition methods are available that make use of LOESS.

(i) Naive Decomposition

The base trend takes historical information into account and established moving averages; it does not have to be linear. To estimate the seasonal component for each season, simply average the detrended values for that season. If the seasonal variation looks constant, we should use the additive model. If the magnitude is increasing as a function of time, we will use multiplicative. Here because it is predictive in nature we are using a one sided moving average, as opposed to a two-sided centred average.

import statsmodels.api as sm

def naive_dec(df, columns, freq=2):
for col in columns:
decomposition = sm.tsa.seasonal_decompose(df[col], model='additive', freq = freq, two_sided=False)
df[col+"_NDDT" ] = decomposition.trend
df[col+"_NDDT"] = decomposition.seasonal
df[col+"_NDDT"] = decomposition.resid
return df



#### (8) Filtering

It is often useful to either low-pass filter (smooth) time series in order to reveal low-frequency features and trends, or to high-pass filter (detrend) time series in order to isolate high frequency transients (e.g. storms). Low pass filters use historical values, high-pass filters detrends with low-pass filters, so also indirectly uses historical values.

There are a few filters available, closely associated with decompositions and smoothing functions. The Hodrick-Prescott filter separates a time-series $yt$ into a trend $τt$ and a cyclical component $ζt$. The Christiano-Fitzgerald filter is a generalization of Baxter-King filter and can be seen as weighted moving average.

(i) Baxter-King Bandpass

The Baxter-King filter is intended to explicitly deal with the periodicity of the business cycle. By applying their band-pass filter to a series, they produce a new series that does not contain fluctuations at higher or lower than those of the business cycle. The parameters are arbitrarily chosen. This method uses a centred moving average that has to be changed to a lagged moving average before it can be used as an input feature. The maximum period of oscillation should be used as the point to truncate the dataset, as that part of the time series does not incorporate all the required datapoints.

import statsmodels.api as sm

def bkb(df, cols):
for col in cols:
df[col+"_BPF"] = sm.tsa.filters.bkfilter(df[[col]].values, 2, 10, len(df)-1)
return df



(ii) Butter Lowpass (IIR Filter Design)

The Butterworth filter is a type of signal processing filter designed to have a frequency response as flat as possible in the passban. Like other filtersm the first few values have to be disregarded for accurate downstream prediction. Instead of disregarding these values on a per case basis, they can be diregarded in one chunk once the database of transformed features have been developed.

from scipy import signal, integrate
def butter_lowpass(cutoff, fs=20, order=5):
nyq = 0.5 * fs
normal_cutoff = cutoff / nyq
b, a = signal.butter(order, normal_cutoff, btype='low', analog=False)
return b, a

def butter_lowpass_filter(df,cols, cutoff, fs=20, order=5):
b, a = butter_lowpass(cutoff, fs, order=order)
for col in cols:
df[col+"_BUTTER"] = signal.lfilter(b, a, df[col])
return df



(iii) Hilbert Transform Angle

The Hilbert transform is a time-domain to time-domain transformation which shifts the phase of a signal by 90 degrees. It is also a centred measure and would be difficult to use in a time series prediction setting, unless it is recalculated on a per step basis or transformed to be based on historical values only.

from scipy import signal
import numpy as np

def instantaneous_phases(df,cols):
for col in cols:
df[col+"_HILLB"] = np.unwrap(np.angle(signal.hilbert(df[col], axis=0)), axis=0)
return df



(iiiv) Unscented Kalman Filter

The Kalman filter is better suited for estimating things that change over time. The most tangible example is tracking moving objects. A Kalman filter will be very close to the actual trajectory because it says the most recent measurement is more important than the older ones. The Unscented Kalman Filter (UKF) is a model based-techniques that recursively estimates the states (and with some modifications also parameters) of a nonlinear, dynamic, discrete-time system. The UKF is based on the typical prediction-correction style methods. The Kalman Smoother incorporates future values, the Filter doesn't and can be used for online prediction. The normal Kalman filter is a forward filter in the sense that it makes forecast of the current state using only current and past observations, whereas the smoother is based on computing a suitable linear combination of two filters, which are ran in forward and backward directions.

from pykalman import UnscentedKalmanFilter

def kalman_feat(df, cols):
for col in cols:
ukf = UnscentedKalmanFilter(lambda x, w: x + np.sin(w), lambda x, v: x + v, observation_covariance=0.1)
(filtered_state_means, filtered_state_covariances) = ukf.filter(df[col])
(smoothed_state_means, smoothed_state_covariances) = ukf.smooth(df[col])
df[col+"_UKFSMOOTH"] = smoothed_state_means.flatten()
df[col+"_UKFFILTER"] = filtered_state_means.flatten()
return df



#### (9) Spectral Analysis

There are a range of functions for spectral analysis. You can use periodograms and the welch method to estimate the power spectral density. You can also use the welch method to estimate the cross power spectral density. Other techniques include spectograms, Lomb-Scargle periodograms and, short time fourier transform.

(i) Periodogram

This returns an array of sample frequencies and the power spectrum of x, or the power spectral density of x.

from scipy import signal
def perd_feat(df, cols):
for col in cols:
sig = signal.periodogram(df[col],fs=1, return_onesided=False)
df[col+"_FREQ"] = sig[0]
df[col+"_POWER"] = sig[1]
return df



(ii) Fast Fourier Transform

The FFT, or fast fourier transform is an algorithm that essentially uses convolution techniques to efficiently find the magnitude and location of the tones that make up the signal of interest. We can often play with the FFT spectrum, by adding and removing successive tones (which is akin to selectively filtering particular tones that make up the signal), in order to obtain a smoothed version of the underlying signal. This takes the entire signal into account, and as a result has to be recalculated on a running basis to avoid peaking into the future.

def fft_feat(df, cols):
for col in cols:
fft_df = np.fft.fft(np.asarray(df[col].tolist()))
fft_df = pd.DataFrame({'fft':fft_df})
df[col+'_FFTABS'] = fft_df['fft'].apply(lambda x: np.abs(x)).values
df[col+'_FFTANGLE'] = fft_df['fft'].apply(lambda x: np.angle(x)).values
return df



#### (10) Waveforms

The waveform of a signal is the shape of its graph as a function of time.

from scipy import signal
for col in cols:
ttxt = f'CW: {fc} Hz'
#%% input
t = df[col]
tx = np.sin(2*np.pi*fc*t)
_,Pxx = signal.welch(tx,fs)
#%% diode
d = (signal.square(2*np.pi*fc*t))
d[d<0] = 0.
#%% output of diode
rx = tx * d
return df



(ii) Saw Tooth

Return a periodic sawtooth or triangle waveform.

def saw(df, cols):
for col in cols:
df[col+" SAW"] = signal.sawtooth(df[col])
return df



(9) Modifications

A range of modification usually applied ot images, these values would have to be recalculate for each time-series.

(i) Various Techniques

from tsaug import *
def modify(df, cols):
for col in cols:
series = df[col].values
df[col+"_magnify"], _ = magnify(series, series)
df[col+"_affine"], _ = affine(series, series)
df[col+"_crop"], _ = crop(series, series)
df[col+"_cross_sum"], _ = cross_sum(series, series)
df[col+"_resample"], _ = resample(series, series)
df[col+"_trend"], _ = trend(series, series)

df[col+"_random_affine"], _ = random_time_warp(series, series)
df[col+"_random_crop"], _ = random_crop(series, series)
df[col+"_random_cross_sum"], _ = random_cross_sum(series, series)
df[col+"_random_sidetrack"], _ = random_sidetrack(series, series)
df[col+"_random_time_warp"], _ = random_time_warp(series, series)
df[col+"_random_magnify"], _ = random_magnify(series, series)
df[col+"_random_jitter"], _ = random_jitter(series, series)
df[col+"_random_trend"], _ = random_trend(series, series)
return df



#### (11) Rolling

Features that are calculated on a rolling basis over fixed window size.

(i) Mean, Standard Deviation

def multiple_rolling(df, windows = [1,2], functions=["mean","std"], columns=None):
windows = [1+a for a in windows]
if not columns:
columns = df.columns.to_list()
rolling_dfs = (df[columns].rolling(i)                                    # 1. Create window
.agg(functions)                                # 1. Aggregate
.rename({col: '{0}_{1:d}'.format(col, i)
for col in columns}, axis=1)  # 2. Rename columns
for i in windows)                                # For each window
df_out = pd.concat((df, *rolling_dfs), axis=1)
da = df_out.iloc[:,len(df.columns):]
da = [col[0] + "_" + col[1] for col in  da.columns.to_list()]
df_out.columns = df.columns.to_list() + da

return  df_out                      # 3. Concatenate dataframes



#### (12) Lagging

Lagged values from existing features.

(i) Single Steps

def multiple_lags(df, start=1, end=3,columns=None):
if not columns:
columns = df.columns.to_list()
lags = range(start, end+1)  # Just two lags for demonstration.

df = df.assign(**{
'{}_t_{}'.format(col, t): df[col].shift(t)
for t in lags
for col in columns
})
return df

df_out = transform.multiple_lags(df, start=1, end=3, columns=["Close"]); df_out.head()


#### (13) Forecast Model

There are a range of time series model that can be implemented like AR, MA, ARMA, ARIMA, SARIMA, SARIMAX, VAR, VARMA, VARMAX, SES, and HWES. The models can be divided into autoregressive models and smoothing models. In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable. Each method might requre specific tuning and parameters to suit your prediction task. You need to drop a certain amount of historical data that you use during the fitting stage. Models that take seasonality into account need more training data.

(i) Prophet

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality. You can apply additive models to your training data but also interactive models like deep learning models. The problem is that because these models have learned from future observations, there would this be a need to recalculate the time series on a running basis, or to only include the predicted as opposed to fitted values in future training and test sets. In this example, I train on 150 data points to illustrate how the remaining or so 100 datapoints can be used in a new prediction problem. You can plot with df["PROPHET"].plot() to see the effect.

You can apply additive models to your training data but also interactive models like deep learning models. The problem is that these models have learned from future observations, there would this be a need to recalculate the time series on a running basis, or to only include the predicted as opposed to fitted values in future training and test sets.

from fbprophet import Prophet

def prophet_feat(df, cols,date, freq,train_size=150):
def prophet_dataframe(df):
df.columns = ['ds','y']
return df

def original_dataframe(df, freq, name):
prophet_pred = pd.DataFrame({"Date" : df['ds'], name : df["yhat"]})
prophet_pred = prophet_pred.set_index("Date")
#prophet_pred.index.freq = pd.tseries.frequencies.to_offset(freq)
return prophet_pred[name].values

for col in cols:
model = Prophet(daily_seasonality=True)
forecast_len = len(df) - train_size
future = model.make_future_dataframe(periods=forecast_len,freq=freq)
future_pred = model.predict(future)
df[col+"_PROPHET"] = list(original_dataframe(future_pred,freq,col))
return df



## (2) Interaction

Interactions are defined as methods that require more than one feature to create an additional feature. Here we include normalising and discretising techniques that are non-feature specific. Almost all of these method can be applied to cross-section method. The only methods that are time specific is the technical features in the speciality section and the autoregression model.

#### (1) Regression

Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable') and one or more independent variables.

(i) Lowess Smoother

The lowess smoother is a robust locally weighted regression. The function fits a nonparametric regression curve to a scatterplot.

from math import ceil
import numpy as np
from scipy import linalg
import math

def lowess(df, cols, y, f=2. / 3., iter=3):
for col in cols:
n = len(df[col])
r = int(ceil(f * n))
h = [np.sort(np.abs(df[col] - df[col][i]))[r] for i in range(n)]
w = np.clip(np.abs((df[col][:, None] - df[col][None, :]) / h), 0.0, 1.0)
w = (1 - w ** 3) ** 3
yest = np.zeros(n)
delta = np.ones(n)
for iteration in range(iter):
for i in range(n):
weights = delta * w[:, i]
b = np.array([np.sum(weights * y), np.sum(weights * y * df[col])])
A = np.array([[np.sum(weights), np.sum(weights * df[col])],
[np.sum(weights * df[col]), np.sum(weights * df[col] * df[col])]])
beta = linalg.solve(A, b)
yest[i] = beta[0] + beta[1] * df[col][i]

residuals = y - yest
s = np.median(np.abs(residuals))
delta = np.clip(residuals / (6.0 * s), -1, 1)
delta = (1 - delta ** 2) ** 2
df[col+"_LOWESS"] = yest

return df

df_out = interact.lowess(df.copy(), ["Open","Volume"], df["Close"], f=0.25, iter=3); df_out.head()


Autoregression

Autoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step

from statsmodels.tsa.ar_model import AR
from timeit import default_timer as timer
def autoregression(df, drop=None, settings={"autoreg_lag":4}):

autoreg_lag = settings["autoreg_lag"]
if drop:
keep = df[drop]
df = df.drop([drop],axis=1).values

n_channels = df.shape[0]
t = timer()
channels_regg = np.zeros((n_channels, autoreg_lag + 1))
for i in range(0, n_channels):
fitted_model = AR(df.values[i, :]).fit(autoreg_lag)
# TODO: This is not the same as Matlab's for some reasons!
# kk = ARMAResults(fitted_model)
# autore_vals, dummy1, dummy2 = arburg(x[i, :], autoreg_lag) # This looks like Matlab's but slow
channels_regg[i, 0: len(fitted_model.params)] = np.real(fitted_model.params)

for i in range(channels_regg.shape[1]):
df["LAG_"+str(i+1)] = channels_regg[:,i]

if drop:
df = pd.concat((keep,df),axis=1)

t = timer() - t
return df



#### (2) Operator

Looking at interaction between different features. Here the methods employed are multiplication and division.

(i) Multiplication and Division

def muldiv(df, feature_list):
for feat in feature_list:
for feat_two in feature_list:
if feat==feat_two:
continue
else:
df[feat+"/"+feat_two] = df[feat]/(df[feat_two]-df[feat_two].min()) #zero division guard
df[feat+"_X_"+feat_two] = df[feat]*(df[feat_two])

return df



#### (3) Discretising

In statistics and machine learning, discretization refers to the process of converting or partitioning continuous attributes, features or variables to discretized or nominal attributes

(i) Decision Tree Discretiser

The first method that will be applies here is a supersived discretiser. Discretisation with Decision Trees consists of using a decision tree to identify the optimal splitting points that would determine the bins or contiguous intervals.

from sklearn.tree import DecisionTreeRegressor

def decision_tree_disc(df, cols, depth=4 ):
for col in cols:
df[col +"_m1"] = df[col].shift(1)
df = df.iloc[1:,:]
tree_model = DecisionTreeRegressor(max_depth=depth,random_state=0)
tree_model.fit(df[col +"_m1"].to_frame(), df[col])
df[col+"_Disc"] = tree_model.predict(df[col +"_m1"].to_frame())
return df



#### (4) Normalising

Normalising normally pertains to the scaling of data. There are many method available, interacting normalising methods makes use of all the feature's attributes to do the scaling.

(i) Quantile Normalisation

In statistics, quantile normalization is a technique for making two distributions identical in statistical properties.

import numpy as np
import pandas as pd

def quantile_normalize(df, drop):

if drop:
keep = df[drop]
df = df.drop(drop,axis=1)

#compute rank
dic = {}
for col in df:
dic.update({col : sorted(df[col])})
sorted_df = pd.DataFrame(dic)
rank = sorted_df.mean(axis = 1).tolist()
#sort
for col in df:
t = np.searchsorted(np.sort(df[col]), df[col])
df[col] = [rank[i] for i in t]

if drop:
df = pd.concat((keep,df),axis=1)
return df



#### (5) Distance

There are multiple types of distance functions like Euclidean, Mahalanobis, and Minkowski distance. Here we are using a contrived example in a location based haversine distance.

(i) Haversine Distance

The Haversine (or great circle) distance is the angular distance between two points on the surface of a sphere.

from math import sin, cos, sqrt, atan2, radians
def haversine_distance(row, lon="Open", lat="Close"):
R = 6373.0

dlon = long - c_long
dlat = lat - c_lat
a = sin(dlat / 2)**2 + cos(lat) * cos(c_lat) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))

return R * c



#### (6) Speciality

(i) Technical Features

Technical indicators are heuristic or mathematical calculations based on the price, volume, or open interest of a security or contract used by traders who follow technical analysis. By analyzing historical data, technical analysts use indicators to predict future price movements.

import ta

def tech(df):
return ta.add_all_ta_features(df, open="Open", high="High", low="Low", close="Close", volume="Volume")



#### (7) Genetic

Genetic programming has shown promise in constructing feature by osing original features to form high-level ones that can help algorithms achieve better performance.

(i) Symbolic Transformer

A symbolic transformer is a supervised transformer that begins by building a population of naive random formulas to represent a relationship.

df.head()

from gplearn.genetic import SymbolicTransformer

def genetic_feat(df, num_gen=20, num_comp=10):
function_set = ['add', 'sub', 'mul', 'div',
'sqrt', 'log', 'abs', 'neg', 'inv','tan']

gp = SymbolicTransformer(generations=num_gen, population_size=200,
hall_of_fame=100, n_components=num_comp,
function_set=function_set,
parsimony_coefficient=0.0005,
max_samples=0.9, verbose=1,
random_state=0, n_jobs=6)

gen_feats = gp.fit_transform(df.drop("Close_1", axis=1), df["Close_1"]); df.iloc[:,:8]
gen_feats = pd.DataFrame(gen_feats, columns=["gen_"+str(a) for a in range(gen_feats.shape[1])])
gen_feats.index = df.index
return pd.concat((df,gen_feats),axis=1)



## (3) Mapping

Methods that help with the summarisation of features by remapping them to achieve some aim like the maximisation of variability or class separability. These methods tend to be unsupervised, but can also take an supervised form.

#### (1) Eigen Decomposition

Eigendecomposition or sometimes spectral decomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors. Some examples are LDA and PCA.

(i) Principal Component Analysis

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

def pca_feature(df, memory_issues=False,mem_iss_component=False,variance_or_components=0.80,n_components=5 ,drop_cols=None, non_linear=True):

if non_linear:
pca = KernelPCA(n_components = n_components, kernel='rbf', fit_inverse_transform=True, random_state = 33, remove_zero_eig= True)
else:
if memory_issues:
if not mem_iss_component:
raise ValueError("If you have memory issues, you have to preselect mem_iss_component")
pca = IncrementalPCA(mem_iss_component)
else:
if variance_or_components>1:
pca = PCA(n_components=variance_or_components)
else: # automated selection based on variance
pca = PCA(n_components=variance_or_components,svd_solver="full")
if drop_cols:
X_pca = pca.fit_transform(df.drop(drop_cols,axis=1))
return pd.concat((df[drop_cols],pd.DataFrame(X_pca, columns=["PCA_"+str(i+1) for i in range(X_pca.shape[1])],index=df.index)),axis=1)

else:
X_pca = pca.fit_transform(df)
return pd.DataFrame(X_pca, columns=["PCA_"+str(i+1) for i in range(X_pca.shape[1])],index=df.index)

return df

df_out = mapper.pca_feature(df.copy(), variance_or_components=0.9, n_components=8,non_linear=False)


#### (2) Cross Decomposition

These families of algorithms are useful to find linear relations between two multivariate datasets.

(1) Canonical Correlation Analysis

Canonical-correlation analysis (CCA) is a way of inferring information from cross-covariance matrices.

from sklearn.cross_decomposition import CCA

def cross_lag(df, drop=None, lags=1, components=4 ):

if drop:
keep = df[drop]
df = df.drop([drop],axis=1)

df_2 = df.shift(lags)
df = df.iloc[lags:,:]
df_2 = df_2.dropna().reset_index(drop=True)

cca = CCA(n_components=components)
cca.fit(df_2, df)

X_c, df_2 = cca.transform(df_2, df)
df_2 = pd.DataFrame(df_2, index=df.index)

if drop:
df = pd.concat([keep,df,df_2],axis=1)
else:
df = pd.concat([df,df_2],axis=1)
return df



#### (3) Kernel Approximation

Functions that approximate the feature mappings that correspond to certain kernels, as they are used for example in support vector machines.

Computes the additive chi-squared kernel between observations in X and Y The chi-squared kernel is computed between each pair of rows in X and Y. X and Y have to be non-negative.

from sklearn.kernel_approximation import AdditiveChi2Sampler

def a_chi(df, drop=None, lags=1, sample_steps=2 ):

if drop:
keep = df[drop]
df = df.drop([drop],axis=1)

df_2 = df.shift(lags)
df = df.iloc[lags:,:]
df_2 = df_2.dropna().reset_index(drop=True)

df_2 = chi2sampler.fit_transform(df_2, df["Close"])

df_2 = pd.DataFrame(df_2, index=df.index)

if drop:
df = pd.concat([keep,df,df_2],axis=1)
else:
df = pd.concat([df,df_2],axis=1)
return df



#### (4) Autoencoder

An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore noise.

(i) Feed Forward

The simplest form of an autoencoder is a feedforward, non-recurrent neural network similar to single layer perceptrons that participate in multilayer perceptrons

from sklearn.preprocessing import minmax_scale
import tensorflow as tf
import numpy as np

def encoder_dataset(df, drop=None, dimesions=20):

if drop:
train_scaled = minmax_scale(df.drop(drop,axis=1).values, axis = 0)
else:
train_scaled = minmax_scale(df.values, axis = 0)

# define the number of encoding dimensions
encoding_dim = dimesions
# define the number of features
ncol = train_scaled.shape[1]
input_dim = tf.keras.Input(shape = (ncol, ))

# Encoder Layers
encoded1 = tf.keras.layers.Dense(3000, activation = 'relu')(input_dim)
encoded2 = tf.keras.layers.Dense(2750, activation = 'relu')(encoded1)
encoded3 = tf.keras.layers.Dense(2500, activation = 'relu')(encoded2)
encoded4 = tf.keras.layers.Dense(750, activation = 'relu')(encoded3)
encoded5 = tf.keras.layers.Dense(500, activation = 'relu')(encoded4)
encoded6 = tf.keras.layers.Dense(250, activation = 'relu')(encoded5)
encoded7 = tf.keras.layers.Dense(encoding_dim, activation = 'relu')(encoded6)

encoder = tf.keras.Model(inputs = input_dim, outputs = encoded7)
encoded_input = tf.keras.Input(shape = (encoding_dim, ))

encoded_train = pd.DataFrame(encoder.predict(train_scaled),index=df.index)
if drop:
encoded_train = pd.concat((df[drop],encoded_train),axis=1)

return encoded_train

df_out = mapper.encoder_dataset(df.copy(), ["Close_1"], 15); df_out.head()

df_out.head()


#### (5) Manifold Learning

Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-linear structure in data.

(i) Local Linear Embedding

Locally Linear Embedding is a method of non-linear dimensionality reduction. It tries to reduce these n-Dimensions while trying to preserve the geometric features of the original non-linear feature structure.

from sklearn.manifold import LocallyLinearEmbedding

def lle_feat(df, drop=None, components=4):

if drop:
keep = df[drop]
df = df.drop(drop, axis=1)

embedding = LocallyLinearEmbedding(n_components=components)
em = embedding.fit_transform(df)
df = pd.DataFrame(em,index=df.index)
if drop:
df = pd.concat((keep,df),axis=1)
return df



#### (6) Clustering

Most clustering techniques start with a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together with some measure. Although these clustering techniques are typically used for observations, it can also be used for feature dimensionality reduction; especially hierarchical clustering techniques.

(i) Feature Agglomeration

Feature agglomerative uses clustering to group together features that look very similar, thus decreasing the number of features.

import numpy as np
from sklearn import datasets, cluster

def feature_agg(df, drop=None, components=4):

if drop:
keep = df[drop]
df = df.drop(drop, axis=1)

components = min(df.shape[1]-1,components)
agglo = cluster.FeatureAgglomeration(n_clusters=components)
agglo.fit(df)
df = pd.DataFrame(agglo.transform(df),index=df.index)

if drop:
return pd.concat((keep,df),axis=1)
else:
return df



#### (7) Neigbouring

Neighbouring points can be calculated using distance metrics like Hamming, Manhattan, Minkowski distance. The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these.

(i) Nearest Neighbours

Unsupervised learner for implementing neighbor searches.

from sklearn.neighbors import NearestNeighbors

def neigh_feat(df, drop, neighbors=6):

if drop:
keep = df[drop]
df = df.drop(drop, axis=1)

components = min(df.shape[0]-1,neighbors)
neigh = NearestNeighbors(n_neighbors=neighbors)
neigh.fit(df)
neigh = neigh.kneighbors()[0]
df = pd.DataFrame(neigh, index=df.index)

if drop:
return pd.concat((keep,df),axis=1)
else:
return df

return df



## (4) Extraction

When working with extraction, you have decide the size of the time series history to take into account when calculating a collection of walk-forward feature values. To facilitate our extraction, we use an excellent package called TSfresh, and also some of their default features. For completeness, we also include 12 or so custom features to be added to the extraction pipeline.

The time series methods in the transformation section and the interaction section are similar to the methods we will uncover in the extraction section, however, for transformation and interaction methods the output is an entire new time series, whereas extraction methods takes as input multiple constructed time series and extracts a singular value from each time series to reconstruct an entirely new time series.

Some methods naturally fit better in one format over another, e.g., lags are too expensive for extraction; time series decomposition only has to be performed once, because it has a low level of 'leakage' so is better suited to transformation; and forecast methods attempt to predict multiple future training samples, so won't work with extraction that only delivers one value per time series. Furthermore all non time-series (cross-sectional) transformation and extraction techniques can not make use of extraction as it is solely a time-series method.

Lastly, when we want to double apply specific functions we can apply it as a transformation/interaction then all the extraction methods can be applied to this feature as well. For example, if we calculate a smoothing function (transformation) then all other extraction functions (median, entropy, linearity etc.) can now be applied to that smoothing function, including the application of the smoothing function itself, e.g., a double smooth, double lag, double filter etc. So separating these methods out give us great flexibility.

Decorator

def set_property(key, value):
"""
This method returns a decorator that sets the property key of the function to value
"""
def decorate_func(func):
setattr(func, key, value)
if func.__doc__ and key == "fctype":
func.__doc__ = func.__doc__ + "\n\n    *This function is of type: " + value + "*\n"
return func
return decorate_func


#### (1) Energy

You can calculate the linear, non-linear and absolute energy of a time series. In signal processing, the energy $E_S$ of a continuous-time signal $x(t)$ is defined as the area under the squared magnitude of the considered signal. Mathematically, $E_{s}=\langle x(t), x(t)\rangle=\int_{-\infty}^{\infty}|x(t)|^{2} d t$

(i) Absolute Energy

Returns the absolute energy of the time series which is the sum over the squared values

#-> In Package
def abs_energy(x):

if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
return np.dot(x, x)

extract.abs_energy(df["Close"])


#### (2) Distance

Here we widely define distance measures as those that take a difference between attributes or series of datapoints.

(i) Complexity-Invariant Distance

This function calculator is an estimate for a time series complexity.

#-> In Package
def cid_ce(x, normalize):

if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
if normalize:
s = np.std(x)
if s!=0:
x = (x - np.mean(x))/s
else:
return 0.0

x = np.diff(x)
return np.sqrt(np.dot(x, x))

extract.cid_ce(df["Close"], True)


#### (3) Differencing

Many alternatives to differencing exists, one can for example take the difference of every other value, take the squared difference, take the fractional difference, or like our example, take the mean absolute difference.

(i) Mean Absolute Change

Returns the mean over the absolute differences between subsequent time series values.

#-> In Package
def mean_abs_change(x):
return np.mean(np.abs(np.diff(x)))

extract.mean_abs_change(df["Close"])


#### (4) Derivative

Features where the emphasis is on the rate of change.

(i) Mean Central Second Derivative

Returns the mean value of a central approximation of the second derivative

#-> In Package
def _roll(a, shift):
if not isinstance(a, np.ndarray):
a = np.asarray(a)
idx = shift % len(a)
return np.concatenate([a[-idx:], a[:-idx]])

def mean_second_derivative_central(x):

diff = (_roll(x, 1) - 2 * np.array(x) + _roll(x, -1)) / 2.0
return np.mean(diff[1:-1])

extract.mean_second_derivative_central(df["Close"])


#### (5) Volatility

Volatility is a statistical measure of the dispersion of a time-series.

(i) Variance Larger than Standard Deviation

#-> In Package
def variance_larger_than_standard_deviation(x):

y = np.var(x)
return y > np.sqrt(y)

extract.variance_larger_than_standard_deviation(df["Close"])


(ii) Variability Index

Variability Index is a way to measure how smooth or 'variable' a time series is.

var_index_param = {"Volume":df["Volume"].values, "Open": df["Open"].values}

@set_property("fctype", "combiner")
@set_property("custom", True)
def var_index(time,param=var_index_param):
final = []
keys = []
for key, magnitude in param.items():
w = 1.0 / np.power(np.subtract(time[1:], time[:-1]), 2)
w_mean = np.mean(w)

N = len(time)
sigma2 = np.var(magnitude)

S1 = sum(w * (magnitude[1:] - magnitude[:-1]) ** 2)
S2 = sum(w)

eta_e = (w_mean * np.power(time[N - 1] -
time[0], 2) * S1 / (sigma2 * S2 * N ** 2))
final.append(eta_e)
keys.append(key)
return {"Interact__{}".format(k): eta_e for eta_e, k in zip(final,keys) }

extract.var_index(df["Close"].values,var_index_param)


#### (6) Shape

Features that emphasises a particular shape not ordinarily considered as a distribution statistic. Extends to derivations of the original time series too For example a feature looking at the sinusoidal shape of an autocorrelation plot.

(i) Symmetrical

Boolean variable denoting if the distribution of x looks symmetric.

#-> In Package
def symmetry_looking(x, param=[{"r": 0.2}]):

if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
mean_median_difference = np.abs(np.mean(x) - np.median(x))
max_min_difference = np.max(x) - np.min(x)
return [("r_{}".format(r["r"]), mean_median_difference < (r["r"] * max_min_difference))
for r in param]

extract.symmetry_looking(df["Close"])


#### (7) Occurrence

Looking at the occurrence, and reoccurence of defined values.

(i) Has Duplicate Max

#-> In Package
def has_duplicate_max(x):
"""
Checks if the maximum value of x is observed more than once

:param x: the time series to calculate the feature of
:type x: numpy.ndarray
:return: the value of this feature
:return type: bool
"""
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
return np.sum(x == np.max(x)) >= 2

extract.has_duplicate_max(df["Close"])


#### (8) Autocorrelation

Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay.

(i) Partial Autocorrelation

Partial autocorrelation is a summary of the relationship between an observation in a time series with observations at prior time steps with the relationships of intervening observations removed.

#-> In Package
from statsmodels.tsa.stattools import acf, adfuller, pacf

def partial_autocorrelation(x, param=[{"lag": 1}]):

# Check the difference between demanded lags by param and possible lags to calculate (depends on len(x))
max_demanded_lag = max([lag["lag"] for lag in param])
n = len(x)

# Check if list is too short to make calculations
if n <= 1:
pacf_coeffs = [np.nan] * (max_demanded_lag + 1)
else:
if (n <= max_demanded_lag):
max_lag = n - 1
else:
max_lag = max_demanded_lag
pacf_coeffs = list(pacf(x, method="ld", nlags=max_lag))
pacf_coeffs = pacf_coeffs + [np.nan] * max(0, (max_demanded_lag - max_lag))

return [("lag_{}".format(lag["lag"]), pacf_coeffs[lag["lag"]]) for lag in param]

extract.partial_autocorrelation(df["Close"])


#### (9) Stochasticity

Stochastic refers to a randomly determined process. Any features trying to capture stochasticity by degree or type are included under this branch.

(i) Augmented Dickey Fuller

The Augmented Dickey-Fuller test is a hypothesis test which checks whether a unit root is present in a time series sample.

#-> In Package
def augmented_dickey_fuller(x, param=[{"attr": "teststat"}]):

res = None
try:
except LinAlgError:
res = np.NaN, np.NaN, np.NaN
except ValueError: # occurs if sample size is too small
res = np.NaN, np.NaN, np.NaN
except MissingDataError: # is thrown for e.g. inf or nan in the data
res = np.NaN, np.NaN, np.NaN

return [('attr_"{}"'.format(config["attr"]),
res[0] if config["attr"] == "teststat"
else res[1] if config["attr"] == "pvalue"
else res[2] if config["attr"] == "usedlag" else np.NaN)
for config in param]

extract.augmented_dickey_fuller(df["Close"])


#### (10) Averages

(i) Median of Magnitudes Skew

@set_property("fctype", "simple")
@set_property("custom", True)
def gskew(x):
interpolation="nearest"
median_mag = np.median(x)
F_3_value = np.percentile(x, 3, interpolation=interpolation)
F_97_value = np.percentile(x, 97, interpolation=interpolation)

skew = (np.median(x[x <= F_3_value]) +
np.median(x[x >= F_97_value]) - 2 * median_mag)

return skew

extract.gskew(df["Close"])


(ii) Stetson Mean

An iteratively weighted mean used in the Stetson variability index

stestson_param = {"weight":100., "alpha":2., "beta":2., "tol":1.e-6, "nmax":20}

@set_property("fctype", "combiner")
@set_property("custom", True)
def stetson_mean(x, param=stestson_param):

weight= stestson_param["weight"]
alpha= stestson_param["alpha"]
beta = stestson_param["beta"]
tol= stestson_param["tol"]
nmax= stestson_param["nmax"]

mu = np.median(x)
for i in range(nmax):
resid = x - mu
resid_err = np.abs(resid) * np.sqrt(weight)
weight1 = weight / (1. + (resid_err / alpha)**beta)
weight1 /= weight1.mean()
diff = np.mean(x * weight1) - mu
mu += diff
if (np.abs(diff) < tol*np.abs(mu) or np.abs(diff) < tol):
break

return mu

extract.stetson_mean(df["Close"])


#### (11) Size

(i) Lenght

#-> In Package
def length(x):
return len(x)

extract.length(df["Close"])


#### (12) Count

(i) Count Above Mean

Returns the number of values in x that are higher than the mean of x

#-> In Package
def count_above_mean(x):
m = np.mean(x)
return np.where(x > m)[0].size

extract.count_above_mean(df["Close"])


#### (13) Streaks

(i) Longest Strike Below Mean

Returns the length of the longest consecutive subsequence in x that is smaller than the mean of x

#-> In Package
import itertools
def get_length_sequences_where(x):

if len(x) == 0:
return [0]
else:
res = [len(list(group)) for value, group in itertools.groupby(x) if value == 1]
return res if len(res) > 0 else [0]

def longest_strike_below_mean(x):

if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
return np.max(get_length_sequences_where(x <= np.mean(x))) if x.size > 0 else 0

extract.longest_strike_below_mean(df["Close"])


(ii) Wozniak

This is an astronomical feature, we count the number of three consecutive data points that are brighter or fainter than $2σ$ and normalize the number by $N−2$

woz_param = [{"consecutiveStar": n} for n in [2, 4]]

@set_property("fctype", "combiner")
@set_property("custom", True)
def wozniak(magnitude, param=woz_param):

iters = []
for consecutiveStar in [stars["consecutiveStar"] for stars in param]:
N = len(magnitude)
if N < consecutiveStar:
return 0
sigma = np.std(magnitude)
m = np.mean(magnitude)
count = 0

for i in range(N - consecutiveStar + 1):
flag = 0
for j in range(consecutiveStar):
if(magnitude[i + j] > m + 2 * sigma or
magnitude[i + j] < m - 2 * sigma):
flag = 1
else:
flag = 0
break
if flag:
count = count + 1
iters.append(count * 1.0 / (N - consecutiveStar + 1))

return [("consecutiveStar_{}".format(config["consecutiveStar"]), iters[en] )  for en, config in enumerate(param)]

extract.wozniak(df["Close"])


#### (14) Location

(i) Last location of Maximum

Returns the relative last location of the maximum value of x. last_location_of_minimum(x),

#-> In Package
def last_location_of_maximum(x):

x = np.asarray(x)
return 1.0 - np.argmax(x[::-1]) / len(x) if len(x) > 0 else np.NaN

extract.last_location_of_maximum(df["Close"])


#### (15) Model Coefficients

Any coefficient that are obtained from a model that might help in the prediction problem. For example here we might include coefficients of polynomial $h(x)$, which has been fitted to the deterministic dynamics of Langevin model.

(i) FFT Coefficient

Calculates the fourier coefficients of the one-dimensional discrete Fourier Transform for real input.

#-> In Package
def fft_coefficient(x, param = [{"coeff": 10, "attr": "real"}]):

assert min([config["coeff"] for config in param]) >= 0, "Coefficients must be positive or zero."
assert set([config["attr"] for config in param]) <= set(["imag", "real", "abs", "angle"]), \
'Attribute must be "real", "imag", "angle" or "abs"'

fft = np.fft.rfft(x)

def complex_agg(x, agg):
if agg == "real":
return x.real
elif agg == "imag":
return x.imag
elif agg == "abs":
return np.abs(x)
elif agg == "angle":
return np.angle(x, deg=True)

res = [complex_agg(fft[config["coeff"]], config["attr"]) if config["coeff"] < len(fft)
else np.NaN for config in param]
index = [('coeff_{}__attr_"{}"'.format(config["coeff"], config["attr"]),res[0]) for config in param]
return index

extract.fft_coefficient(df["Close"])


(ii) AR Coefficient

This feature calculator fits the unconditional maximum likelihood of an autoregressive AR(k) process.

#-> In Package
from statsmodels.tsa.ar_model import AR

def ar_coefficient(x, param=[{"coeff": 5, "k": 5}]):

calculated_ar_params = {}

x_as_list = list(x)
calculated_AR = AR(x_as_list)

res = {}

for parameter_combination in param:
k = parameter_combination["k"]
p = parameter_combination["coeff"]

column_name = "k_{}__coeff_{}".format(k, p)

if k not in calculated_ar_params:
try:
calculated_ar_params[k] = calculated_AR.fit(maxlag=k, solver="mle").params
except (LinAlgError, ValueError):
calculated_ar_params[k] = [np.NaN]*k

mod = calculated_ar_params[k]

if p <= k:
try:
res[column_name] = mod[p]
except IndexError:
res[column_name] = 0
else:
res[column_name] = np.NaN

return [(key, value) for key, value in res.items()]

extract.ar_coefficient(df["Close"])


#### (16) Quantiles

This includes finding normal quantile values in the series, but also quantile derived measures like change quantiles and index max quantiles.

(i) Index Mass Quantile

The relative index $i$ where $q%$ of the mass of the time series $x$ lie left of $i$ .

#-> In Package
def index_mass_quantile(x, param=[{"q": 0.3}]):

x = np.asarray(x)
abs_x = np.abs(x)
s = sum(abs_x)

if s == 0:
# all values in x are zero or it has length 0
return [("q_{}".format(config["q"]), np.NaN) for config in param]
else:
# at least one value is not zero
mass_centralized = np.cumsum(abs_x) / s
return [("q_{}".format(config["q"]), (np.argmax(mass_centralized >= config["q"])+1)/len(x)) for config in param]

extract.index_mass_quantile(df["Close"])


#### (17) Peaks

(i) Number of CWT Peaks

This feature calculator searches for different peaks in x.

from scipy.signal import cwt, find_peaks_cwt, ricker, welch

cwt_param = [ka for ka in [2,6,9]]

@set_property("fctype", "combiner")
@set_property("custom", True)
def number_cwt_peaks(x, param=cwt_param):

return [("CWTPeak_{}".format(n), len(find_peaks_cwt(vector=x, widths=np.array(list(range(1, n + 1))), wavelet=ricker))) for n in param]

extract.number_cwt_peaks(df["Close"])


#### (18) Density

The density, and more specifically the power spectral density of the signal describes the power present in the signal as a function of frequency, per unit frequency.

(i) Cross Power Spectral Density

This feature calculator estimates the cross power spectral density of the time series $x$ at different frequencies.

#-> In Package
def spkt_welch_density(x, param=[{"coeff": 5}]):
freq, pxx = welch(x, nperseg=min(len(x), 256))
coeff = [config["coeff"] for config in param]
indices = ["coeff_{}".format(i) for i in coeff]

if len(pxx) <= np.max(coeff):  # There are fewer data points in the time series than requested coefficients

# filter coefficients that are not contained in pxx
reduced_coeff = [coefficient for coefficient in coeff if len(pxx) > coefficient]
not_calculated_coefficients = [coefficient for coefficient in coeff
if coefficient not in reduced_coeff]

# Fill up the rest of the requested coefficients with np.NaNs
return zip(indices, list(pxx[reduced_coeff]) + [np.NaN] * len(not_calculated_coefficients))
else:
return pxx[coeff].ravel()[0]

extract.spkt_welch_density(df["Close"])


#### (19) Linearity

Any measure of linearity that might make use of something like the linear least-squares regression for the values of the time series. This can be against the time series minus one and many other alternatives.

(i) Linear Trend Time Wise

Calculate a linear least-squares regression for the values of the time series versus the sequence from 0 to length of the time series minus one.

from scipy.stats import linregress

#-> In Package
def linear_trend_timewise(x, param= [{"attr": "pvalue"}]):

ix = x.index

# Get differences between each timestamp and the first timestamp in seconds.
# Then convert to hours and reshape for linear regression
times_seconds = (ix - ix[0]).total_seconds()
times_hours = np.asarray(times_seconds / float(3600))

linReg = linregress(times_hours, x.values)

return [("attr_\"{}\"".format(config["attr"]), getattr(linReg, config["attr"]))
for config in param]

extract.linear_trend_timewise(df["Close"])


#### (20) Non-Linearity

(i) Schreiber Non-Linearity

#-> In Package
def c3(x, lag=3):
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
n = x.size
if 2 * lag >= n:
return 0
else:
return np.mean((_roll(x, 2 * -lag) * _roll(x, -lag) * x)[0:(n - 2 * lag)])

extract.c3(df["Close"])


#### (21) Entropy

Any feature looking at the complexity of a time series. This is typically used in medical signal disciplines (EEG, EMG). There are multiple types of measures like spectral entropy, permutation entropy, sample entropy, approximate entropy, Lempel-Ziv complexity and other. This includes entropy measures and there derivations.

(i) Binned Entropy

Bins the values of x into max_bins equidistant bins.

#-> In Package
def binned_entropy(x, max_bins=10):
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
hist, bin_edges = np.histogram(x, bins=max_bins)
probs = hist / x.size
return - np.sum(p * np.math.log(p) for p in probs if p != 0)

extract.binned_entropy(df["Close"])


(ii) SVD Entropy

SVD entropy is an indicator of the number of eigenvectors that are needed for an adequate explanation of the data set.

svd_param = [{"Tau": ta, "DE": de}
for ta in [4]
for de in [3,6]]

def _embed_seq(X,Tau,D):
N =len(X)
if D * Tau > N:
print("Cannot build such a matrix, because D * Tau > N")
exit()
if Tau<1:
print("Tau has to be at least 1")
exit()
Y= np.zeros((N - (D - 1) * Tau, D))

for i in range(0, N - (D - 1) * Tau):
for j in range(0, D):
Y[i][j] = X[i + j * Tau]
return Y

@set_property("fctype", "combiner")
@set_property("custom", True)
def svd_entropy(epochs, param=svd_param):
axis=0

final = []
for par in param:

def svd_entropy_1d(X, Tau, DE):
Y = _embed_seq(X, Tau, DE)
W = np.linalg.svd(Y, compute_uv=0)
W /= sum(W)  # normalize singular values
return -1 * np.sum(W * np.log(W))

Tau = par["Tau"]
DE = par["DE"]

final.append(np.apply_along_axis(svd_entropy_1d, axis, epochs, Tau, DE).ravel()[0])

return [("Tau_\"{}\"__De_{}\"".format(par["Tau"], par["DE"]), final[en]) for en, par in enumerate(param)]

extract.svd_entropy(df["Close"].values)


(iii) Hjort

The Complexity parameter represents the change in frequency. The parameter compares the signal's similarity to a pure sine wave, where the value converges to 1 if the signal is more similar.

def _hjorth_mobility(epochs):
diff = np.diff(epochs, axis=0)
sigma0 = np.std(epochs, axis=0)
sigma1 = np.std(diff, axis=0)
return np.divide(sigma1, sigma0)

@set_property("fctype", "simple")
@set_property("custom", True)
def hjorth_complexity(epochs):
diff1 = np.diff(epochs, axis=0)
diff2 = np.diff(diff1, axis=0)
sigma1 = np.std(diff1, axis=0)
sigma2 = np.std(diff2, axis=0)
return np.divide(np.divide(sigma2, sigma1), _hjorth_mobility(epochs))

extract.hjorth_complexity(df["Close"])


#### (22) Fixed Points

Fixed points and equilibria as identified from fitted models.

(i) Langevin Fixed Points

Largest fixed point of dynamics $max\ {h(x)=0}$ estimated from polynomial $h(x)$ which has been fitted to the deterministic dynamics of Langevin model

#-> In Package
def _estimate_friedrich_coefficients(x, m, r):
assert m > 0, "Order of polynomial need to be positive integer, found {}".format(m)
df = pd.DataFrame({'signal': x[:-1], 'delta': np.diff(x)})
try:
df['quantiles'] = pd.qcut(df.signal, r)
except ValueError:
return [np.NaN] * (m + 1)

quantiles = df.groupby('quantiles')

result = pd.DataFrame({'x_mean': quantiles.signal.mean(), 'y_mean': quantiles.delta.mean()})
result.dropna(inplace=True)

try:
return np.polyfit(result.x_mean, result.y_mean, deg=m)
except (np.linalg.LinAlgError, ValueError):
return [np.NaN] * (m + 1)

def max_langevin_fixed_point(x, r=3, m=30):
coeff = _estimate_friedrich_coefficients(x, m, r)

try:
max_fixed_point = np.max(np.real(np.roots(coeff)))
except (np.linalg.LinAlgError, ValueError):
return np.nan

return max_fixed_point

extract.max_langevin_fixed_point(df["Close"])


#### (23) Amplitude

Features derived from peaked values in either the positive or negative direction.

(i) Willison Amplitude

This feature is defined as the amount of times that the change in the signal amplitude exceeds a threshold.

will_param = [ka for ka in [0.2,3]]

@set_property("fctype", "combiner")
@set_property("custom", True)
def willison_amplitude(X, param=will_param):
return [("Thresh_{}".format(n),np.sum(np.abs(np.diff(X)) >= n)) for n in param]

extract.willison_amplitude(df["Close"])


(ii) Percent Amplitude

Returns the largest distance from the median value, measured as a percentage of the median

perc_param = [{"base":ba, "exponent":exp} for ba in [3,5] for exp in [-0.1,-0.2]]

@set_property("fctype", "combiner")
@set_property("custom", True)
def percent_amplitude(x, param =perc_param):
final = []
for par in param:
linear_scale_data = par["base"] ** (par["exponent"] * x)
y_max = np.max(linear_scale_data)
y_min = np.min(linear_scale_data)
y_med = np.median(linear_scale_data)
final.append(max(abs((y_max - y_med) / y_med), abs((y_med - y_min) / y_med)))

return [("Base_{}__Exp{}".format(pa["base"],pa["exponent"]),fin) for fin, pa in zip(final,param)]

extract.percent_amplitude(df["Close"])


#### (24) Probability

Given the observed distribution of time lags cads, compute the probability that the next observation occurs within time minutes of an arbitrary epoch.

#-> fixes required
import scipy.stats as stats

@set_property("fctype", "combiner")
@set_property("custom", True)
return [("time_{}".format(time), stats.percentileofscore(cads, float(time) / (24.0 * 60.0)) / 100.0) for time in param]



#### (25) Crossings

Calculates the crossing of the series with other defined values or series.

(i) Zero Crossing Derivative

The positioning of the edge point is located at the zero crossing of the first derivative of the filter.

zero_param = [0.01, 8]

@set_property("fctype", "combiner")
@set_property("custom", True)
def zero_crossing_derivative(epochs, param=zero_param):
diff = np.diff(epochs)
norm = diff-diff.mean()
return [("e_{}".format(e), np.apply_along_axis(lambda epoch: np.sum(((epoch[:-5] <= e) & (epoch[5:] > e))), 0, norm).ravel()[0]) for e in param]

extract.zero_crossing_derivative(df["Close"])


#### (26) Fluctuations

These features are again from medical signal sciences, but under this category we would include values such as fluctuation based entropy measures, fluctuation of correlation dynamics, and co-fluctuations.

(i) Detrended Fluctuation Analysis (DFA)

DFA Calculate the Hurst exponent using DFA analysis.

from scipy.stats import kurtosis as _kurt
from scipy.stats import skew as _skew
import numpy as np

@set_property("fctype", "simple")
@set_property("custom", True)
def detrended_fluctuation_analysis(epochs):
def dfa_1d(X, Ave=None, L=None):
X = np.array(X)

if Ave is None:
Ave = np.mean(X)

Y = np.cumsum(X)
Y -= Ave

if L is None:
L = np.floor(len(X) * 1 / (
2 ** np.array(list(range(1, int(np.log2(len(X))) - 4))))
)

F = np.zeros(len(L))  # F(n) of different given box length n

for i in range(0, len(L)):
n = int(L[i])  # for each box length L[i]
if n == 0:
print("time series is too short while the box length is too big")
print("abort")
exit()
for j in range(0, len(X), n):  # for each box
if j + n < len(X):
c = list(range(j, j + n))
# coordinates of time in the box
c = np.vstack([c, np.ones(n)]).T
# the value of data in the box
y = Y[j:j + n]
# add residue in this box
F[i] += np.linalg.lstsq(c, y, rcond=None)[1]
F[i] /= ((len(X) / n) * n)
F = np.sqrt(F)

stacked = np.vstack([np.log(L), np.ones(len(L))])
stacked_t = stacked.T
Alpha = np.linalg.lstsq(stacked_t, np.log(F), rcond=None)

return Alpha[0][0]

return np.apply_along_axis(dfa_1d, 0, epochs).ravel()[0]

extract.detrended_fluctuation_analysis(df["Close"])


#### (27) Information

Closely related to entropy and complexity measures. Any measure that attempts to measure the amount of information from an observable variable is included here.

(i) Fisher Information

Fisher information is a statistical information concept distinct from, and earlier than, Shannon information in communication theory.

def _embed_seq(X, Tau, D):

shape = (X.size - Tau * (D - 1), D)
strides = (X.itemsize, Tau * X.itemsize)
return np.lib.stride_tricks.as_strided(X, shape=shape, strides=strides)

fisher_param = [{"Tau":ta, "DE":de} for ta in [3,15] for de in [10,5]]

@set_property("fctype", "combiner")
@set_property("custom", True)
def fisher_information(epochs, param=fisher_param):
def fisher_info_1d(a, tau, de):
# taken from pyeeg improvements

mat = _embed_seq(a, tau, de)
W = np.linalg.svd(mat, compute_uv=False)
W /= sum(W)  # normalize singular values
FI_v = (W[1:] - W[:-1]) ** 2 / W[:-1]
return np.sum(FI_v)

return [("Tau_{}__DE_{}".format(par["Tau"], par["DE"]),np.apply_along_axis(fisher_info_1d, 0, epochs, par["Tau"], par["DE"]).ravel()[0]) for par in param]

extract.fisher_information(df["Close"])


#### (28) Fractals

In mathematics, more specifically in fractal geometry, a fractal dimension is a ratio providing a statistical index of complexity comparing how detail in a pattern (strictly speaking, a fractal pattern) changes with the scale at which it is measured.

(i) Highuchi Fractal

Compute a Higuchi Fractal Dimension of a time series

hig_para = [{"Kmax": 3},{"Kmax": 5}]

@set_property("fctype", "combiner")
@set_property("custom", True)
def higuchi_fractal_dimension(epochs, param=hig_para):
def hfd_1d(X, Kmax):

L = []
x = []
N = len(X)
for k in range(1, Kmax):
Lk = []
for m in range(0, k):
Lmk = 0
for i in range(1, int(np.floor((N - m) / k))):
Lmk += abs(X[m + i * k] - X[m + i * k - k])
Lmk = Lmk * (N - 1) / np.floor((N - m) / float(k)) / k
Lk.append(Lmk)
L.append(np.log(np.mean(Lk)))
x.append([np.log(float(1) / k), 1])

(p, r1, r2, s) = np.linalg.lstsq(x, L, rcond=None)
return p[0]

return [("Kmax_{}".format(config["Kmax"]), np.apply_along_axis(hfd_1d, 0, epochs, config["Kmax"]).ravel()[0] ) for  config in param]

extract.higuchi_fractal_dimension(df["Close"])


(ii) Petrosian Fractal

Compute a Petrosian Fractal Dimension of a time series.

@set_property("fctype", "simple")
@set_property("custom", True)
def petrosian_fractal_dimension(epochs):
def pfd_1d(X, D=None):
# taken from pyeeg
"""Compute Petrosian Fractal Dimension of a time series from either two
cases below:
1. X, the time series of type list (default)
2. D, the first order differential sequence of X (if D is provided,
recommended to speed up)
In case 1, D is computed using Numpy's difference function.
To speed up, it is recommended to compute D before calling this function
because D may also be used by other functions whereas computing it here
again will slow down.
"""
if D is None:
D = np.diff(X)
D = D.tolist()
N_delta = 0  # number of sign changes in derivative of the signal
for i in range(1, len(D)):
if D[i] * D[i - 1] < 0:
N_delta += 1
n = len(X)
return np.log10(n) / (np.log10(n) + np.log10(n / n + 0.4 * N_delta))
return np.apply_along_axis(pfd_1d, 0, epochs).ravel()[0]

extract.petrosian_fractal_dimension(df["Close"])


#### (29) Exponent

(i) Hurst Exponent

The Hurst exponent is used as a measure of long-term memory of time series. It relates to the autocorrelations of the time series, and the rate at which these decrease as the lag between pairs of values increases.

@set_property("fctype", "simple")
@set_property("custom", True)
def hurst_exponent(epochs):
def hurst_1d(X):

X = np.array(X)
N = X.size
T = np.arange(1, N + 1)
Y = np.cumsum(X)
Ave_T = Y / T

S_T = np.zeros(N)
R_T = np.zeros(N)
for i in range(N):
S_T[i] = np.std(X[:i + 1])
X_T = Y - T * Ave_T[i]
R_T[i] = np.ptp(X_T[:i + 1])

for i in range(1, len(S_T)):
if np.diff(S_T)[i - 1] != 0:
break
for j in range(1, len(R_T)):
if np.diff(R_T)[j - 1] != 0:
break
k = max(i, j)
assert k < 10, "rethink it!"

R_S = R_T[k:] / S_T[k:]
R_S = np.log(R_S)

n = np.log(T)[k:]
A = np.column_stack((n, np.ones(n.size)))
[m, c] = np.linalg.lstsq(A, R_S, rcond=None)[0]
H = m
return H
return np.apply_along_axis(hurst_1d, 0, epochs).ravel()[0]

extract.hurst_exponent(df["Close"])


(ii) Largest Lyauponov Exponent

In mathematics the Lyapunov exponent or Lyapunov characteristic exponent of a dynamical system is a quantity that characterizes the rate of separation of infinitesimally close trajectories.

def _embed_seq(X, Tau, D):
shape = (X.size - Tau * (D - 1), D)
strides = (X.itemsize, Tau * X.itemsize)
return np.lib.stride_tricks.as_strided(X, shape=shape, strides=strides)

lyaup_param = [{"Tau":4, "n":3, "T":10, "fs":9},{"Tau":8, "n":7, "T":15, "fs":6}]

@set_property("fctype", "combiner")
@set_property("custom", True)
def largest_lyauponov_exponent(epochs, param=lyaup_param):
def LLE_1d(x, tau, n, T, fs):

Em = _embed_seq(x, tau, n)
M = len(Em)
A = np.tile(Em, (len(Em), 1, 1))
B = np.transpose(A, [1, 0, 2])
square_dists = (A - B) ** 2  # square_dists[i,j,k] = (Em[i][k]-Em[j][k])^2
D = np.sqrt(square_dists[:, :, :].sum(axis=2))  # D[i,j] = ||Em[i]-Em[j]||_2

# Exclude elements within T of the diagonal
band = np.tri(D.shape[0], k=T) - np.tri(D.shape[0], k=-T - 1)
band[band == 1] = np.inf
neighbors = (D + band).argmin(axis=0)  # nearest neighbors more than T steps away

# in_bounds[i,j] = (i+j <= M-1 and i+neighbors[j] <= M-1)
inc = np.tile(np.arange(M), (M, 1))
row_inds = (np.tile(np.arange(M), (M, 1)).T + inc)
col_inds = (np.tile(neighbors, (M, 1)) + inc.T)
in_bounds = np.logical_and(row_inds <= M - 1, col_inds <= M - 1)
# Uncomment for old (miscounted) version
# in_bounds = numpy.logical_and(row_inds < M - 1, col_inds < M - 1)
row_inds[~in_bounds] = 0
col_inds[~in_bounds] = 0

# neighbor_dists[i,j] = ||Em[i+j]-Em[i+neighbors[j]]||_2
J = (~neighbor_dists.mask).sum(axis=1)  # number of in-bounds indices by row
# Set invalid (zero) values to 1; log(1) = 0 so sum is unchanged

neighbor_dists[neighbor_dists == 0] = 1

# !!! this fixes the divide by zero in log error !!!
neighbor_dists.data[neighbor_dists.data == 0] = 1

d_ij = np.sum(np.log(neighbor_dists.data), axis=1)
mean_d = d_ij[J > 0] / J[J > 0]

x = np.arange(len(mean_d))
X = np.vstack((x, np.ones(len(mean_d)))).T
[m, c] = np.linalg.lstsq(X, mean_d, rcond=None)[0]
Lexp = fs * m
return Lexp

return [("Tau_{}__n_{}__T_{}__fs_{}".format(par["Tau"], par["n"], par["T"], par["fs"]), np.apply_along_axis(LLE_1d, 0, epochs, par["Tau"], par["n"], par["T"], par["fs"]).ravel()[0]) for par in param]

extract.largest_lyauponov_exponent(df["Close"])


#### (30) Spectral Analysis

Spectral analysis is analysis in terms of a spectrum of frequencies or related quantities such as energies, eigenvalues, etc.

(i) Whelch Method

The Whelch Method is an approach for spectral density estimation. It is used in physics, engineering, and applied mathematics for estimating the power of a signal at different frequencies.

from scipy import signal, integrate

whelch_param = [100,200]

@set_property("fctype", "combiner")
@set_property("custom", True)
def whelch_method(data, param=whelch_param):

final = []
for Fs in param:
f, pxx = signal.welch(data, fs=Fs, nperseg=1024)
d = {'psd': pxx, 'freqs': f}
df = pd.DataFrame(data=d)
dfs = df.sort_values(['psd'], ascending=False)
rows = dfs.iloc[:10]
final.append(rows['freqs'].mean())

return [("Fs_{}".format(pa),fin) for pa, fin in zip(param,final)]

extract.whelch_method(df["Close"])

#-> Basically same as above
freq_param = [{"fs":50, "sel":15},{"fs":200, "sel":20}]

@set_property("fctype", "combiner")
@set_property("custom", True)
def find_freq(serie, param=freq_param):

final = []
for par in param:
fft0 = np.fft.rfft(serie*np.hanning(len(serie)))
freqs = np.fft.rfftfreq(len(serie), d=1.0/par["fs"])
fftmod = np.array([np.sqrt(fft0[i].real**2 + fft0[i].imag**2) for i in range(0, len(fft0))])
d = {'fft': fftmod, 'freq': freqs}
df = pd.DataFrame(d)
hop = df.sort_values(['fft'], ascending=False)
rows = hop.iloc[:par["sel"]]
final.append(rows['freq'].mean())

return [("Fs_{}__sel{}".format(pa["fs"],pa["sel"]),fin) for pa, fin in zip(param,final)]

extract.find_freq(df["Close"])


#### (31) Percentile

(i) Flux Percentile

Flux (or radiant flux) is the total amount of energy that crosses a unit area per unit time. Flux is an astronomical value, measured in joules per square metre per second (joules/m2/s), or watts per square metre. Here we provide the ratio of flux percentiles.

#-> In Package

import math
def flux_perc(magnitude):
sorted_data = np.sort(magnitude)
lc_length = len(sorted_data)

F_60_index = int(math.ceil(0.60 * lc_length))
F_40_index = int(math.ceil(0.40 * lc_length))
F_5_index = int(math.ceil(0.05 * lc_length))
F_95_index = int(math.ceil(0.95 * lc_length))

F_40_60 = sorted_data[F_60_index] - sorted_data[F_40_index]
F_5_95 = sorted_data[F_95_index] - sorted_data[F_5_index]
F_mid20 = F_40_60 / F_5_95

return {"FluxPercentileRatioMid20": F_mid20}

extract.flux_perc(df["Close"])


#### (32) Range

(i) Range of Cummulative Sum

@set_property("fctype", "simple")
@set_property("custom", True)
def range_cum_s(magnitude):
sigma = np.std(magnitude)
N = len(magnitude)
m = np.mean(magnitude)
s = np.cumsum(magnitude - m) * 1.0 / (N * sigma)
R = np.max(s) - np.min(s)
return {"Rcs": R}

extract.range_cum_s(df["Close"])


#### (33) Structural

Structural features, potential placeholders for future research.

(i) Structure Function

The structure function of rotation measures (RMs) contains information on electron density and magnetic field fluctuations when used i astronomy. It becomes a custom feature when used with your own unique time series data.

from scipy.interpolate import interp1d

struct_param = {"Volume":df["Volume"].values, "Open": df["Open"].values}

@set_property("fctype", "combiner")
@set_property("custom", True)
def structure_func(time, param=struct_param):

dict_final = {}
for key, magnitude in param.items():
dict_final[key] = []
Nsf, Np = 100, 100
sf1, sf2, sf3 = np.zeros(Nsf), np.zeros(Nsf), np.zeros(Nsf)
f = interp1d(time, magnitude)

time_int = np.linspace(np.min(time), np.max(time), Np)
mag_int = f(time_int)

for tau in np.arange(1, Nsf):
sf1[tau - 1] = np.mean(
np.power(np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 1.0))
sf2[tau - 1] = np.mean(
np.abs(np.power(
np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 2.0)))
sf3[tau - 1] = np.mean(
np.abs(np.power(
np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 3.0)))
sf1_log = np.log10(np.trim_zeros(sf1))
sf2_log = np.log10(np.trim_zeros(sf2))
sf3_log = np.log10(np.trim_zeros(sf3))

if len(sf1_log) and len(sf2_log):
m_21, b_21 = np.polyfit(sf1_log, sf2_log, 1)
else:

m_21 = np.nan

if len(sf1_log) and len(sf3_log):
m_31, b_31 = np.polyfit(sf1_log, sf3_log, 1)
else:

m_31 = np.nan

if len(sf2_log) and len(sf3_log):
m_32, b_32 = np.polyfit(sf2_log, sf3_log, 1)
else:

m_32 = np.nan
dict_final[key].append(m_21)
dict_final[key].append(m_31)
dict_final[key].append(m_32)

return [("StructureFunction_{}__m_{}".format(key, name), li)  for key, lis in dict_final.items() for name, li in zip([21,31,32], lis)]

struct_param = {"Volume":df["Volume"].values, "Open": df["Open"].values}

extract.structure_func(df["Close"],struct_param)


#### (34) Distribution

(i) Kurtosis

#-> In Package
def kurtosis(x):

if not isinstance(x, pd.Series):
x = pd.Series(x)
return pd.Series.kurtosis(x)

extract.kurtosis(df["Close"])


(ii) Stetson Kurtosis

@set_property("fctype", "simple")
@set_property("custom", True)
def stetson_k(x):
"""A robust kurtosis statistic."""
n = len(x)
x0 = stetson_mean(x, 1./20**2)
delta_x = np.sqrt(n / (n - 1.)) * (x - x0) / 20
ta = 1. / 0.798 * np.mean(np.abs(delta_x)) / np.sqrt(np.mean(delta_x**2))
return ta

extract.stetson_k(df["Close"])


## (5) Synthesise

Time-Series synthesisation (TSS) happens before the feature extraction step and Cross Sectional Synthesisation (CSS) happens after the feature extraction step. Currently I will only include a CSS package, in the future, I would further work on developing out this section. This area still has a lot of performance and stability issues. In the future it might be a more viable candidate to improve prediction.

from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error

def model(df_final):
model = LGBMRegressor()
train = df_final[~df_final.isin(test)].dropna()
model = model.fit(train.drop(["Close_1"],axis=1),train["Close_1"])
preds = model.predict(test.drop(["Close_1"],axis=1))
train = df_final[~df_final.isin(test)].dropna()
model = model.fit(train.drop(["Close_1"],axis=1),train["Close_1"])
val = mean_squared_error(test["Close_1"],preds);
return val

pip install ctgan

from ctgan import CTGANSynthesizer

#discrete_columns = [""]
ctgan = CTGANSynthesizer()
ctgan.fit(df,epochs=10) #15


Random Benchmark

np.random.seed(1)
df_in = df.copy()
df_in["Close_1"] = np.random.permutation(df_in["Close_1"].values)
model(df_in)


Generated Performance

df_gen = ctgan.sample(len(df_in)*100)
model(df_gen)


As expected a cross-sectional technique, does not work well on time-series data, in the future, other methods will be investigated.

## (6) Skeleton Example

Here I will perform tabular agumenting methods on a small dataset single digit features and around 250 instances. This is not necessarily the best sized dataset to highlight the performance of tabular augmentation as some method like extraction would be overkill as it would lead to dimensionality problems. It is also good to know that there are close to infinite number of ways to perform these augmentation methods. In the future, automated augmentation methods can guide the experiment process.

The approach taken in this skeleton is to develop running models that are tested after each augmentation to highlight what methods might work well on this particular dataset. The metric we will use is mean squared error. In this implementation we do not have special hold-out sets.

The above framework of implementation will be consulted, but one still have to be strategic as to when you apply what function, and you have to make sure that you are processing your data with appropriate techniques (drop null values, fill null values) at the appropriate time.

#### Validation

Develop Model and Define Metric

from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error

def model(df_final):
model = LGBMRegressor()
train = df_final[~df_final.isin(test)].dropna()
model = model.fit(train.drop(["Close_1"],axis=1),train["Close_1"])
preds = model.predict(test.drop(["Close_1"],axis=1))
train = df_final[~df_final.isin(test)].dropna()
model = model.fit(train.drop(["Close_1"],axis=1),train["Close_1"])
val = mean_squared_error(test["Close_1"],preds);
return val


df = data_copy()

model(df)

302.61676570345287


(1) (7) (i) Transformation - Decomposition - Naive

## If Inferred Seasonality is Too Large Default to Five
seasons = transform.infer_seasonality(df["Close"],index=0)
df_out = transform.naive_dec(df.copy(), ["Close","Open"], freq=5)
model(df_out) #improvement

274.34477082783525


(1) (8) (i) Transformation - Filter - Baxter-King-Bandpass

df_out = transform.bkb(df_out, ["Close","Low"])
df_best = df_out.copy()
model(df_out) #improvement

267.1826850968307


(1) (3) (i) Transformation - Differentiation - Fractional

df_out = transform.fast_fracdiff(df_out, ["Close_BPF"],0.5)
model(df_out) #null

267.7083192402742


(1) (1) (i) Transformation - Scaling - Robust Scaler

df_out = df_out.dropna()
df_out = transform.robust_scaler(df_out, drop=["Close_1"])
model(df_out) #noisy

270.96980399571214


(2) (2) (i) Interactions - Operator - Multiplication/Division

df_out.head()

df_out = interact.muldiv(df_out, ["Close","Open_NDDS","Low_BPF"])
model(df_out) #noisy

285.6420643864313

df_r = df_out.copy()


(2) (6) (i) Interactions - Speciality - Technical

import ta
df = interact.tech(df)
df_out = pd.merge(df_out,  df.iloc[:,7:], left_index=True, right_index=True, how="left")


Clean Dataframe and Metric

"""Droping column where missing values are above a threshold"""
df_out = df_out.dropna(thresh = len(df_out)*0.95, axis = "columns")
df_out = df_out.dropna()
df_out = df_out.replace([np.inf, -np.inf], np.nan).ffill().fillna(0)
close = df_out["Close"].copy()
df_d = df_out.copy()
model(df_out) #improve

592.52971755184


(3) (1) (i) Mapping - Eigen Decomposition - PCA

from sklearn.decomposition import PCA, IncrementalPCA, KernelPCA

df_out = transform.robust_scaler(df_out, drop=["Close_1"])

df_out = df_out.replace([np.inf, -np.inf], np.nan).ffill().fillna(0)
df_out = mapper.pca_feature(df_out, drop_cols=["Close_1"], variance_or_components=0.9, n_components=8,non_linear=False)

model(df_out) #noisy but not too bad given the 10 fold dimensionality reduction

687.158330455884


(4) Extracting

Here at first, I show the functions that have been added to the DeltaPy fork of tsfresh. You have to add your own personal adjustments based on the features you would like to construct. I am using self-developed features, but you can also use TSFresh's community functions.

The following files have been appropriately ammended (Get in contact for advice)

(4) (10) (i) Extracting - Averages - GSkew

extract.gskew(df_out["PCA_1"])

-0.7903067336449059


(4) (21) (ii) Extracting - Entropy - SVD Entropy

svd_param = [{"Tau": ta, "DE": de}
for ta in [4]
for de in [3,6]]

extract.svd_entropy(df_out["PCA_1"],svd_param)

[('Tau_"4"__De_3"', 0.7234823323374294),
('Tau_"4"__De_6"', 1.3014347840145244)]


(4) (13) (ii) Extracting - Streaks - Wozniak

woz_param = [{"consecutiveStar": n} for n in [2, 4]]

extract.wozniak(df_out["PCA_1"],woz_param)

[('consecutiveStar_2', 0.012658227848101266), ('consecutiveStar_4', 0.0)]


(4) (28) (i) Extracting - Fractal - Higuchi

hig_param = [{"Kmax": 3},{"Kmax": 5}]

extract.higuchi_fractal_dimension(df_out["PCA_1"],hig_param)

[('Kmax_3', 0.577913816027104), ('Kmax_5', 0.8176960510304725)]


(4) (5) (ii) Extracting - Volatility - Variability Index

var_index_param = {"Volume":df["Volume"].values, "Open": df["Open"].values}

extract.var_index(df["Close"].values,var_index_param)

{'Interact__Open': 0.00396022538846289,
'Interact__Volume': 0.20550155114176533}


Time Series Extraction

pip install git+git://github.com/firmai/tsfresh.git

#Construct the preferred input dataframe.
from tsfresh.utilities.dataframe_functions import roll_time_series
df_out["ID"] = 0
periods = 30
df_out = df_out.reset_index()
df_ts = roll_time_series(df_out,"ID","Date",None,1,periods)
counts = df_ts['ID'].value_counts()
df_ts = df_ts[df_ts['ID'].isin(counts[counts > periods].index)]

#Perform extraction
from tsfresh.feature_extraction import extract_features, CustomFCParameters
settings_dict = CustomFCParameters()
settings_dict["var_index"] = {"PCA_1":None, "PCA_2": None}
df_feat = extract_features(df_ts.drop(["Close_1"],axis=1),default_fc_parameters=settings_dict,column_id="ID",column_sort="Date")

Feature Extraction: 100%|██████████| 5/5 [00:10<00:00,  2.14s/it]

# Cleaning operations
import pandasvault as pv
df_feat2 = df_feat.copy()
df_feat = df_feat.dropna(thresh = len(df_feat)*0.50, axis = "columns")
df_feat_cons = pv.constant_feature_detect(data=df_feat,threshold=0.9)
df_feat = df_feat.drop(df_feat_cons, axis=1)
df_feat = df_feat.ffill()
df_feat = pd.merge(df_feat,df[["Close_1"]],left_index=True,right_index=True,how="left")
print(df_feat.shape)
model(df_feat) #noisy

7  variables are found to be almost constant
(208, 48)
2064.7813982935995

from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute

impute(df_feat)
df_feat_2 = select_features(df_feat.drop(["Close_1"],axis=1),df_feat["Close_1"],fdr_level=0.05)
df_feat_2["Close_1"] = df_feat["Close_1"]
model(df_feat_2) #improvement (b/ not an augmentation method)

1577.5273071299482


(3) (6) (i) Feature Agglomoration;   (1)(2)(i) Standard Scaler.

Like in this step, after (1), (2), (3), (4) and (5), you can often circle back to the initial steps to normalise the data and dimensionally reduce the data for the final model.

import numpy as np
from sklearn import datasets, cluster

def feature_agg(df, drop, components):
components = min(df.shape[1]-1,components)
agglo = cluster.FeatureAgglomeration(n_clusters=components,)
df = df.drop(drop,axis=1)
agglo.fit(df)
df = pd.DataFrame(agglo.transform(df))

return df

df_final = transform.standard_scaler(df_feat_2, drop=["Close_1"])
df_final = mapper.feature_agg(df_final,["Close_1"],4)
df_final.index = df_feat.index
df_final["Close_1"] = df_feat["Close_1"]
model(df_final) #noisy

1949.89085894338


Final Model After Applying 13 Arbitrary Augmentation Techniques

model(df_final) #improvement

1949.89085894338


Original Model Before Augmentation

df_org = df.iloc[:,:7][df.index.isin(df_final.index)]
model(df_org)

389.783990984133


Best Model After Developing 8 Augmenting Features

df_best = df_best.replace([np.inf, -np.inf], np.nan).ffill().fillna(0)
model(df_best)

267.1826850968307


Commentary

There are countless ways in which the current model can be improved, this can take on an automated process where all techniques are tested against a hold out set, for example, we can perform the operation below, and even though it improves the score here, there is a need for more robust tests. The skeleton example above is not meant to highlight the performance of the package. It simply serves as an example of how one can go about applying augmentation methods.

Quite naturally this example suffers from dimensionality issues with array shapes reaching (208, 48), furthermore you would need a sample that is at least 50-100 times larger before machine learning methods start to make sense.

Nonetheless, in this example, Transformation, Interactions and Mappings (applied to extraction output) performed fairly well. Extraction augmentation was overkill, but created a reasonable model when dimensionally reduced. A better selection of one of the 50+ augmentation methods and the order of augmentation could further help improve the outcome if robustly tested against development sets.

[1] DeltaPy Development

Author: firmai
Source Code: https://github.com/firmai/deltapy

#engineering