1620826740

In this series, we will be solving several amazing Problems. We will also try to decode the computational Logic behind the tricky solutions.

Code to Contribute:

You all must be aware of the rising covid situation in India, as a result we get to hear about lots of casualties every day. A part of them is also a result of poverty and hunger caused by the pandemic.

GeeksforGeeks in association with GiveIndia has come up with a contest with which anyone can help these persons in need for free.

Details of the contest:

- Contest is free for all.
- On every participation, GeeksforGeeks in association with GiveIndia donates a meal for a family in need.
- For every participant who ends in top 50, GeeksforGeeks in association with GiveIndia donate a week’s meal for a family in need.
- A personalized certificate for each participant.
- Find more details here

https://practice.geeksforgeeks.org/contest/code-to-contribute/

#developer

1659817260

The AWS IoT Device SDK for Embedded C (C-SDK) is a collection of C source files under the MIT open source license that can be used in embedded applications to securely connect IoT devices to AWS IoT Core. It contains MQTT client, HTTP client, JSON Parser, AWS IoT Device Shadow, AWS IoT Jobs, and AWS IoT Device Defender libraries. This SDK is distributed in source form, and can be built into customer firmware along with application code, other libraries and an operating system (OS) of your choice. These libraries are only dependent on standard C libraries, so they can be ported to various OS's - from embedded Real Time Operating Systems (RTOS) to Linux/Mac/Windows. You can find sample usage of C-SDK libraries on POSIX systems using OpenSSL (e.g. Linux demos in this repository), and on FreeRTOS using mbedTLS (e.g. FreeRTOS demos in FreeRTOS repository).

For the latest release of C-SDK, please see the section for Releases and Documentation.

**C-SDK includes libraries that are part of the FreeRTOS 202012.01 LTS release. Learn more about the FreeRTOS 202012.01 LTS libraries by ****clicking here****.**

The C-SDK libraries are licensed under the MIT open source license.

C-SDK simplifies access to various AWS IoT services. C-SDK has been tested to work with AWS IoT Core and an open source MQTT broker to ensure interoperability. The AWS IoT Device Shadow, AWS IoT Jobs, and AWS IoT Device Defender libraries are flexible to work with any MQTT client and JSON parser. The MQTT client and JSON parser libraries are offered as choices without being tightly coupled with the rest of the SDK. C-SDK contains the following libraries:

The coreMQTT library provides the ability to establish an MQTT connection with a broker over a customer-implemented transport layer, which can either be a secure channel like a TLS session (mutually authenticated or server-only authentication) or a non-secure channel like a plaintext TCP connection. This MQTT connection can be used for performing publish operations to MQTT topics and subscribing to MQTT topics. The library provides a mechanism to register customer-defined callbacks for receiving incoming PUBLISH, acknowledgement and keep-alive response events from the broker. The library has been refactored for memory optimization and is compliant with the MQTT 3.1.1 standard. It has no dependencies on any additional libraries other than the standard C library, a customer-implemented network transport interface, and optionally a customer-implemented platform time function. The refactored design embraces different use-cases, ranging from resource-constrained platforms using only QoS 0 MQTT PUBLISH messages to resource-rich platforms using QoS 2 MQTT PUBLISH over TLS connections.

See memory requirements for the latest release here.

The coreHTTP library provides the ability to establish an HTTP connection with a server over a customer-implemented transport layer, which can either be a secure channel like a TLS session (mutually authenticated or server-only authentication) or a non-secure channel like a plaintext TCP connection. The HTTP connection can be used to make "GET" (include range requests), "PUT", "POST" and "HEAD" requests. The library provides a mechanism to register a customer-defined callback for receiving parsed header fields in an HTTP response. The library has been refactored for memory optimization, and is a client implementation of a subset of the HTTP/1.1 standard.

See memory requirements for the latest release here.

The coreJSON library is a JSON parser that strictly enforces the ECMA-404 JSON standard. It provides a function to validate a JSON document, and a function to search for a key and return its value. A search can descend into nested structures using a compound query key. A JSON document validation also checks for illegal UTF8 encodings and illegal Unicode escape sequences.

See memory requirements for the latest release here.

The corePKCS11 library is an implementation of the PKCS #11 interface (API) that makes it easier to develop applications that rely on cryptographic operations. Only a subset of the PKCS #11 v2.4 standard has been implemented, with a focus on operations involving asymmetric keys, random number generation, and hashing.

The Cryptoki or PKCS #11 standard defines a platform-independent API to manage and use cryptographic tokens. The name, "PKCS #11", is used interchangeably to refer to the API itself and the standard which defines it.

The PKCS #11 API is useful for writing software without taking a dependency on any particular implementation or hardware. By writing against the PKCS #11 standard interface, code can be used interchangeably with multiple algorithms, implementations and hardware.

Generally vendors for secure cryptoprocessors such as Trusted Platform Module (TPM), Hardware Security Module (HSM), Secure Element, or any other type of secure hardware enclave, distribute a PKCS #11 implementation with the hardware. The purpose of corePKCS11 mock is therefore to provide a PKCS #11 implementation that allows for rapid prototyping and development before switching to a cryptoprocessor specific PKCS #11 implementation in production devices.

Since the PKCS #11 interface is defined as part of the PKCS #11 specification replacing corePKCS11 with another implementation should require little porting effort, as the interface will not change. The system tests distributed in corePKCS11 repository can be leveraged to verify the behavior of a different implementation is similar to corePKCS11.

See memory requirements for the latest release here.

The AWS IoT Device Shadow library enables you to store and retrieve the current state one or more shadows of every registered device. A device’s shadow is a persistent, virtual representation of your device that you can interact with from AWS IoT Core even if the device is offline. The device state is captured in its "shadow" is represented as a JSON document. The device can send commands over MQTT to get, update and delete its latest state as well as receive notifications over MQTT about changes in its state. The device’s shadow(s) are uniquely identified by the name of the corresponding "thing", a representation of a specific device or logical entity on the AWS Cloud. See Managing Devices with AWS IoT for more information on IoT "thing". This library supports named shadows, a feature of the AWS IoT Device Shadow service that allows you to create multiple shadows for a single IoT device. More details about AWS IoT Device Shadow can be found in AWS IoT documentation.

The AWS IoT Device Shadow library has no dependencies on additional libraries other than the standard C library. It also doesn’t have any platform dependencies, such as threading or synchronization. It can be used with any MQTT library and any JSON library (see demos with coreMQTT and coreJSON).

See memory requirements for the latest release here.

The AWS IoT Jobs library enables you to interact with the AWS IoT Jobs service which notifies one or more connected devices of a pending “Job”. A Job can be used to manage your fleet of devices, update firmware and security certificates on your devices, or perform administrative tasks such as restarting devices and performing diagnostics. For documentation of the service, please see the AWS IoT Developer Guide. Interactions with the Jobs service use the MQTT protocol. This library provides an API to compose and recognize the MQTT topic strings used by the Jobs service.

The AWS IoT Jobs library has no dependencies on additional libraries other than the standard C library. It also doesn’t have any platform dependencies, such as threading or synchronization. It can be used with any MQTT library and any JSON library (see demos with libmosquitto and coreJSON).

See memory requirements for the latest release here.

The AWS IoT Device Defender library enables you to interact with the AWS IoT Device Defender service to continuously monitor security metrics from devices for deviations from what you have defined as appropriate behavior for each device. If something doesn’t look right, AWS IoT Device Defender sends out an alert so you can take action to remediate the issue. More details about Device Defender can be found in AWS IoT Device Defender documentation. This library supports custom metrics, a feature that helps you monitor operational health metrics that are unique to your fleet or use case. For example, you can define a new metric to monitor the memory usage or CPU usage on your devices.

The AWS IoT Device Defender library has no dependencies on additional libraries other than the standard C library. It also doesn’t have any platform dependencies, such as threading or synchronization. It can be used with any MQTT library and any JSON library (see demos with coreMQTT and coreJSON).

See memory requirements for the latest release here.

The AWS IoT Over-the-air Update (OTA) library enables you to manage the notification of a newly available update, download the update, and perform cryptographic verification of the firmware update. Using the OTA library, you can logically separate firmware updates from the application running on your devices. You can also use the library to send other files (e.g. images, certificates) to one or more devices registered with AWS IoT. More details about OTA library can be found in AWS IoT Over-the-air Update documentation.

The AWS IoT Over-the-air Update library has a dependency on coreJSON for parsing of JSON job document and tinyCBOR for decoding encoded data streams, other than the standard C library. It can be used with any MQTT library, HTTP library, and operating system (e.g. Linux, FreeRTOS) (see demos with coreMQTT and coreHTTP over Linux).

See memory requirements for the latest release here.

The AWS IoT Fleet Provisioning library enables you to interact with the AWS IoT Fleet Provisioning MQTT APIs in order to provison IoT devices without preexisting device certificates. With AWS IoT Fleet Provisioning, devices can securely receive unique device certificates from AWS IoT when they connect for the first time. For an overview of all provisioning options offered by AWS IoT, see device provisioning documentation. For details about Fleet Provisioning, refer to the AWS IoT Fleet Provisioning documentation.

See memory requirements for the latest release here.

The AWS SigV4 library enables you to sign HTTP requests with Signature Version 4 Signing Process. Signature Version 4 (SigV4) is the process to add authentication information to HTTP requests to AWS services. For security, most requests to AWS must be signed with an access key. The access key consists of an access key ID and secret access key.

See memory requirements for the latest release here.

The backoffAlgorithm library is a utility library to calculate backoff period using an exponential backoff with jitter algorithm for retrying network operations (like failed network connection with server). This library uses the "Full Jitter" strategy for the exponential backoff with jitter algorithm. More information about the algorithm can be seen in the Exponential Backoff and Jitter AWS blog.

Exponential backoff with jitter is typically used when retrying a failed connection or network request to the server. An exponential backoff with jitter helps to mitigate the failed network operations with servers, that are caused due to network congestion or high load on the server, by spreading out retry requests across multiple devices attempting network operations. Besides, in an environment with poor connectivity, a client can get disconnected at any time. A backoff strategy helps the client to conserve battery by not repeatedly attempting reconnections when they are unlikely to succeed.

The backoffAlgorithm library has no dependencies on libraries other than the standard C library.

See memory requirements for the latest release here.

When establishing a connection with AWS IoT, users can optionally report the Operating System, Hardware Platform and MQTT client version information of their device to AWS. This information can help AWS IoT provide faster issue resolution and technical support. If users want to report this information, they can send a specially formatted string (see below) in the username field of the MQTT CONNECT packet.

Format

The format of the username string with metrics is:

```
<Actual_Username>?SDK=<OS_Name>&Version=<OS_Version>&Platform=<Hardware_Platform>&MQTTLib=<MQTT_Library_name>@<MQTT_Library_version>
```

Where

- is the actual username used for authentication, if username and password are used for authentication. When username and password based authentication is not used, this is an empty value.
- is the Operating System the application is running on (e.g. Ubuntu)
- is the version number of the Operating System (e.g. 20.10)
- is the Hardware Platform the application is running on (e.g. RaspberryPi)
- is the MQTT Client library being used (e.g. coreMQTT)
- is the version of the MQTT Client library being used (e.g. 1.1.0)

Example

- Actual_Username = “iotuser”, OS_Name = Ubuntu, OS_Version = 20.10, Hardware_Platform_Name = RaspberryPi, MQTT_Library_Name = coremqtt, MQTT_Library_version = 1.1.0. If username is not used, then “iotuser” can be removed.

```
/* Username string:
* iotuser?SDK=Ubuntu&Version=20.10&Platform=RaspberryPi&MQTTLib=coremqtt@1.1.0
*/
#define OS_NAME "Ubuntu"
#define OS_VERSION "20.10"
#define HARDWARE_PLATFORM_NAME "RaspberryPi"
#define MQTT_LIB "coremqtt@1.1.0"
#define USERNAME_STRING "iotuser?SDK=" OS_NAME "&Version=" OS_VERSION "&Platform=" HARDWARE_PLATFORM_NAME "&MQTTLib=" MQTT_LIB
#define USERNAME_STRING_LENGTH ( ( uint16_t ) ( sizeof( USERNAME_STRING ) - 1 ) )
MQTTConnectInfo_t connectInfo;
connectInfo.pUserName = USERNAME_STRING;
connectInfo.userNameLength = USERNAME_STRING_LENGTH;
mqttStatus = MQTT_Connect( pMqttContext, &connectInfo, NULL, CONNACK_RECV_TIMEOUT_MS, pSessionPresent );
```

C-SDK releases will now follow a date based versioning scheme with the format YYYYMM.NN, where:

- Y represents the year.
- M represents the month.
- N represents the release order within the designated month (00 being the first release).

For example, a second release in June 2021 would be 202106.01. Although the SDK releases have moved to date-based versioning, each library within the SDK will still retain semantic versioning. In semantic versioning, the version number itself (X.Y.Z) indicates whether the release is a major, minor, or point release. You can use the semantic version of a library to assess the scope and impact of a new release on your application.

All of the released versions of the C-SDK libraries are available as git tags. For example, the last release of the v3 SDK version is available at tag 3.1.2.

API documentation of 202108.00 release

This release introduces the refactored AWS IoT Fleet Provisioning library and the new AWS SigV4 library.

Additionally, this release brings minor version updates in the AWS IoT Over-the-Air Update and corePKCS11 libraries.

API documentation of 202103.00 release

This release includes a major update to the APIs of the AWS IoT Over-the-air Update library.

Additionally, AWS IoT Device Shadow library introduces a minor update by adding support for named shadow, a feature of the AWS IoT Device Shadow service that allows you to create multiple shadows for a single IoT device. AWS IoT Jobs library introduces a minor update by introducing macros for `$next`

job ID and compile-time generation of topic strings. AWS IoT Device Defender library introduces a minor update that adds macros to API for custom metrics feature of AWS IoT Device Defender service.

corePKCS11 also introduces a patch update by removing the `pkcs11configPAL_DESTROY_SUPPORTED`

config and mbedTLS platform abstraction layer of `DestroyObject`

. Lastly, no code changes are introduced for backoffAlgorithm, coreHTTP, coreMQTT, and coreJSON; however, patch updates are made to improve documentation and CI.

API documentation of 202012.01 release

This release includes AWS IoT Over-the-air Update(Release Candidate), backoffAlgorithm, and PKCS #11 libraries. Additionally, there is a major update to the coreJSON and coreHTTP APIs. All libraries continue to undergo code quality checks (e.g. MISRA-C compliance), and Coverity static analysis. In addition, all libraries except AWS IoT Over-the-air Update and backoffAlgorithm undergo validation of memory safety with the C Bounded Model Checker (CBMC) automated reasoning tool.

API documentation of 202011.00 release

This release includes refactored HTTP client, AWS IoT Device Defender, and AWS IoT Jobs libraries. Additionally, there is a major update to the coreJSON API. All libraries continue to undergo code quality checks (e.g. MISRA-C compliance), Coverity static analysis, and validation of memory safety with the C Bounded Model Checker (CBMC) automated reasoning tool.

API documentation of 202009.00 release

This release includes refactored MQTT, JSON Parser, and AWS IoT Device Shadow libraries for optimized memory usage and modularity. These libraries are included in the SDK via Git submoduling. These libraries have gone through code quality checks including verification that no function has a GNU Complexity score over 8, and checks against deviations from mandatory rules in the MISRA coding standard. Deviations from the MISRA C:2012 guidelines are documented under MISRA Deviations. These libraries have also undergone both static code analysis from Coverity static analysis, and validation of memory safety and data structure invariance through the CBMC automated reasoning tool.

If you are upgrading from v3.x API of the C-SDK to the 202009.00 release, please refer to Migration guide from v3.1.2 to 202009.00 and newer releases. If you are using the C-SDK v4_beta_deprecated branch, note that we will continue to maintain this branch for critical bug fixes and security patches but will not add new features to it. See the C-SDK v4_beta_deprecated branch README for additional details.

Details available here.

All libraries depend on the ISO C90 standard library and additionally on the `stdint.h`

library for fixed-width integers, including `uint8_t`

, `int8_t`

, `uint16_t`

, `uint32_t`

and `int32_t`

, and constant macros like `UINT16_MAX`

. If your platform does not support the `stdint.h`

library, definitions of the mentioned fixed-width integer types will be required for porting any C-SDK library to your platform.

Guide for porting coreMQTT library to your platform is available here.

Guide for porting coreHTTP library is available here.

Guide for porting AWS IoT Device Shadow library is available here.

Guide for porting AWS IoT Device Defender library is available here.

Guide for porting OTA library to your platform is available here.

Migration guide for MQTT library is available here.

Migration guide for Shadow library is available here.

Migration guide for Jobs library is available here.

The main branch hosts the continuous development of the AWS IoT Embedded C SDK (C-SDK) libraries. Please be aware that the development at the tip of the main branch is continuously in progress, and may have bugs. Consider using the tagged releases of the C-SDK for production ready software.

The v4_beta_deprecated branch contains a beta version of the C-SDK libraries, which is now deprecated. This branch was earlier named as v4_beta, and was renamed to v4_beta_deprecated. The libraries in this branch will not be released. However, critical bugs will be fixed and tested. No new features will be added to this branch.

This repository uses Git Submodules to bring in the C-SDK libraries (eg, MQTT ) and third-party dependencies (eg, mbedtls for POSIX platform transport layer). Note: If you download the ZIP file provided by GitHub UI, you will not get the contents of the submodules (The ZIP file is also not a valid git repository). If you download from the 202012.00 Release Page page, you will get the entire repository (including the submodules) in the ZIP file, aws-iot-device-sdk-embedded-c-202012.00.zip. To clone the latest commit to main branch using HTTPS:

```
git clone --recurse-submodules https://github.com/aws/aws-iot-device-sdk-embedded-C.git
```

Using SSH:

```
git clone --recurse-submodules git@github.com:aws/aws-iot-device-sdk-embedded-C.git
```

If you have downloaded the repo without using the `--recurse-submodules`

argument, you need to run:

```
git submodule update --init --recursive
```

When building with CMake, submodules are also recursively cloned automatically. However, `-DBUILD_CLONE_SUBMODULES=0`

can be passed as a CMake flag to disable this functionality. This is useful when you'd like to build CMake while using a different commit from a submodule.

The libraries in this SDK are not dependent on any operating system. However, the demos for the libraries in this SDK are built and tested on a Linux platform. The demos build with CMake, a cross-platform build tool.

- CMake 3.2.0 or any newer version for utilizing the build system of the repository.
- C90 compiler such as gcc
- Due to the use of mbedtls in corePKCS11, a C99 compiler is required if building the PKCS11 demos or the CMake install target.

- Although not a part of the ISO C90 standard,
`stdint.h`

is required for fixed-width integer types that include`uint8_t`

,`int8_t`

,`uint16_t`

,`uint32_t`

and`int32_t`

, and constant macros like`UINT16_MAX`

, while`stdbool.h`

is required for boolean parameters in coreMQTT. For compilers that do not provide these header files, coreMQTT provides the files stdint.readme and stdbool.readme, which can be renamed to`stdint.h`

and`stdbool.h`

, respectively, to provide the required type definitions. - A supported operating system. The ports provided with this repo are expected to work with all recent versions of the following operating systems, although we cannot guarantee the behavior on all systems.
- Linux system with POSIX sockets, threads, RT, and timer APIs. (We have tested on Ubuntu 18.04).

Build Dependencies

The follow table shows libraries that need to be installed in your system to run certain demos. If a dependency is not installed and cannot be built from source, demos that require that dependency will be excluded from the default `all`

target.

Dependency | Version | Usage |
---|---|---|

OpenSSL | 1.1.0 or later | All TLS demos and tests with the exception of PKCS11 |

Mosquitto Client | 1.4.10 or later | AWS IoT Jobs Mosquitto demo |

You need to setup an AWS account and access the AWS IoT console for running the AWS IoT Device Shadow library, AWS IoT Device Defender library, AWS IoT Jobs library, AWS IoT OTA library and coreHTTP S3 download demos. Also, the AWS account can be used for running the MQTT mutual auth demo against AWS IoT broker. Note that running the AWS IoT Device Defender, AWS IoT Jobs and AWS IoT Device Shadow library demos require the setup of a Thing resource for the device running the demo. Follow the links to:

- Setup an AWS account.
- Sign-in to the AWS IoT Console after setting up the AWS account.
- Create a Thing resource.

The MQTT Mutual Authentication and AWS IoT Shadow demos include example AWS IoT policy documents to run each respective demo with AWS IoT. You may use the MQTT Mutual auth and Shadow example policies by replacing `[AWS_REGION]`

and `[AWS_ACCOUNT_ID]`

with the strings of your region and account identifier. While the IoT Thing name and MQTT client identifier do not need to match for the demos to run, the example policies have the Thing name and client identifier identical as per AWS IoT best practices.

It can be very helpful to also have the AWS Command Line Interface tooling installed.

You can pass the following configuration settings as command line options in order to run the mutual auth demos. Make sure to run the following command in the root directory of the C-SDK:

```
## optionally find your-aws-iot-endpoint from the command line
aws iot describe-endpoint --endpoint-type iot:Data-ATS
cmake -S . -Bbuild
-DAWS_IOT_ENDPOINT="<your-aws-iot-endpoint>" -DCLIENT_CERT_PATH="<your-client-certificate-path>" -DCLIENT_PRIVATE_KEY_PATH="<your-client-private-key-path>"
```

In order to set these configurations manually, edit `demo_config.h`

in `demos/mqtt/mqtt_demo_mutual_auth/`

and `demos/http/http_demo_mutual_auth/`

to `#define`

the following:

- Set
`AWS_IOT_ENDPOINT`

to your custom endpoint. This is found on the*Settings*page of the AWS IoT Console and has a format of`ABCDEFG1234567.iot.<aws-region>.amazonaws.com`

where`<aws-region>`

can be an AWS region like`us-east-2`

.- Optionally, it can also be found with the AWS CLI command
`aws iot describe-endpoint --endpoint-type iot:Data-ATS`

.

- Optionally, it can also be found with the AWS CLI command
- Set
`CLIENT_CERT_PATH`

to the path of the client certificate downloaded when setting up the device certificate in AWS IoT Account Setup. - Set
`CLIENT_PRIVATE_KEY_PATH`

to the path of the private key downloaded when setting up the device certificate in AWS IoT Account Setup.

It is possible to configure `ROOT_CA_CERT_PATH`

to any PEM-encoded Root CA Certificate. However, this is optional because CMake will download and set it to AmazonRootCA1.pem when unspecified.

To build the AWS IoT Device Defender and AWS IoT Device Shadow demos, you can pass the following configuration settings as command line options. Make sure to run the following command in the root directory of the C-SDK:

```
cmake -S . -Bbuild -DAWS_IOT_ENDPOINT="<your-aws-iot-endpoint>" -DROOT_CA_CERT_PATH="<your-path-to-amazon-root-ca>" -DCLIENT_CERT_PATH="<your-client-certificate-path>" -DCLIENT_PRIVATE_KEY_PATH="<your-client-private-key-path>" -DTHING_NAME="<your-registered-thing-name>"
```

An Amazon Root CA certificate can be downloaded from here.

In order to set these configurations manually, edit `demo_config.h`

in the demo folder to `#define`

the following:

- Set
`AWS_IOT_ENDPOINT`

to your custom endpoint. This is found on the*Settings*page of the AWS IoT Console and has a format of`ABCDEFG1234567.iot.us-east-2.amazonaws.com`

. - Set
`ROOT_CA_CERT_PATH`

to the path of the root CA certificate downloaded when setting up the device certificate in AWS IoT Account Setup. - Set
`CLIENT_CERT_PATH`

to the path of the client certificate downloaded when setting up the device certificate in AWS IoT Account Setup. - Set
`CLIENT_PRIVATE_KEY_PATH`

to the path of the private key downloaded when setting up the device certificate in AWS IoT Account Setup. - Set
`THING_NAME`

to the name of the Thing created in AWS IoT Account Setup.

To build the AWS IoT Fleet Provisioning Demo, you can pass the following configuration settings as command line options. Make sure to run the following command in the root directory of the C-SDK:

```
cmake -S . -Bbuild -DAWS_IOT_ENDPOINT="<your-aws-iot-endpoint>" -DROOT_CA_CERT_PATH="<your-path-to-amazon-root-ca>" -DCLAIM_CERT_PATH="<your-claim-certificate-path>" -DCLAIM_PRIVATE_KEY_PATH="<your-claim-private-key-path>" -DPROVISIONING_TEMPLATE_NAME="<your-template-name>" -DDEVICE_SERIAL_NUMBER="<your-serial-number>"
```

An Amazon Root CA certificate can be downloaded from here.

To create a provisioning template and claim credentials, sign into your AWS account and visit here. Make sure to enable the "Use the AWS IoT registry to manage your device fleet" option. Once you have created the template and credentials, modify the claim certificate's policy to match the sample policy.

In order to set these configurations manually, edit `demo_config.h`

in the demo folder to `#define`

the following:

- Set
`AWS_IOT_ENDPOINT`

to your custom endpoint. This is found on the*Settings*page of the AWS IoT Console and has a format of`ABCDEFG1234567.iot.us-east-2.amazonaws.com`

. - Set
`ROOT_CA_CERT_PATH`

to the path of the root CA certificate downloaded when setting up the device certificate in AWS IoT Account Setup. - Set
`CLAIM_CERT_PATH`

to the path of the claim certificate downloaded when setting up the template and claim credentials. - Set
`CLAIM_PRIVATE_KEY_PATH`

to the path of the private key downloaded when setting up the template and claim credentials. - Set
`PROVISIONING_TEMPLATE_NAME`

to the name of the provisioning template created. - Set
`DEVICE_SERIAL_NUMBER`

to an arbitrary string representing a device identifier.

You can pass the following configuration settings as command line options in order to run the S3 demos. Make sure to run the following command in the root directory of the C-SDK:

```
cmake -S . -Bbuild -DS3_PRESIGNED_GET_URL="s3-get-url" -DS3_PRESIGNED_PUT_URL="s3-put-url"
```

`S3_PRESIGNED_PUT_URL`

is only needed for the S3 upload demo.

In order to set these configurations manually, edit `demo_config.h`

in `demos/http/http_demo_s3_download_multithreaded`

, and `demos/http/http_demo_s3_upload`

to `#define`

the following:

- Set
`S3_PRESIGNED_GET_URL`

to a S3 presigned URL with GET access. - Set
`S3_PRESIGNED_PUT_URL`

to a S3 presigned URL with PUT access.

You can generate the presigned urls using demos/http/common/src/presigned_urls_gen.py. More info can be found here.

Refer this demos/http/http_demo_s3_download/README.md to follow the steps needed to configure and run the S3 Download HTTP Demo using SigV4 Library that generates the authorization HTTP header needed to authenticate the HTTP requests send to S3.

- The demo requires the Linux platform to contain curl and libmosquitto. On a Debian platform, these dependencies can be installed with:

```
apt install curl libmosquitto-dev
```

If the platform does not contain the `libmosquitto`

library, the demo will build the library from source.

`libmosquitto`

1.4.10 or any later version of the first major release is required to run this demo.

- A job that specifies the URL to download for the demo needs to be created on the AWS account for the Thing resource that will be used by the demo.

The job can be created directly from the AWS IoT console or using the aws cli tool.

The following creates a job that specifies a Linux Kernel link for downloading.

```
aws iot create-job \
--job-id 'job_1' \
--targets arn:aws:iot:us-west-2:<account-id>:thing/<thing-name> \
--document '{"url":"https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.8.5.tar.xz"}'
```

- To perform a successful OTA update, you need to complete the prerequisites mentioned here.
- A code signing certificate is required to authenticate the update. A code signing certificate based on the SHA-256 ECDSA algorithm will work with the current demos. An example of how to generate this kind of certificate can be found here.

After you build and run the initial executable you will have to create another executable and schedule an OTA update job with this image.

- Increase the version of the application by setting macro
`APP_VERSION_BUILD`

in`demos/ota/ota_demo_core_[mqtt/http]/demo_config.h`

to a different version than what is running. - Rebuild the application using the build steps below into a different directory, say
`build-dir-2`

. - Rename the demo executable to reflect the change, e.g.
`mv ota_demo_core_mqtt ota_demo_core_mqtt2`

- Create an OTA job:
- Go to the AWS IoT Core console.
- Manage → Jobs → Create → Create a FreeRTOS OTA update job → Select the corresponding name for your device from the thing list.
- Sign a new firmware → Create a new profile → Select any SHA-ECDSA signing platform → Upload the code signing certificate(from prerequisites) and provide its path on the device.
- Select the image → Select the bucket you created during the prerequisite steps → Upload the binary
`build-dir-2/bin/ota_demo2`

. - The path on device should be the absolute path to place the executable and the binary name: e.g.
`/home/ubuntu/aws-iot-device-sdk-embedded-C-staging/build-dir/bin/ota_demo_core_mqtt2`

. - Select the IAM role created during the prerequisite steps.
- Create the Job.

- Run the initial executable again with the following command:
`sudo ./ota_demo_core_mqtt`

or`sudo ./ota_demo_core_http`

. - After the initial executable has finished running, go to the directory where the downloaded firmware image resides which is the path name used when creating an OTA job.
- Change the permissions of the downloaded firmware to make it executable, as it may be downloaded with read (user default) permissions only:
`chmod 775 ota_demo_core_mqtt2`

- Run the downloaded firmware image with the following command:
`sudo ./ota_demo_core_mqtt2`

Before building the demos, ensure you have installed the prerequisite software. On Ubuntu 18.04 and 20.04, `gcc`

, `cmake`

, and OpenSSL can be installed with:

```
sudo apt install build-essential cmake libssl-dev
```

- Go to the root directory of the C-SDK.
- Run
*cmake*to generate the Makefiles:`cmake -S . -Bbuild && cd build`

- Choose a demo from the list below or alternatively, run
`make help | grep demo`

:

```
defender_demo
http_demo_basic_tls
http_demo_mutual_auth
http_demo_plaintext
http_demo_s3_download
http_demo_s3_download_multithreaded
http_demo_s3_upload
jobs_demo_mosquitto
mqtt_demo_basic_tls
mqtt_demo_mutual_auth
mqtt_demo_plaintext
mqtt_demo_serializer
mqtt_demo_subscription_manager
ota_demo_core_http
ota_demo_core_mqtt
pkcs11_demo_management_and_rng
pkcs11_demo_mechanisms_and_digests
pkcs11_demo_objects
pkcs11_demo_sign_and_verify
shadow_demo_main
```

- Replace
`demo_name`

with your desired demo then build it:`make demo_name`

- Go to the
`build/bin`

directory and run any demo executables from there.

- Go to the root directory of the C-SDK.
- Run
*cmake*to generate the Makefiles:`cmake -S . -Bbuild && cd build`

- Run this command to build all configured demos:
`make`

- Go to the
`build/bin`

directory and run any demo executables from there.

The corePKCS11 demos do not require any AWS IoT resources setup, and are standalone. The demos build upon each other to introduce concepts in PKCS #11 sequentially. Below is the recommended order.

`pkcs11_demo_management_and_rng`

`pkcs11_demo_mechanisms_and_digests`

`pkcs11_demo_objects`

`pkcs11_demo_sign_and_verify`

- Please note that this demo requires the private and public key generated from
`pkcs11_demo_objects`

to be in the directory the demo is executed from.

- Please note that this demo requires the private and public key generated from

Install Docker:

```
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh
```

Installing Mosquitto to run MQTT demos locally

The following instructions have been tested on an Ubuntu 18.04 environment with Docker and OpenSSL installed.

Download the official Docker image for Mosquitto 1.6.14. This version is deliberately chosen so that the Docker container can load certificates from the host system. Any version after 1.6.14 will drop privileges as soon as the configuration file has been read (before TLS certificates are loaded).

`docker pull eclipse-mosquitto:1.6.14`

If a Mosquitto broker with TLS communication needs to be run, ignore this step and proceed to the next step. A Mosquitto broker with plain text communication can be run by executing the command below.

`docker run -it -p 1883:1883 --name mosquitto-plain-text eclipse-mosquitto:1.6.14`

Set `BROKER_ENDPOINT`

defined in `demos/mqtt/mqtt_demo_plaintext/demo_config.h`

to `localhost`

.

Ignore the remaining steps unless a Mosquitto broker with TLS communication also needs to be run.

For TLS communication with Mosquitto broker, server and CA credentials need to be created. Use OpenSSL commands to generate the credentials for the Mosquitto server.

```
# Generate CA key and certificate. Provide the Subject field information as appropriate for CA certificate.
openssl req -x509 -nodes -sha256 -days 365 -newkey rsa:2048 -keyout ca.key -out ca.crt
```

```
# Generate server key and certificate.# Provide the Subject field information as appropriate for Server certificate. Make sure the Common Name (CN) field is different from the root CA certificate.
openssl req -nodes -sha256 -new -keyout server.key -out server.csr # Sign with the CA cert.
openssl x509 -req -sha256 -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out server.crt -days 365
```

Note: Make sure to use different Common Name (CN) detail between the CA and server certificates; otherwise, SSL handshake fails with exactly same Common Name (CN) detail in both the certificates.

```
port 8883
cafile /mosquitto/config/ca.crt
certfile /mosquitto/config/server.crt
keyfile /mosquitto/config/server.key
# Use this option for TLS mutual authentication (where client will provide CA signed certificate)
#require_certificate true
tls_version tlsv1.2
#use_identity_as_username true
```

Create a mosquitto.conf file to use port 8883 (for TLS communication) and providing path to the generated credentials.

Run the docker container from the local directory containing the generated credential and mosquitto.conf files.

```
docker run -it -p 8883:8883 -v $(pwd):/mosquitto/config/ --name mosquitto-basic-tls eclipse-mosquitto:1.6.14
```

Update `demos/mqtt/mqtt_demo_basic_tls/demo_config.h`

to the following:

Set `BROKER_ENDPOINT`

to `localhost`

.

Set `ROOT_CA_CERT_PATH`

to the absolute path of the CA certificate created in step 4. for the local Mosquitto server.

Installing httpbin to run HTTP demos locally

Run httpbin through port 80:

```
docker pull kennethreitz/httpbin
docker run -p 80:80 kennethreitz/httpbin
```

`SERVER_HOST`

defined in `demos/http/http_demo_plaintext/demo_config.h`

can now be set to `localhost`

.

To run `http_demo_basic_tls`

, download ngrok in order to create an HTTPS tunnel to the httpbin server currently hosted on port 80:

```
./ngrok http 80 # May have to use ./ngrok.exe depending on OS or filename of the executable
```

`ngrok`

will provide an https link that can be substituted in `demos/http/http_demo_basic_tls/demo_config.h`

and has a format of `https://ABCDEFG12345.ngrok.io`

.

Set `SERVER_HOST`

in `demos/http/http_demo_basic_tls/demo_config.h`

to the https link provided by ngrok, without `https://`

preceding it.

You must also download the Root CA certificate provided by the ngrok https link and set `ROOT_CA_CERT_PATH`

in `demos/http/http_demo_basic_tls/demo_config.h`

to the file path of the downloaded certificate.

The C-SDK libraries and platform abstractions can be installed to a file system through CMake. To do so, run the following command in the root directory of the C-SDK. Note that installation is not required to run any of the demos.

```
cmake -S . -Bbuild -DBUILD_DEMOS=0 -DBUILD_TESTS=0
cd build
sudo make install
```

Note that because `make install`

will automatically build the `all`

target, it may be useful to disable building demos and tests with `-DBUILD_DEMOS=0 -DBUILD_TESTS=0`

unless they have already been configured. Super-user permissions may be needed if installing to a system include or system library path.

To install only a subset of all libraries, pass `-DINSTALL_LIBS`

to install only the libraries you need. By default, all libraries will be installed, but you may exclude any library that you don't need from this list:

```
-DINSTALL_LIBS="DEFENDER;SHADOW;JOBS;OTA;OTA_HTTP;OTA_MQTT;BACKOFF_ALGORITHM;HTTP;JSON;MQTT;PKCS"
```

By default, the install path will be in the `project`

directory of the SDK. You can also set `-DINSTALL_TO_SYSTEM=1`

to install to the system path for headers and libraries in your OS (e.g. `/usr/local/include`

& `/usr/local/lib`

for Linux).

Upon entering `make install`

, the location of each library will be specified first followed by the location of all installed headers:

```
-- Installing: /usr/local/lib/libaws_iot_defender.so
-- Installing: /usr/local/lib/libaws_iot_shadow.so
...
-- Installing: /usr/local/include/aws/defender.h
-- Installing: /usr/local/include/aws/defender_config_defaults.h
-- Installing: /usr/local/include/aws/shadow.h
-- Installing: /usr/local/include/aws/shadow_config_defaults.h
```

You may also set an installation path of your choice by passing the following flags through CMake. Make sure to run the following command in the root directory of the C-SDK:

```
cmake -S . -Bbuild -DBUILD_DEMOS=0 -DBUILD_TESTS=0 \
-DCSDK_HEADER_INSTALL_PATH="/header/path" -DCSDK_LIB_INSTALL_PATH="/lib/path"
cd build
sudo make install
```

POSIX platform abstractions are used together with the C-SDK libraries in the demos. By default, these abstractions are also installed but can be excluded by passing the flag: `-DINSTALL_PLATFORM_ABSTRACTIONS=0`

.

Lastly, a custom config path for any specific library can also be specified through the following CMake flags, allowing libraries to be compiled with a config of your choice:

```
-DDEFENDER_CUSTOM_CONFIG_DIR="defender-config-directory"
-DSHADOW_CUSTOM_CONFIG_DIR="shadow-config-directory"
-DJOBS_CUSTOM_CONFIG_DIR="jobs-config-directory"
-DOTA_CUSTOM_CONFIG_DIR="ota-config-directory"
-DHTTP_CUSTOM_CONFIG_DIR="http-config-directory"
-DJSON_CUSTOM_CONFIG_DIR="json-config-directory"
-DMQTT_CUSTOM_CONFIG_DIR="mqtt-config-directory"
-DPKCS_CUSTOM_CONFIG_DIR="pkcs-config-directory"
```

Note that the file name of the header should not be included in the directory.

Note: For pre-generated documentation, please visit Releases and Documentation section.

The Doxygen references were created using Doxygen version 1.9.2. To generate the Doxygen pages, use the provided Python script at tools/doxygen/generate_docs.py. Please ensure that each of the library submodules under `libraries/standard/`

and `libraries/aws/`

are cloned before using this script.

```
cd <CSDK_ROOT>
git submodule update --init --recursive --checkout
python3 tools/doxygen/generate_docs.py
```

The generated documentation landing page is located at `docs/doxygen/output/html/index.html`

.

Author: aws

Source code: https://github.com/aws/aws-iot-device-sdk-embedded-C

License: MIT license

1605176864

In this video, I will be talking about problem-solving as a developer.

#problem solving skills #problem solving how to #problem solving strategies #problem solving #developer

1641276000

- ML-Quant.com - Automated Research Repository

Tabular augmentation is a new experimental space that makes use of novel and traditional data generation and synthesisation techniques to improve model prediction success. It is in essence a process of modular feature engineering and observation engineering while emphasising the order of augmentation to achieve the best predicted outcome from a given information set. DeltaPy was created with finance applications in mind, but it can be broadly applied to any data-rich environment.

To take full advantage of tabular augmentation for time-series you would perform the techniques in the following order: **(1) transforming**, **(2) interacting**, **(3) mapping**, **(4) extracting**, and **(5) synthesising**. What follows is a practical example of how the above methodology can be used. The purpose here is to establish a framework for table augmentation and to point and guide the user to existing packages.

For most the Colab Notebook format might be preferred. I have enabled comments if you want to ask question or address any issues you uncover. For anything pressing use the issues tab. Also have a look at the SSRN report for a more succinct insights.

Data augmentation can be defined as any method that could increase the size or improve the quality of a dataset by generating new features or instances without the collection of additional data-points. Data augmentation is of particular importance in image classification tasks where additional data can be created by cropping, padding, or flipping existing images.

Tabular cross-sectional and time-series prediction tasks can also benefit from augmentation. Here we divide tabular augmentation into columnular and row-wise methods. Row-wise methods are further divided into extraction and data synthesisation techniques, whereas columnular methods are divided into transformation, interaction, and mapping methods.

See the Skeleton Example, for a combination of multiple methods that lead to a halfing of the mean squared error.

```
pip install deltapy
```

```
@software{deltapy,
title = {{DeltaPy}: Tabular Data Augmentation},
author = {Snow, Derek},
url = {https://github.com/firmai/deltapy/},
version = {0.1.0},
date = {2020-04-11},
}
```

```
Snow, Derek, DeltaPy: A Framework for Tabular Data Augmentation in Python (April 22, 2020). Available at SSRN: https://ssrn.com/abstract=3582219
```

**Transformation**

```
df_out = transform.robust_scaler(df.copy(), drop=["Close_1"]); df_out.head()
df_out = transform.standard_scaler(df.copy(), drop=["Close"]); df_out.head()
df_out = transform.fast_fracdiff(df.copy(), ["Close","Open"],0.5); df_out.head()
df_out = transform.windsorization(df.copy(),"Close",para,strategy='both'); df_out.head()
df_out = transform.operations(df.copy(),["Close"]); df_out.head()
df_out = transform.triple_exponential_smoothing(df.copy(),["Close"], 12, .2,.2,.2,0);
df_out = transform.naive_dec(df.copy(), ["Close","Open"]); df_out.head()
df_out = transform.bkb(df.copy(), ["Close"]); df_out.head()
df_out = transform.butter_lowpass_filter(df.copy(),["Close"],4); df_out.head()
df_out = transform.instantaneous_phases(df.copy(), ["Close"]); df_out.head()
df_out = transform.kalman_feat(df.copy(), ["Close"]); df_out.head()
df_out = transform.perd_feat(df.copy(),["Close"]); df_out.head()
df_out = transform.fft_feat(df.copy(), ["Close"]); df_out.head()
df_out = transform.harmonicradar_cw(df.copy(), ["Close"],0.3,0.2); df_out.head()
df_out = transform.saw(df.copy(),["Close","Open"]); df_out.head()
df_out = transform.modify(df.copy(),["Close"]); df_out.head()
df_out = transform.multiple_rolling(df, columns=["Close"]); df_out.head()
df_out = transform.multiple_lags(df, start=1, end=3, columns=["Close"]); df_out.head()
df_out = transform.prophet_feat(df.copy().reset_index(),["Close","Open"],"Date", "D"); df_out.head()
```

**Interaction**

```
df_out = interact.lowess(df.copy(), ["Open","Volume"], df["Close"], f=0.25, iter=3); df_out.head()
df_out = interact.autoregression(df.copy()); df_out.head()
df_out = interact.muldiv(df.copy(), ["Close","Open"]); df_out.head()
df_out = interact.decision_tree_disc(df.copy(), ["Close"]); df_out.head()
df_out = interact.quantile_normalize(df.copy(), drop=["Close"]); df_out.head()
df_out = interact.tech(df.copy()); df_out.head()
df_out = interact.genetic_feat(df.copy()); df_out.head()
```

**Mapping**

```
df_out = mapper.pca_feature(df.copy(),variance_or_components=0.80,drop_cols=["Close_1"]); df_out.head()
df_out = mapper.cross_lag(df.copy()); df_out.head()
df_out = mapper.a_chi(df.copy()); df_out.head()
df_out = mapper.encoder_dataset(df.copy(), ["Close_1"], 15); df_out.head()
df_out = mapper.lle_feat(df.copy(),["Close_1"],4); df_out.head()
df_out = mapper.feature_agg(df.copy(),["Close_1"],4 ); df_out.head()
df_out = mapper.neigh_feat(df.copy(),["Close_1"],4 ); df_out.head()
```

**Extraction**

```
extract.abs_energy(df["Close"])
extract.cid_ce(df["Close"], True)
extract.mean_abs_change(df["Close"])
extract.mean_second_derivative_central(df["Close"])
extract.variance_larger_than_standard_deviation(df["Close"])
extract.var_index(df["Close"].values,var_index_param)
extract.symmetry_looking(df["Close"])
extract.has_duplicate_max(df["Close"])
extract.partial_autocorrelation(df["Close"])
extract.augmented_dickey_fuller(df["Close"])
extract.gskew(df["Close"])
extract.stetson_mean(df["Close"])
extract.length(df["Close"])
extract.count_above_mean(df["Close"])
extract.longest_strike_below_mean(df["Close"])
extract.wozniak(df["Close"])
extract.last_location_of_maximum(df["Close"])
extract.fft_coefficient(df["Close"])
extract.ar_coefficient(df["Close"])
extract.index_mass_quantile(df["Close"])
extract.number_cwt_peaks(df["Close"])
extract.spkt_welch_density(df["Close"])
extract.linear_trend_timewise(df["Close"])
extract.c3(df["Close"])
extract.binned_entropy(df["Close"])
extract.svd_entropy(df["Close"].values)
extract.hjorth_complexity(df["Close"])
extract.max_langevin_fixed_point(df["Close"])
extract.percent_amplitude(df["Close"])
extract.cad_prob(df["Close"])
extract.zero_crossing_derivative(df["Close"])
extract.detrended_fluctuation_analysis(df["Close"])
extract.fisher_information(df["Close"])
extract.higuchi_fractal_dimension(df["Close"])
extract.petrosian_fractal_dimension(df["Close"])
extract.hurst_exponent(df["Close"])
extract.largest_lyauponov_exponent(df["Close"])
extract.whelch_method(df["Close"])
extract.find_freq(df["Close"])
extract.flux_perc(df["Close"])
extract.range_cum_s(df["Close"])
extract.structure_func(df["Close"])
extract.kurtosis(df["Close"])
extract.stetson_k(df["Close"])
```

Test sets should ideally not be preprocessed with the training data, as in such a way one could be peaking ahead in the training data. The preprocessing parameters should be identified on the test set and then applied on the test set, i.e., the test set should not have an impact on the transformation applied. As an example, you would learn the parameters of PCA decomposition on the training set and then apply the parameters to both the train and the test set.

The benefit of pipelines become clear when one wants to apply multiple augmentation methods. It makes it easy to learn the parameters and then apply them widely. For the most part, this notebook does not concern itself with 'peaking ahead' or pipelines, for some functions, one might have to restructure to code and make use of open source packages to create your preferred solution.

**Notebook Dependencies**

```
pip install deltapy
```

```
pip install pykalman
pip install tsaug
pip install ta
pip install tsaug
pip install pandasvault
pip install gplearn
pip install ta
pip install seasonal
pip install pandasvault
```

```
import pandas as pd
import numpy as np
from deltapy import transform, interact, mapper, extract
import warnings
warnings.filterwarnings('ignore')
def data_copy():
df = pd.read_csv("https://github.com/firmai/random-assets-two/raw/master/numpy/tsla.csv")
df["Close_1"] = df["Close"].shift(-1)
df = df.dropna()
df["Date"] = pd.to_datetime(df["Date"])
df = df.set_index("Date")
return df
df = data_copy(); df.head()
```

Some of these categories are fluid and some techniques could fit into multiple buckets. This is an attempt to find an exhaustive number of techniques, but not an exhaustive list of implementations of the techniques. For example, there are thousands of ways to smooth a time-series, but we have only includes 1-2 techniques of interest under each category.

- Scaling/Normalisation
- Standardisation
- Differencing
- Capping
- Operations
- Smoothing
- Decomposing
- Filtering
- Spectral Analysis
- Waveforms
- Modifications
- Rolling
- Lagging
- Forecast Model

- Regressions
- Operators
- Discretising
- Normalising
- Distance
- Speciality
- Genetic

- Eigen Decomposition
- Cross Decomposition
- Kernel Approximation
- Autoencoder
- Manifold Learning
- Clustering
- Neighbouring

- Energy
- Distance
- Differencing
- Derivative
- Volatility
- Shape
- Occurrence
- Autocorrelation
- Stochasticity
- Averages
- Size
- Count
- Streaks
- Location
- Model Coefficients
- Quantile
- Peaks
- Density
- Linearity
- Non-linearity
- Entropy
- Fixed Points
- Amplitude
- Probability
- Crossings
- Fluctuation
- Information
- Fractals
- Exponent
- Spectral Analysis
- Percentile
- Range
- Structural
- Distribution

Here transformation is any method that includes only one feature as an input to produce a new feature/s. Transformations can be applied to cross-section and time-series data. Some transformations are exclusive to time-series data (smoothing, filtering), but a handful of functions apply to both.

Where the time series methods has a centred mean, or are forward-looking, there is a need to recalculate the outputed time series on a running basis to ensure that information of the future does not leak into the model. The last value of this recalculated series or an extracted feature from this series can then be used as a running value that is only backward looking, satisfying the no 'peaking' ahead rule.

There are some packaged in Python that dynamically create time series and extracts their features, but none that incoropates the dynamic creation of a time series in combination with a wide application of prespecified list of extractions. Because this technique is expensive, we have a preference for models that only take historical data into account.

In this section we will include a list of all types of transformations, those that only use present information (operations), those that incorporate all values (interpolation methods), those that only include past values (smoothing functions), and those that incorporate a subset window of lagging and leading values (select filters). Only those that use historical values or are turned into prediction methods can be used out of the box. The entire time series can be used in the model development process for historical value methods, and only the forecasted values can be used for prediction models.

Curve fitting can involve either interpolation, where an exact fit to the data is required, or smoothing, in which a "smooth" function is constructed that approximately fits the data. When using an interpolation method, you are taking future information into account e.g, cubic spline. You can use interpolation methods to forecast into the future (extrapolation), and then use those forecasts in a training set. Or you could recalculate the interpolation for each time step and then extract features out of that series (extraction method). Interpolation and other forward-looking methods can be used if they are turned into prediction problems, then the forecasted values can be trained and tested on, and the fitted data can be diregarded. In the list presented below the first five methods can be used for cross-section and time series data, after that the time-series only methods follow.

There are a multitude of scaling methods available. Scaling generally gets applied to the entire dataset and is especially necessary for certain algorithms. K-means make use of euclidean distance hence the need for scaling. For PCA because we are trying to identify the feature with maximus variance we also need scaling. Similarly, we need scaled features for gradient descent. Any algorithm that is not based on a distance measure is not affected by feature scaling. Some of the methods include range scalers like minimum-maximum scaler, maximum absolute scaler or even standardisation methods like the standard scaler can be used for scaling. The example used here is robust scaler. Normalisation is a good technique when you don't know the distribution of the data. Scaling looks into the future, so parameters have to be training on a training set and applied to a test set.

(i) Robust Scaler

Scaling according to the interquartile range, making it robust to outliers.

```
def robust_scaler(df, drop=None,quantile_range=(25, 75) ):
if drop:
keep = df[drop]
df = df.drop(drop, axis=1)
center = np.median(df, axis=0)
quantiles = np.percentile(df, quantile_range, axis=0)
scale = quantiles[1] - quantiles[0]
df = (df - center) / scale
if drop:
df = pd.concat((keep,df),axis=1)
return df
df_out = transform.robust_scaler(df.copy(), drop=["Close_1"]); df_out.head()
```

When using a standardisation method, it is often more effective when the attribute itself if Gaussian. It is also useful to apply the technique when the model you want to use makes assumptions of Gaussian distributions like linear regression, logistic regression, and linear discriminant analysis. For most applications, standardisation is recommended.

(i) Standard Scaler

Standardize features by removing the mean and scaling to unit variance

```
def standard_scaler(df,drop ):
if drop:
keep = df[drop]
df = df.drop(drop, axis=1)
mean = np.mean(df, axis=0)
scale = np.std(df, axis=0)
df = (df - mean) / scale
if drop:
df = pd.concat((keep,df),axis=1)
return df
df_out = transform.standard_scaler(df.copy(), drop=["Close"]); df_out.head()
```

Computing the differences between consecutive observation, normally used to obtain a stationary time series.

(i) Fractional Differencing

Fractional differencing, allows us to achieve stationarity while maintaining the maximum amount of memory compared to integer differencing.

```
import pylab as pl
def fast_fracdiff(x, cols, d):
for col in cols:
T = len(x[col])
np2 = int(2 ** np.ceil(np.log2(2 * T - 1)))
k = np.arange(1, T)
b = (1,) + tuple(np.cumprod((k - d - 1) / k))
z = (0,) * (np2 - T)
z1 = b + z
z2 = tuple(x[col]) + z
dx = pl.ifft(pl.fft(z1) * pl.fft(z2))
x[col+"_frac"] = np.real(dx[0:T])
return x
df_out = transform.fast_fracdiff(df.copy(), ["Close","Open"],0.5); df_out.head()
```

Any method that provides sets a floor and a cap to a feature's value. Capping can affect the distribution of data, so it should not be exagerated. One can cap values by using the average, by using the max and min values, or by an arbitrary extreme value.

(i) Winzorisation

The transformation of features by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers by replacing it with a certain percentile value.

```
def outlier_detect(data,col,threshold=1,method="IQR"):
if method == "IQR":
IQR = data[col].quantile(0.75) - data[col].quantile(0.25)
Lower_fence = data[col].quantile(0.25) - (IQR * threshold)
Upper_fence = data[col].quantile(0.75) + (IQR * threshold)
if method == "STD":
Upper_fence = data[col].mean() + threshold * data[col].std()
Lower_fence = data[col].mean() - threshold * data[col].std()
if method == "OWN":
Upper_fence = data[col].mean() + threshold * data[col].std()
Lower_fence = data[col].mean() - threshold * data[col].std()
if method =="MAD":
median = data[col].median()
median_absolute_deviation = np.median([np.abs(y - median) for y in data[col]])
modified_z_scores = pd.Series([0.6745 * (y - median) / median_absolute_deviation for y in data[col]])
outlier_index = np.abs(modified_z_scores) > threshold
print('Num of outlier detected:',outlier_index.value_counts()[1])
print('Proportion of outlier detected',outlier_index.value_counts()[1]/len(outlier_index))
return outlier_index, (median_absolute_deviation, median_absolute_deviation)
para = (Upper_fence, Lower_fence)
tmp = pd.concat([data[col]>Upper_fence,data[col]<Lower_fence],axis=1)
outlier_index = tmp.any(axis=1)
print('Num of outlier detected:',outlier_index.value_counts()[1])
print('Proportion of outlier detected',outlier_index.value_counts()[1]/len(outlier_index))
return outlier_index, para
def windsorization(data,col,para,strategy='both'):
"""
top-coding & bottom coding (capping the maximum of a distribution at an arbitrarily set value,vice versa)
"""
data_copy = data.copy(deep=True)
if strategy == 'both':
data_copy.loc[data_copy[col]>para[0],col] = para[0]
data_copy.loc[data_copy[col]<para[1],col] = para[1]
elif strategy == 'top':
data_copy.loc[data_copy[col]>para[0],col] = para[0]
elif strategy == 'bottom':
data_copy.loc[data_copy[col]<para[1],col] = para[1]
return data_copy
_, para = transform.outlier_detect(df, "Close")
df_out = transform.windsorization(df.copy(),"Close",para,strategy='both'); df_out.head()
```

Operations here are treated like traditional transformations. It is the replacement of a variable by a function of that variable. In a stronger sense, a transformation is a replacement that changes the shape of a distribution or relationship.

(i) Power, Log, Recipricol, Square Root

```
def operations(df,features):
df_new = df[features]
df_new = df_new - df_new.min()
sqr_name = [str(fa)+"_POWER_2" for fa in df_new.columns]
log_p_name = [str(fa)+"_LOG_p_one_abs" for fa in df_new.columns]
rec_p_name = [str(fa)+"_RECIP_p_one" for fa in df_new.columns]
sqrt_name = [str(fa)+"_SQRT_p_one" for fa in df_new.columns]
df_sqr = pd.DataFrame(np.power(df_new.values, 2),columns=sqr_name, index=df.index)
df_log = pd.DataFrame(np.log(df_new.add(1).abs().values),columns=log_p_name, index=df.index)
df_rec = pd.DataFrame(np.reciprocal(df_new.add(1).values),columns=rec_p_name, index=df.index)
df_sqrt = pd.DataFrame(np.sqrt(df_new.abs().add(1).values),columns=sqrt_name, index=df.index)
dfs = [df, df_sqr, df_log, df_rec, df_sqrt]
df= pd.concat(dfs, axis=1)
return df
df_out = transform.operations(df.copy(),["Close"]); df_out.head()
```

Here we maintain that any method that has a component of historical averaging is a smoothing method such as a simple moving average and single, double and tripple exponential smoothing methods. These forms of non-causal filters are also popular in signal processing and are called filters, where exponential smoothing is called an IIR filter and a moving average a FIR filter with equal weighting factors.

(i) Tripple Exponential Smoothing (Holt-Winters Exponential Smoothing)

The Holt-Winters seasonal method comprises the forecast equation and three smoothing equations — one for the level $ℓt$, one for the trend &bt&, and one for the seasonal component $st$. This particular version is performed by looking at the last 12 periods. For that reason, the first 12 records should be disregarded because they can't make use of the required window size for a fair calculation. The calculation is such that values are still provided for those periods based on whatever data might be available.

```
def initial_trend(series, slen):
sum = 0.0
for i in range(slen):
sum += float(series[i+slen] - series[i]) / slen
return sum / slen
def initial_seasonal_components(series, slen):
seasonals = {}
season_averages = []
n_seasons = int(len(series)/slen)
# compute season averages
for j in range(n_seasons):
season_averages.append(sum(series[slen*j:slen*j+slen])/float(slen))
# compute initial values
for i in range(slen):
sum_of_vals_over_avg = 0.0
for j in range(n_seasons):
sum_of_vals_over_avg += series[slen*j+i]-season_averages[j]
seasonals[i] = sum_of_vals_over_avg/n_seasons
return seasonals
def triple_exponential_smoothing(df,cols, slen, alpha, beta, gamma, n_preds):
for col in cols:
result = []
seasonals = initial_seasonal_components(df[col], slen)
for i in range(len(df[col])+n_preds):
if i == 0: # initial values
smooth = df[col][0]
trend = initial_trend(df[col], slen)
result.append(df[col][0])
continue
if i >= len(df[col]): # we are forecasting
m = i - len(df[col]) + 1
result.append((smooth + m*trend) + seasonals[i%slen])
else:
val = df[col][i]
last_smooth, smooth = smooth, alpha*(val-seasonals[i%slen]) + (1-alpha)*(smooth+trend)
trend = beta * (smooth-last_smooth) + (1-beta)*trend
seasonals[i%slen] = gamma*(val-smooth) + (1-gamma)*seasonals[i%slen]
result.append(smooth+trend+seasonals[i%slen])
df[col+"_TES"] = result
#print(seasonals)
return df
df_out= transform.triple_exponential_smoothing(df.copy(),["Close"], 12, .2,.2,.2,0); df_out.head()
```

Decomposition procedures are used in time series to describe the trend and seasonal factors in a time series. More extensive decompositions might also include long-run cycles, holiday effects, day of week effects and so on. Here, we’ll only consider trend and seasonal decompositions. A naive decomposition makes use of moving averages, other decomposition methods are available that make use of LOESS.

(i) Naive Decomposition

The base trend takes historical information into account and established moving averages; it does not have to be linear. To estimate the seasonal component for each season, simply average the detrended values for that season. If the seasonal variation looks constant, we should use the additive model. If the magnitude is increasing as a function of time, we will use multiplicative. Here because it is predictive in nature we are using a one sided moving average, as opposed to a two-sided centred average.

```
import statsmodels.api as sm
def naive_dec(df, columns, freq=2):
for col in columns:
decomposition = sm.tsa.seasonal_decompose(df[col], model='additive', freq = freq, two_sided=False)
df[col+"_NDDT" ] = decomposition.trend
df[col+"_NDDT"] = decomposition.seasonal
df[col+"_NDDT"] = decomposition.resid
return df
df_out = transform.naive_dec(df.copy(), ["Close","Open"]); df_out.head()
```

It is often useful to either low-pass filter (smooth) time series in order to reveal low-frequency features and trends, or to high-pass filter (detrend) time series in order to isolate high frequency transients (e.g. storms). Low pass filters use historical values, high-pass filters detrends with low-pass filters, so also indirectly uses historical values.

There are a few filters available, closely associated with decompositions and smoothing functions. The Hodrick-Prescott filter separates a time-series $yt$ into a trend $τt$ and a cyclical component $ζt$. The Christiano-Fitzgerald filter is a generalization of Baxter-King filter and can be seen as weighted moving average.

(i) Baxter-King Bandpass

The Baxter-King filter is intended to explicitly deal with the periodicity of the business cycle. By applying their band-pass filter to a series, they produce a new series that does not contain fluctuations at higher or lower than those of the business cycle. The parameters are arbitrarily chosen. This method uses a centred moving average that has to be changed to a lagged moving average before it can be used as an input feature. The maximum period of oscillation should be used as the point to truncate the dataset, as that part of the time series does not incorporate all the required datapoints.

```
import statsmodels.api as sm
def bkb(df, cols):
for col in cols:
df[col+"_BPF"] = sm.tsa.filters.bkfilter(df[[col]].values, 2, 10, len(df)-1)
return df
df_out = transform.bkb(df.copy(), ["Close"]); df_out.head()
```

(ii) Butter Lowpass (IIR Filter Design)

The Butterworth filter is a type of signal processing filter designed to have a frequency response as flat as possible in the passban. Like other filtersm the first few values have to be disregarded for accurate downstream prediction. Instead of disregarding these values on a per case basis, they can be diregarded in one chunk once the database of transformed features have been developed.

```
from scipy import signal, integrate
def butter_lowpass(cutoff, fs=20, order=5):
nyq = 0.5 * fs
normal_cutoff = cutoff / nyq
b, a = signal.butter(order, normal_cutoff, btype='low', analog=False)
return b, a
def butter_lowpass_filter(df,cols, cutoff, fs=20, order=5):
b, a = butter_lowpass(cutoff, fs, order=order)
for col in cols:
df[col+"_BUTTER"] = signal.lfilter(b, a, df[col])
return df
df_out = transform.butter_lowpass_filter(df.copy(),["Close"],4); df_out.head()
```

(iii) Hilbert Transform Angle

The Hilbert transform is a time-domain to time-domain transformation which shifts the phase of a signal by 90 degrees. It is also a centred measure and would be difficult to use in a time series prediction setting, unless it is recalculated on a per step basis or transformed to be based on historical values only.

```
from scipy import signal
import numpy as np
def instantaneous_phases(df,cols):
for col in cols:
df[col+"_HILLB"] = np.unwrap(np.angle(signal.hilbert(df[col], axis=0)), axis=0)
return df
df_out = transform.instantaneous_phases(df.copy(), ["Close"]); df_out.head()
```

(iiiv) Unscented Kalman Filter

The Kalman filter is better suited for estimating things that change over time. The most tangible example is tracking moving objects. A Kalman filter will be very close to the actual trajectory because it says the most recent measurement is more important than the older ones. The Unscented Kalman Filter (UKF) is a model based-techniques that recursively estimates the states (and with some modifications also parameters) of a nonlinear, dynamic, discrete-time system. The UKF is based on the typical prediction-correction style methods. The Kalman Smoother incorporates future values, the Filter doesn't and can be used for online prediction. The normal Kalman filter is a forward filter in the sense that it makes forecast of the current state using only current and past observations, whereas the smoother is based on computing a suitable linear combination of two filters, which are ran in forward and backward directions.

```
from pykalman import UnscentedKalmanFilter
def kalman_feat(df, cols):
for col in cols:
ukf = UnscentedKalmanFilter(lambda x, w: x + np.sin(w), lambda x, v: x + v, observation_covariance=0.1)
(filtered_state_means, filtered_state_covariances) = ukf.filter(df[col])
(smoothed_state_means, smoothed_state_covariances) = ukf.smooth(df[col])
df[col+"_UKFSMOOTH"] = smoothed_state_means.flatten()
df[col+"_UKFFILTER"] = filtered_state_means.flatten()
return df
df_out = transform.kalman_feat(df.copy(), ["Close"]); df_out.head()
```

There are a range of functions for spectral analysis. You can use periodograms and the welch method to estimate the power spectral density. You can also use the welch method to estimate the cross power spectral density. Other techniques include spectograms, Lomb-Scargle periodograms and, short time fourier transform.

(i) Periodogram

This returns an array of sample frequencies and the power spectrum of x, or the power spectral density of x.

```
from scipy import signal
def perd_feat(df, cols):
for col in cols:
sig = signal.periodogram(df[col],fs=1, return_onesided=False)
df[col+"_FREQ"] = sig[0]
df[col+"_POWER"] = sig[1]
return df
df_out = transform.perd_feat(df.copy(),["Close"]); df_out.head()
```

(ii) Fast Fourier Transform

The FFT, or fast fourier transform is an algorithm that essentially uses convolution techniques to efficiently find the magnitude and location of the tones that make up the signal of interest. We can often play with the FFT spectrum, by adding and removing successive tones (which is akin to selectively filtering particular tones that make up the signal), in order to obtain a smoothed version of the underlying signal. This takes the entire signal into account, and as a result has to be recalculated on a running basis to avoid peaking into the future.

```
def fft_feat(df, cols):
for col in cols:
fft_df = np.fft.fft(np.asarray(df[col].tolist()))
fft_df = pd.DataFrame({'fft':fft_df})
df[col+'_FFTABS'] = fft_df['fft'].apply(lambda x: np.abs(x)).values
df[col+'_FFTANGLE'] = fft_df['fft'].apply(lambda x: np.angle(x)).values
return df
df_out = transform.fft_feat(df.copy(), ["Close"]); df_out.head()
```

The waveform of a signal is the shape of its graph as a function of time.

(i) Continuous Wave Radar

```
from scipy import signal
def harmonicradar_cw(df, cols, fs,fc):
for col in cols:
ttxt = f'CW: {fc} Hz'
#%% input
t = df[col]
tx = np.sin(2*np.pi*fc*t)
_,Pxx = signal.welch(tx,fs)
#%% diode
d = (signal.square(2*np.pi*fc*t))
d[d<0] = 0.
#%% output of diode
rx = tx * d
df[col+"_HARRAD"] = rx.values
return df
df_out = transform.harmonicradar_cw(df.copy(), ["Close"],0.3,0.2); df_out.head()
```

(ii) Saw Tooth

Return a periodic sawtooth or triangle waveform.

```
def saw(df, cols):
for col in cols:
df[col+" SAW"] = signal.sawtooth(df[col])
return df
df_out = transform.saw(df.copy(),["Close","Open"]); df_out.head()
```

**(9) Modifications**

A range of modification usually applied ot images, these values would have to be recalculate for each time-series.

(i) Various Techniques

```
from tsaug import *
def modify(df, cols):
for col in cols:
series = df[col].values
df[col+"_magnify"], _ = magnify(series, series)
df[col+"_affine"], _ = affine(series, series)
df[col+"_crop"], _ = crop(series, series)
df[col+"_cross_sum"], _ = cross_sum(series, series)
df[col+"_resample"], _ = resample(series, series)
df[col+"_trend"], _ = trend(series, series)
df[col+"_random_affine"], _ = random_time_warp(series, series)
df[col+"_random_crop"], _ = random_crop(series, series)
df[col+"_random_cross_sum"], _ = random_cross_sum(series, series)
df[col+"_random_sidetrack"], _ = random_sidetrack(series, series)
df[col+"_random_time_warp"], _ = random_time_warp(series, series)
df[col+"_random_magnify"], _ = random_magnify(series, series)
df[col+"_random_jitter"], _ = random_jitter(series, series)
df[col+"_random_trend"], _ = random_trend(series, series)
return df
df_out = transform.modify(df.copy(),["Close"]); df_out.head()
```

Features that are calculated on a rolling basis over fixed window size.

(i) Mean, Standard Deviation

```
def multiple_rolling(df, windows = [1,2], functions=["mean","std"], columns=None):
windows = [1+a for a in windows]
if not columns:
columns = df.columns.to_list()
rolling_dfs = (df[columns].rolling(i) # 1. Create window
.agg(functions) # 1. Aggregate
.rename({col: '{0}_{1:d}'.format(col, i)
for col in columns}, axis=1) # 2. Rename columns
for i in windows) # For each window
df_out = pd.concat((df, *rolling_dfs), axis=1)
da = df_out.iloc[:,len(df.columns):]
da = [col[0] + "_" + col[1] for col in da.columns.to_list()]
df_out.columns = df.columns.to_list() + da
return df_out # 3. Concatenate dataframes
df_out = transform.multiple_rolling(df, columns=["Close"]); df_out.head()
```

Lagged values from existing features.

(i) Single Steps

```
def multiple_lags(df, start=1, end=3,columns=None):
if not columns:
columns = df.columns.to_list()
lags = range(start, end+1) # Just two lags for demonstration.
df = df.assign(**{
'{}_t_{}'.format(col, t): df[col].shift(t)
for t in lags
for col in columns
})
return df
df_out = transform.multiple_lags(df, start=1, end=3, columns=["Close"]); df_out.head()
```

There are a range of time series model that can be implemented like AR, MA, ARMA, ARIMA, SARIMA, SARIMAX, VAR, VARMA, VARMAX, SES, and HWES. The models can be divided into autoregressive models and smoothing models. In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable. Each method might requre specific tuning and parameters to suit your prediction task. You need to drop a certain amount of historical data that you use during the fitting stage. Models that take seasonality into account need more training data.

(i) Prophet

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality. You can apply additive models to your training data but also interactive models like deep learning models. The problem is that because these models have learned from future observations, there would this be a need to recalculate the time series on a running basis, or to only include the predicted as opposed to fitted values in future training and test sets. In this example, I train on 150 data points to illustrate how the remaining or so 100 datapoints can be used in a new prediction problem. You can plot with `df["PROPHET"].plot()`

to see the effect.

You can apply additive models to your training data but also interactive models like deep learning models. The problem is that these models have learned from future observations, there would this be a need to recalculate the time series on a running basis, or to only include the predicted as opposed to fitted values in future training and test sets.

```
from fbprophet import Prophet
def prophet_feat(df, cols,date, freq,train_size=150):
def prophet_dataframe(df):
df.columns = ['ds','y']
return df
def original_dataframe(df, freq, name):
prophet_pred = pd.DataFrame({"Date" : df['ds'], name : df["yhat"]})
prophet_pred = prophet_pred.set_index("Date")
#prophet_pred.index.freq = pd.tseries.frequencies.to_offset(freq)
return prophet_pred[name].values
for col in cols:
model = Prophet(daily_seasonality=True)
fb = model.fit(prophet_dataframe(df[[date, col]].head(train_size)))
forecast_len = len(df) - train_size
future = model.make_future_dataframe(periods=forecast_len,freq=freq)
future_pred = model.predict(future)
df[col+"_PROPHET"] = list(original_dataframe(future_pred,freq,col))
return df
df_out = transform.prophet_feat(df.copy().reset_index(),["Close","Open"],"Date", "D"); df_out.head()
```

Interactions are defined as methods that require more than one feature to create an additional feature. Here we include normalising and discretising techniques that are non-feature specific. Almost all of these method can be applied to cross-section method. The only methods that are time specific is the technical features in the speciality section and the autoregression model.

Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable') and one or more independent variables.

(i) Lowess Smoother

The lowess smoother is a robust locally weighted regression. The function fits a nonparametric regression curve to a scatterplot.

```
from math import ceil
import numpy as np
from scipy import linalg
import math
def lowess(df, cols, y, f=2. / 3., iter=3):
for col in cols:
n = len(df[col])
r = int(ceil(f * n))
h = [np.sort(np.abs(df[col] - df[col][i]))[r] for i in range(n)]
w = np.clip(np.abs((df[col][:, None] - df[col][None, :]) / h), 0.0, 1.0)
w = (1 - w ** 3) ** 3
yest = np.zeros(n)
delta = np.ones(n)
for iteration in range(iter):
for i in range(n):
weights = delta * w[:, i]
b = np.array([np.sum(weights * y), np.sum(weights * y * df[col])])
A = np.array([[np.sum(weights), np.sum(weights * df[col])],
[np.sum(weights * df[col]), np.sum(weights * df[col] * df[col])]])
beta = linalg.solve(A, b)
yest[i] = beta[0] + beta[1] * df[col][i]
residuals = y - yest
s = np.median(np.abs(residuals))
delta = np.clip(residuals / (6.0 * s), -1, 1)
delta = (1 - delta ** 2) ** 2
df[col+"_LOWESS"] = yest
return df
df_out = interact.lowess(df.copy(), ["Open","Volume"], df["Close"], f=0.25, iter=3); df_out.head()
```

Autoregression

Autoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step

```
from statsmodels.tsa.ar_model import AR
from timeit import default_timer as timer
def autoregression(df, drop=None, settings={"autoreg_lag":4}):
autoreg_lag = settings["autoreg_lag"]
if drop:
keep = df[drop]
df = df.drop([drop],axis=1).values
n_channels = df.shape[0]
t = timer()
channels_regg = np.zeros((n_channels, autoreg_lag + 1))
for i in range(0, n_channels):
fitted_model = AR(df.values[i, :]).fit(autoreg_lag)
# TODO: This is not the same as Matlab's for some reasons!
# kk = ARMAResults(fitted_model)
# autore_vals, dummy1, dummy2 = arburg(x[i, :], autoreg_lag) # This looks like Matlab's but slow
channels_regg[i, 0: len(fitted_model.params)] = np.real(fitted_model.params)
for i in range(channels_regg.shape[1]):
df["LAG_"+str(i+1)] = channels_regg[:,i]
if drop:
df = pd.concat((keep,df),axis=1)
t = timer() - t
return df
df_out = interact.autoregression(df.copy()); df_out.head()
```

Looking at interaction between different features. Here the methods employed are multiplication and division.

(i) Multiplication and Division

```
def muldiv(df, feature_list):
for feat in feature_list:
for feat_two in feature_list:
if feat==feat_two:
continue
else:
df[feat+"/"+feat_two] = df[feat]/(df[feat_two]-df[feat_two].min()) #zero division guard
df[feat+"_X_"+feat_two] = df[feat]*(df[feat_two])
return df
df_out = interact.muldiv(df.copy(), ["Close","Open"]); df_out.head()
```

In statistics and machine learning, discretization refers to the process of converting or partitioning continuous attributes, features or variables to discretized or nominal attributes

(i) Decision Tree Discretiser

The first method that will be applies here is a supersived discretiser. Discretisation with Decision Trees consists of using a decision tree to identify the optimal splitting points that would determine the bins or contiguous intervals.

```
from sklearn.tree import DecisionTreeRegressor
def decision_tree_disc(df, cols, depth=4 ):
for col in cols:
df[col +"_m1"] = df[col].shift(1)
df = df.iloc[1:,:]
tree_model = DecisionTreeRegressor(max_depth=depth,random_state=0)
tree_model.fit(df[col +"_m1"].to_frame(), df[col])
df[col+"_Disc"] = tree_model.predict(df[col +"_m1"].to_frame())
return df
df_out = interact.decision_tree_disc(df.copy(), ["Close"]); df_out.head()
```

Normalising normally pertains to the scaling of data. There are many method available, interacting normalising methods makes use of all the feature's attributes to do the scaling.

(i) Quantile Normalisation

In statistics, quantile normalization is a technique for making two distributions identical in statistical properties.

```
import numpy as np
import pandas as pd
def quantile_normalize(df, drop):
if drop:
keep = df[drop]
df = df.drop(drop,axis=1)
#compute rank
dic = {}
for col in df:
dic.update({col : sorted(df[col])})
sorted_df = pd.DataFrame(dic)
rank = sorted_df.mean(axis = 1).tolist()
#sort
for col in df:
t = np.searchsorted(np.sort(df[col]), df[col])
df[col] = [rank[i] for i in t]
if drop:
df = pd.concat((keep,df),axis=1)
return df
df_out = interact.quantile_normalize(df.copy(), drop=["Close"]); df_out.head()
```

There are multiple types of distance functions like Euclidean, Mahalanobis, and Minkowski distance. Here we are using a contrived example in a location based haversine distance.

(i) Haversine Distance

The Haversine (or great circle) distance is the angular distance between two points on the surface of a sphere.

```
from math import sin, cos, sqrt, atan2, radians
def haversine_distance(row, lon="Open", lat="Close"):
c_lat,c_long = radians(52.5200), radians(13.4050)
R = 6373.0
long = radians(row['Open'])
lat = radians(row['Close'])
dlon = long - c_long
dlat = lat - c_lat
a = sin(dlat / 2)**2 + cos(lat) * cos(c_lat) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
return R * c
df_out['distance_central'] = df.apply(interact.haversine_distance,axis=1); df_out.head()
```

(i) Technical Features

Technical indicators are heuristic or mathematical calculations based on the price, volume, or open interest of a security or contract used by traders who follow technical analysis. By analyzing historical data, technical analysts use indicators to predict future price movements.

```
import ta
def tech(df):
return ta.add_all_ta_features(df, open="Open", high="High", low="Low", close="Close", volume="Volume")
df_out = interact.tech(df.copy()); df_out.head()
```

Genetic programming has shown promise in constructing feature by osing original features to form high-level ones that can help algorithms achieve better performance.

(i) Symbolic Transformer

A symbolic transformer is a supervised transformer that begins by building a population of naive random formulas to represent a relationship.

```
df.head()
```

```
from gplearn.genetic import SymbolicTransformer
def genetic_feat(df, num_gen=20, num_comp=10):
function_set = ['add', 'sub', 'mul', 'div',
'sqrt', 'log', 'abs', 'neg', 'inv','tan']
gp = SymbolicTransformer(generations=num_gen, population_size=200,
hall_of_fame=100, n_components=num_comp,
function_set=function_set,
parsimony_coefficient=0.0005,
max_samples=0.9, verbose=1,
random_state=0, n_jobs=6)
gen_feats = gp.fit_transform(df.drop("Close_1", axis=1), df["Close_1"]); df.iloc[:,:8]
gen_feats = pd.DataFrame(gen_feats, columns=["gen_"+str(a) for a in range(gen_feats.shape[1])])
gen_feats.index = df.index
return pd.concat((df,gen_feats),axis=1)
df_out = interact.genetic_feat(df.copy()); df_out.head()
```

Methods that help with the summarisation of features by remapping them to achieve some aim like the maximisation of variability or class separability. These methods tend to be unsupervised, but can also take an supervised form.

Eigendecomposition or sometimes spectral decomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors. Some examples are LDA and PCA.

(i) Principal Component Analysis

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

```
def pca_feature(df, memory_issues=False,mem_iss_component=False,variance_or_components=0.80,n_components=5 ,drop_cols=None, non_linear=True):
if non_linear:
pca = KernelPCA(n_components = n_components, kernel='rbf', fit_inverse_transform=True, random_state = 33, remove_zero_eig= True)
else:
if memory_issues:
if not mem_iss_component:
raise ValueError("If you have memory issues, you have to preselect mem_iss_component")
pca = IncrementalPCA(mem_iss_component)
else:
if variance_or_components>1:
pca = PCA(n_components=variance_or_components)
else: # automated selection based on variance
pca = PCA(n_components=variance_or_components,svd_solver="full")
if drop_cols:
X_pca = pca.fit_transform(df.drop(drop_cols,axis=1))
return pd.concat((df[drop_cols],pd.DataFrame(X_pca, columns=["PCA_"+str(i+1) for i in range(X_pca.shape[1])],index=df.index)),axis=1)
else:
X_pca = pca.fit_transform(df)
return pd.DataFrame(X_pca, columns=["PCA_"+str(i+1) for i in range(X_pca.shape[1])],index=df.index)
return df
df_out = mapper.pca_feature(df.copy(), variance_or_components=0.9, n_components=8,non_linear=False)
```

These families of algorithms are useful to find linear relations between two multivariate datasets.

(1) Canonical Correlation Analysis

Canonical-correlation analysis (CCA) is a way of inferring information from cross-covariance matrices.

```
from sklearn.cross_decomposition import CCA
def cross_lag(df, drop=None, lags=1, components=4 ):
if drop:
keep = df[drop]
df = df.drop([drop],axis=1)
df_2 = df.shift(lags)
df = df.iloc[lags:,:]
df_2 = df_2.dropna().reset_index(drop=True)
cca = CCA(n_components=components)
cca.fit(df_2, df)
X_c, df_2 = cca.transform(df_2, df)
df_2 = pd.DataFrame(df_2, index=df.index)
df_2 = df.add_prefix('crd_')
if drop:
df = pd.concat([keep,df,df_2],axis=1)
else:
df = pd.concat([df,df_2],axis=1)
return df
df_out = mapper.cross_lag(df.copy()); df_out.head()
```

Functions that approximate the feature mappings that correspond to certain kernels, as they are used for example in support vector machines.

(i) Additive Chi2 Kernel

Computes the additive chi-squared kernel between observations in X and Y The chi-squared kernel is computed between each pair of rows in X and Y. X and Y have to be non-negative.

```
from sklearn.kernel_approximation import AdditiveChi2Sampler
def a_chi(df, drop=None, lags=1, sample_steps=2 ):
if drop:
keep = df[drop]
df = df.drop([drop],axis=1)
df_2 = df.shift(lags)
df = df.iloc[lags:,:]
df_2 = df_2.dropna().reset_index(drop=True)
chi2sampler = AdditiveChi2Sampler(sample_steps=sample_steps)
df_2 = chi2sampler.fit_transform(df_2, df["Close"])
df_2 = pd.DataFrame(df_2, index=df.index)
df_2 = df.add_prefix('achi_')
if drop:
df = pd.concat([keep,df,df_2],axis=1)
else:
df = pd.concat([df,df_2],axis=1)
return df
df_out = mapper.a_chi(df.copy()); df_out.head()
```

An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore noise.

(i) Feed Forward

The simplest form of an autoencoder is a feedforward, non-recurrent neural network similar to single layer perceptrons that participate in multilayer perceptrons

```
from sklearn.preprocessing import minmax_scale
import tensorflow as tf
import numpy as np
def encoder_dataset(df, drop=None, dimesions=20):
if drop:
train_scaled = minmax_scale(df.drop(drop,axis=1).values, axis = 0)
else:
train_scaled = minmax_scale(df.values, axis = 0)
# define the number of encoding dimensions
encoding_dim = dimesions
# define the number of features
ncol = train_scaled.shape[1]
input_dim = tf.keras.Input(shape = (ncol, ))
# Encoder Layers
encoded1 = tf.keras.layers.Dense(3000, activation = 'relu')(input_dim)
encoded2 = tf.keras.layers.Dense(2750, activation = 'relu')(encoded1)
encoded3 = tf.keras.layers.Dense(2500, activation = 'relu')(encoded2)
encoded4 = tf.keras.layers.Dense(750, activation = 'relu')(encoded3)
encoded5 = tf.keras.layers.Dense(500, activation = 'relu')(encoded4)
encoded6 = tf.keras.layers.Dense(250, activation = 'relu')(encoded5)
encoded7 = tf.keras.layers.Dense(encoding_dim, activation = 'relu')(encoded6)
encoder = tf.keras.Model(inputs = input_dim, outputs = encoded7)
encoded_input = tf.keras.Input(shape = (encoding_dim, ))
encoded_train = pd.DataFrame(encoder.predict(train_scaled),index=df.index)
encoded_train = encoded_train.add_prefix('encoded_')
if drop:
encoded_train = pd.concat((df[drop],encoded_train),axis=1)
return encoded_train
df_out = mapper.encoder_dataset(df.copy(), ["Close_1"], 15); df_out.head()
```

```
df_out.head()
```

Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-linear structure in data.

(i) Local Linear Embedding

Locally Linear Embedding is a method of non-linear dimensionality reduction. It tries to reduce these n-Dimensions while trying to preserve the geometric features of the original non-linear feature structure.

```
from sklearn.manifold import LocallyLinearEmbedding
def lle_feat(df, drop=None, components=4):
if drop:
keep = df[drop]
df = df.drop(drop, axis=1)
embedding = LocallyLinearEmbedding(n_components=components)
em = embedding.fit_transform(df)
df = pd.DataFrame(em,index=df.index)
df = df.add_prefix('lle_')
if drop:
df = pd.concat((keep,df),axis=1)
return df
df_out = mapper.lle_feat(df.copy(),["Close_1"],4); df_out.head()
```

Most clustering techniques start with a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together with some measure. Although these clustering techniques are typically used for observations, it can also be used for feature dimensionality reduction; especially hierarchical clustering techniques.

(i) Feature Agglomeration

Feature agglomerative uses clustering to group together features that look very similar, thus decreasing the number of features.

```
import numpy as np
from sklearn import datasets, cluster
def feature_agg(df, drop=None, components=4):
if drop:
keep = df[drop]
df = df.drop(drop, axis=1)
components = min(df.shape[1]-1,components)
agglo = cluster.FeatureAgglomeration(n_clusters=components)
agglo.fit(df)
df = pd.DataFrame(agglo.transform(df),index=df.index)
df = df.add_prefix('feagg_')
if drop:
return pd.concat((keep,df),axis=1)
else:
return df
df_out = mapper.feature_agg(df.copy(),["Close_1"],4 ); df_out.head()
```

Neighbouring points can be calculated using distance metrics like Hamming, Manhattan, Minkowski distance. The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these.

(i) Nearest Neighbours

Unsupervised learner for implementing neighbor searches.

```
from sklearn.neighbors import NearestNeighbors
def neigh_feat(df, drop, neighbors=6):
if drop:
keep = df[drop]
df = df.drop(drop, axis=1)
components = min(df.shape[0]-1,neighbors)
neigh = NearestNeighbors(n_neighbors=neighbors)
neigh.fit(df)
neigh = neigh.kneighbors()[0]
df = pd.DataFrame(neigh, index=df.index)
df = df.add_prefix('neigh_')
if drop:
return pd.concat((keep,df),axis=1)
else:
return df
return df
df_out = mapper.neigh_feat(df.copy(),["Close_1"],4 ); df_out.head()
```

When working with extraction, you have decide the size of the time series history to take into account when calculating a collection of walk-forward feature values. To facilitate our extraction, we use an excellent package called TSfresh, and also some of their default features. For completeness, we also include 12 or so custom features to be added to the extraction pipeline.

The *time series* methods in the transformation section and the interaction section are similar to the methods we will uncover in the extraction section, however, for transformation and interaction methods the output is an entire new time series, whereas extraction methods takes as input multiple constructed time series and extracts a singular value from each time series to reconstruct an entirely new time series.

Some methods naturally fit better in one format over another, e.g., lags are too expensive for extraction; time series decomposition only has to be performed once, because it has a low level of 'leakage' so is better suited to transformation; and forecast methods attempt to predict multiple future training samples, so won't work with extraction that only delivers one value per time series. Furthermore all non time-series (cross-sectional) transformation and extraction techniques can not make use of extraction as it is solely a time-series method.

Lastly, when we want to double apply specific functions we can apply it as a transformation/interaction then all the extraction methods can be applied to this feature as well. For example, if we calculate a smoothing function (transformation) then all other extraction functions (median, entropy, linearity etc.) can now be applied to that smoothing function, including the application of the smoothing function itself, e.g., a double smooth, double lag, double filter etc. So separating these methods out give us great flexibility.

Decorator

```
def set_property(key, value):
"""
This method returns a decorator that sets the property key of the function to value
"""
def decorate_func(func):
setattr(func, key, value)
if func.__doc__ and key == "fctype":
func.__doc__ = func.__doc__ + "\n\n *This function is of type: " + value + "*\n"
return func
return decorate_func
```

You can calculate the linear, non-linear and absolute energy of a time series. In signal processing, the energy $E_S$ of a continuous-time signal $x(t)$ is defined as the area under the squared magnitude of the considered signal. Mathematically, $E_{s}=\langle x(t), x(t)\rangle=\int_{-\infty}^{\infty}|x(t)|^{2} d t$

(i) Absolute Energy

Returns the absolute energy of the time series which is the sum over the squared values

```
#-> In Package
def abs_energy(x):
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
return np.dot(x, x)
extract.abs_energy(df["Close"])
```

Here we widely define distance measures as those that take a difference between attributes or series of datapoints.

(i) Complexity-Invariant Distance

This function calculator is an estimate for a time series complexity.

```
#-> In Package
def cid_ce(x, normalize):
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
if normalize:
s = np.std(x)
if s!=0:
x = (x - np.mean(x))/s
else:
return 0.0
x = np.diff(x)
return np.sqrt(np.dot(x, x))
extract.cid_ce(df["Close"], True)
```

Many alternatives to differencing exists, one can for example take the difference of every other value, take the squared difference, take the fractional difference, or like our example, take the mean absolute difference.

(i) Mean Absolute Change

Returns the mean over the absolute differences between subsequent time series values.

```
#-> In Package
def mean_abs_change(x):
return np.mean(np.abs(np.diff(x)))
extract.mean_abs_change(df["Close"])
```

Features where the emphasis is on the rate of change.

(i) Mean Central Second Derivative

Returns the mean value of a central approximation of the second derivative

```
#-> In Package
def _roll(a, shift):
if not isinstance(a, np.ndarray):
a = np.asarray(a)
idx = shift % len(a)
return np.concatenate([a[-idx:], a[:-idx]])
def mean_second_derivative_central(x):
diff = (_roll(x, 1) - 2 * np.array(x) + _roll(x, -1)) / 2.0
return np.mean(diff[1:-1])
extract.mean_second_derivative_central(df["Close"])
```

Volatility is a statistical measure of the dispersion of a time-series.

(i) Variance Larger than Standard Deviation

```
#-> In Package
def variance_larger_than_standard_deviation(x):
y = np.var(x)
return y > np.sqrt(y)
extract.variance_larger_than_standard_deviation(df["Close"])
```

(ii) Variability Index

Variability Index is a way to measure how smooth or 'variable' a time series is.

```
var_index_param = {"Volume":df["Volume"].values, "Open": df["Open"].values}
@set_property("fctype", "combiner")
@set_property("custom", True)
def var_index(time,param=var_index_param):
final = []
keys = []
for key, magnitude in param.items():
w = 1.0 / np.power(np.subtract(time[1:], time[:-1]), 2)
w_mean = np.mean(w)
N = len(time)
sigma2 = np.var(magnitude)
S1 = sum(w * (magnitude[1:] - magnitude[:-1]) ** 2)
S2 = sum(w)
eta_e = (w_mean * np.power(time[N - 1] -
time[0], 2) * S1 / (sigma2 * S2 * N ** 2))
final.append(eta_e)
keys.append(key)
return {"Interact__{}".format(k): eta_e for eta_e, k in zip(final,keys) }
extract.var_index(df["Close"].values,var_index_param)
```

Features that emphasises a particular shape not ordinarily considered as a distribution statistic. Extends to derivations of the original time series too For example a feature looking at the sinusoidal shape of an autocorrelation plot.

(i) Symmetrical

Boolean variable denoting if the distribution of x looks symmetric.

```
#-> In Package
def symmetry_looking(x, param=[{"r": 0.2}]):
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
mean_median_difference = np.abs(np.mean(x) - np.median(x))
max_min_difference = np.max(x) - np.min(x)
return [("r_{}".format(r["r"]), mean_median_difference < (r["r"] * max_min_difference))
for r in param]
extract.symmetry_looking(df["Close"])
```

Looking at the occurrence, and reoccurence of defined values.

(i) Has Duplicate Max

```
#-> In Package
def has_duplicate_max(x):
"""
Checks if the maximum value of x is observed more than once
:param x: the time series to calculate the feature of
:type x: numpy.ndarray
:return: the value of this feature
:return type: bool
"""
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
return np.sum(x == np.max(x)) >= 2
extract.has_duplicate_max(df["Close"])
```

Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay.

(i) Partial Autocorrelation

Partial autocorrelation is a summary of the relationship between an observation in a time series with observations at prior time steps with the relationships of intervening observations removed.

```
#-> In Package
from statsmodels.tsa.stattools import acf, adfuller, pacf
def partial_autocorrelation(x, param=[{"lag": 1}]):
# Check the difference between demanded lags by param and possible lags to calculate (depends on len(x))
max_demanded_lag = max([lag["lag"] for lag in param])
n = len(x)
# Check if list is too short to make calculations
if n <= 1:
pacf_coeffs = [np.nan] * (max_demanded_lag + 1)
else:
if (n <= max_demanded_lag):
max_lag = n - 1
else:
max_lag = max_demanded_lag
pacf_coeffs = list(pacf(x, method="ld", nlags=max_lag))
pacf_coeffs = pacf_coeffs + [np.nan] * max(0, (max_demanded_lag - max_lag))
return [("lag_{}".format(lag["lag"]), pacf_coeffs[lag["lag"]]) for lag in param]
extract.partial_autocorrelation(df["Close"])
```

Stochastic refers to a randomly determined process. Any features trying to capture stochasticity by degree or type are included under this branch.

(i) Augmented Dickey Fuller

The Augmented Dickey-Fuller test is a hypothesis test which checks whether a unit root is present in a time series sample.

```
#-> In Package
def augmented_dickey_fuller(x, param=[{"attr": "teststat"}]):
res = None
try:
res = adfuller(x)
except LinAlgError:
res = np.NaN, np.NaN, np.NaN
except ValueError: # occurs if sample size is too small
res = np.NaN, np.NaN, np.NaN
except MissingDataError: # is thrown for e.g. inf or nan in the data
res = np.NaN, np.NaN, np.NaN
return [('attr_"{}"'.format(config["attr"]),
res[0] if config["attr"] == "teststat"
else res[1] if config["attr"] == "pvalue"
else res[2] if config["attr"] == "usedlag" else np.NaN)
for config in param]
extract.augmented_dickey_fuller(df["Close"])
```

(i) Median of Magnitudes Skew

```
@set_property("fctype", "simple")
@set_property("custom", True)
def gskew(x):
interpolation="nearest"
median_mag = np.median(x)
F_3_value = np.percentile(x, 3, interpolation=interpolation)
F_97_value = np.percentile(x, 97, interpolation=interpolation)
skew = (np.median(x[x <= F_3_value]) +
np.median(x[x >= F_97_value]) - 2 * median_mag)
return skew
extract.gskew(df["Close"])
```

(ii) Stetson Mean

An iteratively weighted mean used in the Stetson variability index

```
stestson_param = {"weight":100., "alpha":2., "beta":2., "tol":1.e-6, "nmax":20}
@set_property("fctype", "combiner")
@set_property("custom", True)
def stetson_mean(x, param=stestson_param):
weight= stestson_param["weight"]
alpha= stestson_param["alpha"]
beta = stestson_param["beta"]
tol= stestson_param["tol"]
nmax= stestson_param["nmax"]
mu = np.median(x)
for i in range(nmax):
resid = x - mu
resid_err = np.abs(resid) * np.sqrt(weight)
weight1 = weight / (1. + (resid_err / alpha)**beta)
weight1 /= weight1.mean()
diff = np.mean(x * weight1) - mu
mu += diff
if (np.abs(diff) < tol*np.abs(mu) or np.abs(diff) < tol):
break
return mu
extract.stetson_mean(df["Close"])
```

(i) Lenght

```
#-> In Package
def length(x):
return len(x)
extract.length(df["Close"])
```

(i) Count Above Mean

Returns the number of values in x that are higher than the mean of x

```
#-> In Package
def count_above_mean(x):
m = np.mean(x)
return np.where(x > m)[0].size
extract.count_above_mean(df["Close"])
```

(i) Longest Strike Below Mean

Returns the length of the longest consecutive subsequence in x that is smaller than the mean of x

```
#-> In Package
import itertools
def get_length_sequences_where(x):
if len(x) == 0:
return [0]
else:
res = [len(list(group)) for value, group in itertools.groupby(x) if value == 1]
return res if len(res) > 0 else [0]
def longest_strike_below_mean(x):
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
return np.max(get_length_sequences_where(x <= np.mean(x))) if x.size > 0 else 0
extract.longest_strike_below_mean(df["Close"])
```

(ii) Wozniak

This is an astronomical feature, we count the number of three consecutive data points that are brighter or fainter than $2σ$ and normalize the number by $N−2$

```
woz_param = [{"consecutiveStar": n} for n in [2, 4]]
@set_property("fctype", "combiner")
@set_property("custom", True)
def wozniak(magnitude, param=woz_param):
iters = []
for consecutiveStar in [stars["consecutiveStar"] for stars in param]:
N = len(magnitude)
if N < consecutiveStar:
return 0
sigma = np.std(magnitude)
m = np.mean(magnitude)
count = 0
for i in range(N - consecutiveStar + 1):
flag = 0
for j in range(consecutiveStar):
if(magnitude[i + j] > m + 2 * sigma or
magnitude[i + j] < m - 2 * sigma):
flag = 1
else:
flag = 0
break
if flag:
count = count + 1
iters.append(count * 1.0 / (N - consecutiveStar + 1))
return [("consecutiveStar_{}".format(config["consecutiveStar"]), iters[en] ) for en, config in enumerate(param)]
extract.wozniak(df["Close"])
```

(i) Last location of Maximum

Returns the relative last location of the maximum value of x. last_location_of_minimum(x),

```
#-> In Package
def last_location_of_maximum(x):
x = np.asarray(x)
return 1.0 - np.argmax(x[::-1]) / len(x) if len(x) > 0 else np.NaN
extract.last_location_of_maximum(df["Close"])
```

Any coefficient that are obtained from a model that might help in the prediction problem. For example here we might include coefficients of polynomial $h(x)$, which has been fitted to the deterministic dynamics of Langevin model.

(i) FFT Coefficient

Calculates the fourier coefficients of the one-dimensional discrete Fourier Transform for real input.

```
#-> In Package
def fft_coefficient(x, param = [{"coeff": 10, "attr": "real"}]):
assert min([config["coeff"] for config in param]) >= 0, "Coefficients must be positive or zero."
assert set([config["attr"] for config in param]) <= set(["imag", "real", "abs", "angle"]), \
'Attribute must be "real", "imag", "angle" or "abs"'
fft = np.fft.rfft(x)
def complex_agg(x, agg):
if agg == "real":
return x.real
elif agg == "imag":
return x.imag
elif agg == "abs":
return np.abs(x)
elif agg == "angle":
return np.angle(x, deg=True)
res = [complex_agg(fft[config["coeff"]], config["attr"]) if config["coeff"] < len(fft)
else np.NaN for config in param]
index = [('coeff_{}__attr_"{}"'.format(config["coeff"], config["attr"]),res[0]) for config in param]
return index
extract.fft_coefficient(df["Close"])
```

(ii) AR Coefficient

This feature calculator fits the unconditional maximum likelihood of an autoregressive AR(k) process.

```
#-> In Package
from statsmodels.tsa.ar_model import AR
def ar_coefficient(x, param=[{"coeff": 5, "k": 5}]):
calculated_ar_params = {}
x_as_list = list(x)
calculated_AR = AR(x_as_list)
res = {}
for parameter_combination in param:
k = parameter_combination["k"]
p = parameter_combination["coeff"]
column_name = "k_{}__coeff_{}".format(k, p)
if k not in calculated_ar_params:
try:
calculated_ar_params[k] = calculated_AR.fit(maxlag=k, solver="mle").params
except (LinAlgError, ValueError):
calculated_ar_params[k] = [np.NaN]*k
mod = calculated_ar_params[k]
if p <= k:
try:
res[column_name] = mod[p]
except IndexError:
res[column_name] = 0
else:
res[column_name] = np.NaN
return [(key, value) for key, value in res.items()]
extract.ar_coefficient(df["Close"])
```

This includes finding normal quantile values in the series, but also quantile derived measures like change quantiles and index max quantiles.

(i) Index Mass Quantile

The relative index $i$ where $q%$ of the mass of the time series $x$ lie left of $i$ .

```
#-> In Package
def index_mass_quantile(x, param=[{"q": 0.3}]):
x = np.asarray(x)
abs_x = np.abs(x)
s = sum(abs_x)
if s == 0:
# all values in x are zero or it has length 0
return [("q_{}".format(config["q"]), np.NaN) for config in param]
else:
# at least one value is not zero
mass_centralized = np.cumsum(abs_x) / s
return [("q_{}".format(config["q"]), (np.argmax(mass_centralized >= config["q"])+1)/len(x)) for config in param]
extract.index_mass_quantile(df["Close"])
```

(i) Number of CWT Peaks

This feature calculator searches for different peaks in x.

```
from scipy.signal import cwt, find_peaks_cwt, ricker, welch
cwt_param = [ka for ka in [2,6,9]]
@set_property("fctype", "combiner")
@set_property("custom", True)
def number_cwt_peaks(x, param=cwt_param):
return [("CWTPeak_{}".format(n), len(find_peaks_cwt(vector=x, widths=np.array(list(range(1, n + 1))), wavelet=ricker))) for n in param]
extract.number_cwt_peaks(df["Close"])
```

The density, and more specifically the power spectral density of the signal describes the power present in the signal as a function of frequency, per unit frequency.

(i) Cross Power Spectral Density

This feature calculator estimates the cross power spectral density of the time series $x$ at different frequencies.

```
#-> In Package
def spkt_welch_density(x, param=[{"coeff": 5}]):
freq, pxx = welch(x, nperseg=min(len(x), 256))
coeff = [config["coeff"] for config in param]
indices = ["coeff_{}".format(i) for i in coeff]
if len(pxx) <= np.max(coeff): # There are fewer data points in the time series than requested coefficients
# filter coefficients that are not contained in pxx
reduced_coeff = [coefficient for coefficient in coeff if len(pxx) > coefficient]
not_calculated_coefficients = [coefficient for coefficient in coeff
if coefficient not in reduced_coeff]
# Fill up the rest of the requested coefficients with np.NaNs
return zip(indices, list(pxx[reduced_coeff]) + [np.NaN] * len(not_calculated_coefficients))
else:
return pxx[coeff].ravel()[0]
extract.spkt_welch_density(df["Close"])
```

Any measure of linearity that might make use of something like the linear least-squares regression for the values of the time series. This can be against the time series minus one and many other alternatives.

(i) Linear Trend Time Wise

Calculate a linear least-squares regression for the values of the time series versus the sequence from 0 to length of the time series minus one.

```
from scipy.stats import linregress
#-> In Package
def linear_trend_timewise(x, param= [{"attr": "pvalue"}]):
ix = x.index
# Get differences between each timestamp and the first timestamp in seconds.
# Then convert to hours and reshape for linear regression
times_seconds = (ix - ix[0]).total_seconds()
times_hours = np.asarray(times_seconds / float(3600))
linReg = linregress(times_hours, x.values)
return [("attr_\"{}\"".format(config["attr"]), getattr(linReg, config["attr"]))
for config in param]
extract.linear_trend_timewise(df["Close"])
```

(i) Schreiber Non-Linearity

```
#-> In Package
def c3(x, lag=3):
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
n = x.size
if 2 * lag >= n:
return 0
else:
return np.mean((_roll(x, 2 * -lag) * _roll(x, -lag) * x)[0:(n - 2 * lag)])
extract.c3(df["Close"])
```

Any feature looking at the complexity of a time series. This is typically used in medical signal disciplines (EEG, EMG). There are multiple types of measures like spectral entropy, permutation entropy, sample entropy, approximate entropy, Lempel-Ziv complexity and other. This includes entropy measures and there derivations.

(i) Binned Entropy

Bins the values of x into max_bins equidistant bins.

```
#-> In Package
def binned_entropy(x, max_bins=10):
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
hist, bin_edges = np.histogram(x, bins=max_bins)
probs = hist / x.size
return - np.sum(p * np.math.log(p) for p in probs if p != 0)
extract.binned_entropy(df["Close"])
```

(ii) SVD Entropy

SVD entropy is an indicator of the number of eigenvectors that are needed for an adequate explanation of the data set.

```
svd_param = [{"Tau": ta, "DE": de}
for ta in [4]
for de in [3,6]]
def _embed_seq(X,Tau,D):
N =len(X)
if D * Tau > N:
print("Cannot build such a matrix, because D * Tau > N")
exit()
if Tau<1:
print("Tau has to be at least 1")
exit()
Y= np.zeros((N - (D - 1) * Tau, D))
for i in range(0, N - (D - 1) * Tau):
for j in range(0, D):
Y[i][j] = X[i + j * Tau]
return Y
@set_property("fctype", "combiner")
@set_property("custom", True)
def svd_entropy(epochs, param=svd_param):
axis=0
final = []
for par in param:
def svd_entropy_1d(X, Tau, DE):
Y = _embed_seq(X, Tau, DE)
W = np.linalg.svd(Y, compute_uv=0)
W /= sum(W) # normalize singular values
return -1 * np.sum(W * np.log(W))
Tau = par["Tau"]
DE = par["DE"]
final.append(np.apply_along_axis(svd_entropy_1d, axis, epochs, Tau, DE).ravel()[0])
return [("Tau_\"{}\"__De_{}\"".format(par["Tau"], par["DE"]), final[en]) for en, par in enumerate(param)]
extract.svd_entropy(df["Close"].values)
```

(iii) Hjort

The Complexity parameter represents the change in frequency. The parameter compares the signal's similarity to a pure sine wave, where the value converges to 1 if the signal is more similar.

```
def _hjorth_mobility(epochs):
diff = np.diff(epochs, axis=0)
sigma0 = np.std(epochs, axis=0)
sigma1 = np.std(diff, axis=0)
return np.divide(sigma1, sigma0)
@set_property("fctype", "simple")
@set_property("custom", True)
def hjorth_complexity(epochs):
diff1 = np.diff(epochs, axis=0)
diff2 = np.diff(diff1, axis=0)
sigma1 = np.std(diff1, axis=0)
sigma2 = np.std(diff2, axis=0)
return np.divide(np.divide(sigma2, sigma1), _hjorth_mobility(epochs))
extract.hjorth_complexity(df["Close"])
```

Fixed points and equilibria as identified from fitted models.

(i) Langevin Fixed Points

Largest fixed point of dynamics $max\ {h(x)=0}$ estimated from polynomial $h(x)$ which has been fitted to the deterministic dynamics of Langevin model

```
#-> In Package
def _estimate_friedrich_coefficients(x, m, r):
assert m > 0, "Order of polynomial need to be positive integer, found {}".format(m)
df = pd.DataFrame({'signal': x[:-1], 'delta': np.diff(x)})
try:
df['quantiles'] = pd.qcut(df.signal, r)
except ValueError:
return [np.NaN] * (m + 1)
quantiles = df.groupby('quantiles')
result = pd.DataFrame({'x_mean': quantiles.signal.mean(), 'y_mean': quantiles.delta.mean()})
result.dropna(inplace=True)
try:
return np.polyfit(result.x_mean, result.y_mean, deg=m)
except (np.linalg.LinAlgError, ValueError):
return [np.NaN] * (m + 1)
def max_langevin_fixed_point(x, r=3, m=30):
coeff = _estimate_friedrich_coefficients(x, m, r)
try:
max_fixed_point = np.max(np.real(np.roots(coeff)))
except (np.linalg.LinAlgError, ValueError):
return np.nan
return max_fixed_point
extract.max_langevin_fixed_point(df["Close"])
```

Features derived from peaked values in either the positive or negative direction.

(i) Willison Amplitude

This feature is defined as the amount of times that the change in the signal amplitude exceeds a threshold.

```
will_param = [ka for ka in [0.2,3]]
@set_property("fctype", "combiner")
@set_property("custom", True)
def willison_amplitude(X, param=will_param):
return [("Thresh_{}".format(n),np.sum(np.abs(np.diff(X)) >= n)) for n in param]
extract.willison_amplitude(df["Close"])
```

(ii) Percent Amplitude

Returns the largest distance from the median value, measured as a percentage of the median

```
perc_param = [{"base":ba, "exponent":exp} for ba in [3,5] for exp in [-0.1,-0.2]]
@set_property("fctype", "combiner")
@set_property("custom", True)
def percent_amplitude(x, param =perc_param):
final = []
for par in param:
linear_scale_data = par["base"] ** (par["exponent"] * x)
y_max = np.max(linear_scale_data)
y_min = np.min(linear_scale_data)
y_med = np.median(linear_scale_data)
final.append(max(abs((y_max - y_med) / y_med), abs((y_med - y_min) / y_med)))
return [("Base_{}__Exp{}".format(pa["base"],pa["exponent"]),fin) for fin, pa in zip(final,param)]
extract.percent_amplitude(df["Close"])
```

(i) Cadence Probability

Given the observed distribution of time lags cads, compute the probability that the next observation occurs within time minutes of an arbitrary epoch.

```
#-> fixes required
import scipy.stats as stats
cad_param = [0.1,1000, -234]
@set_property("fctype", "combiner")
@set_property("custom", True)
def cad_prob(cads, param=cad_param):
return [("time_{}".format(time), stats.percentileofscore(cads, float(time) / (24.0 * 60.0)) / 100.0) for time in param]
extract.cad_prob(df["Close"])
```

Calculates the crossing of the series with other defined values or series.

(i) Zero Crossing Derivative

The positioning of the edge point is located at the zero crossing of the first derivative of the filter.

```
zero_param = [0.01, 8]
@set_property("fctype", "combiner")
@set_property("custom", True)
def zero_crossing_derivative(epochs, param=zero_param):
diff = np.diff(epochs)
norm = diff-diff.mean()
return [("e_{}".format(e), np.apply_along_axis(lambda epoch: np.sum(((epoch[:-5] <= e) & (epoch[5:] > e))), 0, norm).ravel()[0]) for e in param]
extract.zero_crossing_derivative(df["Close"])
```

These features are again from medical signal sciences, but under this category we would include values such as fluctuation based entropy measures, fluctuation of correlation dynamics, and co-fluctuations.

(i) Detrended Fluctuation Analysis (DFA)

DFA Calculate the Hurst exponent using DFA analysis.

```
from scipy.stats import kurtosis as _kurt
from scipy.stats import skew as _skew
import numpy as np
@set_property("fctype", "simple")
@set_property("custom", True)
def detrended_fluctuation_analysis(epochs):
def dfa_1d(X, Ave=None, L=None):
X = np.array(X)
if Ave is None:
Ave = np.mean(X)
Y = np.cumsum(X)
Y -= Ave
if L is None:
L = np.floor(len(X) * 1 / (
2 ** np.array(list(range(1, int(np.log2(len(X))) - 4))))
)
F = np.zeros(len(L)) # F(n) of different given box length n
for i in range(0, len(L)):
n = int(L[i]) # for each box length L[i]
if n == 0:
print("time series is too short while the box length is too big")
print("abort")
exit()
for j in range(0, len(X), n): # for each box
if j + n < len(X):
c = list(range(j, j + n))
# coordinates of time in the box
c = np.vstack([c, np.ones(n)]).T
# the value of data in the box
y = Y[j:j + n]
# add residue in this box
F[i] += np.linalg.lstsq(c, y, rcond=None)[1]
F[i] /= ((len(X) / n) * n)
F = np.sqrt(F)
stacked = np.vstack([np.log(L), np.ones(len(L))])
stacked_t = stacked.T
Alpha = np.linalg.lstsq(stacked_t, np.log(F), rcond=None)
return Alpha[0][0]
return np.apply_along_axis(dfa_1d, 0, epochs).ravel()[0]
extract.detrended_fluctuation_analysis(df["Close"])
```

Closely related to entropy and complexity measures. Any measure that attempts to measure the amount of information from an observable variable is included here.

(i) Fisher Information

Fisher information is a statistical information concept distinct from, and earlier than, Shannon information in communication theory.

```
def _embed_seq(X, Tau, D):
shape = (X.size - Tau * (D - 1), D)
strides = (X.itemsize, Tau * X.itemsize)
return np.lib.stride_tricks.as_strided(X, shape=shape, strides=strides)
fisher_param = [{"Tau":ta, "DE":de} for ta in [3,15] for de in [10,5]]
@set_property("fctype", "combiner")
@set_property("custom", True)
def fisher_information(epochs, param=fisher_param):
def fisher_info_1d(a, tau, de):
# taken from pyeeg improvements
mat = _embed_seq(a, tau, de)
W = np.linalg.svd(mat, compute_uv=False)
W /= sum(W) # normalize singular values
FI_v = (W[1:] - W[:-1]) ** 2 / W[:-1]
return np.sum(FI_v)
return [("Tau_{}__DE_{}".format(par["Tau"], par["DE"]),np.apply_along_axis(fisher_info_1d, 0, epochs, par["Tau"], par["DE"]).ravel()[0]) for par in param]
extract.fisher_information(df["Close"])
```

In mathematics, more specifically in fractal geometry, a fractal dimension is a ratio providing a statistical index of complexity comparing how detail in a pattern (strictly speaking, a fractal pattern) changes with the scale at which it is measured.

(i) Highuchi Fractal

Compute a Higuchi Fractal Dimension of a time series

```
hig_para = [{"Kmax": 3},{"Kmax": 5}]
@set_property("fctype", "combiner")
@set_property("custom", True)
def higuchi_fractal_dimension(epochs, param=hig_para):
def hfd_1d(X, Kmax):
L = []
x = []
N = len(X)
for k in range(1, Kmax):
Lk = []
for m in range(0, k):
Lmk = 0
for i in range(1, int(np.floor((N - m) / k))):
Lmk += abs(X[m + i * k] - X[m + i * k - k])
Lmk = Lmk * (N - 1) / np.floor((N - m) / float(k)) / k
Lk.append(Lmk)
L.append(np.log(np.mean(Lk)))
x.append([np.log(float(1) / k), 1])
(p, r1, r2, s) = np.linalg.lstsq(x, L, rcond=None)
return p[0]
return [("Kmax_{}".format(config["Kmax"]), np.apply_along_axis(hfd_1d, 0, epochs, config["Kmax"]).ravel()[0] ) for config in param]
extract.higuchi_fractal_dimension(df["Close"])
```

(ii) Petrosian Fractal

Compute a Petrosian Fractal Dimension of a time series.

```
@set_property("fctype", "simple")
@set_property("custom", True)
def petrosian_fractal_dimension(epochs):
def pfd_1d(X, D=None):
# taken from pyeeg
"""Compute Petrosian Fractal Dimension of a time series from either two
cases below:
1. X, the time series of type list (default)
2. D, the first order differential sequence of X (if D is provided,
recommended to speed up)
In case 1, D is computed using Numpy's difference function.
To speed up, it is recommended to compute D before calling this function
because D may also be used by other functions whereas computing it here
again will slow down.
"""
if D is None:
D = np.diff(X)
D = D.tolist()
N_delta = 0 # number of sign changes in derivative of the signal
for i in range(1, len(D)):
if D[i] * D[i - 1] < 0:
N_delta += 1
n = len(X)
return np.log10(n) / (np.log10(n) + np.log10(n / n + 0.4 * N_delta))
return np.apply_along_axis(pfd_1d, 0, epochs).ravel()[0]
extract.petrosian_fractal_dimension(df["Close"])
```

(i) Hurst Exponent

The Hurst exponent is used as a measure of long-term memory of time series. It relates to the autocorrelations of the time series, and the rate at which these decrease as the lag between pairs of values increases.

```
@set_property("fctype", "simple")
@set_property("custom", True)
def hurst_exponent(epochs):
def hurst_1d(X):
X = np.array(X)
N = X.size
T = np.arange(1, N + 1)
Y = np.cumsum(X)
Ave_T = Y / T
S_T = np.zeros(N)
R_T = np.zeros(N)
for i in range(N):
S_T[i] = np.std(X[:i + 1])
X_T = Y - T * Ave_T[i]
R_T[i] = np.ptp(X_T[:i + 1])
for i in range(1, len(S_T)):
if np.diff(S_T)[i - 1] != 0:
break
for j in range(1, len(R_T)):
if np.diff(R_T)[j - 1] != 0:
break
k = max(i, j)
assert k < 10, "rethink it!"
R_S = R_T[k:] / S_T[k:]
R_S = np.log(R_S)
n = np.log(T)[k:]
A = np.column_stack((n, np.ones(n.size)))
[m, c] = np.linalg.lstsq(A, R_S, rcond=None)[0]
H = m
return H
return np.apply_along_axis(hurst_1d, 0, epochs).ravel()[0]
extract.hurst_exponent(df["Close"])
```

(ii) Largest Lyauponov Exponent

In mathematics the Lyapunov exponent or Lyapunov characteristic exponent of a dynamical system is a quantity that characterizes the rate of separation of infinitesimally close trajectories.

```
def _embed_seq(X, Tau, D):
shape = (X.size - Tau * (D - 1), D)
strides = (X.itemsize, Tau * X.itemsize)
return np.lib.stride_tricks.as_strided(X, shape=shape, strides=strides)
lyaup_param = [{"Tau":4, "n":3, "T":10, "fs":9},{"Tau":8, "n":7, "T":15, "fs":6}]
@set_property("fctype", "combiner")
@set_property("custom", True)
def largest_lyauponov_exponent(epochs, param=lyaup_param):
def LLE_1d(x, tau, n, T, fs):
Em = _embed_seq(x, tau, n)
M = len(Em)
A = np.tile(Em, (len(Em), 1, 1))
B = np.transpose(A, [1, 0, 2])
square_dists = (A - B) ** 2 # square_dists[i,j,k] = (Em[i][k]-Em[j][k])^2
D = np.sqrt(square_dists[:, :, :].sum(axis=2)) # D[i,j] = ||Em[i]-Em[j]||_2
# Exclude elements within T of the diagonal
band = np.tri(D.shape[0], k=T) - np.tri(D.shape[0], k=-T - 1)
band[band == 1] = np.inf
neighbors = (D + band).argmin(axis=0) # nearest neighbors more than T steps away
# in_bounds[i,j] = (i+j <= M-1 and i+neighbors[j] <= M-1)
inc = np.tile(np.arange(M), (M, 1))
row_inds = (np.tile(np.arange(M), (M, 1)).T + inc)
col_inds = (np.tile(neighbors, (M, 1)) + inc.T)
in_bounds = np.logical_and(row_inds <= M - 1, col_inds <= M - 1)
# Uncomment for old (miscounted) version
# in_bounds = numpy.logical_and(row_inds < M - 1, col_inds < M - 1)
row_inds[~in_bounds] = 0
col_inds[~in_bounds] = 0
# neighbor_dists[i,j] = ||Em[i+j]-Em[i+neighbors[j]]||_2
neighbor_dists = np.ma.MaskedArray(D[row_inds, col_inds], ~in_bounds)
J = (~neighbor_dists.mask).sum(axis=1) # number of in-bounds indices by row
# Set invalid (zero) values to 1; log(1) = 0 so sum is unchanged
neighbor_dists[neighbor_dists == 0] = 1
# !!! this fixes the divide by zero in log error !!!
neighbor_dists.data[neighbor_dists.data == 0] = 1
d_ij = np.sum(np.log(neighbor_dists.data), axis=1)
mean_d = d_ij[J > 0] / J[J > 0]
x = np.arange(len(mean_d))
X = np.vstack((x, np.ones(len(mean_d)))).T
[m, c] = np.linalg.lstsq(X, mean_d, rcond=None)[0]
Lexp = fs * m
return Lexp
return [("Tau_{}__n_{}__T_{}__fs_{}".format(par["Tau"], par["n"], par["T"], par["fs"]), np.apply_along_axis(LLE_1d, 0, epochs, par["Tau"], par["n"], par["T"], par["fs"]).ravel()[0]) for par in param]
extract.largest_lyauponov_exponent(df["Close"])
```

Spectral analysis is analysis in terms of a spectrum of frequencies or related quantities such as energies, eigenvalues, etc.

(i) Whelch Method

The Whelch Method is an approach for spectral density estimation. It is used in physics, engineering, and applied mathematics for estimating the power of a signal at different frequencies.

```
from scipy import signal, integrate
whelch_param = [100,200]
@set_property("fctype", "combiner")
@set_property("custom", True)
def whelch_method(data, param=whelch_param):
final = []
for Fs in param:
f, pxx = signal.welch(data, fs=Fs, nperseg=1024)
d = {'psd': pxx, 'freqs': f}
df = pd.DataFrame(data=d)
dfs = df.sort_values(['psd'], ascending=False)
rows = dfs.iloc[:10]
final.append(rows['freqs'].mean())
return [("Fs_{}".format(pa),fin) for pa, fin in zip(param,final)]
extract.whelch_method(df["Close"])
```

```
#-> Basically same as above
freq_param = [{"fs":50, "sel":15},{"fs":200, "sel":20}]
@set_property("fctype", "combiner")
@set_property("custom", True)
def find_freq(serie, param=freq_param):
final = []
for par in param:
fft0 = np.fft.rfft(serie*np.hanning(len(serie)))
freqs = np.fft.rfftfreq(len(serie), d=1.0/par["fs"])
fftmod = np.array([np.sqrt(fft0[i].real**2 + fft0[i].imag**2) for i in range(0, len(fft0))])
d = {'fft': fftmod, 'freq': freqs}
df = pd.DataFrame(d)
hop = df.sort_values(['fft'], ascending=False)
rows = hop.iloc[:par["sel"]]
final.append(rows['freq'].mean())
return [("Fs_{}__sel{}".format(pa["fs"],pa["sel"]),fin) for pa, fin in zip(param,final)]
extract.find_freq(df["Close"])
```

(i) Flux Percentile

Flux (or radiant flux) is the total amount of energy that crosses a unit area per unit time. Flux is an astronomical value, measured in joules per square metre per second (joules/m2/s), or watts per square metre. Here we provide the ratio of flux percentiles.

```
#-> In Package
import math
def flux_perc(magnitude):
sorted_data = np.sort(magnitude)
lc_length = len(sorted_data)
F_60_index = int(math.ceil(0.60 * lc_length))
F_40_index = int(math.ceil(0.40 * lc_length))
F_5_index = int(math.ceil(0.05 * lc_length))
F_95_index = int(math.ceil(0.95 * lc_length))
F_40_60 = sorted_data[F_60_index] - sorted_data[F_40_index]
F_5_95 = sorted_data[F_95_index] - sorted_data[F_5_index]
F_mid20 = F_40_60 / F_5_95
return {"FluxPercentileRatioMid20": F_mid20}
extract.flux_perc(df["Close"])
```

(i) Range of Cummulative Sum

```
@set_property("fctype", "simple")
@set_property("custom", True)
def range_cum_s(magnitude):
sigma = np.std(magnitude)
N = len(magnitude)
m = np.mean(magnitude)
s = np.cumsum(magnitude - m) * 1.0 / (N * sigma)
R = np.max(s) - np.min(s)
return {"Rcs": R}
extract.range_cum_s(df["Close"])
```

Structural features, potential placeholders for future research.

(i) Structure Function

The structure function of rotation measures (RMs) contains information on electron density and magnetic field fluctuations when used i astronomy. It becomes a custom feature when used with your own unique time series data.

```
from scipy.interpolate import interp1d
struct_param = {"Volume":df["Volume"].values, "Open": df["Open"].values}
@set_property("fctype", "combiner")
@set_property("custom", True)
def structure_func(time, param=struct_param):
dict_final = {}
for key, magnitude in param.items():
dict_final[key] = []
Nsf, Np = 100, 100
sf1, sf2, sf3 = np.zeros(Nsf), np.zeros(Nsf), np.zeros(Nsf)
f = interp1d(time, magnitude)
time_int = np.linspace(np.min(time), np.max(time), Np)
mag_int = f(time_int)
for tau in np.arange(1, Nsf):
sf1[tau - 1] = np.mean(
np.power(np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 1.0))
sf2[tau - 1] = np.mean(
np.abs(np.power(
np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 2.0)))
sf3[tau - 1] = np.mean(
np.abs(np.power(
np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 3.0)))
sf1_log = np.log10(np.trim_zeros(sf1))
sf2_log = np.log10(np.trim_zeros(sf2))
sf3_log = np.log10(np.trim_zeros(sf3))
if len(sf1_log) and len(sf2_log):
m_21, b_21 = np.polyfit(sf1_log, sf2_log, 1)
else:
m_21 = np.nan
if len(sf1_log) and len(sf3_log):
m_31, b_31 = np.polyfit(sf1_log, sf3_log, 1)
else:
m_31 = np.nan
if len(sf2_log) and len(sf3_log):
m_32, b_32 = np.polyfit(sf2_log, sf3_log, 1)
else:
m_32 = np.nan
dict_final[key].append(m_21)
dict_final[key].append(m_31)
dict_final[key].append(m_32)
return [("StructureFunction_{}__m_{}".format(key, name), li) for key, lis in dict_final.items() for name, li in zip([21,31,32], lis)]
struct_param = {"Volume":df["Volume"].values, "Open": df["Open"].values}
extract.structure_func(df["Close"],struct_param)
```

(i) Kurtosis

```
#-> In Package
def kurtosis(x):
if not isinstance(x, pd.Series):
x = pd.Series(x)
return pd.Series.kurtosis(x)
extract.kurtosis(df["Close"])
```

(ii) Stetson Kurtosis

```
@set_property("fctype", "simple")
@set_property("custom", True)
def stetson_k(x):
"""A robust kurtosis statistic."""
n = len(x)
x0 = stetson_mean(x, 1./20**2)
delta_x = np.sqrt(n / (n - 1.)) * (x - x0) / 20
ta = 1. / 0.798 * np.mean(np.abs(delta_x)) / np.sqrt(np.mean(delta_x**2))
return ta
extract.stetson_k(df["Close"])
```

Time-Series synthesisation (TSS) happens before the feature extraction step and Cross Sectional Synthesisation (CSS) happens after the feature extraction step. Currently I will only include a CSS package, in the future, I would further work on developing out this section. This area still has a lot of performance and stability issues. In the future it might be a more viable candidate to improve prediction.

```
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error
def model(df_final):
model = LGBMRegressor()
test = df_final.head(int(len(df_final)*0.4))
train = df_final[~df_final.isin(test)].dropna()
model = model.fit(train.drop(["Close_1"],axis=1),train["Close_1"])
preds = model.predict(test.drop(["Close_1"],axis=1))
test = df_final.head(int(len(df_final)*0.4))
train = df_final[~df_final.isin(test)].dropna()
model = model.fit(train.drop(["Close_1"],axis=1),train["Close_1"])
val = mean_squared_error(test["Close_1"],preds);
return val
```

```
pip install ctgan
```

```
from ctgan import CTGANSynthesizer
#discrete_columns = [""]
ctgan = CTGANSynthesizer()
ctgan.fit(df,epochs=10) #15
```

Random Benchmark

```
np.random.seed(1)
df_in = df.copy()
df_in["Close_1"] = np.random.permutation(df_in["Close_1"].values)
model(df_in)
```

Generated Performance

```
df_gen = ctgan.sample(len(df_in)*100)
model(df_gen)
```

As expected a cross-sectional technique, does not work well on time-series data, in the future, other methods will be investigated.

Here I will perform tabular agumenting methods on a small dataset single digit features and around 250 instances. This is not necessarily the best sized dataset to highlight the performance of tabular augmentation as some method like extraction would be overkill as it would lead to dimensionality problems. It is also good to know that there are close to infinite number of ways to perform these augmentation methods. In the future, automated augmentation methods can guide the experiment process.

The approach taken in this skeleton is to develop running models that are tested after each augmentation to highlight what methods might work well on this particular dataset. The metric we will use is mean squared error. In this implementation we do not have special hold-out sets.

The above framework of implementation will be consulted, but one still have to be strategic as to when you apply what function, and you have to make sure that you are processing your data with appropriate techniques (drop null values, fill null values) at the appropriate time.

Develop Model and Define Metric

```
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error
def model(df_final):
model = LGBMRegressor()
test = df_final.head(int(len(df_final)*0.4))
train = df_final[~df_final.isin(test)].dropna()
model = model.fit(train.drop(["Close_1"],axis=1),train["Close_1"])
preds = model.predict(test.drop(["Close_1"],axis=1))
test = df_final.head(int(len(df_final)*0.4))
train = df_final[~df_final.isin(test)].dropna()
model = model.fit(train.drop(["Close_1"],axis=1),train["Close_1"])
val = mean_squared_error(test["Close_1"],preds);
return val
```

Reload Data

```
df = data_copy()
```

```
model(df)
```

```
302.61676570345287
```

**(1) (7) (i) Transformation - Decomposition - Naive**

```
## If Inferred Seasonality is Too Large Default to Five
seasons = transform.infer_seasonality(df["Close"],index=0)
df_out = transform.naive_dec(df.copy(), ["Close","Open"], freq=5)
model(df_out) #improvement
```

```
274.34477082783525
```

**(1) (8) (i) Transformation - Filter - Baxter-King-Bandpass**

```
df_out = transform.bkb(df_out, ["Close","Low"])
df_best = df_out.copy()
model(df_out) #improvement
```

```
267.1826850968307
```

**(1) (3) (i) Transformation - Differentiation - Fractional**

```
df_out = transform.fast_fracdiff(df_out, ["Close_BPF"],0.5)
model(df_out) #null
```

```
267.7083192402742
```

**(1) (1) (i) Transformation - Scaling - Robust Scaler**

```
df_out = df_out.dropna()
df_out = transform.robust_scaler(df_out, drop=["Close_1"])
model(df_out) #noisy
```

```
270.96980399571214
```

**(2) (2) (i) Interactions - Operator - Multiplication/Division**

```
df_out.head()
```

Close_1 | High | Low | Open | Close | Volume | Adj Close | Close_NDDT | Close_NDDS | Close_NDDR | Open_NDDT | Open_NDDS | Open_NDDR | Close_BPF | Low_BPF | Close_BPF_frac | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Date | ||||||||||||||||

2019-01-08 | 338.529999 | 1.018413 | 0.964048 | 1.096600 | 1.001175 | -0.162616 | 1.001175 | 0.832297 | 0.834964 | 1.335433 | 0.758743 | 0.691596 | 2.259884 | -2.534142 | -2.249135 | -3.593612 |

2019-01-09 | 344.970001 | 1.012068 | 1.023302 | 1.011466 | 1.042689 | -0.501798 | 1.042689 | 0.908963 | -0.165036 | 1.111346 | 0.835786 | 0.333361 | 1.129783 | -3.081959 | -2.776302 | -2.523465 |

2019-01-10 | 347.260010 | 1.035581 | 1.027563 | 0.996969 | 1.126762 | -0.367576 | 1.126762 | 1.029347 | 2.120026 | 0.853697 | 0.907588 | 0.000000 | 0.533777 | -2.052768 | -2.543449 | -0.747382 |

2019-01-11 | 334.399994 | 1.073153 | 1.120506 | 1.098313 | 1.156658 | -0.586571 | 1.156658 | 1.109144 | -5.156051 | 0.591990 | 1.002162 | -0.666639 | 0.608516 | -0.694642 | -0.831670 | 0.414063 |

2019-01-14 | 344.429993 | 0.999627 | 1.056991 | 1.102135 | 0.988773 | -0.541752 | 0.988773 | 1.107633 | 0.000000 | -0.660350 | 1.056302 | -0.915491 | 0.263025 | -0.645590 | -0.116166 | -0.118012 |

```
df_out = interact.muldiv(df_out, ["Close","Open_NDDS","Low_BPF"])
model(df_out) #noisy
```

```
285.6420643864313
```

```
df_r = df_out.copy()
```

**(2) (6) (i) Interactions - Speciality - Technical**

```
import ta
df = interact.tech(df)
df_out = pd.merge(df_out, df.iloc[:,7:], left_index=True, right_index=True, how="left")
```

**Clean Dataframe and Metric**

```
"""Droping column where missing values are above a threshold"""
df_out = df_out.dropna(thresh = len(df_out)*0.95, axis = "columns")
df_out = df_out.dropna()
df_out = df_out.replace([np.inf, -np.inf], np.nan).ffill().fillna(0)
close = df_out["Close"].copy()
df_d = df_out.copy()
model(df_out) #improve
```

```
592.52971755184
```

**(3) (1) (i) Mapping - Eigen Decomposition - PCA**

```
from sklearn.decomposition import PCA, IncrementalPCA, KernelPCA
df_out = transform.robust_scaler(df_out, drop=["Close_1"])
```

```
df_out = df_out.replace([np.inf, -np.inf], np.nan).ffill().fillna(0)
df_out = mapper.pca_feature(df_out, drop_cols=["Close_1"], variance_or_components=0.9, n_components=8,non_linear=False)
```

```
model(df_out) #noisy but not too bad given the 10 fold dimensionality reduction
```

```
687.158330455884
```

**(4) Extracting**

Here at first, I show the functions that have been added to the DeltaPy fork of tsfresh. You have to add your own personal adjustments based on the features you would like to construct. I am using self-developed features, but you can also use TSFresh's community functions.

*The following files have been appropriately ammended (Get in contact for advice)*

- https://github.com/firmai/tsfresh/blob/master/tsfresh/feature_extraction/settings.py
- https://github.com/firmai/tsfresh/blob/master/tsfresh/feature_extraction/feature_calculators.py
- https://github.com/firmai/tsfresh/blob/master/tsfresh/feature_extraction/extraction.py

**(4) (10) (i) Extracting - Averages - GSkew**

```
extract.gskew(df_out["PCA_1"])
```

```
-0.7903067336449059
```

**(4) (21) (ii) Extracting - Entropy - SVD Entropy**

```
svd_param = [{"Tau": ta, "DE": de}
for ta in [4]
for de in [3,6]]
extract.svd_entropy(df_out["PCA_1"],svd_param)
```

```
[('Tau_"4"__De_3"', 0.7234823323374294),
('Tau_"4"__De_6"', 1.3014347840145244)]
```

**(4) (13) (ii) Extracting - Streaks - Wozniak**

```
woz_param = [{"consecutiveStar": n} for n in [2, 4]]
extract.wozniak(df_out["PCA_1"],woz_param)
```

```
[('consecutiveStar_2', 0.012658227848101266), ('consecutiveStar_4', 0.0)]
```

**(4) (28) (i) Extracting - Fractal - Higuchi**

```
hig_param = [{"Kmax": 3},{"Kmax": 5}]
extract.higuchi_fractal_dimension(df_out["PCA_1"],hig_param)
```

```
[('Kmax_3', 0.577913816027104), ('Kmax_5', 0.8176960510304725)]
```

**(4) (5) (ii) Extracting - Volatility - Variability Index**

```
var_index_param = {"Volume":df["Volume"].values, "Open": df["Open"].values}
extract.var_index(df["Close"].values,var_index_param)
```

```
{'Interact__Open': 0.00396022538846289,
'Interact__Volume': 0.20550155114176533}
```

**Time Series Extraction**

```
pip install git+git://github.com/firmai/tsfresh.git
```

```
#Construct the preferred input dataframe.
from tsfresh.utilities.dataframe_functions import roll_time_series
df_out["ID"] = 0
periods = 30
df_out = df_out.reset_index()
df_ts = roll_time_series(df_out,"ID","Date",None,1,periods)
counts = df_ts['ID'].value_counts()
df_ts = df_ts[df_ts['ID'].isin(counts[counts > periods].index)]
```

```
#Perform extraction
from tsfresh.feature_extraction import extract_features, CustomFCParameters
settings_dict = CustomFCParameters()
settings_dict["var_index"] = {"PCA_1":None, "PCA_2": None}
df_feat = extract_features(df_ts.drop(["Close_1"],axis=1),default_fc_parameters=settings_dict,column_id="ID",column_sort="Date")
```

```
Feature Extraction: 100%|██████████| 5/5 [00:10<00:00, 2.14s/it]
```

```
# Cleaning operations
import pandasvault as pv
df_feat2 = df_feat.copy()
df_feat = df_feat.dropna(thresh = len(df_feat)*0.50, axis = "columns")
df_feat_cons = pv.constant_feature_detect(data=df_feat,threshold=0.9)
df_feat = df_feat.drop(df_feat_cons, axis=1)
df_feat = df_feat.ffill()
df_feat = pd.merge(df_feat,df[["Close_1"]],left_index=True,right_index=True,how="left")
print(df_feat.shape)
model(df_feat) #noisy
```

```
7 variables are found to be almost constant
(208, 48)
2064.7813982935995
```

```
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute
impute(df_feat)
df_feat_2 = select_features(df_feat.drop(["Close_1"],axis=1),df_feat["Close_1"],fdr_level=0.05)
df_feat_2["Close_1"] = df_feat["Close_1"]
model(df_feat_2) #improvement (b/ not an augmentation method)
```

```
1577.5273071299482
```

**(3) (6) (i) Feature Agglomoration; (1)(2)(i) Standard Scaler.**

Like in this step, after (1), (2), (3), (4) and (5), you can often circle back to the initial steps to normalise the data and dimensionally reduce the data for the final model.

```
import numpy as np
from sklearn import datasets, cluster
def feature_agg(df, drop, components):
components = min(df.shape[1]-1,components)
agglo = cluster.FeatureAgglomeration(n_clusters=components,)
df = df.drop(drop,axis=1)
agglo.fit(df)
df = pd.DataFrame(agglo.transform(df))
df = df.add_prefix('fe_agg_')
return df
df_final = transform.standard_scaler(df_feat_2, drop=["Close_1"])
df_final = mapper.feature_agg(df_final,["Close_1"],4)
df_final.index = df_feat.index
df_final["Close_1"] = df_feat["Close_1"]
model(df_final) #noisy
```

```
1949.89085894338
```

**Final Model** After Applying 13 Arbitrary Augmentation Techniques

```
model(df_final) #improvement
```

```
1949.89085894338
```

**Original Model** Before Augmentation

```
df_org = df.iloc[:,:7][df.index.isin(df_final.index)]
model(df_org)
```

```
389.783990984133
```

**Best Model** After Developing 8 Augmenting Features

```
df_best = df_best.replace([np.inf, -np.inf], np.nan).ffill().fillna(0)
model(df_best)
```

```
267.1826850968307
```

**Commentary**

There are countless ways in which the current model can be improved, this can take on an automated process where all techniques are tested against a hold out set, for example, we can perform the operation below, and even though it improves the score here, there is a need for more robust tests. The skeleton example above is not meant to highlight the performance of the package. It simply serves as an example of how one can go about applying augmentation methods.

Quite naturally this example suffers from dimensionality issues with array shapes reaching `(208, 48)`

, furthermore you would need a sample that is at least 50-100 times larger before machine learning methods start to make sense.

Nonetheless, in this example, *Transformation, Interactions* and *Mappings* (applied to extraction output) performed fairly well. *Extraction* augmentation was overkill, but created a reasonable model when dimensionally reduced. A better selection of one of the 50+ augmentation methods and the order of augmentation could further help improve the outcome if robustly tested against development sets.

[1] DeltaPy Development

Author: firmai

Source Code: https://github.com/firmai/deltapy**#engineering **

1605176646

In this video, I will be talking about how to ask for help as a developer.

#problem solving skills #problem solving how to #coding interviews #problem solving #how to ask a good question

1620826740

In this series, we will be solving several amazing Problems. We will also try to decode the computational Logic behind the tricky solutions.

Code to Contribute:

You all must be aware of the rising covid situation in India, as a result we get to hear about lots of casualties every day. A part of them is also a result of poverty and hunger caused by the pandemic.

GeeksforGeeks in association with GiveIndia has come up with a contest with which anyone can help these persons in need for free.

Details of the contest:

- Contest is free for all.
- On every participation, GeeksforGeeks in association with GiveIndia donates a meal for a family in need.
- For every participant who ends in top 50, GeeksforGeeks in association with GiveIndia donate a week’s meal for a family in need.
- A personalized certificate for each participant.
- Find more details here

https://practice.geeksforgeeks.org/contest/code-to-contribute/

#developer