1656682523
Find peaks in an array based on "Improved peak detection" [1]
[1] Du, Pan, Warren A. Kibbe, and Simon M. Lin. "Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching." Bioinformatics 22.17 (2006): 2059-2065.
If you use NPM, npm install d3-peaks
. Otherwise, download the latest release.
# d3_peaks.findPeaks([signal])
If specified, returns an array of points that represents the peaks in the signal. Otherwise, returns a function to find peaks. An example point returned is:
[{
index: 10,
width: 2,
snr: 1.5
}]
Where index represents the index of the peak in the original signal, width is the width of the peak, and snr is the signal to noise ratio.
# widths([w])
If specified, [w] is an array of expected peak widths that the algorithm should find. Otherwise, returns the current values.
var findPeaks = d3_peaks.findPeaks().widths([1, 2, 10]);
# kernel(kernel)
If specified, changes the kernel function or "smoother". Otherwise, returns the current value.
var ricker = d3_peaks.ricker;
var findPeaks = d3_peaks.findPeaks().kernel(ricker);
# gapThreshold(gap)
If specified, gap represents the maximum allowed number of gaps in the ridgeline. The higher is this number the more connected peaks we will find. Otherwise, returns the current value.
var findPeaks = d3_peaks.findPeaks().gapThreshold(3);
# minLineLength(length)
If specified, length represents the minimum ridgeline length. The higher is this number the more constrained are the lines and we will find fewer peaks. Otherwise, returns the current value.
var findPeaks = d3_peaks.findPeaks().minLineLength(2);
# minSNR(snr)
If specified, snr represents the minimum signal to noise ratio the ridge lines should have. Otherwise, returns the current value. By default the minimum snr is 1.0 for peaks of width 1. This number should be higher for bigger widths.
var findPeaks = d3_peaks.findPeaks().minSNR(1.5);
# d3_peaks.convolve([signal])
If specified, convolve the signal array with the smoother. Otherwise, returns a function to convolve a signal with the smoother.
# kernel(kernel)
If specified, changes the kernel function or "smoother". Otherwise, returns the current kernel.
var convolve = d3_peaks.convolve()
.kernel(ricker);
var signal = convolve([1,2,3,2.5,0,1,4,5,3,-1,-2]);
# d3_peaks.ricker(x)
If specified , it returns φ(x). Otherwise, returns a function to compute the ricker wavelet with default standard deviation 1.0.
# std(value)
If specified, it sets the standard deviation of the curve to value. Otherwise, returns the "width" or standard deviation of the wavelet.
# reach()
Returns the range value reach such that φ(reach) ~ 0.
var y = d3_peaks.ricker()
.std(2);
var output = y(3.5);
var reach = y.reach();
For examples, please see:
Author: Efekarakus
Source Code: https://github.com/efekarakus/d3-peaks
License: BSD-3-Clause license
1659283860
ActiveInteraction manages application-specific business logic. It's an implementation of service objects designed to blend seamlessly into Rails.
ActiveInteraction gives you a place to put your business logic. It also helps you write safer code by validating that your inputs conform to your expectations. If ActiveModel deals with your nouns, then ActiveInteraction handles your verbs.
Add it to your Gemfile:
gem 'active_interaction', '~> 5.1'
Or install it manually:
$ gem install active_interaction --version '~> 5.1'
This project uses Semantic Versioning. Check out GitHub releases for a detailed list of changes.
To define an interaction, create a subclass of ActiveInteraction::Base
. Then you need to do two things:
Define your inputs. Use class filter methods to define what you expect your inputs to look like. For instance, if you need a boolean flag for pepperoni, use boolean :pepperoni
. Check out the filters section for all the available options.
Define your business logic. Do this by implementing the #execute
method. Each input you defined will be available as the type you specified. If any of the inputs are invalid, #execute
won't be run. Filters are responsible for checking your inputs. Check out the validations section if you need more than that.
That covers the basics. Let's put it all together into a simple example that squares a number.
require 'active_interaction'
class Square < ActiveInteraction::Base
float :x
def execute
x**2
end
end
Call .run
on your interaction to execute it. You must pass a single hash to .run
. It will return an instance of your interaction. By convention, we call this an outcome. You can use the #valid?
method to ask the outcome if it's valid. If it's invalid, take a look at its errors with #errors
. In either case, the value returned from #execute
will be stored in #result
.
outcome = Square.run(x: 'two point one')
outcome.valid?
# => nil
outcome.errors.messages
# => {:x=>["is not a valid float"]}
outcome = Square.run(x: 2.1)
outcome.valid?
# => true
outcome.result
# => 4.41
You can also use .run!
to execute interactions. It's like .run
but more dangerous. It doesn't return an outcome. If the outcome would be invalid, it will instead raise an error. But if the outcome would be valid, it simply returns the result.
Square.run!(x: 'two point one')
# ActiveInteraction::InvalidInteractionError: X is not a valid float
Square.run!(x: 2.1)
# => 4.41
ActiveInteraction checks your inputs. Often you'll want more than that. For instance, you may want an input to be a string with at least one non-whitespace character. Instead of writing your own validation for that, you can use validations from ActiveModel.
These validations aren't provided by ActiveInteraction. They're from ActiveModel. You can also use any custom validations you wrote yourself in your interactions.
class SayHello < ActiveInteraction::Base
string :name
validates :name,
presence: true
def execute
"Hello, #{name}!"
end
end
When you run this interaction, two things will happen. First ActiveInteraction will check your inputs. Then ActiveModel will validate them. If both of those are happy, it will be executed.
SayHello.run!(name: nil)
# ActiveInteraction::InvalidInteractionError: Name is required
SayHello.run!(name: '')
# ActiveInteraction::InvalidInteractionError: Name can't be blank
SayHello.run!(name: 'Taylor')
# => "Hello, Taylor!"
You can define filters inside an interaction using the appropriate class method. Each method has the same signature:
Some symbolic names. These are the attributes to create.
An optional hash of options. Each filter supports at least these two options:
default
is the fallback value to use if nil
is given. To make a filter optional, set default: nil
.
desc
is a human-readable description of the input. This can be useful for generating documentation. For more information about this, read the descriptions section.
An optional block of sub-filters. Only array and hash filters support this. Other filters will ignore blocks when given to them.
Let's take a look at an example filter. It defines three inputs: x
, y
, and z
. Those inputs are optional and they all share the same description ("an example filter").
array :x, :y, :z,
default: nil,
desc: 'an example filter' do
# Some filters support sub-filters here.
end
In general, filters accept values of the type they correspond to, plus a few alternatives that can be reasonably coerced. Typically the coercions come from Rails, so "1"
can be interpreted as the boolean value true
, the string "1"
, or the number 1
.
In addition to accepting arrays, array inputs will convert ActiveRecord::Relation
s into arrays.
class ArrayInteraction < ActiveInteraction::Base
array :toppings
def execute
toppings.size
end
end
ArrayInteraction.run!(toppings: 'everything')
# ActiveInteraction::InvalidInteractionError: Toppings is not a valid array
ArrayInteraction.run!(toppings: [:cheese, 'pepperoni'])
# => 2
Use a block to constrain the types of elements an array can contain. Note that you can only have one filter inside an array block, and it must not have a name.
array :birthdays do
date
end
For interface
, object
, and record
filters, the name of the array filter will be singularized and used to determine the type of value passed. In the example below, the objects passed would need to be of type Cow
.
array :cows do
object
end
You can override this by passing the necessary information to the inner filter.
array :managers do
object class: People
end
Errors that occur will be indexed based on the Rails configuration setting index_nested_attribute_errors
. You can also manually override this setting with the :index_errors
option. In this state is is possible to get multiple errors from a single filter.
class ArrayInteraction < ActiveInteraction::Base
array :favorite_numbers, index_errors: true do
integer
end
def execute
favorite_numbers
end
end
ArrayInteraction.run(favorite_numbers: [8, 'bazillion']).errors.details
=> {:"favorite_numbers[1]"=>[{:error=>:invalid_type, :type=>"array"}]}
With :index_errors
set to false
the error would have been:
{:favorite_numbers=>[{:error=>:invalid_type, :type=>"array"}]}
Boolean filters convert the strings "1"
, "true"
, and "on"
(case-insensitive) into true
. They also convert "0"
, "false"
, and "off"
into false
. Blank strings will be treated as nil
.
class BooleanInteraction < ActiveInteraction::Base
boolean :kool_aid
def execute
'Oh yeah!' if kool_aid
end
end
BooleanInteraction.run!(kool_aid: 1)
# ActiveInteraction::InvalidInteractionError: Kool aid is not a valid boolean
BooleanInteraction.run!(kool_aid: true)
# => "Oh yeah!"
File filters also accept TempFile
s and anything that responds to #rewind
. That means that you can pass the params
from uploading files via forms in Rails.
class FileInteraction < ActiveInteraction::Base
file :readme
def execute
readme.size
end
end
FileInteraction.run!(readme: 'README.md')
# ActiveInteraction::InvalidInteractionError: Readme is not a valid file
FileInteraction.run!(readme: File.open('README.md'))
# => 21563
Hash filters accept hashes. The expected value types are given by passing a block and nesting other filters. You can have any number of filters inside a hash, including other hashes.
class HashInteraction < ActiveInteraction::Base
hash :preferences do
boolean :newsletter
boolean :sweepstakes
end
def execute
puts 'Thanks for joining the newsletter!' if preferences[:newsletter]
puts 'Good luck in the sweepstakes!' if preferences[:sweepstakes]
end
end
HashInteraction.run!(preferences: 'yes, no')
# ActiveInteraction::InvalidInteractionError: Preferences is not a valid hash
HashInteraction.run!(preferences: { newsletter: true, 'sweepstakes' => false })
# Thanks for joining the newsletter!
# => nil
Setting default hash values can be tricky. The default value has to be either nil
or {}
. Use nil
to make the hash optional. Use {}
if you want to set some defaults for values inside the hash.
hash :optional,
default: nil
# => {:optional=>nil}
hash :with_defaults,
default: {} do
boolean :likes_cookies,
default: true
end
# => {:with_defaults=>{:likes_cookies=>true}}
By default, hashes remove any keys that aren't given as nested filters. To allow all hash keys, set strip: false
. In general we don't recommend doing this, but it's sometimes necessary.
hash :stuff,
strip: false
String filters define inputs that only accept strings.
class StringInteraction < ActiveInteraction::Base
string :name
def execute
"Hello, #{name}!"
end
end
StringInteraction.run!(name: 0xDEADBEEF)
# ActiveInteraction::InvalidInteractionError: Name is not a valid string
StringInteraction.run!(name: 'Taylor')
# => "Hello, Taylor!"
String filter strips leading and trailing whitespace by default. To disable it, set the strip
option to false
.
string :comment,
strip: false
Symbol filters define inputs that accept symbols. Strings will be converted into symbols.
class SymbolInteraction < ActiveInteraction::Base
symbol :method
def execute
method.to_proc
end
end
SymbolInteraction.run!(method: -> {})
# ActiveInteraction::InvalidInteractionError: Method is not a valid symbol
SymbolInteraction.run!(method: :object_id)
# => #<Proc:0x007fdc9ba94118>
Filters that work with dates and times behave similarly. By default, they all convert strings into their expected data types using .parse
. Blank strings will be treated as nil
. If you give the format
option, they will instead convert strings using .strptime
. Note that formats won't work with DateTime
and Time
filters if a time zone is set.
Date
class DateInteraction < ActiveInteraction::Base
date :birthday
def execute
birthday + (18 * 365)
end
end
DateInteraction.run!(birthday: 'yesterday')
# ActiveInteraction::InvalidInteractionError: Birthday is not a valid date
DateInteraction.run!(birthday: Date.new(1989, 9, 1))
# => #<Date: 2007-08-28 ((2454341j,0s,0n),+0s,2299161j)>
date :birthday,
format: '%Y-%m-%d'
DateTime
class DateTimeInteraction < ActiveInteraction::Base
date_time :now
def execute
now.iso8601
end
end
DateTimeInteraction.run!(now: 'now')
# ActiveInteraction::InvalidInteractionError: Now is not a valid date time
DateTimeInteraction.run!(now: DateTime.now)
# => "2015-03-11T11:04:40-05:00"
date_time :start,
format: '%Y-%m-%dT%H:%M:%S'
Time
In addition to converting strings with .parse
(or .strptime
), time filters convert numbers with .at
.
class TimeInteraction < ActiveInteraction::Base
time :epoch
def execute
Time.now - epoch
end
end
TimeInteraction.run!(epoch: 'a long, long time ago')
# ActiveInteraction::InvalidInteractionError: Epoch is not a valid time
TimeInteraction.run!(epoch: Time.new(1970))
# => 1426068362.5136619
time :start,
format: '%Y-%m-%dT%H:%M:%S'
All numeric filters accept numeric input. They will also convert strings using the appropriate method from Kernel
(like .Float
). Blank strings will be treated as nil
.
Decimal
class DecimalInteraction < ActiveInteraction::Base
decimal :price
def execute
price * 1.0825
end
end
DecimalInteraction.run!(price: 'one ninety-nine')
# ActiveInteraction::InvalidInteractionError: Price is not a valid decimal
DecimalInteraction.run!(price: BigDecimal(1.99, 2))
# => #<BigDecimal:7fe792a42028,'0.2165E1',18(45)>
To specify the number of significant digits, use the digits
option.
decimal :dollars,
digits: 2
Float
class FloatInteraction < ActiveInteraction::Base
float :x
def execute
x**2
end
end
FloatInteraction.run!(x: 'two point one')
# ActiveInteraction::InvalidInteractionError: X is not a valid float
FloatInteraction.run!(x: 2.1)
# => 4.41
Integer
class IntegerInteraction < ActiveInteraction::Base
integer :limit
def execute
limit.downto(0).to_a
end
end
IntegerInteraction.run!(limit: 'ten')
# ActiveInteraction::InvalidInteractionError: Limit is not a valid integer
IntegerInteraction.run!(limit: 10)
# => [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
When a String
is passed into an integer
input, the value will be coerced. A default base of 10
is used though it may be overridden with the base
option. If a base of 0
is provided, the coercion will respect radix indicators present in the string.
class IntegerInteraction < ActiveInteraction::Base
integer :limit1
integer :limit2, base: 8
integer :limit3, base: 0
def execute
[limit1, limit2, limit3]
end
end
IntegerInteraction.run!(limit1: 71, limit2: 71, limit3: 71)
# => [71, 71, 71]
IntegerInteraction.run!(limit1: "071", limit2: "071", limit3: "0x71")
# => [71, 57, 113]
IntegerInteraction.run!(limit1: "08", limit2: "08", limit3: "08")
ActiveInteraction::InvalidInteractionError: Limit2 is not a valid integer, Limit3 is not a valid integer
Interface filters allow you to specify an interface that the passed value must meet in order to pass. The name of the interface is used to look for a constant inside the ancestor listing for the passed value. This allows for a variety of checks depending on what's passed. Class instances are checked for an included module or an inherited ancestor class. Classes are checked for an extended module or an inherited ancestor class. Modules are checked for an extended module.
class InterfaceInteraction < ActiveInteraction::Base
interface :exception
def execute
exception
end
end
InterfaceInteraction.run!(exception: Exception)
# ActiveInteraction::InvalidInteractionError: Exception is not a valid interface
InterfaceInteraction.run!(exception: NameError) # a subclass of Exception
# => NameError
You can use :from
to specify a class or module. This would be the equivalent of what's above.
class InterfaceInteraction < ActiveInteraction::Base
interface :error,
from: Exception
def execute
error
end
end
You can also create an anonymous interface on the fly by passing the methods
option.
class InterfaceInteraction < ActiveInteraction::Base
interface :serializer,
methods: %i[dump load]
def execute
input = '{ "is_json" : true }'
object = serializer.load(input)
output = serializer.dump(object)
output
end
end
require 'json'
InterfaceInteraction.run!(serializer: Object.new)
# ActiveInteraction::InvalidInteractionError: Serializer is not a valid interface
InterfaceInteraction.run!(serializer: JSON)
# => "{\"is_json\":true}"
Object filters allow you to require an instance of a particular class or one of its subclasses.
class Cow
def moo
'Moo!'
end
end
class ObjectInteraction < ActiveInteraction::Base
object :cow
def execute
cow.moo
end
end
ObjectInteraction.run!(cow: Object.new)
# ActiveInteraction::InvalidInteractionError: Cow is not a valid object
ObjectInteraction.run!(cow: Cow.new)
# => "Moo!"
The class name is automatically determined by the filter name. If your filter name is different than your class name, use the class
option. It can be either the class, a string, or a symbol.
object :dolly1,
class: Sheep
object :dolly2,
class: 'Sheep'
object :dolly3,
class: :Sheep
If you have value objects or you would like to build one object from another, you can use the converter
option. It is only called if the value provided is not an instance of the class or one of its subclasses. The converter
option accepts a symbol that specifies a class method on the object class or a proc. Both will be passed the value and any errors thrown inside the converter will cause the value to be considered invalid. Any returned value that is not the correct class will also be treated as invalid. Any default
that is not an instance of the class or subclass and is not nil
will also be converted.
class ObjectInteraction < ActiveInteraction::Base
object :ip_address,
class: IPAddr,
converter: :new
def execute
ip_address
end
end
ObjectInteraction.run!(ip_address: '192.168.1.1')
# #<IPAddr: IPv4:192.168.1.1/255.255.255.255>
ObjectInteraction.run!(ip_address: 1)
# ActiveInteraction::InvalidInteractionError: Ip address is not a valid object
Record filters allow you to require an instance of a particular class (or one of its subclasses) or a value that can be used to locate an instance of the object. If the value does not match, it will call find
on the class of the record. This is particularly useful when working with ActiveRecord objects. Like an object filter, the class is derived from the name passed but can be specified with the class
option. Any default
that is not an instance of the class or subclass and is not nil
will also be found. Blank strings passed in will be treated as nil
.
class RecordInteraction < ActiveInteraction::Base
record :encoding
def execute
encoding
end
end
> RecordInteraction.run!(encoding: Encoding::US_ASCII)
=> #<Encoding:US-ASCII>
> RecordInteraction.run!(encoding: 'ascii')
=> #<Encoding:US-ASCII>
A different method can be specified by providing a symbol to the finder
option.
ActiveInteraction plays nicely with Rails. You can use interactions to handle your business logic instead of models or controllers. To see how it all works, let's take a look at a complete example of a controller with the typical resourceful actions.
We recommend putting your interactions in app/interactions
. It's also very helpful to group them by model. That way you can look in app/interactions/accounts
for all the ways you can interact with accounts.
- app/
- controllers/
- accounts_controller.rb
- interactions/
- accounts/
- create_account.rb
- destroy_account.rb
- find_account.rb
- list_accounts.rb
- update_account.rb
- models/
- account.rb
- views/
- account/
- edit.html.erb
- index.html.erb
- new.html.erb
- show.html.erb
# GET /accounts
def index
@accounts = ListAccounts.run!
end
Since we're not passing any inputs to ListAccounts
, it makes sense to use .run!
instead of .run
. If it failed, that would mean we probably messed up writing the interaction.
class ListAccounts < ActiveInteraction::Base
def execute
Account.not_deleted.order(last_name: :asc, first_name: :asc)
end
end
Up next is the show action. For this one we'll define a helper method to handle raising the correct errors. We have to do this because calling .run!
would raise an ActiveInteraction::InvalidInteractionError
instead of an ActiveRecord::RecordNotFound
. That means Rails would render a 500 instead of a 404.
# GET /accounts/:id
def show
@account = find_account!
end
private
def find_account!
outcome = FindAccount.run(params)
if outcome.valid?
outcome.result
else
fail ActiveRecord::RecordNotFound, outcome.errors.full_messages.to_sentence
end
end
This probably looks a little different than you're used to. Rails commonly handles this with a before_filter
that sets the @account
instance variable. Why is all this interaction code better? Two reasons: One, you can reuse the FindAccount
interaction in other places, like your API controller or a Resque task. And two, if you want to change how accounts are found, you only have to change one place.
Inside the interaction, we could use #find
instead of #find_by_id
. That way we wouldn't need the #find_account!
helper method in the controller because the error would bubble all the way up. However, you should try to avoid raising errors from interactions. If you do, you'll have to deal with raised exceptions as well as the validity of the outcome.
class FindAccount < ActiveInteraction::Base
integer :id
def execute
account = Account.not_deleted.find_by_id(id)
if account
account
else
errors.add(:id, 'does not exist')
end
end
end
Note that it's perfectly fine to add errors during execution. Not all errors have to come from checking or validation.
The new action will be a little different than the ones we've looked at so far. Instead of calling .run
or .run!
, it's going to initialize a new interaction. This is possible because interactions behave like ActiveModels.
# GET /accounts/new
def new
@account = CreateAccount.new
end
Since interactions behave like ActiveModels, we can use ActiveModel validations with them. We'll use validations here to make sure that the first and last names are not blank. The validations section goes into more detail about this.
class CreateAccount < ActiveInteraction::Base
string :first_name, :last_name
validates :first_name, :last_name,
presence: true
def to_model
Account.new
end
def execute
account = Account.new(inputs)
unless account.save
errors.merge!(account.errors)
end
account
end
end
We used a couple of advanced features here. The #to_model
method helps determine the correct form to use in the view. Check out the section on forms for more about that. Inside #execute
, we merge errors. This is a convenient way to move errors from one object to another. Read more about it in the errors section.
The create action has a lot in common with the new action. Both of them use the CreateAccount
interaction. And if creating the account fails, this action falls back to rendering the new action.
# POST /accounts
def create
outcome = CreateAccount.run(params.fetch(:account, {}))
if outcome.valid?
redirect_to(outcome.result)
else
@account = outcome
render(:new)
end
end
Note that we have to pass a hash to .run
. Passing nil
is an error.
Since we're using an interaction, we don't need strong parameters. The interaction will ignore any inputs that weren't defined by filters. So you can forget about params.require
and params.permit
because interactions handle that for you.
The destroy action will reuse the #find_account!
helper method we wrote earlier.
# DELETE /accounts/:id
def destroy
DestroyAccount.run!(account: find_account!)
redirect_to(accounts_url)
end
In this simple example, the destroy interaction doesn't do much. It's not clear that you gain anything by putting it in an interaction. But in the future, when you need to do more than account.destroy
, you'll only have to update one spot.
class DestroyAccount < ActiveInteraction::Base
object :account
def execute
account.destroy
end
end
Just like the destroy action, editing uses the #find_account!
helper. Then it creates a new interaction instance to use as a form object.
# GET /accounts/:id/edit
def edit
account = find_account!
@account = UpdateAccount.new(
account: account,
first_name: account.first_name,
last_name: account.last_name)
end
The interaction that updates accounts is more complicated than the others. It requires an account to update, but the other inputs are optional. If they're missing, it'll ignore those attributes. If they're present, it'll update them.
class UpdateAccount < ActiveInteraction::Base
object :account
string :first_name, :last_name,
default: nil
validates :first_name,
presence: true,
unless: -> { first_name.nil? }
validates :last_name,
presence: true,
unless: -> { last_name.nil? }
def execute
account.first_name = first_name if first_name.present?
account.last_name = last_name if last_name.present?
unless account.save
errors.merge!(account.errors)
end
account
end
end
Hopefully you've gotten the hang of this by now. We'll use #find_account!
to get the account. Then we'll build up the inputs for UpdateAccount
. Then we'll run the interaction and either redirect to the updated account or back to the edit page.
# PUT /accounts/:id
def update
inputs = { account: find_account! }.reverse_merge(params[:account])
outcome = UpdateAccount.run(inputs)
if outcome.valid?
redirect_to(outcome.result)
else
@account = outcome
render(:edit)
end
end
ActiveSupport::Callbacks provides a powerful framework for defining callbacks. ActiveInteraction uses that framework to allow hooking into various parts of an interaction's lifecycle.
class Increment < ActiveInteraction::Base
set_callback :filter, :before, -> { puts 'before filter' }
integer :x
set_callback :validate, :after, -> { puts 'after validate' }
validates :x,
numericality: { greater_than_or_equal_to: 0 }
set_callback :execute, :around, lambda { |_interaction, block|
puts '>>>'
block.call
puts '<<<'
}
def execute
puts 'executing'
x + 1
end
end
Increment.run!(x: 1)
# before filter
# after validate
# >>>
# executing
# <<<
# => 2
In order, the available callbacks are filter
, validate
, and execute
. You can set before
, after
, or around
on any of them.
You can run interactions from within other interactions with #compose
. If the interaction is successful, it'll return the result (just like if you had called it with .run!
). If something went wrong, execution will halt immediately and the errors will be moved onto the caller.
class Add < ActiveInteraction::Base
integer :x, :y
def execute
x + y
end
end
class AddThree < ActiveInteraction::Base
integer :x
def execute
compose(Add, x: x, y: 3)
end
end
AddThree.run!(x: 5)
# => 8
To bring in filters from another interaction, use .import_filters
. Combined with inputs
, delegating to another interaction is a piece of cake.
class AddAndDouble < ActiveInteraction::Base
import_filters Add
def execute
compose(Add, inputs) * 2
end
end
Note that errors in composed interactions have a few tricky cases. See the errors section for more information about them.
The default value for an input can take on many different forms. Setting the default to nil
makes the input optional. Setting it to some value makes that the default value for that input. Setting it to a lambda will lazily set the default value for that input. That means the value will be computed when the interaction is run, as opposed to when it is defined.
Lambda defaults are evaluated in the context of the interaction, so you can use the values of other inputs in them.
# This input is optional.
time :a, default: nil
# This input defaults to `Time.at(123)`.
time :b, default: Time.at(123)
# This input lazily defaults to `Time.now`.
time :c, default: -> { Time.now }
# This input defaults to the value of `c` plus 10 seconds.
time :d, default: -> { c + 10 }
Use the desc
option to provide human-readable descriptions of filters. You should prefer these to comments because they can be used to generate documentation. The interaction class has a .filters
method that returns a hash of filters. Each filter has a #desc
method that returns the description.
class Descriptive < ActiveInteraction::Base
string :first_name,
desc: 'your first name'
string :last_name,
desc: 'your last name'
end
Descriptive.filters.each do |name, filter|
puts "#{name}: #{filter.desc}"
end
# first_name: your first name
# last_name: your last name
ActiveInteraction provides detailed errors for easier introspection and testing of errors. Detailed errors improve on regular errors by adding a symbol that represents the type of error that has occurred. Let's look at an example where an item is purchased using a credit card.
class BuyItem < ActiveInteraction::Base
object :credit_card, :item
hash :options do
boolean :gift_wrapped
end
def execute
order = credit_card.purchase(item)
notify(credit_card.account)
order
end
private def notify(account)
# ...
end
end
Having missing or invalid inputs causes the interaction to fail and return errors.
outcome = BuyItem.run(item: 'Thing', options: { gift_wrapped: 'yes' })
outcome.errors.messages
# => {:credit_card=>["is required"], :item=>["is not a valid object"], :"options.gift_wrapped"=>["is not a valid boolean"]}
Determining the type of error based on the string is difficult if not impossible. Calling #details
instead of #messages
on errors
gives you the same list of errors with a testable label representing the error.
outcome.errors.details
# => {:credit_card=>[{:error=>:missing}], :item=>[{:error=>:invalid_type, :type=>"object"}], :"options.gift_wrapped"=>[{:error=>:invalid_type, :type=>"boolean"}]}
Detailed errors can also be manually added during the execute call by passing a symbol to #add
instead of a string.
def execute
errors.add(:monster, :no_passage)
end
ActiveInteraction also supports merging errors. This is useful if you want to delegate validation to some other object. For example, if you have an interaction that updates a record, you might want that record to validate itself. By using the #merge!
helper on errors
, you can do exactly that.
class UpdateThing < ActiveInteraction::Base
object :thing
def execute
unless thing.save
errors.merge!(thing.errors)
end
thing
end
end
When a composed interaction fails, its errors are merged onto the caller. This generally produces good error messages, but there are a few cases to look out for.
class Inner < ActiveInteraction::Base
boolean :x, :y
end
class Outer < ActiveInteraction::Base
string :x
boolean :z, default: nil
def execute
compose(Inner, x: x, y: z)
end
end
outcome = Outer.run(x: 'yes')
outcome.errors.details
# => { :x => [{ :error => :invalid_type, :type => "boolean" }],
# :base => [{ :error => "Y is required" }] }
outcome.errors.full_messages.join(' and ')
# => "X is not a valid boolean and Y is required"
Since both interactions have an input called x
, the inner error for that input is moved to the x
error on the outer interaction. This results in a misleading error that claims the input x
is not a valid boolean even though it's a string on the outer interaction.
Since only the inner interaction has an input called y
, the inner error for that input is moved to the base
error on the outer interaction. This results in a confusing error that claims the input y
is required even though it's not present on the outer interaction.
The outcome returned by .run
can be used in forms as though it were an ActiveModel object. You can also create a form object by calling .new
on the interaction.
Given an application with an Account
model we'll create a new Account
using the CreateAccount
interaction.
# GET /accounts/new
def new
@account = CreateAccount.new
end
# POST /accounts
def create
outcome = CreateAccount.run(params.fetch(:account, {}))
if outcome.valid?
redirect_to(outcome.result)
else
@account = outcome
render(:new)
end
end
The form used to create a new Account
has slightly more information on the form_for
call than you might expect.
<%= form_for @account, as: :account, url: accounts_path do |f| %>
<%= f.text_field :first_name %>
<%= f.text_field :last_name %>
<%= f.submit 'Create' %>
<% end %>
This is necessary because we want the form to act like it is creating a new Account
. Defining to_model
on the CreateAccount
interaction tells the form to treat our interaction like an Account
.
class CreateAccount < ActiveInteraction::Base
# ...
def to_model
Account.new
end
end
Now our form_for
call knows how to generate the correct URL and param name (i.e. params[:account]
).
# app/views/accounts/new.html.erb
<%= form_for @account do |f| %>
<%# ... %>
<% end %>
If you have an interaction that updates an Account
, you can define to_model
to return the object you're updating.
class UpdateAccount < ActiveInteraction::Base
# ...
object :account
def to_model
account
end
end
ActiveInteraction also supports formtastic and simple_form. The filters used to define the inputs on your interaction will relay type information to these gems. As a result, form fields will automatically use the appropriate input type.
It can be convenient to apply the same options to a bunch of inputs. One common use case is making many inputs optional. Instead of setting default: nil
on each one of them, you can use with_options
to reduce duplication.
with_options default: nil do
date :birthday
string :name
boolean :wants_cake
end
Optional inputs can be defined by using the :default
option as described in the filters section. Within the interaction, provided and default values are merged to create inputs
. There are times where it is useful to know whether a value was passed to run
or the result of a filter default. In particular, it is useful when nil
is an acceptable value. For example, you may optionally track your users' birthdays. You can use the inputs.given?
predicate to see if an input was even passed to run
. With inputs.given?
you can also check the input of a hash or array filter by passing a series of keys or indexes to check.
class UpdateUser < ActiveInteraction::Base
object :user
date :birthday,
default: nil
def execute
user.birthday = birthday if inputs.given?(:birthday)
errors.merge!(user.errors) unless user.save
user
end
end
Now you have a few options. If you don't want to update their birthday, leave it out of the hash. If you want to remove their birthday, set birthday: nil
. And if you want to update it, pass in the new value as usual.
user = User.find(...)
# Don't update their birthday.
UpdateUser.run!(user: user)
# Remove their birthday.
UpdateUser.run!(user: user, birthday: nil)
# Update their birthday.
UpdateUser.run!(user: user, birthday: Date.new(2000, 1, 2))
ActiveInteraction is i18n aware out of the box! All you have to do is add translations to your project. In Rails, these typically go into config/locales
. For example, let's say that for some reason you want to print everything out backwards. Simply add translations for ActiveInteraction to your hsilgne
locale.
# config/locales/hsilgne.yml
hsilgne:
active_interaction:
types:
array: yarra
boolean: naeloob
date: etad
date_time: emit etad
decimal: lamiced
file: elif
float: taolf
hash: hsah
integer: regetni
interface: ecafretni
object: tcejbo
string: gnirts
symbol: lobmys
time: emit
errors:
messages:
invalid: dilavni si
invalid_type: '%{type} dilav a ton si'
missing: deriuqer si
Then set your locale and run interactions like normal.
class I18nInteraction < ActiveInteraction::Base
string :name
end
I18nInteraction.run(name: false).errors.messages[:name]
# => ["is not a valid string"]
I18n.locale = :hsilgne
I18nInteraction.run(name: false).errors.messages[:name]
# => ["gnirts dilav a ton si"]
Everything else works like an activerecord
entry. For example, to rename an attribute you can use attributes
.
Here we'll rename the num
attribute on an interaction named product
:
en:
active_interaction:
attributes:
product:
num: 'Number'
ActiveInteraction is brought to you by Aaron Lasseigne. Along with Aaron, Taylor Fausak helped create and maintain ActiveInteraction but has since moved on.
If you want to contribute to ActiveInteraction, please read our contribution guidelines. A complete list of contributors is available on GitHub.
ActiveInteraction is licensed under the MIT License.
Author: AaronLasseigne
Source code: https://github.com/AaronLasseigne/active_interaction
License: MIT license
1669099573
In this article, we will know what is face recognition and how is different from face detection. We will go briefly over the theory of face recognition and then jump on to the coding section. At the end of this article, you will be able to make a face recognition program for recognizing faces in images as well as on a live webcam feed.
In computer vision, one essential problem we are trying to figure out is to automatically detect objects in an image without human intervention. Face detection can be thought of as such a problem where we detect human faces in an image. There may be slight differences in the faces of humans but overall, it is safe to say that there are certain features that are associated with all the human faces. There are various face detection algorithms but Viola-Jones Algorithm is one of the oldest methods that is also used today and we will use the same later in the article. You can go through the Viola-Jones Algorithm after completing this article as I’ll link it at the end of this article.
Face detection is usually the first step towards many face-related technologies, such as face recognition or verification. However, face detection can have very useful applications. The most successful application of face detection would probably be photo taking. When you take a photo of your friends, the face detection algorithm built into your digital camera detects where the faces are and adjusts the focus accordingly.
For a tutorial on Real-Time Face detection
Now that we are successful in making such algorithms that can detect faces, can we also recognise whose faces are they?
Face recognition is a method of identifying or verifying the identity of an individual using their face. There are various algorithms that can do face recognition but their accuracy might vary. Here I am going to describe how we do face recognition using deep learning.
So now let us understand how we recognise faces using deep learning. We make use of face embedding in which each face is converted into a vector and this technique is called deep metric learning. Let me further divide this process into three simple steps for easy understanding:
Face Detection: The very first task we perform is detecting faces in the image or video stream. Now that we know the exact location/coordinates of face, we extract this face for further processing ahead.
Feature Extraction: Now that we have cropped the face out of the image, we extract features from it. Here we are going to use face embeddings to extract the features out of the face. A neural network takes an image of the person’s face as input and outputs a vector which represents the most important features of a face. In machine learning, this vector is called embedding and thus we call this vector as face embedding. Now how does this help in recognizing faces of different persons?
While training the neural network, the network learns to output similar vectors for faces that look similar. For example, if I have multiple images of faces within different timespan, of course, some of the features of my face might change but not up to much extent. So in this case the vectors associated with the faces are similar or in short, they are very close in the vector space. Take a look at the below diagram for a rough idea:
Now after training the network, the network learns to output vectors that are closer to each other(similar) for faces of the same person(looking similar). The above vectors now transform into:
We are not going to train such a network here as it takes a significant amount of data and computation power to train such networks. We will use a pre-trained network trained by Davis King on a dataset of ~3 million images. The network outputs a vector of 128 numbers which represent the most important features of a face.
Now that we know how this network works, let us see how we use this network on our own data. We pass all the images in our data to this pre-trained network to get the respective embeddings and save these embeddings in a file for the next step.
Comparing faces: Now that we have face embeddings for every face in our data saved in a file, the next step is to recognise a new t image that is not in our data. So the first step is to compute the face embedding for the image using the same network we used above and then compare this embedding with the rest of the embeddings we have. We recognise the face if the generated embedding is closer or similar to any other embedding as shown below:
So we passed two images, one of the images is of Vladimir Putin and other of George W. Bush. In our example above, we did not save the embeddings for Putin but we saved the embeddings of Bush. Thus when we compared the two new embeddings with the existing ones, the vector for Bush is closer to the other face embeddings of Bush whereas the face embeddings of Putin are not closer to any other embedding and thus the program cannot recognise him.
In the field of Artificial Intelligence, Computer Vision is one of the most interesting and Challenging tasks. Computer Vision acts like a bridge between Computer Software and visualizations around us. It allows computer software to understand and learn about the visualizations in the surroundings. For Example: Based on the color, shape and size determining the fruit. This task can be very easy for the human brain however in the Computer Vision pipeline, first we gather the data, then we perform the data processing activities and then we train and teach the model to understand how to distinguish between the fruits based on size, shape and color of fruit.
Currently, various packages are present to perform machine learning, deep learning and computer vision tasks. By far, computer vision is the best module for such complex activities. OpenCV is an open-source library. It is supported by various programming languages such as R, Python. It runs on most of the platforms such as Windows, Linux and MacOS.
To know more about how face recognition works on opencv, check out the free course on face recognition in opencv.
Advantages of OpenCV:
Installation:
Here we will be focusing on installing OpenCV for python only. We can install OpenCV using pip or conda(for anaconda environment).
Using pip, the installation process of openCV can be done by using the following command in the command prompt.
pip install opencv-python
If you are using anaconda environment, either you can execute the above code in anaconda prompt or you can execute the following code in anaconda prompt.
conda install -c conda-forge opencv
In this section, we shall implement face recognition using OpenCV and Python. First, let us see the libraries we will need and how to install them:
OpenCV is an image and video processing library and is used for image and video analysis, like facial detection, license plate reading, photo editing, advanced robotic vision, optical character recognition, and a whole lot more.
The dlib library, maintained by Davis King, contains our implementation of “deep metric learning” which is used to construct our face embeddings used for the actual recognition process.
The face_recognition library, created by Adam Geitgey, wraps around dlib’s facial recognition functionality, and this library is super easy to work with and we will be using this in our code. Remember to install dlib library first before you install face_recognition.
To install OpenCV, type in command prompt
pip install opencv-python |
I have tried various ways to install dlib on Windows but the easiest of all of them is via Anaconda. First, install Anaconda (here is a guide to install it) and then use this command in your command prompt:
conda install -c conda-forge dlib |
Next to install face_recognition, type in command prompt
pip install face_recognition |
Now that we have all the dependencies installed, let us start coding. We will have to create three files, one will take our dataset and extract face embedding for each face using dlib. Next, we will save these embedding in a file.
In the next file we will compare the faces with the existing the recognise faces in images and next we will do the same but recognise faces in live webcam feed
First, you need to get a dataset or even create one of you own. Just make sure to arrange all images in folders with each folder containing images of just one person.
Next, save the dataset in a folder the same as you are going to make the file. Now here is the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
Now that we have stored the embedding in a file named “face_enc”, we can use them to recognise faces in images or live video stream.
Here is the script to recognise faces on a live webcam feed:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
|
https://www.youtube.com/watch?v=fLnGdkZxRkg
Although in the example above we have used haar cascade to detect faces, you can also use face_recognition.face_locations to detect a face as we did in the previous script
The script for detecting and recognising faces in images is almost similar to what you saw above. Try it yourself and if you can’t take a look at the code below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
Output:
InputOutput
This brings us to the end of this article where we learned about face recognition.
You can also upskill with Great Learning’s PGP Artificial Intelligence and Machine Learning Course. The course offers mentorship from industry leaders, and you will also have the opportunity to work on real-time industry-relevant projects.
Original article source at: https://www.mygreatlearning.com
1659511140
:warning: | This gem is now in [passive maintenance mode][passive]. [(more)][passive] |
Making HTML emails comfortable for the Ruby rockstars
Roadie tries to make sending HTML emails a little less painful by inlining stylesheets and rewriting relative URLs for you inside your emails.
Email clients have bad support for stylesheets, and some of them blocks stylesheets from downloading. The easiest way to handle this is to work with inline styles (style="..."
), but that is error prone and hard to work with as you cannot use classes and/or reuse styling over your HTML.
This gem makes this easier by automatically inlining stylesheets into the document. You give Roadie your CSS, or let it find it by itself from the <link>
and <style>
tags in the markup, and it will go through all of the selectors assigning the styles to the matching elements. Careful attention has been put into selectors being applied in the correct order, so it should behave just like in the browser.
"Dynamic" selectors (:hover
, :visited
, :focus
, etc.), or selectors not understood by Nokogiri will be inlined into a single <style>
element for those email clients that support it. This changes specificity a great deal for these rules, so it might not work 100% out of the box. (See more about this below)
Roadie also rewrites all relative URLs in the email to an absolute counterpart, making images you insert and those referenced in your stylesheets work. No more headaches about how to write the stylesheets while still having them work with emails from your acceptance environments. You can disable this on specific elements using a data-roadie-ignore
marker.
!important
styles.style
attribute of tags.:hover
, @media { ... }
and friends around in a separate <style>
element.href
s and img
src
s absolute.data-roadie-ignore
markers before finishing the HTML.Add this gem to your Gemfile as recommended by Rubygems and run bundle install
.
gem 'roadie', '~> 4.0'
Your document instance can be configured with several options:
url_options
- Dictates how absolute URLs should be built.keep_uninlinable_css
- Set to false to skip CSS that cannot be inlined.merge_media_queries
- Set to false to not group media queries. Some users might prefer to not group rules within media queries because it will result in rules getting reordered. e.g.@media(max-width: 600px) { .col-6 { display: block; } }
@media(max-width: 400px) { .col-12 { display: inline-block; } }
@media(max-width: 600px) { .col-12 { display: block; } }
@media(max-width: 600px) { .col-6 { display: block; } .col-12 { display: block; } }
@media(max-width: 400px) { .col-12 { display: inline-block; } }
asset_providers
- A list of asset providers that are invoked when CSS files are referenced. See below.external_asset_providers
- A list of asset providers that are invoked when absolute CSS URLs are referenced. See below.before_transformation
- A callback run before transformation starts.after_transformation
- A callback run after transformation is completed.In order to make URLs absolute you need to first configure the URL options of the document.
html = '... <a href="/about-us">Read more!</a> ...'
document = Roadie::Document.new html
document.url_options = {host: "myapp.com", protocol: "https"}
document.transform
# => "... <a href=\"https://myapp.com/about-us\">Read more!</a> ..."
The following URLs will be rewritten for you:
a[href]
(HTML)img[src]
(HTML)url()
(CSS)You can disable individual elements by adding an data-roadie-ignore
marker on them. CSS will still be inlined on those elements, but URLs will not be rewritten.
<a href="|UNSUBSCRIBE_URL|" data-roadie-ignore>Unsubscribe</a>
By default, style
and link
elements in the email document's head
are processed along with the stylesheets and removed from the head
.
You can set a special data-roadie-ignore
attribute on style
and link
tags that you want to ignore (the attribute will be removed, however). This is the place to put things like :hover
selectors that you want to have for email clients allowing them.
Style and link elements with media="print"
are also ignored.
<head>
<link rel="stylesheet" type="text/css" href="/assets/emails/rock.css"> <!-- Will be inlined with normal providers -->
<link rel="stylesheet" type="text/css" href="http://www.metal.org/metal.css"> <!-- Will be inlined with external providers, *IF* specified; otherwise ignored. -->
<link rel="stylesheet" type="text/css" href="/assets/jazz.css" media="print"> <!-- Will NOT be inlined; print style -->
<link rel="stylesheet" type="text/css" href="/ambient.css" data-roadie-ignore> <!-- Will NOT be inlined; ignored -->
<style></style> <!-- Will be inlined -->
<style data-roadie-ignore></style> <!-- Will NOT be inlined; ignored -->
</head>
Roadie will use the given asset providers to look for the actual CSS that is referenced. If you don't change the default, it will use the Roadie::FilesystemProvider
which looks for stylesheets on the filesystem, relative to the current working directory.
Example:
# /home/user/foo/stylesheets/primary.css
body { color: green; }
# /home/user/foo/script.rb
html = <<-HTML
<html>
<head>
<link rel="stylesheet" type="text/css" href="/stylesheets/primary.css">
</head>
<body>
</body>
</html>
HTML
Dir.pwd # => "/home/user/foo"
document = Roadie::Document.new html
document.transform # =>
# <!DOCTYPE html>
# <html>
# <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head>
# <body style="color:green;"></body>
# </html>
If a referenced stylesheet cannot be found, the #transform
method will raise an Roadie::CssNotFound
error. If you instead want to ignore missing stylesheets, you can use the NullProvider
.
You can write your own providers if you need very specific behavior for your app, or you can use the built-in providers. Providers come in two groups: normal and external. Normal providers handle paths without host information (/style/foo.css
) while external providers handle URLs with host information (//example.com/foo.css
, localhost:3001/bar.css
, and so on).
The default configuration is to not have any external providers configured, which will cause those referenced stylesheets to be ignored. Adding one or more providers for external assets causes all of them to be searched and inlined, so if you only want this to happen to specific stylesheets you need to add ignore markers to every other styleshheet (see above).
Included providers:
FilesystemProvider
– Looks for files on the filesystem, relative to the given directory unless otherwise specified.ProviderList
– Wraps a list of other providers and searches them in order. The asset_providers
setting is an instance of this. It behaves a lot like an array, so you can push, pop, shift and unshift to it.NullProvider
– Does not actually provide anything, it always finds empty stylesheets. Use this in tests or if you want to ignore stylesheets that cannot be found by your other providers (or if you want to force the other providers to never run).NetHttpProvider
– Downloads stylesheets using Net::HTTP
. Can be given a whitelist of hosts to download from.CachedProvider
– Wraps another provider (or ProviderList
) and caches responses inside the provided cache store.PathRewriterProvider
– Rewrites the passed path and then passes it on to another provider (or ProviderList
).If you want to search several locations on the filesystem, you can declare that:
document.asset_providers = [
Roadie::FilesystemProvider.new(App.root.join("resources", "stylesheets")),
Roadie::FilesystemProvider.new(App.root.join("system", "uploads", "stylesheets")),
]
NullProvider
If you want to ignore stylesheets that cannot be found instead of crashing, push the NullProvider
to the end:
# Don't crash on missing assets
document.asset_providers << Roadie::NullProvider.new
# Don't download assets in tests
document.external_asset_providers.unshift Roadie::NullProvider.new
Note: This will cause the referenced stylesheet to be removed from the source code, so email client will never see it either.
NetHttpProvider
The NetHttpProvider
will download the URLs that is is given using Ruby's standard Net::HTTP
library.
You can give it a whitelist of hosts that downloads are allowed from:
document.external_asset_providers << Roadie::NetHttpProvider.new(
whitelist: ["myapp.com", "assets.myapp.com", "cdn.cdnnetwork.co.jp"],
)
document.external_asset_providers << Roadie::NetHttpProvider.new # Allows every host
CachedProvider
You might want to cache providers from working several times. If you are sending several emails quickly from the same process, this might also save a lot of time on parsing the stylesheets if you use in-memory storage such as a hash.
You can wrap any other kind of providers with it, even a ProviderList
:
document.external_asset_providers = Roadie::CachedProvider.new(document.external_asset_providers, my_cache)
If you don't pass a cache backend, it will use a normal Hash
. The cache store must follow this protocol:
my_cache["key"] = some_stylesheet_instance # => #<Roadie::Stylesheet instance>
my_cache["key"] # => #<Roadie::Stylesheet instance>
my_cache["missing"] # => nil
Warning: The default Hash
store will never be cleared, so make sure you don't allow the number of unique asset paths to grow too large in a single run. This is especially important if you run Roadie in a daemon that accepts arbritary documents, and/or if you use hash digests in your filenames. Making a new instance of CachedProvider
will use a new Hash
instance.
You can implement your own custom cache store by implementing the []
and []=
methods.
class MyRoadieMemcacheStore
def initialize(memcache)
@memcache = memcache
end
def [](path)
css = memcache.read("assets/#{path}/css")
if css
name = memcache.read("assets/#{path}/name") || "cached #{path}"
Roadie::Stylesheet.new(name, css)
end
end
def []=(path, stylesheet)
memcache.write("assets/#{path}/css", stylesheet.to_s)
memcache.write("assets/#{path}/name", stylesheet.name)
stylesheet # You need to return the set Stylesheet
end
end
document.external_asset_providers = Roadie::CachedProvider.new(
document.external_asset_providers,
MyRoadieMemcacheStore.new(MemcacheClient.instance)
)
If you are using Rspec, you can test your implementation by using the shared examples for the "roadie cache store" role:
require "roadie/rspec"
describe MyRoadieMemcacheStore do
let(:memcache_client) { MemcacheClient.instance }
subject { MyRoadieMemcacheStore.new(memcache_client) }
it_behaves_like "roadie cache store" do
before { memcache_client.clear }
end
end
PathRewriterProvider
With this provider, you can rewrite the paths that are searched in order to more easily support another provider. Examples could include rewriting absolute URLs into something that can be found on the filesystem, or to access internal hosts instead of external ones.
filesystem = Roadie::FilesystemProvider.new("assets")
document.asset_providers << Roadie::PathRewriterProvider.new(filesystem) do |path|
path.sub('stylesheets', 'css').downcase
end
document.external_asset_providers = Roadie::PathRewriterProvider.new(filesystem) do |url|
if url =~ /myapp\.com/
URI.parse(url).path.sub(%r{^/assets}, '')
else
url
end
end
You can also wrap a list, for example to implement external_asset_providers
by composing the normal asset_providers
:
document.external_asset_providers =
Roadie::PathRewriterProvider.new(document.asset_providers) do |url|
URI.parse(url).path
end
Writing your own provider is also easy. You need to provide:
#find_stylesheet(name)
, returning either a Roadie::Stylesheet
or nil
.#find_stylesheet!(name)
, returning either a Roadie::Stylesheet
or raising Roadie::CssNotFound
.class UserAssetsProvider
def initialize(user_collection)
@user_collection = user_collection
end
def find_stylesheet(name)
if name =~ %r{^/users/(\d+)\.css$}
user = @user_collection.find_user($1)
Roadie::Stylesheet.new("user #{user.id} stylesheet", user.stylesheet)
end
end
def find_stylesheet!(name)
find_stylesheet(name) or
raise Roadie::CssNotFound.new(
css_name: name, message: "does not match a user stylesheet", provider: self
)
end
# Instead of implementing #find_stylesheet!, you could also:
# include Roadie::AssetProvider
# That will give you a default implementation without any error message. If
# you have multiple error cases, it's recommended that you implement
# #find_stylesheet! without #find_stylesheet and raise with an explanatory
# error message.
end
# Try to look for a user stylesheet first, then fall back to normal filesystem lookup.
document.asset_providers = [
UserAssetsProvider.new(app),
Roadie::FilesystemProvider.new('./stylesheets'),
]
You can test for compliance by using the built-in RSpec examples:
require 'spec_helper'
require 'roadie/rspec'
describe MyOwnProvider do
# Will use the default `subject` (MyOwnProvider.new)
it_behaves_like "roadie asset provider", valid_name: "found.css", invalid_name: "does_not_exist.css"
# Extra setup just for these tests:
it_behaves_like "roadie asset provider", valid_name: "found.css", invalid_name: "does_not_exist.css" do
subject { MyOwnProvider.new(...) }
before { stub_dependencies }
end
end
Some CSS is impossible to inline properly. :hover
and ::after
comes to mind. Roadie tries its best to keep these around by injecting them inside a new <style>
element in the <head>
(or at the beginning of the partial if transforming a partial document).
The problem here is that Roadie cannot possible adjust the specificity for you, so they will not apply the same way as they did before the styles were inlined.
Another caveat is that a lot of email clients does not support this (which is the entire point of inlining in the first place), so don't put anything important in here. Always handle the case of these selectors not being part of the email.
Inlined styles will have much higher specificity than styles in a <style>
. Here's an example:
<style>p:hover { color: blue; }</style>
<p style="color: green;">Hello world</p>
When hovering over this <p>
, the color will not change as the color: green
rule takes precedence. You can get it to work by adding !important
to the :hover
rule.
It would be foolish to try to automatically inject !important
on every rule automatically, so this is a manual process.
If you'd rather skip this and have the styles not possible to inline disappear, you can turn off this feature by setting the keep_uninlinable_css
option to false.
document.keep_uninlinable_css = false
Callbacks allow you to do custom work on documents before they are transformed. The Nokogiri document tree is passed to the callable along with the Roadie::Document
instance:
class TrackNewsletterLinks
def call(dom, document)
dom.css("a").each { |link| fix_link(link) }
end
def fix_link(link)
divider = (link['href'] =~ /?/ ? '&' : '?')
link['href'] = link['href'] + divider + 'source=newsletter'
end
end
document.before_transformation = ->(dom, document) {
logger.debug "Inlining document with title #{dom.at_css('head > title').try(:text)}"
}
document.after_transformation = TrackNewsletterLinks.new
You can configure the underlying HTML/XML engine to output XHTML or HTML (which is the default). One usecase for this is that {
tokens usually gets escaped to {
, which would be a problem if you then pass the resulting HTML on to some other templating engine that uses those tokens (like Handlebars or Mustache).
document.mode = :xhtml
This will also affect the emitted <!DOCTYPE>
if transforming a full document. Partial documents does not have a <!DOCTYPE>
.
Tested with Github CI using:
Let me know if you want any other runtime supported officially.
This project follows Semantic Versioning and has been since version 1.0.0.
Roadie uses Nokogiri to parse and regenerate the HTML of your email, which means that some unintentional changes might show up.
One example would be that Nokogiri might remove your
s in some cases.
Another example is Nokogiri's lack of HTML5 support, so certain new element might have spaces removed. I recommend you don't use HTML5 in emails anyway because of bad email client support (that includes web mail!).
Roadie uses Nokogiri to parse the HTML of your email, so any C-like problems like segfaults are likely in that end. The best way to fix this is to first upgrade libxml2 on your system and then reinstall Nokogiri. Instructions on how to do this on most platforms, see Nokogiri's official install guide.
@keyframes
?The CSS Parser used in Roadie does not handle keyframes. I don't think any email clients do either, but if you want to keep on trying you can add them manually to a <style>
element (or a separate referenced stylesheet) and tell Roadie not to touch them.
@media
queries are reordered, how can I fix this?Different @media
query blocks with the same conditions are merged by default, which will change the order in some cases. You can disable this by setting merge_media_queries
to false
. (See Install & Usage section above).
<body>
elements that are added?It sounds like you want to transform a partial document. Maybe you are building partials or template fragments to later place in other documents. Use Document#transform_partial
instead of Document#transform
in order to treat the HTML as a partial document.
If you add the data-roadie-ignore
attribute on an element, URL rewriting will not be performed on that element. This could be really useful for you if you intend to send the email through some other rendering pipeline that replaces some placeholders/variables.
<a href="/about-us">About us</a>
<a href="|UNSUBSCRIBE_URL|" data-roadie-ignore>Unsubscribe</a>
Note that this will not skip CSS inlining on the element; it will still get the correct styles applied.
If the URL is invalid on purpose, see Can I skip URL rewriting on a specific element? above. Otherwise, you can try to parse it yourself using Ruby's URI
class and see if you can figure it out.
require "uri"
URI.parse("https://example.com/best image.jpg") # raises
URI.parse("https://example.com/best%20image.jpg") # Works!
bundle install
rake
Roadie is set up with the assumption that all CSS and HTML passing through it is under your control. It is not recommended to run arbritary HTML with the default settings.
Care has been given to try to secure all file system accesses, but it is never guaranteed that someone cannot access something they should not be able to access.
In order to secure Roadie against file system access, only use your own asset providers that you yourself can secure against your particular environment.
If you have found any security vulnerability, please email me at magnus.bergmark+security@gmail.com
to disclose it. For very sensitive issues, please use my public GPG key. You can also encrypt your message with my public key and open an issue if you do not want to email me directly. Thank you.
This gem was previously tied to Rails. It is now framework-agnostic and supports any type of HTML documents. If you want to use it with Rails, check out roadie-rails.
Major contributors to Roadie:
You can see all contributors on GitHub.
(The MIT License)
Copyright (c) 2009-2022 Magnus Bergmark, Jim Neath / Purify, and contributors.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the ‘Software’), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED ‘AS IS’, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Author: Mange
Source code: https://github.com/Mange/roadie
License: MIT license
1641276000
Tabular augmentation is a new experimental space that makes use of novel and traditional data generation and synthesisation techniques to improve model prediction success. It is in essence a process of modular feature engineering and observation engineering while emphasising the order of augmentation to achieve the best predicted outcome from a given information set. DeltaPy was created with finance applications in mind, but it can be broadly applied to any data-rich environment.
To take full advantage of tabular augmentation for time-series you would perform the techniques in the following order: (1) transforming, (2) interacting, (3) mapping, (4) extracting, and (5) synthesising. What follows is a practical example of how the above methodology can be used. The purpose here is to establish a framework for table augmentation and to point and guide the user to existing packages.
For most the Colab Notebook format might be preferred. I have enabled comments if you want to ask question or address any issues you uncover. For anything pressing use the issues tab. Also have a look at the SSRN report for a more succinct insights.
Data augmentation can be defined as any method that could increase the size or improve the quality of a dataset by generating new features or instances without the collection of additional data-points. Data augmentation is of particular importance in image classification tasks where additional data can be created by cropping, padding, or flipping existing images.
Tabular cross-sectional and time-series prediction tasks can also benefit from augmentation. Here we divide tabular augmentation into columnular and row-wise methods. Row-wise methods are further divided into extraction and data synthesisation techniques, whereas columnular methods are divided into transformation, interaction, and mapping methods.
See the Skeleton Example, for a combination of multiple methods that lead to a halfing of the mean squared error.
pip install deltapy
@software{deltapy,
title = {{DeltaPy}: Tabular Data Augmentation},
author = {Snow, Derek},
url = {https://github.com/firmai/deltapy/},
version = {0.1.0},
date = {2020-04-11},
}
Snow, Derek, DeltaPy: A Framework for Tabular Data Augmentation in Python (April 22, 2020). Available at SSRN: https://ssrn.com/abstract=3582219
Transformation
df_out = transform.robust_scaler(df.copy(), drop=["Close_1"]); df_out.head()
df_out = transform.standard_scaler(df.copy(), drop=["Close"]); df_out.head()
df_out = transform.fast_fracdiff(df.copy(), ["Close","Open"],0.5); df_out.head()
df_out = transform.windsorization(df.copy(),"Close",para,strategy='both'); df_out.head()
df_out = transform.operations(df.copy(),["Close"]); df_out.head()
df_out = transform.triple_exponential_smoothing(df.copy(),["Close"], 12, .2,.2,.2,0);
df_out = transform.naive_dec(df.copy(), ["Close","Open"]); df_out.head()
df_out = transform.bkb(df.copy(), ["Close"]); df_out.head()
df_out = transform.butter_lowpass_filter(df.copy(),["Close"],4); df_out.head()
df_out = transform.instantaneous_phases(df.copy(), ["Close"]); df_out.head()
df_out = transform.kalman_feat(df.copy(), ["Close"]); df_out.head()
df_out = transform.perd_feat(df.copy(),["Close"]); df_out.head()
df_out = transform.fft_feat(df.copy(), ["Close"]); df_out.head()
df_out = transform.harmonicradar_cw(df.copy(), ["Close"],0.3,0.2); df_out.head()
df_out = transform.saw(df.copy(),["Close","Open"]); df_out.head()
df_out = transform.modify(df.copy(),["Close"]); df_out.head()
df_out = transform.multiple_rolling(df, columns=["Close"]); df_out.head()
df_out = transform.multiple_lags(df, start=1, end=3, columns=["Close"]); df_out.head()
df_out = transform.prophet_feat(df.copy().reset_index(),["Close","Open"],"Date", "D"); df_out.head()
Interaction
df_out = interact.lowess(df.copy(), ["Open","Volume"], df["Close"], f=0.25, iter=3); df_out.head()
df_out = interact.autoregression(df.copy()); df_out.head()
df_out = interact.muldiv(df.copy(), ["Close","Open"]); df_out.head()
df_out = interact.decision_tree_disc(df.copy(), ["Close"]); df_out.head()
df_out = interact.quantile_normalize(df.copy(), drop=["Close"]); df_out.head()
df_out = interact.tech(df.copy()); df_out.head()
df_out = interact.genetic_feat(df.copy()); df_out.head()
Mapping
df_out = mapper.pca_feature(df.copy(),variance_or_components=0.80,drop_cols=["Close_1"]); df_out.head()
df_out = mapper.cross_lag(df.copy()); df_out.head()
df_out = mapper.a_chi(df.copy()); df_out.head()
df_out = mapper.encoder_dataset(df.copy(), ["Close_1"], 15); df_out.head()
df_out = mapper.lle_feat(df.copy(),["Close_1"],4); df_out.head()
df_out = mapper.feature_agg(df.copy(),["Close_1"],4 ); df_out.head()
df_out = mapper.neigh_feat(df.copy(),["Close_1"],4 ); df_out.head()
Extraction
extract.abs_energy(df["Close"])
extract.cid_ce(df["Close"], True)
extract.mean_abs_change(df["Close"])
extract.mean_second_derivative_central(df["Close"])
extract.variance_larger_than_standard_deviation(df["Close"])
extract.var_index(df["Close"].values,var_index_param)
extract.symmetry_looking(df["Close"])
extract.has_duplicate_max(df["Close"])
extract.partial_autocorrelation(df["Close"])
extract.augmented_dickey_fuller(df["Close"])
extract.gskew(df["Close"])
extract.stetson_mean(df["Close"])
extract.length(df["Close"])
extract.count_above_mean(df["Close"])
extract.longest_strike_below_mean(df["Close"])
extract.wozniak(df["Close"])
extract.last_location_of_maximum(df["Close"])
extract.fft_coefficient(df["Close"])
extract.ar_coefficient(df["Close"])
extract.index_mass_quantile(df["Close"])
extract.number_cwt_peaks(df["Close"])
extract.spkt_welch_density(df["Close"])
extract.linear_trend_timewise(df["Close"])
extract.c3(df["Close"])
extract.binned_entropy(df["Close"])
extract.svd_entropy(df["Close"].values)
extract.hjorth_complexity(df["Close"])
extract.max_langevin_fixed_point(df["Close"])
extract.percent_amplitude(df["Close"])
extract.cad_prob(df["Close"])
extract.zero_crossing_derivative(df["Close"])
extract.detrended_fluctuation_analysis(df["Close"])
extract.fisher_information(df["Close"])
extract.higuchi_fractal_dimension(df["Close"])
extract.petrosian_fractal_dimension(df["Close"])
extract.hurst_exponent(df["Close"])
extract.largest_lyauponov_exponent(df["Close"])
extract.whelch_method(df["Close"])
extract.find_freq(df["Close"])
extract.flux_perc(df["Close"])
extract.range_cum_s(df["Close"])
extract.structure_func(df["Close"])
extract.kurtosis(df["Close"])
extract.stetson_k(df["Close"])
Test sets should ideally not be preprocessed with the training data, as in such a way one could be peaking ahead in the training data. The preprocessing parameters should be identified on the test set and then applied on the test set, i.e., the test set should not have an impact on the transformation applied. As an example, you would learn the parameters of PCA decomposition on the training set and then apply the parameters to both the train and the test set.
The benefit of pipelines become clear when one wants to apply multiple augmentation methods. It makes it easy to learn the parameters and then apply them widely. For the most part, this notebook does not concern itself with 'peaking ahead' or pipelines, for some functions, one might have to restructure to code and make use of open source packages to create your preferred solution.
Notebook Dependencies
pip install deltapy
pip install pykalman
pip install tsaug
pip install ta
pip install tsaug
pip install pandasvault
pip install gplearn
pip install ta
pip install seasonal
pip install pandasvault
import pandas as pd
import numpy as np
from deltapy import transform, interact, mapper, extract
import warnings
warnings.filterwarnings('ignore')
def data_copy():
df = pd.read_csv("https://github.com/firmai/random-assets-two/raw/master/numpy/tsla.csv")
df["Close_1"] = df["Close"].shift(-1)
df = df.dropna()
df["Date"] = pd.to_datetime(df["Date"])
df = df.set_index("Date")
return df
df = data_copy(); df.head()
Some of these categories are fluid and some techniques could fit into multiple buckets. This is an attempt to find an exhaustive number of techniques, but not an exhaustive list of implementations of the techniques. For example, there are thousands of ways to smooth a time-series, but we have only includes 1-2 techniques of interest under each category.
Here transformation is any method that includes only one feature as an input to produce a new feature/s. Transformations can be applied to cross-section and time-series data. Some transformations are exclusive to time-series data (smoothing, filtering), but a handful of functions apply to both.
Where the time series methods has a centred mean, or are forward-looking, there is a need to recalculate the outputed time series on a running basis to ensure that information of the future does not leak into the model. The last value of this recalculated series or an extracted feature from this series can then be used as a running value that is only backward looking, satisfying the no 'peaking' ahead rule.
There are some packaged in Python that dynamically create time series and extracts their features, but none that incoropates the dynamic creation of a time series in combination with a wide application of prespecified list of extractions. Because this technique is expensive, we have a preference for models that only take historical data into account.
In this section we will include a list of all types of transformations, those that only use present information (operations), those that incorporate all values (interpolation methods), those that only include past values (smoothing functions), and those that incorporate a subset window of lagging and leading values (select filters). Only those that use historical values or are turned into prediction methods can be used out of the box. The entire time series can be used in the model development process for historical value methods, and only the forecasted values can be used for prediction models.
Curve fitting can involve either interpolation, where an exact fit to the data is required, or smoothing, in which a "smooth" function is constructed that approximately fits the data. When using an interpolation method, you are taking future information into account e.g, cubic spline. You can use interpolation methods to forecast into the future (extrapolation), and then use those forecasts in a training set. Or you could recalculate the interpolation for each time step and then extract features out of that series (extraction method). Interpolation and other forward-looking methods can be used if they are turned into prediction problems, then the forecasted values can be trained and tested on, and the fitted data can be diregarded. In the list presented below the first five methods can be used for cross-section and time series data, after that the time-series only methods follow.
There are a multitude of scaling methods available. Scaling generally gets applied to the entire dataset and is especially necessary for certain algorithms. K-means make use of euclidean distance hence the need for scaling. For PCA because we are trying to identify the feature with maximus variance we also need scaling. Similarly, we need scaled features for gradient descent. Any algorithm that is not based on a distance measure is not affected by feature scaling. Some of the methods include range scalers like minimum-maximum scaler, maximum absolute scaler or even standardisation methods like the standard scaler can be used for scaling. The example used here is robust scaler. Normalisation is a good technique when you don't know the distribution of the data. Scaling looks into the future, so parameters have to be training on a training set and applied to a test set.
(i) Robust Scaler
Scaling according to the interquartile range, making it robust to outliers.
def robust_scaler(df, drop=None,quantile_range=(25, 75) ):
if drop:
keep = df[drop]
df = df.drop(drop, axis=1)
center = np.median(df, axis=0)
quantiles = np.percentile(df, quantile_range, axis=0)
scale = quantiles[1] - quantiles[0]
df = (df - center) / scale
if drop:
df = pd.concat((keep,df),axis=1)
return df
df_out = transform.robust_scaler(df.copy(), drop=["Close_1"]); df_out.head()
When using a standardisation method, it is often more effective when the attribute itself if Gaussian. It is also useful to apply the technique when the model you want to use makes assumptions of Gaussian distributions like linear regression, logistic regression, and linear discriminant analysis. For most applications, standardisation is recommended.
(i) Standard Scaler
Standardize features by removing the mean and scaling to unit variance
def standard_scaler(df,drop ):
if drop:
keep = df[drop]
df = df.drop(drop, axis=1)
mean = np.mean(df, axis=0)
scale = np.std(df, axis=0)
df = (df - mean) / scale
if drop:
df = pd.concat((keep,df),axis=1)
return df
df_out = transform.standard_scaler(df.copy(), drop=["Close"]); df_out.head()
Computing the differences between consecutive observation, normally used to obtain a stationary time series.
(i) Fractional Differencing
Fractional differencing, allows us to achieve stationarity while maintaining the maximum amount of memory compared to integer differencing.
import pylab as pl
def fast_fracdiff(x, cols, d):
for col in cols:
T = len(x[col])
np2 = int(2 ** np.ceil(np.log2(2 * T - 1)))
k = np.arange(1, T)
b = (1,) + tuple(np.cumprod((k - d - 1) / k))
z = (0,) * (np2 - T)
z1 = b + z
z2 = tuple(x[col]) + z
dx = pl.ifft(pl.fft(z1) * pl.fft(z2))
x[col+"_frac"] = np.real(dx[0:T])
return x
df_out = transform.fast_fracdiff(df.copy(), ["Close","Open"],0.5); df_out.head()
Any method that provides sets a floor and a cap to a feature's value. Capping can affect the distribution of data, so it should not be exagerated. One can cap values by using the average, by using the max and min values, or by an arbitrary extreme value.
(i) Winzorisation
The transformation of features by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers by replacing it with a certain percentile value.
def outlier_detect(data,col,threshold=1,method="IQR"):
if method == "IQR":
IQR = data[col].quantile(0.75) - data[col].quantile(0.25)
Lower_fence = data[col].quantile(0.25) - (IQR * threshold)
Upper_fence = data[col].quantile(0.75) + (IQR * threshold)
if method == "STD":
Upper_fence = data[col].mean() + threshold * data[col].std()
Lower_fence = data[col].mean() - threshold * data[col].std()
if method == "OWN":
Upper_fence = data[col].mean() + threshold * data[col].std()
Lower_fence = data[col].mean() - threshold * data[col].std()
if method =="MAD":
median = data[col].median()
median_absolute_deviation = np.median([np.abs(y - median) for y in data[col]])
modified_z_scores = pd.Series([0.6745 * (y - median) / median_absolute_deviation for y in data[col]])
outlier_index = np.abs(modified_z_scores) > threshold
print('Num of outlier detected:',outlier_index.value_counts()[1])
print('Proportion of outlier detected',outlier_index.value_counts()[1]/len(outlier_index))
return outlier_index, (median_absolute_deviation, median_absolute_deviation)
para = (Upper_fence, Lower_fence)
tmp = pd.concat([data[col]>Upper_fence,data[col]<Lower_fence],axis=1)
outlier_index = tmp.any(axis=1)
print('Num of outlier detected:',outlier_index.value_counts()[1])
print('Proportion of outlier detected',outlier_index.value_counts()[1]/len(outlier_index))
return outlier_index, para
def windsorization(data,col,para,strategy='both'):
"""
top-coding & bottom coding (capping the maximum of a distribution at an arbitrarily set value,vice versa)
"""
data_copy = data.copy(deep=True)
if strategy == 'both':
data_copy.loc[data_copy[col]>para[0],col] = para[0]
data_copy.loc[data_copy[col]<para[1],col] = para[1]
elif strategy == 'top':
data_copy.loc[data_copy[col]>para[0],col] = para[0]
elif strategy == 'bottom':
data_copy.loc[data_copy[col]<para[1],col] = para[1]
return data_copy
_, para = transform.outlier_detect(df, "Close")
df_out = transform.windsorization(df.copy(),"Close",para,strategy='both'); df_out.head()
Operations here are treated like traditional transformations. It is the replacement of a variable by a function of that variable. In a stronger sense, a transformation is a replacement that changes the shape of a distribution or relationship.
(i) Power, Log, Recipricol, Square Root
def operations(df,features):
df_new = df[features]
df_new = df_new - df_new.min()
sqr_name = [str(fa)+"_POWER_2" for fa in df_new.columns]
log_p_name = [str(fa)+"_LOG_p_one_abs" for fa in df_new.columns]
rec_p_name = [str(fa)+"_RECIP_p_one" for fa in df_new.columns]
sqrt_name = [str(fa)+"_SQRT_p_one" for fa in df_new.columns]
df_sqr = pd.DataFrame(np.power(df_new.values, 2),columns=sqr_name, index=df.index)
df_log = pd.DataFrame(np.log(df_new.add(1).abs().values),columns=log_p_name, index=df.index)
df_rec = pd.DataFrame(np.reciprocal(df_new.add(1).values),columns=rec_p_name, index=df.index)
df_sqrt = pd.DataFrame(np.sqrt(df_new.abs().add(1).values),columns=sqrt_name, index=df.index)
dfs = [df, df_sqr, df_log, df_rec, df_sqrt]
df= pd.concat(dfs, axis=1)
return df
df_out = transform.operations(df.copy(),["Close"]); df_out.head()
Here we maintain that any method that has a component of historical averaging is a smoothing method such as a simple moving average and single, double and tripple exponential smoothing methods. These forms of non-causal filters are also popular in signal processing and are called filters, where exponential smoothing is called an IIR filter and a moving average a FIR filter with equal weighting factors.
(i) Tripple Exponential Smoothing (Holt-Winters Exponential Smoothing)
The Holt-Winters seasonal method comprises the forecast equation and three smoothing equations — one for the level $ℓt$, one for the trend &bt&, and one for the seasonal component $st$. This particular version is performed by looking at the last 12 periods. For that reason, the first 12 records should be disregarded because they can't make use of the required window size for a fair calculation. The calculation is such that values are still provided for those periods based on whatever data might be available.
def initial_trend(series, slen):
sum = 0.0
for i in range(slen):
sum += float(series[i+slen] - series[i]) / slen
return sum / slen
def initial_seasonal_components(series, slen):
seasonals = {}
season_averages = []
n_seasons = int(len(series)/slen)
# compute season averages
for j in range(n_seasons):
season_averages.append(sum(series[slen*j:slen*j+slen])/float(slen))
# compute initial values
for i in range(slen):
sum_of_vals_over_avg = 0.0
for j in range(n_seasons):
sum_of_vals_over_avg += series[slen*j+i]-season_averages[j]
seasonals[i] = sum_of_vals_over_avg/n_seasons
return seasonals
def triple_exponential_smoothing(df,cols, slen, alpha, beta, gamma, n_preds):
for col in cols:
result = []
seasonals = initial_seasonal_components(df[col], slen)
for i in range(len(df[col])+n_preds):
if i == 0: # initial values
smooth = df[col][0]
trend = initial_trend(df[col], slen)
result.append(df[col][0])
continue
if i >= len(df[col]): # we are forecasting
m = i - len(df[col]) + 1
result.append((smooth + m*trend) + seasonals[i%slen])
else:
val = df[col][i]
last_smooth, smooth = smooth, alpha*(val-seasonals[i%slen]) + (1-alpha)*(smooth+trend)
trend = beta * (smooth-last_smooth) + (1-beta)*trend
seasonals[i%slen] = gamma*(val-smooth) + (1-gamma)*seasonals[i%slen]
result.append(smooth+trend+seasonals[i%slen])
df[col+"_TES"] = result
#print(seasonals)
return df
df_out= transform.triple_exponential_smoothing(df.copy(),["Close"], 12, .2,.2,.2,0); df_out.head()
Decomposition procedures are used in time series to describe the trend and seasonal factors in a time series. More extensive decompositions might also include long-run cycles, holiday effects, day of week effects and so on. Here, we’ll only consider trend and seasonal decompositions. A naive decomposition makes use of moving averages, other decomposition methods are available that make use of LOESS.
(i) Naive Decomposition
The base trend takes historical information into account and established moving averages; it does not have to be linear. To estimate the seasonal component for each season, simply average the detrended values for that season. If the seasonal variation looks constant, we should use the additive model. If the magnitude is increasing as a function of time, we will use multiplicative. Here because it is predictive in nature we are using a one sided moving average, as opposed to a two-sided centred average.
import statsmodels.api as sm
def naive_dec(df, columns, freq=2):
for col in columns:
decomposition = sm.tsa.seasonal_decompose(df[col], model='additive', freq = freq, two_sided=False)
df[col+"_NDDT" ] = decomposition.trend
df[col+"_NDDT"] = decomposition.seasonal
df[col+"_NDDT"] = decomposition.resid
return df
df_out = transform.naive_dec(df.copy(), ["Close","Open"]); df_out.head()
It is often useful to either low-pass filter (smooth) time series in order to reveal low-frequency features and trends, or to high-pass filter (detrend) time series in order to isolate high frequency transients (e.g. storms). Low pass filters use historical values, high-pass filters detrends with low-pass filters, so also indirectly uses historical values.
There are a few filters available, closely associated with decompositions and smoothing functions. The Hodrick-Prescott filter separates a time-series $yt$ into a trend $τt$ and a cyclical component $ζt$. The Christiano-Fitzgerald filter is a generalization of Baxter-King filter and can be seen as weighted moving average.
(i) Baxter-King Bandpass
The Baxter-King filter is intended to explicitly deal with the periodicity of the business cycle. By applying their band-pass filter to a series, they produce a new series that does not contain fluctuations at higher or lower than those of the business cycle. The parameters are arbitrarily chosen. This method uses a centred moving average that has to be changed to a lagged moving average before it can be used as an input feature. The maximum period of oscillation should be used as the point to truncate the dataset, as that part of the time series does not incorporate all the required datapoints.
import statsmodels.api as sm
def bkb(df, cols):
for col in cols:
df[col+"_BPF"] = sm.tsa.filters.bkfilter(df[[col]].values, 2, 10, len(df)-1)
return df
df_out = transform.bkb(df.copy(), ["Close"]); df_out.head()
(ii) Butter Lowpass (IIR Filter Design)
The Butterworth filter is a type of signal processing filter designed to have a frequency response as flat as possible in the passban. Like other filtersm the first few values have to be disregarded for accurate downstream prediction. Instead of disregarding these values on a per case basis, they can be diregarded in one chunk once the database of transformed features have been developed.
from scipy import signal, integrate
def butter_lowpass(cutoff, fs=20, order=5):
nyq = 0.5 * fs
normal_cutoff = cutoff / nyq
b, a = signal.butter(order, normal_cutoff, btype='low', analog=False)
return b, a
def butter_lowpass_filter(df,cols, cutoff, fs=20, order=5):
b, a = butter_lowpass(cutoff, fs, order=order)
for col in cols:
df[col+"_BUTTER"] = signal.lfilter(b, a, df[col])
return df
df_out = transform.butter_lowpass_filter(df.copy(),["Close"],4); df_out.head()
(iii) Hilbert Transform Angle
The Hilbert transform is a time-domain to time-domain transformation which shifts the phase of a signal by 90 degrees. It is also a centred measure and would be difficult to use in a time series prediction setting, unless it is recalculated on a per step basis or transformed to be based on historical values only.
from scipy import signal
import numpy as np
def instantaneous_phases(df,cols):
for col in cols:
df[col+"_HILLB"] = np.unwrap(np.angle(signal.hilbert(df[col], axis=0)), axis=0)
return df
df_out = transform.instantaneous_phases(df.copy(), ["Close"]); df_out.head()
(iiiv) Unscented Kalman Filter
The Kalman filter is better suited for estimating things that change over time. The most tangible example is tracking moving objects. A Kalman filter will be very close to the actual trajectory because it says the most recent measurement is more important than the older ones. The Unscented Kalman Filter (UKF) is a model based-techniques that recursively estimates the states (and with some modifications also parameters) of a nonlinear, dynamic, discrete-time system. The UKF is based on the typical prediction-correction style methods. The Kalman Smoother incorporates future values, the Filter doesn't and can be used for online prediction. The normal Kalman filter is a forward filter in the sense that it makes forecast of the current state using only current and past observations, whereas the smoother is based on computing a suitable linear combination of two filters, which are ran in forward and backward directions.
from pykalman import UnscentedKalmanFilter
def kalman_feat(df, cols):
for col in cols:
ukf = UnscentedKalmanFilter(lambda x, w: x + np.sin(w), lambda x, v: x + v, observation_covariance=0.1)
(filtered_state_means, filtered_state_covariances) = ukf.filter(df[col])
(smoothed_state_means, smoothed_state_covariances) = ukf.smooth(df[col])
df[col+"_UKFSMOOTH"] = smoothed_state_means.flatten()
df[col+"_UKFFILTER"] = filtered_state_means.flatten()
return df
df_out = transform.kalman_feat(df.copy(), ["Close"]); df_out.head()
There are a range of functions for spectral analysis. You can use periodograms and the welch method to estimate the power spectral density. You can also use the welch method to estimate the cross power spectral density. Other techniques include spectograms, Lomb-Scargle periodograms and, short time fourier transform.
(i) Periodogram
This returns an array of sample frequencies and the power spectrum of x, or the power spectral density of x.
from scipy import signal
def perd_feat(df, cols):
for col in cols:
sig = signal.periodogram(df[col],fs=1, return_onesided=False)
df[col+"_FREQ"] = sig[0]
df[col+"_POWER"] = sig[1]
return df
df_out = transform.perd_feat(df.copy(),["Close"]); df_out.head()
(ii) Fast Fourier Transform
The FFT, or fast fourier transform is an algorithm that essentially uses convolution techniques to efficiently find the magnitude and location of the tones that make up the signal of interest. We can often play with the FFT spectrum, by adding and removing successive tones (which is akin to selectively filtering particular tones that make up the signal), in order to obtain a smoothed version of the underlying signal. This takes the entire signal into account, and as a result has to be recalculated on a running basis to avoid peaking into the future.
def fft_feat(df, cols):
for col in cols:
fft_df = np.fft.fft(np.asarray(df[col].tolist()))
fft_df = pd.DataFrame({'fft':fft_df})
df[col+'_FFTABS'] = fft_df['fft'].apply(lambda x: np.abs(x)).values
df[col+'_FFTANGLE'] = fft_df['fft'].apply(lambda x: np.angle(x)).values
return df
df_out = transform.fft_feat(df.copy(), ["Close"]); df_out.head()
The waveform of a signal is the shape of its graph as a function of time.
(i) Continuous Wave Radar
from scipy import signal
def harmonicradar_cw(df, cols, fs,fc):
for col in cols:
ttxt = f'CW: {fc} Hz'
#%% input
t = df[col]
tx = np.sin(2*np.pi*fc*t)
_,Pxx = signal.welch(tx,fs)
#%% diode
d = (signal.square(2*np.pi*fc*t))
d[d<0] = 0.
#%% output of diode
rx = tx * d
df[col+"_HARRAD"] = rx.values
return df
df_out = transform.harmonicradar_cw(df.copy(), ["Close"],0.3,0.2); df_out.head()
(ii) Saw Tooth
Return a periodic sawtooth or triangle waveform.
def saw(df, cols):
for col in cols:
df[col+" SAW"] = signal.sawtooth(df[col])
return df
df_out = transform.saw(df.copy(),["Close","Open"]); df_out.head()
(9) Modifications
A range of modification usually applied ot images, these values would have to be recalculate for each time-series.
(i) Various Techniques
from tsaug import *
def modify(df, cols):
for col in cols:
series = df[col].values
df[col+"_magnify"], _ = magnify(series, series)
df[col+"_affine"], _ = affine(series, series)
df[col+"_crop"], _ = crop(series, series)
df[col+"_cross_sum"], _ = cross_sum(series, series)
df[col+"_resample"], _ = resample(series, series)
df[col+"_trend"], _ = trend(series, series)
df[col+"_random_affine"], _ = random_time_warp(series, series)
df[col+"_random_crop"], _ = random_crop(series, series)
df[col+"_random_cross_sum"], _ = random_cross_sum(series, series)
df[col+"_random_sidetrack"], _ = random_sidetrack(series, series)
df[col+"_random_time_warp"], _ = random_time_warp(series, series)
df[col+"_random_magnify"], _ = random_magnify(series, series)
df[col+"_random_jitter"], _ = random_jitter(series, series)
df[col+"_random_trend"], _ = random_trend(series, series)
return df
df_out = transform.modify(df.copy(),["Close"]); df_out.head()
Features that are calculated on a rolling basis over fixed window size.
(i) Mean, Standard Deviation
def multiple_rolling(df, windows = [1,2], functions=["mean","std"], columns=None):
windows = [1+a for a in windows]
if not columns:
columns = df.columns.to_list()
rolling_dfs = (df[columns].rolling(i) # 1. Create window
.agg(functions) # 1. Aggregate
.rename({col: '{0}_{1:d}'.format(col, i)
for col in columns}, axis=1) # 2. Rename columns
for i in windows) # For each window
df_out = pd.concat((df, *rolling_dfs), axis=1)
da = df_out.iloc[:,len(df.columns):]
da = [col[0] + "_" + col[1] for col in da.columns.to_list()]
df_out.columns = df.columns.to_list() + da
return df_out # 3. Concatenate dataframes
df_out = transform.multiple_rolling(df, columns=["Close"]); df_out.head()
Lagged values from existing features.
(i) Single Steps
def multiple_lags(df, start=1, end=3,columns=None):
if not columns:
columns = df.columns.to_list()
lags = range(start, end+1) # Just two lags for demonstration.
df = df.assign(**{
'{}_t_{}'.format(col, t): df[col].shift(t)
for t in lags
for col in columns
})
return df
df_out = transform.multiple_lags(df, start=1, end=3, columns=["Close"]); df_out.head()
There are a range of time series model that can be implemented like AR, MA, ARMA, ARIMA, SARIMA, SARIMAX, VAR, VARMA, VARMAX, SES, and HWES. The models can be divided into autoregressive models and smoothing models. In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable. Each method might requre specific tuning and parameters to suit your prediction task. You need to drop a certain amount of historical data that you use during the fitting stage. Models that take seasonality into account need more training data.
(i) Prophet
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality. You can apply additive models to your training data but also interactive models like deep learning models. The problem is that because these models have learned from future observations, there would this be a need to recalculate the time series on a running basis, or to only include the predicted as opposed to fitted values in future training and test sets. In this example, I train on 150 data points to illustrate how the remaining or so 100 datapoints can be used in a new prediction problem. You can plot with df["PROPHET"].plot()
to see the effect.
You can apply additive models to your training data but also interactive models like deep learning models. The problem is that these models have learned from future observations, there would this be a need to recalculate the time series on a running basis, or to only include the predicted as opposed to fitted values in future training and test sets.
from fbprophet import Prophet
def prophet_feat(df, cols,date, freq,train_size=150):
def prophet_dataframe(df):
df.columns = ['ds','y']
return df
def original_dataframe(df, freq, name):
prophet_pred = pd.DataFrame({"Date" : df['ds'], name : df["yhat"]})
prophet_pred = prophet_pred.set_index("Date")
#prophet_pred.index.freq = pd.tseries.frequencies.to_offset(freq)
return prophet_pred[name].values
for col in cols:
model = Prophet(daily_seasonality=True)
fb = model.fit(prophet_dataframe(df[[date, col]].head(train_size)))
forecast_len = len(df) - train_size
future = model.make_future_dataframe(periods=forecast_len,freq=freq)
future_pred = model.predict(future)
df[col+"_PROPHET"] = list(original_dataframe(future_pred,freq,col))
return df
df_out = transform.prophet_feat(df.copy().reset_index(),["Close","Open"],"Date", "D"); df_out.head()
Interactions are defined as methods that require more than one feature to create an additional feature. Here we include normalising and discretising techniques that are non-feature specific. Almost all of these method can be applied to cross-section method. The only methods that are time specific is the technical features in the speciality section and the autoregression model.
Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable') and one or more independent variables.
(i) Lowess Smoother
The lowess smoother is a robust locally weighted regression. The function fits a nonparametric regression curve to a scatterplot.
from math import ceil
import numpy as np
from scipy import linalg
import math
def lowess(df, cols, y, f=2. / 3., iter=3):
for col in cols:
n = len(df[col])
r = int(ceil(f * n))
h = [np.sort(np.abs(df[col] - df[col][i]))[r] for i in range(n)]
w = np.clip(np.abs((df[col][:, None] - df[col][None, :]) / h), 0.0, 1.0)
w = (1 - w ** 3) ** 3
yest = np.zeros(n)
delta = np.ones(n)
for iteration in range(iter):
for i in range(n):
weights = delta * w[:, i]
b = np.array([np.sum(weights * y), np.sum(weights * y * df[col])])
A = np.array([[np.sum(weights), np.sum(weights * df[col])],
[np.sum(weights * df[col]), np.sum(weights * df[col] * df[col])]])
beta = linalg.solve(A, b)
yest[i] = beta[0] + beta[1] * df[col][i]
residuals = y - yest
s = np.median(np.abs(residuals))
delta = np.clip(residuals / (6.0 * s), -1, 1)
delta = (1 - delta ** 2) ** 2
df[col+"_LOWESS"] = yest
return df
df_out = interact.lowess(df.copy(), ["Open","Volume"], df["Close"], f=0.25, iter=3); df_out.head()
Autoregression
Autoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step
from statsmodels.tsa.ar_model import AR
from timeit import default_timer as timer
def autoregression(df, drop=None, settings={"autoreg_lag":4}):
autoreg_lag = settings["autoreg_lag"]
if drop:
keep = df[drop]
df = df.drop([drop],axis=1).values
n_channels = df.shape[0]
t = timer()
channels_regg = np.zeros((n_channels, autoreg_lag + 1))
for i in range(0, n_channels):
fitted_model = AR(df.values[i, :]).fit(autoreg_lag)
# TODO: This is not the same as Matlab's for some reasons!
# kk = ARMAResults(fitted_model)
# autore_vals, dummy1, dummy2 = arburg(x[i, :], autoreg_lag) # This looks like Matlab's but slow
channels_regg[i, 0: len(fitted_model.params)] = np.real(fitted_model.params)
for i in range(channels_regg.shape[1]):
df["LAG_"+str(i+1)] = channels_regg[:,i]
if drop:
df = pd.concat((keep,df),axis=1)
t = timer() - t
return df
df_out = interact.autoregression(df.copy()); df_out.head()
Looking at interaction between different features. Here the methods employed are multiplication and division.
(i) Multiplication and Division
def muldiv(df, feature_list):
for feat in feature_list:
for feat_two in feature_list:
if feat==feat_two:
continue
else:
df[feat+"/"+feat_two] = df[feat]/(df[feat_two]-df[feat_two].min()) #zero division guard
df[feat+"_X_"+feat_two] = df[feat]*(df[feat_two])
return df
df_out = interact.muldiv(df.copy(), ["Close","Open"]); df_out.head()
In statistics and machine learning, discretization refers to the process of converting or partitioning continuous attributes, features or variables to discretized or nominal attributes
(i) Decision Tree Discretiser
The first method that will be applies here is a supersived discretiser. Discretisation with Decision Trees consists of using a decision tree to identify the optimal splitting points that would determine the bins or contiguous intervals.
from sklearn.tree import DecisionTreeRegressor
def decision_tree_disc(df, cols, depth=4 ):
for col in cols:
df[col +"_m1"] = df[col].shift(1)
df = df.iloc[1:,:]
tree_model = DecisionTreeRegressor(max_depth=depth,random_state=0)
tree_model.fit(df[col +"_m1"].to_frame(), df[col])
df[col+"_Disc"] = tree_model.predict(df[col +"_m1"].to_frame())
return df
df_out = interact.decision_tree_disc(df.copy(), ["Close"]); df_out.head()
Normalising normally pertains to the scaling of data. There are many method available, interacting normalising methods makes use of all the feature's attributes to do the scaling.
(i) Quantile Normalisation
In statistics, quantile normalization is a technique for making two distributions identical in statistical properties.
import numpy as np
import pandas as pd
def quantile_normalize(df, drop):
if drop:
keep = df[drop]
df = df.drop(drop,axis=1)
#compute rank
dic = {}
for col in df:
dic.update({col : sorted(df[col])})
sorted_df = pd.DataFrame(dic)
rank = sorted_df.mean(axis = 1).tolist()
#sort
for col in df:
t = np.searchsorted(np.sort(df[col]), df[col])
df[col] = [rank[i] for i in t]
if drop:
df = pd.concat((keep,df),axis=1)
return df
df_out = interact.quantile_normalize(df.copy(), drop=["Close"]); df_out.head()
There are multiple types of distance functions like Euclidean, Mahalanobis, and Minkowski distance. Here we are using a contrived example in a location based haversine distance.
(i) Haversine Distance
The Haversine (or great circle) distance is the angular distance between two points on the surface of a sphere.
from math import sin, cos, sqrt, atan2, radians
def haversine_distance(row, lon="Open", lat="Close"):
c_lat,c_long = radians(52.5200), radians(13.4050)
R = 6373.0
long = radians(row['Open'])
lat = radians(row['Close'])
dlon = long - c_long
dlat = lat - c_lat
a = sin(dlat / 2)**2 + cos(lat) * cos(c_lat) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
return R * c
df_out['distance_central'] = df.apply(interact.haversine_distance,axis=1); df_out.head()
(i) Technical Features
Technical indicators are heuristic or mathematical calculations based on the price, volume, or open interest of a security or contract used by traders who follow technical analysis. By analyzing historical data, technical analysts use indicators to predict future price movements.
import ta
def tech(df):
return ta.add_all_ta_features(df, open="Open", high="High", low="Low", close="Close", volume="Volume")
df_out = interact.tech(df.copy()); df_out.head()
Genetic programming has shown promise in constructing feature by osing original features to form high-level ones that can help algorithms achieve better performance.
(i) Symbolic Transformer
A symbolic transformer is a supervised transformer that begins by building a population of naive random formulas to represent a relationship.
df.head()
from gplearn.genetic import SymbolicTransformer
def genetic_feat(df, num_gen=20, num_comp=10):
function_set = ['add', 'sub', 'mul', 'div',
'sqrt', 'log', 'abs', 'neg', 'inv','tan']
gp = SymbolicTransformer(generations=num_gen, population_size=200,
hall_of_fame=100, n_components=num_comp,
function_set=function_set,
parsimony_coefficient=0.0005,
max_samples=0.9, verbose=1,
random_state=0, n_jobs=6)
gen_feats = gp.fit_transform(df.drop("Close_1", axis=1), df["Close_1"]); df.iloc[:,:8]
gen_feats = pd.DataFrame(gen_feats, columns=["gen_"+str(a) for a in range(gen_feats.shape[1])])
gen_feats.index = df.index
return pd.concat((df,gen_feats),axis=1)
df_out = interact.genetic_feat(df.copy()); df_out.head()
Methods that help with the summarisation of features by remapping them to achieve some aim like the maximisation of variability or class separability. These methods tend to be unsupervised, but can also take an supervised form.
Eigendecomposition or sometimes spectral decomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors. Some examples are LDA and PCA.
(i) Principal Component Analysis
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
def pca_feature(df, memory_issues=False,mem_iss_component=False,variance_or_components=0.80,n_components=5 ,drop_cols=None, non_linear=True):
if non_linear:
pca = KernelPCA(n_components = n_components, kernel='rbf', fit_inverse_transform=True, random_state = 33, remove_zero_eig= True)
else:
if memory_issues:
if not mem_iss_component:
raise ValueError("If you have memory issues, you have to preselect mem_iss_component")
pca = IncrementalPCA(mem_iss_component)
else:
if variance_or_components>1:
pca = PCA(n_components=variance_or_components)
else: # automated selection based on variance
pca = PCA(n_components=variance_or_components,svd_solver="full")
if drop_cols:
X_pca = pca.fit_transform(df.drop(drop_cols,axis=1))
return pd.concat((df[drop_cols],pd.DataFrame(X_pca, columns=["PCA_"+str(i+1) for i in range(X_pca.shape[1])],index=df.index)),axis=1)
else:
X_pca = pca.fit_transform(df)
return pd.DataFrame(X_pca, columns=["PCA_"+str(i+1) for i in range(X_pca.shape[1])],index=df.index)
return df
df_out = mapper.pca_feature(df.copy(), variance_or_components=0.9, n_components=8,non_linear=False)
These families of algorithms are useful to find linear relations between two multivariate datasets.
(1) Canonical Correlation Analysis
Canonical-correlation analysis (CCA) is a way of inferring information from cross-covariance matrices.
from sklearn.cross_decomposition import CCA
def cross_lag(df, drop=None, lags=1, components=4 ):
if drop:
keep = df[drop]
df = df.drop([drop],axis=1)
df_2 = df.shift(lags)
df = df.iloc[lags:,:]
df_2 = df_2.dropna().reset_index(drop=True)
cca = CCA(n_components=components)
cca.fit(df_2, df)
X_c, df_2 = cca.transform(df_2, df)
df_2 = pd.DataFrame(df_2, index=df.index)
df_2 = df.add_prefix('crd_')
if drop:
df = pd.concat([keep,df,df_2],axis=1)
else:
df = pd.concat([df,df_2],axis=1)
return df
df_out = mapper.cross_lag(df.copy()); df_out.head()
Functions that approximate the feature mappings that correspond to certain kernels, as they are used for example in support vector machines.
(i) Additive Chi2 Kernel
Computes the additive chi-squared kernel between observations in X and Y The chi-squared kernel is computed between each pair of rows in X and Y. X and Y have to be non-negative.
from sklearn.kernel_approximation import AdditiveChi2Sampler
def a_chi(df, drop=None, lags=1, sample_steps=2 ):
if drop:
keep = df[drop]
df = df.drop([drop],axis=1)
df_2 = df.shift(lags)
df = df.iloc[lags:,:]
df_2 = df_2.dropna().reset_index(drop=True)
chi2sampler = AdditiveChi2Sampler(sample_steps=sample_steps)
df_2 = chi2sampler.fit_transform(df_2, df["Close"])
df_2 = pd.DataFrame(df_2, index=df.index)
df_2 = df.add_prefix('achi_')
if drop:
df = pd.concat([keep,df,df_2],axis=1)
else:
df = pd.concat([df,df_2],axis=1)
return df
df_out = mapper.a_chi(df.copy()); df_out.head()
An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore noise.
(i) Feed Forward
The simplest form of an autoencoder is a feedforward, non-recurrent neural network similar to single layer perceptrons that participate in multilayer perceptrons
from sklearn.preprocessing import minmax_scale
import tensorflow as tf
import numpy as np
def encoder_dataset(df, drop=None, dimesions=20):
if drop:
train_scaled = minmax_scale(df.drop(drop,axis=1).values, axis = 0)
else:
train_scaled = minmax_scale(df.values, axis = 0)
# define the number of encoding dimensions
encoding_dim = dimesions
# define the number of features
ncol = train_scaled.shape[1]
input_dim = tf.keras.Input(shape = (ncol, ))
# Encoder Layers
encoded1 = tf.keras.layers.Dense(3000, activation = 'relu')(input_dim)
encoded2 = tf.keras.layers.Dense(2750, activation = 'relu')(encoded1)
encoded3 = tf.keras.layers.Dense(2500, activation = 'relu')(encoded2)
encoded4 = tf.keras.layers.Dense(750, activation = 'relu')(encoded3)
encoded5 = tf.keras.layers.Dense(500, activation = 'relu')(encoded4)
encoded6 = tf.keras.layers.Dense(250, activation = 'relu')(encoded5)
encoded7 = tf.keras.layers.Dense(encoding_dim, activation = 'relu')(encoded6)
encoder = tf.keras.Model(inputs = input_dim, outputs = encoded7)
encoded_input = tf.keras.Input(shape = (encoding_dim, ))
encoded_train = pd.DataFrame(encoder.predict(train_scaled),index=df.index)
encoded_train = encoded_train.add_prefix('encoded_')
if drop:
encoded_train = pd.concat((df[drop],encoded_train),axis=1)
return encoded_train
df_out = mapper.encoder_dataset(df.copy(), ["Close_1"], 15); df_out.head()
df_out.head()
Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-linear structure in data.
(i) Local Linear Embedding
Locally Linear Embedding is a method of non-linear dimensionality reduction. It tries to reduce these n-Dimensions while trying to preserve the geometric features of the original non-linear feature structure.
from sklearn.manifold import LocallyLinearEmbedding
def lle_feat(df, drop=None, components=4):
if drop:
keep = df[drop]
df = df.drop(drop, axis=1)
embedding = LocallyLinearEmbedding(n_components=components)
em = embedding.fit_transform(df)
df = pd.DataFrame(em,index=df.index)
df = df.add_prefix('lle_')
if drop:
df = pd.concat((keep,df),axis=1)
return df
df_out = mapper.lle_feat(df.copy(),["Close_1"],4); df_out.head()
Most clustering techniques start with a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together with some measure. Although these clustering techniques are typically used for observations, it can also be used for feature dimensionality reduction; especially hierarchical clustering techniques.
(i) Feature Agglomeration
Feature agglomerative uses clustering to group together features that look very similar, thus decreasing the number of features.
import numpy as np
from sklearn import datasets, cluster
def feature_agg(df, drop=None, components=4):
if drop:
keep = df[drop]
df = df.drop(drop, axis=1)
components = min(df.shape[1]-1,components)
agglo = cluster.FeatureAgglomeration(n_clusters=components)
agglo.fit(df)
df = pd.DataFrame(agglo.transform(df),index=df.index)
df = df.add_prefix('feagg_')
if drop:
return pd.concat((keep,df),axis=1)
else:
return df
df_out = mapper.feature_agg(df.copy(),["Close_1"],4 ); df_out.head()
Neighbouring points can be calculated using distance metrics like Hamming, Manhattan, Minkowski distance. The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these.
(i) Nearest Neighbours
Unsupervised learner for implementing neighbor searches.
from sklearn.neighbors import NearestNeighbors
def neigh_feat(df, drop, neighbors=6):
if drop:
keep = df[drop]
df = df.drop(drop, axis=1)
components = min(df.shape[0]-1,neighbors)
neigh = NearestNeighbors(n_neighbors=neighbors)
neigh.fit(df)
neigh = neigh.kneighbors()[0]
df = pd.DataFrame(neigh, index=df.index)
df = df.add_prefix('neigh_')
if drop:
return pd.concat((keep,df),axis=1)
else:
return df
return df
df_out = mapper.neigh_feat(df.copy(),["Close_1"],4 ); df_out.head()
When working with extraction, you have decide the size of the time series history to take into account when calculating a collection of walk-forward feature values. To facilitate our extraction, we use an excellent package called TSfresh, and also some of their default features. For completeness, we also include 12 or so custom features to be added to the extraction pipeline.
The time series methods in the transformation section and the interaction section are similar to the methods we will uncover in the extraction section, however, for transformation and interaction methods the output is an entire new time series, whereas extraction methods takes as input multiple constructed time series and extracts a singular value from each time series to reconstruct an entirely new time series.
Some methods naturally fit better in one format over another, e.g., lags are too expensive for extraction; time series decomposition only has to be performed once, because it has a low level of 'leakage' so is better suited to transformation; and forecast methods attempt to predict multiple future training samples, so won't work with extraction that only delivers one value per time series. Furthermore all non time-series (cross-sectional) transformation and extraction techniques can not make use of extraction as it is solely a time-series method.
Lastly, when we want to double apply specific functions we can apply it as a transformation/interaction then all the extraction methods can be applied to this feature as well. For example, if we calculate a smoothing function (transformation) then all other extraction functions (median, entropy, linearity etc.) can now be applied to that smoothing function, including the application of the smoothing function itself, e.g., a double smooth, double lag, double filter etc. So separating these methods out give us great flexibility.
Decorator
def set_property(key, value):
"""
This method returns a decorator that sets the property key of the function to value
"""
def decorate_func(func):
setattr(func, key, value)
if func.__doc__ and key == "fctype":
func.__doc__ = func.__doc__ + "\n\n *This function is of type: " + value + "*\n"
return func
return decorate_func
You can calculate the linear, non-linear and absolute energy of a time series. In signal processing, the energy $E_S$ of a continuous-time signal $x(t)$ is defined as the area under the squared magnitude of the considered signal. Mathematically, $E_{s}=\langle x(t), x(t)\rangle=\int_{-\infty}^{\infty}|x(t)|^{2} d t$
(i) Absolute Energy
Returns the absolute energy of the time series which is the sum over the squared values
#-> In Package
def abs_energy(x):
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
return np.dot(x, x)
extract.abs_energy(df["Close"])
Here we widely define distance measures as those that take a difference between attributes or series of datapoints.
(i) Complexity-Invariant Distance
This function calculator is an estimate for a time series complexity.
#-> In Package
def cid_ce(x, normalize):
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
if normalize:
s = np.std(x)
if s!=0:
x = (x - np.mean(x))/s
else:
return 0.0
x = np.diff(x)
return np.sqrt(np.dot(x, x))
extract.cid_ce(df["Close"], True)
Many alternatives to differencing exists, one can for example take the difference of every other value, take the squared difference, take the fractional difference, or like our example, take the mean absolute difference.
(i) Mean Absolute Change
Returns the mean over the absolute differences between subsequent time series values.
#-> In Package
def mean_abs_change(x):
return np.mean(np.abs(np.diff(x)))
extract.mean_abs_change(df["Close"])
Features where the emphasis is on the rate of change.
(i) Mean Central Second Derivative
Returns the mean value of a central approximation of the second derivative
#-> In Package
def _roll(a, shift):
if not isinstance(a, np.ndarray):
a = np.asarray(a)
idx = shift % len(a)
return np.concatenate([a[-idx:], a[:-idx]])
def mean_second_derivative_central(x):
diff = (_roll(x, 1) - 2 * np.array(x) + _roll(x, -1)) / 2.0
return np.mean(diff[1:-1])
extract.mean_second_derivative_central(df["Close"])
Volatility is a statistical measure of the dispersion of a time-series.
(i) Variance Larger than Standard Deviation
#-> In Package
def variance_larger_than_standard_deviation(x):
y = np.var(x)
return y > np.sqrt(y)
extract.variance_larger_than_standard_deviation(df["Close"])
(ii) Variability Index
Variability Index is a way to measure how smooth or 'variable' a time series is.
var_index_param = {"Volume":df["Volume"].values, "Open": df["Open"].values}
@set_property("fctype", "combiner")
@set_property("custom", True)
def var_index(time,param=var_index_param):
final = []
keys = []
for key, magnitude in param.items():
w = 1.0 / np.power(np.subtract(time[1:], time[:-1]), 2)
w_mean = np.mean(w)
N = len(time)
sigma2 = np.var(magnitude)
S1 = sum(w * (magnitude[1:] - magnitude[:-1]) ** 2)
S2 = sum(w)
eta_e = (w_mean * np.power(time[N - 1] -
time[0], 2) * S1 / (sigma2 * S2 * N ** 2))
final.append(eta_e)
keys.append(key)
return {"Interact__{}".format(k): eta_e for eta_e, k in zip(final,keys) }
extract.var_index(df["Close"].values,var_index_param)
Features that emphasises a particular shape not ordinarily considered as a distribution statistic. Extends to derivations of the original time series too For example a feature looking at the sinusoidal shape of an autocorrelation plot.
(i) Symmetrical
Boolean variable denoting if the distribution of x looks symmetric.
#-> In Package
def symmetry_looking(x, param=[{"r": 0.2}]):
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
mean_median_difference = np.abs(np.mean(x) - np.median(x))
max_min_difference = np.max(x) - np.min(x)
return [("r_{}".format(r["r"]), mean_median_difference < (r["r"] * max_min_difference))
for r in param]
extract.symmetry_looking(df["Close"])
Looking at the occurrence, and reoccurence of defined values.
(i) Has Duplicate Max
#-> In Package
def has_duplicate_max(x):
"""
Checks if the maximum value of x is observed more than once
:param x: the time series to calculate the feature of
:type x: numpy.ndarray
:return: the value of this feature
:return type: bool
"""
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
return np.sum(x == np.max(x)) >= 2
extract.has_duplicate_max(df["Close"])
Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay.
(i) Partial Autocorrelation
Partial autocorrelation is a summary of the relationship between an observation in a time series with observations at prior time steps with the relationships of intervening observations removed.
#-> In Package
from statsmodels.tsa.stattools import acf, adfuller, pacf
def partial_autocorrelation(x, param=[{"lag": 1}]):
# Check the difference between demanded lags by param and possible lags to calculate (depends on len(x))
max_demanded_lag = max([lag["lag"] for lag in param])
n = len(x)
# Check if list is too short to make calculations
if n <= 1:
pacf_coeffs = [np.nan] * (max_demanded_lag + 1)
else:
if (n <= max_demanded_lag):
max_lag = n - 1
else:
max_lag = max_demanded_lag
pacf_coeffs = list(pacf(x, method="ld", nlags=max_lag))
pacf_coeffs = pacf_coeffs + [np.nan] * max(0, (max_demanded_lag - max_lag))
return [("lag_{}".format(lag["lag"]), pacf_coeffs[lag["lag"]]) for lag in param]
extract.partial_autocorrelation(df["Close"])
Stochastic refers to a randomly determined process. Any features trying to capture stochasticity by degree or type are included under this branch.
(i) Augmented Dickey Fuller
The Augmented Dickey-Fuller test is a hypothesis test which checks whether a unit root is present in a time series sample.
#-> In Package
def augmented_dickey_fuller(x, param=[{"attr": "teststat"}]):
res = None
try:
res = adfuller(x)
except LinAlgError:
res = np.NaN, np.NaN, np.NaN
except ValueError: # occurs if sample size is too small
res = np.NaN, np.NaN, np.NaN
except MissingDataError: # is thrown for e.g. inf or nan in the data
res = np.NaN, np.NaN, np.NaN
return [('attr_"{}"'.format(config["attr"]),
res[0] if config["attr"] == "teststat"
else res[1] if config["attr"] == "pvalue"
else res[2] if config["attr"] == "usedlag" else np.NaN)
for config in param]
extract.augmented_dickey_fuller(df["Close"])
(i) Median of Magnitudes Skew
@set_property("fctype", "simple")
@set_property("custom", True)
def gskew(x):
interpolation="nearest"
median_mag = np.median(x)
F_3_value = np.percentile(x, 3, interpolation=interpolation)
F_97_value = np.percentile(x, 97, interpolation=interpolation)
skew = (np.median(x[x <= F_3_value]) +
np.median(x[x >= F_97_value]) - 2 * median_mag)
return skew
extract.gskew(df["Close"])
(ii) Stetson Mean
An iteratively weighted mean used in the Stetson variability index
stestson_param = {"weight":100., "alpha":2., "beta":2., "tol":1.e-6, "nmax":20}
@set_property("fctype", "combiner")
@set_property("custom", True)
def stetson_mean(x, param=stestson_param):
weight= stestson_param["weight"]
alpha= stestson_param["alpha"]
beta = stestson_param["beta"]
tol= stestson_param["tol"]
nmax= stestson_param["nmax"]
mu = np.median(x)
for i in range(nmax):
resid = x - mu
resid_err = np.abs(resid) * np.sqrt(weight)
weight1 = weight / (1. + (resid_err / alpha)**beta)
weight1 /= weight1.mean()
diff = np.mean(x * weight1) - mu
mu += diff
if (np.abs(diff) < tol*np.abs(mu) or np.abs(diff) < tol):
break
return mu
extract.stetson_mean(df["Close"])
(i) Lenght
#-> In Package
def length(x):
return len(x)
extract.length(df["Close"])
(i) Count Above Mean
Returns the number of values in x that are higher than the mean of x
#-> In Package
def count_above_mean(x):
m = np.mean(x)
return np.where(x > m)[0].size
extract.count_above_mean(df["Close"])
(i) Longest Strike Below Mean
Returns the length of the longest consecutive subsequence in x that is smaller than the mean of x
#-> In Package
import itertools
def get_length_sequences_where(x):
if len(x) == 0:
return [0]
else:
res = [len(list(group)) for value, group in itertools.groupby(x) if value == 1]
return res if len(res) > 0 else [0]
def longest_strike_below_mean(x):
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
return np.max(get_length_sequences_where(x <= np.mean(x))) if x.size > 0 else 0
extract.longest_strike_below_mean(df["Close"])
(ii) Wozniak
This is an astronomical feature, we count the number of three consecutive data points that are brighter or fainter than $2σ$ and normalize the number by $N−2$
woz_param = [{"consecutiveStar": n} for n in [2, 4]]
@set_property("fctype", "combiner")
@set_property("custom", True)
def wozniak(magnitude, param=woz_param):
iters = []
for consecutiveStar in [stars["consecutiveStar"] for stars in param]:
N = len(magnitude)
if N < consecutiveStar:
return 0
sigma = np.std(magnitude)
m = np.mean(magnitude)
count = 0
for i in range(N - consecutiveStar + 1):
flag = 0
for j in range(consecutiveStar):
if(magnitude[i + j] > m + 2 * sigma or
magnitude[i + j] < m - 2 * sigma):
flag = 1
else:
flag = 0
break
if flag:
count = count + 1
iters.append(count * 1.0 / (N - consecutiveStar + 1))
return [("consecutiveStar_{}".format(config["consecutiveStar"]), iters[en] ) for en, config in enumerate(param)]
extract.wozniak(df["Close"])
(i) Last location of Maximum
Returns the relative last location of the maximum value of x. last_location_of_minimum(x),
#-> In Package
def last_location_of_maximum(x):
x = np.asarray(x)
return 1.0 - np.argmax(x[::-1]) / len(x) if len(x) > 0 else np.NaN
extract.last_location_of_maximum(df["Close"])
Any coefficient that are obtained from a model that might help in the prediction problem. For example here we might include coefficients of polynomial $h(x)$, which has been fitted to the deterministic dynamics of Langevin model.
(i) FFT Coefficient
Calculates the fourier coefficients of the one-dimensional discrete Fourier Transform for real input.
#-> In Package
def fft_coefficient(x, param = [{"coeff": 10, "attr": "real"}]):
assert min([config["coeff"] for config in param]) >= 0, "Coefficients must be positive or zero."
assert set([config["attr"] for config in param]) <= set(["imag", "real", "abs", "angle"]), \
'Attribute must be "real", "imag", "angle" or "abs"'
fft = np.fft.rfft(x)
def complex_agg(x, agg):
if agg == "real":
return x.real
elif agg == "imag":
return x.imag
elif agg == "abs":
return np.abs(x)
elif agg == "angle":
return np.angle(x, deg=True)
res = [complex_agg(fft[config["coeff"]], config["attr"]) if config["coeff"] < len(fft)
else np.NaN for config in param]
index = [('coeff_{}__attr_"{}"'.format(config["coeff"], config["attr"]),res[0]) for config in param]
return index
extract.fft_coefficient(df["Close"])
(ii) AR Coefficient
This feature calculator fits the unconditional maximum likelihood of an autoregressive AR(k) process.
#-> In Package
from statsmodels.tsa.ar_model import AR
def ar_coefficient(x, param=[{"coeff": 5, "k": 5}]):
calculated_ar_params = {}
x_as_list = list(x)
calculated_AR = AR(x_as_list)
res = {}
for parameter_combination in param:
k = parameter_combination["k"]
p = parameter_combination["coeff"]
column_name = "k_{}__coeff_{}".format(k, p)
if k not in calculated_ar_params:
try:
calculated_ar_params[k] = calculated_AR.fit(maxlag=k, solver="mle").params
except (LinAlgError, ValueError):
calculated_ar_params[k] = [np.NaN]*k
mod = calculated_ar_params[k]
if p <= k:
try:
res[column_name] = mod[p]
except IndexError:
res[column_name] = 0
else:
res[column_name] = np.NaN
return [(key, value) for key, value in res.items()]
extract.ar_coefficient(df["Close"])
This includes finding normal quantile values in the series, but also quantile derived measures like change quantiles and index max quantiles.
(i) Index Mass Quantile
The relative index $i$ where $q%$ of the mass of the time series $x$ lie left of $i$ .
#-> In Package
def index_mass_quantile(x, param=[{"q": 0.3}]):
x = np.asarray(x)
abs_x = np.abs(x)
s = sum(abs_x)
if s == 0:
# all values in x are zero or it has length 0
return [("q_{}".format(config["q"]), np.NaN) for config in param]
else:
# at least one value is not zero
mass_centralized = np.cumsum(abs_x) / s
return [("q_{}".format(config["q"]), (np.argmax(mass_centralized >= config["q"])+1)/len(x)) for config in param]
extract.index_mass_quantile(df["Close"])
(i) Number of CWT Peaks
This feature calculator searches for different peaks in x.
from scipy.signal import cwt, find_peaks_cwt, ricker, welch
cwt_param = [ka for ka in [2,6,9]]
@set_property("fctype", "combiner")
@set_property("custom", True)
def number_cwt_peaks(x, param=cwt_param):
return [("CWTPeak_{}".format(n), len(find_peaks_cwt(vector=x, widths=np.array(list(range(1, n + 1))), wavelet=ricker))) for n in param]
extract.number_cwt_peaks(df["Close"])
The density, and more specifically the power spectral density of the signal describes the power present in the signal as a function of frequency, per unit frequency.
(i) Cross Power Spectral Density
This feature calculator estimates the cross power spectral density of the time series $x$ at different frequencies.
#-> In Package
def spkt_welch_density(x, param=[{"coeff": 5}]):
freq, pxx = welch(x, nperseg=min(len(x), 256))
coeff = [config["coeff"] for config in param]
indices = ["coeff_{}".format(i) for i in coeff]
if len(pxx) <= np.max(coeff): # There are fewer data points in the time series than requested coefficients
# filter coefficients that are not contained in pxx
reduced_coeff = [coefficient for coefficient in coeff if len(pxx) > coefficient]
not_calculated_coefficients = [coefficient for coefficient in coeff
if coefficient not in reduced_coeff]
# Fill up the rest of the requested coefficients with np.NaNs
return zip(indices, list(pxx[reduced_coeff]) + [np.NaN] * len(not_calculated_coefficients))
else:
return pxx[coeff].ravel()[0]
extract.spkt_welch_density(df["Close"])
Any measure of linearity that might make use of something like the linear least-squares regression for the values of the time series. This can be against the time series minus one and many other alternatives.
(i) Linear Trend Time Wise
Calculate a linear least-squares regression for the values of the time series versus the sequence from 0 to length of the time series minus one.
from scipy.stats import linregress
#-> In Package
def linear_trend_timewise(x, param= [{"attr": "pvalue"}]):
ix = x.index
# Get differences between each timestamp and the first timestamp in seconds.
# Then convert to hours and reshape for linear regression
times_seconds = (ix - ix[0]).total_seconds()
times_hours = np.asarray(times_seconds / float(3600))
linReg = linregress(times_hours, x.values)
return [("attr_\"{}\"".format(config["attr"]), getattr(linReg, config["attr"]))
for config in param]
extract.linear_trend_timewise(df["Close"])
(i) Schreiber Non-Linearity
#-> In Package
def c3(x, lag=3):
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
n = x.size
if 2 * lag >= n:
return 0
else:
return np.mean((_roll(x, 2 * -lag) * _roll(x, -lag) * x)[0:(n - 2 * lag)])
extract.c3(df["Close"])
Any feature looking at the complexity of a time series. This is typically used in medical signal disciplines (EEG, EMG). There are multiple types of measures like spectral entropy, permutation entropy, sample entropy, approximate entropy, Lempel-Ziv complexity and other. This includes entropy measures and there derivations.
(i) Binned Entropy
Bins the values of x into max_bins equidistant bins.
#-> In Package
def binned_entropy(x, max_bins=10):
if not isinstance(x, (np.ndarray, pd.Series)):
x = np.asarray(x)
hist, bin_edges = np.histogram(x, bins=max_bins)
probs = hist / x.size
return - np.sum(p * np.math.log(p) for p in probs if p != 0)
extract.binned_entropy(df["Close"])
(ii) SVD Entropy
SVD entropy is an indicator of the number of eigenvectors that are needed for an adequate explanation of the data set.
svd_param = [{"Tau": ta, "DE": de}
for ta in [4]
for de in [3,6]]
def _embed_seq(X,Tau,D):
N =len(X)
if D * Tau > N:
print("Cannot build such a matrix, because D * Tau > N")
exit()
if Tau<1:
print("Tau has to be at least 1")
exit()
Y= np.zeros((N - (D - 1) * Tau, D))
for i in range(0, N - (D - 1) * Tau):
for j in range(0, D):
Y[i][j] = X[i + j * Tau]
return Y
@set_property("fctype", "combiner")
@set_property("custom", True)
def svd_entropy(epochs, param=svd_param):
axis=0
final = []
for par in param:
def svd_entropy_1d(X, Tau, DE):
Y = _embed_seq(X, Tau, DE)
W = np.linalg.svd(Y, compute_uv=0)
W /= sum(W) # normalize singular values
return -1 * np.sum(W * np.log(W))
Tau = par["Tau"]
DE = par["DE"]
final.append(np.apply_along_axis(svd_entropy_1d, axis, epochs, Tau, DE).ravel()[0])
return [("Tau_\"{}\"__De_{}\"".format(par["Tau"], par["DE"]), final[en]) for en, par in enumerate(param)]
extract.svd_entropy(df["Close"].values)
(iii) Hjort
The Complexity parameter represents the change in frequency. The parameter compares the signal's similarity to a pure sine wave, where the value converges to 1 if the signal is more similar.
def _hjorth_mobility(epochs):
diff = np.diff(epochs, axis=0)
sigma0 = np.std(epochs, axis=0)
sigma1 = np.std(diff, axis=0)
return np.divide(sigma1, sigma0)
@set_property("fctype", "simple")
@set_property("custom", True)
def hjorth_complexity(epochs):
diff1 = np.diff(epochs, axis=0)
diff2 = np.diff(diff1, axis=0)
sigma1 = np.std(diff1, axis=0)
sigma2 = np.std(diff2, axis=0)
return np.divide(np.divide(sigma2, sigma1), _hjorth_mobility(epochs))
extract.hjorth_complexity(df["Close"])
Fixed points and equilibria as identified from fitted models.
(i) Langevin Fixed Points
Largest fixed point of dynamics $max\ {h(x)=0}$ estimated from polynomial $h(x)$ which has been fitted to the deterministic dynamics of Langevin model
#-> In Package
def _estimate_friedrich_coefficients(x, m, r):
assert m > 0, "Order of polynomial need to be positive integer, found {}".format(m)
df = pd.DataFrame({'signal': x[:-1], 'delta': np.diff(x)})
try:
df['quantiles'] = pd.qcut(df.signal, r)
except ValueError:
return [np.NaN] * (m + 1)
quantiles = df.groupby('quantiles')
result = pd.DataFrame({'x_mean': quantiles.signal.mean(), 'y_mean': quantiles.delta.mean()})
result.dropna(inplace=True)
try:
return np.polyfit(result.x_mean, result.y_mean, deg=m)
except (np.linalg.LinAlgError, ValueError):
return [np.NaN] * (m + 1)
def max_langevin_fixed_point(x, r=3, m=30):
coeff = _estimate_friedrich_coefficients(x, m, r)
try:
max_fixed_point = np.max(np.real(np.roots(coeff)))
except (np.linalg.LinAlgError, ValueError):
return np.nan
return max_fixed_point
extract.max_langevin_fixed_point(df["Close"])
Features derived from peaked values in either the positive or negative direction.
(i) Willison Amplitude
This feature is defined as the amount of times that the change in the signal amplitude exceeds a threshold.
will_param = [ka for ka in [0.2,3]]
@set_property("fctype", "combiner")
@set_property("custom", True)
def willison_amplitude(X, param=will_param):
return [("Thresh_{}".format(n),np.sum(np.abs(np.diff(X)) >= n)) for n in param]
extract.willison_amplitude(df["Close"])
(ii) Percent Amplitude
Returns the largest distance from the median value, measured as a percentage of the median
perc_param = [{"base":ba, "exponent":exp} for ba in [3,5] for exp in [-0.1,-0.2]]
@set_property("fctype", "combiner")
@set_property("custom", True)
def percent_amplitude(x, param =perc_param):
final = []
for par in param:
linear_scale_data = par["base"] ** (par["exponent"] * x)
y_max = np.max(linear_scale_data)
y_min = np.min(linear_scale_data)
y_med = np.median(linear_scale_data)
final.append(max(abs((y_max - y_med) / y_med), abs((y_med - y_min) / y_med)))
return [("Base_{}__Exp{}".format(pa["base"],pa["exponent"]),fin) for fin, pa in zip(final,param)]
extract.percent_amplitude(df["Close"])
(i) Cadence Probability
Given the observed distribution of time lags cads, compute the probability that the next observation occurs within time minutes of an arbitrary epoch.
#-> fixes required
import scipy.stats as stats
cad_param = [0.1,1000, -234]
@set_property("fctype", "combiner")
@set_property("custom", True)
def cad_prob(cads, param=cad_param):
return [("time_{}".format(time), stats.percentileofscore(cads, float(time) / (24.0 * 60.0)) / 100.0) for time in param]
extract.cad_prob(df["Close"])
Calculates the crossing of the series with other defined values or series.
(i) Zero Crossing Derivative
The positioning of the edge point is located at the zero crossing of the first derivative of the filter.
zero_param = [0.01, 8]
@set_property("fctype", "combiner")
@set_property("custom", True)
def zero_crossing_derivative(epochs, param=zero_param):
diff = np.diff(epochs)
norm = diff-diff.mean()
return [("e_{}".format(e), np.apply_along_axis(lambda epoch: np.sum(((epoch[:-5] <= e) & (epoch[5:] > e))), 0, norm).ravel()[0]) for e in param]
extract.zero_crossing_derivative(df["Close"])
These features are again from medical signal sciences, but under this category we would include values such as fluctuation based entropy measures, fluctuation of correlation dynamics, and co-fluctuations.
(i) Detrended Fluctuation Analysis (DFA)
DFA Calculate the Hurst exponent using DFA analysis.
from scipy.stats import kurtosis as _kurt
from scipy.stats import skew as _skew
import numpy as np
@set_property("fctype", "simple")
@set_property("custom", True)
def detrended_fluctuation_analysis(epochs):
def dfa_1d(X, Ave=None, L=None):
X = np.array(X)
if Ave is None:
Ave = np.mean(X)
Y = np.cumsum(X)
Y -= Ave
if L is None:
L = np.floor(len(X) * 1 / (
2 ** np.array(list(range(1, int(np.log2(len(X))) - 4))))
)
F = np.zeros(len(L)) # F(n) of different given box length n
for i in range(0, len(L)):
n = int(L[i]) # for each box length L[i]
if n == 0:
print("time series is too short while the box length is too big")
print("abort")
exit()
for j in range(0, len(X), n): # for each box
if j + n < len(X):
c = list(range(j, j + n))
# coordinates of time in the box
c = np.vstack([c, np.ones(n)]).T
# the value of data in the box
y = Y[j:j + n]
# add residue in this box
F[i] += np.linalg.lstsq(c, y, rcond=None)[1]
F[i] /= ((len(X) / n) * n)
F = np.sqrt(F)
stacked = np.vstack([np.log(L), np.ones(len(L))])
stacked_t = stacked.T
Alpha = np.linalg.lstsq(stacked_t, np.log(F), rcond=None)
return Alpha[0][0]
return np.apply_along_axis(dfa_1d, 0, epochs).ravel()[0]
extract.detrended_fluctuation_analysis(df["Close"])
Closely related to entropy and complexity measures. Any measure that attempts to measure the amount of information from an observable variable is included here.
(i) Fisher Information
Fisher information is a statistical information concept distinct from, and earlier than, Shannon information in communication theory.
def _embed_seq(X, Tau, D):
shape = (X.size - Tau * (D - 1), D)
strides = (X.itemsize, Tau * X.itemsize)
return np.lib.stride_tricks.as_strided(X, shape=shape, strides=strides)
fisher_param = [{"Tau":ta, "DE":de} for ta in [3,15] for de in [10,5]]
@set_property("fctype", "combiner")
@set_property("custom", True)
def fisher_information(epochs, param=fisher_param):
def fisher_info_1d(a, tau, de):
# taken from pyeeg improvements
mat = _embed_seq(a, tau, de)
W = np.linalg.svd(mat, compute_uv=False)
W /= sum(W) # normalize singular values
FI_v = (W[1:] - W[:-1]) ** 2 / W[:-1]
return np.sum(FI_v)
return [("Tau_{}__DE_{}".format(par["Tau"], par["DE"]),np.apply_along_axis(fisher_info_1d, 0, epochs, par["Tau"], par["DE"]).ravel()[0]) for par in param]
extract.fisher_information(df["Close"])
In mathematics, more specifically in fractal geometry, a fractal dimension is a ratio providing a statistical index of complexity comparing how detail in a pattern (strictly speaking, a fractal pattern) changes with the scale at which it is measured.
(i) Highuchi Fractal
Compute a Higuchi Fractal Dimension of a time series
hig_para = [{"Kmax": 3},{"Kmax": 5}]
@set_property("fctype", "combiner")
@set_property("custom", True)
def higuchi_fractal_dimension(epochs, param=hig_para):
def hfd_1d(X, Kmax):
L = []
x = []
N = len(X)
for k in range(1, Kmax):
Lk = []
for m in range(0, k):
Lmk = 0
for i in range(1, int(np.floor((N - m) / k))):
Lmk += abs(X[m + i * k] - X[m + i * k - k])
Lmk = Lmk * (N - 1) / np.floor((N - m) / float(k)) / k
Lk.append(Lmk)
L.append(np.log(np.mean(Lk)))
x.append([np.log(float(1) / k), 1])
(p, r1, r2, s) = np.linalg.lstsq(x, L, rcond=None)
return p[0]
return [("Kmax_{}".format(config["Kmax"]), np.apply_along_axis(hfd_1d, 0, epochs, config["Kmax"]).ravel()[0] ) for config in param]
extract.higuchi_fractal_dimension(df["Close"])
(ii) Petrosian Fractal
Compute a Petrosian Fractal Dimension of a time series.
@set_property("fctype", "simple")
@set_property("custom", True)
def petrosian_fractal_dimension(epochs):
def pfd_1d(X, D=None):
# taken from pyeeg
"""Compute Petrosian Fractal Dimension of a time series from either two
cases below:
1. X, the time series of type list (default)
2. D, the first order differential sequence of X (if D is provided,
recommended to speed up)
In case 1, D is computed using Numpy's difference function.
To speed up, it is recommended to compute D before calling this function
because D may also be used by other functions whereas computing it here
again will slow down.
"""
if D is None:
D = np.diff(X)
D = D.tolist()
N_delta = 0 # number of sign changes in derivative of the signal
for i in range(1, len(D)):
if D[i] * D[i - 1] < 0:
N_delta += 1
n = len(X)
return np.log10(n) / (np.log10(n) + np.log10(n / n + 0.4 * N_delta))
return np.apply_along_axis(pfd_1d, 0, epochs).ravel()[0]
extract.petrosian_fractal_dimension(df["Close"])
(i) Hurst Exponent
The Hurst exponent is used as a measure of long-term memory of time series. It relates to the autocorrelations of the time series, and the rate at which these decrease as the lag between pairs of values increases.
@set_property("fctype", "simple")
@set_property("custom", True)
def hurst_exponent(epochs):
def hurst_1d(X):
X = np.array(X)
N = X.size
T = np.arange(1, N + 1)
Y = np.cumsum(X)
Ave_T = Y / T
S_T = np.zeros(N)
R_T = np.zeros(N)
for i in range(N):
S_T[i] = np.std(X[:i + 1])
X_T = Y - T * Ave_T[i]
R_T[i] = np.ptp(X_T[:i + 1])
for i in range(1, len(S_T)):
if np.diff(S_T)[i - 1] != 0:
break
for j in range(1, len(R_T)):
if np.diff(R_T)[j - 1] != 0:
break
k = max(i, j)
assert k < 10, "rethink it!"
R_S = R_T[k:] / S_T[k:]
R_S = np.log(R_S)
n = np.log(T)[k:]
A = np.column_stack((n, np.ones(n.size)))
[m, c] = np.linalg.lstsq(A, R_S, rcond=None)[0]
H = m
return H
return np.apply_along_axis(hurst_1d, 0, epochs).ravel()[0]
extract.hurst_exponent(df["Close"])
(ii) Largest Lyauponov Exponent
In mathematics the Lyapunov exponent or Lyapunov characteristic exponent of a dynamical system is a quantity that characterizes the rate of separation of infinitesimally close trajectories.
def _embed_seq(X, Tau, D):
shape = (X.size - Tau * (D - 1), D)
strides = (X.itemsize, Tau * X.itemsize)
return np.lib.stride_tricks.as_strided(X, shape=shape, strides=strides)
lyaup_param = [{"Tau":4, "n":3, "T":10, "fs":9},{"Tau":8, "n":7, "T":15, "fs":6}]
@set_property("fctype", "combiner")
@set_property("custom", True)
def largest_lyauponov_exponent(epochs, param=lyaup_param):
def LLE_1d(x, tau, n, T, fs):
Em = _embed_seq(x, tau, n)
M = len(Em)
A = np.tile(Em, (len(Em), 1, 1))
B = np.transpose(A, [1, 0, 2])
square_dists = (A - B) ** 2 # square_dists[i,j,k] = (Em[i][k]-Em[j][k])^2
D = np.sqrt(square_dists[:, :, :].sum(axis=2)) # D[i,j] = ||Em[i]-Em[j]||_2
# Exclude elements within T of the diagonal
band = np.tri(D.shape[0], k=T) - np.tri(D.shape[0], k=-T - 1)
band[band == 1] = np.inf
neighbors = (D + band).argmin(axis=0) # nearest neighbors more than T steps away
# in_bounds[i,j] = (i+j <= M-1 and i+neighbors[j] <= M-1)
inc = np.tile(np.arange(M), (M, 1))
row_inds = (np.tile(np.arange(M), (M, 1)).T + inc)
col_inds = (np.tile(neighbors, (M, 1)) + inc.T)
in_bounds = np.logical_and(row_inds <= M - 1, col_inds <= M - 1)
# Uncomment for old (miscounted) version
# in_bounds = numpy.logical_and(row_inds < M - 1, col_inds < M - 1)
row_inds[~in_bounds] = 0
col_inds[~in_bounds] = 0
# neighbor_dists[i,j] = ||Em[i+j]-Em[i+neighbors[j]]||_2
neighbor_dists = np.ma.MaskedArray(D[row_inds, col_inds], ~in_bounds)
J = (~neighbor_dists.mask).sum(axis=1) # number of in-bounds indices by row
# Set invalid (zero) values to 1; log(1) = 0 so sum is unchanged
neighbor_dists[neighbor_dists == 0] = 1
# !!! this fixes the divide by zero in log error !!!
neighbor_dists.data[neighbor_dists.data == 0] = 1
d_ij = np.sum(np.log(neighbor_dists.data), axis=1)
mean_d = d_ij[J > 0] / J[J > 0]
x = np.arange(len(mean_d))
X = np.vstack((x, np.ones(len(mean_d)))).T
[m, c] = np.linalg.lstsq(X, mean_d, rcond=None)[0]
Lexp = fs * m
return Lexp
return [("Tau_{}__n_{}__T_{}__fs_{}".format(par["Tau"], par["n"], par["T"], par["fs"]), np.apply_along_axis(LLE_1d, 0, epochs, par["Tau"], par["n"], par["T"], par["fs"]).ravel()[0]) for par in param]
extract.largest_lyauponov_exponent(df["Close"])
Spectral analysis is analysis in terms of a spectrum of frequencies or related quantities such as energies, eigenvalues, etc.
(i) Whelch Method
The Whelch Method is an approach for spectral density estimation. It is used in physics, engineering, and applied mathematics for estimating the power of a signal at different frequencies.
from scipy import signal, integrate
whelch_param = [100,200]
@set_property("fctype", "combiner")
@set_property("custom", True)
def whelch_method(data, param=whelch_param):
final = []
for Fs in param:
f, pxx = signal.welch(data, fs=Fs, nperseg=1024)
d = {'psd': pxx, 'freqs': f}
df = pd.DataFrame(data=d)
dfs = df.sort_values(['psd'], ascending=False)
rows = dfs.iloc[:10]
final.append(rows['freqs'].mean())
return [("Fs_{}".format(pa),fin) for pa, fin in zip(param,final)]
extract.whelch_method(df["Close"])
#-> Basically same as above
freq_param = [{"fs":50, "sel":15},{"fs":200, "sel":20}]
@set_property("fctype", "combiner")
@set_property("custom", True)
def find_freq(serie, param=freq_param):
final = []
for par in param:
fft0 = np.fft.rfft(serie*np.hanning(len(serie)))
freqs = np.fft.rfftfreq(len(serie), d=1.0/par["fs"])
fftmod = np.array([np.sqrt(fft0[i].real**2 + fft0[i].imag**2) for i in range(0, len(fft0))])
d = {'fft': fftmod, 'freq': freqs}
df = pd.DataFrame(d)
hop = df.sort_values(['fft'], ascending=False)
rows = hop.iloc[:par["sel"]]
final.append(rows['freq'].mean())
return [("Fs_{}__sel{}".format(pa["fs"],pa["sel"]),fin) for pa, fin in zip(param,final)]
extract.find_freq(df["Close"])
(i) Flux Percentile
Flux (or radiant flux) is the total amount of energy that crosses a unit area per unit time. Flux is an astronomical value, measured in joules per square metre per second (joules/m2/s), or watts per square metre. Here we provide the ratio of flux percentiles.
#-> In Package
import math
def flux_perc(magnitude):
sorted_data = np.sort(magnitude)
lc_length = len(sorted_data)
F_60_index = int(math.ceil(0.60 * lc_length))
F_40_index = int(math.ceil(0.40 * lc_length))
F_5_index = int(math.ceil(0.05 * lc_length))
F_95_index = int(math.ceil(0.95 * lc_length))
F_40_60 = sorted_data[F_60_index] - sorted_data[F_40_index]
F_5_95 = sorted_data[F_95_index] - sorted_data[F_5_index]
F_mid20 = F_40_60 / F_5_95
return {"FluxPercentileRatioMid20": F_mid20}
extract.flux_perc(df["Close"])
(i) Range of Cummulative Sum
@set_property("fctype", "simple")
@set_property("custom", True)
def range_cum_s(magnitude):
sigma = np.std(magnitude)
N = len(magnitude)
m = np.mean(magnitude)
s = np.cumsum(magnitude - m) * 1.0 / (N * sigma)
R = np.max(s) - np.min(s)
return {"Rcs": R}
extract.range_cum_s(df["Close"])
Structural features, potential placeholders for future research.
(i) Structure Function
The structure function of rotation measures (RMs) contains information on electron density and magnetic field fluctuations when used i astronomy. It becomes a custom feature when used with your own unique time series data.
from scipy.interpolate import interp1d
struct_param = {"Volume":df["Volume"].values, "Open": df["Open"].values}
@set_property("fctype", "combiner")
@set_property("custom", True)
def structure_func(time, param=struct_param):
dict_final = {}
for key, magnitude in param.items():
dict_final[key] = []
Nsf, Np = 100, 100
sf1, sf2, sf3 = np.zeros(Nsf), np.zeros(Nsf), np.zeros(Nsf)
f = interp1d(time, magnitude)
time_int = np.linspace(np.min(time), np.max(time), Np)
mag_int = f(time_int)
for tau in np.arange(1, Nsf):
sf1[tau - 1] = np.mean(
np.power(np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 1.0))
sf2[tau - 1] = np.mean(
np.abs(np.power(
np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 2.0)))
sf3[tau - 1] = np.mean(
np.abs(np.power(
np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 3.0)))
sf1_log = np.log10(np.trim_zeros(sf1))
sf2_log = np.log10(np.trim_zeros(sf2))
sf3_log = np.log10(np.trim_zeros(sf3))
if len(sf1_log) and len(sf2_log):
m_21, b_21 = np.polyfit(sf1_log, sf2_log, 1)
else:
m_21 = np.nan
if len(sf1_log) and len(sf3_log):
m_31, b_31 = np.polyfit(sf1_log, sf3_log, 1)
else:
m_31 = np.nan
if len(sf2_log) and len(sf3_log):
m_32, b_32 = np.polyfit(sf2_log, sf3_log, 1)
else:
m_32 = np.nan
dict_final[key].append(m_21)
dict_final[key].append(m_31)
dict_final[key].append(m_32)
return [("StructureFunction_{}__m_{}".format(key, name), li) for key, lis in dict_final.items() for name, li in zip([21,31,32], lis)]
struct_param = {"Volume":df["Volume"].values, "Open": df["Open"].values}
extract.structure_func(df["Close"],struct_param)
(i) Kurtosis
#-> In Package
def kurtosis(x):
if not isinstance(x, pd.Series):
x = pd.Series(x)
return pd.Series.kurtosis(x)
extract.kurtosis(df["Close"])
(ii) Stetson Kurtosis
@set_property("fctype", "simple")
@set_property("custom", True)
def stetson_k(x):
"""A robust kurtosis statistic."""
n = len(x)
x0 = stetson_mean(x, 1./20**2)
delta_x = np.sqrt(n / (n - 1.)) * (x - x0) / 20
ta = 1. / 0.798 * np.mean(np.abs(delta_x)) / np.sqrt(np.mean(delta_x**2))
return ta
extract.stetson_k(df["Close"])
Time-Series synthesisation (TSS) happens before the feature extraction step and Cross Sectional Synthesisation (CSS) happens after the feature extraction step. Currently I will only include a CSS package, in the future, I would further work on developing out this section. This area still has a lot of performance and stability issues. In the future it might be a more viable candidate to improve prediction.
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error
def model(df_final):
model = LGBMRegressor()
test = df_final.head(int(len(df_final)*0.4))
train = df_final[~df_final.isin(test)].dropna()
model = model.fit(train.drop(["Close_1"],axis=1),train["Close_1"])
preds = model.predict(test.drop(["Close_1"],axis=1))
test = df_final.head(int(len(df_final)*0.4))
train = df_final[~df_final.isin(test)].dropna()
model = model.fit(train.drop(["Close_1"],axis=1),train["Close_1"])
val = mean_squared_error(test["Close_1"],preds);
return val
pip install ctgan
from ctgan import CTGANSynthesizer
#discrete_columns = [""]
ctgan = CTGANSynthesizer()
ctgan.fit(df,epochs=10) #15
Random Benchmark
np.random.seed(1)
df_in = df.copy()
df_in["Close_1"] = np.random.permutation(df_in["Close_1"].values)
model(df_in)
Generated Performance
df_gen = ctgan.sample(len(df_in)*100)
model(df_gen)
As expected a cross-sectional technique, does not work well on time-series data, in the future, other methods will be investigated.
Here I will perform tabular agumenting methods on a small dataset single digit features and around 250 instances. This is not necessarily the best sized dataset to highlight the performance of tabular augmentation as some method like extraction would be overkill as it would lead to dimensionality problems. It is also good to know that there are close to infinite number of ways to perform these augmentation methods. In the future, automated augmentation methods can guide the experiment process.
The approach taken in this skeleton is to develop running models that are tested after each augmentation to highlight what methods might work well on this particular dataset. The metric we will use is mean squared error. In this implementation we do not have special hold-out sets.
The above framework of implementation will be consulted, but one still have to be strategic as to when you apply what function, and you have to make sure that you are processing your data with appropriate techniques (drop null values, fill null values) at the appropriate time.
Develop Model and Define Metric
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error
def model(df_final):
model = LGBMRegressor()
test = df_final.head(int(len(df_final)*0.4))
train = df_final[~df_final.isin(test)].dropna()
model = model.fit(train.drop(["Close_1"],axis=1),train["Close_1"])
preds = model.predict(test.drop(["Close_1"],axis=1))
test = df_final.head(int(len(df_final)*0.4))
train = df_final[~df_final.isin(test)].dropna()
model = model.fit(train.drop(["Close_1"],axis=1),train["Close_1"])
val = mean_squared_error(test["Close_1"],preds);
return val
Reload Data
df = data_copy()
model(df)
302.61676570345287
(1) (7) (i) Transformation - Decomposition - Naive
## If Inferred Seasonality is Too Large Default to Five
seasons = transform.infer_seasonality(df["Close"],index=0)
df_out = transform.naive_dec(df.copy(), ["Close","Open"], freq=5)
model(df_out) #improvement
274.34477082783525
(1) (8) (i) Transformation - Filter - Baxter-King-Bandpass
df_out = transform.bkb(df_out, ["Close","Low"])
df_best = df_out.copy()
model(df_out) #improvement
267.1826850968307
(1) (3) (i) Transformation - Differentiation - Fractional
df_out = transform.fast_fracdiff(df_out, ["Close_BPF"],0.5)
model(df_out) #null
267.7083192402742
(1) (1) (i) Transformation - Scaling - Robust Scaler
df_out = df_out.dropna()
df_out = transform.robust_scaler(df_out, drop=["Close_1"])
model(df_out) #noisy
270.96980399571214
(2) (2) (i) Interactions - Operator - Multiplication/Division
df_out.head()
Close_1 | High | Low | Open | Close | Volume | Adj Close | Close_NDDT | Close_NDDS | Close_NDDR | Open_NDDT | Open_NDDS | Open_NDDR | Close_BPF | Low_BPF | Close_BPF_frac | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | ||||||||||||||||
2019-01-08 | 338.529999 | 1.018413 | 0.964048 | 1.096600 | 1.001175 | -0.162616 | 1.001175 | 0.832297 | 0.834964 | 1.335433 | 0.758743 | 0.691596 | 2.259884 | -2.534142 | -2.249135 | -3.593612 |
2019-01-09 | 344.970001 | 1.012068 | 1.023302 | 1.011466 | 1.042689 | -0.501798 | 1.042689 | 0.908963 | -0.165036 | 1.111346 | 0.835786 | 0.333361 | 1.129783 | -3.081959 | -2.776302 | -2.523465 |
2019-01-10 | 347.260010 | 1.035581 | 1.027563 | 0.996969 | 1.126762 | -0.367576 | 1.126762 | 1.029347 | 2.120026 | 0.853697 | 0.907588 | 0.000000 | 0.533777 | -2.052768 | -2.543449 | -0.747382 |
2019-01-11 | 334.399994 | 1.073153 | 1.120506 | 1.098313 | 1.156658 | -0.586571 | 1.156658 | 1.109144 | -5.156051 | 0.591990 | 1.002162 | -0.666639 | 0.608516 | -0.694642 | -0.831670 | 0.414063 |
2019-01-14 | 344.429993 | 0.999627 | 1.056991 | 1.102135 | 0.988773 | -0.541752 | 0.988773 | 1.107633 | 0.000000 | -0.660350 | 1.056302 | -0.915491 | 0.263025 | -0.645590 | -0.116166 | -0.118012 |
df_out = interact.muldiv(df_out, ["Close","Open_NDDS","Low_BPF"])
model(df_out) #noisy
285.6420643864313
df_r = df_out.copy()
(2) (6) (i) Interactions - Speciality - Technical
import ta
df = interact.tech(df)
df_out = pd.merge(df_out, df.iloc[:,7:], left_index=True, right_index=True, how="left")
Clean Dataframe and Metric
"""Droping column where missing values are above a threshold"""
df_out = df_out.dropna(thresh = len(df_out)*0.95, axis = "columns")
df_out = df_out.dropna()
df_out = df_out.replace([np.inf, -np.inf], np.nan).ffill().fillna(0)
close = df_out["Close"].copy()
df_d = df_out.copy()
model(df_out) #improve
592.52971755184
(3) (1) (i) Mapping - Eigen Decomposition - PCA
from sklearn.decomposition import PCA, IncrementalPCA, KernelPCA
df_out = transform.robust_scaler(df_out, drop=["Close_1"])
df_out = df_out.replace([np.inf, -np.inf], np.nan).ffill().fillna(0)
df_out = mapper.pca_feature(df_out, drop_cols=["Close_1"], variance_or_components=0.9, n_components=8,non_linear=False)
model(df_out) #noisy but not too bad given the 10 fold dimensionality reduction
687.158330455884
(4) Extracting
Here at first, I show the functions that have been added to the DeltaPy fork of tsfresh. You have to add your own personal adjustments based on the features you would like to construct. I am using self-developed features, but you can also use TSFresh's community functions.
The following files have been appropriately ammended (Get in contact for advice)
(4) (10) (i) Extracting - Averages - GSkew
extract.gskew(df_out["PCA_1"])
-0.7903067336449059
(4) (21) (ii) Extracting - Entropy - SVD Entropy
svd_param = [{"Tau": ta, "DE": de}
for ta in [4]
for de in [3,6]]
extract.svd_entropy(df_out["PCA_1"],svd_param)
[('Tau_"4"__De_3"', 0.7234823323374294),
('Tau_"4"__De_6"', 1.3014347840145244)]
(4) (13) (ii) Extracting - Streaks - Wozniak
woz_param = [{"consecutiveStar": n} for n in [2, 4]]
extract.wozniak(df_out["PCA_1"],woz_param)
[('consecutiveStar_2', 0.012658227848101266), ('consecutiveStar_4', 0.0)]
(4) (28) (i) Extracting - Fractal - Higuchi
hig_param = [{"Kmax": 3},{"Kmax": 5}]
extract.higuchi_fractal_dimension(df_out["PCA_1"],hig_param)
[('Kmax_3', 0.577913816027104), ('Kmax_5', 0.8176960510304725)]
(4) (5) (ii) Extracting - Volatility - Variability Index
var_index_param = {"Volume":df["Volume"].values, "Open": df["Open"].values}
extract.var_index(df["Close"].values,var_index_param)
{'Interact__Open': 0.00396022538846289,
'Interact__Volume': 0.20550155114176533}
Time Series Extraction
pip install git+git://github.com/firmai/tsfresh.git
#Construct the preferred input dataframe.
from tsfresh.utilities.dataframe_functions import roll_time_series
df_out["ID"] = 0
periods = 30
df_out = df_out.reset_index()
df_ts = roll_time_series(df_out,"ID","Date",None,1,periods)
counts = df_ts['ID'].value_counts()
df_ts = df_ts[df_ts['ID'].isin(counts[counts > periods].index)]
#Perform extraction
from tsfresh.feature_extraction import extract_features, CustomFCParameters
settings_dict = CustomFCParameters()
settings_dict["var_index"] = {"PCA_1":None, "PCA_2": None}
df_feat = extract_features(df_ts.drop(["Close_1"],axis=1),default_fc_parameters=settings_dict,column_id="ID",column_sort="Date")
Feature Extraction: 100%|██████████| 5/5 [00:10<00:00, 2.14s/it]
# Cleaning operations
import pandasvault as pv
df_feat2 = df_feat.copy()
df_feat = df_feat.dropna(thresh = len(df_feat)*0.50, axis = "columns")
df_feat_cons = pv.constant_feature_detect(data=df_feat,threshold=0.9)
df_feat = df_feat.drop(df_feat_cons, axis=1)
df_feat = df_feat.ffill()
df_feat = pd.merge(df_feat,df[["Close_1"]],left_index=True,right_index=True,how="left")
print(df_feat.shape)
model(df_feat) #noisy
7 variables are found to be almost constant
(208, 48)
2064.7813982935995
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute
impute(df_feat)
df_feat_2 = select_features(df_feat.drop(["Close_1"],axis=1),df_feat["Close_1"],fdr_level=0.05)
df_feat_2["Close_1"] = df_feat["Close_1"]
model(df_feat_2) #improvement (b/ not an augmentation method)
1577.5273071299482
(3) (6) (i) Feature Agglomoration; (1)(2)(i) Standard Scaler.
Like in this step, after (1), (2), (3), (4) and (5), you can often circle back to the initial steps to normalise the data and dimensionally reduce the data for the final model.
import numpy as np
from sklearn import datasets, cluster
def feature_agg(df, drop, components):
components = min(df.shape[1]-1,components)
agglo = cluster.FeatureAgglomeration(n_clusters=components,)
df = df.drop(drop,axis=1)
agglo.fit(df)
df = pd.DataFrame(agglo.transform(df))
df = df.add_prefix('fe_agg_')
return df
df_final = transform.standard_scaler(df_feat_2, drop=["Close_1"])
df_final = mapper.feature_agg(df_final,["Close_1"],4)
df_final.index = df_feat.index
df_final["Close_1"] = df_feat["Close_1"]
model(df_final) #noisy
1949.89085894338
Final Model After Applying 13 Arbitrary Augmentation Techniques
model(df_final) #improvement
1949.89085894338
Original Model Before Augmentation
df_org = df.iloc[:,:7][df.index.isin(df_final.index)]
model(df_org)
389.783990984133
Best Model After Developing 8 Augmenting Features
df_best = df_best.replace([np.inf, -np.inf], np.nan).ffill().fillna(0)
model(df_best)
267.1826850968307
Commentary
There are countless ways in which the current model can be improved, this can take on an automated process where all techniques are tested against a hold out set, for example, we can perform the operation below, and even though it improves the score here, there is a need for more robust tests. The skeleton example above is not meant to highlight the performance of the package. It simply serves as an example of how one can go about applying augmentation methods.
Quite naturally this example suffers from dimensionality issues with array shapes reaching (208, 48)
, furthermore you would need a sample that is at least 50-100 times larger before machine learning methods start to make sense.
Nonetheless, in this example, Transformation, Interactions and Mappings (applied to extraction output) performed fairly well. Extraction augmentation was overkill, but created a reasonable model when dimensionally reduced. A better selection of one of the 50+ augmentation methods and the order of augmentation could further help improve the outcome if robustly tested against development sets.
[1] DeltaPy Development
Author: firmai
Source Code: https://github.com/firmai/deltapy
#engineering
1656682523
Find peaks in an array based on "Improved peak detection" [1]
[1] Du, Pan, Warren A. Kibbe, and Simon M. Lin. "Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching." Bioinformatics 22.17 (2006): 2059-2065.
If you use NPM, npm install d3-peaks
. Otherwise, download the latest release.
# d3_peaks.findPeaks([signal])
If specified, returns an array of points that represents the peaks in the signal. Otherwise, returns a function to find peaks. An example point returned is:
[{
index: 10,
width: 2,
snr: 1.5
}]
Where index represents the index of the peak in the original signal, width is the width of the peak, and snr is the signal to noise ratio.
# widths([w])
If specified, [w] is an array of expected peak widths that the algorithm should find. Otherwise, returns the current values.
var findPeaks = d3_peaks.findPeaks().widths([1, 2, 10]);
# kernel(kernel)
If specified, changes the kernel function or "smoother". Otherwise, returns the current value.
var ricker = d3_peaks.ricker;
var findPeaks = d3_peaks.findPeaks().kernel(ricker);
# gapThreshold(gap)
If specified, gap represents the maximum allowed number of gaps in the ridgeline. The higher is this number the more connected peaks we will find. Otherwise, returns the current value.
var findPeaks = d3_peaks.findPeaks().gapThreshold(3);
# minLineLength(length)
If specified, length represents the minimum ridgeline length. The higher is this number the more constrained are the lines and we will find fewer peaks. Otherwise, returns the current value.
var findPeaks = d3_peaks.findPeaks().minLineLength(2);
# minSNR(snr)
If specified, snr represents the minimum signal to noise ratio the ridge lines should have. Otherwise, returns the current value. By default the minimum snr is 1.0 for peaks of width 1. This number should be higher for bigger widths.
var findPeaks = d3_peaks.findPeaks().minSNR(1.5);
# d3_peaks.convolve([signal])
If specified, convolve the signal array with the smoother. Otherwise, returns a function to convolve a signal with the smoother.
# kernel(kernel)
If specified, changes the kernel function or "smoother". Otherwise, returns the current kernel.
var convolve = d3_peaks.convolve()
.kernel(ricker);
var signal = convolve([1,2,3,2.5,0,1,4,5,3,-1,-2]);
# d3_peaks.ricker(x)
If specified , it returns φ(x). Otherwise, returns a function to compute the ricker wavelet with default standard deviation 1.0.
# std(value)
If specified, it sets the standard deviation of the curve to value. Otherwise, returns the "width" or standard deviation of the wavelet.
# reach()
Returns the range value reach such that φ(reach) ~ 0.
var y = d3_peaks.ricker()
.std(2);
var output = y(3.5);
var reach = y.reach();
For examples, please see:
Author: Efekarakus
Source Code: https://github.com/efekarakus/d3-peaks
License: BSD-3-Clause license