1659870420
Chewy is an ODM (Object Document Mapper), built on top of the the official Elasticsearch client.
In this section we'll cover why you might want to use Chewy instead of the official elasticsearch-ruby
client gem.
Every index is observable by all the related models.
Most of the indexed models are related to other and sometimes it is necessary to denormalize this related data and put at the same object. For example, you need to index an array of tags together with an article. Chewy allows you to specify an updateable index for every model separately - so corresponding articles will be reindexed on any tag update.
Bulk import everywhere.
Chewy utilizes the bulk ES API for full reindexing or index updates. It also uses atomic updates. All the changed objects are collected inside the atomic block and the index is updated once at the end with all the collected objects. See Chewy.strategy(:atomic)
for more details.
Powerful querying DSL.
Chewy has an ActiveRecord-style query DSL. It is chainable, mergeable and lazy, so you can produce queries in the most efficient way. It also has object-oriented query and filter builders.
Support for ActiveRecord.
Add this line to your application's Gemfile
:
gem 'chewy'
And then execute:
$ bundle
Or install it yourself as:
$ gem install chewy
Chewy is compatible with MRI 2.6-3.0¹.
¹ Ruby 3 is only supported with Rails 6.1
Chewy version | Elasticsearch version |
---|---|
7.2.x | 7.x |
7.1.x | 7.x |
7.0.x | 6.8, 7.x |
6.0.0 | 5.x, 6.x |
5.x | 5.x, limited support for 1.x & 2.x |
Important: Chewy doesn't follow SemVer, so you should always check the release notes before upgrading. The major version is linked to the newest supported Elasticsearch and the minor version bumps may include breaking changes.
See our migration guide for detailed upgrade instructions between various Chewy versions.
5.2, 6.0, 6.1 Active Record versions are supported by all Chewy versions.
Chewy provides functionality for Elasticsearch index handling, documents import mappings, index update strategies and chainable query DSL.
Create config/initializers/chewy.rb
with this line:
Chewy.settings = {host: 'localhost:9250'}
And run rails g chewy:install
to generate chewy.yml
:
# config/chewy.yml
# separate environment configs
test:
host: 'localhost:9250'
prefix: 'test'
development:
host: 'localhost:9200'
Make sure you have Elasticsearch up and running. You can install it locally, but the easiest way is to use Docker:
$ docker run --rm --name elasticsearch -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" elasticsearch:7.11.1
Create app/chewy/users_index.rb
with User Index:
class UsersIndex < Chewy::Index
settings analysis: {
analyzer: {
email: {
tokenizer: 'keyword',
filter: ['lowercase']
}
}
}
index_scope User
field :first_name
field :last_name
field :email, analyzer: 'email'
end
Add User model, table and migrate it:
$ bundle exec rails g model User first_name last_name email
$ bundle exec rails db:migrate
Add update_index
to app/models/user.rb:
class User < ApplicationRecord
update_index('users') { self }
end
User.create(
first_name: "test1",
last_name: "test1",
email: 'test1@example.com',
# other fields
)
# UsersIndex Import (355.3ms) {:index=>1}
# => #<User id: 1, first_name: "test1", last_name: "test1", email: "test1@example.com", # other fields>
UsersController
:def search
@users = UsersIndex.query(query_string: { fields: [:first_name, :last_name, :email, ...], query: search_params[:query], default_operator: 'and' })
render json: @users.to_json, status: :ok
end
private
def search_params
params.permit(:query, :page, :per)
end
http://localhost:3000/users/search?query=test1@example.com
issuing a response like:[
{
"attributes":{
"id":"1",
"first_name":"test1",
"last_name":"test1",
"email":"test1@example.com",
...
"_score":0.9808291,
"_explanation":null
},
"_data":{
"_index":"users",
"_type":"_doc",
"_id":"1",
"_score":0.9808291,
"_source":{
"first_name":"test1",
"last_name":"test1",
"email":"test1@example.com",
...
}
}
}
]
To configure the Chewy client you need to add chewy.rb
file with Chewy.settings
hash:
# config/initializers/chewy.rb
Chewy.settings = {host: 'localhost:9250'} # do not use environments
And add chewy.yml
configuration file.
You can create chewy.yml
manually or run rails g chewy:install
to generate it:
# config/chewy.yml
# separate environment configs
test:
host: 'localhost:9250'
prefix: 'test'
development:
host: 'localhost:9200'
The resulting config merges both hashes. Client options are passed as is to Elasticsearch::Transport::Client
except for the :prefix
, which is used internally by Chewy to create prefixed index names:
Chewy.settings = {prefix: 'test'}
UsersIndex.index_name # => 'test_users'
The logger may be set explicitly:
Chewy.logger = Logger.new(STDOUT)
See config.rb for more details.
If you would like to use AWS's Elasticsearch using an IAM user policy, you will need to sign your requests for the es:*
action by injecting the appropriate headers passing a proc to transport_options
. You'll need an additional gem for Faraday middleware: add gem 'faraday_middleware-aws-sigv4'
to your Gemfile.
require 'faraday_middleware/aws_sigv4'
Chewy.settings = {
host: 'http://my-es-instance-on-aws.us-east-1.es.amazonaws.com:80',
port: 80, # 443 for https host
transport_options: {
headers: { content_type: 'application/json' },
proc: -> (f) do
f.request :aws_sigv4,
service: 'es',
region: 'us-east-1',
access_key_id: ENV['AWS_ACCESS_KEY'],
secret_access_key: ENV['AWS_SECRET_ACCESS_KEY']
end
}
}
/app/chewy/users_index.rb
class UsersIndex < Chewy::Index
end
class UsersIndex < Chewy::Index
index_scope User.active # or just model instead_of scope: index_scope User
end
class UsersIndex < Chewy::Index
index_scope User.active.includes(:country, :badges, :projects)
field :first_name, :last_name # multiple fields without additional options
field :email, analyzer: 'email' # Elasticsearch-related options
field :country, value: ->(user) { user.country.name } # custom value proc
field :badges, value: ->(user) { user.badges.map(&:name) } # passing array values to index
field :projects do # the same block syntax for multi_field, if `:type` is specified
field :title
field :description # default data type is `text`
# additional top-level objects passed to value proc:
field :categories, value: ->(project, user) { project.categories.map(&:name) if user.active? }
end
field :rating, type: 'integer' # custom data type
field :created, type: 'date', include_in_all: false,
value: ->{ created_at } # value proc for source object context
end
See here for mapping definitions.
Chewy::Index.settings
docs for details:class UsersIndex < Chewy::Index
settings analysis: {
analyzer: {
email: {
tokenizer: 'keyword',
filter: ['lowercase']
}
}
}
index_scope User.active.includes(:country, :badges, :projects)
root date_detection: false do
template 'about_translations.*', type: 'text', analyzer: 'standard'
field :first_name, :last_name
field :email, analyzer: 'email'
field :country, value: ->(user) { user.country.name }
field :badges, value: ->(user) { user.badges.map(&:name) }
field :projects do
field :title
field :description
end
field :about_translations, type: 'object' # pass object type explicitly if necessary
field :rating, type: 'integer'
field :created, type: 'date', include_in_all: false,
value: ->{ created_at }
end
end
See index settings here. See root object settings here.
See mapping.rb for more details.
class User < ActiveRecord::Base
update_index('users') { self } # specifying index and back-reference
# for updating after user save or destroy
end
class Country < ActiveRecord::Base
has_many :users
update_index('users') { users } # return single object or collection
end
class Project < ActiveRecord::Base
update_index('users') { user if user.active? } # you can return even `nil` from the back-reference
end
class Book < ActiveRecord::Base
update_index(->(book) {"books_#{book.language}"}) { self } # dynamic index name with proc.
# For book with language == "en"
# this code will generate `books_en`
end
Also, you can use the second argument for method name passing:
update_index('users', :self)
update_index('users', :users)
In the case of a belongs_to association you may need to update both associated objects, previous and current:
class City < ActiveRecord::Base
belongs_to :country
update_index('cities') { self }
update_index 'countries' do
previous_changes['country_id'] || country
end
end
Every index has default_import_options
configuration to specify, suddenly, default import options:
class ProductsIndex < Chewy::Index
index_scope Post.includes(:tags)
default_import_options batch_size: 100, bulk_size: 10.megabytes, refresh: false
field :name
field :tags, value: -> { tags.map(&:name) }
end
See import.rb for available options.
To define an objects field you can simply nest fields in the DSL:
field :projects do
field :title
field :description
end
This will automatically set the type or root field to object
. You may also specify type: 'objects'
explicitly.
To define a multi field you have to specify any type except for object
or nested
in the root field:
field :full_name, type: 'text', value: ->{ full_name.strip } do
field :ordered, analyzer: 'ordered'
field :untouched, type: 'keyword'
end
The value:
option for internal fields will no longer be effective.
You can use Elasticsearch's geo mapping with the geo_point
field type, allowing you to query, filter and order by latitude and longitude. You can use the following hash format:
field :coordinates, type: 'geo_point', value: ->{ {lat: latitude, lon: longitude} }
or by using nested fields:
field :coordinates, type: 'geo_point' do
field :lat, value: ->{ latitude }
field :long, value: ->{ longitude }
end
See the section on Script fields for details on calculating distance in a search.
You can use a join field to implement parent-child relationships between documents. It replaces the old parent_id
based parent-child mapping
To use it, you need to pass relations
and join
(with type
and id
) options:
field :hierarchy_link, type: :join, relations: {question: %i[answer comment], answer: :vote, vote: :subvote}, join: {type: :comment_type, id: :commented_id}
assuming you have comment_type
and commented_id
fields in your model.
Note that when you reindex a parent, it's children and grandchildren will be reindexed as well. This may require additional queries to the primary database and to elastisearch.
Also note that the join field doesn't support crutches (it should be a field directly defined on the model).
Assume you are defining your index like this (product has_many categories through product_categories):
class ProductsIndex < Chewy::Index
index_scope Product.includes(:categories)
field :name
field :category_names, value: ->(product) { product.categories.map(&:name) } # or shorter just -> { categories.map(&:name) }
end
Then the Chewy reindexing flow will look like the following pseudo-code:
Product.includes(:categories).find_in_batches(1000) do |batch|
bulk_body = batch.map do |object|
{name: object.name, category_names: object.categories.map(&:name)}.to_json
end
# here we are sending every batch of data to ES
Chewy.client.bulk bulk_body
end
If you meet complicated cases when associations are not applicable you can replace Rails associations with Chewy Crutches™ technology:
class ProductsIndex < Chewy::Index
index_scope Product
crutch :categories do |collection| # collection here is a current batch of products
# data is fetched with a lightweight query without objects initialization
data = ProductCategory.joins(:category).where(product_id: collection.map(&:id)).pluck(:product_id, 'categories.name')
# then we have to convert fetched data to appropriate format
# this will return our data in structure like:
# {123 => ['sweets', 'juices'], 456 => ['meat']}
data.each.with_object({}) { |(id, name), result| (result[id] ||= []).push(name) }
end
field :name
# simply use crutch-fetched data as a value:
field :category_names, value: ->(product, crutches) { crutches.categories[product.id] }
end
An example flow will look like this:
Product.includes(:categories).find_in_batches(1000) do |batch|
crutches[:categories] = ProductCategory.joins(:category).where(product_id: batch.map(&:id)).pluck(:product_id, 'categories.name')
.each.with_object({}) { |(id, name), result| (result[id] ||= []).push(name) }
bulk_body = batch.map do |object|
{name: object.name, category_names: crutches[:categories][object.id]}.to_json
end
Chewy.client.bulk bulk_body
end
So Chewy Crutches™ technology is able to increase your indexing performance in some cases up to a hundredfold or even more depending on your associations complexity.
One more experimental technology to increase import performance. As far as you know, chewy defines value proc for every imported field in mapping, so at the import time each of this procs is executed on imported object to extract result document to import. It would be great for performance to use one huge whole-document-returning proc instead. So basically the idea or Witchcraft™ technology is to compile a single document-returning proc from the index definition.
index_scope Product
witchcraft!
field :title
field :tags, value: -> { tags.map(&:name) }
field :categories do
field :name, value: -> (product, category) { category.name }
field :type, value: -> (product, category, crutch) { crutch.types[category.name] }
end
The index definition above will be compiled to something close to:
-> (object, crutches) do
{
title: object.title,
tags: object.tags.map(&:name),
categories: object.categories.map do |object2|
{
name: object2.name
type: crutches.types[object2.name]
}
end
}
end
And don't even ask how is it possible, it is a witchcraft. Obviously not every type of definition might be compiled. There are some restrictions:
method_source
be able to extract field value proc sources.[:first_name, :last_name].each do |name|
field name, value: -> (o) { o.send(name) }
end
However, it is quite possible that your index definition will be supported by Witchcraft™ technology out of the box in the most of the cases.
Another way to speed up import time is Raw Imports. This technology is only available in ActiveRecord adapter. Very often, ActiveRecord model instantiation is what consumes most of the CPU and RAM resources. Precious time is wasted on converting, say, timestamps from strings and then serializing them back to strings. Chewy can operate on raw hashes of data directly obtained from the database. All you need is to provide a way to convert that hash to a lightweight object that mimics the behaviour of the normal ActiveRecord object.
class LightweightProduct
def initialize(attributes)
@attributes = attributes
end
# Depending on the database, `created_at` might
# be in different formats. In PostgreSQL, for example,
# you might see the following format:
# "2016-03-22 16:23:22"
#
# Taking into account that Elastic expects something different,
# one might do something like the following, just to avoid
# unnecessary String -> DateTime -> String conversion.
#
# "2016-03-22 16:23:22" -> "2016-03-22T16:23:22Z"
def created_at
@attributes['created_at'].tr(' ', 'T') << 'Z'
end
end
index_scope Product
default_import_options raw_import: ->(hash) {
LightweightProduct.new(hash)
}
field :created_at, 'datetime'
Also, you can pass :raw_import
option to the import
method explicitly.
By default, when you perform import Chewy checks whether an index exists and creates it if it's absent. You can turn off this feature to decrease Elasticsearch hits count. To do so you need to set skip_index_creation_on_import
parameter to false
in your config/chewy.yml
You can use ignore_blank: true
to skip fields that return true
for the .blank?
method:
index_scope Country
field :id
field :cities, ignore_blank: true do
field :id
field :name
field :surname, ignore_blank: true
field :description
end
By default ignore_blank
is false on every type except geo_point
.
You can record all actions that were made to the separate journal index in ElasticSearch. When you create/update/destroy your documents, it will be saved in this special index. If you make something with a batch of documents (e.g. during index reset) it will be saved as a one record, including primary keys of each document that was affected. Common journal record looks like this:
{
"action": "index",
"object_id": [1, 2, 3],
"index_name": "...",
"created_at": "<timestamp>"
}
This feature is turned off by default. But you can turn it on by setting journal
setting to true
in config/chewy.yml
. Also, you can specify journal index name. For example:
# config/chewy.yml
production:
journal: true
journal_name: my_super_journal
Also, you can provide this option while you're importing some index:
CityIndex.import journal: true
Or as a default import option for an index:
class CityIndex
index_scope City
default_import_options journal: true
end
You may be wondering why do you need it? The answer is simple: not to lose the data.
Imagine that you reset your index in a zero-downtime manner (to separate index), and at the meantime somebody keeps updating the data frequently (to old index). So all these actions will be written to the journal index and you'll be able to apply them after index reset using the Chewy::Journal
interface.
UsersIndex.delete # destroy index if it exists
UsersIndex.delete!
UsersIndex.create
UsersIndex.create! # use bang or non-bang methods
UsersIndex.purge
UsersIndex.purge! # deletes then creates index
UsersIndex.import # import with 0 arguments process all the data specified in index_scope definition
UsersIndex.import User.where('rating > 100') # or import specified users scope
UsersIndex.import User.where('rating > 100').to_a # or import specified users array
UsersIndex.import [1, 2, 42] # pass even ids for import, it will be handled in the most effective way
UsersIndex.import User.where('rating > 100'), update_fields: [:email] # if update fields are specified - it will update their values only with the `update` bulk action
UsersIndex.import! # raises an exception in case of any import errors
UsersIndex.reset! # purges index and imports default data for all types
If the passed user is #destroyed?
, or satisfies a delete_if
index_scope option, or the specified id does not exist in the database, import will perform delete from index action for this object.
index_scope User, delete_if: :deleted_at
index_scope User, delete_if: -> { deleted_at }
index_scope User, delete_if: ->(user) { user.deleted_at }
See actions.rb for more details.
Assume you've got the following code:
class City < ActiveRecord::Base
update_index 'cities', :self
end
class CitiesIndex < Chewy::Index
index_scope City
field :name
end
If you do something like City.first.save!
you'll get an UndefinedUpdateStrategy exception instead of the object saving and index updating. This exception forces you to choose an appropriate update strategy for the current context.
If you want to return to the pre-0.7.0 behavior - just set Chewy.root_strategy = :bypass
.
:atomic
The main strategy here is :atomic
. Assume you have to update a lot of records in the db.
Chewy.strategy(:atomic) do
City.popular.map(&:do_some_update_action!)
end
Using this strategy delays the index update request until the end of the block. Updated records are aggregated and the index update happens with the bulk API. So this strategy is highly optimized.
:sidekiq
This does the same thing as :atomic
, but asynchronously using sidekiq. Patch Chewy::Strategy::Sidekiq::Worker
for index updates improving.
Chewy.strategy(:sidekiq) do
City.popular.map(&:do_some_update_action!)
end
The default queue name is chewy
, you can customize it in settings: sidekiq.queue_name
Chewy.settings[:sidekiq] = {queue: :low}
:lazy_sidekiq
This does the same thing as :sidekiq
, but with lazy evaluation. Beware it does not allow you to use any non-persistent record state for indices and conditions because record will be re-fetched from database asynchronously using sidekiq. However for destroying records strategy will fallback to :sidekiq
because it's not possible to re-fetch deleted records from database.
The purpose of this strategy is to improve the response time of the code that should update indexes, as it does not only defer actual ES calls to a background job but update_index
callbacks evaluation (for created and updated objects) too. Similar to :sidekiq
, index update is asynchronous so this strategy cannot be used when data and index synchronization is required.
Chewy.strategy(:lazy_sidekiq) do
City.popular.map(&:do_some_update_action!)
end
The default queue name is chewy
, you can customize it in settings: sidekiq.queue_name
Chewy.settings[:sidekiq] = {queue: :low}
:active_job
This does the same thing as :atomic
, but using ActiveJob. This will inherit the ActiveJob configuration settings including the active_job.queue_adapter
setting for the environment. Patch Chewy::Strategy::ActiveJob::Worker
for index updates improving.
Chewy.strategy(:active_job) do
City.popular.map(&:do_some_update_action!)
end
The default queue name is chewy
, you can customize it in settings: active_job.queue_name
Chewy.settings[:active_job] = {queue: :low}
:urgent
The following strategy is convenient if you are going to update documents in your index one by one.
Chewy.strategy(:urgent) do
City.popular.map(&:do_some_update_action!)
end
This code will perform City.popular.count
requests for ES documents update.
It is convenient for use in e.g. the Rails console with non-block notation:
> Chewy.strategy(:urgent)
> City.popular.map(&:do_some_update_action!)
:bypass
The bypass strategy simply silences index updates.
Strategies are designed to allow nesting, so it is possible to redefine it for nested contexts.
Chewy.strategy(:atomic) do
city1.do_update!
Chewy.strategy(:urgent) do
city2.do_update!
city3.do_update!
# there will be 2 update index requests for city2 and city3
end
city4..do_update!
# city1 and city4 will be grouped in one index update request
end
It is possible to nest strategies without blocks:
Chewy.strategy(:urgent)
city1.do_update! # index updated
Chewy.strategy(:bypass)
city2.do_update! # update bypassed
Chewy.strategy.pop
city3.do_update! # index updated again
See strategy/base.rb for more details. See strategy/atomic.rb for an example.
There are a couple of predefined strategies for your Rails application. Initially, the Rails console uses the :urgent
strategy by default, except in the sandbox case. When you are running sandbox it switches to the :bypass
strategy to avoid polluting the index.
Migrations are wrapped with the :bypass
strategy. Because the main behavior implies that indices are reset after migration, there is no need for extra index updates. Also indexing might be broken during migrations because of the outdated schema.
Controller actions are wrapped with the configurable value of Chewy.request_strategy
and defaults to :atomic
. This is done at the middleware level to reduce the number of index update requests inside actions.
It is also a good idea to set up the :bypass
strategy inside your test suite and import objects manually only when needed, and use Chewy.massacre
when needed to flush test ES indices before every example. This will allow you to minimize unnecessary ES requests and reduce overhead.
RSpec.configure do |config|
config.before(:suite) do
Chewy.strategy(:bypass)
end
end
All connection options, except the :prefix
, are passed to the Elasticseach::Client.new
(chewy/lib/chewy.rb):
Here's the relevant Elasticsearch documentation on the subject: https://rubydoc.info/gems/elasticsearch-transport#setting-hosts
ActiveSupport::Notifications
supportChewy has notifying the following events:
search_query.chewy
payloadpayload[:index]
: requested index classpayload[:request]
: request hashimport_objects.chewy
payloadpayload[:index]
: currently imported index name
payload[:import]
: imports stats, total imported and deleted objects count:
{index: 30, delete: 5}
payload[:errors]
: might not exists. Contains grouped errors with objects ids list:
{index: {
'error 1 text' => ['1', '2', '3'],
'error 2 text' => ['4']
}, delete: {
'delete error text' => ['10', '12']
}}
To integrate with NewRelic you may use the following example source (config/initializers/chewy.rb):
require 'new_relic/agent/instrumentation/evented_subscriber'
class ChewySubscriber < NewRelic::Agent::Instrumentation::EventedSubscriber
def start(name, id, payload)
event = ChewyEvent.new(name, Time.current, nil, id, payload)
push_event(event)
end
def finish(_name, id, _payload)
pop_event(id).finish
end
class ChewyEvent < NewRelic::Agent::Instrumentation::Event
OPERATIONS = {
'import_objects.chewy' => 'import',
'search_query.chewy' => 'search',
'delete_query.chewy' => 'delete'
}.freeze
def initialize(*args)
super
@segment = start_segment
end
def start_segment
segment = NewRelic::Agent::Transaction::DatastoreSegment.new product, operation, collection, host, port
if (txn = state.current_transaction)
segment.transaction = txn
end
segment.notice_sql @payload[:request].to_s
segment.start
segment
end
def finish
if (txn = state.current_transaction)
txn.add_segment @segment
end
@segment.finish
end
private
def state
@state ||= NewRelic::Agent::TransactionState.tl_get
end
def product
'Elasticsearch'
end
def operation
OPERATIONS[name]
end
def collection
payload.values_at(:type, :index)
.reject { |value| value.try(:empty?) }
.first
.to_s
end
def host
Chewy.client.transport.hosts.first[:host]
end
def port
Chewy.client.transport.hosts.first[:port]
end
end
end
ActiveSupport::Notifications.subscribe(/.chewy$/, ChewySubscriber.new)
Quick introduction.
The request DSL have the same chainable nature as AR. The main class is Chewy::Search::Request
.
CitiesIndex.query(match: {name: 'London'})
Main methods of the request DSL are: query
, filter
and post_filter
, it is possible to pass pure query hashes or use elasticsearch-dsl
.
CitiesIndex
.filter(term: {name: 'Bangkok'})
.query(match: {name: 'London'})
.query.not(range: {population: {gt: 1_000_000}})
You can query a set of indexes at once:
CitiesIndex.indices(CountriesIndex).query(match: {name: 'Some'})
See https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html and https://github.com/elastic/elasticsearch-dsl-ruby for more details.
An important part of requests manipulation is merging. There are 4 methods to perform it: merge
, and
, or
, not
. See Chewy::Search::QueryProxy for details. Also, only
and except
methods help to remove unneeded parts of the request.
Every other request part is covered by a bunch of additional methods, see Chewy::Search::Request for details:
CitiesIndex.limit(10).offset(30).order(:name, {population: {order: :desc}})
Request DSL also provides additional scope actions, like delete_all
, exists?
, count
, pluck
, etc.
The request DSL supports pagination with Kaminari
. An extension is enabled on initializtion if Kaminari
is available. See Chewy::Search and Chewy::Search::Pagination::Kaminari for details.
Chewy supports named scopes functionality. There is no specialized DSL for named scopes definition, it is simply about defining class methods.
See Chewy::Search::Scoping for details.
ElasticSearch scroll API is utilized by a bunch of methods: scroll_batches
, scroll_hits
, scroll_wrappers
and scroll_objects
.
See Chewy::Search::Scrolling for details.
It is possible to load ORM/ODM source objects with the objects
method. To provide additional loading options use load
method:
CitiesIndex.load(scope: -> { active }).to_a # to_a returns `Chewy::Index` wrappers.
CitiesIndex.load(scope: -> { active }).objects # An array of AR source objects.
See Chewy::Search::Loader for more details.
In case when it is necessary to iterate through both of the wrappers and objects simultaneously, object_hash
method helps a lot:
scope = CitiesIndex.load(scope: -> { active })
scope.each do |wrapper|
scope.object_hash[wrapper]
end
For a Rails application, some index-maintaining rake tasks are defined.
chewy:reset
Performs zero-downtime reindexing as described here. So the rake task creates a new index with unique suffix and then simply aliases it to the common index name. The previous index is deleted afterwards (see Chewy::Index.reset!
for more details).
rake chewy:reset # resets all the existing indices
rake chewy:reset[users] # resets UsersIndex only
rake chewy:reset[users,cities] # resets UsersIndex and CitiesIndex
rake chewy:reset[-users,cities] # resets every index in the application except specified ones
chewy:upgrade
Performs reset exactly the same way as chewy:reset
does, but only when the index specification (setting or mapping) was changed.
It works only when index specification is locked in Chewy::Stash::Specification
index. The first run will reset all indexes and lock their specifications.
See Chewy::Stash::Specification and Chewy::Index::Specification for more details.
rake chewy:upgrade # upgrades all the existing indices
rake chewy:upgrade[users] # upgrades UsersIndex only
rake chewy:upgrade[users,cities] # upgrades UsersIndex and CitiesIndex
rake chewy:upgrade[-users,cities] # upgrades every index in the application except specified ones
chewy:update
It doesn't create indexes, it simply imports everything to the existing ones and fails if the index was not created before.
rake chewy:update # updates all the existing indices
rake chewy:update[users] # updates UsersIndex only
rake chewy:update[users,cities] # updates UsersIndex and CitiesIndex
rake chewy:update[-users,cities] # updates every index in the application except UsersIndex and CitiesIndex
chewy:sync
Provides a way to synchronize outdated indexes with the source quickly and without doing a full reset. By default field updated_at
is used to find outdated records, but this could be customized by outdated_sync_field
as described at Chewy::Index::Syncer.
Arguments are similar to the ones taken by chewy:update
task.
See Chewy::Index::Syncer for more details.
rake chewy:sync # synchronizes all the existing indices
rake chewy:sync[users] # synchronizes UsersIndex only
rake chewy:sync[users,cities] # synchronizes UsersIndex and CitiesIndex
rake chewy:sync[-users,cities] # synchronizes every index in the application except except UsersIndex and CitiesIndex
chewy:deploy
This rake task is especially useful during the production deploy. It is a combination of chewy:upgrade
and chewy:sync
and the latter is called only for the indexes that were not reset during the first stage.
It is not possible to specify any particular indexes for this task as it doesn't make much sense.
Right now the approach is that if some data had been updated, but index definition was not changed (no changes satisfying the synchronization algorithm were done), it would be much faster to perform manual partial index update inside data migrations or even manually after the deploy.
Also, there is always full reset alternative with rake chewy:reset
.
Every task described above has its own parallel version. Every parallel rake task takes the number for processes for execution as the first argument and the rest of the arguments are exactly the same as for the non-parallel task version.
https://github.com/grosser/parallel gem is required to use these tasks.
If the number of processes is not specified explicitly - parallel
gem tries to automatically derive the number of processes to use.
rake chewy:parallel:reset
rake chewy:parallel:upgrade[4]
rake chewy:parallel:update[4,cities]
rake chewy:parallel:sync[4,-users]
rake chewy:parallel:deploy[4] # performs parallel upgrade and parallel sync afterwards
chewy:journal
This namespace contains two tasks for the journal manipulations: chewy:journal:apply
and chewy:journal:clean
. Both are taking time as the first argument (optional for clean) and a list of indexes exactly as the tasks above. Time can be in any format parsable by ActiveSupport.
rake chewy:journal:apply["$(date -v-1H -u +%FT%TZ)"] # apply journaled changes for the past hour
rake chewy:journal:apply["$(date -v-1H -u +%FT%TZ)",users] # apply journaled changes for the past hour on UsersIndex only
Just add require 'chewy/rspec'
to your spec_helper.rb and you will get additional features:
update_index helper mock_elasticsearch_response
helper to mock elasticsearch response mock_elasticsearch_response_sources
helper to mock elasticsearch response sources build_query
matcher to compare request and expected query (returns true
/false
)
To use mock_elasticsearch_response
and mock_elasticsearch_response_sources
helpers add include Chewy::Rspec::Helpers
to your tests.
See chewy/rspec/ for more details.
Add require 'chewy/minitest'
to your test_helper.rb, and then for tests which you'd like indexing test hooks, include Chewy::Minitest::Helpers
.
Since you can set :bypass
strategy for test suites and manually handle import for the index and manually flush test indices using Chewy.massacre
. This will help reduce unnecessary ES requests
But if you require chewy to index/update model regularly in your test suite then you can specify :urgent
strategy for documents indexing. Add Chewy.strategy(:urgent)
to test_helper.rb.
Also, you can use additional helpers:
mock_elasticsearch_response
to mock elasticsearch response mock_elasticsearch_response_sources
to mock elasticsearch response sources assert_elasticsearch_query
to compare request and expected query (returns true
/false
)
See chewy/minitest/ for more details.
If you use DatabaseCleaner
in your tests with the transaction
strategy, you may run into the problem that ActiveRecord
's models are not indexed automatically on save despite the fact that you set the callbacks to do this with the update_index
method. The issue arises because chewy
indices data on after_commit
run as default, but all after_commit
callbacks are not run with the DatabaseCleaner
's' transaction
strategy. You can solve this issue by changing the Chewy.use_after_commit_callbacks
option. Just add the following initializer in your Rails application:
#config/initializers/chewy.rb
Chewy.use_after_commit_callbacks = !Rails.env.test?
git checkout -b my-new-feature
)git commit -am 'Add some feature'
)git push origin my-new-feature
)Use the following Rake tasks to control the Elasticsearch cluster while developing, if you prefer native Elasticsearch installation over the dockerized one:
rake elasticsearch:start # start Elasticsearch cluster on 9250 port for tests
rake elasticsearch:stop # stop Elasticsearch
Author: toptal
Source code: https://github.com/toptal/chewy
License: MIT license
1650636000
Port of deeplearning4j to clojure
Contact info
If you have any questions,
NOT YET RELEASED TO CLOJARS
If using Maven add the following repository definition to your pom.xml:
<repository>
<id>clojars.org</id>
<url>http://clojars.org/repo</url>
</repository>
With Leiningen:
n/a
With Maven:
n/a
<dependency>
<groupId>_</groupId>
<artifactId>_</artifactId>
<version>_</version>
</dependency>
All functions for creating dl4j objects return code by default
API functions return code when all args are provided as code
API functions return the value of calling the wrapped method when args are provided as a mixture of objects and code or just objects
The tests are there to help clarify behavior, if you are unsure of how to use a fn, search the tests
(ns my.ns
(:require [dl4clj.nn.conf.builders.layers :as l]))
;; as code (the default)
(l/dense-layer-builder
:activation-fn :relu
:learning-rate 0.006
:weight-init :xavier
:layer-name "example layer"
:n-in 10
:n-out 1)
;; =>
(doto
(org.deeplearning4j.nn.conf.layers.DenseLayer$Builder.)
(.nOut 1)
(.activation (dl4clj.constants/value-of {:activation-fn :relu}))
(.weightInit (dl4clj.constants/value-of {:weight-init :xavier}))
(.nIn 10)
(.name "example layer")
(.learningRate 0.006))
;; as an object
(l/dense-layer-builder
:activation-fn :relu
:learning-rate 0.006
:weight-init :xavier
:layer-name "example layer"
:n-in 10
:n-out 1
:as-code? false)
;; =>
#object[org.deeplearning4j.nn.conf.layers.DenseLayer 0x69d7d160 "DenseLayer(super=FeedForwardLayer(super=Layer(layerName=example layer, activationFn=relu, weightInit=XAVIER, biasInit=NaN, dist=null, learningRate=0.006, biasLearningRate=NaN, learningRateSchedule=null, momentum=NaN, momentumSchedule=null, l1=NaN, l2=NaN, l1Bias=NaN, l2Bias=NaN, dropOut=NaN, updater=null, rho=NaN, epsilon=NaN, rmsDecay=NaN, adamMeanDecay=NaN, adamVarDecay=NaN, gradientNormalization=null, gradientNormalizationThreshold=NaN), nIn=10, nOut=1))"]
Loading data from a file (here its a csv)
(ns my.ns
(:require [dl4clj.datasets.input-splits :as s]
[dl4clj.datasets.record-readers :as rr]
[dl4clj.datasets.api.record-readers :refer :all]
[dl4clj.datasets.iterators :as ds-iter]
[dl4clj.datasets.api.iterators :refer :all]
[dl4clj.helpers :refer [data-from-iter]]))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; file splits (convert the data to records)
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def poker-path "resources/poker-hand-training.csv")
;; this is not a complete dataset, it is just here to sever as an example
(def file-split (s/new-filesplit :path poker-path))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; record readers, (read the records created by the file split)
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def csv-rr (initialize-rr! :rr (rr/new-csv-record-reader :skip-n-lines 0 :delimiter ",")
:input-split file-split))
;; lets look at some data
(println (next-record! :rr csv-rr :as-code? false))
;; => #object[java.util.ArrayList 0x2473e02d [1, 10, 1, 11, 1, 13, 1, 12, 1, 1, 9]]
;; this is our first line from the csv
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; record readers dataset iterators (turn our writables into a dataset)
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def rr-ds-iter (ds-iter/new-record-reader-dataset-iterator
:record-reader csv-rr
:batch-size 1
:label-idx 10
:n-possible-labels 10))
;; we use our record reader created above
;; we want to see one example per dataset obj returned (:batch-size = 1)
;; we know our label is at the last index, so :label-idx = 10
;; there are 10 possible types of poker hands so :n-possible-labels = 10
;; you can also set :label-idx to -1 to use the last index no matter the size of the seq
(def other-rr-ds-iter (ds-iter/new-record-reader-dataset-iterator
:record-reader csv-rr
:batch-size 1
:label-idx -1
:n-possible-labels 10))
(str (next-example! :iter rr-ds-iter :as-code? false))
;; =>
;;===========INPUT===================
;;[1.00, 10.00, 1.00, 11.00, 1.00, 13.00, 1.00, 12.00, 1.00, 1.00]
;;=================OUTPUT==================
;;[0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00]
;; and to show that :label-idx = -1 gives us the same output
(= (next-example! :iter rr-ds-iter :as-code? false)
(next-example! :iter other-rr-ds-iter :as-code? false)) ;; => true
(ns my.ns
(:require [nd4clj.linalg.factory.nd4j :refer [vec->indarray matrix->indarray
indarray-of-zeros indarray-of-ones
indarray-of-rand vec-or-matrix->indarray]]
[dl4clj.datasets.new-datasets :refer [new-ds]]
[dl4clj.datasets.api.datasets :refer [as-list]]
[dl4clj.datasets.iterators :refer [new-existing-dataset-iterator]]
[dl4clj.datasets.api.iterators :refer :all]
[dl4clj.datasets.pre-processors :as ds-pp]
[dl4clj.datasets.api.pre-processors :refer :all]
[dl4clj.core :as c]))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; INDArray creation
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;TODO: consider defaulting to code
;; can create from a vector
(vec->indarray [1 2 3 4])
;; => #object[org.nd4j.linalg.cpu.nativecpu.NDArray 0x269df212 [1.00, 2.00, 3.00, 4.00]]
;; or from a matrix
(matrix->indarray [[1 2 3 4] [2 4 6 8]])
;; => #object[org.nd4j.linalg.cpu.nativecpu.NDArray 0x20aa7fe1
;; [[1.00, 2.00, 3.00, 4.00], [2.00, 4.00, 6.00, 8.00]]]
;; will fill in spareness with zeros
(matrix->indarray [[1 2 3 4] [2 4 6 8] [10 12]])
;; => #object[org.nd4j.linalg.cpu.nativecpu.NDArray 0x8b7796c
;;[[1.00, 2.00, 3.00, 4.00],
;; [2.00, 4.00, 6.00, 8.00],
;; [10.00, 12.00, 0.00, 0.00]]]
;; can create an indarray of all zeros with specified shape
;; defaults to :rows = 1 :columns = 1
(indarray-of-zeros :rows 3 :columns 2)
;; => #object[org.nd4j.linalg.cpu.nativecpu.NDArray 0x6f586a7e
;;[[0.00, 0.00],
;; [0.00, 0.00],
;; [0.00, 0.00]]]
(indarray-of-zeros) ;; => #object[org.nd4j.linalg.cpu.nativecpu.NDArray 0xe59ffec 0.00]
;; and if only one is supplied, will get a vector of specified length
(indarray-of-zeros :rows 2)
;; => #object[org.nd4j.linalg.cpu.nativecpu.NDArray 0x2899d974 [0.00, 0.00]]
(indarray-of-zeros :columns 2)
;; => #object[org.nd4j.linalg.cpu.nativecpu.NDArray 0xa5b9782 [0.00, 0.00]]
;; same considerations/defaults for indarray-of-ones and indarray-of-rand
(indarray-of-ones :rows 2 :columns 3)
;; => #object[org.nd4j.linalg.cpu.nativecpu.NDArray 0x54f08662 [[1.00, 1.00, 1.00], [1.00, 1.00, 1.00]]]
(indarray-of-rand :rows 2 :columns 3)
;; all values are greater than 0 but less than 1
;; => #object[org.nd4j.linalg.cpu.nativecpu.NDArray 0x2f20293b [[0.85, 0.86, 0.13], [0.94, 0.04, 0.36]]]
;; vec-or-matrix->indarray is built into all functions which require INDArrays
;; so that you can use clojure data structures
;; but you still have the option of passing existing INDArrays
(def example-array (vec-or-matrix->indarray [1 2 3 4]))
;; => #object[org.nd4j.linalg.cpu.nativecpu.NDArray 0x5c44c71f [1.00, 2.00, 3.00, 4.00]]
(vec-or-matrix->indarray example-array)
;; => #object[org.nd4j.linalg.cpu.nativecpu.NDArray 0x607b03b0 [1.00, 2.00, 3.00, 4.00]]
(vec-or-matrix->indarray (indarray-of-rand :rows 2))
;; => #object[org.nd4j.linalg.cpu.nativecpu.NDArray 0x49143b08 [0.76, 0.92]]
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; data-set creation
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def ds-with-single-example (new-ds :input [1 2 3 4]
:output [0.0 1.0 0.0]))
(as-list :ds ds-with-single-example :as-code? false)
;; =>
;; #object[java.util.ArrayList 0x5d703d12
;;[===========INPUT===================
;;[1.00, 2.00, 3.00, 4.00]
;;=================OUTPUT==================
;;[0.00, 1.00, 0.00]]]
(def ds-with-multiple-examples (new-ds
:input [[1 2 3 4] [2 4 6 8]]
:output [[0.0 1.0 0.0] [0.0 0.0 1.0]]))
(as-list :ds ds-with-multiple-examples :as-code? false)
;; =>
;;#object[java.util.ArrayList 0x29c7a9e2
;;[===========INPUT===================
;;[1.00, 2.00, 3.00, 4.00]
;;=================OUTPUT==================
;;[0.00, 1.00, 0.00],
;;===========INPUT===================
;;[2.00, 4.00, 6.00, 8.00]
;;=================OUTPUT==================
;;[0.00, 0.00, 1.00]]]
;; we can create a dataset iterator from the code which creates datasets
;; and set the labels for our outputs (optional)
(def ds-with-multiple-examples
(new-ds
:input [[1 2 3 4] [2 4 6 8]]
:output [[0.0 1.0 0.0] [0.0 0.0 1.0]]))
;; iterator
(def training-rr-ds-iter
(new-existing-dataset-iterator
:dataset ds-with-multiple-examples
:labels ["foo" "baz" "foobaz"]))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; data-set normalization
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; this gathers statistics on the dataset and normalizes the data
;; and applies the transformation to all dataset objects in the iterator
(def train-iter-normalized
(c/normalize-iter! :iter training-rr-ds-iter
:normalizer (ds-pp/new-standardize-normalization-ds-preprocessor)
:as-code? false))
;; above returns the normalized iterator
;; to get fit normalizer
(def the-normalizer
(get-pre-processor train-iter-normalized))
Creating a neural network configuration with singe and multiple layers
(ns my.ns
(:require [dl4clj.nn.conf.builders.layers :as l]
[dl4clj.nn.conf.builders.nn :as nn]
[dl4clj.nn.conf.distributions :as dist]
[dl4clj.nn.conf.input-pre-processor :as pp]
[dl4clj.nn.conf.step-fns :as s-fn]))
;; nn/builder has 3 types of args
;; 1) args which set network configuration params
;; 2) args which set default values for layers
;; 3) args which set multi layer network configuration params
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; single layer nn configuration
;; here we are setting network configuration
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(nn/builder :optimization-algo :stochastic-gradient-descent
:seed 123
:iterations 1
:minimize? true
:use-drop-connect? false
:lr-score-based-decay-rate 0.002
:regularization? false
:step-fn :default-step-fn
:layers {:dense-layer {:activation-fn :relu
:updater :adam
:adam-mean-decay 0.2
:adam-var-decay 0.1
:learning-rate 0.006
:weight-init :xavier
:layer-name "single layer model example"
:n-in 10
:n-out 20}})
;; there are several options within a nn-conf map which can be configuration maps
;; or calls to fns
;; It doesn't matter which option you choose and you don't have to stay consistent
;; the list of params which can be passed as config maps or fn calls will
;; be enumerated at a later date
(nn/builder :optimization-algo :stochastic-gradient-descent
:seed 123
:iterations 1
:minimize? true
:use-drop-connect? false
:lr-score-based-decay-rate 0.002
:regularization? false
:step-fn (s-fn/new-default-step-fn)
:build? true
;; dont need to specify layer order, theres only one
:layers (l/dense-layer-builder
:activation-fn :relu
:updater :adam
:adam-mean-decay 0.2
:adam-var-decay 0.1
:dist (dist/new-normal-distribution :mean 0 :std 1)
:learning-rate 0.006
:weight-init :xavier
:layer-name "single layer model example"
:n-in 10
:n-out 20))
;; these configurations are the same
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; multi-layer configuration
;; here we are also setting layer defaults
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; defaults will apply to layers which do not specify those value in their config
(nn/builder
:optimization-algo :stochastic-gradient-descent
:seed 123
:iterations 1
:minimize? true
:use-drop-connect? false
:lr-score-based-decay-rate 0.002
:regularization? false
:default-activation-fn :sigmoid
:default-weight-init :uniform
;; we need to specify the layer order
:layers {0 (l/activation-layer-builder
:activation-fn :relu
:updater :adam
:adam-mean-decay 0.2
:adam-var-decay 0.1
:learning-rate 0.006
:weight-init :xavier
:layer-name "example first layer"
:n-in 10
:n-out 20)
1 {:output-layer {:n-in 20
:n-out 2
:loss-fn :mse
:layer-name "example output layer"}}})
;; specifying multi-layer config params
(nn/builder
;; network args
:optimization-algo :stochastic-gradient-descent
:seed 123
:iterations 1
:minimize? true
:use-drop-connect? false
:lr-score-based-decay-rate 0.002
:regularization? false
;; layer defaults
:default-activation-fn :sigmoid
:default-weight-init :uniform
;; the layers
:layers {0 (l/activation-layer-builder
:activation-fn :relu
:updater :adam
:adam-mean-decay 0.2
:adam-var-decay 0.1
:learning-rate 0.006
:weight-init :xavier
:layer-name "example first layer"
:n-in 10
:n-out 20)
1 {:output-layer {:n-in 20
:n-out 2
:loss-fn :mse
:layer-name "example output layer"}}}
;; multi layer network args
:backprop? true
:backprop-type :standard
:pretrain? false
:input-pre-processors {0 (pp/new-zero-mean-pre-pre-processor)
1 {:unit-variance-processor {}}})
Multi Layer models
(ns my.ns
(:require [dl4clj.datasets.iterators :as iter]
[dl4clj.datasets.input-splits :as split]
[dl4clj.datasets.record-readers :as rr]
[dl4clj.optimize.listeners :as listener]
[dl4clj.nn.conf.builders.nn :as nn]
[dl4clj.nn.multilayer.multi-layer-network :as mln]
[dl4clj.nn.api.model :refer [init! set-listeners!]]
[dl4clj.nn.api.multi-layer-network :refer [evaluate-classification]]
[dl4clj.datasets.api.record-readers :refer [initialize-rr!]]
[dl4clj.eval.api.eval :refer [get-stats get-accuracy]]
[dl4clj.core :as c]))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; nn-conf -> multi-layer-network
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def nn-conf
(nn/builder
;; network args
:optimization-algo :stochastic-gradient-descent
:seed 123 :iterations 1 :regularization? true
;; setting layer defaults
:default-activation-fn :relu :default-l2 7.5e-6
:default-weight-init :xavier :default-learning-rate 0.0015
:default-updater :nesterovs :default-momentum 0.98
;; setting layer configuration
:layers {0 {:dense-layer
{:layer-name "example first layer"
:n-in 784 :n-out 500}}
1 {:dense-layer
{:layer-name "example second layer"
:n-in 500 :n-out 100}}
2 {:output-layer
{:n-in 100 :n-out 10
;; layer specific params
:loss-fn :negativeloglikelihood
:activation-fn :softmax
:layer-name "example output layer"}}}
;; multi layer args
:backprop? true
:pretrain? false))
(def multi-layer-network (c/model-from-conf nn-conf))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; local cpu training with dl4j pre-built iterators
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; lets use the pre-built Mnist data set iterator
(def train-mnist-iter
(iter/new-mnist-data-set-iterator
:batch-size 64
:train? true
:seed 123))
(def test-mnist-iter
(iter/new-mnist-data-set-iterator
:batch-size 64
:train? false
:seed 123))
;; and lets set a listener so we can know how training is going
(def score-listener (listener/new-score-iteration-listener :print-every-n 5))
;; and attach it to our model
;; TODO: listeners are broken, look into log4j warnning
(def mln-with-listener (set-listeners! :model multi-layer-network
:listeners [score-listener]))
(def trained-mln (mln/train-mln-with-ds-iter! :mln mln-with-listener
:iter train-mnist-iter
:n-epochs 15
:as-code? false))
;; training happens because :as-code? = false
;; if it was true, we would still just have a data structure
;; we now have a trained model that has seen the training dataset 15 times
;; time to evaluate our model
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;Create an evaluation object
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def eval-obj (evaluate-classification :mln trained-mln
:iter test-mnist-iter))
;; always remember that these objects are stateful, dont use the same eval-obj
;; to eval two different networks
;; we trained the model on a training dataset. We evaluate on a test set
(println (get-stats :evaler eval-obj))
;; this will print the stats to standard out for each feature/label pair
;;Examples labeled as 0 classified by model as 0: 968 times
;;Examples labeled as 0 classified by model as 1: 1 times
;;Examples labeled as 0 classified by model as 2: 1 times
;;Examples labeled as 0 classified by model as 3: 1 times
;;Examples labeled as 0 classified by model as 5: 1 times
;;Examples labeled as 0 classified by model as 6: 3 times
;;Examples labeled as 0 classified by model as 7: 1 times
;;Examples labeled as 0 classified by model as 8: 2 times
;;Examples labeled as 0 classified by model as 9: 2 times
;;Examples labeled as 1 classified by model as 1: 1126 times
;;Examples labeled as 1 classified by model as 2: 2 times
;;Examples labeled as 1 classified by model as 3: 1 times
;;Examples labeled as 1 classified by model as 5: 1 times
;;Examples labeled as 1 classified by model as 6: 2 times
;;Examples labeled as 1 classified by model as 7: 1 times
;;Examples labeled as 1 classified by model as 8: 2 times
;;Examples labeled as 2 classified by model as 0: 3 times
;;Examples labeled as 2 classified by model as 1: 2 times
;;Examples labeled as 2 classified by model as 2: 1006 times
;;Examples labeled as 2 classified by model as 3: 2 times
;;Examples labeled as 2 classified by model as 4: 3 times
;;Examples labeled as 2 classified by model as 6: 3 times
;;Examples labeled as 2 classified by model as 7: 7 times
;;Examples labeled as 2 classified by model as 8: 6 times
;;Examples labeled as 3 classified by model as 2: 4 times
;;Examples labeled as 3 classified by model as 3: 990 times
;;Examples labeled as 3 classified by model as 5: 3 times
;;Examples labeled as 3 classified by model as 7: 3 times
;;Examples labeled as 3 classified by model as 8: 3 times
;;Examples labeled as 3 classified by model as 9: 7 times
;;Examples labeled as 4 classified by model as 2: 2 times
;;Examples labeled as 4 classified by model as 3: 1 times
;;Examples labeled as 4 classified by model as 4: 967 times
;;Examples labeled as 4 classified by model as 6: 4 times
;;Examples labeled as 4 classified by model as 7: 1 times
;;Examples labeled as 4 classified by model as 9: 7 times
;;Examples labeled as 5 classified by model as 0: 2 times
;;Examples labeled as 5 classified by model as 3: 6 times
;;Examples labeled as 5 classified by model as 4: 1 times
;;Examples labeled as 5 classified by model as 5: 874 times
;;Examples labeled as 5 classified by model as 6: 3 times
;;Examples labeled as 5 classified by model as 7: 1 times
;;Examples labeled as 5 classified by model as 8: 3 times
;;Examples labeled as 5 classified by model as 9: 2 times
;;Examples labeled as 6 classified by model as 0: 4 times
;;Examples labeled as 6 classified by model as 1: 3 times
;;Examples labeled as 6 classified by model as 3: 2 times
;;Examples labeled as 6 classified by model as 4: 4 times
;;Examples labeled as 6 classified by model as 5: 4 times
;;Examples labeled as 6 classified by model as 6: 939 times
;;Examples labeled as 6 classified by model as 7: 1 times
;;Examples labeled as 6 classified by model as 8: 1 times
;;Examples labeled as 7 classified by model as 1: 7 times
;;Examples labeled as 7 classified by model as 2: 4 times
;;Examples labeled as 7 classified by model as 3: 3 times
;;Examples labeled as 7 classified by model as 7: 1005 times
;;Examples labeled as 7 classified by model as 8: 2 times
;;Examples labeled as 7 classified by model as 9: 7 times
;;Examples labeled as 8 classified by model as 0: 3 times
;;Examples labeled as 8 classified by model as 2: 3 times
;;Examples labeled as 8 classified by model as 3: 2 times
;;Examples labeled as 8 classified by model as 4: 4 times
;;Examples labeled as 8 classified by model as 5: 3 times
;;Examples labeled as 8 classified by model as 6: 2 times
;;Examples labeled as 8 classified by model as 7: 4 times
;;Examples labeled as 8 classified by model as 8: 947 times
;;Examples labeled as 8 classified by model as 9: 6 times
;;Examples labeled as 9 classified by model as 0: 2 times
;;Examples labeled as 9 classified by model as 1: 2 times
;;Examples labeled as 9 classified by model as 3: 4 times
;;Examples labeled as 9 classified by model as 4: 8 times
;;Examples labeled as 9 classified by model as 6: 1 times
;;Examples labeled as 9 classified by model as 7: 4 times
;;Examples labeled as 9 classified by model as 8: 2 times
;;Examples labeled as 9 classified by model as 9: 986 times
;;==========================Scores========================================
;; Accuracy: 0.9808
;; Precision: 0.9808
;; Recall: 0.9807
;; F1 Score: 0.9807
;;========================================================================
;; can get the stats that are printed via fns in the evaluation namespace
;; after running eval-model-whole-ds
(get-accuracy :evaler evaler-with-stats) ;; => 0.9808
Early Stopping (controlling training)
it is recommened you start here when designing models
using dl4clj.core
(ns my.ns
(:require [dl4clj.earlystopping.termination-conditions :refer :all]
[dl4clj.earlystopping.model-saver :refer [new-in-memory-saver]]
[dl4clj.nn.api.multi-layer-network :refer [evaluate-classification]]
[dl4clj.eval.api.eval :refer [get-stats]]
[dl4clj.nn.conf.builders.nn :as nn]
[dl4clj.datasets.iterators :as iter]
[dl4clj.core :as c]))
(def nn-conf
(nn/builder
;; network args
:optimization-algo :stochastic-gradient-descent
:seed 123
:iterations 1
:regularization? true
;; setting layer defaults
:default-activation-fn :relu
:default-l2 7.5e-6
:default-weight-init :xavier
:default-learning-rate 0.0015
:default-updater :nesterovs
:default-momentum 0.98
;; setting layer configuration
:layers {0 {:dense-layer
{:layer-name "example first layer"
:n-in 784 :n-out 500}}
1 {:dense-layer
{:layer-name "example second layer"
:n-in 500 :n-out 100}}
2 {:output-layer
{:n-in 100 :n-out 10
;; layer specific params
:loss-fn :negativeloglikelihood
:activation-fn :softmax
:layer-name "example output layer"}}}
;; multi layer args
:backprop? true
:pretrain? false))
(def train-iter
(iter/new-mnist-data-set-iterator
:batch-size 64
:train? true
:seed 123))
(def test-iter
(iter/new-mnist-data-set-iterator
:batch-size 64
:train? false
:seed 123))
(def invalid-score-condition (new-invalid-score-iteration-termination-condition))
(def max-score-condition (new-max-score-iteration-termination-condition
:max-score 20.0))
(def max-time-condition (new-max-time-iteration-termination-condition
:max-time-val 10
:max-time-unit :minutes))
(def score-doesnt-improve-condition (new-score-improvement-epoch-termination-condition
:max-n-epoch-no-improve 5))
(def target-score-condition (new-best-score-epoch-termination-condition
:best-expected-score 0.009))
(def max-number-epochs-condition (new-max-epochs-termination-condition :max-n 20))
(def in-mem-saver (new-in-memory-saver))
(def trained-mln
;; defaults to returning the model
(c/train-with-early-stopping
:nn-conf nn-conf
:training-iter train-mnist-iter
:testing-iter test-mnist-iter
:eval-every-n-epochs 1
:iteration-termination-conditions [invalid-score-condition
max-score-condition
max-time-condition]
:epoch-termination-conditions [score-doesnt-improve-condition
target-score-condition
max-number-epochs-condition]
:save-last-model? true
:model-saver in-mem-saver
:as-code? false))
(def model-evaler
(evaluate-classification :mln trained-mln :iter test-mnist-iter))
(println (get-stats :evaler model-evaler))
(ns my.ns
(:require [dl4clj.earlystopping.early-stopping-config :refer [new-early-stopping-config]]
[dl4clj.earlystopping.termination-conditions :refer :all]
[dl4clj.earlystopping.model-saver :refer [new-in-memory-saver new-local-file-model-saver]]
[dl4clj.earlystopping.score-calc :refer [new-ds-loss-calculator]]
[dl4clj.earlystopping.early-stopping-trainer :refer [new-early-stopping-trainer]]
[dl4clj.earlystopping.api.early-stopping-trainer :refer [fit-trainer!]]
[dl4clj.nn.conf.builders.nn :as nn]
[dl4clj.nn.multilayer.multi-layer-network :as mln]
[dl4clj.utils :refer [load-model!]]
[dl4clj.datasets.iterators :as iter]
[dl4clj.core :as c]))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; start with our network config
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def nn-conf
(nn/builder
;; network args
:optimization-algo :stochastic-gradient-descent
:seed 123 :iterations 1 :regularization? true
;; setting layer defaults
:default-activation-fn :relu :default-l2 7.5e-6
:default-weight-init :xavier :default-learning-rate 0.0015
:default-updater :nesterovs :default-momentum 0.98
;; setting layer configuration
:layers {0 {:dense-layer
{:layer-name "example first layer"
:n-in 784 :n-out 500}}
1 {:dense-layer
{:layer-name "example second layer"
:n-in 500 :n-out 100}}
2 {:output-layer
{:n-in 100 :n-out 10
;; layer specific params
:loss-fn :negativeloglikelihood
:activation-fn :softmax
:layer-name "example output layer"}}}
;; multi layer args
:backprop? true
:pretrain? false))
(def mln (c/model-from-conf nn-conf))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; the training/testing data
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def train-iter
(iter/new-mnist-data-set-iterator
:batch-size 64
:train? true
:seed 123))
(def test-iter
(iter/new-mnist-data-set-iterator
:batch-size 64
:train? false
:seed 123))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; we are going to need termination conditions
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; these allow us to control when we exit training
;; this can be based off of iterations or epochs
;; iteration termination conditions
(def invalid-score-condition (new-invalid-score-iteration-termination-condition))
(def max-score-condition (new-max-score-iteration-termination-condition
:max-score 20.0))
(def max-time-condition (new-max-time-iteration-termination-condition
:max-time-val 10
:max-time-unit :minutes))
;; epoch termination conditions
(def score-doesnt-improve-condition (new-score-improvement-epoch-termination-condition
:max-n-epoch-no-improve 5))
(def target-score-condition (new-best-score-epoch-termination-condition :best-expected-score 0.009))
(def max-number-epochs-condition (new-max-epochs-termination-condition :max-n 20))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; we also need a way to save our model
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; can be in memory or to a local directory
(def in-mem-saver (new-in-memory-saver))
(def local-file-saver (new-local-file-model-saver :directory "resources/tmp/readme/"))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; set up your score calculator
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def score-calcer (new-ds-loss-calculator :iter test-iter
:average? true))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; create an early stopping configuration
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; termination conditions
;; a way to save our model
;; a way to calculate the score of our model on the dataset
(def early-stopping-conf
(new-early-stopping-config
:epoch-termination-conditions [score-doesnt-improve-condition
target-score-condition
max-number-epochs-condition]
:iteration-termination-conditions [invalid-score-condition
max-score-condition
max-time-condition]
:eval-every-n-epochs 5
:model-saver local-file-saver
:save-last-model? true
:score-calculator score-calcer))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; create an early stopping trainer from our data, model and early stopping conf
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def es-trainer (new-early-stopping-trainer :early-stopping-conf early-stopping-conf
:mln mln
:iter train-iter))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; fit and use our early stopping trainer
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def es-trainer-fitted (fit-trainer! es-trainer :as-code? false))
;; when the trainer terminates, you will see something like this
;;[nREPL-worker-24] BaseEarlyStoppingTrainer INFO Completed training epoch 14
;;[nREPL-worker-24] BaseEarlyStoppingTrainer INFO New best model: score = 0.005225599372851298,
;; epoch = 14 (previous: score = 0.018243224899038346, epoch = 7)
;;[nREPL-worker-24] BaseEarlyStoppingTrainer INFO Hit epoch termination condition at epoch 14.
;; Details: BestScoreEpochTerminationCondition(0.009)
;; and if we look at the es-trainer-fitted object we see
;;#object[org.deeplearning4j.earlystopping.EarlyStoppingResult 0x5ab74f27 EarlyStoppingResult
;;(terminationReason=EpochTerminationCondition,details=BestScoreEpochTerminationCondition(0.009),
;; bestModelEpoch=14,bestModelScore=0.005225599372851298,totalEpochs=15)]
;; and our model has been saved to /resources/tmp/readme/bestModel.bin
;; there we have our model config, model params and our updater state
;; we can then load this model to use it or continue refining it
(def loaded-model (load-model! :path "resources/tmp/readme/bestModel.bin"
:load-updater? true))
Transfer Learning (freezing layers)
;; TODO: need to write up examples
dl4j Spark usage
How it is done in dl4clj
(ns my.ns
(:require [dl4clj.nn.conf.builders.layers :as l]
[dl4clj.nn.conf.builders.nn :as nn]
[dl4clj.datasets.iterators :refer [new-iris-data-set-iterator]]
[dl4clj.eval.api.eval :refer [get-stats]]
[dl4clj.spark.masters.param-avg :as master]
[dl4clj.spark.data.java-rdd :refer [new-java-spark-context
java-rdd-from-iter]]
[dl4clj.spark.api.dl4j-multi-layer :refer [eval-classification-spark-mln
get-spark-context]]
[dl4clj.core :as c]))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Step 1, create your model config
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def mln-conf
(nn/builder
:optimization-algo :stochastic-gradient-descent
:default-learning-rate 0.006
:layers {0 (l/dense-layer-builder :n-in 4 :n-out 2 :activation-fn :relu)
1 {:output-layer
{:loss-fn :negativeloglikelihood
:n-in 2 :n-out 3
:activation-fn :soft-max
:weight-init :xavier}}}
:backprop? true
:backprop-type :standard))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Step 2, training master
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def training-master
(master/new-parameter-averaging-training-master
:build? true
:rdd-n-examples 10
:n-workers 4
:averaging-freq 10
:batch-size-per-worker 2
:export-dir "resources/spark/master/"
:rdd-training-approach :direct
:repartition-data :always
:repartition-strategy :balanced
:seed 1234
:save-updater? true
:storage-level :none))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Step 3, spark context
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def your-spark-context
(new-java-spark-context :app-name "example app"))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Step 4, training data
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def iris-iter
(new-iris-data-set-iterator
:batch-size 1
:n-examples 5))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Step 5, spark mln
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def fitted-spark-mln
(c/train-with-spark :spark-context your-spark-context
:mln-conf mln-conf
:training-master training-master
:iter iris-iter
:n-epochs 1
:as-code? false))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Step 5, use spark context from spark-mln to create rdd
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; TODO: eliminate this step
(def our-rdd
(let [sc (get-spark-context fitted-spark-mln :as-code? false)]
(java-rdd-from-iter :spark-context sc
:iter iris-iter)))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Step 6, evaluation model and print stats (poor performance of model expected)
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def eval-obj
(eval-classification-spark-mln
:spark-mln fitted-spark-mln
:rdd our-rdd))
(println (get-stats :evaler eval-obj))
(ns my.ns
(:require [dl4clj.nn.conf.builders.layers :as l]
[dl4clj.nn.conf.builders.nn :as nn]
[dl4clj.datasets.iterators :refer [new-iris-data-set-iterator]]
[dl4clj.eval.api.eval :refer [get-stats]]
[dl4clj.spark.masters.param-avg :as master]
[dl4clj.spark.data.java-rdd :refer [new-java-spark-context java-rdd-from-iter]]
[dl4clj.spark.dl4j-multi-layer :as spark-mln]
[dl4clj.spark.api.dl4j-multi-layer :refer [fit-spark-mln!
eval-classification-spark-mln]]))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Step 1, create your model
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def mln-conf
(nn/builder
:optimization-algo :stochastic-gradient-descent
:default-learning-rate 0.006
:layers {0 (l/dense-layer-builder :n-in 4 :n-out 2 :activation-fn :relu)
1 {:output-layer
{:loss-fn :negativeloglikelihood
:n-in 2 :n-out 3
:activation-fn :soft-max
:weight-init :xavier}}}
:backprop? true
:as-code? false
:backprop-type :standard))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Step 2, create a training master
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; not all options specified, but most are
(def training-master
(master/new-parameter-averaging-training-master
:build? true
:rdd-n-examples 10
:n-workers 4
:averaging-freq 10
:batch-size-per-worker 2
:export-dir "resources/spark/master/"
:rdd-training-approach :direct
:repartition-data :always
:repartition-strategy :balanced
:seed 1234
:as-code? false
:save-updater? true
:storage-level :none))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Step 3, create a Spark Multi Layer Network
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def your-spark-context
(new-java-spark-context :app-name "example app" :as-code? false))
;; new-java-spark-context will turn an existing spark-configuration into a java spark context
;; or create a new java spark context with master set to "local[*]" and the app name
;; set to :app-name
(def spark-mln
(spark-mln/new-spark-multi-layer-network
:spark-context your-spark-context
:mln mln-conf
:training-master training-master
:as-code? false))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Step 4, load your data
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; one way is via a dataset-iterator
;; can make one directly from a dataset (iterator data-set)
;; see: nd4clj.linalg.dataset.api.data-set and nd4clj.linalg.dataset.data-set
;; we are going to use a pre-built one
(def iris-iter
(new-iris-data-set-iterator
:batch-size 1
:n-examples 5
:as-code? false))
;; now lets convert the data into a javaRDD
(def our-rdd
(java-rdd-from-iter :spark-context your-spark-context
:iter iris-iter))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Step 5, fit and evaluate the model
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(def fitted-spark-mln
(fit-spark-mln!
:spark-mln spark-mln
:rdd our-rdd
:n-epochs 1))
;; this fn also has the option to supply :path-to-data instead of :rdd
;; that path should point to a directory containing a number of dataset objects
(def eval-obj
(eval-classification-spark-mln
:spark-mln fitted-spark-mln
:rdd our-rdd))
;; we would want to have different testing and training rdd's but here we are using
;; the data we trained on
;; lets get the stats for how our model performed
(println (get-stats :evaler eval-obj))
Coming soon
Implement ComputationGraphs and the classes which use them
NLP
Parallelism
TSNE
UI
Author: yetanalytics
Source Code: https://github.com/yetanalytics/dl4clj
License: BSD-2-Clause License
1591611780
How can I find the correct ulimit values for a user account or process on Linux systems?
For proper operation, we must ensure that the correct ulimit values set after installing various software. The Linux system provides means of restricting the number of resources that can be used. Limits set for each Linux user account. However, system limits are applied separately to each process that is running for that user too. For example, if certain thresholds are too low, the system might not be able to server web pages using Nginx/Apache or PHP/Python app. System resource limits viewed or set with the NA command. Let us see how to use the ulimit that provides control over the resources available to the shell and processes.
#[object object] #[object object] #[object object] #[object object] #[object object] #[object object] #[object object] #[object object] #[object object] #[object object]
1591993440
We are going to build a full stack Todo App using the MEAN (MongoDB, ExpressJS, AngularJS and NodeJS). This is the last part of three-post series tutorial.
MEAN Stack tutorial series:
AngularJS tutorial for beginners (Part I)
Creating RESTful APIs with NodeJS and MongoDB Tutorial (Part II)
MEAN Stack Tutorial: MongoDB, ExpressJS, AngularJS and NodeJS (Part III) 👈 you are here
Before completing the app, let’s cover some background about the this stack. If you rather jump to the hands-on part click here to get started.
#[object object] #[object object] #[object object] #[object object] #[object object] #[object object] #[object object] #[object object]
1592610180
CentOS Linux 8.2 (2004) released. It is a Linux distribution derived from RHEL (Red Hat Enterprise Linux) 8.2 source code. CentOS was created when Red Hat stopped providing RHEL free. CentOS 8.2 gives complete control of its open-source software packages and is fully customized for research needs or for running a high-performance website without the need for license fees. Let us see what’s new in CentOS 8.2 (2004) and how to upgrade existing CentOS 8.1.1199 server to 8.2.2004 using the command line.
#[object object] #[object object] #[object object] #[object object] #[object object] #[object object] #[object object]
1598195340
How do I configure Amazon SES With Postfix mail server to send email under a CentOS/RHEL/Fedora/Ubuntu/Debian Linux server?
Amazon Simple Email Service (SES) is a hosted email service for you to send and receive email using your email addresses and domains. Typically SES used for sending bulk email or routing emails without hosting MTA. We can use Perl/Python/PHP APIs to send an email via SES. Another option is to configure Linux or Unix box running Postfix to route all outgoing emails via SES.
Before getting started with Amazon SES and Postfix, you need to sign up for AWS, including SES. You need to verify your email address and other settings. Make sure you create a user for SES access and download credentials too.
If sendmail installed remove it. Debian/Ubuntu Linux user type the following apt command/apt-get command:
$`` sudo apt --purge remove sendmail
CentOS/RHEL user type the following yum command or dnf command on Fedora/CentOS/RHEL 8.x:
$`` sudo yum remove sendmail
$`` sudo dnf remove sendmail
Sample outputs from CentOS 8 server:
Dependencies resolved.
===============================================================================
Package Architecture Version Repository Size
===============================================================================
Removing:
sendmail x86_64 8.15.2-32.el8 @AppStream 2.4 M
Removing unused dependencies:
cyrus-sasl x86_64 2.1.27-1.el8 @BaseOS 160 k
procmail x86_64 3.22-47.el8 @AppStream 369 k
Transaction Summary
===============================================================================
Remove 3 Packages
Freed space: 2.9 M
Is this ok [y/N]: y
#[object object] #[object object] #[object object] #[object object] #[object object] #[object object] #[object object]