Awesome Python: Libraries for Machine Learning

Machine Learning

Libraries for Machine Learning. Also see awesome-machine-learning.

  • gym - A toolkit for developing and comparing reinforcement learning algorithms.
  • H2O - Open Source Fast Scalable Machine Learning Platform.
  • Metrics - Machine learning evaluation metrics.
  • NuPIC - Numenta Platform for Intelligent Computing.
  • scikit-learn - The most popular Python library for Machine Learning.
  • Spark ML - Apache Spark's scalable Machine Learning library.
  • vowpal_porpoise - A lightweight Python wrapper for Vowpal Wabbit.
  • xgboost - A scalable, portable, and distributed gradient boosting library.
  • MindsDB - MindsDB is an open source AI layer for existing databases that allows you to effortlessly develop, train and deploy state-of-the-art machine learning models using standard queries.

Author: vinta
Source Code: https://github.com/vinta/awesome-python
License: View license

#python #machine-learning 

Awesome Python: Libraries for Machine Learning

MindsDB: In-Database Machine Learning

MindsDB enables you to use ML predictions in your database using SQL.

  • Developers can quickly add AI capabilities to your applications.
  • Data Scientists can streamline MLOps by deploying ML models as AI Tables.
  • Data Analysts can easily make forecasts on complex data (like multivariate time-series with high cardinality) and visualize them in BI tools like Tableau.

If you like our project then we would really appreciate a Star ⭐!

Also, check-out the rewards and community programs.


Installation - Overview - Features - Database Integrations - Quickstart - Documentation - Support - Contributing - Mailing lists - License


Machine Learning using SQL 

MindsDB

Installation

To install the latest version of MindsDB please pull the following Docker image:

docker pull mindsdb/mindsdb

Or, use PyPI:

pip install mindsdb

Overview

MindsDB automates and abstracts machine learning models through virtual AI Tables:

Apart from abstracting ML models as AI Tables inside databases, MindsDB has a set of unique capabilities as:

Easily make predictions over very complex multivariate time-series data with high cardinality

An open JSON-AI syntax to tune ML models and optimize ML pipelines in a declarative way

How it works:

Let MindsDB connect to your database.

Train a Predictor using a single SQL statement (make MindsDB learn from historical data automatically) or import your own ML model to a Predictor via JSON-AI .

Make predictions with SQL statements (Predictor is exposed as virtual AI Tables). There’s no need to deploy models since they are already part of the data layer.

Check our docs and blog for tutorials and use case examples.

Features

  • Automatic data pre-processing, feature engineering and encoding
  • Classification, regression, time-series tasks
  • Bring models to production without “traditional deployment” as AI Tables
  • Get mModels’ accuracy scoring and confidence intervals for each prediction
  • Join ML models with existing data
  • Anomaly detection
  • Model explainability analysis
  • GPU support for models’ training
  • Open JSON-AI syntax to build models and bring your own ML blocks in a declarative way
  • REST API available as well

Database Integrations

MindsDB works with most of the SQL and NoSQL databases and data Streams for real-time ML.

Connect your Data
Connect Apache Kafka
Connect Amazon Redshift
Connect Cassandra
Connect Clickhouse
Connect CockroachDB
Connect MariaDB
Connect SQL Server
Connect MongoDB
Connect MySQL
Connect PostgreSQL
Connect Redis
Connect ScyllaDB
Connect Singlestore
Connect Snowflake
Connect Trino

❓ 👋 Missing integration?

Quickstart

To get your hands on MindsDB, we recommend using the Docker image or simply sign up for a free cloud account. Feel free to browse documentation for other installation methods and tutorials.

Documentation

You can find the complete documentation of MindsDB at docs.mindsdb.com. Documentation for our HTTP API can be found at apidocs.mindsdb.com.

Support

If you found a bug, please submit an issue on Github.

To get community support, you can:

If you need commercial support, please contact the MindsDB team.

Contributing

A great place to start contributing to MindsDB will be our GitHub projects for :checkered_flag:

Also, we are always open to suggestions so feel free to open new issues with your ideas and we can give you guidance!

Being part of the core team is accessible to anyone who is motivated and wants to be part of that journey! If you'd like to contribute to the project, refer to the contributing documentation.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project, you agree to abide by its terms.

Made with contributors-img.

Mailing lists

Subscribe to MindsDB Monthly Community Newsletter to get general announcements, release notes, information about MindsDB events, and the latest blog posts. You may also join our beta-users group, and get access to new beta features.

License

MindsDB is licensed under GNU General Public License v3.0

Author: mindsdb
Source Code: https://github.com/mindsdb/mindsdb
License: GPL-3.0 License

#python #mindsdb #machine-learning #database 

MindsDB: In-Database Machine Learning

XGBoost: An Optimized Distributed Gradient Boosting Library

eXtreme Gradient Boosting

Community | Documentation | Resources | Contributors | Release Notes

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Kubernetes, Hadoop, SGE, MPI, Dask) and can solve problems beyond billions of examples.

License

© Contributors, 2021. Licensed under an Apache-2 license.

Contribute to XGBoost

XGBoost has been developed and used by a group of active community members. Your help is very valuable to make the package better for everyone. Checkout the Community Page.

Reference

  • Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. In 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, 2016
  • XGBoost originates from research project at University of Washington.

Sponsors

Become a sponsor and get a logo here. See details at Sponsoring the XGBoost Project. The funds are used to defray the cost of continuous integration and testing infrastructure (https://xgboost-ci.net).

Open Source Collective sponsors 

Sponsors

[Become a sponsor]

NVIDIA

Backers

[Become a backer]

Other sponsors

The sponsors in this list are donating cloud hours in lieu of cash donation.

Author: dmlc
Source Code: https://github.com/dmlc/xgboost
License: Apache-2.0 License

#python #xgboost #machine-learning 

XGBoost: An Optimized Distributed Gradient Boosting Library

Vowpal Wabbit: Machine Learning System Which Pushes The Frontier Of ML

This is the Vowpal Wabbit fast online learning code.

Why Vowpal Wabbit?

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning. There is a specific focus on reinforcement learning with several contextual bandit algorithms implemented and the online nature lending to the problem well. Vowpal Wabbit is a destination for implementing and maturing state of the art algorithms with performance in mind.

  • Input Format. The input format for the learning algorithm is substantially more flexible than might be expected. Examples can have features consisting of free form text, which is interpreted in a bag-of-words way. There can even be multiple sets of free form text in different namespaces.
  • Speed. The learning algorithm is fast -- similar to the few other online algorithm implementations out there. There are several optimization algorithms available with the baseline being sparse gradient descent (GD) on a loss function.
  • Scalability. This is not the same as fast. Instead, the important characteristic here is that the memory footprint of the program is bounded independent of data. This means the training set is not loaded into main memory before learning starts. In addition, the size of the set of features is bounded independent of the amount of training data using the hashing trick.
  • Feature Interaction. Subsets of features can be internally paired so that the algorithm is linear in the cross-product of the subsets. This is useful for ranking problems. The alternative of explicitly expanding the features before feeding them into the learning algorithm can be both computation and space intensive, depending on how it's handled.

Visit the wiki to learn more.

Getting Started

For the most up to date instructions for getting started on Windows, MacOS or Linux please see the wiki. This includes:

Author: VowpalWabbit
Source Code: https://github.com/VowpalWabbit/vowpal_wabbit
License: View license

#python #machine-learning 

Vowpal Wabbit: Machine Learning System Which Pushes The Frontier Of ML

Vowpal Porpoise: Lightweight Python Wrapper for Vowpal Wabbit

vowpal_porpoise

Lightweight python wrapper for vowpal_wabbit.

Why: Scalable, blazingly fast machine learning.

Install

  1. Install vowpal_wabbit. Clone and run make
  2. Install cython. pip install cython
  3. Clone vowpal_porpoise
  4. Run: python setup.py install to install.

Now can you do: import vowpal_porpoise from python.

Examples

Standard Interface

Linear regression with l1 penalty:

from vowpal_porpoise import VW

# Initialize the model
vw = VW(moniker='test',    # a name for the model
        passes=10,         # vw arg: passes
        loss='quadratic',  # vw arg: loss
        learning_rate=10,  # vw arg: learning_rate
        l1=0.01)           # vw arg: l1

# Inside the with training() block a vw process will be 
# open to communication
with vw.training():
    for instance in ['1 |big red square',\
                      '0 |small blue circle']:
        vw.push_instance(instance)

    # here stdin will close
# here the vw process will have finished

# Inside the with predicting() block we can stream instances and 
# acquire their labels
with vw.predicting():
    for instance in ['1 |large burnt sienna rhombus',\
                      '0 |little teal oval']:
        vw.push_instance(instance)

# Read the predictions like this:
predictions = list(vw.read_predictions_())

L-BFGS with a rank-5 approximation:

from vowpal_porpoise import VW

# Initialize the model
vw = VW(moniker='test_lbfgs', # a name for the model
        passes=10,            # vw arg: passes
        lbfgs=True,           # turn on lbfgs
        mem=5)                # lbfgs rank

Latent Dirichlet Allocation with 100 topics:

from vowpal_porpoise import VW

# Initialize the model
vw = VW(moniker='test_lda',  # a name for the model
        passes=10,           # vw arg: passes
        lda=100,             # turn on lda
        minibatch=100)       # set the minibatch size

Scikit-learn Interface

vowpal_porpoise also ships with an interface into scikit-learn, which allows awesome experiment-level stuff like cross-validation:

from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from vowpal_porpoise.sklearn import VW_Classifier

GridSearchCV(
        VW_Classifier(loss='logistic', moniker='example_sklearn',
                      passes=10, silent=True, learning_rate=10),
        param_grid=parameters,
        score_func=f1_score,
        cv=StratifiedKFold(y_train),
).fit(X_train, y_train)

Check out example_sklearn.py for more details

Library Interace (DISABLED as of 2013-08-12)

Via the VW interface:

with vw.predicting_library():
    for instance in ['1 |large burnt sienna rhombus', \
                      '1 |little teal oval']:
        prediction = vw.push_instance(instance)

Now the predictions are returned directly to the parent process, rather than having to read from disk. See examples/example1.py for more details.

Alternatively you can use the raw library interface:

import vw_c
vw = vw_c.VW("--loss=quadratic --l1=0.01 -f model")
vw.learn("1 |this is a positive example")
vw.learn("0 |this is a negative example")
vw.finish()

Currently does not support passes due to some limitations in the underlying vw C code.

Need more examples?

  • example1.py: SimpleModel class wrapper around VP (both standard and library flavors)
  • example_library.py: Demonstrates the low-level vw library wrapper, classifying lines of alice in wonderland vs through the looking glass.

Why

vowpal_wabbit is insanely fast and scalable. vowpal_porpoise is slower, but only during the initial training pass. Once the data has been properly cached it will idle while vowpal_wabbit does all the heavy lifting. Furthermore, vowpal_porpoise was designed to be lightweight and not to get in the way of vowpal_wabbit's scalability, e.g. it allows distributed learning via --nodes and does not require data to be batched in memory. In our research work we use vowpal_porpoise on an 80-node cluster running over multiple terabytes of data.

The main benefit of vowpal_porpoise is allowing rapid prototyping of new models and feature extractors. We found that we had been doing this in an ad-hoc way using python scripts to shuffle around massive gzipped text files, so we just closed the loop and made vowpal_wabbit a python library.

How it works

Wraps the vw binary in a subprocess and uses stdin to push data, temporary files to pull predictions. Why not use the prediction labels vw provides on stdout? It turns out that the python GIL basically makes streamining in and out of a process (even asynchronously) painfully difficult. If you know of a clever way to get around this, please email me. In other languages (e.g. in a forthcoming scala wrapper) this is not an issue.

Alternatively, you can use a pure api call (vw_c, wrapping libvw) for prediction.

Contact

Joseph Reisinger @josephreisinger

Contributors

License

Apache 2.0

Author: josephreisinger
Source Code: https://github.com/josephreisinger/vowpal_porpoise
License: View license

#python #machine-learning 

Vowpal Porpoise: Lightweight Python Wrapper for Vowpal Wabbit

Scikit Learn: Machine Learning in Python

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors.

It is currently maintained by a team of volunteers.

Website: https://scikit-learn.org

Installation

Dependencies

scikit-learn requires:

  • Python (>= 3.8)
  • NumPy (>= 1.17.3)
  • SciPy (>= 1.3.2)
  • joblib (>= 1.0.0)
  • threadpoolctl (>= 2.0.0)

Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4. scikit-learn 1.0 and later require Python 3.7 or newer. scikit-learn 1.1 and later require Python 3.8 or newer.

Scikit-learn plotting capabilities (i.e., functions start with plot_ and classes end with "Display") require Matplotlib (>= 3.1.2). For running the examples Matplotlib >= 3.1.2 is required. A few examples require scikit-image >= 0.14.5, a few examples require pandas >= 1.0.5, some examples require seaborn >= 0.9.0.

User installation

If you already have a working installation of numpy and scipy, the easiest way to install scikit-learn is using pip:

pip install -U scikit-learn

or conda:

conda install -c conda-forge scikit-learn

The documentation includes more detailed installation instructions.

Changelog

See the changelog for a history of notable changes to scikit-learn.

Development

We welcome new contributors of all experience levels. The scikit-learn community goals are to be helpful, welcoming, and effective. The Development Guide has detailed information about contributing code, documentation, tests, and more. We've included some basic information in this README.

Important links

Source code

You can check the latest sources with the command:

git clone https://github.com/scikit-learn/scikit-learn.git

Contributing

To learn more about making a contribution to scikit-learn, please see our Contributing guide.

Testing

After installation, you can launch the test suite from outside the source directory (you will need to have pytest >= 5.0.1 installed):

pytest sklearn

See the web page https://scikit-learn.org/dev/developers/advanced_installation.html#testing for more information.

Random number generation can be controlled during testing by setting the SKLEARN_SEED environment variable.

Submitting a Pull Request

Before opening a Pull Request, have a look at the full Contributing page to make sure your code complies with our guidelines: https://scikit-learn.org/stable/developers/index.html

Project History

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors.

The project is currently maintained by a team of volunteers.

Note: scikit-learn was previously referred to as scikits.learn.

Help and Support

Documentation

Communication

Citation

If you use scikit-learn in a scientific publication, we would appreciate citations: https://scikit-learn.org/stable/about.html#citing-scikit-learn

Author: scikit-learn
Source Code: https://github.com/scikit-learn/scikit-learn
License: BSD-3-Clause License

#python #scikitlearn #machine-learning 

Scikit Learn: Machine Learning in Python

Metrics: Machine Learning Evaluation Metrics, Implemented in Python

Note: the current releases of this toolbox are a beta release, to test working with Haskell's, Python's, and R's code repositories.

Metrics provides implementations of various supervised machine learning evaluation metrics in the following languages:

  • Python easy_install ml_metrics
  • R install.packages("Metrics") from the R prompt
  • Haskell cabal install Metrics
  • MATLAB / Octave (clone the repo & run setup from the MATLAB command line)

For more detailed installation instructions, see the README for each implementation.

EVALUATION METRICS

Evaluation MetricPythonRHaskellMATLAB / Octave
Absolute Error (AE)
Average Precision at K (APK, AP@K)
Area Under the ROC (AUC)
Classification Error (CE)
F1 Score (F1)   
Gini   
Levenshtein 
Log Loss (LL)
Mean Log Loss (LogLoss)
Mean Absolute Error (MAE)
Mean Average Precision at K (MAPK, MAP@K)
Mean Quadratic Weighted Kappa 
Mean Squared Error (MSE)
Mean Squared Log Error (MSLE)
Normalized Gini   
Quadratic Weighted Kappa 
Relative Absolute Error (RAE)   
Root Mean Squared Error (RMSE)
Relative Squared Error (RSE)   
Root Relative Squared Error (RRSE)   
Root Mean Squared Log Error (RMSLE)
Squared Error (SE)
Squared Log Error (SLE)

TO IMPLEMENT

  • F1 score
  • Multiclass log loss
  • Lift
  • Average Precision for binary classification
  • precision / recall break-even point
  • cross-entropy
  • True Pos / False Pos / True Neg / False Neg rates
  • precision / recall / sensitivity / specificity
  • mutual information

HIGHER LEVEL TRANSFORMATIONS TO HANDLE

  • GroupBy / Reduce
  • Weight individual samples or groups

PROPERTIES METRICS CAN HAVE

(Nonexhaustive and to be added in the future)

  • Min or Max (optimize through minimization or maximization)
  • Binary Classification
    • Scores predicted class labels
    • Scores predicted ranking (most likely to least likely for being in one class)
    • Scores predicted probabilities
  • Multiclass Classification
    • Scores predicted class labels
    • Scores predicted probabilities
  • Regression
  • Discrete Rater Comparison (confusion matrix)

Author: benhamner
Source Code: https://github.com/benhamner/Metrics
License: View license

#python #machine-learning 

Metrics: Machine Learning Evaluation Metrics, Implemented in Python

H2O: Open Source, Distributed, Fast & Scalable ML Platform

H2O

H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark. H2O provides implementations of many popular algorithms such as Generalized Linear Models (GLM), Gradient Boosting Machines (including XGBoost), Random Forests, Deep Neural Networks, Stacked Ensembles, Naive Bayes, Generalized Additive Models (GAM), Cox Proportional Hazards, K-Means, PCA, Word2Vec, as well as a fully automatic machine learning algorithm (H2O AutoML).

H2O is extensible so that developers can add data transformations and custom algorithms of their choice and access them through all of those clients. H2O models can be downloaded and loaded into H2O memory for scoring, or exported into POJO or MOJO format for extemely fast scoring in production. More information can be found in the H2O User Guide.

H2O-3 (this repository) is the third incarnation of H2O, and the successor to H2O-2.

1. Downloading H2O-3

While most of this README is written for developers who do their own builds, most H2O users just download and use a pre-built version. If you are a Python or R user, the easiest way to install H2O is via PyPI or Anaconda (for Python) or CRAN (for R):

Python

pip install h2o

R

install.packages("h2o")

For the latest stable, nightly, Hadoop (or Spark / Sparkling Water) releases, or the stand-alone H2O jar, please visit: https://h2o.ai/download

More info on downloading & installing H2O is available in the H2O User Guide.

2. Open Source Resources

Most people interact with three or four primary open source resources: GitHub (which you've already found), JIRA (for bug reports and issue tracking), Stack Overflow for H2O code/software-specific questions, and h2ostream (a Google Group / email discussion forum) for questions not suitable for Stack Overflow. There is also a Gitter H2O developer chat group, however for archival purposes & to maximize accessibility, we'd prefer that standard H2O Q&A be conducted on Stack Overflow.

2.1 Issue Tracking and Feature Requests

(Note: There is only one issue tracking system for the project. GitHub issues are not enabled; you must use JIRA.)

You can browse and create new issues in our open source JIRA: http://jira.h2o.ai

  • You can browse and search for issues without logging in to JIRA:
    1. Click the Issues menu
    2. Click Search for issues
  • To create an issue (either a bug or a feature request), please create yourself an account first:
    1. Click the Log In button on the top right of the screen
    2. Click Create an acccount near the bottom of the login box
    3. Once you have created an account and logged in, use the Create button on the menu to create an issue
    4. Create H2O-3 issues in the PUBDEV project. (Note: Sparkling Water questions should be filed under the SW project.)
  • You can also vote for feature requests and/or other issues. Voting can help H2O prioritize the features that are included in each release. 1. Go to the H2O JIRA page. 2. Click Log In to either log in or create an account if you do not already have one. 3. Search for the feature that you want to prioritize, or create a new feature. 4. Click on the Vote for this issue link. This is located on the right side of the issue under the People section.

2.2 List of H2O Resources

GitHub

JIRA -- file bug reports / track issues here

  • The PUBDEV project contains issues for the current H2O-3 project)

Stack Overflow -- ask all code/software questions here

Cross Validated (Stack Exchange) -- ask algorithm/theory questions here

h2ostream Google Group -- ask non-code related questions here

Gitter H2O Developer Chat

Documentation

Download (pre-built packages)

Jenkins (H2O build and test system)

Website

Twitter -- follow us for updates and H2O news!

Awesome H2O -- share your H2O-powered creations with us

3. Using H2O-3 Artifacts

Every nightly build publishes R, Python, Java, and Scala artifacts to a build-specific repository. In particular, you can find Java artifacts in the maven/repo directory.

Here is an example snippet of a gradle build file using h2o-3 as a dependency. Replace x, y, z, and nnnn with valid numbers.

// h2o-3 dependency information
def h2oBranch = 'master'
def h2oBuildNumber = 'nnnn'
def h2oProjectVersion = "x.y.z.${h2oBuildNumber}"

repositories {
  // h2o-3 dependencies
  maven {
    url "https://s3.amazonaws.com/h2o-release/h2o-3/${h2oBranch}/${h2oBuildNumber}/maven/repo/"
  }
}

dependencies {
  compile "ai.h2o:h2o-core:${h2oProjectVersion}"
  compile "ai.h2o:h2o-algos:${h2oProjectVersion}"
  compile "ai.h2o:h2o-web:${h2oProjectVersion}"
  compile "ai.h2o:h2o-app:${h2oProjectVersion}"
}

Refer to the latest H2O-3 bleeding edge nightly build page for information about installing nightly build artifacts.

Refer to the h2o-droplets GitHub repository for a working example of how to use Java artifacts with gradle.

Note: Stable H2O-3 artifacts are periodically published to Maven Central (click here to search) but may substantially lag behind H2O-3 Bleeding Edge nightly builds.

4. Building H2O-3

Getting started with H2O development requires JDK 1.8+, Node.js, Gradle, Python and R. We use the Gradle wrapper (called gradlew) to ensure up-to-date local versions of Gradle and other dependencies are installed in your development directory.

4.1. Before building

Building h2o requires a properly set up R environment with required packages and Python environment with the following packages:

grip
future
tabulate
requests
wheel

To install these packages you can use pip or conda. If you have troubles installing these packages on Windows, please follow section Setup on Windows of this guide.

(Note: It is recommended to use some virtual environment such as VirtualEnv, to install all packages. )

4.2. Building from the command line (Quick Start)

To build H2O from the repository, perform the following steps.

Recipe 1: Clone fresh, build, skip tests, and run H2O

# Build H2O
git clone https://github.com/h2oai/h2o-3.git
cd h2o-3
./gradlew build -x test

You may encounter problems: e.g. npm missing. Install it:
brew install npm

# Start H2O
java -jar build/h2o.jar

# Point browser to http://localhost:54321

Recipe 2: Clone fresh, build, and run tests (requires a working install of R)

git clone https://github.com/h2oai/h2o-3.git
cd h2o-3
./gradlew syncSmalldata
./gradlew syncRPackages
./gradlew build

Notes:

  • Running tests starts five test JVMs that form an H2O cluster and requires at least 8GB of RAM (preferably 16GB of RAM).
  • Running ./gradlew syncRPackages is supported on Windows, OS X, and Linux, and is strongly recommended but not required. ./gradlew syncRPackages ensures a complete and consistent environment with pre-approved versions of the packages required for tests and builds. The packages can be installed manually, but we recommend setting an ENV variable and using ./gradlew syncRPackages. To set the ENV variable, use the following format (where `${WORKSPACE} can be any path):
mkdir -p ${WORKSPACE}/Rlibrary
export R_LIBS_USER=${WORKSPACE}/Rlibrary

Recipe 3: Pull, clean, build, and run tests

git pull
./gradlew syncSmalldata
./gradlew syncRPackages
./gradlew clean
./gradlew build

Notes

We recommend using ./gradlew clean after each git pull.

Skip tests by adding -x test at the end the gradle build command line. Tests typically run for 7-10 minutes on a Macbook Pro laptop with 4 CPUs (8 hyperthreads) and 16 GB of RAM.

Syncing smalldata is not required after each pull, but if tests fail due to missing data files, then try ./gradlew syncSmalldata as the first troubleshooting step. Syncing smalldata downloads data files from AWS S3 to the smalldata directory in your workspace. The sync is incremental. Do not check in these files. The smalldata directory is in .gitignore. If you do not run any tests, you do not need the smalldata directory.

Running ./gradlew syncRPackages is supported on Windows, OS X, and Linux, and is strongly recommended but not required. ./gradlew syncRPackages ensures a complete and consistent environment with pre-approved versions of the packages required for tests and builds. The packages can be installed manually, but we recommend setting an ENV variable and using ./gradlew syncRPackages. To set the ENV variable, use the following format (where ${WORKSPACE} can be any path):

Recipe 4: Just building the docs

./gradlew clean && ./gradlew build -x test && (export DO_FAST=1; ./gradlew dist)
open target/docs-website/h2o-docs/index.html

4.3. Setup on Windows

Step 1: Download and install WinPython.

From the command line, validate python is using the newly installed package by using which python (or sudo which python). Update the Environment variable with the WinPython path.

Step 2: Install required Python packages:

pip install grip future tabulate wheel

Step 3: Install JDK

Install Java 1.8+ and add the appropriate directory C:\Program Files\Java\jdk1.7.0_65\bin with java.exe to PATH in Environment Variables. To make sure the command prompt is detecting the correct Java version, run:

javac -version

The CLASSPATH variable also needs to be set to the lib subfolder of the JDK:

CLASSPATH=/<path>/<to>/<jdk>/lib

Step 4. Install Node.js

Install Node.js and add the installed directory C:\Program Files\nodejs, which must include node.exe and npm.cmd to PATH if not already prepended.

Step 5. Install R, the required packages, and Rtools:

Install R and add the bin directory to your PATH if not already included.

Install the following R packages:

To install these packages from within an R session:

pkgs <- c("RCurl", "jsonlite", "statmod", "devtools", "roxygen2", "testthat")
for (pkg in pkgs) {
  if (! (pkg %in% rownames(installed.packages()))) install.packages(pkg)
}

Note that libcurl is required for installation of the RCurl R package.

Note that this packages don't cover running tests, they for building H2O only.

Finally, install Rtools, which is a collection of command line tools to facilitate R development on Windows.

NOTE: During Rtools installation, do not install Cygwin.dll.

Step 6. Install Cygwin

NOTE: During installation of Cygwin, deselect the Python packages to avoid a conflict with the Python.org package.

Step 6b. Validate Cygwin

If Cygwin is already installed, remove the Python packages or ensure that Native Python is before Cygwin in the PATH variable.

Step 7. Update or validate the Windows PATH variable to include R, Java JDK, Cygwin.

Step 8. Git Clone h2o-3

If you don't already have a Git client, please install one. The default one can be found here http://git-scm.com/downloads. Make sure that command prompt support is enabled before the installation.

Download and update h2o-3 source codes:

git clone https://github.com/h2oai/h2o-3

Step 9. Run the top-level gradle build:

cd h2o-3
./gradlew.bat build

If you encounter errors run again with --stacktrace for more instructions on missing dependencies.

4.4. Setup on OS X

If you don't have Homebrew, we recommend installing it. It makes package management for OS X easy.

Step 1. Install JDK

Install Java 1.8+. To make sure the command prompt is detecting the correct Java version, run:

javac -version

Step 2. Install Node.js:

Using Homebrew:

brew install node

Otherwise, install from the NodeJS website.

Step 3. Install R and the required packages:

Install R and add the bin directory to your PATH if not already included.

Install the following R packages:

To install these packages from within an R session:

pkgs <- c("RCurl", "jsonlite", "statmod", "devtools", "roxygen2", "testthat")
for (pkg in pkgs) {
  if (! (pkg %in% rownames(installed.packages()))) install.packages(pkg)
}

Note that libcurl is required for installation of the RCurl R package.

Note that this packages don't cover running tests, they for building H2O only.

Step 4. Install python and the required packages:

Install python:

brew install python

Install pip package manager:

sudo easy_install pip

Next install required packages:

sudo pip install wheel requests future tabulate  

Step 5. Git Clone h2o-3

OS X should already have Git installed. To download and update h2o-3 source codes:

git clone https://github.com/h2oai/h2o-3

Step 6. Run the top-level gradle build:

cd h2o-3
./gradlew build

Note: on a regular machine it may take very long time (about an hour) to run all the tests.

If you encounter errors run again with --stacktrace for more instructions on missing dependencies.

4.5. Setup on Ubuntu 14.04

Step 1. Install Node.js

curl -sL https://deb.nodesource.com/setup_0.12 | sudo bash -
sudo apt-get install -y nodejs

Step 2. Install JDK:

Install Java 8. Installation instructions can be found here JDK installation. To make sure the command prompt is detecting the correct Java version, run:

javac -version

Step 3. Install R and the required packages:

Installation instructions can be found here R installation. Click “Download R for Linux”. Click “ubuntu”. Follow the given instructions.

To install the required packages, follow the same instructions as for OS X above.

Note: If the process fails to install RStudio Server on Linux, run one of the following:

sudo apt-get install libcurl4-openssl-dev

or

sudo apt-get install libcurl4-gnutls-dev

Step 4. Git Clone h2o-3

If you don't already have a Git client:

sudo apt-get install git

Download and update h2o-3 source codes:

git clone https://github.com/h2oai/h2o-3

Step 5. Run the top-level gradle build:

cd h2o-3
./gradlew build

If you encounter errors, run again using --stacktrace for more instructions on missing dependencies.

Make sure that you are not running as root, since bower will reject such a run.

4.6. Setup on Ubuntu 13.10

Step 1. Install Node.js

curl -sL https://deb.nodesource.com/setup_16.x | sudo bash -
sudo apt-get install -y nodejs

Steps 2-4. Follow steps 2-4 for Ubuntu 14.04 (above)

4.7. Setup on CentOS 7

cd /opt
sudo wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/7u79-b15/jdk-7u79-linux-x64.tar.gz"

sudo tar xzf jdk-7u79-linux-x64.tar.gz
cd jdk1.7.0_79

sudo alternatives --install /usr/bin/java java /opt/jdk1.7.0_79/bin/java 2

sudo alternatives --install /usr/bin/jar jar /opt/jdk1.7.0_79/bin/jar 2
sudo alternatives --install /usr/bin/javac javac /opt/jdk1.7.0_79/bin/javac 2
sudo alternatives --set jar /opt/jdk1.7.0_79/bin/jar
sudo alternatives --set javac /opt/jdk1.7.0_79/bin/javac

cd /opt

sudo wget http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm
sudo rpm -ivh epel-release-7-5.noarch.rpm

sudo echo "multilib_policy=best" >> /etc/yum.conf
sudo yum -y update

sudo yum -y install R R-devel git python-pip openssl-devel libxml2-devel libcurl-devel gcc gcc-c++ make openssl-devel kernel-devel texlive texinfo texlive-latex-fonts libX11-devel mesa-libGL-devel mesa-libGL nodejs npm python-devel numpy scipy python-pandas

sudo pip install scikit-learn grip tabulate statsmodels wheel

mkdir ~/Rlibrary
export JAVA_HOME=/opt/jdk1.7.0_79
export JRE_HOME=/opt/jdk1.7.0_79/jre
export PATH=$PATH:/opt/jdk1.7.0_79/bin:/opt/jdk1.7.0_79/jre/bin
export R_LIBS_USER=~/Rlibrary

# install local R packages
R -e 'install.packages(c("RCurl","jsonlite","statmod","devtools","roxygen2","testthat"), dependencies=TRUE, repos="http://cran.rstudio.com/")'

cd
git clone https://github.com/h2oai/h2o-3.git
cd h2o-3

# Build H2O
./gradlew syncSmalldata
./gradlew syncRPackages
./gradlew build -x test

5. Launching H2O after Building

To start the H2O cluster locally, execute the following on the command line:

java -jar build/h2o.jar

A list of available start-up JVM and H2O options (e.g. -Xmx, -nthreads, -ip), is available in the H2O User Guide.

6. Building H2O on Hadoop

Pre-built H2O-on-Hadoop zip files are available on the download page. Each Hadoop distribution version has a separate zip file in h2o-3.

To build H2O with Hadoop support yourself, first install sphinx for python: pip install sphinx Then start the build by entering the following from the top-level h2o-3 directory:

(export BUILD_HADOOP=1; ./gradlew build -x test)
./gradlew dist

This will create a directory called 'target' and generate zip files there. Note that BUILD_HADOOP is the default behavior when the username is jenkins (refer to settings.gradle); otherwise you have to request it, as shown above.

Adding support for a new version of Hadoop

In the h2o-hadoop directory, each Hadoop version has a build directory for the driver and an assembly directory for the fatjar.

You need to:

  1. Add a new driver directory and assembly directory (each with a build.gradle file) in h2o-hadoop
  2. Add these new projects to h2o-3/settings.gradle
  3. Add the new Hadoop version to HADOOP_VERSIONS in make-dist.sh
  4. Add the new Hadoop version to the list in h2o-dist/buildinfo.json

Secure user impersonation

Hadoop supports secure user impersonation through its Java API. A kerberos-authenticated user can be allowed to proxy any username that meets specified criteria entered in the NameNode's core-site.xml file. This impersonation only applies to interactions with the Hadoop API or the APIs of Hadoop-related services that support it (this is not the same as switching to that user on the machine of origin).

Setting up secure user impersonation (for h2o):

  1. Create or find an id to use as proxy which has limited-to-no access to HDFS or related services; the proxy user need only be used to impersonate a user
  2. (Required if not using h2odriver) If you are not using the driver (e.g. you wrote your own code against h2o's API using Hadoop), make the necessary code changes to impersonate users (see org.apache.hadoop.security.UserGroupInformation)
  3. In either of Ambari/Cloudera Manager or directly on the NameNode's core-site.xml file, add 2/3 properties for the user we wish to use as a proxy (replace with the simple user name - not the fully-qualified principal name).
  4. Restart core services such as HDFS & YARN for the changes to take effect

Impersonated HDFS actions can be viewed in the hdfs audit log ('auth:PROXY' should appear in the ugi= field in entries where this is applicable). YARN similarly should show 'auth:PROXY' somewhere in the Resource Manager UI.

To use secure impersonation with h2o's Hadoop driver:

Before this is attempted, see Risks with impersonation, below

When using the h2odriver (e.g. when running with hadoop jar ...), specify -principal <proxy user kerberos principal>, -keytab <proxy user keytab path>, and -run_as_user <hadoop username to impersonate>, in addition to any other arguments needed. If the configuration was successful, the proxy user will log in and impersonate the -run_as_user as long as that user is allowed by either the users or groups configuration property (configured above); this is enforced by HDFS & YARN, not h2o's code. The driver effectively sets its security context as the impersonated user so all supported Hadoop actions will be performed as that user (e.g. YARN, HDFS APIs support securely impersonated users, but others may not).

Precautions to take when leveraging secure impersonation

  • The target use case for secure impersonation is applications or services that pre-authenticate a user and then use (in this case) the h2odriver on behalf of that user. H2O's Steam is a perfect example: auth user in web app over SSL, impersonate that user when creating the h2o YARN container.
  • The proxy user should have limited permissions in the Hadoop cluster; this means no permissions to access data or make API calls. In this way, if it's compromised it would only have the power to impersonate a specific subset of the users in the cluster and only from specific machines.
  • Use the hadoop.proxyuser.<proxyusername>.hosts property whenever possible or practical.
  • Don't give the proxyusername's password or keytab to any user you don't want to impersonate another user (this is generally any user). The point of impersonation is not to allow users to impersonate each other. See the first bullet for the typical use case.
  • Limit user logon to the machine the proxying is occurring from whenever practical.
  • Make sure the keytab used to login the proxy user is properly secured and that users can't login as that id (via su, for instance)
  • Never set hadoop.proxyuser..{users,groups} to '*' or 'hdfs', 'yarn', etc. Allowing any user to impersonate hdfs, yarn, or any other important user/group should be done with extreme caution and strongly analyzed before it's allowed.

Risks with secure impersonation

  • The id performing the impersonation can be compromised like any other user id.
  • Setting any hadoop.proxyuser.<proxyusername>.{hosts,groups,users} property to '*' can greatly increase exposure to security risk.
  • When users aren't authenticated before being used with the driver (e.g. like Steam does via a secure web app/API), auditability of the process/system is difficult.
$ git diff
diff --git a/h2o-app/build.gradle b/h2o-app/build.gradle
index af3b929..097af85 100644
--- a/h2o-app/build.gradle
+++ b/h2o-app/build.gradle
@@ -8,5 +8,6 @@ dependencies {
   compile project(":h2o-algos")
   compile project(":h2o-core")
   compile project(":h2o-genmodel")
+  compile project(":h2o-persist-hdfs")
 }

diff --git a/h2o-persist-hdfs/build.gradle b/h2o-persist-hdfs/build.gradle
index 41b96b2..6368ea9 100644
--- a/h2o-persist-hdfs/build.gradle
+++ b/h2o-persist-hdfs/build.gradle
@@ -2,5 +2,6 @@ description = "H2O Persist HDFS"

 dependencies {
   compile project(":h2o-core")
-  compile("org.apache.hadoop:hadoop-client:2.0.0-cdh4.3.0")
+  compile("org.apache.hadoop:hadoop-client:2.4.1-mapr-1408")
+  compile("org.json:org.json:chargebee-1.0")
 }

7. Sparkling Water

Sparkling Water combines two open-source technologies: Apache Spark and the H2O Machine Learning platform. It makes H2O’s library of advanced algorithms, including Deep Learning, GLM, GBM, K-Means, and Distributed Random Forest, accessible from Spark workflows. Spark users can select the best features from either platform to meet their Machine Learning needs. Users can combine Spark's RDD API and Spark MLLib with H2O’s machine learning algorithms, or use H2O independently of Spark for the model building process and post-process the results in Spark.

Sparkling Water Resources:

8. Documentation

Documenation Homepage

The main H2O documentation is the H2O User Guide. Visit http://docs.h2o.ai for the top-level introduction to documentation on H2O projects.

Generate REST API documentation

To generate the REST API documentation, use the following commands:

cd ~/h2o-3
cd py
python ./generate_rest_api_docs.py  # to generate Markdown only
python ./generate_rest_api_docs.py --generate_html  --github_user GITHUB_USER --github_password GITHUB_PASSWORD # to generate Markdown and HTML

The default location for the generated documentation is build/docs/REST.

If the build fails, try gradlew clean, then git clean -f.

Bleeding edge build documentation

Documentation for each bleeding edge nightly build is available on the nightly build page.

9. Citing H2O

If you use H2O as part of your workflow in a publication, please cite your H2O resource(s) using the following BibTex entry:

H2O Software

@Manual{h2o_package_or_module,
    title = {package_or_module_title},
    author = {H2O.ai},
    year = {year},
    month = {month},
    note = {version_information},
    url = {resource_url},
}

Formatted H2O Software citation examples:

H2O Booklets

H2O algorithm booklets are available at the Documentation Homepage.

@Manual{h2o_booklet_name,
    title = {booklet_title},
    author = {list_of_authors},
    year = {year},
    month = {month},
    url = {link_url},
}

Formatted booklet citation examples:

Arora, A., Candel, A., Lanford, J., LeDell, E., and Parmar, V. (Oct. 2016). Deep Learning with H2O. http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/DeepLearningBooklet.pdf.

Click, C., Lanford, J., Malohlava, M., Parmar, V., and Roark, H. (Oct. 2016). Gradient Boosted Models with H2O. http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GBMBooklet.pdf.

10. Roadmap

H2O 3.36.01 - Winter 2021

  • [PUBDEV-4940] Uplift Trees
  • [PUBDEV-8074] Admissible ML - Infogram
  • RuleFit improvements (multinomial support, rule deduplication and consolidation)
  • Backward elimination in MAXR
  • Improved support for CDP (S3A with IDBroker)
  • Support for Java 16 and 17, Python 3.8

H2O 3.38.01 - Spring 2022

  • [PUBDEV-8074] Admissible ML - stage 2 (algos)
  • Multi-Output Regression in Deep Learning
  • GAM Improvements (support for Monotonic Splines)
  • XGBoost Upgrade
  • Data Ingest Improvements (Secured Hive in Standalone/K8S)
  • Extended Isolation Forest MOJO
  • Uplift MOJO
  • New features ICE plots

11. Community

H2O has been built by a great many number of contributors over the years both within H2O.ai (the company) and the greater open source community. You can begin to contribute to H2O by answering Stack Overflow questions or filing bug reports. Please join us!

Team & Committers

SriSatish Ambati
Cliff Click
Tom Kraljevic
Tomas Nykodym
Michal Malohlava
Kevin Normoyle
Spencer Aiello
Anqi Fu
Nidhi Mehta
Arno Candel
Josephine Wang
Amy Wang
Max Schloemer
Ray Peck
Prithvi Prabhu
Brandon Hill
Jeff Gambera
Ariel Rao
Viraj Parmar
Kendall Harris
Anand Avati
Jessica Lanford
Alex Tellez
Allison Washburn
Amy Wang
Erik Eckstrand
Neeraja Madabhushi
Sebastian Vidrio
Ben Sabrin
Matt Dowle
Mark Landry
Erin LeDell
Andrey Spiridonov
Oleg Rogynskyy
Nick Martin
Nancy Jordan
Nishant Kalonia
Nadine Hussami
Jeff Cramer
Stacie Spreitzer
Vinod Iyengar
Charlene Windom
Parag Sanghavi
Navdeep Gill
Lauren DiPerna
Anmol Bal
Mark Chan
Nick Karpov
Avni Wadhwa
Ashrith Barthur
Karen Hayrapetyan
Jo-fai Chow
Dmitry Larko
Branden Murray
Jakub Hava
Wen Phan
Magnus Stensmo
Pasha Stetsenko
Angela Bartz
Mateusz Dymczyk
Micah Stubbs
Ivy Wang
Terone Ward
Leland Wilkinson
Wendy Wong
Nikhil Shekhar
Pavel Pscheidl
Michal Kurka
Veronika Maurerova
Jan Sterba
Jan Jendrusak
Sebastien Poirier
Tomáš Frýda
Ard Kelmendi

Advisors

Scientific Advisory Council

Stephen Boyd
Rob Tibshirani
Trevor Hastie

Systems, Data, FileSystems and Hadoop

Doug Lea
Chris Pouliot
Dhruba Borthakur

Investors

Jishnu Bhattacharjee, Nexus Venture Partners
Anand Babu Periasamy
Anand Rajaraman
Ash Bhardwaj
Rakesh Mathur
Michael Marks
Egbert Bierman
Rajesh Ambati
  • hadoop.proxyuser.<proxyusername>.hosts: the hosts the proxy user is allowed to perform impersonated actions on behalf of a valid user from
  • hadoop.proxyuser.<proxyusername>.groups: the groups an impersonated user must belong to for impersonation to work with that proxy user
  • hadoop.proxyuser.<proxyusername>.users: the users a proxy user is allowed to impersonate
  • Example: <property> <name>hadoop.proxyuser.myproxyuser.hosts</name> <value>host1,host2</value> </property> <property> <name>hadoop.proxyuser.myproxyuser.groups</name> <value>group1,group2</value> </property> <property> <name>hadoop.proxyuser.myproxyuser.users</name> <value>user1,user2</value> </property>
mkdir -p ${WORKSPACE}/Rlibrary
export R_LIBS_USER=${WORKSPACE}/Rlibrary

Author: h2oai
Source Code: https://github.com/h2oai/h2o-3
License: Apache-2.0 License

#python #machine-learning 

H2O: Open Source, Distributed, Fast & Scalable ML Platform

Gym: Open Source Python Library for Developing

Gym

Gym is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API. Since its release, Gym's API has become the field standard for doing this.

Gym documentation website is at https://www.gymlibrary.ml/, and you can propose fixes and changes here.

Installation

To install the base Gym library, use pip install gym.

This does not include dependencies for all families of environments (there's a massive number, and some can be problematic to install on certain systems). You can install these dependencies for one family like pip install gym[atari] or use pip install gym[all] to install all dependencies.

We support Python 3.7, 3.8, 3.9 and 3.10 on Linux and macOS. We will accept PRs related to Windows, but do not officially support it.

API

The Gym API's API models environments as simple Python env classes. Creating environment instances and interacting with them is very simple- here's an example using the "CartPole-v1" environment:

import gym
env = gym.make("CartPole-v1")
observation, info = env.reset(seed=42, return_info=True)

for _ in range(1000):
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)

    if done:
        observation, info = env.reset(return_info=True)
env.close()

Notable Related Libraries

  • Stable Baselines 3 is a learning library based on the Gym API. It is designed to cater to complete beginners in the field who want to start learning things quickly.
  • RL Baselines3 Zoo builds upon SB3, containing optimal hyperparameters for Gym environments as well as code to easily find new ones.
  • Tianshou is a learning library that's geared towards very experienced users and is design to allow for ease in complex algorithm modifications.
  • RLlib is a learning library that allows for distributed training and inferencing and supports an extraordinarily large number of features throughout the reinforcement learning space.
  • PettingZoo is like Gym, but for environments with multiple agents.

Environment Versioning

Gym keeps strict versioning for reproducibility reasons. All environments end in a suffix like "_v0". When changes are made to environments that might impact learning results, the number is increased by one to prevent potential confusion.

Citation

A whitepaper from when Gym just came out is available https://arxiv.org/pdf/1606.01540, and can be cited with the following bibtex entry:

@misc{1606.01540,
  Author = {Greg Brockman and Vicki Cheung and Ludwig Pettersson and Jonas Schneider and John Schulman and Jie Tang and Wojciech Zaremba},
  Title = {OpenAI Gym},
  Year = {2016},
  Eprint = {arXiv:1606.01540},
}

Release Notes

There used to be release notes for all the new Gym versions here. New release notes are being moved to releases page on GitHub, like most other libraries do. Old notes can be viewed here.

Author: openai
Source Code: https://github.com/openai/gym
License: View license

#python #machine-learning 

Gym: Open Source Python Library for Developing
Bulah  Pfeffer

Bulah Pfeffer

1650823200

M2cgen: A CLI tool for Porting Classic ML Models

m2cgen

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code (Python, C, Java, Go, JavaScript, Visual Basic, C#, PowerShell, R, PHP, Dart, Haskell, Ruby, F#, Rust, Elixir).

Installation

Supported Python version is >= 3.7.

pip install m2cgen

Supported Languages

  • C
  • C#
  • Dart
  • F#
  • Go
  • Haskell
  • Java
  • JavaScript
  • PHP
  • PowerShell
  • Python
  • R
  • Ruby
  • Rust
  • Visual Basic (VBA-compatible)
  • Elixir

Supported Models

 ClassificationRegression
Linear
  • scikit-learn
    • LogisticRegression
    • LogisticRegressionCV
    • PassiveAggressiveClassifier
    • Perceptron
    • RidgeClassifier
    • RidgeClassifierCV
    • SGDClassifier
  • lightning
    • AdaGradClassifier
    • CDClassifier
    • FistaClassifier
    • SAGAClassifier
    • SAGClassifier
    • SDCAClassifier
    • SGDClassifier
  • scikit-learn
    • ARDRegression
    • BayesianRidge
    • ElasticNet
    • ElasticNetCV
    • GammaRegressor
    • HuberRegressor
    • Lars
    • LarsCV
    • Lasso
    • LassoCV
    • LassoLars
    • LassoLarsCV
    • LassoLarsIC
    • LinearRegression
    • OrthogonalMatchingPursuit
    • OrthogonalMatchingPursuitCV
    • PassiveAggressiveRegressor
    • PoissonRegressor
    • RANSACRegressor(only supported regression estimators can be used as a base estimator)
    • Ridge
    • RidgeCV
    • SGDRegressor
    • TheilSenRegressor
    • TweedieRegressor
  • StatsModels
    • Generalized Least Squares (GLS)
    • Generalized Least Squares with AR Errors (GLSAR)
    • Generalized Linear Models (GLM)
    • Ordinary Least Squares (OLS)
    • [Gaussian] Process Regression Using Maximum Likelihood-based Estimation (ProcessMLE)
    • Quantile Regression (QuantReg)
    • Weighted Least Squares (WLS)
  • lightning
    • AdaGradRegressor
    • CDRegressor
    • FistaRegressor
    • SAGARegressor
    • SAGRegressor
    • SDCARegressor
    • SGDRegressor
SVM
  • scikit-learn
    • LinearSVC
    • NuSVC
    • OneClassSVM
    • SVC
  • lightning
    • KernelSVC
    • LinearSVC
  • scikit-learn
    • LinearSVR
    • NuSVR
    • SVR
  • lightning
    • LinearSVR
Tree
  • DecisionTreeClassifier
  • ExtraTreeClassifier
  • DecisionTreeRegressor
  • ExtraTreeRegressor
Random Forest
  • ExtraTreesClassifier
  • LGBMClassifier(rf booster only)
  • RandomForestClassifier
  • XGBRFClassifier
  • ExtraTreesRegressor
  • LGBMRegressor(rf booster only)
  • RandomForestRegressor
  • XGBRFRegressor
Boosting
  • LGBMClassifier(gbdt/dart/goss booster only)
  • XGBClassifier(gbtree(including boosted forests)/gblinear booster only)
  • LGBMRegressor(gbdt/dart/goss booster only)
  • XGBRegressor(gbtree(including boosted forests)/gblinear booster only)

You can find versions of packages with which compatibility is guaranteed by CI tests here. Other versions can also be supported but they are untested.

Classification Output

Linear / Linear SVM / Kernel SVM

Binary

Scalar value; signed distance of the sample to the hyperplane for the second class.

Multiclass

Vector value; signed distance of the sample to the hyperplane per each class.

Comment

The output is consistent with the output of LinearClassifierMixin.decision_function.

SVM

Outlier detection

Scalar value; signed distance of the sample to the separating hyperplane: positive for an inlier and negative for an outlier.

Binary

Scalar value; signed distance of the sample to the hyperplane for the second class.

Multiclass

Vector value; one-vs-one score for each class, shape (n_samples, n_classes * (n_classes-1) / 2).

Comment

The output is consistent with the output of BaseSVC.decision_function when the decision_function_shape is set to ovo.

Tree / Random Forest / Boosting

Binary

Vector value; class probabilities.

Multiclass

Vector value; class probabilities.

Comment

The output is consistent with the output of the predict_proba method of DecisionTreeClassifier / ExtraTreeClassifier / ExtraTreesClassifier / RandomForestClassifier / XGBRFClassifier / XGBClassifier / LGBMClassifier.

Usage

Here's a simple example of how a linear model trained in Python environment can be represented in Java code:

from sklearn.datasets import load_diabetes
from sklearn import linear_model
import m2cgen as m2c

X, y = load_diabetes(return_X_y=True)

estimator = linear_model.LinearRegression()
estimator.fit(X, y)

code = m2c.export_to_java(estimator)

Generated Java code:

public class Model {
    public static double score(double[] input) {
        return ((((((((((152.1334841628965) + ((input[0]) * (-10.012197817470472))) + ((input[1]) * (-239.81908936565458))) + ((input[2]) * (519.8397867901342))) + ((input[3]) * (324.39042768937657))) + ((input[4]) * (-792.1841616283054))) + ((input[5]) * (476.74583782366153))) + ((input[6]) * (101.04457032134408))) + ((input[7]) * (177.06417623225025))) + ((input[8]) * (751.2793210873945))) + ((input[9]) * (67.62538639104406));
    }
}

You can find more examples of generated code for different models/languages here.

CLI

m2cgen can be used as a CLI tool to generate code using serialized model objects (pickle protocol):

$ m2cgen <pickle_file> --language <language> [--indent <indent>] [--function_name <function_name>]
         [--class_name <class_name>] [--module_name <module_name>] [--package_name <package_name>]
         [--namespace <namespace>] [--recursion-limit <recursion_limit>]

Don't forget that for unpickling serialized model objects their classes must be defined in the top level of an importable module in the unpickling environment.

Piping is also supported:

$ cat <pickle_file> | m2cgen --language <language>

FAQ

Q: Generation fails with RecursionError: maximum recursion depth exceeded error.

A: If this error occurs while generating code using an ensemble model, try to reduce the number of trained estimators within that model. Alternatively you can increase the maximum recursion depth with sys.setrecursionlimit(<new_depth>).

Q: Generation fails with ImportError: No module named <module_name_here> error while transpiling model from a serialized model object.

A: This error indicates that pickle protocol cannot deserialize model object. For unpickling serialized model objects, it is required that their classes must be defined in the top level of an importable module in the unpickling environment. So installation of package which provided model's class definition should solve the problem.

Q: Generated by m2cgen code provides different results for some inputs compared to original Python model from which the code were obtained.

A: Some models force input data to be particular type during prediction phase in their native Python libraries. Currently, m2cgen works only with float64 (double) data type. You can try to cast your input data to another type manually and check results again. Also, some small differences can happen due to specific implementation of floating-point arithmetic in a target language.


Author: BayesWitnesses
Source Code: https://github.com/BayesWitnesses/m2cgen
License: MIT License

#machine-learning #python #dart 

M2cgen: A CLI tool for Porting Classic ML Models
Sasha  Lee

Sasha Lee

1650736800

Notespace: Notebook Experience in Your Clojure Namespace

notespace

Notebook experience in your Clojure namespace

What is it?

This tool is an attempt to answer the following question: can we have a notebook-like experience in Clojure without leaving one's favourite editor?

See this recorded Overview.

Status

Version 4 is actively developed and in alpha stage. If you are new to Notespace, this is the version we recommend trying out.

Version 3 and Version 2 has been used in some projects. We are not planning to develop them further, but please reach out if you need any support.

Versions

Notespace has been evolving gradually, slowly realizing some lessons of usage in the Scicloj study groups, in individual research projects, and in documenting some of the Scicloj libraries.

Version 4 -- please go here if you are new to Notespace

Version 3

Version 2

Setup and Usage

See details in the dedicated version pages linked above.

Discussion

Hearing your comments, opinions and wishes will help!

#notespace-dev at the Clojurians Zulip.

Relation to other projects

There are several magnificent existing options for literate programming in Clojure: Marginalia, Org-Babel, Gorilla REPL, Oz, Saite, Clojupyter, Nextjournal, Pink-Gorilla/Goldly, Clerk. Most of them are actively developed.

Creating a separate alternative would be the least desired outcome of the current project. Rather, the hope is to compose and integrate well with some of the other projects.


Author: scicloj
Source Code: https://github.com/scicloj/notespace
License: EPL-2.0 License

#machine-learning 

Notespace: Notebook Experience in Your Clojure Namespace
Sasha  Lee

Sasha Lee

1650715200

Clojupyter: A Jupyter Kernel for Clojure

A Jupyter kernel for Clojure - run Clojure code in Jupyter Lab, Notebook and Console.

clojupyter

Getting Started

In the examples folder of the repository are there 3 example notebooks showing some of the features of clojupyter. See this notebook showing examples of how you can display HTML and use external Javascript:

There are 3 example notebooks because Jupyter offers several distinct user interfaces - Jupyter Lab, Jupyter Notebook and Jupyter Console - which have different feature sets for which clojupyter offers different support. We have one example notebook showing the features shared by Lab and Notebook and for each showing their distinct features. According to the Jupyter development roadmaps, Jupyter Notebook will eventually be phased out and completely replaced by Jupyter Lab.

You can also use existing JVM charting libraries since you can render any Java BufferedImage.

Installation

Clojupyter can be used in several ways, please read Usage Scenarios to find out which type of use model best fits you needs, and how to install Clojupyter in that scenario.

Running Jupyter with Clojupyter

Jupyter Notebook

To start Jupyter Notebook do:

jupyter notebook

and choose 'New' in the top right corner and select 'Clojure (clojupyter...)' kernel.

Jupyter Lab

To start Jupyter Lab do:

jupyter lab

Jupyter Console

You can also start the Jupyter Console by:

jupyter-console --kernel=<clojupyter-kernel-name>

Use jupyter-kernelspec list to list all available kernels. So e.g. in case of installing clojupyter using conda the start command is:

jupyter-console --kernel=conda-clojupyter

Command Line Interface

If you are using Clojupyter as a library, you can use Clojupyter's command line interface to perform operations such as listing, installing, and removing Clojupyter kernels.

For example, in a Clojure repository which includes Clojuputer, you can get the list of available commands:

bash> clj -m clojupyter.cmdline list-commands
Clojupyter v0.2.3 - List commands

    Clojupyter commands:

       - help
       - install
       - list-commands
       - list-installs
       - list-installs-matching
       - remove-installs-matching
       - remove-install
       - version

    You can invoke Clojupyter commands like this:

       clj -m clojupyter.cmdline <command>

    or, if you have set up lein configuration, like this:

       lein clojupyter <command>

    See documentation for details.

exit(0)

See Command Line Interface for more details.

TODO

Development progress is based on voluntary efforts so we can't make any promises, but the wish list for clojupyter development looks something like this:

  •  Front-end: Support reindentation, Parinfer, syntax highlighting in code blocks
  •  Connect running kernel to running Clojure instances
  •  Clarify/simplify external access to rendering - eliminate dependency from Oz to clojupyter
  •  Support interactive Jupyter Widgets

Feed-back is welcomed, use the discussions page to ask questions, give suggestions or just to say hi 👋.

If you have issues with Clojupyter, check the issues page to see if your problem is already reported and open a new issue if needed.


Author: clojupyter
Source Code: https://github.com/clojupyter/clojupyter
License: MIT License

#machine-learning #jupyter 

 Clojupyter: A Jupyter Kernel for Clojure
Sasha  Lee

Sasha Lee

1650708000

A Clojure/Clojurescript Notebook Application/-library Based on Gorilla

Pink Gorilla Notebook 

Pink Gorilla Notebook is a rich browser based notebook REPL for Clojure and ClojureScript, which aims at extensibility (development- and runtime) and user experience while being very lightweight. Extensibility primarily revolves around UI vizulisations and data.

Use cases

  • Data science
  • Persistent experiments and demos (Clojure/ClojureScript libraries)
  • Courses and education on all matters related to clojure

Web Interface

Whichever method you use to start the notebook, you should reach it at http://localhost:8000/.

Run Notebook standalone

The easiest way to run the notebook locally is leveraging the clojure cli

clojure -Sdeps '{:deps {org.pinkgorilla/notebook-bundel {:mvn/version "RELEASE"}}}' -m pinkgorilla.notebook-bundel

Run Notebook with default bundel

Since the default bundel ships many default ui extensions, you want to use the notebook-bundel artefact, because the javascript frotend app has already been precompiled, which results in faster startup-time.

in your deps.edn project

We recommend to use tools.deps over leiningen fortwo reasons:

  • no dependency coflicts with tools.deps (tools deps resolves to the highest dependency version, vs leiningen which depends on the position in the project.clj
  • use RELEASE so you always get the most recent notebook)

One way to configure the notebook is to pass it a edn configuration file. An example is notebook edn config

In your deps.edn add this alias:

:notebook {:extra-deps {org.pinkgorilla/notebook-bundel {:mvn/version "RELEASE"}}
           :exec-fn pinkgorilla.notebook-bundel/run
           :exec-args {:config "notebook-config.edn"}}

then run it with clojure -X:notebook.

trateg uses notebook-bundel with deps.edn: Clone trateg and run clojure -X:notebook

in your leiningen project

** We don't recommend leiningen use with notebook, as leiningen does not use the highest version of dependencies. **

Run Notebook with custom bundel

If you define your own ui extensions, you need to compile the javascript bundel. This requires some extra initial compilation time.

in your deps.edn project

ui-quil use deps.edn to build a custom notebook bundel (that includes the library that gets built).

in your leiningen project

gorilla-ui and ui-vega use leiningen to run notebooks with a custom build bundel, and with custom notebook folder.

Run Notebook in Docker Image

Documentation has been moved over here

Run Notebook from cloned git repo

This option is mainly there for development of notebook. For regular use, the long compile-times are not really sensible.

Run clojure -X:notebook to run the notebook.

This runs the notebook with ui libraries bundled:

  • gorilla ui
  • gorilla plot

Run Development UI

Run clojure -X:develop to run the develop ui.


Author: pink-gorilla
Source Code: https://github.com/pink-gorilla/notebook
License: 

#machine-learning 

A Clojure/Clojurescript Notebook Application/-library Based on Gorilla
Sasha  Lee

Sasha Lee

1650700800

Clojure Data Visualisation Library, Based on Statistiker and D3

Envision

Envision is a small, easy to use Clojure library for data processing, cleanup and visualisation. If you've heard about Incanter, you may see a couple of things that we do in a similar way.

You can check out a couple of rendered examples here.

Project Maturity

Envision is a relatively young project. Since it's never meant to be used in hard- production (e.g. it will never be something user-facing), and is intended to be used by people who'd like to yield some information from their data, it should be stable enough from the very early releases.

Dependency Information (Artifacts)

Envision artifacts are released to Clojars. If you are using Maven, add the following repository definition to your pom.xml:

<repository>
  <id>clojars.org</id>
  <url>http://clojars.org/repo</url>
</repository>

The Most Recent Version

With Leiningen:

[clojurewerkz/envision "0.1.0-SNAPSHOT"]

With Maven:

<dependency>
  <groupId>clojurewerkz</groupId>
  <artifactId>envision</artifactId>
  <version>0.1.0-SNAPSHOT</version>
</dependency>

General Approach

Main idea of this library is to make exploratory analysis more interactive and visual, although in programmer's way. Envision creates a "throwaway environment" every time you, for example, make a line chart. You can modify chart the way you want, change all the possible configuration parameters, filter data, add exponents the ways we wouldn't be able to program for you.

We concluded that visual environments are often constraining, and creating an API for every since feature would make it amazingly big and bloated. So we do a bare minimum, which is already helpful by default through the API and let you configure everything you could've possibly imagined yourself: adding interactivity, combining charts, customizing layouts and so on.

Usage

Main entrypoint is clojurewerkz.envision.core/render. It creates a temporary directory with all the required dependencies and returns you a path to it. For example, let's generate some data and render a line and area charts:

(ns my-ns
  (:require [clojurewerkz.envision.core         :as envision]
            [clojurewerkz.envision.chart-config :as cfg]

(envision/render
   [(envision/histogram 10 (take 100 (distribution/normal-distribution 5 10))
               {:tick-format "s"})

    (envision/linear-regression
     (flatten (for [i (range 0 20)]
                [{:year (+ 2000 i)
                  :income (+ 10 i (rand-int 10))
                  :series "series-1"}
                 {:year (+ 2000 i)
                  :income (+ 10 i (rand-int 20))
                  :series "series-2"}]
                ))
     :year
     :income
     [:year :income :series])
    (cfg/make-chart-config
     {:id            "line"
      :headline      "Line Chart"
      :x             "year"
      :y             "income"
      :x-config      {:order-rule "year"}
      :series-type   "line"
      :data          (flatten (for [i (range 0 20)]
                                [{:year (+ 2000 i)
                                  :income (+ 10 i (rand-int 10))
                                  :series "series-1"}
                                 {:year (+ 2000 i)
                                  :income (+ 10 i (rand-int 20))
                                  :series "series-2"}]
                                ))
      :series        "series"
      :interpolation :cardinal
      })
    (cfg/make-chart-config
     {:id            "area"
      :headline      "Area Chart"
      :x             "year"
      :y             "income"
      :x-config      {:order-rule "year"}
      :series-type   "area"
      :data          (into [] (for [i (range 0 20)] {:year (+ 2000 i) :income (+ 10 i (rand-int 10))}))
      :interpolation :cardinal
      })
    ])

Function will return a tmp folder path, like:

/var/folders/1y/xr7zvp2j035bpq09whg7th5w0000gn/T/envision-1402385765815-3502705781

cd into this path and start an HTTP Server on most systems you'd have Python 2.7 installed.

python -m SimpleHTTPServer

After that you can point your browser to

http://localhost:4000/templates/index.html

If you don't want to start an HTTP server, or don't have Python installed, just open templates/index_file.html static file in your browser.

You can check out a couple of example graphs rendered as static files here.

We decided to use an simple HTTP server by default, since sometimes d3 doesn't like file:// protocol. However, you can always just open templates/index_file.html in your browser and get pretty much same result.

Chart configuration

In order to configure chart, you have to specify:

  • id, a unique string literal identifying the chart
  • data, sequence of maps, where each map represents an entry to be displayed
  • x, key that should be taken as x value for each rendered point
  • y, key that should be taken as y value for each rendered point
  • series-type, one of line, bubble, area and bar for line charts, Scatterplots, area charts and barcharts, correspondingly

Optionally, you can specify:

  • series, which will split your data, grouping or color-coding charts by given keys keys should be given either as a string or a vector or strings.
  • interpolation, interpolation type to be used in area or line chart, usually you want to use linear, basis, or step-after, but there're more options, which will be mentioned in a corresponding section.
  • x-config specifies a configuration for X axis

x-config options:

  • order-rule specifies a key to sort data points on x axis, if it's not x
  • override-min overrides minimum for an axis

Features:

  • Histograms
  • Scatterplots
  • Boxplots
  • Barcharts
  • Regression lines
  • Cluster visualisation

Supported Clojure Versions

Envision supports Clojure 1.4+.

Community

To subscribe for announcements of releases, important changes and so on, please follow @ClojureWerkz on Twitter.

Envision Is a ClojureWerkz Project

Envision is part of the group of libraries known as ClojureWerkz, together with Monger, Elastisch, Langohr, Welle, Titanium and several others.

Development

Envision uses Leiningen 2. Make sure you have it installed and then run tests against all supported Clojure versions using

lein2 all test

Then create a branch and make your changes on it. Once you are done with your changes and all tests pass, submit a pull request on Github.


Author: clojurewerkz
Source Code: https://github.com/clojurewerkz/envision
License: 

#machine-learning #DataVisualisation

Clojure Data Visualisation Library, Based on Statistiker and D3
Sasha  Lee

Sasha Lee

1650693600

Great And Powerful Scientific Documents & Data Visualizations

Overview

Oz is a data visualization and scientific document processing library for Clojure built around Vega-Lite & Vega.

Vega-Lite & Vega are declarative grammars for describing interactive data visualizations. Of note, they are based on the Grammar of Graphics, which served as the guiding light for the popular R ggplot2 viz library. With Vega & Vega-Lite, we define visualizations by declaratively specifying how attributes of our data map to aesthetic properties of a visualization. Vega-Lite in particular focuses on maximal productivity and leverage for day to day usage (and is the place to start), while Vega (to which Vega-Lite compiles) is ideal for more nuanced control.

About oz specifically...

Oz itself provides:

  • view!: Clojure REPL API for for pushing Vega-Lite & Vega (+ hiccup) data to a browser window over a websocket
  • vega, vega-lite: Reagent component API for dynamic client side ClojureScript apps
  • publish!: create a GitHub gist with Vega-Lite & Vega (+ hiccup), and print a link to visualize it with either the IDL's live vega editor or the ozviz.io
  • load: load markdown, hiccup or Vega/Vega-Lite files (+ combinations) from disk as EDN or JSON
  • export!: write out self-contained html files with live/interactive visualizations embedded
  • oz.notebook.<kernel>: embed Vega-Lite & Vega data (+ hiccup) in Jupyter notebooks via the Clojupyter & IClojure kernels
  • live-reload!: live clj code reloading (à la Figwheel), tuned for data-science hackery (only reruns from first changed form, for a pleasant, performant live-coding experience)
  • live-view!: similar Figwheel-inspired live-view! function for watching and view!ing .md, .edn and .json files with Vega-Lite & Vega (+ (or markdown hiccup))
  • build!: generate a static website from directories of markdown, hiccup &/or interactive Vega-Lite & Vega visualizations, while being able to see changes live (as with live-view!)

Learning Vega, Vega-Lite & Oz

To take full advantage of the data visualization capabilities of Oz, it pays to understanding the core Vega & Vega-Lite. If you're new to the scene, it's worth taking a few minutes to orient yourself with this mindblowing talk/demo from the creators at the Interactive Data Lab (IDL) at University of Washington.

Vega & Vega-Lite talk from IDL

Watched the IDL talk and hungry for more content? Here's another which focuses on the philosophical ideas behind Vega & Vega-Lite, how they relate to Clojure, and how you can use the tools from Clojure using Oz.

Seajure Clojure + Vega/Vega-Lite talk

This Readme is the canonical entry point for learning about Oz. You may also want to check out the cljdoc page (if you're not there already) for API & other docs, and look at the examples directory of this project (references occassionally below).

Ecosystem

Some other things in the Vega/Vega-Lite ecosystem you may want to look at for getting started or learning more

  • Vega Editor - Wonderful editing tool (as mentioned above) for editing and sharing Vega/Vega-Lite data visualizations.
  • Ozviz - Sister project to Oz: A Vega Editor like tool for sharing (and soon editing) hiccup with embedded Vega/Vega-Lite visualizations, as used with the view! function.
  • Voyager - Also from the IDL, Voyager is a wonderful Tableau like (drag and drop) tool for exploring data and constructing exportable Vega/Vega-Lite visualizations.
  • Vega Examples & Vega-Lite Examples - A robust showcase of visualizations from which to draw inspiration and code.
  • Vega home - More great stuff from the IDL folks.

REPL Usage

If you clone this repository and open up the dev/user.clj file, you can follow along by executing the commented out code block at the end of the file.

Assuming you're starting from scratch, first add oz to your leiningen project dependencies

 

Next, require oz and start the plot server as follows:

(require '[oz.core :as oz])

(oz/start-server!)

This will fire up a browser window with a websocket connection for funneling view data back and forth. If you forget to call this function, it will be called for you when you create your first plot, but be aware that it will delay the first display, and it's possible you'll have to resend the plot on a slower computer.

Next we'll define a function for generating some dummy data

(defn play-data [& names]
  (for [n names
        i (range 20)]
    {:time i :item n :quantity (+ (Math/pow (* i (count n)) 0.8) (rand-int (count n)))}))

oz/view!

The main function for displaying vega or vega-lite is oz/view!.

For example, a simple line plot:

(def line-plot
  {:data {:values (play-data "monkey" "slipper" "broom")}
   :encoding {:x {:field "time" :type "quantitative"}
              :y {:field "quantity" :type "quantitative"}
              :color {:field "item" :type "nominal"}}
   :mark "line"})

;; Render the plot
(oz/view! line-plot)

Should render something like:

lines plot

Another example:

(def stacked-bar
  {:data {:values (play-data "munchkin" "witch" "dog" "lion" "tiger" "bear")}
   :mark "bar"
   :encoding {:x {:field "time"
                  :type "ordinal"}
              :y {:aggregate "sum"
                  :field "quantity"
                  :type "quantitative"}
              :color {:field "item"
                      :type "nominal"}}})

(oz/view! stacked-bar)

This should render something like:

bars plot

vega support

For vega instead of vega-lite, you can also specify :mode :vega to oz/view!:

;; load some example vega (this may only work from within a checkout of oz; haven't checked)

(require '[cheshire.core :as json])

(def contour-plot (oz/load "examples/contour-lines.vega.json"))
(oz/view! contour-plot :mode :vega)

This should render like:

contours plot

Hiccup

We can also embed Vega-Lite & Vega visualizations within hiccup documents:

(def viz
  [:div
    [:h1 "Look ye and behold"]
    [:p "A couple of small charts"]
    [:div {:style {:display "flex" :flex-direction "row"}}
      [:vega-lite line-plot]
      [:vega-lite stacked-bar]]
    [:p "A wider, more expansive chart"]
    [:vega contour-plot]
    [:h2 "If ever, oh ever a viz there was, the vizard of oz is one because, because, because..."]
    [:p "Because of the wonderful things it does"]])

(oz/view! viz)

Note that the Vega-Lite & Vega specs are described in the output vega as using the :vega and :vega-lite keys.

You should now see something like this:

composite view

Note that vega/vega-lite already have very powerful and impressive plot concatenation features which allow for coupling of interactivity between plots in a viz. However, combing things through hiccup like this is nice for expedience, gives one the ability to combine such visualizations in the context of HTML documents.

Also note that while not illustrated above, you can specify multiple maps in these vectors, and they will be merged into one. So for example, you can do [:vega-lite stacked-bar {:width 100}] to override the width.

As client side reagent components

If you like, you may also use the Reagent components found at oz.core to render vega and/or vega-lite you construct client side.

[:div
 [oz.core/vega { ... }]
 [oz.core/vega-lite { ... }]]

At present, these components do not take a second argument. The merging of spec maps described above applies prior to application of this reagent component.

Eventually we'll be adding options for hooking into the signal dataflow graphs within these visualizations so that interactions in a Vega/Vega-Lite visualization can be used to inform other Reagent components in your app.

Please note that when using oz.core client side, the :data entry in your vega spec map should not be nil (for example you're loading data into a reagent atom which has not been populated yet). Instead prefer an empty sequence () to avoid hard to diagnose errors in the browser.

Loading specs

Oz now features a load function which accepts the following formats:

  • edn, json, yaml: directly parse into hiccup &/or Vega/Vega-Lite representations
  • md: loads a markdown file, with a notation for specifying Vega/Vega-Lite in code blocks tagged with the vega, vega-lite or oz class

As example of the markdown syntax:

# An example markdown file

```edn vega-lite
{:data {:url "data/cars.json"}
 :mark "point"
 :encoding {
   :x {:field "Horsepower", :type "quantitative"}
   :y {:field "Miles_per_Gallon", :type "quantitative"}
   :color {:field "Origin", :type "nominal"}}}
```

The real magic here is in the code class specification edn vega-lite. It's possible to replace edn with json or yaml, and vega with vega-lite as appropriate. Additionally, these classes can be hyphenated for compatibility with editors/parsers that have problems with multiple class specifications (e.g. edn-vega-lite)

Note that embedding all of your data into a vega/vega-lite spec directly as :values may be untenable for larger data sets. In these cases, the recommended solution is to post your data to a GitHub gist, or elsewhere online where you can refer to it using the :url syntax (e.g. {:data {:url "https://your.data.url/path"} ...}).

One final note: in lieue of vega or vega-lite you can specify hiccup in order to embed oz-style hiccup forms which may or may not contain [:vega ...] or [:vega-lite ...] blocks. This allows you to embed nontrivial html in your markdown files as hiccup, when basic markdown just doesn't cut it, without having to resort to manually writing html.

Export

We can also export static HTML files which use Vega-Embed to render interactive Vega/Vega-Lite visualizations using the oz/export! function.

(oz/export! spec "test.html")

Notebook support

Oz now also features Jupyter support for both the Clojupyter and IClojure kernels. See the view! method in the namespaces oz.notebook.clojupyter and oz.notebook.iclojure for usage.

example notebook

Requiring in Clojupyter

Take a look at the example clojupyter notebook.

If you have docker installed you can run the following to build and run a jupyter container with clojupyter installed.

docker run --rm -p 8888:8888 kxxoling/jupyter-clojure-docker

Note that if you get a permission related error, you may need to run this command like sudo docker run ....

Once you have a notebook up and running you can either import the example clojupyter notebook or manually add something like:

(require '[clojupyter.misc.helper :as helper])
(helper/add-dependencies '[metasoarous/oz "x.x.x"])
(require '[oz.notebook.clojupyter :as oz])

;; Create spec

;; then...
(oz/view! spec)

Based on my own tinkering and the reports of other users, the functionality of this integration is somewhat sensitive to version/environment details, so running from the docker image is the recommended way of getting things running for the moment.

Requiring in IClojure

If you have docker installed you can get an IClojure environment up and running using:

docker run -p 8888:8888 cgrand/iclojure

As with Clojupyter, note that if you get a permission related error, you may need to run this command like sudo docker run ....

Once you have that running, you can:

/cp {:deps {metasoarous/oz {:mvn/version "x.x.x"}}}
(require '[oz.notebook.iclojure :as oz])

;; Create spec

;; then...
(oz/view! spec)

Live code reloading

Oz now features Figwheel-like hot code reloading for Clojure-based data science workflows. To start this functionality, you specify from the REPL a file you would like to watch for changes, like so:

(oz/live-reload! "live-reload-test.clj")

As soon as you run this, the code in the file will be executed in its entirety. Thereafter, if you save changes to the file, all forms starting from the first form with material changes will be re-evaluated. Additionally, whitespace changes are ignored, and namespace changes only trigger a recompile if there were other code changes in flight, or if there was an error during the last execution. We also try to do a good job of logging notifications as things are running so that you know what is running and how long things are taking for to execute long-running forms.

Collectively all of these features give you the same magic of Figwheel's hot-code reloading experience, but geared towards the specific demands of a data scientist, or really anyone who needs to quickly hack together potentially long running jobs.

Here's a quick video of this in action: https://www.youtube.com/watch?v=yUTxm29fjT4

Of import: Because the code evaluated with live-reload! is evaluated in a separate thread, you can't include any code which might try to set root bindings of a dynamic var. Fortunately, setting root var bindings isn't something I've ever needed to do in my data science workflow (nor should you), but of course, it's possible there are libraries out there that do this. Just be aware that it might come up. This seems to be a pretty fundamental Clojure limitation, but I'd be interested to hear from the oracles whether there's any chance of this being supported in a future version of Clojure.

There's also a related function, oz/live-view! which will similarly watch a file for changes, oz/load! it, then oz/view! it.

Sharing features

Looking to share your cool plots or hiccup documents with someone? We've got you covered via the publish! utility function.

This will post the plot content to a GitHub Gist, and use the gist uuid to create a vega-editor link which prints to the screen. When you visit the vega-editor link, it will load the gist in question and place the content in the editor. It renders the plot, and updates in real time as you tinker with the code, making it a wonderful yet simple tool for sharing and prototyping.

user=> (oz/publish! stacked-bar)
Gist url: https://gist.github.com/87a5621b0dbec648b2b54f68b3354c3a
Raw gist url: https://api.github.com/gists/87a5621b0dbec648b2b54f68b3354c3a
Vega editor url: https://vega.github.io/editor/#/gist/vega-lite/metasoarous/87a5621b0dbec648b2b54f68b3354c3a/e1d471b5a5619a1f6f94e38b2673feff15056146/vega-viz.json

Following the Vega editor url with take you here (click on image to follow):

vega-editor

As mentioned above, we can also share our hiccup documents/dashboards. Since Vega Editor knows nothing about hiccup, we've created ozviz.io as a tool for loading these documents.

user=> (oz/publish! viz)
Gist url: https://gist.github.com/305fb42fa03e3be2a2c78597b240d30e
Raw gist url: https://api.github.com/gists/305fb42fa03e3be2a2c78597b240d30e
Ozviz url: http://ozviz.io/#/gist/305fb42fa03e3be2a2c78597b240d30e

Try it out: http://ozviz.io/#/gist/305fb42fa03e3be2a2c78597b240d30e

Authentication

In order to use the oz/publish! function, you must provide authentication.

The easiest way is to pass :auth "username:password" to the oz/publish! function. However, this can be problematic in that you don't want these credentials accidentally strewn throughout your code or ./.lein-repl-history.

To address this issue, oz/publish! will by default try to read authorization parameters from a file at ~/.oz/github-creds.edn. The contents should be a map of authorization arguments, as passed to the tentacles api. While you can use {:auth "username:password"} in this file, as above, it's far better from a security standpoint to use OAuth tokens.

  • First, generate a new token (Settings > Developer settings > Personal access tokens):
    • Enter a description like "Oz api token"
    • Select the "[ ] gist" scope checkbox, to grant gisting permissions for this token
    • Click "Generate token" to finish
  • Copy the token and paste place in your ~/.oz/github-creds.edn file as {:oauth-token "xxxxxxxxxxxxxx"}

When you're finished, it's a good idea to run chmod 600 ~/.oz/github-creds.edn so that only your user can read the credential file.

And that's it! Your calls to (oz/publish! spec) should now be authenticated.

Sadly, GitHub used to allow the posting of anonymous gists, without the requirement of authentication, which saved us from all this hassle. However, they've since deprecated this. If you like, you can submit a comment asking that GitHub consider enabling auto-expiring anonymous gists, which would avoid this setup.

Static site generation

If you've ever thought "man, I wish there was a static site generation toolkit which had live code reloading of whatever page you're currently editing, and it would be great if it was in Clojure and let me embed data visualizations and math formulas via LaTeX in Markdown & Hiccup documents", boy, are you in for a treat!

Oz now features exectly such features in the form of the oz/build!. A very simple site might be generated with:

(build!
  [{:from "examples/static-site/src/"
    :to "examples/static-site/build/"}])

The input formats currently supported by oz/build! are

  • md: As described above, markdown with embedded Vega-Lite or Vega visualizations, Latex, and hiccup
  • json, edn: You can directly supply hiccup data for more control over layout and content
  • clj: Will live-reload! Clojure files (as described above), and render the last form evaluated as hiccup

Oz should handle image and css files it comes across by simply copying them over. However, if you have any json or edn assets (datasets perhaps) which need to pass through unchanged, you can separate these into their own build specification, like so:

(defn site-template
  [spec]
  [:div {:style {:max-width 900 :margin-left "auto" :margin-right "auto"}}
   spec])

(build!
  [{:from "examples/static-site/src/site/"
    :to "examples/static-site/build/"
    :template-fn site-template}
   ;; If you have static assets, like datasets or imagines which need to be simply copied over
   {:from "examples/static-site/src/assets/"
    :to "examples/static-site/build/"
    :as-assets? true}])

This can be a good way to separate document code from other static assets.

Specifying multiple builds like this can be used to do other things as well. For example, if you wanted to render a particular set of pages using a different template function (for example, so that your blog posts style differently than the main pages), you can do that easily

(defn blog-template
  [spec]
  (site-template
    (let [{:as spec-meta :keys [title published-at tags]} (meta spec)]
      [:div
       [:h1 {:style {:line-height 1.35}} title]
       [:p "Published on: " published-at]
       [:p "Tags: " (string/join ", " tags)]
       spec])))

(build!
  [{:from "examples/static-site/src/site/"
    :to "examples/static-site/build/"
    :template-fn site-template}
   {:from "examples/static-site/src/blog/"
    :to "examples/static-site/build/blog/"
    :template-fn blog-template}
   ;; If you have static assets, like datasets or imagines which need to be simply copied over
   {:from "examples/static-site/src/assets/"
    :to "examples/static-site/build/"
    :as-assets? true}])

Note that the blog-template above is using metadata about the spec to inform how it renders. This metadata can be written into Markdown files using a yaml markdown metadata header (see /examples/static-site/src/)

---
title: Oz static websites rock
tags: oz, dataviz
---

# Oz static websites!

Some markdown content...

The title in particular here will wind it's way into the Title metadata tag of your output HTML document, and thus will be visible at the top of your browser window when you view the file. This is a pattern that Jekyll and some other blogging engines use, and markdown-clj now supports extracting this data.

Again, as you edit and save these files, the outputs just automatically update for you, both as compiled HTML files, and in the live-view window which lets you see your changes as you make em. If you need to change a template, or some other detail of the specs, you can simply rerun build! with the modified arguments, and the most recently edited page will updated before your eyes. This provides for a lovely live-view editing experience from the comfort of your favorite editor.

When you're done, one of the easiest ways to deploy is with the excellent surge.sh toolkit, which makes static site deployment a breeze. You can also use GitHub Pages or S3 or really whatever if you prefer. The great thing about static sites is that they are easy and cheap to deploy and scale, so you have plenty of options at your disposal.

EDN translation caveats in expression strings

In general, it's pretty easy to translate specs between EDN (Clojure data) and JSON. However, there is one place where you can get a little tripped up if you don't know what to do, and that's in expressions (as used in calculate and filter transforms).

The expression you see in the Vega docs typically look like {"calculate": "datum.attr * 2", "as": "attr2"} (as JSON). However, in Clojure, we often use kebab cased keywords for data map keys (e.g. :cool-attr). For these attributes, you obviously can't use datum.cool-attr, since this will be interpretted as data.cool - attr, and either error out or not produce the desired result. Instead you'll need to use datum['cool-attr'] in your expressions when your keys are kebab cased.

This may be easy to miss, since most of the docs assume that you're working with camel or snake cased keys. It is mentioned somewhere in there if you look, but tends to bite us Clojurists more frequently than practitioners of other languages, and so isn't particularly front and center. Once you know the trick though, you should be on your way.

Local CLJS development

Oz is now compiled (on the cljs side) with Shadow-CLJS, together with the Clojure CLI tooling. A typical workflow involves running clj -M:shadow-cljs watch devcards app (note, older versions of clj use -A instead of -M; consider updating). This will watch your cljs files for changes, and immediately compile both the app.js and devcards.js targets (to resources/oz/public/js/).

In general, the best way to develop is to visit http://localhost:7125/devcards.html, which will pull up a live view of a set of example Reagent components defined at src/cljs/oz/core_devcards.cljs. This is the easiest way to tweak functionality and test new features, as editing src/cljs/oz/core.cljs will trigger updates to the devcards views.

If it's necessary or desirable to test the app (live-view, etc) functionality "in-situ", you can also use the normal Clj REPL utilities to feed plots to the app.js target using oz/view!, etc. Note that if you do this, you will need to use whatever port is passed to oz/view! (by default, 10666) and not the one printed out when you start clj -M:shadow-cljs.

See documentation for your specific editing environment if you'd like your editor to be able to connect to the Shadow-CLJS repl. For vim-fireplace, the initial Clj connection should establish itself automatically when you attempt to evaluate your first form. From there simply execute the vim command :CljEval (shadow/repl :app), and you should be able to evaluate code in the *.cljs files from vim. Code in *.clj files should also continue to evaluate as before as well.

IMPORTANT NOTE: If you end up deploying a version of Oz to Clojars or elsewhere, make sure you stop your clj -M:shadow-cljs watch process before running make release. If you don't, shadow will continue watching files and rebuild js compilation targets with dev time configuration (shadow, less minification, etc), that shouldn't be in the final release build. If however you are simply making changes and pushing up for me to release, please just leave any compiled changes to the js targets out of your commits.

Debugging & updating Vega/Vega-Lite versions

I'm frequently shocked (pleasantly) at how if I find I'm unable to do something in Vega or Vega-Lite that I think I should, updating the Vega or Vega-Lite version fixes the problem. As a side note, I think this speaks volumes of the stellar job (pun intended) the IDL has been doing of developing these tools. More to the point though, if you find yourself unable to do something you expect to be able to do, it's not a bad idea to try

  1. Make sure your Oz version is up to date, in case there's a more recent Vega/Vega-Lite versions required there fix the problem.
  2. Check npm to see if there's a more recent version of the Vega/Vega-Lite (or Vega-Embed or Vega-Tooltip, as appropriate).
  3. Clone Oz, update the package.json file, and attempt to rebuild the Oz as described above.
  4. If this still doesn't solve your problem, file an issue on the appropriate Vega GitHub project. I've found the developers super responsive to issues.

Author: metasoarous
Source Code:  https://github.com/metasoarous/oz
License: EPL-1.0 License

#datavisualization #machine-learning 

Great And Powerful Scientific Documents & Data Visualizations