1651692120
Libraries for Machine Learning. Also see awesome-machine-learning.
Author: vinta
Source Code: https://github.com/vinta/awesome-python
License: View license
1651680960
MindsDB enables you to use ML predictions in your database using SQL.
If you like our project then we would really appreciate a Star ⭐!
Also, check-out the rewards and community programs.
Installation - Overview - Features - Database Integrations - Quickstart - Documentation - Support - Contributing - Mailing lists - License
To install the latest version of MindsDB please pull the following Docker image:
docker pull mindsdb/mindsdb
Or, use PyPI:
pip install mindsdb
MindsDB automates and abstracts machine learning models through virtual AI Tables:
Apart from abstracting ML models as AI Tables inside databases, MindsDB has a set of unique capabilities as:
Easily make predictions over very complex multivariate time-series data with high cardinality
An open JSON-AI syntax to tune ML models and optimize ML pipelines in a declarative way
Let MindsDB connect to your database.
Train a Predictor using a single SQL statement (make MindsDB learn from historical data automatically) or import your own ML model to a Predictor via JSON-AI .
Make predictions with SQL statements (Predictor is exposed as virtual AI Tables). There’s no need to deploy models since they are already part of the data layer.
Check our docs and blog for tutorials and use case examples.
MindsDB works with most of the SQL and NoSQL databases and data Streams for real-time ML.
Connect your Data |
---|
To get your hands on MindsDB, we recommend using the Docker image or simply sign up for a free cloud account. Feel free to browse documentation for other installation methods and tutorials.
You can find the complete documentation of MindsDB at docs.mindsdb.com. Documentation for our HTTP API can be found at apidocs.mindsdb.com.
If you found a bug, please submit an issue on Github.
To get community support, you can:
If you need commercial support, please contact the MindsDB team.
A great place to start contributing to MindsDB will be our GitHub projects for :checkered_flag:
Also, we are always open to suggestions so feel free to open new issues with your ideas and we can give you guidance!
Being part of the core team is accessible to anyone who is motivated and wants to be part of that journey! If you'd like to contribute to the project, refer to the contributing documentation.
Please note that this project is released with a Contributor Code of Conduct. By participating in this project, you agree to abide by its terms.
Made with contributors-img.
Subscribe to MindsDB Monthly Community Newsletter to get general announcements, release notes, information about MindsDB events, and the latest blog posts. You may also join our beta-users group, and get access to new beta features.
MindsDB is licensed under GNU General Public License v3.0
Author: mindsdb
Source Code: https://github.com/mindsdb/mindsdb
License: GPL-3.0 License
1651669740
eXtreme Gradient Boosting
Community | Documentation | Resources | Contributors | Release Notes
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Kubernetes, Hadoop, SGE, MPI, Dask) and can solve problems beyond billions of examples.
© Contributors, 2021. Licensed under an Apache-2 license.
XGBoost has been developed and used by a group of active community members. Your help is very valuable to make the package better for everyone. Checkout the Community Page.
Become a sponsor and get a logo here. See details at Sponsoring the XGBoost Project. The funds are used to defray the cost of continuous integration and testing infrastructure (https://xgboost-ci.net).
The sponsors in this list are donating cloud hours in lieu of cash donation.
Author: dmlc
Source Code: https://github.com/dmlc/xgboost
License: Apache-2.0 License
1651658040
This is the Vowpal Wabbit fast online learning code.
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning. There is a specific focus on reinforcement learning with several contextual bandit algorithms implemented and the online nature lending to the problem well. Vowpal Wabbit is a destination for implementing and maturing state of the art algorithms with performance in mind.
For the most up to date instructions for getting started on Windows, MacOS or Linux please see the wiki. This includes:
Author: VowpalWabbit
Source Code: https://github.com/VowpalWabbit/vowpal_wabbit
License: View license
1651647060
vowpal_porpoise
Lightweight python wrapper for vowpal_wabbit.
Why: Scalable, blazingly fast machine learning.
make
pip install cython
python setup.py install
to install.Now can you do: import vowpal_porpoise
from python.
Linear regression with l1 penalty:
from vowpal_porpoise import VW
# Initialize the model
vw = VW(moniker='test', # a name for the model
passes=10, # vw arg: passes
loss='quadratic', # vw arg: loss
learning_rate=10, # vw arg: learning_rate
l1=0.01) # vw arg: l1
# Inside the with training() block a vw process will be
# open to communication
with vw.training():
for instance in ['1 |big red square',\
'0 |small blue circle']:
vw.push_instance(instance)
# here stdin will close
# here the vw process will have finished
# Inside the with predicting() block we can stream instances and
# acquire their labels
with vw.predicting():
for instance in ['1 |large burnt sienna rhombus',\
'0 |little teal oval']:
vw.push_instance(instance)
# Read the predictions like this:
predictions = list(vw.read_predictions_())
L-BFGS with a rank-5 approximation:
from vowpal_porpoise import VW
# Initialize the model
vw = VW(moniker='test_lbfgs', # a name for the model
passes=10, # vw arg: passes
lbfgs=True, # turn on lbfgs
mem=5) # lbfgs rank
Latent Dirichlet Allocation with 100 topics:
from vowpal_porpoise import VW
# Initialize the model
vw = VW(moniker='test_lda', # a name for the model
passes=10, # vw arg: passes
lda=100, # turn on lda
minibatch=100) # set the minibatch size
vowpal_porpoise also ships with an interface into scikit-learn, which allows awesome experiment-level stuff like cross-validation:
from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from vowpal_porpoise.sklearn import VW_Classifier
GridSearchCV(
VW_Classifier(loss='logistic', moniker='example_sklearn',
passes=10, silent=True, learning_rate=10),
param_grid=parameters,
score_func=f1_score,
cv=StratifiedKFold(y_train),
).fit(X_train, y_train)
Check out example_sklearn.py for more details
Via the VW
interface:
with vw.predicting_library():
for instance in ['1 |large burnt sienna rhombus', \
'1 |little teal oval']:
prediction = vw.push_instance(instance)
Now the predictions are returned directly to the parent process, rather than having to read from disk. See examples/example1.py
for more details.
Alternatively you can use the raw library interface:
import vw_c
vw = vw_c.VW("--loss=quadratic --l1=0.01 -f model")
vw.learn("1 |this is a positive example")
vw.learn("0 |this is a negative example")
vw.finish()
Currently does not support passes due to some limitations in the underlying vw C code.
vowpal_wabbit is insanely fast and scalable. vowpal_porpoise is slower, but only during the initial training pass. Once the data has been properly cached it will idle while vowpal_wabbit does all the heavy lifting. Furthermore, vowpal_porpoise was designed to be lightweight and not to get in the way of vowpal_wabbit's scalability, e.g. it allows distributed learning via --nodes
and does not require data to be batched in memory. In our research work we use vowpal_porpoise on an 80-node cluster running over multiple terabytes of data.
The main benefit of vowpal_porpoise is allowing rapid prototyping of new models and feature extractors. We found that we had been doing this in an ad-hoc way using python scripts to shuffle around massive gzipped text files, so we just closed the loop and made vowpal_wabbit a python library.
Wraps the vw binary in a subprocess and uses stdin to push data, temporary files to pull predictions. Why not use the prediction labels vw provides on stdout? It turns out that the python GIL basically makes streamining in and out of a process (even asynchronously) painfully difficult. If you know of a clever way to get around this, please email me. In other languages (e.g. in a forthcoming scala wrapper) this is not an issue.
Alternatively, you can use a pure api call (vw_c
, wrapping libvw) for prediction.
Joseph Reisinger @josephreisinger
Apache 2.0
Author: josephreisinger
Source Code: https://github.com/josephreisinger/vowpal_porpoise
License: View license
1651639200
scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.
The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors.
It is currently maintained by a team of volunteers.
Website: https://scikit-learn.org
scikit-learn requires:
Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4. scikit-learn 1.0 and later require Python 3.7 or newer. scikit-learn 1.1 and later require Python 3.8 or newer.
Scikit-learn plotting capabilities (i.e., functions start with plot_
and classes end with "Display") require Matplotlib (>= 3.1.2). For running the examples Matplotlib >= 3.1.2 is required. A few examples require scikit-image >= 0.14.5, a few examples require pandas >= 1.0.5, some examples require seaborn >= 0.9.0.
If you already have a working installation of numpy and scipy, the easiest way to install scikit-learn is using pip
:
pip install -U scikit-learn
or conda
:
conda install -c conda-forge scikit-learn
The documentation includes more detailed installation instructions.
See the changelog for a history of notable changes to scikit-learn.
We welcome new contributors of all experience levels. The scikit-learn community goals are to be helpful, welcoming, and effective. The Development Guide has detailed information about contributing code, documentation, tests, and more. We've included some basic information in this README.
You can check the latest sources with the command:
git clone https://github.com/scikit-learn/scikit-learn.git
To learn more about making a contribution to scikit-learn, please see our Contributing guide.
After installation, you can launch the test suite from outside the source directory (you will need to have pytest
>= 5.0.1 installed):
pytest sklearn
See the web page https://scikit-learn.org/dev/developers/advanced_installation.html#testing for more information.
Random number generation can be controlled during testing by setting the
SKLEARN_SEED
environment variable.
Before opening a Pull Request, have a look at the full Contributing page to make sure your code complies with our guidelines: https://scikit-learn.org/stable/developers/index.html
The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors.
The project is currently maintained by a team of volunteers.
Note: scikit-learn was previously referred to as scikits.learn.
If you use scikit-learn in a scientific publication, we would appreciate citations: https://scikit-learn.org/stable/about.html#citing-scikit-learn
Author: scikit-learn
Source Code: https://github.com/scikit-learn/scikit-learn
License: BSD-3-Clause License
1651613400
Note: the current releases of this toolbox are a beta release, to test working with Haskell's, Python's, and R's code repositories.
Metrics provides implementations of various supervised machine learning evaluation metrics in the following languages:
easy_install ml_metrics
install.packages("Metrics")
from the R promptcabal install Metrics
For more detailed installation instructions, see the README for each implementation.
Evaluation Metric | Python | R | Haskell | MATLAB / Octave |
Absolute Error (AE) | ✓ | ✓ | ✓ | ✓ |
Average Precision at K (APK, AP@K) | ✓ | ✓ | ✓ | ✓ |
Area Under the ROC (AUC) | ✓ | ✓ | ✓ | ✓ |
Classification Error (CE) | ✓ | ✓ | ✓ | ✓ |
F1 Score (F1) | ✓ | |||
Gini | ✓ | |||
Levenshtein | ✓ | ✓ | ✓ | |
Log Loss (LL) | ✓ | ✓ | ✓ | ✓ |
Mean Log Loss (LogLoss) | ✓ | ✓ | ✓ | ✓ |
Mean Absolute Error (MAE) | ✓ | ✓ | ✓ | ✓ |
Mean Average Precision at K (MAPK, MAP@K) | ✓ | ✓ | ✓ | ✓ |
Mean Quadratic Weighted Kappa | ✓ | ✓ | ✓ | |
Mean Squared Error (MSE) | ✓ | ✓ | ✓ | ✓ |
Mean Squared Log Error (MSLE) | ✓ | ✓ | ✓ | ✓ |
Normalized Gini | ✓ | |||
Quadratic Weighted Kappa | ✓ | ✓ | ✓ | |
Relative Absolute Error (RAE) | ✓ | |||
Root Mean Squared Error (RMSE) | ✓ | ✓ | ✓ | ✓ |
Relative Squared Error (RSE) | ✓ | |||
Root Relative Squared Error (RRSE) | ✓ | |||
Root Mean Squared Log Error (RMSLE) | ✓ | ✓ | ✓ | ✓ |
Squared Error (SE) | ✓ | ✓ | ✓ | ✓ |
Squared Log Error (SLE) | ✓ | ✓ | ✓ | ✓ |
(Nonexhaustive and to be added in the future)
Author: benhamner
Source Code: https://github.com/benhamner/Metrics
License: View license
1651602420
H2O
H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark. H2O provides implementations of many popular algorithms such as Generalized Linear Models (GLM), Gradient Boosting Machines (including XGBoost), Random Forests, Deep Neural Networks, Stacked Ensembles, Naive Bayes, Generalized Additive Models (GAM), Cox Proportional Hazards, K-Means, PCA, Word2Vec, as well as a fully automatic machine learning algorithm (H2O AutoML).
H2O is extensible so that developers can add data transformations and custom algorithms of their choice and access them through all of those clients. H2O models can be downloaded and loaded into H2O memory for scoring, or exported into POJO or MOJO format for extemely fast scoring in production. More information can be found in the H2O User Guide.
H2O-3 (this repository) is the third incarnation of H2O, and the successor to H2O-2.
While most of this README is written for developers who do their own builds, most H2O users just download and use a pre-built version. If you are a Python or R user, the easiest way to install H2O is via PyPI or Anaconda (for Python) or CRAN (for R):
pip install h2o
install.packages("h2o")
For the latest stable, nightly, Hadoop (or Spark / Sparkling Water) releases, or the stand-alone H2O jar, please visit: https://h2o.ai/download
More info on downloading & installing H2O is available in the H2O User Guide.
Most people interact with three or four primary open source resources: GitHub (which you've already found), JIRA (for bug reports and issue tracking), Stack Overflow for H2O code/software-specific questions, and h2ostream (a Google Group / email discussion forum) for questions not suitable for Stack Overflow. There is also a Gitter H2O developer chat group, however for archival purposes & to maximize accessibility, we'd prefer that standard H2O Q&A be conducted on Stack Overflow.
(Note: There is only one issue tracking system for the project. GitHub issues are not enabled; you must use JIRA.)
You can browse and create new issues in our open source JIRA: http://jira.h2o.ai
Issues
menuSearch for issues
Log In
button on the top right of the screenCreate an acccount
near the bottom of the login boxCreate
button on the menu to create an issueGitHub
JIRA -- file bug reports / track issues here
Stack Overflow -- ask all code/software questions here
Cross Validated (Stack Exchange) -- ask algorithm/theory questions here
h2ostream Google Group -- ask non-code related questions here
Gitter H2O Developer Chat
Documentation
Download (pre-built packages)
Jenkins (H2O build and test system)
Website
Twitter -- follow us for updates and H2O news!
Awesome H2O -- share your H2O-powered creations with us
Every nightly build publishes R, Python, Java, and Scala artifacts to a build-specific repository. In particular, you can find Java artifacts in the maven/repo directory.
Here is an example snippet of a gradle build file using h2o-3 as a dependency. Replace x, y, z, and nnnn with valid numbers.
// h2o-3 dependency information
def h2oBranch = 'master'
def h2oBuildNumber = 'nnnn'
def h2oProjectVersion = "x.y.z.${h2oBuildNumber}"
repositories {
// h2o-3 dependencies
maven {
url "https://s3.amazonaws.com/h2o-release/h2o-3/${h2oBranch}/${h2oBuildNumber}/maven/repo/"
}
}
dependencies {
compile "ai.h2o:h2o-core:${h2oProjectVersion}"
compile "ai.h2o:h2o-algos:${h2oProjectVersion}"
compile "ai.h2o:h2o-web:${h2oProjectVersion}"
compile "ai.h2o:h2o-app:${h2oProjectVersion}"
}
Refer to the latest H2O-3 bleeding edge nightly build page for information about installing nightly build artifacts.
Refer to the h2o-droplets GitHub repository for a working example of how to use Java artifacts with gradle.
Note: Stable H2O-3 artifacts are periodically published to Maven Central (click here to search) but may substantially lag behind H2O-3 Bleeding Edge nightly builds.
Getting started with H2O development requires JDK 1.8+, Node.js, Gradle, Python and R. We use the Gradle wrapper (called gradlew
) to ensure up-to-date local versions of Gradle and other dependencies are installed in your development directory.
Building h2o
requires a properly set up R environment with required packages and Python environment with the following packages:
grip
future
tabulate
requests
wheel
To install these packages you can use pip or conda. If you have troubles installing these packages on Windows, please follow section Setup on Windows of this guide.
(Note: It is recommended to use some virtual environment such as VirtualEnv, to install all packages. )
To build H2O from the repository, perform the following steps.
# Build H2O
git clone https://github.com/h2oai/h2o-3.git
cd h2o-3
./gradlew build -x test
You may encounter problems: e.g. npm missing. Install it:
brew install npm
# Start H2O
java -jar build/h2o.jar
# Point browser to http://localhost:54321
git clone https://github.com/h2oai/h2o-3.git
cd h2o-3
./gradlew syncSmalldata
./gradlew syncRPackages
./gradlew build
Notes:
- Running tests starts five test JVMs that form an H2O cluster and requires at least 8GB of RAM (preferably 16GB of RAM).
- Running
./gradlew syncRPackages
is supported on Windows, OS X, and Linux, and is strongly recommended but not required../gradlew syncRPackages
ensures a complete and consistent environment with pre-approved versions of the packages required for tests and builds. The packages can be installed manually, but we recommend setting an ENV variable and using./gradlew syncRPackages
. To set the ENV variable, use the following format (where `${WORKSPACE} can be any path):mkdir -p ${WORKSPACE}/Rlibrary export R_LIBS_USER=${WORKSPACE}/Rlibrary
git pull
./gradlew syncSmalldata
./gradlew syncRPackages
./gradlew clean
./gradlew build
We recommend using ./gradlew clean
after each git pull
.
Skip tests by adding -x test
at the end the gradle build command line. Tests typically run for 7-10 minutes on a Macbook Pro laptop with 4 CPUs (8 hyperthreads) and 16 GB of RAM.
Syncing smalldata is not required after each pull, but if tests fail due to missing data files, then try ./gradlew syncSmalldata
as the first troubleshooting step. Syncing smalldata downloads data files from AWS S3 to the smalldata directory in your workspace. The sync is incremental. Do not check in these files. The smalldata directory is in .gitignore. If you do not run any tests, you do not need the smalldata directory.
Running ./gradlew syncRPackages
is supported on Windows, OS X, and Linux, and is strongly recommended but not required. ./gradlew syncRPackages
ensures a complete and consistent environment with pre-approved versions of the packages required for tests and builds. The packages can be installed manually, but we recommend setting an ENV variable and using ./gradlew syncRPackages
. To set the ENV variable, use the following format (where ${WORKSPACE}
can be any path):
./gradlew clean && ./gradlew build -x test && (export DO_FAST=1; ./gradlew dist)
open target/docs-website/h2o-docs/index.html
Step 1: Download and install WinPython.
From the command line, validate python
is using the newly installed package by using which python
(or sudo which python
). Update the Environment variable with the WinPython path.
Step 2: Install required Python packages:
pip install grip future tabulate wheel
Step 3: Install JDK
Install Java 1.8+ and add the appropriate directory C:\Program Files\Java\jdk1.7.0_65\bin
with java.exe to PATH in Environment Variables. To make sure the command prompt is detecting the correct Java version, run:
javac -version
The CLASSPATH variable also needs to be set to the lib subfolder of the JDK:
CLASSPATH=/<path>/<to>/<jdk>/lib
Step 4. Install Node.js
Install Node.js and add the installed directory C:\Program Files\nodejs
, which must include node.exe and npm.cmd to PATH if not already prepended.
Step 5. Install R, the required packages, and Rtools:
Install R and add the bin directory to your PATH if not already included.
Install the following R packages:
To install these packages from within an R session:
pkgs <- c("RCurl", "jsonlite", "statmod", "devtools", "roxygen2", "testthat")
for (pkg in pkgs) {
if (! (pkg %in% rownames(installed.packages()))) install.packages(pkg)
}
Note that libcurl is required for installation of the RCurl R package.
Note that this packages don't cover running tests, they for building H2O only.
Finally, install Rtools, which is a collection of command line tools to facilitate R development on Windows.
NOTE: During Rtools installation, do not install Cygwin.dll.
Step 6. Install Cygwin
NOTE: During installation of Cygwin, deselect the Python packages to avoid a conflict with the Python.org package.
Step 6b. Validate Cygwin
If Cygwin is already installed, remove the Python packages or ensure that Native Python is before Cygwin in the PATH variable.
Step 7. Update or validate the Windows PATH variable to include R, Java JDK, Cygwin.
Step 8. Git Clone h2o-3
If you don't already have a Git client, please install one. The default one can be found here http://git-scm.com/downloads. Make sure that command prompt support is enabled before the installation.
Download and update h2o-3 source codes:
git clone https://github.com/h2oai/h2o-3
Step 9. Run the top-level gradle build:
cd h2o-3
./gradlew.bat build
If you encounter errors run again with
--stacktrace
for more instructions on missing dependencies.
If you don't have Homebrew, we recommend installing it. It makes package management for OS X easy.
Step 1. Install JDK
Install Java 1.8+. To make sure the command prompt is detecting the correct Java version, run:
javac -version
Step 2. Install Node.js:
Using Homebrew:
brew install node
Otherwise, install from the NodeJS website.
Step 3. Install R and the required packages:
Install R and add the bin directory to your PATH if not already included.
Install the following R packages:
To install these packages from within an R session:
pkgs <- c("RCurl", "jsonlite", "statmod", "devtools", "roxygen2", "testthat")
for (pkg in pkgs) {
if (! (pkg %in% rownames(installed.packages()))) install.packages(pkg)
}
Note that libcurl is required for installation of the RCurl R package.
Note that this packages don't cover running tests, they for building H2O only.
Step 4. Install python and the required packages:
Install python:
brew install python
Install pip package manager:
sudo easy_install pip
Next install required packages:
sudo pip install wheel requests future tabulate
Step 5. Git Clone h2o-3
OS X should already have Git installed. To download and update h2o-3 source codes:
git clone https://github.com/h2oai/h2o-3
Step 6. Run the top-level gradle build:
cd h2o-3
./gradlew build
Note: on a regular machine it may take very long time (about an hour) to run all the tests.
If you encounter errors run again with
--stacktrace
for more instructions on missing dependencies.
Step 1. Install Node.js
curl -sL https://deb.nodesource.com/setup_0.12 | sudo bash -
sudo apt-get install -y nodejs
Step 2. Install JDK:
Install Java 8. Installation instructions can be found here JDK installation. To make sure the command prompt is detecting the correct Java version, run:
javac -version
Step 3. Install R and the required packages:
Installation instructions can be found here R installation. Click “Download R for Linux”. Click “ubuntu”. Follow the given instructions.
To install the required packages, follow the same instructions as for OS X above.
Note: If the process fails to install RStudio Server on Linux, run one of the following:
sudo apt-get install libcurl4-openssl-dev
or
sudo apt-get install libcurl4-gnutls-dev
Step 4. Git Clone h2o-3
If you don't already have a Git client:
sudo apt-get install git
Download and update h2o-3 source codes:
git clone https://github.com/h2oai/h2o-3
Step 5. Run the top-level gradle build:
cd h2o-3
./gradlew build
If you encounter errors, run again using
--stacktrace
for more instructions on missing dependencies.
Make sure that you are not running as root, since
bower
will reject such a run.
Step 1. Install Node.js
curl -sL https://deb.nodesource.com/setup_16.x | sudo bash -
sudo apt-get install -y nodejs
Steps 2-4. Follow steps 2-4 for Ubuntu 14.04 (above)
cd /opt
sudo wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/7u79-b15/jdk-7u79-linux-x64.tar.gz"
sudo tar xzf jdk-7u79-linux-x64.tar.gz
cd jdk1.7.0_79
sudo alternatives --install /usr/bin/java java /opt/jdk1.7.0_79/bin/java 2
sudo alternatives --install /usr/bin/jar jar /opt/jdk1.7.0_79/bin/jar 2
sudo alternatives --install /usr/bin/javac javac /opt/jdk1.7.0_79/bin/javac 2
sudo alternatives --set jar /opt/jdk1.7.0_79/bin/jar
sudo alternatives --set javac /opt/jdk1.7.0_79/bin/javac
cd /opt
sudo wget http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm
sudo rpm -ivh epel-release-7-5.noarch.rpm
sudo echo "multilib_policy=best" >> /etc/yum.conf
sudo yum -y update
sudo yum -y install R R-devel git python-pip openssl-devel libxml2-devel libcurl-devel gcc gcc-c++ make openssl-devel kernel-devel texlive texinfo texlive-latex-fonts libX11-devel mesa-libGL-devel mesa-libGL nodejs npm python-devel numpy scipy python-pandas
sudo pip install scikit-learn grip tabulate statsmodels wheel
mkdir ~/Rlibrary
export JAVA_HOME=/opt/jdk1.7.0_79
export JRE_HOME=/opt/jdk1.7.0_79/jre
export PATH=$PATH:/opt/jdk1.7.0_79/bin:/opt/jdk1.7.0_79/jre/bin
export R_LIBS_USER=~/Rlibrary
# install local R packages
R -e 'install.packages(c("RCurl","jsonlite","statmod","devtools","roxygen2","testthat"), dependencies=TRUE, repos="http://cran.rstudio.com/")'
cd
git clone https://github.com/h2oai/h2o-3.git
cd h2o-3
# Build H2O
./gradlew syncSmalldata
./gradlew syncRPackages
./gradlew build -x test
To start the H2O cluster locally, execute the following on the command line:
java -jar build/h2o.jar
A list of available start-up JVM and H2O options (e.g. -Xmx
, -nthreads
, -ip
), is available in the H2O User Guide.
Pre-built H2O-on-Hadoop zip files are available on the download page. Each Hadoop distribution version has a separate zip file in h2o-3.
To build H2O with Hadoop support yourself, first install sphinx for python: pip install sphinx
Then start the build by entering the following from the top-level h2o-3 directory:
(export BUILD_HADOOP=1; ./gradlew build -x test)
./gradlew dist
This will create a directory called 'target' and generate zip files there. Note that BUILD_HADOOP
is the default behavior when the username is jenkins
(refer to settings.gradle
); otherwise you have to request it, as shown above.
In the h2o-hadoop
directory, each Hadoop version has a build directory for the driver and an assembly directory for the fatjar.
You need to:
build.gradle
file) in h2o-hadoop
h2o-3/settings.gradle
HADOOP_VERSIONS
in make-dist.sh
h2o-dist/buildinfo.json
Hadoop supports secure user impersonation through its Java API. A kerberos-authenticated user can be allowed to proxy any username that meets specified criteria entered in the NameNode's core-site.xml file. This impersonation only applies to interactions with the Hadoop API or the APIs of Hadoop-related services that support it (this is not the same as switching to that user on the machine of origin).
Setting up secure user impersonation (for h2o):
Impersonated HDFS actions can be viewed in the hdfs audit log ('auth:PROXY' should appear in the ugi=
field in entries where this is applicable). YARN similarly should show 'auth:PROXY' somewhere in the Resource Manager UI.
To use secure impersonation with h2o's Hadoop driver:
Before this is attempted, see Risks with impersonation, below
When using the h2odriver (e.g. when running with hadoop jar ...
), specify -principal <proxy user kerberos principal>
, -keytab <proxy user keytab path>
, and -run_as_user <hadoop username to impersonate>
, in addition to any other arguments needed. If the configuration was successful, the proxy user will log in and impersonate the -run_as_user
as long as that user is allowed by either the users or groups configuration property (configured above); this is enforced by HDFS & YARN, not h2o's code. The driver effectively sets its security context as the impersonated user so all supported Hadoop actions will be performed as that user (e.g. YARN, HDFS APIs support securely impersonated users, but others may not).
hadoop.proxyuser.<proxyusername>.hosts
property whenever possible or practical.su
, for instance)hadoop.proxyuser.<proxyusername>.{hosts,groups,users}
property to '*' can greatly increase exposure to security risk.$ git diff
diff --git a/h2o-app/build.gradle b/h2o-app/build.gradle
index af3b929..097af85 100644
--- a/h2o-app/build.gradle
+++ b/h2o-app/build.gradle
@@ -8,5 +8,6 @@ dependencies {
compile project(":h2o-algos")
compile project(":h2o-core")
compile project(":h2o-genmodel")
+ compile project(":h2o-persist-hdfs")
}
diff --git a/h2o-persist-hdfs/build.gradle b/h2o-persist-hdfs/build.gradle
index 41b96b2..6368ea9 100644
--- a/h2o-persist-hdfs/build.gradle
+++ b/h2o-persist-hdfs/build.gradle
@@ -2,5 +2,6 @@ description = "H2O Persist HDFS"
dependencies {
compile project(":h2o-core")
- compile("org.apache.hadoop:hadoop-client:2.0.0-cdh4.3.0")
+ compile("org.apache.hadoop:hadoop-client:2.4.1-mapr-1408")
+ compile("org.json:org.json:chargebee-1.0")
}
Sparkling Water combines two open-source technologies: Apache Spark and the H2O Machine Learning platform. It makes H2O’s library of advanced algorithms, including Deep Learning, GLM, GBM, K-Means, and Distributed Random Forest, accessible from Spark workflows. Spark users can select the best features from either platform to meet their Machine Learning needs. Users can combine Spark's RDD API and Spark MLLib with H2O’s machine learning algorithms, or use H2O independently of Spark for the model building process and post-process the results in Spark.
Sparkling Water Resources:
The main H2O documentation is the H2O User Guide. Visit http://docs.h2o.ai for the top-level introduction to documentation on H2O projects.
To generate the REST API documentation, use the following commands:
cd ~/h2o-3
cd py
python ./generate_rest_api_docs.py # to generate Markdown only
python ./generate_rest_api_docs.py --generate_html --github_user GITHUB_USER --github_password GITHUB_PASSWORD # to generate Markdown and HTML
The default location for the generated documentation is build/docs/REST
.
If the build fails, try gradlew clean
, then git clean -f
.
Documentation for each bleeding edge nightly build is available on the nightly build page.
If you use H2O as part of your workflow in a publication, please cite your H2O resource(s) using the following BibTex entry:
@Manual{h2o_package_or_module,
title = {package_or_module_title},
author = {H2O.ai},
year = {year},
month = {month},
note = {version_information},
url = {resource_url},
}
Formatted H2O Software citation examples:
H2O algorithm booklets are available at the Documentation Homepage.
@Manual{h2o_booklet_name,
title = {booklet_title},
author = {list_of_authors},
year = {year},
month = {month},
url = {link_url},
}
Formatted booklet citation examples:
Arora, A., Candel, A., Lanford, J., LeDell, E., and Parmar, V. (Oct. 2016). Deep Learning with H2O. http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/DeepLearningBooklet.pdf.
Click, C., Lanford, J., Malohlava, M., Parmar, V., and Roark, H. (Oct. 2016). Gradient Boosted Models with H2O. http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GBMBooklet.pdf.
H2O has been built by a great many number of contributors over the years both within H2O.ai (the company) and the greater open source community. You can begin to contribute to H2O by answering Stack Overflow questions or filing bug reports. Please join us!
SriSatish Ambati
Cliff Click
Tom Kraljevic
Tomas Nykodym
Michal Malohlava
Kevin Normoyle
Spencer Aiello
Anqi Fu
Nidhi Mehta
Arno Candel
Josephine Wang
Amy Wang
Max Schloemer
Ray Peck
Prithvi Prabhu
Brandon Hill
Jeff Gambera
Ariel Rao
Viraj Parmar
Kendall Harris
Anand Avati
Jessica Lanford
Alex Tellez
Allison Washburn
Amy Wang
Erik Eckstrand
Neeraja Madabhushi
Sebastian Vidrio
Ben Sabrin
Matt Dowle
Mark Landry
Erin LeDell
Andrey Spiridonov
Oleg Rogynskyy
Nick Martin
Nancy Jordan
Nishant Kalonia
Nadine Hussami
Jeff Cramer
Stacie Spreitzer
Vinod Iyengar
Charlene Windom
Parag Sanghavi
Navdeep Gill
Lauren DiPerna
Anmol Bal
Mark Chan
Nick Karpov
Avni Wadhwa
Ashrith Barthur
Karen Hayrapetyan
Jo-fai Chow
Dmitry Larko
Branden Murray
Jakub Hava
Wen Phan
Magnus Stensmo
Pasha Stetsenko
Angela Bartz
Mateusz Dymczyk
Micah Stubbs
Ivy Wang
Terone Ward
Leland Wilkinson
Wendy Wong
Nikhil Shekhar
Pavel Pscheidl
Michal Kurka
Veronika Maurerova
Jan Sterba
Jan Jendrusak
Sebastien Poirier
Tomáš Frýda
Ard Kelmendi
Scientific Advisory Council
Stephen Boyd
Rob Tibshirani
Trevor Hastie
Systems, Data, FileSystems and Hadoop
Doug Lea
Chris Pouliot
Dhruba Borthakur
Jishnu Bhattacharjee, Nexus Venture Partners
Anand Babu Periasamy
Anand Rajaraman
Ash Bhardwaj
Rakesh Mathur
Michael Marks
Egbert Bierman
Rajesh Ambati
hadoop.proxyuser.<proxyusername>.hosts
: the hosts the proxy user is allowed to perform impersonated actions on behalf of a valid user fromhadoop.proxyuser.<proxyusername>.groups
: the groups an impersonated user must belong to for impersonation to work with that proxy userhadoop.proxyuser.<proxyusername>.users
: the users a proxy user is allowed to impersonate<property> <name>hadoop.proxyuser.myproxyuser.hosts</name> <value>host1,host2</value> </property> <property> <name>hadoop.proxyuser.myproxyuser.groups</name> <value>group1,group2</value> </property> <property> <name>hadoop.proxyuser.myproxyuser.users</name> <value>user1,user2</value> </property>
mkdir -p ${WORKSPACE}/Rlibrary
export R_LIBS_USER=${WORKSPACE}/Rlibrary
Author: h2oai
Source Code: https://github.com/h2oai/h2o-3
License: Apache-2.0 License
1651593480
Gym is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API. Since its release, Gym's API has become the field standard for doing this.
Gym documentation website is at https://www.gymlibrary.ml/, and you can propose fixes and changes here.
To install the base Gym library, use pip install gym
.
This does not include dependencies for all families of environments (there's a massive number, and some can be problematic to install on certain systems). You can install these dependencies for one family like pip install gym[atari]
or use pip install gym[all]
to install all dependencies.
We support Python 3.7, 3.8, 3.9 and 3.10 on Linux and macOS. We will accept PRs related to Windows, but do not officially support it.
The Gym API's API models environments as simple Python env
classes. Creating environment instances and interacting with them is very simple- here's an example using the "CartPole-v1" environment:
import gym
env = gym.make("CartPole-v1")
observation, info = env.reset(seed=42, return_info=True)
for _ in range(1000):
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
observation, info = env.reset(return_info=True)
env.close()
Gym keeps strict versioning for reproducibility reasons. All environments end in a suffix like "_v0". When changes are made to environments that might impact learning results, the number is increased by one to prevent potential confusion.
A whitepaper from when Gym just came out is available https://arxiv.org/pdf/1606.01540, and can be cited with the following bibtex entry:
@misc{1606.01540,
Author = {Greg Brockman and Vicki Cheung and Ludwig Pettersson and Jonas Schneider and John Schulman and Jie Tang and Wojciech Zaremba},
Title = {OpenAI Gym},
Year = {2016},
Eprint = {arXiv:1606.01540},
}
There used to be release notes for all the new Gym versions here. New release notes are being moved to releases page on GitHub, like most other libraries do. Old notes can be viewed here.
Author: openai
Source Code: https://github.com/openai/gym
License: View license
1650823200
m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code (Python, C, Java, Go, JavaScript, Visual Basic, C#, PowerShell, R, PHP, Dart, Haskell, Ruby, F#, Rust, Elixir).
Supported Python version is >= 3.7.
pip install m2cgen
Classification | Regression | |
---|---|---|
Linear |
|
|
SVM |
|
|
Tree |
|
|
Random Forest |
|
|
Boosting |
|
|
You can find versions of packages with which compatibility is guaranteed by CI tests here. Other versions can also be supported but they are untested.
Scalar value; signed distance of the sample to the hyperplane for the second class.
Vector value; signed distance of the sample to the hyperplane per each class.
The output is consistent with the output of LinearClassifierMixin.decision_function
.
Scalar value; signed distance of the sample to the separating hyperplane: positive for an inlier and negative for an outlier.
Scalar value; signed distance of the sample to the hyperplane for the second class.
Vector value; one-vs-one score for each class, shape (n_samples, n_classes * (n_classes-1) / 2).
The output is consistent with the output of BaseSVC.decision_function
when the decision_function_shape
is set to ovo
.
Vector value; class probabilities.
Vector value; class probabilities.
The output is consistent with the output of the predict_proba
method of DecisionTreeClassifier
/ ExtraTreeClassifier
/ ExtraTreesClassifier
/ RandomForestClassifier
/ XGBRFClassifier
/ XGBClassifier
/ LGBMClassifier
.
Here's a simple example of how a linear model trained in Python environment can be represented in Java code:
from sklearn.datasets import load_diabetes
from sklearn import linear_model
import m2cgen as m2c
X, y = load_diabetes(return_X_y=True)
estimator = linear_model.LinearRegression()
estimator.fit(X, y)
code = m2c.export_to_java(estimator)
Generated Java code:
public class Model {
public static double score(double[] input) {
return ((((((((((152.1334841628965) + ((input[0]) * (-10.012197817470472))) + ((input[1]) * (-239.81908936565458))) + ((input[2]) * (519.8397867901342))) + ((input[3]) * (324.39042768937657))) + ((input[4]) * (-792.1841616283054))) + ((input[5]) * (476.74583782366153))) + ((input[6]) * (101.04457032134408))) + ((input[7]) * (177.06417623225025))) + ((input[8]) * (751.2793210873945))) + ((input[9]) * (67.62538639104406));
}
}
You can find more examples of generated code for different models/languages here.
m2cgen
can be used as a CLI tool to generate code using serialized model objects (pickle protocol):
$ m2cgen <pickle_file> --language <language> [--indent <indent>] [--function_name <function_name>]
[--class_name <class_name>] [--module_name <module_name>] [--package_name <package_name>]
[--namespace <namespace>] [--recursion-limit <recursion_limit>]
Don't forget that for unpickling serialized model objects their classes must be defined in the top level of an importable module in the unpickling environment.
Piping is also supported:
$ cat <pickle_file> | m2cgen --language <language>
Q: Generation fails with RecursionError: maximum recursion depth exceeded
error.
A: If this error occurs while generating code using an ensemble model, try to reduce the number of trained estimators within that model. Alternatively you can increase the maximum recursion depth with sys.setrecursionlimit(<new_depth>)
.
Q: Generation fails with ImportError: No module named <module_name_here>
error while transpiling model from a serialized model object.
A: This error indicates that pickle protocol cannot deserialize model object. For unpickling serialized model objects, it is required that their classes must be defined in the top level of an importable module in the unpickling environment. So installation of package which provided model's class definition should solve the problem.
Q: Generated by m2cgen code provides different results for some inputs compared to original Python model from which the code were obtained.
A: Some models force input data to be particular type during prediction phase in their native Python libraries. Currently, m2cgen works only with float64
(double
) data type. You can try to cast your input data to another type manually and check results again. Also, some small differences can happen due to specific implementation of floating-point arithmetic in a target language.
Author: BayesWitnesses
Source Code: https://github.com/BayesWitnesses/m2cgen
License: MIT License
1650736800
Notebook experience in your Clojure namespace
This tool is an attempt to answer the following question: can we have a notebook-like experience in Clojure without leaving one's favourite editor?
See this recorded Overview.
Version 4 is actively developed and in alpha stage. If you are new to Notespace, this is the version we recommend trying out.
Version 3 and Version 2 has been used in some projects. We are not planning to develop them further, but please reach out if you need any support.
Notespace has been evolving gradually, slowly realizing some lessons of usage in the Scicloj study groups, in individual research projects, and in documenting some of the Scicloj libraries.
Version 4 -- please go here if you are new to Notespace
See details in the dedicated version pages linked above.
Hearing your comments, opinions and wishes will help!
#notespace-dev at the Clojurians Zulip.
There are several magnificent existing options for literate programming in Clojure: Marginalia, Org-Babel, Gorilla REPL, Oz, Saite, Clojupyter, Nextjournal, Pink-Gorilla/Goldly, Clerk. Most of them are actively developed.
Creating a separate alternative would be the least desired outcome of the current project. Rather, the hope is to compose and integrate well with some of the other projects.
Author: scicloj
Source Code: https://github.com/scicloj/notespace
License: EPL-2.0 License
1650715200
A Jupyter kernel for Clojure - run Clojure code in Jupyter Lab, Notebook and Console.
In the examples
folder of the repository are there 3 example notebooks showing some of the features of clojupyter. See this notebook showing examples of how you can display HTML and use external Javascript:
There are 3 example notebooks because Jupyter offers several distinct user interfaces - Jupyter Lab, Jupyter Notebook and Jupyter Console - which have different feature sets for which clojupyter offers different support. We have one example notebook showing the features shared by Lab and Notebook and for each showing their distinct features. According to the Jupyter development roadmaps, Jupyter Notebook will eventually be phased out and completely replaced by Jupyter Lab.
You can also use existing JVM charting libraries since you can render any Java BufferedImage.
Clojupyter can be used in several ways, please read Usage Scenarios to find out which type of use model best fits you needs, and how to install Clojupyter in that scenario.
To start Jupyter Notebook do:
jupyter notebook
and choose 'New' in the top right corner and select 'Clojure (clojupyter...)' kernel.
To start Jupyter Lab do:
jupyter lab
You can also start the Jupyter Console by:
jupyter-console --kernel=<clojupyter-kernel-name>
Use jupyter-kernelspec list
to list all available kernels. So e.g. in case of installing clojupyter using conda the start command is:
jupyter-console --kernel=conda-clojupyter
If you are using Clojupyter as a library, you can use Clojupyter's command line interface to perform operations such as listing, installing, and removing Clojupyter kernels.
For example, in a Clojure repository which includes Clojuputer, you can get the list of available commands:
bash> clj -m clojupyter.cmdline list-commands
Clojupyter v0.2.3 - List commands
Clojupyter commands:
- help
- install
- list-commands
- list-installs
- list-installs-matching
- remove-installs-matching
- remove-install
- version
You can invoke Clojupyter commands like this:
clj -m clojupyter.cmdline <command>
or, if you have set up lein configuration, like this:
lein clojupyter <command>
See documentation for details.
exit(0)
See Command Line Interface for more details.
Development progress is based on voluntary efforts so we can't make any promises, but the wish list for clojupyter development looks something like this:
Feed-back is welcomed, use the discussions page to ask questions, give suggestions or just to say hi 👋.
If you have issues with Clojupyter, check the issues page to see if your problem is already reported and open a new issue if needed.
Author: clojupyter
Source Code: https://github.com/clojupyter/clojupyter
License: MIT License
1650708000
Pink Gorilla Notebook is a rich browser based notebook REPL for Clojure and ClojureScript, which aims at extensibility (development- and runtime) and user experience while being very lightweight. Extensibility primarily revolves around UI vizulisations and data.
Whichever method you use to start the notebook, you should reach it at http://localhost:8000/
.
The easiest way to run the notebook locally is leveraging the clojure
cli
clojure -Sdeps '{:deps {org.pinkgorilla/notebook-bundel {:mvn/version "RELEASE"}}}' -m pinkgorilla.notebook-bundel
Since the default bundel ships many default ui extensions, you want to use the notebook-bundel artefact, because the javascript frotend app has already been precompiled, which results in faster startup-time.
We recommend to use tools.deps over leiningen fortwo reasons:
One way to configure the notebook is to pass it a edn configuration file. An example is notebook edn config
In your deps.edn add this alias:
:notebook {:extra-deps {org.pinkgorilla/notebook-bundel {:mvn/version "RELEASE"}}
:exec-fn pinkgorilla.notebook-bundel/run
:exec-args {:config "notebook-config.edn"}}
then run it with clojure -X:notebook
.
trateg uses notebook-bundel with deps.edn: Clone trateg and run clojure -X:notebook
** We don't recommend leiningen use with notebook, as leiningen does not use the highest version of dependencies. **
If you define your own ui extensions, you need to compile the javascript bundel. This requires some extra initial compilation time.
ui-quil use deps.edn to build a custom notebook bundel (that includes the library that gets built).
gorilla-ui and ui-vega use leiningen to run notebooks with a custom build bundel, and with custom notebook folder.
Documentation has been moved over here
This option is mainly there for development of notebook. For regular use, the long compile-times are not really sensible.
Run clojure -X:notebook
to run the notebook.
This runs the notebook with ui libraries bundled:
Run clojure -X:develop
to run the develop ui.
Author: pink-gorilla
Source Code: https://github.com/pink-gorilla/notebook
License:
1650700800
Envision is a small, easy to use Clojure library for data processing, cleanup and visualisation. If you've heard about Incanter, you may see a couple of things that we do in a similar way.
You can check out a couple of rendered examples here.
Envision is a relatively young project. Since it's never meant to be used in hard- production (e.g. it will never be something user-facing), and is intended to be used by people who'd like to yield some information from their data, it should be stable enough from the very early releases.
Envision artifacts are released to Clojars. If you are using Maven, add the following repository definition to your pom.xml
:
<repository>
<id>clojars.org</id>
<url>http://clojars.org/repo</url>
</repository>
With Leiningen:
[clojurewerkz/envision "0.1.0-SNAPSHOT"]
With Maven:
<dependency>
<groupId>clojurewerkz</groupId>
<artifactId>envision</artifactId>
<version>0.1.0-SNAPSHOT</version>
</dependency>
Main idea of this library is to make exploratory analysis more interactive and visual, although in programmer's way. Envision creates a "throwaway environment" every time you, for example, make a line chart. You can modify chart the way you want, change all the possible configuration parameters, filter data, add exponents the ways we wouldn't be able to program for you.
We concluded that visual environments are often constraining, and creating an API for every since feature would make it amazingly big and bloated. So we do a bare minimum, which is already helpful by default through the API and let you configure everything you could've possibly imagined yourself: adding interactivity, combining charts, customizing layouts and so on.
Main entrypoint is clojurewerkz.envision.core/render
. It creates a temporary directory with all the required dependencies and returns you a path to it. For example, let's generate some data and render a line and area charts:
(ns my-ns
(:require [clojurewerkz.envision.core :as envision]
[clojurewerkz.envision.chart-config :as cfg]
(envision/render
[(envision/histogram 10 (take 100 (distribution/normal-distribution 5 10))
{:tick-format "s"})
(envision/linear-regression
(flatten (for [i (range 0 20)]
[{:year (+ 2000 i)
:income (+ 10 i (rand-int 10))
:series "series-1"}
{:year (+ 2000 i)
:income (+ 10 i (rand-int 20))
:series "series-2"}]
))
:year
:income
[:year :income :series])
(cfg/make-chart-config
{:id "line"
:headline "Line Chart"
:x "year"
:y "income"
:x-config {:order-rule "year"}
:series-type "line"
:data (flatten (for [i (range 0 20)]
[{:year (+ 2000 i)
:income (+ 10 i (rand-int 10))
:series "series-1"}
{:year (+ 2000 i)
:income (+ 10 i (rand-int 20))
:series "series-2"}]
))
:series "series"
:interpolation :cardinal
})
(cfg/make-chart-config
{:id "area"
:headline "Area Chart"
:x "year"
:y "income"
:x-config {:order-rule "year"}
:series-type "area"
:data (into [] (for [i (range 0 20)] {:year (+ 2000 i) :income (+ 10 i (rand-int 10))}))
:interpolation :cardinal
})
])
Function will return a tmp folder path, like:
/var/folders/1y/xr7zvp2j035bpq09whg7th5w0000gn/T/envision-1402385765815-3502705781
cd
into this path and start an HTTP Server on most systems you'd have Python 2.7 installed.
python -m SimpleHTTPServer
After that you can point your browser to
http://localhost:4000/templates/index.html
If you don't want to start an HTTP server, or don't have Python installed, just open templates/index_file.html
static file in your browser.
You can check out a couple of example graphs rendered as static files here.
We decided to use an simple HTTP server by default, since sometimes d3
doesn't like file://
protocol. However, you can always just open templates/index_file.html
in your browser and get pretty much same result.
In order to configure chart, you have to specify:
id
, a unique string literal identifying the chartdata
, sequence of maps, where each map represents an entry to be displayedx
, key that should be taken as x
value for each rendered pointy
, key that should be taken as y
value for each rendered pointseries-type
, one of line
, bubble
, area
and bar
for line charts, Scatterplots, area charts and barcharts, correspondinglyOptionally, you can specify:
series
, which will split your data, grouping or color-coding charts by given keys keys should be given either as a string or a vector or strings.interpolation
, interpolation type to be used in area or line chart, usually you want to use linear
, basis
, or step-after
, but there're more options, which will be mentioned in a corresponding section.x-config
specifies a configuration for X axisx-config
options:
order-rule
specifies a key to sort data points on x
axis, if it's not x
override-min
overrides minimum for an axisEnvision supports Clojure 1.4+.
To subscribe for announcements of releases, important changes and so on, please follow @ClojureWerkz on Twitter.
Envision is part of the group of libraries known as ClojureWerkz, together with Monger, Elastisch, Langohr, Welle, Titanium and several others.
Envision uses Leiningen 2. Make sure you have it installed and then run tests against all supported Clojure versions using
lein2 all test
Then create a branch and make your changes on it. Once you are done with your changes and all tests pass, submit a pull request on Github.
Author: clojurewerkz
Source Code: https://github.com/clojurewerkz/envision
License:
#machine-learning #DataVisualisation
1650693600
Oz is a data visualization and scientific document processing library for Clojure built around Vega-Lite & Vega.
Vega-Lite & Vega are declarative grammars for describing interactive data visualizations. Of note, they are based on the Grammar of Graphics, which served as the guiding light for the popular R ggplot2
viz library. With Vega & Vega-Lite, we define visualizations by declaratively specifying how attributes of our data map to aesthetic properties of a visualization. Vega-Lite in particular focuses on maximal productivity and leverage for day to day usage (and is the place to start), while Vega (to which Vega-Lite compiles) is ideal for more nuanced control.
Oz itself provides:
view!
: Clojure REPL API for for pushing Vega-Lite & Vega (+ hiccup) data to a browser window over a websocketvega
, vega-lite
: Reagent component API for dynamic client side ClojureScript appspublish!
: create a GitHub gist with Vega-Lite & Vega (+ hiccup), and print a link to visualize it with either the IDL's live vega editor or the ozviz.ioload
: load markdown, hiccup or Vega/Vega-Lite files (+ combinations) from disk as EDN or JSONexport!
: write out self-contained html files with live/interactive visualizations embeddedoz.notebook.<kernel>
: embed Vega-Lite & Vega data (+ hiccup) in Jupyter notebooks via the Clojupyter & IClojure kernelslive-reload!
: live clj code reloading (à la Figwheel), tuned for data-science hackery (only reruns from first changed form, for a pleasant, performant live-coding experience)live-view!
: similar Figwheel-inspired live-view!
function for watching and view!
ing .md
, .edn
and .json
files with Vega-Lite & Vega (+ (or markdown hiccup))build!
: generate a static website from directories of markdown, hiccup &/or interactive Vega-Lite & Vega visualizations, while being able to see changes live (as with live-view!
)To take full advantage of the data visualization capabilities of Oz, it pays to understanding the core Vega & Vega-Lite. If you're new to the scene, it's worth taking a few minutes to orient yourself with this mindblowing talk/demo from the creators at the Interactive Data Lab (IDL) at University of Washington.
Watched the IDL talk and hungry for more content? Here's another which focuses on the philosophical ideas behind Vega & Vega-Lite, how they relate to Clojure, and how you can use the tools from Clojure using Oz.
This Readme is the canonical entry point for learning about Oz. You may also want to check out the cljdoc page (if you're not there already) for API & other docs, and look at the examples directory of this project (references occassionally below).
Some other things in the Vega/Vega-Lite ecosystem you may want to look at for getting started or learning more
view!
function.If you clone this repository and open up the dev/user.clj
file, you can follow along by executing the commented out code block at the end of the file.
Assuming you're starting from scratch, first add oz to your leiningen project dependencies
Next, require oz and start the plot server as follows:
(require '[oz.core :as oz])
(oz/start-server!)
This will fire up a browser window with a websocket connection for funneling view data back and forth. If you forget to call this function, it will be called for you when you create your first plot, but be aware that it will delay the first display, and it's possible you'll have to resend the plot on a slower computer.
Next we'll define a function for generating some dummy data
(defn play-data [& names]
(for [n names
i (range 20)]
{:time i :item n :quantity (+ (Math/pow (* i (count n)) 0.8) (rand-int (count n)))}))
oz/view!
The main function for displaying vega or vega-lite is oz/view!
.
For example, a simple line plot:
(def line-plot
{:data {:values (play-data "monkey" "slipper" "broom")}
:encoding {:x {:field "time" :type "quantitative"}
:y {:field "quantity" :type "quantitative"}
:color {:field "item" :type "nominal"}}
:mark "line"})
;; Render the plot
(oz/view! line-plot)
Should render something like:
Another example:
(def stacked-bar
{:data {:values (play-data "munchkin" "witch" "dog" "lion" "tiger" "bear")}
:mark "bar"
:encoding {:x {:field "time"
:type "ordinal"}
:y {:aggregate "sum"
:field "quantity"
:type "quantitative"}
:color {:field "item"
:type "nominal"}}})
(oz/view! stacked-bar)
This should render something like:
For vega instead of vega-lite, you can also specify :mode :vega
to oz/view!
:
;; load some example vega (this may only work from within a checkout of oz; haven't checked)
(require '[cheshire.core :as json])
(def contour-plot (oz/load "examples/contour-lines.vega.json"))
(oz/view! contour-plot :mode :vega)
This should render like:
We can also embed Vega-Lite & Vega visualizations within hiccup documents:
(def viz
[:div
[:h1 "Look ye and behold"]
[:p "A couple of small charts"]
[:div {:style {:display "flex" :flex-direction "row"}}
[:vega-lite line-plot]
[:vega-lite stacked-bar]]
[:p "A wider, more expansive chart"]
[:vega contour-plot]
[:h2 "If ever, oh ever a viz there was, the vizard of oz is one because, because, because..."]
[:p "Because of the wonderful things it does"]])
(oz/view! viz)
Note that the Vega-Lite & Vega specs are described in the output vega as using the :vega
and :vega-lite
keys.
You should now see something like this:
Note that vega/vega-lite already have very powerful and impressive plot concatenation features which allow for coupling of interactivity between plots in a viz. However, combing things through hiccup like this is nice for expedience, gives one the ability to combine such visualizations in the context of HTML documents.
Also note that while not illustrated above, you can specify multiple maps in these vectors, and they will be merged into one. So for example, you can do [:vega-lite stacked-bar {:width 100}]
to override the width.
If you like, you may also use the Reagent components found at oz.core
to render vega and/or vega-lite you construct client side.
[:div
[oz.core/vega { ... }]
[oz.core/vega-lite { ... }]]
At present, these components do not take a second argument. The merging of spec maps described above applies prior to application of this reagent component.
Eventually we'll be adding options for hooking into the signal dataflow graphs within these visualizations so that interactions in a Vega/Vega-Lite visualization can be used to inform other Reagent components in your app.
Please note that when using oz.core client side, the :data
entry in your vega spec map should not be nil
(for example you're loading data into a reagent atom which has not been populated yet). Instead prefer an empty sequence ()
to avoid hard to diagnose errors in the browser.
Oz now features a load
function which accepts the following formats:
edn
, json
, yaml
: directly parse into hiccup &/or Vega/Vega-Lite representationsmd
: loads a markdown file, with a notation for specifying Vega/Vega-Lite in code blocks tagged with the vega
, vega-lite
or oz
classAs example of the markdown syntax:
# An example markdown file
```edn vega-lite
{:data {:url "data/cars.json"}
:mark "point"
:encoding {
:x {:field "Horsepower", :type "quantitative"}
:y {:field "Miles_per_Gallon", :type "quantitative"}
:color {:field "Origin", :type "nominal"}}}
```
The real magic here is in the code class specification edn vega-lite
. It's possible to replace edn
with json
or yaml
, and vega
with vega-lite
as appropriate. Additionally, these classes can be hyphenated for compatibility with editors/parsers that have problems with multiple class specifications (e.g. edn-vega-lite
)
Note that embedding all of your data into a vega/vega-lite spec directly as :values
may be untenable for larger data sets. In these cases, the recommended solution is to post your data to a GitHub gist, or elsewhere online where you can refer to it using the :url
syntax (e.g. {:data {:url "https://your.data.url/path"} ...}
).
One final note: in lieue of vega
or vega-lite
you can specify hiccup
in order to embed oz-style hiccup forms which may or may not contain [:vega ...]
or [:vega-lite ...]
blocks. This allows you to embed nontrivial html in your markdown files as hiccup, when basic markdown just doesn't cut it, without having to resort to manually writing html.
We can also export static HTML files which use Vega-Embed
to render interactive Vega/Vega-Lite visualizations using the oz/export!
function.
(oz/export! spec "test.html")
Oz now also features Jupyter support for both the Clojupyter and IClojure kernels. See the view!
method in the namespaces oz.notebook.clojupyter
and oz.notebook.iclojure
for usage.
Take a look at the example clojupyter notebook.
If you have docker installed you can run the following to build and run a jupyter container with clojupyter installed.
docker run --rm -p 8888:8888 kxxoling/jupyter-clojure-docker
Note that if you get a permission related error, you may need to run this command like sudo docker run ...
.
Once you have a notebook up and running you can either import the example clojupyter notebook or manually add something like:
(require '[clojupyter.misc.helper :as helper])
(helper/add-dependencies '[metasoarous/oz "x.x.x"])
(require '[oz.notebook.clojupyter :as oz])
;; Create spec
;; then...
(oz/view! spec)
Based on my own tinkering and the reports of other users, the functionality of this integration is somewhat sensitive to version/environment details, so running from the docker image is the recommended way of getting things running for the moment.
If you have docker installed you can get an IClojure environment up and running using:
docker run -p 8888:8888 cgrand/iclojure
As with Clojupyter, note that if you get a permission related error, you may need to run this command like sudo docker run ...
.
Once you have that running, you can:
/cp {:deps {metasoarous/oz {:mvn/version "x.x.x"}}}
(require '[oz.notebook.iclojure :as oz])
;; Create spec
;; then...
(oz/view! spec)
Oz now features Figwheel-like hot code reloading for Clojure-based data science workflows. To start this functionality, you specify from the REPL a file you would like to watch for changes, like so:
(oz/live-reload! "live-reload-test.clj")
As soon as you run this, the code in the file will be executed in its entirety. Thereafter, if you save changes to the file, all forms starting from the first form with material changes will be re-evaluated. Additionally, whitespace changes are ignored, and namespace changes only trigger a recompile if there were other code changes in flight, or if there was an error during the last execution. We also try to do a good job of logging notifications as things are running so that you know what is running and how long things are taking for to execute long-running forms.
Collectively all of these features give you the same magic of Figwheel's hot-code reloading experience, but geared towards the specific demands of a data scientist, or really anyone who needs to quickly hack together potentially long running jobs.
Here's a quick video of this in action: https://www.youtube.com/watch?v=yUTxm29fjT4
Of import: Because the code evaluated with live-reload!
is evaluated in a separate thread, you can't include any code which might try to set root bindings of a dynamic var. Fortunately, setting root var bindings isn't something I've ever needed to do in my data science workflow (nor should you), but of course, it's possible there are libraries out there that do this. Just be aware that it might come up. This seems to be a pretty fundamental Clojure limitation, but I'd be interested to hear from the oracles whether there's any chance of this being supported in a future version of Clojure.
There's also a related function, oz/live-view!
which will similarly watch a file for changes, oz/load!
it, then oz/view!
it.
Looking to share your cool plots or hiccup documents with someone? We've got you covered via the publish!
utility function.
This will post the plot content to a GitHub Gist, and use the gist uuid to create a vega-editor link which prints to the screen. When you visit the vega-editor link, it will load the gist in question and place the content in the editor. It renders the plot, and updates in real time as you tinker with the code, making it a wonderful yet simple tool for sharing and prototyping.
user=> (oz/publish! stacked-bar)
Gist url: https://gist.github.com/87a5621b0dbec648b2b54f68b3354c3a
Raw gist url: https://api.github.com/gists/87a5621b0dbec648b2b54f68b3354c3a
Vega editor url: https://vega.github.io/editor/#/gist/vega-lite/metasoarous/87a5621b0dbec648b2b54f68b3354c3a/e1d471b5a5619a1f6f94e38b2673feff15056146/vega-viz.json
Following the Vega editor url with take you here (click on image to follow):
As mentioned above, we can also share our hiccup documents/dashboards. Since Vega Editor knows nothing about hiccup, we've created ozviz.io as a tool for loading these documents.
user=> (oz/publish! viz)
Gist url: https://gist.github.com/305fb42fa03e3be2a2c78597b240d30e
Raw gist url: https://api.github.com/gists/305fb42fa03e3be2a2c78597b240d30e
Ozviz url: http://ozviz.io/#/gist/305fb42fa03e3be2a2c78597b240d30e
Try it out: http://ozviz.io/#/gist/305fb42fa03e3be2a2c78597b240d30e
In order to use the oz/publish!
function, you must provide authentication.
The easiest way is to pass :auth "username:password"
to the oz/publish!
function. However, this can be problematic in that you don't want these credentials accidentally strewn throughout your code or ./.lein-repl-history
.
To address this issue, oz/publish!
will by default try to read authorization parameters from a file at ~/.oz/github-creds.edn
. The contents should be a map of authorization arguments, as passed to the tentacles api. While you can use {:auth "username:password"}
in this file, as above, it's far better from a security standpoint to use OAuth tokens.
~/.oz/github-creds.edn
file as {:oauth-token "xxxxxxxxxxxxxx"}
When you're finished, it's a good idea to run chmod 600 ~/.oz/github-creds.edn
so that only your user can read the credential file.
And that's it! Your calls to (oz/publish! spec)
should now be authenticated.
Sadly, GitHub used to allow the posting of anonymous gists, without the requirement of authentication, which saved us from all this hassle. However, they've since deprecated this. If you like, you can submit a comment asking that GitHub consider enabling auto-expiring anonymous gists, which would avoid this setup.
If you've ever thought "man, I wish there was a static site generation toolkit which had live code reloading of whatever page you're currently editing, and it would be great if it was in Clojure and let me embed data visualizations and math formulas via LaTeX in Markdown & Hiccup documents", boy, are you in for a treat!
Oz now features exectly such features in the form of the oz/build!
. A very simple site might be generated with:
(build!
[{:from "examples/static-site/src/"
:to "examples/static-site/build/"}])
The input formats currently supported by oz/build!
are
md
: As described above, markdown with embedded Vega-Lite or Vega visualizations, Latex, and hiccupjson
, edn
: You can directly supply hiccup data for more control over layout and contentclj
: Will live-reload!
Clojure files (as described above), and render the last form evaluated as hiccupOz should handle image and css files it comes across by simply copying them over. However, if you have any json
or edn
assets (datasets perhaps) which need to pass through unchanged, you can separate these into their own build specification, like so:
(defn site-template
[spec]
[:div {:style {:max-width 900 :margin-left "auto" :margin-right "auto"}}
spec])
(build!
[{:from "examples/static-site/src/site/"
:to "examples/static-site/build/"
:template-fn site-template}
;; If you have static assets, like datasets or imagines which need to be simply copied over
{:from "examples/static-site/src/assets/"
:to "examples/static-site/build/"
:as-assets? true}])
This can be a good way to separate document code from other static assets.
Specifying multiple builds like this can be used to do other things as well. For example, if you wanted to render a particular set of pages using a different template function (for example, so that your blog posts style differently than the main pages), you can do that easily
(defn blog-template
[spec]
(site-template
(let [{:as spec-meta :keys [title published-at tags]} (meta spec)]
[:div
[:h1 {:style {:line-height 1.35}} title]
[:p "Published on: " published-at]
[:p "Tags: " (string/join ", " tags)]
spec])))
(build!
[{:from "examples/static-site/src/site/"
:to "examples/static-site/build/"
:template-fn site-template}
{:from "examples/static-site/src/blog/"
:to "examples/static-site/build/blog/"
:template-fn blog-template}
;; If you have static assets, like datasets or imagines which need to be simply copied over
{:from "examples/static-site/src/assets/"
:to "examples/static-site/build/"
:as-assets? true}])
Note that the blog-template
above is using metadata about the spec to inform how it renders. This metadata can be written into Markdown files using a yaml markdown metadata header (see /examples/static-site/src/
)
---
title: Oz static websites rock
tags: oz, dataviz
---
# Oz static websites!
Some markdown content...
The title in particular here will wind it's way into the Title
metadata tag of your output HTML document, and thus will be visible at the top of your browser window when you view the file. This is a pattern that Jekyll and some other blogging engines use, and markdown-clj
now supports extracting this data.
Again, as you edit and save these files, the outputs just automatically update for you, both as compiled HTML files, and in the live-view window which lets you see your changes as you make em. If you need to change a template, or some other detail of the specs, you can simply rerun build!
with the modified arguments, and the most recently edited page will updated before your eyes. This provides for a lovely live-view editing experience from the comfort of your favorite editor.
When you're done, one of the easiest ways to deploy is with the excellent surge.sh
toolkit, which makes static site deployment a breeze. You can also use GitHub Pages or S3 or really whatever if you prefer. The great thing about static sites is that they are easy and cheap to deploy and scale, so you have plenty of options at your disposal.
In general, it's pretty easy to translate specs between EDN (Clojure data) and JSON. However, there is one place where you can get a little tripped up if you don't know what to do, and that's in expressions (as used in calculate and filter transforms).
The expression you see in the Vega docs typically look like {"calculate": "datum.attr * 2", "as": "attr2"}
(as JSON). However, in Clojure, we often use kebab cased keywords for data map keys (e.g. :cool-attr
). For these attributes, you obviously can't use datum.cool-attr
, since this will be interpretted as data.cool - attr
, and either error out or not produce the desired result. Instead you'll need to use datum['cool-attr']
in your expressions when your keys are kebab cased.
This may be easy to miss, since most of the docs assume that you're working with camel or snake cased keys. It is mentioned somewhere in there if you look, but tends to bite us Clojurists more frequently than practitioners of other languages, and so isn't particularly front and center. Once you know the trick though, you should be on your way.
Oz is now compiled (on the cljs side) with Shadow-CLJS, together with the Clojure CLI tooling. A typical workflow involves running clj -M:shadow-cljs watch devcards app
(note, older versions of clj
use -A
instead of -M
; consider updating). This will watch your cljs files for changes, and immediately compile both the app.js
and devcards.js
targets (to resources/oz/public/js/
).
In general, the best way to develop is to visit http://localhost:7125/devcards.html, which will pull up a live view of a set of example Reagent components defined at src/cljs/oz/core_devcards.cljs
. This is the easiest way to tweak functionality and test new features, as editing src/cljs/oz/core.cljs
will trigger updates to the devcards views.
If it's necessary or desirable to test the app (live-view, etc) functionality "in-situ", you can also use the normal Clj REPL utilities to feed plots to the app.js
target using oz/view!
, etc. Note that if you do this, you will need to use whatever port is passed to oz/view!
(by default, 10666) and not the one printed out when you start clj -M:shadow-cljs
.
See documentation for your specific editing environment if you'd like your editor to be able to connect to the Shadow-CLJS repl. For vim-fireplace
, the initial Clj connection should establish itself automatically when you attempt to evaluate your first form. From there simply execute the vim command :CljEval (shadow/repl :app)
, and you should be able to evaluate code in the *.cljs
files from vim. Code in *.clj
files should also continue to evaluate as before as well.
IMPORTANT NOTE: If you end up deploying a version of Oz to Clojars or elsewhere, make sure you stop your clj -M:shadow-cljs watch
process before running make release
. If you don't, shadow will continue watching files and rebuild js compilation targets with dev time configuration (shadow, less minification, etc), that shouldn't be in the final release build. If however you are simply making changes and pushing up for me to release, please just leave any compiled changes to the js targets out of your commits.
I'm frequently shocked (pleasantly) at how if I find I'm unable to do something in Vega or Vega-Lite that I think I should, updating the Vega or Vega-Lite version fixes the problem. As a side note, I think this speaks volumes of the stellar job (pun intended) the IDL has been doing of developing these tools. More to the point though, if you find yourself unable to do something you expect to be able to do, it's not a bad idea to try
package.json
file, and attempt to rebuild the Oz as described above.Author: metasoarous
Source Code: https://github.com/metasoarous/oz
License: EPL-1.0 License