Royce  Reinger

Royce Reinger

1686234016

BigARTM: Fast Topic Modeling Platform

BigARTM

The state-of-the-art platform for topic modeling.


What is BigARTM?

BigARTM is a powerful tool for topic modeling based on a novel technique called Additive Regularization of Topic Models. This technique effectively builds multi-objective models by adding the weighted sums of regularizers to the optimization criterion. BigARTM is known to combine well very different objectives, including sparsing, smoothing, topics decorrelation and many others. Such combination of regularizers significantly improves several quality measures at once almost without any loss of the perplexity.

References

Related Software Packages

  • TopicNet is a high-level interface for BigARTM which is helpful for rapid solution prototyping and for exploring the topics of finished ARTM models.
  • David Blei's List of Open Source topic modeling software
  • MALLET: Java-based toolkit for language processing with topic modeling package
  • Gensim: Python topic modeling library
  • Vowpal Wabbit has an implementation of Online-LDA algorithm

Installation

Installing with pip (Linux only)

We have a PyPi release for Linux:

$ pip install bigartm

or

$ pip install bigartm10

Installing on Windows

We suggest using pre-build binaries.

It is also possible to compile C++ code on Windows you want the latest development version.

Installing on Linux / MacOS

Download binary release or build from source using cmake:

$ mkdir build && cd build
$ cmake ..
$ make install

See here for detailed instructions.

How to Use

Command-line interface

Check out documentation for bigartm.

Examples:

  • Basic model (20 topics, outputed to CSV-file, inferred in 10 passes)
bigartm.exe -d docword.kos.txt -v vocab.kos.txt --write-model-readable model.txt
--passes 10 --batch-size 50 --topics 20
  • Basic model with less tokens (filtered extreme values based on token's frequency) ```bash bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2

--passes 10 --batch-size 50 --topics 20 --write-model-readable model.txt


* Simple regularized model (increase sparsity up to 60-70%)
```bash
bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics 20  --write-model-readable model.txt 
--regularizer "0.05 SparsePhi" "0.05 SparseTheta"
  • More advanced regularize model, with 10 sparse objective topics, and 2 smooth background topics ```bash bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2

--passes 10 --batch-size 50 --topics obj:10;background:2 --write-model-readable model.txt --regularizer "0.05 SparsePhi #obj" --regularizer "0.05 SparseTheta #obj" --regularizer "0.25 SmoothPhi #background" --regularizer "0.25 SmoothTheta #background"


### Interactive Python interface

BigARTM supports full-featured and clear Python API (see [Installation](http://docs.bigartm.org/en/latest/installation/index.html) to configure Python API for your OS).

Example:

```python
import artm

# Prepare data
# Case 1: data in CountVectorizer format
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from numpy import array

cv = CountVectorizer(max_features=1000, stop_words='english')
n_wd = array(cv.fit_transform(fetch_20newsgroups().data).todense()).T
vocabulary = cv.get_feature_names()

bv = artm.BatchVectorizer(data_format='bow_n_wd',
                          n_wd=n_wd,
                          vocabulary=vocabulary)

# Case 2: data in UCI format (https://archive.ics.uci.edu/ml/datasets/Bag+of+Words)
bv = artm.BatchVectorizer(data_format='bow_uci',
                          collection_name='kos',
                          target_folder='kos_batches')

# Learn simple LDA model (or you can use advanced artm.ARTM)
model = artm.LDA(num_topics=15, dictionary=bv.dictionary)
model.fit_offline(bv, num_collection_passes=20)

# Print results
model.get_top_tokens()

Refer to tutorials for details on how to start using BigARTM from Python, user's guide can provide information about more advanced features and cases.

Low-level API

Contributing

Refer to the Developer's Guide and follows Code Style.

To report a bug use issue tracker. To ask a question use our mailing list. Feel free to make pull request.



Download Details:

Author: Bigartm
Source Code: https://github.com/bigartm/bigartm 
License: View license

#machinelearning #python #bigdata 

BigARTM: Fast Topic Modeling Platform
Aiyana  Miller

Aiyana Miller

1628877000

Creating PySpark Dataframe Scalar UDFs

In this video, you learn how to create PySpark dataframe User Defined Functions (UDF) to perform distributed transformations on each row.   You will learn about using Apache Arrow to get optimal performance and how to use these functions from Spark SQL and dataframes.

#bigdata #apache-spark 

 Creating PySpark Dataframe Scalar UDFs
Aiyana  Miller

Aiyana Miller

1628866140

Using PySpark Dataframe Methods on Databricks

In this video, you learn how to use PySpark dataframes methods on Databricks to perform data analysis and engineering at scale.  This is the core of using Python on Spark and you need to learn the power but also the nuances involved.  

#bigdata #apache-spark 

Using PySpark Dataframe Methods  on Databricks
Aiyana  Miller

Aiyana Miller

1628855280

How To Use PySpark Leveraging SQL

In this video, we use PySpark to load Spark dataframes from queries and perform data analysis at scale.  You'll learn why using SQL with Python is so important and how it jump-starts your productivity on Databricks.

#bigdata #apache-spark 

How To Use PySpark Leveraging SQL
Aiyana  Miller

Aiyana Miller

1628844360

Use PySpark to analyze Data with RDDs

In this video, we use PySpark to analyze data with Resilient Distributed Datasets (RDD).  RDDs are the foundation of Spark. You learn what RDDs are, what Lazy Evaluation is and why it matters, and how to use Transformations and Actions.  Everything is demonstrated using a Databricks notebook.

#apache-spark #bigdata 

Use PySpark to analyze Data with RDDs
Aiyana  Miller

Aiyana Miller

1628833440

Why You Need to Know PySpark?

In this video, we introduce PySpark and explain important concepts you need to understand to be productive with Python on Spark. Specific questions include Can Python do everything?  Does it perform well?  Why you need to know PySpark and more.

#bigdata #python #apache-spark 

Why You Need to Know PySpark?
Aiyana  Miller

Aiyana Miller

1628822520

Open-source Spark SQL with SQL Zeppelin Notebook Catch Up

In this video, we catch up and review the open-source Spark SQL with the remaining Zeppelin notebooks. You'll learn what the new notebooks are and how to get them.  You'll also learn the specific differences between coding for open-source Spark vs. Databricks.

#bigdata #apache-spark 

Open-source Spark SQL with SQL Zeppelin Notebook Catch Up
Aiyana  Miller

Aiyana Miller

1628811600

Saving Query Results to Tables on Spark

In this video, you learn how to query Spark tables with SQL and how to save the results permanently to tables which is a great way to build a platform of data for future analysis and ML pipelines.  This video demonstrates using open-source Apache Spark via Zeppelin Notebook.

#bigdata #apache-spark 

 Saving Query Results to Tables on Spark
Aiyana  Miller

Aiyana Miller

1628800680

How To Use SQL Window Functions

In this video, you learn how to use Spark Structured Query Language (SQL) window functions.  Spark SQL is the most performant way to do data engineering on Databricks and window functions expand SQL functionality to include things like cumulative totals, ranking values, and including aggregations alongside detail rows.  They can save you a lot of work. I'll explain the concepts and demonstrate them with code in a Databricks notebook.

#apache-spark #bigdata 

How To Use SQL Window Functions
Aiyana  Miller

Aiyana Miller

1628789820

How to Use Spark Structured Query Language (SQL) Scalar

In this video, you learn how to use Spark Structured Query Language (SQL) scalar and aggregate functions.  Spark SQL is the most performant way to do data engineering on  Databricks and Spark and you want to leverage SQL functions as much as possible versus writing custom code.  I'll explain the concepts and demonstrate them with code in a Databricks notebook.

#bigdata #sql 

How to Use Spark Structured Query Language (SQL) Scalar
Aiyana  Miller

Aiyana Miller

1628778600

Set Operators in Spark Structured Query Language

In this video, you learn how to Set Operators in Spark Structured Query Language (SQL), i.e. UNION, INTERSECT, and EXCEPT.  Spark SQL is the most performant way to do data engineering on  Databrick and Spark.  I'll explain the concepts and demonstrate them with code in a Databricks notebook.

#bigdata #apache-spark 

 

Set Operators in Spark Structured Query Language
Aiyana  Miller

Aiyana Miller

1628767680

How To Query Perform Joins using Spark SQL

In this video, you learn how to query perform joins using Spark Structured Query Language (SQL). Spark SQL is the most performant way to do data engineering on  Databrick and Spark.  I'll explain the concepts and demonstrate them with code in a Databricks notebook.

#bigdata #apache-spark 
.​

How To Query Perform Joins using Spark SQL
Aiyana  Miller

Aiyana Miller

1628756700

How to Query Save SQL Queries

In this video, you learn how to query save SQL queries as views so they can be re-used in your data analysis and pipelines.

#apache-spark #bigdata 

How to Query Save SQL Queries
Aiyana  Miller

Aiyana Miller

1628745720

How to Save The Results Permanently To Tables On

In this video, you learn how to query Spark tables with SQL and how to save the results permanently to tables which is a great way to build a platform of data for future analysis and ML pipelines.


#apache-spark  #bigdata 

How to Save The Results Permanently To Tables On
Aiyana  Miller

Aiyana Miller

1628723700

Create and Load The Project CSV Files into SQL Tables

In this video, you will learn how to create and load the project CSV files into SQL tables on open-source Apache Spark using Zeppelin Notebook.  The prior video, Lesson 9, showed you how to create the tables using Databricks.  The files are available with the notebook and slides at the link below.  This video lays the foundation for the ones that follow so make sure you watch it and create your own database.  

#sql #apache-spark #bigdata 

 

 Create and Load The Project CSV Files into SQL Tables