Configurable Data Processing Engine with React Components

This project provides a collection of components for executing processing pipelines, particularly oriented to data wrangling. Detailed documentation is provided in subfolders, with an overview of high-level goals and concepts here. Most of the documentation within individual packages is tailored to developers needing to understand how the code is organized and executed. Higher-level concepts for the project as a whole, constructing workflows, etc. are in the root docs folder.

Motivation

There are four primary goals of the project:

Create a shareable client/server schema for describing data processing steps. This is in the schema folder. TypeScript types and JSONSchema generation is in javascript/schema, and published schemas are copied out to schema along with test cases that are executed by JavaScript and Python builds to ensure parity. Stable released versions of DataShaper schemas are hosted on github.io for permanent reference (described below).
Maintain an implementation of a basic client-side wrangling engine (largely based on Arquero). This is in the javascript/workflow folder. This contains a reactive execution engine, along with individual verb implementations.
Maintain a python implementation using common wrangling libraries (e.g., pandas) for backend or data science deployments. This is in the python folder. The execution engine is less complete than in JavaScript, but has complete verb implementations and test suite parity. A fuller-featured generalized pipeline execution engine is forthcoming.
Provide an application framework along with some reusable React components so wrangling operations can be incorporated into web applications easily. This is in the javascript/app-framework and javascript/react folders.

Individual documentation for the JavaScript and Python implementations can be found in their respective folders. Broad documentation about building pipelines and the available verbs is available in the docs folder.

We currently have seven primary JavaScript packages:

app-framework - this provides web application infrastructure for creating data-driven apps with minimal boilerplate.
react - this is a set of React components for each verb that you can include in web apps that enable transformation pipeline building.
schema - this is a set of core types and associated JSONSchema definitions for formalizing our data package and resource models (including the definitions for table parsing, Codebooks, and Workflows).
tables - this is the primary set of functions for loading and parsing data tables, using Arquero under the hood.
utilities - this is a set of helpers for working with files, etc., to ease building data wrangling applications.
webapp - this is the deployable DataShaper webapp that includes all of the verb components and allows creation, execution, and saving of pipeline JSON files. We also rely on this to demonstrate example code, including a TestApp profile. If you're wondering how to build an app with DataShaper components, start here!
workflow - this is the primary engine for pipeline execution. It includes low-level operational primitives to execute a wide variety of relational algebra transformations over Arquero tables.

Also note that each JavaScript package has a generated docs folder containing Markdown API documentation extracted from code comments using api-extractor.

The Python packages are much simpler, because there is no associated web application and component code.

engine - contains the core verb implementations.
workflow.py - this is the primary execution engine that loads and interprets pipelines, and iterates through the steps to produce outputs.

Schema management

We generate JSONSchema for formal project artifacts including resource definitions and workflow specifications. This allows validation by any consumer and/or implementor. Schema versions are published on github.io for permanent reference. Each variant of a schema is hosted in perpetuity with semantic versioning. Aliases to the most recent (unversioned latest) and major revisions are also published. Here are direct links to the latest versions of our primary schemas:

Bundle (types) (published schema)
Codebook (types) (published schema)
Data Package (types) (published schema)
Data Table (types) (published schema)
Table Bundle (types) (published schema)
Workflow (types) (published schema)

Note that for the purposes of pipeline development, the workflow schema is primary. The rest are largely used for package management and table bundling in the web application.

Creating new verbs

For new verbs within the DataShaper toolkit, you must first determine if JavaScript and Python parity is desired. For operations that should be configurable via a UX, a JavaScript implementation is necessary. However, if the verb is primarily useful for data science workflows and has potentially complicated parameters, a Python-only implementation may be fine. We have a preference for parity to reduce confusion and allow for cross-platform execution of any pipelines created with the tool, but also recognize the value of the Python-based execution engine for configuring data science and ETL workflows that will only ever be run server-side.

Core verbs

Core verbs are built into the toolkit, and should generally have JavaScript and Python parity. Creating these verbs involves the following steps:

Schema definition - this is done by authoring TypeScript types in the javascript/schema folder, which are then generated as JSONSchema during a build step.
Cross-platform tests - these are defined in schema/fixtures, primarily in the workflow folder. Each fixture includes a workflow.json and an expected output csv file. Executors run in both JavaScript and Python to confirm that outputs match the expected table.
JavaScript implementation - verbs are implemented in javascript/workflow/verbs
Python implementation - verbs are implemented in python/verbs
Verb UX - individual verb UX components are in javascript/react

Custom verbs

The Python implementation supports the use of custom verbs supplied by your application - this allows arbitrary processing pipelines to be built that contain custom logic and processing steps.

TODO: document custom verb format

Build and test

JavaScript

You need node and yarn installed
Operate from project root
Run: yarn
Then: yarn build
Run the webapp locally: yarn start

Python

You need Python and poetry installed
Operate from python/datashaper folder
Run: poetry install
Then: poetry run poe test

Download Details:

Author: Microsoft

Official Github: https://github.com/microsoft/datashaper

License: MIT

#microsoft #data #data-analysis #data-science