Royce  Reinger

Royce Reinger


Marquez: Collect, Aggregate, and Visualize A Data Ecosystem's Metadata


Marquez is an open source metadata service for the collection, aggregation, and visualization of a data ecosystem's metadata. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more. Marquez was released and open sourced by WeWork.


Marquez is an LF AI & Data Foundation incubation project under active development, and we'd love your help!


Want to be added? Send a pull request our way!

Try it!

Open in Gitpod


Marquez provides a simple way to collect and view dataset, job, and run metadata using OpenLineage. The easiest way to get up and running is with Docker. From the base of the Marquez repository, run:

$ ./docker/

Tip: Use the --build flag to build images from source, and/or --seed to start Marquez with sample lineage metadata. For a more complete example using the sample metadata, please follow our quickstart guide.

Note: Port 5000 is now reserved for MacOS. If running locally on MacOS, you can run ./docker/ --api-port 9000 to configure the API to listen on port 9000 instead. Keep in mind that you will need to update the URLs below with the appropriate port number.


You can open http://localhost:3000 to begin exploring the Marquez Web UI. The UI enables you to discover dependencies between jobs and the datasets they produce and consume via the lineage graph, view run metadata of current and previous job runs, and much more!



The Marquez HTTP API listens on port 5000 for all calls and port 5001 for the admin interface. The admin interface exposes helpful endpoints like /healthcheck and /metrics. To verify the HTTP API server is running and listening on localhost, browse to http://localhost:5001. To begin collecting lineage metadata as OpenLineage events, use the LineageAPI or an OpenLineage integration.

Note: By default, the HTTP API does not require any form of authentication or authorization.


To explore metadata via graphql, browse to http://localhost:5000/graphql-playground. The graphql endpoint is currently in beta and is located at http://localhost:5000/api/v1-beta/graphql.


We invite everyone to help us improve and keep documentation up to date. Documentation is maintained in this repository and can be found under docs/.

Note: To begin collecting metadata with Marquez, follow our quickstart guide. Below you will find the steps to get up and running from source.


Marquez uses a multi-project structure and contains the following modules:

  • api: core API used to collect metadata
  • web: web UI used to view metadata
  • clients: clients that implement the HTTP API
  • chart: helm chart

Note: The integrations module was removed in 0.21.0, so please use an OpenLineage integration to collect lineage events easily.


Note: To connect to your running PostgreSQL instance, you will need the standard psql tool.


To build the entire project run:

./gradlew build

The executable can be found under api/build/libs/


To run Marquez, you will have to define marquez.yml. The configuration file is passed to the application and used to specify your database connection. The configuration file creation steps are outlined below.

Step 1: Create Database

When creating your database using createdb, we recommend calling it marquez:

$ createdb marquez

Step 2: Create marquez.yml

With your database created, you can now copy marquez.example.yml:

$ cp marquez.example.yml marquez.yml

You will then need to set the following environment variables (we recommend adding them to your .bashrc): POSTGRES_DB, POSTGRES_USER, and POSTGRES_PASSWORD. The environment variables override the equivalent option in the configuration file.

By default, Marquez uses the following ports:

  • TCP port 8080 is available for the HTTP API server.
  • TCP port 8081 is available for the admin interface.

Note: All of the configuration settings in marquez.yml can be specified either in the configuration file or in an environment variable.

Running the HTTP API Server

$ ./gradlew :api:runShadow

Marquez listens on port 8080 for all API calls and port 8081 for the admin interface. To verify the HTTP API server is running and listening on localhost, browse to http://localhost:8081. We encourage you to familiarize yourself with the data model and APIs of Marquez. To run the web UI, please follow the steps outlined here.

Note: By default, the HTTP API does not require any form of authentication or authorization.

Related Projects

  • OpenLineage: an open standard for metadata and lineage collection

Getting Involved


See for more details about how to contribute.

Reporting a Vulnerability

If you discover a vulnerability in the project, please open an issue and attach the "security" label.

Download Details:

Author: MarquezProject
Source Code: 
License: Apache-2.0 license

#machinelearning #metadata #data #discovery 

Marquez: Collect, Aggregate, and Visualize A Data Ecosystem's Metadata
Royce  Reinger

Royce Reinger


Intake: A General Interface for Loading Data

Intake: A general interface for loading data

Intake is a lightweight set of tools for loading and sharing data in data science projects. Intake helps you:

  • Load data from a variety of formats (see the current list of known plugins) into containers you already know, like Pandas dataframes, Python lists, NumPy arrays, and more.
  • Convert boilerplate data loading code into reusable Intake plugins
  • Describe data sets in catalog files for easy reuse and sharing between projects and with others.
  • Share catalog information (and data sets) over the network with the Intake server


Recommended method using conda:

conda install -c conda-forge intake

You can also install using pip, in which case you have a choice as to how many of the optional dependencies you install, with the simplest having least requirements

pip install intake

and additional sections [server], [plot] and [dataframe], or to include everything:

pip install intake[complete]

Note that you may well need specific drivers and other plugins, which usually have additional dependencies of their own.


  • Create development Python environment with the required dependencies, ideally with conda. The requirements can be found in the yml files in the scripts/ci/ directory of this repo.
    • e.g. conda env create -f scripts/ci/environment-py38.yml and then conda activate test_env
  • Install intake using pip install -e .[complete]
  • Use pytest to run tests.
  • Create a fork on github to be able to submit PRs.
  • We respect, but do not enforce, pep8 standards; all new code should be covered by tests.

Documentation is available at Read the Docs.

Status of intake and related packages is available at Status Dashboard

Weekly news about this repo and other related projects can be found on the wiki

Download Details:

Author: intake
Source Code: 
License: BSD-2-Clause license

#machinelearning #python #data 

Intake: A General Interface for Loading Data
Royce  Reinger

Royce Reinger


CKAN: The Open Source Data Portal Software

CKAN: The Open Source Data Portal Software

CKAN is the world’s leading open-source data portal platform. CKAN makes it easy to publish, share and work with data. It's a data management system that provides a powerful platform for cataloging, storing and accessing datasets with a rich front-end, full API (for both data and catalog), visualization tools and more. Read more at


See the CKAN Documentation for installation instructions.


If you need help with CKAN or want to ask a question, use either the ckan-dev mailing list, the CKAN chat on Gitter, or the CKAN tag on Stack Overflow (try searching the Stack Overflow and ckan-dev archives for an answer to your question first).

If you've found a bug in CKAN, open a new issue on CKAN's GitHub Issues (try searching first to see if there's already an issue for your bug).

If you find a potential security vulnerability please email, rather than creating a public issue on GitHub.

Contributing to CKAN

For contributing to CKAN or its documentation, see CONTRIBUTING.

Mailing List

Subscribe to the ckan-dev mailing list to receive news about upcoming releases and future plans as well as questions and discussions about CKAN development, deployment, etc.

Community Chat

If you want to talk about CKAN development say hi to the CKAN developers and members of the CKAN community on the public CKAN chat on Gitter. Gitter is free and open-source; you can sign in with your GitHub, GitLab, or Twitter account.

The logs for the old #ckan IRC channel (2014 to 2018) can be found here:


If you've figured out how to do something with CKAN and want to document it for others, make a new page on the CKAN wiki and tell us about it on the ckan-dev mailing list or on Gitter.

Download Details:

Author: ckan
Source Code: 
License: View license

#machinelearning #python #api #data 

CKAN: The Open Source Data Portal Software
emily joe

emily joe


Data Analytics Projects for Beginners

After learning about the fundamentals of data analytics, it’s time to apply your knowledge and skills to projects. This blog will discuss the types of data analytics projects beginners should include in their data analytics portfolio. 

#data-analysis #data-science #big-data #python 

Oral  Brekke

Oral Brekke


Learn Benefits Of Using Azure To Store Data

Learn Benefits Of Using Azure To Store Data

There are several benefits of using Microsoft Azure to store data,


Azure can easily scale up or down as needed to accommodate changing data storage needs.


Azure provides built-in redundancy and disaster recovery options to ensure data is always available and protected.


Azure provides robust security features and follows strict security compliance standards to keep data safe.


Azure offers flexible pricing options and can save money compared to on-premise data storage solutions.


Azure integrates with a wide range of platforms and technologies, making it easy to integrate with existing systems.

Global Access

Azure has a global network of data centers, providing fast and reliable access to data from anywhere in the world.

Data Management

Azure provides a range of data management tools and services to help manage and analyze data, including big data and IoT.


Azure meets strict compliance standards and offers certifications, such as HIPAA, SOC, and PCI, to help meet regulatory requirements.

Hybrid Capabilities

Azure provides hybrid capabilities, enabling organizations to easily move data between on-premise and cloud-based systems.

In conclusion, Azure provides a reliable, secure, and cost-effective solution for storing and managing data, with a range of tools and services to help organizations gain insights and make data-driven decisions.

Original article source at:

#azure #data 

Learn Benefits Of Using Azure To Store Data
Monty  Boehm

Monty Boehm


Using a Cloud Data Platform

The global cloud computing market is a rapidly growing one, valued at over 405.65 billion USD in 2021. According to predictions from Fortune Business Insights, professionals in the cloud computing industry are expected to enjoy its impressive compound annual growth rate (CAGR) of 19.9% and increasing demand across many regions.

To beginner developers and those who are looking into digitally transforming their businesses, the concept of cloud data platforms might be a bit difficult to grasp. But worry not – in this article, we’ll be tackling all you need to know about the basics of cloud data platforms.

1. What is a cloud data platform?

A post on ‘What is a Data Platform?’ by MongoDB defines it as a set of technologies that can completely meet an organization’s end-to-end data needs. ‘End-to-end’ often refers to everything from the collection, storage, preparation, delivery, and safekeeping of data.

Cloud data platforms were essentially created as a solution to the problems that big data posted over 20 years ago.

Now, cloud data platforms and tools are the new norm, and have been developed to handle large data volumes at incredible speeds.

2. Why run your database on the cloud?

One of the major considerations for those who are looking into cloud platforms is how it differs from running databases on-premises.

An article by TechTarget explains that any IaaS or DBaaS is similar to running an on-premise database. Traditionally, organizations build data centers and servers with everything needed to manage data. Instead, major cloud platform providers can provide these services and tools to minimize the work needed for developers. Time and resources can then be spent on the development process itself.

3. How do I build a cloud database?

First, make sure to look into database management systems that are compatible with your OS. Most of the top providers can be installed on Windows and MacOS, but some are more suited for specific operating systems. There are also three different types of cloud databases to choose from: self-managed, autonomous, and automated cloud databases. We recommend doing your research to choose the best database category for the type of program or application you are creating.

Next, you will need to learn about the process of data normalization, which refers to the structured approach to organizing a database. This reduces the chances of data redundancy and ensures that the database is easy to navigate.

  1. Adding the primary key to a database table: Each database row is represented by a key that builds relationships within the database. An example of a key can be any arrangement of unique characters or any number chain.
  2. Creating smaller tables: Split your database into smaller tables along with primary keys.
  3. Configure relationships: Now that you have separate tables consisting of different information, you can start building relationships. For instance, a customer-focused table can be used as a parent table, and a pending orders table can be considered a child table.
  4. Relationship types: Relationships can be one-to-one, one-to-many, and many-to-many. Although this is quite self-explanatory, it might take a bit of trial and error to figure out what works best for the database you’re building.

4. More advanced information to keep in mind

Managed cloud database services that can handle aspects like the necessary hardware, automated backups, and storage capacity management.

Most providers are also written in popular languages such as JavaScript, Python, and Ruby, which makes them accessible even to beginners. Regardless of your experience as a developer, cloud database building is essentially the same at the core. Most of the work lies in understanding the structure of your cloud database, and keeping your records and attributes organized.

6. Conclusion

In the next decade, we will continue to see a rise in the usage of cloud data platforms and their impact on industries like retail, finance, manufacturing, healthcare, and even for government use. It’s a great time to learn about this technology and its many uses.

Original article source at:

#cloud #data #platform 

Using a Cloud Data Platform
Hunter  Krajcik

Hunter Krajcik


Data Mining Functionalities

The use of data by companies to understand business patterns and predict future occurrences has been on the rise. With the availability of new technologies like machine learning, it has become easy for experts to analyse vast quantities of information to find patterns that will help establishments make better decisions. Data mining is a method that has proven very successful in discovering hidden insights in the available information. It was not possible to use the earlier methods of data exploration. Through this article, we shall understand the process and the various data mining functionalities.

One can learn more about the use and functions of data mining through the Executive Development Programme In General Management offered by reputed institutions. You can learn more about this course on our website. Before we go on to see the various data mining functionalities, we must know the process and why companies are showing keen interest in it. 

What Is Data Mining?

Data mining is analysing large volumes of data available in the company’s storage systems or outside to find patterns to help them improve their business. The process uses powerful computers and algorithms to execute statistical analysis of data. This process helps companies to make sense of the scattered data and find correlations between them. It helps these firms find answers to questions that could not be arrived at earlier due to the time it takes to use manual methods. Understanding data mining functionalities is possible only once we understand the process clearly. 

The data mining process starts with clearly defining the questions for which the organisation seeks answers. It is very important because the exercise can prove futile unless there is a clear focus on the business outcome. Once experts identify the problem, they start collecting relevant data from various sources. These are pooled in a central data lake or warehouse and prepared for analysis. Companies use various data mining functionalities to arrive at the solution they desire. For the success of this process, companies follow the below six steps. 

  1. Business Understanding – For the project’s success, there must be a clear understanding of the business situation, the current aim of the project and the criteria for success. 
  2. Data Understanding – Companies must identify the data needed for the project and collect them from all available sources. 
  3. Data Preparation – This is a very important step in preparing the data for analysis. The company must ensure that the data is in the correct format, making it ready to solve the issue. At this stage, quality issues like missing or duplicate data are addressed.
  4. Modelling – Data mining experts identify patterns in the data and apply them to the predictive models they have created. 
  5. Evaluation – The process is evaluated for its effectiveness. The data experts assess whether the model will deliver the desired business outcome. They fine-tune the algorithm at this stage to get the best results. 
  6. Deployment – Data analysts run the analysis and submit the results to the decision-makers for further action. 

Types Of Data Mining

Types Of Data Mining

Descriptive Data Mining

Descriptive data mining aims to transform raw data into information that can be used for analysis and preparing reports. In this type of data mining, the patterns and similarities in the available information are identified and separated. This method also helps to isolate interesting groupings in the analysed data. This method will help the companies find information about the data like count, average, etc. It also brings out the common features of the data. The experts can find out the general properties of the data present in the datasets.

Predictive Data Mining

Instead of just understanding the patterns, this type of data mining helps to make predictions using past information. It uses the target-prediction capabilities that are acquired through supervised learning. The subsets of data mining techniques that fall under this method are classification, time-series analysis and regression. Using this method, developers can understand the characteristics that are not explicitly mentioned. Companies can use this method to predict sales in the future using present and past data. 

Advantages Of Using Data Mining

Companies spend a huge amount of money on this process. It is because they gain a lot of benefits by performing data mining. Before we see the various data mining functionalities, let us see how companies benefit from this process. In a competitive business world, companies need to make the right decisions. It is possible only if the company uses available data to discover insights and incorporate them into the decision-making process quickly. The knowledge of the past and present helps optimise the future. 

Data mining helps the company achieve the below objectives

  • Increased Revenues
  • Understanding customer preferences
  • New customer acquisition
  • Improving up-selling and cross-selling
  • Retaining customers and improving their loyalty
  • Increasing the ROI on marketing campaigns
  • Preventing fraud by early detection
  • Identifying credit risks
  • Assessing operational performance

By joining the Executive Development Programme In General Management, you can learn the benefits of using this process in detail. The details of this course are available on our website. Before understanding the data mining functionalities, let us look at examples of how this exercise helps various business areas. 

Examples Of Data Mining

  1. Marketing

One of the biggest examples of data mining can be seen in a company’s marketing activities. They use it for exploring large datasets to improve market segmentation. Many firms use it to understand the relationship between various parameters like the age, gender and preferences of a customer. It helps to create personalised marketing campaigns. One of the other uses of data mining is to predict the customers who are likely to unsubscribe from a service. 

  1. Retail

The placement of products in a supermarket is very important for improving sales. It is necessary to know to entice customers to purchase products which they may not have planned on buying. For this, it is crucial to place these items where they can attract customers. Data mining helps identify product associations and decide where items must be placed on the shelf and in the aisle. It is possible to use different data mining functionalities to know the offers that customers value the most. 

  1. Banking

Avoiding risks is very important for banks. Data mining helps banks to know financial transactions, card usage, purchase patterns and customer financial data. It helps them identify risky customers and reduce risks. Data mining also enables banks to learn more about customers’ online habits. It enables the banks to customise their marketing messages according to customers’ preferences and get better results. 

  1. Medicine

Medical specialists can make more accurate diagnostics using data mining. They can prescribe better treatment methods when they have all information about the patient, like medical records, treatment patterns and physical examination details. Data mining also helps in providing better healthcare by identifying risks and predicting illness in certain sections of the population. It helps the state make better use of available resources. One can also forecast the length of hospitalisation required using this process. 

  1. Media

Television and radio channels use various data mining functionalities to measure their audience on a real-time basis. They collect and analyse information from channel views, programming and broadcasts as they happen. Another big example of the use of data mining is the personalised recommendations the audience receives based on their preferences. The media industry is also able to give valuable information to advertisers about the likes and dislikes of the audience, which helps them target their potential customers more accurately. 

Most Important Data Mining Functionalities

Most Important Data Mining Functionalities

Class Description 

This is one of the data mining functionalities that is used to associate data with a class or concept. One of the best examples is the release of the same model of mobile phone in different variants. This helps companies to satisfy the needs of different customer segments. Data characterisation is one of the methods used in the class description. This helps to connect data with certain sets of customers. The other method, called data discrimination, is used to compare the characteristics of two different classes of customers. 


Classification is one of the most important data mining functionalities that uses models to predict the trends in the available data. The spending patterns that are discovered using customers’ internet or mobile banking data are one example of classification. This helps businesses decide the risk of giving a customer a new loan or credit facility. This method uses the “if-then” rule, decision tree, mathematical formulae or neural network to analyse a model. This functionality uses training data to create new instances and compare them with existing ones. 


Finding missing data in a database is very important for the accuracy of the analysis. Prediction is one of the data mining functionalities that helps the analyst find the missing numeric values. If there is a missing class label, then this function is done using classification. It is very important in business intelligence and is very popular. One of the methods is to predict the missing or unavailable data using prediction analysis. The other method of prediction is to use previously built class models to know the missing class label. 

Also Read: What is Quality Assurance Plan? 9 Steps to Create a Great Plan

Association Analysis

This is one of the data mining functionalities that enables the analyst to relate two or more attributes of the data. It provides a way to find the relationship between the data and the rules that keep them together. It is a functionality that finds much use in retail sales. One of the classic examples is the message “customer who bought this also bought….” that we usually see on online platforms. It relates two transactions of similar items and finds out the probability of the same happening again. This helps the companies improve their sales of various items. 

Cluster Analysis

This data mining functionality is similar to classification. But in this case, the class label is unknown. This functionality groups item together based on the clustering algorithms. Similar objects are grouped in a cluster. There are vast differences between one cluster and another. It is applied in different fields like machine learning, image processing, pattern recognition and bioinformatics. There are different types of clustering algorithms like K-Mean, Gaussian Mixture and Mean-Shift algorithms. Each of these uses different factors to group objects in data. 

Outlier Analysis

Outlier analysis is one of the data mining functionalities that is used to group data that don’t appear under any class. There can be data that has no similarity with the attributes of other classes or general modules. These are called outliers. Such occurrences are considered to be noise or exceptions. Their analysis is termed outlier mining. In many cases, they are discarded as noise. But in some cases, they can provide associations, and that is the reason for identifying them. They are identified using statistical tests that calculate probabilities. 

Correlation Analysis

This is another data mining functionality that experts use to calculate the strength of the association of two attributes. It is used to determine how well two numerically measured continuous variables are related to each other. One of the most common examples of such attributes is height and weight. Researchers often use these two variables to find if there is any relationship between them. Correlation analysis refers to various data structures that can be combined with an item set or subsequence. 

An in-depth knowledge of the different data mining functionalities can be had at the Executive Development Programme In General Management offered by reputed institutions. You can learn more about this programme on our website. 

Summing Up

The use of data mining has increased considerably in the recent past as companies try to gain useful knowledge from the raw data available to them. The development of data warehousing technology and the growth of big data have contributed greatly to this phenomenon. Data mining is interesting because you can get useful information without asking specific questions. It is a predictive process that uses algorithms and statistics to predict future trends. This technology is used heavily in retail and e-commerce companies to understand customer purchase patterns.

Original article source at:

#data #mining 

Data Mining Functionalities
Royce  Reinger

Royce Reinger


Stats: Golang Statistics Package

Stats - Golang Statistics Package

A well tested and comprehensive Golang statistics library / package / module with no dependencies.

If you have any suggestions, problems or bug reports please create an issue and I'll do my best to accommodate you. In addition simply starring the repo would show your support for the project and be very much appreciated!


go get

Example Usage

All the functions can be seen in examples/main.go but here's a little taste:

// start with some source data to use
data := []float64{1.0, 2.1, 3.2, 4.823, 4.1, 5.8}

// you could also use different types like this
// data := stats.LoadRawData([]int{1, 2, 3, 4, 5})
// data := stats.LoadRawData([]interface{}{1.1, "2", 3})
// etc...

median, _ := stats.Median(data)
fmt.Println(median) // 3.65

roundedMedian, _ := stats.Round(median, 0)
fmt.Println(roundedMedian) // 4


The entire API documentation is available on or

You can also view docs offline with the following commands:

# Command line
godoc .              # show all exported apis
godoc . Median       # show a single function
godoc -ex . Round    # show function with example
godoc . Float64Data  # show the type and methods

# Local website
godoc -http=:4444    # start the godoc server on port 4444
open http://localhost:4444/pkg/

The exported API is as follows:

var (
    ErrEmptyInput = statsError{"Input must not be empty."}
    ErrNaN        = statsError{"Not a number."}
    ErrNegative   = statsError{"Must not contain negative values."}
    ErrZero       = statsError{"Must not contain zero values."}
    ErrBounds     = statsError{"Input is outside of range."}
    ErrSize       = statsError{"Must be the same length."}
    ErrInfValue   = statsError{"Value is infinite."}
    ErrYCoord     = statsError{"Y Value must be greater than zero."}

func Round(input float64, places int) (rounded float64, err error) {}

type Float64Data []float64

func LoadRawData(raw interface{}) (f Float64Data) {}

func AutoCorrelation(data Float64Data, lags int) (float64, error) {}
func ChebyshevDistance(dataPointX, dataPointY Float64Data) (distance float64, err error) {}
func Correlation(data1, data2 Float64Data) (float64, error) {}
func Covariance(data1, data2 Float64Data) (float64, error) {}
func CovariancePopulation(data1, data2 Float64Data) (float64, error) {}
func CumulativeSum(input Float64Data) ([]float64, error) {}
func Entropy(input Float64Data) (float64, error) {}
func EuclideanDistance(dataPointX, dataPointY Float64Data) (distance float64, err error) {}
func GeometricMean(input Float64Data) (float64, error) {}
func HarmonicMean(input Float64Data) (float64, error) {}
func InterQuartileRange(input Float64Data) (float64, error) {}
func ManhattanDistance(dataPointX, dataPointY Float64Data) (distance float64, err error) {}
func Max(input Float64Data) (max float64, err error) {}
func Mean(input Float64Data) (float64, error) {}
func Median(input Float64Data) (median float64, err error) {}
func MedianAbsoluteDeviation(input Float64Data) (mad float64, err error) {}
func MedianAbsoluteDeviationPopulation(input Float64Data) (mad float64, err error) {}
func Midhinge(input Float64Data) (float64, error) {}
func Min(input Float64Data) (min float64, err error) {}
func MinkowskiDistance(dataPointX, dataPointY Float64Data, lambda float64) (distance float64, err error) {}
func Mode(input Float64Data) (mode []float64, err error) {}
func NormBoxMullerRvs(loc float64, scale float64, size int) []float64 {}
func NormCdf(x float64, loc float64, scale float64) float64 {}
func NormEntropy(loc float64, scale float64) float64 {}
func NormFit(data []float64) [2]float64{}
func NormInterval(alpha float64, loc float64,  scale float64 ) [2]float64 {}
func NormIsf(p float64, loc float64, scale float64) (x float64) {}
func NormLogCdf(x float64, loc float64, scale float64) float64 {}
func NormLogPdf(x float64, loc float64, scale float64) float64 {}
func NormLogSf(x float64, loc float64, scale float64) float64 {}
func NormMean(loc float64, scale float64) float64 {}
func NormMedian(loc float64, scale float64) float64 {}
func NormMoment(n int, loc float64, scale float64) float64 {}
func NormPdf(x float64, loc float64, scale float64) float64 {}
func NormPpf(p float64, loc float64, scale float64) (x float64) {}
func NormPpfRvs(loc float64, scale float64, size int) []float64 {}
func NormSf(x float64, loc float64, scale float64) float64 {}
func NormStats(loc float64, scale float64, moments string) []float64 {}
func NormStd(loc float64, scale float64) float64 {}
func NormVar(loc float64, scale float64) float64 {}
func Pearson(data1, data2 Float64Data) (float64, error) {}
func Percentile(input Float64Data, percent float64) (percentile float64, err error) {}
func PercentileNearestRank(input Float64Data, percent float64) (percentile float64, err error) {}
func PopulationVariance(input Float64Data) (pvar float64, err error) {}
func Sample(input Float64Data, takenum int, replacement bool) ([]float64, error) {}
func SampleVariance(input Float64Data) (svar float64, err error) {}
func Sigmoid(input Float64Data) ([]float64, error) {}
func SoftMax(input Float64Data) ([]float64, error) {}
func StableSample(input Float64Data, takenum int) ([]float64, error) {}
func StandardDeviation(input Float64Data) (sdev float64, err error) {}
func StandardDeviationPopulation(input Float64Data) (sdev float64, err error) {}
func StandardDeviationSample(input Float64Data) (sdev float64, err error) {}
func StdDevP(input Float64Data) (sdev float64, err error) {}
func StdDevS(input Float64Data) (sdev float64, err error) {}
func Sum(input Float64Data) (sum float64, err error) {}
func Trimean(input Float64Data) (float64, error) {}
func VarP(input Float64Data) (sdev float64, err error) {}
func VarS(input Float64Data) (sdev float64, err error) {}
func Variance(input Float64Data) (sdev float64, err error) {}
func ProbGeom(a int, b int, p float64) (prob float64, err error) {}
func ExpGeom(p float64) (exp float64, err error) {}
func VarGeom(p float64) (exp float64, err error) {}

type Coordinate struct {
    X, Y float64

type Series []Coordinate

func ExponentialRegression(s Series) (regressions Series, err error) {}
func LinearRegression(s Series) (regressions Series, err error) {}
func LogarithmicRegression(s Series) (regressions Series, err error) {}

type Outliers struct {
    Mild    Float64Data
    Extreme Float64Data

type Quartiles struct {
    Q1 float64
    Q2 float64
    Q3 float64

func Quartile(input Float64Data) (Quartiles, error) {}
func QuartileOutliers(input Float64Data) (Outliers, error) {}


Pull request are always welcome no matter how big or small. I've included a Makefile that has a lot of helper targets for common actions such as linting, testing, code coverage reporting and more.

  1. Fork the repo and clone your fork
  2. Create new branch (git checkout -b some-thing)
  3. Make the desired changes
  4. Ensure tests pass (go test -cover or make test)
  5. Run lint and fix problems (go vet . or make lint)
  6. Commit changes (git commit -am 'Did something')
  7. Push branch (git push origin some-thing)
  8. Submit pull request

To make things as seamless as possible please also consider the following steps:

  • Update examples/main.go with a simple example of the new feature
  • Update documentation section with any new exported API
  • Keep 100% code coverage (you can check with make coverage)
  • Squash commits into single units of work with git rebase -i new-feature


This is not required by contributors and mostly here as a reminder to myself as the maintainer of this repo. To release a new version we should update the and

First install the tools used to generate the markdown files and release:

go install
go install
brew tap git-chglog/git-chglog
brew install gnu-sed hub git-chglog

Then you can run these make directives:

# Generate
make docs

Then we can create a a new git tag and a github release:

make release TAG=v0.x.x

To authenticate hub for the release you will need to create a personal access token and use it as the password when it's requested.

Download Details:

Author: Montanaflynn
Source Code: 
License: MIT license

#machinelearning #go #data #statistics #math #analytics 

Stats: Golang Statistics Package
Bongani  Ngema

Bongani Ngema


How to Build Enterprise Data Lake with AWS Cloud

Data Lake

A Data Lake is a place to store enterprise data in one common place. This data can be further accessed by data wranglers with analytical needs. However, a data lake is different from a normal database. As a data lake can store current and historical data for different systems in its raw form for analysis. And, a database stores current updated data for an application. Now this data which organisations preserve can be in any shape or format – structured, unstructured or semi-structured. Also, it can be saved in any desired format like CSV, Apache Parquet, XML, JSON etc. When we talk about data this data can have no limit on size. So, we need a mechanism in place to ingest this data by batch or stream processing most of the times. Potential users of this data also look forward to secure this data lake and ensure data governance. Hence, we need a data lake which is secure with proper security and controls governing access. This should be independent of data access methods.

Data Lake Benefits

  • Accessibility of data by storing it at common place. This is accessible by everyone based on privileges set by data custodians (who manage and owns this data).
  • Store raw data at scale for a low cost.
  • Unlock the data from different domains just in few clicks
  • Provide leading industry experience to different data personas.
  • Ensure the value associated with each data stored in lake to provide valuable experience and competitive edge over each other.
  • Make it more comprehensive with desired search, filtering and navigation capabilities to make it work like a search engine aka. Google for your organisation.

Now to make this data lake accessible to users we need a web based application. A data catalog can be one form to address this need which would act as a persistent metadata store that facilitates data exploration around different data stores.

Data Lake (ELT Tool) vs. Data Warehouse (ETL Tool)

Let’s try to understand how this data lake is different from a data warehouse. ETL (Extract Transform and Load) is what happens within a Data Warehouse and ELT (Extract Load and Transform) within a Data Lake. DWH (Data warehouse) serves as an integration platform for data from different data sources. It creates a structured data during ETL which can be used for various analytical needs whereas a DL (Data Lake) can preserve data in structured, unstructured or semi-structured format without specific purpose or need. This data from data lake gets value out of it over period of time with gradual transformation and other other analytical processes. Also, schema of this data is defined at time of processing or reading in lake. So, data in data lake is highly configurable, agile based on requirement. Data Lake works well with real time and big data needs. Hence, when a business has drastically changing data need one should build a data lake whereas for slowly changing structured data needs one can go with building data warehouse.

Data Lake for Big Data

In this age of big data which is collecting several millions of rows of data per second in any format can be stored and used with data lake. Another addition to this is Data Vault methodology and modelling which is a governed data lake that address some of the limitations of DWH. Vault provides durability and accelerates business value.

Deploying Data Lakes on Cloud

A data lake is considered as an ideal workload to be deployed in cloud for scalability, reliability, availability, performance, and analytics purposes. Users perceive cloud as a benefit to deploy data lake for better security, faster deployment time, elasticity, pay as use model, and for more coverage across different geographies.

Build Data Lake via. AWS Cloud

Now let’s discuss the final part of this discussion – how can we build a data lake on cloud using different AWS services.

Data Collection: Collect & Extract data from different sources including formats like flat files, API’s, or any SQL, No-SQL database or from some cloud storage like S3.

Data Load: Load this raw unprocessed data into AWS S3 bucket for storage. This bucket will act as a landing bucket.

Data Transformation: Then use ETL tool like AWS Glue for various data processing and transformations.

Data Governance: We can further enable security settings and access controls on this data and ensure data governance on top of this transformed-processed data. A data-catalog can be build for storing metadata and further exploration around different data stores.

Data Curation: We can curate this processed data in another target S3 bucket or in AWS Redshift (as a DWH).

Data Notification & Monitoring: AWS SNS can be used for intermediate notifications and alerting mechanism for various jobs. AWS cloudwatch can be used for monitoring and logging.

Data Analytics: From second S3 bucket or Redhift where transformed data was curated we can query and analyse data for various business requirements via. AWS Athena, QuickSight. Also, data scientists can use this data for building & training various ML models.

Original article source at:

#aws #cloud #data 

How to Build Enterprise Data Lake with AWS Cloud
Rupert  Beatty

Rupert Beatty


A Block-based API for NSValueTransformer, with A Growing Collection


A block-based API for NSValueTransformer, with a growing collection of useful examples.

NSValueTransformer, while perhaps obscure to most iOS programmers, remains a staple of OS X development. Before Objective-C APIs got in the habit of flinging block parameters hither and thither with reckless abandon, NSValueTransformer was the go-to way to encapsulate mutation functionality --- especially when it came to Bindings.

NSValueTransformer is convenient to use but a pain to set up. To create a value transformer you have to create a subclass, implement a handful of required methods, and register a singleton instance by name.

TransformerKit breathes new life into NSValueTransformer by making them dead-simple to define and register:

NSString * const TTTCapitalizedStringTransformerName = @"TTTCapitalizedStringTransformerName";

[NSValueTransformer registerValueTransformerWithName:TTTCapitalizedStringTransformerName
                               transformedValueClass:[NSString class]
                  returningTransformedValueWithBlock:^id(id value) {
  return [value capitalizedString];

TransformerKit pairs nicely with InflectorKit and FormatterKit, providing well-designed APIs for manipulating user-facing content.

TransformerKit also contains a growing number of convenient transformers that your apps will love and cherish:

String Transformers

  • Capitalized
  • lowercase
  • CamelCase
  • llamaCase
  • snake_case
  • train-case
  • esreveR* (Reverse)
  • Rémövê Dîaçritics (Remove accents and combining marks)
  • ट्रांस्लितेराते स्ट्रिंग (Transliterate to Latin)
  • Any Valid ICU Transform*

Image Transformers

  • PNG Representation*
  • JPEG Representation*
  • GIF Representation (macOS)
  • TIFF Representation (macOS)

Date Transformers

JSON Data Transformers

  • JSON Transformer*

Data Transformers (macOS)

  • Base16 String Encode / Decode
  • Base32 String Encode / Decode
  • Base64 String Encode / Decode
  • Base85 String Encode / Decode

Cryptographic Transformers (macOS)

  • MD5, SHA-1, SHA-256, et al. Digests

* - Reversible


Mattt (@mattt)

Download Details:

Author: Mattt
Source Code: 
License: MIT license

#swift #objective-c #data #transform 

A Block-based API for NSValueTransformer, with A Growing Collection
Rupert  Beatty

Rupert Beatty


Graph: A Semantic Database That Is Used to Create Data-driven Apps

Welcome to Graph

Graph is a semantic database that is used to create data-driven applications.


  •  iCloud Support
  •  Multi Local & Cloud Graphs
  •  Thread Safe
  •  Store Any Data Type, Including Binary Data
  •  Relationship Modeling
  •  Action Modeling For Analytics
  •  Model With Graph Theory and Set Theory
  •  Asynchronous / Synchronous Search
  •  Asynchronous / Synchronous Saving
  •  Data-Driven Architecture
  •  Data Model Observation
  •  Comprehensive Unit Test Coverage
  •  Example Projects


  • iOS 8.0+ / Mac OS X 10.10+
  • Xcode 8.0+


  • If you need help, use Stack Overflow. (Tag 'cosmicmind')
  • If you'd like to ask a general question, use Stack Overflow.
  • If you found a bug, and can provide steps to reliably reproduce it, open an issue.
  • If you have a feature request, open an issue.
  • If you want to contribute, submit a pull request.


Embedded frameworks require a minimum deployment target of iOS 8.


CocoaPods is a dependency manager for Cocoa projects. You can install it with the following command:

$ gem install cocoapods

To integrate Graph's core features into your Xcode project using CocoaPods, specify it in your Podfile:

source ''
platform :ios, '8.0'

pod 'Graph', '~> 3.1.0'

Then, run the following command:

$ pod install


Carthage is a decentralized dependency manager that builds your dependencies and provides you with binary frameworks.

You can install Carthage with Homebrew using the following command:

$ brew update
$ brew install carthage

To integrate Graph into your Xcode project using Carthage, specify it in your Cartfile:

github "CosmicMind/Graph"

Run carthage update to build the framework and drag the built Graph.framework into your Xcode project.


Graph is a growing project and will encounter changes throughout its development. It is recommended that the Changelog be reviewed prior to updating versions.


The following are samples to see how Graph may be used within your applications.

  • Visit the Samples repo to see example projects using Graph.

Creating an Entity for an ImageCard

An Entity is a model (data) object that represents a person, place, or thing. It may store property values, be a member of groups, and can be tagged.

In the following example, we create an ImageCard view using Material and populate it's properties with an Entity that stores the data for that view.

Material ImageCard

Creating data

let graph = Graph()

let entity = Entity(type: "ImageCard")
entity["title"] = "Graph"
entity["detail"] = "Build Data-Driven Software"
entity["content"] = "Graph is a semantic database that is used to create data-driven applications."
entity["author"] = "CosmicMind"
entity["image"] = UIImage.load(contentsOfFile: "frontier", ofType: "jpg")


Setting the view's properties

imageCard.toolbar?.title = entity["title"] as? String
imageCard.toolbar?.detail = entity["detail"] as? String
imageCard.imageView?.image = entity["image"] as? UIImage

let contentLabel = UILabel()
contentLabel.text = entity["content"] as? String
imageCard.contentView = contentLabel

let authorLabel = UILabel()
authorLabel.text = entity["author"] as? String
imageCard.bottomBar?.centerViews = [authorLabel]

Searching a list of users in realtime

Using the Search API is incredibly flexible. In the following example, Search is used to create a live search on user names with a dynamic UI provided by Material's SearchBar.

Preparing the search criteria

let graph = Graph()

let search = Search<Entity>(graph: graph).for(types: "User").where(properties: "name")

Asynchronously searching graph

search.async { [weak self, pattern = pattern] (users) in

    guard let regex = try? NSRegularExpression(pattern: pattern, options: []) else {

    var data = [Entity]()

    for user in users {
        if let name = user["name"] as? String {
            let matches = regex.matches(in: name, range: NSRange(location: 0, length: name.utf16.count))

            if 0 < matches.count {

    self? = data

Download Details:

Author: CosmicMind
Source Code: 
License: MIT license

#swift #data #database #graph 

Graph: A Semantic Database That Is Used to Create Data-driven Apps
Gordon  Murray

Gordon Murray


Create and Pass Data to Template in CakePHP 4

Template in the CakePHP 4 is a .php file where define the HTML layout of the page. Template files automatically get loaded on the page. For this need to create files in a specific pattern.

From Controller, you can pass data to the template file.

In this tutorial, I show how you can create template files and pass values from the controller to the template in CakePHP 4 project.

1. Create Controller

Create a HomeController.php file in src/Controller/ folder.

Create HomeController Class that extends AppController.

In the class create 2 methods –

  • index() – Assign a string value to $page and $content variables. To pass this value to the template use $this->set(). It takes an Array value. compact() function converts passed variable names to an Array in which the key is same as the variable name.
  • aboutus() – Assign a string value to $page variable. Initialize $data['page'] with $page. Pass $data to $this->set(). In the template file reading the value is the same as the above-specified way.

CakePHP automatically converts the method names as a request parameter.

This will create the following requests –

  • index – http://localhost:8765/home or http://localhost:8765/home/index
  • aboutus – http://localhost:8765/home/aboutus

Completed Code


namespace App\Controller;

class HomeController extends AppController

     public function index(){
          $page = "Homepage";
          $content = "Welcome to Makitweb";

          // Pass value to template

     public function aboutus(){
          $page = "About US";

          // Pass value to template
          $data['page'] = $page;


2. Create Template

Create a new Home folder in templates/ folder. Now in the Home folder create index.php and aboutus.php file.

Here, make sure the folder name is the same as the controller name – Home and the file name is the same as the method names created in the controller – index(), and aboutus().

CakePHP 4 template file structure


In the page just create a <h1 > and <p > tag. You can either read the values passed from the controller using <?php echo $page ?> or <?= $page ?>.

Completed Code

<h1><?= $page ?></h1>
<p><?= $content ?></p>


In this file also create <h1> and <p> tags. I displayed static value in the <p > tag and display passed value in <h1 > tag – <?= $page ?>.

Completed Code

<h1><?= $page ?></h1>
<p>About us page content</p>

4. Output

Home page (http://localhost:8765/home/index)

CakePHP 4 example index page

About us page (http://localhost:8765/home/aboutus)

CakePHP 4 example aboutus page

5. Conclusion

You don’t need to call template files explicitly from the controller. So the naming of the files and folder is important while creating a template.

If you found this tutorial helpful then don't forget to share.

Original article source at:

#cakephp #pass #data #php 

Create and Pass Data to Template in CakePHP 4
Gordon  Murray

Gordon Murray


How to Loading and Indexing Data In MarkLogic

With MarkLogic being a document-oriented database, data is commonly stored in a JSON or XML document format.

If the data to bring into the MarkLogic is not already structured in JSON or XML means if it is currently in a relational database, there are various ways to export or transform it from the source.

For example, many relational databases provide an option to export relational data in XML or in JSON format, or a SQL script could be written to fetch the data from the database, outputting it in an XML or JSON structure. Or, using Marklogic rows from a .csv file can be imported as XML and JSON documents.

In any case, it is normal to first denormalize the data being exported from the relational database to first put the content back together in its original state. Denormalization, which naturally occurs when working with documents in their original form, greatly reduces the need for joins and acceleration performance.

Schema Agnostic

As we know that schema is something having a set of rules for a particular structure of the database. While we talk about data quality then schemas are helpful as quality matters a lot with quality reliability and a proper actional database is going to present.

Now if we talk about the schema-agnostic then it is something the database is not bounded by any schema but it is aware of it. Schemas are optional in MarkLogic. Data is going to be loaded in its original data form. To address a group of documents within a database, directories, collections and internal structure of documents can be used. With MarkLogic easily supports data from disparate systems all in the same database.

Required Document Size and Structure

When loading a document, it is the best choice to have one document per entity. Marklogic is the most performant with many small documents, rather than one large document. The target document size is 1KB to 100KB but can be larger.

For Example, rather than loading a bunch of students all as one document, have each student be a document.

Whenever defining a document remember that use XML document and attribute names or JSON property names. Make document names human-readable so do not create generic names. Using this convention help indexes be efficient.



<product> Mouse </product>

<price> 1000 </price>

<quantity> 3 </quantity>



<product> Keyboard </product>

<price> 2000 </price>

<quantity> 2 </quantity>



Indexing Documents

As documents are loaded, all the words in each document and the structure of each document, are indexed. So documents are easily searchable.

The document can be loaded into the MarkLogic in many ways:

  • MarkLogic Content Pump.
  • Data movement SDK.
  • Rest APIs
  • Java API or Node js API.
  • XQuery
  • Javascript Functions.

Reading a Document

To read a document, the URI of the document is used.

XQuery Example : fn:doc("college/course-101.json")

JavaScript Example : fn:doc("account/order-202.json")

Rest API Example : curl --abc --user admin:admin: -X GET "http://localhost:8055/v1/document?uri=/accounting/order-10072.json"

Splitting feature of MLCP

MLCP has the feature of splitting the long XML documents, where each occurrence of a designated element becomes an individual XML document in the database. This is useful when multiple records are all contained within one large XML file. Such as a list of students, courses, details, etc.

The -input_file_type aggregates option is used to split a large document into individual documents. The aggregate_record-element option is used element used to designate a new document. The -uri_id is used to create a URI for each document.

While it is fine to have a mix of XML and JSON documents in the same database, it is also possible to transform content from one format to other. You can easily transform the files by following the below steps.

xquery version "1.0-ml";

import module namespace json = "" at "abc/json/json.xqy";

json:transform-to-json(fn:doc("doc-01.xml"), json:config("custom"))

A Marklogic content pump can be used to import the rows from the .csv file to a MarkLogic database. We are able to the data during the process or afterward in the database. Ways to modify content once it is already in the database include using the data movement SDK, XQuery, Js, etc.


As we know that MarkLogic is a database that facilitates many things like we can load the data, indexing the data, transforming the data, and splitting the data.


Original article source at:

#loading #data #index 

How to Loading and Indexing Data In MarkLogic
Gordon  Murray

Gordon Murray


How to Filter Data with PHP

In this article, we’ll look at why it’s so important to filter anything that’s incorporated into our applications. In particular, we’ll look at how to validate and sanitize foreign data in PHP.

Never (ever!) trust foreign input in your application. That’s one of the most important lessons to learn for anyone developing a web application.

Foreign input can be anything — from $_GET and $_POST form input data, some elements on the HTTP request body, or even some values on the $_SERVER superglobal. Cookies, session values, and uploaded and downloaded document files are also considered foreign input.

Every time we process, output, include or concatenate foreign data into our code, there’s a potential vector for attackers to inject code into our application (the so-called injection attacks). Because of this, we need to make sure every piece of foreign data is properly filtered so it can be safely incorporated into the application.

When it comes to filtering, there are two main types: validation and sanitization.


Validation ensures that foreign input is what we expect it to be. For example, we might be expecting an email address, so we are expecting something with the ********@*****.*** format. For that, we can use the FILTER_VALIDATE_EMAIL filter. Or, if we’re expecting a Boolean, we can use PHP’s FILTER_VALIDATE_BOOL filter.

Amongst the most useful filters are FILTER_VALIDATE_BOOL, FILTER_VALIDATE_INT, and FILTER_VALIDATE_FLOAT to filter for basic types and the FILTER_VALIDATE_EMAIL and FILTER_VALIDATE_DOMAIN to filter for emails and domain names respectively.

Another very important filter is the FILTER_VALIDATE_REGEXP that allows us to filter against a regular expression. With this filter, we can create our custom filters by changing the regular expression we’re filtering against.

All the available filters for validation in PHP can be found here.


Sanitization is the process of removing illegal or unsafe characters from foreign input.

The best example of this is when we sanitize database inputs before inserting them into a raw SQL query.

Again, some of the most useful sanitization filters include the ones to sanitize for basic types like FILTER_SANITIZE_STRING, FILTER_SANITIZE_CHARS and FILTER_SANITIZE_INT, but also FILTER_SANITIZE_URL and FILTER_SANITIZE_EMAIL to sanitize URLs and emails.

All PHP sanitization filters can be found here.

filter_var() and filter_input()

Now that we know PHP has an entire selection of filters available, we need to know how to use them.

Filter application is done via the filter_var() and filter_input() functions.

The filter_var() function applies a specified filter to a variable. It will take the value to filter, the filter to apply, and an optional array of options. For example, if we’re trying to validate an email address we can use this:


$email =

if ( filter_var( $email, FILTER_VALIDATE_EMAIL ) ) {
    echo ("This email is valid");

If the goal was to sanitize a string, we could use this:

$string = "<h1>Hello World</h1>";

$sanitized_string = filter_var ( $string, FILTER_SANITIZE_STRING);
echo $sanitized_string;

The filter_input() function gets a foreign input from a form input and filters it.

It works just like the filter_var() function, but it takes a type of input (we can choose from GET, POST, COOKIE, SERVER, or ENV), the variable to filter, and the filter. Optionally, it can also take an array of options.

Once again, if we want to check if the external input variable “email” is being sent via GET to our application, we can use this:


if ( filter_input( INPUT_GET, "email", FILTER_VALIDATE_EMAIL ) ) {
    echo "The email is being sent and is valid.";


And these are the basics of data filtering in PHP. Other techniques might be used to filter foreign data, like applying regex, but the techniques we’ve seen in this article are more than enough for most use cases.

Make sure you understand the difference between validation and sanitization and how to use the filter functions. With this knowledge, your PHP applications will be more reliable and secure!

Original article source at:

#php #data 

How to Filter Data with PHP

What is a Logistic Regression? - Simply explained

What is a Logistic Regression? How is it calculated? And most importantly, how are the logistic regression results interpreted? In a logistic regression, the dependent variable is a dichotomous variable. Dichotomous variables are variables with only two values. For example: Whether a person buys or does not buy a particular product.

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary).  Like all regression analyses, the logistic regression is a predictive analysis.  Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

00:00 What is a Regression
00:45 Difference  between Linear Regression and Logistic Regression
01:24 Example Logistic Regression
02:23 Why do we need Logistic Regression?
03:31 Logistic Function and the Logistic Regression equation
05:01 How to interpret the results of a Logistic Regression?
07:58 Logistic Regression: Results Table
08:21 Logistic Regression: Classification Table
09:19 Logistic Regression: and Chi Square Test
10:22 Logistic Regression: Model Summary
11:24 Logistic Regression: Coefficient B, Standard error, p-Value and odds Ratio
13:56 ROC Curve (receiver operating characteristic curve)

📁 Load Example Dataset 
💻 Online Logistic Regression Calculator 
🎓Tutorial Logistic Regression 


#datascience #data-analysis  #machinelearning 


What is a Logistic Regression? - Simply explained