1675736640
Marquez is an open source metadata service for the collection, aggregation, and visualization of a data ecosystem's metadata. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more. Marquez was released and open sourced by WeWork.
Marquez is an LF AI & Data Foundation incubation project under active development, and we'd love your help!
Want to be added? Send a pull request our way!
Marquez provides a simple way to collect and view dataset, job, and run metadata using OpenLineage. The easiest way to get up and running is with Docker. From the base of the Marquez repository, run:
$ ./docker/up.sh
Tip: Use the
--build
flag to build images from source, and/or--seed
to start Marquez with sample lineage metadata. For a more complete example using the sample metadata, please follow our quickstart guide.
Note: Port 5000 is now reserved for MacOS. If running locally on MacOS, you can run
./docker/up.sh --api-port 9000
to configure the API to listen on port 9000 instead. Keep in mind that you will need to update the URLs below with the appropriate port number.
WEB UI
You can open http://localhost:3000 to begin exploring the Marquez Web UI. The UI enables you to discover dependencies between jobs and the datasets they produce and consume via the lineage graph, view run metadata of current and previous job runs, and much more!
HTTP API
The Marquez HTTP API listens on port 5000
for all calls and port 5001
for the admin interface. The admin interface exposes helpful endpoints like /healthcheck
and /metrics
. To verify the HTTP API server is running and listening on localhost
, browse to http://localhost:5001. To begin collecting lineage metadata as OpenLineage events, use the LineageAPI or an OpenLineage integration.
Note: By default, the HTTP API does not require any form of authentication or authorization.
GRAPHQL
To explore metadata via graphql, browse to http://localhost:5000/graphql-playground. The graphql endpoint is currently in beta and is located at http://localhost:5000/api/v1-beta/graphql.
We invite everyone to help us improve and keep documentation up to date. Documentation is maintained in this repository and can be found under docs/
.
Note: To begin collecting metadata with Marquez, follow our quickstart guide. Below you will find the steps to get up and running from source.
Marquez uses a multi-project structure and contains the following modules:
api
: core API used to collect metadataweb
: web UI used to view metadataclients
: clients that implement the HTTP APIchart
: helm chartNote: The
integrations
module was removed in0.21.0
, so please use an OpenLineage integration to collect lineage events easily.
Note: To connect to your running PostgreSQL instance, you will need the standard
psql
tool.
To build the entire project run:
./gradlew build
The executable can be found under api/build/libs/
To run Marquez, you will have to define marquez.yml
. The configuration file is passed to the application and used to specify your database connection. The configuration file creation steps are outlined below.
When creating your database using createdb
, we recommend calling it marquez
:
$ createdb marquez
marquez.yml
With your database created, you can now copy marquez.example.yml
:
$ cp marquez.example.yml marquez.yml
You will then need to set the following environment variables (we recommend adding them to your .bashrc
): POSTGRES_DB
, POSTGRES_USER
, and POSTGRES_PASSWORD
. The environment variables override the equivalent option in the configuration file.
By default, Marquez uses the following ports:
8080
is available for the HTTP API server.8081
is available for the admin interface.Note: All of the configuration settings in
marquez.yml
can be specified either in the configuration file or in an environment variable.
$ ./gradlew :api:runShadow
Marquez listens on port 8080
for all API calls and port 8081
for the admin interface. To verify the HTTP API server is running and listening on localhost
, browse to http://localhost:8081. We encourage you to familiarize yourself with the data model and APIs of Marquez. To run the web UI, please follow the steps outlined here.
Note: By default, the HTTP API does not require any form of authentication or authorization.
OpenLineage
: an open standard for metadata and lineage collectionSee CONTRIBUTING.md for more details about how to contribute.
If you discover a vulnerability in the project, please open an issue and attach the "security" label.
Author: MarquezProject
Source Code: https://github.com/MarquezProject/marquez
License: Apache-2.0 license
1675728720
Intake is a lightweight set of tools for loading and sharing data in data science projects. Intake helps you:
Recommended method using conda:
conda install -c conda-forge intake
You can also install using pip
, in which case you have a choice as to how many of the optional dependencies you install, with the simplest having least requirements
pip install intake
and additional sections [server]
, [plot]
and [dataframe]
, or to include everything:
pip install intake[complete]
Note that you may well need specific drivers and other plugins, which usually have additional dependencies of their own.
conda
. The requirements can be found in the yml files in the scripts/ci/
directory of this repo.conda env create -f scripts/ci/environment-py38.yml
and then conda activate test_env
pip install -e .[complete]
pytest
to run tests.Documentation is available at Read the Docs.
Status of intake and related packages is available at Status Dashboard
Weekly news about this repo and other related projects can be found on the wiki
Author: intake
Source Code: https://github.com/intake/intake
License: BSD-2-Clause license
1675687641
CKAN is the world’s leading open-source data portal platform. CKAN makes it easy to publish, share and work with data. It's a data management system that provides a powerful platform for cataloging, storing and accessing datasets with a rich front-end, full API (for both data and catalog), visualization tools and more. Read more at ckan.org.
See the CKAN Documentation for installation instructions.
If you need help with CKAN or want to ask a question, use either the ckan-dev mailing list, the CKAN chat on Gitter, or the CKAN tag on Stack Overflow (try searching the Stack Overflow and ckan-dev archives for an answer to your question first).
If you've found a bug in CKAN, open a new issue on CKAN's GitHub Issues (try searching first to see if there's already an issue for your bug).
If you find a potential security vulnerability please email security@ckan.org, rather than creating a public issue on GitHub.
For contributing to CKAN or its documentation, see CONTRIBUTING.
Subscribe to the ckan-dev mailing list to receive news about upcoming releases and future plans as well as questions and discussions about CKAN development, deployment, etc.
If you want to talk about CKAN development say hi to the CKAN developers and members of the CKAN community on the public CKAN chat on Gitter. Gitter is free and open-source; you can sign in with your GitHub, GitLab, or Twitter account.
The logs for the old #ckan IRC channel (2014 to 2018) can be found here: https://github.com/ckan/irc-logs.
If you've figured out how to do something with CKAN and want to document it for others, make a new page on the CKAN wiki and tell us about it on the ckan-dev mailing list or on Gitter.
Author: ckan
Source Code: https://github.com/ckan/ckan
License: View license
1675418539
After learning about the fundamentals of data analytics, it’s time to apply your knowledge and skills to projects. This blog will discuss the types of data analytics projects beginners should include in their data analytics portfolio. https://www.techcrums.com/data-analytics-projects-for-beginners/
#data-analysis #data-science #big-data #python
1675331280
There are several benefits of using Microsoft Azure to store data,
Scalability
Azure can easily scale up or down as needed to accommodate changing data storage needs.
Reliability
Azure provides built-in redundancy and disaster recovery options to ensure data is always available and protected.
Security
Azure provides robust security features and follows strict security compliance standards to keep data safe.
Cost-effectiveness
Azure offers flexible pricing options and can save money compared to on-premise data storage solutions.
Integration
Azure integrates with a wide range of platforms and technologies, making it easy to integrate with existing systems.
Global Access
Azure has a global network of data centers, providing fast and reliable access to data from anywhere in the world.
Data Management
Azure provides a range of data management tools and services to help manage and analyze data, including big data and IoT.
Compliance
Azure meets strict compliance standards and offers certifications, such as HIPAA, SOC, and PCI, to help meet regulatory requirements.
Hybrid Capabilities
Azure provides hybrid capabilities, enabling organizations to easily move data between on-premise and cloud-based systems.
In conclusion, Azure provides a reliable, secure, and cost-effective solution for storing and managing data, with a range of tools and services to help organizations gain insights and make data-driven decisions.
Original article source at: https://www.c-sharpcorner.com/
1675255518
The global cloud computing market is a rapidly growing one, valued at over 405.65 billion USD in 2021. According to predictions from Fortune Business Insights, professionals in the cloud computing industry are expected to enjoy its impressive compound annual growth rate (CAGR) of 19.9% and increasing demand across many regions.
To beginner developers and those who are looking into digitally transforming their businesses, the concept of cloud data platforms might be a bit difficult to grasp. But worry not – in this article, we’ll be tackling all you need to know about the basics of cloud data platforms.
A post on ‘What is a Data Platform?’ by MongoDB defines it as a set of technologies that can completely meet an organization’s end-to-end data needs. ‘End-to-end’ often refers to everything from the collection, storage, preparation, delivery, and safekeeping of data.
Cloud data platforms were essentially created as a solution to the problems that big data posted over 20 years ago.
Now, cloud data platforms and tools are the new norm, and have been developed to handle large data volumes at incredible speeds.
One of the major considerations for those who are looking into cloud platforms is how it differs from running databases on-premises.
An article by TechTarget explains that any IaaS or DBaaS is similar to running an on-premise database. Traditionally, organizations build data centers and servers with everything needed to manage data. Instead, major cloud platform providers can provide these services and tools to minimize the work needed for developers. Time and resources can then be spent on the development process itself.
First, make sure to look into database management systems that are compatible with your OS. Most of the top providers can be installed on Windows and MacOS, but some are more suited for specific operating systems. There are also three different types of cloud databases to choose from: self-managed, autonomous, and automated cloud databases. We recommend doing your research to choose the best database category for the type of program or application you are creating.
Next, you will need to learn about the process of data normalization, which refers to the structured approach to organizing a database. This reduces the chances of data redundancy and ensures that the database is easy to navigate.
Managed cloud database services that can handle aspects like the necessary hardware, automated backups, and storage capacity management.
Most providers are also written in popular languages such as JavaScript, Python, and Ruby, which makes them accessible even to beginners. Regardless of your experience as a developer, cloud database building is essentially the same at the core. Most of the work lies in understanding the structure of your cloud database, and keeping your records and attributes organized.
In the next decade, we will continue to see a rise in the usage of cloud data platforms and their impact on industries like retail, finance, manufacturing, healthcare, and even for government use. It’s a great time to learn about this technology and its many uses.
Original article source at: https://makitweb.com/
1674844920
The use of data by companies to understand business patterns and predict future occurrences has been on the rise. With the availability of new technologies like machine learning, it has become easy for experts to analyse vast quantities of information to find patterns that will help establishments make better decisions. Data mining is a method that has proven very successful in discovering hidden insights in the available information. It was not possible to use the earlier methods of data exploration. Through this article, we shall understand the process and the various data mining functionalities.
One can learn more about the use and functions of data mining through the Executive Development Programme In General Management offered by reputed institutions. You can learn more about this course on our website. Before we go on to see the various data mining functionalities, we must know the process and why companies are showing keen interest in it.
What Is Data Mining?
Data mining is analysing large volumes of data available in the company’s storage systems or outside to find patterns to help them improve their business. The process uses powerful computers and algorithms to execute statistical analysis of data. This process helps companies to make sense of the scattered data and find correlations between them. It helps these firms find answers to questions that could not be arrived at earlier due to the time it takes to use manual methods. Understanding data mining functionalities is possible only once we understand the process clearly.
The data mining process starts with clearly defining the questions for which the organisation seeks answers. It is very important because the exercise can prove futile unless there is a clear focus on the business outcome. Once experts identify the problem, they start collecting relevant data from various sources. These are pooled in a central data lake or warehouse and prepared for analysis. Companies use various data mining functionalities to arrive at the solution they desire. For the success of this process, companies follow the below six steps.
Types Of Data Mining
Descriptive Data Mining
Descriptive data mining aims to transform raw data into information that can be used for analysis and preparing reports. In this type of data mining, the patterns and similarities in the available information are identified and separated. This method also helps to isolate interesting groupings in the analysed data. This method will help the companies find information about the data like count, average, etc. It also brings out the common features of the data. The experts can find out the general properties of the data present in the datasets.
Predictive Data Mining
Instead of just understanding the patterns, this type of data mining helps to make predictions using past information. It uses the target-prediction capabilities that are acquired through supervised learning. The subsets of data mining techniques that fall under this method are classification, time-series analysis and regression. Using this method, developers can understand the characteristics that are not explicitly mentioned. Companies can use this method to predict sales in the future using present and past data.
Advantages Of Using Data Mining
Companies spend a huge amount of money on this process. It is because they gain a lot of benefits by performing data mining. Before we see the various data mining functionalities, let us see how companies benefit from this process. In a competitive business world, companies need to make the right decisions. It is possible only if the company uses available data to discover insights and incorporate them into the decision-making process quickly. The knowledge of the past and present helps optimise the future.
Data mining helps the company achieve the below objectives
By joining the Executive Development Programme In General Management, you can learn the benefits of using this process in detail. The details of this course are available on our website. Before understanding the data mining functionalities, let us look at examples of how this exercise helps various business areas.
Examples Of Data Mining
One of the biggest examples of data mining can be seen in a company’s marketing activities. They use it for exploring large datasets to improve market segmentation. Many firms use it to understand the relationship between various parameters like the age, gender and preferences of a customer. It helps to create personalised marketing campaigns. One of the other uses of data mining is to predict the customers who are likely to unsubscribe from a service.
The placement of products in a supermarket is very important for improving sales. It is necessary to know to entice customers to purchase products which they may not have planned on buying. For this, it is crucial to place these items where they can attract customers. Data mining helps identify product associations and decide where items must be placed on the shelf and in the aisle. It is possible to use different data mining functionalities to know the offers that customers value the most.
Avoiding risks is very important for banks. Data mining helps banks to know financial transactions, card usage, purchase patterns and customer financial data. It helps them identify risky customers and reduce risks. Data mining also enables banks to learn more about customers’ online habits. It enables the banks to customise their marketing messages according to customers’ preferences and get better results.
Medical specialists can make more accurate diagnostics using data mining. They can prescribe better treatment methods when they have all information about the patient, like medical records, treatment patterns and physical examination details. Data mining also helps in providing better healthcare by identifying risks and predicting illness in certain sections of the population. It helps the state make better use of available resources. One can also forecast the length of hospitalisation required using this process.
Television and radio channels use various data mining functionalities to measure their audience on a real-time basis. They collect and analyse information from channel views, programming and broadcasts as they happen. Another big example of the use of data mining is the personalised recommendations the audience receives based on their preferences. The media industry is also able to give valuable information to advertisers about the likes and dislikes of the audience, which helps them target their potential customers more accurately.
Most Important Data Mining Functionalities
Class Description
This is one of the data mining functionalities that is used to associate data with a class or concept. One of the best examples is the release of the same model of mobile phone in different variants. This helps companies to satisfy the needs of different customer segments. Data characterisation is one of the methods used in the class description. This helps to connect data with certain sets of customers. The other method, called data discrimination, is used to compare the characteristics of two different classes of customers.
Classification
Classification is one of the most important data mining functionalities that uses models to predict the trends in the available data. The spending patterns that are discovered using customers’ internet or mobile banking data are one example of classification. This helps businesses decide the risk of giving a customer a new loan or credit facility. This method uses the “if-then” rule, decision tree, mathematical formulae or neural network to analyse a model. This functionality uses training data to create new instances and compare them with existing ones.
Prediction
Finding missing data in a database is very important for the accuracy of the analysis. Prediction is one of the data mining functionalities that helps the analyst find the missing numeric values. If there is a missing class label, then this function is done using classification. It is very important in business intelligence and is very popular. One of the methods is to predict the missing or unavailable data using prediction analysis. The other method of prediction is to use previously built class models to know the missing class label.
Also Read: What is Quality Assurance Plan? 9 Steps to Create a Great Plan
Association Analysis
This is one of the data mining functionalities that enables the analyst to relate two or more attributes of the data. It provides a way to find the relationship between the data and the rules that keep them together. It is a functionality that finds much use in retail sales. One of the classic examples is the message “customer who bought this also bought….” that we usually see on online platforms. It relates two transactions of similar items and finds out the probability of the same happening again. This helps the companies improve their sales of various items.
Cluster Analysis
This data mining functionality is similar to classification. But in this case, the class label is unknown. This functionality groups item together based on the clustering algorithms. Similar objects are grouped in a cluster. There are vast differences between one cluster and another. It is applied in different fields like machine learning, image processing, pattern recognition and bioinformatics. There are different types of clustering algorithms like K-Mean, Gaussian Mixture and Mean-Shift algorithms. Each of these uses different factors to group objects in data.
Outlier Analysis
Outlier analysis is one of the data mining functionalities that is used to group data that don’t appear under any class. There can be data that has no similarity with the attributes of other classes or general modules. These are called outliers. Such occurrences are considered to be noise or exceptions. Their analysis is termed outlier mining. In many cases, they are discarded as noise. But in some cases, they can provide associations, and that is the reason for identifying them. They are identified using statistical tests that calculate probabilities.
Correlation Analysis
This is another data mining functionality that experts use to calculate the strength of the association of two attributes. It is used to determine how well two numerically measured continuous variables are related to each other. One of the most common examples of such attributes is height and weight. Researchers often use these two variables to find if there is any relationship between them. Correlation analysis refers to various data structures that can be combined with an item set or subsequence.
An in-depth knowledge of the different data mining functionalities can be had at the Executive Development Programme In General Management offered by reputed institutions. You can learn more about this programme on our website.
Summing Up
The use of data mining has increased considerably in the recent past as companies try to gain useful knowledge from the raw data available to them. The development of data warehousing technology and the growth of big data have contributed greatly to this phenomenon. Data mining is interesting because you can get useful information without asking specific questions. It is a predictive process that uses algorithms and statistics to predict future trends. This technology is used heavily in retail and e-commerce companies to understand customer purchase patterns.
Original article source at: https://www.edureka.co/
1674105180
A well tested and comprehensive Golang statistics library / package / module with no dependencies.
If you have any suggestions, problems or bug reports please create an issue and I'll do my best to accommodate you. In addition simply starring the repo would show your support for the project and be very much appreciated!
go get github.com/montanaflynn/stats
All the functions can be seen in examples/main.go but here's a little taste:
// start with some source data to use
data := []float64{1.0, 2.1, 3.2, 4.823, 4.1, 5.8}
// you could also use different types like this
// data := stats.LoadRawData([]int{1, 2, 3, 4, 5})
// data := stats.LoadRawData([]interface{}{1.1, "2", 3})
// etc...
median, _ := stats.Median(data)
fmt.Println(median) // 3.65
roundedMedian, _ := stats.Round(median, 0)
fmt.Println(roundedMedian) // 4
The entire API documentation is available on GoDoc.org or pkg.go.dev.
You can also view docs offline with the following commands:
# Command line
godoc . # show all exported apis
godoc . Median # show a single function
godoc -ex . Round # show function with example
godoc . Float64Data # show the type and methods
# Local website
godoc -http=:4444 # start the godoc server on port 4444
open http://localhost:4444/pkg/github.com/montanaflynn/stats/
The exported API is as follows:
var (
ErrEmptyInput = statsError{"Input must not be empty."}
ErrNaN = statsError{"Not a number."}
ErrNegative = statsError{"Must not contain negative values."}
ErrZero = statsError{"Must not contain zero values."}
ErrBounds = statsError{"Input is outside of range."}
ErrSize = statsError{"Must be the same length."}
ErrInfValue = statsError{"Value is infinite."}
ErrYCoord = statsError{"Y Value must be greater than zero."}
)
func Round(input float64, places int) (rounded float64, err error) {}
type Float64Data []float64
func LoadRawData(raw interface{}) (f Float64Data) {}
func AutoCorrelation(data Float64Data, lags int) (float64, error) {}
func ChebyshevDistance(dataPointX, dataPointY Float64Data) (distance float64, err error) {}
func Correlation(data1, data2 Float64Data) (float64, error) {}
func Covariance(data1, data2 Float64Data) (float64, error) {}
func CovariancePopulation(data1, data2 Float64Data) (float64, error) {}
func CumulativeSum(input Float64Data) ([]float64, error) {}
func Entropy(input Float64Data) (float64, error) {}
func EuclideanDistance(dataPointX, dataPointY Float64Data) (distance float64, err error) {}
func GeometricMean(input Float64Data) (float64, error) {}
func HarmonicMean(input Float64Data) (float64, error) {}
func InterQuartileRange(input Float64Data) (float64, error) {}
func ManhattanDistance(dataPointX, dataPointY Float64Data) (distance float64, err error) {}
func Max(input Float64Data) (max float64, err error) {}
func Mean(input Float64Data) (float64, error) {}
func Median(input Float64Data) (median float64, err error) {}
func MedianAbsoluteDeviation(input Float64Data) (mad float64, err error) {}
func MedianAbsoluteDeviationPopulation(input Float64Data) (mad float64, err error) {}
func Midhinge(input Float64Data) (float64, error) {}
func Min(input Float64Data) (min float64, err error) {}
func MinkowskiDistance(dataPointX, dataPointY Float64Data, lambda float64) (distance float64, err error) {}
func Mode(input Float64Data) (mode []float64, err error) {}
func NormBoxMullerRvs(loc float64, scale float64, size int) []float64 {}
func NormCdf(x float64, loc float64, scale float64) float64 {}
func NormEntropy(loc float64, scale float64) float64 {}
func NormFit(data []float64) [2]float64{}
func NormInterval(alpha float64, loc float64, scale float64 ) [2]float64 {}
func NormIsf(p float64, loc float64, scale float64) (x float64) {}
func NormLogCdf(x float64, loc float64, scale float64) float64 {}
func NormLogPdf(x float64, loc float64, scale float64) float64 {}
func NormLogSf(x float64, loc float64, scale float64) float64 {}
func NormMean(loc float64, scale float64) float64 {}
func NormMedian(loc float64, scale float64) float64 {}
func NormMoment(n int, loc float64, scale float64) float64 {}
func NormPdf(x float64, loc float64, scale float64) float64 {}
func NormPpf(p float64, loc float64, scale float64) (x float64) {}
func NormPpfRvs(loc float64, scale float64, size int) []float64 {}
func NormSf(x float64, loc float64, scale float64) float64 {}
func NormStats(loc float64, scale float64, moments string) []float64 {}
func NormStd(loc float64, scale float64) float64 {}
func NormVar(loc float64, scale float64) float64 {}
func Pearson(data1, data2 Float64Data) (float64, error) {}
func Percentile(input Float64Data, percent float64) (percentile float64, err error) {}
func PercentileNearestRank(input Float64Data, percent float64) (percentile float64, err error) {}
func PopulationVariance(input Float64Data) (pvar float64, err error) {}
func Sample(input Float64Data, takenum int, replacement bool) ([]float64, error) {}
func SampleVariance(input Float64Data) (svar float64, err error) {}
func Sigmoid(input Float64Data) ([]float64, error) {}
func SoftMax(input Float64Data) ([]float64, error) {}
func StableSample(input Float64Data, takenum int) ([]float64, error) {}
func StandardDeviation(input Float64Data) (sdev float64, err error) {}
func StandardDeviationPopulation(input Float64Data) (sdev float64, err error) {}
func StandardDeviationSample(input Float64Data) (sdev float64, err error) {}
func StdDevP(input Float64Data) (sdev float64, err error) {}
func StdDevS(input Float64Data) (sdev float64, err error) {}
func Sum(input Float64Data) (sum float64, err error) {}
func Trimean(input Float64Data) (float64, error) {}
func VarP(input Float64Data) (sdev float64, err error) {}
func VarS(input Float64Data) (sdev float64, err error) {}
func Variance(input Float64Data) (sdev float64, err error) {}
func ProbGeom(a int, b int, p float64) (prob float64, err error) {}
func ExpGeom(p float64) (exp float64, err error) {}
func VarGeom(p float64) (exp float64, err error) {}
type Coordinate struct {
X, Y float64
}
type Series []Coordinate
func ExponentialRegression(s Series) (regressions Series, err error) {}
func LinearRegression(s Series) (regressions Series, err error) {}
func LogarithmicRegression(s Series) (regressions Series, err error) {}
type Outliers struct {
Mild Float64Data
Extreme Float64Data
}
type Quartiles struct {
Q1 float64
Q2 float64
Q3 float64
}
func Quartile(input Float64Data) (Quartiles, error) {}
func QuartileOutliers(input Float64Data) (Outliers, error) {}
Pull request are always welcome no matter how big or small. I've included a Makefile that has a lot of helper targets for common actions such as linting, testing, code coverage reporting and more.
git checkout -b some-thing
)go test -cover
or make test
)go vet .
or make lint
)git commit -am 'Did something'
)git push origin some-thing
)To make things as seamless as possible please also consider the following steps:
examples/main.go
with a simple example of the new featureREADME.md
documentation section with any new exported APImake coverage
)git rebase -i new-feature
This is not required by contributors and mostly here as a reminder to myself as the maintainer of this repo. To release a new version we should update the CHANGELOG.md and DOCUMENTATION.md.
First install the tools used to generate the markdown files and release:
go install github.com/davecheney/godoc2md@latest
go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest
brew tap git-chglog/git-chglog
brew install gnu-sed hub git-chglog
Then you can run these make
directives:
# Generate DOCUMENTATION.md
make docs
Then we can create a CHANGELOG.md a new git tag and a github release:
make release TAG=v0.x.x
To authenticate hub
for the release you will need to create a personal access token and use it as the password when it's requested.
Author: Montanaflynn
Source Code: https://github.com/montanaflynn/stats
License: MIT license
1673656500
A Data Lake is a place to store enterprise data in one common place. This data can be further accessed by data wranglers with analytical needs. However, a data lake is different from a normal database. As a data lake can store current and historical data for different systems in its raw form for analysis. And, a database stores current updated data for an application. Now this data which organisations preserve can be in any shape or format – structured, unstructured or semi-structured. Also, it can be saved in any desired format like CSV, Apache Parquet, XML, JSON etc. When we talk about data this data can have no limit on size. So, we need a mechanism in place to ingest this data by batch or stream processing most of the times. Potential users of this data also look forward to secure this data lake and ensure data governance. Hence, we need a data lake which is secure with proper security and controls governing access. This should be independent of data access methods.
Now to make this data lake accessible to users we need a web based application. A data catalog can be one form to address this need which would act as a persistent metadata store that facilitates data exploration around different data stores.
Let’s try to understand how this data lake is different from a data warehouse. ETL (Extract Transform and Load) is what happens within a Data Warehouse and ELT (Extract Load and Transform) within a Data Lake. DWH (Data warehouse) serves as an integration platform for data from different data sources. It creates a structured data during ETL which can be used for various analytical needs whereas a DL (Data Lake) can preserve data in structured, unstructured or semi-structured format without specific purpose or need. This data from data lake gets value out of it over period of time with gradual transformation and other other analytical processes. Also, schema of this data is defined at time of processing or reading in lake. So, data in data lake is highly configurable, agile based on requirement. Data Lake works well with real time and big data needs. Hence, when a business has drastically changing data need one should build a data lake whereas for slowly changing structured data needs one can go with building data warehouse.
In this age of big data which is collecting several millions of rows of data per second in any format can be stored and used with data lake. Another addition to this is Data Vault methodology and modelling which is a governed data lake that address some of the limitations of DWH. Vault provides durability and accelerates business value.
A data lake is considered as an ideal workload to be deployed in cloud for scalability, reliability, availability, performance, and analytics purposes. Users perceive cloud as a benefit to deploy data lake for better security, faster deployment time, elasticity, pay as use model, and for more coverage across different geographies.
Now let’s discuss the final part of this discussion – how can we build a data lake on cloud using different AWS services.
Data Collection: Collect & Extract data from different sources including formats like flat files, API’s, or any SQL, No-SQL database or from some cloud storage like S3.
Data Load: Load this raw unprocessed data into AWS S3 bucket for storage. This bucket will act as a landing bucket.
Data Transformation: Then use ETL tool like AWS Glue for various data processing and transformations.
Data Governance: We can further enable security settings and access controls on this data and ensure data governance on top of this transformed-processed data. A data-catalog can be build for storing metadata and further exploration around different data stores.
Data Curation: We can curate this processed data in another target S3 bucket or in AWS Redshift (as a DWH).
Data Notification & Monitoring: AWS SNS can be used for intermediate notifications and alerting mechanism for various jobs. AWS cloudwatch can be used for monitoring and logging.
Data Analytics: From second S3 bucket or Redhift where transformed data was curated we can query and analyse data for various business requirements via. AWS Athena, QuickSight. Also, data scientists can use this data for building & training various ML models.
Original article source at: https://blog.knoldus.com/
1673641260
A block-based API for NSValueTransformer, with a growing collection of useful examples.
NSValueTransformer
, while perhaps obscure to most iOS programmers, remains a staple of OS X development. Before Objective-C APIs got in the habit of flinging block parameters hither and thither with reckless abandon, NSValueTransformer
was the go-to way to encapsulate mutation functionality --- especially when it came to Bindings.
NSValueTransformer
is convenient to use but a pain to set up. To create a value transformer you have to create a subclass, implement a handful of required methods, and register a singleton instance by name.
TransformerKit breathes new life into NSValueTransformer
by making them dead-simple to define and register:
NSString * const TTTCapitalizedStringTransformerName = @"TTTCapitalizedStringTransformerName";
[NSValueTransformer registerValueTransformerWithName:TTTCapitalizedStringTransformerName
transformedValueClass:[NSString class]
returningTransformedValueWithBlock:^id(id value) {
return [value capitalizedString];
}];
TransformerKit pairs nicely with InflectorKit and FormatterKit, providing well-designed APIs for manipulating user-facing content.
TransformerKit also contains a growing number of convenient transformers that your apps will love and cherish:
* - Reversible
Mattt (@mattt)
Author: Mattt
Source Code: https://github.com/mattt/TransformerKit
License: MIT license
1673609181
Graph is a semantic database that is used to create data-driven applications.
Embedded frameworks require a minimum deployment target of iOS 8.
CocoaPods is a dependency manager for Cocoa projects. You can install it with the following command:
$ gem install cocoapods
To integrate Graph's core features into your Xcode project using CocoaPods, specify it in your Podfile
:
source 'https://github.com/CocoaPods/Specs.git'
platform :ios, '8.0'
use_frameworks!
pod 'Graph', '~> 3.1.0'
Then, run the following command:
$ pod install
Carthage is a decentralized dependency manager that builds your dependencies and provides you with binary frameworks.
You can install Carthage with Homebrew using the following command:
$ brew update
$ brew install carthage
To integrate Graph into your Xcode project using Carthage, specify it in your Cartfile:
github "CosmicMind/Graph"
Run carthage update
to build the framework and drag the built Graph.framework
into your Xcode project.
Graph is a growing project and will encounter changes throughout its development. It is recommended that the Changelog be reviewed prior to updating versions.
Samples
The following are samples to see how Graph may be used within your applications.
An Entity is a model (data) object that represents a person, place, or thing. It may store property values, be a member of groups, and can be tagged.
In the following example, we create an ImageCard view using Material and populate it's properties with an Entity that stores the data for that view.
let graph = Graph()
let entity = Entity(type: "ImageCard")
entity["title"] = "Graph"
entity["detail"] = "Build Data-Driven Software"
entity["content"] = "Graph is a semantic database that is used to create data-driven applications."
entity["author"] = "CosmicMind"
entity["image"] = UIImage.load(contentsOfFile: "frontier", ofType: "jpg")
graph.sync()
imageCard.toolbar?.title = entity["title"] as? String
imageCard.toolbar?.detail = entity["detail"] as? String
imageCard.imageView?.image = entity["image"] as? UIImage
let contentLabel = UILabel()
contentLabel.text = entity["content"] as? String
imageCard.contentView = contentLabel
let authorLabel = UILabel()
authorLabel.text = entity["author"] as? String
imageCard.bottomBar?.centerViews = [authorLabel]
Using the Search API is incredibly flexible. In the following example, Search is used to create a live search on user names with a dynamic UI provided by Material's SearchBar.
let graph = Graph()
let search = Search<Entity>(graph: graph).for(types: "User").where(properties: "name")
search.async { [weak self, pattern = pattern] (users) in
guard let regex = try? NSRegularExpression(pattern: pattern, options: []) else {
return
}
var data = [Entity]()
for user in users {
if let name = user["name"] as? String {
let matches = regex.matches(in: name, range: NSRange(location: 0, length: name.utf16.count))
if 0 < matches.count {
data.append(user)
}
}
}
self?.tableView.data = data
}
Author: CosmicMind
Source Code: https://github.com/CosmicMind/Graph
License: MIT license
1673470440
Template in the CakePHP 4 is a .php file where define the HTML layout of the page. Template files automatically get loaded on the page. For this need to create files in a specific pattern.
From Controller, you can pass data to the template file.
In this tutorial, I show how you can create template files and pass values from the controller to the template in CakePHP 4 project.
Create a HomeController.php
file in src/Controller/
folder.
Create HomeController
Class that extends AppController
.
In the class create 2 methods –
$page
and $content
variables. To pass this value to the template use $this->set()
. It takes an Array value. compact()
function converts passed variable names to an Array in which the key is same as the variable name.$page
variable. Initialize $data['page']
with $page
. Pass $data
to $this->set()
. In the template file reading the value is the same as the above-specified way.CakePHP automatically converts the method names as a request parameter.
This will create the following requests –
Completed Code
<?php
declare(strict_types=1);
namespace App\Controller;
class HomeController extends AppController
{
public function index(){
$page = "Homepage";
$content = "Welcome to Makitweb";
// Pass value to template
$this->set(compact('page','content'));
}
public function aboutus(){
$page = "About US";
// Pass value to template
$data['page'] = $page;
$this->set($data);
}
}
Create a new Home
folder in templates/
folder. Now in the Home
folder create index.php
and aboutus.php
file.
Here, make sure the folder name is the same as the controller name – Home
and the file name is the same as the method names created in the controller – index()
, and aboutus()
.
templates/Home/index.php
In the page just create a <h1 >
and <p >
tag. You can either read the values passed from the controller using <?php echo $page ?>
or <?= $page ?>
.
Completed Code
<h1><?= $page ?></h1>
<p><?= $content ?></p>
templates/Home/aboutus.php
In this file also create <h1>
and <p>
tags. I displayed static value in the <p >
tag and display passed value in <h1 >
tag – <?= $page ?>
.
Completed Code
<h1><?= $page ?></h1>
<p>About us page content</p>
Home page (http://localhost:8765/home/index)
About us page (http://localhost:8765/home/aboutus)
You don’t need to call template files explicitly from the controller. So the naming of the files and folder is important while creating a template.
If you found this tutorial helpful then don't forget to share.
Original article source at: https://makitweb.com/
1673466300
With MarkLogic being a document-oriented database, data is commonly stored in a JSON or XML document format.
If the data to bring into the MarkLogic is not already structured in JSON or XML means if it is currently in a relational database, there are various ways to export or transform it from the source.
For example, many relational databases provide an option to export relational data in XML or in JSON format, or a SQL script could be written to fetch the data from the database, outputting it in an XML or JSON structure. Or, using Marklogic rows from a .csv file can be imported as XML and JSON documents.
In any case, it is normal to first denormalize the data being exported from the relational database to first put the content back together in its original state. Denormalization, which naturally occurs when working with documents in their original form, greatly reduces the need for joins and acceleration performance.
As we know that schema is something having a set of rules for a particular structure of the database. While we talk about data quality then schemas are helpful as quality matters a lot with quality reliability and a proper actional database is going to present.
Now if we talk about the schema-agnostic then it is something the database is not bounded by any schema but it is aware of it. Schemas are optional in MarkLogic. Data is going to be loaded in its original data form. To address a group of documents within a database, directories, collections and internal structure of documents can be used. With MarkLogic easily supports data from disparate systems all in the same database.
When loading a document, it is the best choice to have one document per entity. Marklogic is the most performant with many small documents, rather than one large document. The target document size is 1KB to 100KB but can be larger.
For Example, rather than loading a bunch of students all as one document, have each student be a document.
Whenever defining a document remember that use XML document and attribute names or JSON property names. Make document names human-readable so do not create generic names. Using this convention help indexes be efficient.
<items>
<item>
<product> Mouse </product>
<price> 1000 </price>
<quantity> 3 </quantity>
</item>
<item>
<product> Keyboard </product>
<price> 2000 </price>
<quantity> 2 </quantity>
</item>
</items>
As documents are loaded, all the words in each document and the structure of each document, are indexed. So documents are easily searchable.
The document can be loaded into the MarkLogic in many ways:
To read a document, the URI of the document is used.
XQuery Example : fn:doc("college/course-101.json")
JavaScript Example : fn:doc("account/order-202.json")
Rest API Example : curl --abc --user admin:admin: -X GET "http://localhost:8055/v1/document?uri=/accounting/order-10072.json"
MLCP has the feature of splitting the long XML documents, where each occurrence of a designated element becomes an individual XML document in the database. This is useful when multiple records are all contained within one large XML file. Such as a list of students, courses, details, etc.
The -input_file_type aggregates option is used to split a large document into individual documents. The aggregate_record-element option is used element used to designate a new document. The -uri_id is used to create a URI for each document.
While it is fine to have a mix of XML and JSON documents in the same database, it is also possible to transform content from one format to other. You can easily transform the files by following the below steps.
xquery version "1.0-ml";
import module namespace json = "http://abc.com/xdmp/json" at "abc/json/json.xqy";
json:transform-to-json(fn:doc("doc-01.xml"), json:config("custom"))
A Marklogic content pump can be used to import the rows from the .csv file to a MarkLogic database. We are able to the data during the process or afterward in the database. Ways to modify content once it is already in the database include using the data movement SDK, XQuery, Js, etc.
As we know that MarkLogic is a database that facilitates many things like we can load the data, indexing the data, transforming the data, and splitting the data.
References:
Original article source at: https://blog.knoldus.com/
1673454480
In this article, we’ll look at why it’s so important to filter anything that’s incorporated into our applications. In particular, we’ll look at how to validate and sanitize foreign data in PHP.
Never (ever!) trust foreign input in your application. That’s one of the most important lessons to learn for anyone developing a web application.
Foreign input can be anything — from $_GET
and $_POST
form input data, some elements on the HTTP request body, or even some values on the $_SERVER
superglobal. Cookies, session values, and uploaded and downloaded document files are also considered foreign input.
Every time we process, output, include or concatenate foreign data into our code, there’s a potential vector for attackers to inject code into our application (the so-called injection attacks). Because of this, we need to make sure every piece of foreign data is properly filtered so it can be safely incorporated into the application.
When it comes to filtering, there are two main types: validation and sanitization.
Validation ensures that foreign input is what we expect it to be. For example, we might be expecting an email address, so we are expecting something with the ********@*****.***
format. For that, we can use the FILTER_VALIDATE_EMAIL
filter. Or, if we’re expecting a Boolean, we can use PHP’s FILTER_VALIDATE_BOOL
filter.
Amongst the most useful filters are FILTER_VALIDATE_BOOL
, FILTER_VALIDATE_INT
, and FILTER_VALIDATE_FLOAT
to filter for basic types and the FILTER_VALIDATE_EMAIL
and FILTER_VALIDATE_DOMAIN
to filter for emails and domain names respectively.
Another very important filter is the FILTER_VALIDATE_REGEXP
that allows us to filter against a regular expression. With this filter, we can create our custom filters by changing the regular expression we’re filtering against.
All the available filters for validation in PHP can be found here.
Sanitization is the process of removing illegal or unsafe characters from foreign input.
The best example of this is when we sanitize database inputs before inserting them into a raw SQL query.
Again, some of the most useful sanitization filters include the ones to sanitize for basic types like FILTER_SANITIZE_STRING
, FILTER_SANITIZE_CHARS
and FILTER_SANITIZE_INT
, but also FILTER_SANITIZE_URL
and FILTER_SANITIZE_EMAIL
to sanitize URLs and emails.
All PHP sanitization filters can be found here.
Now that we know PHP has an entire selection of filters available, we need to know how to use them.
Filter application is done via the filter_var()
and filter_input()
functions.
The filter_var()
function applies a specified filter to a variable. It will take the value to filter, the filter to apply, and an optional array of options. For example, if we’re trying to validate an email address we can use this:
<?php
$email = your.email@sitepoint.com:
if ( filter_var( $email, FILTER_VALIDATE_EMAIL ) ) {
echo ("This email is valid");
}
If the goal was to sanitize a string, we could use this:
<?php
$string = "<h1>Hello World</h1>";
$sanitized_string = filter_var ( $string, FILTER_SANITIZE_STRING);
echo $sanitized_string;
The filter_input()
function gets a foreign input from a form input and filters it.
It works just like the filter_var()
function, but it takes a type of input (we can choose from GET
, POST
, COOKIE
, SERVER
, or ENV
), the variable to filter, and the filter. Optionally, it can also take an array of options.
Once again, if we want to check if the external input variable “email” is being sent via GET
to our application, we can use this:
<?php
if ( filter_input( INPUT_GET, "email", FILTER_VALIDATE_EMAIL ) ) {
echo "The email is being sent and is valid.";
}
And these are the basics of data filtering in PHP. Other techniques might be used to filter foreign data, like applying regex, but the techniques we’ve seen in this article are more than enough for most use cases.
Make sure you understand the difference between validation and sanitization and how to use the filter functions. With this knowledge, your PHP applications will be more reliable and secure!
Original article source at: https://www.sitepoint.com/
1673233264
What is a Logistic Regression? How is it calculated? And most importantly, how are the logistic regression results interpreted? In a logistic regression, the dependent variable is a dichotomous variable. Dichotomous variables are variables with only two values. For example: Whether a person buys or does not buy a particular product.
Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
00:00 What is a Regression
00:45 Difference between Linear Regression and Logistic Regression
01:24 Example Logistic Regression
02:23 Why do we need Logistic Regression?
03:31 Logistic Function and the Logistic Regression equation
05:01 How to interpret the results of a Logistic Regression?
07:58 Logistic Regression: Results Table
08:21 Logistic Regression: Classification Table
09:19 Logistic Regression: and Chi Square Test
10:22 Logistic Regression: Model Summary
11:24 Logistic Regression: Coefficient B, Standard error, p-Value and odds Ratio
13:56 ROC Curve (receiver operating characteristic curve)
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
📁 Load Example Dataset
► https://datatab.net/statistics-calculator/regression?example=Medical_example_logistic_regression
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💻 Online Logistic Regression Calculator
► https://datatab.net/statistics-calculator/regression
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
🎓Tutorial Logistic Regression
► https://datatab.net/tutorial/logistic-regression
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
📖 E-BOOK
► https://datatab.net/statistics-book
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Subscribe: https://www.youtube.com/@datatab/featured
#datascience #data-analysis #machinelearning