Find and Kill A Zombie Process on Linux

Find and Kill a Zombie Process on Linux

A zombie is a creature that was a human and died but somehow due to a virus or any reason it woke up again. It is already dead but is walking and moving. This is the concept of a zombie described in movies and novels. In the same way in Linux, a zombie process is a process that was removed from the system as “defunct” but still somehow runs in the system memory is called zombie process. Until a child process is eliminated from a process table, it turns into a zombie first.

A process in the terminated state is another name for it. It is cleaned from the memory using its parent process. When the parent process is not notified of the change, the child process becomes the zombie process and it does not get any signal of termination so that it can leave the memory. In Linux, whenever a process is removed from the memory, its parent process is informed about the removal. This process stays in the memory until it notifies its parent process.

This means that the dead process is not removed from the memory immediately and it continues in the system memory thus becoming a zombie process. To remove a zombie process, the parent process calls the wait() function. Once the wait() function is called into play, the zombie process is completely removed from the system.

Killing a Zombie Process

Before moving to kill the zombie process, we will first discuss the zombie process risks and the causes of the zombie process taking place. Also, we will learn more about the zombie process to make it easy to understand the killing process.

What is the Cause of the Zombie Process

There can be two major causes of the zombie process. The first one is the one in which the parent process is unable to call the wait() function when the child creation process is running which leads to ignoring the SIGCHLD. This may cause the zombie process. The second is the one in which the other application may affect the parent process execution due to bad coding or malicious content in it.

In other words, we can say that the zombie process is caused when the parent process ignores the child process state changes or it cannot check the state of the child process, when the child process ends the PCB is not cleared.

Does the Zombie Process Pose a Risk

The zombie process does not pose any risk, they just use some part of memory in the system. The process table is of small size but the id of the table where the zombie process is stored cannot be used until it is released by the zombie process. But in case there are a lot of zombie processes reserving the memory location and no memory space left for other processes to take place, it becomes difficult for other processes to run.

Finding a Zombie Process

Before killing the zombie process it is necessary to find them. To find the zombie process we will run the following command in the terminal:

linux@linux-VirtualBox:~$ ps aux | egrep "Z|defunct"

In the command above, “ps” stands for the process state it is used to view the state of the process that is running in the system along with the ps command. We passed the flag aux in which “a” indicates the details of all associated processes in the terminal, “u” indicates the processes that are in the user list and the “x” indicates the processes that are not executed from the terminal. In combination, we can say that it is used to print all of the running processes that are stored in the memory.

The second option passed “egrep” is a processing tool that is used to fetch the expressions or patterns in a specified location. Lastly, we passed the “Z|defunct” keyword which denotes the zombie process that is to be fetched in the memory. When we execute the command, we will get the following output, which shows the zombie process in the system along with its “PID”.


linux      33819  0.0  0.0  18008   724 pts/0    S+   20:22   0:00 grep -E --color=auto Z|defunct

Zombie processes are already dead processes but the parent process is unable to read its status and cannot be released from memory. So, the dead process cannot be killed. We can only do for them to enable the parent process to read the child state so it can be executed and removed from the process table. For this, we will run the command mentioned below.

linux@linux-VirtualBox:~$ ps –o ppid= -p 33819

In the above command, we tried to get the parent id of the zombie process. After getting the parent id, we will run the following command to kill the zombie process by sending the SIGCHLD to the parent process which enables the parent process to read the child state:

linux@linux-VirtualBox:~$ kill –s SIGCHLD  Parent_PID

In the command above, we pass the signal to the parent id to kill the zombie process of the parent id passed to it. After the above command is executed, it will simply move to the next line without printing any output on the terminal in case no parent id exists. To check for the process whether the zombie process is killed or not, you can run the command that we already executed to find the zombie process.

Let us try another way to kill the zombie process which is done by killing the parent process itself. This is the more efficient way to kill the zombie process because it will completely remove the whole process and won’t allow the zombie to arise again. For that, we will run the below-shown command:

linux@linux-VirtualBox:~$ kill -9 Parent_PID

After running the above command, we allow the system to kill the parent process.


We have briefly discussed the zombie process and also the causes and the procedure to kill these processes. Before heading to our killing process, we tried to explain the reasons for their cause and also the ways to identify them using simple commands.

Original article source at:

#linux #process 

Find and Kill A Zombie Process on Linux
Gordon  Murray

Gordon Murray


How to Check Which Process is using A Port on Linux

What Is a Port?

In computer networking, a port represents a logical entry and exit point for a connection. Ports are based on software and are entirely virtual. These ports on a computer are managed by the operating system.

What Will We Talk About?

This quick tutorial demonstrates the various methods to determine which Linux process or service is currently listening on a specific port. Let’s talk about ports and their purpose.

How Are Ports Analogous to Physical Ports?

Just as physical ports help to interact with various peripheral devices connected to a computer, ports help the different services to communicate with each other. These services can be on the same computer or on different computers.

A Bit About Port of a Service

To listen for incoming connection requests, a process associates itself with a port number. Most processes are set up with a default port, and they have to use that port as per their specification. They do not automatically switch to the other port unless their configuration is explicitly modified.

A few examples of protocols and their associated default ports include the Secure Shell (SSH) protocol (port22), the Apache HTTP (port80), the MySQL database server (port3306), and so forth. You may use this information to discover which default port does a service utilizes.

The config file of these services can be edited to use some other port as well.

Checking the Ports on Linux

Let’s now see how to check what port/ports a process is using on Linux. Here, we will show you the different commands for this purpose.

1. Lsof Command

The lsof utility is helpful to obtain a list of the ports which are used by your system. Let’s consider the following example to get an information about a process (processes) using the TCP port 22:

$ sudo lsof -i TCP:22

The lsof command gives more information like the user’s name and what process IDs are linked to each process. It works with both TCP and UDP ports.

2. SS Command

The ss command is another way to find out which processes are linked to a certain port. Although lsof is the more common abbreviation, some people may find ss to be more handy.

Let’s look for the processes or services that listen on port 3306:

$ sudo ss -tunap | grep :3306

Let’s break down this command:

1. t: It tells the ss command to display the TCP packets.

2. u: It tells the ss command to display the UDP packets.

3. n: It is used to display the port numbers instead of their translations.

4. a: It is used to display the listening as well as non-listening sockets of all types.

5. p: It is used to display the processes that utilize a socket.

The result of the previous command shows which process is utilizing which port. You may also issue the following command:

$ sudo ss -tup -a sport = :80

Here, sport signifies the source port.

These two approaches may help you find the IDs of the processes that are connected to different ports.

3. Netstat Command

The netstat command shows the information about your network and can be used to fix the problems or change the way that your network is set up. It can also keep a close watch on your network connections.

This command is often used to see an information about inbound and outbound connections, routing tables, port listening, and usage stats. Although it has been rendered obsolete in recent years, netstat is still a useful tool for analyzing networks.

With the grep command, netstat can determine which process or service is using a certain port (by mentioning the port):

$ sudo netstat -ltnp | grep -w ':80'

The options used here can be classified as follows:

1. t: It only shows the TCP connection.

2. l: It is used to display the results in a list.

3. n: It displays addresses and port numbers in numerical format.

4. p: It displays the PID and program name which are associated with each socket.

4. Fuser Command

The fuser command determines the processes that utilize the files or sockets. You can use it to list the services which run on a specific port. Let’s take the example of port 3306 and see what services are running here:

$ sudo fuser 3306/tcp

This provides us with the process numbers using this port. You can use this process number to find the corresponding process names. For example, if the process number is 15809, the command to use here is as follows:

$ ps -p 15809 -o comm=

However, certain tools are required to identify the processes that utilize a non-standard port. “LSOF” is a tool for discovering what services are available on a network and what ports they use. Consider the following example. This shows how to list the UDP and TCP listening ports:

$ sudo lsof -Pni | egrep "(UDP|LISTEN)"

The following is a description of the options that are used here:

1. P: It suppresses the port service name lookup.

2. n: It displays the numeric network addresses.

3. i: It lists the IP sockets.

Both the ports and the associated processes are shown in the previously-mentioned result. This way is particularly useful for processes with non-default ports.


In this article, we talked about four possible Linux command-line tools and provided the examples on how to use them to find out which process is listening on a certain port.

Original article source at:

#linux #process 

How to Check Which Process is using A Port on Linux
Sheldon  Grant

Sheldon Grant


All you need to Simplify ETL process with AWS Glue

The ETL process has been designed specifically for the purpose of transferring data from its source database into a data warehouse. However, the challenges and complexities of ETL can make it hard to implement successfully for all of your enterprise data. For this reason, Amazon has introduced AWS Glue. You can learn more about the Amazon web services with the AWS Training and Certification.

In this article, the pointers that we are going to cover are as follows:

  • What is AWS Glue?
  • When should I use AWS Glue?
  • AWS Glue Benefits
  • The AWS Glue Concepts
  • AWS Glue Terminology
  • How does AWS Glue work?

So let us begin with our first topic.

What is AWS Glue?

AWS Glue is a fully managed ETL service. This service makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it swiftly and reliably between various data stores.

It comprises of components such as a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries.

AWS Glue is serverless, this means that there’s no infrastructure to set up or manage.

When Should I Use AWS Glue?

1. To build a data warehouse to organize, cleanse, validate, and format data. 

You can transform as well as move AWS Cloud data into your data store.

You can also load data from disparate sources into your data warehouse for regular reporting and analysis.

By storing it in a data warehouse, you integrate information from different parts of your business and provide a common source of data for decision making.

2. When you run serverless queries against your Amazon S3 data lake. 

AWS Glue can catalog your Amazon Simple Storage Service (Amazon S3) data, making it available for querying with Amazon Athena and Amazon Redshift Spectrum.

With crawlers, your metadata stays in synchronization with the underlying data. Athena and Redshift Spectrum can directly query your Amazon S3 data lake with the help of the AWS Glue Data Catalog.

With AWS Glue, you access as well as analyze data through one unified interface without loading it into multiple data silos.

3. When you want to create event-driven ETL pipelines 

You can run your ETL jobs as soon as new data becomes available in Amazon S3 by invoking your AWS Glue ETL jobs from an AWS Lambda function.

You can also register this new dataset in the AWS Glue Data Catalog considering it as part of your ETL jobs.

4.  To understand your data assets. 

You can store your data using various AWS services and still maintain a unified view of your data using the AWS Glue Data Catalog.

View the Data Catalog to quickly search and discover the datasets that you own, and maintain the relevant metadata in one central repository.

The Data Catalog also serves as a drop-in replacement for your external Apache Hive Metastore.

Check out our AWS Certification Training in Top Cities

IndiaUnited StatesOther Countries
AWS Training in HyderabadAWS Training in AtlantaAWS Training in London
AWS Training in BangaloreAWS Training in BostonAWS Training in Adelaide
AWS Training in ChennaiAWS Training in NYCAWS Training in Singapore

AWS Glue Benefits

1. Less hassle

AWS Glue is integrated across a very wide range of AWS services. AWS Glue natively supports data stored in Amazon Aurora and all other Amazon RDS engines, Amazon Redshift, and Amazon S3, along with common database engines and databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2.


2. Cost-effective

AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources that you use while your jobs are running.



3. More power

power iconAWS Glue automates a significant amount of effort in building, maintaining, and running ETL jobs. It crawls your data sources, identifies data formats as well as suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.

AWS Glue Concepts

You define jobs in AWS Glue to accomplish the work that’s required to extract, transform, and load (ETL) data from a data source to a data target. You typically perform the following actions:


ARchitecture - AWS Glue - Edureka

Firstly, you define a crawler to populate your AWS Glue Data Catalog with metadata table definitions. You point your crawler at a data store, and the crawler creates table definitions in the Data Catalog.In addition to table definitions, the Data Catalog contains other metadata that is required to define ETL jobs. You use this metadata when you define a job to transform your data.

  • AWS Glue can generate a script to transform your data or you can also provide the script in the AWS Glue console or API.

You can run your job on-demand, or you can set it up to start when a specified trigger occurs. The trigger can be a time-based schedule or an event.

When your job runs, a script extracts data from your data source, transforms the data, and loads it to your data target. This script runs in an Apache Spark environment in AWS Glue.

You can learn more about AWS and its services from the AWS Cloud Course.

AWS Glue Terminology

Data Catalog

The persistent metadata store in AWS Glue. It contains table definitions, job definitions, and other control information to manage your AWS Glue environment.

ClassifierDetermines the schema of your data. AWS Glue provides classifiers for common file types, such as CSV, JSON, AVRO, XML, and others.

It contains the properties that are required to connect to your data store.

CrawlerA program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data and then creates metadata tables in the Data Catalog.

A set of associated Data Catalog table definitions organized into a logical group in AWS Glue.

Data Store, Data Source, Data Target

data store is a repository for persistently storing your data. Data source is a data store that is used as input to a process or transform. A data target is a data store that a process or transform writes to.

Development Endpoint

An environment that you can use to develop and test your AWS Glue ETL scripts.


The business logic is required to perform ETL work. It is composed of a transformation script, data sources, and data targets.

Notebook Server

A web-based environment that you can use to run your PySpark statements. PySpark is a Python dialect for ETL programming.


Code that extracts data from sources, transforms it and loads it into targets. AWS Glue generates PySpark or Scala scripts.


It is the metadata definition that represents your data. A table defines the schema of your data.


You use the code logic to manipulate your data into a different format.


Initiates an ETL job. You can define triggers based on a scheduled time or event.

How does AWS Glue work?

Here I am going to demonstrate an example where I will create a transformation script with Python and Spark. I will also cover some basic Glue concepts such as crawler, database, table, and job.

Create a data source for AWS Glue:

Glue can read data from a database or S3 bucket. For example, I have created an S3 bucket called glue-bucket-edurekaCreate two folders from S3 console and name them read and write. Now create a text file with the following data and upload it to the read folder of S3 bucket.

1,The Shawshank Redemption,1994,9.2
2,The Godfather,1972,9.2
3,The Godfather: Part II,1974,9.0
4,The Dark Knight,2008,9.0
5,12 Angry Men,1957,8.9
6,Schindler’s List,1993,8.9
7,The Lord of the Rings: The Return of the King,2003,8.9
8,Pulp Fiction,1994,8.9
9,The Lord of the Rings: The Fellowship of the Ring,2001,8.8
10,Fight Club,1999,8.8

Crawl the data source to the data catalog:

In this step, we will create a crawler. The crawler will catalog all files in the specified S3 bucket and prefix. All the files should have the same schema. In Glue crawler terminology the file format is known as a classifier. The crawler identifies the most common classifiers automatically including CSV, json and parquet. Our sample file is in the CSV format and will be recognized automatically.


In the left panel of the Glue management console click Crawlers.

Click the blue Add crawler button.

Give the crawler a name such as glue-demo-edureka-crawler.

In Add a data store menu choose S3 and select the bucket you created. Drill down to select the read folder.

In Choose an IAM role create new. Name the role to for example glue-demo-edureka-iam-role.

In Configure the crawler’s output add a database called glue-demo-edureka-db.

When you are back in the list of all crawlers, tick the crawler that you created. Click Run crawler.

3. The crawled metadata in Glue tables:

Once the data has been crawled, the crawler creates a metadata table from it. You find the results from the Tables section of the Glue console. The database that you created during the crawler setup is just an arbitrary way of grouping the tables. Glue tables don’t contain the data but only the instructions on how to access the data.

4. AWS Glue jobs for data transformations:

From the Glue console left panel go to Jobs and click blue Add job button. Follow these instructions to create the Glue job:

  • Name the job as glue-demo-edureka-job.
  • Choose the same IAM role that you created for the crawler. It can read and write to the S3 bucket.
  • TypeSpark.
  • Glue versionSpark 2.4, Python 3.
  • This job runsA new script to be authored by you.
  • Security configuration, script libraries, and job parameters
    • Maximum capacity2. This is the minimum and costs about 0.15$ per run.
    • Job timeout10. Prevents the job to run longer than expected.
  • Click Next and then Save job and edit the script.

5. Editing the Glue script to transform the data with Python and Spark:

Copy the following code to your Glue script editor Remember to change the bucket name for the s3_write_path variable. Save the code in the editor and click Run job.

#Import python modules
from datetime import datetime
#Import pyspark modules
from pyspark.context import SparkContext
import pyspark.sql.functions as f
#Import glue modules
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
#Initialize contexts and session
spark_context = SparkContext.getOrCreate()
glue_context = GlueContext(spark_context)
session = glue_context.spark_session
glue_db = "glue-demo-edureka-db"
glue_tbl = "read"
s3_write_path = "s3://glue-demo-bucket-edureka/write"
#Log starting time
dt_start ="%Y-%m-%d %H:%M:%S")
print("Start time:", dt_start)
#Read movie data to Glue dynamic frame
dynamic_frame_read = glue_context.create_dynamic_frame.from_catalog(database = glue_db, table_name = glue_tbl)
#Convert dynamic frame to data frame to use standard pyspark functions
data_frame = dynamic_frame_read.toDF()
#Create a decade column from year
decade_col = f.floor(data_frame["year"]/10)*10
data_frame = data_frame.withColumn("decade", decade_col)
#Group by decade: Count movies, get average rating
data_frame_aggregated = data_frame.groupby("decade").agg(
#Sort by the number of movies per the decade
data_frame_aggregated = data_frame_aggregated.orderBy(f.desc("movie_count"))
#Print result table
#Note: Show function is an action. Actions force the execution of the data frame plan.
#With big data the slowdown would be significant without cacching.
#Create just 1 partition, because there is so little data
data_frame_aggregated = data_frame_aggregated.repartition(1)
#Convert back to dynamic frame
dynamic_frame_write = DynamicFrame.fromDF(data_frame_aggregated, glue_context, "dynamic_frame_write")
#Write data back to S3
frame = dynamic_frame_write,
connection_type = "s3",
connection_options = {
"path": s3_write_path,
#Here you could create S3 prefixes according to a values in specified columns
#"partitionKeys": ["decade"]
format = "csv"
#Log end time
dt_end ="%Y-%m-%d %H:%M:%S")
print("Start time:", dt_end)

The detailed explanations are commented in the code. Here is the high-level description:

Read the movie data from S3

Get movie count and rating average for each decade

Write aggregated data back to S3

The execution time with 2 Data Processing Units (DPU) was around 40 seconds. A relatively long duration is explained by the start-up overhead.

The data transformation script creates summarized movie data. For example, 2000 decade has 3 movies in IMDB top 10 with average rating 8.9. You can download the result file from the write folder of your S3 bucket. Another way to investigate the job would be to take a look at the CloudWatch logs.

The data is stored back to S3 as a CSV in the “write” prefix. The number of partitions equals the number of the output files.

With this, we have come to the end of this article on AWS Glue. I hope you have understood everything that I have explained here. You can learn about the features and benefits of AWS Glue in the following video

If you found this AWS Glue article relevant, you can check out Edureka’s live and instructor-led course on AWS training in Chennai, co-created by industry practitioners.

Got a question for us? Please mention it in the comments section of this How to Deploy Java Web Application in AWS and we will get back to you.

AWS Tutorial For Beginners | AWS Full Course – Learn AWS In 10 Hours | AWS Training | Edureka

This video focuses on the complete AWS Course which is the ultimate tutorial for beginners or anyone who wishes to scale up in their career

Original article source at:

#aws #process 

All you need to Simplify ETL process with AWS Glue

Process: An Async Process Dispatcher for Amp


This package provides an asynchronous process dispatcher that works on all major platforms (including Windows).

As Windows pipes are file handles and do not allow non-blocking access, this package makes use of a process wrapper, that provides access to these pipes via sockets. On Unix-like systems it uses the standard pipes, as these can be accessed without blocking there. Concurrency is managed by the Amp event loop.


This package can be installed as a Composer dependency.

composer require amphp/process


  • PHP 7.0+


amphp/process follows the semver semantic versioning specification like all other amphp packages.


If you discover any security related issues, please email instead of using the issue tracker.

Download Details:

Author: Amphp
Source Code: 
License: MIT license

#php #async #process 

Process: An Async Process Dispatcher for Amp
Bongani  Ngema

Bongani Ngema


Implement a Data Quality Process

Data quality is a crucial element of any successful data warehouse solution. As the complexity of data warehouses increases, so does the need for data quality processes. In this article, Toptal Data Quality Developer Alexander Hauskrecht outlines how you can ensure a high degree of data quality and why this process is so important.

Data Quality (DQ) in data warehouse systems is getting more and more important. Increasing regulatory requirements, but also the growing complexity of data warehouse solutions, force companies to intensify (or start) a data quality initiative.

This article’s main focus will be on “traditional” data warehousing, but data quality is also an issue in more “modern” concepts such as data lakes. It will show some main points to consider and also some common pitfalls to avoid when implementing a data quality strategy. It does not cover the part on choosing the right technology/tool to build a DQ framework.

One of the most obstructive problems of a DQ project is the fact that at first sight, it creates a lot of work for the business units without providing any extra functionality. A data quality initiative usually only has strong proponents if:

  • There are data quality issues with a severe impact on the business.
  • Regulatory bodies enforce data quality standards (e.g., BCBS 239 in the finance industry).

DQ’s treatment is similar to that of testing in software development—if a project runs out of time and/or budget, this part tends to be reduced first.

This, of course, is not the whole truth. A good data quality system helps detect errors early, thus speeding up the process of delivering data of “good enough” quality to the users.

Definition of Terms

Before discussing the topic, a common understanding of the terms used is important.

Data Warehouse (DWH)

A data warehouse (DWH) is a non-operational system mainly used for decision support. It consolidates the data of the operational systems (all of them or a smaller subset) and provides query-optimized data for the users of the DWH system. The data warehouse should provide “a single version of truth” within the enterprise. A data warehouse is usually built of stages/layers:


Common data warehouse layers

Figure 1: Common data warehouse layers.


The operational data is stored mostly unchanged into a staging layer. The core layer contains consolidated and unified data. The next optional stage is a derivation area, providing derived data (for example, a customer score for sales) and aggregations. The data mart layer contains data optimized for a given group of users. Data marts often contain aggregations and lots of derived metrics. Data warehouse users often work only with the data mart layer.

Between each stage, some kind of data transformation takes place. Usually, a data warehouse is periodically loaded with delta extractions of the operational data and contains algorithms to keep historical data.

Data Quality

Data quality is usually defined as a metric on how well a product meets user requirements. Different users might have different requirements for a product so the implementation depends on the user’s perspective, and it is important to identify these needs.

Data quality does not mean the data has to be completely or almost error-free—it depends on the users’ requirements. A “good enough” approach is a good choice to start with. Nowadays, bigger companies have “a data (or information) government policy,” and data quality is a part of it. A data government policy should describe how your company deals with data and how it makes sure that data has the right quality and that data privacy rules are not violated.

Data quality is an ongoing topic. A DQ circuit loop has to be implemented (see next chapter). Regulatory requirements and compliance rules also have an impact on the data quality needed, such as TCPA (US Telephone Consumer Protection Act) or GDPR in Europe for privacy issues, but also industry-specific rules like Solvency II for insurances in the EU, BCBS 239 and others for banking, and so on.

Data Quality Circuit Loop

As with all quality topics, DQ is an ongoing activity designed to maintain satisfactory quality. As a result of a DQ project, a circuit loop similar to the one below has to be implemented:


Data quality circuit loop

Figure 2: Data quality circuit loop.


The steps within this loop will be described in the next chapters.

Data Quality Roles

To implement a successful DQ initiative, the following roles are needed:

  • Data Owner. A data owner is responsible for data quality, but also for data privacy protection. The data owner “owns” a data domain, controls access, and is responsible for assuring data quality and taking action to fix findings. In larger organizations, it’s common to find several data owners. Data domains could be, for example, marketing data, controlling data, etc. If more than one data owner exists in a company, there should be one person (a data owner or someone else) responsible for the overall data quality process. A data owner should have a strong authority to enforce data quality and support the DQ process; therefore, data owners are often senior stakeholders. A good understanding of the business domain along with good communication skills are important.
  • Data Steward. A data steward helps implement data quality within an enterprise, supporting data users on questions about how to interpret data/the data model, data quality issues, etc. Data stewards are often the data owner’s staff or can be organized in a data quality competence center or a DQ team. Data stewards can have an IT or business background but should know both sides. Analytical skills along with a good understanding of the business domain they support, combined with strong communication skills, are chief prerequisites for a successful data steward.
  • Data User. These are data warehouse users working with data. Data users typically work with the data mart layer and are responsible for work results with the data. Data users make sure there are adequate data quality checks for the quality level they need. Data users need a strong understanding of their data, business domain, and the required analytical skills to interpret data. It is reasonable to find a few people among data users in every business unit who will be responsible for data quality issues.

To ensure success, it is important to have these roles clearly defined and widely accepted within your organization in the early stages of your DQ project. It is equally important to find competent data specialists for these roles who support the project.

Define the Rules

Find and implement useful DQ checks/rules. Defining DQ rules requires a good understanding of your data warehouse and its use.

How to Find DQ Rules?

As discussed earlier, data users (and the data owner) are responsible for data use and therefore also for the needed level of data quality. Data users should have a good understanding of their data so they can give the best input for useful data quality rules.

They are also the ones who analyze the results of the data quality rules, so it is always a good idea to let them define their own rules. This further enhances the acceptance to check and rate the result of the DQ rules assigned to a data user unit (see “Analyze” chapter).

The drawback of this approach is that data users normally only know the data mart layer, not the earlier layers of the data warehouse. If data was corrupted in the “lower” stages, this won’t be detected by checking just the “top” layer of your data warehouse.

Tackling Errors

What kind of known errors might occur in a data warehouse?

  • Wrong transformation logic in the data warehouse
    • The more complex your IT landscape, the more complex the transformation logic tends to be. These are the most common DQ problems, and the effect of such errors can be “lost” data, duplicates, incorrect values, etc.
  • Unstable load process or wrong handling of loads
    • The load of a data warehouse can be a complex process that might include errors in the definition of the job orchestration (jobs starting too early or too late, jobs not executed, etc.). Errors due to manual intervention (e.g., some jobs are skipped, some jobs are started with the wrong due date or with the data files of yesterday) happen often when the load process is run out of band due to some disruption.
  • Wrong data transfer of data sources
    • Data transfer is often implemented as a task of the source system. Anomalies or disruption in the job flows might cause the delivery of empty or incomplete data.
  • Wrong operational data
    • The data in the operational system contains errors not recognized so far. It may sound strange, but it is a platitude in data warehouse projects that the quality of the operational data is often not seen until the data is included in a DWH.
  • Misinterpretation of data
    • The data is correct, but users don’t know how to interpret it right. This is a very common “error” that is not strictly a data quality issue but something that has to do with data governance and is a task for the data stewards.

These problems are often caused by people lacking the appropriate know-how and skills to define, implement, run, and work with a data warehouse solution.

Data Quality Dimensions

DQ dimensions are a common way to identify and cluster DQ checks. There are many definitions, and the number of dimensions varies considerably: You might find 16, or even more dimensions. From a practical perspective, it is less confusing to start with a few dimensions and find a general understanding of them among your users.

  • Completeness: Is all the data required available and accessible? Are all sources needed available and loaded? Was data lost between stages?
  • Consistency: Is there erroneous/conflicting/inconsistent data? For example, the termination date of a contract in a “Terminated” state must contain a valid date higher than or equal to the start date of the contract.
  • Uniqueness: Are there any duplicates?
  • Integrity: Is all data linked correctly? For example, are there orders linking to nonexistent customer IDs (a classic referential integrity problem)?
  • Timeliness: Is the data current? For example, in a data warehouse with daily updates, I would expect yesterday’s data available today.

Data generated by the data warehouse load process can be helpful as well.

  • Tables with discarded data. Your data warehouse might have processes to skip/delay data that can’t be loaded due to technical issues (e.g., format conversion, missing mandatory values, etc.).
  • Logging information. Noticeable problems might be written into logging tables or log files.
  • Bill of delivery. Some systems use “bills of delivery” for data provided by operational systems (e.g., number of records, number of distinct keys, sums of values). These can be used for reconciliation checks (see below) between the data warehouse and the operational systems.

Keep in mind that each data quality check has to be analyzed by at least one data user (see “Analyze” chapter) in case errors are found, for which you’ll need someone responsible and available to look after every check implemented.

Within a complex data warehouse, you might end up with many (sometimes thousands) DQ rules. The process to execute data quality rules should be robust and fast enough to handle this.

Don’t check facts that are guaranteed by technical implementation. For example, if the data is stored in a relational DBMS, it is not necessary to check if:

  • Columns defined as mandatory contain NULL values.
  • The primary key field(s) values are unique in a table.
  • There are no existing foreign keys in a table with relational integrity checks enabled.

That said, always keep in mind that a data warehouse is in constant change and that the data definition of fields and tables might change over time.

Housekeeping is very important. Rules defined by different data user units might overlap and should be consolidated. The more complex your organization, the more housekeeping will be needed. Data owners should implement a process of rules consolidation as a kind of “data quality for data quality rules.” Also, data quality checks might become useless if the data is no longer used or if its definition has changed.

Classes of Data Quality Rules

Data quality rules can be classified based on the type of test.

  • Data quality check. The “normal” case, checking data within one data warehouse layer (see Figure 1) either within one table or a set of tables.
  • Reconciliation. Rules that check if data was transported correctly between data warehouse layers (see Figure 1). These rules are mostly used to check the DQ dimension of “Completeness.” Reconciliation can use a single row or a summarized approach. Checking single rows is much more granular, but you’ll have to reproduce the transformation steps (data filtering, changes in field values, denormalization, joins, etc.) between the compared layers. The more layers you skipped, the more complex transformation logic must be implemented. Therefore, it is a good choice to do reconciliation between each layer and its predecessor instead of comparing the staging to the data mart layer. If transformations have to be implemented in reconciliation rules, use the specification, not the data warehouse code! For summarized reconciliation, find meaningful fields (e.g., summarization, count of distinct values, etc.).
  • Monitoring. A data warehouse usually contains historical data and is loaded with delta extracts of operational data. There is the danger of a slowly increasing gap between the data warehouse and the operational data. Building summarized time series of data helps identify issues like this (e.g., comparing last month’s data with the data of the current month). Data users with a good knowledge of their data can provide useful measures and thresholds for monitoring rules.

How to Quantify a Data Quality Issue

Once you have defined what to check, you’ll have to specify how to quantify the identified issues. Information such as “five data rows violate the DQ rule with ID 15” makes little sense for data quality.

The following parts are missing:

  • How to quantify/count the detected errors. You might count “number of rows,” but you also might use a monetary scale (for example, exposure). Keep in mind that monetary values might have different signs, so you’ll have to investigate how to meaningfully summarize them. You might consider using both quantification units (count of rows and summarization) for a data quality rule.
  • Population. What is the number of units checked by the data quality rule? “Five data rows out of five” has a different quality from “five out of 5 million.” The population should be measured using the same quantification(s) as for the errors. It is common to show the result of a data quality rule as a percentage. The population must not be identical to the number of rows in a table. If a DQ rule checks only a subset of the data (e.g., only terminated contracts in the contracts table), the same filter should be applied to measure the population.
  • Definition of the result. Even if a data quality check finds issues, this does not always have to cause an error. For data quality, a traffic light system (red, yellow, green) using threshold values to rate findings is very helpful. For example, green: 0-2%, yellow: 2-5%, red: above 5%. Keep in mind that if data user units share the same rules, they might have very different thresholds for a given rule. A marketing business unit might not mind a loss of a few orders, whereas an accounting unit might mind even cents. It should be possible to define thresholds on percentage or on absolute figures.
  • Collect sample error rows. It helps if a data quality rule provides a sample of the detected errors—normally, the (business!) keys and the checked data values are sufficient to help examine the error. It is a good idea to limit the number of written error rows for a data quality rule.
  • Sometimes, you might find “known errors” in the data that won’t be fixed but are found by useful data quality checks. For these cases, the use of whitelists (keys of records that should be skipped by a data quality check) is recommended.

Other Metadata

Metadata is important to route the “Analyze” and monitor the phases of the data quality control loop.

  • Checked items. It helps to assign the checked table(s) and field(s) to a data quality rule. If you have an enhanced metadata system, this might help to automatically assign data users and a data owner to this rule. For regulatory reasons (such as BCBS 239), it is also necessary to prove how the data is checked by DQ. However, assigning rules automatically to data users/data owners via data lineage (*) might be a double-edged sword (see below).
  • Data user. Every DQ rule must have at least one data user/data user unit assigned to check the result during the “Analyze” phase and decide if and how a finding influences their work with the data.
  • Data owner. Every DQ rule must have a data owner assigned.

(*) Data lineage shows the flow of data between two points. With data lineage, you can find all data elements influencing a given target field within your warehouse.

Using data lineage to assign users to rules can be problematic. As mentioned before, business users usually know only the data mart layer (and the operating system), but not the lower levels of the data warehouse. By mapping via data lineage, data users will be assigned rules they’re not familiar with. For the lower levels, IT staff may be needed to evaluate a data quality finding. In many cases, a manual mapping or a mixed approach (mapping via data lineage only within the data mart) can help.

Measuring Data Quality

Measuring data quality means executing the available data quality rules, which should be done automatically, triggered by the load processes of the data warehouse. As we’ve seen before, there might be a remarkable number of data quality rules, so the checks will be time-consuming.

In a perfect world, a data warehouse would be loaded only if all data is error-free. In the real world, this is seldom the case (realistically, it is almost never the case). Depending on the overall loading strategy of your data warehouse, the data quality process should or should not (the latter is far more likely) rule the load process. It is a good design to have data quality processes (job networks) parallel and linked to the “regular” data warehouse load processes.

If there are defined service-level agreements, make sure not to thwart the data warehouse loads with the data quality checks. Errors/abends in data quality processes should not stop the regular load process. Unexpected errors within the data quality processes should be reported and shown up for the “Analyze” phase (see next chapter).

Keep in mind that a data quality rule might crash because of unexpected errors (maybe the rule itself was wrongly implemented, or the underlying data structure changed over time). It would help if your data quality system provided a mechanism to deactivate such rules, especially if your company has few releases per year.

DQ processes should be executed and reported as early as possible—ideally, right after the data checked was loaded. This helps detect errors as early as possible during the load of the data warehouse (some complex warehouse system loads have a duration of several days).


In this context, “analyze” means reacting to data quality findings. This is a task for the assigned data users and the data owner.

The way to react should be clearly defined by your data quality project. Data users should be obligated to comment on a rule with findings (at least rules with a red light), explaining what measures are being taken to handle the finding. The data owner needs to be informed and should decide together with the data user(s).

The following actions are possible:

  • Serious problem: The problem has to be fixed and the data load repeated.
  • Problem is acceptable: Try to fix it for future data loads and handle the problem within the data warehouse or the reporting.
  • Defective DQ rule: Fix the problematic DQ rule.

In a perfect world, every data quality problem would be fixed. However, lack of resources and/or time often results in workarounds.

To be able to react in time, the DQ system must inform the data users about “their” rules with findings. Using a data quality dashboard (maybe with sending messages that something came up) is a good idea. The earlier the users are informed about findings, the better.

The data quality dashboard should contain:

  • All rules assigned to a given role
  • The rules’ results (traffic light, measures, and example rows) with the ability to filter rules by result and data domain
  • A mandatory comment that data users have to enter for findings
  • A feature to optionally “overrule” the result (if the data quality rule reports errors due to a defective implementation, for example). If more than one business unit has the same data quality rule assigned, “overruling” is only valid for the data user’s business unit (not the whole company).
  • Showing rules that were not executed or that abended

The dashboard should also show the current status of the recent data warehouse load process, giving the users a 360-degree view of the data warehouse load process.

The data owner is responsible for making sure that every finding was commented on and the status of the data quality (original or overruled) is at least yellow for all data users.

For a quick overview, it would help to build a kind of simple KPIs (key performance indicators) for data users/data owner. Having an overall traffic light for all associated rules’ results is quite easy if each rule is given the same weight.

Personally, I think computing an overall value of data quality for a given data domain is rather complex and tends to be cabalistic, but you could at least show the number of overall rules grouped by result for a data domain (e.g., “100 DQ rules with 90% green, 5% yellow, and 5% red results”).

It is the data owner’s task to ensure that the findings will be fixed and data quality improved.

Improving Processes

As the data warehouse processes often change, the data quality mechanism also needs maintenance.

A data owner should always take care of the following points:

  • Keep it up to date. Changes in the data warehouse need to be caught in the data quality system.
  • Enhance. Implement new rules for errors that are not covered by data quality rules yet.
  • Streamline. Disable data quality rules that are no longer needed. Consolidate overlapping rules.

Monitoring Data Quality Processes

Monitoring the entire data quality process helps to improve it over time.

Things worth watching would be:

  • The coverage of your data with data quality rules
  • The percentage of data quality findings within the active rules over time
  • The number of active data quality rules (Keep an eye on it—I have seen data users solving their findings by simply disabling more and more data quality rules.)
  • The time needed within a data load to have all findings rated and fixed


Many of the following points are important in any kind of project.

Anticipate resistance. As we have seen, if there is no urgent quality issue, data quality is often viewed as an additional burden without offering new functionality. Keep in mind that it might create additional workload for the data users. In many cases, compliance and regulatory demands can help you to convince the users to see it as an unavoidable requirement.

Find a sponsor. As noted above, DQ is not a fast-selling item, so a powerful sponsor/stakeholder is needed—the higher in the management, the better.

Find allies. As with the sponsor, anyone who shares the idea of strong data quality would be most helpful. The DQ circuit loop is an ongoing process and needs people to keep the circuit loop alive.

Start small. If there’s been no DQ strategy so far, look for a business unit that needs better data quality. Build a prototype to show them the benefit of better data. If your task is to improve or even replace a given data quality strategy, look at things working well/being accepted in the organization, and keep them.

Don’t lose sight of the whole picture. Although starting small, keep in mind that some points, especially the roles, are prerequisites for a successful DQ strategy.

Once implemented, don’t let go. The data quality process needs to be part of data warehouse use. Over time, focus on data quality tends to get a bit lost, and it’s up to you to maintain it.

Original article source at:

#data #process #database 

Implement a Data Quality Process
Sheldon  Grant

Sheldon Grant


Requirements Elicitation Processes and its Benefits

Introduction to Requirements Elicitation

Elicitation? It must be an easy thing!

Let me try. Okay, so first, I have to do what? Wait, this isn't that easy…

Exactly! It looks like elicitation is easy, but it's not, so let's understand first what requirement elicitation is in Project management.

Many processes and techniques can be used to manage projects effectively in the software development world. One of the most challenging tasks in any project is eliciting requirements. With so many different ways to go about it, it's easy to get lost and end up with a document that's nothing more than fragmented ideas instead of precise specifications. Eliciting requirements is an art and not a science. It requires you to use your instincts and engage with the client on a different level. Having said that, you can use various techniques and methodologies as a software engineer or project manager to streamline the process of getting useful information from your stakeholders. With so many software development methodologies available in the market today, it can take time for new entrants to choose which will work best for their specific needs.

What is Requirements Elicitation?

The first step in the software development process is requirements engineering. This is turning a business or organizational problem into a set of requirements for a new software application. This step identifies a problem, and a set of requirements is developed. It will include information about the problem, the stakeholders, their needs, and the reason for building the application. The problem statement is the primary input for this process.

The crucial part is that we need to gather information, not only information but also the correct information. Connecting with Stakeholders to understand precisely what they are looking for.

What are the benefits and importance of Requirements Elicitation?

  • Requirements Analysis: It helps identify and understand the problem to be solved.
    It helps identify user needs and the problems that need to be solved. It is the first step toward designing and developing a solution.
  • Requirements Documentation: Requirements documentation is necessary to make sure that all stakeholders are on the same page and agree on what is required to be done, how it will be done, and what technology will be used to deliver the results.
  • Requirements Traceability: It ensures that each requirement can be traced back to its owner and links to the problem statement and other related requirements. This is important when changes need to be made to the requirements.
  • Requirements Prioritization: This is where the business and domain experts will come together to determine which requirements are most important and need to be implemented first.
  • A good understanding of the problem will help you to deliver a solution that is more likely to be successful.
  • Clear and well-documented requirements will help stakeholders and team members stay on the same page and agree on what needs to be done.
  • Documented requirements will also help minimize the number of defects in the resulting product.
  • A proper requirements engineering process can help the organization save money and increase productivity.
  • A proper requirements engineering process will help you be more successful in the job hunt because it will showcase your ability to understand the problem and deliver a solution to meet the stakeholders' requirements.
  • It can help you to decide which software development methodologies will work best for your specific needs.

What are the best ways to do Requirements Elicitation?

  • Build Rapport: This is the first step toward eliciting requirements. Building rapport with your stakeholders will help them feel more comfortable sharing their problems and needs, making them more likely to respond to your questions and suggestions.
  • Ask Open-Ended Questions: Closed-ended questions are more likely to result in "yes" or "no" answers. These are not very helpful when you are trying to get a good grasp on the problem and get a good set of requirements.
  • Write Down Everything: This is an essential part of the process. You will want to write down everything your stakeholders say, even if it sounds silly or unrelated. You can sort through the information and determine what is useful and what isn't once you have finished asking questions.
  • Get as Many Stakeholders as Possible: It's not enough to talk to just one person. You will want to talk to as many stakeholders as possible to get a well-rounded picture of the problem and a good set of requirements.
  • Take Your Time: You don't want to be in too much of a rush. This is not a quick process. It will take time to meet with stakeholders, ask questions, and get a good grasp of the problem and the requirements.

What are the processes of Requirements Elicitation?

The significant steps which should be involved in this procedure are –

  1. We need to identify all the stakeholders, for example. Users, developers, customers, etc.
  2. Always list out all the requirements of customers.
  3. A value indicating the degree of importance must be assigned to each requirement.
  4. At last, the final list of requirements must be categorized as follows:
  5. What is possible to achieve
  • What should be deferred and the reason for it
  • What is impossible to achieve and should be dropped off.


Requirements engineering is turning a business or organizational problem into a set of it for a new software application. The first step towards designing and developing a solution is requirements engineering. It is essential to understand the problem statement and the user's needs. It helps in identifying and understanding the problem to be solved. It also helps identify user needs and the problems that need to be solved. It is the first step toward designing and developing a solution. It is essential to document requirements so stakeholders are on a similar page and agree on what needs to be done, how it will be done, and what technology will be used to deliver the results.

Original article source at: 

#process #benefits 

Requirements Elicitation Processes and its Benefits
Dexter  Goodwin

Dexter Goodwin


Process-warning: A Small Utility for Creating Warnings & Emitting Them


A small utility for generating consistent warning objects across your codebase. It also exposes a utility for emitting those warnings, guaranteeing that they are issued only once.

This module is used by the Fastify framework and it was called fastify-warning prior to version 1.0.0.


npm i process-warning


The module exports a builder function that returns a utility for creating warnings and emitting them.

const warning = require('process-warning')()


warning.create(name, code, message)
  • name (string, required) - The error name, you can access it later with For consistency, we recommend prefixing module error names with {YourModule}Warning
  • code (string, required) - The warning code, you can access it later with error.code. For consistency, we recommend prefixing plugin error codes with {ThreeLetterModuleName}_, e.g. FST_. NOTE: codes should be all uppercase.
  • message (string, required) - The warning message. You can also use interpolated strings for formatting the message.

The utility also contains an emit function that you can use for emitting the warnings you have previously created by passing their respective code. A warning is guaranteed to be emitted only once.

warning.emit(code [, a [, b [, c]]])
  • code (string, required) - The warning code you intend to emit.
  • [, a [, b [, c]]] (any, optional) - Parameters for string interpolation.
const warning = require('process-warning')()
warning.create('FastifyWarning', 'FST_ERROR_CODE', 'message')

How to use an interpolated string:

const warning = require('process-warning')()
warning.create('FastifyWarning', 'FST_ERROR_CODE', 'Hello %s')
warning.emit('FST_ERROR_CODE', 'world')

The module also exports an warning.emitted Map, which contains all the warnings already emitted. Useful for testing.

const warning = require('process-warning')()
warning.create('FastifyWarning', 'FST_ERROR_CODE', 'Hello %s')
console.log(warning.emitted.get('FST_ERROR_CODE')) // false
warning.emit('FST_ERROR_CODE', 'world')
console.log(warning.emitted.get('FST_ERROR_CODE')) // true

Download Details:

Author: Fastify
Source Code: 
License: MIT license

#javascript #fastify #process

Process-warning: A Small Utility for Creating Warnings & Emitting Them
Bongani  Ngema

Bongani Ngema


PM2: Advanced Process Manager

P(rocess) M(anager) 2
Runtime Edition  

PM2 is a production process manager for Node.js applications with a built-in load balancer. It allows you to keep applications alive forever, to reload them without downtime and to facilitate common system admin tasks.

Starting an application in production mode is as easy as:

$ pm2 start app.js


Works on Linux (stable) & macOS (stable) & Windows (stable). All Node.js versions are supported starting Node.js 12.X.

Installing PM2

With NPM:

$ npm install pm2 -g

You can install Node.js easily with NVM or ASDF.

Start an application

You can start any application (Node.js, Python, Ruby, binaries in $PATH...) like that:

$ pm2 start app.js

Your app is now daemonized, monitored and kept alive forever.

Managing Applications

Once applications are started you can manage them easily:

Process listing

To list all running applications:

$ pm2 list

Managing apps is straightforward:

$ pm2 stop     <app_name|namespace|id|'all'|json_conf>
$ pm2 restart  <app_name|namespace|id|'all'|json_conf>
$ pm2 delete   <app_name|namespace|id|'all'|json_conf>

To have more details on a specific application:

$ pm2 describe <id|app_name>

To monitor logs, custom metrics, application information:

$ pm2 monit

More about Process Management

Cluster Mode: Node.js Load Balancing & Zero Downtime Reload

The Cluster mode is a special mode when starting a Node.js application, it starts multiple processes and load-balance HTTP/TCP/UDP queries between them. This increase overall performance (by a factor of x10 on 16 cores machines) and reliability (faster socket re-balancing in case of unhandled errors).

Framework supported

Starting a Node.js application in cluster mode that will leverage all CPUs available:

$ pm2 start api.js -i <processes>

<processes> can be 'max', -1 (all cpu minus 1) or a specified number of instances to start.

Zero Downtime Reload

Hot Reload allows to update an application without any downtime:

$ pm2 reload all

More informations about how PM2 make clustering easy

Container Support

With the drop-in replacement command for node, called pm2-runtime, run your Node.js application in a hardened production environment. Using it is seamless:

RUN npm install pm2 -g
CMD [ "pm2-runtime", "npm", "--", "start" ]

Read More about the dedicated integration

Host monitoring speedbar

PM2 allows to monitor your host/server vitals with a monitoring speedbar.

To enable host monitoring:

$ pm2 set pm2:sysmonit true
$ pm2 update

Framework supported

Terminal Based Monitoring


Monitor all processes launched straight from the command line:

$ pm2 monit

Log Management

To consult logs just type the command:

$ pm2 logs

Standard, Raw, JSON and formated output are available.


$ pm2 logs APP-NAME       # Display APP-NAME logs
$ pm2 logs --json         # JSON output
$ pm2 logs --format       # Formated output

$ pm2 flush               # Flush all logs
$ pm2 reloadLogs          # Reload all logs

To enable log rotation install the following module

$ pm2 install pm2-logrotate

More about log management

Startup Scripts Generation

PM2 can generate and configure a Startup Script to keep PM2 and your processes alive at every server restart.

Init Systems Supported: systemd, upstart, launchd, rc.d

# Generate Startup Script
$ pm2 startup

# Freeze your process list across server restart
$ pm2 save

# Remove Startup Script
$ pm2 unstartup

More about Startup Scripts Generation

Updating PM2

# Install latest PM2 version
$ npm install pm2@latest -g
# Save process list, exit old PM2 & restore all processes
$ pm2 update

PM2 updates are seamless

PM2+ Monitoring

If you manage your apps with PM2, PM2+ makes it easy to monitor and manage apps across servers.

Feel free to try it:

Discover the monitoring dashboard for PM2

Thanks in advance and we hope that you like PM2!





PM2 is constantly assailed by more than 1800 tests.

Official website:

Author: Unitech
Source Code: 
License: View license

#node #process 

PM2: Advanced Process Manager
Nat  Grady

Nat Grady


Continuous Machine Learning project integration with DVC

Continuous Machine Learning project integration with DVC


Data Version Control or DVC is an open-source tool for data science and machine learning projects. It allows for different versioning and management of datasets and Machine Learning models.

Using github actions we can actually generate ML results for every Pull request and have the info we need right there,

DVC Command Cheat Sheet

git init
dvc init
git commit -m "initial commit"

#set dvc remote

dvc remote add -d myremote gdrive://0AIac4JZqHhKmUk9PDA/dvcstore
git commit -m "sets dvc remote"

#process which repeats after any modification of data(new version)

#adds file to dvc and .gitignore

dvc add path_to_data
git add .gitignore && path_to_data.dvc
git commit -m "data:track"

#tags data version on git

git tag -a 'v1' -m "raw data"
dvc push

go ahead and delete the data! it might also appear in .dvc/cache

to get data back => dvc pull

Download Details: 
Author: Azariagmt
Source Code: 
License: BSD-2-Clause License


Continuous Machine Learning project integration with DVC

NodeJS with TypeScript - Process, Modules and Writing To File #2

Welcome to NodeJS series with TypeScript. In this series I will try to introduce us to NodeJS and how to build simple backend with http module. We will learn ins and outs of file operations of NodeJS. All in a type safe way thanks to TypeScript.

In second episode we will take a look at the processes in NodeJS. I will try to explain basics of creating the modules in NodeJS and how to get user input into your program!

Table of content:
00:00 Process in NodeJS
19:40 Basics of Modules And Writing To Files
22:33 Reading the user input
30:33 Using inquirer package

You can find me here:


web design
web development
how to make a website
responsive design tutorial
web development tutorial
media queries
website from scratch
html css
responsive website tutorial
responsive web development
web developer
how to make a responsive website
how to build a website from scratch
how to build a website
build a website
How to

#nodejs #typescript #process

NodeJS with TypeScript - Process,  Modules and Writing To File #2

In Unix/Linux, what are process states?

How to find the process state?
You can find the process state from the following source:
a. Unix/Linux command-line tool ‘top’ will report the process state in the column ‘s’. Process status is reported with a single character.
b. You can use web-based root cause analysis tools like yCrash , which will report the process states.

#linux #process #unix/linux

In Unix/Linux, what are process states?

Automation Software landscape

We have mapped over 120 companies; new and old, large and small, according to their subsegment and the precise type of automation they provide. Find an intro & explanation to the map below.

The process automation space (RPA, Robotic Process Automation) continued to grow strongly in 2020, despite — or perhaps thanks to — the global shift to working from home. Software companies automating business processes raised over $1.9 billion in 2020; the broader automation software space raised over $11 billion, according to PitchBook data.

Fittingly for 2020, the most funding for a single company went to Olive, an AI-driven workflow automation platform for hospitals and healthcare systems. They raised over $380 million last year across three separate rounds, with investors including General Catalyst, Tiger Global Management and Sequoia Capital.

#process #intelligent-automation #rap #artificial-intelligence #software

Automation Software landscape
Dexter  Goodwin

Dexter Goodwin


Use Your Own Template With Create-React-App

Start all your projects quickly, nicely, and cleanly

create-react-app  is an amazing tool, described as **the best way to start building a new single-page application in React, ** inside the official documentation.

By making use of  react-scripts, It offers a modern build setup with no configuration.

Enhance it

While it may well be a fantastic tool, chances are your projects always start with a big overhead. If your first step after running** CRA **is to spend a day setting up default functionalities and configuration, creating your own template will definitely boost your productivity.


Using GIT the right way can resolve a lot of headaches regarding the creation and maintenance of your custom template.

Using a fork of  create-react-app and by doing so, keeping its GIT history, allow you to later sync your fork with the official repository.

If you are a GitHub user, you only need to use the **Fork **button on the homepage of  create-react-app.

If you use any other GIT repository hosting service, the approach is slightly different. It consist in creating your own repository and creating an upstream remote:

$ mkdir create-react-app
$ cd create-react-app
$ git init
$ git remote add origin <YOUR_REMOTE_REPOSITORY>
$ git remote add upstream
$ git pull upstream master
$ git push origin master

#process #javascript #typescript #productivity

Use Your Own Template With Create-React-App

Salman Ahmad


Two Advanced Tips for Event Logs in Power BI

It’s an old trope, but the companies of the world are sitting on a goldmine of insights locked away in data that no-one has ever looked at. This hidden data is nowhere more common than in server or machine event logs and over the past decade there has been an explosion in full-text search and log aggregation tools. Software like Elasticsearch and Splunk have internet-scale superpowers at ingesting and dashboarding real-time events.

For the humble analyst these new methods allow the indexing of literally everything. This means that counting the number of weird log-in attempts or the average temperature on one of 10,000 industrial sensors can be as easy as a Google search.

#logs #process #data-science #regex #power-bi

Two Advanced Tips for Event Logs in Power BI