1675749240
A zombie is a creature that was a human and died but somehow due to a virus or any reason it woke up again. It is already dead but is walking and moving. This is the concept of a zombie described in movies and novels. In the same way in Linux, a zombie process is a process that was removed from the system as “defunct” but still somehow runs in the system memory is called zombie process. Until a child process is eliminated from a process table, it turns into a zombie first.
A process in the terminated state is another name for it. It is cleaned from the memory using its parent process. When the parent process is not notified of the change, the child process becomes the zombie process and it does not get any signal of termination so that it can leave the memory. In Linux, whenever a process is removed from the memory, its parent process is informed about the removal. This process stays in the memory until it notifies its parent process.
This means that the dead process is not removed from the memory immediately and it continues in the system memory thus becoming a zombie process. To remove a zombie process, the parent process calls the wait() function. Once the wait() function is called into play, the zombie process is completely removed from the system.
Before moving to kill the zombie process, we will first discuss the zombie process risks and the causes of the zombie process taking place. Also, we will learn more about the zombie process to make it easy to understand the killing process.
There can be two major causes of the zombie process. The first one is the one in which the parent process is unable to call the wait() function when the child creation process is running which leads to ignoring the SIGCHLD. This may cause the zombie process. The second is the one in which the other application may affect the parent process execution due to bad coding or malicious content in it.
In other words, we can say that the zombie process is caused when the parent process ignores the child process state changes or it cannot check the state of the child process, when the child process ends the PCB is not cleared.
The zombie process does not pose any risk, they just use some part of memory in the system. The process table is of small size but the id of the table where the zombie process is stored cannot be used until it is released by the zombie process. But in case there are a lot of zombie processes reserving the memory location and no memory space left for other processes to take place, it becomes difficult for other processes to run.
Before killing the zombie process it is necessary to find them. To find the zombie process we will run the following command in the terminal:
linux@linux-VirtualBox:~$ ps aux | egrep "Z|defunct"
In the command above, “ps” stands for the process state it is used to view the state of the process that is running in the system along with the ps command. We passed the flag aux in which “a” indicates the details of all associated processes in the terminal, “u” indicates the processes that are in the user list and the “x” indicates the processes that are not executed from the terminal. In combination, we can say that it is used to print all of the running processes that are stored in the memory.
The second option passed “egrep” is a processing tool that is used to fetch the expressions or patterns in a specified location. Lastly, we passed the “Z|defunct” keyword which denotes the zombie process that is to be fetched in the memory. When we execute the command, we will get the following output, which shows the zombie process in the system along with its “PID”.
Output:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
linux 33819 0.0 0.0 18008 724 pts/0 S+ 20:22 0:00 grep -E --color=auto Z|defunct
Zombie processes are already dead processes but the parent process is unable to read its status and cannot be released from memory. So, the dead process cannot be killed. We can only do for them to enable the parent process to read the child state so it can be executed and removed from the process table. For this, we will run the command mentioned below.
linux@linux-VirtualBox:~$ ps –o ppid= -p 33819
In the above command, we tried to get the parent id of the zombie process. After getting the parent id, we will run the following command to kill the zombie process by sending the SIGCHLD to the parent process which enables the parent process to read the child state:
linux@linux-VirtualBox:~$ kill –s SIGCHLD Parent_PID
In the command above, we pass the signal to the parent id to kill the zombie process of the parent id passed to it. After the above command is executed, it will simply move to the next line without printing any output on the terminal in case no parent id exists. To check for the process whether the zombie process is killed or not, you can run the command that we already executed to find the zombie process.
Let us try another way to kill the zombie process which is done by killing the parent process itself. This is the more efficient way to kill the zombie process because it will completely remove the whole process and won’t allow the zombie to arise again. For that, we will run the below-shown command:
linux@linux-VirtualBox:~$ kill -9 Parent_PID
After running the above command, we allow the system to kill the parent process.
We have briefly discussed the zombie process and also the causes and the procedure to kill these processes. Before heading to our killing process, we tried to explain the reasons for their cause and also the ways to identify them using simple commands.
Original article source at: https://linuxhint.com/
1673422801
In computer networking, a port represents a logical entry and exit point for a connection. Ports are based on software and are entirely virtual. These ports on a computer are managed by the operating system.
This quick tutorial demonstrates the various methods to determine which Linux process or service is currently listening on a specific port. Let’s talk about ports and their purpose.
Just as physical ports help to interact with various peripheral devices connected to a computer, ports help the different services to communicate with each other. These services can be on the same computer or on different computers.
To listen for incoming connection requests, a process associates itself with a port number. Most processes are set up with a default port, and they have to use that port as per their specification. They do not automatically switch to the other port unless their configuration is explicitly modified.
A few examples of protocols and their associated default ports include the Secure Shell (SSH) protocol (port22), the Apache HTTP (port80), the MySQL database server (port3306), and so forth. You may use this information to discover which default port does a service utilizes.
The config file of these services can be edited to use some other port as well.
Let’s now see how to check what port/ports a process is using on Linux. Here, we will show you the different commands for this purpose.
The lsof utility is helpful to obtain a list of the ports which are used by your system. Let’s consider the following example to get an information about a process (processes) using the TCP port 22:
$ sudo lsof -i TCP:22
The lsof command gives more information like the user’s name and what process IDs are linked to each process. It works with both TCP and UDP ports.
The ss command is another way to find out which processes are linked to a certain port. Although lsof is the more common abbreviation, some people may find ss to be more handy.
Let’s look for the processes or services that listen on port 3306:
$ sudo ss -tunap | grep :3306
Let’s break down this command:
1. t: It tells the ss command to display the TCP packets.
2. u: It tells the ss command to display the UDP packets.
3. n: It is used to display the port numbers instead of their translations.
4. a: It is used to display the listening as well as non-listening sockets of all types.
5. p: It is used to display the processes that utilize a socket.
The result of the previous command shows which process is utilizing which port. You may also issue the following command:
$ sudo ss -tup -a sport = :80
Here, sport signifies the source port.
These two approaches may help you find the IDs of the processes that are connected to different ports.
The netstat command shows the information about your network and can be used to fix the problems or change the way that your network is set up. It can also keep a close watch on your network connections.
This command is often used to see an information about inbound and outbound connections, routing tables, port listening, and usage stats. Although it has been rendered obsolete in recent years, netstat is still a useful tool for analyzing networks.
With the grep command, netstat can determine which process or service is using a certain port (by mentioning the port):
$ sudo netstat -ltnp | grep -w ':80'
The options used here can be classified as follows:
1. t: It only shows the TCP connection.
2. l: It is used to display the results in a list.
3. n: It displays addresses and port numbers in numerical format.
4. p: It displays the PID and program name which are associated with each socket.
The fuser command determines the processes that utilize the files or sockets. You can use it to list the services which run on a specific port. Let’s take the example of port 3306 and see what services are running here:
$ sudo fuser 3306/tcp
This provides us with the process numbers using this port. You can use this process number to find the corresponding process names. For example, if the process number is 15809, the command to use here is as follows:
$ ps -p 15809 -o comm=
However, certain tools are required to identify the processes that utilize a non-standard port. “LSOF” is a tool for discovering what services are available on a network and what ports they use. Consider the following example. This shows how to list the UDP and TCP listening ports:
$ sudo lsof -Pni | egrep "(UDP|LISTEN)"
The following is a description of the options that are used here:
1. P: It suppresses the port service name lookup.
2. n: It displays the numeric network addresses.
3. i: It lists the IP sockets.
Both the ports and the associated processes are shown in the previously-mentioned result. This way is particularly useful for processes with non-default ports.
In this article, we talked about four possible Linux command-line tools and provided the examples on how to use them to find out which process is listening on a certain port.
Original article source at: https://linuxhint.com/
1672847413
The ETL process has been designed specifically for the purpose of transferring data from its source database into a data warehouse. However, the challenges and complexities of ETL can make it hard to implement successfully for all of your enterprise data. For this reason, Amazon has introduced AWS Glue. You can learn more about the Amazon web services with the AWS Training and Certification.
In this article, the pointers that we are going to cover are as follows:
So let us begin with our first topic.
AWS Glue is a fully managed ETL service. This service makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it swiftly and reliably between various data stores.
It comprises of components such as a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries.
AWS Glue is serverless, this means that there’s no infrastructure to set up or manage.
You can transform as well as move AWS Cloud data into your data store.
You can also load data from disparate sources into your data warehouse for regular reporting and analysis.
By storing it in a data warehouse, you integrate information from different parts of your business and provide a common source of data for decision making.
AWS Glue can catalog your Amazon Simple Storage Service (Amazon S3) data, making it available for querying with Amazon Athena and Amazon Redshift Spectrum.
With crawlers, your metadata stays in synchronization with the underlying data. Athena and Redshift Spectrum can directly query your Amazon S3 data lake with the help of the AWS Glue Data Catalog.
With AWS Glue, you access as well as analyze data through one unified interface without loading it into multiple data silos.
You can run your ETL jobs as soon as new data becomes available in Amazon S3 by invoking your AWS Glue ETL jobs from an AWS Lambda function.
You can also register this new dataset in the AWS Glue Data Catalog considering it as part of your ETL jobs.
4. To understand your data assets.
You can store your data using various AWS services and still maintain a unified view of your data using the AWS Glue Data Catalog.
View the Data Catalog to quickly search and discover the datasets that you own, and maintain the relevant metadata in one central repository.
The Data Catalog also serves as a drop-in replacement for your external Apache Hive Metastore.
Check out our AWS Certification Training in Top Cities
AWS Glue is integrated across a very wide range of AWS services. AWS Glue natively supports data stored in Amazon Aurora and all other Amazon RDS engines, Amazon Redshift, and Amazon S3, along with common database engines and databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2.
AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources that you use while your jobs are running.
AWS Glue automates a significant amount of effort in building, maintaining, and running ETL jobs. It crawls your data sources, identifies data formats as well as suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.
AWS Glue Concepts
You define jobs in AWS Glue to accomplish the work that’s required to extract, transform, and load (ETL) data from a data source to a data target. You typically perform the following actions:
Firstly, you define a crawler to populate your AWS Glue Data Catalog with metadata table definitions. You point your crawler at a data store, and the crawler creates table definitions in the Data Catalog.In addition to table definitions, the Data Catalog contains other metadata that is required to define ETL jobs. You use this metadata when you define a job to transform your data.
You can run your job on-demand, or you can set it up to start when a specified trigger occurs. The trigger can be a time-based schedule or an event.
When your job runs, a script extracts data from your data source, transforms the data, and loads it to your data target. This script runs in an Apache Spark environment in AWS Glue.
You can learn more about AWS and its services from the AWS Cloud Course.
Terminology | Description |
Data Catalog | The persistent metadata store in AWS Glue. It contains table definitions, job definitions, and other control information to manage your AWS Glue environment. |
Classifier | Determines the schema of your data. AWS Glue provides classifiers for common file types, such as CSV, JSON, AVRO, XML, and others. |
Connection | It contains the properties that are required to connect to your data store. |
Crawler | A program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data and then creates metadata tables in the Data Catalog. |
Database | A set of associated Data Catalog table definitions organized into a logical group in AWS Glue. |
Data Store, Data Source, Data Target | A data store is a repository for persistently storing your data. Data source is a data store that is used as input to a process or transform. A data target is a data store that a process or transform writes to. |
Development Endpoint | An environment that you can use to develop and test your AWS Glue ETL scripts. |
Job | The business logic is required to perform ETL work. It is composed of a transformation script, data sources, and data targets. |
Notebook Server | A web-based environment that you can use to run your PySpark statements. PySpark is a Python dialect for ETL programming. |
Script | Code that extracts data from sources, transforms it and loads it into targets. AWS Glue generates PySpark or Scala scripts. |
Table | It is the metadata definition that represents your data. A table defines the schema of your data. |
Transform | You use the code logic to manipulate your data into a different format. |
Trigger | Initiates an ETL job. You can define triggers based on a scheduled time or event. |
How does AWS Glue work?
Here I am going to demonstrate an example where I will create a transformation script with Python and Spark. I will also cover some basic Glue concepts such as crawler, database, table, and job.
Glue can read data from a database or S3 bucket. For example, I have created an S3 bucket called glue-bucket-edureka. Create two folders from S3 console and name them read and write. Now create a text file with the following data and upload it to the read folder of S3 bucket.
rank,movie_title,year,rating
1,The Shawshank Redemption,1994,9.2
2,The Godfather,1972,9.2
3,The Godfather: Part II,1974,9.0
4,The Dark Knight,2008,9.0
5,12 Angry Men,1957,8.9
6,Schindler’s List,1993,8.9
7,The Lord of the Rings: The Return of the King,2003,8.9
8,Pulp Fiction,1994,8.9
9,The Lord of the Rings: The Fellowship of the Ring,2001,8.8
10,Fight Club,1999,8.8
In this step, we will create a crawler. The crawler will catalog all files in the specified S3 bucket and prefix. All the files should have the same schema. In Glue crawler terminology the file format is known as a classifier. The crawler identifies the most common classifiers automatically including CSV, json and parquet. Our sample file is in the CSV format and will be recognized automatically.
In the left panel of the Glue management console click Crawlers.
Click the blue Add crawler button.
Give the crawler a name such as glue-demo-edureka-crawler.
In Add a data store menu choose S3 and select the bucket you created. Drill down to select the read folder.
In Choose an IAM role create new. Name the role to for example glue-demo-edureka-iam-role.
In Configure the crawler’s output add a database called glue-demo-edureka-db.
When you are back in the list of all crawlers, tick the crawler that you created. Click Run crawler.
Once the data has been crawled, the crawler creates a metadata table from it. You find the results from the Tables section of the Glue console. The database that you created during the crawler setup is just an arbitrary way of grouping the tables. Glue tables don’t contain the data but only the instructions on how to access the data.
From the Glue console left panel go to Jobs and click blue Add job button. Follow these instructions to create the Glue job:
Copy the following code to your Glue script editor Remember to change the bucket name for the s3_write_path variable. Save the code in the editor and click Run job.
#########################################
### IMPORT LIBRARIES AND SET VARIABLES
#########################################
#Import python modules
from datetime import datetime
#Import pyspark modules
from pyspark.context import SparkContext
import pyspark.sql.functions as f
#Import glue modules
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
#Initialize contexts and session
spark_context = SparkContext.getOrCreate()
glue_context = GlueContext(spark_context)
session = glue_context.spark_session
#Parameters
glue_db = "glue-demo-edureka-db"
glue_tbl = "read"
s3_write_path = "s3://glue-demo-bucket-edureka/write"
#########################################
### EXTRACT (READ DATA)
#########################################
#Log starting time
dt_start = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print("Start time:", dt_start)
#Read movie data to Glue dynamic frame
dynamic_frame_read = glue_context.create_dynamic_frame.from_catalog(database = glue_db, table_name = glue_tbl)
#Convert dynamic frame to data frame to use standard pyspark functions
data_frame = dynamic_frame_read.toDF()
#########################################
### TRANSFORM (MODIFY DATA)
#########################################
#Create a decade column from year
decade_col = f.floor(data_frame["year"]/10)*10
data_frame = data_frame.withColumn("decade", decade_col)
#Group by decade: Count movies, get average rating
data_frame_aggregated = data_frame.groupby("decade").agg(
f.count(f.col("movie_title")).alias('movie_count'),
f.mean(f.col("rating")).alias('rating_mean'),
)
#Sort by the number of movies per the decade
data_frame_aggregated = data_frame_aggregated.orderBy(f.desc("movie_count"))
#Print result table
#Note: Show function is an action. Actions force the execution of the data frame plan.
#With big data the slowdown would be significant without cacching.
data_frame_aggregated.show(10)
#########################################
### LOAD (WRITE DATA)
#########################################
#Create just 1 partition, because there is so little data
data_frame_aggregated = data_frame_aggregated.repartition(1)
#Convert back to dynamic frame
dynamic_frame_write = DynamicFrame.fromDF(data_frame_aggregated, glue_context, "dynamic_frame_write")
#Write data back to S3
glue_context.write_dynamic_frame.from_options(
frame = dynamic_frame_write,
connection_type = "s3",
connection_options = {
"path": s3_write_path,
#Here you could create S3 prefixes according to a values in specified columns
#"partitionKeys": ["decade"]
},
format = "csv"
)
#Log end time
dt_end = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print("Start time:", dt_end)
The detailed explanations are commented in the code. Here is the high-level description:
Read the movie data from S3
Get movie count and rating average for each decade
Write aggregated data back to S3
The execution time with 2 Data Processing Units (DPU) was around 40 seconds. A relatively long duration is explained by the start-up overhead.
The data transformation script creates summarized movie data. For example, 2000 decade has 3 movies in IMDB top 10 with average rating 8.9. You can download the result file from the write folder of your S3 bucket. Another way to investigate the job would be to take a look at the CloudWatch logs.
The data is stored back to S3 as a CSV in the “write” prefix. The number of partitions equals the number of the output files.
With this, we have come to the end of this article on AWS Glue. I hope you have understood everything that I have explained here. You can learn about the features and benefits of AWS Glue in the following video
If you found this AWS Glue article relevant, you can check out Edureka’s live and instructor-led course on AWS training in Chennai, co-created by industry practitioners.
Got a question for us? Please mention it in the comments section of this How to Deploy Java Web Application in AWS and we will get back to you.
This video focuses on the complete AWS Course which is the ultimate tutorial for beginners or anyone who wishes to scale up in their career
Original article source at: https://www.edureka.co/
1672318741
This package provides an asynchronous process dispatcher that works on all major platforms (including Windows).
As Windows pipes are file handles and do not allow non-blocking access, this package makes use of a process wrapper, that provides access to these pipes via sockets. On Unix-like systems it uses the standard pipes, as these can be accessed without blocking there. Concurrency is managed by the Amp event loop.
This package can be installed as a Composer dependency.
composer require amphp/process
amphp/process
follows the semver semantic versioning specification like all other amphp
packages.
If you discover any security related issues, please email me@kelunik.com
instead of using the issue tracker.
Author: Amphp
Source Code: https://github.com/amphp/process
License: MIT license
1671162181
Data quality is a crucial element of any successful data warehouse solution. As the complexity of data warehouses increases, so does the need for data quality processes. In this article, Toptal Data Quality Developer Alexander Hauskrecht outlines how you can ensure a high degree of data quality and why this process is so important.
Data Quality (DQ) in data warehouse systems is getting more and more important. Increasing regulatory requirements, but also the growing complexity of data warehouse solutions, force companies to intensify (or start) a data quality initiative.
This article’s main focus will be on “traditional” data warehousing, but data quality is also an issue in more “modern” concepts such as data lakes. It will show some main points to consider and also some common pitfalls to avoid when implementing a data quality strategy. It does not cover the part on choosing the right technology/tool to build a DQ framework.
One of the most obstructive problems of a DQ project is the fact that at first sight, it creates a lot of work for the business units without providing any extra functionality. A data quality initiative usually only has strong proponents if:
DQ’s treatment is similar to that of testing in software development—if a project runs out of time and/or budget, this part tends to be reduced first.
This, of course, is not the whole truth. A good data quality system helps detect errors early, thus speeding up the process of delivering data of “good enough” quality to the users.
Before discussing the topic, a common understanding of the terms used is important.
A data warehouse (DWH) is a non-operational system mainly used for decision support. It consolidates the data of the operational systems (all of them or a smaller subset) and provides query-optimized data for the users of the DWH system. The data warehouse should provide “a single version of truth” within the enterprise. A data warehouse is usually built of stages/layers:
Figure 1: Common data warehouse layers.
The operational data is stored mostly unchanged into a staging layer. The core layer contains consolidated and unified data. The next optional stage is a derivation area, providing derived data (for example, a customer score for sales) and aggregations. The data mart layer contains data optimized for a given group of users. Data marts often contain aggregations and lots of derived metrics. Data warehouse users often work only with the data mart layer.
Between each stage, some kind of data transformation takes place. Usually, a data warehouse is periodically loaded with delta extractions of the operational data and contains algorithms to keep historical data.
Data quality is usually defined as a metric on how well a product meets user requirements. Different users might have different requirements for a product so the implementation depends on the user’s perspective, and it is important to identify these needs.
Data quality does not mean the data has to be completely or almost error-free—it depends on the users’ requirements. A “good enough” approach is a good choice to start with. Nowadays, bigger companies have “a data (or information) government policy,” and data quality is a part of it. A data government policy should describe how your company deals with data and how it makes sure that data has the right quality and that data privacy rules are not violated.
Data quality is an ongoing topic. A DQ circuit loop has to be implemented (see next chapter). Regulatory requirements and compliance rules also have an impact on the data quality needed, such as TCPA (US Telephone Consumer Protection Act) or GDPR in Europe for privacy issues, but also industry-specific rules like Solvency II for insurances in the EU, BCBS 239 and others for banking, and so on.
As with all quality topics, DQ is an ongoing activity designed to maintain satisfactory quality. As a result of a DQ project, a circuit loop similar to the one below has to be implemented:
Figure 2: Data quality circuit loop.
The steps within this loop will be described in the next chapters.
To implement a successful DQ initiative, the following roles are needed:
To ensure success, it is important to have these roles clearly defined and widely accepted within your organization in the early stages of your DQ project. It is equally important to find competent data specialists for these roles who support the project.
Find and implement useful DQ checks/rules. Defining DQ rules requires a good understanding of your data warehouse and its use.
As discussed earlier, data users (and the data owner) are responsible for data use and therefore also for the needed level of data quality. Data users should have a good understanding of their data so they can give the best input for useful data quality rules.
They are also the ones who analyze the results of the data quality rules, so it is always a good idea to let them define their own rules. This further enhances the acceptance to check and rate the result of the DQ rules assigned to a data user unit (see “Analyze” chapter).
The drawback of this approach is that data users normally only know the data mart layer, not the earlier layers of the data warehouse. If data was corrupted in the “lower” stages, this won’t be detected by checking just the “top” layer of your data warehouse.
What kind of known errors might occur in a data warehouse?
These problems are often caused by people lacking the appropriate know-how and skills to define, implement, run, and work with a data warehouse solution.
DQ dimensions are a common way to identify and cluster DQ checks. There are many definitions, and the number of dimensions varies considerably: You might find 16, or even more dimensions. From a practical perspective, it is less confusing to start with a few dimensions and find a general understanding of them among your users.
Data generated by the data warehouse load process can be helpful as well.
Keep in mind that each data quality check has to be analyzed by at least one data user (see “Analyze” chapter) in case errors are found, for which you’ll need someone responsible and available to look after every check implemented.
Within a complex data warehouse, you might end up with many (sometimes thousands) DQ rules. The process to execute data quality rules should be robust and fast enough to handle this.
Don’t check facts that are guaranteed by technical implementation. For example, if the data is stored in a relational DBMS, it is not necessary to check if:
That said, always keep in mind that a data warehouse is in constant change and that the data definition of fields and tables might change over time.
Housekeeping is very important. Rules defined by different data user units might overlap and should be consolidated. The more complex your organization, the more housekeeping will be needed. Data owners should implement a process of rules consolidation as a kind of “data quality for data quality rules.” Also, data quality checks might become useless if the data is no longer used or if its definition has changed.
Data quality rules can be classified based on the type of test.
Once you have defined what to check, you’ll have to specify how to quantify the identified issues. Information such as “five data rows violate the DQ rule with ID 15” makes little sense for data quality.
The following parts are missing:
Metadata is important to route the “Analyze” and monitor the phases of the data quality control loop.
(*) Data lineage shows the flow of data between two points. With data lineage, you can find all data elements influencing a given target field within your warehouse.
Using data lineage to assign users to rules can be problematic. As mentioned before, business users usually know only the data mart layer (and the operating system), but not the lower levels of the data warehouse. By mapping via data lineage, data users will be assigned rules they’re not familiar with. For the lower levels, IT staff may be needed to evaluate a data quality finding. In many cases, a manual mapping or a mixed approach (mapping via data lineage only within the data mart) can help.
Measuring data quality means executing the available data quality rules, which should be done automatically, triggered by the load processes of the data warehouse. As we’ve seen before, there might be a remarkable number of data quality rules, so the checks will be time-consuming.
In a perfect world, a data warehouse would be loaded only if all data is error-free. In the real world, this is seldom the case (realistically, it is almost never the case). Depending on the overall loading strategy of your data warehouse, the data quality process should or should not (the latter is far more likely) rule the load process. It is a good design to have data quality processes (job networks) parallel and linked to the “regular” data warehouse load processes.
If there are defined service-level agreements, make sure not to thwart the data warehouse loads with the data quality checks. Errors/abends in data quality processes should not stop the regular load process. Unexpected errors within the data quality processes should be reported and shown up for the “Analyze” phase (see next chapter).
Keep in mind that a data quality rule might crash because of unexpected errors (maybe the rule itself was wrongly implemented, or the underlying data structure changed over time). It would help if your data quality system provided a mechanism to deactivate such rules, especially if your company has few releases per year.
DQ processes should be executed and reported as early as possible—ideally, right after the data checked was loaded. This helps detect errors as early as possible during the load of the data warehouse (some complex warehouse system loads have a duration of several days).
In this context, “analyze” means reacting to data quality findings. This is a task for the assigned data users and the data owner.
The way to react should be clearly defined by your data quality project. Data users should be obligated to comment on a rule with findings (at least rules with a red light), explaining what measures are being taken to handle the finding. The data owner needs to be informed and should decide together with the data user(s).
The following actions are possible:
In a perfect world, every data quality problem would be fixed. However, lack of resources and/or time often results in workarounds.
To be able to react in time, the DQ system must inform the data users about “their” rules with findings. Using a data quality dashboard (maybe with sending messages that something came up) is a good idea. The earlier the users are informed about findings, the better.
The data quality dashboard should contain:
The dashboard should also show the current status of the recent data warehouse load process, giving the users a 360-degree view of the data warehouse load process.
The data owner is responsible for making sure that every finding was commented on and the status of the data quality (original or overruled) is at least yellow for all data users.
For a quick overview, it would help to build a kind of simple KPIs (key performance indicators) for data users/data owner. Having an overall traffic light for all associated rules’ results is quite easy if each rule is given the same weight.
Personally, I think computing an overall value of data quality for a given data domain is rather complex and tends to be cabalistic, but you could at least show the number of overall rules grouped by result for a data domain (e.g., “100 DQ rules with 90% green, 5% yellow, and 5% red results”).
It is the data owner’s task to ensure that the findings will be fixed and data quality improved.
As the data warehouse processes often change, the data quality mechanism also needs maintenance.
A data owner should always take care of the following points:
Monitoring the entire data quality process helps to improve it over time.
Things worth watching would be:
Many of the following points are important in any kind of project.
Anticipate resistance. As we have seen, if there is no urgent quality issue, data quality is often viewed as an additional burden without offering new functionality. Keep in mind that it might create additional workload for the data users. In many cases, compliance and regulatory demands can help you to convince the users to see it as an unavoidable requirement.
Find a sponsor. As noted above, DQ is not a fast-selling item, so a powerful sponsor/stakeholder is needed—the higher in the management, the better.
Find allies. As with the sponsor, anyone who shares the idea of strong data quality would be most helpful. The DQ circuit loop is an ongoing process and needs people to keep the circuit loop alive.
Start small. If there’s been no DQ strategy so far, look for a business unit that needs better data quality. Build a prototype to show them the benefit of better data. If your task is to improve or even replace a given data quality strategy, look at things working well/being accepted in the organization, and keep them.
Don’t lose sight of the whole picture. Although starting small, keep in mind that some points, especially the roles, are prerequisites for a successful DQ strategy.
Once implemented, don’t let go. The data quality process needs to be part of data warehouse use. Over time, focus on data quality tends to get a bit lost, and it’s up to you to maintain it.
Original article source at: https://www.toptal.com/
1671125220
Elicitation? It must be an easy thing!
Let me try. Okay, so first, I have to do what? Wait, this isn't that easy…
Exactly! It looks like elicitation is easy, but it's not, so let's understand first what requirement elicitation is in Project management.
Many processes and techniques can be used to manage projects effectively in the software development world. One of the most challenging tasks in any project is eliciting requirements. With so many different ways to go about it, it's easy to get lost and end up with a document that's nothing more than fragmented ideas instead of precise specifications. Eliciting requirements is an art and not a science. It requires you to use your instincts and engage with the client on a different level. Having said that, you can use various techniques and methodologies as a software engineer or project manager to streamline the process of getting useful information from your stakeholders. With so many software development methodologies available in the market today, it can take time for new entrants to choose which will work best for their specific needs.
The first step in the software development process is requirements engineering. This is turning a business or organizational problem into a set of requirements for a new software application. This step identifies a problem, and a set of requirements is developed. It will include information about the problem, the stakeholders, their needs, and the reason for building the application. The problem statement is the primary input for this process.
The crucial part is that we need to gather information, not only information but also the correct information. Connecting with Stakeholders to understand precisely what they are looking for.
The significant steps which should be involved in this procedure are –
Requirements engineering is turning a business or organizational problem into a set of it for a new software application. The first step towards designing and developing a solution is requirements engineering. It is essential to understand the problem statement and the user's needs. It helps in identifying and understanding the problem to be solved. It also helps identify user needs and the problems that need to be solved. It is the first step toward designing and developing a solution. It is essential to document requirements so stakeholders are on a similar page and agree on what needs to be done, how it will be done, and what technology will be used to deliver the results.
Original article source at: https://www.xenonstack.com/
1660414920
A small utility for generating consistent warning objects across your codebase. It also exposes a utility for emitting those warnings, guaranteeing that they are issued only once.
This module is used by the Fastify framework and it was called fastify-warning
prior to version 1.0.0.
npm i process-warning
The module exports a builder function that returns a utility for creating warnings and emitting them.
const warning = require('process-warning')()
warning.create(name, code, message)
name
(string
, required) - The error name, you can access it later with error.name
. For consistency, we recommend prefixing module error names with {YourModule}Warning
code
(string
, required) - The warning code, you can access it later with error.code
. For consistency, we recommend prefixing plugin error codes with {ThreeLetterModuleName}_
, e.g. FST_
. NOTE: codes should be all uppercase.message
(string
, required) - The warning message. You can also use interpolated strings for formatting the message.The utility also contains an emit
function that you can use for emitting the warnings you have previously created by passing their respective code. A warning is guaranteed to be emitted only once.
warning.emit(code [, a [, b [, c]]])
code
(string
, required) - The warning code you intend to emit.[, a [, b [, c]]]
(any
, optional) - Parameters for string interpolation.const warning = require('process-warning')()
warning.create('FastifyWarning', 'FST_ERROR_CODE', 'message')
warning.emit('FST_ERROR_CODE')
How to use an interpolated string:
const warning = require('process-warning')()
warning.create('FastifyWarning', 'FST_ERROR_CODE', 'Hello %s')
warning.emit('FST_ERROR_CODE', 'world')
The module also exports an warning.emitted
Map, which contains all the warnings already emitted. Useful for testing.
const warning = require('process-warning')()
warning.create('FastifyWarning', 'FST_ERROR_CODE', 'Hello %s')
console.log(warning.emitted.get('FST_ERROR_CODE')) // false
warning.emit('FST_ERROR_CODE', 'world')
console.log(warning.emitted.get('FST_ERROR_CODE')) // true
Author: Fastify
Source Code: https://github.com/fastify/process-warning
License: MIT license
1646690760
P(rocess) M(anager) 2
Runtime Edition
PM2 is a production process manager for Node.js applications with a built-in load balancer. It allows you to keep applications alive forever, to reload them without downtime and to facilitate common system admin tasks.
Starting an application in production mode is as easy as:
$ pm2 start app.js
Works on Linux (stable) & macOS (stable) & Windows (stable). All Node.js versions are supported starting Node.js 12.X.
With NPM:
$ npm install pm2 -g
You can install Node.js easily with NVM or ASDF.
You can start any application (Node.js, Python, Ruby, binaries in $PATH...) like that:
$ pm2 start app.js
Your app is now daemonized, monitored and kept alive forever.
Once applications are started you can manage them easily:
To list all running applications:
$ pm2 list
Managing apps is straightforward:
$ pm2 stop <app_name|namespace|id|'all'|json_conf>
$ pm2 restart <app_name|namespace|id|'all'|json_conf>
$ pm2 delete <app_name|namespace|id|'all'|json_conf>
To have more details on a specific application:
$ pm2 describe <id|app_name>
To monitor logs, custom metrics, application information:
$ pm2 monit
The Cluster mode is a special mode when starting a Node.js application, it starts multiple processes and load-balance HTTP/TCP/UDP queries between them. This increase overall performance (by a factor of x10 on 16 cores machines) and reliability (faster socket re-balancing in case of unhandled errors).
Starting a Node.js application in cluster mode that will leverage all CPUs available:
$ pm2 start api.js -i <processes>
<processes>
can be 'max'
, -1
(all cpu minus 1) or a specified number of instances to start.
Zero Downtime Reload
Hot Reload allows to update an application without any downtime:
$ pm2 reload all
More informations about how PM2 make clustering easy
With the drop-in replacement command for node
, called pm2-runtime
, run your Node.js application in a hardened production environment. Using it is seamless:
RUN npm install pm2 -g
CMD [ "pm2-runtime", "npm", "--", "start" ]
Read More about the dedicated integration
PM2 allows to monitor your host/server vitals with a monitoring speedbar.
To enable host monitoring:
$ pm2 set pm2:sysmonit true
$ pm2 update
Monitor all processes launched straight from the command line:
$ pm2 monit
To consult logs just type the command:
$ pm2 logs
Standard, Raw, JSON and formated output are available.
Examples:
$ pm2 logs APP-NAME # Display APP-NAME logs
$ pm2 logs --json # JSON output
$ pm2 logs --format # Formated output
$ pm2 flush # Flush all logs
$ pm2 reloadLogs # Reload all logs
To enable log rotation install the following module
$ pm2 install pm2-logrotate
PM2 can generate and configure a Startup Script to keep PM2 and your processes alive at every server restart.
Init Systems Supported: systemd, upstart, launchd, rc.d
# Generate Startup Script
$ pm2 startup
# Freeze your process list across server restart
$ pm2 save
# Remove Startup Script
$ pm2 unstartup
More about Startup Scripts Generation
# Install latest PM2 version
$ npm install pm2@latest -g
# Save process list, exit old PM2 & restore all processes
$ pm2 update
PM2 updates are seamless
If you manage your apps with PM2, PM2+ makes it easy to monitor and manage apps across servers.
Feel free to try it:
Discover the monitoring dashboard for PM2
Thanks in advance and we hope that you like PM2!
PM2 is constantly assailed by more than 1800 tests.
Official website: https://pm2.keymetrics.io/
Author: Unitech
Source Code: https://github.com/Unitech/pm2
License: View license
1640282160
Data Version Control or DVC is an open-source tool for data science and machine learning projects. It allows for different versioning and management of datasets and Machine Learning models.
Using github actions we can actually generate ML results for every Pull request and have the info we need right there,
git init
dvc init
git commit -m "initial commit"
#set dvc remote
dvc remote add -d myremote gdrive://0AIac4JZqHhKmUk9PDA/dvcstore
git commit -m "sets dvc remote"
#process which repeats after any modification of data(new version)
#adds file to dvc and .gitignore
dvc add path_to_data
git add .gitignore && path_to_data.dvc
git commit -m "data:track"
#tags data version on git
git tag -a 'v1' -m "raw data"
dvc push
go ahead and delete the data! it might also appear in .dvc/cache
to get data back => dvc pull
Download Details:
Author: Azariagmt
Source Code: https://github.com/Azariagmt/abtest-mlops
License: BSD-2-Clause License
1626349260
Welcome to NodeJS series with TypeScript. In this series I will try to introduce us to NodeJS and how to build simple backend with http module. We will learn ins and outs of file operations of NodeJS. All in a type safe way thanks to TypeScript.
In second episode we will take a look at the processes in NodeJS. I will try to explain basics of creating the modules in NodeJS and how to get user input into your program!
Table of content:
00:00 Process in NodeJS
19:40 Basics of Modules And Writing To Files
22:33 Reading the user input
30:33 Using inquirer package
You can find me here:
https://twitter.com/wojciech_bilick
https://medium.com/@wojciech.bilicki
https://github.com/wojciech-bilicki
ignore
web design
html
web development
css
html5
css3
es6
programming
basics
tutorial
javascript
how to make a website
responsive design tutorial
web development tutorial
media queries
website from scratch
html css
responsive website tutorial
responsive web development
web developer
how to make a responsive website
how to build a website from scratch
how to build a website
build a website
How to
#nodejs #typescript #process
1625458855
How to find the process state?
You can find the process state from the following source:
a. Unix/Linux command-line tool ‘top’ will report the process state in the column ‘s’. Process status is reported with a single character.
R – RUNNING/RUNNABLE
S – INTERRRUPTABLE_SLEEP
D – UNINTERRUPTABLE_SLEEP
T – STOPPED
Z – ZOMBIE
b. You can use web-based root cause analysis tools like yCrash , which will report the process states.
#linux #process #unix/linux
1623042624
https://phpcoder.tech/how-to-check-process-running-in-windows/
#php #check #process #windows #runnig
1619181420
We have mapped over 120 companies; new and old, large and small, according to their subsegment and the precise type of automation they provide. Find an intro & explanation to the map below.
The process automation space (RPA, Robotic Process Automation) continued to grow strongly in 2020, despite — or perhaps thanks to — the global shift to working from home. Software companies automating business processes raised over $1.9 billion in 2020; the broader automation software space raised over $11 billion, according to PitchBook data.
Fittingly for 2020, the most funding for a single company went to Olive, an AI-driven workflow automation platform for hospitals and healthcare systems. They raised over $380 million last year across three separate rounds, with investors including General Catalyst, Tiger Global Management and Sequoia Capital.
#process #intelligent-automation #rap #artificial-intelligence #software
1614682146
Start all your projects quickly, nicely, and cleanly
create-react-app
is an amazing tool, described as **the best way to start building a new single-page application in React, ** inside the official documentation.
By making use of react-scripts, It offers a modern build setup with no configuration.
While it may well be a fantastic tool, chances are your projects always start with a big overhead. If your first step after running** CRA **is to spend a day setting up default functionalities and configuration, creating your own template will definitely boost your productivity.
Using GIT the right way can resolve a lot of headaches regarding the creation and maintenance of your custom template.
Using a fork of create-react-app and by doing so, keeping its GIT history, allow you to later sync your fork with the official repository.
If you are a GitHub user, you only need to use the **Fork **button on the homepage of create-react-app.
If you use any other GIT repository hosting service, the approach is slightly different. It consist in creating your own repository and creating an upstream
remote:
$ mkdir create-react-app
$ cd create-react-app
$ git init
$ git remote add origin <YOUR_REMOTE_REPOSITORY>
$ git remote add upstream https://github.com/facebook/create-react-app
$ git pull upstream master
$ git push origin master
#process #javascript #typescript #productivity
1611997283
It’s an old trope, but the companies of the world are sitting on a goldmine of insights locked away in data that no-one has ever looked at. This hidden data is nowhere more common than in server or machine event logs and over the past decade there has been an explosion in full-text search and log aggregation tools. Software like Elasticsearch and Splunk have internet-scale superpowers at ingesting and dashboarding real-time events.
For the humble analyst these new methods allow the indexing of literally everything. This means that counting the number of weird log-in attempts or the average temperature on one of 10,000 industrial sensors can be as easy as a Google search.
#logs #process #data-science #regex #power-bi