Building a ETL pipeline

An ETL (Data Extraction, Transformation, Loading) pipeline is a set of processes used to Extract, Transform, and Load data from a source to a target.
The source of the data can be from one or many sources, such as from an API call, CSV files, information within a database, and many more.
We take these sources of information then transform it in a way where it can be used immediately by another client, user or developer within some target storage.
In practice, an ETL pipeline is run infrequently. Generally there are large amounts of information that is unfiltered and difficult to process and clean, often taking large amounts of time and resources to transform.

#request #pandas #python #covid19 #etl

What is GEEK

Buddha Community

Building a ETL pipeline

Anil Cynix

1592470812

Before we learn anything about ETL Testing its important to learn about Business Intelligence and Dataware. Let’s get started –

What is BI?
Business Intelligence is the process of collecting raw data or business data and turning it into information that is useful and more meaningful. The raw data is the records of the daily transaction of an organization such as interactions with customers, administration of finance, and management of employee and so on. These data’s will be used for “Reporting, Analysis, Data mining, Data quality and Interpretation, Predictive Analysis”.

What is Data Warehouse?
A data warehouse is a database that is designed for query and analysis rather than for transaction processing. The data warehouse is constructed by integrating the data from multiple heterogeneous sources.It enables the company or organization to consolidate data from several sources and separates analysis workload from transaction workload. Data is turned into high quality information to meet all enterprise reporting requirements for all levels of users.

What is ETL?
ETL stands for Extract-Transform-Load and it is a process of how data is loaded from the source system to the data warehouse. Data is extracted from an OLTP database, transformed to match the data warehouse schema and loaded into the data warehouse database. Many data warehouses also incorporate data from non-OLTP systems such as text files, legacy systems and spreadsheets.

Let see how it works

For example, there is a retail store which has different departments like sales, marketing, logistics etc. Each of them is handling the customer information independently, and the way they store that data is quite different. The sales department have stored it by customer’s name, while marketing department by customer id.

Now if they want to check the history of the customer and want to know what the different products he/she bought owing to different marketing campaigns; it would be very tedious.

The solution is to use a Datawarehouse to store information from different sources in a uniform structure using ETL. ETL can transform dissimilar data sets into an unified structure.Later use BI tools to derive meaningful insights and reports from this data.

The following diagram gives you the ROAD MAP of the ETL process

ETL Testing or Datawarehouse Testing : Ultimate Guide

1.Extract

  • Extract relevant data
    2.Transform
  • Transform data to DW (Data Warehouse) format
  • Build keys - A key is one or more data attributes that uniquely identify an entity. Various types of keys are primary key, alternate key, foreign key, composite key, surrogate key. The datawarehouse owns these keys and never allows any other entity to assign them.
  • Cleansing of data :After the data is extracted, it will move into the next phase, of cleaning and conforming of data. Cleaning does the omission in the data as well as identifying and fixing the errors. Conforming means resolving the conflicts between those data’s that is incompatible, so that they can be used in an enterprise data warehouse. In addition to these, this system creates meta-data that is used to diagnose source system problems and improves data quality.

3.Load

  • Load data into DW ( Data Warehouse)
  • Build aggregates - Creating an aggregate is summarizing and storing data which is available in fact table in order to improve the performance of end-user queries.

What is ETL Testing?
ETL testing is done to ensure that the data that has been loaded from a source to the destination after business transformation is accurate. It also involves the verification of data at various middle stages that are being used between source and destination. ETL stands for Extract-Transform-Load.

ETL Testing Process
Similar to other Testing Process, ETL also go through different phases. The different phases of ETL testing process is as follows

ETL Testing or Datawarehouse Testing : Ultimate Guide

ETL testing is performed in five stages

1.Identifying data sources and requirements
2.Data acquisition
3.Implement business logics and dimensional Modelling
4.Build and populate data
5.Build Reports
ETL Testing or Datawarehouse Testing : Ultimate Guide.

These are the basics of etl testing course if u want to know more plz visit Online IT Guru website.

#etl testing #etl testing course #etl testing online #etl testing training #online etl testing training #etl testing online training

Anthony  Dach

Anthony Dach

1623649980

Building An Automated Testing Pipeline with GoCD [Tutorial]

CI/CD enables developers, engineers and DevOps team to create a fast and effective process of packaging the product to market, thereby allowing them to stay ahead of the competition. When Selenium automation testing joins force with an effective CI/CD tool, it does wonders for the product delivery. GoCD is one such open-source Continuous Integration (CI) and Continuous Delivery (CD) tool developed by ThoughtWorks that supports the software development life cycle by enabling automation for the entire process. Right from development –. test –> deployment, GoCD ensures that your delivery cycles are on time, reliable, and efficient.

Ok. I know what you are thinking!

We have Jenkins for CI & CD ! Why a new tool ?

In this GoCD pipeline tutorial, we will deep dive into all the information you would need to set up a GoCD pipeline and in conjunction with the underlying concepts. You will also learn how to perform automation testing using Selenium in GoCD Pipeline through an Online Selenium Grid .

Why GoCD ?

Setting up GoCD

Setting up GoCD

Setting Up the GoCD Pipeline

#selenium testing #ci/cd #building an automated testing pipeline with gocd #gocd pipeline #build-test #gocd

The Best Way to Build a Chatbot in 2021

A useful tool several businesses implement for answering questions that potential customers may have is a chatbot. Many programming languages give web designers several ways on how to make a chatbot for their websites. They are capable of answering basic questions for visitors and offer innovation for businesses.

With the help of programming languages, it is possible to create a chatbot from the ground up to satisfy someone’s needs.

Plan Out the Chatbot’s Purpose

Before building a chatbot, it is ideal for web designers to determine how it will function on a website. Several chatbot duties center around fulfilling customers’ needs and questions or compiling and optimizing data via transactions.

Some benefits of implementing chatbots include:

  • Generating leads for marketing products and services
  • Improve work capacity when employees cannot answer questions or during non-business hours
  • Reducing errors while providing accurate information to customers or visitors
  • Meeting customer demands through instant communication
  • Alerting customers about their online transactions

Some programmers may choose to design a chatbox to function through predefined answers based on the questions customers may input or function by adapting and learning via human input.

#chatbots #latest news #the best way to build a chatbot in 2021 #build #build a chatbot #best way to build a chatbot

Riyad Amin

Riyad Amin

1571046022

Build Your Own Cryptocurrency Blockchain in Python

Cryptocurrency is a decentralized digital currency that uses encryption techniques to regulate the generation of currency units and to verify the transfer of funds. Anonymity, decentralization, and security are among its main features. Cryptocurrency is not regulated or tracked by any centralized authority, government, or bank.

Blockchain, a decentralized peer-to-peer (P2P) network, which is comprised of data blocks, is an integral part of cryptocurrency. These blocks chronologically store information about transactions and adhere to a protocol for inter-node communication and validating new blocks. The data recorded in blocks cannot be altered without the alteration of all subsequent blocks.

In this article, we are going to explain how you can create a simple blockchain using the Python programming language.

Here is the basic blueprint of the Python class we’ll use for creating the blockchain:

class Block(object):
    def __init__():
        pass
    #initial structure of the block class 
    def compute_hash():
        pass
    #producing the cryptographic hash of each block 
  class BlockChain(object):
    def __init__(self):
    #building the chain
    def build_genesis(self):
        pass
    #creating the initial block
    def build_block(self, proof_number, previous_hash):
        pass
    #builds new block and adds to the chain
   @staticmethod
    def confirm_validity(block, previous_block):
        pass
    #checks whether the blockchain is valid
    def get_data(self, sender, receiver, amount):
        pass
    # declares data of transactions
    @staticmethod
    def proof_of_work(last_proof):
        pass
    #adds to the security of the blockchain
    @property
    def latest_block(self):
        pass
    #returns the last block in the chain

Now, let’s explain how the blockchain class works.

Initial Structure of the Block Class

Here is the code for our initial block class:

import hashlib
import time
class Block(object):
    def __init__(self, index, proof_number, previous_hash, data, timestamp=None):
        self.index = index
        self.proof_number = proof_number
        self.previous_hash = previous_hash
        self.data = data
        self.timestamp = timestamp or time.time()
    @property
    def compute_hash(self):
        string_block = "{}{}{}{}{}".format(self.index, self.proof_number, self.previous_hash, self.data, self.timestamp)
        return hashlib.sha256(string_block.encode()).hexdigest()

As you can see above, the class constructor or initiation method ( init()) above takes the following parameters:

self — just like any other Python class, this parameter is used to refer to the class itself. Any variable associated with the class can be accessed using it.

index — it’s used to track the position of a block within the blockchain.

previous_hash — it used to reference the hash of the previous block within the blockchain.

data—it gives details of the transactions done, for example, the amount bought.

timestamp—it inserts a timestamp for all the transactions performed.

The second method in the class, compute_hash , is used to produce the cryptographic hash of each block based on the above values.

As you can see, we imported the SHA-256 algorithm into the cryptocurrency blockchain project to help in getting the hashes of the blocks.

Once the values have been placed inside the hashing module, the algorithm will return a 256-bit string denoting the contents of the block.

So, this is what gives the blockchain immutability. Since each block will be represented by a hash, which will be computed from the hash of the previous block, corrupting any block in the chain will make the other blocks have invalid hashes, resulting in breakage of the whole blockchain network.

Building the Chain

The whole concept of a blockchain is based on the fact that the blocks are “chained” to each other. Now, we’ll create a blockchain class that will play the critical role of managing the entire chain.

It will keep the transactions data and include other helper methods for completing various roles, such as adding new blocks.

Let’s talk about the helper methods.

Adding the Constructor Method

Here is the code:

class BlockChain(object):
    def __init__(self):
        self.chain = []
        self.current_data = []
        self.nodes = set()
        self.build_genesis()

The init() constructor method is what instantiates the blockchain.

Here are the roles of its attributes:

self.chain — this variable stores all the blocks.

self.current_data — this variable stores information about the transactions in the block.

self.build_genesis() — this method is used to create the initial block in the chain.

Building the Genesis Block

The build_genesis() method is used for creating the initial block in the chain, that is, a block without any predecessors. The genesis block is what represents the beginning of the blockchain.

To create it, we’ll call the build_block() method and give it some default values. The parameters proof_number and previous_hash are both given a value of zero, though you can give them any value you desire.

Here is the code:

def build_genesis(self):
        self.build_block(proof_number=0, previous_hash=0)
 def build_block(self, proof_number, previous_hash):
        block = Block(
            index=len(self.chain),
            proof_number=proof_number,
            previous_hash=previous_hash,
            data=self.current_data
        )
        self.current_data = []  
        self.chain.append(block)
        return block

Confirming Validity of the Blockchain

The confirm_validity method is critical in examining the integrity of the blockchain and making sure inconsistencies are lacking.

As explained earlier, hashes are pivotal for realizing the security of the cryptocurrency blockchain, because any slight alteration in an object will result in the creation of an entirely different hash.

Thus, the confirm_validity method utilizes a series of if statements to assess whether the hash of each block has been compromised.

Furthermore, it also compares the hash values of every two successive blocks to identify any anomalies. If the chain is working properly, it returns true; otherwise, it returns false.

Here is the code:

def confirm_validity(block, previous_block):
        if previous_block.index + 1 != block.index:
            return False
        elif previous_block.compute_hash != block.previous_hash:
            return False
        elif block.timestamp <= previous_block.timestamp:
            return False
        return True

Declaring Data of Transactions

The get_data method is important in declaring the data of transactions on a block. This method takes three parameters (sender’s information, receiver’s information, and amount) and adds the transaction data to the self.current_data list.

Here is the code:

def get_data(self, sender, receiver, amount):
        self.current_data.append({
            'sender': sender,
            'receiver': receiver,
            'amount': amount
        })
        return True

Effecting the Proof of Work

In blockchain technology, Proof of Work (PoW) refers to the complexity involved in mining or generating new blocks on the blockchain.

For example, the PoW can be implemented by identifying a number that solves a problem whenever a user completes some computing work. Anyone on the blockchain network should find the number complex to identify but easy to verify — this is the main concept of PoW.

This way, it discourages spamming and compromising the integrity of the network.

In this article, we’ll illustrate how to include a Proof of Work algorithm in a blockchain cryptocurrency project.

Finalizing With the Last Block

Finally, the latest_block() helper method is used for retrieving the last block on the network, which is actually the current block.

Here is the code:

def latest_block(self):
        return self.chain[-1]

Implementing Blockchain Mining

Now, this is the most exciting section!

Initially, the transactions are kept in a list of unverified transactions. Mining refers to the process of placing the unverified transactions in a block and solving the PoW problem. It can be referred to as the computing work involved in verifying the transactions.

If everything has been figured out correctly, a block is created or mined and joined together with the others in the blockchain. If users have successfully mined a block, they are often rewarded for using their computing resources to solve the PoW problem.

Here is the mining method in this simple cryptocurrency blockchain project:

def block_mining(self, details_miner):
            self.get_data(
            sender="0", #it implies that this node has created a new block
            receiver=details_miner,
            quantity=1, #creating a new block (or identifying the proof number) is awarded with 1
        )
        last_block = self.latest_block
        last_proof_number = last_block.proof_number
        proof_number = self.proof_of_work(last_proof_number)
        last_hash = last_block.compute_hash
        block = self.build_block(proof_number, last_hash)
        return vars(block)

Summary

Here is the whole code for our crypto blockchain class in Python:

import hashlib
import time
class Block(object):
    def __init__(self, index, proof_number, previous_hash, data, timestamp=None):
        self.index = index
        self.proof_number = proof_number
        self.previous_hash = previous_hash
        self.data = data
        self.timestamp = timestamp or time.time()
    @property
    def compute_hash(self):
        string_block = "{}{}{}{}{}".format(self.index, self.proof_number, self.previous_hash, self.data, self.timestamp)
        return hashlib.sha256(string_block.encode()).hexdigest()
    def __repr__(self):
        return "{} - {} - {} - {} - {}".format(self.index, self.proof_number, self.previous_hash, self.data, self.timestamp)
class BlockChain(object):
    def __init__(self):
        self.chain = []
        self.current_data = []
        self.nodes = set()
        self.build_genesis()
    def build_genesis(self):
        self.build_block(proof_number=0, previous_hash=0)
    def build_block(self, proof_number, previous_hash):
        block = Block(
            index=len(self.chain),
            proof_number=proof_number,
            previous_hash=previous_hash,
            data=self.current_data
        )
        self.current_data = []  
        self.chain.append(block)
        return block
    @staticmethod
    def confirm_validity(block, previous_block):
        if previous_block.index + 1 != block.index:
            return False
        elif previous_block.compute_hash != block.previous_hash:
            return False
        elif block.timestamp <= previous_block.timestamp:
            return False
        return True
    def get_data(self, sender, receiver, amount):
        self.current_data.append({
            'sender': sender,
            'receiver': receiver,
            'amount': amount
        })
        return True        
    @staticmethod
    def proof_of_work(last_proof):
        pass
    @property
    def latest_block(self):
        return self.chain[-1]
    def chain_validity(self):
        pass        
    def block_mining(self, details_miner):       
        self.get_data(
            sender="0", #it implies that this node has created a new block
            receiver=details_miner,
            quantity=1, #creating a new block (or identifying the proof number) is awared with 1
        )
        last_block = self.latest_block
        last_proof_number = last_block.proof_number
        proof_number = self.proof_of_work(last_proof_number)
        last_hash = last_block.compute_hash
        block = self.build_block(proof_number, last_hash)
        return vars(block)  
    def create_node(self, address):
        self.nodes.add(address)
        return True
    @staticmethod
    def get_block_object(block_data):        
        return Block(
            block_data['index'],
            block_data['proof_number'],
            block_data['previous_hash'],
            block_data['data'],
            timestamp=block_data['timestamp']
        )
blockchain = BlockChain()
print("GET READY MINING ABOUT TO START")
print(blockchain.chain)
last_block = blockchain.latest_block
last_proof_number = last_block.proof_number
proof_number = blockchain.proof_of_work(last_proof_number)
blockchain.get_data(
    sender="0", #this means that this node has constructed another block
    receiver="LiveEdu.tv", 
    amount=1, #building a new block (or figuring out the proof number) is awarded with 1
)
last_hash = last_block.compute_hash
block = blockchain.build_block(proof_number, last_hash)
print("WOW, MINING HAS BEEN SUCCESSFUL!")
print(blockchain.chain)

Now, let’s try to run our code to see if we can generate some digital coins…

Wow, it worked!

Conclusion

That is it!

We hope that this article has assisted you to understand the underlying technology that powers cryptocurrencies such as Bitcoin and Ethereum.

We just illustrated the basic ideas for making your feet wet in the innovative blockchain technology. The project above can still be enhanced by incorporating other features to make it more useful and robust.

Learn More

Thanks for reading !

Do you have any comments or questions? Please share them below.

#python #cryptocurrency

ETL Data Pipeline In AWS

ETL (Extract, Transform, and Load) is an emerging topic among all the IT Industries. Industries often looking for some easy solution and Open source tools and technology to do ETL on their valuable data without spending much effort on other things.

There is AWS Glue for you, it’s a feature of Amazon Web Services to create a simple ETL pipeline.

AWS Glue Introduction

AWS Glue is another offering from AWS and is a serverless ETL (Extract, Transform, and Load) service on the cloud. It is fully managed, cost-effective service to categorize your data, clean and enrich it and finally move it from source systems to target systems.

#etl #aws #aws-glue #etl