DataStage Tutorial: Beginner's Training

DataStage Tutorial: Beginner's Training

### What is DataStage? Datastage is an ETL tool which extracts data, transform and load data from source to the target. The data sources might include sequential files, indexed files, relational databases, external data sources, archives...

What is DataStage?

Datastage is an ETL tool which extracts data, transform and load data from source to the target. The data sources might include sequential files, indexed files, relational databases, external data sources, archives, enterprise applications, etc. DataStage facilitates business analysis by providing quality data to help in gaining business intelligence. More Additional Info At DataStage Online Training

Datastage is used in a large organization as an interface between different systems. It takes care of extraction, translation, and loading of data from source to the target destination. It was first launched by VMark in mid-90's. With IBM acquiring DataStage, it was renamed to IBM WebSphere DataStage and later to IBM InfoSphere.

Various version of Datastage available in the market so far was Enterprise Edition (PX), Server Edition, MVS Edition, DataStage for PeopleSoft and so on. The latest edition is IBM InfoSphere DataStage

IBM Information server includes following products,

• IBM InfoSphere DataStage
• IBM InfoSphere QualityStage
• IBM InfoSphere Information Services Director
• IBM InfoSphere Information Analyzer
• IBM Information Server FastTrack
• IBM InfoSphere Business Glossary

DataStage Overview
Datastage has following Capabilities.

• It can integrate data from the widest range of enterprise and external data sources
• Implements data validation rules
• It is useful in processing and transforming large amounts of data
• It uses scalable parallel processing approach
• It can handle complex transformations and manage multiple integration processes
• Leverage direct connectivity to enterprise applications as sources or targets
• Leverage metadata for analysis and maintenance
• Operates in batch, real time, or as a Web service. To get in-depth knowledge on DataStage Training

In the following sections, we briefly describe the following aspects of IBM InfoSphere DataStage:

• Data transformation
• Jobs
• Parallel processing

InfoSphere DataStage and QualityStage can access data in enterprise applications and data sources such as:

• Relational databases
• Mainframe databases
• Business and analytic applications
• Enterprise resource planning (ERP) or customer relationship management (CRM) databases
• Online analytical processing (OLAP) or performance management databases

Processing Stage Types

IBM infosphere job consists of individual stages that are linked together. It describes the flow of data from a data source to a data target. Usually, a stage has minimum of one data input and/or one data output. However, some stages can accept more than one data input and output to more than one stage.

In Job design various stages you can use are:

• Transform stage
• Filter stage
• Aggregator stage
• Remove duplicates stage
• Join stage
• Lookup stage
• Copy stage
• Sort stage
• Containers

DataStage Components and Architecture

DataStage has four main components namely,

1. Administrator: It is used for administration tasks. This includes setting up DataStage users, setting up purging criteria and creating & moving projects.
2. Manager: It is the main interface of the Repository of DataStage. It is used for the storage and management of reusable Metadata. Through DataStage manager, one can view and edit the contents of the Repository.
3. Designer: A design interface used to create DataStage applications OR jobs. It specifies the data source, required transformation, and destination of data. Jobs are compiled to create an executable that are scheduled by the Director and run by the Server
4. Director: It is used to validate, schedule, execute and monitor DataStage server jobs and parallel jobs.


The above image explains how IBM Infosphere DataStage interacts with other elements of the IBM Information Server platform. DataStage is divided into two section, Shared Components, and Runtime Architecture.