What is ETL?
In 1970’s, due to the limitations with respect to the capacity of the resources, mainly Memory (RAM), CPU and Storage, it became difficult to store the data and generate transactional reports on the same system. To overcome this obstacle, ETL was introduced.
While working with databases, it is essential to properly format and prepare data, and to load it into target data storage systems. ETL are three separate but crucial functions combined into a single programming tool that helps in preparing and transferring data, and also helps in the management of databases.
Each word in the acronym, ‘Extract’, ‘Transform’ and ‘Load’ denotes a process in the movement of data from its source to a data storage system, often referred to as a data warehouse. ETL stands for:
Extract: Reading data from the database, which includes multiple and different types of sources
Transform: Converting the extracted data to the desired form
Load: Writing the transformed data to the target source
ETL is much easier and faster to use when compared to the traditional methods of moving data, which involve writing conventional complex computer programs. It is performed by ETL Tools. They contain graphical interfaces that speed up the process of mapping tables and columns between the source and target databases. They come with several functions to facilitate the ETL workflow.
With the growing popularity of ETL, the data warehousing market has seen the emergence of several commercial and open source ETL tools. The most popular ETL tools in the market are: Informatica, IBMInfosphere Information Server, Oracle Data Integrator, MicrosoftSQL Server Integrated Services (SSIS), Talend Open Studio for Data Integration, etc.
At OneGlobe we have implemented Analytics projects with Talend as the ETL tool and in this blog I will elaborate on it.
“The future belongs to those who can control their data”
Talend Open Studio
Talend is one of the first providers of open source data integration software. The company provides diversified software and services for data integration and management of data. Talend is the dominant player when it comes to Big Data. Talend Open Studio (TOS), which is currently known as Talend Open Studio for Data Integration, was launched in October 2006.It is one of the most powerful data integration ETL tools available in the market. Apart from Talend Open Studio, Talend also provides numerous other services such as:
Cloud Integration
Talend Data Integration
Talend Big Data Integration
Talend Data Quality
Talend Data Preparation and many more
TOS is an open source ETL tool that lets you to easily manage all the steps involved in the ETL process, right from the initial ETL design to the execution of ETL data load to the target system. You can leverage the TOS graphical user interface to drag and drop the desired component, for achieving the mapping between the source and the target system. It even allows you to do transformations on the data columns with the help of a wide range of pre-built in formulae. TOS is mainly used for integration between operational systems, ETL (Extract, Transform, Load) for Business Intelligence and Data Warehousing, and for Data Migration. It is built on Eclipse environment, which generates a native code and allows you to run the piece of code on any platform that supports Java.
Getting to know the Tool
The TOS interface is divided into three panes namely:
Repository
Design Workspace
Component Palette
The ‘Repository’ is found on the left side of the screen. Data related to the technical items used to design jobs is gathered in the Repository. This pane is known as the ‘Heart of the TOS. You can manage metadata (database connections, database tables and columns) and jobs in this pane, once you begin creating them.
A ‘Component’ is a preconfigured connector used to perform a specific data integration operation. It is dragged and dropped into the centermost pane, known as the 'Design Workspace’. You can lay out and design jobs (or “flows”) in this pane. You can view the job graphically within the ‘Designer’ tab, or use the ‘Code’ tab to see the code generated and identify any possible errors. The ‘Component’ Tab will help you to specify the properties of the components, whereas the ‘Run’ tab will allow you to execute the job, once the design is completed. To the right, you can access the ‘Component Palette’, which contains hundreds of different technical components used to build your jobs, grouped together in families.
Executing an ETL Job
You can execute a basic job, which transfers data from one system to other, in the following way.
The ETL process described below involves some manipulations on an Excel data source of Big Mart and the transfer of the transformed data to the target system, which is an Oracle Database.
The Excel sheet consists of data such as item, MRP, item type, in which outlet the item is available, etc. You can design the job for data transfer as follows:
Create a new job and from the palette, and drag and drop the following components:
tFileExcelInput
tDBOutput_2
tMap
Connect the components as shown below.
Go to the ‘Component’ tab of tFileExcelInput and specify the properties of the file and other details, such as whether to include the header, selecting the sheets of the Excel file and defining the schema of the file.
Connect the file component and the tMap component by specifying the row(main). In the tMap, you can perform operations such as creating new columns, manipulating existing columns, joining two or more files, etc. Here, you can drop the unnecessary columns in the Excel file.
Go to the component tab of tDBOutput_2, choose the DB type and fill in the database credentials such as host, port, username, password and schema. Specify the Target table name and the action on the table - whether to create a new table or drop existing table, and create the table or update it. Once the details are filled in, connect the tMap component to the output component using row(main).
Tip: To avoid mentioning the database credentials again and again, you can store the details in the Metadata Repository, and use it whenever required by just dragging and dropping it in the Design pane.
Once done, go to the ‘Run’ tab and execute the job.
This is the basic flow of how the data is extracted transformed and loaded from the source system to the target System. Talend Open Studio provides a wide variety of components to connect to various DBs such as AWS, Azure, Google, Salesforce, etc.
Advantages
With Talend Open Studio, more connectors are available for integrating with different data sources such as file, database, SAP, salesforce, FTP etc.With the ETL tool, by simply dragging and dropping the essential components, you can quickly complete ETL processes such as reading data from a text file and inserting them into various database like Azure, Oracle, MSSQL Serveretc. Since the design layout of the job is automatically coded in the backend, there is no need to hand code any process. Even if you need to make any changes, you can implement them by modifying the code generated.
Talend is highly suitable for any sector that wants seamless integration between the source and the target, and to mash up multiple tables and park them in target system.
Our Use Case
In our case, the customer requirement was to extract data from Oracle EBS and Transform/Load them into their MS-SQL Server. UI of Talend Open Studio made our work easier while mashing tables and views within Talend environment. A simple drag and drop of objects from different sources also minimized the need for writing custom SQLs in the source systems and significantly reduced the time taken during ETL design phase. Scheduling option in Talend, which comes in the Enterprise Edition, played a vital role in bringing data on to MS-SQL Server at the scheduled time, enabling us to apply all the transformation logic on a huge volume of data and providing real time visibility to the customer.
For more information on Data Warehouse Modernization
please visit https://bit.ly/2M6FkqJ