Getting Started with Data Engineering

{tocify} $title={Table of Contents}

What is Data Engineering

Let's segregate and understand, then understand it together - a kind of  Beginner's guide to Data Engineering

Data -  Some fact or information
Engineering - Basically, engineering is a systematic application of knowledge and experience to solve problems, protect, and improve lives.

Data Engineering - Data engineering is the process of designing and building systems to collect data, store the collected data and further processing on it to make it suitable for performing Data analysis and applying data science.

What is Data Analysis

In Layman terms, it is bringing out some sense out of the available data and use it to make some decision.

Following are the basic techniques which can be used for analysis 
descriptive - deducing What has happened so far from the available data
diagnostic - deducing Why it happened so far from the available data
predictive - deducing What is likely to happen in future from the available data
prescriptive - deducing What do we need to do in future from the available data

Are there more analysis technique - Yes, you will find that there are many based on different perspective, the way analysis is done etc.

You might ask, 

Why the need of Data engineering? 

If you consider the traditional systems, mostly there is some RDBMS(Relational Management Database System) used to store the application data and for implementing the business logic/processes, systems are built using java,.net etc.

It works absolutely fine, serves its purpose and will be there as long as we have businesses and their requirement for applications catering various business need.

So can this RDBMS be used for the purpose of extracting some insights ?

Yes it can be used - you can do analytical processing(programs/formula/algorithm)  on it and get the results out of it. 

But.... Wait there is a problem .. And the problem is ...

                                     BIG DATA  

What is Big Data

To simply put primarily -  Big Data is data which is BIG 

As per IBM, any data which can be characterized by three V's is considered to be a Big Data 

3 V's of Big Data are

  1. Volume - The amount of the data that is generated
  2. Variety - The types of the data generated
  3. Velocity - The speed at which the data is generated
Earlier businesses where only taking the core applications data into consideration for gaining the insight or carving out the decision making facts, but with the advancement in the technology - it has changed.

And it's all because we are trying to be more smart 😉.

The change is due to the fact that we are into era where we want every thing to be smart.  

Not only us(the humans), but also the devices and the apps which we use on day to day basis. 

What makes devices and apps smart?

In simple terms, it is the ability of connecting and communicating with each other with the help of internet.

And where there is a communication happening, there is data involved.

Just think of any networking site, any video sharing sites, any IOT devices - they generate data(3 v's of Big data). 

And Traditional system aren't designed to handle this scenario i.e. large and complex data - thus the need of BIG Data systems came into existence(clustered processing engines, data lake, machine learning, AI etc.)

Hope, you got the glimpse of BIG DATA and the industries are trying to gather data as much as possible from all the related sources so as to make most of the information.

In a nutshell and laymen terms, 

Big Data is the medium through which industries are trying to increase there Income 

Whenever you shop for something online, have you seen "people who buy this also bought that"

Let's get back to technicalities

Types of data 

Basically data can be contained in any of the forms like text, drawing, pictures, audio, video etc.

But, there are basically 3 categories in which they can be segregated into

i. Unstructured Data

No Predefined structure, i.e., no adherence to any schema or model 

Examples are Text data, images, audio, videos, phone calls transcriptions, social media comments, logs files etc

ii. Semi Structured Data

It is same as unstructured data, i.e., no adherence to any schema or model but has some structure to it.

Examples are XML, JSON, Emails etc

iii. Structured Data

Data is stored as per the predefined structure, adhering to the schema or the model. 

If data represented in rows and columns (structured), it is easy to traverse through it.(RDBMS)

An easy search is primary and notable feature of Structured data 

Examples are SQL, transaction history, excel, google sheets, customer data, phone records etc

Stages in Data Engineering

Every project has some vision and tied to it, is the requirement gathering followed by series of steps

1. Understand the Business, Identifying the key data and it's supporting data

When dealing with data project, it is of utmost importance that you
 i. Understand what the business is about
ii. What all entities are involved
iii. What are the key data 
iv. What are the key systems and what are the supporting systems

Knowing above all helps tremendously in all stages . 

2. Collect the Raw Data

Based on the intention, you decide which data to collect. 

Say you want sales reports to be generated, for that you will first identify the system which has the sales related data and the systems which has sales supporting data (data on which sales data is dependent/related).

Now the next is to see what(how) type of data is to be collected -  data  already generated and stored?  or it's real time data?

Batched Data

Here the data is already created and is stored over a period of time. e.g., Sales data generated over the period of years 

Streaming Data

Streaming data is data that is continuously generated by different sources

3. Clean the Raw Data

So whatever data you collected above is termed as Raw Data, the data that is collected directly from the data source, while clean data is
 processed raw data. 

That is, clean data is a modification of raw data, which includes the removal of irrelevances and inaccuracies.

Cleaning data basically comprise of
i.  Removal of Unwanted data
ii. Duplicate data 
iii.Fixing Data Structures etc

4. Enrich or Merge the data

Once you have clean raw data, the next step is to make the data as usable data. Many times your clean raw data will be usable and serve the purpose.

But sometimes you need to enrich it, for that you can refer to some existing operational data, merge some data.

While doing so there might be some business rules/policies which also needs to be taken into consideration.

5. Store it in Warehouse

So once we are done with above steps, the data is stored in Data warehouse which is  a central repository and also commonly known as an online analytical processing system (OLAP).

The primary goal(apart from storing) of data warehouse is to make the querying on data easy and fast.


This is a subset of a data warehouse used to support a specific region, business unit, function area or department (i.e., Finance).

6. Using the refined data for Reporting, Artificial Intelligence, Machine Learning

All the data from various sources after all the processing is pushed to data warehouse to serve the purpose of providing data which can be used for decision making with the help of Data analysis and applying data science.

In Layman term it is a data source for visualization (reports), Machine learning and for Data Science.

Top executives of organization largely depends on this to improve the shortcomings, improve the services -- ultimately to reduce losses and increase the revenue.

Strategies in Data Engineering ( Basically designing your ETL Process)

i. How to pull full load(Historical data)

When we talk about historical(any data which is before today/now)  data, expect it to be heavy i.e., big size as data accumulated can span for several years or decades etc .

Of course  the amount of data depends on the organization's business volume.
Can the whole data be pulled in one go? Or should the data be pulled as batches(based on size, date etc)?

ii. How to get delta(Incremental data)

CDC(capture data change) is what comes in picture first, as to how we know that what is changed at source? 

Not all system have out of box feature to notify about data change thus custom ways need to be applied.

a. Like using a flag denoting whether the data is processed or not - only pull data which has flag as not processed
b. Having a Timestamp value as a reference, so any data beyond this timestamp can be pulled and after that update the timestamp

iii. How to merge the data (Historical and Incremental data)

Once we have Historical data pulled and stored, the next is pulling Incremental data. How to get it, we saw in above point.

But how to merge it with Historical - should it be done after every Incremental data pull or should it be done once in a day or every 12 hrs etc

iv. How/Where to Store  merge the data (Historical and Incremental data)

Here you decide where you want to store the collected raw data, where should be the container of it be- on premise or on cloud .

Basically, where you want to stage the data, which supports multi format data(format agnostic), supports scalable storage and support for transformation(data transformation and processing programs).

v. Archiving the Raw Data

It is not mandatory, but it should be part of solution. Basically we archive the raw data which we have processed, so that if there is need to process again the same data due to some reason - then we shouldn't go always to the actual source of the data.

vi. Adding security on every touchpoint

While designing the solution there will be lot of systems and application involved - some residing on-premise, some may be in public cloud etc.

Thus it becomes primitive task to add Security at every touchpoint/hop.

At high level, from Data lifecycle perspective following categorization can be done

a. Securing Data in Transit
    Here the data is on move from one network to another, one storage to another - here we need to secure both the channel and the data.

Applying/using security on protocol (HTTPS,FTPS, SSL/TLS etc) and encrypting the data should be done.

b. Securing Data in Store(at rest)
    In most of the big data solution we store data at two places
    i. Stage (Raw/Merge data)
    ii. Data warehouse (Processed/ready to use data)

Applying encryption mechanism/solutions along with the firewalls, antivirus etc protects from outside invasion but to protect from inside invasion - implementing Role based access should be helpful.

c. Securing Data in Use 
    Here we should be focusing on limiting the access. Say we don't want someone from sales department to get access to finance department data. Or someone from one region shouldn't be able to see another region etc.

Basically the only intended user should be able to access the data relevant to them -  implementing Role based access at this layer.

At low level we can make use of 

a. Centralized key store ( to store keys, certificates and secrets)
b. Use of certificates for encryption/authentication
c. Use of Service principal/Service account/Managed Identity etc. 

vii. Audit  and Logging

This area often gets subsided but it helps a lot in managing the solution, in Identifying the issues and gaining insights.

You can have a centralized approach where audit data from all the systems/application/phase/layer are captured and stored collectively.

Or you can have them segregated per systems/application/phase/layer.

So here we can capture details like - 

When data was moved (date, time)
What data was moved (based on module, department etc)
How much data moved (row count, size etc)  
From where to where (Source and destination)
Process success/failure status etc

It depends how granular you want to go.

Technologies  used in designing your ETL Process

Below are examples of high level technologies/tools used,

1. Extraction/Data collection

   i. Azure Data Factory for Batched data
  ii. Kafka for streaming data

2. Data Store (Staging - Raw/Merge)

  i. Azure Data Lake
 ii. Azure blob storage
iii. Amazon S3

3. Processing

 i. Azure Databricks

4. Data Warehouse

  i. Snowflake
 ii. SQL Datawarehouse
iii. Oracle

5. Reporting

 i. PowerBI
ii. QlickView
iii. Tableau

Reference Architecture

Below are couple of references to summarize the topic 

Microsoft Data Lake Architecture

                 Microsoft Data Lake Architecture -- reference from Here

AWS Data lake architecture

                 AWS Data Lake Architecture -- reference from Here


If you have any suggestions or questions or want to share something then please drop a comment

Previous Post Next Post