Getting Started with Azure Data Factory - CopyData from CosmosDB to SQL | Azure data factory tutorial | Basics of Azure Data Factory

byMaheshkumar Tiwari •5:38 PM

0

{tocify} $title={Table of Contents}

Why and What is Azure Data Factory

Day by day the amount of data collected by any business/enterprise is growing - both the historical and transactional. But is there any need to keep them? Is it useful to business in any way? -- Yes, the data which is collected so far or would be collected on going can be used to deduce value by process of analyzing data and presenting actionable information to help executives, managers and other corporate end users to make informed business decisions.(assisting Business Intelligence)

Like it is said Information when processed on turns into Knowledge.

To make this possible Raw Data(information) --> Transformed data(knowledge), Microsoft introduced data integration service i.e. Azure Data Factory which is capable of extracting-transformation-loading (ETL) or extracting-loading-transformation(ELT), for transformation activities either of following can be used - HDInsight, Data Bricks, Azure Data lake U-SQL, Machine Learning, stored procedures etc and is managed by Microsoft.

Each business/entity have there own way of storing the data (data stores), some prefer to store on premises, some on cloud, some on file system, some in different database etc.

There is currently support for more than 72 data store to which ADF can connect to and extract data.

Like Logic Apps, where we have provision to connect services which are cloud based or also on premises (integrating various services) like wise Azure Data factory can connect to various types of data stores which are cloud based or also on premises.

You get a browser based designer (also available in visual studio), where you can design the Pipeline (data driven workflow) by setting the Trigger (the way to start the workflow) and then selecting the appropriate activity or series of activities.

You have provision to debug and validate the workflow, once done you publish the workflow, it gets deployed – ready to use. you are charged only when it is executed - based on the number of activities that are run, the volume of data that is moved etc.

Building Blocks of Azure Data Factory

TRIGGERS in ADF

There is three way you can trigger pipeline - manually or Scheduled based and Event Based(example -the deletion of a file, in your Azure Storage account) .
Scheduled based will invoke automatically based on your scheduled time. It can be set to execute on daily, weekly, monthly basis (scheduled ) or can be set on time intervals of fixed size (Tumbling window)

CONNECTIONS in ADF

The way to connect to data store and establishing a way to allow data movement to and fro is done via Integration Runtime and Linked Services.

Integration runtime acts as a Data gateway/medium (on premises to cloud, cloud to cloud) whereas Linked Services are connection to data sources and destinations.

Linked services are connection strings much like bindings in function App and Connectors config in Logic App. Thus Linked Service runs in scope of Integration Runtime.

Currently three types of IR are supported

Azure (Within Azure)
Self-hosted (On premises)
Azure-SSIS (dedicated for SSIS package)

ACTIVITIES in ADF

The steps in pipelines are termed as Activities and can be Data movement activities, Data transformation activities (Data bricks),Control activities (Lookup activity, if conditions, wait activity,for each etc).

You can initiate with any of the activity based on requirement. Below is how they are categorized

DATASET in ADF

Datasets represent the way data is present inside the data store i.e. data structures within the data stores .

Thus an input dataset represents the input which will be used and an output dataset represents the output which will be produced by an activity.

PIPELINE in ADF

It is here where all the components discussed above are used to design the data workflow with help of an activity or group of activities, like from where data is to be collected, how it is to be collected, what is to be done with data.

So it is a logical grouping of activities or a scope of the tasks.

MONITORING PIPELINE in ADF

Inbuilt support for monitoring the pipeline runs is provided out of box, also details on Activity level can be checked and alerts can be set on top of it.

Below is small demonstration to help getting started with Azure data Factory.

Scenario

For demo I am using simple scenario - say a business wants to archive the transactional data from the application database (where data is retained for few days only and then after gets deleted). As business needs the data for further processing, it wants it to be stored somewhere i.e. move data from application database to archive database periodically.

Here I have used Cosmos db as application database and Azure SQL database as the archive database.

Steps in creating solution

Create Azure Cosmos DB -- Source

To keep it simple have created collection with four records only (In reality there would be lot of data, upto millions of records )

Create Azure SQL DB -- Destination

To keep it simple have created table with column names matching to that of fields in Records of cosmos db. I made small mistake here which led to failure in Pipeline run.

Create Azure Data Factory

Click on +Create a resource --> Data Factories --> click on +Add button

Provide unique name to Data Factory instance to be created as per purpose , select the subscription, resource group, version and location.

Note that currently only limited locations are available to choose from.

After creating an instance of Data Factory, you need to click on Author and Monitor - which will lead to ADF designer portal (dev env) which opens in separate tab.

The new tab opens with options to get started with. Either of first two can be used to create pipeline and but as we need to create a pipeline to copy data so selected the Copy Data Wizard(it creates a pipeline with Copy data activity)

Give a descriptive name and select the Task schedule, as I want to run it manually once , Run once now is selected.

Now we need to provide the data store from where data is to be extracted and for that we need to create new connection (Linked Service)

As data source is Cosmos DB which is SQL API based, create linked service by selecting Azure Cosmos DB (SQL API)

Next provide name to the Linked Service, select the Runtime . As my data store is in Cloud itself , AutoResolveIntegrationRuntime is selected.

If source data store would have been on Premises, then Integration Runtime agent had to be installed on the onPremises server and that would have been used as gateway.

For Connection String, as it is same subscription, From Azure Subscription is selected(it automatically fetches the keys/connection string), else we can also enter manually.

Select the CosmosDB account and Database which is to be connected, and click on Test connection and Finish.

Once connection is tested, create a dataset

Here Table data is to be fetched directly so EmployeeRecords table is selected from Existing tables, else query can also be used to fetch the data

Next is to create connection with Destination , select Azure SQL database

Provide the details like server name, database name, login details -- here too it is on Azure and same subscription

EmployeeArchive table is selected as destination as data is to be copied to it

With presents with column mapping for creating Destination Dataset, you can make changes select/unselect here or continue.

As for this demo, I kept it simple and same - one to one mapping

Select what is to be done if any exception occurs in Pipeline -- You can abort and fail the copy activity when incompatible data is encountered (default behavior) or You can continue to copy all of the data by adding fault tolerance and skipping incompatible data rows (also errors can be logged).

In summary page, verify the details and if any changes are to be done then Edit option can be used.

Click on Finish, that's it, pipeline is ready, save it , publish it and test.

Testing

For testing I triggered the pipeline manually.

After the pipeline run succeeded, entries were found in the SQL DB

Summary

It is the cloud-based data integration service (ETL)
It allows you to create and schedule data-driven workflows(pipelines) for orchestrating data movement and transforming data at scale, that can ingest data from disparate data stores .

If you have questions or suggestions, feel free to do in comments section below !!!

Do share if you find this helpful .......

Knowledge Sharing is Caring !!!!!!

Learn More about Azure Data Factory

Tags: Azure Data Factory