Interview questions and answers on Azure Data Factory | Interview questions for azure data factory developers



This post intends to cover the answers to the questions which a Azure developer can face at an Interview i.e. Azure Developer Interview Questions and Answers on Data Factory (ADF)


1.What is Parquet file format? 
- Parquet is an open source file format that is designed for efficient and performant storage and processing of flat columnar data(also called columnar storage),  compressed, binary file format, some of the features and benefits of Parquet are:

It supports various data types, including complex nested structures and arrays.

It uses various compression and encoding techniques to reduce the size of data and improve query performance.

It stores metadata, such as schema and statistics, within the file, which enables self-describing data and faster filtering.

It is compatible with most of the Hadoop ecosystem tools and frameworks, such as Spark, Hive, Pig, and Impala.

It is language independent and has a binary representation that can be read by any system.

2. How to Create parquet file from SQL Table data? 
-Simply make use of copyData activity in a pipeline and provide the details of source (i.e. which Table to refer) and sink (i.e. ADLS Gen 2 folder location). 

3. What is item in Foreach in ADF? 
- Item is current object/record of the ForEach items array.

Foreach is an ADF (Azure Data Factory) activity that defines an iterative flow of control within a pipeline. It is used to iterate over a collection of items and perform a specific activity within a loop. 

The Foreach's item property is the current item that the Foreach activity iterates over from the input collection. An @item() expression can be used to refer to the current item anywhere a property value can be specified using a dynamic expression. 

For example, if the input collection is an array of numbers [1, 2, 3], @item() will return 1 on the first iteration, 2 on the second iteration, and 3 on the third iteration . You can pass values ​​to other activities in a foreach loop using @item() such as Copy an activity or web activity. 

4. Difference between logic app and data factory?
- Both Logic App and Data Factory are Azure services that can be used for data integration and orchestration. However, they have different purposes and functions, there are some differences.

Focused on application integration, Logic App lets you create automated, scalable workflows that integrate apps and data across cloud and on-premises services. Logic Apps support various connectors, triggers, and actions that allow you to interact with various systems and services.

Focused on data processing, Data Factory lets you create data-driven workflows in the cloud to orchestrate and automate data movement and data transformation. Data Factory supports various sources, sinks, activities, and pipelines that enable you to ingest, transform, and load data.

Logic Apps are well suited for scenarios involving simple or complex business processes such as Approval workflow, event-driven processing, or service orchestration. Data Factory is well suited for scenarios involving large-scale data integration, such as Data warehousing, data lakes, or data migration.

Logic Apps has a pay-as-you-go pricing model, where you only pay for the number of runs and connectors you use. Data Factory uses a usage-based pricing model, where you pay according to the number of activities and pipelines you run and the compute resources you consume. 

In some cases, Logic App and Data Factory can be used together to leverage the strengths of each and optimize cost and efficiency. 

For example, you can use Logic App to trigger Data Factory pipelines based on events or schedules, or use Data Factory to invoke Logic App as part of your data processing workflow. 

5. What is integration runtime responsible for  in Azure Data factory?
- Integration runtime (IR) is the compute infrastructure used by Azure Data Factory (ADF) to provide data integration capabilities across different network environments. IR is responsible for the following tasks in ADF:

Data movement: Copy data across data stores in public or private networks. IR supports built-in connectors, format conversion, column mapping, and scalable data transfer.

Activity dispatch: Dispatch and monitor transformation activities running on various compute services, such as Azure Databricks, Azure HDInsight, Azure SQL Database, SQL Server, and more.

Data flow execution: Execute a data flow in a managed Azure compute environment. IR supports various sources, sinks, transformations, and expressions for data processing.

SSIS package execution: Natively execute SQL Server Integration Services (SSIS) packages in a managed Azure compute environment. IR supports various sources, destinations, tasks, and components for SSIS.

6.How does Integration Runtime handle data security and compliance?
- IR (Integration Runtime) handles data security and compliance by using various features and practices to protect the confidentiality, integrity, availability, and security of the service and customer data.

Some of the ways that IR handles data security and compliance are:

Encryption:
IR uses industry-standard encryption algorithms and protocols to encrypt data in transit and at rest. IR also supports encryption key management and rotation for customer-managed keys.

Network security:
IR supports network policies, firewall rules, private connectivity, proxy settings, and private link/private endpoint to secure network access to IR and prevent unauthorized access from or to a network.

Authentication:
IR supports various authentication methods, such as passwords, federated authentication, multi-factor authentication, key pair authentication, and OAuth to verify the identity of users or systems accessing IR.

Authorization:
IR supports various authorization mechanisms, such as roles, privileges, grants, ownership, and secure views to grant or deny access to resources based on predefined rules.

Data masking:
IR supports dynamic data masking, applying masking policies at query time based on user role to hide sensitive data or replace it with fictitious data.

Compliance audit:
IR relies on certified third-party auditors to ensure that its technical infrastructure and data processing processes meet or exceed global cybersecurity and privacy standards such as:  NIST 800-53, PCI DSS, ITAR, FISMA. 

7.How many types of triggers are supported by Azure data factory?
- Azure Data Factory (ADF) supports three types of triggers that can be used to schedule pipeline executions without manual intervention and those triggers are:


Trigger schedule:
This trigger runs the pipeline on a fixed schedule i.e. Hourly, Daily, Weekly, or Monthly. You can specify the trigger's start date, end date, recurrence, and time zone.

Rolling window trigger:
This trigger runs the pipeline at regular intervals, such as every 15 minutes, every 2 hours, and so on. You can specify the trigger start time, end time, interval, and time zone. Unlike scheduled triggers, this trigger maintains pipeline state and supports dependencies and chains across windows.

Event-based trigger:
This trigger runs your pipeline in response to blob-related events, such as blob creation or deletion in Azure Blob Storage or Azure Data Lake Storage Gen. You can specify storage accounts, containers, folder paths, file names, and event types. for the trigger. 


8.What is the difference between dataflow and pipeline in ADF?
- Dataflows and pipelines are two different concepts in ADF (Azure Data Factory), there are some differences.

Dataflow is a graphical, code-free transformation layer that uses Azure Databricks clusters behind the scenes to perform data processing for a variety of sources and sinks. Dataflow supports various transformations like Derived columns, aggregates, filters, joins, pivots.

A pipeline is an orchestration layer that defines a series of one or more activities like Copy data, execute stored procedures, or run data flows. Pipelines support a wide variety of activities, triggers, parameters, variables and expressions. 

Dataflow can run as an activity within a pipeline, but not the other way around. A pipeline can run without a data flow, but a data flow cannot run without a pipeline.

Dataflow is well suited for scenarios involving complex data transformation and processing, such as Parse values, calculate, add/rename/remove columns, or add or remove rows. 

Pipelines are suitable for scenarios involving simple or complex data integration and orchestration. For example, copying data, executing stored procedures, firing events, scheduling jobs, and so on. 

9.Can we share self hosted integration runtime?
Self-hosted Integration Runtime (SHIR) is a locally installed and managed instance of Azure Integration Runtime (IR). Establish a secure connection between your on-premises resources and the cloud, allowing data transfer between the two.

A single SHIR can be used for multiple on-premises data sources.
It can also be shared with another data factory within the same Azure Active Directory (Azure AD) tenant.

However, Synapse workspaces do not support integration runtime sharing. Only one instance of SHIR can be installed on each individual computer. You can have multiple SHIRs on different computers that connect to the same local data source. 

10.What are the differences between Integration runtime in Azure?
Azure has three types of integration runtimes (IRs).
Azure IR, Self-hosted IR, and Azure-SSIS IR.

Azure IR is a runtime managed by Azure itself. You can perform data flow, data movement, and activity distribution over public or private networks. You can also use Private Link to securely connect to data sources within your virtual network.

A self-hosted IR is a runtime that is installed and managed on-premises. Data movement and activity distribution can only be performed over public or private networks. You can access local data sources or data sources in your virtual network.

Azure-SSIS IR is a runtime that allows you to natively run SQL Server Integration Services (SSIS) packages in a managed Azure compute environment. You can also use Private Link to securely connect to data sources within your virtual network. 

11.What are key components of ADF in Azure?
Some of the key components of Azure Data Factory (ADF) are:

Activities: These are the tasks that ADF performs, such as copying data, running a data flow, executing a SQL query, etc.

Pipelines: These are the logical grouping of activities that perform one unit of work. Pipelines can be scheduled and monitored using ADF’s web-based interface or REST APIs.

Datasets: These are the named references to the data sources that ADF can access, such as Azure Blob Storage, Azure SQL Database, Oracle, etc.

Linked services: These are the connection strings that ADF uses to connect to the data sources defined by datasets.

Data flows: These are graphical representations of data transformations that can be performed on data sources using Spark clusters in Azure.

Integration runtimes: These are the compute infrastructure that ADF uses to provide data integration capabilities across different network environments. There are three types of integration runtimes: Azure, Self-hosted, and Azure-SSIS.


12.How do you use parameters in Azure Data Factory?
-Parameters in Azure Data Factory let you pass external values ​​to pipelines, datasets, linked services, and dataflows. They can be used individually or as part of an expression.

Parameters can be defined at the pipeline level or within the data flow definition. You can also access system variables that provide information about pipeline executions, such as Trigger time, execution ID, etc.

Here are some examples of using parameters in Azure Data Factory.


Use pipeline parameters to pass the connection details of the record or the path of the file to process.

Use the Set Variable activity to set and modify pipeline variables within a pipeline. Use parameters to control data flow behavior like Filtering, merging, aggregating, or transforming data.

Pass trigger information to the pipeline. For example: window start and end times for tumbling window triggers, or custom data fields for custom event triggers. 

13.What is difference between parameters and variables in ADF?
The main difference between parameters and variables in ADF is that parameters are external values ​​passed to pipelines, datasets, linked services, and dataflows and cannot be changed while the pipeline is running. 

Variables are internal values ​​that exist within a pipeline and can be set and changed during pipeline execution using the Set Variable activity.

Parameters allow you to control the behavior of a pipeline and its activities, for example, by passing connection details for a record or the path of a file to process. 

Variables can be used to store and manipulate data during pipeline execution, such as storing computation results or the current state of a process. 

14.What are the parameters of a data set?
The parameters of a dataset in ADF are the named references to the external values that can be passed into the dataset definition. They can be used to specify dynamic properties of the dataset, such as the connection details, the path of a file, or the query to be executed.

To define a dataset parameter, you can follow these steps:

Click on your dataset to view its configuration tabs.
Select the “Parameters” tab, and click on the “+ New” button to define a new parameter.
Enter a name and description for the parameter, and select its data type from the dropdown menu. Data types can be String, Int, Float, Bool, Array, Object, or SecureString.
Optionally, you can also assign a default value to the parameter.
After defining a dataset parameter, you can access its value in a data flow or a pipeline activity by using the @dataset ().parameters.<parameter name> expression. 

For example, if you defined a parameter called pageNum of type Int, you can access its value in a data flow by using the expression @dataset ().parameters.pageNum. You can also pass values to the dataset parameters from the pipeline or data flow level by using dynamic content.

15.How do we monitor and manage Azure Data Factory Pipelines?
There are several ways to monitor and manage your Azure Data Factory (ADF) pipelines:

From the Azure portal, view your data factory as a graph, view activities in your pipeline, view input and output records, pause and resume pipelines, debug pipelines, errors in pipelines can be re-executed and alerts can be created.

Azure PowerShell allows you to manage pipelines using cmdlets like Create, update, delete, start, stop, and monitor pipelines.

Use the Monitoring and Management app, a web-based application that provides many features for monitoring and managing Data Factory pipelines. You can use system views, sort and filter activity windows, perform batch actions, and access detailed activity-specific monitoring experiences. 

16.How to send alerts from Azure Data Factory?
There are various ways to send alerts from Azure Data Factory (ADF),for example. :

Azure Monitor allows you to create alerts based on metrics and logs emitted by ADF. You can also configure action groups and notification types for alerts like Email, SMS, phone calls, or push notifications from Azure apps.

With Azure Logic Apps, you can create workflows that send emails through Office 365 Outlook or other connectors. You can use web activities to trigger logic app workflows from pipelines and pass dynamic messages using system variables and expressions.

Data Factory Analytics, a service pack available in the Azure Marketplace, provides a rich set of Power BI reports for monitoring your ADF pipelines. You can also create alerts and configure notification settings based on report data. 


17.How to format the input and output format in Azure Data Factory?
You can format the input and output data in Azure Data Factory (ADF) by using datasets and data flows. Datasets are the named references to the data sources that ADF can access, such as Azure Blob Storage, Azure SQL Database, Oracle, etc. Data flows are the graphical representations of data transformations that can be performed on data sources using Spark clusters in Azure.

You can specify the format of the input and output data in a dataset by using properties such as type, columnDelimiter, rowDelimiter, quoteChar, escapeChar, firstRowAsHeader, etc. 

For example, you can use the DelimitedText type to define a dataset that uses comma-separated values (CSV) or other delimiters. You can also use other types such as JsonFormat, AvroFormat, OrcFormat, ParquetFormat, etc.

You can also use data flows to format the input and output data by using transformations such as source, sink, select, derive column, filter, join, aggregate, etc. 

For example, you can use the source transformation to read data from a dataset and specify the schema and format of the input data. You can also use the sink transformation to write data to a dataset and specify the schema and format of the output data.

18.How do I Backup my Azure Data factory pipeline?
Using the Export Template option in the ADF UI, you can download a zip file containing JSON files of your pipeline and its dependencies. You can also import the template and restore the pipeline to the same or another data factory.

Azure Resource Manager (ARM) templates allow you to export and import pipelines and their dependencies as JSON files using PowerShell or Azure CLI commands. You can also manage your ARM templates using source control tools such as GitHub or Azure Repos.

With Azure Backup and Disaster Recovery, you can set up automatic recovery for your pipelines in the event of a region-wide outage. If the primary region is unavailable, ADF will automatically failover the pipeline to the paired region.

19.How do I list all pipelines in Azure Data factory?
You can list all pipelines in Azure Data Factory (ADF) using one of the following methods:

From the Azure portal, you can view your data factory as a diagram and see all your pipelines and their dependencies. You can also filter and sort pipelines by name, type, status, etc.

Using Azure PowerShell, you can get all pipelines in a data factory using the Get-AzDataFactoryV2Pipeline cmdlet. You can also use the Get-AzDataFactoryV2PipelineRun cmdlet and pass the pipeline run ID or date range to get the run status of each pipeline.

Using the REST API, you can retrieve all pipelines in a data factory using the Pipelines - List By Factory operation. You can also use the Pipeline Execution - Query by Factory operation to get the execution status of each pipeline by passing a filter parameter. 

20.What is sink in ADF?
-An ADF sink is a transformation that writes data to target storage after the data transformation is complete. Every data flow needs at least one sink transformation. However, you can write to as many sinks as you need to complete the transformation flow.

You can use a dataset object or an inline dataset to specify the format and properties of the output data in your sink. Record objects are reusable entities that can be used in other data flows and activities. Inline datasets are format-specific entities that can be used for flexible schemas, one-time sink instances, or parameterized sinks.

You can also use various settings in your sink to optimize data flow performance and error handling. For example, batch size, staging, partitioning, schema drift, schema validation, etc.  

21.How to debug ADF pipeline ?
- You can debug an ADF pipeline by following these steps:

Open the pipeline under the Author page and click on the Debug button.
The pipeline will be deployed to the debug environment and run with the output shown in the Output tab.

You can view the input, output, status and duration of each activity in the pipeline.

You can also set breakpoints on specific activities to debug until that point.

You can cancel a debug run while it is in progress or view the historical debug runs in the Monitor page.

22. Difference between Azure Data factory and Databricks ?
- Azure Data Factory and Databricks are two different cloud solutions that share some similarities and differences. Below are some key points 

Azure Data Factory is a data integration service orchestration tool that runs ETL and ELT workflows and scales data movement from various sources. 

Databricks is an integrated platform for data, analytics and AI that simplifies data architecture and enables data engineering, data science, machine learning and BI in a single platform.

Azure Data Factory has a graphical user interface that allows users to create pipelines and activities without coding.

Databricks has a notebook-based interface that supports multiple programming languages ​​and frameworks such as Python, Scala, SQL, R, and Spark. 

Azure Data Factory supports over 100 connectors to various data sources and targets including Azure services, databases, files, web services, and more. 

Databricks supports a smaller number of connectors, but offers more flexibility and power when processing data on Spark and Delta Lake.

Azure Data Factory can integrate with Databricks by running notebooks, JARs, or Python code as activities in pipelines. This allows users to leverage the power of both services for data collection, transformation and analysis.  

23. Difference between full load and incremental load in ADF?
- Full load and incremental load are two types of data loading methods in the ETL process.

Full load means that the entire data is loaded from source to target without any filtering or conditions.

Incremental load means that only new or changed data is loaded from the source to the target based on timestamps or key columns.

A full load typically occurs when the data source is loaded into the data warehouse or the first time the entire data needs to be refreshed.

Incremental loads are typically run periodically to keep the data warehouse up-to-date with the latest changes in the source. 

Full load is simpler and easier to implement than incremental load, but requires more time and resources to execute. 

Incremental loading is faster and more efficient than full loading, but requires more logic and complexity to handle errors and dependencies. 

24. What is CDC and why it is important in ADF?
- CDC stands for Change Data Capture, which is a set of technologies and techniques that enables data engineers to discover and extract changed records from the source data as they occur. 

CDC is important in ADF because it allows users to:

Perform incremental data loading from various sources to the destination without requiring timestamp or ID columns.

Use a graphical user interface to configure sources, targets and transformations without coding or designing pipelines or data flows.

Run continuous data integration processes with preferred latency and pay only for the data processing time.

When you want to synchronize data between two tables or databases by applying the changes (inserts, updates, deletes) that occurred in the source to the destination.


25. What is Delta lake and how it differs from Data Lake?
- Delta Lake is an open-source storage layer that brings reliability and performance to your data lake. A data lake is a centralized repository for storing all your structured and unstructured data at any scale. 

Delta Lake differs from Data Lake in several ways:

Delta Lake supports schema adaptation and evolution. This means you can verify and control the quality and structure of your data before writing it to storage. Data Lake does not require schema definition or validation, which can lead to data inconsistency or corruption.

Delta Lake supports ACID transactions. This means that data integrity and parallelism can be guaranteed across multiple concurrent reads and writes. Data Lake does not support ACID transactions, which can result in data loss or duplication.

Delta Lake supports time travel. That is, it tracks change history and allows users to access previous versions of data for auditing and rollback purposes. Data lakes don't support time travel, making it difficult to recover from errors and analyze historical data.

Delta Lake supports optimized query performance. This means you can leverage the Spark and Delta Lake engines to speed up data processing and analytics. Data Lake relies on external query engines or frameworks, which can be slow or inefficient. 

26. How do I create a dynamic pipeline in Azure Data factory?
- You can create dynamic pipelines in Azure Data Factory using parameters, variables, and expressions. 

Here are the steps you need to follow:

Define parameters for pipelines and activities that can accept values ​​from user input or other sources e.g. you can define parameters for the source file name or sink table name.

Define variables for pipelines and activities that can store intermediate values ​​and results of other activities e.g. you can define variables for the number of rows of data copied or the execution status.

Use expressions to dynamically assign values ​​to activity parameters, variables, and properties based on logic or conditions e.g. you can use an expression to concatenate the source file name and the current date, or check if the number of rows is greater than 0.

Use dynamic content to reference parameters, variables, and expressions in pipelines and activities e.g. you can use dynamic content to specify the source file path or sink table name. 

27.How can we handle exceptions in Azure Data Factory?
There are various ways to implement exception handling in Azure Data Factory e.g. 

A retry policy allows you to automatically retry an activity or trigger if it fails due to transient errors or network issues. You can configure the retry count, interval, and backoff factor for activities or triggers that support a retry policy.

Use conditional paths to branch your pipeline logic based on activity results. Activity success, failure, completion, and jump outputs can be used to direct the pipeline to different activities based on their execution status e.g. you can use the error output to perform cleanup or send notifications when an activity fails. 

Use stored procedures to log errors and exceptions to a table or file. You can create a stored procedure that accepts parameters such as activity name, error code, error message, timestamp, etc. and inserts them into a table or file for testing or troubleshooting. You can then call this stored procedure from any activity that can raise errors or exceptions.

Use webhooks or logic apps to send alerts or notifications when pipelines or activities fail. You can create a webhook or logic app that triggers an email, text message, or push notification when an HTTP request with specific parameters such as pipeline name, run ID, status, or error details is received. You can then call that webhook or logic app from any activity that can raise an error or exception. 

28. What are the stages in Azure Data Factory?
- Azure Data Factory phases are:


Connect and Collect:
This phase connects to various data sources such as databases, files, and web services, and copies the data to central locations such as Azure Blob Storage and Azure Data Lake Storage.

Transform and enhance:
At this stage, data is processed and transformed using compute services such as Azure HDInsight, Azure Databricks, and Azure SQL Database to perform tasks such as cleansing, filtering, aggregation, and joining.

Release:
During this phase, the transformed data is moved to the target data store like Azure Synapse Analytics, Azure Cosmos DB, Azure SQL Data Warehouse, etc. Can be used in business intelligence and analytics applications.

Monitor:
During this phase, tools such as Azure Monitor, Azure Data Factory portal, PowerShell, and REST APIs are used to track and manage pipeline and activity execution and performance. 



Post a Comment

If you have any suggestions or questions or want to share something then please drop a comment

Previous Post Next Post