Data transformation is an essential step in the data engineering process. In order to prepare raw data for analysis, it must be cleaned, shaped, and structured in order to be ready for analysis. As far as transforming data in Azure Data Factory is concerned, there are two main approaches: Mapping data flows and Wrangling data flows. The purpose of this article is to demonstrate how we can use Wrangling data flows to transform data in a simple and logical manner.
The Wrangling data flows tool in Azure Data Factory is a tool that allows you to clean, transform, reshape, and re-purpose your data with the help of an interactive, code-free user interface. The program provides you with the ability to perform data wrangling operations on a variety of data sources, including delimited text files, JSON files, and Excel spreadsheets.
If you like to get the big picture of Data Transformation in Azure Data Factory, check out Data Transformation in Azure Data Factory – An overview
Before we dive in to wrangling data flows in Azure Data Factory
Before diving deeper, it is important to understand some key concepts and terms that we will be using in this article. In wrangling data flows, you can use a wide range of data transformations to manipulate your data, such as:
- Filter: This transformation filters rows based on a condition.
- Split: This transformation splits a column into multiple columns based on a delimiter.
- Extract: This transformation extracts a sub-string from a column based on a pattern.
- Replace: This transformation replaces a value in a column with another value.
- Merge: This transformation merges multiple columns into one column based on a separator.
- Pivot: This transformation pivots data from a long format to a wide format.
- Un-pivot: This transformation un-pivots data from a wide format to a long format.
Now let us see step-by-step on how to use Wrangling data flow to transform data in Azure Data Factory.
1. Create a Wrangling data flow
The first step is to create a Wrangling data flow. To do this, follow these steps:
- Log in to your Azure portal and navigate to your Data Factory instance – https://azure.microsoft.com
- Click on the “Author & Monitor” button to open the Data Factory authoring interface.
- Click on the “Author” tab to open the authoring workspace.
- In the left-hand panel, click on “Data flows” and then click on the “New data flow” button.
- Give your data flow a name and select the data source you want to use.
- Click on the “Create” button to create your data flow.
2. Add a data source
The next step is to add a data source to your data flow. To do this, follow these steps:
- In the data flow canvas, click on the “Source” icon in the toolbar on the left.
- In the “Source settings” pane on the right, click on the “New source” button.
- Select the type of source you want to use from the list of available sources. For example, if you want to use a CSV file, select “Delimited text” from the list.
- Enter the connection information for your source. This will depend on the type of source you are using.
- Instance: If you are using a CSV file, you will need to enter the file path and delimiter.
- Once you have entered the connection information, click on the “OK” button to add the source to your data flow.
3. Explore and clean your data
The next step is to explore and clean your data using the Wrangling Data Flows UI. The UI provides a rich set of features to explore and clean your data, including:
- Data preview: View a sample of your data to see how it looks.
- Column profile: View statistics and metadata for each column in your data.
- Data transformations: Apply transformations to your data using a drag-and-drop interface.
- Data type conversion: Convert data types for columns.
- Missing value imputation: Impute missing values using various techniques.
- Outlier detection: Identify outliers in your data.
- Data sampling: Sample your data to test transformations.
Use the Wrangling Data Flows UI to explore and clean your data according to your business requirements.
4. Create transformations
Once you have explored and cleaned your data, the next step is to create transformations to shape and structure your data. To do this, follow these steps:
- In the data flow canvas, select the column you want to transform.
- Click on the “Add column” button in the toolbar.
- Select the type of transformation you want to apply from the list of available transformations.
- Instance: If you want to split a column into two columns, select “Split column” from the list.
- Configure the transformation settings in the pane on the right.
- Instance: If you are adding a split column transformation, you will need to specify the delimiter.
- Once you have configured the transformation settings, click on the “OK” button to apply the transformation to your data.
- Repeat steps 1-5 for each transformation you want to apply to your data.
5. Save and run your data flow
Once you have created your transformations, the final step is to save and run your data flow. To do this, follow these steps:
- Click on the “Publish all” button in the toolbar to save your data flow.
- Click on the “Debug” button in the toolbar to run your data flow.
- In the “Debug settings” pane on the right, select the data integration runtime you want to use to run your data flow.
- Click on the “Create” button to create a new debug session.
- Wait for the debug session to start running. You can monitor the progress of your data flow in the “Output” pane on the right.
- Once your data flow has finished running, you can view the output in the “Data preview” pane on the right.
Additional tips to use wrangling data flows in Azure Data Factory
Here are some additional details and tips to keep in mind when using wrangling data flows in Azure Data Factory:
- The Wrangling data flows tool supports the concept of data profiling, which allows you to analyze the quality and characteristics of your data, such as how it is distributed, the type of data, and the pattern in your data.
- The use of dynamic expressions and parameters in wrangling your data flows can help you create a more flexible and customizable process of data transformation.
- To reduce duplication and improve consistency, you can reuse the wrangling of data flows across several pipelines and data factories.
- Whenever you design your wrangling data flow, it is important to take into account the complexity and size of your data, as well as the resources required to process it. With Azure Monitor, you are able to monitor and optimize the performance and cost of the data preparation process for your business.
- In conjunction with other Azure services, such as Azure Databricks and Azure Machine Learning, Wrangling data flow can be used to perform advanced analytics and machine learning tasks on your prepared data using Azure Databricks and Azure Machine Learning.
- When using the “Add to Data Flow” feature, you can add data transformations from your wrangling data flow to your mapping data flow in order to create a new mapping data flow.
The data flow lineage capabilities of Wrangling data flows allow you to track how your data has been transformed, from the point of origin to the point of destination, as well as what changes have been made to it along the way.
Wrangling data flows provides a powerful way to transform data in Azure Data Factory by providing a powerful way to transform data. The Wrangling Data Flows UI provides a series of features which will help you explore, clean, and transform your data in a variety of ways. The steps in this article will help you to gain a better understanding of how to use Wrangling data flows to transform your data by following the steps in the article. With this knowledge, you can now start using Wrangling data flows to transform your data in Azure Data Factory.