The data cleansing tool replaces and removes inconsistent or improperly formatted data in your inputs. I know that sounds a bit abstract, so let's go through it with an example. As you can see with the illustration on the right, the first picture shows in red boxes, all the anomalies we have with our data set. employee ID has random white spaces between the numbers. First name has punctuation issues, h has a couple of random tabs, and favorite coffee has two null values. What we're going to do in this exercise is import our example HR sheet, Dragon Age a data cleansing tool to clean it, and then run our workflow to view the results of our cleaned data set.
Let's start a new workflow by importing spreadsheet 2.1. We'll go to the input data tool and connect to our spreadsheet. In here we have our example HR five from chapter one, but this time it needs to be cleansed. To get a better view of our data will add a browse tool to our input data tool by using the following keyboard shortcut Ctrl Shift V will run our workflow with Ctrl R. And we can see at the bottom in our preview pane that with employee ID, there's a couple of white spaces. First name has punctuation issues. Age has a couple of random tabs, and under favorite coffee, there are two null values.
A quick way to identify issues with our data set is to check the color of each column. If it's anything but green, there's potentially something wrong with it. under age ultrix tells us that 40% of our records in age aren't okay. And under favorite coffee. There are 20% null values having these anomalies in your data might not seem like a big deal, but it can easily throw your data set off. From an ETL or data ingestion point of view, not cleaning the data before ingesting it can corrupt the data set.
And as a user of it can skew the results of your queries. So cleaning the data set is very important before using it. Let's go to the preparation tab and drag in a data cleansing tool into our workflow. In the configuration pane on the left, we have several options for selecting which fields we want to clean and how to clean them. The fields in scope for this exercise where employee ID, first name, age, and favorite coffee. Under replace Knowles, let's replace null string values with an empty piece of string and replace no numeric fields with zero.
Just to quickly cover off nulls and blank values are two different things. means that nothing is stored in that field whilst blank means a blank value is stored in that field. Under remove unwanted characters, we can untick leading and trailing white spaces. Since we didn't have any, we can remove tabs line breaks and duplicate white spaces to fixed values that we had in age. Under all white spaces, we'll take that to fix employee ID, which had white spaces between the numbers. We can leave letters and numbers alone since we actually want those in our data set.
And we can tick punctuation to remove exclamation marks in first name. Let's run our workflow with Ctrl R. And we can see that our data set has been cleansed. All our columns show a green color and we can see that the whitespace has been removed from this record. First name has the punctuation is removed The tabs in age are gone and our null values and favorite coffee have been removed.