Excel for Data Cleaning

Excel is a powerful tool for data cleaning, which involves dealing with "dirty data", data that requires preparation before analysis. Raw data often contains numerous errors that can impact accuracy, leading to incorrect predictions and negative business outcomes.


To enhance the quality of company information, consider the following strategies:

  1. Convert data to table format
  2. Separate first and last names
  3. Identify duplicate values
  4. Ensure correct capitalization (orthography)
  5. Remove extra spaces or tabulation
  6. Correct date format
  7. Convert decimal values to percentages (if needed)
  8. Format values as currency (if required)
  9. Identify cells with blank spaces or null values

DESCRIPTION


The following example demonstrates the cleanup of a fictitious database using formulas or methods to correct and modify specific requirements.

GOAL


Identify errors that can impact companies' strategies and the development of new products or services, ultimately leading to inaccurate sales forecasts.

RESULT


By employing the provided formulas and following the outlined steps in the exercise, it was possible to clean the entire table. This resulted in presenting the data in a more professional and readable format.

Company's Employee Database

Raw database:

RawDatabase
Formula Description
CLEAN() Removes all non-printable characters from the text.
TRIM() Removes all blank spaces from the text except the individual spaces between words.
LEFT() Returns the specified number of characters from the beginning of a text string (used to create employee's IDs).
UPPER() Converts a text string to uppercase letters.
PROPER() Converts a text string to upper or lower case, depending on the case; the first letter of each word is capitalized and the remaining letters are lowercase.
IF() Returns a value if a condition is true and other if it is false (used to rename boolean values).

In Microsoft Excel 2019, the default decimal separator is a comma, not a period. Therefore, if you want to do a mathematical operation, it will return an error.


Steps to remove currency format:

  1. Select the column
  2. Navigate to the "Find and Replace" option (Ctrl + F)
  3. Choose the "Replace" tab
  4. In the "Find what" field, type the "$" symbol
  5. Leave the "Replace with" field empty
  6. Finally, select the "Replace All" option

Steps to replace decimal point with a comma:

  1. Select the column
  2. Navigate to the "Find and Replace" option (Ctrl + F)
  3. Choose the "Replace" tab
  4. In the "Find what" field, type the "." symbol
  5. In the "Replace with" field, type the "," symbol
  6. Finally, select the "Replace All" option
CurrencyFormat

In this exercise, the start date doesn't specify any type of format and also contains tabulation.


Steps to fix date format:

  1. Select the column
  2. Locate the option "Data" in tools tab
  3. Select the option "Text to columns"
  4. In the Text Wizard, choose the file type "Delimited" and click Next
  5. In Delimiters Screen, uncheck the "Tab" and "Space" options and click Next
  6. In Column Data Format Screen, check the "Date" option, set the date format to DMY(Day/Month/Year) and click Finished.
DateFormat

Steps to identify duplicate values and how to remove them:

  1. Select all table data (except headers)
  2. Locate the "Home" option in the tools tab
  3. Select "Conditional Formatting"
  4. Choose "Highlight Cells Rules"
  5. Select "Duplicate Values" (you can choose how to highlight your data)
  6. From the table's filter options, choose "Sort by color"
  7. To remove duplicates, locate the "Data" option on the tools tab and select "Remove Duplicates" from the menu
DuplicateValues

The last step of the exercise indicates that empty cells should be removed because they represent null values.


Steps to identify empty cells and how to remove them:

  1. Select all table data (except headers)
  2. Locate the "Home" option in the tools tab
  3. Select "Conditional Formatting"
  4. Choose "Highlight Cells Rules"
  5. Select "More Rules"
  6. In the "New Formatting Rule" tab > Edit the Rule Description > Format only cells with, choose the option "Blanks" (you can choose how to highlight your data)
  7. To remove empty cells, locate the "Home" option on the tools tab, select the columns, and finally, choose the option "Delete Cells" from the menu
EmptyCells

Final result of the cleaning process:

CleanDatabase

You can visit the GitHub repository for more information on the documents used in the project.