This course will guide you through the basics of using ChatGPT for automating the tedious tasks of data cleaning and formatting.
First, what is data cleaning? Data cleaning involves detecting and correcting (or removing) corrupt or inaccurate records from a dataset. For instance, it sometimes happen that a field is missing (empty cell on excel) or has an error (email address ending without any domain extension e.g .com or .net)
In this tutorial, you will learn how to:
- Upload your data files
- Remove duplicate data
- Remove blanks
- Insert placeholder values
- Standardize formatting
- Label data based on specific rules
Uploading and preparing your data
First, click on the icon at the left of the input bar.

Select your file (must be a CSV, xls or xlsx file) and give it to ChatGPT (or just drag and drop the file in the window).
Removing duplicate data
Duplicates can ruin your data analysis and must be removed.
Let's start by asking ChatGPT to remove them.
Prompt:
Here's my data file. Can you remove any duplicate data? Duplicates will have the same id or email. Please provide a preview before generating an updated data file.

Cleaning blank rows
Blank rows can disrupt data processing. It's now time to remove any!
Prompt:
Is there any blank rows? if yes, could you remove them?

See how easy it is?
Inserting placeholder values in blank cells
Depending on the software and tools you use, erroneous values can take several forms: NaN, NULL...
Here, we're going to ask chatgpt to consider them as blank cells and standardize them.
Prompt:
Are there any blank cells (consider a cell whose value is NaN or NULL as also blank)? If yes, could you insert "N/A" in these cells and provide a preview before generating an updated data file?

Ensuring data is properly formatted
Ensuring that data is properly formatted is essential when dealing with huge volumes of data. Poorly formatted data can lead to errors in your analyses and projections, which can have serious consequences for your decisions and therefore your business.
Let's take the example of email addresses. A badly formatted email address is unusable and will inevitably lead to errors if processed incorrectly.
Prompt:
Could you please check whether the e-mail addresses are correctly formatted? If you find any that are not formatted correctly, please list them for me.

Note that this data check can be applied to every column!
Creating new columns
You can derive new columns from existing ones (e.g., extract domain from email addresses).
This can be useful for understanding the distribution of email providers or for data analysis.
Prompt:
Create a new column that extracts the domain from email addresses.

Now you've seen the main features for cleaning and formatting data with ChatGPT. Of course, don't hesitate to modify the prompts and apply them to new use cases.
As you've seen, it's possible to save hours of Excel by using ChatGPT directly to manipulate data for us, without having to create functions.