Usage of Pandas for Basic Data Cleaning and Processing of Structured Datasets

Srinath Sridharan
4 min readDec 15, 2023

--

“Data is a precious thing and will last longer than the systems themselves.” — D J Patil

The above quote underscores the enduring value of data and implicitly highlights the importance of data cleaning. By ensuring the accuracy and quality of data through meticulous cleaning processes, we preserve its value for future analysis and decision-making. This quote would be a great start to this article, emphasizing the long-term significance of proper data handling and data cleaning.

Just as a chef carefully selects and prepares their ingredients to create a delightful meal, a data scientist must meticulously clean and prepare their data for analysis. This process is crucial in ensuring that the insights derived from the data are accurate and reliable.

Image created using OpenAI’s DALL-E generative AI model.

Similarly, consider a gardener tending to their garden. Just as they prune, water, and nurture their plants, a data scientist must tend to their data, nurturing and refining it to reveal its true potential in analysis.

Image created using OpenAI’s DALL-E generative AI model.

As an experienced data science educator, deeply entrenched in the world of data science, I often emphasize to junior colleagues and beginners, the significant role data cleaning plays in the data analysis process. In real-world scenarios, data scientists spend approximately 80% of their time on data cleaning and preprocessing. This is because the outcome of data analysis is profoundly influenced by the quality of the data. To demonstrate this, let’s explore the usage of Pandas, a pivotal Python library, for data cleaning and processing using a dummy dataset.

Introduction to Our Enhanced Data Cleaning Process

We have created a simulated dataset of employee records for a company. The dataset comprises columns like ‘ID’, ‘Name’, ‘Age’, ‘Salary’, ‘Department’, and ‘Joining_Date’. It intentionally includes common data issues such as missing values, outliers, and inconsistent formats. We are going to explore the 10 most common data-cleaning tasks in this blog. It’s important to note that these 10 data-cleaning tasks are by no means exhaustive but represent some of the most common and essential ones. Please find this dummy dataset here. Here’s a glimpse of the dataset:

print(df.head())
ID            Name   Age    Salary Department Joining_Date
0 1 Melissa Mayo 35.0 51042.69 Sales 2014-01-13
1 2 Olivia Stone 45.0 45460.11 Finance 2017-03-27
2 3 Benjamin Lucas 55.0 49459.79 Finance 2021-09-17
3 4 Lori Martinez 25.0 49435.06 Marketing 2015-04-03
4 5 Kaitlyn Garcia 30.0 49503.83 Sales 2018-12-04

Task 1: Handling Missing Data

Missing data can lead to skewed analysis. Identifying and addressing these gaps is crucial for data integrity.

Here’s how we handle it:

# Identifying missing values
missing_values = df.isnull().sum()
print(missing_values)
ID              0
Name 0
Age 5
Salary 2
Department 4
Joining_Date 4
dtype: int64
# Replacing the missing value with mean or dropping
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
df.dropna(subset=['Department', 'Joining_Date'], inplace=True)

Task 2: Data Type Conversion and Cleaning

Accurate data type conversion ensures proper data manipulation and reduces errors in analysis.

# Converting 'Joining_Date' to datetime and filtering out unrealistic salaries
df['Joining_Date'] = pd.to_datetime(df['Joining_Date'])
df = df[df['Salary'].between(30000, 150000)]

Task 3: Data Transformation

Data transformation can enhance readability and compatibility for further analysis.

# Adding a column for years with the company
df['Years_with_Company'] = (pd.Timestamp.now() - df['Joining_Date']).dt.days / 365

Task 4: Aggregating Data for Insights

Data aggregation provides a summarized view, essential for spotting trends and making decisions. This is akin to a Pivot Table in Excel.

# Average salary by department
average_salary_by_dept = df.groupby('Department')['Salary'].mean()
print(average_salary_by_dept)

Output:

Department
Finance 49830.94
HR 50559.72
IT 50266.97
Marketing 50872.83
Sales 49207.03
Name: Salary, dtype: float64

Task 5: Removing Duplicates

Eliminating duplicates is crucial for maintaining data accuracy and preventing redundancy.

# Removing duplicate rows
df_dropped_duplicates = df.drop_duplicates()
duplicates_removed_output = len(df) - len(df_dropped_duplicates)

Output:

0

Task 6: Normalizing Data

Normalization standardizes the range of data, making it easier to understand and compare.

# Normalizing the Salary column
df['Salary_Normalized'] = (df['Salary'] - df['Salary'].mean()) / df['Salary'].std()
normalized_salary_output = df[['Salary', 'Salary_Normalized']].head()

Output:

Salary  Salary_Normalized
0 49290.00 -0.157826
1 50599.49 0.501151
2 50637.94 0.520501
3 49513.51 -0.045349
4 49861.35 0.129695

Task 7: Encoding Categorical Variables

Categorical encoding is necessary for converting text data into a numerical format that algorithms can understand. This is very useful when we are working on Classification problems (Refer to my earlier article on Classification, if interested)

# Encoding categorical variables
df_encoded = pd.get_dummies(df, columns=['Department'])
encoded_columns_output = df_encoded.columns

Output:

Index(['ID', 'Name', 'Age', 'Salary', 'Joining_Date', 'Salary_Normalized',       'Department_Finance', 'Department_HR', 'Department_IT',       'Department_Marketing', 'Department_Sales'],
dtype='object')

Task 8: Handling Outliers

Identifying and managing outliers is essential to prevent them from distorting statistical analysis.

# Handling outliers in the Age column
df = df[df['Age'] < df['Age'].quantile(0.99)]

Task 9: Data Imputation

Data imputation replaces missing or erroneous data with substituted values, maintaining the dataset’s integrity.

# Imputing missing values in Age based on median
df['Age'].fillna(df['Age'].median(), inplace=True)

Task 10: String Manipulation

Proper string manipulation ensures consistency and accuracy in text data.

# Formatting the Name column
df['Name'] = df['Name'].str.title()

Conclusion

Data cleaning, although time-consuming, is a vital step in the data analysis process. Using Pandas in Python, we can efficiently perform various data cleaning tasks, thereby ensuring the data is well-prepared for analysis. This process, akin to preparing the foundation in any construction, sets the stage for accurate and insightful outcomes.

Remember, “Garbage in, garbage out.” Quality data cleaning is the key to quality data analysis.

Happy Data Cleaning!

--

--

Srinath Sridharan

Data Enthusiast | Healthcare Aficionado | Digital Consultant