Usage of Pandas for Basic Data Cleaning and Processing of Structured Datasets
“Data is a precious thing and will last longer than the systems themselves.” — D J Patil
The above quote underscores the enduring value of data and implicitly highlights the importance of data cleaning. By ensuring the accuracy and quality of data through meticulous cleaning processes, we preserve its value for future analysis and decision-making. This quote would be a great start to this article, emphasizing the long-term significance of proper data handling and data cleaning.
Just as a chef carefully selects and prepares their ingredients to create a delightful meal, a data scientist must meticulously clean and prepare their data for analysis. This process is crucial in ensuring that the insights derived from the data are accurate and reliable.
Similarly, consider a gardener tending to their garden. Just as they prune, water, and nurture their plants, a data scientist must tend to their data, nurturing and refining it to reveal its true potential in analysis.
As an experienced data science educator, deeply entrenched in the world of data science, I often emphasize to junior colleagues and beginners, the significant role data cleaning plays in the data analysis process. In real-world scenarios, data scientists spend approximately 80% of their time on data cleaning and preprocessing. This is because the outcome of data analysis is profoundly influenced by the quality of the data. To demonstrate this, let’s explore the usage of Pandas, a pivotal Python library, for data cleaning and processing using a dummy dataset.
Introduction to Our Enhanced Data Cleaning Process
We have created a simulated dataset of employee records for a company. The dataset comprises columns like ‘ID’, ‘Name’, ‘Age’, ‘Salary’, ‘Department’, and ‘Joining_Date’. It intentionally includes common data issues such as missing values, outliers, and inconsistent formats. We are going to explore the 10 most common data-cleaning tasks in this blog. It’s important to note that these 10 data-cleaning tasks are by no means exhaustive but represent some of the most common and essential ones. Please find this dummy dataset here. Here’s a glimpse of the dataset:
print(df.head())
ID Name Age Salary Department Joining_Date
0 1 Melissa Mayo 35.0 51042.69 Sales 2014-01-13
1 2 Olivia Stone 45.0 45460.11 Finance 2017-03-27
2 3 Benjamin Lucas 55.0 49459.79 Finance 2021-09-17
3 4 Lori Martinez 25.0 49435.06 Marketing 2015-04-03
4 5 Kaitlyn Garcia 30.0 49503.83 Sales 2018-12-04
Task 1: Handling Missing Data
Missing data can lead to skewed analysis. Identifying and addressing these gaps is crucial for data integrity.
Here’s how we handle it:
# Identifying missing values
missing_values = df.isnull().sum()
print(missing_values)
ID 0
Name 0
Age 5
Salary 2
Department 4
Joining_Date 4
dtype: int64
# Replacing the missing value with mean or dropping
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
df.dropna(subset=['Department', 'Joining_Date'], inplace=True)
Task 2: Data Type Conversion and Cleaning
Accurate data type conversion ensures proper data manipulation and reduces errors in analysis.
# Converting 'Joining_Date' to datetime and filtering out unrealistic salaries
df['Joining_Date'] = pd.to_datetime(df['Joining_Date'])
df = df[df['Salary'].between(30000, 150000)]
Task 3: Data Transformation
Data transformation can enhance readability and compatibility for further analysis.
# Adding a column for years with the company
df['Years_with_Company'] = (pd.Timestamp.now() - df['Joining_Date']).dt.days / 365
Task 4: Aggregating Data for Insights
Data aggregation provides a summarized view, essential for spotting trends and making decisions. This is akin to a Pivot Table in Excel.
# Average salary by department
average_salary_by_dept = df.groupby('Department')['Salary'].mean()
print(average_salary_by_dept)
Output:
Department
Finance 49830.94
HR 50559.72
IT 50266.97
Marketing 50872.83
Sales 49207.03
Name: Salary, dtype: float64
Task 5: Removing Duplicates
Eliminating duplicates is crucial for maintaining data accuracy and preventing redundancy.
# Removing duplicate rows
df_dropped_duplicates = df.drop_duplicates()
duplicates_removed_output = len(df) - len(df_dropped_duplicates)
Output:
0
Task 6: Normalizing Data
Normalization standardizes the range of data, making it easier to understand and compare.
# Normalizing the Salary column
df['Salary_Normalized'] = (df['Salary'] - df['Salary'].mean()) / df['Salary'].std()
normalized_salary_output = df[['Salary', 'Salary_Normalized']].head()
Output:
Salary Salary_Normalized
0 49290.00 -0.157826
1 50599.49 0.501151
2 50637.94 0.520501
3 49513.51 -0.045349
4 49861.35 0.129695
Task 7: Encoding Categorical Variables
Categorical encoding is necessary for converting text data into a numerical format that algorithms can understand. This is very useful when we are working on Classification problems (Refer to my earlier article on Classification, if interested)
# Encoding categorical variables
df_encoded = pd.get_dummies(df, columns=['Department'])
encoded_columns_output = df_encoded.columns
Output:
Index(['ID', 'Name', 'Age', 'Salary', 'Joining_Date', 'Salary_Normalized', 'Department_Finance', 'Department_HR', 'Department_IT', 'Department_Marketing', 'Department_Sales'],
dtype='object')
Task 8: Handling Outliers
Identifying and managing outliers is essential to prevent them from distorting statistical analysis.
# Handling outliers in the Age column
df = df[df['Age'] < df['Age'].quantile(0.99)]
Task 9: Data Imputation
Data imputation replaces missing or erroneous data with substituted values, maintaining the dataset’s integrity.
# Imputing missing values in Age based on median
df['Age'].fillna(df['Age'].median(), inplace=True)
Task 10: String Manipulation
Proper string manipulation ensures consistency and accuracy in text data.
# Formatting the Name column
df['Name'] = df['Name'].str.title()
Conclusion
Data cleaning, although time-consuming, is a vital step in the data analysis process. Using Pandas in Python, we can efficiently perform various data cleaning tasks, thereby ensuring the data is well-prepared for analysis. This process, akin to preparing the foundation in any construction, sets the stage for accurate and insightful outcomes.
Remember, “Garbage in, garbage out.” Quality data cleaning is the key to quality data analysis.
Happy Data Cleaning!