todd.lindsay85
todd.lindsay85 Jan 18, 2026 โ€ข 0 views

Sample Code for Data Cleaning with Pandas in Python

Hey there! ๐Ÿ‘‹ Data cleaning can be a bit of a headache, but Pandas in Python makes it so much easier. I always struggled with messy datasets until I learned a few simple tricks. I'm sharing some sample code to get you started. Let's make your data sparkle โœจ!
๐Ÿ’ป Computer Science & Technology

1 Answers

โœ… Best Answer
User Avatar
jane.mccoy Dec 31, 2025

๐Ÿ“š What is Data Cleaning?

Data cleaning, also known as data cleansing, is the process of identifying and correcting inaccurate, incomplete, irrelevant, or inconsistent data. It's a crucial step in data analysis and machine learning because the quality of your results depends directly on the quality of your input data. Think of it like this: garbage in, garbage out!

  • ๐Ÿ” Accuracy: Ensuring the data reflects the real-world values accurately.
  • ๐Ÿงฉ Completeness: Addressing missing values appropriately.
  • ๐ŸŽฏ Consistency: Resolving conflicting data entries.
  • โœ… Validity: Confirming that data conforms to defined formats and rules.

๐Ÿ“œ A Brief History of Data Cleaning

The need for data cleaning emerged alongside the increasing volume and complexity of data in the latter half of the 20th century. Early data cleaning techniques were manual and time-consuming. With the advent of powerful programming languages like Python and libraries like Pandas, the process has become increasingly automated and efficient. Pandas, released in 2008, revolutionized data manipulation and cleaning, providing intuitive tools for handling tabular data.

๐Ÿ”‘ Key Principles of Data Cleaning with Pandas

Pandas provides a rich set of functions for tackling common data cleaning challenges.

  • ๐Ÿงน Handling Missing Values: Identifying and dealing with `NaN` (Not a Number) values.
  • ๐Ÿ“Š Data Type Conversion: Converting columns to the correct data type (e.g., string to numeric).
  • โœ‚๏ธ Removing Duplicates: Identifying and removing duplicate rows.
  • ๐Ÿ“ Text Cleaning: Standardizing text data by removing whitespace, punctuation, and applying case conversions.
  • ๐Ÿ”ข Outlier Detection: Identifying and handling extreme values that deviate significantly from the norm.

๐Ÿ’ป Real-World Examples of Data Cleaning with Pandas

Let's dive into some practical examples. We'll use a sample DataFrame for demonstration.

First, import the Pandas library:

import pandas as pd
import numpy as np

Create a sample DataFrame:

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Bob'],
    'Age': [25, 30, None, 35, 28, 30],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney', 'London'],
    'Salary': [60000, 75000, 80000, 90000, None, 75000],
    'Gender': ['Female', 'Male', 'Male', 'Male', 'Female', 'Male']
}
df = pd.DataFrame(data)
print(df)

๐Ÿ“ Handling Missing Values

We can fill missing 'Age' values with the mean age:

df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df)

We can also fill the missing 'Salary' values with 0:

df['Salary'].fillna(0, inplace=True)
print(df)

๐Ÿ”ข Data Type Conversion

Let's convert the 'Age' column to an integer type:

df['Age'] = df['Age'].astype(int)
print(df)

โœ‚๏ธ Removing Duplicates

Remove duplicate rows based on all columns:

df.drop_duplicates(inplace=True)
print(df)

Remove duplicates based on 'Name' and 'City':

df.drop_duplicates(subset=['Name', 'City'], inplace=True)
print(df)

๐Ÿ“ Text Cleaning

Let's add a column with leading/trailing spaces and then remove those spaces:

df['City'] = df['City'].str.strip()
print(df)

๐Ÿ“ˆ Outlier Detection (Example)

While a detailed outlier analysis requires more complex techniques, hereโ€™s a basic example of identifying salaries significantly above the mean:

mean_salary = df['Salary'].mean()
std_salary = df['Salary'].std()
threshold = 2  # Adjust this threshold as needed
outliers = df[abs((df['Salary'] - mean_salary) / std_salary) > threshold]
print(outliers)

๐Ÿงช Advanced Techniques

For more complex cleaning tasks, you can use functions like `apply` and custom functions. Regular expressions are also incredibly useful for pattern-based text cleaning.

  • ๐Ÿ”ฌ Using Apply: Applying a custom function to each row or column.
  • ๐Ÿงฎ Regular Expressions: Powerful pattern matching for complex text manipulation.

๐Ÿง  Conclusion

Data cleaning with Pandas is an essential skill for anyone working with data. By mastering these techniques, you can ensure your data is accurate, consistent, and ready for analysis. Keep practicing and exploring the vast capabilities of Pandas!

Join the discussion

Please log in to post your answer.

Log In

Earn 2 Points for answering. If your answer is selected as the best, you'll get +20 Points! ๐Ÿš€