In the world of data science, Exploratory Data Analysis (EDA) is the critical first step after collecting data. It helps you understand your data — its structure, patterns, anomalies, and relationships — before applying models or making decisions.
This blog will walk you through:
Table of Contents
- What is EDA?
- Why is EDA important?
- Let’s Begin with a Dataset!
- Step-by-Step EDA Process
- Final Summary and Conclusion
What is EDA?
Exploratory Data Analysis (EDA) is the process of visually and statistically examining datasets to summarize their main characteristics. It’s about exploring the unknowns and building intuition for the data.
Why is EDA important?
- Identifies patterns and relationships
- Detects missing or incorrect data
- Informs feature engineering and model selection
- Saves time and effort in later stages of a project
Let’s Begin with a Dataset!
We’ll use the Titanic dataset, which contains data about passengers: age, sex, class, fare, and survival status.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load dataset
df = sns.load_dataset('titanic')
df.head()
Step-by-Step EDA Process
1. Understand the Data Structure
df.shape # Dimensions
df.info() # Data types
df.describe() # Statistical summary
2. Check for Missing Values
df.isnull().sum()
We found missing values in age
, embarked
, deck
.
df['age'].fillna(df['age'].median(), inplace=True)
df.drop(columns=['deck'], inplace=True) # Too sparse
df.dropna(subset=['embarked'], inplace=True)
3. Univariate Analysis
Numerical: Age
sns.histplot(df['age'], kde=True)
plt.title('Age Distribution')
plt.show()
Categorical: Class
sns.countplot(x='class', data=df)
plt.title('Passenger Class Count')
plt.show()
4. Bivariate Analysis
Survival by Gender
sns.barplot(x='sex', y='survived', data=df)
plt.title('Survival Rate by Gender')
plt.show()
Survival by Class
sns.barplot(x='class', y='survived', data=df)
plt.title('Survival Rate by Passenger Class')
plt.show()
Age vs Survival
sns.boxplot(x='survived', y='age', data=df)
plt.title('Age vs Survival')
plt.show()
5. Correlation Matrix
plt.figure(figsize=(8,6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
Final Summary and Conclusion
- Gender: Women survived more than men.
- Class: 1st class had the highest survival.
- Age: Children had an edge in survival.
- Fare: Those who paid more had better cabins and higher chances.