Data Engineering

Unlocking Data Insights: A Complete Guide to Exploratory Data Analysis (EDA) with Python

In the world of data science, Exploratory Data Analysis (EDA) is the critical first step after collecting data. It helps you understand your data — its structure, patterns, anomalies, and relationships — before applying models or making decisions.

This blog will walk you through:

Table of Contents


What is EDA?

Exploratory Data Analysis (EDA) is the process of visually and statistically examining datasets to summarize their main characteristics. It’s about exploring the unknowns and building intuition for the data.


Why is EDA important?

  • Identifies patterns and relationships
  • Detects missing or incorrect data
  • Informs feature engineering and model selection
  • Saves time and effort in later stages of a project

Let’s Begin with a Dataset!

We’ll use the Titanic dataset, which contains data about passengers: age, sex, class, fare, and survival status.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
df = sns.load_dataset('titanic')
df.head()

Step-by-Step EDA Process

1. Understand the Data Structure

df.shape        # Dimensions
df.info() # Data types
df.describe() # Statistical summary

2. Check for Missing Values

df.isnull().sum()

We found missing values in age, embarked, deck.

df['age'].fillna(df['age'].median(), inplace=True)
df.drop(columns=['deck'], inplace=True) # Too sparse
df.dropna(subset=['embarked'], inplace=True)

3. Univariate Analysis

Numerical: Age

sns.histplot(df['age'], kde=True)
plt.title('Age Distribution')
plt.show()

Categorical: Class

sns.countplot(x='class', data=df)
plt.title('Passenger Class Count')
plt.show()

4. Bivariate Analysis

Survival by Gender

sns.barplot(x='sex', y='survived', data=df)
plt.title('Survival Rate by Gender')
plt.show()

Survival by Class

sns.barplot(x='class', y='survived', data=df)
plt.title('Survival Rate by Passenger Class')
plt.show()

Age vs Survival

sns.boxplot(x='survived', y='age', data=df)
plt.title('Age vs Survival')
plt.show()

5. Correlation Matrix

plt.figure(figsize=(8,6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Final Summary and Conclusion

  • Gender: Women survived more than men.
  • Class: 1st class had the highest survival.
  • Age: Children had an edge in survival.
  • Fare: Those who paid more had better cabins and higher chances.

Leave a Comment

Your email address will not be published. Required fields are marked *