Exploratory Data Analysis (EDA) in Python: Ultimate Guide

In this guide below, we will learning an Exploratory Data Analysis the EDA process in Python with a step-by-step implementation guide.

This comprehensive guide on Exploratory Data Analysis (EDA) in Python, will show the way to explore and understand the data in better way.

Also, we will be learning and exploring the examples from top 15 popular Kaggle case studies include code snippets and explanations for each step.

What is Exploratory Data Analysis (EDA)?

Exploratory Data Analysis (EDA) is the process of investigating datasets to summarize their main characteristics, often using visual methods.

It is an important initial step in the data analysis process, where analysts use various statistical tools and graphical techniques to understand a dataset. 

This involves summarizing the main features of the data, often using visual methods, to discover patterns, identify outliers, test hypotheses and verify assumptions. 

EDA helps identify the underlying structure of the data and guides further analysis, providing insights that shape the choice of models and methods for detailed investigation. 

EDA allows data analysts to ensure that data is clean and suitable for further processing, ultimately leading to more accurate and understandable conclusions.

Step for Exploratory Data Analysis in Python

Here are the steps to Implement Exploratory Data Analysis using python on data to get maximum insights from it to understand the trend from data in a better way:

1. Setting Up the Environment

First, install the necessary libraries. You can install them using pip if you don't have them already:
pip install pandas numpy matplotlib seaborn scikit-learn

2. Importing Libraries

We will be Importing the required Libraries for Exploratory Data Analysis using python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

3. Loading the Dataset

For this example, we'll use the "Titanic" dataset from Kaggle, a classic case study for EDA.
titanic = pd.read_csv('titanic.csv')

4. Understanding the Dataset

Understanding the data is very crucial part before going into the statistical analysis, for that we need to see what data its, which columns, rows, count etc.

The few function like head(), tail(), info(), describe(), etc, are use in python to understand the data in a more precise way.
# Display the first few rows of the dataset
print(titanic.head())

# Summary of the dataset
print(titanic.info())

# Summary statistics
print(titanic.describe())

5. Handling Missing Values

Handling missing values in Python is a critical task in data preprocessing, ensuring that the dataset is clean and reliable for analysis.

Python, with its powerful libraries like pandas and NumPy, offers various methods to address missing data. Common techniques include:

  1. Removing Missing Values: Dropping rows or columns that contain missing values using dropna(). This method is straightforward but can lead to significant data loss if many values are missing.
  2. Imputing Missing Values: Filling in missing data with substitute values. Methods include replacing with the mean, median, or mode of the column using fillna(), or using more advanced techniques like forward fill and backward fill (ffill() and bfill()).
  3. Interpolating Missing Values: Using interpolation methods to estimate missing values based on the data’s trends. This can be done using the interpolate() function.
  4. Using Algorithms: For more sophisticated imputation, machine learning algorithms like K-Nearest Neighbors (KNN) can predict missing values based on the dataset’s structure.

By appropriately handling missing values, analysts ensure that the integrity and accuracy of their data are maintained, leading to more robust and reliable analysis.

In the below python code, we will be doing missing values treatment on titanic dataset like checking the missing value, filling the NA values using median(), mode(), and dropping the NA values etc.
# Check for missing values
print(titanic.isnull().sum())

# Fill missing values for 'Age' with the median
titanic['Age'].fillna(titanic['Age'].median(), inplace=True)

# Fill missing values for 'Embarked' with the mode
titanic['Embarked'].fillna(titanic['Embarked'].mode()[0], inplace=True)

# Drop the 'Cabin' column because it has too many missing values
titanic.drop(columns='Cabin', inplace=True)

6. Data Visualization

Data visualization in exploratory data analysis (EDA) is a powerful technique used to graphically represent data, making it easier to identify patterns, trends, and anomalies with python. 

By transforming complex data into visual formats such as histograms, scatter charts, box and bar graphs, analysts can quickly understand the underlying structures and relationships in the data. 

Visualizations help communicate results effectively by allowing both analysts and stakeholders to see and understand the data overview at a glance. 

This step is important in EDA because it not only helps reveal hidden patterns, but also strengthens assumptions and guides the direction of further analysis.

Related Article: Top 15 Best Data Visualization tools: You Should Know!

A. Univariate Analysis

Univariate analysis is a fundamental type of data analysis that focuses on examining and summarizing the characteristics of a single variable.

The primary goal is to understand the distribution and central tendency of the variable, as well as its spread and any patterns it may exhibit.

Common techniques used in univariate analysis include:

  1. Descriptive Statistics: Calculating measures such as mean, median, mode, variance, standard deviation, and range to describe the central tendency and dispersion of the variable.
  2. Frequency Distribution: Creating frequency tables to show the number of occurrences of each unique value or range of values.
  3. Visualization Methods: Using graphs like histograms, box plots, and bar charts to visually inspect the distribution, identify skewness, detect outliers, and observe the overall shape of the data.
Here we are going to explore the age and embarked in a titanic dataset using Univariate analysis
# Distribution of Age
plt.figure(figsize=(10,6))
sns.histplot(titanic['Age'], kde=True)
plt.title('Distribution of Age')
plt.show()

# Count plot of Embarked
plt.figure(figsize=(10,6))
sns.countplot(x='Embarked', data=titanic)
plt.title('Count of Embarked')
plt.show()
B. Bivariate Analysis

Bivariate analysis is a statistical method that involves the simultaneous analysis of two variables to explore the potential relationship between them.

This type of analysis is crucial for understanding how one variable might affect or be related to another. Key techniques in bivariate analysis include:

  1. Scatter Plots: These graphs plot two variables against each other, helping to visualize any correlations or patterns. For instance, a positive or negative linear relationship can often be spotted in a scatter plot.
  2. Correlation Coefficient: This statistical measure, often denoted as Pearson’s r, quantifies the strength and direction of the linear relationship between two variables. A value close to 1 or -1 indicates a strong relationship, while a value around 0 suggests no linear correlation.
  3. Cross-tabulations: Also known as contingency tables, these are used for categorical variables to examine the frequency distribution across different categories.
  4. Regression Analysis: Simple linear regression can be used to model the relationship between two variables, providing an equation that best fits the observed data.
  5. Comparative Analysis: Techniques like t-tests or chi-square tests compare the means or proportions between two groups to determine if differences are statistically significant.

Bivariate analysis is essential for identifying and understanding relationships between variables, which can inform more complex multivariate analyses and guide decision-making processes.

Here we will be doing Bivariate analysis using variables like (Survived, Sex), (Survived, Age) in titanic dataset.
# Survival rate by gender
plt.figure(figsize=(10,6))
sns.countplot(x='Survived', hue='Sex', data=titanic)
plt.title('Survival by Gender')
plt.show()


# Age distribution by survival
plt.figure(figsize=(10,6))
sns.boxplot(x='Survived', y='Age', data=titanic)
plt.title('Age Distribution by Survival')
plt.show()
C. Multivariate Analysis

Multivariate analysis is a sophisticated statistical technique used to examine the relationships between three or more variables simultaneously.

This type of analysis allows researchers to understand the complex interactions and dependencies among multiple variables, providing a deeper and more comprehensive understanding of the data.

Key methods in multivariate analysis include:
  1. Multiple Regression Analysis: Extends simple regression by modeling the relationship between a dependent variable and multiple independent variables. It helps in predicting the outcome and assessing the influence of each predictor.
  2. Principal Component Analysis (PCA): Reduces the dimensionality of the data by transforming original variables into a smaller set of uncorrelated variables called principal components. This is useful for simplifying data while retaining most of the variability.
  3. Factor Analysis: Identifies underlying factors that explain the pattern of correlations within a set of observed variables. It helps in data reduction and identifying latent constructs.
  4. Cluster Analysis: Groups observations into clusters based on similarities across multiple variables. This technique is useful for market segmentation, pattern recognition, and anomaly detection.
  5. Multivariate Analysis of Variance (MANOVA): Extends ANOVA by examining the influence of one or more categorical independent variables on multiple continuous dependent variables. It helps in understanding the effect of factors on several outcomes simultaneously.
  6. Discriminant Analysis: Used to classify observations into predefined categories based on predictor variables. It is often applied in cases where the goal is to predict group membership.

Multivariate analysis provides a comprehensive framework for analyzing complex datasets, uncovering hidden patterns, and making informed decisions based on multiple variables’ interactions and effects.

# Pairplot to see the interactions between features
sns.pairplot(titanic[['Survived', 'Pclass', 'Age', 'Fare', 'SibSp', 'Parch']])
plt.show()

# heatmap use to see and identify patterns and anomalies in complete dataset.
titanic1=titanic.select_dtypes(exclude=['object'])
plt.figure(figsize=(12, 7))
sns.heatmap(titanic1.corr(),annot=True)
plt.show()

7. Feature Engineering

Feature engineering is a machine learning technique that involves extracting raw data and transforming it into features that can be used for training and prediction. 

The purpose of feature design is to: improve model accuracy, simplify and accelerate data transformations, and design and train new machine learning functions..

Creating new features to improve the performance of models:
# Create a new feature 'FamilySize'
titanic['FamilySize'] = titanic['SibSp'] + titanic['Parch'] + 1

# Create a new feature 'IsAlone'
titanic['IsAlone'] = 1 # Initialize to yes/1 is alone
titanic['IsAlone'].loc[titanic['FamilySize'] > 1] = 0 # Now update to no/0 if family size is greater than 1

# Create a new feature 'Title' from the 'Name'
titanic['Title'] = titanic['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

# Simplify the titles
titanic['Title'] = titanic['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 
                                             'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
titanic['Title'] = titanic['Title'].replace('Mlle', 'Miss')
titanic['Title'] = titanic['Title'].replace('Ms', 'Miss')
titanic['Title'] = titanic['Title'].replace('Mme', 'Mrs')

8. Correlation Analysis

In the Exploratory data analysis process, Correlation analysis plays a crucial role in analyzing data.

It is an important part of exploratory data analysis (EDA) because it helps identify relationships between different variables in a data set.

Correlation coefficients such as Pearson’s, Spearman’s, and Kendall’s coefficients measure the strength and direction of these relationships. Here we use the Pearson correlation coefficient, the most common type..

# Correlation matrix
plt.figure(figsize=(10,8))
sns.heatmap(titanic.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Top 15 Case Studies from Kaggle

Here are 15 notable Kaggle datasets to practice EDA. For each dataset, follow similar steps as outlined which given above for better analysis:

1. Titanic: Machine Learning from Disaster

Dataset: Titanic: Machine Learning from Disaster

# Load the dataset
titanic = pd.read_csv('titanic.csv')
# Follow the steps as mentioned in the guide above

2. House Prices: Advanced Regression Techniques

Dataset: House Prices: Advanced Regression Techniques

# Load the dataset
house_prices = pd.read_csv('house_prices.csv')

# Basic Information and Cleaning
print(house_prices.info())
print(house_prices.describe())
print(house_prices.isnull().sum())

# Fill missing values (example)
house_prices['LotFrontage'].fillna(house_prices['LotFrontage'].median(), inplace=True)

3. Iris Species

Dataset: Iris Species

# Load the dataset
iris = pd.read_csv('iris.csv')

# Basic Information
print(iris.info())
print(iris.describe())

# Visualization
sns.pairplot(iris, hue='species')
plt.show()

4. Heart Disease UCI

Dataset: Heart Disease UCI

# Load the dataset
heart = pd.read_csv('heart.csv')

# Basic Information
print(heart.info())
print(heart.describe())

# Visualization
sns.heatmap(heart.corr(), annot=True, cmap='coolwarm')
plt.show()

5. COVID-19 Open Research Dataset Challenge (CORD-19)

Dataset: COVID-19 Open Research Dataset Challenge (CORD-19)

# Due to the complexity and size of this dataset, advanced EDA techniques and NLP are required.
# Refer to Kaggle kernels for detailed analysis: https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/notebooks

6. Wine Quality Dataset

Dataset: Wine Quality Dataset

# Load the dataset
wine = pd.read_csv('winequality-red.csv')

# Basic Information
print(wine.info())
print(wine.describe())

# Visualization
sns.heatmap(wine.corr(), annot=True, cmap='coolwarm')
plt.show()

7. IMDB 5000 Movie Dataset

Dataset: IMDB 5000 Movie Dataset

# Load the dataset
imdb = pd.read_csv('movie_metadata.csv')

# Basic Information
print(imdb.info())
print(imdb.describe())

# Visualization
sns.pairplot(imdb[['gross', 'budget', 'imdb_score', 'movie_facebook_likes']])
plt.show()

8. New York City Taxi Fare Prediction

Dataset: New York City Taxi Fare Prediction

# Load the dataset
taxi = pd.read_csv('train.csv')

# Basic Information
print(taxi.info())
print(taxi.describe())

# Visualization
sns.scatterplot(x='pickup_longitude', y='pickup_latitude', data=taxi)
plt.show()

9. Retail Analysis with Walmart Data

Dataset: Retail Analysis with Walmart Data

# Load the dataset
walmart = pd.read_csv('train.csv')

# Basic Information
print(walmart.info())
print(walmart.describe())

# Visualization
sns.lineplot(x='Date', y='Weekly_Sales', data=walmart)
plt.show()

10. Customer Segmentation (Mall Customer Data)

Dataset: Customer Segmentation (Mall Customer Segmentation Data)

# Load the dataset
mall = pd.read_csv('Mall_Customers.csv')

# Basic Information
print(mall.info())
print(mall.describe())

# Visualization
sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', data=mall)
plt.show()

11. Airbnb New User Bookings

Dataset: Airbnb New User Bookings

# Load the dataset
airbnb = pd.read_csv('train_users_2.csv')

# Basic Information
print(airbnb.info())
print(airbnb.describe())

# Visualization
sns.countplot(x='country_destination', data=airbnb)
plt.show()

12. Bike Sharing Demand

Dataset: Bike Sharing Demand

# Load the dataset
bike = pd.read_csv('train.csv')

# Basic Information
print(bike.info())
print(bike.describe())

# Visualization
sns.lineplot(x='datetime', y='count', data=bike)
plt.show()

13. Rossmann Store Sales

Dataset: Rossmann Store Sales

# Load the dataset
rossmann = pd.read_csv('train.csv')

# Basic Information
print(rossmann.info())
print(rossmann.describe())

# Visualization
sns.lineplot(x='Date', y='Sales', data=rossmann)
plt.show()

14. Breast Cancer Wisconsin (Diagnostic) Data Set

Dataset: Breast Cancer Wisconsin (Diagnostic) Data Set

# Load the dataset
cancer = pd.read_csv('data.csv')

# Basic Information
print(cancer.info())
print(cancer.describe())

# Visualization
sns.heatmap(cancer.corr(), annot=True, cmap='coolwarm')
plt.show()

15. World Happiness Report

Dataset: World Happiness Report

# Load the dataset
happiness = pd.read_csv('2019.csv')

# Basic Information
print(happiness.info())
print(happiness.describe())

# Visualization
sns.scatterplot(x='GDP per capita', y='Score', data=happiness)
plt.show()

Conclusion

EDA is an iterative and detailed process that helps in understanding the dataset better and preparing it for modeling.

The steps and techniques described above can be applied to various datasets, including those available on Kaggle.

Each dataset may require specific handling based on its unique characteristics, but the fundamental principles of EDA remain consistent.