In this guide below, we will learning an Exploratory Data Analysis the EDA process in Python with a step-by-step implementation guide.
This comprehensive guide on Exploratory Data Analysis (EDA) in Python, will show the way to explore and understand the data in better way.
Also, we will be learning and exploring the examples from top 15 popular Kaggle case studies include code snippets and explanations for each step.
What is Exploratory Data Analysis (EDA)?
Exploratory Data Analysis (EDA) is the process of investigating datasets to summarize their main characteristics, often using visual methods.
It is an important initial step in the data analysis process, where analysts use various statistical tools and graphical techniques to understand a dataset.
This involves summarizing the main features of the data, often using visual methods, to discover patterns, identify outliers, test hypotheses and verify assumptions.
EDA helps identify the underlying structure of the data and guides further analysis, providing insights that shape the choice of models and methods for detailed investigation.
EDA allows data analysts to ensure that data is clean and suitable for further processing, ultimately leading to more accurate and understandable conclusions.
Step for Exploratory Data Analysis in Python
Here are the steps to Implement Exploratory Data Analysis using python on data to get maximum insights from it to understand the trend from data in a better way:
1. Setting Up the Environment
First, install the necessary libraries. You can install them using pip if you don't have them already:
pip install pandas numpy matplotlib seaborn scikit-learn
2. Importing Libraries
We will be Importing the required Libraries for Exploratory Data Analysis using python
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns
3. Loading the Dataset
For this example, we'll use the "Titanic" dataset from Kaggle, a classic case study for EDA.
titanic = pd.read_csv('titanic.csv')
4. Understanding the Dataset
Understanding the data is very crucial part before going into the statistical analysis, for that we need to see what data its, which columns, rows, count etc.
The few function like head(), tail(), info(), describe(), etc, are use in python to understand the data in a more precise way.
# Display the first few rows of the dataset print(titanic.head()) # Summary of the dataset print(titanic.info()) # Summary statistics print(titanic.describe())
5. Handling Missing Values
Handling missing values in Python is a critical task in data preprocessing, ensuring that the dataset is clean and reliable for analysis.
Python, with its powerful libraries like pandas and NumPy, offers various methods to address missing data. Common techniques include:
- Removing Missing Values: Dropping rows or columns that contain missing values using dropna(). This method is straightforward but can lead to significant data loss if many values are missing.
- Imputing Missing Values: Filling in missing data with substitute values. Methods include replacing with the mean, median, or mode of the column using fillna(), or using more advanced techniques like forward fill and backward fill (ffill() and bfill()).
- Interpolating Missing Values: Using interpolation methods to estimate missing values based on the data’s trends. This can be done using the interpolate() function.
- Using Algorithms: For more sophisticated imputation, machine learning algorithms like K-Nearest Neighbors (KNN) can predict missing values based on the dataset’s structure.
By appropriately handling missing values, analysts ensure that the integrity and accuracy of their data are maintained, leading to more robust and reliable analysis.
In the below python code, we will be doing missing values treatment on titanic dataset like checking the missing value, filling the NA values using median(), mode(), and dropping the NA values etc.
# Check for missing values print(titanic.isnull().sum()) # Fill missing values for 'Age' with the median titanic['Age'].fillna(titanic['Age'].median(), inplace=True) # Fill missing values for 'Embarked' with the mode titanic['Embarked'].fillna(titanic['Embarked'].mode()[0], inplace=True) # Drop the 'Cabin' column because it has too many missing values titanic.drop(columns='Cabin', inplace=True)
6. Data Visualization
Data visualization in exploratory data analysis (EDA) is a powerful technique used to graphically represent data, making it easier to identify patterns, trends, and anomalies with python.
By transforming complex data into visual formats such as histograms, scatter charts, box and bar graphs, analysts can quickly understand the underlying structures and relationships in the data.
Visualizations help communicate results effectively by allowing both analysts and stakeholders to see and understand the data overview at a glance.
This step is important in EDA because it not only helps reveal hidden patterns, but also strengthens assumptions and guides the direction of further analysis.
Related Article: Top 15 Best Data Visualization tools: You Should Know!
A. Univariate Analysis
Univariate analysis is a fundamental type of data analysis that focuses on examining and summarizing the characteristics of a single variable.
The primary goal is to understand the distribution and central tendency of the variable, as well as its spread and any patterns it may exhibit.
Common techniques used in univariate analysis include:
- Descriptive Statistics: Calculating measures such as mean, median, mode, variance, standard deviation, and range to describe the central tendency and dispersion of the variable.
- Frequency Distribution: Creating frequency tables to show the number of occurrences of each unique value or range of values.
- Visualization Methods: Using graphs like histograms, box plots, and bar charts to visually inspect the distribution, identify skewness, detect outliers, and observe the overall shape of the data.
Here we are going to explore the age and embarked in a titanic dataset using Univariate analysis
# Distribution of Age plt.figure(figsize=(10,6)) sns.histplot(titanic['Age'], kde=True) plt.title('Distribution of Age') plt.show() # Count plot of Embarked plt.figure(figsize=(10,6)) sns.countplot(x='Embarked', data=titanic) plt.title('Count of Embarked') plt.show()
B. Bivariate Analysis
Bivariate analysis is a statistical method that involves the simultaneous analysis of two variables to explore the potential relationship between them.
This type of analysis is crucial for understanding how one variable might affect or be related to another. Key techniques in bivariate analysis include:
- Scatter Plots: These graphs plot two variables against each other, helping to visualize any correlations or patterns. For instance, a positive or negative linear relationship can often be spotted in a scatter plot.
- Correlation Coefficient: This statistical measure, often denoted as Pearson’s r, quantifies the strength and direction of the linear relationship between two variables. A value close to 1 or -1 indicates a strong relationship, while a value around 0 suggests no linear correlation.
- Cross-tabulations: Also known as contingency tables, these are used for categorical variables to examine the frequency distribution across different categories.
- Regression Analysis: Simple linear regression can be used to model the relationship between two variables, providing an equation that best fits the observed data.
- Comparative Analysis: Techniques like t-tests or chi-square tests compare the means or proportions between two groups to determine if differences are statistically significant.
Bivariate analysis is essential for identifying and understanding relationships between variables, which can inform more complex multivariate analyses and guide decision-making processes.
Here we will be doing Bivariate analysis using variables like (Survived, Sex), (Survived, Age) in titanic dataset.
# Survival rate by gender plt.figure(figsize=(10,6)) sns.countplot(x='Survived', hue='Sex', data=titanic) plt.title('Survival by Gender') plt.show() # Age distribution by survival plt.figure(figsize=(10,6)) sns.boxplot(x='Survived', y='Age', data=titanic) plt.title('Age Distribution by Survival') plt.show()
C. Multivariate Analysis
Multivariate analysis is a sophisticated statistical technique used to examine the relationships between three or more variables simultaneously.
This type of analysis allows researchers to understand the complex interactions and dependencies among multiple variables, providing a deeper and more comprehensive understanding of the data.
Key methods in multivariate analysis include:
- Multiple Regression Analysis: Extends simple regression by modeling the relationship between a dependent variable and multiple independent variables. It helps in predicting the outcome and assessing the influence of each predictor.
- Principal Component Analysis (PCA): Reduces the dimensionality of the data by transforming original variables into a smaller set of uncorrelated variables called principal components. This is useful for simplifying data while retaining most of the variability.
- Factor Analysis: Identifies underlying factors that explain the pattern of correlations within a set of observed variables. It helps in data reduction and identifying latent constructs.
- Cluster Analysis: Groups observations into clusters based on similarities across multiple variables. This technique is useful for market segmentation, pattern recognition, and anomaly detection.
- Multivariate Analysis of Variance (MANOVA): Extends ANOVA by examining the influence of one or more categorical independent variables on multiple continuous dependent variables. It helps in understanding the effect of factors on several outcomes simultaneously.
- Discriminant Analysis: Used to classify observations into predefined categories based on predictor variables. It is often applied in cases where the goal is to predict group membership.
Multivariate analysis provides a comprehensive framework for analyzing complex datasets, uncovering hidden patterns, and making informed decisions based on multiple variables’ interactions and effects.
# Pairplot to see the interactions between features sns.pairplot(titanic[['Survived', 'Pclass', 'Age', 'Fare', 'SibSp', 'Parch']]) plt.show() # heatmap use to see and identify patterns and anomalies in complete dataset. titanic1=titanic.select_dtypes(exclude=['object']) plt.figure(figsize=(12, 7)) sns.heatmap(titanic1.corr(),annot=True) plt.show()
7. Feature Engineering
Feature engineering is a machine learning technique that involves extracting raw data and transforming it into features that can be used for training and prediction.
The purpose of feature design is to: improve model accuracy, simplify and accelerate data transformations, and design and train new machine learning functions..
Creating new features to improve the performance of models:
# Create a new feature 'FamilySize' titanic['FamilySize'] = titanic['SibSp'] + titanic['Parch'] + 1 # Create a new feature 'IsAlone' titanic['IsAlone'] = 1 # Initialize to yes/1 is alone titanic['IsAlone'].loc[titanic['FamilySize'] > 1] = 0 # Now update to no/0 if family size is greater than 1 # Create a new feature 'Title' from the 'Name' titanic['Title'] = titanic['Name'].str.extract(' ([A-Za-z]+)\.', expand=False) # Simplify the titles titanic['Title'] = titanic['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare') titanic['Title'] = titanic['Title'].replace('Mlle', 'Miss') titanic['Title'] = titanic['Title'].replace('Ms', 'Miss') titanic['Title'] = titanic['Title'].replace('Mme', 'Mrs')
8. Correlation Analysis
In the Exploratory data analysis process, Correlation analysis plays a crucial role in analyzing data.
It is an important part of exploratory data analysis (EDA) because it helps identify relationships between different variables in a data set.
Correlation coefficients such as Pearson’s, Spearman’s, and Kendall’s coefficients measure the strength and direction of these relationships. Here we use the Pearson correlation coefficient, the most common type..
# Correlation matrix plt.figure(figsize=(10,8)) sns.heatmap(titanic.corr(), annot=True, cmap='coolwarm') plt.title('Correlation Matrix') plt.show()
Top 15 Case Studies from Kaggle
Here are 15 notable Kaggle datasets to practice EDA. For each dataset, follow similar steps as outlined which given above for better analysis:
1. Titanic: Machine Learning from Disaster
Dataset: Titanic: Machine Learning from Disaster
# Load the dataset titanic = pd.read_csv('titanic.csv') # Follow the steps as mentioned in the guide above
2. House Prices: Advanced Regression Techniques
Dataset: House Prices: Advanced Regression Techniques
# Load the dataset house_prices = pd.read_csv('house_prices.csv') # Basic Information and Cleaning print(house_prices.info()) print(house_prices.describe()) print(house_prices.isnull().sum()) # Fill missing values (example) house_prices['LotFrontage'].fillna(house_prices['LotFrontage'].median(), inplace=True)
3. Iris Species
Dataset: Iris Species
# Load the dataset iris = pd.read_csv('iris.csv') # Basic Information print(iris.info()) print(iris.describe()) # Visualization sns.pairplot(iris, hue='species') plt.show()
4. Heart Disease UCI
Dataset: Heart Disease UCI
# Load the dataset heart = pd.read_csv('heart.csv') # Basic Information print(heart.info()) print(heart.describe()) # Visualization sns.heatmap(heart.corr(), annot=True, cmap='coolwarm') plt.show()
5. COVID-19 Open Research Dataset Challenge (CORD-19)
Dataset: COVID-19 Open Research Dataset Challenge (CORD-19)
# Due to the complexity and size of this dataset, advanced EDA techniques and NLP are required. # Refer to Kaggle kernels for detailed analysis: https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/notebooks
6. Wine Quality Dataset
Dataset: Wine Quality Dataset
# Load the dataset wine = pd.read_csv('winequality-red.csv') # Basic Information print(wine.info()) print(wine.describe()) # Visualization sns.heatmap(wine.corr(), annot=True, cmap='coolwarm') plt.show()
7. IMDB 5000 Movie Dataset
Dataset: IMDB 5000 Movie Dataset
# Load the dataset imdb = pd.read_csv('movie_metadata.csv') # Basic Information print(imdb.info()) print(imdb.describe()) # Visualization sns.pairplot(imdb[['gross', 'budget', 'imdb_score', 'movie_facebook_likes']]) plt.show()
8. New York City Taxi Fare Prediction
Dataset: New York City Taxi Fare Prediction
# Load the dataset taxi = pd.read_csv('train.csv') # Basic Information print(taxi.info()) print(taxi.describe()) # Visualization sns.scatterplot(x='pickup_longitude', y='pickup_latitude', data=taxi) plt.show()
9. Retail Analysis with Walmart Data
Dataset: Retail Analysis with Walmart Data
# Load the dataset walmart = pd.read_csv('train.csv') # Basic Information print(walmart.info()) print(walmart.describe()) # Visualization sns.lineplot(x='Date', y='Weekly_Sales', data=walmart) plt.show()
10. Customer Segmentation (Mall Customer Data)
Dataset: Customer Segmentation (Mall Customer Segmentation Data)
# Load the dataset mall = pd.read_csv('Mall_Customers.csv') # Basic Information print(mall.info()) print(mall.describe()) # Visualization sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', data=mall) plt.show()
11. Airbnb New User Bookings
Dataset: Airbnb New User Bookings
# Load the dataset airbnb = pd.read_csv('train_users_2.csv') # Basic Information print(airbnb.info()) print(airbnb.describe()) # Visualization sns.countplot(x='country_destination', data=airbnb) plt.show()
12. Bike Sharing Demand
Dataset: Bike Sharing Demand
# Load the dataset bike = pd.read_csv('train.csv') # Basic Information print(bike.info()) print(bike.describe()) # Visualization sns.lineplot(x='datetime', y='count', data=bike) plt.show()
13. Rossmann Store Sales
Dataset: Rossmann Store Sales
# Load the dataset rossmann = pd.read_csv('train.csv') # Basic Information print(rossmann.info()) print(rossmann.describe()) # Visualization sns.lineplot(x='Date', y='Sales', data=rossmann) plt.show()
14. Breast Cancer Wisconsin (Diagnostic) Data Set
Dataset: Breast Cancer Wisconsin (Diagnostic) Data Set
# Load the dataset cancer = pd.read_csv('data.csv') # Basic Information print(cancer.info()) print(cancer.describe()) # Visualization sns.heatmap(cancer.corr(), annot=True, cmap='coolwarm') plt.show()
15. World Happiness Report
Dataset: World Happiness Report
# Load the dataset happiness = pd.read_csv('2019.csv') # Basic Information print(happiness.info()) print(happiness.describe()) # Visualization sns.scatterplot(x='GDP per capita', y='Score', data=happiness) plt.show()
Conclusion
EDA is an iterative and detailed process that helps in understanding the dataset better and preparing it for modeling.
The steps and techniques described above can be applied to various datasets, including those available on Kaggle.
Each dataset may require specific handling based on its unique characteristics, but the fundamental principles of EDA remain consistent.
Meet Nitin, a seasoned professional in the field of data engineering. With a Post Graduation in Data Science and Analytics, Nitin is a key contributor to the healthcare sector, specializing in data analysis, machine learning, AI, blockchain, and various data-related tools and technologies. As the Co-founder and editor of analyticslearn.com, Nitin brings a wealth of knowledge and experience to the realm of analytics. Join us in exploring the exciting intersection of healthcare and data science with Nitin as your guide.