In this blog, we are going to explore PySpark when Function and the different examples and usabilities of when() in PySpark.
Apache Spark, with its Python API PySpark, is a versatile platform for big data processing and analytics.
Among the many functions PySpark offers, the “when” function stands out as a powerful tool for conditional transformations.
In this comprehensive guide, we will delve into the “when” function, explore its capabilities, and provide practical examples to showcase its utility in data transformation and analysis.
Introduction to PySpark’s “when” Function
The “when” function in PySpark is part of the pyspark.sql.functions
module. It allows you to apply conditional logic to your DataFrame columns.
This function is incredibly useful for data cleansing, feature engineering, and creating new columns based on conditions.
To get started, let’s set up a PySpark environment:
from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder.appName("PySparkWhenExample").getOrCreate()
With our Spark environment ready, let’s dive into the “when” function with a series of practical examples.
Related Article: PySpark Filter: Comprehensive Guide
Creating a New Column with a Condition
In this basic example, we’ll use the “when” function to create a new column, “category,” based on a condition.
from pyspark.sql.functions import when # Sample data data = [(1,), (5,), (10,)] schema = ["value"] # Create a DataFrame df = spark.createDataFrame(data, schema=schema) # Use the "when" function to create a new column based on a condition df_with_category = df.withColumn("category", when(df["value"] < 5, "Low").otherwise("High")) df_with_category.show()
Here, we use the “when” function to categorize values as “Low” or “High” based on a condition.
Multiple Conditions with “when” and “otherwise”
You can use the “when” function with multiple conditions and the “otherwise” clause. In this example, we’ll categorize values as “Low,” “Medium,” or “High.”
# Use the "when" function with multiple conditions and "otherwise" df_with_category = df.withColumn("category", when(df["value"] < 5, "Low") .when((df["value"] >= 5) & (df["value"] <= 8), "Medium") .otherwise("High") ) df_with_category.show()
In this example, we categorize values into “Low,” “Medium,” or “High” based on multiple conditions.
Using “when” for String Data
The “when” function is not limited to numerical data. In this example, we’ll categorize names based on their length.
# Sample data data = [("Alice",), ("Bob",), ("Carol",)] schema = ["name"] # Create a DataFrame df = spark.createDataFrame(data, schema=schema) # Use "when" for string data df_with_length_category = df.withColumn("name_length", when(df["name"].rlike("^[A-M]"), "Short") .when(df["name"].rlike("^[N-Z]"), "Long") .otherwise("Medium") ) df_with_length_category.show()
Here, we categorize names as “Short,” “Medium,” or “Long” based on their first letter.
Handling Missing Data
The “when” function can also be used to handle missing or null values. In this example, we replace missing values with a default value.
from pyspark.sql.functions import lit # Sample data with missing values data = [(1,), (None,), (3,)] schema = ["value"] # Create a DataFrame df = spark.createDataFrame(data, schema=schema) # Use "when" to replace missing values with a default df_filled = df.withColumn("value", when(df["value"].isNotNull(), df["value"]).otherwise(lit(0))) df_filled.show()
Here, we replace missing values with a default value of 0.
Combining “when” with Mathematical Expressions
You can combine the “when” function with mathematical expressions. In this example, we categorize values as “Even” or “Odd” based on their parity.
from pyspark.sql.functions import col # Sample data data = [(2,), (5,), (8,)] schema = ["value"] # Create a DataFrame df = spark.createDataFrame(data, schema=schema) # Use "when" and mathematical expressions df_with_parity = df.withColumn("parity", when((col("value") % 2) == 0, "Even") .otherwise("Odd") ) df_with_parity.show()
In this example, we categorize values as “Even” or “Odd” based on their parity.
Handling Multiple Columns
You can apply the “when” function to multiple columns simultaneously. In this example, we categorize students based on both their score and attendance.
from pyspark.sql.functions import col # Sample data data = [("Alice", 85, 90), ("Bob", 60, 75), ("Carol", 95, 80)] schema = ["name", "score", "attendance"] # Create a DataFrame df = spark.createDataFrame(data, schema=schema) # Use "when" to categorize students based on score and attendance df_with_student_category = df.withColumn("category", when((col("score") >= 90) & (col("attendance") >= 90), "Excellent Student") .when((col("score") >= 70) & (col("attendance") >= 70), "Good Student") .otherwise("Needs Improvement") ) df_with_student_category.show()
In this example, students are categorized as “Excellent Student,” “Good Student,” or “Needs Improvement” based on both their score and attendance.
Handling Multiple Conditions with “and” and “or”
You can use “and” and “or” operators in the “when” function to handle more complex conditions. In this example, we categorize values based on a combination of conditions.
# Sample data data = [(1,), (5,), (10,)] schema = ["value"] # Create a DataFrame df = spark.createDataFrame(data, schema=schema) # Use "when" with "and" and "or" operators df_with_complex_category = df.withColumn("category", when((df["value"] < 5) | (df["value"] > 8), "Outlier") .when((df["value"] >= 5) & (df["value"] <= 8),
Applying “when” with Pattern Matching
The “when” function can be employed for pattern matching. In this example, we categorize products based on their names.
# Sample data data = [("Laptop",), ("Mobile Phone",), ("Tablet",)] schema = ["product"] # Create a DataFrame df = spark.createDataFrame(data, schema=schema) # Use "when" with pattern matching df_with_product_type = df.withColumn("product_type", when(df["product"].rlike("Laptop"), "Electronics") .when(df["product"].rlike("Mobile Phone|Tablet"), "Portable Devices") .otherwise("Other") ) df_with_product_type.show()
Here, we categorize products as “Electronics,” “Portable Devices,” or “Other” based on their names using pattern matching.
Combining “when” with Date Operations
The “when” function can also work with date-related operations. In this example, we categorize events based on their date.
from pyspark.sql.functions import to_date # Sample data data = [("Event 1", "2023-04-15"), ("Event 2", "2023-06-30"), ("Event 3", "2023-08-20")] schema = ["event", "date"] # Create a DataFrame df = spark.createDataFrame(data, schema=schema) # Convert the date string to a DateType df = df.withColumn("date", to_date(df["date"])) # Use "when" with date operations df_with_event_category = df.withColumn("event_category", when(df["date"] < "2023-05-01", "Early Event") .when((df["date"] >= "2023-05-01") & (df["date"] < "2023-07-01"), "Mid Event") .otherwise("Late Event") ) df_with_event_category.show()
In this example, we categorize events as “Early Event,” “Mid Event,” or “Late Event” based on their dates.
Chaining Multiple “when” Conditions
The “when” function allows you to chain multiple conditions, making it highly flexible. In this example, we categorize products based on both their name and price.
# Sample data data = [("Laptop", 1000), ("Mobile Phone", 500), ("Tablet", 300)] schema = ["product", "price"] # Create a DataFrame df = spark.createDataFrame(data, schema=schema) # Use "when" to categorize products based on name and price df_with_product_category = df.withColumn("product_category", when(df["product"].rlike("Laptop") & (df["price"] > 800), "High-end Laptop") .when(df["product"].rlike("Mobile Phone") & (df["price"] > 400), "High-end Mobile") .when(df["product"].rlike("Tablet") & (df["price"] <= 400), "Budget Tablet") .otherwise("Other") ) df_with_product_category.show()
In this example, products are categorized based on their name and price into different categories.
Conclusion
The “when” function in PySpark is a versatile and powerful tool for conditional transformations and data categorization.
These 10 practical examples demonstrate the flexibility and utility of the “when” function for a wide range of data transformation tasks.
With this knowledge, you can effectively harness the power of PySpark’s “when” function in your data processing and analysis projects.
Related Article: PySpark SQL: Ultimate Guide
References
1. Official Apache Spark Documentation – PySpark SQL Functions:
- PySpark SQL Functions Documentation
- The official documentation provides comprehensive information on PySpark SQL functions, including the “when” function.
Meet Nitin, a seasoned professional in the field of data engineering. With a Post Graduation in Data Science and Analytics, Nitin is a key contributor to the healthcare sector, specializing in data analysis, machine learning, AI, blockchain, and various data-related tools and technologies. As the Co-founder and editor of analyticslearn.com, Nitin brings a wealth of knowledge and experience to the realm of analytics. Join us in exploring the exciting intersection of healthcare and data science with Nitin as your guide.