PySpark "when" Function: Comprehensive Guide

All the Key Points...

In this blog, we are going to explore PySpark when Function and the different examples and usabilities of when() in PySpark.

Apache Spark, with its Python API PySpark, is a versatile platform for big data processing and analytics.

Among the many functions PySpark offers, the “when” function stands out as a powerful tool for conditional transformations.

In this comprehensive guide, we will delve into the “when” function, explore its capabilities, and provide practical examples to showcase its utility in data transformation and analysis.

Introduction to PySpark’s “when” Function

The “when” function in PySpark is part of the pyspark.sql.functions module. It allows you to apply conditional logic to your DataFrame columns.

This function is incredibly useful for data cleansing, feature engineering, and creating new columns based on conditions.

To get started, let’s set up a PySpark environment:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("PySparkWhenExample").getOrCreate()

With our Spark environment ready, let’s dive into the “when” function with a series of practical examples.

Related Article: PySpark Filter: Comprehensive Guide

Creating a New Column with a Condition

In this basic example, we’ll use the “when” function to create a new column, “category,” based on a condition.

from pyspark.sql.functions import when

# Sample data
data = [(1,), (5,), (10,)]
schema = ["value"]

# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)

# Use the "when" function to create a new column based on a condition
df_with_category = df.withColumn("category", when(df["value"] < 5, "Low").otherwise("High"))
df_with_category.show()

Here, we use the “when” function to categorize values as “Low” or “High” based on a condition.

Multiple Conditions with “when” and “otherwise”

You can use the “when” function with multiple conditions and the “otherwise” clause. In this example, we’ll categorize values as “Low,” “Medium,” or “High.”

# Use the "when" function with multiple conditions and "otherwise"
df_with_category = df.withColumn("category",
    when(df["value"] < 5, "Low")
    .when((df["value"] >= 5) & (df["value"] <= 8), "Medium")
    .otherwise("High")
)
df_with_category.show()

In this example, we categorize values into “Low,” “Medium,” or “High” based on multiple conditions.

Using “when” for String Data

The “when” function is not limited to numerical data. In this example, we’ll categorize names based on their length.

# Sample data
data = [("Alice",), ("Bob",), ("Carol",)]
schema = ["name"]

# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)

# Use "when" for string data
df_with_length_category = df.withColumn("name_length",
    when(df["name"].rlike("^[A-M]"), "Short")
    .when(df["name"].rlike("^[N-Z]"), "Long")
    .otherwise("Medium")
)
df_with_length_category.show()

Here, we categorize names as “Short,” “Medium,” or “Long” based on their first letter.

Handling Missing Data

The “when” function can also be used to handle missing or null values. In this example, we replace missing values with a default value.

from pyspark.sql.functions import lit

# Sample data with missing values
data = [(1,), (None,), (3,)]
schema = ["value"]

# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)

# Use "when" to replace missing values with a default
df_filled = df.withColumn("value", when(df["value"].isNotNull(), df["value"]).otherwise(lit(0)))
df_filled.show()

Here, we replace missing values with a default value of 0.

Combining “when” with Mathematical Expressions

You can combine the “when” function with mathematical expressions. In this example, we categorize values as “Even” or “Odd” based on their parity.

from pyspark.sql.functions import col

# Sample data
data = [(2,), (5,), (8,)]
schema = ["value"]

# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)

# Use "when" and mathematical expressions
df_with_parity = df.withColumn("parity",
    when((col("value") % 2) == 0, "Even")
    .otherwise("Odd")
)
df_with_parity.show()

In this example, we categorize values as “Even” or “Odd” based on their parity.

Handling Multiple Columns

You can apply the “when” function to multiple columns simultaneously. In this example, we categorize students based on both their score and attendance.

from pyspark.sql.functions import col

# Sample data
data = [("Alice", 85, 90), ("Bob", 60, 75), ("Carol", 95, 80)]
schema = ["name", "score", "attendance"]

# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)

# Use "when" to categorize students based on score and attendance
df_with_student_category = df.withColumn("category",
    when((col("score") >= 90) & (col("attendance") >= 90), "Excellent Student")
    .when((col("score") >= 70) & (col("attendance") >= 70), "Good Student")
    .otherwise("Needs Improvement")
)
df_with_student_category.show()

In this example, students are categorized as “Excellent Student,” “Good Student,” or “Needs Improvement” based on both their score and attendance.

Handling Multiple Conditions with “and” and “or”

You can use “and” and “or” operators in the “when” function to handle more complex conditions. In this example, we categorize values based on a combination of conditions.

# Sample data
data = [(1,), (5,), (10,)]
schema = ["value"]

# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)

# Use "when" with "and" and "or" operators
df_with_complex_category = df.withColumn("category",
    when((df["value"] < 5) | (df["value"] > 8), "Outlier")
    .when((df["value"] >= 5) & (df["value"] <= 8),

Applying “when” with Pattern Matching

The “when” function can be employed for pattern matching. In this example, we categorize products based on their names.

# Sample data
data = [("Laptop",), ("Mobile Phone",), ("Tablet",)]
schema = ["product"]

# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)

# Use "when" with pattern matching
df_with_product_type = df.withColumn("product_type",
    when(df["product"].rlike("Laptop"), "Electronics")
    .when(df["product"].rlike("Mobile Phone|Tablet"), "Portable Devices")
    .otherwise("Other")
)
df_with_product_type.show()

Here, we categorize products as “Electronics,” “Portable Devices,” or “Other” based on their names using pattern matching.

Combining “when” with Date Operations

The “when” function can also work with date-related operations. In this example, we categorize events based on their date.

from pyspark.sql.functions import to_date

# Sample data
data = [("Event 1", "2023-04-15"), ("Event 2", "2023-06-30"), ("Event 3", "2023-08-20")]
schema = ["event", "date"]

# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)

# Convert the date string to a DateType
df = df.withColumn("date", to_date(df["date"]))

# Use "when" with date operations
df_with_event_category = df.withColumn("event_category",
    when(df["date"] < "2023-05-01", "Early Event")
    .when((df["date"] >= "2023-05-01") & (df["date"] < "2023-07-01"), "Mid Event")
    .otherwise("Late Event")
)
df_with_event_category.show()

In this example, we categorize events as “Early Event,” “Mid Event,” or “Late Event” based on their dates.

Chaining Multiple “when” Conditions

The “when” function allows you to chain multiple conditions, making it highly flexible. In this example, we categorize products based on both their name and price.

# Sample data
data = [("Laptop", 1000), ("Mobile Phone", 500), ("Tablet", 300)]
schema = ["product", "price"]

# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)

# Use "when" to categorize products based on name and price
df_with_product_category = df.withColumn("product_category",
    when(df["product"].rlike("Laptop") & (df["price"] > 800), "High-end Laptop")
    .when(df["product"].rlike("Mobile Phone") & (df["price"] > 400), "High-end Mobile")
    .when(df["product"].rlike("Tablet") & (df["price"] <= 400), "Budget Tablet")
    .otherwise("Other")
)
df_with_product_category.show()

In this example, products are categorized based on their name and price into different categories.

Conclusion

The “when” function in PySpark is a versatile and powerful tool for conditional transformations and data categorization.

These 10 practical examples demonstrate the flexibility and utility of the “when” function for a wide range of data transformation tasks.

With this knowledge, you can effectively harness the power of PySpark’s “when” function in your data processing and analysis projects.

Related Article: PySpark SQL: Ultimate Guide

References

1. Official Apache Spark Documentation – PySpark SQL Functions:

PySpark SQL Functions Documentation
The official documentation provides comprehensive information on PySpark SQL functions, including the “when” function.

Nitin Khandare

Meet Nitin, a seasoned professional in the field of data engineering. With a Post Graduation in Data Science and Analytics, Nitin is a key contributor to the healthcare sector, specializing in data analysis, machine learning, AI, blockchain, and various data-related tools and technologies. As the Co-founder and editor of analyticslearn.com, Nitin brings a wealth of knowledge and experience to the realm of analytics. Join us in exploring the exciting intersection of healthcare and data science with Nitin as your guide.