PySpark Filter: Comprehensive Guide

In this blog, we will discuss what is pyspark filter? and we will explore about it with the different examples for pyspark data filtering.

In the era of big data, filtering and processing vast datasets efficiently is a critical skill for data engineers and data scientists.

Apache Spark, a powerful framework for distributed data processing, offers the PySpark Filter operation as a versatile tool to selectively extract and manipulate data.

In this article, we will explore PySpark Filter, delve into its capabilities, and provide various examples to help you master the art of data filtering with PySpark.

Related Article: PySpark DataFrames: Ultimate Guide

Introduction to PySpark Filter

PySpark Filter is a transformation operation that allows you to select a subset of rows from a DataFrame or Dataset based on specific conditions.

It is a fundamental tool for data preprocessing, cleansing, and analysis.

By applying the PySpark Filter operation, you can focus on the data that meets your criteria, making it easier to derive meaningful insights and perform subsequent analysis.

Let’s begin by setting up a PySpark environment and dive into various examples of using the Filter operation.

Setting Up the PySpark Environment

Before we start with examples, it’s essential to set up a PySpark environment.

Ensure you have Apache Spark installed and the pyspark Python package. Here’s how to create a basic PySpark session:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("PySparkFilterExample").getOrCreate()

With your environment ready, let’s explore various examples of using the PySpark Filter operation.

Different Examples of PySpark Filter

Example 1: Basic Filter

In this example, we’ll filter a DataFrame to select rows where a specific column meets a condition.

# Sample data
data = [("Alice", 25), ("Bob", 30), ("Carol", 35), ("David", 28)]
schema = ["name", "age"]

# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)

# Filter rows where age is greater than 30
filtered_df = df.filter(df["age"] > 30)
filtered_df.show()

In this example, we create a DataFrame df with a “name” and “age” column. We then use the filter operation to select rows where the “age” is greater than 30.

Example 2: Combining Filters

You can combine multiple conditions using logical operators to create complex filters.

# Filter rows where age is between 25 and 35
filtered_df = df.filter((df["age"] >= 25) & (df["age"] <= 35))
filtered_df.show()

In this example, we use the logical “and” operator (&) to filter rows where the “age” falls between 25 and 35.

Example 3: Filtering Strings

Filtering is not limited to numeric data; you can filter string data as well.

# Filter rows where the name starts with "A"
filtered_df = df.filter(df["name"].startswith("A"))
filtered_df.show()

In this example, we filter rows where the “name” column starts with the letter “A.”

Example 4: Filtering with Functions

You can use functions from the pyspark.sql.functions module to create more complex filters.

from pyspark.sql.functions import col

# Filter rows where name is not equal to "Bob"
filtered_df = df.filter(col("name") != "Bob")
filtered_df.show()

Here, we use the col function to reference the column name and filter rows where the “name” is not equal to “Bob.”

Example 5: Combining Filters with OR

You can use the logical “or” operator (|) to combine multiple filters.

# Filter rows where age is either less than 30 or greater than 35
filtered_df = df.filter((df["age"] < 30) | (df["age"] > 35))
filtered_df.show()

In this example, we use the logical “or” operator (|) to filter rows where the “age” is either less than 30 or greater than 35.

Example 6: Using ISIN

The isin function allows you to filter rows based on a list of values.

# Filter rows where the name is in the list ["Alice", "Carol"]
filtered_df = df.filter(df["name"].isin(["Alice", "Carol"]))
filtered_df.show()

Here, we filter rows where the “name” is either “Alice” or “Carol.”

Example 7: Filtering NULL Values

Filtering NULL values is straightforward with PySpark.

# Filter rows where age is not NULL
filtered_df = df.filter(df["age"].isNotNull())
filtered_df.show()

In this example, we filter rows where the “age” is not NULL, effectively removing rows with missing age values.

Example 8: Complex Filtering

You can combine multiple conditions, use functions, and work with multiple columns to create complex filters.

# Filter rows where age is greater than 30 and name starts with "C"
filtered_df = df.filter((df["age"] > 30) & (df["name"].startswith("C")))
filtered_df.show()

In this example, we create a complex filter to select rows where the “age” is greater than 30 and the “name” starts with “C.”

PySpark Filter on Dataframe

Certainly! Here are more examples of DataFrame filtering in PySpark, showcasing a variety of scenarios and conditions:

Example 1: Filtering with Multiple Conditions

You can filter a DataFrame based on multiple conditions using logical operators. For example, you can filter for rows where both age is greater than 30 and the name starts with “C.”

from pyspark.sql.functions import col

# Filter rows where age > 30 and name starts with "C"
filtered_df = df.filter((col("age") > 30) & (col("name").startswith("C")))
filtered_df.show()

Example 2: Filtering with LIKE

Filtering with the like function is useful for finding rows with text that partially matches a pattern. In this example, filter for rows where the “name” contains the letter “o.”

# Filter rows where name contains the letter "o"
filtered_df = df.filter(col("name").like("%o%"))
filtered_df.show()

Example 3: Filtering with IN

You can use the isin function to filter rows where a column’s value matches any value in a given list. For example, filter rows where the “age” is either 25 or 35.

# Filter rows where age is either 25 or 35
filtered_df = df.filter(col("age").isin([25, 35]))
filtered_df.show()

Example 4: Filtering with NOT

You can use the ~ operator to negate a condition and filter rows where a condition is not met. In this example, filter rows where the “name” is not “Bob.”

# Filter rows where name is not "Bob"
filtered_df = df.filter(~(col("name") == "Bob"))
filtered_df.show()

Example 5: Filtering with Regular Expressions

PySpark allows you to use regular expressions for advanced filtering. In this example, filter rows where the “name” starts with a vowel (A, E, I, O, or U).

# Filter rows where name starts with a vowel
filtered_df = df.filter(col("name").rlike("^[AEIOU]"))
filtered_df.show()

Example 6: Filtering with a Custom Function

You can define a custom Python function and use it for filtering. In this example, create a function that filters rows based on a specific condition, such as even ages.

# Define a custom filtering function
def is_even_age(age):
    return age % 2 == 0

# Use the custom function for filtering
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType

udf_is_even_age = udf(is_even_age, BooleanType())
filtered_df = df.filter(udf_is_even_age(col("age")))
filtered_df.show()

These additional examples demonstrate the versatility of PySpark’s DataFrame filtering capabilities, allowing you to handle a wide range of filtering requirements when working with large datasets and complex conditions.

Related Article: Top 50 PySpark Interview Questions and Answers

Conclusion

The PySpark Filter operation is a powerful tool for data manipulation, allowing you to extract specific subsets of data based on conditions and criteria.

By applying filters, you can streamline your data preprocessing and analysis, enabling you to focus on the data that matters most for your tasks.

In this comprehensive guide, we’ve covered various examples of using PySpark Filter, from basic filters to complex filtering scenarios.

Armed with this knowledge, you’ll be better equipped to handle real-world data filtering challenges and extract valuable insights from your datasets.

PySpark’s filtering capabilities empower data professionals to perform efficient data preprocessing and make more informed decisions in the age of big data.

Related Article: PySpark Join: Comprehensive Guide

References

Here are some references with links that provide further information on using PySpark’s filter operation:

1. Official Apache Spark Documentation – DataFrame:
2. Databricks Blog – “PySpark SQL Cheat Sheet: Python Example”:
  • PySpark SQL Cheat Sheet
  • This blog post by Databricks offers a cheat sheet with Python examples, including filtering operations with PySpark DataFrames.
3. DataFlair – PySpark Filter Function with Examples:
  • PySpark Filter Function
  • DataFlair provides a tutorial with examples and explanations of PySpark’s filter function.
4. Towards Data Science – A Comprehensive Guide to Filter Data in PySpark:
  • Filter Data in PySpark
  • This article on Towards Data Science offers an extensive guide to filtering data with PySpark, covering various filtering scenarios.
5. DataCamp – Pyspark DataFrames:

These references should provide you with a wealth of information and practical examples to help you understand and master the PySpark filter operation for data manipulation and analysis.