In this blog, we will discuss what is pyspark filter? and we will explore about it with the different examples for pyspark data filtering.
In the era of big data, filtering and processing vast datasets efficiently is a critical skill for data engineers and data scientists.
Apache Spark, a powerful framework for distributed data processing, offers the PySpark Filter operation as a versatile tool to selectively extract and manipulate data.
In this article, we will explore PySpark Filter, delve into its capabilities, and provide various examples to help you master the art of data filtering with PySpark.
Related Article: PySpark DataFrames: Ultimate Guide
Introduction to PySpark Filter
PySpark Filter is a transformation operation that allows you to select a subset of rows from a DataFrame or Dataset based on specific conditions.
It is a fundamental tool for data preprocessing, cleansing, and analysis.
By applying the PySpark Filter operation, you can focus on the data that meets your criteria, making it easier to derive meaningful insights and perform subsequent analysis.
Let’s begin by setting up a PySpark environment and dive into various examples of using the Filter operation.
Setting Up the PySpark Environment
Before we start with examples, it’s essential to set up a PySpark environment.
Ensure you have Apache Spark installed and the pyspark
Python package. Here’s how to create a basic PySpark session:
from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder.appName("PySparkFilterExample").getOrCreate()
With your environment ready, let’s explore various examples of using the PySpark Filter operation.
Different Examples of PySpark Filter
Example 1: Basic Filter
In this example, we’ll filter a DataFrame to select rows where a specific column meets a condition.
# Sample data data = [("Alice", 25), ("Bob", 30), ("Carol", 35), ("David", 28)] schema = ["name", "age"] # Create a DataFrame df = spark.createDataFrame(data, schema=schema) # Filter rows where age is greater than 30 filtered_df = df.filter(df["age"] > 30) filtered_df.show()
In this example, we create a DataFrame df
with a “name” and “age” column. We then use the filter
operation to select rows where the “age” is greater than 30.
Example 2: Combining Filters
You can combine multiple conditions using logical operators to create complex filters.
# Filter rows where age is between 25 and 35 filtered_df = df.filter((df["age"] >= 25) & (df["age"] <= 35)) filtered_df.show()
In this example, we use the logical “and” operator (&) to filter rows where the “age” falls between 25 and 35.
Example 3: Filtering Strings
Filtering is not limited to numeric data; you can filter string data as well.
# Filter rows where the name starts with "A" filtered_df = df.filter(df["name"].startswith("A")) filtered_df.show()
In this example, we filter rows where the “name” column starts with the letter “A.”
Example 4: Filtering with Functions
You can use functions from the pyspark.sql.functions
module to create more complex filters.
from pyspark.sql.functions import col # Filter rows where name is not equal to "Bob" filtered_df = df.filter(col("name") != "Bob") filtered_df.show()
Here, we use the col
function to reference the column name and filter rows where the “name” is not equal to “Bob.”
Example 5: Combining Filters with OR
You can use the logical “or” operator (|) to combine multiple filters.
# Filter rows where age is either less than 30 or greater than 35 filtered_df = df.filter((df["age"] < 30) | (df["age"] > 35)) filtered_df.show()
In this example, we use the logical “or” operator (|) to filter rows where the “age” is either less than 30 or greater than 35.
Example 6: Using ISIN
The isin
function allows you to filter rows based on a list of values.
# Filter rows where the name is in the list ["Alice", "Carol"] filtered_df = df.filter(df["name"].isin(["Alice", "Carol"])) filtered_df.show()
Here, we filter rows where the “name” is either “Alice” or “Carol.”
Example 7: Filtering NULL Values
Filtering NULL values is straightforward with PySpark.
# Filter rows where age is not NULL filtered_df = df.filter(df["age"].isNotNull()) filtered_df.show()
In this example, we filter rows where the “age” is not NULL, effectively removing rows with missing age values.
Example 8: Complex Filtering
You can combine multiple conditions, use functions, and work with multiple columns to create complex filters.
# Filter rows where age is greater than 30 and name starts with "C" filtered_df = df.filter((df["age"] > 30) & (df["name"].startswith("C"))) filtered_df.show()
In this example, we create a complex filter to select rows where the “age” is greater than 30 and the “name” starts with “C.”
PySpark Filter on Dataframe
Certainly! Here are more examples of DataFrame filtering in PySpark, showcasing a variety of scenarios and conditions:
Example 1: Filtering with Multiple Conditions
You can filter a DataFrame based on multiple conditions using logical operators. For example, you can filter for rows where both age is greater than 30 and the name starts with “C.”
from pyspark.sql.functions import col # Filter rows where age > 30 and name starts with "C" filtered_df = df.filter((col("age") > 30) & (col("name").startswith("C"))) filtered_df.show()
Example 2: Filtering with LIKE
Filtering with the like
function is useful for finding rows with text that partially matches a pattern. In this example, filter for rows where the “name” contains the letter “o.”
# Filter rows where name contains the letter "o" filtered_df = df.filter(col("name").like("%o%")) filtered_df.show()
Example 3: Filtering with IN
You can use the isin
function to filter rows where a column’s value matches any value in a given list. For example, filter rows where the “age” is either 25 or 35.
# Filter rows where age is either 25 or 35 filtered_df = df.filter(col("age").isin([25, 35])) filtered_df.show()
Example 4: Filtering with NOT
You can use the ~
operator to negate a condition and filter rows where a condition is not met. In this example, filter rows where the “name” is not “Bob.”
# Filter rows where name is not "Bob" filtered_df = df.filter(~(col("name") == "Bob")) filtered_df.show()
Example 5: Filtering with Regular Expressions
PySpark allows you to use regular expressions for advanced filtering. In this example, filter rows where the “name” starts with a vowel (A, E, I, O, or U).
# Filter rows where name starts with a vowel filtered_df = df.filter(col("name").rlike("^[AEIOU]")) filtered_df.show()
Example 6: Filtering with a Custom Function
You can define a custom Python function and use it for filtering. In this example, create a function that filters rows based on a specific condition, such as even ages.
# Define a custom filtering function def is_even_age(age): return age % 2 == 0 # Use the custom function for filtering from pyspark.sql.functions import udf from pyspark.sql.types import BooleanType udf_is_even_age = udf(is_even_age, BooleanType()) filtered_df = df.filter(udf_is_even_age(col("age"))) filtered_df.show()
These additional examples demonstrate the versatility of PySpark’s DataFrame filtering capabilities, allowing you to handle a wide range of filtering requirements when working with large datasets and complex conditions.
Related Article: Top 50 PySpark Interview Questions and Answers
Conclusion
The PySpark Filter operation is a powerful tool for data manipulation, allowing you to extract specific subsets of data based on conditions and criteria.
By applying filters, you can streamline your data preprocessing and analysis, enabling you to focus on the data that matters most for your tasks.
In this comprehensive guide, we’ve covered various examples of using PySpark Filter, from basic filters to complex filtering scenarios.
Armed with this knowledge, you’ll be better equipped to handle real-world data filtering challenges and extract valuable insights from your datasets.
PySpark’s filtering capabilities empower data professionals to perform efficient data preprocessing and make more informed decisions in the age of big data.
Related Article: PySpark Join: Comprehensive Guide
References
Here are some references with links that provide further information on using PySpark’s filter
operation:
1. Official Apache Spark Documentation – DataFrame:
- PySpark DataFrame API Documentation
- The official documentation provides a detailed reference for PySpark’s DataFrame API, including the
filter
operation.
2. Databricks Blog – “PySpark SQL Cheat Sheet: Python Example”:
- PySpark SQL Cheat Sheet
- This blog post by Databricks offers a cheat sheet with Python examples, including filtering operations with PySpark DataFrames.
3. DataFlair – PySpark Filter Function with Examples:
- PySpark Filter Function
- DataFlair provides a tutorial with examples and explanations of PySpark’s
filter
function.
4. Towards Data Science – A Comprehensive Guide to Filter Data in PySpark:
- Filter Data in PySpark
- This article on Towards Data Science offers an extensive guide to filtering data with PySpark, covering various filtering scenarios.
5. DataCamp – Pyspark DataFrames:
- DataCamp Pyspark DataFrames Tutorial
- DataCamp provides a tutorial on PySpark DataFrames that includes filtering operations with clear examples.
These references should provide you with a wealth of information and practical examples to help you understand and master the PySpark filter
operation for data manipulation and analysis.
Meet Nitin, a seasoned professional in the field of data engineering. With a Post Graduation in Data Science and Analytics, Nitin is a key contributor to the healthcare sector, specializing in data analysis, machine learning, AI, blockchain, and various data-related tools and technologies. As the Co-founder and editor of analyticslearn.com, Nitin brings a wealth of knowledge and experience to the realm of analytics. Join us in exploring the exciting intersection of healthcare and data science with Nitin as your guide.