Data Wrangling or Mugging can be easy in R, similarly, it provides a massive way to command and subset the data using the dplyr package in R.
The dplyr package in R implements variation of functions like select(), filter(), mutate(), group_by(), and summarize() for several recurring performances.
In this tutorial, we are going to analyze and explore the dplyr functions on the hospital data in r programming.
For this analysis, you need the raw file of Hospital data used for analysis you can download it from this link.
Install the package
install.packages("dplyr")
Load the library
library("dplyr")
Read the data frame
You can read the data based on file format here we are reading the CSV file using the read.csv() function in R programming.
df <- read.csv("READMISSION REDUCTION.csv")
The summary () function gives the brief statistics of any data, here we are applying to healthcare data.
summary(df)
The names() function gives all the available column names by simply printing all column names using the point function.
print(names(df))
Select() Function In Dplyr
The Select() function in Dplyr uses to select the specific columns from data or sub-setting of data.
In Dplyr select() function used to select the specific columns or sub-setting of data as per your convenience.
sample <- select(df, Hospital.Name, State, Provider.Number, Number.of.Discharges, Measure.Name)
print(sample)
Sub setting or selecting data using Colon (:)
We can select column range using colon (:) and it can do with two ways using indexing or using column names.
Select the columns by the index and using “:” we can set the range
sample1 <- select(df,1:3)
print(sample1)
It creates a subset of data from column Hospital.Name to Measure.Name using colon (:)
sample2 <- select(df, Hospital.Name:Measure.Name)
print(head(sample2,5))
start_with() & ends_with() Functions
start_with() function in dplyr select only the columns in df whose titles start with mentioned string.
ends_with() function select only the columns in df whose titles end with the given string.
sample3 <- select(df,starts_with("H"),ends_with("S"))
print(select3)
Select the data where no Hospital.Name and Provider.Number
sample4 <- select(df,-c(Hospital.Name,Provider.Number))
print(sample4)
filter() Function In Dplyr
Data filtering using the filter function in Dplyr is used to filter the data based on single or multiple conditions.
Here Return the subset where the Number.of.Discharges = “Not Available” and Number.of.Readmissions == “Not Available” separately
filter_sample1 <- filter(df, Number.of.Discharges == "Not Available")
print(filter_sample1)
filter_sample2 <- filter(df,Number.of.Readmissions == "Not Available")
filter_sample2
filter condition on numeric data and class function to get the number of classes.
## No of discharges is >350
filter_sample3 <- filter(df,Number.of.Discharges >= 350)
# conversion of column into numeric data
filter_sample3$Number.of.Discharges = as.numeric(filter_sample3$Number.of.Discharges)
filter_sample3$Number.of.Readmissions = as.numeric(filter_sample3$Number.of.Readmissions)
# Getting the class of sample data
class(filter_sample3$Number.of.Discharges)
Multiple Condition in a Filter()
Here we use Number.of.Discharges >350 or Number.of.Readmissions > 100 two filter condition together using (|) or operator
filter_sample4 = filter(df, Number.of.Discharges >= 350 | Number.of.Readmissions >= 100)
filter_sample4$Number.of.Readmissions = as.numeric(filter_sample4 $Number.of.Readmissions)
filter rows with %in% operator and c() (combine function)
filtering rows with “SOUTHEAST ALABAMANMEDICAL CENTER” “MARSHALL MEDICAL CENTER SOUTH” using filter condition together with the help of %in% and c().
filter_sample5 <- filter(df, Hospital.Name %in% c("SOUTHEAST ALABAMANMEDICAL CENTER","MARSHALL MEDICAL CENTER SOUTH"))
print(filter_sample5)
## Column conversion in numeric using as.numeric() function
filter_sample5$Number.of.Discharges = as.numeric(as.character(filter_sample5$Number.of.Discharges))
Pipeline (%>%) operator in Dplyr
We can use multiple functions like Select(), filter() and arrange() function in a combine code using pipeline (%>%) operator.
# Example 1
pipeline_data1 <- select(df, Hospital.Name, Provider.Number, State,Number.of.Discharges) %>%
filter(State == 'IL') %>% arrange(desc(Number.of.Discharges))
# Example 2
pipeline_data2 <- df %>%
filter(State == 'IL') %>% arrange(desc(Number.of.Discharges))
arrange() function in dplyr
arrange() function in dplyr is used to sort the data easily and it sorts the data in ascending order by default.
It is used for arranging or rearranging data based on columns in different orders.
arrange_sample = arrange(df, desc(Number.of.Discharges))
arrange_sample2 = select(arrange_sample, Number.of.Discharges, everything())
print(arrange_sample2)
Renaming the column using rename() in dplyr
new_df = rename(df, P_number = Provider.Number)
print(new_df)
mutate () function in Dplyr
mutate () in Dplyr with statistical operation
Here we created the new column in data using the mutate() function by performing mean on Excess.Readmission.Ratio column.
The Standard deviation of Predicted.Readmission.Rate Column
mutate_sample1 = mutate(df, New_Col1 = sum(as.numeric(Number.of.Readmissions)))
mutate_sample2 = mutate(mutate_sample1, New_Col2 = mean(Number.of.Readmissions))
print(mutate_sample2)
Because of lots of NA in data, we are not able to perform the specific operation so we need to treat the NA values here.
summarise() in Dplyr
Here we have to use the group_by() function in the Hospital.Name for grouping, and getting the mean of data using summarise() function.
Then sorting the Hospital.Name using arrange() function, and we have connected all the functions using pipeline (%>%) operator
df %>%
group_by(Hospital.Name) %>%
summarise(Number.of.Read_meanvalue = mean(Number.of.Readmissions)) %>%
arrange(Hospital.Name)
Conclusion
In this case study, you got the practical application of dplyr package and its important functions for subsetting and analyzing data.
How you can analyze the healthcare data or other data by using the Dplyr functions in R.
Recommended Articles:
Subsetting Data Frames In R With Top 20 Steps.
How You Can Use Excel Formulas Vlookup?
Meet our Analytics Team, a dynamic group dedicated to crafting valuable content in the realms of Data Science, analytics, and AI. Comprising skilled data scientists and analysts, this team is a blend of full-time professionals and part-time contributors. Together, they synergize their expertise to deliver insightful and relevant material, aiming to enhance your understanding of the ever-evolving fields of data and analytics. Join us on a journey of discovery as we delve into the world of data-driven insights with our diverse and talented Analytics Team.