Dplyr In R programming | Complete Tutorial

Data Wrangling or Mugging can be easy in R, similarly, it provides a massive way to command and subset the data using the dplyr package in R.

The dplyr package in R implements variation of functions like select(), filter(), mutate(), group_by(), and summarize() for several recurring performances.

In this tutorial, we are going to analyze and explore the dplyr functions on the hospital data in r programming.

For this analysis, you need the raw file of Hospital data used for analysis you can download it from this link.

Install the package

install.packages("dplyr")

Load the library

library("dplyr")

Read the data frame

You can read the data based on file format here we are reading the CSV file using the read.csv() function in R programming.

df <- read.csv("READMISSION REDUCTION.csv")

The summary () function gives the brief statistics of any data, here we are applying to healthcare data.

summary(df)

The names() function gives all the available column names by simply printing all column names using the point function.

print(names(df))

Select() Function In Dplyr

The Select() function in Dplyr uses to select the specific columns from data or sub-setting of data.

In Dplyr select() function used to select the specific columns or sub-setting of data as per your convenience.

sample <- select(df, Hospital.Name, State, Provider.Number, Number.of.Discharges, Measure.Name)

print(sample)

Sub setting or selecting data using Colon (:)

We can select column range using colon (:) and it can do with two ways using indexing or using column names.

Select the columns by the index and using “:” we can set the range

sample1 <- select(df,1:3)

print(sample1)

It creates a subset of data from column Hospital.Name to Measure.Name using colon (:)

sample2 <- select(df, Hospital.Name:Measure.Name)

print(head(sample2,5))

start_with() & ends_with() Functions

start_with() function in dplyr select only the columns in df whose titles start with mentioned string.

ends_with() function select only the columns in df whose titles end with the given string.

sample3 <- select(df,starts_with("H"),ends_with("S"))

print(select3)

Select the data where no Hospital.Name and Provider.Number

sample4 <- select(df,-c(Hospital.Name,Provider.Number))

print(sample4)

filter() Function In Dplyr

Data filtering using the filter function in Dplyr is used to filter the data based on single or multiple conditions.

Here Return the subset where the Number.of.Discharges = “Not Available” and Number.of.Readmissions == “Not Available” separately

filter_sample1 <- filter(df, Number.of.Discharges == "Not Available")

print(filter_sample1)

filter_sample2 <- filter(df,Number.of.Readmissions == "Not Available")

filter_sample2

filter condition on numeric data and class function to get the number of classes.

## No of discharges is >350
filter_sample3 <- filter(df,Number.of.Discharges >= 350)

# conversion of column into numeric data
filter_sample3$Number.of.Discharges = as.numeric(filter_sample3$Number.of.Discharges)

filter_sample3$Number.of.Readmissions = as.numeric(filter_sample3$Number.of.Readmissions)

# Getting the class of sample data
class(filter_sample3$Number.of.Discharges)

Multiple Condition in a Filter()

Here we use Number.of.Discharges >350 or Number.of.Readmissions > 100 two filter condition together using (|) or operator

filter_sample4 = filter(df, Number.of.Discharges >= 350 | Number.of.Readmissions >= 100)

filter_sample4$Number.of.Readmissions = as.numeric(filter_sample4 $Number.of.Readmissions)

filter rows with %in% operator and c() (combine function)

filtering rows with “SOUTHEAST ALABAMANMEDICAL CENTER” “MARSHALL MEDICAL CENTER SOUTH” using filter condition together with the help of %in% and c().

filter_sample5 <- filter(df, Hospital.Name %in% c("SOUTHEAST ALABAMANMEDICAL CENTER","MARSHALL MEDICAL CENTER SOUTH"))

print(filter_sample5)
## Column conversion in numeric using as.numeric() function

filter_sample5$Number.of.Discharges = as.numeric(as.character(filter_sample5$Number.of.Discharges))

Pipeline (%>%) operator in Dplyr

We can use multiple functions like Select(), filter() and arrange() function in a combine code using pipeline (%>%) operator.

# Example 1
pipeline_data1 <- select(df, Hospital.Name, Provider.Number, State,Number.of.Discharges) %>% 
filter(State == 'IL') %>% arrange(desc(Number.of.Discharges))


# Example 2
pipeline_data2 <- df %>%
  filter(State == 'IL') %>% arrange(desc(Number.of.Discharges))

arrange() function in dplyr

arrange() function in dplyr is used to sort the data easily and it sorts the data in ascending order by default.

It is used for arranging or rearranging data based on columns in different orders.

arrange_sample = arrange(df, desc(Number.of.Discharges))

arrange_sample2 = select(arrange_sample, Number.of.Discharges, everything())

print(arrange_sample2)

Renaming the column using rename() in dplyr

new_df = rename(df, P_number = Provider.Number)

print(new_df)

mutate () function in Dplyr

mutate () in Dplyr with statistical operation

Here we created the new column in data using the mutate() function by performing mean on Excess.Readmission.Ratio column.

The Standard deviation of Predicted.Readmission.Rate Column

mutate_sample1 = mutate(df, New_Col1 = sum(as.numeric(Number.of.Readmissions)))

mutate_sample2 = mutate(mutate_sample1, New_Col2 = mean(Number.of.Readmissions))

print(mutate_sample2)

Because of lots of NA in data, we are not able to perform the specific operation so we need to treat the NA values here.

summarise() in Dplyr

Here we have to use the group_by() function in the Hospital.Name for grouping, and getting the mean of data using summarise() function.

Then sorting the Hospital.Name using arrange() function, and we have connected all the functions using pipeline (%>%) operator

df %>% 
group_by(Hospital.Name) %>%
summarise(Number.of.Read_meanvalue = mean(Number.of.Readmissions)) %>% 
arrange(Hospital.Name)

Conclusion

In this case study, you got the practical application of dplyr package and its important functions for subsetting and analyzing data.

How you can analyze the healthcare data or other data by using the Dplyr functions in R.

Subsetting Data Frames In R With Top 20 Steps.

How You Can Use Excel Formulas Vlookup?

Leave a Reply

Your email address will not be published. Required fields are marked *