- 1 Select() Function In Dplyr
- 2 start_with() & ends_with() Functions
- 3 filter() Function In Dplyr
- 4 Pipeline (%>%) operator in Dplyr
- 5 arrange() function in dplyr
- 6 mutate () function in Dplyr
- 7 summarise() in Dplyr
- 8 Conclusion
Data Wrangling or Mugging can be easy in R, similarly, it provides a massive way to command and subset the data using the dplyr package in R.
The dplyr package in R implements variation of functions like select(), filter(), mutate(), group_by(), and summarize() for several recurring performances.
In this tutorial, we are going to analyze and explore the dplyr functions on the hospital data in r programming.
For this analysis, you need the raw file of Hospital data used for analysis you can download it from this link.
Install the package
Load the library
Read the data frame
You can read the data based on file format here we are reading the CSV file using the read.csv() function in R programming.
df <- read.csv("READMISSION REDUCTION.csv")
The summary () function gives the brief statistics of any data, here we are applying to healthcare data.
The names() function gives all the available column names by simply printing all column names using the point function.
Select() Function In Dplyr
The Select() function in Dplyr uses to select the specific columns from data or sub-setting of data.
In Dplyr select() function used to select the specific columns or sub-setting of data as per your convenience.
sample <- select(df, Hospital.Name, State, Provider.Number, Number.of.Discharges, Measure.Name) print(sample)
Sub setting or selecting data using Colon (:)
We can select column range using colon (:) and it can do with two ways using indexing or using column names.
Select the columns by the index and using “:” we can set the range
sample1 <- select(df,1:3) print(sample1)
It creates a subset of data from column Hospital.Name to Measure.Name using colon (:)
sample2 <- select(df, Hospital.Name:Measure.Name) print(head(sample2,5))
start_with() & ends_with() Functions
start_with() function in dplyr select only the columns in df whose titles start with mentioned string.
ends_with() function select only the columns in df whose titles end with the given string.
sample3 <- select(df,starts_with("H"),ends_with("S")) print(select3)
Select the data where no Hospital.Name and Provider.Number
sample4 <- select(df,-c(Hospital.Name,Provider.Number)) print(sample4)
filter() Function In Dplyr
Data filtering using the filter function in Dplyr is used to filter the data based on single or multiple conditions.
Here Return the subset where the Number.of.Discharges = “Not Available” and Number.of.Readmissions == “Not Available” separately
filter_sample1 <- filter(df, Number.of.Discharges == "Not Available") print(filter_sample1) filter_sample2 <- filter(df,Number.of.Readmissions == "Not Available") filter_sample2
filter condition on numeric data and class function to get the number of classes.
## No of discharges is >350 filter_sample3 <- filter(df,Number.of.Discharges >= 350) # conversion of column into numeric data filter_sample3$Number.of.Discharges = as.numeric(filter_sample3$Number.of.Discharges) filter_sample3$Number.of.Readmissions = as.numeric(filter_sample3$Number.of.Readmissions) # Getting the class of sample data class(filter_sample3$Number.of.Discharges)
Multiple Condition in a Filter()
Here we use Number.of.Discharges >350 or Number.of.Readmissions > 100 two filter condition together using (|) or operator
filter_sample4 = filter(df, Number.of.Discharges >= 350 | Number.of.Readmissions >= 100) filter_sample4$Number.of.Readmissions = as.numeric(filter_sample4 $Number.of.Readmissions)
filter rows with %in% operator and c() (combine function)
filtering rows with “SOUTHEAST ALABAMANMEDICAL CENTER” “MARSHALL MEDICAL CENTER SOUTH” using filter condition together with the help of %in% and c().
filter_sample5 <- filter(df, Hospital.Name %in% c("SOUTHEAST ALABAMANMEDICAL CENTER","MARSHALL MEDICAL CENTER SOUTH")) print(filter_sample5)
## Column conversion in numeric using as.numeric() function filter_sample5$Number.of.Discharges = as.numeric(as.character(filter_sample5$Number.of.Discharges))
Pipeline (%>%) operator in Dplyr
We can use multiple functions like Select(), filter() and arrange() function in a combine code using pipeline (%>%) operator.
# Example 1 pipeline_data1 <- select(df, Hospital.Name, Provider.Number, State,Number.of.Discharges) %>% filter(State == 'IL') %>% arrange(desc(Number.of.Discharges)) # Example 2 pipeline_data2 <- df %>% filter(State == 'IL') %>% arrange(desc(Number.of.Discharges))
arrange() function in dplyr
arrange() function in dplyr is used to sort the data easily and it sorts the data in ascending order by default.
It is used for arranging or rearranging data based on columns in different orders.
arrange_sample = arrange(df, desc(Number.of.Discharges)) arrange_sample2 = select(arrange_sample, Number.of.Discharges, everything()) print(arrange_sample2)
Renaming the column using rename() in dplyr
new_df = rename(df, P_number = Provider.Number) print(new_df)
mutate () function in Dplyr
mutate () in Dplyr with statistical operation
Here we created the new column in data using the mutate() function by performing mean on Excess.Readmission.Ratio column.
The Standard deviation of Predicted.Readmission.Rate Column
mutate_sample1 = mutate(df, New_Col1 = sum(as.numeric(Number.of.Readmissions))) mutate_sample2 = mutate(mutate_sample1, New_Col2 = mean(Number.of.Readmissions)) print(mutate_sample2)
Because of lots of NA in data, we are not able to perform the specific operation so we need to treat the NA values here.
summarise() in Dplyr
Here we have to use the group_by() function in the Hospital.Name for grouping, and getting the mean of data using summarise() function.
Then sorting the Hospital.Name using arrange() function, and we have connected all the functions using pipeline (%>%) operator
df %>% group_by(Hospital.Name) %>% summarise(Number.of.Read_meanvalue = mean(Number.of.Readmissions)) %>% arrange(Hospital.Name)
In this case study, you got the practical application of dplyr package and its important functions for subsetting and analyzing data.
How you can analyze the healthcare data or other data by using the Dplyr functions in R.
Analytics Teams working on creating useful content related to Data Science, analytics, and AI. It is a team of skilled data Scientists and Analysts, some works full time and some are part-time.