Data Wrangling or Mugging can be easy in R, similarly, it provides a massive way to command and subset the data using the dplyr package in R.

The dplyr package in R implements variation of functions like select(), filter(), mutate(), group_by(), and summarize() for several recurring performances.

In this tutorial, we are going to analyze and explore the dplyr functions on the hospital data in r programming.

For this analysis, you need the raw file of Hospital data used for analysis you can download it from this link.

#### Install the package

`install.packages("dplyr")`

#### Load the library

`library("dplyr")`

#### Read the data frame

You can read the data based on file format here we are reading the CSV file using the read.csv() function in R programming.

`df <- read.csv("READMISSION REDUCTION.csv")`

The summary () function gives the brief statistics of any data, here we are applying to healthcare data.

`summary(df)`

The names() function gives all the available column names by simply printing all column names using the point function.

`print(names(df))`

## Select() Function In Dplyr

The **Select()** function in Dplyr uses to select the specific columns from data or sub-setting of data.

In Dplyr select() function used to select the specific columns or sub-setting of data as per your convenience.

```
sample <- select(df, Hospital.Name, State, Provider.Number, Number.of.Discharges, Measure.Name)
print(sample)
```

**Sub setting or selecting data using Colon (:)**

We can select column range using colon (**:**) and it can do with two ways using indexing or using column names.

Select the columns by the index and using “:” we can set the range

```
sample1 <- select(df,1:3)
print(sample1)
```

It creates** **a subset of data from column **Hospital.Name** to **Measure.Name** using colon (:)

```
sample2 <- select(df, Hospital.Name:Measure.Name)
print(head(sample2,5))
```

## start_with() & ends_with() Functions

**start_with()** function in dplyr select only the columns in df whose titles start with mentioned string.

**ends_with()** function select only the columns in df whose titles end with the given string.

```
sample3 <- select(df,starts_with("H"),ends_with("S"))
print(select3)
```

Select the data where no **Hospital.Name** and **Provider.Number**

```
sample4 <- select(df,-c(Hospital.Name,Provider.Number))
print(sample4)
```

## filter() Function In Dplyr

Data filtering using the filter function in Dplyr is used to filter the data based on single or multiple conditions.

Here Return the subset where the **Number.of.Discharges** = “Not Available” and **Number.of.Readmissions** == “Not Available” separately

```
filter_sample1 <- filter(df, Number.of.Discharges == "Not Available")
print(filter_sample1)
filter_sample2 <- filter(df,Number.of.Readmissions == "Not Available")
filter_sample2
```

filter condition on numeric data and class function to get the number of classes.

```
## No of discharges is >350
filter_sample3 <- filter(df,Number.of.Discharges >= 350)
# conversion of column into numeric data
filter_sample3$Number.of.Discharges = as.numeric(filter_sample3$Number.of.Discharges)
filter_sample3$Number.of.Readmissions = as.numeric(filter_sample3$Number.of.Readmissions)
# Getting the class of sample data
class(filter_sample3$Number.of.Discharges)
```

#### Multiple Condition in a Filter()

Here we use Number.of.Discharges >350 or Number.of.Readmissions > 100 two filter condition together using (|) or operator

```
filter_sample4 = filter(df, Number.of.Discharges >= 350 | Number.of.Readmissions >= 100)
filter_sample4$Number.of.Readmissions = as.numeric(filter_sample4 $Number.of.Readmissions)
```

#### filter rows with %in% operator and c() (combine function)

filtering rows with “SOUTHEAST ALABAMANMEDICAL CENTER” “MARSHALL MEDICAL CENTER SOUTH” using filter condition together with the help of %in% and c().

```
filter_sample5 <- filter(df, Hospital.Name %in% c("SOUTHEAST ALABAMANMEDICAL CENTER","MARSHALL MEDICAL CENTER SOUTH"))
print(filter_sample5)
```

```
## Column conversion in numeric using as.numeric() function
filter_sample5$Number.of.Discharges = as.numeric(as.character(filter_sample5$Number.of.Discharges))
```

## Pipeline (%>%) operator in Dplyr

We can use multiple functions like Select(), filter() and arrange() function in a combine code using pipeline (%>%) operator.

```
# Example 1
pipeline_data1 <- select(df, Hospital.Name, Provider.Number, State,Number.of.Discharges) %>%
filter(State == 'IL') %>% arrange(desc(Number.of.Discharges))
# Example 2
pipeline_data2 <- df %>%
filter(State == 'IL') %>% arrange(desc(Number.of.Discharges))
```

**arrange() function in dpl**y**r**

arrange() function in dplyr is used to sort the data easily and it sorts the data in ascending order by default.

It is used for arranging or rearranging data based on columns in different orders.

```
arrange_sample = arrange(df, desc(Number.of.Discharges))
arrange_sample2 = select(arrange_sample, Number.of.Discharges, everything())
print(arrange_sample2)
```

#### Renaming the column using rename() in dplyr

```
new_df = rename(df, P_number = Provider.Number)
print(new_df)
```

## mutate () function in Dplyr

#### mutate () in Dplyr with statistical operation

Here we created the new column in data using the **mutate()** function by performing mean on **Excess.Readmission.Ratio** column.

The Standard deviation of **Predicted.Readmission.Rate** Column

```
mutate_sample1 = mutate(df, New_Col1 = sum(as.numeric(Number.of.Readmissions)))
mutate_sample2 = mutate(mutate_sample1, New_Col2 = mean(Number.of.Readmissions))
print(mutate_sample2)
```

Because of lots of NA in data, we are not able to perform the specific operation so we need to treat the NA values here.

## summarise() in Dplyr

Here we have to use the **group_by()** function in the **Hospital.Name** for grouping, and getting the mean of data using **summarise()** function.

Then sorting the Hospital.Name using **arrange()** function, and we have connected all the functions using pipeline (**%>%**) operator

```
df %>%
group_by(Hospital.Name) %>%
summarise(Number.of.Read_meanvalue = mean(Number.of.Readmissions)) %>%
arrange(Hospital.Name)
```

## Conclusion

In this case study, you got the practical application of dplyr package and its important functions for subsetting and analyzing data.

How you can analyze the healthcare data or other data by using the Dplyr functions in R.

## Recommended Articles:

Subsetting Data Frames In R With Top 20 Steps.

How You Can Use Excel Formulas Vlookup?

Analytics Teams working on creating useful content related to Data Science, analytics, and AI. It is a team of skilled data Scientists and Analysts, some works full time and some are part-time.