Data Analysis with R — Part 5 (Data Processing using DPLYR)

10 min readJan 16, 2022

According to a survey, data preparation accounts for about 80% of a project’s time, and 76% of data scientists view data preparation as the least enjoyable part of their work.

This is the fifth part of a series of posts on the topic of Data Analysis with R. Here are the links to the previous parts.

Data Analysis with R — Part 1 (Getting Started)

Data Analysis with R — Part 2 (Basics of R Programming)

Data Analysis with R — Part 3 (Control Flow)

Data Analysis with R — Part 4 (Reading data from Excel/CSV)

This post is dedicated to data processing using the DPLYR package. You will be able to do the following reading this post

subsetting columns in a data frame
adding new columns to a data frame, based on condition(s)
filtering rows using a condition or a set of conditions
grouping rows based on unique identifiers and summarising
joining two data frames with a common column (key)

Introduction

This post is what I’ve been waiting for so long to write. The survey mentioned above indicates how painful data processing can be and how much time one needs to dedicate to it. DPLYR is one of my favorite packages that makes data processing a breeze in R.

DPLYR

DPLYR is one of the key packages within the tidyverse that has a collection of packages sharing an underlying design philosophy, grammar, and data structures. You can find more information about tidyverse here.

DPLYR was developed by Hadley Wickham, Chief Data Scientist at RStudio. There are rich resources available on DPLYR — here and here, to name a couple. If you a data buff and have some time over a weekend, listen to this super data science podcast, where Hadley Wickham details how he developed DPLYR and other packages he developed/in the pipeline.

DPYLR is a powerful R-package that can aid in manipulating, cleaning, and summarizing data. It can handle operations such as applying filters, selecting specific columns, sorting data, adding or deleting columns, and aggregating data. The best part about DPLYR is that the commands are based on common verbs like — select, filter, mutate, arrange, etc. The following table summarizes the key commands within DPLYR.

Fig. 1. Summary of DPLYR commands and their description

Let us look at each of these functions with an example

Exercise 1— select, filter, group_by & summarize

For this exercise, I would be using the 4 datasets that you may download from this google drive link. This is a synthetic patients’ dataset covering demographics, medications, creatinine levels, and glucose levels.

Let’s load the datasets and see how the data looks like.

# 0. Loading the necessary libraries and datasets
rm(list = ls())
library(dplyr)setwd("C:/Users/.../Selecting_Filtering_Grouping/")
demo_data = read.csv("T_demo.csv")
meds_data =  read.csv("T_meds.csv")
creatinine_data = read.csv("T_creatinine.csv")
glucose_data = read.csv("T_glucose.csv")# 1. Let's check the information about each dataset# a. Demographics data
str(demo_data)
'data.frame': 300 obs. of  4 variables:
 $ id    : int  0 1 2 3 4 5 6 7 8 9 ...
 $ race  : chr  "Unknown" "White" "White" "White" ...
 $ gender: chr  "Male" "Female" "Female" "Male" ...
 $ age   : int  70 71 57 74 50 73 60 62 68 80 ...# b. Medication data
str(meds_data)
'data.frame': 2181 obs. of  5 variables:
 $ id          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ drug        : chr  "atorvastatin" "atorvastatin" "losartan" "losartan" ...
 $ daily_dosage: num  10 10 100 100 100 1000 1000 1000 1000 1000 ...
 $ start_day   : int  19 117 19 403 587 19 131 309 312 467 ...
 $ end_day     : int  109 207 289 493 677 109 281 399 462 557 ...# c. Creatinine data
str(creatinine_data)
'data.frame': 1439 obs. of  3 variables:
 $ id   : int  0 0 0 0 0 0 1 1 1 1 ...
 $ value: num  1.29 1.15 1.44 1.23 1.08 1.23 1.74 1.83 1.81 1.61 ...
 $ time : int  0 107 286 382 580 688 0 68 289 387 ...# d. Glucose data
dim(glucose_data) # dim gives the dimensions of a dataframe
[1] 1556    3

Next, let's subset a few columns from the medication dataset using the DPLYR’s select(), as well as R’s base commands. Remember that the select() operation manipulates the columns, while the rows remain intact

# 2. In the medication data, select a subset of columns - Only ID and Drug from meds data# 2a. Using DPLYR
meds_data_dplyr = meds_data %>% select(id, drug)#2b. Equivalent Base command
meds_data_base = meds_data[,1:2]## Notes
# a. The difference between 2a and 2b is that 2a requires invoking DPLYR libray while 2b requires no additional library.
# b. The symbols '%>%' is called piping operator and will be extensively used in DPLYR. To understand the history, usage and applications of this operator check this excellent artice on DataCamp
# c. Before the pipiing operator(%>%) you need to have the dataframe that needs to be processed - in our case its "meds_data"
# d. The DPLYR command is to select a subset of columns is so succinct and intuitive that it's easy to remember and use.# 3. In the medication data, let's drop the 'start_day' and 'end_day' columns
# 3a. Using DPLYR
meds_data_dplyr = meds_data %>% select(-c(id, drug))#3b. Equivalent Base command
meds_data_base = meds_data[,-c(4,5)]## Notes
# a. The symbol '-' is used to deselect or exclude one or multiple colummns.

Next, let’s focus on filtering the rows of the demographics data using the filter() command within DPLYR and using base commands of R.

# 4. Filtering only to Patients>60 years and NOT of 'Unknown' Race
# 4a. Using DPLYR
demo_data_dplyr = demo_data %>% filter(age > 60 & race != "Unknown")#4b. Equivalent Base command
demo_data_base = demo_data[which(demo_data$age > 60 & demo_data$race != "Unknown"),]## Notes
# a. At first glance, you can see how easy DPLYR makes filtering or rows based on multiple conditions. It's quite intuitive, like we filter in MS Excel
# b. Within filter we can have multiple conditions, like above.Here I'm including only those patients who are above 60 years and do not have "Unknown" as their "Race"

Let’s look in more detail at the medication data. As you can see in the code snippet below, there is more than one row for Patients 0 and 1.

head(meds_data,15)head(meds_data,15) id         drug daily_dosage start_day end_day
1   0 atorvastatin           10        19     109
2   0 atorvastatin           10       117     207
3   0     losartan          100        19     289
4   0     losartan          100       403     493
5   0     losartan          100       587     677
6   0    metformin         1000        19     109
7   0    metformin         1000       131     281
8   0    metformin         1000       309     399
9   0    metformin         1000       312     462
10  0    metformin         1000       467     557
11  0    metformin         1000       551     641
12  1 atorvastatin           20        43     133
13  1 atorvastatin           20        96     126
14  1 atorvastatin           40       295     385
15  1 atorvastatin           40       394     484

Our next task is to use group_by and summarize() on our dataset. These two commands always come together, with summarize() always following group_by.

## 5. Grouping the meds_data by the drug, i.e., only one row per patient and summarizing the number of drugs every patient had been prescribed
medication_data = meds_data_dplyr %>%
                  group_by(id) %>% 
                  summarise(n_of_meds = n()) 
head(medication_data)
     id   n_of_meds
  <int>     <int>
1     0        11
2     1         4
3     2         2
4     3         4
5     4        18
6     5        11## Notes
# a. Here the dataframe that we are trying to process is "meds_data_dplyr". Hence it is seen before the first piping operator
# b. Then we pipe this dataframe into the group_by command. Here we need to group our meds_data by patients id. So, we have group_by(id)
# c. Then we pipe this group_by reslult into summarise. Here n() would count the number of instance of each "id" and provide the sum. "n_of_meds" would be the column name.

Let’s now group Glucose data and extract the latest glucose reading (assuming the data is sorted reverse chronologically)

## 6. Group the Glucose data and pick the latest reading
new_glucose_data = glucose_data %>% 
                   group_by(id) %>% 
                   summarise(latest_reading = first(value))
head(new_glucose_data)     id       latest_reading
  <int>          <dbl>
1     0           6.24
2     1           9.96
3     2           7.24
4     3           6.43
5     4           8.76
6     5           5.97## Notes
# a. The only change from the previous grouping example is the use of "first" here. "first" picks the first datapoint in the column "value" for every patient id. Similarly "last" would pick the last datapoint.

To wrap-up the group_by, let’s do one final exercise — find the least creatinine reading for every patient

## 7. Group the Creatinine data and pick the least reading
new_creat_data = creatinine_data %>%
                 group_by(id) %>% 
                 summarise(least_reading = min(value))
head(new_creat_data)     id    least_reading
  <int>         <dbl>
1     0          1.08
2     1          1.61
3     2          1.89
4     3          0.74
5     4          0.93
6     5          1.52## Notes
# a. The only change from the previous grouping examples is the use of "min" here. "min" picks the least datapoint in the column "value" for every patient id. Similarly "max" would pick the maximum datapoint

Depending on your need, you can use any of these commonly used functions to summarize your data.

Center — mean(), median(), sum()
Spread —sd(), IQR(), mad(), fivenum()
Range — min(), max, quantile()
Position — first(), last(), nth()
Count — n(), n_distinct()
Logical — any(), all()

Exercise 2— join, mutate, & arrange

For this exercise, I would be using the 4 datasets that you may download from this google drive link. As before, let’s explore the datasets before diving into the DPLYR commands.

# 0. Loading necessary libraries and datasets
rm(list = ls())
library(dplyr)
library(openxlsx)
library(tidyr)
setwd("C:/Users/.../Joining_Mutating/")
patient_data = read.csv("patient.demographics.cost.csv")
movement_data = read.xlsx("patient.movement.xlsx", detectDates = T)
disease_data = read.xlsx("patient.disease.xlsx", detectDates = T)# 1a. Let's explore the patient data
str(patient_data)'data.frame': 2500 obs. of  5 variables:
 $ ID        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ total.cost: num  29.1 417.1 856.1 261.9 755.3 ...
 $ birth.year: int  1937 1972 1971 1972 1988 1973 1930 1945 1975 1964 ...
 $ gender    : chr  "Female" "Male" "Male" "Male" ...
 $ ethnicity : chr  "Chinese" "Chinese" "Others" "Others" ...# 1b. Let's find unique number of patients (ID) in each of the datasets
length(unique(patient_data$ID))
[1] 2500 length(unique(movement_data$ID))
[1] 2069length(unique(disease_data$ID))
[1] 875

The patient data seems to be the master dataset with the most number of patients — 2500, while the other two datasets have only a subset of these 2500 patients. Next, let’s add a few new columns to the movement dataset and one new column to the patient data.

# 2. Add a new column in the movement_data dataframe - LOS
movement_data = movement_data %>% 
                mutate(LOS = discharge.date - admission.date)
## Notes
# a. 'mutate' is the DPLYR function that will add a new column-LOS- to the data frame 'movement_data'. LOS is computed by subtracting the admission date and the discharge date.# 3. Add 2 new columns - Outpatient and Inpatient
movement_data = movement_data %>% 
                mutate(IP = ifelse(type == "Inpatient", 1, 0),
                       OP = ifelse(type == "Outpatient", 1, 0))
## Notes
# a. Here we are adding 2 new columns-IP & OP. Computing these columns are not as straightforward as LOS above
# b. A new column 'IP'  is assigned as 1, if the column 'type' is equal to 'Inpatient', else column 'IP would be assinged as 0. 
# c. As mentioned in Part 3 (Control flow) of this series, this is another way of succinctly using 'ifelse'# 4. Add a new column in the patient_data dataframe - Age
patient_data = patient_data %>%
               mutate(Age = 2022 - birth.year)
## Notes
# a. A new column named 'Age' is computed as the difference between the current year and the 'birth.year' column# 5.  Sort the patient data based on the Age - younger to older
patient_data = patient_data %>% arrange(Age)
## Notes
# a. Arrange will sort the data frame based on the column mentioned-in this case the 'Age'

The final operation to learn within DPLYR is the join(). This is nothing but merging two data frames — say A and B — based on a common column. One column on both the data frames we are trying to merge should have the same name. If you’re familiar with SQL, then you would be familiar with the common types of joins mentioned below.

left join: pick all the values of the left data frame — A and match them to the records of the data frame on the right- — B
right join: this is the opposite of right join. pick all the values of the right data frame — B and match them to the records of the data frame on the left- — A
inner join: this join will create a new table based on the common records in A & B.
full join: this join will create a new table based on all the records in A & all the records in B.

To visually understand how each of these joins work, take a look at this Github page.

Let’s get back to our datasets and try to implement left and full joins of movement and disease data frames

# 6. Before joining, let's convert the movement data into patient wise, i.e. only ONE ROW per patient
movement_pw = movement_data %>% group_by(ID) %>% 
              summarise(ALOS = mean(LOS),
              n_ip = sum(IP),
               n_op = sum(OP))# 7. Left joining patient data with the movement_pw data frame
patient_movement_data = left_join(patient_data, movement_pw,
                                  by = "ID")
head(patient_movement_data)  ID total.cost birth.year gender ethnicity Age    ALOS n_ip n_op
1  39    19143.3       1997 Female    Others  25 5.166667 days    5    1
2 123    42965.9       1997 Female   Chinese  25 4.200000 days    5    0
3 156    22161.9       1997   Male    Others  25 5.500000 days    4    0
4 158     8801.0       1997   Male   Chinese  25 4.833333 days    6    0
5 192     1897.4       1997 Female   Chinese  25 5.500000 days    5    1
6 210      307.8       1997   Male    Others  25 2.500000 days    4    0## Notes
# a. Any join command has 3 arguments - dataframe A, dataframe B and by. 
# b. The "by" argument is the common columns in data frames A & B.

Conclusions

This has been a very long post that is deserving of the DPLYR package. We learned all the operations within this package that will aid us in various data processing tasks. I hope you practice these operations to gain a better understanding and appreciation of this package.

In the next post, we will learn strings — how to handle them, process them, and manipulate them.

Data Analysis with R — Part 5 (Data Processing using DPLYR)

Written by Srinath Sridharan

Responses (1)