Data Analysis with R — Part 2 (Basics of R programming)
Get the fundamentals down and the level of everything you do will rise —
Michael Jordan
This is the second part of a series of posts on the topic of Data Analysis with R. To check out the first part of this series, please click below.
Data Analysis with R — Part 1 (Getting Started)
These are the things you can expect to learn from this post.
- Roadmap to learning languages
- Important data types and data structures in R
Roadmap to learning languages
As with any language, knowing the alphabets or the basics is the first step in learning that language. Then we learn to form words using the alphabets and finally combine these words into a sentence using appropriate grammar.
This is exactly the route we will follow to learn R as well
(as illustrated in Fig. 1). In R (or any programming language),
- the alphabets are the variables,
- words are vectors/data frames/lists, and
- the grammar is the syntax.
A variable is any data value that is used to store information for manipulation. We assign every variable a name so that the program can be understood clearly by the user or even ourselves.
The key to learning any language is practice. So, practice everyday and keep using the language till it becomes second nature to you (like your mother tongue). — Yours truly
Important datatypes
The 5 important data types in R are
- Character — can be strings or alphabets or words or sentences
character_vector = c("R", "Python", "Java", "My name is Srinath")
# Writing text after HASH will become a comment and will not be executed by R. Commenting your code is a good practice.
- Numeric — can be positive/negative numbers or decimals
numeric_vector = c(1, 1.34, -1902, 23)
- Logical/Boolean — consists of either True or False
boolean_vector = c(True, False, False, True)
- Dates — refers to any date or time or date and time. The Date data type and its handling in R deserves a separate post on its own and I will definitely write one after finishing this series.
christmas_2020 = as.Date("2020-12-25")
my_graduation = as.Date("2012-10-31")
- Factors — refers to a statistical data type used to store categorical variables. These can be either numeric or character. It is important that R knows whether it is dealing with a continuous or a categorical variable, as the statistical models treat both these types differently.
Exercise — FACTORS
The following is a mock-up exercise to better understand Factors. (Adapted from this Github page).
Note: Bold text represent the R code — you can copy/paste into our RStudio console — and Italicized text within any code snippets are the output of that particular command execution or comments, if within ‘#’.
# Sex vector - right now this would be a "character" vector
sex_vector <- c("Male", "Female", "Female", "Male", "Male")
class(sex_vector)
[1] "character"# 1. Convert sex_vector to a factor
factor_sex_vector <- factor(sex_vector)
class(factor_sex_vector)
[1] "factor" #-> This line is the output# Define survey_vector
survey_vector <- c("M", "F", "F", "M", "M")
#Convert survey_vector to factor
factor_survey_vector <- factor(survey_vector)# 2. Change the factor levels of factor_survey_vector to c("Female", "Male").
# Mind the order of the vector elements here.
levels(factor_survey_vector) = c("Female", "Male")
relevel(factor_survey_vector, ref = "Male")# 3. Generate summary for survey_vector
summary(survey_vector)
Length Class Mode
5 character character# 4. Generate summary for factor_survey_vector
summary(factor_survey_vector)
Female Male
2 3
#Notes
# 1. Sumamry of a factor will lead to number of each factors.# 5. Compare the sexes
# Male
male <- factor_survey_vector[1]
# Female
female <- factor_survey_vector[2]
male > female
[1] NA
# Notes
#1. This comparison is not possible, as there is no order defined for the factor_survey_vector. This is possible only if the factors are ordered.# 6. Ordered factors
# Assign speed_vector with 5 entries, one for each analyst.
# Each entry should be either "slow", "medium", or "fast". Use the list below:
# Analyst 1 is medium,
# Analyst 2 is slow,
# Analyst 3 is slow,
# Analyst 4 is medium and
# Analyst 5 is fast.# a. Create speed_vector
speed_vector = c("medium", "slow", "slow", "medium", "fast")# Convert speed_vector to ordered factor vector
factor_speed_vector <- factor(speed_vector, ordered = T,
levels = c("slow", "medium", "fast"))
# Notes
# 1. Levels argument is the place where we provide the order of the speed_vector# b. Factor value for second data analyst
da2 <- factor_speed_vector[2]# c. Factor value for fifth data analyst
da5 <- factor_speed_vector[5]
# 5. Is data analyst 2 faster than data analyst 5?
da2 > da5
[1] FALSE
Important data structures
The above-mentioned data types (character, numeric, etc.) can be arranged in quite a number of different structures like vector, data frame, matrix, list, etc. In this post, I’m going to focus only on data frames, as these are the most commonly used, especially when the data is loaded from an Excel file or a database. A data frame has the variables/features of a dataset as columns and the observations as rows. This will be a familiar concept for those who have used different statistical software packages such as SAS or SPSS.
Exercise — DATA FRAMES
The following exercise would help in understanding how data frames are constructed from vectors, how data frames can be accessed, etc. (Adapted from this GitHub page)
# Define a few vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)# 1. Create a data frame using all of the above vectors
df = data.frame(name, type, diameter, rotation, rings)# 1a. Check the dimension of the dataframe
dim(df)
[1] 8 5
# Notes
# 1. The first number in the output (8) corresponds to the number of rows and the second number (5) corresponds to the number of columns# 1b. Check the structure of the dataframe
str(df)
'data.frame': 8 obs. of 5 variables:
$ name : chr "Mercury" "Venus" "Earth" "Mars" ...
$ type : chr "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" ...
$ diameter: num 0.382 0.949 1 0.532 11.209 ...
$ rotation: num 58.64 -243.02 1 1.03 0.41 ...
$ rings : logi FALSE FALSE FALSE FALSE TRUE TRUE ...
# Notes
# 1.Each entry above coresponds to one column.
# 2. For e.g., "name" column is of the datatype character, "diameter" column is of numeric datatype and so on.
# 2. Print out diameter of Mercury (row 1, column 3)
df[1,3]
[1] 0.382
# Notes
# 1. When a dataframe is accessed, the first entry is always row (1 in this case) and the second entry corresponds to colums (3 in this case).
# 2. Rows and columns are separated by a comma
# 3. Print out data for Mars (entire fourth row)
df[,4]
[1] 58.64 -243.02 1.00 1.03 0.41 0.43 -0.72 0.67
# 4. Select first 5 values of diameter column
df[1:5, "diameter"]
[1] 0.382 0.949 1.000 0.532 11.209
# 5. Select the rings variable from planets_df
rings_vector <- df[, 5]
# 6. Select planets with diameter < 1
subset(df, subset = diameter<1) name type diameter rotation rings
1 Mercury Terrestrial planet 0.382 58.64 FALSE
2 Venus Terrestrial planet 0.949 -243.02 FALSE
4 Mars Terrestrial planet 0.532 1.03 FALSE
Conclusion
In this post, I’ve introduced the building blocks of the language we are trying to learn — R. These data frames will be back in Part 4 when we learn to import data from Excel/CSV files into R.
In the next post — Part 3, we will be looking at Logical operators and Control flow, with examples. These will come in handy when we perform operations like filtering, slicing, etc. on data frames. In the meanwhile, as illustrated in the picture at the beginning of this post, keep practicing this new language R to become familiar with it. 😃