Machine Learning for Noobs using R and RStudio.
By Farisch Hanoeman , June 11, 2019. LAST UPDATE: July 15, 2019
In this article I will explain how to use R in the RStudio environment and how to use it to do some basic machine learning. I will begin with explaining some of the basics of the R language and how to use the RStudio environment. Then I will continue explaining some statistical functionalities in R and explain how to do linear regression and cluster analysis. Before we start, make sure to download and install R and to download and install RStudio.
A brief and practical overview of how to use RStudio.
In this article we will use R in the RStudio environment. The image above gives you a brief and practical overview of how to use RStudio. The environment will have different color settings when you first use the program. To get the same color setting go to:
Tools > Global Options… > Appearance > Editor Theme > Cobalt > Apply
To get a new R Script, press the button with the white square and a green plus icon in the upper left corner. Assigning variables can be done in two ways: using = operator or using <-. Using <- has a preference in the R community, so we will use that. The difference between using = and <- has to do with variable scope and you can read more about it here. Now, let’s create your first variable by typing
in the newly created script and press cmd/ctrl + enter. If everything went correct, you should see a copy of your command in the console (the lower left tab) and the variable a with value 10 in the Environment (the upper right tab).
If you want to comment parts of your code, you can use the # character. The text after this character will not be interpreted by the program as code.
A handy functionality in RStudio is the use of code sections. To create a code section, use #, followed by the name of your section, follow by four dashes.
If you want to know more about some function, you can use the ? function to get more information. For example:
2. VARIABLES AND BASIC DATA TYPES
The variable a has a number with value 10 assigned to it. Can we use any name instead of a? No, there are some rules:
1) Variables can be a combination of letters, digits, period (.) and underscore (_).
2) It must start with a letter or a period. If it starts with a period, it cannot be followed by a digit.
3) Reserved words in R cannot be used as identifiers.
Some examples of valid names:
AmountCandies, userAddress, .fine.with.dot, this_is_acceptable, .
Some invalid names:
@l, 5um, _fine, TRUE, .0ne.
Question: Can you see why the invalid names are invalid?
The number is one of the basic data types that R has. The others are character, integer, logical and complex. Here are some values for each data type.
myName = “Farisch” #character
integerExample = 5L #integer, note the use of L
logicalExample = TRUE #logical
complexExample = 5 + 6i
3. ARITHMETIC OPERATORS
To use data, we need some operators. We will start by looking at some of the basic arithmetic operators in R.
4 – 2 #minus operator
50/4 #division operator
60%%4 #modulo operator
3^2 #exponent operator
16%/%5 #integer division operator
Question: Can you find out what each operator does?
%%: Modulus (it returns the remainder after the division).
4A. DATA STRUCTURES: VECTORS AND LISTS.
The basic data types that we discussed in part 2 can be combined to form data structures. The first data structure that we will discuss is the list. The list is a collection of data, where the data can consist of several types. To create a list, we use the list() function. When the collection of data consists of data that is of the same type we call it a vector. We can use the c() to create a vector.
vectorExample <- c(1,2,3,4,5)
In the code below you see several examples of how to create vectors.
x2 <- c(1, 2, 3) #combines values 1, 2 and 3 into a vector.
x3 <- c(TRUE, FALSE, TRUE) #combines logical values into a vector.
x4 <- c(“Anne”, “Bertha”, “Cynthia”) #combines character values into a vector.
If you want to create a sequence of numbers, you can use the following functions
x6 <- seq(1, 100, by=2) #creates a vector ranging from 1 to 99 with steps of 2
The combine function c() can also be used to add values to existing vectors.
To examine vectors you can use the following functions :
length(x4) #returns the length of the vector
str(x4) #returns the structure of the vector
We can also apply arithmetic operators to our vectors. Here’s an example:
v1 + v1 #output: 2, 4, 6, 8 ,10
v1 – v1 #output: 0, 0, 0, 0 ,0
v1 / v1 #output: 1, 1, 1, 1, 1
2 * v1 #output: 2, 4, 6, 8 ,10
v1*v1 #output: 1, 4, 9, 16, 25
v1^2 #output: 1, 4, 9, 16, 25
Question: How would you create the vector "1, 8, 27, 64, 125" from v1?
4B. DATA STRUCTURES: DATA FRAMES
Data frames in R are used as a structure for a collection of vectors of equal length. The next examples shows you how to create a data frame with three columns named “Persons”, “Age” and “hasChildren”.
persons = c(“Alice”,”Bob”, “Celine”,”David”)
age = c(27, 30, 29, 28)
hasChildren = c(TRUE, TRUE, FALSE, FALSE)
ChildData <- data.frame(Person = persons, Age = age, hasChildren = hasChildren)
It’s also possible to create a data frame from existing data. To import external .csv data we can use the read.csv() function to import data. The next example imports data from a .csv file called myData.csv. After the data is imported, it’s stored in a data frame with name myData.
R also has some built-in data frames, which we can use. For this example we will use the mtcars dataset.
To view the first 6 rows of the dataset, we use the head() function. To get a statistical summary of the data, we use the summary() function. To plot all the numerical data, we use the plot() function.
summary(mtcars) #gives you a statistical summary of the dataset
plot(mtcars) #plots all the numerical data
To retrieve data from the dataset, we can use the square bracket operator “[ ]”. Within the square brackets we can specify the column and the row position as follows:
It is also possible to use row and column names instead of numeric values.
There are several methods to retrieve one column from the data frame. Let’s say for example that we want the 9th column from the mtcars dataset with name am.
mtcars[[“am”]] #retrieve by name
mtcars$am #retrieve by using the “$” operator
We can also use the bracket operator to retrieve data with conditional statements. Here, it’s also possible to use the & operator (the AND operator) or the | operator (the OR operator) to combine statements. For example, the following statement gives us all the cars with mpg greater than 21 and cyl greater than 4.