Machine Learning for Noobs using R and RStudio.

By Farisch Hanoeman , June 11, 2019. LAST UPDATE: July 15, 2019

In this article I will explain how to use R in the RStudio environment and how to use it to do some basic machine learning. I will begin with explaining some of the basics of the R language and how to use the RStudio environment. Then I will continue explaining some statistical functionalities in R and explain how to do linear regression and cluster analysis. Before we start, make sure to download and install R  and to download and install RStudio.

A brief and practical overview of how to use RStudio.

1. INTRODUCTION

In this article we will use R in the RStudio environment. The image above gives you a brief and practical overview of how to use RStudio. The environment will have different color settings when you first use the program. To get the same color setting go to:

Tools > Global Options… > Appearance > Editor Theme > Cobalt > Apply

To get a new R Script, press the button with the white square and a green plus icon in the upper left corner. Assigning variables can be done in two ways: using = operator or using <-. Using <- has a preference in the R community, so we will use that. The difference between using = and <- has to do with variable scope and you can read more about it here. Now, let’s create your first variable by typing

a <- 10

in the newly created script and press cmd/ctrl + enter. If everything went correct, you should see a copy of your command in the console (the lower left tab) and the variable a with value 10 in the Environment (the upper right tab).

If you want to comment parts of your code, you can use the # character. The text after this character will not be interpreted by the program as code.

#This is a comment

A handy functionality in RStudio is the use of code sections. To create a code section, use #, followed by the name of your section, follow by four dashes.

#Section 1 ————

If you want to know more about some function, you can use the ? function to get more information. For example:

?c #gives you information about the combine function

2. VARIABLES AND BASIC DATA TYPES

The variable a has a number with value 10 assigned to it. Can we use any name instead of a? No, there are some rules:

1) Variables can be a combination of letters, digits, period (.) and underscore (_).
2) It must start with a letter or a period. If it starts with a period, it cannot be followed by a digit.
3) Reserved words in R cannot be used as identifiers.

Some examples of valid names:
AmountCandies, userAddress, .fine.with.dot, this_is_acceptable, .

Some invalid names:
@l, 5um, _fine, TRUE, .0ne.

Question: Can you see why the invalid names are invalid?

The number is one of the basic data types that R has. The others are character, integer, logical and complex. Here are some values for each data type.

numberExample = 5.6 #numeric
myName = “Farisch” #character
integerExample = 5L #integer, note the use of L
logicalExample = TRUE #logical
complexExample = 5 + 6i

3. ARITHMETIC OPERATORS

To use data, we need some operators. We will start by looking at some of the basic arithmetic operators in R.

4 + 4 #plus operator
4 – 2 #minus operator
5*3 #multiplicator
50/4 #division operator
60%%4 #modulo operator
3^2 #exponent operator
16%/%5 #integer division operator
Question: Can you find out what each operator does?

+: Addition
–: Subtraction
*: Multiplication
/: Division
^: Exponent
%%: Modulus (it returns the remainder after the division).

4A. DATA STRUCTURES: VECTORS AND LISTS.

The basic data types that we discussed in part 2 can be combined to form data structures. The first data structure that we will discuss is the list. The list is a collection of data, where the data can consist of several types. To create a list, we use the list() function. When the collection of data consists of data that is of the same type we call it a vector. We can use the c() to create a vector.

listExample <- list(“Hello”, 8, TRUE)

vectorExample <- c(1,2,3,4,5)

In the code below you see several examples of how to create vectors.

x1 <- vector() #creates an empty vector

x2 <- c(1, 2, 3) #combines values 1, 2 and 3 into a vector.

x3 <- c(TRUE, FALSE, TRUE) #combines logical values into a vector.

x4 <- c(“Anne”, “Bertha”, “Cynthia”) #combines character values into a vector.

If you want to create a sequence of numbers, you can use the following functions

x5 <- c(1:100) #creates a vector ranging from 1 to 100

x6 <- seq(1, 100, by=2) #creates a vector ranging from 1 to 99 with steps of 2

The combine function c() can also be used to add values to existing vectors.

x2 <- c(x2, 4) #add value 4 to existing vector x2

To examine vectors you can use the following functions :

typeof(x4) #returns the data type within the vector
length(x4) #returns the length of the vector
str(x4) #returns the structure of the vector

We can also apply arithmetic operators to our vectors. Here’s an example:

v1 = c(1:5)
v1 + v1 #output: 2, 4, 6, 8 ,10
v1 – v1 #output: 0, 0, 0, 0 ,0
v1 / v1 #output: 1, 1, 1, 1, 1
2 * v1 #output: 2, 4, 6, 8 ,10
v1*v1 #output: 1, 4, 9, 16, 25
v1^2 #output: 1, 4, 9, 16, 25
Question: How would you create the vector "1, 8, 27, 64, 125" from v1?

v1^3

4B. DATA STRUCTURES: DATA FRAMES

Data frames in R are used as a structure for a collection of vectors of equal length. The next examples shows you how to create a data frame with three columns named “Persons”, “Age” and “hasChildren”.

persons = c(“Alice”,”Bob”, “Celine”,”David”)
age = c(27, 30, 29, 28)
hasChildren = c(TRUE, TRUE, FALSE, FALSE)

ChildData <- data.frame(Person = persons, Age = age, hasChildren = hasChildren)

It’s also possible to create a data frame from existing data. To import external .csv data we can use the read.csv() function to import data. The next example imports data from a .csv file called myData.csv. After the data is imported, it’s stored in a data frame with name myData.

myData <- read.csv(“myData.csv”) #import myData.csv into a data frame myData

R also has some built-in data frames, which we can use. For this example we will use the mtcars dataset.

help(mtcars) #the help function gives us more information about the dataset

To view the first 6 rows of the dataset, we use the head() function. To get a statistical summary of the data, we use the summary() function. To plot all the numerical data, we use the plot() function.

head(mtcars) #gives you the first 6 rows of the dataset
summary(mtcars) #gives you a statistical summary of the dataset
plot(mtcars) #plots all the numerical data

To retrieve data from the dataset, we can use the square bracket operator “[ ]”. Within the square brackets we can specify the column and the row position as follows:

mtcars[1,2] #row 1, column 2 has 6 as value

It is also possible to use row and column names instead of numeric values.

mtcars[“Mazda RX4”, “cyl”] #same cell as above with value 6

There are several methods to retrieve one column from the data frame. Let’s say for example that we want the 9th column from the mtcars dataset with name am.

mtcars[[9]] #retrieve by position
mtcars[[“am”]] #retrieve by name
mtcars$am #retrieve by using the “$” operator

We can also use the bracket operator to retrieve data with conditional statements. Here, it’s also possible to use the & operator (the AND operator) or the | operator (the OR operator) to combine statements. For example, the following statement gives us all the cars with mpg greater than 21 and cyl greater than 4.

mtcars[mtcars$mpg > 21 & mtcars$cyl > 4,]
Call Now Button