# Machine Learning for Noobs using R and RStudio.

By Farisch Hanoeman , June 11, 2019. LAST UPDATE: July 15, 2019

In this article I will explain how to use R in the RStudio environment and how to use it to do some basic machine learning. I will begin with explaining some of the basics of the R language and how to use the RStudio environment. Then I will continue explaining some statistical functionalities in R, explain how to do logistic regression, principal component analysis and cluster analysis. Before we start, make sure to download and install R and to download and install RStudio.

A brief and practical overview of how to use RStudio.

# R and RStudio

## 1. INTRODUCTION

In this article we will use R in the RStudio environment. The image above gives you a brief and practical overview of how to use RStudio. The environment will have different color settings when you first use the program. To get the same color setting go to:**Tools > Global Options… > Appearance > Editor Theme > Cobalt > Apply**To get a new R Script, press the button with the white square and a green plus icon in the upper left corner. Assigning variables can be done in two ways: using

**=**operator or using

**<-**. Using

**<-**has a preference in the R community, so we will use that. The difference between using

**=**and

**<-**has to do with variable scope and you can read more about it here. Now, let’s create your first variable by typing

**cmd/ctrl + enter**. If everything went correct, you should see a copy of your command in the console (the lower left tab) and the variable

**a**with value 10 in the

**Environment**(the upper right tab). If you want to comment parts of your code, you can use the

**#**character. The text after this character will not be interpreted by the program as code.

**#**, followed by the name of your section, follow by four dashes.

**?**function to get more information. For example:

## 2. VARIABLES AND BASIC DATA TYPES

The *variable* **a** has a *number* with value **10** assigned to it. Can we use any name instead of **a**? No, there are some rules:

1) Variables can be a combination of letters, digits, period (.) and underscore (_).

2) It must start with a letter or a period. If it starts with a period, it cannot be followed by a digit.

3) Reserved words in R cannot be used as identifiers.

Some examples of valid names:

**AmountCandies**, **userAddress**, **.fine.with.dot**, **this_is_acceptable**, .

Some invalid names:

**@l**, **5um**, **_fine**, **TRUE**, **.0ne**.

Question: Can you see why the invalid names are invalid?

The *number* is one of the basic data types that R has. The others are *character*, *integer*, *logical* and *complex*. Here are some values for each data type.

myName = “Farisch” #character

integerExample = 5L #integer, note the use of L

logicalExample = TRUE #logical

complexExample = 5 + 6i

## 3. ARITHMETIC OPERATORS

To use data, we need some operators. We will start by looking at some of the basic arithmetic operators in R.

4 – 2 #minus operator

5*3 #multiplicator

50/4 #division operator

60%%4 #modulo operator

3^2 #exponent operator

16%/%5 #integer division operator

##### Question: Can you find out what each operator does?

+: Addition –: Subtraction *: Multiplication /: Division ^: Exponent %%: Modulus (it returns the remainder after the division).

## 4A. DATA STRUCTURES: VECTORS AND LISTS.

The basic data types that we discussed in part 2 can be combined to form **data structures**. The first data structure that we will discuss is the **list**. The list is a collection of data, where the data can consist of several types. To create a list, we use the **list()** function. When the collection of data consists of data that is of the same type we call it a **vector**. We can use the **c()** to create a vector.

vectorExample <- c(1,2,3,4,5)

In the code below you see several examples of how to create vectors.

x2 <- c(1, 2, 3) #combines values 1, 2 and 3 into a vector.

x3 <- c(TRUE, FALSE, TRUE) #combines logical values into a vector.

x4 <- c(“Anne”, “Bertha”, “Cynthia”) #combines character values into a vector.

If you want to create a sequence of numbers, you can use the following functions

x6 <- seq(1, 100, by=2) #creates a vector ranging from 1 to 99 with steps of 2

The **combine function c()** can also be used to add values to existing vectors.

To examine vectors you can use the following functions :

length(x4) #returns the length of the vector

str(x4) #returns the structure of the vector

We can also apply arithmetic operators to our vectors. Here’s an example:

v1 + v1 #output: 2, 4, 6, 8 ,10

v1 – v1 #output: 0, 0, 0, 0 ,0

v1 / v1 #output: 1, 1, 1, 1, 1

2 * v1 #output: 2, 4, 6, 8 ,10

v1*v1 #output: 1, 4, 9, 16, 25

v1^2 #output: 1, 4, 9, 16, 25

##### Question: How would you create the vector "1, 8, 27, 64, 125" from v1?

v1^3

## 4B. DATA STRUCTURES: DATA FRAMES

Data frames in R are used as a structure for a collection of vectors of equal length. The next examples shows you how to create a data frame with three columns named “Persons”, “Age” and “hasChildren”.

persons = c(“Alice”,”Bob”, “Celine”,”David”)

age = c(27, 30, 29, 28)

hasChildren = c(TRUE, TRUE, FALSE, FALSE)

ChildData <- data.frame(Person = persons, Age = age, hasChildren = hasChildren)

It’s also possible to create a data frame from existing data. To import external .csv data we can use the **read.csv()** function to import data. The next example imports data from a .csv file called myData.csv. After the data is imported, it’s stored in a data frame with name myData.

R also has some built-in data frames, which we can use. For this example we will use the mtcars dataset.

To view the first 6 rows of the dataset, we use the **head()** function. To get a statistical summary of the data, we use the **summary()** function. To plot all the numerical data, we use the **plot()** function.

summary(mtcars) #gives you a statistical summary of the dataset

plot(mtcars) #plots all the numerical data

To retrieve data from the dataset, we can use the **square bracket operator “[ ]”**. Within the square brackets we can specify the column and the row position as follows:

It is also possible to use row and column names instead of numeric values.

There are several methods to retrieve one column from the data frame. Let’s say for example that we want the 9th column from the mtcars dataset with name **am**.

mtcars[[“am”]] #retrieve by name

mtcars$am #retrieve by using the

**“$” operator**

We can also use the bracket operator to retrieve data with conditional statements. Here, it’s also possible to use the **&** operator (the AND operator) or the **|** operator (the OR operator) to combine statements. For example, the following statement gives us all the cars with mpg greater than 21 and cyl greater than 4.