August 29, 2022

Learning R for Data Science

This is a guide for learning R for data science.

Rstudio Tutorial
Creating Functions
Doing Math in R
Doing Statistics in R

Rstudio Tutorial

Rstudio is a free package that allows you to run statistical packages. It is an open source IDE which runs R commands and scripts. Rstudio adds a lot of functionality, it is very useful for beginners.

Opening View

When you first open Rstudio you should see 4 rectangles, that is your interface. Each panel does something different.

The top left panel is the source and where you will save and run scripts. You can save from this panel and easily access scripts without having to leave Rstudio. The bottom left is the R console and you can see the output of your scripts or commands here.

The panel on the upper right is the environment panel. It shows you the variables that are available and what is in your working memory. The panel on the lower right has various functions including help, plots, and accessing your files.

Configuring the Interface

You can configure the look of Rstudio by clicking on “workspace panes” under the main menu. Click on “pane layout” to dictate what you want to see. Under each of the 4 sections you can select the different tabs that will appear. This will depend on your needs. If you do not know what you want right now, just leave it on the default. It will always be there if you want to play with it later. You can also drag the 4 interfaces around so you can get the size and appearance you want.

Importing Data

One of the first and easiest ways to open a dataset is to look in the top right interface, click the “environment” tab, then select “import dataset”. You will see that you can open up text files, Excel sheets, and other types of data sets. a

First Tasks

If you create a vector, you will see several nice things that Rstudio does for you.

x=1:50

y=51:100

plot(x,y)

Under the “environment” tab at top right, you will see your variables that are in memory. This is great because sometimes we are doing many things at once and want to know we have the right variables.

In the bottom right, we see the plot we just made. We do not have to have a separate window or go anywhere else for it. We can save the plot by clicking the “export” tab and selecting the option we want.

Script Panel

The script panel is in the upper left. It lets you load and save scripts in one place.

Rstudio can help in the creation and management of R scripts. To do this, go to

File
New File
R Script

Enter in some commands that do something and go to File and hit save.

We can also create a Project. This allows you to to manage all your files and scripts related to this project in one place.

Console Panel

The console is the bottom left panel. This lets you interactively enter commands and work with variables.

Environment Panel

It will automatically show the variables you created last. The panel also shows what their current values are. You can click the “broom” icon to clear the environment.

File and Plot Panel

This is on your bottom right. It has several tabs by default. It will show you a file tree, the plots you make on the console, and a help section to give you hints.

Creating Functions

Vectorized functions are very useful in R programming. They are a way to analyze data quickly and easily. I briefly mentioned them in the section before this but today I want to go more in depth.

Vectorizing Your Functions

Your first step in using vectorized functions is to create a vector of values. Let us make a vector using the high temperatures of Cincinnati, OH for the next few days. I don’t live there, but the temperatures are so nice, I wish I did. We need to make a vector name and then use the c() function. It looks like this:

Temps = c()

Now I will use my weather program to find the highs for the next few days and input those into the function. Use commas to separate values.

Temps = c(77,79,79,82,68,72)

With these values in hand, we can always call the vector name to see them again. The values could change, that is why it could be useful. Do it like this:

Temps

We could get the sum of the temperatures if we wanted, though with temperatures that is not exactly useful. However, you should know how to use the sum function for other kinds of numerical data.

sum(temps)

That will give you a sum of values.

We can also use functions that deal with string data or names, then work on that function with another function.

Let me use some Magic the Gathering cards I have for examples.

What I will do here is list the first part of the card name for a few different cards. Then I will do the same for the last part of the card name. Lastly, I will paste the second part to the first part.

first=c(“Annointed”, “Shivan”, “Llanowar”)

second=c(“Peacekeeper”, “Reef”, “Wastes”)

Now we can use the paste() function on our vector.

paste(first, second)

Those are the actual names of the cards I am looking at. The point is that you can use functions to work on other functions.

Arguments

Most functions will allow you to use arguments. They let you tell the function exactly how to behave. This is called passing a value to a function. If you know other programming languages, this is not a foreign concept. Most functions across languages work like this. You can have arguments with default values and some without default values.

Command History

R keeps track of all the commands you use in a particular session. The purpose of this functionality is to let you see what you have done and let you easily repeat something. You can look at the history that you have typed by using the up arrow within your console. You can hit enter when you see something you want to repeat. This runs the same command again.

You can save the history of your commands with the “savehistory()” function. This will automatically save them in a file called .Rhistory, but if you want to specify a different file you can do this:

savehistory(file=”modernmagic.Rhistory”)

You can look at the file with any text editor. You can also load a previous history file.

loadhistory(“legacymagic.Rhistory”)

Adding Comments

Comments are good to use, they help with readability. You can use them to indicate who wrote a piece of code and what they were intending. It is also good to explain why something is there in the first place. This is done with the pound symbol.

# this line is a comment and does nothing else

You can place comments at the beginning of a line or any place in the line itself. To do multiple lines of comments, every line will need a pound symbol.

Packages

Anyone can write functions in R and share them with others. These are called packages. There are a few different website repositories that contain collections of these packages. The most important is: https://cran.r-project.org

You install packages by using the command:

install.packages()

The name of the package is the argument of the function.

install.packages(“magic”)

Once the package is installed, you have to load it in order to use it. You do it like this:

library(“magic”)

After this step, you can officially use the commands of this package. The library is the directory where your packages are installed.

Doing Math in R

Working in R means doing a lot of calculations. This is what R is for and why it is called a statistical programming language. You can do simple math and work with vectors. In fact, there are a few different categories. They are:

Arithmetic
Functions
Vectors
Matrixes

The arithmetic operators should be familiar to everyone. These are the basic math operators everyone learned when they were kids.

\[x + y\] y added to x
\[x - y\] y subtracted from x
\[x * y\] x multiplied by y
\[\frac{x}{y}\] x divided by y
\[x ^ y\] x raised to the power of y
\[x %% y\] remainder of x divided by y
\[x% / %y\] x divided by y but rounded down

Let’s move to the next section, mathematical functions. These are the traditional algebraic functions and they work the same way.

abs(x) takes the absolute value of x
log x takes the logarithm of x with base y
exp(x) returns the exponential of x
sqrt(x) returns the square root of x
factorial (x) returns the factorial of x!
choose(x,y) returns the number of possible combinations when drawing y elements from x possibilities

You can take the log of a number like this:

Log 1.5

You can take the log of a series of numbers:

log(2:4)

That function takes the natural log of the numbers 2,3,4

You can specify a base:

log(2:4, base=4)

The other functions work similarly, I will get into them more when we need to.

You can round numbers easily in R. You just use the ‘round()’ function.

round(454567.2333445, digits=3)

Significant digits can be done just as easily.

signif(333.334455, digits=3)

Trig functions are also available. By default, R gives results in radians. So, if you need a result in degrees, you will have to convert it. I will show you how, though.

cos(120)

Gives results in radians

cos(95 * pi / 180)

Gives results in degrees

Working With Vectors

A vector is a one-dimensional set of values. It looks like this:

x=c(1,2,3,4,5,6,7,8)

They have to be the same type, such as integers.

There is a function, ‘str()’, that lets you look at any particular vector and see its properties. Use it like this:

str(x)

To see the length of a vector:

length(x)

Vectors can be several different types:

Numeric
Integer
Logical
Character
Datetime
Factors

You can test a vector to see what kind it is:

is.numeric(x)

is.character(x)

is.logical(x)

Those are all separate tests to determine what kind of vector you have. You will get the output of ‘true’ when you have a match.

To create vectors you can enter in numbers or use a sequence of numbers. I will show you how to do both. It is common to assign a variable to your data so it is easy to work on it. The variable is ‘x’.

x=c(22,33,44,55,66,77,88,99)

The ‘c’ is a function itself and combines the numbers in the parentheses to make a vector.

You can also use the colon operator to create a sequence of numbers.

x=c(2:7)

This creates a vector with the number 2,3,4,5,6,7

You can include negative numbers too.

x=c(7:-2)

R lets you combine vectors when you need to. If we have:

x=c(1:9)

y=c(11:16)

We can combine them like this:

total=c(x,y)

We can repeat vectors too. We do this with the ‘rep()’ function.

If we want to repeat a vector a set number of times, we do this:

rep(c(1:9), times=4)

When we want to repeat every value:

rep(c(1:9), each=3)

We can also tell R how often to repeat each value:

rep(c(1,9), times=c(3,4)

Looking At Vector Values

Once we have a vector, R lets us look at and work with individual values. The square brackets let us extract a value from the vector. We just indicate the position we want inside of the square brackets.

X[3]

This gives us the 3rd number from the start of the vector.

We can get more than one position value at once:

x[c(1,2,3)]

This will give us the first 3 positions of the vector.

You can change the value of a vector.

x=c(1,2,3,4,5)

Let us change the last value from 5 to 3

X[5] = 3

Now, our vector has been changed to what we want to reflect it as.

Making Copies of Vectors

Before working with an important vector set of data, make a copy of it. You do not want to accidentally change it without knowing it. Do it like this:

X.copy = x

Now you can do your work with a little less worry.

Comparing Values

To compare values in a vector:

X > 5

This gives us logical values. Any time there is a value greater than 5, the output is true.

We can also check positions that are greater than 5.

which(x > 5)

This shows us which positions in the vector are greater than 5.

These are the logical operators in R:

X == y
X != y
X > y
X >= y
X < y
X <= y
X & y
X | y
!x
xor(x,y)

More Arithmetic Operations

Once we have a vector set up and kind of know how it works, we can start doing more with it. I recently took a statistics class and I used many of these functions to great effect. It really speeds things up. The idea of a vector is to look at each value in a vector and do something with it. That is what functions do once we have a vector set up. Here are the arithmetic functions that are pretty useful:

sum(x) calculates the sum of values in the vector x
prod(x) calculates the product of all the values in the vector x
min(x) gives the minimum of all values in x
max(x) gives the max of all the values in the vector x
cumsum(x) gives the cumulative sum of all the values in the vector x
cumprod(x) gives the cumulative product of all the values in the vector x
cumin(x) gives the minimum for all values in x from the start of the vector to the position indicated
cummax(x) gives the maximum for all values in x from the start of the vector until the position indicated
diff(x) gives for every value the difference between that value and the next value in the vector

Doing Statistics in R

One of the first things you need to learn to do is create functions. We are often given a data set of numbers to deal with. Those numbers need to be in a function to take advantage of the power of R programming.

Squaring A Number

Squaring a number looks like this in your R console:

11.17^2

Square Root Of A Number

To take the square root of a number, use this function in your R console:

sqrt(36)

Creating Functions

Enter the code I am giving you into your Rgui or Rstudio or whatever else you are using and press enter.

Here is how to create a function:

x=c()

We are usually given a data set to work with. Let us start with some basic. I will just enter the number one through ten so we have something to work with.

x=c(2,2,1,3,3,3,4,5,6,7,7,8,8,8,9,9,9,9,2,3,4,1,1,2,2,6,6,7,7,8,8,9,9,)

Now, we have a function with data.

Ordering Data

Let us now order the data.

order(x)

This version of order sorts the data by position.

[1] 3 22 23 1 2 19 24 25 4 5 6 20 7 21 8 9 26 27 10 11 28 29 12 13 14 30 31 15 16 17 18 32 33

The smallest value is 1 and it is in the 3rd, 22nd, and 23 positions.

If you want the numbers in ascending order, do this:

x[order(x)]

[1] 1 1 1 2 2 2 2 2 3 3 3 3 4 4 5 6 6 6 7 7 7 7 8 8 8 8 8 9 9 9 9 9 9

This version of the order function sorts the numbers in ascending order.

If you want the data set in descending order then do this:

x[order(x,decreasing=TRUE)]

You will get this:

[1] 9 9 9 9 9 9 8 8 8 8 8 7 7 7 7 6 6 6 5 4 4 3 3 3 3 2 2 2 2 2 1 1 1

Calculating A Sum

To calculate a sum of some numbers, put them into a function like we did above.

Importing or copying/pasting works just fine.

x=c(1,2,3,4,5)

Then just use the sum() function.

sum(x)

Calculating The Range

First, we need to get some data into a function.

x=c(55,22,87,14,64,62,94,91,61,44,11)

Next, we order the data.

x[order(x)]

To find the range, subtract the min from the max.

\(94-11=83\)

Calculating the Mean

To calculate the Mean of a data set we use the mean() function.

It accepts a vector as an input. That is what we created above.

We will just work on the data set we created above.

To use the mean() function, we use the variable name we created as an argument to the mean() function. So:

mean(x)

5.393939

Finding the Median

If we had 3 numbers then finding the median would be pretty quick. However, when we have 3000 numbers it is a different story.

The median() function also accepts a vector as input. We use it in the same way as above.

To find the median of a data set quickly we do this:

median(x)

Finding the Mode

The mode of a vector of values can be found using the mode() function.

It again accepts a vector as an input.

It returns the most frequently occurring value in the data set.

In any data set, there can be no mode, one mode, or multiple modes.

First, we have to create our own function to find the mode.

Type it like this:

mode=function(x)

{

u=unique(x)

tab=tabulate(match(x,u))

u[tab==max(tab)]

}

Then we just do:

mode(x)

This also works on a character vector.

If we have:

letter=c('a','s','s','s','d','d','d','f','f','f','g','g','h','h','h','h','h','j','j','j','j','k','k','k','k','k','k','k','k','l','l','l','l','l','l')

mode(letter)

“k”

The mode of this character vector is “k” because it occurs more than any other letter.

Calculating The Standard Deviation

To calculate standard deviation, use the sd() function.

Get data into a function.

x=c(55,22,87,14,64,62,94,91,61,44,11)

Use the sd() function.

sd(x)

Calculating The Variance

To calculate the variance of some numbers, we square the standard deviation.

Get data into a function.

x=c(55,22,87,14,64,62,94,91,61,44,11)

Use the sd() function to find the standard deviation.

sd(x)

= 29.7

Square this number.

29.7^2

= 882.1

Calculating The Coefficient Of Variation

The coefficient of variation is:

\[cv = \frac{\text{standard deviation}}{mean}\]

Let us start with getting data into a function.

x=c(86,70,62,68,69,54,66,55,81,68,61,62,98,54,62)

Use the mean() function to find the mean of the data.

mean(x)

Now, use the sd() function to find the standard deviation

sd(x)

Next, use the formula from above.

Cv = sd(x) / mean(x)

Calculating the Z-Score of a Data Set

A z score is the number of standard deviations that a given value is above or below the mean.

First, we need to import some data. Let’s use the previous data set.

x=c(86,70,62,68,69,54,66,55,81,68,61,62,98,54,62)

Next, we find the mean.

mean(x)

67.73

Now, find the standard deviation.

sd(x)

12.30

Then:

\[z = \frac{x - \mu}{s}\]

Let us use the value of 54 that is in the data set.

\[z = \frac{54-67.73}{12.30}\]

Z = -1.17

This tells us that the value of 54 is slightly more than 1 standard deviation away from the mean. It is negative because it is less than the mean of 67.73.

Calculating Percentiles

A data set’s three quantiles are its quarter points. These are 25%, 50%, and 75%.

The interquartile range is the area between 25% and 75%.

Let’s use our data set.

x=c(86,70,62,68,69,54,66,55,81,68,61,62,98,54,62)

We can use the quantile() function to analyze this data set.

quantile(x)

We can even set the percentages we want to see from the data set.

quantile(x, probs=c(0.125,0.375,0.625,0.875))

You will notice the quantile() function also gives you the quartiles of a data set by default.

Calculating Relative Frequency

Relative frequency tells you how often a value occurs within a dataset. It is most often reported as a percentage. Let's get to some examples.

# Create a vector of data
data = c('r', 'r', 'u', 'u', 'u', 'v', 'v', 'v', 'v')
# Create table
table(data)/length(data)

The table you see means 'r' happens 22.2% of the time, 'u' is 33.3%, and 'v' is 44.4%.

If you are asked to find the relative frequency of some total, we divide the number given by the total. Here is an example.

Of the 1806 qualified applicants, 966 were accepted. What is the frequency that were accepted as a percentage?

966/1806 = 53.5%

Boxplots

A boxplot is a simple graph and one of the first that you will learn to use.

It uses the boxplot() function.

Let’s load our data set.

x=c(86,70,62,68,69,54,66,55,81,68,61,62,98,54,62)

Now, apply the boxplot() function.

boxplot(x)

You can see the concentration of data and any outliers at a glance.

This function has several options, let us look at some of them.

boxplot(x,main="Random Data Set", xlab="x",ylab="y",col="blue",border="brown",horizontal=TRUE,notch=TRUE)

These options will spice up your boxplot quite well.

You can also do side by side boxplots.

If you have three different variables, x,y,z...

boxplot(x,y,z)

Bar Charts

To create a bar chart in R, we will first do it the most simple way to show how it is done.

I am assuming this is being done in Rstudio or in a blank script.

Create a vector.

x=c(1,3,5,7,9,22,33,44,55,66,77,88,99,2,4,6,8,10)

Then run:

barplot(x)

Pie Charts

We can make a pie chart the same way.

pie(x)

Obviously, we can do a lot to make it pretty and more informative but I wanted to show how the basic chart is done, first.

Normal Probability Plots

To make a Normal probability plot for x:

qqnorm(x)

When you want to standardize a variable, use:

z=(x - mean(x)) / sd(x)

By default, Rplaces the ordered data on the vertical axis and the Normal scores on the horizontal axis, but that can be reversed by setting datax=TRUE inside qqnorm

Table of Contents

Rstudio Tutorial

Creating Functions

Doing Math in R

Doing Statistics in R

You should also read:

Learning Python for Beginners

Learning Economics for Beginners

Learning Linux for Beginners

Subscribe