mean(c(14, 10, 3, 67, 32, 1, 9, 2))
[1] 17.25
Each activity will come with some sort of narrative context. This is to help you operationalize the statistical concepts and see the applicability. For this assignment, the context is:
You are tasked with doing some research on just how much people are spending at a particular grocery store. One day, you wait by the exit door with a clipboard and ask people how much they spent as they’re exiting. You jot down how much they spent and give them an ID number to know when you’ve reached your goal of 200.
This is your first R-based assignment after your “Introduction to R” project (remember, you’re just exploring that one, not actually submitting anything). It may still feel entirely new to you. Here we’ll walk through getting set up to do this assignment (and later ones) as pain-free as possible. Remember, we’re using Posit Cloud for this! See the Posit Cloud instruction page if you’re still unsure about this.
(Note that while you’ll normally have to deal with missing data. In this project, we’re generating data that is clean and has no missing data.)
There is some overlap between what you explored in the Introduction to R project and what you’re seeing below. This is intentional!
The sum of all items in the set divided by the number of items in the set. This is represented as:
\[ \mu = \frac{\sum_{}^{}X}{N} \]
In R, you’ll be using the mean()
function. For example, if we have a set number numbers, 14, 10, 3, 67, 32, 1, 9, 2
and wanted to find the average of this, we could write out:
mean(c(14, 10, 3, 67, 32, 1, 9, 2))
[1] 17.25
Note that using c()
turns the items into a vector by combining items. This is key! If you just used mean(10, 3, 67, 32, 1, 9)
you will get an error.
For more on calculating the mean in R, see here.
The median is, you’ll remember, the middle value of a set when the set is ordered by value. So, the median value of the set 1, 3, 5
will be 3. But what about the set with an even number, like the one from above? You’ll take the average of the two middle values, like so:
median(c(14, 10, 3, 67, 32, 1, 9, 2))
[1] 9.5
In this case, since, when they’re in order, 9
and 10
are the two middle values, we take \(9 + 10\), which gives us 19, then divide by two, which gives us 9.5, as demonstrated above. For more on calculating the median in R, see here
Now we’re getting to more complicated mathematics but the R function works much the same as that for mean and median. You’ll recall the standard deviation is essentially a measure of how spread out a set of numbers are. Basically, it can be written as:
\[ \sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2} \]
In this case, \(\sigma\) (sigma) refers to the standard deviation of the population (denoted by \(\mu\) (mu); note that this is generally a mathematical function as opposed to a real-world statistical one as, in the vast majority of cases, you are actually working with a sample of the population and not the population as a whole). When finding the standard deviation for a sample, you will be using the following slightly different formula (note the slight difference, referred to as a correction):
\[ s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2} \]
We can calculate the standard deviation rather easily in R by using the sd()
function. Keep in mind that R will always calculate the standard deviation with the sample correction as it won’t treat your data as a full population. Again using our little 8-digit vector, we can simply calculate:
sd(c(14, 10, 3, 67, 32, 1, 9, 2))
[1] 22.43562
If we wanted R to treat the data like it’s an entire population, we could simply manually add in a correction, ourselves:
<- c(14, 10, 3, 67, 32, 1, 9, 2) # Let's call it x to make the calculation easier
x
sd(x)*(sqrt((length(x)-1)/length(x)))
[1] 20.9866
Essentially, the interquartile range (IQR) is the middle 50% of the data (the 75th percentile minus the 25th percentile; Q3 - Q1). If we think about the entire data set as broken into quarters, the median is the center point, so you’d then take the median of the two halves it creates and subtract them. We can demonstrate this by finding the 75th percentile and the 25th percentile, saving them as objects in R, and manually finding the difference. You’ll then see below that the IQR()
function does the same.
Also, now that we’ve renamed our vector as x, we can just use that, instead. Remember in R you can assign virtually anything to anything else. If you’re typing something more than twice it’s good practice to create it as an environmental object.
<- quantile(x, 0.75) # Calculate the 75th percentile
Q3 <- quantile(x, 0.25) # Calculate the 25th percentile
Q1 <- Q3 - Q1 # Find the difference
manualIQR
# Show the number manualIQR
75%
15.75
IQR(x) # And the function doing the same
[1] 15.75
(Again, the content below is to help you understand what’s going on. Your template for this assignment is in the Activity A project in Posit Cloud.)
Let’s generate some data for this assignment, shall we? Since this is generated data, everyone’s values will be (mostly) slightly different.
Why are we calling it dfa
? Well, df
is often used as an abbreviation for data frame
, which is what a table is called in R. It’s dfa
because it’s the data frame
for Activity A
. It’s important to name data and variables meaningfully (not like how we named a vector x above!).
Note: if you want to reference the content in the value
column you’ll need to reference the dataframe AND the variable: dfa$value
. Remember: dfa
refers to the data frame, value
refers to the column/variable in that data frame, and the $
tells R the relationship (ie, value
is a variable in the dfa
data frame). Simple as that!
This is the code that loads the data. It’s already in the template file you’re being provided. Just remember to change set.seed()
to a random number! You shouldn’t leave it as 123456
. You can pick anything, from 42
to 8675309
, both of which are pretty common.
library(plyr) # To round the numbers to currency
library(tidyverse) # To turn our vector into a dataframe with enframe()
library(DT) # To create the table
set.seed(123456) # Pick a random 6 digit number - this sets your randomness
<- rnorm(200, mean=25, sd=12) |> # Let's create some prices
dfa abs() |> # Take the absolute value to get rid of any negative numbers
round_any(0.01) |> # And round them to pennies
enframe() # And turn it into a dataframe so we have customer IDs/names
datatable(dfa, caption = "Spending") # And let's see them!
Use the Activity A template in Posit Cloud to complete the following requirements.
dfa$value
variable. Is it normally distributed? Skewed? What does the kurtosis look like? Describe the value of being able to see the histogram..R
file (called an R script file) so you can easily test it while working. Then, when you’ve got everything above taken care of, transfer it to your .qmd
file and use that to present your data rather than simply turning in code and the results.You’ll see an R script file in your Activity A Posit Cloud project called sandbox.R
. Use this to work on your code. Just remember that content in an R script file (.R) is treated as code by default, while code in a Quarto (.qmd) file needs to be in a code chunk!
qmd
and your rendered HTML
to the Activity A dropbox in the LMS by the stated due date and time. The submission must contain the requirements listed here or you will not receive credit for assignment!embed-resources: true
YAML line!