Activity A: Descriptives

Activity A serves as both an introduction to using Quarto and R, while giving you an opportunity to explore basic descriptive statistics.

Each activity will come with some sort of narrative context. This is to help you operationalize the statistical concepts and see the applicability. For this assignment, the context is:

You are tasked with doing some research on just how much people are spending at a particular grocery store. One day, you wait by the exit door with a clipboard and ask people how much they spent as they’re exiting. You jot down how much they spent and give them an ID number to know when you’ve reached your goal of 200.

Relevant R

This is your first R-based assignment after your “Introduction to R” project (remember, you’re just exploring that one, not actually submitting anything). It may still feel entirely new to you. Here we’ll walk through getting set up to do this assignment (and later ones) as pain-free as possible. Remember, we’re using Posit Cloud for this! See the Posit Cloud instruction page if you’re still unsure about this.

(Note that while you’ll normally have to deal with missing data. In this project, we’re generating data that is clean and has no missing data.)

There is some overlap between what you explored in the Introduction to R project and what you’re seeing below. This is intentional!

Mean

The sum of all items in the set divided by the number of items in the set. This is represented as:

\[ \mu = \frac{\sum_{}^{}X}{N} \]

In R, you’ll be using the mean() function. For example, if we have a set number numbers, 14, 10, 3, 67, 32, 1, 9, 2 and wanted to find the average of this, we could write out:

mean(c(14, 10, 3, 67, 32, 1, 9, 2))
[1] 17.25

Note that using c() turns the items into a vector by combining items. This is key! If you just used mean(10, 3, 67, 32, 1, 9) you will get an error.

For more on calculating the mean in R, see here.

Median

The median is, you’ll remember, the middle value of a set when the set is ordered by value. So, the median value of the set 1, 3, 5 will be 3. But what about the set with an even number, like the one from above? You’ll take the average of the two middle values, like so:

median(c(14, 10, 3, 67, 32, 1, 9, 2))
[1] 9.5

In this case, since, when they’re in order, 9 and 10 are the two middle values, we take \(9 + 10\), which gives us 19, then divide by two, which gives us 9.5, as demonstrated above. For more on calculating the median in R, see here

Standard Deviation

Now we’re getting to more complicated mathematics but the R function works much the same as that for mean and median. You’ll recall the standard deviation is essentially a measure of how spread out a set of numbers are. Basically, it can be written as:

\[ \sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2} \]

In this case, \(\sigma\) (sigma) refers to the standard deviation of the population (denoted by \(\mu\) (mu); note that this is generally a mathematical function as opposed to a real-world statistical one as, in the vast majority of cases, you are actually working with a sample of the population and not the population as a whole). When finding the standard deviation for a sample, you will be using the following slightly different formula (note the slight difference, referred to as a correction):

\[ s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2} \]

We can calculate the standard deviation rather easily in R by using the sd() function. Keep in mind that R will always calculate the standard deviation with the sample correction as it won’t treat your data as a full population. Again using our little 8-digit vector, we can simply calculate:

sd(c(14, 10, 3, 67, 32, 1, 9, 2))
[1] 22.43562

If we wanted R to treat the data like it’s an entire population, we could simply manually add in a correction, ourselves:

x <- c(14, 10, 3, 67, 32, 1, 9, 2) # Let's call it x to make the calculation easier

sd(x)*(sqrt((length(x)-1)/length(x)))
[1] 20.9866

Interquartile Range

Essentially, the interquartile range (IQR) is the middle 50% of the data (the 75th percentile minus the 25th percentile; Q3 - Q1). If we think about the entire data set as broken into quarters, the median is the center point, so you’d then take the median of the two halves it creates and subtract them. We can demonstrate this by finding the 75th percentile and the 25th percentile, saving them as objects in R, and manually finding the difference. You’ll then see below that the IQR() function does the same.

Also, now that we’ve renamed our vector as x, we can just use that, instead. Remember in R you can assign virtually anything to anything else. If you’re typing something more than twice it’s good practice to create it as an environmental object.

Q3 <- quantile(x, 0.75) # Calculate the 75th percentile
Q1 <- quantile(x, 0.25) # Calculate the 25th percentile
manualIQR <- Q3 - Q1 # Find the difference

manualIQR # Show the number
  75% 
15.75 
IQR(x) # And the function doing the same
[1] 15.75

The Data

(Again, the content below is to help you understand what’s going on. Your template for this assignment is in the Activity A project in Posit Cloud.)

Let’s generate some data for this assignment, shall we? Since this is generated data, everyone’s values will be (mostly) slightly different.

Why are we calling it dfa? Well, df is often used as an abbreviation for data frame, which is what a table is called in R. It’s dfa because it’s the data frame for Activity A. It’s important to name data and variables meaningfully (not like how we named a vector x above!).

Note: if you want to reference the content in the value column you’ll need to reference the dataframe AND the variable: dfa$value. Remember: dfa refers to the data frame, value refers to the column/variable in that data frame, and the $ tells R the relationship (ie, value is a variable in the dfa data frame). Simple as that!

This is the code that loads the data. It’s already in the template file you’re being provided. Just remember to change set.seed() to a random number! You shouldn’t leave it as 123456. You can pick anything, from 42 to 8675309, both of which are pretty common.

library(plyr) # To round the numbers to currency
library(tidyverse) # To turn our vector into a dataframe with enframe()
library(DT) # To create the table

set.seed(123456) # Pick a random 6 digit number - this sets your randomness
dfa <- rnorm(200, mean=25, sd=12) |> # Let's create some prices
  abs() |> # Take the absolute value to get rid of any negative numbers
  round_any(0.01) |> # And round them to pennies
  enframe() # And turn it into a dataframe so we have customer IDs/names

datatable(dfa, caption = "Spending") # And let's see them!

Assignment

Use the Activity A template in Posit Cloud to complete the following requirements.

  1. Summarize the data by creating (via R code) and describing (via written narrative) the following descriptive statistics and what they mean/why they’re important:
    1. mean
    2. median
    3. standard deviation
    4. interquartile range
    5. at least one other descriptive statistics you find interesting about the data
  2. Display a histogram of the dfa$value variable. Is it normally distributed? Skewed? What does the kurtosis look like? Describe the value of being able to see the histogram.
  3. It’s easiest to write your code in a .R file (called an R script file) so you can easily test it while working. Then, when you’ve got everything above taken care of, transfer it to your .qmd file and use that to present your data rather than simply turning in code and the results.You’ll see an R script file in your Activity A Posit Cloud project called sandbox.R. Use this to work on your code. Just remember that content in an R script file (.R) is treated as code by default, while code in a Quarto (.qmd) file needs to be in a code chunk!
    1. For this assignment and all others, having gone through the Getting Started with Quarto decks is absolutely key.
    2. This is very likely going to take some trial and error. Set aside 2-3 times the amount of time you think this will take to account for fixing errors and debugging. R code is relatively straight forward and easy to use but it can be somewhat intimidating to the beginner. You’re encouraged to read through most of the Quarto Guide as it will make things much easier on you in the long run. When in doubt: copy example code that works and tweak to your specifications.
  4. Submitting the assignment:
    1. Complete the grading declaration quiz in the LMS. Note that this is not the same content that goes into your assignment submission! Your grading declaration should include, in addition to what the declaration quiz item describes:
      1. how you accomplished the steps in item 1 above
      2. why you think descriptive statistics are important
    2. Submit both your qmd and your rendered HTML to the Activity A dropbox in the LMS by the stated due date and time. The submission must contain the requirements listed here or you will not receive credit for assignment!
    3. Remember: the point of using this file system is reproducibility. Ensure you haven’t removed the embed-resources: true YAML line!