APCV 302: Statistics in the Information Age

R, RStudio, and Quarto

We’re using an incredibly powerful statistical software package in this class. Unfortunately, it does come with a bit of a learning curve. You won’t be required to learn the nitty-gritty ins and outs of the R scripting language but you will need to know the basics of what R can do. Luckily, we have RStudio to really help out with that. Below are some introductory articles and videos.

Connect to Posit Cloud

We’ll be using Posit Cloud in this class. See the Posit Cloud page for instructions on just how to get started with it. If you also want to have RStudio on your local machine, you’ll need to do the following. (Remember, for this class, all you need is https://rstudio.cloud!)
If you wish to also have the Desktop version to use after class is over: Download R and RStudio

Pick the version for your operating system. If you’re using a Chromebook or can’t otherwise get RStudio to work on your machine, you’ll need to use Posit Cloud, anyway. When your class is over, your access to the unrestricted Posit Cloud version will be diminished, so having the desktop vesion is highly encouraged.
Quick-R R Tutorial

A quick (obviously) introduction to some of the most basic aspects of R.
The built-in help in RStudio

Inside RStudio, simply type help.start() in the console and voila, the manuals and reference materials appear in the Help module.
R for Cats

In case you like cats.
The art of R programming: a tour of statistical software design

Made by the same group that created the Manga guide, this text walks you through the vast majority of the basics to get you well on your way to being an R guru.
Cheatsheets

Print them out and keep them at your desk. The RStudio IDE and RMarkdown cheatsheets are particularly useful.
swirl

swirl is a fantastic collection of courses (ranging between 10 and 20 minutes each) designed to help you learn R programming while immersed in R!
Get Started with Quarto

This workshop is designed for those who have no or little prior experience with R Markdown and who want to learn Quarto. Quarto is the next generation of RMarkdown for publishing, including dynamic and static documents and multi-lingual programming language support. With Quarto you can create documents, books, presentations, blogs or other online resources.
Quarto Guide

Comprehensive guide to using Quarto. If you are just starting out, you may want to explore the tutorials to learn the basics.
Pandoc

When using Posit Cloud, you don’t need to install this. RStudio now comes with a version built-in, as well. That said, if you need to convert files from one markup format into another, pandoc is your swiss-army knife. (It converts from everything to everything and you’ll never need to touch it; everything happens through RStudio. —Dr S)
LaTeX (variations)

Again, we’re using Posit Cloud, so this is pre-installed. You will be wanting to save your R results as PDFs at some point. To do this you will need to install LaTeX, a mathetmatical typesetting system. It’s required to convert code into symbols. Here’s a good but brief introduction to LaTeX with RStudio. Ideally you would install TinyTeX by Yihui, the brain behind much of what you see in RStudio. Failing this, you can install MiKTeX in Windows and MacTeX on a Mac. That said, you should absolutely start with the TinyTeX R package!

You can include this kind of math inline (NOT to be confused with including R code inline) by using code like this: $\sigma = 0$ . This will display as $\sigma = 0$. Note the lack of spaces between the $ and the code!

Likewise, you can write something like and have it appear on its own:

$$ y_{ij} = b_{ij} + \beta_{0} + \beta_{1} $$

gives you the following displayed in your document:

\[ y_{ij} = b_{ij} + \beta_{0} + \beta_{1} \]

Big Data

What Is Big Data? A Super Simple Explanation For Everyone

The term “Big Data” may have been around for some time now, but there is still quite a lot of confusion about what it actually means. In truth, the concept is continually evolving and being reconsidered, as it remains the driving force behind many ongoing waves of digital transformation, including artificial intelligence, data science and the Internet of Things. But what exactly is Big Data and how is it changing our world?
What is Big Data?

Big data encompasses a wide range of analytics and data-gathering strategies. Essentially, it’s the ability to capture, store and analyze data on a mass scale to inform business decisions. It follows basic logic: The more you know about a problem or issue, the more reliable the solution.

Data Mining

What is Data Mining?

Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related - also known as “big data”) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications. The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions).
Revealing Online Learning Behaviors and Activity Patterns and Making Predictions with Data Mining Techniques in Online Teaching

(Abstract) This study was conducted with data mining (DM) techniques to analyze various patterns of online learning behaviors, and to make predictions on learning outcomes . Statistical models and machine learning DM techniques were conducted to analyze 17,934 server logs to investigate 98 undergraduate students’ learning behaviors in an online business course in Taiwan . The study scientifically identified students’ behavioral patterns and preferences in the online learning processes, differentiated active and passive learners, and found important parameters for performance prediction. The results also demonstrated how data mining techniques might be utilized to help improve online teaching and learning with suggestions for online instructors, instructional designers and courseware developers.
How to Catch a Liar on the Internet

Technology makes it easier than ever to play fast and loose with the truth—but easier than ever to get caught.
R and Data Mining

This website presents documents, examples, tutorials and resources on R and data mining.

Text Mining

Text Mining with R

This book serves as an introduction of text mining using the tidytext package and other tidy tools in R. The functions provided by the tidytext package are relatively simple; what is important are the possible applications. Thus, this book provides compelling examples of real text mining problems.
Text Mining(Big Data, Unstructured Data)

The purpose of Text Mining is to process unstructured (textual) information, extract meaningful numeric indices from the text, and, thus, make the information contained in the text accessible to the various data mining (statistical and machine learning) algorithms. Information can be extracted to derive summaries for the words contained in the documents or to compute summaries for the documents based on the words contained in them. Hence, you can analyze words, clusters of words used in documents, etc., or you could analyze documents and determine similarities between them or how they are related to other variables of interest in the data mining project. In the most general terms, text mining will “turn text into numbers” (meaningful indices), which can then be incorporated in other analyses such as predictive data mining projects, the application of unsupervised learning methods (clustering), etc. These methods are described and discussed in great detail in the comprehensive overview work by Manning and Schütze (2002), and for an in-depth treatment of these and related topics as well as the history of this approach to text mining, we highly recommend that source.
Why Text Mining May Be the Next Big Thing

“Big Data” is a hot topic in the business world these days. But there’s a subset of this broad field that has yet to take a turn in the spotlight. It’s called “text mining,” and you’re probably going to be hearing a lot more about it over the coming months and years. Basically, text mining is the process of combing through countless pages of plain-language digitized text to find useful information that’s been hiding in plain sight.
Where to start with text mining

This post is an outline of discussion topics I’m proposing for a workshop at NASSR2012 (a conference of Romanticists). I’m putting it on the blog since some of the links might be useful for a broader audience.
Text mining: what do publishers have against this hi-tech research tool?

Researchers push for end to publishers’ default ban on computer scanning of tens of thousands of papers to find links between genes and diseases

Cluster Analysis

Clustering: An Introduction

Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters
Cluster Analysis Introduction (StatSoft)

The term cluster analysis (first used by Tryon, 1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories. A general question facing researchers in many areas of inquiry is how to organize observed data into meaningful structures, that is, to develop taxonomies.
Hierarchical Clustering Algorithms

An introduction to hierarchical clustering algorithms.

Analytics and Business Intelligence

What is Predictive Analytics?

Every business has a treasure trove of data, from customer and transaction information to manufacturing and shipping statistics. The key is figuring out how to use past data to better the business’ future.
Strategy for building a “good” predictive model

Step-by-step guide.
Google Analytics

The one and only. (It’s even used on this site. Note that the Analytics site is not accessible while using some VPNs.)

Privacy, Ethics, and Social Issues

Ethics, Big Data, and Analytics: A Model for Application

The use of big data and analytics to predict student success presents unique ethical questions for higher education administrators relating to the nature of knowledge; in education, “to know” entails an obligation to act on behalf of the student. The Potter Box framework can help administrators address these questions and provide a framework for action.
The Promise of Big Data in Public Safety and Justice

Making data easier to digest for more law enforcement users.
How the NSA Spied on Americans Before the Internet

In May 1984 — an apt year for columns about “Big Brother” — The Post’s Michael Schrage warned of a future in which the government could snoop on unsuspecting citizens by subpoenaing their floppy discs. Personal computers were new, expensive and not particularly common; the first dot-com domain wasn’t even registered until the following year.
NSA gathered thousands of Americans’ e-mails before court ordered it to revise its tactics

For several years, the National Security Agency unlawfully gathered tens of thousands of e-mails and other electronic communications between Americans as part of a now-revised collection method, according to a 2011 secret court opinion.

302-specific

302 specific