Training Manual for SDAL
Welcome
The Social and Decision Analytics Laboratory
Overview: Our Mission & Methods
Innovation: Our Research Focus Areas
Collaboration: Our Research Partnerships
Data Science for the Public Good Program
Problem
Methods
Impact
Program Sponsors: Who We Work With
Previous Projects
Lesson Materials
Software-Carpenty
Data-Carpentry
The Carpentries
DataCamp
R for Data Science
Setup
The Bash Shell
Windows
macOS
Linux
Chromebook
Git
Windows
macOS
Linux
Chromebook
Text Editor
Windows
macOS
Linux
R
Windows
macOS
Linux
Post installation settings
SQLite
Windows
macOS
Linux
SSH Keys
GitHub
Lightfoot
GitLab (devlab)
LaTeX
Syllabus
Lesson Breakdown
Lesson 1
Lesson 2
Lesson 3
Lesson 4
Lesson 5
Lesson 6
Lesson 7
Lesson 8
Lesson 9
Lesson 10
Lesson 11
I Presentations and Talks
DSPG Introduction
Ethics
II SDAL Specific Components
1
Infrastructure
1.1
Components
1.2
Accessing Servers
1.2.1
RStudio
1.2.2
Database
1.2.3
Python
1.3
Project Template
1.4
Getting a project repository
1.4.1
Using Terminal
1.4.2
Using RStudio
2
Documentation and Metadata
2.1
Code documentaion
2.1.1
lintr
2.2
Metadata
2.3
README files
2.3.1
Example File
III Version Control
3
Version Control
3.1
Git
3.2
Git setup
3.3
Git on your own
3.4
Working with remotes
3.5
Git with branches
3.6
Collaborating with Git
3.7
Protecting branches
3.8
Help! (FAQ)
3.8.1
General workflow
3.8.2
Git push rejected (master)
3.8.3
Accidently did work on
master
:
3.8.4
Get changes from master on your branch
3.8.5
Remote server (e.g., GitLab, GitHub, Bitbucket, etc) shows merge conflict
3.8.6
Remove data/files from history
IV RStudio
4
RStudio
4.1
Restarting RStudio Session
4.1.1
Within RStudio
4.1.2
From the terminal
4.2
Accessing folders
4.2.1
Outside of Home
V R
5
Data Visualization
5.1
Loading ggplot2
5.2
Creating a ggplot
5.3
Aesthetic Mapings
5.4
Facets
5.5
Geometic Objects
5.6
Statistical Transformations
5.7
Position Adjustments
5.8
Coordinate Systems
6
Importing Data
6.1
Loading datasets
6.2
Writing to a files
7
Data Transformations
7.1
dplyr
library
7.2
Filter
7.3
Comparison (operators)
7.4
Logical operators
7.5
Arrange
7.6
Select
7.7
Mutate
7.8
Summarize (summarise)
7.9
Groupby
8
Tidy data
8.1
tidyr
8.2
Gather
8.3
Primer to joins
8.4
Spread
8.5
Separate
9
Relational Data in R
9.1
Create simple datasets
9.2
one-to-one
9.3
many to one OR one to many
9.4
many to many
9.5
Multiple keys
9.6
Different column names
10
Strings
10.1
Regular Expressions
10.1.1
Intro
10.1.2
Basic Web Scraping Example
10.1.3
Lookaround Examples
11
Factors
11.1
forcats
11.2
Sorting
11.3
Counting
11.4
Re-ording factors
11.5
Modifying factor values
11.5.1
Double check your work
12
Dates and Times
12.1
lubridate
12.2
Making datetimes from data
12.3
Non-standard date formatting
12.4
strptime format variables
12.5
Datetime arithmetic
13
SQL Databases
13.1
Databases in R
13.1.1
CREATE REUSABLE CONNECTION TO DATABASE
13.1.2
WRITE DATA TO DATABASE TABLES
13.2
Joins
13.2.1
JOIN TABLES IN SQL WITH DBI
13.2.2
JOIN TABLES IN SQL IN SQL CHUNK
13.2.3
JOIN TABLES IN SQL IN SQL CHUNK WITH OUTPUT VARIABLE
13.2.4
JOIN TABLES IN SQL IN SQL CHUNK APPLY WHERE CLAUSE AND FUZZY SEARCH
13.2.5
JOIN TABLES IN SQL IN SQL CHUNK APPLY WHERE CLAUSE WITH ‘AND’ AND FUZZY SEARCH
13.2.6
JOIN TABLES IN SQL IN SQL CHUNK APPLY SELECT AND WHERE CLAUSE WITH ‘AND’ AND FUZZY SEARCH AND ORDER
13.2.7
JOIN TABLES IN SQL IN SQL CHUNK APPLY SELECT AND WHERE CLAUSE AND FUZZY SEARCH AND ORDER AND GROUP BY WITH AGGREGATE FUNCTION
13.3
dplyr and data.table
14
Vectors
14.1
vectors (c)
14.2
Subsetting vectors
14.3
Negative indicies drop by position
14.4
Lists
14.5
Named vectors
14.6
Named lists
14.7
vectorization / broadcasting
15
Functions
15.1
Writing Functions
15.1.1
Fahrenheit to Kelvin
15.1.2
Kelvin to Celsius
15.1.3
Fahrenheit to Celsius
16
Conditionals
16.1
if statements
16.2
If else statements
16.3
Dealing with NA
17
Iteration
17.1
Broadcasting
17.2
For loops
17.2.1
Pre allocating in a loop
17.3
purrr (map)
17.4
Apply (in base R)
17.4.1
lapply
17.4.2
sapply
17.4.3
mapply
17.4.4
apply (2-dimensions)
18
R Dialects
18.1
data.frame
18.1.1
Convert to dataframe
18.1.2
Select columns
18.1.3
Select column as dataframe
18.1.4
Filtering rows
18.1.5
Filter rows and select columns
18.1.6
Works with dplyr
VI R Special Topics
19
data.table
19.1
Read into data.table
19.2
Convert to data.table
19.3
Working with data.table
19.3.1
Select columns list() .()
19.3.2
Functions on columns
19.3.3
.N (counts)
19.3.4
Using dplyr
20
Spatial Data
21
APIs and Web Scraping
21.1
Tables from websites (wikipedia)
21.2
Websites (rvest)
21.3
Selenium
22
Communication
23
Shiny
24
Text Mining
24.1
One-token-per-row (unnest_tokens)
24.2
Jane Austen
24.3
Stop words and Frequencies
24.4
gutenbergr package
24.5
Sentiment Analytis
24.6
tf-idf: Term frequency, inverse document frequency
24.7
bind_tf_idf
24.8
Relationships (n-grams)
24.9
Stemming
VII Modeling
25
Random Forests
25.1
CART
25.2
Bootstrapping
25.3
Bagging
25.4
Viewing Random Forests
VIII Exercises
IX Back Matter
Build Details
References
Published with bookdown
Training Manual for SDAL
Chapter 21
APIs and Web Scraping
Selector Gadget:
http://selectorgadget.com/
DataCamp:
https://www.datacamp.com/courses/working-with-web-data-in-r
Additional Resources:
https://stat4701.github.io/edav/2015/04/02/rvest_tutorial/