• Training Manual for SDAL
  • Welcome
    • The Social and Decision Analytics Laboratory
      • Overview: Our Mission & Methods
      • Innovation: Our Research Focus Areas
      • Collaboration: Our Research Partnerships
    • Data Science for the Public Good Program
      • Problem
      • Methods
      • Impact
      • Program Sponsors: Who We Work With
      • Previous Projects
    • Lesson Materials
      • Software-Carpenty
      • Data-Carpentry
      • The Carpentries
      • DataCamp
      • R for Data Science
  • Setup
    • The Bash Shell
      • Windows
      • macOS
      • Linux
      • Chromebook
    • Git
      • Windows
      • macOS
      • Linux
      • Chromebook
    • Text Editor
      • Windows
      • macOS
      • Linux
    • R
      • Windows
      • macOS
      • Linux
      • Post installation settings
    • SQLite
      • Windows
      • macOS
      • Linux
    • SSH Keys
      • GitHub
      • Lightfoot
      • GitLab (devlab)
    • LaTeX
  • Syllabus
    • Lesson Breakdown
      • Lesson 1
      • Lesson 2
      • Lesson 3
      • Lesson 4
      • Lesson 5
      • Lesson 6
      • Lesson 7
      • Lesson 8
      • Lesson 9
      • Lesson 10
      • Lesson 11
  • I Presentations and Talks
  • DSPG Introduction
  • Ethics
  • II SDAL Specific Components
  • 1 Infrastructure
    • 1.1 Components
    • 1.2 Accessing Servers
      • 1.2.1 RStudio
      • 1.2.2 Database
      • 1.2.3 Python
    • 1.3 Project Template
    • 1.4 Getting a project repository
      • 1.4.1 Using Terminal
      • 1.4.2 Using RStudio
  • 2 Documentation and Metadata
    • 2.1 Code documentaion
      • 2.1.1 lintr
    • 2.2 Metadata
    • 2.3 README files
      • 2.3.1 Example File
  • III Version Control
  • 3 Version Control
    • 3.1 Git
    • 3.2 Git setup
    • 3.3 Git on your own
    • 3.4 Working with remotes
    • 3.5 Git with branches
    • 3.6 Collaborating with Git
    • 3.7 Protecting branches
    • 3.8 Help! (FAQ)
      • 3.8.1 General workflow
      • 3.8.2 Git push rejected (master)
      • 3.8.3 Accidently did work on master:
      • 3.8.4 Get changes from master on your branch
      • 3.8.5 Remote server (e.g., GitLab, GitHub, Bitbucket, etc) shows merge conflict
      • 3.8.6 Remove data/files from history
  • IV RStudio
  • 4 RStudio
    • 4.1 Restarting RStudio Session
      • 4.1.1 Within RStudio
      • 4.1.2 From the terminal
    • 4.2 Accessing folders
      • 4.2.1 Outside of Home
  • V R
  • 5 Data Visualization
    • 5.1 Loading ggplot2
    • 5.2 Creating a ggplot
    • 5.3 Aesthetic Mapings
    • 5.4 Facets
    • 5.5 Geometic Objects
    • 5.6 Statistical Transformations
    • 5.7 Position Adjustments
    • 5.8 Coordinate Systems
  • 6 Importing Data
    • 6.1 Loading datasets
    • 6.2 Writing to a files
  • 7 Data Transformations
    • 7.1 dplyr library
    • 7.2 Filter
    • 7.3 Comparison (operators)
    • 7.4 Logical operators
    • 7.5 Arrange
    • 7.6 Select
    • 7.7 Mutate
    • 7.8 Summarize (summarise)
    • 7.9 Groupby
  • 8 Tidy data
    • 8.1 tidyr
    • 8.2 Gather
    • 8.3 Primer to joins
    • 8.4 Spread
    • 8.5 Separate
  • 9 Relational Data in R
    • 9.1 Create simple datasets
    • 9.2 one-to-one
    • 9.3 many to one OR one to many
    • 9.4 many to many
    • 9.5 Multiple keys
    • 9.6 Different column names
  • 10 Strings
    • 10.1 Regular Expressions
      • 10.1.1 Intro
      • 10.1.2 Basic Web Scraping Example
      • 10.1.3 Lookaround Examples
  • 11 Factors
    • 11.1 forcats
    • 11.2 Sorting
    • 11.3 Counting
    • 11.4 Re-ording factors
    • 11.5 Modifying factor values
      • 11.5.1 Double check your work
  • 12 Dates and Times
    • 12.1 lubridate
    • 12.2 Making datetimes from data
    • 12.3 Non-standard date formatting
    • 12.4 strptime format variables
    • 12.5 Datetime arithmetic
  • 13 SQL Databases
    • 13.1 Databases in R
      • 13.1.1 CREATE REUSABLE CONNECTION TO DATABASE
      • 13.1.2 WRITE DATA TO DATABASE TABLES
    • 13.2 Joins
      • 13.2.1 JOIN TABLES IN SQL WITH DBI
      • 13.2.2 JOIN TABLES IN SQL IN SQL CHUNK
      • 13.2.3 JOIN TABLES IN SQL IN SQL CHUNK WITH OUTPUT VARIABLE
      • 13.2.4 JOIN TABLES IN SQL IN SQL CHUNK APPLY WHERE CLAUSE AND FUZZY SEARCH
      • 13.2.5 JOIN TABLES IN SQL IN SQL CHUNK APPLY WHERE CLAUSE WITH ‘AND’ AND FUZZY SEARCH
      • 13.2.6 JOIN TABLES IN SQL IN SQL CHUNK APPLY SELECT AND WHERE CLAUSE WITH ‘AND’ AND FUZZY SEARCH AND ORDER
      • 13.2.7 JOIN TABLES IN SQL IN SQL CHUNK APPLY SELECT AND WHERE CLAUSE AND FUZZY SEARCH AND ORDER AND GROUP BY WITH AGGREGATE FUNCTION
    • 13.3 dplyr and data.table
  • 14 Vectors
    • 14.1 vectors (c)
    • 14.2 Subsetting vectors
    • 14.3 Negative indicies drop by position
    • 14.4 Lists
    • 14.5 Named vectors
    • 14.6 Named lists
    • 14.7 vectorization / broadcasting
  • 15 Functions
    • 15.1 Writing Functions
      • 15.1.1 Fahrenheit to Kelvin
      • 15.1.2 Kelvin to Celsius
      • 15.1.3 Fahrenheit to Celsius
  • 16 Conditionals
    • 16.1 if statements
    • 16.2 If else statements
    • 16.3 Dealing with NA
  • 17 Iteration
    • 17.1 Broadcasting
    • 17.2 For loops
      • 17.2.1 Pre allocating in a loop
    • 17.3 purrr (map)
    • 17.4 Apply (in base R)
      • 17.4.1 lapply
      • 17.4.2 sapply
      • 17.4.3 mapply
      • 17.4.4 apply (2-dimensions)
  • 18 R Dialects
    • 18.1 data.frame
      • 18.1.1 Convert to dataframe
      • 18.1.2 Select columns
      • 18.1.3 Select column as dataframe
      • 18.1.4 Filtering rows
      • 18.1.5 Filter rows and select columns
      • 18.1.6 Works with dplyr
  • VI R Special Topics
  • 19 data.table
    • 19.1 Read into data.table
    • 19.2 Convert to data.table
    • 19.3 Working with data.table
      • 19.3.1 Select columns list() .()
      • 19.3.2 Functions on columns
      • 19.3.3 .N (counts)
      • 19.3.4 Using dplyr
  • 20 Spatial Data
  • 21 APIs and Web Scraping
    • 21.1 Tables from websites (wikipedia)
    • 21.2 Websites (rvest)
    • 21.3 Selenium
  • 22 Communication
  • 23 Shiny
  • 24 Text Mining
    • 24.1 One-token-per-row (unnest_tokens)
    • 24.2 Jane Austen
    • 24.3 Stop words and Frequencies
    • 24.4 gutenbergr package
    • 24.5 Sentiment Analytis
    • 24.6 tf-idf: Term frequency, inverse document frequency
    • 24.7 bind_tf_idf
    • 24.8 Relationships (n-grams)
    • 24.9 Stemming
  • VII Modeling
  • 25 Random Forests
    • 25.1 CART
    • 25.2 Bootstrapping
    • 25.3 Bagging
    • 25.4 Viewing Random Forests
  • VIII Exercises
  • IX Back Matter
  • Build Details
  • References
  • Published with bookdown

Training Manual for SDAL

9.2 one-to-one

When you perform a join, the function and code do not change. The output of the join will be determined by whether or not you have duplicate values in your keys, and whether or not both tables you are joining have duplicate values in the keys.

In a one-to-one merge, there are no duplicate keys in both tables.

x %>% 
    inner_join(y, by = 'key')
## # A tibble: 2 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2
x %>% 
    left_join(y, by = 'key')
## # A tibble: 3 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     3 x3    <NA>
x %>% 
    right_join(y, by = 'key')
## # A tibble: 3 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     4 <NA>  y3
x %>% 
    full_join(y, by = 'key')
## # A tibble: 4 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     3 x3    <NA> 
## 4     4 <NA>  y3