Ingesting text

This is from Section 19.3 of the Modern Data Science with R 2e book.

Using rvest

Take a look at the Wikipedia List of songs recorded by the Beatles.

In the book the second list of Other songs is used. I have used the Main Songs list.

A great reference for regex (commands like gsub) is the r4ds book, see Chapter 14 about strings

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rvest) 

Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding
library(tidyr) 
library(methods) 
library(mdsr)
library(tm)
Loading required package: NLP

Attaching package: 'NLP'

The following object is masked from 'package:ggplot2':

    annotate
url <- "http://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles" 
tables <- url %>%
  read_html() %>%
  html_nodes(css = "table") 
tables
{xml_nodeset (11)}
 [1] <table id="toc" class="toc" summary="Class" align="center" style="text-a ...
 [2] <table class="wikitable" style="font-size:90%;">\n<caption>Key\n</captio ...
 [3] <table class="wikitable sortable plainrowheaders" style="text-align:cent ...
 [4] <table class="wikitable" style="font-size:90%;">\n<caption>Key\n</captio ...
 [5] <table class="wikitable sortable plainrowheaders" style="text-align:cent ...
 [6] <table class="nowraplinks vcard hlist mw-collapsible autocollapse navbox ...
 [7] <table class="nowraplinks vcard hlist mw-collapsible autocollapse navbox ...
 [8] <table class="nowraplinks vcard navbox-subgroup" style="border-spacing:0 ...
 [9] <table class="nowraplinks vcard navbox-subgroup" style="border-spacing:0 ...
[10] <table class="nowraplinks vcard navbox-subgroup" style="border-spacing:0 ...
[11] <table class="nowraplinks vcard navbox-subgroup" style="border-spacing:0 ...
songs <- html_table(tables[[4]])
glimpse(songs)
Rows: 2
Columns: 2
$ X1 <lgl> NA, NA
$ X2 <chr> "Indicates song not written by the members of the Beatles", "Indica…
songs
# A tibble: 2 × 2
  X1    X2                                                      
  <lgl> <chr>                                                   
1 NA    Indicates song not written by the members of the Beatles
2 NA    Indicates live recording                                
other <- html_table(tables[[5]])
glimpse(other)
Rows: 82
Columns: 7
$ Song               <chr> "\"12-Bar Original\"", "\"Ain't She Sweet\"", "\"Ai…
$ `Release(s)`       <chr> "Anthology 2", "Anthology 1", "Anthology 3", "Antho…
$ `Songwriter(s)`    <chr> "LennonMcCartneyHarrisonStarkey", "Jack YellenMilto…
$ `Lead vocal(s)[d]` <chr> "Instrumental", "Lennon", "McCartney", "Harrison", …
$ Yearrecorded       <int> 1965, 1961, 1969, 1969, 1963, 1963, 1962, 1968, 196…
$ Yearreleased       <int> 1996, 1995, 1996, 1996, 2013, 2013, 1995, 2018, 201…
$ `Ref(s)`           <chr> "[98][99]", "[84][100]", "[101][102]", "[101][103]"…
other
# A tibble: 82 × 7
   Song             `Release(s)` `Songwriter(s)` `Lead vocal(s)[d]` Yearrecorded
   <chr>            <chr>        <chr>           <chr>                     <int>
 1 "\"12-Bar Origi… "Anthology … LennonMcCartne… Instrumental               1965
 2 "\"Ain't She Sw… "Anthology … Jack YellenMil… Lennon                     1961
 3 "\"Ain't She Sw… "Anthology … Jack YellenMil… McCartney                  1969
 4 "\"All Things M… "Anthology … Harrison        Harrison                   1969
 5 "\"Bad to Me\""  "The Beatle… LennonMcCartney Lennon                     1963
 6 "\"Beautiful Dr… "On Air – L… Stephen Foster… McCartney                  1963
 7 "\"Bésame Mucho… "Anthology … Consuelo Veláz… McCartney                  1962
 8 "\"Blue Moon\""  "The Beatle… Richard Rodger… McCartney                  1968
 9 "\"Can You Take… "The Beatle… LennonMcCartney McCartney                  1968
10 "\"Carol\""      "Live at th… Chuck Berry     Lennon                     1963
# ℹ 72 more rows
# ℹ 2 more variables: Yearreleased <int>, `Ref(s)` <chr>