This is from Section 19.3 of the Modern Data Science with R 2e book.
Using rvest
Take a look at the Wikipedia List of songs recorded by the Beatles.
In the book the second list of Other songs is used. I have used the Main Songs list.
A great reference for regex (commands like gsub) is the r4ds book, see Chapter 14 about strings
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.1 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.1 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'rvest'
The following object is masked from 'package:readr':
guess_encoding
library(tidyr)
library(methods)
library(mdsr)
library(tm)
Loading required package: NLP
Attaching package: 'NLP'
The following object is masked from 'package:ggplot2':
annotate
url <- "http://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles"
tables <- url %>%
read_html() %>%
html_nodes(css = "table")
tables
{xml_nodeset (11)}
[1] <table id="toc" class="toc" summary="Class" align="center" style="text-a ...
[2] <table class="wikitable" style="font-size:90%;">\n<caption>Key\n</captio ...
[3] <table class="wikitable sortable plainrowheaders" style="text-align:cent ...
[4] <table class="wikitable" style="font-size:90%;">\n<caption>Key\n</captio ...
[5] <table class="wikitable sortable plainrowheaders" style="text-align:cent ...
[6] <table class="nowraplinks vcard hlist mw-collapsible autocollapse navbox ...
[7] <table class="nowraplinks vcard hlist mw-collapsible autocollapse navbox ...
[8] <table class="nowraplinks vcard navbox-subgroup" style="border-spacing:0 ...
[9] <table class="nowraplinks vcard navbox-subgroup" style="border-spacing:0 ...
[10] <table class="nowraplinks vcard navbox-subgroup" style="border-spacing:0 ...
[11] <table class="nowraplinks vcard navbox-subgroup" style="border-spacing:0 ...
songs <- html_table(tables[[4]])
glimpse(songs)
Rows: 2
Columns: 2
$ X1 <lgl> NA, NA
$ X2 <chr> "Indicates song not written by the members of the Beatles", "Indica…
# A tibble: 2 × 2
X1 X2
<lgl> <chr>
1 NA Indicates song not written by the members of the Beatles
2 NA Indicates live recording
other <- html_table(tables[[5]])
glimpse(other)
Rows: 82
Columns: 7
$ Song <chr> "\"12-Bar Original\"", "\"Ain't She Sweet\"", "\"Ai…
$ `Release(s)` <chr> "Anthology 2", "Anthology 1", "Anthology 3", "Antho…
$ `Songwriter(s)` <chr> "LennonMcCartneyHarrisonStarkey", "Jack YellenMilto…
$ `Lead vocal(s)[d]` <chr> "Instrumental", "Lennon", "McCartney", "Harrison", …
$ Yearrecorded <int> 1965, 1961, 1969, 1969, 1963, 1963, 1962, 1968, 196…
$ Yearreleased <int> 1996, 1995, 1996, 1996, 2013, 2013, 1995, 2018, 201…
$ `Ref(s)` <chr> "[98][99]", "[84][100]", "[101][102]", "[101][103]"…
# A tibble: 82 × 7
Song `Release(s)` `Songwriter(s)` `Lead vocal(s)[d]` Yearrecorded
<chr> <chr> <chr> <chr> <int>
1 "\"12-Bar Origi… "Anthology … LennonMcCartne… Instrumental 1965
2 "\"Ain't She Sw… "Anthology … Jack YellenMil… Lennon 1961
3 "\"Ain't She Sw… "Anthology … Jack YellenMil… McCartney 1969
4 "\"All Things M… "Anthology … Harrison Harrison 1969
5 "\"Bad to Me\"" "The Beatle… LennonMcCartney Lennon 1963
6 "\"Beautiful Dr… "On Air – L… Stephen Foster… McCartney 1963
7 "\"Bésame Mucho… "Anthology … Consuelo Veláz… McCartney 1962
8 "\"Blue Moon\"" "The Beatle… Richard Rodger… McCartney 1968
9 "\"Can You Take… "The Beatle… LennonMcCartney McCartney 1968
10 "\"Carol\"" "Live at th… Chuck Berry Lennon 1963
# ℹ 72 more rows
# ℹ 2 more variables: Yearreleased <int>, `Ref(s)` <chr>