--- title: "Ingesting text" output: pdf_document: default html_notebook: default --- This is from Section 15.3 of the Modern Data Science with R book. # Using *rvest* Take a look at the Wikipedia [List of songs recorded by the Beatles](http://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles). In the book the second list of Other songs is used. I have used the Main Songs list. A great reference for regex (commands like gsub) is the [r4ds](https://r4ds.had.co.nz) book, see Chapter 14 about strings ```{r} library(rvest) library(tidyr) library(methods) library(mdsr) library(tm) url <- "http://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles" tables <- url %>% read_html() %>% html_nodes(css = "table") tables songs <- html_table(tables[[4]]) glimpse(songs) songs other <- html_table(tables[[5]]) glimpse(other) other ``` ```{r} songs <- songs %>% mutate(Song = gsub('\\"', "", Song), Year = as.numeric(Year)) %>% rename(songwriters = `Songwriter(s)`) songs other <- other %>% mutate(Song = gsub('\\"', "", Song), Yearrecorded = as.numeric(Yearrecorded)) %>% rename(songwriters = `Songwriter(s)`) other ``` ```{r} tally(~songwriters, data = songs) %>% sort(decreasing = TRUE) %>% head() ``` ```{r} length(grep("McCartney", songs$songwriters)) length(grep("Lennon", songs$songwriters)) length(grep("(McCartney|Lennon)", songs$songwriters)) length(grep("(McCartney|Lennon).*(McCartney|Lennon)", songs$songwriters)) ``` ```{r} songs %>% filter(grepl("(McCartney|Lennon).*(McCartney|Lennon)", songwriters)) %>% select(Song) %>% head() ``` ```{r} song_titles <- VCorpus(VectorSource(songs$Song)) %>% tm_map(removeWords, stopwords("english")) %>% DocumentTermMatrix(control = list(weighting = weightTfIdf)) findFreqTerms(song_titles, 14) ``` # Using *httr* The following code is from Exercise 15.10. The site [stackexchange.com](stackexchange.com) displays questions and answers on technical topics. ```{r} library(httr) # Find the most recent R questions on stackoverflow getresult <- GET("http://api.stackexchange.com", path = "questions", query = list(site = "stackoverflow.com", tagged = "dplyr")) stop_for_status(getresult) # Ensure returned without error questions <- content(getresult) # Grab content names(questions$items[[1]]) # What does the returned data look like? ``` ```{r} length(questions$item) ``` ```{r} substr(questions$items[[1]]$title, 1, 68) ``` ```{r} substr(questions$items[[2]]$title, 1, 68) ``` ```{r} substr(questions$items[[3]]$title, 1, 68) ``` The question asked in this Exercise: How many questions were returned? Without using jargon, describe in words what is being displayed and how it might be used. The next Exercise 15.11 ask for the same, dplyr. Try something else like ggplot2. ```{r} library(httr) # Find the most recent R questions on stackoverflow getresult <- GET("http://api.stackexchange.com", path = "questions", query = list(site = "stackoverflow.com", tagged = "ggplot2")) stop_for_status(getresult) # Ensure returned without error questions <- content(getresult) # Grab content names(questions$items[[1]]) # What does the returned data look like? ``` ```{r} substr(questions$items[[1]]$title, 1, 68) substr(questions$items[[2]]$title, 1, 68) substr(questions$items[[3]]$title, 1, 68) substr(questions$items[[4]]$title, 1, 68) substr(questions$items[[5]]$title, 1, 68) ```