---
title: "Ingesting text"
output:
  pdf_document: default
  html_notebook: default
---

This is from Section 15.3 of the Modern Data Science with R book.

# Using *rvest*

Take a look at the Wikipedia [List of songs recorded by the Beatles](http://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles).

In the book the second list of Other songs is used.  I have used the Main Songs list.

A great reference for regex (commands like gsub) is the [r4ds](https://r4ds.had.co.nz) book, see Chapter 14 about strings

```{r}
library(rvest) 
library(tidyr) 
library(methods) 
library(mdsr)
library(tm)

url <- "http://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles" 
tables <- url %>%
  read_html() %>%
  html_nodes(css = "table") 
tables
songs <- html_table(tables[[4]])
glimpse(songs)
songs

other <- html_table(tables[[5]])
glimpse(other)
other
```

```{r}
songs <- songs %>% mutate(Song = gsub('\\"', "", Song), Year = as.numeric(Year)) %>% 
  rename(songwriters = `Songwriter(s)`)
songs

other <- other %>% mutate(Song = gsub('\\"', "", Song), Yearrecorded = as.numeric(Yearrecorded)) %>% 
  rename(songwriters = `Songwriter(s)`)
other
```


```{r}
tally(~songwriters, data = songs) %>% 
  sort(decreasing = TRUE) %>% 
  head()

```

```{r}
length(grep("McCartney", songs$songwriters))
length(grep("Lennon", songs$songwriters))
length(grep("(McCartney|Lennon)", songs$songwriters))
length(grep("(McCartney|Lennon).*(McCartney|Lennon)", songs$songwriters))
```

```{r}
songs %>% filter(grepl("(McCartney|Lennon).*(McCartney|Lennon)", songwriters)) %>% 
  select(Song) %>% 
  head()
```

```{r}
song_titles <- VCorpus(VectorSource(songs$Song)) %>% 
  tm_map(removeWords, stopwords("english")) %>% 
  DocumentTermMatrix(control = list(weighting = weightTfIdf))
findFreqTerms(song_titles, 14)
```

# Using *httr*

The following code is from Exercise 15.10.  The site [stackexchange.com](stackexchange.com) displays questions and answers on technical topics.

```{r}
library(httr) 
# Find the most recent R questions on stackoverflow 
getresult <- GET("http://api.stackexchange.com",
                 path = "questions",
                 query = list(site = "stackoverflow.com", tagged = "dplyr")) 
stop_for_status(getresult) # Ensure returned without error 
questions <- content(getresult) # Grab content 
names(questions$items[[1]])	# What does the returned data look like?
```

```{r}
length(questions$item)
```

```{r}
substr(questions$items[[1]]$title, 1, 68)
```


```{r}
substr(questions$items[[2]]$title, 1, 68)
```

```{r}
substr(questions$items[[3]]$title, 1, 68)
```


The question asked in this Exercise:  How many questions were returned? Without using jargon, describe in words what is being displayed and how it might be used.

The next Exercise 15.11 ask for the same, dplyr.  Try something else like ggplot2.

```{r}
library(httr) 
# Find the most recent R questions on stackoverflow 
getresult <- GET("http://api.stackexchange.com",
                 path = "questions",
                 query = list(site = "stackoverflow.com", tagged = "ggplot2")) 
stop_for_status(getresult) # Ensure returned without error 
questions <- content(getresult) # Grab content 
names(questions$items[[1]])	# What does the returned data look like?
```

```{r}
substr(questions$items[[1]]$title, 1, 68)
substr(questions$items[[2]]$title, 1, 68)
substr(questions$items[[3]]$title, 1, 68)
substr(questions$items[[4]]$title, 1, 68)
substr(questions$items[[5]]$title, 1, 68)
```