---
title: "aRxiv"
output:
  html_notebook: default
  pdf_document: default
---

This is R code from Modern Data Science with R, Chapter 15 Text as data.

In Section 15.2 Analyzing textual data there is an example where research papers related to Data Science are downloaded from aRxiv and summarized.

```{r}
library(tidyverse)
library(mdsr)
library(aRxiv)
```

```{r}
DataSciencePapers <- arxiv_search(query = '"Data Science"', limit = 200)
head(DataSciencePapers)
```


```{r}
library(lubridate) 

DataSciencePapers <- DataSciencePapers %>%
  mutate(submitted = ymd_hms(submitted), updated = ymd_hms(updated)) 
glimpse(DataSciencePapers)

```

```{r}
tally(~ year(submitted), data = DataSciencePapers)
```


```{r}
DataSciencePapers %>% filter(year(submitted) == 2007) %>% 
  glimpse()
```

```{r}
tally(~ primary_category, data = DataSciencePapers)
```

```{r}
DataSciencePapers %>% mutate(field = str_extract(primary_category, "^[a-z,-]+")) %>% 
  tally(x = ~field) %>% 
  sort()

```

Now using the *tm* package to covert the data.frame to a corpus.

```{r}
library(tm) 

Corpus <- with(DataSciencePapers, VCorpus(VectorSource(abstract))) 
Corpus[[1]] %>% as.character() %>% 
  strwrap()
```

```{r}

Corpus <- Corpus %>% tm_map(stripWhitespace) %>% 
  tm_map(removeNumbers) %>% 
  tm_map(removePunctuation) %>% 
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeWords, stopwords("english"))
strwrap(as.character(Corpus[[1]]))

```

Now using the *wordcloud* package visualize the data.  Do you see Data Science?

```{r}
library(wordcloud) 

wordcloud(Corpus, max.words = 30, scale = c(8, 1), colors = topo.colors(n = 30), random.color = TRUE)

```

Create a Document Term Matrix using tf-idf.

```{r}
DTM <- DocumentTermMatrix(Corpus, control = list(weighting = weightTfIdf)) 
DTM
```

```{r}
findFreqTerms(DTM, lowfreq = 0.8)
```

```{r}
DTM %>% as.matrix() %>% 
  apply(MARGIN = 2, sum) %>% 
  sort(decreasing = TRUE) %>% 
  head(9)

```


```{r}
findAssocs(DTM, terms = "statistics", corlimit = 0.5)
```

```{r}
findAssocs(DTM, terms = "mathematics", corlimit = 0.5)
```