aRxiv

This is R code from Modern Data Science with R, Chapter 15 Text as data.

In Section 15.2 Analyzing textual data there is an example where research papers related to Data Science are downloaded from aRxiv and summarized.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(aRxiv)
DataSciencePapers <- arxiv_search(query = '"Data Science"', limit = 200)
retrieved batch 1
retrieved batch 2
head(DataSciencePapers)
                  id           submitted             updated
1 astro-ph/0701361v1 2007-01-12 03:28:11 2007-01-12 03:28:11
2        0901.2805v1 2009-01-19 10:38:33 2009-01-19 10:38:33
3        0901.3118v2 2009-01-20 18:48:59 2009-01-24 19:23:47
4        0909.3895v1 2009-09-22 02:55:14 2009-09-22 02:55:14
5        1106.2503v5 2011-06-13 17:42:32 2013-06-23 21:21:41
6        1106.3305v1 2011-06-16 18:45:32 2011-06-16 18:45:32
                                                                                           title
1                               How to Make the Dream Come True: The Astronomers' Data Manifesto
2 Safeguarding Old and New Journal Tables for the VO: Status for\n  Extragalactic and Radio Data
3                                               The CATS Service: an Astrophysical Research Tool
4                             The Revolution in Astronomy Education: Data Science for the Masses
5                                         A Large-Scale Community Structure Analysis In Facebook
6                                                                        The Art of Data Science
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         abstract
1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Astronomy is one of the most data-intensive of the sciences. Data technology\nis accelerating the quality and effectiveness of its research, and the rate of\nastronomical discovery is higher than ever. As a result, many view astronomy as\nbeing in a 'Golden Age', and projects such as the Virtual Observatory are\namongst the most ambitious data projects in any field of science. But these\npowerful tools will be impotent unless the data on which they operate are of\nmatching quality. Astronomy, like other fields of science, therefore needs to\nestablish and agree on a set of guiding principles for the management of\nastronomical data. To focus this process, we are constructing a 'data\nmanifesto', which proposes guidelines to maximise the rate and\ncost-effectiveness of scientific discovery.\n
2                                                                                                                                                                                                                                                                                                                                 Independent of established data centers, and partly for my own research,\nsince 1989 I have been collecting the tabular data from over 2600 articles\nconcerned with radio sources and extragalactic objects in general. Optical\ncharacter recognition (OCR) was used to recover tables from 740 papers. Tables\nfrom only 41 percent of the 2600 articles are available in the CDS or CATS\ncatalog collections, and only slightly better coverage is estimated for the NED\ndatabase. This fraction is not better for articles published electronically\nsince 2001. Both object databases (NED, SIMBAD, LEDA) as well as catalog\nbrowsers (VizieR, CATS) need to be consulted to obtain the most complete\ninformation on astronomical objects. More human resources at the data centers\nand better collaboration between authors, referees, editors, publishers, and\ndata centers are required to improve data coverage and accessibility. The\ncurrent efforts within the Virtual Observatory (VO) project, to provide\nretrieval and analysis tools for different types of published and archival data\nstored at various sites, should be balanced by an equal effort to recover and\ninclude large amounts of published data not currently available in this way.\n
3                                                                                                                                                                                                                                                     We describe the current status of CATS (astrophysical CATalogs Support\nsystem), a publicly accessible tool maintained at Special Astrophysical\nObservatory of the Russian Academy of Sciences (SAO RAS) (http://cats.sao.ru)\nallowing one to search hundreds of catalogs of astronomical objects discovered\nall along the electromagnetic spectrum. Our emphasis is mainly on catalogs of\nradio continuum sources observed from 10 MHz to 245 GHz, and secondly on\ncatalogs of objects such as radio and active stars, X-ray binaries, planetary\nnebulae, HII regions, supernova remnants, pulsars, nearby and radio galaxies,\nAGN and quasars. CATS also includes the catalogs from the largest extragalactic\nsurveys with non-radio waves. In 2008 CATS comprised a total of about 10e9\nrecords from over 400 catalogs in the radio, IR, optical and X-ray windows,\nincluding most source catalogs deriving from observations with the Russian\nradio telescope RATAN-600. CATS offers several search tools through different\nways of access, e.g. via web interface and e-mail. Since its creation in 1997\nCATS has managed about 10,000 requests. Currently CATS is used by external\nusers about 1500 times per day and since its opening to the public in 1997 has\nreceived about 4000 requests for its selection and matching tasks.\n
4   As our capacity to study ever-expanding domains of our science has increased\n(including the time domain, non-electromagnetic phenomena, magnetized plasmas,\nand numerous sky surveys in multiple wavebands with broad spatial coverage and\nunprecedented depths), so have the horizons of our understanding of the\nUniverse been similarly expanding. This expansion is coupled to the exponential\ndata deluge from multiple sky surveys, which have grown from gigabytes into\nterabytes during the past decade, and will grow from terabytes into Petabytes\n(even hundreds of Petabytes) in the next decade. With this increased vastness\nof information, there is a growing gap between our awareness of that\ninformation and our understanding of it. Training the next generation in the\nfine art of deriving intelligent understanding from data is needed for the\nsuccess of sciences, communities, projects, agencies, businesses, and\neconomies. This is true for both specialists (scientists) and non-specialists\n(everyone else: the public, educators and students, workforce). Specialists\nmust learn and apply new data science research techniques in order to advance\nour understanding of the Universe. Non-specialists require information literacy\nskills as productive members of the 21st century workforce, integrating\nfoundational skills for lifelong learning in a world increasingly dominated by\ndata. We address the impact of the emerging discipline of data science on\nastronomy education within two contexts: formal education and lifelong\nlearners.\n
5                                                                                                                                                                                    Understanding social dynamics that govern human phenomena, such as\ncommunications and social relationships is a major problem in current\ncomputational social sciences. In particular, given the unprecedented success\nof online social networks (OSNs), in this paper we are concerned with the\nanalysis of aggregation patterns and social dynamics occurring among users of\nthe largest OSN as the date: Facebook. In detail, we discuss the mesoscopic\nfeatures of the community structure of this network, considering the\nperspective of the communities, which has not yet been studied on such a large\nscale. To this purpose, we acquired a sample of this network containing\nmillions of users and their social relationships; then, we unveiled the\ncommunities representing the aggregation units among which users gather and\ninteract; finally, we analyzed the statistical features of such a network of\ncommunities, discovering and characterizing some specific organization patterns\nfollowed by individuals interacting in online social networks, that emerge\nconsidering different sampling techniques and clustering methodologies. This\nstudy provides some clues of the tendency of individuals to establish social\ninteractions in online social networks that eventually contribute to building a\nwell-connected social structure, and opens space for further social studies.\n
6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        To flourish in the new data-intensive environment of 21st century science, we\nneed to evolve new skills. These can be expressed in terms of the systemized\nframework that formed the basis of mediaeval education - the trivium (logic,\ngrammar, and rhetoric) and quadrivium (arithmetic, geometry, music, and\nastronomy). However, rather than focusing on number, data is the new keystone.\nWe need to understand what rules it obeys, how it is symbolized and\ncommunicated and what its relationship to physical space and time is. In this\npaper, we will review this understanding in terms of the technologies and\nprocesses that it requires. We contend that, at least, an appreciation of all\nthese aspects is crucial to enable us to extract scientific information and\nknowledge from the data sets which threaten to engulf and overwhelm us.\n
                                                                                            authors
1                                                                                      Ray P Norris
2                                                                                   Heinz Andernach
3                                    O. V. Verkhodanov|S. A. Trushkin|H. Andernach|V. N. Chernenkov
4 Kirk D. Borne|Suzanne Jacoby|K. Carney|A. Connolly|T. Eastman|M. J. Raddick|J. A. Tyson|J. Wallin
5                                                                                    Emilio Ferrara
6                                                                                 Matthew J. Graham
                                                                                                                                                                                                                                                                                                                                                                                 affiliations
1                                                                                                                                                                                                                                                                                                                                                                                            
2                                                                                                                                                                                                                                                                                                                                                                                            
3 Special Astrophysical Observatory, Nizhnij Arkhyz, Karachaj-Cherkesia, Russia;|Special Astrophysical Observatory, Nizhnij Arkhyz, Karachaj-Cherkesia, Russia;|Argelander-Institut fuer Astronomie, Universitaet Bonn, Bonn, Germany; on leave of absence from Depto. de Astronomia, Univ. Guanajuato, Mexico|Special Astrophysical Observatory, Nizhnij Arkhyz, Karachaj-Cherkesia, Russia;
4                                                                                                                                                                                                                                                 George Mason University|LSST Corporation|Adler Planetarium|U. Washington|Wyle Information Systems|JHU/SDSS|UC Davis|George Mason University
5                                                                                                                                                                                                                                                                                                                                                                                            
6                                                                                                                                                                                                                                                                                                                                                                                            
                            link_abstract
1 http://arxiv.org/abs/astro-ph/0701361v1
2        http://arxiv.org/abs/0901.2805v1
3        http://arxiv.org/abs/0901.3118v2
4        http://arxiv.org/abs/0909.3895v1
5        http://arxiv.org/abs/1106.2503v5
6        http://arxiv.org/abs/1106.3305v1
                                 link_pdf
1 http://arxiv.org/pdf/astro-ph/0701361v1
2        http://arxiv.org/pdf/0901.2805v1
3        http://arxiv.org/pdf/0901.3118v2
4        http://arxiv.org/pdf/0909.3895v1
5        http://arxiv.org/pdf/1106.2503v5
6        http://arxiv.org/pdf/1106.3305v1
                                       link_doi
1                                              
2            http://dx.doi.org/10.2481/dsj.8.41
3            http://dx.doi.org/10.2481/dsj.8.34
4                                              
5              http://dx.doi.org/10.1140/epjds9
6 http://dx.doi.org/10.1007/978-1-4614-3323-1_4
                                                                                                                                                                                                                                                                                                                                      comment
1                                                                                                                                                                                                                                                             Submitted to Data Science Journal Presented at CODATA, Beijing,\n  October 2006
2                                                                       11 pages, 4 figures; accepted for publication in Data Science\n  Journal, vol. 8 (2009), http://dsj.codataweb.org; presented at Special\n  Session "Astronomical Data and the Virtual Observatory" on the conference\n  "CODATA 21", Kiev, Ukraine, October 5-8, 2008
3 8 pages, no figures; accepted for publication in Data Science\n  Journal, vol. 8 (2009), http://dsj.codataweb.org; presented at Special\n  Session "Astronomical Data and the Virtual Observatory" on the conference\n  "CODATA 21", Kiev, Ukraine, October 5-8, 2008; replaced incorrect reference\n  arXiv:0901.2085 with arXiv:0901.2805
4                                                                                                                                                                      12 pages total: 1 cover page, 1 page of co-signers, plus 10 pages,\n  State of the Profession Position Paper submitted to the Astro2010 Decadal\n  Survey (March 2009)
5                                                                                                                                                                                                           30 pages, 13 Figures - Published on: EPJ Data Science, 1:9, 2012 -\n  open access at: http://www.epjdatascience.com/content/1/1/9
6                                                                                                                                            12 pages, invited talk at Astrostatistics and Data Mining in Large\n  Astronomical Databases workshop, La Palma, Spain, 30 May - 3 June 2011, to\n  appear in Springer Series on Astrostatistics
                  journal_ref                         doi primary_category
1                                                                 astro-ph
2                                        10.2481/dsj.8.41      astro-ph.IM
3                                        10.2481/dsj.8.34      astro-ph.IM
4                                                              astro-ph.IM
5 EPJ Data Science, 1:9, 2012              10.1140/epjds9            cs.SI
6                             10.1007/978-1-4614-3323-1_4      astro-ph.IM
                                                                 categories
1                                                                  astro-ph
2                                                   astro-ph.IM|astro-ph.CO
3                                                   astro-ph.IM|astro-ph.CO
4                               astro-ph.IM|cs.DB|cs.DL|cs.IR|physics.ed-ph
5 cs.SI|cs.CY|physics.soc-ph|91D30, 05C82, 68R10, 90B10, 90C35|H.2.8; D.2.8
6                                                         astro-ph.IM|cs.DL