R webscraper is not outputting pdf text in one row - r

I've been web scraping articles in R from the Oxford journals and want to grab the full text of specific articles. All articles have a pdf link to them so I've been trying to pull the pdf link and scrape the entire text onto a csv. The full text should all fit into 1 row however the output in the csv file shows one article of 11 rows. How can I fix this issue?
The code is below:
####install.packages("rvest")
library(rvest)
library(RCurl)
library(XML)
library(stringr)
#for Fulltext to read pdf
####install.packages("pdftools")
library(pdftools)
fullText <- function(parsedDocument){
endLink <- parsedDocument %>%
html_node('.article-pdfLink') %>% html_attr('href')
frontLink <- "https://academic.oup.com"
#link of pdf
pdfLink <- paste(frontLink,endLink,sep = "")
#extract full text from pdfLink
pdfFullText <- pdf_text(pdfLink)
fulltext <- paste(pdfFullText, sep = "\n")
return(fulltext)
}
#############################################
#main function with input as parameter year
testFullText <- function(DOIurl){
parsedDocument <- read_html(DOIurl)
DNAresearch <- data.frame()
allData <- data.frame("Full Text" = fullText(parsedDocument), stringsAsFactors = FALSE)
DNAresearch <- rbind(DNAresearch, allData)
write.csv(DNAresearch, "DNAresearch.csv", row.names = FALSE)
}
testFullText("https://doi.org/10.1093/dnares/dsm026")

Looking at your last function, if I understand correctly, you want to take the url and scrape all the text into the a data frame/tibble and then export it to a csv. Here is how you can do it with just 1 article, and you should be able to loop through some links with a little manipulation (apologies if I am misunderstanding):
library(tidyverse)
library(rvest)
# read in html link
document_link <- read_html("https://doi.org/10.1093/dnares/dsm026")
# get the text, and put it into a tibble with only 1 row
text_tibble <- document_link %>%
html_nodes('.chapter-para') %>%
html_text() %>%
as_tibble() %>%
summarize(full_text = paste(value, collapse = " ")) ## this will collpase to 1 row
# now write to csv
## write_csv(text_tibble, file = "")

Related

R scrape multiple png files

I have tried to look at other subjects but it does not looks they are pertinent to my question. I am trying to scrape multiple .png plots with R from the 'Indicator' section of https://tradingeconomics.com/
For any indicator, there are multiple countries data and each country page includes a plot. I would like to find a way to scrape png files for each country through a single routine.
I have tried the first indicator ('growth rate') and yet my code is the following:
library(stringr)
library(dplyr)
library(rvest)
tradeec <- read_html("https://tradingeconomics.com/country-list/gdp-growth-rate")
tradeec_countries <- tradeec %>% html_nodes("td:nth-child(1)") %>%
html_text()
tradeec_countries <- str_replace_all(tradeec_countries, "[\r\n]" , "")
tradeec_countries <- as.data.frame(tradeec_countries)
tradeec_countries <- tradeec_countries[-c(91:95), ]
tradeec_plots <- paste0("https://d3fy651gv2fhd3.cloudfront.net/charts", tradeec_countries, "-gdp-growth.png?s=", i)
Nonetheless I am not reaching my goal.
Any hint?
updated answer
For example, all the figures in world column of link can be obtained using the following code. Other columns, such as Europe, America, Asia, Australia, G20 can also be obtained similarly.
page <- read_html("https://tradingeconomics.com/country-list/gdp-growth-rate")
url_init <- "https://tradingeconomics.com"
country_list <- html_nodes(page,"td a") %>% html_attr("href")
world_list <- paste(url_init,country_list,sep = "")
page_list <- vector(mode = "list")
for(page_index in 1:length(world_list)) {
page_list[[page_index]] = read_html(world_list[page_index])
}
for (i in 1:length(page_list)) {
figure_link <- html_nodes(page_list[[i]],"#ImageChart") %>% html_attr("src")
figure_name <- gsub(".*charts/(.*png).*","\\1",figure_link,perl = TRUE)
figure_name <- paste(i,"_",figure_name)
download.file(figure_link,figure_name)
}
original answer
The following code can get the figure's link and name.
tradeec <- read_html("https://tradingeconomics.com/south-africa/gdp-growth")
figure_link <- html_nodes(tradeec, "#ImageChart") %>% html_attr("src")
figure_name <- gsub(".*charts/(.*png).*", "\\1", figure_link, perl = T)
download.file(figure_link,figure_name)
Then you can replace south-africa in the link to a series of countries you wanted.

R code for login to online news paper while scraping urls?

I have a list of URLs, and I am trying to scrape the content from them for my research in R. I scrape all content using read_html in a for loop. The problem is that I need to login to the newspaper to scrape the content. So I am trying to login with my ID and password so that I can scrape the news content and date for each URL I found in the search result.
Can I somehow write login information into the for loop in order to access the content of the news articles?
library(rvest)
library(stringr)
library(purrr)
library(readbulk)
library(dplyr)
#Read URLs
urls <- read_bulk("C:/Users/XXXX", extension = ".csv") %>%
dplyr::distinct(link) #removing dublicates
#For Loop
titles <- c()
text <- c()
url <- c()
date <- c()
for(i in 1:nrow(urls)){
data <- read_html(paste0(urls$link[i]))
body <- data %>%
html_nodes("p") %>%
html_text() %>%
str_c(collapse = " ", sep = "")
text = append(text, body)
data <- read_html(paste0(urls$link[i]))
header <- data %>%
html_node("title") %>%
html_text()
titles = append(titles, rep(header,each=length(body)))
data <- read_html(paste0(urls$link[i]))
time <- data %>%
html_nodes("time") %>% #See HTML source code for data within this tag
html_text() %>%
str_c(collapse = " ", sep = "")
date = append(date, rep(time,each=length(time)))
url = append(url, rep(paste0(urls$link[i]),each=length(body)))
print(i)
}
data <- data.frame(Headline=titles, Body=text, Date=date, Url=url) # As Dataframe

How to download multiple files with the same name from html page?

I want to download all the files named "listings.csv.gz" which refer to US cities from http://insideairbnb.com/get-the-data.html, I can do it by writing each link but is it possible to do in a loop?
In the end I'll keep only a few columns from each file and merge them into one file.
Since the problem was solved thanks to #CodeNoob I'd like to share how it all worked out:
page <- read_html("http://insideairbnb.com/get-the-data.html")
# Get all hrefs (i.e. all links present on the website)
links <- page %>%
html_nodes("a") %>%
html_attr("href")
# Filter for listings.csv.gz, USA cities, data for March 2019
wanted <- grep('listings.csv.gz', links)
USA <- grep('united-states', links)
wanted.USA = wanted[wanted %in% USA]
wanted.links <- links[wanted.USA]
wanted.links = grep('2019-03', wanted.links, value = TRUE)
wanted.cols = c("host_is_superhost", "summary", "host_identity_verified", "street",
"city", "property_type", "room_type", "bathrooms",
"bedrooms", "beds", "price", "security_deposit", "cleaning_fee",
"guests_included", "number_of_reviews", "instant_bookable",
"host_response_rate", "host_neighbourhood",
"review_scores_rating", "review_scores_accuracy","review_scores_cleanliness",
"review_scores_checkin" ,"review_scores_communication",
"review_scores_location", "review_scores_value", "space",
"description", "host_id", "state", "latitude", "longitude")
read.gz.url <- function(link) {
con <- gzcon(url(link))
df <- read.csv(textConnection(readLines(con)))
close(con)
df <- df %>% select(wanted.cols) %>%
mutate(source.url = link)
df
}
all.df = list()
for (i in seq_along(wanted.links)) {
all.df[[i]] = read.gz.url(wanted.links[i])
}
all.df = map(all.df, as_tibble)
You can actually extract all links, filter for the ones containing listings.csv.gz and then download these in a loop:
library(rvest)
library(dplyr)
# Get all download links
page <- read_html("http://insideairbnb.com/get-the-data.html")
# Get all hrefs (i.e. all links present on the website)
links <- page %>%
html_nodes("a") %>%
html_attr("href")
# Filter for listings.csv.gz
wanted <- grep('listings.csv.gz', links)
wanted.links <- links[wanted]
for (link in wanted.links) {
con <- gzcon(url(link))
txt <- readLines(con)
df <- read.csv(textConnection(txt))
# Do what you want
}
Example: Download and combine the files
To get the result you want I would suggest to write a download function that filters for the columns you want and then combines these in a single dataframe, for example something like this:
read.gz.url <- function(url) {
con <- gzcon(url(link))
df <- read.csv(textConnection(readLines(con)))
close(con)
df <- df %>% select(c('calculated_host_listings_count_shared_rooms', 'cancellation_policy' )) %>% # random columns I chose
mutate(source.url = url) # You may need to remember the origin of each row
df
}
all.df <- do.call('rbind', lapply(head(wanted.links,2), read.gz.url))
Note I only tested this on the first two files since they are pretty large

R ReadHTMLTable Pro Football Refrence Team Offense

I am trying to get the "Team Offense" table into R. I have tried multiple techniques and I cannot seem to get it to work. It looks like R is only reading the first two tables. The link is below.
https://www.pro-football-reference.com/years/2018/index.htm
This is what I have tried...
library(XML)
TeamData = 'https://www.pro-football-reference.com/years/2018/index.htm'TeamData = 'https://www.pro-football-reference.com/years/2018/index.htm'
URL = TeamData
URLdata = getURL(URL)
table = readHTMLTable(URLdata, stringsAsFactors=F, which = 5)
Scraping Sports Reference sites can be tricky but they are great sources:
library(rvest)
library(httr)
link <- "https://www.pro-football-reference.com/years/2018/index.htm"
doc <- GET(link)
cont <- content(doc, "text") %>%
gsub(pattern = "<!--\n", "", ., fixed = TRUE) %>%
read_html %>%
html_nodes(".table_outer_container table") %>%
html_table()
# Team Offense table is the fifth one
df <- cont[[5]]

need help in extracting the first google search result using html_node in R

I have a list of hospital names for which I need to extract the first google search URL. Here is the code I'm using
library(rvest)
library(urltools)
library(RCurl)
library(httr)
getWebsite <- function(name)
{
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- read_html(url)
results <- page %>%
html_nodes("cite") %>%
html_text()
result <- results[1]
return(as.character(result))}
websites <- data.frame(Website = sapply(c,getWebsite))
View(websites)
For short URLs this code works fine but when the link is long and appears in R with "..." (ex. www.medicine.northwestern.edu/divisions/allergy-immunology/.../fellowship.html) it appears in the dataframe the same way with "...". How can I extract the actual URLs without "..."? Appreciate your help!
This is a working example, tested on my computer:
library("rvest")
# Load the page
main.page <- read_html(x = "https://www.google.com/search?q=software%20programming")
links <- main.page %>%
html_nodes(".r a") %>% # get the a nodes with an r class
html_attr("href") # get the href attributes
#clean the text
links = gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
# as a dataframe
websites <- data.frame(links = links, stringsAsFactors = FALSE)
View(websites)

Resources