opening PDF from a webpage in R - r

I'm trying to practice text analysis with the Fed FOMC minutes.
I was able to obtain all links to the appropriate pdf files from the link below.
https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm
I tried download.file(https://www.federalreserve.gov/monetarypolicy/files/fomcminutes20160316.pdf,"1.pdf").
The download was successful; however, when I click on the downloaded file, it outputs "There was an error opening this document. The file is damaged and could not be repaired."
What are some ways to fix this? Is this a way of preventing web scraping on Fed's side?
I have 44 links(pdf files) to download and read in R. Is there a way to do this without physically downloading the files?

library(stringr)
library(rvest)
library(pdftools)
# Scrape the website with rvest for all href links
p <-
rvest::read_html("https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm")
pdfs <- p %>% rvest::html_elements("a") %>% html_attr("href")
# Filter selected fomcminute paths and reconstruct html links
pdfs <- pdfs[stringr::str_detect(pdfs, "fomcminutes.*pdf")]
pdfs <- pdfs[!is.na(pdfs)]
paths <- paste0("https://www.federalreserve.gov/", pdfs)
# Scrape minutes as list of text files
pdf_data <- lapply(paths, pdftools::pdf_text)

Related

Scraping data in Excel format from URL into R

I am trying to download data from url
https://migration.iom.int/datasets/europe-%E2%80%94-mixed-migration-flows-europe-quarterly-overview-april-june-2021
On this page is available dataset with file into Excel and link for downloading data is https://migration.iom.int/system/tdf/datasets/Q2%202021%20Mixed%20Migration%20Flows%20to%20Europe%20%28April%20-%20June%202021%29.xlsx?file=1&type=node&id=12261
So I want to download all this data in Excel format directly into R.
library(rvest)
URL <- "https://migration.iom.int/system/tdf/datasets/Q2%202021%20Mixed%20Migration%20Flows%20to%20Europe%20%28April%20-%20June%202021%29.xlsx?file=1&type=node&id=12261"
pg <- read_html(URL)
html_attr(html_nodes(pg, "download"), "href")
But I made some mistake and I don't make download. So can anybody help me how to download this data into R .
I personally would go about it in the following way.
Download the data into a specified destination, read the excel file from that location. An idea would be:
download.file(url, destinationFile)
fileInR <- read.table(file = desinationFile,sep = “\t”)
However, a simple google search for both (downloading and reading in an excel file in R) should provide you with plenty more options.

Download US census files from the web using R

I am trying to download all 1980 US Census files from the URL https://www2.census.gov/census_1980/ and store in my computer using R.
I already tried download.file and the package downloader, but the usual commands download only one file with no format.
Is there an easy way to download all files (including subfolders, etc) at once in R?
You can check if data you are interested in are in FRED | U.S. Census Bureau https://fred.stlouisfed.org/source?soid=19
If you are interested in somethin specific it is easy to get data with
# install.packages(quantmod)
library(quantmod)
retail_sales_total <- getSymbols('MRTSSM44X72USS', src = 'FRED', auto.assign = FALSE)
But if you want to get all files it is possible using xml2 and rvest packages.
# Readhtml
page <- read_html(URL)
# Try to extract the atributes of the html and get all the download links
links <- html_attr(html_nodes(page, "a"), "href")
and download it all in a loop

how to download pdf file with R from web (encode issue)

I am trying to download a pdf file from a website using R. When I tried to to use the function browserURL, it only worked with the argument encodeIfNeeded = T. As a result, if I pass the same url to the function download.file, it returns "cannot open destfile 'downloaded/teste.pdf', reason 'No such file or directory", i.e., it cant find the correct url.
How do I correct the encode, in order for me to be able to download the file programatically?
I need to automate this, because there are more than a thousand files to download.
Here's a minimum reproducible code:
library(tidyverse)
library(rvest)
url <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html"
webpage <- read_html(url)
# scrapping hyperlinks
links_decisoes <- html_nodes(webpage,".borderTD a") %>%
html_attr("href")
# creating full/correct url
full_links <- paste("http://www.ouvidoriageral.sp.gov.br/", links_decisoes, sep="" )
# browseURL only works with encodeIfNeeded = T
browseURL(full_links[1], encodeIfNeeded = T,
browser = "C://Program Files//Mozilla Firefox//firefox.exe")
# returns an error
download.file(full_links[1], "downloaded/teste.pdf")
There are a couple of problems here. Firstly, the links to some of the files are not properly formatted as urls - they contain spaces and other special characters. In order to convert them you must use url_escape(), which should be available to you as loading rvest also loads xml2, which contains url_escape().
Secondly, the path you are saving to is relative to your R home directory, but you are not telling R this. You either need the full path like this: "C://Users/Manoel/Documents/downloaded/testes.pdf", or a relative path like this: path.expand("~/downloaded/testes.pdf").
This code should do what you need:
library(tidyverse)
library(rvest)
# scraping hyperlinks
full_links <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html" %>%
read_html() %>%
html_nodes(".borderTD a") %>%
html_attr("href") %>%
url_escape() %>%
{paste0("http://www.ouvidoriageral.sp.gov.br/", .)}
# Looks at page in firefox
browseURL(full_links[1], encodeIfNeeded = T, browser = "firefox.exe")
# Saves pdf to "downloaded" folder if it exists
download.file(full_links[1], path.expand("~/downloaded/teste.pdf"))

Web scraping pdf files from HTML

How can I scrape the pdf documents from HTML?
I am using R and I can do only extract the text from HTML. The example of the website that I am going to scrape is as follows.
https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx
When you say you want to scrape the PDF files from HTML pages, I think the first problem you face is to actually identify the location of those PDF files.
library(XML)
library(RCurl)
url <- "https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx"
page <- getURL(url)
parsed <- htmlParse(page)
links <- xpathSApply(parsed, path="//a", xmlGetAttr, "href")
inds <- grep("*.pdf", links)
links <- links[inds]
links contains all the URLs to the PDF-files you are trying to download.
Beware: many websites don't like it very much when you automatically scrape their documents and you get blocked.
With the links in place, you can start looping through the links and download them one by one and saving them in your working directory under the name destination. I decided to extract reasonable document names for your PDFs, based on the links (extracting the final piece after the last / in the urls
regex_match <- regexpr("[^/]+$", links, perl=TRUE)
destination <- regmatches(links, regex_match)
To avoid overloading the servers of the website, I have heard it is friendly to pause your scraping every once in a while, so therefore I use 'Sys.sleep()` to pause scraping for a time between 0 and 5 seconds:
for(i in seq_along(links)){
download.file(links[i], destfile=destination[i])
Sys.sleep(runif(1, 1, 5))
}

Scraping PDF from iframe into R

I am trying to scrape the text of U.N. Security Council (UNSC) resolutions into R. The U.N. maintains an online archive of all UNSC resolutions in PDF format (here). So, in theory, this should be do-able.
If I click on the hyperlink for a specific year and then click on the link for a specific document (e.g., this one), I can see the PDF in my browser. When I try to download that PDF by pointing download.file at the link in the URL bar, it seems to work. When I try to read the contents of that file into R using the pdf_text function from the pdftools package, however, I get a stack of error messages.
Here's what I'm trying that's failing. If you run it, you'll see the error messages I'm talking about.
library(pdftools)
pdflink <- "http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2341(2017)"
tmp <- tempfile()
download.file(pdflink, tmp, mode = "wb")
doc <- pdf_text(tmp)
What am I missing? I think it has to do with the link addresses to the downloadable versions of these files differing from the link addresses for the in-browser display, but I can't figure out how to get the path to the former. I tried right-clicking on the download icon; using the "Inspect" option in Chrome to see the URL identified as 'src' there (this link); and pointing the rest of my process at it. Again, the download.file part executes, but I get the same error messages when I run pdf_text. I also tried a) varying the mode part of the call to download.file and b) tacking ".pdf" onto the end of the path to tmp, but neither of those helped.
The pdf you are looking to download is in an iframe in the main page, so the link you are downloading only contains html.
You need to follow the link in the iframe to get the actual link to the pdf. You need to jump to several pages to get cookies/temporary urls before getting to the direct link to download the pdf.
Here's an example for the link you posted:
rm(list=ls())
library(rvest)
library(pdftools)
s <- html_session("http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2341(2017)")
#get the link in the mainFrame iframe holding the pdf
frame_link <- s %>% read_html() %>% html_nodes(xpath="//frame[#name='mainFrame']") %>%
html_attr("src")
#go to that link
s <- s %>% jump_to(url=frame_link)
#there is a meta refresh with a link to another page, get it and go there
temp_url <- s %>% read_html() %>%
html_nodes("meta") %>%
html_attr("content") %>% {gsub(".*URL=","",.)}
s <- s %>% jump_to(url=temp_url)
#get the LtpaToken cookie then come back
s %>% jump_to(url="https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234") %>%
back()
#get the pdf link and download it
pdf_link <- s %>% read_html() %>%
html_nodes(xpath="//meta[#http-equiv='refresh']") %>%
html_attr("content") %>% {gsub(".*URL=","",.)}
s <- s %>% jump_to(pdf_link)
tmp <- tempfile()
writeBin(s$response$content,tmp)
doc <- pdf_text(tmp)
doc

Resources