I am trying to download a pdf file from a website using R. When I tried to to use the function browserURL, it only worked with the argument encodeIfNeeded = T. As a result, if I pass the same url to the function download.file, it returns "cannot open destfile 'downloaded/teste.pdf', reason 'No such file or directory", i.e., it cant find the correct url.
How do I correct the encode, in order for me to be able to download the file programatically?
I need to automate this, because there are more than a thousand files to download.
Here's a minimum reproducible code:
library(tidyverse)
library(rvest)
url <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html"
webpage <- read_html(url)
# scrapping hyperlinks
links_decisoes <- html_nodes(webpage,".borderTD a") %>%
html_attr("href")
# creating full/correct url
full_links <- paste("http://www.ouvidoriageral.sp.gov.br/", links_decisoes, sep="" )
# browseURL only works with encodeIfNeeded = T
browseURL(full_links[1], encodeIfNeeded = T,
browser = "C://Program Files//Mozilla Firefox//firefox.exe")
# returns an error
download.file(full_links[1], "downloaded/teste.pdf")
There are a couple of problems here. Firstly, the links to some of the files are not properly formatted as urls - they contain spaces and other special characters. In order to convert them you must use url_escape(), which should be available to you as loading rvest also loads xml2, which contains url_escape().
Secondly, the path you are saving to is relative to your R home directory, but you are not telling R this. You either need the full path like this: "C://Users/Manoel/Documents/downloaded/testes.pdf", or a relative path like this: path.expand("~/downloaded/testes.pdf").
This code should do what you need:
library(tidyverse)
library(rvest)
# scraping hyperlinks
full_links <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html" %>%
read_html() %>%
html_nodes(".borderTD a") %>%
html_attr("href") %>%
url_escape() %>%
{paste0("http://www.ouvidoriageral.sp.gov.br/", .)}
# Looks at page in firefox
browseURL(full_links[1], encodeIfNeeded = T, browser = "firefox.exe")
# Saves pdf to "downloaded" folder if it exists
download.file(full_links[1], path.expand("~/downloaded/teste.pdf"))
Related
I would like to download the last archive (meteorological data) that has been added to this website by using Rstudio;
https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/
Do you know how to do it? I am able to download one in particular, but then I have to write the exact extension and It should be manually changed every time, and I do not want that, I want it automatically detected.
Thanks.
The function download_CDC() downloads the files for you. Input number 1 will download the lastest one with their respective name provided by the website.
library(tidyverse)
library(rvest)
base_url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
files <- base_url %>%
read_html() %>%
html_elements("a+ a") %>%
html_attr("href")
download_CDC <- function(item_number) {
base_url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
download.file(paste0(base_url, files[item_number]),
destfile = files[item_number],
mode = "wb")
}
download_CDC(1)
It's bit naïve (no error checking, blindly takes the last link from the file list page), but works with that particular listing.
Most of web scraping in R happens through rvest , html_element("a:last-of-type") extracts the last element of type <a> though CSS selector - your last archive. And html_attr('href') extracts the href attribute from that last <a>-element - actual link to the file.
library(rvest)
last_link <- function(url) {
last_href <- read_html(url) |>
html_element("a:last-of-type") |>
html_attr('href')
paste0(url,last_href)
}
url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
last_link(url)
#> [1] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/RW-20220720.tar.gz"
Created on 2022-07-21 by the reprex package (v2.0.1)
I am using the R programming language for NLP (natural language process) analysis - for this, I need to "webscrape" publicly available information on the internet.
Recently, I learned how to "webscrape" a single pdf file from the website I am using :
library(pdftools)
library(tidytext)
library(textrank)
library(dplyr)
library(tibble)
#this is an example of a single pdf
url <- "https://www.canlii.org/en/ns/nswcat/doc/2013/2013canlii47876/2013canlii47876.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words <- article_words %>%
anti_join(stop_words, by = "word")
#this final command can take some time to run
article_summary <- textrank_sentences(data = article_sentences, terminology = article_words)
#Sources: https://stackoverflow.com/questions/66979242/r-error-in-textrank-sentencesdata-article-sentences-terminology-article-w , https://www.hvitfeldt.me/blog/tidy-text-summarization-using-textrank/
The above code works fine if you want to manually access a single website and then "webscrape" this website. Now, I want to try and automatically download 10 such articles at the same time, without manually visiting each page. For instance, suppose I want to download the first 10 pdf's from this website: https://www.canlii.org/en/#search/type=decision&text=dog%20toronto
I think I found the following website which discusses how to do something similar (I adapted the code for my example): https://towardsdatascience.com/scraping-downloading-and-storing-pdfs-in-r-367a0a6d9199
library(tidyverse)
library(rvest)
library(stringr)
page <- read_html("https://www.canlii.org/en/#search/type=decision&text=dog%20toronto ")
raw_list <- page %>%
html_nodes("a") %>%
html_attr("href") %>%
str_subset("\\.pdf") %>%
str_c("https://www.canlii.org/en/#search/type=decision&text=dog", .)
map(read_html) %>%
map(html_node, "#raw-url") %>%
map(html_attr, "href") %>%
str_c("https://www.canlii.org/en/#search/type=decision&text=dog", .) %>%
walk2(., basename(.), download.file, mode = "wb")
But this produces the following error:
Error in .f(.x[[1L]], .y[[1L]], ...) : scheme not supported in URL 'NA'
Can someone please show me what I am doing wrong? Is it possible to download the first 10 pdf files that appear on this website and save them individually in R as "pdf1", "pdf2", ... "pdf9", "pdf10"?
Thanks
I see some people suggesting that you use rselenium, which is a way to
simulate browser actions, so that the web server renders the page as
if a human was visiting the site. From my experience it is almost never
necessary to go down that route. The javascript part of the website is
interacting with an API and we can utilize that to circumvent the Javascript
part and get the raw json data directly. In Firefox (and Chrome is similar in that regard I
assume) you can right-click on the website and select “Inspect Element (Q)”,
go to the “Network” tab and click on reload. You’ll see that each request
the browser makes to the webserver is being listed after a few seconds or less.
We are interested in the ones that have the “Type” json.
When you right click on an entry you can select “Open in New Tab”. One of the
requests that returns json has the following URL attached to it https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1
Opening that URL in Firefox gets you to a GUI that lets you explore the
json data structure and you’ll see that there is a “results” entry which
contains the data for the 25 first results of your search. Each one has a
“path” entry, that leads to the page that will display the embedded PDF.
It turns out that if you replace the “.html” part with “.pdf” that path
leads directly to the PDF file. The code below utilizes all this information.
library(tidyverse) # tidyverse for the pipe and for `purrr::map*()` functions.
library(httr) # this should already be installed on your machine as `rvest` builds on it
library(pdftools)
#> Using poppler version 20.09.0
library(tidytext)
library(textrank)
base_url <- "https://www.canlii.org"
json_url_search_p1 <-
"https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1"
This downloads the json for page 1 / results 1 to 25
results_p1 <-
GET(json_url_search_p1, encode = "json") %>%
content()
For each result we extract the path only.
result_html_paths_p1 <-
map_chr(results_p1$results,
~ .$path)
We replace “.html” with “.pdf”, combine the base URL with the path to
generate the full URLs pointing to the PDFs. Last we pipe it into purrr::map()
and pdftools::pdf_text in order to extract the text from all 25 PDFs.
pdf_texts_p1 <-
gsub(".html$", ".pdf", result_html_paths_p1) %>%
paste0(base_url, .) %>%
map(pdf_text)
If you want to do this for more than just the first page you might want to
wrap the above code in a function that lets you switch out the “&page=”
parameter. You could also make the “&text=” parameter an argument of the
function in order to automatically scrape results for other searches.
For the remaining part of the task we can build on the code you already have.
We make it a function that can be applied to any article and apply that function
to each PDF text again using purrr::map().
extract_article_summary <-
function(article) {
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words <- article_words %>%
anti_join(stop_words, by = "word")
textrank_sentences(data = article_sentences, terminology = article_words)
}
This now will take a real long time!
article_summaries_p1 <-
map(pdf_texts_p1, extract_article_summary)
Alternatively you could use furrr::future_map() instead to utilize all the CPU
cores in your machine and speed up the process.
library(furrr) # make sure the package is installed first
plan(multisession)
article_summaries_p1 <-
future_map(pdf_texts_p1, extract_article_summary)
Disclaimer
The code in the answer above is for educational purposes only. As many websites do, this service restricts automated access to its contents. The robots.txt explicitly disallows the /search path from being accessed by bots. It is therefore recommended to get in contact with the site owner before downloading big amounts of data. canlii offers API access on an individual request basis, see documentation here. This would be the correct and safest way to access their data.
I am trying to download all of the Excel files at: https://www.grants.gov.au/reports/gaweeklyexport
Using the code below I get errors similar to the following for each link (77 in total):
[[1]]$error
<simpleError in download.file(.x, .y, mode = "wb"): scheme not supported in URL '/Reports/GaWeeklyExportDownload?GaWeeklyExportUuid=0db183a2-11c6-42f8-bf52-379aafe0d21b'>
I get this error when trying to iterate over the full list, but when I call download.file on an individual list item it works fine.
I would be grateful if someone could tell me what I have done wrong or suggest a better way of doing it.
The code that produces the error:
library(tidyverse)
library(rvest)
# Reading links to the Excel files to be downloaded
url <- "https://www.grants.gov.au/reports/gaweeklyexport"
webpage <- read_html(url)
# The list of links to the Excel files
links <- html_attr(html_nodes(webpage, '.u'), "href")
# Creating names for the files to supply to the download.file function
wb_names = str_c(1:77, ".xlsx")
# Defining a function that using purrr's safely to ensure it doesn't fail if there is a dead link
safe_download <- safely(~ download.file(.x , .y, mode = "wb"))
# Combining links, file names, and the function returns an error
map2(links, wb_names, safe_download)
You need to prepend 'https://www.grants.gov.au/' to the URL to get the absolute path of the file which can be used to download the file.
library(rvest)
library(purrr)
url <- "https://www.grants.gov.au/reports/gaweeklyexport"
webpage <- read_html(url)
# The list of links to the Excel files
links <- paste0('https://www.grants.gov.au/', html_attr(html_nodes(webpage, '.u'), "href"))
safe_download <- safely(~ download.file(.x , .y, mode = "wb"))
# Creating names for the files to supply to the download.file function
wb_names = paste0(1:77, ".xlsx")
map2(links, wb_names, safe_download)
New to programming and trying to scrap data from the below site. When I run the below code it returns an empty dataset or table. Any help or alternatives will be greatly appreciated.
url <- "https://fasttrack.grv.org.au/Dog/Form?id=2003010003"
tab <- url %>% read_html %>%
html_node("dogruns_wrapper") %>%
html_text()
View(tab)
Have tried with xpath and same result and html_table() instead of text returns an error of no applicable method for 'html_table' applied to an object of class "xml_missing".
As Mislav stated, the table is generated with JavaScript, so your best option is RSelenium.
In addition, if you want to get the table, you can get it with less code if you use html_table().
My try:
# Load packages
library(rvest) #Loading the rvest package
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of the webpage
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
# define url
url <- "https://fasttrack.grv.org.au/Dog/Form?id=2003010003"
# go to website
remDr$navigate(url)
# as it's being loaded with JavaScript and it has a slow load, add a sleep here
Sys.sleep(10) # increase as needed
# get the html object of the webpage
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# read the table in the html_obj
tab <- html_obj %>% html_table() %>% .[[1]]
Hope it helps! However, always check if webpages allow scraping before doing it!
Check Terms and conditions:
Except for the direct purpose of viewing, printing, accessing or
interacting with the Web Site for your own personal use or as
otherwise indicated on the Web Site or these Terms and Conditions, you
must not copy, reproduce, modify, communicate to the public, adapt,
transfer, distribute, download or store any of the contents of the Web
Site (including Race Information as described below), or incorporate
any part of the Web Site into another web site without GRV’s written
consent.
I am trying to scrape the text of U.N. Security Council (UNSC) resolutions into R. The U.N. maintains an online archive of all UNSC resolutions in PDF format (here). So, in theory, this should be do-able.
If I click on the hyperlink for a specific year and then click on the link for a specific document (e.g., this one), I can see the PDF in my browser. When I try to download that PDF by pointing download.file at the link in the URL bar, it seems to work. When I try to read the contents of that file into R using the pdf_text function from the pdftools package, however, I get a stack of error messages.
Here's what I'm trying that's failing. If you run it, you'll see the error messages I'm talking about.
library(pdftools)
pdflink <- "http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2341(2017)"
tmp <- tempfile()
download.file(pdflink, tmp, mode = "wb")
doc <- pdf_text(tmp)
What am I missing? I think it has to do with the link addresses to the downloadable versions of these files differing from the link addresses for the in-browser display, but I can't figure out how to get the path to the former. I tried right-clicking on the download icon; using the "Inspect" option in Chrome to see the URL identified as 'src' there (this link); and pointing the rest of my process at it. Again, the download.file part executes, but I get the same error messages when I run pdf_text. I also tried a) varying the mode part of the call to download.file and b) tacking ".pdf" onto the end of the path to tmp, but neither of those helped.
The pdf you are looking to download is in an iframe in the main page, so the link you are downloading only contains html.
You need to follow the link in the iframe to get the actual link to the pdf. You need to jump to several pages to get cookies/temporary urls before getting to the direct link to download the pdf.
Here's an example for the link you posted:
rm(list=ls())
library(rvest)
library(pdftools)
s <- html_session("http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2341(2017)")
#get the link in the mainFrame iframe holding the pdf
frame_link <- s %>% read_html() %>% html_nodes(xpath="//frame[#name='mainFrame']") %>%
html_attr("src")
#go to that link
s <- s %>% jump_to(url=frame_link)
#there is a meta refresh with a link to another page, get it and go there
temp_url <- s %>% read_html() %>%
html_nodes("meta") %>%
html_attr("content") %>% {gsub(".*URL=","",.)}
s <- s %>% jump_to(url=temp_url)
#get the LtpaToken cookie then come back
s %>% jump_to(url="https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234") %>%
back()
#get the pdf link and download it
pdf_link <- s %>% read_html() %>%
html_nodes(xpath="//meta[#http-equiv='refresh']") %>%
html_attr("content") %>% {gsub(".*URL=","",.)}
s <- s %>% jump_to(pdf_link)
tmp <- tempfile()
writeBin(s$response$content,tmp)
doc <- pdf_text(tmp)
doc