The lodown packages works great for me for the most part - was able to download ACS and CES data without issue. But when I try to use it to access CPS data, I get the following output:
lodown( "cpsbasic" , output_dir = file.path( path.expand( "~" ) , "CPSBASIC" ) )
building catalog for cpsbasic
Error in rvest::html_table(xml2::read_html(cps_ftp), fill = TRUE)[[2]] :
subscript out of bounds
Tried a fresh install of R and the packages involved, but I still get the same error. I think it has something to do with the Census updating their website since the package was last updated, but I'm not clear on what the specific problem is.
I did dig up the install files for the package. The specific lines of the code at issue is below:
cps_ftp <- "https://www.census.gov/data/datasets/time-series/demo/cps/cps-basic.html"
cps_table <- rvest::html_table( xml2::read_html( cps_ftp ) , fill = TRUE )[[2]]
Not sure how active the developer of the package is in updating anymore, so I don't know that an update will be coming anytime soon. Any ideas?
We can download both .csv files in cps_ftp by,
library(rvest)
library(stringr)
#get links of csv files
links = 'https://www.census.gov/data/datasets/time-series/demo/cps/cps-basic.html' %>% read_html() %>%
html_nodes('.uscb-layout-align-start-start') %>% html_nodes('a') %>% html_attr('href')
#filter the links
csv_links= links %>% str_subset('csv') %>% paste0('https:', .)
#read the csv files
csv_files = lapply(csv_links, read_csv)
Related
I would like to download the last archive (meteorological data) that has been added to this website by using Rstudio;
https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/
Do you know how to do it? I am able to download one in particular, but then I have to write the exact extension and It should be manually changed every time, and I do not want that, I want it automatically detected.
Thanks.
The function download_CDC() downloads the files for you. Input number 1 will download the lastest one with their respective name provided by the website.
library(tidyverse)
library(rvest)
base_url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
files <- base_url %>%
read_html() %>%
html_elements("a+ a") %>%
html_attr("href")
download_CDC <- function(item_number) {
base_url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
download.file(paste0(base_url, files[item_number]),
destfile = files[item_number],
mode = "wb")
}
download_CDC(1)
It's bit naïve (no error checking, blindly takes the last link from the file list page), but works with that particular listing.
Most of web scraping in R happens through rvest , html_element("a:last-of-type") extracts the last element of type <a> though CSS selector - your last archive. And html_attr('href') extracts the href attribute from that last <a>-element - actual link to the file.
library(rvest)
last_link <- function(url) {
last_href <- read_html(url) |>
html_element("a:last-of-type") |>
html_attr('href')
paste0(url,last_href)
}
url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
last_link(url)
#> [1] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/RW-20220720.tar.gz"
Created on 2022-07-21 by the reprex package (v2.0.1)
The organization I currently work for blocks the CRAN Repository in R Studio. So in order to install packages I need to go to http://cran.rstudio.com/bin/windows/contrib/3.6/ and manually download each one and their dependencies and install them in RStudio. It gets quite tedious.
Is there a way for me download all of the zip files on this page at once and put them in a folder on my desktop? And then from there is there a code to install/load all the zip file packages at once in RStudio?
Thank you in advance!
Here is a possible example using the package rvest. The rvest functions are used to get the list of packages to be downloaded.
Note that the Sys.sleep(1L) call pauses the execution for one second between downloads. You can obviously change that or remove it altogether.
library(rvest)
url <- 'https://cran.rstudio.com/bin/windows/contrib/3.6'
packages <- rvest::read_html(url) %>%
rvest::html_nodes("a") %>%
rvest::html_text() %>%
grep('.zip$', ., value = TRUE) %>%
sort()
for (pkg in packages) {
Sys.sleep(1L)
cat('Downloading', pkg, '...')
pkg_url <- file.path(url, pkg)
download.file(pkg_url, destfile = pkg, quiet = TRUE)
cat(' DONE.\n')
}
I am using the R programming language for NLP (natural language process) analysis - for this, I need to "webscrape" publicly available information on the internet.
Recently, I learned how to "webscrape" a single pdf file from the website I am using :
library(pdftools)
library(tidytext)
library(textrank)
library(dplyr)
library(tibble)
#this is an example of a single pdf
url <- "https://www.canlii.org/en/ns/nswcat/doc/2013/2013canlii47876/2013canlii47876.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words <- article_words %>%
anti_join(stop_words, by = "word")
#this final command can take some time to run
article_summary <- textrank_sentences(data = article_sentences, terminology = article_words)
#Sources: https://stackoverflow.com/questions/66979242/r-error-in-textrank-sentencesdata-article-sentences-terminology-article-w , https://www.hvitfeldt.me/blog/tidy-text-summarization-using-textrank/
The above code works fine if you want to manually access a single website and then "webscrape" this website. Now, I want to try and automatically download 10 such articles at the same time, without manually visiting each page. For instance, suppose I want to download the first 10 pdf's from this website: https://www.canlii.org/en/#search/type=decision&text=dog%20toronto
I think I found the following website which discusses how to do something similar (I adapted the code for my example): https://towardsdatascience.com/scraping-downloading-and-storing-pdfs-in-r-367a0a6d9199
library(tidyverse)
library(rvest)
library(stringr)
page <- read_html("https://www.canlii.org/en/#search/type=decision&text=dog%20toronto ")
raw_list <- page %>%
html_nodes("a") %>%
html_attr("href") %>%
str_subset("\\.pdf") %>%
str_c("https://www.canlii.org/en/#search/type=decision&text=dog", .)
map(read_html) %>%
map(html_node, "#raw-url") %>%
map(html_attr, "href") %>%
str_c("https://www.canlii.org/en/#search/type=decision&text=dog", .) %>%
walk2(., basename(.), download.file, mode = "wb")
But this produces the following error:
Error in .f(.x[[1L]], .y[[1L]], ...) : scheme not supported in URL 'NA'
Can someone please show me what I am doing wrong? Is it possible to download the first 10 pdf files that appear on this website and save them individually in R as "pdf1", "pdf2", ... "pdf9", "pdf10"?
Thanks
I see some people suggesting that you use rselenium, which is a way to
simulate browser actions, so that the web server renders the page as
if a human was visiting the site. From my experience it is almost never
necessary to go down that route. The javascript part of the website is
interacting with an API and we can utilize that to circumvent the Javascript
part and get the raw json data directly. In Firefox (and Chrome is similar in that regard I
assume) you can right-click on the website and select “Inspect Element (Q)”,
go to the “Network” tab and click on reload. You’ll see that each request
the browser makes to the webserver is being listed after a few seconds or less.
We are interested in the ones that have the “Type” json.
When you right click on an entry you can select “Open in New Tab”. One of the
requests that returns json has the following URL attached to it https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1
Opening that URL in Firefox gets you to a GUI that lets you explore the
json data structure and you’ll see that there is a “results” entry which
contains the data for the 25 first results of your search. Each one has a
“path” entry, that leads to the page that will display the embedded PDF.
It turns out that if you replace the “.html” part with “.pdf” that path
leads directly to the PDF file. The code below utilizes all this information.
library(tidyverse) # tidyverse for the pipe and for `purrr::map*()` functions.
library(httr) # this should already be installed on your machine as `rvest` builds on it
library(pdftools)
#> Using poppler version 20.09.0
library(tidytext)
library(textrank)
base_url <- "https://www.canlii.org"
json_url_search_p1 <-
"https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1"
This downloads the json for page 1 / results 1 to 25
results_p1 <-
GET(json_url_search_p1, encode = "json") %>%
content()
For each result we extract the path only.
result_html_paths_p1 <-
map_chr(results_p1$results,
~ .$path)
We replace “.html” with “.pdf”, combine the base URL with the path to
generate the full URLs pointing to the PDFs. Last we pipe it into purrr::map()
and pdftools::pdf_text in order to extract the text from all 25 PDFs.
pdf_texts_p1 <-
gsub(".html$", ".pdf", result_html_paths_p1) %>%
paste0(base_url, .) %>%
map(pdf_text)
If you want to do this for more than just the first page you might want to
wrap the above code in a function that lets you switch out the “&page=”
parameter. You could also make the “&text=” parameter an argument of the
function in order to automatically scrape results for other searches.
For the remaining part of the task we can build on the code you already have.
We make it a function that can be applied to any article and apply that function
to each PDF text again using purrr::map().
extract_article_summary <-
function(article) {
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words <- article_words %>%
anti_join(stop_words, by = "word")
textrank_sentences(data = article_sentences, terminology = article_words)
}
This now will take a real long time!
article_summaries_p1 <-
map(pdf_texts_p1, extract_article_summary)
Alternatively you could use furrr::future_map() instead to utilize all the CPU
cores in your machine and speed up the process.
library(furrr) # make sure the package is installed first
plan(multisession)
article_summaries_p1 <-
future_map(pdf_texts_p1, extract_article_summary)
Disclaimer
The code in the answer above is for educational purposes only. As many websites do, this service restricts automated access to its contents. The robots.txt explicitly disallows the /search path from being accessed by bots. It is therefore recommended to get in contact with the site owner before downloading big amounts of data. canlii offers API access on an individual request basis, see documentation here. This would be the correct and safest way to access their data.
I am trying to download a pdf file from a website using R. When I tried to to use the function browserURL, it only worked with the argument encodeIfNeeded = T. As a result, if I pass the same url to the function download.file, it returns "cannot open destfile 'downloaded/teste.pdf', reason 'No such file or directory", i.e., it cant find the correct url.
How do I correct the encode, in order for me to be able to download the file programatically?
I need to automate this, because there are more than a thousand files to download.
Here's a minimum reproducible code:
library(tidyverse)
library(rvest)
url <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html"
webpage <- read_html(url)
# scrapping hyperlinks
links_decisoes <- html_nodes(webpage,".borderTD a") %>%
html_attr("href")
# creating full/correct url
full_links <- paste("http://www.ouvidoriageral.sp.gov.br/", links_decisoes, sep="" )
# browseURL only works with encodeIfNeeded = T
browseURL(full_links[1], encodeIfNeeded = T,
browser = "C://Program Files//Mozilla Firefox//firefox.exe")
# returns an error
download.file(full_links[1], "downloaded/teste.pdf")
There are a couple of problems here. Firstly, the links to some of the files are not properly formatted as urls - they contain spaces and other special characters. In order to convert them you must use url_escape(), which should be available to you as loading rvest also loads xml2, which contains url_escape().
Secondly, the path you are saving to is relative to your R home directory, but you are not telling R this. You either need the full path like this: "C://Users/Manoel/Documents/downloaded/testes.pdf", or a relative path like this: path.expand("~/downloaded/testes.pdf").
This code should do what you need:
library(tidyverse)
library(rvest)
# scraping hyperlinks
full_links <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html" %>%
read_html() %>%
html_nodes(".borderTD a") %>%
html_attr("href") %>%
url_escape() %>%
{paste0("http://www.ouvidoriageral.sp.gov.br/", .)}
# Looks at page in firefox
browseURL(full_links[1], encodeIfNeeded = T, browser = "firefox.exe")
# Saves pdf to "downloaded" folder if it exists
download.file(full_links[1], path.expand("~/downloaded/teste.pdf"))
I have the following R script for downloading data but it gives me an error. How can I fix this error?
rm(list=ls(all=TRUE))
library('purrr')
years <- c(1980:1981)
days <- c(001:002)
walk(years, function(x) {
map(x, ~sprintf("https://hydro1.gesdisc.eosdis.nasa.gov/data/NLDAS/NLDAS_MOS0125_H.002/%s/%s/.grb", years, days)) %>%
flatten_chr() -> urls
download.file(urls, basename(urls), method="libcurl")
})
Error:
Error in download.file(urls, basename(urls), method = "libcurl") :
download.file(method = "libcurl") is not supported on this platform
I have the following R script for downloading data but it gives me an error. How can I fix this error?
Session info:
That means that libcurl may not be installed or available for your operative system. Please note that the method argument has other options and that method varies across operative systems (more or less the same as platform in the error message). I would try with other methods (e.g., wget, curl...).
From the help of download.files...
The supported ‘method’s do change: method ‘libcurl’ was introduced
in R 3.2.0 and is still optional on Windows - use
‘capabilities("libcurl")’ in a program to see if it is available.
I had started to do a light edit to #gballench's answer (since I don't rly need the pts) but it's more complex than you have it since you're not going to get to the files you need with that idiom (which I'm 99% sure is from an answer of mine :-) for a whole host of reasons.
First days needs to be padded to length 3 with 0s but the way you did it won't do that. Second, You likely want to download all the .grb files from each year/00x combo, so you need a way to get those. Finally, that site requires authentication, so you need to register and use basic authentication for it.
Something like this:
library(purrr)
library(httr)
library(rvest)
years <- c(1980:1981)
days <- sprintf("%03d", 1:2)
sprintf("http://hydro1.gesdisc.eosdis.nasa.gov/data/NLDAS/NLDAS_MOS0125_H.002/%s/%%s/", years) %>%
map(~sprintf(.x, days)) %>%
flatten_chr() %>%
map(~{
base_url <- .x
sprintf("%s/%s", base_url, read_html(.x) %>%
html_nodes(xpath=".//a[contains(#href, '.grb')]") %>%
html_attr("href"))
}) %>%
flatten_chr() %>%
discard(~grepl("xml$", .)) %>%
walk(~{
output_path <- file.path("FULL DIRECTORY PATH", basename(.x))
if (!file.exists(output_path)) {
message(.x)
GET(
url = .x,
config = httr::config(ssl_verifypeer = FALSE),
write_disk(output_path, overwrite=TRUE),
authenticate(user = "me#example.com", password = "xldjkdjfid8y83"),
progress()
)
}
})
You'll need to install the httr package which will install the curl package and ultimately make libcurl available for simpler batch downloads in the future.
I remembered that I had an account so I linked it with this app & tested this (killed it at 30 downloads) and it works. I added progress() to the GET() call so you can see it downloading individual files. It skips over already downloaded files (so you can kill it and restart it at any time). If you need to re-download any, just remove the file you want to re-download.
If you also need the .xml files, then remove the discard() call.