I have the following R script for downloading data but it gives me an error. How can I fix this error?
rm(list=ls(all=TRUE))
library('purrr')
years <- c(1980:1981)
days <- c(001:002)
walk(years, function(x) {
map(x, ~sprintf("https://hydro1.gesdisc.eosdis.nasa.gov/data/NLDAS/NLDAS_MOS0125_H.002/%s/%s/.grb", years, days)) %>%
flatten_chr() -> urls
download.file(urls, basename(urls), method="libcurl")
})
Error:
Error in download.file(urls, basename(urls), method = "libcurl") :
download.file(method = "libcurl") is not supported on this platform
I have the following R script for downloading data but it gives me an error. How can I fix this error?
Session info:
That means that libcurl may not be installed or available for your operative system. Please note that the method argument has other options and that method varies across operative systems (more or less the same as platform in the error message). I would try with other methods (e.g., wget, curl...).
From the help of download.files...
The supported ‘method’s do change: method ‘libcurl’ was introduced
in R 3.2.0 and is still optional on Windows - use
‘capabilities("libcurl")’ in a program to see if it is available.
I had started to do a light edit to #gballench's answer (since I don't rly need the pts) but it's more complex than you have it since you're not going to get to the files you need with that idiom (which I'm 99% sure is from an answer of mine :-) for a whole host of reasons.
First days needs to be padded to length 3 with 0s but the way you did it won't do that. Second, You likely want to download all the .grb files from each year/00x combo, so you need a way to get those. Finally, that site requires authentication, so you need to register and use basic authentication for it.
Something like this:
library(purrr)
library(httr)
library(rvest)
years <- c(1980:1981)
days <- sprintf("%03d", 1:2)
sprintf("http://hydro1.gesdisc.eosdis.nasa.gov/data/NLDAS/NLDAS_MOS0125_H.002/%s/%%s/", years) %>%
map(~sprintf(.x, days)) %>%
flatten_chr() %>%
map(~{
base_url <- .x
sprintf("%s/%s", base_url, read_html(.x) %>%
html_nodes(xpath=".//a[contains(#href, '.grb')]") %>%
html_attr("href"))
}) %>%
flatten_chr() %>%
discard(~grepl("xml$", .)) %>%
walk(~{
output_path <- file.path("FULL DIRECTORY PATH", basename(.x))
if (!file.exists(output_path)) {
message(.x)
GET(
url = .x,
config = httr::config(ssl_verifypeer = FALSE),
write_disk(output_path, overwrite=TRUE),
authenticate(user = "me#example.com", password = "xldjkdjfid8y83"),
progress()
)
}
})
You'll need to install the httr package which will install the curl package and ultimately make libcurl available for simpler batch downloads in the future.
I remembered that I had an account so I linked it with this app & tested this (killed it at 30 downloads) and it works. I added progress() to the GET() call so you can see it downloading individual files. It skips over already downloaded files (so you can kill it and restart it at any time). If you need to re-download any, just remove the file you want to re-download.
If you also need the .xml files, then remove the discard() call.
Related
I am trying to automate a process in R which involves downloading a zipped folder from an API* which contains a few .csv/.xml files, accessing its contents, and then extracting the .csv/.xml that I actually care about into a dataframe (or something else that is workable). However, I am having some problems accessing the contents of the API pull. From what I gather, the proper process for pulling from an API is to use GET() from the httr package to access the API's files, then the jsonlite package to process it. The second step in this process is failing me. The code I have been trying to use is roughly as follows:
library(httr)
library(jsonlite)
req <- "http://request.path.com/thisisanapi/SingleZip?option1=yes&option2=no"
res <- GET(url = req)
#this works as expected, with res$status_code == 200
#OPTION 1:
api_char <- rawToChar(res$content)
api_call <- fromJSON(api_char, flatten=T)
#OPTION 2:
api_char2 <- content(res, "text")
api_call2 <- fromJSON(api_char2, flatten=T)
In option 1, the first line fails with an "embedded nul in string" error. In option 2, the second line fails with a "lexical error: invalid char in json text" error.
I did some reading and found a few related threads. First, this person looks to be doing a very similar thing to me, but did not experience this error (this suggests that maybe the files are zipped/stored differently between the APIs that the two of us are using and that I have set up the GET() incorrectly?). Second, this person seems to be experiencing a similar problem with converting the raw data from the API. I attempted the fix from this thread, but it did not work. In option 1, the first line ran but the second line gave a similar "lexical error: invalid char in json text" as in option before and, in option 2, the second line gave a "if (is.character(txt) && length(txt) == 1 && nchar(txt, type = "bytes") < : missing value where TRUE/FALSE needed" error, which I am not quite sure how to interpret. This may be because the content_type differs between our API pulls: mine is application/x-zip-compressed and theirs is text/tab-separated-values; charset=utf-16le, so maybe removing the null characters is altogether inappropriate here.
There is some documentation on usage of the API I am using*, but a lot of it is a few years old now and seems to focus more on manual usage rather than integration with large automated downloads like I am working on (my end goal is a loop which executes the process described many times over slightly varying urls). I am most certainly a beginner to using APIs like this, and would really appreciate some insight!
* = specifically, I am pulling from CAISO's OASIS API. If you want to follow along with some real files, replace "http://request.path.com/thisisanapi/SingleZip?option1=yes&option2=no" with "http://oasis.caiso.com/oasisapi/SingleZip?resultformat=6&queryname=PRC_INTVL_LMP&version=3&startdatetime=20201225T09:00-0000&enddatetime=20201226T9:10-0000&market_run_id=RTM&grp_type=ALL"
I think the main issue here is that you don't have a JSON return from the API. You have a .zip file being returned, as binary (I think?) data. Your challenge is to process that data. I don't think fromJSON() will help you, as the data from the API isn't in JSON format.
Here's how I would do it. I prefer to use the httr2 package. The process below makes it nice and clear what the parameters of the query are.
library(httr2)
req <- httr2::request("http://oasis.caiso.com/oasisapi")
query <- req %>%
httr2::req_url_path_append("SingleZip") %>%
httr2::req_url_query(resultformat = 6) %>%
httr2::req_url_query(queryname = "PRC_INTVL_LMP") %>%
httr2::req_url_query(version = 3) %>%
httr2::req_url_query(startdatetime = "20201225T09:00-0000") %>%
httr2::req_url_query(enddatetime = "20201226T9:10-0000") %>%
httr2::req_url_query(market_run_id = "RTM") %>%
httr2::req_url_query(grp_type = "ALL")
# Check what our query looks like
query
#> <httr2_request>
#> GET
#> http://oasis.caiso.com/oasisapi/SingleZip?resultformat=6&queryname=PRC_INTVL_LMP&version=3&startdatetime=20201225T09%3A00-0000&enddatetime=20201226T9%3A10-0000&market_run_id=RTM&grp_type=ALL
#> Body: empty
resp <- query %>%
httr2::req_perform()
# Check what content type and encoding we have
# All looks good
resp %>%
httr2::resp_content_type()
#> [1] "application/x-zip-compressed"
resp %>%
httr2::resp_encoding()
#> [1] "UTF-8"
Created on 2022-08-30 with reprex v2.0.2
Then you have a choice what to do, if you want to write the data to a zip file.
I discovered that the brio package will write raw data to a file nicely. Or you can just use download.file to download the .zip from the URL (you can just do that without all the httr stuff above). You need to use mode = "wb".
resp %>%
httr2::resp_body_raw() %>%
brio::write_file_raw(path = "out.zip")
# alternative using your original URL or query$url
download.file(query$url, "out.zip", mode = "wb")
I wish to download an online folder using Windows 10 on my Dell laptop. In this example the folder I wish to download is named Targetfolder. I am trying to use the Command Window but also am wondering whether there is a simple solution in R. I have included an image at the bottom of this post showing the target folder. I should add that Targetfolder includes a file and multiple subfolders containing files. Not all files have the same extension. Also, please note this is a hypothetical site. I did not want to include the real site for privacy issues.
EDIT
Here is a real site that can serve as a functional, reproducible example. The folder rel2020 can take the place of the hypothetical Targetfolder:
https://www2.census.gov/geo/docs/maps-data/data/rel2020/
None of the answers here seem to work with Targetfolder:
How to download HTTP directory with all files and sub-directories as they appear on the online files/folders list?
Below are my attempts based on answers posted at the link above and the result I obtained:
Attempt One
lftp -c 'mirror --parallel=300 https://www.examplengo.org/datadisk/examplefolder/userdirs/user3/Targetfolder/ ;exit'
Returned:
lftp is not recognized as an internal or external command, operable program or batch file.
Attempt Two
wget -r -np -nH --cut-dirs=3 -R index.html https://www.examplengo.org/datadisk/examplefolder/userdirs/user3/Targetfolder/
Returned:
wget is not recognized as an internal or external command, operable program or batch file.
Attempt Three
https://sourceforge.net/projects/visualwget/files/latest/download
VisualWget returned Unsupported scheme next to the url.
Here is a way with packages httr and rvest.
First get the folders where the files are from the link.
Then loop through the folders with Map, getting the filenames and downloading them in a lapply loop.
If errors such as time out conditions occur, they will be trapped in tryCatch. The last code lines will tell if and where there were errors.
Note: I only downloaded from folders[1:2], in the Map below change this to folders.
suppressPackageStartupMessages({
library(httr)
library(rvest)
library(dplyr)
})
link <- "https://www2.census.gov/geo/docs/maps-data/data/rel2020/"
page <- read_html(link)
folders <- page %>%
html_elements("a") %>%
html_attr("href") %>%
.[8:14] %>%
paste0(link, .)
files_txt <- Map(\(x) {
x %>%
read_html() %>%
html_elements("a") %>%
html_attr("href") %>%
grep("\\.txt$", ., value = TRUE) %>%
paste0(x, .) %>%
lapply(\(y) {
tryCatch(
download.file(y, destfile = file.path("~/Temp", basename(y))),
error = function(e) e
)
})
}, folders[1:2])
err <- sapply(unlist(files_txt, recursive = FALSE), inherits, "error")
lapply(unlist(files_txt, recursive = FALSE)[err], simpleError)
The lodown packages works great for me for the most part - was able to download ACS and CES data without issue. But when I try to use it to access CPS data, I get the following output:
lodown( "cpsbasic" , output_dir = file.path( path.expand( "~" ) , "CPSBASIC" ) )
building catalog for cpsbasic
Error in rvest::html_table(xml2::read_html(cps_ftp), fill = TRUE)[[2]] :
subscript out of bounds
Tried a fresh install of R and the packages involved, but I still get the same error. I think it has something to do with the Census updating their website since the package was last updated, but I'm not clear on what the specific problem is.
I did dig up the install files for the package. The specific lines of the code at issue is below:
cps_ftp <- "https://www.census.gov/data/datasets/time-series/demo/cps/cps-basic.html"
cps_table <- rvest::html_table( xml2::read_html( cps_ftp ) , fill = TRUE )[[2]]
Not sure how active the developer of the package is in updating anymore, so I don't know that an update will be coming anytime soon. Any ideas?
We can download both .csv files in cps_ftp by,
library(rvest)
library(stringr)
#get links of csv files
links = 'https://www.census.gov/data/datasets/time-series/demo/cps/cps-basic.html' %>% read_html() %>%
html_nodes('.uscb-layout-align-start-start') %>% html_nodes('a') %>% html_attr('href')
#filter the links
csv_links= links %>% str_subset('csv') %>% paste0('https:', .)
#read the csv files
csv_files = lapply(csv_links, read_csv)
I'm doing some scraping, but as I'm parsing approximately 4000 URL's, the website eventually detects my IP and blocks me every 20 iterations.
I've written a bunch of Sys.sleep(5) and a tryCatch so I'm not blocked too soon.
I use a VPN but I have to manually disconnect and reconnect it every now and then to change my IP. That's not a suitable solution with such a scraper supposed to run all night long.
I think rotating a proxy should do the job.
Here's my current code (a part of it at least) :
library(rvest)
library(dplyr)
scraped_data = data.frame()
for (i in urlsuffixes$suffix)
{
tryCatch({
message("Let's scrape that, Buddy !")
Sys.sleep(5)
doctolib_url = paste0("https://www.website.com/test/", i)
page = read_html(site_url)
links = page %>%
html_nodes(".seo-directory-doctor-link") %>%
html_attr("href")
Sys.sleep(5)
name = page %>%
html_nodes(".seo-directory-doctor-link") %>%
html_text()
Sys.sleep(5)
job_title = page %>%
html_nodes(".seo-directory-doctor-speciality") %>%
html_text()
Sys.sleep(5)
address = page %>%
html_nodes(".seo-directory-doctor-address") %>%
html_text()
Sys.sleep(5)
scraped_data = rbind(scraped_data, data.frame(links,
name,
address,
job_title,
stringsAsFactors = FALSE))
}, error=function(e){cat("Houston, we have a problem !","\n",conditionMessage(e),"\n")})
print(paste("Page : ", i))
}
Interesting question. I think the first thing to note is that, as mentioned on this Github issue, rvest and xml2 use httr for the connections. As such, I'm going to introduce httr into this answer.
Using a proxy with httr
The following code chunk shows how to use httr to query a url using a proxy and extract the html content.
page <- httr::content(
httr::GET(
url,
httr::use_proxy(ip, port, username, password)
)
)
If you are using IP authentication or don't need a username and password, you can simply exclude those values from the call.
In short, you can replace the page = read_html(site_url) with the code chunk above.
Rotating the Proxies
One big problem with using proxies is getting reliable ones. For this, I'm just going to assume that you have a reliable source. Since you haven't indicated otherwise, I'm going to assume that your proxies are stored in the following reasonable format with object name proxies:
ip
port
64.235.204.107
8080
167.71.190.253
80
185.156.172.122
3128
With that format in mind, you could tweak the script chunk above to rotate proxies for every web request as follows:
library(dplyr)
library(httr)
library(rvest)
scraped_data = data.frame()
for (i in 1:length(urlsuffixes$suffix))
{
tryCatch({
message("Let's scrape that, Buddy !")
Sys.sleep(5)
doctolib_url = paste0("https://www.website.com/test/",
urlsuffixes$suffix[[i]])
# The number of urls is longer than the proxy list -- which proxy to use
# I know this isn't the greatest, but it works so whatever
proxy_id <- ifelse(i %% nrow(proxies) == 0, nrow(proxies), i %% nrow(proxies))
page <- httr::content(
httr::GET(
doctolib_url,
httr::use_proxy(proxies$ip[[proxy_id]], proxies$port[[proxy_id]])
)
)
links = page %>%
html_nodes(".seo-directory-doctor-link") %>%
html_attr("href")
Sys.sleep(5)
name = page %>%
html_nodes(".seo-directory-doctor-link") %>%
html_text()
Sys.sleep(5)
job_title = page %>%
html_nodes(".seo-directory-doctor-speciality") %>%
html_text()
Sys.sleep(5)
address = page %>%
html_nodes(".seo-directory-doctor-address") %>%
html_text()
Sys.sleep(5)
scraped_data = rbind(scraped_data, data.frame(links,
name,
address,
job_title,
stringsAsFactors = FALSE))
}, error=function(e){cat("Houston, we have a problem !","\n",conditionMessage(e),"\n")})
print(paste("Page : ", i))
}
This may not be enough
You might want to go a few steps further and add elements to the httr request such as the user-agent etc. However, one of the big problems with a package like httr is that it can't render dynamic html content, such as JavaScript-rendered html, and any website that really cares about blocking scrapers is going to detect this. To conquer this problem there are tools such as Headless Chrome that are meant to address specifically stuff like this. Here's a package you might want to look into for headless Chrome in R NOTE: still in development.
Disclaimer
Obviously, I think this code will work but since there's no reproducible data to test with, it may not.
As already said by #Daniel-Molitor headless Chrome gives stunning results.
Another cheap option in R Studio is looping over a list of proxies while you have to start a new R process afterwards
Sys.setenv(http_proxy=proxy)
.rs.restartR()
Sys.sleep(1) can be even omitted afterwards ;-)
I'm trying to adopt the Reproducible Research paradigm but meet people who like looking at Excel rather than text data files half way, by using Dropbox to host Excel files which I can then access using the .xlsx package.
Rather like downloading and unpacking a zipped file I assumed something like the following would work:
# Prerequisites
require("xlsx")
require("ggplot2")
require("repmis")
require("devtools")
require("RCurl")
# Downloading data from Dropbox location
link <- paste0(
"https://www.dropbox.com/s/",
"{THE SHA-1 KEY}",
"{THE FILE NAME}"
)
url <- getURL(link)
temp <- tempfile()
download.file(url, temp)
However, I get Error in download.file(url, temp) : unsupported URL scheme
Is there an alternative to download.file that will accept this URL scheme?
Thanks,
Jon
You have the wrong URL - the one you are using just goes to the landing page. I think the actual download URL is different, I managed to get it sort of working using the below.
I actually don't think you need to use RCurl or the getURL() function, and I think you were leaving out some relatively important /'s in your previous formulation.
Try the following:
link <- paste("https://dl.dropboxusercontent.com/s",
"{THE SHA-1 KEY}",
"{THE FILE NAME}",
sep="/")
download.file(url=link,destfile="your.destination.xlsx")
closeAllConnections()
UPDATE:
I just realised there is a source_XlsxData function in the repmis package, which in theory should do the job perfectly.
Also the function below works some of the time but not others, and appears to get stuck at the GET line. So, a better solution would be very welcome.
I decided to try taking a step back and figure out how to download a raw file from a secure (https) url. I adapted (butchered?) the source_url function in devtools to produce the following:
download_file_url <- function (
url,
outfile,
..., sha1 = NULL)
{
require(RCurl)
require(devtools)
require(repmis)
require(httr)
require(digest)
stopifnot(is.character(url), length(url) == 1)
filetag <- file(outfile, "wb")
request <- GET(url)
stop_for_status(request)
writeBin(content(request, type = "raw"), filetag)
close(filetag)
}
This seems to work for producing local versions of binary files - Excel included. Nicer, neater, smarter improvements in this gratefully received.