Download all R zip packages from webpage - r

The organization I currently work for blocks the CRAN Repository in R Studio. So in order to install packages I need to go to http://cran.rstudio.com/bin/windows/contrib/3.6/ and manually download each one and their dependencies and install them in RStudio. It gets quite tedious.
Is there a way for me download all of the zip files on this page at once and put them in a folder on my desktop? And then from there is there a code to install/load all the zip file packages at once in RStudio?
Thank you in advance!

Here is a possible example using the package rvest. The rvest functions are used to get the list of packages to be downloaded.
Note that the Sys.sleep(1L) call pauses the execution for one second between downloads. You can obviously change that or remove it altogether.
library(rvest)
url <- 'https://cran.rstudio.com/bin/windows/contrib/3.6'
packages <- rvest::read_html(url) %>%
rvest::html_nodes("a") %>%
rvest::html_text() %>%
grep('.zip$', ., value = TRUE) %>%
sort()
for (pkg in packages) {
Sys.sleep(1L)
cat('Downloading', pkg, '...')
pkg_url <- file.path(url, pkg)
download.file(pkg_url, destfile = pkg, quiet = TRUE)
cat(' DONE.\n')
}

Related

R - Can't access CPS data using lodown package

The lodown packages works great for me for the most part - was able to download ACS and CES data without issue. But when I try to use it to access CPS data, I get the following output:
lodown( "cpsbasic" , output_dir = file.path( path.expand( "~" ) , "CPSBASIC" ) )
building catalog for cpsbasic
Error in rvest::html_table(xml2::read_html(cps_ftp), fill = TRUE)[[2]] :
subscript out of bounds
Tried a fresh install of R and the packages involved, but I still get the same error. I think it has something to do with the Census updating their website since the package was last updated, but I'm not clear on what the specific problem is.
I did dig up the install files for the package. The specific lines of the code at issue is below:
cps_ftp <- "https://www.census.gov/data/datasets/time-series/demo/cps/cps-basic.html"
cps_table <- rvest::html_table( xml2::read_html( cps_ftp ) , fill = TRUE )[[2]]
Not sure how active the developer of the package is in updating anymore, so I don't know that an update will be coming anytime soon. Any ideas?
We can download both .csv files in cps_ftp by,
library(rvest)
library(stringr)
#get links of csv files
links = 'https://www.census.gov/data/datasets/time-series/demo/cps/cps-basic.html' %>% read_html() %>%
html_nodes('.uscb-layout-align-start-start') %>% html_nodes('a') %>% html_attr('href')
#filter the links
csv_links= links %>% str_subset('csv') %>% paste0('https:', .)
#read the csv files
csv_files = lapply(csv_links, read_csv)

Problem with saving kable table (install_phantomjs)

Let's consider very simple table created by kable
library(knitr)
library(kableExtra)
x <- data.frame(1:3, 2:4, 3:5)
x <- kable(x, format = "pipe", col.names = c("X_1", "X_2", "X_3"), caption = "My_table")
I want to save this table into .pdf format
x %>% save_kable("My_table.pdf")
But I get error:
PhantomJS not found. You can install it with webshot::install_phantomjs(). If it is installed, please make sure the phantomjs executable can be found via the PATH variable.
However, when trying to install it by proposed command:
webshot::install_phantomjs()
I get error:
Error in utils::download.file(url, method = method, ...) :
cannot open URL 'https://github.com/wch/webshot/releases/download/v0.3.1/phantomjs-2.1.1-windows.zip'
So my question is - Is there any possbility to save kable table without using phanomjs?
The command works for me and the URL is also available.
I suspect that the file (it's a .zip file) is being blocked by your firewall or anti-virus software.

R, edit movie files or avi files?

I have a bunch of .avi files I would like to load into R, breakdown each frame as an individual image, arrange the images, and save as a separate image. In spite of a sincere effort to try to find a package to load .avi files, I can't find anything.
1) is it possible to load and work with avi files in r?
2) how is this done?
3) is there a specific library for this?
Ive seen several examples using linux, such as this post, but I'm hoping for an R solution.
Converting AVI Frames to JPGs on Linux
I figured this out. As indicated in the comments section, ffmpeg is the package called by various packages in R to load videos into the R environment. The Imager package has a function called "load.video.internal" that works well and uses ffmpeg. I had to down load the package from github because this function was not available in the version I installed using "install.packages". I ended up copy/pasting the function from the source package and commenting out the reference to the "has.ffmpeg" function because it kept hanging at this step. Using paste, I was able to indicate the file path and load an avi file successfully.
Modified load.video.internal function:
load.video.internal <- function(fname,maxSize=1,skip.to=0,frames=NULL,fps=NULL,extra.args="",verbose=FALSE)
{
# if (!has.ffmpeg()) stop("Can't find ffmpeg. Please install.")
dd <- paste0(tempdir(),"/vid")
if (!is.null(frames)) extra.args <- sprintf("%s -vframes %i ",extra.args,frames)
if (!is.null(fps)) extra.args <- sprintf("%s -vf fps=%.4d ",extra.args,fps)
arg <- sprintf("-i %s %s -ss %s ",fname,extra.args,as.character(skip.to)) %>% paste0(dd,"/image-%d.bmp")
tryCatch({
dir.create(dd)
system2("ffmpeg",arg,stdout=verbose,stderr=verbose)
fls <- dir(dd,full.names=TRUE)
if (length(fls)==0) stop("No output was generated")
ordr <- stringr::str_extract(fls,"(\\d)+\\.bmp") %>% stringr::str_split(stringr::fixed(".")) %>% purrr::map_int(~ as.integer(.[[1]])) %>% order
fls <- fls[ordr]
#Check total size
imsz <- load.image(fls[[1]]) %>% dim %>% prod
tsz <- ((imsz*8)*length(fls))/(1024^3)
if (tsz > maxSize)
{
msg <- sprintf("Video exceeds maximum allowed size in memory (%.2d Gb out of %.2d Gb)",tsz,maxSize)
unlink(dd,recursive=TRUE)
stop(msg)
}
else
{
out <- map_il(fls,load.image) %>% imappend("z")
out
} },
finally=unlink(dd,recursive=TRUE))
}
Example of its use:
vid_in <- load.video.internal(paste("/home/phil/Documents/avi_files/Untitled.avi"))
UPDATED SIMPLER VERSION:
the magick package will read avi files using the image_read function.

Download data from web using R gives libcurl error

I have the following R script for downloading data but it gives me an error. How can I fix this error?
rm(list=ls(all=TRUE))
library('purrr')
years <- c(1980:1981)
days <- c(001:002)
walk(years, function(x) {
map(x, ~sprintf("https://hydro1.gesdisc.eosdis.nasa.gov/data/NLDAS/NLDAS_MOS0125_H.002/%s/%s/.grb", years, days)) %>%
flatten_chr() -> urls
download.file(urls, basename(urls), method="libcurl")
})
Error:
Error in download.file(urls, basename(urls), method = "libcurl") :
download.file(method = "libcurl") is not supported on this platform
I have the following R script for downloading data but it gives me an error. How can I fix this error?
Session info:
That means that libcurl may not be installed or available for your operative system. Please note that the method argument has other options and that method varies across operative systems (more or less the same as platform in the error message). I would try with other methods (e.g., wget, curl...).
From the help of download.files...
The supported ‘method’s do change: method ‘libcurl’ was introduced
in R 3.2.0 and is still optional on Windows - use
‘capabilities("libcurl")’ in a program to see if it is available.
I had started to do a light edit to #gballench's answer (since I don't rly need the pts) but it's more complex than you have it since you're not going to get to the files you need with that idiom (which I'm 99% sure is from an answer of mine :-) for a whole host of reasons.
First days needs to be padded to length 3 with 0s but the way you did it won't do that. Second, You likely want to download all the .grb files from each year/00x combo, so you need a way to get those. Finally, that site requires authentication, so you need to register and use basic authentication for it.
Something like this:
library(purrr)
library(httr)
library(rvest)
years <- c(1980:1981)
days <- sprintf("%03d", 1:2)
sprintf("http://hydro1.gesdisc.eosdis.nasa.gov/data/NLDAS/NLDAS_MOS0125_H.002/%s/%%s/", years) %>%
map(~sprintf(.x, days)) %>%
flatten_chr() %>%
map(~{
base_url <- .x
sprintf("%s/%s", base_url, read_html(.x) %>%
html_nodes(xpath=".//a[contains(#href, '.grb')]") %>%
html_attr("href"))
}) %>%
flatten_chr() %>%
discard(~grepl("xml$", .)) %>%
walk(~{
output_path <- file.path("FULL DIRECTORY PATH", basename(.x))
if (!file.exists(output_path)) {
message(.x)
GET(
url = .x,
config = httr::config(ssl_verifypeer = FALSE),
write_disk(output_path, overwrite=TRUE),
authenticate(user = "me#example.com", password = "xldjkdjfid8y83"),
progress()
)
}
})
You'll need to install the httr package which will install the curl package and ultimately make libcurl available for simpler batch downloads in the future.
I remembered that I had an account so I linked it with this app & tested this (killed it at 30 downloads) and it works. I added progress() to the GET() call so you can see it downloading individual files. It skips over already downloaded files (so you can kill it and restart it at any time). If you need to re-download any, just remove the file you want to re-download.
If you also need the .xml files, then remove the discard() call.

Reading pdf with TM package

I am trying to read pdf files with the TM package. I have gone through succesfully in most of the attempts, but one. I have several folders with hundreds of documents each. I have read all of them but one. The problem is that the pdfs in that specific folder have a sequence of images on the bottom of the first page that prevents me from reading them. I get the following error:
Error in strptime(d, fmt) : input string is too long
If I remove the first page, I manage to read them. I could do it without much loss of relevant information, but it is too much work.
I try with xpdf and ghoststring, but both give me the same error.
My code is as following:
library(rvest)
library(tm)
url<-paste0("http://www.tjrj.jus.br/search?q=acidente+de+transito+crianca+atropelamento&btnG=Pesquisar&processType=cnj&site=juris&client=juris&output=xml_no_dtd&proxystylesheet=juris&entqrm=0&oe=UTF-8&ie=UTF-8&ud=1&filter=0&getfields=*&partialfields=(ctd:1)&exclude_apps=1&ulang=en&lr=lang_pt&sort=date:D:S:d1&as_q=+&access=p&entqr=3&start=",seq(0,462,10))
css<-sprintf(".margin-top-10:nth-child(%.d) .outros .featured",1:10)
for (j in 1:1){ # There 47 pages, but I only put one here
for (i in 1:10){ # there are 10 files per page.
a<-html_node(css=css[i]) %>%
html_attr("href")
download.file(a,paste0("doc",j,i,".pdf"))
}
}
files <- list.files(pattern = "pdf$")
Rpdf <- readPDF(control = list(text = "-layout"))
docs <- Corpus(URISource(files,encoding="UTF-8"),readerControl = list(reader = Rpdf,language="portuguese"))
Does someone have a suggestion? I use a Mac.
Late answer:
But I recently discovered that with the current verions of tm (0.7-4) readPDF uses pdftools as default to read pdfs.
library(tm)
directory <- getwd() # change this to directory where pdf-files are located
# read the pdfs with readPDF, default engine used is pdftools see ?readPDF for more info
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"),
readerControl = list(reader = readPDF))

Resources