Download multiple files using "download.file" function

Download multiple files using "download.file" function - r

I am trying to download PDFs from a website using R.
I have a vector of the PDF-URLs (pdfurls) and a vector of destination file names (destinations):
e.g.:
pdfurls <- c("http://website/name1.pdf", "http://website/name2.pdf")
destinations <- c("C:/username/name1.pdf", "C:/username/name2.pdf")
The code I am using is:
for(i in 1:length(urls)){
download.file(urls, destinations, mode="wb")}
However, when I run the code, R accesses the URL, downloads the first PDF, and repeats downloading the same PDF over and over again.
I have read this post: for loop on R function and was wondering if this has something to do with the function itself or is there a problem with my loop?
The code is similar to the post here: How to download multiple files using loop in R? so I was wondering why it is not working and if there is a better way to download multiple files using R.

I think your loop is mostly fine, except you forgot to index the urls and destinations objects.
Tangentially, I would recommend getting in the habit of using seq_along instead of 1:length() when defining for loops.
for(i in seq_along(urls)){
download.file(urls[i], destinations[i], mode="wb")
}
Or using Map as suggested by #docendodiscimus :
Map(function(u, d) download.file(u, d, mode="wb"), urls, destinations)

Related

Looping through nc files in R

Good morning everyone,
I am currently using the code written by Antonio Olinto Avila-da-Silva on this link: https://oceancolor.gsfc.nasa.gov/forum/oceancolor/topic_show.pl?tid=5954
It allows me to extract data of type sst/chlor_a from nc file. It uses a loop to create an excel file with all the data. Unfortunately, I noticed that the function only takes the first data file in the loop. Thus, I find myself with 20 times the same data in a row in my excel file.
Does anyone have a solution to make this loop work properly?

I would first check out that these two lines contain all the files you are expecting:
(f <- list.files(".", pattern="*.L3m_MO_SST_sst_9km.nc",full.names=F))
(lf<-length(f))
And then there's a bug in the for-loop. This line:
data<-nc_open(f)
Needs to reference the iterator i, so change it to something like this:
data<-nc_open(f[[i]])
It appears both scripts have this same bug.

How can I perform the same set of commands separately for each file in the same file path?

I am trying to deal with extracting a subset from multiple .grb2 files in the same file path, and write them in a csv. I am able to do it for one (or a few) by using the following set of commands:
GRIB <- brick("tmp2m.1989102800.time.grb2")
GRIB <- as.array(GRIB)
readGDAL("tmp2m.1989102800.time.grb2")
tmp2m.6hr <- GRIB[51,27,c(261:1232)]
str(tmp2m.6hr)
tmp2m.data <- data.frame(tmp2m.6hr)
write.csv(tmp2m.data,"tmp1.csv")
The above set of commands extract, in csv, temperature values for specific latitude "51" and longitude "27", as well as for a specific time range "c(261:1232)".
Now I have hundreds of these files (with different file names, of course) in the same directory and I want to do the same for all. As you know, better than me, I cannot do this to one by one, changing the file name each time.
I have struggled a lot with this, but so far I did not manage to do it. Since I am new in R, and my knowledge is limited, I would very much appreciate any possible help with this.

The simplest way would be to use a normal for loop:
path <- "your file path here"
input.file.names <- dir(path, pattern =".grb2")
output.file.names <- paste0(tools::file_path_sans_ext(file.names),".csv")
for(i in 1:length(file.names)){
GRIB <- brick(input.file.names[i])
GRIB <- as.array(GRIB)
readGDAL(input.file.names[i]) # edited line
tmp2m.6hr <- GRIB[51,27,c(261:1232)]
str(tmp2m.6hr)
tmp2m.data <- data.frame(tmp2m.6hr)
write.csv(tmp2m.data,output.file.names[i])
}
You could of course create the body of the for loop into a function and then use the standard lapply or the map function from purrr.
Note that this code will print out different CSV files. If you want to append the data to a single file then you should check out write.table

Reading multiple offline html files to a list in R

I have rawdata as 20 offline html files stored in following format
../rawdata/1999_table.html
../rawdata/2000_table.html
../rawdata/2001_table.html
../rawdata/2002_table.html
.
.
../rawdata/2017_table.html
These files contain tables that I am extracting and reshaping to a particular format.
I want to read these files at once to a list and process them one by one through a function that I have written.
What I tried:
I put the names of these files into an Excel file called filestoread.xlsx and used a for loop to load these files using the names mentioned in the sheet. But it doesn't seem to work
filestoread <- fread("../rawdata/filestoread.csv")
x <- list()
for (i in nrow(filestoread)) {
x[[i]] <- read_html(paste0("../rawdata/", filestoread[i]))
}
How can this be done?
Also, after reading the HTML files I want to extract the tables from them and reshape them using a function I wrote after converting it to a data table.
My final objective is to rbind all the tables and have a single data table with year wise entries of the tables in the html file.

First save path of your data on one of the following ways.
Either, hardcoded
filestoread <- paste0("../rawdata/", 1999:2017, "_table.html")
or reading all html files in the directory
filestoread <- list.files(path = "../rawdata/", pattern="\\.html$")
Then use lapply()
library(rvest)
lapply(filestoread, function(x) try(read_html(x)))
Note: try() runs the code even when there is a file missing (throwing error).
The second part of your question is a little broad, depends on the content of your files, and there are already some answers, you could consider e.g. this answer. In principle you use a combination of ?html_nodes and ?html_table.

Using a for loop to write in multiple .grd files

I am working with very large data layers for a SDM class and because of this I ended up breaking some of my layers into a bunch of blocks to avoid memory restraint. These blocks were written out as .grd files, and now I need to get them read back into R and merged together. I am extremely new to R an programming in general so any help would be appreciated. What I have been trying so far looks like this:
merge.coarse=raster("coarseBlock1.grd")
for ("" in 2:nBlocks){
merge.coarse=merge(merge.coarse,raster(paste("coarseBlock", ".grd", sep="")))
}
where my files are in coarseBlock.grd and are sequentially numbered from 1 to nBlocks (259)
Any feed back would be greatly appreciated.

Using for loops is generally slow in R. Also, using functions like merge and rbind in a for loop eat up a lot of memory because of the way R passes values to these functions.
A more efficient way to do this task would be to call lapply (see this tutorial on apply functions for details) to load the files into R. This will result in a list which can then be collapsed using the rbind function:
rasters <- lapply(list.files(GRDFolder), FUN = raster)
merge.coarse <- do.call(rbind, rasters)

I'm not too familiar with .grd files, but this overall process should at least get you going in the right direction. Assuming all your .grd files (1 through 259) are stored in the same folder (which I will refer to as GRDFolder), then you can try this:
merge.coarse <- raster("coarseBlock1.grd")
for(filename in list.files(GRDFolder))
{
temp <- raster(filename)
merge.coarse <- rbind(merge.coarse, temp)
}

R: Improving workflow and keeping track of output

I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:
(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?
(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by #hadley in 1). Any ideas would be welcome!
The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:
MetaInfo <- function(message=NULL, filename)
{
# message - character string - Any message to be written into the information
# file (e.g., data used).
# filename - character string - the name of the txt file (including relative
# path). Should be the same as the output file it describes (RData,
# csv, pdf).
#
if (is.null(filename))
{
stop('Provide an output filename - parameter filename.')
}
filename <- paste(filename, '.txt', sep='')
# Try to get as close as possible to getting the file name from which the
# function is called.
source.file <- lapply(sys.frames(), function(x) x$ofile)
source.file <- Filter(Negate(is.null), source.file)
t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
silent=TRUE)
if (class(t.sf) == 'try-error')
{
source.file <- NULL
}
func <- deparse(sys.call(-1))
# MetaInfo isn't always called from within another function, so func could
# return as NULL or as general environment.
if (any(grepl('eval', func, ignore.case=TRUE)))
{
func <- NULL
}
time <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
git.h <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
meta <- list(Message=message,
Source=paste(source.file, ' on ', time, sep=''),
Functions=func,
System=Sys.info(),
Session=sessionInfo(),
Git.hash=git.h)
sink(file=filename)
print(meta)
sink(file=NULL)
}
which can then be called in another function, stored in another file, e.g.:
source('metainfo.R')
RandomPlot <- function(x, y)
{
fn <- 'random_plot'
pdf(file=paste(fn, '.pdf', sep=''))
plot(x, y)
MetaInfo(message=NULL, filename=fn)
dev.off()
}
x <- 1:10
y <- runif(10)
RandomPlot(x, y)
This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.

In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.
So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like
## myproject
source("functions.R")
source("read-data.R")
source("clean-data.R")
etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.

There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.
If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.
There is also this new free tool called Drake which is advertised as "make for data".

I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions #noah and #alex!

There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.
install.packages("drake") # It is on CRAN.
library(drake)
load_basic_example()
plot_graph(my_plan)
make(my_plan)
Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.
readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
print(small)
But you can still work with files as single-quoted targets. (See 'report.Rmd' and 'report.md' in my_plan from the basic example.)

There is package developed by RStudio called pins that might address this problem.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Download multiple files using "download.file" function - r

Related

Looping through nc files in R

How can I perform the same set of commands separately for each file in the same file path?

Reading multiple offline html files to a list in R

Using a for loop to write in multiple .grd files

R: Improving workflow and keeping track of output

Categories

Resources