downloading data in a loop, how often to save that information - r

I'm using R, but this question isn't specific to it. Suppose you've written some loop which downloads a url at each iteration. You want to to save this data, so you could do so each iteration or hold the information and save it every nth iteration. Are there any general rules of thumb for doing this? How slow is it to open and close a file for writing all the time? What I have in mind is
for (i in 1:1000) {
data <- url("http://...i")
write.table(data, file="file")
}
versus something like this
data <- list()
length(data) <- 20
j <- 1
for (i in 1:1000) {
data[j] <-url("http://...i")
j <- j+1
if (j = 20) {j <-1}
if (i %% 20 == 0) {
write.table(data, file="file")
}
}

If all your downloaded data are of the same form, you might want to append them to a unique file in which case you can do that at each iteration. Here is a short example:
sites<-c("714A","715A","716A")
for(i in 1:length(sites)){
data<-read.table(file=paste("http://www.ngdc.noaa.gov/mgg/geology/odp/data/115/",sites[i],"paleomag.txt",sep="/"),sep="\t",header=TRUE)
#In this example i downloaded paleomagnetic data from deep sea drilling sites.
ifelse(i==1,h<-TRUE,h<-FALSE) #Here the idea is that we want to output the column names only the first time.
write.table(data,file="paleomag_leg115.txt",sep="\t",append=!h,col.names=h,row.names=FALSE)
}

Related

problem with loops: creating names based on increasing values for reading in CSV file

Okay, so I needed to split a larger file into a bunch of CSV's to run through a non-R program. I used this loop to do it:
for(k in 1:14){
inner_edge = 2000000L*(k-1) + 1
outter_edge = 2000000L*(k)
part <- slice(nc_tu_CLEAN, inner_edge:outter_edge)
out_name = paste0("geo/geo_CLEAN",(k),".csv")
write_csv(part,out_name)
Sys.time()
}
which worked great. Except I'm having a problem in this other program, and need to read a bunch of these back in to trouble shoot. I tried to write this loop for it, and get the following error:
for(k in 1:6){
csv_name <- paste0("geo_CLEAN",(k),".csv")
geo_CLEAN_(k) <- fread(file= csv_name)
}
|--------------------------------------------------|
|==================================================|
Error in geo_CLEAN_(k) <- fread(file = csv_name) :
could not find function "geo_CLEAN_<-"
I know I could do this line by line, but I'd like to have that be a loop if possible. What I want is for geo_CLEAN_1 to relate to fread geoCLEAN1.csv; geo_CLEAN_2 to relate to fread geoCLEAN2.csv, etc.
We need assign if we are interested in creating objects
for(k in 1:6){
csv_name <- paste0("geo_CLEAN",(k),".csv")
assign(sub("\\.csv", "", csv_name), fread(file= csv_name))
}

How to efficiently XML parse in R

I use R to parse XML data from a website. I have list of 20,000 rows with URLs from which I need to extract data. I have a code which gets the job done using a for loop, but it's very slow (takes approx. 12 hours). I thought of using parallel processing (I have access to several CPUs) to speed it up, but I cannot make it work properly. Would it be more efficient using a data table instead of a data frame? Is there any way to speed the process up? Thanks!
for (i in 1:nrow(list)) {
t <- xmlToDataFrame(xmlParse(read_xml(list$path[i]))) #Read the data into a file
t$ID <- list$ID[i]
emptyDF <- bind_rows(all, t) #Bind all into one file
if (i / 10 == floor(i / 10)) {
print(i)
} #print every 10th value to monitor progress of the loop
}
This script should point you in the correct direction:
t<-list()
for (i in 1:nrow(list)) {
tempdf <- xmlToDataFrame(xmlParse(list$path[i])) #Read the data into a file
tempdf$ID <- list$ID[i]
t[[i]]<-tempdf
if (i %% 10 == 0) {
print(i)
} #print every 10th value to monitor progress of the loop
}
answer <- bind_rows(t) #Bind all into one file
Instead of a for loop, an lapply would also work here. Without any sample data, this is untested.

saving dataframe output to multiple folders

Pardon my newbieness, but I feel like I've arrived at my wit's end.
I have a (spatial polygon) dataframe (tri.extract) that houses all of my data. Every row in this dataframe corresponds to an image. Every image in the dataframe corresponds to a parcel and thus has an attribute parcel_id, which denotes which parcel does the image belong to. I wish to save all the images in sub-folders so that each image is in the folder of its respective parcel.
parcels <- data.frame(unique(tri.extract#data$parcel_id))
save.dir <- "/home/iordamo/Documents/GIS_Workload/bbox/DemoGrasslandTIMED_END_ImagesMapillary/"
#create sub-folders named after parcel_ids
for (i in 1:nrow(parcels)){
dir.create(paste0(save.dir,parcels[i,]))
}
#the save loop itself
for (i in 1:nrow(tri.extract#data)){
#generate URLs for each image in the dataframe
img_url<-paste0('https://d1cuyjsrcm0gby.cloudfront.net/',
tri.extract#data$key[i],
'/thumb-2048.jpg')
#create a dataframe of all the folder names within save.dir - the parcels
dirs.to.save1 <- data.frame(list.files(save.dir, recursive = F))
dirs.to.save1[] <- lapply(dirs.to.save1, as.character)
for (g in 1:nrow(dirs.to.save1)){
if (g==1){
row <- dirs.to.save1[g,]
#print(row)
img_path <- file.path(paste0(save.dir,row, "/"), paste0("i_",tri.extract#data$key[i], ".jpg"))
download.file(img_url, img_path, quiet=TRUE, mode="wb")
#next
}
else if (g>1){
row <- dirs.to.save1[g,]
#print(row)
img_path <- file.path(paste0(save.dir,row, "/"), paste0("i_",tri.extract#data$key[i], ".jpg"))
download.file(img_url, img_path, quiet=TRUE, mode="wb")
#next
}
}
}
With the code in its current form all of the images get saved in all of the sub-folders. Can anyone explain why? To my understanding I am looping through each record of the dataframe (tri.extract), generating a URL, then (in the first nested loop) loop through each parcel and create a file.path from the save.dir and each row in the dirs.to.save1 dataframe and the respective image id (tri.extract#data$key[i]). And this should output in each respective folder because I am looping through them in the nested loop. Can someone explain to me where my logic fails to be translated to execution?
Ok, that wasn't too hard.
Solution turned out be, as usual, simpler than what I originally conjured up:
for (i in 1:nrow(tri.extract#data)){
img_url<-paste0('https://d1cuyjsrcm0gby.cloudfront.net/',
tri.extract#data$key[i],
'/thumb-2048.jpg')
for (g in 1:nrow(parcels)){
row <- droplevels(parcels[g,])
if(tri.extract#data$parcel_id[i] == parcels[g,]) {
img_path <- file.path(paste0(save.dir,row), paste0("i_",tri.extract#data$key[i], ".jpg"))
download.file(img_url, img_path, quiet=TRUE, mode="wb")
}}
}

reading and processing files in parallel in R

I am using the parallel library in R to process a large data set on which I am applying complex operations.
For the sake of providing a reproducible code, you can find below a simpler example:
#data generation
dir <- "C:/Users/things_to_process/"
setwd(dir)
for(i in 1:800)
{
my.matrix <- matrix(runif(100),ncol=10,nrow=10)
saveRDS(my.matrix,file=paste0(dir,"/matrix",i))
}
#worker function
worker.function <- function(files)
{
files.length <- length(files)
partial.results <- vector('list',files.length)
for(i in 1:files.length)
{
matrix <- readRDS(files[i])
partial.results[[i]] <- sum(diag(matrix))
}
Reduce('+',partial.results)
}
#master part
cl <- makeCluster(detectCores(), type = "PSOCK")
file_list <- list.files(path=dir,recursive=FALSE,full.names=TRUE)
part <- clusterSplit(cl,seq_along(file_list))
files.partitioned <- lapply(part,function(p) file_list[p])
results <- clusterApply(cl,files.partitioned,worker.function)
result <- Reduce('+',results)
Essentially, I am wondering if trying to read files in parallel would be done in an interleaved fashion instead. And if, as a result, this bottleneck would cut down on the expected performance of running tasks in parallel?
Would it be better if I first read all matrices at once in a list then sent chunks of this list to each core for it to be processed? what if these matrices were much larger, would I be able to load all of them in a list at once ?
Instead of saving each matrix in a separate RDS file, have you tried saving a list of N matrices in each file, where N is the number that is going to be processed by a single worker?
Then the worker.function looks like:
worker.function <- function(file) {
matrix_list <- readRDS(file)
partial_results <- lapply(matrix_list, function(mat) sum(diag(mat)))
Reduce('+',partial.results)
}
You should save some time on I/O and maybe even on computation by replacing a for with a lapply.

file.show and user input in a loop

I have a dataframe data with information on tiffs, including one column txt describing the content of the tiff. Unfortunately, txt is not always correct and we need to correct them by hand. Therefore I want to loop over each row in data, show the tiff and ask for feedback, which is than put into data$txt.cor.
setwd(file.choose())
Some test tiffs (with nonsene inside, but to show the idea...):
txt <- sample(100:199, 5)
for (i in 1:length(txt)){
tiff(paste0(i, ".tif"))
plot(txt[i], ylim = c(100, 200))
dev.off()
}
and the dataframe:
pix.files <- list.files(getwd(), pattern = "*.tif", full.names = TRUE)
pix.file.info <- file.info(pix.files)
data <- cbind(txt, pix.file.info)
data$file <- row.names(pix.file.info)
data$txt.cor <- ""
data$txt[5] <- 200 # wrong one
My feedback function (error handling stripped):
read.number <- function(){
n <- readline(prompt = "Enter the value: ")
n <- as.character(n) #Yes, character. Sometimes we have alphanumerical data or leading zeros
}
Now the loop, for which help would be very much appreciated:
for (i in nrow(data)){
file.show(data[i, "file"]) # show the image file
data[i, "txt.cor"] <- read.number() # aks for the feedback and put it back into the dataframe
}
In my very first attempts I was thinking of the plot.lm idea, where you go through the diagnostic plots after pressing return. I suspect that plot and tiffs are not big friends. file.show turned out to be easier. But now I am having a hard time with that loop...
Your problem is that you don't loop over the data, you only evaluate the last row. Simply write 1:nrow(data)to iterate over all rows.
To display your tiff images in R you can use the package rtiff:
library(rtiff)
for (i in 1:nrow(data)){
tif <- readTiff(data[i,"file"]) # read in the tiff data
plot(tif) # plot the image
data[i, "txt.cor"] <- read.number() # aks for the feedback and put it back into the dataframe
}

Resources