I've written a script to automate the processing of my raster data, but it fails after half completing. The loop looks in the directory unzipped for subdirectories which contain .tif files and stacks those with some basemaps in /base_layers. Its been failing with:
Error in matrix(unlist(ini), ncol = 2, byrow = TRUE) : 'data' must be of a vector type, was 'NULL'
and
In writeBin(v, x#file#con, size = x#file#dsize) :
problem writing to connection
and
Error in .local(.Object, ...) :
GDAL Error 3: Free disk space available is 875151360 bytes, whereas 2473962400 are at least necessary. You can disable this check by defining the CHECK_DISK_FREE_SPACE configuration option to FALSE.
FIXED I think, see answers
Ran in a kubernetes pod with Ubuntu, 24 gigs of ram, and 50 gigs of storage, and 4 vCPUs and also in a VM instance running Ubuntu 18.04 with 2 vCPUs, 8 gigs of ram and 50 gigs of storage. The pods were failing part way through because of memory issues, and the VM didn't complete either. Code typically fails when it reaches the mask raster stage but only after running through a few iterations of the loop (so it all works at the start).
If anyone can point out if memory fragmentation is likely to occur in this script, or ways to clear the OS memory I would be forever grateful!
Script:
library(raster)
library(rgdal)
input = "/mnt/nfs/data/unzipped"
output = "/mnt/nfs/data/training" #path to where the data should go
#get paths to basemaps
DEM = raster("/mnt/nfs/data/base_layers/Clipped_filled_dem.tif")
flow_accum = raster("/mnt/nfs/data/base_layers/accumulation.tif")
slope = raster("/mnt/nfs/data/base_layers/Clipped_slope.tif")
aspect = raster("/mnt/nfs/data/base_layers/Clipped_filled_aspect.tif")
ruggedness = raster("/mnt/nfs/data/base_layers/Clipped_TRI.tif")
#get directories that have data that needs to be processed
directory = list.dirs(path = input, recursive = FALSE)
for(direct in directory) {
subdirect = list.dirs(path = direct,recursive = FALSE)
for(sub in subdirect){
files_for_raster <- list.files(path = sub, pattern = "*.tif$", full.names = TRUE)
rasterstack = stack(files_for_raster)
# name of datapull
name = gsub(paste(direct,"/",sep=""),"",sub)
print(c("working in",name))
# crop DEM to the extent of the satellite image
DEMcrop = crop(DEM,rasterstack) #extent can be a raster
flow_accumcrop = crop(flow_accum,rasterstack)
slopecrop = crop(slope,rasterstack)
aspectcrop = crop(aspect,rasterstack)
ruggednesscrop = crop(ruggedness,rasterstack)
print(c("cropped"))
print(object.size(DEMcrop))
print(object.size(rasterstack))
# resample rasters, this will take a bit
DEMcrop = resample(DEMcrop,rasterstack)
flow_accumcrop = resample(flow_accumcrop,rasterstack)
slopecrop = resample(slopecrop,rasterstack)
aspectcrop = resample(aspectcrop,rasterstack)
ruggednesscrop = resample(ruggednesscrop,rasterstack)
print(c("resampled"))
print(object.size(DEMcrop))
print(object.size(rasterstack))
# mask layers
DEMcrop = mask(DEMcrop,raster::subset(rasterstack,1))
flow_accumcrop = mask(flow_accumcrop,raster::subset(rasterstack,1))
slopecrop = mask(slopecrop,raster::subset(rasterstack,1))
aspectcrop = mask(aspectcrop,raster::subset(rasterstack,1))
ruggednesscrop = mask(ruggednesscrop,raster::subset(rasterstack,1))
print(c("masked"))
print(object.size(DEMcrop))
print(object.size(rasterstack))
# add baselayers to the raster stack
finalstack = addLayer(rasterstack,DEMcrop,flow_accumcrop,slopecrop,aspectcrop,ruggednesscrop)
print(names(finalstack))
print(nlayers(finalstack))
bands<-c("band1","band2","band3","band4","band5","band6","band7")
type<-c("quality","sr_ndvi","DEM","flow_accum","slope","aspect","TRI")
band_info<-data.frame(bands,type)
print("finalstack")
# create new output directory and save raster there
output_subdirect = gsub(paste(input,"/",sep=""),"",sub)
dir.create(file.path(output,output_subdirect), recursive = TRUE)
Sys.chmod(file.path(output,output_subdirect), mode = "777", use_umask = FALSE)
print("created directory")
write = file.path(output,output_subdirect)
writeRaster(finalstack, format="GTiff", filename=file.path(write,name,fsep="/"), options=c("INTERLEAVE=BAND","COMPRESS=NONE"), overwrite=TRUE)
write.csv(band_info, file = paste(file.path(write,name,fsep="/"),".csv",sep=""))
print("done processing")
rm(rasterstack,DEMcrop,flow_accumcrop,slopecrop,aspectcrop,ruggednesscrop)
gc()
print(gc())
system("sudo sysctl -w vm.drop_caches=3")
}
}
# useful functions
#mystack = stack("path/to/multilayer.tif") # multilayer.tif is an existing raster stack
#band1 = subset(mystack,subset=1) # subsets bands from raster stack
#removeTmpFiles(h=0) # removes temp files, can be used after writing raster stackes to delete all temp raster files
#rasterbrick<-brick(rasterstack) #can make a raster brick from a raster stack
I've included the print(object.size()) to try to see if my objects are growing in size throughout the code's execution.
The file was writing large rasters to disk in the /tmp folder, over 20Gb each time it executed. Filled up the disk pretty quick. Still concerned why it was writing rasters to existing files rather than clearing them. Could clear the tmp directory during the script.
This was a pretty good explanation of memory issues with the raster package. https://discuss.ropensci.org/t/how-to-avoid-space-hogging-raster-tempfiles/864
Related
I would like to include -co options to compress output raster using gdalwarp from gdalUtilities in R.
I have tried some options (commented in the code), but I have not been successful in generating the compressed raster.
gdalUtilities::gdalwarp(srcfile = paste0(source_path,"/mask_30.tif"),
dstfile = paste0(writing_path,"/mask_30_gdalwarp.tif"),
cutline = paste0(source_path,"/amazon.shp"),
crop_to_cutline = TRUE,
multi = TRUE,
wo = "NUM_THREADS = 32",
co = "COMPRESS = DEFLATE")
# co = c("COMPRESS = DEFLATE","ZLEVEL = 9"))
# co COMPRESS = DEFLATE,
# co ZLEVEL = 9),
# co = "COMPRESS = DEFLATE",
# co = ZLEVEL = 9")
Additionally, I would like to use multithread warping implementation. I am including-multi and -wo "NUM_THREADS = 16" (my computer has 32 cores) options, but I also have not been able to decrease the runtime vs. the default -multi option, which uses two cores by default.
Any suggestions for compression and parallelization?
Many thanks in advance.
1 - COMPRESSION
Please find the solution for the problem of file compression. To be honest, I have already been confronted with the same problem as you and, at the time, I was racking my brains... to finally find the solution which is quite simple (once we know it!): you must not put any spaces (i.e. "COMPRESS=DEFLATE" and not "COMPRESS = DEFLATE")
So, please find below a small reprex.
Reprex
library(gdalUtilities)
library(stars) # Loaded just to have a '.tif' image for the reprex
# Import a '.tif' image from the 'stars' library
tif <- read_stars(system.file("tif/L7_ETMs.tif", package = "stars"))
# Write the image to disk (in your working directory)
write_stars(tif, "image.tif")
# Size of the image on disk (in bytes)
file.size("image.tif")
#> [1] 2950880
# Compress the image
gdalUtilities::gdalwarp(srcfile = "image.tif",
dstfile = "image_gdalwarp.tif",
co = "COMPRESS=DEFLATE")
# Size of the compressed image on disk (in bytes)
file.size("image_gdalwarp.tif")
#> [1] 937920 # The image has been successfully compressed.
As #MarkAdler said, there is not much difference between the default compression level (i.e. 6) and level 9. That said, please find below how you should write the code to be able to apply the desired compression level (i.e. still without spaces and in a list):
gdalUtilities::gdalwarp(srcfile = "image.tif",
dstfile = "image_gdalwarp_Z9.tif",
co = list("COMPRESS=DEFLATE", "ZLEVEL=9"))
file.size("image_gdalwarp_Z9.tif")
#> [1] 901542
Created on 2022-02-09 by the reprex package (v2.0.1)
2 - PARALLELIZATION
For the problem of parallelization on the cores of the processor, you should not use multi = TRUE. Only the argument wo = "NUM_THREADS=4" (always without spaces ;-)) is enough.
Just a clarification, I guess you are confusing the RAM and the number of cores. Usually computers are equipped with a 4 or 8 cores processor. The 32 that you indicate in your code refers to the 32 gigas of RAM that your computer probably has.
Reprex
library(gdalUtilities)
library(stars)
tif <- read_stars(system.file("tif/L7_ETMs.tif", package = "stars"))
write_stars(tif, "image.tif")
file.size("image.tif")
#> [1] 2950880
gdalUtilities::gdalwarp(srcfile = "image.tif",
dstfile = "image_gdalwarp_Z9_parallel.tif",
co = list("COMPRESS=DEFLATE", "ZLEVEL=9"),
wo = "NUM_THREADS=4") # Replace '4' by '8' if your processor has 8 cores
file.size("image_gdalwarp_Z9_parallel.tif")
#> [1] 901542
Created on 2022-02-09 by the reprex package (v2.0.1)
I am a newbie R user. Now, I have a question related to write out multiple files with different names. Lets says that my data has the following structure:
IV_HAR_m1<-matrix(rnorm(1:100), ncol=30, nrow = 2000)
DV_HAR_m1<-matrix(rnorm(1:100), ncol=10, nrow = 2000)
I am trying to estimate multiple LASSO regressions. At the beginning, I was storing the iterations in one object called Dinamic_beta. This object was stored in only one file, and it saves the required information each time that my code iterate.
For doing this I was using stew which belongs to pomp package, but the total process takes 5 or 6 days and I am worried about a power outage or a fail in my computer.
Now, I want to save each environment (iterations) in a .Rnd file. I do not know how can I do that? but the code that I am using is the following:
library(glmnet)
library(Matrix)
library(pomp)
space <- 7 #THE NUMBER OF FILES THAT I would WANT TO CREATE
Dinamic_betas<-array(NA, c(10, 31, (nrow(IV_HAR_m1)-space)))
dimnames(Dinamic_betas) <- list(NULL, NULL)
set.seed(12345)
stew( #stew save the enviroment in a .Rnd file
file = "Dinamic_LASSO_RD",{ # The name required by stew for creating one file with all information
for (i in 1:dim(Dinamic_betas)[3]) {
tryCatch( #print messsages
expr = {
cv_dinamic <- cv.glmnet(IV_HAR_m1[i:(space+i-1),],
DV_HAR_m1[i:(space+i-1),], alpha = 1, family = "mgaussian", thresh=1e-08, maxit=10^9)
LASSO_estimation_dinamic<- glmnet(IV_HAR_m1[i:(space+i-1),], DV_HAR_m1[i:(space+i-1),],
alpha = 1, lambda = cv_dinamic$lambda.min, family = "mgaussian")
coefs <- as.matrix(do.call(cbind, coef(LASSO_estimation_dinamic)))
Dinamic_betas[,,i] <- t(coefs)
},
error = function(e){
message("Caught an error!")
print(e)
},
warning = function(w){
message("Caught an warning!")
print(w)
},
finally = {
message("All done, quitting.")
}
)
if (i%%400==0) {print(i)}
}
}
)
If someone can suggest another package that stores the outputs in different files I will grateful.
Try adding this just before the close of your loop
save.image(paste0("Results_iteration_",i,".RData"))
This should save your entire workspace to disk for every iteration. You can then use load() to load the workspace of every environment. Let me know if this works.
I'm working with limited RAM (AWS free tier EC2 server - 1GB).
I have a relatively large txt file "vectors.txt" (800mb) I'm trying to read into R. Having tried various methods I have failed to read in this vector to memory.
So, I was researching ways of reading it in in chunks. I know that the dim of the resulting data frame should be 300K * 300. If I was able to read in the file e.g. 10K lines at a time and then save each chunk as an RDS file I would be able to loop over the results and get what I need, albeit just a little slower with less convenience than having the whole thing in memory.
To reproduce:
# Get data
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
# word2vec r library
library(rword2vec)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
So far so good. Here's where I struggle:
word_vectors = as.data.frame(read.table("vector.txt",skip = 1, nrows = 10))
Returns "cannot allocate a vector of size [size]" error message.
Tried alternatives:
word_vectors <- ff::read.table.ffdf(file = "vector.txt", header = TRUE)
Same, not enough memory
word_vectors <- readr::read_tsv_chunked("vector.txt",
callback = function(x, i) saveRDS(x, i),
chunk_size = 10000)
Resulted in:
Parsed with column specification:
cols(
`299567 300` = col_character()
)
|=========================================================================================| 100% 817 MB
Error in read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs, :
Evaluation error: bad 'file' argument.
Is there any other way to turn vectors.txt into a data frame? Maybe by breaking it into pieces and reading in each piece, saving as a data frame and then to rds? Or any other alternatives?
EDIT:
From Jonathan's answer below, tried:
library(rword2vec)
library(RSQLite)
# Download pre trained Google News word2vec model (Slimmed down version)
# https://github.com/eyaler/word2vec-slim
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
# from https://privefl.github.io/bigreadr/articles/csv2sqlite.html
csv2sqlite <- function(tsv,
every_nlines,
table_name,
dbname = sub("\\.txt$", ".sqlite", tsv),
...) {
# Prepare reading
con <- RSQLite::dbConnect(RSQLite::SQLite(), dbname)
init <- TRUE
fill_sqlite <- function(df) {
if (init) {
RSQLite::dbCreateTable(con, table_name, df)
init <<- FALSE
}
RSQLite::dbAppendTable(con, table_name, df)
NULL
}
# Read and fill by parts
bigreadr::big_fread1(tsv, every_nlines,
.transform = fill_sqlite,
.combine = unlist,
... = ...)
# Returns
con
}
vectors_data <- csv2sqlite("vector.txt", every_nlines = 1e6, table_name = "vectors")
Resulted in:
Splitting: 12.4 seconds.
Error: nThread >= 1L is not TRUE
Another option would be to do the processing on-disk, e.g. using an SQLite file and dplyr's database functionality. Here's one option: https://stackoverflow.com/a/38651229/4168169
To get the CSV into SQLite you can also use the bigreadr package which has an article on doing just this: https://privefl.github.io/bigreadr/articles/csv2sqlite.html
Am trying to understand why I am having inconsistent results downloading CSV files from a website archive. Don't know if the problem is at my end, the other side or just failed communications in between. Any suggestions are welcomed.
Using a R script to automate the downloading of CSV files by month and year from the HYCOM archives for analysis. The script generated the following URL trying URL 'http://ncss.hycom.org/thredds/ncss/GLBu0.08/reanalysis/3hrly?var=salinity&var=water_temp&var=water_u&var=water_v&latitude=13.875&longitude=-72.25&time_start=2012-05-01T00:00:00Z&time_end=2012-05-31T21:00:00Z&vertCoord=&accept=csv'
Running download.file successfully obtains the file about half the time, otherwise fails. Any suggestions are welcomed. The images below shows the failed run. Successful run is below.
Successful Log
#download one month of data
MM = '05'
LastDay = ndays(paste(year,MM,'01',sep="-"))
H1 = paste( as shown in image)
H2 = '-01T00:00:00Z&time_end='
#H3 = 'T21:00:00Z&timeStride=1&vertCoord=&accept=csv'
H3 = 'T21:00:00Z&vertCoord=&accept=csv'
HtmlLink <- paste(H1,year,"-",MM,H2,year,"-",MM,"-",LastDay,H3,sep="")
dest = paste("../data/",year,MM,".csv",sep="")
download.file(url =HtmlLink ,destfile=dest,cacheOK=FALSE, method="auto")
trying URL 'as shown in image'
Content type 'text/plain;charset=UTF-8' length unknown
..................................................
................downloaded 666 KB
user system elapsed
28.278 6.605 5201.421
LOG OF FAILED RUN
You can/should turn the following into a function accepting parameters and replace the hardcoded values with said params (I used httr:::parse_query() to make the list):
library(httr)
URL <- "http://ncss.hycom.org/thredds/ncss/GLBu0.08/reanalysis/3hrly"
params <- list(var = "salinity",
var = "water_temp",
var = "water_u",
var = "water_v",
latitude = "13.875",
longitude = "-72.25",
time_start = "2012-05-01T00:00:00Z",
time_end = "2012-05-31T21:00:00Z",
vertCoord = "",
accept = "csv")
dest_file <- "filename"
res <- GET(url=URL,
query=params,
timeout(360),
write_disk(dest_file, overwrite=TRUE),
verbose())
warn_for_status(res)
You can (eventually) remove the verbose() from that GET call, but it's helpful during debugging.
The main issue is that this server is s l o w and times out before the transfer is complete. Even the value of 360 might not be enough (you'll need to experiment).
Many thanks to all for the help. The suggestion by hrbrmstr appears to be an elegant answer and I look forwards to testing it. However, I was unable to install a working copy using the program manager. Installation from a local download also failed since R complained that the OS X version that I downloaded from CRAN was a windows version, not OS X. Yes, I repeated the download several times to make sure I had the right package.
As suggested by Cyrus Mohammadian, I tried the procedures in the curl library.
Running the same URL, download.file transfers failed about 50% of the time. Using curl reduced the transfer times from 2000 seconds to 1000 seconds with no failures in 12 tries.
## calculate number of days in month
ndays <- function(d) {
last_days <- 28:31
rev(last_days[which(!is.na(
as.Date( paste( substr(d, 1, 8),
last_days, sep = ''),
'%Y-%m-%d')))])[1] }
nlat = 13.875
elon = -72.25
#download one month of data
year = 2008
MM = '01'
LastDay = ndays(paste(year,MM,'01',sep="-"))
H1 = paste('http://ncss.hycom.org/thredds/ncss/GLBu0.08/reanalysis/3hrly?
var=salinity&var=water_temp&var=water_u&var=water_v&latitude=',
nlat,'&longitude=', elon,'&time_start=',sep="")
H2 = '-01T00:00:00Z&time_end='
H3 = 'T21:00:00Z&timeStride=1&vertCoord=&accept=csv'
HtmlLink <- paste(H1,year,"-",MM,H2,year,"-",MM,"-",LastDay,H3,sep="")
dest = paste("../data/",year,MM,".csv",sep="")
curl_download(url =HtmlLink ,destfile=dest,quiet=FALSE, mode="wb")
The following is my code. I am trying get the list of all the files (~20000) that end with .idat and read each file using the function illuminaio::readIDAT.
library(illuminaio)
library(parallel)
library(data.table)
# number of cores to use
ncores = 8
# this gets all the files with .idat extension ~20000 files
files <- list.files(path = './',
pattern = "*.idat",
full.names = TRUE)
# function to read the idat file and create a data.table of filename, and two more columns
# write out as csv using fwrite
get.chiptype <- function(x)
{
idat <- readIDAT(x)
res <- data.table(filename = x, nSNPs = nrow(idat$Quants), Chip = idat$ChipType)
fwrite(res, file.path = 'output.csv', append = TRUE)
}
# using mclapply call the function get.chiptype on all 20000 files.
# use 8 cores at a time
mclapply(files, FUN = function(x) get.chiptype(x), mc.cores = ncores)
After reading and writing info about 1200 files, I get the following message:
Warning message:
In mclapply(files, FUN = function(x) get.chiptype(x), mc.cores = ncores) :
all scheduled cores encountered errors in user code
How do I resolve it?
Calling mclapply() in some instances requires you to specify a random number generator that allows for multiple streams of random numbers.
R version 2.14.0 has an implementation of Pierre L'Ecuyer's multiple pseudo-random number generator.
Try adding the following before the mclapply() call, with a pre-specified value for 'my.seed':
set.seed( my.seed, kind = "L'Ecuyer-CMRG" );