Dealing with big datasets in R - r

I'm having a memory problem with R giving the Can not allocate vector of size XX Gb error message. I have a bunch of daily files (12784 days) in netcdf format giving sea surface temperature in a 1305x378 (longitude-latitude) grid. That gives 493290 points each day, decreasing to about 245000 when removing NAs (over land points).
My final objective is to build a time series for any of the 245000 points from the daily files and find the temporal trend for each point. And my idea was to build a big data frame with a point per row and a day per column (2450000x12784) so I could apply the trend calculation to any point. But then, building such data frame, the memory problem appeared, as expected.
First I tried a script I had previously used to read data and extract a three column (lon-lat-sst) dataframe by reading nc file and then melting the data. This lead to an excessive computing time when tried for a small set of days and to the memory problem. Then I tried to subset the daily files into longitudinal slices; this avoided the memory problem but the csv output files were too big and the process was very time consuming.
Another strategy I've tried without success to the moment it's been to sequentially read all the nc files and then extract all the daily values for each point and find the trend. Then I would only need to save a single 245000 points dataframe. But I think this would be time consuming and not the proper R way.
I have been reading about big.memory and ff packages to try to declare big.matrix or a 3D array (1305 x 378 x 12784) but had not success by now.
What would be the appropriate strategy to face the problem?
Extract single point time series to calculate individual trends and populate a smaller dataframe
Subset daily files in slices to avoid the memory problem but end with a lot of dataframes/files
Try to solve the memory problem with bigmemory or ff packages
Thanks in advance for your help
EDIT 1
Add code to fill the matrix
library(stringr)
library(ncdf4)
library(reshape2)
library(dplyr)
# paths
ruta_datos<-"/home/meteo/PROJECTES/VERSUS/CMEMS/DATA/SST/"
ruta_treball<-"/home/meteo/PROJECTES/VERSUS/CMEMS/TREBALL/"
setwd(ruta_treball)
sst_data_full <- function(inputfile) {
sstFile <- nc_open(inputfile)
sst_read <- list()
sst_read$lon <- ncvar_get(sstFile, "lon")
sst_read$lats <- ncvar_get(sstFile, "lat")
sst_read$sst <- ncvar_get(sstFile, "analysed_sst")
nc_close(sstFile)
sst_read
}
melt_sst <- function(L) {
dimnames(L$sst) <- list(lon = L$lon, lat = L$lats)
sst_read <- melt(L$sst, value.name = "sst")
}
# One month list file: This ends with a df of 245855 rows x 33 columns
files <- list.files(path = ruta_datos, pattern = "SST-CMEMS-198201")
sst.out=data.frame()
for (i in 1:length(files) ) {
sst<-sst_data_full(paste0(ruta_datos,files[i],sep=""))
msst <- melt_sst(sst)
msst<-subset(msst, !is.na(msst$sst))
if ( i == 1 ) {
sst.out<-msst
} else {
sst.out<-cbind(sst.out,msst$sst)
}
}
EDIT 2
Code used in a previous (smaller) data frame to calculate temporal trend. Original data was a matrix of temporal series, being each column a series.
library(forecast)
data<-read.csv(....)
for (i in 2:length(data)){
var<-paste("V",i,sep="")
ff<-data$fecha
valor<-data[,i]
datos2<-as.data.frame(cbind(data$fecha,valor))
datos.ts<-ts(datos2$valor, frequency = 365)
datos.stl <- stl(datos.ts,s.window = 365)
datos.tslm<-tslm(datos.ts ~ trend)
summary(datos.tslm)
output[i-1]<-datos.tslm$coefficients[2]
}
fecha is date variable name
EDIT 2
Working code from F. Privé answer
library(bigmemory)
tmp <- sst_data_full(paste0(ruta_datos,files[1],sep=""))
library(bigstatsr)
mat <- FBM(length(tmp$sst), length(files),backingfile = "/home/meteo/PROJECTES/VERSUS/CMEMS/TREBALL" )
for (i in seq_along(files)) {
mat[, i] <- sst_data_full(paste0(ruta_datos,files[i],sep=""))$sst
}
With this code a big matrix was created
dim(mat)
[1] 493290 12783
mat[1,1]
[1] 293.05
mat[1,1:10]
[1] 293.05 293.06 292.98 292.96 292.96 293.00 292.97 292.99 292.89 292.97
ncol(mat)
[1] 12783
nrow(mat)
[1] 493290

So, to your read data in a Filebacked Big Matrix (FBM), you can do
files <- list.files(path = "SST-CMEMS", pattern = "SST-CMEMS-198201*",
full.names = TRUE)
tmp <- sst_data_full(files[1])
library(bigstatsr)
mat <- FBM(length(tmp$sst), length(files))
for (i in seq_along(files)) {
mat[, i] <- sst_data_full(files[i])$sst
}

Related

Converting thousands of PRISM .bil data files to .csv

Any help would be greatly appreciated, as this has been a problem I've been trying to solve for quite some time.
I have 13877 days worth of climate data from PRISM, daily precipitation data from 01/01/1981 to 12/31/2018. The data I need is "climate data: precipitation," with "temporal period: daily data." From there, data has to be downloaded manually by year as *.bil. The data is floating point, with each .bil file being a raster of dimension 621 x 1405.
I would like to convert and consolidate all these raster files into a single *.csv file using R (or Python). This post seemed close, but I could not get the suggested code to work properly.
With such a large amount of data, I am looking for an efficient solution if possible. My most recent attempt was as follows:
#After setting working directory to the folder with all .bil files
library(prism)
library(raster)
library(sp)
type = "ppt"
ls_prism_data(name=TRUE) #lists PRISM files
j = 1
k = 18
final_df = NULL
filepre = paste("PRISM_", type , "_stable_4kmD2_", sep = "")
for (i in 1:771) {
RS1 <- prism_stack(ls_prism_data()[j:k, 1])
l = 18
if (i = 771) {
l = 17 # 770*18 + 17 = 13877
}
j = j + l
k = k + l
pr_df <- data.frame(rasterToPoints(RS1))
long_pr_df <- pr_df %>% gather(Date, type, -x, -y)
long_pr_df$Dates <- gsub(filepre, "", long_pr_df$Dates)
long_pr_df$Dates <- gsub("_bil", "", long_pr_df$Dates)
final_df <- rbind(final_df, long_pr_df)
}
My thinking was to convert the data in "batches." Unfortunately, this results in a fatal error that causes R to abort.
Ultimately I would like to get the information to a single file organized by year, by point with column variable denoting precipitation so I can continue this for other climate variables.
I don't need a correction to the above code per se, I'm just looking for a means to an end. Thank you in advance for the help!

How to efficiently parallelize EXTRACT function in raster package R

Given a netcdf file, I am trying to extract all pixels to form a data.frame for later export to .csv
a=brick(mew.nc)
#get coordinates
coord<-xyFromCell(a,1:ncell(a))
I can extract data for all pixels using extract(a,1:ncell(a)). However, I run into memory issues.
Upon reading through various help pages, I found that one can speed up things with:
beginCluster(n=30)
b=extract(a, coord)
endCluster()
But I still run out of memory. Our supercomputer has more than 1000 nodes, each node has 32 cores.
My actual rasterbrick has 400,000 layers
I am not sure how to parrallize this task without running into memory issues.
Thank you for all your suggestions.
Sample data of ~8MB can be found here
You can do something along these lines to avoid memory problems
library(raster)
b <- brick(system.file("external/rlogo.grd", package="raster"))
outfile <- 'out.csv'
if (file.exists(outfile)) file.remove(outfile)
tr <- blockSize(b)
b <- readStart(b)
for (i in 1:tr$n) {
v <- getValues(b, row=tr$row[i], nrows=tr$nrows[i])
write.table(v, outfile, sep = ",", row.names = FALSE, append = TRUE, col.names=!file.exists(outfile))
}
b <- readStop(b)
To parallelize, you could do this by layer, or groups of layers; and probably all values in one step for each subset of layers. Here for one layer at a time:
f <- function(d) {
filename <- extension(paste(names(d), collapse='-'), '.csv')
x <- values(d)
x <- matrix(x) # these two lines only needed when using
colnames(x) <- names(d) # a single layer
write.csv(x, filename, row.names=FALSE)
}
# parallelize this:
for (i in 1:nlayers(b)) {
f(b[[i]])
}
or
x <- sapply(1:nlayers(b), function(i) f(b[[i]]))
You should not be using extract. The question I have is what you would want such a large csv file for.

Export accuracy of multiple timeseries forecasts in r into csv-document

I am using the fpp package to forecast multiple time series of different customers at the same time. I am already able to extract the point forecasts of different easy forecast methods (snaive, meanf, etc.) into a csv document. However, I am still trying to figure out how to extract the measures of the accuracy() command of every time series into a csv file at the same time.
I constructed an example:
# loading of the "fpp"-package into R
install.packages("fpp")
require("fpp")
# Example customers
customer1 <- c(0,3,1,3,0,5,1,4,8,9,1,0,1,2,6,0)
customer2 <- c(1,3,0,1,7,8,2,0,1,3,6,8,2,5,0,0)
customer3 <- c(1,6,9,9,3,1,5,0,5,2,0,3,2,6,4,2)
customer4 <- c(1,4,8,0,3,5,2,3,0,0,0,0,3,2,4,5)
customer5 <- c(0,0,0,0,4,9,0,1,3,0,0,2,0,0,1,3)
#constructing the timeseries
all <- ts(data.frame(customer1,customer2,customer3,customer4,customer5),
f=12, start=2015)
train <- window(all, start=2015, end=2016-0.01)
test <- window(all, start=2016)
CustomerQuantity <- ncol(train)
# Example of extracting easy forecast method into csv-document
horizon <- 4
fc_snaive <- matrix(NA, nrow=horizon, ncol=CustomerQuantity)
for(i in 1:CustomerQuantity){
fc_snaive [,i] <- snaive (train[,i], h=horizon)$mean
}
write.csv2(fc_snaive, file ="fc_snaive.csv")
The following part is exactly the part, where I would needed some help - I would like to extract the accuracy-measures into a csv file all at the same time. In my real dataset, I have 4000 customers, and not only 5! I tried to use loops and lapply(), but unfortunately my code didn't work.
accuracy(fc_snaive[,1], test[,1])
accuracy(fc_snaive[,2], test[,2])
accuracy(fc_snaive[,3], test[,3])
accuracy(fc_snaive[,4], test[,4])
accuracy(fc_snaive[,5], test[,5])
The following uses lapply to run accuracy for each element from 1 to the number of columns in fc_snaive with the corresponding element in test.
Then, with do.call, we bind the results by row (rbind), so we end up with a matrix that we can, in turn, export using write.csv.
new_matrix <- do.call(what = rbind,
args = lapply(1:ncol(fc_snaive), function(x){
accuracy(fc_snaive[, x], test[, x])
}))
write.csv(x = new_matrix,
file = "a_filename.csv")

r double loop too slow with large data

I need to read hundred of .bil files:(reproductive example)
d19810101 <- data.frame(ID=c(1:10),year=rep(1981,10),month=rep(1,10),day=rep(1,10),value=c(11:20))
d19810102 <- data.frame(ID=c(1:10),year=rep(1981,10),month=rep(1,10),day=rep(2,10),value=c(12:21))
d19820101 <- data.frame(ID=c(1:10),year=rep(1982,10),month=rep(1,10),day=rep(1,10),value=c(13:22))
d19820102 <- data.frame(ID=c(1:10),year=rep(1982,10),month=rep(1,10),day=rep(2,10),value=c(14:23))
The code I wrote for testing small amount files works ok but when I tried to run the entire files, it went super slow, please let me know if there is any way that I can improve. What I need to do is simply get the average of 33 years of daily data, here is the code for testing small amount of files:
years <- c(1981:1982)
days <- substr(as.numeric(format(seq(as.Date("1981/1/1"), as.Date("1981/1/2"), "day"), '%Y%m%d')),5,8)
X_Y <- NULL
for (j in days) {
for (i in years) {
XYi <- read.table(paste(i,substr(j,1,2),substr(j,3,4),".csv",sep=''),header=T,sep=",",stringsAsFactors=F)
X_Y <- rbind(X_Y, XYi)
cat(paste("Data in ", i, j, " are processing now.", sep=""), "\n")
}
library(plyr)
X_Y1 <- ddply(X_Y, .(ID, month, day), summarize, mean(value, na.rm=T))
cat(paste("Data in ", i, j, " are processing now.", sep=""), "\n")
}
EDIT:
Thank you for all your help! I tried putting the files in a list to read, but since its .bil files which needs to get the raster characteristics, thus I got error, that's why I need to read them one by one, sorry for didn't make it clear earlier
Read.files <- function(file.names, sep=",") {
library(raster)
ldply(file.names, function(fn) data.frame(Filename=fn, layer <- raster(fn, sep=",")))
}
data1 <- Read.files(paste("filenames here",days,".bil",sep=''), sep=",")
"Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class 'structure("RasterLayer", package = "raster")' into a data.frame.
EDIT 2:
The data structure of my data is actually same with the example data, only that my data is grid data and needs to be extracted(using raster function instead of read.csv), and then to be put into data frame, therefore I need to do the following steps:
for (i in days)
{
layer <- raster(paste("filename here",i,".bil",sep=''))
projection <- projection(layer)
cellsize <- res(layer)[1]
...
s <- resample(layer,r, method='ngb')
XY <- data.frame(rasterToPoints(s))
names(XY) <- c('Long','Lat','Data')
}
It's hard to tell exactly how your are managing file IO, but I think an easier way to achieve this would be to read the files in, put them into one data.frame (e.g. using rbind()), and then get the summary statistics you need via tapply():
data <- do.call(rbind, mget(ls(pattern = "d[0-9]*"))) # combine data
with(data, tapply(value, list(month, day), mean)) # get mean for each month and day combination
This assumes you have already read in all of the files, to objects named as in your example.

Merging multiple rasters in R

I've been trying to find a time-efficient way to merge multiple raster images in R. These are adjacent ASTER scenes from the southern Kilimanjaro region, and my target is to put them together to obtain one large image.
This is what I got so far (object 'ast14dmo' representing a list of RasterLayer objects):
# Loop through single ASTER scenes
for (i in seq(ast14dmo.sd)) {
if (i == 1) {
# Merge current with subsequent scene
ast14dmo.sd.mrg <- merge(ast14dmo.sd[[i]], ast14dmo.sd[[i+1]], tolerance = 1)
} else if (i > 1 && i < length(ast14dmo.sd)) {
tmp.mrg <- merge(ast14dmo.sd[[i]], ast14dmo.sd[[i+1]], tolerance = 1)
ast14dmo.sd.mrg <- merge(ast14dmo.sd.mrg, tmp.mrg, tolerance = 1)
} else {
# Save merged image
writeRaster(ast14dmo.sd.mrg, paste(path.mrg, "/AST14DMO_sd_", z, "m_mrg", sep = ""), format = "GTiff", overwrite = TRUE)
}
}
As you surely guess, the code works. However, merging takes quite long considering that each single raster object is some 70 mb large. I also tried Reduce and do.call, but that failed since I couldn't pass the argument 'tolerance' which circumvents the different origins of the raster files.
Anybody got an idea of how to speed things up?
You can use do.call
ast14dmo.sd$tolerance <- 1
ast14dmo.sd$filename <- paste(path.mrg, "/AST14DMO_sd_", z, "m_mrg.tif", sep = "")
ast14dmo.sd$overwrite <- TRUE
mm <- do.call(merge, ast14dmo.sd)
Here with some data, from the example in raster::merge
r1 <- raster(xmx=-150, ymn=60, ncols=30, nrows=30)
r1[] <- 1:ncell(r1)
r2 <- raster(xmn=-100, xmx=-50, ymx=50, ymn=30)
res(r2) <- c(xres(r1), yres(r1))
r2[] <- 1:ncell(r2)
x <- list(r1, r2)
names(x) <- c("x", "y")
x$filename <- 'test.tif'
x$overwrite <- TRUE
m <- do.call(merge, x)
The 'merge' function from the Raster package is a little slow. For large projects a faster option is to work with gdal commands in R.
library(gdalUtils)
library(rgdal)
Build list of all raster files you want to join (in your current working directory).
all_my_rasts <- c('r1.tif', 'r2.tif', 'r3.tif')
Make a template raster file to build onto. Think of this a big blank canvas to add tiles to.
e <- extent(-131, -124, 49, 53)
template <- raster(e)
projection(template) <- '+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs'
writeRaster(template, file="MyBigNastyRasty.tif", format="GTiff")
Merge all raster tiles into one big raster.
mosaic_rasters(gdalfile=all_my_rasts,dst_dataset="MyBigNastyRasty.tif",of="GTiff")
gdalinfo("MyBigNastyRasty.tif")
This should work pretty well for speed (faster than merge in the raster package), but if you have thousands of tiles you might even want to look into building a vrt first.
You can use Reduce like this for example :
Reduce(function(...)merge(...,tolerance=1),ast14dmo.sd)
SAGA GIS mosaicking tool (http://www.saga-gis.org/saga_tool_doc/7.3.0/grid_tools_3.html) gives you maximum flexibility for merging numeric layers, and it runs in parallel by default! You only have to translate all rasters/images to SAGA .sgrd format first, then run the command line saga_cmd.
I have tested the solution using gdalUtils as proposed by Matthew Bayly. It works quite well and fast (I have about 1000 images to merge). However, after checking with document of mosaic_raster function here, I found that it works without making a template raster before mosaic the images. I pasted the example codes from the document below:
outdir <- tempdir()
gdal_setInstallation()
valid_install <- !is.null(getOption("gdalUtils_gdalPath"))
if(require(raster) && require(rgdal) && valid_install)
{
layer1 <- system.file("external/tahoe_lidar_bareearth.tif", package="gdalUtils")
layer2 <- system.file("external/tahoe_lidar_highesthit.tif", package="gdalUtils")
mosaic_rasters(gdalfile=c(layer1,layer2),dst_dataset=file.path(outdir,"test_mosaic.envi"),
separate=TRUE,of="ENVI",verbose=TRUE)
gdalinfo("test_mosaic.envi")
}
I was faced with this same problem and I used
#Read desired files into R
data_name1<-'file_name1.tif'
r1=raster(data_name1)
data_name2<-'file_name2.tif'
r2=raster(data_name2)
#Merge files
new_data <- raster::merge(r1, r2)
Although it did not produce a new merged raster file, it stored in the data environment and produced a merged map when plotted.
I ran into the following problem when trying to mosaic several rasters on top of each other
In vv[is.na(vv)] <- getValues(x[[i]])[is.na(vv)] :
number of items to replace is not a multiple of replacement length
As #Robert Hijmans pointed out, it was likely because of misaligned rasters. To work around this, I had to resample the rasters first
library(raster)
x <- raster("Base_raster.tif")
r1 <- raster("Top1_raster.tif")
r2 <- raster("Top2_raster.tif")
# Resample
x1 <- resample(r1, crop(x, r1))
x2 <- resample(r2, crop(x, r2))
# Merge rasters. Make sure to use the right order
m <- merge(merge(x1, x2), x)
# Write output
writeRaster(m,
filename = file.path("Mosaic_raster.tif"),
format = "GTiff",
overwrite = TRUE)

Resources