Reading multiple ERA5 netcdf files - netcdf

I have a collection of ERA5 netcdf files that contain hourly data for air temperature that span over approximately 40 years in the tropical East Pacific. In jupyter notebook, I want to run a bandpass filter on the merged dataset but I keep running into initial errors concerning memory allocation. I read the files using xarray.open_mfdataset(list_of_files), but when I try to load the dataset I get the error:
Unable to allocate X GiB for an array with shape (d1, d2, d3, d4) and data type float32
Are there work around solutions or best practices to manipulating large datasets like this in jupyter?
The full code for the bandpass filter is:
I've been wanting to apply a band pass filter to a large domain over the East Pacific over about 40 years of data from ERA5. The code goes as follows:
# Grab dataset
var = 't'
files = glob.glob(os.path.join(parent_dir, 'era5_' + var + '_daily.nc'))
files.sort()
# Read files into a dask array
ds = xr.open_mfdataset(files)
# Limit study region
lon_min = -140
lon_max = -80
lat_min = -10
lat_max = 10
ds = ds.sel(latitude = slice(lat_max, lat_min), longitude = slice(lon_min, lon_max))
# Now, load the data from the original dask array
da_T = ds.T.load()
# High pass filter (remove singal on the seasonal and longer timescales)
import xrft
freq_threshold = (1/90) * (1/24) * (1/3600) # 90-day frequency threshold
def high_pass_filter(da, dim, thres):
ft = xrft.fft(da, dim=dim, true_phase=True, true_amplitude=True)
ft_new = ft.where(ft.freq_time > thres, other = 0)
ft.close()
da_new = xrft.ifft(ft_new, dim = 'freq_time', true_phase=True, true_amplitude=True)
da_new = da_new + np.tile(da.mean('time'), (da_T.time.shape[0],1,1,1))
ft_new.close()
return da_new.real
da_new = high_pass_filter(da_T, 'time', freq_threshold)
# Save filtered dataset
da_new.real.to_netcdf(os.path.join(outdir, 'era5_T.nc'))

When you do this
# Now, load the data from the original dask array
da_T = ds.T.load()
you are loading all the data in the "T" variable of your dataset into memory all at once. Presumably the size of this variable is larger than the amount of RAM available on your system.
You also have another problem with this line: da.load() loads in-place and returns the modified object. So ds.T.load() alone would have been sufficient. You could also have done da_T = ds.T.compute() instead.
You could try using dask to perform your analysis chunk-by-chunk. You want to ensure that your ds object contains chunked dask array objects before you load/compute it.
You then want to specify your analysis using xarray methods / the xrft package as you are doing, but only call .compute() at the end. This should do your analysis in a chunk-by-chunk manner.
I suggest reading about how to use dask + xarray together though in order to get this to run smoothly.

Related

xarray chunk dataset PerformanceWarning: Slicing with an out-of-order index is generating more chunks

I am trying to run a simple calculation based on two big gridded datasets in xarray (around 5 GB altogether, daily data from 1850-2100). I keep running out of memory when I try it this way
import xarray as xr
def readin(model):
observed = xr.open_dataset(var_obs)
model_sim = xr.open_dataset(var_sim)
observed = observed.sel(time = slice('1989','2010'))
model_hist = model_sim.sel(time = slice('1989','2010'))
model_COR = model_sim
return(observed, model_hist, model_COR)
def method(model):
clim_obs = observed.groupby('time.day').mean(dim='time')
clim_hist = model_hist.groupby('time.day').mean(dim='time')
diff_scaling = clim_hist-clim_obs
bc = model_COR.groupby('time.day') - diff_scaling
bc[var]=bc[var].where(bc[var]>0,0)
bc = bc.reset_coords('day',drop=True)
observed, model_hist, model_COR = readin('model')
method('model')
I tried to chunk the (full)
model_COR
to split up the memory
model_COR.chunk(chunks={'lat': 20, 'lon': 20})
or across the time dimension
model_COR.chunk(chunks={'time': 8030})
but no matter what I tried resulted in
PerformanceWarning: Slicing with an out-of-order index is generating xxx times more chunks
Which doesn't exactly sound like the outcome I want? Where am I going wrong here? Happy about any help!

Custom spatial processing function consumes a lot of memory as code runs in R

I have several rasters, 343 to be more exact, from Cropscape. I need to get the locations (centroids) and area measurements of pixels that represent potatoes and tomatoes based on the associated values in the rasters. The pixel values are 43 and 54, respectively. Cropscape provides rasters separated by year and state, except for 2016, which has the lower 48 states combined. The rasters are saved as GeoTiffs on a Google Drive and I am using Google File Stream to connect to the rasters locally.
I want to create a SpatialPointsDataFrame from the centroids of each pixel or group of adjacent pixels for tomatoes and potatoes in all the rasters. Right now, my code will
Subset the rasters to potatoes and tomatoes
Change the raster subsets to polygons, one for potatoes and one for tomatoes
Create centroids from each polygon
Create a SpatialPointsDataFrame based on the centroids
Extract the area measurement for each area of interest with SpatialPointsDataFrame
Write the raster subsets and each polygon to a file.
Code:
library(raster)
library(rgdal)
library(rgeos)
dat_dir2 = getwd()
mepg <- make_EPSG()
ae_pr <- mepg[mepg$code == "5070", "prj4"]
# Toy raster list for use with code
# I use `list.files()` with the directories that hold
# the rasters and use list that is generated from
# that to read in the files to raster. My list is called
# "tiflist". Not used in the code, but mentioned later.
rmk1 <- function(x, ...) {
r1 = raster(ncol = 1000, nrow = 1000)
r1[] = sample(1:60, 1000000, replace = T)
proj4string(r1) = CRS(ae_pr)
return(r1)
}
rlis <- lapply(1:5, rmk1)
#Pixel values needed
ptto <- c(43, 54)
# My function to go through rasters for locations and area measurements.
# This code is somewhat edited to work with the demo raster list.
# It produces the same output as what I wanted, but with the demo list.
pottom <- function(x, ...) {
# Next line is not necessary with the raster list created above.
# temras = raster(x)
now = format(Sys.time(), "%b%d%H%M%S")
nwnm = paste0(names(x), now)
rasmatx = match(x = x, table = ptto)
writeRaster(rasmatx, file.path( dat_dir2, paste0(nwnm,"ras")), format = "GTiff")
tempol = rasterToPolygons(rasmatx, fun = function(x) { x > 0 & x < 4}, dissolve = T)
tempol2 = disaggregate(tempol)
# for potatoes
tempol2p = tempol2[tempol2$layer == '1',]
if (nrow(tempol2p) > 0) {
temcenp = gCentroid(tempol2p, byid = T)
temcenpdf = SpatialPointsDataFrame(temcenp, data.frame(ID = 1:length(temcenp) , temcenp))
temcenpdf$pot_p = extract(rasmatx, temcenpdf)
temcenpdf$areap_m = gArea(tempol2p, byid = T)
# writeOGR(temcenpdf, dsn=file.path(dat_dir2), paste0(nwnm, "p"), driver = "ESRI Shapefile")
}
# for tomatoes
tempol2t = tempol2[tempol2$layer == '2',]
if (nrow(tempol2t) > 0) {
temcent = gCentroid(tempol2t, byid = T)
temcentdf = SpatialPointsDataFrame(temcent, data.frame(ID = 1:length(temcent) , temcent))
temcentdf$tom_t = extract(rasmatx, temcentdf)
temcentdf$areat_m = gArea(tempol2t, byid = T)
writeOGR(temcentdf, dsn=file.path(dat_dir2), paste0(nwnm,"t"), driver = "ESRI Shapefile")
}
}
lapply(rlis, pottom)
I know I should provide some toy data and I created some, but I don't know if they exactly recreate my problem, which follows.
Besides my wonky code, which seems to work, I have a bigger problem. A lot of memory is used when this code runs. The tiflist can only get through the first 4 files of the list and by then RAM, which is 16 GB on my laptop, is completely consumed. I'm pretty sure it's the connections to the Google Drive, since the cache for the drive stream is at least 8 GB. I guess each raster is staying open after being connected to in the Google Drive? I don't know how to confirm that.
I think I need to get the function to clear out all of the objects that are created, e.g. temras, rasmatx, tempol, etc., after processing each raster, but I'm not sure how to do that. I did try adding rm(temras ...) to the end of the function, but when I did that, there was no output at all from the function after 10 minutes and by then, I've usually got the first 3 rasters processed.
27/Oct EDIT after comments from RobertHijmans. It seems that the states with large geographic extents are causing problems with rasterToPolygons(). I edited the code from the way it works for me locally to work with the demo data I included, since RobertHijmans pointed out it wasn't functional. So I hope this is now reproducible.
I feel silly answering my own question, but here it is: the rasterToPolygons function is notoriously slow. I was unaware of this issue. I waited 30 minutes before killing the process with no result in one of my attempts. It works on the conditions I require for rasters for Alabama and Arkansas for example, but not California.
A submitted solution, which I am in the process of testing, comes from this GitHub repo. The test is ongoing at 12 minutes, so I don't know if it works for an object as large as California. I don't want to copy and paste someone else's code in an answer to my own question.
One of the comments suggested using profvis, but I couldn't really figure out the output. And it hung with the process too.

efficient use of raster functions in r

I have 500+ points in a SpatialPointsDataFrame object; I have a 1.7GB (200,000 rows x 200,000 cols) raster object. I want to have a tabulation of the values of the raster cells within a buffer around each of the 500+ points.
I have managed to achieve that with the code below (I got a lot of inspiration from here.). However, it is slow to run and I would like to make it run faster. It actually runs OK for buffers with "small" widths, say 5km ro even 15km (~1 million cells), but it becomes super slow when buffer increases to say 100km (~42 million cells).
I could easily improve on the loop below by using something from the apply family and/or a parallel loop. But my suspicion is that it is slow because the raster package writes 400Mb+ temporary files for each interaction of the loop.
# packages
library(rgeos)
library(raster)
library(rgdal)
myPoints = readOGR(points_path, 'myLayer')
myRaster = raster(raster_path)
myFunction = function(polygon_obj, raster_obj) {
# this function return a tabulation of the values of raster cells
# inside a polygon (buffer)
# crop to extent of polygon
clip1 = crop(raster_obj, extent(polygon_obj))
# crops to polygon edge & converts to raster
clip2 = rasterize(polygon_obj, clip1, mask = TRUE)
# much faster than extract
ext = getValues(clip2)
# tabulates the values of the raster in the polygon
tab = table(ext)
return(tab)
}
# loop over the points
ids = unique(myPoints$ID)
for (id in ids) {
# select point
myPoint = myPoints[myPoints$ID == id, ]
# create buffer
myPolygon = gBuffer(spgeom = myPoint, byid = FALSE, width = myWidth)
# extract the data I want (projections, etc are fine)
tab = myFunction(myPolygon, myRaster)
# do stuff with tab ...
}
My questions:
Am I right to partially blame the writing operations? If I managed to avoid all those writing operations, would this code run faster? I have access to a machine with 32GB of RAM -- so I guess it is safe to assume I could load the raster to the memory and need not to write temporary files?
What else could I do to improve efficiency in this code?
I think you should approach it like this
library(raster)
library(rgdal)
myPoints <- readOGR(points_path, 'myLayer')
myRaster <- raster(raster_path)
e <- extract(myRaster, myPoints, buffer=myWidth)
And then something like
etab <- sapply(e, table)
It is hard to answer your question #1 as we do not know enough about your data (we do not know how many cells are covered by a "100 km" buffer). But you can set options about when to write to file with the rasterOptions function. You notice that getValues is faster than extract, based on the post you link to, but I think that is wrong, or at least not very important. The combination of crop, rasterize and getValues should have a similar performance as extract (which does almost exactly that under the hood). If you go this route anyway, you should pass an empty RasterLayer, created by raster(myRaster) for faster cropping.

Opening and reading multiple netcdf files with RnetCDF

Using R, I am trying to open all the netcdf files I have in a single folder (e.g 20 files) read a single variable, and create a single data.frame combining the values from all files. I have been using RnetCDF to read netcdf files. For a single file, I read the variable with the following commands:
library('RNetCDF')
nc = open.nc('file.nc')
lw = var.get.nc(nc,'LWdown',start=c(414,315,1),count=c(1,1,240))
where 414 & 315 are the longitude and latitude of the value I would like to extract and 240 is the number of timesteps.
I have found this thread which explains how to open multiple files. Following it, I have managed to open the files using:
filenames= list.files('/MY_FOLDER/',pattern='*.nc',full.names=TRUE)
ldf = lapply(filenames,open.nc)
but now I'm stuck. I tried
var1= lapply(ldf, var.get.nc(ldf,'LWdown',start=c(414,315,1),count=c(1,1,240)))
but it doesn't work.
The added complication is that every nc file has a different number of timestep. So I have 2 questions:
1: How can I open all files, read the variable in each file and combine all values in a single data frame?
2: How can I set the last dimension in count to vary for all files?
Following #mdsummer's comment, I have tried a do loop instead and have managed to do everything I needed:
# Declare data frame
df=NULL
#Open all files
files= list.files('MY_FOLDER/',pattern='*.nc',full.names=TRUE)
# Loop over files
for(i in seq_along(files)) {
nc = open.nc(files[i])
# Read the whole nc file and read the length of the varying dimension (here, the 3rd dimension, specifically time)
lw = var.get.nc(nc,'LWdown')
x=dim(lw)
# Vary the time dimension for each file as required
lw = var.get.nc(nc,'LWdown',start=c(414,315,1),count=c(1,1,x[3]))
# Add the values from each file to a single data.frame
rbind(df,data.frame(lw))->df
}
There may be a more elegant way but it works.
You're passing the additional function parameters wrong. You should use ... for that. Here's a simple example of how to pass na.rm to mean.
x.var <- 1:10
x.var[5] <- NA
x.var <- list(x.var)
x.var[[2]] <- 1:10
lapply(x.var, FUN = mean)
lapply(x.var, FUN = mean, na.rm = TRUE)
edit
For your specific example, this would be something along the lines of
var1 <- lapply(ldf, FUN = var.get.nc, variable = 'LWdown', start = c(414, 315, 1), count = c(1, 1, 240))
though this is untested.
I think this is much easier to do with CDO as you can select the varying timestep easily using the date or time stamp, and pick out the desired nearest grid point. This would be an example bash script:
# I don't know how your time axis is
# you may need to use a date with a time stamp too if your data is not e.g. daily
# see the CDO manual for how to define dates.
date=20090101
lat=10
lon=50
files=`ls MY_FOLDER/*.nc`
for file in $files ; do
# select the nearest grid point and the date slice desired:
# %??? strips the .nc from the file name
cdo seldate,$date -remapnn,lon=$lon/lat=$lat $file ${file%???}_${lat}_${lon}_${date}.nc
done
Rscript here to read in the files
It is possible to merge all the new files with cdo, but you would need to be careful if the time stamp is the same. You could try cdo merge or cdo cat - that way you can read in a single file to R, rather than having to loop and open each file separately.

How would you speed up this code? Reducing the resolution of data from a netcdf and then converting it to xyz format for stats

I'm taking a netcdf of maize yields and area harvested, shrinking the resolution from 2.5 arc-minutes to .5degrees, and then converting the whole thing to XYZ format so that I can make it "talk" more easily to data that I've got in that format. (I suppose that I could turn my other data into matrix form, but I like xyz.)
The data is here.
The code below defines a few functions to calculate total production from area harvested and average yields, and then it makes some "feeder" data t use when querying the netcdf's, then it uses plyr to loop through the feeder, extract the data, apply the functions, and then save in xyz. It works, but it takes about 30 minutes to run only one of these files, and I've got more than 100. Any suggestions for ways to optimize this code would be great. Would it be faster to extract bigger chunks of data and apply functions to them? Like, maybe a whole stripe of the earth? How would I know a priori whether this would be faster or not?
rm(list=ls())
library(ncdf)
library(reshape)
library(plyr)
library(sp)
library(maps)
library(rgeos)
library(maptools)
library(rworldmap)
getha = function(lat,size=lat[1]-lat[2]){
lat1 = (lat-size/2)*pi/180
lat2 = (lat+size/2)*pi/180
lon1 = (0-size/2)*pi/180 #lon doesn't come in because all longitudes are great circles
lon2 = (0+size/2)*pi/180
6371^2 * abs(sin(lat1)-sin(lat2))*abs(lon1-lon2)*100 #6371 is the radius of the earth and 100 is the number of ha in a km^2
}
gethamat = function(mat,latvec,blocksize=6){
a = getha(latvec)
areamat = matrix(rep(a,blocksize),blocksize)
area = t(mat)*areamat #The matrix is transposed because the dimensions of the Ramankutty's netcdf's are switched
area
}
getprod = function(yieldblock,areablock,latvec){
b = gethamat(areablock,latvec)
sum(t(yieldblock)*b,na.rm=TRUE)
}
lat = as.matrix(seq(from=89.75,to=-89.75,by=-.5))
lon = as.matrix(seq(from=-179.75,to=179.75,by=.5))
lon = seq.int(from=1,to=4320,by=6)
lat = seq.int(from=1,to=2160,by=6)
lat = rep(lat,720)
lon = t(matrix(lon,720,360))
lon = as.data.frame(lon)
l = reshape(lon,direction="long",varying=list(colnames(lon)),v.names = "V")
coords = data.frame(cbind(l[,2],lat))
colnames(coords) = c("lng","lat")
feeder = coords
head(feeder)
maize.nc = open.ncdf('maize_5min.nc')
getcrops = function(feed,netcdf,var="cropdata"){
yieldblock = get.var.ncdf(netcdf,varid=var,start = c(as.numeric(feed[1]),as.numeric(feed[2]),2,1),count = c(6,6,1,1))
areablock = get.var.ncdf(netcdf,varid=var,start = c(as.numeric(feed[1]),as.numeric(feed[2]),1,1),count = c(6,6,1,1))
lat = get.var.ncdf(netcdf,varid="latitude",start = feed[2],count = 6)
prod = getprod(yieldblock,areablock,lat)
lon = get.var.ncdf(netcdf,varid="longitude",start = feed[1],count = 6)
#print(c(mean(lat),mean(lon)))
data.frame(lat=mean(lat),lon = mean(lon),prod=prod)
}
out = adply(as.matrix(feeder),1,getcrops,netcdf=maize.nc,.parallel=FALSE)
Thanks in advance.
plyr functions are notoriously slow when the number of chunks grows larger. I would really recommend keeping the data in a multi-dimensional array. This allows you to use apply to e.g. get mean for all lat-lon combinations etc. The multi-dimensional array takes less RAM storage, as the metadata is not in stored directly as columns, but implicitly within the dimensions of the array. In addition, apply is often much much faster than using plyr. The ncdf package natively reads the data into multi-dimensional arrays, so this also saves you a processing step (e.g. using melt).
After reducing the dataset, I would often use melt to go to what you call XYZ format for plotting. But by then the dataset is so small that this does not really mattern.

Resources