How would you speed up this code? Reducing the resolution of data from a netcdf and then converting it to xyz format for stats - r

I'm taking a netcdf of maize yields and area harvested, shrinking the resolution from 2.5 arc-minutes to .5degrees, and then converting the whole thing to XYZ format so that I can make it "talk" more easily to data that I've got in that format. (I suppose that I could turn my other data into matrix form, but I like xyz.)
The data is here.
The code below defines a few functions to calculate total production from area harvested and average yields, and then it makes some "feeder" data t use when querying the netcdf's, then it uses plyr to loop through the feeder, extract the data, apply the functions, and then save in xyz. It works, but it takes about 30 minutes to run only one of these files, and I've got more than 100. Any suggestions for ways to optimize this code would be great. Would it be faster to extract bigger chunks of data and apply functions to them? Like, maybe a whole stripe of the earth? How would I know a priori whether this would be faster or not?
rm(list=ls())
library(ncdf)
library(reshape)
library(plyr)
library(sp)
library(maps)
library(rgeos)
library(maptools)
library(rworldmap)
getha = function(lat,size=lat[1]-lat[2]){
lat1 = (lat-size/2)*pi/180
lat2 = (lat+size/2)*pi/180
lon1 = (0-size/2)*pi/180 #lon doesn't come in because all longitudes are great circles
lon2 = (0+size/2)*pi/180
6371^2 * abs(sin(lat1)-sin(lat2))*abs(lon1-lon2)*100 #6371 is the radius of the earth and 100 is the number of ha in a km^2
}
gethamat = function(mat,latvec,blocksize=6){
a = getha(latvec)
areamat = matrix(rep(a,blocksize),blocksize)
area = t(mat)*areamat #The matrix is transposed because the dimensions of the Ramankutty's netcdf's are switched
area
}
getprod = function(yieldblock,areablock,latvec){
b = gethamat(areablock,latvec)
sum(t(yieldblock)*b,na.rm=TRUE)
}
lat = as.matrix(seq(from=89.75,to=-89.75,by=-.5))
lon = as.matrix(seq(from=-179.75,to=179.75,by=.5))
lon = seq.int(from=1,to=4320,by=6)
lat = seq.int(from=1,to=2160,by=6)
lat = rep(lat,720)
lon = t(matrix(lon,720,360))
lon = as.data.frame(lon)
l = reshape(lon,direction="long",varying=list(colnames(lon)),v.names = "V")
coords = data.frame(cbind(l[,2],lat))
colnames(coords) = c("lng","lat")
feeder = coords
head(feeder)
maize.nc = open.ncdf('maize_5min.nc')
getcrops = function(feed,netcdf,var="cropdata"){
yieldblock = get.var.ncdf(netcdf,varid=var,start = c(as.numeric(feed[1]),as.numeric(feed[2]),2,1),count = c(6,6,1,1))
areablock = get.var.ncdf(netcdf,varid=var,start = c(as.numeric(feed[1]),as.numeric(feed[2]),1,1),count = c(6,6,1,1))
lat = get.var.ncdf(netcdf,varid="latitude",start = feed[2],count = 6)
prod = getprod(yieldblock,areablock,lat)
lon = get.var.ncdf(netcdf,varid="longitude",start = feed[1],count = 6)
#print(c(mean(lat),mean(lon)))
data.frame(lat=mean(lat),lon = mean(lon),prod=prod)
}
out = adply(as.matrix(feeder),1,getcrops,netcdf=maize.nc,.parallel=FALSE)
Thanks in advance.

plyr functions are notoriously slow when the number of chunks grows larger. I would really recommend keeping the data in a multi-dimensional array. This allows you to use apply to e.g. get mean for all lat-lon combinations etc. The multi-dimensional array takes less RAM storage, as the metadata is not in stored directly as columns, but implicitly within the dimensions of the array. In addition, apply is often much much faster than using plyr. The ncdf package natively reads the data into multi-dimensional arrays, so this also saves you a processing step (e.g. using melt).
After reducing the dataset, I would often use melt to go to what you call XYZ format for plotting. But by then the dataset is so small that this does not really mattern.

Related

Reading multiple ERA5 netcdf files

I have a collection of ERA5 netcdf files that contain hourly data for air temperature that span over approximately 40 years in the tropical East Pacific. In jupyter notebook, I want to run a bandpass filter on the merged dataset but I keep running into initial errors concerning memory allocation. I read the files using xarray.open_mfdataset(list_of_files), but when I try to load the dataset I get the error:
Unable to allocate X GiB for an array with shape (d1, d2, d3, d4) and data type float32
Are there work around solutions or best practices to manipulating large datasets like this in jupyter?
The full code for the bandpass filter is:
I've been wanting to apply a band pass filter to a large domain over the East Pacific over about 40 years of data from ERA5. The code goes as follows:
# Grab dataset
var = 't'
files = glob.glob(os.path.join(parent_dir, 'era5_' + var + '_daily.nc'))
files.sort()
# Read files into a dask array
ds = xr.open_mfdataset(files)
# Limit study region
lon_min = -140
lon_max = -80
lat_min = -10
lat_max = 10
ds = ds.sel(latitude = slice(lat_max, lat_min), longitude = slice(lon_min, lon_max))
# Now, load the data from the original dask array
da_T = ds.T.load()
# High pass filter (remove singal on the seasonal and longer timescales)
import xrft
freq_threshold = (1/90) * (1/24) * (1/3600) # 90-day frequency threshold
def high_pass_filter(da, dim, thres):
ft = xrft.fft(da, dim=dim, true_phase=True, true_amplitude=True)
ft_new = ft.where(ft.freq_time > thres, other = 0)
ft.close()
da_new = xrft.ifft(ft_new, dim = 'freq_time', true_phase=True, true_amplitude=True)
da_new = da_new + np.tile(da.mean('time'), (da_T.time.shape[0],1,1,1))
ft_new.close()
return da_new.real
da_new = high_pass_filter(da_T, 'time', freq_threshold)
# Save filtered dataset
da_new.real.to_netcdf(os.path.join(outdir, 'era5_T.nc'))
When you do this
# Now, load the data from the original dask array
da_T = ds.T.load()
you are loading all the data in the "T" variable of your dataset into memory all at once. Presumably the size of this variable is larger than the amount of RAM available on your system.
You also have another problem with this line: da.load() loads in-place and returns the modified object. So ds.T.load() alone would have been sufficient. You could also have done da_T = ds.T.compute() instead.
You could try using dask to perform your analysis chunk-by-chunk. You want to ensure that your ds object contains chunked dask array objects before you load/compute it.
You then want to specify your analysis using xarray methods / the xrft package as you are doing, but only call .compute() at the end. This should do your analysis in a chunk-by-chunk manner.
I suggest reading about how to use dask + xarray together though in order to get this to run smoothly.

How can I calculate ETP (Hargreaves) from a NetCDF using xarray.apply_ufunc() with GroupBy

The following illustrates how I tried computing the ETP on a netCDF using xarray_ufunc.
# stack the lat and lon dimensions into a new dimension named point, so at each lat/lon
# we'll have a time series for the geospatial point, and group by these points
tmin_da = ds.tmin.stack(point=('lat', 'lon')).groupby('point')
tmax_da = ds.tmax.stack(point=('lat', 'lon')).groupby('point')
tmean_da = ds.tmean.stack(point=('lat', 'lon')).groupby('point')
# apply the SPI function to the data array
ds = xr.apply_ufunc(eto.eto_hargreaves, tmin_da, tmax_da, tmean_da, latitude_degrees = np.linspace(23.125 , 49.125, 14))
Code error
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Any suggestions on how the input data should be or another easy method?

Custom spatial processing function consumes a lot of memory as code runs in R

I have several rasters, 343 to be more exact, from Cropscape. I need to get the locations (centroids) and area measurements of pixels that represent potatoes and tomatoes based on the associated values in the rasters. The pixel values are 43 and 54, respectively. Cropscape provides rasters separated by year and state, except for 2016, which has the lower 48 states combined. The rasters are saved as GeoTiffs on a Google Drive and I am using Google File Stream to connect to the rasters locally.
I want to create a SpatialPointsDataFrame from the centroids of each pixel or group of adjacent pixels for tomatoes and potatoes in all the rasters. Right now, my code will
Subset the rasters to potatoes and tomatoes
Change the raster subsets to polygons, one for potatoes and one for tomatoes
Create centroids from each polygon
Create a SpatialPointsDataFrame based on the centroids
Extract the area measurement for each area of interest with SpatialPointsDataFrame
Write the raster subsets and each polygon to a file.
Code:
library(raster)
library(rgdal)
library(rgeos)
dat_dir2 = getwd()
mepg <- make_EPSG()
ae_pr <- mepg[mepg$code == "5070", "prj4"]
# Toy raster list for use with code
# I use `list.files()` with the directories that hold
# the rasters and use list that is generated from
# that to read in the files to raster. My list is called
# "tiflist". Not used in the code, but mentioned later.
rmk1 <- function(x, ...) {
r1 = raster(ncol = 1000, nrow = 1000)
r1[] = sample(1:60, 1000000, replace = T)
proj4string(r1) = CRS(ae_pr)
return(r1)
}
rlis <- lapply(1:5, rmk1)
#Pixel values needed
ptto <- c(43, 54)
# My function to go through rasters for locations and area measurements.
# This code is somewhat edited to work with the demo raster list.
# It produces the same output as what I wanted, but with the demo list.
pottom <- function(x, ...) {
# Next line is not necessary with the raster list created above.
# temras = raster(x)
now = format(Sys.time(), "%b%d%H%M%S")
nwnm = paste0(names(x), now)
rasmatx = match(x = x, table = ptto)
writeRaster(rasmatx, file.path( dat_dir2, paste0(nwnm,"ras")), format = "GTiff")
tempol = rasterToPolygons(rasmatx, fun = function(x) { x > 0 & x < 4}, dissolve = T)
tempol2 = disaggregate(tempol)
# for potatoes
tempol2p = tempol2[tempol2$layer == '1',]
if (nrow(tempol2p) > 0) {
temcenp = gCentroid(tempol2p, byid = T)
temcenpdf = SpatialPointsDataFrame(temcenp, data.frame(ID = 1:length(temcenp) , temcenp))
temcenpdf$pot_p = extract(rasmatx, temcenpdf)
temcenpdf$areap_m = gArea(tempol2p, byid = T)
# writeOGR(temcenpdf, dsn=file.path(dat_dir2), paste0(nwnm, "p"), driver = "ESRI Shapefile")
}
# for tomatoes
tempol2t = tempol2[tempol2$layer == '2',]
if (nrow(tempol2t) > 0) {
temcent = gCentroid(tempol2t, byid = T)
temcentdf = SpatialPointsDataFrame(temcent, data.frame(ID = 1:length(temcent) , temcent))
temcentdf$tom_t = extract(rasmatx, temcentdf)
temcentdf$areat_m = gArea(tempol2t, byid = T)
writeOGR(temcentdf, dsn=file.path(dat_dir2), paste0(nwnm,"t"), driver = "ESRI Shapefile")
}
}
lapply(rlis, pottom)
I know I should provide some toy data and I created some, but I don't know if they exactly recreate my problem, which follows.
Besides my wonky code, which seems to work, I have a bigger problem. A lot of memory is used when this code runs. The tiflist can only get through the first 4 files of the list and by then RAM, which is 16 GB on my laptop, is completely consumed. I'm pretty sure it's the connections to the Google Drive, since the cache for the drive stream is at least 8 GB. I guess each raster is staying open after being connected to in the Google Drive? I don't know how to confirm that.
I think I need to get the function to clear out all of the objects that are created, e.g. temras, rasmatx, tempol, etc., after processing each raster, but I'm not sure how to do that. I did try adding rm(temras ...) to the end of the function, but when I did that, there was no output at all from the function after 10 minutes and by then, I've usually got the first 3 rasters processed.
27/Oct EDIT after comments from RobertHijmans. It seems that the states with large geographic extents are causing problems with rasterToPolygons(). I edited the code from the way it works for me locally to work with the demo data I included, since RobertHijmans pointed out it wasn't functional. So I hope this is now reproducible.
I feel silly answering my own question, but here it is: the rasterToPolygons function is notoriously slow. I was unaware of this issue. I waited 30 minutes before killing the process with no result in one of my attempts. It works on the conditions I require for rasters for Alabama and Arkansas for example, but not California.
A submitted solution, which I am in the process of testing, comes from this GitHub repo. The test is ongoing at 12 minutes, so I don't know if it works for an object as large as California. I don't want to copy and paste someone else's code in an answer to my own question.
One of the comments suggested using profvis, but I couldn't really figure out the output. And it hung with the process too.

efficient use of raster functions in r

I have 500+ points in a SpatialPointsDataFrame object; I have a 1.7GB (200,000 rows x 200,000 cols) raster object. I want to have a tabulation of the values of the raster cells within a buffer around each of the 500+ points.
I have managed to achieve that with the code below (I got a lot of inspiration from here.). However, it is slow to run and I would like to make it run faster. It actually runs OK for buffers with "small" widths, say 5km ro even 15km (~1 million cells), but it becomes super slow when buffer increases to say 100km (~42 million cells).
I could easily improve on the loop below by using something from the apply family and/or a parallel loop. But my suspicion is that it is slow because the raster package writes 400Mb+ temporary files for each interaction of the loop.
# packages
library(rgeos)
library(raster)
library(rgdal)
myPoints = readOGR(points_path, 'myLayer')
myRaster = raster(raster_path)
myFunction = function(polygon_obj, raster_obj) {
# this function return a tabulation of the values of raster cells
# inside a polygon (buffer)
# crop to extent of polygon
clip1 = crop(raster_obj, extent(polygon_obj))
# crops to polygon edge & converts to raster
clip2 = rasterize(polygon_obj, clip1, mask = TRUE)
# much faster than extract
ext = getValues(clip2)
# tabulates the values of the raster in the polygon
tab = table(ext)
return(tab)
}
# loop over the points
ids = unique(myPoints$ID)
for (id in ids) {
# select point
myPoint = myPoints[myPoints$ID == id, ]
# create buffer
myPolygon = gBuffer(spgeom = myPoint, byid = FALSE, width = myWidth)
# extract the data I want (projections, etc are fine)
tab = myFunction(myPolygon, myRaster)
# do stuff with tab ...
}
My questions:
Am I right to partially blame the writing operations? If I managed to avoid all those writing operations, would this code run faster? I have access to a machine with 32GB of RAM -- so I guess it is safe to assume I could load the raster to the memory and need not to write temporary files?
What else could I do to improve efficiency in this code?
I think you should approach it like this
library(raster)
library(rgdal)
myPoints <- readOGR(points_path, 'myLayer')
myRaster <- raster(raster_path)
e <- extract(myRaster, myPoints, buffer=myWidth)
And then something like
etab <- sapply(e, table)
It is hard to answer your question #1 as we do not know enough about your data (we do not know how many cells are covered by a "100 km" buffer). But you can set options about when to write to file with the rasterOptions function. You notice that getValues is faster than extract, based on the post you link to, but I think that is wrong, or at least not very important. The combination of crop, rasterize and getValues should have a similar performance as extract (which does almost exactly that under the hood). If you go this route anyway, you should pass an empty RasterLayer, created by raster(myRaster) for faster cropping.

How can I specify dimension order when using ncdf4::ncvar_get?

Following a previous question (Faster reading of time series from netCDF?) I have re-permuted my netCDF files to provide fast time-series reads (scripts on github to be cleaned up eventually ...).
In short, to make reads faster, I have rearranged the dimensions from lat, lon, time to time, lat, lon. Now, my existing scripts break because they assume that the dimensions will always be lat, lon, time, following the ncdf4 documentation of ncvar_get, for the 'start' argument:
Order is X-Y-Z-T (i.e., the time dimension is last)
However, this is not the case.
Furthermore, there is a related inconsistency in the order of variables listed via the commandline netCDF utility ncdump -h and the R function ncdf4::nc_open. The first says that the dimensions are in the expected (lat, lon, time) order while the latter sees dimensions with time first (time, lat, lon).
For a minimal example, download the file test.nc and run
bash-$ ncdump -h .nc
bash-$ R
R> library(ncdf4)
R> print(nc_open("test.nc")
What I want to do is get records 5-15 from the variable "lwdown"
my.nc <- nc_open("test.nc")
But this doesn't work, since R sees the time dimension first, so I must change my scripts to
ncvar_get(my.nc, "lwdown", start = c(5, 1, 1), count = c(10, 1, 1))
It wouldn't be so bad to update my scripts and functions, except that I want to be able to read data from files regardless of the dimension order.
Other than Is there a way to generalize this function so that it works independent of dimension order?
While asking the question, I figured out this solution, though there is still room for improvement:
The closest I can get is to open the file and find the order in this way:
my.nc$var$lwdown$dim[[1]]$name
[1] "time"
my.nc$var$lwdown$dim[[2]]$name
[1] "lon"
my.nc$var$lwdown$dim[[3]]$name
[1] "lat"
which is a bit unsatisfying, although it led me to this solution:
If I want to start at c(lat = 1, lon = 1, time = 5), but the ncvar_get expects an arbitrary order, I can say"
start <- c(lat = 1, lon = 1, time = 5)
count <- c(lat = 1, lon = 1, time = 10)
dim.order <- sapply(my.nc$var$lwdown$dim, function(x) x$name)
ncvar_get(my.nc, "lwdown", start = start[dim.order], count = count[dim.order])
I ran into this recently as well. I have a netcdf with data in this format
nc_in <- nc_open("my.nc")
nc_in$dim[[1]]$name == "time"
nc_in$dim[[2]]$name == "latitude"
nc_in$dim[[3]]$name == "longitude"
nc_in$dim[[1]]$len == 3653 # this is the number of timesteps in my netcdf
nc_in$dim[[2]]$len == 180 # this is the number of longitude cells
nc_in$dim[[3]]$len == 360 # this is the number of latitude cells
The obnoxious part here is that the DIM component of the netCDF is in the order of T,Y,X
If I try to to grab time series data for the pr var using the indices in the order they appear in nc_in$dim I get an error
ncvar_get(nc_in,"pr")[3653,180,360] # 'subscript out of bounds'
If I instead grab data in X,Y,T order, it works:
ncvar_get(nc_in,"pr")[360,180,3653] # gives me a value
What I don't understand is how the ncvar_get() package knows what variable represents X, Y and T, especially if you have generated your own netCDF.

Resources