How can I calculate ETP (Hargreaves) from a NetCDF using xarray.apply_ufunc() with GroupBy - netcdf

The following illustrates how I tried computing the ETP on a netCDF using xarray_ufunc.
# stack the lat and lon dimensions into a new dimension named point, so at each lat/lon
# we'll have a time series for the geospatial point, and group by these points
tmin_da = ds.tmin.stack(point=('lat', 'lon')).groupby('point')
tmax_da = ds.tmax.stack(point=('lat', 'lon')).groupby('point')
tmean_da = ds.tmean.stack(point=('lat', 'lon')).groupby('point')
# apply the SPI function to the data array
ds = xr.apply_ufunc(eto.eto_hargreaves, tmin_da, tmax_da, tmean_da, latitude_degrees = np.linspace(23.125 , 49.125, 14))
Code error
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Any suggestions on how the input data should be or another easy method?

Related

How to filter for nearby geometries

I want to create a subset of census data for nearby political boundaries. For example, with zip code 17954, how do I filter a data frame containing all zip codes in the US to just zip codes within a 10 mile radius?
I would be using tigris, dyplr I believe for this. From documentation it seems I should use the st_filter() command to do so.
states = c(unique(FoodAccessResearchAtlasData2019$State))
state_list = vector("list", length = length(states))
for (i in 1:length(states)){
#tigris command cd=TRUE doesn't seem to work with year command, so this function suffices that
dat = zctas(state = states[i], year = 2010)
state_list[[i]] = dat
}
#combining into one large zip code data frame
zipcodes = do.call(rbind, state_list)
nearby_zips = zipcodes %>%
st_filter(zipcodes$ZCTA5CE10[17603], #this is what I imagine is the point to filter from
..predicate = st_is_within_distance, #honestly just copying from the documentation
dist = 10) #is within 10 miles, I don't know the base unit for the dist command

Reading multiple ERA5 netcdf files

I have a collection of ERA5 netcdf files that contain hourly data for air temperature that span over approximately 40 years in the tropical East Pacific. In jupyter notebook, I want to run a bandpass filter on the merged dataset but I keep running into initial errors concerning memory allocation. I read the files using xarray.open_mfdataset(list_of_files), but when I try to load the dataset I get the error:
Unable to allocate X GiB for an array with shape (d1, d2, d3, d4) and data type float32
Are there work around solutions or best practices to manipulating large datasets like this in jupyter?
The full code for the bandpass filter is:
I've been wanting to apply a band pass filter to a large domain over the East Pacific over about 40 years of data from ERA5. The code goes as follows:
# Grab dataset
var = 't'
files = glob.glob(os.path.join(parent_dir, 'era5_' + var + '_daily.nc'))
files.sort()
# Read files into a dask array
ds = xr.open_mfdataset(files)
# Limit study region
lon_min = -140
lon_max = -80
lat_min = -10
lat_max = 10
ds = ds.sel(latitude = slice(lat_max, lat_min), longitude = slice(lon_min, lon_max))
# Now, load the data from the original dask array
da_T = ds.T.load()
# High pass filter (remove singal on the seasonal and longer timescales)
import xrft
freq_threshold = (1/90) * (1/24) * (1/3600) # 90-day frequency threshold
def high_pass_filter(da, dim, thres):
ft = xrft.fft(da, dim=dim, true_phase=True, true_amplitude=True)
ft_new = ft.where(ft.freq_time > thres, other = 0)
ft.close()
da_new = xrft.ifft(ft_new, dim = 'freq_time', true_phase=True, true_amplitude=True)
da_new = da_new + np.tile(da.mean('time'), (da_T.time.shape[0],1,1,1))
ft_new.close()
return da_new.real
da_new = high_pass_filter(da_T, 'time', freq_threshold)
# Save filtered dataset
da_new.real.to_netcdf(os.path.join(outdir, 'era5_T.nc'))
When you do this
# Now, load the data from the original dask array
da_T = ds.T.load()
you are loading all the data in the "T" variable of your dataset into memory all at once. Presumably the size of this variable is larger than the amount of RAM available on your system.
You also have another problem with this line: da.load() loads in-place and returns the modified object. So ds.T.load() alone would have been sufficient. You could also have done da_T = ds.T.compute() instead.
You could try using dask to perform your analysis chunk-by-chunk. You want to ensure that your ds object contains chunked dask array objects before you load/compute it.
You then want to specify your analysis using xarray methods / the xrft package as you are doing, but only call .compute() at the end. This should do your analysis in a chunk-by-chunk manner.
I suggest reading about how to use dask + xarray together though in order to get this to run smoothly.

List outcomes in loops- gmapsdistance

I am trying to use the R package gmapsdistance to calculate the road distance and travel time between two sets of co-ordinates. I have created the following loop as I have 2524 routes to calculate.
for(i in 1:2524) {
df$Distance_Time[i] <- as.data.frame(gmapsdistance
(origin = c(df$OriginCo[i]),
destination = c(df$DestinationCo[i]),
combinations = "pairwise",
mode = "driving",
shape = "long",
set.api.key("MY_API_KEY")))
}
An example of my df is as follow;
OriginCo : chr "-39.9+174.99", "-40.27+175.55", "-39.66+176.77"...
DestinationCo : chr "-40.27+175.55", "-39.9+174.99", "-40.27+175.55"...
Distance_Time :
However, because the output of gmapdistance is a list, the new column in the df (Distance_time) only shows the travel time for each set of co-ordinates.
If I run the code on the first set of co-ordinates, instead of a loop, I get $time, $distance and $status.
How can I make another column to show distance and status or have these two variables listed along with the $time?
Many thank in advance

How to group repeating sequences of numbers using R

The simplest description of what I am trying to do is that I have a column in a data.frame like 1,2,3,..., n, 1,2,3,...n,.... and I want group the first 1...n as 1 the second 1...n as 2 and so on.
The full context is; I am using the R spcosa package to do equal area stratification composite sampling on parcels of land. I start with a shape file from a GIS that contains a number of polygons (land parcels). The end result I want is a GIS file with each of the strata and sample locations in a GIS file format with each stratum and sample location labeled by land parcel, stratum and sample id. So far I can do all this except one bit which is identifying the stratum that the samples belongs too and including it in the sample label. The sample label needs to look like "parcel#-strata#-composite# (where # is the number). In practice I don't need this actual label but as separate attributes in GIS file.
The basic work flow is a follows
For each individual polygon using spcosa::stratify I divide it into a number of equal area strata like
strata.CSEA <- stratify(poly[i,], nStrata = n, nTry = 1, equalArea = TRUE, nGridCells = x)
Note spcosa::stratify generates a CompactStratificationEqualArea object. I cocerce this to a SpatialPixelData then use rasterToPolygon to be able to output it as a GIS file.
I then generate the sample locations as follows:
samples.SPRC <- spsample(strata.CSEA, n = n, type = "composite")
spcosa::spsample creates a SamplingPatternRandomComposite object. I coerce this to a SpatialPointsDataFrame
samples.SPDF <- as(samples.SPRC, "SpatialPointsDataFrame")
and add two columns to the #data slot
samples.SPDF#data$Strata <- "this is the bit I can't do yet"
samples.SPDF#data$CEA <- poly[i,]$name
I can then write samples.SPDF as a GIS file ( ie writeOGE) with all the wanted attributes.
As above the part I can't sort out is how the sample ids relate to the strata ids. The sample points are a vector like 1,2,3...n, 1,2,3...n,.... How do I extract which sample goes with which strata? As actual strata number are arbitrary, I can just group ( as per my simple question above) but ideally I would like to use the numbering of the actual strata so everything lines up.
To give any contributors access to a hands on example I copy below the code from the spcosa documentation slightly modified to generate the correct objects.
# Note: the example below requires the 'rgdal'-package You may consider the 'maptools'-package as an alternative
if (require(rgdal)) {
# read a vector representation of the `Farmsum' field
shpFarmsum <- readOGR(
dsn = system.file("maps", package = "spcosa"),
layer = "farmsum"
)
# stratify `Farmsum' into 50 strata
# NB: increase argument 'nTry' to get better results
set.seed(314)
myStratification <- stratify(shpFarmsum, nStrata = 50, nTry = 1, equalArea = TRUE)
# sample two sampling units per stratum
mySamplingPattern <- spsample(myStratification, n = 2 type = "composite")
# plot the resulting sampling pattern on
# top of the stratification
plot(myStratification, mySamplingPattern)
}
Maybe order() function can help you
n <- 10
dat <- data.frame(col1 = rep(1:n, 2), col2 = rnorm(2*n))
head(dat)
dat[order(dat$col1), ]
I did not get where the "ID" (1,2,3...n) is to be found; so let's assume you have your SpatialPolygonsDataFrame called shpFarmsum with a attribute data column "ID". You can access this column via shpFarmsum$ID. Therefore, if you want to create individual subsets for each ID this is one way to go:
for (i in unique(shpFarmsum$ID)) {
tempSubset shpFarmsum[shpFarmsum$ID == i,]
writeOGR(tempSubset, ".", paste0("subset_", i), driver = "ESRI Shapefile")
}
I added the line writeOGR(... so all subsets are written to your working direktory. However, you can change this line or add further analysis into the for-loop.
How it works
unique(shpFarmsum$ID) extracts all occuring IDs (compareable to your 1,2,3...n).
In each repetition of the for loop, another value of this IDs will be used to create a subset of the whole SpatialPolygonsDataFrame, which you can use for further analysis.

How would you speed up this code? Reducing the resolution of data from a netcdf and then converting it to xyz format for stats

I'm taking a netcdf of maize yields and area harvested, shrinking the resolution from 2.5 arc-minutes to .5degrees, and then converting the whole thing to XYZ format so that I can make it "talk" more easily to data that I've got in that format. (I suppose that I could turn my other data into matrix form, but I like xyz.)
The data is here.
The code below defines a few functions to calculate total production from area harvested and average yields, and then it makes some "feeder" data t use when querying the netcdf's, then it uses plyr to loop through the feeder, extract the data, apply the functions, and then save in xyz. It works, but it takes about 30 minutes to run only one of these files, and I've got more than 100. Any suggestions for ways to optimize this code would be great. Would it be faster to extract bigger chunks of data and apply functions to them? Like, maybe a whole stripe of the earth? How would I know a priori whether this would be faster or not?
rm(list=ls())
library(ncdf)
library(reshape)
library(plyr)
library(sp)
library(maps)
library(rgeos)
library(maptools)
library(rworldmap)
getha = function(lat,size=lat[1]-lat[2]){
lat1 = (lat-size/2)*pi/180
lat2 = (lat+size/2)*pi/180
lon1 = (0-size/2)*pi/180 #lon doesn't come in because all longitudes are great circles
lon2 = (0+size/2)*pi/180
6371^2 * abs(sin(lat1)-sin(lat2))*abs(lon1-lon2)*100 #6371 is the radius of the earth and 100 is the number of ha in a km^2
}
gethamat = function(mat,latvec,blocksize=6){
a = getha(latvec)
areamat = matrix(rep(a,blocksize),blocksize)
area = t(mat)*areamat #The matrix is transposed because the dimensions of the Ramankutty's netcdf's are switched
area
}
getprod = function(yieldblock,areablock,latvec){
b = gethamat(areablock,latvec)
sum(t(yieldblock)*b,na.rm=TRUE)
}
lat = as.matrix(seq(from=89.75,to=-89.75,by=-.5))
lon = as.matrix(seq(from=-179.75,to=179.75,by=.5))
lon = seq.int(from=1,to=4320,by=6)
lat = seq.int(from=1,to=2160,by=6)
lat = rep(lat,720)
lon = t(matrix(lon,720,360))
lon = as.data.frame(lon)
l = reshape(lon,direction="long",varying=list(colnames(lon)),v.names = "V")
coords = data.frame(cbind(l[,2],lat))
colnames(coords) = c("lng","lat")
feeder = coords
head(feeder)
maize.nc = open.ncdf('maize_5min.nc')
getcrops = function(feed,netcdf,var="cropdata"){
yieldblock = get.var.ncdf(netcdf,varid=var,start = c(as.numeric(feed[1]),as.numeric(feed[2]),2,1),count = c(6,6,1,1))
areablock = get.var.ncdf(netcdf,varid=var,start = c(as.numeric(feed[1]),as.numeric(feed[2]),1,1),count = c(6,6,1,1))
lat = get.var.ncdf(netcdf,varid="latitude",start = feed[2],count = 6)
prod = getprod(yieldblock,areablock,lat)
lon = get.var.ncdf(netcdf,varid="longitude",start = feed[1],count = 6)
#print(c(mean(lat),mean(lon)))
data.frame(lat=mean(lat),lon = mean(lon),prod=prod)
}
out = adply(as.matrix(feeder),1,getcrops,netcdf=maize.nc,.parallel=FALSE)
Thanks in advance.
plyr functions are notoriously slow when the number of chunks grows larger. I would really recommend keeping the data in a multi-dimensional array. This allows you to use apply to e.g. get mean for all lat-lon combinations etc. The multi-dimensional array takes less RAM storage, as the metadata is not in stored directly as columns, but implicitly within the dimensions of the array. In addition, apply is often much much faster than using plyr. The ncdf package natively reads the data into multi-dimensional arrays, so this also saves you a processing step (e.g. using melt).
After reducing the dataset, I would often use melt to go to what you call XYZ format for plotting. But by then the dataset is so small that this does not really mattern.

Resources