xarray chunk dataset PerformanceWarning: Slicing with an out-of-order index is generating more chunks - bigdata

I am trying to run a simple calculation based on two big gridded datasets in xarray (around 5 GB altogether, daily data from 1850-2100). I keep running out of memory when I try it this way
import xarray as xr
def readin(model):
observed = xr.open_dataset(var_obs)
model_sim = xr.open_dataset(var_sim)
observed = observed.sel(time = slice('1989','2010'))
model_hist = model_sim.sel(time = slice('1989','2010'))
model_COR = model_sim
return(observed, model_hist, model_COR)
def method(model):
clim_obs = observed.groupby('time.day').mean(dim='time')
clim_hist = model_hist.groupby('time.day').mean(dim='time')
diff_scaling = clim_hist-clim_obs
bc = model_COR.groupby('time.day') - diff_scaling
bc[var]=bc[var].where(bc[var]>0,0)
bc = bc.reset_coords('day',drop=True)
observed, model_hist, model_COR = readin('model')
method('model')
I tried to chunk the (full)
model_COR
to split up the memory
model_COR.chunk(chunks={'lat': 20, 'lon': 20})
or across the time dimension
model_COR.chunk(chunks={'time': 8030})
but no matter what I tried resulted in
PerformanceWarning: Slicing with an out-of-order index is generating xxx times more chunks
Which doesn't exactly sound like the outcome I want? Where am I going wrong here? Happy about any help!

Related

Reading multiple ERA5 netcdf files

I have a collection of ERA5 netcdf files that contain hourly data for air temperature that span over approximately 40 years in the tropical East Pacific. In jupyter notebook, I want to run a bandpass filter on the merged dataset but I keep running into initial errors concerning memory allocation. I read the files using xarray.open_mfdataset(list_of_files), but when I try to load the dataset I get the error:
Unable to allocate X GiB for an array with shape (d1, d2, d3, d4) and data type float32
Are there work around solutions or best practices to manipulating large datasets like this in jupyter?
The full code for the bandpass filter is:
I've been wanting to apply a band pass filter to a large domain over the East Pacific over about 40 years of data from ERA5. The code goes as follows:
# Grab dataset
var = 't'
files = glob.glob(os.path.join(parent_dir, 'era5_' + var + '_daily.nc'))
files.sort()
# Read files into a dask array
ds = xr.open_mfdataset(files)
# Limit study region
lon_min = -140
lon_max = -80
lat_min = -10
lat_max = 10
ds = ds.sel(latitude = slice(lat_max, lat_min), longitude = slice(lon_min, lon_max))
# Now, load the data from the original dask array
da_T = ds.T.load()
# High pass filter (remove singal on the seasonal and longer timescales)
import xrft
freq_threshold = (1/90) * (1/24) * (1/3600) # 90-day frequency threshold
def high_pass_filter(da, dim, thres):
ft = xrft.fft(da, dim=dim, true_phase=True, true_amplitude=True)
ft_new = ft.where(ft.freq_time > thres, other = 0)
ft.close()
da_new = xrft.ifft(ft_new, dim = 'freq_time', true_phase=True, true_amplitude=True)
da_new = da_new + np.tile(da.mean('time'), (da_T.time.shape[0],1,1,1))
ft_new.close()
return da_new.real
da_new = high_pass_filter(da_T, 'time', freq_threshold)
# Save filtered dataset
da_new.real.to_netcdf(os.path.join(outdir, 'era5_T.nc'))
When you do this
# Now, load the data from the original dask array
da_T = ds.T.load()
you are loading all the data in the "T" variable of your dataset into memory all at once. Presumably the size of this variable is larger than the amount of RAM available on your system.
You also have another problem with this line: da.load() loads in-place and returns the modified object. So ds.T.load() alone would have been sufficient. You could also have done da_T = ds.T.compute() instead.
You could try using dask to perform your analysis chunk-by-chunk. You want to ensure that your ds object contains chunked dask array objects before you load/compute it.
You then want to specify your analysis using xarray methods / the xrft package as you are doing, but only call .compute() at the end. This should do your analysis in a chunk-by-chunk manner.
I suggest reading about how to use dask + xarray together though in order to get this to run smoothly.

Sparklyr performance comparison in R to other on disk solutions like SAS. Remove duplicates using distinct takes hours in Sparklyr, seconds in SAS

I was hoping to receive some clarification on optimizing Sparklyr performance in R on my local machine.
I have imported a CSV file with 211 million rows (CSV is 17 gigabytes, so wont fit in memory), with just a few columns, and I would like to only select the distinct values for one of the columns. To accomplish this I imported the data as "test" using spark_read_csv Memory = FALSE and a data generated schema saved separately in its own object (the import took a few minutes).
After importing using the function I ran very basic code to dedpulicate one column.
It has been running for 2 hours, so I decided to try using SAS. I was able to accomplish what I needed in a few minutes.
This seems very problematic to me, even if I am using a local machine it does not seem like a very difficult problem.
sc <- spark_connect(master = "local", version = "2.3")
download <- function(datapath, dataname) {
spec_with_r <- sapply(read.csv(datapath, nrows = 1000), class)
#spec_explicit <- c(x = "character", y = "numeric")
system.time(dataname <- spark_read_csv(
sc,
path = datapath,
columns = spec_with_r,
Memory = FALSE
))
return(dataname)
}
test <- download("./data/metastases17.csv", test)
test2 <- test %>% select(DX) %>% distinct()

R running very slowly when importing and manipulating 100,000 KB dataset

I'm working with a dataset in R with dimensions of about 7000 x 5000. The file size is about 100,000 KB. It takes about half an hour to load it into R. When I try to create a correlation table in order to run a PCA, R freezes. Then I have to reopen it and import the data again.
I'm surprised that it's so slow with a dataset of this size. I thought datasets had to be much larger to affect the speed to this degree. I'm using a Microsoft Surface Pro 3.
Does anyone have any ideas for why this might be happening and what I can do about it? Is it my laptop? Or is this kind of thing common with datasets of this size?
Edit in response to comments: My computer has 8 GB RAM. This is the code I am using:
nlsy_training_set <- read_excel("nlsy training set.xlsx")
df <- nlsy_training_set
full <- df[,2:4886]
corf <- cor(full)
corf <- fill.NAs(full, data = NULL, all.covs = FALSE, contrasts.arg = NULL)
corf <- as.data.frame(corf)
pcaf <- principal(corf, nfactors = 100, rotate = "varimax")$loadings
dfpcaf <- as.data.frame(pcaf)
This was very slow because I was using read_excel and had converted the original data file into an Excel workbook format. Once I used read.csv and used the original csv format, I was able to import the data into R relatively quickly.
Using read.csv works better than read_excel for large datasets.

Xarray - concatenating slices from multiple files

I'm attempting to concatenate slices of multiple files into one file (initialized by a zeros array) and then write to a nCDF file. However, I receive the error:
arguments without labels along dimension 'Time' cannot be aligned
because they have different dimension sizes: {365, 30}
I understand the error (the isel() changes the size of the dimension to the size of the slice), however I don't know how to correct or circumvent the problem. Am I approaching this task correctly? Here's a simplified version of the first iteration:
import xarray as xr
import numpy as np
i=0
PRCP = np.zeros((365,327,348))
d = xr.open_dataset("/Path")
d = d.isel(Time=slice(0,-1,24))
P = d['CUMPRCP'].values
DinM = P.shape[0]
PRCP[i:i+DinM,:,:] = P
i = i + DinM
PRCPxr = xr.DataArray(PRCP.astype('float32'),dims=[('Time'),
'south_north', 'west_east'])
d['DPRCP'] = PRCPxr
Problem was solved by removing the dims=() argument from xr.DataArray(), where it arbitrarily renamed them.

How would you speed up this code? Reducing the resolution of data from a netcdf and then converting it to xyz format for stats

I'm taking a netcdf of maize yields and area harvested, shrinking the resolution from 2.5 arc-minutes to .5degrees, and then converting the whole thing to XYZ format so that I can make it "talk" more easily to data that I've got in that format. (I suppose that I could turn my other data into matrix form, but I like xyz.)
The data is here.
The code below defines a few functions to calculate total production from area harvested and average yields, and then it makes some "feeder" data t use when querying the netcdf's, then it uses plyr to loop through the feeder, extract the data, apply the functions, and then save in xyz. It works, but it takes about 30 minutes to run only one of these files, and I've got more than 100. Any suggestions for ways to optimize this code would be great. Would it be faster to extract bigger chunks of data and apply functions to them? Like, maybe a whole stripe of the earth? How would I know a priori whether this would be faster or not?
rm(list=ls())
library(ncdf)
library(reshape)
library(plyr)
library(sp)
library(maps)
library(rgeos)
library(maptools)
library(rworldmap)
getha = function(lat,size=lat[1]-lat[2]){
lat1 = (lat-size/2)*pi/180
lat2 = (lat+size/2)*pi/180
lon1 = (0-size/2)*pi/180 #lon doesn't come in because all longitudes are great circles
lon2 = (0+size/2)*pi/180
6371^2 * abs(sin(lat1)-sin(lat2))*abs(lon1-lon2)*100 #6371 is the radius of the earth and 100 is the number of ha in a km^2
}
gethamat = function(mat,latvec,blocksize=6){
a = getha(latvec)
areamat = matrix(rep(a,blocksize),blocksize)
area = t(mat)*areamat #The matrix is transposed because the dimensions of the Ramankutty's netcdf's are switched
area
}
getprod = function(yieldblock,areablock,latvec){
b = gethamat(areablock,latvec)
sum(t(yieldblock)*b,na.rm=TRUE)
}
lat = as.matrix(seq(from=89.75,to=-89.75,by=-.5))
lon = as.matrix(seq(from=-179.75,to=179.75,by=.5))
lon = seq.int(from=1,to=4320,by=6)
lat = seq.int(from=1,to=2160,by=6)
lat = rep(lat,720)
lon = t(matrix(lon,720,360))
lon = as.data.frame(lon)
l = reshape(lon,direction="long",varying=list(colnames(lon)),v.names = "V")
coords = data.frame(cbind(l[,2],lat))
colnames(coords) = c("lng","lat")
feeder = coords
head(feeder)
maize.nc = open.ncdf('maize_5min.nc')
getcrops = function(feed,netcdf,var="cropdata"){
yieldblock = get.var.ncdf(netcdf,varid=var,start = c(as.numeric(feed[1]),as.numeric(feed[2]),2,1),count = c(6,6,1,1))
areablock = get.var.ncdf(netcdf,varid=var,start = c(as.numeric(feed[1]),as.numeric(feed[2]),1,1),count = c(6,6,1,1))
lat = get.var.ncdf(netcdf,varid="latitude",start = feed[2],count = 6)
prod = getprod(yieldblock,areablock,lat)
lon = get.var.ncdf(netcdf,varid="longitude",start = feed[1],count = 6)
#print(c(mean(lat),mean(lon)))
data.frame(lat=mean(lat),lon = mean(lon),prod=prod)
}
out = adply(as.matrix(feeder),1,getcrops,netcdf=maize.nc,.parallel=FALSE)
Thanks in advance.
plyr functions are notoriously slow when the number of chunks grows larger. I would really recommend keeping the data in a multi-dimensional array. This allows you to use apply to e.g. get mean for all lat-lon combinations etc. The multi-dimensional array takes less RAM storage, as the metadata is not in stored directly as columns, but implicitly within the dimensions of the array. In addition, apply is often much much faster than using plyr. The ncdf package natively reads the data into multi-dimensional arrays, so this also saves you a processing step (e.g. using melt).
After reducing the dataset, I would often use melt to go to what you call XYZ format for plotting. But by then the dataset is so small that this does not really mattern.

Resources