Xarray - concatenating slices from multiple files - netcdf

I'm attempting to concatenate slices of multiple files into one file (initialized by a zeros array) and then write to a nCDF file. However, I receive the error:
arguments without labels along dimension 'Time' cannot be aligned
because they have different dimension sizes: {365, 30}
I understand the error (the isel() changes the size of the dimension to the size of the slice), however I don't know how to correct or circumvent the problem. Am I approaching this task correctly? Here's a simplified version of the first iteration:
import xarray as xr
import numpy as np
i=0
PRCP = np.zeros((365,327,348))
d = xr.open_dataset("/Path")
d = d.isel(Time=slice(0,-1,24))
P = d['CUMPRCP'].values
DinM = P.shape[0]
PRCP[i:i+DinM,:,:] = P
i = i + DinM
PRCPxr = xr.DataArray(PRCP.astype('float32'),dims=[('Time'),
'south_north', 'west_east'])
d['DPRCP'] = PRCPxr

Problem was solved by removing the dims=() argument from xr.DataArray(), where it arbitrarily renamed them.

Related

Reading multiple ERA5 netcdf files

I have a collection of ERA5 netcdf files that contain hourly data for air temperature that span over approximately 40 years in the tropical East Pacific. In jupyter notebook, I want to run a bandpass filter on the merged dataset but I keep running into initial errors concerning memory allocation. I read the files using xarray.open_mfdataset(list_of_files), but when I try to load the dataset I get the error:
Unable to allocate X GiB for an array with shape (d1, d2, d3, d4) and data type float32
Are there work around solutions or best practices to manipulating large datasets like this in jupyter?
The full code for the bandpass filter is:
I've been wanting to apply a band pass filter to a large domain over the East Pacific over about 40 years of data from ERA5. The code goes as follows:
# Grab dataset
var = 't'
files = glob.glob(os.path.join(parent_dir, 'era5_' + var + '_daily.nc'))
files.sort()
# Read files into a dask array
ds = xr.open_mfdataset(files)
# Limit study region
lon_min = -140
lon_max = -80
lat_min = -10
lat_max = 10
ds = ds.sel(latitude = slice(lat_max, lat_min), longitude = slice(lon_min, lon_max))
# Now, load the data from the original dask array
da_T = ds.T.load()
# High pass filter (remove singal on the seasonal and longer timescales)
import xrft
freq_threshold = (1/90) * (1/24) * (1/3600) # 90-day frequency threshold
def high_pass_filter(da, dim, thres):
ft = xrft.fft(da, dim=dim, true_phase=True, true_amplitude=True)
ft_new = ft.where(ft.freq_time > thres, other = 0)
ft.close()
da_new = xrft.ifft(ft_new, dim = 'freq_time', true_phase=True, true_amplitude=True)
da_new = da_new + np.tile(da.mean('time'), (da_T.time.shape[0],1,1,1))
ft_new.close()
return da_new.real
da_new = high_pass_filter(da_T, 'time', freq_threshold)
# Save filtered dataset
da_new.real.to_netcdf(os.path.join(outdir, 'era5_T.nc'))
When you do this
# Now, load the data from the original dask array
da_T = ds.T.load()
you are loading all the data in the "T" variable of your dataset into memory all at once. Presumably the size of this variable is larger than the amount of RAM available on your system.
You also have another problem with this line: da.load() loads in-place and returns the modified object. So ds.T.load() alone would have been sufficient. You could also have done da_T = ds.T.compute() instead.
You could try using dask to perform your analysis chunk-by-chunk. You want to ensure that your ds object contains chunked dask array objects before you load/compute it.
You then want to specify your analysis using xarray methods / the xrft package as you are doing, but only call .compute() at the end. This should do your analysis in a chunk-by-chunk manner.
I suggest reading about how to use dask + xarray together though in order to get this to run smoothly.

Partition a large list into chunks with convenient I/O

I have a large list with size of approx. 1.3GB. I'm looking for the fastest solution in R to generate chunks and save them in any convenient format so that :
a) every saved file of the chunk is less than 100MB large
b) the original list can be loaded conveniently and fast into a new R workspace
EDIT II : The reason to do so is a R-solution to bypass the GitHub file size restriction of 100MB per file. The limitation to R is due to some external non-technical restrictions which I can't comment.
What is the best solution for this problem?
EDIT I: Since it was mentioned in the comments that some code for the problem is helpful to create a better question:
An R-example of a list with size of 1.3 GB:
li <- list(a = rnorm(10^8),
b = rnorm(10^7.8))
So, you want to split a file and to reload it in a single dataframe.
There is a twist: to reduce file size, it would be wise to compress, but then the file size is not entirely deterministic. You may have to tweak a parameter.
The following is a piece of code I have used for a similar task (unrelated to GitHub though).
The split.file function takes 3 arguments: a dataframe, the number of rows to write in each file, and the base filename. For instance, if basename is "myfile", the files will be "myfile00001.rds", "myfile00002.rds", etc.
The function returns the number of files written.
The join.files function takes the base name.
Note:
Play with the rows parameter to find out the correct size to fit in 100 MB. It depends on your data, but for similar datasets a fixed size should do. However, if you are dealing with very different datasets, this approach will likely fail.
When reading, you need to have twice as much memory as occupied by your dataframe (because a list of the smaller dataframes is first read, then rbinded.
The number is written as 5 digits, but you can change that. The goal is to have the names in lexicographic order, so that when the files are concatenated, the rows are in the same order as the original file.
Here are the functions:
split.file <- function(db, rows, basename) {
n = nrow(db)
m = n %/% rows
for (k in seq_len(m)) {
db.sub <- db[seq(1 + (k-1)*rows, k*rows), , drop = F]
saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, k),
compress = "xz", ascii = F)
}
if (m * rows < n) {
db.sub <- db[seq(1 + m*rows, n), , drop = F]
saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, m+1),
compress = "xz", ascii = F)
m <- m + 1
}
m
}
join.files <- function(basename) {
files <- sort(list.files(pattern = sprintf("%s[0-9]{5}\\.rds", basename)))
do.call("rbind", lapply(files, readRDS))
}
Example:
n <- 1500100
db <- data.frame(x = rnorm(n))
split.file(db, 100000, "myfile")
dbx <- join.files("myfile")
all(dbx$x == db$x)

xarray chunk dataset PerformanceWarning: Slicing with an out-of-order index is generating more chunks

I am trying to run a simple calculation based on two big gridded datasets in xarray (around 5 GB altogether, daily data from 1850-2100). I keep running out of memory when I try it this way
import xarray as xr
def readin(model):
observed = xr.open_dataset(var_obs)
model_sim = xr.open_dataset(var_sim)
observed = observed.sel(time = slice('1989','2010'))
model_hist = model_sim.sel(time = slice('1989','2010'))
model_COR = model_sim
return(observed, model_hist, model_COR)
def method(model):
clim_obs = observed.groupby('time.day').mean(dim='time')
clim_hist = model_hist.groupby('time.day').mean(dim='time')
diff_scaling = clim_hist-clim_obs
bc = model_COR.groupby('time.day') - diff_scaling
bc[var]=bc[var].where(bc[var]>0,0)
bc = bc.reset_coords('day',drop=True)
observed, model_hist, model_COR = readin('model')
method('model')
I tried to chunk the (full)
model_COR
to split up the memory
model_COR.chunk(chunks={'lat': 20, 'lon': 20})
or across the time dimension
model_COR.chunk(chunks={'time': 8030})
but no matter what I tried resulted in
PerformanceWarning: Slicing with an out-of-order index is generating xxx times more chunks
Which doesn't exactly sound like the outcome I want? Where am I going wrong here? Happy about any help!

R: use single file while running a for loop on list of files

I am trying to create a loop where I select one file name from a list of file names, and use that one file to run read.capthist and subsequently discretize, fit, derived, and save the outputs using save. The list contains 10 files of identical rows and columns, the only difference between them are the geographical coordinates in each row.
The issue I am running into is that capt needs to be a single file (in the secr package they are 'captfile' types), but I don't know how to select a single file from this list and get my loop to recognize it as a single entity.
This is the error I get when I try and select only one file:
Error in read.capthist(female[[i]], simtraps, fmt = "XY", detector = "polygon") :
requires single 'captfile'
I am not a programmer by training, I've learned R on my own and used stack overflow a lot for solving my issues, but I haven't been able to figure this out. Here is the code I've come up with so far:
library(secr)
setwd("./")
files = list.files(pattern = "female*")
lst <- vector("list", length(files))
names(lst) <- files
for (i in 1:length(lst)) {
capt <- lst[i]
femsimCH <- read.capthist(capt, simtraps, fmt = 'XY', detector = "polygon")
femsimdiscCH <- discretize(femsimCH, spacing = 2500, outputdetector = 'proximity')
fit <- secr.fit(femsimdiscCH, buffer = 15000, detectfn = 'HEX', method = 'BFGS', trace = FALSE, CL = TRUE)
save(fit, file="C:/temp/fit.Rdata")
D.fit <- derived(fit)
save(D.fit, file="C:/temp/D.fit.Rdata")
}
simtraps is a list of coordinates.
Ideally I would also like to have my outputs have unique identifiers as well, since I am simulating data and I will have to compare all the results, I don't want each iteration to overwrite the previous data output.
I know I can use this code by bringing in each file and running this separately (this code works for non-simulation runs of a couple data sets), but as I'm hoping to run 100 simulations, this would be laborious and prone to mistakes.
Any tips would be greatly appreciated for an R novice!

Using Juno/LT to run Julia Code, Error, `getindex` has no method matching getindex(::DataFrame, ::ASCIIString

Below is the first portion of the code I am using. The intention of this code is to, when given a file of images in .bmp format to correctly identity the letter shown in the image.
#Install required packages
Pkg.add("Images")
Pkg.add("DataFrames")
using Images
using DataFrames
#typeData could be either "train" or "test.
#labelsInfo should contain the IDs of each image to be read
#The images in the trainResized and testResized data files
#are 20x20 pixels, so imageSize is set to 400.
#path should be set to the location of the data files.
function read_data(typeData, labelsInfo, imageSize, path)
#Intialize x matrix
x = zeros(size(labelsInfo, 1), imageSize)
for (index, idImage) in enumerate(labelsInfoTrain["ID"])
#Read image file
nameFile = "$(path)/$(typeData)Resized/$(idImage).Bmp"
img = imread(nameFile)
#Convert img to float values
temp = float32sc(img)
#Convert color images to gray images
#by taking the average of the color scales.
if ndims(temp) == 3
temp = mean(temp.data, 1)
end
#Transform image matrix to a vector and store
#it in data matrix
x[index, :] = reshape(temp, 1, imageSize)
end
return x
end
imageSize = 400 # 20 x 20 pixels
#Set location of data files , folders
#Probably will need to set this path to which folder your files are in
path = "C:\\Users\\Aaron\\Downloads\\Math512Project"
#Read information about test data ( IDs )
labelsInfoTest = readtable("$(path)/sampleSubmissionJulia.csv")
#Read test matrixnformation about training data , IDs.
labelsInfoTrain = readtable("$(path)/trainLabels.csv")
#Read training matrix
xTrain = read_data("train", labelsInfoTrain, imageSize, path)
the error that I am facing is when the my program reaches the very last line of code above that reads:
xTrain = read_data("train", labelsInfoTrain, imageSize, path)
I receive an error saying: getindex has no method matching getindex(::DataFrame, ::ASCIIString in read_data at benchmarkJeff.jl:18
which refers to the line of code :
for (index, idImage) in enumerate(labelsInfoTrain["ID"])
Some research online has given me insight that the problem has to do with a conflict when using the DataFrames package and Image package. I was recommended to change the "ID" in my code to [:ID], but this does not solve the problem but rather causes another error. I was wondering if anyone new how to fix this problem or what exactly the problem is with my code. I get the same error when running the code in Julia command line 0.4.0. Look forward to hearing from you.

Resources