Drop duplicate times in xarray - netcdf

I'm reading NetCDF files with open_mfdataset, which contain duplicate times. For each duplicate time I only want to keep the first occurrence, and drop the second (it will never occur more often). The problem is quite similar to this Pandas question, but none of the solutions provided there seem to work with Xarray.
To reproduce the problem:
import numpy as np
import netCDF4 as nc4
import xarray as xr
# Create example NetCDF files
for t in range(2):
nc = nc4.Dataset('test{}.nc'.format(t), 'w')
dim_t = nc.createDimension('time', None)
var_t = nc.createVariable('time', 'f8', ('time',))
var_s = nc.createVariable('var', 'f8', ('time',))
var_t.setncattr('units', 'hours since 2001-01-01 00:00:00')
var_t[:] = t*5+np.arange(6)
var_s[:] = t*5+np.arange(6)+t
nc.close()
# Read with xarray
f = xr.open_mfdataset(['test0.nc', 'test1.nc'])
The times in the resulting dataset are:
array(['2001-01-01T00:00:00.000000000', '2001-01-01T01:00:00.000000000',
'2001-01-01T02:00:00.000000000', '2001-01-01T03:00:00.000000000',
'2001-01-01T04:00:00.000000000', '2001-01-01T05:00:00.000000000',
'2001-01-01T05:00:00.000000000', '2001-01-01T06:00:00.000000000',
'2001-01-01T07:00:00.000000000', '2001-01-01T08:00:00.000000000',
'2001-01-01T09:00:00.000000000', '2001-01-01T10:00:00.000000000'], dtype='datetime64[ns]')
Is there an easy way to remove the second ocurance of 2001-01-01T05:00:00.000000000? The real-life problem deals with multi-dimensional NetCDF files, so switching to Pandas is no option.
[update] The closest I get is following this answer; that works for my simple example as long as Dask is not used, if the files contain Dask arrays I get the error:
'last' with skipna=True is not yet implemented on dask arrays
But I don't see where I can/have to set skipna.

I think xarray does not have its own method for this purpose, but the following works,
In [7]: _, index = np.unique(f['time'], return_index=True)
In [8]: index
Out[8]: array([ 0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11])
In [9]: f.isel(time=index)
Out[9]:
<xarray.Dataset>
Dimensions: (time: 11)
Coordinates:
* time (time) datetime64[ns] 2001-01-01 2001-01-01T01:00:00 ...
Data variables:
var (time) float64 dask.array<shape=(11,), chunksize=(6,)>

Apparently stackoverflow won't let me comment... I wanted to add to Keisuke's answer. You can also use the get_index() function to get a pandas index.
f.sel(time=~f.get_index("time").duplicated())

Related

xarray chunk dataset PerformanceWarning: Slicing with an out-of-order index is generating more chunks

I am trying to run a simple calculation based on two big gridded datasets in xarray (around 5 GB altogether, daily data from 1850-2100). I keep running out of memory when I try it this way
import xarray as xr
def readin(model):
observed = xr.open_dataset(var_obs)
model_sim = xr.open_dataset(var_sim)
observed = observed.sel(time = slice('1989','2010'))
model_hist = model_sim.sel(time = slice('1989','2010'))
model_COR = model_sim
return(observed, model_hist, model_COR)
def method(model):
clim_obs = observed.groupby('time.day').mean(dim='time')
clim_hist = model_hist.groupby('time.day').mean(dim='time')
diff_scaling = clim_hist-clim_obs
bc = model_COR.groupby('time.day') - diff_scaling
bc[var]=bc[var].where(bc[var]>0,0)
bc = bc.reset_coords('day',drop=True)
observed, model_hist, model_COR = readin('model')
method('model')
I tried to chunk the (full)
model_COR
to split up the memory
model_COR.chunk(chunks={'lat': 20, 'lon': 20})
or across the time dimension
model_COR.chunk(chunks={'time': 8030})
but no matter what I tried resulted in
PerformanceWarning: Slicing with an out-of-order index is generating xxx times more chunks
Which doesn't exactly sound like the outcome I want? Where am I going wrong here? Happy about any help!

How to create multi-dimensional structure array in python using dictionary

Is there a way to form a dictionary in python which opens and appears like the following picture in MATLAB (in structure format)? The saving was done using hdf5storage.savemat.
I would convert the dictionary into a pandas data frame:
import pandas as pd
data = [{'area': 'new-hills', 'rainfall': 100, 'temperature': 20},
{'area': 'cape-town', 'rainfall': 70, 'temperature': 25},
{'area': 'mumbai', 'rainfall': 200, 'temperature': 39 }]
df = pd.DataFrame.from_dict(data)
Many IDEs (e.g. PyCharm) provide visualization options for data frames.
Lazloo

Xarray - concatenating slices from multiple files

I'm attempting to concatenate slices of multiple files into one file (initialized by a zeros array) and then write to a nCDF file. However, I receive the error:
arguments without labels along dimension 'Time' cannot be aligned
because they have different dimension sizes: {365, 30}
I understand the error (the isel() changes the size of the dimension to the size of the slice), however I don't know how to correct or circumvent the problem. Am I approaching this task correctly? Here's a simplified version of the first iteration:
import xarray as xr
import numpy as np
i=0
PRCP = np.zeros((365,327,348))
d = xr.open_dataset("/Path")
d = d.isel(Time=slice(0,-1,24))
P = d['CUMPRCP'].values
DinM = P.shape[0]
PRCP[i:i+DinM,:,:] = P
i = i + DinM
PRCPxr = xr.DataArray(PRCP.astype('float32'),dims=[('Time'),
'south_north', 'west_east'])
d['DPRCP'] = PRCPxr
Problem was solved by removing the dims=() argument from xr.DataArray(), where it arbitrarily renamed them.

psycopg2 - acceptable date/datetime values

I'm using psycopg2 and sqlalchemy to insert data in a postgres db from xls files. I've previously been experiencing issues in inserting the 'date' columns which have been formatted as a number in excel. We have defined these columns as date type in postgres.
I have two issues here:
1. Some of the values in the date columns are empty. Pandas is converting those values to NaT or NaN but sqlalchemy and psycopg2 is not able to parse.
df = pd.read_excel(full_path, encoding='utf-8')
dict_items = df.to_dict(orient='records')
table = sql.Table(table_name, engine, schema='users')
connection.execute(table.insert().values(dict_items))
<class 'sqlalchemy.exc.DataError'>, DataError('(psycopg2.DataError) invalid input syntax for type timestamp: "NaT"
I have converted the numbers into python dates via the code below but also had to make sure the dates are not greated then Pandas timestamp max because I previously got an 'Range Out of Bounds' for timestamp:
max_date = pd.Timestamp.max
for index, row in df.iterrows():
for col in date_cols:
date_value = row[col]
if not np.isnan(date_value):
year, month, day, hour, minute, sec = xlrd.xldate_as_tuple(date_value, 0)
py_date = "%02d.%02d.%04d" % (month, day, year)
if py_date > str(max_date):
df.loc[index, col] = pd.to_datetime(max_date)
else:
df.loc[index, col] = py_date
if np.isnan(date_value):
df.loc[index, col] = pd.to_datetime('01.12.2016')
Now I get the following error:
<class 'sqlalchemy.exc.DataError'>, DataError('(psycopg2.DataError) integer out of range\n',)<traceback object at>
Could this be related to the last line of code, where I push in the 01.12.2016? Is there some way of tracing where the problem lies?
Thanks in advance.
To fix the issues with the nan's and nat's just change them to None in the dataframe and then they should get inserted without complaint. It solved my issue with this.
df = df.where(pd.notnull(df), None)
I got this solution from the postgres message board where they show a small example of the nans getting changed to None
Another alternate approach that worked for me
import numpy as np
df = df.replace({np.nan: None})

How can I specify dimension order when using ncdf4::ncvar_get?

Following a previous question (Faster reading of time series from netCDF?) I have re-permuted my netCDF files to provide fast time-series reads (scripts on github to be cleaned up eventually ...).
In short, to make reads faster, I have rearranged the dimensions from lat, lon, time to time, lat, lon. Now, my existing scripts break because they assume that the dimensions will always be lat, lon, time, following the ncdf4 documentation of ncvar_get, for the 'start' argument:
Order is X-Y-Z-T (i.e., the time dimension is last)
However, this is not the case.
Furthermore, there is a related inconsistency in the order of variables listed via the commandline netCDF utility ncdump -h and the R function ncdf4::nc_open. The first says that the dimensions are in the expected (lat, lon, time) order while the latter sees dimensions with time first (time, lat, lon).
For a minimal example, download the file test.nc and run
bash-$ ncdump -h .nc
bash-$ R
R> library(ncdf4)
R> print(nc_open("test.nc")
What I want to do is get records 5-15 from the variable "lwdown"
my.nc <- nc_open("test.nc")
But this doesn't work, since R sees the time dimension first, so I must change my scripts to
ncvar_get(my.nc, "lwdown", start = c(5, 1, 1), count = c(10, 1, 1))
It wouldn't be so bad to update my scripts and functions, except that I want to be able to read data from files regardless of the dimension order.
Other than Is there a way to generalize this function so that it works independent of dimension order?
While asking the question, I figured out this solution, though there is still room for improvement:
The closest I can get is to open the file and find the order in this way:
my.nc$var$lwdown$dim[[1]]$name
[1] "time"
my.nc$var$lwdown$dim[[2]]$name
[1] "lon"
my.nc$var$lwdown$dim[[3]]$name
[1] "lat"
which is a bit unsatisfying, although it led me to this solution:
If I want to start at c(lat = 1, lon = 1, time = 5), but the ncvar_get expects an arbitrary order, I can say"
start <- c(lat = 1, lon = 1, time = 5)
count <- c(lat = 1, lon = 1, time = 10)
dim.order <- sapply(my.nc$var$lwdown$dim, function(x) x$name)
ncvar_get(my.nc, "lwdown", start = start[dim.order], count = count[dim.order])
I ran into this recently as well. I have a netcdf with data in this format
nc_in <- nc_open("my.nc")
nc_in$dim[[1]]$name == "time"
nc_in$dim[[2]]$name == "latitude"
nc_in$dim[[3]]$name == "longitude"
nc_in$dim[[1]]$len == 3653 # this is the number of timesteps in my netcdf
nc_in$dim[[2]]$len == 180 # this is the number of longitude cells
nc_in$dim[[3]]$len == 360 # this is the number of latitude cells
The obnoxious part here is that the DIM component of the netCDF is in the order of T,Y,X
If I try to to grab time series data for the pr var using the indices in the order they appear in nc_in$dim I get an error
ncvar_get(nc_in,"pr")[3653,180,360] # 'subscript out of bounds'
If I instead grab data in X,Y,T order, it works:
ncvar_get(nc_in,"pr")[360,180,3653] # gives me a value
What I don't understand is how the ncvar_get() package knows what variable represents X, Y and T, especially if you have generated your own netCDF.

Resources