I am accessing a netcdf file using the xarray python library. The specific file that I am using is publicly available.
So, the file has several variables, and for most of these variables the dimensions are: time: 4314, x: 700, y: 562. I am using the ET_500m variable, but the behaviour is similar for the other variables as well. The chunking is: 288, 36, 44.
I am retrieving a single cell and printing the value using the following code:
import xarray as xr
ds = xr.open_dataset('./dataset_greece.nc')
print(ds.ET_500m.values[0][0][0])
According to my understanding, xarray should locate directly the position of the chunk that contains the corresponding value in disk and read it. Since the size of the chunk should not be bigger than a couple of MBs, I would expect this to take a few seconds or even less. But instead, it takes more than 2 minutes.
If, in the same script, I retrieve the value of another cell, even if it is located in a different chunk (e.g. print(ds.ET_500m.values[1000][500][500])), then this second retrieval takes only some milliseconds.
So my question is what exactly causes this overhead in the first retrieval?
EDIT: I just saw that in xarray open_dataset there is the optional parameter cache, which according to the manual:
If True, cache data loaded from the underlying datastore in memory as NumPy arrays when accessed to avoid reading from the underlying data- store multiple times. Defaults to True [...]
So, when I set this to False, subsequent fetches are also slow like the first one. But my question remains. Why is this so slow since I am only accessing a single cell. I was expecting that xarray directly locates the chunk on disk and only reads a couple of MBs.
Rather than selecting from the .values property, subset the array first:
print(ds.ET_500m[0, 0, 0].values)
The problem is that .values coerces the data to a numpy array, so you're loading all of the data and then subsetting the array. There's no way around this for xarray - numpy doesn't have any concept of lazy loading, so as soon as you call .values xarray has no option but to load (or compute) all of your data.
If the data is a dask-backed array, you could use .data rather than .values to access the dask array and use positional indexing on the dask array, e.g. ds.ET_500m.data[0, 0, 0]. But if the data is just a lazy-loaded netCDF .data will have the same load-everything pitfall described above.
Related
I am concatenating 1000s of nc-files (outputs from simulations) to allow me to handle them more easily in Matlab. To do this I use ncrcat. The files have different sizes, and the time variable is not unique between files. The concatenate works well and allows me to read the data into Matlab much quicker than individually reading the files. However, I want to be able to identify the original nc-file from which each data point originates. Is it possible to, say, add the source filename as an extra variable so I can trace back the data?
Easiest way: Online indexing
Before we start, I would use an integer index rather than the filename to identify each run, as it is a lot easier to handle, both for writing and then for handling in the matlab programme. Rather than a simple monotonically increasing index, the identifier can have relevance for your run (or you can even write several separate indices if necessary (e.g. you might have a number for the resolution, the date, the model version etc).
So, the obvious way to do this that I can think of would be that each simulation writes an index to the file to identify itself. i.e. the first model run would write a variable
myrun=1
the second
myrun=2
and so on... then when you cat the files the data can be uniquely identified very easily using this index.
Note that if your spatial dimensions are not unique and the number of time steps also changes from run to run from what you write, your index will need to be a function of all the non-unique dimensions, e.g. myrun(x,y,t). If any of your dimensions are unique across all files then that dimension is redundant in the index and can be omitted.
Of course, the only issue with this solution is it means running the simulations again :-D and you might be talking about an expensive model to run or someone else's runs you can't repeat. If rerunning is out of the question you will need to try to add an index offline...
Offline indexing (easy if grids are same, more complex otherwise)
IF your space dimensions were the same across all files, then this is still an easy task as you can add an index offline very easily across all the time steps in each file using nco:
ncap2 -s 'myrun[$time]=array(X,0,$time)' infile.nc outfile.nc
or if you are happy to overwrite the original file (be careful!)
ncap2 -O -s 'myrun[$time]=array(X,0,$time)'
where X is the run number. This will add a variable, with a new variable myrun which is a function of time and then puts X at each step. When you merge you can see which data slice was from which specific run.
By the way, the second zero is the increment, as this is set to zero the number X will be written for all timesteps in a given file (otherwise if it were 1, the index would increase by one each timestep - this could be useful in some cases. For example, you might use two indices, one with increment of zero to identify the run, and the second with an increment of unity to easily tell you which step of the Xth run the data slice belongs to).
If your files are for different domains too, then you might want to put them on a common grid before you do that... I think for that
cdo enlarge
might be of help, see this post : https://code.mpimet.mpg.de/boards/2/topics/1459
I agree that an index will be simpler than a filename. I would just add to the above answer that the command to add a unique index X with a time dimension to each input file can be simplified to
ncap2 -s 'myrun[$time]=X' in.nc out.nc
I currently have an list object in RStudio, which shows up in the Environment listing as 1.2 GB. However, when I save with the function saveRDS with compress = FALSE, the size of saved object shows up as nearly 4 GB.
Is the reporting of the size of my list object wrong or is something else happening? I thought that if an object took up a certain space in R, it should save at that same size without compression? I understand there are a few questions on Stackoverflow similar to this, but none seems to explain why it differs even with no compression.
The calculation of the size of objects in R is complicated by the need for efficient memory management. Your list may contain elements that are not accounted for while in memory as they may be shared resources, but will need to be included when exported. The help file for object.size states that:
Exactly which parts of the memory allocation should be attributed to
which object is not clear-cut. This function merely provides a rough
indication: it should be reasonably accurate for atomic vectors, but
does not detect if elements of a list are shared, for example.
(Sharing amongst elements of a character vector is taken into account,
but not that between character vectors in a single object.)
This is my first post, I hope I have followed convention.
I've found a lot of success with pydicom, but am stuck on one particular application. I would like to do the following:
Read in dicom to numpy array
Reshape to (frames, rows, columns, pixels)
Do some processing including cropping and converting to grayscale
Output as new dicom file
I use
r = ds.Rows
c = ds.Columns
f = ds.NumberOfFrames
s = ds.SamplesPerPixel
imageC = np.reshape(img,(f,r,c,s), order='C')
to get the initial numpy matrix I want and do the processing. I have confirmed that these steps look reasonable.
Prior to saving the new dicom, I update the ds Rows and Columns with the new correct dimensions and set SamplesPerPixels to 1. I then reshape the numpy matrix before reassigning to PixelData with .tostring().
np.reshape(mat, (p, f, r, c), order='C')
The resulting image is nonsensical (green) in my dicom viewer. Are there any obvious logical mistakes? I can provide more code if it would be of use.
I am rather guessing, as I have not used pydicom for witing files. Anyway, if the original image is an RGB one and you convert it to grayscale, than you should change the Media Storage SOP Class UID of the image so that the viewer can interpret it properly. Can you check the value? It is under tag (0002,0002). Here is the list.
It is possible that there are more tags to change. Can you dump both files and show us differences?
By the way, from your post it seems that you import the image by ds.PixelData. Why don't you use ds.pixel_array? Then you wouldn't need to reshape.
I have an HDF5 file containing arrays that are saved with Python/numpy. When I read them into Julia using HDF5.jl, the axes are in the reverse of the order in which they appear in Python. To reduce the mental gymnastics involved in moving between the Python and Julia codebases, I reverse the axis order when I read the data into Julia. I have written my own function to do this:
function reversedims(ary::Array)
permutedims(ary, [ ndims(ary):-1:1 ])
end
data = HDF5.read(someh5file, somekey) |> reversedims
This is not ideal because (1) I always have to import reversedims to use this; (2) I have to remember to do this for each Array I read. I am wondering if it is possible to either:
instruct HDF5.jl to read in the arrays with a numpy-style axis order, either through a keyword argument or some kind of global configuration parameter
use a builtin single argument function to reverse the axes
The best approach would be to create a H5py.jl package, modeled on MAT.jl (which reads and writes .mat files created by Matlab). See also https://github.com/timholy/HDF5.jl/issues/180.
It looks to me like permutedims! does what you're looking for, however it does do an array copy. If you can rewrite the hdf5 files in python, numpy.asfortranarray claims to return your data stored in column-major format, though the numpy internals docs seem to suggest that the data isn't altered, simply the stride is, so I don't know if the hdf5 file output would be any different
Edit: Sorry, I just saw you are already using permutedims in your function. I couldn't find anything else on the Julia side, but I would still try the numpy.asfortranarray and see if that helps.
I have a bunch of binary data in N-byte chunks, where each chunk corresponds exactly to one row of a PyTables table.
Right now I am parsing each chunk into fields, writing them to the various fields in the table row, and appending them to the table.
But this seems a little silly since PyTables is going to convert my structured data back into a flat binary form for inclusion in an HDF5 file.
If I need to optimize the CPU time necessary to do this (my data comes in large bursts), is there a more efficient way to load the data into PyTables directly?
PyTables does not currently expose a 'raw' dump mechanism like you describe. However, you can fake it by using UInt8Atom and UInt8Col. You would do something like:
import tables as tb
f = tb.open_file('my_file.h5', 'w')
mytable = f.create_table('/', 'mytable', {'mycol': tb.UInt8Col(shape=(N,))})
mytable.append(myrow)
f.close()
This would likely get you the fastest I/O performance. However, you will miss out on the meaning of the various fields that are part of this binary chunk.
Arguably, raw dumping of the chunks/rows is not what you want to do anyway, which is why it is not explicitly supported. Internally HDF5 and PyTables handle many kinds of conversion for you. This includes but is not limited to things like endianness and thet platform specific feature. By managing the data types for you the resultant HDF5 file and data set cross platform. When you dump raw bytes in the manner you describe you short-circuit one of the main advantages of using HDF5/PyTables. If you do short-circuit, you have a high probability that the resulting file will look like garbage on anything but the original system that produced it.
So in summary, you should be converting the chunks to the appropriate data types in memory and then writing out. Yes this takes more processing power, time, etc. So in addition to being the right thing to do it will ultimately save you huge headaches down the road.