How to write null values in netcdf file? - netcdf

Does _FillValue or missing_value still occup storage space?
If there is a 2-dimission array with some null values, How can i write it to netcdf file for saving storage space?

In netCDF3 every value requires the same amount of disk space. In netCDF4 it is possible to reduce the required disk space using gzip compression. The actual compression ratio depends on the data. If there are lots of identical values (for example missing data), you can achieve good results. Here is an example in python:
import netCDF4
import numpy as np
import os
# Define sample data with all elements masked out
N = 1000
data = np.ma.masked_all((N, N))
# Write data to netCDF file using different data formats
for fmt in ('NETCDF3_CLASSIC', 'NETCDF4'):
fname = 'test.nc'
ds = netCDF4.Dataset(fname, format=fmt, mode='w')
xdim = ds.createDimension(dimname='x', size=N)
ydim = ds.createDimension(dimname='y', size=N)
var = ds.createVariable(
varname='data',
dimensions=(ydim.name, xdim.name),
fill_value=-999,
datatype='f4',
complevel=9, # set gzip compression level
zlib=True # enable compression
)
var[:] = data
ds.close()
# Determine file size
print fmt, os.stat(fname).st_size
See the netCDF4-python documentation, section 9) "Efficient compression of netCDF variables" for details.

Just to add to the excellent answer from Funkensieper, you can copy and compress files from the command line using cdo:
cdo -f nc4c -z zip_9 copy in.nc out.nc
One could compress files simply using gzip or zip etc, but the disadvantage is that you need to decompress before reading. Using the netcdf4 compression capabilities avoids this.
You can select your level X of compression by using -z zip_X. If your files are very large you may want to sacrifice a little bit the file size in return for faster access times (e.g. using zip_5 or 6, instead of 9). In many cases with heterogeneous data, the compression gain is small relative to the uncompressed file.

or similarly with NCO
ncks -7 -L 9 in.nc out.nc

Related

Rascal MPL get line count of file without loading its contents

Is there a more efficient way than
int fileSize = size(readFileLines(fileLoc));
to get the total number of lines in a file? I presume this code has to read the entire file first, which could become costly for huge files.
I have looked into IO and Loc whether some of this info might be saved in conjunction with the file.
This is the way, unless you'd like to call wc -l via util::ShellExec 😁
Apart from streaming the file and saving some memory counting lines is always linear in the size of the file so you won't win much time.

View NetCDF metadata without tripping on large file size / format

Summary
I need help getting NCO tools to be helpful. I'm running into the error
"One or more variable sizes violate format constraints"
... when trying to just view the list of variables in the file with:
ncdump -h isrm_v1.2.1.ncf
It seems odd to trip on this when I'm not asking for any large variables to be read ... just metadata. Are there any flags I should or could be passing to avoid this error?
Reprex
isrm_v1.2.1.ncf (165 GB) is available on Zenodo.
Details
I've just installed the NCO suite via brew --install nco --build-from-source on a Mac (I know, I know) running OS X 11.6.5. ncks --version says 5.0.6.
Tips appreciated. I've been trawling through the ncks docs for a couple of hours without much insight. A friend was able to slice the file on a different system running actual Linux, so I'm pretty sure my NCO install is to blame.
How can I dig in deeper to find the root cause? NCO tools don't seem very verbose. I understand there are different sub-formats of NetCDF (3, 4, ...) but I'm not even sure how to verify the version/format of the .nc file that I'm trying to access.
My larger goal is to be able to slice it, like ncks -v pNH4 -d layer,0 isrm_v1.2.1.ncf pNH4L0.nc, but if I can't even view metadata, I'm thinking I need to solve that first.
The more-verbose version of the error message, for the record, is:
HINT: NC_EVARSIZE errors occur when attempting to copy or aggregate input files together into an output file that exceeds the per-file capacity of the output file format, and when trying to copy, aggregate, or define individual variables that exceed the per-variable constraints of the output file format. The per-file limit of all netCDF formats is not less than 8 EiB on modern computers, so any NC_EVARSIZE error is almost certainly due to violating a per-variable limit. Relevant limits: netCDF3 NETCDF_CLASSIC format limits fixed variables to sizes smaller than 2^31 B = 2 GiB ~ 2.1 GB, and record variables to that size per record. A single variable may exceed this limit if and only if it is the last defined variable. netCDF3 NETCDF_64BIT_OFFSET format limits fixed variables to sizes smaller than 2^32 B = 4 GiB ~ 4.2 GB, and record variables to that size per record. Any number of variables may reach, though not exceed, this size for fixed variables, or this size per record for record variables. The netCDF3 NETCDF_64BIT_DATA and netCDF4 NETCDF4 formats have no variable size limitations of real-world import. If any variable in your dataset exceeds these limits, alter the output file to a format capacious enough, either netCDF3 classic with 64-bit offsets (with -6 or --64), to PnetCDF/CDF5 with 64-bit data (with -5), or to netCDF4 (with -4 or -7). For more details, see http://nco.sf.net/nco.html#fl_fmt
Tips appreciated!
ncdump is not an NCO program, so I can't help you there, except to say that printing metadata should not cause an error in this case, so try ncks -m in.nc instead of ncdump -h in.nc.
Nevertheless, the hyperslab problem you have experienced is most likely due to trying to shove too much data into a netCDF format that can't hold it. The generic solution to that is to write the data to a more capacious netCDF format:
Try either one of these commands:
ncks -5 -v pNH4 -d layer,0 isrm_v1.2.1.ncf pNH4L0.nc
ncks -7 -v pNH4 -d layer,0 isrm_v1.2.1.ncf pNH4L0.nc
Formats are documented here

Can i remap and compress a NETCDF at the same time?

I got a huge set of data and i need to remap it to a new size of pixel. But this operation generate a big file that fills my hard drive...
i'm using that:
cdo remapnn,r7432x13317 petcomp.nc FINAL.nc
So, can i compress and make this operation at the same time?
The following modification to your code should work:
cdo -z zip -remapnn,r7432x13317 petcomp.nc FINAL.nc
Read the CDO user guide to see other compression options if you need to get as small as possible.

Extract a given variable from multiple Netcdf files and concatenate to a single file

I am trying to extract a single variable (DUEXTTAU) from multiple NC files, and then combine all the individual files into a single NC file. I am using nco, but have an issue with ncks.
The NC filenames follow:
MERRA2_100.tavgM_2d_aer_Nx.YYYYMM.nc4
Each file has 1 (monthly) time step, and the time coordinate has no real value, but changes in units or begin_date. For example, in the file MERRA2_100.tavgM_2d_aer_Nx.198001.nc4, it has:
int time(time=1);
:long_name = "time";
:units = "minutes since 1980-01-01 00:30:00";
:time_increment = 60000; // int
:begin_date = 19800101; // int
:begin_time = 3000; // int
:vmax = 9.9999999E14f; // float
:vmin = -9.9999999E14f; // float
:valid_range = -9.9999999E14f, 9.9999999E14f; // float
:_ChunkSizes = 1U; // uint
I repeat this step for each file
ncks -v DUEXTTAU MERRA2_100.tavgM_2d_aer_Nx.YYYYMM.nc4 YYYYMM.nc4
and then
ncrcat YYYYMM.nc4 final.nc4
In final.nc4, the time coordinate has the same value (of the first YYYYMM.nc4). For example, after combining the 3 files of 198001, 198002 and 198003, the time coordinate equals 198001 for all the time steps. How should I deal with this?
Firstly, this command should work:
ncrcat -v DUEXTTAU MERRA2_100.tavgM_2d_aer_Nx.??????.nc4 final.nc4
However, recent versions of NCO fail to correctly reconstruct or re-base the time coordinate when time is an integer, which it is in your case. The fix is in the latest NCO snapshot on GitHub and will be in 4.9.3 to be released hopefully this week. If installing from source is not an option, then manual intervention would be required (e.g., change time to floating point in each input file with ncap2 -s 'time=float(time)' in.nc out.nc). In any case, the time_increment, begin_date, and begin_time attributes are non-standard and will simply be copied from the first file. But time itself should be correctly reconstructed if you use a non-broken version of ncrcat.
you can do this using cdo as well, but you need two steps:
cdo mergetime MERRA2_100.tavgM_2d_aer_Nx.??????.nc4 merged_file.nc
cdo selvar,DUEXTTAU merged_file.nc DUEXTTAU.nc
This should actually work if the begin dates are all set correctly. The problem is that merged_file.nc could actually be massive, and so it may be better to loop through to extract the variable first and then combine:
for file in `ls MERRA2_100.tavgM_2d_aer_Nx.??????.nc4`; do
cdo selvar,DUEXTTAU $file ${file#????}_duexttau.nc4
done
cdo mergetime MERRA2_100.tavgM_2d_aer_Nx.??????_duexttau.nc4 DUEXTTAU.nc
rm -f MERRA2_100.tavgM_2d_aer_Nx.??????_duexttau.nc4 # clean up

S3: How to do a partial read / seek without downloading the complete file?

Although they resemble files, objects in Amazon S3 aren't really "files", just like S3 buckets aren't really directories. On a Unix system I can use head to preview the first few lines of a file, no matter how large it is, but I can't do this on a S3. So how do I do a partial read on S3?
S3 files can be huge, but you don't have to fetch the entire thing just to read the first few bytes. The S3 APIs support the HTTP Range: header (see RFC 2616), which take a byte range argument.
Just add a Range: bytes=0-NN header to your S3 request, where NN is the requested number of bytes to read, and you'll fetch only those bytes rather than read the whole file. Now you can preview that 900 GB CSV file you left in an S3 bucket without waiting for the entire thing to download. Read the full GET Object docs on Amazon's developer docs.
The AWS .Net SDK only shows only fixed-ended ranges are possible (RE: public ByteRange(long start, long end) ). What if I want to start in the middle and read to the end? An HTTP range of Range: bytes=1000- is perfectly acceptable for "start at 1000 and read to the end" I do not believe that they have allowed for this in the .Net library.
get_object api has arg for partial read
s3 = boto3.client('s3')
resp = s3.get_object(Bucket=bucket, Key=key, Range='bytes={}-{}'.format(start_byte, stop_byte-1))
res = resp['Body'].read()
Using Python you can preview first records of compressed file.
Connect using boto.
#Connect:
s3 = boto.connect_s3()
bname='my_bucket'
self.bucket = s3.get_bucket(bname, validate=False)
Read first 20 lines from gzip compressed file
#Read first 20 records
limit=20
k = Key(self.bucket)
k.key = 'my_file.gz'
k.open()
gzipped = GzipFile(None, 'rb', fileobj=k)
reader = csv.reader(io.TextIOWrapper(gzipped, newline="", encoding="utf-8"), delimiter='^')
for id,line in enumerate(reader):
if id>=int(limit): break
print(id, line)
So it's an equivalent of a following Unix command:
zcat my_file.gz|head -20

Resources