Can i remap and compress a NETCDF at the same time? - netcdf

I got a huge set of data and i need to remap it to a new size of pixel. But this operation generate a big file that fills my hard drive...
i'm using that:
cdo remapnn,r7432x13317 petcomp.nc FINAL.nc
So, can i compress and make this operation at the same time?

The following modification to your code should work:
cdo -z zip -remapnn,r7432x13317 petcomp.nc FINAL.nc
Read the CDO user guide to see other compression options if you need to get as small as possible.

Related

Rascal MPL get line count of file without loading its contents

Is there a more efficient way than
int fileSize = size(readFileLines(fileLoc));
to get the total number of lines in a file? I presume this code has to read the entire file first, which could become costly for huge files.
I have looked into IO and Loc whether some of this info might be saved in conjunction with the file.
This is the way, unless you'd like to call wc -l via util::ShellExec 😁
Apart from streaming the file and saving some memory counting lines is always linear in the size of the file so you won't win much time.

View NetCDF metadata without tripping on large file size / format

Summary
I need help getting NCO tools to be helpful. I'm running into the error
"One or more variable sizes violate format constraints"
... when trying to just view the list of variables in the file with:
ncdump -h isrm_v1.2.1.ncf
It seems odd to trip on this when I'm not asking for any large variables to be read ... just metadata. Are there any flags I should or could be passing to avoid this error?
Reprex
isrm_v1.2.1.ncf (165 GB) is available on Zenodo.
Details
I've just installed the NCO suite via brew --install nco --build-from-source on a Mac (I know, I know) running OS X 11.6.5. ncks --version says 5.0.6.
Tips appreciated. I've been trawling through the ncks docs for a couple of hours without much insight. A friend was able to slice the file on a different system running actual Linux, so I'm pretty sure my NCO install is to blame.
How can I dig in deeper to find the root cause? NCO tools don't seem very verbose. I understand there are different sub-formats of NetCDF (3, 4, ...) but I'm not even sure how to verify the version/format of the .nc file that I'm trying to access.
My larger goal is to be able to slice it, like ncks -v pNH4 -d layer,0 isrm_v1.2.1.ncf pNH4L0.nc, but if I can't even view metadata, I'm thinking I need to solve that first.
The more-verbose version of the error message, for the record, is:
HINT: NC_EVARSIZE errors occur when attempting to copy or aggregate input files together into an output file that exceeds the per-file capacity of the output file format, and when trying to copy, aggregate, or define individual variables that exceed the per-variable constraints of the output file format. The per-file limit of all netCDF formats is not less than 8 EiB on modern computers, so any NC_EVARSIZE error is almost certainly due to violating a per-variable limit. Relevant limits: netCDF3 NETCDF_CLASSIC format limits fixed variables to sizes smaller than 2^31 B = 2 GiB ~ 2.1 GB, and record variables to that size per record. A single variable may exceed this limit if and only if it is the last defined variable. netCDF3 NETCDF_64BIT_OFFSET format limits fixed variables to sizes smaller than 2^32 B = 4 GiB ~ 4.2 GB, and record variables to that size per record. Any number of variables may reach, though not exceed, this size for fixed variables, or this size per record for record variables. The netCDF3 NETCDF_64BIT_DATA and netCDF4 NETCDF4 formats have no variable size limitations of real-world import. If any variable in your dataset exceeds these limits, alter the output file to a format capacious enough, either netCDF3 classic with 64-bit offsets (with -6 or --64), to PnetCDF/CDF5 with 64-bit data (with -5), or to netCDF4 (with -4 or -7). For more details, see http://nco.sf.net/nco.html#fl_fmt
Tips appreciated!
ncdump is not an NCO program, so I can't help you there, except to say that printing metadata should not cause an error in this case, so try ncks -m in.nc instead of ncdump -h in.nc.
Nevertheless, the hyperslab problem you have experienced is most likely due to trying to shove too much data into a netCDF format that can't hold it. The generic solution to that is to write the data to a more capacious netCDF format:
Try either one of these commands:
ncks -5 -v pNH4 -d layer,0 isrm_v1.2.1.ncf pNH4L0.nc
ncks -7 -v pNH4 -d layer,0 isrm_v1.2.1.ncf pNH4L0.nc
Formats are documented here

How can I tell if my dicom files are compressed?

I have been working with dicom files that are about 4 MB each but I recently received some which are 280 KB each. I am not sure whether this is because they are from different CT scanners or if the new dicoms were compressed before being given to me.
Is there a way to find out and if they are compressed is there a way to uncompressed them to the original size?
This is in continuation to the other answer from #kritzel_sw.
If you see any of the following UIDs in (0002,0010) Transfer Syntax UID element:
1.2.840.10008.1.2 Implicit VR Endian: Default Transfer Syntax for DICOM
1.2.840.10008.1.2.1 Explicit VR Little Endian
1.2.840.10008.1.2.2 Explicit VR Big Endian
then the Pixel Data (7FE0,0010) Pixel Data is uncompressed. You will generally observe bigger file size here.
Not a part of your question, but objects other than image (PDF may be in case of Structured Report) can be encapsulated with following Transfer Syntax:
1.2.840.10008.1.2.1.99 Deflated Explicit VR Little Endian
Other well known values for Transfer Syntax mean that the Pixel Data is compressed.
Note that there are also private Transfer Syntax values possible for data set. Implementation of those values is generally private to the respective manufacturer.
Yes and yes.
I recommend the binary tools from the OFFIS DICOM toolkit, but you will be able to achieve the same results with different toolkits. You can find the dcmtk here.
How to find out if your files are compressed:
dcmdump <filename>
Have a look at the metaheader, the attribute Transfer Syntax UID (0002,0010) in particular. Dcmdump "translates" the unique identifier to the human readable transfer syntax, e.g.
(0002,0010) UI =LittleEndianExplicit # 20, 1 TransferSyntaxUID
The Transfer Syntax tells you whether or not the pixel data in this DICOM file is compressed.
How to decompress compressed images:
dcmdjpeg <compressed DICOM file in> <uncompressed DICOM file out>

How to write null values in netcdf file?

Does _FillValue or missing_value still occup storage space?
If there is a 2-dimission array with some null values, How can i write it to netcdf file for saving storage space?
In netCDF3 every value requires the same amount of disk space. In netCDF4 it is possible to reduce the required disk space using gzip compression. The actual compression ratio depends on the data. If there are lots of identical values (for example missing data), you can achieve good results. Here is an example in python:
import netCDF4
import numpy as np
import os
# Define sample data with all elements masked out
N = 1000
data = np.ma.masked_all((N, N))
# Write data to netCDF file using different data formats
for fmt in ('NETCDF3_CLASSIC', 'NETCDF4'):
fname = 'test.nc'
ds = netCDF4.Dataset(fname, format=fmt, mode='w')
xdim = ds.createDimension(dimname='x', size=N)
ydim = ds.createDimension(dimname='y', size=N)
var = ds.createVariable(
varname='data',
dimensions=(ydim.name, xdim.name),
fill_value=-999,
datatype='f4',
complevel=9, # set gzip compression level
zlib=True # enable compression
)
var[:] = data
ds.close()
# Determine file size
print fmt, os.stat(fname).st_size
See the netCDF4-python documentation, section 9) "Efficient compression of netCDF variables" for details.
Just to add to the excellent answer from Funkensieper, you can copy and compress files from the command line using cdo:
cdo -f nc4c -z zip_9 copy in.nc out.nc
One could compress files simply using gzip or zip etc, but the disadvantage is that you need to decompress before reading. Using the netcdf4 compression capabilities avoids this.
You can select your level X of compression by using -z zip_X. If your files are very large you may want to sacrifice a little bit the file size in return for faster access times (e.g. using zip_5 or 6, instead of 9). In many cases with heterogeneous data, the compression gain is small relative to the uncompressed file.
or similarly with NCO
ncks -7 -L 9 in.nc out.nc

Compress EACH LINE of a file individually and independently of one another? (or preserve newlines)

I have a very large file (~10 GB) that can be compressed to < 1 GB using gzip. I'm interested in using sort FILE | uniq -c | sort to see how often a single line is repeated, however the 10 GB file is too large to sort and my computer runs out of memory.
Is there a way to compress the file while preserving newlines (or an entirely different method all together) that would reduce the file to a small enough size to sort, yet still leave the file in a condition that's sortable?
Or any other method of finding out / countin how many times each line is repetead inside a large file (a ~10 GB CSV-like file) ?
Thanks for any help!
Are you sure you're running out of the Memory (RAM?) with your sort?
My experience debugging sort problems leads me to believe that you have probably run out of diskspace for sort to create it temporary files. Also recall that diskspace used to sort is usually in /tmp or /var/tmp.
So check out your available disk space with :
df -g
(some systems don't support -g, try -m (megs) -k (kiloB) )
If you have an undersized /tmp partition, do you have another partition with 10-20GB free? If yes, then tell your sort to use that dir with
sort -T /alt/dir
Note that for sort version
sort (GNU coreutils) 5.97
The help says
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;
multiple options specify multiple directories
I'm not sure if this means can combine a bunch of -T=/dr1/ -T=/dr2 ... to get to your 10GB*sortFactor space or not. My experience was that it only used the last dir in the list, so try to use 1 dir that is big enough.
Also, note that you can go to the whatever dir you are using for sort, and you'll see the acctivity of the temporary files used for sorting.
I hope this helps.
As you appear to be a new user here on S.O., allow me to welcome you and remind you of four things we do:
. 1) Read the FAQs
. 2) Please accept the answer that best solves your problem, if any, by pressing the checkmark sign. This gives the respondent with the best answer 15 points of reputation. It is not subtracted (as some people seem to think) from your reputation points ;-)
. 3) When you see good Q&A, vote them up by using the gray triangles, as the credibility of the system is based on the reputation that users gain by sharing their knowledge.
. 4) As you receive help, try to give it too, answering questions in your area of expertise
There are some possible solutions:
1 - use any text processing language (perl, awk) to extract each line and save the line number and a hash for that line, and then compare the hashes
2 - Can / Want to remove the duplicate lines, leaving just one occurence per file? Could use a script (command) like:
awk '!x[$0]++' oldfile > newfile
3 - Why not split the files but with some criteria? Supposing all your lines begin with letters:
- break your original_file in 20 smaller files: grep "^a*$" original_file > a_file
- sort each small file: a_file, b_file, and so on
- verify the duplicates, count them, do whatever you want.

Resources