View NetCDF metadata without tripping on large file size / format - netcdf

Summary
I need help getting NCO tools to be helpful. I'm running into the error
"One or more variable sizes violate format constraints"
... when trying to just view the list of variables in the file with:
ncdump -h isrm_v1.2.1.ncf
It seems odd to trip on this when I'm not asking for any large variables to be read ... just metadata. Are there any flags I should or could be passing to avoid this error?
Reprex
isrm_v1.2.1.ncf (165 GB) is available on Zenodo.
Details
I've just installed the NCO suite via brew --install nco --build-from-source on a Mac (I know, I know) running OS X 11.6.5. ncks --version says 5.0.6.
Tips appreciated. I've been trawling through the ncks docs for a couple of hours without much insight. A friend was able to slice the file on a different system running actual Linux, so I'm pretty sure my NCO install is to blame.
How can I dig in deeper to find the root cause? NCO tools don't seem very verbose. I understand there are different sub-formats of NetCDF (3, 4, ...) but I'm not even sure how to verify the version/format of the .nc file that I'm trying to access.
My larger goal is to be able to slice it, like ncks -v pNH4 -d layer,0 isrm_v1.2.1.ncf pNH4L0.nc, but if I can't even view metadata, I'm thinking I need to solve that first.
The more-verbose version of the error message, for the record, is:
HINT: NC_EVARSIZE errors occur when attempting to copy or aggregate input files together into an output file that exceeds the per-file capacity of the output file format, and when trying to copy, aggregate, or define individual variables that exceed the per-variable constraints of the output file format. The per-file limit of all netCDF formats is not less than 8 EiB on modern computers, so any NC_EVARSIZE error is almost certainly due to violating a per-variable limit. Relevant limits: netCDF3 NETCDF_CLASSIC format limits fixed variables to sizes smaller than 2^31 B = 2 GiB ~ 2.1 GB, and record variables to that size per record. A single variable may exceed this limit if and only if it is the last defined variable. netCDF3 NETCDF_64BIT_OFFSET format limits fixed variables to sizes smaller than 2^32 B = 4 GiB ~ 4.2 GB, and record variables to that size per record. Any number of variables may reach, though not exceed, this size for fixed variables, or this size per record for record variables. The netCDF3 NETCDF_64BIT_DATA and netCDF4 NETCDF4 formats have no variable size limitations of real-world import. If any variable in your dataset exceeds these limits, alter the output file to a format capacious enough, either netCDF3 classic with 64-bit offsets (with -6 or --64), to PnetCDF/CDF5 with 64-bit data (with -5), or to netCDF4 (with -4 or -7). For more details, see http://nco.sf.net/nco.html#fl_fmt
Tips appreciated!

ncdump is not an NCO program, so I can't help you there, except to say that printing metadata should not cause an error in this case, so try ncks -m in.nc instead of ncdump -h in.nc.
Nevertheless, the hyperslab problem you have experienced is most likely due to trying to shove too much data into a netCDF format that can't hold it. The generic solution to that is to write the data to a more capacious netCDF format:
Try either one of these commands:
ncks -5 -v pNH4 -d layer,0 isrm_v1.2.1.ncf pNH4L0.nc
ncks -7 -v pNH4 -d layer,0 isrm_v1.2.1.ncf pNH4L0.nc
Formats are documented here

Related

Trouble concatenating netcdf files with ncrcat

I have a list of netcdf files that I am trying to concatenate along the time dimension.
I am attempting to use the steps outlined here, which seem simple enough. However, I am running into some errors (likely some small/stupid oversight on my part...)
When I try to first make time a record dimension, I am using the following command:
ncks -O --mk_rec_dmn time TiMREX_20080526_000001.nc test_out.nc
This, however, give me the following error:
ncks: invalid option -- '-'
It seems like this is just some simple syntax/typo error on my part, but try as I might I can' find anything wrong.
Just to be sure, when I run a ncdump -h on the file, it confirms that there is indeed a time dimension
ncdump -h TiMREX_20080526_000001.nc
netcdf TiMREX_20080526_000001 {
dimensions:
time = 1 ;
bounds = 2 ;
x0 = 300 ;
y0 = 300 ;
z0 = 40 ;
Additionally, if I try to skip this step and just go right to the ncrcat part...
ncrcat -O TiMREX_20080526_000001.nc TiMREX_20080526_000733.nc test_out.nc
I get the following error:
ncopen: filename "TiMREX_20080526_000001.nc": Not a netCDF file
Which is especially odd...I'm pretty confident it is indeed at netCDF file (I just ran ncdump on it after all, and have no problem viewing it with ncview...)
Any thoughts? What simple step am I embarrassingly missing?
This is a weird error as your command looks syntactically correct. To be sure, I copied it to my machine where it ran as expected, with no 'invalid option' error. Thus I am unable to reproduce the problem. Based on the error message you report, it seems as though you might (somehow) be using a character that the system does not understand as a dash. In other words, the error you report is what I would expect if ncks received a funky character that looks like a dash but is not really a dash. Maybe when you copy it to stackoverflow it gets converted to a dash, so it works for me (try copying your own command above back into your console). Make sure the dash character you type is the same as the minus sign on a normal keyboard, and something else. Some keyboard/character sets make characters that look similar to dashes but are not ASCII dashes. Good luck.

How can I download multiple objects from S3 simultaneously?

I have lots (millions) of small log files in s3 in with its name (date/time) helping to define it i.e. servername-yyyy-mm-dd-HH-MM. e.g.
s3://my_bucket/uk4039-2015-05-07-18-15.csv
s3://my_bucket/uk4039-2015-05-07-18-16.csv
s3://my_bucket/uk4039-2015-05-07-18-17.csv
s3://my_bucket/uk4039-2015-05-07-18-18.csv
...
s3://my_bucket/uk4339-2015-05-07-19-23.csv
s3://my_bucket/uk4339-2015-05-07-19-24.csv
...
etc
From EC2, using the AWS CLI, I would like to simultaneously download all files that are have the minute equal 16 for 2015, for all only server uk4339 and uk4338
Is there a clever way to do this?
Also if this is a terrible file structure in s3 to query data, I would be extremely grateful for any advice on how to set this up better.
I can put a relevant aws s3 cp ... command into a loop in a shell/bash script to sequentially download the relevant files but, was wondering if there was something more efficient.
As an added bonus I would like to row bind the results together too as one csv.
A quick example of a mock csv file can be generated in R using this line of R code
R> write.csv(data.frame(cbind(a1=rnorm(100),b1=rnorm(100),c1=rnorm(100))),file='uk4339-2015-05-07-19-24.csv',row.names=FALSE)
The csv that is created is uk4339-2015-05-07-19-24.csv. FYI, I will be importing the combined data into R at the end.
As you didn't answer my questions, nor indicate what OS you use, it is somewhat hard to make any concrete suggestions, so I will briefly suggest you use GNU Parallel to parallelise your S3 fetch requests to get around the latency.
Suppose you somehow generate a list of all the S3 files you want and put the resulting list in a file called GrabMe.txt like this
s3://my_bucket/uk4039-2015-05-07-18-15.csv
s3://my_bucket/uk4039-2015-05-07-18-16.csv
s3://my_bucket/uk4039-2015-05-07-18-17.csv
s3://my_bucket/uk4039-2015-05-07-18-18.csv
Then you can get them in parallel, say 32 at a time, like this:
parallel -j 32 echo aws s3 cp {} . < GrabMe.txt
or if you prefer reading left-to-right
cat GrabMe.txt | parallel -j 32 echo aws s3 cp {} .
You can obviously alter the number of parallel requests from 32 to any other number. At the moment, it just echoes the command it would run, but you can remove the word echo when you see how it works.
There is a good tutorial here, and Ole Tange (the author of GNU Parallel) is on SO, so we are in good company.

Using the diff command

So I am trying to compare a binary file I make when I compile with gcc to an sample executable that is provided. So I used the command diff and went like this
diff asgn2 sample-asgn2
Binary files asgn2 and sample-asgn2 differ
Is there any way to see how they differ? Instead of it just displaying that differ.
Do a hex dump of the two binaries using hexdump. Then you can compare the hex dump using your favorite diffing tool, like kdiff3, tkdiff, xxdiff, etc.
Why don't you try Vbindiff? It probably does what you want:
Visual Binary Diff (VBinDiff) displays files in hexadecimal and ASCII (or EBCDIC). It can also display two files at once, and highlight the differences between them. Unlike diff, it works well with large files (up to 4 GB).
Where to get Vbindiff depends on which operating system you are using. If Ubuntu or another Debian derivative, apt-get install vbindiff.
I'm using Linux,in my case,I need a -q option to just show what you got.
diff -q file1 file2
without -q option it will show which line is differ and display that line.
you may check with man diff to see the right option to use in your UNIX.
vbindiff only do byte-to-byte comparison. If there is just one byte addition/deletion, it will mark all subsequent bytes changed...
Another approach is to transform the binary files in text files so they can be compared with the text diff algorithm.
colorbindiff.pl is a simple and open-source perl script which uses this method and show a colored side-by-side comparison, like in a text diff. It highlights byte changes/additions/deletions. It's available on GitHub.

Compress EACH LINE of a file individually and independently of one another? (or preserve newlines)

I have a very large file (~10 GB) that can be compressed to < 1 GB using gzip. I'm interested in using sort FILE | uniq -c | sort to see how often a single line is repeated, however the 10 GB file is too large to sort and my computer runs out of memory.
Is there a way to compress the file while preserving newlines (or an entirely different method all together) that would reduce the file to a small enough size to sort, yet still leave the file in a condition that's sortable?
Or any other method of finding out / countin how many times each line is repetead inside a large file (a ~10 GB CSV-like file) ?
Thanks for any help!
Are you sure you're running out of the Memory (RAM?) with your sort?
My experience debugging sort problems leads me to believe that you have probably run out of diskspace for sort to create it temporary files. Also recall that diskspace used to sort is usually in /tmp or /var/tmp.
So check out your available disk space with :
df -g
(some systems don't support -g, try -m (megs) -k (kiloB) )
If you have an undersized /tmp partition, do you have another partition with 10-20GB free? If yes, then tell your sort to use that dir with
sort -T /alt/dir
Note that for sort version
sort (GNU coreutils) 5.97
The help says
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;
multiple options specify multiple directories
I'm not sure if this means can combine a bunch of -T=/dr1/ -T=/dr2 ... to get to your 10GB*sortFactor space or not. My experience was that it only used the last dir in the list, so try to use 1 dir that is big enough.
Also, note that you can go to the whatever dir you are using for sort, and you'll see the acctivity of the temporary files used for sorting.
I hope this helps.
As you appear to be a new user here on S.O., allow me to welcome you and remind you of four things we do:
. 1) Read the FAQs
. 2) Please accept the answer that best solves your problem, if any, by pressing the checkmark sign. This gives the respondent with the best answer 15 points of reputation. It is not subtracted (as some people seem to think) from your reputation points ;-)
. 3) When you see good Q&A, vote them up by using the gray triangles, as the credibility of the system is based on the reputation that users gain by sharing their knowledge.
. 4) As you receive help, try to give it too, answering questions in your area of expertise
There are some possible solutions:
1 - use any text processing language (perl, awk) to extract each line and save the line number and a hash for that line, and then compare the hashes
2 - Can / Want to remove the duplicate lines, leaving just one occurence per file? Could use a script (command) like:
awk '!x[$0]++' oldfile > newfile
3 - Why not split the files but with some criteria? Supposing all your lines begin with letters:
- break your original_file in 20 smaller files: grep "^a*$" original_file > a_file
- sort each small file: a_file, b_file, and so on
- verify the duplicates, count them, do whatever you want.

What is the unix command to see how much disk space there is and how much is remaining?

I'm looking for the equivalent of right clicking on the drive in windows and seeing the disk space used and remaining info.
Look for the commands du (disk usage) and df (disk free)
Use the df command:
df -h
df -g .
Option g for Size in GBs Block and . for current working directory.
I love doing du -sh * | sort -nr | less to sort by the largest files first
If you want to see how much space each folder ocuppes:
du -sh *
s – summarize
h – human readable
* – list of folders
Note: The original question was answered already, but I would just like to expand on it with some extras that are relevant to the topic.
Your AIX installation would first be put into volume groups. This is done upon installation.
It will first create rootvg (as in root volume group). This is kinda like your actual hard drive mapped.
This would be equivalent to Disc Management in Windows. AIX wont use up all of that space for its file systems like we tend to do it in consumer Windows machines. Instead there will be a good bit of unallocated space.
To check how much space your rootvg would have you use the following command.
lsvg rootvg
That would stand for list volume group rootvg. This will give you information like the size of physical partitions (PP), Total PPs assigned to the volume group, Free PPs in the volume group, etc. Regardless, the output should be fairly comprehensive.
Next thing you may be interested in, is the file systems on the volume group. Each file system would have certain amount of space given within the volume group it belongs to.
To check what file systems you got on your volume group you use the following command.
lsvgfs rootvg
As in list volume group file systems for rootvg.
You can check how much space each file system has using the following command.
df
I personally like to refine it with flags like -m and -g (in megabytes and gigabytes respectively)
If you have free space available in your volume group, you can assign it to your file systems using the following command.
chfs -a size=+1G /home
As in change file system attribute size by adding 1 G where file system is /home. use man chfs for more instructions. This is a powerful tool. This example is for adjusting size, however you can do more with this command than that.
Sources:
http://www.ibm.com/developerworks/aix/library/au-rootvg/
+ My own experience working with AIX.
All these answers are superficially correct. However, the proper answer is
apropos disk # And pray your admin maintains the whatis database
because asking questions the answers of which lay at your fingertips in the manual wastes everybody's time.
su -sm ./*
You can see every file and folder size (-sm=Mb ; -sk=Kb) in the current directory like a list. This way runs in all Unix/Linux environment.
du -sm * => RULLLLLEZ
df -tk
for Disk Free size in 1024 byte blocks

Resources