Compress EACH LINE of a file individually and independently of one another? (or preserve newlines) - unix

I have a very large file (~10 GB) that can be compressed to < 1 GB using gzip. I'm interested in using sort FILE | uniq -c | sort to see how often a single line is repeated, however the 10 GB file is too large to sort and my computer runs out of memory.
Is there a way to compress the file while preserving newlines (or an entirely different method all together) that would reduce the file to a small enough size to sort, yet still leave the file in a condition that's sortable?
Or any other method of finding out / countin how many times each line is repetead inside a large file (a ~10 GB CSV-like file) ?
Thanks for any help!

Are you sure you're running out of the Memory (RAM?) with your sort?
My experience debugging sort problems leads me to believe that you have probably run out of diskspace for sort to create it temporary files. Also recall that diskspace used to sort is usually in /tmp or /var/tmp.
So check out your available disk space with :
df -g
(some systems don't support -g, try -m (megs) -k (kiloB) )
If you have an undersized /tmp partition, do you have another partition with 10-20GB free? If yes, then tell your sort to use that dir with
sort -T /alt/dir
Note that for sort version
sort (GNU coreutils) 5.97
The help says
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;
multiple options specify multiple directories
I'm not sure if this means can combine a bunch of -T=/dr1/ -T=/dr2 ... to get to your 10GB*sortFactor space or not. My experience was that it only used the last dir in the list, so try to use 1 dir that is big enough.
Also, note that you can go to the whatever dir you are using for sort, and you'll see the acctivity of the temporary files used for sorting.
I hope this helps.
As you appear to be a new user here on S.O., allow me to welcome you and remind you of four things we do:
. 1) Read the FAQs
. 2) Please accept the answer that best solves your problem, if any, by pressing the checkmark sign. This gives the respondent with the best answer 15 points of reputation. It is not subtracted (as some people seem to think) from your reputation points ;-)
. 3) When you see good Q&A, vote them up by using the gray triangles, as the credibility of the system is based on the reputation that users gain by sharing their knowledge.
. 4) As you receive help, try to give it too, answering questions in your area of expertise

There are some possible solutions:
1 - use any text processing language (perl, awk) to extract each line and save the line number and a hash for that line, and then compare the hashes
2 - Can / Want to remove the duplicate lines, leaving just one occurence per file? Could use a script (command) like:
awk '!x[$0]++' oldfile > newfile
3 - Why not split the files but with some criteria? Supposing all your lines begin with letters:
- break your original_file in 20 smaller files: grep "^a*$" original_file > a_file
- sort each small file: a_file, b_file, and so on
- verify the duplicates, count them, do whatever you want.

Related

View NetCDF metadata without tripping on large file size / format

Summary
I need help getting NCO tools to be helpful. I'm running into the error
"One or more variable sizes violate format constraints"
... when trying to just view the list of variables in the file with:
ncdump -h isrm_v1.2.1.ncf
It seems odd to trip on this when I'm not asking for any large variables to be read ... just metadata. Are there any flags I should or could be passing to avoid this error?
Reprex
isrm_v1.2.1.ncf (165 GB) is available on Zenodo.
Details
I've just installed the NCO suite via brew --install nco --build-from-source on a Mac (I know, I know) running OS X 11.6.5. ncks --version says 5.0.6.
Tips appreciated. I've been trawling through the ncks docs for a couple of hours without much insight. A friend was able to slice the file on a different system running actual Linux, so I'm pretty sure my NCO install is to blame.
How can I dig in deeper to find the root cause? NCO tools don't seem very verbose. I understand there are different sub-formats of NetCDF (3, 4, ...) but I'm not even sure how to verify the version/format of the .nc file that I'm trying to access.
My larger goal is to be able to slice it, like ncks -v pNH4 -d layer,0 isrm_v1.2.1.ncf pNH4L0.nc, but if I can't even view metadata, I'm thinking I need to solve that first.
The more-verbose version of the error message, for the record, is:
HINT: NC_EVARSIZE errors occur when attempting to copy or aggregate input files together into an output file that exceeds the per-file capacity of the output file format, and when trying to copy, aggregate, or define individual variables that exceed the per-variable constraints of the output file format. The per-file limit of all netCDF formats is not less than 8 EiB on modern computers, so any NC_EVARSIZE error is almost certainly due to violating a per-variable limit. Relevant limits: netCDF3 NETCDF_CLASSIC format limits fixed variables to sizes smaller than 2^31 B = 2 GiB ~ 2.1 GB, and record variables to that size per record. A single variable may exceed this limit if and only if it is the last defined variable. netCDF3 NETCDF_64BIT_OFFSET format limits fixed variables to sizes smaller than 2^32 B = 4 GiB ~ 4.2 GB, and record variables to that size per record. Any number of variables may reach, though not exceed, this size for fixed variables, or this size per record for record variables. The netCDF3 NETCDF_64BIT_DATA and netCDF4 NETCDF4 formats have no variable size limitations of real-world import. If any variable in your dataset exceeds these limits, alter the output file to a format capacious enough, either netCDF3 classic with 64-bit offsets (with -6 or --64), to PnetCDF/CDF5 with 64-bit data (with -5), or to netCDF4 (with -4 or -7). For more details, see http://nco.sf.net/nco.html#fl_fmt
Tips appreciated!
ncdump is not an NCO program, so I can't help you there, except to say that printing metadata should not cause an error in this case, so try ncks -m in.nc instead of ncdump -h in.nc.
Nevertheless, the hyperslab problem you have experienced is most likely due to trying to shove too much data into a netCDF format that can't hold it. The generic solution to that is to write the data to a more capacious netCDF format:
Try either one of these commands:
ncks -5 -v pNH4 -d layer,0 isrm_v1.2.1.ncf pNH4L0.nc
ncks -7 -v pNH4 -d layer,0 isrm_v1.2.1.ncf pNH4L0.nc
Formats are documented here

How to `diff` files to create a "common" file?

I have a slew of CSS files to go through where someone just grunted through making alterations to various core stylesheets on a number of subsites. Obviously if the original developer had had some foresight they would have just included a master stylesheet and overridden the necessary elements…
I first started off with comm thinking that it might do the trick, but quickly found that it needed to receive a sorted input file.
I then switched over to diff and have gotten down to the following through some reading and research:
diff --unchanged-group-format="## %dn,%df%c'\012'%<" --old-group-format='' --new-group-format='' --changed-group-format='' file_1.css file_2.css
The previous obviously is almost there, but:
A) I need to grep out the ## lines (which should be fine, right? At first glance this appears right, but does diff throw in any other unexpected lines that need to be yanked?) and then
B) I need to create two more files that first is the leftover unique lines from file_1.css and then the leftover unique lines of file_2.css.
Obviously the first "in common" file will go into an include folder and then be included into the two latter created files as a #import url("common.css");
I am thinking that the following simple alteration will create the latter two files to which I'm referring:
diff --unchanged-group-format='' --old-group-format="## %dn,%df%c'\012'%<" --new-group-format='' --changed-group-format='' file_1.css file_2.css
diff --unchanged-group-format='' --old-group-format='' --new-group-format="## %dn,%df%c'\012'%<" file_1.css file_2.css
Sample files:
file 1: https://gist.github.com/c13843972c47b5037704
file 2: https://gist.github.com/fff39eae386e8969dc10
So for example, upon executing a test of the following:
diff --unchanged-group-format="## %dn,%df%c'\012'%<" --old-group-format='' --new-group-format='' --changed-group-format='' file_1.css file_2.css | egrep -v "^##\d*" > common.css
diff --unchanged-group-format='' --old-group-format="## %dn,%df%c'\012'%<" --new-group-format='' --changed-group-format='' file_1.css file_2.css | egrep -v "^##\d*" > old.css
And then searching for body with egrep "^body" *css, it yielded only a body in common.css and none in old.css, whereas it showed that there were two different entries in file_1.css and file_2.css. So obviously this methodology is flawed.
How would one about creating these two files that would ultimately become the common include and the override files?
#ylluminate, you have a couple of options:
use BeyondCompare to visually verify the differences. It does a fantastic job comparing similar files. It allows saving common lines/left only lines/right only lines. Only downside is it is interactive and if you have a lot of files, will take some time. On the positive side, it looks like you want to build trust first by testing it out a few times.
Add formatting text for --changed-group-format and capture modified code (and the old code as your command does it now). You need to run one more comparison to get what is in new code but not in old code. Downside here is the validation is going to be hard.
Saving all the lines in a database table and comparing columns is another option. Take care to store old and new line numbers. Downsides are the data lines need to be unique, blank lines will be chopped off.
I would go with option 1 if i have less than 50 files.
Hope this helps.
PS: I am not associated with BeyondCompare in any way. just a happy user of the software

Using the diff command

So I am trying to compare a binary file I make when I compile with gcc to an sample executable that is provided. So I used the command diff and went like this
diff asgn2 sample-asgn2
Binary files asgn2 and sample-asgn2 differ
Is there any way to see how they differ? Instead of it just displaying that differ.
Do a hex dump of the two binaries using hexdump. Then you can compare the hex dump using your favorite diffing tool, like kdiff3, tkdiff, xxdiff, etc.
Why don't you try Vbindiff? It probably does what you want:
Visual Binary Diff (VBinDiff) displays files in hexadecimal and ASCII (or EBCDIC). It can also display two files at once, and highlight the differences between them. Unlike diff, it works well with large files (up to 4 GB).
Where to get Vbindiff depends on which operating system you are using. If Ubuntu or another Debian derivative, apt-get install vbindiff.
I'm using Linux,in my case,I need a -q option to just show what you got.
diff -q file1 file2
without -q option it will show which line is differ and display that line.
you may check with man diff to see the right option to use in your UNIX.
vbindiff only do byte-to-byte comparison. If there is just one byte addition/deletion, it will mark all subsequent bytes changed...
Another approach is to transform the binary files in text files so they can be compared with the text diff algorithm.
colorbindiff.pl is a simple and open-source perl script which uses this method and show a colored side-by-side comparison, like in a text diff. It highlights byte changes/additions/deletions. It's available on GitHub.

Potential Dangers of ALIASing a Unix Command Starting with "."?

I'd like to use alias to make some commands for myself when searching through directories for code files, but I'm a little nervous because they start with ".". Here's some examples:
$ alias .cpps="ls -a *.cpp"
$ alias .hs="ls -a *.h"
Should I be worried about encountering any difficulties? Has anyone else done this?
What is the advantage of putting the dot in the names? It seems like an unnecessary extra character. I'd just use the base names (hs and cpps) for the aliases.
I suppose that it might be argued that the dot indicates that the command is an alias - but why is that distinction beneficial? One of the great things about Unix was that it removed the distinction between hallowed commands provided by the O/S and programs written by the user. They are all equal - just located in different places.
I don't see any real dangers with using aliases that start with a dot. It would never have occurred to me to try; I'm mildly surprised that they are allowed. But given that they are allowed, there is no real risk involved that I can see.
I wouldn't use '.' to begin your aliases because it's next to '/' and you could hit the two together by mistake and accidentally run an executable in your current directory (especially if you use tab completion).
I doubt that there's any technical problem though it's likely to be confusing to anyone who has used Unix for a long time. In my world commands don't have dots in them and file names don't have spaces or upper case letters!

What is the unix command to see how much disk space there is and how much is remaining?

I'm looking for the equivalent of right clicking on the drive in windows and seeing the disk space used and remaining info.
Look for the commands du (disk usage) and df (disk free)
Use the df command:
df -h
df -g .
Option g for Size in GBs Block and . for current working directory.
I love doing du -sh * | sort -nr | less to sort by the largest files first
If you want to see how much space each folder ocuppes:
du -sh *
s – summarize
h – human readable
* – list of folders
Note: The original question was answered already, but I would just like to expand on it with some extras that are relevant to the topic.
Your AIX installation would first be put into volume groups. This is done upon installation.
It will first create rootvg (as in root volume group). This is kinda like your actual hard drive mapped.
This would be equivalent to Disc Management in Windows. AIX wont use up all of that space for its file systems like we tend to do it in consumer Windows machines. Instead there will be a good bit of unallocated space.
To check how much space your rootvg would have you use the following command.
lsvg rootvg
That would stand for list volume group rootvg. This will give you information like the size of physical partitions (PP), Total PPs assigned to the volume group, Free PPs in the volume group, etc. Regardless, the output should be fairly comprehensive.
Next thing you may be interested in, is the file systems on the volume group. Each file system would have certain amount of space given within the volume group it belongs to.
To check what file systems you got on your volume group you use the following command.
lsvgfs rootvg
As in list volume group file systems for rootvg.
You can check how much space each file system has using the following command.
df
I personally like to refine it with flags like -m and -g (in megabytes and gigabytes respectively)
If you have free space available in your volume group, you can assign it to your file systems using the following command.
chfs -a size=+1G /home
As in change file system attribute size by adding 1 G where file system is /home. use man chfs for more instructions. This is a powerful tool. This example is for adjusting size, however you can do more with this command than that.
Sources:
http://www.ibm.com/developerworks/aix/library/au-rootvg/
+ My own experience working with AIX.
All these answers are superficially correct. However, the proper answer is
apropos disk # And pray your admin maintains the whatis database
because asking questions the answers of which lay at your fingertips in the manual wastes everybody's time.
su -sm ./*
You can see every file and folder size (-sm=Mb ; -sk=Kb) in the current directory like a list. This way runs in all Unix/Linux environment.
du -sm * => RULLLLLEZ
df -tk
for Disk Free size in 1024 byte blocks

Resources