Different results from awk and nawk - unix

I just ran these two commands on a file having around 250 million records.
awk '{if(substr($0,472,1)=="9") print $0}' < file1.txt >> file2.txt
and
nawk '{if(substr($0,472,1)=="9") print $0}' < file1.txt >> file2.txt
The record length is 482. The first command gave the correct number of records in file2.txt
i.e.; 60 million but the nawk command gives only 4.2 million.
I am confused and would like to know if someone has come across issue like this. How exactly this simple command being treated in a different way internally? Is there a buffer which can hold only up to certain number of bytes while using nawk?
would appreciate if someone can throw some light on this.
My OS details are
SunOS <hostname> 5.10 Generic_147148-26 i86pc i386 i86pc

The difference probably lies on the buffer limit of Nawk. One of the records (lines) found in your input file has probably exceeded it.
This crucial line can be found in awk.h:
#define RECSIZE (8 * 1024) /* sets limit on records, fields, etc., etc. */

Your command can be reduced to just this:
awk 'substr($0,472,1)==9'
On Solaris (which you are on) when you run awk by default you are running old, broken awk (/usr/bin/awk) so I suspect that nawk is the one producing the correct result.
Run /usr/xpg4/bin/awk with the same script/arguments and see which of your other results it's output agrees with.
Also, check if your input file was created on Windows by running dos2unix on it and see if it's size changes and, if so, re-run your awk commands on the modified files. If it was created on Windows then it will have some control-Ms in there that could be causing chaos.

Related

nested for loop too slow: 1MN record traversal

I've huge file count, around 200,000 records in a file. I have been testing some cases where in I have to figure out the naming pattern of the files match to some specific strings. Here's how I preceded-
Test Strings, I stored in a file (let's say for one case, they are 10). The actual file that contains string records, separated by newline; totaling upto 200,000 records. To check if the test string patterns are present in the large file, I wrote a small nested for loop.
for i in `cat TestString.txt`
do
for j in `cat LargeFile.txt`
do
if [[ $i == $j ]]
then
echo "Match" >> result.txt
fi
done
done
This nested loop actual has to do the traversal (if I'm not wrong in the concepts), 10x200000 times. Normally I don't see that's too much of a load on the server, but the time taken is like all along. The excerpt is running for the past 4 hours, with ofcourse some "matched" results.
Does anyone has any idea on speeding this up? I've found so many answers with python or perl touch, but I'm honestly searching for something in Unix.
Thanks
Try the following:
grep -f TestString.txt LargeFile.txt >> result.txt
Check out grep
while read line
do
cat LargeFile.txt | grep "$line" >> result.txt
done < TestString.txt
grep will output any matching strings. This may be faster. Note that your TestString.txt file should not have any blank lines or grep will return everything from LargeFile.txt.

How to cat using part of a filename in terminal?

I'm using terminal on OS 10.X. I have some data files of the format:
mbh5.0_mrg4.54545454545_period0.000722172513951.params.dat
mbh5.0_mrg4.54545454545_period0.00077271543854.params.dat
mbh5.0_mrg4.59090909091_period-0.000355232058085.params.dat
mbh5.0_mrg4.59090909091_period-0.000402015664015.params.dat
I know that there will be some files with similar numbers after mbh and mrg, but I won't know ahead of time what the numbers will be or how many similarly numbered ones there will be. My goal is to cat all the data from all the files with similar numbers after mbh and mrg into one data file. So from the above I would want to do something like...
cat mbh5.0_mrg4.54545454545*dat > mbh5.0_mrg4.54545454545.dat
cat mbh5.0_mrg4.5909090909*dat > mbh5.0_mrg4.5909090909.dat
I want to automate this process because there will be many such files.
What would be the best way to do this? I've been looking into sed, but I don't have a solution yet.
for file in *.params.dat; do
prefix=${file%_*}
cat "$file" >> "$prefix.dat"
done
This part ${file%_*} remove the last underscore and following text from the end of $file and saves the result in the prefix variable. (Ref: http://www.gnu.org/software/bash/manual/bashref.html#Shell-Parameter-Expansion)
It's not 100% clear to me what you're trying to achieve here but if you want to aggregate files into a file with the same number after "mbh5.0_mrg4." then you can do the following.
ls -l mbh5.0_mrg4* | awk '{print "cat " $9 " > mbh5.0_mrg4." substr($9,12,11) ".dat" }' | /bin/bash
The "ls -s" lists the file and the "awk" takes the 9th column from the result of the ls. With some string concatenation the result is passed to /bin/bash to be executed.
This is a linux bash script, so assuming you have /bind/bash, I'm not 100% famililar with OS X. This script also assumes that the number youre grouping on is always in the same place in the filename. I think you can change /bin/bash to almost any shell you have installed.

How can I remove common occurrences between 2 text files using the unix environment?

Ok so I'm still learning the command line stuff like grep and diff and their uses within the scope of my project, but I can't seem to wrap my head around how to approach this problem.
So I have 2 files, each containing hundreds of 20 character long strings. lets call the files A and B. I want to search through A and, using the values in B as keys, locate UNIQUE String entries that occur in A but not in B(there are duplicates so unique is the key here)
Any Ideas?
Also I'm not opposed to finding the answer myself, but I don't have a good enough understanding of the different command line scripts and their functions to really start thinking of how to use them together.
There are two ways to do this. With comm or with grep, sort, and uniq.
comm
comm afile bfile
comm compares the files and outputs 3 columns, lines only in afile, lines only in bfile, and lines in common. The -1, -3 switches tell comm to not print out those columns.
grep sort uniq
grep -F -v -file bfile afile | sort | uniq
or just
grep -F -v -file bfile afile | sort -u
if your sort handles the -u option.
(note: the command fgrep if your system has it, is equivalent to grep -F.)
Look up the comm command (POSIX comm
) to do this. See also Unix command to find lines common in two files.

grep -f alternative for huge files

grep -F -f file1 file2
file1 is 90 Mb (2.5 million lines, one word per line)
file2 is 45 Gb
That command doesn't actually produce anything whatsoever, no matter how long I leave it running. Clearly, this is beyond grep's scope.
It seems grep can't handle that many queries from the -f option. However, the following command does produce the desired result:
head file1 > file3
grep -F -f file3 file2
I have doubts about whether sed or awk would be appropriate alternatives either, given the file sizes.
I am at a loss for alternatives... please help. Is it worth it to learn some sql commands? Is it easy? Can anyone point me in the right direction?
Try using LC_ALL=C . It turns the searching pattern from UTF-8 to ASCII which speeds up by 140 time the original speed. I have a 26G file which would take me around 12 hours to do down to a couple of minutes.
Source: Grepping a huge file (80GB) any way to speed it up?
So what I do is:
LC_ALL=C fgrep "pattern" <input >output
I don't think there is an easy solution.
Imagine you write your own program which does what you want and you will end up with a nested loop, where the outer loop iterates over the lines in file2 and the inner loop iterates over file1 (or vice versa). The number of iterations grows with size(file1) * size(file2). This will be a very large number when both files are large. Making one file smaller using head apparently resolves this issue, at the cost of not giving the correct result anymore.
A possible way out is indexing (or sorting) one of the files. If you iterate over file2 and for each word you can determine whether or not it is in the pattern file without having to fully traverse the pattern file, then you are much better off. This assumes that you do a word-by-word comparison. If the pattern file contains not only full words, but also substrings, then this will not work, because for a given word in file2 you wouldn't know what to look for in file1.
Learning SQL is certainly a good idea, because learning something is always good. It will hovever, not solve your problem, because SQL will suffer from the same quadratic effect described above. It may simplify indexing, should indexing be applicable to your problem.
Your best bet is probably taking a step back and rethinking your problem.
You can try ack. They are saying that it is faster than grep.
You can try parallel :
parallel --progress -a file1 'grep -F {} file2'
Parallel has got many other useful switches to make computations faster.
Grep can't handle that many queries, and at that volume, it won't be helped by fixing the grep -f bug that makes it so unbearably slow.
Are both file1 and file2 composed of one word per line? That means you're looking for exact matches, which we can do really quickly with awk:
awk 'NR == FNR { query[$0] = 1; next } query[$0]' file1 file2
NR (number of records, the line number) is only equal to the FNR (file-specific number of records) for the first file, where we populate the hash and then move onto the next line. The second clause checks the other file(s) for whether the line matches one saved in our hash and then prints the matching lines.
Otherwise, you'll need to iterate:
awk 'NR == FNR { query[$0]=1; next }
{ for (q in query) if (index($0, q)) { print; next } }' file1 file2
Instead of merely checking the hash, we have to loop through each query and see if it matches the current line ($0). This is much slower, but unfortunately necessary (though we're at least matching plain strings without using regexes, so it could be slower). The loop stops when we have a match.
If you actually wanted to evaluate the lines of the query file as regular expressions, you could use $0 ~ q instead of the faster index($0, q). Note that this uses POSIX extended regular expressions, roughly the same as grep -E or egrep but without bounded quantifiers ({1,7}) or the GNU extensions for word boundaries (\b) and shorthand character classes (\s,\w, etc).
These should work as long as the hash doesn't exceed what awk can store. This might be as low as 2.1B entries (a guess based on the highest 32-bit signed int) or as high as your free memory.

unix command to read line from a file by passing line number

I am looking for a unix command to get a single line by passing line number to a big file (with around 5 million records). For example to get 10th line, I want to do something like
command file-name 10
Is there any such command available? We can do this by looping through each record but that will be time consuming process.
This forum entry suggests:
sed -n '52p' (file)
for printing the 52th line of a file.
Going forward, There are a lot of ways to do it, and other related stuffs.
If you want multiple lines to be printed,
sed -n -e 'Np' -e 'Mp'
Where N and M are lines which will only be printed. Refer this 10 Awesome Examples for Viewing Huge Log Files in Unix
command | sed -n '10p'
or
sed -n '10p' file
You could do something like:
head -n<lineno> <file> | tail -n1
That would give you the <lineno> lines, then only give the last line of output (your line).
Edit: It seems all the solutions here are pretty slow. However, by definition you'll have to iterate through all the records since the operating system has no way to parse line-oriented files since files are byte-oriented. (In some sense, all these programs are going to do is count the number of \n or \r characters.) In lieu of a great answer, I'll also present the timings on my system of several of these commands!
[mjschultz#mawdryn ~]$ time sed -n '145430980p' br.txt
0b10010011111111010001101111010111
real 0m25.871s
user 0m17.315s
sys 0m2.360s
[mjschultz#mawdryn ~]$ time head -n 145430980 br.txt | tail -n1
0b10010011111111010001101111010111
real 0m41.112s
user 0m39.385s
sys 0m4.291s
[mjschultz#mawdryn ~]$ time awk 'NR==145430980{print;exit}' br.txt
0b10010011111111010001101111010111
real 2m8.835s
user 1m38.076s
sys 0m3.337s
So, on my system, it looks like the sed -n '<lineno>p' <file> solution is fastest!
you can use awk
awk 'NR==10{print;exit}' file
Put an exit after printing the 10th line so that awk won't process the 5 million records file further.

Resources