Exact limit to avoid "Argument list too long " - unix

I know that there are some tricks to avoid the shell's limit, which leads to "argument list too long", but I want to understand why the limit hits in my case (even though it should not). As far as I know the limit of chars in an argument for a command should be able to be determined by the following steps:
Get the maximum argument size by getconf ARG_MAX
Subtract the content of you environment retrieved by env|wc -c
On my machine with Fedora 30 and zsh 5.7.1 this should allow me argument lists with a length of up to 2085763 chars. But I already hit the limit with only 1501000 chars. What did I miss in my calculation?
Minimal working example for reproduction:
Setting up files:
$ for i in {10000..100000}; do touch Testfile_${i}".txt"; done
$ ls Testfile*
zsh: argument list too long: ls
No I deleted stepwise (1000 files per step) files to check when the argument line was short enough to be handled again
for i in {10000..100000..1000}; do echo $(ls|wc -l); rm Testfile_{$i..$((i + 1000))}.txt; ls Testfile_*|wc -l; done
The message zsh: argument list too long: ls stops between 79000 and 78000 remaining files. Each filename has a length of 18 chars (19, including the separating whitespace), so in total at this moment the argument line should have a total length of 79000*19=1501000 respectively 78000*19=1482000 chars.
This result is the same magnitude in comparison to the expected value of 2085763 chars but still it's slighty off. What could explain the difference of 500000 chars?
ADDENDUM1:
Like suggested in the comments I ran xargs --show-limits and the output fits round about my expectation.
$ xargs --show-limits
Your environment variables take up 4783 bytes
POSIX upper limit on argument length (this system): 2090321
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2085538
Size of command buffer we are actually using: 131072
Maximum parallelism (--max-procs must be no greater): 2147483647
ADDENDUM2:
Following the comment of #Jens I now added 9 Bytes additional overhead to the words (8 Bytes for the pointer, 1 for the terminating NUL-Byte). Now I get the following results (I do not know, how the whitespace is handled, for the moment I leave it out):
79000*(18+9)= 2133000
78000*(18+9)= 2106000
Both values are much closer to the theoretical limit than before...indeed, they are even a bit above it. So together with some safety margin I'm more confident to preestimate the maximal argument length.
Further reading:
There are more posts about this topic, of which none answers the question in a satisfying way, but still they provide good material:
https://unix.stackexchange.com/a/120842/211477
https://www.in-ulm.de/~mascheck/various/argmax/

If you where looking to count files, this worked for me on Mac OSX Ventura (13.1)
find . -maxdepth 2 -name "*.zip" | wc -l
I had 1039999 zip files and the standard "ls /.zip | wc -l" just died ("zsh: argument list too long: ls")

Related

/usr/xpg4/bin/grep -q [^0-9] does not always work as expected

I have a Unix ksh script that has been in daily use for years (kicked off at night by the crontab). Recently one function in the script is behaving erratically as never happened before. I tried various ways to find out why, but have no success.
The function validates an input string, which is supposed to be a string of 10 numeric characters. The function checks if the string length is 10, and whether it contains any non-numeric characters:
#! /bin/ksh
# The function:
is_valid_id () {
# Takes one argument, which is the ID being tested.
if [[ $(print ${#1}) -ne 10 ]] || print "$1" | /usr/xpg4/bin/grep -q [^0-9] ; then
return 1
else
return 0
fi
}
cat $input_file | while read line ; do
id=$(print $line | awk -F: '{print $5}')
# Calling the function:
is_valid_id $id
stat=$?
if [[ $stat -eq 1 ]] ; then
print "The ID $id is invalid. Request rejected.\n" >> $ERRLOG
continue
else
...
fi
done
The problem with the function is that, every night, out of scores or hundreds of requests, it finds the IDs in several requests as invalid. I visually inspected the input data and found that all the "invalid" IDs are actually strings of 10 numeric characters as should be. This error seems to be random, because it happens with only some of the requests. However, while the rejected requests persistently come back, it is consistently the same IDs that are picked out as invalid day after day.
I did the following:
The Unix machine has been running for almost a year, therefore might need to be refreshed. The system admin to reboot the machine at my request. But the problem persists after the reboot.
I manually ran exactly the same two tests in the function, at command prompt, and the IDs that have been found invalid at night are all valid.
I know the same commands may behave differently invoked manually or in a script. To see how the function behaves in script, the above code excerpt is the small script I ran to reproduce the problem. And indeed, some (though not all) of the IDs found to be invalid at night are also found invalid by the small trouble-shooting script.
I then modified that troubleshooting script to run the two tests one at a time, and found it is the /usr/xpg4/bin/grep -q [^0-9] test that erroneously finds some of the ID as containing non-numeric character(s). Well, the IDs are all numeric characters, at least visually.
I checked if there is any problem with the xpg4 grep command file (ls -l /usr/xpg4/bin/grep), to see if it is put there recently. But its timestamp is year 2005 (this machine runs Solaris 10).
Knowing that the data comes from a central ERP system, to which data entry is performed from different locations using all kinds of various terminal machines running all kinds of possible operating systems that support various character sets and encodings. The ERP system simply allows them. But can characters from other encodings visually appear as numeric characters but the encoded values are not as the /usr/xpg4/bin/grep command expects to be on our Unix machine? I tried the od (octal dump) command but it does not help me much as I am not familiar with it. Maybe I need to know more about od for solving this problem.
My temporary work-around is omitting the /usr/xpg4/bin/grep -q [^0-9] test. But the problem has not been solved. What can I try next?
Your validity test function happens to be more complicated than it should be. E.g. why do you use a command substitution with print for ${#1}? Why don't you use ${#1} directly? Next, forking grep to test for a non-number is a slow and expensive operation. What about this equivalent function, 100% POSIX and blazingly fast:
is_valid_id () {
# Takes one argument, which is the ID being tested.
if test ${#1} -ne 10; then
return 1 # ID length not exactly 10.
fi
case $1 in
(*[!0-9]*) return 1;; # ID contains a non-digit.
(*) return 0;; # ID is exactly 10 digits.
esac
}
Or even more simple, if you don't mind repeating yourself:
is_valid_id () {
# Takes one argument, which is the ID being tested.
case $1 in
([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]) # 10 digits.
return 0;;
(*)
return 1;;
esac
}
This also avoids your unquoted use of a grep pattern, which is error-prone in the presence of one-character file names. Does this work better?

How old is file?

I have a shell script that will check a file is how many days old. I did stat -f "%m%t%Sm %N" "$file" . But I want to store this into a variable and then compare current time and file created time ?
Assuming you're using bash, you can capture the output of commands with something like:
fdate=$(stat -f "%m%t%Sm %N" "$file")
and then do whatever you will with the results:
echo ${fdate}
That's assuming the command itself works in the first place. If you are, you can ignore the text below.
The GNU stat program uses -f to specify you want to query the filesystem rather than a file and the other options you have don't seem to make sense in the context of your question.
Using Gnu stat, you can get the time since the last file update(1) as:
ageInSeconds=$(($(date -u +%s) - $(stat --printf "%Y" "file")))
The subtracts the last modification time of the file from the current time (both expressed as seconds since the epoch) to give you the age in seconds.
To turn that into days, assuming you're not overly concerned about the possible error from leap seconds (an error of, at most, one part in about 15.7 million, or 0.000006%), you can just divide it by 86,400:
ageInDays=$((($(date -u +%s) - $(stat --printf "%Y" "file")) / 86400))
(1) Note that, although stat purports to have a %W format specifier that gives the birth of the file, this doesn't always work (it returns zero). You could check that first if you're really interested in when the file was created rather than last updated but you may have to be prepared to accept the possibility the information is not available. I've used last modification time above since, frequently, it's used for things like detecting changes.

grep -f alternative for huge files

grep -F -f file1 file2
file1 is 90 Mb (2.5 million lines, one word per line)
file2 is 45 Gb
That command doesn't actually produce anything whatsoever, no matter how long I leave it running. Clearly, this is beyond grep's scope.
It seems grep can't handle that many queries from the -f option. However, the following command does produce the desired result:
head file1 > file3
grep -F -f file3 file2
I have doubts about whether sed or awk would be appropriate alternatives either, given the file sizes.
I am at a loss for alternatives... please help. Is it worth it to learn some sql commands? Is it easy? Can anyone point me in the right direction?
Try using LC_ALL=C . It turns the searching pattern from UTF-8 to ASCII which speeds up by 140 time the original speed. I have a 26G file which would take me around 12 hours to do down to a couple of minutes.
Source: Grepping a huge file (80GB) any way to speed it up?
So what I do is:
LC_ALL=C fgrep "pattern" <input >output
I don't think there is an easy solution.
Imagine you write your own program which does what you want and you will end up with a nested loop, where the outer loop iterates over the lines in file2 and the inner loop iterates over file1 (or vice versa). The number of iterations grows with size(file1) * size(file2). This will be a very large number when both files are large. Making one file smaller using head apparently resolves this issue, at the cost of not giving the correct result anymore.
A possible way out is indexing (or sorting) one of the files. If you iterate over file2 and for each word you can determine whether or not it is in the pattern file without having to fully traverse the pattern file, then you are much better off. This assumes that you do a word-by-word comparison. If the pattern file contains not only full words, but also substrings, then this will not work, because for a given word in file2 you wouldn't know what to look for in file1.
Learning SQL is certainly a good idea, because learning something is always good. It will hovever, not solve your problem, because SQL will suffer from the same quadratic effect described above. It may simplify indexing, should indexing be applicable to your problem.
Your best bet is probably taking a step back and rethinking your problem.
You can try ack. They are saying that it is faster than grep.
You can try parallel :
parallel --progress -a file1 'grep -F {} file2'
Parallel has got many other useful switches to make computations faster.
Grep can't handle that many queries, and at that volume, it won't be helped by fixing the grep -f bug that makes it so unbearably slow.
Are both file1 and file2 composed of one word per line? That means you're looking for exact matches, which we can do really quickly with awk:
awk 'NR == FNR { query[$0] = 1; next } query[$0]' file1 file2
NR (number of records, the line number) is only equal to the FNR (file-specific number of records) for the first file, where we populate the hash and then move onto the next line. The second clause checks the other file(s) for whether the line matches one saved in our hash and then prints the matching lines.
Otherwise, you'll need to iterate:
awk 'NR == FNR { query[$0]=1; next }
{ for (q in query) if (index($0, q)) { print; next } }' file1 file2
Instead of merely checking the hash, we have to loop through each query and see if it matches the current line ($0). This is much slower, but unfortunately necessary (though we're at least matching plain strings without using regexes, so it could be slower). The loop stops when we have a match.
If you actually wanted to evaluate the lines of the query file as regular expressions, you could use $0 ~ q instead of the faster index($0, q). Note that this uses POSIX extended regular expressions, roughly the same as grep -E or egrep but without bounded quantifiers ({1,7}) or the GNU extensions for word boundaries (\b) and shorthand character classes (\s,\w, etc).
These should work as long as the hash doesn't exceed what awk can store. This might be as low as 2.1B entries (a guess based on the highest 32-bit signed int) or as high as your free memory.

grep -f maximum number of patterns?

I'd like to use grep on a text file with -f to match a long list (10,000) of patterns. Turns out that grep doesn't like this (who, knew?). After a day, it didn't produce anything. Smaller lists work almost instantaneously.
I was thinking I might split my long list up and do it a few times. Any idea what a good maximum length for the pattern list might be?
Also, I'm rather new with unix. Alternative approaches are welcome. The list of patterns, or search terms, are in a plaintext file, one per line.
Thank you everyone for your guidance.
From comments, it appears that the patterns you are matching are fixed strings. If that is the case, you should definitely use -F. That will increase the speed of the matching considerably. (Using 479,000 strings to match on an input file with 3 lines using -F takes under 1.5 seconds on a moderately powered machine. Not using -F, that same machine is not yet finished after several minutes.)
i got the same problem with approx. 4 million patterns to search for in a file with 9 million lines. Seems like it is a problem of RAM. so i got this neat little work around which might be slower than splitting and joining but it just need this one line.
while read line; do grep $line fileToSearchIn;done < patternFile
I needed to use the work around since the -F flag is no solution for that large files...
EDIT: This seems to be really slow for large files. After some more research i found 'faSomeRecords' and really other awesome tools from Kent NGS-editing-Tools
I tried it on my own by extracting 2 million fasta-rec from 5.5million records file. Took approx. 30 sec..
cheers
EDIT: direct download link
Here is a bash script you can run on your files (or if you would like, a subset of your files). It will split the key file into increasingly large blocks, and for each block attempt the grep operation. The operations are timed - right now I'm timing each grep operation, as well as the total time to process all the sub-expressions.
Output is in seconds - with some effort you can get ms, but with the problem you are having it's unlikely you need that granularity.
Run the script in a terminal window with a command of the form
./timeScript keyFile textFile 100 > outputFile
This will run the script, using keyFile as the file where the search keys are stored, and textFile as the file where you are looking for keys, and 100 as the initial block size. On each loop the block size will be doubled.
In a second terminal, run the command
tail -f outputFile
which will keep track of the output of your other process into the file outputFile
I recommend that you open a third terminal window, and that you run top in that window. You will be able to see how much memory and CPU your process is taking - again, if you see vast amounts of memory consumed it will give you a hint that things are not going well.
This should allow you to find out when things start to slow down - which is the answer to your question. I don't think there's a "magic number" - it probably depends on your machine, and in particular on the file size and the amount of memory you have.
You could take the output of the script and put it through a grep:
grep entire outputFile
You will end up with just the summaries - block size, and time taken, e.g.
Time for processing entire file with blocksize 800: 4 seconds
If you plot these numbers against each other (or simply inspect the numbers), you will see when the algorithm is optimal, and when it slows down.
Here is the code: I did not do extensive error checking but it seemed to work for me. Obviously in your ultimate solution you need to do something with the outputs of grep (instead of piping it to wc -l which I did just to see how many lines were matched)...
#!/bin/bash
# script to look at difference in timing
# when grepping a file with a large number of expressions
# assume first argument = name of file with list of expressions
# second argument = name of file to check
# optional third argument = initial block size (default 100)
#
# split f1 into chunks of 1, 2, 4, 8... expressions at a time
# and print out how long it took to process all the lines in f2
if (($# < 2 )); then
echo Warning: need at leasttwo parameters.
echo Usage: timeScript keyFile searchFile [initial blocksize]
exit 0
fi
f1_linecount=`cat $1 | wc -l`
echo linecount of file1 is $f1_linecount
f2_linecount=`cat $2 | wc -l`
echo linecount of file2 is $f2_linecount
echo
if (($# < 3 )); then
blockLength=100
else
blockLength=$3
fi
while (($blockLength < f1_linecount))
do
echo Using blocks of $blockLength
#split is a built in command that splits the file
# -l tells it to break after $blockLength lines
# and the block$blockLength parameter is a prefix for the file
split -l $blockLength $1 block$blockLength
Tstart="$(date +%s)"
Tbefore=$Tstart
for fn in block*
do
echo "grep -f $fn $2 | wc -l"
echo number of lines matched: `grep -f $fn $2 | wc -l`
Tnow="$(($(date +%s)))"
echo Time taken: $(($Tnow - $Tbefore)) s
Tbefore=$Tnow
done
echo Time for processing entire file with blocksize $blockLength: $(($Tnow - $Tstart)) seconds
blockLength=$((2*$blockLength))
# remove the split files - no longer needed
rm block*
echo block length is now $blockLength and f1 linecount is $f1_linecount
done
exit 0
You could certainly give sed a try to see whether you get a better result, but it is a lot of work to do either way on a file of any size. You didn't provide any details on your problem, but if you have 10k patterns I would be trying to think about whether there is some way to generalize them into a smaller number of regular expressions.
Here is a perl script "match_many.pl" which addresses a very common subset of the "large number of keys vs. large number of records" problem. Keys are accepted one per line from stdin. The two command line parameters are the name of the file to search and the field (white space delimited) which must match a key. This subset of the original problem can be solved quickly since the location of the match (if any) in the record is known ahead of time and the key always corresponds to an entire field in the record. In one typical case it searched 9400265 records with 42899 keys, matching 42401 of the keys and emitting 1831944 records in 41s. The more general case, where the key may appear as a substring in any part of a record, is a more difficult problem that this script does not address. (If keys never include white space and always correspond to an entire word the script could be modified to handle that case by iterating over all fields per record, instead of just testing the one, at the cost of running M times slower, where M is the average field number where the matches are found.)
#!/usr/bin/perl -w
use strict;
use warnings;
my $kcount;
my ($infile,$test_field) = #ARGV;
if(!defined($infile) || "$infile" eq "" || !defined($test_field) || ($test_field <= 0)){
die "syntax: match_many.pl infile field"
}
my %keys; # hash of keys
$test_field--; # external range (1,N) to internal range (0,N-1)
$kcount=0;
while(<STDIN>) {
my $line = $_;
chomp($line);
$keys {$line} = 1;
$kcount++
}
print STDERR "keys read: $kcount\n";
my $records = 0;
my $emitted = 0;
open(INFILE, $infile ) or die "Could not open $infile";
while(<INFILE>) {
if(substr($_,0,1) =~ /#/){ #skip comment lines
next;
}
my $line = $_;
chomp($line);
$line =~ s/^\s+//;
my #fields = split(/\s+/, $line);
if(exists($keys{$fields[$test_field]})){
print STDOUT "$line\n";
$emitted++;
$keys{$fields[$test_field]}++;
}
$records++;
}
$kcount=0;
while( my( $key, $value ) = each %keys ){
if($value > 1){
$kcount++;
}
}
close(INFILE);
print STDERR "records read: $records, emitted: $emitted; keys matched: $kcount\n";
exit;

How does grep run so fast?

I am really amazed by the functionality of GREP in shell, earlier I used to use substring method in java but now I use GREP for it and it executes in a matter of seconds, it is blazingly faster than java code that I used to write.(according to my experience I might be wrong though)
That being said I have not been able to figure out how it is happening? there is also not much available on the web.
Can anyone help me with this?
Assuming your question regards GNU grep specifically. Here's a note from the author, Mike Haertel:
GNU grep is fast because it AVOIDS LOOKING AT EVERY INPUT BYTE.
GNU grep is fast because it EXECUTES VERY FEW INSTRUCTIONS FOR EACH
BYTE that it
does look at.
GNU grep uses the well-known Boyer-Moore algorithm, which looks first
for the final letter of the target string, and uses a lookup table to
tell it how far ahead it can skip in the input whenever it finds a
non-matching character.
GNU grep also unrolls the inner loop of Boyer-Moore, and sets up the
Boyer-Moore delta table entries in such a way that it doesn't need to
do the loop exit test at every unrolled step. The result of this is
that, in the limit, GNU grep averages fewer than 3 x86 instructions
executed for each input byte it actually looks at (and it skips many
bytes entirely).
GNU grep uses raw Unix input system calls and avoids copying data
after reading it. Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO
LINES. Looking for newlines would slow grep down by a factor of
several times, because to find the newlines it would have to look at
every byte!
So instead of using line-oriented input, GNU grep reads raw data into
a large buffer, searches the buffer using Boyer-Moore, and only when
it finds a match does it go and look for the bounding newlines
(Certain command line options like
-n disable this optimization.)
This answer is a subset of the information taken from here.
To add to Steve's excellent answer.
It may not be widely known but grep is almost always faster when grepping for a longer pattern-string than a short one, because in a longer pattern, Boyer-Moore can skip forward in longer strides to achieve even better sublinear speeds:
Example:
# after running these twice to ensure apples-to-apples comparison
# (everything is in the buffer cache)
$ time grep -c 'tg=f_c' 20140910.log
28
0.168u 0.068s 0:00.26
$ time grep -c ' /cc/merchant.json tg=f_c' 20140910.log
28
0.100u 0.056s 0:00.17
The longer form is 35% faster!
How come? Boyer-Moore consructs a skip-forward table from the pattern-string, and whenever there's a mismatch, it picks the longest skip possible (from last char to first) before comparing a single char in the input to the char in the skip table.
Here's a video explaining Boyer Moore (Credit to kommradHomer)
Another common misconception (for GNU grep) is that fgrep is faster than grep. f in fgrep doesn't stand for 'fast', it stands for 'fixed' (see the man page), and since both are the same program, and both use Boyer-Moore, there's no difference in speed between them when searching for fixed-strings without regexp special chars. The only reason I use fgrep is when there's a regexp special char (like ., [], or *) I don't want it to be interpreted as such. And even then the more portable/standard form of grep -F is preferred over fgrep.

Resources