mac unix script problem - unix

I'm trying to write a script that breaks up a VERY large file into smaller pieces that are then sent to a script that runs in the background. The motivation is that if the script is running in the background, I can run in parallel.
Here is my code, ./seq works just like the normal seq command (which mac doesn't have). and $1 is the huge file to be split.
echo "Splitting and Running Script"
for i in $(./seq 0 14000000 500000)
do
awk ' { if (NR>='$i' && NR<'$(($i+500000))') { print $0 > "xPart'$i'" } }' $1
python FastQ2Seq.py xPart$i &
done
wait
echo "Concatenating"
for k in *.out.seq
do
cat $k >> original.seq
done
for j in *.out.qul
do
cat $j >> original.qul
done
echo "Cleaning"
rm xPart*
My problem is that only xPart0 is made and it only has 499995 lines in it before the program hangs. I put some debugging echos in the script and I know the awk statement is what stops the script. I just can't figure out what's going wrong.

Check out the split command --
split -- split a file into pieces
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'. With no INPUT, or when
INPUT is -, read standard input.
Should be much faster, reliable, and cleaner than running awk in a loop!

echo "Splitting and Running Script"
# splits to smaller files each 50000 lines, if i understand your problem correctly
awk 'NR%50000==1{++c}{print $0 > "xPart"c".txt"}' file
# or use split -l 50000
for file in xPart*
do
python FastQ2Seq.py "$file" &
done
echo "Concatenating"
cat *.out.seq >> original.seq
cat *.out.qul >> original.qul

If your seq truly works like the standard seq, you're calling it wrong. The proper command line for seq is:
seq FIRST INCREMENT LAST
So you would need to change your seq commandline to:
seq 0 500000 14000000

Related

How to read 1 symbol in zsh?

I need to get exactly one character from console and not print it.
I've tried to use read -en 1 as I did using bash. But this doesn't work at all.
And vared doesn't seem to have such option.
How to read 1 symbol in zsh? (I'm using zsh v.4.3.11 and v.5.0.2)
read -sk
From the documentation:
-s
Don’t echo back characters if reading from the terminal. Currently does not work with the -q option.
-k [ num ]
Read only one (or num) characters. All are assigned to the first name, without word splitting. This flag is ignored when -q is present. Input is read from the terminal unless one of -u or -p is present. This option may also be used within zle widgets.
Note that despite the mnemonic ‘key’ this option does read full characters, which may consist of multiple bytes if the option MULTIBYTE is set.
If you want your script to be a bit more portable you can do something like this:
y=$(bash -c "read -n 1 c; echo \$c")
read reads from the terminal by default:
% date | read -sk1 "?Enter one char: "; echo $REPLY
Enter one char: X
Note above:
The output of date is discarded
The X is printed by the echo, not when the user enters it.
To read from a pipeline, use file descriptor 0:
% echo foobar | read -rk1 -u0; echo $REPLY
f
% echo $ZSH_VERSION
5.5.1
Try something like
read line
c=`echo $line | cut -c1`
echo $c

grep -f maximum number of patterns?

I'd like to use grep on a text file with -f to match a long list (10,000) of patterns. Turns out that grep doesn't like this (who, knew?). After a day, it didn't produce anything. Smaller lists work almost instantaneously.
I was thinking I might split my long list up and do it a few times. Any idea what a good maximum length for the pattern list might be?
Also, I'm rather new with unix. Alternative approaches are welcome. The list of patterns, or search terms, are in a plaintext file, one per line.
Thank you everyone for your guidance.
From comments, it appears that the patterns you are matching are fixed strings. If that is the case, you should definitely use -F. That will increase the speed of the matching considerably. (Using 479,000 strings to match on an input file with 3 lines using -F takes under 1.5 seconds on a moderately powered machine. Not using -F, that same machine is not yet finished after several minutes.)
i got the same problem with approx. 4 million patterns to search for in a file with 9 million lines. Seems like it is a problem of RAM. so i got this neat little work around which might be slower than splitting and joining but it just need this one line.
while read line; do grep $line fileToSearchIn;done < patternFile
I needed to use the work around since the -F flag is no solution for that large files...
EDIT: This seems to be really slow for large files. After some more research i found 'faSomeRecords' and really other awesome tools from Kent NGS-editing-Tools
I tried it on my own by extracting 2 million fasta-rec from 5.5million records file. Took approx. 30 sec..
cheers
EDIT: direct download link
Here is a bash script you can run on your files (or if you would like, a subset of your files). It will split the key file into increasingly large blocks, and for each block attempt the grep operation. The operations are timed - right now I'm timing each grep operation, as well as the total time to process all the sub-expressions.
Output is in seconds - with some effort you can get ms, but with the problem you are having it's unlikely you need that granularity.
Run the script in a terminal window with a command of the form
./timeScript keyFile textFile 100 > outputFile
This will run the script, using keyFile as the file where the search keys are stored, and textFile as the file where you are looking for keys, and 100 as the initial block size. On each loop the block size will be doubled.
In a second terminal, run the command
tail -f outputFile
which will keep track of the output of your other process into the file outputFile
I recommend that you open a third terminal window, and that you run top in that window. You will be able to see how much memory and CPU your process is taking - again, if you see vast amounts of memory consumed it will give you a hint that things are not going well.
This should allow you to find out when things start to slow down - which is the answer to your question. I don't think there's a "magic number" - it probably depends on your machine, and in particular on the file size and the amount of memory you have.
You could take the output of the script and put it through a grep:
grep entire outputFile
You will end up with just the summaries - block size, and time taken, e.g.
Time for processing entire file with blocksize 800: 4 seconds
If you plot these numbers against each other (or simply inspect the numbers), you will see when the algorithm is optimal, and when it slows down.
Here is the code: I did not do extensive error checking but it seemed to work for me. Obviously in your ultimate solution you need to do something with the outputs of grep (instead of piping it to wc -l which I did just to see how many lines were matched)...
#!/bin/bash
# script to look at difference in timing
# when grepping a file with a large number of expressions
# assume first argument = name of file with list of expressions
# second argument = name of file to check
# optional third argument = initial block size (default 100)
#
# split f1 into chunks of 1, 2, 4, 8... expressions at a time
# and print out how long it took to process all the lines in f2
if (($# < 2 )); then
echo Warning: need at leasttwo parameters.
echo Usage: timeScript keyFile searchFile [initial blocksize]
exit 0
fi
f1_linecount=`cat $1 | wc -l`
echo linecount of file1 is $f1_linecount
f2_linecount=`cat $2 | wc -l`
echo linecount of file2 is $f2_linecount
echo
if (($# < 3 )); then
blockLength=100
else
blockLength=$3
fi
while (($blockLength < f1_linecount))
do
echo Using blocks of $blockLength
#split is a built in command that splits the file
# -l tells it to break after $blockLength lines
# and the block$blockLength parameter is a prefix for the file
split -l $blockLength $1 block$blockLength
Tstart="$(date +%s)"
Tbefore=$Tstart
for fn in block*
do
echo "grep -f $fn $2 | wc -l"
echo number of lines matched: `grep -f $fn $2 | wc -l`
Tnow="$(($(date +%s)))"
echo Time taken: $(($Tnow - $Tbefore)) s
Tbefore=$Tnow
done
echo Time for processing entire file with blocksize $blockLength: $(($Tnow - $Tstart)) seconds
blockLength=$((2*$blockLength))
# remove the split files - no longer needed
rm block*
echo block length is now $blockLength and f1 linecount is $f1_linecount
done
exit 0
You could certainly give sed a try to see whether you get a better result, but it is a lot of work to do either way on a file of any size. You didn't provide any details on your problem, but if you have 10k patterns I would be trying to think about whether there is some way to generalize them into a smaller number of regular expressions.
Here is a perl script "match_many.pl" which addresses a very common subset of the "large number of keys vs. large number of records" problem. Keys are accepted one per line from stdin. The two command line parameters are the name of the file to search and the field (white space delimited) which must match a key. This subset of the original problem can be solved quickly since the location of the match (if any) in the record is known ahead of time and the key always corresponds to an entire field in the record. In one typical case it searched 9400265 records with 42899 keys, matching 42401 of the keys and emitting 1831944 records in 41s. The more general case, where the key may appear as a substring in any part of a record, is a more difficult problem that this script does not address. (If keys never include white space and always correspond to an entire word the script could be modified to handle that case by iterating over all fields per record, instead of just testing the one, at the cost of running M times slower, where M is the average field number where the matches are found.)
#!/usr/bin/perl -w
use strict;
use warnings;
my $kcount;
my ($infile,$test_field) = #ARGV;
if(!defined($infile) || "$infile" eq "" || !defined($test_field) || ($test_field <= 0)){
die "syntax: match_many.pl infile field"
}
my %keys; # hash of keys
$test_field--; # external range (1,N) to internal range (0,N-1)
$kcount=0;
while(<STDIN>) {
my $line = $_;
chomp($line);
$keys {$line} = 1;
$kcount++
}
print STDERR "keys read: $kcount\n";
my $records = 0;
my $emitted = 0;
open(INFILE, $infile ) or die "Could not open $infile";
while(<INFILE>) {
if(substr($_,0,1) =~ /#/){ #skip comment lines
next;
}
my $line = $_;
chomp($line);
$line =~ s/^\s+//;
my #fields = split(/\s+/, $line);
if(exists($keys{$fields[$test_field]})){
print STDOUT "$line\n";
$emitted++;
$keys{$fields[$test_field]}++;
}
$records++;
}
$kcount=0;
while( my( $key, $value ) = each %keys ){
if($value > 1){
$kcount++;
}
}
close(INFILE);
print STDERR "records read: $records, emitted: $emitted; keys matched: $kcount\n";
exit;

ksh: how to probe stdin?

I want my ksh script to have different behaviors depending on whether there is something incoming through stdin or not:
(1) cat file.txt | ./script.ksh (then do "cat <&0 >./tmp.dat" and process tmp.dat)
vs. (2) ./script.ksh (then process $1 which must be a readable regular file)
Checking for stdin to see if it is a terminal[ -t 0 ] is not helpful, because my script is called from an other script.
Doing "cat <&0 >./tmp.dat" to check tmp.dat's size hangs up waiting for an EOF from stdin if stdin is "empty" (2nd case).
How to just check if stdin is "empty" or not?!
EDIT: You are running on HP-UX
Tested [ -t 0 ] on HP-UX and it appears to be working for me. I have used the following setup:
/tmp/x.ksh:
#!/bin/ksh
/tmp/y.ksh
/tmp/y.ksh:
#!/bin/ksh
test -t 0 && echo "terminal!"
Running /tmp/x.ksh prints: terminal!
Could you confirm the above on your platform, and/or provide an alternate test setup more closely reflecting your situation? Is your script ultimately spawned by cron?
EDIT 2
If desperate, and if Perl is available, define:
stdin_ready() {
TIMEOUT=$1; shift
perl -e '
my $rin = "";
vec($rin,fileno(STDIN),1) = 1;
select($rout=$rin, undef, undef, '$TIMEOUT') < 1 && exit 1;
'
}
stdin_ready 1 || 'stdin not ready in 1 second, assuming terminal'
EDIT 3
Please note that the timeout may need to be significant if your input comes from sort, ssh etc. (all these programs can spawn and establish the pipe with your script seconds or minutes before producing any data over it.) Also, using a hefty timeout may dramatically penalize your script when there is nothing on the input to begin with (e.g. terminal.)
If potentially large timeouts are a problem, and if you can influence the way in which your script is called, then you may want to force the callers to explicitly instruct your program whether stdin should be used, via a custom option or in the standard GNU or tar manner (e.g. script [options [--]] FILE ..., where FILE can be a file name, a - to denote standard input, or a combination thereof, and your script would only read from standard input if - were passed in as a parameter.)
This strategy works for bash, and would likely work for ksh. Poll 'tty':
#!/bin/bash
set -a
if [ "$( tty )" == 'not a tty' ]
then
STDIN_DATA_PRESENT=1
else
STDIN_DATA_PRESENT=0
fi
if [ ${STDIN_DATA_PRESENT} -eq 1 ]
then
echo "Input was found."
else
echo "Input was not found."
fi
Why not solve this in a more traditional way, and use the command line argument to indicate that the data will be coming from stdin?
For an example, consider the difference between:
echo foo | cat -
and
echo foo > /tmp/test.txt
cat /tmp/test.txt

`tee` command equivalent for *input*?

The unix tee command splits the standard input to stdout AND a file.
What I need is something that works the other way around, merging several inputs to one output - I need to concatenate the stdout of two (or more) commands.
Not sure what the semantics of this app should be - let's suppose each argument is a complete command.
Example:
> eet "echo 1" "echo 2" > file.txt
should generate a file that has contents
1
2
I tried
> echo 1 && echo 2 > zz.txt
It doesn't work.
Side note: I know I could just append the outputs of each command to the file, but I want to do this in one go (actually, I want to pipe the merged outputs to another program).
Also, I could roll my own, but I'm lazy whenever I can afford it :-)
Oh yeah, and it would be nice if it worked in Windows (although I guess any bash/linux-flavored solution works, via UnxUtils/msys/etc)
Try
( echo 1; echo 2 ) > file.txt
That spawn a subshell and executes the commands there
{ echo 1; echo 2; } > file.txt
is possible, too. That does not spawn a subshell (the semicolon after the last command is important)
I guess what you want is to run both commands in parallel, and pipe both outputs merged to another command.
I would do:
( echo 1 & echo 2 ) | cat
Where "echo 1" and "echo 2" are the commands generating the outputs and "cat" is the command that will receive the merged output.
echo 1 > zz.txt && echo 2 >> zz.txt
That should work. All you're really doing is running two commands after each other, where the first redirects to a file, and then, if that was successful, you run another command that appends its output to the end of the file you wrote in the first place.

Breaking out of "tail -f" that's being read by a "while read" loop in HP-UX

I'm trying to write a (sh -bourne shell) script that processes lines as they are written to a file. I'm attempting to do this by feeding the output of tail -f into a while read loop. This tactic seems to be proper based on my research in Google as well as this question dealing with a similar issue, but using bash.
From what I've read, it seems that I should be able to break out of the loop when the file being followed ceases to exist. It doesn't. In fact, it seems the only way I can break out of this is to kill the process in another session. tail does seem to be working fine otherwise as testing with this:
touch file
tail -f file | while read line
do
echo $line
done
Data I append to file in another session appears just file from the loop processing written above.
This is on HP-UX version B.11.23.
Thanks for any help/insight you can provide!
If you want to break out, when your file does not exist any more, just do it:
test -f file || break
Placing this in your loop, should break out.
The remaining problem is, how to break the read line, as this is blocking.
This could you do by applying a timeout, like read -t 5 line. Then every 5 second the read returns, and in case the file does not longer exist, the loop will break. Attention: Create your loop that it can handle the case, that the read times out, but the file is still present.
EDIT: Seems that with timeout read returns false, so you could combine the test with the timeout, the result would be:
tail -f test.file | while read -t 3 line || test -f test.file; do
some stuff with $line
done
I don't know about HP-UX tail but GNU tail has the --follow=name option which will follow the file by name (by re-opening the file every few seconds instead of reading from the same file descriptor which will not detect if the file is unlinked) and will exit when the filename used to open the file is unlinked:
tail --follow=name test.txt
Unless you're using GNU tail, there is no way it'll terminate of its own accord when following a file. The -f option is really only meant for interactive monitoring--indeed, I have a book that says that -f "is unlikely to be of use in shell scripts".
But for a solution to the problem, I'm not wholly sure this isn't an over-engineered way to do it, but I figured you could send the tail to a FIFO, then have a function or script that checked the file for existence and killed off the tail if it'd been unlinked.
#!/bin/sh
sentinel ()
{
while true
do
if [ ! -e $1 ]
then
kill $2
rm /tmp/$1
break
fi
done
}
touch $1
mkfifo /tmp/$1
tail -f $1 >/tmp/$1 &
sentinel $1 $! &
cat /tmp/$1 | while read line
do
echo $line
done
Did some naïve testing, and it seems to work okay, and not leave any garbage lying around.
I've never been happy with this answer but I have not found an alternative either:
kill $(ps -o pid,cmd --no-headers --ppid $$ | grep tail | awk '{print $1}')
Get all processes that are children of the current process, look for the tail, print out the first column (tail's pid), and kill it. Sin-freaking-ugly indeed, such is life.
The following approach backgrounds the tail -f file command, echos its process id plus a custom string prefix (here tailpid: ) to the while loop where the line with the custom string prefix triggers another (backgrounded) while loop that every 5 seconds checks if file is still existing. If not, tail -f file gets killed and the subshell containing the backgrounded while loop exits.
# cf. "The Heirloom Bourne Shell",
# http://heirloom.sourceforge.net/sh.html,
# http://sourceforge.net/projects/heirloom/files/heirloom-sh/ and
# http://freecode.com/projects/bournesh
/usr/local/bin/bournesh -c '
touch file
(tail -f file & echo "tailpid: ${!}" ) | while IFS="" read -r line
do
case "$line" in
tailpid:*) while sleep 5; do
#echo hello;
if [ ! -f file ]; then
IFS=" "; set -- ${line}
kill -HUP "$2"
exit
fi
done &
continue ;;
esac
echo "$line"
done
echo exiting ...
'

Resources