Is there any faster way to truncate column in Unix - unix

I want to truncate 4th column of TSV file to given length in Unix. File has records in few millions and is of size 8GB.
I am trying this but it seems to be kind of slow.
awk -F"\t" '{s=substr($4,0,256); print $1"\t"$2"\t"$3"\t"s"\t"$5"\t"$6"\t"$7}' file > newFile
Is there any faster alternatives for same?
Thanks

Your command could be written a little more nicely (assuming you are re-building the record), which may give some performance increases:
awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,256) }' file > newFile
If you have access to a multi-core machine (which you probably do), you can use GNU parallel. You may want to vary the number of cores you use (I've set 4 here) and the block size that's fed to awk (I've set this to two megabytes)...
< file parallel -j 4 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' > newFile
Here's some testing I did on my system using a 2.7G file with 100 million lines and a block size of 2M:
time awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' file >/dev/null
Results:
real 1m59.313s
user 1m57.120s
sys 0m2.190s
With one core:
time < file parallel -j 1 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' >/dev/null
Results:
real 2m28.270s
user 4m3.070s
sys 0m41.560s
With four cores:
time < file parallel -j 4 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' >/dev/null
Results:
real 0m54.329s
user 2m41.550s
sys 0m31.460s
With twelve cores:
time < file parallel -j 12 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' >/dev/null
Results:
real 0m36.581s
user 2m24.370s
sys 0m32.230s

I’ll assume that your file has exactly one space character between fields and no whitespace at the beginning of the line.  If that is wrong, this can be enhanced. 
Otherwise, this should work:
sed 's/^\([^ ]* [^ ]* [^ ]* [^ ]\{1,256\}\)[^ ]* /\1 /'
I haven’t actually tested it with 256-character-long data (I tested it with \{1,2\} and I have no idea how its speed compares to that of awk.  BTW, on some versions, you might need to leave off the backslashes from the curly braces and use just {1,256}.

If Scott or Steve's solutions are still too slow, it may be time to break out the C. Run as ./a.out < file > newFile. Test on a small file with some long fields first; I am not 100% sure I have the math right.
#include <stdio.h>
int
main(void)
{
int field = 1;
int character = 0;
int c;
while ((c = getchar()) != EOF)
{
switch (c)
{
case '\n':
field = 1;
character = 0;
break;
case '\t':
character = 0;
field++;
break;
default:
character++;
break;
}
if (field != 4 || character < 256)
putchar(c);
}
if (ferror(stdout) || fflush(stdout) || fclose(stdout))
{
perror("write");
return 1;
}
return 0;
}

Related

How to exclude parent Unix processes from grepped output from ps

I have got a file of pids and am using ps -f to get information about the pids.
Here is an example..
ps -eaf | grep -f myfilename
myuser 14216 14215 0 10:00 ? 00:00:00 /usr/bin/ksh /home/myScript.ksh
myuser 14286 14216 0 10:00 ? 00:00:00 /usr/bin/ksh /home/myScript.ksh
where myfilename contains only 14216.
I've got a tiny problem where the output is giving me parent process id's as well as the child. I want to exclude the line for the parent process id.
Does anyone know how I could modify my command to exclude parent process keeping in mind that I could have many process id's in my input file?
Hard to do with just grep but easy to do with awk.
Invoke the awk script below from the following command:
ps -eaf | awk -f script.awk myfilename -
Here's the script:
# process the first file on the command line (aka myfilename)
# this is the list of pids
ARGIND == 1 {
pids[$0] = 1
}
# second and subsequent files ("-"/stdin in the example)
ARGIND > 1 {
# is column 2 of the ps -eaf output [i.e.] the pid in the list of desired
# pids? -- if so, print the entire line
if ($2 in pids)
printf("%s\n",$0)
}
UPDATE:
When using GNU awk (gawk), the following may be ignored. For other [obsolete] versions, insert the following code at the top:
# work around old, obsolete versions
ARGIND == 0 {
defective_awk_flag = 1
}
defective_awk_flag != 0 {
if (FILENAME != defective_awk_file) {
defective_awk_file = FILENAME
ARGIND += 1
}
}
UPDATE #2:
The above is all fine. Just for fun, here's an alternate way to do the same thing with perl. One of the advantages is that everything can be contained in the script and no pipeline is necessary.
Invoke the script via:
./script.pl myfilename
And, here's script.pl. Note: I don't write idiomatic perl. My style is more akin to what one would expect to see in other languages like C, javascript, etc.:
#!/usr/bin/perl
master(#ARGV);
exit(0);
# master -- master control
sub master
{
my(#argv) = #_;
my($xfsrc);
my($pidfile);
my($buf);
# NOTE: "chomp" is a perl function that strips newlines
# get filename with list of pids (e.g. myfilename)
$pidfile = shift(#argv);
open($xfsrc,"<$pidfile") ||
die("master: unable to open '$pidfile' -- $!\n");
# create an associative array (a 'hash" in perl parlance) of the desired
# pid numbers
while ($pid = <$xfsrc>) {
chomp($pid);
$pid_desired{$pid} = 1;
}
close($xfsrc);
# run the 'ps' command and capture its output into an array
#pslist = (`ps -eaf`);
# process the command output, line-by-line
foreach $buf (#pslist) {
chomp($buf);
# the pid number we want is in the second column
(undef,$pid) = split(" ",$buf);
# print the line if the pid is one of the ones we want
print($buf,"\n")
if ($pid_desired{$pid});
}
}
Use this command:
ps -eaf | grep -f myfilename | grep -v grep | grep -f myfilename

HP-Unix: C-shell:Disk space checking

I have 10 devices that using hp-ux and i want to check the disk space in each devices.
my requirement is if the space more than 90%, the info of device ans space will be save to a log.
this is list of device and ip address which i set as variable ipadd:
lo1 100.45.32.43
lot2 100.45.32.44
lot3 100.45.32.44
lot4 100.45.32.45
lot5 100.45.32.46
and so on..
This is my script so far :
#!/bin/csh -f
set ipaddress = (`awk '{print $2}' "ipadd"`)
set device = (`awk '{print $1}' "ipadd"`)
# j = 1
while ($j <= $#ipaddress)
echo $ipaddress
set i = 90 # Threshold set at 90%
set max = 100
while ($i <= $max)
rsh $ipaddress[$j] bdf | grep /dev/vg00 | grep $i% \
|awk '{ file=substr($6,index($6,"/") + 1,length($6)); print "WARNING: $device[$j]:/" file " has reached " $5 ". Perform HouseKeeping IMMEDIATELY..." >> "/scripts/space." file ".file"}'
# i++
end
# j++
end
The output after bdf:
/dev/vg00/lvol2 15300207 10924582 28566314 79% /
/dev/vg00/lvol4 42529 23786 25510 55% /stand
The output at terminal after exec the script:
100.45.32.43
100.45.32.44
The output at .file:
WARNING: $device[$j]:/ has reached 79%. Perform HouseKeeping IMMEDIATELY...
My question is, is it my looping have something wrong cause only iterates one time only because my .file output only show one device only?
And why $device[$j] not come out in .file output?
or awk have problem?
Thank you for the advice.
Your code tested for each possible percentage between 90 and 100.
Persumably, you'd be OK with code that checks once, and asks 'is device percent greater than 90%'?. So then you don't need the inner loop at all, and you make only 1 connection per machine, try
#!/bin/csh -f
set ipaddress = (`awk '{print $2}' "ipadd"`)
set device = (`awk '{print $1}' "ipadd"`)
# j = 1
set i = 90 # Threshold set at 90%
while ($j <= $#ipaddress)
echo $ipaddress
echo "#dbg: ipaddress[$j]=${ibpaddress[$j]}"
rsh $ipaddress[$j] bdf \
| awk -v thresh="$i" -v dev="$device[$j]" \
'/\/dev\/vg00/ { \
sub(/%/,"",$5) \
if ($5 > thresh) { \
file=substr($6,index($6,"/") + 1,length($6)) \
print "WARNING: " dev ":/" file " has reached " $5 ". Perform HouseKeeping IMMEDIATELY..." >> "/scripts/space." file ".file" \
}\
}'
# j++
end
Sorry, but I don't have a csh available to dbl-chk for syntax errors.
So here is a one liner that we determined worked in your environment.
rsh $ipaddress[$j] bdf | nawk -v thresh="$i" -v dev="$device[$j]" '/\/dev\/vg00/ { sub(/%/,"",$5) ; if ($5 > thresh) { file=substr($6,index($6,"/") + 1,length($6));print "#dbg:file="file; print "WARNING: " dev ":/" file " has reached " $5 ". Perform HouseKeeping IMMEDIATELY..." >> "/scripts/space.file.TMP" } }'
I don't have a system with bdf available. Change the two references to $5 in the sub() and if test to match the field-number of the output that has the percentage you want to test.
Note that -v var="value" is the standard way to pass a variable value from the shell to an awk script that is enclosed in single-quotes.
Be careful that any '\' chars at the end of a line are the last chars, no trailing space or tabs, or you'll get an indecipherable error msg. ;-)
IHTH

Cshell and Awk Infinite Running

When I run the below program, I get no return, however the program still runs forever until I end it. Can some one please exoplain to me why this would happen. I am trying to get this complex awk statement to work, however, have been very unsuccessful.
The code I am using for my Cshell is (its all on one line, but I split it here to make it easier to read):
awk '{split($2,b,""); counter = 1; while (counter < 13)
{if (b[counter] == 1 && "'$cmonth'" > counter)
{{printf("%s%s%s\n", $1, "'$letter'","'$year3'")}; counter++;
else if (b[counter] == 1 && "'$cmonth'" <= counter)
{{printf("%s%s%s\n", $1, "'$letter'","'$year2'")}; counter++;}
else echo "fail"}}' fileRead >> $year$month
The text file I am reading from looks like
fff 101010101010
yyy 100100100100
Here $year2 and $year3 represent counters that start from 1987 and go up 1 year for each line read.
$cmonth is just a month counter from 1–12.
$letter is just a ID.
The goal is for the program to read each line and print out the ID, month, and year if the position in the byte code is 1.
You have some mismatched curly braces, I have reformatted to one standard of indentation.
awk '{ \
split($2,b,""); counter = 1 \
while (counter < 13) { \
if (b[counter] == 1 && "'$cmonth'" > counter){ \
printf("%s%s%s\n", $1, "'$letter'","'$year3'") \
counter++ \
} \
else if (b[counter] == 1 && "'$cmonth'" <= counter) { \
printf("%s%s%s\n", $1, "'$letter'","'$year2'") \
counter++ \
} \
else print "fail" \
} # while \
}' fileRead >> $year$month
Also awk does'nt support echo.
Make sure that the \ is the LAST char on the line (no space or tab chars!!!), or you'll get a syntax error.
Else, you can 'fold' up all of the lines into one line. adding the occasional ';' as needed.
edit
OR you can take the previous version of this awk script (without the \ line continuation chars), put it in a file (without any of the elements outside of the ' ....' (single quotes) and call it from awk as a file. You'll also need to made so you can pass the variables cmonth, letter, year2 and any others that I've missed.
save as file
edit file, remove any `\' chars, change all vars like "'$letter'" to letter **
call program like
**
awk -v letter="$letter" -v year2="$year2" -v month="$month" -f myScript fileRead >> $year$month
**
for example
printf("%s%s%s\n", $1, "'$letter'","'$year2'")
becomes
printf("%s%s%s\n", $1, letter,year2)
IHTH.

Faster Alternative to Unix Grep

I'm trying to do the following
$ grep ">" file.fasta > output.txt
But it is taking so long when the input fasta file is large.
The input file looks like this:
>seq1
ATCGGTTA
>seq2
ATGGGGGG
Is there a faster alternative?
Use time command with all these
$> time grep ">" file.fasta > output.txt
$> time egrep ">" file.fasta > output.txt
$> time awk '/^>/{print $0}' file.fasta > output.txt -- If ">' is first letter
If you see the output..they are almost the same .
In my opinion ,if the data is in columnar format, then use awk to search.
Hand-built state machine. If you only want '>' to be accepted at the beginning of the line, you'll need one more state. If you need to recognise '\r' too, you will need a few more states.
#include <stdio.h>
int main(void)
{
int state,ch;
for(state=0; (ch=getc(stdin)) != EOF; ) {
switch(state) {
case 0: /* start */
if (ch == '>') state = 1;
else break;
case 1: /* echo */
fputc(ch,stdout);
if (ch == '\n') state = 0;
break;
}
}
if (state==1) fputc('\n',stdout);
return 0;
}
If you want real speed, you could replace the fgetc() and fputc() by their macro equivalents getc() and putc(). (but I think trivial programs like this will be I/O bound anyway)
For big files, the fastest possible grep can be accomplished with GNU parallel. An example using parallel and grep can be found here.
For your purposes, you may like to try:
cat file.fasta | parallel -j 4 --pipe --block 10M grep "^\>" > output.txt
The above will use four cores, and parse 10 MB blocks to grep. The block-size is optional, but I find using a 10 MB block-size quite a bit faster on my system. YRMV.
HTH
Ack is a good alternative to grep to find string/regex in code :
http://beyondgrep.com/

xxd -r without xxd

I'm running on a scaled down version of CentOS 5.5 without many tools available. No xxd, bc, or hd. I can't install any additional utilities, unfortunately. I do have od, dd, awk, and bourne shell (not bash). What I'm trying to do is relatively simple in a normal environment. Basically, I have a number, say 100,000, and I need to store its binary representation in a file. Typically, I'd do something like ...
printf '%x' "100000" | xxd -r -p > file.bin
If you view a hex dump of the file, you'd correctly see the number represented as 186A0.
Is there an equivalent I can cobble together using the limited tools I have available? Pretty much everything I've tried stores the ascii values for the digits.
You can do it with a combination of your printf, awk, and your shell.
#!/usr/bin/env awk
# ascii_to_bin.awk
{
# Pad out the incoming integer to a full byte
len = length($0);
if ( (len % 2) != 0) {
str = sprintf("0%s", $0);
len = len + 1;
}
else {
str = $0;
}
# Create your escaped echo string
printf("echo -n -e \"");
for(i=1;i<=len;i=i+2) {
printf("\\\\x%s", substr(str, i, 2));
}
printf("\"");
}
Then you can just do
$ printf '%x' "100000" | awk -f ascii_to_bin.awk | /bin/sh > output.bin
If you know your target binary length you can just do a printf "%0nX" (n is the target size) and remove the (len % 2) logic.

Resources