xxd -r without xxd - hex

I'm running on a scaled down version of CentOS 5.5 without many tools available. No xxd, bc, or hd. I can't install any additional utilities, unfortunately. I do have od, dd, awk, and bourne shell (not bash). What I'm trying to do is relatively simple in a normal environment. Basically, I have a number, say 100,000, and I need to store its binary representation in a file. Typically, I'd do something like ...
printf '%x' "100000" | xxd -r -p > file.bin
If you view a hex dump of the file, you'd correctly see the number represented as 186A0.
Is there an equivalent I can cobble together using the limited tools I have available? Pretty much everything I've tried stores the ascii values for the digits.

You can do it with a combination of your printf, awk, and your shell.
#!/usr/bin/env awk
# ascii_to_bin.awk
{
# Pad out the incoming integer to a full byte
len = length($0);
if ( (len % 2) != 0) {
str = sprintf("0%s", $0);
len = len + 1;
}
else {
str = $0;
}
# Create your escaped echo string
printf("echo -n -e \"");
for(i=1;i<=len;i=i+2) {
printf("\\\\x%s", substr(str, i, 2));
}
printf("\"");
}
Then you can just do
$ printf '%x' "100000" | awk -f ascii_to_bin.awk | /bin/sh > output.bin
If you know your target binary length you can just do a printf "%0nX" (n is the target size) and remove the (len % 2) logic.

Related

Loop over environment variables in POSIX sh

I need to loop over environment variables and get their names and values in POSIX sh (not bash). This is what I have so far.
#!/usr/bin/env sh
# Loop over each line from the env command
while read -r line; do
# Get the string before = (the var name)
name="${line%=*}"
eval value="\$$name"
echo "name: ${name}, value: ${value}"
done <<EOF
$(env)
EOF
It works most of the time, except when an environment variable contains a newline. I need it to work in that case.
I am aware of the -0 flag for env that separates variables with nul instead of newlines, but if I use that flag, how do I loop over each variable? Edit: #chepner pointed out that POSIX env doesn't support -0, so that's out.
Any solution that uses portable linux utilities is good as long as it works in POSIX sh.
There is no way to parse the output of env with complete confidence; consider this output:
bar=3
baz=9
I can produce that with two different environments:
$ env -i "bar=3" "baz=9"
bar=3
baz=9
$ env -i "bar=3
> baz=9"
bar=3
baz=9
Is that two environment variables, bar and baz, with simple numeric values, or is it one variable bar with the value $'3\nbaz=9' (to use bash's ANSI quoting style)?
You can safely access the environment with POSIX awk, however, using the ENVIRON array. For example:
awk 'END { for (name in ENVIRON) {
print "Name is "name;
print "Value is "ENVIRON[name];
}
}' < /dev/null
With this command, you can distinguish between the two environments mentioned above.
$ env -i "bar=3" "baz=9" awk 'END { for (name in ENVIRON) { print "Name is "name; print "Value is "ENVIRON[name]; }}' < /dev/null
Name is baz
Value is 9
Name is bar
Value is 3
$ env -i "bar=3
> baz=9" awk 'END { for (name in ENVIRON) { print "Name is "name; print "Value is "ENVIRON[name]; }}' < /dev/null
Name is bar
Value is 3
baz=9
Maybe this would work?
#!/usr/bin/env sh
env | while IFS= read -r line
do
name="${line%%=*}"
indirect_presence="$(eval echo "\${$name+x}")"
[ -z "$name" ] || [ -z "$indirect_presence" ] || echo "name:$name, value:$(eval echo "\$$name")"
done
It is not bullet-proof, as if the value of a variable with a newline happens to have a line beginning that looks like an assignment, it could be somewhat confused.
The expansion uses %% to remove the longest match, so if a line contains several = signs, they should all be removed to leave only the variable name from the beginning of the line.
Here an example based on the awk approach:
#!/bin/sh
for NAME in $(awk "END { for (name in ENVIRON) { print name; }}" < /dev/null)
do
VAL="$(awk "END { printf ENVIRON[\"$NAME\"]; }" < /dev/null)"
echo "$NAME=$VAL"
done

HP-Unix: C-shell:Disk space checking

I have 10 devices that using hp-ux and i want to check the disk space in each devices.
my requirement is if the space more than 90%, the info of device ans space will be save to a log.
this is list of device and ip address which i set as variable ipadd:
lo1 100.45.32.43
lot2 100.45.32.44
lot3 100.45.32.44
lot4 100.45.32.45
lot5 100.45.32.46
and so on..
This is my script so far :
#!/bin/csh -f
set ipaddress = (`awk '{print $2}' "ipadd"`)
set device = (`awk '{print $1}' "ipadd"`)
# j = 1
while ($j <= $#ipaddress)
echo $ipaddress
set i = 90 # Threshold set at 90%
set max = 100
while ($i <= $max)
rsh $ipaddress[$j] bdf | grep /dev/vg00 | grep $i% \
|awk '{ file=substr($6,index($6,"/") + 1,length($6)); print "WARNING: $device[$j]:/" file " has reached " $5 ". Perform HouseKeeping IMMEDIATELY..." >> "/scripts/space." file ".file"}'
# i++
end
# j++
end
The output after bdf:
/dev/vg00/lvol2 15300207 10924582 28566314 79% /
/dev/vg00/lvol4 42529 23786 25510 55% /stand
The output at terminal after exec the script:
100.45.32.43
100.45.32.44
The output at .file:
WARNING: $device[$j]:/ has reached 79%. Perform HouseKeeping IMMEDIATELY...
My question is, is it my looping have something wrong cause only iterates one time only because my .file output only show one device only?
And why $device[$j] not come out in .file output?
or awk have problem?
Thank you for the advice.
Your code tested for each possible percentage between 90 and 100.
Persumably, you'd be OK with code that checks once, and asks 'is device percent greater than 90%'?. So then you don't need the inner loop at all, and you make only 1 connection per machine, try
#!/bin/csh -f
set ipaddress = (`awk '{print $2}' "ipadd"`)
set device = (`awk '{print $1}' "ipadd"`)
# j = 1
set i = 90 # Threshold set at 90%
while ($j <= $#ipaddress)
echo $ipaddress
echo "#dbg: ipaddress[$j]=${ibpaddress[$j]}"
rsh $ipaddress[$j] bdf \
| awk -v thresh="$i" -v dev="$device[$j]" \
'/\/dev\/vg00/ { \
sub(/%/,"",$5) \
if ($5 > thresh) { \
file=substr($6,index($6,"/") + 1,length($6)) \
print "WARNING: " dev ":/" file " has reached " $5 ". Perform HouseKeeping IMMEDIATELY..." >> "/scripts/space." file ".file" \
}\
}'
# j++
end
Sorry, but I don't have a csh available to dbl-chk for syntax errors.
So here is a one liner that we determined worked in your environment.
rsh $ipaddress[$j] bdf | nawk -v thresh="$i" -v dev="$device[$j]" '/\/dev\/vg00/ { sub(/%/,"",$5) ; if ($5 > thresh) { file=substr($6,index($6,"/") + 1,length($6));print "#dbg:file="file; print "WARNING: " dev ":/" file " has reached " $5 ". Perform HouseKeeping IMMEDIATELY..." >> "/scripts/space.file.TMP" } }'
I don't have a system with bdf available. Change the two references to $5 in the sub() and if test to match the field-number of the output that has the percentage you want to test.
Note that -v var="value" is the standard way to pass a variable value from the shell to an awk script that is enclosed in single-quotes.
Be careful that any '\' chars at the end of a line are the last chars, no trailing space or tabs, or you'll get an indecipherable error msg. ;-)
IHTH

Is there any faster way to truncate column in Unix

I want to truncate 4th column of TSV file to given length in Unix. File has records in few millions and is of size 8GB.
I am trying this but it seems to be kind of slow.
awk -F"\t" '{s=substr($4,0,256); print $1"\t"$2"\t"$3"\t"s"\t"$5"\t"$6"\t"$7}' file > newFile
Is there any faster alternatives for same?
Thanks
Your command could be written a little more nicely (assuming you are re-building the record), which may give some performance increases:
awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,256) }' file > newFile
If you have access to a multi-core machine (which you probably do), you can use GNU parallel. You may want to vary the number of cores you use (I've set 4 here) and the block size that's fed to awk (I've set this to two megabytes)...
< file parallel -j 4 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' > newFile
Here's some testing I did on my system using a 2.7G file with 100 million lines and a block size of 2M:
time awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' file >/dev/null
Results:
real 1m59.313s
user 1m57.120s
sys 0m2.190s
With one core:
time < file parallel -j 1 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' >/dev/null
Results:
real 2m28.270s
user 4m3.070s
sys 0m41.560s
With four cores:
time < file parallel -j 4 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' >/dev/null
Results:
real 0m54.329s
user 2m41.550s
sys 0m31.460s
With twelve cores:
time < file parallel -j 12 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' >/dev/null
Results:
real 0m36.581s
user 2m24.370s
sys 0m32.230s
I’ll assume that your file has exactly one space character between fields and no whitespace at the beginning of the line.  If that is wrong, this can be enhanced. 
Otherwise, this should work:
sed 's/^\([^ ]* [^ ]* [^ ]* [^ ]\{1,256\}\)[^ ]* /\1 /'
I haven’t actually tested it with 256-character-long data (I tested it with \{1,2\} and I have no idea how its speed compares to that of awk.  BTW, on some versions, you might need to leave off the backslashes from the curly braces and use just {1,256}.
If Scott or Steve's solutions are still too slow, it may be time to break out the C. Run as ./a.out < file > newFile. Test on a small file with some long fields first; I am not 100% sure I have the math right.
#include <stdio.h>
int
main(void)
{
int field = 1;
int character = 0;
int c;
while ((c = getchar()) != EOF)
{
switch (c)
{
case '\n':
field = 1;
character = 0;
break;
case '\t':
character = 0;
field++;
break;
default:
character++;
break;
}
if (field != 4 || character < 256)
putchar(c);
}
if (ferror(stdout) || fflush(stdout) || fclose(stdout))
{
perror("write");
return 1;
}
return 0;
}

Faster Alternative to Unix Grep

I'm trying to do the following
$ grep ">" file.fasta > output.txt
But it is taking so long when the input fasta file is large.
The input file looks like this:
>seq1
ATCGGTTA
>seq2
ATGGGGGG
Is there a faster alternative?
Use time command with all these
$> time grep ">" file.fasta > output.txt
$> time egrep ">" file.fasta > output.txt
$> time awk '/^>/{print $0}' file.fasta > output.txt -- If ">' is first letter
If you see the output..they are almost the same .
In my opinion ,if the data is in columnar format, then use awk to search.
Hand-built state machine. If you only want '>' to be accepted at the beginning of the line, you'll need one more state. If you need to recognise '\r' too, you will need a few more states.
#include <stdio.h>
int main(void)
{
int state,ch;
for(state=0; (ch=getc(stdin)) != EOF; ) {
switch(state) {
case 0: /* start */
if (ch == '>') state = 1;
else break;
case 1: /* echo */
fputc(ch,stdout);
if (ch == '\n') state = 0;
break;
}
}
if (state==1) fputc('\n',stdout);
return 0;
}
If you want real speed, you could replace the fgetc() and fputc() by their macro equivalents getc() and putc(). (but I think trivial programs like this will be I/O bound anyway)
For big files, the fastest possible grep can be accomplished with GNU parallel. An example using parallel and grep can be found here.
For your purposes, you may like to try:
cat file.fasta | parallel -j 4 --pipe --block 10M grep "^\>" > output.txt
The above will use four cores, and parse 10 MB blocks to grep. The block-size is optional, but I find using a 10 MB block-size quite a bit faster on my system. YRMV.
HTH
Ack is a good alternative to grep to find string/regex in code :
http://beyondgrep.com/

Removing trailing / starting newlines with sed, awk, tr, and friends

I would like to remove all of the empty lines from a file, but only when they are at the end/start of a file (that is, if there are no non-empty lines before them, at the start; and if there are no non-empty lines after them, at the end.)
Is this possible outside of a fully-featured scripting language like Perl or Ruby? I’d prefer to do this with sed or awk if possible. Basically, any light-weight and widely available UNIX-y tool would be fine, especially one I can learn more about quickly (Perl, thus, not included.)
From Useful one-line scripts for sed:
# Delete all leading blank lines at top of file (only).
sed '/./,$!d' file
# Delete all trailing blank lines at end of file (only).
sed -e :a -e '/^\n*$/{$d;N;};/\n$/ba' file
Therefore, to remove both leading and trailing blank lines from a file, you can combine the above commands into:
sed -e :a -e '/./,$!d;/^\n*$/{$d;N;};/\n$/ba' file
So I'm going to borrow part of #dogbane's answer for this, since that sed line for removing the leading blank lines is so short...
tac is part of coreutils, and reverses a file. So do it twice:
tac file | sed -e '/./,$!d' | tac | sed -e '/./,$!d'
It's certainly not the most efficient, but unless you need efficiency, I find it more readable than everything else so far.
here's a one-pass solution in awk: it does not start printing until it sees a non-empty line and when it sees an empty line, it remembers it until the next non-empty line
awk '
/[[:graph:]]/ {
# a non-empty line
# set the flag to begin printing lines
p=1
# print the accumulated "interior" empty lines
for (i=1; i<=n; i++) print ""
n=0
# then print this line
print
}
p && /^[[:space:]]*$/ {
# a potentially "interior" empty line. remember it.
n++
}
' filename
Note, due to the mechanism I'm using to consider empty/non-empty lines (with [[:graph:]] and /^[[:space:]]*$/), interior lines with only whitespace will be truncated to become truly empty.
As mentioned in another answer, tac is part of coreutils, and reverses a file. Combining the idea of doing it twice with the fact that command substitution will strip trailing new lines, we get
echo "$(echo "$(tac "$filename")" | tac)"
which doesn't depend on sed. You can use echo -n to strip the remaining trailing newline off.
Here's an adapted sed version, which also considers "empty" those lines with just spaces and tabs on it.
sed -e :a -e '/[^[:blank:]]/,$!d; /^[[:space:]]*$/{ $d; N; ba' -e '}'
It's basically the accepted answer version (considering BryanH comment), but the dot . in the first command was changed to [^[:blank:]] (anything not blank) and the \n inside the second command address was changed to [[:space:]] to allow newlines, spaces an tabs.
An alternative version, without using the POSIX classes, but your sed must support inserting \t and \n inside […]. GNU sed does, BSD sed doesn't.
sed -e :a -e '/[^\t ]/,$!d; /^[\n\t ]*$/{ $d; N; ba' -e '}'
Testing:
prompt$ printf '\n \t \n\nfoo\n\nfoo\n\n \t \n\n'
foo
foo
prompt$ printf '\n \t \n\nfoo\n\nfoo\n\n \t \n\n' | sed -n l
$
\t $
$
foo$
$
foo$
$
\t $
$
prompt$ printf '\n \t \n\nfoo\n\nfoo\n\n \t \n\n' | sed -e :a -e '/[^[:blank:]]/,$!d; /^[[:space:]]*$/{ $d; N; ba' -e '}'
foo
foo
prompt$
using awk:
awk '{a[NR]=$0;if($0 && !s)s=NR;}
END{e=NR;
for(i=NR;i>1;i--)
if(a[i]){ e=i; break; }
for(i=s;i<=e;i++)
print a[i];}' yourFile
this can be solved easily with sed -z option
sed -rz 's/^\n+//; s/\n+$/\n/g' file
Hello
Welcome to
Unix and Linux
For an efficient non-recursive version of the trailing newlines strip (including "white" characters) I've developed this sed script.
sed -n '/^[[:space:]]*$/ !{x;/\n/{s/^\n//;p;s/.*//;};x;p;}; /^[[:space:]]*$/H'
It uses the hold buffer to store all blank lines and prints them only after it finds a non-blank line. Should someone want only the newlines, it's enough to get rid of the two [[:space:]]* parts:
sed -n '/^$/ !{x;/\n/{s/^\n//;p;s/.*//;};x;p;}; /^$/H'
I've tried a simple performance comparison with the well-known recursive script
sed -e :a -e '/^\n*$/{$d;N;};/\n$/ba'
on a 3MB file with 1MB of random blank lines around a random base64 text.
shuf -re 1 2 3 | tr -d "\n" | tr 123 " \t\n" | dd bs=1 count=1M > bigfile
base64 </dev/urandom | dd bs=1 count=1M >> bigfile
shuf -re 1 2 3 | tr -d "\n" | tr 123 " \t\n" | dd bs=1 count=1M >> bigfile
The streaming script took roughly 0.5 second to complete, the recursive didn't end after 15 minutes. Win :)
For completeness sake of the answer, the leading lines stripping sed script is already streaming fine. Use the most suitable for you.
sed '/[^[:blank:]]/,$!d'
sed '/./,$!d'
Using bash
$ filecontent=$(<file)
$ echo "${filecontent/$'\n'}"
In bash, using cat, wc, grep, sed, tail and head:
# number of first line that contains non-empty character
i=`grep -n "^[^\B*]" <your_file> | sed -e 's/:.*//' | head -1`
# number of hte last one
j=`grep -n "^[^\B*]" <your_file> | sed -e 's/:.*//' | tail -1`
# overall number of lines:
k=`cat <your_file> | wc -l`
# how much empty lines at the end of file we have?
m=$(($k-$j))
# let strip last m lines!
cat <your_file> | head -n-$m
# now we have to strip first i lines and we are done 8-)
cat <your_file> | tail -n+$i
Man, it's definitely worth to learn "real" programming language to avoid that ugliness!
#dogbane has a nice simple answer for removing leading empty lines. Here's a simple awk command which removes just the trailing lines. Use this with #dogbane's sed command to remove both leading and trailing blanks.
awk '{ LINES=LINES $0 "\n"; } /./ { printf "%s", LINES; LINES=""; }'
This is pretty simple in operation.
Add every line to a buffer as we read it.
For every line which contains a character, print the contents of the buffer and then clear it.
So the only things that get buffered and never displayed are any trailing blanks.
I used printf instead of print to avoid the automatic addition of a newline, since I'm using newlines to separate the lines in the buffer already.
This AWK script will do the trick:
BEGIN {
ne=0;
}
/^[[:space:]]*$/ {
ne++;
}
/[^[:space:]]+/ {
for(i=0; i < ne; i++)
print "";
ne=0;
print
}
The idea is simple: empty lines do not get echoed immediately. Instead, we wait till we get a non-empty line, and only then we first echo out as much empty lines as seen before it, and only then echo out the new non-empty line.
perl -0pe 's/^\n+|\n+(\n)$/\1/gs'
Here's an awk version that removes trailing blank lines (both empty lines and lines consisting of nothing but white space).
It is memory efficient; it does not read the entire file into memory.
awk '/^[[:space:]]*$/ {b=b $0 "\n"; next;} {printf "%s",b; b=""; print;}'
The b variable buffers up the blank lines; they get printed when a non-blank line is encountered. When EOF is encountered, they don't get printed. That's how it works.
If using gnu awk, [[:space:]] can be replaced with \s. (See full list of gawk-specific Regexp Operators.)
If you want to remove only those trailing lines that are empty, see #AndyMortimer's answer.
A bash solution.
Note: Only useful if the file is small enough to be read into memory at once.
[[ $(<file) =~ ^$'\n'*(.*)$ ]] && echo "${BASH_REMATCH[1]}"
$(<file) reads the entire file and trims trailing newlines, because command substitution ($(....)) implicitly does that.
=~ is bash's regular-expression matching operator, and =~ ^$'\n'*(.*)$ optionally matches any leading newlines (greedily), and captures whatever comes after. Note the potentially confusing $'\n', which inserts a literal newline using ANSI C quoting, because escape sequence \n is not supported.
Note that this particular regex always matches, so the command after && is always executed.
Special array variable BASH_REMATCH rematch contains the results of the most recent regex match, and array element [1] contains what the (first and only) parenthesized subexpression (capture group) captured, which is the input string with any leading newlines stripped. The net effect is that ${BASH_REMATCH[1]} contains the input file content with both leading and trailing newlines stripped.
Note that printing with echo adds a single trailing newline. If you want to avoid that, use echo -n instead (or use the more portable printf '%s').
I'd like to introduce another variant for gawk v4.1+
result=($(gawk '
BEGIN {
lines_count = 0;
empty_lines_in_head = 0;
empty_lines_in_tail = 0;
}
/[^[:space:]]/ {
found_not_empty_line = 1;
empty_lines_in_tail = 0;
}
/^[[:space:]]*?$/ {
if ( found_not_empty_line ) {
empty_lines_in_tail ++;
} else {
empty_lines_in_head ++;
}
}
{
lines_count ++;
}
END {
print (empty_lines_in_head " " empty_lines_in_tail " " lines_count);
}
' "$file"))
empty_lines_in_head=${result[0]}
empty_lines_in_tail=${result[1]}
lines_count=${result[2]}
if [ $empty_lines_in_head -gt 0 ] || [ $empty_lines_in_tail -gt 0 ]; then
echo "Removing whitespace from \"$file\""
eval "gawk -i inplace '
{
if ( NR > $empty_lines_in_head && NR <= $(($lines_count - $empty_lines_in_tail)) ) {
print
}
}
' \"$file\""
fi
Because I was writing a bash script anyway containing some functions, I found it convenient to write those:
function strip_leading_empty_lines()
{
while read line; do
if [ -n "$line" ]; then
echo "$line"
break
fi
done
cat
}
function strip_trailing_empty_lines()
{
acc=""
while read line; do
acc+="$line"$'\n'
if [ -n "$line" ]; then
echo -n "$acc"
acc=""
fi
done
}

Resources