Awk to replace a column with another column if it meets criteria - unix

In psudocode, I'm trying to replace column 4 with column 1, if column 4 equals NULL. Otherwise keep column 4 the same. I'm currently trying this, which isn't changing the line and I'm not sure why:
Example data:
grep NULL matrix.txt
AAGGGCCCGGGGGG 0 0 3 NULL
grep NULL matrix.txt | awk -F/t '{ $34 = ($34 == "NULL" ? $1 : $34) } 1'
Should give me:
AAGGGCCCGGGGGG 0 0 3 AAGGGCCCGGGGGG
Thanks!

modify your awk
replace $34 with $5 as you need to check 5th field not 34th
Also /t is incorrect. It should be \t for tab. BTW it will also work without -F "\t"
echo "AAGGGCCCGGGGGG 0 0 3 NULL" | awk ' $5=="NULL"{$5=$1} 1'

Related

awk $4 column if column = value with characters thereafter

I have a file with the following data within for example:
20 V 70000003d120f88 1 2
20 V 70000003d120f88 2 2
20x00 V 70000003d120f88 2 2
10020 V 70000003d120f88 1 5
I want to get the sum of the 4th column data.
Using the the below command, I can acheive this, however the row 20x00 is excluded. I want to everything to start with 20 must be sumed and nothing before that, so 20* for example:
cat testdata.out | awk '{if ($1 == '20') print $4;}' | awk '{s+=$1}END{printf("%.0f\n", s)}'
The output value must be:
5
How can I achieve this using awk. The below I attempted also does not work:
cat testdata.out | awk '$1 ~ /'20'/ {print $4;}' | awk '{s+=$1}END{printf("%.0f\n", s)}'
There is no need to use 3 processes, anything can be done by one AWK process. Check it out:
awk '$1 ~ /^20/ { a+=$4 } END { print a }' testdata.out
explanation:
$1 ~ /^20/ checks to see if $1 starts with 20
if yes, we add $4 in the variable a
finally, we print the variable a
result 5
EDIT:
Ed Morton rightly points out that the result should always be of the same type, which can be solved by adding 0 to the result.
You can set the exit status if it is necessary to distinguish whether the result 0 is due to no matches
(output status 0) or matching only zero values ​​(output status 1).
The exit code for different input data can be checked e.g. echo $?
The code would look like this:
awk '$1 ~ /^20/ { a+=$4 } END { print a+0; exit(a!="") }' testdata.out
Figured it out:
cat testdata.out | awk '$1 ~ /'^20'/ {print $4;}' | awk '{s+=$1}END{printf("%.0f\n", s)}'
The above might not work for all cases, but below will suffice:
i=20
cat testdata.out | awk '{if ($1 == "'"$i"'" || $1 == ""'"${i}"'"x00") print $4;}' | awk '{s+=$1}END{printf("%.0f\n", s)}'

Splitting text file based on column value in unix

I have a text file:
head train_test_split.txt
1 0
2 1
3 0
4 1
5 1
What I want to do is save the first column values for which second column value is 1 to file train.txt.
So, the corresponding first column value for second column value with 1 are: 2,4,5. So, in my train.txt file i want:
2
4
5
How can I do this easily unix?
You can use awk for this:
awk '$2 == 1 { print $1 }' inputfile
That is,
$2 == 1 is a filter,
matching lines where the 2nd column is 1,
and print $1 means to print the first column.
In Perl:
$ perl -lane 'print "$F[0]" if $F[1]==1' file
Or GNU grep:
$ grep -oP '^(\S+)(?=[ \t]+1$)' file
But awk is the best. Use awk...

Comparing consecutive rows within an file

I have two files with with 1s and 0s in each column, where the field separator is "," :
1,0,0,1,1,1,0,0,0,0,1,0,0,1,1,0,1,0
0,1,0,1,1,1,0,1,0,1,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,0,1,0,0,0,0,1,1,1,1,1,1,1,1,1,0,1
1,0,1,0,0,0,0,1,1,1,1,1,1,1,1,1,0,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
Want I want to do is look at the file in pairs of rows, compare them, and if they are exactly the same output a 1. So for this example the rows 1 & 2 are different so they don't get a 1, rows 3 & 4 are exactly the same so they get a 1, and rows 5&6 differ by 1 column so they don't get a 1, and so on.
So the desired output could be something like :
1
1
1
Because here there are exactly 3 pairs (they are paired by the fact if they are consecutive) of rows that are exactly the same: rows 3&4, 7&8, and 9&10. The comparison should not reuse a row, so if you compare rows 1 & 2, you shouldn't then compare rows 2 & 3.
You can do this with awk like:
awk -F, '!(NR%2) {print $0==p} {p=$0}' data
0
1
0
1
1
where every line that's evenly divisible by two will print a 0 if the current line doesn't match the last value for p or a 1 if it matches.
If you truly only want the 1s, which is throwing away any information about which pairs matched, you could:
awk -F, '!(NR%2)&&$0==p {print 1} {p=$0}' data
1
1
1
Alternatively, you could output matching pair line numbers like:
awk -F, '!(NR%2)&&$0==p {print NR-1 "," NR} {p=$0}' data
3,4
7,8
9,10
Or just the counts of all matched pairs:
awk -F, '!(NR%2)&&$0==p {c++} {p=$0} END{ print c}' data
3
Another useful variant might be just to return the matching lines directly:
awk -F, '!(NR%2)&&$0==p {print} {p=$0}' data
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
I would use a shell script like this:
while read line
do
if test "$prevline" = "$line"
then
echo 1
fi
prevline=$line
done
I'm not 100% sure about your requirement to "not reuse a row", but I think that could be achieved by changing inner part of the loop to
if test "$prevline" = "$line"
then
echo 1
line="" # don't reuse a line
fi

How to convert multiple lines into fixed column lengths

To convert rows into tab-delimited, it's easy
cat input.txt | tr "\n" " "
But I have a long file with 84046468 lines. I wish to convert this into a file with 1910147 rows and 44 tab-delimited columns. The first column is a text string such as chrXX_12345_+ and the other 43 columns are numerical strings. Is there a way to perform this transformation?
There are NAs present, so I guess sed and substituting "\n" for "\t" if the string preceding is a number doesn't work.
sample input.txt
chr10_1000103_+
0.932203
0.956522
1
0.972973
1
0.941176
1
0.923077
1
1
0.909091
0.9
1
0.916667
0.8
1
1
0.941176
0.904762
1
1
1
0.979592
0.93617
0.934783
1
0.941176
1
1
0.928571
NA
1
1
1
0.941176
1
0.875
0.972973
1
1
NA
0.823529
0.51366
chr10_1000104_-
0.952381
1
1
0.973684
sample output.txt
chr10_1000103_+ 0.932203 (numbers all tab-delimited)
chr10_1000104_- etc
(sorry alot of numbers to type manually)
sed '
# use a delimiter
s/^/M/
:Next
# put a counter
s/^/i/
# test counter
/^\(i\)\{44\}/ !{
$ !{
# not 44 line or end of file, add the next line
N
# loop
b Next
}
}
# remove marker and counter
s/^i*M//
# replace new line by tab
s/\n/ /g' YourFile
some limite if more than 255 tab on sed (so 44 is ok)
Here's the right approach using 4 columns instead of 44:
$ cat file
chr10_1000103_+
0.932203
0.956522
1
chr10_1000104_-
0.952381
1
1
$ awk '{printf "%s%s", $0, (NR%4?"\t":"\n")}' file
chr10_1000103_+ 0.932203 0.956522 1
chr10_1000104_- 0.952381 1 1
Just change 4 to 44 for your real input.
If you are seeing control-Ms in your output it's because they are present in your input so use dos2unix or similar to remove them before running the tool or with GNU awk you could just set -v RS='\n\r'.
When posting questions it's important to make it as clear, simple, and brief as possible so that as many people as possible will be interested in helping you.
BTW, cat input.txt | tr "\n" " " is a UUOC and should just be tr "\n" " " < input.txt
Not the best solution, but should work:
line="nonempty"; while [ ! -z "$line" ]; do for i in $(seq 44); do read line; echo -n "$line "; done; echo; done < input.txt
If there is an empty line in the file, it will terminate. For a more permanent solution I'd try perl.
edit:
If you are concerned with efficiency, just use awk.
awk '{ printf "%s\t", $1 } NR%44==0{ print "" }' < input.txt
You may want to strip the trailing tab character with | sed 's/\t$//' or make the awk script more complicated.
This might work for you (GNU sed):
sed '/^chr/!{H;$!d};x;s/\n/\t/gp;d' file
If a line does not begin with chr append it to the hold space and then delete it unless it is the last. If the line does start chr or it is the last line, then swap to the hold space and replace all newlines by tabs and print out the result.
N.B. the start of the next line will be left untouched in the pattern space which becomes the new hold space.

How to print the last but one record of a file using awk?

How to print the last but one record of a file using awk?
Something like:
awk '{ prev_line=this_line; this_line=$0 } END { print prev_line }' < file
Essentially, keep a record of the line before the current one, until you hit the end of the file, then print the previous line.
edit to respond to comment:
To just extract the second field in the penultimate line:
awk '{ prev_f2=this_f2; this_f2=$2 } END { print prev_f2 }' < file
You can do it with awk but you may find that:
tail -2 inputfile | head -1
will be a quicker solution - it grabs the last two lines of the complete set then the first of those two.
The following transcript shows how this works:
pax$ echo '1
> 2
> 3
> 4
> 5' | tail -2 | head -1
4
If you must use awk, you can use:
pax$ echo '1
2
3
4
5' | awk '{last = this; this = $0} END {print last}'
4
It works by keeping the last and current line in variables last and this and just printing out last when the file is finished.

Resources