Splitting text file based on column value in unix - unix

I have a text file:
head train_test_split.txt
1 0
2 1
3 0
4 1
5 1
What I want to do is save the first column values for which second column value is 1 to file train.txt.
So, the corresponding first column value for second column value with 1 are: 2,4,5. So, in my train.txt file i want:
2
4
5
How can I do this easily unix?

You can use awk for this:
awk '$2 == 1 { print $1 }' inputfile
That is,
$2 == 1 is a filter,
matching lines where the 2nd column is 1,
and print $1 means to print the first column.

In Perl:
$ perl -lane 'print "$F[0]" if $F[1]==1' file
Or GNU grep:
$ grep -oP '^(\S+)(?=[ \t]+1$)' file
But awk is the best. Use awk...

Related

Finding amount of sequence matches per line

I'm looking to use GREP or something similiar to find the total matches of a 5 letter sequence (AATTC) in every line of a file, and then print the result in a new file. For example:
File 1:
GGGGGAATTCGAATTC
GGGGGAATTCGGGGGG
GGGGGAATTCCAATTC
Then in another file it prints the matches line by line
File 2:
2
1
2
Awk solution:
awk '{ print gsub(/AATTC/,"") }' file1 > file2
The gsub() function returns the number of substitutions made
$ cat file2
2
1
2
If you have to use grep, then put that in a while loop,
$ while read -r line; do grep -o 'AATTC'<<<"$line"|wc -l >> file2 ; done < file1
$ cat file2
2
1
2
Another way: using perl.
$ perl -ne 'print s/AATTC/x/g ."\n"' file1 > file2

Awk to replace a column with another column if it meets criteria

In psudocode, I'm trying to replace column 4 with column 1, if column 4 equals NULL. Otherwise keep column 4 the same. I'm currently trying this, which isn't changing the line and I'm not sure why:
Example data:
grep NULL matrix.txt
AAGGGCCCGGGGGG 0 0 3 NULL
grep NULL matrix.txt | awk -F/t '{ $34 = ($34 == "NULL" ? $1 : $34) } 1'
Should give me:
AAGGGCCCGGGGGG 0 0 3 AAGGGCCCGGGGGG
Thanks!
modify your awk
replace $34 with $5 as you need to check 5th field not 34th
Also /t is incorrect. It should be \t for tab. BTW it will also work without -F "\t"
echo "AAGGGCCCGGGGGG 0 0 3 NULL" | awk ' $5=="NULL"{$5=$1} 1'

How do I list specific lines using awk?

I have a file that is sorted like the following:
2 Good
2 Hello
3 Goodbye
3 Begin
3 Yes
3 No
I want to search for the highest value in the file and display what is one the line?
3 Goodbye
3 begin
3 Yes
3 No
How would I do this?
awk to the rescue!
$ awk 'FNR==NR{if(max<$1) max=$1; next} $1==max' file{,}
3 Goodbye
3 Begin
3 Yes
3 No
double-pass, find the maximum and filter out the rest.
cat file.txt | sort -r | awk '{if ($1>=prev) {print $0; prev=$1}}'
3 Yes
3 No
3 Goodbye
3 Begin
Assuming file.txt contains
2 Good
2 Hello
3 Goodbye
3 Begin
3 Yes
3 No
First get the highest value in the file into a variable. Considering the file is already sorted, pickup the last line in the file. Then parse out the number using awk.
highest=`tail -1 file.list|awk '{print $1}'`
Then grep the file using that value.
grep "^${highest} " file.list
This should do the job. I am only using awk as required in the question:
awk 'BEGIN {v=0} {l = l "\n" $0} {if ($1>v) {l = $0; v = $1}} END {print l}' file.txt
The variable v is initialized (before parsing the file) to 0. Then each line is read and kept in memory; if the first field ($1) is greater than v, then update v and empty what is in l. At the end, just print the content of l.
It's easier than you think.
awk '/^3/' file
3 Goodbye
3 Begin
3 Yes
3 No

Comparing consecutive rows within an file

I have two files with with 1s and 0s in each column, where the field separator is "," :
1,0,0,1,1,1,0,0,0,0,1,0,0,1,1,0,1,0
0,1,0,1,1,1,0,1,0,1,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,0,1,0,0,0,0,1,1,1,1,1,1,1,1,1,0,1
1,0,1,0,0,0,0,1,1,1,1,1,1,1,1,1,0,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
Want I want to do is look at the file in pairs of rows, compare them, and if they are exactly the same output a 1. So for this example the rows 1 & 2 are different so they don't get a 1, rows 3 & 4 are exactly the same so they get a 1, and rows 5&6 differ by 1 column so they don't get a 1, and so on.
So the desired output could be something like :
1
1
1
Because here there are exactly 3 pairs (they are paired by the fact if they are consecutive) of rows that are exactly the same: rows 3&4, 7&8, and 9&10. The comparison should not reuse a row, so if you compare rows 1 & 2, you shouldn't then compare rows 2 & 3.
You can do this with awk like:
awk -F, '!(NR%2) {print $0==p} {p=$0}' data
0
1
0
1
1
where every line that's evenly divisible by two will print a 0 if the current line doesn't match the last value for p or a 1 if it matches.
If you truly only want the 1s, which is throwing away any information about which pairs matched, you could:
awk -F, '!(NR%2)&&$0==p {print 1} {p=$0}' data
1
1
1
Alternatively, you could output matching pair line numbers like:
awk -F, '!(NR%2)&&$0==p {print NR-1 "," NR} {p=$0}' data
3,4
7,8
9,10
Or just the counts of all matched pairs:
awk -F, '!(NR%2)&&$0==p {c++} {p=$0} END{ print c}' data
3
Another useful variant might be just to return the matching lines directly:
awk -F, '!(NR%2)&&$0==p {print} {p=$0}' data
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
I would use a shell script like this:
while read line
do
if test "$prevline" = "$line"
then
echo 1
fi
prevline=$line
done
I'm not 100% sure about your requirement to "not reuse a row", but I think that could be achieved by changing inner part of the loop to
if test "$prevline" = "$line"
then
echo 1
line="" # don't reuse a line
fi

How to convert multiple lines into fixed column lengths

To convert rows into tab-delimited, it's easy
cat input.txt | tr "\n" " "
But I have a long file with 84046468 lines. I wish to convert this into a file with 1910147 rows and 44 tab-delimited columns. The first column is a text string such as chrXX_12345_+ and the other 43 columns are numerical strings. Is there a way to perform this transformation?
There are NAs present, so I guess sed and substituting "\n" for "\t" if the string preceding is a number doesn't work.
sample input.txt
chr10_1000103_+
0.932203
0.956522
1
0.972973
1
0.941176
1
0.923077
1
1
0.909091
0.9
1
0.916667
0.8
1
1
0.941176
0.904762
1
1
1
0.979592
0.93617
0.934783
1
0.941176
1
1
0.928571
NA
1
1
1
0.941176
1
0.875
0.972973
1
1
NA
0.823529
0.51366
chr10_1000104_-
0.952381
1
1
0.973684
sample output.txt
chr10_1000103_+ 0.932203 (numbers all tab-delimited)
chr10_1000104_- etc
(sorry alot of numbers to type manually)
sed '
# use a delimiter
s/^/M/
:Next
# put a counter
s/^/i/
# test counter
/^\(i\)\{44\}/ !{
$ !{
# not 44 line or end of file, add the next line
N
# loop
b Next
}
}
# remove marker and counter
s/^i*M//
# replace new line by tab
s/\n/ /g' YourFile
some limite if more than 255 tab on sed (so 44 is ok)
Here's the right approach using 4 columns instead of 44:
$ cat file
chr10_1000103_+
0.932203
0.956522
1
chr10_1000104_-
0.952381
1
1
$ awk '{printf "%s%s", $0, (NR%4?"\t":"\n")}' file
chr10_1000103_+ 0.932203 0.956522 1
chr10_1000104_- 0.952381 1 1
Just change 4 to 44 for your real input.
If you are seeing control-Ms in your output it's because they are present in your input so use dos2unix or similar to remove them before running the tool or with GNU awk you could just set -v RS='\n\r'.
When posting questions it's important to make it as clear, simple, and brief as possible so that as many people as possible will be interested in helping you.
BTW, cat input.txt | tr "\n" " " is a UUOC and should just be tr "\n" " " < input.txt
Not the best solution, but should work:
line="nonempty"; while [ ! -z "$line" ]; do for i in $(seq 44); do read line; echo -n "$line "; done; echo; done < input.txt
If there is an empty line in the file, it will terminate. For a more permanent solution I'd try perl.
edit:
If you are concerned with efficiency, just use awk.
awk '{ printf "%s\t", $1 } NR%44==0{ print "" }' < input.txt
You may want to strip the trailing tab character with | sed 's/\t$//' or make the awk script more complicated.
This might work for you (GNU sed):
sed '/^chr/!{H;$!d};x;s/\n/\t/gp;d' file
If a line does not begin with chr append it to the hold space and then delete it unless it is the last. If the line does start chr or it is the last line, then swap to the hold space and replace all newlines by tabs and print out the result.
N.B. the start of the next line will be left untouched in the pattern space which becomes the new hold space.

Resources