I have two files with with 1s and 0s in each column, where the field separator is "," :
1,0,0,1,1,1,0,0,0,0,1,0,0,1,1,0,1,0
0,1,0,1,1,1,0,1,0,1,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,0,1,0,0,0,0,1,1,1,1,1,1,1,1,1,0,1
1,0,1,0,0,0,0,1,1,1,1,1,1,1,1,1,0,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
Want I want to do is look at the file in pairs of rows, compare them, and if they are exactly the same output a 1. So for this example the rows 1 & 2 are different so they don't get a 1, rows 3 & 4 are exactly the same so they get a 1, and rows 5&6 differ by 1 column so they don't get a 1, and so on.
So the desired output could be something like :
1
1
1
Because here there are exactly 3 pairs (they are paired by the fact if they are consecutive) of rows that are exactly the same: rows 3&4, 7&8, and 9&10. The comparison should not reuse a row, so if you compare rows 1 & 2, you shouldn't then compare rows 2 & 3.
You can do this with awk like:
awk -F, '!(NR%2) {print $0==p} {p=$0}' data
0
1
0
1
1
where every line that's evenly divisible by two will print a 0 if the current line doesn't match the last value for p or a 1 if it matches.
If you truly only want the 1s, which is throwing away any information about which pairs matched, you could:
awk -F, '!(NR%2)&&$0==p {print 1} {p=$0}' data
1
1
1
Alternatively, you could output matching pair line numbers like:
awk -F, '!(NR%2)&&$0==p {print NR-1 "," NR} {p=$0}' data
3,4
7,8
9,10
Or just the counts of all matched pairs:
awk -F, '!(NR%2)&&$0==p {c++} {p=$0} END{ print c}' data
3
Another useful variant might be just to return the matching lines directly:
awk -F, '!(NR%2)&&$0==p {print} {p=$0}' data
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
I would use a shell script like this:
while read line
do
if test "$prevline" = "$line"
then
echo 1
fi
prevline=$line
done
I'm not 100% sure about your requirement to "not reuse a row", but I think that could be achieved by changing inner part of the loop to
if test "$prevline" = "$line"
then
echo 1
line="" # don't reuse a line
fi
To convert rows into tab-delimited, it's easy
cat input.txt | tr "\n" " "
But I have a long file with 84046468 lines. I wish to convert this into a file with 1910147 rows and 44 tab-delimited columns. The first column is a text string such as chrXX_12345_+ and the other 43 columns are numerical strings. Is there a way to perform this transformation?
There are NAs present, so I guess sed and substituting "\n" for "\t" if the string preceding is a number doesn't work.
sample input.txt
chr10_1000103_+
0.932203
0.956522
1
0.972973
1
0.941176
1
0.923077
1
1
0.909091
0.9
1
0.916667
0.8
1
1
0.941176
0.904762
1
1
1
0.979592
0.93617
0.934783
1
0.941176
1
1
0.928571
NA
1
1
1
0.941176
1
0.875
0.972973
1
1
NA
0.823529
0.51366
chr10_1000104_-
0.952381
1
1
0.973684
sample output.txt
chr10_1000103_+ 0.932203 (numbers all tab-delimited)
chr10_1000104_- etc
(sorry alot of numbers to type manually)
sed '
# use a delimiter
s/^/M/
:Next
# put a counter
s/^/i/
# test counter
/^\(i\)\{44\}/ !{
$ !{
# not 44 line or end of file, add the next line
N
# loop
b Next
}
}
# remove marker and counter
s/^i*M//
# replace new line by tab
s/\n/ /g' YourFile
some limite if more than 255 tab on sed (so 44 is ok)
Here's the right approach using 4 columns instead of 44:
$ cat file
chr10_1000103_+
0.932203
0.956522
1
chr10_1000104_-
0.952381
1
1
$ awk '{printf "%s%s", $0, (NR%4?"\t":"\n")}' file
chr10_1000103_+ 0.932203 0.956522 1
chr10_1000104_- 0.952381 1 1
Just change 4 to 44 for your real input.
If you are seeing control-Ms in your output it's because they are present in your input so use dos2unix or similar to remove them before running the tool or with GNU awk you could just set -v RS='\n\r'.
When posting questions it's important to make it as clear, simple, and brief as possible so that as many people as possible will be interested in helping you.
BTW, cat input.txt | tr "\n" " " is a UUOC and should just be tr "\n" " " < input.txt
Not the best solution, but should work:
line="nonempty"; while [ ! -z "$line" ]; do for i in $(seq 44); do read line; echo -n "$line "; done; echo; done < input.txt
If there is an empty line in the file, it will terminate. For a more permanent solution I'd try perl.
edit:
If you are concerned with efficiency, just use awk.
awk '{ printf "%s\t", $1 } NR%44==0{ print "" }' < input.txt
You may want to strip the trailing tab character with | sed 's/\t$//' or make the awk script more complicated.
This might work for you (GNU sed):
sed '/^chr/!{H;$!d};x;s/\n/\t/gp;d' file
If a line does not begin with chr append it to the hold space and then delete it unless it is the last. If the line does start chr or it is the last line, then swap to the hold space and replace all newlines by tabs and print out the result.
N.B. the start of the next line will be left untouched in the pattern space which becomes the new hold space.