Subtracting lines in one file from another file - unix

I couldn't find an answer that truly subtracts one file from another.
My goal is to remove lines in one file that occur in another file.
Multiple occurences should be respected, which means for exammple if one line occurs 4 times in file A and only once in file B, file C should have 3 of those lines.
File A:
1
3
3
3
4
4
File B:
1
3
4
File C (desired output)
3
3
4
Thanks in advance

In awk:
$ awk 'NR==FNR{a[$0]--;next} ($0 in a) && ++a[$0] > 0' f2 f1
3
3
4
Explained:
NR==FNR { # for each record in the first file
a[$0]--; # for each identical value, decrement a[value] (of 0)
next
}
($0 in a) && ++a[$0] > 0' # if record in a, increment a[value]
# once over remove count in first file, output
If you want to print items in f1 that are not in f2 you can lose ($0 in a) &&:
$ echo 5 >> f1
$ awk 'NR==FNR{a[$0]--;next} (++a[$0] > 0)' f2 f1
3
3
4
5

If the input files are already sorted as shown in sample, comm would be more suited
$ comm -23 f1 f2
3
3
4
option description from man page:
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)

You can do:
awk 'NR==FNR{++cnt[$1]
next}
cnt[$1]-->0{next}
1' f2 f1

Related

Merge and remove redundant lines *among* files

I need to merge several files, removing redundant lines among files, while keeping redundant lines within files. A schematic representation of my files is the following:
File1.txt
1
2
3
3
4
5
6
File2.txt
6
7
8
8
9
File3.txt
9
10
10
11
The desired output would be:
1
2
3
3
4
5
6
7
8
8
9
10
10
11
I would prefer to get a solution either in awk, or in bash or in R language. I searched the web for solutions and, though there were plenty of them* (please find some examples below), there were all removing duplicated lines regardless of the fact that they were located within or outside files.
Thanks in advance.
Arturo
Examples of previous solutions removing redundant lines both within and outside files:
https://unix.stackexchange.com/questions/50103/merge-two-lists-while-removing-duplicates
https://unix.stackexchange.com/questions/457320/combine-text-files-and-delete-duplicate-lines
https://unix.stackexchange.com/questions/350520/awk-combine-two-big-files-and-remove-duplicated-lines
https://unix.stackexchange.com/questions/257467/merging-2-files-and-keeping-the-one-duplicate
With your shown samples, could you please try following. This will NOT remove redundant lines within files but will remove them file wise.
awk '
FNR==1{
for(key in current){
total[key]
}
delete current
}
!($0 in total)
{
current[$0]
}
' file1.txt file2.txt file3.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if its first line(of each file) then do following.
for(key in current){ ##Traverse through current array here.
total[key] ##placing index of current array into total(for all files) one.
}
delete current ##Deleting current array here.
}
!($0 in total) ##If current line is NOT present in total then do following.
{
current[$0] ##Place current line into current array.
}
' file1.txt file2.txt file3.txt ##Mentioning Input_file names here.
Here's a trick adding on to https://stackoverflow.com/a/15385080/3358272 using diff and its output format. There is likely a presumption of "sorted" here, untested.
out=$(mktemp -p .)
tmpout=$(mktemp -p .)
trap 'rm -f "${out}" "${tmpout}"' EXIT
for F in ${#} ; do
{ cat "${out}" ;
diff --changed-group-format='%>' --unchanged-group-format='' "${out}" "${F}" ;
} > "${tmpout}"
mv "${tmpout}" "${out}"
done
cat "${out}"
Output:
$ ./question.sh F*
1
2
3
3
4
5
6
7
8
8
9
10
10
11
$ diff <(./question.sh F*) Output.txt
(Per markp-fuso's comment, if File3.txt had two 9s, this would preserve both.)

get the top N counts based on the value of another column

I have a file with two columns that looks like this:
a 3
a 7
b 6
a 6
b 1
b 8
c 1
b 1
For each value in the first column, I'd like to find the top N counts from the second column. Using this example, I'd like to find the top 2 values in column 2 for each string in column 1. The desired output would be:
a 7
a 6
b 8
b 6
c 1
I was trying to do something like this with awk, but I am not that familiar with it. This gives the max, not top N:
awk '$2>max[$1]{max[$1]=$2; row[$1]=$0} END{for (i in row) print row[i]}'
Could you please try following, using sort + awk solution.
sort -k2 -s -nr Input_file | awk '++array[$1]<=2' | sort -k1,1 -k2,2nr
OR
sort -k2 -s -nr Input_file | sort -k1,1 -k2,2nr | awk '++array[$1]<=2'
Logical brief explanation: First 2 sort commands are being used to sort data as per 1st and 2nd fields to get data in right order(as per OP's samples), then passing its output to awk to get only 1st 2 occurrences of each first field only as per ask.
$ sort -k1,1 -k2,2rn file | awk -v n=2 '(++cnt[$1])<=n'
a 7
a 6
b 8
b 6
c 1

Checking on equal values of 2 different data frame row by row

I have 2 different data frame, one is of 5.5 MB and the other is 25 GB. I want to check if these two data frame have the same value in 2 different columns for each row.
For e.g.
x 0 0 a
x 1 2 b
y 1 2 c
z 3 4 d
and
x 0 0 w
x 1 2 m
y 5 6 p
z 8 9 q
I want to check if the 2° and 3° column are equal for each row, if yes I return the 4° columns for the both data frame.Then I should have:
a w
b m
c m
the 2 data frame are sorted respect the 2° and 3° column value. I try in R but the 2° file (25 GB) is too big. How can I obtain this new file in a "faster" (even some hours) way ???
With GNU awk for arrays of arrays:
$ cat tst.awk
NR==FNR { a[$2,$3][$4]; next }
($2,$3) in a {
for (val in a[$2,$3]) {
print val, $4
}
}
$ awk -f tst.awk small_file large_file
a w
b m
c m
and with any awk (a bit less efficiently):
$ cat tst.awk
NR==FNR { a[$2,$3] = a[$2,$3] FS $4; next }
($2,$3) in a {
split(a[$2,$3],vals)
for (i in vals) {
print vals[i], $4
}
}
$ awk -f tst.awk small_file large_file
a w
b m
c m
The above when reading small_file (NR==FNR is only true for the first file read - look up those variables in the awk man page or google) creates an associative array a[] that maps an index created from the concatenation of the 2nd+3rd fields to the list of value of the 4th field for those 2nd/3rd field combinations. Then when reading large_file it looks up that array for the current 2nd/3rd field combination and loops through all of the values stored for that combination in the previous phase printing that value (the $4 from small_file) plus the current $4.
You said your small file is 5.5 MB and the large file is 25 GB. Since 1 MB is about 1,047,600 characters (see https://www.computerhope.com/issues/chspace.htm) and each of your lines is about 8 characters long that means your small file is about 130 thousand lines long and your large one about 134 million lines long so I expect on an average powered computer the above should take no more than a minute or 2 to run, it certainly won't take anything like an hour!
An alternative to the solution of Ed Morton, but with an identical idea:
$ cat tst.awk
NR==FNR { a[$2,$3] = a[$2,$3] $4 ORS; next }
($2,$3) in a {
s=a[$2,$3]; gsub(ORS,OFS $4 ORS,s)
printf "%s",s;
}
$ awk -f tst.awk small_file large_file
a w
b m
c m

Adding an extra line to a text file after every N lines

Hi I have a Unix command that produces a list of ip addresses along with other columns information . i want to add something to the command so that it displays it as a set of 3 lines then a space or ---- and then the next 3 lines and so on.
how can I achieve this ?
for example:
1.2.3.4 xy
1.3.5.7 ab
1.25.7.9 cd
-------------
1.25.7.8 kl
1.3.4.5 mn
1.25.7.8 op
-------------
1.24.5.6 la
1.3.4.5 ka
1.25.7.8 xz
You can use awk to print an extra line after every 40 lines:
awk '{ print $0; if(++i % 40 == 0) printf("-------------\n") }' file
Change % 40 to % 3 to print that extra line after every 3 lines.
$ seq 9 | awk 'NR>1 && (NR%3)==1{print "---"} 1'
1
2
3
---
4
5
6
---
7
8
9

Make a file counting instances in sets of 5

I have a file that looks like this:
1 rs531842 503939 61733 G A
1 rs10494103 35025 114771 C T
1 rs17038458 254490 21116837 G A
1 rs616378 525783 21127670 T C
1 rs3845293 432526 21199392 A C
2 rs16840461 233620 157112959 A G
2 rs1560628 224228 157113214 T C
2 rs17200880 269314 257145829 C T
2 rs10497165 35844 357156412 C T
2 rs7607531 624696 457156575 T C
...with column 1 stretching on to 22, and several thousand entries in total.
I want to create a file that lists bins of 5 million from column 4 which have data, separating by column 1.
Basically, all but column 1 and 4 can be discarded. A simple imput would look like this:
InputChr1:
61733
114771
21116837
21127670
21199392
InputChr2:
157112959
157113214
257145829
357156412
457156575
So, for the example above, I would want to get two files that look like this:
OutputChr1.txt
Start End Occurrences
1 5000000 2
20000001 25000000 3
OutputChr2.txt
Start End Occurrences
155000001 160000000 2
255000001 260000000 1
355000001 360000000 1
455000001 460000000 1
Any ideas? It seems like something that should be doable with lapply in R, but I can't get the for loops to work...
EDIT: Actually, I made this look much harder than it needed to be - basically, I want to split the original file by column 1, extract the data in column 4, and then count the instances in bins of 5 million.
(Apologies for slightly random tags, just trying to think of which tools might be best!)
Well, this happened to be very challenging. I couldn't find a way to use an unique awk command, though.
awk -v const=5000000 -v max=150
'{a[$1,int($4/const)]++; b[$1]}
END{for (i in b)
{for (j=0; j<max; j++)
print i, j*const +1, (j+1)*const, a[i,j]
}
}' file
And then to get only the results:
awk 'NF==4'
Explanation
-v const=5000000 -v max=150 give the variables. const is the 5 million value to split the results. max is the biggest number up to which we will look for info in the END block.
a[$1,int($4/const)]++ create an array with (1st field, 4th field) as index. Note the second is int($4/const) is to get from 23432 --> 0, 6000000 --> 1, etc. That is, to see in which block of values is every 4th column.
b[$1] keep track of the first columns that have been processed.
END{for (i in b) {for (j=0; j<max; j++) print j, j*const +1, (j+1)*const, a[i,j]}}' print the values.
awk 'NF==4' just print those lines that have 4 columns. This way it just outputs those cases in which there were matches.
In case you want to store the values into a new file, you can do
awk 'NF==4 {print > "OutputChr"$1".txt}'
Sample output
$ awk -v const=5000000 -v max=150 '{a[$1,int($4/const)]++; b[$1]} END{for (i in b) {for (j=0; j<max; j++) print i, j*const +1, (j+1)*const, a[i,j]}}' a | awk 'NF==4'
1 1 5000000 2
1 20000001 25000000 3
2 155000001 160000000 2
2 255000001 260000000 1
2 355000001 360000000 1
2 455000001 460000000 1
All in one
awk '{ v=int($4/const)
a[$1 FS v]++
min[$1]=min[$1]<v?min[$1]:v # get the Minimum of column $4 for group $1
max[$1]=max[$1]>v?max[$1]:v # get the Minimum of column $4 for group $1
}END{ for (i in min)
for (j=min[i];j<=max[i];j++) # set the for loop, and use the min and max value.
if (a[i FS j]!="") print j*const+1,(j+1)*const,a[i FS j] > "OutputChr" i ".txt" # if the data is exist, print to file "OutputChr" i ".txt"
}' const=5000000 file
result:
$ cat OutputChr1.txt
1 5000000 2
20000001 25000000 3
$ cat OutputChr2.txt
155000001 160000000 2
255000001 260000000 1
355000001 360000000 1
455000001 460000000 1

Resources