How to delete duplicated rows on one file based on a common field between two files with AWK? - unix

I have two files
File 1 contains 3 fields
File 2 contains 4 fields
The number of rows of File 1 is much smaller than that of File 2
I would like to compare between two files based on 1st field with the following operation
If the first field in any row of file 1 appears in the first field of a row in file 2, don't print that row for file 2.
Any advice would be grateful.
Input File 1
S13109 3739 31082
S45002 3800 31873
S43722 3313 26638
Input File 2
S13109 3738 31081 0
S13109 3737 31080 0
S00033 3008 29985 0
S00033 3007 29984 0
S00022 4130 31838 0
S00022 4129 31837 0
S00188 3317 27372 0
S45002 3759 31832 0
S45002 3758 31831 0
S45002 3757 31830 0
S43722 3020 26345 0
S43722 3019 26344 0
S00371 3737 33636 0
S00371 3736 33635 0
Desired Output
S00033 3008 29985 0
S00033 3007 29984 0
S00022 4130 31838 0
S00022 4129 31837 0
S00188 3317 27372 0
S00371 3737 33636 0
S00371 3736 33635 0

awk 'FNR==NR{a[$1]++;next}!a[$1]' file1 file2
How it works:
FNR==NR
When you have two (or more) input files to awk, NR will reset back to 1 on the first line of the next file whereas FNR will continuing incrementing from where it left off. By checking FNR==NR we are essentially checking to see if we are currently parsing the first file.
a[$1]++
If we are parsing the first file (see above) then create an associative array with the first field $1 as the key and post increment the value by 1. This essentially lets us create a 'seen' list.
next
This command tells awk not to process any further commands and to read in the next record and start over. We do this because file1 is only meant to set the associative array
!a[$1]
This line only executes when FNR==NR is false, i.e. we are not parsing file1 and thus must be parsing file2. We then use the first field $1 of file2 as the key to index into our 'seen' list created earlier. If the value returned is 0 it means we didn't see it in file1 and therefore we should print this line. Conversely, if the value is non-zero then we did see it in file1 and thus we should not print its value. Note that !a[$1] is equivalent to !a[$1]{print} because the default action when one is not given is to print the entire line.

If you don't need to preserve the order of the lines, you can use process substitution in Bash, Korn shell or Z shell along with the join and sort utilities:
join -v 2 <(sort file_1) <(sort file_2)
If you're using a shell without process substitution you would have to pre-sort the files.

Related

Adding a constant value to unix file whilst keeping tab

I have the following tab separated file
1 879375 879375
1 899892 899892
1 949363 949363
1 949523 949523
1 949696 949696
1 949739 949739
1 955619 955619
1 957605 957605
1 957693 957693
and have used the following unix command to add 1 to each of the values in column 3:
awk '{$3+=1}1' file > new_file
However the new file loses its tab separator and I would like to keep it.
You are on right path. You need to set FS(field separator) and OFS(output field separator) as \t to your code.
awk 'BEGIN{FS=OFS="\t"} {$3+=1}1' Input_file

csplit in zsh: splitting file based on pattern

I would like to split the following file based on the pattern ABC:
ABC
4
5
6
ABC
1
2
3
ABC
1
2
3
4
ABC
8
2
3
to get file1:
ABC
4
5
6
file2:
ABC
1
2
3
etc.
Looking at the docs of man csplit: csplit my_file /regex/ {num}.
I can split this file using: csplit my_file '/^ABC$/' {2} but this requires me to put in a number for {num}. When I try to match with {*} which suppose to repeat the pattern as much as possible, i get the error:
csplit: *}: bad repetition count
I am using a zshell.
To split a file on a pattern like this, I would turn to awk:
awk 'BEGIN { i=0; }
/^ABC/ { ++i; }
{ print >> "file" i }' < input
This reads lines from the file named input; before reading any lines, the BEGIN section explicitly initializes an "i" variable to zero; variables in awk default to zero, but it never hurts to be explicit. The "i" variable is our index to the serial filenames.
Subsequently, each line that starts with "ABC" will increment this "i" variable.
Any and every line in the file will then be printed (in append mode) to the file name that's generated from the text "file" and the current value of the "i" variable.

Merge and remove redundant lines *among* files

I need to merge several files, removing redundant lines among files, while keeping redundant lines within files. A schematic representation of my files is the following:
File1.txt
1
2
3
3
4
5
6
File2.txt
6
7
8
8
9
File3.txt
9
10
10
11
The desired output would be:
1
2
3
3
4
5
6
7
8
8
9
10
10
11
I would prefer to get a solution either in awk, or in bash or in R language. I searched the web for solutions and, though there were plenty of them* (please find some examples below), there were all removing duplicated lines regardless of the fact that they were located within or outside files.
Thanks in advance.
Arturo
Examples of previous solutions removing redundant lines both within and outside files:
https://unix.stackexchange.com/questions/50103/merge-two-lists-while-removing-duplicates
https://unix.stackexchange.com/questions/457320/combine-text-files-and-delete-duplicate-lines
https://unix.stackexchange.com/questions/350520/awk-combine-two-big-files-and-remove-duplicated-lines
https://unix.stackexchange.com/questions/257467/merging-2-files-and-keeping-the-one-duplicate
With your shown samples, could you please try following. This will NOT remove redundant lines within files but will remove them file wise.
awk '
FNR==1{
for(key in current){
total[key]
}
delete current
}
!($0 in total)
{
current[$0]
}
' file1.txt file2.txt file3.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if its first line(of each file) then do following.
for(key in current){ ##Traverse through current array here.
total[key] ##placing index of current array into total(for all files) one.
}
delete current ##Deleting current array here.
}
!($0 in total) ##If current line is NOT present in total then do following.
{
current[$0] ##Place current line into current array.
}
' file1.txt file2.txt file3.txt ##Mentioning Input_file names here.
Here's a trick adding on to https://stackoverflow.com/a/15385080/3358272 using diff and its output format. There is likely a presumption of "sorted" here, untested.
out=$(mktemp -p .)
tmpout=$(mktemp -p .)
trap 'rm -f "${out}" "${tmpout}"' EXIT
for F in ${#} ; do
{ cat "${out}" ;
diff --changed-group-format='%>' --unchanged-group-format='' "${out}" "${F}" ;
} > "${tmpout}"
mv "${tmpout}" "${out}"
done
cat "${out}"
Output:
$ ./question.sh F*
1
2
3
3
4
5
6
7
8
8
9
10
10
11
$ diff <(./question.sh F*) Output.txt
(Per markp-fuso's comment, if File3.txt had two 9s, this would preserve both.)

Print required lines from 3 files in Unix

There are 3 files in a directory. How can i print first file 1st line, Second file 3rd line and Third file 4th line using UNIX command ?
I tried with cat filename.txt| sed -n 1p but it is applicable for only one file. How can I view all the three files at a time ??
Using awk. at the beginning of each file f is increased to follow which file we're dealing with then we just team that up with the required record number of each file (FNR):
$ awk 'FNR==1 {f++} f==1&&FNR==1 || f==2&&FNR==3 || f==3&&FNR==4' 1 2 3
11
23
34
Record of the first file, the others are similar:
$ cat 1
11
12
13
14

Modify awk script to add looping logic

I have an awk script to print pids appearing in myfilename. Where myfilename contains a list of pids each one appearing on a new line...
ps -eaf | awk -f script.awk myfilename -
And here is the contents of script.awk...
# process the first file on the command line (aka myfilename)
# this is the list of pids
ARGIND == 1 {
pids[$0] = 1
}
# second and subsequent files ("-"/stdin in the example)
ARGIND > 1 {
# is column 2 of the ps -eaf output [i.e.] the pid in the list of desired
# pids? -- if so, print the entire line
if ($2 in pids)
printf("%s\n",$0)
}
At the moment the comman prints out pids in order of the ps -eaf command however I would like it to print out pids as per the order that they appear in myfilename.
I tried to modify the script to loop through $pids and repeat the same logic but I couldn't quite get it right.
Appreciate it if someone could help me with this.
thanks
Forgive my rusty AWK. Perhaps this is usable?
ARGIND == 1 {
pids[$0] = NR # capture the order
}
ARGIND > 1 {
if ($2 in pids) {
idx = pids[$2];
matches[idx] = $0; # capture the line and associate it with the ps -eaf order
if (idx > max)
max = idx;
}
}
END {
for(i = 1; i <= max; i++)
if (i in matches)
print matches[i];
}
I don't know what the output from ps -eaf looks like or what assumptions might be useful to exploit from its output. When I first read the question I thought OP had more than two inputs to the script. If it's really going to be only two then it probably makes more sense to reverse the inputs, if not then this might be the more general approach.
I would instead do this using the time-honoured NR==FNR construct. It goes a little something like this (one-liner).
ps -eaf | awk 'NR==FNR{p[$1]++;next} $2 in p' mypidlist -
The idea of NR==FNR is we look at the current record number (NR), and compare it to the record number within the current file (FNR). If they are the same, we are in the same file, so we store a record and move to the next line of input.
If NR==FNR is not true, then we simply check for $2 being in the array.
So the first expression populates the array p[] with the contents of mypidlist, and the second construct is a condition only, which defaults to {print} as its statement.
Of course, the one-liner above does not address your requirement to print results in the order of your pid input file. To do that, you need to keep an index and record the data in an array for some kind of sort. Of course, it doesn't have to be a real sort, just keeping the index itself should be sufficient. The following is a bit long as a one-liner:
ps -eaf | awk 'NR==FNR{p[$1]++;o[++n]=$1;next} $2 in p {c[$2]=$0} END {for(n=1;n<=length(o);n++){print n,o[n],c[o[n]]}}' mypidlist -
Broken out for easier reading, the awk script looks like this:
# Record the pid list...
NR==FNR {
p[$1]++ # Each pid is an element in this array.
o[++n]=$1 # This array records the order of the pids.
next
}
# If the second+ input source has a matching pid...
$2 in p {
c[$2]=$0 # record the line in a third array, pid as key.
}
END {
# At the end of our input, step through the ordered pid list...
for (n=1;n<=length(o);n++) {
print c[o[n]] # and print the collected line, using our pid index as key.
}
}
Note that in the event a pid from your list is missing from ps output, the result will be to print a blank line, since awk doesn't complain about references to nonexistent array indices.
Note also that length(arrayname) notation works in GAWK and OneTrueAwk, but may not be universal. If that doesn't work for you, you might be able to add an something like this to your awk script:
function alength(arrayname, i, n) {
for(i in arrayname)
n++
return n
}
If there is one file, you can flip the order of inputs and use idiomatic awk as follows
$ awk 'NR==1; NR==FNR{a[$2]=$0;next} $0 in a{print a[$0]}' <(ps -eaf) <(seq 10)
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 02:36 ? 00:00:03 /sbin/init
root 2 0 0 02:36 ? 00:00:00 [kthreadd]
root 3 2 0 02:36 ? 00:00:00 [ksoftirqd/0]
root 4 2 0 02:36 ? 00:00:00 [kworker/0:0]
root 5 2 0 02:36 ? 00:00:00 [kworker/0:0H]
root 6 2 0 02:36 ? 00:00:00 [kworker/u30:0]
root 7 2 0 02:36 ? 00:00:00 [rcu_sched]
root 8 2 0 02:36 ? 00:00:00 [rcuos/0]
root 9 2 0 02:36 ? 00:00:00 [rcuos/1]
root 10 2 0 02:36 ? 00:00:00 [rcuos/2]
Here, the list of the ids provided by the seq, substitute with your file.

Resources