Append data from 1 file to another using AWK - unix

I have an already existing script to check the exclusive data between 2 files and load it in 3rd file. The command is below.
var='FNR == NR {keys[$1 $2]; next} !($1 $2 in keys)'
awk -F\| $var file1.dat file2.dat > file3.dat
The requirement is to reuse the same but just append the data from file2 to file3 ignoring file1. I tried to do the below but it is spooling the data from both file1 and file2. All I need is, though there are 2 file names provided in the awk command, only the 2nd file data to be appended.
var='{print $0}'
awk -F\| $var file1.dat file2.dat > file3.dat
Can anyone help with the exact command.
Below is the data in each file and expected output.
File1 (Can have 0 or more) - We should not look at this file at all
123
456
789
File2:
123
ABC
XYZ
456
Expected output in File3 (All from file2 and just ignore file1 input, but I have to have the file1 name in awk command)
123
ABC
XYZ
456

All from file2 and just ignore file1 input, but I have to have the file1 name in awk command.
If you must use file1 and file2 in arguments to awk command and want to output content from file2 only then you can just use:
awk 'BEGIN {delete ARGV[1]} 1' file1 file2 > file3
123
ABC
XYZ
456
delete ARGV[1] will delete first argument from argument list.

With your shown samples and attempts please try following awk code. Written and tested in GNU awk. Simply use nextfile to skip first Input_file named file1 itself and read 2nd file onwards.
awk 'NR==1{nextfile} 1' file1 file2

also remember not to waste time splitting unneeded fields
{m,g}awk 'BEGIN { delete ARGV[_^=FS="^$"] }_' file1 file2
and it's MUUUCH faster not reading it a row at a time :
mawk2 'BEGIN { delete ARGV[_^=FS="^$"] }_' "${m2p}" "${m3t}"
out9: 1.85GiB 0:00:01 [1.11GiB/s] [1.11GiB/s] [ <=>]
f9d2e18d22eb58e5fc2173863cff238e stdin
mawk2 'BEGIN { delete ARGV[_^=RS=FS="^$"] }_^(ORS=__)' "${m2p}" "${m3t}"
out9: 1.85GiB 0:00:00 [1.92GiB/s] [1.92GiB/s] [<=> ]
f9d2e18d22eb58e5fc2173863cff238e stdin
and try to avoid the slow default mode of gawk :
gawk 'BEGIN { delete ARGV[_^=FS="^$"] }_' "${m2p}" "${m3t}"
out9: 1.85GiB 0:00:03 [ 620MiB/s] [ 620MiB/s] [ <=> ]
f9d2e18d22eb58e5fc2173863cff238e stdin

Related

Compare 2 nd columns from 2 files unix

Compare 2 nd columns from 2 files, unmatch match first file record s write into output file
Example:
# delimiter
Filename_clientid.txt
RIA00024_MA_plan_BTR_09282022_4.xml#RIA00025
RIA00024_MA_plan_BTR_09282022_5.xml#RIA00024
RIA00026_MA_plan_BTR_09282022_6.xml#RIA00026
Client_id.txt
ramesh#RIA000025
suresh#RIA000024
vamshi#RIA000027
Excepted output:
RIA00026_MA_plan_BTR_09282022_6.xml#RIA00026
I used awk command not working can you help me
awk -F '#' 'NR==FNR{a[$2]; next} FNR==1 || !($1 in a)' Client_id.txt Filename_clientid.txt
alternative
$ join -t# -j2 <(sort -t# -k2 file1) <(sort -t# -k2 file2)
RIA000026#RIA000026_MA_plan_BTR_09282022_6.xml#ramesh
The number of zeroes is not the same in both files. If they are the same, you can check that the field 2 value of Filename_clientid.txt does not occur in a
Filename_clientid.txt
RIA00024_MA_plan_BTR_09282022_4.xml#RIA00025
RIA00024_MA_plan_BTR_09282022_5.xml#RIA00024
RIA00026_MA_plan_BTR_09282022_6.xml#RIA00026
Client_id.txt
ramesh#RIA00025
suresh#RIA00024
vamshi#RIA00027
Example
awk -F'#' 'NR==FNR{a[$2]; next} !($2 in a)' Client_id.txt Filename_clientid.txt
Output
RIA00026_MA_plan_BTR_09282022_6.xml#RIA000026
With corrected inputs (was wrong with number of zeroes):
file1
RIA00024_MA_plan_BTR_09282022_4.xml#RIA00025
RIA00024_MA_plan_BTR_09282022_5.xml#RIA00024
RIA000026_MA_plan_BTR_09282022_6.xml#RIA000026
file2
ramesh#RIA000025
suresh#RIA000024
vamshi#RIA000027
ramesh#RIA000026
code
awk -F'#' 'NR==FNR{a[$1]=$0;next} $2 in a{print a[$2]}' file1 file2
Output
RIA000026_MA_plan_BTR_09282022_6.xml

Finding amount of sequence matches per line

I'm looking to use GREP or something similiar to find the total matches of a 5 letter sequence (AATTC) in every line of a file, and then print the result in a new file. For example:
File 1:
GGGGGAATTCGAATTC
GGGGGAATTCGGGGGG
GGGGGAATTCCAATTC
Then in another file it prints the matches line by line
File 2:
2
1
2
Awk solution:
awk '{ print gsub(/AATTC/,"") }' file1 > file2
The gsub() function returns the number of substitutions made
$ cat file2
2
1
2
If you have to use grep, then put that in a while loop,
$ while read -r line; do grep -o 'AATTC'<<<"$line"|wc -l >> file2 ; done < file1
$ cat file2
2
1
2
Another way: using perl.
$ perl -ne 'print s/AATTC/x/g ."\n"' file1 > file2

Using Awk how to merge fields between files, F2 of file1 plus last 8char of F2 in file 2

I have two files file1 and file2, I need to replace F1 value of file1 by merging F2 of file1 plus last 8char of F2 in file2
File 1 :
123456|AAAAAAA|BBBBBB|CCCCCCC
444444|kkkkkkk|rrrrrr|NNNNNNN
File 2:
AAAAAAA|DDDDDD12345678
kkkkkkk|987654321aaaaa
Expected Output
123456|AAAAAAA12345678|BBBBBB|CCCCCCC
444444|kkkkkkk321aaaaa|rrrrrr|NNNNNNN
I have tried with Bellow awk function not sure how to fetch last 8 char of F2 from file2
# awk -F"|" 'NR==FNR{a[$1]=$2} NR>FNR{$2=$2a[$2];print}' OFS='|' File2 File1
123456|AAAAAAADDDDDD12345678|BBBBBB|CCCCCCC
444444|kkkkkkk987654321aaaaa|rrrrrr|NNNNNNN
In order to get the last 8 characters of a[$2], you need to use substr:
substr(a[$2],length(a[$2])-7)
The above takes the substring of a[$2] starting at position length(a[$2])-7.
With that one change, your code produces your desired output:
$ awk -F"|" 'NR==FNR{a[$1]=$2} NR>FNR{$2=$2 substr(a[$2],length(a[$2])-7);print}' OFS='|' File2 File1
123456|AAAAAAA12345678|BBBBBB|CCCCCCC
444444|kkkkkkk321aaaaa|rrrrrr|NNNNNNN
As Ghoti points out in the comments, the more usual awk style is to use next so as to avoid the need for the second condition, NR>FNR, as follows:
awk -F"|" 'NR==FNR{a[$1]=$2;next} {$2=$2 substr(a[$2],length(a[$2])-7);print}' OFS='|' File2 File1
When awk encounters next, it skips the rest of the commands and starts over on the next line.
As awk programmers often value conciseness over clarity, it is common to see the print statement replaced with a 1:
awk -F"|" 'NR==FNR{a[$1]=$2;next} {$2=$2 substr(a[$2],length(a[$2])-7)} 1' OFS='|' File2 File1
In this case, 1 is a condition and it always evaluates to true. Since no command is associated with that condition, the default command is executed which is print.

How to get a pattern from a file and search in another file in unix

I have 2 files File1 and File2.
File1 has some values separated by "|". For example,
A|a
C|c
F|f
File2 also has some values separated by "|". For example,
a|1
b|2
c|3
d|4
e|5
Means 2nd column in File1 is resembled with 1st column of File2.
I have to create 3rd file File3 with expected output
A|a|1
C|c|3
I tried to take each record in loop and searched for that in File2 using "awk".
It worked, but the problem is both File1 and File2 are having more than 5 million records.
I need an optimized solution.
You can use this awk,
awk -F'|' 'NR==FNR{a[$2]=$1;next} $1 in a { print a[$1],$1,$2 }' OFS="|" file1 file2 > file3
More clearer way:
awk 'BEGIN{ OFS=FS="|";} NR==FNR{a[$2]=$1;next} $1 in a { print a[$1],$1,$2 }' file1 file2 > file3
As per #Kent suggestion:
If your file2 have more than two columns that you want it in file3 then,
awk 'BEGIN{ OFS=FS="|";} NR==FNR{a[$2]=$1;next} $1 in a { print a[$1],$0 }' file1 file2 > file3
Here,
FS - Field Separator
OFS - Output Field Separator
This is what join was created to do:
$ join -t '|' -o '1.1,1.2,2.2' -1 2 -2 1 file1 file2
A|a|1
C|c|3
man join for more details and pay particular attention to the files needing to be sorted on the join fields (i.e. 2nd field for file1 and 1st field for file2), as your posted sample input is.

How to interleave lines from two text files

What's the easiest/quickest way to interleave the lines of two (or more) text files? Example:
File 1:
line1.1
line1.2
line1.3
File 2:
line2.1
line2.2
line2.3
Interleaved:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Sure it's easy to write a little Perl script that opens them both and does the task. But I was wondering if it's possible to get away with fewer code, maybe a one-liner using Unix tools?
paste -d '\n' file1 file2
Here's a solution using awk:
awk '{print; if(getline < "file2") print}' file1
produces this output:
line 1 from file1
line 1 from file2
line 2 from file1
line 2 from file2
...etc
Using awk can be useful if you want to add some extra formatting to the output, for example if you want to label each line based on which file it comes from:
awk '{print "1: "$0; if(getline < "file2") print "2: "$0}' file1
produces this output:
1: line 1 from file1
2: line 1 from file2
1: line 2 from file1
2: line 2 from file2
...etc
Note: this code assumes that file1 is of greater than or equal length to file2.
If file1 contains more lines than file2 and you want to output blank lines for file2 after it finishes, add an else clause to the getline test:
awk '{print; if(getline < "file2") print; else print ""}' file1
or
awk '{print "1: "$0; if(getline < "file2") print "2: "$0; else print"2: "}' file1
#Sujoy's answer points in a useful direction. You can add line numbers, sort, and strip the line numbers:
(cat -n file1 ; cat -n file2 ) | sort -n | cut -f2-
Note (of interest to me) this needs a little more work to get the ordering right if instead of static files you use the output of commands that may run slower or faster than one another. In that case you need to add/sort/remove another tag in addition to the line numbers:
(cat -n <(command1...) | sed 's/^/1\t/' ; cat -n <(command2...) | sed 's/^/2\t/' ; cat -n <(command3) | sed 's/^/3\t/' ) \
| sort -n | cut -f2- | sort -n | cut -f2-
With GNU sed:
sed 'R file2' file1
Output:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Here's a GUI way to do it: Paste them into two columns in a spreadsheet, copy all cells out, then use regular expressions to replace tabs with newlines.
cat file1 file2 |sort -t. -k 2.1
Here its specified that the separater is "." and that we are sorting on the first character of the second field.

Resources