How to get a pattern from a file and search in another file in unix - unix

I have 2 files File1 and File2.
File1 has some values separated by "|". For example,
A|a
C|c
F|f
File2 also has some values separated by "|". For example,
a|1
b|2
c|3
d|4
e|5
Means 2nd column in File1 is resembled with 1st column of File2.
I have to create 3rd file File3 with expected output
A|a|1
C|c|3
I tried to take each record in loop and searched for that in File2 using "awk".
It worked, but the problem is both File1 and File2 are having more than 5 million records.
I need an optimized solution.

You can use this awk,
awk -F'|' 'NR==FNR{a[$2]=$1;next} $1 in a { print a[$1],$1,$2 }' OFS="|" file1 file2 > file3
More clearer way:
awk 'BEGIN{ OFS=FS="|";} NR==FNR{a[$2]=$1;next} $1 in a { print a[$1],$1,$2 }' file1 file2 > file3
As per #Kent suggestion:
If your file2 have more than two columns that you want it in file3 then,
awk 'BEGIN{ OFS=FS="|";} NR==FNR{a[$2]=$1;next} $1 in a { print a[$1],$0 }' file1 file2 > file3
Here,
FS - Field Separator
OFS - Output Field Separator

This is what join was created to do:
$ join -t '|' -o '1.1,1.2,2.2' -1 2 -2 1 file1 file2
A|a|1
C|c|3
man join for more details and pay particular attention to the files needing to be sorted on the join fields (i.e. 2nd field for file1 and 1st field for file2), as your posted sample input is.

Related

Append data from 1 file to another using AWK

I have an already existing script to check the exclusive data between 2 files and load it in 3rd file. The command is below.
var='FNR == NR {keys[$1 $2]; next} !($1 $2 in keys)'
awk -F\| $var file1.dat file2.dat > file3.dat
The requirement is to reuse the same but just append the data from file2 to file3 ignoring file1. I tried to do the below but it is spooling the data from both file1 and file2. All I need is, though there are 2 file names provided in the awk command, only the 2nd file data to be appended.
var='{print $0}'
awk -F\| $var file1.dat file2.dat > file3.dat
Can anyone help with the exact command.
Below is the data in each file and expected output.
File1 (Can have 0 or more) - We should not look at this file at all
123
456
789
File2:
123
ABC
XYZ
456
Expected output in File3 (All from file2 and just ignore file1 input, but I have to have the file1 name in awk command)
123
ABC
XYZ
456
All from file2 and just ignore file1 input, but I have to have the file1 name in awk command.
If you must use file1 and file2 in arguments to awk command and want to output content from file2 only then you can just use:
awk 'BEGIN {delete ARGV[1]} 1' file1 file2 > file3
123
ABC
XYZ
456
delete ARGV[1] will delete first argument from argument list.
With your shown samples and attempts please try following awk code. Written and tested in GNU awk. Simply use nextfile to skip first Input_file named file1 itself and read 2nd file onwards.
awk 'NR==1{nextfile} 1' file1 file2
also remember not to waste time splitting unneeded fields
{m,g}awk 'BEGIN { delete ARGV[_^=FS="^$"] }_' file1 file2
and it's MUUUCH faster not reading it a row at a time :
mawk2 'BEGIN { delete ARGV[_^=FS="^$"] }_' "${m2p}" "${m3t}"
out9: 1.85GiB 0:00:01 [1.11GiB/s] [1.11GiB/s] [ <=>]
f9d2e18d22eb58e5fc2173863cff238e stdin
mawk2 'BEGIN { delete ARGV[_^=RS=FS="^$"] }_^(ORS=__)' "${m2p}" "${m3t}"
out9: 1.85GiB 0:00:00 [1.92GiB/s] [1.92GiB/s] [<=> ]
f9d2e18d22eb58e5fc2173863cff238e stdin
and try to avoid the slow default mode of gawk :
gawk 'BEGIN { delete ARGV[_^=FS="^$"] }_' "${m2p}" "${m3t}"
out9: 1.85GiB 0:00:03 [ 620MiB/s] [ 620MiB/s] [ <=> ]
f9d2e18d22eb58e5fc2173863cff238e stdin

compare two fields from two different files using awk

I have two files where I want to compare certain fields and produce the output
I have a variable as well
echo ${CURR_SNAP}
123
File1
DOMAIN1|USER1|LE1|ORG1|ACCES1|RSCTYPE1|RSCNAME1
DOMAIN2|USER2|LE2|ORG2|ACCES2|RSCTYPE2|RSCNAME2
DOMAIN3|USER3|LE3|ORG3|ACCES3|RSCTYPE3|RSCNAME3
DOMAIN4|USER4|LE4|ORG4|ACCES4|RSCTYPE4|RSCNAME4
File2
ORG1|PRGPATH1
ORG3|PRGPATH3
ORG5|PRGPATH5
ORG6|PRGPATH6
ORG7|PRGPATH7
The output I am expecting as below where the last column is CURR_SNAP value and the matching will be 4th column of File1 should be matched with 1st column of File2
DOMAIN1|USER1|LE1|ORG1|ACCES1|RSCTYPE1|123
DOMAIN3|USER3|LE3|ORG3|ACCES3|RSCTYPE3|123
I tried with the below code piece but looks like I am not doing it correctly
awk -v CURRSNAP="${CURR_SNAP}" '{FS="|"} NR==FNR {x[$0];next} {if(x[$1]==$4) print $1"|"$2"|"$3"|"$4"|"$5"|"$6"|"CURRSNAP}' File2 File1
With awk:
#! /bin/bash
CURR_SNAP="123"
awk -F'|' -v OFS='|' -v curr_snap="$CURR_SNAP" '{
if (FNR == NR)
{
# this stores the ORG* as an index
# here you can store other values if needed
orgs_arr[$1]=1
}
else if (orgs_arr[$4] == 1)
{
# overwrite $7 to contain CURR_SNAP value
$7=curr_snap
print
}
}' file2 file1
As in your expected output, you didn't output RSCNAME*, so I have overwritten $7(which is column for RSCNAME*) with $CURR_SNAP. If you want to display RSCNAME* column aswell, remove $7=curr_snap and change print statement to print $0, curr_snap.
I wouldn't use awk at all. This is what join(1) is meant for (Plus sed to append the extra column:
$ join -14 -21 -t'|' -o 1.1,1.2,1.3,1.4,1.5,1.6 File1 File2 | sed "s/$/|${CURR_SNAP}/"
DOMAIN1|USER1|LE1|ORG1|ACCES1|RSCTYPE1|123
DOMAIN3|USER3|LE3|ORG3|ACCES3|RSCTYPE3|123
It does require that the files be sorted based on the common field, like your examples are.
You can do this with awk with two-rules. For the first file (where NR==FNR), simply use string concatenation to append the fields 1 - (NF-1) assigning the concatenated result to an array indexed by $4. Then for the second file (where NR>FNR) in rule two test if array[$1] has content and if so, output the array and append "|"CURR_SNAP (with CURR_SNAP shortened to c in the example below and array being a), e.g.
CURR_SNAP=123
awk -F'|' -v c="$CURR_SNAP" '
NR==FNR {
for (i=1;i<NF;i++)
a[$4]=i>1?a[$4]"|"$i:a[$4]$1
}
NR>FNR {
if(a[$1])
print a[$1]"|"c
}
' file1 file2
Example Use/Output
After setting the filenames to match yours, you can simply copy/middle-mouse-paste in your console to test, e.g.
$ awk -F'|' -v c="$CURR_SNAP" '
> NR==FNR {
> for (i=1;i<NF;i++)
> a[$4]=i>1?a[$4]"|"$i:a[$4]$1
> }
> NR>FNR {
> if(a[$1])
> print a[$1]"|"c
> }
> ' file1 file2
DOMAIN1|USER1|LE1|ORG1|ACCES1|RSCTYPE1|123
DOMAIN3|USER3|LE3|ORG3|ACCES3|RSCTYPE3|123
Look things over and let me know if you have further questions.

Finding amount of sequence matches per line

I'm looking to use GREP or something similiar to find the total matches of a 5 letter sequence (AATTC) in every line of a file, and then print the result in a new file. For example:
File 1:
GGGGGAATTCGAATTC
GGGGGAATTCGGGGGG
GGGGGAATTCCAATTC
Then in another file it prints the matches line by line
File 2:
2
1
2
Awk solution:
awk '{ print gsub(/AATTC/,"") }' file1 > file2
The gsub() function returns the number of substitutions made
$ cat file2
2
1
2
If you have to use grep, then put that in a while loop,
$ while read -r line; do grep -o 'AATTC'<<<"$line"|wc -l >> file2 ; done < file1
$ cat file2
2
1
2
Another way: using perl.
$ perl -ne 'print s/AATTC/x/g ."\n"' file1 > file2

Using Awk how to merge fields between files, F2 of file1 plus last 8char of F2 in file 2

I have two files file1 and file2, I need to replace F1 value of file1 by merging F2 of file1 plus last 8char of F2 in file2
File 1 :
123456|AAAAAAA|BBBBBB|CCCCCCC
444444|kkkkkkk|rrrrrr|NNNNNNN
File 2:
AAAAAAA|DDDDDD12345678
kkkkkkk|987654321aaaaa
Expected Output
123456|AAAAAAA12345678|BBBBBB|CCCCCCC
444444|kkkkkkk321aaaaa|rrrrrr|NNNNNNN
I have tried with Bellow awk function not sure how to fetch last 8 char of F2 from file2
# awk -F"|" 'NR==FNR{a[$1]=$2} NR>FNR{$2=$2a[$2];print}' OFS='|' File2 File1
123456|AAAAAAADDDDDD12345678|BBBBBB|CCCCCCC
444444|kkkkkkk987654321aaaaa|rrrrrr|NNNNNNN
In order to get the last 8 characters of a[$2], you need to use substr:
substr(a[$2],length(a[$2])-7)
The above takes the substring of a[$2] starting at position length(a[$2])-7.
With that one change, your code produces your desired output:
$ awk -F"|" 'NR==FNR{a[$1]=$2} NR>FNR{$2=$2 substr(a[$2],length(a[$2])-7);print}' OFS='|' File2 File1
123456|AAAAAAA12345678|BBBBBB|CCCCCCC
444444|kkkkkkk321aaaaa|rrrrrr|NNNNNNN
As Ghoti points out in the comments, the more usual awk style is to use next so as to avoid the need for the second condition, NR>FNR, as follows:
awk -F"|" 'NR==FNR{a[$1]=$2;next} {$2=$2 substr(a[$2],length(a[$2])-7);print}' OFS='|' File2 File1
When awk encounters next, it skips the rest of the commands and starts over on the next line.
As awk programmers often value conciseness over clarity, it is common to see the print statement replaced with a 1:
awk -F"|" 'NR==FNR{a[$1]=$2;next} {$2=$2 substr(a[$2],length(a[$2])-7)} 1' OFS='|' File2 File1
In this case, 1 is a condition and it always evaluates to true. Since no command is associated with that condition, the default command is executed which is print.

How to interleave lines from two text files

What's the easiest/quickest way to interleave the lines of two (or more) text files? Example:
File 1:
line1.1
line1.2
line1.3
File 2:
line2.1
line2.2
line2.3
Interleaved:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Sure it's easy to write a little Perl script that opens them both and does the task. But I was wondering if it's possible to get away with fewer code, maybe a one-liner using Unix tools?
paste -d '\n' file1 file2
Here's a solution using awk:
awk '{print; if(getline < "file2") print}' file1
produces this output:
line 1 from file1
line 1 from file2
line 2 from file1
line 2 from file2
...etc
Using awk can be useful if you want to add some extra formatting to the output, for example if you want to label each line based on which file it comes from:
awk '{print "1: "$0; if(getline < "file2") print "2: "$0}' file1
produces this output:
1: line 1 from file1
2: line 1 from file2
1: line 2 from file1
2: line 2 from file2
...etc
Note: this code assumes that file1 is of greater than or equal length to file2.
If file1 contains more lines than file2 and you want to output blank lines for file2 after it finishes, add an else clause to the getline test:
awk '{print; if(getline < "file2") print; else print ""}' file1
or
awk '{print "1: "$0; if(getline < "file2") print "2: "$0; else print"2: "}' file1
#Sujoy's answer points in a useful direction. You can add line numbers, sort, and strip the line numbers:
(cat -n file1 ; cat -n file2 ) | sort -n | cut -f2-
Note (of interest to me) this needs a little more work to get the ordering right if instead of static files you use the output of commands that may run slower or faster than one another. In that case you need to add/sort/remove another tag in addition to the line numbers:
(cat -n <(command1...) | sed 's/^/1\t/' ; cat -n <(command2...) | sed 's/^/2\t/' ; cat -n <(command3) | sed 's/^/3\t/' ) \
| sort -n | cut -f2- | sort -n | cut -f2-
With GNU sed:
sed 'R file2' file1
Output:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Here's a GUI way to do it: Paste them into two columns in a spreadsheet, copy all cells out, then use regular expressions to replace tabs with newlines.
cat file1 file2 |sort -t. -k 2.1
Here its specified that the separater is "." and that we are sorting on the first character of the second field.

Resources