Find duplicate words in two text files using command line - unix

I have two text files:
f1.txt
boom Boom pow
Lazy dog runs.
The Grass is Green
This is TEST
Welcome
and
f2.txt
Welcome
I am lazy
Welcome, Green
This is my room
Welcome
bye
In Ubuntu Command Line I am trying:
awk 'BEGIN {RS=" "}FNR==NR {a[$1]=NR; next} $1 in a' f1.txt f2.txt
and getting output:
Green
This
is
My desired output is:
lazy
Green
This is
Welcome
Description: I want to compare two txt files, line by line. Then I want to output all duplicate words. The matches should be not case sensitive. Also, comparing line by line would be better instead of looking for a match from f1.txt in a whole f2.txt file. In example, the word "Welcome" should not be in desired output if it was on line 6 instead of line 5 in f2.txt

Well, then. With awk:
awk 'NR == FNR { for(i = 1; i <= NF; ++i) { a[NR,tolower($i)] = 1 }; next } { flag = 0; for(i = 1; i <= NF; ++i) { if(a[FNR,tolower($i)]) { printf("%s%s", flag ? OFS : "", $i); flag = 1 } } if(flag) print "" }' f1.txt f2.txt
This works as follows:
NR == FNR { # While processing the first file:
for(i = 1; i <= NF; ++i) { # Remember which fields were in
a[NR,tolower($i)] = 1 # each line (lower-cased)
}
next # Do nothing else.
}
{ # After that (when processing the
# second file)
flag = 0 # reset flag so we know we haven't
# printed anything yet
for(i = 1; i <= NF; ++i) { # wade through fields (words)
if(a[FNR,tolower($i)]) { # if this field was in the
# corresponding line in the first
# file, then
printf("%s%s", flag ? OFS : "", $i) # print it (with a separator if it
# isn't the first)
flag = 1 # raise flag
}
}
if(flag) { # and if we printed anything
print "" # add a newline at the end.
}
}

Related

merging multiple rows into one row per record seprated by a blank line in unix [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a text file in which each record starts with a no and name and ends with a blank line. I would like to have per record in one row as comma-separated values. I have tried the following code whose code file and text file link attached below:
biosample.txt
sark.awk
unix command: to run the code is:
gawk -f sark.awk biosample.txt
then run:
sed 's/,,/\n/g' <biosample.txt > out.txt
but the out.txt is a bit discrepant/messy/confusing.
I want each record in one line with the values to be extracted for the following headers only:
record name
Identifiers
Organism
strain
isolate
serovar
isolation source
collected by
collection date
geographic location
host
host disease
Accession
ID
potential_contaminant
sample type
Description
Having the values for each header to be picked from each record that is separated by a new line.
Thanks
Here's a straightforward implementation with awk:
BEGIN { print "record name,Identifiers,Organism,strain,isolate,serovar,"\
"isolation source,collected by,collection date,"\
"geographic location,host,host disease,Accession,ID,"\
"potential_contaminant,sample type,Description"
RS="\r\n"
ORS=""
}
sub(/^[0-9]*: /,"") { r[1] = $0; next }
sub(/^Identifiers: /,""){ r[2] = $0; next }
sub(/^Organism: /,"") { r[3] = $0; next }
/^ / { split($0, a, "=") }
/^ *\/strain=/ { r[4] = a[2] }
/^ *\/isolate=/ { r[5] = a[2] }
/^ *\/serovar=/ { r[6] = a[2] }
/^ *\/isolation source=/{ r[7] = a[2] }
/^ *\/collected by=/ { r[8] = a[2] }
/^ *\/collection date=/ { r[9] = a[2] }
/^ *\/geographic locati/{ r[10] = a[2] }
/^ *\/host=/ { r[11] = a[2] }
/^ *\/host disease=/ { r[12] = a[2] }
/^Accession:/ { r[13] = $2; r[14] = $4 }
/^ *\/potential_contami/{ r[15] = a[2] }
/^ *\/sample type=/ { r[16] = a[2] }
/^Description:/ { getline; r[17] = $0 }
/^$/ { if (r[1]) { for (i = 1; i < 17; ++i) print r[i]","
print r[i]"\n"
delete r
}
}

How to jump to next top level loop

If condition occurs in the inner for loop I want to break the inner and next the outer.
I could create a flag in the inner before the break statement and then evaluate in the outer, this is a silly example:
for (i in 1:3) {
NEXT <- FALSE
for (j in 1:3) {
if (j==2 && i==2) {
NEXT <- TRUE
break
}
}
if (NEXT) next
cat("\n", i, " ... some i stuff ...")
}
Is there an elegant way to do it? Something like:
for (i in 1:3) {
for (j in 1:3) {
if (j==2 && i==2) {
break
# next (outer)
}
}
cat("\n", i, " ... some i stuff ...")
}
There a similar/duplicate question but I think it doesn't answer's mine, because in the question's outer loop it does nothing after the inner loop.
How to jump to next top level loop?
A quick fix could be something like:
for(i in 1:3){
for(j in 1:3){
if(i == 2 && j == 2){
i <- 3 # can be ignored if you don't want i value changed
j <- 3 # this will kick it out of the j for loop
} else {
...code...
}
cat ....
}
Obviously, it's not robust but seems to solve your problem.
EDIT:
Per your comment, perhaps you're looking for:
for(i in 1:3){
cat.ready <- TRUE
for(j in 1:3){
if(i == 2 && j == 2){
j <- 3 # this will kick it out of the j for loop
cat.ready <- FALSE
} else {
...code...
}
if(cat.ready == TRUE){
cat(...)
} else {
cat.ready <- TRUE
}
This then will remove you from executing code and also from producing a cat() if i and j are both 2 and will reset the condition after that exception has been handled.
I'm sure there is a more elegant solution, however.
Why not invert the problem and only execute the inner loop if j!=2 && i!=2?
for (i in 1:3) {
for (j in 1:3) {
cat("\n\ni=",i, " and j=",j )
if (j!=2 | i!=2 )
# will be executed unless j is 2 and i is 2
{
cat("\n", j, " ... some j stuff ...")
}
}
cat("\n", i, " ... some i stuff ...")
}
If I am misunderstanding and you want to not execute the combination of j=2/i=2 and j=3/i=2 adjust accordingly:
for (i in 1:3) {
for (j in 1:3) {
cat("\n\ni=",i, " and j=",j )
if (j!=2 | i!=2 & i!=3)
# will be executed unless j is 2 and i is 2 or j is 3 and i is 2
{
cat("\n", j, " ... some j stuff ...")
}
}
cat("\n", i, " ... some i stuff ...")
}

awk to generate consecutive sequence:

Would like to read first field then generate sequence based on "&-" and "&&-" delimiter.
Ex: If Digits field is 210&-3 , need to populate 210 and 213 only.
If Digits field is 210&&-3 , need to populate 210,211,212 and 213.
Input.txt
DIGITS
20
210&-2
2130&&-3&-6&&-8
Desired Output:
DIGITS
20
210
212
2130
2131
2132
2133
2136
2137
2138
Have tried some commands but not materialised, any suggestions...
Here's an awk executable script version:
#!/usr/bin/awk -f
BEGIN {FS="[&]"}
{
flen = length($1)
ldigit = substr($1, flen)+0
prefix = substr($1, 1, flen-1)+0
if( ldigit !~ /[[:space:]]/ )
print prefix ldigit
doRange=0
for(i=2;i<=NF;i++) {
if( $i == "" ) { doRange=1; continue }
if( !doRange ) { ldigit=-$i; print prefix ldigit }
else {
while( ldigit < -$i ) {
ldigit++
print prefix ldigit
}
doRange=0
}
}
}
Here's the breakdown:
Set the field separator to &
When their are commands to parse, break find the prefix and the ldigit values
Print out the first value using print prefix ldigit. This will print the header too. The if( ldigit !~ /[[:space:]]/ ) discards the blank lines
When there's no range, set ldigit and then print prefix ldigit
When there is a range, increment ldigit and print prefix ldigit for as long as required.
Using an older gawk version I get output like:
DIGITS
20
210
212
2130
2131
2132
2133
2136
2137
2138
Using GNU awk for patsplit:
gawk '{
n = patsplit($0,patt,/[&][&]-|[&]-/,number);
lastnum = number[0]
print lastnum
if(n > 0) {
for (i=1; i<=n; i++) {
if (patt[i] ~ /^[&]-$/) {
print number[0] + number[i]
lastnum = number[0] + number[i]
}
if (patt[i] ~ /^[&][&]-$/) {
for (num = lastnum + 1; num <= number[0] + number[i]; num++) {
print num
}
lastnum = number[0] + number[i]
}
}
}
}' file
Output
20
210
212
2130
2131
2132
2133
2136
2137
2138
$ cat tst.awk
BEGIN{ FS="&" }
{
for (i=1;i<=NF;i++) {
if ($i == "") {
i++
$i = $1 - $i
for (j=(prev+1);j<$i;j++) {
print j
}
}
else if ($i < 0) {
$i = $1 - $i
}
print $i
prev = $i
}
}
$
$ awk -f tst.awk file
20
210
212
2130
2131
2132
2133
2136
2137
2138

Pivot table in AWK

I need to transform elements from an array to column index and return the value of $3 for each column index.
I donĀ“t have access to gawk 4 so I cannot work with real multidimensional arrays.
Input
Name^Code^Count
Name1^0029^1
Name1^0038^1
Name1^0053^1
Name2^0013^3
Name2^0018^3
Name2^0023^5
Name2^0025^1
Name2^0029^1
Name2^0038^1
Name2^0053^1
Name3^0018^1
Name3^0060^1
Name4^0018^2
Name4^0025^5
Name5^0018^2
Name5^0025^1
Name5^0060^1
Desired output
Name^0013^0018^0023^0025^0029^0038^0053^0060
Name1^^^^^1^1^1^
Name2^3^3^5^1^1^1^1^
Name3^^1^^^^^^1
Name4^^2^^5^^^^
Name5^^^^1^^^^1
Any suggestions on how to tackle this task without using real multidimensional arrays?
The following solution uses GNU awk v3.2 features for sorting. This does not use multi-dimensional arrays. It only simulates one.
awk -F"^" '
NR>1{
map[$1,$2] = $3
name[$1]++
value[$2]++
}
END{
printf "Name"
n = asorti(value, v_s)
for(i=1; i<=n; i++) {
printf "%s%s", FS, v_s[i]
}
print ""
m = asorti(name, n_s)
for(i=1; i<=m; i++) {
printf "%s", n_s[i]
for(j=1; j<=n; j++) {
printf "%s%s", FS, map[n_s[i],v_s[j]]
}
print ""
}
}' file
Name^0013^0018^0023^0025^0029^0038^0053^0060
Name1^^^^^1^1^1^
Name2^3^3^5^1^1^1^1^
Name3^^1^^^^^^1
Name4^^2^^5^^^^
Name5^^2^^1^^^^1
This will work with any awk and will order the output of counts numerically while keeping the names in the order they occur in your input file:
$ cat tst.awk
BEGIN{FS="^"}
NR>1 {
if (!seenNames[$1]++) {
names[++numNames] = $1
}
if (!seenCodes[$2]++) {
# Insertion Sort - start at the end of the existing array and
# move everything greater than the current value down one slot
# leaving open the slot for the current value to be inserted between
# the last value smaller than it and the first value greater than it.
for (j=++numCodes;codes[j-1]>$2+0;j--) {
codes[j] = codes[j-1]
}
codes[j] = $2
}
count[$1,$2] = $3
}
END {
printf "%s", "Name"
for (j=1;j<=numCodes;j++) {
printf "%s%s",FS,codes[j]
}
print ""
for (i=1;i<=numNames;i++) {
printf "%s", names[i]
for (j=1;j<=numCodes;j++) {
printf "%s%s",FS,count[names[i],codes[j]]
}
print ""
}
}
...
$ awk -f tst.awk file
Name^0013^0018^0023^0025^0029^0038^0053^0060
Name1^^^^^1^1^1^
Name2^3^3^5^1^1^1^1^
Name3^^1^^^^^^1
Name4^^2^^5^^^^
Name5^^2^^1^^^^1
Since you only have two "dimensions", it is easy enough to use one array for each dimension and a joining array with a calculated column name. I didn't do the sorting of columns or rows, but the idea is pretty basic.
#!/usr/bin/awk -f
#
BEGIN { FS = "^" }
(NR == 1) {next}
{
rows[$1] = 1
columns[$2] = 1
join_table[$1 "-" $2] = $3
}
END {
printf "Name"
for (col_name in columns) {
printf "^%s", col_name
}
printf "\n"
for (row_name in rows) {
printf row_name
for (col_name in columns) {
printf "^%s", join_table[row_name "-" col_name]
}
printf "\n"
}
}

Where does the "newline" (\n) come from? (pattern matching using "flex")

I have an experimental flex source file(lex.l):
%option noyywrap
%{
int chars = 0;
int words = 0;
int lines = 0;
%}
delim [ \t\n]
ws {delim}+
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(.{digit}+)?(E[+-]?{digit}+)?
%%
{letter}+ { words++; chars += strlen(yytext); printf("Word\n"); }
\n { chars++; lines++; printf("Line\n"); }
. { chars++; printf("SomethingElse\n"); }
%%
int main(argc, argv)
int argc;
char **argv;
{
if(argc > 1)
{
if(!(yyin = fopen(argv[1], "r")))
{
perror(argv[1]);
return (1);
}
}
yylex();
printf("lines: %8d\nwords: %8d\nchars: %8d\n", lines, words, chars);
}
I created an input file called "input.txt" with "red apple" written in it. Command line:
$ flex lex.l
$ cc lex.yy.c
$ ./a.out < input.txt
Word
SomethingElse
Word
Line
lines: 1
words: 2
chars: 10
Since there is no newline character in the input file, why the "\n" in lex.l is pattern matched? (The "lines" is supposed to be 0, and the "chars" is supposed to be 9)
(I am using OS X.)
Thanks for your time.
It is very possible that your text editor has automatically inserted a newline at the end of the file.

Resources