AWK : Comparing difference of files with different delimiter - unix

1.txt
1|2|3
4|5|6
7|3|6
2.txt (double pipe)
1||2||3
4||5||6
expected
7|3|6
I want to compare 1.txt and 2.txt and print the difference . Note that the numbers of columns can vary each time
awk -F"|" 'NR==FNR{a[$0]++;next} !(a[$0])' 2.txt 1.txt
How can I modify the code to include delimiters in each files.
The code below works for first field alone but I am not sure how it separated the fields by double pipe
awk -F"|" 'NR==FNR{a[$1]++;next} !(a[$1])' 2.txt 1.txt

One simple workaround would be to squeeze the double delimiters in the second file before feeding to awk:
awk -F"|" 'NR==FNR{a[$0]++;next} !(a[$0])' <(tr -s '|' < 2.txt) 1.txt
For your sample input, it'd produce:
7|3|6
EDIT: You assert that
awk -F"|" 'NR==FNR{a[$1]++;next} !(a[$1])' 2.txt 1.txt
works. It doesn't do what you expect. It compares only the first field and not the entire line.

You can use this awk,
awk -F"|" 'NR==FNR{gsub(/\|\|/,"|",$0);a[$0]++;next} !(a[$0])' 2.txt 1.txt

I typically use bash features to accomplish this:
diff 1.txt <(sed 's/||/|/g' < 2.txt)

You can use regexp as a delimiter in gawk, like so if you don't mind if your output is unsorted (as arrays in awk), you can do it with a single command:
gawk 'BEGIN {FS="\\|\\|*"} {gsub(FS,"|") ; a[$0]++} END {for (k in a) {if ( a[k] > 0 ) { print k } } }'
BEGIN {FS="\\|\\|*"} ==> The field separator is one or more |
{gsub(FS,"|") ; a[$0]++} ==> On every line normalize the number of separator |s to one and store the line in an array, or if it's already in the array, increment the value related to it
END {for (k in a) {if ( a[k] > 0 ) { print k } } } finally print every array element where it found more than once.

Related

Match function in unix to find if string ends with particular input value?

I need a regular expression that I can use in match() function to see if value given at command line argument exists at the end of a given string.
I am using awk then trying use match function to get above result:
while read line; do
cat $line | awk -v value="$2.$" '{ if ( match("$1,value) != 0 ) print match("arvind","ind.$") " " "arvind" }'
done < xreffilelist.txt

Find the number of occurences of a word using awk and a variable

I am trying to find the number of fields that contain the word entered by the user. I do not know the syntax to use the variable in my awk statement. If I just use a literal string of $i == "Washington" it works, but I need it to use the input. When I try this it returns nothing:
<code>
read Choice
awk '{
for (i=1;i<=NF;i++)
if ( $i == "$Choice")
c++
}
END{
print c}' DC_Area.csv
</code>
Shell variables are not visible in awk. That's why the code in the OP doesn't work. Use the -v option to pass on shell variables to awk.
Try
read Choice
awk -v Choice="$Choice" '{
for (i=1;i<=NF;i++)
if ( $i == Choice)
c++
}
END{
print c}' DC_Area.csv

How do I fetch this substring using awk?

I have a string let's say
k=CHECK_${SOMETHING}_CUSTOM_executable.acs
Now I want to fetch only CUSTOM_executable from the above string. This is what I have tried so far in Unix
echo $k|awk -F '_' '{print $2}'
Can you explain how can i do this
Try this :
$ echo "$k"
CHECK_111_CUSTOM_executable.acs
code:
echo "$k" | awk 'BEGIN{FS=OFS="_"}{sub(/.acs/, "");print $3, $4}'
Assume the variable ${SOMETHING} has the value SOMETHING just for simplicity.
The following assignment, therefore,
k=CHECK_${SOMETHING}_CUSTOM_executable.acs
sets the value of k to CHECK_SOMETHING_CUSTOM_executable.acs.
When split into fields on _ by awk -F '_' (note the single quotes aren't necessary here).
You get the following fields:
$ echo "$k" | awk -F _ '{for (i=0; i<=NF; i++) {print i"="$i}}'
0=CHECK_SOMETHING_CUSTOM_executable.acs
1=CHECK
2=SOMETHING
3=CUSTOM
4=executable.acs
So to get the output you want simply use
echo "$k" | awk -F _ -v OFS=_ '{print $3,$4}'
Suppose if SOMETHING variable is having 111_222_333 (or) 111_222_333_444,
Use this:
$ k=CHECK_${SOMETHING}_CUSTOM_executable.acs
$ echo $k | awk 'BEGIN{FS=OFS="_"}{ print $(NF-1),$NF }'
(Or)
echo $k | awk -F_ '{ print $(NF-1), $NF }' OFS=_
Explanation :
NF - The number of fields in the current input record.
Try this simple awk:
awk -F[._] '{print $3"_"$4}' <<<"$k"
CUSTOM_executable
The -F[._] defines both dot and underline as field separator. Then awk prints the filed number 3 and 4 from $k as input.
If the k contains k='CHECK_${111_111}_CUSTOM_executable.acs', then use filed with numbers $4 and $5:
awk -F[._] '{print $4"_"$5}' <<<"$k"
CHECK_${111_111}_CUSTOM_executable.acs
| $1| |$2 | |$3| | $4 | | $5 | |$6|
You do not need to use awk, it can be done in bash easily. I assume that $SOMETHING does not contains _ characters (also CUSTOM and executable part is just some text, they also not contains _). Then:
k=CHECK_${SOMETHING}_CUSTOM_executable.acs
l=${k#*_}; l=${l#*_}; l=${l%.*};
This cuts anything from the beginning to the 2nd _ char, and chomps off anything after the last . char. Result is put into the l env.var.
If $SOMETHING may contain _ then a little bit work has to be done (I assume the CUSTOM and executable part does not contain _):
k=CHECK_${SOMETHING}_CUSTOM_executable.acs
l=${k%_*}; l=${l%_*}; l=${k#${l}_*}; l=${l%.*};
This chomps off everything after the last but one _ character, the cuts the result off from the original string. The last statement chomps the extension off. The result is in l env.var.
Or it can be done using regex:
[[ $k =~ ([^_]+_[^_]+)\.[^.]+$ ]] && l=${BASH_REMATCH[1]}
This matches any string containing two words separated by _ and finished with .<extension>. The extension part is chomped off and result is in l env.var.
I hope this helps!

Unix awk Substring string comparison

I want to find if a substring is contained in a string using Unix AWK command.
eg, pseudocode:
a= commandline
b=line
if(b is contained in a)
print "success "
$ awk 'BEGIN{a="commandline";b="line";if (a ~ b){print "success"}}'
success

Remove duplicates from a large file

I have a ~20GB csv file.
Sample file:
1,a#a.com,M
2,b#b.com,M
1,c#c.com,F
3,d#d.com,F
The primary key in this file is the first column.
I need to write two file, uniq.csv and duplicates.csv
uniq.csv should contain all non-duplicate records and duplicates.csv will contain all duplicate records with current timesstamp.
uniq.csv
1,a#a.com,M
2,b#b.com,M
3,d#d.com,F
duplicates.csv
2012-06-29 01:53:31 PM, 1,c#c.com,F
I am using Unix Sort so that I can take advantage of its External R-Way merge sorting algorithm
To identify uniq records
tail -n+2 data.txt | sort -t, -k1 -un > uniq.csv
To identify duplicate records
awk 'x[$1]++' FS="," data.txt | awk '{print d,$1}' "d=$(date +'%F %r')," > duplicates.csv
I was wondering if there is anyway to find both duplicates and uniq with a single scan of this large file?
Your awk script is nearly there. To find the unique lines, you merely need to use the in operator to test whether the entry is in the associate array or not. This allows you to collect the data in one pass through the data file and to avoid having to call sort.
tail -n +2 data.txt | \
awk '
BEGIN { OFS=FS="," }
{
if (!($1 in x)) {
print $0 > "/dev/fd/3"
}
x[$1]++
}
END {
for (t in x) {
print d, t, x[t]
}
}' d="$(date +'%F %r')" 3> uniq.csv > duplicates.csv
I got this question in an interview, a couple jobs ago.
One answer is to use uniq with the "-c" (count) option. An entry with a count of "1' is unique, and otherwise not unique.
sort foo | uniq -c | awk '{ if ($1 == 1) { write-to-unique } else {write-to-duplicate }'
If you want to write a special-purpose program and/or avoid the delay caused by sorting, I would use Python.
Read the input file, hashing each entry and ++ an integer value for each unique key that you encounter. Remember that hash values can collide even when the two items are not equal, so keep each key individually along with its count.
At EOF on the input, traverse the hash structure and spit each entry into one of two files.
You seem not to need sorted output, only categorized output, so the hashing should be faster. Constructing a hash is O(1), while sorting is O(I forget; is unix sort Nlog(N)?)
Here is a code on perl which will do processing in one scan
#!/usr/bin/perl
open(FI,"sort -t, -k1 < file.txt |");
open(FD,">duplicates.txt");
open(FU,">uniques.txt");
my #prev;
while(<FI>)
{
my (#cur) = split(',');
if($prev[0] && $prev[0]==$cur[0])
{
print FD localtime()." $_";
}
else
{
print FU $_;
}
#prev=#cur;
}

Resources