I need a regular expression that I can use in match() function to see if value given at command line argument exists at the end of a given string.
I am using awk then trying use match function to get above result:
while read line; do
cat $line | awk -v value="$2.$" '{ if ( match("$1,value) != 0 ) print match("arvind","ind.$") " " "arvind" }'
done < xreffilelist.txt
I am trying to find the number of fields that contain the word entered by the user. I do not know the syntax to use the variable in my awk statement. If I just use a literal string of $i == "Washington" it works, but I need it to use the input. When I try this it returns nothing:
<code>
read Choice
awk '{
for (i=1;i<=NF;i++)
if ( $i == "$Choice")
c++
}
END{
print c}' DC_Area.csv
</code>
Shell variables are not visible in awk. That's why the code in the OP doesn't work. Use the -v option to pass on shell variables to awk.
Try
read Choice
awk -v Choice="$Choice" '{
for (i=1;i<=NF;i++)
if ( $i == Choice)
c++
}
END{
print c}' DC_Area.csv
1.txt
1|2|3
4|5|6
7|3|6
2.txt (double pipe)
1||2||3
4||5||6
expected
7|3|6
I want to compare 1.txt and 2.txt and print the difference . Note that the numbers of columns can vary each time
awk -F"|" 'NR==FNR{a[$0]++;next} !(a[$0])' 2.txt 1.txt
How can I modify the code to include delimiters in each files.
The code below works for first field alone but I am not sure how it separated the fields by double pipe
awk -F"|" 'NR==FNR{a[$1]++;next} !(a[$1])' 2.txt 1.txt
One simple workaround would be to squeeze the double delimiters in the second file before feeding to awk:
awk -F"|" 'NR==FNR{a[$0]++;next} !(a[$0])' <(tr -s '|' < 2.txt) 1.txt
For your sample input, it'd produce:
7|3|6
EDIT: You assert that
awk -F"|" 'NR==FNR{a[$1]++;next} !(a[$1])' 2.txt 1.txt
works. It doesn't do what you expect. It compares only the first field and not the entire line.
You can use this awk,
awk -F"|" 'NR==FNR{gsub(/\|\|/,"|",$0);a[$0]++;next} !(a[$0])' 2.txt 1.txt
I typically use bash features to accomplish this:
diff 1.txt <(sed 's/||/|/g' < 2.txt)
You can use regexp as a delimiter in gawk, like so if you don't mind if your output is unsorted (as arrays in awk), you can do it with a single command:
gawk 'BEGIN {FS="\\|\\|*"} {gsub(FS,"|") ; a[$0]++} END {for (k in a) {if ( a[k] > 0 ) { print k } } }'
BEGIN {FS="\\|\\|*"} ==> The field separator is one or more |
{gsub(FS,"|") ; a[$0]++} ==> On every line normalize the number of separator |s to one and store the line in an array, or if it's already in the array, increment the value related to it
END {for (k in a) {if ( a[k] > 0 ) { print k } } } finally print every array element where it found more than once.
I want translate the word "abcd" into upper case "ABCD" using tr command then translate the "ABCD" to digit e.g 1234.
I want to chain two translations together (lowercase to upper case, then upper case to 1234) using pipes and also pipe the final output into more.
I'm not able to chain the second part.
echo "abcd" | tr '[:lower:]' '[:upper:]' > file1
Here I'm not sure how to add the second translation in the same command.
You can't do it in a single tr command; you can do it in a single pipeline:
echo "abcd" | tr '[:lower:]' '[:upper:]' | tr 'ABCD' '1234'
Note that your [:lower:] and [:upper:] notation will translate more than abcd to ABCD. If you want to extend the mapping of digits so A-I map to 1-9, that's doable; what maps to 0?
If you want to do it in a single command, then you could write:
echo "abcdABCD" | tr 'abcdABCD' '12341234'
Or, abbreviated slightly:
$ echo 'abecedenarian-DIABOLICALISM' | tr 'a-dA-D' '1-41-4'
12e3e4en1ri1n-4I12OLI31LISM
$
I have a ~20GB csv file.
Sample file:
1,a#a.com,M
2,b#b.com,M
1,c#c.com,F
3,d#d.com,F
The primary key in this file is the first column.
I need to write two file, uniq.csv and duplicates.csv
uniq.csv should contain all non-duplicate records and duplicates.csv will contain all duplicate records with current timesstamp.
uniq.csv
1,a#a.com,M
2,b#b.com,M
3,d#d.com,F
duplicates.csv
2012-06-29 01:53:31 PM, 1,c#c.com,F
I am using Unix Sort so that I can take advantage of its External R-Way merge sorting algorithm
To identify uniq records
tail -n+2 data.txt | sort -t, -k1 -un > uniq.csv
To identify duplicate records
awk 'x[$1]++' FS="," data.txt | awk '{print d,$1}' "d=$(date +'%F %r')," > duplicates.csv
I was wondering if there is anyway to find both duplicates and uniq with a single scan of this large file?
Your awk script is nearly there. To find the unique lines, you merely need to use the in operator to test whether the entry is in the associate array or not. This allows you to collect the data in one pass through the data file and to avoid having to call sort.
tail -n +2 data.txt | \
awk '
BEGIN { OFS=FS="," }
{
if (!($1 in x)) {
print $0 > "/dev/fd/3"
}
x[$1]++
}
END {
for (t in x) {
print d, t, x[t]
}
}' d="$(date +'%F %r')" 3> uniq.csv > duplicates.csv
I got this question in an interview, a couple jobs ago.
One answer is to use uniq with the "-c" (count) option. An entry with a count of "1' is unique, and otherwise not unique.
sort foo | uniq -c | awk '{ if ($1 == 1) { write-to-unique } else {write-to-duplicate }'
If you want to write a special-purpose program and/or avoid the delay caused by sorting, I would use Python.
Read the input file, hashing each entry and ++ an integer value for each unique key that you encounter. Remember that hash values can collide even when the two items are not equal, so keep each key individually along with its count.
At EOF on the input, traverse the hash structure and spit each entry into one of two files.
You seem not to need sorted output, only categorized output, so the hashing should be faster. Constructing a hash is O(1), while sorting is O(I forget; is unix sort Nlog(N)?)
Here is a code on perl which will do processing in one scan
#!/usr/bin/perl
open(FI,"sort -t, -k1 < file.txt |");
open(FD,">duplicates.txt");
open(FU,">uniques.txt");
my #prev;
while(<FI>)
{
my (#cur) = split(',');
if($prev[0] && $prev[0]==$cur[0])
{
print FD localtime()." $_";
}
else
{
print FU $_;
}
#prev=#cur;
}