Unix utilities, sum the data under the save entries - unix

I have this little problem that I want to ask:
So I have a file named "quest", which has:
Tom 100 John 10 Tom 100
How do I use awk to output something like:
Tom 200
I'd appreciate your help. I tried to look up online but I am not sure what I am look for. Thanks ahead!!
I do know how to use regular expression /Tom/ to grep the entry, but I am not sure how to proceed from there.

You can try something like:
$ awk '{
for(i=1; i<=NF; i+=2)
names[$i] = ((names[$i]) ? names[$i]+$(i+1) : $(i+1))
}
END{
for (name in names) print name, names[name]
}' quest
Tom 200
John 10
You basically iterate over the fields creating keys for all odd fields and assigning values of even fields to them. If the key already exists, you just add to the existing value.
This expects your file format to have Names on odd fields (for eg. 1, 3, 5 .. etc) and values on even fields (eg 2, 4, 6 .. etc).
In the END block, you just print entire array content.

I guess you need calculate all users' mark, not only Tom, here is the code:
xargs -n2 < file|awk '{a[$1]+=$2}END{for (i in a) print i,a[i]}'
Tom 200
John 10
and one-liner of awk
awk '{for (i=1;i<=NF;i+=2) a[$i]+=$(i+1)}END{for (i in a) print i,a[i]}' file
Tom 200
John 10

$ echo 'Tom 100 John 10 Tom 100' | grep -o '[0-9]*' | paste -sd+ | bc
210
grep -o '[0-9]*' produces
100
10
100
paste -sd+ produces
100+10+100
bc calculates the result.
However, this only works for small input since bc has limitation in input size.
In that case you can use awk '{s+=$0}END{print s}' instead of paste -sd+ | bc.
However note that GNU Awk treats all number as floting point, it produces inaccurate result when number is too large.

awk '/Tom/{
for(i=1;i<=NF;i++)
if($i=="Tom")s+=$(i+1);
print "Tom",s;s=0}' your_file
Test

Here is a way to do it in awk (no loop):
awk -v RS=" " '{n=$1;getline;a[n]+=$1} END {for (i in a) print i,a[i]}' quest
Tom 200
John 10
If there are more than one line like this
cat quest
Tom 100 John 10 Tom 100
Paul 20 Tom 40 John 10
Then do this with gnu awk:
awk -v RS=" |\n" '{n=$1;getline;a[n]+=$1} END {for (i in a) print i,a[i]}' quest
Paul 20
Tom 240
John 20
And if you do not like getline
awk -v RS=" |\n" 'NR%2 {n=$1;next}{a[n]+=$1} END {for (i in a) print i,a[i]}' quest

Related

How do I list specific lines using awk?

I have a file that is sorted like the following:
2 Good
2 Hello
3 Goodbye
3 Begin
3 Yes
3 No
I want to search for the highest value in the file and display what is one the line?
3 Goodbye
3 begin
3 Yes
3 No
How would I do this?
awk to the rescue!
$ awk 'FNR==NR{if(max<$1) max=$1; next} $1==max' file{,}
3 Goodbye
3 Begin
3 Yes
3 No
double-pass, find the maximum and filter out the rest.
cat file.txt | sort -r | awk '{if ($1>=prev) {print $0; prev=$1}}'
3 Yes
3 No
3 Goodbye
3 Begin
Assuming file.txt contains
2 Good
2 Hello
3 Goodbye
3 Begin
3 Yes
3 No
First get the highest value in the file into a variable. Considering the file is already sorted, pickup the last line in the file. Then parse out the number using awk.
highest=`tail -1 file.list|awk '{print $1}'`
Then grep the file using that value.
grep "^${highest} " file.list
This should do the job. I am only using awk as required in the question:
awk 'BEGIN {v=0} {l = l "\n" $0} {if ($1>v) {l = $0; v = $1}} END {print l}' file.txt
The variable v is initialized (before parsing the file) to 0. Then each line is read and kept in memory; if the first field ($1) is greater than v, then update v and empty what is in l. At the end, just print the content of l.
It's easier than you think.
awk '/^3/' file
3 Goodbye
3 Begin
3 Yes
3 No

Compare 2 files in unix file1(2M numbers/rows/lines) , file2(2,000,480 numbers/rows/lines)

How can I compare this 2 big files in unix.
I've already tried using 'grep -Fxvf file1.txt file2.txt | wc -l' but the output is 2,000,480 and when switching file1 and file2 the output is 1,999,999.
How can I get the output of '480' because that's what i am expecting.
I've also tried using diff/cmp commands but the output is too complicated.
I think you want an absolute value of a difference in line numbers in 2 files. You can achieve it easily with awk and get a decent result. You'd read numbers of lines in an array and later subtract the array values in the END block. For pure shell it'd have to get more complex. Imagine you get some test data generated (10 and 14 line files):
$ seq 1 10 > ten
$ seq 1 14 > fourteen
And then you do:
$ ( wc -l ten ; wc -l fourteen ) | awk '{ print $1}' | sort -rn | xargs -J % echo % - p | dc
The result:
4
But much better way would be do just do it in 3 lines (get word count for file1, then file2 and then subtract)

Finding common elements from one file in a column of another file and output the entire row of the latter

I needed to extract all hits from one list (list.txt) which can be found in one of the columns of another (here in Data.txt) into a third (output.txt).
Data.txt (tab delimited)
some_data more_data other_data here yet_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
T 3 Whizz 13 3
List.txt
Gee
Whiz
Lol
Ideally output.txt looks like
some_data more_data other_data here yet_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
So I tried a shell script
for ids in List.txt
do
grep $ids Data.txt >> output.txt
done
except I typed out everything (cut and paste actually) in List.txt in said script.
Unfortunately it gave me an output.txt including the last line, I assume as 'Whizz' contains 'Whiz'.
I also tried cat Data.txt | egrep -F "List.txt" and that resulted in grep: conflicting matchers specified -- I suppose that was too naive of me. The actual files: List.txt contains a sorted list of 985 words, Data.txt has 115576 rows with 17 columns.
Some help/guidance would be much appreciated thanks.
Try something like this:
for ids in List.txt
do
grep "[TAB;]$ids[TAB;]" Data.txt >> output.txt
done
But it has two drawbacks:
"Data.txt" is scanned multiple times
You can get one line multiple times.
If it is problem try two step version:
cat List.txt | sed -e "s/.*/[TAB;]\0[TAB;]/g" > List_mod.txt
grep -f List_mod.txt Data.txt > output.txt
Note:
TAB character can be inserted by combination Ctrl-V following by Tab key in command line, and Tab character in editor. You have to check if your edit does not change tab to series of spaces.
The UNIX tool for general text processing is "awk":
awk '
NR==FNR { list[$0]; next }
{
for (word in list) {
if ($0 ~ "[\t;]" word "[\t;]") {
print
next
}
}
}
' List.txt Data.txt > output.txt

How to use awk script to run through a textfile

I need to run through a textfile, doctors.txt, which is written in the format:
Sarah,Jenny,Charles;Dr. Hampton
Jenny,Lucy,Harry;Dr. Fritz
Ben,Kaitlyn,Connor,Charles;Dr. Hampton
and have it output:
Dr. Hampton: Sarah Jenny Charles Ben Kaitlyn Connor
Dr. Fritz: Jenny Lucy Harry
(if someone is mentioned more than once I can't have them repeat)
I need to do this using awk, I currently am having issues even trying to make it print anything:
My code is:
#!/user/bin/awk -f
awk 'BEGIN {for i in $(doctors.txt) {
split(i,doctors,";");}
END{print doctors[1]}'
When I run it, I get
awk: 3: unexpected character '''
awk: 5: unexpected character '''
Could someone help me with this please?
Try this awk
awk -F\; '{gsub(/,/," ");a[$2]=a[$2]?a[$2]" "$1:$1} END {for (i in a) print i": "a[i]}' doctors.txt
Dr. Fritz: Jenny Lucy Harry
Dr. Hampton: Sarah Jenny Charles Ben Kaitlyn Connor Charles
To use it in a script:
#!/bin/bash
awk -F\; '{gsub(/,/," ");a[$2]=a[$2]?a[$2]" "$1:$1} END {for (i in a) print i": "a[i]}' doctors.txt > doctors2.txt
How does it work:
a[$2]= # give array a[$2] the following value
a[$2] # test if array a[$2] have data already
? # If yes then
a[$2]" "$1 # add $1 to the variable already stored there
: # If no the
$1 " just sett array a[$2] to value in $1
This part a[$2]=a[$2]?a[$2]" "$1:$1 can be replaced by
if (a[$2]) a[$2]=a[$2]" "$1; else a[$2]=$1
Can be shorten some: (do not need the test, since the extra space is ok)
awk -F\; '{gsub(/,/," ");a[$2]=a[$2]" "$1} END {for (i in a) print i":"a[i]}' doctors.txt
May be you can use perl for this:
perl -F";" -lane '#a=split /,/,$F[0];
$x{$F[1]}.="#a";
END{print "$_:$x{$_}" for(keys %x)}' your_file
Tested here
If you insist on awk:
awk -F';' '{
gsub(/,/," ",$1);
a[$2]=a[$2]""$1}
END{for(i in a)print i":"a[i]
}' yourfile
Tested the awk version here
awk -F ";" '{print $1}' doctors.txt

sed/awk or other: one-liner to increment a number by 1 keeping spacing characters

EDIT: I don't know in advance at which "column" my digits are going to be and I'd like to have a one-liner. Apparently sed doesn't do arithmetic, so maybe a one-liner solution based on awk?
I've got a string: (notice the spacing)
eh oh 37
and I want it to become:
eh oh 36
(so I want to keep the spacing)
Using awk I don't find how to do it, so far I have:
echo "eh oh 37" | awk '$3>=0&&$3<=99 {$3--} {print}'
But this gives:
eh oh 36
(the spacing characters where lost, because the field separator is ' ')
Is there a way to ask awk something like "print the output using the exact same field separators as the input had"?
Then I tried yet something else, using awk's sub(..,..) method:
' sub(/[0-9][0-9]/, ...) {print}'
but no cigar yet: I don't know how to reference the regexp and do arithmetic on it in the second argument (which I left with '...' for now).
Then I tried with sed, but got stuck after this:
echo "eh oh 37" | sed -e 's/\([0-9][0-9]\)/.../'
Can I do arithmetic from sed using a reference to the matching digits and have the output not modify the number of spacing characters?
Note that it's related to my question concerning Emacs and how to apply this to some (big) Emacs region (using a replace region with Emacs's shell-command-on-region) but it's not an identical question: this one is specifically about how to "keep spaces" when working with awk/sed/etc.
Here is a variation on ghostdog74's answer that does not require the number to be anchored at the end of the string. This is accomplished using match instead of relying on the number to be in a particular position.
This will replace the first number with its value minus one:
$ echo "eh oh 37 aaa 22 bb" | awk '{n = substr($0, match($0, /[0-9]+/), RLENGTH) - 1; sub(/[0-9]+/, n); print }'
eh oh 36 aaa 22 bb
Using gsub there instead of sub would replace both the "37" and the "22" with "36". If there's only one number on the line, it doesn't matter which you use. By doing it this way, though, it will handle numbers with trailing whitespace plus other non-numeric characters that may be there (after some whitespace).
If you have gawk, you can use gensub like this to pick out an arbitrary number within the string (just set the value of which):
$ echo "eh oh 37 aaa 22 bb 19" |
awk -v which=2 'BEGIN { regex = "([0-9]+)\\>[^0-9]*";
for (i = 1; i < which; i++) {regex = regex"([0-9]+)\\>[^0-9]*"}}
{ match($0, regex, a);
n = a[which] - 1; # do the math
print gensub(/[0-9]+/, n, which) }'
eh oh 37 aaa 21 bb 19
The second (which=2) number went from 22 to 21. And the embedded spaces are preserved.
It's broken out on multiple lines to make it easier to read, but it's copy/pastable.
$ echo "eh oh 37" | awk '{n=$NF+1; gsub(/[0-9]+$/,n) }1'
eh oh 38
or
$ echo "eh oh 37" | awk '{n=$NF+1; gsub(/..$/,n) }1'
eh oh 38
something like
number=`echo "eh oh 37" | grep -o '[0-9]*'`
sed 's/$number/`expr $number + 1`/'
How about:
$ echo "eh oh 37" | awk -F'[ \t]' '{$NF = $NF - 1;} 1'
eh oh 36
The solution will not preserve the number of decimals, so if the number is 10, then the result is 9, even if one would like to have 09.
I did not write the shortest possible code, it should stay readable
Here I construct the printf pattern using RLENGTH so it becomes %02d (2 being the length of the matched pattern)
$ echo "eh oh 10 aaa 22 bb" |
awk '{n = substr($0, match($0, /[0-9]+/), RLENGTH)-1 ;
nn=sprintf("%0" RLENGTH "d", n)
sub(/[0-9]+/, nn);
print
}'
eh oh 09 aaa 22 bb

Resources