This question already has an answer here:
Use awk to sum or average for each unique ID
(1 answer)
Closed 6 years ago.
I have a file that contains several comma-separated columns, including a customer ID in the first column.
One customer ID may occur on several rows, but always refers to the same real customer.
How do I run basic calculations in a shell script based on this ID column? For example, calculating the sum of the mileages (the 5th field) for the given customer ID.
102,305,Jin,Kerala,40
104,308,Paul,US,45
105,350,Nina,AUS,50
102,390,Jin,Kerala,10
104,395,Paul,US,35
102,399,Jin,Kerala,35
5th field is the mileage, 1st field is the customer ID.
This is a simple awk script that will sum up the mileages and print the customer IDs together with the sums at the end:
#!/usr/bin/awk -f
BEGIN { FS = "," }
{
customer_id = $1;
mileage = $5;
total_mileage[customer_id] += mileage;
}
END {
for (customer_id in total_mileage) {
print customer_id, total_mileage[customer_id];
}
}
To run (after making it executable with chmod +x script.awk):
$ ./script.awk data.in
102 85
104 80
105 50
Alternatively, as a "one-liner":
$ awk -F, '{t[$1]+=$5} END {for (c in t){print c,t[c]}}' data.in
102 85
104 80
105 50
While I agree with #wilx that using a database might be smarter, this sample awk script should get you started:
awk -v FS=',' '{miles[$1] += $5}
END { for (customerid in miles) {
print customerid, miles[customerid]; } }' customers
You can get a list of unique IDs using something like (assuming the first column is the ID):
awk '{print $1}' inputFile | sort -u
This outputs the first field of every single line in the input file inputFile, sorts them and removes duplicates.
You can then use that method with a bash loop to process each of the unique IDs with another awk command to perform some action on them. In the following snippet, I print out the matching lines for each ID:
for id in $(awk '{print $1}' inputFile | sort -u) ; do
echo "${id}:"
awk -vid=${id} '$1==id {print " "$0)' inputFile
done
In that code, for each individual ID, it first outputs the ID then uses awk to only process lines matching that ID. The action carried out is to output the full line with indentation.
Of course, you can do anything you wish with the lines matching each ID. As shown below, an example more closely matching your requirements.
First, here's an input file I used for testing - we can assume field 1 is the customer ID and field 2 the mileage:
$ cat inputFile
a 1
b 2
c 3
a 4
b 5
c 6
a 7
b 8
c 9
b 10
c 11
c 12
And here's a command-line transcript of the method proposed (note that $ and + are input prompt and continuation prompt respectively, they are not part of the actual commands):
$ for id in $(awk '{print $1}' inputFile | sort -u) ; do
+ awk -vid=${id} '
+ $1==id {print $0; sum += $2 }
+ END {print "Total: "sum; print }
+ ' inputFile
+ done
a 1
a 4
a 7
Total: 12
b 2
b 5
b 8
b 10
Total: 25
c 3
c 6
c 9
c 11
c 12
Total: 41
Keep in mind that, for non-huge data sets, it's also possible to do this in a single pass awk script, using associative arrays to store the totals then outputting all the data in the END block. I myself tend to prefer the multi-pass approach myself since it minimises the possibility of running out of memory. The trade-off, of course, is that it will no doubt take longer since you're processing the file more than once.
For a single-pass solution, you can use something like:
$ awk '{sum[$1] += $2} {for (key in sum) { print key": "sum[key]}}' inputFile
which gives you:
a: 12
b: 25
c: 41
I have two files with with 1s and 0s in each column, where the field separator is "," :
1,0,0,1,1,1,0,0,0,0,1,0,0,1,1,0,1,0
0,1,0,1,1,1,0,1,0,1,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,0,1,0,0,0,0,1,1,1,1,1,1,1,1,1,0,1
1,0,1,0,0,0,0,1,1,1,1,1,1,1,1,1,0,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
Want I want to do is look at the file in pairs of rows, compare them, and if they are exactly the same output a 1. So for this example the rows 1 & 2 are different so they don't get a 1, rows 3 & 4 are exactly the same so they get a 1, and rows 5&6 differ by 1 column so they don't get a 1, and so on.
So the desired output could be something like :
1
1
1
Because here there are exactly 3 pairs (they are paired by the fact if they are consecutive) of rows that are exactly the same: rows 3&4, 7&8, and 9&10. The comparison should not reuse a row, so if you compare rows 1 & 2, you shouldn't then compare rows 2 & 3.
You can do this with awk like:
awk -F, '!(NR%2) {print $0==p} {p=$0}' data
0
1
0
1
1
where every line that's evenly divisible by two will print a 0 if the current line doesn't match the last value for p or a 1 if it matches.
If you truly only want the 1s, which is throwing away any information about which pairs matched, you could:
awk -F, '!(NR%2)&&$0==p {print 1} {p=$0}' data
1
1
1
Alternatively, you could output matching pair line numbers like:
awk -F, '!(NR%2)&&$0==p {print NR-1 "," NR} {p=$0}' data
3,4
7,8
9,10
Or just the counts of all matched pairs:
awk -F, '!(NR%2)&&$0==p {c++} {p=$0} END{ print c}' data
3
Another useful variant might be just to return the matching lines directly:
awk -F, '!(NR%2)&&$0==p {print} {p=$0}' data
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
I would use a shell script like this:
while read line
do
if test "$prevline" = "$line"
then
echo 1
fi
prevline=$line
done
I'm not 100% sure about your requirement to "not reuse a row", but I think that could be achieved by changing inner part of the loop to
if test "$prevline" = "$line"
then
echo 1
line="" # don't reuse a line
fi
I have a file that looks like this:
1 rs531842 503939 61733 G A
1 rs10494103 35025 114771 C T
1 rs17038458 254490 21116837 G A
1 rs616378 525783 21127670 T C
1 rs3845293 432526 21199392 A C
2 rs16840461 233620 157112959 A G
2 rs1560628 224228 157113214 T C
2 rs17200880 269314 257145829 C T
2 rs10497165 35844 357156412 C T
2 rs7607531 624696 457156575 T C
...with column 1 stretching on to 22, and several thousand entries in total.
I want to create a file that lists bins of 5 million from column 4 which have data, separating by column 1.
Basically, all but column 1 and 4 can be discarded. A simple imput would look like this:
InputChr1:
61733
114771
21116837
21127670
21199392
InputChr2:
157112959
157113214
257145829
357156412
457156575
So, for the example above, I would want to get two files that look like this:
OutputChr1.txt
Start End Occurrences
1 5000000 2
20000001 25000000 3
OutputChr2.txt
Start End Occurrences
155000001 160000000 2
255000001 260000000 1
355000001 360000000 1
455000001 460000000 1
Any ideas? It seems like something that should be doable with lapply in R, but I can't get the for loops to work...
EDIT: Actually, I made this look much harder than it needed to be - basically, I want to split the original file by column 1, extract the data in column 4, and then count the instances in bins of 5 million.
(Apologies for slightly random tags, just trying to think of which tools might be best!)
Well, this happened to be very challenging. I couldn't find a way to use an unique awk command, though.
awk -v const=5000000 -v max=150
'{a[$1,int($4/const)]++; b[$1]}
END{for (i in b)
{for (j=0; j<max; j++)
print i, j*const +1, (j+1)*const, a[i,j]
}
}' file
And then to get only the results:
awk 'NF==4'
Explanation
-v const=5000000 -v max=150 give the variables. const is the 5 million value to split the results. max is the biggest number up to which we will look for info in the END block.
a[$1,int($4/const)]++ create an array with (1st field, 4th field) as index. Note the second is int($4/const) is to get from 23432 --> 0, 6000000 --> 1, etc. That is, to see in which block of values is every 4th column.
b[$1] keep track of the first columns that have been processed.
END{for (i in b) {for (j=0; j<max; j++) print j, j*const +1, (j+1)*const, a[i,j]}}' print the values.
awk 'NF==4' just print those lines that have 4 columns. This way it just outputs those cases in which there were matches.
In case you want to store the values into a new file, you can do
awk 'NF==4 {print > "OutputChr"$1".txt}'
Sample output
$ awk -v const=5000000 -v max=150 '{a[$1,int($4/const)]++; b[$1]} END{for (i in b) {for (j=0; j<max; j++) print i, j*const +1, (j+1)*const, a[i,j]}}' a | awk 'NF==4'
1 1 5000000 2
1 20000001 25000000 3
2 155000001 160000000 2
2 255000001 260000000 1
2 355000001 360000000 1
2 455000001 460000000 1
All in one
awk '{ v=int($4/const)
a[$1 FS v]++
min[$1]=min[$1]<v?min[$1]:v # get the Minimum of column $4 for group $1
max[$1]=max[$1]>v?max[$1]:v # get the Minimum of column $4 for group $1
}END{ for (i in min)
for (j=min[i];j<=max[i];j++) # set the for loop, and use the min and max value.
if (a[i FS j]!="") print j*const+1,(j+1)*const,a[i FS j] > "OutputChr" i ".txt" # if the data is exist, print to file "OutputChr" i ".txt"
}' const=5000000 file
result:
$ cat OutputChr1.txt
1 5000000 2
20000001 25000000 3
$ cat OutputChr2.txt
155000001 160000000 2
255000001 260000000 1
355000001 360000000 1
455000001 460000000 1