Performing calculations based on customer ID in comma-separated file [duplicate] - unix

This question already has an answer here:
Use awk to sum or average for each unique ID
(1 answer)
Closed 6 years ago.
I have a file that contains several comma-separated columns, including a customer ID in the first column.
One customer ID may occur on several rows, but always refers to the same real customer.
How do I run basic calculations in a shell script based on this ID column? For example, calculating the sum of the mileages (the 5th field) for the given customer ID.
102,305,Jin,Kerala,40
104,308,Paul,US,45
105,350,Nina,AUS,50
102,390,Jin,Kerala,10
104,395,Paul,US,35
102,399,Jin,Kerala,35
5th field is the mileage, 1st field is the customer ID.

This is a simple awk script that will sum up the mileages and print the customer IDs together with the sums at the end:
#!/usr/bin/awk -f
BEGIN { FS = "," }
{
customer_id = $1;
mileage = $5;
total_mileage[customer_id] += mileage;
}
END {
for (customer_id in total_mileage) {
print customer_id, total_mileage[customer_id];
}
}
To run (after making it executable with chmod +x script.awk):
$ ./script.awk data.in
102 85
104 80
105 50
Alternatively, as a "one-liner":
$ awk -F, '{t[$1]+=$5} END {for (c in t){print c,t[c]}}' data.in
102 85
104 80
105 50

While I agree with #wilx that using a database might be smarter, this sample awk script should get you started:
awk -v FS=',' '{miles[$1] += $5}
END { for (customerid in miles) {
print customerid, miles[customerid]; } }' customers

You can get a list of unique IDs using something like (assuming the first column is the ID):
awk '{print $1}' inputFile | sort -u
This outputs the first field of every single line in the input file inputFile, sorts them and removes duplicates.
You can then use that method with a bash loop to process each of the unique IDs with another awk command to perform some action on them. In the following snippet, I print out the matching lines for each ID:
for id in $(awk '{print $1}' inputFile | sort -u) ; do
echo "${id}:"
awk -vid=${id} '$1==id {print " "$0)' inputFile
done
In that code, for each individual ID, it first outputs the ID then uses awk to only process lines matching that ID. The action carried out is to output the full line with indentation.
Of course, you can do anything you wish with the lines matching each ID. As shown below, an example more closely matching your requirements.
First, here's an input file I used for testing - we can assume field 1 is the customer ID and field 2 the mileage:
$ cat inputFile
a 1
b 2
c 3
a 4
b 5
c 6
a 7
b 8
c 9
b 10
c 11
c 12
And here's a command-line transcript of the method proposed (note that $ and + are input prompt and continuation prompt respectively, they are not part of the actual commands):
$ for id in $(awk '{print $1}' inputFile | sort -u) ; do
+ awk -vid=${id} '
+ $1==id {print $0; sum += $2 }
+ END {print "Total: "sum; print }
+ ' inputFile
+ done
a 1
a 4
a 7
Total: 12
b 2
b 5
b 8
b 10
Total: 25
c 3
c 6
c 9
c 11
c 12
Total: 41
Keep in mind that, for non-huge data sets, it's also possible to do this in a single pass awk script, using associative arrays to store the totals then outputting all the data in the END block. I myself tend to prefer the multi-pass approach myself since it minimises the possibility of running out of memory. The trade-off, of course, is that it will no doubt take longer since you're processing the file more than once.
For a single-pass solution, you can use something like:
$ awk '{sum[$1] += $2} {for (key in sum) { print key": "sum[key]}}' inputFile
which gives you:
a: 12
b: 25
c: 41

Related

Extract two consecutive lines that have non-consecutive strings

I have a very large text file with 2 columns and more than 10 mio of lines.
Most lines have in column 2 a number that is the number of column 2 of the previous line +1. However, few thousands of lines behave differently (see example below).
Input file:
A 1
A 2
A 3
A 10
A 11
A 12
A 40
A 41
I would like to extract the pair of two lines that do not respect the +1 increment in column 2.
Desired output file:
A 3
A 10
A 12
A 40
Is there (preferentially) an awk command that allows to do that?
I tried several codes comparing column 2 of two consecutive lines but unfortunately I fail until now (see the code below).
awk 'FNR==1 {print; next} $2==p2+1 {print p $0; p=""; next} {p=$0 ORS; p2=$2}' input.txt > output.txt
Thanks for your help. Best,
Would you please try the following:
awk 'NR>1 {if ($2!=p2+1) print p ORS $0} {p=$0; p2=$2}' input.txt > output.txt
Output:
A 3
A 10
A 12
A 40
The variables names are similar to yours: p holds the previous line and
p2 holds the second column of the previous line.
The condition NR>1 suppresses to print on the 1st line.
if ($2!=p2+1) print p ORS $0 prints the pairs of two lines which
meet the condition.
The block {p=$0; p2=$2} preserves values of current line for the next iteration.
I like perl for the text processing that needs arithmetic.
$ perl -ane 'print and next if $.<3; print $p and print if $F[3]!=$fp+1; $fp=$F[3]; $p=$_' input.txt
| COLUMN 1 | COLUMN 2 |
| -------- | -------- |
| A | 3 |
| A | 10 |
| A | 12 |
| A | 40 |
This is using -a to autosplit into #F.
Prints first 2 lines: print and next if $.<3
On subsequent lines, prints previous line and current line if the 4th field isn't exactly one more than the prior 4th field: print $p and print if $F[3]!=$fp+1
Saves the 4th field as $fp and the entire line as $p: $fp=$F[3]; $p=$_
Assumptions:
columns are tab-delimited
the 1st column may contain white space (this isn't demonstrated in the sample provided by OP but it also hasn't been ruled out)
lines of interest must have the same value in the 1st column (ie, if the values in the 1st column differ then we don't bother with comparing the values in the 2nd column and instead proceed to the next input line)
if 3 consecutive lines meet the criteria, the 2nd/middle line is only printed once
Setup:
$ cat input.txt
A 1
A 2
A 3 # match
A 10 # match
A 11
A 12 # match
A 23 # match
A 40 # match
A 41
X to Z 101
X to Z 102 # match
X to Z 104 # match
X to Z 105
NOTE: comments only added here to highlight the lines that match the search criteria
One awk idea:
awk -F'\t' '
FNR==1 { prevline=$0 }
FNR>1 { if ($1 == prev1 && $2+0 != prev2+1) {
if (prevline) print prevline
print
prevline="" # make sure this line is not printed again if next line also meets criteria
}
else
prevline=$0
}
{ prev1=$1; prev2=$2 }
' input.txt
This generates:
A 3
A 10
A 12
A 23
A 40
X to Z 102
X to Z 104
This might work for you (GNU sed):
sed -nE 'N;h
s/.*\s+(.*)\n.*(\s.*)/echo "$((\1+1))\2"/e;/^(.*)\s\1$/!{x;p;x};x;D' file
Open a two line window throughout the length of the file.
Make a copy of the window and increment the 2nd column of the first line by one. If this amended value is equal to the 2nd column of the second line then print both unadulterated lines.
Delete the first line and repeat.
N.B. This may print the second of these lines twice if the following line meets the same criteria.

Merge and remove redundant lines *among* files

I need to merge several files, removing redundant lines among files, while keeping redundant lines within files. A schematic representation of my files is the following:
File1.txt
1
2
3
3
4
5
6
File2.txt
6
7
8
8
9
File3.txt
9
10
10
11
The desired output would be:
1
2
3
3
4
5
6
7
8
8
9
10
10
11
I would prefer to get a solution either in awk, or in bash or in R language. I searched the web for solutions and, though there were plenty of them* (please find some examples below), there were all removing duplicated lines regardless of the fact that they were located within or outside files.
Thanks in advance.
Arturo
Examples of previous solutions removing redundant lines both within and outside files:
https://unix.stackexchange.com/questions/50103/merge-two-lists-while-removing-duplicates
https://unix.stackexchange.com/questions/457320/combine-text-files-and-delete-duplicate-lines
https://unix.stackexchange.com/questions/350520/awk-combine-two-big-files-and-remove-duplicated-lines
https://unix.stackexchange.com/questions/257467/merging-2-files-and-keeping-the-one-duplicate
With your shown samples, could you please try following. This will NOT remove redundant lines within files but will remove them file wise.
awk '
FNR==1{
for(key in current){
total[key]
}
delete current
}
!($0 in total)
{
current[$0]
}
' file1.txt file2.txt file3.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if its first line(of each file) then do following.
for(key in current){ ##Traverse through current array here.
total[key] ##placing index of current array into total(for all files) one.
}
delete current ##Deleting current array here.
}
!($0 in total) ##If current line is NOT present in total then do following.
{
current[$0] ##Place current line into current array.
}
' file1.txt file2.txt file3.txt ##Mentioning Input_file names here.
Here's a trick adding on to https://stackoverflow.com/a/15385080/3358272 using diff and its output format. There is likely a presumption of "sorted" here, untested.
out=$(mktemp -p .)
tmpout=$(mktemp -p .)
trap 'rm -f "${out}" "${tmpout}"' EXIT
for F in ${#} ; do
{ cat "${out}" ;
diff --changed-group-format='%>' --unchanged-group-format='' "${out}" "${F}" ;
} > "${tmpout}"
mv "${tmpout}" "${out}"
done
cat "${out}"
Output:
$ ./question.sh F*
1
2
3
3
4
5
6
7
8
8
9
10
10
11
$ diff <(./question.sh F*) Output.txt
(Per markp-fuso's comment, if File3.txt had two 9s, this would preserve both.)

get the top N counts based on the value of another column

I have a file with two columns that looks like this:
a 3
a 7
b 6
a 6
b 1
b 8
c 1
b 1
For each value in the first column, I'd like to find the top N counts from the second column. Using this example, I'd like to find the top 2 values in column 2 for each string in column 1. The desired output would be:
a 7
a 6
b 8
b 6
c 1
I was trying to do something like this with awk, but I am not that familiar with it. This gives the max, not top N:
awk '$2>max[$1]{max[$1]=$2; row[$1]=$0} END{for (i in row) print row[i]}'
Could you please try following, using sort + awk solution.
sort -k2 -s -nr Input_file | awk '++array[$1]<=2' | sort -k1,1 -k2,2nr
OR
sort -k2 -s -nr Input_file | sort -k1,1 -k2,2nr | awk '++array[$1]<=2'
Logical brief explanation: First 2 sort commands are being used to sort data as per 1st and 2nd fields to get data in right order(as per OP's samples), then passing its output to awk to get only 1st 2 occurrences of each first field only as per ask.
$ sort -k1,1 -k2,2rn file | awk -v n=2 '(++cnt[$1])<=n'
a 7
a 6
b 8
b 6
c 1

Extract data before and after matching (BIG FILE )

I have got a big file ( arounf 80K lines )
my main goal is to find the patterns and pring for example 10 lines before and 10 lines after the pattern .
the pattern accures multiple times across the file .
using the grep command :
grep -i <my_pattern>* -B 10 -A 10 <my_file>
i get only some of the data , i think it must be something related to the buffer size ....
i need a command ( grep , sed , awk ) that will handle all the matching
and will print 10 line before and after the pattern ...
Example :
my patterns hides here :
a
b
c
pattern_234
c
b
a
a
b
c
pattern_567
c
b
a
this happens multiple times across the file .
running this command :
grep -i pattern_* -B 3 -A 3 <my_file>
will get he right output :
a
b
c
c
b
a
a
b
c
c
b
it works but not full time
if i have 80 patterns not all the 80 will be shown
awk to the rescue
awk -vn=4 # pass the argument of context line count
'{
for(i=1;i<=n;i++) # store the past n lines in an indexed array
p[i]=p[i+1];
p[n+1]=$0
}
/pattern/ # if pattern matched
{
c=n+1; # set the counter to after match line count
for(i=1;i<=n;i++) # print previously saved entries
print p[i]
}
c-->0' # print the lines after match until counter runs out
will print 4 lines before and 4 lines after the match of pattern, change the value of n as per your need.
if non-symmetric before/after you need two variables
awk -vb=2 -va=3 '{for(i=1;i<=b;i++) p[i]=p[i+1];p[b+1]=$0} /pattern/{c=a+1;for(i=1;i<=b;i++) print p[i]} c-->0'

Finding common elements from one file in a column of another file and output the entire row of the latter

I needed to extract all hits from one list (list.txt) which can be found in one of the columns of another (here in Data.txt) into a third (output.txt).
Data.txt (tab delimited)
some_data more_data other_data here yet_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
T 3 Whizz 13 3
List.txt
Gee
Whiz
Lol
Ideally output.txt looks like
some_data more_data other_data here yet_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
So I tried a shell script
for ids in List.txt
do
grep $ids Data.txt >> output.txt
done
except I typed out everything (cut and paste actually) in List.txt in said script.
Unfortunately it gave me an output.txt including the last line, I assume as 'Whizz' contains 'Whiz'.
I also tried cat Data.txt | egrep -F "List.txt" and that resulted in grep: conflicting matchers specified -- I suppose that was too naive of me. The actual files: List.txt contains a sorted list of 985 words, Data.txt has 115576 rows with 17 columns.
Some help/guidance would be much appreciated thanks.
Try something like this:
for ids in List.txt
do
grep "[TAB;]$ids[TAB;]" Data.txt >> output.txt
done
But it has two drawbacks:
"Data.txt" is scanned multiple times
You can get one line multiple times.
If it is problem try two step version:
cat List.txt | sed -e "s/.*/[TAB;]\0[TAB;]/g" > List_mod.txt
grep -f List_mod.txt Data.txt > output.txt
Note:
TAB character can be inserted by combination Ctrl-V following by Tab key in command line, and Tab character in editor. You have to check if your edit does not change tab to series of spaces.
The UNIX tool for general text processing is "awk":
awk '
NR==FNR { list[$0]; next }
{
for (word in list) {
if ($0 ~ "[\t;]" word "[\t;]") {
print
next
}
}
}
' List.txt Data.txt > output.txt

Resources