How to condense a file: uniq occurences and sum another field - math

I have a very large file that looks something like this:
1,22,A
2,10,A
3,4,B
4,3,B
5,20,B
The second column tells me how many instances of the third column there are. So I want to collapse the third column (so that it is effectively uniqued), but add up the second column values. Desired output would be something like:
32,A
27,B
I can come up with some rather complicated ways to do this, but it seems like it ought to be rather simple...

I'm not sure what kind of "math" answer you would expect...
Given you have a file input.txt with the following content:
1,22,A
2,10,A
3,4,B
4,3,B
5,20,B
Create a new file with the following script in Ruby, put in the same directory as your input.txt, and run ruby script.rb from the console:
File.open('output.txt', 'w+') do |file|
result = {}
File.readlines("input.txt").each do |line|
values = line.split(',')
letter = values[2]
letter_value = values[1].to_i
result[letter] ||= 0
result[letter] += letter_value
end
result.each do |letter, value|
file << [value, letter].join(', ')
end
end
Then, look for your result in output.txt in the same directory.

Related

Print sets of lines from multiple folders as rows, not columns?

I have .out files in multiple folders.
Let's say I am in a directory containing folders A, B, C, D. I use the command below to print a specific value from the 8th column of lines containing the keyword VALUE in all .out files in folders A, B, C, D
awk '/VALUE/{print $8}' ./*/.out
My result would look like:
output1_A
output2_A
output3_A
output1_B
output2_B
output3_B
output1_C
output2_C
output3_C
Is there a way I could get my output to look like what is shown below instead?
output1_A output2_A output3_A
output1_B output2_B output3_B
output1_C output2_C output3_C
In other words, have a space separate outputs from the same folder, and not a linebreak?
Could you please try following(since I don't have directory structure so I couldn't test it or if OP could post file's contents inside directory perhaps we could do in single awk itself too).
awk '/VALUE/{print $8}' ./*/.out | xargs -n 3
Another:
$ awk '/VALUE/{b=b (FNR==(NR>FNR)?ORS:ofs) $8;ofs=OFS}END{print b}' dir?/file1
output1_A output2_A output3_A
output1_B output2_B output3_B
output1_C output2_C output3_C
Explained:
$ awk '
/VALUE/ { # magic keyword
b=b (FNR==(NR>FNR)?ORS:ofs) $8 # gathering a buffer set ORS or OFS appropriately
ofs=OFS # ... but #NR==1 we want ""
}
END {
print b # output buffer
}' dir?/file1
The unexplained two empty records in your sample are not considered but would probably cause extra OFSes in the ends of the output records.

unix shell scripting to find and remove unwanted string in a pipe delimited file in a particular column

{
I have a requirement, where the file is pipe "|" delimited.
The first row contains the headers, and the count of columns is 5.
I have to delete only the string in the 3rd column if it matches the pattern.
Also note the 3rd column can contain strings with commas ,, semicolon ; or colon : but it will never contain a pipe | (due to which we have chosen a pipe delimiter).
Input File:
COL1|COL2|COL3|COL4|COL5
1|CRIC|IPL|CRIC1:IPL_M1;IPL_M2;TEST_M1,CRIC2:ODI_M1;IPL_M3|C1|D1
2|CRIC|TEST|CRIC1:TEST_M2,CRIC2:ODI_M1;IPL_M1;TEST_M2;IPL_M3;T20_M1|C2|D2
Output should change only in COL3 no other columns should be changed, i.e. in COL3 the string which matches the pattern 'IPL_' should be present.
Any other strings like "TEST_M1","ODI_M1" should be made null.
And any unwanted semi colons should be removed.
eg
Question - CRIC1:IPL_M1;IPL_M2;TEST_M1,CRIC2:ODI_M1;IPL_M3
result - CRIC1:IPL_M1;IPL_M2,CRIC2:IPL_M3
Another scenario where if only strings that do not match "IPL_" are present then
Question - CRIC1:TEST_M1,CRIC2:ODI_M1
Result - CRIC1:,CRIC2:
Output File:
COL1|COL2|COL3|COL4|COL5
1|CRIC|IPL|CRIC1:IPL_M1;IPL_M2,CRIC2:IPL_M3|C1|D1
2|CRIC|TEST|CRIC1:,CRIC2:IPL_M1;IPL_M3|C2|D2
Basic requirement is to find and replace the string,
INPUT
COL1|COL2|COL3|COL4|COL5
1|A1|A12|A13|A14|A15
Replace A13 with B13 in column 3 (A13 can change, I mean we have to find any pattern like A13)
OUTPUT
COL1|COL2|COL3|COL4|COL5
1|A1|A12|B13|A14|A15
Thanks in advance.
Re formatting the scenario in simpler terms,by taking only 2 columns, where I need to search "IPL_" and keep only those strings and any other string like "ODI_M3;TEST_M5" should be deleted
{
I/P:
{
COL1|COL2
CRIC1|IPL_M1;IPL_M2;TEST_M1
CRIC2|ODI_M1;IPL_M3
CRIC3|ODI_M3;TEST_M5
CRIC4|IPL_M5;ODI_M5;IPL_M6
}
O/P:
{
COL1|COL2
CRIC1|IPL_M1;IPL_M2
CRIC2|IPL_M3
CRIC3|
CRIC4|IPL_M5;IPL_M6
}
Awaiting your precious suggestions.
Please help I'm new to this platform.
Thanks,
Saquib
}
If I'm reading this correctly (and I'm not entirely sure I am; I'm going mostly by the provided examples), then this could be done relatively sanely with Perl:
#!/usr/bin/perl
while(<>) {
if($. > 1) {
local #F = split /\|/;
$F[3] = join(",", map {
local #H = split /:/;
$H[1] = join(";", grep(/IPL_/, split(";", $H[1])));
join ":", #H;
} split(/,/, $F[3]));
$_ = join "|", #F;
}
print;
}
Put this code into a file, say foo.pl, then if your data is in a file data.txt you can run
perl -f foo.pl data.txt
This works as follows:
#!/usr/bin/perl
# Read lines from input (in our case: data.txt)
while(<>) {
# In all except the first line (the header line):
if($. > 1) {
# Apply the transformation. To do this, first split the line into fields
local #F = split /\|/;
# Then edit the third field. This has to be read right-to-left at the top
# level, which is to say: first the field is split along commas, then the
# tokens are mapped according to the code in the inner block, then they
# are joined with commas between them again.
$F[3] = join(",", map {
# the map block does a similar thing. The inner tokens (e.g.,
# "CRIC1:IPL_M1;IPL_M2") are split at the colon into the CRIC# part
# (which is to be unchanged) and the value list we want to edit.
local #H = split /:/;
# This value list is again split along semicolons, filtered so that
# only those elements that match /IPL_/ remain, and then joined with
# semicolons again.
$H[1] = join(";", grep(/IPL_/, split(";", $H[1])));
# The map result is the CRIC# part joined to the edited list with a colon.
join ":", #H;
} split(/,/, $F[3]));
# When all is done, rejoin the outermost fields with pipe characters
$_ = join "|", #F;
}
# and print the result.
print;
}

pattern matching and delete all the lines except the last occurence

I have a txt file which is having 100+ lines, i want to search for pattern and delete all the lines except the last occurrence.
Here are the lines from the txt file.
my pattern search is "string1=" , "string2=", "string3=" , "string4=" and "string5="
string1=hi
string2=hello
string3=welcome
string3=welcome1
string3=
string4=hi
string5=hello
i want to go through the each line and keep "string3=" is empty on the file and remove the "string3=welcome" ,"string3=welcome1"
please help me.
For a single pattern, you can start with something like this:
grep "string3" input | tail -1
#!/usr/bin/perl
my %h;
while (<STDIN>) {
my ($k, $v) = split /=/;
$h{$k} = $v;
}
foreach my $k ( sort keys %h ) {
print "$k=$h{$k}";
}
The perl script here will take your list as stdin and process output as you mention. This assumes you want the keys (string*) as sorted output.
If you only wants the values that start with string1-5 only then you can put a match in the beginning of your while loop as so:
next if ! /^string[1-5]=/;

Function to create the array by reading the file

I am creating scripts which will store the contents of pipe delimited file. Each column is stored in a separate array. I then read the information from the arrays and process it. There are 20 pipe delimited files and I need to write 20 scripts. The processing that will happen in each script after the information is stored in the array is different. The number of columns in each pipe delimited file is different (but in no case it would be more than 9 columns). I need to do this activity of storing the information in the array in the beginning of each script. The way I am doing it at present is given below. I want help from you to understand how can I write a function to do this activity.
cat > example_file.txt <<End-of-message
some text first row|other text first row|some other text first row
some text nth row|other text nth row|some other text nth row
End-of-message
# Note that example_file.txt will available. I have created it inside the script just to let you know the format of the file
OIFS=$IFS
IFS='|'
i=0
while read -r first second third ignore
do
first_arr[$i]=$first
second_arr[$i]=$second
third_arr[$i]=$third
(( i=i+1 ))
done < example_file.txt
IFS=$OIFS
Here is a sort-of minimal change to your script that should get you further...
...
...
while read -r first second third ignore
do
arr0[$i]=$first
arr1[$i]=$second
arr2[$i]=$third
(( i=i+1 ))
done < example_file.txt
IFS=$OIFS
proc0 () {
for j in "$#"; do
echo proc0 : "$j"
done
}
proc1 () {
echo proc1
}
proc2 () {
echo proc2
}
for i in 0 1 2; do
t=arr$i'[#]'
proc$i "${!t}"
done

Delete a line with a pattern

Hi I want to delete a line from a file which matches particular pattern
the code I am using is
BEGIN {
FS = "!";
stopDate = "date +%Y%m%d%H%M%S";
deletedLineCtr = 0; #diagnostics counter, unused at this time
}
{
if( $7 < stopDate )
{
deletedLineCtr++;
}
else
print $0
}
The code says that the file has lines "!" separated and 7th field is a date yyyymmddhhmmss format. The script deletes a line whose date is less than the system date. But this doesn't work. Can any one tell me the reason?
Is the awk(1) assignment due Tuesday? Really, awk?? :-)
Ok, I wasn't sure exactly what you were after so I made some guesses. This awk program gets the current time of day and then removes every line in the file less than that. I left one debug print in.
BEGIN {
FS = "!"
stopDate = strftime("%Y%m%d%H%M%S")
print "now: ", stopDate
}
{ if ($7 >= stopDate) print $0 }
$ cat t2.data
!!!!!!20080914233848
!!!!!!20090914233848
!!!!!!20100914233848
$ awk -f t2.awk < t2.data
now: 20090914234342
!!!!!!20100914233848
$
call date first to pass the formatted date as a parameter:
awk -F'!' -v stopdate=$( date +%Y%m%d%H%M%S ) '
$7 < stopdate { deletedLineCtr++; next }
{print}
END {do something with deletedLineCrt...}
'
You would probably need to run the date command - maybe with backticks - to get the date into stopDate. If you printed stopDate with the code as written, it would contain "date +...", not a string of digits. That is the root cause of your problem.
Unfortunately...
I cannot find any evidence that backticks work in any version of awk (old awk, new awk, GNU awk). So, you either need to migrate the code to Perl (Perl was originally designed as an 'awk-killer' - and still includes a2p to convert awk scripts to Perl), or you need to reconsider how the date is set.
Seeing #DigitalRoss's answer, the strftime() function in gawk provides you with the formatting you want (check 'info gawk' as I did).
With that fixed, you should be getting the right lines deleted.

Resources