Join lines based on a starting value using UNIX commands

Join lines based on a starting value using UNIX commands - unix

Here I am again, with another UNIX requirement (as my knowledge in UNIX is limited to basic commands).
I have a file that looks like this (and has about 30 million lines)
123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
123456789012,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
123456789012,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
The final output should be like this (without the first value repeating in the joined portions)
123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
However, if the above output is a bit complicated, an output like below is also fine. Because I can load the file into Oracle11g and get rid of the redundant columns.
123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,123456789012,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,123456789012,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,234567890123,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,345678901234,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,345678901234,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,567890123456,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0

Using awk is sufficient; it is a control-break report of sorts. Since the lines with the same key are grouped together — a very important point — it is fairly simple.
awk -F, '{ if ($1 != saved)
{
if (saved != 0) print saved "," list
saved = $1
list = ""
}
pad = ""
for (i = 2; i <= NF; i++) { list = list pad $i; pad = "," }
}
END { if (saved != 0) print saved, list }'
You can feed the data as standard input or list the files to be processed after the final single quote.
Sample output:
123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456 PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
The code uses saved to keep a track of the key column value that it is accumulating. When the key column changes, print out the saved values (if there are any) and reset for the new set of lines. At the end, print out the saved values (if there are any). The code deals with an empty file gracefully, therefore.
Perl options
#!/usr/bin/env perl
use strict;
use warnings;
my $saved = "";
my $list;
while (<>)
{
chomp;
my($key,$value) = ($_ =~ m/^([^,]+)(,.*)/);
if ($key ne $saved)
{
print "$saved$list\n" if $saved;
$saved = $key;
$list = "";
}
$list .= $value;
}
print "$saved$list\n" if $saved;
Or, if you really want to, you can saved writing the loop (and using strict and warnings) with:
perl -n -e 'chomp;
($key,$value) = ($_ =~ m/^([^,]+)(,.*)/);
if ($key ne $saved)
{
print "$saved$list\n" if $saved;
$saved = $key;
$list = "";
}
$list .= $value;
} END {
print "$saved$list\n" if $saved;'
That could be squished down to a single (rather long) line. The } END { is a piece of Perl weirdness; the -n option creates a loop while (<>) { … } and interpolates the script in the -e argument into it, so the } in } END { terminates that loop and then creates an END block which is ended by the } that Perl provided. Yes, documented and supported; yes, extremely weird (so I wouldn't do it; I'd use the Perl script shown first).

This awk script does what you want:
BEGIN { FS = OFS = "," }
NR == 1 { a[++n] = $1 }
a[1] != $1 { for(i=1; i<=n; ++i) printf "%s%s", a[i], (i<n?OFS:ORS); n = 1 }
{ a[1] = $1; for(i=2;i<=NF;++i) a[++n] = $i }
END { for(i=1; i<=n; ++i) printf "%s%s", a[i], (i<n?OFS:ORS) }
It stores all of the fields with the same first column in an array. When the first column differs, it prints out all of the elements of the array. Use it like awk -f join.awk file.
Output:
123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0

Here are some Python options, if you decide to go that route... First will work for multiple input files and non-sequential identical indices. Second doesn't read the whole file into memory.
(Note, I know it is not convention, but I intentionally use UpperCase for variables to make it clear what is a user-defined variable and what is a special python word.)
#! /usr/bin/env python
# -*- coding: utf-8 -*-
"""
concatenate comma-separated values based on first value
Usage:
catfile.py *.txt > output.dat
"""
import sys
if len(sys.argv)<2:
sys.stderr.write(__doc__)
else:
FileList = sys.argv[1:]
IndexList = []
OutDict = {}
for FileName in FileList:
with open(FileName,'rU') as FStream:
for Line in FStream:
if Line:
Ind,TheRest = Line.rstrip().split(",",1)
if Ind not in IndexList:
IndexList.append(Ind)
OutDict[Ind] = OutDict.get(Ind,"") + "," + TheRest
for Ind in IndexList:
print Ind + OutDict[Ind]
Here is a different version which doesn't load the whole file into memory, but requires that the identical Indices all occur in order, and it only runs on one file:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
"""
concatenate comma-separated values based on first value
Usage:
catfile.py *.txt > output.dat
"""
import sys
if len(sys.argv)<2:
sys.stderr.write(__doc__)
else:
FileName = sys.argv[1]
OutString = ''
PrevInd = ''
FirstLine = True
with open(FileName,'rU') as FStream:
for Line in FStream:
if "," in Line:
Ind,TheRest = Line.rstrip().split(",",1)
if Ind != PrevInd:
if not FirstLine:
print PrevInd+OutString
PrevInd = Ind
OutString = TheRest
FirstLine = False
else:
OutString += ","+TheRest
print Ind + OutString
More generally, you can run these with by saving them as say catfile.py and then doing python catfile.py inputfile.txt > outputfile.txt. Or for longer term solutions, make a scripts directory, add it to your $PATH, make them executable with chmod u+x catfile.py and then you can just type the name of the script from any directory. But that is another topic that you would want to research.

A way without array:
BEGIN { FS = OFS = "," ; ORS = "" }
{
if (lid == $1) { $1 = "" ; print $0 }
else { print sep $0 ; lid = $1 ; sep = "\n" }
}
END { if (NR) print }
Note: if you don't need a newline at the end, remove the END block.

This might work for you (GNU sed):
sort file | sed -r ':a;$!N;s/^(([^,]*),.*)\n\2/\1/;ta;P;D'
Sort the file (if need be) and then delete newline and key where duplicates appear.

Related

How to resolve this error? Use of uninitialized value $val1 in printf at sham.pl line 112

This is my script. Please suggest how to avoid the error. This script worked fine while using other files.
`if (($line1 =~ /^[##]/) || ($line2 =~ /^[##]/)) {
# do nothing, it's a comment - probably not necessary after above cleanup
} else {
my #data1 = split(" ", $line1);
my #data2 = split(" ", $line2);
my $time = $data1[0];
my $val1 = $data1[$d1];
my $val2 = $data2[$d2];
printf OUT "%8.3f\t%f\t%f\n", $time, $val1, $val2;
}
}
close(OUT);
exit;
I have not tried anything since I am new to perl and have less knowledge about it.

Trying to force an entry in an array to be an array

I am trying to create an associative array of associative arrays in gawk, and what I initially tried was:
options[key][subkey] = 1
However, when it got to this line, I unceremoniously received the error fatal: attempt to use scalar 'option["Declaration"]' as an array ("Declaration" being one of the main keys that my program uses, although I presume the exact value is irrelevant. At this particular point in the program, there was no "Declaration" entry assigned, although there were entries which had "Declaration" as a subkey on other entries, which may be meaningful).
So with a bit of googling, I found another stackoverflow question that looked like it answered my issue, so I put the following code immediately above it:
if (typeof(options[key])!="array") {
options[key] = 0;
delete options[key];
split("",options[key]);
}
However, this does not work either, instead now giving me the error: fatal: split: second argument is not an array
What am I doing wrong?
EDIT: Note, that I cannot use a basic 2-dimensional array here... for what I am doing, it is important that I am using one associative array to another because I need to be able to later identify the subkeys that were used on a given key.
Pursuant to requests below, I am posting the relevant functions that use the associative array, which may help clarify what is going on.
function add_concrete(key, concrete) {
if (key == concrete) {
return;
}
if (length(options[key])>0) {
for(i in options[key]) {
add_concrete(i, concrete);
}
}
contains[key][concrete] = 1
}
function add_options(name, value) {
subkey = trim(name);
if (subkey == "") {
return;
}
if (match(value, ";") > 0) {
exporting = 0;
}
split(value, args, /[ |;]*/);
for (i in args) {
key = trim(args[i]);
if (key != "") {
print("Adding " name " to " key);
options[key][subkey] = 1
if (concrete[key]) {
add_concrete(subkey, key);
}
}
}
}

Sorry, cooking at the same time. As you didn't post much, don't have much to work with, but with no "initialization":
$ awk 'BEGIN {
options[key] = 0;
delete options[key];
# options[key][1] # cant see me
split("",options[key]);
}'
awk: cmd. line:5: fatal: split: second argument is not an array
But with "initialization":
$ awk 'BEGIN {
options[key] = 0;
delete options[key];
options[key][1] # can see me
split("",options[key]);
}'
$_ # see this cursor happily blinking without any error

How to print number of particular columns in shell script?

I have a text file temp1 and say it has more than 20 columns and it has numerical values like as follows,
1,0,3,0,5........,
1,0,5,0,8........,
3,0,6,0,3........,
5,0,6,0,4........,
.................,
I want to remove the columns which has the total(sum) of zero and i need to redirect remaining columns to the new file
ie : for example as above 2nd and 4th columns have the total of zero so i need to remove 2nd and 4 th column and redirect it to separate file .
can any one help me pls?

$ cat file
1,0,3,0,5
1,0,5,0,8
3,0,6,0,3
5,0,6,0,4
$ awk -f tst.awk file
1,3,5
1,5,8
3,6,3
5,6,4
$ cat tst.awk
BEGIN{ FS="," }
{
for (j=1;j<=NF;j++) {
val[NR,j] = $j
sum[j] += val[NR,j]
}
}
END {
for (i=1;i<=NR;i++) {
ofs = ""
for (j=1;j<=NF;j++) {
if (sum[j]) {
printf "%s%s",ofs,val[i,j]
ofs = FS
}
}
print ""
}
}

You can use awk: (the following is ugly but I hope readable. That's the goal. I let better awkist enhance/reduce it further)
If the data is in file /path/to/zefile:
awk -F',' '
FNR==NR { for (col=1;col<=NF;col++)
{ if ($col != 0)
{wewantthiscolumn[col]=1 }
}
next
}
{ for (col=1;col<=NF;col++)
{ if (wewantthiscolumn[col]==1)
{ printf ("%s,",$col) }
}
print ""
}' /path/to/zefile /path/to/zefile | sed -e 's/,$//'
The idea: we launch awk on /path/to/zefile /path/to/zefile (hence, it read is twice).
On the first pass, we create a "wewantthiscolumn" array. This array contains "1" as soon as that column has something different from 0. The "next" ensure we only do this bit when FNR (=Number of Rows in the CURRENT file) == NR (=total number of rows), which is true only on the first pass.
On the second pass (hence we go directly to the 2nd { } as now NR>FNR) : we only display the column value $col which has a corresponding wewantthiscolumn(col)==1, and followed by a "," (so there is a little problem: the last col will have a "," after it)
Then we pass this through sed to get rid of the ",$" bit.
I am not sure there is not a very better way : can awk delete a field? so it could delete field col on the 2nd pass? Then it would be much easier to print the resulting $0, setting OFS=',' to have them separated with , ...
This would make the 2nd pass:
awk -F',' '
FNR==NR { for (col=1;col<=NF;col++)
{ if ($col != 0)
{wewantthiscolumn[col]=1 }
}
next
}
{ for (col=1;col<=NF;col++)
{ if (wewantthiscolumn[col]==0)
$col="DELETETHIS"
}
gensub(",DELETETHIS","",g)
gensub("DELETETHIS,","",g)
print $0
}' /path/to/zefile /path/to/zefile
I didn't want to assume no columns could be empty, hence I use "DELETETHIS" to make sure I only delete relevant fields... But this means the 1st way is in fact simpler ^^ : only print the fields you need, and then get rid of the "," at the end of line.

Here's one way using awk. Run like:
awk -f ./script.awk file{,}
Contents of script.awk:
BEGIN {
FS=","
}
FNR==NR {
for(i=1;i<=NF;i++) {
if ($i != 0) {
a[i]
}
}
next
}
{
for(j=1;j<=NF;j++) {
if (j in a) {
printf "%s%s", $j, (j==NF ? RS : FS)
}
}
}
Alternatively, here's the one-liner:
awk -F, 'FNR==NR { for(i=1;i<=NF;i++) if ($i != 0) a[i]; next } { for(j=1;j<=NF;j++) if (j in a) printf "%s%s", $j, (j==NF ? RS : FS) }' filex{,}
Contents of file:
1,0,3,0,5,0
1,0,5,0,8,1
3,0,6,0,3,2
5,0,6,0,4,5
Results:
1,3,5,0
1,5,8,1
3,6,3,2
5,6,4,5

A solution using python:
#!/usr/bin/env python
def transpose(grid):
return zip(*grid)
def removeBlankRows(grid):
return [list(row) for row in grid if any(map(int,row))]
grid = []
with open("input.csv") as fd:
for line in fd:
grid.append(line.strip().split(','))
data = removeBlankRows(transpose(removeBlankRows(transpose(grid))))
for i in data:
print ",".join(i)
input:
1,0,3,0,5
1,0,5,0,8
3,0,6,0,3
5,0,6,0,4
output:
1,3,5
1,5,8
3,6,3
5,6,4
input:
1,0,3,0,5
1,0,5,0,8
3,0,6,0,3
5,0,6,1,4
output:
1,3,0,5
1,5,0,8
3,6,0,3
5,6,1,4

Check if the current time falls within defined time range on UNIX

Consider the below PSUEDO-CODE:
#!/bin/ksh
rangeStartTime_hr=13
rangeStartTime_min=56
rangeEndTime_hr=15
rangeEndTime_min=05
getCurrentMinute() {
return `date +%M | sed -e 's/0*//'`;
# Used sed to remove the padded 0 on the left. On successfully find&replacing
# the first match it returns the resultant string.
# date command does not provide minutes in long integer format, on Solaris.
}
getCurrentHour() {
return `date +%l`; # %l hour ( 1..12)
}
checkIfWithinRange() {
if [[ getCurrentHour -ge $rangeStartTime_hr &&
getCurrentMinute -ge $rangeStartTime_min ]]; then
# Ahead of start time.
if [[ getCurrentHour -le $rangeEndTime_hr &&
getCurrentMinute -le $rangeEndTime_min]]; then
# Within the time range.
return 0;
else
return 1;
fi
else
return 1;
fi
}
Is there a better way of implementing checkIfWithinRange()? Are there any inbuilt functions in UNIX that make it easier to do the above? I am new to korn scripting and would appreciate your inputs.

The return command is used to return an exit status, not an arbitrary string. This is unlike many other languages. You use stdout to pass data:
getCurrentMinute() {
date +%M | sed -e 's/^0//'
# make sure sed only removes zero from the beginning of the line
# in the case of "00" don't be too greedy so only remove one 0
}
Also, you need more syntax to invoke the function. Currently you are comparing the literal string "getCurrentMinute" in the if condition
if [[ $(getCurrentMinute) -ge $rangeStartTime_min && ...
I would do if a bit differently
start=13:56
end=15:05
checkIfWithinRange() {
current=$(date +%H:%M) # Get's the current time in the format 05:18
[[ ($start = $current || $start < $current) && ($current = $end || $current < $end) ]]
}
if checkIfWithinRange; then
do something
fi

Log4perl category as log file name

I'm sure I'm being dim and missing the obvious but is there a simple way of using the current category as the filename in a config file without resorting to a subroutine call?
So that in the following one could use ${category}.log instead of repeating bin.nh.tpp in the filename line
log4perl.logger.**bin.nh.tpp**=INFO, BIN_NH_TPP_LOGFILE
log4perl.appender.BIN_NH_TPP_LOGFILE=Log::Log4perl::Appender::File
log4perl.appender.BIN_NH_TPP_LOGFILE.filename=${LOGS}/nh/**bin.nh.tpp**.log
log4perl.appender.BIN_NH_TPP_LOGFILE.mode=append
log4perl.appender.BIN_NH_TPP_LOGFILE.layout=PatternLayout
log4perl.appender.BIN_NH_TPP_LOGFILE.layout.ConversionPattern=[%d] %F{1} %L %c - %m%n

It's somewhat more involved than a subroutine, I'm afraid. Subroutines in l4p conf files allow for including variables known at conf file parsing time, e.g. the time/date or a user id. You can't modify log time behavior that way.
The easiest way I can think of right now to accomplish what you want is a custom appender like
package FileByCategoryAppender;
use warnings;
use strict;
use base qw( Log::Log4perl::Appender::File );
sub new {
my( $class, %options ) = #_;
$options{filename } = "no-category.log";
my $self = $class->SUPER::new( %options );
bless $self, $class;
}
sub log {
my( $self, %params ) = #_;
my $category = $params{ log4p_category };
$self->SUPER::file_switch( $category . ".log" );
$self->SUPER::log( %params );
}
1;
and then use it in your script like
use strict;
use warnings;
use Log::Log4perl qw( get_logger );
my $conf = q(
log4perl.category = WARN, Logfile
log4perl.appender.Logfile = FileByCategoryAppender
log4perl.appender.Logfile.create_at_logtime = 1
log4perl.appender.Logfile.layout = \
Log::Log4perl::Layout::PatternLayout
log4perl.appender.Logfile.layout.ConversionPattern = %d %F{1} %L> %m %n
);
Log::Log4perl::init(\$conf);
my $logger = get_logger("Bar::Twix");
$logger->error("twix error");
$logger = get_logger("Bar::Mars");
$logger->error("mars error");
which will result in two log files being created at log time:
# Bar.Mars.log
2012/11/18 11:12:12 t 21> mars error
and
# Bar.Twix.log
2012/11/18 11:12:12 t 21> twix error

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Join lines based on a starting value using UNIX commands - unix

A way without array: BEGIN { FS = OFS = "," ; ORS = "" } { if (lid == $1) { $1 = "" ; print $0 } else { print sep $0 ; lid = $1 ; sep = "\n" } } END { if (NR) print } Note: if you don't need a newline at the end, remove the END block.

This might work for you (GNU sed): sort file | sed -r ':a;$!N;s/^(([^,]),.)\n\2/\1/;ta;P;D' Sort the file (if need be) and then delete newline and key where duplicates appear.

Related

How to resolve this error? Use of uninitialized value $val1 in printf at sham.pl line 112

Trying to force an entry in an array to be an array

How to print number of particular columns in shell script?

Check if the current time falls within defined time range on UNIX

Log4perl category as log file name

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Join lines based on a starting value using UNIX commands - unix

A way without array: BEGIN { FS = OFS = "," ; ORS = "" } { if (lid == $1) { $1 = "" ; print $0 } else { print sep $0 ; lid = $1 ; sep = "\n" } } END { if (NR) print } Note: if you don't need a newline at the end, remove the END block.

This might work for you (GNU sed): sort file | sed -r ':a;$!N;s/^(([^,]*),.*)\n\2/\1/;ta;P;D' Sort the file (if need be) and then delete newline and key where duplicates appear.

Related

How to resolve this error? Use of uninitialized value $val1 in printf at sham.pl line 112

Trying to force an entry in an array to be an array

How to print number of particular columns in shell script?

Check if the current time falls within defined time range on UNIX

Log4perl category as log file name

Categories

Resources

This might work for you (GNU sed): sort file | sed -r ':a;$!N;s/^(([^,]),.)\n\2/\1/;ta;P;D' Sort the file (if need be) and then delete newline and key where duplicates appear.