Here I am again, with another UNIX requirement (as my knowledge in UNIX is limited to basic commands).
I have a file that looks like this (and has about 30 million lines)
123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
123456789012,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
123456789012,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
The final output should be like this (without the first value repeating in the joined portions)
123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
However, if the above output is a bit complicated, an output like below is also fine. Because I can load the file into Oracle11g and get rid of the redundant columns.
123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,123456789012,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,123456789012,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,234567890123,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,345678901234,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,345678901234,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,567890123456,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
Using awk is sufficient; it is a control-break report of sorts. Since the lines with the same key are grouped together — a very important point — it is fairly simple.
awk -F, '{ if ($1 != saved)
{
if (saved != 0) print saved "," list
saved = $1
list = ""
}
pad = ""
for (i = 2; i <= NF; i++) { list = list pad $i; pad = "," }
}
END { if (saved != 0) print saved, list }'
You can feed the data as standard input or list the files to be processed after the final single quote.
Sample output:
123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456 PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
The code uses saved to keep a track of the key column value that it is accumulating. When the key column changes, print out the saved values (if there are any) and reset for the new set of lines. At the end, print out the saved values (if there are any). The code deals with an empty file gracefully, therefore.
Perl options
#!/usr/bin/env perl
use strict;
use warnings;
my $saved = "";
my $list;
while (<>)
{
chomp;
my($key,$value) = ($_ =~ m/^([^,]+)(,.*)/);
if ($key ne $saved)
{
print "$saved$list\n" if $saved;
$saved = $key;
$list = "";
}
$list .= $value;
}
print "$saved$list\n" if $saved;
Or, if you really want to, you can saved writing the loop (and using strict and warnings) with:
perl -n -e 'chomp;
($key,$value) = ($_ =~ m/^([^,]+)(,.*)/);
if ($key ne $saved)
{
print "$saved$list\n" if $saved;
$saved = $key;
$list = "";
}
$list .= $value;
} END {
print "$saved$list\n" if $saved;'
That could be squished down to a single (rather long) line. The } END { is a piece of Perl weirdness; the -n option creates a loop while (<>) { … } and interpolates the script in the -e argument into it, so the } in } END { terminates that loop and then creates an END block which is ended by the } that Perl provided. Yes, documented and supported; yes, extremely weird (so I wouldn't do it; I'd use the Perl script shown first).
This awk script does what you want:
BEGIN { FS = OFS = "," }
NR == 1 { a[++n] = $1 }
a[1] != $1 { for(i=1; i<=n; ++i) printf "%s%s", a[i], (i<n?OFS:ORS); n = 1 }
{ a[1] = $1; for(i=2;i<=NF;++i) a[++n] = $i }
END { for(i=1; i<=n; ++i) printf "%s%s", a[i], (i<n?OFS:ORS) }
It stores all of the fields with the same first column in an array. When the first column differs, it prints out all of the elements of the array. Use it like awk -f join.awk file.
Output:
123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
Here are some Python options, if you decide to go that route... First will work for multiple input files and non-sequential identical indices. Second doesn't read the whole file into memory.
(Note, I know it is not convention, but I intentionally use UpperCase for variables to make it clear what is a user-defined variable and what is a special python word.)
#! /usr/bin/env python
# -*- coding: utf-8 -*-
"""
concatenate comma-separated values based on first value
Usage:
catfile.py *.txt > output.dat
"""
import sys
if len(sys.argv)<2:
sys.stderr.write(__doc__)
else:
FileList = sys.argv[1:]
IndexList = []
OutDict = {}
for FileName in FileList:
with open(FileName,'rU') as FStream:
for Line in FStream:
if Line:
Ind,TheRest = Line.rstrip().split(",",1)
if Ind not in IndexList:
IndexList.append(Ind)
OutDict[Ind] = OutDict.get(Ind,"") + "," + TheRest
for Ind in IndexList:
print Ind + OutDict[Ind]
Here is a different version which doesn't load the whole file into memory, but requires that the identical Indices all occur in order, and it only runs on one file:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
"""
concatenate comma-separated values based on first value
Usage:
catfile.py *.txt > output.dat
"""
import sys
if len(sys.argv)<2:
sys.stderr.write(__doc__)
else:
FileName = sys.argv[1]
OutString = ''
PrevInd = ''
FirstLine = True
with open(FileName,'rU') as FStream:
for Line in FStream:
if "," in Line:
Ind,TheRest = Line.rstrip().split(",",1)
if Ind != PrevInd:
if not FirstLine:
print PrevInd+OutString
PrevInd = Ind
OutString = TheRest
FirstLine = False
else:
OutString += ","+TheRest
print Ind + OutString
More generally, you can run these with by saving them as say catfile.py and then doing python catfile.py inputfile.txt > outputfile.txt. Or for longer term solutions, make a scripts directory, add it to your $PATH, make them executable with chmod u+x catfile.py and then you can just type the name of the script from any directory. But that is another topic that you would want to research.
A way without array:
BEGIN { FS = OFS = "," ; ORS = "" }
{
if (lid == $1) { $1 = "" ; print $0 }
else { print sep $0 ; lid = $1 ; sep = "\n" }
}
END { if (NR) print }
Note: if you don't need a newline at the end, remove the END block.
This might work for you (GNU sed):
sort file | sed -r ':a;$!N;s/^(([^,]*),.*)\n\2/\1/;ta;P;D'
Sort the file (if need be) and then delete newline and key where duplicates appear.
I have a text file temp1 and say it has more than 20 columns and it has numerical values like as follows,
1,0,3,0,5........,
1,0,5,0,8........,
3,0,6,0,3........,
5,0,6,0,4........,
.................,
I want to remove the columns which has the total(sum) of zero and i need to redirect remaining columns to the new file
ie : for example as above 2nd and 4th columns have the total of zero so i need to remove 2nd and 4 th column and redirect it to separate file .
can any one help me pls?
$ cat file
1,0,3,0,5
1,0,5,0,8
3,0,6,0,3
5,0,6,0,4
$ awk -f tst.awk file
1,3,5
1,5,8
3,6,3
5,6,4
$ cat tst.awk
BEGIN{ FS="," }
{
for (j=1;j<=NF;j++) {
val[NR,j] = $j
sum[j] += val[NR,j]
}
}
END {
for (i=1;i<=NR;i++) {
ofs = ""
for (j=1;j<=NF;j++) {
if (sum[j]) {
printf "%s%s",ofs,val[i,j]
ofs = FS
}
}
print ""
}
}
You can use awk: (the following is ugly but I hope readable. That's the goal. I let better awkist enhance/reduce it further)
If the data is in file /path/to/zefile:
awk -F',' '
FNR==NR { for (col=1;col<=NF;col++)
{ if ($col != 0)
{wewantthiscolumn[col]=1 }
}
next
}
{ for (col=1;col<=NF;col++)
{ if (wewantthiscolumn[col]==1)
{ printf ("%s,",$col) }
}
print ""
}' /path/to/zefile /path/to/zefile | sed -e 's/,$//'
The idea: we launch awk on /path/to/zefile /path/to/zefile (hence, it read is twice).
On the first pass, we create a "wewantthiscolumn" array. This array contains "1" as soon as that column has something different from 0. The "next" ensure we only do this bit when FNR (=Number of Rows in the CURRENT file) == NR (=total number of rows), which is true only on the first pass.
On the second pass (hence we go directly to the 2nd { } as now NR>FNR) : we only display the column value $col which has a corresponding wewantthiscolumn(col)==1, and followed by a "," (so there is a little problem: the last col will have a "," after it)
Then we pass this through sed to get rid of the ",$" bit.
I am not sure there is not a very better way : can awk delete a field? so it could delete field col on the 2nd pass? Then it would be much easier to print the resulting $0, setting OFS=',' to have them separated with , ...
This would make the 2nd pass:
awk -F',' '
FNR==NR { for (col=1;col<=NF;col++)
{ if ($col != 0)
{wewantthiscolumn[col]=1 }
}
next
}
{ for (col=1;col<=NF;col++)
{ if (wewantthiscolumn[col]==0)
$col="DELETETHIS"
}
gensub(",DELETETHIS","",g)
gensub("DELETETHIS,","",g)
print $0
}' /path/to/zefile /path/to/zefile
I didn't want to assume no columns could be empty, hence I use "DELETETHIS" to make sure I only delete relevant fields... But this means the 1st way is in fact simpler ^^ : only print the fields you need, and then get rid of the "," at the end of line.
Here's one way using awk. Run like:
awk -f ./script.awk file{,}
Contents of script.awk:
BEGIN {
FS=","
}
FNR==NR {
for(i=1;i<=NF;i++) {
if ($i != 0) {
a[i]
}
}
next
}
{
for(j=1;j<=NF;j++) {
if (j in a) {
printf "%s%s", $j, (j==NF ? RS : FS)
}
}
}
Alternatively, here's the one-liner:
awk -F, 'FNR==NR { for(i=1;i<=NF;i++) if ($i != 0) a[i]; next } { for(j=1;j<=NF;j++) if (j in a) printf "%s%s", $j, (j==NF ? RS : FS) }' filex{,}
Contents of file:
1,0,3,0,5,0
1,0,5,0,8,1
3,0,6,0,3,2
5,0,6,0,4,5
Results:
1,3,5,0
1,5,8,1
3,6,3,2
5,6,4,5
A solution using python:
#!/usr/bin/env python
def transpose(grid):
return zip(*grid)
def removeBlankRows(grid):
return [list(row) for row in grid if any(map(int,row))]
grid = []
with open("input.csv") as fd:
for line in fd:
grid.append(line.strip().split(','))
data = removeBlankRows(transpose(removeBlankRows(transpose(grid))))
for i in data:
print ",".join(i)
input:
1,0,3,0,5
1,0,5,0,8
3,0,6,0,3
5,0,6,0,4
output:
1,3,5
1,5,8
3,6,3
5,6,4
input:
1,0,3,0,5
1,0,5,0,8
3,0,6,0,3
5,0,6,1,4
output:
1,3,0,5
1,5,0,8
3,6,0,3
5,6,1,4
I have one input file which is given below.
Values,series,setupresultcode,nameofresultcode,resultcode
2,9184200,,serviceSetupResultempty,2001
11,9184200,0,successfulReleasedByService,2001
194,9184200,1,successfulDisconnectedByCallingParty,2001
101,9184200,2,successfulDisconnectByCalledParty,2001
2,9184201,0,successfulReleasedByService,2001
78,9184201,1,successfulDisconnectedByCallingParty,2001
32,9184201,2,successfulDisconnectByCalledParty,2001
4,9184202,0,successfulReleasedByService,2001
63,9184202,1,successfulDisconnectedByCallingParty,2001
37,9184202,2,successfulDisconnectByCalledParty,2001
I want output as given below:
Series,successfulReleasedByService,successfulDisconnectedByCallingParty,successfulDisconnectByCalledParty,serviceSetupResultempty
9184200,11,194,101,2
9184202,4,63,37,
Keep series as common print value of series.i.e. first column with respect to result code.i.e third(integer) or fourth(string) column in input file.
For example: the second column of the data has n number of series; take 9184200. That series having 4 setupresultcode (empty,0,1,2). Name of each result code is given in 4th column. I want to print if resultcode is 0; i.e. successfulReleasedByService then print value 11 with respect to series 9184200.
Something like this might work although I haven't tested it, regard it as some kind of pseudo code.
#!/bin/awk -f
BEGIN
{
number_of_series=0;
}
{
#This part will be executed for every line
if ($3 =="0" || $3 == "1" || $3 == "2")
{
for (i=1; i<=number_of_series; i++)
{
#If the series has already been added
if(seriesarray[i] == $2)
{
#Concat the results
seriesarray[$2]=seriesarray[$2]","$1;
}
#If it's a new series
else
{
number_of_series++;
seriesarray[$2]=$1;
}
}
}
}
END
{
#Iterate over the series and print the series id and the concatenated results
for (series in seriesarray)
{
print series, seriesarray[series];
}
}
This would yield something like
9184200,11,194,101
9184201,2,78,32
9184202,4,63,37