Here I am again, with another UNIX requirement (as my knowledge in UNIX is limited to basic commands).
I have a file that looks like this (and has about 30 million lines)
123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
123456789012,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
123456789012,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
The final output should be like this (without the first value repeating in the joined portions)
123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
However, if the above output is a bit complicated, an output like below is also fine. Because I can load the file into Oracle11g and get rid of the redundant columns.
123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,123456789012,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,123456789012,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,234567890123,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,345678901234,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,345678901234,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,567890123456,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
Using awk is sufficient; it is a control-break report of sorts. Since the lines with the same key are grouped together — a very important point — it is fairly simple.
awk -F, '{ if ($1 != saved)
{
if (saved != 0) print saved "," list
saved = $1
list = ""
}
pad = ""
for (i = 2; i <= NF; i++) { list = list pad $i; pad = "," }
}
END { if (saved != 0) print saved, list }'
You can feed the data as standard input or list the files to be processed after the final single quote.
Sample output:
123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456 PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
The code uses saved to keep a track of the key column value that it is accumulating. When the key column changes, print out the saved values (if there are any) and reset for the new set of lines. At the end, print out the saved values (if there are any). The code deals with an empty file gracefully, therefore.
Perl options
#!/usr/bin/env perl
use strict;
use warnings;
my $saved = "";
my $list;
while (<>)
{
chomp;
my($key,$value) = ($_ =~ m/^([^,]+)(,.*)/);
if ($key ne $saved)
{
print "$saved$list\n" if $saved;
$saved = $key;
$list = "";
}
$list .= $value;
}
print "$saved$list\n" if $saved;
Or, if you really want to, you can saved writing the loop (and using strict and warnings) with:
perl -n -e 'chomp;
($key,$value) = ($_ =~ m/^([^,]+)(,.*)/);
if ($key ne $saved)
{
print "$saved$list\n" if $saved;
$saved = $key;
$list = "";
}
$list .= $value;
} END {
print "$saved$list\n" if $saved;'
That could be squished down to a single (rather long) line. The } END { is a piece of Perl weirdness; the -n option creates a loop while (<>) { … } and interpolates the script in the -e argument into it, so the } in } END { terminates that loop and then creates an END block which is ended by the } that Perl provided. Yes, documented and supported; yes, extremely weird (so I wouldn't do it; I'd use the Perl script shown first).
This awk script does what you want:
BEGIN { FS = OFS = "," }
NR == 1 { a[++n] = $1 }
a[1] != $1 { for(i=1; i<=n; ++i) printf "%s%s", a[i], (i<n?OFS:ORS); n = 1 }
{ a[1] = $1; for(i=2;i<=NF;++i) a[++n] = $i }
END { for(i=1; i<=n; ++i) printf "%s%s", a[i], (i<n?OFS:ORS) }
It stores all of the fields with the same first column in an array. When the first column differs, it prints out all of the elements of the array. Use it like awk -f join.awk file.
Output:
123456789012,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
234567890123,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
345678901234,PID=1,AID=2,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
456789012345,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
567890123456,PID=2,AID=1,EQOSID=1,PDPTY=IPV4,PDPCH=2-0,PID=3,AID=8,EQOSID=1,PDPTY=IPV4,PDPCH=2-0
Here are some Python options, if you decide to go that route... First will work for multiple input files and non-sequential identical indices. Second doesn't read the whole file into memory.
(Note, I know it is not convention, but I intentionally use UpperCase for variables to make it clear what is a user-defined variable and what is a special python word.)
#! /usr/bin/env python
# -*- coding: utf-8 -*-
"""
concatenate comma-separated values based on first value
Usage:
catfile.py *.txt > output.dat
"""
import sys
if len(sys.argv)<2:
sys.stderr.write(__doc__)
else:
FileList = sys.argv[1:]
IndexList = []
OutDict = {}
for FileName in FileList:
with open(FileName,'rU') as FStream:
for Line in FStream:
if Line:
Ind,TheRest = Line.rstrip().split(",",1)
if Ind not in IndexList:
IndexList.append(Ind)
OutDict[Ind] = OutDict.get(Ind,"") + "," + TheRest
for Ind in IndexList:
print Ind + OutDict[Ind]
Here is a different version which doesn't load the whole file into memory, but requires that the identical Indices all occur in order, and it only runs on one file:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
"""
concatenate comma-separated values based on first value
Usage:
catfile.py *.txt > output.dat
"""
import sys
if len(sys.argv)<2:
sys.stderr.write(__doc__)
else:
FileName = sys.argv[1]
OutString = ''
PrevInd = ''
FirstLine = True
with open(FileName,'rU') as FStream:
for Line in FStream:
if "," in Line:
Ind,TheRest = Line.rstrip().split(",",1)
if Ind != PrevInd:
if not FirstLine:
print PrevInd+OutString
PrevInd = Ind
OutString = TheRest
FirstLine = False
else:
OutString += ","+TheRest
print Ind + OutString
More generally, you can run these with by saving them as say catfile.py and then doing python catfile.py inputfile.txt > outputfile.txt. Or for longer term solutions, make a scripts directory, add it to your $PATH, make them executable with chmod u+x catfile.py and then you can just type the name of the script from any directory. But that is another topic that you would want to research.
A way without array:
BEGIN { FS = OFS = "," ; ORS = "" }
{
if (lid == $1) { $1 = "" ; print $0 }
else { print sep $0 ; lid = $1 ; sep = "\n" }
}
END { if (NR) print }
Note: if you don't need a newline at the end, remove the END block.
This might work for you (GNU sed):
sort file | sed -r ':a;$!N;s/^(([^,]*),.*)\n\2/\1/;ta;P;D'
Sort the file (if need be) and then delete newline and key where duplicates appear.
We know F4F is Adobe's fragmented MP4 file format for HTTP Dynamic Streaming. A tool called F4F Packager could convert an F4V file to several F4F files and a manifest file(F4M).
My question is, how to convert such F4F files back to an F4V or MP4 file?
We finally found a simple method to merge & convert .f4f files -> .flv file, in which only 'mdat' box is usefull. Here is a the php code:
<?php
function ReadInt24($str, $pos)
{
return intval(bin2hex(substr($str, $pos, 3)), 16);
}
function ReadInt32($str, $pos)
{
return unpack("N", substr($str, $pos, 4))[1];
}
echo "\nKSV Adobe HDS Downloader\n\n";
$flvHeader = hex2bin("464c5601050000000900000000");
$firstVideoPacket = true;
$prevTagSize = 4;
$fragCount = 0;
isset($argv[1]) ? $baseFilename = $argv[1] : $baseFilename = "";
$baseFilename ? $outputFile = "$baseFilename.flv" : $outputFile = "Joined.flv";
while (true)
{
if (file_exists("$baseFilename" . $fragCount + 1 . ".f4f"))
$fragCount++;
else
break;
}
echo "Found $fragCount fragments\n";
$flv = fopen("$outputFile", "wb");
fwrite($flv, $flvHeader, 13);
for ($i = 1; $i <= $fragCount; $i++)
{
$frag = file_get_contents("$baseFilename$i.f4f");
preg_match('/(.{4})mdat[\x08\x09\x12]/i', $frag, $mdat, PREG_OFFSET_CAPTURE);
$fragLen = ReadInt32($mdat[1][0], 0) - 8;
$frag = substr($frag, $mdat[1][1] + 8, $fragLen);
$pos = 0;
while ($pos < $fragLen)
{
$packetType = $frag[$pos];
$packetSize = ReadInt24($frag, $pos + 1);
$packetTS = ReadInt24($frag, $pos + 4);
$totalTagLen = 11 + $packetSize + $prevTagSize;
if (($packetType == "\x08" && $packetSize > 4) or ($packetType == "\x09" && $packetSize > 40) or ($packetType == "\x09" && $firstVideoPacket))
{
if ($packetType == "\x09" && $firstVideoPacket)
$firstVideoPacket = false;
fwrite($flv, substr($frag, $pos, $totalTagLen), $totalTagLen);
}
$pos += $totalTagLen;
}
}
fclose($flv);
echo "Finished\n";
?>
A more comprehensive answer is available here : https://github.com/K-S-V/Scripts/blob/master/AdobeHDS.php.
The serious stuff happens around line 1046.
This script handles more cases that the current top answer. I won't post the whole script here since it's a bit long.
Alas, it's a PHP script too, though I may need to rewrite this in Java in a couple of weeks. If so, I'll post a link to the Java rewrite when it's done.
livestreamer and youtube-dl both support HDS streams. Here's an example of livestreamer:
$ livestreamer -O 'hds://radio_chym-lh.akamaihd.net/z/KIT967_1#183249/manifest.f4m' 48k >out.m4a
This is an internet radio station. For video, only a change in 48k and the file extension of out.m4a should be necessary.