Combine lines with matching first field - unix

For a few years, I often have a need to combine lines of (sorted) text with a matching first field, and I never found an elegant (i.e. one-liner unix command line) way to do it. What I want is similar to what's possible with the unix join command, but join expects 2 files, with each key appearing maximum once. I want to start with a single file, in which a key might appear multiple tiles.
I have both a ruby and perl script that do this, but there's no way to shorten my algorithm into a one-liner. After years of unix use, I'm still learning new tricks with comm, paste, uniq, etc, and I suspect there's a smart way to do this.
There are some related questions, like join all lines that have the same first column to the same line; Command line to match lines with matching first field (sed, awk, etc.); and Combine lines with matching keys -- but those solutions never really give a clean and reliable solution.
Here's sample input:
apple:A fruit
apple:Type of: pie
banana:tropical fruit
cherry:small burgundy fruit
cherry:1 for me to eat
cherry:bright red
Here's sample output:
apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red
Here's my ideal syntax:
merge --inputDelimiter=":" --outputDelimiter=";" --matchfield=1 infile.txt
The "matchfield" is really optional. It could always be the first field. Subsequent appearances of the delimiter should be treated like plain text.
I don't mind a perl, ruby, awk one-liner, if you can think of a short and elegant algorithm. This should be able to handle millions of lines of input. Any ideas?

Using awk one liner
awk -F: -v ORS="" 'a!=$1{a=$1; $0=RS $0} a==$1{ sub($1":",";") } 1' file
Output:
apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red
setting ORS="" ; By default it is \n.
The reason why we have set ORS="" (Output Record Separator) is because we don't want awk to include newlines in the output at the end of each record. We want to handle it in our own way, through our own logic. We are actually including newlines at the start of every record which has the first field different from the previous one.
a!=$1 : When variable a (initially null) doesn't match with first field $1 which is for eg. applein first line, then set a=$1 and $0=RS $0 i.e $0 or simply whole record becomes "\n"$0 (basically adding newline at the beginning of record). a!=$1 will always satisfy when there is a different first field ($1) than the previous line's $1 and is thus a criteria to segregate our records based on first field.
a==$1: If it matches then it probably means you are iterating over a record belonging to the previous record set. In this case substitute first occurrence of $1: (Note the : ) for eg. apple: with ;. $1":" could also be written as $1FS where FS is :
If you have millions of line in your file then this approach would be fastest because it doesn't involve any pre-processing and also we are not using any other data structure say array for storing your keys or records.

for F in `cut -f1 -d ':' infile.txt | sort | uniq`; do echo "$F:$(grep $F infile.txt | cut -f2- -d ':' | paste -s -d ';' - )"; done
Not sure it qualifies as 'elegant', but it works, though I'm sure not quickly for millions of lines - as the number of grep calls increases it would slow significantly. What % of the matching fields do you expect to be unique?

Discover awk language:
awk -F':' '{ v=substr($0, index($0,":")+1); a[$1]=($1 in a? a[$1]";" : "")v }
END{ for(i in a) print i,a[i] }' OFS=':' infile.txt
The output:
apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red

I think this one do the job
awk -F':' '$1!=a{if(b);print b;b=""}a=$1{$1="";if(!b)b=a;b=b$0}END{print b}' infile

#RahulVerma's awk solution is simple and awesome! However, I've been looking for a sed solution. I wondered if sed is capable to solve this sort of problem. Finally, I've found my own solution! The core idea is treat the hold space as a variable, put what we want to search in the hold space, and search it in the current input line. I want to give credit to #chaos, because the core idea is from his answer how-to-search-for-the-word-stored-in-the-hold-space-with-sed#239049.
Using sed one liner
sed -E '1{:OK; h; $b; d}; x; G; s/^([^:]+)(.*)\n\1:(.*)/\1\2;\3/; tOK; s/\n.*//' file
or
sed -E 'x; G; s/^([^:]+)(.*)\n\1:(.*)/\1\2;\3/; tOK; 1d; s/\n.*//; b; :OK; h; $!d' file
Output:
apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red
Explanation: (the 2nd sed)
x exchange the hold space and the pattern space
The hold space would become the current input line.
The pattern space would become the content generated by s/// in the last cycle (see below), or the last input line if s/// fails to replace anything. It is empty for the very beginning, i.e., when the current input line is the line 1.
G append the hold space (the current input line) to the pattern space (the content generated in the last cycle). Thus, the pattern space would be two lines (The hold space is not changed). For example,
If the current input line is line 1, the pattern space would be
# the empty hold space
apple:A fruit # the current input line
If it is line 2, the pattern space would be
apple:A fruit # the last input line, because `s///` fails to replace anything in the last cycle
apple:Type of: pie # the current input line
If it is line 3, the pattern space would be
apple:A fruit;Type of: pie # the resulting content generated by `s///` in the last cycle
banana:tropical fruit # the current input line
s/// use the first field ^([^:]+) (: is the field separator) in the first line of the pattern space to search the second line. The part of \n\1: indicates how it works. If it is found in the second line, concatenate the first line \1\2', a ;, and the 2+ fields of the second line \3. Thus, the result is \1\2;\3.
tOK if the s/// is successful (s has replaced something) jump to the label OK, otherwise continue
In case of success
:OK specify the location of the label OK
h put the pattern space which is \1\2;\3, generated by s///, to the hold space. (The first line of the pattern space in the next cycle is made up of this very content)
$!d delete it, don't print it, if hasn't reached the last line
In case of failure, (s/// hasn't changed anything, the pattern space still has two lines)
1d delete the pattern space, don't print it, if the current input line is the line 1. At this moment, there is nothing ready to output. Without this, an empty line would be printed.
s/\n.*// delete the second line of the pattern space, which is the current input line.
b jump without a label, means end the sed script, and the remaining content in the pattern space would be printed out before starting the next cycle.
This is the very place where we finish combining the lines starting with the "current" same field and print out the combined line.
The hold space is still holding the the current input line for this cycle, and is going to be "the last input line" for the next cycle (see above). What it holds will trigger a brand new combining of the lines starting with a new same field.
In the second sed command line, the 1d; is not very necessary, however, the output has a subtle difference without it -- an innocuous empty line would be printed on the top, like below,
apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red
Thoughts at the end
The time I spent to figuring out the sed solution and the final look of the sed solution, both indicates that awk is better and easier for problems as complicated as this. sed is not good at searching/replacing a file according to the file itself. Like, searching/replacing constant text, it is easy, but what to search/replace is from the file itself, which means it might vary from line to line. For this kind, sed is not as good as awk -- the superior that has the full programming facilities, like variable, function, if-else, for/while/do. (See further in what is the difference between sed and awk)

As an exercise - or even for smallish datasets - the same in just native bash interpreter commands:
declare -A fruit=()
while IFS=: read k v; do fruit[$k]="${fruit[$k]:+${fruit[$k]};}$v"; done < test.in
for k in "${!fruit[#]}"; do echo "$k:${fruit[$k]}"; done
Output:
apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red
Any required sorting order might need to be explicitly enforced in more complex datasets.

Related

GVim Command / Script to delete lines from a set of 4

This post could count as duplicate , but i have not found any relevant answer in previous threads. I have a large (6 GB) text file and i wish to remove every 3rd and 4th line in a set of 4 lines . For example , the following
line1
line2
line3
line4
line5
line6
line7
line8
needs to be converted to this
line1
line2
line5
line6
Is there any vim script / command to remove those lines ? It could be also in multiple passes . 1 pass to delete the 3rd lines (in a set of 4 (line1,line2,line3,line4)) and another pass to delete again the 3rd lines (previously 4th ones , in a set of 3 (line1,line2,line3)) .
The commands :g/^/+1 d3 is close to what i want but it also removes the second lines .
If you have GNU sed, you can filter the buffer through this pipeline:
sed -e '0~4d' | sed '0~3d'
The first sed deletes every 4th line, the second deletes every 3rd line.
This has the desired effect.
To pipe the current buffer through this command, enter this in command mode:
%!sed -e '0~4d' | sed '0~3d'
The % selects the range of lines to pass to a command (% means all lines, the entire buffer), and !cmd is the command to pipe through.
To perform this outside of vim, run these two steps:
sed -ie '0~4d' file
sed -ie '0~3d' file
This will modify the file, in two steps.
Alternatively you can also use Awk.
awk 'NR%4==3||NR%4==0{next;}1' file.txt > output.txt
To do this via Vim:
%!awk 'NR\%4==3||NR\%4==0{next;}1'
UPDATE: It is a bad approach for large files, it needs ~3 sec for a 6MB file to perform a substitution.
This approach works in vim. Using regular expression, you find 4 lines and substitute them with first two lines of these 4. Works for a long file as well. Doesn't work for last 1–3 lines if there is a remainder of division of total lines number by 4.
:%s#\(^.*\n^.*\)\n^.*\n^.*\n#\1\r#g
Explanation:
:%s — substitute in the whole file, # used as a delimiter
\(^.*\n^.*\) — \(\) select two lines that will be used later as \1; \n stands for linebreak; ^ for the beginning of the line; .* for any symbol repeated as much times as possible before the linebreak
\n — linebreak after the second line
^.*\n^.*\n — next two lines to be deleted
\1\r — substitute for lines with first two lines and add a linebreak \r
g — apply to the whole file

Double spacing a file using awk

I was going through the awk and found the below two commands for double spacing a file.
Can someone please explain how these commands actually work ?
awk '1;{print ""}' filename
awk 'BEGIN{ORS="\n\n"};1' filename
Thanks
Your first example uses two common awk shortcuts: 1 is just a pattern that is always true, so the default action, which is "print the line", is executed for every line. Then, there is a rule with an empty pattern (which is also always true, but you can't omit both pattern and action), which in its action just prints an empty line.
Your second example changes the Output Record Separator, which is usually just a single end-of-line, to be two, so that just copying every line will be enough. (BEGIN rules are executed before the input file is read.)

Remove every x lines from text input

I'm looking to grep some log files with a few surrounding lines, but then discard the junk lines from the matches. To make matters worse, the stupid code outputs the same exception twice so I want to junk every other grep match. I don't know that there's a good way to skip every other grep match when also including surrounding lines, so I'm good to do it all in one.
So let's say we have the following results from grep:
InterestingContext1
lkjsdf
MatchExceptionText1
--
kjslkj
lskjlk
MatchExceptionText2
--
InterestingContext3
lkjsdf
MatchExceptionText3
--
kjslkj
lskjlk
MatchExceptionText4
--
Obviously the grep match is "MatchExceptionText" (simplified, of course). So I'd like to pipe this to something where I can remove lines 2,5,6,7,8 and then repeat that pattern, so the results look like this:
InterestingContext1
MatchExceptionText1
--
InterestingContext3
MatchExceptionText3
--
The repeating is where things get tricky for me. I know sed can just remove certain line numbers but I don't know how to group them into groups of 8 lines and repeat that cut in all the groups.
Any ideas? Thanks for your help.
awk can do modular arithemetic so printing conditional on the number of lines read mod 8 should allow you to repeat the pattern.
awk 'NR%8 ~ /[134]/' file
Sed can do it:
sed -n 'N;s/\n.*//;N;N;p;N;N;N;N' filename
EDIT:
Come to think of it, this is a little better:
sed -n 'p;n;n;N;p;n;n;n;n' filename
With GNU awk you can split the input at appropriate record separators and print the wanted output, eg.:
awk 'NR%2 {print $1, $3}' RS='--\n' ORS='\n--\n' OFS='\n' infile
Output:
InterestingContext1
MatchExceptionText1
--
InterestingContext3
MatchExceptionText3
--
This might work for you (GNU sed):
sed -n 'p;n;n;p;n;p;n;n;n;n' file
sed -n "s/^.*\n//;x;s/^/²/
/^²\{1\}$/ b print
/^²\{3\}$/ b print
/^²\{4\}$/ b print
/^²\{7\}$/ b print
/^²\{8\}$/ b print
b cycle
: print
x;
# your treatment begin
p
# your treatment stop
x
: cycle
/^²\{8\}$/ s/.*//
x
" YourFile
Mainly for reference for kind of "case of" with relative line number, just have to change the number in /^²\{YourLineNumber\}$/ for take the other relative line position.
Don't forget the last line number that reset the cycle
First part take the line and prepare the relative line counter
Second part is the case of
Third part is the treatment (here a print)
Last part is the reset of the cycle counter if needed

sed '$!N;$!D' explanation

I know that
cat foo | sed '$!N;$!D'
will print out the last two lines of the file foo, but I don't understand why.
I have read the man page and know that N joins the next line to the currently processed line etc - but could someone explain in 'good english' that matches the order of operation what is happening here, step by step?
thanks!
Here is what that script looks like when run through the sedsed debugger (by Aurelio Jargas):
$ echo -e 'a\nb\nc\nd' | sed '$!N;$!D' PATT:^a$
PATT:^a$
COMM:$ !N
PATT:^a\Nb$
COMM:$ !D
PATT:^b$
COMM:$ !N
PATT:^b\Nc$
COMM:$ !D
PATT:^c$
COMM:$ !N
PATT:^c\Nd$
COMM:$ !D
PATT:^c\Nd$
c
d
I've filtered out the lines that would show hold space ("HOLD") since it's not being used. "PATT" shows what's in pattern space and "COMM" shoes the command about to be executed. "\N" indicates an embedded newline. Of course, "^" and "$" indicate the beginning and end of the string.
!N appends the next line and !D deletes the previous line and loops to the beginning of the script without doing an implicit print. When the last line is read, the $! tests fail so there's nothing left to do and the script exits and performs an implicit print of what remains in the pattern space (since -n was not specified in the arguments).
Disclaimer: I am a contributor to the sedsed project and have made a few minor improvements including expanded color support, adding the ^ line-beginning indicator and preliminary support for Python 3. The bleeding edge version (which hasn't been touched lately) is here.
$!N;$!D is a sed program consisting of two statements, $!N and $!D.
$!N matches everything but the last line of the last file of input ($ negated by !) and runs the N command on it, which as you said yourself appends the next line of input to the line currently under scrutiny (the "pattern space"). In other words, sed now has two lines in the pattern space and has advanced to the next line.
$!D also matches everything but the last line, and wipes the pattern space up to the first newline. D also prevents sed from wiping the entire pattern space when reading the next line.
So, the algorithm being executed is roughly:
For every line up to but not including the last {
Read the next line and append it to the pattern space
If still not at the last line
Delete the first line in the pattern space
}
Print the pattern space

How to delete duplicate lines in a file without sorting it in Unix

Is there a way to delete duplicate lines in a file in Unix?
I can do it with sort -u and uniq commands, but I want to use sed or awk.
Is that possible?
awk '!seen[$0]++' file.txt
seen is an associative array that AWK will pass every line of the file to. If a line isn't in the array then seen[$0] will evaluate to false. The ! is the logical NOT operator and will invert the false to true. AWK will print the lines where the expression evaluates to true.
The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
AWK evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.
From http://sed.sourceforge.net/sed1line.txt:
(Please don't ask me how this works ;-) )
# delete duplicate, consecutive lines from a file (emulates "uniq").
# First line in a set of duplicate lines is kept, rest are deleted.
sed '$!N; /^\(.*\)\n\1$/!P; D'
# delete duplicate, nonconsecutive lines from a file. Beware not to
# overflow the buffer size of the hold space, or else use GNU sed.
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
Perl one-liner similar to jonas's AWK solution:
perl -ne 'print if ! $x{$_}++' file
This variation removes trailing white space before comparing:
perl -lne 's/\s*$//; print if ! $x{$_}++' file
This variation edits the file in-place:
perl -i -ne 'print if ! $x{$_}++' file
This variation edits the file in-place, and makes a backup file.bak:
perl -i.bak -ne 'print if ! $x{$_}++' file
An alternative way using Vim (Vi compatible):
Delete duplicate, consecutive lines from a file:
vim -esu NONE +'g/\v^(.*)\n\1$/d' +wq
Delete duplicate, nonconsecutive and nonempty lines from a file:
vim -esu NONE +'g/\v^(.+)$\_.{-}^\1$/d' +wq
The one-liner that Andre Miller posted works except for recent versions of sed when the input file ends with a blank line and no characterss. On my Mac my CPU just spins.
This is an infinite loop if the last line is blank and doesn't have any characterss:
sed '$!N; /^\(.*\)\n\1$/!P; D'
It doesn't hang, but you lose the last line:
sed '$d;N; /^\(.*\)\n\1$/!P; D'
The explanation is at the very end of the sed FAQ:
The GNU sed maintainer felt that despite the portability problems
this would cause, changing the N command to print (rather than
delete) the pattern space was more consistent with one's intuitions
about how a command to "append the Next line" ought to behave.
Another fact favoring the change was that "{N;command;}" will
delete the last line if the file has an odd number of lines, but
print the last line if the file has an even number of lines.
To convert scripts which used the former behavior of N (deleting
the pattern space upon reaching the EOF) to scripts compatible with
all versions of sed, change a lone "N;" to "$d;N;".
The first solution is also from http://sed.sourceforge.net/sed1line.txt
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr '$!N;/^(.*)\n\1$/!P;D'
1
2
3
4
5
The core idea is:
Print only once of each duplicate consecutive lines at its last appearance and use the D command to implement the loop.
Explanation:
$!N;: if the current line is not the last line, use the N command to read the next line into the pattern space.
/^(.*)\n\1$/!P: if the contents of the current pattern space is two duplicate strings separated by \n, which means the next line is the same with current line, we can not print it according to our core idea; otherwise, which means the current line is the last appearance of all of its duplicate consecutive lines. We can now use the P command to print the characters in the current pattern space until \n (\n also printed).
D: we use the D command to delete the characters in the current pattern space until \n (\n also deleted), and then the content of pattern space is the next line.
and the D command will force sed to jump to its first command $!N, but not read the next line from a file or standard input stream.
The second solution is easy to understand (from myself):
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr 'p;:loop;$!N;s/^(.*)\n\1$/\1/;tloop;D'
1
2
3
4
5
The core idea is:
print only once of each duplicate consecutive lines at its first appearance and use the : command and t command to implement LOOP.
Explanation:
read a new line from the input stream or file and print it once.
use the :loop command to set a label named loop.
use N to read the next line into the pattern space.
use s/^(.*)\n\1$/\1/ to delete the current line if the next line is the same with the current line. We use the s command to do the delete action.
if the s command is executed successfully, then use the tloop command to force sed to jump to the label named loop, which will do the same loop to the next lines until there are no duplicate consecutive lines of the line which is latest printed; otherwise, use the D command to delete the line which is the same with the latest-printed line, and force sed to jump to the first command, which is the p command. The content of the current pattern space is the next new line.
uniq would be fooled by trailing spaces and tabs. In order to emulate how a human makes comparison, I am trimming all trailing spaces and tabs before comparison.
I think that the $!N; needs curly braces or else it continues, and that is the cause of the infinite loop.
I have Bash 5.0 and sed 4.7 in Ubuntu 20.10 (Groovy Gorilla). The second one-liner did not work, at the character set match.
The are three variations. The first is to eliminate adjacent repeat lines, the second to eliminate repeat lines wherever they occur, and the third to eliminate all but the last instance of lines in file.
pastebin
# First line in a set of duplicate lines is kept, rest are deleted.
# Emulate human eyes on trailing spaces and tabs by trimming those.
# Use after norepeat() to dedupe blank lines.
dedupe() {
sed -E '
$!{
N;
s/[ \t]+$//;
/^(.*)\n\1$/!P;
D;
}
';
}
# Delete duplicate, nonconsecutive lines from a file. Ignore blank
# lines. Trailing spaces and tabs are trimmed to humanize comparisons
# squeeze blank lines to one
norepeat() {
sed -n -E '
s/[ \t]+$//;
G;
/^(\n){2,}/d;
/^([^\n]+).*\n\1(\n|$)/d;
h;
P;
';
}
lastrepeat() {
sed -n -E '
s/[ \t]+$//;
/^$/{
H;
d;
};
G;
# delete previous repeated line if found
s/^([^\n]+)(.*)(\n\1(\n.*|$))/\1\2\4/;
# after searching for previous repeat, move tested last line to end
s/^([^\n]+)(\n)(.*)/\3\2\1/;
$!{
h;
d;
};
# squeeze blank lines to one
s/(\n){3,}/\n\n/g;
s/^\n//;
p;
';
}
This can be achieved using AWK.
The below line will display unique values:
awk file_name | uniq
You can output these unique values to a new file:
awk file_name | uniq > uniq_file_name
The new file uniq_file_name will contain only unique values, without any duplicates.
Use:
cat filename | sort | uniq -c | awk -F" " '$1<2 {print $2}'
It deletes the duplicate lines using AWK.

Resources