I know that
cat foo | sed '$!N;$!D'
will print out the last two lines of the file foo, but I don't understand why.
I have read the man page and know that N joins the next line to the currently processed line etc - but could someone explain in 'good english' that matches the order of operation what is happening here, step by step?
thanks!
Here is what that script looks like when run through the sedsed debugger (by Aurelio Jargas):
$ echo -e 'a\nb\nc\nd' | sed '$!N;$!D' PATT:^a$
PATT:^a$
COMM:$ !N
PATT:^a\Nb$
COMM:$ !D
PATT:^b$
COMM:$ !N
PATT:^b\Nc$
COMM:$ !D
PATT:^c$
COMM:$ !N
PATT:^c\Nd$
COMM:$ !D
PATT:^c\Nd$
c
d
I've filtered out the lines that would show hold space ("HOLD") since it's not being used. "PATT" shows what's in pattern space and "COMM" shoes the command about to be executed. "\N" indicates an embedded newline. Of course, "^" and "$" indicate the beginning and end of the string.
!N appends the next line and !D deletes the previous line and loops to the beginning of the script without doing an implicit print. When the last line is read, the $! tests fail so there's nothing left to do and the script exits and performs an implicit print of what remains in the pattern space (since -n was not specified in the arguments).
Disclaimer: I am a contributor to the sedsed project and have made a few minor improvements including expanded color support, adding the ^ line-beginning indicator and preliminary support for Python 3. The bleeding edge version (which hasn't been touched lately) is here.
$!N;$!D is a sed program consisting of two statements, $!N and $!D.
$!N matches everything but the last line of the last file of input ($ negated by !) and runs the N command on it, which as you said yourself appends the next line of input to the line currently under scrutiny (the "pattern space"). In other words, sed now has two lines in the pattern space and has advanced to the next line.
$!D also matches everything but the last line, and wipes the pattern space up to the first newline. D also prevents sed from wiping the entire pattern space when reading the next line.
So, the algorithm being executed is roughly:
For every line up to but not including the last {
Read the next line and append it to the pattern space
If still not at the last line
Delete the first line in the pattern space
}
Print the pattern space
Related
This post could count as duplicate , but i have not found any relevant answer in previous threads. I have a large (6 GB) text file and i wish to remove every 3rd and 4th line in a set of 4 lines . For example , the following
line1
line2
line3
line4
line5
line6
line7
line8
needs to be converted to this
line1
line2
line5
line6
Is there any vim script / command to remove those lines ? It could be also in multiple passes . 1 pass to delete the 3rd lines (in a set of 4 (line1,line2,line3,line4)) and another pass to delete again the 3rd lines (previously 4th ones , in a set of 3 (line1,line2,line3)) .
The commands :g/^/+1 d3 is close to what i want but it also removes the second lines .
If you have GNU sed, you can filter the buffer through this pipeline:
sed -e '0~4d' | sed '0~3d'
The first sed deletes every 4th line, the second deletes every 3rd line.
This has the desired effect.
To pipe the current buffer through this command, enter this in command mode:
%!sed -e '0~4d' | sed '0~3d'
The % selects the range of lines to pass to a command (% means all lines, the entire buffer), and !cmd is the command to pipe through.
To perform this outside of vim, run these two steps:
sed -ie '0~4d' file
sed -ie '0~3d' file
This will modify the file, in two steps.
Alternatively you can also use Awk.
awk 'NR%4==3||NR%4==0{next;}1' file.txt > output.txt
To do this via Vim:
%!awk 'NR\%4==3||NR\%4==0{next;}1'
UPDATE: It is a bad approach for large files, it needs ~3 sec for a 6MB file to perform a substitution.
This approach works in vim. Using regular expression, you find 4 lines and substitute them with first two lines of these 4. Works for a long file as well. Doesn't work for last 1–3 lines if there is a remainder of division of total lines number by 4.
:%s#\(^.*\n^.*\)\n^.*\n^.*\n#\1\r#g
Explanation:
:%s — substitute in the whole file, # used as a delimiter
\(^.*\n^.*\) — \(\) select two lines that will be used later as \1; \n stands for linebreak; ^ for the beginning of the line; .* for any symbol repeated as much times as possible before the linebreak
\n — linebreak after the second line
^.*\n^.*\n — next two lines to be deleted
\1\r — substitute for lines with first two lines and add a linebreak \r
g — apply to the whole file
For a few years, I often have a need to combine lines of (sorted) text with a matching first field, and I never found an elegant (i.e. one-liner unix command line) way to do it. What I want is similar to what's possible with the unix join command, but join expects 2 files, with each key appearing maximum once. I want to start with a single file, in which a key might appear multiple tiles.
I have both a ruby and perl script that do this, but there's no way to shorten my algorithm into a one-liner. After years of unix use, I'm still learning new tricks with comm, paste, uniq, etc, and I suspect there's a smart way to do this.
There are some related questions, like join all lines that have the same first column to the same line; Command line to match lines with matching first field (sed, awk, etc.); and Combine lines with matching keys -- but those solutions never really give a clean and reliable solution.
Here's sample input:
apple:A fruit
apple:Type of: pie
banana:tropical fruit
cherry:small burgundy fruit
cherry:1 for me to eat
cherry:bright red
Here's sample output:
apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red
Here's my ideal syntax:
merge --inputDelimiter=":" --outputDelimiter=";" --matchfield=1 infile.txt
The "matchfield" is really optional. It could always be the first field. Subsequent appearances of the delimiter should be treated like plain text.
I don't mind a perl, ruby, awk one-liner, if you can think of a short and elegant algorithm. This should be able to handle millions of lines of input. Any ideas?
Using awk one liner
awk -F: -v ORS="" 'a!=$1{a=$1; $0=RS $0} a==$1{ sub($1":",";") } 1' file
Output:
apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red
setting ORS="" ; By default it is \n.
The reason why we have set ORS="" (Output Record Separator) is because we don't want awk to include newlines in the output at the end of each record. We want to handle it in our own way, through our own logic. We are actually including newlines at the start of every record which has the first field different from the previous one.
a!=$1 : When variable a (initially null) doesn't match with first field $1 which is for eg. applein first line, then set a=$1 and $0=RS $0 i.e $0 or simply whole record becomes "\n"$0 (basically adding newline at the beginning of record). a!=$1 will always satisfy when there is a different first field ($1) than the previous line's $1 and is thus a criteria to segregate our records based on first field.
a==$1: If it matches then it probably means you are iterating over a record belonging to the previous record set. In this case substitute first occurrence of $1: (Note the : ) for eg. apple: with ;. $1":" could also be written as $1FS where FS is :
If you have millions of line in your file then this approach would be fastest because it doesn't involve any pre-processing and also we are not using any other data structure say array for storing your keys or records.
for F in `cut -f1 -d ':' infile.txt | sort | uniq`; do echo "$F:$(grep $F infile.txt | cut -f2- -d ':' | paste -s -d ';' - )"; done
Not sure it qualifies as 'elegant', but it works, though I'm sure not quickly for millions of lines - as the number of grep calls increases it would slow significantly. What % of the matching fields do you expect to be unique?
Discover awk language:
awk -F':' '{ v=substr($0, index($0,":")+1); a[$1]=($1 in a? a[$1]";" : "")v }
END{ for(i in a) print i,a[i] }' OFS=':' infile.txt
The output:
apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red
I think this one do the job
awk -F':' '$1!=a{if(b);print b;b=""}a=$1{$1="";if(!b)b=a;b=b$0}END{print b}' infile
#RahulVerma's awk solution is simple and awesome! However, I've been looking for a sed solution. I wondered if sed is capable to solve this sort of problem. Finally, I've found my own solution! The core idea is treat the hold space as a variable, put what we want to search in the hold space, and search it in the current input line. I want to give credit to #chaos, because the core idea is from his answer how-to-search-for-the-word-stored-in-the-hold-space-with-sed#239049.
Using sed one liner
sed -E '1{:OK; h; $b; d}; x; G; s/^([^:]+)(.*)\n\1:(.*)/\1\2;\3/; tOK; s/\n.*//' file
or
sed -E 'x; G; s/^([^:]+)(.*)\n\1:(.*)/\1\2;\3/; tOK; 1d; s/\n.*//; b; :OK; h; $!d' file
Output:
apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red
Explanation: (the 2nd sed)
x exchange the hold space and the pattern space
The hold space would become the current input line.
The pattern space would become the content generated by s/// in the last cycle (see below), or the last input line if s/// fails to replace anything. It is empty for the very beginning, i.e., when the current input line is the line 1.
G append the hold space (the current input line) to the pattern space (the content generated in the last cycle). Thus, the pattern space would be two lines (The hold space is not changed). For example,
If the current input line is line 1, the pattern space would be
# the empty hold space
apple:A fruit # the current input line
If it is line 2, the pattern space would be
apple:A fruit # the last input line, because `s///` fails to replace anything in the last cycle
apple:Type of: pie # the current input line
If it is line 3, the pattern space would be
apple:A fruit;Type of: pie # the resulting content generated by `s///` in the last cycle
banana:tropical fruit # the current input line
s/// use the first field ^([^:]+) (: is the field separator) in the first line of the pattern space to search the second line. The part of \n\1: indicates how it works. If it is found in the second line, concatenate the first line \1\2', a ;, and the 2+ fields of the second line \3. Thus, the result is \1\2;\3.
tOK if the s/// is successful (s has replaced something) jump to the label OK, otherwise continue
In case of success
:OK specify the location of the label OK
h put the pattern space which is \1\2;\3, generated by s///, to the hold space. (The first line of the pattern space in the next cycle is made up of this very content)
$!d delete it, don't print it, if hasn't reached the last line
In case of failure, (s/// hasn't changed anything, the pattern space still has two lines)
1d delete the pattern space, don't print it, if the current input line is the line 1. At this moment, there is nothing ready to output. Without this, an empty line would be printed.
s/\n.*// delete the second line of the pattern space, which is the current input line.
b jump without a label, means end the sed script, and the remaining content in the pattern space would be printed out before starting the next cycle.
This is the very place where we finish combining the lines starting with the "current" same field and print out the combined line.
The hold space is still holding the the current input line for this cycle, and is going to be "the last input line" for the next cycle (see above). What it holds will trigger a brand new combining of the lines starting with a new same field.
In the second sed command line, the 1d; is not very necessary, however, the output has a subtle difference without it -- an innocuous empty line would be printed on the top, like below,
apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red
Thoughts at the end
The time I spent to figuring out the sed solution and the final look of the sed solution, both indicates that awk is better and easier for problems as complicated as this. sed is not good at searching/replacing a file according to the file itself. Like, searching/replacing constant text, it is easy, but what to search/replace is from the file itself, which means it might vary from line to line. For this kind, sed is not as good as awk -- the superior that has the full programming facilities, like variable, function, if-else, for/while/do. (See further in what is the difference between sed and awk)
As an exercise - or even for smallish datasets - the same in just native bash interpreter commands:
declare -A fruit=()
while IFS=: read k v; do fruit[$k]="${fruit[$k]:+${fruit[$k]};}$v"; done < test.in
for k in "${!fruit[#]}"; do echo "$k:${fruit[$k]}"; done
Output:
apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red
Any required sorting order might need to be explicitly enforced in more complex datasets.
Is there a way to delete the newline at the end of a line in Vim, so that the next line is appended to the current line?
For example:
Evaluator<T>():
_bestPos(){
}
I'd like to put this all on one line without copying lines and pasting them into the previous one. It seems like I should be able to put my cursor to the end of each line, press a key, and have the next line jump onto the same one the cursor is on.
End result:
Evaluator<T>(): _bestPos(){ }
Is this possible in Vim?
If you are on the first line, pressing (upper case) J will join that line and the next line together, removing the newline. You can also combine this with a count, so pressing 3J will combine all 3 lines together.
Certainly. Vim recognizes the \n character as a newline, so you can just search and replace.
In command mode type:
:%s/\n/
While on the upper line in normal mode, hit Shift+j.
You can prepend a count too, so 3J on the top line would join all those lines together.
As other answers mentioned, (upper case) J and search + replace for \n can be used generally to strip newline characters and to concatenate lines.
But in order to get rid of the trailing newline character in the last line, you need to do this in Vim:
:set noendofline binary
:w
J deletes extra leading spacing (if any), joining lines with a single space. (With some exceptions: after /[.!?]$/, two spaces may be inserted; before /^\s*)/, no spaces are inserted.)
If you don't want that behavior, gJ simply removes the newline and doesn't do anything clever with spaces at all.
set backspace=indent,eol,start
in your .vimrc will allow you to use backspace and delete on \n (newline) in insert mode.
set whichwrap+=<,>,h,l,[,]
will allow you to delete the previous LF in normal mode with X (when in col 1).
All of the following assume that your cursor is on the first line:
Using normal mappings:
3Shift+J
Using Ex commands:
:,+2j
Which is an abbreviation of
:.,.+2 join
Which can also be entered by the following shortcut:
3:j
An even shorter Ex command:
:j3
It probably depends on your settings, but I usually do this with A<delete>
Where A is append at the end of the line. It probably requires nocompatible mode :)
I would just press A (append to end of line, puts you into insert mode) on the line where you want to remove the newline and then press delete.
<CURSOR>Evaluator<T>():
_bestPos(){
}
cursor in first line
NOW, in NORMAL MODE do
shift+v
2j
shift+j
or
V2jJ
:normal V2jJ
if you don't mind using other shell tools,
tr -d "\n" < file >t && mv -f t file
sed -i.bak -e :a -e 'N;s/\n//;ba' file
awk '{printf "%s",$0 }' file >t && mv -f t file
The problem is that multiples char 0A (\n) that are invisible may accumulate.
Supose you want to clean up from line 100 to the end:
Typing ESC and : (terminal commander)
:110,$s/^\n//
In a vim script:
execute '110,$s/^\n//'
Explanation: from 110 till the end
search for lines that start with new line (are blank)
and remove them
A very slight improvement to TinkerTank's solution if you're just looking to quickly concatenate all the lines in a text file is to have something like this in your .vimrc:
nnoremap <leader>j :%s/\n/\ /g<CR>
This globally substitutes newlines with a space meaning you don't end up with the last word of a line being joined onto the first word of the next line. This works perfectly for my typical use-case.
If you're wanting to maintain deliberate paragraph breaks, V):join is probably the easiest solution.
Is there a way to delete duplicate lines in a file in Unix?
I can do it with sort -u and uniq commands, but I want to use sed or awk.
Is that possible?
awk '!seen[$0]++' file.txt
seen is an associative array that AWK will pass every line of the file to. If a line isn't in the array then seen[$0] will evaluate to false. The ! is the logical NOT operator and will invert the false to true. AWK will print the lines where the expression evaluates to true.
The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
AWK evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.
From http://sed.sourceforge.net/sed1line.txt:
(Please don't ask me how this works ;-) )
# delete duplicate, consecutive lines from a file (emulates "uniq").
# First line in a set of duplicate lines is kept, rest are deleted.
sed '$!N; /^\(.*\)\n\1$/!P; D'
# delete duplicate, nonconsecutive lines from a file. Beware not to
# overflow the buffer size of the hold space, or else use GNU sed.
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
Perl one-liner similar to jonas's AWK solution:
perl -ne 'print if ! $x{$_}++' file
This variation removes trailing white space before comparing:
perl -lne 's/\s*$//; print if ! $x{$_}++' file
This variation edits the file in-place:
perl -i -ne 'print if ! $x{$_}++' file
This variation edits the file in-place, and makes a backup file.bak:
perl -i.bak -ne 'print if ! $x{$_}++' file
An alternative way using Vim (Vi compatible):
Delete duplicate, consecutive lines from a file:
vim -esu NONE +'g/\v^(.*)\n\1$/d' +wq
Delete duplicate, nonconsecutive and nonempty lines from a file:
vim -esu NONE +'g/\v^(.+)$\_.{-}^\1$/d' +wq
The one-liner that Andre Miller posted works except for recent versions of sed when the input file ends with a blank line and no characterss. On my Mac my CPU just spins.
This is an infinite loop if the last line is blank and doesn't have any characterss:
sed '$!N; /^\(.*\)\n\1$/!P; D'
It doesn't hang, but you lose the last line:
sed '$d;N; /^\(.*\)\n\1$/!P; D'
The explanation is at the very end of the sed FAQ:
The GNU sed maintainer felt that despite the portability problems
this would cause, changing the N command to print (rather than
delete) the pattern space was more consistent with one's intuitions
about how a command to "append the Next line" ought to behave.
Another fact favoring the change was that "{N;command;}" will
delete the last line if the file has an odd number of lines, but
print the last line if the file has an even number of lines.
To convert scripts which used the former behavior of N (deleting
the pattern space upon reaching the EOF) to scripts compatible with
all versions of sed, change a lone "N;" to "$d;N;".
The first solution is also from http://sed.sourceforge.net/sed1line.txt
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr '$!N;/^(.*)\n\1$/!P;D'
1
2
3
4
5
The core idea is:
Print only once of each duplicate consecutive lines at its last appearance and use the D command to implement the loop.
Explanation:
$!N;: if the current line is not the last line, use the N command to read the next line into the pattern space.
/^(.*)\n\1$/!P: if the contents of the current pattern space is two duplicate strings separated by \n, which means the next line is the same with current line, we can not print it according to our core idea; otherwise, which means the current line is the last appearance of all of its duplicate consecutive lines. We can now use the P command to print the characters in the current pattern space until \n (\n also printed).
D: we use the D command to delete the characters in the current pattern space until \n (\n also deleted), and then the content of pattern space is the next line.
and the D command will force sed to jump to its first command $!N, but not read the next line from a file or standard input stream.
The second solution is easy to understand (from myself):
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr 'p;:loop;$!N;s/^(.*)\n\1$/\1/;tloop;D'
1
2
3
4
5
The core idea is:
print only once of each duplicate consecutive lines at its first appearance and use the : command and t command to implement LOOP.
Explanation:
read a new line from the input stream or file and print it once.
use the :loop command to set a label named loop.
use N to read the next line into the pattern space.
use s/^(.*)\n\1$/\1/ to delete the current line if the next line is the same with the current line. We use the s command to do the delete action.
if the s command is executed successfully, then use the tloop command to force sed to jump to the label named loop, which will do the same loop to the next lines until there are no duplicate consecutive lines of the line which is latest printed; otherwise, use the D command to delete the line which is the same with the latest-printed line, and force sed to jump to the first command, which is the p command. The content of the current pattern space is the next new line.
uniq would be fooled by trailing spaces and tabs. In order to emulate how a human makes comparison, I am trimming all trailing spaces and tabs before comparison.
I think that the $!N; needs curly braces or else it continues, and that is the cause of the infinite loop.
I have Bash 5.0 and sed 4.7 in Ubuntu 20.10 (Groovy Gorilla). The second one-liner did not work, at the character set match.
The are three variations. The first is to eliminate adjacent repeat lines, the second to eliminate repeat lines wherever they occur, and the third to eliminate all but the last instance of lines in file.
pastebin
# First line in a set of duplicate lines is kept, rest are deleted.
# Emulate human eyes on trailing spaces and tabs by trimming those.
# Use after norepeat() to dedupe blank lines.
dedupe() {
sed -E '
$!{
N;
s/[ \t]+$//;
/^(.*)\n\1$/!P;
D;
}
';
}
# Delete duplicate, nonconsecutive lines from a file. Ignore blank
# lines. Trailing spaces and tabs are trimmed to humanize comparisons
# squeeze blank lines to one
norepeat() {
sed -n -E '
s/[ \t]+$//;
G;
/^(\n){2,}/d;
/^([^\n]+).*\n\1(\n|$)/d;
h;
P;
';
}
lastrepeat() {
sed -n -E '
s/[ \t]+$//;
/^$/{
H;
d;
};
G;
# delete previous repeated line if found
s/^([^\n]+)(.*)(\n\1(\n.*|$))/\1\2\4/;
# after searching for previous repeat, move tested last line to end
s/^([^\n]+)(\n)(.*)/\3\2\1/;
$!{
h;
d;
};
# squeeze blank lines to one
s/(\n){3,}/\n\n/g;
s/^\n//;
p;
';
}
This can be achieved using AWK.
The below line will display unique values:
awk file_name | uniq
You can output these unique values to a new file:
awk file_name | uniq > uniq_file_name
The new file uniq_file_name will contain only unique values, without any duplicates.
Use:
cat filename | sort | uniq -c | awk -F" " '$1<2 {print $2}'
It deletes the duplicate lines using AWK.
I need to have a script read the files coming in and check information for verification.
On the first line of the files to be read is a date but in numeric form. eg: 20080923
But before the date is other information, I need to read it from position 27. Meaning line 1 position 27, I need to get that number and see if it’s greater then another number.
I use the grep command to check other information but I use special characters to search, in this case the information before the date is always different, so I can’t use a character to search on. It has to be done by line 1 position 27.
sed 1q $file | cut -c27-34
The sed command reads the first line of the file and the cut command chops out characters 27-34 of the one line, which is where you said the date is.
Added later:
For the more general case - where you need to read line 24, for example, instead of the first line, you need a slightly more complex sed command:
sed -n -e 24p -e 24q | cut -c27-34
sed -n '24p;24q' | cut -c27-34
The -n option means 'do not print lines by default'; the 24p means print line 24; the 24q means quit after processing line 24. You could leave that out, in which case sed would continue processing the input, effectively ignoring it.
Finally, especially if you are going to validate the date, you might want to use Perl for the whole job (or Python, or Ruby, or Tcl, or any scripting language of your choice).
You can extract the characters starting at position 27 of line 1 like so:
datestring=`head -1 $file | cut -c27-`
You'd perform your next processing step on $datestring.