GVim Command / Script to delete lines from a set of 4 - unix

This post could count as duplicate , but i have not found any relevant answer in previous threads. I have a large (6 GB) text file and i wish to remove every 3rd and 4th line in a set of 4 lines . For example , the following
line1
line2
line3
line4
line5
line6
line7
line8
needs to be converted to this
line1
line2
line5
line6
Is there any vim script / command to remove those lines ? It could be also in multiple passes . 1 pass to delete the 3rd lines (in a set of 4 (line1,line2,line3,line4)) and another pass to delete again the 3rd lines (previously 4th ones , in a set of 3 (line1,line2,line3)) .
The commands :g/^/+1 d3 is close to what i want but it also removes the second lines .

If you have GNU sed, you can filter the buffer through this pipeline:
sed -e '0~4d' | sed '0~3d'
The first sed deletes every 4th line, the second deletes every 3rd line.
This has the desired effect.
To pipe the current buffer through this command, enter this in command mode:
%!sed -e '0~4d' | sed '0~3d'
The % selects the range of lines to pass to a command (% means all lines, the entire buffer), and !cmd is the command to pipe through.
To perform this outside of vim, run these two steps:
sed -ie '0~4d' file
sed -ie '0~3d' file
This will modify the file, in two steps.

Alternatively you can also use Awk.
awk 'NR%4==3||NR%4==0{next;}1' file.txt > output.txt
To do this via Vim:
%!awk 'NR\%4==3||NR\%4==0{next;}1'

UPDATE: It is a bad approach for large files, it needs ~3 sec for a 6MB file to perform a substitution.
This approach works in vim. Using regular expression, you find 4 lines and substitute them with first two lines of these 4. Works for a long file as well. Doesn't work for last 1–3 lines if there is a remainder of division of total lines number by 4.
:%s#\(^.*\n^.*\)\n^.*\n^.*\n#\1\r#g
Explanation:
:%s — substitute in the whole file, # used as a delimiter
\(^.*\n^.*\) — \(\) select two lines that will be used later as \1; \n stands for linebreak; ^ for the beginning of the line; .* for any symbol repeated as much times as possible before the linebreak
\n — linebreak after the second line
^.*\n^.*\n — next two lines to be deleted
\1\r — substitute for lines with first two lines and add a linebreak \r
g — apply to the whole file

Related

Remove new line character after a specific word ignore space and tab

I am parsing a sql script using unix. If FROM is a first word then merge it with previous line. If FROM is last word in line then we need to merge it with next line. E.g.:
A
FROM
B
I want the result as
A FROM B
avoid any space and tabs.
Code:
cat A.txt | sed ':a;N;$!ba;s|[Ff][Rr][Oo][Mm][\s\t]*\n|FROM |g;s/\n\s*\t*[Ff][Rr][Oo][Mm]/ FROM/g' >B.txt
Here is one using GNU awk and gensub. It replaces combination of spaces, newlines and tabs (carriage return omitted due to unix tag) before and after the word FROM. It uses empty RS as record separator, meaning that a record ends in an empty line or the end of file.
$ awk 'BEGIN{RS=""}{$0=gensub(/[ \t\n]+(FROM)[ \t\n]+/," \\1 ","g")}1' file
A FROM B
If you just want the word that comes after FROM:
$ awk 'BEGIN{RS=""}{for(i=1;i<=NF;i++)if($i=="FROM")print $(i+1)}' file
B
Both will fail if your query has FROM in the WHERE part values, like:
SELECT * FROM table WHERE variable='DEATH COMES FROM ABOVE';

How can I merge two specific lines with sed?

I have a file that looks like this
line one
line two
line three
line four
and I want it to look like this (I want lines 2 and 3 merged into one)
line one
line two line three
line four
I've tried the following, that based on my research SHOULD do what I want:
$ sed '2s/\n/ /' test.txt > test2.txt
However test2.txt looks like this
line one
line two
line three
line four
I've seen some references to it being different on Solaris than Linux. Here's my server's details
$ uname -a
SunOS myserver 5.10 Generic_142909-17 sun4u sparc SUNW,Sun-Fire-V490
How can I make this give the results I want?
Based on the answers in the linked question, this sed should work:
sed '2N;s/\n/ /' file
2N means on the second line, append the next line to the pattern space. This results in the pattern space containing line two\nline three. The substitution (which replaces the newline \n with a space) applies to every line in the file but only has an effect here.
Otherwise, this awk would do the trick:
awk 'NR==2{printf("%s ",$0);next}1' file
On line two, use printf to print the contents of the line, followed by a space. next skips to the next line. For all other lines, the 1 at the end is effectively a {print $0} block (which prints the line).
Alternatively:
awk '{printf("%s%s",$0,NR==2?OFS:ORS)}' file
Where OFS by default is a space and ORS is a newline
Or a bit of perl:
perl -pe 's/\n/ / if $. == 2' file
$. is the line number. Substitute the newline for a space on the 2nd line.
Output (using any of the above approaches on my system):
line one
line two line three
line four

Remove every x lines from text input

I'm looking to grep some log files with a few surrounding lines, but then discard the junk lines from the matches. To make matters worse, the stupid code outputs the same exception twice so I want to junk every other grep match. I don't know that there's a good way to skip every other grep match when also including surrounding lines, so I'm good to do it all in one.
So let's say we have the following results from grep:
InterestingContext1
lkjsdf
MatchExceptionText1
--
kjslkj
lskjlk
MatchExceptionText2
--
InterestingContext3
lkjsdf
MatchExceptionText3
--
kjslkj
lskjlk
MatchExceptionText4
--
Obviously the grep match is "MatchExceptionText" (simplified, of course). So I'd like to pipe this to something where I can remove lines 2,5,6,7,8 and then repeat that pattern, so the results look like this:
InterestingContext1
MatchExceptionText1
--
InterestingContext3
MatchExceptionText3
--
The repeating is where things get tricky for me. I know sed can just remove certain line numbers but I don't know how to group them into groups of 8 lines and repeat that cut in all the groups.
Any ideas? Thanks for your help.
awk can do modular arithemetic so printing conditional on the number of lines read mod 8 should allow you to repeat the pattern.
awk 'NR%8 ~ /[134]/' file
Sed can do it:
sed -n 'N;s/\n.*//;N;N;p;N;N;N;N' filename
EDIT:
Come to think of it, this is a little better:
sed -n 'p;n;n;N;p;n;n;n;n' filename
With GNU awk you can split the input at appropriate record separators and print the wanted output, eg.:
awk 'NR%2 {print $1, $3}' RS='--\n' ORS='\n--\n' OFS='\n' infile
Output:
InterestingContext1
MatchExceptionText1
--
InterestingContext3
MatchExceptionText3
--
This might work for you (GNU sed):
sed -n 'p;n;n;p;n;p;n;n;n;n' file
sed -n "s/^.*\n//;x;s/^/²/
/^²\{1\}$/ b print
/^²\{3\}$/ b print
/^²\{4\}$/ b print
/^²\{7\}$/ b print
/^²\{8\}$/ b print
b cycle
: print
x;
# your treatment begin
p
# your treatment stop
x
: cycle
/^²\{8\}$/ s/.*//
x
" YourFile
Mainly for reference for kind of "case of" with relative line number, just have to change the number in /^²\{YourLineNumber\}$/ for take the other relative line position.
Don't forget the last line number that reset the cycle
First part take the line and prepare the relative line counter
Second part is the case of
Third part is the treatment (here a print)
Last part is the reset of the cycle counter if needed

sed '$!N;$!D' explanation

I know that
cat foo | sed '$!N;$!D'
will print out the last two lines of the file foo, but I don't understand why.
I have read the man page and know that N joins the next line to the currently processed line etc - but could someone explain in 'good english' that matches the order of operation what is happening here, step by step?
thanks!
Here is what that script looks like when run through the sedsed debugger (by Aurelio Jargas):
$ echo -e 'a\nb\nc\nd' | sed '$!N;$!D' PATT:^a$
PATT:^a$
COMM:$ !N
PATT:^a\Nb$
COMM:$ !D
PATT:^b$
COMM:$ !N
PATT:^b\Nc$
COMM:$ !D
PATT:^c$
COMM:$ !N
PATT:^c\Nd$
COMM:$ !D
PATT:^c\Nd$
c
d
I've filtered out the lines that would show hold space ("HOLD") since it's not being used. "PATT" shows what's in pattern space and "COMM" shoes the command about to be executed. "\N" indicates an embedded newline. Of course, "^" and "$" indicate the beginning and end of the string.
!N appends the next line and !D deletes the previous line and loops to the beginning of the script without doing an implicit print. When the last line is read, the $! tests fail so there's nothing left to do and the script exits and performs an implicit print of what remains in the pattern space (since -n was not specified in the arguments).
Disclaimer: I am a contributor to the sedsed project and have made a few minor improvements including expanded color support, adding the ^ line-beginning indicator and preliminary support for Python 3. The bleeding edge version (which hasn't been touched lately) is here.
$!N;$!D is a sed program consisting of two statements, $!N and $!D.
$!N matches everything but the last line of the last file of input ($ negated by !) and runs the N command on it, which as you said yourself appends the next line of input to the line currently under scrutiny (the "pattern space"). In other words, sed now has two lines in the pattern space and has advanced to the next line.
$!D also matches everything but the last line, and wipes the pattern space up to the first newline. D also prevents sed from wiping the entire pattern space when reading the next line.
So, the algorithm being executed is roughly:
For every line up to but not including the last {
Read the next line and append it to the pattern space
If still not at the last line
Delete the first line in the pattern space
}
Print the pattern space

How to delete duplicate lines in a file without sorting it in Unix

Is there a way to delete duplicate lines in a file in Unix?
I can do it with sort -u and uniq commands, but I want to use sed or awk.
Is that possible?
awk '!seen[$0]++' file.txt
seen is an associative array that AWK will pass every line of the file to. If a line isn't in the array then seen[$0] will evaluate to false. The ! is the logical NOT operator and will invert the false to true. AWK will print the lines where the expression evaluates to true.
The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
AWK evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.
From http://sed.sourceforge.net/sed1line.txt:
(Please don't ask me how this works ;-) )
# delete duplicate, consecutive lines from a file (emulates "uniq").
# First line in a set of duplicate lines is kept, rest are deleted.
sed '$!N; /^\(.*\)\n\1$/!P; D'
# delete duplicate, nonconsecutive lines from a file. Beware not to
# overflow the buffer size of the hold space, or else use GNU sed.
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
Perl one-liner similar to jonas's AWK solution:
perl -ne 'print if ! $x{$_}++' file
This variation removes trailing white space before comparing:
perl -lne 's/\s*$//; print if ! $x{$_}++' file
This variation edits the file in-place:
perl -i -ne 'print if ! $x{$_}++' file
This variation edits the file in-place, and makes a backup file.bak:
perl -i.bak -ne 'print if ! $x{$_}++' file
An alternative way using Vim (Vi compatible):
Delete duplicate, consecutive lines from a file:
vim -esu NONE +'g/\v^(.*)\n\1$/d' +wq
Delete duplicate, nonconsecutive and nonempty lines from a file:
vim -esu NONE +'g/\v^(.+)$\_.{-}^\1$/d' +wq
The one-liner that Andre Miller posted works except for recent versions of sed when the input file ends with a blank line and no characterss. On my Mac my CPU just spins.
This is an infinite loop if the last line is blank and doesn't have any characterss:
sed '$!N; /^\(.*\)\n\1$/!P; D'
It doesn't hang, but you lose the last line:
sed '$d;N; /^\(.*\)\n\1$/!P; D'
The explanation is at the very end of the sed FAQ:
The GNU sed maintainer felt that despite the portability problems
this would cause, changing the N command to print (rather than
delete) the pattern space was more consistent with one's intuitions
about how a command to "append the Next line" ought to behave.
Another fact favoring the change was that "{N;command;}" will
delete the last line if the file has an odd number of lines, but
print the last line if the file has an even number of lines.
To convert scripts which used the former behavior of N (deleting
the pattern space upon reaching the EOF) to scripts compatible with
all versions of sed, change a lone "N;" to "$d;N;".
The first solution is also from http://sed.sourceforge.net/sed1line.txt
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr '$!N;/^(.*)\n\1$/!P;D'
1
2
3
4
5
The core idea is:
Print only once of each duplicate consecutive lines at its last appearance and use the D command to implement the loop.
Explanation:
$!N;: if the current line is not the last line, use the N command to read the next line into the pattern space.
/^(.*)\n\1$/!P: if the contents of the current pattern space is two duplicate strings separated by \n, which means the next line is the same with current line, we can not print it according to our core idea; otherwise, which means the current line is the last appearance of all of its duplicate consecutive lines. We can now use the P command to print the characters in the current pattern space until \n (\n also printed).
D: we use the D command to delete the characters in the current pattern space until \n (\n also deleted), and then the content of pattern space is the next line.
and the D command will force sed to jump to its first command $!N, but not read the next line from a file or standard input stream.
The second solution is easy to understand (from myself):
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr 'p;:loop;$!N;s/^(.*)\n\1$/\1/;tloop;D'
1
2
3
4
5
The core idea is:
print only once of each duplicate consecutive lines at its first appearance and use the : command and t command to implement LOOP.
Explanation:
read a new line from the input stream or file and print it once.
use the :loop command to set a label named loop.
use N to read the next line into the pattern space.
use s/^(.*)\n\1$/\1/ to delete the current line if the next line is the same with the current line. We use the s command to do the delete action.
if the s command is executed successfully, then use the tloop command to force sed to jump to the label named loop, which will do the same loop to the next lines until there are no duplicate consecutive lines of the line which is latest printed; otherwise, use the D command to delete the line which is the same with the latest-printed line, and force sed to jump to the first command, which is the p command. The content of the current pattern space is the next new line.
uniq would be fooled by trailing spaces and tabs. In order to emulate how a human makes comparison, I am trimming all trailing spaces and tabs before comparison.
I think that the $!N; needs curly braces or else it continues, and that is the cause of the infinite loop.
I have Bash 5.0 and sed 4.7 in Ubuntu 20.10 (Groovy Gorilla). The second one-liner did not work, at the character set match.
The are three variations. The first is to eliminate adjacent repeat lines, the second to eliminate repeat lines wherever they occur, and the third to eliminate all but the last instance of lines in file.
pastebin
# First line in a set of duplicate lines is kept, rest are deleted.
# Emulate human eyes on trailing spaces and tabs by trimming those.
# Use after norepeat() to dedupe blank lines.
dedupe() {
sed -E '
$!{
N;
s/[ \t]+$//;
/^(.*)\n\1$/!P;
D;
}
';
}
# Delete duplicate, nonconsecutive lines from a file. Ignore blank
# lines. Trailing spaces and tabs are trimmed to humanize comparisons
# squeeze blank lines to one
norepeat() {
sed -n -E '
s/[ \t]+$//;
G;
/^(\n){2,}/d;
/^([^\n]+).*\n\1(\n|$)/d;
h;
P;
';
}
lastrepeat() {
sed -n -E '
s/[ \t]+$//;
/^$/{
H;
d;
};
G;
# delete previous repeated line if found
s/^([^\n]+)(.*)(\n\1(\n.*|$))/\1\2\4/;
# after searching for previous repeat, move tested last line to end
s/^([^\n]+)(\n)(.*)/\3\2\1/;
$!{
h;
d;
};
# squeeze blank lines to one
s/(\n){3,}/\n\n/g;
s/^\n//;
p;
';
}
This can be achieved using AWK.
The below line will display unique values:
awk file_name | uniq
You can output these unique values to a new file:
awk file_name | uniq > uniq_file_name
The new file uniq_file_name will contain only unique values, without any duplicates.
Use:
cat filename | sort | uniq -c | awk -F" " '$1<2 {print $2}'
It deletes the duplicate lines using AWK.

Resources