How to extract text between two words in unix? - unix

I
am
using
basic
sed
expression :-
sed -n "am/,/sed/p"
to get the text between "am" and "sed"
which will output "am \n using \n basic \n sed".
But my real problem is if the string would be :-
I
am
using
basic
grep
expression.
I applied the above sed in this sentence
then it gave "am \n using \n basic \n grep \n expression"
which it should not give it. How to discard the
output if there would be no matching?
Any suggestions?

The command in the question (sed -n "/am/,/sed/p", note the added slash) means:
Find a line containing the string am
and print (p) until a line containing sed occurs
Therefore it prints:
I am using basic grep expression
because it contains am. If you would add some more lines they will be printed, too, until a line containing sed occurs.
E.g.:
echo -e 'I am using basic grep expression.\nOne more line\nOne with sed\nOne without' | sed -n "/am/,/sed/p"
results in:
I am using basic grep expression.
One more line
One with sed
I think - what you want to do is something like that:
sed -n "s/.*\(am.*sed\).*/\1/p"
Example:
echo 'I am using basic grep expression.' | sed -n "s/.*\(am.*sed\).*/\1/p"
echo 'I am using basic sed expression.' | sed -n "s/.*\(am.*sed\).*/\1/p"
sed -n "s/.*\(am.*sed\).*/\1/p"

You have to use slightly different sed command like:
sed -n '/am/{:a; /am/x; $!N; /sed/!{$!ba;}; /sed/{s/\n/ /gp;}}' file
To print ONLY lines that contain text am and sed spanned across multiple lines.

When Using SED this can work but it's quite an overwhelming syntax...
if you need to crop part of a multi-line (\n) text, you might want to try a simpler way using grep:
cat multi_line.txt | grep -oP '(?s)(?<=START phrase).*(?=END phrase)'
For example, I find this as the easiest way to grab perforce changelist description (without rest of CL info):
p4 describe {CL NUMBER} | grep -oP '(?s).*(?=Affected files)'
Note, you can play with the <= and >= to include or not include, the starting/ending phrases in the output.

Related

How to use sed to extract text between two bar signs (i.e. '|')?

I would like to extract text that falls between two | signs in a file with multiple lines. For instance, I want to extract P16 from sp|P16|SM2. I have found a possible answer here. However, I cannot apply the answer to my case. I am using the following:
sed -n '/|/,/|/ p' filename
or this by escaping the | sign:
sed -n '/\|/,/\|/ p' filename
But what I receive as result are all the lines in the file unchanged even though I am using -n to suppress automatic printing of pattern space. Any ideas what I am missing?
[EDIT]:
I can get the desired result using the following. However, I would like an explanation why the above mentioned is not working:
sed 's/^sp|//' filename | sed 's/|.*//'
the tool for this task is cut
$ echo "sp|P16|SM2" | cut -d'|' -f2
P16
awk is better choice for column based data:
awk -F'|' '{print $2}'
will give you P16
sed one-liner:
The following sed one-liner will only leave the 2nd column for you:
kent$ echo "sp|P16|SM2"|sed 's/[^|]*|//;s/|[^|]*//'
P16
Or using grouping:
kent$ echo "sp|P16|SM2"|sed 's/.*|\([^|]*\)|.*/\1/'
P16
Short explanation why your two commands didn't work:
1) sed -n '/|/,/|/ p' filename
This sed will print lines between two lines which containing |
2) sed -n '/\|/,/\|/ p' filename
Sed takes BRE as default. If you escape the |, you gave them special meaning, the logical OR. again, the /pat1/,/pat2/ address was wrong usage for your case, it checks lines, not within a line.

Delete Special Word Using sed

I would like to use sed to remove all occurances of this line if and only if it is this
<ab></ab>
If this line, I would not want to delete it
<ab>keyword</ab>
My attempt that's not working:
sed '/<ab></ab>/d'
Thanks for any insight. I'm not sure what's wrong as I should not have to escape anything?
I'm using a shell script named temp to execute this. My command is this:
cat foobar.html | ./temp
This is my temp shell script:
#!/bin/sh
sed -e '/td/!d' | sed '/<ab></ab>/d'
It looks like we have a couple of problems here. The first is with the / in the close-tag. sed uses this to delimit different parts of the command. Fortunately, all we have to do is escape it with \. Try:
sed '/<ab><\/ab>/d'
Here's an example on my machine:
$ cat test
<ab></ab>
<ab></ab>
<ab>test</ab>
$ sed '/<ab><\/ab>/d' test
<ab>test</ab>
$
The other problem is that I'm not sure what the purpose of sed -e '/td/!d' is. In it's default operating mode, you don't need to tell it not to delete something; just tell it exactly what you want to delete.
So, to do this on a file called input.html:
sed '/<ab><\/ab>/d' input.html
Or, to edit the file in-place, you can just do:
sed -i -e '/<ab><\/ab>/d' input.html
Additionally, sed lets you use any character you want as a delimiter; you don't have to use /. So if you'd prefer not to escape your input, you can do:
sed '\#<ab></ab>#d' input.html
Edit
In the comments, you mentioned wanting to delete lines that only contain </ab> and nothing else. To do that, you need to do what's called anchoring the match. The ^ character represents the beginning of the line for anchoring, and $ represents the end of the line.
sed '/^<\/ab>$/d' input.html
This will only match a line that contains (literally) </ab> and nothing else at all, and delete the line. If you want to match lines that contain whitespace too, but no text other than </ab>:
sed '/^[[:blank:]]*<\/ab>[[:blank:]]*$/d' input.html
[[:blank:]]* matches "0 or more whitespace characters" and is called a "POSIX bracket expression".

How to remove blank lines from a Unix file

I need to remove all the blank lines from an input file and write into an output file. Here is my data as below.
11216,33,1032747,64310,1,0,0,1.878,0,0,0,1,1,1.087,5,1,1,18-JAN-13,000603221321
11216,33,1033196,31300,1,0,0,1.5391,0,0,0,1,1,1.054,5,1,1,18-JAN-13,059762153003
11216,33,1033246,31300,1,0,0,1.5391,0,0,0,1,1,1.054,5,1,1,18-JAN-13,000603211032
11216,33,1033280,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,055111034001
11216,33,1033287,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,000378689701
11216,33,1033358,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,000093737301
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802041926
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802041954
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802049326
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802049383
11216,33,1036985,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000093415580
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781202001
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781261305
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781603955
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781615746
sed -i '/^$/d' foo
This tells sed to delete every line matching the regex ^$ i.e. every empty line. The -i flag edits the file in-place, if your sed doesn't support that you can write the output to a temporary file and replace the original:
sed '/^$/d' foo > foo.tmp
mv foo.tmp foo
If you also want to remove lines consisting only of whitespace (not just empty lines) then use:
sed -i '/^[[:space:]]*$/d' foo
Edit: also remove whitespace at the end of lines, because apparently you've decided you need that too:
sed -i '/^[[:space:]]*$/d;s/[[:space:]]*$//' foo
awk 'NF' filename
awk 'NF > 0' filename
sed -i '/^$/d' filename
awk '!/^$/' filename
awk '/./' filename
The NF also removes lines containing only blanks or tabs, the regex /^$/ does not.
Use grep to match any line that has nothing between the start anchor (^) and the end anchor ($):
grep -v '^$' infile.txt > outfile.txt
If you want to remove lines with only whitespace, you can still use grep. I am using Perl regular expressions in this example, but here are other ways:
grep -P -v '^\s*$' infile.txt > outfile.txt
or, without Perl regular expressions:
grep -v '^[[:space:]]*$' infile.txt > outfile.txt
sed -e '/^ *$/d' input > output
Deletes all lines which consist only of blanks (or is completely empty). You can change the blank to [ \t] where the \t is a representation for tab. Whether your shell or your sed will do the expansion varies, but you can probably type the tab character directly. And if you're using GNU or BSD sed, you can do the edit in-place, if that's what you want, with the -i option.
If I execute the above command still I have blank lines in my output file. What could be the reason?
There could be several reasons. It might be that you don't have blank lines but you have lots of spaces at the end of a line so it looks like you have blank lines when you cat the file to the screen. If that's the problem, then:
sed -e 's/ *$//' -e '/^ *$/d' input > output
The new regex removes repeated blanks at the end of the line; see previous discussion for blanks or tabs.
Another possibility is that your data file came from Windows and has CRLF line endings. Unix sees the carriage return at the end of the line; it isn't a blank, so the line is not removed. There are multiple ways to deal with that. A reliable one is tr to delete (-d) character code octal 15, aka control-M or \r or carriage return:
tr -d '\015' < input | sed -e 's/ *$//' -e '/^ *$/d' > output
If neither of those works, then you need to show a hex dump or octal dump (od -c) of the first two lines of the file, so we can see what we're up against:
head -n 2 input | od -c
Judging from the comments that sed -i does not work for you, you are not working on Linux or Mac OS X or BSD — which platform are you working on? (AIX, Solaris, HP-UX spring to mind as relatively plausible possibilities, but there are plenty of other less plausible ones too.)
You can try the POSIX named character classes such as sed -e '/^[[:space:]]*$/d'; it will probably work, but is not guaranteed. You can try it with:
echo "Hello World" | sed 's/[[:space:]][[:space:]]*/ /'
If it works, there'll be three spaces between the 'Hello' and the 'World'. If not, you'll probably get an error from sed. That might save you grief over getting tabs typed on the command line.
grep . file
grep looks at your file line-by-line; the dot . matches anything except a newline character. The output from grep is therefore all the lines that consist of something other than a single newline.
with awk
awk 'NF > 0' filename
To be thorough and remove lines even if they include spaces or tabs something like this in perl will do it:
cat file.txt | perl -lane "print if /\S/"
Of course there are the awk and sed equivalents. Best not to assume the lines are totally blank as ^$ would do.
Cheers
You can sed's -i option to edit in-place without using temporary file:
sed -i '/^$/d' file

output of one command is argument of another

Is there any way to fit in 1 line using the pipes the following:
output of
sha1sum $(xpi) | grep -Eow '^[^ ]+'
goes instead of 456
sed 's/#version#/456/' input.txt > output.txt
Um, I think you can nest $(command arg arg) occurances, so if you really need just one line, try
sed "s/#version#/$(sha1sum $(xpi) | grep -Eow '^[^ ]+')/" input.txt \
> output.txt
But I like Trey's solution putting it one two lines; it's less confusing.
This is not possible using pipes. Command nesting works though:
sed 's/#version#/'$(sha1sum $(xpi) | grep -Eow '^[^ ]+')'/' input.txt > output.txt
Also note that if the results of the nested command contain the / character you will need to use a different character as delimiter (#, |, $, and _ are popular ones) or somehow escape the forward slashes in your string. This StackOverflow question also has a solution to the escaping problem. The problem can be solved by piping the command to sed and replacing all forward slashes (for escape characters) and backslashes (to avoid conflicts with using / as the outer sed delimiter).
The following regular expression will escape all \ characters and all / characters in the command:
sha1sum $(xpi) | grep -Eow '^[^ ]+' | sed -e 's/\(\/\|\\\|&\)/\\&/g'
Nesting this as we did above we get this solution which should properly escape slashes where needed:
sed 's/#version#/'$(sha1sum $(xpi) | grep -Eow '^[^ ]+' | sed -e 's/\(\/\|\\\|&\)/\\&/g')'/' input.txt > output.txt
Personally I think that looks like a mess as one line, but it works.

Why sed removes last line?

$ cat file.txt
one
two
three
$ cat file.txt | sed "s/one/1/"
1
two
Where is the word "three"?
UPDATED:
There is no line after the word "three".
As Ivan suggested, your text file is missing the end of line (EOL) marker on the final line. Since that's not present, three is printed out by sed but then immediately over-written by your prompt. You can see it if you force an extra line to be printed.
sed 's/one/1/' file.txt && echo
This is a common problem since people incorrectly think of the EOL as an indication that there's a following line (which is why it's commonly called a "newline") and not as an indication that the current line has ended.
Using comments from other posts:
older versions of sed do not process the last line of a file if no EOL or "new line" is present.
echo can be used to add a new line
Then, to solve the problem you can re-order the commands:
( cat file.txt && echo ) | sed 's/one/1/'
I guess there is no new line character after last line. sed didn't find line separator after last line and ignore it.
Update
I suggest you to rewrite this in perl (if you have it installed):
cat file.txt | perl -pe 's/one/1/'
Instead of cat'ing the file and piping into sed, run sed with the file name as an argument after the substitution string, like so:
sed "s/one/1/" file.txt
When I did it this way, I got the "three" immediately following by the prompt:
1
two
three$
A google search shows that the man page for some versions of sed (not the GNU or BSD versions, which work as you'd expect) indicate that it won't process an incomplete line (one that's not newline-terminated) at the end of a file. The solution is to ensure your files end with a newline, install GNU sed, or use awk or perl instead.
here's an awk solution
awk '{gsub("one","1")}1' file.txt

Resources