I want to delete directory path except the file name using sed from an html file. The path looks like:
<a href="/dir1/dir2/file.mp3" other_tags_here </a>
with spaces (%) and other characters in the directory and file names. eg.
<a href="/1-%one%2026/two%20_three%four/1-%eight.mp3"
I just need to keep <a href="1-%eight.mp3" other_tags_here <a/>. When I try
echo '<a href=/1-%one%2026/two%20_three%four/1-%eight.mp3' | sed 's|href="/.*/.*/|href="|g'
it works fine. However when I read from the html file
sed 's|href="/.*/.*/|href="|g' file.html
it deletes every thing after href= and returns only href=. How do I correct this ?
In sed, regexes match the leftmost longest match. That means that the final .*/ in your regex will match to the final / on the line. To prevent that:
sed 's|href="/[^/]*/[^/]*/|href="|g' file.html
The regex [^/]*/ will match to the next / only.
In languages like python or perl we can address this issue by using non-greedy regexes. Because sed does not support non-greedy regexes, we must try to achieve a similar effect using tricks like [^/]*/.
Standard Warning: In general, html format can be very complex with lots of special cases that regexes are ill-suited to handle.
When working with html, it is generally best to use html-specific tools (like python's beautifulsoup).
Related
I'm trying to replace any occurrence of a cwe.mitre.org.*.html (regex) URL and remove the .html extension and not change any other type of URL.
Example:
https://cwe.mitre.org/data/definitions/377.html
http://google.com/404.html
Expectation:
https://cwe.mitre.org/data/definitions/377
http://google.com/404.html
Is there a way to do this in sed or another tool?
I've tried sed -Ei 's/cwe.mitre.org.*.html/<REPLACEMENT>/g' file.txt, but that won't work. Is there a way for the <REPLACEMENT> to be a regular expression? The sed manual doesn't seem to suggest that?
EDIT: I was wrong about the sed manual. It does mention it, see "5.7 Back-references and Subexpressions" section of https://www.gnu.org/software/sed/manual/sed.html.
$ sed 's/\(cwe\.mitre\.org.*\)\.html/\1/' file
https://cwe.mitre.org/data/definitions/377
http://google.com/404.html
google sed capture groups.
Use
sed -Ei 's/(cwe\.mitre\.org.*)\.html/\1/' file
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
cwe 'cwe'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
mitre 'mitre'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
org 'org'
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
html 'html'
The \1 backreferences the part of a string captured by parenthesized piece of the pattern. When you want a piece of a match stay in the result, use the backreference.
GNU AWK solution, let file.txt content be
https://cwe.mitre.org/data/definitions/377.html
http://google.com/404.html
then
awk '/cwe\.mitre\.org.*\.html/{sub(/\.html$/,"")}{print}' file.txt
gives output
https://cwe.mitre.org/data/definitions/377
http://google.com/404.html
Explanation: If you find provided regex in line, replace .html followed by end of line ($) using empty string. Every line, changed or not, print.
(tested in GNU Awk 5.0.1)
Another possibility is
% sed '/cwe\.mitre\.org/s/\.html//' try.txt
https://cwe.mitre.org/data/definitions/377
Nothing
hello.html
http://google.com/404.html
This isn't unequivocally better than the accepted answer (it would get confused by foo.html text http://cwe.mitre.org/bar.html, for example, but the other answers may also be assuming there's only one relevant URL on a line). I mention it as a supplement to that one, however, since it usefully illustrates that sed commands can be prefixed by ‘addresses’, which can include regexps. This script deletes .html on any line which includes cvw.mitre.org.
This feature is often forgotten, and is only occasionally useful, but when it's appropriate, it can avoid an otherwise complicated regexp in the s ‘pattern’ slot, and back-references.
i need some help with using sed in unix.
i need to Use the standard Unix command sed to process the input stream and remove all HTML tags, so that for example:
This is my link.
will be replaced by
This is my link.
I tried
sed -r 's/
<[^>]*>
//g'
but it didn't work.
This is extremely bare-bones and unlikely to catch all of the scenarious that HTML will throw at you, but if you are looking to just trim a leading and trailing < and >, then something like this might work:
sed 's/<[^>]*>//g'
But seriously, I'd use a parser.
In the general case you cannot parse HTML with regular expressions.
But, for simple case and assuming that no tag spans more than two lines, you can use:
sed -e 's/<[^<>]*>//g' -e 's/<[^<>]*$//' -e 's/^[^<>]*>//'
The first regex finds and deletes tags contained on one line. The second takes care of tags which begin on a line but end on the next. The third deletes the tails of tags which began on the previous line. I a tag can span more than two lines then something more complicated (or a better tool) is needed.
I'm trying to replace a every reference of \' with ' in a file
I've used variations of: sed -e s/\'/"\'"/g file.txt
But they always replace every.single.(single).quote
Any help would be greatly appreciated.
Not sure it's the best solution,I could do it like this:
sed "s/[\]'/\"\'\"/g" file.txt
(putting the backslash character in a character range so it doesn't interfere with the following quote, and protect with double quotes)
Or just extending your syntax, without quotes but using almost the same trick:
sed -e s/[\\]\'/"\'"/g file.txt
An approach trying to conserve as much of the "single-quotedness" of the sed command as possible:
sed 's/\\'"'"'/\'/g'
Just escaping \ with \\ and getting a single quote into the command with '"'"': the first single quote ends the command so far, then we have a double-quoted single quote ("'"), and finally an opening single quote for the rest of the command.
Alternatively, double quoting the whole command and escaping both the backslash and single quote:
sed "s/\\\'/\'/g"
The correct syntax is:
$ echo "foo'bar" | sed 's/'\''/\'/'
foo'bar
Every script (sed, awk, whatever) should always be enclosed in single quotes and you just us other single quotes to stop/restart the script delimiters break out to shell for the minimal portion of the script that's absolutely necessary, in this case long enough to use \'. You need to break out to shell to specify that ' because per shell rules no script enclosed in 's can contain a ', not even if you try to escape it.
echo "foo'bar" | gawk '{gsub(/\47/,"\\'")}1'
foo'bar
The tricky part here is to replace a single quote with ampersand.
First in order to make the single quote manageable use its octal
code here \47 and then escaping ampersand by two back slash. And all of sudden
it becomes feasible :)
I have a file where a few lines end with tux. How do I add " to the end of any line that ends in words like this or This?
You could visit this site for more examples and help about using sed in overall. Also check it's "Regular expressions" tab or search the web for something like "unix anchor characters".
For this actual problem, these are the relevant parts of the site:
Sed has the ability to specify which lines are to be examined and/or modified, by specifying addresses before the command. I will just describe the simplest version for now - the /PATTERN/ address. When used, only lines that match the pattern are given the command after the address. Briefly, when used with the /p flag, matching lines are printed twice:
sed '/PATTERN/p' file
And of course PATTERN is any regular expression.
According to these, you could use a sed command like this to get the lines ending with "this" or "This" in your file, or "tux" if you meant that:
$ sed '/[tT]his$/p' yourfile
or
$ sed '/tux$/p' yourfile
For putting the double quotes at the end of these lines, you also need to understand:
$ has a special meaning (end of the input line) as an anchor character in regular expressions
... and the character "$" is the end anchor. The expression "A$" will match all lines that end with the capital A. If the anchor characters are not used at the proper end of the pattern, then they no longer act as anchors. The "$" is only an anchor if it is the last character.
how to use sed for substitution of characters (see the linked page)
Sed has several commands, but most people only learn the substitute command: s. The substitute command changes all occurrences of the regular expression into a new value. A simple example is changing "day" in the "old" file to "night" in the "new" file:
$ sed 's/day/night/' newfile
Or another way (for UNIX beginners),
$ sed 's/day/night/' old >new
and for those who want to test this:
$ echo day | sed 's/day/night/'
This will output "night".
After these you can construct your own sed command, knowing that you can use this two parts together in one command like this:
$ sed '/[pP]atternAtTheEndOfLine$/s/$/patternToAddToEndOfTheLine/' yourfile
I am trying to find the following text
get_pins {
and replace it with
get_pins -hierarchical {proc_top_*/
I've tried using sed but I'm not sure what I'm doing wrong. I know that you need # in front of curly braces but I still can't get the command to work properly.
The closest I've come is to this:
sed 's/get_pins #{#/get_pins -hierarchical #{#proc_top_*\//g' filename.txt > output
but it doesn't do the replacement I wanted above.
#merlin2011's answer shows you how to do it with alternative delimiters, but as for why your command didn't work:
It's actually perfectly fine, if you just remove all # chars. from your statement:
sed 's/get_pins {/get_pins -hierarchical {proc_top_*\//'g filename.txt > output
There are two distinct escaping requirements involved here:
Escaping literal use of the regex delimiter: this is what you did correctly, by escaping the / as \/.
Escaping characters with special meaning inside a regex in general: this escaping is always done with \-prefixing, but in your case there is NO need for such escaping: since you're NOT using -E or -r to indicate use of extended regexes - and are therefore using a basic regex - { is actually NOT a special character, so you need NOT escape it. If, by contrast, you had used -E (-r), then you should have escaped { as \{.
The problem is not in the curly braces, it's in the /.
This is exactly why sed lets you do alternate delimiters.
The line below uses ! as a delimiter instead, and works correctly for a simple file with get_pins { in it.
sed 's!get_pins {!get_pins -hierarchical {proc_top_*/!g' Input.txt
Output:
get_pins -hierarchical {proc_top_*/
Update: Based mklement0's comment, and testing with the csh shell, the following should work in csh.
sed 's#get_pins {#get_pins -hierarchical {proc_top_*/#g' Input.txt
This awk should do the replace:
awk '{sub(/get_pins {/,"get_pins -hierarchical {proc_top_*/")}1'