Remove HTML with Regex

Remove HTML with Regex - asp.net

Is it possible to use regex to remove HTML tags inside a particular block of HTML?
E.g.
<body>
<p>Hello World!</p>
<table>
<tr>
<td>
<p>My First HTML Table</p>
</td>
</tr>
</table>
I don't want to remove all P tags, only those within the table element.
The ability to both remove or retain the text inside the nested p tag would be ideal.
Thanks.

There are a lot of mentions regarding not to use regex when parsing HTML, so you could use Html Agility Pack for this:
var html = #"
<body>
<p>Hello World!</p>
<table>
<tr>
<td>
<p>My First HTML Table</p>
</td>
</tr>
</table>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var nodes = document.DocumentNode.SelectNodes("//table//p");
foreach (HtmlNode node in nodes)
{
node.ParentNode.ReplaceChild(
HtmlNode.CreateNode(node.InnerHtml),
node
);
}
string result = null;
using (StringWriter writer = new StringWriter())
{
document.Save(writer);
result = writer.ToString();
}
So after all these manupulations, you'll get the next result:
<body>
<p>Hello World!</p>
<table>
<tr>
<td>
My First HTML Table
</td>
</tr>
</table></body>

I have found this link in which it seems the exact question was asked
"I have an HTML document in .txt format containing multiple tables and other texts and I am trying to delete any HTML (anything within "<>") if it's inside a table (between and ). For example:"
Regex to delete HTML within <table> tags

<td>[\r\n\s]*<p>([^<]*)</p>[\r\n\s]*</td>
The round brackets denote a numbered capture group which will contain your text.
However, using regular expressions in this way relies on a lot of assumptions regarding the content of the <p> tag and the construction of the HTML.
Have a read of the ubiquitous SO question regarding using regular expressions to parse (X)HTML and see #Bruno's answer for a more robust solution.

Possible to some extent but not reliable!
I will rather suggest you to look at HTML parsers such as HTML Agility Pack.

Related

Dynamic background-image

I want to have multiple div with different background URLs.
My inline razor for this code seems to be wrong:
<table>
#foreach (var item in fa.get_albums()) {
<tr>
<td>
<div style="background-image:url('#item.picture');">
///something
</div>
</td>
</tr>
}
</table>
What's the right way to put inline razor in to background-imag:url()?

The issue you have is that MVC will happily fix the relative paths inside an img src attribute but not for style. You should map that virtual path using Url.Content():
<div style="background-image:url('#Url.Content(item.picture)');">

PHPExcel: HTML to Excel, writing remove the CSS in excel file

I want to export(force download) HTML(with CSS) to EXCEL sheet, for now I am using the PHPExcel library to perform this, it generate the excel file but remove the CSS (using inline with html tags), can anyone guide me, that how to keep CSS in excel sheet.
I am using this code, But I also want to keep the css and force to download
//html
$html = "<table>
<thead> <tr> <td colspan='2'> <h1> Main Heading </h1> <td> </tr> </thead>
<tbody>
<tr>
<th style='background:#ccc; color:red; font-size:15px'> Name <th>
<th style='background:#ccc; color:red; font-size:15px'> Class <th>
</tr>
<tr>
<td style='background:#fff; color:green; font-size:13px'> Jhon <th>
<td style='background:#fff; color:gree; font-size:13px'> 9th <th>
</tr>
</tbody>
</table>";
// Put the html into a temporary file
$tmpfile = time().'.html';
file_put_contents($tmpfile, $html);
// Read the contents of the file into PHPExcel Reader class
$reader = new PHPExcel_Reader_HTML;
$content = $reader->load($tmpfile);
// Pass to writer and output as needed
$objWriter = PHPExcel_IOFactory::createWriter($content, 'Excel2007');
$objWriter->save('excelfile.xlsx');
// Delete temporary file
unlink($tmpfile);

You can't read styles from HTML markup at the moment, unless you rewrite PHPExcel's HTML Reader to handle styles; it simply isn't supported yet. If you're building the spreadsheet from HTML, perhaps you should reconsider building it directly from a new PHPExcel object, which gives you access to all the features of PHPExcel.
To send to the browser, send to php://output with the appropriate headings, as shown in Examples/01simple-download-xlsx.php, and described in the section of the developer documentation entitled Redirect output to a client’s web browser

Ractivejs: How to get to work Nested properties in view

I have 2 objects results and headers being headers generated from _.keys(result[0])
r{
data:{
headers:['head1','head2']
result:[
{head1:'content1',head2:'content2'}
{head1:'content3',head2:'content4'}
{head1:'content5',head2:'content6'}
]
}
I have to create a table dinamically so I create this:
<table class="ui celled table segment">
<thead>
<tr>
{{#headers}}
<th>{{.}}</th>
{{/headers}}
</tr></thead>
<tbody>
{{#result:i}}
<tr>
{{#headers:h}}
<td>{{????}}</td> <-- Here is where I fail to know what to put into
{{/headers}}
</tr>
{{/result}}
</tbody>
</table>
Can someone help me to fill in the blanks. So I can create a table that display the contents
If I remove the {{#headers}} part and I already know the elements <td>{{.head1}}</td> work perfectly the problem is that I'am generating different objects on the fly.

{{#result:i}}
<tr>
{{#headers:h}}
<td>{{result[i][this]}}</td>
{{/headers}}
</tr>
{{/result}}
The reason this works is that the <td> is repeated for each item in the headers array, because it's inside a headers section - so far, so obvious. Because of that, we can use this to refer to the current header (head1, head2 etc). The trick is to get a reference to the current row - and because you've already created the i index reference, we can do that easily with result[i]. Hence result[i][this].
Here's a demo fiddle: http://jsfiddle.net/rich_harris/dkQ5Z/

How to show horizontal line in iTextSharp

I am creating a pdf and need to put a horizontal line in the page. Can anyone tell how to do that?
I have a xml file which has my html tag(<table>....</table>). And the whole content of xml file is parsed to a string which is used to create the pdf. Now some tags are not supported. One of them is <hr>. So is there any other tag which I can use in the xml file so that this will draw a
line when the pdf is created using xml data.
Below is an example of xml xontent
<table>
<tr>
<td>
<span>
This is working properly.
</span>
</td>
<tr>
</table>
<table>
<tr>
<td>
<span>
<hr>
This is not working properly.
</span>
</td>
<tr>
</table>
Please let me know if any more information is needed.
Thanks in advance.

The following creates a full width black line a few pixels thick, I'm using HTMLWorker.Parse:
<table>
<tr>
<td>
<span>
This is working properly.
</span>
</td>
<tr>
</table>
<table>
<tr>
<td>
<span>
<table border="1" cellpadding="0" cellspacing="0"><tr><td> </td></tr></table>
This is working properly now too!
</span>
</td>
<tr>
</table>

You can draw lines from begining postion (moveto), LineTo and then stroke (commit the line):
...
PdfContentByte cb = writer.DirectContent;
....
cb.MoveTo(doc.PageSize.Width / 2, doc.PageSize.Height / 2);
cb.LineTo(doc.PageSize.Width / 2, doc.PageSize.Height);
cb.Stroke();
...

I hope this helps you out
PdfPTable table = new PdfPTable(1); //Create a new table with one column
PdfPCell cellLeft = new PdfPCell(); //Create an empty cell
StyleSheet style = new StyleSheet(); //Declare a stylesheet
style.LoadTagStyle("h1", "border-bottom", "red"); //Create styles for your html tags which you think will be there in PDFText
List<IElement> objects = HTMLWorker.ParseToList(new StringReader(PDFText),style); //This transforms your HTML to a list of PDF compatible objects
for (int k = 0; k < objects.Count; ++k)
{
cellLeft.AddElement((IElement)objects[k]); //Add these objects to cell one by one
}
table.AddCell(cellLeft);

How to remove html tags and contents in between them in c#?

I want to remove the tag and contents between them from my source..
Following is my source:
<tr>
<td class="ds_label" width="40%" style="font-size: 70%;"></td>
<td id="table_cell_1585" class="ds_label">
<a class="tt" href="#" onClick="return false;">
<table class="tooltip" style="width:300px;" cellpadding="0" cellspacing="0" border=0>
</a>
</td>
<td class="ds_data" width="60%" style="font-size: 70%">800 x 480 pixels</td>
</tr>
And i want to remove whole <a> tag with content.
I used this:
response contains my source code.
response = Regex.Replace(response, "<a>(.|\n)*?</a>", string.Empty);
but it's not working.
Please advise.

Regex is not a good tool for parsing HTML. Take a look at HTMLAgilityPack instead to save yourself some work.

Firstly, try to avoid using regex to work with HTML, it's the wrong tool because there are too many edge cases to be reliable or secure. Use a framework designed to work with a structured document like the HTMLAgilityPack.
When you are using literal strings to define a regular expression in c# it's a good idea to use a verbatim string literal (prefixed with #) so escape caracters in the pattern arn't interpreted as part of the literal string. In the case of this question #"<a>(.|\n)*?</a>" will stop the \n from being treated as an escape character in c#.
New lines can consist of both \r and-or \n
HTML A tags contain attributes like href so <a> is unlikely to match anything because of the closing >
Use RegexOptions.Singleline in the options argument to ensure . matches any character including newlines.
This unit test succeeds.
[Test]
public void Test()
{
Regex pattern = new Regex(#"<a.*?</a>", RegexOptions.Singleline);
string input = "foo \r\nbaz bar";
string expected = "foo bar";
string actual = pattern.Replace(input, string.Empty);
Assert.AreEqual(expected, actual);
}
However, be aware that this is not a secure way of handling user input or any kind of data that is not pre-defined because regular expressions like this can easily be evaded.

use this
variable = Server.HtmlDecode(variable).Trim();

Try this regex :
<a\b[^>]*>(.*?)</a>
[TestMethod]
public void TestMethod1()
{
var source =
#"
<tr>
<td class='ds_label' width='40%' style='font-size: 70%;\'></td>
<td id='table_cell_1585' class='ds_label'>
<a class='tt' href='#' onClick='return false;'>
<table class='tooltip' style='width:300px;' cellpadding='0' cellspacing='0' border=0>
</a>
</td>
<td class='ds_data' width='60%' style='font-size: 70%'>800 x 480 pixels</td>
</tr>";
source = Regex.Replace(source, "<a [^>]*>", string.Empty);
source = Regex.Replace(source, "</a>", string.Empty);
Console.Write(source);
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove HTML with Regex - asp.net

Possible to some extent but not reliable! I will rather suggest you to look at HTML parsers such as HTML Agility Pack.

Related

Dynamic background-image

PHPExcel: HTML to Excel, writing remove the CSS in excel file

Ractivejs: How to get to work Nested properties in view

How to show horizontal line in iTextSharp

How to remove html tags and contents in between them in c#?

Categories

Resources