Removing all elements from HTML that have given class using Agility Pack - asp.net

I'm trying to select all elements that have a given class and remove them from a HTML string.
This is what I have so far it doesn't seem to remove anything although the source shows clearly 4 elements with that class name.
// Filter page HTML to display required content
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// filePath is a path to a file containing the html
htmlDoc.LoadHtml(pageHTML);
// ParseErrors is an ArrayList containing any errors from the Load statement);
if (!htmlDoc.ParseErrors.Any())
{
// Remove all elements marked with pdf-ignore class
HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//body[#class='pdf-ignore']");
// Remove the collection from above
foreach (var node in nodes)
{
node.Remove();
}
}
EDIT: Just to clarify the document is parsing and the SelectNodes line is being hit, just not returning anything.
Here is a snippet of the html:
<input type=\"submit\" name=\"ctl00$MainContent$PrintBtn\" value=\"Print Shotlist\" onclick=\"window.print();\" id=\"MainContent_PrintBtn\" class=\"pdf-ignore\">

EDIT: in your updated answer you posted a part of the HTML string an <input> element declaration, but you're trying to match a <body> element with the class pdf-ignore (according to your expression //body[#class='pdf-ignore']).
If you want to match all the elements from the document with this class you should use:
var nodes = htmlDoc.DocumentNode.SelectNodes("//*[contains(#class,'pdf-ignore')]");
code to get your nodes. This will match all the elements with the class name specified.
Your code is seems to be correct except the one detail: the condition htmlDoc.ParseErrors == null. You select and remove nodes ONLY if the ParseErrors property (which is a type of IEnumerable<HtmlParseError>) is null, but actually if no errors found this property returns an empty list. So changing your code to:
if (!htmlDoc.ParseErrors.Any())
{
// some logic here
}
should solve the issue.

Your xpath is probably not matching: have you tried "//div[class='pdf-ignore']" (no "#")?

Related

getElementsByClassName - Undefined Return

I am having issues with a getElementsByClassName function I am using in Google Tag Manager.
I will need to capture an input field value in my client's form and I am isolating the class name and using it in my custom JS however I am only getting Undefined back.
The JS I am using is the below and I've also created a gtm.formsubmit event but I reckon that the event is firing before it has time to listen to the user input, it that even possible?
function() {
var inputField = document.getElementsByClassName("wpcf7-form");
return inputField.value || "";
}
Thanks!
Even if there is just a single element with the class wpcf7-form a call to getElementsByClassName will return an array of elements (in that case a single element). Since an array has no "value" attribute you get an "undefined".
If you are resonably sure there is only one element with the class you can do
...
var inputField = document.getElementsByClassName("wpcf7-form");
return inputField[0].value || "";
...
since a single element will always be at index 0. In that case it would be easier to use a DOM type variable in Google Tag Manager and set the selection method to "CSS selector". This will return the first element with your class (or undefined if not present).

Is there a way to filter out hidden elements with css?

as an example some html has several elements which have the css path table.class1.class2[role="menu"] but only one of these elements will be visible at any given time, so I want to get only the one that is visible.
can I adjust my css path to narrow it down?
Possibly use Linq to get the list. I am not sure which language you are using. But, similar concept can be applied using any of them. Using Linq
to accomplish this kind of scenario is very simple in C#
public IWebElement Test()
{
//selector
By bycss = By.CssSelector("table.class1.class2[role='menu']");
return Driver.FindElements(bycss).ToList().FirstOrDefault(d => d.Displayed);
}
And, make sure to import
using System.Linq; if you are using C#
In Java you can do something like this[not using lambdas]
List<WebElement> visibleList = null;
//selector
By byCss = By.cssSelector("table.class1.class2[role='menu']");
//list of visible and hidden elements
Iterator<WebElement> iterator = driver.findElements(byCss).iterator();
while (iterator.hasNext()){
WebElement element = iterator.next();
if (element.isDisplayed()){
//building the list of visible elements
visibleList.add(element);
}
}
//get the first item of the list
//you can return all if needed
return visibleList.get(0);
In Java, you can use WebElement.isDisplayed().

How to specify the position of parsed XHTML using itextsharp

I use itextsharp for creating a pdf. I need to place XHTML on it so I uase the XMLWorkerHelper class:
iTextSharp.tool.xml.XMLWorkerHelper worker = iTextSharp.tool.xml.XMLWorkerHelper.GetInstance();
worker.ParseXHtml(pdfWrite, doc, new StringReader(sb.ToString()));
However I would like to specify a position for the parsed XHTML. How do I do that?
EDIT:
I thought I will post the code in the case someone else runs into this. The link provided below was for JAVA and in C# things work a bit different.
First you need a class for gathering the Elements:
class ElementHandlerClass : iTextSharp.tool.xml.IElementHandler
{
public List<IElement> elements = new List<IElement>();
public void Add(iTextSharp.tool.xml.IWritable input)
{
if (input is iTextSharp.tool.xml.pipeline.WritableElement)
{
elements.AddRange(((iTextSharp.tool.xml.pipeline.WritableElement)input).Elements());
}
}
}
Then you use it
ElementHandlerClass ehc = new ElementHandlerClass();
worker.ParseXHtml(ehc, new StringReader(sb.ToString()));
Now you have the elements. Next step is to create a ColumnText and fill it with the Elements:
iTextSharp.text.pdf.ColumnText ct = new iTextSharp.text.pdf.ColumnText(pdfWrite.DirectContent);
ct.SetSimpleColumn(200, 300, 300, 500);
foreach (IElement element in ehc.elements)
ct.AddElement(element);
ct.Go();
You need to combine the answers of two previous questions on StackOverflow.
The first answer you need, is the one to How to get particular html table contents to write it in pdf using itext
In this answer, you learn how to parse an XHTML source into a list of Element objects.
Once you have this list, you need the answer to itext ColumnText ignores alignment
You can create a ColumnText object, define a rectangle with the setSimpleColumn() method, add all the elements retrieved from the XHTML using XML Worker with the addElement() method, and go() to add that content.

how to get attribute value using selenium and css

I have the following HTML code:
2
I would like to get what is contained in href, ie, I was looking for a command which would give me "/search/?p=2&q=move&mt=1" value for href.
Could someone please help me with the respective command and css locator in selenium, for the above query?
if I have something like:
2
3
Out of these two if I was to get the attribute value for href whose text conatins '2', then how would my css locator synatx look like?
If your HTML consists solely of that one <a> tag, then this should do it:
String href = selenium.getAttribute("css=a#href");
You use the DefaultSelenium#getAttribute() method and pass in a CSS locator, an # symbol, and the name of the attribute you want to fetch. In this case, you select the a and get its #href.
In response to your comment/edit:
The part after # tells Selenium that that part is the name of the attribute.
You should place :contains('2') before #href because it's part of the locator, not the attribute. So, like this:
selenium.getAttribute("css=a:contains('2')#href");
Changing css=a#href to href should do the trick. Let me if this did not work.
List<WebElement> ele = driver.findElements(By.className("c"));
for(WebElement e : ele)
{
String doctorname = e.getText();
String linkValue = e.getAttribute("href");
}

Can someone explain this seeming inconsistency in jQuery/Javascript?? (trailing brackets inconsistency on reads)

So, in my example below, "InputDate'" is an input type=text, "DateColumn" is a TD within a table with a class of "DateColumn".
Read the value of a discreet texbox:
var inputVal = $('#InputDate').val();
Read the value of a div within a table....
This works:
$('#theTable .DateColumn').each(function() {
var rowDate = Date.parse($(this)[0].innerHTML);
});
This doesn't:
$('#theTable .DateColumn').each(function() {
var rowDate = Date.parse($(this)[0].innerHTML());
});
The difference is the "()" after innerHTML. This behavior seems syntactically inconsistent between how you read a value from a textbox and how you read it from a div. I'm ok with sometimes, depending on the type of control, having to read .val vs .innerHTML vs.whateverElseDependingOnTheTypeOfControl...but this example leads me to believe I now must also memorize whether I need trailing brackets or not on each property/method.
So for a person like me who is relatively new to jQuery/Javascript....I seem to have figured out this particular anomaly, in this instance, but is there a convention I am missing out on, or does a person have to literally have to memorize whether each method does or does not need brackets?
innerHTML is javascript, and is a property of an element. If you'd like to stick with the jQuery version of doing things, use html():
$('#theTable .DateColumn').each(function() {
var rowDate = Date.parse($(this).html() );
});
edit: a bit more clarification about your concerns. jQuery is pretty consistent in it's syntax. Basically, most of the methods you find allow read/write access by adjusting the parameters passed to the method.
var css = $('#element').css('color'); // read the color of the element
$('#element').css('color', 'red'); // set the color to "red"
var contents = $('#element').html(); // grab the innerHTML of the element
$('#element').html('Hello World'); // set the innerHTML of this element
.innerHTML is a property of the element not a method.
Property reference Example: object.MyProperty
Method Example: object.SomeFunction();

Resources