How to extract element id attribute values from HTML

How to extract element id attribute values from HTML - asp.net

I am trying to work out the overhead of the ASP.NET auto-naming of server controls. I have a page which contains 7,000 lines of HTML rendered from hundreds of nested ASP.NET controls, many of which have id / name attributes that are hundreds of characters in length.
What I would ideally like is something that would extract every HTML attribute value that begins with "ctl00" into a list. The regex Find function in Notepad++ would be perfect, if only I knew what the regex should be?
As an example, if the HTML is:
<input name="ctl00$Header$Search$Keywords" type="text" maxlength="50" class="search" />
I would like the output to be something like:
name="ctl00$Header$Search$Keywords"
A more advanced search might include the element name as well (e.g. control type):
input|name="ctl00$Header$Search$Keywords"
In order to cope with both Id and Name attributes I will simply rerun the search looking for Id instead of Name (i.e. I don't need something that will search for both at the same time).
The final output will be an excel report that lists the number of server controls on the page, and the length of the name of each, possibly sorted by control type.

Quick and dirty:
Search for
\w+\s*=\s*"ctl00[^"]*"
This will match any text that looks like an attribute, e.g. name="ctl00test" or attr = "ctl00longer text". It will not check whether this really occurs within an HTML tag - that's a little more difficult to do and perhaps unnecessary? It will also not check for escaped quotes within the tag's name. As usual with regexes, the complexity required depends on what exactly you want to match and what your input looks like...

"7000"? "Hundreds"? Dear god.
Since you're just looking at source in a text editor, try this... /(id|name)="ct[^"]*"/

Answering my own question, the easiest way to do this is to use BeautifulSoup, the 'dirty HTML' Python parser whose tagline is:
"You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."
It works, and it's available from here - http://crummy.com/software/BeautifulSoup

I suggest xpath, as in this question

Related

Trying to search through text on a website in PlayWright API

The text I'm searching for is all contained within a CSS class called "content-center", and within that is a series of CSS classes all with the same name that old similar, but different information. It seems to only be returning [<JSHandle preview=JSHandle#node>] rather than returning the text itself as if saying "yes, this text is on the page X times".
page.wait_for_selector('.content-center')
print(page.query_selector_all(".content-center:has-text('Bob Johnson')"))

page.query_selector_all returns the ElementHandle[] values of the elements which got found. Over these you can loop and call the text_content() method to get the text out of that specific element.
Also in most cases, its enough to use the text-selectors to verify something is on the page or an element has text, see here for reference.

How to verify a text is present on a webpage for 'n' times

I wanted to verify a text in a webpage exist for 2 times or ‘n’ times. I have used “Page Should Contain” keyword but it says “Pass” when it identifies single occurrence. I don’t want to verify using locator.
Ex: I want to verify the text "Success" is available in a current webpage for 3 times using robot framework
Any inputs/suggesstions would be helpful.

Too bad you don't want to use a locator, as robotframework has a keyword just for that:
Xpath Should Match X Times //*[contains(., "Success")] 2
The caveat is the locator should not be prepended with xpath= - just straight expression.
The library keyword Page Should Contain does pretty much exactly that, by the way.
And if you want to find how many times the string is present in the page - easy:
${count}= Get Matching Xpath Count //*[contains(., "Success")]
And then do any kind of checks on the result, e.g.
Should Be Equal ${count} 2

I thought the problem of not using locator sounds fun (the rationale behind the requirement still unclear, yet), so another solution - look in the source yourself:
${source}= Page Source # you have the whole html of the page here
${matches}= Get Regexp Matches ${source} >.*\b(Success)\b.*<
${count}= Get Length ${matches}
The first one gets the source, the second gets all non-overlapping (separate) occurrences of the target string, when it is (hopefully) inside a tag. The third line returns the count.
Disclaimer - please don't actually do that, unless you're 100% sure of the source and the structure. Use a locator.

Retrieve data from a website via Visual Basic

There is this website that we purchase widgets from that provides details for each of their parts on its own webpage. Example: http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND. I have to find all of their parts that are in our database, and add Manufacturer and Manufacturer Part Number values to their fields.
I was told that there is a way for Visual Basic to access a webpage and extract information. If someone could point me in the right direction on where to start, I'm sure I can figure this out.
Thanks.

How to scrape a website using HTMLAgilityPack (VB.Net)
I agree that htmlagilitypack is the easiest way to accomplish this. It is less error prone than just using Regex. The following will be how I deal with scraping.
After downloading htmlagilitypack*dll, create a new application, add htmlagilitypack via nuget, and reference to it. If you can use Chrome, it will allow you to inspect the page to get information about where your information is located. Right-click on a value you wish to capture and look for the table that it is found in (follow the HTML up a bit).
The following example will extract all the values from that page within the "pricing" table. We need to know the XPath value for the table (this value is used to instruct htmlagilitypack on what to look for) so that the document we create looks for our specific values. This can be achieved by finding whatever structure your values are in and right click copy XPath. From this we get...
//*[#id="pricing"]
Please note that sometimes the XPath you get from Chrome may be rather large. You can often simplify it by finding something unique about the table your values are in. In this example it is "id", but in other situations, it could easily be headings or class or whatever.
This XPath value looks for something with the id equal to pricing, that is our table. When we look further in, we see that our values are within tbody,tr and td tags. HtmlAgilitypack doesn't work well with the tbody so ignore it. Our new XPath is...
//*[#id='pricing']/tr/td
This XPath says look for the pricing id within the page, then look for text within its tr and td tags. Now we add the code...
Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND")
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes("//*[#id='pricing']/tr/td")
Next
To extract the values we simply reference our table value that was created in our loop and it's innertext member.
Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND")
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes("//*[#id='pricing']/tr/td")
MsgBox(table.InnerText)
Next
Now we have message boxes that pop up the values...you can switch the message box for an arraylist to fill or whatever way you wish to store the values. Now simply do the same for whatever other tables you wish to get.
Please note that the Doc variable that was created is reusable, so if you wanted to cycle through a different table in the same page, you do not have to reload the page. This is a good idea especially if you are making many requests, you don't want to slam the website, and if you are automating a large number of scrapes, it puts some time between requests.
Scraping is really that easy. That's is the basic idea. Have fun!

Html Agility Pack is going to be your friend!
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).
Looking at the source of the example page you provided, they are using HTML5 Microdata in their markup. I searched some more on CodePlex and found a microdata parser which may help too: MicroData Parser

What's the correct format for TCDL linkAttributes?

I can see the technology-independent Tridion Content Delivery Language (TCDL) link has the following parameters, which are pretty well described on SDL Live Content.
type
origin
destination
templateURI
linkAttributes
textOnFail
addAnchor
VariantId
How do we add multiple attribute-value pairs for the linkAttributes? Specifically, what do we use to escape the double quotes as well as separate pairs (e.g. if we need class="someclass" and onclick="someevent").

The separate pairs are just space delimited, like a normal series of attributes. Try XML encoding the value of linkAttributes however. So, " become &quote;, etc...
If you are using some Javascript, you might take care of the Javascript quotes too, as in \".

Edit: after I figured out your real question, the answer is a lot simpler:
You should wrap the values inside your linkAttributes in single quotes. Spaces inside linkAttributes are typically handled fine; but if not, escape then with %20.
If you need something more or want something that isn't handled by the standard tcdl:ComponentLink, remember that you can always create your own TCDL tag and and use a TagHandler or TagRenderer (look them up in the docs for examples or search for Jaime's article on TagRenderer) to do precisely what you want.
My original answer was to a question you didn't ask: what is the format for TCDL tags (in general). But the explanation might still be useful to some, so remains below.
I'd suggest having a look at what format the default building blocks (e.g. the Link Resolver TBB in the Default Finish Actions) output and use that as a guide line.
This is what I could quickly get from the transport package of a published page:
<tcdl:Link type="Page" origin="tcm:5-199-64" destination="tcm:5-206-64"
templateURI="tcm:0-0-0" linkAttributes="" textOnFail="true"
addAnchor="" variantId="">Home</tcdl:Link>
<tcdl:ComponentPresentation type="Embedded" componentURI="tcm:5-69"
templateURI="tcm:5-133-32">
<span>
...
One of the things that I know from experience: your entire TCDL tag will have to be on a single line (I wrapped the lines above for readability only). Or at least that is the case if it is used to invoke a REL TagRenderer. Clearly the tcdl:ComponentPresentation tag above will span multiple lines, so that "single line rule" doesn't apply everywhere.
And that is probably the best advice: given the fact that TCDL tags are processed at multiple points in Tridion Publishing, Deployment and Delivery pipeline, I'd stick to the format that the default TBBs output. And from my sample that seems to be: put everything on a single line and wrap the values in (double) quotes.

pass data from parent to popup

Apologies if this seems like a duplicate post...
Thomas Warner kindly answeres an earlier post suggesting I use:
Popup.aspx?Data1=Piece_of_data&Data2=Piece_of_data
Just want to ask, if my code is Popup.aspx?Data1=textbox1.text&Data2=textbox2.text
whats the proper way to reference whats in the textboxes?
The way is is above, all that appears in the popup is the actual text 'textbox1.text'
rather than what is actualy in that control.
thanks again

Using asp.net you can litterally write the value straight into the string like:
Popup.aspx?Data1=<%=textbox1.Text%>&Data2=<%=textbox1.Text%>
A more ideal way of doing this would be to build up the URL string in your codebehind so as not to clutter up your HTML and C# code.
That way you could do something like:
String popupUrl = String.Format("Popup.aspx?Data1={0}&Data2={1}",
textbox1.Text,textbox2.Text);
This will also allow you to do any sanitization checks on the values from the textboxes before you start passing those values around.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex