Retrieve data from a website via Visual Basic - asp.net

There is this website that we purchase widgets from that provides details for each of their parts on its own webpage. Example: http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND. I have to find all of their parts that are in our database, and add Manufacturer and Manufacturer Part Number values to their fields.
I was told that there is a way for Visual Basic to access a webpage and extract information. If someone could point me in the right direction on where to start, I'm sure I can figure this out.
Thanks.

How to scrape a website using HTMLAgilityPack (VB.Net)
I agree that htmlagilitypack is the easiest way to accomplish this. It is less error prone than just using Regex. The following will be how I deal with scraping.
After downloading htmlagilitypack*dll, create a new application, add htmlagilitypack via nuget, and reference to it. If you can use Chrome, it will allow you to inspect the page to get information about where your information is located. Right-click on a value you wish to capture and look for the table that it is found in (follow the HTML up a bit).
The following example will extract all the values from that page within the "pricing" table. We need to know the XPath value for the table (this value is used to instruct htmlagilitypack on what to look for) so that the document we create looks for our specific values. This can be achieved by finding whatever structure your values are in and right click copy XPath. From this we get...
//*[#id="pricing"]
Please note that sometimes the XPath you get from Chrome may be rather large. You can often simplify it by finding something unique about the table your values are in. In this example it is "id", but in other situations, it could easily be headings or class or whatever.
This XPath value looks for something with the id equal to pricing, that is our table. When we look further in, we see that our values are within tbody,tr and td tags. HtmlAgilitypack doesn't work well with the tbody so ignore it. Our new XPath is...
//*[#id='pricing']/tr/td
This XPath says look for the pricing id within the page, then look for text within its tr and td tags. Now we add the code...
Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND")
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes("//*[#id='pricing']/tr/td")
Next
To extract the values we simply reference our table value that was created in our loop and it's innertext member.
Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND")
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes("//*[#id='pricing']/tr/td")
MsgBox(table.InnerText)
Next
Now we have message boxes that pop up the values...you can switch the message box for an arraylist to fill or whatever way you wish to store the values. Now simply do the same for whatever other tables you wish to get.
Please note that the Doc variable that was created is reusable, so if you wanted to cycle through a different table in the same page, you do not have to reload the page. This is a good idea especially if you are making many requests, you don't want to slam the website, and if you are automating a large number of scrapes, it puts some time between requests.
Scraping is really that easy. That's is the basic idea. Have fun!

Html Agility Pack is going to be your friend!
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).
Looking at the source of the example page you provided, they are using HTML5 Microdata in their markup. I searched some more on CodePlex and found a microdata parser which may help too: MicroData Parser

Related

Scraping variable names and values from html definition lists using R

I'm looking to extract some data from a definition list in some html code in R. So far I've done the following;
url <- "myurl"
doc <- htmlParse(url)
and then I (think I) want to use xpathSApply to extract the list data; however I keep returning an error... I'm new to the concept of webscraping and HTML, so I'm not entirely sure how the function goes about locating the data to scrape.
How do I find the xpath to pass to xpathSApply?
an example url would be http://opencorporates.com/companies/gb/06309283
and I would want to scrape the data regarding company name, number, address, directors etc. into one observation per query.
Firefox has an amazing plugin called FireBug, and an extension to that called FirePath. Using that, you can right click on any element on a web page and click "Inspect" . That will show you the XPath to be passed to xpathSApply.
If you can't use Firebug there's a nifty bookmarklet called SelectorGadget that does much the same thing and should work in IE9
Turns out the syntax that I was in need of was the '//node[#class="myclass"]' for use in the xpathSAppply function. Cheers all

Scraping ASP pages with Excel/VBA

I'm trying to scrape an ASP.NET page with Excel. Unfortunately, the page only returns 50 records at a time, of several pages. Excel's native Web Query module only picks up the first page. I want all the pages.
Like most (all?) ASP pages, there are a few hidden variables sent back to the server when requesting a new page. The important ones are _VIEWSTATE and _EVENT_VALIDATION.
I've written a VBA function that gets the HTML source of the page and scrapes these variables from it.
I've also written an .iqy page, which allows for POST requests in it. It looks something like this:
WEB
1
http://www.myaspwebsite/search/search_List.aspx
__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwULLTEy[....truncated ..50k characters..]Mhudyk5U6u8%2BBpvxDPN8R4%3D&__EVENTVALIDATION=%2FwEWFQL%2FkN%2FBCgL6g%2B5vAvfY06EOAoic4qIIAome%2Bf4PAuOrjYgIAuKrjYgIAuGrjYgIAuCrjYgIAuerjYgIAt7e34UPAvuL7m8CtuLToQ4CiaTioggCyKX5%2Fg8C4tv1sAgC49v1sAgC4Nv1sAgC4dv1sAgC5tv1sAgC%2Fd7fhQ%2BU8QRtxd7MM4Bpa%2F%2FZC7I64eUh3Q%3D%3D&ctl00_RadMenu1_ClientState=&ctl00%24ContentPlaceHolder1%24NavBar1%24PageNoDropDownList=2&ctl00%24ContentPlaceHolder1%24NavBar1%24btnGo=Go&ctl00%24ContentPlaceHolder1%24NavBar2%24PageNoDropDownList=1
Selection=AllTables
Formatting=None
PreFormattedTextToColumns=True
ConsecutiveDelimitersAsOne=True
SingleBlockTextImport=False
DisableDateRecognition=False
DisableRedirections=False
This iqy page successfully retuns the desired results if the post query is placed in the file.
I can also use this .iqy page programmatically in VBA and assign the POST query dynamically using QueryTables. However, I get told that my query returned nothing.
I suspect this is because of the length of my argument. The VIEWSTATE alone is about 50k characters. I've tried printing the argument string to a file and it truncates it. However, I can read the same string from a file and use it dynamically successfully.
My questions are : Am I going about this the best way? What limitations should I be aware of when doing this? Also, is there a limit to string size in Excel?
According to Microsoft's documentation on Visual Basic strings (same value applies to VBA strings):
A string can contain from 0 to approximately two billion (2 ^ 31) Unicode characters.
That is more than enough to handle a 50k string. A simple way to bypass IDE line limits and immediate window printing limits would be to print the string into an Excel cell and then read it back into a variable when you need to use that piece of data.

How to generate an archive of multiple pages in Spring MVC?

I want to allow my users to "bulk export" an archive of selected resources, i.e., http://.../resource/1, resource/2, resource/4, ... ,
My thought was "render the HTML of each page to a string and use java.util.zip to create a multifile archive."
My problem then became "how to get the HTML of a page so that I can loop over them?"
I cannot figure out a way to get a JstlView to render to a String, nor can I see a way to set the ServletOutputStream to be a ZipOutputStream.
My last thought is to actually GET the HTML of each of the resources via HTTP. I imagine that will be easy enough to code, but it seems pretty byzantine. Is there a better way? (Perhaps something with RequestDispatcher.forward()? )
Use a SwallowingHttpServletResponse from DWR (or a PageResponseWrapper from Sitemesh) as a parameter to RequestDispatcher.include() and then get the output from that response object.
See my response (no pun intended) to this question.

HTTPModule filter questions

I have one issues I'm struggling with with regards to my HTTPModule filter:
1) I notice that the module gets it's data in chunks. This is problematic for me because I'm using a regex to find and replace. If I get a partial match in one chunk and the rest of the match in the second, it will not work. Is there any way to get the entire response before I do my thing to it? I have seen code where it appends data to a string builder until it uses a matches on an "" end tag but my code must work for more that just (xml, custom tags, etc). I don't know how to detect the End Of Stream or if that is even possible.
I am attaching the filter in the BeginRequest.
Have a look at this example. It looks for "" in the stream of the page.
Here's a sample project which performs buffered search and replace in an HttpModule using a Request.Filter and Response.Filter. You should be able to adapt this technique to perform a Regex easily.
https://github.com/snives/HttpModuleRewrite

Deleting / Replacing A Node in E4X (AS3 - Flex)

I'm building a listing/grid control in a Flex application and using it in a .NET web application. To make a really long story short I am getting XML from a webservice of serialized objects. I have a page limit of how many things can be on a page. I've taken a data grid and made it page, sort across pages, and handle some basic filtering.
In regards to paging I'm using a Dictionary keyed on the page and storing the XML for that page. This way whenever a user comes back to a page that I've saved into this dictionary I can grab the XML from local memory instead of hitting the webservice. Basically, I'm caching the data retrieved from each call to the webservice for a page of data.
There are several things that can expire my cache. Filtering and sorting are the main reason. However, a user may edit a row of data in the grid by opening an editor. The data they edit could cause the data displayed in the row to be stale. I could easily go to the webservice and get the whole page of data, but since the page size is set at runtime I could be looking at a large amount of records to retrieve.
So let me now get to the heart of the issue that I am experiencing. In order to prevent getting the whole page of data back I make a call to the webservice asking for the completely updated record (the editor handles saving its data).
Since I'm using custom objects I need to serialize them on the server to XML (this is handled already for other portions of our software). All data is handled through XML in e4x. The cache in the Dictionary is stored as an XMLList.
Now let me show you my code...
var idOfReplacee:String = this._WebService.GetSingleModelXml.lastResult.*[0].*[0].#Id;
var xmlToReplace:XMLList = this._DataPages[this._Options.PageIndex].Data.(#Id == idOfReplacee);
if(xmlToReplace.length() > 0)
{
delete (this._DataPages[this._Options.PageIndex].Data.(#Id == idOfReplacee)[0]);
this._DataPages[this._Options.PageIndex].Data += this._WebService.GetSingleModelXml.lastResult.*[0].*[0];
}
Basically, I get the id of the node I want to replace. Then I find it in the cache's Data property (XMLList). I make sure it exists since the filter on the second line returns the XMLList.
The problem I have is with the delete line. I cannot make that line delete that node from the list. The line following the delete line works. I've added the node to the list.
How do I replace or delete that node (meaning the node that I find from the filter statement out of the .Data property of the cache)???
Hopefully the underscores for all of my variables do not stay escaped when this is posted! otherwise this.&#95 == this._
Thanks for the answers guys.
#Theo:
I tried the replace several different ways. For some reason it would never error, but never update the list.
#Matt:
I figured out a solution. The issue wasn't coming from what you suggested, but from how the delete works with Lists (at least how I have it in this instance).
The Data property of the _DataPages dictionary object is list of the definition nodes (was arrived at by a previous filtering of another XML document).
<Models>
<Definition Id='1' />
<Definition Id='2' />
</Models>
I ended up doing this little deal:
//gets the index of the node to replace from the same filter
var childIndex:int = (this._DataPages[this._Options.PageIndex].Data.(#Id == idOfReplacee)[0]).childIndex();
//deletes the node from the list
delete this._DataPages[this._Options.PageIndex].Data[childIndex];
//appends the new node from the webservice to the list
this._DataPages[this._Options.PageIndex].Data += this._WebService.GetSingleModelXml.lastResult.*[0].*[0];
So basically I had to get the index of the node in the XMLList that is the Data property. From there I could use the delete keyword to remove it from the list. The += adds my new node to the list.
I'm so used to using the ActiveX or Mozilla XmlDocument stuff where you call "SelectSingleNode" and then use "replaceChild" to do this kind of stuff. Oh well, at least this is in some forum where someone else can find it. I do not know the procedure for what happens when I answer my own question. Perhaps this insight will help someone else come along and help answer the question better!
Perhaps you could use replace instead?
var oldNode : XML = this._DataPages[this._Options.PageIndex].Data.(#Id == idOfReplacee)[0];
var newNode : XML = this._WebService.GetSingleModelXml.lastResult.*[0].*[0];
oldNode.parent.replace(oldNode, newNode);
I know this is an incredibly old question, but I don't see (what I think is) the simplest solution to this problem.
Theo had the right direction here, but there's a number of errors with the way replace was being used (and the fact that pretty much everything in E4X is a function).
I believe this will do the trick:
oldNode.parent().replace(oldNode.childIndex(), newNode);
replace() can take a number of different types in the first parameter, but AFAIK, XML objects are not one of them.
I don't immediately see the problem, so I can only venture a guess. The delete line that you've got is looking for the first item at the top level of the list which has an attribute "Id" with a value equal to idOfReplacee. Ensure that you don't need to dig deeper into the XML structure to find that matching id.
Try this instead:
delete (this._DataPages[this._Options.PageIndex].Data..(#Id == idOfReplacee)[0]);
(Notice the extra '.' after Data). You could more easily debug this by setting a breakpoint on the second line of the code you posted, and ensure that the XMLList looks like you expect.

Resources