I'm trying to scrape an ASP.NET page with Excel. Unfortunately, the page only returns 50 records at a time, of several pages. Excel's native Web Query module only picks up the first page. I want all the pages.
Like most (all?) ASP pages, there are a few hidden variables sent back to the server when requesting a new page. The important ones are _VIEWSTATE and _EVENT_VALIDATION.
I've written a VBA function that gets the HTML source of the page and scrapes these variables from it.
I've also written an .iqy page, which allows for POST requests in it. It looks something like this:
WEB
1
http://www.myaspwebsite/search/search_List.aspx
__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwULLTEy[....truncated ..50k characters..]Mhudyk5U6u8%2BBpvxDPN8R4%3D&__EVENTVALIDATION=%2FwEWFQL%2FkN%2FBCgL6g%2B5vAvfY06EOAoic4qIIAome%2Bf4PAuOrjYgIAuKrjYgIAuGrjYgIAuCrjYgIAuerjYgIAt7e34UPAvuL7m8CtuLToQ4CiaTioggCyKX5%2Fg8C4tv1sAgC49v1sAgC4Nv1sAgC4dv1sAgC5tv1sAgC%2Fd7fhQ%2BU8QRtxd7MM4Bpa%2F%2FZC7I64eUh3Q%3D%3D&ctl00_RadMenu1_ClientState=&ctl00%24ContentPlaceHolder1%24NavBar1%24PageNoDropDownList=2&ctl00%24ContentPlaceHolder1%24NavBar1%24btnGo=Go&ctl00%24ContentPlaceHolder1%24NavBar2%24PageNoDropDownList=1
Selection=AllTables
Formatting=None
PreFormattedTextToColumns=True
ConsecutiveDelimitersAsOne=True
SingleBlockTextImport=False
DisableDateRecognition=False
DisableRedirections=False
This iqy page successfully retuns the desired results if the post query is placed in the file.
I can also use this .iqy page programmatically in VBA and assign the POST query dynamically using QueryTables. However, I get told that my query returned nothing.
I suspect this is because of the length of my argument. The VIEWSTATE alone is about 50k characters. I've tried printing the argument string to a file and it truncates it. However, I can read the same string from a file and use it dynamically successfully.
My questions are : Am I going about this the best way? What limitations should I be aware of when doing this? Also, is there a limit to string size in Excel?
According to Microsoft's documentation on Visual Basic strings (same value applies to VBA strings):
A string can contain from 0 to approximately two billion (2 ^ 31) Unicode characters.
That is more than enough to handle a 50k string. A simple way to bypass IDE line limits and immediate window printing limits would be to print the string into an Excel cell and then read it back into a variable when you need to use that piece of data.
Related
I came across an unusual URL structure on a site. It looked like this:
https://www.agilealliance.org/glossary/xp/#q=~(infinite~false~filters~(postType~(~'post~'aa_book~'aa_event_session~'aa_experience_report)~tags~(~'xp))~searchTerm~'~sort~false~sortDirection~'asc~page~1)
It seems the category, pagination and sort options of a widget on the page injects and reads through these values. Does this format for storing data in the URL have a name, or is this an esoteric format someone made?
What's the purpose of doing this over using regular GET params, or at least using a more conventional format after the fragment?
If you inspect the URL carefully, you'll see that the parameters you describe are placed after the fragment (#), meaning they're not sent to the server but used by the client instead.
In this case, the client (JavaScript) builds them into something like an ElasticSearch query that's then POSTed to the server, in order to update listing you see on your screen.
I have following string:
soqDi22c2_A-eY4ahWKJV6GAYgmuJBZ3poNNEixha1lOhXxxoucRuuzmcyDD_9ZYp_ECXRPbrBf6issNn23CUDJrh_A5L3Y5dHhB0o_U5Oq_j4rDCXOJ4Q==
It's a query parameter generated by form on a page. (This is done server-side in ASP.net) We are able to submit this form programatically and get the string we need (it just leads to a detail page of an object [realworld parcel/building, publicly accessible]) and redirect our user to it. However I would like to know, if there is a way to decrypt/deobfuscate this string to know what it contains and if we could possibly just generate these without going through the form (it's a multi step form).
The string also has some sort of expiration, so I sadly cannot provide a link to the result page, as it would stop working after like 10 minutes or so.
It feels a bit like it's base64, but after trying to run it through base64 -d, it says it's invalid.
It's likely base64 with + and / replaced with - and _ to make it more browser-friendly.
Though even if it's base64-encoded, it may just be a completely random key. You won't nessesarily be able to decode it to something readable.
The problem here is the middle of the line (HTML).
The chain:
I have WinForm program that uses awesomium (alternative to native webBrowser) to view Html page that has a part of asp.net page in it's iframe.
The problem:
The problem is that I need to pass value to asp.net page, it is easily achieved without middle of the chain (Html iframe) by sending hashed and crypted querystring.
How it works:
WinForm do some thing, then use few-step-crypt to code all the needed values into 1 string.
Then it should send this string to asp.net page through the iframe (and that's the problem, it is easy to receive query string in asp.net page, but firstly I need to receive it in Html and send to asp.net).
Acceptable answers:
1) Probably the most easily one - using JavaScript. I have heard it is possible to be done in that way.
How I imagine this - I send query string from WinForm to Html page as http:\\HtmlPage.html?AspNet.aspx?CryptedString
Then Html receive it with JavaScript and put querystring "AspNet.aspx?CryptedString" into iframe's "src=http:\\" resulting in "src=http:\\AspNet.aspx?CryptedString"
And then I easily get it in asp.net page.
2) Somehow create >>>VIRTUAL<<<(NOTE: Virtual, I don't want querystring to be saved on the HDD, even don't suggest) asp.net or html page with iframe source taken directly from WinForm string.
Probably that is possible with awesomium, but I'm new to it and don't know how to (if it is possible ofc).
3) Some web service with which I can communicate between asp.net and WinForm through the existing HTML iframe.
4) Another way that replace one of 3 previous, that doesn't save "values" in querystring/else on HDD nor is visible for the user, doesn't use asp.net page's server to create iframe-page on it. On HTML page's server HTML is only allowed, PhP isn't.
5) If you don't know any of 4 above - suggest free PhP hosting without ads (if such exists, what I highly doubt).
Priority:
The best one would be #3, then #2, then #1, then #5 (#4 is excluded as it is unknown).
And in the end:
Thanks in advance for your help.
P.S.Currently at work, so I'll check/try all answers later on and will report tomorrow if any suits my needs. Thanks again.
Answering my own question. I have found 2 ways that can do what I did want.
The first one:
Creating a RAM file System.IO.MemoryStream or another method (google c# create a file in ram).
The second one:
Creating a hidden+encrypted+system+custom-readable-only-by-program-crypt file somewhere in the far away folder via File.SetAttributes Method and System.IO.StreamWriter/Reader or System.IO.FileStream or System.IO.TextWriter, etc. depending on what it should be.
Once this file was used for needs delete it + delete on exit + delete on start using
if (File.Exists(path)
{
File.Delete(path);
}
(Need more reputation to post few links -_-, and I don't want to post only part of them, either all or no at all, so use google if you'll need anything from here).
If you'll need to store "Small temp file" and not for a long time use first one, if "Heavy" use second one, unless you badly need to use RAM for it.
There is this website that we purchase widgets from that provides details for each of their parts on its own webpage. Example: http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND. I have to find all of their parts that are in our database, and add Manufacturer and Manufacturer Part Number values to their fields.
I was told that there is a way for Visual Basic to access a webpage and extract information. If someone could point me in the right direction on where to start, I'm sure I can figure this out.
Thanks.
How to scrape a website using HTMLAgilityPack (VB.Net)
I agree that htmlagilitypack is the easiest way to accomplish this. It is less error prone than just using Regex. The following will be how I deal with scraping.
After downloading htmlagilitypack*dll, create a new application, add htmlagilitypack via nuget, and reference to it. If you can use Chrome, it will allow you to inspect the page to get information about where your information is located. Right-click on a value you wish to capture and look for the table that it is found in (follow the HTML up a bit).
The following example will extract all the values from that page within the "pricing" table. We need to know the XPath value for the table (this value is used to instruct htmlagilitypack on what to look for) so that the document we create looks for our specific values. This can be achieved by finding whatever structure your values are in and right click copy XPath. From this we get...
//*[#id="pricing"]
Please note that sometimes the XPath you get from Chrome may be rather large. You can often simplify it by finding something unique about the table your values are in. In this example it is "id", but in other situations, it could easily be headings or class or whatever.
This XPath value looks for something with the id equal to pricing, that is our table. When we look further in, we see that our values are within tbody,tr and td tags. HtmlAgilitypack doesn't work well with the tbody so ignore it. Our new XPath is...
//*[#id='pricing']/tr/td
This XPath says look for the pricing id within the page, then look for text within its tr and td tags. Now we add the code...
Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND")
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes("//*[#id='pricing']/tr/td")
Next
To extract the values we simply reference our table value that was created in our loop and it's innertext member.
Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND")
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes("//*[#id='pricing']/tr/td")
MsgBox(table.InnerText)
Next
Now we have message boxes that pop up the values...you can switch the message box for an arraylist to fill or whatever way you wish to store the values. Now simply do the same for whatever other tables you wish to get.
Please note that the Doc variable that was created is reusable, so if you wanted to cycle through a different table in the same page, you do not have to reload the page. This is a good idea especially if you are making many requests, you don't want to slam the website, and if you are automating a large number of scrapes, it puts some time between requests.
Scraping is really that easy. That's is the basic idea. Have fun!
Html Agility Pack is going to be your friend!
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).
Looking at the source of the example page you provided, they are using HTML5 Microdata in their markup. I searched some more on CodePlex and found a microdata parser which may help too: MicroData Parser
I have a web application that is using UrlRewriting. Now I want to set it that if the user enters the page with a url in re-written format, all the links apply the same format, otherwise they remain the same (with normal query strings).
Is there a way that I can get a list of query strings that are in the links without parsing the string?
Try the HttpUtility.ParseQueryString