Scrape information from asp.net gridview with paging - asp.net

I'm trying to scrape some schedules off of a website. the information is displayed in a GridView with paging.
The url is:
http://www.landmarkworldwide.com/when-and-where/register/search-results.aspx?prgid=0&pgID=270&crid=0&ctid=&sdt=0
My Issue is when I want to scrape pages other then #1 in the grid view.
The best post I found so far was This One, but it doesn't work and that topic is not complete. I tried to use Fiddler and Chrome to get the post data and use it, but I can't get it to work for me. Can you guys see what's missing?
Here's the code I am using. it's in VB, but you can answer in C# and I'll translate -) (sorry)
Protected Sub Page_Load(sender As Object, e As System.EventArgs) Handles Me.Load
Dim lcUrl As String = "http://www.landmarkworldwide.com/when-and-where/register/search-results.aspx?prgid=0&pgID=270&crid=0&ctid=&sdt=0"
' first, request the login form to get the viewstate value
Dim webRequest__1 As HttpWebRequest = TryCast(WebRequest.Create(lcUrl), HttpWebRequest)
Dim responseReader As New StreamReader(webRequest__1.GetResponse().GetResponseStream())
Dim responseData As String = responseReader.ReadToEnd()
responseReader.Close()
' extract the viewstate value and build out POST data
Dim viewState As String = ExtractViewState(responseData)
Dim loHttp As HttpWebRequest = DirectCast(WebRequest.Create(lcUrl), HttpWebRequest)
' *** Send any POST data
Dim lcPostData As String = [String].Format("__VIEWSTATE={0}&__EVENTTARGET={1}&__EVENTARGUMENT={2}", viewState, HttpUtility.UrlEncode("contentwrapper_0$maincontent_0$maincontentfullwidth_0$ucSearchResults$gvPrograms"), HttpUtility.UrlEncode("Page$3"))
loHttp.Method = "POST"
Dim lbPostBuffer As Byte() = System.Text.Encoding.GetEncoding(1252).GetBytes(lcPostData)
loHttp.ContentLength = lbPostBuffer.Length
Dim loPostData As Stream = loHttp.GetRequestStream()
loPostData.Write(lbPostBuffer, 0, lbPostBuffer.Length)
loPostData.Close()
Dim loWebResponse As HttpWebResponse = DirectCast(loHttp.GetResponse(), HttpWebResponse)
Dim enc As Encoding = System.Text.Encoding.GetEncoding(1252)
Dim loResponseStream As New StreamReader(loWebResponse.GetResponseStream(), enc)
Dim lcHtml As String = loResponseStream.ReadToEnd()
loWebResponse.Close()
loResponseStream.Close()
Response.Write(lcHtml)
End Sub
Private Function ExtractViewState(s As String) As String
Dim viewStateNameDelimiter As String = "__VIEWSTATE"
Dim valueDelimiter As String = "value="""
Dim viewStateNamePosition As Integer = s.IndexOf(viewStateNameDelimiter)
Dim viewStateValuePosition As Integer = s.IndexOf(valueDelimiter, viewStateNamePosition)
Dim viewStateStartPosition As Integer = viewStateValuePosition + valueDelimiter.Length
Dim viewStateEndPosition As Integer = s.IndexOf("""", viewStateStartPosition)
Return HttpUtility.UrlEncodeUnicode(s.Substring(viewStateStartPosition, viewStateEndPosition - viewStateStartPosition))
End Function

To make it work you need to send all input fields to the page, not only viewstate. Other critical data is the __EVENTVALIDATION for example that you do not handle it. So:
First you need to make scrape on the #1 page. So load it and use the Html Agility Pack to convert it to a usable struct.
Then extract from that struct the input data that you need to post. From this answer HTML Agility Pack get all input fields here is a code sniped on how you can do that.
foreach (HtmlNode input in doc.DocumentNode.SelectNodes("//input"))
{
// use this to create the post string
// input.Attributes["value"];
}
Then when you have the post data that is needed to be a valid post, you move to the next step. Here is an example How to pass POST parameters to ASP.Net web request?
You can also read: How to use HTML Agility pack

Related

HttpWebRequest.GetResponse does not return AutoID for server side controls

In .NET by default the client side ID's, for server side controls, get concatenated with generated text.
For example:
<asp:TextBox ID="txtUser" runat="server">
would become...
<input type="text" id="ctl00_body_txbUser">
However when I use HttpWebRequest.GetResponse(objReq.Getresponse, HttpWebResponse) to request the same page the item comes back without the auto generated text.
<input type="text" id="txbUser">
Is it possible to use an HttpWebRequest object and GetResponse in such a way that it returns a response with the Auto generated ID's .NET uses for server side controls?
I am working with a 3rd party that has previously set up translation rules specific to ID, now we are attempting have the same rules work against an API call passed a string generated from the page. However, the string generated from the page does not have the same IDs.
Below is code being used to return the Web Page as a string.
Public Shared Function GetWebPageAsString(ByVal strURI As String, ByVal strPostData As String) As String
' Declare our variables. '
Dim objHttpRequest As HttpWebRequest
Dim PostBuffer() As Byte
Dim PostDataStream As Stream = Nothing
Dim objHttpResponse As HttpWebResponse = Nothing
Dim objStreamReader As StreamReader = Nothing
Dim strResponseText As String = ""
Try
' Create a new request. '
objHttpRequest = CType(WebRequest.Create(strURI), HttpWebRequest)
objHttpRequest.Timeout = 3000000
objHttpRequest.Method = "POST"
PostBuffer = Encoding.ASCII.GetBytes(strPostData)
objHttpRequest.ContentType = "application/x-www-form-urlencoded"
objHttpRequest.ContentLength = PostBuffer.Length
PostDataStream = objHttpRequest.GetRequestStream
PostDataStream.Write(PostBuffer, 0, PostBuffer.Length)
PostDataStream.Close()
' Get the response to our request as a stream object. '
objHttpResponse = CType(objHttpRequest.GetResponse, HttpWebResponse)
' Create a stream reader to read the data from the stream. '
objStreamReader = New StreamReader(objHttpResponse.GetResponseStream, Encoding.UTF8)
' Copy the text retrieved from the stream to a variable. '
strResponseText = objStreamReader.ReadToEnd()
' Close our objects. '
objStreamReader.Close()
objHttpResponse.Close()
Catch ex As Exception
strResponseText = strURI & " | " & strPostData
Throw (ex)
Finally
If Not objStreamReader Is Nothing Then
objStreamReader.Close()
End If
If Not PostDataStream Is Nothing Then
PostDataStream.Close()
End If
If Not objHttpResponse Is Nothing Then
objHttpResponse.Close()
End If
objHttpRequest = Nothing
PostBuffer = Nothing
PostDataStream = Nothing
objHttpResponse = Nothing
objStreamReader = Nothing
End Try
' Set return value. '
Return strResponseText
End Function
EDIT: Just to Clarify, I need the IDs to continue to be Auto generated by .NET. I understand that I could make them equal by setting the mode to Static. Unfortunately the 3rd Party we are working with has already created the rules for our current pages based on the IDs that were generated by .NET. Requesting the same page using the HTTPRequest object and pushing data into a stream. I am not seeing the Auto Generated IDs anymore, even though its the same page.
Create a clean master page and put your page in it. That should fix the IDs issue.

How to write response into Buffer instead of rendering on Screen?

I am developing app where I need to capture an Information from Webpage after giving credentials automatically. Some how I managed to do Automatic login and redirection of page. Here is my code :
HttpWebRequest req = (HttpWebRequest)WebRequest.Create("https://abcd.com.au/categories/A_dfn/sdf");
HttpWebResponse res = req.GetResponse() as HttpWebResponse;
StringBuilder sb = new StringBuilder();
byte[] buf = new byte[10000];
Stream resStream = res.GetResponseStream();
string s = null;
int c = 0;
do
{
c = resStream.Read(buf, 0, buf.Length);
if (c != 0) {
s = ASCIIEncoding.ASCII.GetString(buf, 0, c);
sb.Append(s);
}
} while (c > 0);
string oldhead = "class=\"login_button\">";
string newhead = "class=\"login_button\"> <script type=\"text/javascript\">document.getElementById('btn').click()</script>";
sb.Replace(oldhead, newhead);
string oldbtn = "value=\"Submit\"";
string newbtn = "value=\"Submit\" id=\"btn\" ";
sb.Replace(oldbtn, newbtn);
string oldAction = "<form action=\"/login\" method=\"post\">";
string newAction = "<form action=\"https://abcd.com.au/login?orig_req_url=%2Fcategories/A_dfn/sdf\" method=\"post\">";
sb.Replace(oldAction, newAction);
string oldUsername = "<input id=\"login_email\" type=\"text\" name=\"user[email_address]\" class=\"textBox\" value=\"\">";
string newUserName = "<input id=\"login_email\" type=\"text\" name=\"user[email_address]\" class=\"textBox\" value=\"abc#xyz.com.au\">";
sb.Replace(oldUsername, newUserName);
string oldPass = "<input id=\"login_password\" type=\"password\" name=\"user[password]\" class=\"textBox\" value=\"\">";
string newPass = "<input id=\"login_password\" type=\"password\" name=\"user[password]\" class=\"textBox\" value=\"abc\">";
sb.Replace(oldPass,newPass);
Response.Write(sb);
This is show me expected output as I want by rendering page(Response.write(sb)). But, now I want to do same thing without redirecting to "https://abcd.com.au/login?orig_req_url=%2Fcategories/A_dfn/sdf" and want to do more stuff on this. I expect to get output of Response.Write(sb) in some buffer. Is it possible to do?
Here is example, that explains exactly what I want to do.
I am looking for an product's qty say name : Screw 15mm, this resides in page https://abcd.com.au/%2Fcategories/A_dfn/sdf.
So, I am requesting this url first, but as need login to access that page, its redirecting me to login page, filling username and password, pressing login button by using javascript,and then redirected to Originally Requested page. And on this page I want to find for that product, and return information to my web app.
All this I want to do without showing to user.
I just want to show retrieved information.
Thanks.
What you are looking for is a persisted session. Your approach towards this problem is incorrect. You are triggering the submit on the client-side. What you are trying to achieve should be done on the server-side.
The key to your scenario is to persist (store) the session & cookies set by the login page; then before your next request for the product info, inject the credential into the requesting webRequest.
Use the WebRequest object to load the login page.
Store any info (cookies) sent by the login page Response header.
create a new WebRequest object with the provided Response header, inject in userid/password.
Store any credentials returned by the Response.
Proceed to request for the quote info.
There is no generic way to do this without knowing the website you are trying to screen-scrap from. But the general step is as above. Basically, you need to create a custom class for this.
Also, you need HTMLAgilityPack to parse the HTML nodes. It is the correct method.
EDIT: Added my codes. Just so happen that I've created this class before sometime ago. So, you're in luck. However, you will need HTMLAgilityPack installed & referenced to use it. You can download HAP at: http://htmlagilitypack.codeplex.com/ If you want to do any serious screen-scraping, HAP is the de-facto standard.
Public Class clsBrowserSession
'=================================================================================================================================
'This is a special Browser Post class
' Instead of just POST to a URL as per the clsWeb.fnsPostResponse()
' clsBrowserSession allows us to LOAD a page first, persist all the cookies and variables, and then only POST to the target URL.
' The reason is that some program will drop (lets say) a SessionID as an input when you first load the page.
' and when you post, without the SessionID (variable), it will reject the POST. Thus clsBrowserSession can solve this problem.
'=================================================================================================================================
' USAGE:
' Dim voBrowserSession As New clsBrowserSession
' voBrowserSession.sbLoadPage("https://xxx.yyy.net.my/publicncdenq/index.htm")
' voBrowserSession.proFormElements("UserID") = "myID"
' voBrowserSession.proFormElements("Password") = "myPassword"
' Dim vsResponseHTML As String = voBrowserSession.Post("https://xxx.yyy.net.my/publicncdenq/index.htm")
Private vbIsPostingInProgress As Boolean
Public voCookies As System.Net.CookieCollection
Public proHTMLDoc As HtmlAgilityPack.HtmlDocument
Public proFormElements As clsFormElementCollection
Public Sub sbLoadPage(pvsURL As String)
vbIsPostingInProgress = False
fnoCreateWebRequestObject().Load(pvsURL)
End Sub
Public Function Post(pvsURL As String) As String
vbIsPostingInProgress = True
fnoCreateWebRequestObject().Load(pvsURL, "POST")
Return proHTMLDoc.DocumentNode.InnerHtml
End Function
Private Function fnoCreateWebRequestObject() As HtmlAgilityPack.HtmlWeb
Dim voWeb As New HtmlAgilityPack.HtmlWeb
voWeb.UseCookies = True
voWeb.PreRequest = New HtmlAgilityPack.HtmlWeb.PreRequestHandler(AddressOf event_OnPreRequest)
voWeb.PostResponse = New HtmlAgilityPack.HtmlWeb.PostResponseHandler(AddressOf event_OnAfterResponse)
voWeb.PreHandleDocument = New HtmlAgilityPack.HtmlWeb.PreHandleDocumentHandler(AddressOf event_OnPreHandleDocument)
Return voWeb
End Function
Private Sub sbAddPostDataTo(pvoRequest As Net.HttpWebRequest)
Dim vsPayload As String = proFormElements.fnsAssemblePostPayload()
Dim vabyteBuffer As Byte() = Text.Encoding.UTF8.GetBytes(vsPayload.ToCharArray())
pvoRequest.ContentLength = vabyteBuffer.Length
pvoRequest.ContentType = "application/x-www-form-urlencoded"
pvoRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11"
pvoRequest.GetRequestStream().Write(vabyteBuffer, 0, vabyteBuffer.Length)
End Sub
Private Sub sbAddvoCookiesTo(pvoRequest As Net.HttpWebRequest)
If (Not IsNothing(voCookies)) Then
If voCookies.Count > 0 Then pvoRequest.CookieContainer.Add(voCookies)
End If
End Sub
Private Sub sbSaveCookiesFrom(pvoResponse As Net.HttpWebResponse)
If pvoResponse.Cookies.Count > 0 Then
If IsNothing(voCookies) Then voCookies = New Net.CookieCollection
voCookies.Add(pvoResponse.Cookies)
End If
End Sub
Private Sub sbSaveHtmlDocument(pvoHTMLDocument As HtmlAgilityPack.HtmlDocument)
proHTMLDoc = pvoHTMLDocument
proFormElements = New clsFormElementCollection(proHTMLDoc)
End Sub
Protected Function event_OnPreRequest(pvoRequest As Net.HttpWebRequest) As Boolean
sbAddvoCookiesTo(pvoRequest)
If vbIsPostingInProgress Then sbAddPostDataTo(pvoRequest)
Return True
End Function
Protected Sub event_OnAfterResponse(pvoRequest As System.Net.HttpWebRequest, pvoResponse As Net.HttpWebResponse)
sbSaveCookiesFrom(pvoResponse)
End Sub
Protected Sub event_OnPreHandleDocument(pvoHTMLDocument As HtmlAgilityPack.HtmlDocument)
sbSaveHtmlDocument(pvoHTMLDocument)
End Sub
'-----------------------------------------------------------------------------------------------------
'Form Elements class
' Note: This element class will only capture (any) INPUT elements only, which should be enough
' for most cases. It can be easily modified to add other SELECT, TEXTAREA, etc voInputs
'-----------------------------------------------------------------------------------------------------
Public Class clsFormElementCollection
Inherits Dictionary(Of String, String)
Public Sub New(htmlDoc As HtmlAgilityPack.HtmlDocument)
Dim voInputs As Collections.Generic.IEnumerable(Of HtmlAgilityPack.HtmlNode) = htmlDoc.DocumentNode.Descendants("input")
For Each voInput As HtmlAgilityPack.HtmlNode In voInputs
Dim vsName = voInput.GetAttributeValue("name", "undefined")
Dim vsValue = voInput.GetAttributeValue("value", "")
If vsName <> "undefined" Then Add(vsName, vsValue)
Next
End Sub
Public Function fnsAssemblePostPayload() As String
Dim sb As New Text.StringBuilder
For Each voKeyValuePair In Me
Dim vsValue = System.Web.HttpUtility.UrlEncode(voKeyValuePair.Value)
sb.Append("&" & voKeyValuePair.Key & "=" & vsValue)
Next
Return sb.ToString.Substring(1)
End Function
End Class
End Class
Just make the above into a class object and instantiate it. The usage example is in the comment. You want the vsResponseHTML string.

How to get Description of Youtube embeded videos in my asp.net application?

I am using the below code to get the Title and description of the youtube video embeded in my asp.net application. I am able to see the Title, but not description.
I use Atomfeed to do this...
Problem is i get the Description as "Google.GData.Client.AtomTextConstruct" for all my videos.
Private Function GetTitle(ByVal myFeed As AtomFeed) As String
Dim strTitle As String = ""
For Each entry As AtomEntry In myFeed.Entries
strTitle = entry.Title.Text
Next
Return strTitle
End Function
Private Function GetDesc(ByVal myFeed As AtomFeed) As String
Dim strDesc As String = ""
For Each entry As AtomEntry In myFeed.Entries
strDesc = entry.Summary.ToString()
Next
Return strDesc
End Function
I believe that when the XML from the atom feed is parsed, that the description is not handled. Take a look at this: http://code.google.com/p/google-gdata/wiki/UnderstandingTheUnknown
But what happens with things that are not understood? They end up as
an element of the ExtensionElements collection, that is a member of
all classes inherited from AtomBase, like AtomFeed, AtomEntry,
EventEntry etc...
So, what we can do is pull out the description from the extensionelement like this:
Dim query As New FeedQuery()
Dim service As New Service()
query.Uri = New Uri("https://gdata.youtube.com/feeds/api/standardfeeds/top_rated")
Dim myFeed As AtomFeed = service.Query(query)
For Each entry In myFeed.Entries
For Each obj As Object In entry.ExtensionElements
If TypeOf obj Is XmlExtension Then
Dim xel As XElement = XElement.Parse(TryCast(obj, XmlExtension).Node.OuterXml)
If xel.Name = "{http://search.yahoo.com/mrss/}group" Then
Dim descNode = xel.Descendants("{http://search.yahoo.com/mrss/}description").FirstOrDefault()
If descNode IsNot Nothing Then
Console.WriteLine(descNode.Value)
End If
Exit For
End If
End If
Next
Next
Also, the reason why you are getting "Google.GData.Client.AtomTextConstruct" is because Summary is an object of type Google.GData.Client.AtomTextConstruct, so doing entry.Summary.ToString() is just giving you the default ToString() behavior. You would normally do Summary.Text, but this of course is blank because as I say above, it's not handled properly by the library.
For youtube, I fetch the information for each video using the Google.GData.YouTube.
Something like this returns a lot of information from the video.
Dim yv As Google.YouTube.Video
url = New Uri("http://gdata.youtube.com/feeds/api/videos/" & video.Custom)
r = New YouTubeRequest(New YouTubeRequestSettings("??", "??"))
yv = r.Retrieve(Of Video)(url)
Then it's possible to get the description with: yv.Description

ASP.net: Scrape Part of webage

Dim url As New Uri("http://www.testpage.com")
If url.Scheme = Uri.UriSchemeHttp Then
'Create Request Object
Dim objRequest As HttpWebRequest = DirectCast(HttpWebRequest.Create(url), HttpWebRequest)
'Set Request Method
objRequest.Method = WebRequestMethods.Http.[Get]
'Get response from requested url
Dim objResponse As HttpWebResponse = DirectCast(objRequest.GetResponse(), HttpWebResponse)
'Read response in stream reader
Dim reader As New StreamReader(objResponse.GetResponseStream())
Dim tmp As String = reader.ReadToEnd()
objResponse.Close()
'Set response data to container
Label1.Text = tmp
End If
How Would I only scrape part of a webpage..The code succesfulyl get the full html content.
For Example..I want to scrape eveyrthing between <div id="content"> </div>
Once you have the page's full html content in a string variable, you can use Regular Expressions over this string to return the parts you want to extract.
Since you have not provided details on what you want to extract, I will provide you with a link on how to use Regular Expressions.
A short tutorial on Regular Expressions can be found here

a ButtonField within DataGrid calling Vb.net as Javascript.... for streaming PDF?

I just noticed it last night,
Anyway, let's get to the interesting case here.
I have a ButtonField within DataGrid, and if you notice it here...
The Interface of that ButtonField is looks like a LINK.
But if we hover on it, it appeared as Javascript's call.
Here is the Image ScreenShot
Ya, that's the 1st case.
IT IS a javascript's call.
I didnt notice about it lately. (hehehe).
Then, if we click on that... it would call the createPDF() function. The function behind the scene (which I'm using VB.net) is to execute these code;
Protected Sub createPDF()
Dim document As New Document()
Dim mem As LengthFixingStream = New LengthFixingStream()
' instantiate a iTextSharp.text.pdf.Document
'Dim mem As New MemoryStream()
' PDF data will be written here
PdfWriter.GetInstance(document, mem)
' tie a PdfWriter instance to the stream
document.Open()
Dim titleFont = FontFactory.GetFont("Arial", 18, Font.BOLD)
document.Add(New Paragraph("Northwind Traders Receipt", titleFont))
document.Close()
' automatically closes the attached MemoryStream
Dim docData As Byte() = mem.GetBuffer()
' get the generated PDF as raw data
' write the document data to response stream and set appropriate headers:
Response.AppendHeader("Content-Disposition", "attachment; filename=testdoc.pdf")
Response.ContentType = "application/pdf"
Response.BinaryWrite(docData)
Response.[End]()
End Sub
But somehow... this of course would not deliver the PDF into the browser.
BEcause It's called by Javascript, nor the direct as Hyperlink (normally).
Thus, I'm wondering could we get the ASP.net Call new Window,
and then redirect the createPDF() result into it?
Correct me if i'm wrong...
Here is just some mockup so you get the idea. I haven't tested this. Basically you will have to put the above code in a new page...say "receipt.aspx" and execute it on the load event...you will need to setup an id parameter...if data is being pulled from the db to generate the pdf.
on the button click add the following
Dim sb As New System.Text.StringBuilder()
sb.Append("<script language='javascript'>")
sb.Append("window.open('receipt.aspx.htm?id=123'', 'Receipt',")
sb.Append("'width=800, height=800, menubar=yes, resizable=no');<")
sb.Append("/script>")
Dim t As Type = Me.GetType()
If Not ClientScript.IsStartUpScriptRegistered(t, "PopupScript") Then
ClientScript.RegisterStartUpScript(t, "PopupScript", sb.ToString())
End If
Notice the "id=123" querystring value I am passing to receipt.aspx?
You can then call that in the receipt.aspx page like this
Dim id as String = Request.QueryString("id")
CreatePDF(id)
...shoot! Just realized you are using a Grid...the principle remains the same, just wireup the buttons on RowDataBound event.
Protected Sub GridView_RowDataBound(sender As Object, e As GridViewRowEventArgs)
If e.Row.RowType = DataControlRowType.DataRow Then
Dim Id As String = DirectCast(e.Row.Cells(0).FindControl("quotationid"), Label).Text
Dim myButton As New Button
myButton = DirectCast(e.Row.Cells(4).FindControl("btnViewReceipt"), Button)
myButton.Attributes.Add("OnClick", "window.open('receipt.aspx?id=" + id + "','Receipt','scrollbars=yes','width=800,height=800')")
End If
End Sub

Resources