How can I scrape Google? - web-scraping

How do I get HTML inside google.com?
Let's say I go to Google and type "Humpty Dumpty" and I get the search results and the URL changes to something like:
https://www.google.com/search?newwindow=1&q=humpty+dumpty&oq=humtp&gs_l=serp.3.0.0i10l10.7599.8190.0.9757.5.5.0.0.0.0.373.732.3j1j0j1.5.0....0...1c.1.30.serp..2.3.187.2B69R71ux4U
But when I try to HttpWebRequest to download this webpage I don't get any search result HTML inside of it. I think this is because Google makes request for results after the page is loaded?
Is there any way I can get the HTML?
P.S: I know scraping from Google is against their TOS. I am trying to learn of how to scrape such websites.

Using the below code, I'm seeing the correct HTML coming back (something coming back about nursery rhymes)
The below code uses WebClient to retrieve the correct HTML
WebClient wbclient = new WebClient();
string html = wbclient.DownloadString("https://www.google.com/search?newwindow=1&q=humpty+dumpty&oq=humtp&gs_l=serp.3.0.0i10l10.7599.8190.0.9757.5.5.0.0.0.0.373.732.3j1j0j1.5.0....0...1c.1.30.serp..2.3.187.2B69R71ux4U");

Related

Why the same URL gives different results?

On the following page, the number 2, 3 ... at the bottom all point to the same URL. Yet, the different tables will be shown. Does anybody know what specific techniques are used here? How to extract information in these tables using raw HTTP request (I prefer not to use a headless browser to do so)? Thanks.
https://services27.ieee.org/fellowsdirectory/home.html#results_table
It is using Javascript (AJAX) to make HTTP calls to the server.
If you inspect the Network activity in the Developer tools you will see calls to the following URL: https://services27.ieee.org/fellowsdirectory/getpageresultsdesk.html.
They send data from Javascript:
selectedJSON: {"alpha":"ALL","menu":"ALPHABETICAL","gender":"All","currPageNum":1,"breadCrumbs":[{"breadCrumb":"Alphabetical Listing "}],"helpText":"Click on any of the alphabet letters to view a list of Fellows."}
inputFilterJSON: {"sortOnList":[{"sortByField":"fellow.lastName","sortType":"ASC"}],"typeAhead":false}
pageNum: 2
You can see the pageNum property. This is how they request a specific page of results.
When you click the number buttons, some Javascript code makes an AJAX POST request to https://services27.ieee.org/fellowsdirectory/getpageresultsdesk.html;jsessionid=yoursessionid with formData including pageNum: 3 and some other formatting parameters. The server responds with the HTML block of table rows that get loaded into the page. You can look at the requests on that webpage in your browser's network inspector (in the developer tools) to see exactly what HTTP requests are happening.
The link has an onclick handler that changes the href onclick. Go to
https://services27.ieee.org/fellowsdirectory/home.html#results_table
In the console, enter:
window.location=getDetailProfileUrl('lOH1bDxMyI1CCIxo5ODlGg==');
This redirects to Aarons, Jules.
Now go back and enter window.location=getDetailProfileUrl('JJuL3J00kHdIUozoVAgKdg==');
This opens Aarts, Ronald.
Basically, when the link is clicked, the JavaScript changes the url of the link.
To extract them using php, use the file_get_contents() function.
echo file_get_contents('https://services27.ieee.org/fellowsdirectory/home.html#results_table');
That will print out the page. Now scrape it with JavaScript.
echo "<script>console.log(document.querySelectorAll('.name'));</script>";
Hope this helps.

ASP Request.QueryString doesn't html decode "&" from URL query string

In ASP having this URL:
http://www.example.com?foo=1&bar=2
Request.QueryString["bar"] returns NULL
The URL is a map area "href" link which I have assigned like so:
PolygonHotSpot p = new PolygonHotSpot();
p.NavigateUrl = http://www.example.com?foo=1&bar=2
ASP automatically HTML encodes the URL for the href, but it is not HTML decoding it again in the request therefore query string "bar" is not found.
Now I am using IIS URL Rewrite 2 module. Maybe this module is causing the problem? What can I do to solve it? I have tried using URL rewrite rules but couldn't figure our how or if it is the proper way.
It's probably not a good idea, but you could use Request.ServerVariables("QUERY_STRING") (or Request.ServerVariables["QUERY_STRING"] - your tags say ASP classic but your code looks like C#?) to get at the entire thing and then process it yourself.
I think there must be something deeper wrong though. A link can be encoded to be sent to the browser - the browser does the work of decoding it before navigating to the link. You can demonstrate this with a simple <a href="/test?a=1&b=2"> in a test script - the browser ends up correctly at /test?a=1&b=2. Testing it with a polygonal image map shows the same behaviour.
If you can show me what is in your actual HTML output for the image map I might be able to help more.

youtube videos not playing when embedded with "normal" URL

I'm trying to embed a video on my page, depending on which one the user selects after being presented with a list.
On my page I have:
<div id="vidContent" style="text-align:left">
<object width="550px" height="350px" >
<asp:Literal ID="ltlVideo" runat="server"></asp:Literal>
</object>
</div>
And in the code behind I have:
Dim strVidPath As String = "http://www.youtube.com/v/" & strVidID
ltlVideo.Text = "<embed src='" & strVidPath & "' type='application/x-shockwave-flash' allowscriptaccess='always' allowfullscreen='true' height='350' width='470'></embed>
phVideoBanner.Visible = True
..
which works ok...if the you have the "strVidID"
It only seems to display and play if you have the strVidPath = www.youtube.com/v/_O7iUiftbKU
but not play if strVidPath = www.youtube.com/watch?v=_O7iUiftbKU ....which seems to be the normal URL in the address bar when watching a youtube video.
I want the user to be able to add a video to the page and I was thinking it would be easier if the paste in the URL of the video but now it seems I'll have to get them to paste in the videoID instead as it only seems to play when I use www.youtube.com/v/_O7iUiftbKU
Anyone know why this is?
Rather than trying to parse a YouTube watch page URL and construct an embed code yourself, you can use the oEmbed service to do it for you.
If you need to get back legacy embed codes rather than iframe embed codes, you'd need to pass iframe=0 as one of the URL parameters to the oEmbed service, like: http://www.youtube.com/oembed?url=http%3A//www.youtube.com/watch%3Fv%3DbDOYN-6gdRE&format=json&iframe=0
The URL structure with the word "watch" in it is, as you point out, Youtube's public facing web page, which includes a lot more than the video ... it includes all the other content you see on the page as well. In essence, it's a pointer that resolves to an HTML page, and you can't have an HTML page as the source of an embed element.
The URL structure that is proper (i.e. the one that works) is not a pointer to an HTML page but a pointer that resolves directly to the player itself, and thus can serve as the source of an embed element.
Here's a link to a Stack Overflow question whose answer includes a C# code block that takes a regular YouTube URL (in any of its forms) as an input, does a regex match, and returns just the Youtube ID -- should be pretty simple to modify it for your needs ... thus you can still continue to have your users paste in the whole video URL:
C# regex to get video id from youtube and vimeo by url

Cannot pass url in html code

I cannot pass URL in html code. I want to do like this :
eg.
http://www.myname.com/page.aspx?id=1&name=test&msg=message message message
If I do like that then The page cannot be found.
I also try like this:
http://www.myname.com/page.aspx?id=1&name=test&msg=%3Cp%3Emessage%20message%20message%3C/p%3E
but still cannot. I try in my localhost. It is Okay but if I upload in my server this method will not work.
So how I can pass URL in html code?
"The page cannot be found"
Have you tried navigating to the page without the extra stuff?
It doesnt look URL friendly to me but perhaps this guide will make it reader ok and properly URL encoded http://www.w3schools.com/asp/met_urlencode.asp
Im thinking ur looking for
response.write(Server.URLEncode("http://www.myname.com/page.aspx?id=1&name=test&msg=message message message"))
Just Be aware that spaces are generally bad in URLS as with some Special Characters as well You don't want to trust user input to output as URL i reckon

How do I search an iframe for a specific image or grab the source code

My main goal is to search an iframe for a specific image. I know what the image will be (abc_clicked.gif) but am not sure how I can access the iframe to either:
1) search the iframe for the image itself
2) grab the source code in which I will manually search myself for the image
I am looking trying to accomplish this with javascript, as I don't see how PHP could help me at all in this case.
Anyone have any ideas???? I'm lost....
If the iFrame is hosted on the same domain, you can access the DOM the same as you would for the main page using contentDocument.
For example:
var iframeElement = document.getElementById('myiframe');
var imageElement = iframeElement.contentDocument.getElementById('myImage');
(assuming you're working in a Web page and looking for a JavaScript solution)
If the iframed page is in a different domain, there's not much you can do.
If it's in the same domain, here is a cross-browser way to access it's content:
var doc=ifr.contentWindow||ifr.contentDocument;
if (doc.document) doc=doc.document;
You can then search your iframe:
var imgs = doc.getElementsByTagName("img");
// etc.
Your second option is also valid (but might be more complicated), use ajax to retrieve and parse the page source.

Resources