How to retrieve google pages - information-retrieval

Dear all,I am now using a webtool
http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=
to parse a webpage.
For example,we can parse newyorktimes homepage,we do:
http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=http://www.nytimes.com/pages/world/index.html
in the address bar of our browser,it will parse things nicely for us.
However,it just fails for google pages.
For example,if I want to parse Google news headpage,like:
http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=http://news.google.com/nwshp?hl=en&tab=wn
I will always get 500 Internal Server Error.
I am sure that is somthing to do with google website,I think probably we need some API for google,does anyone have any idea how to to sort this out for google pages?
Many thanks.

Per the google.com robots.txt file, you are explictly requested not to scrape their content. Google does not provide an API for machine-readable search results; they want to control the presentation of their content via widgets and embedding strategies.

Related

Content of Facebook Open Graph urls

I know Facebook open graph self hosted object uRLs should have their meta tags describing the object.
But I was wandering if they are supposed to have anything else at all?
i.e. Should they also provide user content? or are they just used for FB scraping?
It depends on what you are trying to do.Take a look at facebook Sharing Best Practices documents.

Can I track who is linking or manipulating my site's data?

Is it possible to track if someone links to data on my site? Specifically if my data is used in a site dynamically generated by a developer program? I would like to know if someone is blatantly passing off my site's data as their own. There are obviously ways around directly linking to content, such as content manipulation or even manual manipulation. But if someone where to link(or directly add word for word or manipulate) my content into their website, is there a way to track it?
Can I avoid someone being able to scrape my website at all, or is everything just up for grabs?
the best answer and the easy one is called GOOGLE - WEBMASTER TOOLS!
HERE
actually doing that is very hard and you would need to crawl the web to discover those links that address to your pages... dynamic content as well is linked so it would be find by google as well.
this tool will allow you to see outer links that address to your site.. and you can check them.
for extra - you can monitor requests and traffic to your site and find ip's that are using the same page over and over again. that can tell u that an outer page is dynamically loading content from your web page.
EDIT:
here is a good article in this subject: link - scroll down and you can see the use of google
webmaster tool with some other progrmas and method.
here is a good start guide to the google webmaster: link
ENJOY!

How can I get this Google RSS gadget to work?

I am trying to embed an RSS feed on a web page I am designing. The feed is for an Austin, TX Craigslist page:
http://austin.craigslist.org/search/fua?query=%22modern+salvage%22&srchType=A&minAsk=&maxAsk=
Depending on which URL I use for the feed I get one of these results:
**Feed URL**
http://www.gmodules.com/ig/creator?url=http://austin.craigslist.org/search/fua?query=%22modern%20salvage%22&srchType=A&format=rss
Error parsing module spec:
Not a properly formatted file
missing xml header
**<link> in the XML for the above URL**
http://www.gmodules.com/ig/creator?url=http://austin.craigslist.org/search/fua?query=&quot;modern%20salvage&quot;&amp;srchType=A
Information is temporarily unavailable
**I have also tried the URL in the head of the HTML doc:**
http://austin.craigslist.org/search/fua?query="modernsalvage"&srchType=A&format=rss" title="RSS feed for craigslist | furniture - all ""modern salvage"" in austin
Information is temporarily unavailable.
Although Craigslist encourages users to embed RSS feeds I wonder if the Craigslist server is denying the request. I have a background in design, not programming. Any suggestions?
Thank You.
I'm not sure I understand what gadget you're using...
Anyway, I was able to make your page load with the RSS Reader+ gadget on a Google Sites page (won't stay up forever).
Attempts to make it work with http://www.gstatic.com/sites-gadgets/rss-sites/rss_sites.xml were unsuccessful. I think that gadget is broken, according to comments on http://www.google.ca/ig/directory?type=gadgets&url=www.google.com/ig/modules/reader.xml

Redirecting search results into an ASP.NET page

I've an ASP.NET page with a textbox and a option from user of the following choices: Wikipedia, Google, Dictionary.com, Flickr, Google images.
The user enters a word(s) in the textbox and selects a choice among the following.
Depending on the choice select by the user I wish to return the following.
Wikipedia: Return the content and link to the page corresponding to the topic about the word.
Google: Return the top 10 results of google search for this word.
Flickr: Return a few images atmost 10 images from flickr search
GoogleImage: Return a few images from google image search.
Dictionary: Return the meaning of the word.
How can I do that?
Since you are wanting to do some processing on the results prior to displaying them, your best bet is probably to invoke a web request on the server to fetch your results as RSS or some other parsable XML format.
So first up, we have Wikipedia, which has API support for open search, and queries with XML or JSON output. You can get the details of the API by going to: http://en.wikipedia.org/w/api.php
I would think either the query action, or opensearch action would be what you want.
Right, now there is Google, which supports search results as RSS through their Active Search feature. The link takes you to the main page where you can build the query, at which point it should be easy to drop in your search terms. There is also the Google Search AJAX API, which you can find out about here (See the "Flash and other Non-Javascript Environments" section for building the URLs directly. I believe this option should give you access to Google Image results as well.
For Flickr, have a look at this App Garden page. There are several output formats available to choose from.
I wasn't able to find anything real solid on getting results from Dictionary.com, but it does appear that they have an API. You might be able to dig through google and find some references on how to get search results as XML or JSON. There are also several other Dictionary sites which may have more information about their APIs. While searching I managed to find this SO question about word lookup from google dictionary.
Hope this helps.
Have an iframe within your page, and then set the src of the frame to the appropriate query string that you craft from the user's input.
This can be done from javascript within the page, in response to the user selecting something in the 'choice' dropdown. You can have the appropriate urls already embedded in the javascript (as variables), and just substitute in the user's input.

Preventing RSS feed scraping?

On a Wordpress site, I have both a normal blog that I want Google to detect and an RSS feed for outgoing links to other sites. I don't need/want bots to get at this other RSS feed nor do I want people to be able to get the link for their own use.
I've disabled RSS for the main blog successfully but am not sure how to encrypt/protect/hide the RSS link for this additional feed.
I'm not sure how Facebook runs a newsfeed without RSS but however they do it is probably beyond my means/experience to replicate.
Where these are just outgoing links, I don't think copyright notices in the feed will do much. Maybe there is a way to output the links automatically through a means other than RSS?
Use Robots.Text www.robotstxt.org to prevent google from following the link. All self respecting robots should follow the directives in the robots.txt file. This file needs to go in the root of your sit.
The basic answer to this is to use a method of getting the feed entries in a manner other than using the actual RSS like outputting JSON, going through the API, etc.
It will help prevent scraping though not completely.

Resources