I think the question has been answered here before,but i could not find the desired topic.I am a newbie in web scraping.I have to develop a script that will take all the google search result for a specific name.Then it will grab the related data against that name and if there is found more than one,the data will be grouped according to their names.
All I know is that,google has some kind of restriction on scraping.They provide a custom search api.I still did not use that api,but hoping to get all the resulted links corresponding to a query from that api. But, could not understand what will be the ideal process to do the scraping of the information from that links.Any tutorial link or suggestion is very much appreciated.
You should have provided a bit more what you have been doing, it does not sound like you even tried to solve it yourself.
Anyway, if you are still on it:
You can scrape Google through two ways, one is allowed one is not allowed.
a) Use their API, you can get around 2k results a day.
You can up it to around 3k a day for 2000 USD/year. You can up it more by getting in contact with them directly.
You will not be able to get accurate ranking positions from this method, if you only need a lower number of requests and are mainly interested in getting some websites according to a keyword that's the choice.
Starting point would be here: https://code.google.com/apis/console/
b) You can scrape the real search results
That's the only way to get the true ranking positions, for SEO purposes or to track website positions. Also it allows to get a large amount of results, if done right.
You can Google for code, the most advanced free (PHP) code I know is at http://scraping.compunect.com
However, there are other projects and code snippets.
You can start off at 300-500 requests per day and this can be multiplied by multiple IPs. Look at the linked article if you want to go that route, it explains it in more details and is quite accurate.
That said, if you choose route b) you break Googles terms, so either do not accept them or make sure you are not detected. If Google detects you, your script will be banned by IP/captcha. Not getting detected should be a priority.
Related
We were checking newly implemented Google Analytics for our mobile app and surprisingly there are a lot of visitors from multiple countries but in actuality, we haven't released our app for any store and it's just beta between 5 main users.
After checking Google Analytics report in details we have found that it got spammed by Bot call "Trumps Bot" when something happens on your account you can see following lines in your language section.
“Secret.ɢoogle.com You are invited! Enter only with this ticket URL. Copy it. Vote for Trump!”
There are a lot of solution available to avoid this data in your reports using the filter but i was just wondering if there is any concrete solution on permanently remove this data from my reports and also is there anything we can do to avoid such data in future as its seriously affecting business strategy.
Due the tecnology used on Google Analytics the only way to eliminate this referal is using a filter, check one common point of all this hits . In this case is a hard one, because all the parameters changes , exept for the language, for a well know reason, to see the spam.
So try to use this one, in my case works
I highly recommend you read the community policy, this can be considered as off-topic question
Analytics spammers are always trying to find new ways of getting attention, and with this one, this spammer hit it big.
It is not possible to permanently remove it unless you delete the whole property. But you can create and advance segment to get a clean view.
But the most important part is blocking it so it doesn't pollutes your data. For this particular type of spam you should create a custom exclude language filter with this expression:
\s[^s]*\s|.{15,}|.|,
That expression will block any hit that doesn't use a proper language. That combined with a valid hostname filter should prevent most of the current spam and save you a lot of headaches.
If you need help, you can check this step by step guide for building these filters and creating the advanced segment to remove it from your historical data.
Here is also a related question.
Login in to Your Google Analytics account
Select ADMIN Section
Click on All Filters -- Add Filters
Give a filter name such as -- Include only website traffic
In Predefined section, select Include Only
for more... Click Here
google has an API for downloading search suggestions:
https://www.google.com/support/enterprise/static/gsa/docs/admin/70/gsa_doc_set/xml_reference/query_suggestion.html
unfortunately, as far as i can tell, these results are specific to your location. for an analysis, i would like to be able to define the city/location that google thinks it is making the suggestion to. here's what happens when i scrape from dar es salaam, tanzania:
http://suggestqueries.google.com/complete/search?client=firefox&q=insurance
["insurance",["insurance","insurance companies in tanzania","insurance group of tanzania","insurance principles","insurance act","insurance policy","insurance act tanzania","insurance act 2009","insurance definition","insurance industry in tanzania"]]
i understand that a vpn would partially solve this issue, but only by giving me a different location and not lots of locations. is there a reasonable way to replicate this sort of thing quickly and easily from, say, the 100 largest cities in the united states?
confirmation that results differ within the usa-
thanks!
Google will use your IP and your location history (if turned on) to determine your location.
To be able to go around it, you can spoof your IP while logged off your google account (but I don't know if google will consider it a trial of hacking no matter what your intentions are).
Another way is to use Tor browser (even though it is not it's original purpose). You can configure tor to exit from a certain country using the Exitnode parameter in the torrc config file
As found in the docs:
ExitNodes node,node,…
A list of identity fingerprints, country codes, and address patterns of nodes to use as exit node
But if you want a fast way to do it, I don't think that's possible since google wants to know the real location of the users and have put a lot of effort into making such tricks fail.
The hl param for interface language changes the search results, but I can't tell if it's actually changing the location. For example:
http://suggestqueries.google.com/complete/search?client=chrome&q=why&hl=FR
Here's an example with 5 different values of hl:
http://jsbin.com/tusacufaza/edit?js,output
I'll be working on a project that will require a live output of a number of tweets users have hash tagged on Twitter as well as their tweets. Something along the lines of MTV's Twitter Tracker: http://vma-twittertracker.mtv.com/live/#buzz.
What intrigued me about this site is how can they constantly make API calls to Twitter without breaching the request limit?
I'd appreciate if anyone could guide me on the most effective way to accomplish this. From the research I've carried out thus far, I presume I will need to use Twitter's Streaming API.
Since there is a chance that the number of tweets output to my page could be in their thousands (AJAX loaded) along with stats on number of retweets/favourites, what would be the most scalable approach within my .NET site? Any examples or guidance would be appreciated.
Check out Linq2Twitter. It is a great wrapper around the Twitter API, and provides two mechanisms that will help you:
There is a search function that allows you to search for hash tags, etc, which will limit the amount of data you are getting back
You have the option to specify getting all the data since a certain tweet ID. You can therefore incrementally search the feed by performing searches and searching, in subsequent calls, from the ID you left off on.
I have used this many times to search the public feed and have not had any issues to date. I think the search function is key not requesting too much. Good luck!
you can look into Storm framework. Below are few links for further reference:-
http://storm-project.net/
https://github.com/nathanmarz/storm
Thanks for all your responses.
It looks like sites such that display a lot of Twitter stats/data use third party approved providers that have direct access to Twitter's Firehose API.
I have managed to get in contact with an approved provider to supply us with the feeds of data required (and it ain't cheap!).
I'm trying to create a filter on a Google Analytics profile. I'd like it to include only traffic that has come as a result of searching for a specific search term.
For example, imagine I'm interested only in people who have arrived at my site having searched for the word 'dog'. I don't care about any other visitors, so I want all my reports to be filtered for people who have searched for 'dog' to get to the site.
I have tried this a few ways, but I'm not convinced they're working. My latest attempt was the following:
Edit filter
Filter type: Include
Filter field: Referral
Filter pattern: (\?|&)(q|p)=.dog.([^&]*)
Case sensitive: no
At the moment, this appears to be letting through traffic that has not come from a search engine. It would be great if someone could explain what I need to do to get it to work correctly!
Many thanks,
Katie
[P.S. I realise this may sound like a strange request, but it's partly to help me learn a bit more about filters]
Add another Include filter where medium=organic
You could also do it using Advanced Segments instead of requiring to setup a new profile for each keyword :-)
Having looked through the questions already on SO, I can't seem to find the answer on how to track a form that has multiple steps on one page. I saw an example that Google gives but could not really understand the way they were presenting it. What we have is a one page order form and need to track the users that come from a website and end up ordering. the whole ordering process is done with one file so I don't know how to track whether or not someone has actually completed the order. Any help would be great, even directing me to better examples than what Google has shown to me.
Thank you
Rob
Just call the JS function _gaq.push([trackPageview,'/form/stepXX']); each time the process reaches a new step.
You can pass any text string you want as a parameter.
Then you can configure a Goal and a funnel in GA with all the major steps of process
You can also track Events in case of errors for example.
(this uses the GA Async syntax)