how to spoof location so google autocomplete API will provide local results, ideally with R - r

google has an API for downloading search suggestions:
https://www.google.com/support/enterprise/static/gsa/docs/admin/70/gsa_doc_set/xml_reference/query_suggestion.html
unfortunately, as far as i can tell, these results are specific to your location. for an analysis, i would like to be able to define the city/location that google thinks it is making the suggestion to. here's what happens when i scrape from dar es salaam, tanzania:
http://suggestqueries.google.com/complete/search?client=firefox&q=insurance
["insurance",["insurance","insurance companies in tanzania","insurance group of tanzania","insurance principles","insurance act","insurance policy","insurance act tanzania","insurance act 2009","insurance definition","insurance industry in tanzania"]]
i understand that a vpn would partially solve this issue, but only by giving me a different location and not lots of locations. is there a reasonable way to replicate this sort of thing quickly and easily from, say, the 100 largest cities in the united states?
confirmation that results differ within the usa-
thanks!

Google will use your IP and your location history (if turned on) to determine your location.
To be able to go around it, you can spoof your IP while logged off your google account (but I don't know if google will consider it a trial of hacking no matter what your intentions are).
Another way is to use Tor browser (even though it is not it's original purpose). You can configure tor to exit from a certain country using the Exitnode parameter in the torrc config file
As found in the docs:
ExitNodes node,node,…
A list of identity fingerprints, country codes, and address patterns of nodes to use as exit node
But if you want a fast way to do it, I don't think that's possible since google wants to know the real location of the users and have put a lot of effort into making such tricks fail.

The hl param for interface language changes the search results, but I can't tell if it's actually changing the location. For example:
http://suggestqueries.google.com/complete/search?client=chrome&q=why&hl=FR
Here's an example with 5 different values of hl:
http://jsbin.com/tusacufaza/edit?js,output

Related

How do Website like craigslist create content depending on the city your computer is located

I am looking to create a website that generates content depending on your city location. The best Example I found was Craigslit.They generate a web domain name like https://yourcity.craigslist.org/ when you either click on the city or it locates where you are. I was just wondering if I could get some help on how to build something like that.
The web pages are created using a template that doesn't change, populated with data that is selected from a database server, using your location to lookup appropriate items.
The subdomain (your city) is usually defined in the DNS record, just like www. There would be an entry for chicago.craigslist.org, for example.
edit
If you're asking how they know where you are, they can take a guess based on your IP address, however this isn't very reliable. Google does this also, when getting you search results that could be localized.
So yeah, it is expected of you to type some stuff into google to (try) find your answer (like detect city from javascript will bring up a lot of results for your problem.)
But yeah you would use a service like https://ipstack.com/ to detect where you live, depending on where you live the accuracy increases. (EU has some rules and regulations that make it a lot less accurate than if you would be living in the US)
Once you have a database with content - For example craigslist has a database of second hand items sold by people from all over. When you connect to craigslist they ask a service where your request came from - then use some filter function based on your location to match the results.
Good luck
Your IP address can be used to make an educated guess as to where you are, but it's not very accurate. When providing you with search results that might be localised, Google also does this.To know more about creating a website like craigslist follow here
https://www.yarddiant.com/blog/classifieds/how-to-build-a-website-like-craigslist.html

Geolocation of BGP Automous Systems

Hi friends I've been looking around for the past few days on a way to find the geolocation of the BGP AS's, preferably through the use of some API. I've been using the ripestat API for the majority of my work on this, but it comes up inconclusive on some of the AS's, for example AS 10000. RIPE tells me the location is in JP. Which is sort of fine, I just would like to narrow it down more to like a city / postal code / etc if possible. Is there another API suited for this? or is it just a manual task of fixing all the information once gathered.
Alternatively, if it is possible to grab the IP address of the actual AS itself, and not the range, that would likely work as well.
IP Geolocation isn't nearly accurate enough to pinpoint an IP to a specific City/ZIP code. In many cases, IPs from the same block will be used across a large area in an ISP's control, so it's not possible to be very accurate. Autonomous Systems don't really have "an IP", as there's no one specific location of them.
If you're looking for the locations where they peer to other providers, you might want to check out PeeringDB.

Finding the number of common users between two websites

There are two Swiss (.ch) websites, let's call them A and B. A is owned by me and B by a customer.
Because of legal data protection issues B is hosted in Switzerland and not allowed to store any user information abroad. Which means that software like Google Analytics is not available on B. A is a Swiss website but hosted in a (European) cloud.
Now we would like to find out how many common users we both have over the duration of 30 days. In short:
numberOfUsersA ∩ numberOfUsersB
For the sake of simplicity: Instead of users we are perfectly happy to measure common browsers.
What would you suggest is the simplest way to solve this problem?
First off all, best regards from Zurich/Zug :) Swiss people are everywhere...
I don't think you're correct that it's not legal to collect data in Switzerland at all (also abroad). As I'm working in the financial industry I know this topic very well and we also had to do a lot research to use GA at all.
It's always the question what and how you collect data. What you can't do - beside you got in upfront the permission of the user - is storing personal identifiable information. That's anyway not allowed by GA - you can't import/save in custom dimension/metrics for example email addresses.
Please check https://support.google.com/adsense/answer/6156630?hl=en as general basic information about this topic.
If you save the IP addresses via IP anonymization, you shouldn't run into problems if you're declaring this in your data-privacy statements. Take this approach: https://support.google.com/analytics/answer/2763052?hl=en
I'm not a lawyer and also not want to give you legal advises, but ours told us that's fine. If you are real paranoid about sending data to the USA - like we have to be - you can exclude your tracking from very sensitive forms.
To go back to your basic question, if you want to find this out via Google Analytics, your key is "cross domain tracking". Check https://support.google.com/analytics/answer/1034342?hl=en for more information in this direction.
The only work-around I have in my mind beside this, is if you start collecting browser-fingerprints yourself and then connect both collections over the finger prints together (that's not save, as your visitors will use more than one device/configuration). I personally would go for the IP anonimization, exclude very sensitive forms and ensure that your data-privacy declaration contains all necessary parts for and offer an opt-out option then you should be on the safe side.
All the best and TGIF :)

Scrape all google search result for a specific name

I think the question has been answered here before,but i could not find the desired topic.I am a newbie in web scraping.I have to develop a script that will take all the google search result for a specific name.Then it will grab the related data against that name and if there is found more than one,the data will be grouped according to their names.
All I know is that,google has some kind of restriction on scraping.They provide a custom search api.I still did not use that api,but hoping to get all the resulted links corresponding to a query from that api. But, could not understand what will be the ideal process to do the scraping of the information from that links.Any tutorial link or suggestion is very much appreciated.
You should have provided a bit more what you have been doing, it does not sound like you even tried to solve it yourself.
Anyway, if you are still on it:
You can scrape Google through two ways, one is allowed one is not allowed.
a) Use their API, you can get around 2k results a day.
You can up it to around 3k a day for 2000 USD/year. You can up it more by getting in contact with them directly.
You will not be able to get accurate ranking positions from this method, if you only need a lower number of requests and are mainly interested in getting some websites according to a keyword that's the choice.
Starting point would be here: https://code.google.com/apis/console/
b) You can scrape the real search results
That's the only way to get the true ranking positions, for SEO purposes or to track website positions. Also it allows to get a large amount of results, if done right.
You can Google for code, the most advanced free (PHP) code I know is at http://scraping.compunect.com
However, there are other projects and code snippets.
You can start off at 300-500 requests per day and this can be multiplied by multiple IPs. Look at the linked article if you want to go that route, it explains it in more details and is quite accurate.
That said, if you choose route b) you break Googles terms, so either do not accept them or make sure you are not detected. If Google detects you, your script will be banned by IP/captcha. Not getting detected should be a priority.

How to decode google gclids

Now, I realise the initial response to this is likely to be "you can't" or "use analytics", but I'll continue in the hope that someone has more insight than that.
Google adwords with "autotagging" appends a "gclid" (presumably "google click id") to link that sends you to the advertised site. It appears in the web log since it's a query parameter, and it's used by analytics to tie that visit to the ad/campaign.
What I would like to do is to extract any useful information from the gclid in order to do our own analysis on our traffic. The reasons for this are:
Stats are imperfect, but if we are collating them, we know exactly what assumptions we have made, and how they were calculated.
We can tie the data to the rest of our data and produce far more accurate stats wrt conversion rate.
We don't have to rely on javascript for conversions.
Now it is clear that the gclid is base64 encoded (or some close variant), and some parts of it vary more than others. Beyond that, I haven't been able to determine what any of it relates to.
Does anybody have any insight into how I might approach decoding this, or has anybody already related gclids back to compaigns or even accounts?
I have spoken to a couple of people at google, and despite their "don't be evil" motto, they were completely unwilling to discuss the possibility of divulging this information, even under an NDA. It seems they like the monopoly they have over our web stats.
By far the easiest solution is to manually tag your links with Google Analytics campaign tracking parameters (utm_source, utm_campaign, utm_medium, etc.) and then pull out that data.
The gclid is dependent on more than just the adwords account/campaign/etc. If you click on the same adwords ad twice, it could give you different gclids, because there's all sorts of session and cost data associated with that particular click as well.
Gclid is probably not 100% random, true, but I'd be very surprised and concerned if it were possible to extract all your Adwords data from that number. That would be a HUGE security flaw (i.e. an arbitrary user could view your Adwords data). More likely, a pseudo-random gclid is generated with every impression, and if that ad is clicked on, the gclid is logged in Adwords (otherwise it's thrown out). Analytics then uses that number to reconcile the data with Adwords after the fact. Other than that, there's no intrinsic value in the gclid number itself.
In regards to your last point, attempting to crack or reverse-engineer this information is explicitly forbidden in both the Google Analytics and Google Adwords Terms of Service, and is grounds for a permanent ban. Additionally, the TOS that you agreed to when signing up for these services says that it is not your data to use in any way you feel like. Google is providing a free service, so there are strings attached. If you don't like not having complete control over your data, then there are plenty of other solutions out there. However, you will pay a premium for that kind of control.
Google makes nearly all their money from selling ads. Adwords is their biggest money-making product. They're not going to give you confidential information about how it works. They don't know who you are, or what you're going to do with that information. It doesn't matter if you sign an NDA and they have legal recourse to sue you; if you give away that information to a competitor, your life isn't worth enough to pay back the money you will have lost them.
Sorry to break it to you, but "Don't be Evil" or not, Google is a business, not a charity. They didn't become one of the most successful companies in the world by giving away their search algorithm to the first guy who asked for it.
The gclid parameter is encoded in Protocol Buffers, and then in a variant of Base64.
See this guide to decoding the gclid and interpreting it, including an (Apache-licensed) PHP function you can use.
There are basically 3 parameters encoded inside it, one of which is a timestamp. The other 2 as yet are not known.
As far as understanding what these other parameters mean—it may be helpful to compare it to the ei parameter, which is encoded in an extremely similar way (basically Protocol Buffers with the keys stripped out). The ei parameter also has a timestamp, with what seem to be microseconds, and 2 other integers.
FYI, I just posted a quick analysis of some glcid data from my sites on this post. There definitely is some structure to the gclid, but it is difficult to decipher.
I think you can get all the goodies linked to the gclid via google's adword api. Specifically, you can query the click performance report.
https://developers.google.com/adwords/api/docs/appendix/reports#click
I've been working on this problem at our company as well. We'd like to be able to get a better sense of what our AdWords are doing but we're frustrated with limitations in Analytics.
Our current solution is to look in the Apache access logs for GET requests using the regex:
.*[?&]gclid=([^$&]*)
If that exists, then we look at the referer string to get the keyword:
.*[?&]q=([^$&]*).*
An alternative option is to change your Apache web log to start logging the __utmz cookie that google sets, which should have a piece for the keyword in utmctr. Google __utmz cookie and you should be able to find plenty of information.
How accurate is the referer string? Not 100%. Firewalls and security appliances will strip it out. But parsing it out yourself does give you more flexibility than Google Analytics. It would be a great feature to send the gclid to AdWords and get data back, but that feature does not look like it's available.
EDIT: Since I wrote this we've also created our own tags that are appended to each destination url as a request parameter. Each tag is just an md5 hash of the text, ad group, and campaign name. We grab it using regex from the access log and look it up in a SQL database.
This is a non-programmatic way to decode the GCLID parameter. Chances are you are simply trying to figure out the campaign, ad group, keyword, placement, ad that drove the click and conversion. To do this, you can upload the GCLID into AdWords as a separate conversion type and then segment by conversion type to drill down to the criteria that triggered the conversion. These steps:
In AdWords UI, go to Tools->Conversions->Add conversion with source "Import from clicks"
Visit the AdWords help topic about importing conversions https://support.google.com/adwords/answer/7014069 and create a bulk load file with your GCLID values, assigning the conversions to you new "Import from clicks" conversion type
Upload the conversions into AdWords in Tools->Conversions->Conversion actions (Uploads) on left navigation
Go to campaigns tab, Segment->Conversions->Conversion name
Find your new conversion name in the segment list, this is where the conversion came from. Continue this same process on the ad groups and keywords tab until you know the GCLID originating criteria
Well, this is no answer, but the approach is similar to how you'd tackle any cryptography problem.
Possibility 1: They're just random, in which case, you're screwed. This is analogous to a one-time pad.
Possibility 2: They "mean" something. In that case, you have to control the environment.
Get a good database of them. Find gclids for your site, and others. Record all times that all clicks occur, and any other potentially useful data
Get cracking! As you have started already, start regressing your collected data against your known, and see if you can find patterns used decrypting techniques
Start scraping random gclid's, and see where they take you.
I wouldn't hold high hope for this to be successful though, but I do wish you luck!
Looks like my rep is weak, so I'll just post another answer rather than a comment.
This is not an answer, clearly. Just voicing some thoughts.
When you enable auto tagging in Adwords, the gclid params are not added to the destination URLs. Rather they are appended to the destination URLs at run time by the Google click tracking servers. So, one of two things is happening:
The click servers are storing the gclid along with Adwords entity identifiers so that Analytics can later look them up.
The gclid has the entity identifiers encoded in some way so that Analytics can decode them.
From a performance perspective it seems unlikely that Google would implement anything like option 1. Forcing Analytics to "join" the gclid to Adwords IDs seems exceptionally inefficient at scale.
A different approach is to simply look at the referrer data which will at least provide the keyword which was searched.
Here's a thought: Is there a chance the gclid is simply a crytographic hash, a la bit.ly or some other URL shortener?
In which case the contents of the hashed text would be written to a database, and replaced with a unique id.
Afterall, the gclid is shortening a bunch of otherwise long text.
Takes this example:
www.example.com?utm_source=google&utm_medium=cpc
Is converted to this:
www.example.com?gclid=XDF
just like a URL shortener.
One would need a substitution cipher in order to reverse engineer the cryptographic hash... not as easy task: https://crypto.stackexchange.com/questions/300/reverse-engineering-a-hash
Maybe some deep digging into logs, looking for patterns, etc...
I agree with Ophir and Chris. My feeling is that it is purely a serial number / unique click ID, which only opens up its secrets when the Analytics and Adwords systems talk to each other behind the scenes.
Knowing this, I'd recommend looking at the referring URL and pulling as much as possible from this to use in your back end click tracking setup.
For example, I live in NZ, and am using Firefox. This is a search from the Firefox Google toolbar for "stack overflow":
http://www.google.co.nz/search?q=stack+overflow&ie=utf-8&oe=utf-8&aq=t&client=firefox-a&rlz=1R1GGLL_en-GB
You can see that: a) im using .NZ domain, b) my keyword "stack+overflow", c) im running firefox.
Finally, if you also stash the full landing page URL, you can store the GCLID, which will tell you the visitor came from paid, whereas if it doesn't have a GCLID, then the user must have come from natural search (if URL tagging is enabled of course).
This would theoretically allow you to then search for the keyword in your campaign, and figure out which adgroup them came from. Knowing the creative would probably be impossible though, unless you split test your landing URLs or tag them somehow.

Resources