It seem like it's doing more than just removing duplicates but I can't find any good documentation on it. The API docs currently just say:
dedupe=[0|1]
No explanation yet.
I'm running my own Nominatim that's been forked from the original a few months ago and I use the public Nominatim as a back up when mine doesn't respond. So I am interested in answers regarding the latest public Nominatim. On my own Nominatim, I haven't noticed duplicates in results.
I have noticed differences when setting dedupe to 0 or 1. Here's a diff where you can see that even with dedupe=0, there are no duplicates yet with dedupe=1, the results are different. BTW dedupe=1 seem to match the default with no dedupe set.
Maybe Nominatim is removing duplicates based on GPS and boundaries and not just place_id's?
you could check the source code in here. Basically, the dedupe parameter is used to check whether there is a duplicate on the search result or not. it will be used when generating the query to get the data from the database.
the parameters which are used when trying to check for duplication are place_id and the address (the country code, postcode, address, etc). you could check the sql function in here (the function is called get_address_by_language)
I have not heard of the dedupe parameter but found your question while searching for nominatim duplicate addresses.
I'm having an issue with some buildings existing twice in the Nominatim data with slightly different variants of the address (e.g. Avenue vs Boulevard suffix, front door vs back door as coordinates).
Related
I'm pulling distance/time information for a large number of origin/destination pairs using the Google Maps API in R. I'm currently using the gmapsdistance package but have looked at a few others.
My premium API key includes 100k free queries per day. Are there any packages that can return how many are remaining? For example, the ggmap package has a geocodeQueryCheck(). The problem is I don't think this function actually returns the number remaining on your account. It doesn't ask for your API key. My guess is that it just keeps track of how many it has called today. The latest github version has a register_google() function that does allow you to set your API key, but when I make API requests with the gmapsdistance package, geocodeQueryCheck() doesn't update.
In summary, I just want to know how many are left. Even if I need to construct the URL address directly. When I look at the API documentation, I don't even see URL calls for it, which doesn't give me much hope.
As confirmed by #SymbolixAU, there is currently no way to do this.
Sorry, I guess this is late, but have you tried this?
sum(.GoogleDistQueryCount$elements)
Im trying our different flight api's from sabre, I understand from reading the data Im getting back is limited in development but Im not sure if it really can be THAT limited or its me doing something wrong.
1: InstaFlights Search
First I use the citypairs lookup to show city pairs, then use them for the instasearch,
The problem is unlike I use NY or London (there were 2 other cities working fine), for almost ALL other cities, Im getting no response.
I know data is limited but since the citypairs api already returns VERY limited data, but is it really THAT limited? Feeling like I must be doing something wrong because I cannot image, that api to work (in dev) only for 3 cities on 3 different dates :-/
destination api
here I use first the supported cities api, then use results to use the multi airports api, then use that for destination api.
Again, same here, only 2/3 cities actually work. Since in the destination api, UNLIKE the instaflights api, the changes of 'matches' are higher as any destination could be shown for the picked origin. HERE AGAIN almost no results, BUT for about 3 cities.
If anyone who has some experience with sabre, could help out it would be great- just trying to figure out if its me whos using it wrong or no. Thanks!
Can you please provide the city pairs that seem to be failing for you? I just did a test of both APIs (InstaFlights and DestinationFinder) and was able to obtain results with the city pairs provided there. I changed the point of sale to FR and obtained PAR-ATH, and that worked. Also worked with ABE-MCO which is the first city pair I obtain when using POS US.
The testing environment for this API but you should be not limited to just three cities.
google has an API for downloading search suggestions:
https://www.google.com/support/enterprise/static/gsa/docs/admin/70/gsa_doc_set/xml_reference/query_suggestion.html
unfortunately, as far as i can tell, these results are specific to your location. for an analysis, i would like to be able to define the city/location that google thinks it is making the suggestion to. here's what happens when i scrape from dar es salaam, tanzania:
http://suggestqueries.google.com/complete/search?client=firefox&q=insurance
["insurance",["insurance","insurance companies in tanzania","insurance group of tanzania","insurance principles","insurance act","insurance policy","insurance act tanzania","insurance act 2009","insurance definition","insurance industry in tanzania"]]
i understand that a vpn would partially solve this issue, but only by giving me a different location and not lots of locations. is there a reasonable way to replicate this sort of thing quickly and easily from, say, the 100 largest cities in the united states?
confirmation that results differ within the usa-
thanks!
Google will use your IP and your location history (if turned on) to determine your location.
To be able to go around it, you can spoof your IP while logged off your google account (but I don't know if google will consider it a trial of hacking no matter what your intentions are).
Another way is to use Tor browser (even though it is not it's original purpose). You can configure tor to exit from a certain country using the Exitnode parameter in the torrc config file
As found in the docs:
ExitNodes node,node,…
A list of identity fingerprints, country codes, and address patterns of nodes to use as exit node
But if you want a fast way to do it, I don't think that's possible since google wants to know the real location of the users and have put a lot of effort into making such tricks fail.
The hl param for interface language changes the search results, but I can't tell if it's actually changing the location. For example:
http://suggestqueries.google.com/complete/search?client=chrome&q=why&hl=FR
Here's an example with 5 different values of hl:
http://jsbin.com/tusacufaza/edit?js,output
This is a question specifically for the Google Developer Relations team. I have read the Geocode API T&Cs and I am aware that I am not allowed to store data except by way of a temporary cache (e.g. for performance). Is this the end of the matter? I am developing a product which requires a search with results sorted by distance from a place, meaning that all my records need a lat/long. I was intending to use the Geocode API to get the lat/long when a user adds a record, and then adding that lat/long info to the record. We would then use the Haversine formula to calculate the relative distances and sort the results.
If I follow this approach, will I be in breach of the T&Cs? If so, is there another approach using the Geocode API which will allow me to hold onto lat/long data so that I can sort my results by distance, within the letter of the T&Cs?
For anyone else commenting, please observe the following restrictions: (1) we don't have a budget to buy a postcode-lat/long dataset; (2) we don't want to use a static dataset of our own, eg GeoNames, because we don't want to have to maintain data which is, effectively, public; (3) we have to support users who have javascript disabled.
To be absolutely clear, what I need here is to have the lat/long for all of my records in hand so that I can do effective searching and sorting by distance relative to another lat/long as provided, e.g. by a user searching.
Google Team, please respond to this message with contact details so we can speak.
I think the question has been answered here before,but i could not find the desired topic.I am a newbie in web scraping.I have to develop a script that will take all the google search result for a specific name.Then it will grab the related data against that name and if there is found more than one,the data will be grouped according to their names.
All I know is that,google has some kind of restriction on scraping.They provide a custom search api.I still did not use that api,but hoping to get all the resulted links corresponding to a query from that api. But, could not understand what will be the ideal process to do the scraping of the information from that links.Any tutorial link or suggestion is very much appreciated.
You should have provided a bit more what you have been doing, it does not sound like you even tried to solve it yourself.
Anyway, if you are still on it:
You can scrape Google through two ways, one is allowed one is not allowed.
a) Use their API, you can get around 2k results a day.
You can up it to around 3k a day for 2000 USD/year. You can up it more by getting in contact with them directly.
You will not be able to get accurate ranking positions from this method, if you only need a lower number of requests and are mainly interested in getting some websites according to a keyword that's the choice.
Starting point would be here: https://code.google.com/apis/console/
b) You can scrape the real search results
That's the only way to get the true ranking positions, for SEO purposes or to track website positions. Also it allows to get a large amount of results, if done right.
You can Google for code, the most advanced free (PHP) code I know is at http://scraping.compunect.com
However, there are other projects and code snippets.
You can start off at 300-500 requests per day and this can be multiplied by multiple IPs. Look at the linked article if you want to go that route, it explains it in more details and is quite accurate.
That said, if you choose route b) you break Googles terms, so either do not accept them or make sure you are not detected. If Google detects you, your script will be banned by IP/captcha. Not getting detected should be a priority.