mapdist limited to <200 requests - r

I've recently been playing with the mapdist function in the ggmap package.
For small volumes of queries it works fine for me, but for larger numbers (still below the 2,500 limit) it falls over and I'm not sure why.
I've had an old colleague try this script and they get the same results as I do (they are in a different organisation, using a different computer, on a different network etc.).
Here is my testing script which runs the same request again and again to see how many queries it manages to pass before failing. It was consistently returning 129 for a time, lately it has begun returning 127 (though this number is still consistent within a certain test).
Note that although this repeats the same postcodes, I have tried similar with a random selection of destination postcodes and get the same results.
library("ggmap")
# Setup ----------
no.of.pcd.to.check <- 500
a <- rep("SW1A 1AA",no.of.pcd.to.check) # Use a repeating list of the same postcode to remove it as a causal factor
b <- rep("WC2H 0HE",no.of.pcd.to.check) # As above
test.length <- 5 # How many iterations should the test run over
# Create results dataframe ----------
# and pre-set capacity to speed up the for loop
results.df <- data.frame(
Iteration=seq(1:test.length),
Result=as.integer(rep(0,test.length)),
Remaining=as.integer(rep(0,test.length)))
# Run the test ----------
for(i in 1:test.length){
x <- distQueryCheck() # Get remaining number of queries pre submission
try(mapdist(a, b, mode="driving", output="simple",override_limit=TRUE))
y <- distQueryCheck() # Get remaining number of queries post submission
query.use <- (x-y) # Difference between pre and post (ie number of successful queries submitted)
print(paste(query.use, "queries used"))
results.df[i,"Result"] <- query.use # Save successful number of queries for each test iteration
results.df[i,"Remaining"] <- y
}
I'd be really grateful for any insight on where I'm going wrong here.

So I had this same error message, and what ended up fixing it was simply changing the '#' in an address to 'Number '. I'm no expert and haven't even looked into the mapdist code, but eliminating '#' allowed me to use mapdist with no problems.

Related

Twitter rate limit changes to NULL, R tweetscores package self-terminates

I am using the R tweetscores package to estimate Twitter users’ ideology score (i.e. estimating a user’s ideology based on the accounts they follow).
I am using the code below to loop through a list of usernames, get who they follow (getFriends()) and then estimate their ideology score (estimateIdeology2()).
The getFriends() function makes calls to the Twitter API until it hits the rate limit. In this case, it should wait and then resume to making calls.
However, the loop seems to self-terminate after about 40 minutes.
It looks like the variable that holds the number of calls left, changes from 0 to NULL after a while, causing the loop to break.
Has anyone encountered this and/or knows how to fix this issue? I have tried adapting code to catch it when this variable turns to NULL and change its value but that doesn't prevent the loop from terminating. I would ideally like to keep this loop running and not manually restart it every 40 minutes. The raw code for the getFriends() function is here (it seems to break at line47): https://github.com/pablobarbera/twitter_ideology/blob/master/pkg/tweetscores/R/get-friends.R
for(user in usernames$user_screen_name){
skip_to_next <- FALSE
tryCatch({
friends <- getFriends(screen_name=user, oauth=my_oauth)
results <- estimateIdeology2(user, friends)
}, error=function(e){skip_to_next <<- TRUE})
if(skip_to_next) { next }
print("results computed successfully.")
user_scores[nrow(user_scores) + 1,] = list(screen_name = user,
ideology_score = results)
}
Tweetscores package uses API v1 endpoints and rtweet package. These are being replaced by API v2 and academictwitter. So I would suggest you to get friends list through academictwitteR.
get_user_following(user_ids, bearer_token)
But rate limits are real: You can make 15 requests per a 15 minute window. So if your users only follow a handful accounts (so that no pagination is required), in the best case scenario, you can get followers for one user per minute. If you have got hundreds of thousands of users, this could take ages. Looking ways to work around this issue.

Hitting Query Limit in Google Distance Matrix API on R

I have a list of 36 locations for which I have to get a distance matrix from each location to every other location, i.e. a 36x36 matrix. Using help from other questions on this topic on this forum, I was able to put together a basic code (demonstrated with four locations only) as follows:
library(googleway)
library(plyr)
key <- "VALID KEY" #removed for security reasons
districts <- c("Attock, Pakistan",
"Bahawalnagar, Pakistan",
"Bahawalpur, Pakistan",
"Bhakkar, Pakistan")
#Calculate pairwise distance between each location
lst <- google_distance(origins=districts, destinations=districts, key=key)
res.lst <- list()
lst_elements <- for (i in 1:length(districts)) {
e.row <- rbind(cbind(districts[i], distance_destinations(lst),
distance_elements(lst)[[i]][['distance']]))
res.lst[[i]] <- e.row
}
# view results as list
res.lst
# combine each element of list into a dataframe.
res.df <- ldply(res.lst, rbind)
#give names to columns
colnames(res.df) <- c("origin", "destination", "dist.km", "dist.m")
#Display result
res.df
This code works fine for small number of queries; i.e. if locations are few e.g. 5 at a time. For anything larger, I get a "Over-Query-Limit" error with the message: "You have exceeded your rate-limit for this API" even though I have not reached the 2500 limit. I also signed up for 'Pay-as-you-use' billing option but I continue to get the same error. I wonder if this is an issue of how many requests are being sent per second (i.e. the rate)? And if so, can I modify my code to address this? Even without an API key, this code does not ask for more than 2500 queries so I should be able to do it but I'm stumped how to resolve this even with billing enabled.
The free quota is 2500 elements.
Each query sent to the Distance Matrix API is limited by the number of allowed elements, where the number of origins times the number of destinations defines the number of elements.
Standard Usage Limits
Users of the standard API:
2,500 free elements per day, calculated as the sum of client-side and server-side queries.
Maximum of 25 origins or 25 destinations per request.
a 36x36 request would be 1296 elements. After 2 you would be out of quota.
For anyone still struggling with this issue; I was able to resolve it by using a while loop. Since I was well under the 2500 query limit, this was a rate problem rather than a query limit being reached problem. With a while loop, I broke the locations into chunks (running distance queries for 2x36 at a time) and looping over the entire data to get the 36x36 I needed.

How to Efficiently Download One Million Images with R and Fully Utilize Computer/Network Resources

I am attempting to download data consisting of approximately 1 million jpg files for which I have individual URL's and desired file names. The images have a mean filesize of approximately 120KB and range from 1KB to 1MB. I would like to use R to download the images.
I've tried a few things and eventually figured out a way that has let me download all million images in under three hours. My current strategy works, but it is a somewhat absurd solution that I would prefer not to use ever again, and I'm baffled as to why it even works. I would like to understand what's going on and to find a more elegant and efficient way of achieving the same result.
I started out with mapply and download.file() but this only managed a rate of 2 images per second. Next, I parallelized the process with the parallel package. This was very effective and improved the rate to 9 images per second. I assumed that would be the most I could achieve, but I noticed that the resources being used by my modest laptop were nowhere near capacity. I checked to make sure there wasn't a significant disk or network access bottleneck, and sure enough, neither were experiencing much more than ~10% of their capacity.
So I split up the url information and opened a new R console window where I ran a second instance of the same script on a different segment of the data to achieve 18 images per second. Then I just continued to open more and more instances, giving each of them a unique section of the full list of URL's. It was not until I had 12 open that there was any hint of slowing down. Each instance actually gave a nearly linear increase in downloads per second, and with some memory management, I approached my maximum down speed of 13 MB/s.
I have attached a graph showing the approximate total images being downloaded per second as a function of the number of instances running.
Also attached is a screenshot of my resource monitor while 10 simultaneous instances of R were running.
I find this result very surprising and I don't quite understand why this should be possible. What's making each individual script run so slowly? If the computer can run 12 instances of this code with little to no diminishing returns, what prevents it from just running 12 times as fast? Is there a way to achieve the same thing without having to open up entirely new R environments?
Here is the code I am asking about specifically. Unfortunately I cannot disclose the original URL's but the script is nearly identical to what I am using. I have replaced my data with a few CC images from wikimedia. For better replication, please replace "images" with your own large URL list if you have access to such a thing.
library(parallel)
library(data.table)
images <-
data.table(
file = c(
"Otter.jpg",
"Ocean_Ferret.jpg",
"Aquatic_Cat.jpg",
"Amphibious_Snake_Dog.jpg"
),
url = c(
"https://upload.wikimedia.org/wikipedia/commons/thumb/3/3d/Otter_and_Bamboo_Wall_%2822222758789%29.jpg/640px-Otter_and_Bamboo_Wall_%2822222758789%29.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/f/f7/Otter_Looking_Back_%2817939094316%29.jpg/640px-Otter_Looking_Back_%2817939094316%29.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/Otter1_%2814995327039%29.jpg/563px-Otter1_%2814995327039%29.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Otter_Profile_%2817962452452%29.jpg/640px-Otter_Profile_%2817962452452%29.jpg"
) #full URL's are redundant and unnecessary but I kept them in case there was some performance advantage over nesting a function inside download.file that combines strings.
)
#Download with Mapply (just for benchmarking, not actually used in the script)
system.time(
mapply(
function(x, y)
download.file(x, y, mode = 'wb', quiet = TRUE),
x = images$url,
y = images$file,
SIMPLIFY = "vector",
USE.NAMES = FALSE
)
)
#Parallel Download with clusterMap (this is what each instance is running. I give each instance a different portion of the images data table)
cl <- makeCluster(detectCores())
system.time(
clusterMap(
cl,
download.file,
url = images$url,
destfile = images$file,
quiet = TRUE,
mode = 'wb',
.scheduling = 'dynamic',
SIMPLIFY = 'vector',
USE.NAMES = FALSE
)
)
In summary, the questions I am asking are:
1) Why is my solution behaving this way? More specifically, why is 1 script not fully utilizing my computer's resources?
2) What is a better way to achieve the following with R: download 120GB composed of one million jpeg images directly via their URL's in under 3 hours.
Thank you in advance.
cl <- makeCluster(detectCores())
This line says to make a backend cluster with a number of nodes equal to your cores. That would probably be 2, 4 or 8, depending on how beefy a machine you have.
Since, as you noticed, the downloading process isn't CPU-bound, there's nothing stopping you from making the cluster as big as you want. Replace that line with something like
cl <- makeCluster(50)
and you'll have 50 R sessions downloading in parallel. Increase the number until you hit your bandwidth or memory limit.

Is it possible to control the speed of the lapply function?

I wrote a script to geocode a list of addresses using R and Google Maps, but it exceeds Google's 10 queries per second speed limit. I would like to slow this down to 5 queries per second.
My function constructs the URL, and then I call the functions using do.call, rbind, and lapply to create my geocoded dataset.
geoc <- function(address){
out <- tryCatch({
url <- "http://maps.google.com/maps/api/geocode/json"
response <- GET(url,query=list(sensor="FALSE",address1=address))
json <- fromJSON(content(response,type="text"))
loc <- json$results[[1]]$geometry$location
return(c(address1=address, long=loc$lng, lat=loc$lat))
})
return(out)
}
result <- do.call(rbind,lapply(as.character(sample$location),geoc))
Is there a way to slow this down to about 5 queries per second? It works great if I'm only geocoding 5 or 10 at a time, but anything over there throws Google errors.
Thanks!
Use Sys.sleep to wait a given time, then proceed. You would not be able to use that R session for anything else, but you can have multiple R sessions running at the same time so that would not prevent you from working in another R session.

How to get Twitter's Impression and Reach with R twitteR package?

This question is about measuring twitter impressions and reach using R.
I'm working on a twitter analysis of "People voice about Lynas Malaysia through Twitter Analysis with R" . To be more perfect, I wish to find out how to measure impressions, reach, frequency and so from twitter.
Definition:
Impressions: The aggregated number of followers that have been exposed to a brand/message.
Reach: The total number of unique users exposed to a message/brand.
Frequency: The number of times each unique user reached is exposed to a message.
My trial: #1.
From my understanding, the impression is the followers numbers of the total tweeters that tweet specific "keyword".
For #1. I made one:
rdmTweets <- searchTwitter(cloudstatorg, n=1500)
tw.df=twListToDF(rdmTweets)
n <- length(tw.df[,2])
S <- 0
X <- 0
for (i in 1:n) {
tuser <- getUser(tw.df$screenName[[i]])
X <- tuser$followersCount
S <- S + X
}
S
But the problem occurred will be
Error in .self$twFromJSON(out) :
Error: Rate limit exceeded. Clients may not make more than 150 requests per hour.
For #2. and #3., still don't have any ideas, hope to get helps here. Thanks a lot.
The problem you are having for #1 has nothing to do with R nor your code, is about the # of calls you have made to the Twitter Search API and that it exceeded the 150 calls you have by default.
Depending on what you are trying to do, you are able to mix and match several components of the API to get the results you need,
You can read more in their docs: https://dev.twitter.com/docs/rate-limiting

Resources