Is it possible to control the speed of the lapply function? - r

I wrote a script to geocode a list of addresses using R and Google Maps, but it exceeds Google's 10 queries per second speed limit. I would like to slow this down to 5 queries per second.
My function constructs the URL, and then I call the functions using do.call, rbind, and lapply to create my geocoded dataset.
geoc <- function(address){
out <- tryCatch({
url <- "http://maps.google.com/maps/api/geocode/json"
response <- GET(url,query=list(sensor="FALSE",address1=address))
json <- fromJSON(content(response,type="text"))
loc <- json$results[[1]]$geometry$location
return(c(address1=address, long=loc$lng, lat=loc$lat))
})
return(out)
}
result <- do.call(rbind,lapply(as.character(sample$location),geoc))
Is there a way to slow this down to about 5 queries per second? It works great if I'm only geocoding 5 or 10 at a time, but anything over there throws Google errors.
Thanks!

Use Sys.sleep to wait a given time, then proceed. You would not be able to use that R session for anything else, but you can have multiple R sessions running at the same time so that would not prevent you from working in another R session.

Related

R bootstrap continue execution

I would like to use bootstrapping using the boot library. Since calculating the statistics from each sample is a length process, it is going to take several days for the entire bootstrapping calculation to conclude. Since the computer I am using disconnects every several hours, I would like to use some checkpoint mechanism such that I will not have to start from scratch every time. Currently, I am running:
results <- boot(data=data, statistic=my_slow_function, R=10000, parallel='snow', ncpus=4, cl=cl)
but I would rather run it with R=100 multiple times such that I will be able to save the intermediate results and retrieve them if the connection hang-up. How can I achieve that?
Thank you in advance
Maybe you can combine results for the bootstrap replicates:
#simulating R=10000
results_list <- lapply(1:00, function(x) {
return(boot(data=data, statistic=my_slow_function, R=100, parallel='snow', ncpus=4)$t)
})
results_t <- unlist(results_list)
hist(results_t)
t0 = mean(results_t)

Twitter rate limit changes to NULL, R tweetscores package self-terminates

I am using the R tweetscores package to estimate Twitter users’ ideology score (i.e. estimating a user’s ideology based on the accounts they follow).
I am using the code below to loop through a list of usernames, get who they follow (getFriends()) and then estimate their ideology score (estimateIdeology2()).
The getFriends() function makes calls to the Twitter API until it hits the rate limit. In this case, it should wait and then resume to making calls.
However, the loop seems to self-terminate after about 40 minutes.
It looks like the variable that holds the number of calls left, changes from 0 to NULL after a while, causing the loop to break.
Has anyone encountered this and/or knows how to fix this issue? I have tried adapting code to catch it when this variable turns to NULL and change its value but that doesn't prevent the loop from terminating. I would ideally like to keep this loop running and not manually restart it every 40 minutes. The raw code for the getFriends() function is here (it seems to break at line47): https://github.com/pablobarbera/twitter_ideology/blob/master/pkg/tweetscores/R/get-friends.R
for(user in usernames$user_screen_name){
skip_to_next <- FALSE
tryCatch({
friends <- getFriends(screen_name=user, oauth=my_oauth)
results <- estimateIdeology2(user, friends)
}, error=function(e){skip_to_next <<- TRUE})
if(skip_to_next) { next }
print("results computed successfully.")
user_scores[nrow(user_scores) + 1,] = list(screen_name = user,
ideology_score = results)
}
Tweetscores package uses API v1 endpoints and rtweet package. These are being replaced by API v2 and academictwitter. So I would suggest you to get friends list through academictwitteR.
get_user_following(user_ids, bearer_token)
But rate limits are real: You can make 15 requests per a 15 minute window. So if your users only follow a handful accounts (so that no pagination is required), in the best case scenario, you can get followers for one user per minute. If you have got hundreds of thousands of users, this could take ages. Looking ways to work around this issue.

R Updating A Column In a Large Dataframe

I've got a dataframe, which is stored in a csv, of 63 columns and 1.3 million rows. Each row is a chess game, each column is details about the game (e.g. who played in the game, what their ranking was, the time it was played, etc). I have a column called "Analyzed", which is whether someone later analyzed the game, so it's a yes/no variable.
I need to use the API offered by chess.com to check whether a game is analyzed. That's easy. However, how do I systematically update the csv file, without wasting huge amounts of time reading in and writing out the csv file, while accounting for the fact that this is going to take a huge amount of time and I need to do it in stages? I believe a best practice for chess.com's API is to use Sys.sleep after every API call so that you lower the likelihood that you are accidentally making concurrent requests, which the API doesn't handle very well. So I have Sys.sleep for a quarter of a second. If we assume the API call itself takes no time, then this means this program will need to run for 90 hours because of the sleep time alone. My goal is to make it so that I can easily run this program in chunks, so that I don't need to run it for 90 hours in a row.
The code below works great to get whether a game has been analyzed, but I don't know how to intelligently update the original csv file. I think my best bet would be to rewrite the new dataframe and replace the old Games.csv every 1000 or say API calls. See the commented code below.
My overall question is, when I need to update a column in csv that is large, what is the smart way to update that column incrementally?
library(bigchess)
library(rjson)
library(jsonlite)
df <- read.csv <- "Games.csv"
for(i in 1:nrow(df)){
data <- read_json(df$urls[i])
if(data$analysisLogExists == TRUE){
df$Analyzed[i] <- 1
}
if(data$analysisLogExists==FALSE){
df$Analyzed[i] = 0
}
Sys.sleep(.25)
##This won't work because the second time I run it then I'll just reread the original lines
##if i try to account for this by subsetting only the the columns that haven't been updated,
##then this still doesn't work because then the write command below will not be writing the whole dataset to the csv
if(i%%1000){
write.csv(df,"Games.csv",row.names = F)
}
}

mapdist limited to <200 requests

I've recently been playing with the mapdist function in the ggmap package.
For small volumes of queries it works fine for me, but for larger numbers (still below the 2,500 limit) it falls over and I'm not sure why.
I've had an old colleague try this script and they get the same results as I do (they are in a different organisation, using a different computer, on a different network etc.).
Here is my testing script which runs the same request again and again to see how many queries it manages to pass before failing. It was consistently returning 129 for a time, lately it has begun returning 127 (though this number is still consistent within a certain test).
Note that although this repeats the same postcodes, I have tried similar with a random selection of destination postcodes and get the same results.
library("ggmap")
# Setup ----------
no.of.pcd.to.check <- 500
a <- rep("SW1A 1AA",no.of.pcd.to.check) # Use a repeating list of the same postcode to remove it as a causal factor
b <- rep("WC2H 0HE",no.of.pcd.to.check) # As above
test.length <- 5 # How many iterations should the test run over
# Create results dataframe ----------
# and pre-set capacity to speed up the for loop
results.df <- data.frame(
Iteration=seq(1:test.length),
Result=as.integer(rep(0,test.length)),
Remaining=as.integer(rep(0,test.length)))
# Run the test ----------
for(i in 1:test.length){
x <- distQueryCheck() # Get remaining number of queries pre submission
try(mapdist(a, b, mode="driving", output="simple",override_limit=TRUE))
y <- distQueryCheck() # Get remaining number of queries post submission
query.use <- (x-y) # Difference between pre and post (ie number of successful queries submitted)
print(paste(query.use, "queries used"))
results.df[i,"Result"] <- query.use # Save successful number of queries for each test iteration
results.df[i,"Remaining"] <- y
}
I'd be really grateful for any insight on where I'm going wrong here.
So I had this same error message, and what ended up fixing it was simply changing the '#' in an address to 'Number '. I'm no expert and haven't even looked into the mapdist code, but eliminating '#' allowed me to use mapdist with no problems.

Does clusterMap in Snow support dynamic processing?

It seems clusterMap in Snow doesn't support dynamic processing. I'd like to do parallel computing with two pairs of parameters stored in a data frame. But the elapsed time of every job vary very much. If the jobs are run un-dynamically, it will be time consuming.
e.g.
library(snow)
cl2 <- makeCluster(3, type = "SOCK")
df_t <- data.frame (type=c(rep('a',3),rep('b',3)), value=c(rep('1',3),rep('2',3)))
clusterExport(cl2,"df_t")
clusterMap(cl2, function(x,y){paste(x,y)},
df_t$type,df_t$value)
It is true that clusterMap doesn't support dynamic processing, but there is a comment in the code suggesting that it might be implemented in the future.
In the meantime, I would create a list from the data in order to call clusterApplyLB with a slightly different worker function:
ldf <- lapply(seq_len(nrow(df_t)), function(i) df_t[i,])
clusterApplyLB(cl2, ldf, function(df) {paste(df$type, df$value)})
This was common before clusterMap was added to the snow package.
Note that your use of clusterMap doesn't actually require you to export df_t since your worker function doesn't refer to it. But if you're willing to export df_t to the workers, you could also use:
clusterApplyLB(cl2, 1:nrow(df_t), function(i){paste(df_t$type[i],df_t$value[i])})
In this case, df_t must be exported to the cluster workers since the worker function references it. However, it is generally less efficient since each worker only needs a fraction of the entire data frame.
I found clusterMap in Parallel package support LB. But it less efficient than the method of clusterApplyLB combined with lapply implemented by Snow. I tried to find out the source code to figure out. But the clusterMap is not available when I click the link 'source' and 'R code'.

Resources