Test if socket is empty (was: Reading data from a raw socket) - r

At the start, I thought that the bad performance of my driver was caused by the way in which I read data from a socket.
This was the original function I used:
socket_char_reader = function(in_sock) {
string_read <- raw(0)
while((rd <- readBin(in_sock, what = "raw", n=1)) > 0) {
string_read <- c(string_read, rd)
}
return(string_read %>% strip_CR_NUL() %>% rawToChar())
}
The results from 3 consecutive calls to this function give the expected result. It takes 0.004 seconds to read 29 bytes in total.
My second try reads the socket until it is empty. Another function splits the resulting raw array in 3 separate parts.
get_response = function() {
in_sock <- self$get_socket()
BUF_SIZE <- 8
chars_read <- raw(0)
while (length(rd <- readBin(in_sock, what = "raw", n=BUF_SIZE)) > 0) {
chars_read <- c(chars_read, rd)
}
return(chars_read)
},
Reading from the socket now takes 2.049 seconds!
Can somebody explain to me what could be the cause for this difference (or what is the best method for reading the socket until it is empty)?
In the meantime I'll return to my original solution and continue looking for the cause of the bad performance.
Ben
I believe, I found the cause (but not the solution).
While debuging, I noticed that the delay is caused by the last call to readBin().
In socket_char_reader(), the stop-condition for the while-loop is based on the value of rd; if that value equals 0, the loop stops.
In get_response() the stop-condition is based on the number of bytes in the buffer. Before that number can be determined, readBin() first waits if any other bytes will be send to the socket.
The timeOut-period is set in the socketConnection() call.
private$conn <- socketConnection(host = "localhost", port,
open = "w+b", server = FALSE, blocking = TRUE,
encoding = "UTF-8", timeout = 1)
Timeout has to be give a value > 0, otherwise it will take days before the loop stops.
Is it possible to check if there still are any bytes in the socket without actually reading?
Ben

Related

Using the try and try catch function on .WAV files

Trying to cut a bunch of audio (.WAV) files into smaller samples in R. For this example, I'm using a loop to cut out 1 minute samples at 140 minutes.
For some files, the recording ends before 140 minutes due to an error in the recording device. When this occurs, an error appears -- and the loop stops. I'm trying to make it so the loop continues by using the try or tryCatch function however keep getting errors.
The code is as follows:
for(i in 1:length(AR_CD288)){
CUT_AR288_5 <- try({readWave(AR_CD288[i], from = 140, to = 141, units = "minutes")})
FILE.OUT_AR288_5<- sub("\\.wav$", "_140.wav", AR_CD288)
OUT.PATH_AR288_5 <- file.path("New files", basename(FILE.OUT_AR288_5))
writeWave(CUT_AR288_5, extensible=FALSE, filename = OUT.PATH_AR288_5[i])
}
I get the following two errors from the code:
Error in readBin(con, int, n = N, size = bytes, signed = (bytes != 1), :
invalid 'n' argument
Error in writeWave(CUT_AR288_5, extensible = FALSE, filename = OUT.PATH_AR288_5[i]) :
'object' needs to be of class 'Wave' or 'WaveMC
The loop still saves some samples into the "New files" directory, however, once the loop reaches a file <140 minutes, the loop stops.
I am very stuck! Any help would be greatly appreciated.
Cheers.
When I use try, I always do one (or both) of:
check the return value to see if it inherits "try-error", indicating that the command failed; or
add try(., silent = TRUE), indicating that I don't care if it succeeded (but this implies that I will not use its return value, either).
Try this:
for (i in seq_along(AR_CD288)) {
CUT_AR288_5 <- try({
readWave(AR_CD288[i], from = 140, to = 141, units = "minutes")
}, silent = TRUE)
if (!inherits(CUT_AR288_5, "try_error")) {
FILE.OUT_AR288_5 <- sub("\\.wav$", "_140.wav", AR_CD288)
OUT.PATH_AR288_5 <- file.path("New files", basename(FILE.OUT_AR288_5))
writeWave(CUT_AR288_5, extensible = FALSE, filename = OUT.PATH_AR288_5[i])
}
}
Three notes:
I changed 1:length(.) to seq_along(.); the latter is more resilient in an automated use when it is feasible that the vector might be length 0. For example, if AR_CD288 can ever be length 2, intuitively we expect 1:length(AR_CD288) to return nothing so that the for loop will not run; unfortunately, it resolves to 1:0 which returns a vector of length 2, which will often fail (based on whatever code is operating in the loop). The use of seq_along(.) will always return a vector of length 0 with an empty input, which is what we need. (Alternatively and equivalent, seq_len(length(AR_CD288)), though that's really what seq_along is intended to do.)
If you do not add silent=TRUE (or explicitly add silent=FALSE), then you will get an error message indicating that the command failed. Unfortunately, the error message may not indicate which i failed, so you may be left in the dark as far as fixing or removing the errant file. You may prefer to add an else to the if (inherits(.,"try-error")) clause so that you can provide a clearer error, such as
if (inherits(CUT_AR288_5, "try_error")) {
warning("'readWave' failed on ", sQuote(AR_CD288[i]), call. = FALSE)
} else {
FILE.OUT_AR288_5 <- sub("\\.wav$", "_140.wav", AR_CD288)
# ...
}
(noting that I put the "things worked" code in the else clause here ... I find it odd to do if (!...) {} else {}, seems like a double-negation :-).
The choice to wrap one function or the whole block depends on your needs: I tend to prefer to know exactly where things fail, so the will-possibly-fail functions are often individually wrapped with try so that I can react (or log/message) accordingly. If you don't need that resolution of error-detection, then you can certainly wrap the whole code-block in a sense:
for (i in seq_along(AR_CD288)) {
ret <- try({
CUT_AR288_5 <- readWave(AR_CD288[i], from = 140, to = 141, units = "minutes")
FILE.OUT_AR288_5 <- sub("\\.wav$", "_140.wav", AR_CD288)
OUT.PATH_AR288_5 <- file.path("New files", basename(FILE.OUT_AR288_5))
writeWave(CUT_AR288_5, extensible = FALSE, filename = OUT.PATH_AR288_5[i])
}, silent = TRUE)
if (inherits(ret, "try-error")) {
# do or log something
}
}

Skip or exit command if R goes over specified memory limit

I would like to run a block of code that skips or exits a command if R goes over a specified memory limit at any time. To illustrate a related example, the following code will skip to the next iteration of the for loop, if the code block takes more than a specified time limit. It will print: '1', 'skip', '2'
params = c(1,4,2)
for(i in params) {
tryCatch(
expr = {
evalWithTimeout({
Sys.sleep(i)
print(i)
},
timeout = 3) #go to next iteration if block takes more than 3 seconds
},
TimeoutException = function(x) cat("skip")
)
}
I would like to do something similar, but skip or exit a command if R goes over a memory limit instead. For example, how can I make the following code print: '1', NOTHING, '2'. Note the second matrix with 1000 rows should be skipped before it is fully built. Also, I will not know the size of the matrix/object that needs to be skipped beforehand, I will only know the memory_limit
unknown = matrix(rnorm(1000*1000), ncol = 1000, nrow = 1000) #unknown object
memory_limit = object.size(unknown)-100000 #known memory limit that happens to be just under the object
##Evaluate_in_memory_limit##{
print(nrow(matrix(rnorm(1*1), ncol = 1, nrow = 1)))
print(nrow(unknown)) #this should be skipped
print(nrow(matrix(rnorm(2*2), ncol = 2, nrow = 2))),
limit = memory_limit
}
An idea:
You could calculate the size of the vector (matrix) beforehand, if you know the length of it in advance.
For
integers: 40 + 4 * n bytes
numeric: 40 + 8 * n bytes
should be the formula for vectors. Check with e.g.
sapply((1:3)^10, function(n) object.size(numeric(n)))
# or for matrix
sapply((1:3)^10, function(n) object.size(matrix(numeric(n))))
Then use system('free') on unix, to determine free memory.
Create your elements in a for loop and use the check in an if condition to next the loop, in case used memory will exceed available.

How to create an efficient for loop to resolve the rate limit issue with twitteR?

I am quite new to TwitteR and the concept of for loop. I have come across to this code to get the followers and profiles.
This code below works fine. Not entirely sure if retry on rate limit should be set for such a long time.
#This extracts all or most followers.
followers<-getUser("twitter_handle_here")$getFollowerIDs(retryOnRateLimit=9999999)
This code below is the for loop to get the profiles.
However, I think there should be a way to use length(followers) and getCurRateLimitInfo() to better contruct the loop.
My question is that if the length(followers) = 40000 and the ratelimit = 180, then how to construct the loop to sleep with the right amount of time and to get all 40000 twitter profiles?
Any help would be much appreciated.
#This is the for loop to sleep for 5 seconds.
#Problem with this is it simply sleeps for X seconds
for (follower in followers){
Sys.sleep(5)
followers_info<-lookupUsers(followers)
followers_full<-twListToDF(followers_info)
}
Here is some code I had written for a similar purpose, First you need to define this function stall_rate_limit:
stall_rate_limit <- function(limit) {
# Store the record of all the rate limits into rate
rate = getCurRateLimitInfo()
message("Checking Rate Limit")
if(any(as.numeric(rate[,3]) == 0)) {
# Get the locations of API Calls that are used up
index = which(as.numeric(rate[,3]) == 0)
# get the time till when rates limits Reset
wait = as.POSIXct(min(rate[index,4]), ## Reset times in the 4th col
origin = "1970-01-01", ## Origin of Unix Time
tz = "US/Mountain") ## Replace with your Timezone
message(paste("Waiting until", wait,"for Godot to reset rate limit"))
# Tell the computer to sleep until the rates reset
Sys.sleep(difftime(wait, Sys.time(), units = "secs"))
# Set J = to 0
J = 0
# Return J as a counter
return(J)
} else {
# Count was off, Try again
J = limit - 1
return(J)
}
}
Then you can run your code something like this:
callsMade = 0 ## This is your counter to count how many calls were made
limit = 180 ## the Limit of how many calls you can make
for(i in 1:length(followers)){
# Check to see if you have exceeded your limit
if(callsMade >= limit){
# If you have exceeded your limit, wait and set calls made to 0
callsMade = stall_rate_limit(limit)
}
### Execute your Code Here ... ###
callsMade = callsMade + 1 # or however many calls you have made
}

make concurrent RCurl GET requests for set of URLs

I wrote a function to use RCurl to obtain the effective URL for a list of shortened URL redirects (bit.ly, t.co, etc.) and handle errors when the effective URL locates a document (PDFs tend to throw "Error in curlPerform... embedded nul in string.")
I would like to make this function more efficiently if possible (while keeping it in R). As written the run-time is prohibitively long for un-shortening a thousand or more URLs.
?getURI tells us that by default, getURI/getURL goes asynchronous when the length of the url vector is >1. But my performance seems totally linear, presumably because sapply turns the thing into one big for loop and the concurrency is lost.
Is there anyway I can speed up these requests? Extra credit for fixing the "embedded nul" issue.
require(RCurl)
options(RCurlOptions = list(verbose = F, followlocation = T,
timeout = 500, autoreferer = T, nosignal = T,
useragent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)"))
# find successful location (or error msg) after any redirects
getEffectiveUrl <- function(url){
c = getCurlHandle()
h = basicHeaderGatherer()
curlSetOpt( .opts = list(header=T, verbose=F), curl= c, .encoding = "CE_LATIN1")
possibleError <- tryCatch(getURI( url, curl=c, followlocation=T,
headerfunction = h$update, async=T),
error=function(e) e)
if(inherits(possibleError, "error")){
effectiveUrl <- "ERROR_IN_PAGE" # fails on linked documents (PDFs etc.)
} else {
headers <- h$value()
names(headers) <- tolower(names(headers)) #sometimes cases change on header names?
statusPrefix <- substr(headers[["status"]],1,1) #1st digit of http status
if(statusPrefix=="2"){ # status = success
effectiveUrl <- getCurlInfo(c)[["effective.url"]]
} else{ effectiveUrl <- paste(headers[["status"]] ,headers[["statusmessage"]]) }
}
effectiveUrl
}
testUrls <- c("http://t.co/eivRJJaV4j","http://t.co/eFfVESXE2j","http://t.co/dLI6Q0EMb0",
"http://www.google.com","http://1.uni.vi/01mvL","http://t.co/05Mz00DHLD",
"http://t.co/30aM6L4FhH","http://www.amazon.com","http://bit.ly/1fwWZLK",
"http://t.co/cHglxQkz6Z") # 10th URL redirects to content w/ embedded nul
system.time(
effectiveUrls <- sapply(X= testUrls, FUN=getEffectiveUrl, USE.NAMES=F)
) # takes 7-10 secs on my laptop
# does Vectorize help?
vGetEffectiveUrl <- Vectorize(getEffectiveUrl, vectorize.args = "url")
system.time(
effectiveUrls2 <- vGetEffectiveUrl(testUrls)
) # nope, makes it worse
I had bad experience with RCurl and Async request. R would completely freeze (though no error message, CPU and RAM did not spike) with only concurrent 20 requests.
I recommend switching to CURL and using curl_fetch_multi() function. It my case it could easily handle 50000 JSON request in one pool (with some division into subpools under the hood).
https://cran.r-project.org/web/packages/curl/vignettes/intro.html#async_requests

Leaving connections open indefinitely using file() in R

I have this function to write streaming data from Twitter into one file for 12 hours, then to another file for 12 hours. This is so we can clean, parse, and store the data twice a day.
conn <- file(description = "after12.json", open = "a")
conn2 <- file(description = "before12.json", open = "a")
write.tweets <- function(x) {
if (nchar(x) > 0 && format(Sys.time(), " %H") >= 12){
writeLines(x, conn, sep = "")
} else {
writeLines(x, conn2, sep = "")
}
}
This is within a much larger function to pull and write the data. My question is pretty simple. I want to leave both connections open indefinitely to be able to call the connection after 12 hours of inactivity. Is there a way I can do this?
use open
conn <- file(description = "after12.json")
open(conn, open = "a")
as per ?open:
open opens a connection. In general functions using connections will open them if they are not open, but then close them again, so to leave a connection open call open explicitly.

Resources