This is a two-part question regarding the following chunk of code.
> ## ran rm(list = ls()) prior to the following
> closeAllConnections()
> tc <- textConnection("messages", "w")
> isOpen(tc)
# [1] TRUE
> close(tc)
> ls()
# [1] "messages" "tc"
> is.object(tc)
# [1] TRUE
> class(tc)
# [1] "textConnection" "connection"
> tc
# Error in summary.connection(x) : invalid connection
Why isn't tc removed from the list of objects, ls(), immediately when the tc connection is closed, and what does invalid connection mean? Is there a reason why R keeps tc in the list?
Is there a way to remove it from the object list immediately after it's closed? I really don't want to call rm() if it's not necessary. Perhaps I missed an argument somewhere while scanning the help files.
The reason this is important is because I have a function called list.objects that returns an error after I run the above code, but does not otherwise (possibly because tc has two classes).
For 1., tc isn't removed from the list of objects because close doesn't delete the variable you use to contain the pointer to the connection. Rather, close closes the pointer and effectively removes it from the open file connections list (see showConnections). The variable that contains the pointer still exists, it's just that the pointer points nowhere. This explains why you get an error when you type tc after you close it, you're trying to look at a file connection that goes nowhere.
For 2., what's so hard about close(tc); rm(tc)? Hardly any more typing than if there actually was a "delete my first argument" parameter.
tc is a variable that holds a reference to certain state. There is no particular reason that a call to close() should come with an rm() built-in. That would be like expecting a TV remote to disappear on its own after you turn off the TV by hitting the power button.
I think you will have to call rm(tc) to remove it.
Related
I cannot seem to create a closed connection object. I want to have a connection object sitting around, so I can give it a connection if isOpen(myCon) returns FALSE. I would expect myCon<-file() to give me a closed connection, but isOpen(myCon) returns TRUE. I see from the help file that file() actually returns a connection to an "anonymous file", which is apparently just a memory location that acts like a file. ...not what I want. If I create the anonymous file and execute close(myCon), then isOpen(myCon) gives an invalid connection error rather than returning FALSE. I don't want to be trapping errors just to get my false value. How can I create a valid connection that's not open? There must be a way isOpen(myCon) can return FALSE, otherwise it's a somewhat pointless function. My OS is Windows 7.
file() works as long as the first parameter (description) is a non-empty string. Example:
> myCon <- file("dumdum")
> isOpen(myCon)
[1] FALSE
The key is that the second parameter (open) is left empty (the default). It does not matter whether the string used for description is an existing file or not. However, this does not mean that the connection can't be used. R opens the connection as needed. For example:
> myCon <- file("important log file.txt")
> isOpen(myCon)
[1] FALSE
> cat("Thinking this will fail, because the connection is closed. ...wrong!", file=myCon)
> isOpen(myCon)
[1] FALSE
The file has just been overwritten with that one line.
The safe way to make a stand-by connection is to generate the description with tempfile(). That returns a string which is guaranteed "not to be currently in use" (...according to the help page. I interpret that to mean that the string is not the name of an existing file.)
> myCon <- file( tempfile() )
> isOpen(myCon)
[1] FALSE
> cat("Didn't mean to do this, but all it will do is create a new file.", file=myCon)
> isOpen(myCon)
[1] FALSE
In this case, that line was written to a file, but it did not overwrite anything.
Many thanks to Martin Morgan for pointing me in the right direction. I welcome additional comments.
On this code when I use for loop or the function lapply I get the following error
"Error in get_entrypoint (debug_port):
Cannot connect R to Chrome. Please retry. "
library(rvest)
library(xml2) #pull html data
library(selectr) #for xpath element
url_stackoverflow_rmarkdown <-
'https://stackoverflow.com/questions/tagged/r-markdown?tab=votes&pagesize=50'
web_page <- read_html(url_stackoverflow_rmarkdown)
questions_per_page <- html_text(html_nodes(web_page, ".page-numbers.current"))[1]
link_questions <- html_attr(html_nodes(web_page, ".question-hyperlink")[1:questions_per_page],
"href")
setwd("~/WebScraping_chrome_print_to_pdf")
for (i in 1:length(link_questions)) {
question_to_pdf <- paste0("https://stackoverflow.com",
link_questions[i])
pagedown::chrome_print(question_to_pdf)
}
Is it possible to build a for loop() or use lapply to repeat the code from where it break? That is, from the last i value without breaking the code?
Many thanks
I edited #Rui Barradas idea of tryCatch().
You can try to do something like below.
The IsValues will get either the link value or bad is.
IsValues <- list()
for (i in 1:length(link_questions)) {
question_to_pdf <- paste0("https://stackoverflow.com",
link_questions[i])
IsValues[[i]] <- tryCatch(
{
message(paste("Converting", i))
pagedown::chrome_print(question_to_pdf)
},
error=function(cond) {
message(paste("Cannot convert", i))
# Choose a return value in case of error
return(i)
})
}
Than, you can rbind your values and extract the bad is:
do.call(rbind, IsValues)[!grepl("\\.pdf$", do.call(rbind, IsValues))]
[1] "3" "5" "19" "31"
You can read more about tryCatch() in this answer.
Based on your example, it looks like you have two errors to contend with. The first error is the one you mention in your question. It is also the most frequent error:
Error in get_entrypoint (debug_port): Cannot connect R to Chrome. Please retry.
The second error arises when there are links in the HTML that return 404:
Failed to generate output. Reason: Failed to open https://lh3.googleusercontent.com/-bwcos_zylKg/AAAAAAAAAAI/AAAAAAAAAAA/AAnnY7o18NuEdWnDEck_qPpn-lu21VTdfw/mo/photo.jpg?sz=32 (HTTP status code: 404)
The key phrase in the first error is "Please retry". As far as I can tell, chrome_print sometimes has issues connecting to Chrome. It seems to be fairly random, i.e. failed connections in one run will be fine in the next, and vice versa. The easiest way to get around this issue is to just keep trying until it connects.
I can't come up with any fix for the second error. However, it doesn't seem to come up very often, so it might make sense to just record it and skip to the next URL.
Using the following code I'm able to print 48 of 50 pages. The only two I can't get to work have the 404 issue I describe above. Note that I use purrr::safely to catch errors. Base R's tryCatch will also work fine, but I find safely to be a little more convient. That said, in the end it's really just a matter of preference.
Also note that I've dealt with the connection error by utilizing repeat within the for loop. R will keep trying to connect to Chrome and print until it is either successful, or some other error pops up. I didn't need it, but you might want to include a counter to set an upper threshold for the number of connection attempts:
quest_urls <- paste0("https://stackoverflow.com", link_questions)
errors <- NULL
safe_print <- purrr::safely(pagedown::chrome_print)
for (qurl in quest_urls){
repeat {
output <- safe_print(qurl)
if (is.null(output$error)) break
else if (grepl("retry", output$error$message)) next
else {errors <- c(errors, `names<-`(output$error$message, qurl)); break}
}
}
R Version : 2.14.1 x64
Running on Windows 7
Connecting to a database on a remote Microsoft SQL Server 2012
I have an unordered vectors of names, say:
names<-c(“A”, “B”, “A”, “C”,”C”)
each of which have an id in a table in my db. I need to convert the names to their corresponding ids.
I currently have the following code to do it.
###
names<-c(“A”, “B”, “A”, “C”,”C”)
dbConn<-odbcDriverConnect(connection=”connection string”) #successfully connects
nameToID<-function(name, dbConn){
#dbConn : active db connection formed via odbcDriverConnect
#name : a char string
sqlQuery(dbConn, paste(“select id from table where name=’”, name, “’”, sep=””))
}
sapply(names, nameToID, dbConn=dbConn)
###
Barring better ways to do this, which could involve loading the table into R then working with the problem there (which is possible), I understand why the following doesn’t work, but I cannot seem to find a solution. Attempting to use parallelization via the package ‘parallel’ :
###
names<-c(“A”, “B”, “A”, “C”,”C”)
dbConn<-odbcDriverConnect(connection=”connection string”) #successfully connects
nameToID<-function(name, dbConn){
#dbConn : active db connection formed via odbcDriverConnect
#name : a char string
sqlQuery(dbConn, paste(“select id from table where name=’”, name, “’”, sep=””))
}
mc<-detectCores()
cl<-makeCluster(mc)
clusterExport(cl, c(“sqlQuery”, “dbConn”))
parSapply(cl, names, nameToID, dbConn=dbConn) #incorrect passing of nameToID’s second argument
###
As in the comment, this is not the correct way to assign the second argument to nameToID.
I have also tried the following:
parSapply(cl, names, function(x) nameToID(x, dbConn))
in place of the previous parSapply call, but that also does not work, with the error being thrown saying “the first parameter is not an open RODBC connection”, presumably referring to the first parameter of the sqlQuery(). dbConn remains open though
The following code does work with parallization.
###
names<-c(“A”, “B”, “A”, “C”,”C”)
dbConn<-odbcDriverConnect(connection=”connection string”) #successfully connects
nameToID<-function(name){
#name : a char string
dbConn<-odbcDriverConnect(connection=”string”)
result<-sqlQuery(dbConn, paste(“select id from table where name=’”, name, “’”, sep=””))
odbcClose(dbConn)
result
}
mc<-detectCores()
cl<-makeCluster(mc)
clusterExport(cl, c(“sqlQuery”, “odbcDriverConnect”, “odbcClose”, “dbConn”, “nameToID”)) #throwing everything in
parSapply(cl, names, nameToID)
###
But the constant opening and closing of the connection ruins the gains from parallelization, and seems just a bit silly.
So the overall question would be how to pass the second parameter (the open db connection) to the function within parSapply, in much the same way as it is done in the regular apply? In general, how does one pass a second, third, nth parameter to a function within a parallel routine?
Thanks and if you need any more information let me know.
-DT
Database connection objects can't be exported or passed as function arguments because they contain socket connections. If you try, it will be serialized, sent to the workers and deserialized, but it won't work correctly since the socket connection won't be valid.
The solution is to create the database connection on each worker before calling parSapply. I often do that using clusterEvalQ:
clusterEvalQ(cl, {
library(RODBC)
dbConn <- odbcDriverConnect(connection="connection string")
NULL
})
Now the worker function can be written as:
nameToID <- function(name) {
sqlQuery(dbConn, paste("select id from table where name='", name, "'", sep=""))
}
and called with:
parSapply(cl, names, nameToID)
Also note that since RODBC is loaded on each of the workers you don't have to export functions defined in it, which I think is good programming practice.
For one of my dissertation's data collection modules, I have implemented a simple polling mechanism. This is needed, because I make each data collection request (one of many) as SQL query, submitted via Web form, which is simulated by RCurl code. The server processes each request and generates a text file with results at a specific URL (RESULTS_URL in code below). Regardless of the request, URL and file name are the same (I cannot change that). Since processing time for different data requests, obviously, is different and some requests may take significant amount of time, my R code needs to "know", when the results are ready (file is re-generated), so that it can retrieve them. The following is my solution for this problem.
POLL_TIME <- 5 # polling timeout in seconds
In function srdaRequestData(), before making data request:
# check and save 'last modified' date and time of the results file
# before submitting data request, to compare with the same after one
# for simple polling of results file in srdaGetData() function
beforeDate <- url.exists(RESULTS_URL, .header=TRUE)["Last-Modified"]
beforeDate <<- strptime(beforeDate, "%a, %d %b %Y %X", tz="GMT")
<making data request is here>
In function srdaGetData(), called after srdaRequestData()
# simple polling of the results file
repeat {
if (DEBUG) message("Waiting for results ...", appendLF = FALSE)
afterDate <- url.exists(RESULTS_URL, .header=TRUE)["Last-Modified"]
afterDate <- strptime(afterDate, "%a, %d %b %Y %X", tz="GMT")
delta <- difftime(afterDate, beforeDate, units = "secs")
if (as.numeric(delta) != 0) { # file modified, results are ready
if (DEBUG) message(" Ready!")
break
}
else { # no results yet, wait the timeout and check again
if (DEBUG) message(".", appendLF = FALSE)
Sys.sleep(POLL_TIME)
}
}
<retrieving request's results is here>
The module's main flow/sequence of events is linear, as follows:
Read/update configuration file
Authenticate with the system
Loop through data requests, specified in configuration file (via lapply()),
where for each request perform the following:
{
...
Make request: srdaRequestData()
...
Retrieve results: srdaGetData()
...
}
The issue with the code above is that it doesn't seem to be working as expected: upon making data request, the code should print "Waiting for results ..." and then, periodically checking the results file for being modified (re-generated), print progress dots until the results are ready, when it prints confirmation. However, the actual behavior is that the code waits long time (I intentionally made one request a long-running), not printing anything, but then, apparently retrieves results and prints both "Waiting for results ..." and " Ready" at the same time.
It seems to me that it's some kind of synchronization issue, but I can't figure out what exactly. Or, maybe it's something else and I'm somehow missing it. Your advice and help will be much appreciated!
In a comment to the question, I believe MrFlick solved the issue: the polling logic appears to be functional, but the problem is that the progress messages are out of synch with current events on the system.
By default, the R console output is buffered. This is by design: to speed things up and avoid the distracting flicker that may be associated with frequent messages etc. We tend to forget this fact, particularly after we've been using R in a very interactive fashion, running various ad-hoc statement at the console (the console buffer is automatically flushed just before returning the > prompt).
It is however possible to get message() and more generally console output in "real time" by either explicitly flushing the console after each critical output statement, using the flush.console() function, or by disabling buffering at the level of the R GUI (right-click when on the console, see Buffered output Ctrl W item. This is also available in the Misc menu)
Here's a toy example of the explicit use of flush.console. Note the use of cat() rather than message() as the former doesn't automatically add a CR/LF to the output. The latter however is useful however because its messages can be suppressed with suppressMessages() and the like. Also as shown in the comment you can cat the "\b" (backspace) character to make the number overwrite one another.
CountDown <- function() {
for (i in 9:1){
cat(i)
# alternatively to cat(i) use: message(i)
flush.console() # <<<<<<< immediate ouput to console.
Sys.sleep(1)
cat(" ") # also try cat("\b") instead ;-)
}
cat("... Blast-off\n")
}
The output is the following, what is of course not evident in this print-out is that it took 10 seconds overall with one number printed every second, before the final "Blast off"; do remove the flush.console() statement and the output will come at once, after 10 seconds, i.e. when the function terminates (unless console is not buffered at the level of the GUI).
CountDown()
9 8 7 6 5 4 3 2 1 ... Blast-off
In R, I'm wondering if it's possible to temporarily redirect the output of the console to a variable?
p.s. There are a few examples on the web on how to use sink() to redirect the output into a filename, but none that I could find showing how to redirect into a variable.
p.p.s. The reason this is useful, in practice, is that I need to print out a portion of the default console output from some of the built-in functions in R.
I believe results <- capture.output(...) is what you need (i.e. using the default file=NULL argument). sink(textConnection("results")); ...; sink() should work as well, but as ?capture.output says, capture.output() is:
Related to ‘sink’ in the same way that ‘with’ is related to ‘attach’.
... which suggests that capture.output() will generally be better since it is more contained (i.e. you don't have to remember to terminate the sink()).
If you want to send the output of multiple statements to a variable you can wrap them in curly brackets {}, but if the block is sufficiently complex it might be better to use sink() (or make your code more modular by wrapping it in functions).
For the record, it's indeed possible to store stdout in a variable with the help of a temorary connection without calling capture.output -- e.g. when you want to save both the results and stdout. Example:
Prepare the variable for the diverted R output:
> stdout <- vector('character')
> con <- textConnection('stdout', 'wr', local = TRUE)
Divert the output:
> sink(con)
Do some stuff:
> 1:10
End the diversion:
> sink()
Close the temporary connection:
> close(con)
Check results:
> stdout
[1] " [1] 1 2 3 4 5 6 7 8 9 10"