rxDataStep in RevoScaleR package crashing - r

I am trying to create a new factor column on an .xdf data set with the rxDataStep function in RevoScaleR:
rxDataStep(nyc_lab1
, nyc_lab1
, transforms = list(RatecodeID_desc = factor(RatecodeID, levels=RatecodeID_Levels, labels=RatecodeID_Labels))
, overwrite=T
)
where nyc_lab1 is a pointer to a .xdf file. I know that the file is fine because I imported it into a data table and successfully created a the new factor column.
However, I get the following error message:
Error in doTryCatch(return(expr), name, parentenv, handler) :
ERROR: The sample data set for the analysis has no variables.
What could be wrong?

First, RevoScaleR has some warts when it comes to replacing data. In particular, overwriting the input file with the output can sometimes causes rxDataStep to fail for unknown reasons.
Even if it works, you probably shouldn't do it anyway. If there is a mistake in your code, you risk destroying your data. Instead, write to a new file each time, and only delete the old file once you've verified you no longer need it.
Second, any object you reference that isn't part of the dataset itself, has to be passed in via the transformObjects argument. See ?rxTransform. Basically the rx* functions are meant to be portable to distributed computing contexts, where the R session that runs the code isn't be the same as your local session. In this scenario, you can't assume that objects in your global environment will exist in the session where the code executes.
Try something like this:
nyc_lab2 <- RxXdfData("nyc_lab2.xdf")
nyc_lab2 <- rxDataStep(nyc_lab1, nyc_lab2,
transforms=list(
RatecodeID_desc=factor(RatecodeID, levels=.levs, labels=.labs)
),
rxTransformObjects=list(
.levs=RatecodeID_Levels,
.labs=RatecodeID_Labels
)
)
Or, you could use dplyrXdf which will handle all this file management business for you:
nyc_lab2 <- nyc_lab1 %>% factorise(RatecodeID)

Related

Externally set default argument of a function from a package

I am building a package with functions that have default arguments. I would like to find a clean way to set the values of the default arguments once the function has been imported.
My first attempt was to set the default values to unknown objects (within the package). Then when I load the package, I would have a external script that would assign a value to the unknown objects.
But it does not seem very clean since I am compiling a function with an unknown object.
My issue is that I will need other people to use the functions, and since they have many arguments I want to keep the code as concise as possible. And many arguments can be set by a configuration script before running the program.
So, for instance, I define in my package:
function_try <- function(x = VAL){
return(x)
}
I compile the package and load it, and then I have an external script that does (or reading from a config file)
VAL <- "hello"
And then a user of the function can just run
function_try()
I would use options for that. So your function looks like:
function_try <- function(x = getOption("mypackage.default.value")) x
In your external script you make sure that the option value is set:
options(mypackage.default.value = "hello")
IMHO that is a clean solution, because anybody reading your function will see at first sight that a certain options value is used as a default and has also a clear understanding of how to overwrite this value if needed.
I would also define a fall back value in your library onLoad to make sure that the value is defined in the first place. You can then even react in your functions to this fallback value and issue a meaningful warning if the function realizes that the the external script did for whatever reason not provide a new value.

R tryCatch RODBC function issue

We have a number of MS Access databases on a server which are copies from remote locations which are updated overnight. We collate some of the data from these machines for reporting purposes on a daily basis. Sometimes the overnight update fails, meaning we don’t have access to all of the databases, so I am attempting to write an R script which will test if we can connect (using a list of the database paths), and output an updated version of the list including only those which we can connect to. This will then be used to run a further script which will only update the data related to the available databases.
This is what I have so far (I am new to R but reasonably proficient in SAS and SQL – attempting to use R both as a learning exercise and for potential cost savings);
{
# Create Store data locations listing
A=matrix(c(1000,1,"One","//Server/Comms1/Access.mdb"
,2000,2,"Two","//Server/Comms2/Access.mdb"
,3000,3,"Three","//Server/Comms3/Access.mdb"
)
,nrow=3,ncol=4,byrow=TRUE)
# Add column names
colnames(A)<-c("Ref1","Ref2","Ref3","Location")
#Create summary for testing connections (Ref1 and Location)
B<-A[,c(1,4)]
ConnectionTest<-function(Ref1,Location)
{
out<-tryCatch({ch<-odbcDriverConnect(paste("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=",Location))
sqlQuery(ch,paste("select ",Ref1," as Ref1,COUNT(variable) as Count from table"))}
,error=matrix(c(Ref1,0),nrow=1,ncol=2,byrow=TRUE)
)
return(out)
}
#Run function, using 'B' to provide arguments
C<-apply(B,1,function(x)do.call(ConnectionTest,as.list(x)))
#Convert to matrix and add column names
D<-matrix(unlist(C),ncol=2,byrow=T)
colnames(D)<-c("Ref1","Count")
}
When I run the script I get the following error message;
Error in value[3L] : attempt to apply non-function
I am guessing this is because I am using TryCatch incorrectly inside the UDF?
Does anyone have any advice on what I am doing incorrectly, or even if this is the best way to do what I am attempting?
Thanks
(apologies if this is formatted incorrectly, having to post on my phone due to Stackoverflow posting being blocked)
Edit - I think I fixed the 'Error in value[3L]' issue by adding function(e) {} around the matrix function in the error part of the tryCatch.
The issue now is that the script just fails if it can't reach one of the databases, rather than doing the matrix function. Do I need to add something else to make it ignore the error?
Edit 2 - it seems tryCatch does now work - it processes the
alternate function upon error but also shows warnings about the error, which makes sense.
As mentioned in the edit above, using 'function(e) {}' to wrap the Matrix function in the error section of the tryCatch fixed the 'Error in value[3L]' issue, so the script now works, but displays error messages if it can't access a particular channel. I am guessing the 'warning' section of the tryCatch can be used to adjust these as necessary.

Importing a file only if it has been modified since last import and then save to a new object

I am trying to create a script that I run about once a week. The goal is that it will go out and check an MS Excel file that a co-worker manages. It then tests to see if the date the file was modified is newer then the last time it was imported. If it is newer, it will import the file (I am using readxl package - WONDERFUL!) into a new object that is a named with the date the original Excel file was last modified included in the object name. I have everything working except for the assignment of the imported data.frame to a new object that includes the date.
An example of the code I am using is:
First I create an object with a path pointing to the file of interest.
pfdFilePath <- file.path("H:", "3700", "3780", "002-00", "3.
Project Information", "Program", "RAH program.xls")
after testing to verify the file has been modified, I have tried simple assignment ("test" is just an example for simplification):
paste("df-", as.Date(file.info(pfdFilePath)$mtime), sep = "") <- "test"
But that code produces an error:
Error in paste("df-", as.Date(file.info(pfdFilePath)$mtime), sep = "") <- "test" :
target of assignment expands to non-language object
I then try the assign function:
assign(paste("df-", as.Date(file.info(pfdFilePath)$mtime), sep = ""), "test")
Running this code creates an object that looks to be okay, but when I evaluate it, or try using str() or class() I get the following error:
Error in df - df-2016-08-09 :
non-numeric argument to binary operator
I am pretty sure this is an error that has to do with the environment I am using assign, but being relatively new to R, I cannot figure it out. I understand that the assign function seems to be frowned upon, but those warnings seem to centered on for-loops vs. lapply functions. I am not really iterating within a function though. Just a dynamically named object whenever I run a script. I can't come up with a better way to do it. If there is another way to do this that doesn't require the assign function, or a better way to use assign function , I would love to know it.
Thank you in advance, and sorry if this is a duplicate. I have spent the entire evening digging and can't derive what I need.
Abdou provided the key.
assign(paste0("df.", "pfd.", strftime(file.info(pfdFilePath)$mtime, "%Y%m%d")), "test01")
I also converted to the cleaner paste0 function and got rid of the dashes to avoid confusion. Lesson learned.
Works perfectly.

R: How make dump.frames() include all variables for later post-mortem debugging with debugger()

I have the following code which provokes an error and writes a dump of all frames using dump.frames() as proposed e. g. by Hadley Wickham:
a <- -1
b <- "Hello world!"
bad.function <- function(value)
{
log(value) # the log function may cause an error or warning depending on the value
}
tryCatch( {
a.local.value <- 42
bad.function(a)
bad.function(b)
},
error = function(e)
{
dump.frames(to.file = TRUE)
})
When I restart the R session and load the dump to debug the problem via
load(file = "last.dump.rda")
debugger(last.dump)
I cannot find my variables (a, b, a.local.value) nor my function "bad.function" anywhere in the frames.
This makes the dump nearly worthless to me.
What do I have to do to see all my variables and functions for a decent post-mortem analysis?
The output of debugger is:
> load(file = "last.dump.rda")
> debugger(last.dump)
Message: non-numeric argument to mathematical functionAvailable environments had calls:
1: tryCatch({
a.local.value <- 42
bad.function(a)
bad.function(b)
2: tryCatchList(expr, classes, parentenv, handlers)
3: tryCatchOne(expr, names, parentenv, handlers[[1]])
4: value[[3]](cond)
Enter an environment number, or 0 to exit
Selection:
PS: I am using R3.3.2 with RStudio for debugging.
Update Nov. 20, 2016: Note that it is not an R bug (see answer of Martin Maechler). I did not change my answer for reproducibility. The described work around still applies.
Summary
I think dump.frames(to.file = TRUE) is currently an anti pattern (or probably a bug) in R if you want to debug errors of batch jobs in a new R session.
You should better replace it with
dump.frames()
save.image(file = "last.dump.rda")
or
options(error = quote({dump.frames(); save.image(file = "last.dump.rda")}))
instead of
options(error = dump.frames)
because the global environment (.GlobalEnv = the user workspace you normally create your objects) is included then in the dump while it is missing when you save the dump directly via dump.frames(to.file = TRUE).
Impact analysis
Without the .GlobalEnv you loose important top level objects (and their current values ;-) to understand the behaviour of your code that led to an error!
Especially in case of errors in "non-interactive" R batch jobs you are lost without .GlobalEnv since you can debug only in a newly started (empty) interactive workspace where you then can only access the objects in the call stack frames.
Using the code snippet above you can examine the object values that led to the error in a new R workspace as usual via:
load(file = "last.dump.rda")
debugger(last.dump)
Background
The implementation of dump.frames creates a variable last.dump in the workspace and fills it with the environments of the call stack (sys.frames(). Each environment contains the "local variables" of the called function). Then it saves this variable into a file using save().
The frame stack (call stack) grows with each call of a function, see ?sys.frames:
.GlobalEnv is given number 0 in the list of frames. Each subsequent
function evaluation increases the frame stack by 1 and the [...] environment for evaluation of that function are returned by [...] sys.frame with the appropriate index.
Observe that the .GlobalEnv has the index number 0.
If I now start debugging the dump produced by the code in the question and select the frame 1 (not 0!) I can see a variable parentenv which points (references) the .GlobalEnv:
Browse[1]> environmentName(parentenv)
[1] "R_GlobalEnv"
Hence I believe that sys.frames does not contain the .GlobalEnv and therefore dump.frames(to.file = TRUE) neither since it only stores the sys.frames without all other objects of the .GlobalEnv.
Maybe I am wrong, but this looks like an unwanted effect or even a bug.
Discussions welcome!
References
https://cran.r-project.org/doc/manuals/R-exts.pdf
Excerpt from section 4.2 Debugging R code (page 96):
Because last.dump can be looked at later or even in another R session,
post-mortem debug- ging is possible even for batch usage of R. We do
need to arrange for the dump to be saved: this can be done either
using the command-line flag
--save to save the workspace at the end of the run, or via a setting such as
options(error = quote({dump.frames(to.file=TRUE); q()}))
Note that it is often more productive to work with the R Core team rather than just telling that R has a bug. It clearly has no bug, here, as it behaves exactly as documented.
Also there is no problem if you work interactively, as you have full access to your workspace (which may be LARGE) there, so the problem applies only to batch jobs (as you've mentioned).
What we rather have here is a missing feature and feature requests (and bug reports!) should happen on the R bug site (aka _'R bugzilla'), https://bugs.r-project.org/ ... typically however after having read the corresponding page on the R website: https://www.r-project.org/bugs.html.
Note that R bugzilla is searchable, and in the present case, you'd pretty quickly find that Andreas Kersting made a nice proposal (namely as a wish, rather than claiming a bug),
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17116
and consequently I had added the missing feature to R, on Aug.16, already.
Yes, of course, the development version of R, aka R-devel.
See also today's thread on the R-devel mailing list,
https://stat.ethz.ch/pipermail/r-devel/2016-November/073378.html

An error while trying to use glm model for prediction on another computer

I would like to save a glm object in one R machine and use it for prediction on another data set located on another machine that has a newer data.I try to use save and load but with no success.What am I doing wrong?
Here is a toy example:
# on machine 1:
glm<-glm(y~x1+x2,data=dat1, family=binomial(link="logit")
save(glm,file="glm.Rdata") # the file is stored in a folder.
# on machine 2:
load(glm.RData) # got an error:"Error in load(glm.RData) : object 'glm.RData' not found"
#I tried :
load(file='glm.RData') # no error was displayed
print(glm) # got an error:"Error in load(glm.RData) : object 'glm.RData' not found"
Any help will be great.
As per #user3710546's advice, I would avoid saving your model using the name glm, as it'll mask (ie. block) the glm() function, making it difficult for you to use it in your session.
Using save() and load()
save() is generally used to save a list of objects to a file, rather than a single object. The first argument to save() is list, 'A character vector containing the names of objects to be saved.' (Emphasis mine.) So you'd want to use it like this:
# On machine 1:
save(list = 'glm', file = '/path/to/glm.RData')
# On machine 2:
load(file = '/path/to/glm.RData')
Note that the file extensions are often case-sensitive: you saved to a file with the extension .RData but loaded from one with the extension .Rdata, which is different. This may explain why the file isn't found.
Using saveRDS() and readRDS()
An alternative to using save() and load is to use saveRDS() and readRDS(), which are designed to be used with one object. They're used slightly differently:
# On machine 1
saveRDS(glm, file = '/path/to/glm.rds')
# On machine 2
glm = readRDS(file = '/path/to/glm.rds')
Note the .rds file extension and the fact that readRDS() isn't automatically put in the environment (it needs to be assigned to something).
Saving parts of a GLM
If you just want the formula saved—that is, the actual text string—you can find it in glm$formula, where glm is the name of your object. It comes back as a formula object, but you can convert it to a string with as.character(glm$formula), to then be written to a text file or whatever.
If, however, you want the model itself without the dataset it was created from (to cut down on disk space), have a look at this article, which discusses which parts of a glm object can be safely deleted.

Resources