How to do introspection in R - r

I am somewhat new to R, and i have this piece of code which generates a variable that i don't know the type for. Are there any introspection facility in R which will tell me which type this variable belongs to?
The following illustrates the property of this variable:
I am working on linear model selection, and the resource I have is lm result from another model. Now I want to retrieve the lm call by the command summary(model)$call so that I don't need to hardcode the model structure. However, since I have to change the dataset, I need to do a bit of modification on the "string", but apparently it is not a simple string. I wonder if there is any command similar to string.replace so that I can manipulate this variable from the variable $call.
> str<-summary(rdnM)$call
> str
lm(formula = y ~ x1, data = rdndat)
> str[1]
lm()
> str[2]
y ~ x1()
> str[3]
rdndat()
> str[3] <- data
Warning message:
In str[3] <- data :
number of items to replace is not a multiple of replacement length
> str
lm(formula = y ~ x1, data = c(10, 20, 30, 40))
> str<-summary(rdnM)$call
> str
lm(formula = y ~ x1, data = rdndat)
> str[3] <- 'data'
> str
lm(formula = y ~ x1, data = "data")
> str<-summary(rdnM)$call
> type str
Error: unexpected symbol in "type str"
>

In terms of introspection: R allows you to easily examine and operate on language objects.
For more details, see R Language Definition, particularly sections 2 and 6. For instance, in your case, summary(rdnM)$call is a "call" object. You can retrieve pieces of it by indexing, but you can't construct another call object by assigning to indices like you are trying to do. You'd have to construct a new call.
In your case you are constructing an updated call to lm() out of an existing call. If you want to reuse the formula on different data, you would extract the formula from the call object via formula(foo$call), like so:
foo <- lm(formula = y ~ x1, data = data.frame(y=rnorm(10),x1=rnorm(10)))
bar <- lm(formula(foo$call), data = data.frame(y=rnorm(10),x1=rnorm(10)))
On the other hand, if you are trying to update the formula, you could use update():
baz <- update(bar, . ~ . - 1)
baz$call
##>lm(formula = y ~ x1 - 1, data = data.frame(y = rnorm(10), x1 = rnorm(10)))

Related

Recommended way of creating reusable objects within an R function

Suppose we have the following data:
# simulate data to fit
set.seed(21)
y = rnorm(100)
x = .5*y + rnorm(100, 0, sqrt(.75))
Let's also suppose the user has fit a model:
# user fits a lm
mod = lm(y~x)
Now suppose I have an R package designed to perform several operations on the object mod. Just for simplicify, suppose we have two functions, one that plots the data, and one that computes the coefficients. However, as an intermediary, suppose we want to perform some operation on the data (in this example, add ten).
Example:
# function that adds ten to all scores
add_ten = function(model) {
data = model$model
data = data + 10
return(data)
}
# functions I defined that do something to the "add_ten" dataset
plot_ten = function(model) {
new_data = data.frame(add_ten(model))
x = all.vars(formula(model))[2]
y = all.vars(formula(model))[1]
ggplot2::ggplot(new_data, aes_string(x=x, y=y)) + geom_point() + geom_smooth()
}
coefs_ten = function(model) {
new_data = data.frame(add_ten(model))
coef(lm(formula(model), new_data))
}
(Obviously, this is pretty silly to do. In actuality, the operation I want to perform is multiple imputation, which is computationally intensive).
Notice in the above example I have to call the add_ten function twice, once for plot_ten and once for coefs_ten. This is inefficient.
So, now to my question, what is the best way to create a reusable object within a function?
I could, of course, create an object to be placed in the user's global environment:
add_ten = function(model) {
# check for add_ten_data in the global environment
if (exists("add_ten_data", where = .GlobalEnv)) return(get("add_ten_data", envir = .GlobalEnv))
data = model$model
data = data + 10
# assign add_ten_data to the global environment
assign('add_ten_data', data, envir = .GlobalEnv)
return(data)
}
I'm happy to do so, but worry about the "netiquette" of putting something in the user's environment. There's also a potential problem if users happen to have an object called "add_ten_data" in their environment.
So, what is the best way of accomplishing this?
Thanks in advance!
You should certainly avoid writing an object to the global environment. If you find that you have to repeat the same computationally expensive task at the top of a number of different functions, it means you are carrying out the computationally expensive task too late.
For example, you could create an S3 class that holds the necessary components to produce a "cheap" plot and a "cheap" extraction of the coefficients. It even has the benefits of generic dispatch:
add_ten <- function(model) model$model + 10
lm_tens <- function(formula, data)
{
model <- if(missing(data)) lm(formula) else lm(formula, data = data)
structure(list(data = data.frame(add_ten(model)), model = model),
class = "tens")
}
plot.tens <- function(tens) {
x = all.vars(formula(tens$data))[2]
y = all.vars(formula(tens$data))[1]
ggplot2::ggplot(tens$data, ggplot2::aes(x = x, y = y)) +
ggplot2::geom_point() +
ggplot2::geom_smooth()
}
coef.tens = function(tens) {
coef(lm(formula(tens$model), data = tens$data))
}
So now we just need to do:
set.seed(21)
y = rnorm(100)
x = .5*y + rnorm(100, 0, sqrt(.75))
mod <- lm_tens(y ~ x)
coef(mod)
#> (Intercept) x
#> 4.3269914 0.5775404
plot(mod)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Note that we only need to call add_ten once here.

Create a function with an argument used in a formula

I'm a beginner at creating function and I have some trouble with something probably basic.
I'd like to create a function that takes as argument a data.frame and a name of a variable, and return the linear regression of this variable by the others (no real point with doing that, I'm just trying to learn how to create functions)
my_lm <- function(df, var) lm(var~., data = df)
my_lm(diamonds, price)
But I get this error:
Error in eval(predvars, data, env) : object 'price' not found"
Thanks for your help and sorry for bad english
One solution is to pass price as char, and use formula() to convert a string in the proper object for the lm.
my_lm <- function(df, var) {
f = formula(paste0(var, "~.")) # this creates "price ~ ." in the example
lm(f, data = df)
}
my_lm(diamonds, var="price")
Or, if you have to pass price as "not a string", you need NSE:
my_lm <- function(df, var) {
var = substitute(var)
f = formula(paste0(var, "~."))
lm(f, data = df)
}
my_lm(diamonds, var=price)

Calling update within a lapply within a function, why isn't it working?

This a a follow up question from Error in calling `lm` in a `lapply` with `weights` argument but it may not be the same problem (but still related).
Here is a reproducible example:
dd <- data.frame(y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100),
x3 = rnorm(100),
x4 = rnorm(100),
wg = runif(100,1,100))
ls.form <- list(
formula(y~x1+x2),
formula(y~x3+x4),
formula(y~x1|x2|x3),
formula(y~x1+x2+x3+x4)
)
I have a function that takes different arguments (1- a subsample, 2- a colname for the weights argument, 3- a list of formulas to try and 4- the data.frame to use)
f1 <- function(samp, dat, forms, wgt){
baselm <- lm(y~x1, data = dat[samp,], weights = dat[samp,wgt])
lapply(forms, update, object = baselm)
}
If I call the function, I get an error:
f1(1:66, dat = dd, forms = ls.form, wgt = "wg")
Error in is.data.frame(data) : object 'dat' not found
I don't really get why it doesn't find the dat object, it should be part of the fonction environment. The problem is in the update part of the code as if you remove this line from the function, the code works.
At the end, this function will be call with a lapply
lapply(list(1:66, 33:99), f1, dat=dd, forms = ls.form, wgt="wg")
I think your problems are due to the scoping rules used by lm which are quite frankly a pain in the r-squared.
One option is to use do.call to get it to work, but you get some ugly output when it deparses the inputs to give the call used for the standard print method.
f1 <- function(samp, dat, forms, wgt){
baselm <- do.call(lm,list(formula=y~x1, data = dat[samp,], weights = dat[samp,wgt]))
lapply(forms, update, object = baselm)
}
A better way is to use an eval(substitute(...)) construct which gives the output you originally expected:
f2 <- function(samp, dat, forms, wgt){
baselm <- eval(substitute(lm(y~x1, data = dat[samp,], weights = dat[samp,wgt])))
lapply(forms, update, object = baselm)
}
Such scoping issues are very common with lm objects. You can solve this by specifying the correct environment for evaluation:
f1 <- function(samp, dat, forms, wgt){
baselm <- lm(y~x1, data = dat[samp,], weights = dat[samp,wgt])
mods <- lapply(forms, update, object = baselm, evaluate = FALSE)
e <- environment()
lapply(mods, eval, envir = e)
}
f1(1:66, dat = dd, forms = ls.form, wgt = "wg")
#works
The accepted error work, but I continued digging and found this old r-help question (here) which gave more options and explanation. I thought I would post it here in case somebody else needs it.

loop through column glmer

I am trying to run a glmer by looping through columns in my dataset which contain response variables (dat_prob).The code I am using is as follows, adapted from code researched on another stackoverflow question (Looping through columns in R).
Their code:
dat_y<-(dat[,c(2:1130)])
dat_x<-(dat[,c(1)])
models <- list()
#
for(i in names(dat_y)){
y <- dat_y[i]
model[[i]] = lm( y~dat_x )
}
My code:
dat_prob<-(probs[,c(108:188)])
dat_age<-(probs[,c(12)])
dat_dist<-(probs[,c(20)])
fyearcap=(probs[,c(25)])
fstation=(probs[,c(22)])
fnetnum=(probs[,c(23)])
fdepth=(probs[,c(24)])
models <- list()
#
for(i in names(dat_prob)){
y <- dat_prob[i]
y2=as.vector(y)
model[[i]] = glmer( y ~ dat_age * dat_dist + (1|fyearcap) + (1|fstation)+
(1|fnetnum)+ (1|fdepth),family=binomial,REML=TRUE )
}
And I receive this error, similar to the error received in the hyperlinked question:
Error in model.frame.default(drop.unused.levels = TRUE, formula = y ~ :
invalid type (list) for variable 'y'
I have been working through this for hours and now can't see the forest through the trees.
Any help is appreciated.
y <- dat_prob[i] makes y a list (or data frame, whatever). Lists are vectors - try is.vector(list()), so even y2 = as.vector(y) is still a list/data frame (even though you don't use it).
class(as.vector(mtcars[1]))
# [1] "data.frame"
To extract a numeric vector from a data frame, use [[: y <- dat_prob[[i]].
class(mtcars[[1]])
# [1] "numeric"
Though I agree with Roman - using formulas is probably a nicer way to go. Try something like this:
for(i in names(dat_prob)) {
my_formula = as.formula(paste(i,
"~ dat_age * dat_dist + (1|fyearcap) + (1|fstation)+ (1|fnetnum)+ (1|fdepth)"
))
model[[i]] = glmer(my_formula, family = binomial, REML = TRUE)
}
I'm also pretty skeptical of whatever you're doing trying 80 different response variables, but that's not your question...

How to use the for loop with function needing for a string field?

I am using the smbinning R package to compute the variables information value included in my dataset.
The function smbinning() is pretty simple and it has to be used as follows:
result = smbinning(df= dataframe, y= "target_variable", x="characteristic_variable", p = 0.05)
So, df is the dataset you want to analyse, y the target variable and x is the variable of which you want to compute the information value statistics; I enumerate all the characteristic variables as z1, z2, ... z417 to be able to use a for loop to mechanize all the analysis process.
I tried to use the following for loop:
for (i in 1:417) {
result = smbinning(df=DATA, y = "FLAG", x = "DATA[,i]", p=0.05)
}
in order to be able to compute the information value for each variable corresponding to i column of the dataframe.
The DATA class is "data.frame" while the resultone is "character".
So, my question is how to compute the information value of each variable and store that in the object denominated result?
Thanks! Any help will be appreciated!
No sample data is provided I can only hazard a guess that the following will work:
results_list = list()
for (i in 1:417) {
current_var = paste0('z', i)
current_result = smbinning(df=DATA, y = "FLAG", x = current_var, p=0.05)
results_list[i] = current_result$iv
}
You could try to use one of the apply methods, iterating over the z-counts. The x value to smbinning should be the column name not the column.
results = sapply(paste0("z",1:147), function(foo) {
smbinning(df=DATA, y = "FLAG", x = foo, p=0.05)
})
class(results) # should be "list"
length(results) # should be 147
names(results) # should be z1,...
results[[1]] # should be the first result, so you can also iterate by indexing
I tried the following, since you had not provided any data
> XX=c("IncomeLevel","TOB","RevAccts01")
> res = sapply(XX, function(z) smbinning(df=chileancredit.train,y="FlagGB",x=z,p=0.05))
Warning message:
NAs introduced by coercion
> class(res)
[1] "list"
> names(res)
[1] "IncomeLevel" "TOB" "RevAccts01"
> res$TOB
...
HTH

Resources