Looping through list for each element computation in R - r

I am naive at R and trying to get a stuff done so advanced apologies if it is a stupid way of doing it.
I am trying to get coefficient and relevance of x-values to y-values. Values in X are criteria to which co-relevance is being tested.
I need to find postive or negative relevance/confidence for results represented in myList. Rather than putting one column in Y manually I just want to iterate through it for result of each column.
library(rms)
parameters <- read.csv(file="C:/Users/manjaria/Documents/Lek papers/validation_csv.csv", header=TRUE)
#attach(parameters)
myList <- c("name1","name2","name3","name4","name5")
for (cnt in seq(length(myList))) {
Y<- cbind(myList[cnt])
X<- cbind(age,female,income,employed,traveldays,modesafety,prPoolsize)
XVar <-c("age","female","income","employed","traveldays","modesafety","prPoolsize")
summary (Y)
summary (X)
table(Y)
ddist<- datadist(XVar)
options(datadist = 'ddist')
ologit<- lrm(Y ~ X, data = parameters)
print(ologit)
fitted<- predict(ologit, newdata=parameters, type = "fitted.ind")
colMeans(fitted)
}
I encounter:
Error in model.frame.default(formula = Y ~ X, data = parameters, na.action = function (frame) :
variable lengths differ (found for 'X')
If I don't loop through for-loop and use a static name for Y like
Y<- cbind(name1) it works well.

Related

How to get around error "factor has new levels" in cross-validation glm?

My goal is to use cross-validation to evaluate the performance of a linear model.
My problem is that my training and testing sets might not always have the same variable levels.
Here is a reproducible data example:
set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)
summary(data)
Now let's make a glm model:
model_glm <- glm(x~., data = data)
And let's use cross-validation on this model:
library(boot)
cross_validation_glm <- cv.glm(data = data, glmfit = model_glm, K = 10)
And this is the kind of error output that you will get:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor z has new levels F
if you don't get this error, re-run the cross validation and at some point you will get a similar error.
The nature of the problem here is that when you do cross-validation, the train and test subsets might not have the exact same variable levels. Here our variable z has three levels (D,E,F).
In the total amount of our data there is much more D's than E's and F's.
Thus whenever you take a small subset of the whole data (to do cross-validation).
There is a very good chance that your z variable are all going to be set at the D's level.
Thus Eand F levels gets dropped, thus we get the error (This answer is helpful to understand the problem: https://stackoverflow.com/a/51555998/10972294).
My question is: how to avoid the drop in the first place?
If it is not possible, what are the alternatives?
(Keep in mind that this a reproducible example, the actual data I am using has many variables like z, I would like to avoid deleting them.)
To answer your question in the comment, I don't know if there is a function or not. Most likely there is one, but I have no idea on which package would contain it. For this example, this function should work:
set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)
#optional tag row for later identification:
#data$rowid<-1:nrow(data)
stratified <- function(df, column, percent){
#split dataframe into groups based on column
listdf<-split(df, df[[column]])
testsubgroups<-lapply(listdf, function(x){
#pick the number of samples per group, round up.
numsamples <- ceiling(percent*nrow(x))
#selects the rows
whichones <-sample(1:nrow(x), numsamples, replace = FALSE)
testsubgroup <-x[whichones,]
})
#combine the subgroups into one data frame
testgroup<-do.call(rbind, testsubgroups)
testgroup
}
testgroup<-stratified(data, "z", 0.8)
This will just split the initial data by column z, if you are interested is grouping by multiple columns then this could be extended by using the group_by function from the dplyr package, but that would be another question.
Comment on the statistics: If you just have a few examples for any particular factor, what type of fit do you expect? A poor fit with wide confidence limits.

How to use one variable in regression with many independent variables in lm()

I need to reproduce this code using all of these variables.
composite <- read.csv("file.csv", header = T, stringsAsFactors = FALSE)
composite <- subset(composite, select = -Date)
model1 <- lm(indepvariable ~., data = composite, na.action = na.exclude)
composite is a data frame with 82 variables.
UPDATE:
What I have done is found a way to create an object that contains only the significantly correlated variables, to narrow the number of independent variables down.
I have a variable now: sigvars, which is the names of an object that sorted a correlation matrix and picked out only the variables with correlation coefficients >0.5 and <-0.5. Here is the code:
sortedcor <- sort(cor(composite)[,1])
regvar = NULL
k = 1
for(i in 1:length(sortedcor)){
if(sortedcor[i] > .5 | sortedcor[i] < -.5){
regvar[k] = i
k = k+1
}
}
regvar
sigvars <- names(sortedcor[regvar])
However, it is not working in my lm() function:
model1 <- lm(data.matrix(composite[1]) ~ sigvars, data = composite)
Error: Error in model.frame.default(formula = data.matrix(composite[1]) ~ sigvars, : variable lengths differ (found for 'sigvars')
Think about what sigvars is for a minute...?
After sigvars <- names(sortedcor[regvar]), sigvars is a character vector of column names. Say your data have 100 rows and 5 variables come out as significant using the method you've chosen (which doesn't sound overly defensible to be). The model formula you are using will result in composite[, 1] being a vector of length 100 (100 rows) and sigvars being a character vector of length 5.
Assuming you have the variables you want to include in the model, then you could do:
form <- reformulate(sigvars, response = names(composite)[1])
model1 <- lm(form, data = composite)
or
model1 <- lm(composite[,1] ~ ., data = composite[, sigvars])
In the latter case, do yourself a favour and write the name of the dependent variable into the formula instead of composite[,1].
Also, you don't seem to have appreciated the difference between [i] and [i,j] for data frames, hence you are doing data.matrix(composite[1]) which is taking the first component of composite, leaving it as a data frame, then converting that to a matrix via the data.matrix() function. All you really need is just the name of the dependent variable on the LHS of the formula.
The error is here:
model1 <- lm(data.matrix(composite[1]) ~ sigvars, data = composite)
The sigvars is names(data). The equation is usually of the form lm(var1 ~ var2+var3+var4), you however have it as lm(var1 ~ var2 var3 var4).
Hopefully that helps.

variable scope & resolution in R function

I want to loop through the vars in a dataframe, calling lm() on each one, and so I wrote this:
findvars <- function(x = samsungData, dv = 'activity', id = 'subject') {
# Loops through the possible predictor vars, does an lm() predicting the dv
# from each, and returns a data.frame of coefficients, one row per IV.
r <- data.frame()
# All varnames apart from the dependent var, and the case identifier
ivs <- setdiff(names(x), c(dv, id))
for (iv in ivs) {
print(paste("trying", iv))
m <- lm(dv ~ iv, data = x, na.rm = TRUE)
# Take the absolute value of the coefficient, then transpose.
c <- t(as.data.frame(sapply(m$coefficients, abs)))
c$iv <- iv # which IV produced this row?
r <- c(r, c)
}
return(r)
}
This doesn't work, I believe b/c the formula in the lm() call consists of function-local variables that hold strings naming vars in the passed-in dataframe (e.g., "my_dependant_var" and "this_iv") as opposed to pointers to the actual variable objects.
I tried wrapping that formula in eval(parse(text = )), but could not get that to work.
If I'm right about the problem, can someone explain to me how to get R to resolve the contents of those vars iv & dv into the pointers I need? Or if I'm wrong, can someone explain what else is going on?
Many thanks!
Here is some repro code:
library(datasets)
data(USJudgeRatings)
findvars(x = USJudgeRatings, dv = 'CONT', id = 'DILG')
So there's enough bad stuff happening in your function besides your trouble with the formula, that I think someone should walk you through it all. Here are some annotations, followed by a better version:
#For small examples, "growing" objects isn't a huge deal,
# but you will regret it very, very quickly. It's a bad
# habit. Learn to ditch it now. So don't inititalize
# empty lists and data frames.
r <- data.frame()
ivs <- setdiff(names(x), c(dv, id))
for (iv in ivs) {
print(paste("trying", iv))
#There is no na.rm argument to lm, only na.action
m <- lm(dv ~ iv, data = x, na.rm = TRUE)
#Best not to name variables c, its a common function, see two lines from now!
# Also, use the coef() extractor functions, not $. That way, if/when
# authors change the object structure your code won't break.
#Finally, abs is vectorized, no need for sapply
c <- t(as.data.frame(sapply(m$coefficients, abs)))
#This is probably best stored in the name
c$iv <- iv # which IV produced this row?
#Growing objects == bad! Also, are you sure you know what happens when
# you concatenate two data frames?
r <- c(r, c)
}
return(r)
}
Try something like this instead:
findvars <- function(x,dv,id){
ivs <- setdiff(names(x),c(dv,id))
#initialize result list of the appropriate length
result <- setNames(vector("list",length(ivs)),ivs)
for (i in seq_along(ivs)){
result[[i]] <- abs(coef(lm(paste(dv,ivs[i],sep = "~"),data = x,na.action = na.omit)))
}
result
}

Apply nls() function to multiple subsets

I need to run a non-linear least squares regression on an entire data set, and then repeat the regression on several subsets of that data set. I can do this for a single subset; for example (where y is a generic logistic equation, and x is a vector from 1 to 20):
example = nls(x ~ y, subset = c(2:20))
but I want to do this for 3:20, 4:20, 5:20, etc. I tried a for loop:
datasubsets <- sapply(2:19, seq, to = 20)
for (i in 1:19){
example[i] = nls(x ~ y, subset = datasubsets[i])
}
but I receive "Error in xj[i] : invalid subscript type 'list'". I would very much like to avoid having to copy and paste nls() 20 times. Any help is much appreciated.
This does the job: sapply(2:19,function(jj) nls(x~y,subset=jj:20)).

Object not found error when passing model formula to another function

I have a weird problem with R that I can't seem to work out.
I've tried to write a function that performs K-fold cross validation for a model chosen by the stepwise procedure in R. (I'm aware of the issues with stepwise procedures, it's purely for comparison purposes) :)
Now the issue is, that if I define the function parameters (linmod,k,direction) and run the contents of the function, it works flawlessly. BUT, if I run it as a function, I get an error saying the datas.train object can't be found.
I've tried stepping through the function with debug() and the object clearly exists, but R says it doesn't when I actually run the function. If I just fit a model using lm() it works fine, so I believe it's a problem with the step function in the loop, while inside a function. (try commenting out the step command, and set the predictions to those from the ordinary linear model.)
#CREATE A LINEAR MODEL TO TEST FUNCTION
lm.cars <- lm(mpg~.,data=mtcars,x=TRUE,y=TRUE)
#THE FUNCTION
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
form <- formula(linmod$call)
# generate indices for cross validation
rar <- n/k
xval.idx <- list()
s <- sample(1:n, n) # permutation of 1:n
for (i in 1:k) {
xval.idx[[i]] <- s[(ceiling(rar*(i-1))+1):(ceiling(rar*i))]
}
#error calculation
errors <- R2 <- 0
for (j in 1:k){
datas.test <- datas[xval.idx[[j]],]
datas.train <- datas[-xval.idx[[j]],]
test.idx <- xval.idx[[j]]
#THE MODELS+
lm.1 <- lm(form,data= datas.train)
lm.step <- step(lm.1,direction=direction,trace=0)
step.pred <- predict(lm.step,newdata= datas.test)
step.error <- sum((step.pred-response[test.idx])^2)
errors[j] <- step.error/length(response[test.idx])
SS.tot <- sum((response[test.idx] - mean(response[test.idx]))^2)
R2[j] <- 1 - step.error/SS.tot
}
CVerror <- sum(errors)/k
CV.R2 <- sum(R2)/k
res <- list()
res$CV.error <- CVerror
res$CV.R2 <- CV.R2
return(res)
}
#TESTING OUT THE FUNCTION
cv.step(lm.cars)
Any thoughts?
When you created your formula, lm.cars, in was assigned its own environment. This environment stays with the formula unless you explicitly change it. So when you extract the formula with the formula function, the original environment of the model is included.
I don't know if I'm using the correct terminology here, but I think you need to explicitly change the environment for the formula inside your function:
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
.env <- environment() ## identify the environment of cv.step
## extract the formula in the environment of cv.step
form <- as.formula(linmod$call, env = .env)
## The rest of your function follows
Another problem that can cause this is if one passes a character (string vector) to lm instead of a formula. vectors have no environment, and so when lm converts the character to a formula, it apparently also has no environment instead of being automatically assigned the local environment. If one then uses an object as weights that is not in the data argument data.frame, but is in the local function argument, one gets a not found error. This behavior is not very easy to understand. It is probably a bug.
Here's a minimal reproducible example. This function takes a data.frame, two variable names and a vector of weights to use.
residualizer = function(data, x, y, wtds) {
#the formula to use
f = "x ~ y"
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
residualizer2 = function(data, x, y, wtds) {
#the formula to use
f = as.formula("x ~ y")
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
d_example = data.frame(x = rnorm(10), y = rnorm(10))
weightsvar = runif(10)
And test:
> residualizer(data = d_example, x = "x", y = "y", wtds = weightsvar)
Error in eval(expr, envir, enclos) : object 'wtds' not found
> residualizer2(data = d_example, x = "x", y = "y", wtds = weightsvar)
1 2 3 4 5 6 7 8 9 10
0.8986584 -1.1218003 0.6215950 -0.1106144 0.1042559 0.9997725 -1.1634717 0.4540855 -0.4207622 -0.8774290
It is a very subtle bug. If one goes into the function environment with browser, one can see the weights vector just fine, but it somehow is not found in the lm call!
The bug becomes even harder to debug if one used the name weights for the weights variable. In this case, since lm can't find the weights object, it defaults to the function weights() from base thus throwing an even stranger error:
Error in model.frame.default(formula = f, data = data, weights = weights, :
invalid type (closure) for variable '(weights)'
Don't ask me how many hours it took me to figure this out.

Resources