Weird error: not found #dependent variable in eval(predvars, data, env) - r

I am having a weird error when trying to make prediction from my model.
My original dataset is a discrete choice experiment where doctors evaluate a set of patients with different characteristics and make a treatment choice.
The structure of my data is like this:
>str(dcefull2)
'data.frame': 350 obs. of 28 variables:
$ id : Factor w/ 25 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ choice : Factor w/ 3 levels "stop","half",..: 2 3 2 2 2 2 2 2 2 3 ...
$ response : Factor w/ 2 levels "Good response",..: 1 2 2 2 2 1 2 2 1 1 ...
$ resist_profile : Factor w/ 3 levels "MDR-TB","MDR-TB + PZA + EMB resis",..: 3 2 2 3 3 2 1 2 3 3 ...
$ ambu_regimen : Factor w/ 2 levels "BPaLM","Standard": 2 2 2 2 1 2 2 2 2 1 ...
$ expo_his : Factor w/ 2 levels "No exposure",..: 2 1 1 1 2 2 2 1 2 1 ...
$ resist_prob : num 2.8 0.8 1.8 1.8 1.8 1.8 2.8 0.8 2.8 0.8 ...
$ cred_interval : Factor w/ 2 levels "Wide","Narrow": 2 2 2 1 2 2 2 1 1 1 ...
I fitted an ordered logistic model with the choice (ordinal variable with 3 categories) as a function of patient characteristics.
random_model_off3 <- clmm2(choice ~ response + resist_prob*resist_profile + resist_prob*ambu_regimen + resist_prob*expo_his + resist_prob*cred_interval, random = id, data=dcefull2, Hess = TRUE, nAGQ = 10)
I then created a new dataset with all independent variables as in the original dataset. I varied the values of the variable 'resist_prob' and 'cred_interval', other variables I kept one fixed value.
The structure of my new data is like this:
>str(newdat2)
'data.frame': 2002 obs. of 6 variables:
$ response : Factor w/ 1 level "Good response": 1 1 1 1 1 1 1 1 1 1 ...
$ resist_profile: Factor w/ 1 level "MDR-TB": 1 1 1 1 1 1 1 1 1 1 ...
$ cred_interval : Factor w/ 2 levels "Wide","Narrow": 1 2 1 2 1 2 1 2 1 2 ...
$ ambu_regimen : Factor w/ 1 level "BPaLM": 1 1 1 1 1 1 1 1 1 1 ...
$ expo_his : Factor w/ 1 level "No exposure": 1 1 1 1 1 1 1 1 1 1 ...
$ resist_prob : num 0 0 0.004 0.004 0.008 0.008 0.012 0.012 0.016 0.016 ...
I used function predict() to make prediction of probability of each treatment choice for each row.
predict(random_model_off3, newdata = newdat2)
Then I received this error:
Error in eval(predvars, data, env) : object 'choice' not found
I found it very weird because "choice" is my dependent variable. I cannot find any similar issues and solution on the internet.
I would very appreciate your help!

Related

Categorical data in R with h2o

I have run a logistical regression model with both categorical and numerical variables. The model was trying to predict the number of website visits in a month based off the first week. Obviously the number of website visits in the first week was the strongest indicator. However when i ran h2o deep learning with various models and activation functions the model performs very poorly. Based off the var_imp function it gives importance to very non important variables(based off my logistical regression model, which is quite good, this is wrong), and only seems to have categorical subsets ranked with high importance. and the model does not perform well even on the training data, a real warning sign! So i just wanted to upload my code to check i am not doing anything to harm the model. It seems strange for logistical regression to get it quiteright but deep learning to get it so wrong, so i imagine its something i've done!
summary(data)
$ VAR1: Factor w/ 8 levels ,..: 1 5 2 1 7 2 5 1 5 1 ...
$ VAR2: Factor w/ 5 levels ,..: 1 4 1 1 4 4 4 1 1 4 ...
$ VAR3: Factor w/ 2 levels "F","M": 2 2 2 1 2 2 2 2 2 2 ...
$ VAR4: Factor w/ 2 levels : 2 1 2 2 1 1 1 2 2 1 ...
$ VAR5 : num 1000 20 30 20 30 30 30 50 30 400 ...
$ VAR6: Factor w/ 2 levels "N","Y": 1 2 2 1 2 2 2 2 1 2 ...
$ VAR7: Factor w/ 2 levels "N","Y": 1 2 2 1 2 2 2 2 1 2 ...
$ VAR8: num 0 0 0 0 0 0 0 0 0 0 ...
$ VAR9: num 56 52 49 29 28 38 34 79 53 36 ...
$ VAR10: num 3 2 1 3 2 2 3 4 2 2 ...
$ VAR11: num 1 1 1 2 2 1 1 1 1 2 ...
$ VAR12: Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...
$ VAR13: num 1 0 1 1 1 0 1 0 0 0 ...
$ VAR14: Factor w/ 2 levels "N","Y": 2 1 1 1 1 1 1 1 1 1 ...
$ VAR15: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
$ VAR16: num 1 0 0 1 0 0 0 1 1 0 ...
$ VAR17: num 19 7 1 4 10 2 4 4 7 12 ...
$ VAR18: Factor w/ 2 levels "N","Y": 1 2 2 2 2 2 2 1 2 1 ...
$ VAR19: Factor w/ 2 levels "0","Y": 1 1 2 1 1 1 1 1 1 1 ...
$ VAR20: Factor w/ 2 levels "N","Y": 1 1 2 1 1 1 1 1 1 1 ...
$ VAR21: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
$ VAR22: : num 0.579 0 0 0 0.4 ...
$ VAR23: num 1.89 1 1 1 2.9 ...
$ VAR24: num 0.02962 0.00691 0.05327 0.02727 0.01043 ...
$ VAR25: Factor w/ 3 levels ..: 2 2 2 3 3 2 3 2 1 3 ...
$ VAR26: num 3 2 1 2 3 1 2 1 2 4 ...
$ VAR27: num 3 2 1 1 5 1 1 1 1 2 ...
$ VAR_RESPONSE: num 7 24 4 3 8 12 5 48 2 7 ...
sapply(data,function(x) sum(is.na(x)))
colSums(is.na(data))
data[is.na(data)] = 0
d.hex = as.h2o(data, destination_frame= "d.hex")
Data_g.split = h2o.splitFrame(data = d.hex,ratios = 0.75)
Data_train = Data_g.split[[1]]#75% training data
Data_test = Data_g.split[[2]]
activation_opt <-
c("Rectifier","RectifierWithDropout","Maxout","MaxoutWithDropout",
"Tanh","TanhWithDropout")
hidden_opt <- list(c(10,10),c(20,15),c(50,50,50),c(5,3,2),c(100,100),c(5),c(30,30,30),c(50,50,50,50),c(5,4,3,2))
l1_opt <- c(0,1e-3,1e-5,1e-7,1e-9)
l2_opt <- c(0,1e-3,1e-5,1e-7,1e-9)
hyper_params <- list( activation=activation_opt,
hidden=hidden_opt,
l1=l1_opt,
l2=l2_opt )
search_criteria <- list(strategy = "RandomDiscrete", max_models=30)
dl_grid10 <- h2o.grid("deeplearning"
,grid_id = "deep_learn10"
,hyper_params = hyper_params
,search_criteria = search_criteria
,x = 1:27
,y = "VAR_RESPONSE"
,training_frame = Data_train)
d_grid10 <- h2o.getGrid("deep_learn10",sort_by = "mse")
mn = h2o.deeplearning(x = 1:27,
y = "VAR_RESPONSE",
training_frame = Data_train,
model_id = "mn",
activation = "Maxout",
l1 = 0,
l2 = 1e-9,
hidden = c(100,100),)

Error in Logistic Regression for Factors in R

I am trying to do logistic regression by using the code:
model <- glm (Participation ~ Gender + Race + Ethnicity + Education + Comorbidities + WLProgram + LoseWeight + EverLoseWeight + PastYearLW + Age + BMI, data = LogisticData, family = binomial)
summary(model)
I keep getting the error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
Upon checking the forums I checked to see which variables were factors:
str(LogisticData)
'data.frame': 994 obs. of 13 variables:
$ outcome : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 2 2 1 2 2 ...
$ Gender : Factor w/ 3 levels "Male","Female",..: 1 2 2 1 2 1 1 1 1
$ Race : Factor w/ 3 levels "White","Black",..: 1 1 1 3 1 1 1 1 1 1
$ Ethnicity : Factor w/ 2 levels "Hispanic/Latino",..: 2 2 2 2 2 2 2 2 2
$ Education : Factor w/ 2 levels "Below Bachelors",..: 1 1 1 2 1 1 1 2 1
$ Comorbidities : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 2 2 1 1 ...
$ WLProgram : Factor w/ 2 levels "No","Yes": NA 1 2 2 1 1 1 NA 1 1 ...
$ LoseWeight : Factor w/ 2 levels "Yes","No": 2 1 1 1 1 1 1 2 1 1 ...
$ PastYearLW : Factor w/ 2 levels "Yes","No": NA 2 1 1 1 2 1 NA 1 1 ...
$ EverLoseWeight: Factor w/ 2 levels "Yes","No": 2 1 1 1 1 1 1 2 1 1 ...
$ Age : int 29 35 69 32 21 45 40 62 59 58 ...
$ Participation : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 1 1 2 1 2 ...
$ BMI : num 25.7 33.8 26.4 32.3 27.5 ...
All factors appear to have 2 or more levels.
I also tried to omit NA's which still gave me this error.
I want all variables in the regression, and can't figure out why it won't run.
When performing :
newdata <- droplevels(na.omit(LogisticData))
> str(newdata)
'data.frame': 840 obs. of 13 variables:
$ outcome : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 2 2 2 ...
$ Gender : Factor w/ 3 levels "Male","Female",..: 2 2 1 2 1 1 1 2 1
$ Race : Factor w/ 3 levels "White","Black",..: 1 1 3 1 1 1 1 1 3
$ Ethnicity : Factor w/ 2 levels "Hispanic/Latino",..: 2 2 2 2 2 2 2 2
$ Education : Factor w/ 2 levels "Below Bachelors",..: 1 1 2 1 1 1 1 1
$ Comorbidities : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...
$ WLProgram : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 1 1 1 1 1 ...
$ LoseWeight : Factor w/ 1 level "Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ PastYearLW : Factor w/ 2 levels "Yes","No": 2 1 1 1 2 1 1 1 1 2 ...
$ EverLoseWeight: Factor w/ 1 level "Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ Age : int 35 69 32 21 45 40 59 58 23 32 ...
$ Participation : Factor w/ 2 levels "Yes","No": 2 1 1 1 1 1 1 2 2 1 ...
$ BMI : num 33.8 26.4 32.3 27.5 45.4 ...
- attr(*, "na.action")=Class 'omit' Named int [1:154] 1 8 13 14 21 24 25
46 55 58 ...
.. ..- attr(*, "names")= chr [1:154] "1" "8" "13" "14" ...
This doesn't make sense to me because you can see in the first str(Logisitic Data) that there is obviously 2 levels in EverLoseWeight as you can see both the Yes and the No and the 1 and 2? How do I fix this anomaly?
Given your update, it looks like you have at least two possibilities.
1: Remove the factors that are left with only a single level after removing the NAs (i.e. LoseWeight and EverLoseWeight).
2: Treat the NAs as an extra level. Something along the lines of
a = as.factor(c(1,1,NA,2))
b = as.factor(c(1,1,2,1))
# 0 is an unused factor level for a
x = data.frame(a, b)
levels(x$a) = c(levels(x$a), 0)
x$a[is.na(x$a)] = 0
But this might not deal with any singularity issues that also resulted in having single-level factors.
Try doing summary on your raw data and make sure that all of the levels have values. I would put this in a comment, but I don't have the reputation points :(

R, aggregate function apparently causes loss of column levels?

I just encountered a weird situation in RGui...I used the same script as always to get my data.frame into the right shape for ggplot2. So my data looks like the following:
time days treatment nucleic_acid habitat parallel disturbance variable cellcounts value
1 1 2 control dna water 1 none Proteobacteria batch 0.000000000
2 2 22 control dna water 1 none Proteobacteria batch 0.003586543
3 1 2 treated dna water 1 none Proteobacteria batch 0.000000000
4 2 22 treated dna biofilm 1 none Proteobacteria NA 0.000000000
'data.frame': 185648 obs. of 10 variables:
$ time : int 5 5 5 5 5 5 6 6 6 6 ...
$ days : int 62 62 62 62 62 62 69 69 69 69 ...
$ treatment : Factor w/ 2 levels "control","treated": 2 2 2 1 1 1 2 2 2 1 ...
$ parallel : int 1 2 3 1 2 3 1 2 3 1 ...
$ nucleic_acid: Factor w/ 2 levels "cdna","dna": 1 1 1 1 1 1 1 1 1 1 ...
$ habitat : Factor w/ 2 levels "biofilm","water": 1 1 1 1 1 1 1 1 1 1 ...
$ cellcounts : Factor w/ 4 levels "batch","high",..: NA NA NA NA NA NA NA NA NA NA ...
$ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ...
$ variable : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 0 0 0 0 0 0 0 0 0 0 ...
and I wanted aggregate to calculate the mean value of my up to 3 parallels:
df_mean<-aggregate(value~time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean)
afterwards, the level "biofilm" in column "habitat" is lost.
df_mean<-droplevels(df_mean)
str(df_mean)
'data.frame': 44608 obs. of 9 variables:
$ time : int 1 2 1 2 1 2 1 2 1 2 ...
$ days : int 2 22 2 22 2 22 2 22 2 22 ...
$ treatment : Factor w/ 2 levels "control","treated": 1 1 2 2 1 1 2 2 1 1 ...
$ nucleic_acid: Factor w/ 2 levels "cdna","dna": 2 2 2 2 2 2 2 2 2 2 ...
$ habitat : Factor w/ 1 level "water": 1 1 1 1 1 1 1 1 1 1 ...
$ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ...
$ variable : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 2 2 2 2 3 3 ...
$ cellcounts : Factor w/ 4 levels "batch","high",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 0 0.00359 0 0 0 ...
So I spent a lot of time (I actually just realised this, there were many more issues that now all seem to be aggregate related) looking into this. I removed the column "cellcounts" and it worked. Interestingly, the columns "cellcounts" and "habitat" always carry in case of "biofilm" the same, therefore redundant, information ("biofilm" goes always with "NA"). Is this the cause? But it always worked before, so I don't get my head around this. Was there a change to the base::aggregate function or something like that? Do you have an explanation for me? I'm using R-3.4.0, other packages used are reshape, reshape2 and ggplot2
Thx a lot, a confused crazysantaclaus
The issue comes from the NA, maybe your file was loaded differently in the past and these were stored as strings instead of NA values ? Here's a way to solve it by setting them to a "NA" string:
levels(df$cellcounts) <- c(levels(df$cellcounts),"NA")
df$cellcounts[is.na(df$cellcounts)] <- "NA"
df_mean <- aggregate(value ~ time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean,na.rm=TRUE)
df_mean<-droplevels(df_mean)
str(df_mean)
'data.frame': 4 obs. of 9 variables:
$ time : int 1 2 1 2
$ days : int 2 22 2 22
$ treatment : Factor w/ 2 levels "control","treated": 1 1 2 2
$ nucleic_acid: Factor w/ 1 level "dna": 1 1 1 1
$ habitat : Factor w/ 2 levels "biofilm","water": 2 2 2 1
$ disturbance : Factor w/ 1 level "none": 1 1 1 1
$ variable : Factor w/ 1 level "Proteobacteria": 1 1 1 1
$ cellcounts : Factor w/ 2 levels "batch","NA": 1 1 1 2
$ value : num 0 0.00359 0 0
data
df <- read.table(text=" time days treatment nucleic_acid habitat parallel disturbance variable cellcounts value
1 1 2 control dna water 1 none Proteobacteria batch 0.000000000
2 2 22 control dna water 1 none Proteobacteria batch 0.003586543
3 1 2 treated dna water 1 none Proteobacteria batch 0.000000000
4 2 22 treated dna biofilm 1 none Proteobacteria NA 0.000000000
",header=T)

Find the average of one variable in multiple year classes in R

I have 50 year-classes, and age and length data on individuals within each year class.
Without inputting a different data set for each year class I'm trying to plot average age for each year class.
As in year class along the x axis and average age (or length) (for each year class) along the y.
This is my data frame
'data.frame': 236628 obs. of 7 variables:
$ maturity: Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
$ year : Factor w/ 50 levels "1966","1967",..: 1 1 1 1 1 1 1 1 1 1 ...
$ quarter : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ area : Factor w/ 10 levels "1","2","3","4",..: 2 2 2 2 2 2 2 2 2 2 ...
$ lngth : int 145 145 145 150 150 150 150 150 155 155 ...
$ age : int 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Ord.factor w/ 2 levels "0"<"1": 1 2 2 2 2 2 2 2 1 2 ...
Cheers

How can I take multiple vectors and recode their datatypes in R?

I'm looking for an elegant way to change multiple vectors' datatypes in R.
I'm working with an educational dataset: 426 students' answers to eight multiple choice questions (1 = correct, 0 = incorrect), plus a column indicating which instructor (1, 2, or 3) taught their course.
As it stands, my data is sitting pretty in data.df, like this:
str(data.df)
'data.frame': 426 obs. of 9 variables:
$ ques01: int 1 1 1 1 1 1 0 0 0 1 ...
$ ques02: int 0 0 1 1 1 1 1 1 1 1 ...
$ ques03: int 0 0 1 1 0 0 1 1 0 1 ...
$ ques04: int 1 0 1 1 1 1 1 1 1 1 ...
$ ques05: int 0 0 0 0 1 0 0 0 0 0 ...
$ ques06: int 1 0 1 1 0 1 1 1 1 1 ...
$ ques07: int 0 0 1 1 0 1 1 0 0 1 ...
$ ques08: int 0 0 1 1 1 0 1 1 0 1 ...
$ inst : num 1 1 1 1 1 1 1 1 1 1 ...
But those ques0x values aren't really integers. Rather, I think it's better to have R treat them as experimental factors. Same goes for the "inst" values.
I'd love to turn all those ints and nums into factors
Ideally, an elegant solution should produce a dataframe—I call it factorData.df—that looks like this:
str(factorData.df)
'data.frame': 426 obs. of 9 variables:
$ ques01: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 1 1 2 ...
$ ques02: Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 2 2 2 ...
$ ques03: Factor w/ 2 levels "0","1": 1 1 2 2 1 1 2 2 1 2 ...
$ ques04: Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
$ ques05: Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
$ ques06: Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 2 ...
$ ques07: Factor w/ 2 levels "0","1": 1 1 2 2 1 2 2 1 1 2 ...
$ ques08: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 2 2 1 2 ...
$ inst : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
I'm fairly certain that whatever solution you folks come up with, it ought to be easy to generalize to any n number of variables that'd need to get reclassified, and would work across most common conversions (int -> factor and num -> int, for example).
No matter what solution you folks generate, it's bound to be more elegant than mine
Because my current clunky code is just 9 separate factor() statements, one for each variable, like this
factorData.df$ques01
I'm brand-new to R, programming, and stackoverflow. Please be gentle, and thanks in advance for your help!
This was also answered in R-Help.
I imagine that there's a better way to do it, but here are two options:
# use a sample data set
> str(cars)
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
> data.df <- cars
You can use lapply:
> data.df <- data.frame(lapply(data.df, factor))
Or a for statement:
> for(i in 1:ncol(data.df)) data.df[,i] <- as.factor(data.df[,i])
In either case, you end up with what you want:
> str(data.df)
'data.frame': 50 obs. of 2 variables:
$ speed: Factor w/ 19 levels "4","7","8","9",..: 1 1 2 2 3 4 5 5 5 6 ...
$ dist : Factor w/ 35 levels "2","4","10","14",..: 1 3 2 9 5 3 7 11 14 6 ...
I found an alternative solution in the plyr package:
# load the package and data
> library(plyr)
> data.df <- cars
Use the colwise function:
> data.df <- colwise(factor)(data.df)
> str(data.df)
'data.frame': 50 obs. of 2 variables:
$ speed: Factor w/ 19 levels "4","7","8","9",..: 1 1 2 2 3 4 5 5 5 6 ...
$ dist : Factor w/ 35 levels "2","4","10","14",..: 1 3 2 9 5 3 7 11 14 6 ...
Incidentally, if you look inside the colwise function, it just uses lapply:
df <- as.data.frame(lapply(filtered, .fun, ...))

Resources