Error in cross validation with factor value - r

I have this code:
# Define training control
set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(is_nocnv ~., data = mydata, method = "lm", trControl = train.control)
# Summarize the results
print(model)
When I execute this code I obtain this error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
The field: is_nocnv is factor the value of this field is 'YES' , 'NO'
str(mydata)
'data.frame': 8334 obs. of 7 variables:
$ chr : Factor w/ 1 level "chr1": 1 1 1 1 1 1 1 1 1 1 ...
$ start : int 3218610 154080441 154089408 61735 2069681 2074104 3135175 3137913 3214732 5901288 ...
$ stop : int 154074261 154081058 247813706 2061969 2071738 3130590 3136858 3212946 5900106 5902086 ...
$ strand : Factor w/ 1 level "*": 1 1 1 1 1 1 1 1 1 1 ...
$ num_probes : int 69643 3 59364 379 2 333 2 33 1943 3 ...
$ segment_mean: num -0.122 -13.462 -0.1 -0.326 -25.242 ...
$ is_nocnv : Factor w/ 2 levels "NO","YES": 2 2 2 1 1 1 1 1 1 1 ...
Here a small part of my dataset (csv)
"chr","start","stop","strand","num_probes","segment_mean","is_nocnv"
chr1,3218610,154074261,*,69643,-0.122,YES
chr1,154080441,154081058,*,3,-13.462,YES
chr1,154089408,247813706,*,59364,-0.1003,YES
chr1,61735,2061969,*,379,-0.326,NO
chr1,2069681,2071738,*,2,-25.242,NO
chr1,2074104,3130590,*,333,-0.3957,NO

Related

No starting estimate was successful error with coxme upon data subsetting

I have a large dataset that I subsetted and created a new dataset.
I used the following code that works perfectly
require(sjPlot);require(coxme)
tab_model(coxme(Surv(comp2_years, comp2)~FEMALE+(1|TRIAL), data))
But when I used the subsetted datas set using the following code,
www<- subset(data, (data$TRIAL != 5 & data$Sex.standerd.BMI.gpM1F2 >=1))
tab_model(coxme(Surv(comp2_years, comp2)~FEMALE+(1|TRIAL), www))
it gave me the following error:
Error in coxme.fit(X, Y, strats, offset, init, control, weights = weights, :
No starting estimate was successful
This is my new data structure
str(www)
Classes ‘data.table’ and 'data.frame': 7576 obs. of 79 variables:
$ TRIAL : num 1 1 1 1 1 1 1 1 1 1 ...
$ FEMALE : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ type_comp2 : chr "0" "0" "Revasc" "0" ...
$ comp2 : num 0 0 1 0 0 0 0 0 0 1 ...
$ comp2_years : num 10 10 9.77 10 10 ...
$ Sex.standerd.BMI.gpM1F2 : num 1 1 1 1 1 1 1 1 1 1 ...
$ Trial1_4.MiddleBMI : num 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
I saw this post but I could not solve my current problem.
Any advice will be greatly appreciated.
Add the droplevels() command to your subset.
This happened to me too, and I found that using droplevels() to forget about the levels you did not include in the subset solved it:
library(survival)
library(coxme)
Change ph.ecog from number to categorical to make this point:
lung$ph.ecog <- as.factor(lung$ph.ecog)
(fit <- coxme(Surv(time, status) ~ ph.ecog + age + (1|inst), lung))
Works well for the full data set. Subset out some levels of ph.ecog, and it gives this error:
lunga <- subset(lung, !ph.ecog %in% c(2, 3))
(fita <- coxme(Surv(time, status) ~ ph.ecog + age + (1|inst), lunga))
Error in coxme.fit(X, Y, strats, offset, init, control, weights = weights, :
No starting estimate was successful
Using droplevels() to forget about empty levels allows coxme to fit again:
lungb <- droplevels(subset(lung, !ph.ecog %in% c(2, 3)))
(fitb <- coxme(Surv(time, status) ~ ph.ecog + age + (1|inst), lungb))

caret::predict giving Error: $ operator is invalid for atomic vectors

This has been driving me crazy and I've been looking through similar posts all day but can't seem to solve my problem. I have a naive bayes model trained and stored as model. I'm attempting to predict with a newdata data frame but I keep getting the error Error: $ operator is invalid for atomic vectors. Here is what I am running: stats::predict(model, newdata = newdata) where newdata is the first row of another data frame: new data <- pbp[1, c("balls", "strikes", "outs_when_up", "stand", "pitcher", "p_throws", "inning")]
class(newdata) gives [1] "tbl_df" "tbl" "data.frame".
The issue is with the data used. it should match the levels used in the training. E.g. if we use one of the rows from trainingData to predict, it does work
predict(model, head(model$trainingData, 1))
#[1] Curveball
#Levels: Changeup Curveball Fastball Sinker Slider
By checking the str of both datasets, some of the factor columns in the training is character class
str(model$trainingData)
'data.frame': 1277525 obs. of 7 variables:
$ pitcher : Factor w/ 1390 levels "112526","115629",..: 277 277 277 277 277 277 277 277 277 277 ...
$ stand : Factor w/ 2 levels "L","R": 1 1 2 2 2 2 2 1 1 1 ...
$ p_throws : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
$ balls : num 0 1 0 1 2 2 2 0 0 0 ...
$ strikes : num 0 0 0 0 0 1 2 0 1 2 ...
$ outs_when_up: num 1 1 1 1 1 1 1 2 2 2 ...
$ .outcome : Factor w/ 5 levels "Changeup","Curveball",..: 3 4 1 4 1 5 5 1 1 5 ...
str(newdata)
tibble [1 × 6] (S3: tbl_df/tbl/data.frame)
$ balls : int 3
$ strikes : int 2
$ outs_when_up: int 1
$ stand : chr "R"
$ pitcher : int 605200
$ p_throws : chr "R"
An option is to make levels same for factor class
nm1 <- intersect(names(model$trainingData), names(newdata))
nm2 <- names(which(sapply(model$trainingData[nm1], is.factor)))
newdata[nm2] <- Map(function(x, y) factor(x, levels = levels(y)), newdata[nm2], model$trainingData[nm2])
Now do the prediction
predict(model, newdata)
#[1] Sinker
#Levels: Changeup Curveball Fastball Sinker Slider

Error: Can't find column `RH_train` in `.data`

i'm trying to train a model with the caret Package (Random Forest), after running the "train" code, i get: Error: Can't find column RH_train in .data. Then, i tried converting the dependent variable (Rendimiento) to a factor but i get: Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X0.5, X0.6, X0.65, X0.7, X0.75, X0.79, X0.8, X0.81, X0.82, X0.83, X0.85, X0.86, X0.87, X0.88, X0.9, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help).
library(rpart)
library(rpart.plot)
library(RWekajars)
library(randomForest)
library(party)
library(caret)
library(e1071)
library(dplyr)
####Cargar base de datos####
setwd("C:/Users/Frankenstein/Downloads")
RH <- read_excel("RH.xlsx")
RH$`Año Ingreso`=NULL
RH$`Mes ingreso`=NULL
RH$`Status empleado para Gestión t`=NULL
RH$`Horario trabajo`=NULL
RH$Nacional=NULL
RH$Jefe=NULL
RH$`N Personal`=NULL
colnames(RH)
names(RH)[names(RH) == "Grado de distancia"] <- "Distancia"
names(RH)[names(RH) == "Clave para el estado civil"] <- "EstadoCivil"
names(RH)[names(RH) == "Clave de sexo"] <- "Sexo"
####Analizar la estructura del los datos#
str(RH)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 325 obs. of 10 variables:
$ Rendimiento: num 0.6 0.8 0.85 0.86 0.85 0.8 1 0.86 1 0.9 ...
$ Edad : num 36 37 21 26 25 28 32 32 29 36 ...
$ Posición : chr "ANA" "ANA" "AUX" "AUX" ...
$ Sexo : num 1 1 0 1 1 1 1 1 1 1 ...
$ Distancia : num 5 3 1 1 3 2 2 4 5 4 ...
$ Estrato : num 2 5 3 3 3 3 5 2 5 6 ...
$ EstadoCivil: num 1 2 1 1 1 1 2 1 1 2 ...
$ Hijos : num 1 1 0 0 0 0 0 1 0 0 ...
$ Formación : chr "PREGRADO" "PREGRADO" "PREGRADO" "PREGRADO" ...
$ Educación : num 3 3 3 3 3 3 4 3 3 3 ...
hist(RH$Rendimiento)
summary(RH)
####Dividir datos en entrenamiento y testeo####
glimpse(RH_train)
glimpse(RH_test)
RH_train <- RH[1:243, ]
RH_test <- RH[244:325, ]
# Define the control
trControl <- trainControl(method = "cv",
number = 10,
search = "grid",
classProb=TRUE)
set.seed(1234)
RH_train$Rendimiento=factor(RH_train$Rendimiento)
RendimientoFactor=factor(RH_train$Rendimiento)
# Run the model
rf_default <- train(RH_train$Rendimiento ~ RH_train$Edad + RH_train$Sexo + RH_train$Distancia + RH_train$Estrato + RH_train$EstadoCivil + RH_train$Hijos + RH_train$Educación,
data=RH_train,
method = "rf",
metric = "Accuracy",
trControl = trControl)

R - Aggregate function creating sub-lists

I'm using the aggregate function to summarise some data. The data is loans data, I have the ContractNum and LoanAmount. I want to aggregate the data by StartDate, count the number of Loans and Average the loan amount.
Here is a sample of the data and the function that I use:
ContractNum <- c("RHL-1","RHL-2","RHL-3","RHL-3")
StartDate <- c("2016-11-01","2016-11-01","2016-12-01","2016-12-01")
LoanPurpose <- c("Personal","Personal","HomeLoan","Investment")
LoanAmount <- c(200,500,600,150)
dat <- data.frame(ContractNum,StartDate,LoanPurpose,LoanAmount)
aggr.data <- aggregate(
cbind(LoanAmount,ContractNum) ~ StartDate + LoanPurpose
,data = dat
,FUN = function(x)c(count = mean(x),length(x))
)
When I lookat the results of the aggregate function, it looks ok:
> aggr.data
StartDate LoanPurpose LoanAmount.count LoanAmount.V2 ContractNum.count ContractNum.V2
1 2016-12-01 HomeLoan 600 1 3.0 1.0
2 2016-12-01 Investment 150 1 3.0 1.0
3 2016-11-01 Personal 350 2 1.5 2.0
But when I look at the strucutre of it, it seems to have created a sub-list:
> str(aggr.data)
'data.frame': 3 obs. of 4 variables:
$ StartDate : Factor w/ 2 levels "2016-11-01","2016-12-01": 2 2 1
$ LoanPurpose: Factor w/ 3 levels "HomeLoan","Investment",..: 1 2 3
$ LoanAmount : num [1:3, 1:2] 600 150 350 1 1 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "count" ""
$ ContractNum: num [1:3, 1:2] 3 3 1.5 1 1 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "count" ""
How do I get rid of this sub-list so that I can access each column the way I would normally access a DF? I understand that in the code I've asked to give me a mean on a ContractNum which is not meaningful, but I can just get rid of that column.
Thank you
Just do do.call(data.frame, ...) on aggr.data to unnest the matrices.
aggr.data <- do.call(data.frame, aggr.data);
str(aggr.data);
#'data.frame': 3 obs. of 6 variables:
# $ StartDate : Factor w/ 2 levels "2016-11-01","2016-12-01": 2 2 1
# $ LoanPurpose : Factor w/ 3 levels "HomeLoan","Investment",..: 1 2 3
# $ LoanAmount.count : num 600 150 350
# $ LoanAmount.V2 : num 1 1 2
# $ ContractNum.count: num 3 3 1.5
# $ ContractNum.V2 : num 1 1 2

Between/within standard deviations

When working on a hierarchical/multilevel/panel dataset, it may be very useful to adopt a package which returns the within- and between-group standard deviations of the available variables.
This is something that with the following data in Stata can be easily done through the command
xtsum, i(momid)
I made a research, but I cannot find any R package which can do that..
edit:
Just to fix ideas, an example of hierarchical dataset could be this:
son_id mom_id hispanic mom_smoke son_birthweigth
1 1 1 1 3950
2 1 1 0 3890
3 1 1 0 3990
1 2 0 1 4200
2 2 0 1 4120
1 3 0 0 2975
2 3 0 1 2980
The "multilevel" structure is given by the fact that each mother (higher level) has two or more sons (lower level). Hence, each mother defines a group of observations.
Accordingly, each dataset variable can vary either between and within mothers or only between mothers. birtweigth varies among mothers, but also within the same mother. Instead, hispanic is fixed for the same mother.
For example, the within-mother variance of son_birthweigth is:
# mom1 means
bwt_mean1 <- (3950+3890+3990)/3
bwt_mean2 <- (4200+4120)/2
bwt_mean3 <- (2975+2980)/2
# Within-mother variance for birthweigth
((3950-bwt_mean1)^2 + (3890-bwt_mean1)^2 + (3990-bwt_mean1)^2 +
(4200-bwt_mean2)^2 + (4120-bwt_mean2)^2 +
(2975-bwt_mean3)^2 + (2980-bwt_mean3)^2)/(7-1)
While the between-mother variance is:
# overall mean of birthweigth:
# mean <- sum(data$son_birthweigth)/length(data$son_birthweigth)
mean <- (3950+3890+3990+4200+4120+2975+2980)/7
# within variance:
((bwt_mean1-mean)^2 + (bwt_mean2-mean)^2 + (bwt_mean3-mean)^2)/(3-1)
I don't know what your stata command should reproduce, but to answer the second part of question about
hierarchical structure , it is easy to do this with list.
For example, you define a structure like this:
tree = list(
"var1" = list(
"panel" = list(type ='p',mean = 1,sd=0)
,"cluster" = list(type = 'c',value = c(5,8,10)))
,"var2" = list(
"panel" = list(type ='p',mean = 2,sd=0.5)
,"cluster" = list(type="c",value =c(1,2)))
)
To create this lapply is convinent to work with list
tree <- lapply(list('var1','var2'),function(x){
ll <- list(panel= list(type ='p',mean = rnorm(1),sd=0), ## I use symbol here not name
cluster= list(type = 'c',value = rnorm(3))) ## R prefer symbols
})
names(tree) <-c('var1','var2')
You can view he structure with str
str(tree)
List of 2
$ var1:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num 0.284
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] 0.0722 -0.9413 0.6649
$ var2:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num -0.144
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] -0.595 -1.795 -0.439
Edit after OP clarification
I think that package reshape2 is what you want. I will demonstrate this here.
The idea here is in order to do the multilevel analysis we need to reshape the data.
First to divide the variables into two groups :identifier and measured variables.
library(reshape2)
dat.m <- melt(dat,id.vars=c('son_id','mom_id')) ## other columns are measured
str(dat.m)
'data.frame': 21 obs. of 4 variables:
$ son_id : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 1 2 1 2 3 ...
$ mom_id : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 3 3 1 1 1 ...
$ variable: Factor w/ 3 levels "hispanic","mom_smoke",..: 1 1 1 1 1 1 1 2 2 2 ...
$ value : num 1 1 1 0 0 0 0 1 0 0 ..
Once your have data in "moten" form , you can "cast" to rearrange it in the shape that you want:
# mom1 means for all variable
acast(dat.m,variable~mom_id,mean)
1 2 3
hispanic 1.0000000 0 0.0
mom_smoke 0.3333333 1 0.5
son_birthweigth 3943.3333333 4160 2977.5
# Within-mother variance for birthweigth
acast(dat.m,variable~mom_id,function(x) sum((x-mean(x))^2))
1 2 3
hispanic 0.0000000 0 0.0
mom_smoke 0.6666667 0 0.5
son_birthweigth 5066.6666667 3200 12.5
## overall mean of each variable
acast(dat.m,variable~.,mean)
[,1]
hispanic 0.4285714
mom_smoke 0.5714286
son_birthweigth 3729.2857143
I know this question is four years old, but recently I wanted to do the same in R and came up with the following function. It depends on dplyr and tibble. Where: df is the dataframe, columns is a numerical vector to subset the dataframe and individuals is the column with the individuals.
xtsumR<-function(df,columns,individuals){
df<-dplyr::arrange_(df,individuals)
panel<-tibble::tibble()
for (i in columns){
v<-df %>% dplyr::group_by_() %>%
dplyr::summarize_(
mean=mean(df[[i]]),
sd=sd(df[[i]]),
min=min(df[[i]]),
max=max(df[[i]])
)
v<-tibble::add_column(v,variacao="overal",.before=-1)
v2<-aggregate(df[[i]],list(df[[individuals]]),"mean")[[2]]
sdB<-sd(v2)
varW<-df[[i]]-rep(v2,each=12) #
varW<-varW+mean(df[[i]])
sdW<-sd(varW)
minB<-min(v2)
maxB<-max(v2)
minW<-min(varW)
maxW<-max(varW)
v<-rbind(v,c("between",NA,sdB,minB,maxB),c("within",NA,sdW,minW,maxW))
panel<-rbind(panel,v)
}
var<-rep(names(df)[columns])
n1<-rep(NA,length(columns))
n2<-rep(NA,length(columns))
var<-c(rbind(var,n1,n1))
panel$var<-var
panel<-panel[c(6,1:5)]
names(panel)<-c("variable","variation","mean","standard.deviation","min","max")
panel[3:6]<-as.numeric(unlist(panel[3:6]))
panel[3:6]<-round(unlist(panel[3:6]),2)
return(panel)
}

Resources