SVN for classification with NA column on test data [duplicate]

SVN for classification with NA column on test data [duplicate] - r

This question already has answers here:
How to debug "contrasts can be applied only to factors with 2 or more levels" error?
(3 answers)
Error in `contrasts' Error
(1 answer)
SVM predict on dataframe with different factor levels
(1 answer)
Closed 4 years ago.
Its the follow I have two files one with the training data other with the test data without the class I want to predict and I am trying to execute the follow code:
full <- bind_rows(train, test)
full$Survived <- factor(full$Survived)
train <- full[1:n,]
test <- full[n+1:total,]
model.svm <- svm(Survived~.,train)
predictions <- predict(model.svm,test)
But when I tried to predict give me the follow error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :contrasts can be applied only to factors with 2 or more levels
As I understand is because the class column is all NA's but don't know what to do about this, I already try to fill it with dummy values only to get the predictions and I get this:
Error in newdata[, object$scaled, drop = FALSE] : (subscript) logical subscript too long
Someone can tell me what I am doing wrong and how to correct?
Edit:
Note I am doing binary classification if it help.
Thanks in the advance.
EDIT:
The dataset is the titanic survived people, that I am using for learn how to use some models( I am begging to learn this kind of things).
str(full):
'data.frame': 1309 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
dput is to long to put it here.
I think I have some NA's in age is it a problem?

Related

Error in MEEM(object, conLin, control$niterEM) in lme function

I'm trying to apply the lme function to my data, but the model gives follow message:
mod.1 = lme(lon ~ sex + month2 + bat + sex*month2, random=~1|id, method="ML", data = AA_patch_GLM, na.action=na.exclude)
Error in MEEM(object, conLin, control$niterEM) :
Singularity in backsolve at level 0, block 1
dput for data, copy from https://pastebin.com/tv3NvChR (too large to include here)
str(AA_patch_GLM)
'data.frame': 2005 obs. of 12 variables:
$ lon : num -25.3 -25.4 -25.4 -25.4 -25.4 ...
$ lat : num -51.9 -51.9 -52 -52 -52 ...
$ id : Factor w/ 12 levels "24641.05","24642.03",..: 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ bat : int -3442 -3364 -3462 -3216 -3216 -2643 -2812 -2307 -2131 -2131 ...
$ year : chr "2005" "2005" "2005" "2005" ...
$ month : chr "12" "12" "12" "12" ...
$ patch_id: Factor w/ 45 levels "111870.17_1",..: 34 34 34 34 34 34 34 34 34 34 ...
$ YMD : Date, format: "2005-12-30" "2005-12-31" "2005-12-31" ...
$ month2 : Ord.factor w/ 7 levels "January"<"February"<..: 7 7 7 7 7 1 1 1 1 1 ...
$ lonsc : num [1:2005, 1] -0.209 -0.213 -0.215 -0.219 -0.222 ...
$ batsc : num [1:2005, 1] 0.131 0.179 0.118 0.271 0.271 ...
What's the problem?
I saw a solution applying the lme4::lmer function, but there is another option to continue to use lme function?

The problem is that you have collinear combinations of predictors. In particular, here are some diagnostics:
## construct the fixed-effect model matrix for your problem
X <- model.matrix(~ sex + month2 + bat + sex*month2, data = AA_patch_GLM)
lc <- caret::findLinearCombos(X)
colnames(X)[lc$linearCombos[[1]]]
## [1] "sexM:month2^6" "(Intercept)" "sexM" "month2.L"
## [5] "month2.C" "month2^4" "month2^5" "month2^6"
## [9] "sexM:month2.L" "sexM:month2.C" "sexM:month2^4" "sexM:month2^5"
This is in a weird order, but it suggests that the sex × month interaction is causing problems. Indeed:
with(AA_patch_GLM, table(sex, month2))
## sex January February March April May June December
## F 367 276 317 204 43 0 6
## M 131 93 90 120 124 75 159
shows that you're missing data for one sex/month combination (i.e., no females were sampled in June).
You can:
construct the sex/month interaction yourself (data$SM <- with(data, interaction(sex, month2, drop = TRUE))) and use ~ SM + bat — but then you'll have to sort out main effects and interactions yourself (ugh)
construct the model matrix by hand (as above), drop the redundant column(s), then include all the resulting columns in the model:
d2 <- with(AA_patch_GLM,
data.frame(lon,
as.data.frame(X),
id))
## drop linearly dependent column
## note data.frame() has "sanitized" variable names (:, ^ both converted to .)
d2 <- d2[names(d2) != "sexM.month2.6"]
lme(reformulate(colnames(d2)[2:15], response = "lon"),
random=~1|id, method="ML", data = d2)
Again, the results will be uglier than the simpler version of the model.
use a patched version of nlme (I submitted a patch here but it hasn't been considered)
remotes::install_github("bbolker/nlme")

summary_factorlist, Having Error due variables with less than two levels

*I have a large data set including 2000 variables, including factors and continuous variables.
For example:
library(finalfit)
library(dplyr)
data(colon_s)
explanatory = c("age", "age.factor", "sex.factor", "obstruct.factor")
dependent = "perfor.factor"
I use the following function to compare the mean of each continuous variable among the level of the categorical dependent variable (ANOVA) or the percentage of each categorical variable among the level of the categorical dependent variable (CHI-SQUARE)
summary_factorlist(colon_s, dependent ="perfor.factor", explanatory =explanatory , add_dependent_label=T, p=T,p_cat="fisher", p_cont_para = "aov", fit_id
= T)
But as soon as running the above code, I got the following error:
Error in dplyr::summarise():
! Problem while computing ..1 = ...$p.value.
Caused by error in fisher.test():
! 'x' and 'y' must have at least 2 levels
*In the data set, there are some variables which do not include at least two levels or just one of their levels has a non-zero frequency. I was wondering if there is any loop function to remove the variable if one of these conditions satisfies.
If the variable includes just one level
If the variable includes more than one level but the frequency of just one level is no-zero.
if all values of the variable are missing*

Update (partial answer):
With this code we can remove factors with only one level and keep other non factor variables:
x <- colon_s[, (sapply(colon_s, nlevels)>1) | (sapply(colon_s, is.factor)==FALSE)]

The OP's code does work with the data provided
library(dplyr)
library(finalfit)
summary_factorlist(colon_s, dependent ="perfor.factor",
explanatory =explanatory ,
add_dependent_label=TRUE, p=TRUE,p_cat="fisher", p_cont_para = "aov", fit_id = TRUE)
Dependent: Perforation No Yes p fit_id index
Age (years) Mean (SD) 59.8 (11.9) 58.4 (13.3) 0.542 age 1
Age <40 years 68 (7.5) 2 (7.4) 1.000 age.factor<40 years 2
40-59 years 334 (37.0) 10 (37.0) age.factor40-59 years 3
60+ years 500 (55.4) 15 (55.6) age.factor60+ years 4
Sex Female 432 (47.9) 13 (48.1) 1.000 sex.factorFemale 5
Male 470 (52.1) 14 (51.9) sex.factorMale 6
Obstruction No 715 (81.2) 17 (63.0) 0.026 obstruct.factorNo 7
Yes 166 (18.8) 10 (37.0) obstruct.factorYes 8
The strcture of data shows the factor variables to have more than 1 level
> str(colon_s[c(explanatory, dependent)])
'data.frame': 929 obs. of 5 variables:
$ age : num 43 63 71 66 69 57 77 54 46 68 ...
..- attr(*, "label")= chr "Age (years)"
$ age.factor : Factor w/ 3 levels "<40 years","40-59 years",..: 2 3 3 3 3 2 3 2 2 3 ...
..- attr(*, "label")= chr "Age"
$ sex.factor : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 1 2 2 2 1 ...
..- attr(*, "label")= chr "Sex"
$ obstruct.factor: Factor w/ 2 levels "No","Yes": NA 1 1 2 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Obstruction"
$ perfor.factor : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Perforation"
Regarding selection of factor variables with the condition mentioned, we could use
library(dplyr)
colon_s_sub <- colon_s %>%
select(where(~ is.factor(.x) && nlevels(.x) > 1 && all(table(.x) > 0) &
sum(complete.cases(.x)) > 0))

caret::predict giving Error: $ operator is invalid for atomic vectors

This has been driving me crazy and I've been looking through similar posts all day but can't seem to solve my problem. I have a naive bayes model trained and stored as model. I'm attempting to predict with a newdata data frame but I keep getting the error Error: $ operator is invalid for atomic vectors. Here is what I am running: stats::predict(model, newdata = newdata) where newdata is the first row of another data frame: new data <- pbp[1, c("balls", "strikes", "outs_when_up", "stand", "pitcher", "p_throws", "inning")]
class(newdata) gives [1] "tbl_df" "tbl" "data.frame".

The issue is with the data used. it should match the levels used in the training. E.g. if we use one of the rows from trainingData to predict, it does work
predict(model, head(model$trainingData, 1))
#[1] Curveball
#Levels: Changeup Curveball Fastball Sinker Slider
By checking the str of both datasets, some of the factor columns in the training is character class
str(model$trainingData)
'data.frame': 1277525 obs. of 7 variables:
$ pitcher : Factor w/ 1390 levels "112526","115629",..: 277 277 277 277 277 277 277 277 277 277 ...
$ stand : Factor w/ 2 levels "L","R": 1 1 2 2 2 2 2 1 1 1 ...
$ p_throws : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
$ balls : num 0 1 0 1 2 2 2 0 0 0 ...
$ strikes : num 0 0 0 0 0 1 2 0 1 2 ...
$ outs_when_up: num 1 1 1 1 1 1 1 2 2 2 ...
$ .outcome : Factor w/ 5 levels "Changeup","Curveball",..: 3 4 1 4 1 5 5 1 1 5 ...
str(newdata)
tibble [1 × 6] (S3: tbl_df/tbl/data.frame)
$ balls : int 3
$ strikes : int 2
$ outs_when_up: int 1
$ stand : chr "R"
$ pitcher : int 605200
$ p_throws : chr "R"
An option is to make levels same for factor class
nm1 <- intersect(names(model$trainingData), names(newdata))
nm2 <- names(which(sapply(model$trainingData[nm1], is.factor)))
newdata[nm2] <- Map(function(x, y) factor(x, levels = levels(y)), newdata[nm2], model$trainingData[nm2])
Now do the prediction
predict(model, newdata)
#[1] Sinker
#Levels: Changeup Curveball Fastball Sinker Slider

R: Understanding variable selection in MICE for imputing data

EDIT: This is not a course with a live instructor and I cannot ask him directly in any way. If it were, I wouldn't be wasting your time here.
I am taking an R class that is dealing with the basics of machine learning. We are working with the Vanderbilt Titanic dataset available HERE. The goal is the use the R mice package to imput missing age values. I've already split my data into train and test samples, and str(training) outputs:
'data.frame': 917 obs. of 14 variables:
$ pclass : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ survived : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 2 2 2 ...
$ name : chr "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mrs. Hudson J C (Bessie Waldo Daniels)" ...
$ sex : Factor w/ 2 levels "female","male": 1 2 1 1 2 1 2 1 1 1 ...
$ age : num 29 0.92 2 25 48 63 71 18 24 26 ...
$ sibsp : int 0 1 1 1 0 1 0 1 0 0 ...
$ parch : int 0 2 2 2 0 0 0 0 0 0 ...
$ ticket : chr "24160" "113781" "113781" "113781" ...
$ fare : num 211.3 151.6 151.6 151.6 26.6 ...
$ cabin : chr "B5" "C22 C26" "C22 C26" "C22 C26" ...
$ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 2 2 2 4 ...
$ boat : chr "2" "11" "" "" ...
$ body : int NA NA NA NA NA NA 22 NA NA NA ...
$ home.dest: chr "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
The instructor then goes on to write:
factor_vars <- c('pclass', 'sex', 'embarked', 'survived')
training[factor_vars] <- lapply(training[factor_vars], function(x) as.factor(x))
impute_variables <- c('pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked')
mice_model <- mice(training[,impute_variables], method='rf')
mice_output <- complete(mice_model)
mice_output
I understand the factor_vars piece - these variables are labelled as factors in the structure output. What I don't understand is how the impute_variables were chosen or what they mean exactly. Are they arbitrarily chosen, perhaps on the basis that the instructor believed things like 'pclass' (which is the indicator for steerage, coach, or first class) may help predict age (with older people being able to afford first class perhaps) while things like 'cabin' would have no relevance?
Furthermore, in the line mice_model <- mice(training[,impute_variables], method='rf'), which portion of the function is declaring that we want to be imputing the age of the passengers?

Between/within standard deviations

When working on a hierarchical/multilevel/panel dataset, it may be very useful to adopt a package which returns the within- and between-group standard deviations of the available variables.
This is something that with the following data in Stata can be easily done through the command
xtsum, i(momid)
I made a research, but I cannot find any R package which can do that..
edit:
Just to fix ideas, an example of hierarchical dataset could be this:
son_id mom_id hispanic mom_smoke son_birthweigth
1 1 1 1 3950
2 1 1 0 3890
3 1 1 0 3990
1 2 0 1 4200
2 2 0 1 4120
1 3 0 0 2975
2 3 0 1 2980
The "multilevel" structure is given by the fact that each mother (higher level) has two or more sons (lower level). Hence, each mother defines a group of observations.
Accordingly, each dataset variable can vary either between and within mothers or only between mothers. birtweigth varies among mothers, but also within the same mother. Instead, hispanic is fixed for the same mother.
For example, the within-mother variance of son_birthweigth is:
# mom1 means
bwt_mean1 <- (3950+3890+3990)/3
bwt_mean2 <- (4200+4120)/2
bwt_mean3 <- (2975+2980)/2
# Within-mother variance for birthweigth
((3950-bwt_mean1)^2 + (3890-bwt_mean1)^2 + (3990-bwt_mean1)^2 +
(4200-bwt_mean2)^2 + (4120-bwt_mean2)^2 +
(2975-bwt_mean3)^2 + (2980-bwt_mean3)^2)/(7-1)
While the between-mother variance is:
# overall mean of birthweigth:
# mean <- sum(data$son_birthweigth)/length(data$son_birthweigth)
mean <- (3950+3890+3990+4200+4120+2975+2980)/7
# within variance:
((bwt_mean1-mean)^2 + (bwt_mean2-mean)^2 + (bwt_mean3-mean)^2)/(3-1)

I don't know what your stata command should reproduce, but to answer the second part of question about
hierarchical structure , it is easy to do this with list.
For example, you define a structure like this:
tree = list(
"var1" = list(
"panel" = list(type ='p',mean = 1,sd=0)
,"cluster" = list(type = 'c',value = c(5,8,10)))
,"var2" = list(
"panel" = list(type ='p',mean = 2,sd=0.5)
,"cluster" = list(type="c",value =c(1,2)))
)
To create this lapply is convinent to work with list
tree <- lapply(list('var1','var2'),function(x){
ll <- list(panel= list(type ='p',mean = rnorm(1),sd=0), ## I use symbol here not name
cluster= list(type = 'c',value = rnorm(3))) ## R prefer symbols
})
names(tree) <-c('var1','var2')
You can view he structure with str
str(tree)
List of 2
$ var1:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num 0.284
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] 0.0722 -0.9413 0.6649
$ var2:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num -0.144
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] -0.595 -1.795 -0.439
Edit after OP clarification
I think that package reshape2 is what you want. I will demonstrate this here.
The idea here is in order to do the multilevel analysis we need to reshape the data.
First to divide the variables into two groups :identifier and measured variables.
library(reshape2)
dat.m <- melt(dat,id.vars=c('son_id','mom_id')) ## other columns are measured
str(dat.m)
'data.frame': 21 obs. of 4 variables:
$ son_id : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 1 2 1 2 3 ...
$ mom_id : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 3 3 1 1 1 ...
$ variable: Factor w/ 3 levels "hispanic","mom_smoke",..: 1 1 1 1 1 1 1 2 2 2 ...
$ value : num 1 1 1 0 0 0 0 1 0 0 ..
Once your have data in "moten" form , you can "cast" to rearrange it in the shape that you want:
# mom1 means for all variable
acast(dat.m,variable~mom_id,mean)
1 2 3
hispanic 1.0000000 0 0.0
mom_smoke 0.3333333 1 0.5
son_birthweigth 3943.3333333 4160 2977.5
# Within-mother variance for birthweigth
acast(dat.m,variable~mom_id,function(x) sum((x-mean(x))^2))
1 2 3
hispanic 0.0000000 0 0.0
mom_smoke 0.6666667 0 0.5
son_birthweigth 5066.6666667 3200 12.5
## overall mean of each variable
acast(dat.m,variable~.,mean)
[,1]
hispanic 0.4285714
mom_smoke 0.5714286
son_birthweigth 3729.2857143

I know this question is four years old, but recently I wanted to do the same in R and came up with the following function. It depends on dplyr and tibble. Where: df is the dataframe, columns is a numerical vector to subset the dataframe and individuals is the column with the individuals.
xtsumR<-function(df,columns,individuals){
df<-dplyr::arrange_(df,individuals)
panel<-tibble::tibble()
for (i in columns){
v<-df %>% dplyr::group_by_() %>%
dplyr::summarize_(
mean=mean(df[[i]]),
sd=sd(df[[i]]),
min=min(df[[i]]),
max=max(df[[i]])
)
v<-tibble::add_column(v,variacao="overal",.before=-1)
v2<-aggregate(df[[i]],list(df[[individuals]]),"mean")[[2]]
sdB<-sd(v2)
varW<-df[[i]]-rep(v2,each=12) #
varW<-varW+mean(df[[i]])
sdW<-sd(varW)
minB<-min(v2)
maxB<-max(v2)
minW<-min(varW)
maxW<-max(varW)
v<-rbind(v,c("between",NA,sdB,minB,maxB),c("within",NA,sdW,minW,maxW))
panel<-rbind(panel,v)
}
var<-rep(names(df)[columns])
n1<-rep(NA,length(columns))
n2<-rep(NA,length(columns))
var<-c(rbind(var,n1,n1))
panel$var<-var
panel<-panel[c(6,1:5)]
names(panel)<-c("variable","variation","mean","standard.deviation","min","max")
panel[3:6]<-as.numeric(unlist(panel[3:6]))
panel[3:6]<-round(unlist(panel[3:6]),2)
return(panel)
}