R: Understanding variable selection in MICE for imputing data - r

EDIT: This is not a course with a live instructor and I cannot ask him directly in any way. If it were, I wouldn't be wasting your time here.
I am taking an R class that is dealing with the basics of machine learning. We are working with the Vanderbilt Titanic dataset available HERE. The goal is the use the R mice package to imput missing age values. I've already split my data into train and test samples, and str(training) outputs:
'data.frame': 917 obs. of 14 variables:
$ pclass : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ survived : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 2 2 2 ...
$ name : chr "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mrs. Hudson J C (Bessie Waldo Daniels)" ...
$ sex : Factor w/ 2 levels "female","male": 1 2 1 1 2 1 2 1 1 1 ...
$ age : num 29 0.92 2 25 48 63 71 18 24 26 ...
$ sibsp : int 0 1 1 1 0 1 0 1 0 0 ...
$ parch : int 0 2 2 2 0 0 0 0 0 0 ...
$ ticket : chr "24160" "113781" "113781" "113781" ...
$ fare : num 211.3 151.6 151.6 151.6 26.6 ...
$ cabin : chr "B5" "C22 C26" "C22 C26" "C22 C26" ...
$ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 2 2 2 4 ...
$ boat : chr "2" "11" "" "" ...
$ body : int NA NA NA NA NA NA 22 NA NA NA ...
$ home.dest: chr "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
The instructor then goes on to write:
factor_vars <- c('pclass', 'sex', 'embarked', 'survived')
training[factor_vars] <- lapply(training[factor_vars], function(x) as.factor(x))
impute_variables <- c('pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked')
mice_model <- mice(training[,impute_variables], method='rf')
mice_output <- complete(mice_model)
mice_output
I understand the factor_vars piece - these variables are labelled as factors in the structure output. What I don't understand is how the impute_variables were chosen or what they mean exactly. Are they arbitrarily chosen, perhaps on the basis that the instructor believed things like 'pclass' (which is the indicator for steerage, coach, or first class) may help predict age (with older people being able to afford first class perhaps) while things like 'cabin' would have no relevance?
Furthermore, in the line mice_model <- mice(training[,impute_variables], method='rf'), which portion of the function is declaring that we want to be imputing the age of the passengers?

Related

summary_factorlist, Having Error due variables with less than two levels

*I have a large data set including 2000 variables, including factors and continuous variables.
For example:
library(finalfit)
library(dplyr)
data(colon_s)
explanatory = c("age", "age.factor", "sex.factor", "obstruct.factor")
dependent = "perfor.factor"
I use the following function to compare the mean of each continuous variable among the level of the categorical dependent variable (ANOVA) or the percentage of each categorical variable among the level of the categorical dependent variable (CHI-SQUARE)
summary_factorlist(colon_s, dependent ="perfor.factor", explanatory =explanatory , add_dependent_label=T, p=T,p_cat="fisher", p_cont_para = "aov", fit_id
= T)
But as soon as running the above code, I got the following error:
Error in dplyr::summarise():
! Problem while computing ..1 = ...$p.value.
Caused by error in fisher.test():
! 'x' and 'y' must have at least 2 levels
*In the data set, there are some variables which do not include at least two levels or just one of their levels has a non-zero frequency. I was wondering if there is any loop function to remove the variable if one of these conditions satisfies.
If the variable includes just one level
If the variable includes more than one level but the frequency of just one level is no-zero.
if all values of the variable are missing*
Update (partial answer):
With this code we can remove factors with only one level and keep other non factor variables:
x <- colon_s[, (sapply(colon_s, nlevels)>1) | (sapply(colon_s, is.factor)==FALSE)]
The OP's code does work with the data provided
library(dplyr)
library(finalfit)
summary_factorlist(colon_s, dependent ="perfor.factor",
explanatory =explanatory ,
add_dependent_label=TRUE, p=TRUE,p_cat="fisher", p_cont_para = "aov", fit_id = TRUE)
Dependent: Perforation No Yes p fit_id index
Age (years) Mean (SD) 59.8 (11.9) 58.4 (13.3) 0.542 age 1
Age <40 years 68 (7.5) 2 (7.4) 1.000 age.factor<40 years 2
40-59 years 334 (37.0) 10 (37.0) age.factor40-59 years 3
60+ years 500 (55.4) 15 (55.6) age.factor60+ years 4
Sex Female 432 (47.9) 13 (48.1) 1.000 sex.factorFemale 5
Male 470 (52.1) 14 (51.9) sex.factorMale 6
Obstruction No 715 (81.2) 17 (63.0) 0.026 obstruct.factorNo 7
Yes 166 (18.8) 10 (37.0) obstruct.factorYes 8
The strcture of data shows the factor variables to have more than 1 level
> str(colon_s[c(explanatory, dependent)])
'data.frame': 929 obs. of 5 variables:
$ age : num 43 63 71 66 69 57 77 54 46 68 ...
..- attr(*, "label")= chr "Age (years)"
$ age.factor : Factor w/ 3 levels "<40 years","40-59 years",..: 2 3 3 3 3 2 3 2 2 3 ...
..- attr(*, "label")= chr "Age"
$ sex.factor : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 1 2 2 2 1 ...
..- attr(*, "label")= chr "Sex"
$ obstruct.factor: Factor w/ 2 levels "No","Yes": NA 1 1 2 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Obstruction"
$ perfor.factor : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Perforation"
Regarding selection of factor variables with the condition mentioned, we could use
library(dplyr)
colon_s_sub <- colon_s %>%
select(where(~ is.factor(.x) && nlevels(.x) > 1 && all(table(.x) > 0) &
sum(complete.cases(.x)) > 0))

caret::predict giving Error: $ operator is invalid for atomic vectors

This has been driving me crazy and I've been looking through similar posts all day but can't seem to solve my problem. I have a naive bayes model trained and stored as model. I'm attempting to predict with a newdata data frame but I keep getting the error Error: $ operator is invalid for atomic vectors. Here is what I am running: stats::predict(model, newdata = newdata) where newdata is the first row of another data frame: new data <- pbp[1, c("balls", "strikes", "outs_when_up", "stand", "pitcher", "p_throws", "inning")]
class(newdata) gives [1] "tbl_df" "tbl" "data.frame".
The issue is with the data used. it should match the levels used in the training. E.g. if we use one of the rows from trainingData to predict, it does work
predict(model, head(model$trainingData, 1))
#[1] Curveball
#Levels: Changeup Curveball Fastball Sinker Slider
By checking the str of both datasets, some of the factor columns in the training is character class
str(model$trainingData)
'data.frame': 1277525 obs. of 7 variables:
$ pitcher : Factor w/ 1390 levels "112526","115629",..: 277 277 277 277 277 277 277 277 277 277 ...
$ stand : Factor w/ 2 levels "L","R": 1 1 2 2 2 2 2 1 1 1 ...
$ p_throws : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
$ balls : num 0 1 0 1 2 2 2 0 0 0 ...
$ strikes : num 0 0 0 0 0 1 2 0 1 2 ...
$ outs_when_up: num 1 1 1 1 1 1 1 2 2 2 ...
$ .outcome : Factor w/ 5 levels "Changeup","Curveball",..: 3 4 1 4 1 5 5 1 1 5 ...
str(newdata)
tibble [1 × 6] (S3: tbl_df/tbl/data.frame)
$ balls : int 3
$ strikes : int 2
$ outs_when_up: int 1
$ stand : chr "R"
$ pitcher : int 605200
$ p_throws : chr "R"
An option is to make levels same for factor class
nm1 <- intersect(names(model$trainingData), names(newdata))
nm2 <- names(which(sapply(model$trainingData[nm1], is.factor)))
newdata[nm2] <- Map(function(x, y) factor(x, levels = levels(y)), newdata[nm2], model$trainingData[nm2])
Now do the prediction
predict(model, newdata)
#[1] Sinker
#Levels: Changeup Curveball Fastball Sinker Slider

SVN for classification with NA column on test data [duplicate]

This question already has answers here:
How to debug "contrasts can be applied only to factors with 2 or more levels" error?
(3 answers)
Error in `contrasts' Error
(1 answer)
SVM predict on dataframe with different factor levels
(1 answer)
Closed 4 years ago.
Its the follow I have two files one with the training data other with the test data without the class I want to predict and I am trying to execute the follow code:
full <- bind_rows(train, test)
full$Survived <- factor(full$Survived)
train <- full[1:n,]
test <- full[n+1:total,]
model.svm <- svm(Survived~.,train)
predictions <- predict(model.svm,test)
But when I tried to predict give me the follow error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :contrasts can be applied only to factors with 2 or more levels
As I understand is because the class column is all NA's but don't know what to do about this, I already try to fill it with dummy values only to get the predictions and I get this:
Error in newdata[, object$scaled, drop = FALSE] : (subscript) logical subscript too long
Someone can tell me what I am doing wrong and how to correct?
Edit:
Note I am doing binary classification if it help.
Thanks in the advance.
EDIT:
The dataset is the titanic survived people, that I am using for learn how to use some models( I am begging to learn this kind of things).
str(full):
'data.frame': 1309 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
dput is to long to put it here.
I think I have some NA's in age is it a problem?

Calculate new column with growth rate based on two factors without splintering dataframe

Hej hej,
I would like to calculate growth rates, storing them in a new column of my data frame e.g. named growth.per.day. I am - as always - looking for a way that doesn't include hundreds and hundreds of lines of manually edited code.
I have six levels of algae and 25 levels of nutrients.
This means i have 150 "subgroups" for which i want to calculate the rates. Those subsets differ in length based on the individual algae.
So, basically:
Algae A ->
Nutrient (1) -> C.mikro.gr.L (Day 2) - C.mikro.gr.L (Day 1),C.mikro.gr.L (Day 3) - C.mikro.gr.L (Day 2) ... ;
Nutrient (2) -> C.mikro.gr.L (Day 2) - C.mikro.gr.L (Day 1),C.mikro.gr.L (Day 3) - C.mikro.gr.L (Day 2) ... etc.
I already split the data frame by algae
X <- split(data, data$ALGAE)
names(X) <- c("ANKI", "CHLAMY", "MIX_A", "MIX_B", "SCENE", "STAURA")
list2env(X, envir = .GlobalEnv)
and i have also split those again, creating the aforementioned lovely 150 subsets. Then i applied
ratio1$growth.per.day <- c(NA,ratio1[2:nrow(ratio1), 16] - ratio1[1:(nrow(ratio1)-1), 16])
which is perfect and does what i want, BUT i would really very much appreciate a shorter, more elegant way without butchering my dataframe.
'data.frame': 3550 obs. of 16 variables:
$ SAMPLE.ID : Factor w/ 150 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ COMMUNITY : chr "com.1" "com.1" "com.1" "com.1" ...
$ NUTRIENT : Factor w/ 25 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ RATIO : Factor w/ 23 levels "3.2","4","5.4",..: 11 9 6 4 1 14 10 8 5 2 ...
$ PHOS : Factor w/ 5 levels "0.09","0.195",..: 5 5 5 5 5 4 4 4 4 4 ...
$ NIT : Factor w/ 5 levels "1.5482","3.0964",..: 5 4 3 2 1 5 4 3 2 1 ...
$ DATUM : Factor w/ 35 levels "30.08.16","31.08.16",..: 1 1 1 1 1 1 1 1 1 1 ...
$ DAY : int 0 0 0 0 0 0 0 0 0 0 ...
$ TYPE : chr "mono" "mono" "mono" "mono" ...
$ ALGAE : Factor w/ 6 levels "ANK","CHLA","MIX A",..: 5 5 5 5 5 5 5 5 5 5 ...
$ MEAN : num 864 868 882 873 872 ...
$ GROW : num 0.00116 0.00115 0.00113 0.00115 0.00115 ...
$ FLUORO : num NA NA NA NA NA NA NA NA NA NA ...
$ MEAN.MQ : num 0.964 0.969 0.985 0.975 0.973 ...
$ GROW.MQ : num 1.04 1.03 1.02 1.03 1.03 ...
$ C.mikro.gr.L: num -764 -913 -1394 -1085 -1039 ...
I hope this sufficiently describes the problem,
Thanks so much!
Hope it is what you asked for:
df = data.frame(algae = sort(rep(LETTERS[1:6], 20)),
nutrient = rep(letters[22:26], 24),
day = rep(c(rep(1, 5),
rep(2, 5),
rep(3, 5),
rep(4, 5)), 6),
growth = runif(120, 30, 60))
library(dplyr)
df = df %>% group_by(algae, nutrient) %>% mutate(rate = c(NA, diff(growth, lag = 1)))
And there the table for alga A and nutrient v:
algae nutrient day growth rate
<fctr> <fctr> <dbl> <dbl> <dbl>
1 A v 1 48.68547 NA
2 A v 2 55.63570 6.950232
3 A v 3 53.28569 -2.350013
4 A v 4 44.83022 -8.455465

Between/within standard deviations

When working on a hierarchical/multilevel/panel dataset, it may be very useful to adopt a package which returns the within- and between-group standard deviations of the available variables.
This is something that with the following data in Stata can be easily done through the command
xtsum, i(momid)
I made a research, but I cannot find any R package which can do that..
edit:
Just to fix ideas, an example of hierarchical dataset could be this:
son_id mom_id hispanic mom_smoke son_birthweigth
1 1 1 1 3950
2 1 1 0 3890
3 1 1 0 3990
1 2 0 1 4200
2 2 0 1 4120
1 3 0 0 2975
2 3 0 1 2980
The "multilevel" structure is given by the fact that each mother (higher level) has two or more sons (lower level). Hence, each mother defines a group of observations.
Accordingly, each dataset variable can vary either between and within mothers or only between mothers. birtweigth varies among mothers, but also within the same mother. Instead, hispanic is fixed for the same mother.
For example, the within-mother variance of son_birthweigth is:
# mom1 means
bwt_mean1 <- (3950+3890+3990)/3
bwt_mean2 <- (4200+4120)/2
bwt_mean3 <- (2975+2980)/2
# Within-mother variance for birthweigth
((3950-bwt_mean1)^2 + (3890-bwt_mean1)^2 + (3990-bwt_mean1)^2 +
(4200-bwt_mean2)^2 + (4120-bwt_mean2)^2 +
(2975-bwt_mean3)^2 + (2980-bwt_mean3)^2)/(7-1)
While the between-mother variance is:
# overall mean of birthweigth:
# mean <- sum(data$son_birthweigth)/length(data$son_birthweigth)
mean <- (3950+3890+3990+4200+4120+2975+2980)/7
# within variance:
((bwt_mean1-mean)^2 + (bwt_mean2-mean)^2 + (bwt_mean3-mean)^2)/(3-1)
I don't know what your stata command should reproduce, but to answer the second part of question about
hierarchical structure , it is easy to do this with list.
For example, you define a structure like this:
tree = list(
"var1" = list(
"panel" = list(type ='p',mean = 1,sd=0)
,"cluster" = list(type = 'c',value = c(5,8,10)))
,"var2" = list(
"panel" = list(type ='p',mean = 2,sd=0.5)
,"cluster" = list(type="c",value =c(1,2)))
)
To create this lapply is convinent to work with list
tree <- lapply(list('var1','var2'),function(x){
ll <- list(panel= list(type ='p',mean = rnorm(1),sd=0), ## I use symbol here not name
cluster= list(type = 'c',value = rnorm(3))) ## R prefer symbols
})
names(tree) <-c('var1','var2')
You can view he structure with str
str(tree)
List of 2
$ var1:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num 0.284
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] 0.0722 -0.9413 0.6649
$ var2:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num -0.144
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] -0.595 -1.795 -0.439
Edit after OP clarification
I think that package reshape2 is what you want. I will demonstrate this here.
The idea here is in order to do the multilevel analysis we need to reshape the data.
First to divide the variables into two groups :identifier and measured variables.
library(reshape2)
dat.m <- melt(dat,id.vars=c('son_id','mom_id')) ## other columns are measured
str(dat.m)
'data.frame': 21 obs. of 4 variables:
$ son_id : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 1 2 1 2 3 ...
$ mom_id : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 3 3 1 1 1 ...
$ variable: Factor w/ 3 levels "hispanic","mom_smoke",..: 1 1 1 1 1 1 1 2 2 2 ...
$ value : num 1 1 1 0 0 0 0 1 0 0 ..
Once your have data in "moten" form , you can "cast" to rearrange it in the shape that you want:
# mom1 means for all variable
acast(dat.m,variable~mom_id,mean)
1 2 3
hispanic 1.0000000 0 0.0
mom_smoke 0.3333333 1 0.5
son_birthweigth 3943.3333333 4160 2977.5
# Within-mother variance for birthweigth
acast(dat.m,variable~mom_id,function(x) sum((x-mean(x))^2))
1 2 3
hispanic 0.0000000 0 0.0
mom_smoke 0.6666667 0 0.5
son_birthweigth 5066.6666667 3200 12.5
## overall mean of each variable
acast(dat.m,variable~.,mean)
[,1]
hispanic 0.4285714
mom_smoke 0.5714286
son_birthweigth 3729.2857143
I know this question is four years old, but recently I wanted to do the same in R and came up with the following function. It depends on dplyr and tibble. Where: df is the dataframe, columns is a numerical vector to subset the dataframe and individuals is the column with the individuals.
xtsumR<-function(df,columns,individuals){
df<-dplyr::arrange_(df,individuals)
panel<-tibble::tibble()
for (i in columns){
v<-df %>% dplyr::group_by_() %>%
dplyr::summarize_(
mean=mean(df[[i]]),
sd=sd(df[[i]]),
min=min(df[[i]]),
max=max(df[[i]])
)
v<-tibble::add_column(v,variacao="overal",.before=-1)
v2<-aggregate(df[[i]],list(df[[individuals]]),"mean")[[2]]
sdB<-sd(v2)
varW<-df[[i]]-rep(v2,each=12) #
varW<-varW+mean(df[[i]])
sdW<-sd(varW)
minB<-min(v2)
maxB<-max(v2)
minW<-min(varW)
maxW<-max(varW)
v<-rbind(v,c("between",NA,sdB,minB,maxB),c("within",NA,sdW,minW,maxW))
panel<-rbind(panel,v)
}
var<-rep(names(df)[columns])
n1<-rep(NA,length(columns))
n2<-rep(NA,length(columns))
var<-c(rbind(var,n1,n1))
panel$var<-var
panel<-panel[c(6,1:5)]
names(panel)<-c("variable","variation","mean","standard.deviation","min","max")
panel[3:6]<-as.numeric(unlist(panel[3:6]))
panel[3:6]<-round(unlist(panel[3:6]),2)
return(panel)
}

Resources