Multiple comparisons for GLMM dataset (proportion/binomial response) - lsmeans? - r

I have a glmm that runs fine, and produces results that makes biological sense. I want to do multiple comparisons with the levels predictor variable I'm interested in (a factor with 6 levels--labeled Body in the diagram). This factor and its interaction with Class were significant in the GLMM (as expected).
I have tried using lsmeans using this code:
lsmc <- lsmeans(modelc, ~ Class*Body)
plot(lsmc, by = "Class", intervals = TRUE, type = "response")
cld(lsmc)
The result is a confusing mishmash of grouping codes:
> cld(lsmc)
Class Body lsmean SE df asymp.LCL asymp.UCL .group
a 6 -4.134310 0.2707025 NA -4.664878 -3.603743 123
a 3 -3.970351 0.2728055 NA -4.505040 -3.435662 123
a 4 -3.928422 0.2704543 NA -4.458502 -3.398341 123
a 5 -3.882009 0.2692264 NA -4.409683 -3.354335 123456
b 6 -3.736560 0.4111311 NA -4.542362 -2.930758 1 4 7
a 1 -3.526359 0.2772493 NA -4.069757 -2.982960 456789
a 2 -3.343117 0.2711772 NA -3.874614 -2.811619 789
b 5 -3.200230 0.4107996 NA -4.005383 -2.395078 2 5 8
b 1 -2.879111 0.4122133 NA -3.687034 -2.071187 23 56 89
b 2 -2.840026 0.4110968 NA -3.645761 -2.034291 3 6 9
b 3 -2.818114 0.4102995 NA -3.622287 -2.013942 3 6 9
b 4 -2.649563 0.4096440 NA -3.452450 -1.846675 3 6 9
As far as I am aware, non-continuous grouping codes, as seen in all of Class b aren't a good sign.
Is there another way to go about multiple and/or pairwise comparisons with output from GLMMs?

Related

R ranger treeInfo final nodes have the same class

When I use ranger for a classification model and treeInfo() to extract a tree, I see that sometimes a split results in two identical terminal nodes.
Is this expected behaviour? Why does it make sense to introduce a split where the final nodes are the same?
From this question, I take that the prediction variable could be the majority class (albeit for python and another random forest implementation). The ranger ?treeInfo documentation says it should be the predicted class.
MWE
library(ranger)
data <- iris
data$is_versicolor <- factor(data$Species == "versicolor")
data$Species <- NULL
rf <- ranger(is_versicolor ~ ., data = data,
num.trees = 1, # no need for many trees in this example
max.depth = 3, # keep depth at an understandable level
seed = 1351, replace = FALSE)
treeInfo(rf, 1)
#> nodeID leftChild rightChild splitvarID splitvarName splitval terminal prediction
#> 1 0 1 2 2 Petal.Length 2.60 FALSE <NA>
#> 2 1 NA NA NA <NA> NA TRUE FALSE
#> 3 2 3 4 3 Petal.Width 1.75 FALSE <NA>
#> 4 3 5 6 2 Petal.Length 4.95 FALSE <NA>
#> 5 4 7 8 0 Sepal.Length 5.95 FALSE <NA>
#> 6 5 NA NA NA <NA> NA TRUE TRUE
#> 7 6 NA NA NA <NA> NA TRUE TRUE
#> 8 7 NA NA NA <NA> NA TRUE FALSE
#> 9 8 NA NA NA <NA> NA TRUE FALSE
In this example, the last four rows (final nodes with nodeID 5 and 6, as well as 7 and 8) have the prediction TRUE and FALSE.
Graphically this would look like this
I think I found a (partial) answer to the issue, namely the mtry and min.node.size arguments and their functionality.
As the random forest chooses only mtry variables at each split, the final split might take only variables into account, which do not split the data in a way that results in a maximum gini difference (or whatever metric was chosen) but still in each final node, a given class might prevail.
Playing around with mtry and min.node.size can change this. But we still might get splits with the same results.

R: Error in contrasts when fitting linear models with `lm`

I've found Error in contrasts when defining a linear model in R and have followed the suggestions there, but none of my factor variables take on only one value and I am still experiencing the same issue.
This is the dataset I'm using: https://www.dropbox.com/s/em7xphbeaxykgla/train.csv?dl=0.
This is the code I'm trying to run:
simplelm <- lm(log_SalePrice ~ ., data = train)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels
What is the issue?
Thanks for providing your dataset (I hope that link will forever be valid so that everyone can access). I read it into a data frame train.
Using the debug_contr_error, debug_contr_error2 and NA_preproc helper functions provided by How to debug "contrasts can be applied only to factors with 2 or more levels" error?, we can easily analyze the problem.
info <- debug_contr_error2(log_SalePrice ~ ., train)
## the data frame that is actually used by `lm`
dat <- info$mf
## number of cases in your dataset
nrow(train)
#[1] 1460
## number of complete cases used by `lm`
nrow(dat)
#[1] 1112
## number of levels for all factor variables in `dat`
info$nlevels
# MSZoning Street Alley LotShape LandContour
# 4 2 3 4 4
# Utilities LotConfig LandSlope Neighborhood Condition1
# 1 5 3 25 9
# Condition2 BldgType HouseStyle RoofStyle RoofMatl
# 6 5 8 5 7
# Exterior1st Exterior2nd MasVnrType ExterQual ExterCond
# 14 16 4 4 4
# Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
# 6 5 5 5 7
# BsmtFinType2 Heating HeatingQC CentralAir Electrical
# 7 5 5 2 5
# KitchenQual Functional FireplaceQu GarageType GarageFinish
# 4 6 6 6 3
# GarageQual GarageCond PavedDrive PoolQC Fence
# 5 5 3 4 5
# MiscFeature SaleType SaleCondition MiscVal_bool MoYrSold
# 4 9 6 2 55
As you can see, Utilities is the offending variable here as it has only 1 level.
Since you have many character / factor variables in train, I wonder whether you have NA for them. If we add NA as a valid level, we could possibly get more complete cases.
new_train <- NA_preproc(train)
new_info <- debug_contr_error2(log_SalePrice ~ ., new_train)
new_dat <- new_info$mf
nrow(new_dat)
#[1] 1121
new_info$nlevels
# MSZoning Street Alley LotShape LandContour
# 5 2 3 4 4
# Utilities LotConfig LandSlope Neighborhood Condition1
# 1 5 3 25 9
# Condition2 BldgType HouseStyle RoofStyle RoofMatl
# 6 5 8 5 7
# Exterior1st Exterior2nd MasVnrType ExterQual ExterCond
# 14 16 4 4 4
# Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
# 6 5 5 5 7
# BsmtFinType2 Heating HeatingQC CentralAir Electrical
# 7 5 5 2 6
# KitchenQual Functional FireplaceQu GarageType GarageFinish
# 4 6 6 6 3
# GarageQual GarageCond PavedDrive PoolQC Fence
# 5 5 3 4 5
# MiscFeature SaleType SaleCondition MiscVal_bool MoYrSold
# 4 9 6 2 55
We do get more complete cases, but Utilities still has one level. This means that most incomplete cases are actually caused by NA in your numerical variables, which we can do nothing (unless you have a statistically valid way to impute those missing values).
As you only have one single-level factor variable, the same method as given in How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"? will work.
new_dat$Utilities <- 1
simplelm <- lm(log_SalePrice ~ 0 + ., data = new_dat)
The model now runs successfully. However, it is rank-deficient. You probably want to do something to address it, but leaving it as it is is fine.
b <- coef(simplelm)
length(b)
#[1] 301
sum(is.na(b))
#[1] 9
simplelm$rank
#[1] 292

Issue with NA values when removing rows from data frame in R

This is my data frame:
ID <- c('TZ1','TZ2','TZ3','TZ4')
hr <- c(56,32,38,NA)
cr <- c(1,4,5,2)
data <- data.frame(ID,hr,cr)
ID hr cr
1 TZ1 56 1
2 TZ2 32 4
3 TZ3 38 5
4 TZ4 NA 2
I want to remove the rows where data$hr = 56. This is what I want the end product to be:
ID hr cr
2 TZ2 32 4
3 TZ3 38 5
4 TZ4 NA 2
This is what I thought would work:
data = data[data$hr !=56,]
However the resulting data frame looks like this:
ID hr cr
2 TZ2 32 4
3 TZ3 38 5
NA <NA> NA NA
How can I mofify my code to encorporate the NA value so this doesn't happen? Thank you for your help, I can't figure it out.
EDIT: I also want to keep the NA value in the data frame.
The issue is that when we do the == or !=, if there are NA values, it will remain as such and create an NA row for that corresponding NA value. So one way to make the logical index with only TRUE/FALSE values will be to use is.na also in the comparison.
data[!(data$hr==56 & !is.na(data$hr)),]
# ID hr cr
#2 TZ2 32 4
#3 TZ3 38 5
#4 TZ4 NA 2
We could also apply the reverse logic
subset(data, hr!=56|is.na(hr))
# ID hr cr
#2 TZ2 32 4
#3 TZ3 38 5
#4 TZ4 NA 2

chaining together sequential observations with only current and immediately prior ID values in R

Say I have some data on traits of individuals measured over time, that looks like this:
present <- c(1:4)
pre.1 <- c(5:8)
pre.2 <- c(9:12)
present2 <- c(13:16)
id <- c(present,pre.1,pre.2,present2)
prev.id <- c(pre.1,pre.2,rep(NA,8))
trait <- rnorm(16,10,3)
d <- data.frame(id,prev.id,trait)
print d:
id prev.id trait
1 1 5 10.693266
2 2 6 12.059654
3 3 7 3.594182
4 4 8 14.411477
5 5 9 10.840814
6 6 10 13.712924
7 7 11 11.258689
8 8 12 10.920899
9 9 NA 14.663039
10 10 NA 5.117289
11 11 NA 8.866973
12 12 NA 15.508879
13 13 NA 14.307738
14 14 NA 15.616640
15 15 NA 10.275843
16 16 NA 12.443139
Every observations has a unique value of id. However, some individuals have been observed in the past, and so I also have an observation of prev.id. This allows me to connect an individual with its current and past values of trait. However, some individuals have been remeasured multiple times. Observations 1-4 have previous IDs of 5-8, and observations of 5-8 have previous IDs of 9-12. Observations 9-12 have no previous ID because this is the first time these were measured. Furthermore, observations 13-16 have never been measured before. So, observations 1:4 are unique individuals, observations 5-12 are prior observations of individuals 1-4, and observations 13-16 are another set of unqiue individuals, distinct from 1-4. I would like to write code to generate a table that has every unique individual, as well as every past observation of that individuals trait. The final output would look like:
id <- c(1:4,13:16)
prev.id <- c(5:8, rep(NA,4))
trait <- d$trait[c(1:4,13:16)]
prev.trait.1 <- d$trait[c(5:8 ,rep(NA,4))]
prev.trait.2 <- d$trait[c(9:12,rep(NA,4))]
output<- data.frame(id,prev.id,trait,prev.trait.1,prev.trait.2)
> output
id prev.id trait prev.trait.1 prev.trait.2
1 1 5 10.693266 10.84081 14.663039
2 2 6 12.059654 13.71292 5.117289
3 3 7 3.594182 11.25869 8.866973
4 4 8 14.411477 10.92090 15.508879
5 13 NA 14.307738 NA NA
6 14 NA 15.616640 NA NA
7 15 NA 10.275843 NA NA
8 16 NA 12.443139 NA NA
I can accomplish this in a straightforward manner, but it requires me coding an additional pairing for each previous observation, such that the number of code groups I need to write is the number of times any individual has been recorded. This is a pain, as in the data set I am applying this problem to, there may be anywhere from 0-100 previous observations of an individual.
#first pairing
d.prev <- data.frame(d$id,d$trait,d$prev.id)
colnames(d.prev) <- c('prev.id','prev.trait.1','prev.id.2')
d <- merge(d,d.prev, by = 'prev.id',all.x=T)
#second pairing
d.prev2 <- data.frame(d$id,d$trait,d$prev.id)
colnames(d.prev2) <- c('prev.id.2','prev.trait.2','prev.id.3')
d<- merge(d,d.prev2,by='prev.id.2',all.x=T)
#remove observations that are another individuals previous observation
d <- d[!(d$id %in% d$prev.id),]
How can I go about doing this in fewer lines, so I don't need 100 code chunks to cover individuals that have been remeasured 100 times?
What you have is a forest of linear lists. We'll start at the terminal ends
roots<-d$id[is.na(d$prev.id)]
And determine the paths backwards
path <- function(node) {
a <- integer(nrow(d))
i <- 0
while(!is.na(node)) {
i <- i+1
a[i] <- node
node <- d$id[match(node,d$prev.id)]
}
return(rev(a[1:i]))
}
Then we can get a 'stacked' representation of your desired output with
x<-do.call(rbind,lapply(roots,
function(r) {p<-path(r); data.frame(id=p[[1]],seq=seq_along(p),traits=d$trait[p])}))
And then use reshape2::dcast to get it in the desired shape
library(reshape2)
dcast(x,id~seq,fill=NA,value.var='traits')
id 1 2 3
1 1 10.693266 10.84081 14.663039
2 2 12.059654 13.71292 5.117289
3 3 3.594182 11.25869 8.866973
4 4 14.411477 10.92090 15.508879
5 13 14.307738 NA NA
6 14 15.616640 NA NA
7 15 10.275843 NA NA
8 16 12.443139 NA NA
I leave it to you to adapt column names.

Removing certain values from the dataframe in R

I am not sure how I can do this, but what I need is I need to form a cluster of this dataframe mydf where I want to omit the inf(infitive) values and the values greater than 50. I need to get the table that has no inf and no values greater than 50. How can I get a table that contains no inf and no value greater than 50(may be by nullifying those cells)? However, For clustering part, I don't have any problem because I can do this using mfuzz package. So the only problem I have is that I want to scale the cluster within 0-50 margin.
mydf
s.no A B C
1 Inf Inf 999.9
2 0.43 30 23
3 34 22 233
4 3 43 45
You can use NA, the built in missing data indicator in R:
?NA
By doing this:
mydf[mydf > 50 | mydf == Inf] <- NA
mydf
s.no A B C
1 1 NA NA NA
2 2 0.43 30 23
3 3 34.00 22 NA
4 4 3.00 43 45
Any stuff you do downstream in R should have NA handling methods, even if it's just na.omit

Resources