I want to make a mixed anova with the within-subject-factors mzp and cond besides the between-subject-factors cond_order and video_order.
I have 3 timepoints of a repeated measurement, indicated by mzp.
anova.h1 <- aov_car(ee ~ cond_order + video_order + Error(code/mzp*cond), data=dat_long)
Three things I can't find a solution for:
How to separate between within-factors in the error term? A lot of codes I found used *, but I fear it might only be for specific cases? Are there other separating-operators?
mzp has actually 3 levels (i.e. times of measurement) for measuring the dependent variable, cond has only 2 (because there were no baseline measured). So I made for that variable a 3rd timepoint up, by setting values to NA at baseline for cond. But it seems to cause issues now:
Error: Empty cells in within-subjects design (i.e., bad data structure).
table(data[c("mzp", "cond")])
# cond
# mzp s1st t1st s2nd t2nd
# X0 0 0 0 0
# X1 44 43 0 0
# X2 0 0 43 44
I need to examine relations between all 3 times of measuring the dependent variable and its interactions with the independet variables cond, cond_order and video_order. So is there a way of ignoring the NAs in cond, but include every 3 timepoints of the dependent variable for examining the progress of the dependent variable?
Above all, I need this anova to examine the residuals, to test for normality. I tried functions I know and googled (for a model without the cond-variable), but they won't work for this model/this function. I have to examine it graphically. So what works for this anova function?
hist(rstandard(anova.h1))
plot(anova.h1,2)
anova.h1.pr <- proj(anova.h1)
# Error in proj.default(anova.h1) : argument does not contain 'qr' component
res <- anova.h1.pr[["Within"]][ , "Residuals"].
qqnorm(res)
I have a dataset containing genes identified in different reference genomes. So, the reference genomes are in the Rows and the genes are in the columns of the table. The table is coded as a binary where 0 means the gene is absent and 1 means the gene is present. I made gene accumulation curves, which indicates that the number of genes per genomes is approaching a plateau. Now, I am trying to plot the rarefaction curves using the R-package vegan. I used the following codes:
b<-read.csv("data.csv", header = T, check.names = F)
S <- specnumber(b) # observed number of species
(raremax <- min(rowSums(b)))
Srare <- rarefy(b, raremax)
plot(Srare, xlab = "Observed No. of genes", ylab = "Rarefied No. of genes")
abline(0, 1)
rarecurve(b, step = 15, sample = raremax, col = "blue", cex = 0.6)
The data set is like the following:
gene1 gene2 gene3
#genome1 0 1 0
#genome2 1 0 1
#genome3 1 0 1
However, using this code I am not getting any satisfactory output. I just get only one straight line through the diagonal. I have attached the output below.
Can someone please suggest me how can I correct the output?
Thank you.
rarefy function rarefies individual rows of your data: it takes a subsample of your occurrences ("individuals") within each row. If all these sampled individuals have value 1, you will have a subsample of ones, and the sum of ones is the sample size: that was what you got. There is no meaningful way of rarefying a vector of ones: you need count data with some counts > 1.
You were perhaps looking for accumulation of genes in your whole data set when subsampling rows of the matrix. This is done in vegan function specaccum (argument method = "exact") which has its own plot etc methods.
I have a table that looks like this:
ID Survival Event Allele
2 5 1 WildType
2 0 1 WildType
3 3 1 WildType
4 38 0 Variant
I want to do a kaplan meier plot, and tell me if wild type or variants tend to survive longer.
I have this code:
library(survival)
Table <-read.table("Table1",header=T)
fit=survfit(Surv(Table$Survival,Table$Event)~Table$Allele)
plot(fit,lty=2:3,col=3:4)
From the fit p value, I can see that the survival of these two groups have significantly different survival curves.
survdiff(formula = Surv(dat$Death, dat$Event) ~ dat$Allele, rho = 0)
# N Observed Expected (O-E)^2/E (O-E)^2/V
# dat$Allele=Variant 5592 3400 3503 3.00 8.63
# dat$Allele=WildType 3232 2056 1953 5.39 8.63
# Chisq= 8.6 on 1 degrees of freedom, p= 0.0033
The plot looks as expected (i.e. two curves).
All I want to do is put a legend on the plot, so that I can see which data is represented by the black and red lines, i.e. do the Wild Type or Variant survive longer.
I have tried these two commands:
lab <-gsub("x=","",names(fit$strata))
legend("top",legend=lab,col=3:4,lty=2:3,horiz=FALSE,bty='n')
The first command works (i.e. I get no error). The second command, I get this error:
Error in strwidth(legend, units = "user", cex = cex, font = text.font) :
plot.new has not been called yet
I've tried reading forums etc., but none of the answers seem to work for me (for example, changing between top/topright/topleft etc. doesn't matter).
Edit 1: This is an example of a table for which I get this error:
ID Survival Event Allele
25808 5 1 WTHomo
22196 0 1 Variant
22518 3 1 Variant
25013 38 0 Variant
27354 5 1 Variant
27223 4 1 Variant
22700 5 1 Variant
22390 24 1 Variant
17586 1 1 Variant
What exactly happens is: when I type the very last command ( legend("top",legend=lab,col=3:4,lty=2:3,horiz=FALSE,bty='n')), the XII window opens, except it's completely blank.
But then if you just type "plot(fit,lty=2:3,col=3:4)", the XII window and the plot appear.
Edit 2: Also, this graph will have two lines, how do I tell which line is which variable? Would the easiest way to do this be to type summary(fit) which gives me two tables. Then, whichever variable comes first in the table, I put in first in the legend?
Many thanks
Eva
You can also do this using ggsurvplot() from survminer.
Here is an example
library(survminer) # Contains ggsurvplot()
library(survival) # Contains survfit()
ggsurvplot(
fit=survfit(Surv(time, censor) ~ Allele, data=your_data,type="kaplan-meier"), # Model
xlab="Years",
ylab="Overall survival probability",
legend.labs=c("WildType","Variant"), # Assign names to groups which are shown in the plot
conf.int = T, # Adds a 95%-confidence interval
pval = T, # Displays the P-value in the plot
pval.method = T # Shows the statistical method used for obtaining the P-value
)
I too have had repeated problems with the "plot.new has not been called yet" error! Strangely, the error was intermittent and repeating the identical commands did not always result in the error! In my case, I found that by preceding the plotting command with
plot.new()
stopped the error from appearing! I have no idea why. Just as an aside, I also had no problem adding a legend to the survival plot using your command.
I would like to perform a decision tree analysis. I want that the decision tree uses all the variables in the model.
I also need to plot the decision tree. How can I do that in R?
This is a sample of my dataset
> head(d)
TargetGroup2000 TargetGroup2012 SmokingGroup_Kai PA_Score wheeze3 asthma3 tres3
1 2 2 4 2 0 0 0
2 2 2 4 3 1 0 0
3 2 2 5 1 0 0 0
4 2 2 4 2 1 0 0
5 2 3 3 1 0 0 0
6 2 3 3 2 0 0 0
>
I would like to use the formula
myFormula <- wheeze3 ~ TargetGroup2000 + TargetGroup2012 + SmokingGroup_Kai + PA_Score
Note that all the variables are categorical.
EDIT:
My problem is that some variables do not appear in the final decision tree.
The deap of the tree should be defined by a penalty parameter alpha. I do not know how to set this penalty in order that all the variables appear in my model.
In other words I would like a model that minimize the training error.
As mentioned above, if you want to run the tree on all the variables you should write it as
ctree(wheeze3 ~ ., d)
The penalty you mentioned is located at the ctree_control(). You can set the P-value there and the minimum split and bucket size. So in order to maximize the chance that all the variables will be included you should do something like that:
ctree(wheeze3 ~ ., d, controls = ctree_control(mincriterion = 0.85, minsplit = 0, minbucket = 0))
The problem is that you'll get into risk of overfitting.
The last thing you need to understand is, that the reason that you may not see all the variables in the output of the tree is because they don't have a significant influence on the dependend variable. Unlike linear or logistic regression, that will show all the variables and give you the P-value in order to determine if they are significant or not, the decision tree does not return the unsiginifcant variables, i.e, it doesn't split by them.
For better understanding of how ctree works, please take a look here: https://stats.stackexchange.com/questions/12140/conditional-inference-trees-vs-traditional-decision-trees
The easiest way is to use the rpart package that is part of the core R.
library(rpart)
model <- rpart( wheeze3 ~ ., data=d )
summary(model)
plot(model)
text(model)
The . in the formula argument means use all the other variables as independent variables.
plot(ctree(myFormula~., data=sta))
I am fitting a model to factor data and predicting. If the newdata in predict.lm() contains a single factor level that is unknown to the model, all of predict.lm() fails and returns an error.
Is there a good way to have predict.lm() return a prediction for those factor levels the model knows and NA for unknown factor levels, instead of only an error?
Example code:
foo <- data.frame(response=rnorm(3),predictor=as.factor(c("A","B","C")))
model <- lm(response~predictor,foo)
foo.new <- data.frame(predictor=as.factor(c("A","B","C","D")))
predict(model,newdata=foo.new)
I would like the very last command to return three "real" predictions corresponding to factor levels "A", "B" and "C" and an NA corresponding to the unknown level "D".
You have to remove the extra levels before any calculation, like:
> id <- which(!(foo.new$predictor %in% levels(foo$predictor)))
> foo.new$predictor[id] <- NA
> predict(model,newdata=foo.new)
1 2 3 4
-0.1676941 -0.6454521 0.4524391 NA
This is a more general way of doing it, it will set all levels that do not occur in the original data to NA. As Hadley mentioned in the comments, they could have chosen to include this in the predict() function, but they didn't
Why you have to do that becomes obvious if you look at the calculation itself. Internally, the predictions are calculated as :
model.matrix(~predictor,data=foo) %*% coef(model)
[,1]
1 -0.1676941
2 -0.6454521
3 0.4524391
At the bottom you have both model matrices. You see that the one for foo.new has an extra column, so you can't use the matrix calculation any more. If you would use the new dataset to model, you would also get a different model, being one with an extra dummy variable for the extra level.
> model.matrix(~predictor,data=foo)
(Intercept) predictorB predictorC
1 1 0 0
2 1 1 0
3 1 0 1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$predictor
[1] "contr.treatment"
> model.matrix(~predictor,data=foo.new)
(Intercept) predictorB predictorC predictorD
1 1 0 0 0
2 1 1 0 0
3 1 0 1 0
4 1 0 0 1
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$predictor
[1] "contr.treatment"
You can't just delete the last column from the model matrix either, because even if you do that, both other levels are still influenced. The code for level A would be (0,0). For B this is (1,0), for C this (0,1) ... and for D it is again (0,0)! So your model would assume that A and D are the same level if it would naively drop the last dummy variable.
On a more theoretical part: It is possible to build a model without having all the levels. Now, as I tried to explain before, that model is only valid for the levels you used when building the model. If you come across new levels, you have to build a new model to include the extra information. If you don't do that, the only thing you can do is delete the extra levels from the dataset. But then you basically lose all information that was contained in it, so it's generally not considered good practice.
Tidied and extended the function by MorgenBall. It is also implemented in sperrorest now.
Additional features
drops unused factor levels rather than just setting the missing values to NA.
issues a message to the user that factor levels have been dropped
checks for existence of factor variables in test_data and returns original data.frame if non are present
works not only for lm, glm and but also for glmmPQL
Note: The function shown here may change (improve) over time.
#' #title remove_missing_levels
#' #description Accounts for missing factor levels present only in test data
#' but not in train data by setting values to NA
#'
#' #import magrittr
#' #importFrom gdata unmatrix
#' #importFrom stringr str_split
#'
#' #param fit fitted model on training data
#'
#' #param test_data data to make predictions for
#'
#' #return data.frame with matching factor levels to fitted model
#'
#' #keywords internal
#'
#' #export
remove_missing_levels <- function(fit, test_data) {
# https://stackoverflow.com/a/39495480/4185785
# drop empty factor levels in test data
test_data %>%
droplevels() %>%
as.data.frame() -> test_data
# 'fit' object structure of 'lm' and 'glmmPQL' is different so we need to
# account for it
if (any(class(fit) == "glmmPQL")) {
# Obtain factor predictors in the model and their levels
factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
names(unlist(fit$contrasts))))
# do nothing if no factors are present
if (length(factors) == 0) {
return(test_data)
}
map(fit$contrasts, function(x) names(unmatrix(x))) %>%
unlist() -> factor_levels
factor_levels %>% str_split(":", simplify = TRUE) %>%
extract(, 1) -> factor_levels
model_factors <- as.data.frame(cbind(factors, factor_levels))
} else {
# Obtain factor predictors in the model and their levels
factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
names(unlist(fit$xlevels))))
# do nothing if no factors are present
if (length(factors) == 0) {
return(test_data)
}
factor_levels <- unname(unlist(fit$xlevels))
model_factors <- as.data.frame(cbind(factors, factor_levels))
}
# Select column names in test data that are factor predictors in
# trained model
predictors <- names(test_data[names(test_data) %in% factors])
# For each factor predictor in your data, if the level is not in the model,
# set the value to NA
for (i in 1:length(predictors)) {
found <- test_data[, predictors[i]] %in% model_factors[
model_factors$factors == predictors[i], ]$factor_levels
if (any(!found)) {
# track which variable
var <- predictors[i]
# set to NA
test_data[!found, predictors[i]] <- NA
# drop empty factor levels in test data
test_data %>%
droplevels() -> test_data
# issue warning to console
message(sprintf(paste0("Setting missing levels in '%s', only present",
" in test data but missing in train data,",
" to 'NA'."),
var))
}
}
return(test_data)
}
We can apply this function to the example in the question as follows:
predict(model,newdata=remove_missing_levels (fit=model, test_data=foo.new))
While trying to improve this function, I came across the fact that SL learning methods like lm, glm etc. need the same levels in train & test while ML learning methods (svm, randomForest) fail if the levels are removed. These methods need all levels in train & test.
A general solution is quite hard to achieve since every fitted model has a different way of storing their factor level component (fit$xlevels for lm and fit$contrasts for glmmPQL). At least it seems to be consistent across lm related models.
If you want to deal with the missing levels in your data after creating your lm model but before calling predict (given we don't know exactly what levels might be missing beforehand) here is function I've built to set all levels not in the model to NA - the prediction will also then give NA and you can then use an alternative method to predict these values.
object will be your lm output from lm(...,data=trainData)
data will be the data frame you want to create predictions for
missingLevelsToNA<-function(object,data){
#Obtain factor predictors in the model and their levels ------------------
factors<-(gsub("[-^0-9]|as.factor|\\(|\\)", "",names(unlist(object$xlevels))))
factorLevels<-unname(unlist(object$xlevels))
modelFactors<-as.data.frame(cbind(factors,factorLevels))
#Select column names in your data that are factor predictors in your model -----
predictors<-names(data[names(data) %in% factors])
#For each factor predictor in your data if the level is not in the model set the value to NA --------------
for (i in 1:length(predictors)){
found<-data[,predictors[i]] %in% modelFactors[modelFactors$factors==predictors[i],]$factorLevels
if (any(!found)) data[!found,predictors[i]]<-NA
}
data
}
Sounds like you might like random effects. Look into something like glmer (lme4 package). With a Bayesian model, you'll get effects that approach 0 when there's little information to use when estimating them. Warning, though, that you'll have to do prediction yourself, rather than using predict().
Alternatively, you can simply make dummy variables for the levels you want to include in the model, e.g. a variable 0/1 for Monday, one for Tuesday, one for Wednesday, etc. Sunday will be automatically removed from the model if it contains all 0's. But having a 1 in the Sunday column in the other data won't fail the prediction step. It will just assume that Sunday has an effect that's average the other days (which may or may not be true).
One of the assumptions of Linear/Logistic Regressions is to little or no multi-collinearity; so if the predictor variables are ideally independent of each other, then the model does not need to see all the possible variety of factor levels. A new factor level (D) is a new predictor, and can be set to NA without affecting the predicting ability of the remaining factors A,B,C. This is why the model should still be able to make predictions. But addition of the new level D throws off the expected schema. That's the whole issue. Setting NA fixes that.
The lme4 package will handle new levels if you set the flag allow.new.levels=TRUE when calling predict.
Example: if your day of week factor is in a variable dow and a categorical outcome b_fail, you could run
M0 <- lmer(b_fail ~ x + (1 | dow), data=df.your.data, family=binomial(link='logit'))
M0.preds <- predict(M0, df.new.data, allow.new.levels=TRUE)
This is an example with a random effects logistic regression. Of course, you can perform regular regression ... or most GLM models. If you want to head further down the Bayesian path, look at Gelman & Hill's excellent book and the Stan infrastructure.
A quick-and-dirty solution for split testing, is to recode rare values as "other". Here is an implementation:
rare_to_other <- function(x, fault_factor = 1e6) {
# dirty dealing with rare levels:
# recode small cells as "other" before splitting to train/test,
# assuring that lopsided split occurs with prob < 1/fault_factor
# (N.b. not fully kosher, but useful for quick and dirty exploratory).
if (is.factor(x) | is.character(x)) {
min.cell.size = log(fault_factor, 2) + 1
xfreq <- sort(table(x), dec = T)
rare_levels <- names(which(xfreq < min.cell.size))
if (length(rare_levels) == length(unique(x))) {
warning("all levels are rare and recorded as other. make sure this is desirable")
}
if (length(rare_levels) > 0) {
message("recoding rare levels")
if (is.factor(x)) {
altx <- as.character(x)
altx[altx %in% rare_levels] <- "other"
x <- as.factor(altx)
return(x)
} else {
# is.character(x)
x[x %in% rare_levels] <- "other"
return(x)
}
} else {
message("no rare levels encountered")
return(x)
}
} else {
message("x is neither a factor nor a character, doing nothing")
return(x)
}
}
For example, with data.table, the call would be something like:
dt[, (xcols) := mclapply(.SD, rare_to_other), .SDcol = xcols] # recode rare levels as other
where xcols is a any subset of colnames(dt).