I have a dataset with a binary dependent variable and a number of predictors, including participant. I am trying to examine the idiosyncratic effects of different predictors for different participants. In order to do that, I'm trying to look at the effect of interactions between participant id and the other predictors on the dependent variable. I'm using randomForest in R. I can fit the forest successfully, and can produce partial dependence plots for individual variables. What I need, however, are partial dependence plots for pairs of variables - participant + others. Is this possible?
For reference, my code:
data_sample<-data_raw[sample(1:nrow(data_raw),500,replace=F),];
test_rf<-randomForest(perceptually.rhotic~vowel+speaker+modified_clip_start+function_word+year_of_birth+gender+fathers_job_type+prepausal,data=data_sample,ntree=500,mtry=3);
partialPlot(test_rf,pred.dat=data_sample,x.var="speaker");
??? partialPlot(test_rf,pred.dat=data_sample,x.var=c("speaker","vowel"));
Thanks very much in advance for any advice anyone can offer!
The plotmo R package will plot partial dependencies for all variables and pairs of variables (bivariate dependencies) for "any" model. For example:
library(randomForest)
data(trees)
mod <- randomForest(Volume~., data=trees)
library(plotmo)
plotmo(mod, pmethod="partdep") # plot partial dependencies
which gives
You can specify exactly which variable and variable pairs get plotted using plotmo's all1, all2, degree1, and degree2 arguments. Additional examples are in the vignette for the plotmo package.
Related
Im using the book Applied Survival Analysis Using R by Moore to try and model some time-to-event data. The issue I'm running into is plotting the estimated survival curves from the cox model. Because of this I'm wondering if my understanding of the model is wrong or not. My data is simple: a time column t, an event indicator column (1 for event 0 for censor) i, and a predictor column with 6 factor levels p.
I believe I can plot estimated surival curves for a cox model as follows below. But I don't understand how to use survfit and baseplot, nor functions from survminer to achieve the same end. Here is some generic code for clarifying my question. I'll use the pharmcoSmoking data set to demonstrate my issue.
library(survival)
library(asaur)
t<-pharmacoSmoking$longestNoSmoke
i<-pharmacoSmoking$relapse
p<-pharmacoSmoking$levelSmoking
data<-as.data.frame(cbind(t,i,p))
model <- coxph(Surv(data$t, data$i) ~ p, data=data)
As I understand it, with the following code snippets, modeled after book examples, a baseline (cumulative) hazard at my reference factor level for p may be given from
base<-basehaz(model, centered=F)
An estimate of the survival curve is given by
s<-exp(-base$hazard)
t<-base$time
plot(s~t, typ = "l")
The survival curve associated with a different factor level may then be given by
beta_n<-model$coefficients #only one coef in this case
s_n <- s^(exp(beta_n))
lines(s_n~t)
where beta_n is the coefficient for the nth factor level from the cox model. The code above gives what I think are estimated survival curves for heavy vs light smokers in the pharmcoSmokers dataset.
Since thats a bit of code I was looking to packages for a one-liner solution, I had a hard time with the documentation for Survival ( there weren't many examples in the docs) and also tried survminer. For the latter I've tried:
library(survminer)
ggadjustedcurves(model, variable ="p" , data=data)
This gives me something different than my prior code, although it is similar. Is the method I used earlier incorrect? Or is there a different methodology that accounts for the difference? The survminer code doesn't work from my data (I get a 'can't allocated vector of size yada yada error, and my data is ~1m rows) which seems weird considering I can make plots using what I did before no problem. This is the primary reason I am wondering if I am understanding how to plot survival curves for my model.
I try to perfom an analysis of gene expression data with the limma r package. My model includes factors and numerical covariates and I'm not able to get the results for both types of variables at once.
Here is an example:
design <- model.matrix(~0+Factor+NumericCov,data=sampleData)
fit <- lmFit(geneExprData,design)
cont.matrix <- makeContrasts(Factor1=FactorLevel2-FactorLevel1,
Factor2=FactorLevel3-FactorLevel2,
Factor2=FactorLevel1-FactorLevel3,
NumericCov = NumericCov,
levels=design)
fit <- contrasts.fit(fit, cont.matrix)
fit <- eBayes(fit)
topTable(fit,coef="Factor1")
topTable(fit,coef="Factor2")
topTable(fit,coef="Factor3")
topTable(fit,coef="NumericCov")
Is this correct? Or should I just not use a contrast matrix for the analysis of the effect of numeric covariates?
If I do not use the makeContrast function it is more difficult to look at the difference between all the levels of the factor (which I need to do).
So if this is not correct, is there nevertheless a way to define the constrasts in order do both parts of the analysis at once?
Thanks in advance!
I was wondering if I get some advice about fitting hurdle models using continuous data and covariates.
I have some continuous data that are generally well fit using a right-skewed distribution such as a Pareto, Gamma, or Weibull distribution. However, there several zeros in my data which are important to my analysis. In addition, I have some categorical (two-level) covariates and would like to model the parameters of a distribution as a function of these covariates in order to formally evaluate their importance (e.g., using AIC).
I have seen examples of hurdle models fit using continuous data but have not yet found any examples of how to incorporate covariates and a model-selection framework. Does anyone have any suggestions as to how to proceed or know of any R packages that allow this procedure? I have included some code below to reproduce the type of data I am working with. The non-zero data are generated via a generalized Pareto distribution from the package texmex. The parameters were estimated directly from my non-zero data. I have also included the code to plot the data in a histogram to see their distribution.
library("texmex")
set.seed(101)
zeros <- rep(0,8)
non_zeros <- rgpd(17, sigm=exp(-10.4856), xi=0.1030, u = 0)
all.data <- c(zeros,non_zeros)
hist(non_zeros,breaks=50,xlim=c(0,0.00015),ylim=c(0,9),main="",xlab="",
col="gray")
hist(zeros,add=TRUE,col="black",breaks=100,xlim=c(0,0.00015),ylim=c(0,9))
legend("topright",legend=c("zeros"),col="black",lwd=8)
I would like to use the delete-d cross-validation technique available in the R package bestglm. I have a binomial response variable (species presence/absence) and 11 predictor variables that are continuous or have levels and I am treating them in the analysis as continuous. I have about 7000 data points, depending on the species. I would like to allow interactions between one variable and the other ten variables, and I would also like to include quadratic responses.
Is this possible? From what I gather looking at the R help and the vignette for this package, it is not, but maybe I am missing something.
I have a data.frame (X,Y,a,b,c,d,e)
Is there a package where I can predict both X and Y at the same time?
Thanks for your help.
Try package car, ?linearHypothesis, example
a multivariate linear model for repeated-measures data
see ?OBrienKaiser for a description of the data set used in this example.
However, it might not be appropriate for the real X Y you have.