Why does R and PROCESS render different result of a mediation model (one is significant, the other one is not)? - r

As a newcomer who just gets started in R, I am confused about the result of the mediation analysis.
My model is simple: IV 'T1Incivi', Mediator 'T1Envied', DV 'T2PSRB'. I ran the same model in SPSS using PROCESS, but the result was insignificant in PROCESS; however, the indirect effect is significant in R. Since I am not that familiar with R, could you please help me to see if there is anything wrong with my code? And tell me why the result is significant in R but not in SPSS?Thanks a bunch!!!
My code in R:
X predict M
apath <- lm(T1Envied~T1Incivi, data=dat)
summary(apath)
X and M predict Y
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
summary(bpath)
Bootstrapping for indirect effect
getindirect <- function(dataset,random){
d=dataset[random,]
apath <- lm(T1Envied~T1Incivi, data=d)
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
indirect <- apath$coefficients["T1Incivi"]*bpath$coefficients["T1Envied"]
return(indirect)
}
library(boot)
set.seed(6452234)
Ind1 <- boot(data=dat,
statistic=getindirect,
R=5000)
boot.ci(Ind1,
conf = .95,
type = "norm")`*PSRB as outcome*

In your function getindirect all linear regressions should be based on the freshly shuffled data in d.
However there is the line
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
that makes the wrong reference to the variable dat which should really not be used within this function. That alone can explain incoherent results.

Related

Avoid failure of confint.merMod on weighted models in lme4 when data object modified in calling frame

I'm facing a problem when using lme4 glmer function with weights, where if the data object passed to glmer is modified, some functions such as confint no longer work on the model. Here is an example:
library(lme4)
set.seed(1)
n <- 1000
df <- data.frame(
y=rbinom(n,1,.5),
w=runif(n,0,1)*.1+.95,
g=as.integer(round(runif(n,0,4)))
)
m <- glmer(cbind(y,1-y)~(1|g),data=df,weights=w,family=binomial())
confint(m)
df$w <- df$w*2
confint(m)
The 2nd call to confint gives this error:
Computing profile confidence intervals ...
Error in profile.merMod(object, which = parm, signames = oldNames, ...) :
Profiling over both the residual variance and
fixed effects is not numerically consistent with
profiling over the fixed effects only
It seems this has something to do with the profile function, as that function doesn't work after modifying the data frame.
The following seems to work to remove the dependency on the data object, but I am a bit uneasy not knowing if there might ever be bad side effects:
glmer2 <- function(...){
cl <- match.call()
df <- eval.parent(cl$data)
cl[1] <- call("glmer")
cl$data <- as.name("df")
eval(cl)
}
m <- glmer2(cbind(y,1-y)~(1|g),data=df,weights=w,family=binomial())
confint(m)
df$w <- df$w*2
confint(m)
(results of confint don't change)
The reason I need something like this is that I am creating a series of models, and need to re-compute the weights between each one, and it would be quite messy to keep all of the data objects.
Why do model functions seem to rely on the data object still being present and unchanged in the calling environment? And is there a better way to solve this issue?
(R version 3.6.3 (2020-02-29), x86_64-redhat-linux-gnu, lme4_1.1-21)

Is it possible to adapt standard prediction interval code for dlm in R with other distribution?

Using the dlm package in R I fit a dynamic linear model to a time series data set, consisting of 20 observations. I then use the dlmForecast function to predict future values (which I can validate against the genuine data for said period).
I use the following code to create a prediction interval;
ciTheory <- (outer(sapply(fut1$Q, FUN=function(x) sqrt(diag(x))), qnorm(c(0.05,0.95))) +
as.vector(t(fut1$f)))
However my data does not follow a normal distribution and I wondered whether it would be possible to
adapt the qnorm function for other distributions. I have tried qt, but am unable to apply qgamma.......
Just wondered if anyone knew how you would go about sorting this.....
Below is a reproduced version of my code...
library(dlm)
data <- c(20.68502, 17.28549, 12.18363, 13.53479, 15.38779, 16.14770, 20.17536, 43.39321, 42.91027, 49.41402, 59.22262, 55.42043)
mod.build <- function(par) {
dlmModPoly(1, dV = exp(par[1]), dW = exp(par[2]))
}
# Returns most likely estimate of relevant values for parameters
mle <- dlmMLE(a2, rep(0,2), mod.build); #nileMLE$conv
if(mle$convergence==0) print("converged") else print("did not converge")
mod1 <- dlmModPoly(dV = v, dW = c(0, w))
mod1Filt <- dlmFilter(a1, mod1)
fut1 <- dlmForecast(mod1Filt, n = 7)
Cheers

How to do stepwise regression in r for more independent variables and less observations?

I'm trying to do stepwise regression for following data:
y <- c(1.2748, 1.2574, 1.5571, 1.4178, 0.8491, 1.3606, 1.4747, 1.3177, 1.2896, 0.8453)
x <- data.frame(A = c(2,3,4,5,6,2,3,4,5,6),
B = c(2,4,1,3,5,1,3,5,2,4)*100,
C = c(9,5,11,5,11,7,13,7,13,9),
D = c(6,5,3,7,6,4,3,7,5,4),
E = c(1,1,0.8,0.8,0.6,0.6,0.4,0.4,0.2,0.2))
x$A2 <- x$A^2
x$B2 <- x$B^2
x$C2 <- x$C^2
x$D2 <- x$D^2
x$E2 <- x$E^2
x$AB <- x$A*x$B
As we can see, it has 10 observations and 11 independent variables so I can't build a linear regression model for it. In fact, only a few factors is useful and in this case, I need to use stepwise regression and "forward" to add independent variables into my formula. But stats:: step function cannot be used. I wonder if there is a method to do it. I know there is a package called "StepReg" but I don't fully understand how to use it and how to read the results. Thank you!
I just run stepwise regression with the data you provided using R package StepReg
Here is the code, and hope this can help you.
library(StepReg)
df <- data.frame(y,x)
# forward method with information criterion 'AIC', you can choose other information criterion
stepwise(df, y="y", exclude=NULL, include=NULL, Class=NULL,
selection="forward", select="AIC")
# forward method with significant levels, significant level for entry = 0.15
stepwise(df, y="y", exclude=NULL, include=NULL, Class=NULL,
selection="forward", select="SL",sle=0.15)
You can use olsrr package which gives similar results as that of SPSS. Here is the solution
library(olsrr)
df <- data.frame(y, x)
model <- lm(y ~ ., data = df)
smlr <- ols_step_both_p(model, pent = 0.05, prem = 0.1) #pent p value; variables with p value less than pent will enter into the model.
#premp value; variables with p more than prem will be removed from the model.
You can get the model details by calling
smlr
smlr$model
smlr$beta_pval #regression coefficients with p-values
I have kept the values of pent and prem same as the default values used by SPSS.

R random forest - training set using target column for prediction

I am learning how to use various random forest packages and coded up the following from example code:
library(party)
library(randomForest)
set.seed(415)
#I'll try to reproduce this with a public data set; in the mean time here's the existing code
data = read.csv(data_location, sep = ',')
test = data[1:65] #basically data w/o the "answers"
m = sample(1:(nrow(factor)),nrow(factor)/2,replace=FALSE)
o = sample(1:(nrow(data)),nrow(data)/2,replace=FALSE)
train2 = data[m,]
train3 = data[o,]
#random forest implementation
fit.rf <- randomForest(train2[,66] ~., data=train2, importance=TRUE, ntree=10000)
Prediction.rf <- predict(fit.rf, test) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
#cforest implementation
fit.cf <- cforest(train3[,66]~., data=train3, controls=cforest_unbiased(ntree=10000, mtry=10))
Prediction.cf <- predict(fit.cf, test, OOB=TRUE) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
Data[,66] is the is the target factor I'm trying to predict, but it seems that by using "~ ." to solve for it is causing the formula to use the factor in the prediction model itself.
How do I solve for the dimension I want on high-ish dimensionality data, without having to spell out exactly which dimensions to use in the formula (so I don't end up with some sort of cforest(data[,66] ~ data[,1] + data[,2] + data[,3}... etc.?
EDIT:
On a high level, I believe one basically
loads full data
breaks it down to several subsets to prevent overfitting
trains via subset data
generates a fitting formula so one can predict values of target (in my case data[,66]) given data[1:65].
so my PROBLEM is now if I give it a new set of test data, let’s say test = data{1:65], it now says “Error in eval(expr, envir, enclos) :” where it is expecting data[,66]. I want to basically predict data[,66] given the rest of the data!
I think that if the response is in train3 then it will be used as a feature.
I believe this is more like what you want:
crtl <- cforest_unbiased(ntree=1000, mtry=3)
mod <- cforest(iris[,5] ~ ., data = iris[,-5], controls=crtl)

How to extract info from package in R and use in function?

I apologize for the vague question title. What I want to do is run a regression in R using geeglm from the geepack R package, then use information from that to calculate a quasilikelihood information criteria (QIC; Pan 2001). I can do this fairly easily for single models but I would like to write a general function that can do this for a variety of different types of models. I guess my real question is whether there is a better alternative than having a long series of nested ifelse statements?
Here's my current code:
library(geepack)
data(dietox) #data from the geepack package
# Run gee regression
dietox$Cu <- as.factor(dietox$Cu)
mf <- formula(Weight ~ Cu * (Time + I(Time^2) + I(Time^3)))
gee1 <- geeglm(mf, data = dietox, id = Pig, family = gaussian, corstr = "ar1")
Then I can run a function to calculate the quasilikelihood:
QlogLik.normal <- function(model.R) {
library(MASS)
mu.R <- model.R$fitted.values
y <- model.R$y
# Quasi Likelihood for Normal
quasi.R <- sum(((y - mu.R)^2)/-2)
quasi.R
}
However, I would like to write a function that is more general because the quasilikelihood function is different for every distribution. The above function would work for gee1 because it had a gaussian (normal) distribution. If I wanted to generalize it for a variety of distributions I could use a series of nested ifelse statements (below), but I don't know if this is the best way to do this. Does anyone have other options or a better solution? This just doesn't seem very elegant to say the least (clearly I don't have much programming or R experience).
QlogLik <- function(model.R) {
library(MASS)
mu.R <- model.R$fitted.values
y <- model.R$y
ifelse(model.R$modelInfo$variance == "poisson",
# Quasi Likelihood for Poisson
quasi.R <- sum((y*log(mu.R)) - mu.R),
ifelse(model.R$modelInfo$variance == "gaussian",
# Quasi Likelihood for Normal
quasi.R <- sum(((y - mu.R)^2)/-2),
ifelse(model.R$modelInfo$variance == "binomial",
# Quasilikelihood for Binomial
quasi.R <- sum(y*log(mu.R/(1 - mu.R)) + log(1 - mu.R)),
quasi.R <- "Error: distribution not recognized")))
quasi.R
}
In this example, I used the model output from geeglm to extract the type of distribution used to model the variance
model.R$modelInfo$variance
but there may be other ways to determine what distribution was used in the geeglm model. Any help would be appreciated.
You should be able to rewrite your function like this:
QlogLik <- function(model.R) {
library(MASS)
mu.R <- model.R$fitted.values
y <- model.R$y
type <- family(model.R)$family
switch(type,
poisson = sum((y*log(mu.R)) - mu.R),
gaussian = sum(((y - mu.R)^2)/-2),
binomial = sum(y*log(mu.R/(1 - mu.R)) + log(1 - mu.R)),
stop("Error: distribution not recognized"))
}
As #baptise points out, switch useful in these cases. You use family(model.R)$family to automatically detect what family type should be used with switch.
Also, if your commands for what to do in different cases run beyond one line, you can wrap the lines with curly brackets ({ do something here }) instead.
switch(type,
type1 = { something <- do(this)
thisis(something) },
type2 = do(that))
I hope this helps!
You may also use model.R$family$family which gives the type of distribution used to model the variance, but so far I didn't know if you could eliminate those ifelse statements. The quasi.R in your code differs among different distributions, so you have to define each of them separately.
BTW, it is a good question and thanks for posting it: I had similar situations in the past, and hope to get some advice on how to write the codes more efficiently.

Resources