How to simulate data properly? - r

Hi I'm new to R and would like to ask a more general question. How do I simulate or create an example data set which is suitable to be posted here and simultaneously posses the property of reproducibility. I would like, for instance, create a numeric example which abstract my data set properly. One condition woud be to implement some correlation between my dependent and independent variables.
For instance. how to introduce some correlation between my count and my in.var1 and in.var2?
set.seed(1122)
count<-rpois(1000,30)
in.var1<- rnorm(1000, mean = 25, sd = 3)
in.var1<- rnorm(1000, mean = 12, sd = 2)
data<-cbind(count,in.var1,in.var2)

You can introduce dependence by adding in some portion of the "information" in the two variables to the construction of the count variable:
set.seed(1222)
in.var1<- rnorm(1000, mean = 25, sd = 3)
#Corrected spelling of in.var2
in.var2<- rnorm(1000, mean = 12, sd = 2)
count<-rpois(1000,30) + 0.15*in.var1 + 0.3*in.var2
# Avoid use 'data` as an object name
dat<-data.frame(count,in.var1,in.var2)
> spearman(count, in.var1)
rho
0.06859676
> spearman(count, in.var2)
rho
0.1276568
> spearman(in.var1, in.var2)
rho
-0.02175273
> summary( glm(count ~ in.var1 + in.var2, data=dat) )
Call:
glm(formula = count ~ in.var1 + in.var2, data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-16.6816 -3.6910 -0.4238 3.4435 15.5326
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.05034 1.74084 16.688 < 2e-16 ***
in.var1 0.14701 0.05613 2.619 0.00895 **
in.var2 0.35512 0.08228 4.316 1.74e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

If you want count to be a function of in.var1 and invar.2 try this. Note that count is already a function name so I am changing it to Count
set.seed(1122)
in.var1<- rnorm(1000, mean = 4, sd = 3)
in.var2<- rnorm(1000, mean = 6, sd = 2)
Count<-rpois(1000, exp(3+ 0.5*in.var1 - 0.25*in.var2))
Data<-data.frame(Count=Count, Var1=in.var1, Var2=in.var2)
You now have a poisson count based on in.var1 and in.var2. A poisson regression will show an intercept of 3 and coefficients of 0.5 for Var1 and -0.25 for Var2
summary(glm(Count~Var1+Var2,data=Data, family=poisson))
Call:
glm(formula = Count ~ Var1 + Var2, family = poisson, data = Data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.84702 -0.76292 -0.04463 0.67525 2.79537
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.001390 0.011782 254.7 <2e-16 ***
Var1 0.499789 0.001004 498.0 <2e-16 ***
Var2 -0.250949 0.001443 -173.9 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 308190.7 on 999 degrees of freedom
Residual deviance: 1063.3 on 997 degrees of freedom
AIC: 6319.2
Number of Fisher Scoring iterations: 4

As I understand you want to add some pattern to your data.
# Basic info taken from Data Science Exploratory Analysis Course
# http://datasciencespecialization.github.io/courses/04_ExploratoryAnalysis/
set.seed(1122)
rowNumber = 1000
count<-rpois(rowNumber,30)
in.var1<- rnorm(rowNumber, mean = 25, sd = 3)
in.var2<- rnorm(rowNumber, mean = 12, sd = 2)
data<-cbind(count,in.var1,in.var2)
dataNew <- data
for (i in 1:rowNumber) {
# flip a coin
coinFlip <- rbinom(1, size = 1, prob = 0.5)
# if coin is heads add a common pattern to that row
if (coinFlip) {
dataNew[i,"count"] <- 2 * data[i,"in.var1"] + 10* data[i,"in.var2"]
}
}
Basically, I am adding a pattern count = 2 *in.var1 + 10 * in.var2 to some random rows, here coinFlip variable. Of course you should vectorize it for more rows.

Related

How to do a t.test on a linear model for a given value of beta1?

data("hprice2")
reg1 <- lm(price ~ rooms + crime + nox, hprice2)
summary(reg1)
Call:
lm(formula = price ~ rooms + crime + nox, data = hprice2)
Residuals:
Min 1Q Median 3Q Max
-18311 -3218 -772 2418 39164
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -19371.47 3250.94 -5.959 4.79e-09 ***
rooms 7933.18 407.87 19.450 < 2e-16 ***
crime -199.70 35.05 -5.697 2.08e-08 ***
nox -1306.06 266.14 -4.907 1.25e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6103 on 502 degrees of freedom
Multiple R-squared: 0.5634, Adjusted R-squared: 0.5608
F-statistic: 215.9 on 3 and 502 DF, p-value: < 2.2e-16
Question 1.
Run two alternative (two-sided) t-tests for: H0: B1 = 8000
predict(reg1, data.frame(rooms=8000, crime = -199.70, nox = -1306.06), interval = .99)
Report your t-statistic and whether you reject or fail to reject the null at 90, 95, and/or 99 percent confidence levels.
I suppose by beta1 you mean rooms in this case. Your t.test in the summary is tested against beta0 = 0, you can see from wiki:
so using the example of nox:
tstat = (-1306.06 - 0)/266.14
[2] -4.907417
And p.value is
2*pt(-abs(tstat),502)
[2] 1.251945e-06
the null hypothesis in your case will be 8000 and you test rooms = 8000:
tstat = (7933.18 - 8000)/407.87
2*pt(-abs(tstat),502)
You can also use linearHypothesis from cars to do the above:
library(car)
linearHypothesis(reg1, c("rooms = 8000"))

How to formulate time period dummy variable in lm()

I am analysing whether the effects of x_t on y_t differ during and after a specific time period.
I am trying to regress the following model in R using lm():
y_t = b_0 + [b_1(1-D_t) + b_2 D_t]x_t
where D_t is a dummy variable with the value 1 over the time period and 0 otherwise.
Is it possible to use lm() for this formula?
observationNumber <- 1:80
obsFactor <- cut(observationNumber, breaks = c(0,55,81), right =F)
fit <- lm(y ~ x * obsFactor)
For example:
y = runif(80)
x = rnorm(80) + c(rep(0,54), rep(1, 26))
fit <- lm(y ~ x * obsFactor)
summary(fit)
Call:
lm(formula = y ~ x * obsFactor)
Residuals:
Min 1Q Median 3Q Max
-0.48375 -0.29655 0.05957 0.22797 0.49617
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.50959 0.04253 11.983 <2e-16 ***
x -0.02492 0.04194 -0.594 0.554
obsFactor[55,81) -0.06357 0.09593 -0.663 0.510
x:obsFactor[55,81) 0.07120 0.07371 0.966 0.337
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3116 on 76 degrees of freedom
Multiple R-squared: 0.01303, Adjusted R-squared: -0.02593
F-statistic: 0.3345 on 3 and 76 DF, p-value: 0.8004
obsFactor[55,81) is zero if observationNumber < 55 and one if its greater or equal its coefficient is your $b_0$. x:obsFactor[55,81) is the product of the dummy and the variable $x_t$ - its coefficient is your $b_2$. The coefficient for $x_t$ is your $b_1$.

Why is glmmTMP is estimating approx half value for Zero inflated Conway maxwell poisson mixed model for Simulated data

I'm trying to estimate parameter for Zero-inflated Conway Maxwell Poisson Mixed Model. I'm not getting why GlmmTMP function is giving approx half value for the non zero effect part and giving nice estimates for the Zero part and dispersion part?
E.g:- Actual value for intercept is 2.5 and I'm getting 1.21
for sexfemale actual value is 1.2 and I'm getting 0.548342
please help me out in this situation?
Thank you
#--------Simulation from ZICOMP mix lambda---------
library(COMPoissonReg)
library(glmmTMB)
set.seed(123)
n <- 100 # number of subjects
K <- 8 # number of measurements per subject
t_max <- 5 # maximum follow-up time
# we constuct a data frame with the design:
# everyone has a baseline measurment, and then measurements at random follow-up times
DF_CMP <- data.frame(id = rep(seq_len(n), each = K),
time = c(replicate(n, c(0, sort(runif(K - 1, 0, t_max))))),
sex = rep(gl(2, n/2, labels = c("male", "female")), each = K))
# design matrices for the fixed and random effects non-zero part
X <- model.matrix(~ sex * time, data = DF_CMP)
Z <- model.matrix(~ 1, data = DF_CMP)
# design matrices for the fixed and random effects zero part
X_zi <- model.matrix(~ sex, data = DF_CMP)
betas <- c(2.5 , 1.2 , 2.3, -1.5) # fixed effects coefficients non-zero part
shape <- 2
gammas <- c(-1.5, 0.9) # fixed effects coefficients zero part
D11 <- 0.5 # variance of random intercepts non-zero part
# we simulate random effects
b <- rnorm(n, sd = sqrt(D11))
# linear predictor non-zero part
eta_y <- as.vector(X %*% betas + rowSums(Z * b[DF_CMP$id,drop = FALSE]))
# linear predictor zero part
eta_zi <- as.vector(X_zi %*% gammas)
DF_CMP$CMP_y <- rzicmp(n * K, lambda = exp(eta_y), nu = shape, p = plogis(eta_zi))
hist(DF_CMP$CMP_y)
#------ estimation -------------
CMPzicmpm0 = glmmTMB(CMP_y~ sex*time + (1|id) , zi= ~ sex, data = DF_CMP, family=compois)
summary(CMPzicmpm0)
> summary(CMPzicmpm0)
Family: compois ( log )
Formula: CMP_y ~ sex * time + (1 | id)
Zero inflation: ~sex
Data: DF_CMP
AIC BIC logLik deviance df.resid
4586.2 4623.7 -2285.1 4570.2 792
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
id (Intercept) 0.1328 0.3644
Number of obs: 800, groups: id, 100
Overdispersion parameter for compois family (): 0.557
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.217269 0.054297 22.42 < 2e-16 ***
sexfemale 0.548342 0.079830 6.87 6.47e-12 ***
time 1.151549 0.004384 262.70 < 2e-16 ***
sexfemale:time -0.735348 0.009247 -79.52 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Zero-inflation model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6291 0.1373 -11.866 < 2e-16 ***
sexfemale 0.9977 0.1729 5.771 7.89e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

negative binomial model with interaction in R

i have fit negative binomial model to my data as follows:
> ngbinmodel <- glm.nb( seizure.rate ~ age + treatment, data = epilepsy_reduced)
> summary(ngbinmodel)
Call:
glm.nb(formula = seizure.rate ~ age + treatment, data = epilepsy_reduced,
init.theta = 1.498983674, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3510 -0.8790 -0.4563 0.4328 1.8916
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.0985089 0.5845392 3.590 0.000331 ***
age -0.0007965 0.0193064 -0.041 0.967092
treatment -0.5011593 0.2405658 -2.083 0.037228 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.499) family taken to be 1)
Null deviance: 71.217 on 57 degrees of freedom
Residual deviance: 66.875 on 55 degrees of freedom
AIC: 341.12
Number of Fisher Scoring iterations: 1
Theta: 1.499
Std. Err.: 0.362
2 x log-likelihood: -333.119
Now I would like to check if i should include the interaction effect between age and treatment. I have found two methods to do it:
> intearaction_nbm<-addterm(ngbinmodel, . ~ . * age,test="Chisq")
> summary(intearaction_nbm)
Df AIC LRT Pr(Chi)
Min. :1 Min. :339.1 Min. :0.9383 Min. :0.3327
1st Qu.:1 1st Qu.:339.4 1st Qu.:0.9383 1st Qu.:0.3327
Median :1 Median :339.6 Median :0.9383 Median :0.3327
Mean :1 Mean :339.6 Mean :0.9383 Mean :0.3327
3rd Qu.:1 3rd Qu.:339.9 3rd Qu.:0.9383 3rd Qu.:0.3327
Max. :1 Max. :340.2 Max. :0.9383 Max. :0.3327
NA's :1 NA's :1 NA's :1
and
> ngbinmodel_int <- glm.nb( seizure.rate ~ age*treatment, data = epilepsy_reduced)
> summary(ngbinmodel_int)
glm.nb(formula = seizure.rate ~ age * treatment, data = epilepsy_reduced,
init.theta = 1.531539174, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3503 -0.8742 -0.3848 0.3403 1.8508
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.51361 0.83920 1.804 0.0713 .
age 0.01914 0.02826 0.677 0.4981
treatment 0.60748 1.12199 0.541 0.5882
age:treatment -0.03893 0.03850 -1.011 0.3119
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.5315) family taken to be 1)
Null deviance: 72.238 on 57 degrees of freedom
Residual deviance: 66.874 on 54 degrees of freedom
AIC: 342.18
Number of Fisher Scoring iterations: 1
Theta: 1.532
Std. Err.: 0.373
2 x log-likelihood: -332.180
I was expecting to obtain the same result from both of the methods.
How can i access the regression estimates of intearaction_nbm?
why are the outcomes different? According to intearaction_nbm i should include the interaction term (the AIC is lower) but according to ngbinmodel_int i should not include the interaction term (AIC increases).
would discretizing my continuous variable age be advised?
Remark: You should move this post to cross validated.
How can i access the regression estimates of intearaction_nbm?
intearaction_nbm gives you the result of the addition of single term to your model, if you print it, you will have a row per possible additional term (age:treatment, age:another_variable, etc.) giving you the AIC and P-value among other things.
why are the outcomes different?
Not possible to answer without the data, but what I would do is define both models and compare their AIC using: AIC(model_1, model_2). This way I am sure that I am comparing the same quantity. As you know, the AIC is defined up to an additional term, and unless you check how it is computed, you cannot be sure that two different functions in two different packages use the same definition.
would discretizing my continuous variable age be advised?
Not possible to answer without the data...
Let us consider the dataset quine and the following model with only main effects for Eth and Lrn factors:
library(MASS)
negbin_no_int <- glm.nb(Days ~ Eth + Lrn, data = quine)
summary(negbin_no_int)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 3.0367 0.1334 22.764 < 2e-16 ***
# EthN -0.5520 0.1597 -3.457 0.000546 ***
# LrnSL 0.0388 0.1611 0.241 0.809661
extractAIC(negbin_no_int)
# [1] 3.000 1112.576
The model with the interaction term between the two factor is:
negbin_with_int <- glm.nb(Days ~ Eth * Lrn, data = quine)
summary(negbin_with_int)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.9218 0.1503 19.446 <2e-16 ***
# EthN -0.3374 0.2100 -1.607 0.108
# LrnSL 0.2929 0.2307 1.269 0.204
# EthN:LrnSL -0.4956 0.3201 -1.549 0.122
extractAIC(negbin_with_int)
# [1] 4.000 1112.196
The statistical significance of the interaction term is p=0.122.
Now we compare the two models using addterm:
interaction_nbm <- addterm(negbin_no_int, . ~ . + Eth:Lrn, test="Chisq")
print(interaction_nbm)
# Model:
# Days ~ Eth + Lrn
# Df AIC LRT Pr(Chi)
# <none> 1112.6
# Eth:Lrn 1 1112.2 2.3804 0.1229
The AICs given by addterm are the same calculated using extractAIC.
If you want to see the regression estimates of addterm, you can add a summary(print(nfit)) inside the function, as follows:
myaddterm <- function (object, scope, scale = 0, test = c("none", "Chisq"),
k = 2, sorted = FALSE, trace = FALSE, ...)
{
if (missing(scope) || is.null(scope))
stop("no terms in scope")
if (!is.character(scope))
scope <- add.scope(object, update.formula(object, scope))
if (!length(scope))
stop("no terms in scope for adding to object")
ns <- length(scope)
ans <- matrix(nrow = ns + 1L, ncol = 2L, dimnames = list(c("<none>",
scope), c("df", "AIC")))
ans[1L, ] <- extractAIC(object, scale, k = k, ...)
n0 <- nobs(object, use.fallback = TRUE)
env <- environment(formula(object))
for (i in seq_len(ns)) {
tt <- scope[i]
if (trace) {
message(gettextf("trying + %s", tt), domain = NA)
utils::flush.console()
}
nfit <- update(object, as.formula(paste("~ . +", tt)),
evaluate = FALSE)
nfit <- try(eval(nfit, envir = env), silent = TRUE)
print(summary(nfit))
ans[i + 1L, ] <- if (!inherits(nfit, "try-error")) {
nnew <- nobs(nfit, use.fallback = TRUE)
if (all(is.finite(c(n0, nnew))) && nnew != n0)
stop("number of rows in use has changed: remove missing values?")
extractAIC(nfit, scale, k = k, ...)
}
else NA_real_
}
dfs <- ans[, 1L] - ans[1L, 1L]
dfs[1L] <- NA
aod <- data.frame(Df = dfs, AIC = ans[, 2L])
o <- if (sorted)
order(aod$AIC)
else seq_along(aod$AIC)
test <- match.arg(test)
if (test == "Chisq") {
dev <- ans[, 2L] - k * ans[, 1L]
dev <- dev[1L] - dev
dev[1L] <- NA
nas <- !is.na(dev)
P <- dev
P[nas] <- MASS:::safe_pchisq(dev[nas], dfs[nas], lower.tail = FALSE)
aod[, c("LRT", "Pr(Chi)")] <- list(dev, P)
}
aod <- aod[o, ]
head <- c("Single term additions", "\nModel:", deparse(formula(object)))
if (scale > 0)
head <- c(head, paste("\nscale: ", format(scale), "\n"))
class(aod) <- c("anova", "data.frame")
attr(aod, "heading") <- head
aod
}
interaction_nbm1 <- myaddterm(negbin_no_int, . ~ . + Eth:Lrn, test="Chisq")
The output is:
Call:
glm.nb(formula = Days ~ Eth + Lrn + Eth:Lrn, data = quine, init.theta = 1.177546225,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5770 -1.0470 -0.3645 0.3521 2.7227
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.9218 0.1503 19.446 <2e-16 ***
EthN -0.3374 0.2100 -1.607 0.108
LrnSL 0.2929 0.2307 1.269 0.204
EthN:LrnSL -0.4956 0.3201 -1.549 0.122
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.1775) family taken to be 1)
Null deviance: 182.93 on 145 degrees of freedom
Residual deviance: 168.18 on 142 degrees of freedom
AIC: 1114.2
Number of Fisher Scoring iterations: 1
Theta: 1.178
Std. Err.: 0.146
2 x log-likelihood: -1104.196

Extract data from Partial least square regression on R

I want to use the partial least squares regression to find the most representative variables to predict my data.
Here is my code:
library(pls)
potion<-read.table("potion-insomnie.txt",header=T)
potionTrain <- potion[1:182,]
potionTest <- potion[183:192,]
potion1 <- plsr(Sommeil ~ Aubepine + Bave + Poudre + Pavot, data = potionTrain, validation = "LOO")
The summary(lm(potion1)) give me this answer:
Call:
lm(formula = potion1)
Residuals:
Min 1Q Median 3Q Max
-14.9475 -5.3961 0.0056 5.2321 20.5847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.63931 1.67955 22.410 < 2e-16 ***
Aubepine -0.28226 0.05195 -5.434 1.81e-07 ***
Bave -1.79894 0.26849 -6.700 2.68e-10 ***
Poudre 0.35420 0.72849 0.486 0.627
Pavot -0.47678 0.52027 -0.916 0.361
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.845 on 177 degrees of freedom
Multiple R-squared: 0.293, Adjusted R-squared: 0.277
F-statistic: 18.34 on 4 and 177 DF, p-value: 1.271e-12
I deduced that only the variables Aubepine et Bave are representative. So I redid the model just with this two variables:
potion1 <- plsr(Sommeil ~ Aubepine + Bave, data = potionTrain, validation = "LOO")
And I plot:
plot(potion1, ncomp = 2, asp = 1, line = TRUE)
Here is the plot of predicted vs measured values:
The problem is that I see the linear regression on the plot, but I can not know its equation and R². Is it possible ?
Is the first part is the same as a multiple regression linear (ANOVA)?
pacman::p_load(pls)
data(mtcars)
potion <- mtcars
potionTrain <- potion[1:28,]
potionTest <- potion[29:32,]
potion1 <- plsr(mpg ~ cyl + disp + hp + drat, data = potionTrain, validation = "LOO")
coef(potion1) # coefficeints
scores(potion1) # scores
## R^2:
R2(potion1, estimate = "train")
## cross-validated R^2:
R2(potion1)
## Both:
R2(potion1, estimate = "all")

Resources