R- tables package - error subscript out of bounds - r

I need to create a fancy table and export it as png. I'm trying tables package in R. I need to group "variacion" by groups of agents ("agentes") who had a positive variation vs the rest. I want the mean, sd and the number of agents who fulfill these conditions
My table is:
agente mes1 mes2 variacion
1 a1 0.50 0.60 0.20000000
2 a2 0.70 0.65 -0.07142857
3 a3 0.60 0.75 0.25000000
4 a4 0.80 0.60 -0.25000000
5 a5 0.78 0.90 0.15384615
My output should be (including format):
You can arrive to those numbers by doing for example:
sd(t_agentes1$variacion[t_agentes1$variacion<=0])
And the result is the last number in the table for the column sd: 0.126
So in tables library:
library(tables)
X<-t_agentes1$variacion
latex( tabular( (X > 0) + (X < 0) + 1
+ ~ ((n = 1) + X*(mean + sd + length)) ) )
But I get the error:
non-numeric argument to binary operator
Also when I try the first example of the package I get the same error
tabular( (Species + 1) ~ (n=1) + Format(digits=2)*
+ + (Sepal.Length + Sepal.Width)*(mean + sd), data=iris )
Error in e[[3]] : subscript out of bounds
I really don't understand the parameters of this package. Is there a way to do the grouping? I'm really lost with this so any help would be really appreciated. Thanks.

X <- read.table(header = TRUE, text="agente mes1 mes2 variacion
1 a1 0.50 0.60 0.20000000
2 a2 0.70 0.65 -0.07142857
3 a3 0.60 0.75 0.25000000
4 a4 0.80 0.60 -0.25000000
5 a5 0.78 0.90 0.15384615")
X <- within(X, variation <- factor(variacion > 0, levels = c(TRUE, FALSE),
labels = c('variation > 0',
'variation <= 0')))
library(tables)
# latex(
# tabular(Heading() * variation ~
# Justify(l) * (Heading() * Format(digits = 2) * variacion * (mean + sd) + (number = (n = 1))),
# data = X))
latex(
tabular(Heading() * variation ~
Justify(l) * (Heading() * variacion * (Format(digits = 2) * mean + Format(digits = 2) *sd) + (number = (n = 1))),
data = X))
# mean sd number
# variation $>$ 0 0.20 0.048 3
# variation $\\leq$ 0 -0.16 0.126 2
Gives me

Without prettifying the results:
tabular((X > 0) + (X < 0) ~ mean*X + sd*X + length*X)

Related

Method to extract mean (CI 95%) when independent variable is continuous (both pos and neg) and dependent variable is binary?

(This question is similar to Extract confidence interval for both values of binary variable for glm()?, however in this example the continuous variable includes negative values, and the dependent and independent variables have swapped places)
The reproducible data set:
smoking_status (0 = non-smokers, 1 = smokers)
change_hemoglobin
age
 sex
1
2.5
12
 0
0
-5.2
99
0
1
-2.0
54
1
0
1.7
46
1
1
0
45
1
1
0.1
24
1
0
0
24
1
1
9.05
78
0
1
6.0
56
0
1
-5.2
45
0
With which method can I extract the mean and confidence intervals for the change in hemoglobin for each smoking status? I would like to produce a results table like this, including the p-value for the difference in change of hemoglobin between the smokers and non-smokers:
Unadjusted:
Change in hemoglobin, mean (95% CI)
p-value of difference between smokers and non-smokers
Smokers
X (X - X)
X
Non-smokers
X (X - X)
Adjusted for age and sex:
Change in hemoglobin, mean (95% CI)
p-value of difference between smokers and non-smokers
Smokers
X (X - X)
X
Non-smokers
X (X - X)
You can do this with a few base R functions. The mean for each group can simply be obtained by mean. The 95% confidence interval around the mean can be obtained by calculating the mean +/- 1.96 standard errors, and the p value can be obtained from a single-sample t test:
with(df, tapply(change_hemoglobin, smoking_status, function(x) {
mean_x <- round(mean(x), 2)
sem_x <- round(sd(x)/sqrt(length(x)), 2)
paste0(mean_x, ' (95% CI ',
mean_x - 1.96 * sem_x, ' to ', mean_x + 1.96 * sem_x, '),',
' p = ', round(t.test(x)$p.val, 2))
}))
#> 0
#> "-1.17 (95% CI -5.2468 to 2.9068), p = 0.63"
#> 1
#> "1.49 (95% CI -2.0772 to 5.0572), p = 0.44"
Update
To take sex and age into account, we need a linear model:
df$smoking_status <- factor(df$smoking_status)
model1 <- lm(change_hemoglobin ~ 0 + smoking_status + age + sex, data = df)
model2 <- lm(change_hemoglobin ~ smoking_status + age + sex, data = df)
coefs <- summary(model1)$coef
coefs2 <- summary(model2)$coef
c(paste0('Mean change: ', round(coefs[1:2, 1], 2), ', 95% CI: ',
round(coefs[1:2, 1] - 1.96 * coefs[1:2, 2], 2), ' to ',
round(coefs[1:2, 1] + 1.96 * coefs[1:2, 2], 2)),
paste("p value = ",
round(coefs2[2, 4], 2)))
#> [1] "Mean change: -0.05, 95% CI: -13.14 to 13.04"
#> [2] "Mean change: 2.27, 95% CI: -7.29 to 11.82"
#> [3] "p value = 0.58"

Error in MEEM(object, conLin, control$niterEM) / fixed-effect model matrix is rank deficient

I'm trying to make a multilevel mediation analysis (as done here and here).
library(data.table)
library(lme4)
library(nlme)
library(magrittr)
library(dplyr)
set.seed(1)
# Simulated data ------------------------------------------------------------------
dt_1 <- data.table(id = rep(1:10, each=4),
time = as.factor(rep(1:4, 10)),
x = rnorm(40),
m = rnorm(40),
y = rnorm(40))
# Melt m and y into z ------------------------------------------------------------------
dt_2 <- dt_1 %>%
mutate(mm = m, .after=x) %>%
melt(id.vars = c("id","time","x","mm"),
na.rm = F,
variable.name = "dv",
value.name = "z") %>%
within({
dy <- as.integer(dv == "y")
dm <- as.integer(dv == "m")
}) %>%
arrange(id,time)
> head(dt_2,4)
id time x mm dv z dm dy
1: 1 1 -0.6264538 -0.1645236 m -0.1645236 1 0
2: 1 1 -0.6264538 -0.1645236 y -0.5686687 0 1
3: 1 2 0.1836433 -0.2533617 m -0.2533617 1 0
4: 1 2 0.1836433 -0.2533617 y -0.1351786 0 1
# lme mediation model ------------------------------------------------------------------
model_lme <- lme(fixed = z ~ 0 +
dm + dm:x + dm:time + #m as outcome
dy + dy:mm + dy:x + dy:time, #y as outcome
random = ~ 0 + dm:x + dy:mm + dy:x | id,
weights = varIdent(form = ~ 1 | dm), #separate sigma^{2}_{e} for each outcome
data = dt_2,
na.action = na.exclude)
Error in MEEM(object, conLin, control$niterEM): Singularity in backsolve at level 0, block 1
# lmer mediation model ------------------------------------------------------------------
model_lmer <- lmer(z ~ 0 + dm + dm:x + dm:time + dy + dy:mm + dy:x + dy:time +
(0 + dm:x + dy:mm + dy:x | id) + (0 + dm | time),
data = dt_2,
na.action = na.exclude)
fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
I've seen some posts about these error (nlme) / warning (lme4) (eg, this and this), but I didn't figure out what is the problem here.
I checked
X <- model.matrix(~0 + dm + dm:x + dm:time + dy + dy:mm + dy:x + dy:time, data=dt_2)
> caret::findLinearCombos(X)
$linearCombos
$linearCombos[[1]]
[1] 7 1 4 5 6
$remove
[1] 7
but I don't quite understand the output.
From the summary of model_lmer I verify that dm:time4 and time1:dy coefficients are missing, but why? There are all the possible combinations (0/0, 0/1, 1/0, 1/1) in the dataset.
Fixed effects:
Estimate Std. Error t value
dm 0.30898 0.92355 0.335
dy 0.03397 0.27480 0.124
dm:x 0.21267 0.19138 1.111
dm:time1 -0.19713 1.30583 -0.151
dm:time2 -0.30206 1.30544 -0.231
dm:time3 -0.20828 1.30620 -0.159
dy:mm 0.22625 0.18728 1.208
x:dy -0.37747 0.17130 -2.204
time2:dy 0.29894 0.39123 0.764
time3:dy 0.22640 0.39609 0.572
time4:dy -0.16758 0.39457 -0.425
On the other hand, using time as numeric gives no error/warning:
# lmer mediation model - time as numeric
model_lmer2 <- lmer(z ~ 0 + dm + dm:x + dm:time + dy + dy:mm + dy:x + dy:time +
(0 + dm:x + dy:mm + dy:x | id) + (0 + dm | time),
data = within(dt_2, time <- as.numeric(time)),
na.action = na.exclude)
It's true that one can know dm from dy (if one is 1 the other is 0), but if that was the problem, this last model (model_lmer2) would still give a warning, I guess.
In my real dataset, I could eventually use time as numeric (although not my first approach), but I would like to understand what the problem is with using it as categorical.
Thank you!
This is really a generic question about linear model construction/formulas in R: it's not mixed-model specific.
Let's look at the names of the columns involved in the multicollinear combination of variables (i.e. columns 7, 1, 4, 5, 6):
cc <- caret::findLinearCombos(X)
colnames(X)[cc$linearCombos[[1]]]
## [1] "dm:time4" "dm" "dm:time1" "dm:time2" "dm:time3"
This is telling us that the main effect of dm is confounded with the dm:time interaction; once we know dm:time[i] for all levels of i, the main effect of dm is redundant.
It's too bad that lme doesn't automatically drop columns to adjust for multicollinearity as lmer does, and that lmer doesn't have a super-convenient way to model heteroscedasticity à la varIdent(); it is possible but it's a nuisance. It would be possible to build the auto-dropping into nlme, or glmmTMB (which can also model heteroscedasticity easily), but no-one's gotten around to it yet).
... if you're OK with specifying dm:time and leaving dm out of your model then that might be easiest!
You can experiment with what happens with different model specifications:
lc <- function(f) {
X <- model.matrix(f, dt_2)
cc <- caret::findLinearCombos(X)
lapply(cc$linearCombos, function(z) colnames(X)[z])
}
lc(~0 + dm + dm:time)
lc(~0 + dy + dy:time)
lc(~0 + dm + dm:time + dy + dy:time)
lc(~0 + dy + dy:time + dm + dm:time)
or similar stuff looking at the (heads of) the model matrices, column names of the model matrices, etc..

Linear Optimization R

I am new to optimization so please bear with me. Here is my problem:
A, B, C, D and E are percentages (18%,2%,1%,78%,1%)
Maximize sum (A(x) + B(x) + C(x) +D(x) + E(x)) ie maximize x ( x<=499572)
such that
A(x) <= 20076
B(x) <= 8619
C(x) <= 145
D(x) <= 465527
E(x) <= 5205
How do I frame this problem in R?
I was using LPsolve package but I am ok with any suggestions.
We restate the problem, omitting positivity constraints on the single scalar variable x, as:
maximize 1 * x
such that
0.18 * x <= 20076
0.02 * x <= 8619
0.01 * x <= 145
0.78 * x <= 465527
0.01 * x <= 5205
so as a linear program we have the following optimum value of x:
library(lpSolve)
constr.mat <- c(.18, .02, .01, .78, .01)
RHS <- c(20076, 8619, 145, 465527, 5205)
soln <- lp("max", 1, constr.mat, "<=", RHS)
soln$solution
## [1] 14500
Of course, as pointed out in the comments below the question this problem can be solved trivially without linear programming by taking the least upper bound of x:
min(RHS / constr.mat)
## [1] 14500
Note
If what you really meant was not the problem stated in the question but rather this 5 variable problem:
max 0.18 * x1 + 0.02 * x2 + 0.01 * x3 + 0.78 * x4 + 0.01 * x5
such that
0.18 * x1 <= 20076
0.02 * x2 <= 8619
0.01 * x3 <= 145
0.78 * x4 <= 465527
0.01 * x5 <= 5205
then we have
soln2 <- lp("max", constr.mat, diag(constr.mat), "<=", RHS)
soln2$solution
## [1] 111533.3 430950.0 14500.0 596829.5 520500.0
Again this is trivial to compute without linear programming:
RHS / constr.mat
## [1] 111533.3 430950.0 14500.0 596829.5 520500.0

What if I want a single linear regression model rather than an "mlm"?

I have shared the top 9 rows of the data I am working on in the image below (y0 to y6 are outputs, rest are inputs):
My objective is to get fitted output data for y0 to y6.
I tried lm function in R using the commands:
lm1 <- lm(cbind(y0, y1, y2, y3, y4, y5, y6) ~ tt + tcb + s + l + b, data = table3)
summary(lm1)
And it has returned 7 sets of coefficients like "Response y0", "Response y1", etc.
What I really want is just 1 set of coefficients which can predict values for outputs y0 to y6.
Could you please help in this?
By cbind(y0, y1, y2, y3, y4, y5, y6) we fit 7 independent models (which is be a better idea).
For what you are looking for, stack your y* variables, replicate other independent variables and do a single regression.
Y <- c(y0, y1, y2, y3, y4, y5, y6)
tt. <- rep(tt, times = 7)
tcb. <- rep(tcb, times = 7)
s. <- rep(s, times = 7)
l. <- rep(l, times = 7)
b. <- rep(b, times = 7)
fit <- lm(Y ~ tt. + tcb. + s. + l. + b.)
Predicted values for y* are
matrix(fitted(fit), ncol = 7)
For other readers than OP
I hereby prepare a tiny reproducible example (with only one covariate x and two replicates y1, y2) to help you digest the issue.
set.seed(0)
dat_wide <- data.frame(x = round(runif(4), 2),
y1 = round(runif(4), 2),
y2 = round(runif(4), 2))
# x y1 y2
#1 0.90 0.91 0.66
#2 0.27 0.20 0.63
#3 0.37 0.90 0.06
#4 0.57 0.94 0.21
## The original "mlm"
fit_mlm <- lm(cbind(y1, y2) ~ x, data = dat_wide)
Instead of doing c(y1, y2) and rep(x, times = 2), I would use the reshape function from R base package stats, as such operation is essentially a "wide" to "long" dataset reshaping.
dat_long <- stats::reshape(dat_wide, ## wide dataset
varying = 2:3, ## columns 2:3 are replicates
v.names = "y", ## the stacked variable is called "y"
direction = "long" ## reshape to "long" format
)
# x time y id
#1.1 0.90 1 0.91 1
#2.1 0.27 1 0.20 2
#3.1 0.37 1 0.90 3
#4.1 0.57 1 0.94 4
#1.2 0.90 2 0.66 1
#2.2 0.27 2 0.63 2
#3.2 0.37 2 0.06 3
#4.2 0.57 2 0.21 4
Extra variables time and id are created. The former tells which replicate a case comes from; the latter tells which record that case is within a replicate.
To fit the same model for all replicates, we do
fit1 <- lm(y ~ x, data = dat_long)
#(Intercept) x
# 0.2578 0.5801
matrix(fitted(fit1), ncol = 2) ## there are two replicates
# [,1] [,2]
#[1,] 0.7798257 0.7798257
#[2,] 0.4143822 0.4143822
#[3,] 0.4723891 0.4723891
#[4,] 0.5884029 0.5884029
Don't be surprised that two columns are identical; there is only a single set of regression coefficients for both replicates after all.
If you think carefully, we can do the following instead:
dat_wide$ymean <- rowMeans(dat_wide[2:3]) ## average all replicates
fit2 <- lm(ymean ~ x, data = dat_wide)
#(Intercept) x
# 0.2578 0.5801
and we will get the same point estimates. Standard errors and other summary statistics would differ as two models have different sample size.
coef(summary(fit1))
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2577636 0.2998382 0.8596755 0.4229808
#x 0.5800691 0.5171354 1.1216967 0.3048657
coef(summary(fit2))
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2577636 0.01385864 18.59949 0.002878193
#x 0.5800691 0.02390220 24.26844 0.001693604

"Missing value or an infinity produced when evaluating the model"

I'm trying to solve a two-component decay model in R using the nls function, but running into errors. The equation is:
Where t is time, Ctot is C1+C2, and p1 and p2 are known proportions of Ctot.
my data (dd) is:
> head(dd,n=15)
t Ctot
1 0.00 6.62
2 0.33 6.45
3 0.50 6.38
4 0.67 6.44
5 0.83 6.38
6 1.00 6.39
7 1.17 6.35
8 1.33 6.33
9 1.50 6.33
10 1.67 6.28
11 1.83 6.17
12 2.00 6.11
13 2.17 6.07
14 2.33 5.89
15 2.50 5.86
Using nls I have tried:
p1 <- 0.3
p2 <- 0.7
z <- nls(Ctot~(p1*C1*(exp(-k1*t)))+(p2*C2*(exp(-k2*t))), data=dd, start=list(C1=6, C2=0.1, k1=0.01, k2=0.01))
However I am getting:
z <- nls(Ctot~(p1*C1*(exp(-k1*t)))+(p2*C2*(exp(-k2*t))), data=dd, start=list(C1=6, C2=0.1, k1=0.01, k2=0.01))
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
I would be grateful if anyone has suggestions!
The data seems fairly limited and clearly incomplete since it only the head. If we make up some data for testing methods ... and leave out the confusing p1 and p2:
t=seq(0, 20, by=.3)
Ctot = 3 * exp( -1 * t) + 4 * exp(-5*t)
# following hte example on gnm::gnm's help page:
saved.fits <- list(); library(gnm)
for (i in 1:10) {
saved.fits[[i]] <- suppressWarnings( gnm(Ctot ~ Exp(1 + t, inst = 1) +
Exp(1 + t, inst = 2),
verbose=FALSE))}
plot(Ctot~t)
lines(saved.fits[[3]]$fitted~t)
lines(saved.fits[[3]]$fitted~t,col="red")
I wasn't familiar with the gnm package and so ended up reading the first few sections and then the worked 2 component data fitting example in its vignette: https://cran.r-project.org/web/packages/gnm/vignettes/gnmOverview.pdf . Most of the fits will be as expected, but some will find a local maximum in likelihood that is not a global max:
> saved.fits[[1]]$coefficients
(Intercept) Exp(. + t, inst = 1).(Intercept)
1.479909e-12 1.098612e+00
Exp(1 + ., inst = 1).t Exp(. + t, inst = 2).(Intercept)
-1.000000e+00 1.386294e+00
Exp(1 + ., inst = 2).t
-5.000000e+00
attr(,"eliminated")
[1] 0
> exp( saved.fits[[1]]$coefficients[4] )
Exp(. + t, inst = 2).(Intercept)
4
> exp( saved.fits[[1]]$coefficients[2] )
Exp(. + t, inst = 1).(Intercept)
3
With the data shown in the question it does not seem to work well but if you are open to other parametric models then this 3 parameter model seems reasonable.
fm <- nls(Ctot ~ 1 / (a + b * t^c), dd, st = list(a = 1, b = 1, c = 1))
plot(dd)
lines(fitted(fm) ~ t, dd, col = "red")

Resources