I have a dataset with age as continuous and as a factor, sex as a factor and 4 groups.
structure(list(Age = c(9, 12, 16, 57), Age_1 = structure(c(2L,
3L, 3L, 7L), .Label = c("8", "1", "2", "3", "4", "5", "6", "7"
), class = "factor"), Sex = structure(c(2L, 1L, 2L, 1L), .Label = c("M",
"F", "U"), class = "factor"), N = structure(c(2L, 2L, 2L,
2L), .Label = c("0", "1"), class = "factor"), G = structure(c(1L,
1L, 1L, 1L), .Label = c("0", "1"), class = "factor"), L_1 =
structure(c(1L,
1L, 1L, 1L), .Label = c("0", "1"), class = "factor"), C_1 =
structure(c(1L,
1L, 1L, 1L), .Label = c("0", "1"), class = "factor"), G_1 =
structure(c(1L,
1L, 1L, 1L), .Label = c("0", "1"), class = "factor"), m = structure(c(1L,
1L, 1L, 1L), .Label = c("0", "1"), class = "factor"), A = c(1,
1, 1, 1)), row.names = c(NA, 4L), class = "data.frame")
I want to do logistic regression for each variable (Age, Age_1 and sex) for each of the groups (N, G, L_1, C_1, G_1, m).
eg.
mylogit <- glm(N ~ Sex, data = logistic_s, family = "binomial")
mylogit <- glm(N ~ Age, data = logistic_s, family = "binomial")
I am using gtsummary for combining the variables in a table.
library(gtsummary)
tbl_n <-
tbl_uvregression(
logistic_s[c("N", "Age", "sex", "Age_1")],
method = glm,
y = N,
method.args = list(family = binomial),
exponentiate = TRUE
)
tbl_n
This produces the output for one group (eg. N) with the variables Age, Age_1, Sex.
I want to repeat this with each of the groups (eg N, G, L_1 etc) and then combine the tables to make one combined table.
I am open to using different packages if there are other options that suit this better. I want to make a table that can be exported in word.
I agree that some reproducible code would be helpful. I am not 100% certain what kind of output you're looking to get. You want to build some univariate logistic regression models, separately for 2 or more groups?
If that is correct, here is one way to go about it:
I will use the trial data set in the gtsummary package as an example. I'll make the grouping variable treatment (trt).
library(gtsummary)
library(tidyverse)
trial_subset <-
trial %>%
select(trt, response, age, marker, grade)
We'll begin by constructing univariate regression tables stratified by trt using the tbl_uvregression() function from gtsummary package. They will be stored in a new column in the data frame called tbl_uv.
df_uv <-
trial_subset %>%
# group by trt, and nest data within group
group_by(trt) %>%
nest() %>%
# build univariate logistic regression models separately within grouping variable
mutate(
tbl_uv = map(
data,
~tbl_uvregression(
data = .x,
y = response,
method = glm,
method.args = list(family = binomial),
exponentiate = TRUE
)
)
)
#> # A tibble: 2 x 3
#> # Groups: trt [2]
#> trt data tbl_uv
#> <chr> <list> <list>
#> 1 Drug A <tibble [98 x 4]> <tbl_vrgr>
#> 2 Drug B <tibble [102 x 4]> <tbl_vrgr>
We can now use the tables saved in df_uv to merge them into a single table using the tbl_merge() function.
tbl_merge(
tbls = df_uv$tbl_uv, # list of the univariate logistic regression tables
tab_spanner = paste0("**", df_uv$trt, "**") # adding stars to bold the header
)
This produces the table below. I hope this is helpful!
Related
Essentially what I want to do is automate this for all my data:
plants_A<-plants_sorted[plants_sorted[, 'treatment']== 'A', ]
plants_A1<-plants_A[plants_A[, 'replicate']== '1', ]
lm(weight~time, data = plants_A1)
From 'plants' I want to make lm's for all treatment and replicate combinations.
I have also managed to split the data using:
plants_treat_repl <- split(plants, paste(plants$treatment, plants$replicate))
But I can't seem to make lm's from this split data.
# sample data
structure(list(treatment = structure(c(1L, 1L, 1L, 1L, 1L, 2L ), .Label = c("A",
"B", "C", "D"), class = "factor"), replicate = c(1, 2, 3, 4, 5, 1), time =
structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("6", "8", "10", "12", "14",
"16"), class = "factor"), weight = c(2, 0, 0, 0, 0.5, 2.6), trtrep =
structure(c(1L, 5L, 9L, 13L, 17L, 2L), .Label = c("A.1", "B.1", "C.1", "D.1",
"A.2", "B.2", "C.2", "D.2", "A.3", "B.3", "C.3", "D.3", "A.4", "B.4", "C.4",
"D.4", "A.5", "B.5", "C.5", "D.5"), class = "factor")), row.names = c(NA, 6L),
class = "data.frame")
Let's transition this to a reproducible example using built-in data. (Feel free to share reproducible sample data in the question - dput(your_data[1:10, ]) is a great way to share data reproducibly, pick an appropriate subset.)
## split the data by group
mt_split = split(mtcars, mtcars$cyl)
## fit models to each group
mods = lapply(mt_split, lm, formula = mpg ~ wt)
## extract the coefficients from each model
lapply(mods, coef)
# $`4`
# (Intercept) wt
# 39.571196 -5.647025
#
# $`6`
# (Intercept) wt
# 28.408845 -2.780106
#
# $`8`
# (Intercept) wt
# 23.868029 -2.192438
Try lapply function:
lms <- lapply(split(plants, paste(plants$treatment,plants$replicate)),
function(x)lm(weight~time, data = x))
lms is a list of all posibles lm's.
You haven't told us exactly what doesn't work, but:
plants <- transform(plants, trtrep = interaction(treatment, replicate)
lme4::lmList(weight~time | trtrep, data=plants)
should work. You might get away with defining the replicate on the fly:
lme4::lmList(weight~time | interaction(treatment, replicate), data=plants)
I'm trying to add significance annotations to an errorbar plot with a factor x-axis and dodged groups within each level of the x-axis. It is a similar but NOT identical use case to this
My base errorbar plot is:
library(ggplot2)
library(dplyr)
pres_prob_pd = structure(list(x = structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3), labels = c(`1` = 1,
`2` = 2, `3` = 3)), predicted = c(0.571584427222816, 0.712630712634987,
0.156061969566517, 0.0162388386564817, 0.0371877245103279, 0.0165022541901018,
0.131528946944238, 0.35927812866896, 0.0708662221985375), std.error = c(0.355802875027348,
0.471253661425626, 0.457109887762665, 0.352871728451576, 0.442646879181155,
0.425913568532558, 0.376552208691762, 0.48178172708116, 0.451758041335245
), conf.low = c(0.399141779923204, 0.496138837620712, 0.0701919316506831,
0.00819832576725402, 0.0159620304815404, 0.00722904089045731,
0.0675129352870401, 0.17905347369819, 0.030504893442457), conf.high = c(0.728233665534388,
0.861980236164486, 0.311759350126477, 0.031911364587827, 0.0842227723261319,
0.0372248587668487, 0.240584344249407, 0.590437963881823, 0.156035177669385
), group = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("certain",
"neutral", "uncertain"), class = "factor"), group_col = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("certain", "neutral",
"uncertain"), class = "factor"), language = structure(c(2L, 2L,
2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("english", "dutch", "german"
), class = "factor"), top = c(0.861980236164486, 0.861980236164486,
0.861980236164486, 0.0842227723261319, 0.0842227723261319, 0.0842227723261319,
0.590437963881823, 0.590437963881823, 0.590437963881823)), row.names = c(NA,
-9L), groups = structure(list(language = structure(1:3, .Label = c("english",
"dutch", "german"), class = "factor"), .rows = structure(list(
4:6, 1:3, 7:9), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, 3L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
#dodge
pd = position_dodge(.75)
#plot
p = ggplot(pres_prob_pd,aes(x=language,y=predicted,color=group,shape=group)) +
geom_point(position=pd,size=2) +
geom_errorbar(aes(ymax=conf.high,ymin=conf.low),width=.125,position=pd)
p
What I want to do is annotate the plot such that the contrasts between group within each level of language are annotated for significance. I've plotted points representing the relevant contrasts and (toy) sig. annotations as follows:
#bump function
f = function(x){
v = c()
bump=0.025
constant = 0
for(i in x){
v = c(v,i+constant+bump)
bump = bump + 0.075
}
v
}
#create contrasts
combs = data.frame(gtools::combinations(3, 2, v=c("certain", "neutral", "uncertain"), set=F, repeats.allowed=F)) %>%
mutate(contrast=c("cont_1","cont_2","cont_3"))
combs = rbind(combs %>% mutate(language = 'english'),
combs %>% mutate(language='dutch'),
combs %>% mutate(language = "german")) %>%
left_join(select(pres_prob_pd,language:top)%>%distinct(),by='language') %>%
group_by(language)
#long transform and calc y_pos
combs_long = mutate(combs,y_pos=f(top)) %>% gather(long, probability, X1:X2, factor_key=TRUE) %>% mutate(language=factor(language,levels=c("english","dutch","german"))) %>%
arrange(language,contrast)
#back to wide
combs_wide =combs_long %>% spread(long,probability)
combs_wide$p = rep(c('***',"*","ns"),3)
#plot
p +
geom_point(data=combs_long,
aes(x = language,
color=probability,
shape=probability,
y=y_pos),
inherit.aes = T,
position=pd,
size=2) +
geom_text(data=combs_wide,
aes(x=language,
label=p,
y=y_pos+.025,
group=X1),
color='black',
position=position_dodge(.75),
inherit.aes = F)
What I am failing to achieve is plotting a line connecting each of the contrasts of group within each level of language, as is standard when annotating significant group-wise differences. Any help much appreciated!
I am trying to perform a Tukey's HSD test or an LSD test on my data. I have two factors, Collection (2 treatments) and Irrigation (5 treatments), and want to do the test on the Sucrose responses from each combination, so 10 total treatments.
Data:
structure(list(Collection = structure(c(1L, 1L, 1L, 1L, 1L, 2L
), .Label = c("1", "2"), class = "factor"), Irrigation = structure(c(1L,
2L, 3L, 4L, 5L, 1L), .Label = c("Rate1", "Rate2", "Rate3", "Rate4",
"Rate5"), class = "factor"), meanSuc = c(0.585416666666667, 0.5032,
0.61375, 0.602775, 0.688466666666667, 0.545133333333333)), row.names =
c(NA,
-6L), groups = structure(list(Collection = structure(1:2, .Label = c("1",
"2"), class = "factor"), .rows = list(1:5, 6L)), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE), class =
c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Attempt at combining treatments into a column and using Agricolae to perform test:
Tukey_data <- dataAvgSucCI %>%
mutate(Tukey_ID = paste(Collection, Irrigation, sep="_"))
TukeyAov <- aov(meanSuc ~ Tukey_ID,Tukey_data)
HSD.test(TukeyAov, "Tukey_ID", group=TRUE)
Error message:
Error in if (pvalue[k] <= 0.001) sig[k] <- "***" else if (pvalue[k] <=
:
missing value where TRUE/FALSE needed
In addition: Warning message:
In qtukey(1 - alpha, ntr, DFerror) : NaNs produced
How should I edit my code to make it work?
Or would I be better off writing something entirely different?
The data have to look like this (One way ANOVA):
Collection = rep(1:2, times = 1, each = 5)
Irrigation = rep(1:5, times = 2, each = 1)
meanSuc = rnorm(10, mean = 0, sd = 1)
d = data.frame(Collection, Irrigation, meanSuc)
fit = aov(meanSuc ~ as.factor(Collection), data=d)
TukeyHSD(fit)
or Two way ANOVA:
fit2 = aov(meanSuc ~ as.factor(Collection) + as.factor(Irrigation), data = d)
TukeyHSD(fit2)
I think you like to perform a two way ANOVA. Like AkselA said, there is no variation in your target variable (meanSuc), If you perform a one way ANOVA like you did.
I am wishing to run a linear mixed model on a dependent variable DV that is collected under two different Condition at three different Timepoint. The data is structured as follows:
## dput(head(RawData,5))
structure(list(Participant = structure(c(2L, 2L, 2L, 2L, 4L),
.Label = c("Jessie", "James", "Gus", "Hudson", "Flossy",
"Bobby", "Thomas", "Alfie", "Charles", "Will", "Mat", "Paul", "Tim",
"John", "Toby", "Blair"), class = "factor"),
xVarCondition = c(1, 1, 0, 0, 1),
Measure = structure(c(1L, 2L, 3L, 4L, 1L),
.Label = c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10", "11", "12"), class = "factor"),
Sample = structure(c(1L, 2L, 1L, 2L, 1L),
.Label = c("1", "2"), class = "factor"),
Condition = structure(c(2L, 2L, 1L, 1L, 2L),
.Label = c("AM", "PM"), class = "factor"),
Timepoint = structure(c(2L, 2L, 2L, 2L, 1L),
.Label = c("Baseline", "Mid", "Post"), class = "factor"),
DV = c(83.6381348645853, 86.9813802115179, 69.2691666620429,
71.3949807856125, 87.8931998204771)),
.Names = c("Participant", "xVarCondition", "Measure",
"Sample", "Condition", "Timepoint", "DV"),
row.names = c(NA, 5L), class = "data.frame")
Each Participant performs two trials per Condition across three Timepoints as depicted by Measure; however, there are missing data so not necessarily 12 levels per participant. The column xVarCondition is simply a dummy variable that includes a 1 for each entry of AM in Condition. The column Sample refers to the 2 trials for each Condition at each Timepoint.
I am an R user but the statistician is a SAS user who believes the code for the model should be:
proc mixed data=RawData covtest cl alpha=α
class Participant Condition Timepoint Measure Sample;
model &dep=Condition Timepoint/s ddfm=sat outp=pred residual noint;
random int xVarCondition xVarCondition*TimePoint*Sample
TimePoint/subject=Participant s;
The above SAS code gives sensible answers and is working perfectly. We believe the resulting lme4 syntax for the above model to be:
TestModel = lmer(DV ~ Condition + Timepoint +
(1 | Participant/Timepoint) +
(0 + xVarCondition | Participant) +
(1 | Participant:xVarCondition:Measure), data = RawData)
However, I get the following error when running this model:
Error: number of levels of each grouping factor must be < number of observations
Are the random effects specified correctly?
I can't quite tell from your description, but most likely your Participant:xVarCondition:Measure term constructs a grouping variable that has no more than one more observation in each level of classification, which will make the (1|Participant:xVarCondition:Measure) term redundant with the residual error term which is always included in an lmer model. You can override the error if you really want to by including
control=lmerControl(check.nobs.vs.nlev = "ignore")
in your function call, but (if I've diagnosed the problem correctly) this will lead to the residual variance and the Participant:xVarCondition:Measure variance being jointly unidentifiable. Such unidentifiability usually doesn't cause any problems with the rest of the model, but I am more comfortable with an identifiable model (there's always the possibility that such unidentifiability will lead to numerical issues).
There's a similar example here.
You can check my conjecture as follows:
ifac <- with(RawData,
interaction(Participant,xVarCondition,Measure,drop=TRUE))
length(levels(ifac)) == nrow(RawData)
I am interested in knowing if there is a lattice alternative to geom_tile() in ggplot2 when I want to display factor levels/map fill to text. Example data frame (df) follows...
Gene Sample Mutation
A1 2 Missense
A2 2 WT
A1 3 Missense
A2 3 Missense
With ggplot2 this is trivial
qplot(df, y=Gene, x=Sample, fill=Mutation, geom='tile')
what would the lattice equivalent of this be? (I am interested in this because axis alignment in ggplot2 between plots is convoluted and cumbersome currently).
df <- structure(list(Gene = structure(c(1L, 2L, 1L, 2L), .Label = c("A1", "A2"), class = "factor"),
Sample = structure(c(1L, 1L, 2L, 2L ), .Label = c("2", "3"), class = "factor"),
Mutation = structure(c(1L, 2L, 1L, 1L), .Label = c("Missense", "WT"), class = "factor")), .Names = c("Gene", "Sample", "Mutation"), row.names = c(NA, -4L), class = "data.frame")
Check out the levelplot() function in lattice, for example
library("lattice")
df <- transform(df, Sample = factor(Sample))
levelplot(Mutation ~ Gene * Sample, data = df)
You'll need to work out the colour scale key yourself though.