I am using the survey package by Thomas Lumley to create cross tabs and SE's. I am struggling to specify the denominator of the cross tabulation.
This is my data:
library(survey)
data <- read_table2("Q50_1 Q50_2 Q38 Q90 pov gender wgt id
yes 3 Yes NA High M 1.3 A
NA 4 No 2 Med F 0.4 B
no 2 NA 4 Low F 1.2 C
maybe 3 No 2 High M 0.5 D
yes NA No NA High M 0.7 E
no 2 Yes 3 Low F 0.56 F
maybe 4 Yes 2 Med F 0.9 G")
Create the design object:
design <- svydesign(id =~id,
weights = ~wgt,
nest = FALSE,
data = data)
To find the cross tabulation of Q50_1 by Female:
svymean(~interaction(Q50_1,gender=="F"), design, na.rm = T)
This gives me:
mean SE
interaction(Q50_1, gender == "F")maybe.FALSE 0.096899 0.1043
interaction(Q50_1, gender == "F")no.FALSE 0.000000 0.0000
interaction(Q50_1, gender == "F")yes.FALSE 0.387597 0.2331
interaction(Q50_1, gender == "F")maybe.TRUE 0.174419 0.1725
interaction(Q50_1, gender == "F")no.TRUE 0.341085 0.2233
interaction(Q50_1, gender == "F")yes.TRUE 0.000000 0.0000
This is not as useful to me because the denominator includes TRUE FALSE values for every combination, whereas I am only interested in the mean that is true. So, I could easily find the percentage of TRUE as follows:
dat <- as.data.frame(svymean(~interaction(Q50_1,gender=="F"), design, na.rm = T)) %>% tibble::rownames_to_column("question")
dat %>% tidyr::separate(question,c("question",'response'), sep = "\\)", extra = "merge") %>%
mutate(question = str_replace(question,"interaction\\("," ")) %>%
tidyr::separate(response,c('value', 'bool'), sep ="\\." ) %>%
tidyr::separate(question,c('question', 'group'), sep ="\\," ) %>%
tidyr::separate(group,c('group_level', 'group'), sep ="\\==" ) %>%
filter(bool=='TRUE') %>%
group_by(question, group_level, group) %>%
mutate(sum_true = sum(mean)) %>%
mutate(mean= mean/sum_true)
This gives me:
question group_level group value bool mean SE sum_true
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
" Q50_1" " gender " " \"F\"" maybe TRUE 0.338 0.173 0.516
" Q50_1" " gender " " \"F\"" no TRUE 0.662 0.223 0.516
" Q50_1" " gender " " \"F\"" yes TRUE 0 0 0.516
The means are exactly what I want, but the SEs are associated with a different denominator and don't represent the manipulated mean. Is there a way to call the svymean to present the mean and SE of ONLY the TRUE values in the denominator?
I thought something like this might do (but it does not work):
svymean(~interaction(Q50_1,gender=="F"[TRUE]), design, na.rm = T)
My desired outcome (the SE's are fake):
mean SE
interaction(Q50_1, gender == "F"[TRUE])maybe.TRUE 0.338 0.0725
interaction(Q50_1, gender == "F"[TRUE])no.TRUE 0.0.662 0.0233
interaction(Q50_1, gender == "F"[TRUE])yes.TRUE 0.0 0.0000
To get the percentage of women who responded to each answer you want
svymean(~Q50_1, subset(design, gender== "F"),na.rm=TRUE)
or equivalently (because that's how svyby does it)
svyby(~Q50_1, ~gender, design, svymean, na.rm = TRUE)
If you want to get the empty category as well, you need to convert the ~Q50_1 variable to a factor -- that's the point of factors (vs strings): they know what levels they have.
If you to be able to extract parts of the output programmatically, use the coef and SE functions
data$Q50_1<-factor(data$Q50_1)
design <- svydesign(id =~id,
weights = ~wgt,
nest = FALSE,
data = data)
svymean(~Q50_1, subset(design, gender== "F"),na.rm=TRUE)
svyby(~Q50_1, ~gender, design, svymean, na.rm = TRUE)[1,]
coef(svyby(~Q50_1, ~gender, design, svymean, na.rm = TRUE))
SE(svyby(~Q50_1, ~gender, design, svymean, na.rm = TRUE))
These don't agree with what you got using ~interaction, because what you got that way doesn't match what you said you want. The interaction analysis gives you the percentage of people who are women and also responded yes, not the percentage of yes responses among women. To put it another way, the 6 percentages you get with the interaction analysis add to 100%, not to 200%.
> sum(coef(svymean(~interaction(Q50_1,gender=="F"), design, na.rm = T)))
[1] 1
Related
I have extracted time series from a few regions of interest in the brain (fMRI) and I have added pairwise correlation (Fisher-Z) values for each subject under columns corresponding the correlation between two nodes in the brain (for example: stim_lvis3, stim = stimulation site and lvis3= left visual network 3). Now, I would like to perform ANOVAs on this dataset to look at the effects and between/within group differences (3 groups x 3 timepoints). My data is already in long format.
*groups= ctbs [10 subjects x 3 timepoints], itbs = [10 subjects x 3 timepoints], and sham [10 subjects x 3 timepoints]
Any suggestions on how this can be done, given that I have 12 columns with connectivity values (stim_lvis3....stim_rpcc1). for example I have not been able to box plot the data faceted both by time and group?
How to perform a two-way mixed anova in this situation for all 12 columns for each group at a specific timepoints and then compare groups at each timepoint?
I converted subject, time and group to factors
tbs %>%
group_by(time, group) %>%
get_summary_stats(stim_lVis3, type = "mean_sd")
Error in tbs(.) : could not find function "tbs"
bxp <- ggboxplot(
tbs, x = "time", y = "stim_lvis3",
color = "group", palette = "jco"
)
bxp
Error in FUN(X[[i]], ...) : object 'stim_lvis3' not found
Welcome to SO! It's looks like you're new, but in the future to get great answers quickly, make sure you include sample data in a format that useable (i.e., the output from dput(head(myData))). Check it out: making R reproducible questions.
I know of two approaches to completing within and between ANOVA. The easier to implement version is from the package ez. The package jmv offers a much more complex write-up but you have an immense amount of control, as well. I'm sure there is more to ez's version, but I haven't worked with that package very much.
I created data to somewhat simulate what you are working with.
library(tidyverse)
library(jmv)
library(ez)
set.seed(35)
df1 <- data.frame(subject = factor(rep(1:15, each = 3)),
time = factor(rep(c(1:3), 15)),
group = factor(rep(1:3, each = 15)),
stim_IVis3 = rnorm(45, .5, .15),
stim_IVis4 = rnorm(45, .51, .125))
head(df1)
summary(df1)
To use ez, it's a pretty straightforward implementation. Although, I couldn't find an option that allowed for multiple dependent variables. Essentially, you would either need to pivot_long or you can use jmv.
For this method, you don't get the post hoc comparisons; the effect size is the generalized η2.
ezANOVA(df1, dv = stim_IVis3, wid = subject, within = time,
between = group, detailed = T)
# $ANOVA
# Effect DFn DFd SSn SSd F p p<.05 ges
# 1 (Intercept) 1 12 12.58700023 0.3872673 390.0252306 1.616255e-10 * 0.92579702
# 2 group 2 12 0.05853169 0.3872673 0.9068417 4.297644e-01 0.05483656
# 3 time 2 24 0.05417372 0.6215855 1.0458491 3.668654e-01 0.05096178
# 4 group:time 4 24 0.06774352 0.6215855 0.6539102 6.298074e-01 0.06292379
#
# $`Mauchly's Test for Sphericity`
# Effect W p p<.05
# 3 time 0.9914977 0.9541232
# 4 group:time 0.9914977 0.9541232
#
# $`Sphericity Corrections`
# Effect GGe p[GG] p[GG]<.05 HFe p[HF] p[HF]<.05
# 3 time 0.9915694 0.3664558 1.187039 0.3668654
# 4 group:time 0.9915694 0.6286444 1.187039 0.6298074
Now to use jmv, you will need to pivot the data wider. Your within-subject data needs to be in separate columns. Since time is the factor that represents with the within-subject values in the stim... columns, that's what you need to pivot.
df2 <- df1 %>%
pivot_wider(id_cols = c(subject, group),
names_from = time,
values_from = starts_with("stim"),
names_vary = "fastest")
head(df2)
# # A tibble: 6 × 8
# subject group stim_IVis3_1 stim_IVis3_2 stim_IVis3_3 stim_IVis4_1 stim_IVis4_2 stim_IVis4_3
# <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 0.497 0.505 0.601 0.417 0.776 0.660
# 2 2 1 0.584 0.408 0.737 0.454 0.670 0.650
# 3 3 1 0.741 0.399 0.450 0.306 0.665 0.577
# 4 4 1 0.223 0.450 0.350 0.522 0.342 0.417
# 5 5 1 0.717 0.661 0.581 0.601 0.700 0.686
# 6 6 2 0.534 0.780 0.443 0.420 0.565 0.739
Now you're ready for jmv. bs equates to between-subjects; rm equates to repeated measures.
fit = anovaRM(data = df2,
ss = "3", # type of SS (1, 2, or 3)
bs = list("group"), # between subjects (links with other parameters this way)
bsTerms = list("group"),
rm = list(list(label = "stim_IVs", # within subjects
levels = c(names(df2)[3:8]))), # could also write c("stim_IVs3_1", "stim_IVs3_2", "stim_IVs3_3")
rmCells = list(list(measure = names(df2)[3], # could also write "stim_IVs3_1"
cell = names(df2)[3]), # the group associated in 'rm' level (I used the column name)
list(measure = names(df2)[4],
cell = names(df2)[4]),
list(measure = names(df2)[5],
cell = names(df2)[5]),
list(measure = names(df2)[6],
cell = names(df2)[6]),
list(measure = names(df2)[7],
cell = names(df2)[7]),
list(measure = names(df2)[8],
cell = names(df2)[8])
),
rmTerms = list("stim_IVs"), # groups variable for repeated/within measures
emMeans = list(list("stim_IVs", "group")), # all grouping vars in the ANOVA (for the em tables)
emmPlots = T, # show em plots
emmTables = T, # show em tables
effectSize = c("omega", "partEta"), # multiple options here, see help
spherTests = T, # use correction
spherCorr = c("none", "GG", "HF"), # multiple options here, see help
leveneTest = T, # check homogeneity (p GREATER than .05 is good)
qq = T, # plot normality validation qq plot
postHoc = list("group",c("group","stim_IVs"),"stim_IVs"),
postHocCorr = "tukey") # use TukeyHSD
If you print fit, you're going to get a ton of information. In addition to the content provided by ezANOVA, this provides the post hoc comparisons of each group, time, dependent variable, and the intermix of each; the results of the statistical assumptions: Levene's for homogeneity and a QQ plot of the residuals for normality; and the estimated marginal means table and a plot of those means.
I realize that you have more fields in your data than what I have here. I believe I've provided enough for you to build off of.
I suggest that you determine what you want to learn from the data and choose the algorithm from there.
If you have any questions, let me know.
I wanted to get differences between subgroup means from survey data (with survey weights, psu and strata), but due to a missing observation (NA), I could not do so. Would you mind helping me?
I used the package "Survey", created a survey design and grouped my observations which contained NA (income in the example below) by subgroup (city) using svyby. I also set covmat = True so I could use svycontrast later to compute standard errors. However, when I did so, I got NA.
library(survey)
data <- data.frame(psu = 1:8, city = rep(1:2, 4), income = c(2:8, NA), weights = 1)
svy <- svydesign(id=~psu, data = data, weights =~weights)
svyby(~income,~city, svy, svymean, covmat=TRUE)
city income se
1 1 5 1.195229
2 2 NA NaN
I then tried to add all sorts of NA removals, but none of them seemed to work.
> svyby(~income,~city, svy, svymean, covmat=TRUE, na.rm.by=T, na.rm.all=T)
city income se
1 1 5 1.195229
2 2 NA NaN
svyby(~income,~city, svy, svymean, covmat=TRUE, na.rm = T)
Error in inflmats[[i]][idxs[[i]], ] <- infs[[i]] :
number of items to replace is not a multiple of replacement length
Any advice would be welcome.
Looks like a bug.
A work-around is to subset in advance:
> data <- data.frame(psu = factor(1:8), city = rep(1:2, 4), income = c(2:8, NA), weights = 1)
> svy <- svydesign(id=~psu, data = data, weights =~weights)
> svyby(~income,~city, subset(svy,!is.na(income)), svymean, covmat=TRUE)->a
> a
city income se
1 1 5 1.195229
2 2 5 1.007905
> vcov(a)
1 2
1 1.428571 0.000000
2 0.000000 1.015873
I'm trying to write a function to calculate toplines (as commonly used in polling data). It needs to include both a "percent" and "valid percent" column.
Here's an example
library(tidyverse)
# prepare some data
d <- gss_cat %>%
mutate(tvhours2 = tvhours,
tvhours2 = replace(tvhours2, tvhours > 5 , "6-8"),
tvhours2 = replace(tvhours2, tvhours > 8 , "9+"),
tvhours2 = fct_explicit_na(tvhours2),
# make a weight variable
fakeweight = rnorm(n(), mean = 1, sd = .25))
The following function works as far as it goes:
make.topline <- function(variable, data, weight){
variable <- enquo(variable)
weight <- enquo(weight)
table <- data %>%
# calculate denominator
mutate(total = sum(!!weight)) %>%
# calculate proportions
group_by(!!variable) %>%
summarise(pct = (sum(!!weight)/first(total))*100,
n = sum(!!weight))
table
}
make.topline(variable = tvhours2, data = d, weight = fakeweight)
I'm struggling to implement the valid percent column. Here is the syntax I tried.
make.topline2 <- function(variable, data, weight){
variable <- enquo(variable)
weight <- enquo(weight)
table <- data %>%
# calculate denominator
mutate(total = sum(!!weight),
valid.total = sum(!!weight[!!variable != "(Missing)"])) %>%
# calculate proportions
group_by(!!variable) %>%
summarise(pct = (sum(!!weight)/first(total))*100,
valid.pct = (sum(!!weight)/first(valid.total))*100,
n = sum(!!weight))
table
}
make.topline2(variable = tvhours2, data = d, weight = fakeweight)
This yields the following error:
Error: Base operators are not defined for quosures.
Do you need to unquote the quosure?
# Bad:
myquosure != rhs
# Good:
!!myquosure != rhs
Call `rlang::last_error()` to see a backtrace
I know the problem is in this line, but I don't know how to fix it:
mutate(valid.total = sum(!!weight[!!variable != "(Missing)"]))
You can put parentheses around the !!weight. I think of this as making sure we are using the extract brackets only after weight is unquoted (so an order of operations thing).
That line would then look like:
valid.total = sum((!!weight)[!!variable != "(Missing)"])
Alternatively, you could use the new curly-curly operator ({{), which takes the place of the enquo()/!! combination for relatively simple cases like yours. Then your function would look something like
make.topline <- function(variable, data, weight){
table <- data %>%
# calculate denominator
mutate(total = sum({{ weight }}),
valid.total = sum({{ weight }}[{{ variable }} != "(Missing)"])) %>%
# calculate proportions
group_by({{ variable }}) %>%
summarise(pct = (sum({{ weight }})/first(total))*100,
valid.pct = (sum({{ weight }})/first(valid.total))*100,
n = sum({{ weight }}))
table
}
Like the parentheses solution, this runs without error.
make.topline(variable = tvhours2, data = d, weight = fakeweight)
# A tibble: 9 x 4
tvhours2 pct valid.pct n
<fct> <dbl> <dbl> <dbl>
1 0 3.16 5.98 679.
2 1 10.9 20.6 2342.
3 2 14.1 26.6 3022.
4 3 9.10 17.2 1957.
5 4 6.67 12.6 1432.
6 5 3.24 6.13 696.
7 6-8 4.02 7.61 864.
8 9+ 1.67 3.16 358.
9 (Missing) 47.2 89.3 10140.
I am having trouble trying to apply a custom function to multiple groups within a data frame and mutate it to the original data. I am trying to calculate the percent inhibition for each row of data (each observation in the experiment has a value). The challenging issue is that the function needs the mean of two different groups of values (positive and negative controls) and then uses that mean value in each calculation.
In other words, the mean of the negative control is subtracted by the experimental value, then divided by the mean of the negative control minus the positive control.
Each observation including the + and - controls should have a calculated percent inhibition, and as a double check, for each experiment(grouping) the
mean of the pct inhib of the - controls should be around 0 and the + controls around 100.
The function:
percent_inhibition <- function(uninhibited, inhibited, unknown){
uninhibited <- as.vector(uninhibited)
inhibited <- as.vector(inhibited)
unknown <- as.vector(unknown)
mu_u <- mean(uninhibited, na.rm = TRUE)
mu_i <- mean(inhibited, na.rm = TRUE)
percent_inhibition <- (mu_u - unknown)/(mu_u - mu_i)*100
return(percent_inhibition)
}
I have a data frame with multiple variables: target, box, replicate, and sample type. I am able to do the calculation by subsetting the data (below), (1 target, box, and replicate) but have not been able to figure out the right way to apply it to all of the data.
subset <- data %>%
filter(target == "A", box == "1", replicate == 1)
uninhib <-
subset$value[subset$sample == "unihib"]
inhib <-
subset$value[subset$sample == "inhib"]
pct <- subset %>%
mutate(pct = percent_inhibition(uninhib, inhib, .$value))
I have tried group_by and do, and nest functions, but my knowledge is lacking in how to apply these functions to my subsetting problem. I'm stuck when it comes to the subset of the subset (calculating the means) and then applying that to the individual values. I am hoping there is an elegant way to do this without all of the subsetting, but I am at a loss on how.
I have tried:
inhibition <- data %>%
group_by(target, box, replicate) %>%
mutate(pct = (percent_inhibition(.$value[.$sample == "uninhib"], .$value[.$sample == "inhib"], .$value)))
But get the error that columns are not the right length, because of the group_by function.
library(tidyr)
library(purrr)
library(dplyr)
data %>%
group_by(target, box, replicate) %>%
mutate(pct = {
x <- split(value, sample)
percent_inhibition(x$uninhib, x$inhib, value)
})
#> # A tibble: 10,000 x 6
#> # Groups: target, box, replicate [27]
#> target box replicate sample value pct
#> <chr> <chr> <int> <chr> <dbl> <dbl>
#> 1 A 1 3 inhib -0.836 1941.
#> 2 C 1 1 uninhib -0.221 -281.
#> 3 B 3 2 inhib -2.10 1547.
#> 4 C 1 1 uninhib -1.67 -3081.
#> 5 C 1 3 inhib -1.10 -1017.
#> 6 A 2 1 inhib -1.67 906.
#> 7 B 3 1 uninhib -0.0495 -57.3
#> 8 C 3 2 inhib 1.56 5469.
#> 9 B 3 2 uninhib -0.405 321.
#> 10 B 1 2 inhib 0.786 -3471.
#> # … with 9,990 more rows
Created on 2019-03-25 by the reprex package (v0.2.1)
Or:
data %>%
group_by(target, box, replicate) %>%
mutate(pct = percent_inhibition(value[sample == "uninhib"],
value[sample == "inhib"], value))
With data as:
n <- 10000L
set.seed(123) ; data <-
tibble(
target = sample(LETTERS[1:3], n, replace = TRUE),
box = sample(as.character(1:3), n, replace = TRUE),
replicate = sample(1:3, n, replace = TRUE),
sample = sample(c("inhib", "uninhib"), n, replace = TRUE),
value = rnorm(n)
)
Let's say I have 2 (categorical) variables and one continuous:
library(tidyverse)
set.seed(123)
ds <- data.frame(
depression=rnorm(90,10,2),
schooling_dummy=c(0,1,2),
sex_dummy=c(0,1)
)
When I regress depression on sex (0 or 1), the intercept is 10.0436, what is the mean of sex = 0. Ok!
ds %>% group_by(sex_dummy) %>%
+ summarise(formatC(mean(depression),format="f", digits=4))
# A tibble: 2 x 2
sex_dummy `formatC(mean(depression), format = "f", digits = 4)`
<dbl> <chr>
1 0 10.0436
2 1.00 10.1640
The same thing happens when I regress depression on schooling. The intercept value is 10.4398. The mean of schooling = 0 is the same.
ds %>% group_by(schooling_dummy) %>%
+ summarise(formatC(mean(depression),format="f", digits=4))
# A tibble: 3 x 2
schooling_dummy `formatC(mean(depression), format = "f", digits = 4)`
<dbl> <chr>
1 0 10.4398
2 1.00 9.7122
3 2.00 10.1593
Now, when I compute a model with both variables, why the intercept is not the mean when both groups = 0? The regression **intercept is 10.3796, but the mean when sex = 0, and schooling is = 0 is 10.32548:
ds %>% group_by(schooling_dummy,sex_dummy) %>%
+ summarise(formatC(mean(depression),format="f", digits=5))
# A tibble: 6 x 3
# Groups: schooling_dummy [?]
schooling_dummy sex_dummy `formatC(mean(depression), format = "f", digits = 5)`
<dbl> <dbl> <chr>
1 0 0 10.32548
2 0 1.00 10.55404
3 1.00 0 9.59305
4 1.00 1.00 9.83139
5 2.00 0 10.21218
6 2.00 1.00 10.10648
When I predict the model when both are 0:
predict(mod3, data.frame(sex_dummy=0, schooling_dummy=0))
1
10.37956
This result is related to depression (of course...) but still not What I was expecting, since:
(Reference: https://www.theanalysisfactor.com/interpret-the-intercept/)
What is the same for this previous forum post
I aware of my variables are categorical and I'm adjusting my script, as you can reproduce using this code below:
Thanks
library(tidyverse)
set.seed(123)
ds <- data.frame(
depression=rnorm(90,10,2),
schooling_dummy=c(0,1,2),
sex_dummy=c(0,1)
)
mod <- lm(data=ds, depression ~ relevel(factor(sex_dummy), ref = "0"))
summary(mod)
ds %>% group_by(sex_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=4))
mod2 <- lm(data=ds, depression ~ relevel(factor(schooling_dummy), ref = "0"))
summary(mod2)
ds %>% group_by(schooling_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=4))
mod3 <- lm(data=ds, depression ~ relevel(factor(sex_dummy), ref = "0") +
relevel(factor(schooling_dummy), ref = "0"))
summary(mod3)
ds %>% group_by(schooling_dummy,sex_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=5))
predict(mod3, data.frame(sex_dummy=0, schooling_dummy=0))
Two errors in your thinking (although your R code works so it's not a programming error.
First and foremost you violated your own statement you have not dummy coded schooling it does not have only zeroes and ones it has 0,1 & 2.
Second you forgot the interaction effect in your lm modeling...
Try this...
library(tidyverse)
set.seed(123)
ds <- data.frame(
depression=rnorm(90,10,2),
schooling_dummy=c(0,1,2),
sex_dummy=c(0,1)
)
# if you explicitly make these variables factors not integers R will do the right thing with them
ds$schooling_dummy<-factor(ds$schooling_dummy)
ds$sex_dummy<-factor(ds$sex_dummy)
ds %>% group_by(schooling_dummy,sex_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=5))
# you need an asterisk in your lm model to include the interaction term
lm(depression ~ schooling_dummy * sex_dummy, data = ds)
The results give you the mean(s) you were expecting...
Call:
lm(formula = depression ~ schooling_dummy * sex_dummy, data = ds)
Coefficients:
(Intercept) schooling_dummy1 schooling_dummy2
10.325482 -0.732433 -0.113305
sex_dummy1 schooling_dummy1:sex_dummy1 schooling_dummy2:sex_dummy1
0.228561 0.009778 -0.334254
and FWIW you can avoid this sort of accidental misuse of categorical variables if your data is coded as characters to begin with... so if your data is coded this way:
ds <- data.frame(
depression=rnorm(90,10,2),
schooling=c("A","B","C"),
sex=c("Male","Female")
)
You're less likely to make the same mistake plus the results are easier to read...