Creating non-random matched pairs - r

I am looking for an R package that would allow me to match match each subject in a treatment group to a subject in the general population that has similar characteristics (age, gender, etc).

I use the MatchIt package for doing this type of thing. You may receive advice to use propensity score matching, but there are limitations to that widely used approach (see: PS Not)
library(MatchIt) # use for matching
library(tidyverse) # The overall package. It will load lots of dependencies
set.seed(950)
n.size <- 1000
# This creates a tibble (an easier to use version of a data frame)
myData <- tibble(
a = lubridate::now() + runif(n.size) * 86400,
b = lubridate::today() + runif(n.size) * 30,
ID = 1:n.size,
# d = runif(1000),
ivFactor = sample(c("Level 1", "Level 2", "Level 3", "Level 4" ), n.size, replace = TRUE),
age = round(rnorm(n = n.size, mean = 52, sd = 10),2),
outContinuous = rnorm(n = n.size, mean = 100, sd = 10),
tmt = sample(c(1,0), size = n.size, prob = c(.3, .7), replace = TRUE)
)
# Using matching methods suggestions found in Ho, Imai, King and Stuart
myData.balance <- matchit(tmt~age + ivFactor, data = myData, method = "nearest", distance = "logit")
# Check to see if the matching improved balance between treatment and control group
summary(myData.balance)
# Extract the matched data. Now we can use this in subsequent analyses
myData.matched <- match.data(myData.balance)

Related

Apply custom interaction contrast in emmeans

I have been trying to compare a set of interaction contrasts using emmeans() and contrast(). However, I am having trouble applying a custom contrast and then compare it between groups. Here I added an example dataset so you can replicate what I am working with.
In summary, this dataset contains 3 factors: group, location, scenario. Location and scenario are within subject variables and group is a between subject variable. I would like to know if there is a higher measurement in location A compared to B, C and D (I set a contrast in the form c(1, -1/2, -1/2)) and then compare this contrast between groups (a contrast of contrast). So far I managed to get the contrast for each group (working with a single scenario at the moment):
library(afex)
library(emmeans)
data <- data.frame(
id = rep(c(100:189), each = 9),
group = rep(c("W", "X", "Y"), each = 90 * 3),
location = rep(c("A", "B", "C"), each = 3, times = 90),
scenario = rep(c("alpha", "beta", "gamma"), times = 270),
measurement = c(rnorm(length(id)))
)
test_aov <- aov_car(measurement ~ group * scenario * location +
Error(id/scenario * location),
data = data)
location_a_vs_all <- c(1, -1/2, -1/2)
test_effects <- emmeans(test_aov,
specs = ~ location | scenario | group,
at = list(scenario = c("alpha")))
test_contrast <- contrast(test_effects,
method = list("Location A vs All" = location_a_vs_all),
adjust = "none")
But if I try to set the argument interaction to True in the contrast() function I get an error:
test_contrast <- contrast(test_effects,
method = list("Location A vs All" = location_a_vs_all),
interaction = T,
adjust = "none")
Error in contrast.emmGrid(object, interaction[[i]], by = vars[-pos], name = nm, : 'method' must be a list, function, or the basename of an '.emmc' function
The same happens if I put my custom contrast directly in the interaction argument.
test_contrast <- contrast(test_effects,
interaction = list("Location A vs All" = location_a_vs_all),
adjust = "none")
I looked at the documentation on Interaction analysis in CRAN but I am lost on how to correctly implement what I would like to do, so I would appreciate any pointers from you. Thanks in advance!
The solution is to split the analysis in three steps:
library(afex)
library(emmeans)
data <- data.frame(
id = rep(c(100:189), each = 9),
group = rep(c("W", "X", "Y"), each = 90 * 3),
location = rep(c("A", "B", "C"), each = 3, times = 90),
scenario = rep(c("alpha", "beta", "gamma"), times = 270),
measurement = c(rnorm(length(id)))
)
test_aov <- aov_car(measurement ~ group * scenario * location +
Error(id/scenario * location),
data = data)
First I got the effects for the specific scenario:
test_effects <- emmeans(test_aov,
specs = ~ location | group,
at = list(scenario = c("alpha")))
Then I make the contrast comparing one of the locations to the other 2 (which I referred as my custom contrast in the question setup), using the "treatment vs control" method:
location_con <- contrast(test_effects,
method = "trt.vs.ctrl",
name = "A vs Others",
by = "group",
ref = 2:3)
Finally I apply a contrast of contrast to this, to compare the previous result between group W and the others, defining W as my reference level:
group_con <- contrast(location_con,
method = "trt.vs.ctrl",
by = "A vs Others",
ref = "W")

How to add a covariate to a between-within design MANCOVA with R?

I want to run a between-within design MANCOVA with R, with two dependent variables (Planned and Unplanned), two between-subject variables (Genre [Male, Female] and Urb [Yes, No]), one within-subject variable (Period [Before, During]), and one covariate (BMI).
Here is what I've done (see here for similar calculation: https://stats.stackexchange.com/questions/183441/correct-way-to-perform-a-one-way-within-subjects-manova-in-r):
# Create dummy data
data <- data.frame(Quest_before_planned = sample(1:100, 10),
Quest_during_planned = sample(1:100, 10),
Quest_before_unplanned = sample(1:100, 10),
Quest_during_unplanned = sample(1:100, 10),
Genre = sample(rep(c("Male", "Female"), each = 5)),
Urb = sample(rep(c("Yes", "No"), each = 5)),
BMI = sample(1:100, 10))
# Define the within-subjects factor
period <- as.factor(rep(c('before','during'), each = 2))
idata <- data.frame(period)
# Create the data structure for the linear model
data.model <- with(data, cbind(Quest_before_planned, Quest_during_planned,
Quest_before_unplanned, Quest_during_unplanned))
# Build the multivariate-linear model
mod.mlm <- lm(data.model ~ Genre * Urb, data = data_total)
# Run the MANOVA
mav.blpaq <- Anova(mod.mlm, idata = idata, idesign = ~ period, type = 2)
print(mav.blpaq)
Thus, the between-within design MANOVA here works well. However, I failed to add a covariate (i.e., BMI) to this model. Do you know how can I achieve this?
N.B.: I also tried using the (great) mancova() function , which include a covariate parameter; but with this function, I do not know how to specify that Period is a within-subject variable...
blpaq_macov <- mancova(data_tidy,
deps = c("Quest_planned", "Quest_unplanned"),
factors = c("Genre", "Period", "Urb"),
covs = "BMI",
multivar = "pillai")

R Survey library Difference of Means Test

I am currently using R's survey library to analyze survey data. I have two samples from two different time periods. My goal is to test if the difference between the two weighted sample means is equal to 0. Question: How do I approach this using R's survey library?
I have tried two approaches to doing this:
Approach 1: Create two different postStratify objects. Toy example:
q1 = c(1,1,1,1,0)
group = c(0,0,0,1,1)
df = data.frame(q1, group, time)
svy_design = svydesign(ids = ~1 , data = df)
pop_data = data.frame(group = c(0,1), Freq = c(10,90))
ps_design = postStratify(svy_design, strata = ~group,pop_data)
first = svymean(q1, ps_design) #Weighted Mean of first sample
q1 = c(1,1,1,0,0)
g2 = c(1,1,0,0,0)
df2 = data.frame(q1, g2)
pop_data_2 = data.frame(group = c(0,1), Freq = c(20,80))
svd_2 = svydesign(ids = ~1, data = df2)
psd_2 = postStratify(svd_2, strata = ~g2, pop_data_2)
second = svymean(q2, psd_2) #Weighted mean of second sample
The problem with this approach is that I do not know how to conduct the difference of means test on "first" and "Second" - the two svymean objects.
Approach 2: Create only one postStratify object. Toy example:
q1 = c(1,1,1,1,0, 1,1,0,0,1)
group = c(0,0,0,1,1, 0,0,1,1,1)
time = c(0,0,0,0,0, 1,1,1,1,1) #Variable that distinguishes between the samples
df = data.frame(q1, group, time)
svy_design = svydesign(ids = ~1 , data = df)
pop_data = data.frame(group = c(0,1), Freq = c(10,90))
ps_design = postStratify(svy_design, strata = ~group,pop_data)
svyby(~q1, ~time, ps_design, svymean)
svyttest(q1~time, ps_design)
The problem with this approach is that when i run svyby just to check the created mean values, the output of svyby is not what I expect. It puts out mean = 0.5714 for time = 0, when the theoretical weighted mean for that is 0.55. Any insight as to why the theoretical mean differs from that of svyby will be greatly appreciated.
Thank you so much for your time.
are you looking for this? thanks
library(survey)
q1 = c(1,1,1,1,0, 1,1,0,0,1)
# edited #
group = c(0,0,0,1,1, 2,2,3,3,3)
time = c(0,0,0,0,0, 1,1,1,1,1) #Variable that distinguishes between the samples
df = data.frame(q1, group, time)
svy_design = svydesign(ids = ~1 , data = df)
# edited #
pop_data = data.frame(group = c(0,1,2,3), Freq = c(10,90,20,80))
ps_design = postStratify(svy_design, strata = ~group,pop_data)
svyby(~q1, ~time, ps_design, svymean)
svyttest(q1~time, ps_design)

How can a blocking factor be included in makeClassifTask() from mlr package?

In some classification tasks, using mlr package, I need to deal with a data.frame similar to this one:
set.seed(pi)
# Dummy data frame
df <- data.frame(
# Repeated values ID
ID = sort(sample(c(0:20), 100, replace = TRUE)),
# Some variables
X1 = runif(10, 1, 10),
# Some Label
Label = sample(c(0,1), 100, replace = TRUE)
)
df
I need to cross-validate the model keeping together the values with the same ID, I know from the tutorial that:
https://mlr-org.github.io/mlr-tutorial/release/html/task/index.html#further-settings
We could include a blocking factor in the task. This would indicate that some observations "belong together" and should not be separated when splitting the data into training and test sets for resampling.
The question is how can I include this blocking factor in the makeClassifTask?
Unfortunately, I couldn't find any example.
What version of mlr do you have? Blocking should be part of it since a while. You can find it directly as an argument in makeClassifTask
Here is an example for your data:
df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = cv10)
# to prove-check that blocking worked
lapply(1:10, function(i) {
blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
intersect(blocks.testing, blocks.training)
})
#all entries are empty, blocking indeed works!
The answer by #jakob-r no longer works. My guess is something changed with cv10.
Minor edit to use "blocking.cv = TRUE"
Complete working example:
set.seed(pi)
# Dummy data frame
df <- data.frame(
# Repeated values ID
ID = sort(sample(c(0:20), 100, replace = TRUE)),
# Some variables
X1 = runif(10, 1, 10),
# Some Label
Label = sample(c(0,1), 100, replace = TRUE)
)
df
df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
resDesc <- makeResampleDesc("CV",iters=10,blocking.cv = TRUE)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = resDesc)
# to prove-check that blocking worked
lapply(1:10, function(i) {
blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
intersect(blocks.testing, blocks.training)
})

Histogram in R when using a binary value

I have data of students from several schools. I want to show a histogram of the percentage of all students that passed the test in each school, using R.
My data looks like this (id,school,passed/failed):
432342 school1 passed
454233 school2 failed
543245 school1 failed
etc'
(The point is that I am only interested in the percent of students that passed, obviously those that didn't passed have failed. I want to have one column for each school that shows the percent of the students in that school that passed)
Thanks
there are many ways to do that.
one is:
df<-data.frame(ID=sample(100),
school=factor(sample(3,100,TRUE),labels=c("School1","School2","School3")),
result=factor(sample(2,100,TRUE),labels=c("passed","failed")))
p<-aggregate(df$result=="passed"~school, mean, data=df)
barplot(p[,2]*100,names.arg=p[,1])
My previous answer didn't go all the way. Here's a redo. Example is the one from #eyjo's answer.
students <- 400
schools <- 5
df <- data.frame(
id = 1:students,
school = sample(paste("school", 1:schools, sep = ""), size = students, replace = TRUE),
results = sample(c("passed", "failed"), size = students, replace = TRUE, prob = c(.8, .2)))
r <- aggregate(results ~ school, FUN = table, data = df)
r <- do.call(cbind, r) # "flatten" the result
r <- as.data.frame(cbind(r, sum = rowSums(r)))
r$perc.passed <- round(with(r, (passed/sum) * 100), 0)
library(ggplot2)
ggplot(r, aes(x = school, y = perc.passed)) +
theme_bw() +
geom_bar(stat = "identity")
Since you have individual records (id) and want to calculate based on index (school) I would suggest tapply for this.
students <- 400
schools <- 5
df <- data.frame("id" = 1:students,
"school" = sample(paste("school", 1:schools, sep = ""),
size = students, replace = TRUE),
"results" = sample(c("passed", "failed"),
size = students, replace = TRUE, prob = c(.8, .2)))
p <- tapply(df$results == "passed", df$school, mean) * 100
barplot(p)

Resources