Propensity matching issue in RStudio - r

I am trying to do a propensity matched analysis but am having a lot of trouble. I have a large data set with an exposure coded as 0 (no exposure) and 1 (exposure) and am trying to matched based on a couple of variables. Basically I was trying to follow a tutorial on propensity matching via Coursera but am getting a really weird output. My initial dataset has 2,202 distinct observations. However, once I do the matching, my dataset has 3,074 distinct observations, which is obviously not supposed to happen. It creates a matched sample, but I'm not sure where the additional observations come from...
Does anyone know what I'm doing wrong? I have been trying to troubleshoot for the past week but keep coming up empty handed.
Here is what I'm doing:
race <- as.numeric(cohort$race_eth)
insurance <- as.numeric(cohort$privateinsurance)
language <- as.numeric(cohort$primarylanguage)
bloodpressure <- as.numeric(cohort$bloodpressure
bmi <- cohort$bmiatdelivery
exp <- as.numeric(cohort$prechange)
out <- as.numeric(cohort$tdapvaccinedate_yn)
#merge new dataset
propensity <- cbind(race, insurance, language, bloodpressure, bmi, exp, out)
propensity <- data.frame(prop_score)
#covariates to use in matching
xvars <- c("race", "insurance", "language", "bloodpressure", "age", "bmi")
table1 <- CreateTableOne(vars=xvars, strata="exp",data=propensity, test=FALSE)
print(table1, smd=TRUE)
#do matching
greedymatch <- Match(Tr=propensity$exp, M=1, X=propensity[xvars])
matched <- propensity[unlist(greedymatch[c("index.treated", "index.control")]),] # THIS IS WHERE THE PROBLEM OCCURS SHOWING THAT I HAVE 3074 OBSERVATIONS ```

it is not as easy to tackle your question, as your provided code snippet seems to have some issues before the error you report to experience, hence I was not able to reproduce your error. Still, I have created in the following a dummy dataset based on random numbers, and proceeded your steps. I have put down comments where potential errors could arise with your current code. Maybe this already helps!
#Indicate which Packages are needed
library(tableone)
library(Matching)
##Create Reproducible Example Dataset
# I create dummy data with random variables
race = as.numeric(rep(c(1,2),5))
insurance = as.numeric(sample(1:100,10))
language = as.numeric(sample(1:100,10))
bloodpressure = as.numeric(sample(1:100,10))
bmi = sample(1:100,10)
# it would be safer if you renamed exp as it is is also base function
exp = rep(c(1,0),5)
out = sample(1:100,10)
#You did not inlcude here an age variable but will refer to it later
# did you maybe forgot to include it?
age= sample(1:50,10)
#merge new dataset
propensity <- cbind(race, insurance, language, bloodpressure, bmi, exp, out,age)
# In your example a new dataset "prop_score" appeared
# I can only guess you ment the just created dataset propensity
propensity <- data.frame(propensity)
#Input Dimension
dim(propensity)
#covariates to use in matching
xvars <- c("race", "insurance", "language", "bloodpressure", "age", "bmi")
table1 <- CreateTableOne(vars=xvars, strata="exp",data=propensity, test=FALSE)
print(table1, smd=TRUE)
#do matching
greedymatch <- Match(Tr=propensity$exp, M=1, X=propensity[,xvars])
matched <- propensity[unlist(greedymatch[c("index.treated", "index.control")]),]
#Output Dimensions
dim(matched) #The Dimensions are fine

Thanks for responding! I did ur code and it still didn't work. What do you mean by "it would be safer if you renamed exp as it is is also base function"? I have attached a reprex for my dataset. Does this help?
outcome = c(0, 0, 0, 1, 0),
exposure = c(0, 0, 0, 1, 1),
insurance = c(1, 1, 1, 1, 1),
language = c(3, 1, 1, 1, 1),
age = c(32, 36, 22, 26, 38),
bmi = c(23.8407, 25.354099, 29.709999, 26.9098, 36.290401),
race_eth = as.factor(c("5", "1", "2", "1", "2")),
nullip = as.factor(c("1", "0", "1", "1", "0"))
)
library(tableone)
library(Matching)
#recode variables to use in matching
race <- as.numeric(cohort$race_eth)
insurance <- as.numeric(cohort$insurance)
language <- as.numeric(cohort$language)
nullip <- as.numeric(cohort$nullip)
age <- cohort$age
bmi <- cohort$bmi
exp <- as.numeric(cohort$exposure)
out <- as.numeric(cohort$outcome)
#create new dataset
prop_score <- cbind(race, insurance, language, nullip, pnc, age, bmi, exp, out)
prop_score <- data.frame(prop_score)
xvars <- c("race", "insurance", "language", "nullip", "pnc", "age", "bmi")
#table 1
table1 <- CreateTableOne(vars=xvars, strata="exp",data=prop, test=FALSE)
print(table1, smd=TRUE)
#matching
greedymatch <- Match(Tr=prop$exp, M=1, X=prop[xvars])
matched <- prop[unlist(greedymatch[c("index.treated", "index.control")]),]`

just add the replace = FALSE to your code
greedymatch <- Match(Tr = propensity$exp, M = 1, X = propensity[xvars], replace = FALSE)

Related

Is there way a way to cbind() unimputed data with an imputationList from jomo imputation, similar to mice?

I am trying to impute a large dataset of covariates, and then cbind it with outcome data for different cohorts/ages of children (so I don't want to impute those outcomes across the whole large set of covariates, because different ages of children got different measures). Here is my syntax
#jomo imputation
set.seed(90291)
TchParImput$icept <- 1
l1miss <- c("TCR_CLD_Q3A", "PAR_QA2D", "PAR_QH1", "TCR_CLD_Q2A", "TCR_CLD_Q2B", "PAR_QI6_YEARS_DERIVED", "PAR_SQ8_DERIVED", "PAR_QB6",
"PAR_QB16_RELCHILD_DERIVED", "PAR_QB16_RELADULT_DERIVED","PAR_QC2B2"
)
l2miss <- c("T_QA17D", "T_QB1B2", "TA_QA17D", "TA_QB1B2")
l1complete <- c("icept")
l2complete <- c("icept")
impdata <- jomo(TchParImput[l1miss], Y2 = TchParImput[l2miss], X = TchParImput[l1complete],
X2 = TchParImput[l2complete], clus = TchParImput$CLASSROOM_ID,
nburn = 2000, nbetween = 2000, nimp = 20, meth = "random")
imp.list <- imputationList(split(imp, imp$Imputation)[-1])
With mice, I've used:
Preschool_data_imp <- filter(imputedmidsobject, CHILDAGE_MONTHS >= 36)
data_imp <- cbind(Preschool_data_imp, data.frame(outcomes))
to then add in the unimputed, complete outcome data, but the imp.list format of jomo is different than mids, and so I am not sure how to proceed. Feedback appreciated

Multiple imputation and mlogit for a multinomial regression

I am trying to run a multinomial regression with imputed data. I can do this with the nnet package, however I want to use mlogit. Using the mlogit package I keep getting the following error "Error in 1:nrow(data) : argument of length 0".
So making the data
library(mlogit)
library(nnet)
library(tidyverse)
library(mice)
df <- data.frame(vax = sample(1:6, 500, replace = T),
age = runif(500, 12, 18),
var1 = sample(1:2, 500, replace = T),
var2 = sample(1:5, 500, replace = T))
# Create missing data using the mice package:
df2 <- ampute(df, prop = 0.15)
df3 <- df2$amp
df3$vax <- as.factor(df3$vax)
df3$var1 <- as.factor(df3$var1)
df3$var2 <- as.factor(df3$var2)
# Inpute missing data:
df4 <- mice(df3, m = 5, print = T, seed = 123)
It works using nnet's multinom:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, data = df, model = T))
summary(pool(multinomtest))
But throws up an error when I try to reshape the data into mlogit format
test <- with(df4, dfidx(data = df4, choice = "vax", shape = "wide"))
Does anyone have any idea how I can get the imputed data into mlogit format, or even whether mlogit has compatibility with mice or any other imputation package?
Answer
You are using with.mids incorrectly, and thus both lines of code are wrong; the multinom line just doesn't give an error. If you want to apply multiple functions to the imputed datasets, you're better off using something like lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2,
data = data.idx,
reflevel = "1",
nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
How with.mids works
When you apply with to a mids object (AKA the output of mice::mice), then you are actually calling with.mids.
If you use getAnywhere(with.mids) (or just type mice:::with.mids), you'll find that it does a couple of things:
It loops over all imputed datasets.
It uses complete to get one dataset.
It runs the expression with the dataset as the environment.
The third step is the problem. For functions that use formulas (like lm, glm and multinom), you can use that formula within a given environment. If the variables are not in the current environment (but rather in e.g. a data frame), you can specify a new environment by setting the data variable.
The problems
This is where both your problems derive from:
In your multinom call, you set the data variable to be df. Hence, you are actually running your multinom on the original df, NOT the imputed dataset!
In your dfidx call, you are again filling in data directly. This is also wrong. However, leaving it empty also gives an error. This is because with.mids doesn't fill in the data argument, but only the environment. That isn't sufficient for you.
Fixing multinom
The solution for your multinom line is simple: just don't specify data:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, model = T))
summary(pool(multinomtest))
As you will see, this will yield very different results! But it is important to realise that this is what you are trying to obtain.
Fixing dfidx (and mlogit)
We cannot do this with with.mids, since it uses the imputed dataset as the environment, but you want to use the modified dataset (after dfidx) as your environment. So, we have to write our own code. You could just do this with any looping function, e.g. lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2, data = data.idx, reflevel = "1", nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
From there, all we have to do is make something that looks like a mira object, so that we can still use pool:
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
Offering this as a way forward to circumvent the error with dfidx():
df5 <- df4$imp %>%
# work with a list, where each top-element is a different imputation run (imp_n)
map(~as.list(.x)) %>%
transpose %>%
# for each run, impute and return the full (imputed) data set
map(function(imp_n.x) {
df_out <- df4$data
df_out$vax[is.na(df_out$vax)] <- imp_n.x$vax
df_out$age[is.na(df_out$age)] <- imp_n.x$age
df_out$var1[is.na(df_out$var1)] <- imp_n.x$var1
df_out$var2[is.na(df_out$var2)] <- imp_n.x$var2
return(df_out)
}) %>%
# No errors with dfidx() now
map(function(imp_n.x) {
dfidx(data = imp_n.x, choice = "vax", shape = "wide")
})
However, I'm not too familiar with mlogit(), so can't help beyond this.
Update 8/2/21
As #slamballais mentioned in their answer, the issue is with dataset you refer to when fitting the model. I assume that mldata (from your code in the comments section) is a data.frame? This is probably why you are seeing the same coefficients - you are not referring to the imputed data sets (which I've identified as imp_n.x in the functions). The function purrr::map() is very similar to lapply(), where you apply a function to elements of a list. So to get the code working properly, you would want to change mldata to imp_n.x:
# To fit mlogit() for each imputed data set
df5 %>%
map(function(imp_n.x) {
# form as specified in the comments
mlogit(vax ~ 1 | age + var1 + var2,
data = imp_n.x,
reflevel = "1",
nests = list(type1 = c('1', '2'),
type2 = c('3','4'),
type3 = c('5','6')))
})

Propensity Score Matching by using Nearest Neighborhood Method with weights in R

I am performing Propensity Score Matching in R by using one of my variable as weighting Factor, i.e., Weight. I want to do Matching using nearest neighbor method. The sample of my data is:
dput(dat2):
structure(list(ID = c(1, 2, 3, 4, 6, 7),
Weight = c(2.4740626, 2.4740626, 2.4740626, 2.4740626, 1.9548149, 1.9548149),
Age = c("35-44", "<15-24", "25-34", "35-44", ">45", "25-34"),
Treatment = c(1, 0, 0, 1, 0, 0),
Outcome = c(1, 1, 1, 0, 1, 1)),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame")))
I am using tableone package to create a pre-matching table. My code is here:
`model1<-lm(formula = dat2$Outcome ~ dat2$Treatment + dat2$Age,
data = dat2,weights = Weights)
model1
Effect<- model1$coeff[2]
Effect
pscores.model<-glm(dat2$Treatment ~ dat2$Age,
family = binomial("logit"), data = dat2, weights = dat2$Weights)
summary(pscores.model)
Propensity_scores<-pscores.model
dat2$Pscores_1<-pscores.model$fitted.values
require(tableone)
xvars<-c("Age")
table1<-svyCreateTableOne(vars = xvars, strata = "Treatment", data = dat2, test = FALSE)
print(table1,smd = TRUE)
require(MatchIt)
match2 <- matchit(pscores.model, method="nearest", radio=1,data=dat2)
match2
ATE_ATT_2.1<-ATE(Y = dat2$Outcome,
Ti = dat$Treatment,
X = dat$Age,
ATT = TRUE
)
ATE_ATT_2.1
ATE_ATT_2.2<-ATE(Y = dat$Outcome,
Ti = dat$Treatment,
X = dat$Age,
ATT = FALSE
)
ATE_ATT_2.2
match2.data <- match.data(match2)
#Create the tableone for nearest matching
table_match2 <- CreateTableOne(vars = xvars,
strata = "Treatment",
data = match2.data,test = FALSE)
print(table_match2, smd = TRUE)`
But when I am running this code, it is giving results without weighting factor and it is not taking weights as an argument inside CreateTableOne function.
Please help me regarding this issue.
Thanks in advance.
There are several issues with this code.
First, to use the svyCreateTableOne function, you need to supply it with a svydesign object in the data argument, which you have not done. You need to use the survey package and supply it with weights. In your case it looks like you are using this function on unweighted data anyway, so you probably shouldn't be using it.
Second, in your call to matchit(), you have the argument radio=1. radio is not an appropriate argument to matchit(). You probably meant ratio. Fortunately, this probably won't affect anything.
I don't know why you have the ATE() code in there. It is irrelevant to the problem.
I don't know what weights you are intending to appear in your final table, but you didn't supply any weights for it to use. In your first table, you give it unmatched, unweighted data. In the second table, you give it matched, unweighted data. What weights are you trying to supply? If you want to give it weighted data, you need to supply those weights using svyCreateTableOne and a svydesign object that has the weights.

How can I bootstrap text readability statistics using quanteda?

I'm new to both bootstrapping and the quanteda package for text analysis. I have a large corpus of texts organized by document group type that I'd like to obtain readability scores for. I can easily obtain readability scores for each group with the following function:
textstat_readability(texts(mwe, groups = "document"), "Flesch")
I then want to bootstrap the results to obtain a 95% confidence interval by wrapping a function:
b_readability <- function(x, i, groups = NULL, measure = "Flesch")
textstat_readability(texts(x[i], groups = groups), measure)
n <- 10
groups <- factor(mwe[["document"]]$document)
b <- boot(texts(mwe), b_readability, strata = groups, R = n, groups = groups)
colnames(b$t) <- names(b$t0)
apply(b$t, 2, quantile, c(.025, .5, .975))
But "b <-" fails with the error:
"Error in t.star[r, ] <- res[[r]] : incorrect number of subscripts on matrix"
I've wasted two days trying to debug with no luck. What am I doing wrong? Much appreciated for any advice...
MWE:
mwe<-structure(list(document = structure(c(1L, 1L),
.Label = c("a", "b", "c", "d", "e"), class = "factor"), text = c("Text 1. Text 1.1", "Text 2."), section = structure(2:1, .Label = c("aa", "bb", "cc", "dd", "ee", "ff", "hh", "ii", "jj", "kk"), class = "factor"), year = c(1919L, 1944L), preamble = structure(8:9, .Label = c("1", "2","3", " 4 ", "5", "6 ", "7 ", "8 ", "9 ", "10 "), class = "factor"), articles = c(43L, 47L), pages = c(5.218, 7.666), wordcount = c(3503L, 4929L), mean_articles = c(45, 45)), row.names = 1:2, class = "data.frame")
mwe <- corpus(mwe)
b_readability <- function(x, i, groups = NULL, measure = "Flesch")
textstat_readability(texts(x[i], groups = groups), measure)
n <- 10
groups <- factor(mwe[["document"]]$document)
b <- boot(texts(mwe), b_readability, strata = groups, R = n, groups = groups)
colnames(b$t) <- names(b$t0)
apply(b$t, 2, quantile, c(.025, .5, .975))
A good question that involves knowing a lot about the boot package as well as how to index and group corpus texts in quanteda. Here's the best (currently) and safest way to do it. "Safest" here means future-proof, since there are some things that currently work in the internal addressing of a quanteda corpus that will not work in upcoming v2. (We warn about this very clearly in ?corpus but no one seems to heed that warning...) Note also that while this should always work, we are also planning, in future versions, more direct methods for bootstrapping textual statistics that would not require the user to do this sort of deep dive into the boot package.
Let's try a reproducible example from built-in objects first. To "bootstrap" a text, we will construct a new, hypothetical text using sentence-level resampling (with replacement) from the original, and use texts(x, groups = "<groupvar>") to piece this together into a hypothetical kind of text. (This is how I have done in in the two references at the end of this post.) To make this happen, we can exploit the property of texts() that it works to get texts from a corpus object but also works on character objects (but with fast grouping).
To get the sentences, after subsetting the corpus to simplify our example here, we reshape it into sentences.
First, however, I recorded the original document's name in a new document variable, so that we can use this for grouping later. In this example, we could also have used Year, but doing it this way will work for any example. (There are some internal records about the original docname that we might have used, but doing it this way will be future-proof.)
library("quanteda")
## Package version: 1.4.1
library("boot")
docvars(data_corpus_inaugural, "docnameorig") <- docnames(data_corpus_inaugural)
sent_corpus <- data_corpus_inaugural %>%
corpus_subset(Year > 2000) %>%
corpus_reshape(to = "sentences")
Then we have to define the function to be bootstrapped. We will use the "index" method and call the index i (as you did above). Here, x will be a character and not a corpus, even though we will call texts() on it, again, using the grouping variable to reassemble it. This will also need to return a vector and not a data.frame, which is normal form of a textstat_*() return. So we will extract just the measure column and return it as a vector.
b_readability <- function(x, i, groups = NULL, measure = "Flesch") {
textstat_readability(texts(x[i], groups = groups[i]), measure)[[measure]]
}
We will call our grouping variable simgroups just to distinguish the value from the argument name, and use this for both the groups argument and for strata in the call to boot(). The strata is an argument to boot(), while groups is passed through to our function b_readability(). We need to factorize this grouping variable since the function seems to want that. Then we call boot() and get our answer.
simgroups <- factor(docvars(sent_corpus, "docnameorig"))
boot(texts(sent_corpus), b_readability, R = 10,
strata = simgroups, groups = simgroups)
##
## STRATIFIED BOOTSTRAP
##
##
## Call:
## boot(data = texts(sent_corpus), statistic = b_readability, R = 10,
## strata = simgroups, groups = simgroups)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 60.22723 -0.01454477 2.457416
## t2* 53.23332 1.24942328 2.564719
## t3* 60.56705 1.07426297 1.996705
## t4* 53.55532 -0.28971190 1.943986
## t5* 58.63471 0.52289051 2.502101
These correspond to the five (original) documents, here distinguished by year, although unfortunately those names have been replaced by t1, t2, ... in the return object from boot().
To return to your original example, let's say these form two documents from one strata (since these are too short two subdivide further). Then:
simgroups <- rep(1, ndoc(mwe))
boot(texts(mwe), b_readability, R = 10, strata = simgroups, groups = simgroups)
##
## STRATIFIED BOOTSTRAP
##
##
## Call:
## boot(data = texts(mwe), statistic = b_readability, R = 10, strata = simgroups,
## groups = simgroups)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 119.19 0.6428333 0.4902916

Convincing R to exclude single level factors when using lm() in a for loop in some subsets (but not all)

I am working on a large dataset with 19 subcohorts for which I want to run a lineair regression model to estimate BMI.
One of the covariates I am using is sex, but some subcohorts consist only of men, which causes problems in my loop.
If I try to run a linear regression model, I get the following error:
tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I have found a solution for this problem, by running seperate loops for subcohorts with men and subcohorts with men and women by the following (simplified) code:
men <- c(1,6,15) # Cohort nrs that only contain men
menandwomen <- c(2,3,4,5,7,8,9,10,11,12,13,14,16,17,18,19)
trenddpmodelm <-list()
for(i in men) {
trenddpmodelm[[i]] <- lm(BMI ~ age + sex,
data=subcohort[subcohort$centre_a==i, ],)
}
trenddpmodelmw <-list()
for(i in menandwomen) {
trenddpmodelmw[[i]] <- lm(BMI ~ age + sex,
data=subcohort[subcohort$centre_a==i, ],)
}
trenddpmodel <- c(list(trenddpmodelm[[1]]), list(trenddpmodelmw[[2]]), list(trenddpmodelmw[[3]]), list(trenddpmodelmw[[4]]), list(trenddpmodelmw[[5]]), list(trenddpmodelm[[6]]), list(trenddpmodelmw[[7]]), list(trenddpmodelmw[[8]]), list(trenddpmodelmw[[9]]), list(trenddpmodelmw[[10]]), list(trenddpmodelmw[[11]]), list(trenddpmodelmw[[12]]), list(trenddpmodelmw[[13]]), list(trenddpmodelmw[[14]]), list(trenddpmodelm[[15]]), list(trenddpmodelmw[[16]]), list(trenddpmodelmw[[17]]), list(trenddpmodelmw[[18]]), list(trenddpmodelmw[[19]]))
After this step, I extract relevant information from the summaries and put this in a df to export to excel.
My problem is that I will be running quite a lot of analyses, which will result in pages and pages of code.
My question is therefore: Is there a setting in R that I could use that allows non varying factors to be dropped from my lineair regression model in subcohorts where this is applicable? (similar to what happens in coxph; R gives a warning that the factor does not always vary, but the loop does run)
It is not like I cannot continue working without a solution, but I have been trying to find an answer to this question for days without succes and I think it must be possible somehow. Any advice is much appreciated :)
I would recommend building your formula dynamically within the loop.
DF <- list(Cohort1 = data.frame(bmi = rnorm(25, 24, 1),
age = rnorm(25, 50, 3),
sex = sample(c("F", "M"), 25, replace = TRUE)),
Cohort2 = data.frame(bmi = rnorm(15, 24, 1),
age = rnorm(15, 55, 4),
sex = rep("M", 15)))
candidate_vars <- c("age", "sex")
Models <- vector("list", length(DF))
for (i in seq_along(DF)){
# Determine if the variables are either numeric, or factor with more than 1 level
indep <- vapply(X = DF[[i]][candidate_vars],
FUN = function(x){
if (is.numeric(x)) return(TRUE)
else return(nlevels(x) > 1)
},
FUN.VALUE = logical(1))
# Write the formula
form <- paste("bmi ~ ", paste(candidate_vars[indep], collapse = " + "))
# Create the model
Models[[i]] <- lm(as.formula(form), data = DF[[i]])
}

Resources