Weighted dataset after IPTW using weightit? - r

I'm trying to get a weighted dataset after IPTW using weightit. Unfortunately, I'm not even sure where to start. Any help would be appreciated.
library(WeightIt)
library(cobalt)
library(survey)
W.out <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde, estimand = "ATT", method = "ps")
bal.tab(W.out)
# pre-weighting dataset
lalonde
# post-weighting dataset??

The weightit() function produces balance weights. In your case, setting method = "ps" will produce propensity scores that are transformed into weights. More details of how it produces those weights can be found with ?method_ps. You can extract the weights from your output and store them as a column in a data.frame via: data.frame(w = W.out[["weights"]]). The output is a vector of weights with a length equal to the number of non-NA rows in your data (lalonde).
What you actually mean by "weighted dataset" is ambiguous for two reasons. First, any analyses that use those weights will typically not actually produce a new data.set...rather it will weight the contribution of the row to the likelihood. This is substantively different from simply analyzing a dataset that has had each row's values multiplied by its weight and will produce different results for many models. Second, you are asking how to get a weighted dataset that has character vectors in columns. For example, lalonde$race is a character vector. Multiplying 5*"black" doesn't make much sense.
If you are indeed intent on multiplying every value in every row of your data by the row's respective weight, you will need to convert your race variable to numeric indicators, remove it from your data, then you can apply sweep():
library(dplyr)
df <- lalonde %>%
black = if_else(race == "black", 1, 0),
hispan = if_else(race == "hispan",1,0),
white = if_else(race == "white",1,0)) %>%
select(-race)
sweep(df, MARGIN = 2, W.out[["weights"]], `*`)

Related

GAM distributed lag model with factor smooth interaction (by variable)

I'm trying to compare the climate response in the last 60 years of two subgroups of a plant (factor variable subgroups with 2 levels). The response of the two subgroups which both grew on the same plots is measured in deviation from the long-term growth (plant_growth). As climate data mean temperature (tmean) and mean precipitation (prec) are available.
I formulated a distributed lag model using mgcv's gam() to test the hypothesis, that the climate response differs between the plant subgroups:
climate_model <- gam(plant_growth ~ te(tmean, lag, by = subgroups) +
te(prec, lag, , by = subgroups) +
te(tmean, prec, lag, , by = subgroups) ,
data = plant_data)
plant_data is a list that contains tmean, prec and lag as separate numeric matrices, subgroups as factor variable which distinguishes between subgroup A and B, a character variable giving the ID of the plant, and the numeric measured plant_growth as vector.
The problem is, however, that factor by variables cannot be used with the matrix arguments from plant_data. The error message looks as follows:
Error in smoothCon(split$smooth.spec[[i]], data, knots, absorb.cons, scale.penalty = scale.penalty, :
factor `by' variables can not be used with matrix arguments.
I'm wondering if there is a way to include the factor variable subgroups into the distributed lag model so that a comparison between the two levels of the factor is possible.
I've already tried running two separate lag models for the two levels of subgroups. This works fine. However, I cannot really compare the predictions of the two models because the fit and the parameters of the smooths are different. Moreover, in this way the the climate response of the two subgroups is treated as if it was completely independent. This is however not the case.
I was reproduce my problem with growth data from the Treeclim package:
library("treeclim") #Data library
data("muc_spruce") #Plant growth
data("muc_clim") #Climate data
#Format climate to wide
clim <- pivot_wider(muc_clim, names_from = month, values_from = c(temp,prec))
#Format the growth data and add three new groth time series
growth <- muc_spruce %>%
select(-samp.depth) %>%
mutate(year = as.numeric(row.names(muc_spruce))) %>%
mutate(ID = 1) %>%
rename("plant_growth" = "mucstd")
additional_growth <- data.frame()
for (i in c(1:3)){
A <- growth %>%
mutate(plant_growth = plant_growth + runif(nrow(muc_spruce), min = 0, max = 0.5)) %>%
mutate(ID = ID + i)
additional_growth <- rbind(additional_growth, A)
}
growth <- rbind(growth, additional_growth)
#Bring growth and climate data together
plant_data <- na.omit(left_join(growth, clim))
rm(A, growth, clim, muc_clim, muc_spruce, additional_growth, i) #clean
#Add the subgroups label
plant_data$subgroups <- as.factor(c(rep("A", nrow(plant_data)/2), rep("B", nrow(plant_data)/2)))
#Format for gam input
plant_data <- list(lag = matrix(1:12,nrow(plant_data),12,byrow=TRUE),
year = plant_data$year,
ID = plant_data$ID,
plant_growth = plant_data$plant_growth,
subgroups = as.factor(plant_data$subgroups),
tmean = data.matrix(plant_data[,c(4:15)]),
prec = data.matrix(plant_data[,c(16:27)]))
From ?mgcv::linear.functional.terms:
The mechanism is usable with random effect smooths which take factor arguments, by using a trick to create a 2D array of factors. Simply create a factor vector containing the columns of the factor matrix stacked end to end (column major order). Then reset the dimensions of this vector to create the appropriate 2D array: the first dimension should be the number of response data and the second the number of columns of the required factor matrix. You can not use matrix or data.matrix to set up the required matrix of factor levels. See example below:
## set up a `factor matrix'...
fac <- factor(sample(letters,n*2,replace=TRUE))
dim(fac) <- c(n,2)
You cannot create a factor matrix tough, but can create a factor and modify the dims afterwars.

MatchIt Question - How to Access Distance Between Matched Units with Mahalanobis Dsitance

Is it possible to get the distances between matched units using the MatchIt::matchit() function?
Here is a reproducible example. I can see the distances when I use distance = "glm" but not with distance = "mahalanobis".
If you have a recommendation for a different package I am also happy to try that. I am only looking to match to another unit and not, for example, to calculate an ATT. Thank you!
# Run nearest neighbor with "mahalanobis" distance
res_matchitmahalanobis <- matchit(
data = df_example,
formula = treat ~ age + male,
method = "nearest",
distance = "mahalanobis",
exact = ~ male,
replace = TRUE
)
# Note: No `distance` column
get_matches(res_matchitmahalanobis)
# Note: `distance` element is missing
res_matchitmahalanobis$distance
# Run nearest neighbor with "glm" distance
res_glm <- matchit(
data = df_example,
formula = treat ~ age + male,
method = "nearest",
distance = "glm",
exact = ~ male,
replace = TRUE
)
# Note: There is now a `distance` column
get_matches(res_glm)
# Note: `distance` element is now present
res_glm$distance
It looks like they don't give you the distances if you use Mahalanobis. They calculate the results using that metric, though.
If you'd like to use Mahalanobis, you can use it along with another metric (like 'glm'). Alternatively, you can collect the distances separately.
I ran the matchit function with both the glm and Mahalanobis distances. Then I collected the Mahalonbis distances separately. (Really, I wanted to see if the distances were Malahanobis or glm...but as expected, they were glm.)
To collect the Mahalanobis distances (even with factors and no extra work) you can use the package assertr and the function maha_dist. The base R function requires you to manually convert factors to values.
library(MatchIt)
library(tidyverse)
library(assertr)
data("lalonde")
m.out2 <- matchit(treat ~ age + educ + race, data = lalonde,
distance = "glm", method = "nearest",
exact = ~educ, replace = T,
mahvars = ~age + educ + race)
summary(m.out2)
la2 <- lalonde %>% select(age, educ, race)
head(la2) # as expected
# collect distances
vals <- maha_dist(la2, robust = T) # robust uses covariance matrix
# visualize it
plot(density(vals, bw = .5),
main = "Mahal Sq Distances")
qqplot(qchisq(ppoints(100), df = 3), vals,
main = "QQ Plot Mahal Sq Distances")
abline(0, 1, "gray")
# definately outside of the 'normal'
As #Kat pointed out, matchit() does not return this value. It would be inappropriate to have this in the distance column; see here for why. The distance output in the matchit object is a misnomer; it refers to the propensity score, and each unit has one distance value. This is why it shows up with distance = "glm"; you are estimating a propensity score, which is then used to compute the distance between units. No methods in matchit() will actually return the distance between two paired units.
It would take a fair bit of work to extract this information. matchit() does not provide the Mahalanobis distance matrix used in the matching (because this would be way too big for big datasets!). However, you can compute a distance matrix outside matchit(), supply it to the distance argument, and then access the distance between units by extracting those distances from the matrix after doing the pairing. You can compute the Mahalanobis distance using, e.g., optmatch::match_on(), though it is not guaranteed to be identical to the Mahalanobis distance matchit() uses internally. Here is how you would do this:
data("lalonde", package = "MatchIt")
#Create distance matrix
dist <- optmatch::match_on(treat ~ age + educ + race, data = lalonde,
method = "mahalanobis")
#Do matching on distance matrix
m <- MatchIt::matchit(treat ~ age + educ + race, data = lalonde,
distance = dist, exact = ~married,
replace = TRUE)
#Extract matched pairs
mm <- m$match.matrix
#Create data frame of pairs and distance
d <- data.frame(treated = rownames(mm), control = mm[,1],
distance = dist[cbind(rownames(mm), mm[,1])])
head(d)
#> treated control distance
#> NSW1 NSW1 PSID368 0.3100525
#> NSW2 NSW2 PSID341 0.2067017
#> NSW3 NSW3 PSID99 0.2067017
#> NSW4 NSW4 PSID189 0.3900789
#> NSW5 NSW5 PSID400 0.4134033
#> NSW6 NSW6 PSID253 0.1033508
dist["NSW1", "PSID368"]
#> [1] 0.3100525
Created on 2022-02-24 by the reprex package (v2.0.1)
This works with replace = FALSE as well but would take a bit more work when k:1 matching or full matching. Although you are not matching using matchit()'s Mahalanobis distance, the distances produced in the output above do correspond to the distances used to pair.

MatchIt in R - how to link/know which rows were matched to original dataset

I've ran the MatchIt package successfully using the nearest method, and got a new dataset with only matched rows.
I need to find which rows in the original dataset were matched, so that I can further describe the matched population (based on variables that aren't included in the propensity score matching). But, the matched output data only shows the variables that the regression was performed on.
reprex:
original_data <- data.frame(c(row_ID = 1232451, 4938593, 2948201, 3349281, 3958593, 3948202, 2938402, 1192932), wealth = c(low, med, high, med, high, med, low, med), income = c(50000,1000000,150000, 52000, 29000, 29330, 20000, 10292), marriage_status = c(1, 1, 0, 0, 0, 1, 0), death = c(0,1,1,0,1,0,0,1))
ps <- glm(death ~ wealth + income, family = binomial(), data = original)
ps_df <- data.frame(pr_score = predict(ps, type = "response"),
readmit = ps$model$death)
ps_match <- matchit(death ~ wealth + income, method = "nearest", data = original_data)
ps_data <- match.data(ps_match)
View(ps_data)
How do I link the row ID in the new dataset ps_data so I know which rows were included from the original dataset?
Your code is not runnable, so this is not a reprex. That said, match.data() uses the original dataset that was supplied to matchit(), so it automatically includes all the variables, whether they were used for matching or not. You can also set drop.unmatched = FALSE to make sure the original dataset and the matched dataset have the same number of rows (and unmatched units will receive a weight of 1 and a subclass of NA).
If for some reason your variable of interest is in a totally different dataset, you can subset that dataset using the row names of the matched dataset, e.g., using merge() with by = 0 (which matches on row names).

Fama Macbeth Regression in R pmg

In the past few days I have been trying to find how to do Fama Macbeth regressions in R. It is advised to use the plm package with pmg, however every attempt I do returns me that I have an insufficient number of time periods.
My Dataset consists of 2828419 observations with 13 columns of variables of which I am looking to do multiple cross-sectional regressions.
My firms are specified by seriesis, I have got a variable date and want to do the following Fama Macbeth regressions:
totret ~ size
totret ~ momentum
totret ~ reversal
totret ~ volatility
totret ~ value size
totret ~ value + size + momentum
totret ~ value + size + momentum + reversal + volatility
I have been using this command:
fpmg <- pmg(totret ~ momentum, Data, index = c("date", "seriesid")
Which returns: Error in pmg(totret ~ mom, Dataset, index = c("seriesid", "datem")) : Insufficient number of time periods
I tried it with my dataset being a datatable, dataframe and pdataframe. Switching the index does not work as well.
My data contains NAs as well.
Who can fix this, or find a different way for me to do Fama Macbeth?
This is almost certainly due to having NAs in the variables in your formula. The error message is not very helpful - it is probably not a case of "too few time periods to estimate" and very likely a case of "there are firm/unit IDs that are not represented across all time periods" due to missing data being dropped.
You have two options - impute the missing data or drop observations with missing data (the latter being a quick test that the model works without missing points before deciding what you want to do that is valid for estimtation).
If the missingness in your data is truly random, you might be okay just dropping observations with missingness. Otherwise you should probably impute. A common strategy here is to impute multiple times - at least 5 - and then estimate for each of those 5 resulting data sets and average the effect together. Amelia or mice are very strong imputation packages. I like Amelia because with one call you can impute n times for that many resulting data sets and it's easy to pass in a set of variables to not impute (e.g., id variable or time period) with the idvars parameter.
EDIT: I dug into the source code to see where the error was triggered and here is what the issue is - again likely caused by missing data, but it does interact with your degrees of freedom:
...
# part of the code where error is triggered below, here is context:
# X = matrix of the RHS of your model including intercept, so X[,1] is all 1s
# k = number of coefficients used determined by length(coef(plm.model))
# ind = vector of ID values
# so t here is the minimum value from a count of occurrences for each unique ID
t <- min(tapply(X[,1], ind, length))
# then if the minimum number of times a single ID appears across time is
# less than the number of coefficients + 1, you do not have enough time
# points (for that ID/those IDs) to estimate.
if (t < (k + 1))
stop("Insufficient number of time periods")
That is what is triggering your error. So imputation is definitely a solution, but there might be a single offender in your data and importantly, once this condition is satisfied your model will run just fine with missing data.
Lately, I fixed the Fama Macbeth regression in R.
From a Data Table with all of the characteristics within the rows, the following works and gives the opportunity to equally weight or apply weights to the regression (remove the ",weights = marketcap" for equally weighted). totret is a total return variable, logmarket is the logarithm of market capitalization.
logmarket<- df %>%
group_by(date) %>%
summarise(constant = summary(lm(totret~logmarket, weights = marketcap))$coefficient[1], rsquared = summary(lm(totret~logmarket*, weights = marketcap*))$r.squared, beta= summary(lm(totret~logmarket, weights = marketcap))$coefficient[2])
You obtain a DataFrame with monthly alphas (constant), betas (beta), the R squared (rsquared).
To retrieve coefficients with t-statistics in a dataframe:
Summarystatistics <- as.data.frame(matrix(data=NA, nrow=6, ncol=1)
names(Summarystatistics) <- "logmarket"
row.names(Summarystatistics) <- c("constant","t-stat", "beta", "tstat", "R^2", "observations")
Summarystatistics[1,1] <- mean(logmarket$constant)
Summarystatistics[2,1] <- coeftest(lm(logmarket$constant~1))[1,3]
Summarystatistics[3,1] <- mean(logmarket$beta)
Summarystatistics[4,1] <- coeftest(lm(logmarket$beta~1))[1,3]
Summarystatistics[5,1] <- mean(logmarket$rsquared)
Summarystatistics[6,1] <- nrow(subset(df, !is.na(logmarket)))
There are some entries of "seriesid" with only one entry. Therefore the pmg gives the error. If you do something like this (with variable names you use), it will stop the error:
try2 <- try2 %>%
group_by(cusip) %>%
mutate(flag = (if (length(cusip)==1) {1} else {0})) %>%
ungroup() %>%
filter(flag == 0)

Tapply only producing missing values

I'm trying to generate estimates of the percent of Catholics within a given municipality in a country and I'm using multilevel regression and post-stratification of survey data.
The approach fits a multilevel logit and generates predicted probabilities of the dependent variable. It then weights the probabilities using poststratification of the sample to census data.
I can generate the initial estimates (which are essentially just the predicted probability of being Catholic for a given individual in the survey data.) However, when I try to take the average with the last line of code below it only returns NA's for each of the municipalities. The initial cell predictions have some missing values but nowhere near a majority.
I don't understand why I can't generate municipal weighted averages as I've followed the procedure using different data. Any help would be greatly appreciated.
rm(list=ls(all=TRUE))
library("arm")
library("foreign")
#read in megapoll and attach
ES.data <- read.dta("ES4.dta", convert.underscore = TRUE)
#read in municipal-level dataset
munilevel <- read.dta("election.dta",convert.underscore = TRUE)
munilevel <- munilevel[order(munilevel$municode),]
#read in Census data
Census <- read.dta("poststratification4.dta",convert.underscore = TRUE)
Census <- Census[order(Census$municode),]
Census$municode <- match(Census$municode, munilevel$municode)
#Create index variables
#At level of megapoll
ES.data$ur.female <- (ES.data$female *2) + ES.data$ur
ES.data$age.edr <- 6 * (ES.data$age -1) + ES.data$edr
#At census level (same coding as above for all variables)
Census$cur.cfemale <- (Census$cfemale *2) + Census$cur
Census$cage.cedr <- 6 * (Census$cage -1) + Census$cedr
##Municipal level variables
Census$c.arena<- munilevel$c.arena[Census$municode]
Census$c.fmln <- munilevel$c.fmln[Census$municode]
#run individual-level opinion model
individual.model1 <- glmer(formula = catholic ~ (1|ur.female) + (1|age)
+ (1|edr) + (1|age.edr) + (1|municode) + p.arena +p.fmln
,data=ES.data, family=binomial(link="logit"))
display(individual.model1)
#examine random effects and standard errors for urban-female
ranef(individual.model1)$ur.female
se.ranef(individual.model1)$ur.female
#create vector of state ranefs and then fill in missing ones
muni.ranefs <- array(NA,c(66,1))
dimnames(muni.ranefs) <- list(c(munilevel$municode),"effect")
for(i in munilevel$municode){
muni.ranefs[i,1] <- ranef(individual.model1)$municode[i,1]
}
muni.ranefs[,1][is.na(muni.ranefs[,1])] <- 0 #set states with missing REs (b/c not in data) to zero
#create a prediction for each cell in Census data
cellpred1 <- invlogit(fixef(individual.model1)["(Intercept)"]
+ranef(individual.model1)$ur.female[Census$cur.cfemale,1]
+ranef(individual.model1)$age[Census$cage,1]
+ranef(individual.model1)$edr[Census$cedr,1]
+ranef(individual.model1)$age.edr[Census$cage.cedr,1]
+muni.ranefs[Census$municode,1]
+(fixef(individual.model1)["p.fmln"] *Census$c.fmln) # municipal level
+(fixef(individual.model1)["p.arena"] *Census$c.arena)) # municipal level
#weights the prediction by the freq of cell
cellpredweighted1 <- cellpred1 * Census$cpercent.muni
#calculates the percent within each municipality (weighted average of responses)
munipred <- 100* as.vector(tapply(cellpredweighted1, Census$municode, sum))
munipred
The extensive amount of code is totally redundant without the data! I suppose you have NAs in the object cellpredweighted1 and by default sum() propagates NAs to the answer because if one or more elements of a vector is NA then by definition the summation of those elements is also NA.
If the above is the case here, then simply adding na.rm = TRUE to the tapply() call should solve the problem.
tapply(cellpredweighted1, Census$municode, sum, na.rm = TRUE)
You should be asking yourself why there are NAs at this stage and if these result from errors earlier on the process.

Resources