MatchIt Question - How to Access Distance Between Matched Units with Mahalanobis Dsitance - r

Is it possible to get the distances between matched units using the MatchIt::matchit() function?
Here is a reproducible example. I can see the distances when I use distance = "glm" but not with distance = "mahalanobis".
If you have a recommendation for a different package I am also happy to try that. I am only looking to match to another unit and not, for example, to calculate an ATT. Thank you!
# Run nearest neighbor with "mahalanobis" distance
res_matchitmahalanobis <- matchit(
data = df_example,
formula = treat ~ age + male,
method = "nearest",
distance = "mahalanobis",
exact = ~ male,
replace = TRUE
)
# Note: No `distance` column
get_matches(res_matchitmahalanobis)
# Note: `distance` element is missing
res_matchitmahalanobis$distance
# Run nearest neighbor with "glm" distance
res_glm <- matchit(
data = df_example,
formula = treat ~ age + male,
method = "nearest",
distance = "glm",
exact = ~ male,
replace = TRUE
)
# Note: There is now a `distance` column
get_matches(res_glm)
# Note: `distance` element is now present
res_glm$distance

It looks like they don't give you the distances if you use Mahalanobis. They calculate the results using that metric, though.
If you'd like to use Mahalanobis, you can use it along with another metric (like 'glm'). Alternatively, you can collect the distances separately.
I ran the matchit function with both the glm and Mahalanobis distances. Then I collected the Mahalonbis distances separately. (Really, I wanted to see if the distances were Malahanobis or glm...but as expected, they were glm.)
To collect the Mahalanobis distances (even with factors and no extra work) you can use the package assertr and the function maha_dist. The base R function requires you to manually convert factors to values.
library(MatchIt)
library(tidyverse)
library(assertr)
data("lalonde")
m.out2 <- matchit(treat ~ age + educ + race, data = lalonde,
distance = "glm", method = "nearest",
exact = ~educ, replace = T,
mahvars = ~age + educ + race)
summary(m.out2)
la2 <- lalonde %>% select(age, educ, race)
head(la2) # as expected
# collect distances
vals <- maha_dist(la2, robust = T) # robust uses covariance matrix
# visualize it
plot(density(vals, bw = .5),
main = "Mahal Sq Distances")
qqplot(qchisq(ppoints(100), df = 3), vals,
main = "QQ Plot Mahal Sq Distances")
abline(0, 1, "gray")
# definately outside of the 'normal'

As #Kat pointed out, matchit() does not return this value. It would be inappropriate to have this in the distance column; see here for why. The distance output in the matchit object is a misnomer; it refers to the propensity score, and each unit has one distance value. This is why it shows up with distance = "glm"; you are estimating a propensity score, which is then used to compute the distance between units. No methods in matchit() will actually return the distance between two paired units.
It would take a fair bit of work to extract this information. matchit() does not provide the Mahalanobis distance matrix used in the matching (because this would be way too big for big datasets!). However, you can compute a distance matrix outside matchit(), supply it to the distance argument, and then access the distance between units by extracting those distances from the matrix after doing the pairing. You can compute the Mahalanobis distance using, e.g., optmatch::match_on(), though it is not guaranteed to be identical to the Mahalanobis distance matchit() uses internally. Here is how you would do this:
data("lalonde", package = "MatchIt")
#Create distance matrix
dist <- optmatch::match_on(treat ~ age + educ + race, data = lalonde,
method = "mahalanobis")
#Do matching on distance matrix
m <- MatchIt::matchit(treat ~ age + educ + race, data = lalonde,
distance = dist, exact = ~married,
replace = TRUE)
#Extract matched pairs
mm <- m$match.matrix
#Create data frame of pairs and distance
d <- data.frame(treated = rownames(mm), control = mm[,1],
distance = dist[cbind(rownames(mm), mm[,1])])
head(d)
#> treated control distance
#> NSW1 NSW1 PSID368 0.3100525
#> NSW2 NSW2 PSID341 0.2067017
#> NSW3 NSW3 PSID99 0.2067017
#> NSW4 NSW4 PSID189 0.3900789
#> NSW5 NSW5 PSID400 0.4134033
#> NSW6 NSW6 PSID253 0.1033508
dist["NSW1", "PSID368"]
#> [1] 0.3100525
Created on 2022-02-24 by the reprex package (v2.0.1)
This works with replace = FALSE as well but would take a bit more work when k:1 matching or full matching. Although you are not matching using matchit()'s Mahalanobis distance, the distances produced in the output above do correspond to the distances used to pair.

Related

How can I use cubic splines for extrapolation?

I am looking to use natural cubic splines to interpolate between some data points using stats::splinefun(). The documentation states:
"These interpolation splines can also be used for extrapolation, that is prediction at points outside the range of ‘x’. Extrapolation makes little sense for ‘method = "fmm"’; for natural splines it is linear using the slope of the interpolating curve at the nearest data point."
I have attempted to replicate the spline function in Excel as a review, which is working fine except that I can't replicate the extrapolation approach. Example data and code below:
library(stats)
# Example data
x <- c(1,2,3,4,5,6,7,8,9,10,12,15,20,25,30,40,50)
y <- c(7.1119,5.862,5.4432,5.1458,4.97,4.8484,4.7726,4.6673,4.5477,4.437,4.3163,4.1755,4.0421,3.9031,3.808,3.6594,3.663)
df <- data.frame(x,y)
# Create spline functions
splinetest <- splinefun(x = df$x, y = df$y, method = "natural")
# Create dataframe of coefficients
splinetest_coef <- environment(splinetest)$z
splinetest_coefdf <- data.frame(i = 0:16, x = splinecoef_inf$x, a = splinecoef_inf$y, b = splinecoef_inf$b, c = splinecoef_inf$c, d = splinecoef_inf$d)
# Calculate extrapolated value at 51
splinetest(51)
# Result:
# [1] 3.667414
Question: How is this result calculated?
Expected result using linear extrapolation from x = 40 and x = 50 is 3.663 + (51 - 50) x (3.663 - 3.6594) / (50 - 40) = 3.66336
The spline coefficients are as follows at i = 50: a = 3.663 and b = 0.00441355...
Therefore splinetest(51) is calculated as 3.663 + 0.0441355
How is 0.0441355 calculated in this function?
Linear extrapolation is not done by computing the slope between a particular pair of points, but by using the estimated derivatives at the boundary ("closest point" in R's documentation). The derivatives at any point can be calculated directly from the spline function, e.g. to calculate the estimated first derivative at the upper boundary:
splinetest(max(df$x), deriv = 1)
[1] 0.004413552
This agrees with your manual back-calculation of the slope used to do the extrapolation.
As pointed out in the comments, plotting the end of the curve/data set with curve(splinetest, from = 30, to = 60); points(x,y) illustrates clearly the difference between the derivative at the boundary (x=50) and the line based on the last two data points (i.e. (y(x=50) - y(x=40))/10)

Weighted dataset after IPTW using weightit?

I'm trying to get a weighted dataset after IPTW using weightit. Unfortunately, I'm not even sure where to start. Any help would be appreciated.
library(WeightIt)
library(cobalt)
library(survey)
W.out <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde, estimand = "ATT", method = "ps")
bal.tab(W.out)
# pre-weighting dataset
lalonde
# post-weighting dataset??
The weightit() function produces balance weights. In your case, setting method = "ps" will produce propensity scores that are transformed into weights. More details of how it produces those weights can be found with ?method_ps. You can extract the weights from your output and store them as a column in a data.frame via: data.frame(w = W.out[["weights"]]). The output is a vector of weights with a length equal to the number of non-NA rows in your data (lalonde).
What you actually mean by "weighted dataset" is ambiguous for two reasons. First, any analyses that use those weights will typically not actually produce a new data.set...rather it will weight the contribution of the row to the likelihood. This is substantively different from simply analyzing a dataset that has had each row's values multiplied by its weight and will produce different results for many models. Second, you are asking how to get a weighted dataset that has character vectors in columns. For example, lalonde$race is a character vector. Multiplying 5*"black" doesn't make much sense.
If you are indeed intent on multiplying every value in every row of your data by the row's respective weight, you will need to convert your race variable to numeric indicators, remove it from your data, then you can apply sweep():
library(dplyr)
df <- lalonde %>%
black = if_else(race == "black", 1, 0),
hispan = if_else(race == "hispan",1,0),
white = if_else(race == "white",1,0)) %>%
select(-race)
sweep(df, MARGIN = 2, W.out[["weights"]], `*`)

Using dtwclust to calculate clusters based on euclidean distance when time series contain NAs

I wish to cluster a sample of income trajectories using Euclidean distance as my distance calculation. I would like to do this using the tsclust function in the dtwclust package, as its multi-threading features mean it has substantially better processing times compared to the other packages like cluster.
However, a lot of my time series data have missing values, which is causing several problems that I illustrate below.
My primary question is: Is there a way to tell tsclust to calculate Euclidian distance whilst ignoring time-periods where one or both time series contain NA values? A more detailed description of what I want can be found in Step 3.
Step 1: Create Example Data
# Import required libraries
library(tidyverse)
library(dtwclust)
library(parallel)
# Define number of trajectories
ntraj = 6
# Define proportion of trajectories that will have missing data
prop = 0.4
# Set seed for reproducible results
set.seed(123)
# Randomly generate number of NA values to occur in each income trajectory (will only occur in % of sample defined by prop).
num_NAs = rep(0,ntraj) %>% replace(sample(1:ntraj,round(ntraj*prop)), sample(1:4,round(ntraj*prop),replace = T)) %>% as.list()
# Randomly generate where these NA values will occure in time series
NA_posn = num_NAs %>% map(~ sample(1:10,.)) %>% map(~ sort(.))
# Use rnorm to generate income trajectories
inc_traj_full = list() %>% .[1:ntraj] %>% map(~ rnorm(10, 1588.647, 1484.186))
# Populate with NA values
inc_traj = inc_traj_full %>% map2(NA_posn, ~ .x %>% replace(.y, rep(NA,length(.y))) )
# Show trajectories (apologies for ugly code)
inc_traj %>% map(~ as.data.frame(.)) %>% bind_cols() %>% t() %>% as.data.frame() %>%
setNames(paste("year_",1:10,sep="")) %>% `rownames<-`(paste("traj_",1:ntraj,sep="")) %>% View()
Here I’ve created a sample dataset with 6 trajectories, 2 of which contain NA values (4 and 6).
Step 2: Calculate euclidean distance using method registered with proxy package
# Set up parallelisation
# Code taken from https://cran.r-project.org/web/packages/dtwclust/vignettes/parallelization-considerations.html
# create multi-process workers
workers <- makeCluster(detectCores())
# load dtwclust in each one, and make them use 1 thread per worker
invisible(clusterEvalQ(workers, {
library(dtwclust)
RcppParallel::setThreadOptions(1L)
}))
# register your workers, e.g. with doParallel
require(doParallel)
registerDoParallel(workers)
# Calculate distance matrix using proxy::dist
distmat_regd = proxy::dist(inc_traj, method = "Euclidean")
# Show output
distmat_regd
When calculating Euclidean distance using the method registered with proxy::dist, all distances associated with trajectories 4 and 6 are returned as NA. I understand why this has been done. However, I would prefer for the calculation to ignore time-periods that contain these NA values.
# Partition data (PAM) using tsclust using registered distance function
clusters = tsclust(inc_traj, k = 2L, distance = "Euclidean", centroid = "pam", seed = 3247, trace = TRUE)
Furthermore, when trying this approach using tsclust it returns the above error.
Step 3: Calculating euclidean distance using custom method
# Function to calculate euclidean distance whilst removing NAs
calc_euc = function(x,y) {
euc = {x - y} %>% {sqrt(. ^ 2)} %>% mean(na.rm = T)
return(euc)
}
# Calculate distance matrix using proxy::dist
distmat_custm = proxy::dist(inc_traj, method = calc_euc)
# Show output
distmat_custm
In the above distance function, I calculate euclidean distance whilst ignoring time-periods where one or both time-series contain an NA value. This is my desired approach and successfully produces non-NA distances for the entire matrix.
# Partition data (PAM) using tsclust with custom distance function
clusters = tsclust(inc_traj, k = 2L, distance = calc_euc, centroid = "pam", seed = 3247, trace = TRUE)
However, when I try to pass this function to tsclust it returns the same error as before. I tried to get around these errors by providing a registered distance calculation and a dataset with no NA values, in the hopes it would still calculate the same clusters using the distance matrix I pre-calculated above. Unfortunately, it resulted in additional errors.
# Partition data (PAM) using tsclust with custom distance function, non-NA dataset and pre-supplied distance matrix
clusters = tsclust(inc_traj_full, k = 2L, distance = calc_euc, centroid = "pam",
control = partitional_control(distmat = distmat_custm), seed = 3247, trace = TRUE)
# Partition data (PAM) using tsclust with registered distance function, non-NA dataset and pre-supplied distance matrix
clusters = tsclust(inc_traj_full, k = 2L, distance = "Euclidean", centroid = "pam",
control = partitional_control(distmat = distmat_custm), seed = 3247, trace = TRUE)
I'm therefore wondering if there's a registered way to calculate distance that is equivalent to my custom function calc_euc? If not, is there a way to implement my custom function within tsclust that doesn't trigger these errors?
Any information or advice people can provide would be greatly appreciated!

Is it possible to analyse a spatial point pattern given another, underlying, spatial point pattern in R

I want to analyse the type of spatial pattern shown by an animal (i.e. random, clustered, uniform) taking into consideration the underlying spatial pattern of it's available habitat. The animals in question roost in trees, so a standard analysis of the animal spp will always show a clustered distribution (i.e. clustering around trees), but I want to test whether there is clustering between trees vs whether they distribute randomly throughout trees. To provide a visual, I want to be able to differentiate between the following scenarios in the image:
https://imgur.com/a/iE3nAoh (image not allowed because I'm new to stack overflow, but it's available through the link)
Here is a reproducible data frame. The scenario here is of uniform habitat (25 areas of habitat) and uniform animals (16 animals per habitat):
library(spatstat)
data <- data.frame(matrix(ncol = 4, nrow = 25))
x <- c("habitat", "x", "y", "animalcount")
colnames(data) <- x
data$habitat <- 1:25
data$x <- seq(from=2, to=20, by=4)
data$y[1:5] <- 2
data$y[6:10] <- 6
data$y[11:15] <- 10
data$y[16:20] <- 14
data$y[21:25] <- 18
data$animalcount <- 16
Set up conditions for the spatial analysis:
plot.win <- owin(c(0,20), c(0,20)) # set the plot window as 20x20m
nS <- 499 # number of simulations
cd <- 5 # cluster distance
ed <- 50 # envelope distance
incr.dist <- 0.5 # increment distance for envelopes
Create the point pattern for the habitat:
habitat <- ppp(x = data$x, y = data$y, window = plot.win)
Create the point pattern for the animals. To do this, first make a new dataframe with repeated rows for the number in animal count, so that points are individual animals. Jitter x/y so that x/y coordinates are not exactly the same:
data <-data[which(data$animalcount>0),]
duplicate_rows <- function(habitat, x, y, animalcount) {
expanded <- paste0("animal-", 1:animalcount)
repeated_rows <- data.frame("habitat" = habitat, "x" = x, "y" = y, "animalcount" = expanded)
repeated_rows
}
expanded_rows <- Map(f = duplicate_rows, data$habitat, data$x, data$y, data$animalcount)
animal_data <- do.call(rbind, expanded_rows)
animal_data$xan <- jitter(animal_data$x)
animal_data$yan <- jitter(animal_data$y)
animal <- ppp(x = animal_data$xan, y = animal_data$yan, window = plot.win)
Now test Complete Spatial Randomness of animals regardless of habitat. This should come out as clustered:
an.csr <- envelope(animal, Kest, nsims = nS, savepatterns = TRUE, r = seq(0, ed, incr.dist), correction=c("Ripley"), verbose = FALSE) #CSR fit and determine the number of simulations
an.dclf <- dclf.test(an.csr, rinterval = c(0,cd), verbose = FALSE) #calculate the summary statistics of the CSR null model fit (dclf.test)
plot(an.csr, sqrt(./pi)-r~r, ylab="L(r)-r", xlab="r (meters)", xlim=c(0,ed), legend="NULL", main=paste("Animal - CSR", sep = "")) #plot 0-centered fit with the confidence bounds
clarkevans(animal)[2] #R > 1 suggests ordering, < 1 suggests clustering
clarkevans.test(animal, "Donnelly")$p
Now test Complete Spatial Randomness of animals, given the available habitat. This should come out not clustered. But simply adding habitat as a covariate clearly isn't the appropriate way to do it:
an.csr <- envelope(animal, covariates = animal_data[,2:3], Kest, nsims = nS, savepatterns = TRUE, r = seq(0, ed, incr.dist), correction=c("Ripley"), verbose = FALSE)
an.dclf <- dclf.test(an.csr, rinterval = c(0,cd), verbose = FALSE)
plot(an.csr, sqrt(./pi)-r~r, ylab="L(r)-r", xlab="r (meters)", xlim=c(0,ed), legend="NULL", main=paste("Animal - CSR", sep = ""))
clarkevans(animal)[2]
clarkevans.test(animal, "Donnelly")$p
I also tried running the test of Complete Spatial Randomness on a fitted Point Process Model, where the animal point pattern could be predicted by x&y, but this also did not change outcomes:
animalppm<-ppm(animal~x+y)
an.csr <- envelope(animalppm, Kest, nsims = nS, savepatterns = TRUE, r = seq(0, ed, incr.dist), correction=c("Ripley"), verbose = FALSE)
an.dclf <- dclf.test(an.csr, rinterval = c(0,cd), verbose = FALSE)
plot(an.csr, sqrt(./pi)-r~r, ylab="L(r)-r", xlab="r (meters)", xlim=c(0,ed), legend="NULL", main=paste("Animal - CSR", sep = ""))
clarkevans(animalppm)[2] #R > 1 suggests ordering, < 1 suggests clustering
clarkevans.test(animalppm, "Donnelly")$p
From there I would run tests of aggregation models, but the logic of adding the second point pattern should be similar.
I would appreciate any suggestions on ways to deal with this. I cannot think of an effective way to google this, and am coming up short on clever coding solutions in R. Thanks in advance!
You can model the intensity as depending on the distance to the
habitat pattern. Here is a simple example where the animals follow a Poisson
point process with intensity function which decays log-linearly with distance
to the habitat:
library(spatstat)
data <- expand.grid(x = seq(2, 18, by=4), y = seq(2, 18, by=4))
data$animalcount <- 16
plot.win <- owin(c(0,20), c(0,20)) # set the plot window as 20x20m
habitat <- ppp(x = data$x, y = data$y, window = plot.win)
d <- distmap(habitat)
plot(d)
lam <- exp(3-2*d)
plot(lam)
animal <- rpoispp(lam)
plot(animal)
fit <- ppm(animal ~ d)
fit
#> Nonstationary Poisson process
#>
#> Log intensity: ~d
#>
#> Fitted trend coefficients:
#> (Intercept) d
#> 2.952048 -1.974381
#>
#> Estimate S.E. CI95.lo CI95.hi Ztest Zval
#> (Intercept) 2.952048 0.07265533 2.809646 3.094450 *** 40.63085
#> d -1.974381 0.07055831 -2.112673 -1.836089 *** -27.98226
Taking the underlying non-homogeneous intensity into account
there is no sign of departure from the Poisson model in the
(inhomogeneous) K-function:
plot(Kinhom(animal, lambda = fit))
#> Warning: The behaviour of Kinhom when lambda is a ppm object has changed
#> (in spatstat 1.37-0 and later). See help(Kinhom)
You don't have to have simple log-linear dependence on distance. You could also make a threshold model where you have one intensity with e.g. distance 1 of the habitat and another intensity outside this distance. You can make all kinds of derived covariates from e.g. the distance for use in your model.
If animals is the point pattern of animals, and trees is the point pattern of trees (both objects of class "ppp" in spatstat) then you could do
d <- distfun(trees)
f <- rhohat(animals, d)
plot(f)
to get an idea of how the concentration of animals depends on distance to nearest tree. You can use
berman.test(animals, d)
to perform a hypothesis test of dependence on the trees.

Matching on a quaternary variable R

I am working with data set you can generate with the following code:
set.seed(922)
dat<-data.frame("y" = c(rnorm(500,20,2),rnorm(500, 40,2),rnorm(500,60,2),rnorm(500,80,2)),
"quaternary" = sample(LETTERS[1:4],2000,replace = T),
"binary" = sample(c("0","1"),2000,replace = T))
(Generating a 4-modal distribution was an arbitrary decision)
the four treatment groups ("A","B","C","D") are what is important.
I am trying to create a balanced matched sample based on the values of y in the data frame. I've used the "Matchit" package to build balanced samples based on a binary variable:
matchit(binary~y,data = dat)
but I'm not sure how I could build matches of a 4-level factor "quaternary" on the values of "y".
I'm not certain there's an elegant way to do it in the Matchit package, but I'm open to any suggestion on how I might stack the methodologies to get a good balanced sample. Any help would be awesome
EDIT:
OK so I think I'm close. You can leverage dplyr in a for loop. It's a bit inefficient, and I still have to think about the implications of using this to create a balanced sample, but it's getting closer...
first in the dat frame, you create four new variables populated with NAs:
dat$A_match<-NA
dat$B_match<-NA
dat$C_match<-NA
dat$D_match<-NA
The you use summarise function in dplyr to find the values.
require(dplyr) #haha. Hey that rhymes
for(i in 1:dim(dat)[1]){
dat_A_index<-dat%>%
mutate(y = ifelse(quaternary=="A",y,0),
abs = abs(dat[i,1]-y))%>%
summarise(A_index = which.min(abs))
dat$A_match[i]<-dat[dat_A_index$A_index,1]
rm(dat_A_index)
dat_B_index<-dat%>%
mutate(y = ifelse(quaternary=="B",y,0),
abs = abs(dat[i,1]-y))%>%
summarise(B_index = which.min(abs))
dat$B_match[i]<-dat[dat_B_index$B_index,1]
rm(dat_B_index)
dat_C_index<-dat%>%
mutate(y = ifelse(quaternary=="C",y,0),
abs = abs(dat[i,1]-y))%>%
summarise(C_index = which.min(abs))
dat$C_match[i]<-dat[dat_C_index$C_index,1]
rm(dat_C_index)
dat_D_index<-dat%>%
mutate(y = ifelse(quaternary=="D",y,0),
abs = abs(dat[i,1]-y))%>%
summarise(D_index = which.min(abs))
dat$D_match[i]<-dat[dat_D_index$D_index,1]
rm(dat_D_index)
}
I know it's clunky, but at least it's selecting the best match in each of the 4 categories for the given value of y. In a real world application, the final balanced sample should be no larger than smallest conditional n multiplied by 4. You also have to assume some outliers might have to be thrown out (maybe an F-test to set the last filtering rule?). At any rate, the vector, y, we generated already represent a balanced sample, but for a real-world application, this is not correct.
Remember that MatchIt only produces matched samples that are suitable for estimating the ATT (average treatment effect on the treated). Typically, MatchIt selects a group it considers the "treated", which is usually the treatment level labeled "1". It then matches to each treated unit one or more control units.
With multinomial treatments, you also need to decide which estimand you are interested in. If, again, you are interested in the ATT, you must select one group to be considered the "treated", and the other groups are considered "control" (I prefer to refer to them as "focal" and "non-focal"). Importantly, your treatment effect estimates will only generalize to a population similar in composition to that of the focal group.
If this is what you want, you need to select one group as focal, and then perform three separate matchit calls where each one matches units from one of the non-focal group to the units in the focal group. The focal group remains unchanged. Below is some code I might use to do this:
set.seed(922)
library(MatchIt)
dat<-data.frame("y" = c(rnorm(500,20,2),rnorm(500, 40,2),rnorm(500,60,2),rnorm(500,80,2)),
"quaternary" = sample(LETTERS[1:4],2000,replace = T, prob = c(.1, .3, .3, .3)),
"binary" = sample(c("0","1"),2000,replace = T))
focal <- "A"
dat$match.weights <- 1
for (lev in levels(dat$quaternary)) {
if (lev != focal) {
dat0 <- dat[dat$quaternary %in% c(focal, lev),]
dat0$treat <- as.numeric(dat0$quaternary == focal)
m.out <- matchit(treat ~ y, dat = dat0, replace = FALSE)
dat$match.weights[dat$quaternary == lev] <- m.out$weights[dat0$treat == 0]
}
}
library(cobalt)
bal.tab(quaternary ~ y, data = dat, weights = dat$match.weights,
method = "matching", focal = focal, un = TRUE)
#> Note: estimand and s.d.denom not specified; assuming ATT and treated.
#> Balance summary across all treatment pairs
#> Type Max.Diff.Un Max.Diff.Adj
#> y Contin. 0.1134 0.0009
#>
#> Sample sizes
#> B C D A
#> All 593 597 612 198
#> Matched 198 198 198 198
#> Unmatched 395 399 414 0
Created on 2018-10-13 by the reprex package (v0.2.1)
Note that if your focal group is not the smallest of the groups, you must match with replacement by setting replace = TRUE in matchit(). To ensure the focal group in this example was the smallest, I set the probabilities of the randomly sampled values of quaternary so that the probability of A was lowest.
If, on the other hand, you want the ATE, matching is probably not your best option. It would be hard to use MatchIt to produce a matched set for the ATE for a binary treatment, making it even harder to do so for multiple treatment groups. Instead, you might look into propensity score weighting, for which the weights are well defined with multinomial treatments. Below is some code to estimate the weights using the above data set to estimate either the ATT or the ATE:
library(WeightIt)
#Weighting for the ATT with A as focal:
w.out.att <- weightit(quaternary ~ y, data = dat, estimand = "ATT", focal = "A")
#> Using multinomial logit regression.
dat$w.att <- w.out.att$weights
#Weighting for the ATE:
w.out.ate <- weightit(quaternary ~ y, data = dat, estimand = "ATE")
#> Using multinomial logit regression.
dat$w.ate <- w.out.ate$weights
bal.tab(quaternary ~ y, data = dat, weights = c("w.att", "w.ate"),
method = "weighting", estimand = c("ATT", "ATE"), un = TRUE)
#> Balance summary across all treatment pairs
#> Type Max.Diff.Un Max.Diff.w.att Max.Diff.w.ate
#> y Contin. 0.1092 0.0055 0.0024
#>
#> Effective sample sizes
#> A B C D
#> All 198.000 593.000 597.000 612.000
#> w.att 198.000 591.139 593.474 604.162
#> w.ate 196.947 592.822 596.993 611.107
Created on 2018-10-13 by the reprex package (v0.2.1)
No matter what strategy you use, you can use the weights in a weighted regression of the outcome on the treated using the estimated matching weights or ATT or ATE weights.
[Disclosure: I'm the author of both the cobalt and WeightIt packages.]

Resources