I am working with data set you can generate with the following code:
set.seed(922)
dat<-data.frame("y" = c(rnorm(500,20,2),rnorm(500, 40,2),rnorm(500,60,2),rnorm(500,80,2)),
"quaternary" = sample(LETTERS[1:4],2000,replace = T),
"binary" = sample(c("0","1"),2000,replace = T))
(Generating a 4-modal distribution was an arbitrary decision)
the four treatment groups ("A","B","C","D") are what is important.
I am trying to create a balanced matched sample based on the values of y in the data frame. I've used the "Matchit" package to build balanced samples based on a binary variable:
matchit(binary~y,data = dat)
but I'm not sure how I could build matches of a 4-level factor "quaternary" on the values of "y".
I'm not certain there's an elegant way to do it in the Matchit package, but I'm open to any suggestion on how I might stack the methodologies to get a good balanced sample. Any help would be awesome
EDIT:
OK so I think I'm close. You can leverage dplyr in a for loop. It's a bit inefficient, and I still have to think about the implications of using this to create a balanced sample, but it's getting closer...
first in the dat frame, you create four new variables populated with NAs:
dat$A_match<-NA
dat$B_match<-NA
dat$C_match<-NA
dat$D_match<-NA
The you use summarise function in dplyr to find the values.
require(dplyr) #haha. Hey that rhymes
for(i in 1:dim(dat)[1]){
dat_A_index<-dat%>%
mutate(y = ifelse(quaternary=="A",y,0),
abs = abs(dat[i,1]-y))%>%
summarise(A_index = which.min(abs))
dat$A_match[i]<-dat[dat_A_index$A_index,1]
rm(dat_A_index)
dat_B_index<-dat%>%
mutate(y = ifelse(quaternary=="B",y,0),
abs = abs(dat[i,1]-y))%>%
summarise(B_index = which.min(abs))
dat$B_match[i]<-dat[dat_B_index$B_index,1]
rm(dat_B_index)
dat_C_index<-dat%>%
mutate(y = ifelse(quaternary=="C",y,0),
abs = abs(dat[i,1]-y))%>%
summarise(C_index = which.min(abs))
dat$C_match[i]<-dat[dat_C_index$C_index,1]
rm(dat_C_index)
dat_D_index<-dat%>%
mutate(y = ifelse(quaternary=="D",y,0),
abs = abs(dat[i,1]-y))%>%
summarise(D_index = which.min(abs))
dat$D_match[i]<-dat[dat_D_index$D_index,1]
rm(dat_D_index)
}
I know it's clunky, but at least it's selecting the best match in each of the 4 categories for the given value of y. In a real world application, the final balanced sample should be no larger than smallest conditional n multiplied by 4. You also have to assume some outliers might have to be thrown out (maybe an F-test to set the last filtering rule?). At any rate, the vector, y, we generated already represent a balanced sample, but for a real-world application, this is not correct.
Remember that MatchIt only produces matched samples that are suitable for estimating the ATT (average treatment effect on the treated). Typically, MatchIt selects a group it considers the "treated", which is usually the treatment level labeled "1". It then matches to each treated unit one or more control units.
With multinomial treatments, you also need to decide which estimand you are interested in. If, again, you are interested in the ATT, you must select one group to be considered the "treated", and the other groups are considered "control" (I prefer to refer to them as "focal" and "non-focal"). Importantly, your treatment effect estimates will only generalize to a population similar in composition to that of the focal group.
If this is what you want, you need to select one group as focal, and then perform three separate matchit calls where each one matches units from one of the non-focal group to the units in the focal group. The focal group remains unchanged. Below is some code I might use to do this:
set.seed(922)
library(MatchIt)
dat<-data.frame("y" = c(rnorm(500,20,2),rnorm(500, 40,2),rnorm(500,60,2),rnorm(500,80,2)),
"quaternary" = sample(LETTERS[1:4],2000,replace = T, prob = c(.1, .3, .3, .3)),
"binary" = sample(c("0","1"),2000,replace = T))
focal <- "A"
dat$match.weights <- 1
for (lev in levels(dat$quaternary)) {
if (lev != focal) {
dat0 <- dat[dat$quaternary %in% c(focal, lev),]
dat0$treat <- as.numeric(dat0$quaternary == focal)
m.out <- matchit(treat ~ y, dat = dat0, replace = FALSE)
dat$match.weights[dat$quaternary == lev] <- m.out$weights[dat0$treat == 0]
}
}
library(cobalt)
bal.tab(quaternary ~ y, data = dat, weights = dat$match.weights,
method = "matching", focal = focal, un = TRUE)
#> Note: estimand and s.d.denom not specified; assuming ATT and treated.
#> Balance summary across all treatment pairs
#> Type Max.Diff.Un Max.Diff.Adj
#> y Contin. 0.1134 0.0009
#>
#> Sample sizes
#> B C D A
#> All 593 597 612 198
#> Matched 198 198 198 198
#> Unmatched 395 399 414 0
Created on 2018-10-13 by the reprex package (v0.2.1)
Note that if your focal group is not the smallest of the groups, you must match with replacement by setting replace = TRUE in matchit(). To ensure the focal group in this example was the smallest, I set the probabilities of the randomly sampled values of quaternary so that the probability of A was lowest.
If, on the other hand, you want the ATE, matching is probably not your best option. It would be hard to use MatchIt to produce a matched set for the ATE for a binary treatment, making it even harder to do so for multiple treatment groups. Instead, you might look into propensity score weighting, for which the weights are well defined with multinomial treatments. Below is some code to estimate the weights using the above data set to estimate either the ATT or the ATE:
library(WeightIt)
#Weighting for the ATT with A as focal:
w.out.att <- weightit(quaternary ~ y, data = dat, estimand = "ATT", focal = "A")
#> Using multinomial logit regression.
dat$w.att <- w.out.att$weights
#Weighting for the ATE:
w.out.ate <- weightit(quaternary ~ y, data = dat, estimand = "ATE")
#> Using multinomial logit regression.
dat$w.ate <- w.out.ate$weights
bal.tab(quaternary ~ y, data = dat, weights = c("w.att", "w.ate"),
method = "weighting", estimand = c("ATT", "ATE"), un = TRUE)
#> Balance summary across all treatment pairs
#> Type Max.Diff.Un Max.Diff.w.att Max.Diff.w.ate
#> y Contin. 0.1092 0.0055 0.0024
#>
#> Effective sample sizes
#> A B C D
#> All 198.000 593.000 597.000 612.000
#> w.att 198.000 591.139 593.474 604.162
#> w.ate 196.947 592.822 596.993 611.107
Created on 2018-10-13 by the reprex package (v0.2.1)
No matter what strategy you use, you can use the weights in a weighted regression of the outcome on the treated using the estimated matching weights or ATT or ATE weights.
[Disclosure: I'm the author of both the cobalt and WeightIt packages.]
Related
I'm trying to get a weighted dataset after IPTW using weightit. Unfortunately, I'm not even sure where to start. Any help would be appreciated.
library(WeightIt)
library(cobalt)
library(survey)
W.out <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde, estimand = "ATT", method = "ps")
bal.tab(W.out)
# pre-weighting dataset
lalonde
# post-weighting dataset??
The weightit() function produces balance weights. In your case, setting method = "ps" will produce propensity scores that are transformed into weights. More details of how it produces those weights can be found with ?method_ps. You can extract the weights from your output and store them as a column in a data.frame via: data.frame(w = W.out[["weights"]]). The output is a vector of weights with a length equal to the number of non-NA rows in your data (lalonde).
What you actually mean by "weighted dataset" is ambiguous for two reasons. First, any analyses that use those weights will typically not actually produce a new data.set...rather it will weight the contribution of the row to the likelihood. This is substantively different from simply analyzing a dataset that has had each row's values multiplied by its weight and will produce different results for many models. Second, you are asking how to get a weighted dataset that has character vectors in columns. For example, lalonde$race is a character vector. Multiplying 5*"black" doesn't make much sense.
If you are indeed intent on multiplying every value in every row of your data by the row's respective weight, you will need to convert your race variable to numeric indicators, remove it from your data, then you can apply sweep():
library(dplyr)
df <- lalonde %>%
black = if_else(race == "black", 1, 0),
hispan = if_else(race == "hispan",1,0),
white = if_else(race == "white",1,0)) %>%
select(-race)
sweep(df, MARGIN = 2, W.out[["weights"]], `*`)
Is it possible to get the distances between matched units using the MatchIt::matchit() function?
Here is a reproducible example. I can see the distances when I use distance = "glm" but not with distance = "mahalanobis".
If you have a recommendation for a different package I am also happy to try that. I am only looking to match to another unit and not, for example, to calculate an ATT. Thank you!
# Run nearest neighbor with "mahalanobis" distance
res_matchitmahalanobis <- matchit(
data = df_example,
formula = treat ~ age + male,
method = "nearest",
distance = "mahalanobis",
exact = ~ male,
replace = TRUE
)
# Note: No `distance` column
get_matches(res_matchitmahalanobis)
# Note: `distance` element is missing
res_matchitmahalanobis$distance
# Run nearest neighbor with "glm" distance
res_glm <- matchit(
data = df_example,
formula = treat ~ age + male,
method = "nearest",
distance = "glm",
exact = ~ male,
replace = TRUE
)
# Note: There is now a `distance` column
get_matches(res_glm)
# Note: `distance` element is now present
res_glm$distance
It looks like they don't give you the distances if you use Mahalanobis. They calculate the results using that metric, though.
If you'd like to use Mahalanobis, you can use it along with another metric (like 'glm'). Alternatively, you can collect the distances separately.
I ran the matchit function with both the glm and Mahalanobis distances. Then I collected the Mahalonbis distances separately. (Really, I wanted to see if the distances were Malahanobis or glm...but as expected, they were glm.)
To collect the Mahalanobis distances (even with factors and no extra work) you can use the package assertr and the function maha_dist. The base R function requires you to manually convert factors to values.
library(MatchIt)
library(tidyverse)
library(assertr)
data("lalonde")
m.out2 <- matchit(treat ~ age + educ + race, data = lalonde,
distance = "glm", method = "nearest",
exact = ~educ, replace = T,
mahvars = ~age + educ + race)
summary(m.out2)
la2 <- lalonde %>% select(age, educ, race)
head(la2) # as expected
# collect distances
vals <- maha_dist(la2, robust = T) # robust uses covariance matrix
# visualize it
plot(density(vals, bw = .5),
main = "Mahal Sq Distances")
qqplot(qchisq(ppoints(100), df = 3), vals,
main = "QQ Plot Mahal Sq Distances")
abline(0, 1, "gray")
# definately outside of the 'normal'
As #Kat pointed out, matchit() does not return this value. It would be inappropriate to have this in the distance column; see here for why. The distance output in the matchit object is a misnomer; it refers to the propensity score, and each unit has one distance value. This is why it shows up with distance = "glm"; you are estimating a propensity score, which is then used to compute the distance between units. No methods in matchit() will actually return the distance between two paired units.
It would take a fair bit of work to extract this information. matchit() does not provide the Mahalanobis distance matrix used in the matching (because this would be way too big for big datasets!). However, you can compute a distance matrix outside matchit(), supply it to the distance argument, and then access the distance between units by extracting those distances from the matrix after doing the pairing. You can compute the Mahalanobis distance using, e.g., optmatch::match_on(), though it is not guaranteed to be identical to the Mahalanobis distance matchit() uses internally. Here is how you would do this:
data("lalonde", package = "MatchIt")
#Create distance matrix
dist <- optmatch::match_on(treat ~ age + educ + race, data = lalonde,
method = "mahalanobis")
#Do matching on distance matrix
m <- MatchIt::matchit(treat ~ age + educ + race, data = lalonde,
distance = dist, exact = ~married,
replace = TRUE)
#Extract matched pairs
mm <- m$match.matrix
#Create data frame of pairs and distance
d <- data.frame(treated = rownames(mm), control = mm[,1],
distance = dist[cbind(rownames(mm), mm[,1])])
head(d)
#> treated control distance
#> NSW1 NSW1 PSID368 0.3100525
#> NSW2 NSW2 PSID341 0.2067017
#> NSW3 NSW3 PSID99 0.2067017
#> NSW4 NSW4 PSID189 0.3900789
#> NSW5 NSW5 PSID400 0.4134033
#> NSW6 NSW6 PSID253 0.1033508
dist["NSW1", "PSID368"]
#> [1] 0.3100525
Created on 2022-02-24 by the reprex package (v2.0.1)
This works with replace = FALSE as well but would take a bit more work when k:1 matching or full matching. Although you are not matching using matchit()'s Mahalanobis distance, the distances produced in the output above do correspond to the distances used to pair.
I am trying to run a PLSR model using Fe concentration with XRF spectra. The spectra matrix contains some zero values.
The code I am using upto the point where the error message pops up is as follows:
#------------------------------------------------------------------------
dataset<-data.frame(cbind(chem_properties$Fe, xrf_spectra))
names(dataset)[1]<-"Fe"
summary(dataset$Fe)
dim(dataset)
plot(dataset$Fe)
#######################################
## Building calibration model
# Dataset partitioning
set.seed(100)
pls_Fe <- createDataPartition(dataset$Fe, p = 0.7, list = FALSE)
training <- dataset[pls_Fe,]
testing <- dataset[-pls_Fe,]
summary(training$Fe)
summary(testing$Fe)
# =================================================
# Model
tc <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
tg <- data.frame(ncomp = seq(2, 15, by =1))
pls_rcv <- train(Fe~.,
data = training,
preProcess = c("center", "scale"),
method = "pls",
tuneGrid = tg,
trControl = tc)
#----------------------------------------------------------------------------------
whenever I run the last line (i.e pls_rcv) this error message shows up;
............Error in na.fail.default(list(Fe = c(568L, 437L, 599L, 1016L, 670L, 1951L, :
missing values in object ..............
Although my question is similar to some previously asked questions, I have tried some of the suggested solutions in those cases but none seems to be working. Maybe I am rather doing something wrong.
One of the suggestion was to use na.exclude() on the whole data frame.
I will be grateful to receive feedbacks from you.
So here is the first pass at an answer, but without peaking at a subsample of your data this is the gist I can give.
First, we can set up some fake data and add some missing cases to solve the first issue.
# Some fake data
dat <- mtcars
# Now lets add some missing data
dat[sample(x = 1:nrow(dat), size = 5),
sample(x = 1:ncol(dat), size = 5)] <- NA
Now we we looks at our data, we see we have some missing values:
(the function counts the mising cells per column then turns the output into a dataframe for easier presentation)
as.data.frame(lapply(dat, function(x) sum(is.na(x))))
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 0 5 0 0 5 5 0 5 5 0 0
If we know that we can't have missing values, we can then use the complete.cases function to only keep those rows that do not have any NAs.
dat_complete <- dat[complete.cases(dat),]
This removes some records (in my case 5)
nrow(dat) -nrow(dat_complete)
#> [1] 5
Aside: If dropping missing data is problematic (e.g. discarding information from these data can bias your estimates. Perhaps the data are not missing completely at random, or there was an instrument malfunction that is known), there are many methods for imputation and joint estimation of the missing values.
The second problem deals with splits not containing all of the factors.
For example, if I were to add a factor to my data set and generate the partitions, it is important to see if I capture the factor in my training data.
# Add factor (note D is a rare letter)
library(caret)
dat_complete$my_factor <- factor(sample(x = letters[1:4],
size = nrow(dat_complete),
prob = c(.7,.2,.15,.05), replace = T))
pls_Fe <- createDataPartition(dat_complete$mpg, p = 0.7, list = FALSE)
training <- dat_complete[pls_Fe,]
testing <- dat_complete[-pls_Fe,]
When I look at my testing and training data I see that while my testing data has a "D", my training data do not.
table(training$my_factor)
#> a b c
#> 14 6 0
table(testing$my_factor)
#> a b c
#> 5 1 1
A model cannot reliably predict on a new factor level (generally speaking, of course. there are more methods).
How to fix this? You can always convert the factors to numbers if that makes sense in the context (adds biase, but allows your model to work). You might need to drop your split (rather than 70% training, do 60% and see if you capture some of the low incidence samples). If the predictors aren't relevant, then remove them from consideration. Additionally, given this is PLS, you could also try one-hot encoding. This splits a factor column into n-1 columns with a 1 or 0 representing factor representation. Do this on the full data set (only if you know that you will always observe the factors e.g. the value can only take on a category).
dumz <- dummyVars(~my_factor, data = dat_complete)
dat_dummies <- cbind(dat_complete, predict(dumz, dat_complete))
dat_dummies <- dat_dummies[,names(dat_dummies) !="my_factor"]
pls_Fe <- createDataPartition(dat_dummies$mpg, p = 0.7, list = FALSE)
training <- dat_dummies[pls_Fe,]
testing <- dat_dummies[-pls_Fe,]
I am very new to programming, therefore, I apologize in case my question may seem to fundamental.
Basically I have now a data set of apprx. 300 rows. The idea was now to create an entire new data set with the size of 10k for instance, however, which still has the same characteristics as the smlla data set of 300.
ID Category1 Category2 Amount1 Probability1
1 Class1 A 100 0.3
2 Class2 B 800 0.2
3 Class3 C 300 0.7
4 Class2 A 250 0.4
5 Class3 C 900 0.6
I already did exploratory analysis. I know that my numeric data has a beta distribution and I know the mean and sd (and the level of skewness in case it is relevant)
For my categorical data I know the percent distribution so for instance category A take 25% of the data set. Category B takes 35% and category C takes 40%.
My question now is: what are the best packages in order to simulate this data and to create a bigger data set?
I found on the simstudy package which seemed very goodm however, I am still very new to programming and I'm having hard time to get my head around the code.
Here is the link to the description
https://cran.r-project.org/web/packages/simstudy/vignettes/simstudy.html
(I also checked the R documentation but for a newbie like me it is very hard to follow and fully understand it)
I still don't really get how I can define there my categorical values. (They set there the percent distribution of the single classes but they dont actually set what apply to which class.
Maybe, someone here could help me explain me how I could apply it on my data set or is there another better package for that?
Thank you very much in advance!
EDIT
So my current code with the simstudy package is the following:
def <- defData(varname = "Product_Class", formula = "0.25;0.35;0.4", dist = "categorical")
def <- defData(varname = "Category", formula = "0.25;0.35;0.4", dist = "categorical")
def <- defData(def, varname = "Amount", dist = "beta", formula = 0.6, variance = 0.12)
def <- defData(def, varname = "Amount2", dist = "beta", formula = 0.45, variance = 0.1)
def <- defData(def, varname = "Probability", dist = "beta", formula = 0.4, variance = 0.23)
However, here my problem is that I cant create a skewed beta distribution (and I know that my data is skewed to the right).
Alternativey, I could use this formula, but here i have to create each column seperately and I can not create a relationship between some columns (f.i. correlation, which I would have to create later on as well)
rsbeta(n, shape1, shape)
# shape1 <0 & shape2 >0 creates a right skewede beta distribution
rsbeta(1000, 0.2,3)
Any other suggestions how to resolve this problem?
How do you usually do simulations of different data sets which have only a limited amount of entries ?
Would it work if you just used the sample() function in R with with replacement?
Here is an example using the mtcars data set.
data(mtcars)
mydata=mtcars[,1:4] # only using the first 4 columns for this example
head(mydata)
dim(mydata) # data has 32 rows 4 columns
bigdata=data.frame(mpg=sample(mydata$mpg,1000,replace = T),
cyl=sample(mydata$cyl,1000,replace = T),
disp=sample(mydata$disp,1000,replace = T),
hp=sample(mydata$hp,1000,replace = T))
head(bigdata)
dim(bigdata)
I actually have done something exactly like this. I'm calculating the actual min and max for each variable, so I can simulate to mimic my own original dataset. Using simstudy has several advantages over just using sample, primarily that sample only takes from the existing data available, while simstudy generates any potential value between the minimum and maximum (for numeric types), or a proportion for the categorical variables. Simstudy is also useful if your original data is sensitive/personal data, so you can bypass privacy problems compared to using sample. This is what I did:
library(skimr)
library(simstudy)
library(dplyr)
library(glue)
sim_definitions <-
skim_to_wide(iris) %>%
mutate(min = as.numeric(p0), max = as.numeric(p100)) %>%
transmute(
varname = variable,
dist = case_when(
# For binary data if it is only 0 and 1
n_unique == 2 ~ "binary",
n_unique > 2 ~ "categorical",
TRUE ~ "uniform"
),
formula = case_when(
dist == "uniform" ~ as.character(glue("{min};{max}")),
# For only factors with 3 levels. number is proportion. 0.3 = 30%
dist == "categorical" ~ "0.5;0.2;0.3",
dist == "binary" ~ "0.2",
# other wise 10 is min, 20 is max
TRUE ~ "10;20"
),
link = case_when(
dist == "binary" ~ "logit",
TRUE ~ "identity"
)
)
# 1000 is the final size of the dataset. Change to what ever you want.
simulated_data <- genData(1000, sim_definitions)
dim(simulated_data)
head(simulated_data)
NOTE: I see to have an error with simstudy. Not sure if it's because of an update. Let me know if this works for you. UPDATE: Seems the categorical specification causes the error but I was unable to find the problem.
UPDATE based on clarification in question and comments:
Your code works fine in generating a simulated dataset. If you want to force a skewed distribution, you can use base R's distribution functions like qlnorm. So:
library(simstudy)
#> Loading required package: data.table
def <- defData(varname = "Product_Class", formula = "0.25;0.35;0.4", dist = "categorical")
def <- defData(def, varname = "Category", formula = "0.25;0.35;0.4", dist = "categorical")
def <- defData(def, varname = "Amount", dist = "beta", formula = 0.6, variance = 0.12)
def <- defData(def, varname = "Amount2", dist = "beta", formula = 0.45, variance = 0.1)
def <- defData(def, varname = "Probability", dist = "beta", formula = 0.4, variance = 0.23)
simulated_data <- genData(1000, def)
hist(simulated_data$Amount2)
simulated_data$Amount2 <- qlnorm(simulated_data$Amount2)
hist(simulated_data$Amount2)
Created on 2019-03-24 by the reprex package (v0.2.1)
I am doing double cross validation with LASSO of glmnet package, however when I plot the results I am getting lambda of 0 - 150000 which is unrealistic in my case, not sure what is wrong I am doing, can someone point me in the right direction. Thanks in advance!
calcium = read.csv("calciumgood.csv", header=TRUE)
dim(calcium)
n = dim(calcium)[1]
calcium = na.omit(calcium)
names(calcium)
library(glmnet) # use LASSO model from package glmnet
lambdalist = exp((-1200:1200)/100) # defines models to consider
fulldata.in = calcium
x.in = model.matrix(CAMMOL~. - CAMLEVEL - AGE,data=fulldata.in)
y.in = fulldata.in[,2]
k.in = 10
n.in = dim(fulldata.in)[1]
groups.in = c(rep(1:k.in,floor(n.in/k.in)),1:(n.in%%k.in))
set.seed(8)
cvgroups.in = sample(groups.in,n.in) #orders randomly, with seed (8)
#LASSO cross-validation
cvLASSOglm.in = cv.glmnet(x.in, y.in, lambda=lambdalist, alpha = 1, nfolds=k.in, foldid=cvgroups.in)
plot(cvLASSOglm.in$lambda,cvLASSOglm.in$cvm,type="l",lwd=2,col="red",xlab="lambda",ylab="CV(10)")
whichlowestcvLASSO.in = order(cvLASSOglm.in$cvm)[1]; min(cvLASSOglm.in$cvm)
bestlambdaLASSO = (cvLASSOglm.in$lambda)[whichlowestcvLASSO.in]; bestlambdaLASSO
abline(v=bestlambdaLASSO)
bestlambdaLASSO # this is the lambda for the best LASSO model
LASSOfit.in = glmnet(x.in, y.in, alpha = 1,lambda=lambdalist) # fit the model across possible lambda
LASSObestcoef = coef(LASSOfit.in, s = bestlambdaLASSO); LASSObestcoef # coefficients for the best model fit
I found the dataset you referring at
Calcium, inorganic phosphorus and alkaline phosphatase levels in elderly patients.
Basically the data are "dirty", and it is a possible reason why the algorithm does not converge properly. E.g. there are 771 year old patients, bisides 1 and 2 for male and female, there is 22 for sex encodeing etc.
As for your case you removed only NAs.
You need to check data.frame imported types as well. E.g. instead of factors it could be imported as integers (SEX, Lab and Age group) which will affect the model.
I think you need:
1) cleanse the data;
2) if doesnot work submit *.csv file