Ratio 1:1 for exact matching using MatchIt package - r

library(MatchIt)
df <- data.frame(lalonde)
m.out1 <- matchit(treat ~ age + race + educ, data = lalonde,
method = "exact")
m.data1<-match.data(m.out1)
I would like to know how I can get the same size for both the control and treatment samples after running an exact matching with MatchIt package. Ideally, I would like to randomly pick a control if a treated unit has been matched to more than one control.
My real dataset is not lalonde. It is actually an extremely large one. So I might have many controls associated with a treated unit and I want to draw one randomly for each treated unit.

For exact matching you could use this code.
library(Matching)
data("lalonde")
Y <- lalonde$re78
Tr <- lalonde$treat
X <- lalonde[setdiff(names(lalonde), c('re78', 'treat'))]
set.seed(42) ## comment out for FIXING the ties
rmtch <- Match(Y=Y, Tr=Tr, X=X, exact=TRUE, ties=FALSE)
summary(rmtch)
# Estimate... 1678.6
# SE......... 981
# T-stat..... 1.7111
# p.val...... 0.087055
#
# Original number of observations.............. 445
# Original number of treated obs............... 185
# Matched number of observations............... 55
# Matched number of observations (unweighted). 55
#
# Number of obs dropped by 'exact' or 'caliper' 130
str(rmtch) ## what is stored in Match object
rmtch$index.control ## indices of control units
# [1] 261 254 188 279 288 317 323 280 186 311 305 234 337 302 219 345 234 328
# [19] 271 218 253 249 339 271 339 344 351 253 328 339 255 217 254 197 254 284
# [37] 266 252 253 280 208 226 209 354 204 282 350 296 202 247 219 330 347 280
# [55] 344
If you re-run the code, you will see that the IDs change slightly, which they would probably do more clearly if the dataset was larger.
To fix the randomization of the control units you may use set.seed(). For handling ties deterministically use ties=FALSE (see ?Match help page).

The easiest way is to do 1:1 nearest neighbor matching with exact matching constraints:
m.out1 <- matchit(treat ~ age + race + educ, data = lalonde,
method = "nearest",
exact = ~ age + race + educ)
If you are doing coarsened exact matching, there is an option already built in to request this which is by setting k2k = TRUE:
m.out1 <- matchit(treat ~ age + race + educ, data = lalonde,
method = "cem", k2k = TRUE,
cutpoints = 0)
Setting cutpoints = 0 requests exact matching (no coarsening).

Related

Stratifying multiple columns for cross-validation

There are many ways I've seen to stratify a sample by a single variable to use for cross-validation. The caret package does this nicely with the createFolds() function. By default it seems that caret will partition such that each fold has roughly the same target event rate.
What I want to do though is stratify by the target rate and by time. I've found a function that can partially do this, it's the splitstackshape package and uses the stratified() function. The issue with that function though is it returns a single sample, it doesn't split the data into k groups under the given conditions.
Here's some dummy data to reproduce.
set.seed(123)
time = rep(seq(1:10),100)
target = rbinom(n=100, size=1, prob=0.3)
data = as.data.frame(cbind(time,target))
table(data$time,data$target)
0 1
1 60 40
2 80 20
3 80 20
4 60 40
5 80 20
6 80 20
7 60 40
8 60 40
9 70 30
10 80 20
As you can see, the target event rate is not the same across time. It's 40% in time 1 and 20% in time 2, etc. I want to preserve this when creating the folds used for cross-validation. If I understand correctly, caret will partition by the overall event rate.
table(data$target)
0 1
710 290
This rate of ~30% will be preserved overall, but target event rate over time will not.
We can get one sample like this:
library(splitstackshape)
train.index <- stratified(data,c("target","time"),size=.2)
I need to repeat this though 4 more times for a 5-fold cross validation and it needs to be done such that once a row is assigned it can't be assigned again. I feel like there should be a function designed for this already. Any ideas?
I know this post is old but I just had the same problem and I couldn't find another solution. In case anyone else needs an answer, here's the solution I'm implementing.
library(data.table)
mystratified <- function(indt, group, NUM_FOLDS) {
indt <- setDT(copy(indt))
if (is.numeric(group))
group <- names(indt)[group]
temp_grp <- temp_ind <- NULL
indt[, `:=`(temp_ind, .I)]
indt[, `:=`(temp_grp, do.call(paste, .SD)), .SDcols = group]
samp_sizes <- indt[, .N, by = group]
samp_sizes[, `:=`(temp_grp, do.call(paste, .SD)), .SDcols = group]
inds <- split(indt$temp_ind, indt$temp_grp)[samp_sizes$temp_grp]
z = unlist(inds,use.names=F)
model_folds <- suppressWarnings(split(z, 1:NUM_FOLDS))
}
Which is basically a rewriting of splitstackshape::stratified. It works like the following, giving as output a list of validation indeces for each fold.
myfolds = mystratified(indt = data, group = colnames(data), NUM_FOLDS = 5)
str(myfolds)
List of 5
$ 1: int [1:200] 1 91 181 261 351 441 501 591 681 761 ...
$ 2: int [1:200] 41 101 191 281 361 451 541 601 691 781 ...
$ 3: int [1:200] 51 141 201 291 381 461 551 641 701 791 ...
$ 4: int [1:200] 61 151 241 301 391 481 561 651 741 801 ...
$ 5: int [1:200] 81 161 251 341 401 491 581 661 751 841 ...
So, for instance the train and validation data for each fold are:
# first fold
train = data[-myfolds[[1]],]
valid = data[myfolds[[1]],]
# second fold
train = data[-myfolds[[2]],]
valid = data[myfolds[[2]],]
# etc...

R One sample test for set of columns for each row

I have a data set where I have the Levels and Trends for say 50 cities for 3 scenarios. Below is the sample data -
City <- paste0("City",1:50)
L1 <- sample(100:500,50,replace = T)
L2 <- sample(100:500,50,replace = T)
L3 <- sample(100:500,50,replace = T)
T1 <- runif(50,0,3)
T2 <- runif(50,0,3)
T3 <- runif(50,0,3)
df <- data.frame(City,L1,L2,L3,T1,T2,T3)
Now, across the 3 scenarios I find the minimum Level and Minimum Trend using the below code -
df$L_min <- apply(df[,2:4],1,min)
df$T_min <- apply(df[,5:7],1,min)
Now I want to check if these minimum values are significantly different between the levels and trends respectively. So check L_min with columns 2-4 and T_min with columns 5-7. This needs to be done for each city (row) and if significant then return which column it is significantly different with.
It would help if some one could guide how this can be done.
Thank you!!
I'll put my idea here, nevertheless I'm looking forward for ideas for others.
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min
1 City1 251 176 263 1.162313 0.07196579 2.0925715 176 0.07196579
2 City2 385 406 264 0.353124 0.66089524 2.5613980 264 0.35312402
3 City3 437 333 426 2.625795 1.43547766 1.7667891 333 1.43547766
4 City4 431 405 493 2.042905 0.93041254 1.3872058 405 0.93041254
5 City5 101 429 100 1.731004 2.89794314 0.3535423 100 0.35354230
6 City6 374 394 465 1.854794 0.57909775 2.7485841 374 0.57909775
> df$FC <- rowMeans(df[,2:4])/df[,8]
> df <- df[order(-df$FC), ]
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min FC
18 City18 461 425 117 2.7786757 2.6577894 0.75974121 117 0.75974121 2.857550
38 City38 370 117 445 0.1103141 2.6890014 2.26174542 117 0.11031411 2.655271
44 City44 101 473 222 1.2754675 0.8667007 0.04057544 101 0.04057544 2.627063
10 City10 459 361 132 0.1529519 2.4678493 2.23373484 132 0.15295194 2.404040
16 City16 232 393 110 0.8628494 1.3995549 1.01689217 110 0.86284938 2.227273
15 City15 499 475 182 0.3679611 0.2519497 2.82647041 182 0.25194969 2.117216
Now you have the most different rows based on columns 2:4 at the top. Columns 5:7 in analogous way.
And some tips for stastical tests:
Always use t.test(parametrical, based on mean) instead of wilcoxon(u-mann whitney - non-parametrical, based on median), it has more power; HOWEVER:
-Data sets should be big ex. hipotesis: Montreal has taller citizens than Quebec; t.test will work fine when you take a 100 people from each city, so we have height measurment of 200 people 100 vs 100.
-Distribution should be close to normal distribution in all samples; or both samples should have similar distribution far from normal - it may be binominal. Anyway we can't use this test when one sample has normal distribution, and second hasn't.
-Size of both samples should be eqal, so 100 vs 100 is ok, but 87 vs 234 not exactly, p-value will be below 0.05, however it may be misrepresented.
If your data doesn't meet above conditions, I prefer non-parametrical test, less power but more resistant.

Convert time values to numeric while keeping time characteristics

I have a data set which contains interval times of different events occurring. What I want to do, is convert the data into a numeric vector, so its easier to manipulate and run summaries/make graphs etc, while keeping its time characteristics. Here is a snippet of my data:
data <- c( "03:31", "12:17", "16:29", "09:52", "04:01", "09:00", "06:29",
"04:17", "04:42")
class(data)
[1] character
The obvious answer is :
as.numeric(data)
But I get this error:
Warning message:
NAs introduced by coercion
I thought of maybe taking the ':' out, but then it loses its time characteristics. By that, I mean that if I sum values together say 347 and 543, it would give me 890 as opposed to 930. Here is the code that I would use to take the colon out, which works fine for its purpose:
Nocolon <- gsub("[:]", "", Data, perl=TRUE)
"0331" "1217" "1629" "0952" "0401" "0900" "0629" "0417" "0442"
So essentially, what I want is for my time values to be in a form which is easy to manipulate and analyse. My idea is for it to be a numeric vector, but that is from my minimal understanding of R. My actual code has thousands of time values, and I want to create a plot that will allow me to view and determine whether the values follow a statistical distribution.
Thanks in advance!
Here are some approaches. All convert to minutes. For example, the first component is "03:31" which is 3 * 60 + 31 = 211 minutes. (1) to (5) do not use any packages.
1) %*% It works by reading data into a 2 column data frame with hours and minutes. That is converted to a matrix so that it can be matrix multiplied by c(60, 1). Finally, unravel it with c.
c(as.matrix(read.table(text = data, sep = ":")) %*% c(60, 1))
[1] 211 737 989 592 241 540 389 257 282
2) with This variation is even shorter. It creates the same data frame but and then simply mulitiplies the first column (V1) by 60 and adds it to the second column (V2).
with(read.table(text = data, sep = ":"), 60*V1+V2)
[1] 211 737 989 592 241 540 389 257 282
3) complex This converts each component to a complex number and then performs the required arithmetic on the real and imaginary parts:
data_c <- as.complex(sub(":(\\d+)", "+\\1i", data))
60 * Re(data_c) + Im(data_c)
## [1] 211 737 989 592 241 540 389 257 282
3a) This variation of (3) also works and avoids regular expressions:
data_c <- as.complex(paste0(chartr(":", "+", data), "i"))
60 * Re(data_c) + Im(data_c)
## [1] 211 737 989 592 241 540 389 257 282
4) eval This converts each component into an arithmetic expression which evaluates to the number of minutes and then performs the evalution. Using eval is not really recommended when you can avoid it so this one is less desirable:
sapply(parse(text = sub("(\\d+):", "60*\\1+", data)), eval)
## [1] 211 737 989 592 241 540 389 257 282
5) POSIXlt We can convert to "POSIXlt" class and then use the hour and min components:
with(unclass(as.POSIXlt(data, format = "%H:%M")), 60 * hour + min)
## [1] 211 737 989 592 241 540 389 257 282
6) chron Using the chron package we can paste on the seconds, convert to "times" class and then convert to minutes:
library(chron)
24 * 60 * as.numeric(times(paste0(data, ":00")))
## [1] 211 737 989 592 241 540 389 257 282
7) lubridate Using the lubridate package we can convert it using hm and then to numeric giving seconds and finally dividing by 60 to give minutes:
as.numeric(hm(data)) / 60
## [1] 211 737 989 592 241 540 389 257 282
Use the as.difftime function designed for this:
as.difftime(data, format="%H:%M", units="mins")
#Time differences in mins
#[1] 211 737 989 592 241 540 389 257 282

R summing row one with all rows

I am trying to analyse website data for AB testing.
My reference point is based on experimentName = Experiment 1 (control version)
experimentName UniquePageView UniqueFrequency NonUniqueFrequency
1 Experiment 1 459 294 359
2 Experiment 2 440 286 338
3 Experiment 3 428 273 348
What I need to do is sum every UniquePageView, UniqueFrequency and NonUniqueFrequency row when experimentName = Experiment 1
e.g.
UniquePageView WHERE experimentName = 'Experiment 1 ' + UniquePageView WHERE experimentName = 'Experiment 2 ',
UniquePageView WHERE experimentName = 'Experiment 1 ' + UniquePageView WHERE experimentName = 'Experiment 3 '
so on so forth (I could have an unlimted number of experiment #)
then do the same for UniqueFrequency and NonUniqueFrequency (I could have an unlimited number of column as well)
Result expected:
experimentName UniquePageView UniqueFrequency NonUniqueFrequency Conversion Rate Pooled UniquePageView Conversion Rate Pooled UniqueFrequency Conversion Rate Pooled NonUniqueFrequency
1 Experiment 1 459 294 359 918 588 718
2 Experiment 2 440 286 338 899 580 697
3 Experiment 3 428 273 348 887 567 707
here is the math behind it:
experimentName UniquePageView UniqueFrequency NonUniqueFrequency Conversion Rate Pooled UniquePageView Conversion Rate Pooled UniqueFrequency Conversion Rate Pooled NonUniqueFrequency
1 Experiment 1 459 294 359 459 + 459 294 + 294 359 + 359
2 Experiment 2 440 286 338 459 + 440 294 + 286 359 + 338
3 Experiment 3 428 273 348 459 + 428 294 + 273 359 + 348
In base R, you can do this in one line by column binding (with cbind) the initial data frame to the initial data frame plus a version that is just duplicates of the "Experiment 1" row).
cbind(dat, dat[,-1] + dat[rep(which(dat$experimentName == "Experiment 1"), nrow(dat)), -1])
# experimentName UniquePageView UniqueFrequency NonUniqueFrequency UniquePageView UniqueFrequency
# 1 Experiment 1 459 294 359 918 588
# 2 Experiment 2 440 286 338 899 580
# 3 Experiment 3 428 273 348 887 567
# NonUniqueFrequency
# 1 718
# 2 697
# 3 707
To update the column names at the end (assuming you stored the resulting data frame in res), you could use:
names(res)[4:6] <- c("CombinedPageView", "CombinedUniqueFrequency", "CombinedNonUniqueFrequency")
Do you know how to use dplyr? If you're new to R, this is a pretty good lesson to learn. Dplyr includes the functions filter and summarise, which are all you need to do this problem - very simple!
First, take your data frame
df
Then, filter to only the data you want, in this case when experimentName = Experiment 1
df
df <- filter(df, experimentName == "Experiment 1")
Now, summarise to find the sums of UniquePageView, UniqueFrequency and NonUniqueFrequency
df
df <- filter(df, experimentName == "Experiment 1")
summarise(df, SumUniquePageView = sum(UniquePageView),
SumUniqueFrequency = sum(UniqueFrequency),
SumNonUniqueFrequency = sum(NonUniqueFrequency))
This will return a small table with the answers you're looking for. For a slightly more advanced (but simpler) way to do this, you can use the piping operator %>% from the packages magrittr. That code borrows the object from the previous statement and uses it as the first argument in the proceeding statement, as follows:
df %>% filter(experimentName == "Experiment 1") %>% summarise(SumUniquePageView = sum(UniquePageView), etc)
If you don't yet have those packages, you can get them with install.packages("dpyr"), library(dplyr)

How do I make sure numbers are numeric from a .txt?

I'm setting up a script to extract the thickness and voltages from a single column text file and perform a Weibull distribution on it. When I try to use fitdistr() I get an error stating "'x' must be a non-empty numeric vector". R is supposed to interpret numbers in text files as numeric but that doesn't seem to be happening. Any thoughts?
filename <- "SampleBreakdownSet.txt"
d <- read.table(filename, header = FALSE, sep = "")
#Extract thickness from the dataset; set to variable t
t = d[1,1]
#Extract the breakdown voltages and toss into dataset, BDV
BDV = tail(d,(nrow(d)-1))
#Calculates the breakdown field from the thickness and BDV
BDF = (BDV*10000) / t
#Calculates the Weibull parameters from the input breakdown voltages.
fitdistr(BDF, densfun ="weibull", lower = 0)
fitdistr(BDF, densfun ="weibull", lower = 0)
Error in fitdistr(BDF, densfun = "weibull", lower = 0) :
'x' must be a non-empty numeric vector
Sample data I'm using:
2
200
250
450
320
100
400
200
403
502
203
420
120
342
304
253
423
534
534
243
253
423
123
433
534
234
633
432
342
543
532
123
453
231
532
342
213
243
You are passing a data.frame to fitdistr, but you should be passing the vector itself.
Try this:
d <- read.table(text='200
250
450
320
100
400
200
403
502
203
420
120
342
304
253
423
534
534
243
253
423
123
433
534
234
633
432
342
543
532
123
453
231
532
342
213
243', header=FALSE)
t <- d[1,1]
#Extract the breakdown voltages and toss into dataset, BDV
BDV <- d[-1, 1]
BDF <- (BDV*10000) / t
library(MASS)
fitdistr(BDF, densfun ="weibull", lower = 0)
You could also refer to the relevant column when calling fitdistr, e.g.:
fitdistr(BDF$V1, densfun ="weibull", lower = 0)
# shape scale
# 2.745485e+00 1.997509e+04
# (3.716797e-01) (1.283667e+03)

Resources