I am trying to use designmatch package for cardinality matching of a treated group (n=88) to two untreated contols. The output returns 88x3=264 group_id and 88 t_id, but only 88 c_id (instead of 88x2=176). I understand designmatch does not use replacement by default so I don't understand why I only get 88 c_id.
out <- bmatch(t_ind = t_ind, near_exact = near_exact, n_controls=2)
out
$obj_total
[1] -88
$obj_dist_mat
NULL
$t_id
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
[44] 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
[87] 87 88
$c_id
[1] 108 308 279 131 220 147 231 437 194 278 153 445 383 290 482 105 241 335 238 202 289 301 323 312 159 262 176 315 443 200 377 393
[33] 885 581 927 398 217 117 240 448 263 554 525 854 169 352 317 119 386 414 518 477 424 469 280 286 297 513 316 97 936 609 387 455
[65] 168 702 284 432 349 379 446 543 552 293 851 185 713 501 232 641 997 561 499 310 485 466 675 647
$group_id
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
[44] 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
[87] 87 88 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21
[130] 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42
[173] 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51 52 52 53 53 54 54 55 55 56 56 57 57 58 58 59 59 60 60 61 61 62 62 63 63 64
[216] 64 65 65 66 66 67 67 68 68 69 69 70 70 71 71 72 72 73 73 74 74 75 75 76 76 77 77 78 78 79 79 80 80 81 81 82 82 83 83 84 84 85 85
[259] 86 86 87 87 88 88
Thanks for any help
Answer
The function does not seem to work properly, thus this is likely not possible. The package also does not seem to be actively maintained. My recommendation is moving on to a different package, like MatchIt.
Details
I had an extensive look at the source code of the package. I made several observations.
The group_id element in the output does not seem based on anything.
In the output, you indeed see group_id that seems to have the correct dimensions. However, the numbers don't seem to represent anything meaningful:
group_id_t = 1:(length(t_id))
group_id_c = sort(rep(1:(length(t_id)), n_controls))
group_id = c(group_id_t, group_id_c)
As you can see, they just create a vector group_id_t that runs from 1 to length(t_id) (the IDs of the treated group, see t_id in your output). Next, they create a vector group_id_c that is the exact same thing, just repeated n_controls times. The final group_id is just the concatenated version of that.
I looked around for a matrix where you could enter this, or a matrix that has the number of rows/columns that matches the length of group_id. I cannot find one. The numbers in group_id seem to have no value.
The optimizer seems to optimize for n_controls or less
The bmatch function has several steps. First, it calculates some initial parameters. Second, it puts those parameters in an optimizer (in the default case: glpk using Rglpk::Rglpk_solve_LP). Third, it does some calculations to create the output.
When you vary n_controls (1, 2, 10, etc.), it changes only 1 parameter of the initial parameters (bvec). This parameter essentially carries information on how many matches should be found, and are then entered as a constraint into the optimizer. However, I'm getting the impression that something is wrong with bvec. It gets entered with the condition <=, meaning that the optimizer only has to find a solution where you get n_controls or fewer. I tried looking deeper into how the initial parameters are determined, but that's several hundreds of lines of code, so I gave up.
Final thoughts
The package was last updated on 2018-06-18, which suggests to me that the authors haven't looked at it for a while. You can/should contact them and see what they say. Alternatively, there are other packages like MatchIt that have been verified extensively. You can also switch to one of those packages instead.
Related
I have to get order of one vector to sort other vector. The point is I don't want my function to be stable. In fact, I'd like to random order of equal values. Any idea how do it in R in finite time? :D
Thanks for any help.
You can do this in base R using order. order will take multiple variable to sort on. If you make the second one be a random variable, it will randomize the ties. Here is an example using the built-in iris data. The variable Sepal.Length has several ties for second lowest value. Here are some:
iris$Sepal.Length[c(9,39,43)]
[1] 4.4 4.4 4.4
Now let's sort just that variable (stable sort) and then sort with a random secondary sort.
order(iris$Sepal.Length)
[1] 14 9 39 43 42 4 7 23 48 3 30 12 13 25 31 46 2 10 35
[20] 38 58 107 5 8 26 27 36 41 44 50 61 94 1 18 20 22 24 40
[39] 45 47 99 28 29 33 60 49 6 11 17 21 32 85 34 37 54 81 82
[58] 90 91 65 67 70 89 95 122 16 19 56 80 96 97 100 114 15 68 83
[77] 93 102 115 143 62 71 150 63 79 84 86 120 139 64 72 74 92 128 135
[96] 69 98 127 149 57 73 88 101 104 124 134 137 147 52 75 112 116 129 133
[115] 138 55 105 111 117 148 59 76 66 78 87 109 125 141 145 146 77 113 144
[134] 53 121 140 142 51 103 110 126 130 108 131 106 118 119 123 136 132
order(iris$Sepal.Length, sample(150,150))
[1] 14 43 39 9 42 48 7 4 23 3 30 25 31 46 13 12 35 38 107
[20] 10 58 2 8 41 27 61 94 5 36 44 50 26 18 22 99 40 20 47
[39] 24 45 1 33 60 29 28 49 85 11 6 32 21 17 90 81 91 54 34
[58] 37 82 67 122 95 65 70 89 100 96 56 114 80 16 19 97 93 15 68
[77] 143 102 83 115 150 62 71 120 79 84 63 139 86 72 135 74 64 92 128
[96] 149 69 98 127 88 134 101 57 137 73 104 147 124 138 112 129 116 75 52
[115] 133 148 55 111 105 117 59 76 87 66 78 146 141 109 125 145 144 113 77
[134] 140 53 121 142 51 103 126 130 110 108 131 106 136 119 118 123 132
Without the random secondary sort, positions 2,3,and 4 are in order (stable). With the random secondary sort, they are jumbled.
Try fct_reorder in the forcats package to order one factor by another. If you want to introduce randomness as well, try fct_reorder2 with .y = runif(length(your_vector))
(I'm apparently thinking in strange directions today - fct_reorder will reorder the levels of a factor. If that's what you are after, this may help. Otherwise, order is the better approach.)
Suppose I have a data set and I want to do a 4-fold cross validation using logistic regression. So there will be 4 different models. In R, I did the following:
ctrl <- trainControl(method = "repeatedcv", number = 4, savePredictions = TRUE)
mod_fit <- train(outcome ~., data=data1, method = "glm", family="binomial", trControl = ctrl)
I would assume that mod_fit should contain 4 separate sets of coefficients? When I type modfit$finalModel$ I just get the same set of coefficients.
I've created a reproducible example based on your code snippet. The first thing to notice about your code is that it's specifying repeatedcv as the method, but it doesn't give any repeats, so the number=4 parmeter is just telling it to resample 4 times (this is not an answer to your question but important to understand).
mod_fit$finalModel gives you only 1 set of coefficients because it's the one final model that's derived by aggergating the non-repeated k-fold CV results from each of the 4 folds.
You can see the fold-level performance in the resample object:
library(caret)
library(mlbench)
data(iris)
iris$binary <- ifelse(iris$Species=="setosa",1,0)
iris$Species <- NULL
ctrl <- trainControl(method = "repeatedcv",
number = 4,
savePredictions = TRUE,
verboseIter = T,
returnResamp = "all")
mod_fit <- train(binary ~.,
data=iris,
method = "glm",
family="binomial",
trControl = ctrl)
# Fold-level Performance
mod_fit$resample
RMSE Rsquared parameter Resample
1 2.630866e-03 0.9999658 none Fold1.Rep1
2 3.863821e-08 1.0000000 none Fold2.Rep1
3 8.162472e-12 1.0000000 none Fold3.Rep1
4 2.559189e-13 1.0000000 none Fold4.Rep1
To your earlier point, the package is not going to save and display information on the coefficients of each fold. In addition the the performance information above, does however save the index (list of in-sample rows), indexOut (hold how rows), and random seeds for each fold, thus if you were so inclined it would be easy to reconstruct the intermediate models.
mod_fit$control$seeds
[[1]]
[1] 169815
[[2]]
[1] 445763
[[3]]
[1] 871613
[[4]]
[1] 706905
[[5]]
[1] 89408
mod_fit$control$index
$Fold1
[1] 1 2 3 4 5 6 7 8 9 10 11 12 15 18 19 21 22 24 28 30 31 32 33 34 35 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 59 60 61 63
[45] 64 65 66 68 69 70 71 72 73 75 76 77 79 80 81 82 84 85 86 87 89 90 91 92 93 94 95 96 98 99 100 103 104
106 107 108 110 111 113 114 116 118 119 120
[89] 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 140 141 142 143 145 147 149 150
$Fold2
[1] 1 6 7 8 12 13 14 15 16 17 18 19 20 21 22 23 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 42
44 46 48 50 51 53 54 55 56 57 58
[45] 59 61 62 64 66 67 69 70 71 72 73 74 75 76 78 79 80 81 82 83 84 85 87 88 89 90 91 92 95 96 97 98 99
101 102 104 105 106 108 109 111 112 113 115
[89] 116 117 119 120 121 122 123 127 130 131 132 134 135 137 138 139 140 141 142 143 144 145 146 147 148
$Fold3
[1] 2 3 4 5 6 7 8 9 10 11 13 14 16 17 20 23 24 25 26 27 28 29 30 33 35 36 37 38 39 40 41 43 45
46 47 49 50 51 52 54 55 56 57 58
[45] 60 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 82 83 84 85 86 88 89 93 94 97 98 99 100 101 102
103 105 106 107 108 109 110 111 112 114 115
[89] 117 118 119 121 124 125 126 128 129 131 132 133 134 135 136 137 138 139 144 145 146 147 148 149 150
$Fold4
[1] 1 2 3 4 5 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 29 31 32 34 36 37 38 39 41
42 43 44 45 47 48 49 52 53 55 56
[45] 57 58 59 60 61 62 63 65 67 68 74 77 78 79 80 81 83 86 87 88 90 91 92 93 94 95 96 97 100 101 102 103 104
105 107 109 110 112 113 114 115 116 117 118
[89] 120 122 123 124 125 126 127 128 129 130 133 136 137 138 139 140 141 142 143 144 146 148 149 150
mod_fit$control$indexOut
$Resample1
[1] 13 14 16 17 20 23 25 26 27 29 36 37 38 39 55 56 57 58 62 67 74 78 83 88 97 101 102 105 109 112 115 117 137
138 139 144 146 148
$Resample2
[1] 2 3 4 5 9 10 11 24 41 43 45 47 49 52 60 63 65 68 77 86 93 94 100 103 107 110 114 118 124 125 126 128 129
133 136 149 150
$Resample3
[1] 1 12 15 18 19 21 22 31 32 34 42 44 48 53 59 61 79 80 81 87 90 91 92 95 96 104 113 116 120 122 123 127 130
140 141 142 143
$Resample4
[1] 6 7 8 28 30 33 35 40 46 50 51 54 64 66 69 70 71 72 73 75 76 82 84 85 89 98 99 106 108 111 119 121 131
132 134 135 145 147
#Damien your mod_fit will not contain 4 separate set of coefficients. You are asking for cross validation with 4 folds. This does not mean you will have 4 different models. According to the documentation here, the train function works as follows:
At the end of the resampling loop - in your case 4 iterations for 4 folds, you will have one set of average forecast accuracy measures (eg., rmse, R-squared), for a given one set of model parameters.
Since you did not use tuneGrid or tuneLength argument in train function, by default, train function will tune over three values of each tuneable parameter.
This means you will have at most three models (not 4 models as you were expecting) and therefore three sets of average model performance measures.
The optimum model is the one that has the lowest rmse in case of regression. This model coefficients are available in mod_fit$finalModel.
I have plotted a bar graph and knowing that the transition in the curve takes place around values 21 to 25 , I want to find such point in general.
barplot(Views$V2,names=Views$V1,las=2,cex.names=0.2,border="blue",xpd=FALSE)
abline(v=25,col="red")
Any help is appreciated
Update :
A part of my data frame is given below. It is sorted by values V2 . By transition what I want to say is that a point in the sorted data where there is a large deviation and after which it continues in that pattern. So a point where there is difference in patterns.
V1 V2
1 1 16154424
2 2 3701944
3 3 1618377
4 4 903302
5 5 569824
6 6 389772
7 7 281751
8 8 212450
9 9 166364
10 10 133339
11 11 109410
12 12 90934
13 13 77155
14 14 66124
15 15 57861
16 16 50765
17 17 44805
18 18 39996
19 19 35850
20 20 32492
21 21 29522
22 22 27152
23 23 24821
24 24 22619
25 25 21238
26 26 19639
27 27 18320
28 28 16867
29 29 15890
30 30 14936
31 31 14252
32 32 13150
33 33 12696
34 34 11656
35 35 11191
36 36 10951
37 37 10232
38 38 9605
39 39 9058
40 40 8916
41 41 8531
42 42 8010
43 43 7932
44 44 7436
45 45 6991
46 46 6750
47 47 6613
48 48 6254
49 49 6292
50 50 5731
51 51 5659
52 52 5551
53 53 5396
54 54 5122
55 55 4845
56 56 4860
57 57 4591
58 58 4504
59 59 4233
60 60 4371
61 61 4014
62 62 4083
63 63 3923
64 64 3796
65 65 3616
66 66 3519
67 67 3466
68 68 3409
69 69 3357
70 70 3215
71 71 3118
72 72 3081
73 73 3040
74 74 2951
75 75 2808
76 76 2797
77 77 2829
78 78 2714
79 79 2564
80 80 2563
81 81 2445
82 82 2528
83 83 2443
84 84 2316
85 85 2314
86 86 2212
87 87 2215
88 88 2102
89 89 2172
90 90 2020
91 91 2108
92 92 2020
93 93 2027
94 94 1982
95 95 1936
96 96 1836
97 97 1801
98 98 1850
99 99 1751
100 100 1810
It has to be something to do with the ratios of differences maybe and something which can be evident after observing the graph.
The getOption("max.print") can be used to limit the number of values that can be printed from a single function call. For example:
options(max.print=20)
print(cars)
prints only the first 10 rows of 2 columns. However, max.print doesn't work very well lists. Especially if they are nested deeply, the amount of lines printed to the console can still be infinite.
Is there any way to specify a harder cutoff of the amount that can be printed to the screen? For example by specifying the amount of lines after which the printing can be interrupted? Something that also protects against printing huge recursive objects?
Based in part on this question, I would suggest just building a wrapper for print that uses capture.output to regulate what is printed:
print2 <- function(x, nlines=10,...)
cat(head(capture.output(print(x,...)), nlines), sep="\n")
For example:
> print2(list(1:10000,1:10000))
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12
[13] 13 14 15 16 17 18 19 20 21 22 23 24
[25] 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48
[49] 49 50 51 52 53 54 55 56 57 58 59 60
[61] 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84
[85] 85 86 87 88 89 90 91 92 93 94 95 96
[97] 97 98 99 100 101 102 103 104 105 106 107 108
Why does the equal.count() function create overlapping shingles when it is clearly possible to create groupings with no overlap. Also, on what basis are the overlaps decided?
For example:
equal.count(1:100,4)
Data:
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
[23] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
[45] 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
[67] 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
[89] 89 90 91 92 93 94 95 96 97 98 99 100
Intervals:
min max count
1 0.5 40.5 40
2 20.5 60.5 40
3 40.5 80.5 40
4 60.5 100.5 40
Overlap between adjacent intervals:
[1] 20 20 20
Wouldn't it be better to create groups of size 25 ? Or maybe I'm missing something that makes this functionality useful?
The overlap smooths transitions between the shingles (which, as the name says, overlap on the roof), but a better choice would have been to use some windowing function such as in spectral analysis.
I believe it is a pre-historic relic, because the behavior goes back to some very old pre-lattice code and is used in coplot remembered only by veteRans. lattice::equal.count calls co.intervals in graphics, where you will find some explanation. Try:
lattice:::equal.count(1:100,4,overlap=0)