merge/cbind model matrices

merge/cbind model matrices - r

This is a simplified version of my current problem. I need to create a model.matrix from 2 model matrices, without loosing the info in "assign". For example, consider data and formula
y<-rnorm(100); x1<-rnorm(100); x2<-rnorm(100); x3<-rnorm(100)
f1 <- y ~ x1 + x2 + x3
and 2 model matrices X1 and X2 created using
trms<-terms.formula(f1)
trms2<-drop.terms(trms, dropx = 2)
trms3<-drop.terms(trms, dropx = -2)
X1<-model.matrix(trms2)
X2<-model.matrix(trms3)
Is there an easy way to create from X1 and X2 a matrix X with 1 intercept column and with attr(,"assign") that would have been obtained from f1?

I'm not completly sure if this is what you are trying to do but cbind() seems to work fine in this case.
X <- cbind(X1, X2)
X <- X[, !duplicated(colnames(X))]
You can then concatenate the attributes from X1 and X2. In order not to get duplicates you can only take the assign info from X2 which isn't already present in X1:
attributes(X)$assign <- c(attr(X1,"assign"), attr(X2,"assign")[!attr(X2,"assign") %in% attr(X1,"assign")])
If this is not what you were trying to to let us know.

If I understand the question correctly, how about something simple and direct like:
X3 <- cbind(X1[,1:2], X2[,2], X1[,3])
attr(X3,"assign") <- c(0,1,2,3)
colnames(X3) <- c("Intercept",attr(trms, "term.labels"))
head(X3)
Intercept x1 x2 x3
1 1 -1.28372461 -0.2598796 0.3028496
2 1 0.56880875 0.2803302 0.7593734
3 1 -0.32480770 -1.6705911 -1.1750247
4 1 -1.02761734 -0.1405454 -0.6805033
5 1 0.84218452 -0.1224962 -1.3882420
6 1 0.07221231 0.5587801 -0.9042751

Related

Functions for Finding Out the Midpoint (Interpolation)

I am interested in solving this kind of problem (interpolation/midpoint) :
Suppose I have this kind of data (I randomly generated data, it might not exactly fit this set up):
x1 = runif(100)
x2 = x1+ runif(100)
y1 = runif(100)
y2 = y1 + runif(100)
z1 = y2 - (runif(100)/3)
interpolation_data = data.frame(x1, x2, y1, y2, z1)
> head(interpolation_data)
x1 x2 y1 y2 z1
1 0.98851249 1.8413195 0.5362105 0.6490267 0.4130697
2 0.87617261 0.9825167 0.5628463 0.9246420 0.7357799
3 0.05901131 0.4298880 0.2463369 1.1585020 1.0614156
4 0.26843176 0.6694523 0.9467490 1.1860874 1.0859276
5 0.86021705 1.4767606 0.9956568 1.2048599 1.1932450
6 0.19368972 0.2269496 0.7618794 0.9763212 0.8594290
I want to write to calculate a value of Z2 for each row. I did this myself:
interpolation_data$z2 = (interpolation_data$x1 + ((interpolation_data$z1 - interpolation_data$y1) * (interpolation_data$x2 - interpolation_data$x1)))/(interpolation_data$y2 - interpolation_data$y1)
Are there any built-in ways in R to do this? I tried looking for something and nothing came up that exactly matched this kind of problem.
Thank you!

Why do results of matching depend on order of data (MatchIt package)?

When using the matchit-function for full matching, the results differ by the order of the input dataframe. That is, if the order of the data is changed, results change, too. This is surprising, because in my understanding, the optimal full algorithm should yield only one single best solution.
Am I missing something or is this an error?
Similar differences occur with the optimal algorithm.
Below you find a reproducible example. Subclasses should be identical for the two data sets, which they are not.
Thank you for your help!
# create data
nr <- c(1:100)
x1 <- rnorm(100, mean=50, sd=20)
x2 <- c(rep("a", 20),rep("b", 60), rep("c", 20))
x3 <- rnorm(100, mean=230, sd=2)
outcome <- rnorm(100, mean=500, sd=20)
group <- c(rep(0, 50),rep(1, 50))
df <- data.frame(x1=x1, x2=x2, outcome=outcome, group=group, row.names=nr, nr=nr)
df_neworder <- df[order(outcome),] # re-order data.frame
# perform matching
model_oldorder <- matchit(group~x1, data=df, method="full", distance ="logit")
model_neworder <- matchit(group~x1, data=df_neworder, method="full", distance ="logit")
# store matching results
matcheddata_oldorder <- match.data(model_oldorder, distance="pscore")
matcheddata_neworder <- match.data(model_neworder, distance="pscore")
# Results based on original data.frame
head(matcheddata_oldorder[order(nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 27
2 63.949637 a 529.2733 0 2 0.5283582 1.0 32
3 52.217666 a 526.7928 0 3 0.5028106 0.5 17
4 48.936397 a 492.9255 0 4 0.4956569 1.0 9
5 36.501507 a 512.9301 0 5 0.4685876 1.0 16
# Results based on re-ordered data.frame
head(matcheddata_neworder[order(matcheddata_neworder$nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 25
2 63.949637 a 529.2733 0 2 0.5283582 1.0 31
3 52.217666 a 526.7928 0 3 0.5028106 0.5 15
4 48.936397 a 492.9255 0 4 0.4956569 1.0 7
5 36.501507 a 512.9301 0 5 0.4685876 2.0 14
Apparently, the assignment of objects to subclasses differs. In my understanding, this should not be the case.

The developers of the optmatch package (which the matchit function calls) provided useful help:
I think what we're seeing here is the result of the tolerance argument
that fullmatch has. The matching algorithm requires integer distances,
so we have to scale then truncate floating point distances. For a
given set of integer distances, there may be multiple matchings that
achieve the minimum, so the solver is free to pick among these
non-unique solutions.
Developing your example a little more:
> library(optmatch)
> nr <- c(1:100) x1 <- rnorm(100, mean=50, sd=20)
> outcome <- rnorm(100, mean=500, sd=20) group <- c(rep(0, 50),rep(1, 50))
> df_oldorder <- data.frame(x1=x1, outcome=outcome, group=group, row.names=nr, nr=nr) > df_neworder <- df_oldorder[order(outcome),] # > re-order data.frame
> glm_oldorder <- match_on(glm(group~x1, > data=df_oldorder), data = df_oldorder)
> glm_neworder <- > match_on(glm(group~x1, data=df_neworder), data = df_neworder)
> fm_old <- fullmatch(glm_oldorder, data=df_oldorder)
> fm_new <- fullmatch(glm_neworder, data=df_neworder)
> mean(sapply(matched.distances(fm_old, glm_oldorder), mean))
> ## 0.06216174
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.062058 mean(sapply(matched.distances(fm_old, glm_oldorder), mean)) -
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.00010373
which we can see is smaller than the default tolerance of 0.001. You can always decrease the tolerance level, which may
require increased run time, in order to get closer to the true
floating put minimum. We found 0.001 seemed to work well in practice,
but there is nothing special about this value.

Extract function calls from the right hand side of a formula

Several functions in R treat certain functions of variables on the right hand side of a formula specially. For example s in mgcv or strata in survival. In my case, I want particular functions of variables to be taken out of the model matrix and treated specially. I can't see how to do this other than using grep on the column names (see below) - which also doesn't work if f(.) has not been used in the formula. Does anyone have a more elegant solution? I have looked in survival and mgcv but I find the code very hard to follow and is overkill for my needs. Thanks.
f <- function(x) {
# do stuff
return(x)
}
data <- data.frame(y = rnorm(10),
x1 = rnorm(10),
x2 = rnorm(10),
s = rnorm(10))
formula <- y ~ x1 + x2 + f(s)
mf <- model.frame(formula, data)
x <- model.matrix(formula, mf)
desired_x <- x[ , -grep("f\\(", colnames(x))]
desired_f <- x[ , grep("f\\(", colnames(x))]
output:
> head(desired_x)
(Intercept) x1 x2
1 1 0.29864902 0.1474018
2 1 -0.03192798 -0.4424467
3 1 -0.83716557 1.0268295
4 1 -0.74094149 1.1094299
5 1 1.38706580 -0.2339486
6 1 -0.52925896 1.2866540
> desired_f
1 2 3 4 5 6
0.46751965 0.65939178 -1.35835634 -0.05322648 -0.09286254 1.05423067
7 8 9 10
-1.71971996 0.71743985 -0.65993305 -0.79821349

Simulating dataset in R

I am trying to simulate a dataset and am a bit stuck on the following, as I'm a bit new to R. Here's the code I have so far:
set.seed(10)
n <- 300
x1a <- rnorm(100,1,2)
x1b <- rnorm(100,0,1)
x1c <- rnorm(100,1,0.5)
x1 <- c(x1a,x1b,x1c)
x2a <- rnorm(100,1,2)
x2b <- rnorm(100,1,1)
x2c <- rnorm(100,0,0.5)
x2 <- c(x2a,x2b,x2c)
This is what I wanted to create:
Dataset with 300 observations and three variables: x1, x2 and g.
g has 3 levels, with lvl 1 having observations 1-100, lvl2 having obs 101-200, and lvl 3 having obs 201-300.
x1 and x2 have the following factors:
the first 100 obs have mean 1 for x1, mean 1 for x2, sd of 2 for both;
the second 100 obs have mean 0 for x1, mean 1 for x2, sd of 1 for both;
the third 100 obs have mean 1 for x1, mean 0 for x2, sd of 0.5 for both.
I was able to create the first two vectors, x1 and x2 in my code above. However, when creating g, I'm not sure where to begin and how to incorporate that into the dataset. I am also not sure on how to combine everything to generate only one vector with the full 300 observations including all of these factors.
Any suggestions?
Thanks!

Mean and standard deviation of triplicated vector data

I have an experiment where I measured a bit less than 200 variables in triplicate. In other words, I have three vectors of ~ 200 values.
I want a quick way to determine if I should use mean or median for my calculations. I can do the mean easily ((v1 + v2 + v3) / 3), but how do I calculate the SD to have it in a vector of ~ 200 SDs? And what about the median?
After having these values, I need to do growth curves (measurements were taken over certain period of time).

Here is a dplyr solution:
require(dplyr)
d <- data.frame(
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10)
)
d %>%
rowwise() %>%
mutate(
mean = mean(c(x1, x2, x3)),
median = median(c(x1, x2, x3)),
sd = sd(c(x1, x2, x3))
)
It sounds like you also have a substantive question about longitudinal data. If so, crossvalidated would be a good platform for this question.

apply is what you do. Have your vector in a matrix, e.g.
mydat <- matrix(rnorm(600), ncol = 3)
means <- apply(mydat, MARGIN = 1, mean) # MARGIN = 1 is rows, MARGIN = 2 would be columns...
sds <- apply(mydat, MARGIN = 1, sd)
medians <- apply(mydat, MARGIN = 1, median)
Though I have to say, with 3 values each, using median sounds pretty questionable.

Traditional 'for' loop can also be used, though it is not preferred:
for(i in 1:nrow(d)) d[i,4]=mean(unlist(d[i,1:3]))
for(i in 1:nrow(d)) d[i,5]=sd(unlist(d[i,1:3]))
for(i in 1:nrow(d)) d[i,6]=median(unlist(d[i,1:3]))
names(d)[4:6]=c('meanval', 'sdval', 'medianval')
d
x1 x2 x3 meanval sdval medianval
1 -1.3230176 0.6956100 -0.7210798 -0.44949580 1.0363556 -0.7210798
2 -1.8931166 0.9047873 -1.0378874 -0.67540558 1.4337404 -1.0378874
3 -0.2137543 0.1846733 0.6410478 0.20398893 0.4277283 0.1846733
4 0.1371915 -1.0345325 -0.2260038 -0.37444827 0.5998009 -0.2260038
5 -0.8662465 -0.8229465 -0.2230030 -0.63739866 0.3595296 -0.8229465
6 -0.2918697 -1.3543493 1.3025262 -0.11456426 1.3372826 -0.2918697
7 -0.4931936 1.7186173 1.3757156 0.86704643 1.1904138 1.3757156
8 0.3982403 -0.3394208 1.9316059 0.66347514 1.1585131 0.3982403
9 -1.0332427 -0.3045905 1.1513260 -0.06216908 1.1122775 -0.3045905
10 -1.5603811 -0.1709146 -0.5409815 -0.75742575 0.7195765 -0.5409815
Using d from #DMC's answer.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

merge/cbind model matrices - r

Related

Functions for Finding Out the Midpoint (Interpolation)

Why do results of matching depend on order of data (MatchIt package)?

Extract function calls from the right hand side of a formula

Simulating dataset in R

Mean and standard deviation of triplicated vector data

Categories

Resources