Simulating dataset in R - r

I am trying to simulate a dataset and am a bit stuck on the following, as I'm a bit new to R. Here's the code I have so far:
set.seed(10)
n <- 300
x1a <- rnorm(100,1,2)
x1b <- rnorm(100,0,1)
x1c <- rnorm(100,1,0.5)
x1 <- c(x1a,x1b,x1c)
x2a <- rnorm(100,1,2)
x2b <- rnorm(100,1,1)
x2c <- rnorm(100,0,0.5)
x2 <- c(x2a,x2b,x2c)
This is what I wanted to create:
Dataset with 300 observations and three variables: x1, x2 and g.
g has 3 levels, with lvl 1 having observations 1-100, lvl2 having obs 101-200, and lvl 3 having obs 201-300.
x1 and x2 have the following factors:
the first 100 obs have mean 1 for x1, mean 1 for x2, sd of 2 for both;
the second 100 obs have mean 0 for x1, mean 1 for x2, sd of 1 for both;
the third 100 obs have mean 1 for x1, mean 0 for x2, sd of 0.5 for both.
I was able to create the first two vectors, x1 and x2 in my code above. However, when creating g, I'm not sure where to begin and how to incorporate that into the dataset. I am also not sure on how to combine everything to generate only one vector with the full 300 observations including all of these factors.
Any suggestions?
Thanks!

Related

Why do results of matching depend on order of data (MatchIt package)?

When using the matchit-function for full matching, the results differ by the order of the input dataframe. That is, if the order of the data is changed, results change, too. This is surprising, because in my understanding, the optimal full algorithm should yield only one single best solution.
Am I missing something or is this an error?
Similar differences occur with the optimal algorithm.
Below you find a reproducible example. Subclasses should be identical for the two data sets, which they are not.
Thank you for your help!
# create data
nr <- c(1:100)
x1 <- rnorm(100, mean=50, sd=20)
x2 <- c(rep("a", 20),rep("b", 60), rep("c", 20))
x3 <- rnorm(100, mean=230, sd=2)
outcome <- rnorm(100, mean=500, sd=20)
group <- c(rep(0, 50),rep(1, 50))
df <- data.frame(x1=x1, x2=x2, outcome=outcome, group=group, row.names=nr, nr=nr)
df_neworder <- df[order(outcome),] # re-order data.frame
# perform matching
model_oldorder <- matchit(group~x1, data=df, method="full", distance ="logit")
model_neworder <- matchit(group~x1, data=df_neworder, method="full", distance ="logit")
# store matching results
matcheddata_oldorder <- match.data(model_oldorder, distance="pscore")
matcheddata_neworder <- match.data(model_neworder, distance="pscore")
# Results based on original data.frame
head(matcheddata_oldorder[order(nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 27
2 63.949637 a 529.2733 0 2 0.5283582 1.0 32
3 52.217666 a 526.7928 0 3 0.5028106 0.5 17
4 48.936397 a 492.9255 0 4 0.4956569 1.0 9
5 36.501507 a 512.9301 0 5 0.4685876 1.0 16
# Results based on re-ordered data.frame
head(matcheddata_neworder[order(matcheddata_neworder$nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 25
2 63.949637 a 529.2733 0 2 0.5283582 1.0 31
3 52.217666 a 526.7928 0 3 0.5028106 0.5 15
4 48.936397 a 492.9255 0 4 0.4956569 1.0 7
5 36.501507 a 512.9301 0 5 0.4685876 2.0 14
Apparently, the assignment of objects to subclasses differs. In my understanding, this should not be the case.
The developers of the optmatch package (which the matchit function calls) provided useful help:
I think what we're seeing here is the result of the tolerance argument
that fullmatch has. The matching algorithm requires integer distances,
so we have to scale then truncate floating point distances. For a
given set of integer distances, there may be multiple matchings that
achieve the minimum, so the solver is free to pick among these
non-unique solutions.
Developing your example a little more:
> library(optmatch)
> nr <- c(1:100) x1 <- rnorm(100, mean=50, sd=20)
> outcome <- rnorm(100, mean=500, sd=20) group <- c(rep(0, 50),rep(1, 50))
> df_oldorder <- data.frame(x1=x1, outcome=outcome, group=group, row.names=nr, nr=nr) > df_neworder <- df_oldorder[order(outcome),] # > re-order data.frame
> glm_oldorder <- match_on(glm(group~x1, > data=df_oldorder), data = df_oldorder)
> glm_neworder <- > match_on(glm(group~x1, data=df_neworder), data = df_neworder)
> fm_old <- fullmatch(glm_oldorder, data=df_oldorder)
> fm_new <- fullmatch(glm_neworder, data=df_neworder)
> mean(sapply(matched.distances(fm_old, glm_oldorder), mean))
> ## 0.06216174
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.062058 mean(sapply(matched.distances(fm_old, glm_oldorder), mean)) -
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.00010373
which we can see is smaller than the default tolerance of 0.001. You can always decrease the tolerance level, which may
require increased run time, in order to get closer to the true
floating put minimum. We found 0.001 seemed to work well in practice,
but there is nothing special about this value.

merge/cbind model matrices

This is a simplified version of my current problem. I need to create a model.matrix from 2 model matrices, without loosing the info in "assign". For example, consider data and formula
y<-rnorm(100); x1<-rnorm(100); x2<-rnorm(100); x3<-rnorm(100)
f1 <- y ~ x1 + x2 + x3
and 2 model matrices X1 and X2 created using
trms<-terms.formula(f1)
trms2<-drop.terms(trms, dropx = 2)
trms3<-drop.terms(trms, dropx = -2)
X1<-model.matrix(trms2)
X2<-model.matrix(trms3)
Is there an easy way to create from X1 and X2 a matrix X with 1 intercept column and with attr(,"assign") that would have been obtained from f1?
I'm not completly sure if this is what you are trying to do but cbind() seems to work fine in this case.
X <- cbind(X1, X2)
X <- X[, !duplicated(colnames(X))]
You can then concatenate the attributes from X1 and X2. In order not to get duplicates you can only take the assign info from X2 which isn't already present in X1:
attributes(X)$assign <- c(attr(X1,"assign"), attr(X2,"assign")[!attr(X2,"assign") %in% attr(X1,"assign")])
If this is not what you were trying to to let us know.
If I understand the question correctly, how about something simple and direct like:
X3 <- cbind(X1[,1:2], X2[,2], X1[,3])
attr(X3,"assign") <- c(0,1,2,3)
colnames(X3) <- c("Intercept",attr(trms, "term.labels"))
head(X3)
Intercept x1 x2 x3
1 1 -1.28372461 -0.2598796 0.3028496
2 1 0.56880875 0.2803302 0.7593734
3 1 -0.32480770 -1.6705911 -1.1750247
4 1 -1.02761734 -0.1405454 -0.6805033
5 1 0.84218452 -0.1224962 -1.3882420
6 1 0.07221231 0.5587801 -0.9042751

Extract function calls from the right hand side of a formula

Several functions in R treat certain functions of variables on the right hand side of a formula specially. For example s in mgcv or strata in survival. In my case, I want particular functions of variables to be taken out of the model matrix and treated specially. I can't see how to do this other than using grep on the column names (see below) - which also doesn't work if f(.) has not been used in the formula. Does anyone have a more elegant solution? I have looked in survival and mgcv but I find the code very hard to follow and is overkill for my needs. Thanks.
f <- function(x) {
# do stuff
return(x)
}
data <- data.frame(y = rnorm(10),
x1 = rnorm(10),
x2 = rnorm(10),
s = rnorm(10))
formula <- y ~ x1 + x2 + f(s)
mf <- model.frame(formula, data)
x <- model.matrix(formula, mf)
desired_x <- x[ , -grep("f\\(", colnames(x))]
desired_f <- x[ , grep("f\\(", colnames(x))]
output:
> head(desired_x)
(Intercept) x1 x2
1 1 0.29864902 0.1474018
2 1 -0.03192798 -0.4424467
3 1 -0.83716557 1.0268295
4 1 -0.74094149 1.1094299
5 1 1.38706580 -0.2339486
6 1 -0.52925896 1.2866540
> desired_f
1 2 3 4 5 6
0.46751965 0.65939178 -1.35835634 -0.05322648 -0.09286254 1.05423067
7 8 9 10
-1.71971996 0.71743985 -0.65993305 -0.79821349

Mean and standard deviation of triplicated vector data

I have an experiment where I measured a bit less than 200 variables in triplicate. In other words, I have three vectors of ~ 200 values.
I want a quick way to determine if I should use mean or median for my calculations. I can do the mean easily ((v1 + v2 + v3) / 3), but how do I calculate the SD to have it in a vector of ~ 200 SDs? And what about the median?
After having these values, I need to do growth curves (measurements were taken over certain period of time).
Here is a dplyr solution:
require(dplyr)
d <- data.frame(
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10)
)
d %>%
rowwise() %>%
mutate(
mean = mean(c(x1, x2, x3)),
median = median(c(x1, x2, x3)),
sd = sd(c(x1, x2, x3))
)
It sounds like you also have a substantive question about longitudinal data. If so, crossvalidated would be a good platform for this question.
apply is what you do. Have your vector in a matrix, e.g.
mydat <- matrix(rnorm(600), ncol = 3)
means <- apply(mydat, MARGIN = 1, mean) # MARGIN = 1 is rows, MARGIN = 2 would be columns...
sds <- apply(mydat, MARGIN = 1, sd)
medians <- apply(mydat, MARGIN = 1, median)
Though I have to say, with 3 values each, using median sounds pretty questionable.
Traditional 'for' loop can also be used, though it is not preferred:
for(i in 1:nrow(d)) d[i,4]=mean(unlist(d[i,1:3]))
for(i in 1:nrow(d)) d[i,5]=sd(unlist(d[i,1:3]))
for(i in 1:nrow(d)) d[i,6]=median(unlist(d[i,1:3]))
names(d)[4:6]=c('meanval', 'sdval', 'medianval')
d
x1 x2 x3 meanval sdval medianval
1 -1.3230176 0.6956100 -0.7210798 -0.44949580 1.0363556 -0.7210798
2 -1.8931166 0.9047873 -1.0378874 -0.67540558 1.4337404 -1.0378874
3 -0.2137543 0.1846733 0.6410478 0.20398893 0.4277283 0.1846733
4 0.1371915 -1.0345325 -0.2260038 -0.37444827 0.5998009 -0.2260038
5 -0.8662465 -0.8229465 -0.2230030 -0.63739866 0.3595296 -0.8229465
6 -0.2918697 -1.3543493 1.3025262 -0.11456426 1.3372826 -0.2918697
7 -0.4931936 1.7186173 1.3757156 0.86704643 1.1904138 1.3757156
8 0.3982403 -0.3394208 1.9316059 0.66347514 1.1585131 0.3982403
9 -1.0332427 -0.3045905 1.1513260 -0.06216908 1.1122775 -0.3045905
10 -1.5603811 -0.1709146 -0.5409815 -0.75742575 0.7195765 -0.5409815
Using d from #DMC's answer.

split on factor, sapply, and lm [duplicate]

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I want to apply lm() to observations grouped by subject, but cannot work out the sapply syntax. At the end, I want a dataframe with 1 row for each subject, and the intercept and slope (ie, rows of: subj, lm$coefficients[1] lm$coefficients[2])
set.seed(1)
subj <- rep(c("a","b","c"), 4) # 4 observations each on 3 experimental subjects
ind <- rnorm(12) #12 random numbers, the independent variable, the x axis
dep <- rnorm(12) + .5 #12 random numbers, the dependent variable, the y axis
df <- data.frame(subj=subj, ind=ind, dep=dep)
s <- (split(df,subj)) # create a list of observations by subject
I can pull a single set of observations from s, make a dataframe, and get what I want:
df2 <- as.data.frame(s[1])
df2
lm1 <- lm(df2$a.dep ~ df2$a.ind)
lm1$coefficients[1]
lm1$coefficients[2]
I am having trouble looping over all the elements of s and getting the data into the final form I want:
lm.list <- sapply(s, FUN= function(x)
(lm(x[ ,"dep"] ~ x[,"ind"])))
a <-as.data.frame(lm.list)
I feel like I need some kind of transpose of the structure below; the columns (a,b,c) are what I want my rows to be, but t(a) does not work.
head(a)
a
coefficients 0.1233519, 0.4610505
residuals 0.4471916, -0.3060402, 0.4460895, -0.5872409
effects -0.6325478, 0.6332422, 0.5343949, -0.7429069
rank 2
fitted.values 0.74977179, 0.09854505, -0.05843569, 0.47521446
assign 0, 1
b
coefficients 1.1220840, 0.2024222
residuals -0.04461432, 0.02124541, 0.27103003, -0.24766112
effects -2.0717363, 0.2228309, 0.2902311, -0.2302195
rank 2
fitted.values 1.1012775, 0.8433366, 1.1100777, 1.0887808
assign 0, 1
c
coefficients 0.2982019, 0.1900459
residuals -0.5606330, 1.0491990, 0.3908486, -0.8794147
effects -0.6742600, 0.2271767, 1.1273566, -1.0345665
rank 2
fitted.values 0.3718773, 0.2193339, 0.5072572, 0.2500516
assign 0, 1
By the sounds of it, this might be what you're trying to do:
sapply(s, FUN= function(x)
lm(x[ ,"dep"] ~ x[,"ind"])$coefficients[c(1, 2)])
# a b c
# (Intercept) 0.71379430 -0.6817331 0.5717372
# x[, "ind"] 0.07125591 1.1452096 -1.0303726
Other alternatives, if this is what you're looking for
I've seen it noted that in general, if you're splitting and then using s/lapply, you can usually just jump straight to by and skip the split step:
do.call(rbind,
by(data = df, INDICES=df$subj, FUN=function(x)
lm(x[, "dep"] ~ x[, "ind"])$coefficients[c(1, 2)]))
# (Intercept) x[, "ind"]
# a 0.7137943 0.07125591
# b -0.6817331 1.14520962
# c 0.5717372 -1.03037257
Or, you can use one of the packages that lets you do such sorts of calculations more conveniently, like "data.table":
library(data.table)
DT <- data.table(df)
DT[, list(Int = lm(dep ~ ind)$coefficients[1],
Slo = lm(dep ~ ind)$coefficients[2]), by = subj]
# subj Int Slo
# 1: a 0.7137943 0.07125591
# 2: b -0.6817331 1.14520962
# 3: c 0.5717372 -1.03037257
How about nlme::lmList?
library(nlme)
coef(lmList(dep~ind|subj,df))
## (Intercept) ind
## a 0.7137943 0.07125591
## b -0.6817331 1.14520962
## c 0.5717372 -1.03037257
You can transpose this if you want.

Resources