Extract function calls from the right hand side of a formula - r

Several functions in R treat certain functions of variables on the right hand side of a formula specially. For example s in mgcv or strata in survival. In my case, I want particular functions of variables to be taken out of the model matrix and treated specially. I can't see how to do this other than using grep on the column names (see below) - which also doesn't work if f(.) has not been used in the formula. Does anyone have a more elegant solution? I have looked in survival and mgcv but I find the code very hard to follow and is overkill for my needs. Thanks.
f <- function(x) {
# do stuff
return(x)
}
data <- data.frame(y = rnorm(10),
x1 = rnorm(10),
x2 = rnorm(10),
s = rnorm(10))
formula <- y ~ x1 + x2 + f(s)
mf <- model.frame(formula, data)
x <- model.matrix(formula, mf)
desired_x <- x[ , -grep("f\\(", colnames(x))]
desired_f <- x[ , grep("f\\(", colnames(x))]
output:
> head(desired_x)
(Intercept) x1 x2
1 1 0.29864902 0.1474018
2 1 -0.03192798 -0.4424467
3 1 -0.83716557 1.0268295
4 1 -0.74094149 1.1094299
5 1 1.38706580 -0.2339486
6 1 -0.52925896 1.2866540
> desired_f
1 2 3 4 5 6
0.46751965 0.65939178 -1.35835634 -0.05322648 -0.09286254 1.05423067
7 8 9 10
-1.71971996 0.71743985 -0.65993305 -0.79821349

Related

R: Tetrachoric correlation for multiple variables at one go?

You can see I'm a beginner at this when I'm not even able to reproduce my problem with a dummy dataset... Anyways, here goes: I want to calculate tetrachoric correlations between one grouping variable and multiple other variables. Like this:
library(psych)
set.seed(42)
n <- 16
dat <- data.frame(id=1:n,
group=c(rep("a", times=5), rep("b", times=3)),
x=sample(1:2, n, replace=TRUE),
y=sample(1:2, n, replace=TRUE),
z=sample(1:2, n, replace=TRUE))
dat
id group x y z
1 1 a 1 1 2
2 2 a 1 2 2
3 3 a 1 1 2
4 4 a 1 2 2
5 5 a 2 1 1
6 6 b 2 2 1
7 7 b 2 1 1
8 8 b 2 1 1
tetrachoric(as.matrix(dat[,c("group","y")]))
Now with this example (not with my actual dataset) I get an error which I'm unable to solve:
Error in apply(x, 2, function(x) min(x, na.rm = TRUE)) :
dim(X) must have a positive length
In addition: Warning messages:
1: In var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
NAs introduced by coercion
2: In tetrachoric(as.matrix(dat[, c("group", "y")])) :
Item = group had no variance and was deleted
My question is still what would be the best solution to get all the correlations with a single piece of code? Thank you for help!
The help file for tetrachoric says "The tetrachoric correlation is the inferred Pearson Correlation from a two x two table with the assumption of bivariate normality", so presumably you need to pass it a 2x2 table. You could write a little function that would hand the tetrachoric the appropriate table and collect the results:
myfun <- function(x,y, ...){
tabs <- lapply(seq_along(y), function(i)table(x,y[,i]))
l <- lapply(tabs, function(x)tetrachoric(x, ...))
rho <- sapply(l, function(x)x$rho)
tau <- sapply(l, function(x)x$tau)
colnames(tau) <- colnames(y)
names(rho) <- colnames(y)
ret <- list(rho = rho ,
tau = tau)
ret
}
myfun(dat$group, dat[,c("x", "y", "z")])
# $rho
# x y z
# 0.5397901 -0.2605839 0.6200705
#
# $tau
# x y z
# a 0.3186394 0.3186394 0.2690661
# 1 0.1573107 0.1573107 -0.6045853

merge/cbind model matrices

This is a simplified version of my current problem. I need to create a model.matrix from 2 model matrices, without loosing the info in "assign". For example, consider data and formula
y<-rnorm(100); x1<-rnorm(100); x2<-rnorm(100); x3<-rnorm(100)
f1 <- y ~ x1 + x2 + x3
and 2 model matrices X1 and X2 created using
trms<-terms.formula(f1)
trms2<-drop.terms(trms, dropx = 2)
trms3<-drop.terms(trms, dropx = -2)
X1<-model.matrix(trms2)
X2<-model.matrix(trms3)
Is there an easy way to create from X1 and X2 a matrix X with 1 intercept column and with attr(,"assign") that would have been obtained from f1?
I'm not completly sure if this is what you are trying to do but cbind() seems to work fine in this case.
X <- cbind(X1, X2)
X <- X[, !duplicated(colnames(X))]
You can then concatenate the attributes from X1 and X2. In order not to get duplicates you can only take the assign info from X2 which isn't already present in X1:
attributes(X)$assign <- c(attr(X1,"assign"), attr(X2,"assign")[!attr(X2,"assign") %in% attr(X1,"assign")])
If this is not what you were trying to to let us know.
If I understand the question correctly, how about something simple and direct like:
X3 <- cbind(X1[,1:2], X2[,2], X1[,3])
attr(X3,"assign") <- c(0,1,2,3)
colnames(X3) <- c("Intercept",attr(trms, "term.labels"))
head(X3)
Intercept x1 x2 x3
1 1 -1.28372461 -0.2598796 0.3028496
2 1 0.56880875 0.2803302 0.7593734
3 1 -0.32480770 -1.6705911 -1.1750247
4 1 -1.02761734 -0.1405454 -0.6805033
5 1 0.84218452 -0.1224962 -1.3882420
6 1 0.07221231 0.5587801 -0.9042751

How do you create a user defined formula functions in R

R has the ability to allow you to defined a formula with a transformation function applied to one of the variables. For example in the following formula, the logarithmic transformation will be applied to variable b.
y ~ a + log(b) + c
How does one define their own formula functions in R without the use of I()? For example applying the user defined function foo to a in the following formula.
y ~ foo(a) + b
You can use your own functions in R formulas.
An example function:
foo <- function(x)
log(x) ^ 2
Data:
set.seed(1)
dat <- data.frame(y = rnorm(5), x = rgamma(5, 2))
Create model matrix based on formula and data:
mod <- model.matrix(y ~ 1 + foo(x), data = dat)
mod
# (Intercept) foo(x)
# 1 1 0.16837521
# 2 1 0.02222275
# 3 1 0.68509896
# 4 1 0.01936180
# 5 1 0.14758002
# attr(,"assign")
# [1] 0 1
The function is indeed applied to the data:
identical(foo(dat$x), unname(mod[ , "foo(x)"]))
# [1] TRUE

Is there a function to return the matching response vector to model.matrix?

In glmnet() I have to specify the raw X matrix and response vector Y (different than lm where you can specify the model formula). model.matrix() will correctly remove incomplete observations from the X matrix, but it doesn't include the response in the output object. So I will have something like this:
mydf
glmnet(y = mydf$response, x = model.matrix(myformula, mydf)[,-1], ...)
When model.matrix removes observations the y and x dimensions won't match. Is there a function to align y data to x?
Try using model.frame and model.response.
> d <- data.frame(y=rnorm(3), x=c(1,NA,2), z=c(NA, NA, 1))
> d
y x z
1 -0.6257260 1 NA
2 -0.4979723 NA NA
3 -1.2233772 2 1
> form <- y~x
> mf <- model.frame(form, data=d)
> model.response(mf)
1 3
-0.625726 -1.223377
> model.matrix(form, mf)
(Intercept) x
1 1 1
3 1 2
attr(,"assign")
[1] 0 1
I'm not familiar with glmnet, it might be the case that mf is sufficient, just passing y=mf[1,] and x=mf[-1,].

Using split function in R

I am trying to simulate three small datasets, which contains x1,x2,x3,x4, trt and IND.
However, when I try to split simulated data by IND using "split" in R I get Warning messages and outputs are correct. Could someone please give me a hint what I did wrong in my R code?
# Step 2: simulate data
Alpha = 0.05
S = 3 # number of replicates
x = 8 # number of covariates
G = 3 # number of treatment groups
N = 50 # number of subjects per dataset
tot = S*N # total subjects for a simulation run
# True parameters
alpha = c(0.5, 0.8) # intercepts
b1 = c(0.1,0.2,0.3,0.4) # for pi_1 of trt A
b2 = c(0.15,0.25,0.35,0.45) # for pi_2 of trt B
b = c(1.1,1.2,1.3,1.4);
##############################################################################
# Scenario 1: all covariates are independent standard normally distributed #
##############################################################################
set.seed(12)
x1 = rnorm(n=tot, mean=0, sd=1);x2 = rnorm(n=tot, mean=0, sd=1);
x3 = rnorm(n=tot, mean=0, sd=1);x4 = rnorm(n=tot, mean=0, sd=1);
###############################################################################
p1 = exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p2 = exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p3 = 1/(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
# To assign subjects to one of treatment groups based on response probabilities
tmp = function(x){sample(c("A","B","C"), 1, prob=x, replace=TRUE)}
trt = apply(cbind(p1,p2,p3),1,tmp)
IND=rep(1:S,each=N) #create an indicator for split simulated data
sim=data.frame(x1,x2,x3,x4,trt, IND)
Aset = subset(sim, trt=="A")
Bset = subset(sim, trt=="B")
Cset = subset(sim, trt=="C")
Anew = split(Aset, f = IND)
Bnew = split(Bset, f = IND)
Cnew = split(Cset, f = IND)
The warning message:
> Anew = split(Aset, f = IND)
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
and the output becomes
$`2`
x1 x2 x3 x4 trt IND
141 1.0894068 0.09765185 -0.46702047 0.4049424 A 3
145 -1.2953113 -1.94291045 0.09926239 -0.5338715 A 3
148 0.0274979 0.72971804 0.47194731 -0.1963896 A 3
$`3`
[1] x1 x2 x3 x4 trt IND
<0 rows> (or 0-length row.names)
I have checked my R code several times however, I can't figure out what I did wrong. Many thanks in advance
IND is the global variable for the full data, sim. You want to use the specific one for the subset, eg
Anew <- split(Aset, f = Aset$IND)
It's a warning, not an error, which means split executed successfully, but may not have done what you wanted to do.
From the "details" section of the help file:
f is recycled as necessary and if the length of x is not a multiple of
the length of f a warning is printed. Any missing values in f are
dropped together with the corresponding values of x.
Try checking the length of your IND against the size of your dataframe, maybe.
Not sure what your goal is once you have your data split, but this sounds like a good candidate for the plyr package.
> library(plyr)
> ddply(sim, .(trt,IND), summarise, x1mean=mean(x1), x2sum=sum(x2), x3min=min(x3), x4max=max(x4))
trt IND x1mean x2sum x3min x4max
1 A 1 -0.49356448 -1.5650528 -1.016615 2.0027822
2 A 2 0.05908053 5.1680463 -1.514854 0.8184445
3 A 3 0.22898716 1.8584443 -1.934188 1.6326763
4 B 1 0.01531230 1.1005720 -2.002830 2.6674931
5 B 2 0.17875088 0.2526760 -1.546043 1.2021935
6 B 3 0.13398967 -4.8739380 -1.565945 1.7887837
7 C 1 -0.16993037 -0.5445507 -1.954848 0.6222546
8 C 2 -0.04581149 -6.3230167 -1.491114 0.8714535
9 C 3 -0.41610973 0.9085831 -1.797661 2.1174894
>
Where you can substitute summarise and its following arguments for any function that returns a data.frame or something that can be coerced to one. If lists are the target, ldply is your friend.

Resources