How do you create a user defined formula functions in R - r

R has the ability to allow you to defined a formula with a transformation function applied to one of the variables. For example in the following formula, the logarithmic transformation will be applied to variable b.
y ~ a + log(b) + c
How does one define their own formula functions in R without the use of I()? For example applying the user defined function foo to a in the following formula.
y ~ foo(a) + b

You can use your own functions in R formulas.
An example function:
foo <- function(x)
log(x) ^ 2
Data:
set.seed(1)
dat <- data.frame(y = rnorm(5), x = rgamma(5, 2))
Create model matrix based on formula and data:
mod <- model.matrix(y ~ 1 + foo(x), data = dat)
mod
# (Intercept) foo(x)
# 1 1 0.16837521
# 2 1 0.02222275
# 3 1 0.68509896
# 4 1 0.01936180
# 5 1 0.14758002
# attr(,"assign")
# [1] 0 1
The function is indeed applied to the data:
identical(foo(dat$x), unname(mod[ , "foo(x)"]))
# [1] TRUE

Related

Reverting a linear filter for time series in R

I'm using the stats::filter function in R in order to understand ARIMA simulations in R (as in the function stats::arima.sim) and estiamtion. I know that stats::filter applies a linear filter to a vector or time series, but I'm not sure how to "unfilter" my series.
Consider the following example: I want to use a recursive filter with value 0.7 to my series x = 1:5 (which is essentially generating an AR(1) with phi=0.7). I can do so by:
x <- 1:5
ar <-0.7
filt <- filter(x, ar, method="recursive")
filt
Time Series:
Start = 1
End = 5
Frequency = 1
[1] 1.0000 2.7000 4.8900 7.4230 10.1961
Which returns me essentially c(y1,y2,y3,y4,y5) where:
y1 <- x[1]
y2 <- x[2] + ar*y1
y3 <- x[3] + ar*y2
y4 <- x[4] + ar*y3
y5 <- x[5] + ar*y4
Now imagine I have the y = c(y1,y2,y3,y4,y5) series. How can I use the filter function to return me the original series x = 1:5?
I can write a code to do it like:
unfilt <- rep(NA, 5)
unfilt[1] <- filt[1]
for(i in 2:5){
unfilt[i] <- filt[i] - ar*filt[i-1]
}
unfilt
[1] 1 2 3 4 5
But I do want to use the filter function to do so, instead of writing my own function. How can I do so? I tried stats::filter(filt, -ar, method="recursive"), which returns me [1] 1.0000 2.0000 3.4900 4.9800 6.7101 not what I desire.
stats::filter used with the recursive option is a particular case of an ARMA filter.
a[1]*y[n] + a[2]*y[n-1] + … + a[n]*y[1] = b[1]*x[n] + b[2]*x[m-1] + … + b[m]*x[1]
You could implement this filter with the signal package which allows more options than stat::filter :
a = c(1,-ar)
b = 1
filt_Arma <- signal::filter(signal::Arma(b = b, a = a),x)
filt_Arma
# Time Series:
# Start = 1
# End = 5
# Frequency = 1
# [1] 1.0000 2.7000 4.8900 7.4230 10.1961
identical(filt,filt_Arma)
# [1] TRUE
Reverting an ARMA filter can be done by switching b and a, provided that the inverse filter stays stable (which is the case here):
signal::filter(signal::Arma(b = a, a = b),filt)
# Time Series:
# Start = 2
# End = 6
# Frequency = 1
# [1] 1 2 3 4 5
This corresponds to switching numerator and denominator in the z-transform:
Y(z) = a(z)/b(z) X(z)
X(z) = b(z)/a(z) Y(z)

Multiclass Changepoint detection in R

The output of my classification predictive model is as follows:
a <- c(1,1.1,1,1,2,0.9,1.1,1,1.1,1) ## Class A
b <- c(2,2.1,1.9,1.7,2,2,3,2,2,2) ## Class B
c <- c(3,3,3.1,3.6,3.2,2,3.1,3,3,3) ## Class C
x <- data.frame(c(a,b,c))
x$color <- rep(c("red","green","blue"),times=c(10,10,10))
I am trying to find out at which position the class type changes. In this scenario, the class type changes at positions 11 and 21.
I have tried the following packages : CPM, EPC, QCC but it is not giving me what I desire.
What's the best way to detect that the class type has changed?
What about
library(strucchange)
( bp <- breakpoints( x[, 1] ~ 1) )
# Optimal 3-segment partition:
#
# Call:
# breakpoints.formula(formula = x[, 1] ~ 1)
#
# Breakpoints at observation number:
# 10 20
#
# Corresponding to breakdates:
# 0.3333333 0.6666667
bp$breakpoints
# [1] 10 20
(via)

Extract function calls from the right hand side of a formula

Several functions in R treat certain functions of variables on the right hand side of a formula specially. For example s in mgcv or strata in survival. In my case, I want particular functions of variables to be taken out of the model matrix and treated specially. I can't see how to do this other than using grep on the column names (see below) - which also doesn't work if f(.) has not been used in the formula. Does anyone have a more elegant solution? I have looked in survival and mgcv but I find the code very hard to follow and is overkill for my needs. Thanks.
f <- function(x) {
# do stuff
return(x)
}
data <- data.frame(y = rnorm(10),
x1 = rnorm(10),
x2 = rnorm(10),
s = rnorm(10))
formula <- y ~ x1 + x2 + f(s)
mf <- model.frame(formula, data)
x <- model.matrix(formula, mf)
desired_x <- x[ , -grep("f\\(", colnames(x))]
desired_f <- x[ , grep("f\\(", colnames(x))]
output:
> head(desired_x)
(Intercept) x1 x2
1 1 0.29864902 0.1474018
2 1 -0.03192798 -0.4424467
3 1 -0.83716557 1.0268295
4 1 -0.74094149 1.1094299
5 1 1.38706580 -0.2339486
6 1 -0.52925896 1.2866540
> desired_f
1 2 3 4 5 6
0.46751965 0.65939178 -1.35835634 -0.05322648 -0.09286254 1.05423067
7 8 9 10
-1.71971996 0.71743985 -0.65993305 -0.79821349

Is there a function to return the matching response vector to model.matrix?

In glmnet() I have to specify the raw X matrix and response vector Y (different than lm where you can specify the model formula). model.matrix() will correctly remove incomplete observations from the X matrix, but it doesn't include the response in the output object. So I will have something like this:
mydf
glmnet(y = mydf$response, x = model.matrix(myformula, mydf)[,-1], ...)
When model.matrix removes observations the y and x dimensions won't match. Is there a function to align y data to x?
Try using model.frame and model.response.
> d <- data.frame(y=rnorm(3), x=c(1,NA,2), z=c(NA, NA, 1))
> d
y x z
1 -0.6257260 1 NA
2 -0.4979723 NA NA
3 -1.2233772 2 1
> form <- y~x
> mf <- model.frame(form, data=d)
> model.response(mf)
1 3
-0.625726 -1.223377
> model.matrix(form, mf)
(Intercept) x
1 1 1
3 1 2
attr(,"assign")
[1] 0 1
I'm not familiar with glmnet, it might be the case that mf is sufficient, just passing y=mf[1,] and x=mf[-1,].

Creating an R function to use mclapply from the multicore package

I need to analyze some simulated data with the following structure:
h c x1 y1 x1c10
1 0 37.607056431 104.83097593 5
1 1 27.615251557 140.85532974 10
1 0 34.68915314 114.59312842 2
1 1 30.090387454 131.60485642 9
1 1 39.274429397 106.76042522 10
1 0 33.839385007 122.73681319 2
...
where h ranges from 1 to 2500, and indexes the Monte Carlo sample, each sample with 1000 observations. I'm analysing these data with the following code that gives me two objects (fnN1, fdQB101):
mc<-2500 ##create loop index
fdN1<-matrix(0,mc,1000)
fnQB101 <- matrix(0,mc,1000) ##create 2500x1000 storage matrices, elements zero
for(j in 1:mc){
fdN1[j,] <- dnorm(residuals(lm(x1 ~ c,data=s[s$h==j,])),
mean(residuals(lm(x1 ~ c,data=s[s$h==j,]))),
sd(residuals(lm(x1 ~ c,data=s[s$h==j,]))))
x1c10<-as.matrix(subset(s,s$h==j,select=x1c10))
fdQB100 <- as.matrix(predict(polr(as.factor(x1c10) ~ c ,
method="logistic", data=s[s$h==j,]),
type="probs"))
indx10<- as.matrix(cbind(as.vector(seq(1:nrow(fdQB100))),x1c10))
fdQB101[j,] <- fdQB100[indx10]
}
The objects fdN1 and fdQB101 are 2500x1000 matrices with predicted probabilities as elements. I need to create a function out of this loop that I can call with lapply() or mclapply(). When I wrap this in the following function command:
ndMC <- function(mc){
for(j in 1:mc){
...
}
return(list(fdN1,fdQB101))
}
lapply(mc,ndMC)
the objects fdN1 and fdQB101 are each returned as 2500x1000 matrices of zeros, instead of the predicted probabilities. What am I doing wrong?
You should be able to do this with the data.table package. Here is an example:
library(data.table)
dt<-data.table(h=rep(1L,6), c=c(0L,1L,0L,1L,1L,0L),
X1=c(37.607056431,27.615251557,34.68915314,30.090387454,39.274429397,33.839385007),
y1=c(104.83097593,140.85532974,114.59312842,131.60485642,106.76042522,122.73681319),
x1c10=c(5L,10L,2L,9L,10L,2L))
## Create a linear model for every grouping of variable h:
fdN1.partial<-dt[,list(lm=list(lm(X1~c))),by="h"]
## Retrieve the linear model for h==1:
fdN1.partial[h==1,lm]
## [[1]]
##
## Call:
## lm(formula = X1 ~ c)
##
## Coefficients:
## (Intercept) c
## 35.379 -3.052
You could also write a function to generalize this solution:
f.dnorm<-function(y,x) {
f<-lm(y ~ x)
out<-list(dnorm(residuals(f), mean(residuals(f)), sd(residuals(f))))
return(out)
}
## Generate two dnorm lists for every grouping of variable h:
dt.lm<-dt[,list(dnormX11=list(f.dnorm(X1,rep(1,length(X1)))), dnormX1c=list(f.dnorm(X1,c))),by="h"]
## Retrieve one of the dnorm lists for h==1:
unlist(dt.lm[h==1,dnormX11])
## 1 2 3 4 5 6
## 0.06296194 0.03327407 0.08884549 0.06286739 0.04248756 0.09045784

Resources