lm(y~poly(x1, x2,x3, degree=2, raw=TRUE), data) - r

Is
lm(y~poly(x1, x2,x3, degree=2, raw=TRUE), data)
equal to
lm(y~x1 + x2 + x3 + x1*x2 + x1*x3 + x2*x3 + x1^2 + x2^2 + x3^2 , data)
?
If yes, why do we need to set raw=TRUE?

You can test this yourself easily:
DF <- data.frame(x1 = 1:2, x2 = 3:4, x3 = 5:6)
with(DF, poly(x1, x2, x3, degree = 2, raw = TRUE))
# 1.0.0 2.0.0 0.1.0 1.1.0 0.2.0 0.0.1 1.0.1 0.1.1 0.0.2
#[1,] 1 1 3 3 9 5 5 15 25
#[2,] 2 4 4 8 16 6 12 24 36
#attr(,"degree")
#[1] 1 2 1 2 2 1 2 2 2
#attr(,"class")
#[1] "poly" "matrix"
The column names show the product of the three variables and the degree each variable has in this product. E.g., 1.1.0 means x1^1 + x2^1 + x3^0.
Of course, you see this also in the output of the regression model.
You need raw = TRUE if you want the coefficients to correspond to raw polynomials, i.e., alpha0 + alpha11 * x1^1 + alpha12 * x1^2 + .... If you don't need that, you should not set raw = TRUE because orthogonal polynomials have some desirable properties for regression analysis.

Related

Multiplying a categorical variable with a dummy in regression

I am trying to run a regression that has scores regressed with a female dummy ( taking a value of 0 or 1) and I also have country for that female. I am trying to create a fixed effect on the regression where I have female interacted with country, but every method I try does not work since I am multiplying numeric with a factor
I have tried using fastdummies, but that did not work. I also tried using country-1 method, and trying to multiply with female with no success.
#first wrong
olss1= lm(pv1math ~ female + I(ggi*female) + factor(country) + factor(year) + I(female * factor(country)), data = f1)
# second wrong
olss1= lm(pv1math ~ female + I(ggi*female) + factor(country) + factor(year) + factor( female * country ), data = f1)
Error messages are that I cannot multiply factor with numeric
The * operator in the formula will give interactions as well as lower order terms. Here is an example:
country <- c("A", "A", "A", "B", "B", "B")
female <- c(1, 1, 0, 1, 0, 1)
y <- 1:6
fm <- lm(y ~ country * female)
fm
giving:
Call:
lm(formula = y ~ country * female)
Coefficients:
(Intercept) countryB female countryB:female
3.0 2.0 -1.5 1.5
Also we can check the model matrix
model.matrix(fm)
giving
(Intercept) countryB female countryB:female
1 1 0 1 0
2 1 0 1 0
3 1 0 0 0
4 1 1 1 1
5 1 1 0 0
6 1 1 1 1
attr(,"assign")
[1] 0 1 2 3
attr(,"contrasts")
attr(,"contrasts")$country
[1] "contr.treatment"
You won't need the I() here. * alone will perform an interaction, whereas I() will execute an arithmetic operation before the regression.
Compare:
lm(pv1math ~ ggi*female, data=dat)$coefficients
# (Intercept) ggi female ggi:female
# ... ... ... ...
lm(pv1math ~ I(ggi*female), data=dat)$coefficients
# (Intercept) I(ggi * female)
# ... ...
I() is useful e.g. for polynomials, where age is a popular candidate: pv1math ~ age + I(age^2) + I(age^3), or to binarize a dependent variable in a GLM: glm(I(pv1math > 0.75) ~ ggi*female, family=binomial).
And - as #G.Grothendieck already wrote - you don't need to repeat the variables that are already present in the interaction term (it's just redundant), so you may want to try:
lm(pv1math ~ ggi*female + factor(year) + female*factor(country), data=f1)

The order returned from a vectorised function

I am sending two columns of a data frame to a vectorised function.
For each row of the data frame, the function will return 3 rows. So the total number of rows returned will be nrow(dataframe) * 3. The total columns returned will be equal to 2.
The trivial function below produces the correct set of numbers. But these numbers are returned in a peculiar order. I guess it would be possible to get the order of these numbers in the order I desire...using some combination of base functions. But, if possible, I want to write easy-to-understand code.
So my question is this:
Is there a better way of writing either the function (or call to the function) such that it will produce the desired result (which is commented out below) ?
fnVector <- function(fx, fy) {
x1 <- fx + 1
x2 <- fx + 2
x3 <- fx + 3
y1 <- fy + 1
y2 <- fy + 2
y3 <- fy + 3
vctx <- c(x1, x2, x3)
vcty <- c(y1, y2, y3)
#vct.pair <- c(vctx, vcty)
vct.series <- c(x1, y1, x2, y2, x3, y3)
return(vct.series)
}
vct.names <- c("a", "b")
vct.x <- c(10, 20)
vct.y <- c(100, 200)
df.data <- data.frame(name = vct.names, x = vct.x, y = vct.y)
aa <- fnVector(df.data$x, df.data$y)
# desired result [nrow(dataframe) * 3, 2] (i.e. 3 x 2 )
#11, 101 (i.e. row a)
#12, 102 (i.e. row a)
#13, 103 (i.e. row a)
#21, 201 (i.e. row b)
#22, 202 (i.e. row b)
#23, 203 (i.e. row b)
I think you want to interleave your vectors, i.e. the returned x is x1[1], x2[1], x3[1], x1[2], x2[2], x3[2], ...
so you could:
vctx <- c(rbind(x1, x2, x3)) # interleaves the x2
vcty <- c(rbind(y1, y2, y3)) # interleaves the x2
Then return a matrix, not a vector:
return(cbind(vctx, vcty))
Giving you
fnVector <- function(fx, fy) {
x1 <- fx + 1
x2 <- fx + 2
x3 <- fx + 3
y1 <- fy + 1
y2 <- fy + 2
y3 <- fy + 3
vctx <- c(rbind(x1, x2, x3)) # interleaves the x2
vcty <- c(rbind(y1, y2, y3)) # interleaves the x2
return(cbind(vctx, vcty))
}
fnVector(df.data$x, df.data$y)
# vctx vcty
# [1,] 11 101
# [2,] 12 102
# [3,] 13 103
# [4,] 21 201
# [5,] 22 202
# [6,] 23 203
You may want to think about also retaining the name column.
I don't know if this is adaptable to your specific application or not (I understand you have simplified your fnVector for the purposes of this question), but you might want to investigate plyr:
library(plyr)
ddply(df.data, .(name), summarize,
vctx = x + 1:3,
vcty = y + 1:3)
# name vctx vcty
# 1 a 11 101
# 2 a 12 102
# 3 a 13 103
# 4 b 21 201
# 5 b 22 202
# 6 b 23 203
The ddply(df.data, .(name), says "for each unique value in df.data$name", the summarize says "call the summarize function", then the two named arguments vctx=.. and vcty=... create the output 3 rows for each of these columns (for us, x+1:3 and y+1:3, but for your application, probably something more complex).
I think your function can be greatly simplified, and I also think it makes the most sense to use the custom function along with one of the apply functions. Try this code:
fnVector <- function(x) {
y <- rbind(x+1, x+2, x+3)
return(y)
}
df.output <- data.frame(apply(df.data[, c("x", "y")], 2, function(x) fnVector(x)))
> df.output
x y
1 11 101
2 12 102
3 13 103
4 21 201
5 22 202
6 23 203

Model matrix with all pairwise interactions between columns

Let's say that I have a numeric data matrix with columns w, x, y, z and I also want to add in the columns that are equivalent to w*x, w*y, w*z, x*y, x*z, y*z since I want my covariate matrix to include all pairwise interactions.
Is there a clean and effective way to do this?
If you mean in a model formula, then the ^ operator does this.
## dummy data
set.seed(1)
dat <- data.frame(Y = rnorm(10), x = rnorm(10), y = rnorm(10), z = rnorm(10))
The formula is
form <- Y ~ (x + y + z)^2
which gives (using model.matrix() - which is used internally by the standard model fitting functions)
model.matrix(form, data = dat)
R> form <- Y ~ (x + y + z)^2
R> form
Y ~ (x + y + z)^2
R> model.matrix(form, data = dat)
(Intercept) x y z x:y x:z y:z
1 1 1.51178 0.91898 1.35868 1.389293 2.054026 1.24860
2 1 0.38984 0.78214 -0.10279 0.304911 -0.040071 -0.08039
3 1 -0.62124 0.07456 0.38767 -0.046323 -0.240837 0.02891
4 1 -2.21470 -1.98935 -0.05381 4.405817 0.119162 0.10704
5 1 1.12493 0.61983 -1.37706 0.697261 -1.549097 -0.85354
6 1 -0.04493 -0.05613 -0.41499 0.002522 0.018647 0.02329
7 1 -0.01619 -0.15580 -0.39429 0.002522 0.006384 0.06143
8 1 0.94384 -1.47075 -0.05931 -1.388149 -0.055982 0.08724
9 1 0.82122 -0.47815 1.10003 -0.392667 0.903364 -0.52598
10 1 0.59390 0.41794 0.76318 0.248216 0.453251 0.31896
attr(,"assign")
[1] 0 1 2 3 4 5 6
If you don't know how many variables you have, or it is tedious to write out all of them, use the . notation too
R> form <- Y ~ .^2
R> model.matrix(form, data = dat)
(Intercept) x y z x:y x:z y:z
1 1 1.51178 0.91898 1.35868 1.389293 2.054026 1.24860
2 1 0.38984 0.78214 -0.10279 0.304911 -0.040071 -0.08039
3 1 -0.62124 0.07456 0.38767 -0.046323 -0.240837 0.02891
4 1 -2.21470 -1.98935 -0.05381 4.405817 0.119162 0.10704
5 1 1.12493 0.61983 -1.37706 0.697261 -1.549097 -0.85354
6 1 -0.04493 -0.05613 -0.41499 0.002522 0.018647 0.02329
7 1 -0.01619 -0.15580 -0.39429 0.002522 0.006384 0.06143
8 1 0.94384 -1.47075 -0.05931 -1.388149 -0.055982 0.08724
9 1 0.82122 -0.47815 1.10003 -0.392667 0.903364 -0.52598
10 1 0.59390 0.41794 0.76318 0.248216 0.453251 0.31896
attr(,"assign")
[1] 0 1 2 3 4 5 6
The "power" in the ^ operator, here 2, controls the order of interactions. With ^2 we get second order interactions of all pairs of variables considered by the ^ operator. If you want up to 3rd-order interactions, then use ^3.
R> form <- Y ~ .^3
R> head(model.matrix(form, data = dat))
(Intercept) x y z x:y x:z y:z x:y:z
1 1 1.51178 0.91898 1.35868 1.389293 2.05403 1.24860 1.887604
2 1 0.38984 0.78214 -0.10279 0.304911 -0.04007 -0.08039 -0.031341
3 1 -0.62124 0.07456 0.38767 -0.046323 -0.24084 0.02891 -0.017958
4 1 -2.21470 -1.98935 -0.05381 4.405817 0.11916 0.10704 -0.237055
5 1 1.12493 0.61983 -1.37706 0.697261 -1.54910 -0.85354 -0.960170
6 1 -0.04493 -0.05613 -0.41499 0.002522 0.01865 0.02329 -0.001047
If you are doing a regression, you can just do something like
reg <- lm(w ~ (x + y + z)^2
and it will figure things out for you. For example,
lm(Petal.Width ~ (Sepal.Length + Sepal.Width + Petal.Length)^2, iris)
# Call:
# lm(formula = Petal.Width ~ (Sepal.Length + Sepal.Width + Petal.Length)^2,
# data = iris)
# # Coefficients:
# (Intercept) Sepal.Length Sepal.Width
# -1.05768 0.07628 0.22983
# Petal.Length Sepal.Length:Sepal.Width Sepal.Length:Petal.Length
# 0.47586 -0.03863 -0.03083
# Sepal.Width:Petal.Length
# 0.06493

specifying a regression in R with an indicator variable

I would like to specify a regression in R that would estimate coefficients on x that are conditional on a third variable, z, being greater than 0. For example
y ~ a + x*1(z>0) + x*1(z<=0)
What is the correct way to do this in R using formulas?
The ":" (colon) operator is used to construct conditional interactions (when used with disjoint predictors constructed with I). Should be used with predict
> y=rnorm(10)
> x=rnorm(10)
> z=rnorm(10)
> mod <- lm(y ~ x:I(z>0) )
> mod
Call:
lm(formula = y ~ x:I(z > 0))
Coefficients:
(Intercept) x:I(z > 0)FALSE x:I(z > 0)TRUE
-0.009983 -0.203004 -0.655941
> predict(mod, newdata=data.frame(x=1:10, z=c(-1, 1)) )
1 2 3 4 5 6 7
-0.2129879 -1.3218653 -0.6189968 -2.6337471 -1.0250057 -3.9456289 -1.4310147
8 9 10
-5.2575108 -1.8370236 -6.5693926
> plot(1:10, predict(mod, newdata=data.frame(x=1:10, z=c(-1)) ) )
> lines(1:10, predict(mod, newdata=data.frame(x=1:10, z=c(1)) ) )
Might help to look at its model matrix:
> model.matrix(mod)
(Intercept) x:I(z > 0)FALSE x:I(z > 0)TRUE
1 1 -0.2866252 0.00000000
2 1 0.0000000 -0.03197743
3 1 -0.7427334 0.00000000
4 1 2.0852202 0.00000000
5 1 0.8548904 0.00000000
6 1 0.0000000 1.00044600
7 1 0.0000000 -1.18411791
8 1 0.0000000 -1.54110256
9 1 0.0000000 -0.21173300
10 1 0.0000000 0.17035257
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$`I(z > 0)`
[1] "contr.treatment"
y <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
z <- sample(x=-10:10,size=length(trt),replace=T)
x <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
a <- rnorm(n=length(x))
lm(y~a+I(x*1*I(z>0))+ I(x*1*I(z<=0)))
But I think using the : operator in DWIN solution is more elegant..
Edit
lm(y~a+I(x*1*I(z>0))+ I(x*1*I(z<=0)))
Call:
lm(formula = y ~ a + I(x * 1 * I(z > 0)) + I(x * 1 * I(z <= 0)))
Coefficients:
(Intercept) a I(x * 1 * I(z > 0)) I(x * 1 * I(z <= 0))
6.5775 -0.1345 -0.3352 -0.3366
> lm(formula = y ~ a+ x:I(z > 0))
Call:
lm(formula = y ~ a + x:I(z > 0))
Coefficients:
(Intercept) a x:I(z > 0)FALSE x:I(z > 0)TRUE
6.5775 -0.1345 -0.3366 -0.3352

simple examples of filter function, recursive option specifically

I am seeking some simple (i.e. - no maths notation, long-form reproducible code) examples for the filter function in R
I think I have my head around the convolution method, but am stuck at generalising the recursive option. I have read and battled with various documentation, but the help is just a bit opaque to me.
Here are the examples I have figured out so far:
# Set some values for filter components
f1 <- 1; f2 <- 1; f3 <- 1;
And on we go:
# basic convolution filter
filter(1:5,f1,method="convolution")
[1] 1 2 3 4 5
#equivalent to:
x[1] * f1
x[2] * f1
x[3] * f1
x[4] * f1
x[5] * f1
# convolution with 2 coefficients in filter
filter(1:5,c(f1,f2),method="convolution")
[1] 3 5 7 9 NA
#equivalent to:
x[1] * f2 + x[2] * f1
x[2] * f2 + x[3] * f1
x[3] * f2 + x[4] * f1
x[4] * f2 + x[5] * f1
x[5] * f2 + x[6] * f1
# convolution with 3 coefficients in filter
filter(1:5,c(f1,f2,f3),method="convolution")
[1] NA 6 9 12 NA
#equivalent to:
NA * f3 + x[1] * f2 + x[2] * f1 #x[0] = doesn't exist/NA
x[1] * f3 + x[2] * f2 + x[3] * f1
x[2] * f3 + x[3] * f2 + x[4] * f1
x[3] * f3 + x[4] * f2 + x[5] * f1
x[4] * f3 + x[5] * f2 + x[6] * f1
Now's when I am hurting my poor little brain stem.
I managed to figure out the most basic example using info at this post: https://stackoverflow.com/a/11552765/496803
filter(1:5, f1, method="recursive")
[1] 1 3 6 10 15
#equivalent to:
x[1]
x[2] + f1*x[1]
x[3] + f1*x[2] + f1^2*x[1]
x[4] + f1*x[3] + f1^2*x[2] + f1^3*x[1]
x[5] + f1*x[4] + f1^2*x[3] + f1^3*x[2] + f1^4*x[1]
Can someone provide similar code to what I have above for the convolution examples for the recursive version with filter = c(f1,f2) and filter = c(f1,f2,f3)?
Answers should match the results from the function:
filter(1:5, c(f1,f2), method="recursive")
[1] 1 3 7 14 26
filter(1:5, c(f1,f2,f3), method="recursive")
[1] 1 3 7 15 30
EDIT
To finalise using #agstudy's neat answer:
> filter(1:5, f1, method="recursive")
Time Series:
Start = 1
End = 5
Frequency = 1
[1] 1 3 6 10 15
> y1 <- x[1]
> y2 <- x[2] + f1*y1
> y3 <- x[3] + f1*y2
> y4 <- x[4] + f1*y3
> y5 <- x[5] + f1*y4
> c(y1,y2,y3,y4,y5)
[1] 1 3 6 10 15
and...
> filter(1:5, c(f1,f2), method="recursive")
Time Series:
Start = 1
End = 5
Frequency = 1
[1] 1 3 7 14 26
> y1 <- x[1]
> y2 <- x[2] + f1*y1
> y3 <- x[3] + f1*y2 + f2*y1
> y4 <- x[4] + f1*y3 + f2*y2
> y5 <- x[5] + f1*y4 + f2*y3
> c(y1,y2,y3,y4,y5)
[1] 1 3 7 14 26
and...
> filter(1:5, c(f1,f2,f3), method="recursive")
Time Series:
Start = 1
End = 5
Frequency = 1
[1] 1 3 7 15 30
> y1 <- x[1]
> y2 <- x[2] + f1*y1
> y3 <- x[3] + f1*y2 + f2*y1
> y4 <- x[4] + f1*y3 + f2*y2 + f3*y1
> y5 <- x[5] + f1*y4 + f2*y3 + f3*y2
> c(y1,y2,y3,y4,y5)
[1] 1 3 7 15 30
In the recursive case, I think no need to expand the expression in terms of xi.
The key with "recursive" is to express the right hand expression in terms of previous y's.
I prefer thinking in terms of filter size.
filter size =1
y1 <- x1
y2 <- x2 + f1*y1
y3 <- x3 + f1*y2
y4 <- x4 + f1*y3
y5 <- x5 + f1*y4
filter size = 2
y1 <- x1
y2 <- x2 + f1*y1
y3 <- x3 + f1*y2 + f2*y1 # apply the filter for the past value and add current input
y4 <- x4 + f1*y3 + f2*y2
y5 <- x5 + f1*y4 + f2*y3
Here's the example that I've found most helpful in visualizing what recursive filtering is really doing:
(x <- rep(1, 10))
# [1] 1 1 1 1 1 1 1 1 1 1
as.vector(filter(x, c(1), method="recursive")) ## Equivalent to cumsum()
# [1] 1 2 3 4 5 6 7 8 9 10
as.vector(filter(x, c(0,1), method="recursive"))
# [1] 1 1 2 2 3 3 4 4 5 5
as.vector(filter(x, c(0,0,1), method="recursive"))
# [1] 1 1 1 2 2 2 3 3 3 4
as.vector(filter(x, c(0,0,0,1), method="recursive"))
# [1] 1 1 1 1 2 2 2 2 3 3
as.vector(filter(x, c(0,0,0,0,1), method="recursive"))
# [1] 1 1 1 1 1 2 2 2 2 2
With recursive, the sequence of your "filters" is the additive coefficient for the previous sums or output values of the sequence. With filter=c(1,1) you're saying "take the i-th component in my sequence x and add to it 1 times the result from the previous step and 1 times the results from the step before that one". Here's a couple examples to illustrate
I think the lagged effect notation looks like this:
## only one filter, so autoregressive cumsum only looks "one sequence behind"
> filter(1:5, c(2), method='recursive')
Time Series:
Start = 1
End = 5
Frequency = 1
[1] 1 4 11 26 57
1 = 1
2*1 + 2 = 4
2*(2*1 + 2) + 3 = 11
...
## filter with lag in it, looks two sequences back
> filter(1:5, c(0, 2), method='recursive')
Time Series:
Start = 1
End = 5
Frequency = 1
[1] 1 2 5 8 15
1= 1
0*1 + 2 = 2
2*1 + 0*(0*1 + 2) + 3 = 5
2*(0*1 + 2) + 0 * (2*1 + 0*(0*1 + 2) + 3) + 4 = 8
2*(2*1 + 0*(0*1 + 2) + 3) + 0*(2*(0*1 + 2) + 0 * (2*1 + 0*(0*1 + 2) + 3) + 4) + 5 = 15
Do you see the cumulative pattern there? Put differently.
1 = 1
0*1 + 2 = 2
2*1 + 0*2 + 3 = 5
2*2 + 0*5 + 4 = 8
2*5 + 0*8 + 5 = 15
I spent one hour in reading this, below is my summary, by comparison with Matlab
NOTATION: command in Matlab = command in R
filter([1,1,1], 1, data) = filter(data, [1,1,1], method = "convolution") ; but the difference is that the first 2 elements are NA
filter(1, [1,-1,-1,-1], data) = filter(data, [1,1,1], method = "recursive")
If you know some from DSP, then recursive is for IIR, convolution is for FIR

Resources