R / Rolling Regression with extended Data Frame - r

Hallo I'm currently working on a Regression Analysis with the following Code:
for (i in 1:ncol(Ret1)){
r2.out[i]=summary(lm(Ret1[,1]~Ret1[,i]))$r.squared
}
r2.out
This Code runs a simple OLS Regression of each column in the data Frame agianst the first column and provides the R^2 of These regressions. At the Moment the Regression uses all data Points of a column. What I Need now is that the Code instead of using all data Points in a column just uses a rolling window of data Points. So he calculates for a rolling window of 30 Days the R^2 over the entire time Frame. The output is a Matrix with all the R^2 per rolling window for each (1,i) pair.
This Code does the rolling Regression part but does not make the Regression for each (1,i) pair.
dolm <- function(x) summary(lm(Ret1[,1]~Ret1[,i]))$r.squared
rollapplyr(Ret1, 30, dolm, by.column = FALSE)
I really appreciate any help you can provide.

Using the built-in anscombe data frame we regress the y1 column against x1 and then x2, etc. We use a width of 3 here for purposes of illustration.
xnames should be set to the names of the x variables. In the anscombe data set the column names that begin with x are the x variables. As another example, if all the columns are x variables except the first then xnames <- names(DF)[-1] could be used.
We define an R squared function, rsq which takes the indexes to use, ix and the x variable name xname. We then sapply over the xnames and for each one rollapply over the indices 1:n.
library(zoo)
xnames <- grep("x", names(anscombe), value = TRUE)
n <- nrow(anscombe)
w <- 3
rsq <- function(ix, xname) summary(lm(y1 ~., anscombe[c("y1", xname)], subset = ix))$r.sq
sapply(xnames, function(xname) rollapply(1:n, w, rsq, xname = xname ))
giving the following result of dimensions n - w + 1 by length(xnames):
x1 x2 x3 x4
[1,] 2.285384e-01 2.285384e-01 2.285384e-01 0.0000000
[2,] 3.591782e-05 3.591782e-05 3.591782e-05 0.0000000
[3,] 9.841920e-01 9.841920e-01 9.841920e-01 0.0000000
[4,] 5.857410e-01 5.857410e-01 5.857410e-01 0.0000000
[5,] 9.351609e-01 9.351609e-01 9.351609e-01 0.0000000
[6,] 8.760332e-01 8.760332e-01 8.760332e-01 0.7724447
[7,] 9.494869e-01 9.494869e-01 9.494869e-01 0.7015512
[8,] 9.107256e-01 9.107256e-01 9.107256e-01 0.3192194
[9,] 8.385510e-01 8.385510e-01 8.385510e-01 0.0000000
Variations
1) It would also be possible to reverse the order of the rollapply and sapply replacing the last line of code with:
rollapply(1:n, 3, function(ix) sapply(xnames, rsq, ix = ix))
2) Another variation is to replace the definition of rsq and the sapply/rollapply line with the following single statement. It may be a bit harder to read so you may prefer the first solution but it does entail one simplification -- namely, xname need no longer be an explicit argument of the inner anonymous function (which takes the place of rsq above):
sapply(xnames, function(xname) rollapply(1:n, 3, function(ix)
summary(lm(y1 ~., anscombe[c("y1", xname)], subset = ix))$r.sq))
Update: Have fixed line which is now n <- nrow(anscombe)

Related

McNemar test in R - sparse data

I'm attempting to run a good sized dataset through R, using the McNemar test to determine whether I have a difference in the proportion of objects detected by one method over another on paired samples. I've noticed that the test works fine when I have a 2x2 table of
test1
y n
y 34 2
n 12 16
but if I try and run something more like:
34 0
12 0
it errors telling me that ''x' and 'y' must have the same number of levels (minimum 2)'.
I should clarify, that I've tried converting wide data to a 2x2 matrix using the table function on my wide data set, where rather than appearing as above, it negates the final column, giving me.
test1
y
y 34
n 12
I've also run mcnemar.test using the factor object option, which gives me the same error, so I'm assuming that it does something similar. I'm wondering whether there is either a way to force the table function to generate the 2nd column despite their being no observations which would fall under either of those categories, or whether there would be a way to make the test overlook this missing data?
Perhaps there's a better way to do this, but you can force R to construct a sparse contingency table by ensuring that the tabulated factors have the same levels attribute and that there are exactly 2 distinct levels specified.
# Example data
x1 <- c(rep("y", 34), rep("n", 12))
x2 <- rep("n", 46)
# Set levels explicitly
x1 <- factor(x1, levels = c("y", "n"))
x2 <- factor(x2, levels = c("y", "n"))
table(x1, x2)
# x2
# x1 y n
# y 0 34
# n 0 12
mcnemar.test(table(x1, x2))
#
# McNemar's Chi-squared test with continuity correction
#
# data: table(x1, x2)
# McNemar's chi-squared = 32.0294, df = 1, p-value = 1.519e-08

Error with predict - rows don't match [duplicate]

I'm running a linear regression where the predictor is categorized by another value and am having trouble generating modeled responses for newdata.
First, I generate some random values for the predictor and the error terms. I then construct the response. Note that the predictor's coefficient depends on the value of a categorical variable. I compose a design matrix based on the predictor and its category.
set.seed(1)
category = c(rep("red", 5), rep("blue",5))
x1 = rnorm(10, mean = 1, sd = 1)
err = rnorm(10, mean = 0, sd = 1)
y = ifelse(category == "red", x1 * 2, x1 * 3)
y = y + err
df = data.frame(x1 = x1, category = category)
dm = as.data.frame(model.matrix(~ category + 0, data = df))
dm = dm * df$x1
fit = lm(y ~ as.matrix(dm) + 0, data = df)
# This line will not produce a warning
predictOne = predict.lm(fit, newdata = dm)
# This line WILL produce a warning
predictTwo = predict.lm(fit, newdata = dm[1:5,])
The warning is:
'newdata' had 5 rows but variable(s) found have 10 rows
Unless I'm very much mistaken, I shouldn't have any issues with the variable names. (There are one or two discussions on this board which suggest that issue.) Note that the first prediction runs fine, but the second does not. The only change is that the second prediction uses only the first five rows of the design matrix.
Thoughts?
I'm not 100% sure what you're trying to do, but I think a short walk-through of how formulas work will clear things up for you.
The basic idea is very simple: you pass two things, a formula and a data frame. The terms in the formula should all be names of variables in your data frame.
Now, you can get lm to work without following that guideline exactly, but you're just asking for things to go wrong. So stop and look at your model specifications and think about where R is looking for things.
When you call lm basically none of the names in your formula are actually found in the data frame df. So I suspect that df isn't being used at all.
Then if you call model.frame(fit) you'll see what R thinks your variables should be called. Notice anything strange?
model.frame(fit)
y as.matrix(dm).categoryblue as.matrix(dm).categoryred
1 2.2588735 0.0000000 0.3735462
2 2.7571299 0.0000000 1.1836433
3 -0.2924978 0.0000000 0.1643714
4 2.9758617 0.0000000 2.5952808
5 3.7839465 0.0000000 1.3295078
6 0.4936612 0.1795316 0.0000000
7 4.4460969 1.4874291 0.0000000
8 6.1588103 1.7383247 0.0000000
9 5.5485653 1.5757814 0.0000000
10 2.6777362 0.6946116 0.0000000
Is there anything called as.matrix(dm).categoryblue in dm? Yeah, I didn't think so.
I suspect (but am not sure) that you meant to do something more like this:
df$y <- y
fit <- lm(y~category - 1,data = df)
Joran is on the right track. The issue relates to column names. What I had wanted to do was create my own design matrix, something which, as it happens, I didn't need to do. If run the model with the following line of code, it's smooth sailing:
fit = lm(y ~ x1:category + 0, data = df)
That formula designation will replace the manual construction of the design matrix.
Using my own design matrix is something I had done in the past and the fit parameters and diagnostics were just as they ought to have been. I'd not used the predict function, so had never known that R was discarding the "data = " parameter. A warning would have been cool. R is a harsh mistress.
This may help. Convert the new data as data.frame, example:
x = 1:5
y = c(2,4,6,8,10)
fit = lm(y ~ x)
# PREDICTION
newx = c(3,5,7)
predict(fit, data.frame(x=newx))

Writing a loop for randomly selecting rows of a matrix and doing a linear regression on data from rows and storing in a matrix

I need to write a program that does the following in R:
I have a data set (42 rows, 2 columns) of y variables and x variables.
I want to randomly select 12 rows from this matrix and record the coefficients (slope and intercept) of a linear regression of the randomly generated matrix. I would also like to write a loop for this so I can repeat this 1000 times, so I can then have a matrix with 1000 rows and 2 columns filled in with the slopes and intercepts of the 1000 randomly selected sets of 12 rows from my data set.
I am able to get this far but do not know how to incorporate a loop into the code, and a way to store the coefficients into a a matrix.
#Box.Z and Box.DC.gm are columns of data used to generate my initial matrix of data
A <- matrix(c(Box.Z, Box.DC.gm), nrow=42)
B <- A[sample(42, 12), ]
C <- lm(B[,2] ~ B[,1])
D <- matrix(c(coefficients(C)), ncol =2)
Something like this maybe:
#set.seed(23)
A <- matrix(runif(84),ncol=2)
randco <- function(A) {
B <- A[sample(42,12),]
lm(B[,2] ~ B[,1])$coefficients
}
t(replicate(10,randco(A)))
# (Intercept) B[, 1]
# [1,] 0.6018459 -0.1643174222
# [2,] 0.4411607 0.0005322798
#...
# [9,] 0.3201649 0.4848679516
#[10,] 0.5413830 0.1850853748

How can I perform a pairwise t.test in R across multiple independent vectors?

TL;DR edition
I have vectors X1,X2,X3,...Xn. I want to test to see whether the average value for any one vector is significantly different than the average value for any other vector, for every possible combination of vectors. I am seeking a better way to do this in R than running n^2 individual t.tests.
Full Story
I have a data frame full of census data for a particular CSA. Each row contains observations for each variable (column) for a particular census tract.
What I need to do is compare means for the same variable across census tracts in different MSAs. In other words, I want to factor my data.frame according to the MSA designation variable (which is one of the columns) and then compare the differences in the means for another variable of interest pairwise across each newly-factored MSA. This is essentially doing pairwise t.tests across each ensuing vector, but I wish to do this in a more elegant way than writing t.test(MSAx, MSAy) over and over again. How can I do this?
The advantage to my method below to the one proposed by #ashkan would be that mine removes duplicates. (i.e. either X1 vs X2 OR X2 vs X1 will appear in the results, not both)
# Generate dummy data
df <- data.frame(matrix(rnorm(100), ncol = 10))
colnames(df) <- paste0("X", 1:10)
# Create combinations of the variables
combinations <- combn(colnames(df),2, simplify = FALSE)
# Do the t.test
results <- lapply(seq_along(combinations), function (n) {
df <- df[,colnames(df) %in% unlist(combinations[n])]
result <- t.test(df[,1], df[,2])
return(result)})
# Rename list for legibility
names(results) <- paste(matrix(unlist(combinations), ncol = 2, byrow = TRUE)[,1], matrix(unlist(combinations), ncol = 2, byrow = TRUE)[,2], sep = " vs. ")
Just use pairwise.t.test, here is an example:
x1 <- rnorm(50)
x2 <- rnorm(30, mean=0.2)
x3 <- rnorm(100,mean=0.1)
x4 <- rnorm(100,mean=0.4)
x <- data.frame(data=c(x1,x2,x3,x4),
key=c(
rep("x1", length(x1)),
rep("x2", length(x2)),
rep("x3", length(x3)),
rep("x4", length(x4))) )
pairwise.t.test(x$data,
x$key,
pool.sd=FALSE)
# Pairwise comparisons using t tests with non-pooled SD
#
# data: x$data and x$key
#
# x1 x2 x3
# x2 0.7395 - -
# x3 0.9633 0.9633 -
# x4 0.0067 0.9633 0.0121
#
# P value adjustment method: holm
If you have a data.frame and you wish to independently perform T-tests between each column of the data.frame, you can use a double apply loop:
apply(MSA, 2, function(x1) {
apply(MSA, 2, function(x2) {
t.test(x1, x2)
})
})
A good visualization to accompany such a brute force approach would be a forest plot:
cis <- apply(MSA, 2, function(x) mean(x) + c(-1, 1) * sd(x) * 1.96)
plot.new()
plot.window(xlim=c(1, ncol(cis)), ylim=range(cis))
segments(1:ncol(cis), cis[1, ], 1:ncol(cis), cis[2, ])
axis(1, at=1:ncol(cis), labels=colnames(MSA))
axis(2)
box()
abline(h=mean(MSA), lty='dashed')
title('Forest plot of 95% confidence intervals of MSA')
In addition to response from quarzgar, there are another method to perform pairwise ttest across multiple factors in R. Basically is a trick for the two (or more) factors used by creating a combination of factor levels.
Example with a 2x2 classical design:
df <- data.frame(Id=c(rep(1:100,2),rep(101:200,2)),
dv=c(rnorm(100,10,5),rnorm(100,20,7),rnorm(100,11,5),rnorm(100,12,6)),
Group=c(rep("Experimental",200),rep("Control",200)),
Condition=rep(c(rep("Pre",100),rep("Post",100)),2))
#ANOVA
summary(aov(dv~Group*Condition+Error(Id/Condition),data = df))
#post-hoc across all factors
df$posthoclevels <- paste(df$Group,df$Condition) #factor combination
pairwise.t.test(df$dv,df$posthoclevels)
# Pairwise comparisons using t tests with pooled SD
#
# data: df$dv and df$posthoclevels
#
# Control Post Control Pre Experimental Post
# Control Pre 0.60 - -
# Experimental Post <2e-16 <2e-16 -
# Experimental Pre 0.26 0.47 <2e-16
#
# P value adjustment method: holm

Predict.lm in R fails to recognize newdata

I'm running a linear regression where the predictor is categorized by another value and am having trouble generating modeled responses for newdata.
First, I generate some random values for the predictor and the error terms. I then construct the response. Note that the predictor's coefficient depends on the value of a categorical variable. I compose a design matrix based on the predictor and its category.
set.seed(1)
category = c(rep("red", 5), rep("blue",5))
x1 = rnorm(10, mean = 1, sd = 1)
err = rnorm(10, mean = 0, sd = 1)
y = ifelse(category == "red", x1 * 2, x1 * 3)
y = y + err
df = data.frame(x1 = x1, category = category)
dm = as.data.frame(model.matrix(~ category + 0, data = df))
dm = dm * df$x1
fit = lm(y ~ as.matrix(dm) + 0, data = df)
# This line will not produce a warning
predictOne = predict.lm(fit, newdata = dm)
# This line WILL produce a warning
predictTwo = predict.lm(fit, newdata = dm[1:5,])
The warning is:
'newdata' had 5 rows but variable(s) found have 10 rows
Unless I'm very much mistaken, I shouldn't have any issues with the variable names. (There are one or two discussions on this board which suggest that issue.) Note that the first prediction runs fine, but the second does not. The only change is that the second prediction uses only the first five rows of the design matrix.
Thoughts?
I'm not 100% sure what you're trying to do, but I think a short walk-through of how formulas work will clear things up for you.
The basic idea is very simple: you pass two things, a formula and a data frame. The terms in the formula should all be names of variables in your data frame.
Now, you can get lm to work without following that guideline exactly, but you're just asking for things to go wrong. So stop and look at your model specifications and think about where R is looking for things.
When you call lm basically none of the names in your formula are actually found in the data frame df. So I suspect that df isn't being used at all.
Then if you call model.frame(fit) you'll see what R thinks your variables should be called. Notice anything strange?
model.frame(fit)
y as.matrix(dm).categoryblue as.matrix(dm).categoryred
1 2.2588735 0.0000000 0.3735462
2 2.7571299 0.0000000 1.1836433
3 -0.2924978 0.0000000 0.1643714
4 2.9758617 0.0000000 2.5952808
5 3.7839465 0.0000000 1.3295078
6 0.4936612 0.1795316 0.0000000
7 4.4460969 1.4874291 0.0000000
8 6.1588103 1.7383247 0.0000000
9 5.5485653 1.5757814 0.0000000
10 2.6777362 0.6946116 0.0000000
Is there anything called as.matrix(dm).categoryblue in dm? Yeah, I didn't think so.
I suspect (but am not sure) that you meant to do something more like this:
df$y <- y
fit <- lm(y~category - 1,data = df)
Joran is on the right track. The issue relates to column names. What I had wanted to do was create my own design matrix, something which, as it happens, I didn't need to do. If run the model with the following line of code, it's smooth sailing:
fit = lm(y ~ x1:category + 0, data = df)
That formula designation will replace the manual construction of the design matrix.
Using my own design matrix is something I had done in the past and the fit parameters and diagnostics were just as they ought to have been. I'd not used the predict function, so had never known that R was discarding the "data = " parameter. A warning would have been cool. R is a harsh mistress.
This may help. Convert the new data as data.frame, example:
x = 1:5
y = c(2,4,6,8,10)
fit = lm(y ~ x)
# PREDICTION
newx = c(3,5,7)
predict(fit, data.frame(x=newx))

Resources