Centering Variables in R - r

Do centered variables have to stay in matrix form when using them in a regression equation?
I have centered a few variables using the scale function with center=T and scale=F. I then converted those variables to a numeric variable, so that I can manipulate the data frame for other purposes. However, when I run an ANOVA, I get slightly different F values, just for that variable, all else is the same.
Edit:
What's the difference between these two:
scale(df$A, center=TRUE, scale=FALSE)
Which will embed a matrix within your data.frame
AND
scale(df$A, center=TRUE, scale=FALSE)
df$A = as.numeric(df$A)
Which makes variable A numeric, and removes the matrix notation within the variable?
Example of what I am trying to do, but the example doesn't cause the problem I am having:
library(car)
library(MASS)
mtcars$wt_c <- scale(mtcars$wt, center=TRUE, scale=FALSE)
mtcars$gear <- as.factor(mtcars$gear)
mtcars1 <- as.data.frame(mtcars)
# Part 1
rlm.mpg <- rlm(mpg~wt_c+gear+wt_c*gear, data=mtcars1)
anova.mpg <- Anova(rlm.mpg, type="III")
# Part 2
# Make wt_c Numeric
mtcars1$wt_c <- as.numeric(mtcars1$wt_c)
rlm.mpg2 <- rlm(mpg~wt_c+gear+wt_c*gear, mtcars1)
anova.mpg2 <- Anova(rlm.mpg2, type="III")

I'll attempt to answer both of your questions
Do centered variables have to stay in matrix form when using them in a regression equation?
I'm not sure what you mean by this, but you can strip the center and scale attributes you get back from scale() if that is what you are referring to. You can see in the example below you get the same answer whether it is in 'matrix form' or not.
What's the difference between these two:
scale(A, center=TRUE, scale=FALSE)
Which will embed a matrix within your data.frame
AND
scale(df$A, center=TRUE, scale=FALSE)
df$A = as.numeric(df$A)
From the help file for scale() we see that it returns,
"For scale.default, the centered, scaled matrix."
You are getting back a matrix with attributes for scaled and center. as.numeric(AA) strips off those attributes which is the difference between your first and second method. c(AA) does the same thing. I would guess as.numeric() either calls c() (through as.double()) or uses the same method it does.
set.seed(1234)
test <- data.frame(matrix(runif(10*5),10,5))
head(test)
X1 X2 X3 X4 X5
1 0.1137034 0.6935913 0.31661245 0.4560915 0.5533336
2 0.6222994 0.5449748 0.30269337 0.2651867 0.6464061
3 0.6092747 0.2827336 0.15904600 0.3046722 0.3118243
4 0.6233794 0.9234335 0.03999592 0.5073069 0.6218192
5 0.8609154 0.2923158 0.21879954 0.1810962 0.3297702
6 0.6403106 0.8372956 0.81059855 0.7596706 0.5019975
# center and scale
testVar <- scale(test[,1])
testVar
[,1]
[1,] -1.36612292
[2,] 0.48410899
[3,] 0.43672627
[4,] 0.48803808
[5,] 1.35217501
[6,] 0.54963231
[7,] -1.74522210
[8,] -0.93376661
[9,] 0.64339300
[10,] 0.09103797
attr(,"scaled:center")
[1] 0.4892264
attr(,"scaled:scale")
[1] 0.2748823
# put testvar back with its friends
bindVar <- cbind(testVar,test[,2:5])
# run a regression with 'matrix form' y var
testLm1 <- lm(testVar~.,data=bindVar)
# strip non-name attributes
testVar <- as.numeric(testVar)
# rebind and regress
bindVar <- cbind(testVar,test[,2:5])
testLm2 <- lm(testVar~.,data=bindVar)
# check for equality
all.equal(testLm1, testLm2)
[1] TRUE
lm() seems to return the same thing so it appears they both are the same.

Related

Most efficient way in base R to do pairwise correlations between thousands of columns in a matrix [duplicate]

I'm new to R, so I apologize if this is a straightforward question, however I've done quite a bit of searching this evening and can't seem to figure it out. I've got a data frame with a whole slew of variables, and what I'd like to do is create a table of the correlations among a subset of these, basically the equivalent of "pwcorr" in Stata, or "correlations" in SPSS. The one key to this is that not only do I want the r, but I also want the significance associated with that value.
Any ideas? This seems like it should be very simple, but I can't seem to figure out a good way.
Bill Venables offers this solution in this answer from the R mailing list to which I've made some slight modifications:
cor.prob <- function(X, dfr = nrow(X) - 2) {
R <- cor(X)
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr / (1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
cor.mat <- t(R)
cor.mat[upper.tri(cor.mat)] <- NA
cor.mat
}
So let's test it out:
set.seed(123)
data <- matrix(rnorm(100), 20, 5)
cor.prob(data)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 NA NA NA NA
[2,] 0.7005361 1.0000000 NA NA NA
[3,] 0.5990483 0.6816955 1.0000000 NA NA
[4,] 0.6098357 0.3287116 0.5325167 1.0000000 NA
[5,] 0.3364028 0.1121927 0.1329906 0.5962835 1
Does that line up with cor.test?
cor.test(data[,2], data[,3])
Pearson's product-moment correlation
data: data[, 2] and data[, 3]
t = 0.4169, df = 18, p-value = 0.6817
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3603246 0.5178982
sample estimates:
cor
0.09778865
Seems to work ok.
Here is something that I just made, I stumbled on this post because I was looking for a way to take every pair of variables, and get a tidy nX3 dataframe. Column 1 is a variable, Column 2 is a variable, and Column 3 and 4 are their absolute value and true correlation. Just pass the function a dataframe of numeric and integer values.
pairwiseCor <- function(dataframe){
pairs <- combn(names(dataframe), 2, simplify=FALSE)
df <- data.frame(Vairable1=rep(0,length(pairs)), Variable2=rep(0,length(pairs)),
AbsCor=rep(0,length(pairs)), Cor=rep(0,length(pairs)))
for(i in 1:length(pairs)){
df[i,1] <- pairs[[i]][1]
df[i,2] <- pairs[[i]][2]
df[i,3] <- round(abs(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]])),4)
df[i,4] <- round(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]]),4)
}
pairwiseCorDF <- df
pairwiseCorDF <- pairwiseCorDF[order(pairwiseCorDF$AbsCor, decreasing=TRUE),]
row.names(pairwiseCorDF) <- 1:length(pairs)
pairwiseCorDF <<- pairwiseCorDF
pairwiseCorDF
}
This is what the output is:
> head(pairwiseCorDF)
Vairable1 Variable2 AbsCor Cor
1 roll_belt accel_belt_z 0.9920 -0.9920
2 gyros_dumbbell_x gyros_dumbbell_z 0.9839 -0.9839
3 roll_belt total_accel_belt 0.9811 0.9811
4 total_accel_belt accel_belt_z 0.9752 -0.9752
5 pitch_belt accel_belt_x 0.9658 -0.9658
6 gyros_dumbbell_z gyros_forearm_z 0.9491 0.9491
I've found that the R package picante does a nice job dealing with the problem that you have. You can easily pass your dataset to the cor.table function and get a table of correlations and p-values for all of your variables. You can specify Pearson's r or Spearman in the function. See this link for help:
http://www.inside-r.org/packages/cran/picante/docs/cor.table
Also remember to remove any non-numeric columns from your dataset prior to running the function. Here's an example piece of code:
install.packages("picante")
library(picante)
#Insert the name of your dataset in the code below
cor.table(dataset, cor.method="pearson")
You can use the sjt.corr function of the sjPlot-package, which gives you a nicely formatted correlation table, ready for use in your Office application.
Simplest function call is just to pass the data frame:
sjt.corr(df)
See examples here.

How can I get the contribution by each predictor to the final regression prediction in lm

Using R when I use rlm or lm I would like to get the contribution of each predictor of the model.
The problem occurs when I have interaction terms as I think they are not in the lm object
Bellow is sample data (I am looking for a way that generalized to any number of predictors)
Sample data:
set.seed(1)
y <- rnorm(10)
m <- data.frame(v1=rnorm(10), v2=rnorm(10), v3=rnorm(10))
lmObj <- lm(formula=y~0+v1*v3+v2*v3, data=m)
betaHat <- coefficients(lmObj)
betaHat
v1 v3 v2 v1:v3 v3:v2
0.03455 -0.50224 -0.57745 0.58905 -0.65592
# How do I get the data.frame or matrix with columns (v1,v3,v2,v1:v3,v3:v2)
# worth [M$v1*v1, ... , (M$v3*M$v2)*v3:v2]
I thought by "contribution" you want explained variance of each term (which an ANOVA table helps), while actually you want term-wise prediction:
predict(lmObj, type = "terms")
See ?predict.lm.
Actually I got it from lm itself, the trick is to ask for x=TRUE
lmObj <- lm(formula=y~0+v1*v3+v2*v3, data=m, x=TRUE)
lmObj$x %*% diag(lmObj$coefficients)
[,1] [,2] [,3] [,4] [,5]
1 0.0522305 -0.68238 -0.53066 1.20993 -0.81898
2 0.0134687 0.05162 -0.45164 -0.02360 0.05273
3 -0.0214632 -0.19470 -0.04306 -0.14187 -0.01896
4 -0.0765156 0.02702 1.14875 0.07019 -0.07021
5 0.0388652 0.69161 -0.35792 -0.91250 0.55985
6 -0.0015524 0.20843 0.03241 0.01098 -0.01528
7 -0.0005594 0.19803 0.08996 0.00376 -0.04029
8 0.0326086 0.02979 0.84928 -0.03298 -0.05722
9 0.0283723 -0.55248 0.27611 0.53213 0.34500
10 0.0205187 -0.38330 -0.24134 0.26699 -0.20921

How to calculate "terms" from predict-function manually when regression has an interaction term

does anyone know how predict-function calculates terms when there are an interaction term in a regression model? I know how to solve terms when regression has no interaction terms in it but when I add one I cant solve those manually anymore. Here is some example data and I would like to see how to calculate those values manually. Thanks! -Aleksi
set.seed(2)
a <- c(4,3,2,5,3) # first I make some data
b <- c(2,1,4,3,5)
e <- rnorm(5)
y= 0.6*a+e
data <- data.frame(a,b,y)
model1 <- lm(y~a*b,data=data) # regression
predict(model1,type='terms',data) # terms
#This gives the result:
a b a:b
1 0.04870807 -0.3649011 0.2049069
2 -0.03247205 -0.7298021 0.7740928
3 -0.11365216 0.3649011 0.2049069
4 0.12988818 0.0000000 -0.5919534
5 -0.03247205 0.7298021 -0.5919534
attr(,"constant")
[1] 1.973031
Your model is technically y ~ b0 + b1*a + b2*a*b + e. Calculating a is done by multiplying independent variable by its coefficient and centering the result. So for example, terms for a would be
cf <- coef(model1)
scale(a * cf[2], scale = FALSE)
[,1]
[1,] 0.04870807
[2,] -0.03247205
[3,] -0.11365216
[4,] 0.12988818
[5,] -0.03247205
which matches your output above.
And since interaction term is nothing else than multiplying independent variables, this translates to
scale(a * b * cf[4], scale = FALSE)
[,1]
[1,] 0.2049069
[2,] 0.7740928
[3,] 0.2049069
[4,] -0.5919534
[5,] -0.5919534

Pairwise Correlation Table

I'm new to R, so I apologize if this is a straightforward question, however I've done quite a bit of searching this evening and can't seem to figure it out. I've got a data frame with a whole slew of variables, and what I'd like to do is create a table of the correlations among a subset of these, basically the equivalent of "pwcorr" in Stata, or "correlations" in SPSS. The one key to this is that not only do I want the r, but I also want the significance associated with that value.
Any ideas? This seems like it should be very simple, but I can't seem to figure out a good way.
Bill Venables offers this solution in this answer from the R mailing list to which I've made some slight modifications:
cor.prob <- function(X, dfr = nrow(X) - 2) {
R <- cor(X)
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr / (1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
cor.mat <- t(R)
cor.mat[upper.tri(cor.mat)] <- NA
cor.mat
}
So let's test it out:
set.seed(123)
data <- matrix(rnorm(100), 20, 5)
cor.prob(data)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 NA NA NA NA
[2,] 0.7005361 1.0000000 NA NA NA
[3,] 0.5990483 0.6816955 1.0000000 NA NA
[4,] 0.6098357 0.3287116 0.5325167 1.0000000 NA
[5,] 0.3364028 0.1121927 0.1329906 0.5962835 1
Does that line up with cor.test?
cor.test(data[,2], data[,3])
Pearson's product-moment correlation
data: data[, 2] and data[, 3]
t = 0.4169, df = 18, p-value = 0.6817
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3603246 0.5178982
sample estimates:
cor
0.09778865
Seems to work ok.
Here is something that I just made, I stumbled on this post because I was looking for a way to take every pair of variables, and get a tidy nX3 dataframe. Column 1 is a variable, Column 2 is a variable, and Column 3 and 4 are their absolute value and true correlation. Just pass the function a dataframe of numeric and integer values.
pairwiseCor <- function(dataframe){
pairs <- combn(names(dataframe), 2, simplify=FALSE)
df <- data.frame(Vairable1=rep(0,length(pairs)), Variable2=rep(0,length(pairs)),
AbsCor=rep(0,length(pairs)), Cor=rep(0,length(pairs)))
for(i in 1:length(pairs)){
df[i,1] <- pairs[[i]][1]
df[i,2] <- pairs[[i]][2]
df[i,3] <- round(abs(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]])),4)
df[i,4] <- round(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]]),4)
}
pairwiseCorDF <- df
pairwiseCorDF <- pairwiseCorDF[order(pairwiseCorDF$AbsCor, decreasing=TRUE),]
row.names(pairwiseCorDF) <- 1:length(pairs)
pairwiseCorDF <<- pairwiseCorDF
pairwiseCorDF
}
This is what the output is:
> head(pairwiseCorDF)
Vairable1 Variable2 AbsCor Cor
1 roll_belt accel_belt_z 0.9920 -0.9920
2 gyros_dumbbell_x gyros_dumbbell_z 0.9839 -0.9839
3 roll_belt total_accel_belt 0.9811 0.9811
4 total_accel_belt accel_belt_z 0.9752 -0.9752
5 pitch_belt accel_belt_x 0.9658 -0.9658
6 gyros_dumbbell_z gyros_forearm_z 0.9491 0.9491
I've found that the R package picante does a nice job dealing with the problem that you have. You can easily pass your dataset to the cor.table function and get a table of correlations and p-values for all of your variables. You can specify Pearson's r or Spearman in the function. See this link for help:
http://www.inside-r.org/packages/cran/picante/docs/cor.table
Also remember to remove any non-numeric columns from your dataset prior to running the function. Here's an example piece of code:
install.packages("picante")
library(picante)
#Insert the name of your dataset in the code below
cor.table(dataset, cor.method="pearson")
You can use the sjt.corr function of the sjPlot-package, which gives you a nicely formatted correlation table, ready for use in your Office application.
Simplest function call is just to pass the data frame:
sjt.corr(df)
See examples here.

Returning a vector of attributes shared by a set of objects

I have a list of lm (linear model) objects.
How can I select a particular element (such as the intercept, rank, or residuals) from all the objects in a single call?
I use the plyr package and then if my list of objects was called modelOutput and I want to get out all the predicted values I would do this:
modelPredictions <- ldply(modelOutput, as.data.frame(predict))
if I want all the coefficients I do this:
modelCoef <- ldply(modelOutput, as.data.frame(coef))
Hadley originally showed me how to do this in a previous question.
First I'll generate some example data:
> set.seed(123)
> x <- 1:10
> a <- 3
> b <- 5
> fit <- c()
> for (i in 1:10) {
+ y <- a + b*x + rnorm(10,0,.3)
+ fit[[i]] <- lm(y ~ x)
+ }
Here's one option for grabbing the estimates from each fit:
> t(sapply(fit, function(x) coef(x)))
(Intercept) x
[1,] 3.157640 4.975409
[2,] 3.274724 4.961430
[3,] 2.632744 5.043616
[4,] 3.228908 4.975946
[5,] 2.933742 5.011572
[6,] 3.097926 4.994287
[7,] 2.709796 5.059478
[8,] 2.766553 5.022649
[9,] 2.981451 5.020450
[10,] 3.238266 4.980520
As you mention, other quantities concerning the fit are available. Above I only grabbed the coefficients with the coef() function. Check out the following command for more:
names(summary(fit[[1]]))

Resources