r-squared by groups in linear regression - r

I have calculated a linear regression using all the elements of my dataset (24), and the resulting model is IP2. Now I want to know how well that single model fits (r-squared, I am not interested in the slope and intercept) for each country in my dataset. The awful way to do is (I would need to do the following 200 times)
Country <- c("A","A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","B","B")
IP <- c(55,56,59,63,67,69,69,73,74,74,79,87,0,22,24,26,26,31,37,41,43,46,46,47)
IP2 <- c(46,47,49,50,53,55,53,57,60,57,58,63,0,19,20,21,22,25,26,28,29,30,31,31)
summary(lm(IP[Country=="A"] ~ IP2[Country=="A"]))
summary(lm(IP[Country=="B"] ~ IP2[Country=="B"]))
Is there a way of calculating both r-squared at the same time? I tried with Linear Regression and group by in R as well as some others posts (Fitting several regression models with dplyr), but it did not work, and I get the same coefficients for the four groups I am working with.
Any idea on what I am doing wrong or how to solve the problem?
Thank you

A couple of options with base R:
sapply(unique(Country), function(cn)
summary(lm(IP[Country == cn] ~ IP2[Country == cn]))$r.sq)
# A B
# 0.9451881 0.9496636
and
c(by(data.frame(IP, IP2), Country, function(x) summary(lm(x))$r.sq))
# A B
# 0.9451881 0.9496636
or
sapply(split(data.frame(IP, IP2), Country), function(x) summary(lm(x))$r.sq)
# A B
# 0.9451881 0.9496636

You can use the split function and then mapply to accomplish this.
split takes a vector and turns it into a list with k elements where k is the distinct levels of (in this case) Country.
mapply allows us to loop over multiple inputs.
getR2 is a simple function that takes two inputs, fits a model and then extracts the R^2 value.
Code example below
Country <- c("A","A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","B","B")
IP <- c(55,56,59,63,67,69,69,73,74,74,79,87,0,22,24,26,26,31,37,41,43,46,46,47)
IP2 <- c(46,47,49,50,53,55,53,57,60,57,58,63,0,19,20,21,22,25,26,28,29,30,31,31)
ip_split = split(IP,Country)
ip2_split = split(IP2,Country)
getR2 = function(ip,ip2){
model = lm(ip~ip2)
return(summary(model)$r.squared)
}
r2.values = mapply(getR2,ip_split,ip2_split)
r2.values
#> A B
#> 0.9451881 0.9496636

Related

Performing linear regressions using columns of two matrices in R

I have two large matrices with the same dimensions e.g.:
#dummy matrices
A <- matrix(c(1:3288),nrow=12)
B <- matrix(c(3289:6576),nrow=12)
For each column I would like to run a linear regression between the two matrices (A and B) and if possible I would like to get the output of the lm into a data frame e.g. for each column's regression I want to know lm the r^2, the slope, the intercept etc.
Any help appreciated.
Assuming that you'll fit the regression between any two combination of columns this could be a solution. Keep in mind that depending on what you'll finally want in the resulting data.frame the code will change.
A <- matrix(c(1:3288),nrow=12)
B <- matrix(c(3289:6576),nrow=12)
library(broom)
library(dplyr)
results <- NULL
for (i in 1:ncol(A)){
for (j in 1:ncol(B)){
model_<-lm(A[,i]~B[,j])
results<-bind_rows(results,
bind_cols(columnx = i,
columny = j,
glance(model_),
intercept=model_$coefficients[1],
slope=model_$coefficients[2]
)
)
}
}
If you only need pairwise regression in the form of column 1 in A is going to be fitted with column 1 in B, 2 with 2 and so on, a more elegant solution could be written using map from the purr package. Hope this helps.
Edit: only fitting 1 in A with 1 in B a so forth
library(purrr)
library(dplyr)
library(broom)
A<-data.frame(A)
B<-data.frame(B)
results <- map2_df(.x = A,
.y = B, ~ {
model_<-lm(.y ~ .x)
bind_cols(glance(model_),
intercept=model_$coefficients[1],
slope=model_$coefficients[2]
)
})
Here is the purrr documentation. It is very clear explaining how map2_df works. It basically loops over two lists at the same time executing one function and returning a data.frame.

Computing slope of changing data

In R, I have a dataset of (x, y) points that is constantly being updated via simulation (values are appended to the end of the dataset).
I would like to compute the slope (via a linear model) of the line created by the data using only the last 10 listed datapoints.
The confusion here arises from the fact that the data are changing, and so I suspect a loop may be needed to iterate over the indices of the datapoints.
In R, one usually does something like
linreg <- lm(y ~ x, data = d) # set up linear model
summary.linreg <- summary(linreg) # output summary of model
beta1 <- coef(summary.linreg)[2] # extract slope
The change that is needed in my case is in linreg, specifically
linreg <- lm(y[?] ~ x[?], data = d) # subset response and predictor
For a non-changing dataset of 10 x-y points, one simply does [?] = [1:10] and the problem is solved. In my case though, I am at a standstill as to the best way to proceed efficiently.
Any thoughts?
No, don't subset inside the formula. Subset the data.frame. Inside your loop, after each database update, do this:
linreg <- lm(y ~ x, data = tail(d, 10))
If you want to loop over a data.frame rows, do this:
linreg <- lm(y ~ x, data = d[i:(i+9),])
If your data.frame is large and you only need the slope, you should use the more low-level function lm.fit for better performance. There might also be packages that provide functions for rolling regression.

How to run many linear regressions/correlations in one data set

I have one data set in an excel/csv form. I wish to run many simple linear regressions/correlations (each with a p-value).
I have several independent variables (x's) and one dependent variable (y).
The variables are all columns of data, not rows. Each column has the name of the data type in the first cell, and all the numerical data in the lower cells.
I want to create a loop instead of manually running each test, but I'm unfamiliar with loops in R. If anyone could help, I would greatly appreciate it.Thanks!
Without more detail it's hard to know for sure, but using dplyr and broom might get you where you need to go.
For example, this runs a linear model for each group:
library(broom)
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(tidy(lm(mpg ~ wt, data = .)))
For more detail, may I suggest: http://r4ds.had.co.nz/many-models.html
Here is my attempt to use a simulated data set to demonstrate 1) "manually" compute correlations, and 2) iteratively calculate correlation by a for loop in R:
First, generate data simulation with 2 independent variables x1 (normally distributed) and x2 (exponentially distributed), and a dependent variable y (same distribution as x1):
set.seed(1) #reproducibility
## The first column is your DEPENDENT variable
## The rest are independent variables
data <- data.frame(y=rnorm(100,0.5,1), x1=rnorm(100,0,1), x2= rexp(100,0.5))
"Manually" compute correlation:
cor_x1_y <- cor.test(data$x1, data$y)
cor_x2_y <- cor.test(data$x2, data$y)
c(cor_x1_y$estimate, cor_x2_y$estimate) #corr. coefficients
## cor cor
## -0.0009943199 -0.0404557828
c(cor_x1_y$p.value, cor_x2_y$p.value) #p values
## [1] 0.9921663 0.6894252
Iteratively compute correlation and store results in a matrix called results:
results <- NULL # placeholder
for(i in 2:ncol(data)) {
## Perform i^th test:
one_test <- cor.test(data[,i], data$y)
test_cor <- one_test$estimate
p_value <- one_test$p.value
## Add any other parameters you'd like to include
##update results vector
results <- rbind(results, c(test_cor , p_value))
}
colnames(results) <- c("correlation", "p_value")
results
## correlation p_value
## [1,] -0.0009943199 0.9921663
## [2,] -0.0404557828 0.6894252

How to find significant correlations in a large dataset

I'm using R.
My dataset has about 40 different Variables/Vektors and each has about 80 entries. I'm trying to find significant correlations, that means I want to pick one variable and let R calculate all the correlations of that variable to the other 39 variables.
I tried to do this by using a linear modell with one explaining variable that means: Y=a*X+b.
Then the lm() command gives me an estimator for a and p-value of that estimator for a. I would then go on and use one of the other variables I have for X and try again until I find a p-value thats really small.
I'm sure this is a common problem, is there some sort of package or function that can try all these possibilities (Brute force),show them and then maybe even sorts them by p-value?
You can use the function rcorr from the package Hmisc.
Using the same demo data from Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Then:
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
To access the p-values:
correlations$P
To visualize you can use the package corrgram
library(corrgram)
corrgram(the_data)
Which will produce:
In order to print a list of the significant correlations (p < 0.05), you can use the following.
Using the same demo data from #Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Install Hmisc
install.packages("Hmisc")
Import library and find the correlations (#Carlos)
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
Loop over the values printing the significant correlations
for (i in 1:m){
for (j in 1:m){
if ( !is.na(correlations$P[i,j])){
if ( correlations$P[i,j] < 0.05 ) {
print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
}
}
}
}
Warning
You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.
Here's some sample data for reproducibility.
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
You can calculate the correlation between two columns using cor. This code loops over all columns except the first one (which contains our response), and calculates the correlation between that column and the first column.
correlations <- vapply(
the_data[, -1],
function(x)
{
cor(the_data[, 1], x)
},
numeric(1)
)
You can then find the column with the largest magnitude of correlation with y using:
correlations[which.max(abs(correlations))]
So knowing which variables are correlated which which other variables can be interesting, but please don't draw any big conclusions from this knowledge. You need to have a proper think about what you are trying to understand, and which techniques you need to use. The folks over at Cross Validated can help.
If you are trying to predict y using only one variable than you have to take the one that is mainly correlated with y.
To do this just use the command which.max(abs(cor(x,y))). If you want to use more than one variable in your model then you have to consider something like the lasso estimator
One option is to run a correlation matrix:
cor_result=cor(data)
write.csv(cor_result, file="cor_result.csv")
This correlates all the variables in the file against each other and outputs a matrix.

R extract regression coefficients from multiply regression via lapply command

I have a large dataset with several variables, one of which is a state variable, coded 1-50 for each state. I'd like to run a regression of 28 variables on the remaining 27 variables of the dataset (there are 55 variables total), and specific for each state.
In other words, run a regression of variable1 on covariate1, covariate2, ..., covariate27 for observations where state==1. I'd then like to repeat this for variable1 for states 2-50, and the repeat the whole process for variable2, variable3,..., variable28.
I think I've written the correct R code to do this, but the next thing I'd like to do is extract the coefficients, ideally into a coefficient matrix. Could someone please help me with this? Here's the code I've written so far:
for (num in 1:50) {
#PUF is the data set I'm using
#Subset the data by states
PUFnum <- subset(PUF, state==num)
#Attach data set with state specific data
attach(PUFnum)
#Run our prediction regression
#the variables class1 through e19700 are the 27 covariates I want to use
regression <- lapply(PUFnum, function(z) lm(z ~ class1+class2+class3+class4+class5+class6+class7+
xtot+e00200+e00300+e00600+e00900+e01000+p04470+e04800+
e09600+e07180+e07220+e07260+e06500+e10300+
e59720+e11900+e18425+e18450+e18500+e19700))
Beta <- lapply(regression, function(d) d<- coef(regression$d))
detach(PUFnum)
}
This is another example of the classic Split-Apply-Combine problem, which can be addressed using the plyr package by #hadley. In your problem, you want to
Split data frame by state
Apply regressions for each subset
Combine coefficients into data frame.
I will illustrate it with the Cars93 dataset available in MASS library. We are interested in figuring out the relationship between horsepower and enginesize based on origin of country.
# LOAD LIBRARIES
require(MASS); require(plyr)
# SPLIT-APPLY-COMBINE
regressions <- dlply(Cars93, .(Origin), lm, formula = Horsepower ~ EngineSize)
coefs <- ldply(regressions, coef)
Origin (Intercept) EngineSize
1 USA 33.13666 37.29919
2 non-USA 15.68747 55.39211
EDIT. For your example, substitute PUF for Cars93, state for Origin and fm for the formula
I've cleaned up your code slightly:
fm <- z ~ class1+class2+class3+class4+class5+class6+class7+
xtot+e00200+e00300+e00600+e00900+e01000+p04470+e04800+
e09600+e07180+e07220+e07260+e06500+e10300+
e59720+e11900+e18425+e18450+e18500+e19700
PUFsplit <- split(PUF, PUF$state)
mod <- lapply(PUFsplit, function(z) lm(fm, data=z))
Beta <- sapply(mod, coef)
If you wanted, you could even put this all in one line:
Beta <- sapply(lapply(split(PUF, PUF$state), function(z) lm(fm, data=z)), coef)

Resources