R extract regression coefficients from multiply regression via lapply command - r

I have a large dataset with several variables, one of which is a state variable, coded 1-50 for each state. I'd like to run a regression of 28 variables on the remaining 27 variables of the dataset (there are 55 variables total), and specific for each state.
In other words, run a regression of variable1 on covariate1, covariate2, ..., covariate27 for observations where state==1. I'd then like to repeat this for variable1 for states 2-50, and the repeat the whole process for variable2, variable3,..., variable28.
I think I've written the correct R code to do this, but the next thing I'd like to do is extract the coefficients, ideally into a coefficient matrix. Could someone please help me with this? Here's the code I've written so far:
for (num in 1:50) {
#PUF is the data set I'm using
#Subset the data by states
PUFnum <- subset(PUF, state==num)
#Attach data set with state specific data
attach(PUFnum)
#Run our prediction regression
#the variables class1 through e19700 are the 27 covariates I want to use
regression <- lapply(PUFnum, function(z) lm(z ~ class1+class2+class3+class4+class5+class6+class7+
xtot+e00200+e00300+e00600+e00900+e01000+p04470+e04800+
e09600+e07180+e07220+e07260+e06500+e10300+
e59720+e11900+e18425+e18450+e18500+e19700))
Beta <- lapply(regression, function(d) d<- coef(regression$d))
detach(PUFnum)
}

This is another example of the classic Split-Apply-Combine problem, which can be addressed using the plyr package by #hadley. In your problem, you want to
Split data frame by state
Apply regressions for each subset
Combine coefficients into data frame.
I will illustrate it with the Cars93 dataset available in MASS library. We are interested in figuring out the relationship between horsepower and enginesize based on origin of country.
# LOAD LIBRARIES
require(MASS); require(plyr)
# SPLIT-APPLY-COMBINE
regressions <- dlply(Cars93, .(Origin), lm, formula = Horsepower ~ EngineSize)
coefs <- ldply(regressions, coef)
Origin (Intercept) EngineSize
1 USA 33.13666 37.29919
2 non-USA 15.68747 55.39211
EDIT. For your example, substitute PUF for Cars93, state for Origin and fm for the formula

I've cleaned up your code slightly:
fm <- z ~ class1+class2+class3+class4+class5+class6+class7+
xtot+e00200+e00300+e00600+e00900+e01000+p04470+e04800+
e09600+e07180+e07220+e07260+e06500+e10300+
e59720+e11900+e18425+e18450+e18500+e19700
PUFsplit <- split(PUF, PUF$state)
mod <- lapply(PUFsplit, function(z) lm(fm, data=z))
Beta <- sapply(mod, coef)
If you wanted, you could even put this all in one line:
Beta <- sapply(lapply(split(PUF, PUF$state), function(z) lm(fm, data=z)), coef)

Related

r-squared by groups in linear regression

I have calculated a linear regression using all the elements of my dataset (24), and the resulting model is IP2. Now I want to know how well that single model fits (r-squared, I am not interested in the slope and intercept) for each country in my dataset. The awful way to do is (I would need to do the following 200 times)
Country <- c("A","A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","B","B")
IP <- c(55,56,59,63,67,69,69,73,74,74,79,87,0,22,24,26,26,31,37,41,43,46,46,47)
IP2 <- c(46,47,49,50,53,55,53,57,60,57,58,63,0,19,20,21,22,25,26,28,29,30,31,31)
summary(lm(IP[Country=="A"] ~ IP2[Country=="A"]))
summary(lm(IP[Country=="B"] ~ IP2[Country=="B"]))
Is there a way of calculating both r-squared at the same time? I tried with Linear Regression and group by in R as well as some others posts (Fitting several regression models with dplyr), but it did not work, and I get the same coefficients for the four groups I am working with.
Any idea on what I am doing wrong or how to solve the problem?
Thank you
A couple of options with base R:
sapply(unique(Country), function(cn)
summary(lm(IP[Country == cn] ~ IP2[Country == cn]))$r.sq)
# A B
# 0.9451881 0.9496636
and
c(by(data.frame(IP, IP2), Country, function(x) summary(lm(x))$r.sq))
# A B
# 0.9451881 0.9496636
or
sapply(split(data.frame(IP, IP2), Country), function(x) summary(lm(x))$r.sq)
# A B
# 0.9451881 0.9496636
You can use the split function and then mapply to accomplish this.
split takes a vector and turns it into a list with k elements where k is the distinct levels of (in this case) Country.
mapply allows us to loop over multiple inputs.
getR2 is a simple function that takes two inputs, fits a model and then extracts the R^2 value.
Code example below
Country <- c("A","A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","B","B")
IP <- c(55,56,59,63,67,69,69,73,74,74,79,87,0,22,24,26,26,31,37,41,43,46,46,47)
IP2 <- c(46,47,49,50,53,55,53,57,60,57,58,63,0,19,20,21,22,25,26,28,29,30,31,31)
ip_split = split(IP,Country)
ip2_split = split(IP2,Country)
getR2 = function(ip,ip2){
model = lm(ip~ip2)
return(summary(model)$r.squared)
}
r2.values = mapply(getR2,ip_split,ip2_split)
r2.values
#> A B
#> 0.9451881 0.9496636

Computing slope of changing data

In R, I have a dataset of (x, y) points that is constantly being updated via simulation (values are appended to the end of the dataset).
I would like to compute the slope (via a linear model) of the line created by the data using only the last 10 listed datapoints.
The confusion here arises from the fact that the data are changing, and so I suspect a loop may be needed to iterate over the indices of the datapoints.
In R, one usually does something like
linreg <- lm(y ~ x, data = d) # set up linear model
summary.linreg <- summary(linreg) # output summary of model
beta1 <- coef(summary.linreg)[2] # extract slope
The change that is needed in my case is in linreg, specifically
linreg <- lm(y[?] ~ x[?], data = d) # subset response and predictor
For a non-changing dataset of 10 x-y points, one simply does [?] = [1:10] and the problem is solved. In my case though, I am at a standstill as to the best way to proceed efficiently.
Any thoughts?
No, don't subset inside the formula. Subset the data.frame. Inside your loop, after each database update, do this:
linreg <- lm(y ~ x, data = tail(d, 10))
If you want to loop over a data.frame rows, do this:
linreg <- lm(y ~ x, data = d[i:(i+9),])
If your data.frame is large and you only need the slope, you should use the more low-level function lm.fit for better performance. There might also be packages that provide functions for rolling regression.

How to run many linear regressions/correlations in one data set

I have one data set in an excel/csv form. I wish to run many simple linear regressions/correlations (each with a p-value).
I have several independent variables (x's) and one dependent variable (y).
The variables are all columns of data, not rows. Each column has the name of the data type in the first cell, and all the numerical data in the lower cells.
I want to create a loop instead of manually running each test, but I'm unfamiliar with loops in R. If anyone could help, I would greatly appreciate it.Thanks!
Without more detail it's hard to know for sure, but using dplyr and broom might get you where you need to go.
For example, this runs a linear model for each group:
library(broom)
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(tidy(lm(mpg ~ wt, data = .)))
For more detail, may I suggest: http://r4ds.had.co.nz/many-models.html
Here is my attempt to use a simulated data set to demonstrate 1) "manually" compute correlations, and 2) iteratively calculate correlation by a for loop in R:
First, generate data simulation with 2 independent variables x1 (normally distributed) and x2 (exponentially distributed), and a dependent variable y (same distribution as x1):
set.seed(1) #reproducibility
## The first column is your DEPENDENT variable
## The rest are independent variables
data <- data.frame(y=rnorm(100,0.5,1), x1=rnorm(100,0,1), x2= rexp(100,0.5))
"Manually" compute correlation:
cor_x1_y <- cor.test(data$x1, data$y)
cor_x2_y <- cor.test(data$x2, data$y)
c(cor_x1_y$estimate, cor_x2_y$estimate) #corr. coefficients
## cor cor
## -0.0009943199 -0.0404557828
c(cor_x1_y$p.value, cor_x2_y$p.value) #p values
## [1] 0.9921663 0.6894252
Iteratively compute correlation and store results in a matrix called results:
results <- NULL # placeholder
for(i in 2:ncol(data)) {
## Perform i^th test:
one_test <- cor.test(data[,i], data$y)
test_cor <- one_test$estimate
p_value <- one_test$p.value
## Add any other parameters you'd like to include
##update results vector
results <- rbind(results, c(test_cor , p_value))
}
colnames(results) <- c("correlation", "p_value")
results
## correlation p_value
## [1,] -0.0009943199 0.9921663
## [2,] -0.0404557828 0.6894252

Effects from multinomial logistic model in mlogit

I received some good help getting my data formatted properly produce a multinomial logistic model with mlogit here (Formatting data for mlogit)
However, I'm trying now to analyze the effects of covariates in my model. I find the help file in mlogit.effects() to be not very informative. One of the problems is that the model appears to produce a lot of rows of NAs (see below, index(mod1) ).
Can anyone clarify why my data is producing those NAs?
Can anyone help me get mlogit.effects to work with the data below?
I would consider shifting the analysis to multinom(). However, I can't figure out how to format the data to fit the formula for use multinom(). My data is a series of rankings of seven different items (Accessible, Information, Trade offs, Debate, Social and Responsive) Would I just model whatever they picked as their first rank and ignore what they chose in other ranks? I can get that information.
Reproducible code is below:
#Loadpackages
library(RCurl)
library(mlogit)
library(tidyr)
library(dplyr)
#URL where data is stored
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
#Get data
dat <- read.csv(dat.url)
#Complete cases only as it seems mlogit cannot handle missing values or tied data which in this case you might get because of median imputation
dat <- dat[complete.cases(dat),]
#Change the choice index variable (X) to have no interruptions, as a result of removing some incomplete cases
dat$X <- seq(1,nrow(dat),1)
#Tidy data to get it into long format
dat.out <- dat %>%
gather(Open, Rank, -c(1,9:12)) %>%
arrange(X, Open, Rank)
#Create mlogit object
mlogit.out <- mlogit.data(dat.out, shape='long',alt.var='Open',choice='Rank', ranked=TRUE,chid.var='X')
#Fit Model
mod1 <- mlogit(Rank~1|gender+age+economic+Job,data=mlogit.out)
Here is my attempt to set up a data frame similar to the one portrayed in the help file. It doesnt work. I confess although I know the apply family pretty well, tapply is murky to me.
with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt, mean)))
Compare from the help:
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
m <- mlogit(mode ~ price | income | catch, data = Fish)
# compute a data.frame containing the mean value of the covariates in
# the sample data in the help file for effects
z <- with(Fish, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean),
income = mean(income)))
# compute the marginal effects (the second one is an elasticity
effects(m, covariate = "income", data = z)
I'll try Option 3 and switch to multinom(). This code will model the log-odds of ranking an item as 1st, compared to a reference item (e.g., "Debate" in the code below). With K = 7 items, if we call the reference item ItemK, then we're modeling
log[ Pr(Itemk is 1st) / Pr(ItemK is 1st) ] = αk + xTβk
for k = 1,...,K-1, where Itemk is one of the other (i.e. non-reference) items. The choice of reference level will affect the coefficients and their interpretation, but it will not affect the predicted probabilities. (Same story for reference levels for the categorical predictor variables.)
I'll also mention that I'm handling missing data a bit differently here than in your original code. Since my model only needs to know which item gets ranked 1st, I only need to throw out records where that info is missing. (E.g., in the original dataset record #43 has "Information" ranked 1st, so we can use this record even though 3 other items are NA.)
# Get data
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
dat <- read.csv(dat.url)
# dataframe showing which item is ranked #1
ranks <- (dat[,2:8] == 1)
# for each combination of predictor variable values, count
# how many times each item was ranked #1
dat2 <- aggregate(ranks, by=dat[,9:12], sum, na.rm=TRUE)
# remove cases that didn't rank anything as #1 (due to NAs in original data)
dat3 <- dat2[rowSums(dat2[,5:11])>0,]
# (optional) set the reference levels for the categorical predictors
dat3$gender <- relevel(dat3$gender, ref="Female")
dat3$Job <- relevel(dat3$Job, ref="Government backbencher")
# response matrix in format needed for multinom()
response <- as.matrix(dat3[,5:11])
# (optional) set the reference level for the response by changing
# the column order
ref <- "Debate"
ref.index <- match(ref, colnames(response))
response <- response[,c(ref.index,(1:ncol(response))[-ref.index])]
# fit model (note that age & economic are continuous, while gender &
# Job are categorical)
library(nnet)
fit1 <- multinom(response ~ economic + gender + age + Job, data=dat3)
# print some results
summary(fit1)
coef(fit1)
cbind(dat3[,1:4], round(fitted(fit1),3)) # predicted probabilities
I didn't do any diagnostics, so I make no claim that the model used here provides a good fit.
You are working with Ranked Data, not just Multinomial Choice Data. The structure for the Ranked data in mlogit is that first set of records for a person are all options, then the second is all options except the one ranked first, and so on. But the index assumes equal number of options each time. So a bunch of NAs. We just need to get rid of them.
> with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt[complete.cases(index(mod1)$alt)], mean)))
economic
Accessible 5.13
Debate 4.97
Information 5.08
Officials 4.92
Responsive 5.09
Social 4.91
Trade.Offs 4.91

plot multiple fit and predictions for logistic regression

I am running multiple times a logistic regression over more than 1000 samples taken from a dataset. My question is what is the best way to show my results ? how can I plot my outputs for both the fit and the prediction curve?
This is an example of what I am doing, using the baseball dataset from R. For example I want to fit and predict the model 5 times. Each time I take one sample out (for the prediction) and use another for the fit.
library(corrgram)
data(baseball)
#Exclude rows with NA values
dataset=baseball[complete.cases(baseball),]
#Create vector replacing the Leage (A our N) by 1 or 0.
PA=rep(0,dim(dataset)[1])
PA[which(dataset[,2]=="A")]=1
#Model the player be league A in function of the Hits,Runs,Errors and Salary
fit_glm_list=list()
prd_glm_list=list()
for (k in 1:5){
sp=sample(seq(1:length(PA)),30,replace=FALSE)
fit_glm<-glm(PA[sp[1:15]]~baseball$Hits[sp[1:15]]+baseball$Runs[sp[1:15]]+baseball$Errors[sp[1:15]]+baseball$Salary[sp[1:15]])
prd_glm<-predict(fit_glm,baseball[sp[16:30],c(6,8,20,21)])
fit_glm_list[[k]]=fit_glm;prd_glm_list[[k]]=fit_glm
}
There are a number of issues here.
PA is a subset of baseball$League but the model is constructed on columns from the whole baseball data frame, i.e. they do not match.
PA is treated as a continuous response when using the default family (gaussian), it should be changed to a factor and binomial family.
prd_glm_list[[k]]=fit_glm should probably be prd_glm_list[[k]]=prd_glm
You must save the true class labels for the predictions otherwise you have nothing to compare to.
My take on your code looks like this.
library(corrgram)
data(baseball)
dataset <- baseball[complete.cases(baseball),]
fits <- preds <- truths <- vector("list", 5)
for (k in 1:5){
sp <- sample(nrow(dataset), 30, replace=FALSE)
fits[[k]] <- glm(League ~ Hits + Runs + Errors + Salary,
family="binomial", data=dataset[sp[1:15],])
preds[[k]] <- predict(fits[[k]], dataset[sp[16:30],], type="response")
truths[[k]] <- dataset$League[sp[1:15]]
}
plot(unlist(truths), unlist(preds))
The model performs poorly but at least the code runs without problems. The y-axis in the plot shows the estimated probabilities that the examples belong to league N, i.e. ideally the left box should be close to 0 and the right close to 1.

Resources