I'm having a brain freeze, and hoping one of you can point me in the right direction. My end goal is the output of various regression coefficients (mainly interested in price elasticity), which I achieved via simple multiple regression, using the "by" function.
I am using the "by" function to loop through the regression formula for each iteration of the "State.UPC" variable. Since my data is quite large (~1MM rows), I had to subset my data into groups of 3-4 states (see mystates1...mystates10). I am then performing the regression on those subsets, each time changing my data source in the "datastep3" data frame. And this is where I need your help:
What is the best way to efficiently re-write this with a combination of my existing "by" regression function, and the "for" loops, so I can bypass the step of constantly changing the data frame name in "datastep3" and the "write.csv" steps. Essentially R looping through each "mystates" data subset and doing the regression by the "State.UPC" attributes?
I have tried several combinations with no success. Pardon the amateurish question...still learning R. Here is my code:
data <-read.csv("PriceData.csv")
datastep1 <-subset(data, subset=c(X..Vol>0, Unit.Vol>0))
datastep2 <- transform(datastep1, State.UPC = paste(State,UPC, sep="."))
mystates1 <- c("AL","AR","AZ")
mystates2 <- c("CA","CO","FL")
mystates3 <- c("GA","IA","IL")
mystates4 <- c("IN","KS","KY")
mystates5 <- c("LA","MI","MN")
mystates6 <- c("MO","MS","NC")
mystates7 <- c("NJ","NM","NV")
mystates8 <- c("NY","OH","OK")
mystates9 <- c("SC","TN","TX")
mystates10 <- c("UT","VA","WI","WV")
datastep3 <-subset(datastep2, subset=State %in% mystates10)
datastep4 <-na.omit(datastep3)
PEbyItem <- by(datastep4, datastep4$State.UPC, function(df)
lm(log(Unit.Vol)~log(Price) + Distribution+Independence.Day+Labor.Day+Memorial.Day+Thanksgiving+Christmas+New.Years+
Year+Month, data=df))
x <- do.call("rbind",lapply(PEbyItem, coef))
y <-data.frame(x)
write.csv(x, file="mystates10.csv", row.names=TRUE)
Impossible to test this because you do not provide any data, but theoretically you could just combine the various mystatesN into a list and then run lapply(...) on that.
## Not tested...
get.PEbyItem <- function(i) {
datastep3 <-subset(datastep2, subset=State %in% mystates[[i]])
datastep4 <-na.omit(datastep3)
PEbyItem <- by(datastep4, datastep4$State.UPC, function(df)
lm(log(Unit.Vol)~log(Price) + Distribution+Independence.Day+Labor.Day+
Memorial.Day+Thanksgiving+Christmas+New.Years+Year+Month,
data=df))
x <- do.call("rbind",lapply(PEbyItem, coef))
y <-data.frame(x)
write.csv(x, file=paste(names(mystates[i]),"csv",sep="."), row.names=TRUE)
}
mystates <- list(ms1=mystates1, ms2=mystates2, ..., ms10=mystates10)
lapply(1:length(mystates),get.PEbyItem)
There are lots of other things that could be improved but without the dataset it's pointless to try.
Related
I am trying to loop over objects in R.
myfunc.linear.pred <- function(x){
linear.pred <- predict(object = x)
w <- exp(linear.pred)/(1+exp(linear.pred))
as.vector(w)
}
The function here works perfectly as it should. It returns a vector of 48 rows and it comes from the object x. Now 'x' is nothing but the full regression model from a GLM function (think: mod.fit <- glm (dep~indep, data = data)). The problem is that I have 20 different such ('mod.fit') objects and need to find predictions for each of these. I could literally repeat the code, but I was looking to find a neater solution. So what I want is a matrix with 48 rows and 20 columns for the above function. This is probably basic for an advanced user, but I have only ever used "apply" and "for" loops for numbers and never objects. I looked into lapply but couldn't figure it out.
I tried: (and this is probably dumb)
allmodels <- c(mod.fit, mod.fit2, mod.fit3)
lpred.matrix <- matrix(data=NA, nrow=48, ncol=20)
for(i in allmodels){
lpred.matrix[i,] <- myfunc.linear.pred(i)
}
which obviously won't work because allmodels has a class of "list" and it contains all the stuff from the GLM function. Hope someone can help. Thanks!
In order to use lapply, you must have a list object not a vector object. Something like this should work:
## Load data
data("mtcars")
# fit models
mod.fit1 <- glm (mpg~disp, data = mtcars)
mod.fit2 <- glm (mpg~drat, data = mtcars)
mod.fit3 <- glm (mpg~wt, data = mtcars)
# build function
myfunc.linear.pred <- function(x){
linear.pred <- predict(object = x)
w <- exp(linear.pred)/(1+exp(linear.pred))
as.vector(w)
}
# put models in a list
allmodels <- list("mod1" = mod.fit1, "mod2" = mod.fit2, "mod2" =
mod.fit3)
# use lapply and do.call to generate matrix of prediction results
df <- do.call('cbind', lapply(allmodels, function(x){
a <- myfunc.linear.pred(x)
}))
Hope this helps
I am performing a regression analysis within R that looks the following:
lm_carclass_mod <- lm(log(count_faves+1)~log(views+1)+dateadded+group_url+license+log(precontext.nextphoto.views+1)+log(precontext.prevphoto.views+1)+log(oid.Bridge+1)+log(oid.Face+1)+log(oid.Quail+1)+log(oid.Sky+1)+log(oid.Car+1)+log(oid.Auditorium+1)+log(oid.Font+1)+log(oid.Lane+1)+log(oid.Bmw+1)+log(oid.Racing+1)+log(oid.Wheel+1),data=flickrcar_wo_country)
confint(lm_carclass_mod,level=0.95)
summary(lm_carclass_mod)
The dependent variable as well as some of the independent variables are quite variable throughout my analysis, which is why I would like to keep inserting them manually.
However, I am looking for a way to replace all of the "oid. ..." variables with one single function.
So far I have come up with the following:
g <- paste("log(",variables,"+1)", collapse="+")
Unfortuntaley this does not work inside the lm() function. Neither does a formula like this:
g <- as.formula(
paste("log(",variables,"+1)", collapse="+")
)
The vector variables has the following elements in it:
variables <- ("oid.Bridge", "oid.Face", "oid.Quail", "oid.Off-roading", "oid.Sky", "oid.Car", "oid.Auditorium", "oid.Font", "oid.Lane", "oid.Bmw", "oid.Racing", "oid.Wheel")
In the end my regression model should look something like this:
lm_carclass_mod <- lm(log(count_faves+1)~log(views+1)+dateadded+group_url+license+log(precontext.nextphoto.views+1)+log(precontext.prevphoto.views+1)+g,data=flickrcar_wo_country)
confint(lm_carclass_mod,level=0.95)
summary(lm_carclass_mod)
Thanks for your helpm in advance!
You would need to convert both of the parts into a string and then make the formula:
#the manual bit
manual <- "log(count_faves+1)~log(views+1)+dateadded+group_url+license+log(precontext.nextphoto.views+1)+log(precontext.prevphoto.views+1)"
#the variables:
oid_variables <- c("oid.Bridge", "oid.Face", "oid.Quail", "oid.Off-roading", "oid.Sky", "oid.Car", "oid.Auditorium", "oid.Font", "oid.Lane", "oid.Bmw", "oid.Racing", "oid.Wheel")
#paste them together
g <- paste("log(", oid_variables, "+1)", collapse="+")
#make the formula
myformula <- as.formula(paste(manual, '+', g))
Then you add the formula into lm:
lm_carclass_mod <- lm(myformula, data=flickrcar_wo_country
This post sort of gets to my question :Linear Regression loop for each independent variable individually against dependent
However, I am trying too add rolling periods for the regression.
Example:
data <-data.frame("col1"=runif(10,2,10),"col2"=runif(10,1,10),"col3"=runif(10,1,10),"col4"=runif(10,1,10))
sapply(data, function(x) rollapply(data,30, coef(lm(data$col1~x,data=data))))
EDIT: To give an deeper idea of what I am after, I should note that prior to using the Sapply method (after having read some SO posts that vectorized solutions were preferred), I had been trying for loops to loop over the columns I wanted to use as independent variables.
betadf <- data.frame()
for (i in colnames(data2[,3:ncol(data2)])){
formula <- formula(paste("variablename ~",i,sep=""))
data3 <- na.omit(merge(data2[,'variablename'],data2[,i]))
model <- na.omit(rollapply(data3,rollperiod,
function(z) coef(lm(formula,data=as.data.frame(z))),
by.column = FALSE, align = "right"))
colnames(model) <- c("intercept",i)
betadf <- cbind(betadf,model[,i])
}
I'm having some difficulties figuring out how to approach this problem. I have a data frame that I am splitting into distinct sites (link5). Once split I basically want to run a linear regression model on the subsets. Here is the code I'm working with, but it's definitely not correct. Also, It would be great if I could output the model results to a new data frame such that each site would have one row with the model parameter estimates - that is just a wish and not a necessity right now. Thank you for any help!
les_events <- split(les, les$link5)
result <- lapply(les_events) {
lm1 <-lm(cpe~K,data=les_events)
coef <- coef(lm1)
q.hat <- -coef(lm1)[2]
les_events$N0.hat <- coef(lm1[1]/q.hat)
}
You have a number of issues.
You haven't passed a function (the FUN argument) to lapply
Your closure ( The bit inside {} is almost, but not quite the body you want for your function)
something like th following will return the coefficients from your models
result <- lapply(les_events, function(DD){
lm1 <-lm(cpe~K,data=DD)
coef <- coef(lm1)
data.frame(as.list(coef))
})
This will return a list of data.frames containing columns for each coefficient.
lapply(les_events, lm, formula = 'cpe~K')
will return a list of linear model objects, which may be more useful.
For a more general split / apply / combine approaches use plyr or data.table
data.table
library(data.table)
DT <- data.table(les)
result <- les[, {lm1 <- lm(cpe ~ K, data = .SD)
as.list(lm1)}, by = link5]
plyr
library(plyr)
result <- ddply(les, .(link5), function(DD){
lm1 <-lm(cpe~K,data=DD)
coef <- coef(lm1)
data.frame(as.list(coef))
})
# or to return a list of linear model objects
dlply(les, link5, function(DD){ lm(cpe ~K, data =DD)})
How can I insert summary outputs from multiple regression analyses in a matrix type variable in R statistics package?
Here is my script, which runs the regression and collect intercepts and co-eff in a variable:
for (i in 2:(ncol(data.base))) {
Test <- lm(data.base[,i] ~ log(database$var.1))
results <- rbind(results, c(Test$coefficients))
}
I would like to do is to import summary(lm-test) for each regression in to a matrix type variable. I assume the matrix type variable is what I need.
I appreciate your help.
Yuck! Some nasty variable naming there, in my opinion.
I see data.base has outcomes, and you don't want the first column but each is a separate outcome. You also have database which is a data.frame with a variable var.1. Run each regression, store them in a matrix format.
This is a start:
fits <- apply(data.base[, -1], 2, function(y) lm(y ~ log(database$var.1))
summ <- lapply(fits, summary)
summ <- lapply(fits, coef)
Reduce(cbind, summ)