Loop Over Columns to perform Rolling Regression - r

This post sort of gets to my question :Linear Regression loop for each independent variable individually against dependent
However, I am trying too add rolling periods for the regression.
Example:
data <-data.frame("col1"=runif(10,2,10),"col2"=runif(10,1,10),"col3"=runif(10,1,10),"col4"=runif(10,1,10))
sapply(data, function(x) rollapply(data,30, coef(lm(data$col1~x,data=data))))
EDIT: To give an deeper idea of what I am after, I should note that prior to using the Sapply method (after having read some SO posts that vectorized solutions were preferred), I had been trying for loops to loop over the columns I wanted to use as independent variables.
betadf <- data.frame()
for (i in colnames(data2[,3:ncol(data2)])){
formula <- formula(paste("variablename ~",i,sep=""))
data3 <- na.omit(merge(data2[,'variablename'],data2[,i]))
model <- na.omit(rollapply(data3,rollperiod,
function(z) coef(lm(formula,data=as.data.frame(z))),
by.column = FALSE, align = "right"))
colnames(model) <- c("intercept",i)
betadf <- cbind(betadf,model[,i])
}

Related

How to use a for loop for the svyttest function in the survey package?

I am trying to use the svyttest function in a for loop in the survey package. I want to test for differences in proportions of responses between subpopulations in likert-scale type data. For example, in a survey question (1=strongly disagree, 5 = strongly agree), are there statistically significant differences in the proportion of "strongly disagree" responses between Groups 1 and 2?
I understand that I can also use the svyglm function from the survey package, but I have been unable to successfully use that in a for loop.
I also understand that there is a wtd.t.test in the weights package and the glm function in the stats package has a weights argument, but neither of these two options get the correct results. I need to use either the svyttest or the svyglm functions in the survey package.
For reference I have been looking
here and here for some help but have been unable to adapt these examples to my problem.
Thank you for your time and effort.
# create example survey data
ids <- 1:1000
stratas <- rep(c("strata1", "strata2","strata3","strata4"), each=250)
weight <- rep(c(5,2,1,1), each=250)
group <- rep(c(1,2), times=500)
q1 <- sample(1:5, 1000, replace = TRUE)
survey_data <- data.frame(ids, stratas, weight, group, q1)
# create example svydesign
library(survey)
survey_design <- svydesign(ids = ~0,
probs = NULL,
strata = survey_data$stratas,
weights = survey_data$weight,
data = survey_data)
# look at the proportions of q1 responses by group
prop.table(svytable(~q1+group, design = survey_design), margin = 2)
# t-test for significant differences in the proportions of the first item in q1
svyttest(q1== 1 ~ group, design = survey_design)
# trying a for loop for all five items
for(i in c(1:5)){
print(svyttest(q1== i ~ group, design = survey_design))
}
# I receive the following error:
Error in svyglm.survey.design(formula, design, family = gaussian()) :
all variables must be in design= argument
When dynamically updating a formula inside a function or a loop you need to invoke the as.formula() function to preserve the attributes of objects as variables. This should work:
# trying a for loop for all five items
for(i in c(1:5)){
print(svyttest(as.formula(paste("q1==", i, "~group")),
design = survey_design))
}
I tried some trick, you can use array, which you can use for your loop:
x=c()
for(i in c(1:5)){
x=append(x,as.formula(paste("q1==",i,"~ group")))
print(svyttest(x[[i]], design = survey_design))
}
With regards
Aleksei
I would use bquote
for(i in 1:5){
print(eval(
bquote(svyttest(q1== .(i) ~ group, design = survey_design))
))
}
In this example as.formula works just as well, but bquote is more general.

Deduplicate a list of lm objects in R

I have a list of lm models objects with possible repeated, so I'd like to find a way of checking if some of these lm objects are equal, if so them delete it. In words, I want to "deduplicate" my list.
I'd appreciate very much any help.
An example of the problem:
## Creates outcome and predictors
outcome <- c(names(mtcars)[1:3])
predictors <- c(names(mtcars)[4:11])
dataset <- mtcars
## Creates model list
model_list <- lapply(seq_along((predictors)), function(n) {
left_hand_side <- outcome[1]
right_hand_side <- apply(X = combn(predictors, n), MARGIN = 2, paste, collapse = " + ")
paste(left_hand_side, right_hand_side, sep = " ~ ")
})
## Convert model list into a verctor
model_vector <- unlist(model_list)
## Fit linear models to all itens from the vector of models
list_of_fit <- lapply(model_vector, function(x) {
formula <- as.formula(x)
fit <- step(lm(formula, data = dataset))
fit
})
# Exclude possible missing
list_of_fit <- Filter(Negate(function(x) is.null(unlist(x))), list_of_fit)
# These models are the same in my list
lm253 <- list_of_fit[[253]];lm253
lm254 <- list_of_fit[[254]];lm254
lm255 <- list_of_fit[[255]];lm255
I want to exclude duplicated entries in list_of_fit.
It seems wasteful to fit so many models and then throw away most of them. Your object names make your code hard to read for me, but it seems your models can be distinguished based on their formula. Maybe this helps:
lista_de_ajustes[!duplicated(vapply(lista_de_ajustes,
function(m) deparse(m$call),
FUN.VALUE = "a"))]
I made a simple correction in you code Roland, so it worked for me.
I changed from deparse(m$call) to deparse(formula(m)), due this I'm able to compare the complete formulas.
lista_de_ajustes[!duplicated(vapply(lista_de_ajustes, function(m) deparse(formula(m)), FUN.VALUE = "a"))]
Thank you very much!

Ideas to re-write looping regression with 'for' loops

I'm having a brain freeze, and hoping one of you can point me in the right direction. My end goal is the output of various regression coefficients (mainly interested in price elasticity), which I achieved via simple multiple regression, using the "by" function.
I am using the "by" function to loop through the regression formula for each iteration of the "State.UPC" variable. Since my data is quite large (~1MM rows), I had to subset my data into groups of 3-4 states (see mystates1...mystates10). I am then performing the regression on those subsets, each time changing my data source in the "datastep3" data frame. And this is where I need your help:
What is the best way to efficiently re-write this with a combination of my existing "by" regression function, and the "for" loops, so I can bypass the step of constantly changing the data frame name in "datastep3" and the "write.csv" steps. Essentially R looping through each "mystates" data subset and doing the regression by the "State.UPC" attributes?
I have tried several combinations with no success. Pardon the amateurish question...still learning R. Here is my code:
data <-read.csv("PriceData.csv")
datastep1 <-subset(data, subset=c(X..Vol>0, Unit.Vol>0))
datastep2 <- transform(datastep1, State.UPC = paste(State,UPC, sep="."))
mystates1 <- c("AL","AR","AZ")
mystates2 <- c("CA","CO","FL")
mystates3 <- c("GA","IA","IL")
mystates4 <- c("IN","KS","KY")
mystates5 <- c("LA","MI","MN")
mystates6 <- c("MO","MS","NC")
mystates7 <- c("NJ","NM","NV")
mystates8 <- c("NY","OH","OK")
mystates9 <- c("SC","TN","TX")
mystates10 <- c("UT","VA","WI","WV")
datastep3 <-subset(datastep2, subset=State %in% mystates10)
datastep4 <-na.omit(datastep3)
PEbyItem <- by(datastep4, datastep4$State.UPC, function(df)
lm(log(Unit.Vol)~log(Price) + Distribution+Independence.Day+Labor.Day+Memorial.Day+Thanksgiving+Christmas+New.Years+
Year+Month, data=df))
x <- do.call("rbind",lapply(PEbyItem, coef))
y <-data.frame(x)
write.csv(x, file="mystates10.csv", row.names=TRUE)
Impossible to test this because you do not provide any data, but theoretically you could just combine the various mystatesN into a list and then run lapply(...) on that.
## Not tested...
get.PEbyItem <- function(i) {
datastep3 <-subset(datastep2, subset=State %in% mystates[[i]])
datastep4 <-na.omit(datastep3)
PEbyItem <- by(datastep4, datastep4$State.UPC, function(df)
lm(log(Unit.Vol)~log(Price) + Distribution+Independence.Day+Labor.Day+
Memorial.Day+Thanksgiving+Christmas+New.Years+Year+Month,
data=df))
x <- do.call("rbind",lapply(PEbyItem, coef))
y <-data.frame(x)
write.csv(x, file=paste(names(mystates[i]),"csv",sep="."), row.names=TRUE)
}
mystates <- list(ms1=mystates1, ms2=mystates2, ..., ms10=mystates10)
lapply(1:length(mystates),get.PEbyItem)
There are lots of other things that could be improved but without the dataset it's pointless to try.

plyr + forecasting multiple regressors

Taking the content in this thread a bit further: I've gone as far as I can, but finally hit a wall. I'm looking to use PLYR to create some ARIMA models with exogenous regressors at scale. A high-level overview of the process I've been using (code with example data follows)
1) I have a dataframe with businesses, regions, revenue and orders, all by date
2) For each combination of business + region, I want to create a forecast for revenue based on previous values of revenue + previous values of orders.
3) I want to use an ARIMA model (using auto.arima() ) to figure out optimal orders for both revenue and orders, then apply that information to a forecast function
4) The problem I run into seems to boil down to not being able to pass multiple lists to a PLYR argument to operate on, which most likely in turn boils down to my not fully understanding how llply works (so hopefully this is an easy task)
Here's some sample data I'm working off:
library(plyr)
library(xts)
library(forecast)
data <- data.frame(
biz = sample(c("telco","shipping","tech"), 100, replace = TRUE),
region = sample(c("mideast","americas","asia"), 100, replace = TRUE),
date = rep(seq(as.Date("2010-02-01"), length=10, by = "1 day"),10),
revenue = sample(1:100),
orders = sample(1:100)
)
Edit: First, reorganize data through ddply to get rid of duplicate entries:
dataframe <- ddply(data, c("biz","region","date"), function(df) {
c(revenue = sum(df[,4]),
orders = sum(df[,5]))
})
Step 1: Create a list that contains the time series info for each combination of business + region:
list1 <- dlply(dataframe, .(biz,region), identity)
Step 2: Turn that list into an XTS object so we can use it for time-series analysis:
xtsobject <- llply(list1, function(list) {
xts(x=list[,c("revenue","orders")], order.by=list[,"date"])
})
Here's where I run into trouble. I want to make a list of orders from the auto.arima() function to pass into a forecast.Arima() function. This would be straightforward if I were just doing one variable with no exogenous regressors:
arimamodel1 <- llply(xtsobject, function(list) {
fity <- auto.arima(list$revenue)
})
And then I would apply that list to the forecast.Arima() function:
forecast1 <- llply(arimamodel1, function(model) {
forecast.Arima(model, h=2)
})
That comes out fine. I've tried changing the argument to include some room for the extra regressors, but I'm not sure the forecasts are actually pulling in the x values:
arimamodel2 <- llply(xtstest, function(list) {
fity <- auto.arima(list$revenue, xreg=list$orders)
fitx <- auto.arima(list$orders)
})
and the forecasts:
forecast2 <- llply(arimamodel2, function(model) {
forecast.Arima(model, h=2)
})
... But it seems like in the forecast function, I should be doing something to account for the x regressor model in the way I normally use forecast.Arima() with multiple regressors; something like:
forecast.Arima(model,h=2, xreg=forecast(model,h=2)$mean)
But this doesn't work. Does anybody have any insight into how to use PLYR to make forecasts based on auto.arima() for multiple regressors?
I'm pretty sure I figured this out, in case anybody stumbles on to this question. It's just a matter of making a function that passes through all these arguments, then passing that function through lapply or llply (the data in the question won't work for auto.arima because of the way it was created, but it works on the actual data I'm using):
arimafunc <- function(list) {
fity <- auto.arima(list$revenue, xreg=list$orders)
fitx <- auto.arima(list$orders)
forecast <- forecast.Arima(fity,h=2,xreg=forecast(fitx,h=2)$mean)
return(forecast)
}
then pass through the list apply:
forecasts <- lapply(xtsobject,FUN=arimafunc)
I'm sure there's a way to do this using built-in functionality of something like llply or from one of the base commands, mapply, but this works for now...

apply series of commands to split data frame

I'm having some difficulties figuring out how to approach this problem. I have a data frame that I am splitting into distinct sites (link5). Once split I basically want to run a linear regression model on the subsets. Here is the code I'm working with, but it's definitely not correct. Also, It would be great if I could output the model results to a new data frame such that each site would have one row with the model parameter estimates - that is just a wish and not a necessity right now. Thank you for any help!
les_events <- split(les, les$link5)
result <- lapply(les_events) {
lm1 <-lm(cpe~K,data=les_events)
coef <- coef(lm1)
q.hat <- -coef(lm1)[2]
les_events$N0.hat <- coef(lm1[1]/q.hat)
}
You have a number of issues.
You haven't passed a function (the FUN argument) to lapply
Your closure ( The bit inside {} is almost, but not quite the body you want for your function)
something like th following will return the coefficients from your models
result <- lapply(les_events, function(DD){
lm1 <-lm(cpe~K,data=DD)
coef <- coef(lm1)
data.frame(as.list(coef))
})
This will return a list of data.frames containing columns for each coefficient.
lapply(les_events, lm, formula = 'cpe~K')
will return a list of linear model objects, which may be more useful.
For a more general split / apply / combine approaches use plyr or data.table
data.table
library(data.table)
DT <- data.table(les)
result <- les[, {lm1 <- lm(cpe ~ K, data = .SD)
as.list(lm1)}, by = link5]
plyr
library(plyr)
result <- ddply(les, .(link5), function(DD){
lm1 <-lm(cpe~K,data=DD)
coef <- coef(lm1)
data.frame(as.list(coef))
})
# or to return a list of linear model objects
dlply(les, link5, function(DD){ lm(cpe ~K, data =DD)})

Resources