I have a model that as a predictor has the previous prediction. e.g. target ~ lag(target prediction)
Using purrr::accumulate I'm able to write a custom function to predict. Example of some silly data and a silly model that illustrates:
### A model that uses a lag prediction as a predictor using purrr::accumulate() ###
my_diamonds <- diamonds %>%
group_by(cut) %>%
mutate(cumprice = cumsum(price)) %>% # cumulative within groups
mutate(lag_cumprice = lag(cumprice)) %>%
mutate(InitialValue = min(cumprice)) %>%
filter(!is.na(lag_cumprice)) %>%
select(cut, cumprice, lag_cumprice, x, InitialValue)
silly_model <- glm(formula = cumprice ~ x + lag_cumprice, family = 'poisson', data = my_diamonds)
This model uses the previous prediction as input to the next prediction. I'm able to write a custom function to mutate a prediction:
# when predicting won't have lag_cumprice, instead the result of the previous pediction should be an input to the model:
accPrice <- function(mod, acc, cur) {
db=cur_data_all() # grouped data segment
x = db$x[cur] # cur is the current row in the data, use it to get 'this' iterations value of x
total_exponent <- mod$coefficients['(Intercept)'] +
(mod$coefficients['x'] * x) +
(mod$coefficients['lag_cumprice'] * acc) # acc is the accumulated prediction for cumprice
}
# now predict
my_diamonds <- my_diamonds %>%
mutate(predicted = accumulate(.x = row_number()[-1], .init = InitialValue %>% unique, .f = accPrice, mod = silly_model))
So far so good. In this example I used the previous prediction acc as an input.
But, I created a variation model that now uses two lagged variables as predictors:
### now a model with lag on two variables not just one ###
my_diamonds2 <- diamonds %>%
group_by(cut) %>%
mutate(cumprice = cumsum(price)) %>% # cumulative within groups
mutate(lag_cumprice = lag(cumprice)) %>%
mutate(InitialValue = min(cumprice)) %>%
mutate(rn = row_number()) %>%
mutate(cumrn = cumsum(rn)) %>%
mutate(lag_cumrn = lag(cumrn)) %>%
filter(!is.na(lag_cumprice)) %>%
select(cut, cumprice, lag_cumprice, lag_cumrn, x, InitialValue)
silly_model2 <- glm(formula = cumprice ~ x + lag_cumprice + lag_cumrn, family = 'poisson', data = my_diamonds2)
### Stuck after here ###
How can I modify the function accPrice() above to accumulate 2 variables, both lag_cumprice and lag_cumrn as opposed to just lag_cumprice as before?
We could add an argument to the function. Then, extract the corresponding coefficient from the model and multiply by it
accPrice2 <- function(mod, acc, acc2, cur) {
db=cur_data_all() # grouped data segment
x = db$x[cur] # cur is the current row in the data, use it to get 'this' iterations value of x
total_exponent <- mod$coefficients['(Intercept)'] +
(mod$coefficients['x'] * x) +
(mod$coefficients['lag_cumprice'] * acc) +
(mod$coefficients['lag_cumrn'] * acc2)
}
my_diamonds2 %>%
mutate(predicted = accumulate(.x = row_number()[-1],
.init = InitialValue %>%
unique, .f = accPrice2, mod = silly_model))
Related
I am running a Stochastic Frontier model (using the package frontier) by the group industry as follows:
data is a panel data frame with index year and individual id and columns as below:
Columns: y1, x1, x2and x3 are all numerical variables. industry is a character variable.
library(dplyr)
library(frontier)
sfa_out <- data %>%
group_by(industry) %>%
do(
mod <- sfa(log(y1) ~ log(x1) + log(x2) + log(x3),
ineffDecrease = T,
truncNorm = F,
timeEffect = T,
data = .))
I want mod to store the output of the industry-group SFA estimated models. I don't think SFA-specific knowledge is required here. Thanks.
This did the trick:
library(dplyr)
library(frontier)
library(plm)
sfa_out <- data %>%
group_by(industry) %>%
do(
mod = sfa(log(y1) ~ log(x1) + log(x2) + log(x3),
ineffDecrease = T,
truncNorm = F,
timeEffect = T,
data = pdata.frame(., index = c("individual id", "year"))))
To then display each model you could do this:
# to display all industry models
sfa_out$mod
# to display specific industry model
sfa$mod[[1]]
sfa$mod[[2]]
.
.
.
sfa$mod[[n]]
# to get the estimated efficiency measure for nth model in sfa_out$mod
efficiencies(sfa_out$mod$[[n]])
library(tidyverse)
library(lme4)
library(broom.mixed)
Tibble = tibble(
Class1 = rep(c("TITUS","CAIUS"),27),
Class2 = rep(c("A","A","A",
"B","B","B",
"C","C","C"),6
),
Outcome = rnorm(54,5,2),
Predictor = Outcome + rnorm(54,0,2.5),
alpha = NA,
beta = NA)
lmer(data = Tibble,
Outcome ~ (0 + (Class1) + (0 + Predictor|(Class2)))) %>%
tidy(effects = c("fixed","ran_coefs")) -> model
for(i in 1:54) {
Tibble$alpha[i] <- model %>%
filter(effect == "fixed",
term == str_c("Class1",Tibble$Class1[i])) %>%
pull(estimate)
Tibble$beta[i] <- model %>%
filter(effect == "ran_coefs",
level == Tibble$Class2[i],
term == "Predictor") %>%
pull(estimate)
}
Tibble %>% mutate(
Predicted = (alpha + Predictor*beta),
epsilon = Outcome - Predicted) -> Tibble
Tibble %>% summarise(cor(Predicted,Outcome, method = "kendall"),
)
Key concepts: there is a alpha for each class1- There is a beta for each class2.
Epsilon is the residual.
I want to make the code above faster without recurring to a for cycle.
Also, I am very worried about the regression model, because I plan to do it in a tibble with more than 1 million observation, 600k class1, 40k class2.
Notice that that model formula and the regression package is only one possible combination and you can give suggestions. I am most interested in alpha's estimation, not in minimizing epsilons.
I have a sample data set with number of sales for 1000 different products in 13 countries over 3 years:
13 Countries = (US, China, UK…)
1000 Products = (Prod A, B, C …)
Number of Sales --> my dependent variable.
Number of Clicks and 3 more variables as independent variable.
I've coded a regression model and it works. In the next step I would like to do a rolling regression: How do I have to adapt the code for that?
Thanks for your help and many greetings! :)
# prepare data
nest_dt = raw_data %>%
group_by(product, country) %>%
nest()
# function
lm_function = function(data, formula) {
lm(formula = formula,
data = data)
}
# regression
lm_data = nest_data %>%
mutate(lm = map(
.x = data,
.f = lm_function,
formula = sales ~ clicks + needs + guesses + interests
))
# show solution
solution_data = lm_data %>%
mutate(solution = map(lm, sw_tidy)) %>%
unnest(solution) %>%
select(-data, -lm)
# where to put?!?
rollapply(lm_function, width=10, roll)
You could try the roll_lm function from the roll package. See the description here: Package ‘roll’ on Cran
I have the following data frame:
input.df <- dplyr::data_frame(x = rnorm(4),
y = rnorm(4),
`z 1` = rnorm(4))
I would like to do a multiple regression for each column with the other columns and extract the R-squared from each model. This means that I could run the following code:
summary(lm(x ~ ., data = input.df))
summary(lm(y ~ ., data = input.df))
summary(lm(`z 1` ~ ., data = input.df))
And note down the R-squared.
I'd like to automate this task and have two column data frame where the first column is the dependent variable and the second column is the R-squared.
This is what I've tried:
n <- ncol(input.df)
replicate(n, input.df, simplify = F) %>%
dplyr::bind_rows() %>%
dplyr::mutate(group = rep(names(.), each = nrow(.) / n)) %>%
dplyr::group_by(group) %>%
dplyr::do({
tgt.var <- .$group[1]
# How do I get the formula to interpret . as all variables?
lm(get(tgt.var) ~ ., data = .) %>%
broom::glance() %>%
dplyr::select(r.squared)
})
I've put a comment on the part I am stuck. I get the following error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
I think you've overcomplicated building your dataframe a little. There is no need for replicate as you are running all regressions on the same dataset. You could just use map from purrr, the idea is to try something like
library(purrr)
names(input.df) %>%
map(~ lm(get(.) ~ ., data = input.df))
This runs without errors but doesn't give the desired result. The reason is that get(.) gets added as a new variable in the dataset, so for example the first regression is x ~ x + y + `z 1` which is not what we want. This can be easily fixed though by changing the formula in lm as follows
names(input.df) %>%
map(~ lm(formula(paste0('`', ., '` ~ .')), data = input.df))
(note the need to include the escape backticks because of the name of your third variable, otherwise it wouldn't have been necessary). This now gives the desired results. If you don't want to keep everything and want to extract r2 you can just do
names(input.df) %>%
map(~ lm(formula(paste0('`', ., '` ~ .')), data = input.df)) %>%
map(summary) %>%
map_dbl('r.squared')
Not certain how to resolve your issue directly. Here's an alternative method to derive a data.frame with dependent variable and r.sq for separate models.
cond <- matrix(c(1,0,0,0,1,0,0,0,1), ncol=3)
colnames(cond)<- colnames(input.df)
cond
x y z 1
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
xy <- lapply(1:nrow(cond), function(v)
list(y = colnames(cond)[which(cond[v,]==1)] %>% paste0("`", ., "`"),
x = colnames(cond)[which(cond[v,]==0)] %>% paste0("`", ., "`") %>% paste(., collapse="+")))
lm.form <- lapply(1:length(xy), function(v) paste(xy[[v]]$y, xy[[v]]$x, sep="~") %>% as.formula)
lm.mod <- lapply(lm.form, function(v)lm(v, data=input.df))
data.frame(pred = lapply(xy, function(v) v["x"]) %>% unlist,
r.sq = lapply(lm.mod, function(v) summary(v)$r.sq)%>% unlist)
pred r.sq
1 `y`+`z 1` 0.5806704
2 `x`+`z 1` 0.8500431
3 `x`+`y` 0.8335421
My question is very similar to this one, but the problem I am facing has a twist that those answers do not address. Specifically, I am estimating a spatial model, y=rho * lw * y + X *beta. Because the observations are related by the matrix lw, I must apply the model to the entire X matrix simultaneously. Because those answers operate row-wise, they do not apply.
Here is MWE data, consisting of twenty points across three groups and a spatial weights matrix:
library(spdep)
#Coordinates
pointcoords <- data.frame(x = runif(n=20, min =10, max = 100), y = runif(n=20, min = 10, max = 100), ID = as.character(1:20))
pointsSP <- SpatialPoints(pointcoords[,1:2])
# Weights matrix
lw <- nb2listw(knn2nb(knearneigh(pointsSP, k = 4, RANN = FALSE),
row.names = pointcoords$ID))
# Data
MyData <- data.frame(ID = rep(1:20, each = 3),
Group = rep(1:3, times = 20),
DV = rnorm(60),IV = rnorm(60))
I can estimate the models by Group with dplyr
library(dplyr)
models <- MyData %>% group_by(Group) %>%
do(lm = lm(DV ~ IV, data = .),
sar = lagsarlm(DV ~ IV, data = ., listw = lw))
Predicting to new data with this answer operates on a row-wise basis, working fine for the lm objects,
MyData2 <- data.frame(ID = rep(1:20, each = 3),
Group = rep(1:3, times = 20),
IV = rnorm(60))
MyData2 %>% left_join(models) %>% rowwise %>%
mutate(lmPred = predict(lm, newdata = list("IV" = IV))) %>% head()
#Joining by: "Group"
#Source: local data frame [6 x 6]
#Groups:
# ID Group IV lm sar lmPred
#1 1 1 -0.8930794 <S3:lm> <S3:sarlm> -0.21378814
#2 1 2 -1.6637963 <S3:lm> <S3:sarlm> 0.42547796
#3 1 3 0.5243841 <S3:lm> <S3:sarlm> -0.23372996
#4 2 1 -0.1956969 <S3:lm> <S3:sarlm> -0.20860280
#5 2 2 0.8149920 <S3:lm> <S3:sarlm> 0.14771431
#6 2 3 -0.3000439 <S3:lm> <S3:sarlm> 0.05082524
But not for the sar models:
MyData2 %>% left_join(models) %>% rowwise %>%
mutate(sarPred = predict(sar, newdata = list("IV" = IV), listw=lw)) %>% head()
#Joining by: "Group"
#Error in if (nrow(newdata) != length(listw$neighbours)) stop("mismatch between newdata and spatial weights") :
argument is of length zero
I think there should be a better way of doing this, without joining the model to every row. Also, creating a list object for newdata won't work if you have several or changing predictor variables. It seems that the dplyr way should be something like this:
MyData2 %>% group_by(Group) %>%
mutate(sarPred = predict(models$sar[[Group]], newdata = ., listw=lw))
But the [[Group]] index isn't quite right.
I ended up doing this with do in dplyr, going through the models data.frame rowwise. I believe it does what you want, although the output doesn't contain the new data used for predictions. I did add in Group to the output, though, as it seemed necessary to keep groups separated.
models %>%
do(data.frame(Group = .$Group,
predlm = predict(.$lm, newdata = filter(MyData2, Group == .$Group)),
predsar = predict(.$sar, newdata = filter(MyData2, Group == .$Group) , listw = lw)))
EDIT
Playing around with adding the explanatory variable into the output data.frame. The following works, although there is likely a better way to do this.
models %>%
do(data.frame(Group = .$Group, IV = select(filter(MyData2, Group == .$Group), IV),
predlm = predict(.$lm, newdata = filter(MyData2, Group == .$Group)),
predsar = predict(.$sar, newdata = filter(MyData2, Group == .$Group) , listw = lw)))
I'm putting this out there because it does do what I want it to, even if it needs to use a for loop (gasp)
predictobj <- list()
for(i in models$Group){
predictobj[[i]] <- predict.sarlm(models$sar[[i]],
newdata = filter(MyData2, Group == i),
listw = lw)
}
Anybody have a dplyr solution?