Nest models dplyr calibration and validation - r

I have used R quite a bit, but I'm starting my journey in the tidyverse.
I'm trying to create a function that allows me to Bias correction daily precipitation series.
I want to break the time series in 2 (for calibration and validation). I would need to fit the model for the calibration period, apply it to the validation period, together with the observed and modeled data.
So far, I was able to do this in two for loops, but i was wondering if would be possible to do this "tidyer", with nest, but i cant figure it out.
Moreover, how could I use apply to compute this to many precipitation time series in a data.frame.
My current code is below,
Thanks in advance!
libraries
library(lubridate)
library(qmap)
library(dplyr)
Simulate data
obs_ <- runif(min=0,max=157,n=14975)
sim <- obs_ + 20
date_ <- seq(as.Date("1979-01-01"), as.Date("2019-12-31"),by="days")
db <- data.frame(obs=obs_, sim=sim_, date=date_, month=month(date_), year=year(date_))
Sample years
ss<- seq(from=1979, to=2019, by=1)
samp <- sample(ss, length(ss)/2)
samp <- samp[order(samp)]
samp1 <- subset(ss, !(ss %in% samp))
Model
list_mod <- list()
for(i in 1:12){
# retrives the data for the calibration period
model_fit <-db %>%
mutate(id = case_when( year %in% samp ~ "cal",
year %in% samp1 ~ "val")) %>%
filter(month== i, id== "cal")
# fits the model to each month and stores it in a list
list_mod[[i]] <- fitQmap(model_fit$obs,model_fit$sim)
}
Retrives the data for the validation period
model1 <- db %>%
mutate(id = case_when( year %in% samp ~ "cal",
year %in% samp1 ~ "val")) %>%
filter(id=="val")
Estimates the new data and stores it with the observations and simulations
for( i in 1:12){
temp__ <- model1[model1$month ==i,"sim"]
model1[model1$month ==i,"model"] <- doQmap(temp__, list_mod[[i]])
}

If you're not wedded to tidy, here is a solution using data.table.
Using your db:
library(data.table)
library(qmap)
##
setDT(db)[, Set:='cal']
db[sample(.N, .N/2), Set:='val']
db[, pred:=doQmap(sim, fitQmap(obs[Set=='cal'], sim[Set=='cal'])), by=.(month)]
result <- db[Set=='val']
The first line converts your db to a data.table and creates a column, Set, to define calibration/validation. The second line assigns a random 1/2 of the data to the validation set.
The third line does all the work: it groups the rows by month ( by=.(month) ), then generates fits with fitQmap(...) on the calibration set, and then generates debiased predictions using doQmap(...) on the full dataset.
The final line just filters out the calibration rows.
I notice in this example that Qmap reduces but does not eliminate bias. It that what you expect?

Related

I tried predicting on 5rows of the dataset, but why does it keep predicting on the whole dataset?

So I build a lm model in R on 65OOO rows (mydata) and I want to see only the predictions for the first 5 rows in order to see how good my model predicts. Below you can see the code I wrote to execute this but it keeps predicting the values of all 65000 rows. Is someone able to help me?
lm_model2002 <- lm(`AC: Volume` ~ `Market Area (L1)`,data=mydata)
summary(lm_model2002)
df = head(data.frame(`Market Area (L1)`=mydata$`Market Area (L1)`),5)
predict(lm_model2002,newdata=df)
but now the real problem: I took the first row of mydata and copied this row 5 times, then I made a vector that ranges from 1 to 2 and replaced one of the variables ( price per unit) with that vector. As a result, I want to predict the exact same rows but with only a different price, so that i am able to plot this evolution of a higher price:
lm_model3204<- lm(`AC: Volume` ~ log(price_per_unit)*(Cluster_country_hierarchical+`Loyalty-cumulative-volume-10`+`Loyalty-cumulative-orders-10`+`Loyalty-number-of-order-10`+price_discount+Incoterms)+Cluster_spg*(price_discount+Cluster_country_hierarchical)+price_discount*(Month+`GDP per capita`+`Loyalty-cumulative-orders-10`+`Loyalty-cumulative-volume-10`)+`Payer CustGrp`+`CRU Index`,data = mydata)
summary(lm_model3204)
test_data <- mydata[1:1,]
df <- data.frame(test_data,ntimes=c(5))
df <- as.data.frame(lapply(df, rep, df$ntimes))
priceperunit<-seq(1,2,by=0.25)
df$price_per_unit<-priceperunit
pred <- predict(lm_model3204,newdata=df)
Please use a minimal reproducible example next time you post a question.
You just have to predict the first five rows. Here an example with the in-built iris dataset
data("iris")
lm_model2002 <- lm(Sepal.Length ~ Sepal.Width,data=iris)
summary(lm_model2002)
predict(lm_model2002,newdata=iris[1:5,])
output:
> predict(lm_model2002,newdata=iris[1:5,])
1 2 3 4 5
5.744459 5.856139 5.811467 5.833803 5.722123
Or:
df <- head(iris,5)
predict(lm_model2002,newdata=df)
EDIT
After your last comment, to see the change in prediction by changing one of the independent variables
data(iris)
df <- iris[rep(1,5),]
Petal_Length<-seq(1,2,by=0.25)
df$Petal.Length<-Petal_Length
lm_model3204 <- lm(Sepal.Length ~ Petal.Length+Sepal.Width,data=iris)
pred <- predict(lm_model3204,newdata=df)

How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

How to write a function in R that will implement the "best subsets" approach to model selection?

So I need to write a function that takes a data-frame as input. The columns are my explanatory variables (except for the last column/right most column which is the response variable). I'm trying to fit a linear model and track each model's adjusted r-square as the criterion used to pick the best model.
The model will use all the columns as the explanatory variables (except for the right-most column which will be the response variable).
The function is supposed to create a tibble with a single column for the model number (I have no idea what this is supposed to mean), subset of of explanatory variables along with response variable, model formula, outcome of fitting linear model, and others as needed.
The function is supposed to output: the model number, the explanatory variables in the model, the value of adjusted r-square, and a graph (I can figure the graph out on my own). I have a image of a table here to help with visualizing what the result should look like.
I figured out that this code will get me the explanatory and response variables:
cols <- colnames(data)
# Get the response variable.
y <- tail(cols, 1)
# Get a list of the explanatory variables.
xs <- head(cols, length(cols) - 1)
I know that I can get a model with something like this (ignore variable names for now):
model <- final_data %>%
group_by(debt) %>%
lm(debt ~ distance, data = .) %>%
glance()
I also know that I'm going to have to somehow map that model to each of the rows in the tibble that I'm trying to create.
What I'm stuck on is figuring out how to put all this together and create the complete function. I wish I could provide more details but I am completely stuck. I've spent about 10 hours working on this today... I asked my professor for help and he just told me to post here.
For reference here is a very early (not working at all) attempt I made:
best_subsets <- function(data) {
cols <- colnames(data)
# Get the response variable.
y <- tail(cols, 1)
# Get a list of the explanatory variables.
xs <- head(cols, length(cols) - 1)
# Create the formula as a string and then later in the lm function
# have it turned into a real formula.
form <- paste(y, "~", xs, sep = " ")
data %>%
lm(as.formula(form), data = .) %>%
glance()
}
I don't fully understand your description but I think I understand your goal. Maybe this can help in some way?:
library(tidyverse)
library(broom)
library(data.table)
lm_func <- function(df){
fit1 <- lm(df[, 1] ~ df[, 2], data = df)
fit2 <- lm(df[, 1] ~ df[, 3], data = df)
fit3 <- lm(df[, 1] ~ df[, 2], df[, 3], data = df)
results <- list(fit1, fit2, fit3)
names(results) <- paste0("explanitory_variables_", 1:3)
r_sq <- lapply(results, function(x){
glance(x)
})
r_sq_df <- rbindlist(r_sq, idcol = "df_name")
r_sq_df
}
lm_func(iris)
This gives you a dataframe of all the important outputs from which you can select adj.r.squared. Would also be possible to automate. As a side note, selecting a model based on R squared seems very strange, dangers of overfitting? a higher R squared does not necessarily mean a better model, consider looking into AIC as well?
Let me know if this helps at all or if I can refine the answer a little more towards your goal.
UPDATE:
lm_func <- function(df) {
lst <- c()
for (i in 2:ncol(df)) {
ind <- i
form_df <- df[, 1:ind]
form <- DF2formula(form_df)
fit <- lm(form, data = df)
lst[[i - 1]] <- glance(fit)
}
lst
names(lst) <- paste0("explanitory_variables_", 1:length(lst))
lst <- rbindlist(lst, idcol = "df_name")
lst
}
lm_func(iris)
This assumes your first column is y and you want a model for every additional column.
OK one more UPDATE:
I think this does everything possible but is probably overkill:
library(combinat)
library(data.table)
library(tidyverse)
library(broom)
#First function takes a dataframe containing only the dependent and independent variables. Specify them by variable name or column position.
#The function then returns a list of dataframes of every possible order of independent variables (y ~ x1 + x2...) (y ~ x2 + x1...).
#So you can run your model on every possible sequence of explanatory variables
formula_func <- function(df, dependent = df["Sepal.Length"], independents = df[c("Sepal.Width", "Petal.Length", "Petal.Width", "Species")]) {
independents_df_list <- permn(independents) #length of output should be the factorial of the number of independent variables
df_list <- lapply(independents_df_list, function(x){ #this just pastes your independent variable as the first column of each df
cbind(dependent, x)
})
df_list
}
permd_df_list <- formula_func(iris) # voila
# This function takes the output from the previous function and runs the lm building in one variable each time (y ~ x1), (y ~ x1 + x2) and so on
# So the result is many lms building in one one independent variable at a time in every possible order
# If that is as confusing to you as it is to me then check final output. You will see what model formula is used per row and in what order each explanatory variable was added
lm_func <- function(form_df_list, df) {
mega_lst <- c()
mega_lst <- lapply(form_df_list, function(x) {
lst <- vector(mode = "list", length = length(2:ncol(x)))
for (i in 2:ncol(x)) {
ind <- i
form_df <- x[, 1:ind]
form <- DF2formula(form_df)
fit <- lm(form, data = x)
lst[[i - 1]] <- glance(fit)
names(lst)[[i-1]] <- deparse(form)
}
lst <- rbindlist(lst, idcol = "Model_formula")
return(lst)
})
return(mega_lst)
}
everything_list <- lm_func(permd_df_list, iris) # VOILA!!!
#Remove duplicates and return single df
everything_list_distinct <- everything_list %>%
rbindlist() %>%
distinct()
## You can now subset and select whichever column you want from the final output
I posted this as a coding exercise so let me know if anyone spots any errors. Just one caveat, this code does NOT represent a statistically sound approach just a coding experiment so be sure to understand the stats first!

Caret - creating stratified data sets based on several variables

In the R package caret, can we create stratified training and test sets based on several variables using the function createDataPartition() (or createFolds() for cross-validation)?
Here is an example for one variable:
#2/3rds for training
library(caret)
inTrain = createDataPartition(df$yourFactor, p = 2/3, list = FALSE)
dfTrain=df[inTrain,]
dfTest=df[-inTrain,]
In the code above the training and test sets are stratified by 'df$yourFactor'. But is it possible to stratify using several variables (e.g. 'df$yourFactor' and 'df$yourFactor2')? The following code seems to work but I don't know if it is correct:
inTrain = createDataPartition(df$yourFactor, df$yourFactor2, p = 2/3, list = FALSE)
This is fairly simple if you use the tidyverse.
For example:
df <- df %>%
mutate(n = row_number()) %>% #create row number if you dont have one
select(n, everything()) # put 'n' at the front of the dataset
train <- df %>%
group_by(var1, var2) %>% #any number of variables you wish to partition by proportionally
sample_frac(.7) # '.7' is the proportion of the original df you wish to sample
test <- anti_join(df, train) # creates test dataframe with those observations not in 'train.'
There is a better way to do this.
set.seed(1)
n <- 1e4
d <- data.frame(yourFactor = sample(1:5,n,TRUE),
yourFactor2 = rbinom(n,1,.5),
yourFactor3 = rbinom(n,1,.7))
stratum indicator
d$group <- interaction(d[, c('yourFactor', 'yourFactor2')])
sample selection
indices <- tapply(1:nrow(d), d$group, sample, 30 )
obtain subsample
subsampd <- d[unlist(indices, use.names = FALSE), ]
what this does is make a size 30 random stratified sample on every combination of yourFactor and yourFactor2.

Get predicted values for next period

please consider following data:
y<- c(2,2,6,3,2,23,5,6,4,23,3,4,3,87,5,7,4,23,3,4,3,87,5,7)
x1<- c(3,4,6,3,3,23,5,6,4,23,6,5,5,1,5,7,2,23,6,5,5,1,5,7)
x2<- c(7,3,6,3,2,2,5,2,2,2,2,2,6,5,4,3,2,3,2,2,6,5,4,3)
type <- c("a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b","c","c","c","c","c","c","c","c")
generation<- c(1,1,1,1,2,2,3,3,1,2,2,2,3,3,4,4,1,2,2,2,3,3,4,4)
year<- c(2004,2005,2006,2007,2008,2009,2010,2011,2004,2005,2006,2007,2008,2009,2010,2011,2004,2005,2006,2007,2008,2009,2010,2011)
data <- data.frame(y,x1,x2,model,generation,year)
I would now make analysis that only take into account each single year and predict on the following. So in essence, this would run several separate analysis, only taking into account the data up to one point in time and then predicting on the next (only the directly next) period.
I tried to set up an example for the three models:
data2004 <- subset(data, year==2004)
data2005 <- subset(data, year==2005)
m1 <- lm(y~x1+x2, data=data2004)
preds <- predict(m1, data2005)
How can I do this automatically? My preferred output would be a predicted value for each type that indicates what the value would have been for each of the values that exist in the following period (the original data has 200 periods).
Thanks in advance, help very much appreciated!
The following may be more like what you want.
uq.year <- sort(unique(dat$year)) ## sorting so that i+1 element is the year after ith element
year <- dat$year
dat$year <- NULL ## we want everything in dat to be either the response or a predictor
model <- rep(c("a", "b", "c"), times = length(year) / 3) ## identifies the separate people per year
predlist <- vector("list", length(uq.year) - 1) ## there is 1 prediction fewer than the number of unique years
for(i in 1:(length(uq.year) - 1))
{
mod <- lm(y ~ ., data = subset(dat, year == uq.year[i]))
predlist[[i]] <- predict(mod, subset(dat, subset = year == uq.year[i + 1], select = -y))
names(predlist[[i]]) <- model[year == uq.year[i + 1]] ## labeling each prediction
}
The reason that we want dat to only have modeling variables (rather than year, for example) is because then we can easily use the y ~ . notation and avoid having to spell out all of the predictors in the lm call.

Resources