I am trying to fit a piecewise regression for this dataset. I know we do not have a linear relation between the dependent and independent variable but my real world application requires me to model the data as a lm segmented regression.
Here is my code with description of the steps
bond_data <- data.frame(
yield_change = c(-1.2,-0.9,-1.8,-1.4,-1.8,-2.1,-2.3,-2.1,-2.5,-2.2,-2.4,-2.5,-2.4,-2.4,
-3.0,-2.6,-5.1,-4.8,-4.9,-5.0,-5.0,-6.2,-6.1,-6.3,-5.0,-5.0),
maturity =c(10.2795,10.8603,11.7753,12.3562,12.5205,13.3589,13.8630,14.2822,14.3589,15.3589,
15.8630,16.778,17.3616,17.8658,18.3616,21.8685,22.5288,23.8685,24.3644,25.3671,
26.8712,27.8712,28.8712,29.8740,44.3781,49.3836))
The bond_data Dataframe contains these two vectors stated above.
#Defining lm model & segmented modelmodel <- lm(yield_change~maturity, data = bond_data)
segmented.model <- segmented(model,seg.Z=~maturity,psi = list(maturity = c(15,20,30)),fixed.psi = c(15,20,30),control = seg.control(it.max = 0, n.boot = 50))
xp <- c(min(bond_data$maturity), segmented.model$psi[,"Est."], max(bond_data$maturity))
new_data <- data.frame(xp)
colnames(new_data) <- "maturity"
o <- segmented.model
new_data$dummy1 <- pmax(new_data$maturity - o$psi[1,2], 0)
new_data$dummy2 <- pmax(new_data$maturity - o$psi[2,2], 0)
new_data$dummy3 <- pmax(new_data$maturity - o$psi[3,2], 0)
new_data$dummy4 <-I(new_data$maturity > o$psi[1,2]) * coef(o)[3]
new_data$dummy5 <-I(new_data$maturity > o$psi[2,2]) * coef(o)[4]
new_data$dummy6 <-I(new_data$maturity > o$psi[3,2]) * coef(o)[5]
names(new_data)[-1] <- names(model.frame(o))[-c(1,2)]
yp <- predict(segmented.model,new_data)
plot(bond_data$maturity,bond_data$yield_change, pch=16, col="blue",ylim = c(-8,0))
lines(xp,yp)
I get the following image
Plot of actual values in blue points and pred line
I am trying to get the first segment start at the point(maturity = 10, yield_change = 0)
One thing to note is that all my breakpoints have fixed x positions and no estimates are made so when I run segmented.model$psi my initial values are the same as my estimates (15,20 and 30) and all my st.err are zero.
How would I go about making my prediction line start at the point(maturity = 10, yield_change = 0)? I appreciate any help!
I have tried doing the following:
model <- lm(I(yield_change-0)~I(maturity-10), data = bond_data)
segmented.model <- segmented(model,seg.Z=~maturity,psi = list(maturity = c(15,20,30)),fixed.psi = c(15,20,30), control = seg.control(it.max = 0, n.boot = 50))
#But by running the previous line I get the error (object maturity not recognised).
#By running:
segmented.model <- segmented(model,seg.Z=~I(maturity-10),psi = list(I(maturity-10) = c(15,20,30)),fixed.psi = c(15,20,30), control = seg.control(it.max = 0, n.boot = 50))
I get this error:
Error: unexpected '=' in "segmented.model <- segmented(model,seg.Z=~I(maturity-10),psi = list(I(maturity-10) ="
I do not think I am using the correct method to solve my problem...
Related
Say I have some vectors in Stan
vector[100] mu[20]
This means mu is 20 vectors each of length 100.
Do I read this in as mu = martix(NA,nrow=100,ncol=20) or as mu = array(NA,dim=c(100,20)) from R?
(I hope I'm not getting the dimensions mixed up)
Or is incorrect and do you this differently?
So you have two options to provide the matrix/array or a list of vectors. In your example you have to change the order for rows and columns.
The rstan error messages are clear so you will know if something goes wrong with dimensions.
library(rstan)
# dummy parameters
scode <- "
data {
vector[100] mu[20];
}
parameters {
real y[2];
}
"
# Both fit pass without errors
fit1 <- stan(
model_code = scode,
data = list(mu = matrix(rnorm(20 * 100), nrow = 20, ncol = 100))
)
fit2 <- stan(
model_code = scode,
data = list(mu = lapply(1:20, function(x) rnorm(100)))
)
I am trying to use kfold CV as a means of evaluating a model run using brms and I feel like I'm missing something. As a reproducible example, my data are structured as a binary response (0, 1) dependent on the length of an individual. Here is some code to generate and plot data similar to those I am working with:
library(brms)
library(tidyverse)
library(loo)
length <- seq(0, 100, by = 1)
n_fish_per_length <- 10
a0 <- -48
a1 <- 2
a2 <- -0.02
prob <- plogis(a0 + a1 * length + a2 * length^2)
plot(length, prob , type = 'l')
sim_data <-
expand_grid(fish_id = seq_len(n_fish_per_length),
length = length) %>%
mutate(prob_use = plogis(a0 + a1 * length + a2 * length^2)) %>%
mutate(is_carp = rbinom(n = n(), size = 1, prob= prob_use))
ggplot(sim_data, aes(x = length, y = is_carp)) +
geom_jitter(width = 0, height = 0.05) +
geom_smooth(method = "glm", formula = y ~ x + I(x^2),
method.args = list(family = binomial(link = "logit")))
I then use brms to run my model.
Bayes_Model_Binary <- brm(formula = is_carp ~ length + I(length^2),
data=sim_data,
family = bernoulli(link = "logit"),
warmup = 2500,
iter = 5000,
chains = 4,
inits= "0",
cores=4,
seed = 123)
summary(Bayes_Model_Binary)
I'd like to use kfold CV to evaluate the model. I can use something like this:
kfold(Bayes_Model_Binary, K = 10, chains = 1, save_fits = T)
but the response in my data is highly imbalanced (~18% = 1, ~82% = 0) and my reading suggests that I need to used stratified kfold cv to account for this. If I use:
sim_data$fold <- kfold_split_stratified(K = 10, x = sim_data$is_carp)
the data are split the way I would expect but I'm not sure what the best way is to move forward with the CV process from here. I saw this post https://mc-stan.org/loo/articles/loo2-elpd.html, but I'm not sure how to modify this to work with a brmsfit object. Alternatively, it appears that I should be able to use:
kfold(Bayes_Model_Binary, K = 10, folds = 'stratified', group = sim_data$is_carp)
but this throws an error. Likely because is_carp is the response rather than a predictor in the model. What would my group be in this context? Am I missing/misinterpreting something here? I'm assuming that there is a very simple solution here that I am overlooking but appreciate any thoughts.
After some additional digging and learning how to access information about each fold in the analysis, I was able to determine that the structure of the data (proportion of 0s and 1s in the response) is maintained using the default settings in the kfold() function. To do this I used the following code.
First, save the kfold CV analysis as an object.
kfold1 <- kfold(Bayes_Model_Binary, K = 10, save_fits = T)
kfold1$fits is a list of the model fitting results and the observations used in the test data set (omitted) for each fold.
From this information, I created a loop to print the proportion of observations in each training data set where is_carp = 1 (could also do this for each test data set) with the following code.
for(i in 1:10){
print(length(which(sim_data$is_carp[-kfold1$fits[i, ]$omitted] == 1)) /
nrow(sim_data[-kfold1$fits[i, ]$omitted, ]))
}
[1] 0.1859186
[1] 0.1925193
[1] 0.1991199
[1] 0.1914191
[1] 0.1881188
[1] 0.1848185
[1] 0.1936194
[1] 0.1980198
[1] 0.190319
[1] 0.1870187
and it's easy to then compare these proportions with the proportion of observations where is_carp = 1 from the original data set.
length(which(sim_data$is_carp == 1)) / nrow(sim_data)
[1] 0.1910891
I'm trying to specify a cluster variable after plm using vcovCR() in clubSandwich package for my simulated data (which I use for power simulation), but I get the following error message:
"Error in [.data.frame(eval(mf$data, envir), , index_names) : undefined columns selected"
I'm not sure if this is specific to vcovCR() or something general about R, but could anyone tell me what's wrong with my code? (I saw a related post here How to cluster standard errors of plm at different level rather than id or time?, but it didn't solve my problem).
My code:
N <- 100;id <- 1:N;id <- c(id,id);gid <- 1:(N/2);
gid <- c(gid,gid,gid,gid);T <- rep(0,N);T = c(T,T+1)
a <- qnorm(runif(N),mean=0,sd=0.005)
gp <- qnorm(runif(N/2),mean=0,sd=0.0005)
u <- qnorm(runif(N*2),mean=0,sd=0.05)
a <- c(a,a);gp = c(gp,gp,gp,gp)
Ylatent <- -0.05*T + a + u
Data <- data.frame(
Y = ifelse(Ylatent > 0, 1, 0),
id = id,gid = gid,T = T
)
library(clubSandwich)
library(plm)
fe.fit <- plm(formula = Y ~ T, data = Data, model = "within", index = "id",effect = "individual", singular.ok = FALSE)
vcovCR(fe.fit,cluster=Data$id,type = "CR2") # doesn't work, but I can run this by not specifying cluster as in the next line
vcovCR(fe.fit,type = "CR2")
vcovCR(fe.fit,cluster=Data$gid,type = "CR2") # I ultimately want to run this
Make your data a pdata.frame first. This is safer, especially if you want to have the time index created automatically (seems to be the case looking at your code).
Continuing what you have:
pData <- pdata.frame(Data, index = "id") # time index is created automatically
fe.fit2 <- plm(formula = Y ~ T, data = pData, model = "within", effect = "individual")
vcovCR(fe.fit2, cluster=Data$id,type = "CR2")
vcovCR(fe.fit2, type = "CR2")
vcovCR(fe.fit2,cluster=Data$gid,type = "CR2")
Your example does not work due to a bug in clubSandwich's data extraction function get_index_order (from version 0.3.3) for plm objects. It assumes both index variables are in the original data but this is not the case in your example where the time index is created automatically by only specifying the individual dimension by the index argument.
I have a bnlearn model in R that is learned using the gs function with 4 categorical variables and 8 numerical variables.
when I try to validate my model with a test set, I get this error when trying to predict some of the nodes:
Error in check.fit.vs.data(fitted = object, data = data, subset = object[[node]]$parents) :
'Keyword' has different number of levels in the node and in the data.
Is it not possible to use both numerical and categorical variables with bnlearn? and if it is possible, what am I doing wrong?
mydata$A <- as.factor(mydata$A)
mydata$B <- as.numeric(mydata$B)
mydata$C <- as.numeric(mydata$C)
mydata$D <- as.numeric(mydata$D)
mydata$E <- as.factor(mydata$E)
mydata$F <- as.numeric(mydata$F)
mydata$G <- as.numeric(mydata$G)
mydata$H <- as.numeric(mydata$H)
mydata$I <- as.numeric(mydata$I)
mydata$J <- as.numeric(mydata$J)
mydata$K <- as.numeric(mydata$K)
mydata$L <- as.numeric(mydata$L)
mydata$M <- as.numeric(mydata$M)
mydata$N <- as.numeric(mydata$N)
mydata$O <- as.numeric(mydata$O)
mydata$P <- as.numeric(mydata$P)
mydata$Q <- as.numeric(mydata$Q)
#create vector of black arcs
temp1=vector(mode = "character", length = 0)
for (i in 1:length(varnames)){
for (j in 1:length(varnames)){
temp1 <- c(temp1,varnames[i])
}
}
temp2=vector(mode = "character", length = 0)
for (i in 1:length(varnames)){
temp2 <- c(temp2,varnames)
}
#creat to arcs of the model
arcdata = read.csv("C:/users/asaf/desktop/in progress/whitearcs.csv", header = T)
wfrom=arcdata[,1]
wto=arcdata[,2]
whitelist = data.frame(from = wfrom,to =wto)
#block unwanted arcs
blacklist = data.frame(from = temp1, to = temp2)
#fit and plot the model
#gaussian method
model = gs(mydata, whitelist = whitelist, blacklist = blacklist)
#inference procedure
learntmodel = bn.fit(model,mydata,method = "mle",debug = F)
graphviz.plot(learntmodel)
myvalidation=read.csv("C:/users/asaf/desktop/in progress/val.csv", header = T)
#predicate A
pred = predict(learntmodel, node="A", myvalidation)
myvalidation$A <- pred
#predicate B
pred = predict(learntmodel, node="B", myvalidation)
myvalidation$B <- pred
at this point it throws the following error :
Error in check.fit.vs.data(fitted = object, data = data, subset = object[[node]]$parents) :
'A' has different number of levels in the node and in the data.
bnlearn can't work with mixed variables (qualitative and quantitative) at same time, I read it is possible in deal package.
Another possibility is to use discretize to transform your continous variables into discrete variables:
dmydata <- discretize(mydata, breaks = 2, method = "interval")
model <- gs(dmydata, whitelist = whitelist, blacklist = blacklist)
... and continue your code.
Actually I had the same problem today, I resolved it by ensuring that the other nodes that are connected to the one in question... i.e. $A, had also the same number of levels.
I appreciate any help to make segmented.lm (or any other function) find the obvious breakpoints in this example:
data = list(x=c(50,60,70,80,90) , y= c(703.786,705.857,708.153,711.056,709.257))
plot(data, type='b')
require(segmented)
model.lm = segmented(lm(y~x,data = data),seg.Z = ~x, psi = NA)
It returns with the following error:
Error in solve.default(crossprod(x1), crossprod(x1, y1)) :
system is computationally singular: reciprocal condition number = 1.51417e-20
If I change K:
model.lm = segmented(lm(y~x,data = data),seg.Z = ~x, psi = NA, control = seg.control(K=1))
I get another error:
Error in segmented.lm(lm(y ~ x, data = data), seg.Z = ~x, psi = NA, control = seg.control(K = 1)) :
only 1 datum in an interval: breakpoint(s) at the boundary or too close each other
A nice objective method to determine the break point is described in Crawley (2007: 427).
First, define a vector breaks for a range of potential break points:
breaks <- data$x[data$x >= 70 & data$x <= 90]
Then run a for loop for piecewise regressions for all potential break points and yank out the minimal residual standard error (mse) for each model from the summary output:
mse <- numeric(length(breaks))
for(i in 1:length(breaks)){
piecewise <- lm(data$y ~ data$y*(data$x < breaks[i]) + data$y*(data$x >= breaks[i]))
mse[i] <- summary(piecewise)[6]
}
mse <- as.numeric(mse)
Finally, identify the break point with the least mse:
breaks[which(mse==min(mse))]
Hope this helps.