Related
I am trying to fit a piecewise regression for this dataset. I know we do not have a linear relation between the dependent and independent variable but my real world application requires me to model the data as a lm segmented regression.
Here is my code with description of the steps
bond_data <- data.frame(
yield_change = c(-1.2,-0.9,-1.8,-1.4,-1.8,-2.1,-2.3,-2.1,-2.5,-2.2,-2.4,-2.5,-2.4,-2.4,
-3.0,-2.6,-5.1,-4.8,-4.9,-5.0,-5.0,-6.2,-6.1,-6.3,-5.0,-5.0),
maturity =c(10.2795,10.8603,11.7753,12.3562,12.5205,13.3589,13.8630,14.2822,14.3589,15.3589,
15.8630,16.778,17.3616,17.8658,18.3616,21.8685,22.5288,23.8685,24.3644,25.3671,
26.8712,27.8712,28.8712,29.8740,44.3781,49.3836))
The bond_data Dataframe contains these two vectors stated above.
#Defining lm model & segmented modelmodel <- lm(yield_change~maturity, data = bond_data)
segmented.model <- segmented(model,seg.Z=~maturity,psi = list(maturity = c(15,20,30)),fixed.psi = c(15,20,30),control = seg.control(it.max = 0, n.boot = 50))
xp <- c(min(bond_data$maturity), segmented.model$psi[,"Est."], max(bond_data$maturity))
new_data <- data.frame(xp)
colnames(new_data) <- "maturity"
o <- segmented.model
new_data$dummy1 <- pmax(new_data$maturity - o$psi[1,2], 0)
new_data$dummy2 <- pmax(new_data$maturity - o$psi[2,2], 0)
new_data$dummy3 <- pmax(new_data$maturity - o$psi[3,2], 0)
new_data$dummy4 <-I(new_data$maturity > o$psi[1,2]) * coef(o)[3]
new_data$dummy5 <-I(new_data$maturity > o$psi[2,2]) * coef(o)[4]
new_data$dummy6 <-I(new_data$maturity > o$psi[3,2]) * coef(o)[5]
names(new_data)[-1] <- names(model.frame(o))[-c(1,2)]
yp <- predict(segmented.model,new_data)
plot(bond_data$maturity,bond_data$yield_change, pch=16, col="blue",ylim = c(-8,0))
lines(xp,yp)
I get the following image
Plot of actual values in blue points and pred line
I am trying to get the first segment start at the point(maturity = 10, yield_change = 0)
One thing to note is that all my breakpoints have fixed x positions and no estimates are made so when I run segmented.model$psi my initial values are the same as my estimates (15,20 and 30) and all my st.err are zero.
How would I go about making my prediction line start at the point(maturity = 10, yield_change = 0)? I appreciate any help!
I have tried doing the following:
model <- lm(I(yield_change-0)~I(maturity-10), data = bond_data)
segmented.model <- segmented(model,seg.Z=~maturity,psi = list(maturity = c(15,20,30)),fixed.psi = c(15,20,30), control = seg.control(it.max = 0, n.boot = 50))
#But by running the previous line I get the error (object maturity not recognised).
#By running:
segmented.model <- segmented(model,seg.Z=~I(maturity-10),psi = list(I(maturity-10) = c(15,20,30)),fixed.psi = c(15,20,30), control = seg.control(it.max = 0, n.boot = 50))
I get this error:
Error: unexpected '=' in "segmented.model <- segmented(model,seg.Z=~I(maturity-10),psi = list(I(maturity-10) ="
I do not think I am using the correct method to solve my problem...
I'm working with the train() function from the caret package to fit multiple regression and ML models to test their fit. I'd like to write a function that iterates through all model types and enters the best fit into a dataframe. Biggest issue is that caret doesn't provide all the model fit statistics that I'd like so they need to be derived from the raw output. Based on my exploration there doesn't seem to be a standardized way caret outputs each models fit.
Another post (sorry don't have a link) created this function which pulls from fit$results and fit$bestTune to get pre calculated RMSE, R^2, etc.
get_best_result <- function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
One example of another fit statistic I need to calculate using raw output is BIC. The two functions below do that. The residuals (y_actual - y_predicted) are needed along with the number of x variables (k) and the number of rows used in the prediction (n). k and n must be derived from the output not the original dataset due to the models dropping x variables (feature selection) or rows (omitting NAs) based on its algorithm.
calculate_MSE <- function(residuals){
# residuals can be replaced with y_actual-y_predicted
mse <- mean(residuals^2)
return(mse)
}
calculate_BIC <- function(n, mse, k){
BIC <- n*log(mse)+k*log(n)
return(BIC)
}
The real question is is there a standardized output of caret::train() for x variables or either y_actual, y_predicted, or residuals?
I tried fit$finalModel$model and other methods but to no avail.
Here is a reproducible example along with the function I'm using. Please consider the functions above a part of this reproducible example.
library(rlist)
library(data.table)
# data
df <- data.frame(y1 = rnorm(50, 0, 1),
y2 = rnorm(50, .25, 1.5),
x1 = rnorm(50, .4, .9),
x2 = rnorm(50, 0, 1.1),
x3 = rnorm(50, 1, .75))
missing_index <- sample(1:50, 7, replace = F)
df[missing_index,] <- NA
# function to fit models and pull results
fitModels <- function(df, Ys, Xs, models){
# empty list
results <- list()
# number of for loops
loops_counter <- 0
# for every y
for(y in 1:length(Ys)){
# for every model
for(m in 1:length(models)){
# track loops
loops_counter <- loops_counter + 1
# fit the model
set.seed(1) # seed for reproducability
fit <- tryCatch(train(as.formula(paste(Ys[y], paste(Xs, collapse = ' + '),
sep = ' ~ ')),
data = df,
method = models[m],
na.action = na.omit,
tuneLength = 10),
error = function(e) {return(NA)})
# pull results
results[[loops_counter]] <- c(Y = Ys[y],
model = models[m],
sample_size = nrow(fit$finalModel$model),
RMSE = get_best_result(fit)[[2]],
R2 = get_best_result(fit)[[3]],
MAE = get_best_result(fit)[[4]],
BIC = calculate_BIC(n = length(fit$finalModel),
mse = calculate_MSE(fit$finalModel$residuals),
k = length(fit$finalModel$xNames)))
}
}
# list bind
results_df <- list.rbind(results)
return(results_df)
}
linear_models <- c('lm', 'glmnet', 'ridge', 'lars', 'enet')
fits <- fitModels(df, c(y1, y2), c(x1,x2,x3), linear_models)
I'm using a R package called logistf to make a Logistc Regression and I saw that there's no predict function for new data in this package and predict package does not work with this, so I found a code that show how making this with new data:
fit<-logistf(Tax ~ L20+L24+L28+L29+L31+L32+L33+L36+S10+S15+S16+S17+S20, data=trainData)
betas <- coef(fit)
X <- model.matrix(fit, data=testData)
probs <- 1 / (1 + exp(-X %*% betas))
I want to make a cross validation version with this using fit$predict and the probabilities that probs generate for me. Has anyone ever done something like this before?
Other thing that I want to know is about fit$predict I'm making a binary logistic regression, and this function returns many values, are these values from class 0 or 1, how can I know this? Thanks
While the code that you wrote works perfectly, there is a concise way of getting the same results seemingly:
brglm_model <- brglm(formula = response ~ predictor , family = "binomial", data = train )
brglm_pred <- predict(object = brglm_model, newdata = test , type = "response")
About the CV, you have to write a few lines of code I guess:
#Setting the number of folds, and number of instances in each fold
n_folds <- 5
fold_size <- nrow(dataset) %/% 5
residual <- nrow(dataset) %% 5
#label the instances based on the number of folds
cv_labels <- c(rep(1,fold_size),rep(2,fold_size), rep(3,fold_size), rep(4,fold_size), rep(5,fold_size), rep(5,residual))
# the error term would differ based on each threshold value
t_seq <- seq(0.1,0.9,by = 0.1)
index_mat <- matrix(ncol = (n_folds+1) , nrow = length(t_seq))
index_mat[,1] <- t_seq
# the main loop for calculation of the CV error on each fold
for (i in 1:5){
train <- dataset %>% filter(cv_labels != i)
test <- dataset %>% filter(cv_labels == i )
brglm_cv_model <- brglm(formula = response_var ~ . , family = "binomial", data = train )
brglm_cv_pred <- predict(object = brglm_model, newdata = test , type = "response")
# error formula that you want, e.g. misclassification
counter <- 0
for (treshold in t_seq ) {
counter <- counter + 1
conf_mat <- table( factor(test$response_var) , factor(brglm_cv_pred>treshold, levels = c("FALSE","TRUE") ))
sen <- conf_mat[2,2]/sum(conf_mat[2,])
# other indices can be computed as follows
#spec <- conf_mat[1,1]/sum(conf_mat[1,])
#prec <- conf_mat[2,2]/sum(conf_mat[,2])
#F1 <- (2*prec * sen)/(prec+sen)
#accuracy <- (conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
#here I am only interested in sensitivity
index_mat[counter,(i+1)] <- sen
}
}
# final data.frame would be the mean of sensitivity over each threshold value
final_mat <- matrix(nrow = length(t_seq), ncol = 2 )
final_mat[,1] <- t_seq
final_mat[,2] <- apply(X = index_mat[,-1] , MARGIN = 1 , FUN = mean)
final_mat <- data.frame(final_mat)
colnames(final_mat) <- c("treshold","sensitivity")
#why not having a look at the CV-sensitivity of the model over threshold values?
ggplot(data = final_mat) +
geom_line(aes(x = treshold, y = sensitivity ), color = "blue")
I am using car evaluation dataset from UCI. I am trying to use SVM classification for it. After Model creation, when I calculate accuracy using confusion matrix, even if i change the parameters of SVM, getting same accuracy every time. Posting my code below.
require("e1071");
#Code to read data from csv and convert to numeric
car_data <- read.csv("car.data.csv",header = TRUE,sep = ",",quote = "\"");
#backup original data to other data frame
car_data_bkp <- car_data;
car_data$buying<-as.numeric(car_data$buying);
car_data$maint<-as.numeric(car_data$maint);
car_data$doors<-as.numeric(car_data$doors);
car_data$persons<-as.numeric(car_data$persons);
car_data$lug_boot<-as.numeric(car_data$lug_boot);
car_data$safety<-as.numeric(car_data$safety);
car_data$class<-as.numeric(car_data$class);
#scaling of data
maxs = apply(car_data, MARGIN = 2, max);
mins = apply(car_data, MARGIN = 2, min);
scaled = as.data.frame(scale(car_data, center = mins, scale = maxs - mins));
#sampling of data for train and testing
trainIndex <- sample(1:nrow(scaled), 0.8 * nrow(scaled));
train <- scaled[trainIndex, ];
test <- scaled[-trainIndex, ];
n <- names(train);
f <- as.formula(paste("class ~", paste(n[!n %in% "class"], collapse = " + ")));
svm_model <- svm(formula=f,train,cross = 2,tolerance= 0.00001, cost = 1000,gamma=1);
summary(svm_model);
svm.pred <- predict(svm_model, test[,-7],type = "class");
table(pred = svm.pred, true = test[,7]);
#calculate accuracy
sum(diag(svm.pred))/sum(svm.pred);
I'm trying a very simple random forest, as shown below: The code is entirely self-contained and runnable.
library(randomForest)
n = 1000
factor=10
x1 = seq(n) + rnorm(n, 0, 150)
y = x1*factor + rnorm(n, 0, 550)
x_data = data.frame(x1)
y_data = data.frame(y)
k=2
for (nfold in seq(k)){
fold_ids <- cut(seq(1, nrow(x_data)), breaks=k, labels=FALSE)
id_indices <- which(fold_ids==nfold)
fold_x <- x_data[id_indices,]
fold_y <- y_data[id_indices,]
fold_x_df = data.frame(x=fold_x)
fold_y_df = data.frame(y=fold_y)
print(paste("number of rows in fold_x_df is ", nrow(fold_x_df), sep=" "))
print(paste("number of rows in fold_y_df is ", nrow(fold_y_df), sep=" "))
rf = randomForest(fold_x_df, fold_y_df, ntree=1000)
print(paste("mse for fold number ", " is ", sum(rf$mse)))
}
rf = randomForest(x_data, y_data, ntree=1000)
It gives me an error:
...The response has five or fewer unique values. Are you sure you want to do regression?
I don't understand why it gives me that error.
I've checked these sources:
Use of randomforest() for classification in R?
RandomForest error code
https://www.kaggle.com/c/15-071x-the-analytics-edge-competition-spring-2015/forums/t/13383/warning-message-in-random-forest
None of those solved my problem. you can look at the print statements, there are clearly more than 5 unique labels. Not to mention, I'm doing regression here, not classification, so I'm not sure why the word "label" is used in the error.
The problem is giving the response as a data frame. Since the response must be one-dimensional, it makes sense that it should be a vector. Here's how I would simplify your code to use the data argument of randomForest with the formula method to avoid the issue entirely:
## simulation: unchanged (but seed set for reproducibility)
library(randomForest)
n = 1000
factor=10
set.seed(47)
x1 = seq(n) + rnorm(n, 0, 150)
y = x1*factor + rnorm(n, 0, 550)
## use a single data frame
all_data = data.frame(y, x1)
## define the folds outside the loop
fold_ids <- cut(seq(1, nrow(x_data)), breaks = k, labels = FALSE)
for (nfold in seq(k)) {
id_indices <- which(fold_ids == nfold)
## sprintf can be nicer than paste for "filling in blanks"
print(sprintf("number of rows in fold %s is %s", nfold, length(id_indices)))
## just pass the subset of the data directly to randomForest
## no need for extracting, subsetting, putting back in data frames...
rf <- randomForest(y ~ ., data = all_data[id_indices, ], ntree = 1000)
## sprintf also allows for formatting
## the %g will use scientific notation if the exponent would be >= 3
print(sprintf("mse for fold %s is %g", nfold, sum(rf$mse)))
}