Plotting after Doing Simulation of Linear Regression with R - r

I am doing the simulation of linear regression with R.
A regression model I consider is y_i = a + b_1 * x_1i + b_2 * x_2i + e_i.
The parameter design as follows:
x_1i ~ N(2,1), x_2i ~ Poisson(4), e_i ~ N(0, 1), theta = (a, b_1, b_2)
The following code I am doing is that I would like to generate 100 independent random samples of (y, x_1, x_2) 1000 times using the distribution which I have mentioned above, and I also want to estimate theta_hat (the estimator of theta). After getting the theta_hat, I would like to plot the distribution of estimator of a (a_hat), b_1 (b_1_hat), b_2 (b_2_hat), respectively.
## Construct 1000 x_1
x_1_1000 <- as.data.frame(replicate(n = 1000,expr = rnorm(n = 100,
mean = 2, sd = 1)))
colnames(x_1_1000) <- paste("x_1", 1:1000, sep = "_")
x_2_1000 <- as.data.frame(replicate(n = 1000,expr = rpois(n = 100,
lambda = 4)))
colnames(x_2_1000) <- paste("x_2", 1:1000, sep = "_")
error_1000 <- as.data.frame(replicate(n = 1000, expr = rnorm(n = 100,
mean = 0, sd = 1)))
colnames(error_1000) <- paste("e", 1:1000, sep = "_")
y_1000 <- as.data.frame(matrix(data = 0, nrow = 100, ncol = 1000))
y_1000 = 1 + x_1_1000 * 1 + x_2_1000*(-2) + error_1000
colnames(y_1000) <- paste("y", 1:1000, sep = "_")
######################################################################
lms <- lapply(1:1000, function(x) lm(y_1000[,x] ~ x_1_1000[,x] + x_2_1000[,x]))
theta_hat_1000 <- as.data.frame(sapply(lms, coef))
After doing the linear regression, I just store the result into lms which is a list. Because I just want the data of coefficient, I store those simulation coefficients into "theta_hat_1000" However, when I wanna plot the distribution graph, I cannot get what I want in the final. I have tried two ways to solve the problem but still being confused.
The first way I tried is that I just rename the data frame "theta_hat_1000". I have successfully renamed the column_i, where i from 1 to 1000. However, I just cannot successfully rename the rows.
rownames(theta_hat_1000[1,]) <- "ahat"
rownames(theta_hat_1000[2,]) <- "x1hat"
rownames(theta_hat_1000[3,]) <- "x2hat"
The code listed above did not show any error message but finally failed to change the row names. Thus, I have tried the following code
rownames(theta_hat_1000) <- c("ahat", "x1hat", "x2hat")
This has successfully renamed. However, when I want to check there is anything store in the data frame, it reports "NULL"
theta_hat_1000$ahat
NULL
Therefore, I notice that there is something weird. Thus I have tried the second way like the following.
I have tried to unlist "theta_hat_1000" which is a list stored in my global environment. However, after doing such things, I am not getting what I want. The expected result is just getting three rows and each row with 1000 values, but the actual is that I got 3000 obs with 1 column.
The ideal result is getting three columns and each column with 1000 values and putting them into a data frame to do a further process like using ggplot to demonstrate the distribution of estimated coefficients.
I have stuck on this for several hours. It would be appreciated if anyone can help me and give me some suggestions.

This line theta_hat_1000$ahat in your code does not work, because "ahat" is a rowname not a column name in the data frame. You would get the result by calling theta_hat_1000["ahat",].
However, I understand that your desired result is actually a dataframe with 3 columns (and 1000 rows) representing the 3 parameters of your regression model (intercept, x1, x2). This line in your code as.data.frame(sapply(lms, coef)) produces a dataframe with 3 rows and 1000 columns. You can, for instance, transpose the matrix before changing it into a data frame to get 1000 rows and 3 columns.
theta_hat_1000 <- sapply(lms, coef)
theta_hat_1000 <- as.data.frame(t(theta_hat_1000))
colnames(theta_hat_1000) <- c("ahat", "x1hat", "x2hat")
head(theta_hat_1000)
ahat x1hat x2hat
1 2.0259326 0.7417404 -2.111874
2 0.7827929 0.9437324 -1.944320
3 1.1034906 1.0091594 -2.035405
4 0.9677150 0.8168757 -1.905367
5 1.0518646 0.9616123 -1.985357
6 0.8600449 1.0781489 -2.017061
Now you could also call the variables with theta_hat_1000$ahat.

Related

pred.zoib error, non-conformable arguments

I'm trying to fit a model with just one inflation and whereas the pred.zoib function worked when I included both one and zero inflation, it isn't running now that I've excluded the zero inflation - I get this error:
Error in x1[i, ] %*% b1 : non-conformable arguments
The chaffinchdata is just a data frame with various bits of information, but importantly a detection probability column that has been adjusted to remove the zeros and replace them with 0.00001s, and a distance column that holds numeric values between 0 and 131.
When using pred.zoib before with both zero and one inflation, it worked fine:
fit.zoib.chaffinch4 <- zoib(Detection_probability ~ Distance | Distance | Distance | Distance , data = chaffinchdata)
chaffinchxnew <- data.frame(Distance = seq(0,150,0.1))
pred.chaff4 <- pred.zoib(fit.zoib.chaffinch4, chaffinchxnew)
dfchaff4 <- data.frame(Distance = seq(0,150,0.1),pred.chaff4$summary)
So this all worked perfectly up until now. Then the below runs up until the pred.zoib stage.
# now set the zero inflation to FALSE, to investigate just one inflation
nozeroinf <- birddata$Detection_probability
nozeroinf <- ifelse(nozeroinf == 0, 0.00001, nozeroinf)
birddata$nozeroinfs <- nozeroinf
chaffinchdata <- filter(birddata, Species == 'Chaffinch')
fit.zoib.chaff.oneinf <- zoib(nozeroinfs ~ Distance | Distance | Distance,
data = chaffinchdata, zero.inflation = F,
one.inflation = T)
chaffxnew.oneinf <- data.frame(Distance = seq(0, 100, 0.1))
pred.chaff.oneinf <- pred.zoib(fit.zoib.chaff.oneinf, chaffxnew.oneinf)
I've tried using the distance data straight from the chaffinchdata dataset instead of creating a sequence of my own, ie.
pred.zoib(fit.zoib.chaff.oneinf, data.frame(chaffinchdata$Distance))
but that didn't work, nor did
pred.zoib(fit.zoib.chaff.oneinf, chaffinchdata)
Any help on this would be greatly appreciated!

Does caret::train() in r have a standardized output across different fit methods/models?

I'm working with the train() function from the caret package to fit multiple regression and ML models to test their fit. I'd like to write a function that iterates through all model types and enters the best fit into a dataframe. Biggest issue is that caret doesn't provide all the model fit statistics that I'd like so they need to be derived from the raw output. Based on my exploration there doesn't seem to be a standardized way caret outputs each models fit.
Another post (sorry don't have a link) created this function which pulls from fit$results and fit$bestTune to get pre calculated RMSE, R^2, etc.
get_best_result <- function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
One example of another fit statistic I need to calculate using raw output is BIC. The two functions below do that. The residuals (y_actual - y_predicted) are needed along with the number of x variables (k) and the number of rows used in the prediction (n). k and n must be derived from the output not the original dataset due to the models dropping x variables (feature selection) or rows (omitting NAs) based on its algorithm.
calculate_MSE <- function(residuals){
# residuals can be replaced with y_actual-y_predicted
mse <- mean(residuals^2)
return(mse)
}
calculate_BIC <- function(n, mse, k){
BIC <- n*log(mse)+k*log(n)
return(BIC)
}
The real question is is there a standardized output of caret::train() for x variables or either y_actual, y_predicted, or residuals?
I tried fit$finalModel$model and other methods but to no avail.
Here is a reproducible example along with the function I'm using. Please consider the functions above a part of this reproducible example.
library(rlist)
library(data.table)
# data
df <- data.frame(y1 = rnorm(50, 0, 1),
y2 = rnorm(50, .25, 1.5),
x1 = rnorm(50, .4, .9),
x2 = rnorm(50, 0, 1.1),
x3 = rnorm(50, 1, .75))
missing_index <- sample(1:50, 7, replace = F)
df[missing_index,] <- NA
# function to fit models and pull results
fitModels <- function(df, Ys, Xs, models){
# empty list
results <- list()
# number of for loops
loops_counter <- 0
# for every y
for(y in 1:length(Ys)){
# for every model
for(m in 1:length(models)){
# track loops
loops_counter <- loops_counter + 1
# fit the model
set.seed(1) # seed for reproducability
fit <- tryCatch(train(as.formula(paste(Ys[y], paste(Xs, collapse = ' + '),
sep = ' ~ ')),
data = df,
method = models[m],
na.action = na.omit,
tuneLength = 10),
error = function(e) {return(NA)})
# pull results
results[[loops_counter]] <- c(Y = Ys[y],
model = models[m],
sample_size = nrow(fit$finalModel$model),
RMSE = get_best_result(fit)[[2]],
R2 = get_best_result(fit)[[3]],
MAE = get_best_result(fit)[[4]],
BIC = calculate_BIC(n = length(fit$finalModel),
mse = calculate_MSE(fit$finalModel$residuals),
k = length(fit$finalModel$xNames)))
}
}
# list bind
results_df <- list.rbind(results)
return(results_df)
}
linear_models <- c('lm', 'glmnet', 'ridge', 'lars', 'enet')
fits <- fitModels(df, c(y1, y2), c(x1,x2,x3), linear_models)

Nested loop doesn't return expected values: Return model results from multiple recalculated independent variables

I would like some help with my nested loop which is not returning the values I expect. I am new to nested loops so please bear with me. I want to calculate a new independent variable for a logistic regression model which is based upon different calculations of the original variables. Specifically, I have six variables "x1...x6", and I then create three new variables (newvar1, newvar2, newvar3) by extracting a percentile from pairs of the original variables. From these three new variables I then combine them via subtraction to form a final new variable which forms the independent variable for a logistic regression model. The value of that final variable is then evaluated by the AIC of the logistic regression model.
I need to determine the optimal combination of percentile values which form newvar2, newvar2, and newvar3 which gives me the best logistic regression model. To do this I have attempted to create a three level nested like this:
df <- data.frame(x1 <- rnorm(100),
x2 <- rnorm(100),
x3 <- rnorm(100),
x4 <- rnorm(100),
x5 <- rnorm(100),
x6 <- rnorm(100),
y <- as.factor(runif(100)<=.70))
n = 1
AIC = NULL
for (i in 0.1:n){
for (j in 0.1:n){
for (k in 0.1:n){
df$newvar1 <-apply(df[,1:2], 1, quantile, probs = i, na.rm = T)
df$newvar2 <-apply(df[,3:4], 1, quantile, probs = j, na.rm = T)
df$newvar3 <-apply(df[,5:6], 1, quantile, probs = k, na.rm = T)
df$finalvar <- df$newvar1 - df$newvar2 - df$newvar3
model <- glm(y ~ finalvar, data = df, family = "binomial")
AIC[i] <- as.numeric(model$aic)
}
}
}
I would like to provide a sequence of 11 values (0, 0.1, 0.2....0.9,1) to the "probs" argument of the quantile function, and I would like to get the AIC for each of the possible quantile parameter estimations (11*11*11). Thus the AIC variable in the end should be a numeric vector of 121 values. However, when I run the above code I get an empty numeric value for AIC. How can I get this code the run properly and supply me the values for all possible 121 models?
Thanks!
EDIT: this isn't the solution but provides part of the answer I think. in my previous code "n" was less than one so it was only performing a single iteration, (obviously) "n" needs to greater than one. The reason it was less than 1 before is that the "probs" argument to quantile requires a value betwee 0 and 1. The over come this, the parameter passed to the argument probs is now divided by 10. Now with AIC[1] i can get a vector of 10, but I still don"t understand how to get the full 10*10*10 (or 11*11*11) representing all combinations.
New code:
n = 10
AIC = NULL
for (i in 1:n){
for (j in 1:n){
for (k in 1:n){
df$newvar1 <-apply(df[,1:2], 1, quantile, probs = i/10, na.rm = T)
df$newvar2 <-apply(df[,3:4], 1, quantile, probs = j/10, na.rm = T)
df$newvar3 <-apply(df[,5:6], 1, quantile, probs = k/10, na.rm = T)
df$finalvar <- df$newvar1 - df$newvar2 - df$newvar3
model <- glm(y ~ finalvar, data = df, family = "binomial")
AIC[i] <- as.numeric(model$aic)
}
}
}
First of all, AICis an R function so I've changed the name to aic.
Second, in your code's innermost loop you index by i only, when you have 3 indices. So maybe this is what you really need.
n = 10
aic = array(0, dim = c(n, n, n)) # changed
for(...)
for(...)
for(...){
[...]
aic[i, j, k] <- as.numeric(model$aic) # changed
}

Rolling regression and prediction with lm() and predict()

I need to apply lm() to an enlarging subset of my dataframe dat, while making prediction for the next observation. For example, I am doing:
fit model predict
---------- -------
dat[1:3, ] dat[4, ]
dat[1:4, ] dat[5, ]
. .
. .
dat[-1, ] dat[nrow(dat), ]
I know what I should do for a particular subset (related to this question: predict() and newdata - How does this work?). For example to predict the last row, I do
dat1 = dat[1:(nrow(dat)-1), ]
dat2 = dat[nrow(dat), ]
fit = lm(log(clicks) ~ log(v1) + log(v12), data=dat1)
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)
How can I do this automatically for all subsets, and potentially extract what I want into a table?
From fit, I'd need the summary(fit)$adj.r.squared;
From predict.fit I'd need predict.fit$fit value.
Thanks.
(Efficient) solution
This is what you can do:
p <- 3 ## number of parameters in lm()
n <- nrow(dat) - 1
## a function to return what you desire for subset dat[1:x, ]
bundle <- function(x) {
fit <- lm(log(clicks) ~ log(v1) + log(v12), data = dat, subset = 1:x, model = FALSE)
pred <- predict(fit, newdata = dat[x+1, ], se.fit = TRUE)
c(summary(fit)$adj.r.squared, pred$fit, pred$se.fit)
}
## rolling regression / prediction
result <- t(sapply(p:n, bundle))
colnames(result) <- c("adj.r2", "prediction", "se")
Note I have done several things inside the bundle function:
I have used subset argument for selecting a subset to fit
I have used model = FALSE to not save model frame hence we save workspace
Overall, there is no obvious loop, but sapply is used.
Fitting starts from p, the minimum number of data required to fit a model with p coefficients;
Fitting terminates at nrow(dat) - 1, as we at least need the final column for prediction.
Test
Example data (with 30 "observations")
dat <- data.frame(clicks = runif(30, 1, 100), v1 = runif(30, 1, 100),
v12 = runif(30, 1, 100))
Applying code above gives results (27 rows in total, truncated output for 5 rows)
adj.r2 prediction se
[1,] NaN 3.881068 NaN
[2,] 0.106592619 3.676821 0.7517040
[3,] 0.545993989 3.892931 0.2758347
[4,] 0.622612495 3.766101 0.1508270
[5,] 0.180462206 3.996344 0.2059014
The first column is the adjusted-R.squared value for fitted model, while the second column is the prediction. The first value for adj.r2 is NaN, because the first model we fit has 3 coefficients for 3 data points, hence no sensible statistics is available. The same happens to se as well, as the fitted line has no 0 residuals, so prediction is done without uncertainty.
I just made up some random data to use for this example. I'm calling the object data because that was what it was called in the question at the time that I wrote this solution (call it anything you like).
(Efficient) Solution
data <- data.frame(v1=rnorm(100),v2=rnorm(100),clicks=rnorm(100))
data1 = data[1:(nrow(data)-1), ]
data2 = data[nrow(data), ]
for(i in 3:nrow(data)){
nam <- paste("predict", i, sep = "")
nam1 <- paste("fit", i, sep = "")
nam2 <- paste("summary_fit", i, sep = "")
fit = lm(clicks ~ v1 + v2, data=data[1:i,])
tmp <- predict(fit, newdata=data2, se.fit=TRUE)
tmp1 <- fit
tmp2 <- summary(fit)
assign(nam, tmp)
assign(nam1, tmp1)
assign(nam2, tmp2)
}
All of the results you want will be stored in the data objects this creates.
For example:
> summary_fit10$r.squared
[1] 0.3087432
You mentioned in the comments that you'd like a table of results. You can programmatically create tables of results from the 3 types of output files like this:
rm(data,data1,data2,i,nam,nam1,nam2,fit,tmp,tmp1,tmp2)
frames <- ls()
frames.fit <- frames[1:98] #change index or use pattern matching as needed
frames.predict <- frames[99:196]
frames.sum <- frames[197:294]
fit.table <- data.frame(intercept=NA,v1=NA,v2=NA,sourcedf=NA)
for(i in 1:length(frames.fit)){
tmp <- get(frames.fit[i])
fit.table <- rbind(fit.table,c(tmp$coefficients[[1]],tmp$coefficients[[2]],tmp$coefficients[[3]],frames.fit[i]))
}
fit.table
> fit.table
intercept v1 v2 sourcedf
2 -0.0647017971121678 1.34929652763687 -0.300502017324518 fit10
3 -0.0401617893034109 -0.034750571912636 -0.0843076273486442 fit100
4 0.0132968863522573 1.31283604433593 -0.388846211083564 fit11
5 0.0315113918953643 1.31099122173898 -0.371130010135382 fit12
6 0.149582794027583 0.958692838785998 -0.299479715938493 fit13
7 0.00759688947362175 0.703525856001948 -0.297223988673322 fit14
8 0.219756240025917 0.631961979610744 -0.347851129205841 fit15
9 0.13389223748979 0.560583832333355 -0.276076134872669 fit16
10 0.147258022154645 0.581865844000838 -0.278212722024832 fit17
11 0.0592160359650468 0.469842498721747 -0.163187274356457 fit18
12 0.120640756525163 0.430051839741539 -0.201725012088506 fit19
13 0.101443924785995 0.34966728554219 -0.231560038360121 fit20
14 0.0416637001406594 0.472156988919337 -0.247684504074867 fit21
15 -0.0158319749710781 0.451944113682333 -0.171367482879835 fit22
16 -0.0337969739950376 0.423851304105399 -0.157905431162024 fit23
17 -0.109460218252207 0.32206642419212 -0.055331391802687 fit24
18 -0.100560410735971 0.335862465403716 -0.0609509815266072 fit25
19 -0.138175283219818 0.390418411384468 -0.0873106257144312 fit26
20 -0.106984355317733 0.391270279253722 -0.0560299858019556 fit27
21 -0.0740684978271464 0.385267011513678 -0.0548056844433894 fit28

Why does Random Forest regression in R not like my input data?

I'm trying a very simple random forest, as shown below: The code is entirely self-contained and runnable.
library(randomForest)
n = 1000
factor=10
x1 = seq(n) + rnorm(n, 0, 150)
y = x1*factor + rnorm(n, 0, 550)
x_data = data.frame(x1)
y_data = data.frame(y)
k=2
for (nfold in seq(k)){
fold_ids <- cut(seq(1, nrow(x_data)), breaks=k, labels=FALSE)
id_indices <- which(fold_ids==nfold)
fold_x <- x_data[id_indices,]
fold_y <- y_data[id_indices,]
fold_x_df = data.frame(x=fold_x)
fold_y_df = data.frame(y=fold_y)
print(paste("number of rows in fold_x_df is ", nrow(fold_x_df), sep=" "))
print(paste("number of rows in fold_y_df is ", nrow(fold_y_df), sep=" "))
rf = randomForest(fold_x_df, fold_y_df, ntree=1000)
print(paste("mse for fold number ", " is ", sum(rf$mse)))
}
rf = randomForest(x_data, y_data, ntree=1000)
It gives me an error:
...The response has five or fewer unique values. Are you sure you want to do regression?
I don't understand why it gives me that error.
I've checked these sources:
Use of randomforest() for classification in R?
RandomForest error code
https://www.kaggle.com/c/15-071x-the-analytics-edge-competition-spring-2015/forums/t/13383/warning-message-in-random-forest
None of those solved my problem. you can look at the print statements, there are clearly more than 5 unique labels. Not to mention, I'm doing regression here, not classification, so I'm not sure why the word "label" is used in the error.
The problem is giving the response as a data frame. Since the response must be one-dimensional, it makes sense that it should be a vector. Here's how I would simplify your code to use the data argument of randomForest with the formula method to avoid the issue entirely:
## simulation: unchanged (but seed set for reproducibility)
library(randomForest)
n = 1000
factor=10
set.seed(47)
x1 = seq(n) + rnorm(n, 0, 150)
y = x1*factor + rnorm(n, 0, 550)
## use a single data frame
all_data = data.frame(y, x1)
## define the folds outside the loop
fold_ids <- cut(seq(1, nrow(x_data)), breaks = k, labels = FALSE)
for (nfold in seq(k)) {
id_indices <- which(fold_ids == nfold)
## sprintf can be nicer than paste for "filling in blanks"
print(sprintf("number of rows in fold %s is %s", nfold, length(id_indices)))
## just pass the subset of the data directly to randomForest
## no need for extracting, subsetting, putting back in data frames...
rf <- randomForest(y ~ ., data = all_data[id_indices, ], ntree = 1000)
## sprintf also allows for formatting
## the %g will use scientific notation if the exponent would be >= 3
print(sprintf("mse for fold %s is %g", nfold, sum(rf$mse)))
}

Resources