I have some data for which I'ved used the earth model. I'm interested in the slopes of the different lines but looking at the model summary I don't get my expected values.
library(earth)
library(dplyr)
library(ggplot2)
d = structure(list(x = c(9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30), y = c(0.151534750704409,
0.0348452707597105, -0.0913494247372798, -0.214465577974757,
-0.365251164825619, -0.528214103496014, -0.614970081844732,
-0.922572314358796,
-1.15911158401926, -1.36432638285029, -1.51587576144429, -1.63708705686248,
-1.7530889072188, -1.86142968143915, -1.98159646754281, -2.0994478459505,
-2.23037530743309, -2.3421669680425, -2.40621060828366, -2.55432043723978,
-2.73246980567199, -2.92496136528975)), .Names = c("x", "y"), row.names =
c(NA, -22L), class = c("tbl_df", "tbl", "data.frame"))
mod = earth(y ~ x, data = d)
d$pred = predict(mod, newdata = d)
summary(mod, style = 'pmax')
this gives me this summary:
Call: earth(formula=y~x, data=d)
y =
-1.314958
- 0.06811314 * pmax(0, x - 16)
+ 0.1518165 * pmax(0, 19 - x)
- 0.05124021 * pmax(0, x - 19)
Selected 4 of 4 terms, and 1 of 1 predictors
Termination condition: RSq changed by less than 0.001 at 4 terms
Importance: x
Number of terms at each degree of interaction: 1 3 (additive model)
GCV 0.004496406 RSS 0.04598597 GRSq 0.9953947 RSq 0.9976504
However when I look at my model the three different slopes all look negative:
ggplot(d, aes(x, y)) +
geom_point() +
geom_line(aes(x, pred)) +
theme(aspect.ratio = 1)
How do I get the values for those 3 negative slopes?
mod$coefficients gives the coefficients. If the coefficients are on -x te slopes will be the negative of the coefficients. You can do mod$coefficients %>% {ifelse(grepl('-x', rownames(.)) , -., .)} to get the slopes (or just mentally reverse the signs for the portions with -x).
Related
I'm trying to calculate the area below a certain point, and unsure how to do that. I've seen this question, but it's not exactly answering what I'm looking for.
Here is some example data...
test_df <- structure(list(time = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23), balance = c(27,
-45, -118, -190, -263, -343, -424, -1024, -434, -533, -613, -694,
-775, -355, -436, -516, -597, -77, -158, -239, -319, -400, -472,
-545)), row.names = c(NA, -24L), class = c("tbl_df", "tbl", "data.frame"
)) %>% as_tibble()
ggplot(test_df, aes(time, balance))+
geom_smooth(se = F)+
geom_hline(yintercept = -400)
I'd like to calculate the AUC for the trend line, but only for when it is below a certain threshold (-400, for example).
So I can extract the values for the smoothed line...
test_plot <- ggplot(test_df, aes(time, balance))+
geom_smooth(se = F)+
geom_hline(yintercept = -400)
ggp_data <- ggplot_build(test_plot)$data[[1]]
and use something like this to get an AUC value
MESS::auc(ggp_data$x, ggp_data$y)
My questions are:
How to only calculate below -400?
How to interpret the value?
What units would it be in?
If my x axis is in hours, is there a way to turn the value into an hour value?
Thanks!
To calculate the area only below a certain threshold you can add the threshold to your y-values if your threshold is below 0 and subtract if your threshold is larger than 0. For your case that would be like this:
MESS::auc(ggp_data$x, ggp_data$y+400)
However, this calculates the AUC from 0 to 23 and therefore, also parts that are above -400. To get the AUC for the part that is below your threshold you have to find the x-values of the intersection between your smoothed line and the h-line at -400. Inspecting your values by eye you could find the following approximation of these x-values that fulfill this criteria:
x1 <- 4.45
x2 <- 15.45
x3 <- 21.35
Now we have to calculate the AUC between x1 and x2, and x3 and max(x). Then we have to add these values together:
AUC1 <- MESS::auc(ggp_data$x, ggp_data$y+400, from = x1, to = x2)
AUC2 <- MESS::auc(ggp_data$x, ggp_data$y+400, from = x3, to = max(ggp_data$x))
AUC.total <- AUC1 + AUC2
> AUC.total
[1] -1747.352
Note that the value is negative because it is below 0. There are now "negative areas" therefore, you can take the absolute value AUC.total = 1747.352 to proceede. However, without information on your y-axis one cannot clearly interpret this value.
I'm calculating some stats for 3 groups - all, male, and female. I've stored them in a data frame called stats_df that has each group as a column header with the stats as the row data. I need to do a boxplot (my most current attempt is included in the code) that has all 3 groups represented as boxes, but I can't seem to figure it out and no online tutorials are helping. Data:
Code:
all_stats <- c(all_mean, all_median, all_mode, all_25, all_50, all_75)
female_stats <- c(female_mean, female_median, female_mode, female_25, female_50, female_75)
male_stats <- c(male_mean, male_median, male_mode, male_25, male_50, male_75)
stats_df <- data.frame(all_stats, female_stats, male_stats)
boxplot(all_stats ~ male_stats,
data=stats_df,
main="Stats Boxplot",
xlab="Group",
ylab="Number")
You can use the following code:
stats_df <- data.frame(all_stats = c(35.19, 32, 29, 26, 32, 50),
female_stats = c(36.23, 32, 32, 24, 32, 52),
male_stats = c(33.5, 32, 29, 28.5, 32, 39.5))
library(tidyverse)
stats_df %>%
ggplot() +
geom_boxplot(aes(x = "all_stats", y = all_stats)) +
geom_boxplot(aes(x = "female_stats", y = female_stats)) +
geom_boxplot(aes(x = "male_stats", y = male_stats)) +
xlab("Group") +
ylab("Number") +
ggtitle("Stats Boxplot")
Output:
I am trying to generate a similar plot as below to show the change in R-hat over iterations:
I have tried the following options :
summary(fit1)$summary : gives R-hat all chains are merged
summary(fit1)$c_summary : gives R-hat for each chain individually
Can you please help me to get R-hat for each iteration for a given parameter?
rstan provides the Rhat() function, which takes a matrix of iterations x chains and returns R-hat. We can extract this matrix from the fitted model and apply Rhat() cumulatively over it. The code below uses the 8 schools model as an example (copied from the getting started guide).
library(tidyverse)
library(purrr)
library(rstan)
theme_set(theme_bw())
# Fit the 8 schools model.
schools_dat <- list(J = 8,
y = c(28, 8, -3, 7, -1, 1, 18, 12),
sigma = c(15, 10, 16, 11, 9, 11, 10, 18))
fit <- stan(file = 'schools.stan', data = schools_dat)
# Extract draws for mu as a matrix; columns are chains and rows are iterations.
mu_draws = as.array(fit)[,,"mu"]
# Get the cumulative R-hat as of each iteration.
mu_rhat = map_dfr(
1:nrow(mu_draws),
function(i) {
return(data.frame(iteration = i,
rhat = Rhat(mu_draws[1:i,])))
}
)
# Plot iteration against R-hat.
mu_rhat %>%
ggplot(aes(x = iteration, y = rhat)) +
geom_line() +
labs(x = "Iteration", y = expression(hat(R)))
I'm a beginner with machine learning (and also R). I've figured out how to run some basic linear regression, elastic net, and random forest models in R and have gotten some decent results for a regression project (with a continuous dependent variable) that I'm working on.
I've been trying to learning how to use the gradient boosting algorithm and, in particular, the xgboost() command. My results are way worse here, though, and I'm not sure why.
I was hoping someone could take a look at my code and see if there are any glaring errors.
# Create training data with and without the dependent variable
train <- data[1:split, ]
train.treat <- select(train, -c(y))
# Create test data with and without the dependent variable
test <- data[(split+1):nrow(data), ]
test.treat <- select(test, -c(y))
# Load the package xgboost
library(xgboost)
# Run xgb.cv
cv <- xgb.cv(data = as.matrix(train.treat),
label = train$y,
nrounds = 100,
nfold = 10,
objective = "reg:linear",
eta = 0.1,
max_depth = 6,
early_stopping_rounds = 10,
verbose = 0 # silent
)
# Get the evaluation log
elog <- cv$evaluation_log
# Determine and print how many trees minimize training and test error
elog %>%
summarize(ntrees.train = which.min(train_rmse_mean), # find the index of min(train_rmse_mean)
ntrees.test = which.min(test_rmse_mean)) # find the index of min(test_rmse_mean)
# The number of trees to use, as determined by xgb.cv
ntrees <- 25
# Run xgboost
model_xgb <- xgboost(data = as.matrix(train.treat), # training data as matrix
label = train$y, # column of outcomes
nrounds = ntrees, # number of trees to build
objective = "reg:linear", # objective
eta = 0.001,
depth = 10,
verbose = 0 # silent
)
# Make predictions
test$pred <- predict(model_xgb, as.matrix(test.treat))
# Plot predictions vs actual bike rental count
ggplot(test, aes(x = pred, y = y)) +
geom_point() +
geom_abline()
# Calculate RMSE
test %>%
mutate(residuals = y - pred) %>%
summarize(rmse = sqrt(mean(residuals^2)))
How does this look?
Also, one thing I don't get about xgboost() is why I have to take out the dependent variable from the dataset in the "data" option and then add it back in the "label" option. Why do we do this?
My dataset has 809 observations and 108 independent variables. Here is an arbitrary subset:
structure(list(year = c(2019, 2019, 2019, 2019), ht = c(74, 76,
74, 73), wt = c(223, 234, 215, 215), age = c(36, 29, 32, 24),
gp_l1 = c(16, 16, 11, 14), gp_l2 = c(7, 0, 16, 0), gp_l3 = c(16,
15, 16, 0), gs_l1 = c(16, 16, 11, 13), gs_l2 = c(7, 0, 16,
0), gs_l3 = c(16, 15, 16, 0), cmp_l1 = c(372, 430, 226, 310
), cmp_l2 = c(154, 0, 297, 0), cmp_l3 = c(401, 346, 364,
0), att_l1 = c(597, 639, 365, 486), y = c(8, 71.5, 26, 22
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
My RMSE from this xgboost() model is 31.7. Whereas my random forest and glmnet models give RMSEs around 13. The prediction metric I'm comparing to has RMSE of 15.5. I don't get why my xgboost() model does so much worse than my random forest and glmnet models.
I am confused by this warning message as I try to fit my data with a nonlinear regression model by using the drc package and drm function.
I have
N_obs <- c(1, 80, 80, 80, 81, 82, 83, 84, 84, 95, 102, 102, 102, 103, 104, 105, 105, 109, 111, 117, 120, 123, 123, 124, 126, 127, 128, 128, 129, 130)
times <- c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
The model is
model.drm <- drm(N_obs ~ times, data = data.frame(N_obs = N_obs, times = times), fct = MM.2())
and the warnings come from predictions
preds <- predict(model.drm, times = times, interval = "confidence", level = 0.95)
There were 30 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In (tquan * sqrt(varVal + sumObjRV)) * c(-1, 1) :
Recycling array of length 1 in array-vector arithmetic is deprecated.
Use c() or as.vector() instead.
2: In (tquan * sqrt(varVal + sumObjRV)) * c(-1, 1) :
Recycling array of length 1 in array-vector arithmetic is deprecated.
Use c() or as.vector() instead.
3: In (tquan * sqrt(varVal + sumObjRV)) * c(-1, 1) :
Recycling array of length 1 in array-vector arithmetic is deprecated.
Use c() or as.vector() instead.
I have been trying to change data inputs by using as.vector(times), c(times),etc., but still cannot get rid of the warnings. Could someone help me identify the problem? Thank you!!
I re-ran your analysis with the sample data provided, and I can reproduce your warnings. Here's a summary:
Fit a Michaelis-Menten model of the form f(x; d, e) = d * (1 + e/x)^-1.
# Fit a 2 parameter Michaelis-Menten model
library(drc);
fit <- drm(
formula = N_obs ~ times,
data = data.frame(N_obs = N_obs, times = times),
fct = MM.2())
Based on the model fit, predict the response for the original times. Note you can omit the newdata argument here, because in that case predict will simply use the fitted values (which are based on times).
# Predictions
pred <- as.data.frame(predict(
fit,
newdata = data.frame(N_obs = N_obs, times = times),
interval = "confidence", level = 0.95));
pred$times <- times;
Visualise data and predictions.
library(tidyverse);
data.frame(times = times, N_obs = N_obs) %>%
ggplot(aes(times, N_obs)) +
geom_point() +
geom_line(data = pred, aes(x = times, y = Prediction)) +
geom_ribbon(
data = pred,
aes(x = times, ymin = Lower, ymax = Upper),
alpha = 0.4);
The model fit seems reasonable, and I would say that the warnings can be safely ignored (see details).
Details
I had a look at the drc source-code, and the warning originates from line 201 of predict.drc.R:
retMat[rowIndex, 3:4] <- retMat[rowIndex, 1] + (tquan * sqrt(varVal + sumObjRV)) * c(-1, 1)
In that line, an array of dimension 1 is added to a numeric vector.
Here is a simple example to reproduce the warning:
arr <- array(5, dim = 1);
arr + c(1, 2);
#[1] 6 7
#Warning message:
#In arr + c(1, 2) :
# Recycling array of length 1 in array-vector arithmetic is deprecated.
# Use c() or as.vector() instead.
Note that the result is still correct; it's just that R doesn't like the addition of a one-dimensional array and a vector, and prefers instead adding proper scalars and vectors, or vectors and vectors.