Getting error when using plot.gbm to produce marginal plots - r

Total novice to R -
I am trying to make some marginal plots from a BRT I completed with the gbm package and keep getting the same error.
Below is my code; boosted.tree_LRFF is the output I got from completing a gbm.fit
> plot.gbm(boosted.tree_LRFF,
+ i.var= 5,
+ n.trees = train.model$finalModel$tuneValue$n.trees,
+ continuous.resolution = 100,
+ return.grid = FALSE,
+ type = "link")
Error in plot.window(...) : need finite 'ylim' values
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf

In this first section I just re-create the dataset used to fit a gbm from the "gbm.pdf" for the package:
library(gbm)
N <- 1000
X1 <- runif(N)
X2 <- 2 * runif(N)
X3 <- ordered(sample(letters[1:4], N, replace = TRUE), levels = letters[4:1])
X4 <- factor(sample(letters[1:6], N, replace = TRUE))
X5 <- factor(sample(letters[1:3], N, replace = TRUE))
X6 <- 3 * runif(N)
mu <- c(-1, 0, 1, 2)[as.numeric(X3)]
SNR <- 10 # signal-to-noise ratio
Y <- X1 ** 1.5 + 2 * (X2 ** .5) + mu
sigma <- sqrt(var(Y) / SNR)
Y <- Y + rnorm(N, 0, sigma)
# introduce some missing values
X1[sample(1:N, size = 500)] <- NA
X4[sample(1:N, size = 300)] <- NA
data <- data.frame(Y = Y, X1 = X1, X2 = X2, X3 = X3, X4 = X4, X5 = X5, X6 = X6)
boosted.tree_LRFF <-
gbm(Y ~ X1 + X2 + X3 + X4 + X5 + X6,
data = data,
var.monotone = c(0, 0, 0, 0, 0, 0),
distribution = "gaussian",
n.trees = 1000,
shrinkage = 0.05,
interaction.depth = 3,
bag.fraction = 0.5,
train.fraction = 0.5,
n.minobsinnode = 10,
cv.folds = 3,
keep.data = TRUE,
verbose = FALSE,
n.cores = 1)
Now I plot the tree function values for variable x5, similar to your plot:
plot(boosted.tree_LRFF,
i.var = 5,
n.trees = boosted.tree_LRFF$n.trees,
continuous.resolution = 100,
return.grid = FALSE,
type = "link")
I think your error is due to the n.trees argument. You can either enter it as a constant or it can be from a GBM fitted object. In my example I used it from "boosted.tree_LRFF" that appears to be the name of the original fitted object in your example (although of course my data was different).

Related

Problem with updating terms in the multinom function

I am trying to add1 all interaction terms on top of a multinomial baseline model using multinom() but it shows the error
trying + x1:x2
Error in if (trace) { : argument is not interpretable as logical
Called from: nnet.default(X, Y, w, mask = mask, size = 0, skip = TRUE, softmax = TRUE,
censored = censored, rang = 0, ...)
What is the problem here? I appreciate any input. Here is a reproducible example:
require(nnet)
data <- data.frame(y=sample(1:3, 24, replace = TRUE),
x1 = c(rep(1,12), rep(2,12)),
x2 = rep(c(rep(1,4), rep(2,4), rep(3,4)),2),
x3=rnorm(24),
z1 = sample(1:10, 24, replace = TRUE))
m0 <- multinom(y ~ x1 + x2 + x3 + z1, data = data)
m1 <- add1(m0, scope = .~. + .^2, test="Chisq")
My end goal is to see which terms are appropriate to drop by later adding the line m1[order(add1.m1$'Pr(>Chi)'),].

lm() looped over factor variable while dropping single-level factor variables from the model

I have a dataset where I'm trying to loop over a factor variable (location) and building a separate model for each level of that factor. Depending on the location, however, there are single-level factor variables, which is giving me this error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
Called from: `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]])
So depending on the location, I want to drop any single-level factors from the model. I've tried splitting the data into one dataset that doesn't have any single-level factors and another that does, but I don't know how to drop a given factor variable depending on the location.
This will give you the error:
library(data.table)
dt <- data.table(df, key = "location")
lapply(unique(dt$location), function(z) lm(y ~ x1 + x2 + x3 + x4 + x5, data = dt[J(z),]))
I'm not very comfortable with data.table, however, so any non-data.table solution would be really helpful. Thank you.
Some data:
y <- rnorm(n = 100, mean = 50, 5)
x1 <- rnorm(n = 100, mean = 10, sd = 3)
location <- factor(c(rep(1, 20), rep(2, 20), rep(3, 20), rep(4, 20), rep(5, 20)))
x2 <- rnorm(n = 100, mean = 25, sd = 3)
x3 <- factor(sample(c(0, 1), size = 100, replace = TRUE))
x4 <- factor(ifelse(location == 1, 0,
ifelse(location == 2, sample(c(0, 1), size = 20, replace = TRUE),
ifelse(location == 3, 1,
ifelse(location == 4, 0, sample(c(0, 1), size = 20, replace = TRUE))))))
x5 <- factor(ifelse(location == 1, sample(c(0, 1), size = 20, replace = TRUE),
ifelse(location == 2, 1,
ifelse(location == 3, sample(c(0, 1), size = 20, replace = TRUE),
ifelse(location == 4, sample(c(0, 1), size = 20, replace = TRUE), 0)))))
df <- data.frame(y, location, x1, x2, x3, x4, x5)
Your model's formula is conditional on whether or not there are enough levels in each independent variable to be included.
You can create a formula based on these conditions (e.g., using ifelse()) and then feed the formula to the model inside lapply().
Here is a solution:
lapply(unique(df$location), function(z) {
sub_df = dplyr::filter(df, location == z) # subset by location
form_x4 = ifelse(length(unique(sub_df$x4)) > 1, "+ x4", "")
form_x5 = ifelse(length(unique(sub_df$x5)) > 1, "+ x5", "")
form = as.formula(paste("y ~ x1 + x2 + x3", form_x4, form_x5))
return(lm(data = sub_df, formula = form))
})
The form inside the above lapply(...) combines the consistent part of the lm() formula with multiple variables that meet the conditions to be used in the formula. If a variable only has a single level, the ifelse() statement allows you to treat it as if it's not there when putting it in the formula.

Confusion about 'standardize' option of glmnet package in R

I have a confusion about the standardize option of glmnet package in R. I get different coefficients when I standardize the covariates matrix and set standardize=FALSE vs. when I do not standardize the covariates matrix and set standardize=TRUE. I assumed they would be the same! These two are shown with an example by creating the following ridge.mod1 and ridge.mod2 models. I also created a model (ridge.mod3) that standardized the outcome (and the covariates matrix) and used the option standardize=FALSE. I was just checking if I needed to standardize the outcome too to get the same coefficients as in ridge.mod1.
set.seed(1)
y <- rnorm(30, 20, 10)
x1 <- rnorm(30, 5, 2)
x2 <- x1 + rnorm(30, 0, 5)
cor(x1,x2)
x <- as.matrix(cbind(x1,x2))
z1 <- scale(x1)
z2 <- scale(x2)
z <- as.matrix(cbind(z1,z2))
y.scale <- scale(y)
n <- 30
# Fixing foldid for proper comparison
foldid=sample(rep(seq(5),length=n))
table(foldid)
library(glmnet)
cv.ridge.mod1 <- cv.glmnet(x, y, alpha = 0, nfolds = 5, foldid=foldid, standardize = TRUE)
ridge.mod1 <- glmnet(x, y, alpha = 0, standardize = TRUE)
coef(ridge.mod1, s=cv.ridge.mod1$lambda.min)
> coef(ridge.mod1, s=cv.ridge.mod1$lambda.min)
3 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 2.082458e+01
x1 2.856136e-37
x2 4.334910e-38
cv.ridge.mod2 <- cv.glmnet(z, y, alpha = 0, nfolds = 5, foldid=foldid, standardize = FALSE)
ridge.mod2 <- glmnet(z, y, alpha = 0, standardize = FALSE)
coef(ridge.mod2, s=cv.ridge.mod2$lambda.min)
> coef(ridge.mod2, s=cv.ridge.mod2$lambda.min)
3 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 2.082458e+01
V1 4.391657e-37
V2 2.389751e-37
cv.ridge.mod3 <- cv.glmnet(z, y.scale, alpha = 0, nfolds = 5, foldid=foldid, standardize = FALSE)
ridge.mod3 <- glmnet(z, y.scale, alpha = 0, standardize = FALSE)
coef(ridge.mod3, s=cv.ridge.mod3$lambda.min)
> coef(ridge.mod3, s=cv.ridge.mod3$lambda.min)
3 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 1.023487e-16
V1 4.752255e-38
V2 2.585973e-38
Could anyone please tell me what's going on there and if (or how) I can get the same coefficients as in ridge.mod1 with prior standardization (in the data processing step) and then using standardize=FALSE?
Update: (what I tried based on the comments below)
So, I tried standardizing by SS/n instead of SS/(n-1). I tried by standardizing both y and x. Neither gave me coefficients equal to the de-standardized coefficients of model 1.
## Standadizing by sqrt(SS(X)/n) like glmnet instead of sqrt(SS(X)/(n-1)) which is done by the scale command
Xs <- apply(x, 2, function(m) (m - mean(m)) / sqrt(sum(m^2) / n))
Ys <- (y-mean(y)) / sqrt(sum(y^2) / n)
# Standadizing only X by sqrt(SS(X)/n)
cv.ridge.mod4 <- cv.glmnet(Xs, y, alpha = 0, nfolds = 5, foldid=foldid, standardize = FALSE)
ridge.mod4 <- glmnet(Xs, y, alpha = 0, standardize = FALSE)
coef(ridge.mod4, s=cv.ridge.mod4$lambda.min)
> coef(ridge.mod4, s=cv.ridge.mod4$lambda.min)[2]/sd(x1)
[1] 7.995171e-38
> coef(ridge.mod4, s=cv.ridge.mod4$lambda.min)[3]/sd(x2)
[1] 2.957854e-38
# Standadizing both Y and X by sqrt(SS(X)/n) but neither is centered
cv.ridge.mod6 <- cv.glmnet(Xs.noncentered, Ys.noncentered, alpha = 0, nfolds = 5, foldid=foldid, standardize = FALSE)
ridge.mod6 <- glmnet(Xs.noncentered, Ys.noncentered, alpha = 0, standardize = FALSE)
coef(ridge.mod6, s=cv.ridge.mod6$lambda.min)
> coef(ridge.mod6, s=cv.ridge.mod6$lambda.min)[2] / (sqrt(sum(x1^2) / n))
[1] 1.019023e-39
> coef(ridge.mod6, s=cv.ridge.mod6$lambda.min)[3] / (sqrt(sum(x2^2) / n))
[1] 9.189263e-40
What is it that still is wrong there?
I tweaked your code so that I can work with a more sensible problem. In order to reproduce the coefficients changing the standardize=TRUE and standardize=FALSE options you need to first standardize the variables with the (1/N) variance estimator formula. For this example I also centered the variables to get rid of the constant. I focus only on the coefficients of the variables. After that you have to notice that hence you have to invert that formula to get the de-standardized coefficients. I do that in the following code.
set.seed(1)
x1 <- rnorm(300, 5, 2)
x2 <- x1 + rnorm(300, 0, 5)
x3 <- rnorm(300, 6, 5)
e= rnorm(300, 0, 1)
y <- 0.3*x1+3.5*x2+x3+e
x <- as.matrix(cbind(x1,x2,x3))
sdN=function(x){
sigma=sqrt( (1/length(x)) * sum((x-mean(x))^2))
return(sigma)
}
n=300
foldid=sample(rep(seq(5),length=n))
g1=(x1-mean(x1))/sdN(x1)
g2=(x2-mean(x2))/sdN(x2)
g3=(x3-mean(x3))/sdN(x3)
gy=(y-mean(y))/sdN(y)
equis <- as.matrix(cbind(g1,g2,g3))
library(glmnet)
cv.ridge.mod1 <- cv.glmnet(x, y, alpha = 0, nfolds = 5, foldid=foldid,standardize = TRUE)
coef(cv.ridge.mod1, s=cv.ridge.mod1$lambda.min)
cv.ridge.mod2 <- cv.glmnet(equis, gy, alpha = 0, nfolds = 5, foldid=foldid, standardize = FALSE, intercept=FALSE)
beta=coef(cv.ridge.mod2, s=cv.ridge.mod2$lambda.min)
beta[2]*sdN(y)/sdN(x1)
beta[3]*sdN(y)/sdN(x2)
beta[4]*sdN(y)/sdN(x3)
coef(cv.ridge.mod1, s=cv.ridge.mod1$lambda.min)
this yields the results:
> beta[2]*sdN(y)/sdN(x1)
[1] 0.5984356
> beta[3]*sdN(y)/sdN(x2)
[1] 3.166033
> beta[4]*sdN(y)/sdN(x3)
[1] 0.9145646
>
> coef(cv.ridge.mod1, s=cv.ridge.mod1$lambda.min)
4 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 0.5951423
x1 0.5984356
x2 3.1660328
x3 0.9145646
As you can see the coefficients are the same at 4 decimals. So I hope this answer your question.

KNN outlier detection in R

I am trying to run a script I was given to perform outlier detection using a weighted KNN outlier score, but keep getting the following error:
Error in apply(kNNdist(x = dat, k = k), 1, mean) :
dim(X) must have a positive length
The script I am trying to run is as below. It is a single block of script, but I have added a comment directly above the section of the script that is causing the error, which is the function:
WKNN_Outlier <- apply(kNNdist(x=dat, k = k), 1, mean)
If anyone has any better or simpler ideas for unsupervised outlier detection, I am all ears (so to speak...)
library(dbscan)
library(ggplot2)
set.seed(0)
x11 <- rnorm(n = 100, mean = 10, sd = 1) # Cluster 1 (x1 coordinate)
x21 <- rnorm(n = 100, mean = 10, sd = 1) # Cluster 1 (x2 coordinate)
x12 <- rnorm(n = 100, mean = 20, sd = 1) # Cluster 2 (x1 coordinate)
x22 <- rnorm(n = 100, mean = 10, sd = 1) # Cluster 2 (x2 coordinate)
x13 <- rnorm(n = 100, mean = 15, sd = 3) # Cluster 3 (x1 coordinate)
x23 <- rnorm(n = 100, mean = 25, sd = 3) # Cluster 3 (x2 coordinate)
x14 <- rnorm(n = 50, mean = 25, sd = 1) # Cluster 4 (x1 coordinate)
x24 <- rnorm(n = 50, mean = 25, sd = 1) # Cluster 4 (x2 coordinate)
dat <- data.frame(x1 = c(x11,x12,x13,x14), x2 = c(x21,x22,x23,x24))
( g0a <- ggplot() + geom_point(data=dat, mapping=aes(x=x1, y=x2), shape = 19) )
k <- 4 # KNN parameter
top_n <- 20 # No. of top outliers to be displayed
KNN_Outlier <- kNNdist(x=dat, k = k)
rank_KNN_Outlier <- order(x=KNN_Outlier, decreasing = TRUE) # Sorting (descending)
KNN_Result <- data.frame(ID = rank_KNN_Outlier, score = KNN_Outlier[rank_KNN_Outlier])
head(KNN_Result, top_n)
graph <- g0a +
geom_point(data=dat[rank_KNN_Outlier[1:top_n],], mapping=aes(x=x1,y=x2), shape=19,
color="red", size=2) +
geom_text(data=dat[rank_KNN_Outlier[1:top_n],],
mapping=aes(x=(x1-0.5), y=x2, label=rank_KNN_Outlier[1:top_n]), size=2.5)
graph
## Use KNNdist() to calculate the weighted KNN outlier score
k <- 4 # KNN parameter
top_n <- 20 # No. of top outliers to be displayed
The WKNN_Outler function below is what is causing the error. From what I can gather, the apply function shouldn't be having any issues, as the data (dat) is converted into a data.frame, which should prevent the error, but doesn't.
WKNN_Outlier <- apply(kNNdist(x=dat, k = k), 1, mean) # Weighted KNN outlier score (mean)
rank_WKNN_Outlier <- order(x=WKNN_Outlier, decreasing = TRUE)
WKNN_Result <- data.frame(ID = rank_WKNN_Outlier, score = WKNN_Outlier[rank_WKNN_Outlier])
head(WKNN_Result, top_n)
ge1 <- g0a +
geom_point(data=dat[rank_WKNN_Outlier[1:top_n],], mapping=aes(x=x1,y=x2), shape=19,
color="red", size=2) +
geom_text(data=dat[rank_WKNN_Outlier[1:top_n],],
mapping=aes(x=(x1-0.5), y=x2, label=rank_WKNN_Outlier[1:top_n]), size=2.5)
ge1
The function kNNdist(x=dat, k = k) produces a vector not a matrix, which is why when you try to do the apply function it tells you dim(X) must have a positive length (vectors have a NULL dim).
Try:
WKNN_Outlier <- apply(kNNdist(x=dat, k = k, all=T), 1, mean)

Optimization of a function in R ( L-BFGS-B needs finite values of 'fn')

I would like to optimize (maximum) of the following function f1. I wrote the following code which using the lower and upper bound, since we know that all of our parameters are equal or bigger than zero and also we always should have x4 values less than or equal x6. How can I fix this problem in R?I want to get a finite maximum value of function f1.
x1 = 0.1
x2 = 0.1
x3 = 2
x4 = 10
x5 = 2
x6 = 30
x7 = 1
par = list(x1=x1, x2=x2, x3=x3, x4=x4,x5=x5, x6=x6, x7=x7)
par1 = c(1, 1, 2, 1.5, 1, 1.5, 1)
f1 = function(x, par){
sum(log(exp(-(par$x7)*(par$x1*x + par$x2*x^2/2 +
par$x3 * (par$x4-x)^3/3+par$x5 *(x-par$x6)^3/3))))
}
x = seq(0, 500, length=100)
z = c(par$x1, par$x2, par$x3, par$x4, par$x5, par$x6, par$x7)
f2 = function(z){
par.new = list(x1 = z[1], x2 = z[2], x3 = z[3], x4 = z[4]
, x5 = z[5], x6 = z[6], x7 = z[7])
f1(x, par.new)
}
optim(par1, f2, method = "L-BFGS-B", lower = rep(0, length(z)),
upper = rep(Inf,length(z)),control = list(trace = 5,fnscale=-1))
> optim(par1, f2, method = "L-BFGS-B", lower = rep(0, length(z)),
upper = rep(Inf, length(z)), control = list(trace = 5,fnscale=-1))
N = 7, M = 5 machine precision = 2.22045e-16
L = 0 0 0 0 0 0 0
X0 = 1 1 2 1.5 1 1.5 1
U = inf inf inf inf inf inf inf
At X0, 0 variables are exactly at the bounds
Error in optim(par1, f2, method = "L-BFGS-B", lower = rep(0, length(z)), :
L-BFGS-B needs finite values of 'fn'
At some point in the optimization, your function is returning a value greater than .Machine$double.xmax (which is 1.797693e+308 on my machine).
Since your function f1(...) is defined as sum(log(exp(...))), and since log(exp(z)) = z for any z, why not use this:
par1 = c(1, 1, 2, 1.5, 1, 1.5, 1)
x = seq(0, 500, length=100)
f1 = function(par, x){
sum(-(par[7])*(par[1]*x + par[2]*x^2/2 +
par[3] * (par[4]-x)^3/3+par[6] *(x-par[7])^3/3))
}
result <- optim(par1, f1, x=x,
method = "L-BFGS-B",
lower = rep(0, length(par1)), upper = rep(Inf,length(par1)),
control = list(trace = 5,fnscale=-1))
result$par
# [1] 2.026284e-01 2.026284e-01 8.290126e+08 0.000000e+00 1.000000e+00 9.995598e+35 2.920267e+27
result$value
# [1] 2.423136e+147
Note that the vector of parameters (par) must be the first argument to f1.

Resources