Error using optim() in purrr(): vmmin is not finite - r

I've received a function from another worker to calculate the height required of a tree to reach a certain height at age 100 (SI). My job is to put this into purrr to calculate what the height will look like for a number of SI and height crossings in order to plot the growth trajectory.
First I create the base function:
SI_tall <- function(topheight, age, si ){
paramasi <- 25
parambeta <- 7395.6
paramb2 <- -1.7829
refAge <- 100
d <- parambeta*(paramasi^paramb2)
r <- (((topheight-d)^2)+(4*parambeta*topheight*(age^paramb2)))^0.5
## height at reference age
h2 <- (topheight+d+r)/ (2+(4*parambeta*(refAge^paramb2)) / (topheight-d+r))
return(abs(h2 - si))
}
To calculate the height for a tree of given age and site index, we use this function in another. The height will be given by
my.age <- 10
my.si <- 30
new.topheight <- function(my.si, my.age){
optim(par = list(topheight = 10), ## this topheight is just an initial value
method = 'L-BFGS-B', fn = SI_tall, si = my.si, age = my.age, lower= 0, upper=100)$par
}
This works nicely for each value.
Since I want to draw a trajectory of the growth of each tree, I'll first need to calculate the ages and site indices at a required resolution to plot. I create two vectors to cross:
my.age <- seq(0,110, by=0.2)
my.si <- c(5,10,15,20,25,30,35)
si.crossing <- tidyr::crossing(my.age, my.si)
si.crossing %>% group_by(my.age, my.si) %>%
nest() %>%
mutate(topheight = map2(.x=my.age, .y=my.si, .f=~new.topheight(my.si=.y, my.age=.x)))
Here's the error I get:
Error in optim(par = list(topheight = 30), method = "BFGS", fn = SI_tall, :
initial value in 'vmmin' is not finite
What's going wrong? Many thanks.

Directly pass it to map2_dbl with tryCatch to handle errors.
library(dplyr)
library(purrr)
si.crossing %>%
mutate(topheight = map2_dbl(my.si, my.age,
~tryCatch(new.topheight(.x, .y), error = function(e) NA)))
Or use mapply in base R :
si.crossing$topheight <- mapply(function(x, y)
tryCatch(new.topheight(x, y),error = function(e) NA),
si.crossing$my.si, si.crossing$my.age)

We can use possibly from purrr
library(purrr)
pnew.topheight <- possibly(new.topheight, otherwise = NA)
si.crossing %>%
mutate(topheight = map2_dbl(my.si, my.age, pnew.topheight))

Related

Unable to make code run - Error in `lag()`:! `n`

I need help solving this error, I am not sure how to, but it seems as when I run the "dp_stat" in the end, an error appears when I do my OLS model. Essentially I try to find the cumulative difference between a benchmark and a given individual predictive model. Here dp is my independent variable and sg is my dependent variable. datanu is my excel data. I'm not sure how to attach the data here, however here is a link to the excel and the code: https://drive.google.com/drive/folders/12BOuNBODURIP7CQIBZWMmHFc1d7zXHxN?usp=sharing If anyone has a fix it would mean the world!
"#Error in lag():! n must be a positive integer, not a double vector of length 1."
rm(list= ls()) # Clear global environment
invisible(gc()) # Free up unused R-occupied memory
cat("\014") # Clear console output: equivalent to ctrl + L
library("tseries")
library("readxl")
library("Metrics")
library("lubridate")
library("ggplot2")
library("data.table")
library("dyn")
library("reshape2")
#header TRUE fordi første row er navne.
datanu <- read_xlsx("~/Documents/6.semester/Bachelor/Data/datanu.xlsx",
na = "NaN",
sheet = "datax",
)
myts <- ts(datanu, start=c(1872, 1), end=c(2020, 12), frequency=12)
plot(myts[, c("dp", "dy", "ep", "de")])
get_statistics <- function(myts, dp, sg, h=1, start=1872, end=2020, est_periods_OOS = 20) {
#### IS ANALYSIS
#1. Historical mean model for en portefølje
avg <- mean(window(myts, start, end)[, sg], na.rm=TRUE)
IS_error_N <- (window(myts, start, end)[, sg] - avg)
#2. OLS model
#reg <- dyn$lm(sg ~ lag(as.numeric(dp), 1), data=window(myts, start, end))
reg <- dyn$lm(eval(parse(text=sg)) ~ lag(eval(parse(text=dp)), -1), data=window(myts, start, end)) #Error in `lag()`:! `n` must be a positive integer, not a double vector of length 1.
IS_error_A <- reg$residuals
#OOS ANALYSIS
OOS_error_N <- numeric(end - start - est_periods_OOS)
OOS_error_A <- numeric(end - start - est_periods_OOS)
#anvender kun information op til forecasten er lavet.
j <- 0
for (i in (start + est_periods_OOS):(end-1)) {
j <- j + 1
#Get the actual ERP that you want to predict
actual_ERP <- as.numeric(window(myts, i+1, i+1)[, sg])
#1. Historical mean model
OOS_error_N[j] <- actual_ERP - mean(window(myts, start, i)[, sg], na.rm=TRUE)
#2. OLS model
reg_OOS <- dyn$lm(eval(parse(text=sg)) ~ lag(eval(parse(text=dp)), -1),
data=window(myts, start, i))
#Compute_error
df <- data.frame(x=as.numeric(window(myts, i, i)[, dp]))
names(df) <- dp
pred_ERP <- predict.lm(reg_OOS, newdata=df)
OOS_error_A[j] <- pred_ERP - actual_ERP
}
#Compute statistics
MSE_N <- mean(OOS_error_N^2)
MSE_A <- mean(OOS_error_A^2)
T <- length(!is.na(myts[, sg]))
OOS_R2 <- 1 - MSE_A/MSE_N
#Is the -1 enough (maybe -2 needed because of lag)?
OOS_oR2 <- OOS_R2 - (1-OOS_R2)*(reg$df.residual)/(T - 1)
dRMSE <- sqrt(MSE_N) - sqrt(MSE_A)
##
#### CREATE PLOT
IS <- cumsum(IS_error_N[2:length(IS_error_N)]^2)-cumsum(IS_error_A^2)
OOS <- cumsum(OOS_error_N^2)-cumsum(OOS_error_A^2)
df <- data.frame(x=seq.int(from=start + 1 + est_periods_OOS, to=end),
IS=IS[(1 + est_periods_OOS):length(IS)],
OOS=OOS) #Because you lose one observation due to the lag
#Shift IS errors vertically, so that the IS line begins
# at zero on the date of first OOS prediction. (se Goyal/Welch (2008, side 1465))
df$IS <- df$IS - df$IS[1]
df <- melt(df, id.var="x")
plotGG <- ggplot(df) +
geom_line(aes(x=x, y=value,color=variable)) +
geom_rect(data=data.frame(),#Needed by ggplot2, otherwise not transparent
aes(xmin=2008, xmax=2010,ymin=-0.2,ymax=0.2),
fill='red',
alpha=0.1) +
scale_y_continuous('Cumulative SSE Difference', limits=c(-0.2, 0.2)) +
scale_x_continuous('Year')
##
return(list(IS_error_N = IS_error_N,
IS_error_A = reg$residuals,
OOS_error_N = OOS_error_N,
OOS_error_A = OOS_error_A,
IS_R2 = summary(reg)$r.squared,
IS_aR2 = summary(reg)$adj.r.squared,
OOS_R2 = OOS_R2,
OOS_oR2 = OOS_oR2,
dRMSE = dRMSE,
plotGG = plotGG))
}
dp_stat <- get_statistics(myts, "dp", "sg", start=1872)
dp_stat$plotGG
As the error message states, n must be a positive integer, not a double vector of length 1. The error comes from you providing n = -1 (i.e., a negative number) as an argument. I assume your idea is to have a negative number of positions to lag by. However, the lag() function only accepts a positive number of lag positions. Instead of lag(), you should use lead() with n = 1 to achieve the desired result.

Error: Mean Distance Between Objects Zero

I am trying to learn about the "kohonen" package in R. In particular, there is a function called "supersom()" (https://www.rdocumentation.org/packages/kohonen/versions/3.0.10/topics/supersom , corresponding to the SOM (Self Organizing Maps) algorithm used in unsupervised machine learning) that I am trying to apply on some data.
Below, (from a previous question: R error: "Error in check.data : Argument Should be Numeric") I learned how to apply the "supersom()" function on some artificially created data with both "factor" and "numeric" variables.
#the following code works
#load libraries
library(kohonen)
library(dplyr)
#create and format data
a =rnorm(1000,10,10)
b = rnorm(1000,10,5)
c = rnorm(1000,5,5)
d = rnorm(1000,5,10)
e <- sample( LETTERS[1:4], 100 , replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25) )
f <- sample( LETTERS[1:5], 100 , replace=TRUE, prob=c(0.2, 0.2, 0.2, 0.2, 0.2) )
g <- sample( LETTERS[1:2], 100 , replace=TRUE, prob=c(0.5, 0.5) )
data = data.frame(a,b,c,d,e,f,g)
data$e = as.factor(data$e)
data$f = as.factor(data$f)
data$g = as.factor(data$g)
cols <- 1:4
data[cols] <- scale(data[cols])
#som model
som <- supersom(data= as.list(data), grid = somgrid(10,10, "hexagonal"),
dist.fct = "euclidean", keep.data = TRUE)
Everything works well - the problem is, when I try to apply the "supersom()" function on " more realistic and bigger data", I get the following error:
"Error: Non-informative layers present : mean distances between objects zero"
When I look at the source code for this function (https://rdrr.io/cran/kohonen/src/R/supersom.R), I notice a reference for the same error:
if (any(sapply(meanDistances, mean) < .Machine$double.eps))
stop("Non-informative layers present: mean distance between objects zero")
Can someone please show me how I might be able to resolve this error, i.e. make the "supersom()" function work with factor and numeric data?
I thought that perhaps removing duplicate rows and NA's might fix this problem:
data <- na.omit(data)
data <- unique(data)
However the same error ("Non-informative layers present : mean distances between objects zero") is still there.
Can someone please help me figure out what might be causing this error? Note: when I remove the "factor" variables, everything works fine.
Sources:
https://cran.r-project.org/web/packages/kohonen/kohonen.pdf
https://www.rdocumentation.org/packages/kohonen/versions/2.0.5/topics/supersom
https://rdrr.io/cran/kohonen/src/R/supersom.R
The error happens if you have certain numeric columns whose mean is 0. You can reproduce the error by turning any 1 column to 0.
data$a <- 0
som <- supersom(data= as.list(data), grid = somgrid(10,10, "hexagonal"),
dist.fct = "euclidean", keep.data = TRUE)
Error in supersom(data = as.list(data), grid = somgrid(10, 10, "hexagonal"), :
Non-informative layers present: mean distance between objects zero
Maybe you can investigate why those column have 0 mean or remove the columns with 0 means from the data.
library(kohonen)
library(dplyr)
data <- data %>% select(where(~(is.numeric(.) && mean(.) > 0) | !is.numeric(.)))
#som model
som <- supersom(data= as.list(data), grid = somgrid(10,10, "hexagonal"),
dist.fct = "euclidean", keep.data = TRUE)

uniroot() function in source code does not work with modification; Could not figure out the error

I was trying to find out coordinates of the intersection of two curves in R. The input data are coordinates of empirical points from the two curves. My solution is to use the function curve_intersect(). I need to do this for 2000 replications (i.e., 2000 pairs of curves). So I put the data in two lists. Each list contains 1000 data frames with x & y coordinates of one curve in each data frame.
Here is my data: data
Below are the code that I used.
threshold_or1 <- map2_df(recall_or1_4, precision_or1_4,
~curve_intersect(.x, .y, empirical = TRUE, domain = NULL))
# recall_or_4 is a list of 2000 data frames. Each data frame
# |contains coordinates from curve #1.
# precision_or_4 is a list of 2000 data frames. Each data frame
# |contains coordinates from curve #2.
I got this error message below.
Error in uniroot(function(x) curve1_f(x) - curve2_f(x), c(min(curve1$x), : f() values at end points not of opposite sign
Since the function curve_intersect() can be successfully applied to some individual data frames from the two lists. I ran the following code in order to see exactly which pair of data frames made the process fail.
test <- for (i in 1:2000){
curve_intersect(recall_or1_4[[i]], precision_or1_4[[i]], empirical = TRUE, domain = NULL)
print(paste("i=",i))}
Then, I got the following message, which means that the process ran successfully until it reaches the data pair #460. So I checked that individual data pair.
[1] "i= 457"
[1] "i= 458"
[1] "i= 459"
Error in uniroot(function(x) curve1_f(x) - curve2_f(x), c(min(curve1$x), : f() values at end points not of opposite sign
I plotted data pairs #460.
test1 <- precision_or1_4[[460]] %>% mutate(statistics = 'precision')
test2 <- recall_or1_4[[460]] %>% mutate(statistics = 'recall')
test3 <- rbind(test1, test2)
test3 <- test3 %>% mutate(statistics = as.factor(statistics))
curve_test3 <- ggplot(test3, aes(x = x, y = y))+
geom_line(aes(colour = statistics))
curve_test3
Find coordinates of the intersection point
I then went to modify the source code of curve_intersect(). The original source code is
curve_intersect <- function(curve1, curve2, empirical=TRUE, domain=NULL) {
if (!empirical & missing(domain)) {
stop("'domain' must be provided with non-empirical curves")
}
if (!empirical & (length(domain) != 2 | !is.numeric(domain))) {
stop("'domain' must be a two-value numeric vector, like c(0, 10)")
}
if (empirical) {
# Approximate the functional form of both curves
curve1_f <- approxfun(curve1$x, curve1$y, rule = 2)
curve2_f <- approxfun(curve2$x, curve2$y, rule = 2)
# Calculate the intersection of curve 1 and curve 2 along the x-axis
point_x <- uniroot(function(x) curve1_f(x) - curve2_f(x),
c(min(curve1$x), max(curve1$x)))$root
# Find where point_x is in curve 2
point_y <- curve2_f(point_x)
} else {
# Calculate the intersection of curve 1 and curve 2 along the x-axis
# within the given domain
point_x <- uniroot(function(x) curve1(x) - curve2(x), domain)$root
# Find where point_x is in curve 2
point_y <- curve2(point_x)
}
return(list(x = point_x, y = point_y))
}
I modified the uniroot() part from the third if statement. Instead of using c(min(curve1$x), max(curve1$x)) as an argument of uniroot(), I used lower = -100000000, upper = 100000000. The modified function is
curve_intersect_tq <- function(curve1, curve2, empirical=TRUE, domain=NULL) {
if (!empirical & missing(domain)) {
stop("'domain' must be provided with non-empirical curves")
}
if (!empirical & (length(domain) != 2 | !is.numeric(domain))) {
stop("'domain' must be a two-value numeric vector, like c(0, 10)")
}
if (empirical) {
# Approximate the functional form of both curves
curve1_f <- approxfun(curve1$x, curve1$y, rule = 2)
curve2_f <- approxfun(curve2$x, curve2$y, rule = 2)
# Calculate the intersection of curve 1 and curve 2 along the x-axis
point_x <- uniroot(function(x) curve1_f(x) - curve2_f(x),
lower = -100000000, upper = 100000000)$root
# Find where point_x is in curve 2
point_y <- curve2_f(point_x)
} else {
# Calculate the intersection of curve 1 and curve 2 along the x-axis
# within the given domain
point_x <- uniroot(function(x) curve1(x) - curve2(x), domain)$root
# Find where point_x is in curve 2
point_y <- curve2(point_x)
}
return(list(x = point_x, y = point_y))
}
I tried to change the values of lower =, upper = arguments. It did not work. I got the same error message as shown below.
curve_intersect_tq(recall_or1_4[[460]], precision_or1_4[[460]], empirical = TRUE, domain = NULL)
Error in uniroot(function(x) curve1_f(x) - curve2_f(x), c(min(curve1$x), :
f() values at end points not of opposite sign
I also tried to use possibly(fun, NA) from the tidyverse package hoping that the process can run even with an error message. It did not work when I used
(1) possibly(curve_intersect(), NA) or
(2) possibly(uniroot(), NA)
The same error message appeared.
Why do I have the error message? What could be possible solutions? Thanks in advance.
Might be a little late to the party, but here's why your code still fails and what you could do, depending on what you want to get out of your analysis:
First of all, the reason why your code fails, even after adaptation, is that you are merely telling uniroot to search a wider window in x. However, the underlying curves will never intersect - there just isn't any curve1_f(x) - curve2_f(x) == 0 to be found.
From the doc of uniroot:
"The function values at the endpoints must be of opposite signs (or zero), for extendInt="no", the default."
In the original curve_intersect implementation, uniroot is searching the x-interval that is defined in your data (that's the c(min(curve1$x), max(curve1$x))). In your alteration, you're telling it to search in the x interval [-100000000, 100000000]. You could as well have set extendInt = "yes", but it wouldn't change anything.
The problem doesn't lie in the search interval, it lies with approxfun!
approxfun merely helps you by interpolating empirical data between points. Outside of the data you pass in, the returned function wouldn't know what to do.
approxfun allows you to specify explicit values for y which should be returned outside the empirically defined window (with its params yleft/yright) or lets you set a rule for each side.
In the code you posted above, rule = 2 decides that "the value at the closest data extreme is used". So, approxfun does not extrapolate the data you pass in. It only extends the known.
We can plot how curve1_f and curve2_f will extend outside the empirically defined x-interval into infinity:
tibble(
x = seq(0, 1, by = 0.001),
curve1_approxed = curve1_f(x),
curve2_approxed = curve2_f(x)
) %>%
pivot_longer(starts_with("curve"), names_to = "curve", values_to = "y") %>%
ggplot(aes(x = x, y = y, color = curve)) +
geom_line() +
geom_vline(xintercept = c(min(curve1$x), max(curve1$x)), color = "grey75")
So, now to what you can do to get your code to not crash:
(spoiler: it pretty much depends on what you're trying to accomplish with your project)
accept that there is no intersection in the observed limits of your data.
If you don't want to make any assumptions, I'd suggest you wrap your mapped function in a tryCatch statement and let it fail where the out-of-the-box solution doesn't give you any results. Let's run this for the part of your list that previously made the whole thing crash:
threshold_or1.fix1 <- map2_df(
recall_or1_4, precision_or1_4,
~tryCatch({
curve_intersect(.x, .y, empirical = TRUE, domain = NULL)
}, error = function(e){
return(tibble(.rows = 1))
}),
.id = "i"
)
Now, there is just a NA row when curve_intersect isn't able to give you a result.
threshold_or1.fix1[459:461,]
# A tibble: 3 x 3
i x y
<chr> <dbl> <dbl>
1 459 0.116 0.809
2 460 NA NA
3 461 0.264 0.773
try to extrapolate your data with a linear model
In this case, we'll use a custom curve_intersect-function. Let's wrap the problematic uniroot call in a tryCatch and if no root can be found, we'll fit a lm for each curve and let uniroot find an intersection on the fitted linears.
That might or might not make sense in the light of your experiment, so I'll let you be the judge here. And obviously you can use other models than the simplistic lm if your data is more complex than that...
Just to visualize this approach vs the default:
tibble(
x = seq(-1, 2, by = 0.001),
curve1_approxed = curve1_f(x),
curve2_approxed = curve2_f(x),
curve1_lm = predict(lm(y ~ x, data = curve1), newdata = tibble(x = x)),
curve2_lm = predict(lm(y ~ x, data = curve2), newdata = tibble(x = x))
) %>%
pivot_longer(starts_with("curve"), names_to = "curve", values_to = "y") %>%
ggplot(aes(x = x, y = y, color = curve)) +
geom_line() +
geom_vline(xintercept = c(min(curve1$x), max(curve1$x)), color = "grey75")
You see, where approxfun "fails", with lm we make that assumption that we can extrapolate linearly and find an intersection around x = 1.27 outside of your observed frame.
To go for that second approach and include an extrapolation with lm in our search, you could throw together something like this:
(here, too, only the third if is edited.)
curve_intersect_custom <- function(curve1, curve2, empirical=TRUE, domain=NULL) {
if (!empirical & missing(domain)) {
stop("'domain' must be provided with non-empirical curves")
}
if (!empirical & (length(domain) != 2 | !is.numeric(domain))) {
stop("'domain' must be a two-value numeric vector, like c(0, 10)")
}
if (empirical) {
return(
tryCatch({
# Approximate the functional form of both curves
curve1_f <- approxfun(curve1$x, curve1$y, rule = 2)
curve2_f <- approxfun(curve2$x, curve2$y, rule = 2)
# Calculate the intersection of curve 1 and curve 2 along the x-axis
point_x <- uniroot(
f = function(x) curve1_f(x) - curve2_f(x),
interval = c(min(curve1$x), max(curve1$x))
)$root
# Find where point_x is in curve 2
point_y <- curve2_f(point_x)
return(list(x = point_x, y = point_y, method = "approxfun"))
}, error = function(e) {
tryCatch({
curve1_lm_f <- function(x) predict(lm(y ~ x, data = curve1), newdata = tibble(x = x))
curve2_lm_f <- function(x) predict(lm(y ~ x, data = curve2), newdata = tibble(x = x))
point_x <- uniroot(
f = function(x) curve1_lm_f(x) - curve2_lm_f(x),
interval = c(min(curve1$x), max(curve1$x)),
extendInt = "yes"
)$root
point_y <- curve2_lm_f(point_x)
return(list(x = point_x, y = point_y, method = "lm"))
}, error = function(e) {
return(list(x = NA_real_, y = NA_real_, method = NA_character_))
})
})
)
} else {
# Calculate the intersection of curve 1 and curve 2 along the x-axis
# within the given domain
point_x <- uniroot(function(x) curve1(x) - curve2(x), domain)$root
# Find where point_x is in curve 2
point_y <- curve2(point_x)
}
return(list(x = point_x, y = point_y))
}
For your problematic list elements, this now tries to extrapolate with the naively fitted lm model:
threshold_or1.fix2 <- map2_df(
recall_or1_4, precision_or1_4,
~curve_intersect_custom(.x, .y, empirical = TRUE, domain = NULL),
.id = "i"
)
threshold_or1.fix2[459:461,]
# A tibble: 3 x 4
i x y method
<chr> <dbl> <dbl> <chr>
1 459 0.116 0.809 approxfun
2 460 1.27 0.813 lm
3 461 0.264 0.773 approxfun
Hope this helps a little in understanding and fixing your issue :)

How to calculate running slope for rlm using runner?

I have a data frame "customers" build of customer id, month and total purchases that month.
I'm trying to calculate a running slope for a window of 12 months using robust regression.
I have tried the following:
Coef <- function(x) {return(rlm(cbind(x)~cbind(1:length(x)))$coefficients[2])}
customer_slope = customers %>% mutate(slope = runner(x=total_purchases,k=12,f=Coef))
I get the following error:
x 'x' is singular: singular fits are not implemented in 'rlm'
If I run a single example, the function returns what I've expected:
Coef(c(4,11,7,15,5,14,8,9,14,17,14,13))
cbind(1:length(x))
0.6888112
So I ran into similar problems and finally came to the below solution using slider. This provides a 3 days rolling estimate (of course you can change as you see fit). This doesn't quite get to your answer (which you could probably get with loops), but most of the way there.
library(MASS)
library(dplyr)
library(slider)
dat <- tibble::tibble(customers = c(4,11,7,15,5,14,8,9,14,17,14,13)) %>%
mutate(t = 1:n() %>% as.numeric())
dat %>%
mutate(results = slide_dbl(.x = .,
.f = ~rlm(customers ~ t, k = 12, data = .x)$coefficients[2],
.before = 2,
.complete = T))
It look like that's the way to go, thanks!
It seems like what caused the singularity was that I didn't change the default .complete from F to T.
So, combined with your suggestion, this is how I made it work (took about two hours for 3M rows I did have however more complex group_by involved which is not shown below)
slope_rlm <- function(x) {
x=as.numeric(x)
prep = tibble(data=x)%>%mutate(t=1:n()%>%as.numeric())
return(rlm(data~t,data=prep)$coefficients[2])
}
customers_rlm = customers %>%
mutate(cust_rlm_12=slide_dbl(total_purchases,slope_rlm,.before=11,.complete=T))
Consider data with two customers with data from 1000 days span. total_purchases are cumulated by customer, and each purchase size is ~pois(5).
set.seed(1)
customers <- data.frame(
id = factor(rep(1:2, length.out = 100)),
date = seq(Sys.Date(), Sys.Date() + 1000, length.out = 100)
) %>%
group_by(id) %>%
mutate(
total_purchases = cumsum(rpois(n(), lambda = 5))
)
When using calculating regression in rolling window make sure that you handle errors which comming from insufficient degrees of freedom, singularity etc. - that is why I've put tryCatch around rlm call - if there is any error, function returns NA for failing window.
Data below is grouped by id which means that model is calculated per customer. Yearly rolling regression should converge to the slope = 5 (+/- random error).
customers %>%
group_by(id) %>%
mutate(
slope = runner(
x = .,
f = function(x) {
tryCatch(
rlm(x$total_purchases ~ seq_len(nrow(x)))$coefficients[2],
error = function(e) NA
)
},
idx = "date",
k = "year"
)
)
Plotting slope in time for customers
ggplot(customers, aes(x = date, y = slope, color = id, group = id)) +
geom_line() +
geom_hline(yintercept = 5, color = "red")

Error with multiscale hierarchical clustering in R

I'm doing hierarchical clustering with an R package called pvclust, which builds on hclust by incorporating bootstrapping to calculate significance levels for the clusters obtained.
Consider the following data set with 3 dimensions and 10 observations:
mat <- as.matrix(data.frame("A"=c(9000,2,238),"B"=c(10000,6,224),"C"=c(1001,3,259),
"D"=c(9580,94,51),"E"=c(9328,5,248),"F"=c(10000,100,50),
"G"=c(1020,2,240),"H"=c(1012,3,260),"I"=c(1012,3,260),
"J"=c(984,98,49)))
When I use hclust alone, the clustering runs fine for both Euclidean measures and correlation measures:
# euclidean-based distance
dist1 <- dist(t(mat),method="euclidean")
mat.cl1 <- hclust(dist1,method="average")
# correlation-based distance
dist2 <- as.dist(1 - cor(mat))
mat.cl2 <- hclust(dist2, method="average")
However, when using the each set up with pvclust, as follows:
library(pvclust)
# euclidean-based distance
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean", nboot=1000)
# correlation-based distance
mat.pcl2 <- pvclust(mat, method.hclust="average", method.dist="correlation", nboot=1000)
... I get the following errors:
Euclidean: Error in hclust(distance, method = method.hclust) :
must have n >= 2 objects to cluster
Correlation: Error in cor(x, method = "pearson", use = use.cor) :
supply both 'x' and 'y' or a matrix-like 'x'.
Note that the distance is calculated by pvclust so there is no need for a distance calculation beforehand. Also note that the hclust method (average, median, etc.) does not affect the problem.
When I increase the dimensionality of the data set to 4, pvclust now runs fine. Why is it that I'm getting these errors for pvclust at 3 dimensions and below but not for hclust? Furthermore, why do the errors disappear when I use a data set above 4 dimensions?
At the end of function pvclust we see a line
mboot <- lapply(r, boot.hclust, data = data, object.hclust = data.hclust,
nboot = nboot, method.dist = method.dist, use.cor = use.cor,
method.hclust = method.hclust, store = store, weight = weight)
then digging deeper we find
getAnywhere("boot.hclust")
function (r, data, object.hclust, method.dist, use.cor, method.hclust,
nboot, store, weight = F)
{
n <- nrow(data)
size <- round(n * r, digits = 0)
....
smpl <- sample(1:n, size, replace = TRUE)
suppressWarnings(distance <- dist.pvclust(data[smpl,
], method = method.dist, use.cor = use.cor))
....
}
also note, that the default value of parameter r for function pvclust is r=seq(.5,1.4,by=.1). Well, actually as we can see this value is being changed somewhere:
Bootstrap (r = 0.33)...
so what we get is size <- round(3 * 0.33, digits =0) which is 1, finally data[smpl,] has only 1 row, which is less than 2. After correction of r it returns some error which possibly is harmless and output is given too:
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean",
nboot=1000, r=seq(0.7,1.4,by=.1))
Bootstrap (r = 0.67)... Done.
....
Bootstrap (r = 1.33)... Done.
Warning message:
In a$p[] <- c(1, bp[r == 1]) :
number of items to replace is not a multiple of replacement length
Let me know if the results is satisfactory.

Resources