How to do top down forecasted proportions for hts objects with 2 levels? - r

I had previously asked this question trying to get top down forecasted proportions forecast recombination using the hts package. The solution there works great for multilevel hierarchies, however I have found I get an error when I try to use the solution on a two level hierarchy.
library(hts)
# Create the hierarchy
newhts <- hts(htseg1$bts, list(ncol(htseg1$bts)))
# forecast creation adapted from the `combinef()` example
h <- 12
ally <- aggts(newhts)
allf <- matrix(NA, nrow = h, ncol = ncol(ally))
for(i in 1:ncol(ally))
allf[,i] <- forecast(auto.arima(ally[,i]), h = h, PI = FALSE)$mean
allf <- ts(allf, start = 51)
# Earo Wang's solution to my previous question
hts:::TdFp(allf, nodes = htseg1$nodes)
Error in *.default(fcasts[, 1L], prop) : time-series/vector length mismatch
The problem seems to arise because a two level hierarchy skips the last if conditional with the condition if (l.levels > 2L). The last statement of this conditional multiplies includes a piece where prop is multiplied by the time series flist[[k + 1L]], which converts prop into a time series matrix. When this statement is skipped, prop remains a regular matrix causing the error when the time series vector fcasts[, 1L] is multiplied by the matrix prop.
I understand that TdFp is a non exported function and therefore may not be as robust as the other functions in the package, but is there any way around this problem? Since it is a relatively simple case, I can code a solution myself, but since hts::forecast.hts() can handle two level hierarchies for method = "tdfp", I thought there might be a nice clean solution.

Related

Mclust() - NAs in model selection

I recently tried to perform a GMM in R on a multivariate matrix (400 obs of 196 var), which elements belong to known categories. The Mclust() function (from package mclust) gave very poor results (around 30% of individuals were well classified, whereas with k-means the result reaches more than 90%).
Here is my code :
library(mclust)
X <- read.csv("X.csv", sep = ",", h = T)
y <- read.csv("y.csv", sep = ",")
gmm <- Mclust(X, G = 5) #I want 5 clusters
cl_gmm <- gmm$classification
cl_gmm_lab <- cl_gmm
for (k in 1:nclusters){
ii = which(cl_gmm == k) # individuals of group k
counts=table(y[ii]) # number of occurences for each label
imax = which.max(counts) # Majority label
maj_lab = attributes(counts)$dimnames[[1]][imax]
print(paste("Group ",k,", majority label = ",maj_lab))
cl_gmm_lab[ii] = maj_lab
}
conf_mat_gmm <- table(y,cl_gmm_lab) # CONFUSION MATRIX
The problem seems to come from the fact that every other model than "EII" (spherical, equal volume) is "NA" when looking at gmm$BIC.
Until now I did not find any solution to this problem...are you familiar with this issue?
Here is the link for the data: https://drive.google.com/file/d/1j6lpqwQhUyv2qTpm7KbiMRO-0lXC3aKt/view?usp=sharing
Here is the link for the labels: https://docs.google.com/spreadsheets/d/1AVGgjS6h7v6diLFx4CxzxsvsiEm3EHG7/edit?usp=sharing&ouid=103045667565084056710&rtpof=true&sd=true
I finally found the answer. GMMs simply cannot apply every model when two much explenatory variables are involved. The right thing to do is first reduce dimensions and select an optimal number of dimensions that make it possible to properly apply GMMs while preserving as much informations as possible about the data.

How to remove two data points from a data set that have a large influence on the regression model

I have found two outlier data points in my data set but I don't know how to remove them. All of the guides that I have found online seem to emphasize plotting the data but my question does not require plotting, it only takes regression model fitting. I am having great difficulty finding out how to remove the two data points from my data set and then fitting the new data set with a new model.
Here is the code that I have written and the outliers that I found:
library(alr4)
library(MASS)
data(lathe1)
head(lathe1)
y=lathe1$Life
x1=lathe1$Speed
x2=lathe1$Feed
x1_square=(x1)^2
x2_square=(x2)^2
#part A (Box-Cox method show log transformation)
y.regression=lm(y~x1+x2+(x1)^2+(x2)^2+(x1*x2))
mod=boxcox(y.regression, data=lathe1, lambda = seq(-1, 1, length=10))
best.lam=mod$x[which(mod$y==max(mod$y))]
best.lam
#part B (null-hypothesis F-test)
y.regression1_Reduced=lm(log(y)~1)
y.regression1=lm(log(y)~x1+x2+x1_square+x2_square+(x1*x2))
anova(y.regression1_Reduced, y.regression1)
#part D (F-test of log(Y) without beta1)
y.regression2=lm(log(y)~x2+x2_square)
anova(y.regression1_Reduced, y.regression2)
#part E (Cook's distance and refit)
cooks.distance(y.regression1)
Outliers:
9 10
0.7611370235 0.7088115474
I think you may be able (if execution time / corpus size allows it) to pass through your data using a loop and copy / remove elems by your criteria to obtain your desired result e.g.
corpus_list_without_outliers = []
for elem in corpus_list:
if(elem.speed <= 10000) # elem.[any_param_name] < arbitrary_outlier_value
# push to corpus_list_without_outliers because it is OK :)
print corpus_list_without_outliers
# regression algorithm after
this is how I'd see the situation, but you can change the above-if with a remove statement to avoid the creation of a second list etc. e.g.
for elem in corpus_list:
if(elem.speed > 10000) # elem.[any_param_name]
# remove from current corpus because it is an outlier :(
print corpus_list
# regression algorithm after
Hope it helped you!

CONFUSION MATRIX, R,

I need little help with the following code below. I have to setup a loop to train a neural network model on the TRAINING data with a different number of epochs each time by starting from 5 and adding 3 until I reach 20. Then I have to calculate a line chart showing the accuracy with differing numbers of epochs. I also have to keep all the parameters same as shown. Much of the code is what was given by our instructor. I added the epochs= c(5,8,11,14,17,20) to create a list of epochs and the error.rate = vector() where I intend to store the accuracy from each loop into a vector. The accuracy I want is from the confusion matrix and is found from the formula
h2o.hit_ratio_table(<model>,train = TRUE)[1,2]
The problem I face is that I have tried to create a loop. I am trying to get the results from each loop. I have labled the first part of the loop as X to try to put it into the vector for the accuracy for each loop into a vector error.rate=h2o.hit_ratio_table(x,train=TRUE)[1,2]).
However, it gives an error.
> Error in is(object, "H2OModelMetrics") : object 'X' not found In
> addition: Warning messages: 1: In 1:epochs : numerical expression has
> 6 elements: only the first used
Moreover, when I remove the error.rate=...... part, the function runs fine but there is no way to find the values of the accuracy.
I am a noob at R so a little help will be much appreciated.
s <- proc.time()
epochs= c(5,8,11,14,17,20)
error.rate = vector()
for (epoch in 1:epochs){#set up loop to go around 6 times
X=h2o.deeplearning(x = 2:785, # column numbers for predictors
y = 1, # column number for label
training_frame = train_h2o, # data in H2O format
activation = "RectifierWithDropout", # mathematical activation function
input_dropout_ratio = 0.2, # % of inputs dropout, because some inputs might not matter.
hidden_dropout_ratios = c(0.25,0.25,0.25,0.25), # % for nodes dropout, because maybe we don't need full connections. Improves generalisability
balance_classes = T, # over/under samples so that all classes are similar size.
hidden = c(50,50,50,50), # two layers of 100 nodes
momentum_stable = 0.99,
nesterov_accelerated_gradient = T,
error.rate=h2o.hit_ratio_table(x,train=TRUE)[1,2])
proc.time() - s}
You are doing for(epoch in 1:epochs). Here the 'epoch' term changes each loop (and usually you use it within the loop but i don't see it?). 1:epochs will not work as you think it should. It is taking the first element of epochs (5) and basically saying for(epoch in 1:5) where epoch is 1, then 2, ... and then 5. You want something like for(epoch in epochs) and if you DO want a sequence from 1:each epoch in your code you should write it within the loop.
Also, x is rewritten each time it loops. You should initialize it and save subsets of it each loop instead:
epochs= c(5,8,11,14,17,20)
x <- list() # save as list #option 1
y <- list() # for an option 2
for (epoch in epochs){ #set up loop to go around 6 times
X[[epoch]] = h2o.deeplearning(... )
# or NOW you can somehow use 1:epoch where each loop epoch changes
}
But I would really focus on there is no use of using your epoch in your for loop as I see in your post. Perhaps find out where you want to use it...

How to get top down forecasts using `hts::combinef()`?

I am trying to compare forecast reconciliation methods from the hts package on previously existing forecasts. The forecast.gts function is not available to me since there is no computationally tractable way to create a user defined function that returns the values in a forecast object. Because of this, I am using the combinef() function in the package to redistribute the forecasts. I have been able to work of the proper weights to get the wls and nseries methods, and the ols version is the default. I was able to get the "bottom up" method using:
# Creates sample forecasts, taken from `combinef()` example
library(hts)
h <- 12
ally <- aggts(htseg1)
allf <- matrix(NA, nrow = h, ncol = ncol(ally))
for(i in 1:ncol(ally))
allf[,i] <- forecast(auto.arima(ally[,i]), h = h, PI = FALSE)$mean
allf <- ts(allf, start = 51)
# create the weight vector
numTS <- ncol(allf) # Get the total number of series
numBaseTS <- sum(tail(htseg1$nodes, 1)[[1]]) # Get the number of bottom level series
# Create weights of 0 for all aggregate ts and 1 for the base level
weightVals <- c(rep(0, numTS - numBaseTS), rep(1, numBaseTS))
y.f <- combinef(allf, htseg1$nodes, weights = weightVals)
I was hoping that something like making the first weight 1 and the rest 0 might give me one of the three top down forecast, but that just results in a bunch of 0s or NaN values depending on how you try to look at it.
combinef(allf, htseg1$nodes, weights = c(1, rep(0, numTS - 1)))
I know the top down methods aren't the hardest thing to compute manually, and I can just write a function to do that, but are there any tools in the hts package that can help with this? I'd like to keep the data format consistent to simplify my analysis. Most specifically, I would like to get the "top down forcasted proportions" or tdfp method.
The functions to reconcile the forecasts using the "top-down" method are currently not exported. Probably I should export them to make the "top-down" results as tractable as combinef() in the next version. The workaround is as follows:
hts:::TdFp(allf, nodes = htseg1$nodes)
Hope it helps.

R Nonlinear Least Squares (nls) Model Fitting

I'm trying to fit the information from the G function of my data to the following mathematical mode: y = A / ((1 + (B^2)*(x^2))^((C+1)/2)) . The shape of this graph can be seen here:
http://www.wolframalpha.com/input/?i=y+%3D+1%2F+%28%281+%2B+%282%5E2%29*%28x%5E2%29%29%5E%28%282%2B1%29%2F2%29%29
Here's a basic example of what I've been doing:
data(simdat)
library(spatstat)
simdat.Gest <- Gest(simdat) #Gest is a function within spatstat (explained below)
Gvalues <- simdat.Gest$rs
Rvalues <- simdat.Gest$r
GvsR_dataframe <- data.frame(R = Rvalues, G = rev(Gvalues))
themodel <- nls(rev(Gvalues) ~ (1 / (1 + (B^2)*(R^2))^((C+1)/2)), data = GvsR_dataframe, start = list(B=0.1, C=0.1), trace = FALSE)
"Gest" is a function found within the 'spatstat' library. It is the G function, or the nearest-neighbour function, which displays the distance between particles on the independent axis, versus the probability of finding a nearest neighbour particle on the dependent axis. Thus, it begins at y=0 and hits a saturation point at y=1.
If you plot simdat.Gest, you'll notice that the curve is 's' shaped, meaning that it starts at y = 0 and ends up at y = 1. For this reason, I reveresed the vector Gvalues, which are the dependent variables. Thus, the information is in the correct orientation to be fitted the above model.
You may also notice that I've automatically set A = 1. This is because G(r) always saturates at 1, so I didn't bother keeping it in the formula.
My problem is that I keep getting errors. For the above example, I get this error:
Error in nls(rev(Gvalues) ~ (1/(1 + (B^2) * (R^2))^((C + 1)/2)), data = GvsR_dataframe, :
singular gradient
I've also been getting this error:
Error in nls(Gvalues1 ~ (1/(1 + (B^2) * (x^2))^((C + 1)/2)), data = G_r_dataframe, :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562
I haven't a clue as to where the first error is coming from. The second, however, I believe was occurring because I did not pick suitable starting values for B and C.
I was hoping that someone could help me figure out where the first error was coming from. Also, what is the most effective way to pick starting values to avoid the second error?
Thanks!
As noted your problem is most likely the starting values. There are two strategies you could use:
Use brute force to find starting values. See package nls2 for a function to do this.
Try to get a sensible guess for starting values.
Depending on your values it could be possible to linearize the model.
G = (1 / (1 + (B^2)*(R^2))^((C+1)/2))
ln(G)=-(C+1)/2*ln(B^2*R^2+1)
If B^2*R^2 is large, this becomes approx. ln(G) = -(C+1)*(ln(B)+ln(R)), which is linear.
If B^2*R^2 is close to 1, it is approx. ln(G) = -(C+1)/2*ln(2), which is constant.
(Please check for errors, it was late last night due to the soccer game.)
Edit after additional information has been provided:
The data looks like it follows a cumulative distribution function. If it quacks like a duck, it most likely is a duck. And in fact ?Gest states that a CDF is estimated.
library(spatstat)
data(simdat)
simdat.Gest <- Gest(simdat)
Gvalues <- simdat.Gest$rs
Rvalues <- simdat.Gest$r
plot(Gvalues~Rvalues)
#let's try the normal CDF
fit <- nls(Gvalues~pnorm(Rvalues,mean,sd),start=list(mean=0.4,sd=0.2))
summary(fit)
lines(Rvalues,predict(fit))
#Looks not bad. There might be a better model, but not the one provided in the question.

Resources