MPC K-means constraints definiton using conclust package in R - r

I'm using the conclust package in R to perform semi-supervised clustering using MPC k-means algorithm to cluster fuel stations based on their activities.
I started with the code below.
mustLink =list(c ('station x','stations y'))
cantLink = list(c('station z','station w'))
mpckm(subset, 5, mustLink, cantLink, maxIter = 10)
subset is my dataframe.
stations x, stations y, station z and station w represents the row index.
My problem is I'm not sure how to define my constraints.
For now I'm begining with simple constraints like for example I don't want station X to be in the same cluster with station Y.
In the conclust package, the mpckm function takes two lists of must-link and cannot-link constraints but no further details are added.
I tried to do the same thing adding the row index of the stations in the constraints lists. but this didn't work throwing this error:
Error in 1:nm : argument of length 0.
What Am I exactly missing ?

It works well for me working with objects of class matrix. Apart from the example of the help, matrices are also considered.
So you should do something similar to (assuming datasetis a data.frame):
mustLink =
cbind(which(rownames(subset)=='station x'),which(rownames(subset)=='stations y'))
cantLink =
cbind(which(rownames(subset)=='station z'),which(rownames(subset)=='station w'))
mpckm(subset, 5, mustLink, cantLink, maxIter = 10)

Related

How to decide best number of clusters for kamila clustering with R?

I have a mixed type data set, so I wanted to try kamila clustering. It is easy to apply it, but I would like a plot to decide the number of clusters similar to knee-plot.
data <- read.csv("binarymat.csv",header=FALSE,sep=";")
conInd <- c(9)
conVars <- data[,conInd]
conVars <- data.frame(scale(conVars))
catVarsFac <- data[,c(1,2,3,4,5,6,7,8)]
catVarsFac[] <- lapply(catVarsFac, factor)
catVarsDum <- dummyCodeFactorDf(catVarsFac)
kamRes <- kamila(conVars, catVarsFac, numClust=5, numInit=10,
calcNumClust = "ps",numPredStrCvRun = 10, predStrThresh = 0.5)
summary(kamRes)
It says that the best number of clusters is 5. How does it decide that and can I see a plot indicating this?
In the kamila package documentation
Setting calcNumClust to ’ps’ uses the prediction strength method of
Tibshirani & Walther (J. of Comp. and Graphical Stats. 14(3), 2005).
There is no perfect method for estimating the number of clusters; PS
tends to give a smaller number than, say, BIC based methods for large
sample sizes.
In the case, you are using it, you have specified only one value to numClust. So, it doesn't look like you are actually selecting the number of clusters - you have already picked one.
To select the number of clusters, you have to specify the range you are interested in, for example, numClust = 2 : 7 and also the method for selecting the number of clusters.
If you also want to select the number of clusters, something like the following might work.
kamRes <- kamila(conVars, catVarsFac, numClust = 2 : 7, numInit = 10,
calcNumClust = "ps", numPredStrCvRun = 10, predStrThresh = 0.5)
Information on the selection of the number of clusters is now present in
kamRes$nClust, and plot(2:7, kamRes$nClust$psValues) could be what you are after.

Unscaling neural network numeric matrix in R

I have a question which I assume can be generic, but in my case it is applicable to neural network in R.
For the record I am using both h20 and neuralnet packages.
Since you may know, often, it is advised to scale he input of a neural network, in order to make the NN itself work better with the specific used activation function.
In R to do this there are several ways and I do use scale () / min / max.
Let's pretend that I have a matrix of 700x10 as input so the scaling will produce me two vectors scaled and center of carnality 10.
Now the problem starts when I want to unscale the output.
The formula sayy vOutput * vScaled (full vector) + vCenter (full vector).
Question: Should I use then all the vectors (scaled and Center) in order to the unscaling? or there is a more complex formula or boundaries that I could not find?
#sample data
df <- data.frame(col1 = c(1:5), col2 = c(11:15), target=c(1,0,0,0,1))
#normalize sample data using scale() - except the 'target' column
df_scaled <- scale(df[,-ncol(df)])
df_scaled
#revert back to original data from scaled version
df_original <- as.data.frame(t(apply(df_scaled, 1,
function(x) (x * attr(df_scaled, 'scaled:scale') + attr(df_scaled, 'scaled:center')))))
df_original

CONFUSION MATRIX, R,

I need little help with the following code below. I have to setup a loop to train a neural network model on the TRAINING data with a different number of epochs each time by starting from 5 and adding 3 until I reach 20. Then I have to calculate a line chart showing the accuracy with differing numbers of epochs. I also have to keep all the parameters same as shown. Much of the code is what was given by our instructor. I added the epochs= c(5,8,11,14,17,20) to create a list of epochs and the error.rate = vector() where I intend to store the accuracy from each loop into a vector. The accuracy I want is from the confusion matrix and is found from the formula
h2o.hit_ratio_table(<model>,train = TRUE)[1,2]
The problem I face is that I have tried to create a loop. I am trying to get the results from each loop. I have labled the first part of the loop as X to try to put it into the vector for the accuracy for each loop into a vector error.rate=h2o.hit_ratio_table(x,train=TRUE)[1,2]).
However, it gives an error.
> Error in is(object, "H2OModelMetrics") : object 'X' not found In
> addition: Warning messages: 1: In 1:epochs : numerical expression has
> 6 elements: only the first used
Moreover, when I remove the error.rate=...... part, the function runs fine but there is no way to find the values of the accuracy.
I am a noob at R so a little help will be much appreciated.
s <- proc.time()
epochs= c(5,8,11,14,17,20)
error.rate = vector()
for (epoch in 1:epochs){#set up loop to go around 6 times
X=h2o.deeplearning(x = 2:785, # column numbers for predictors
y = 1, # column number for label
training_frame = train_h2o, # data in H2O format
activation = "RectifierWithDropout", # mathematical activation function
input_dropout_ratio = 0.2, # % of inputs dropout, because some inputs might not matter.
hidden_dropout_ratios = c(0.25,0.25,0.25,0.25), # % for nodes dropout, because maybe we don't need full connections. Improves generalisability
balance_classes = T, # over/under samples so that all classes are similar size.
hidden = c(50,50,50,50), # two layers of 100 nodes
momentum_stable = 0.99,
nesterov_accelerated_gradient = T,
error.rate=h2o.hit_ratio_table(x,train=TRUE)[1,2])
proc.time() - s}
You are doing for(epoch in 1:epochs). Here the 'epoch' term changes each loop (and usually you use it within the loop but i don't see it?). 1:epochs will not work as you think it should. It is taking the first element of epochs (5) and basically saying for(epoch in 1:5) where epoch is 1, then 2, ... and then 5. You want something like for(epoch in epochs) and if you DO want a sequence from 1:each epoch in your code you should write it within the loop.
Also, x is rewritten each time it loops. You should initialize it and save subsets of it each loop instead:
epochs= c(5,8,11,14,17,20)
x <- list() # save as list #option 1
y <- list() # for an option 2
for (epoch in epochs){ #set up loop to go around 6 times
X[[epoch]] = h2o.deeplearning(... )
# or NOW you can somehow use 1:epoch where each loop epoch changes
}
But I would really focus on there is no use of using your epoch in your for loop as I see in your post. Perhaps find out where you want to use it...

Cluster data using medoids (cluster centers) in R

I have a dataframe with three features as
library(cluster)
df <- data.frame(f1=rnorm(480,30,1),
f2=rnorm(480,40,0.5),
f3=rnorm(480,50, 2))
Now, I want to do clustering using K-medoids in two steps. In step 1, using some data from df I want to get medoids (cluster centers), and in step 2, I want to use obtained medoids to do clustering on remaining data. Accordingly,
# find medoids using some data
sample_data <- df[1:240,]
sample_data <- scale(sample_data) # scaling features
clus_res1 <- pam(sample_data,k = 4,diss=FALSE)
# Now perform clustering using medoids obtained from above clustering
test_data <- df[241:480,]
test_data <- scale(test_data)
clus_res2 <- pam(test_data,k = 4,diss=FALSE,medoids=clus_res1$medoids)
With this script, I get an error message as
Error in pam(test_data, k = 4, diss = FALSE, medoids = clus_res1$medoids) :
'medoids' must be NULL or vector of 4 distinct indices in {1,2, .., n}, n=240
It is clear that error message is due to the input format of Medoid matrix. How can I convert this matrix to the vector as specified in the error message?
The initial medoids parameter expects index numbers of points in your data set. So 42,17 means to use objects 42 and 17 as initial medoids.
By the definition of medoids, you can only use points of your data set as medoids, not other vectors!
Clustering is unsupervised. No need to split your data in training/test, because there are no labels to overfit to in unsupervised learning.
Notice that in PAM the clustering center is an observation, that is you get 4 observations that each of them is a center of cluster. Demonstration of PAM.
So if you want to try and use the same center, you need to find the observations which are closest to the observations who are the center in your train.

How to do top down forecasted proportions for hts objects with 2 levels?

I had previously asked this question trying to get top down forecasted proportions forecast recombination using the hts package. The solution there works great for multilevel hierarchies, however I have found I get an error when I try to use the solution on a two level hierarchy.
library(hts)
# Create the hierarchy
newhts <- hts(htseg1$bts, list(ncol(htseg1$bts)))
# forecast creation adapted from the `combinef()` example
h <- 12
ally <- aggts(newhts)
allf <- matrix(NA, nrow = h, ncol = ncol(ally))
for(i in 1:ncol(ally))
allf[,i] <- forecast(auto.arima(ally[,i]), h = h, PI = FALSE)$mean
allf <- ts(allf, start = 51)
# Earo Wang's solution to my previous question
hts:::TdFp(allf, nodes = htseg1$nodes)
Error in *.default(fcasts[, 1L], prop) : time-series/vector length mismatch
The problem seems to arise because a two level hierarchy skips the last if conditional with the condition if (l.levels > 2L). The last statement of this conditional multiplies includes a piece where prop is multiplied by the time series flist[[k + 1L]], which converts prop into a time series matrix. When this statement is skipped, prop remains a regular matrix causing the error when the time series vector fcasts[, 1L] is multiplied by the matrix prop.
I understand that TdFp is a non exported function and therefore may not be as robust as the other functions in the package, but is there any way around this problem? Since it is a relatively simple case, I can code a solution myself, but since hts::forecast.hts() can handle two level hierarchies for method = "tdfp", I thought there might be a nice clean solution.

Resources