I would like to do a cluster analysis with Kmeans and use the Euclidean distance.
This is part of my code:
WKA_ohneJB <- read.csv("WKA_ohneJB.csv", header=TRUE, sep = ";", stringsAsFactors = FALSE)
WKA_ohneJB_scaled <- scale(WKA_ohneJB)
set.seed (123)
WKA_ohneJB_sample <- sample(1:500, 300)
WKA_ohneJB_scaled <- WKA_ohneJB_scaled[WKA_ohneJB_sample,]
kmeans(WKA_ohneJB_scaled, 8, iter.max = 10, nstart = 1, method = "euclidean")
fviz_nbclust(WKA_ohneJB_scaled, kmeans, method = "wss")+ geom_vline(xintercept = 8, linetype = 2)
Error in kmeans(WKA_ohneJB_scaled, 8, iter.max = 10, nstart = 1,
method = "euclidean") : ununsed argument (method = "euclidean")
You have to install the "amap" package to be able to use the "Kmeans" (with capital K) function which is different from the "kmeans" function.
Here's the link to the documentation: https://www.rdocumentation.org/packages/amap/versions/0.8-18/topics/Kmeans
Related
I'm new to R and trying to isolate the best performing features from a data set of 247 columns (246 variables + 1 outcome), and 800 or so rows (where each row is one person's data) to create a predictive model.
I'm using caret to do RFE using lmfuncs - I need to use linear regression since the target variable continuous.
I use the following to split into test/training data (which hasn't evoked errors)
inTrain <- createDataPartition(data$targetVar, p = .8, list = F)
train <- data[inTrain, ]
test <- data[-inTrain, ]
The resulting test and train files have even variables within the sets. e.g X and Y contain the same number samples / all columns are the same length
My control parameters are as follows (also runs without error)
control = rfeControl(functions = lmFuncs, method = "repeatedcv", repeats = 5, verbose = F, returnResamp = "all")
But when I run RFE I get an error message saying
Error in rfe.default(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control) :
there should be the same number of samples in x and y
My code for RFE is as follows, with the target variable in first column
rfe_lm_profile <- rfe(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control)
I've looked through various forums, but nothing seems to work.
This google.group suggests using an older version of Caret - which I tried, but got the same X/Y error https://groups.google.com/g/rregrs/c/qwcP0VGn4ag?pli=1
Others suggest converting the target variable to a factor or matrix. This hasn't helped, and evokes
Warning message:
In createDataPartition(data$EBI_SUM, p = 0.8, list = F) :
Some classes have a single record
when partitioning the data into test/train, and the same X/Y sample error if you try to carry out RFE.
Mega thanks in advance :)
p.s
Here's the dput for the target variable (EBI_SUM) and a couple of variables
data <- structure(list(TargetVar = c(243, 243, 243, 243, 355, 355), Dosing = c(2,
2, 2, 2, 2, 2), `QIDS_1 ` = c(1, 1, 3, 1, 1, 1), `QIDS_2 ` = c(3,
3, 2, 3, 3, 3), `QIDS_3 ` = c(1, 2, 1, 1, 1, 2)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
>
Your data object should not contain spaces:
library(caret)
data <- data.frame(
TargetVar = c(243, 243, 243, 243, 355, 355),
Dosing = c(2, 2, 2, 2, 2, 2),
QIDS_1 = c(1, 1, 3, 1, 1, 1),
QIDS_2 = c(3, 3, 2, 3, 3, 3),
QIDS_3 = c(1, 2, 1, 1, 1, 2)
)
inTrain <- createDataPartition(data$TargetVar, p = .8, list = F)
train <- data[inTrain, ]
test <- data[-inTrain, ]
control <- rfeControl(functions = lmFuncs, method = "repeatedcv", repeats = 5, verbose = F, returnResamp = "all")
rfe_lm_profile <- rfe(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control)
I am doing some kmeans clustering analysis. Example:
library(tidyverse)
library(foreach)
my_diamonds <- diamonds %>% select_if(is.numeric) %>% scale %>% as.data.frame
try_centers <- seq(from = 3, to = 12, by = 1)
wss_list <- foreach(k = try_centers) %do% { # forget parallel processing with this size of data not enough ram
print(k) # progress bar
hw = kmeans(my_diamonds, centers = k, iter.max = 20, nstart = 3, algorithm = 'Hartigan-Wong')
}
wss <- lapply(wss_list, function(i) i$tot.withinss) %>% unlist()
plot(try_centers, wss)
This returns a within sum of squares variance plot:
But I would like to compare with two other algorithms in kmeans. Tried:
wss_list <- foreach(k = try_centers) %do% { # forget parallel processing with this size of data not enough ram
print(k) # progress bar
hw = kmeans(my_diamonds, centers = k, iter.max = 20, nstart = 3, algorithm = 'Hartigan-Wong')
lloyd = kmeans(my_diamonds, centers = k, iter.max = 20, nstart = 3, algorithm = 'Lloyd')
mac = kmeans(my_diamonds, centers = k, iter.max = 20, nstart = 3, algorithm = 'MacQueen')
}
wss <- lapply(wss_list, function(i) i$tot.withinss) %>% unlist()
plot(try_centers, wss)
This does return a plot, but which one am I looking at?! MaQueen? Lloyd?
How can I structure this to run kmeans with the three algorithms on each iteration and then plot on one chart each of the 3?
If we don't specify the return, it will return only the last object created. We can have a list of objects in return
wss_list <- foreach(k = try_centers) %do% { # forget parallel processing with this size of data not enough ram
print(k) # progress bar
hw = kmeans(my_diamonds, centers = k, iter.max = 20, nstart = 3, algorithm = 'Hartigan-Wong')
lloyd = kmeans(my_diamonds, centers = k, iter.max = 20, nstart = 3, algorithm = 'Lloyd')
mac = kmeans(my_diamonds, centers = k, iter.max = 20, nstart = 3, algorithm = 'MacQueen')
return(dplyr::lst(hw, lloyd, mac))
}
Then, we can extract each components
library(purrr)
hw <- map(wss_list, ~ .x$hw)
lloyd <- map(wss_list, ~ .x$lloyd)
mac <- map(wss_list, ~ .x$mac)
Or transpose the list to create a list of three
wss_list1 <- wss_list %>%
transpose
names(wss_list1)
#[1] "hw" "lloyd" "mac"
Now, we plot as
wss <- lapply(wss_list1$hw, function(i) i$tot.withinss) %>%
unlist()
plot(try_centers, wss)
and the same way, we do it with other components
wss2 <- lapply(wss_list1$lloyd, function(i) i$tot.withinss) %>%
unlist()
plot(try_centers, wss2)
I get the following warning message that I want to get rid of but I don't understand it:
possible convergence problem: optim gave code = 1
Bellow is my code:
# Creating the specification with ugarchroll
library("rugarch")
garch_spec_N <- ugarchspec(variance.model = list(model = "sGARCH", garchOrder = c(1, 1)), mean.model = list(armaOrder = c(1, 1), include.mean = TRUE),distribution.model = "norm")
nsim <- 2 # number of simulations
nstart <- 500
simul_ <- matrix(rnorm(2000), nrow = 1000, ncol = nsim))
# Model fitting
garch_sigma_N <- garch_mu_N <- matrix(NA, nrow = nstart, ncol = nsim)
for (i in 1:nsim) {
garch_fit_N <- ugarchroll(garch_spec_N, simul_[,i], n.ahead = 1,
forecast.length = 1, n.start = nstart,
refit.every = 1, refit.window = "moving",
window.size = nstart, calculate.VaR = TRUE,
VaR.alpha = alpha)
# Retrieving estimated garch variance and Mu
garch_sigma_N[,i] <- garch_fit_N#forecast[["density"]][["Sigma"]]
garch_mu_N[,i] <- garch_fit_N#forecast[["density"]][["Mu"]]
}
Any help would be very much appreciated :)
library(stats4)
x <- 0:10
y <- c(26, 17, 13, 12, 20, 5, 9, 8, 5, 4, 8)
## Easy one-dimensional MLE:
nLL <- function(lambda) -sum(stats::dpois(y, lambda, log = TRUE))
fit0 <- mle(nLL, start = list(lambda = 5), nobs = NROW(y), method = "L-BFGS-B")
This is a toy example from mle's documentation. The optimization method I chose to use is L-BFGS-B. I'm interested in seeing the lambda values at different iterations.
Looking into optim's help page, I tried adding trace = TRUE. But that seems to give me the likelihood at each iteration and not the lambda values.
> fit0 <- mle(nLL, start = list(lambda = 5), nobs = NROW(y), method = "L-BFGS-B", control = list(trace = TRUE))
final value 42.726780
converged
How can I obtain the lambda estimates at each iteration?
Is it possible to create a deep learning net that gives multiple outputs?
The reason for doing this is to also try to capture the relationships between outputs.
In the examples given I can only create one output.
library(h2o)
localH2O = h2o.init()
irisPath = system.file("extdata", "iris.csv", package = "h2o")
iris.hex = h2o.importFile(localH2O, path = irisPath)
h2o.deeplearning(x = 1:4, y = 5, data = iris.hex, activation = "Tanh",
hidden = c(10, 10), epochs = 5)
It doesn't look like multiple response columns are currently supported in H2O (H2O FAQ and H2O Google Group topic). Their suggestion is to train a new model for each response.
(Nonsensical) example:
library(h2o)
localH2O <- h2o.init()
irisPath <- system.file("extdata", "iris.csv", package = "h2o")
iris.hex <- h2o.importFile(localH2O, path = irisPath)
m1 <- h2o.deeplearning(x = 1:2, y = 3, data = iris.hex, activation = "Tanh",
hidden = c(10, 10), epochs = 5, classification = FALSE)
m2 <- h2o.deeplearning(x = 1:2, y = 4, data = iris.hex, activation = "Tanh",
hidden = c(10, 10), epochs = 5, classification = FALSE)
However, it appears that multiple responses are available through the deepnet package (check library(sos); findFn("deep learning")).
library(deepnet)
x <- as.matrix(iris[,1:2])
y <- as.matrix(iris[,3:4])
m3 <- dbn.dnn.train(x = x, y = y, hidden = c(5,5))