I am trying to learn about the "kohonen" package in R. In particular, there is a function called "supersom()" (https://www.rdocumentation.org/packages/kohonen/versions/3.0.10/topics/supersom , corresponding to the SOM (Self Organizing Maps) algorithm used in unsupervised machine learning) that I am trying to apply on some data.
Below, (from a previous question: R error: "Error in check.data : Argument Should be Numeric") I learned how to apply the "supersom()" function on some artificially created data with both "factor" and "numeric" variables.
#the following code works
#load libraries
library(kohonen)
library(dplyr)
#create and format data
a =rnorm(1000,10,10)
b = rnorm(1000,10,5)
c = rnorm(1000,5,5)
d = rnorm(1000,5,10)
e <- sample( LETTERS[1:4], 100 , replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25) )
f <- sample( LETTERS[1:5], 100 , replace=TRUE, prob=c(0.2, 0.2, 0.2, 0.2, 0.2) )
g <- sample( LETTERS[1:2], 100 , replace=TRUE, prob=c(0.5, 0.5) )
data = data.frame(a,b,c,d,e,f,g)
data$e = as.factor(data$e)
data$f = as.factor(data$f)
data$g = as.factor(data$g)
cols <- 1:4
data[cols] <- scale(data[cols])
#som model
som <- supersom(data= as.list(data), grid = somgrid(10,10, "hexagonal"),
dist.fct = "euclidean", keep.data = TRUE)
Everything works well - the problem is, when I try to apply the "supersom()" function on " more realistic and bigger data", I get the following error:
"Error: Non-informative layers present : mean distances between objects zero"
When I look at the source code for this function (https://rdrr.io/cran/kohonen/src/R/supersom.R), I notice a reference for the same error:
if (any(sapply(meanDistances, mean) < .Machine$double.eps))
stop("Non-informative layers present: mean distance between objects zero")
Can someone please show me how I might be able to resolve this error, i.e. make the "supersom()" function work with factor and numeric data?
I thought that perhaps removing duplicate rows and NA's might fix this problem:
data <- na.omit(data)
data <- unique(data)
However the same error ("Non-informative layers present : mean distances between objects zero") is still there.
Can someone please help me figure out what might be causing this error? Note: when I remove the "factor" variables, everything works fine.
Sources:
https://cran.r-project.org/web/packages/kohonen/kohonen.pdf
https://www.rdocumentation.org/packages/kohonen/versions/2.0.5/topics/supersom
https://rdrr.io/cran/kohonen/src/R/supersom.R
The error happens if you have certain numeric columns whose mean is 0. You can reproduce the error by turning any 1 column to 0.
data$a <- 0
som <- supersom(data= as.list(data), grid = somgrid(10,10, "hexagonal"),
dist.fct = "euclidean", keep.data = TRUE)
Error in supersom(data = as.list(data), grid = somgrid(10, 10, "hexagonal"), :
Non-informative layers present: mean distance between objects zero
Maybe you can investigate why those column have 0 mean or remove the columns with 0 means from the data.
library(kohonen)
library(dplyr)
data <- data %>% select(where(~(is.numeric(.) && mean(.) > 0) | !is.numeric(.)))
#som model
som <- supersom(data= as.list(data), grid = somgrid(10,10, "hexagonal"),
dist.fct = "euclidean", keep.data = TRUE)
Related
I'm trying to fit a model with just one inflation and whereas the pred.zoib function worked when I included both one and zero inflation, it isn't running now that I've excluded the zero inflation - I get this error:
Error in x1[i, ] %*% b1 : non-conformable arguments
The chaffinchdata is just a data frame with various bits of information, but importantly a detection probability column that has been adjusted to remove the zeros and replace them with 0.00001s, and a distance column that holds numeric values between 0 and 131.
When using pred.zoib before with both zero and one inflation, it worked fine:
fit.zoib.chaffinch4 <- zoib(Detection_probability ~ Distance | Distance | Distance | Distance , data = chaffinchdata)
chaffinchxnew <- data.frame(Distance = seq(0,150,0.1))
pred.chaff4 <- pred.zoib(fit.zoib.chaffinch4, chaffinchxnew)
dfchaff4 <- data.frame(Distance = seq(0,150,0.1),pred.chaff4$summary)
So this all worked perfectly up until now. Then the below runs up until the pred.zoib stage.
# now set the zero inflation to FALSE, to investigate just one inflation
nozeroinf <- birddata$Detection_probability
nozeroinf <- ifelse(nozeroinf == 0, 0.00001, nozeroinf)
birddata$nozeroinfs <- nozeroinf
chaffinchdata <- filter(birddata, Species == 'Chaffinch')
fit.zoib.chaff.oneinf <- zoib(nozeroinfs ~ Distance | Distance | Distance,
data = chaffinchdata, zero.inflation = F,
one.inflation = T)
chaffxnew.oneinf <- data.frame(Distance = seq(0, 100, 0.1))
pred.chaff.oneinf <- pred.zoib(fit.zoib.chaff.oneinf, chaffxnew.oneinf)
I've tried using the distance data straight from the chaffinchdata dataset instead of creating a sequence of my own, ie.
pred.zoib(fit.zoib.chaff.oneinf, data.frame(chaffinchdata$Distance))
but that didn't work, nor did
pred.zoib(fit.zoib.chaff.oneinf, chaffinchdata)
Any help on this would be greatly appreciated!
I've received a function from another worker to calculate the height required of a tree to reach a certain height at age 100 (SI). My job is to put this into purrr to calculate what the height will look like for a number of SI and height crossings in order to plot the growth trajectory.
First I create the base function:
SI_tall <- function(topheight, age, si ){
paramasi <- 25
parambeta <- 7395.6
paramb2 <- -1.7829
refAge <- 100
d <- parambeta*(paramasi^paramb2)
r <- (((topheight-d)^2)+(4*parambeta*topheight*(age^paramb2)))^0.5
## height at reference age
h2 <- (topheight+d+r)/ (2+(4*parambeta*(refAge^paramb2)) / (topheight-d+r))
return(abs(h2 - si))
}
To calculate the height for a tree of given age and site index, we use this function in another. The height will be given by
my.age <- 10
my.si <- 30
new.topheight <- function(my.si, my.age){
optim(par = list(topheight = 10), ## this topheight is just an initial value
method = 'L-BFGS-B', fn = SI_tall, si = my.si, age = my.age, lower= 0, upper=100)$par
}
This works nicely for each value.
Since I want to draw a trajectory of the growth of each tree, I'll first need to calculate the ages and site indices at a required resolution to plot. I create two vectors to cross:
my.age <- seq(0,110, by=0.2)
my.si <- c(5,10,15,20,25,30,35)
si.crossing <- tidyr::crossing(my.age, my.si)
si.crossing %>% group_by(my.age, my.si) %>%
nest() %>%
mutate(topheight = map2(.x=my.age, .y=my.si, .f=~new.topheight(my.si=.y, my.age=.x)))
Here's the error I get:
Error in optim(par = list(topheight = 30), method = "BFGS", fn = SI_tall, :
initial value in 'vmmin' is not finite
What's going wrong? Many thanks.
Directly pass it to map2_dbl with tryCatch to handle errors.
library(dplyr)
library(purrr)
si.crossing %>%
mutate(topheight = map2_dbl(my.si, my.age,
~tryCatch(new.topheight(.x, .y), error = function(e) NA)))
Or use mapply in base R :
si.crossing$topheight <- mapply(function(x, y)
tryCatch(new.topheight(x, y),error = function(e) NA),
si.crossing$my.si, si.crossing$my.age)
We can use possibly from purrr
library(purrr)
pnew.topheight <- possibly(new.topheight, otherwise = NA)
si.crossing %>%
mutate(topheight = map2_dbl(my.si, my.age, pnew.topheight))
I have been stuck for hours trying to run XGboost with R. I have a training data and test data containing around 40 columns and the last column is the target column. It is a 0,1 nominal value. I am running this code which I got from https://www.kaggle.com/michaelpawlus/xgboost-example-0-76178/code.
require(xgboost)
library(xgboost)
train <- read.csv(file.choose(),header = T)
test <- read.csv(file.choose(),header = T)
feature.names <- names(train)[2:ncol(train)-1]
clf <- xgboost(data = data.matrix(train[,feature.names]),
label = train$target,
nrounds = 100, # 100 is better than 200
objective = "binary:logistic",
eval_metric = "auc")
cat("making predictions in batches due to 8GB memory limitation\n")
submission <- data.frame(ID=test$ID)
submission$target1 <- NA
for (rows in test) {
submission[rows, "Succeed"] <- predict(clf, data.matrix(test[rows,feature.names]))
}
varimp_clf <- xgb.importance(feature_names=feature.names,model=clf)
xgb.plot.importance(varimp_clf)
This is the errors I am getting
Error in xgb.get.DMatrix(data, label, missing, weight) :
xgboost: need label when data is a matrix
Error in $<-.data.frame(*tmp*, target1, value = NA) :
replacement has 1 row, data has 0
Error in predict(clf, data.matrix(test[rows, feature.names])) :
object 'clf' not found
Check your input data. Is your last column named target? It sounds like it isn't.
I'm new to R and its the first time i'm using SOM.
I want to predict survival using Self Organizing Map.
The following is the code i used to ingest data:
load raw data
train <- read.csv("train.csv", header = TRUE)
test <- read.csv("test.csv", header = TRUE)
Add a "Survived" variable to the test set to allow for combining data sets
test.survived <- data.frame(survived = rep("None", nrow(test)), test[,])
Combine data sets
data.combined <- rbind(train, test.survived)
Changed the variable to factors
data.combined$Survived <- as.factor(data.combined$survived)
data.combined$Pclass <- as.factor(data.combined$pclass)
Fitting the data to the SOM model
library(kohonen)
Train SOM
som.train.1 <- data.combined[1:891, c("pclass", "title")]
som.label <- as.factor(train$survived)
table(som.train.1)
table(som.label)
som.train.1.grid <- somgrid(xdim = 20, ydim=20, topo="hexagonal")
set.seed(1234)
som.model <- som(som.label,
grid=som.train.1.grid,
rlen = 100,
alpha = c(0.05, 0.01),
keep.data = TRUE,
normalizeDataLayers = TRUE)
plot(som.model)
I get an error that says: sort.list(y): 'x' must be atomic for 'sort.list'
I'm doing hierarchical clustering with an R package called pvclust, which builds on hclust by incorporating bootstrapping to calculate significance levels for the clusters obtained.
Consider the following data set with 3 dimensions and 10 observations:
mat <- as.matrix(data.frame("A"=c(9000,2,238),"B"=c(10000,6,224),"C"=c(1001,3,259),
"D"=c(9580,94,51),"E"=c(9328,5,248),"F"=c(10000,100,50),
"G"=c(1020,2,240),"H"=c(1012,3,260),"I"=c(1012,3,260),
"J"=c(984,98,49)))
When I use hclust alone, the clustering runs fine for both Euclidean measures and correlation measures:
# euclidean-based distance
dist1 <- dist(t(mat),method="euclidean")
mat.cl1 <- hclust(dist1,method="average")
# correlation-based distance
dist2 <- as.dist(1 - cor(mat))
mat.cl2 <- hclust(dist2, method="average")
However, when using the each set up with pvclust, as follows:
library(pvclust)
# euclidean-based distance
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean", nboot=1000)
# correlation-based distance
mat.pcl2 <- pvclust(mat, method.hclust="average", method.dist="correlation", nboot=1000)
... I get the following errors:
Euclidean: Error in hclust(distance, method = method.hclust) :
must have n >= 2 objects to cluster
Correlation: Error in cor(x, method = "pearson", use = use.cor) :
supply both 'x' and 'y' or a matrix-like 'x'.
Note that the distance is calculated by pvclust so there is no need for a distance calculation beforehand. Also note that the hclust method (average, median, etc.) does not affect the problem.
When I increase the dimensionality of the data set to 4, pvclust now runs fine. Why is it that I'm getting these errors for pvclust at 3 dimensions and below but not for hclust? Furthermore, why do the errors disappear when I use a data set above 4 dimensions?
At the end of function pvclust we see a line
mboot <- lapply(r, boot.hclust, data = data, object.hclust = data.hclust,
nboot = nboot, method.dist = method.dist, use.cor = use.cor,
method.hclust = method.hclust, store = store, weight = weight)
then digging deeper we find
getAnywhere("boot.hclust")
function (r, data, object.hclust, method.dist, use.cor, method.hclust,
nboot, store, weight = F)
{
n <- nrow(data)
size <- round(n * r, digits = 0)
....
smpl <- sample(1:n, size, replace = TRUE)
suppressWarnings(distance <- dist.pvclust(data[smpl,
], method = method.dist, use.cor = use.cor))
....
}
also note, that the default value of parameter r for function pvclust is r=seq(.5,1.4,by=.1). Well, actually as we can see this value is being changed somewhere:
Bootstrap (r = 0.33)...
so what we get is size <- round(3 * 0.33, digits =0) which is 1, finally data[smpl,] has only 1 row, which is less than 2. After correction of r it returns some error which possibly is harmless and output is given too:
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean",
nboot=1000, r=seq(0.7,1.4,by=.1))
Bootstrap (r = 0.67)... Done.
....
Bootstrap (r = 1.33)... Done.
Warning message:
In a$p[] <- c(1, bp[r == 1]) :
number of items to replace is not a multiple of replacement length
Let me know if the results is satisfactory.