Automatically select SOM_GRID in R - r

I am trying to run SOM algorithm in R using Kohenen package. In this I have to define xdim, ydim dimension manually. Refer below code:
som_grid <- somgrid(xdim=5, ydim=6, topo="hexagonal")
som_model <- som(data_train_matrix,
grid=som_grid,
keep.data = TRUE)
My questions:
Is there a method where it automatically selects dimensions based on data
Can any explain logic behind this selection so that can we write function in R to identify dimensions automatically

I'm not very well in R but I think that can help you:
#Consider a dummy xdim and ydim Data.
x<-c(seq(0,5,by=0.5))
y<-c(seq(0,6,by=0.5))
## Determine the sector starting and end points.
a<-rbind(1,2,3,4,5)
b<-rbind(1,2,3,4,5,6)
sectors<-cbind(a,b)
sectors
## See the table of the sector.

Related

How to create a custom population generator with the R GA package?

I'm working on an optimsation problem using R's GA package with a 'permutation' type genetic algorithm. I need to introduce some parameters for how the initial population is generated before parents selection and crossover. The reason for this is: there is a general framework for arrangements of the genes in the chromosomes that can work at all, but at the same time, I do need a lot of randomization to find local maxima--not just test some suggested solutions using the suggestions argument of the ga() function.
If you check out the R GA package github repo, you can see there's a population generator function on line 576 that does the following:
gaperm_Population_R <- function(object)
{
int <- seq.int(object#lower, object#upper)
n <- length(int)
population <- matrix(NA, nrow = object#popSize, ncol = n)
for(i in 1:object#popSize)
population[i,] <- sample(int, replace = FALSE)
return(population)
}
I want to create a new function that is quite similar, but which takes some pre-calculated parameters pop_parms, and then call that function through the population argument of the ga() function, instead of using the default function, population = gaControl(type)$population.
My new function would look like this, with the new pop_parms argument:
gaperm_Feasible_Pop <- function(object, pop_parms)
{
int <- seq.int(object#lower, object#upper)
n <- length(int)
population <- matrix(NA, nrow = object#popSize, ncol = n)
for(i in 1:object#popSize)
population[i,] <- sapply(pop_parms, function(x) sample(x, replace = FALSE)
)
return(population)
}
Of course, when I try to use this function, the package doesn't know how to pass through the object parameter.
Is there anyone who could help me get this function to work, or perhaps take a different approach?

Save non-SparkDataFrame from Azure Databricks to local computer as .RData

In Databricks (SparkR), I run the batch algorithm of the self-organizing map in parallel from the kohonen package as it gives me considerable reductions in computation time as opposed to my local machine. However, after fitting the model I would like to download/export the trained model (a list) to my local machine to continue working with the results (create plots etc.) in a way that is not available in Databricks. I know how to save & download a SparkDataFrame to csv:
sdftest # a SparkDataFrame
write.df(sdftest, path = "dbfs:/FileStore/test.csv", source = "csv", mode = "overwrite")
However, I am not sure how to do this for a 'regular' R list object.
Is there any way to save the output created in Databricks to my local machine in .RData format? If not, is there a workaround that would still allow me to continue working with the model results locally?
EDIT :
library(kohonen)
# Load data
sdf.cluster <- read.df("abfss://cluster.csv", source = "csv", header="true", inferSchema = "true")
# Collet SDF to RDF as kohonen::som is not available for SparkDataFrames
rdf.cluster <- SparkR::collect(sdf.cluster)
# Change rdf to matrix as is required by kohonen::som
rdf.som <- as.matrix(rdf.cluster)
# Parallel Batch SOM from Kohonen
som.grid <- somgrid(xdim = 5, ydim = 5, topo="hexagonal",
neighbourhood.fct="gaussian")
set.seed(1)
som.model <- som(rdf.som, grid=som.grid, rlen=10, alpha=c(0.05,0.01), keep.data = TRUE, dist.fcts = "euclidean", mode = "online")
Any help is very much appreciated!
If all your models can fit into the driver's memory, you can use spark.lapply. It is a distributed version of base lapply which requires a function and a list. Spark will apply the function to each element of the list (like a map) and collect the returned objects.
Here is an example of fitting kohonen models, one for each iris species:
library(SparkR)
library(kohonen)
fit_model <- function(df) {
library(kohonen)
grid_size <- ceiling(nrow(df) ^ (1/2.5))
som_grid <- somgrid(xdim = grid_size, ydim = grid_size, topo = 'hexagonal', toroidal = T)
som_model <- som(data.matrix(df), grid = som_grid)
som_model
}
models <- spark.lapply(split(iris[-5], iris$Species), fit_model)
models
The models variable contains a list of kohonen models fitted in parallel:
$setosa
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.
$versicolor
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.
$virginica
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.
Then you can save/serialise the R object as usual:
saveRDS(models, file="/dbfs/kohonen_models.rds")
Note that any file stored into /dbfs/ path will be available through the Databrick's DBFS, accesible with the CLI or API.

Conditional simulation (with Kriging) in R with parallelization?

I am using gstat package in R to generate sequential gaussian simulations. My pc have 4 cores and I tried to parallelize the krige() function using the parallel package following the script provided by Guzmán to answer the question How to achieve parallel Kriging in R to speed up the process?.
The resulting simulations are, however, different from the ones using only one core at the time (no parallelization). It looks a geometry problem, but i can't find out how to fix it.
Next i will provide an example (using 4 cores) generating 2 simulations. You will see that after running the code, the simulated maps derived from parallelization show some artifacts (like vertical lines), and are different from the ones using only one core at the time.
The code needs the libraries gstat, sp, raster, parallel and spatstat. If any of the lines library() do not work, run install.packages() first.
library(gstat)
library(sp)
library(raster)
library(parallel)
library(spatstat)
# create a regular grid
nx=100 # number of columns
ny=100 # number of rows
srgr <- expand.grid(1:ny, nx:1)
names(srgr) <- c('x','y')
gridded(srgr)<-~x+y
# generate a spatial process (unconditional simulation)
g<-gstat(formula=z~x+y, locations=~x+y, dummy=T, beta=15, model=vgm(psill=3, range=10, nugget=0,model='Exp'), nmax=20)
sim <- predict(g, newdata=srgr, nsim=1)
r<-raster(sim)
# generate sample data (Poisson process)
int<-0.02
rpp<-rpoispp(int,win=owin(c(0,nx),c(0,ny)))
df<-as.data.frame(rpp)
coordinates(df)<-~x+y
# assign raster values to sample data
dfpp <-raster::extract(r,df,df=TRUE)
smp<-cbind(coordinates(df),dfpp)
smp<-smp[complete.cases(smp), ]
coordinates(smp)<-~x+y
# fit variogram to sample data
vs <- variogram(sim1~1, data=smp)
m <- fit.variogram(vs, vgm("Exp"))
plot(vs, model = m)
# generate 2 conditional simulations with one core processor
one <- krige(formula = sim1~1, locations = smp, newdata = srgr, model = m,nmax=12,nsim=2)
# plot simulation 1 and 2: statistics (min, max) are ok, simulations are also ok.
spplot(one["sim1"], main = "conditional simulation")
spplot(one["sim2"], main = "conditional simulation")
# generate 2 conditional with parallel processing
no_cores<-detectCores()
cl<-makeCluster(no_cores)
parts <- split(x = 1:length(srgr), f = 1:no_cores)
clusterExport(cl = cl, varlist = c("smp", "srgr", "parts","m"), envir = .GlobalEnv)
clusterEvalQ(cl = cl, expr = c(library('sp'), library('gstat')))
par <- parLapply(cl = cl, X = 1:no_cores, fun = function(x) krige(formula=sim1~1, locations=smp, model=m, newdata=srgr[parts[[x]],], nmax=12, nsim=2))
stopCluster(cl)
# merge all parts
mergep <- maptools::spRbind(par[[1]], par[[2]])
mergep <- maptools::spRbind(mergep, par[[3]])
mergep <- maptools::spRbind(mergep, par[[4]])
# create SpatialPixelsDataFrame from mergep
mergep <- SpatialPixelsDataFrame(points = mergep, data = mergep#data)
# plot mergep: statistics (min, max) are ok, but simulated maps show "vertical lines". i don't understand why.
spplot(mergep[1], main = "conditional simulation")
spplot(mergep[2], main = "conditional simulation")
I have tried your code and I think the problem lies with the way you split the work:
parts <- split(x = 1:length(srgr), f = 1:no_cores)
On my dual core machine that meant that all odd indices in srgr where handled by one process and all even indices where handled by the other process. This is probably the source of the vertical artifacts you are seeing.
A better way should be to split the data into consecutive chunks like this:
parts <- parallel::splitIndices(length(srgr), no_cores)
Using this splitting with the rest of your code I get results that look comparable to the sequential ones. At least to my untrained eyes ...
Original answer, which is only a minor effect. It still might make sense to fix the seed with set.seed for sequential and clusterSetRNGStream for parallel processing.
From what I have read about Kriging it requires you to draw random numbers. These random numbers will be different with parallel processing. See section 6 of the parallel vignette (vignette("parallel")) for more details.

RStudio - object not found - kohonen pack

I'm trying to write a script for som map. It comes from this tutorial. My problem is that Rstudio doesn't work. I have this code :
require(kohonen)
# Create a training data set (rows are samples, columns are variables
# Here I am selecting a subset of my variables available in "data"
data_train <- data[, c(2,4,5,8)]
# Change the data frame with training data to a matrix
# Also center and scale all variables to give them equal importance during
# the SOM training process.
data_train_matrix <- as.matrix(scale(data_train))
# Create the SOM Grid - you generally have to specify the size of the
# training grid prior to training the SOM. Hexagonal and Circular
# topologies are possible
som_grid <- somgrid(xdim = 20, ydim=20, topo="hexagonal")
# Finally, train the SOM, options for the number of iterations,
# the learning rates, and the neighbourhood are available
som_model <- som(data_train_matrix,
grid=som_grid,
rlen=500,
alpha=c(0.05,0.01),
keep.data = TRUE )
plot(som_model, type="changes")
If I try to run this script it writes this error :
Error in supersom(list(X), ...) : object 'data_train_matrix' not found
> plot(som_model, type="changes")
Error in plot(som_model, type = "changes") : object 'som_model' not found
I dont understand this. What does it means there is not data_train_matrix? I have data_train_matrix a few lines before. When I run just first 3 lines of code (to data_train_matrix <- as.matrix(scale(data_train))) it writes this error :
data_train_matrix <- as.matrix(scale(data_train))
Error in scale(data_train) : object 'data_train' not found
and when I run just the first two lines it writes :
data_train <- data[, c(2,4,5,8)]
Error in data[, c(2, 4, 5, 8)] :
object of type 'closure' is not subsettable
How is it possible that this code works in tutorial while I have so many errors using the same code ?
It looks like the error comes from having no original dataframe-like object. The variable "data-train", a subset of "data", was never properly assigned.
You need to first follow the commented line of creating a training data set.
# Create a training data set (rows are samples, columns are variables
# Here I am selecting a subset of my variables available in "data"
data_train <- data[, c(2,4,5,8)]
R also has a function named "data" and that is how it interprets the code. This function is not subsettable, like most functions in R.
If you create some data at the very front, everything should work.
data = data.frame(matrix(rnorm(20), nrow=2))
data_train <- data[, c(2,4,5,8)]
# the rest of the script as written

How can I find out which data record goes into which cluster in R using kohonen and means

I have clustered my data with an SOM and kmeans
install.packages("kohonen")
library(kohonen)
set.seed(7)
som_grid <- somgrid(xdim = 8, ydim=8, topo="hexagonal")
som_model <- som(umfrage_veraendert_kurz,
grid=som_grid,
rlen=500,
alpha=c(0.05,0.01),
keep.data = TRUE )
I get from my som_model the "codes" and clustered it with kmeans
mydata <- som_model$codes
clusterzentren <- kmeans(mydata, center=3)
head(clusterzentren)
I have now 3 clusters but I don't know which data record goes to which cluster? How can I find it out?
Thanks for any help
The return value of kmeans is a S3 object which contains not only the centers, but also the cluster assignment.
See the R manual of kmeans for details.

Resources