i've run my kmeans test from an excel data source and now want to get he results out in excel. I've tried the following code but all i get is a blank worksheet. I'm relatively new to R so i imagine its something simple that I'm missing - please help!
set.seed(123)
kmeansresults<-kmeans(df[,7], 5, iter.max = 50, nstart = 100)
x<-kmeansresults$clusters
write.csv(x, "clustering results.csv")
Try the following:
data("USArrests")
m <- scale(USArrests)
set.seed(123)
km_res <- kmeans(m, 4, nstart=25)
x <- km_res$cluster
write.csv(x, "/Users/user/Desktop/foo.csv")
I guess your problem was not writing a CSV file but calling km_res$clusters that should be km_res$cluster. In R you can access the structure of an object as str(km_res) to see what is "inside". There is no slot clusters but cluster.
Related
In Databricks (SparkR), I run the batch algorithm of the self-organizing map in parallel from the kohonen package as it gives me considerable reductions in computation time as opposed to my local machine. However, after fitting the model I would like to download/export the trained model (a list) to my local machine to continue working with the results (create plots etc.) in a way that is not available in Databricks. I know how to save & download a SparkDataFrame to csv:
sdftest # a SparkDataFrame
write.df(sdftest, path = "dbfs:/FileStore/test.csv", source = "csv", mode = "overwrite")
However, I am not sure how to do this for a 'regular' R list object.
Is there any way to save the output created in Databricks to my local machine in .RData format? If not, is there a workaround that would still allow me to continue working with the model results locally?
EDIT :
library(kohonen)
# Load data
sdf.cluster <- read.df("abfss://cluster.csv", source = "csv", header="true", inferSchema = "true")
# Collet SDF to RDF as kohonen::som is not available for SparkDataFrames
rdf.cluster <- SparkR::collect(sdf.cluster)
# Change rdf to matrix as is required by kohonen::som
rdf.som <- as.matrix(rdf.cluster)
# Parallel Batch SOM from Kohonen
som.grid <- somgrid(xdim = 5, ydim = 5, topo="hexagonal",
neighbourhood.fct="gaussian")
set.seed(1)
som.model <- som(rdf.som, grid=som.grid, rlen=10, alpha=c(0.05,0.01), keep.data = TRUE, dist.fcts = "euclidean", mode = "online")
Any help is very much appreciated!
If all your models can fit into the driver's memory, you can use spark.lapply. It is a distributed version of base lapply which requires a function and a list. Spark will apply the function to each element of the list (like a map) and collect the returned objects.
Here is an example of fitting kohonen models, one for each iris species:
library(SparkR)
library(kohonen)
fit_model <- function(df) {
library(kohonen)
grid_size <- ceiling(nrow(df) ^ (1/2.5))
som_grid <- somgrid(xdim = grid_size, ydim = grid_size, topo = 'hexagonal', toroidal = T)
som_model <- som(data.matrix(df), grid = som_grid)
som_model
}
models <- spark.lapply(split(iris[-5], iris$Species), fit_model)
models
The models variable contains a list of kohonen models fitted in parallel:
$setosa
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.
$versicolor
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.
$virginica
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.
Then you can save/serialise the R object as usual:
saveRDS(models, file="/dbfs/kohonen_models.rds")
Note that any file stored into /dbfs/ path will be available through the Databrick's DBFS, accesible with the CLI or API.
So I'm using the SpadeR package on R to test similarities of pairs of abundance data for my thesis. Does anyone have any idea of how to turn an output into a matrix for further analysis?
Here is the code im working with:
CompAB <- mydata %>% select(2, 3)
dataAB <- data.matrix(CompAB, rownames.force = NA)
SimilarityPair(dataAB, datatype = c("abundance"), nboot = 1000)
I want to do this so I can later loop the outputs from a randomisation run and put them into a matrix to analyse
Have you tried to use the matrix() command?
I am using gstat package in R to generate sequential gaussian simulations. My pc have 4 cores and I tried to parallelize the krige() function using the parallel package following the script provided by Guzmán to answer the question How to achieve parallel Kriging in R to speed up the process?.
The resulting simulations are, however, different from the ones using only one core at the time (no parallelization). It looks a geometry problem, but i can't find out how to fix it.
Next i will provide an example (using 4 cores) generating 2 simulations. You will see that after running the code, the simulated maps derived from parallelization show some artifacts (like vertical lines), and are different from the ones using only one core at the time.
The code needs the libraries gstat, sp, raster, parallel and spatstat. If any of the lines library() do not work, run install.packages() first.
library(gstat)
library(sp)
library(raster)
library(parallel)
library(spatstat)
# create a regular grid
nx=100 # number of columns
ny=100 # number of rows
srgr <- expand.grid(1:ny, nx:1)
names(srgr) <- c('x','y')
gridded(srgr)<-~x+y
# generate a spatial process (unconditional simulation)
g<-gstat(formula=z~x+y, locations=~x+y, dummy=T, beta=15, model=vgm(psill=3, range=10, nugget=0,model='Exp'), nmax=20)
sim <- predict(g, newdata=srgr, nsim=1)
r<-raster(sim)
# generate sample data (Poisson process)
int<-0.02
rpp<-rpoispp(int,win=owin(c(0,nx),c(0,ny)))
df<-as.data.frame(rpp)
coordinates(df)<-~x+y
# assign raster values to sample data
dfpp <-raster::extract(r,df,df=TRUE)
smp<-cbind(coordinates(df),dfpp)
smp<-smp[complete.cases(smp), ]
coordinates(smp)<-~x+y
# fit variogram to sample data
vs <- variogram(sim1~1, data=smp)
m <- fit.variogram(vs, vgm("Exp"))
plot(vs, model = m)
# generate 2 conditional simulations with one core processor
one <- krige(formula = sim1~1, locations = smp, newdata = srgr, model = m,nmax=12,nsim=2)
# plot simulation 1 and 2: statistics (min, max) are ok, simulations are also ok.
spplot(one["sim1"], main = "conditional simulation")
spplot(one["sim2"], main = "conditional simulation")
# generate 2 conditional with parallel processing
no_cores<-detectCores()
cl<-makeCluster(no_cores)
parts <- split(x = 1:length(srgr), f = 1:no_cores)
clusterExport(cl = cl, varlist = c("smp", "srgr", "parts","m"), envir = .GlobalEnv)
clusterEvalQ(cl = cl, expr = c(library('sp'), library('gstat')))
par <- parLapply(cl = cl, X = 1:no_cores, fun = function(x) krige(formula=sim1~1, locations=smp, model=m, newdata=srgr[parts[[x]],], nmax=12, nsim=2))
stopCluster(cl)
# merge all parts
mergep <- maptools::spRbind(par[[1]], par[[2]])
mergep <- maptools::spRbind(mergep, par[[3]])
mergep <- maptools::spRbind(mergep, par[[4]])
# create SpatialPixelsDataFrame from mergep
mergep <- SpatialPixelsDataFrame(points = mergep, data = mergep#data)
# plot mergep: statistics (min, max) are ok, but simulated maps show "vertical lines". i don't understand why.
spplot(mergep[1], main = "conditional simulation")
spplot(mergep[2], main = "conditional simulation")
I have tried your code and I think the problem lies with the way you split the work:
parts <- split(x = 1:length(srgr), f = 1:no_cores)
On my dual core machine that meant that all odd indices in srgr where handled by one process and all even indices where handled by the other process. This is probably the source of the vertical artifacts you are seeing.
A better way should be to split the data into consecutive chunks like this:
parts <- parallel::splitIndices(length(srgr), no_cores)
Using this splitting with the rest of your code I get results that look comparable to the sequential ones. At least to my untrained eyes ...
Original answer, which is only a minor effect. It still might make sense to fix the seed with set.seed for sequential and clusterSetRNGStream for parallel processing.
From what I have read about Kriging it requires you to draw random numbers. These random numbers will be different with parallel processing. See section 6 of the parallel vignette (vignette("parallel")) for more details.
I have been working several times with tensorflow in Python. I always use keras because it has a very nice syntax.
Currently I have to implement a multilayer perceptron un R.
In the tutorial that appear in:
https://tensorflow.rstudio.com/tutorial_mnist_pros.html
They directly load the mnist dataset in their own format:
input_dataset <- tf$examples$tutorials$mnist$input_data
mnist <- input_dataset$read_data_sets("MNIST-data", one_hot = TRUE)
And they use it using their functions to obtain the batches:
for (i in 1:1000) {
batches <- mnist$train$next_batch(100L)
batch_xs <- batches[[1]]
batch_ys <- batches[[2]]
sess$run(train_step,feed_dict = dict(x = batch_xs, y_ = batch_ys))
}
I have my dataset in a csv file and I load it into a data frame, but I have absolutely no idea about how to transform it in a tensorflow dataset in order to use functions like next_batch.
Any tip to solve this problem?
Thank you very much in advance
See class Dataset and some helper functions in base.py in TF.
Reference:
tensorflow/tensorflow/contrib/learn/python/learn/datasets/base.py
tensorflow/tensorflow/contrib/learn/python/learn/datasets/mnist.py
I have several large rasters that I want to process in a PCA (to produce summary rasters).
I have seen several examples whereby people seem to be simply calling prcomp or princomp. However, when I do this, I get the following error message:
Error in as.vector(data): no method for coercing this S4 class to a vector
Example code:
files<-list.files() # a set of rasters
layers<-stack(files) # using the raster package
pca<-prcomp(layers)
I have tried using a raster brick instead of stack but that doesn't seem to the issue. What method do I need to provide the command so that it can convert the raster data to vector format? I understand that there are ways to sample the raster and run the PCA from that, but I would really like to understand why the above method is not working.
Thanks!
The above method is not working simply because prcomp does not know how to deal with a raster object. It only knows how to deal with vectors, and coercing to vector does not work, hence the error.
What you need to do is read each of your files into a vector, and put each of the rasters in a column of a matrix. Each row will then be a time series of values at a single spatial location, and each column will be all the pixels at a certain time step. Note that the exact spatial coordinates are not needed in this approach. This matrix serves as the input of prcomp.
Reading the files can be done using readGDAL, and using as.data.frame to cast the spatial data to data.frame.
Answer to my own question: I ended up doing something slightly different: rather than using every raster cell as input (very large dataset), I took a sample of points, ran the PCA and then saved the output model so that I could make predictions for each grid cell…maybe not the best solution but it works:
rasters <- stack(myRasters)
sr <- sampleRandom(rasters, 5000) # sample 5000 random grid cells
# run PCA on random sample with correlation matrix
# retx=FALSE means don't save PCA scores
pca <- prcomp(sr, scale=TRUE, retx=FALSE)
# write PCA model to file
dput(pca, file=paste("./climate/", name, "/", name, "_pca.csv", sep=""))
x <- predict(rasters, pca, index=1:6) # create new rasters based on PCA predictions
There is rasterPCA function in RStoolbox package http://bleutner.github.io/RStoolbox/rstbx-docu/rasterPCA.html
For example:
library('raster')
library('RStoolbox')
rasters <- stack(myRasters)
pca1 <- rasterPCA(rasters)
pca2 <- rasterPCA(rasters, nSamples = 5000) # sample 5000 random grid cells
pca3 <- rasterPCA(rasters, norm = FALSE) # without normalization
here is a working solution:
library(raster)
filename <- system.file("external/rlogo.grd", package="raster")
r1 <- stack(filename)
pca<-princomp(r1[], cor=T)
res<-predict(pca,r1[])
Display result:
r2 <- raster(filename)
r2[]<-res[,1]
plot(r2)
Yet another option would be to extract the vales from the raster-stack, i.e.:
rasters <- stack(my_rasters)
values <- getValues(rasters)
pca <- prcomp(values, scale = TRUE)
Here is another approach that expands on the getValues approach proposed by #Daniel. The result is a raster stack. The index (idx) references non-NA positions so that NA values are accounted for.
library(raster)
r <- stack(system.file("external/rlogo.grd", package="raster"))
r.val <- getValues(r)
idx <- which(!is.na(r.val))
pca <- princomp(r.val, cor=T)
ncomp <- 2 # first two principle components
r.pca <- r[[1:ncomp]]
for(i in 1:ncomp) { r.pca[[i]][idx] <- pca$scores[,i] }
plot(r.pca)