interactively work with xy point plot clusters - group manipulation in r - r

I have a large number of pair of X and Y variables along with their cluster membership column. Cluster membership (group) may not be always right (limitation in perfection of clustering algorithm), I want to interactively visualize the clusters and manipulate the cluster memberships to identified points.
I tried rggobi and the following is the point I was able to get to (I do not mean that I need to use rggobi / ggobi, if better options are available you are welcome to suggest).
# data
set.seed (1234)
c1 <- rnorm (40, 0.1, 0.02); c2 <- rnorm (40, 0.3, 0.01)
c3 <- rnorm (40, 0.5, 0.01); c4 <- rnorm (40, 0.7, 0.01)
c5 <- rnorm (40, 0.9, 0.03)
Yv <- 0.3 + rnorm (200, 0.05, 0.05)
myd <- data.frame (Xv = round (c(c1, c2, c3, c4, c5), 2), Yv = round (Yv, 2),
cltr = factor (rep(1:5, each = 40)))
require(rggobi)
g <- ggobi(myd)
display(g[1], vars=list(X="Xv", Y="Yv"))
You can see five clusters, colored differently with cltr variable. I manually identified the points that are outliers and I want to make their value to NA in the cltr variable. Is their any easy way to disassociate such membership and write to file.

You could try identify to get the indices of the points manually:
## use base::plot
plot(myd$Xv, myd$Yv, col=myd$cltr)
exclude <- identify(myd$Xv, myd$Yv) ## left click on the points you want to exclude (right click to stop/finish)
myd$cltr[exclude] <- NA

Related

A function in R for plotting locations with consistent high values in multiple raster data

I have four raster files of the same extent. The pattern of low and high values differ in each raster data. I would like to plot areas in the extent (boundary) with values greater than x (where x is an integer). Can anyone help me with an r function to do this? Please find below a sample code for the raster data. In this example, let say I want to plot and identify cells with values greater 0.4 in all the four rasters. Instead of four separate images I want one image that shows cells with values greater than 4. More like overlaying the raster and identifying cells with values greater than 4 in all the images
library(raster)
r1 <- raster(nrows = 1, ncols = 1, res = 0.5, xmn = -1.5, xmx = 1.5, ymn = -1.5, ymx = 1.5, vals = 0.3)
rr <- lapply(1:4, function(i) setValues(r1,runif(ncell(r1))))
par(mfrow = c(2,2))
plot(rr[[1]])
plot(rr[[2]])
plot(rr[[3]])
plot(rr[[4]])
Thank you/
You can combine raster images with &. First threshold each individual plot:
r2 = lapply(rr, `>`, threshold)
And then combine them, retaining only fields which are all greater than the threshold:
summary = Reduce(`&`, r2)
plot(summary)
This is a super easy solution that can be easily generalized for different length of input:
myplot <- function(input,threshold){
input <- lapply(input,function(x){
x <- x>threshold
})
par(mfrow = c(2,2))
plot(input[[1]])
plot(input[[2]])
plot(input[[3]])
plot(input[[4]])
}
myplot(rr,0.4)

Density based clustering that allows user to specify number of clusters

I have data that consists of roughly 100,000 points on a 2-d graph. Each point has X and Y coordinates. I'm looking for an algorithm that will cluster these points based on density but I want to specify the number of clusters.
I originally tried K-Means since this would allow me to specify the number of clusters. However, my data naturally "clumps" into ridges. K-Means would inevitably bisect some of these ridges. DBSCAN seems like a better fit simply due to the shape of my data, but with DBSCAN I can't specify the number of clusters I'd like.
Essentially what I'm trying to find is an algorithm that will optimally cluster the graph into N groups based on density. Where N is supplied by me. At this point I don't care where it's implemented (R, Python, FORTRAN...).
Any direction you can provide would be much appreciated.
In an area of high density, the points tend to be close together, so clustering on the (euclidian) distance may give similar results (not always).
For example, with these three normals in 2 dimensions:
x1 <- mnormt::rmnorm(200, c(10,10), matrix(c(20,0,0,.1), 2, 2))
x2 <- mnormt::rmnorm(100, c(10,20), matrix(c(20,0,0,.1), 2, 2))
x3 <- mnormt::rmnorm(300, c(23, 15), matrix(c(.1,0,0,35), 2, 2))
xx <- rbind(x1, x2, x3)
plot(xx, col=rep(c("grey10","pink2", "green4"), times=c(200,100,300)))
We can apply different clustering algorithms:
# hierarchical
clustering <- hclust(dist(xx,
method = "euclidian"),
method = "ward.D")
h.cl <- cutree(clustering, k=3)
# K-means and dbscan
k.cl <- kmeans(xx, centers = 3L)
d.cl <- dbscan::dbscan(xx, eps = 1)
And we see on this particular example, the hierarchical clustering and DBSCAN produced similar results, whereas K-means cut one of the clusters in a wrong way.
opar <- par(mfrow=c(3,1), mar = c(1,1,1,1))
plot(xx, col = k.cl$cluster, main="K-means")
plot(xx, col = d.cl$cluster, main="DBSCAN")
plot(xx, col = h.cl, main="Hierarchical")
par(opar)
Of course, there is no guarantee this will work on your particular data.

Identify all local extrema of a fitted smoothing spline via R function 'smooth.spline'

I have a 2-dimensional data set.
I use the R's smooth.spline function to smooth my points graph following an example in this article:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.smooth.spline.html
So that I get the spline graph similar to the green line on this picture
I'd like to know the X values, where the first derivative of the smoothing spline equals zero (to determine exact minimum or maximum).
My problem is that my initial dataset (or a dataset that I could auto-generate) to feed into the predict() function does not contain such exact X values that correspond to the smoothing spline extrema.
How can I find such X values?
Here is the picture of the first derivative of the green spline line above
But exact X coordinate of extremums are still not exact.
My approximate R script to generate the pictures looks like the following
sp1 <- smooth.spline(df)
pred.prime <- predict(sp1, deriv=1)
pred.second <- predict(sp1, deriv=2)
d1 <- data.frame(pred.prime)
d2 <- data.frame(pred.second)
dfMinimums <- d1[abs(d1$y) < 1e-4, c('x','y')]
I think that there are two problems here.
You are using the original x-values and they are spaced too far apart AND
Because of the wide spacing of the x's, your threshold for where you consider the derivative "close enough" to zero is too high.
Here is basically your code but with many more x values and requiring smaller derivatives. Since you do not provide any data, I made a coarse approximation to it that should suffice for illustration.
## Coarse approximation of your data
x = runif(300, 0,45000)
y = sin(x/5000) + sin(x/950)/4 + rnorm(300, 0,0.05)
df = data.frame(x,y)
sp1 <- smooth.spline(df)
Spline code
Sx = seq(0,45000,10)
pred.spline <- predict(sp1, Sx)
d0 <- data.frame(pred.spline)
pred.prime <- predict(sp1, Sx, deriv=1)
d1 <- data.frame(pred.prime)
Mins = which(abs(d1$y) < mean(abs(d1$y))/150)
plot(df, pch=20, col="navy")
lines(sp1, col="darkgreen")
points(d0[Mins,], pch=20, col="red")
The extrema look pretty good.
plot(d1, type="l")
points(d1[Mins,], pch=20, col="red")
The points identified look like zeros of the derivative.
You can use my R package SplinesUtils: https://github.com/ZheyuanLi/SplinesUtils, which can be installed by
devtools::install_github("ZheyuanLi/SplinesUtils")
The function to be used are SmoothSplinesAsPiecePoly and solve. I will just use the example under the documentation.
library(SplinesUtils)
## a toy dataset
set.seed(0)
x <- 1:100 + runif(100, -0.1, 0.1)
y <- poly(x, 9) %*% rnorm(9)
y <- y + rnorm(length(y), 0, 0.2 * sd(y))
## fit a smoothing spline
sm <- smooth.spline(x, y)
## coerce "smooth.spline" object to "PiecePoly" object
oo <- SmoothSplineAsPiecePoly(sm)
## plot the spline
plot(oo)
## find all stationary / saddle points
xs <- solve(oo, deriv = 1)
#[1] 3.791103 15.957159 21.918534 23.034192 25.958486 39.799999 58.627431
#[8] 74.583000 87.049227 96.544430
## predict the "PiecePoly" at stationary / saddle points
ys <- predict(oo, xs)
#[1] -0.92224176 0.38751847 0.09951236 0.10764884 0.05960727 0.52068566
#[7] -0.51029209 0.15989592 -0.36464409 0.63471723
points(xs, ys, pch = 19)
One caveat in the #G5W implementation that I found is that it sometimes returns multiple records close around extrema instead of a single one. On the diagram they cannot be seen, since they all fall into one point effectively.
The following snippet from here filters out single extrema points with the minimum value of the first derivative:
library(tidyverse)
df2 <- df %>%
group_by(round(y, 4)) %>%
filter(abs(d1) == min(abs(d1))) %>%
ungroup() %>%
select(-5)

Consistent Cluster Order with Kmeans in R

This might not be possible, but Google has failed me so far so I'm hoping someone else might have some insight. Sorry if this has been asked before.
The background is, I have a database of information on different cities, so like name, population, pollution, crime, etc by year. I'm querying it to aggregate the data on a per-city basis and outputting the result to a table. That works fine.
The next step is I'm running the kmeans() function in R on the data set to find clusters, in testing I've found that 5 clusters is almost always a good choice via the "elbow method".
The issue I'm having is that these clusters have distinct meanings/interpretations, so I want to tag each row in the original data set with the cluster's interpretation for that row, not the cluster number. So I don't want to identify row 2 with "cluster 5", I want to say "low population, high crime, low income".
If R would output the clusters in the same order, say having cluster 5 always equate to the cluster of cities with "low population, high crime, low income", that would work fine, but it doesn't. For instance, if you run code like this:
> a = kmeans(city_date,centers=5)
> b = kmeans(city_date,centers=5)
> c = kmeans(city_date,centers=5)
The run this code:
a$centers
b$centers
c$centers
The clusters will all contain the same data set, but the cluster number will be different. So if I have a mapping table in SQL that has cluster number and interpretation, it won't work, because when I run it one day it might have the "low population, high crime, low income" cluster as 5, and the next it might be 2, the next 4, etc.
What I'm trying to figure out is if there is a way to keep the output consistent. The data set gets updated so it won't even be the same every time, and since R doesn't keep the cluster order consistent even with the same data set, I am wondering if it will be possible at all.
Thanks for any help anyone can provide. On my end my current idea is to output the $centers data to a SQL table, then order the table by the various metrics, each time the one with the highest/lowest getting tagged as such, and then concatenating the results to tag the level. This may work but isn't very elegant.
I know this is a very old post, but I only came across it now. I had the same problem today and adapted the suggestion by Barker to come up with a solution:
library(dplyr)
# create a random data frame
df <- data.frame(id = 1:10, obs = sample(0:500, 10))
# use kmeans a first time to get the centers
centers <- kmeans(df$obs, centers = 3)$centers
# order the centers
centers <- sort(centers)
# call kmeans again but this time passing the centers calculated in the previous step
clusteridx <- kmeans(df$obs, centers = centers)$cluster
Not very elegant, but it works. The clusteridx vector will always return the cluster number based on the centers in ascending order.
This can also be collapsed into just one line if you prefer:
clusteridx <- kmeans(df$obs, centers = sort(kmeans(df$obs, centers = 3)$centers))$cluster
Usually k-means are initialized randomly few times to avoid local minimums. If you want to have resulting clusters ordered, you have to order them manually after k-means algorithm stops to work.
I haven't done this myself so I am not sure it will work, but kmeans has the parameter:
centers - either the number of clusters, say k, or a set of initial (distinct) cluster centres. If a number, a random set of (distinct) rows in x is chosen as the initial centres.
If you know know basically where the clusters should be (perhaps by getting the cluster centers from a dataset you are matching to), you could use that to initialize the model. That would make the starting locations non-random, so the clusters should stay in the same order. Also, as an added benefit, initializing the cluster centers close to where they will end up should speed up your clustering.
Edit
I just checked using the data from the kmeans example but initializing with the first datapoint at (1,1) and the second at (0,0) (the means of the distributions used to makes the clusters) as below.
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, matrix(c(1,0,1,0),ncol=2)))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)
After repeated runs, I found that the first cluster was always in the top right and the second in the bottom left where as initializing with 2 clusters caused then to switch back and forth. If you have some approximate starting values for your clusters (ie quantification for "low population, high crime, low income") that could be your initialization and give you the results you want.
This function runs kmeans with 1-dimensional input and returns a normal "kmeans" object with sensibly numbered clusters, without having to run the kmeans twice.
ordered_kmeans = function(x, centers, iter.max = 10, nstart = 1,
algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
"MacQueen"),
trace = FALSE,
desc = TRUE) {
if (NCOL(x) > 1) {
stop("only one-dimensional inputs are allowed")
}
k = kmeans(x = x, centers = centers, iter.max = iter.max, nstart = nstart,
algorithm = algorithm, trace = trace)
centers_ind = order(k$centers, decreasing = desc)
centers_ord = setNames(seq_along(k$centers), nm = centers_ind)
k$cluster = unname(centers_ord[as.character(k$cluster)])
k$centers = matrix(k$centers[centers_ind], ncol = 1)
k$withinss = k$withinss[centers_ind]
k$size = k$size[centers_ind]
k
}
Example usage:
vec = c(20.28, 9.49, 7.14, 2.48, 2.36, 1.82, 1.3, 1.26, 1.11, 0.98,
0.81, 0.73, 0.66, 0.63, 0.57, 0.53, 0.44, 0.42, 0.38, 0.37, 0.33,
0.29, 0.28, 0.27, 0.26, 0.23, 0.23, 0.2, 0.18, 0.16, 0.15, 0.14,
0.14, 0.12, 0.11, 0.1, 0.1, 0.08)
# For comparispon
set.seed(1)
k = kmeans(vec, centers = 3); k
set.seed(1)
k = ordered_kmeans(vec, centers = 3); k
set.seed(1)
k = ordered_kmeans(vec, centers = 3, desc = FALSE); k
Here's an example where you ascribe letter factor groups to the k-means clusters, ordered from A is low to C is high. The parameters can be altered to fit the data you have.
df <- data.frame(id = 1:10, obs = sample(0:500, 10))
km <- kmeans(df$obs, centers = 3)
km.order <- as.numeric(names(sort(km$centers[,1])))
names(km.order) <- toupper(letters)[1:3]
km.order <- sort(km.order)
clus.order <- factor(names(km.order[km$cluster]))

generate random sequence and plot in R

I would like to generate a random sequence composed of 3000 points, which follows the normal distribution. The mean is c and the standard deviation is d. But I would like these 3000 points lies in the range of [a,b].
Can you tell me how to do it in R?
If I would like to plot this sequence, if Y-axis uses the generated 3000 points, then how should I generate the points corresponding to X-axis.
You can do this using standard R functions like this:
c <- 1
d <- 2
a <- -2
b <- 3.5
ll <- pnorm(a, c, d)
ul <- pnorm(b, c, d)
x <- qnorm( runif(3000, ll, ul), c, d )
hist(x)
range(x)
mean(x)
sd(x)
plot(x, type='l')
The pnorm function is used to find the limits to use for the uniform distriution, data is then generated from a uniform and then transformed back to the normal.
This is even simpler using the distr package:
library(distr)
N <- Norm(c,d)
N2 <- Truncate(N, lower=a, upper=b)
plot(N2)
x <- r(N2)(3000)
hist(x)
range(x)
mean(x)
sd(x)
plot(x, type='l')
Note that in both cases the mean is not c and the sd is not d. If you want the mean and sd of the resulting truncated data to be c and d, then you need the parent distribution (before truncating) to have different values (higher sd, mean depends on the truncating values), finding those values would be a good homework problem for a math/stat theory course. If that is what you really need then add a comment or edit the question to say so specifically.
If you want to generate the data from the untruncated normal, but only plot the data within the range [a,b] then just use the ylim argument to plot:
plot( rnorm(3000, c, d), ylim=c(a,b) )
Generating a random sequence of numbers from any probability distribution is very easy in R. To do this for the normal distribution specifically
c = 1
d = 2
x <- rnorm(3000, c, d)
Clipping the values in x so that they're only within a given range is kind of a strange thing to want to do with a sample from the normal distribution. Maybe what you really want to do is sample a uniform distribution.
a = 0
b = 3
x2 <- runif(3000, a, b)
As for how the plot the distribution, I'm not sure I follow your question. You can plot a density estimate for the sample with this code
plot(density(x))
But, if you want to plot this data as a scatter plot of some sort, you actually need to generate a second sample of numbers.
If I would like to plot this sequence, if Y-axis uses the generated 3000 points, then how should I generate the points corresponding to X-axis.
If you just generate your points, like JoFrhwld said with
y <- rnorm(3000, 1, 2)
Then
plot(y)
Will automatically plot them using the array indices as x axis
a = -2; b = 3
plot(dnorm, xlim = c(a, b))

Resources