I would like to obtain the x and y values at points where the slope is maximal and minimal on a spline. I saw how to do that at this post, but could not replicate it on my dataset. They can make use of diff() but my dataset has specific x and y vectors. My data typically looks like this:
x<-c(0, 0.13, 0.22, 0.34, 0.44, 0.53, 0.62, 0.72, 0.83, 0.91)
y<-c(120, 121, 122, 122, 122, 122, 122, 121, 119, 103)
z <-data.frame(x, y)
z
plot(z)
xspline(z$x,z$y, shape=0.5)
But of course I cannot use this:
w <-xspline(z$x,z$y, shape=0.5)
plot(diff(w))
If I could, I would do this:
param1 <- which(abs(diff(w))==max(abs(diff(w))) )
param2 <-z[which(abs(diff(w))==max(abs(diff(w))) ) ]
param1 <- which(abs(diff(w))==min(abs(diff(w))) )
param2 <-z[which(abs(diff(w))==min(abs(diff(w))) ) ]
I would be grateful for advice on a different way to get that plot of slopes, or alternatively another way to get these parameters. Perhaps I have gone too far down the 'splines' road.
Just use
w <-data.frame(xspline(z$x,z$y, shape=0.5, draw=F))
that will make w a data.frame with the x and y values used to draw the spline. Then you can take the max/min of the differences in y to estimate the maximal and minimal slope points
plot(z)
with(w, lines(x,y))
with(w[which.max(diff(w$y)),], points(x,y,col="red"))
with(w[which.min(diff(w$y)),], points(x,y,col="blue"))
Following on from the clue given by #Carl Witthoft I installed the package numDeriv so I could use grad(), and continued where I left off previously:
`plot(splinefun(z, method="monoH.F")) #just to check the shape of the spline
w <-splinefun(z, method="monoH.F")
z$slopes <-grad(w, z$x, method="simple")
z
plot(z$slopes~z$x) #just out of interest
max.slope <- subset(z, slopes==max(z$slopes))
max.slope.y <-max.slope$y[1]
max.slope.x <-max.slope$x[1]`
Maybe not so elegant, but did the trick. I have noted the slight different in the shape of the splines between splinefun() and xspline(), which may or may not be relevant to my application.
Related
I have a dataset from a biological experiment:
x = c(0.488, 0.977, 1.953, 3.906, 7.812, 15.625, 31.250, 62.500, 125.000, 250.000, 500.000, 1000.000)
y = c(0.933, 1.036, 1.112, 1.627, 2.646, 5.366, 11.115, 2.355, 1.266, 0, 0, 0)
plot(log(x),y)
x represents a concentration and y represents the response in our assay.
The plot can be found here: 1
How can I predict the x-value (concentration) of a pre-defined y-value (in my case 1.5)?
After a loess smoothing I can predict the y-value at a defined x-value. See the example:
smooth_data <- loess(y~log(x))
predict(smooth_data, 1.07) # which gives 1.5
Using the predict function, both x = 1.07 and x = 5.185 result in y = 1.5
Is there a convenient way to get the estimates from the loess regression at y = 1.5 without manually typing some x values into the predict function?
Any suggestions?
I gues your x and y's are pairs? so for f(0.488) = 0.933 and so on?
More of a mathproblem in my opinion :).
If you could define a function that describes your graph it would be pretty easy.
You could also draw a straight line between all points and for every line that intersects with your y value you could get corrosponding x values. But straight lines wouldn't be really precies.
If you have enough pairs you could also train a neureal network. That might get you the best results but takes some time and alot of pairs to train well.
Could you clarify your question a bit and tell us what you are looking for? A way to do it or a code example?
I hope this is helping you atleast a little bit :)
Since your function is not monotonic, there is no true inverse, but if you split it into two functions - one for x < maximum and one for x > maximum - you can just create two inverse functions and solve for whatever values of y you want.
smooth_data <- loess(y~log(x))
X = seq(0,6.9,0.1)
P = predict(smooth_data, X)
M = which.max(P)
Inverse1 = approxfun(X[1:M] ~ P[1:M])
Inverse2 = approxfun(X[M:length(X)] ~ P[M:length(X)])
Inverse1(1.5)
[1] 1.068267
predict(smooth_data, 1.068267)
[1] 1.498854
Inverse2(1.5)
[1] 5.185876
predict(smooth_data, 5.185876)
[1] 1.499585
This might not be possible, but Google has failed me so far so I'm hoping someone else might have some insight. Sorry if this has been asked before.
The background is, I have a database of information on different cities, so like name, population, pollution, crime, etc by year. I'm querying it to aggregate the data on a per-city basis and outputting the result to a table. That works fine.
The next step is I'm running the kmeans() function in R on the data set to find clusters, in testing I've found that 5 clusters is almost always a good choice via the "elbow method".
The issue I'm having is that these clusters have distinct meanings/interpretations, so I want to tag each row in the original data set with the cluster's interpretation for that row, not the cluster number. So I don't want to identify row 2 with "cluster 5", I want to say "low population, high crime, low income".
If R would output the clusters in the same order, say having cluster 5 always equate to the cluster of cities with "low population, high crime, low income", that would work fine, but it doesn't. For instance, if you run code like this:
> a = kmeans(city_date,centers=5)
> b = kmeans(city_date,centers=5)
> c = kmeans(city_date,centers=5)
The run this code:
a$centers
b$centers
c$centers
The clusters will all contain the same data set, but the cluster number will be different. So if I have a mapping table in SQL that has cluster number and interpretation, it won't work, because when I run it one day it might have the "low population, high crime, low income" cluster as 5, and the next it might be 2, the next 4, etc.
What I'm trying to figure out is if there is a way to keep the output consistent. The data set gets updated so it won't even be the same every time, and since R doesn't keep the cluster order consistent even with the same data set, I am wondering if it will be possible at all.
Thanks for any help anyone can provide. On my end my current idea is to output the $centers data to a SQL table, then order the table by the various metrics, each time the one with the highest/lowest getting tagged as such, and then concatenating the results to tag the level. This may work but isn't very elegant.
I know this is a very old post, but I only came across it now. I had the same problem today and adapted the suggestion by Barker to come up with a solution:
library(dplyr)
# create a random data frame
df <- data.frame(id = 1:10, obs = sample(0:500, 10))
# use kmeans a first time to get the centers
centers <- kmeans(df$obs, centers = 3)$centers
# order the centers
centers <- sort(centers)
# call kmeans again but this time passing the centers calculated in the previous step
clusteridx <- kmeans(df$obs, centers = centers)$cluster
Not very elegant, but it works. The clusteridx vector will always return the cluster number based on the centers in ascending order.
This can also be collapsed into just one line if you prefer:
clusteridx <- kmeans(df$obs, centers = sort(kmeans(df$obs, centers = 3)$centers))$cluster
Usually k-means are initialized randomly few times to avoid local minimums. If you want to have resulting clusters ordered, you have to order them manually after k-means algorithm stops to work.
I haven't done this myself so I am not sure it will work, but kmeans has the parameter:
centers - either the number of clusters, say k, or a set of initial (distinct) cluster centres. If a number, a random set of (distinct) rows in x is chosen as the initial centres.
If you know know basically where the clusters should be (perhaps by getting the cluster centers from a dataset you are matching to), you could use that to initialize the model. That would make the starting locations non-random, so the clusters should stay in the same order. Also, as an added benefit, initializing the cluster centers close to where they will end up should speed up your clustering.
Edit
I just checked using the data from the kmeans example but initializing with the first datapoint at (1,1) and the second at (0,0) (the means of the distributions used to makes the clusters) as below.
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, matrix(c(1,0,1,0),ncol=2)))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)
After repeated runs, I found that the first cluster was always in the top right and the second in the bottom left where as initializing with 2 clusters caused then to switch back and forth. If you have some approximate starting values for your clusters (ie quantification for "low population, high crime, low income") that could be your initialization and give you the results you want.
This function runs kmeans with 1-dimensional input and returns a normal "kmeans" object with sensibly numbered clusters, without having to run the kmeans twice.
ordered_kmeans = function(x, centers, iter.max = 10, nstart = 1,
algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
"MacQueen"),
trace = FALSE,
desc = TRUE) {
if (NCOL(x) > 1) {
stop("only one-dimensional inputs are allowed")
}
k = kmeans(x = x, centers = centers, iter.max = iter.max, nstart = nstart,
algorithm = algorithm, trace = trace)
centers_ind = order(k$centers, decreasing = desc)
centers_ord = setNames(seq_along(k$centers), nm = centers_ind)
k$cluster = unname(centers_ord[as.character(k$cluster)])
k$centers = matrix(k$centers[centers_ind], ncol = 1)
k$withinss = k$withinss[centers_ind]
k$size = k$size[centers_ind]
k
}
Example usage:
vec = c(20.28, 9.49, 7.14, 2.48, 2.36, 1.82, 1.3, 1.26, 1.11, 0.98,
0.81, 0.73, 0.66, 0.63, 0.57, 0.53, 0.44, 0.42, 0.38, 0.37, 0.33,
0.29, 0.28, 0.27, 0.26, 0.23, 0.23, 0.2, 0.18, 0.16, 0.15, 0.14,
0.14, 0.12, 0.11, 0.1, 0.1, 0.08)
# For comparispon
set.seed(1)
k = kmeans(vec, centers = 3); k
set.seed(1)
k = ordered_kmeans(vec, centers = 3); k
set.seed(1)
k = ordered_kmeans(vec, centers = 3, desc = FALSE); k
Here's an example where you ascribe letter factor groups to the k-means clusters, ordered from A is low to C is high. The parameters can be altered to fit the data you have.
df <- data.frame(id = 1:10, obs = sample(0:500, 10))
km <- kmeans(df$obs, centers = 3)
km.order <- as.numeric(names(sort(km$centers[,1])))
names(km.order) <- toupper(letters)[1:3]
km.order <- sort(km.order)
clus.order <- factor(names(km.order[km$cluster]))
The concept of jittering in graphical plotting is intended to make sure points do not overlap. I want to do something similar in a vector
Imagine I have a vector like this:
v <- c(0.5, 0.5, 0.55, 0.60, 0.71, 0.71, 0.8)
As you can see, it is a vector that is ordered by increasing numbers, with the caveat that some of the numbers are exactly the same. How can I "jitter" them through adding a very small value, so that they can be ordered strictly in increasing order? I would like to achieve something like this:
0.5, 0.50001, 0.55, 0.60, 0.71, 0.71001, 0.8
How can I achieve this in R?
If the solution allows me to adjust the size of the "added value" it's a bonus!
Jitter and then sort:
sort(jitter(z))
The function rle gets you the run length of repeated elements in a vector. Using this information, you can then create a sequence of the repeats, multiply this by your verySmallNumber and add it to v.
# New vector to illustrate a triplet
v <- c(0.5, 0.5, 0.55, 0.60, 0.71, 0.71, 0.71, 0.8)
# Define the amount you wish to add
verySmallNumber <- 0.00001
# Get the rle
rv <- rle(v)
# Create the sequence, multiply and subtract the verySmallNumber, then add
sequence(rv$lengths) * verySmallNumber - verySmallNumber + v
# [1] 0.50000 0.50001 0.55000 0.60000 0.71000 0.71001 0.71002 0.80000
Of course, eventually, a very long sequence of repeats might lead to a value equal to the next real value. Adding a check to see what the longest repeated value is would possibly solve that.
I want to plot a polygon from a sample of points (in practice, the polygon is a convex hull) whose coordinates are
x <- c(0.66, 0.26, 0.90, 0.06, 0.94, 0.37)
y <- c(0.99, 0.20, 0.38, 0.77, 0.71, 0.17)
When I apply the polygon function I get the following plot:
plot(x,y,type="n")
polygon(x,y)
text(x,y,1:length(x))
But it is not what I expect... What I want is the following plot:
I obtained this last plot by doing:
good.order <- c(1,5,3,6,2,4)
plot(x,y,type="n")
polygon(x[good.order], y[good.order])
text(x,y,1:length(x))
My question
Basically, my question is: how to obtain the vector of indices (called good order in the code above)
which will allow to get the polygon I want?
Assuming a convex polygon, just take a central point and compute the angle, then order in increasing angle.
> pts = cbind(x,y)
> polygon(pts[order(atan2(x-mean(x),y-mean(y))),])
Note that any cycle of your good.order will work, mine gives:
> order(atan2(x-mean(x),y-mean(y)))
[1] 6 2 4 1 5 3
probably because I've mixed x and y in atan2 and so its thinking about it rotated by 90 degrees, like that matters here.
Here is one possibility. The idea is to use the angle around the center for ordering:
x <- c(0.66, 0.26, 0.90, 0.06, 0.94, 0.37)
y <- c(0.99, 0.20, 0.38, 0.77, 0.71, 0.17)
xnew <- x[order(Arg(scale(x) + scale(y) * 1i))]
ynew <- y[order(Arg(scale(x) + scale(y) * 1i))]
plot(xnew, ynew, type = "n")
polygon(xnew ,ynew)
text(x, y, 1:length(x))
Just use the geometry package with the function convhulln
Here the example they provide (see ?convhulln)
ps <- matrix(rnorm(3000), ncol=3) # generate points on a sphere
ps <- sqrt(3)*ps/drop(sqrt((ps^2) %*% rep(1, 3)))
ts.surf <- t(convhulln(ps)) # see the qhull documentations for the options
rgl.triangles(ps[ts.surf,1],ps[ts.surf,2],ps[ts.surf,3],col="blue",alpha=.2)
For plotting you need the rgl-package
Result:
I have a large number of pair of X and Y variables along with their cluster membership column. Cluster membership (group) may not be always right (limitation in perfection of clustering algorithm), I want to interactively visualize the clusters and manipulate the cluster memberships to identified points.
I tried rggobi and the following is the point I was able to get to (I do not mean that I need to use rggobi / ggobi, if better options are available you are welcome to suggest).
# data
set.seed (1234)
c1 <- rnorm (40, 0.1, 0.02); c2 <- rnorm (40, 0.3, 0.01)
c3 <- rnorm (40, 0.5, 0.01); c4 <- rnorm (40, 0.7, 0.01)
c5 <- rnorm (40, 0.9, 0.03)
Yv <- 0.3 + rnorm (200, 0.05, 0.05)
myd <- data.frame (Xv = round (c(c1, c2, c3, c4, c5), 2), Yv = round (Yv, 2),
cltr = factor (rep(1:5, each = 40)))
require(rggobi)
g <- ggobi(myd)
display(g[1], vars=list(X="Xv", Y="Yv"))
You can see five clusters, colored differently with cltr variable. I manually identified the points that are outliers and I want to make their value to NA in the cltr variable. Is their any easy way to disassociate such membership and write to file.
You could try identify to get the indices of the points manually:
## use base::plot
plot(myd$Xv, myd$Yv, col=myd$cltr)
exclude <- identify(myd$Xv, myd$Yv) ## left click on the points you want to exclude (right click to stop/finish)
myd$cltr[exclude] <- NA