Clustering by pair of 3 points for a given x position? - r

If I have a set of points that have different y positions (A,B,C) each with the same x coordinate. Is it possible to cluster this set of 3 points together and not individually?
I'd like to see the occurrence of this set of 3 points together in a given sample and see what set (A,B,C) is most frequent.
I've seen most of the clustering algorithm can cluster points for a given position (x,y) but not a set of several points for a given x coordinate.
For instance, if i have the following
X A B C
1 0.7 0.1 0.2
2 0.3 0.4 0.1
3 0.4 0.5 0.1
4 0.7 0.1 0.2
5 0.7 0.1 0.2
6 0.2 0.1 0.5
The positions x :1, 4 and 5 should be clustered together because they have the same set (A,B,C) = (0.7,0.1,0.2).
Is there any algorithm or tool (R) that is already doing that, clustering by pair of several points, finding the most occurrent pair with a graphical visualization?
Any help would be much appreciated.

If you're looking to tabulate the instances, then something along the lines of:
tab <- table(sprintf("%s:%s:%s", df1$A, df1$B, df1$C))
which.max(tab)
sort(tab, decreasing=TRUE)
will give you the most frequent combination (you can use strsplit to separate out the individual components if you need to go on and use them programmatically.
If you're looking to cluster, in the sense of find similar distances, then you can just use
dis <- dist(as.matrix(df1[[c("A","B","C")]])
clust <- hclust(dis)
and dis will tell you all the pairwise distances (find the zero's to get the identicals), and clust will give you a tree based on similarity across A:c
If this isn't answering the question, you probably need to clarify. You say things like same x coordinate in the text, but none of your rows have the same X value. And it's fairly unconventional to switch interchangeably between y coordinate / position / (A,B,C) .
It's hard to suggest a visualisation without knowing what feature you want to emphasize. Possibly a multi-dimensional scaling graph, where each node represents all x with the same (A,B,C) triplet, and then neighbours are other X's with closest (A', B', C') values?

Related

How to measure differences in spatial densities or organisation of individuals on a plane in R?

I have two distinct datasets which look like this:
identity x-pos y-pos
1: Z 0.5 0.7
2: B 0.1 0.0
3: C 4.6 2.5
4: D 5.6 5.0
5: A 0.2 1.0
6: P 0.4 2.0
Here, the each object with a unique identity is positioned on a 2d plane and the coordinates are denoted by x-pos y-pos (in micrometers)
What I want to be able to do is to measure if objects have differences in spatial positioning/ organisation in the two datasets? This could be differences in clustering, for instance more clusters in one dataset than the other.
or it could be if the radius of each cluster in a dataset is higher than the other.
or there are more objects within a defined radius in one dataset than the other?
Is there a simple way/ r package to execute this in?
Thanks!

graph visualization in R basis symmetric matrix having values in diagonal

I have a symmetric matrix which I modified a bit:
The above matrix is a symmetric matrix except the fact that I have added values in diagonal too (will tell the purpose going forward)
This matrix represents that how many times a person (A, B, C, D, E) works with other person on a publication. e.g. B and C worked 3 times together, similarly A and E worked 4 times together. Now the diagonal values represents how many times a person worked individually e.g. B worked on 4 publications (either alone or with someone else) similarly C worked on 3 publications.
Now I want to make a network analysis graph in R which describes relation between different person in terms of edge thickness and node size. e.g. the graph should look like this:
In graph, node circle size depends on number of publications a person worked on, e.g. circle B is largest as its diagonal value is maximum and A & E are smallest as they have lowest diagonal values. Also, the edge thickness between nodes depends on how many times they worked together, e.g. edge thickness between A & E is maximum as they worked 4 times together, compared to edge thickness (lesser than edge thickness between A & E) between B & C as they have worked 3 times together.
I can describe the relation between two persons basis edge thickness, however inclusion of diagonal values creating problems for me. Is it possible to do it in R? Any leads would be highly appreciated
You can do this with the igraph package. Because the diagonal means something different from the other entries in the matrix, I have separated the matrix into two pieces, the diagonal and the rest.
Your data
SM = as.matrix(read.table(text="A B C D E
1 2 1 1 4
2 4 3 2 1
1 3 3 1 2
1 2 1 2 1
4 1 2 1 1",
header=TRUE))
rownames(SM) = colnames(SM)
library(igraph)
AM = SM
diag(AM) = 0
D = diag(SM)
g = graph_from_adjacency_matrix(AM,
mode = "undirected",
weighted = TRUE)
plot(g,
edge.width=E(g)$weight,
vertex.size = 10+3*D)

output clustered similarity matrix

I have generated a pearson similarity matrix and plotted the results using pheatmap (clustered using hclust, method = "complete"). I'd like to output the ordered matrix, but in R the default seems to be just to alphabetize everything.
Here is my code:
df <- cor(t(genes), method = "pearson")
pheatmap(df, clustering_method = "complete")
head(genes)
pre early mid late end
AAC1 2.0059007 3.64679740 3.0092533 2.4936171 2.2693034
AAC3 -1.6843969 -1.62572636 -0.7654462 -1.5827481 -1.6059080
AAD10 2.6012529 2.05759631 1.3665322 1.4590833 0.3778324
AAD14 0.5047704 0.76021375 0.1825944 0.6111774 0.1174208
AAD15 7.6017557 8.52315453 7.2605744 6.9029452 5.9028824
AAD16 1.2018193 -0.03285354 0.2229450 -0.1337033 0.2198542
This what the current output (df) looks like:
A B C D
A 1 0.5 0.25 0.1
B 0.1 1 0.1 0.5
C 0.5 0.2 1 0.2
D 0 0.1 0.7 1
How can I output the similarity matrix as ordered by hclust?
I've looked, but I haven't been able to find anything that quite accomplishes what I need. Thanks in advance for your help!
(also sorry I don't know how to properly format everything yet)
EDIT: maybe some visuals would help. My clustered pheatmap output looks like this: ordered heatmap
I can see groups of genes that behave similarly, but because there are so many it's impossible/useless to read the labels. I want to find out which genes cluster together, but I can't output the ordered matrix.
When I plot the data without clustering it looks like this: unclustered heatmap
So the output/data I can get is pretty much useless for further analysis.

Plotting a million points in R?

i have a text file (tab delimited) and it has 3 columns A, B, C:
A B C
0.07142857142857142 0.35714285714285715 0.21428571428571427
0.0 0.3333333333333333 0.3888888888888889
0.07142857142857142 0.35714285714285715 0.21428571428571427
0.0 0.3333333333333333 0.3888888888888889
Each row represents a sample with 3 different percentages A, B and C. In total I have 4 files for 4 different organisms.
There can be more than a million rows per file.
My idea is to plot each row in order to see the distribution of the pairs of points (A,B,C) in a given file and then to determine what is the most frequent pair in a given file and then compare the 4 files.
I tried plotting these points in R (multi-curves in a same graph: A, B, C in the y axis and the number of sample in the x axis) for each file but there are so many points that basically the graph can't be interpreted. Also for the million rows file, R crashes and won't plot the points.
What would be the best approach to represent these points? Also is the mode function enough to determine the most frequent pair (A,B,C) or is there any appropriate statistic test I could try to do so?
Any help would be much appreciated.
Thanks.
As I mentioned in my comment, clustering may be a solution to your problem. Here is one way of clustering using kmeans:
irisCl <- transform(iris, Cluster = kmeans(iris[1:4],3)$cluster)
library(ggplot2)
qplot(Sepal.Length, Sepal.Width, data=irisCl, colour=Species) + facet_grid(~Cluster)
Note that we have clustered in a 4-dimensional variable space. As you can see, the setosa are identified correctly in the first cluster, the second cluster contains only virginica, but the third cluster contains a mixture of versicolor and virginica.

Average 3D paths

I have two paths in 3D and I want to "average" them, if there's such a thing.
I have the xyz pairs timestamped at the time they were sampled:
ms x y z
3 0.1 0.2 0.6
12 0.1 0.2 1.3
23 2.1 4.2 0.3
55 0.1 6.2 0.3
Facts about the paths:
They all start and end on/near the same xyz point.
I have the total duration it took to complete the path as well as individual vertices
They have different lengths (i.e. different number of xyz pairs).
Any help would be appreciated.
A simple method is the following...
First build a function interp(t, T, waypoints) that given the current time t, the total path duration T and the path waypoints returns the current position. This can be done using linear interpolation or more sophisticated approaches to avoid speed or acceleration discontinuities.
Once you have interp the average path can be defined as (example in python)
def avg(t, T1, waypoints1, T2, waypoints2):
T = (T1 + T2) / 2
return middlePoint(interp(t*T1/T, T1, waypoints1),
interp(t*T2/T, T2, waypoints2))
the duration of the average path will be the average T = (T1 + T2) / 2 of the two durations.
It's also easy to change this approach to make a weighted average path.
In R, the distances between consecutive points in that series assuming it is in a dataframe named "dat"
would be:
with(dat, sqrt(diff(x)^2 +diff(y)^2 +diff(z)^2) )
#[1] 0.700000 4.582576 2.828427
There are a couple of averages I could think of average distance in interval, average distance traveled per unit time. Depends on what you want. This gives the average velocity in the three intervals:
with(dat, sqrt(diff(x)^2 +diff(y)^2 +diff(z)^2) /diff(ms) )
#[1] 0.07777778 0.41659779 0.08838835
There is definitely such a thing. For each point on path A, find the point that correponds to your current point on path B, and then find the mid-point between those corresponding verticies. You will then get a path in-between the two that is the "average" of the two paths. If you have a mis-match where you did not sample the two paths the same, then for an interior point on path A (i.e., not the end-point), find the two closest sampled points with a similar time-sampling on path B, and locate the mid-point of the triangle those three points will make.
Now since you've discreetized your path by sampling it, this "average" is only going to be an approximation, not a "true" average like you could do by solving for the average function between two differentiable parametric functions defined by r(t) = <x(t), y(t), z(t)>.
Expanding on #6502's answer.
If you wish to retrieve a list of points that would make up the average path, you could sample the avg function at the instances of the individual input points. (Stretched toward the average length)
def avg2(T1, waypoints1, T2, waypoints2):
# Collect the times we want to sample at
T = (T1 + T2) / 2
times = []
times.extend(t*T/T1 for (t,x,y) in waypoints1) # Shift the time towards
times.extend(t*T/T2 for (t,x,y) in waypoints2) # the average
times.sort()
result = []
last_t = None
for t in times:
# Check if we have two points in close succession
if last_t is not None and last_t + 1.0e-6 >= t:
continue
last_t = t
# Sample the average path at this instance
x, y = avg(t, T1, waypoints1, T2, waypoints2)
yield t, x, y

Resources