How to cluster points (coordinates in a large data set - r

I have a large data set made of x and y coordinates (no necessarily lat, lng). Points are not ordered.
df <- data.frame(point=1:7, x=c(3,7,2,23,5,67,16) , y=c(1,4,5,23,17,89,20))
>df
point x y
1 3 1
2 7 4
3 2 5
4 23 23
5 5 17
6 67 89
7 16 20
Is there an easy way to cluster points? (according to a radius for example)
So for example:
points 1, 2, 3 would be together - group A
points 4, 5, 7 would be together - group B
point 6 would be group C
I have tried to use: %>% arrange to sort values, and then x-(x+1) (coordinate differences)
but the method is not perfect and there are situations where clustering isn't done properly.
Any suggestions or comments!
Thanks

You can try hclust with cutree.
df$group <- cutree(hclust(dist(df[,2:3])), 3)
df
# point x y group
#1 1 3 1 1
#2 2 7 4 1
#3 3 2 5 1
#4 4 23 23 2
#5 5 5 17 1
#6 6 67 89 3
#7 7 16 20 2

Related

calculating angle of closest point to multiple lines across time n r

I am trying to find the angle to the closest point from a line in multiple cases within and across time. I have a data set that looks something like this. Four points in group 1, four in group 2 and one in group 3.
x <- sample(1:50, 27)
y <- sample(1:50, 27)
group <- c(1,1,1,1,2,2,2,2,3,1,1,1,1,2,2,2,2,3,1,1,1,1,2,2,2,2,3)
id <- rep(seq(1,9,1), 3)
time <- rep(1:3, each = 9)
df <- as.data.frame(cbind(x, y, group, id, time))
x y group id time
1 25 36 1 1 1
2 49 35 1 2 1
3 41 27 1 3 1
4 28 47 1 4 1
5 7 3 2 5 1
6 46 25 2 6 1
7 15 7 2 7 1
8 32 15 2 8 1
9 38 29 3 9 1
10 19 4 1 1 2
11 18 14 1 2 2
12 8 37 1 3 2
13 29 8 1 4 2
14 6 1 2 5 2
15 30 6 2 6 2
16 10 19 2 7 2
17 45 49 2 8 2
18 40 43 3 9 2
19 17 48 1 1 3
20 27 21 1 2 3
21 26 20 1 3 3
22 33 50 1 4 3
23 16 16 2 5 3
24 23 46 2 6 3
25 21 26 2 7 3
26 13 31 2 8 3
27 11 41 3 9 3
the item in group 3 is used to identify which point is the base of all of the lines. in this example for time 1 - id 3 in group 1 in closest. this signals that a line should be made to all other points in group 1 (3-1, 3-2 and 3-4). I then need to identify which id in group 2 is closest to each of the 3 lines. for example, point 6 might be closest to the line 3-2 and from that I would calculate the angle in points 6-3-2. I need to calculate this for all other lines in this time, and then perform this again across all other times.
The following code identifies the base point for the lines (it is not optimal but I need the other data it calculates for other uses)
#### calculate distance between all points
distance = function(x1,x2,y1,y2) sqrt(((x2-x1)^2)+((y2-y1)^2)) #distance function
distance2 = function(x,y,.pred) distance(x, x[.pred], y, y[.pred]) #
distance3 = function(x, y, id){
dists = map(1:9, ~distance2(x,y, which(id == .x)))
}
#use distance formula
df2 <- df %>%
group_by(time) %>%
mutate(distances=distance3(x, y, id))
distances <- df2$distances # extract distance list
distances <- do.call(rbind.data.frame, distances) # change list to dataframe
colnames(distances) <- c(paste0("dist", 1:9)) # change column names
df <- cbind(df,distances) # merge dataframes
group3 <- df %>% filter(group == 3)
df <- df %>% filter(group == 1 | group == 2) #remove group 3 (id 9 from data as no longer needed)
#new columns with id and group as closest to position of id 9
df <- df %>% group_by(time) %>% mutate(closest = id[which.min(dist9)]) %>%
mutate(closest.group = group[which.min(dist9)]) %>% ungroup
This is about as far as I can get on my own. I have found the following formula on here which I can use to calculate the distance of a point to a line. in an individual case but I have no idea how to integrate it across the multiple time periods and with conditions.
dist2d <- function(a,b,c) {
v1 <- b - c
v2 <- a - b
m <- cbind(v1,v2)
d <- abs(det(m))/sqrt(sum(v1*v1))
}
for clarification, the line only goes between the two points and does not extend to infinity.

Retrieving binary interactions from linkcomm package as a data frame in R

Suppose I have the following clusters:
library(linkcomm)
g <- swiss[,3:4]
lc <-getLinkCommunities(g)
plot(lc, type = "members")
getNodesIn(lc, clusterids = c(3, 7, 8))
From the plot you can see the node 6 is present in 3 overlapping clusters: 3, 7 and 8. I am interested to know how to retrieve the direct binary interactions in these clusters as a data frame. Specifically, I would like a data frame with the cluster id as the first column, and the last two columns as "interactor 1" and "interactor 2", where all pairs of interactors can be listed per cluster. These should be direct, i.e. they have an edge in common.
Basically I would like something like this:
Cluster ID Interactor 1 Interactor 2
3 6 14
3 3 7
3 6 7
3 14 3
3 6 3
and so on for the other ids. If possible I would like to avoid duplicates such as 6 and 14, 14 and 6 etc.
Many thanks,
Abigail
You might be looking for the edges. Note: Use str(lc) to examine what's all included in your object of interest.
lc$edges
# node1 node2 cluster
# 1 17 15 1
# 2 17 8 1
# 3 15 8 1
# 4 16 13 2
# 5 16 10 2
# 6 16 29 2
# 7 14 6 3
# 8 ...
res <- setNames(lc$edges, c(paste0("interactor.", 1:2), "cluster"))[c(3, 1, 2)]
res
# cluster interactor.1 interactor.2
# 1 1 17 15
# 2 1 17 8
# 3 1 15 8
# 4 2 16 13
# 5 2 16 10
# 6 2 16 29
# 7 3 14 6
# 8 ...

Highlight specific points from vector in scatterplot

I have a dataframe df with two columns, which are plotted in a scatterplot using ggplot. Now I have parted the curve into intervalls. The sectioning points of the intervalls are in a vector r. I now want to highlight these points to improve the visualization of the intervalls. I was thinking about coloring these intervall points or even to section the intervalls in adding vertical lines into the plot...I have tried some commands, but they didnt work for me.
Here is an idea of how my data frame looks like:
d is first colume, e is the second with number of instances.
d e
1 4
2 4
3 5
4 5
5 5
6 4
7 2
8 3
9 1
10 3
11 2
12 3
13 3
14 3
15 3
16 3
17 3
18 4
My vector r shows, where the intervall borders were set.
7
8
9
10
11
12
18
Any ideas how to do so? Thanks!
You can try a tidyverse. The idea is to find overlapping points using ´mutateand%in%, then color by the resutling logical vectorgr`. I also added vertical lines to illustrate the "intervals".
library(tidyverse)
d %>%
mutate(gr=d %in% r) %>%
ggplot(aes(d,e, color=gr)) +
geom_vline(xintercept=r, alpha=.1) +
geom_point()
Edit: Without tidyverse you can add gr using
d$gr <- d$d %in% r
ggplot(d, aes(d,e, color=gr)) ...
The data
d <- read.table(text=" d e
1 4
2 4
3 5
4 5
5 5
6 4
7 2
8 3
9 1
10 3
11 2
12 3
13 3
14 3
15 3
16 3
17 3
18 4", header=T)
r <- c(7:12,18)

Sum certain values from changing dataframe in R

I have a data frame that I would like to aggregate by adding certain values. Say I have six clusters. I then feed data from each cluster into some function that generates a value x which is then put into the output data frame.
cluster year lambda v e x
1 1 1 -0.12160997 -0.31105287 -0.253391178 15
2 1 2 -0.12160997 -1.06313732 -0.300349972 10
3 1 3 -0.12160997 -0.06704185 0.754397069 40
4 2 1 -0.07378295 -0.31105287 -1.331764904 4
5 2 2 -0.07378295 -1.06313732 0.279413039 19
6 2 3 -0.07378295 -0.06704185 -0.004581941 23
7 3 1 -0.02809310 -0.31105287 0.239647063 28
8 3 2 -0.02809310 -1.06313732 1.284568047 38
9 3 3 -0.02809310 -0.06704185 -0.294881283 18
10 4 1 0.33479251 -0.31105287 -0.480496125 15
11 4 2 0.33479251 -1.06313732 -0.380251626 12
12 4 3 0.33479251 -0.06704185 -0.078851036 34
13 5 1 0.27953088 -0.31105287 1.435456851 100
14 5 2 0.27953088 -1.06313732 -0.795435607 0
15 5 3 0.27953088 -0.06704185 -0.166848530 0
16 6 1 0.29409366 -0.31105287 0.126647655 44
17 6 2 0.29409366 -1.06313732 0.162961658 18
18 6 3 0.29409366 -0.06704185 -0.812316265 13
To aggregate, I then add up the x value for cluster 1 across all three years with seroconv.cluster1=sum(data.all[c(1:3),6]) and repeat for each cluster.
Every time I change the number of clusters right now I have to manually change the addition of the x's. I would like to be able to say n.vec <- seq(6, 12, by=2) and feed n.vec into the functions and get x and have R add up the x values for each cluster every time with the number of clusters changing. So it would do 6 clusters and add up all the x's per cluster. Then 8 and add up the x's and so on.
It seems you are asking for an easy way to split your data up, apply a function (sum in this case) and then combine it all back together. Split apply combine is a common data strategy, and there are several split/apply/combine strategies in R, the most popular being ave in base, the dplyr package and the data.table package.
Here's an example for your data using dplyr:
library(dplyr)
df %>% group_by(cluster, year) %>% summarise_each(funs(sum))
To get the sum of x for each cluster as a vector, you can use tapply:
tapply(df$x, df$cluster, sum)
# 1 2 3 4 5 6
# 65 46 84 61 100 75
If you instead wanted to output as a data frame, you could use aggregate:
aggregate(x~cluster, sum, data=df)
# cluster x
# 1 1 65
# 2 2 46
# 3 3 84
# 4 4 61
# 5 5 100
# 6 6 75

t-test in R by specific factors

I have a data set with a few variables:
X is a numeric variable, Y and Z are factor variables containing only 2 factors (Y=1,2 Z=3,4)
x y z
1 -0.59131983 1 3
2 1.51800178 1 3
3 0.03079412 1 3
4 -0.43881764 1 3
5 -1.44914000 1 3
6 -1.33483914 1 4
7 0.25612595 1 4
8 0.12606742 1 4
9 0.44735965 1 4
10 1.83294817 1 4
11 -0.59131983 2 3
12 1.51800178 2 3
13 0.03079412 2 3
14 -0.43881764 2 3
15 -1.44914000 2 3
16 -1.33483914 2 4
17 0.25612595 2 4
18 0.12606742 2 4
19 0.44735965 2 4
20 1.83294817 2 4
A t-test is easy to perform if my factor variable is Y (t.test(X~Y)). but i am not sure how to do a t-test which would compare for example only the X values for Y==2, between Z (3 and 4)?
I am not sure if I expressed myself correct, so it might be easier to see it in the table. So, I would like to do a t test for X, where the factor variable is Z and Y==2. how could i do this?
in STATA it is easy:
ttest var1 if var3==3, by(var2)
but i dont get it in R :(
x y z
11 -0.59131983 2 3
12 1.51800178 2 3
13 0.03079412 2 3
14 -0.43881764 2 3
15 -1.44914000 2 3
16 -1.33483914 2 4
17 0.25612595 2 4
18 0.12606742 2 4
19 0.44735965 2 4
20 1.83294817 2 4
If you read the t.test documentation in R you will see that for one-sample t.tests you shouldn't use the formula interface of the function (type ?t.test):
The formula interface is only applicable for the 2-sample tests.
So, in your case you need to create a subset of your data.frame according to the conditions you specified like this:
df2 <- df[df$y==2 & df$z %in% c(3,4), ]
> df2
x y z
11 -0.59131983 2 3
12 1.51800178 2 3
13 0.03079412 2 3
14 -0.43881764 2 3
15 -1.44914000 2 3
16 -1.33483914 2 4
17 0.25612595 2 4
18 0.12606742 2 4
19 0.44735965 2 4
20 1.83294817 2 4
And then run the one-sample t.test using the following syntax:
> t.test(x=df2$x)
One Sample t-test
data: df2$x
t = 0.1171, df = 9, p-value = 0.9094
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.7275964 0.8070325
sample estimates:
mean of x
0.03971805

Resources