identify sequences of approximately equivalent values in a series using R - r

I have a series of values that includes strings of values that are close to each other, for example the sequences below. Note that roughly around the places I have categorized the values in V1 with distinct values in V2, the range of the values changes. That is, all the values called 1 in V2 are within 20 points of each other. All the values marked 2 in V2 are within 20 points of each other. All the values marked 3 are within 20 points of each other, etc. Notice that the values are not identical (they are all different). But instead, they cluster around a common value.
I identified these clusters manually. How could I automate it?
V1 V2
1 399.710 1
2 403.075 1
3 405.766 1
4 407.112 1
5 408.458 1
6 409.131 1
7 410.477 1
8 411.150 1
9 412.495 1
10 332.419 2
11 330.400 2
12 329.054 2
13 327.708 2
14 326.363 2
15 325.017 2
16 322.998 2
17 319.633 2
18 314.923 2
19 288.680 3
20 285.315 3
21 283.969 3
22 281.950 3
23 279.932 3
24 276.567 3
25 273.875 3
26 272.530 3
27 271.857 3
28 272.530 3
29 273.875 3
30 274.548 3
31 275.894 3
32 275.894 3
33 276.567 3
34 277.240 3
35 278.586 3
36 279.932 3
37 281.950 3
38 284.642 3
39 288.007 3
40 291.371 3
41 294.063 4
42 295.409 4
43 296.754 4
44 297.427 4
45 298.100 4
46 299.446 4
47 300.792 4
48 303.484 4
49 306.848 4
50 327.708 5
51 309.540 6
52 310.213 6
53 309.540 6
54 306.848 6
55 304.156 6
56 302.811 6
57 302.811 6
58 304.156 6
59 305.502 6
60 306.175 6
61 306.175 6
62 304.829 6
I haven't tried anything yet, I don't know how to do this.

Using dist and hclust with cutree to detect clusters, but with unique levels at the breaks.
hc <- hclust(dist(x))
cl <- cutree(hc, k=6)
data.frame(x, seq=cumsum(c(0, diff(cl)) != 0) + 1)
# x seq
# 1 399.710 1
# 2 403.075 1
# 3 405.766 1
# 4 407.112 1
# 5 408.458 1
# 6 409.131 1
# 7 410.477 1
# 8 411.150 1
# 9 412.495 1
# 10 332.419 2
# 11 330.400 2
# 12 329.054 2
# 13 327.708 2
# 14 326.363 2
# 15 325.017 2
# 16 322.998 2
# 17 319.633 3
# 18 314.923 3
# 19 288.680 4
# 20 285.315 4
# 21 283.969 4
# 22 281.950 4
# 23 279.932 4
# 24 276.567 5
# 25 273.875 5
# 26 272.530 5
# 27 271.857 5
# 28 272.530 5
# 29 273.875 5
# 30 274.548 5
# 31 275.894 5
# 32 275.894 5
# 33 276.567 5
# 34 277.240 5
# 35 278.586 6
# 36 279.932 6
# 37 281.950 6
# 38 284.642 6
# 39 288.007 6
# 40 291.371 6
# 41 294.063 7
# 42 295.409 7
# 43 296.754 7
# 44 297.427 7
# 45 298.100 7
# 46 299.446 7
# 47 300.792 7
# 48 303.484 7
# 49 306.848 7
# 50 327.708 8
# 51 309.540 9
# 52 310.213 9
# 53 309.540 9
# 54 306.848 9
# 55 304.156 9
# 56 302.811 9
# 57 302.811 9
# 58 304.156 9
# 59 305.502 9
# 60 306.175 9
# 61 306.175 9
# 62 304.829 9
However, the dendrogram suggests rather k=4 clusters instead of 6, but it is arbitrary.
plot(hc)
abline(h=30, lty=2, col=2)
abline(h=18.5, lty=2, col=3)
abline(h=14, lty=2, col=4)
legend('topright', lty=2, col=2:4, legend=paste(c(4, 5, 7), 'cluster'), cex=.8)
Data:
x <- c(399.71, 403.075, 405.766, 407.112, 408.458, 409.131, 410.477,
411.15, 412.495, 332.419, 330.4, 329.054, 327.708, 326.363, 325.017,
322.998, 319.633, 314.923, 288.68, 285.315, 283.969, 281.95,
279.932, 276.567, 273.875, 272.53, 271.857, 272.53, 273.875,
274.548, 275.894, 275.894, 276.567, 277.24, 278.586, 279.932,
281.95, 284.642, 288.007, 291.371, 294.063, 295.409, 296.754,
297.427, 298.1, 299.446, 300.792, 303.484, 306.848, 327.708,
309.54, 310.213, 309.54, 306.848, 304.156, 302.811, 302.811,
304.156, 305.502, 306.175, 306.175, 304.829)

This solution iterates over every value, checks the range of all values in the group up to that point, and starts a new group if the range is greater than a threshold.
maxrange <- 18
grp_start <- 1
grp_num <- 1
V3 <- numeric(length(dat$V1))
for (i in seq_along(dat$V1)) {
grp <- dat$V1[grp_start:i]
if (max(grp) - min(grp) > maxrange) {
grp_num <- grp_num + 1
grp_start <- i
}
V3[[i]] <- grp_num
}
cbind(dat, V3)
V1 V2 V3
1 399.710 1 1
2 403.075 1 1
3 405.766 1 1
4 407.112 1 1
5 408.458 1 1
6 409.131 1 1
7 410.477 1 1
8 411.150 1 1
9 412.495 1 1
10 332.419 2 2
11 330.400 2 2
12 329.054 2 2
13 327.708 2 2
14 326.363 2 2
15 325.017 2 2
16 322.998 2 2
17 319.633 2 2
18 314.923 2 2
19 288.680 3 3
20 285.315 3 3
21 283.969 3 3
22 281.950 3 3
23 279.932 3 3
24 276.567 3 3
25 273.875 3 3
26 272.530 3 3
27 271.857 3 3
28 272.530 3 3
29 273.875 3 3
30 274.548 3 3
31 275.894 3 3
32 275.894 3 3
33 276.567 3 3
34 277.240 3 3
35 278.586 3 3
36 279.932 3 3
37 281.950 3 3
38 284.642 3 3
39 288.007 3 3
40 291.371 3 4
41 294.063 4 4
42 295.409 4 4
43 296.754 4 4
44 297.427 4 4
45 298.100 4 4
46 299.446 4 4
47 300.792 4 4
48 303.484 4 4
49 306.848 4 4
50 327.708 5 5
51 309.540 6 6
52 310.213 6 6
53 309.540 6 6
54 306.848 6 6
55 304.156 6 6
56 302.811 6 6
57 302.811 6 6
58 304.156 6 6
59 305.502 6 6
60 306.175 6 6
61 306.175 6 6
62 304.829 6 6
A threshold of 18 reproduces your groups, except that group 4 starts one row earlier. You could use a higher threshold, but then group 6 would start later than you have it.

Related

Adding a value based on a sequence of numbers

Please simplify my code. The result should be the same. The script works but R shows warning messages:
1: In data$sygnature[seq(first[v], last[v])] <- paste0(n[v], "/", syg) :
number of items to replace is not a multiple of replacement length
etc.
The idea is to assign each sequence in the column the same value.
data <- data.frame(sygnature = c(seq(1:8),seq(1:3),seq(1:11),seq(1:6),seq(1:9),seq(1:5)))
n <- c(44:49)
k<-c()
for(i in (1: nrow(data))){
s<- data$sygnature[i]
z<-data$sygnature[i+1]
if(
if(is.na(z)){
z<-1
s > z
}else{
s > z
}
){
k<- c(k, s)
}
}
last<- cumsum(k)
first<-(last-k)+1
syg <- data$sygnature
for(v in 1:6)
{
data$sygnature[seq(first[v],last[v])] <- paste0(n[v],"/",syg)
}
You can do,
data$res <- paste0(rep(n, aggregate(sygnature ~ cumsum(sygnature == 1), data, length)[[2]]),
'/',
data$sygnature)
data
sygnature res
1 1 44/1
2 2 44/2
3 3 44/3
4 4 44/4
5 5 44/5
6 6 44/6
7 7 44/7
8 8 44/8
9 1 45/1
10 2 45/2
11 3 45/3
12 1 46/1
13 2 46/2
14 3 46/3
15 4 46/4
16 5 46/5
17 6 46/6
18 7 46/7
19 8 46/8
20 9 46/9
21 10 46/10
22 11 46/11
23 1 47/1
24 2 47/2
25 3 47/3
26 4 47/4
27 5 47/5
28 6 47/6
29 1 48/1
30 2 48/2
31 3 48/3
32 4 48/4
33 5 48/5
34 6 48/6
35 7 48/7
36 8 48/8
37 9 48/9
38 1 49/1
39 2 49/2
40 3 49/3
41 4 49/4
42 5 49/5

Making multiple matrices for different run of simulation using for loops

I have a data from a simulation which has counts of interactions among 10 individuals, and there are 80 runs. I would like to make separate matrices for each run, and then use a function for calculating the ranking of individuals from the matric for each run
Is it possible to make for loops for-
making matrices for each run
running a function through all matrices
I am new to R so I don't really know how to make these iterative loops. I made separate matrices, and ran the function separately for each matrix. But this is very time consuming and prone to error.
This is what the data looks like :
head(A)
[run number] distribution who-won1 who-won2 won-battle
1 3 4 patches 7 5 17
2 3 4 patches 9 4 31
3 3 4 patches 0 1 11
4 3 4 patches 2 1 7
5 3 4 patches 2 9 4
6 3 4 patches 5 7 36
7 3 4 patches 9 6 10
8 3 4 patches 2 7 3
9 3 4 patches 1 0 19
10 3 4 patches 3 7 7
Then I used this to make the matrices, which is an actor-receiver matrix with the counts of fights won for each actor-receiver.
Alist <- vector("list", 40)
for(run in 1:40){
newmatrix <- matrix(nrow = 10, ncol = 10)
for (x in 1:90) { #90 rows per run
Actor = A$Actor[A$Group== run][x] + 1
Receiver = A$Receiver[A$Group== run][x] + 1
Won = A$`won-battle`[A$Group== run][x]
newmatrix[Actor,Receiver] = as.numeric(Won)
}
newmatrix[is.na(newmatrix)] <- 0
groomlosepatchylist[[run]] <- newmatrix
}
and it gives a matrix like this-
...1 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 11 19 23 11 9 1 12 34 3
2 2 19 0 25 24 13 12 5 12 35 13
3 3 14 7 0 14 6 3 1 3 38 4
4 4 16 8 10 0 1 5 2 7 19 8
5 5 30 19 35 35 0 17 9 16 67 18
6 6 31 50 52 38 21 0 21 36 83 26
7 7 69 42 46 38 35 43 0 62 66 59
8 8 38 23 48 44 19 17 7 0 66 21
9 9 26 14 31 24 4 2 5 6 0 12
10 10 41 35 43 48 31 33 10 34 64 0

Finding the k-largest clusters in dbscan result

I have a dataframe df, consists of 2 columns: x and y coordinates.
Each row refers to a point.
I feed it into dbscan function to obtain the clusters of the points in df.
library("fpc")
db = fpc::dbscan(df, eps = 0.08, MinPts = 4)
plot(db, df, main = "DBSCAN", frame = FALSE)
By using print(db), I can see the result returned by dbscan.
> print(db)
dbscan Pts=13131 MinPts=4 eps=0.08
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
border 401 38 55 5 2 3 0 0 0 8 0 6 1 3 1 3 3 2 1 2 4 3
seed 0 2634 8186 35 24 561 99 7 22 26 5 75 17 9 9 54 1 2 74 21 3 15
total 401 2672 8241 40 26 564 99 7 22 34 5 81 18 12 10 57 4 4 75 23 7 18
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
border 4 1 2 6 2 1 3 7 2 1 2 3 11 1 3 1 3 2 5 5 1 4 3
seed 14 9 4 48 2 4 38 111 5 11 5 14 111 6 1 5 1 8 3 15 10 15 6
total 18 10 6 54 4 5 41 118 7 12 7 17 122 7 4 6 4 10 8 20 11 19 9
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
border 2 4 2 1 3 2 1 1 3 1 0 2 2 3 0 3 3 3 3 0 0 2 3 1
seed 15 2 9 11 4 8 12 4 6 8 7 7 3 3 4 3 3 4 2 9 4 2 1 4
total 17 6 11 12 7 10 13 5 9 9 7 9 5 6 4 6 6 7 5 9 4 4 4 5
69 70 71
border 3 3 3
seed 1 1 1
total 4 4 4
From the above summary, I can see cluster 2 consists of 8186 seed points (core points), cluster 1 consists of 2634 seed points and cluster 5 consists of 561 points.
I define the largest cluster as the one contains the largest amount of seed points. So, in this case, the largest cluster is cluster 2. And the 1st, 2nd, 3th largest clusters are 2, 1 and 5.
Are they any direct way to return the rows (points) in the largest cluster or the k-largest cluster in general?
I can do it in an indirect way.
I can obtain the assigned cluster number of each point by
db$cluster.
Hence, I can create a new dataframe df2 with db$cluster as the
new additional column besides the original x column and y
column.
Then, I can aggregate the df2 according to the cluster numbers in
the third column and find the number of points in each cluster.
After that, I can find the k-largest groups, which are 2, 1 and 5
again.
Finally, I can select the rows in df2 with third column value equals to 2 to return the points in the largest cluster.
But the above approach re-computes many known results as stated in the summary of print(db).
The dbscan function doesn't appear to retain the data.
library(fpc)
set.seed(665544)
n <- 600
df <- data.frame(x=runif(10, 0, 10)+rnorm(n, sd=0.2), y=runif(10, 0, 10)+rnorm(n,sd=0.2))
(dbs <- dbscan(df, 0.2))
#dbscan Pts=600 MinPts=5 eps=0.2
# 0 1 2 3 4 5 6 7 8 9 10 11
#border 28 4 4 8 5 3 3 4 3 4 6 4
#seed 0 50 53 51 52 51 54 54 54 53 51 1
#total 28 54 57 59 57 54 57 58 57 57 57 5
attributes(dbs)
#$names
#[1] "cluster" "eps" "MinPts" "isseed"
#$class
#[1] "dbscan"
Your indirect steps are not that indirect (only two lines needed), and these commands won't recalculate the clusters. So just run those commands, or put them in a function and then call the function in one command.
cluster_k <- function(dbs, data, k){
kth <- names(rev(sort(table(dbs$cluster)))[k])
data[dbs$cluster == kth,]
}
cluster_k(dbs=dbs, data=df, k=1)
## x y
## 3 6.580695 8.715245
## 13 6.704379 8.528486
## 23 6.809558 8.160721
## 33 6.375842 8.756433
## 43 6.603195 8.640206
## 53 6.728533 8.425067
## a data frame with 59 rows

R Data.Table Filter by table A in table B

The goal of this code is that finding the quadrant of a given point on the given circle equation.
I have two separate data.table. In table A, I have a different variation of circle equation variables. In table B I have raw data for finding how many points are lie on each circle quadrant. I have following sequence:
Get the circle equation from Table A
Filter out points where circle lie on the coordinates from Table B
Find the each points where they lie on the circle (getQuadrant function)
Count how many points lie on each quadrant (Quadrants function)
I had some attempts but it is kind of slow to return the results. The tables are as follows:
set.seed(4)
TableA <- data.table(speed=rep(42:44,each=3),
minX = rep(c(1:12),3),
maxX = rep(c(10:21),3),
minY = 1,
maxY = 10,
r = 5,
cX = rep(c(6:17),3),
cY = 6,
indx = 1:36)
TableA
speed minX maxX minY maxY r cX cY indx
1: 42 1 10 1 10 1 2 2 1
2: 42 2 11 1 10 1 2 2 2
3: 42 3 12 1 10 1 2 2 3
4: 43 1 10 1 10 1 2 2 4
5: 43 2 11 1 10 1 2 2 5
6: 43 3 12 1 10 1 2 2 6
7: 44 1 10 1 10 1 2 2 7
8: 44 2 11 1 10 1 2 2 8
9: 44 3 12 1 10 1 2 2 9
TableB <- data.table(speed=rep(42:44,each=100),
x = rep(sample(12),100),
y = rep(sample(12),100),
n = rep(sample(12),100))
TableB
speed x y n
1: 42 8 2 8
2: 42 1 11 10
3: 42 3 5 5
4: 42 10 10 12
5: 42 7 8 11
Function to find quadrant:
getQuadrant <- function(X=0,Y=0,R=1,PX=10,PY=10){
#' X and Y are center of the circle
#' R = Radius
#' PX and PY are a point anywhere
# The point is on the center
if (PX == X & PY == Y)
return(0)
val = ((PX - X)^2 + (PY - Y)^2)
# Outside the circle
if (val > R^2)
return(5)
# 1st quadrant
if (PX > X & PY >= Y)
return(1)
# 2nd quadrant
if (PX <= X & PY > Y)
return(2)
# 3rd quadrant
if (PX < X & PY <= Y)
return(3)
# 4th quadrant
if (PX >= X & PY < Y)
return(4)
}
Function to return number of points in the quadrant.
Quadrants <- function(dt,radius,centerX,centerY){
#' dt is filtered data for the circle
#' radius of the circle equation
#' centerX and centerY are the center point of the circle equation
if(nrow(dt) > 0 ){
dt[,quadrant:=factor(mapply(function(X,Y,R,PX,PY) getQuadrant(X=X,Y=Y,R=R,PX=PX,PY=PY),centerX,centerY,radius,x_cut,y_cut), levels = c("1","2","3","4","5"))]
dt <- dt[, .(.N), keyby = .(quadrant)]
setkeyv(dt, c("quadrant"))
dt <- dt[CJ(levels(dt[,quadrant])),]
dd <- list(Q1=dt$N[1],Q2=dt$N[2],Q3=dt$N[3],Q4=dt$N[4],Q5=dt$N[5])
}else{
dd <- list(Q1=NA,Q2=NA,Q3=NA,Q4=NA,Q5=NA) }
return(dd)
}
I have following solution but it won't work.
finalTable <- TableA[,c('Q1','Q2','Q3','Q4','Q5') := mapply(function(a,b,c,d,e,f,g,h) Quadrants(TableB[, .SD[x %between% c(a,b) & y %between% c(c,d) & speed == h]], radius=e, centerX = f, centerY = g),minX,maxX,minY,maxY,r,cX,cY,speed)]
I don't think so I am doing right. Because below results are not the expected one.
speed minX maxX minY maxY r cX cY indx Q1 Q2 Q3 Q4 Q5
1: 42 1 10 1 10 5 6 6 1 32 32 100 68 68
2: 42 2 11 1 10 5 7 6 2 32 32 100 68 68
3: 42 3 12 1 10 5 8 6 3 32 32 100 68 68
4: 43 4 13 1 10 5 9 6 4 32 32 100 68 68
...
11: 42 11 20 1 10 5 16 6 11 32 32 100 68 68
12: 42 12 21 1 10 5 17 6 12 32 32 100 68 68
13: 43 1 10 1 10 5 6 6 13 32 32 100 68 68
14: 43 2 11 1 10 5 7 6 14 32 32 100 68 68
15: 43 3 12 1 10 5 8 6 15 32 32 100 68 68
...
22: 43 10 19 1 10 5 15 6 22 32 32 100 68 68
23: 43 11 20 1 10 5 16 6 23 32 32 100 68 68
24: 43 12 21 1 10 5 17 6 24 32 32 100 68 68
25: 44 1 10 1 10 5 6 6 25 32 32 100 68 68
26: 44 2 11 1 10 5 7 6 26 32 32 100 68 68
27: 44 3 12 1 10 5 8 6 27 32 32 100 68 68
28: 42 4 13 1 10 5 9 6 28 32 32 100 68 68
...
35: 44 11 20 1 10 5 16 6 35 32 32 100 68 68
36: 44 12 21 1 10 5 17 6 36 32 32 100 68 68
Can anyone take a look please. I really appreciated.
Expected Output:
speed minX maxX minY maxY r cX cY indx Q1 Q2 Q3 Q4 Q5
1: 42 2 11 1 10 5 7 6 1 200 100 400 100 200
2: 42 3 12 1 10 5 8 6 2 200 100 300 100 200
3: 42 4 13 1 10 5 9 6 3 200 100 300 100 100
4: 42 5 14 1 10 5 10 6 4 100 200 300 NA 100
...
11: 42 12 21 1 10 5 17 6 11 NA NA NA NA NA
12: 42 13 22 1 10 5 18 6 12 NA NA NA NA NA
13: 43 2 11 1 10 5 7 6 13 200 100 400 100 200
14: 43 3 12 1 10 5 8 6 14 200 100 300 100 200
15: 43 4 13 1 10 5 9 6 15 200 100 300 100 100
...
22: 43 11 20 1 10 5 16 6 22 NA NA NA NA 100
23: 43 12 21 1 10 5 17 6 23 NA NA NA NA NA
24: 43 13 22 1 10 5 18 6 24 NA NA NA NA NA
25: 44 2 11 1 10 5 7 6 25 200 100 400 100 200
26: 44 3 12 1 10 5 8 6 26 200 100 300 100 200
27: 44 4 13 1 10 5 9 6 27 200 100 300 100 100
28: 44 5 14 1 10 5 10 6 28 100 200 300 NA 100
...
35: 44 12 21 1 10 5 17 6 35 NA NA NA NA NA
36: 44 13 22 1 10 5 18 6 36 NA NA NA NA NA

Cumulative function for a specific range of values

I have a table with a column "Age" that has a values from 1 to 10, and a column "Population" that has values specified for each of the "age" values. I want to generate a cumulative function for population such that resultant values start from ages at least 1 and above, 2 and above, and so on. I mean, the resultant array should be (203,180..and so on). Any help would be appreciated!
Age Population Withdrawn
1 23 3
2 12 2
3 32 2
4 33 3
5 15 4
6 10 1
7 19 2
8 18 3
9 19 1
10 22 5
You can use cumsum and rev:
df$sum_above <- rev(cumsum(rev(df$Population)))
The result:
> df
Age Population sum_above
1 1 23 203
2 2 12 180
3 3 32 168
4 4 33 136
5 5 15 103
6 6 10 88
7 7 19 78
8 8 18 59
9 9 19 41
10 10 22 22

Resources