Efficiently derive bins based on condition in R - r

I need to create bins for every completed rotation e.g. 360° and bins will be of varying lengths. I have created a for loop but with 100,000+ rows it is slow. I tried to implement using dplyr and/or other non-loop methods but am unclear where and how to declare the cutoffs. None of the examples I found for either dplyr or cut() seemed to address my problem.
Sample data:
x <- c(seq(90, .5, length.out = 3),
seq(359.5, .2, length.out = 5),
seq(358.9, .8, length.out = 8),
seq(359.2, .3, length.out = 11),
seq(358.3, .1, length.out = 15))
df <- data.frame(x)
df$bin <- NA
df[1,2] <- 1
For loop:
for(i in 2:nrow(df)) {
if(df[i,1] < df[i-1,1]) {
df[i,2] <- df[i-1,2]
} else {
df[i,2] <- df[i-1,2] + 1
}
}
How are the results in df$bin achieved without using a loop?

It looks like you could do:
df$binnew <- cumsum(c(1, diff(df$x) > 0))
Compare:
x bin binnew
1 90.00000 1 1
2 45.25000 1 1
3 0.50000 1 1
4 359.50000 2 2
5 269.67500 2 2
6 179.85000 2 2
7 90.02500 2 2
8 0.20000 2 2
9 358.90000 3 3
10 307.74286 3 3
11 256.58571 3 3
12 205.42857 3 3
13 154.27143 3 3
14 103.11429 3 3
15 51.95714 3 3
16 0.80000 3 3
17 359.20000 4 4
18 323.31000 4 4
19 287.42000 4 4
20 251.53000 4 4
21 215.64000 4 4
22 179.75000 4 4
23 143.86000 4 4
24 107.97000 4 4
25 72.08000 4 4
26 36.19000 4 4
27 0.30000 4 4
28 358.30000 5 5
29 332.71429 5 5
30 307.12857 5 5
31 281.54286 5 5
32 255.95714 5 5
33 230.37143 5 5
34 204.78571 5 5
35 179.20000 5 5
36 153.61429 5 5
37 128.02857 5 5
38 102.44286 5 5
39 76.85714 5 5
40 51.27143 5 5
41 25.68571 5 5
42 0.10000 5 5

Related

identify sequences of approximately equivalent values in a series using R

I have a series of values that includes strings of values that are close to each other, for example the sequences below. Note that roughly around the places I have categorized the values in V1 with distinct values in V2, the range of the values changes. That is, all the values called 1 in V2 are within 20 points of each other. All the values marked 2 in V2 are within 20 points of each other. All the values marked 3 are within 20 points of each other, etc. Notice that the values are not identical (they are all different). But instead, they cluster around a common value.
I identified these clusters manually. How could I automate it?
V1 V2
1 399.710 1
2 403.075 1
3 405.766 1
4 407.112 1
5 408.458 1
6 409.131 1
7 410.477 1
8 411.150 1
9 412.495 1
10 332.419 2
11 330.400 2
12 329.054 2
13 327.708 2
14 326.363 2
15 325.017 2
16 322.998 2
17 319.633 2
18 314.923 2
19 288.680 3
20 285.315 3
21 283.969 3
22 281.950 3
23 279.932 3
24 276.567 3
25 273.875 3
26 272.530 3
27 271.857 3
28 272.530 3
29 273.875 3
30 274.548 3
31 275.894 3
32 275.894 3
33 276.567 3
34 277.240 3
35 278.586 3
36 279.932 3
37 281.950 3
38 284.642 3
39 288.007 3
40 291.371 3
41 294.063 4
42 295.409 4
43 296.754 4
44 297.427 4
45 298.100 4
46 299.446 4
47 300.792 4
48 303.484 4
49 306.848 4
50 327.708 5
51 309.540 6
52 310.213 6
53 309.540 6
54 306.848 6
55 304.156 6
56 302.811 6
57 302.811 6
58 304.156 6
59 305.502 6
60 306.175 6
61 306.175 6
62 304.829 6
I haven't tried anything yet, I don't know how to do this.
Using dist and hclust with cutree to detect clusters, but with unique levels at the breaks.
hc <- hclust(dist(x))
cl <- cutree(hc, k=6)
data.frame(x, seq=cumsum(c(0, diff(cl)) != 0) + 1)
# x seq
# 1 399.710 1
# 2 403.075 1
# 3 405.766 1
# 4 407.112 1
# 5 408.458 1
# 6 409.131 1
# 7 410.477 1
# 8 411.150 1
# 9 412.495 1
# 10 332.419 2
# 11 330.400 2
# 12 329.054 2
# 13 327.708 2
# 14 326.363 2
# 15 325.017 2
# 16 322.998 2
# 17 319.633 3
# 18 314.923 3
# 19 288.680 4
# 20 285.315 4
# 21 283.969 4
# 22 281.950 4
# 23 279.932 4
# 24 276.567 5
# 25 273.875 5
# 26 272.530 5
# 27 271.857 5
# 28 272.530 5
# 29 273.875 5
# 30 274.548 5
# 31 275.894 5
# 32 275.894 5
# 33 276.567 5
# 34 277.240 5
# 35 278.586 6
# 36 279.932 6
# 37 281.950 6
# 38 284.642 6
# 39 288.007 6
# 40 291.371 6
# 41 294.063 7
# 42 295.409 7
# 43 296.754 7
# 44 297.427 7
# 45 298.100 7
# 46 299.446 7
# 47 300.792 7
# 48 303.484 7
# 49 306.848 7
# 50 327.708 8
# 51 309.540 9
# 52 310.213 9
# 53 309.540 9
# 54 306.848 9
# 55 304.156 9
# 56 302.811 9
# 57 302.811 9
# 58 304.156 9
# 59 305.502 9
# 60 306.175 9
# 61 306.175 9
# 62 304.829 9
However, the dendrogram suggests rather k=4 clusters instead of 6, but it is arbitrary.
plot(hc)
abline(h=30, lty=2, col=2)
abline(h=18.5, lty=2, col=3)
abline(h=14, lty=2, col=4)
legend('topright', lty=2, col=2:4, legend=paste(c(4, 5, 7), 'cluster'), cex=.8)
Data:
x <- c(399.71, 403.075, 405.766, 407.112, 408.458, 409.131, 410.477,
411.15, 412.495, 332.419, 330.4, 329.054, 327.708, 326.363, 325.017,
322.998, 319.633, 314.923, 288.68, 285.315, 283.969, 281.95,
279.932, 276.567, 273.875, 272.53, 271.857, 272.53, 273.875,
274.548, 275.894, 275.894, 276.567, 277.24, 278.586, 279.932,
281.95, 284.642, 288.007, 291.371, 294.063, 295.409, 296.754,
297.427, 298.1, 299.446, 300.792, 303.484, 306.848, 327.708,
309.54, 310.213, 309.54, 306.848, 304.156, 302.811, 302.811,
304.156, 305.502, 306.175, 306.175, 304.829)
This solution iterates over every value, checks the range of all values in the group up to that point, and starts a new group if the range is greater than a threshold.
maxrange <- 18
grp_start <- 1
grp_num <- 1
V3 <- numeric(length(dat$V1))
for (i in seq_along(dat$V1)) {
grp <- dat$V1[grp_start:i]
if (max(grp) - min(grp) > maxrange) {
grp_num <- grp_num + 1
grp_start <- i
}
V3[[i]] <- grp_num
}
cbind(dat, V3)
V1 V2 V3
1 399.710 1 1
2 403.075 1 1
3 405.766 1 1
4 407.112 1 1
5 408.458 1 1
6 409.131 1 1
7 410.477 1 1
8 411.150 1 1
9 412.495 1 1
10 332.419 2 2
11 330.400 2 2
12 329.054 2 2
13 327.708 2 2
14 326.363 2 2
15 325.017 2 2
16 322.998 2 2
17 319.633 2 2
18 314.923 2 2
19 288.680 3 3
20 285.315 3 3
21 283.969 3 3
22 281.950 3 3
23 279.932 3 3
24 276.567 3 3
25 273.875 3 3
26 272.530 3 3
27 271.857 3 3
28 272.530 3 3
29 273.875 3 3
30 274.548 3 3
31 275.894 3 3
32 275.894 3 3
33 276.567 3 3
34 277.240 3 3
35 278.586 3 3
36 279.932 3 3
37 281.950 3 3
38 284.642 3 3
39 288.007 3 3
40 291.371 3 4
41 294.063 4 4
42 295.409 4 4
43 296.754 4 4
44 297.427 4 4
45 298.100 4 4
46 299.446 4 4
47 300.792 4 4
48 303.484 4 4
49 306.848 4 4
50 327.708 5 5
51 309.540 6 6
52 310.213 6 6
53 309.540 6 6
54 306.848 6 6
55 304.156 6 6
56 302.811 6 6
57 302.811 6 6
58 304.156 6 6
59 305.502 6 6
60 306.175 6 6
61 306.175 6 6
62 304.829 6 6
A threshold of 18 reproduces your groups, except that group 4 starts one row earlier. You could use a higher threshold, but then group 6 would start later than you have it.

Adding a value based on a sequence of numbers

Please simplify my code. The result should be the same. The script works but R shows warning messages:
1: In data$sygnature[seq(first[v], last[v])] <- paste0(n[v], "/", syg) :
number of items to replace is not a multiple of replacement length
etc.
The idea is to assign each sequence in the column the same value.
data <- data.frame(sygnature = c(seq(1:8),seq(1:3),seq(1:11),seq(1:6),seq(1:9),seq(1:5)))
n <- c(44:49)
k<-c()
for(i in (1: nrow(data))){
s<- data$sygnature[i]
z<-data$sygnature[i+1]
if(
if(is.na(z)){
z<-1
s > z
}else{
s > z
}
){
k<- c(k, s)
}
}
last<- cumsum(k)
first<-(last-k)+1
syg <- data$sygnature
for(v in 1:6)
{
data$sygnature[seq(first[v],last[v])] <- paste0(n[v],"/",syg)
}
You can do,
data$res <- paste0(rep(n, aggregate(sygnature ~ cumsum(sygnature == 1), data, length)[[2]]),
'/',
data$sygnature)
data
sygnature res
1 1 44/1
2 2 44/2
3 3 44/3
4 4 44/4
5 5 44/5
6 6 44/6
7 7 44/7
8 8 44/8
9 1 45/1
10 2 45/2
11 3 45/3
12 1 46/1
13 2 46/2
14 3 46/3
15 4 46/4
16 5 46/5
17 6 46/6
18 7 46/7
19 8 46/8
20 9 46/9
21 10 46/10
22 11 46/11
23 1 47/1
24 2 47/2
25 3 47/3
26 4 47/4
27 5 47/5
28 6 47/6
29 1 48/1
30 2 48/2
31 3 48/3
32 4 48/4
33 5 48/5
34 6 48/6
35 7 48/7
36 8 48/8
37 9 48/9
38 1 49/1
39 2 49/2
40 3 49/3
41 4 49/4
42 5 49/5

Creating Groups by Matching Values of Different Columns

I would like to create groups from a base by matching values.
I have the following data table:
now<-c(1,2,3,4,24,25,26,5,6,21,22,23)
before<-c(0,1,2,3,23,24,25,4,5,0,21,22)
after<-c(2,3,4,5,25,26,0,6,0,22,23,24)
df<-as.data.frame(cbind(now,before,after))
which reproduces the following data:
now before after
1 1 0 2
2 2 1 3
3 3 2 4
4 4 3 5
5 24 23 25
6 25 24 26
7 26 25 0
8 5 4 6
9 6 5 0
10 21 0 22
11 22 21 23
12 23 22 24
I would like to get:
now before after group
1 1 0 2 A
2 2 1 3 A
3 3 2 4 A
4 4 3 5 A
5 5 4 6 A
6 6 5 0 A
7 21 0 22 B
8 22 21 23 B
9 23 22 24 B
10 24 23 25 B
11 25 24 26 B
12 26 25 0 B
I would like to reach the answer to this without using a "for" loop becouse the real data is too large.
Any you could provide will be appreciated.
Here is one way. It is hard to avoid a for-loop as this is quite a tricky algorithm. The objection to them is often on the grounds of elegance rather than speed, but sometimes they are entirely appropriate.
df$group <- seq_len(nrow(df)) #assign each row to its own group
stop <- FALSE #indicates convergence
while(!stop){
pre <- df$group #group column at start of loop
for(i in seq_len(nrow(df))){
matched <- which(df$before==df$now[i] | df$after==df$now[i]) #check matches in before and after columns
group <- min(df$group[i], df$group[matched]) #identify smallest group no of matching rows
df$group[i] <- group #set to smallest group
df$group[matched] <- group #set to smallest group
}
if(identical(df$group, pre)) stop <- TRUE #stop if no change
}
df$group <- LETTERS[match(df$group, sort(unique(df$group)))] #convert groups to letters
#(just use match(...) to keep them as integers - e.g. if you have more than 26 groups)
df <- df[order(df$group, df$now),] #reorder as required
df
now before after group
1 1 0 2 A
2 2 1 3 A
3 3 2 4 A
4 4 3 5 A
8 5 4 6 A
9 6 5 0 A
10 21 0 22 B
11 22 21 23 B
12 23 22 24 B
5 24 23 25 B
6 25 24 26 B
7 26 25 0 B

Rearranging order for a pair in R

I have a column with 10 random numbers, from that I want to create a new column that have switched places for every pair, see example for how I mean. How would you do that?
column newcolumn
1 5
5 1
7 6
6 7
25 67
67 25
-10 2
2 -10
-50 36
36 -50
Taking advantage of the fact that R will replicate smaller vectors when adding them to larger vectors, you can:
a <- data.frame(column=c(1,5,7,6,25,67,-10,2,50,36))
a$newColumn <- a$column[seq(nrow(a)) + c(1, -1)]
Something like this.
a <- data.frame(column=c(1,5,7,6,25,67,-10,2,50,36))
a$newColumn <- 0
a[seq(1,nrow(a),by=2),"newColumn"]<-a[seq(2,nrow(a),by=2),"column"]
a[seq(2,nrow(a),by=2),"newColumn"]<-a[seq(1,nrow(a),by=2),"column"]
# results
column newColumn
1 1 5
2 5 1
3 7 6
4 6 7
5 25 67
6 67 25
7 -10 2
8 2 -10
9 50 36
10 36 50
Here is a base R one-liner: We can cast column as 2 x nrow(df)/2 matrix, swap rows, and recast as vector.
df$newcolumn <- c(matrix(df$column, ncol = nrow(df) / 2)[c(2,1), ]);
# column newcolumn
#1 1 5
#2 5 1
#3 7 6
#4 6 7
#5 25 67
#6 67 25
#7 -10 2
#8 2 -10
#9 -50 36
#10 36 -50
Sample data
df <- read.table(text =
"column
1
5
7
6
25
67
-10
2
-50
36", header = T)
Another option would be to use ave and rev
transform(df, newCol = ave(x = df$column, rep(1:5, each = 2), FUN = rev))
# column newCol
#1 1 5
#2 5 1
#3 7 6
#4 6 7
#5 25 67
#6 67 25
#7 -10 2
#8 2 -10
#9 -50 36
#10 36 -50
The part rep(1:5, each = 2) creates a grouping variable ("pairs") for each of which we reverse the elements.
Here's a compact way:
a$new_col <- c(matrix(a$column,2)[2:1,])
# column new_col
# 1 1 5
# 2 5 1
# 3 7 6
# 4 6 7
# 5 25 67
# 6 67 25
# 7 -10 2
# 8 2 -10
# 9 50 36
# 10 36 50
The idea is to write in a 2 row matrix, switch the rows, and unfold back in a vector.

5 nearest neighbors based on given distance in r

I have the following dataset:
id x y age
1 1745353 930284.1 30
2 1745317 930343.4 23
3 1745201 930433.9 10
4 1745351 930309.4 5
5 1745342 930335.2 2
6 1746619 929969.7 66
7 1746465 929827.1 7
8 1746731 928779.5 55
9 1746629 929902.6 26
10 1745938 928923.2 22
I want to find 5 closest neighbors for each of the id based on the distance calculated from the given (x,y). The final output should look like the following:
id n_id dist age age_n_id
1 2 2 30 23
1 5 1.5 30 2
1 3 5 30 10
1 7 3 30 7
1 8 3 30 55
2 1 6 23 30
2 10 1 23 22
2 6 2 23 66
2 7 6 23 7
2 8 9 23 55
3 2 1 10 23
3 1 2 10 30
3 4 1.2 10 5
3 6 1.6 10 66
3 9 2.3 10 26
................................
................................
10 2 1.9 22 23
10 6 2.3 22 66
10 9 2.1 22 26
10 1 2.5 22 30
10 5 1.6 22 2
where n_id is the id if the neighbors, dist is the straight line distance between id and n_id, age is the age of the id, and age_n_id is the age of the n_id. Also, the maximum distance would be 10km. If there are fewer than 5 neighbors within 10km, say 3 neighbors, the corresponding id would be repeated only three times.
I am relatively newer in r programming and any help would be much appreciated.
data.table solution:
library(data.table)
data<-fread("id x y age
1 1745353 930284.1 30
2 1745317 930343.4 23
3 1745201 930433.9 10
4 1745351 930309.4 5
5 1745342 930335.2 2
6 1746619 929969.7 66
7 1746465 929827.1 7
8 1746731 928779.5 55
9 1746629 929902.6 26
10 1745938 928923.2 22")
data[,all_x:=list(list(x))]
data[,all_y:=list(list(y))]
data[,all_age:=list(list(age))]
data[,seq_nr:=seq_len(.N)]
#Distance formula:
formula_distance<-function(x_1,x_2,y_1,y_2,z){
x_2<-x_2[[1]][-z]
y_2<-y_2[[1]][-z]
sqrt((x_1-x_2)^2+(y_1-y_2)^2)
}
data<-data[,{list(dist = formula_distance(x,all_x,y,all_y,seq_nr),
id =seq(1:nrow(data))[-id],
age_id=all_age[[1]][-id],
age=rep(age,nrow(data)-1))},by=1:nrow(data)]
data<-data[order(nrow,dist)]
#Filter data within threshold:
threshold<-1000
#How many nearest neighbors to take:
k<-5
filtered<-data[dist<=threshold]
filtered<-filtered[,{list(dist=dist[1:k],n_id=id[1:k],n_age=age_id[1:k])},by=c("nrow","age")]
filtered<-filtered[!is.na(dist)]
setnames(filtered,"nrow","id")
filtered
id age dist n_id n_age
1: 1 30 25.37893 4 5
2: 1 30 52.27055 5 2
3: 1 30 69.37211 2 23
4: 1 30 213.41050 3 10
5: 2 23 26.31045 5 2
6: 2 23 48.08326 4 5
7: 2 23 69.37211 1 30
8: 2 23 147.12665 3 10
9: 3 10 147.12665 2 23
10: 3 10 172.11243 5 2
11: 3 10 194.93653 4 5
12: 3 10 213.41050 1 30
13: 4 5 25.37893 1 30
14: 4 5 27.32471 5 2
15: 4 5 48.08326 2 23
16: 4 5 194.93653 3 10
17: 5 2 26.31045 2 23
18: 5 2 27.32471 4 5
19: 5 2 52.27055 1 30
20: 5 2 172.11243 3 10
21: 6 66 67.84106 9 26
22: 6 66 209.88273 7 7
23: 7 7 180.54432 9 26
24: 7 7 209.88273 6 66
25: 8 55 805.91482 10 22
26: 9 26 67.84106 6 66
27: 9 26 180.54432 7 7
28: 10 22 805.91482 8 55
Assuming that the unit of coordinates is in meter.
# Load packages
library(FNN)
library(tidyverse)
library(data.table)
# Create example data frame
dataset <- fread("id x y age
1 1745353 930284.1 30
2 1745317 930343.4 23
3 1745201 930433.9 10
4 1745351 930309.4 5
5 1745342 930335.2 2
6 1746619 929969.7 66
7 1746465 929827.1 7
8 1746731 928779.5 55
9 1746629 929902.6 26
10 1745938 928923.2 22")
# Calculate the nearest ID and distance
near_data <- get.knn(dataset[, 2:3], k = 5)
# Extract the nearest ID
nn_index <- as.data.frame(near_data$nn.index)
# Extract the nearest Distance
nn_dist <- as.data.frame(near_data$nn.dist)
# Re organize the data
nn_index2 <- nn_index %>%
# Add ID column
mutate(ID = 1:10) %>%
# Transform the data frame
gather(Rank, n_id, -ID)
nn_dist2 <- nn_dist %>%
# Add ID column
mutate(ID = 1:10) %>%
# Transform the data frame
gather(Rank, dist, -ID)
# Remove coordinates in dataset
dataset2 <- dataset %>% select(-x, -y)
# Create the final output
nn_final <- nn_index2 %>%
# Merge nn_index2 and nn_dist2
left_join(nn_dist2, by = c("ID", "Rank")) %>%
# Merge with dataset2 by ID and id
left_join(dataset2, by = c("ID" = "id")) %>%
# Merge with dataset2 by n_id and id
left_join(dataset2, by = c("n_id" = "id")) %>%
# Remove Rank
select(-Rank) %>%
# Rename column names
rename(id = ID, age = age.x, age_n_id = age.y) %>%
# Sort the data frame
arrange(id, dist) %>%
# Filter the dist < 10000 meters
filter(dist < 10000)

Resources