Imagine a small dataset of xy coordinates. These points are grouped by a variable called indexR, there are 3 groups in total. All xy coordinates are in the same units. The data looks approximately like so:
# A tibble: 61 x 3
indexR x y
<dbl> <dbl> <dbl>
1 1 837 924
2 1 464 661
3 1 838 132
4 1 245 882
5 1 1161 604
6 1 1185 504
7 1 853 870
8 1 1048 859
9 1 1044 514
10 1 141 938
# ... with 51 more rows
The goal is to determine which 3 points, one from each group, are closest to each other, in the sense of minimizing the sum of the pairwise distances between selected points.
I have attempted this by considering euclidian distances, as follows. (Credit goes to #Mouad_S, in this thread, and https://gis.stackexchange.com/questions/233373/distance-between-coordinates-in-r)
#dput provided at bottom of this post
> df$dummy = 1
> df %>%
+ full_join(df, c("dummy" = "dummy")) %>%
+ full_join(df, c("dummy" = "dummy")) %>%
+ filter(indexR.x != indexR.y & indexR.x != indexR & indexR.y != indexR) %>%
+ mutate(dist =
+ ((.$x - .$x.x)^2 + (.$y- .$y.x)^2)^.5 +
+ ((.$x - .$x.y)^2 + (.$y- .$y.y)^2)^.5 +
+ ((.$x.x - .$x.y)^2 + (.$y.x- .$y.y)^2)^.5,
+ dist = round(dist, digits = 0)) %>%
+ arrange(dist) %>%
+ filter(dist == min(dist))
# A tibble: 6 x 11
indexR.x x.x y.x dummy indexR.y x.y y.y indexR x y dist
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 638 324 1 2 592 250 3 442 513 664
2 1 638 324 1 3 442 513 2 592 250 664
3 2 592 250 1 1 638 324 3 442 513 664
4 2 592 250 1 3 442 513 1 638 324 664
5 3 442 513 1 1 638 324 2 592 250 664
6 3 442 513 1 2 592 250 1 638 324 664
From this we can identify the three points closest together (minimum distance apart; enlarged on the figure below). However, the challenge comes when extending this such that indexR has 4,5 ... n groups. The problem is in finding a more practical or optimised method for making this calculation.
structure(list(indexR = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 3, 3), x = c(836.65, 464.43, 838.12, 244.68, 1160.86,
1184.52, 853.4, 1047.96, 1044.2, 141.06, 561.01, 1110.74, 123.4,
1087.24, 827.83, 100.86, 140.07, 306.5, 267.83, 1118.61, 155.04,
299.52, 543.5, 782.25, 737.1, 1132.14, 659.48, 871.78, 1035.33,
867.81, 192.94, 1167.8, 1099.59, 1097.3, 1089.78, 1166.59, 703.33,
671.64, 346.49, 440.89, 126.38, 638.24, 972.32, 1066.8, 775.68,
591.86, 818.75, 953.63, 1104.98, 1050.47, 722.43, 1022.17, 986.38,
1133.01, 914.27, 725.15, 1151.52, 786.08, 1024.83, 246.52, 441.53
), y = c(923.68, 660.97, 131.61, 882.23, 604.09, 504.05, 870.35,
858.51, 513.5, 937.7, 838.47, 482.69, 473.48, 171.78, 774.99,
792.46, 251.26, 757.95, 317.71, 401.93, 326.32, 725.89, 98.43,
414.01, 510.16, 973.61, 445.33, 504.54, 669.87, 598.75, 225.27,
789.45, 135.31, 935.51, 270.38, 241.19, 595.05, 401.25, 160.98,
778.86, 192.17, 323.76, 361.08, 444.92, 354, 249.57, 301.64,
375.75, 440.03, 428.79, 276.5, 408.84, 381.14, 459.14, 370.26,
304.05, 439.14, 339.91, 435.85, 759.42, 513.37)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -61L), .Names = c("indexR",
"x", "y"))
One possibility would be to formulate the problem of identifying the closest elements, one from each group, as a mixed integer program. We could define decision variables y_i for whether each point i is selected, as well as x_{ij} for whether points i and j are both selected (x_{ij} = y_iy_j). We need to select one element from each group.
In practice, you could implement this mixed integer program using the lpSolve package (or one of the other R optimization packages).
opt.closest <- function(df) {
# Compute every pair of indices
library(dplyr)
pairs <- as.data.frame(t(combn(nrow(df), 2))) %>%
mutate(G1=df$indexR[V1], G2=df$indexR[V2]) %>%
filter(G1 != G2) %>%
mutate(dist = sqrt((df$x[V1]-df$x[V2])^2+(df$y[V1]-df$y[V2])^2))
# Compute a few convenience values
n <- nrow(df)
nP <- nrow(pairs)
groups <- sort(unique(df$indexR))
nG <- length(groups)
gpairs <- combn(groups, 2)
nGP <- ncol(gpairs)
# Solve the optimization problem
obj <- c(pairs$dist, rep(0, n))
constr <- rbind(cbind(diag(nP), -outer(pairs$V1, seq_len(n), "==")),
cbind(diag(nP), -outer(pairs$V2, seq_len(n), "==")),
cbind(diag(nP), -outer(pairs$V1, seq_len(n), "==") - outer(pairs$V2, seq_len(n), "==")),
cbind(matrix(0, nG, nP), outer(groups, df$indexR, "==")),
cbind((outer(gpairs[1,], pairs$G1, "==") &
outer(gpairs[2,], pairs$G2, "==")) |
(outer(gpairs[2,], pairs$G1, "==") &
outer(gpairs[1,], pairs$G2, "==")), matrix(0, nGP, n)))
dir <- rep(c("<=", ">=", "="), c(2*nP, nP, nG+nGP))
rhs <- rep(c(0, -1, 1), c(2*nP, nP, nG+nGP))
library(lpSolve)
mod <- lp("min", obj, constr, dir, rhs, all.bin=TRUE)
which(tail(mod$solution, n) == 1)
}
This can compute the closest 3 points, one from each cluster, in your example dataset:
df[opt.closest(df),]
# A tibble: 3 x 3
# indexR x y
# <dbl> <dbl> <dbl>
# 1 1 638.24 323.76
# 2 2 591.86 249.57
# 3 3 441.53 513.37
It can also compute the best possible solution for datasets with more points and groups. Here are the runtimes for datasets with 7 groups each and 100 and 200 points:
make.dataset <- function(n, nG) {
set.seed(144)
data.frame(indexR = sample(seq_len(nG), n, replace=T), x = rnorm(n), y=rnorm(n))
}
df100 <- make.dataset(100, 7)
system.time(opt.closest(df100))
# user system elapsed
# 11.536 2.656 15.407
df200 <- make.dataset(200, 7)
system.time(opt.closest(df200))
# user system elapsed
# 187.363 86.454 323.167
This is far from instantaneous -- it takes 15 seconds for the 100-point, 7-group dataset and 323 seconds for the 200-point, 7-group dataset. Still, it is much quicker than iterating through all 92 million 7-tuples in the 100-point dataset or all 13.8 billion 7-tuples in the 200-point dataset. You could set a runtime limit with a solver like the one from the Rglpk package to get the best solution obtained within that limit.
You cannot afford to enumerate all possible solutions, and I don't see any obvious shortcut.
So I guess you'll have to do a branch and bound optimization approach.
First guess a reasonably good solution. Like the closest two points with different labels. Then add the nearest with a different label until you have all labels covered.
Now do some trivial optimization: for every label, try if there is some point that you can use instead of the current point to improve the result. Stop when you can't find any further improvement.
For this initial guess, compute the distances. This will give you an upper bound, which allows you to stop your search early. You can also compute a lower bound, the sum of all best two-label solutions.
Now you can try to remove points, where the nearest neighbors of each label + the lower bounds for all other labels is already worse than your initial solution. This will hopefully eliminate a lot of points.
Then you can start enumerating solutions (probably begin with the smallest labels first), but stop recursion whenever the current solution + the remaining lower bounds are larger than your best known solution (branch and bound).
You can also try sorting points e.g. by minimum distance to the remaining labels, to hopefully find better bounds fast.
I'd certainly not choose R to implement this...
you can use cross joins to have all the points combinations, calculate the total distance between all three points, then take the minimum of that.
df$id <- row.names(df) # to create ID's for the points
df2 <- merge(df, df, by = NULL ) # the first cross join
df3 <- merge(df2, df, by = NULL) # the second cross join
# eliminating rows where the points are of the same indexR
df3 <- df3[df3$indexR.x != df3$indexR.y & df3$indexR.x != df3$indexR
& df3$indexR.y != df3$indexR,]
## calculating the total distance
df3$total_distance <- ((df3$x - df3$x.x)^2 + (df3$y- df3$y.x)^2)^.5 +
((df3$x - df3$x.y)^2 + (df3$y- df3$y.y)^2)^.5 +
((df3$x.x - df3$x.y)^2 + (df3$y.x- df3$y.y)^2)^.5
## minimum distance
df3[which.min(df3$total_distance),]
indexR.x x.x y.x id.x indexR.y x.y y.y id.y indexR x y id total_distance
155367 3 441.53 513.37 61 2 591.86 249.57 46 1 638.24 323.76 42 664.3373
I developed a simple algorithm to quickly solve this problem. The first step is to overlay a grid on the entire area of points. The first step is to assign each point from each group to the cell or unit square where it is located. Next we go to the lower left corner of the graph and go over one cell and up one cell. This is the starting cell. Then we define a region of interest consisting of this cell and all of its 8 neighbors. Then a test is made to determine whether or not at least one point from each of the groups is within this 9 cell region. If so then the distance from each point represented in this region from each of the groups of points to all other points from all other groups is calculated. In other words all combinations of points in this 9-cell region are used to get a total distance where paired points for distance calculation are never from the same group. From these calculations the one with the minimum distance involving a single point from each group is saved as a possible solution. Then this entire process is repeated by going over one cell to the right. Each 9-cell region is calculated as the central cell moves on to the right. This is stopped one cell from the right end. When the first row is completed the process proceeds by going up one row and starting again at the left but one cell over again. Thus the each cell has been considered when the top row is finished. The solution will be the minimum distance computed from all the tests done for each 9-cell region.
The reason we consider a 9-cell region and not just go cell-by-cell is that we could miss closely spaced points from different groups that are located in the corners of cells.
It's important to choose the correct cell or grid size. If the cells are too small then no possible solution will be found because none of the regions will encompass at least one point from each group. If the cells are too large then there will be many points from each group and calculation time will be excessive. Fortunately this optimal cell size can be quickly found through trial and error.
I've run this algorithm multiple times with varying number of groups and number of points in a group. For randomly scattered points in all groups I found that a 15 x 15 grid size works well for a 10 group - 400 point (40 points per group) case. That example runs in under one second.
I am new to R and have been stuck with a problem for quite a while now ...
I have a big dataset(gridded data originally) with more than 1,000,000 observations and have to make a group variable for my elements.
My dataset looks like follows:
ID Var1
1 0,5
2 0,6
3 0,2
4 0,15
... ...
1029600 0,43
What I want now is to make groups according to the following scheme:
1 2 3 4 5 6 ... 4320
4321 4322 4322 4322 4322 4322 ... 8640
8641 8642 8643 8644 8645 8646 ... 12960
12961 12962 12963 12964 12965 12966 ... 17280
17281 17282 17283 17284 17285 17286 ... 21600
21601 21602 21603 21604 21605 21606 ... 25920
... ... ... ... ... ... ... ...
1025281 1025282 1025283 1025284 1025285 1025286... 1029600
Where the 36 numbers {1,2,3,4,5,6,4321,4322,4323,4324,4325,4326,8641,8642,...,21060} are the first group .
The second group would be {7,8,9,10,11,12,4327,4328,...,21612}. The third group would start with {13,14,15...}. And so on for all observations. I hope i could make it clear what my goal is here. I wanted to visualize it with a picture, but as a new member, this is not possible.
So far i managed to do it with a really ugly loop function, which looks as follows:
for(k in 0:40) {
nk <- 25920 * k
mk <- 720 * k
for (j in 0:719) {
cj <- j * 6
for (i in 0:5) {
ai <- i * 4320 + 1 + cj + nk
bi <- i * 4320 + 6 + cj + nk
group[ai:bi] <- 1 + j + mk
}
}
}
I am aware that this is pretty inefficient and it takes a very long time to compute this with loops. I am pretty sure that there is an easier way to solve my problem, but as I am new to R, I cannot find it myself.
Any help would be really appreciated. Thank you in advance!
You can get the group from the ID with a simple formula:
group <- (((ID-1) %% 4320) %/% 6) +1
Note that %% is the modulo operation and %/% is the integer division. The formula should give you groups numbered from 1. No need to include it in a loop, it is a vectorized operation.
There are plenty of ways to do it (like reshaping 1:1029600 into a matrix with 4320 columns and taking the 6*N:6*(N+1) columns and do a match or something) but this is why you should always stop and think about what, really, you want to do. And realize it comes down to a little arithmetic :)
Create sample data
dtf <- data.frame(ID = 1:1e4, Var1 = rnorm(1:1e4))
Grouping as explained by #antine-sac:
group <- (((dtf$ID-1) %% 4320) %/% 6) +1
Split the data
dtfsplit <- split(dtf, group)
First group
> dtfsplit[1]
$`1`
ID Var1
1 1 0.56655
2 2 0.87645
3 3 -1.41986
4 4 -1.84881
5 5 0.03233
6 6 3.06512
4321 4321 -1.57179
4322 4322 -1.09958
4323 4323 0.55980
4324 4324 0.32390
4325 4325 0.85438
4326 4326 -0.10311
8641 8641 2.08886
8642 8642 1.19836
8643 8643 0.52592
8644 8644 0.20571
8645 8645 1.08429
8646 8646 0.69648
Second group
dtfsplit[2]
I am attempting to repeatedly add a "fixed number" to a numeric vector depending on a specified bin size. However, the "fixed number" is dependent on the data range.
For instance ; i have a data range 10 to 1010, and I wish to separate the data into 100 bins. Therefore ideally the data would look like this
Since 1010 - 10 = 1000
And 1000 / 100(The number of bin specified) = 10
Therefore the ideal data would look like this
bin1 - 10 (initial data)
bin2 - 20 (initial data + 10)
bin3 - 30 (initial data + 20)
bin4 - 40 (initial data + 30)
bin100 - 1010 (initial data + 1000)
Now the real data is slightly more complex, there is not just one data range but multiple data range, hopefully the example below would clarify
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
Ideally I wish to get something like
10 20
20 30
30 40
.. ..
5000 5015
5015 5030
5030 5045
.. ..
4857694 4858096 # Note theoretically it would have decimal places,
#but i do not want any decimal place
4858096 4858498
.. ..
So far I was thinking along this kind of function, but it seems inefficient because ;
1) I have to retype the function 100 times (because my number of bin is 100)
2) I can't find a way to repeat the function along my values - In other words my function can only deal with the data 10-1010 and not the next one 5000-6500
# The range of the variable
width <- end - start
# The bin size (Number of required bin)
bin_size <- 100
bin_count <- width/bin_size
# Create a function
f1 <- function(x,y){
c(x[1],
x[1] + y[1],
x[1] + y[1]*2,
x[1] + y[1]*3)
}
f1(x= start,y=bin_count)
f1
[1] 10 20 30 40
Perhaps any hint or ideas would be greatly appreciated. Thanks in advance!
Aafter a few hours trying, managed to answer my own question, so I thought to share it. I used the package "binr" and the function in the package called "bins" to get the required bin. Please find below my attempt to answer my question, its slightly different than the intended output but for my purpose it still is okay
library(binr)
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
tmp_list_start <- list() # Create an empty list
# This just extract the output from "bins" function into a list
for (i in seq_along(start)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
# Now i need to convert one of the output from bins into numeric value
s <- gsub(",.*", "", names(tmp$binct))
s <- gsub("\\[","",s)
tmp_list_start[[i]] <- as.numeric(s)
}
# Repeating the same thing with slight modification to get the end value of the bin
tmp_list_end <- list()
for (i in seq_along(end)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
e <- gsub(".*,", "", names(tmp$binct))
e <- gsub("]","",e)
tmp_list_end[[i]] <- as.numeric(e)
}
v1 <- unlist(tmp_list_start)
v2 <- unlist(tmp_list_end)
df <- data.frame(start=v1, end=v2)
head(df)
start end
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
6 61 70
Pardon my crappy code, Please share if there is a better way of doing this. Would be nice if someone could comment on how to wrap this into a function..
Here's a way that may help with base R:
bin_it <- function(START, END, BINS) {
range <- END-START
jump <- range/BINS
v1 <- c(START, seq(START+jump+1, END, jump))
v2 <- seq(START+jump-1, END, jump)+1
data.frame(v1, v2)
}
It uses the function seq to create the vectors of numbers leading to the ending number. It may not work for every case, but for the ranges you gave it should give the desired output.
bin_it(10, 1010)
v1 v2
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
bin_it(5000, 6500)
v1 v2
1 5000 5015
2 5016 5030
3 5031 5045
4 5046 5060
5 5061 5075
bin_it(4857694, 4897909)
v1 v2
1 4857694 4858096
2 4858097 4858498
3 4858499 4858900
4 4858901 4859303
5 4859304 4859705
6 4859706 4860107
I have a large data set of vehicles. They were recorded every 0.1 seconds so there IDs repeat in Vehicle ID column. In total there are 2169 vehicles. I filtered the 'Vehicle velocity' column for every vehicle (using for loop) which resulted in a new column with first and last 30 values removed (per vehicle) . In order to bind it with original data frame, I removed the first and last 30 values of table too and then using cbind() combined them. This works for one last vehicle. I want this smoothing and column binding for all vehicles and finally I want to combine all the data frames of vehicles into one single table. That means rowbinding in sequence of vehicle IDs. This is what I wrote so far:
traj1 <- read.csv('trajectories-0750am-0805am.txt', sep=' ', header=F)
head(traj1)
names (traj1)<-c('Vehicle ID', 'Frame ID','Total Frames', 'Global Time','Local X', 'Local Y', 'Global X','Global Y','Vehicle Length','Vehicle width','Vehicle class','Vehicle velocity','Vehicle acceleration','Lane','Preceding Vehicle ID','Following Vehicle ID','Spacing','Headway')
# TIME COLUMN
Time <- sapply(traj1$'Frame ID', function(x) x/10)
traj1$'Time' <- Time
# SMOOTHING VELOCITY
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r
}
for (i in unique(traj1$'Vehicle ID')){
veh <- subset (traj1, traj1$'Vehicle ID'==i)
svel <- smooth(veh$'Vehicle velocity',30,10)
svel <- data.frame(svel)
veh <- head(tail(veh, -30), -30)
fta <- cbind(veh,svel)
}
'fta' now only shows the data frame for last vehicle. But I want all data frames (for all vehicles 'i') combined by row. May be for loop is not the right way to do it but I don't know how can I use tapply (or any other apply function) to do so many things same time.
EDIT
I can't reproduce my dataset here but 'Orange' data set in R could provide good analogy. Using the same smoothing function, the for loop would look like this (if 'age' column is smoothed and 'Tree' column is equivalent to my 'Vehicle ID' coulmn):
for (i in unique(Orange$Tree)){
tre <- subset (Orange, Orange$'Tree'==i)
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb <- cbind(tre,age2)}
}
Umair, I am not sure I understood what you want.
If I understood right, you want to combine all the results by row. To do that you could save all the results in a list and then do.call an rbind:
comb <- list() ### create list to save the results
length(comb) <- length(unique(Orange$Tree))
##Your loop for smoothing:
for (i in 1:length(unique(Orange$Tree))){
tre <- subset (Orange, Tree==unique(Orange$Tree)[i])
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb[[i]] <- cbind(tre,age2) ### save results in the list
}
final.data<-do.call("rbind", comb) ### combine all results by row
This will give you:
Tree age circumference age2
3 1 664 87 687.88
4 1 1004 115 982.66
5 1 1231 120 1211.49
10 2 664 111 687.88
11 2 1004 156 982.66
12 2 1231 172 1211.49
17 3 664 75 687.88
18 3 1004 108 982.66
19 3 1231 115 1211.49
24 4 664 112 687.88
25 4 1004 167 982.66
26 4 1231 179 1211.49
31 5 664 81 687.88
32 5 1004 125 982.66
33 5 1231 142 1211.49
Just for fun, a different way to do it using plyr::ddply and sapply with split:
library(plyr)
data<-ddply(Orange, .(Tree), tail, n=-2)
data<-ddply(data, .(Tree), head, n=-2)
data<- cbind(data,
age2=matrix(sapply(split(Orange$age, Orange$Tree), smooth, D=2, delta=0.67), ncol=1, byrow=FALSE))