I have xy coordinates of points and I want to make use distance for averaging points. My data is named qq and I obtain the distance matrix using dist function
qq
X Y
2 4237.5 4411.5
3 4326.5 4444.5
4 4382.0 4418.0
5 4204.0 4487.5
6 4338.5 4515.0
mydist = as.matrix(dist(qq))
2 3 4 5 6
2 0.00000 94.92102 144.64612 83.0557 144.61414
3 94.92102 0.00000 61.50203 129.8278 71.51398
4 144.64612 61.50203 0.00000 191.0870 106.30734
5 83.05570 129.82777 191.08702 0.0000 137.28256
6 144.61414 71.51398 106.30734 137.2826 0.00000
What I want to do is to average points that are closer that a certain threshold, for this example we could use 80. The only pairwise distances that fall below that limit are 3-4 and 3-6. The question is how to go back to the original matrix and average xy coordinates to make the 3-4 pair one point and 3-6 pair another one (discarding former points 3,4 and 6)
here's the dput of my data.frame
dput(qq)
structure(list(X = c(4237.5, 4326.5, 4382, 4204, 4338.5), Y = c(4411.5,
4444.5, 4418, 4487.5, 4515)), .Names = c("X", "Y"), row.names = 2:6, class = "data.frame")
UPDATE
Using some of the provided with modifications code I get the 2 points I need to replace in the 3-4 place and 3-6 place. This means my point 3 and 4 and 6 will have to disappear from qq and this two points should be appended to it
pairs <- which(as.matrix(dist(qq)) < 80 & upper.tri(as.matrix(dist(qq))), arr.ind = T)
t(apply(pairs,1,function(i) apply(qq[i,],2,mean)))
X Y
3 4354.25 4431.25
3 4332.50 4479.75
I think this should do it for you, if I understand the problem correctly.
pairs <- which(as.matrix(y) > 140 & upper.tri(as.matrix(y)), arr.ind = T)
result <- apply(pairs,1,function(i) apply(qq[i,],2,mean))
#optionally, I think this is the form you will want it in.
result <- data.frame(t(result))
It will a matrix of a similar structure to qq containing the averages of points which are "far" away from each other determined by thresh.
UPDATE
qq <- qq[-unique(c(pairs)),]
qq <- rbind(qq,result)
Ok so I was able to merge strategies and solve the issue but not in a fancy way
# Search pairs less than threshold
pairs <- which(as.matrix(dist(qq)) < 80 & upper.tri(as.matrix(dist(qq))), arr.ind = T)
# Get the row numbers for subsetting the original matrix
indx=unique(c(pairs[,1],pairs[,2]))
# Get result dataframe
out = data.frame(rbind(qq[-indx,],t(apply(pairs,1,function(i) apply(qq[i,],2,mean)))),row.names=NULL)
dim(out)
[1] 4 2
out
X Y
1 4237.50 4411.50
2 4204.00 4487.50
3 4354.25 4431.25
4 4332.50 4479.75
The row.names get dropped because they mean nothing now that I've removed original points and added new ones. I'm still open to better ways to do it and to check everything is done correctly.
UPDATE
I made a function that could be more useful that making things step-wise and let's you play with the threshold.
distance_fix = function(dataframe,threshold){
mydist = as.matrix(dist(dataframe))
# Which pairs in the upper triangle are below threshold
pairs <- which(mydist < threshold & upper.tri(mydist), arr.ind = T)
# Get the row numbers for subsetting the original matrix
indx=unique(c(pairs))
# Get result dataframe
out = data.frame(rbind(dataframe[-indx,],t(apply(pairs,1,function(i) apply(dataframe[i,],2,mean)))),row.names=NULL)
return(out)
}
Related
Use this example data to see what I mean
tag <- as.character(c(1,2,3,4,5,6,7,8,9,10))
species <- c("A","A","A","A","B","B","B","C","C","D")
size <- c(0.10,0.20,0.25,0.30,0.30,0.15,0.15,0.20,0.15,0.15)
radius <- (size*40)
x <- c(9,4,25,14,28,19,9,22,10,2)
y <- c(36,7,15,16,22,24,39,20,34,9)
data <- data.frame(tag, species, size, radius, x, y)
# Plot the points using qplot (from package tidyverse)
qplot(x, y, data = data) +
geom_point(aes(colour = species, size = size))
Now that you can see the plot, what I want to do is for each individual “species A” point, I’d like to identify the largest point within a radius of size*40.
For example, in the bottom left of the plot you can see that species A (tag 2) would produce a radius large enough to contain the close species D point.
However, the species A point on the far right-hand-side of the plot (tag 3) would produce a radius large enough to contain both of the close species B and species C points, in which case I’d want some sort of output that identifies the largest individual within the species A radius.
I’d like to know what I can run (if anything) on this data set to get find the largest “within radius” point for each species A point and get an output like this:
Species A point ---- Largest point within radius
Species A tag 1 ----- Species C tag 9
Species A tag 2 ----- Species D tag 10
Species A tag 3 ----- Species B tag 5
Species A tag 4 ----- Species C tag 8
I've used spatstat and CTFSpackage to make some plots in the past but I can't figure out how to "find largest neighbor within radius". Perhaps I can tackle this in ArcMAP? Also, this is just a small example dataset. Realistically I will be wanting to find the "largest neighbor within radius" for thousands of points.
Any help or feedback would be greatly appreciated.
Following finds the largest species and tag pair that is within given radius for each of the species.
all_df <- data # don't wanna have a variable called data
res_df <- data.frame()
for (j in 1 : nrow(all_df)) {
# subset the data
df <- subset(all_df, species != species[j])
# index of animals within radius
ind <- which ((df$x - x[j])^2 + (df$y - y[j])^2 < radius[j]^2 )
# find the max `size` in the subset df
max_size <- max(df$size[ind])
# all indices with max_size in df
max_inds <- which(df$size[ind] == max_size)
# pick the last one is there is more than on max_size
new_ind <- ind[max_inds[length(max_inds)]]
# results in data.frame
res_df <- rbind(res_df, data.frame(org_sp = all_df$species[j],
org_tag = all_df$tag[j],
res_sp = df$species[new_ind],
res_tag = df$tag[new_ind]))
}
res_df
# org_sp org_tag res_sp res_tag
# 1 A 1 C 9
# 2 A 2 D 10
# 3 A 3 B 5
# 4 A 4 C 8
# 5 B 5 A 3
# 6 B 6 C 8
# 7 B 7 C 9
# 8 C 8 B 5
# 9 C 9 B 7
# 10 D 10 A 2
I have a data frame with a group of x and y points. I need to calculate the euclidean distance of every point relative to every other point. Then I have to figure, for each row, how many are within a given range.
For example, if I had this data frame:
x y
- -
1 2
2 2
9 9
I should add a column that signals how many points (if we consider these points to be in a cartesian plane) are within a distance of 3 units from every other point.
x y n
- - -
1 2 1
2 2 1
9 9 0
Thus, the first point (1,2) has one other point (2,2) that is within that range, whereas the point (9,9) has 0 points at a distance of 3 units.
I could do this with a couple of nested for loops, but I am interested in solving this in R in an idiomatic way, preferably using dplyr or other library.
This is what I have:
ddply(.data=mydataframe, .variables('x', 'y'), .fun=count.in.range)
count.in.range <- function (df) {
xp <- df$x
yp <- df$y
return(nrow(filter(df, dist( rbind(c(x,y), c(xp,yp)) ) < 3 )))
}
But, for some reason, this doesn't work. I think it has to do with filter.
Given
df_ <- data.frame(x = c(1, 2, 9),
y = c(2, 2, 9))
You can use the function "dist":
matrix_dist <- as.matrix(dist(df_))
df_$n <- rowSums(matrix_dist <= 3)
This is base approach with straightforward application of a "distance function" but only on a row-by-row basis:
apply( df_ , 1, function(x) sum( (x[1] - df_[['x']])^2+(x[2]-df_[['y']])^2 <=9 )-1 )
#[1] 1 1 0
It's also really a "sweep" operation, although I wouldn't really expect a performance improvement.
I would suggest you work with pairs of points in the long format and then use a data.table solution, which is probably one of the fastest alternatives to work with large datasets
library(data.table)
library(reshape)
df <- data.frame(x = c(1, 2, 9),
y = c(2, 2, 9))
The first thing you need to do is to reshape your data to long format with all possible combinations of pairs of points:
df_long <- expand.grid.df(df,df)
# rename columns
setDT(df_long )
setnames(df_long, c("x","y","x1","y1"))
Now you only need to do this:
# calculate distance between pairs
df_long[ , mydist := dist ( matrix(c(x,x1,y,y1), ncol = 2, nrow = 2) ) , by=.(x,y,x1,y1)]
# count how many points are within a distance of 3 units
df_long[mydist <3 , .(count = .N), by=.(x,y)]
#> x y count
#> 1: 1 2 2
#> 2: 2 2 2
#> 3: 9 9 1
I'm trying to write a function to compare the values of two colums (x and y) in every row of a dataframe. The function shall compare line by line if the values are identical, allowing a specified tolerance z for each pair of values.
identical() doesn't help because it doesn't allow small differences.
Nor can I use all.equal(), because its "tolerance"-parameter relates to the mean difference across all rows, how the following example demonstrates.
> df <- data.frame("x"=c(1,2,3,4,5), "y"=c(2,7,3,4,5))
> df$diff_x_y <- df$x-df$y
> df
x y diff_x_y
1 1 2 -1
2 2 7 -5
3 3 3 0
4 4 4 0
5 5 5 0
> all.equal(df$x, df$y, scale=1,tolerance=4)
[1] TRUE
>
So this is what I've made up so far:
fun <- function (x, y, z)
{
diff <- abs(x-y) # compute difference for each row
tolerance <- ifelse(diff <= z, TRUE, FALSE) # test whether difference <= tolerance
return(summary(tolerance))
}
This works fine for the example dataframe from above:
> fun(df$x,df$y,1)
Mode FALSE TRUE NA's
logical 1 4 0
Now I want the function to give me some information about the existing differences. I image something like this:
difference frequency
1:10 4
11:100 30
101:1000 350
"difference" is supposed to define an adjustable values range of the differences and "frequency" shall display the number of rows with the corresponding difference.
Other suggestions for the way of returning more detailed information about the differences are welcome. Notice that my original dataframe contains about 2 mio. rows, of whom some may differ significantly.
simplest way imho is to use cut:
df$diff.cat <- cut(abs(df$x-df$y),breaks=c(0,1,10,100,1000),right = FALSE)
the right = FALSE switch is making the intervals include the left (small) margin -
0 <= first interval < 1
1 <= second interval < 10 etc.
you can adjust the intervals of course
you can see the frequencies with
table(df$diff.cat)
so basically for:
df <- data.frame("x"=c(1,2,3,4,5), "y"=c(2,7,3,4,5))
table(cut(abs(df$x-df$y),breaks=c(0,1,10,100,1000),right = FALSE))
will give:
[0,1) [1,10) [10,100) [100,1e+03)
3 2 0 0
I have two data frames imported from txt files -- the sampling points and the station locations.
The sampling points data frame
X Y Z
346449.30 576369.65 86.93
346449.55 576368.24 87.16
346449.29 576368.17 79.08
346449.83 576366.86 88.23
346449.97 576365.42 84.97
346449.91 576362.22 86.59
346449.74 576363.65 88.87
346449.61 576363.59 84.99
346449.50 576363.54 81.33
The station locations data frame
Station x y
1 346479.720 576349.710
2 346575.380 576361.530
3 346685.540 576303.180
4 346722.820 576412.680
5 346514.780 576406.140
6 346813.130 576435.830
7 346748.880 576304.090
8 346825.830 576402.800
So i would like to know how to find and label points (from the sampling data frame) that fall within a buffer zone (e.g. 3 meters buffer radius generated from each of the stations from the second data frame)?
This is what i would like to get:
X Y Z Station
346449.30 576369.65 86.93 1
346449.55 576368.24 87.16 1
346449.29 576368.17 79.08 1
346449.83 576366.86 88.23 2
346449.97 576365.42 84.97 2
346449.91 576362.22 86.59 3
346449.74 576363.65 88.87 4
346449.61 576363.59 84.99 5
346449.50 576363.54 81.33 5
346449.51 576365.07 89.38 5
346449.36 576365.01 84.93 5
346449.24 576366.46 88.70 5
346448.93 576367.83 86.75 5
I am new in R so any help appreciated. Thanks.
if you simply want to add id of the nearest station within 3 meters of the sampling data points to your sampling data.frame one solution would be:
# get a matrix with the squares of the euclidian distances
mx <- outer(seq(nrow(sampleData)),
seq(nrow(stations)),
# return the square of the euclidian distance
function(i,j){
(sampleData[i,'X'] - stations[j,'x'])^2 +
(sampleData[i,'Y'] - stations[j,'y'])^2
})
# maximum distance to consider
d = 3
# get rid of distances greater than 3 meters away
mx[mx>d^2] <- NA
index <- apply(mx,
1,
# returns the number of the nearest row in `stations` that is less than 3 meters away
function(x){
if(all(is.na(x)))
return(NA)
x[is.na(x)] <- F
which.max( x == min(x,na.rm=T) )
})
sampleData$station <- stations$station[indx]
# a comma delimited list of stations with distance < 3
sampleData$closeStations <- apply(mx,
1,
# returns the number of the nearest row in `stations` that is less than 3 meters away
function(x){
if(all(is.na(x)))
return(NA)
paste0(stations$Station[x],sep = ',')
})
using outer and apply may make the solution run faster, but if you're having trouble with it, it may be easier to debug using a for loop instead:
# maximum distance to consider
d = 3
distanceToNearestStation <-
nearestStation <- numeric(0)
nearestStations <- character(0)
for(i in seq(nrow(sampleData))){
# square of the euclidian distances from this data point to the stations
distances <- sqrt((sampleData[i,'X'] - stations[,'x'])^2 +
(sampleData[i,'Y'] - stations[,'y'])^2 )
# get rid of distances greater than 3 meters away
# distances[distances>d] <- NA
# all the stations are too far away or something is wrong with this data point
if(all(is.na(distances)))
next
# record the nearest station to this data point
distanceToNearestStation[i] <- min(distances,na.rm=T)
nearestStation[i] <- which.max( distances == min(distances,na.rm=T) )
# comma delimeted list of stations within 3 meters
distanceIsClose <- distance < 3
distanceIsClose[is.na(distanceIsClose)] <- F
nearestStations[i] <- paste0(paste0(stations$Station[distanceIsClose],sep = ','))
}
range(distanceToNearestStation)
sampleData$station <- stations$station[nearestStation]
# number of data points within 3 meters of a station
table(distanceToNearestStation <= 3)
# data points within 3 meters of a station
subset <- sampleData[distanceToNearestStation<= 3,]
# save to individual files.
for(s in unique(subset$station))
write.csv(subset[subset$station == s,],
file.path('My/Favorite/Directory'# note there is no trailing slash
,paste('station',s,'data.csv')))
With the help of people on this site I have a matrix y that looks similar to this (but much more simplified).
1,3
1,3
1,3
7,1
8,2
8,2
I have created a third column that generates random numbers (without replacement for each of the repeating chunks using this code j=cbind(y,sample(1:99999,y[,2],replace=FALSE)).
Matrix j looks like this:
1,3,4520
1,3,7980
1,3,950
7,1,2
8,3,4520
8,3,7980
8,3,950
How do I obtain truly random numbers for my third column such that for each of the repeating rows i.e. 3,then 1, then 2 I get a random number that is not replicated within that repeating part (replace = FALSE)?
Why this happens:
The problem is that sample command structure is:
sample(vector of values, how many?, replace = FALSE or TRUE)
here, "how many?" is supposed to be ONE value. Since you provide the whole of the second column of y, it just picks the first value which is 3 and so it reads as:
set.seed(45) # just for reproducibility
sample(1:99999, 3, replace = F)
And for this seed, the values are:
# [1] 63337 31754 24092
And since there are only 3 values are you're binding it to your matrix with 6 rows, it "recycles" the values (meaning, it repeats the values in the same order). So, you get:
# [,1] [,2] [,3]
# [1,] 1 3 63337
# [2,] 1 3 31754
# [3,] 1 3 24092
# [4,] 7 1 63337
# [5,] 8 2 31754
# [6,] 8 2 24092
See that the values repeat. For the matrix you've shown, I've no idea how the 7,1,2 occurs. As the first value of your matrix in y[,2] = 3.
What you should do instead:
y <- cbind(y, sample(1:99999, nrow(y), replace = FALSE))
This asks sample to generate nrow(y) = 6 (here) values without replacement. This would generate non-identical values of length 6 and that'll be binded to your matrix y.
This should get you what you want:
j <- cbind(y, unlist(sapply(unique(y[,2]), function(n) sample(1:99999, n))))
edit: There was an error in code. Function unique is of course needed.
I can't get this without a loop. Maybe someone else can get more elegant solution. For me the problem is to sample with repetition intra-group and without repetition inter-group
ll <- split(dat, paste(dat$V1,dat$V2,sep=''))
ll.length <- by(dat, paste(dat$V1,dat$V2,sep=''),nrow)
z <- rep(0,nrow(dat))
SET <- seq(1,100) ## we can change 100 by 99999 for example
v =1
for (i in seq_along(ll)){
SET <- SET[is.na(match(z,SET))]
nn <- nrow(ll[[i]])
z[v:(v+nn-1)] <- sample(SET,nn,rep=TRUE)
v <- v+nn
}
z
[1] 35 77 94 100 23 59