I have two data frames imported from txt files -- the sampling points and the station locations.
The sampling points data frame
X Y Z
346449.30 576369.65 86.93
346449.55 576368.24 87.16
346449.29 576368.17 79.08
346449.83 576366.86 88.23
346449.97 576365.42 84.97
346449.91 576362.22 86.59
346449.74 576363.65 88.87
346449.61 576363.59 84.99
346449.50 576363.54 81.33
The station locations data frame
Station x y
1 346479.720 576349.710
2 346575.380 576361.530
3 346685.540 576303.180
4 346722.820 576412.680
5 346514.780 576406.140
6 346813.130 576435.830
7 346748.880 576304.090
8 346825.830 576402.800
So i would like to know how to find and label points (from the sampling data frame) that fall within a buffer zone (e.g. 3 meters buffer radius generated from each of the stations from the second data frame)?
This is what i would like to get:
X Y Z Station
346449.30 576369.65 86.93 1
346449.55 576368.24 87.16 1
346449.29 576368.17 79.08 1
346449.83 576366.86 88.23 2
346449.97 576365.42 84.97 2
346449.91 576362.22 86.59 3
346449.74 576363.65 88.87 4
346449.61 576363.59 84.99 5
346449.50 576363.54 81.33 5
346449.51 576365.07 89.38 5
346449.36 576365.01 84.93 5
346449.24 576366.46 88.70 5
346448.93 576367.83 86.75 5
I am new in R so any help appreciated. Thanks.
if you simply want to add id of the nearest station within 3 meters of the sampling data points to your sampling data.frame one solution would be:
# get a matrix with the squares of the euclidian distances
mx <- outer(seq(nrow(sampleData)),
seq(nrow(stations)),
# return the square of the euclidian distance
function(i,j){
(sampleData[i,'X'] - stations[j,'x'])^2 +
(sampleData[i,'Y'] - stations[j,'y'])^2
})
# maximum distance to consider
d = 3
# get rid of distances greater than 3 meters away
mx[mx>d^2] <- NA
index <- apply(mx,
1,
# returns the number of the nearest row in `stations` that is less than 3 meters away
function(x){
if(all(is.na(x)))
return(NA)
x[is.na(x)] <- F
which.max( x == min(x,na.rm=T) )
})
sampleData$station <- stations$station[indx]
# a comma delimited list of stations with distance < 3
sampleData$closeStations <- apply(mx,
1,
# returns the number of the nearest row in `stations` that is less than 3 meters away
function(x){
if(all(is.na(x)))
return(NA)
paste0(stations$Station[x],sep = ',')
})
using outer and apply may make the solution run faster, but if you're having trouble with it, it may be easier to debug using a for loop instead:
# maximum distance to consider
d = 3
distanceToNearestStation <-
nearestStation <- numeric(0)
nearestStations <- character(0)
for(i in seq(nrow(sampleData))){
# square of the euclidian distances from this data point to the stations
distances <- sqrt((sampleData[i,'X'] - stations[,'x'])^2 +
(sampleData[i,'Y'] - stations[,'y'])^2 )
# get rid of distances greater than 3 meters away
# distances[distances>d] <- NA
# all the stations are too far away or something is wrong with this data point
if(all(is.na(distances)))
next
# record the nearest station to this data point
distanceToNearestStation[i] <- min(distances,na.rm=T)
nearestStation[i] <- which.max( distances == min(distances,na.rm=T) )
# comma delimeted list of stations within 3 meters
distanceIsClose <- distance < 3
distanceIsClose[is.na(distanceIsClose)] <- F
nearestStations[i] <- paste0(paste0(stations$Station[distanceIsClose],sep = ','))
}
range(distanceToNearestStation)
sampleData$station <- stations$station[nearestStation]
# number of data points within 3 meters of a station
table(distanceToNearestStation <= 3)
# data points within 3 meters of a station
subset <- sampleData[distanceToNearestStation<= 3,]
# save to individual files.
for(s in unique(subset$station))
write.csv(subset[subset$station == s,],
file.path('My/Favorite/Directory'# note there is no trailing slash
,paste('station',s,'data.csv')))
Related
I currently have a data frame storing separate x,y,z coordinates from an accelerometer sensor (with timestamps), but want to perform vector operations on it.
Test data (actually have thousands of rows, and a timestamp row to be preserved)
x <- c(1,3,1,0,3)
y <- c(2,4,8,8,9)
z <- c(0,1,1,2,0)
df <- data.frame(x,y,z)
proj <- function(a,b) {
as.double((a %*% b) / (b %*% b)) * b
}
v = c(1,2,3)
I want to mutate (or create a new dataframe?) df by applying proj(_,v) on each row.
I have tried along the lines of mutate(projected = proj(c(x,y,z), v), but doesn't work, I am probably misusing this.
What is the best way to achieve this? Should I instead be using a list of vectors to store the coordinates?
While your proj(a,b)-function does only take two inputs, in your example you wanted to provide three proj(c(x,y,z),v) or did I misunderstand?
However, this would work:
dplyr::mutate(projected = proj(x,y), df) resulting in
x y z projected
1 1 2 0 0.4279476
2 3 4 1 0.8558952
3 1 8 1 1.7117904
4 0 8 2 1.7117904
5 3 9 0 1.9257642
I have a transect data with latitude, longitude and substrate types. Below I provide a script that creates a hypothetical data with 3 substrate types along a straight transect starting at longitude -24.5 and ending at -23.2. Within this transect there are 3 substrate types (a,b and c), but substrate type "a" occurs 4 times and substrate type "b" twice. I would like to calculate the total length (meters) of each "a","b" and "c" substrate type segments in the transect. As an example, the substrate segment "a" ends at the position of the first observation of "b" substrate type and the substrate segment c ends where the fourth "a" substrate type segment starts. I would like the length of. I have looked into the sp and Rdistance packages but I´m really stuck. With thanks in advance.
hypothetical example: each box denote each segment for which I would like to calculate the length of
Alon<-c(-23.20, -23.30,-23.40,-24.10,-24.15, -23.95, -23.70, -23.60,- 24.20, -24.25)
Blon<-c(-23.80, -23.85, -24.00, -24.03, -24.06)
Clon<-c(-23.47, -23.50,-23.55)
Alat<-c(64,64,64,64,64, 64, 64, 64,64, 64)
Blat<-c(64,64, 64, 64,64)
Clat<-c(64,64, 64)
A<-as.data.frame(cbind(Alon, Alat))
B<-as.data.frame(cbind(Blon, Blat))
C<-as.data.frame(cbind(Clon, Clat))
plot(A$Alon, A$Alat, pch=97)
points(B$Blon, B$Blat, col="red", pch=98)
points(C$Clon, C$Clat, col="blue", pch=99)
A$ID<-seq.int(nrow(A))
A[,3]<-"A"
B$ID<-seq.int(nrow(B))
B[,3]<-"B"
C$ID<-seq.int(nrow(C))
C[,3]<-"C"
colnames(A)<-c("lon", "lat", "ID")
colnames(B)<-c("lon", "lat", "ID")
colnames(C)<-c("lon", "lat", "ID")
A<-as.data.frame(A)
B<-as.data.frame(B)
C<-as.data.frame(C)
pos<- rbind(A,B,C)
pos<-pos[,c("ID","lon","lat")]
I suspect the length in metres depends on your projection, so here I calculate the length in degrees and will leave the conversion up to you. First, I order by longitude (as your latitudes are all the same).
# Order data frame
pos <- pos[order(pos$lon),]
Next, I use rle to pull out runs of each ID. I add 1 to start the first run on the first element and use pmin to make sure the final index isn't greater than the the number of rows in the data frame.
# Pull out start and end points of segments
df_seg <- pos[pmin(nrow(pos), c(1, cumsum(rle(pos$ID)$lengths) + 1)),]
Finally, I use diff to calculate the difference between the start and end longitudes of each run.
# Calculate difference in longitude
data.frame(ID = df_seg$ID[1:(nrow(df_seg)-1)], diff_lon = abs(diff(df_seg$lon)))
# Check data frame
# ID diff_lon
# 1 A 0.19
# 2 B 0.11
# 3 A 0.10
# 4 B 0.15
# 5 A 0.15
# 6 C 0.15
# 7 A 0.20
Regarding ordering stations
I wish I had a good solution to this, but I don't. So, I'll apologise before I do some terrible things...
library(dplyr)
library(RANN)
# Temporary data frame
df_stations <- pos
# Function for finding order of stations
station_order <- function(){
# If only one row, return it (i.e., it's the final station)
if(nrow(df_stations) == 1)return(df_station)
# Find the nearest neighbour for the first station
r <- nn2(data = df_stations %>% select(lon, lat), k = 2)$nn.idx[1,2]
# Bump the nearest neighbour to first in the data frame
# This also deletes the first entry
df_stations[1, ] <<- df_stations[r, ]
# Drop the nearest neighbour elsewhere in the data frame
df_stations <<- df_stations %>% distinct
# Return the nearest neighbour
return(df_stations[1, ])
}
# Initialise data frame
res <- df_stations[1,]
# Loop over data frame
for(i in 2:nrow(df_stations))res[i, ] <- station_order()
This code orders your stations using nearest neighbour (i.e., nn2 from RANN). You'll notice that the transect is inverted, but you can always change it with res[nrow(res):1, ].
# ID lon lat
# 1 A -23.20 64
# 2 A -23.30 64
# 3 A -23.40 64
# 4 C -23.47 64
# 5 C -23.50 64
# 6 C -23.55 64
# 7 A -23.60 64
# 8 A -23.70 64
# 9 B -23.80 64
# 10 B -23.85 64
# 11 A -23.95 64
# 12 B -24.00 64
# 13 B -24.03 64
# 14 B -24.06 64
# 15 A -24.10 64
# 16 A -24.15 64
# 17 A -24.20 64
# 18 A -24.25 64
Use this example data to see what I mean
tag <- as.character(c(1,2,3,4,5,6,7,8,9,10))
species <- c("A","A","A","A","B","B","B","C","C","D")
size <- c(0.10,0.20,0.25,0.30,0.30,0.15,0.15,0.20,0.15,0.15)
radius <- (size*40)
x <- c(9,4,25,14,28,19,9,22,10,2)
y <- c(36,7,15,16,22,24,39,20,34,9)
data <- data.frame(tag, species, size, radius, x, y)
# Plot the points using qplot (from package tidyverse)
qplot(x, y, data = data) +
geom_point(aes(colour = species, size = size))
Now that you can see the plot, what I want to do is for each individual “species A” point, I’d like to identify the largest point within a radius of size*40.
For example, in the bottom left of the plot you can see that species A (tag 2) would produce a radius large enough to contain the close species D point.
However, the species A point on the far right-hand-side of the plot (tag 3) would produce a radius large enough to contain both of the close species B and species C points, in which case I’d want some sort of output that identifies the largest individual within the species A radius.
I’d like to know what I can run (if anything) on this data set to get find the largest “within radius” point for each species A point and get an output like this:
Species A point ---- Largest point within radius
Species A tag 1 ----- Species C tag 9
Species A tag 2 ----- Species D tag 10
Species A tag 3 ----- Species B tag 5
Species A tag 4 ----- Species C tag 8
I've used spatstat and CTFSpackage to make some plots in the past but I can't figure out how to "find largest neighbor within radius". Perhaps I can tackle this in ArcMAP? Also, this is just a small example dataset. Realistically I will be wanting to find the "largest neighbor within radius" for thousands of points.
Any help or feedback would be greatly appreciated.
Following finds the largest species and tag pair that is within given radius for each of the species.
all_df <- data # don't wanna have a variable called data
res_df <- data.frame()
for (j in 1 : nrow(all_df)) {
# subset the data
df <- subset(all_df, species != species[j])
# index of animals within radius
ind <- which ((df$x - x[j])^2 + (df$y - y[j])^2 < radius[j]^2 )
# find the max `size` in the subset df
max_size <- max(df$size[ind])
# all indices with max_size in df
max_inds <- which(df$size[ind] == max_size)
# pick the last one is there is more than on max_size
new_ind <- ind[max_inds[length(max_inds)]]
# results in data.frame
res_df <- rbind(res_df, data.frame(org_sp = all_df$species[j],
org_tag = all_df$tag[j],
res_sp = df$species[new_ind],
res_tag = df$tag[new_ind]))
}
res_df
# org_sp org_tag res_sp res_tag
# 1 A 1 C 9
# 2 A 2 D 10
# 3 A 3 B 5
# 4 A 4 C 8
# 5 B 5 A 3
# 6 B 6 C 8
# 7 B 7 C 9
# 8 C 8 B 5
# 9 C 9 B 7
# 10 D 10 A 2
I have xy coordinates of points and I want to make use distance for averaging points. My data is named qq and I obtain the distance matrix using dist function
qq
X Y
2 4237.5 4411.5
3 4326.5 4444.5
4 4382.0 4418.0
5 4204.0 4487.5
6 4338.5 4515.0
mydist = as.matrix(dist(qq))
2 3 4 5 6
2 0.00000 94.92102 144.64612 83.0557 144.61414
3 94.92102 0.00000 61.50203 129.8278 71.51398
4 144.64612 61.50203 0.00000 191.0870 106.30734
5 83.05570 129.82777 191.08702 0.0000 137.28256
6 144.61414 71.51398 106.30734 137.2826 0.00000
What I want to do is to average points that are closer that a certain threshold, for this example we could use 80. The only pairwise distances that fall below that limit are 3-4 and 3-6. The question is how to go back to the original matrix and average xy coordinates to make the 3-4 pair one point and 3-6 pair another one (discarding former points 3,4 and 6)
here's the dput of my data.frame
dput(qq)
structure(list(X = c(4237.5, 4326.5, 4382, 4204, 4338.5), Y = c(4411.5,
4444.5, 4418, 4487.5, 4515)), .Names = c("X", "Y"), row.names = 2:6, class = "data.frame")
UPDATE
Using some of the provided with modifications code I get the 2 points I need to replace in the 3-4 place and 3-6 place. This means my point 3 and 4 and 6 will have to disappear from qq and this two points should be appended to it
pairs <- which(as.matrix(dist(qq)) < 80 & upper.tri(as.matrix(dist(qq))), arr.ind = T)
t(apply(pairs,1,function(i) apply(qq[i,],2,mean)))
X Y
3 4354.25 4431.25
3 4332.50 4479.75
I think this should do it for you, if I understand the problem correctly.
pairs <- which(as.matrix(y) > 140 & upper.tri(as.matrix(y)), arr.ind = T)
result <- apply(pairs,1,function(i) apply(qq[i,],2,mean))
#optionally, I think this is the form you will want it in.
result <- data.frame(t(result))
It will a matrix of a similar structure to qq containing the averages of points which are "far" away from each other determined by thresh.
UPDATE
qq <- qq[-unique(c(pairs)),]
qq <- rbind(qq,result)
Ok so I was able to merge strategies and solve the issue but not in a fancy way
# Search pairs less than threshold
pairs <- which(as.matrix(dist(qq)) < 80 & upper.tri(as.matrix(dist(qq))), arr.ind = T)
# Get the row numbers for subsetting the original matrix
indx=unique(c(pairs[,1],pairs[,2]))
# Get result dataframe
out = data.frame(rbind(qq[-indx,],t(apply(pairs,1,function(i) apply(qq[i,],2,mean)))),row.names=NULL)
dim(out)
[1] 4 2
out
X Y
1 4237.50 4411.50
2 4204.00 4487.50
3 4354.25 4431.25
4 4332.50 4479.75
The row.names get dropped because they mean nothing now that I've removed original points and added new ones. I'm still open to better ways to do it and to check everything is done correctly.
UPDATE
I made a function that could be more useful that making things step-wise and let's you play with the threshold.
distance_fix = function(dataframe,threshold){
mydist = as.matrix(dist(dataframe))
# Which pairs in the upper triangle are below threshold
pairs <- which(mydist < threshold & upper.tri(mydist), arr.ind = T)
# Get the row numbers for subsetting the original matrix
indx=unique(c(pairs))
# Get result dataframe
out = data.frame(rbind(dataframe[-indx,],t(apply(pairs,1,function(i) apply(dataframe[i,],2,mean)))),row.names=NULL)
return(out)
}
I am a new guy in R and really unsure how to filter data in date frame.
I have created a data frame with two columns including monthly date and corresponding temperature. It has a length of 324.
> head(Nino3.4_1974_2000)
Month_common Nino3.4_degree_1974_2000_plain
1 1974-01-15 -1.93025
2 1974-02-15 -1.73535
3 1974-03-15 -1.20040
4 1974-04-15 -1.00390
5 1974-05-15 -0.62550
6 1974-06-15 -0.36915
The filter rule is to select the temperature which are greater or equal to 0.5 degree. Also, it has to be at least continuously 5 months.
I have eliminate the data with less than 0.5 degree temperature (see below).
for (i in 1) {
el_nino=Nino3.4_1974_2000[which(Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain >= 0.5),]
}
> head(el_nino)
Month_common Nino3.4_degree_1974_2000_plain
32 1976-08-15 0.5192000
33 1976-09-15 0.8740000
34 1976-10-15 0.8864501
35 1976-11-15 0.8229501
36 1976-12-15 0.7336500
37 1977-01-15 0.9276500
However, i still need to extract continuously 5 months. I wish someone could help me out.
If you can always rely on the spacing being one month, then let's temporarily discard the time information:
temps <- Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain
So, since every temperature in that vector is always separated by one month, we just have to look for runs where the temps[i]>=0.5, and the run has to be at least 5 long.
If we do the following:
ofinterest <- temps >= 0.5
we'll have a vector ofinterest with values TRUE FALSE FALSE TRUE TRUE .... etc where it's TRUE when temps[i] was >= 0.5 and FALSE otherwise.
To rephrase your problem then, we just need to look for occurences of at least five TRUE in a row.
To do this we can use the function rle. ?rle gives:
> ?rle
Description
Compute the lengths and values of runs of equal values in a vector
- or the reverse operation.
Value:
‘rle()’ returns an object of class ‘"rle"’ which is a list with
components:
lengths: an integer vector containing the length of each run.
values: a vector of the same length as ‘lengths’ with the
corresponding values.
So we use rle which counts up all the streaks of consecutive TRUE in a row and consecutive FALSE in a row, and look for at least 5 TRUE in a row.
I'll just make up some data to demonstrate:
# for you, temps <- Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain
temps <- runif(1000)
# make a vector that is TRUE when temperature is >= 0.5 and FALSE otherwise
ofinterest <- temps >= 0.5
# count up the runs of TRUEs and FALSEs using rle:
runs <- rle(ofinterest)
# we need to find points where runs$lengths >= 5 (ie more than 5 in a row),
# AND runs$values is TRUE (so more than 5 'TRUE's in a row).
streakIs <- which(runs$lengths>=5 & runs$values)
# these are all the el_nino occurences.
# We need to convert `streakIs` into indices into our original `temps` vector.
# To do this we add up all the `runs$lengths` up to `streakIs[i]` and that gives
# the index into `temps`.
# that is:
# startMonths <- c()
# for ( n in streakIs ) {
# startMonths <- c(startMonths, sum(runs$lengths[1:(n-1)]) + 1
# }
#
# However, since this is R we can vectorise with sapply:
startMonths <- sapply(streakIs, function(n) sum(runs$lengths[1:(n-1)])+1)
Now if you do Nino3.4_1974_2000$Month_common[startMonths] you'll get all the months in which the El Nino started.
It boils down to just a few lines:
runs <- rle(Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain>=0.5)
streakIs <- which(runs$lengths>=5 & runs$values)
startMonths <- sapply(streakIs, function(n) sum(runs$lengths[1:(n-1)])+1)
Nino3.4_1974_2000$Month_common[startMonths]
Here's one way using the fact that the months are regular always one month apart. Than the problem reduces to finding 5 consecutive rows with temps >= 0.5 degrees:
# Some sample data
d <- data.frame(Month=1:20, Temp=c(rep(1,6),0,rep(1,4),0,rep(1,5),0, rep(1,2)))
d
# Use rle to find runs of temps >= 0.5 degrees
x <- rle(d$Temp >= 0.5)
# The find the last row in each run of 5 or more
y <- x$lengths>=5 # BUG HERE: See update below!
lastRow <- cumsum(x$lengths)[y]
# Finally, deduce the first row and make a result matrix
firstRow <- lastRow - x$lengths[y] + 1L
res <- cbind(firstRow, lastRow)
res
# firstRow lastRow
#[1,] 1 6
#[2,] 13 17
UPDATE I had a bug that detected runs with 5 values less than 0.5 too. Here's the updated code (and test data):
d <- data.frame(Month=1:20, Temp=c(rep(0,6),1,0,rep(1,4),0,rep(1,5),0, 1))
x <- rle(d$Temp >= 0.5)
y <- x$lengths>=5 & x$values
lastRow <- cumsum(x$lengths)[y]
firstRow <- lastRow - x$lengths[y] + 1L
res <- cbind(firstRow, lastRow)
res
# firstRow lastRow
#[2,] 14 18