I have a transect data with latitude, longitude and substrate types. Below I provide a script that creates a hypothetical data with 3 substrate types along a straight transect starting at longitude -24.5 and ending at -23.2. Within this transect there are 3 substrate types (a,b and c), but substrate type "a" occurs 4 times and substrate type "b" twice. I would like to calculate the total length (meters) of each "a","b" and "c" substrate type segments in the transect. As an example, the substrate segment "a" ends at the position of the first observation of "b" substrate type and the substrate segment c ends where the fourth "a" substrate type segment starts. I would like the length of. I have looked into the sp and Rdistance packages but I´m really stuck. With thanks in advance.
hypothetical example: each box denote each segment for which I would like to calculate the length of
Alon<-c(-23.20, -23.30,-23.40,-24.10,-24.15, -23.95, -23.70, -23.60,- 24.20, -24.25)
Blon<-c(-23.80, -23.85, -24.00, -24.03, -24.06)
Clon<-c(-23.47, -23.50,-23.55)
Alat<-c(64,64,64,64,64, 64, 64, 64,64, 64)
Blat<-c(64,64, 64, 64,64)
Clat<-c(64,64, 64)
A<-as.data.frame(cbind(Alon, Alat))
B<-as.data.frame(cbind(Blon, Blat))
C<-as.data.frame(cbind(Clon, Clat))
plot(A$Alon, A$Alat, pch=97)
points(B$Blon, B$Blat, col="red", pch=98)
points(C$Clon, C$Clat, col="blue", pch=99)
A$ID<-seq.int(nrow(A))
A[,3]<-"A"
B$ID<-seq.int(nrow(B))
B[,3]<-"B"
C$ID<-seq.int(nrow(C))
C[,3]<-"C"
colnames(A)<-c("lon", "lat", "ID")
colnames(B)<-c("lon", "lat", "ID")
colnames(C)<-c("lon", "lat", "ID")
A<-as.data.frame(A)
B<-as.data.frame(B)
C<-as.data.frame(C)
pos<- rbind(A,B,C)
pos<-pos[,c("ID","lon","lat")]
I suspect the length in metres depends on your projection, so here I calculate the length in degrees and will leave the conversion up to you. First, I order by longitude (as your latitudes are all the same).
# Order data frame
pos <- pos[order(pos$lon),]
Next, I use rle to pull out runs of each ID. I add 1 to start the first run on the first element and use pmin to make sure the final index isn't greater than the the number of rows in the data frame.
# Pull out start and end points of segments
df_seg <- pos[pmin(nrow(pos), c(1, cumsum(rle(pos$ID)$lengths) + 1)),]
Finally, I use diff to calculate the difference between the start and end longitudes of each run.
# Calculate difference in longitude
data.frame(ID = df_seg$ID[1:(nrow(df_seg)-1)], diff_lon = abs(diff(df_seg$lon)))
# Check data frame
# ID diff_lon
# 1 A 0.19
# 2 B 0.11
# 3 A 0.10
# 4 B 0.15
# 5 A 0.15
# 6 C 0.15
# 7 A 0.20
Regarding ordering stations
I wish I had a good solution to this, but I don't. So, I'll apologise before I do some terrible things...
library(dplyr)
library(RANN)
# Temporary data frame
df_stations <- pos
# Function for finding order of stations
station_order <- function(){
# If only one row, return it (i.e., it's the final station)
if(nrow(df_stations) == 1)return(df_station)
# Find the nearest neighbour for the first station
r <- nn2(data = df_stations %>% select(lon, lat), k = 2)$nn.idx[1,2]
# Bump the nearest neighbour to first in the data frame
# This also deletes the first entry
df_stations[1, ] <<- df_stations[r, ]
# Drop the nearest neighbour elsewhere in the data frame
df_stations <<- df_stations %>% distinct
# Return the nearest neighbour
return(df_stations[1, ])
}
# Initialise data frame
res <- df_stations[1,]
# Loop over data frame
for(i in 2:nrow(df_stations))res[i, ] <- station_order()
This code orders your stations using nearest neighbour (i.e., nn2 from RANN). You'll notice that the transect is inverted, but you can always change it with res[nrow(res):1, ].
# ID lon lat
# 1 A -23.20 64
# 2 A -23.30 64
# 3 A -23.40 64
# 4 C -23.47 64
# 5 C -23.50 64
# 6 C -23.55 64
# 7 A -23.60 64
# 8 A -23.70 64
# 9 B -23.80 64
# 10 B -23.85 64
# 11 A -23.95 64
# 12 B -24.00 64
# 13 B -24.03 64
# 14 B -24.06 64
# 15 A -24.10 64
# 16 A -24.15 64
# 17 A -24.20 64
# 18 A -24.25 64
Related
Use this example data to see what I mean
tag <- as.character(c(1,2,3,4,5,6,7,8,9,10))
species <- c("A","A","A","A","B","B","B","C","C","D")
size <- c(0.10,0.20,0.25,0.30,0.30,0.15,0.15,0.20,0.15,0.15)
radius <- (size*40)
x <- c(9,4,25,14,28,19,9,22,10,2)
y <- c(36,7,15,16,22,24,39,20,34,9)
data <- data.frame(tag, species, size, radius, x, y)
# Plot the points using qplot (from package tidyverse)
qplot(x, y, data = data) +
geom_point(aes(colour = species, size = size))
Now that you can see the plot, what I want to do is for each individual “species A” point, I’d like to identify the largest point within a radius of size*40.
For example, in the bottom left of the plot you can see that species A (tag 2) would produce a radius large enough to contain the close species D point.
However, the species A point on the far right-hand-side of the plot (tag 3) would produce a radius large enough to contain both of the close species B and species C points, in which case I’d want some sort of output that identifies the largest individual within the species A radius.
I’d like to know what I can run (if anything) on this data set to get find the largest “within radius” point for each species A point and get an output like this:
Species A point ---- Largest point within radius
Species A tag 1 ----- Species C tag 9
Species A tag 2 ----- Species D tag 10
Species A tag 3 ----- Species B tag 5
Species A tag 4 ----- Species C tag 8
I've used spatstat and CTFSpackage to make some plots in the past but I can't figure out how to "find largest neighbor within radius". Perhaps I can tackle this in ArcMAP? Also, this is just a small example dataset. Realistically I will be wanting to find the "largest neighbor within radius" for thousands of points.
Any help or feedback would be greatly appreciated.
Following finds the largest species and tag pair that is within given radius for each of the species.
all_df <- data # don't wanna have a variable called data
res_df <- data.frame()
for (j in 1 : nrow(all_df)) {
# subset the data
df <- subset(all_df, species != species[j])
# index of animals within radius
ind <- which ((df$x - x[j])^2 + (df$y - y[j])^2 < radius[j]^2 )
# find the max `size` in the subset df
max_size <- max(df$size[ind])
# all indices with max_size in df
max_inds <- which(df$size[ind] == max_size)
# pick the last one is there is more than on max_size
new_ind <- ind[max_inds[length(max_inds)]]
# results in data.frame
res_df <- rbind(res_df, data.frame(org_sp = all_df$species[j],
org_tag = all_df$tag[j],
res_sp = df$species[new_ind],
res_tag = df$tag[new_ind]))
}
res_df
# org_sp org_tag res_sp res_tag
# 1 A 1 C 9
# 2 A 2 D 10
# 3 A 3 B 5
# 4 A 4 C 8
# 5 B 5 A 3
# 6 B 6 C 8
# 7 B 7 C 9
# 8 C 8 B 5
# 9 C 9 B 7
# 10 D 10 A 2
I have xy coordinates of points and I want to make use distance for averaging points. My data is named qq and I obtain the distance matrix using dist function
qq
X Y
2 4237.5 4411.5
3 4326.5 4444.5
4 4382.0 4418.0
5 4204.0 4487.5
6 4338.5 4515.0
mydist = as.matrix(dist(qq))
2 3 4 5 6
2 0.00000 94.92102 144.64612 83.0557 144.61414
3 94.92102 0.00000 61.50203 129.8278 71.51398
4 144.64612 61.50203 0.00000 191.0870 106.30734
5 83.05570 129.82777 191.08702 0.0000 137.28256
6 144.61414 71.51398 106.30734 137.2826 0.00000
What I want to do is to average points that are closer that a certain threshold, for this example we could use 80. The only pairwise distances that fall below that limit are 3-4 and 3-6. The question is how to go back to the original matrix and average xy coordinates to make the 3-4 pair one point and 3-6 pair another one (discarding former points 3,4 and 6)
here's the dput of my data.frame
dput(qq)
structure(list(X = c(4237.5, 4326.5, 4382, 4204, 4338.5), Y = c(4411.5,
4444.5, 4418, 4487.5, 4515)), .Names = c("X", "Y"), row.names = 2:6, class = "data.frame")
UPDATE
Using some of the provided with modifications code I get the 2 points I need to replace in the 3-4 place and 3-6 place. This means my point 3 and 4 and 6 will have to disappear from qq and this two points should be appended to it
pairs <- which(as.matrix(dist(qq)) < 80 & upper.tri(as.matrix(dist(qq))), arr.ind = T)
t(apply(pairs,1,function(i) apply(qq[i,],2,mean)))
X Y
3 4354.25 4431.25
3 4332.50 4479.75
I think this should do it for you, if I understand the problem correctly.
pairs <- which(as.matrix(y) > 140 & upper.tri(as.matrix(y)), arr.ind = T)
result <- apply(pairs,1,function(i) apply(qq[i,],2,mean))
#optionally, I think this is the form you will want it in.
result <- data.frame(t(result))
It will a matrix of a similar structure to qq containing the averages of points which are "far" away from each other determined by thresh.
UPDATE
qq <- qq[-unique(c(pairs)),]
qq <- rbind(qq,result)
Ok so I was able to merge strategies and solve the issue but not in a fancy way
# Search pairs less than threshold
pairs <- which(as.matrix(dist(qq)) < 80 & upper.tri(as.matrix(dist(qq))), arr.ind = T)
# Get the row numbers for subsetting the original matrix
indx=unique(c(pairs[,1],pairs[,2]))
# Get result dataframe
out = data.frame(rbind(qq[-indx,],t(apply(pairs,1,function(i) apply(qq[i,],2,mean)))),row.names=NULL)
dim(out)
[1] 4 2
out
X Y
1 4237.50 4411.50
2 4204.00 4487.50
3 4354.25 4431.25
4 4332.50 4479.75
The row.names get dropped because they mean nothing now that I've removed original points and added new ones. I'm still open to better ways to do it and to check everything is done correctly.
UPDATE
I made a function that could be more useful that making things step-wise and let's you play with the threshold.
distance_fix = function(dataframe,threshold){
mydist = as.matrix(dist(dataframe))
# Which pairs in the upper triangle are below threshold
pairs <- which(mydist < threshold & upper.tri(mydist), arr.ind = T)
# Get the row numbers for subsetting the original matrix
indx=unique(c(pairs))
# Get result dataframe
out = data.frame(rbind(dataframe[-indx,],t(apply(pairs,1,function(i) apply(dataframe[i,],2,mean)))),row.names=NULL)
return(out)
}
I have Valence Category for word stimuli in my psychology experiment.
1 = Negative, 2 = Neutral, 3 = Positive
I need to sort the thousands of stimuli with a pseudo-randomised condition.
Val_Category cannot have more than 2 of the same valence stimuli in a row i.e. no more than 2x negative stimuli in a row.
for example - 2, 2, 2 = not acceptable
2, 2, 1 = ok
I can't sequence the data i.e. decide the whole experiment will be 1,3,2,3,1,3,2,3,2,2,1 because I'm not allowed to have a pattern.
I tried various packages like dylpr, sample, order, sort and nothing so far solves the problem.
I think there's a thousand ways to do this, none of which are probably very pretty. I wrote a small function that takes care of the ordering. It's a bit hacky, but it appeared to work for what I tried.
To explain what I did, the function works as follows:
Take the vector of valences and samples from it.
If sequences are found that are larger than the desired length, then, (for each such sequence), take the last value of that sequence at places it "somewhere else".
Check if the problem is solved. If so, return the reordered vector. If not, then go back to 2.
# some vector of valences
val <- rep(1:3,each=50)
pseudoRandomize <- function(x, n){
# take an initial sample
out <- sample(val)
# check if the sample is "bad" (containing sequences longer than n)
bad.seq <- any(rle(out)$lengths > n)
# length of the whole sample
l0 <- length(out)
while(bad.seq){
# get lengths of all subsequences
l1 <- rle(out)$lengths
# find the bad ones
ind <- l1 > n
# take the last value of each bad sequence, and...
for(i in cumsum(l1)[ind]){
# take it out of the original sample
tmp <- out[-i]
# pick new position at random
pos <- sample(2:(l0-2),1)
# put the value back into the sample at the new position
out <- c(tmp[1:(pos-1)],out[i],tmp[pos:(l0-1)])
}
# check if bad sequences (still) exist
# if TRUE, then 'while' continues; if FALSE, then it doesn't
bad.seq <- any(rle(out)$lengths > n)
}
# return the reordered sequence
out
}
Example:
The function may be used on a vector with or without names. If the vector was named, then these names will still be present on the pseudo-randomized vector.
# simple unnamed vector
val <- rep(1:3,each=5)
pseudoRandomize(val, 2)
# gives:
# [1] 1 3 2 1 2 3 3 2 1 2 1 3 3 1 2
# when names assigned to the vector
names(val) <- 1:length(val)
pseudoRandomize(val, 2)
# gives (first row shows the names):
# 1 13 9 7 3 11 15 8 10 5 12 14 6 4 2
# 1 3 2 2 1 3 3 2 2 1 3 3 2 1 1
This property can be used for randomizing a whole data frame. To achieve that, the "valence" vector is taken out of the data frame, and names are assigned to it either by row index (1:nrow(dat)) or by row names (rownames(dat)).
# reorder a data.frame using a named vector
dat <- data.frame(val=rep(1:3,each=5), stim=rep(letters[1:5],3))
val <- dat$val
names(val) <- 1:nrow(dat)
new.val <- pseudoRandomize(val, 2)
new.dat <- dat[as.integer(names(new.val)),]
# gives:
# val stim
# 5 1 e
# 2 1 b
# 9 2 d
# 6 2 a
# 3 1 c
# 15 3 e
# ...
I believe this loop will set the Valence Category's appropriately. I've called the valence categories treat.
#Generate example data
s1 = data.frame(id=c(1:10),treat=NA)
#Setting the first two rows
s1[1,"treat"] <- sample(1:3,1)
s1[2,"treat"] <- sample(1:3,1)
#Looping through the remainder of the rows
for (i in 3:length(s1$id))
{
s1[i,"treat"] <- sample(1:3,1)
#Check if the treat value is equal to the previous two values.
if (s1[i,"treat"]==s1[i-1,"treat"] & s1[i-1,"treat"]==s1[i-2,"treat"])
#If so draw one of the values not equal to that value
{
a = 1:3
remove <- s1[i,"treat"]
a=a[!a==remove]
s1[i,"treat"] <- sample(a,1)
}
}
This solution is not particularly elegant. There may be a much faster way to accomplish this by sorting several columns or something.
I have two data frames imported from txt files -- the sampling points and the station locations.
The sampling points data frame
X Y Z
346449.30 576369.65 86.93
346449.55 576368.24 87.16
346449.29 576368.17 79.08
346449.83 576366.86 88.23
346449.97 576365.42 84.97
346449.91 576362.22 86.59
346449.74 576363.65 88.87
346449.61 576363.59 84.99
346449.50 576363.54 81.33
The station locations data frame
Station x y
1 346479.720 576349.710
2 346575.380 576361.530
3 346685.540 576303.180
4 346722.820 576412.680
5 346514.780 576406.140
6 346813.130 576435.830
7 346748.880 576304.090
8 346825.830 576402.800
So i would like to know how to find and label points (from the sampling data frame) that fall within a buffer zone (e.g. 3 meters buffer radius generated from each of the stations from the second data frame)?
This is what i would like to get:
X Y Z Station
346449.30 576369.65 86.93 1
346449.55 576368.24 87.16 1
346449.29 576368.17 79.08 1
346449.83 576366.86 88.23 2
346449.97 576365.42 84.97 2
346449.91 576362.22 86.59 3
346449.74 576363.65 88.87 4
346449.61 576363.59 84.99 5
346449.50 576363.54 81.33 5
346449.51 576365.07 89.38 5
346449.36 576365.01 84.93 5
346449.24 576366.46 88.70 5
346448.93 576367.83 86.75 5
I am new in R so any help appreciated. Thanks.
if you simply want to add id of the nearest station within 3 meters of the sampling data points to your sampling data.frame one solution would be:
# get a matrix with the squares of the euclidian distances
mx <- outer(seq(nrow(sampleData)),
seq(nrow(stations)),
# return the square of the euclidian distance
function(i,j){
(sampleData[i,'X'] - stations[j,'x'])^2 +
(sampleData[i,'Y'] - stations[j,'y'])^2
})
# maximum distance to consider
d = 3
# get rid of distances greater than 3 meters away
mx[mx>d^2] <- NA
index <- apply(mx,
1,
# returns the number of the nearest row in `stations` that is less than 3 meters away
function(x){
if(all(is.na(x)))
return(NA)
x[is.na(x)] <- F
which.max( x == min(x,na.rm=T) )
})
sampleData$station <- stations$station[indx]
# a comma delimited list of stations with distance < 3
sampleData$closeStations <- apply(mx,
1,
# returns the number of the nearest row in `stations` that is less than 3 meters away
function(x){
if(all(is.na(x)))
return(NA)
paste0(stations$Station[x],sep = ',')
})
using outer and apply may make the solution run faster, but if you're having trouble with it, it may be easier to debug using a for loop instead:
# maximum distance to consider
d = 3
distanceToNearestStation <-
nearestStation <- numeric(0)
nearestStations <- character(0)
for(i in seq(nrow(sampleData))){
# square of the euclidian distances from this data point to the stations
distances <- sqrt((sampleData[i,'X'] - stations[,'x'])^2 +
(sampleData[i,'Y'] - stations[,'y'])^2 )
# get rid of distances greater than 3 meters away
# distances[distances>d] <- NA
# all the stations are too far away or something is wrong with this data point
if(all(is.na(distances)))
next
# record the nearest station to this data point
distanceToNearestStation[i] <- min(distances,na.rm=T)
nearestStation[i] <- which.max( distances == min(distances,na.rm=T) )
# comma delimeted list of stations within 3 meters
distanceIsClose <- distance < 3
distanceIsClose[is.na(distanceIsClose)] <- F
nearestStations[i] <- paste0(paste0(stations$Station[distanceIsClose],sep = ','))
}
range(distanceToNearestStation)
sampleData$station <- stations$station[nearestStation]
# number of data points within 3 meters of a station
table(distanceToNearestStation <= 3)
# data points within 3 meters of a station
subset <- sampleData[distanceToNearestStation<= 3,]
# save to individual files.
for(s in unique(subset$station))
write.csv(subset[subset$station == s,],
file.path('My/Favorite/Directory'# note there is no trailing slash
,paste('station',s,'data.csv')))
I need to test the value of'peso'(see replication code below) for each factor. Whether a factor reaches 50% of the overall sum for 'peso', the values of each factor should be paste into a new object 'results', otherwise, R should evaluate which factor has the lowest aggregated value for 'peso', and consider the factor in the next column for aggregate 'peso' again. Basically, this process replace the lowest scored factor for the next factor. The process should repeat till a factor cross the 50% threshold. So my question is, where do I start?
set.seed(51)
Data <- sapply(1:100, function(x) sample(1:10, size=5))
Data <- data.frame(t(Data))
names(Data) <- letters[1:5]
Data$peso <- sample(0:3.5, 100, rep=TRUE)
It should be like
If your first two rows are:
a b c d e peso
8 2 3 7 9 1
8 3 4 5 7 3
9 7 4 10 1 2
10 3 4 5 7 3
What would you like for the total?
Totals_08 = 4
Totals_09 = 2
Totals_10 = 3
etc?
So, factor 8 got the greater share 4/(4+2+3) = 0.4444444, but not reached 50% threshold in the round a. Therefore, I need something more: repeat the aggregation but considering now the factor 7 in the column 'b' instead of factors 9 in the column 'a', since it got the lowest aggregated value in the first round.
It's unclear if you have your list of factors already or not. If you do not have it, and are taking it from the data set, you can grab it in a few different ways:
# Get a list of all the factors
myFactors <- levels(Data[[1]]) # If actual factors.
myFactors <- sort(unique(unlist(Data))) # Otherwise use similar to this line
Then to calculate the Totals per factor, you can do the following
Totals <-
colSums(sapply(myFactors, function(fctr)
# calculate totals per fctr
as.integer(Data$peso) * rowSums(fctr == subset(Data, select= -peso))
))
names(Totals) <- myFactors
Which gives
Totals
# 1 2 3 4 5 6 7 8 9 10
# 132 153 142 122 103 135 118 144 148 128
Next:
I'm not sure if afterwards, you want to compare to the sum of peso or the sum of the totals. Here are both options, broken down into steps:
# Calculate the total of all the Totals:
TotalSum <- sum(Totals)
# See percentage for each:
Totals / TotalSum
Totals / sum(as.integer(Data$peso))
# See which, if any, is greater than 50%
Totals / TotalSum > 0.50
Totals / sum(as.integer(Data$peso)) > 0.50
# Using Which to identify the ones you are looking for
which(Totals / TotalSum > 0.50)
which(Totals / sum(as.integer(Data$peso)) > 0.50)
Note on your sampling for Peso
You took a sample of 0:3.5, however, the x:y sequence only gives integers.
If you want fractions, you can either use seq() or you can take a larger sequence and then divide appropriately:
option1 <- (0:7) / 2
option2 <- seq(from=0, to=3.5, by=0.5)
If you want whole integers from 0:3 and also the value 3.5, then use c()
option3 <- c(0:3, 3.5)