Use this example data to see what I mean
tag <- as.character(c(1,2,3,4,5,6,7,8,9,10))
species <- c("A","A","A","A","B","B","B","C","C","D")
size <- c(0.10,0.20,0.25,0.30,0.30,0.15,0.15,0.20,0.15,0.15)
radius <- (size*40)
x <- c(9,4,25,14,28,19,9,22,10,2)
y <- c(36,7,15,16,22,24,39,20,34,9)
data <- data.frame(tag, species, size, radius, x, y)
# Plot the points using qplot (from package tidyverse)
qplot(x, y, data = data) +
geom_point(aes(colour = species, size = size))
Now that you can see the plot, what I want to do is for each individual “species A” point, I’d like to identify the largest point within a radius of size*40.
For example, in the bottom left of the plot you can see that species A (tag 2) would produce a radius large enough to contain the close species D point.
However, the species A point on the far right-hand-side of the plot (tag 3) would produce a radius large enough to contain both of the close species B and species C points, in which case I’d want some sort of output that identifies the largest individual within the species A radius.
I’d like to know what I can run (if anything) on this data set to get find the largest “within radius” point for each species A point and get an output like this:
Species A point ---- Largest point within radius
Species A tag 1 ----- Species C tag 9
Species A tag 2 ----- Species D tag 10
Species A tag 3 ----- Species B tag 5
Species A tag 4 ----- Species C tag 8
I've used spatstat and CTFSpackage to make some plots in the past but I can't figure out how to "find largest neighbor within radius". Perhaps I can tackle this in ArcMAP? Also, this is just a small example dataset. Realistically I will be wanting to find the "largest neighbor within radius" for thousands of points.
Any help or feedback would be greatly appreciated.
Following finds the largest species and tag pair that is within given radius for each of the species.
all_df <- data # don't wanna have a variable called data
res_df <- data.frame()
for (j in 1 : nrow(all_df)) {
# subset the data
df <- subset(all_df, species != species[j])
# index of animals within radius
ind <- which ((df$x - x[j])^2 + (df$y - y[j])^2 < radius[j]^2 )
# find the max `size` in the subset df
max_size <- max(df$size[ind])
# all indices with max_size in df
max_inds <- which(df$size[ind] == max_size)
# pick the last one is there is more than on max_size
new_ind <- ind[max_inds[length(max_inds)]]
# results in data.frame
res_df <- rbind(res_df, data.frame(org_sp = all_df$species[j],
org_tag = all_df$tag[j],
res_sp = df$species[new_ind],
res_tag = df$tag[new_ind]))
}
res_df
# org_sp org_tag res_sp res_tag
# 1 A 1 C 9
# 2 A 2 D 10
# 3 A 3 B 5
# 4 A 4 C 8
# 5 B 5 A 3
# 6 B 6 C 8
# 7 B 7 C 9
# 8 C 8 B 5
# 9 C 9 B 7
# 10 D 10 A 2
Related
I currently have a data frame storing separate x,y,z coordinates from an accelerometer sensor (with timestamps), but want to perform vector operations on it.
Test data (actually have thousands of rows, and a timestamp row to be preserved)
x <- c(1,3,1,0,3)
y <- c(2,4,8,8,9)
z <- c(0,1,1,2,0)
df <- data.frame(x,y,z)
proj <- function(a,b) {
as.double((a %*% b) / (b %*% b)) * b
}
v = c(1,2,3)
I want to mutate (or create a new dataframe?) df by applying proj(_,v) on each row.
I have tried along the lines of mutate(projected = proj(c(x,y,z), v), but doesn't work, I am probably misusing this.
What is the best way to achieve this? Should I instead be using a list of vectors to store the coordinates?
While your proj(a,b)-function does only take two inputs, in your example you wanted to provide three proj(c(x,y,z),v) or did I misunderstand?
However, this would work:
dplyr::mutate(projected = proj(x,y), df) resulting in
x y z projected
1 1 2 0 0.4279476
2 3 4 1 0.8558952
3 1 8 1 1.7117904
4 0 8 2 1.7117904
5 3 9 0 1.9257642
Problem: I have several (10+) custom functions, each defining a step in the workflow. I want to run a nested function of these steps over a large data frame for n (50+) periods iteratively. My current function achieves the result but it is too slow and not very elegant.
Example Input
id x_1975 z_1975
1 1 1 NA
2 2 2 NA
3 3 3 NA
4 4 4 NA
5 5 5 NA
Step 1:
Compare initial x values (x_1975) against a cutoff=3. If x is greater than 3, then the z value should be "Y".
Step 2:
If z value is "Y", then x value in next year should be x times 2. Otherwise, it should be x times 5. Although the z values can be skipped altogether, I need the categorical column to create summary stats.
Note:
The data set I am working with has 20 variables that need to be calculated based on some similar logics.
Desired Output
id x_1975 z_1975 x_1976 z_1976 x_1977 z_1977 x_1978
1 1 1 <NA> 5 Y 10 Y 20
2 2 2 <NA> 10 Y 20 Y 40
3 3 3 <NA> 15 Y 30 Y 60
4 4 4 Y 8 Y 16 Y 32
5 5 5 Y 10 Y 20 Y 40
6 6 6 Y 12 Y 24 Y 48
What I have tried:
Tried setting the data in long format. But found it complicated to iterate over rows.
Pre-allocated all columns with appropriate class. That reduced run time a little although not enough.
Have been trying to use purrr::compose to nest all the functions. But I am not being able to make it work.
Reproducible Example
library(dplyr)
library(purrr)
# Create Data Frame
n <- 6
dat <- data.frame(id=1:n,
x_1975=seq(1,6,1),
z_1975=NA)
cut_off <- 3
# Functions
# Set a value for "z_" variables in period t by comparing "x_" value in period t against the the cut_off value.
func_1 <- function(dat,yr){
# pre-define variables
z <- paste0("z_",yr)
x <- paste0("x_",yr)
# Caclulate values for "z_" in period t
dat <- dat %>% mutate(!!sym(z):=
case_when(!!sym(x)>cut_off ~ "Y",
TRUE~as.character(NA)
))
}
# Calculate the value for "x_" variables in period t+1 based on "z_" variables in period t.
func_2 <- function(dat,yr){
# pre-define variables
x <- paste0("x_",yr+1)
x_lag <- paste0("x_",yr)
z <- paste0("z_",yr)
# Calculate "x_" value for t+1
dat <- dat %>% mutate(!!sym(x):=case_when(
!!sym(z)=="Y"~!!sym(x_lag)*2,
TRUE~!!sym(x_lag)*5
))
}
# Join function 1 and function 2 together. The joined function needs to iterate over the `dat` from beginning year to ending year
joined_func <- function(dat,beginning,ending){
for (year in seq(beginning,ending,1)){
dat <- func_1(dat,year)
# Output of step 1 is used as input for step 2
dat <- func_2(dat,year)
}
return(dat)
}
# Run the code from 1975 to 2025.The data_output has the desired output, but need to reduce runtime.
data_output <- joined_func(dat,1975,1977)
# Tried to use the compose function from purrr. but getting error.
my_funs <- c(func_1, func_2)
f1 <- invoke(compose, my_funs)
joined_func_2 <- function(dat,beginning,ending){
for (year in seq(beginning,ending,1)){
dat <- f1(dat,year=year)
}
}
data_output_2 <- joined_func_2(dat,1975,1977)
# Error message:
# Error in f1(dat, year = year) : unused argument (year = year).
Questions
a) how do I make purrr::compose work? b) any other way to achieve efficiency?
Would really appreciate if someone could help me on this!
I have Valence Category for word stimuli in my psychology experiment.
1 = Negative, 2 = Neutral, 3 = Positive
I need to sort the thousands of stimuli with a pseudo-randomised condition.
Val_Category cannot have more than 2 of the same valence stimuli in a row i.e. no more than 2x negative stimuli in a row.
for example - 2, 2, 2 = not acceptable
2, 2, 1 = ok
I can't sequence the data i.e. decide the whole experiment will be 1,3,2,3,1,3,2,3,2,2,1 because I'm not allowed to have a pattern.
I tried various packages like dylpr, sample, order, sort and nothing so far solves the problem.
I think there's a thousand ways to do this, none of which are probably very pretty. I wrote a small function that takes care of the ordering. It's a bit hacky, but it appeared to work for what I tried.
To explain what I did, the function works as follows:
Take the vector of valences and samples from it.
If sequences are found that are larger than the desired length, then, (for each such sequence), take the last value of that sequence at places it "somewhere else".
Check if the problem is solved. If so, return the reordered vector. If not, then go back to 2.
# some vector of valences
val <- rep(1:3,each=50)
pseudoRandomize <- function(x, n){
# take an initial sample
out <- sample(val)
# check if the sample is "bad" (containing sequences longer than n)
bad.seq <- any(rle(out)$lengths > n)
# length of the whole sample
l0 <- length(out)
while(bad.seq){
# get lengths of all subsequences
l1 <- rle(out)$lengths
# find the bad ones
ind <- l1 > n
# take the last value of each bad sequence, and...
for(i in cumsum(l1)[ind]){
# take it out of the original sample
tmp <- out[-i]
# pick new position at random
pos <- sample(2:(l0-2),1)
# put the value back into the sample at the new position
out <- c(tmp[1:(pos-1)],out[i],tmp[pos:(l0-1)])
}
# check if bad sequences (still) exist
# if TRUE, then 'while' continues; if FALSE, then it doesn't
bad.seq <- any(rle(out)$lengths > n)
}
# return the reordered sequence
out
}
Example:
The function may be used on a vector with or without names. If the vector was named, then these names will still be present on the pseudo-randomized vector.
# simple unnamed vector
val <- rep(1:3,each=5)
pseudoRandomize(val, 2)
# gives:
# [1] 1 3 2 1 2 3 3 2 1 2 1 3 3 1 2
# when names assigned to the vector
names(val) <- 1:length(val)
pseudoRandomize(val, 2)
# gives (first row shows the names):
# 1 13 9 7 3 11 15 8 10 5 12 14 6 4 2
# 1 3 2 2 1 3 3 2 2 1 3 3 2 1 1
This property can be used for randomizing a whole data frame. To achieve that, the "valence" vector is taken out of the data frame, and names are assigned to it either by row index (1:nrow(dat)) or by row names (rownames(dat)).
# reorder a data.frame using a named vector
dat <- data.frame(val=rep(1:3,each=5), stim=rep(letters[1:5],3))
val <- dat$val
names(val) <- 1:nrow(dat)
new.val <- pseudoRandomize(val, 2)
new.dat <- dat[as.integer(names(new.val)),]
# gives:
# val stim
# 5 1 e
# 2 1 b
# 9 2 d
# 6 2 a
# 3 1 c
# 15 3 e
# ...
I believe this loop will set the Valence Category's appropriately. I've called the valence categories treat.
#Generate example data
s1 = data.frame(id=c(1:10),treat=NA)
#Setting the first two rows
s1[1,"treat"] <- sample(1:3,1)
s1[2,"treat"] <- sample(1:3,1)
#Looping through the remainder of the rows
for (i in 3:length(s1$id))
{
s1[i,"treat"] <- sample(1:3,1)
#Check if the treat value is equal to the previous two values.
if (s1[i,"treat"]==s1[i-1,"treat"] & s1[i-1,"treat"]==s1[i-2,"treat"])
#If so draw one of the values not equal to that value
{
a = 1:3
remove <- s1[i,"treat"]
a=a[!a==remove]
s1[i,"treat"] <- sample(a,1)
}
}
This solution is not particularly elegant. There may be a much faster way to accomplish this by sorting several columns or something.
I have two data frames imported from txt files -- the sampling points and the station locations.
The sampling points data frame
X Y Z
346449.30 576369.65 86.93
346449.55 576368.24 87.16
346449.29 576368.17 79.08
346449.83 576366.86 88.23
346449.97 576365.42 84.97
346449.91 576362.22 86.59
346449.74 576363.65 88.87
346449.61 576363.59 84.99
346449.50 576363.54 81.33
The station locations data frame
Station x y
1 346479.720 576349.710
2 346575.380 576361.530
3 346685.540 576303.180
4 346722.820 576412.680
5 346514.780 576406.140
6 346813.130 576435.830
7 346748.880 576304.090
8 346825.830 576402.800
So i would like to know how to find and label points (from the sampling data frame) that fall within a buffer zone (e.g. 3 meters buffer radius generated from each of the stations from the second data frame)?
This is what i would like to get:
X Y Z Station
346449.30 576369.65 86.93 1
346449.55 576368.24 87.16 1
346449.29 576368.17 79.08 1
346449.83 576366.86 88.23 2
346449.97 576365.42 84.97 2
346449.91 576362.22 86.59 3
346449.74 576363.65 88.87 4
346449.61 576363.59 84.99 5
346449.50 576363.54 81.33 5
346449.51 576365.07 89.38 5
346449.36 576365.01 84.93 5
346449.24 576366.46 88.70 5
346448.93 576367.83 86.75 5
I am new in R so any help appreciated. Thanks.
if you simply want to add id of the nearest station within 3 meters of the sampling data points to your sampling data.frame one solution would be:
# get a matrix with the squares of the euclidian distances
mx <- outer(seq(nrow(sampleData)),
seq(nrow(stations)),
# return the square of the euclidian distance
function(i,j){
(sampleData[i,'X'] - stations[j,'x'])^2 +
(sampleData[i,'Y'] - stations[j,'y'])^2
})
# maximum distance to consider
d = 3
# get rid of distances greater than 3 meters away
mx[mx>d^2] <- NA
index <- apply(mx,
1,
# returns the number of the nearest row in `stations` that is less than 3 meters away
function(x){
if(all(is.na(x)))
return(NA)
x[is.na(x)] <- F
which.max( x == min(x,na.rm=T) )
})
sampleData$station <- stations$station[indx]
# a comma delimited list of stations with distance < 3
sampleData$closeStations <- apply(mx,
1,
# returns the number of the nearest row in `stations` that is less than 3 meters away
function(x){
if(all(is.na(x)))
return(NA)
paste0(stations$Station[x],sep = ',')
})
using outer and apply may make the solution run faster, but if you're having trouble with it, it may be easier to debug using a for loop instead:
# maximum distance to consider
d = 3
distanceToNearestStation <-
nearestStation <- numeric(0)
nearestStations <- character(0)
for(i in seq(nrow(sampleData))){
# square of the euclidian distances from this data point to the stations
distances <- sqrt((sampleData[i,'X'] - stations[,'x'])^2 +
(sampleData[i,'Y'] - stations[,'y'])^2 )
# get rid of distances greater than 3 meters away
# distances[distances>d] <- NA
# all the stations are too far away or something is wrong with this data point
if(all(is.na(distances)))
next
# record the nearest station to this data point
distanceToNearestStation[i] <- min(distances,na.rm=T)
nearestStation[i] <- which.max( distances == min(distances,na.rm=T) )
# comma delimeted list of stations within 3 meters
distanceIsClose <- distance < 3
distanceIsClose[is.na(distanceIsClose)] <- F
nearestStations[i] <- paste0(paste0(stations$Station[distanceIsClose],sep = ','))
}
range(distanceToNearestStation)
sampleData$station <- stations$station[nearestStation]
# number of data points within 3 meters of a station
table(distanceToNearestStation <= 3)
# data points within 3 meters of a station
subset <- sampleData[distanceToNearestStation<= 3,]
# save to individual files.
for(s in unique(subset$station))
write.csv(subset[subset$station == s,],
file.path('My/Favorite/Directory'# note there is no trailing slash
,paste('station',s,'data.csv')))
I am trying to construct a function which shouldn't be hard in terms of programming but I am having some difficulties to conceptualize it. Hope you'll be able to understand my problem better than me!
I'd like a function that takes a single list of vectors as argument. Something like
arg1 = list(c(1,2), c(2,3), c(5,6), c(1,3), c(4,6), c(6,7), c(7,5), c(5,8))
The function should output a matrix with two columns (or a list of two vectors or something like that) where one column contains letters and the other numbers. One can think of the argument as a list of the positions/values that should be placed in the same group. If in the list there is the vector c(5,6), then the output should contain somewhere the same letters next to the values 5 and 6 in the number column. If there are the three following vectors c(1,2), c(2,3) and c(1,3), then the output should contain somewhere the same letters next to the value 1, 2 and 3 in the number column.
Therefore if we enter the object arg1 in the function it should return:
myFun(arg1)
number_column letters_column
1 A
2 A
3 A
5 B
6 B
7 B
4 C
6 C
5 D
8 D
(the order is not important. The letters E should not be present before the letter D has been used)
Therefore the function has constructed 2 groups of 3 (A:[1,2,3] and B:[5,6,7]) and 2 groups of 2 (C:[4,6] and D:[5,8]). Note one position or number can be in several group.
Please let me know if something is unclear in my question! Thanks!
As I wrote in the comments, it appears that you want a data frame that lists the maximal cliques of a graph given a list of vectors that define the edges.
require(igraph)
## create a matrix where each row is an edge
argmatrix <- do.call(rbind, arg1)
## create an igraph object from the matrix of edges
gph <- graph.edgelist(argmatrix, directed = FALSE)
## returns a list of the maximal cliques of the graph
mxc <- maximal.cliques(gph)
## creates a data frame of the output
dat <- data.frame(number_column = unlist(mxc),
group_column = rep.int(seq_along(mxc),times = sapply(mxc,length)))
## converts group numbers to letters
## ONLY USE if max(dat$group_column) <= 26
dat$group_column <- LETTERS[dat$group_column]
# number_column group_column
# 1 5 A
# 2 8 A
# 3 5 B
# 4 6 B
# 5 7 B
# 6 4 C
# 7 6 C
# 8 3 D
# 9 1 D
# 10 2 D