How can I get the index of the sample whose previous samples were consecutive and were greater than a fixed threshold in groups?
In the below example, I need to find the time when I have consecutively 3 samples whose speed is greater than 35 speed >= 35 group-wise
speed_threshold = 35
Group Time Speed
1 5 25
1 10 23
1 15 21
1 20 40 # Speed > 35
1 25 42 # Speed > 35
1 30 52 # Speed > 35
1 35 48 # <--- Return time = 35 as answer for Group 1 !
1 40 45
2 5 22
2 10 36 # Speed > 35
2 15 38 # Speed > 35
2 20 46 # Speed > 35
2 25 53 # <--- Return time = 25 as answer for Group 2 !
3 5 45
3 10 58 # <--- Return time = NA as answer for group 3 !
If it's above the threshold and it's the third such value in a row, capture the index in ends. Select the first index in ends and add one to get the index of the return time. (There may be more than 1 such group of 3 and therefore more than one element of ends. In this case, the first end needs to be used.)
Note: In your example, the speed at return time is always above the threshold. This code does not check that as a condition at all, but simply gives the first time after three rows with speeds above threshold (regardless of whether the speed at that time is still above the threshold).
library(data.table)
setDT(df)
speed_thresh <- 35
df[, {above <- Speed > speed_thresh
ends <- which(above & rowid(rleid(above)) == 3)
.(Return_Time = Time[ends[1] + 1])}
, Group]
# Group Return_Time
# 1: 1 35
# 2: 2 25
# 3: 3 NA
Data used:
df <- fread('
Group Time Speed
1 5 25
1 10 23
1 15 21
1 20 40
1 25 42
1 30 52
1 35 48
1 40 45
2 5 22
2 10 36
2 15 38
2 20 46
2 25 53
3 5 45
3 10 58
')
One option is to use rleid to create a grouping variable based on the logic in 'Speed' and filter the rows where the number of rows (n()) is equal to 3 and all 'Speed' is greater than 35
library(dplyr)
library(data.table)
df1 %>%
group_by(Group, grp = rleid(Speed > speed_threshold)) %>%
filter(n() >= 3, all(Speed > speed_threshold)) %>%
slice(1:3)
1) Using DF defined reproducibly in the Note at the end, define a function ok which takes a vector of logicals indicating whether speed is greater than 35 and returns a logical vector of the same length which is TRUE for the first speed that comes after 3 consecutive TRUEs. Apply that to each group using ave and subset DF down those rows which are TRUE giving s.
If just returning the groups which satisfy the condition is sufficient then we are done; otherwise, define Groups as a one column data frame with one row per Group and merge that with s so that we get an NA for those groups not satisfying the condition.
library(zoo)
ok <- function(x) cumsum(rollapplyr(x, list(-(1:3)), all, fill = FALSE)) == 1
s <- subset(DF, ave(Speed > 35, Group, FUN = ok))
Groups <- data.frame(Group = unique(DF$Group))
merge(Groups, s, all.x = TRUE)[1:2]
## Group Time
## 1 1 35
## 2 2 25
## 3 3 NA
2) A second approach is to split DF by group and then perform the calculation over each component of the split.
library(zoo)
calc <- function(x) {
r <- rollapplyr(x$Speed > 35, list(-(1:3)), all, fill = FALSE)
c(which(cumsum(r) == 1), NA)[1]
}
sapply(split(DF, DF$Group), calc)
## 1 2 3
## 35 25 NA
Note
Lines <- "Group Time Speed
1 5 25
1 10 23
1 15 21
1 20 40 # Speed > 35
1 25 42 # Speed > 35
1 30 52 # Speed > 35
1 35 48 # <--- Return time = 35 as answer for Group 1 !
1 40 45
2 5 22
2 10 36 # Speed > 35
2 15 38 # Speed > 35
2 20 46 # Speed > 35
2 25 53 # <--- Return time = 25 as answer for Group 2 !
3 5 45
3 10 58 # <--- Return time = NA as answer for group 3 !"
DF <- read.table(text = Lines, header = TRUE)
Related
Lets say I have a data frame
mydata <- data.frame(x = 1:25,
y = 26:50)
and another data frame with a set of min and max values
df.remove <- data.frame(min = c(3,10,22,17),
max = c(6,13,24,20))
Im looking to create an output where the rows with values in column x of mydata, that fall between each row of min and max in df.remove are deleted.
thus giving me an output data frame
x y
1 26
2 27
7 32
8 33
9 34
14 39
15 40
16 41
21 46
25 50
I figured I can use the between() function to delete the values that fall between a range, and since I would be looking at the min and max values from each row in df.remove I attempted to run a loop using the code
result <- data.frame()
for(i in 1:nrow(df.filter)) {
result <- mydata[!between(mydata$x,df.filter$min[i],df.filter$max[i]),]
}
This, for obvious reasons returns the output with only the last set of min and max values removed. I figured to get the output I am looking for I would likely have to run the consecutive iteration on the output from the previous iteration instead of the original data frame mydata, however I couldn't find a way to do it.
What you are looking for is known as non-equi anti-join. This can be done pretty easily with the data.table package. Consider
library(data.table)
mydata <- data.frame(x = 1:25, y = 26:50)
df.remove <- data.frame(min = c(3,10,22,17), max = c(6,13,24,20))
setDT(mydata)[!df.remove, on = .(x >= min, x <= max)] # drop rows where min <= x <= max
Output
x y
1: 1 26
2: 2 27
3: 7 32
4: 8 33
5: 9 34
6: 14 39
7: 15 40
8: 16 41
9: 21 46
10: 25 50
In your code, the result dataframe can only keep your last update, as you operated on the original mydata dataframe and assigned this single update to the result dataframe every time.
Instead, you are supposed to operate on the updated dataframe. You could try the following code.
result <- mydata
for(i in 1:nrow(df.remove)) {
result <- result[!between(result$x,df.remove$min[i],df.remove$max[i]),]
}
After assigning the original mydata dataframe to the result dataframe, you are able to update it in an iterated way.
A base R approach -
res <- subset(mydata, !x %in% unlist(Map(`:`, df.remove$min, df.remove$max)))
res
# x y
#1 1 26
#2 2 27
#7 7 32
#8 8 33
#9 9 34
#14 14 39
#15 15 40
#16 16 41
#21 21 46
#25 25 50
Using Map we create sequence between min and max values, unlist them in a single vector and drop the rows if x has the same value.
Another option using fuzzyjoin package -
fuzzyjoin::fuzzy_anti_join(mydata, df.remove,
c('x' = 'min', 'x' = 'max'),
match_fun = c(`>=`, `<=`))
Since you're using dplyr function between, we can use dplyr filter function. For each row of mydata you want to apply between to each row of df.remove to see if value of column x is between. This can be accomplished with mapply (since there are two values to input to the function). This will create a matrix of T/F. Then go through each row and see if any values are returned as T. Do this with apply, across rows. Negative filter for any row that returns a T indicating a value between the target value:
library(dplyr)
mydata %>%
filter(
!mapply(function(left, right) between(mydata$x, left, right), left = df.remove$min, right = df.remove$max) %>%
apply(., 1, any)
)
Returns:
x y
1 1 26
2 2 27
3 7 32
4 8 33
5 9 34
6 14 39
7 15 40
8 16 41
9 21 46
10 25 50
Just because this is an interesting problem which has several possible solutions, here is another approach using meta programming.
The idea is that we turn df.remove into a list of expressions which we then use inside filter(mydata, !!! .) by splicing it with the !!! operator.
One way to get the list of expressions is to use rowwise summarise and create a list of expressions with bquote which allows us to evaluate expressions wrapped in .(). In our case the min and max values.
And although this is possible, I'd probably use either #ekoam's {data.table} or #Ronak's base R approach.
library(dplyr)
df.remove %>%
rowwise %>%
summarise(x = list(bquote(!x %in% c(.(min):.(max))))) %>%
pull(x) %>%
filter(mydata, !!! .)
#> `summarise()` has ungrouped output. You can override using the `.groups`
#> argument.
#> x y
#> 1 1 26
#> 2 2 27
#> 3 7 32
#> 4 8 33
#> 5 9 34
#> 6 14 39
#> 7 15 40
#> 8 16 41
#> 9 21 46
#> 10 25 50
Created on 2022-01-23 by the reprex package (v0.3.0)
Using data.table::inrange.
library(data.table)
mydata[!mydata$x %inrange% df.remove, ]
# x y
# 1 1 26
# 2 2 27
# 7 7 32
# 8 8 33
# 9 9 34
# 14 14 39
# 15 15 40
# 16 16 41
# 21 21 46
# 25 25 50
I am trying to group a series of observations by two columns, and then create a third column with an id number. I've tried group_indices, but that gives each combination of observations a unique number. I want the number to revert to 1 for the first observation of each group.
In my data there are a series of Sites with a number of rows showing the calendar Day when an observation was collected. I want to calculate the chronological day within a Site.
library(dplyr)
# Make some data
df <- data.frame(Site = rep(c("A", "B", "C"), each = 70),
Day = as.integer(rep(c(21,22,23,24,25,26,27,1,2,3,4,5,6,7,
24,25,26,27,28,29,30), each = 10)))
# Create Day Number column (this doesn't actually work, but is the sort
# of thing I'm looking for...)
df <- df %>% group_by(Site, Day) %>%
mutate(Day.Number = group_indices(Day))
# Desired output
Site Day Day.Number
1 A 21 1
2 A 21 1
3 A 21 1
...
11 A 22 2
12 A 22 2
13 A 22 2
14 A 22 2
15 A 22 2
...
141 C 24 1
142 C 24 1
143 C 24 1
144 C 24 1
...
151 C 25 2
152 C 25 2
153 C 25 2
154 C 25 2
155 C 25 2
...
This is just a toy dataset to demonstrate the problem. Although most sites will have ten observations of seven days it is not always a given, so I can't just use a sequence of rep() etc.
There is a bit of a discussion about this on github here and here but it doesn't seem to have been resolved. Any suggestions for workarounds are much appreciated.
Here's one way to do it:
df <- df %>%
left_join(unique(df) %>% group_by(Site) %>% mutate(Day.Number=1:n()))
head(df)
# Site Day Day.Number
# 1 A 21 1
# 2 A 21 1
# 3 A 21 1
# 4 A 21 1
# 5 A 21 1
# 6 A 21 1
I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11
Seems simple but I can't figure it out.
I have a bunch of animal location data (217 individuals) as a single dataframe. I'm trying to randomly select X locations per individual for further analysis with the caveat that X is within the range of 6-156.
So I'm trying to set up a loop that first randomly selects a value within the range of 6-156 then use that value (say 56) to randomly extract 56 locations from the first individual animal and so on.
for(i in unique(ANIMALS$ID)){
sub<-sample(6:156,1)
sub2<-i([sample(nrow(i),sub),])
}
This approach didn't seem to work so I tried tweaking it...
for(i in unique(ANIMALS$ID)){
sub<-sample(6:156,1)
rand<-i[sample(1:nrow(i),sub,replace=FALSE),]
}
This did not work either.. Any suggestions or previous postings would be helpful!
Head of the datafile...ANIMALS is the name of the df, ID indicates unique individuals
> FID X Y MONTH DAY YEAR HOUR MINUTE SECOND ELKYR SOURCE ID animalid
1 0 510313 4813290 9 5 2008 22 30 0 342008 FG 1 1
2 1 510382 4813296 9 6 2008 1 30 0 342008 FG 1 1
3 2 510385 4813311 9 6 2008 2 0 0 342008 FG 1 1
4 3 510385 4813394 9 6 2008 3 30 0 342008 FG 1 1
5 4 510386 4813292 9 6 2008 2 30 0 342008 FG 1 1
6 5 510386 4813431 9 6 2008 4 1 0 342008 FG 1 1
Here's one way using mapply. This function takes two lists (or something that can be coerced into a list) and applies function FUN to corresponding elements.
# simulate some data
xy <- data.frame(animal = rep(1:10, each = 10), loc = runif(100))
# calculate number of samples for individual animal
num.samples.per.animal <- sample(3:6, length(unique(xy$animal)), replace = TRUE)
num.samples.per.animal
[1] 6 3 4 4 6 3 3 6 3 5
# subset random x number of rows from each animal
result <- do.call("rbind",
mapply(num.samples.per.animal, split(xy, f = xy$animal), FUN = function(x, y) {
y[sample(1:nrow(y), x),]
}, SIMPLIFY = FALSE)
)
result
animal loc
7 1 0.99483999
1 1 0.50951321
10 1 0.36505294
6 1 0.34058842
8 1 0.26489107
9 1 0.47418823
13 2 0.27213396
12 2 0.28087775
15 2 0.22130069
23 3 0.33646632
21 3 0.02395097
28 3 0.53079981
29 3 0.85287600
35 4 0.84534073
33 4 0.87370167
31 4 0.85646813
34 4 0.11642335
46 5 0.59624723
48 5 0.15379729
45 5 0.57046122
42 5 0.88799675
44 5 0.62171858
49 5 0.75014593
60 6 0.86915983
54 6 0.03152932
56 6 0.66128549
64 7 0.85420774
70 7 0.89262455
68 7 0.40829671
78 8 0.19073661
72 8 0.20648832
80 8 0.71778913
73 8 0.77883677
75 8 0.37647108
74 8 0.65339300
82 9 0.39957202
85 9 0.31188471
88 9 0.10900795
100 10 0.55282999
95 10 0.10145296
96 10 0.09713218
93 10 0.64900866
94 10 0.76099256
EDIT
Here is another (more straightforward) approach that also handles cases when number of rows is less than the number of samples that should be allocated.
set.seed(357)
result <- do.call("rbind",
by(xy, INDICES = xy$animal, FUN = function(x) {
avail.obs <- nrow(x)
num.rows <- sample(3:15, 1)
while (num.rows > avail.obs) {
message("Sample to be larger than available data points, repeating sampling.")
num.rows <- sample(3:15, 1)
}
x[sample(1:avail.obs, num.rows), ]
}))
result
I like Stackoverflow because I learn so much. #RomanLustrik provided a simple solution; mine is straight-froward as well:
# simulate some data
xy <- data.frame(animal = rep(1:10, each = 10), loc = runif(100))
newVec <- NULL #Create a blank dataFrame
for(i in unique(xy$animal)){
#Sample a number between 1 and 10 (or 6 and 156, if you need)
samp <- sample(1:10, 1)
#Determine which rows of dataFrame xy correspond with unique(xy$animal)[i]
rows <- which(xy$animal == unique(xy$animal)[i])
#From xy, sample samp times from the rows associated with unique(xy$animal)[i]
newVec1 <- xy[sample(rows, samp, replace = TRUE), ]
#append everything to the same new dataFrame
newVec <- rbind(newVec, newVec1)
}
I have this data frame
t<-data.frame(v1=c(1,2,1,4,6,7,8,2,3,4,8,1,2), v2=c(2,3,6,1,-3,-2,1,2,-3,6,7,-2,1))
Scanning the data.frame from top to bottom, I want to get the cumulative sum of v1 for as long as v2 is positive. When v2 becomes negative, it should stop, record the value (of the cum.sum up to then) and the cumulative sum should restart again from the next first positive v2 and so on. So that in the end for the above data frame would be be the vector
8, 10 , 12, 2
Any ideas?
I changed the name of the data.frame because t is a function (transpose). I don't get why you want to use cumsum if you only want the sum.
dtf<-data.frame(v1=c(1,2,1,4,6,7,8,2,3,4,8,1,2), v2=c(2,3,6,1,-3,-2,1,2,-3,6,7,-2,1))
groups <- rle(dtf$v2 > 0)
dtf$groups<- rep(seq_along(groups$values), groups$lengths)
library(plyr)
daply(dtf, .(groups), function(x) sum(x$v1))[groups$values]
1 3 5 7
8 10 12 2
Here's one way:
t <- data.frame(v1=c(1,2,1,4,6,7,8,2,3,4,8,1,2), v2=c(2,3,6,1,-3,-2,1,2,-3,6,7,-2,1))
unname(with(t, tapply(v1[v2>0], cumsum(abs(diff(sign(c(0,v2)))))[v2>0], sum)))
[1] 8 10 12 2
It might seem a bit complicated at first :)
The cumsum(abs(diff(sign(c(0,v2))))) generates a unique group id for each run of positive or negative values. Using diff and cumsum for this is a "common" trick that's good to know about... A snag is that diff produces a shorter vector - that's why the c(0, v2) is used.
Here's another way.
> r <- rle(sign(t$v2))
> diff(c(0,cumsum(t$v1)[cumsum(r$lengths)]))[r$values==1]
[1] 8 10 12 2
It's easier to understand if you split it up; it works by picking out the right elements of the cumulative sum and subtracting them.
> (s <- cumsum(t$v1))
[1] 1 3 4 8 14 21 29 31 34 38 46 47 49
> (r <- rle(sign(t$v2)))
Run Length Encoding
lengths: int [1:7] 4 2 2 1 2 1 1
values : num [1:7] 1 -1 1 -1 1 -1 1
> (k <- cumsum(r$lengths))
[1] 4 6 8 9 11 12 13
> (a <- c(0,s[k]))
[ 1] 0 8 21 31 34 46 47 49
> (d <- diff(a))
[1] 8 13 10 3 12 1 2
> d[r$values==1]
[1] 8 10 12 2
Similarly, but without rle:
> k <- which(diff(c(sign(t$v2),0))!=0)
> diff(c(0,cumsum(t$v1)[k]))[t$v2[k]>0]
[1] 8 10 12 2