Determine if there was a recent event within a group - r

I am trying to calculate a variable that depends on the value of multiple other columns, but in other rows.
Here's the sample data:
set.seed(2)
df1 <- data.frame(Participant=c(rep(1,5),rep(2,7),rep(3,10)),
Action=sample(c(rep("Play",9),rep("Other",13))),
time = c(sort(runif(5,1,100)),sort(runif(7,1,100)),sort(runif(10,1,100))))
df1$Action[2] ="Play" # edited to provide important test case
What I am trying to achieve is a column that tests whether the last "play" event is at most 10s ago (time column). If there is no "Play" event in the last 10s, the value of StillPlaying should be "n", regardless of current action. Here's a sample of what I would like to have:
Part Action time StillPlaying
1 1 Play 15.77544 n
2 1 Play 15.89964 y
3 1 Other 35.37995 n
4 1 Play 49.38855 n
5 1 Other 83.85203 n
6 2 Other 2.031038 n
7 2 Play 14.10483 n
8 2 Other 17.29958 y
9 2 Play 36.3492 n
10 2 Play 81.20902 n
11 2 Other 87.01724 y
12 2 Other 96.30176 n

It seems like you want to group by participant and flag any row with action "Other" and where the last "Play" was within 10 seconds. You can do this using group_by in dplyr, using cummax to determine the last time a "Play" action occurred:
library(dplyr)
df1 %>%
group_by(Participant) %>%
mutate(StillPlaying=ifelse(time - c(-100, head(cummax(ifelse(Action == "Play", time, -100)), -1)) <= 10, "y", "n"))
# Participant Action time StillPlaying
# (dbl) (fctr) (dbl) (chr)
# 1 1 Play 15.775439 n
# 2 1 Play 15.899643 y
# 3 1 Other 35.379953 n
# 4 1 Play 49.388550 n
# 5 1 Other 83.852029 n
# 6 2 Other 2.031038 n
# 7 2 Play 14.104828 n
# 8 2 Other 17.299582 y
# 9 2 Play 36.349196 n
# 10 2 Play 81.209022 n
# .. ... ... ... ...
If you want to keep this in base R, you could do split-apply-combine with the same basic commands using:
do.call(rbind, lapply(split(df1, df1$Participant), function(x) {
x$StillPlaying <- ifelse(x$time - c(-100, head(cummax(ifelse(x$Action == "Play", x$time, -100)), -1)) <= 10, "y", "n")
x
}))
# Participant Action time StillPlaying
# 1.1 1 Play 15.775439 n
# 1.2 1 Play 15.899643 y
# 1.3 1 Other 35.379953 n
# 1.4 1 Play 49.388550 n
# 1.5 1 Other 83.852029 n
# 2.6 2 Other 2.031038 n
# 2.7 2 Play 14.104828 n
# 2.8 2 Other 17.299582 y
# 2.9 2 Play 36.349196 n
# 2.10 2 Play 81.209022 n
# 2.11 2 Other 87.017243 y
# 2.12 2 Other 96.301761 n
# ...

Related

Iterative Lag with calculation result from previous row: conditional cumulative product and sum

The actual question
Given the data following data:
library(dplyr)
df <- tibble(v1 = 1:6, cond = c(1, 0, 1, 1, 0, 1))
## # A tibble: 6 × 2
## v1 cond
## <int> <dbl>
## 1 1 1
## 2 2 0
## 3 3 1
## 4 4 1
## 5 5 0
## 6 6 1
I want to calculate a mixture of cumulative sum and cumulative product.
If cond = 1 calcutate the sum of current v1 and the results of the
preceding calculations. If cond = 0 calculate the product of current
v1 and the results of the preceeding calcultions.
The desired result should look like this:
## # A tibble: 6 × 3
## v1 cond cum_prodsum
## <int> <dbl> <int>
## 1 1 1 1
## 2 2 0 2
## 3 3 1 5
## 4 4 1 9
## 5 5 0 45
## 6 6 1 51
In SPSS this is the code I would use:
COMPUTE cum_prodsum = 0.
IF($casenum = 1 & cond = 1) cum_prodsum = v1.
IF($casenum > 1 & cond = 0) cum_prodsum = lag(cum_prodsum) * v1
IF($casenum > 1 & cond = 1) cum_prodsum = lag(cum_prodsum) + v1.
But how can this be done in R?
Sounds like a silly task that noone never ever would need to do? Yeah,
it probably is. But think of it as a simple example for a whole group of
problems where the calculation of the current row depends on the
calculation results of the preceding rows.
Some information for (former) SPSS users working with R (not part of the question)
When I used to work with SPSS I often used a combination of the IF
and LAG command in order to do some common tasks, such as slicing the
data and keeping only the first row of each group. When I started
working with R, I quickly learned, that for those common task, R usually
comes with some handy functions, so that there is no need to program own
routines with the lag function. And even for not so common task, a
little research often leads to solution without iterating through the
data.
In the end the situations where I think “Well, I know how to do it in
SPSS with the LAG command. But how could I do it in R?” are very very
rare. The dplyr package from R comes with a lag function but it
works different, so that the naive approach substituting SPSS-LAG by the
R-lag would not work.
Difference between LAG from SPSS and dplyr:lag from R
Let’ say you have the following data with just one column:
library(dplyr)
df <- tibble(v1 = 1:6)
## # A tibble: 6 × 1
## v1
## <int>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
When computing a new a variable in SPSS with lag, SPSS processes cases
sequentially from top to bottom. The results from the calculation of the
preceding rows can be used for computing the current row.
COMPUTE lagsum_spss = v1.
IF ($casenum > 1) lagsum_spss = lagsum_spss + LAG(lagsum_spss).
Which results in:
## # A tibble: 6 × 2
## v1 lagsum_spss
## <int> <int>
## 1 1 1
## 2 2 3
## 3 3 6
## 4 4 10
## 5 5 15
## 6 6 21
The dplyr::lag function on the other hand, is a vectorised function,
which applies the calculations to all elements in a vector
simultaneously. So when I try mimicking the SPSS behavior in R with the
mutate and lag functions I get a different result:
df %>%
mutate(lagsum_r = v1,
lagsum_r = lagsum_r + lag(lagsum_r, default = 0))
## # A tibble: 6 × 3
## v1 lagsum_spss lagsum_r
## <int> <int> <dbl>
## 1 1 1 1
## 2 2 3 3
## 3 3 6 5
## 4 4 10 7
## 5 5 15 9
## 6 6 21 11
The fourth row, for example is calculates as this:
lagsum_spss[4] = 4 + 6 and lagsum_r[4] = 4 + 3.
So how can we reproduce this calculation in R? Well in this case it is
quite simple:
df %>%
mutate(cumsum = cumsum(v1))
## # A tibble: 6 × 3
## v1 lagsum_spss cumsum
## <int> <int> <int>
## 1 1 1 1
## 2 2 3 3
## 3 3 6 6
## 4 4 10 10
## 5 5 15 15
## 6 6 21 21
See, no need for lag, this time.
OK OK, but what if I want to sum only values from cases that meet a
certain condition, a conditional cumsum if you say so?
Example data set:
df <- tibble(v1 = 1:6, cond = c(1, 0, 1, 1, 0, 1))
df
## # A tibble: 6 × 2
## v1 cond
## <int> <dbl>
## 1 1 1
## 2 2 0
## 3 3 1
## 4 4 1
## 5 5 0
## 6 6 1
The SPSS code would look like this:
COMPUTE cond_cumsum = 0.
IF($casenum = 1 & cond = 1) cond_cumsum = v1.
IF($casenum > 1 & cond = 0) cond_cumsum = lag(cond_cumsum).
IF($casenum > 1 & cond = 1) cond_cumsum = lag(cond_cumsum) + v1.
So how this can be done in R? Well the solution is also pretty easy:
df %>%
mutate(cond_cumsum = cumsum(v1 * cond))
## # A tibble: 6 × 3
## v1 cond cond_cumsum
## <int> <dbl> <dbl>
## 1 1 1 1
## 2 2 0 1
## 3 3 1 4
## 4 4 1 8
## 5 5 0 8
## 6 6 1 14
For a task, where I think it is unavoidable to to iterate through the
data rows, see the question above.
So what we want to do is basicly this: Start with the two first elements
of a vector as input, do stuff with it, use that outcome as the first
input and the next vector element as the second input, do the same stuff
again, use that outcome as new first input … and so on. If you heard
about the Reduce (base R) or reduce and accumulate (purrr)
functions, this may sound familiar.
This is an illustration from the purrr Cheat Sheet of how the accumulate
function works:
Lets first think about the function that we want to apply:
first we want to check if cond is 0 or 1
if cond is 1 then sum v1 from the current row with the outcome from
the preceeding step.
if not, then multiply v1 from the current row with the outcome from
the preceeding step.
So we program this function in R:
function(last_result, i){ # i stands for the row index.
if(condition[i]) last_result + v1[i]
else last_result * v1[i]
}
Now lets think about the first row, since we have no “last_result”
which we could throw into that function. Following the idea of a
cumulative sum, and a cumulative product. The value should be
cumsum(v1[1]) if cond[1] is 1 or prodsum(v1[1]) if not. In both cases
these functions will return v1[1]. So this is our initial value for
the first row.
OK now, lets put this together for the accumulate function from the
purrr package:
library(purrr)
df %>%
mutate(
cum_prodsum = accumulate(
.x = row_number()[-1], # apply the funtion on all rows, except the first one.
.init = v1[1], # initial value for the first row.
.f = function(last_result, i) {
if (cond[i]) last_result + v1[i]
else last_result * v1[i]
}
))
## # A tibble: 6 × 3
## v1 cond cum_prodsum
## <int> <dbl> <int>
## 1 1 1 1
## 2 2 0 2
## 3 3 1 5
## 4 4 1 9
## 5 5 0 45
## 6 6 1 51
And this is it. For similar type of problems where the calculation of a
value depends on the calculations of the preceeding values, just adjust
the function within the accumulate command to your needs.
library(dplyr)
df <-
tibble(v1 = as.numeric(1:6), v2 = c(1, 0, 1, 1, 0, 1))
df %>%
mutate(output = case_when(v2 == 1 ~ cumsum(v1),
v2 == 0 ~ cumprod(v1)))

R: Why is expand.grid() producing many more rows than I expect?

My understanding is that base::grid.expand() and tidyr::grid_expand() will return an object with a row for each unique value of the joint distribution of unique values across one or more vectors. For example, here is what I expect:
# Preliminaries
library(tidyr)
set.seed(123)
# Simulate data
df <- data.frame(x = as.factor(rep(c(1,2), 50)), y= as.factor(sample(1:3, 100, replace = T)))
# Expected result
data.frame(x = rep(1:2, 3), y = rep(1:3, 2)) # 6 rows!
However, when I actually use the functions, I get many more (duplicated) rows than I expect:
# Tidyverse result
tidyr::expand_grid(df) # produces 100 rows!
tidyr::expand_grid(df$x, df$y) # produces 10k rows!
# Base R version
base::expand.grid(df) # produces 10k rows!
base::expand.grid(df$x, df$y) # produces 10k rows!
# Solution...but why do I have to do this?!
unique(base::expand.grid(df))
Can someone explain what I am missing about what it is supposed to do?
The input to expand_grid is variadic (...), we can use do.call
do.call(expand_grid, df)
Or with invoke
library(purrr)
invoke(expand_grid, df)
# A tibble: 10,000 × 2
x y
<fct> <fct>
1 1 3
2 1 3
3 1 3
4 1 2
5 1 3
6 1 2
7 1 2
8 1 2
9 1 3
10 1 1
# … with 9,990 more rows
Or with !!!
expand_grid(!!! df)
# A tibble: 10,000 × 2
x y
<fct> <fct>
1 1 3
2 1 3
3 1 3
4 1 2
5 1 3
6 1 2
7 1 2
8 1 2
9 1 3
10 1 1
# … with 9,990 more rows
As #Mossa commented, the function to return unique combinations would be expand or crossing because expand calls expand_grid on unique values
> expand(df, df)
# A tibble: 6 × 2
x y
<fct> <fct>
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
Based on the source code
getAnywhere("expand.data.frame")
function (data, ..., .name_repair = "check_unique")
{
out <- grid_dots(..., `_data` = data)
out <- map(out, sorted_unique)
out <- expand_grid(!!!out, .name_repair = .name_repair)
reconstruct_tibble(data, out)
}
expand.grid makes no attempt to return only unique values of the input vectors. It will always output a data frame which has a number of rows that is the same as the product of the length of its input vectors:
nrow(expand.grid(1:10, 1:10, 1:10))
#> [1] 1000
nrow(expand.grid(1, 1, 1, 1, 1, 1, 1, 1, 1))
#> [1] 1
If you look at the source code for expand.grid, it takes the variadic dots and turns them into a list called args. It then includes the line:
d <- lengths(args)
which returns a vector with one entry for each vector that we feed into expand.grid. In the case of expand.grid(df$x, df$y), d would be equivalent to c(100, 100).
There then follows the line
orep <- prod(d)
which gives us the product of d, which is 100x100, or 10,000.
The variable orep is used later in the function to repeat each vector so that its length is equal to the value orep.
If you only want unique combinations of the two input vectors, then you must make them unique at the input to expand.grid.

Count number of shared observations between samples using dplyr

I have a list of observations grouped by samples. I want to find the samples that share the most identical observations. An identical observation is where the start and end number are both matching between two samples. I'd like to use R and preferably dplyr to do this if possible.
I've been getting used to using dplyr for simpler data handling but this task is beyond what I am currently able to do. I've been thinking the solution would involve grouping the start and end into a single variable: group_by(start,end) but I also need to keep the information about which sample each observation belongs to and compare between samples.
example:
sample start end
a 2 4
a 3 6
a 4 8
b 2 4
b 3 6
b 10 12
c 10 12
c 0 4
c 2 4
Here samples a, b and c share 1 observation (2, 4)
sample a and b share 2 observations (2 4, 3 6)
sample b and c share 2 observations (2 4, 10 12)
sample a and c share 1 observation (2 4)
I'd like an output like:
abc 1
ab 2
bc 2
ac 1
and also to see what the shared observations are if possible:
abc 2 4
ab 2 4
ab 3 6
etc
Thanks in advance
Here's something that should get you going:
df %>%
group_by(start, end) %>%
summarise(
samples = paste(unique(sample), collapse = ""),
n = length(unique(sample)))
# Source: local data frame [5 x 4]
# Groups: start [?]
#
# start end samples n
# <int> <int> <chr> <int>
# 1 0 4 c 1
# 2 2 4 abc 3
# 3 3 6 ab 2
# 4 4 8 a 1
# 5 10 12 bc 2
Here is an idea via base R,
final_d <- data.frame(count1 = sapply(Filter(nrow, split(df, list(df$start, df$end))), nrow),
pairs1 = sapply(Filter(nrow, split(df, list(df$start, df$end))), function(i) paste(i[[1]], collapse = '')))
# count1 pairs1
#0.4 1 c
#2.4 3 abc
#3.6 2 ab
#4.8 1 a
#10.12 2 bc

creating row index based on time difference in R

I have data that looks like:
player event diff
A x NA
A y 2
A z 240
A w 3
A x 9
B x NA
B y 3
B z 120
C x NA
C x 8
What I did to get this was to group by the player column and take the difference between time events, hence the NA's for diff column whenever a new player has an event.
What I want to do is to partition the data into player specific interactions that are within a few minutes of each other (say a cutoff of diff = 20). What I want in the end is to have:
player event diff interaction
A x NA 1
A y 2 1
A z 240 2
A w 3 2
A x 9 2
B x NA 1
B y 3 1
B z 120 2
C x NA 1
C x 8 1
So basically the interactions are grouped based on having the same player and the difference being less than 20, otherwise a new interaction is started. A new interaction is also started if an NA is present. I'm not really sure how to do this in a fast/efficient way as I've got a large data set with many players. My preference is for a dplyr solution
You can replace NA with 0(or other number that is below your threshold) using coalesce in the diff column and do a cumsum on the diff >= 20 condition, which will give a distinct id whenever diff exceed some threshold:
library(dplyr)
df %>% group_by(player) %>%
mutate(interaction = cumsum(coalesce(diff, 0L) >= 20) + 1)
# Source: local data frame [10 x 4]
# Groups: player [3]
# player event diff interaction
# <fctr> <fctr> <int> <dbl>
# 1 A x NA 1
# 2 A y 2 1
# 3 A z 240 2
# 4 A w 3 2
# 5 A x 9 2
# 6 B x NA 1
# 7 B y 3 1
# 8 B z 120 2
# 9 C x NA 1
# 10 C x 8 1
We can also use base R to get the expected output
df1$interaction <- with(df1, ave(diff, player, FUN = function(x)
cumsum(x > 20 & !is.na(x))+1))
df1$interaction
#[1] 1 1 2 2 2 1 1 2 1 1

How can I loop a data matrix in R?

I am trying to loop a data matrix for each separate ID tag, “1”, “2” and “3” (see my data at the bottom). Ultimately I am doing this to transform the X and Y coordinates into a timeseries with the ts() function, but first i need to build a loop into the function that returns a timeseries for each separate ID. The looping itself works perfectly fine when I use the following code for a dataframe:
for(i in 1:3){
print(na.omit(xyframe[ID==i,]))
}
Returning the following output:
Timestamp X Y ID
1. 0 -34.012 3.406 1
2. 100 -33.995 3.415 1
3. 200 -33.994 3.427 1
Timestamp X Y ID
4. 0 -34.093 3.476 2
5. 100 -34.145 3.492 2
6. 200 -34.195 3.506 2
Timestamp X Y ID
7. 0 -34.289 3.522 3
8. 100 -34.300 3.520 3
9. 200 -34.303 3.517 3
Yet, when I want to produce a loop in a matrix with the same code:
for(i in 1:3){
print(na.omit(xymatrix[ID==i,])
}
It returns the following error:
Error in print(na.omit(xymatrix[ID == i, ]) :
(subscript) logical subscript too long
Why does it not work to loop the ID through a matrix while it does work for the dataframe and how would I be able to fix it?
Furthermore did I read that looping requires much more computational strength then doing the same thing vector based, would there be a way to do this vector based?
The data (simplification of the real data):
Timestamp X Y ID
1. 0 -34.012 3.406 1
2. 100 -33.995 3.415 1
3. 200 -33.994 3.427 1
4. 0 -34.093 3.476 2
5. 100 -34.145 3.492 2
6. 200 -34.195 3.506 2
7. 0 -34.289 3.522 3
8. 100 -34.300 3.520 3
9. 200 -34.303 3.517 3
The format xymatrix[ID==i,] doesn't work for matrix. Try this way:
for(i in 1:3){ print(na.omit(xymatrix[xymatrix[,'ID'] == i,])) }
In general, if you want to apply a function to a data frame, split by some factor, then you should be using one of the apply family of functions in combination with split.
Here's some reproducible sample data.
n <- 20
some_data <- data.frame(
x = sample(c(1:5, NA), n, replace= TRUE),
y = sample(c(letters[1:5], NA), n, replace= TRUE),
id = gl(3, 1, length = n)
)
If you want to print out the rows with no missing values, split by each ID level, then you want something like this.
lapply(split(some_data, some_data$grp), na.omit)
or more concisely using the plyr package.
library(plyr)
dlply(some_data, .(grp), na.omit)
Both methods return output like this
# $`1`
# x y grp
# 1 2 d 1
# 4 3 e 1
# 7 3 c 1
# 10 4 a 1
# 13 2 e 1
# 16 3 a 1
# 19 1 d 1
# $`2`
# x y grp
# 2 1 e 2
# 5 3 e 2
# 8 3 b 2
# $`3`
# x y grp
# 6 3 c 3
# 9 5 a 3
# 12 2 c 3
# 15 2 d 3
# 18 4 a 3

Resources