How to count observations between two rows based on condition in R? - r

I am trying to create a variable for a data frame in which I count the number of observations between two observations which meet a criteria. Here it is counting the number of times since last win in a game.
Say I have a dataframe like this:
df <- data.frame(player = c(10,10,10,10,10,10,10,10,10,10,10),win = c(1,0,0,0,1,1,0,1,0,0,1))
I want to create a new variable that counts the number of games it has been since the player has won.
Summarized in a vector, the result should be (setting a Not Applicable for the first observation):
c(NA,0,1,2,3,0,0,1,0,1,2)
I want to be able to do this easily and create it as a variable in the data.frame using dplyr (or any other suitable approach)

I am not quite sure why the first value should be NA. Because the elapsed time is 0 since the last "win" and not NA.
For purely logical reasons, I would take the following approach:
seq = with(df, ave(win, cumsum(win == 1), FUN = seq_along)-1)
So you get the past cummulated sum games since the last win as follows:
c(0,1,2,3,0,0,1,0,1,2,0)
But if you still aim for your described result with a little data handling you can achieve it with this:
append(NA, seq[1:length(seq)-1])
It is not nice, but it works ;)

With {tidyverse}, try:
library(tidyverse)
df <- data.frame(player = c(10,10,10,10,10,10,10,10,10,10,10),
win = c(1,0,0,0,1,1,0,1,0,0,1))
df %>%
group_by(player, group = cumsum(win != lag(win, default = first(win)))) %>%
mutate(counter = row_number(),
counter = if_else(win == 1, true = 0L, false = counter)) %>%
ungroup() %>%
group_by(player) %>%
mutate(counter = if_else(row_number() == 1, true = NA_integer_, false = counter)) %>%
ungroup() %>%
select(-group)
player win counter
<dbl> <dbl> <int>
1 10 1 NA
2 10 0 1
3 10 0 2
4 10 0 3
5 10 1 0
6 10 1 0
7 10 0 1
8 10 1 0
9 10 0 1
10 10 0 2
11 10 1 0

Related

How to count subsets of elements in an data frame using base R or dplyr?

I would like to add a column to the below data frame nCode, call the desired new column "grpRnk", that counts each group's rank (a group defined as Group value <> 0) among the other groups in the dataframe, with the top rank defined as the lowest associated nmCnt for that grouped row and then descending rank from there as the nmCnt increases for the other grouped rows. As described in the column manually added ("grpRnk ADD") to the far right in the data frame output below:
> print.data.frame(nCode)
Name Group nmCnt seqBase subGrp grpRnk ADD
1 B 0 1 1 0 0 since Group = 0
2 R 0 1 1 0 0 since Group = 0
3 R 1 2 2 1 2 since it is 2nd place among the Groups, with its nmCnt > the nmCnt for the highest ranking Group in row 6
4 R 1 3 2 2 2 same reason as above
5 B 0 2 2 0 0 since Group = 0
6 X 2 1 1 1 1 since it is 1st place among the Groups, with its nmCnt of 1 is the lowest among all the groups
7 X 2 2 1 2 1 same reason as above
Any recommendations for how to do this in base R or dplyr?
Below is the code that generates the above (except for the column manually added on the right):
library(dplyr)
library(stringr)
myDF5 <-
data.frame(
Name = c("B","R","R","R","B","X","X"),
Group = c(0,0,1,1,0,2,2)
)
nCode <- myDF5 %>%
group_by(Name) %>%
mutate(nmCnt = row_number()) %>%
ungroup() %>%
mutate(seqBase = ifelse(Group == 0 | Group != lag(Group), nmCnt,0)) %>%
mutate(seqBase = na_if(seqBase, 0)) %>%
group_by(Name) %>%
fill(seqBase) %>%
mutate(seqBase = match(seqBase, unique(seqBase))) %>%
ungroup %>%
mutate(subGrp = as.integer(ifelse(Group > 0, sapply(1:n(), function(x) sum(Name[1:x]==Name[x] & Group[1:x] == Group[x])),0)))
print.data.frame(nCode)
Here's a dplyr solution. However instead of filling non-groups with 0 per my OP, this code drops in NA for non-groups which works better for me for what this is intended for. The dplyr slice() function used in my solution is new to me and is very useful, I found out about it in post dplyr filter: Get rows with minimum of variable, but only the first if multiple minima
grpRnk <- nCode %>% select(Name,Group,nmCnt) %>%
filter(Group > 0) %>%
group_by(Name) %>%
slice(which.min(Group)) %>%
arrange(nmCnt) %>%
select(-nmCnt)
grpRnk$grpRnk <- as.integer(row.names(grpRnk))
left_join(nCode,grpRnk)

How to use the lag function correctly in r dplyr?

I get the below incorrect output for the last cell in column reSeq when running the R/dplyr code immediately beneath. The code produces a value of 8 in that last cell of column reSeq, when via the lag() function in the code it should instead produce a 7. What is wrong with my use of the lag() function? Also see image at the bottom that better explains what I am trying to do.
Element Group eleCnt reSeq
<chr> <dbl> <int> <int>
1 R 0 1 1
2 R 0 2 2
3 X 0 1 1
4 X 1 2 2
5 X 1 3 2
6 X 0 4 4
7 X 0 5 5
8 X 0 6 6
9 B 0 1 1
10 R 0 3 3
11 R 2 4 4
12 R 2 5 4
13 X 3 7 7
14 X 3 8 7
15 X 3 9 8
library(dplyr)
myDF <- data.frame(
Element = c("R","R","X","X","X","X","X","X","B","R","R","R","X","X","X"),
Group = c(0,0,0,1,1,0,0,0,0,0,2,2,3,3,3)
)
myDF %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup()%>%
mutate(reSeq = eleCnt) %>%
mutate(reSeq = ifelse(
Element == lag(Element)& Group == lag(Group) & Group > 0,
lag(reSeq),
eleCnt)
)
The above is an attempted translation from Excel as show in this image below. I am new to R, migrating over from Excel. I am trying to replicate the column D "Target", highlighted in yellow with the formula to the right. The below shows the correct output, including the desired 7 in cell D17 which I can't replicate with the above R code.
Breaking the derivation of "Target" down into 2 columns, Step1 and Step2, highlighted in yellow and blue in the below image (Step2 below is same as Target in above image)(2 steps is how I got the R code working as shown in one of the solutions):
The below code works. I broke the Excel "Target" calculation into 2 steps in the 2nd image in the OP in order to reflect the step-wise R solution.
library(dplyr)
library(tidyr)
myDF <- data.frame(
Element = c("R","R","X","X","X","X","X","X","B","R","R","R","X","X","X"),
Group = c(0,0,0,1,1,0,0,0,0,0,2,2,3,3,3)
)
myDF %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup()%>%
mutate(reSeq = ifelse(Group == 0 | Group != lag(Group), eleCnt,0)) %>%
mutate(reSeq = na_if(reSeq, 0)) %>%
group_by(Element) %>%
fill(reSeq) %>%
ungroup

Mark row before count starts again

shift = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3)
count =c(1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7)
test <- cbind(shift,count)
So I am trying to mark every last row for every shift (so rows with count = c(8,10,7)with a binary 1 and every other row with 0. Right now I am thinking maybe that is possible with a left join but I am not quite sure. I would prefer not working with loops but rather use some techniques from dplyr. Thanks guys!
Assuming that you want to add a new 0/1 column last that contains a 1 in the last row of each shift and that the shifts are contiguous, here are two base R approaches:
transform(test, last = ave(count, shift, FUN = function(x) x == max(x)))
transform(test, last = +!duplicated(shift, fromLast = TRUE))
or with dplyr use mutate:
test %>%
as.data.frame %>%
group_by(shift) %>%
mutate(last = +(1:n() == n())) %>%
ungroup
test %>%
as.data.frame %>%
mutate(last = +!duplicated(shift, fromLast = TRUE))
Try this one
library(dplyr)
test %>%
as_tibble() %>%
group_by(shift) %>%
mutate(is_last = ifelse( row_number() == max(row_number()), 1, 0)) %>%
ungroup()
# A tibble: 25 x 3
shift count is_last
<dbl> <dbl> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
7 1 7 0
8 1 8 1
9 2 1 0
10 2 2 0
# … with 15 more rows

Recode matrix values with dplyr based on look-up table

I have a matrix with numerous cases and corresponding answers from a questionnaire. The strongly reduced example below (raw_responses) contains the answers of 5 persons to 5 items. Let us assume that these are multiple choice items with 4 possible answers each. If the item was not processed, the person received the code 9.
raw_responses <- data.frame('id' = 1:10,
'item_1' = sample(c(1:4,9), 10, replace = TRUE),
'item_2' = sample(c(1:4,9), 10, replace = TRUE),
'item_3' = sample(c(1:4,9), 10, replace = TRUE),
'item_4' = sample(c(1:4,9), 10, replace = TRUE),
'item_5' = sample(c(1:4,9), 10, replace = TRUE))
The correct answers are stored in a separate table that reflects the entire test design. Below again a strongly reduced variant (design) with only item names and the corresponding correct answers.
design <- data.frame('item' = c('item_1','item_2','item_3','item_4','item_5'),
'key' = sample(1:4, 5, replace = TRUE))
Finally, the goal is a table with scored answers. A correct answer is coded with 1, a wrong one with 0 and an "empty" answer with 99. This works for example with the for loop below.
scored_responses <- raw_responses
for(item in colnames(raw_responses)[2:6]) {
scored_responses[, item] <- ifelse(scored_responses[, item] == design[design$item == item, 'key'], 1,
ifelse(scored_responses[, item] == 9, 99, 0))
}
However, I was wondering if this would work with a more efficient variant with dplyr (including case_when) and possibly purr. Especially because the very extensive answer table is cleaned up with the help of a longer dplyr-pipe, it would be an advantage if the scoring could be built in there.
I thank you in advance for all ideas and hints.
Get the data in long format, join, recode the values and get the data back in wide format.
library(dplyr)
library(tidyr)
raw_responses %>%
pivot_longer(cols = -id, names_to = 'item') %>%
left_join(design, by = 'item') %>%
mutate(value = case_when(value == 9 ~ 99,
value == key ~ 1,
TRUE ~ 0)) %>%
select(-key) %>%
pivot_wider(names_from = 'item')
# A tibble: 10 x 6
# id item_1 item_2 item_3 item_4 item_5
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 99 99 1 0 0
# 2 2 99 99 99 0 0
# 3 3 1 99 0 99 99
# 4 4 0 1 1 99 1
# 5 5 99 0 1 0 1
# 6 6 0 1 0 0 1
# 7 7 0 0 0 1 99
# 8 8 1 99 0 0 0
# 9 9 0 99 99 0 1
#10 10 99 1 99 1 0
Another approach without getting data into wide format is to use map2_dfc from purrr :
library(purrr)
map2_dfc(raw_responses[-1], design$key, ~case_when(.x == 9 ~ 99,
.x == .y ~ 1,
TRUE ~ 0))
However, for this answer to work we need to ensure that column names in raw_responses and design$item are in the same order. In this example, they are already in the same order however, in the real data if they are not we can achieve it by doing :
raw_responses[-1] <- raw_responses[-1][design$key]

Create group variable based on common dates

I have a large data set containing animal ID's and dates. There are two groups within this dataset but there is no grouping variable, so I have to extrapolate who belongs to which group based on the dates they appear to have in common.
Dummy data.
mydf<-data.frame(
Date=sort(rep(seq(as.Date("2012/1/1"),as.Date("2012/1/4"), length.out = 4),5)),
ID = c(1,2,3,4,5,5,6,7,8,9,1,2,3,4,5,6,7,8,9,10))
The other issue I have is that every now and then an ID belonging to group 1 might appear with a date associated with group 2, which is what has thrown off every attempt I've made so far at grouping.
What I need is a output with ID's and a new Group ID like this
ID Group
1 1
2 1
3 1
4 1
5 1
6 2
7 2
8 2
9 2
10 2
1:5 all appear together on the 1st and the 3rd so they are likely to be one group.
6:10 appear on the 2nd and 4th and are likely to be the 2nd group.
ID 5 belongs to group 1, because even though it was observed once on the 2nd with ID's 6:9, it was observed twice on the 1st and 2nd 1:4, so it's most likely to belong to group 1.
All my attempts have fallen flat. Can anyone offer a solution to this?
Thanks in advance.
EDIT:
I thought we had nailed a solution using Jon's kmeans solution (in the comments below):
mydf_wide <- mydf %>%
select(ID, date) %>%
distinct(ID,date)%>% #
mutate(x = 1) %>%
spread(date, x, fill = 0)
mydf_wide$clusters <- mydf_wide %>%
kmeans(centers = 2) %>%
pluck("cluster")
but I'm actually finding the kmeans method not quite getting it right every time. See below:
The groups where certain tags (ID) appear on the same day as each other are fairly easy to spot by eye. There are two groups, one is in the center, and the other group appears on either side. The clustering should be vertical by common dates as in Jon's answer below, but it is clustering across the entire date range. (Apologies for the messy axis labels)
The k-means method has worked on other groups, but it's not consistently able to group by common dates. I think the clustering approach is sensible, but I was wondering if there may be other clustering methods that may cope better than kmeans?
Alternatively, could a filtering method help reduce any background noise and help the kmeans approach more reliable?
Again, very grateful for any and all advice.
Cheers.
My thinking here is that you just assign each Date to a group, then take the average of group for each ID. You could then round to the nearest whole number from there. In this case, average group of ID == 5 would be 1.33
library(dplyr)
mydf %>%
mutate(group = case_when(
Date %in% as.Date(c("2012-01-01", "2012-01-03")) ~ 1,
Date %in% as.Date(c("2012-01-02", "2012-01-04")) ~ 2,
TRUE ~ NA_real_
)) %>%
group_by(ID) %>%
summarise(likely_group = mean(group) %>% round)
Which gives you the following:
# A tibble: 10 x 2
ID likely_group
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 2
7 7 2
8 8 2
9 9 2
10 10 2
This works as long as there isn't an even split between groups for a single ID. But there isn't currently a way to address this situation with the information provided.
As a general solution, you might consider using k-means as an automatic way to split the data into groups based on similarity to other IDs.
First, I converted the data into wide format so that each ID gets one row. Then fed that into the base kmeans function to get the clustering output as a list, and purrr::pluck to extract just the assignment part of that list.
library(tidyverse)
mydf_wide <- mydf %>%
mutate(x = 1) %>%
spread(Date, x, fill = 0)
mydf_wide
# ID 2012-01-01 2012-01-02 2012-01-03 2012-01-04
#1 1 1 0 1 0
#2 2 1 0 1 0
#3 3 1 0 1 0
#4 4 1 0 1 0
#5 5 1 1 1 0
#6 6 0 1 0 1
#7 7 0 1 0 1
#8 8 0 1 0 1
#9 9 0 1 0 1
#10 10 0 0 0 1
clusters <- mydf_wide %>%
kmeans(centers = 2) %>%
pluck("cluster")
clusters
# [1] 2 2 2 2 2 1 1 1 1 1
Here's what that looks if you add those to the original data and plot.
mydf_wide %>%
mutate(cluster = clusters) %>%
# ggplot works better with long (tidy) data...
gather(date, val, -ID, -cluster) %>%
filter(val != 0) %>%
arrange(cluster) %>%
ggplot(aes(date, ID, color = as.factor(cluster))) +
geom_point(size = 5) +
scale_y_continuous(breaks = 1:10, minor_breaks = NULL) +
scale_color_discrete(name = "cluster")

Resources