Mark row before count starts again - r

shift = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3)
count =c(1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7)
test <- cbind(shift,count)
So I am trying to mark every last row for every shift (so rows with count = c(8,10,7)with a binary 1 and every other row with 0. Right now I am thinking maybe that is possible with a left join but I am not quite sure. I would prefer not working with loops but rather use some techniques from dplyr. Thanks guys!

Assuming that you want to add a new 0/1 column last that contains a 1 in the last row of each shift and that the shifts are contiguous, here are two base R approaches:
transform(test, last = ave(count, shift, FUN = function(x) x == max(x)))
transform(test, last = +!duplicated(shift, fromLast = TRUE))
or with dplyr use mutate:
test %>%
as.data.frame %>%
group_by(shift) %>%
mutate(last = +(1:n() == n())) %>%
ungroup
test %>%
as.data.frame %>%
mutate(last = +!duplicated(shift, fromLast = TRUE))

Try this one
library(dplyr)
test %>%
as_tibble() %>%
group_by(shift) %>%
mutate(is_last = ifelse( row_number() == max(row_number()), 1, 0)) %>%
ungroup()
# A tibble: 25 x 3
shift count is_last
<dbl> <dbl> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
7 1 7 0
8 1 8 1
9 2 1 0
10 2 2 0
# … with 15 more rows

Related

How to summarize in R the number of first occurrences of a character string in a dataframe column?

I am trying to figure out a fast way to calculate the number of "first times" a specified character appears in a dataframe column, by groups. In this example, I am trying to summarize (sum) the number of first times, for each Period, the State of "X" appears, grouped by ID. I am looking for a fast way to process this because it is to be run against a database of several million rows. Maybe there is a good solution using the data.table package?
Immediately below I illustrate what I am trying to achieve, and at the bottom I post the code for the dataframe called testDF.
Code:
testDF <-
data.frame(
ID = c(rep(10,5),rep(50,5),rep(60,5)),
Period = c(1:5,1:5,1:5),
State = c("A","B","X","X","X",
"A","A","A","A","A",
"A","X","A","X","B")
)
Maybe we can group by 'ID' first and then create the column and then do a group by 'period' and summarise
library(dplyr)
testDF %>%
group_by(ID) %>%
mutate(`1stStateX` = row_number() == which(State == "X")[1]) %>%
group_by(Period) %>%
summarise(`1stStateX` = sum(`1stStateX`, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 5 × 2
Period `1stStateX`
<int> <int>
1 1 0
2 2 1
3 3 1
4 4 0
5 5 0
Another option will be to slice after grouping by 'ID', get the count and use complete to fill the 'Period' not available
library(tidyr)
testDF %>%
group_by(ID) %>%
slice(match('X', State)) %>%
ungroup %>%
count(Period, sort = TRUE ,name = "1stStateX") %>%
complete(Period = unique(testDF$Period),
fill = list(`1stStateX` = 0))
-output
# A tibble: 5 × 2
Period `1stStateX`
<int> <int>
1 1 0
2 2 1
3 3 1
4 4 0
5 5 0
Or similar option in data.table
library(data.table)
setDT(testDF)[, `1stStateX` := .I == .I[State == 'X'][1],
ID][, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period]
-output
Period 1stStateX
<int> <int>
1: 1 0
2: 2 1
3: 3 1
4: 4 0
5: 5 0

How to use the lag function correctly in r dplyr?

I get the below incorrect output for the last cell in column reSeq when running the R/dplyr code immediately beneath. The code produces a value of 8 in that last cell of column reSeq, when via the lag() function in the code it should instead produce a 7. What is wrong with my use of the lag() function? Also see image at the bottom that better explains what I am trying to do.
Element Group eleCnt reSeq
<chr> <dbl> <int> <int>
1 R 0 1 1
2 R 0 2 2
3 X 0 1 1
4 X 1 2 2
5 X 1 3 2
6 X 0 4 4
7 X 0 5 5
8 X 0 6 6
9 B 0 1 1
10 R 0 3 3
11 R 2 4 4
12 R 2 5 4
13 X 3 7 7
14 X 3 8 7
15 X 3 9 8
library(dplyr)
myDF <- data.frame(
Element = c("R","R","X","X","X","X","X","X","B","R","R","R","X","X","X"),
Group = c(0,0,0,1,1,0,0,0,0,0,2,2,3,3,3)
)
myDF %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup()%>%
mutate(reSeq = eleCnt) %>%
mutate(reSeq = ifelse(
Element == lag(Element)& Group == lag(Group) & Group > 0,
lag(reSeq),
eleCnt)
)
The above is an attempted translation from Excel as show in this image below. I am new to R, migrating over from Excel. I am trying to replicate the column D "Target", highlighted in yellow with the formula to the right. The below shows the correct output, including the desired 7 in cell D17 which I can't replicate with the above R code.
Breaking the derivation of "Target" down into 2 columns, Step1 and Step2, highlighted in yellow and blue in the below image (Step2 below is same as Target in above image)(2 steps is how I got the R code working as shown in one of the solutions):
The below code works. I broke the Excel "Target" calculation into 2 steps in the 2nd image in the OP in order to reflect the step-wise R solution.
library(dplyr)
library(tidyr)
myDF <- data.frame(
Element = c("R","R","X","X","X","X","X","X","B","R","R","R","X","X","X"),
Group = c(0,0,0,1,1,0,0,0,0,0,2,2,3,3,3)
)
myDF %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup()%>%
mutate(reSeq = ifelse(Group == 0 | Group != lag(Group), eleCnt,0)) %>%
mutate(reSeq = na_if(reSeq, 0)) %>%
group_by(Element) %>%
fill(reSeq) %>%
ungroup

How to count observations between two rows based on condition in R?

I am trying to create a variable for a data frame in which I count the number of observations between two observations which meet a criteria. Here it is counting the number of times since last win in a game.
Say I have a dataframe like this:
df <- data.frame(player = c(10,10,10,10,10,10,10,10,10,10,10),win = c(1,0,0,0,1,1,0,1,0,0,1))
I want to create a new variable that counts the number of games it has been since the player has won.
Summarized in a vector, the result should be (setting a Not Applicable for the first observation):
c(NA,0,1,2,3,0,0,1,0,1,2)
I want to be able to do this easily and create it as a variable in the data.frame using dplyr (or any other suitable approach)
I am not quite sure why the first value should be NA. Because the elapsed time is 0 since the last "win" and not NA.
For purely logical reasons, I would take the following approach:
seq = with(df, ave(win, cumsum(win == 1), FUN = seq_along)-1)
So you get the past cummulated sum games since the last win as follows:
c(0,1,2,3,0,0,1,0,1,2,0)
But if you still aim for your described result with a little data handling you can achieve it with this:
append(NA, seq[1:length(seq)-1])
It is not nice, but it works ;)
With {tidyverse}, try:
library(tidyverse)
df <- data.frame(player = c(10,10,10,10,10,10,10,10,10,10,10),
win = c(1,0,0,0,1,1,0,1,0,0,1))
df %>%
group_by(player, group = cumsum(win != lag(win, default = first(win)))) %>%
mutate(counter = row_number(),
counter = if_else(win == 1, true = 0L, false = counter)) %>%
ungroup() %>%
group_by(player) %>%
mutate(counter = if_else(row_number() == 1, true = NA_integer_, false = counter)) %>%
ungroup() %>%
select(-group)
player win counter
<dbl> <dbl> <int>
1 10 1 NA
2 10 0 1
3 10 0 2
4 10 0 3
5 10 1 0
6 10 1 0
7 10 0 1
8 10 1 0
9 10 0 1
10 10 0 2
11 10 1 0

Find last occurrence of unique values in a column and alter value in R

I have a data frame as below
a b
5 0
5 0
5 0
6 0
6 0
I require to edit the column b and change it to one, at the last instance of each unique value of a. Example expected output is,
a b
5 0
5 0
5 1
6 0
6 1
I'm looking for an efficient solution than using apply() to extract the row number and then traverse the dataframe to change the value, as my dataframe is large in size.
Multiple ways to do this
library(dplyr)
df %>%
group_by(a) %>%
mutate(b = if_else(row_number() == n(), 1L ,b))
# a b
# <int> <dbl>
#1 5 0
#2 5 0
#3 5 1
#4 6 0
#5 6 1
Same using ave
with(df, ave(b, a, FUN = function(x) ifelse(seq_along(x) == length(x), 1, x)))
EDIT
In case if you have columns as characters, we need to convert them to numeric first and use if_else
df %>%
mutate_all(as.numeric) %>%
group_by(a) %>%
mutate(b = if_else(row_number() == n(), 1 ,b))
OR just use ifelse as it does not depend on strict type checking
df %>%
group_by(a) %>%
mutate(b = ifelse(row_number() == n(), 1 ,b))
Use duplicated and set fromLast to be TRUE so that you start looking from the end of a.
with(df1, replace(b, !duplicated(a, fromLast = TRUE), 1))
#[1] 0 0 1 0 1
You could do a join on the last row:
library(data.table)
setDT(DT)
DT[.(unique(a)), on=.(a), mult="last", b := 1]
a b
1: 5 0
2: 5 0
3: 5 1
4: 6 0
5: 6 1
The syntax is x[i, on=, j].
It looks up each row of i in x using join conditions on=.
When there are multiple matches for a row of i, it takes the last one.
In j, we are updating b in x on the matched rows.

rollsumr with window-length>1: filling missing values

My data frame looks something like the first two columns of the following
I want to add a third column, equal to the sum of the ID-group's last three observations for VAL.
Using the following command, I managed to get the output below:
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3)) %>%
ungroup()
ID VAL SUM
1 2 NA
1 1 NA
1 3 6
1 4 8
...
I am now hoping to be able to fill the NAs that result for the group's cells in the first two rows.
ID VAL SUM
1 2 2
1 1 3
1 3 6
1 4 8
...
How do I do that?
I have tried doing the following
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=min(3, row_number())) %>%
ungroup()
and
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3), fill = "extend") %>%
ungroup()
But both give me the same error, because I have groups of sizes <= 2.
Evaluation error: need at least two non-NA values to interpolate.
What do I do?
Alternatively, you can use rollapply() from the same package:
df %>%
group_by(ID) %>%
mutate(SUM = rollapply(VAL, width = 3, FUN = sum, partial = TRUE, align = "right"))
ID VAL SUM
<int> <int> <int>
1 1 2 2
2 1 1 3
3 1 3 6
4 1 4 8
Due to argument partial = TRUE, also the rows that are below the desired window of length three are summed.
Not a direct answer but one way would be to replace the values which are NAs with cumsum of VAL
library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(is.na(SUM), cumsum(VAL), SUM))
# ID VAL SUM
# <int> <int> <int>
#1 1 2 2
#2 1 1 3
#3 1 3 6
#4 1 4 8
Or since you know the window size before hand, you could check with row_number() as well
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(row_number() < 3, cumsum(VAL), SUM))

Resources