Only Use The First Match For Every N Rows - r

I have a data.frame that looks like this.
Date Number
1 1
2 0
3 1
4 0
5 0
6 1
7 0
8 0
9 1
I would like to create a new column that puts a 1 in the column if it is the first 1 of every 3 rows. Otherwise put a 0. For example, this is how I would like the new data.frame to look
Date Number New
1 1 1
2 0 0
3 1 0
4 0 0
5 0 0
6 1 1
7 0 0
8 0 0
9 1 1
Every three rows we find the first 1 and populate the column otherwise we place a 0. Thank you.
Hmm, at first glance I thought Akrun answer provided me the solution. However, it is not exactly what I am looking for. Here is what #akrun solution provides.
df1 = data.frame(Number = c(1,0,1,0,1,1,1,0,1,0,0,0))
head(df1,9)
Number
1 1
2 0
3 1
4 0
5 1
6 1
7 1
8 0
9 1
Attempt at solution:
df1 %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(New = +(Number == row_number()))
Number grp New
<dbl> <int> <int>
1 1 1 1
2 0 1 0
3 1 1 0
4 0 2 0
5 1 2 0 #should be a 1
6 1 2 0
7 1 3 1
8 0 3 0
9 1 3 0
As you can see the code misses the one on row 5. I am looking for the first 1 in every chunk. Then everything else should be 0.
Sorry if i was unclear akrn
Edit** Akrun new answer is exactly what I am looking for. Thank you very much

Here is an option to create a grouping column with gl and then do a == with the row_number on the index of matched 1. Here, match will return only the index of the first match.
library(dplyr)
df1 %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(New = +(row_number() == match(1, Number, nomatch = 0)))
# A tibble: 12 x 3
# Groups: grp [4]
# Number grp New
# <dbl> <int> <int>
# 1 1 1 1
# 2 0 1 0
# 3 1 1 0
# 4 0 2 0
# 5 1 2 1
# 6 1 2 0
# 7 1 3 1
# 8 0 3 0
# 9 1 3 0
#10 0 4 0
#11 0 4 0
#12 0 4 0

Looking at the logic, perhaps you want to check if Number == 1 and that the prior 2 values were both 0. If that is not correct please let me know.
library(dplyr)
df %>%
mutate(New = ifelse(Number == 1 & lag(Number, n = 1L, default = 0) == 0 & lag(Number, n = 2L, default = 0) == 0, 1, 0))
Output
Date Number New
1 1 1 1
2 2 0 0
3 3 1 0
4 4 0 0
5 5 0 0
6 6 1 1
7 7 0 0
8 8 0 0
9 9 1 1

You can replace Number value to 0 except for the 1st occurrence of 1 in each 3 rows.
library(dplyr)
df %>%
group_by(gr = ceiling(row_number()/3)) %>%
mutate(New = replace(Number, -which.max(Number), 0)) %>%
#Or to be safe and specific use
#mutate(New = replace(Number, -which(Number == 1)[1], 0)) %>%
ungroup() %>% select(-gr)
# A tibble: 9 x 3
# Date Number New
# <int> <int> <int>
#1 1 1 1
#2 2 0 0
#3 3 1 0
#4 4 0 0
#5 5 0 0
#6 6 1 1
#7 7 0 0
#8 8 0 0
#9 9 1 1

Related

how to cumulative sum variable by unique values and input back in

I'm looking to do the following -- cumulative sum the indicator values and remove the indicators after those days
original:
transaction
day
indicator
1
1
0
1
2
0
1
3
0
1
4
1
1
5
1
1
6
1
2
1
0
2
2
0
2
3
0
2
4
0
2
5
1
2
6
1
and make the new table like this --
transaction
day
indicator
1
1
0
1
2
0
1
3
0
1
4
3
2
1
0
2
2
0
2
3
0
2
4
0
2
5
2
Change all day with indicator == 1 to the first day with indicator == 1
df%>%
group_by(transaction)%>%
mutate(day=case_when(indicator==0~day,
T~head(day[indicator==1],1)))%>%
group_by(transaction,day)%>%
summarise(indicator=sum(indicator))%>%
ungroup
transaction day indicator
<int> <int> <int>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 3
5 2 1 0
6 2 2 0
7 2 3 0
8 2 4 0
9 2 5 2
Please try the below code
code
df <- bind_rows(df1, df2) %>% group_by(transaction) %>%
mutate(cumsum=cumsum(indicator), cumsum2=ifelse(cumsum==1, day, NA)) %>%
fill(cumsum2) %>%
mutate(day=ifelse(!is.na(cumsum2), cumsum2, day)) %>%
group_by(transaction, day) %>% slice_tail(n=1) %>% select(-cumsum2)
Created on 2023-01-19 with reprex v2.0.2
output
# A tibble: 8 × 4
# Groups: transaction, day [8]
transaction day indicator cumsum
<dbl> <int> <dbl> <dbl>
1 1 1 0 0
2 1 2 0 0
3 1 3 0 0
4 1 4 1 3
5 2 1 0 0
6 2 2 0 0
7 2 3 0 0
8 2 4 1 2
Another approach to try. After grouping by transaction, change indicator to either 0 (same) or the sum of indicator. Finally, keep or filter previous rows where cumall (cumulative all) values for indicator are 0. Using lag will provide the last row containing the sum.
library(tidyverse)
df %>%
group_by(transaction) %>%
mutate(indicator = ifelse(indicator == 0, 0, sum(indicator))) %>%
filter(cumall(lag(indicator, default = 0) == 0))
Output
transaction day indicator
<int> <int> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 3
5 2 1 0
6 2 2 0
7 2 3 0
8 2 4 0
9 2 5 2

R: Replace string with consecutive 0 less then three with 1

I have a vector like this in R:
dt = data.frame(input=c(0,0,1,1,0,0,1,0,0,0,1,1,1,0,1) )
dt
input
# 1 0
# 2 0
# 3 1
# 4 1
# 5 0
# 6 0
# 7 1
# 8 0
# 9 0
# 10 0
# 11 1
# 12 1
# 13 1
# 14 0
# 15 1
I want to replace the consecutive 0, in which the length is less than three, with 1, and save it to a new column.
#update:
I also hope that the replacement only happens when less than three 0 are sandwiched between 1. So in this condition, I will ignore the two 0 in raw 1 and 2. (or also when happening in the tail or meet NA)
For example, I want to output:
input output
# 1 0 0
# 2 0 0
# 3 1 1
# 4 1 1
# 5 0 1
# 6 0 1
# 7 1 1
# 8 0 0
# 9 0 0
# 10 0 0
# 11 1 1
# 12 1 1
# 13 1 1
# 14 0 1
# 15 1 1
How can I write it in the foreach loop? (I have the data with thousands of rows)
Thanks.
Create a grouping column with rleid on the 'input' column, and if the number of rows is less than 3 and all values are 0, replace with 1 or else return input
library(dplyr)
library(data.table)
dt %>%
mutate(new = cumsum(input)) %>%
group_by(grp = rleid(input)) %>%
mutate(output = if(n() <3 & all(input == 0) & all(new > 0)) 1 else input) %>%
ungroup %>%
select(-grp, -new)
-output
# A tibble: 15 × 2
input output
<dbl> <dbl>
1 0 0
2 0 0
3 1 1
4 1 1
5 1 1
6 0 1
7 1 1
8 0 0
9 0 0
10 0 0
11 1 1
12 1 1
13 1 1
14 0 1
15 1 1
Or use base R with rle
dt$output <- inverse.rle(within.list(rle(dt$input),
values[!values & lengths < 3 & seq_along(values) != 1] <- 1))
dt$output
#[1] 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1
Update after clarification:
We could now ungroup() and check if the sequence is wrapped by a 1 with lag(input==1):
dt %>%
mutate(
x= cumsum(input != lag(input, def = first(input)))
) %>%
group_by(x) %>%
mutate(x = seq_along(input),
x = last(x)) %>%
ungroup() %>%
mutate(output = case_when(input == 0 &
lag(input==1) &
x<=2 ~ 1,
TRUE ~ as.numeric(input))) %>%
select(-x)
Output:
A tibble: 15 x 2
input output
<dbl> <dbl>
1 0 0
2 0 0
3 1 1
4 1 1
5 1 1
6 0 1
7 1 1
8 0 0
9 0 0
10 0 0
11 1 1
12 1 1
13 1 1
14 0 1
15 1 1
First answer:
Here is a suggestion. But I don't understand the rows 1 and 2 in your output. "replace consecutive 0, in which the length is less than three, with 1" this is the case for row 1 and 2.
dt %>%
mutate(
x= cumsum(input != lag(input, def = first(input)))
) %>%
group_by(x) %>%
mutate(x = seq_along(input),
x = last(x)) %>%
mutate(output = case_when(input == 0 & x<=2 ~ 1,
TRUE ~ as.numeric(input))) %>%
ungroup() %>%
select(-x)
input output
<dbl> <dbl>
1 0 1
2 0 1
3 1 1
4 1 1
5 1 1
6 0 1
7 1 1
8 0 0
9 0 0
10 0 0
11 1 1
12 1 1
13 1 1
14 0 1
15 1 1
Update following reformulation of the question: This tidyverse approach simply makes use of case_when().
library(dplyr)
mutate(dt, inputX = case_when(input == 0 &
lag(input) == 1 &
lead(input) == 1 ~ 1,
input == 0 &
lag(input) == 0 &
lag(input, n = 2) == 1 &
lead(input) == 1 ~ 1,
T ~ input))
# input inputX
# 1 0 0
# 2 0 0
# 3 1 1
# 4 1 1
# 5 1 1
# 6 0 1
# 7 1 1
# 8 0 0
# 9 0 0
# 10 0 0
# 11 1 1
# 12 1 1
# 13 1 1
# 14 0 1
# 15 1 1
Previous solution: Having understood the requirements like Tarjae did, a tidyverse option could look as follows.
library(dplyr)
dt %>%
mutate(x = cumsum(input)) %>%
group_by(x) %>%
mutate(y = +(n() %in% 2:3)) %>%
ungroup() %>%
transmute(input = input,
inputX = if_else(y == 1, 1, input))
# # A tibble: 15 x 2
# input inputX
# <dbl> <dbl>
# 1 0 1
# 2 0 1
# 3 1 1
# 4 1 1
# 5 1 1
# 6 0 1
# 7 1 1
# 8 0 0
# 9 0 0
# 10 0 0
# 11 1 1
# 12 1 1
# 13 1 1
# 14 0 1
# 15 1 1

How to get the number of consecutive zeroes from a column in a dataframe

I'm trying to work out how to get the number of consecutive zeroes for a given column for a dataframe.
Here is a dataframe:
data <- data.frame(id = c(1,1,1,1,1,1,2,2,2,2,2,2), value = c(1,0,0,1,0,0,0,0,0,0,4,3))
This would be the desired output:
id value consec
1 1 0
1 0 2
1 0 2
1 1 0
1 0 2
1 0 2
2 0 4
2 0 4
2 0 4
2 0 4
2 4 0
2 3 0
Any ideas on how to achieve this output?
Many thanks
You can do:
data$consec <- with(data, ave(value, value, cumsum(value != 0), id, FUN = length) - (value != 0))
data
id value consec
1 1 1 0
2 1 0 2
3 1 0 2
4 1 1 0
5 1 0 2
6 1 0 2
7 2 0 4
8 2 0 4
9 2 0 4
10 2 0 4
11 2 4 0
12 2 3 0
Here's a base R solution using interaction and rle (run-length encoding):
rlid <- rle(as.numeric(interaction(data$id, data$value)))$lengths
data$consec <- replace(rep(rlid, rlid), data$value != 0, 0)
data
#> id value consec
#> 1 1 1 0
#> 2 1 0 2
#> 3 1 0 2
#> 4 1 1 0
#> 5 1 0 2
#> 6 1 0 2
#> 7 2 0 4
#> 8 2 0 4
#> 9 2 0 4
#> 10 2 0 4
#> 11 2 4 0
#> 12 2 3 0
This dplyr solution will work. Using cumulative sum we keep track of every time a new non-zero entry occurs, and for each of these groups we count the number of zeros:
data %>%
group_by(id) %>%
mutate(flag_0 = cumsum(value == 1)) %>%
group_by(id, flag_0) %>%
mutate(conseq = ifelse(value == 0, sum(value == 0), 0)) %>%
ungroup()
# A tibble: 12 x 4
id value flag_0 conseq
<dbl> <dbl> <int> <dbl>
1 1 1 1 0
2 1 0 1 2
3 1 0 1 2
4 1 1 2 0
5 1 0 2 2
6 1 0 2 2
7 2 0 0 4
8 2 0 0 4
9 2 0 0 4
10 2 0 0 4
11 2 4 0 0
12 2 3 0 0
This tidyverse approach can also do the job
library(tidyverse)
data %>% group_by(id) %>%
mutate(value2 =cumsum(value)) %>% group_by(id, value, value2) %>%
mutate(consec = ifelse(value == 0, n(), 0)) %>%
ungroup() %>% select(-value2)
# A tibble: 12 x 3
id value consec
<dbl> <dbl> <dbl>
1 1 1 0
2 1 0 2
3 1 0 2
4 1 1 0
5 1 0 2
6 1 0 2
7 2 0 4
8 2 0 4
9 2 0 4
10 2 0 4
11 2 4 0
12 2 3 0

Set value to 0 if any of the remaining values is 0

I have a data.frame like this:
dat <- data.frame("ID"=c(rep(1,13),rep(2,5)), "time"=c(seq(1,13),c(seq(1,5))), "value"=c(rep(0,5), rep(1,3), 2, 0, 1, 5, 20, rep(0,2), seq(1:3)))
ID time value
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 1
7 1 7 1
8 1 8 1
9 1 9 2
10 1 10 0
11 1 11 1
12 1 12 5
13 1 13 20
14 2 1 0
15 2 2 0
16 2 3 1
17 2 4 2
18 2 5 3
My goal is to set all values to 0, if among the remaining values there is any other 0 (for each unique ID and sorted by time). That means in the example data, I would like to have 0 in the rows 6:9.
I tried dat %>% group_by(ID) %>% mutate(value2 = ifelse(lead(value, order_by=time)==0, 0, value)) but I would have to run this several times, since it only changes one row at a time (i.e. row 9 first, then row 8, etc.).
dplyr solution would be prefered but I'd take everything that works :)
Short explanation: value is the size of a tumor. If the tumor does not grow large, but actually vanishes completely at a later time, it was most likely an irrelevant encapsulation, hence should be coded as "zero tumor".
I am not sure wether this is your desired output, but maybe it can be usefull to you
dat %>%
group_by(ID) %>%
arrange(-time) %>%
mutate(value = if_else(cumsum(value == 0) > 0, 0, value)) %>%
arrange(ID, time)
ID time value
<dbl> <int> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
7 1 7 0
8 1 8 0
9 1 9 0
10 1 10 0
11 1 11 1
12 1 12 5
13 1 13 20
14 2 1 0
15 2 2 0
16 2 3 1
17 2 4 2
18 2 5 3
Basicalyl, I first put the observations in descending order. Then I check whether there has been a zero in value (cumsum(value == 0) > 0)). If yes, I set all remaining values to zero.
Finally, I put the observations in correct order again.
If you do not want to order and reorder the data you can use the following code, which relies on the same logic but is a bit more difficult to read:
dat %>%
group_by(ID) %>%
arrange(ID, time) %>%
mutate(value = if_else(cumsum(value == 0) < sum(value == 0), 0, value))
Or a bit more efficient without if_else:
dat %>%
group_by(ID) %>%
arrange(ID, time) %>%
mutate(value = value * (cumsum(value == 0) >= sum(value == 0)))
One way could be to find the indices of the first and last occurrences of 0 and replace everything in between.
library(dplyr)
dat %>%
group_by(ID) %>%
mutate(value = replace(value, between(row_number(), which.max(value == 0), tail(which(value == 0), 1)), 0))
# A tibble: 18 x 3
# Groups: ID [2]
ID time value
<dbl> <int> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
7 1 7 0
8 1 8 0
9 1 9 0
10 1 10 0
11 1 11 1
12 1 12 5
13 1 13 20
14 2 1 0
15 2 2 0
16 2 3 1
17 2 4 2
18 2 5 3
With data.table you can caluculate fields with the data in a certain order, without actually reordering the data frame. Useful here
library(data.table)
setDT(dat)
dat[order(-time), value := fifelse(cumsum(value == 0) > 0, 0, value), ID]
dat
# ID time value
# 1: 1 1 0
# 2: 1 2 0
# 3: 1 3 0
# 4: 1 4 0
# 5: 1 5 0
# 6: 1 6 0
# 7: 1 7 0
# 8: 1 8 0
# 9: 1 9 0
# 10: 1 10 0
# 11: 1 11 1
# 12: 1 12 5
# 13: 1 13 20
# 14: 2 1 0
# 15: 2 2 0
# 16: 2 3 1
# 17: 2 4 2
# 18: 2 5 3
You can use accumulate(..., .dir = "backward") in purrr
library(dplyr)
library(purrr)
dat %>%
group_by(ID) %>%
arrange(time, .by_group = T) %>%
mutate(value2 = accumulate(value, ~ if(.y == 0) 0 else .x, .dir = "backward")) %>%
ungroup()
# A tibble: 18 x 4
ID time value value2
<dbl> <int> <dbl> <dbl>
1 1 1 0 0
2 1 2 0 0
3 1 3 0 0
4 1 4 0 0
5 1 5 0 0
6 1 6 1 0
7 1 7 1 0
8 1 8 1 0
9 1 9 2 0
10 1 10 0 0
11 1 11 1 1
12 1 12 5 5
13 1 13 20 20
14 2 1 0 0
15 2 2 0 0
16 2 3 1 1
17 2 4 2 2
18 2 5 3 3

Constructing variable lags based on additional condition

I want to create a lagged variable based on the following additional condition and operations:
When the lag (previous row) of the variable (day_active) is 1, it should also take the lag of the variable n_wins
When the lag (previous row) of day_active is 0, it should just repeat the value of n_wins of the previous row as long as day_active remains 0.
Let's assume we observe a game player for ten days. day_active indicates if he was active on that day and n_wins indicates the number of games he won.
Example dataset:
da = data.frame(day = c(1,2,3,4,5,6,7,8,9,10), day_active = c(1,1,0,0,1,1,0,0,1,1), n_wins = c(2,3,0,0,1,0,0,0,0,1))
da
day day_active n_wins
1 1 1 2
2 2 1 3
3 3 0 0
4 4 0 0
5 5 1 1
6 6 1 0
7 7 0 0
8 8 0 0
9 9 1 0
10 10 1 1
This is how it should look after the transformation:
da2 = data.frame(day = c(1,2,3,4,5,6,7,8,9,10), day_active = c(1,1,0,0,1,1,0,0,1,1), n_wins = c(2,3,0,0,1,0,0,0,0,1), lag_n_wins = c(NA,2,3,3,3,1,0,0,0,0))
da2
day day_active n_wins lag_n_wins
1 1 1 2 NA
2 2 1 3 2
3 3 0 0 3
4 4 0 0 3
5 5 1 1 3
6 6 1 0 1
7 7 0 0 0
8 8 0 0 0
9 9 1 0 0
10 10 1 1 0
We can create a grouping column based on the presence of 1 in 'day_active' by taking the cumulative sum of logical vector, then if all the values are not 0, replace with NA and replace the NA with the previous non-NA element with na.locf (from zoo), ungroup and take the lag of the column created
library(dplyr)
da %>%
group_by(grp = cumsum(day_active == 1)) %>%
mutate(lag_n_wins = zoo::na.locf0(if(all(n_wins == 0)) n_wins
else na_if(n_wins, 0)) ) %>%
ungroup %>%
mutate(lag_n_wins = lag(lag_n_wins)) %>%
select(-grp)
# A tibble: 10 x 4
# day day_active n_wins lag_n_wins
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2 NA
# 2 2 1 3 2
# 3 3 0 0 3
# 4 4 0 0 3
# 5 5 1 1 3
# 6 6 1 0 1
# 7 7 0 0 0
# 8 8 0 0 0
# 9 9 1 0 0
#10 10 1 1 0

Resources