I have a vector of numbers in a data.frame such as below.
df <- data.frame(a = c(1,2,3,4,2,3,4,5,8,9,10,1,2,1))
I need to create a new column which gives a running count of entries that are greater than their predecessor. The resulting column vector should be this:
0,1,2,3,0,1,2,3,4,5,6,0,1,0
My attempt is to create a "flag" column of diffs to mark when the values are greater.
df$flag <- c(0,diff(df$a)>0)
> df$flag
0 1 1 1 0 1 1 1 1 1 1 0 1 0
Then I can apply some dplyr group/sum magic to almost get the right answer, except that the sum doesn't reset when flag == 0:
df %>% group_by(flag) %>% mutate(run=cumsum(flag))
a flag run
1 1 0 0
2 2 1 1
3 3 1 2
4 4 1 3
5 2 0 0
6 3 1 4
7 4 1 5
8 5 1 6
9 8 1 7
10 9 1 8
11 10 1 9
12 1 0 0
13 2 1 10
14 1 0 0
I don't want to have to resort to a for() loop because I have several of these running sums to compute with several hundred thousand rows in a data.frame.
Here's one way with ave:
ave(df$a, cumsum(c(F, diff(df$a) < 0)), FUN=seq_along) - 1
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
We can get a running count grouped by diff(df$a) < 0. Which are the positions in the vector that are less than their predecessors. We add c(F, ..) to account for the first position. The cumulative sum of that vector creates an index for grouping. The function ave can carry out a function on that index, we use seq_along for a running count. But since it starts at 1, we subtract by one ave(...) - 1 to start from zero.
A similar approach using dplyr:
library(dplyr)
df %>%
group_by(cumsum(c(FALSE, diff(a) < 0))) %>%
mutate(row_number() - 1)
You don't need dplyr:
fun <- function(x) {
test <- diff(x) > 0
y <- cumsum(test)
c(0, y - cummax(y * !test))
}
fun(df$a)
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
a <- c(1,2,3,4,2,3,4,5,8,9,10,1,2,1)
f <- c(0, diff(a)>0)
ifelse(f, cumsum(f), f)
that it is without reset.
with reset:
unlist(tapply(f, cumsum(c(0, diff(a) < 0)), cumsum))
Related
I am doing my best to learn R, and this is my first post on this forum.
I currently have a data frame with a populated vector "x" and an unpopulated vector "counter" as follows:
x <- c(NA,1,0,0,0,0,1,1,1,1,0,1)
df <- data.frame("x" = x, "counter" = 0)
x counter
1 NA 0
2 1 0
3 0 0
4 0 0
5 0 0
6 0 0
7 1 0
8 1 0
9 1 0
10 1 0
11 0 0
12 1 0
I am having a surprisingly difficult time trying to write code that will simply populate counter so that counter sums the cumulative, sequential 1s in x, but reverts back to zero when x is zero. Accordingly, I would like counter to calculate as follows per the above example:
x counter
1 NA NA
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 1
8 1 2
9 1 3
10 1 4
11 0 0
12 1 1
I have tried using lag() and ifelse(), both with and without for loops, but seem to be getting further and further away from a workable solution (while lag got me close, the figures were not calculating as expected....my ifelse and for loops eventually ended up with length 1 vectors of NA_real_, NA or 1). I have also considered cumsum - but not sure how to frame the range to just the 1s - and have searched and reviewed similar posts, for example How to add value to previous row if condition is met; however, I still cannot figure out what I would expect to be a very simple task.
Admittedly, I am at a low point in my early R learning curve and greatly appreciate any help and constructive feedback anyone from the community can provide. Thank you.
You can use :
library(dplyr)
df %>%
group_by(x1 = cumsum(replace(x, is.na(x), 0) == 0)) %>%
mutate(counter = (row_number() - 1) * x) %>%
ungroup %>%
select(-x1)
# x counter
# <dbl> <dbl>
# 1 NA NA
# 2 1 1
# 3 0 0
# 4 0 0
# 5 0 0
# 6 0 0
# 7 1 1
# 8 1 2
# 9 1 3
#10 1 4
#11 0 0
#12 1 1
Explaining the steps -
Create a new column (x1), replace NA in x with 0 and increment the group value by 1 (using cumsum) whenever x = 0.
For each group subtract the row number with 0 and multiply it by x. This multiplication is necessary because it will help to keep counter as 0 where x = 0 and counter as NA where x is NA.
Welcome #cpanagakos.
In dplyr::lag it's not posibble to use a column that still doesn't exist.
(It can't refer to itself.)
https://www.reddit.com/r/rstats/comments/a34n6b/dplyr_use_previous_row_from_a_column_thats_being/
For example:
library(tidyverse)
df <- tibble("x" = c(NA, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1))
# error: lag cannot refer to a column that still doesn't exist
df %>%
mutate(counter = case_when(is.na(x) ~ coalesce(lag(counter), 0),
x == 0 ~ 0,
x == 1 ~ lag(counter) + 1))
#> Error: Problem with `mutate()` input `counter`.
#> x object 'counter' not found
#> i Input `counter` is `case_when(...)`.
So, if you have a criteria that "resets" the counter, you would need to write a formula that changes the group when you need a reset an then refer to the row_number, that will be restarted at 1 inside the group (like #Ronald Shah and others suggest):
Create sequential counter that restarts on a condition within panel data groups
df %>%
group_by(x1 = cumsum(!coalesce(x, 0))) %>%
mutate(counter = row_number() - 1) %>%
ungroup()
#> # A tibble: 12 x 3
#> x x1 counter
#> <dbl> <int> <dbl>
#> 1 NA 1 NA
#> 2 1 1 1
#> 3 0 2 0
#> 4 0 3 0
#> 5 0 4 0
#> 6 0 5 0
#> 7 1 5 1
#> 8 1 5 2
#> 9 1 5 3
#> 10 1 5 4
#> 11 0 6 0
#> 12 1 6 1
This would be one of the few cases where using a for loop in R could be justified: because the alternatives are conceptually harder to understand.
The conditional statement is that in any event, if there are two or more consecutive rows with values higher than 1, the group should be deleted.
For example:
Event<- c(1,1,1,1,2,2,2,2,2,2,3,3,3,3,3)
Value<- c(1,0,0,0,8,7,1,0,0,0,8,0,0,0,0)
A<- data.frame(Event, Value)
Event Value
1 1
1 0
1 0
1 0
2 8
2 7
2 1
2 0
2 0
2 0
3 8
3 0
3 0
3 0
3 0
In this example the group of event 2 should be deleted because it has two consecutive rows with values higher than 1. So it should looks like:
Event Value
1 1
1 0
1 0
1 0
3 8
3 0
3 0
3 0
3 0
Any suggestion?
We can use rle by groups.
library(dplyr)
A %>%
group_by(Event) %>%
filter(!any(with(rle(Value > 1), lengths[values] > 1)))
#Opposite way using all
#filter(all(with(rle(Value > 1), lengths[values] < 2)))
# Event Value
# <dbl> <dbl>
#1 1 1
#2 1 0
#3 1 0
#4 1 0
#5 3 8
#6 3 0
#7 3 0
#8 3 0
#9 3 0
The same logic can be used in base R :
subset(A, !ave(Value > 1, Event, FUN = function(x)
any(with(rle(x), lengths[values] > 1))))
as well as data.table
library(data.table)
setDT(A)[, .SD[!any(with(rle(Value > 1), lengths[values] > 1))], Event]
Using dplyr
A %>%
group_by(Event) %>%
mutate(consec = if_else(Value > 1, row_number(), 0L),
remove = if_else(consec > 1,"Y","N")) %>%
filter(!any(remove == "Y")) %>%
select(-c("consec","remove"))
A base R approach:
# split the dataframe by event into separate lists, record whether values are > 1 (T/F)
A_split <- split(A$Value > 1, Event)
# for each item in the list, record the number of consecutive T values;
# make T/F vector "keep" with row names corresponding to A$Event
keep <- sapply(A_split, function(x) sum(x[1:length(x) - 1] * x[2:length(x)])) == 0
# convert keep to numeric vector of A$Event values
keep <- as.numeric(names(keep == T))
# subset A based on keep vector
A[A$Event %in% keep, ]
I have a vector of numbers in a data.frame such as below.
df <- data.frame(a = c(1,2,3,4,2,3,4,5,8,9,10,1,2,1))
I need to create a new column which gives a running count of entries that are greater than their predecessor. The resulting column vector should be this:
0,1,2,3,0,1,2,3,4,5,6,0,1,0
My attempt is to create a "flag" column of diffs to mark when the values are greater.
df$flag <- c(0,diff(df$a)>0)
> df$flag
0 1 1 1 0 1 1 1 1 1 1 0 1 0
Then I can apply some dplyr group/sum magic to almost get the right answer, except that the sum doesn't reset when flag == 0:
df %>% group_by(flag) %>% mutate(run=cumsum(flag))
a flag run
1 1 0 0
2 2 1 1
3 3 1 2
4 4 1 3
5 2 0 0
6 3 1 4
7 4 1 5
8 5 1 6
9 8 1 7
10 9 1 8
11 10 1 9
12 1 0 0
13 2 1 10
14 1 0 0
I don't want to have to resort to a for() loop because I have several of these running sums to compute with several hundred thousand rows in a data.frame.
Here's one way with ave:
ave(df$a, cumsum(c(F, diff(df$a) < 0)), FUN=seq_along) - 1
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
We can get a running count grouped by diff(df$a) < 0. Which are the positions in the vector that are less than their predecessors. We add c(F, ..) to account for the first position. The cumulative sum of that vector creates an index for grouping. The function ave can carry out a function on that index, we use seq_along for a running count. But since it starts at 1, we subtract by one ave(...) - 1 to start from zero.
A similar approach using dplyr:
library(dplyr)
df %>%
group_by(cumsum(c(FALSE, diff(a) < 0))) %>%
mutate(row_number() - 1)
You don't need dplyr:
fun <- function(x) {
test <- diff(x) > 0
y <- cumsum(test)
c(0, y - cummax(y * !test))
}
fun(df$a)
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
a <- c(1,2,3,4,2,3,4,5,8,9,10,1,2,1)
f <- c(0, diff(a)>0)
ifelse(f, cumsum(f), f)
that it is without reset.
with reset:
unlist(tapply(f, cumsum(c(0, diff(a) < 0)), cumsum))
I have a dataframe. I wish to detect consecutive numbers and populate a new column as 1 or 0.
ID Val
1 a 8
2 a 7
3 a 5
4 a 4
5 a 3
6 a 1
Expected output
ID Val outP
1 a 8 0
2 a 7 1
3 a 5 0
4 a 4 1
5 a 3 1
6 a 1 0
You could do this with the diff function in combination with abs and see whether the outcome is 1 or another value:
d$outP <- c(0, abs(diff(d$Val)) == 1)
which gives:
> d
ID Val outP
1 a 8 0
2 a 7 1
3 a 5 0
4 a 4 1
5 a 3 1
6 a 1 0
If you only want to take decreasing consecutive values into account, you can use:
c(0, diff(d$Val) == -1)
When you want to do this for each ID, you can also do this in base R or with dplyr:
# base R
d$outP <- ave(d$Val, d$ID, FUN = function(x) c(0, abs(diff(x)) == 1))
# dplyr
library(dplyr)
d %>%
group_by(ID) %>%
mutate(outP = c(0, abs(diff(Val)) == 1))
We can also a faster option by comparing the previous value with current
with(df1, as.integer(c(FALSE, Val[-length(Val)] - Val[-1]) ==1))
#[1] 0 1 0 1 1 0
If we need to group by "ID", one option is data.table
library(data.table)
setDT(df1)[, outP := as.integer((shift(Val, fill =Val[1]) - Val)==1) , by = ID]
I would love some help understanding the syntax needed to do a certain calculation in R.
I have a dataframe like this:
a b c
1 1 0
2 1 1
3 1 0
4 2 0
5 2 0
6 3 1
7 3 0
8 3 0
9 4 0
and I want to create a new column "d" that has a value of 1 if (and only if) any of the values in column "c" equal 1 for each group of rows that have the same value in column "b." Otherwise (see rows 4,5 and 9) column "d" gives 0.
a b c d
1 1 0 1
2 1 1 1
3 1 0 1
4 2 0 0
5 2 0 0
6 3 1 1
7 3 0 1
8 3 0 1
9 4 0 0
Can this be done with a for loop? If so, any advice on how to write that would be greatly appreciated.
Using data.table
setDT(df)
df[, d := as.integer(any(c == 1L)), b]
Since you asked for a loop:
# adding the result col
dat <- data.frame(dat, d = rep(NA, nrow(dat)))
# iterate over group
for(i in unique(dat$b)){
# chek if there is a one for
# each group
if(any(dat$c[dat$b == i] == 1))
dat$d[dat$b == i] <- 1
else
dat$d[dat$b == i] <- 0
}
of course the data.table solutions is more elegant ;)
To do this in base R (using the same general function as the dat.table method any), you can use ave:
df$d <- ave(cbind(df$c), df$b, FUN=function(i) any(i)==1)