I have a dataframe. I wish to detect consecutive numbers and populate a new column as 1 or 0.
ID Val
1 a 8
2 a 7
3 a 5
4 a 4
5 a 3
6 a 1
Expected output
ID Val outP
1 a 8 0
2 a 7 1
3 a 5 0
4 a 4 1
5 a 3 1
6 a 1 0
You could do this with the diff function in combination with abs and see whether the outcome is 1 or another value:
d$outP <- c(0, abs(diff(d$Val)) == 1)
which gives:
> d
ID Val outP
1 a 8 0
2 a 7 1
3 a 5 0
4 a 4 1
5 a 3 1
6 a 1 0
If you only want to take decreasing consecutive values into account, you can use:
c(0, diff(d$Val) == -1)
When you want to do this for each ID, you can also do this in base R or with dplyr:
# base R
d$outP <- ave(d$Val, d$ID, FUN = function(x) c(0, abs(diff(x)) == 1))
# dplyr
library(dplyr)
d %>%
group_by(ID) %>%
mutate(outP = c(0, abs(diff(Val)) == 1))
We can also a faster option by comparing the previous value with current
with(df1, as.integer(c(FALSE, Val[-length(Val)] - Val[-1]) ==1))
#[1] 0 1 0 1 1 0
If we need to group by "ID", one option is data.table
library(data.table)
setDT(df1)[, outP := as.integer((shift(Val, fill =Val[1]) - Val)==1) , by = ID]
Related
I have the following dummy dataframe:
t <- data.frame(
a= c(0,0,2,4,5),
b= c(0,0,4,6,5))
a b
0 0
0 0
2 4
4 6
5 5
I want to replace just the first value that it is not zero for the column b. Imagine that the row that meets this criteria is i. I want to replace t$b[i] with t[i+2]+t[i+1] and the rest of t$b should remain the same. So the output would be
a b
0 0
0 0
2 11
4 6
5 5
In fact the dataset is dynamic so I cannot directly point to a specific row, it has to meet the criteria of being the first row not equal to zero in column b.
How can I create this new t$b?
Here is a straight forward solution in base R:
t <- data.frame(
a= c(0,0,2,4,5),
b= c(0,0,4,6,5))
ind <- which(t$b > 0)[1L]
t$b[ind] <- t$b[ind+2L] + t$b[ind+1L]
t
a b
1 0 0
2 0 0
3 2 11
4 4 6
5 5 5
Here is a roundabout way of getting there with a combination of group_by() and mutate():
library(tidyverse)
t %>%
mutate(
b_cond = b != 0,
row_number = row_number()
) %>%
group_by(b_cond) %>%
mutate(
min_row_number = row_number == min(row_number),
b = if_else(b_cond & min_row_number, lead(b, 1) + lead(b, 2), b)
) %>%
ungroup() %>%
select(a, b) # optional, to get back to original columns
# A tibble: 5 × 2
a b
<dbl> <dbl>
1 0 0
2 0 0
3 2 11
4 4 6
5 5 5
The conditional statement is that in any event, if there are two or more consecutive rows with values higher than 1, the group should be deleted.
For example:
Event<- c(1,1,1,1,2,2,2,2,2,2,3,3,3,3,3)
Value<- c(1,0,0,0,8,7,1,0,0,0,8,0,0,0,0)
A<- data.frame(Event, Value)
Event Value
1 1
1 0
1 0
1 0
2 8
2 7
2 1
2 0
2 0
2 0
3 8
3 0
3 0
3 0
3 0
In this example the group of event 2 should be deleted because it has two consecutive rows with values higher than 1. So it should looks like:
Event Value
1 1
1 0
1 0
1 0
3 8
3 0
3 0
3 0
3 0
Any suggestion?
We can use rle by groups.
library(dplyr)
A %>%
group_by(Event) %>%
filter(!any(with(rle(Value > 1), lengths[values] > 1)))
#Opposite way using all
#filter(all(with(rle(Value > 1), lengths[values] < 2)))
# Event Value
# <dbl> <dbl>
#1 1 1
#2 1 0
#3 1 0
#4 1 0
#5 3 8
#6 3 0
#7 3 0
#8 3 0
#9 3 0
The same logic can be used in base R :
subset(A, !ave(Value > 1, Event, FUN = function(x)
any(with(rle(x), lengths[values] > 1))))
as well as data.table
library(data.table)
setDT(A)[, .SD[!any(with(rle(Value > 1), lengths[values] > 1))], Event]
Using dplyr
A %>%
group_by(Event) %>%
mutate(consec = if_else(Value > 1, row_number(), 0L),
remove = if_else(consec > 1,"Y","N")) %>%
filter(!any(remove == "Y")) %>%
select(-c("consec","remove"))
A base R approach:
# split the dataframe by event into separate lists, record whether values are > 1 (T/F)
A_split <- split(A$Value > 1, Event)
# for each item in the list, record the number of consecutive T values;
# make T/F vector "keep" with row names corresponding to A$Event
keep <- sapply(A_split, function(x) sum(x[1:length(x) - 1] * x[2:length(x)])) == 0
# convert keep to numeric vector of A$Event values
keep <- as.numeric(names(keep == T))
# subset A based on keep vector
A[A$Event %in% keep, ]
I have a vector of numbers in a data.frame such as below.
df <- data.frame(a = c(1,2,3,4,2,3,4,5,8,9,10,1,2,1))
I need to create a new column which gives a running count of entries that are greater than their predecessor. The resulting column vector should be this:
0,1,2,3,0,1,2,3,4,5,6,0,1,0
My attempt is to create a "flag" column of diffs to mark when the values are greater.
df$flag <- c(0,diff(df$a)>0)
> df$flag
0 1 1 1 0 1 1 1 1 1 1 0 1 0
Then I can apply some dplyr group/sum magic to almost get the right answer, except that the sum doesn't reset when flag == 0:
df %>% group_by(flag) %>% mutate(run=cumsum(flag))
a flag run
1 1 0 0
2 2 1 1
3 3 1 2
4 4 1 3
5 2 0 0
6 3 1 4
7 4 1 5
8 5 1 6
9 8 1 7
10 9 1 8
11 10 1 9
12 1 0 0
13 2 1 10
14 1 0 0
I don't want to have to resort to a for() loop because I have several of these running sums to compute with several hundred thousand rows in a data.frame.
Here's one way with ave:
ave(df$a, cumsum(c(F, diff(df$a) < 0)), FUN=seq_along) - 1
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
We can get a running count grouped by diff(df$a) < 0. Which are the positions in the vector that are less than their predecessors. We add c(F, ..) to account for the first position. The cumulative sum of that vector creates an index for grouping. The function ave can carry out a function on that index, we use seq_along for a running count. But since it starts at 1, we subtract by one ave(...) - 1 to start from zero.
A similar approach using dplyr:
library(dplyr)
df %>%
group_by(cumsum(c(FALSE, diff(a) < 0))) %>%
mutate(row_number() - 1)
You don't need dplyr:
fun <- function(x) {
test <- diff(x) > 0
y <- cumsum(test)
c(0, y - cummax(y * !test))
}
fun(df$a)
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
a <- c(1,2,3,4,2,3,4,5,8,9,10,1,2,1)
f <- c(0, diff(a)>0)
ifelse(f, cumsum(f), f)
that it is without reset.
with reset:
unlist(tapply(f, cumsum(c(0, diff(a) < 0)), cumsum))
Let's say I have data like this:
group value
1 0
1 0
1 0
2 1
2 0
3 1
3 0
4 1
4 1
How would I iterate through all values of "group" to see if the values corresponding with the group have all equal values. I want to have a dataset that includes ONLY groups where the values are not identical. I'm not sure of an easy way to do this avoiding a for loop.
You can do:
tapply(DF$value, DF$group, FUN = function(x) length(unique(x))) > 1L
# 1 2 3 4
# FALSE TRUE TRUE FALSE
To subset the table, write the same with ave:
DF[ ave(DF$value, DF$group, FUN = function(x) length(unique(x))) > 1L, ]
# group value
# 4 2 1
# 5 2 0
# 6 3 1
# 7 3 0
With packages, the latter step looks like...
library(data.table)
setDT(DF)[, if (uniqueN(value) > 1L) .SD, by=group]
# or
library(dplyr)
DF %>% group_by(group) %>% filter(n_distinct(value) > 1L)
Here is another option using table
tbl <- rowSums(table(df1)>0)>1
subset(df1, group %in% names(tbl)[tbl])
# group value
#4 2 1
#5 2 0
#6 3 1
#7 3 0
I have a vector of numbers in a data.frame such as below.
df <- data.frame(a = c(1,2,3,4,2,3,4,5,8,9,10,1,2,1))
I need to create a new column which gives a running count of entries that are greater than their predecessor. The resulting column vector should be this:
0,1,2,3,0,1,2,3,4,5,6,0,1,0
My attempt is to create a "flag" column of diffs to mark when the values are greater.
df$flag <- c(0,diff(df$a)>0)
> df$flag
0 1 1 1 0 1 1 1 1 1 1 0 1 0
Then I can apply some dplyr group/sum magic to almost get the right answer, except that the sum doesn't reset when flag == 0:
df %>% group_by(flag) %>% mutate(run=cumsum(flag))
a flag run
1 1 0 0
2 2 1 1
3 3 1 2
4 4 1 3
5 2 0 0
6 3 1 4
7 4 1 5
8 5 1 6
9 8 1 7
10 9 1 8
11 10 1 9
12 1 0 0
13 2 1 10
14 1 0 0
I don't want to have to resort to a for() loop because I have several of these running sums to compute with several hundred thousand rows in a data.frame.
Here's one way with ave:
ave(df$a, cumsum(c(F, diff(df$a) < 0)), FUN=seq_along) - 1
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
We can get a running count grouped by diff(df$a) < 0. Which are the positions in the vector that are less than their predecessors. We add c(F, ..) to account for the first position. The cumulative sum of that vector creates an index for grouping. The function ave can carry out a function on that index, we use seq_along for a running count. But since it starts at 1, we subtract by one ave(...) - 1 to start from zero.
A similar approach using dplyr:
library(dplyr)
df %>%
group_by(cumsum(c(FALSE, diff(a) < 0))) %>%
mutate(row_number() - 1)
You don't need dplyr:
fun <- function(x) {
test <- diff(x) > 0
y <- cumsum(test)
c(0, y - cummax(y * !test))
}
fun(df$a)
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
a <- c(1,2,3,4,2,3,4,5,8,9,10,1,2,1)
f <- c(0, diff(a)>0)
ifelse(f, cumsum(f), f)
that it is without reset.
with reset:
unlist(tapply(f, cumsum(c(0, diff(a) < 0)), cumsum))