Subseting when there are n consecutive dummies - r

I have a data frame and I have created a series of dummy variables and then combined them into i final column. I want to know if I have a case where there is 3 consecutive 1's, i.e., is there a way to subset the data frame that gives me rows 3:5 in the following example?
df <- tibble(
a= c(0, 0, 1, 1, 1, 0, 1, 1)
)
df
# A tibble: 8 x 1
a
<dbl>
1 0
2 0
3 1
4 1
5 1
6 0
7 1
8 1

The package data.table has a nice function called rleid that creates groups based on the diff not being 0. Using that, you can do,
library(tidyverse)
df %>%
group_by(grp = data.table::rleid(df$a)) %>%
filter(n() >= 3 & all(a == 1))

Related

R Looping through columns of data frame, filtering by each column

I have a data frame looking at errors in data entry.
It has two sets of variables:
One set is error flags (their names will contain 'Error').
One set is information about the cases which were flagged (for this example, the column 'values').
I would like to loop through the error flag variables, filter the data frame by a value of 1 in each column, and then print.
My problem is that I can't get R to recognize the names of the columns as names to filter by.
I've looked for other examples, and I haven't hit what I need.
In this example, I have a data frame 'test_df', with four variables: error_1-error_3, and values.
I'd like to loop through those three error variables, and filter test_df for rows having a value of 1.
# set up libraries:
library(tidyverse)
library(magrittr)
# Create the data set 'test_df':
test_df <- structure(list(error_1 = c(0, 0, 1, 1), error_2 = c(0, 0, 1,
1), error_3 = c(0, 0, 1, 1), values = c(1, 2, 3, 4)), class = "data.frame", row.names = c(NA,
-4L))
# Pull the column names from test_df, retaining only those with 'error' in their name, and print:
names_test_df <- test_df %>%
dplyr::select(.,contains("error")) %>%
names()
test_df
names_test_df
> test_df
error_1 error_2 error_3 values
1 0 0 0 1
2 0 0 0 2
3 1 1 1 3
4 1 1 1 4
> names_test_df
[1] "error_1" "error_2" "error_3"
Here is where the trouble starts - I can't figure out how to feed the elements of names_test_df into functions so that they are recognized as column names in test_df:
test_df %>% dplyr::filter(.,error_1==1)
test_df %>% dplyr::filter(.,as.character(names_test_df[1])==1)
test_df %>% dplyr::filter(.,noquote(names_test_df[1])==1)
> test_df %>% dplyr::filter(.,error_1==1)
error_1 error_2 error_3 values
1 1 1 1 3
2 1 1 1 4
> test_df %>% dplyr::filter(.,as.character(names_test_df[1])==1)
[1] error_1 error_2 error_3 values
<0 rows> (or 0-length row.names)
> test_df %>% dplyr::filter(.,noquote(names_test_df[1])==1)
[1] error_1 error_2 error_3 values
<0 rows> (or 0-length row.names)
I've also played with looping through 'item' in colnames(test_df), and I get the same result.
Could anybody give some guidance on how to do this?
We can convert the string to symbol and evaluate (!!)
library(dplyr)
test_df %>%
dplyr::filter(., !! rlang::sym(names_test_df[1])==1)
# error_1 error_2 error_3 values
#1 1 1 1 3
#2 1 1 1 4
Or another option with across
test_df %>%
filter(across(all_of(names_test_df[1]), ~ . == 1))
# error_1 error_2 error_3 values
#1 1 1 1 3
#2 1 1 1 4
In base R, you can subset the columns with [. For one column you can do :
test_df[test_df[names_test_df[1]] == 1, ]
# error_1 error_2 error_3 values
#3 1 1 1 3
#4 1 1 1 4
For more than one column where we want to select rows where any of the column has 1 in it.
test_df[rowSums(test_df[names_test_df] == 1) > 0, ]

slice(which.max()) with condition

I have the following dataset:
ID, diff
1 -40
1 -21
1 -5
1 1
1 6
1 7
...
ID variable has values 1,2,3,4,5,... while diff is a numeric variable. Now, from the dataset, for each ID I want to extract the row with a diff that is closest to zero AND is negative. So, I want the row with the highest negative value of diff. In the dataset above, for ID 1 I want to extract 3 rows with values (1 -5).
The following code can extract rows where the absolute value is closest to 0:
library(dplyr)
dataset22 = dataset1 %>% group_by(ID) %>% slice(which.min(abs(diff)))
How can I extract the row with a negative number that is closest to zero?
Thanks in advance!
This works:
library(dplyr)
df <- data.frame(ID = c(1, 1, 1, 1, 1, 1),
diff = c(-40, -21, -5, 1, 6, 7))
df %>%
group_by(ID) %>%
filter(diff < 0) %>%
summarise(min_negative_diff = max(diff))
#> # A tibble: 1 x 2
#> ID min_negative_diff
#> <dbl> <dbl>
#> 1 1 -5

Add a new column by mutate involving a conditional

I need to add a new column in dplyr by mutate involving an conditional. I can't find a way to implement the following scheme in the tidyverse, but I can do it Excel. That makes me feel like something of a barbarian. Does someone know how to accomplish this in the tidyverse?
The first value of the running count column is 1, no matter what is in the "n" column.
After the first row, here is the conditional. If the n column=1, the running.count output is the running.count value from the row above +1. If the n column=0, the running.count output is the running.count value from the row above +1 only when it is the first 0 after a 1 in the "n" column. Otherwise, it is just the running.count value from the row above.
Here's some toy data with the desired output:
data.frame("n"=c(0,1,0,0,0,0,1,0,1,1),"running.count"=c(1,2,3,3,3,3,4,5,6,7))
We can use rleid from data.table to create the running.count column
library(dplyr)
library(data.table)
df1 %>%
group_by(running.count = rleid(n) ) %>%
mutate(ind = if(all(n==1)) row_number() - 1 else 0) %>%
ungroup %>%
mutate(running.count = rleid(running.count, ind)) %>%
select(-ind)
# A tibble: 10 x 2
# n running.count
# <dbl> <int>
# 1 0 1
# 2 1 2
# 3 0 3
# 4 0 3
# 5 0 3
# 6 0 3
# 7 1 4
# 8 0 5
# 9 1 6
#10 1 7
data
df1 ,- structure(list(n = c(0, 1, 0, 0, 0, 0, 1, 0, 1, 1)),
class = "data.frame", row.names = c(NA, -10L))

Determining the first occurence of a value by group and its position within the group

I would like to know per group in the column 'Participants' when the value '1' occurs for the first time in the column 'Signal' (by Participants). The count of the value '1' should refer to the group.
Here is an example data frame
> dfInput <- data.frame(Participants=c( 'A','A','A','B','B','B','B','C','C'), Signal=c(0, 1, 1, 0, 0, 0, 1, 1,0))
> dfInput
Participants Signal
1 A 0
2 A 1
3 A 1
4 B 0
5 B 0
6 B 0
7 B 1
8 C 1
9 C 0
And here is the output I am looking for:
> dfOutput <-data.frame(Participants=c( 'A','B','C'), RowNumberofFirst1=c(2, 4, 1))
> dfOutput
Participants RowNumberofFirst1
1 A 2
2 B 4
3 C 1
The problem is somewhat similar to this: Find first occurence of value in group using dplyr mutate
Yet, I could not adapt it accordingly, to create my output df
I think this is what you are looking for
library(dplyr)
dfInput %>%
group_by(Participants) %>%
summarise(RowNumberofFirst1 = which(Signal == 1)[1])
Another base R via aggregate
aggregate(Signal~Participants, dfInput, function(i)which(i == 1)[1])
# Participants Signal
#1 A 2
#2 B 4
#3 C 1
dfInput <- data.frame(Participants=c( 'A','A','A','B','B','B','B','C','C'),
Signal=c(0, 1, 1, 0, 0, 0, 1, 1,0))
library(dplyr)
dfInput %>%
group_by(Participants) %>% # for each Participant
summarise(NumFirst1 = min(row_number()[Signal == 1])) # get the minimum number of row where signal equals 1
# # A tibble: 3 x 2
# Participants NumFirst1
# <fct> <int>
# 1 A 2
# 2 B 4
# 3 C 1
In case you want to return the row (i.e. all column values) that you've identified, you can use this:
set.seed(5)
dfInput <- data.frame(Participants=c( 'A','A','A','B','B','B','B','C','C'),
Signal=c(0, 1, 1, 0, 0, 0, 1, 1,0),
A = sample(c("C","D","F"),9, replace = T),
B = sample(c("N","M","K"),9, replace = T))
library(dplyr)
dfInput %>%
group_by(Participants) %>%
filter(row_number() == min(row_number()[Signal == 1])) %>%
ungroup()
# # A tibble: 3 x 4
# Participants Signal A B
# <fct> <dbl> <fct> <fct>
# 1 A 1 F N
# 2 B 1 D N
# 3 C 1 F M
So, in this case you use filter to return, for each participant, the row that is equal to the minimum row number where Signal is 1.
With tidyverse:
dfInput%>%
group_by(Participants)%>%
mutate(max=cumsum(Signal),
RowNumberofFirst1=row_number())%>%
filter(max==1)%>%
top_n(-1,RowNumberofFirst1)%>%
select(Participants,RowNumberofFirst1)
# A tibble: 3 x 2
# Groups: Participants [3]
Participants RowNumberofFirst1
<fct> <int>
1 A 2
2 B 4
3 C 1
Here is a solution with base R:
dfInput <- data.frame(Participants=c( 'A','A','A','B','B','B','B','C','C'), Signal=c(0, 1, 1, 0, 0, 0, 1, 1,0))
tapply(dfInput$Signal, dfInput$Participants, FUN=function(x) min(which(x==1)))
# > tapply(dfInput$Signal, dfInput$Participants, FUN=function(x) min(which(x==1)))
# A B C
# 2 4 1
If you want a dataframe you can do:
first1 <- tapply(dfInput$Signal, dfInput$Participants, FUN=function(x) min(which(x==1)))
data.frame(Participants=names(first1), f=first1)
Here is a variant with data.table:
library("data.table")
setDT(dfInput)
dfInput[, which(Signal==1)[1], "Participants"]

Filter rows by last maximal value ordering by a time value

I have a dataframe with an id, an ordering time value and a value. And for each group of ids, I would like to remove rows having a smaller value than rows having smaller time value.
data <- data.frame(id = c(rep(c("a", "b"), each = 3L), "b"),
time = c(0, 1, 2, 0, 1, 2, 3),
value = c(1, 1, 2, 3, 1, 2, 4))
> data
id time value
1 a 0 1
2 a 1 1
3 a 2 2
4 b 0 3
5 b 1 1
6 b 2 2
7 b 3 4
So the result would be :
> data
id time value
1 a 0 1
2 a 2 2
3 b 0 3
4 b 3 4
(For id == b rows where time %in% c(3, 4) are removed because the value value is smaller than when time is lower)
I was thinking about lag
data %>%
group_by(id) %>%
filter(time == 0 | lag(value, order_by = time) < value)
Source: local data frame [5 x 3]
Groups: id [2]
id time value
<fctr> <dbl> <dbl>
1 a 0 1
2 a 2 2
3 b 0 3
4 b 2 2
5 b 3 4
But it doesn't work as expected since it's a vectorized function, so instead the idea would be to use a "recursive lag function" or to check the last maximal value. I can do it recursively with a loop but I'm sure there is a more straightforward and high level way to do it.
Any help would be appreciated, thank you !
Here is a data.table solution:
library(data.table)
setDT(data)
data[, myVal := cummax(c(0, shift(value)[-1])), by=id][value > myVal][, myVal := NULL][]
id time value
1: a 0 1
2: a 2 2
3: b 0 3
4: b 3 4
The first part of the chain uses shift and cummax to create the cumulative maximum of the lagged value variable. In c(0, shift(value)[-1]), 0 is added to supply a value lover than any in the variable. More generally, you could use min(value)-1 the [-1] subsetting removes the first element of shift, which is NA. The second part of the chain selects observations where value is greater than the cumulative maximum. The final two chains remove the cumulative maximum variable and print out the result.
Another option is to perform a self anti/non-equi join using data.table
library(data.table) # v1.10.0
setDT(data)[!data, on = .(id, time > time, value <= value)]
# id time value
# 1: a 0 1
# 2: a 2 2
# 3: b 0 3
# 4: b 3 4
Which is basically saying: "If time is larger but value is less-equal, then I don't want these rows (! sign)"
Here is an option with dplyr. After grouping by 'id', we filter the rows where the 'value' is greater than the cumulative maximum of the 'lag' of the 'value' column
library(dplyr)
data %>%
group_by(id) %>%
filter(value > cummax(lag(value, default = 0)) )
# id time value
# <fctr> <dbl> <dbl>
#1 a 0 1
#2 a 2 2
#3 b 0 3
#4 b 3 4
Or another option is slice after arrangeing by 'id' and 'time' (as the OP mentioned about the order
data %>%
group_by(id) %>%
arrange(id, time) %>%
slice(which(value > cummax(lag(value, default = 0))))

Resources