slice(which.max()) with condition - r

I have the following dataset:
ID, diff
1 -40
1 -21
1 -5
1 1
1 6
1 7
...
ID variable has values 1,2,3,4,5,... while diff is a numeric variable. Now, from the dataset, for each ID I want to extract the row with a diff that is closest to zero AND is negative. So, I want the row with the highest negative value of diff. In the dataset above, for ID 1 I want to extract 3 rows with values (1 -5).
The following code can extract rows where the absolute value is closest to 0:
library(dplyr)
dataset22 = dataset1 %>% group_by(ID) %>% slice(which.min(abs(diff)))
How can I extract the row with a negative number that is closest to zero?
Thanks in advance!

This works:
library(dplyr)
df <- data.frame(ID = c(1, 1, 1, 1, 1, 1),
diff = c(-40, -21, -5, 1, 6, 7))
df %>%
group_by(ID) %>%
filter(diff < 0) %>%
summarise(min_negative_diff = max(diff))
#> # A tibble: 1 x 2
#> ID min_negative_diff
#> <dbl> <dbl>
#> 1 1 -5

Related

Count cumulative and sequential values of the same sign in R

I'm looking for the equivalent code in R to this post in python for adding a column that cumulative counts the number of positives and negative values in the preceeding column.
I've found many examples of cumulative sums or something more complex, but I would just like to count the number of positives and negatives in a row that resets whenever the sign changes. See sample code.
library(dplyr)
df <- data.frame(x = c(0.5, 1, 6.5, -2, 3, -0.2, -1))
My expected output is this:
df <- data.frame(x = c(0.5, 1, 6.5, -2, 3, -0.2, -1),
z = c(1,2,3,-1,1,-1,-2))
I would like R to create column "z" with a mutate function to the dataframe df when it starts with just "x".
You can try:
library(dplyr)
df %>%
mutate(z = with(rle(sign(x)), sequence(lengths) * rep(values, lengths)))
x z
1 0.5 1
2 1.0 2
3 6.5 3
4 -2.0 -1
5 3.0 1
6 -0.2 -1
7 -1.0 -2
You may want to consider how zeroes should be treated as the above may need a modification if zeroes exist in your vector. Perhaps:
df %>%
mutate(z = with(rle(sign(x)), sequence(lengths) * rep(values^(values != 0), lengths)))
Edit addressing OP comment below:
df %>%
mutate(z = with(tmp <- rle(sign(x)), sequence(lengths) * rep(values, lengths)),
id = with(tmp, rep(seq_along(lengths), lengths))) %>%
group_by(id) %>%
mutate(avg = cumsum(x)/row_number()) %>%
ungroup() %>%
select(-id)
# A tibble: 7 x 3
x z avg
<dbl> <dbl> <dbl>
1 0.5 1 0.5
2 1 2 0.75
3 6.5 3 2.67
4 -2 -1 -2
5 3 1 3
6 -0.2 -1 -0.2
7 -1 -2 -0.6

counting leading & trailing zeros for every row in a dataframe in R

I am trying to analyse a dataframe where every row represents a timeseries. My df is structured as follows:
df <- data.frame(key = c("10A", "11xy", "445pe"),
Obs1 = c(0, 22, 0),
Obs2 = c(10, 0, 0),
Obs3 = c(0, 3, 5),
Obs4 = c(0, 10, 0)
)
I would now like to create a new dataframe, where every row represents again the key, and the columns consist of the following results:
"TotalZeros": counts the total number of zeros for each row (=key)
"LeadingZeros": counts the number of zeros before the first nonzero obs for each row
This means I would like to receive the following dataframe in the end:
key TotalZeros LeadingZeros
10A 3 1
11xy 1 0
445pe 3 2
I managed to count the total number of zeros for each row:
zeroCountDf <- data.frame(key = df$key, TotalNonZeros = rowSums(df ! = 0))
But I am struggling with counting the LeadingZeros. I found how to count the first non-zero position in a vector, but I don't understand how to apply this approach to my dataframe:
vec <- c(0,1,1)
min(which(vec != 0)) # returns 2, meaning the second position is first nonzero value
Can anyone explain how to count leading zeros for every row in a dataframe? I am new to R and thankful for any insight and tips. Thanks in advance.
We could use rowCumsums from matrixStats along with rowSums
library(matrixStats)
cbind(df[1], total_zeros = rowSums(df[-1] == 0),
Leading_zeros = rowSums(!rowCumsums(df[-1] != 0)))
-output
key total_zeros Leading_zeros
1 10A 3 1
2 11xy 1 0
3 445pe 3 2
or in tidyverse, we may also use rowwise
library(dplyr)
df %>%
mutate(total_zeros = rowSums(select(., starts_with("Obs")) == 0)) %>%
rowwise %>%
transmute(key, total_zeros,
Leading_zeros = sum(!cumsum(c_across(starts_with('Obs')) != 0))) %>%
ungroup
-output
# A tibble: 3 x 3
key total_zeros Leading_zeros
<chr> <dbl> <int>
1 10A 3 1
2 11xy 1 0
3 445pe 3 2
Edit Added Miff's comment to the solution.
Here is a tidyverse solution:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(starts_with("Obs"),
names_pattern = "Obs(\\d+)") %>%
arrange(key, as.integer(name)) %>%
group_by(key) %>%
summarize(
leading_zeros = sum(cumsum(abs(value)) == 0),
total_zeros = sum(value == 0),
trailing_zeros = sum(cumsum(abs(value)) == last(cumsum(abs(value)))) - 1)
This returns
# A tibble: 3 x 4
key leading_zeros total_zeros trailing_zeros
<chr> <int> <int> <dbl>
1 10A 1 3 2
2 11xy 0 1 0
3 445pe 2 3 1
A data.table option
setDT(df)[
, .(
total_zeros = rowSums(.SD == 0),
Leading_zeros = which.max(.SD != 0) - 1,
Trailing_zeros = length(.SD)-max(which(.SD!=0))
),
key
]
gives
key total_zeros Leading_zeros Trailing_zeros
1: 10A 3 1 2
2: 11xy 1 0 0
3: 445pe 3 2 1

Lowest positive and least negative value among various columns in R?

I have a dataset looking like this:
df <- data.frame(ID=c(1, 1, 1, 2, 3, 3), timeA=c(-10, NA, NA, -15, -10, -5), timeB=c(5, 100, -10, -10, -15, 5), timeC=c(1, 160, 17, -5, -5, 2))
Question 1:
I want to create a column giving me the lowest positive value of time for each participant or if all values are negative then keep the negative value in and choose the one that is least negative. Then I want to only choose the lowest positive value for each participant (ID), or when all values are negative, choose the value that is least negative.
Question 2: Is there a function looking for the value that is closest to 0?
So that my output would look like this:
df <- data.frame(ID=c(1,2,3), time_new=c(1, -5, 2))
I think your looking for Closest() from the library DescTools.
library(tidyverse)
library(DescTools)
# your data
df <- data.frame(ID=c(1, 1, 1, 2, 3, 3),
timeA=c(-10, NA, NA, -15, -10, -5),
timeB=c(5, 100, -10, -10, -15, 5),
timeC=c(1, 160, 17, -5, -5, 2))
# your results
# I stacked the information for easier searching
df %>% pivot_longer(!ID,values_to = "value") %>%
group_by(ID) %>%
summarise(time_new = Closest(value, 0, na.rm = T)) # closest value to zero
Simply calculate distance to 0 and then filter
For #1
library(tidyverse)
# function filter check and return a TRUE/FALSE with
# follow logic of #1 - priority positive value first
# if no positive take the maximum negative number
filter_function <- function(x) {
result <- rep(0, length(x))
if (all(x < 0, na.rm = TRUE)) {
reference <- max(x, na.rm = TRUE)
} else {
reference <- min(x[x > 0], na.rm = TRUE)
}
result <- result + (x == reference)
result[is.na(result)] <- 0
as.logical(result)
}
# filter as #1 option
df %>% pivot_longer(!ID,values_to = "value") %>%
# calculate the distance to ZERO for each value
mutate(distance_to_zero = 0 + value,
abs_distance_to_zero = abs(distance_to_zero)) %>%
group_by(ID) %>%
filter(filter_function(distance_to_zero))
#> # A tibble: 3 x 5
#> # Groups: ID [3]
#> ID name value distance_to_zero abs_distance_to_zero
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 timeC 1 1 1
#> 2 2 timeC -5 -5 5
#> 3 3 timeC 2 2 2
And this is for #2
# filter as closest to ZERO no matter positive or negative
df %>%
pivot_longer(!ID,values_to = "value") %>%
# calculate the distance to ZERO for each value
mutate(abs_distance_to_zero = abs(0 + value)) %>%
group_by(ID) %>%
# Then filter by the one equal to minimum in each group can return multiple
# records in your actual data
filter(abs_distance_to_zero == min(abs_distance_to_zero, na.rm = TRUE) &
!is.na(abs_distance_to_zero)) %>%
ungroup()
#> # A tibble: 3 x 4
#> ID name value abs_distance_to_zero
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 timeC 1 1
#> 2 2 timeC -5 5
#> 3 3 timeC 2 2

Subseting when there are n consecutive dummies

I have a data frame and I have created a series of dummy variables and then combined them into i final column. I want to know if I have a case where there is 3 consecutive 1's, i.e., is there a way to subset the data frame that gives me rows 3:5 in the following example?
df <- tibble(
a= c(0, 0, 1, 1, 1, 0, 1, 1)
)
df
# A tibble: 8 x 1
a
<dbl>
1 0
2 0
3 1
4 1
5 1
6 0
7 1
8 1
The package data.table has a nice function called rleid that creates groups based on the diff not being 0. Using that, you can do,
library(tidyverse)
df %>%
group_by(grp = data.table::rleid(df$a)) %>%
filter(n() >= 3 & all(a == 1))

Get last row of each group in R [duplicate]

This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 4 years ago.
I have some data similar in structure to:
a <- data.frame("ID" = c("A", "A", "B", "B", "C", "C"),
"NUM" = c(1, 2, 4, 3, 6, 9),
"VAL" = c(1, 0, 1, 0, 1, 0))
And I am trying to sort it by ID and NUM then get the last row.
This code works to get the last row and summarize down to a unique ID, however, it doesn't actually get the full last row like I want.
a <- a %>% arrange(ID, NUM) %>%
group_by(ID) %>%
summarise(max(NUM))
I understand why this code doesn't work but am looking for the dplyr way of getting the last row for each unique ID
Expected Results:
ID NUM VAL
<fct <dbl> <dbl>
1 A 2 0
2 B 4 1
3 C 9 0
Note: I will admit that though it is nearly a duplicate of Select first and last row from grouped data, the answers on that thread were not quite what I was looking for.
You might try:
a %>%
group_by(ID) %>%
arrange(NUM) %>%
slice(n())
One dplyr option could be:
a %>%
arrange(ID, NUM) %>%
group_by(ID) %>%
summarise_all(last)
ID NUM VAL
<fct> <dbl> <dbl>
1 A 2. 0.
2 B 4. 1.
3 C 9. 0.
Or since dplyr 1.0.0:
a %>%
arrange(ID, NUM) %>%
group_by(ID) %>%
summarise(across(everything(), last))
Or using slice_max():
a %>%
group_by(ID) %>%
slice_max(order_by = NUM, n = 1)
tail() returns the last 6 items of a subsettable object. When using aggregate(), the parameters to the FUN argument are passed immediately after the function using a comma; here 1 refers to n = 1, which tells tail() to only return the last item.
aggregate(a[, c('NUM', 'VAL')], list(a$ID), tail, 1)
# Group.1 NUM VAL
# 1 A 2 0
# 2 B 3 0
# 3 C 9 0
You can use top_n. (grouping already sorts by ID, and sorting by NUM isn't necessary since there's only 1 value)
library(dplyr)
a %>%
group_by(ID) %>%
top_n(1, NUM)
# # A tibble: 3 x 3
# # Groups: ID [3]
# ID NUM VAL
# <fct> <dbl> <dbl>
# 1 A 2 0
# 2 B 4 1
# 3 C 9 0

Resources