How to find first element of a group that fulfill a condition - r

structure(list(group = c(17L, 17L, 17L, 18L, 18L, 18L, 18L, 19L,
19L, 19L, 20L, 20L, 20L, 21L, 21L, 22L, 23L, 24L, 25L, 25L, 25L,
26L, 27L, 27L, 27L, 28L), var = c(74L, 49L, 1L, 74L, 1L, 49L,
61L, 49L, 1L, 5L, 5L, 1L, 44L, 44L, 12L, 13L, 5L, 5L, 1L, 1L,
4L, 4L, 1L, 1L, 1L, 49L), first = c(0, 0, 1, 0, 1, 0, 0, 0, 1,
0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0)), .Names = c("group",
"var", "first"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-26L))
With the data from the first two column I would like to create a third column (called first) where first == 1 only when var == 1 for the first time in a group. In other words I would like to mark first elements within group that fullfil var == 1. How can I do that in dplyr? Certainly group_by should be used but what next?

library(dplyr)
df$first = NULL
df %>%
group_by(group) %>%
mutate(first = as.numeric(row_number() == min(row_number()[var == 1]))) %>%
ungroup()
# # A tibble: 26 x 3
# group var first
# <int> <int> <dbl>
# 1 17 74 0
# 2 17 49 0
# 3 17 1 1
# 4 18 74 0
# 5 18 1 1
# 6 18 49 0
# 7 18 61 0
# 8 19 49 0
# 9 19 1 1
# 10 19 5 0
# # ... with 16 more rows
The idea is to flag the minimum row number where var = 1, within each group.
This will return some warnings, because in some groups there are no var = 1 cases.
Another option would be this:
library(dplyr)
df$first = NULL
# create row id
df$id = seq_along(df$group)
df %>%
filter(var == 1) %>% # keep cases where var = 1
distinct(group, .keep_all = T) %>% # keep distinct cases based on group
mutate(first = 1) %>% # create first column
right_join(df, by=c("id","group","var")) %>% # join back original dataset
mutate(first = coalesce(first, 0)) %>% # replace NAs with 0
select(-id) # remove row id
# # A tibble: 26 x 3
# group var first
# <int> <int> <dbl>
# 1 17 74 0
# 2 17 49 0
# 3 17 1 1
# 4 18 74 0
# 5 18 1 1
# 6 18 49 0
# 7 18 61 0
# 8 19 49 0
# 9 19 1 1
#10 19 5 0
# # ... with 16 more rows

For ungrouped data, one solution is
first_equal_to = function(x, value)
(x == value) & (cumsum(x == value) == 1)
so
tbl %>% group_by(group) %>% mutate(first = first_equal_to(var, 1))
(it seems appropriate to keep this as a logical vector, since that is what the column represents).
Another implementation is
first_equal_to2 = function(x, value) {
result = logical(length(x))
result[match(value, x)] = TRUE
result
}

We can use the expression shown for first:
DF %>%
group_by(group) %>%
mutate(first = { var == 1 } %>% { . * !duplicated(.) } ) %>%
ungroup
giving:
# A tibble: 26 x 3
group var first
<int> <int> <int>
1 17 74 0
2 17 49 0
3 17 1 1
4 18 74 0
5 18 1 1
6 18 49 0
7 18 61 0
8 19 49 0
9 19 1 1
10 19 5 0
# ... with 16 more rows

Related

R: creating continuous timelines

I have some data showing the duration individuals spent in states a, b, and c. The durations should not overlap. In my data, c sometimes overlaps with a and b. I am trying to correct the data so there are no overlaps. a and b take priority, so I am only changing the start or end dates of c.
For example, my data looks like this (for two individuals):
structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L), period = c("a", "b", "c", "a", "b", "c", "a", "a", "b",
"c", "a", "b"), start = c(3L, 11L, 8L, 18L, 20L, 26L, 31L, 1L,
11L, 13L, 15L, 16L), end = c(10L, 12L, 20L, 19L, 25L, 32L, 35L,
10L, 14L, 17L, 15L, 20L)), class = "data.frame", row.names = c(NA,
-12L))
And the desired output would be:
In this example, the individual with id == 2 has no time in period c, because periods a and b completely overlap.
How would you do this?
My method is a bit crude, but it'll work
library(tidyverse)
df %>% mutate(rowid = row_number()) %>%
pivot_longer(cols = c(start, end), names_to = "event", values_to = "val") %>%
mutate(event = factor(event, levels = c("start", "end"), ordered = T)) %>%
arrange(rowid, id, period, event) %>%
mutate(val = ifelse(period == "c" & event == "start" & val < lag(val), lag(val)+1, val),
val = ifelse(period == "c" & event == "end" & val > lead(val), lead(val)-1, val)) %>%
pivot_wider(id_cols = c(rowid, id, period), names_from = event, values_from = val) %>%
arrange(rowid) %>% mutate(start = ifelse(start > end, NA, start),
end = ifelse(end < start, NA, end)) %>%
select(-rowid)
# A tibble: 12 x 4
id period start end
<int> <chr> <dbl> <dbl>
1 1 a 3 10
2 1 b 11 12
3 1 c 13 17
4 1 a 18 19
5 1 b 20 25
6 1 c 26 30
7 1 a 31 35
8 2 a 1 10
9 2 b 11 14
10 2 c NA NA
11 2 a 15 15
12 2 b 16 20
Adding - instead of NA will turn the column type from dbl to chr.
EDIT on second thoughts I propose smaller code
df %>% group_by(id) %>%
mutate(start = ifelse(period == "c" & start < lag(end), lag(end)+1, start),
end = ifelse(period == "c" & end > lead(start), lead(start)-1, end),
start = ifelse(start > end, NA, start),
end = ifelse(start > end, NA, end))
# A tibble: 12 x 4
# Groups: id [2]
id period start end
<int> <chr> <dbl> <dbl>
1 1 a 3 10
2 1 b 11 12
3 1 c 13 17
4 1 a 18 19
5 1 b 20 25
6 1 c 26 30
7 1 a 31 35
8 2 a 1 10
9 2 b 11 14
10 2 c NA NA
11 2 a 15 15
12 2 b 16 20

How to group contiguous variable into a range r

I have an example dataset:
Road Start End Cat
1 0 50 a
1 50 60 b
1 60 90 b
1 70 75 a
2 0 20 a
2 20 25 a
2 25 40 b
Trying to output following:
Road Start End Cat
1 0 50 a
1 50 90 b
1 70 75 a
2 0 25 a
2 25 40 b
My code doesn't work:
df %>% group_by(Road, cat)
%>% summarise(
min(Start),
max(End)
)
How can I achieve the results I wanted?
We can use rleid from data.table to get the run-length-id-encoding for grouping and then do the summarise
library(dplyr)
library(data.table)
df %>%
group_by(Road, grp = rleid(Cat)) %>%
summarise(Cat = first(Cat), Start = min(Start), End = max(End)) %>%
select(-grp)
# A tibble: 5 x 4
# Groups: Road [2]
# Road Cat Start End
# <int> <chr> <int> <int>
#1 1 a 0 50
#2 1 b 50 90
#3 1 a 70 75
#4 2 a 0 25
#5 2 b 25 40
Or using data.table methods
library(data.table)
setDT(df)[, .(Start = min(Start), End = max(End)), .(Road, Cat, grp = rleid(Cat))]
data
df <- structure(list(Road = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), Start = c(0L,
50L, 60L, 70L, 0L, 20L, 25L), End = c(50L, 60L, 90L, 75L, 20L,
25L, 40L), Cat = c("a", "b", "b", "a", "a", "a", "b")),
class = "data.frame", row.names = c(NA,
-7L))

Summarise a group value into single row

I have a large dataset with longitudinal readings from single individuals.
I want to summarise information over time into a binary variable. i.e. if diff in the input table below is >5 for any value I want to then reduce the observation for A to a new column saying TRUE.
#Input
individual val1 val2 diff
A 32 36 -4
A 36 28 8
A 28 26 2
A 26 26 0
B 65 64 1
B 58 59 -1
B 57 54 3
B 54 51 3
#Output
individual newval
A TRUE
B FALSE
Using dplyr you can:
library(dplyr)
df %>%
group_by(individual) %>% # first group data
summarize(newval = any(diff > 5)) # then evaluate test for each group
#> # A tibble: 2 x 2
#> individual newval
#> <fct> <lgl>
#> 1 A TRUE
#> 2 B FALSE
data
df <- read.table(text = "individual val1 val2 diff
A 32 36 -4
A 36 28 8
A 28 26 2
A 26 26 0
B 65 64 1
B 58 59 -1
B 57 54 3
B 54 51 3
", header = TRUE)
Multiple ways to do this :
In base R we can use aggregate
aggregate(diff~individual, df,function(x) any(x>5))
# individual diff
#1 A TRUE
#2 B FALSE
Or tapply
tapply(df$diff > 5, df$individual, any)
We can also use data.table
library(data.table)
setDT(df)[ ,(newval = any(diff > 5)), by = individual]
An option in base R with rowsum
rowsum(+(df1$diff > 5), df1$individual) != 0
or with by
by(df1$diff > 5, df1$individual, any)
data
df1 <- structure(list(individual = c("A", "A", "A", "A", "B", "B", "B",
"B"), val1 = c(32L, 36L, 28L, 26L, 65L, 58L, 57L, 54L), val2 = c(36L,
28L, 26L, 26L, 64L, 59L, 54L, 51L), diff = c(-4L, 8L, 2L, 0L,
1L, -1L, 3L, 3L)), class = "data.frame", row.names = c(NA, -8L
))

How would I add a Total Row for each value in a specific column, that does calculations based upon other columns,

Assume I have this data frame
What I want is this
What I want to do is create rows which groups upon the month variable, which then obtains the sum of the total variable, and the unique value of the days_month variable for all of the values in person for that month.
I am just wondering if there is an easy way to do this that does not involve multiple spreads and gathers with adorn totals that I have to change the days in month back to original value after the totals were summed, etc. Is there a quick and easy way to do this?
One option would be to group by 'month', 'days_in_month' and apply adorn_total by group_mapping
library(dplyr)
library(janitor)
df1 %>%
group_by(month, days_in_month) %>%
group_map(~ .x %>%
adorn_totals("row")) %>%
select(names(df1))
# A tibble: 10 x 4
# Groups: month, days_in_month [2]
# month person total days_in_month
# <int> <chr> <int> <int>
# 1 1 John 7 31
# 2 1 Jane 18 31
# 3 1 Tim 20 31
# 4 1 Cindy 11 31
# 5 1 Total 56 31
# 6 2 John 18 28
# 7 2 Jane 13 28
# 8 2 Tim 15 28
# 9 2 Cindy 9 28
#10 2 Total 55 28
If we need other statistics, we can have it in group_map
library(tibble)
df1 %>%
group_by(month, days_in_month) %>%
group_map(~ bind_rows(.x, tibble(person = "Mean", total = mean(.x$total))))
data
df1 <- structure(list(month = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), person = c("John",
"Jane", "Tim", "Cindy", "John", "Jane", "Tim", "Cindy"), total = c(7L,
18L, 20L, 11L, 18L, 13L, 15L, 9L), days_in_month = c(31L, 31L,
31L, 31L, 28L, 28L, 28L, 28L)), class = "data.frame", row.names = c(NA,
-8L))

Unable to summarize the minimum and maximum while using for loop

Below is random data.
drop drop1 drop2 ch
15 14 40 1
20 15 45 1
35 16 90 1
40 17 70 0
25 18 80 0
30 18 90 0
11 20 100 0
13 36 11 0
16 70 220 0
19 40 440 1
25 45 1 1
35 30 70 1
40 40 230 1
17 11 170 1
30 2 160 1
I am using code below for variable profiling for continuous variable in R.
library(dplyr)
dt %>% mutate(dec=ntile(drop, n=2)) %>%
count(ch, dec) %>%
filter(ch == 1) -> datcbld
datcbld$N <- unclass(dt %>%
mutate(dec=ntile(drop, n=2)) %>%
count(dec) %>%
unname())[[2]]
datcbld$ch_perc <- datcbld$n / datcbld$N
datcbld$GreaterThan <- unclass(dt %>% mutate(dec=ntile(drop, n=2)) %>%
group_by(dec) %>%
summarise(min(drop)))[[2]]
datcbld$LessThan <- unclass(dt %>%
mutate(dec=ntile(drop, n=2)) %>%
group_by(dec) %>%
summarise(max(drop)))[[2]]
datcbld$Varname <- rep("dt", nrow(datcbld))
And below is output of the code.
ch dec n N ch_perc GreaterThan LessThan Varname
1 1 4 8 0.5 11 25 drop
1 2 5 7 0.714285714 25 40 drop
This code works perfectly fine when I am using it for a single variable.
When I am trying to run it for each column using a for loop it is unable to summarise with min and max for each decile.
Below is my code using for running for loop.
finaldata <- data.frame()
for(i in 1:(ncol(dt) - 1)){
dt %>%
mutate(dec=ntile(dt[, colnames(dt[i])], n = 2)) %>%
count(ch,dec) %>%
filter(ch == 1) -> dat
dat$N <- unclass(dt %>%
mutate(dec=ntile(dt[, colnames(dt[i])], n=2)) %>%
count(dec) %>%
unname())[[2]]
dat$ch_perc <- dat$n / dat$N
dat$GreaterThan <- unclass(dt %>%
mutate(dec=ntile(dt[, colnames(dt[i])], n=2)) %>%
group_by(dec) %>%
summarise(min(dt[, colnames(dt[i])])))[[2]]
dat$LessThan <- unclass(dt %>%
mutate(dec=ntile(dt[, colnames(dt[i])], n=2)) %>%
group_by(dec) %>%
summarise(max(dt[, colnames(dt[i])])))[[2]]
dat$Varname <- rep(colnames(dt[i]), nrow(dat))
finaldata <- rbind(finaldata, dat)
}
But I'm unable to get same result.
We could do this with map by looping over the names and this can be done without breaking off the chain (%>%)
library(tidyverse)
names(dt)[1:3] %>%
map_df(~
dt %>%
select(.x, ch) %>%
mutate(dec = ntile(!! rlang::sym(.x), n = 2)) %>%
group_by(dec) %>%
mutate(N = n(),
GreaterThan = max(!!rlang::sym(.x)),
LessThan = min(!!rlang::sym(.x))) %>%
select(-1) %>%
count(!!! rlang::syms(names(.))) %>%
filter(ch == 1)%>%
mutate(ch_perc = n/N,
Varname = .x))
# A tibble: 6 x 8
# Groups: dec [2]
# dec ch N GreaterThan LessThan n ch_perc Varname
# <int> <int> <int> <dbl> <dbl> <int> <dbl> <chr>
#1 1 1 8 25 11 4 0.5 drop
#2 2 1 7 40 25 5 0.714 drop
#3 1 1 8 18 2 5 0.625 drop1
#4 2 1 7 70 20 4 0.571 drop1
#5 1 1 8 90 1 5 0.625 drop2
#6 2 1 7 440 90 4 0.571 drop2
The issue in the OP's for loop is the use of
dt[, colnames(dt[i])]
within summarise. It will apply the min or max on the full column value instead of applying the function on the column respecting the group by structure
We could convert the column names to symbols as showed above (sym) and do an evaluation or use summarise_at
finaldata <- data.frame()
for(i in 1:(ncol(dt) - 1)){
dt %>%
mutate(dec=ntile(dt[, colnames(dt[i])], n = 2)) %>%
count(ch,dec) %>%
filter(ch == 1) -> dat
dat$N <- unclass(dt %>%
mutate(dec=ntile(dt[, colnames(dt[i])], n=2)) %>%
count(dec) %>%
unname())[[2]]
dat$ch_perc <- dat$n / dat$N
dat$GreaterThan <- unclass(dt %>%
mutate(dec=ntile(dt[, colnames(dt[i])], n=2)) %>%
group_by(dec) %>%
summarise(max(!! rlang::sym(names(dt)[i]))))[[2]]
dat$LessThan <- unclass(dt %>%
mutate(dec=ntile(dt[, colnames(dt[i])], n=2)) %>%
group_by(dec) %>%
summarise(min(!! rlang::sym(names(dt)[i]))))[[2]]
dat$Varname <- rep(colnames(dt[i]), nrow(dat))
finaldata <- rbind(finaldata, dat)
}
finaldata
# A tibble: 6 x 8
# ch dec n N ch_perc GreaterThan LessThan Varname
# <int> <int> <int> <int> <dbl> <dbl> <dbl> <chr>
#1 1 1 4 8 0.5 25 11 drop
#2 1 2 5 7 0.714 40 25 drop
#3 1 1 5 8 0.625 18 2 drop1
#4 1 2 4 7 0.571 70 20 drop1
#5 1 1 5 8 0.625 90 1 drop2
#6 1 2 4 7 0.571 440 90 drop2
data
dt <- structure(list(drop = c(15L, 20L, 35L, 40L, 25L, 30L, 11L, 13L,
16L, 19L, 25L, 35L, 40L, 17L, 30L), drop1 = c(14L, 15L, 16L,
17L, 18L, 18L, 20L, 36L, 70L, 40L, 45L, 30L, 40L, 11L, 2L), drop2 = c(40L,
45L, 90L, 70L, 80L, 90L, 100L, 11L, 220L, 440L, 1L, 70L, 230L,
170L, 160L), ch = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L,
1L, 1L, 1L, 1L)), .Names = c("drop", "drop1", "drop2", "ch"),
class = "data.frame", row.names = c(NA,
-15L))

Resources