I am trying to identify which groups in a column contain a specific sequence length of non-zero numbers. In the basic example below, where the goal is the find the groups with the a sequence length of 5, only group b would be the correct.
set.seed(123)
df <- data.frame(
id = seq(1:40),
grp = sort(rep(letters[1:4], 10)),
x = c(
c(0, sample(1:10, 3), rep(0, 6)),
c(0, 0, sample(1:10, 5), rep(0, 3)),
c(rep(0, 6), sample(1:10, 4)),
c(0, 0, sample(1:10, 3), 0, sample(1:10, 2), 0, 0))
)
One limited solution is using cumsum below, to find count the non-zero values but does not work when there are breaks in the sequence, such as the specific length being 5 and group d being incorrectly included.
library(dplyr)
df %>%
group_by(grp) %>%
mutate(cc = cumsum(x != 0)) %>% filter(cc == 5) %>% distinct(grp)
Desired output for the example of a sequence length of 5, would identify only group b, not d.
You may use rle to find a consecutive non-zero numbers for each group.
library(dplyr)
find_groups <- function(x, n) {
tmp <- rle(x != 0)
any(tmp$lengths[tmp$values] >= n)
}
#apply the function for each group
df %>%
group_by(grp) %>%
dplyr::filter(find_groups(x, 5)) %>%
ungroup %>%
distinct(grp)
# grp
# <chr>
#1 b
in data.table:
library(data.table)
setDT(df)[,.N==5,.(grp,rleid(!x))][(V1), .(grp)]
grp
1: b
Split the sequence breaks into different groups by adding cumsum(x == 0) to the group_by statement. Then filter for the groups that contain 5 non-zero rows.
library(dplyr)
df %>%
group_by(grp, cumsum(x == 0)) %>%
filter(sum(x != 0) == 5) %>%
ungroup() %>%
distinct(grp)
#> # A tibble: 1 × 1
#> grp
#> <chr>
#> 1 b
Related
I want to count the number of unique values in a table that contains several from...to pairs like below:
tmp <- tribble(
~group, ~from, ~to,
1, 1, 10,
1, 5, 8,
1, 15, 20,
2, 1, 10,
2, 5, 10,
2, 15, 18
)
I tried to nest all values in a list for each row (works), but combining these nested lists into one vector and counting the uniques doesn't work as expected.
tmp %>%
group_by(group) %>%
rowwise() %>%
mutate(nrs = list(c(from:to))) %>%
summarise(n_uni = length(unique(unlist(list(nrs)))))
The desired output looks like this:
tibble(group = c(1, 2),
n_uni = c(length(unique(unlist(list(tmp$nrs[tmp$group == 1])))),
length(unique(unlist(list(tmp$nrs[tmp$group == 2]))))))
# # A tibble: 2 × 2
# group n_uni
# <dbl> <int>
#1 1 16
#2 2 14
Any help would be much appreciated!
tmp %>%
rowwise() %>%
mutate(nrs = list(from:to)) %>%
group_by(group) %>%
summarise(n_uni = n_distinct(unlist(nrs)))
The issue with OP's approach is that rowwise is the equivalent of a new grouping, that drops the initial group_by step. Thus we've got to group_by after creating nrs in the rowwise step.
is it possible for the top_n() command to return both max and min value at the same time?
Using the example from the reference page https://dplyr.tidyverse.org/reference/top_n.html
I tried the following
df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1))
df %>% top_n(c(1,-1)) ## returns an error
df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1))
df %>% top_n(1) %>% top_n(-1) ## returns only max value
Thanks
Not really involving top_n(), but you can try:
df %>%
arrange(x) %>%
slice(c(1, n()))
x
1 1
2 10
Or:
df %>%
slice(which(x == max(x) | x == min(x))) %>%
distinct()
Or (provided by #Gregor):
df %>%
slice(c(which.min(x), which.max(x)))
Or using filter():
df %>%
filter(x %in% range(x) & !duplicated(x))
Idea similar to #Jakub's answer with purrr::map_dfr
library(tidyverse) # dplyr and purrrr for map_dfr
df %>%
map_dfr(c(1, -1), top_n, wt = x, x = .)
# x
# 1 10
# 2 1
# 3 1
# 4 1
Here is an option with top_n where we pass a logical vector based that returns TRUE for min/max using range and then get the distinct rows as there are ties for range i.e duplicate elements are present
library(dplyr)
df %>%
top_n(x %in% range(x), 1) %>%
distinct
# x
#1 10
#2 1
I like #tmfmnk's answer. If you want to use top_n function, you can do this:
df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1))
bind_rows(
df %>% top_n(1),
df %>% top_n(-1)
)
# this solution addresses the specification in comments
df %>%
group_by(y) %>%
summarise(min = min(x),
max = max(x),
average = mean(x))
I would like to summarize data across multiple columns, based on all rows other than rows with a certain value in a separate grouping variable column. For example, in the df below, I want to get the medians of A, B, C, D, and E based on values from rows that are not assigned to the cluster matching a given row.
df = data.frame(cluster = c(1:5, 1:3, 1:2),
A = rnorm(10, 2),
B = rnorm(10, 5),
C = rnorm(10, 0.4),
D = rnorm(10, 3),
E = rnorm(10, 1))
df %>%
group_by(cluster) %>%
summarise_at(toupper(letters[1:5]), funs(m = fun_i_need_help_with(.)))
fun_i_need_help_with would give the equivalent of:
first row: median(df[which(df$cluster != 1), "A"])
second row: median(df[which(df$cluster != 2), "A"])
and so on...
I can do it with nested for loops, but it's pretty slow to run and doesn't seem like a good R-like solution.
for(col in toupper(letters[1:5])){
for(clust in unique(df$cluster)){
df[which(df$cluster == clust), col] <-
median(df[which(df$cluster != clust), col])
}
}
A solution using the tidyverse.
set.seed(123)
df = data.frame(cluster = c(1:5, 1:3, 1:2),
A = rnorm(10, 2),
B = rnorm(10, 5),
C = rnorm(10, 0.4),
D = rnorm(10, 3),
E = rnorm(10, 1))
library(tidyverse)
df2 <- map_dfr(unique(df$cluster),
~df %>%
filter(cluster != .x) %>%
summarize_at(vars(-cluster), funs(median(.))) %>%
# Add a label to show the content of this row is not from a certain cluster number
mutate(not_cluster = .x))
df2
# A B C D E not_cluster
# 1 2.070508 5.110683 0.1820251 3.553918 0.7920827 1
# 2 2.070508 5.400771 -0.6260044 3.688640 0.5333446 2
# 3 1.920165 5.428832 -0.2769652 3.490191 0.8543568 3
# 4 1.769823 5.400771 -0.2250393 3.426464 0.5971152 4
# 5 1.769823 5.400771 -0.3288912 3.426464 0.5971152 5
I have some patient data, where the individual patients change treatment groups over time. My goal is to visualize the sequence of group changes and aggregate this data into a "sequence profile" for each treatment group.
For each treatment group I would like to show, when it generally occurs
in the treatment cycle (say rather in the beginning or in the end). To account for the differing sequence length, I would like to standardize these profiles betweenn 0 (very beginning) and 1 (end).
I would like to find an efficient data preparation and visualization.
Mininmal Example
Structure of Data
library(dplyr)
library(purrr)
library(ggplot2)
# minimal data
cj_df_raw <- tibble::tribble(
~id, ~group,
1, "A",
1, "B",
2, "A",
2, "B",
2, "A"
)
# compute "intervals" for each person [start, end]
cj_df_raw %>%
group_by(id) %>%
mutate(pos = row_number(),
len = length(id),
start = (pos - 1) / len,
end = pos / len) %>%
filter(group == "A")
#> # A tibble: 3 x 6
#> # Groups: id [2]
#> id group pos len start end
#> <dbl> <chr> <int> <int> <dbl> <dbl>
#> 1 1 A 1 2 0 0.5
#> 2 2 A 1 3 0 0.333
#> 3 2 A 3 3 0.667 1
(So Id 1 was in group A in the first 50% of their sequence, and Id 2 was in Group A in the first 33% and the last 33% of their sequence. This means, that 2 Ids where between 0-33% of the sequence, 1 between 33-50%, 0 between 50-66% and 1 above 66%.)
This is the result I would like to achieve and I miss a chance to transform my data effectively.
Desired outcome
profile_treatmen_a <- tibble::tribble(
~x, ~y,
0, 0L,
0.33, 2L,
0.5, 1L,
0.66, 0L,
1, 1L,
1, 0L
)
profile_treatmen_a %>%
ggplot(aes(x, y)) +
geom_step(direction = "vh") +
expand_limits(x = c(0, 1), y = 0)
(Ideally the area under the curve would be shaded)
Ideal solution: via ggridges
The goal of the visualization would be to compare the "sequence-profile" of many treatment-groups at the same time. If I could prepare the data accordingly, I would like to use the ggridges-package for a striking visual comparison the treatment groups.
library(ggridges)
data.frame(group = rep(letters[1:2], each=20),
mean = rep(2, each=20)) %>%
mutate(count = runif(nrow(.))) %>%
ggplot(aes(x=count, y=group, fill=group)) +
geom_ridgeline(stat="binline", binwidth=0.5, scale=0.9)
You could build helper intervals and then just plot a histogram. Since each patient is either in Group A or B both groups sum up to 100%. With these helper intervals you could also easily switch to other geoms.
library(tidyverse, warn.conflicts = FALSE)
library(ggplot2)
# create sample data
set.seed(42)
id <- 1:10 %>% map(~ rep(x = .x, times = runif(n = 1, min = 1, max = 6))) %>%
unlist()
group <- sample(x = c("A", "B"), size = length(id), replace = TRUE) %>%
as_factor()
df <- tibble(id, group)
glimpse(df)
#> Observations: 37
#> Variables: 2
#> $ id <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5,...
#> $ group <fct> A, B, B, A, A, B, B, A, A, B, B, A, B, B, A, B, A, B, A,...
# tidy data
df <- df %>%
group_by(id) %>%
mutate(from = (row_number() - 1) / n(),
to = row_number() / n()) %>%
ungroup() %>%
rowwise() %>%
mutate(list = seq(from + 1/60, to, 1/60) %>% list()) %>%
unnest()
# plot
df %>%
ggplot(aes(x = list, fill = group)) +
geom_histogram(binwidth = 1/60) +
ggthemes::theme_hc()
Created on 2018-09-16 by the [reprex package](http://reprex.tidyverse.org) (v0.2.0).
My attempt at an answer.. although it is probably not the nicest/fastest/most efficient way, I think it might help you in your efforts.
library(data.table)
# compute "intervals" for each person [start, end]
df <- cj_df_raw %>%
group_by(id) %>%
mutate(pos = row_number(),
len = length(id),
from = (pos - 1) / len,
to = pos / len,
value = 1)
dt <- as.data.table(df)
setkey(dt, from, to)
#create intervals
dt.interval <- data.table(from = seq( from = 0, by = 0.01, length.out = 100),
to = seq( from = 0.01, by = 0.01, length.out = 100))
#perform overlap join on intervals
dt2 <- foverlaps( dt.interval, dt, type = "within", nomatch = NA)[, sum(value), by = c("i.from", "group")]
#some melting ans casting to fill in '0' on empty intervals
dt3 <- melt( dcast(dt2, ... ~ group, fill = 0), id.vars = 1 )
#plot
ggplot( dt3 ) +
geom_step( aes( x = i.from, y = value, color = variable ) ) +
facet_grid( .~variable )
My data is below
grp <- paste('group', sample(1:3, 100, replace = T))
x <- rnorm(100, 100)
y <- rnorm(100, 10)
df <- data.frame(grp = grp, x =x , y =y , stringsAsFactors = F)
lag_size <- c(10, 4, 9)
Now when I try to use
df %>% group_by(grp) %>% mutate_all(lag, n = lag_size) %>% arrange(grp)
it gives an error
Error in mutate_impl(.data, dots) :
Expecting a single value:
whereas this works fine
df %>% group_by(grp) %>% mutate_all(lag, n = 10) %>% arrange(grp)
If we need to do the lag based on the 'grp' i.e. to lag the corresponding 'grp' with the value specified in 'lag_size'
library(tidyverse)
res <- map2(split(df[2:3], df$grp) , lag_size, ~.x %>%
mutate_all(lag, n = .y)) %>%
bind_rows(., .id = 'grp')
We can check the lag in 'grp' by the position of the first non-NA element
res %>%
group_by(grp) %>%
summarise(n = which(!is.na(x))[1]-1)
# A tibble: 3 x 2
# grp n
# <chr> <dbl>
#1 group 1 10
#2 group 2 4
#3 group 3 9