Group data by interval slice

Group data by interval slice - r

Let's say i have this kind of data:
> head(data)
year type
1 1999 A
2 2018 B
3 2002 A
4 2001 B
5 2017 B
6 2017 A
How do i get to group the column 'year' by a interval defined by the user, let's say, 2.
So the returning data would look like this:
> head(data)
Ano Type Freq
1 1999-2000 A 12
2 1999-2000 B 5
3 2001-2002 A 23
4 2001-2002 B 6
5 2003-2004 A 30
6 2003-2004 B 15
I'm using it along a Shiny app, and I've got this far, but it only works for one column:
period <- 1999:2004
n = 2
interval = split(period, ceiling(seq_along(period) / n))
year_interval = unlist(lapply(interval, function(x) {
paste(min(x), max(x), sep = " - ")
}))

Create data
library(tidyverse)
set.seed(10)
df <- tibble(year = sample(1999:2020, 30, T), type = sample(LETTERS[1:3], 30, T))
Group by intervals of N = 2 years
N <- 2
df %>%
mutate(g = year %/% N,
years = paste0(g*N, '-', g*N + N - 1)) %>%
count(years, type)
# # A tibble: 20 x 3
# # Groups: years [10]
# years type n
# <chr> <chr> <int>
# 1 2000-2001 A 2
# 2 2000-2001 B 1
# 3 2002-2003 C 1
# 4 2004-2005 A 1
# 5 2004-2005 B 2
# 6 2004-2005 C 2
# 7 2006-2007 A 3
# 8 2006-2007 B 2
# 9 2008-2009 A 2
# 10 2008-2009 B 1
# 11 2010-2011 A 1
# 12 2010-2011 B 1
# 13 2012-2013 A 1
# 14 2012-2013 C 3
# 15 2014-2015 B 2
# 16 2014-2015 C 1
# 17 2016-2017 A 1
# 18 2016-2017 B 1
# 19 2016-2017 C 1
# 20 2018-2019 B 1
For N = 3
N <- 3
df %>%
mutate(g = year %/% N,
years = paste0(g*N, '-', g*N + N - 1)) %>%
count(years, type)
# # A tibble: 18 x 3
# # Groups: years [7]
# years type n
# <chr> <chr> <int>
# 1 1998-2000 A 1
# 2 1998-2000 B 1
# 3 2001-2003 A 1
# 4 2001-2003 C 1
# 5 2004-2006 A 2
# 6 2004-2006 B 4
# 7 2004-2006 C 2
# 8 2007-2009 A 4
# 9 2007-2009 B 1
# 10 2010-2012 A 1
# 11 2010-2012 B 1
# 12 2010-2012 C 3
# 13 2013-2015 A 1
# 14 2013-2015 B 2
# 15 2013-2015 C 1
# 16 2016-2018 A 1
# 17 2016-2018 B 2
# 18 2016-2018 C 1

Related

R: subsetting a dataframe according to a condition by id

I have the following data set:
Lines <- "id Observation_code Observation_value
1 A 5
1 A 6
1 B 24
2 C 2
2 D 9
2 A 12
3 V 5
3 E 6
3 C 24
4 B 2
4 D 9
4 C 12"
dat <- read.table(text = Lines, header = TRUE)
I would like to subset the data in a way that I get the whole history of patients with Observation_code == A. In this example, since only id 1 and 2 have observation_code A, they should be the ones left. Note that the all observations for id 1 and 2 should be in the final dataset:
Final <- "id Observation_code Observation_value
1 A 5
1 A 6
1 B 24
2 C 2
2 D 9
2 A 12"
dat_Final <- read.table(text = Final, header = TRUE)

base R
ind <- ave(dat$Observation_code == "A", dat$id, FUN = any)
dat[ind,]
# id Observation_code Observation_value
# 1 1 A 5
# 2 1 A 6
# 3 1 B 24
# 4 2 C 2
# 5 2 D 9
# 6 2 A 12
or
do.call(rbind, by(dat, dat$id, FUN = function(z) z[any(z$Observation_code == "A"),]))
dplyr
library(dplyr)
dat %>%
group_by(id) %>%
filter(any(Observation_code == "A")) %>%
ungroup()
# # A tibble: 6 x 3
# id Observation_code Observation_value
# <int> <chr> <int>
# 1 1 A 5
# 2 1 A 6
# 3 1 B 24
# 4 2 C 2
# 5 2 D 9
# 6 2 A 12

How to balance a dataset in `dplyr` using `sample_n` automatically to the size of the smallest class?

I have a dataset like:
df <- tibble(
id = 1:18,
class = rep(c(rep(1,3),rep(2,2),3),3),
var_a = rep(c("a","b"),9)
)
# A tibble: 18 x 3
id cluster var_a
<int> <dbl> <chr>
1 1 1 a
2 2 1 b
3 3 1 a
4 4 2 b
5 5 2 a
6 6 3 b
7 7 1 a
8 8 1 b
9 9 1 a
10 10 2 b
11 11 2 a
12 12 3 b
13 13 1 a
14 14 1 b
15 15 1 a
16 16 2 b
17 17 2 a
18 18 3 b
That dataset contains a number of observations in several classes. The classes are not balanced. In the sample above we can see, that only 3 observations are of class 3, while there are 6 observations of class 2 and 9 observations of class 1.
Now I want to automatically balance that dataset so that all classes are of the same size. So I want a dataset of 9 rows, 3 rows in each class. I can use the sample_n function from dplyr to do such a sampling.
I achieved to do so by first calculating the smallest class size..
min_length <- as.numeric(df %>%
group_by(class) %>%
summarise(n = n()) %>%
ungroup() %>%
summarise(min = min(n)))
..and then apply the sample_n function:
set.seed(1)
df %>% group_by(cluster) %>% sample_n(min_length)
# A tibble: 9 x 3
# Groups: cluster [3]
id cluster var_a
<int> <dbl> <chr>
1 15 1 a
2 7 1 a
3 13 1 a
4 4 2 b
5 5 2 a
6 17 2 a
7 18 3 b
8 6 3 b
9 12 3 b
I wondered If it's possible to do that (calculating the smallest class size and then sampling) in one go?

You can do it in one step, but it is cheating a little:
set.seed(42)
df %>%
group_by(class) %>%
sample_n(min(table(df$class))) %>%
ungroup()
# # A tibble: 9 x 3
# id class var_a
# <int> <dbl> <chr>
# 1 1 1 a
# 2 8 1 b
# 3 15 1 a
# 4 4 2 b
# 5 5 2 a
# 6 11 2 a
# 7 12 3 b
# 8 18 3 b
# 9 6 3 b
I say "cheating" because normally you would not want to reference df$ from within the pipe. However, because they property we're looking for is of the whole frame but the table function only sees one group at a time, we need to side-step that a little.
One could do
df %>%
mutate(mn = min(table(class))) %>%
group_by(class) %>%
sample_n(mn[1]) %>%
ungroup()
# # A tibble: 9 x 4
# id class var_a mn
# <int> <dbl> <chr> <int>
# 1 14 1 b 3
# 2 13 1 a 3
# 3 7 1 a 3
# 4 4 2 b 3
# 5 16 2 b 3
# 6 5 2 a 3
# 7 12 3 b 3
# 8 18 3 b 3
# 9 6 3 b 3
Though I don't think that that is any more elegant/readable.

Only rows where difference between them is less than 'n' in groups

Let's say we have the below dataset where values in V2 are ordered ascending in groups V1:
Input =(" V1 V2
1 A 3
2 A 4
3 A 5
4 A 6
5 A 12
6 A 13
7 B 4
8 B 5
9 B 6
10 B 12
11 C 13
12 C 14
13 C 18")
df = as.data.frame(read.table(textConnection(Input), header = T, row.names = 1))
Now I want to keep rows where the difference between consecutive ones is <= 1, so my desired output:
V1 V2
1 A 3
2 A 4
3 A 5
4 A 6
5 A 12
6 A 13
7 B 4
8 B 5
9 B 6
11 C 13
12 C 14
However when I use:
df %>%
group_by(V1) %>%
filter(c(0,diff(V2)) <= 1)
I have:
V1 V2
1 A 3
2 A 4
3 A 5
4 A 6
5 A 13
6 B 4
7 B 5
8 B 6
9 C 13
10 C 14
The row with V2 value 12 is missing and it should be in dataset. I tried also with lag() but result is same.
df %>%
group_by(V1) %>%
filter(V2 - lag(V2) <= 1 | is.na(V2 - lag(V2)))
Could you point my mistake?

You need to subtract the values from both the sides. Try lead and lag :
library(dplyr)
df %>%
group_by(V1) %>%
filter(V2 - lag(V2) <= 1 | V2 - lead(V2) <= 1)
# V1 V2
# <chr> <int>
# 1 A 3
# 2 A 4
# 3 A 5
# 4 A 6
# 5 A 12
# 6 A 13
# 7 B 4
# 8 B 5
# 9 B 6
#10 C 13
#11 C 14

Here is another idea where we create groups with a tolerance of 1, and filter out those groups with only one observation, i.e.
df %>%
group_by(V1, grp = cumsum(c(TRUE, diff(V2) != 1))) %>%
filter(n() > 1) %>%
ungroup() %>%
select(-grp)
# A tibble: 11 x 2
# V1 V2
# <fct> <int>
# 1 A 3
# 2 A 4
# 3 A 5
# 4 A 6
# 5 A 12
# 6 A 13
# 7 B 4
# 8 B 5
# 9 B 6
#10 C 13
#11 C 14

Count combinations of elements with condition

My question is similar to this r count combinations of elements in groups however, firstly, I want to group all potential combinations by group in a column Comb and second, count the occurrences of the combinations depending on year in a column n.
Using the same mock dataset:
> dat = data.table(group = c(1,1,1,2,2,2,3,3), id=c(10,11,12,10,11,13,11,13))
> dat
group id year
1: 1 10 2010
2: 1 11 2010
3: 1 12 2010
4: 2 10 2011
5: 2 11 2011
6: 2 13 2011
7: 3 11 2012
8: 3 13 2012
The desired outcome:
> dat
group Comb year n
1: 1 10 11 2010 1
2: 1 11 12 2010 1
3: 1 12 10 2010 1
4: 2 10 11 2011 2
5: 2 11 13 2011 1
6: 2 13 10 2011 1
7: 3 11 13 2012 2
I would much appreciate a possible solution with dplyr.
thanks

Here's a solution, presented first as data.table then as dplyr. The process is the same: we self-join on group, filter where the id combinations are in a consistent order (any order would work, we pick first id < second id), group by combination to number the rows, and drop the unused columns.
dat = data.table(group = c(1,1,1,2,2,2,3,3), id=c(10,11,12,10,11,13,11,13))
## with data.table
merge(dat, dat, by = "group", allow.cartesian = TRUE)[
id.x < id.y, ][
, Comb := paste(id.x, id.y)][
, n := 1:.N, by = .(Comb)
][, .(group, Comb, n)]
# group Comb n
# 1: 1 10 11 1
# 2: 1 10 12 1
# 3: 1 11 12 1
# 4: 2 10 11 2
# 5: 2 10 13 1
# 6: 2 11 13 1
# 7: 3 11 13 2
## with dplyr
dat %>% full_join(dat, by = "group") %>%
filter(id.x < id.y) %>%
group_by(Comb = paste(id.x, id.y)) %>%
mutate(n = row_number()) %>%
select(group, Comb, n)
# # A tibble: 7 x 3
# # Groups: Comb [5]
# group Comb n
# <dbl> <chr> <int>
# 1 1 10 11 1
# 2 1 10 12 1
# 3 1 11 12 1
# 4 2 10 11 2
# 5 2 10 13 1
# 6 2 11 13 1
# 7 3 11 13 2

Programmatically rename data frame columns using lookup data frame

What is the best way to batch rename columns using a lookup data frame?
Can I do it as part of a pipe?
library(tidyverse)
df <- data_frame(
a = seq(1, 10)
, b = seq(10, 1)
, c = rep(1, 10)
)
df_lookup <- data_frame(
old_name = c("b", "c", "a")
, new_name = c("y", "z", "x")
)
I know how to do it manually
df %>%
rename(x = a
, y = b
, z = c)
I am seeking a solution in tidyverse / dplyr packages.

Use rlang; Firstly build up a list of names using syms, and then splice the arguments to rename with UQS or !!! operator:
library(rlang); library(dplyr)
df %>% rename(!!!syms(with(df_lookup, setNames(old_name, new_name))))
# A tibble: 10 x 3
# x y z
# <int> <int> <dbl>
# 1 1 10 1
# 2 2 9 1
# 3 3 8 1
# 4 4 7 1
# 5 5 6 1
# 6 6 5 1
# 7 7 4 1
# 8 8 3 1
# 9 9 2 1
#10 10 1 1

You could write your own helper to make it easier
rename_to <- function(data, old, new) {
data %>% rename_at(old, function(x) new[old==x])
}
df %>% rename_to(df_lookup$old_name, df_lookup$new_name)

In base-R:
names(df)[match(df_lookup$old_name,names(df))] <- df_lookup$new_name
# # A tibble: 10 x 3
# x y z
# <int> <int> <dbl>
# 1 1 10 1
# 2 2 9 1
# 3 3 8 1
# 4 4 7 1
# 5 5 6 1
# 6 6 5 1
# 7 7 4 1
# 8 8 3 1
# 9 9 2 1
# 10 10 1 1
Using data.table:
library(data.table)
setnames(setDT(df), old = df_lookup$old_name, new = df_lookup$new_name)
# x y z
# 1: 1 10 1
# 2: 2 9 1
# 3: 3 8 1
# 4: 4 7 1
# 5: 5 6 1
# 6: 6 5 1
# 7: 7 4 1
# 8: 8 3 1
# 9: 9 2 1
# 10: 10 1 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Group data by interval slice - r

Related

R: subsetting a dataframe according to a condition by id

How to balance a dataset in `dplyr` using `sample_n` automatically to the size of the smallest class?

Only rows where difference between them is less than 'n' in groups

Count combinations of elements with condition

Programmatically rename data frame columns using lookup data frame

Categories

Resources