Group data by interval slice - r

Let's say i have this kind of data:
> head(data)
year type
1 1999 A
2 2018 B
3 2002 A
4 2001 B
5 2017 B
6 2017 A
How do i get to group the column 'year' by a interval defined by the user, let's say, 2.
So the returning data would look like this:
> head(data)
Ano Type Freq
1 1999-2000 A 12
2 1999-2000 B 5
3 2001-2002 A 23
4 2001-2002 B 6
5 2003-2004 A 30
6 2003-2004 B 15
I'm using it along a Shiny app, and I've got this far, but it only works for one column:
period <- 1999:2004
n = 2
interval = split(period, ceiling(seq_along(period) / n))
year_interval = unlist(lapply(interval, function(x) {
paste(min(x), max(x), sep = " - ")
}))

Create data
library(tidyverse)
set.seed(10)
df <- tibble(year = sample(1999:2020, 30, T), type = sample(LETTERS[1:3], 30, T))
Group by intervals of N = 2 years
N <- 2
df %>%
mutate(g = year %/% N,
years = paste0(g*N, '-', g*N + N - 1)) %>%
count(years, type)
# # A tibble: 20 x 3
# # Groups: years [10]
# years type n
# <chr> <chr> <int>
# 1 2000-2001 A 2
# 2 2000-2001 B 1
# 3 2002-2003 C 1
# 4 2004-2005 A 1
# 5 2004-2005 B 2
# 6 2004-2005 C 2
# 7 2006-2007 A 3
# 8 2006-2007 B 2
# 9 2008-2009 A 2
# 10 2008-2009 B 1
# 11 2010-2011 A 1
# 12 2010-2011 B 1
# 13 2012-2013 A 1
# 14 2012-2013 C 3
# 15 2014-2015 B 2
# 16 2014-2015 C 1
# 17 2016-2017 A 1
# 18 2016-2017 B 1
# 19 2016-2017 C 1
# 20 2018-2019 B 1
For N = 3
N <- 3
df %>%
mutate(g = year %/% N,
years = paste0(g*N, '-', g*N + N - 1)) %>%
count(years, type)
# # A tibble: 18 x 3
# # Groups: years [7]
# years type n
# <chr> <chr> <int>
# 1 1998-2000 A 1
# 2 1998-2000 B 1
# 3 2001-2003 A 1
# 4 2001-2003 C 1
# 5 2004-2006 A 2
# 6 2004-2006 B 4
# 7 2004-2006 C 2
# 8 2007-2009 A 4
# 9 2007-2009 B 1
# 10 2010-2012 A 1
# 11 2010-2012 B 1
# 12 2010-2012 C 3
# 13 2013-2015 A 1
# 14 2013-2015 B 2
# 15 2013-2015 C 1
# 16 2016-2018 A 1
# 17 2016-2018 B 2
# 18 2016-2018 C 1

Related

R: subsetting a dataframe according to a condition by id

I have the following data set:
Lines <- "id Observation_code Observation_value
1 A 5
1 A 6
1 B 24
2 C 2
2 D 9
2 A 12
3 V 5
3 E 6
3 C 24
4 B 2
4 D 9
4 C 12"
dat <- read.table(text = Lines, header = TRUE)
I would like to subset the data in a way that I get the whole history of patients with Observation_code == A. In this example, since only id 1 and 2 have observation_code A, they should be the ones left. Note that the all observations for id 1 and 2 should be in the final dataset:
Final <- "id Observation_code Observation_value
1 A 5
1 A 6
1 B 24
2 C 2
2 D 9
2 A 12"
dat_Final <- read.table(text = Final, header = TRUE)
base R
ind <- ave(dat$Observation_code == "A", dat$id, FUN = any)
dat[ind,]
# id Observation_code Observation_value
# 1 1 A 5
# 2 1 A 6
# 3 1 B 24
# 4 2 C 2
# 5 2 D 9
# 6 2 A 12
or
do.call(rbind, by(dat, dat$id, FUN = function(z) z[any(z$Observation_code == "A"),]))
dplyr
library(dplyr)
dat %>%
group_by(id) %>%
filter(any(Observation_code == "A")) %>%
ungroup()
# # A tibble: 6 x 3
# id Observation_code Observation_value
# <int> <chr> <int>
# 1 1 A 5
# 2 1 A 6
# 3 1 B 24
# 4 2 C 2
# 5 2 D 9
# 6 2 A 12

How to balance a dataset in `dplyr` using `sample_n` automatically to the size of the smallest class?

I have a dataset like:
df <- tibble(
id = 1:18,
class = rep(c(rep(1,3),rep(2,2),3),3),
var_a = rep(c("a","b"),9)
)
# A tibble: 18 x 3
id cluster var_a
<int> <dbl> <chr>
1 1 1 a
2 2 1 b
3 3 1 a
4 4 2 b
5 5 2 a
6 6 3 b
7 7 1 a
8 8 1 b
9 9 1 a
10 10 2 b
11 11 2 a
12 12 3 b
13 13 1 a
14 14 1 b
15 15 1 a
16 16 2 b
17 17 2 a
18 18 3 b
That dataset contains a number of observations in several classes. The classes are not balanced. In the sample above we can see, that only 3 observations are of class 3, while there are 6 observations of class 2 and 9 observations of class 1.
Now I want to automatically balance that dataset so that all classes are of the same size. So I want a dataset of 9 rows, 3 rows in each class. I can use the sample_n function from dplyr to do such a sampling.
I achieved to do so by first calculating the smallest class size..
min_length <- as.numeric(df %>%
group_by(class) %>%
summarise(n = n()) %>%
ungroup() %>%
summarise(min = min(n)))
..and then apply the sample_n function:
set.seed(1)
df %>% group_by(cluster) %>% sample_n(min_length)
# A tibble: 9 x 3
# Groups: cluster [3]
id cluster var_a
<int> <dbl> <chr>
1 15 1 a
2 7 1 a
3 13 1 a
4 4 2 b
5 5 2 a
6 17 2 a
7 18 3 b
8 6 3 b
9 12 3 b
I wondered If it's possible to do that (calculating the smallest class size and then sampling) in one go?
You can do it in one step, but it is cheating a little:
set.seed(42)
df %>%
group_by(class) %>%
sample_n(min(table(df$class))) %>%
ungroup()
# # A tibble: 9 x 3
# id class var_a
# <int> <dbl> <chr>
# 1 1 1 a
# 2 8 1 b
# 3 15 1 a
# 4 4 2 b
# 5 5 2 a
# 6 11 2 a
# 7 12 3 b
# 8 18 3 b
# 9 6 3 b
I say "cheating" because normally you would not want to reference df$ from within the pipe. However, because they property we're looking for is of the whole frame but the table function only sees one group at a time, we need to side-step that a little.
One could do
df %>%
mutate(mn = min(table(class))) %>%
group_by(class) %>%
sample_n(mn[1]) %>%
ungroup()
# # A tibble: 9 x 4
# id class var_a mn
# <int> <dbl> <chr> <int>
# 1 14 1 b 3
# 2 13 1 a 3
# 3 7 1 a 3
# 4 4 2 b 3
# 5 16 2 b 3
# 6 5 2 a 3
# 7 12 3 b 3
# 8 18 3 b 3
# 9 6 3 b 3
Though I don't think that that is any more elegant/readable.

Only rows where difference between them is less than 'n' in groups

Let's say we have the below dataset where values in V2 are ordered ascending in groups V1:
Input =(" V1 V2
1 A 3
2 A 4
3 A 5
4 A 6
5 A 12
6 A 13
7 B 4
8 B 5
9 B 6
10 B 12
11 C 13
12 C 14
13 C 18")
df = as.data.frame(read.table(textConnection(Input), header = T, row.names = 1))
Now I want to keep rows where the difference between consecutive ones is <= 1, so my desired output:
V1 V2
1 A 3
2 A 4
3 A 5
4 A 6
5 A 12
6 A 13
7 B 4
8 B 5
9 B 6
11 C 13
12 C 14
However when I use:
df %>%
group_by(V1) %>%
filter(c(0,diff(V2)) <= 1)
I have:
V1 V2
1 A 3
2 A 4
3 A 5
4 A 6
5 A 13
6 B 4
7 B 5
8 B 6
9 C 13
10 C 14
The row with V2 value 12 is missing and it should be in dataset. I tried also with lag() but result is same.
df %>%
group_by(V1) %>%
filter(V2 - lag(V2) <= 1 | is.na(V2 - lag(V2)))
Could you point my mistake?
You need to subtract the values from both the sides. Try lead and lag :
library(dplyr)
df %>%
group_by(V1) %>%
filter(V2 - lag(V2) <= 1 | V2 - lead(V2) <= 1)
# V1 V2
# <chr> <int>
# 1 A 3
# 2 A 4
# 3 A 5
# 4 A 6
# 5 A 12
# 6 A 13
# 7 B 4
# 8 B 5
# 9 B 6
#10 C 13
#11 C 14
Here is another idea where we create groups with a tolerance of 1, and filter out those groups with only one observation, i.e.
df %>%
group_by(V1, grp = cumsum(c(TRUE, diff(V2) != 1))) %>%
filter(n() > 1) %>%
ungroup() %>%
select(-grp)
# A tibble: 11 x 2
# V1 V2
# <fct> <int>
# 1 A 3
# 2 A 4
# 3 A 5
# 4 A 6
# 5 A 12
# 6 A 13
# 7 B 4
# 8 B 5
# 9 B 6
#10 C 13
#11 C 14

Count combinations of elements with condition

My question is similar to this r count combinations of elements in groups however, firstly, I want to group all potential combinations by group in a column Comb and second, count the occurrences of the combinations depending on year in a column n.
Using the same mock dataset:
> dat = data.table(group = c(1,1,1,2,2,2,3,3), id=c(10,11,12,10,11,13,11,13))
> dat
group id year
1: 1 10 2010
2: 1 11 2010
3: 1 12 2010
4: 2 10 2011
5: 2 11 2011
6: 2 13 2011
7: 3 11 2012
8: 3 13 2012
The desired outcome:
> dat
group Comb year n
1: 1 10 11 2010 1
2: 1 11 12 2010 1
3: 1 12 10 2010 1
4: 2 10 11 2011 2
5: 2 11 13 2011 1
6: 2 13 10 2011 1
7: 3 11 13 2012 2
I would much appreciate a possible solution with dplyr.
thanks
Here's a solution, presented first as data.table then as dplyr. The process is the same: we self-join on group, filter where the id combinations are in a consistent order (any order would work, we pick first id < second id), group by combination to number the rows, and drop the unused columns.
dat = data.table(group = c(1,1,1,2,2,2,3,3), id=c(10,11,12,10,11,13,11,13))
## with data.table
merge(dat, dat, by = "group", allow.cartesian = TRUE)[
id.x < id.y, ][
, Comb := paste(id.x, id.y)][
, n := 1:.N, by = .(Comb)
][, .(group, Comb, n)]
# group Comb n
# 1: 1 10 11 1
# 2: 1 10 12 1
# 3: 1 11 12 1
# 4: 2 10 11 2
# 5: 2 10 13 1
# 6: 2 11 13 1
# 7: 3 11 13 2
## with dplyr
dat %>% full_join(dat, by = "group") %>%
filter(id.x < id.y) %>%
group_by(Comb = paste(id.x, id.y)) %>%
mutate(n = row_number()) %>%
select(group, Comb, n)
# # A tibble: 7 x 3
# # Groups: Comb [5]
# group Comb n
# <dbl> <chr> <int>
# 1 1 10 11 1
# 2 1 10 12 1
# 3 1 11 12 1
# 4 2 10 11 2
# 5 2 10 13 1
# 6 2 11 13 1
# 7 3 11 13 2

Programmatically rename data frame columns using lookup data frame

What is the best way to batch rename columns using a lookup data frame?
Can I do it as part of a pipe?
library(tidyverse)
df <- data_frame(
a = seq(1, 10)
, b = seq(10, 1)
, c = rep(1, 10)
)
df_lookup <- data_frame(
old_name = c("b", "c", "a")
, new_name = c("y", "z", "x")
)
I know how to do it manually
df %>%
rename(x = a
, y = b
, z = c)
I am seeking a solution in tidyverse / dplyr packages.
Use rlang; Firstly build up a list of names using syms, and then splice the arguments to rename with UQS or !!! operator:
library(rlang); library(dplyr)
df %>% rename(!!!syms(with(df_lookup, setNames(old_name, new_name))))
# A tibble: 10 x 3
# x y z
# <int> <int> <dbl>
# 1 1 10 1
# 2 2 9 1
# 3 3 8 1
# 4 4 7 1
# 5 5 6 1
# 6 6 5 1
# 7 7 4 1
# 8 8 3 1
# 9 9 2 1
#10 10 1 1
You could write your own helper to make it easier
rename_to <- function(data, old, new) {
data %>% rename_at(old, function(x) new[old==x])
}
df %>% rename_to(df_lookup$old_name, df_lookup$new_name)
In base-R:
names(df)[match(df_lookup$old_name,names(df))] <- df_lookup$new_name
# # A tibble: 10 x 3
# x y z
# <int> <int> <dbl>
# 1 1 10 1
# 2 2 9 1
# 3 3 8 1
# 4 4 7 1
# 5 5 6 1
# 6 6 5 1
# 7 7 4 1
# 8 8 3 1
# 9 9 2 1
# 10 10 1 1
Using data.table:
library(data.table)
setnames(setDT(df), old = df_lookup$old_name, new = df_lookup$new_name)
# x y z
# 1: 1 10 1
# 2: 2 9 1
# 3: 3 8 1
# 4: 4 7 1
# 5: 5 6 1
# 6: 6 5 1
# 7: 7 4 1
# 8: 8 3 1
# 9: 9 2 1
# 10: 10 1 1

Resources