It's easy to group_by unique values of a variable:
library(tidyverse)
library(gapminder)
gapminder %>%
group_by(year)
If we wanted to make a group ID just to show us what the groups would be:
gapminder %>%
select(year) %>%
distinct %>%
mutate(group = group_indices(., year))
A tibble: 12 x 2
year group
<int> <int>
1 1952 1
2 1957 2
3 1962 3
4 1967 4
5 1972 5
6 1977 6
7 1982 7
8 1987 8
9 1992 9
10 1997 10
11 2002 11
12 2007 12
But what if I want to group by pairs ("group2"), triplets ("group3"), etc. of sequential years? How could I produce the following tibble using dplyr/tidyverse?
A tibble: 12 x 2
year group group2 group3 group5
<int> <int> <int> <int> <int>
1 1952 1 1 1 1
2 1957 2 1 1 1
3 1962 3 2 1 1
4 1967 4 2 2 1
5 1972 5 3 2 1
6 1977 6 3 2 2
7 1982 7 4 3 2
8 1987 8 4 3 2
9 1992 9 5 3 2
10 1997 10 5 4 2
11 2002 11 6 4 3
12 2007 12 6 4 3
With ceiling() you can create groups very easily.
gapminder %>%
select(year) %>%
distinct() %>%
mutate(group1 = group_indices(., year)) %>%
mutate(group2=ceiling(group1 / 2)) %>%
mutate(group3=ceiling(group1 / 3)) %>%
mutate(group4=ceiling(group1 / 4)) %>%
mutate(group5=ceiling(group1 / 5))
# A tibble: 12 x 6
year group1 group2 group3 group4 group5
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1952 1 1 1 1 1
2 1957 2 1 1 1 1
3 1962 3 2 1 1 1
4 1967 4 2 2 1 1
5 1972 5 3 2 2 1
6 1977 6 3 2 2 2
7 1982 7 4 3 2 2
8 1987 8 4 3 2 2
9 1992 9 5 3 3 2
10 1997 10 5 4 3 2
11 2002 11 6 4 3 3
12 2007 12 6 4 3 3
Here's an alternative solution, where you can specify the number of groups you want in the beginning and the process creates the corresponding groups:
library(tidyverse)
library(gapminder)
# input number of groups
nn = 5
gapminder %>%
select(year) %>%
distinct() %>%
mutate(X = seq_along(year),
d = map(X, ~data.frame(t(ceiling(.x/2:nn))))) %>%
unnest() %>%
setNames(c("year", paste0("group",1:nn)))
# # A tibble: 12 x 6
# year group1 group2 group3 group4 group5
# <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 1952 1 1 1 1 1
# 2 1957 2 1 1 1 1
# 3 1962 3 2 1 1 1
# 4 1967 4 2 2 1 1
# 5 1972 5 3 2 2 1
# 6 1977 6 3 2 2 2
# 7 1982 7 4 3 2 2
# 8 1987 8 4 3 2 2
# 9 1992 9 5 3 3 2
#10 1997 10 5 4 3 2
#11 2002 11 6 4 3 3
#12 2007 12 6 4 3 3
Here's a function that does the job
group_by_n = function(x, n) {
ux <- match(x, sort(unique(x)))
ceiling(ux / n)
}
It does not require that x be ordered, or that values be evenly spaced or even numeric values. Use as, e.g.,
mutate(gapminder, group3 = group_by_n(year, 3))
Related
The code below should group the data by year and then create two new columns with the first and last value of each year.
library(dplyr)
set.seed(123)
d <- data.frame(
group = rep(1:3, each = 3),
year = rep(seq(2000,2002,1),3),
value = sample(1:9, r = T))
d %>%
group_by(group) %>%
mutate(
first = dplyr::first(value),
last = dplyr::last(value)
)
However, it does not work as it should. The expected result would be
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 8 8 1
5 2 2001 9 8 1
6 2 2002 1 8 1
7 3 2000 5 5 5
8 3 2001 9 5 5
9 3 2002 5 5 5
Yet, I get this (it takes the first and the last value over the entire data frame, not just the groups):
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 5
2 1 2001 8 3 5
3 1 2002 4 3 5
4 2 2000 8 3 5
5 2 2001 9 3 5
6 2 2002 1 3 5
7 3 2000 5 3 5
8 3 2001 9 3 5
9 3 2002 5 3 5
dplyr::mutate() did the trick
d %>%
group_by(group) %>%
dplyr::mutate(
first = dplyr::first(value),
last = dplyr::last(value)
)
You can also try by using summarise function within dpylr to get the first and last values of unique groups
d %>%
group_by(group) %>%
summarise(first_value = first(na.omit(values)),
last_value = last(na.omit(values))) %>%
left_join(d, ., by = 'group')
If you are from the future and dplyr has stopped supporting the first and last functions or want a future-proof solution, you can just index the columns like you would a list:
> d %>%
group_by(group) %>%
mutate(
first = value[[1]],
last = value[[length(value)]]
)
# A tibble: 9 × 5
# Groups: group [3]
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 8 8 1
5 2 2001 9 8 1
6 2 2002 1 8 1
7 3 2000 5 5 5
8 3 2001 9 5 5
9 3 2002 5 5 5
I have a series of rows in a single dataframe. I'm trying to aggregate the first two rows for each ID- i.e. - I want to combine events 1 and 2 for ID 1 into a single row, events 1 and 2 for ID 2 into a singlw row etc, but leave event 3 completely untouched.
id <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
event <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
score <- c(3,NA,1,3,NA,2,6,NA,1,8,NA,2,4,NA,1)
score2 <- c(NA,4,1,NA,5,2,NA,0,3,NA,5,6,NA,8,7)
df <- tibble(id, event, score, score2)
# A tibble: 15 x 4
id event score score2
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 NA
2 1 2 NA 4
3 1 3 1 1
4 2 1 3 NA
5 2 2 NA 5
6 2 3 2 2
7 3 1 6 NA
8 3 2 NA 0
9 3 3 1 3
10 4 1 8 NA
11 4 2 NA 5
12 4 3 2 6
13 5 1 4 NA
14 5 2 NA 8
15 5 3 1 7
I've tried :
df_merged<- df %>% group_by (id) %>% summarise_all(funs(min(as.character(.),na.rm=TRUE))),
which aggregates these nicely, but then I struggle to merge these back into the orignal dataframe/tibble (there are really about 300 different "score" columns in the full dataset, so a right_join is a headache with score.x, score.y, score2.x, score2.y all over the place...)
Ideally, the situation would need to be dplyr as the rest of my code runs on this!
EDIT:
Ideally, my expected output would be:
# A tibble: 10 x 4
id event score score2
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 4
3 1 3 1 1
4 2 1 3 5
6 2 3 2 2
7 3 1 6 0
9 3 3 1 3
10 4 1 8 5
12 4 3 2 6
13 5 1 4 8
15 5 3 1 7
We may change the order of NA elements with replace
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(starts_with('score'),
~replace(., 1:2, .[1:2][order(is.na(.[1:2]))]))) %>%
ungroup %>%
filter(if_all(starts_with('score'), Negate(is.na)))
-output
# A tibble: 10 x 4
id event score score2
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 4
2 1 3 1 1
3 2 1 3 5
4 2 3 2 2
5 3 1 6 0
6 3 3 1 3
7 4 1 8 5
8 4 3 2 6
9 5 1 4 8
10 5 3 1 7
Here is an alternative way to achieve your task with fill from tidyr package:
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
fill(everything(), .direction = "down") %>%
fill(everything(), .direction = "up") %>%
slice(1,3)
id event score score2
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 4
2 1 3 1 1
3 2 1 3 5
4 2 3 2 2
5 3 1 6 0
6 3 3 1 3
7 4 1 8 5
8 4 3 2 6
9 5 1 4 8
10 5 3 1 7
How about this?
library(dplyr)
df_e12 <- df %>%
filter(event %in% c(1, 2)) %>%
group_by(id) %>%
mutate(across(starts_with("score"), ~min(.x, na.rm = TRUE))) %>%
ungroup() %>%
distinct(id, .keep_all = TRUE)
df_e3 <- df %>%
filter(event == 3)
df <- bind_rows(df_e12, df_e3) %>%
arrange(id, event)
df
> df
# A tibble: 10 x 4
id event score score2
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 4
2 1 3 1 1
3 2 1 3 5
4 2 3 2 2
5 3 1 6 0
6 3 3 1 3
7 4 1 8 5
8 4 3 2 6
9 5 1 4 8
10 5 3 1 7
I want to pivot_longer into two columns based on two sets of variables.
For example:
df <- data.frame(year = rep(c(2010,2012,2017), 4),
party = rep(c("A", "A", "A", "B", "B", "B"), 2),
pp1 = rep(c(3,4,5,1,2,6), 2),
pp2 = rep(c(1,2,3,4,5,6), 2),
pp3 = rep(c(6,2,3,1,5,4), 2),
l_pp1 = rep(c(1,2,6,3,4,5), 2),
l_pp2 = rep(c(4,5,6,1,2,3), 2),
l_pp3 = rep(c(1,5,4,6,2,3), 2))
Data:
year party pp1 pp2 pp3 l_pp1 l_pp2 l_pp3
1 2010 A 3 1 6 1 4 1
2 2012 A 4 2 2 2 5 5
3 2017 A 5 3 3 6 6 4
4 2010 B 1 4 1 3 1 6
5 2012 B 2 5 5 4 2 2
6 2017 B 6 6 4 5 3 3
7 2010 A 3 1 6 1 4 1
8 2012 A 4 2 2 2 5 5
9 2017 A 5 3 3 6 6 4
10 2010 B 1 4 1 3 1 6
11 2012 B 2 5 5 4 2 2
12 2017 B 6 6 4 5 3 3
What I need is the following:
year party area pp l_pp
1 2010 A 1 3 1
2 2012 A 1 4 2
3 2017 A 1 5 6
4 2010 B 1 1 3
5 2012 B 1 2 4
etc.
Here pp and l_pp are the same area (pp1 & l_pp1 become pp and l_pp for area 1).
I would think something like this, but values_to can only take size 1.
df <- df %>%
pivot_longer(!c("party", "year"), names_to = "area", values_to = c("pp", "l_pp"))
This gets me somehow close, but is not what I am looking for:
df <- df %>%
pivot_longer(!c("party", "year"), names_to = "area", values_to = c("pp"))
year party area pp
1 2010 A pp1 3
2 2010 A pp2 1
3 2010 A pp3 6
4 2010 A l_pp1 1
5 2010 A l_pp2 4
6 2010 A l_pp3 1
EDIT Making use of the .value sentinel this could be achieved via one pivot_longer like so:
library(tidyr)
df %>%
pivot_longer(-c(year, party), names_to = c(".value", "area"), names_pattern = "^(.*?)(\\d+)$")
#> # A tibble: 36 × 5
#> year party area pp l_pp
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 2010 A 1 3 1
#> 2 2010 A 2 1 4
#> 3 2010 A 3 6 1
#> 4 2012 A 1 4 2
#> 5 2012 A 2 2 5
#> 6 2012 A 3 2 5
#> 7 2017 A 1 5 6
#> 8 2017 A 2 3 6
#> 9 2017 A 3 3 4
#> 10 2010 B 1 1 3
#> # … with 26 more rows
As a second option the same result could be achieved via an additional pivot_wider like so, where as an intermediate step one has to add an id column to uniquely identify the rows in the data:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(!c(year, party), names_to = c("var", "area"), names_pattern = "(.*)(\\d)") %>%
group_by(year, party, area, var) %>%
mutate(id = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = var, values_from = value)
#> # A tibble: 36 x 6
#> year party area id pp l_pp
#> <dbl> <chr> <chr> <int> <dbl> <dbl>
#> 1 2010 A 1 1 3 1
#> 2 2010 A 2 1 1 4
#> 3 2010 A 3 1 6 1
#> 4 2012 A 1 1 4 2
#> 5 2012 A 2 1 2 5
#> 6 2012 A 3 1 2 5
#> 7 2017 A 1 1 5 6
#> 8 2017 A 2 1 3 6
#> 9 2017 A 3 1 3 4
#> 10 2010 B 1 1 1 3
#> # … with 26 more rows
data=data.frame(person=c(1,1,1,2,2,2,2,3,3,3,3),
t=c(3,NA,9,4,7,NA,13,3,NA,NA,12),
WANT=c(3,6,9,4,7,10,13,3,6,9,12))
So basically I am wanting to create a new variable 'WANT' which takes the PREVIOUS value in t and ADDS 3 to it, and if there are many NA in a row then it keeps doing this. My attempt is:
library(dplyr)
data %>%
group_by(person) %>%
mutate(WANT_TRY = fill(t) + 3)
Here's one way -
data %>%
group_by(person) %>%
mutate(
# cs = cumsum(!is.na(t)), # creates index for reference value; uncomment if interested
w = case_when(
# rle() gives the running length of NA
is.na(t) ~ t[cumsum(!is.na(t))] + 3*sequence(rle(is.na(t))$lengths),
TRUE ~ t
)
) %>%
ungroup()
# A tibble: 11 x 4
person t WANT w
<dbl> <dbl> <dbl> <dbl>
1 1 3 3 3
2 1 NA 6 6
3 1 9 9 9
4 2 4 4 4
5 2 7 7 7
6 2 NA 10 10
7 2 13 13 13
8 3 3 3 3
9 3 NA 6 6
10 3 NA 9 9
11 3 12 12 12
Here is another way. We can do linear interpolation with the imputeTS package.
library(dplyr)
library(imputeTS)
data2 <- data %>%
group_by(person) %>%
mutate(WANT2 = na.interpolation(WANT)) %>%
ungroup()
data2
# # A tibble: 11 x 4
# person t WANT WANT2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 3 3 3
# 2 1 NA 6 6
# 3 1 9 9 9
# 4 2 4 4 4
# 5 2 7 7 7
# 6 2 NA 10 10
# 7 2 13 13 13
# 8 3 3 3 3
# 9 3 NA 6 6
# 10 3 NA 9 9
# 11 3 12 12 12
This is harder than it seems because of the double NA at the end. If it weren't for that, then the following:
ifelse(is.na(data$t), c(0, data$t[-nrow(data)])+3, data$t)
...would give you want you want. The simplest way, that uses the same logic but doesn't look very clever (sorry!) would be:
.impute <- function(x) ifelse(is.na(x), c(0, x[-length(x)])+3, x)
.impute(.impute(data$t))
...which just cheats by doing it twice. Does that help?
You can use functional programming from purrr and "NA-safe" addition from hablar:
library(hablar)
library(dplyr)
library(purrr)
data %>%
group_by(person) %>%
mutate(WANT2 = accumulate(t, ~.x %plus_% 3))
Result
# A tibble: 11 x 4
# Groups: person [3]
person t WANT WANT2
<dbl> <dbl> <dbl> <dbl>
1 1 3 3 3
2 1 NA 6 6
3 1 9 9 9
4 2 4 4 4
5 2 7 7 7
6 2 NA 10 10
7 2 13 13 13
8 3 3 3 3
9 3 NA 6 6
10 3 NA 9 9
11 3 12 12 12
I'm trying to create dummy variable for whether a child is first born, and one for if the child is second born. My data looks something like this
ID MID CMOB CYRB
1 1 1 1991
2 1 7 1989
3 2 1 1985
4 2 11 1985
5 2 9 1994
6 3 4 1992
7 4 2 1992
8 4 10 1983
With ID = child ID, MID = mother ID, CMOB = month of birth and CYRB = year of birth.
For the first born dummy I tried using this:
Identifiers_age <- Identifiers_age %>% group_by(MPUBID)
%>% mutate(first = as.numeric(rank(CYRB) == 1))
But there doesn't seem to be a way of breaking ties by the rank of another columnn (clearly in this case the desired column being CMOB), whenever I try using the "ties.method" argument it tell me the input must be a character vector.
Am I missing something here?
order might be more convenient to use here, from ?order:
order returns a permutation which rearranges its first argument into
ascending or descending order, breaking ties by further arguments.
Identifiers_age <- Identifiers_age %>% group_by(MID) %>%
mutate(first = as.numeric(order(CYRB, CMOB) == 1))
Identifiers_age
#Source: local data frame [8 x 5]
#Groups: MID [4]
# ID MID CMOB CYRB first
# <int> <int> <int> <int> <dbl>
#1 1 1 1 1991 0
#2 2 1 7 1989 1
#3 3 2 1 1985 1
#4 4 2 11 1985 0
#5 5 2 9 1994 0
#6 6 3 4 1992 1
#7 7 4 2 1992 0
#8 8 4 10 1983 1
If we still want to use rank, we can convert the 'CYRB', 'CMOB' in to 'Date', apply rank on it and the get the binary output based on the logical vector
Identifiers_age %>%
group_by(MID) %>%
mutate(first = as.integer(rank(as.Date(paste(CYRB, CMOB, 1,
sep="-"), "%Y-%m-%d"))==1))
# ID MID CMOB CYRB first
# <int> <int> <int> <int> <int>
#1 1 1 1 1991 0
#2 2 1 7 1989 1
#3 3 2 1 1985 1
#4 4 2 11 1985 0
#5 5 2 9 1994 0
#6 6 3 4 1992 1
#7 7 4 2 1992 0
#8 8 4 10 1983 1
Or we can use arithmetic to do this with rank
Identifiers_age %>%
group_by(MID) %>%
mutate(first = as.integer(rank(CYRB + CMOB/12)==1))
# ID MID CMOB CYRB first
# <int> <int> <int> <int> <int>
#1 1 1 1 1991 0
#2 2 1 7 1989 1
#3 3 2 1 1985 1
#4 4 2 11 1985 0
#5 5 2 9 1994 0
#6 6 3 4 1992 1
#7 7 4 2 1992 0
#8 8 4 10 1983 1