I want to pivot_longer into two columns based on two sets of variables.
For example:
df <- data.frame(year = rep(c(2010,2012,2017), 4),
party = rep(c("A", "A", "A", "B", "B", "B"), 2),
pp1 = rep(c(3,4,5,1,2,6), 2),
pp2 = rep(c(1,2,3,4,5,6), 2),
pp3 = rep(c(6,2,3,1,5,4), 2),
l_pp1 = rep(c(1,2,6,3,4,5), 2),
l_pp2 = rep(c(4,5,6,1,2,3), 2),
l_pp3 = rep(c(1,5,4,6,2,3), 2))
Data:
year party pp1 pp2 pp3 l_pp1 l_pp2 l_pp3
1 2010 A 3 1 6 1 4 1
2 2012 A 4 2 2 2 5 5
3 2017 A 5 3 3 6 6 4
4 2010 B 1 4 1 3 1 6
5 2012 B 2 5 5 4 2 2
6 2017 B 6 6 4 5 3 3
7 2010 A 3 1 6 1 4 1
8 2012 A 4 2 2 2 5 5
9 2017 A 5 3 3 6 6 4
10 2010 B 1 4 1 3 1 6
11 2012 B 2 5 5 4 2 2
12 2017 B 6 6 4 5 3 3
What I need is the following:
year party area pp l_pp
1 2010 A 1 3 1
2 2012 A 1 4 2
3 2017 A 1 5 6
4 2010 B 1 1 3
5 2012 B 1 2 4
etc.
Here pp and l_pp are the same area (pp1 & l_pp1 become pp and l_pp for area 1).
I would think something like this, but values_to can only take size 1.
df <- df %>%
pivot_longer(!c("party", "year"), names_to = "area", values_to = c("pp", "l_pp"))
This gets me somehow close, but is not what I am looking for:
df <- df %>%
pivot_longer(!c("party", "year"), names_to = "area", values_to = c("pp"))
year party area pp
1 2010 A pp1 3
2 2010 A pp2 1
3 2010 A pp3 6
4 2010 A l_pp1 1
5 2010 A l_pp2 4
6 2010 A l_pp3 1
EDIT Making use of the .value sentinel this could be achieved via one pivot_longer like so:
library(tidyr)
df %>%
pivot_longer(-c(year, party), names_to = c(".value", "area"), names_pattern = "^(.*?)(\\d+)$")
#> # A tibble: 36 × 5
#> year party area pp l_pp
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 2010 A 1 3 1
#> 2 2010 A 2 1 4
#> 3 2010 A 3 6 1
#> 4 2012 A 1 4 2
#> 5 2012 A 2 2 5
#> 6 2012 A 3 2 5
#> 7 2017 A 1 5 6
#> 8 2017 A 2 3 6
#> 9 2017 A 3 3 4
#> 10 2010 B 1 1 3
#> # … with 26 more rows
As a second option the same result could be achieved via an additional pivot_wider like so, where as an intermediate step one has to add an id column to uniquely identify the rows in the data:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(!c(year, party), names_to = c("var", "area"), names_pattern = "(.*)(\\d)") %>%
group_by(year, party, area, var) %>%
mutate(id = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = var, values_from = value)
#> # A tibble: 36 x 6
#> year party area id pp l_pp
#> <dbl> <chr> <chr> <int> <dbl> <dbl>
#> 1 2010 A 1 1 3 1
#> 2 2010 A 2 1 1 4
#> 3 2010 A 3 1 6 1
#> 4 2012 A 1 1 4 2
#> 5 2012 A 2 1 2 5
#> 6 2012 A 3 1 2 5
#> 7 2017 A 1 1 5 6
#> 8 2017 A 2 1 3 6
#> 9 2017 A 3 1 3 4
#> 10 2010 B 1 1 1 3
#> # … with 26 more rows
Related
The code below should group the data by year and then create two new columns with the first and last value of each year.
library(dplyr)
set.seed(123)
d <- data.frame(
group = rep(1:3, each = 3),
year = rep(seq(2000,2002,1),3),
value = sample(1:9, r = T))
d %>%
group_by(group) %>%
mutate(
first = dplyr::first(value),
last = dplyr::last(value)
)
However, it does not work as it should. The expected result would be
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 8 8 1
5 2 2001 9 8 1
6 2 2002 1 8 1
7 3 2000 5 5 5
8 3 2001 9 5 5
9 3 2002 5 5 5
Yet, I get this (it takes the first and the last value over the entire data frame, not just the groups):
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 5
2 1 2001 8 3 5
3 1 2002 4 3 5
4 2 2000 8 3 5
5 2 2001 9 3 5
6 2 2002 1 3 5
7 3 2000 5 3 5
8 3 2001 9 3 5
9 3 2002 5 3 5
dplyr::mutate() did the trick
d %>%
group_by(group) %>%
dplyr::mutate(
first = dplyr::first(value),
last = dplyr::last(value)
)
You can also try by using summarise function within dpylr to get the first and last values of unique groups
d %>%
group_by(group) %>%
summarise(first_value = first(na.omit(values)),
last_value = last(na.omit(values))) %>%
left_join(d, ., by = 'group')
If you are from the future and dplyr has stopped supporting the first and last functions or want a future-proof solution, you can just index the columns like you would a list:
> d %>%
group_by(group) %>%
mutate(
first = value[[1]],
last = value[[length(value)]]
)
# A tibble: 9 × 5
# Groups: group [3]
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 8 8 1
5 2 2001 9 8 1
6 2 2002 1 8 1
7 3 2000 5 5 5
8 3 2001 9 5 5
9 3 2002 5 5 5
I have this dataset in R:
and I want to change the data according to nb variable, it means ID = 1 will have 5 rows and ID=2 will have 12 rows as shown below:
is there any R function that I could use it to transform my data :) ?
Thanks in advance
We need uncount from tidyr to expand based on the 'nb' column, by default, it removes the column as .remove = TRUE, change it to FALSE and then create the nb_long by doing a group by row_number()
library(dplyr)
library(tidyr)
df1 %>%
uncount(nb, .remove = FALSE) %>%
group_by(ID) %>%
mutate(nb_long = row_number()) %>%
ungroup
-output
# A tibble: 17 x 3
ID nb nb_long
<int> <dbl> <int>
1 1 5 1
2 1 5 2
3 1 5 3
4 1 5 4
5 1 5 5
6 2 12 1
7 2 12 2
8 2 12 3
9 2 12 4
10 2 12 5
11 2 12 6
12 2 12 7
13 2 12 8
14 2 12 9
15 2 12 10
16 2 12 11
17 2 12 12
data
df1 <- structure(list(ID = 1:2, nb = c(5, 12)),
class = "data.frame", row.names = c(NA,
-2L))
Here is another option. we just map out the values from 1 to nb and then we unnest the vector longer.
#packages
library(tidyverse)
#data
df1 <- structure(list(ID = 1:2, nb = c(5, 12)),
class = "data.frame", row.names = c(NA,
-2L))
#solution
df1 %>%
mutate(nums = map(nb, ~seq(1, .x, by = 1))) %>%
unnest_longer(nums)
#> # A tibble: 17 x 3
#> ID nb nums
#> <int> <dbl> <dbl>
#> 1 1 5 1
#> 2 1 5 2
#> 3 1 5 3
#> 4 1 5 4
#> 5 1 5 5
#> 6 2 12 1
#> 7 2 12 2
#> 8 2 12 3
#> 9 2 12 4
#> 10 2 12 5
#> 11 2 12 6
#> 12 2 12 7
#> 13 2 12 8
#> 14 2 12 9
#> 15 2 12 10
#> 16 2 12 11
#> 17 2 12 12
We can try the following data.table option
> setDT(df)[,.(nb_long = 1:nb),.(ID,nb)]
ID nb nb_long
1: 1 5 1
2: 1 5 2
3: 1 5 3
4: 1 5 4
5: 1 5 5
6: 2 12 1
7: 2 12 2
8: 2 12 3
9: 2 12 4
10: 2 12 5
11: 2 12 6
12: 2 12 7
13: 2 12 8
14: 2 12 9
15: 2 12 10
16: 2 12 11
17: 2 12 12
I hope anyone can help with this. I have a data frame similar to this:
test <- data.frame(ID = c(1:24),
group = rep(c(1,1,1,1,1,1,2,2,2,2,2,2),2),
year1 = rep(c(2018,2018,2018,2019,2019,2019),4),
month1 = rep(c(1,2,3),8))
Now I want to do a cumsum per group but when I use the following code the sumsum 'restarts' each year.
test2 <-test %>%
group_by(group,year1,month1) %>%
summarise(a = length(unique(ID))) %>%
mutate(a = cumsum(a))
My desired output is:
group year1 month1 a
1 1 2018 1 2
2 1 2018 2 4
3 1 2018 3 6
4 1 2019 1 8
5 1 2019 2 10
6 1 2019 3 12
7 2 2018 1 2
8 2 2018 2 4
9 2 2018 3 6
10 2 2019 1 8
11 2 2019 2 10
12 2 2019 3 12
You could first count unique ID for each group, month and year and then take cumsum of it for each group.
library(dplyr)
test %>%
group_by(group, year1, month1) %>%
summarise(a = n_distinct(ID)) %>%
group_by(group) %>%
mutate(a = cumsum(a))
# group year1 month1 a
# <dbl> <dbl> <dbl> <int>
# 1 1 2018 1 2
# 2 1 2018 2 4
# 3 1 2018 3 6
# 4 1 2019 1 8
# 5 1 2019 2 10
# 6 1 2019 3 12
# 7 2 2018 1 2
# 8 2 2018 2 4
# 9 2 2018 3 6
#10 2 2019 1 8
#11 2 2019 2 10
#12 2 2019 3 12
With data.table, this can be done with
library(data.table)
setDT(test)[, .(a = uniqueN(ID)), by = .(group, year1, month1)
][, a := cumsum(a), by = group]
I have two data frames of the same respondents, one from Time 1 and the next from Time 2. In each wave they nominated their friends, and I want to know:
1) how many friends are nominated in Time 2 but not in Time 1 (new friends)
2) how many friends are nominated in Time 1 but not in Time 2 (lost friends)
Sample data:
Time 1 DF
ID friend_1 friend_2 friend_3
1 4 12 7
2 8 6 7
3 9 NA NA
4 15 7 2
5 2 20 7
6 19 13 9
7 12 20 8
8 3 17 10
9 1 15 19
10 2 16 11
Time 2 DF
ID friend_1 friend_2 friend_3
1 4 12 3
2 8 6 14
3 9 NA NA
4 15 7 2
5 1 17 9
6 9 19 NA
7 NA NA NA
8 7 1 16
9 NA 10 12
10 7 11 9
So the desired DF would include these columns (EDIT filled in columns):
ID num_newfriends num_lostfriends
1 1 1
2 1 1
3 0 0
4 0 0
5 3 3
6 0 1
7 0 3
8 3 3
9 2 3
10 2 1
EDIT2:
I've tried doing an anti join
df3 <- anti_join(df1, df2)
But this method doesn't take into account friend id numbers that might appear in a different column in time 2 (For example respondent #6 friend 9 and 19 are in T1 and T2 but in different columns in each time)
Another option:
library(tidyverse)
left_join(
gather(df1, key, x, -ID),
gather(df2, key, y, -ID),
by = c("ID", "key")
) %>%
group_by(ID) %>%
summarise(
num_newfriends = sum(!y[!is.na(y)] %in% x[!is.na(x)]),
num_lostfriends = sum(!x[!is.na(x)] %in% y[!is.na(y)])
)
Output:
# A tibble: 10 x 3
ID num_newfriends num_lostfriends
<int> <int> <int>
1 1 1 1
2 2 1 1
3 3 0 0
4 4 0 0
5 5 3 3
6 6 0 1
7 7 0 3
8 8 3 3
9 9 2 3
10 10 2 2
Simple comparisons would be an option
library(tidyverse)
na_sums_old <- rowSums(is.na(time1))
na_sums_new <- rowSums(is.na(time2))
kept_friends <- map_dbl(seq(nrow(time1)), ~ sum(time1[.x, -1] %in% time2[.x, -1]))
kept_friends <- kept_friends - na_sums_old * (na_sums_new >= 1)
new_friends <- 3 - na_sums_new - kept_friends
lost_friends <- 3 - na_sums_old - kept_friends
tibble(ID = time1$ID, new_friends = new_friends, lost_friends = lost_friends)
# A tibble: 10 x 3
ID new_friends lost_friends
<int> <dbl> <dbl>
1 1 1 1
2 2 1 1
3 3 0 0
4 4 0 0
5 5 3 3
6 6 0 1
7 7 0 3
8 8 3 3
9 9 2 3
10 10 2 2
You can make anti_join work by first pivoting to a "long" data frame.
df1 <- df1 %>%
pivot_longer(starts_with("friend_"), values_to = "friend") %>%
drop_na()
df2 <- df2 %>%
pivot_longer(starts_with("friend_"), values_to = "friend") %>%
drop_na()
head(df1)
#> # A tibble: 6 x 3
#> ID name friend
#> <int> <chr> <int>
#> 1 1 friend_1 4
#> 2 1 friend_2 12
#> 3 1 friend_3 7
#> 4 2 friend_1 8
#> 5 2 friend_2 6
#> 6 2 friend_3 7
lost_friends <- anti_join(df1, df2, by = c("ID", "friend"))
new_fiends <- anti_join(df2, df1, by = c("ID", "friend"))
respondents <- distinct(df1, ID)
respondents %>%
full_join(
count(lost_friends, ID, name = "num_lost_friends")
) %>%
full_join(
count(new_fiends, ID, name = "num_new_friends")
) %>%
mutate_at(vars(starts_with("num_")), replace_na, 0)
#> Joining, by = "ID"
#> Joining, by = "ID"
#> # A tibble: 10 x 3
#> ID num_lost_friends num_new_friends
#> <int> <dbl> <dbl>
#> 1 1 1 1
#> 2 2 1 1
#> 3 3 0 0
#> 4 4 0 0
#> 5 5 3 3
#> 6 6 1 0
#> 7 7 3 0
#> 8 8 3 3
#> 9 9 3 2
#> 10 10 2 2
Created on 2019-11-01 by the reprex package (v0.3.0)
It's easy to group_by unique values of a variable:
library(tidyverse)
library(gapminder)
gapminder %>%
group_by(year)
If we wanted to make a group ID just to show us what the groups would be:
gapminder %>%
select(year) %>%
distinct %>%
mutate(group = group_indices(., year))
A tibble: 12 x 2
year group
<int> <int>
1 1952 1
2 1957 2
3 1962 3
4 1967 4
5 1972 5
6 1977 6
7 1982 7
8 1987 8
9 1992 9
10 1997 10
11 2002 11
12 2007 12
But what if I want to group by pairs ("group2"), triplets ("group3"), etc. of sequential years? How could I produce the following tibble using dplyr/tidyverse?
A tibble: 12 x 2
year group group2 group3 group5
<int> <int> <int> <int> <int>
1 1952 1 1 1 1
2 1957 2 1 1 1
3 1962 3 2 1 1
4 1967 4 2 2 1
5 1972 5 3 2 1
6 1977 6 3 2 2
7 1982 7 4 3 2
8 1987 8 4 3 2
9 1992 9 5 3 2
10 1997 10 5 4 2
11 2002 11 6 4 3
12 2007 12 6 4 3
With ceiling() you can create groups very easily.
gapminder %>%
select(year) %>%
distinct() %>%
mutate(group1 = group_indices(., year)) %>%
mutate(group2=ceiling(group1 / 2)) %>%
mutate(group3=ceiling(group1 / 3)) %>%
mutate(group4=ceiling(group1 / 4)) %>%
mutate(group5=ceiling(group1 / 5))
# A tibble: 12 x 6
year group1 group2 group3 group4 group5
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1952 1 1 1 1 1
2 1957 2 1 1 1 1
3 1962 3 2 1 1 1
4 1967 4 2 2 1 1
5 1972 5 3 2 2 1
6 1977 6 3 2 2 2
7 1982 7 4 3 2 2
8 1987 8 4 3 2 2
9 1992 9 5 3 3 2
10 1997 10 5 4 3 2
11 2002 11 6 4 3 3
12 2007 12 6 4 3 3
Here's an alternative solution, where you can specify the number of groups you want in the beginning and the process creates the corresponding groups:
library(tidyverse)
library(gapminder)
# input number of groups
nn = 5
gapminder %>%
select(year) %>%
distinct() %>%
mutate(X = seq_along(year),
d = map(X, ~data.frame(t(ceiling(.x/2:nn))))) %>%
unnest() %>%
setNames(c("year", paste0("group",1:nn)))
# # A tibble: 12 x 6
# year group1 group2 group3 group4 group5
# <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 1952 1 1 1 1 1
# 2 1957 2 1 1 1 1
# 3 1962 3 2 1 1 1
# 4 1967 4 2 2 1 1
# 5 1972 5 3 2 2 1
# 6 1977 6 3 2 2 2
# 7 1982 7 4 3 2 2
# 8 1987 8 4 3 2 2
# 9 1992 9 5 3 3 2
#10 1997 10 5 4 3 2
#11 2002 11 6 4 3 3
#12 2007 12 6 4 3 3
Here's a function that does the job
group_by_n = function(x, n) {
ux <- match(x, sort(unique(x)))
ceiling(ux / n)
}
It does not require that x be ordered, or that values be evenly spaced or even numeric values. Use as, e.g.,
mutate(gapminder, group3 = group_by_n(year, 3))