pivot_longer two sets of variables into two columns - r

I want to pivot_longer into two columns based on two sets of variables.
For example:
df <- data.frame(year = rep(c(2010,2012,2017), 4),
party = rep(c("A", "A", "A", "B", "B", "B"), 2),
pp1 = rep(c(3,4,5,1,2,6), 2),
pp2 = rep(c(1,2,3,4,5,6), 2),
pp3 = rep(c(6,2,3,1,5,4), 2),
l_pp1 = rep(c(1,2,6,3,4,5), 2),
l_pp2 = rep(c(4,5,6,1,2,3), 2),
l_pp3 = rep(c(1,5,4,6,2,3), 2))
Data:
year party pp1 pp2 pp3 l_pp1 l_pp2 l_pp3
1 2010 A 3 1 6 1 4 1
2 2012 A 4 2 2 2 5 5
3 2017 A 5 3 3 6 6 4
4 2010 B 1 4 1 3 1 6
5 2012 B 2 5 5 4 2 2
6 2017 B 6 6 4 5 3 3
7 2010 A 3 1 6 1 4 1
8 2012 A 4 2 2 2 5 5
9 2017 A 5 3 3 6 6 4
10 2010 B 1 4 1 3 1 6
11 2012 B 2 5 5 4 2 2
12 2017 B 6 6 4 5 3 3
What I need is the following:
year party area pp l_pp
1 2010 A 1 3 1
2 2012 A 1 4 2
3 2017 A 1 5 6
4 2010 B 1 1 3
5 2012 B 1 2 4
etc.
Here pp and l_pp are the same area (pp1 & l_pp1 become pp and l_pp for area 1).
I would think something like this, but values_to can only take size 1.
df <- df %>%
pivot_longer(!c("party", "year"), names_to = "area", values_to = c("pp", "l_pp"))
This gets me somehow close, but is not what I am looking for:
df <- df %>%
pivot_longer(!c("party", "year"), names_to = "area", values_to = c("pp"))
year party area pp
1 2010 A pp1 3
2 2010 A pp2 1
3 2010 A pp3 6
4 2010 A l_pp1 1
5 2010 A l_pp2 4
6 2010 A l_pp3 1

EDIT Making use of the .value sentinel this could be achieved via one pivot_longer like so:
library(tidyr)
df %>%
pivot_longer(-c(year, party), names_to = c(".value", "area"), names_pattern = "^(.*?)(\\d+)$")
#> # A tibble: 36 × 5
#> year party area pp l_pp
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 2010 A 1 3 1
#> 2 2010 A 2 1 4
#> 3 2010 A 3 6 1
#> 4 2012 A 1 4 2
#> 5 2012 A 2 2 5
#> 6 2012 A 3 2 5
#> 7 2017 A 1 5 6
#> 8 2017 A 2 3 6
#> 9 2017 A 3 3 4
#> 10 2010 B 1 1 3
#> # … with 26 more rows
As a second option the same result could be achieved via an additional pivot_wider like so, where as an intermediate step one has to add an id column to uniquely identify the rows in the data:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(!c(year, party), names_to = c("var", "area"), names_pattern = "(.*)(\\d)") %>%
group_by(year, party, area, var) %>%
mutate(id = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = var, values_from = value)
#> # A tibble: 36 x 6
#> year party area id pp l_pp
#> <dbl> <chr> <chr> <int> <dbl> <dbl>
#> 1 2010 A 1 1 3 1
#> 2 2010 A 2 1 1 4
#> 3 2010 A 3 1 6 1
#> 4 2012 A 1 1 4 2
#> 5 2012 A 2 1 2 5
#> 6 2012 A 3 1 2 5
#> 7 2017 A 1 1 5 6
#> 8 2017 A 2 1 3 6
#> 9 2017 A 3 1 3 4
#> 10 2010 B 1 1 1 3
#> # … with 26 more rows

Related

How to keep only first value from distinct values in one column based on repeated values in other column in R? [duplicate]

The code below should group the data by year and then create two new columns with the first and last value of each year.
library(dplyr)
set.seed(123)
d <- data.frame(
group = rep(1:3, each = 3),
year = rep(seq(2000,2002,1),3),
value = sample(1:9, r = T))
d %>%
group_by(group) %>%
mutate(
first = dplyr::first(value),
last = dplyr::last(value)
)
However, it does not work as it should. The expected result would be
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 8 8 1
5 2 2001 9 8 1
6 2 2002 1 8 1
7 3 2000 5 5 5
8 3 2001 9 5 5
9 3 2002 5 5 5
Yet, I get this (it takes the first and the last value over the entire data frame, not just the groups):
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 5
2 1 2001 8 3 5
3 1 2002 4 3 5
4 2 2000 8 3 5
5 2 2001 9 3 5
6 2 2002 1 3 5
7 3 2000 5 3 5
8 3 2001 9 3 5
9 3 2002 5 3 5
dplyr::mutate() did the trick
d %>%
group_by(group) %>%
dplyr::mutate(
first = dplyr::first(value),
last = dplyr::last(value)
)
You can also try by using summarise function within dpylr to get the first and last values of unique groups
d %>%
group_by(group) %>%
summarise(first_value = first(na.omit(values)),
last_value = last(na.omit(values))) %>%
left_join(d, ., by = 'group')
If you are from the future and dplyr has stopped supporting the first and last functions or want a future-proof solution, you can just index the columns like you would a list:
> d %>%
group_by(group) %>%
mutate(
first = value[[1]],
last = value[[length(value)]]
)
# A tibble: 9 × 5
# Groups: group [3]
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 8 8 1
5 2 2001 9 8 1
6 2 2002 1 8 1
7 3 2000 5 5 5
8 3 2001 9 5 5
9 3 2002 5 5 5

Changing number of observation in a dataset by IDs according to a given value

I have this dataset in R:
and I want to change the data according to nb variable, it means ID = 1 will have 5 rows and ID=2 will have 12 rows as shown below:
is there any R function that I could use it to transform my data :) ?
Thanks in advance
We need uncount from tidyr to expand based on the 'nb' column, by default, it removes the column as .remove = TRUE, change it to FALSE and then create the nb_long by doing a group by row_number()
library(dplyr)
library(tidyr)
df1 %>%
uncount(nb, .remove = FALSE) %>%
group_by(ID) %>%
mutate(nb_long = row_number()) %>%
ungroup
-output
# A tibble: 17 x 3
ID nb nb_long
<int> <dbl> <int>
1 1 5 1
2 1 5 2
3 1 5 3
4 1 5 4
5 1 5 5
6 2 12 1
7 2 12 2
8 2 12 3
9 2 12 4
10 2 12 5
11 2 12 6
12 2 12 7
13 2 12 8
14 2 12 9
15 2 12 10
16 2 12 11
17 2 12 12
data
df1 <- structure(list(ID = 1:2, nb = c(5, 12)),
class = "data.frame", row.names = c(NA,
-2L))
Here is another option. we just map out the values from 1 to nb and then we unnest the vector longer.
#packages
library(tidyverse)
#data
df1 <- structure(list(ID = 1:2, nb = c(5, 12)),
class = "data.frame", row.names = c(NA,
-2L))
#solution
df1 %>%
mutate(nums = map(nb, ~seq(1, .x, by = 1))) %>%
unnest_longer(nums)
#> # A tibble: 17 x 3
#> ID nb nums
#> <int> <dbl> <dbl>
#> 1 1 5 1
#> 2 1 5 2
#> 3 1 5 3
#> 4 1 5 4
#> 5 1 5 5
#> 6 2 12 1
#> 7 2 12 2
#> 8 2 12 3
#> 9 2 12 4
#> 10 2 12 5
#> 11 2 12 6
#> 12 2 12 7
#> 13 2 12 8
#> 14 2 12 9
#> 15 2 12 10
#> 16 2 12 11
#> 17 2 12 12
We can try the following data.table option
> setDT(df)[,.(nb_long = 1:nb),.(ID,nb)]
ID nb nb_long
1: 1 5 1
2: 1 5 2
3: 1 5 3
4: 1 5 4
5: 1 5 5
6: 2 12 1
7: 2 12 2
8: 2 12 3
9: 2 12 4
10: 2 12 5
11: 2 12 6
12: 2 12 7
13: 2 12 8
14: 2 12 9
15: 2 12 10
16: 2 12 11
17: 2 12 12

calculating cumulatives within a group correctly

I hope anyone can help with this. I have a data frame similar to this:
test <- data.frame(ID = c(1:24),
group = rep(c(1,1,1,1,1,1,2,2,2,2,2,2),2),
year1 = rep(c(2018,2018,2018,2019,2019,2019),4),
month1 = rep(c(1,2,3),8))
Now I want to do a cumsum per group but when I use the following code the sumsum 'restarts' each year.
test2 <-test %>%
group_by(group,year1,month1) %>%
summarise(a = length(unique(ID))) %>%
mutate(a = cumsum(a))
My desired output is:
group year1 month1 a
1 1 2018 1 2
2 1 2018 2 4
3 1 2018 3 6
4 1 2019 1 8
5 1 2019 2 10
6 1 2019 3 12
7 2 2018 1 2
8 2 2018 2 4
9 2 2018 3 6
10 2 2019 1 8
11 2 2019 2 10
12 2 2019 3 12
You could first count unique ID for each group, month and year and then take cumsum of it for each group.
library(dplyr)
test %>%
group_by(group, year1, month1) %>%
summarise(a = n_distinct(ID)) %>%
group_by(group) %>%
mutate(a = cumsum(a))
# group year1 month1 a
# <dbl> <dbl> <dbl> <int>
# 1 1 2018 1 2
# 2 1 2018 2 4
# 3 1 2018 3 6
# 4 1 2019 1 8
# 5 1 2019 2 10
# 6 1 2019 3 12
# 7 2 2018 1 2
# 8 2 2018 2 4
# 9 2 2018 3 6
#10 2 2019 1 8
#11 2 2019 2 10
#12 2 2019 3 12
With data.table, this can be done with
library(data.table)
setDT(test)[, .(a = uniqueN(ID)), by = .(group, year1, month1)
][, a := cumsum(a), by = group]

Count number of new and lost friends between two data frames in R

I have two data frames of the same respondents, one from Time 1 and the next from Time 2. In each wave they nominated their friends, and I want to know:
1) how many friends are nominated in Time 2 but not in Time 1 (new friends)
2) how many friends are nominated in Time 1 but not in Time 2 (lost friends)
Sample data:
Time 1 DF
ID friend_1 friend_2 friend_3
1 4 12 7
2 8 6 7
3 9 NA NA
4 15 7 2
5 2 20 7
6 19 13 9
7 12 20 8
8 3 17 10
9 1 15 19
10 2 16 11
Time 2 DF
ID friend_1 friend_2 friend_3
1 4 12 3
2 8 6 14
3 9 NA NA
4 15 7 2
5 1 17 9
6 9 19 NA
7 NA NA NA
8 7 1 16
9 NA 10 12
10 7 11 9
So the desired DF would include these columns (EDIT filled in columns):
ID num_newfriends num_lostfriends
1 1 1
2 1 1
3 0 0
4 0 0
5 3 3
6 0 1
7 0 3
8 3 3
9 2 3
10 2 1
EDIT2:
I've tried doing an anti join
df3 <- anti_join(df1, df2)
But this method doesn't take into account friend id numbers that might appear in a different column in time 2 (For example respondent #6 friend 9 and 19 are in T1 and T2 but in different columns in each time)
Another option:
library(tidyverse)
left_join(
gather(df1, key, x, -ID),
gather(df2, key, y, -ID),
by = c("ID", "key")
) %>%
group_by(ID) %>%
summarise(
num_newfriends = sum(!y[!is.na(y)] %in% x[!is.na(x)]),
num_lostfriends = sum(!x[!is.na(x)] %in% y[!is.na(y)])
)
Output:
# A tibble: 10 x 3
ID num_newfriends num_lostfriends
<int> <int> <int>
1 1 1 1
2 2 1 1
3 3 0 0
4 4 0 0
5 5 3 3
6 6 0 1
7 7 0 3
8 8 3 3
9 9 2 3
10 10 2 2
Simple comparisons would be an option
library(tidyverse)
na_sums_old <- rowSums(is.na(time1))
na_sums_new <- rowSums(is.na(time2))
kept_friends <- map_dbl(seq(nrow(time1)), ~ sum(time1[.x, -1] %in% time2[.x, -1]))
kept_friends <- kept_friends - na_sums_old * (na_sums_new >= 1)
new_friends <- 3 - na_sums_new - kept_friends
lost_friends <- 3 - na_sums_old - kept_friends
tibble(ID = time1$ID, new_friends = new_friends, lost_friends = lost_friends)
# A tibble: 10 x 3
ID new_friends lost_friends
<int> <dbl> <dbl>
1 1 1 1
2 2 1 1
3 3 0 0
4 4 0 0
5 5 3 3
6 6 0 1
7 7 0 3
8 8 3 3
9 9 2 3
10 10 2 2
You can make anti_join work by first pivoting to a "long" data frame.
df1 <- df1 %>%
pivot_longer(starts_with("friend_"), values_to = "friend") %>%
drop_na()
df2 <- df2 %>%
pivot_longer(starts_with("friend_"), values_to = "friend") %>%
drop_na()
head(df1)
#> # A tibble: 6 x 3
#> ID name friend
#> <int> <chr> <int>
#> 1 1 friend_1 4
#> 2 1 friend_2 12
#> 3 1 friend_3 7
#> 4 2 friend_1 8
#> 5 2 friend_2 6
#> 6 2 friend_3 7
lost_friends <- anti_join(df1, df2, by = c("ID", "friend"))
new_fiends <- anti_join(df2, df1, by = c("ID", "friend"))
respondents <- distinct(df1, ID)
respondents %>%
full_join(
count(lost_friends, ID, name = "num_lost_friends")
) %>%
full_join(
count(new_fiends, ID, name = "num_new_friends")
) %>%
mutate_at(vars(starts_with("num_")), replace_na, 0)
#> Joining, by = "ID"
#> Joining, by = "ID"
#> # A tibble: 10 x 3
#> ID num_lost_friends num_new_friends
#> <int> <dbl> <dbl>
#> 1 1 1 1
#> 2 2 1 1
#> 3 3 0 0
#> 4 4 0 0
#> 5 5 3 3
#> 6 6 1 0
#> 7 7 3 0
#> 8 8 3 3
#> 9 9 3 2
#> 10 10 2 2
Created on 2019-11-01 by the reprex package (v0.3.0)

group_by n unique sequential values of a variable

It's easy to group_by unique values of a variable:
library(tidyverse)
library(gapminder)
gapminder %>%
group_by(year)
If we wanted to make a group ID just to show us what the groups would be:
gapminder %>%
select(year) %>%
distinct %>%
mutate(group = group_indices(., year))
A tibble: 12 x 2
year group
<int> <int>
1 1952 1
2 1957 2
3 1962 3
4 1967 4
5 1972 5
6 1977 6
7 1982 7
8 1987 8
9 1992 9
10 1997 10
11 2002 11
12 2007 12
But what if I want to group by pairs ("group2"), triplets ("group3"), etc. of sequential years? How could I produce the following tibble using dplyr/tidyverse?
A tibble: 12 x 2
year group group2 group3 group5
<int> <int> <int> <int> <int>
1 1952 1 1 1 1
2 1957 2 1 1 1
3 1962 3 2 1 1
4 1967 4 2 2 1
5 1972 5 3 2 1
6 1977 6 3 2 2
7 1982 7 4 3 2
8 1987 8 4 3 2
9 1992 9 5 3 2
10 1997 10 5 4 2
11 2002 11 6 4 3
12 2007 12 6 4 3
With ceiling() you can create groups very easily.
gapminder %>%
select(year) %>%
distinct() %>%
mutate(group1 = group_indices(., year)) %>%
mutate(group2=ceiling(group1 / 2)) %>%
mutate(group3=ceiling(group1 / 3)) %>%
mutate(group4=ceiling(group1 / 4)) %>%
mutate(group5=ceiling(group1 / 5))
# A tibble: 12 x 6
year group1 group2 group3 group4 group5
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1952 1 1 1 1 1
2 1957 2 1 1 1 1
3 1962 3 2 1 1 1
4 1967 4 2 2 1 1
5 1972 5 3 2 2 1
6 1977 6 3 2 2 2
7 1982 7 4 3 2 2
8 1987 8 4 3 2 2
9 1992 9 5 3 3 2
10 1997 10 5 4 3 2
11 2002 11 6 4 3 3
12 2007 12 6 4 3 3
Here's an alternative solution, where you can specify the number of groups you want in the beginning and the process creates the corresponding groups:
library(tidyverse)
library(gapminder)
# input number of groups
nn = 5
gapminder %>%
select(year) %>%
distinct() %>%
mutate(X = seq_along(year),
d = map(X, ~data.frame(t(ceiling(.x/2:nn))))) %>%
unnest() %>%
setNames(c("year", paste0("group",1:nn)))
# # A tibble: 12 x 6
# year group1 group2 group3 group4 group5
# <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 1952 1 1 1 1 1
# 2 1957 2 1 1 1 1
# 3 1962 3 2 1 1 1
# 4 1967 4 2 2 1 1
# 5 1972 5 3 2 2 1
# 6 1977 6 3 2 2 2
# 7 1982 7 4 3 2 2
# 8 1987 8 4 3 2 2
# 9 1992 9 5 3 3 2
#10 1997 10 5 4 3 2
#11 2002 11 6 4 3 3
#12 2007 12 6 4 3 3
Here's a function that does the job
group_by_n = function(x, n) {
ux <- match(x, sort(unique(x)))
ceiling(ux / n)
}
It does not require that x be ordered, or that values be evenly spaced or even numeric values. Use as, e.g.,
mutate(gapminder, group3 = group_by_n(year, 3))

Resources