"spread" multiple variables using pivot_wider() - r

What is the best way to "spread" multiple variables using pivot_wider() in the development version of tidyr?
# https://tidyr.tidyverse.org/dev/reference/pivot_wider.html
# devtools::install_github("tidyverse/tidyr")
library(tidyr)
library(tidyverse)
have <- tibble::tribble(
~user_id, ~question, ~answer, ~timestamp,
1, "q1", "a1", "2019-07-22 16:54:43",
1, "q2", "a2", "2019-07-22 16:55:43",
2, "q1", "a1", "2019-07-22 16:56:43",
2, "q2", "a2", "2019-07-22 16:57:43",
3, "q1", "a1", "2019-07-22 16:58:43",
3, "q2", "a2", "2019-07-22 16:59:43"
) %>%
mutate(timestamp = as_datetime(timestamp))
have
# # A tibble: 6 x 4
# user_id question answer timestamp
# <dbl> <chr> <chr> <dttm>
# 1 1 q1 a1 2019-07-22 16:54:43
# 2 1 q2 a2 2019-07-22 16:55:43
# 3 2 q1 a1 2019-07-22 16:56:43
# 4 2 q2 a2 2019-07-22 16:57:43
# 5 3 q1 a1 2019-07-22 16:58:43
# 6 3 q2 a2 2019-07-22 16:59:43
want <- tibble::tribble(
~user_id, ~q1, ~q2, ~timestamp_q1, ~timestamp_q2,
1, "a1", "a2", "2019-07-22 16:54:43", "2019-07-22 16:55:43",
2, "a1", "a2", "2019-07-22 16:56:43", "2019-07-22 16:57:43",
3, "a1", "a2", "2019-07-22 16:58:43", "2019-07-22 16:59:43"
) %>%
mutate(timestamp_q1 = as_datetime(timestamp_q1)) %>%
mutate(timestamp_q2 = as_datetime(timestamp_q2))
want
# A tibble: 3 x 5
# user_id q1 q2 timestamp_q1 timestamp_q2
# <dbl> <chr> <chr> <dttm> <dttm>
#1 1 a1 a2 2019-07-22 16:54:43 2019-07-22 16:55:43
#2 2 a1 a2 2019-07-22 16:56:43 2019-07-22 16:57:43
#3 3 a1 a2 2019-07-22 16:58:43 2019-07-22 16:59:43
This works if you want to spread one pair of variables, but fails because only user_id should be the identifying variable.
have %>%
pivot_wider(names_from = question, values_from = answer)
# # A tibble: 6 x 4
# user_id timestamp q1 q2
# <dbl> <dttm> <chr> <chr>
# 1 1 2019-07-22 16:54:43 a1 NA
# 2 1 2019-07-22 16:55:43 NA a2
# 3 2 2019-07-22 16:56:43 a1 NA
# 4 2 2019-07-22 16:57:43 NA a2
# 5 3 2019-07-22 16:58:43 a1 NA
# 6 3 2019-07-22 16:59:43 NA a2

You can include multiple columns in the values_from argument to spread multiple columns in one go:
have %>%
pivot_wider(
id_cols = user_id,
names_from = question,
values_from = c(answer, timestamp)
) %>%
# remove the 'answer_' prefix from those cols
rename_all(~ str_remove(., "answer_"))
Output:
# A tibble: 3 x 5
user_id q1 q2 timestamp_q1 timestamp_q2
<dbl> <chr> <chr> <dttm> <dttm>
1 1 a1 a2 2019-07-22 16:54:43 2019-07-22 16:55:43
2 2 a1 a2 2019-07-22 16:56:43 2019-07-22 16:57:43
3 3 a1 a2 2019-07-22 16:58:43 2019-07-22 16:59:43

Related

dplyr join with OR condition?

I am wondering whether there is any way, preferably in the tidyverse, to join two dataframes based on OR conditions.
There are two dataframes: df_obs and df_event.
a) The join should happen if there is a match between
the obs_id and event_id; and obs_date or event_date, or both are NA.
OR
b) the obs_date and event_date are identical; and obs_id, event_id, or both are NA.
The match should not happen if obs_id is not identical to event_id (if both are not NA) OR
obs_date and event_date are not identical (both being not NA).
The result should look like df_res below. The column 'event' from the df_event is added to the df_obs.
I have seen the answer to this question, but maybe there is a way around SQL?
df_obs <- tibble::tribble(
~obs, ~obs_date, ~obs_id,
"a1", NA, 10L,
"a2", "01/01/2000", NA,
"b", "02/01/2000", NA,
"a3", "03/01/2000", 10L
)
df_obs
#> # A tibble: 4 × 3
#> obs obs_date obs_id
#> <chr> <chr> <int>
#> 1 a1 <NA> 10
#> 2 a2 01/01/2000 NA
#> 3 b 02/01/2000 NA
#> 4 a3 03/01/2000 10
df_event <- tibble::tribble(
~event, ~event_date, ~event_id,
"A", "01/01/2000", 10L,
"B", "02/01/2000", NA
)
df_event
#> # A tibble: 2 × 3
#> event event_date event_id
#> <chr> <chr> <int>
#> 1 A 01/01/2000 10
#> 2 B 02/01/2000 NA
df_res <- tibble::tribble(
~obs, ~obs_date, ~obs_id, ~event,
"a1", NA, 10L, "A",
"a2", "01/01/2000", NA, "A",
"b", "02/01/2000", NA, "B",
"a3", "03/01/2000", 10L, NA
)
df_res
#> # A tibble: 4 × 4
#> obs obs_date obs_id event
#> <chr> <chr> <int> <chr>
#> 1 a1 <NA> 10 A
#> 2 a2 01/01/2000 NA A
#> 3 b 02/01/2000 NA B
#> 4 a3 03/01/2000 10 <NA>
Created on 2022-09-13 with reprex v2.0.2
I can come up only with this solution:
df_obs %>%
left_join(df_event, by = c("obs_id" = "event_id"), na_matches='never') %>%
mutate(event = ifelse(!(is.na(obs_date)|is.na(event_date)|obs_date == event_date), NA, event)) %>%
select(-event_date) %>%
left_join(df_event, by = c("obs_date" = "event_date"), na_matches='never') %>%
mutate(event.y = ifelse(!(is.na(obs_id)|is.na(event_id)|obs_id == event_id), NA, event.y)) %>%
select(-event_id) %>%
mutate(event = ifelse(is.na(event.x), event.y, event.x)) %>%
select(-c(event.x, event.y))

How to check if pairs remain the same between years?

I have a data frame with pairs of individual birds (male and female) that were observed in several years. I am trying to figure out whether these pairs have changed from one year to the next so that I can do some further analyses.
My data is structured like this:
dat <- tibble(year = rep(1:3, each = 3),
Male = c("A1", "B1", "C1",
"A1", "B1", "C1",
"A1", "B1", "C2"),
Female = c("X1", "Y1", "Z1",
"X1", "Y2", "Z2",
"X1", "Y2", "Z2"))
# A tibble: 9 x 3
year Male Female
<int> <chr> <chr>
1 1 A1 X1
2 1 B1 Y1
3 1 C1 Z1
4 2 A1 X1
5 2 B1 Y2
6 2 C1 Z2
7 3 A1 X1
8 3 B1 Y2
9 3 C2 Z2
And my expected output is something like:
# A tibble: 9 x 5
year Male Female male_state female_state
<int> <chr> <chr> <chr> <chr>
1 1 A1 X1 new new
2 1 B1 Y1 new new
3 1 C1 Z1 new new
4 2 A1 X1 reunited reunited
5 2 B1 Y2 divorced new
6 2 C1 Z2 divorced new
7 3 A1 X1 reunited reunited
8 3 B1 Y2 reunited reunited
9 3 C2 Z2 new divorced
I cannot figure out how to check whether a value from a different column is the same in the year before (e.g. if the male ID is the same for a certain female in year 2 or 3 as in the year prior). Any ideas?
This (probably overcomplicated) pipe produces the following output.
dat <- tibble(year = rep(1:3, each = 3),
Male = c("A1", "B1", "C1",
"A1", "B1", "C1",
"A1", "B1", "C2"),
Female = c("X1", "Y1", "Z1",
"X1", "Y2", "Z2",
"X1", "Y2", "Z2"))
dat %>%
mutate(pair=paste0(Male,Female)) %>%
arrange(pair,year) %>%
mutate(check = if_else((pair==lag(pair)) & (year>lag(year)), 'old couple', 'new couple')) %>%
mutate(check = if_else(is.na(check), 'new couple', check)) %>%
mutate(divorced = if_else((Male == lag(Male)) & (Female != lag(Female)), 'divorce', '')) %>%
mutate(divorced = if_else(is.na(divorced), '', divorced))
OUTPUT:
# A tibble: 9 × 6
year Male Female pair check divorced
<int> <chr> <chr> <chr> <chr> <chr>
1 1 A1 X1 A1X1 new couple ""
2 2 A1 X1 A1X1 old couple ""
3 3 A1 X1 A1X1 old couple ""
4 1 B1 Y1 B1Y1 new couple ""
5 2 B1 Y2 B1Y2 new couple "divorce"
6 3 B1 Y2 B1Y2 old couple ""
7 1 C1 Z1 C1Z1 new couple ""
8 2 C1 Z2 C1Z2 new couple "divorce"
9 3 C2 Z2 C2Z2 new couple ""
Try this:
library(tidyverse)
dat <- tibble(
year = rep(1:3, each = 3),
Male = c(
"A1", "B1", "C1",
"A1", "B1", "C1",
"A1", "B1", "C2"
),
Female = c(
"X1", "Y1", "Z1",
"X1", "Y2", "Z2",
"X1", "Y2", "Z2"
)
)
dat |>
mutate(pairing = str_c(Male, "|", Female)) |>
add_count(pairing) |>
group_by(pairing) |>
mutate(male_state = if_else(pairing == lag(pairing), "reunited", NA_character_),
female_state = if_else(pairing == lag(pairing), "reunited", NA_character_)) |>
group_by(Male) |>
mutate(
male_state = if_else(row_number() == 1, "new", male_state),
male_state = if_else(is.na(male_state), "divorced", male_state)
) |>
group_by(Female) |>
mutate(
female_state = if_else(row_number() == 1, "new", female_state),
female_state = if_else(is.na(female_state), "divorced", female_state)
) |>
arrange(year, Male)
#> # A tibble: 9 × 7
#> # Groups: Female [5]
#> year Male Female pairing n male_state female_state
#> <int> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 1 A1 X1 A1|X1 3 new new
#> 2 1 B1 Y1 B1|Y1 1 new new
#> 3 1 C1 Z1 C1|Z1 1 new new
#> 4 2 A1 X1 A1|X1 3 reunited reunited
#> 5 2 B1 Y2 B1|Y2 2 divorced new
#> 6 2 C1 Z2 C1|Z2 1 divorced new
#> 7 3 A1 X1 A1|X1 3 reunited reunited
#> 8 3 B1 Y2 B1|Y2 2 reunited reunited
#> 9 3 C2 Z2 C2|Z2 1 new divorced
Created on 2022-05-03 by the reprex package (v2.0.1)

tidyverse: Simulating random sample with nested factor

I want to simulate random sample with nested factor. Factor Dept has two levels A & B. Level A has two nested levels A1 and A2. Level B has three nested levels B1, B2 and B3. Want to simulate random sample from 2022-01-01 to 2022-01-31 using some R code. Part of desired output is given below (from 2022-01-01 to 2022-01-02 only for reference).
library(tibble)
set.seed(12345)
df1 <-
tibble(
Date = c(rep("2022-01-01", 5), rep("2022-01-02", 4), rep("2022-01-03", 4))
, Dept = c("A", "A", "B", "B", "B", "A", "B", "B", "B", "A", "A", "B", "B")
, Prog = c("A1", "A2", "B1", "B2", "B3", "A1", "B1", "B2", "B3", "A1", "A2", "B2", "B3")
, Amount = runif(n = 13, min = 50000, max = 100000)
)
df1
#> # A tibble: 13 x 4
#> Date Dept Prog Amount
#> <chr> <chr> <chr> <dbl>
#> 1 2022-01-01 A A1 86045.
#> 2 2022-01-01 A A2 93789.
#> 3 2022-01-01 B B1 88049.
#> 4 2022-01-01 B B2 94306.
#> 5 2022-01-01 B B3 72824.
#> 6 2022-01-02 A A1 58319.
#> 7 2022-01-02 B B1 66255.
#> 8 2022-01-02 B B2 75461.
#> 9 2022-01-02 B B3 86385.
#> 10 2022-01-03 A A1 99487.
#> 11 2022-01-03 A A2 51727.
#> 12 2022-01-03 B B2 57619.
#> 13 2022-01-03 B B3 86784.
If we want to sample randomly, create the expanded data with crossing and then filter/slice to return random rows for each 'date'
library(dplyr)
library(tidyr)
library(stringr)
crossing(Date = seq(as.Date("2022-01-01"), as.Date("2022-01-31"),
by = "1 day"), Dept = c("A", "B"), Prog = 1:3) %>%
mutate(Prog = str_c(Dept, Prog)) %>%
filter(Prog != "A3") %>%
mutate(Amount = runif(n = n(), min = 50000, max = 100000)) %>%
group_by(Date) %>%
slice(seq_len(sample(row_number(), 1))) %>%
ungroup
-output
# A tibble: 102 × 4
Date Dept Prog Amount
<date> <chr> <chr> <dbl>
1 2022-01-01 A A1 83964.
2 2022-01-01 A A2 93428.
3 2022-01-01 B B1 85187.
4 2022-01-01 B B2 79144.
5 2022-01-01 B B3 65784.
6 2022-01-02 A A1 86014.
7 2022-01-03 A A1 76060.
8 2022-01-03 A A2 56412.
9 2022-01-03 B B1 87365.
10 2022-01-03 B B2 66169.
# … with 92 more rows

Replacing values bigger than threshold with 0 in specified range of columns in R dataframe

I have a dataset df1 with columns from S_2018_p to S_2021_p where I want to replace the values >= 10 with 0. I would expect a dataset like df2.
library(data.table)
df1 = data.table(
ID = c("a1", "a2", "a3", "a4", "a5", "a6", "a7"),
"string1" = c("x2", "g3", "n2", "m3", "2w", "ps2", "kg2"),
"S_2018_p" = c(3,5,11,3,9,22,6),
"S_2019_p" = c(3,5,6,21,1,4,0),
"S_2020_p" = c(0,4,13,9,16,7,9),
"S_2021_p" = c(4,0,3,8,5,4,6),
"string2" = c("si", "q2", "oq", "mx", "ix", "p2", "2q"))
ID string1 S_2018_p S_2019_p S_2020_p S_2021_p string2
1: a1 x2 3 3 0 4 si
2: a2 g3 5 5 4 0 q2
3: a3 n2 11 6 13 3 oq
4: a4 m3 3 21 9 8 mx
5: a5 2w 9 1 16 5 ix
6: a6 ps2 22 4 7 4 p2
7: a7 kg2 6 0 9 6 2q
df2 = data.table(
ID = c("a1", "a2", "a3", "a4", "a5", "a6", "a7"),
"string1" = c("x2", "g3", "n2", "m3", "2w", "ps2", "kg2"),
"S_2018_p" = c(3,5,0,3,9,0,6),
"S_2019_p" = c(3,5,6,0,1,4,0),
"S_2020_p" = c(0,4,0,9,0,7,9),
"S_2021_p" = c(4,0,3,8,5,4,6),
"string2" = c("si", "q2", "oq", "mx", "ix", "p2", "2q"))
ID string1 S_2018_p S_2019_p S_2020_p S_2021_p string2
1: a1 x2 3 3 0 4 si
2: a2 g3 5 5 4 0 q2
3: a3 n2 0 6 0 3 oq
4: a4 m3 3 0 9 8 mx
5: a5 2w 9 1 0 5 ix
6: a6 ps2 0 4 7 4 p2
7: a7 kg2 6 0 9 6 2q
I tried to do it with mutete_if, but I don't get the desired output.
df1 %>%
mutate_if(is.numeric, ~1 * (. >= 10))
Dplyr (use the latest version) has a nice "across()" function to be used with mutate. Just be sure to update your dplyr package as it is quite recent
library(dplyr)
df1 %>% mutate(across(where(is.numeric), function(x) ifelse(x >= 10, 0, x)))
ID string1 S_2018_p S_2019_p S_2020_p S_2021_p string2
1: a1 x2 3 3 0 4 si
2: a2 g3 5 5 4 0 q2
3: a3 n2 0 6 0 3 oq
4: a4 m3 3 0 9 8 mx
5: a5 2w 9 1 0 5 ix
6: a6 ps2 0 4 7 4 p2
7: a7 kg2 6 0 9 6 2q
You can use the apply and ifelse functions on the specific columns of interest. For example:
apply(df1[,c(3,4,5,6)], MARGIN = c(1,2), FUN = function(x) ifelse(x >= 10, 0, x))
The apply function will work on the selected rows of the data frame (df1[,c(3,4,5,6)]) and will apply the function FUN = ifelse(x >= 10, 0, x) (if x is greater or equal to ten, replace it with 0, else, replace it with itself (don't replace it)) on each cells of the dataframe (MARGIN = c(1,2)).
You can of course then replace the dataframe part with the corrected part:
df1[,c(3,4,5,6)] <- apply(df1[,c(3,4,5,6)], MARGIN = c(1,2), FUN = function(x) ifelse(x >= 10, 0, x))
Another solution, based on dplyr:
library(tidyverse)
library(data.table)
df1 = data.table(
ID = c("a1", "a2", "a3", "a4", "a5", "a6", "a7"),
"string1" = c("x2", "g3", "n2", "m3", "2w", "ps2", "kg2"),
"S_2018_p" = c(3,5,11,3,9,22,6),
"S_2019_p" = c(3,5,6,21,1,4,0),
"S_2020_p" = c(0,4,13,9,16,7,9),
"S_2021_p" = c(4,0,3,8,5,4,6),
"string2" = c("si", "q2", "oq", "mx", "ix", "p2", "2q"))
df1 %>%
mutate(
across(
where(is.numeric),
~ if_else(get(cur_column()) > 10, 0, get(cur_column()))))
#> ID string1 S_2018_p S_2019_p S_2020_p S_2021_p string2
#> 1: a1 x2 3 3 0 4 si
#> 2: a2 g3 5 5 4 0 q2
#> 3: a3 n2 0 6 0 3 oq
#> 4: a4 m3 3 0 9 8 mx
#> 5: a5 2w 9 1 0 5 ix
#> 6: a6 ps2 0 4 7 4 p2
#> 7: a7 kg2 6 0 9 6 2q
And yet another solution, based on purrr::map_if:
library(tidyverse)
library(data.table)
df1 = data.table(
ID = c("a1", "a2", "a3", "a4", "a5", "a6", "a7"),
"string1" = c("x2", "g3", "n2", "m3", "2w", "ps2", "kg2"),
"S_2018_p" = c(3,5,11,3,9,22,6),
"S_2019_p" = c(3,5,6,21,1,4,0),
"S_2020_p" = c(0,4,13,9,16,7,9),
"S_2021_p" = c(4,0,3,8,5,4,6),
"string2" = c("si", "q2", "oq", "mx", "ix", "p2", "2q"))
df1 %>% map_if(is.numeric, ~ ifelse(.x > 10 , 0, .x)) %>% as.data.table
#> ID string1 S_2018_p S_2019_p S_2020_p S_2021_p string2
#> 1: a1 x2 3 3 0 4 si
#> 2: a2 g3 5 5 4 0 q2
#> 3: a3 n2 0 6 0 3 oq
#> 4: a4 m3 3 0 9 8 mx
#> 5: a5 2w 9 1 0 5 ix
#> 6: a6 ps2 0 4 7 4 p2
#> 7: a7 kg2 6 0 9 6 2q

Adding a pvalue column to dataframe in R

I have a dataframe that looks like this:
A1 A2 A3 B1 B2 B3
0 1 0 2 3 3
5 6 4 4 6 6
I would like to add a column based on t-testing the significance of the difference between As and Bs:
A1 A2 A3 B1 B2 B3 PValue
0 1 0 3 3 4 <some small number>
5 6 4 4 6 6 <some large number>
I tried using dplyr like this:
data %>%
mutate(PValue = t.test(unlist(c(A1,A2,A3),unlist(c(B1,B2,B3)))$p.value)
However, the resulting PValue column is constant for some reason. I would appreciate any help.
If we are doing this by row, then pmap is one way
library(tidyverse)
pmap_dbl(data, ~ c(...) %>%
{t.test(.[1:3], .[4:6])$p.value}) %>%
bind_cols(data, PValue = .)
# A1 A2 A3 B1 B2 B3 PValue
#1 0 1 0 2 3 3 0.007762603
#2 5 6 4 4 6 6 0.725030185
or another option is rowwise with do
data %>%
rowwise() %>%
do(data.frame(., PValue = t.test(unlist(.[1:3]), unlist(.[4:6]))$p.value))
# A tibble: 2 x 7
# A1 A2 A3 B1 B2 B3 PValue
#* <int> <int> <int> <int> <int> <int> <dbl>
#1 0 1 0 2 3 3 0.00776
#2 5 6 4 4 6 6 0.725
Or we can gather to 'long' format and then do the group by t.test
data %>%
rownames_to_column('rn') %>%
gather(key, val, -rn) %>% group_by(rn) %>%
summarise(PValue = t.test(val[str_detect(key, "A")],
val[str_detect(key, "B")])$p.value) %>%
pull(PValue) %>%
bind_cols(data, PValue = .)
data
data <- structure(list(A1 = c(0L, 5L), A2 = c(1L, 6L), A3 = c(0L, 4L),
B1 = c(2L, 4L), B2 = c(3L, 6L), B3 = c(3L, 6L)), .Names = c("A1",
"A2", "A3", "B1", "B2", "B3"), class = "data.frame", row.names = c(NA,
-2L))
Also with apply in Base R:
data$PValue = apply(data, 1, function(x) t.test(x[1:3], x[4:6])$p.value)
or:
library(dplyr)
data %>%
mutate(PValue = apply(., 1, function(x) t.test(x[1:3], x[4:6])$p.value))
Result:
A1 A2 A3 B1 B2 B3 PValue
1 0 1 0 2 3 3 0.007762603
2 5 6 4 4 6 6 0.725030185

Resources