Pivot/gather multiple "crossed" values belonging to a common key - r

I have some strangely stored time series data. Two kinds of values, event and foo, can be observed together for different phenomena a and b. The observations are in time t and belong to different category (those are basically different recordings).
Everything is stored as follows, in a kind of mixed wide format:
> tibble(category = c("x", "x", "y", "y"), t = c(1:2, 1:2),
event_a = c(T, T, F, F), event_b = c(T, F, T, F),
foo_a = c(1, 2, 3, 4), foo_b = c(10, 20, 30, 40))
# A tibble: 4 x 6
category t event_a event_b foo_a foo_b
<chr> <int> <lgl> <lgl> <dbl> <dbl>
1 x 1 TRUE TRUE 1 10
2 x 2 TRUE FALSE 2 20
3 y 1 FALSE TRUE 3 30
4 y 2 FALSE FALSE 4 40
Now I want convert it to long format, with the phenomena being used to index the kind of event with a value, and the foo value being matched to them via a/b:
# A tibble: 8 x 5
category t event value foo
<chr> <dbl> <chr> <lgl> <dbl>
1 x 1 a TRUE 1
2 x 1 b TRUE 10
3 x 2 a TRUE 2
4 x 2 b FALSE 20
5 y 1 a FALSE 3
6 y 1 b TRUE 30
7 y 2 a FALSE 4
8 y 2 b FALSE 40
I'm looking for some sort of tidyr (or at least tidyverse) solution using gather/pivot_long and friends, but couldn't come up with anything useful, since there are multiple value columns in the result. I was thinking about a join with the foo columns split of, but didn't really succeed, and I'm not really enought in to SQL to know what goes wrong there...

This is a complicated way of solving the problem but it works.
The idea is to solve the multiple columns issue with in two steps, a pivot_longer for each of event_* and foo_*. And bind_cols the results. Finally, remove the pattern 'event' from the new column event.
library(tidyverse)
df1 %>%
dplyr::select(-starts_with('foo')) %>%
pivot_longer(
cols = starts_with('event'),
names_to = 'event',
values_to = 'value'
) %>%
bind_cols(
df1 %>%
dplyr::select(-starts_with('event')) %>%
pivot_longer(
cols = starts_with('foo'),
values_to = 'foo'
) %>%
dplyr::select(-category, -t, -name)
) %>%
mutate(event = sub('event_', '', event))
## A tibble: 8 x 5
# category t event value foo
# <chr> <int> <chr> <lgl> <dbl>
#1 x 1 a TRUE 1
#2 x 1 b TRUE 10
#3 x 2 a TRUE 2
#4 x 2 b FALSE 20
#5 y 1 a FALSE 3
#6 y 1 b TRUE 30
#7 y 2 a FALSE 4
#8 y 2 b FALSE 40

Related

complex(ish) pivot_longer in R

I have a dataset that roughly looks like this:
person_id mem_was_there_1 mem_was_there_2 mem_was_there_3 new_number_yn_1 new_number_yn_2 new_number_yn_3
<dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl>
1 100 1 2 3 FALSE TRUE FALSE
2 101 4 5 6 TRUE FALSE FALSE
I need to pivot this data into something like this:
# A tibble: 6 x 4
person_id nr mem_was_there new_number_yn
<dbl> <dbl> <dbl> <lgl>
1 100 1 1 FALSE
2 100 2 2 TRUE
3 100 3 3 FALSE
4 101 1 4 TRUE
5 101 2 5 FALSE
6 101 3 6 FALSE
I would like to use a pivot_longer() from dplyr option. I tried using this code, but I do not use what to fill in at the ??? to regex to the third _. Ideally, I would like a separate names_sep for both 'mem_was_there_xx' and 'new_number_yn_xx'
df1 %>%
pivot_longer(cols = c(matches("^mem_was_there"), matches("^new_number_yn")),
names_to = c('.value', 'nr'),
names_sep = ??? )
df1 <-
tribble(~person_id, ~mem_was_there_1, ~mem_was_there_2, ~mem_was_there_3, ~new_number_yn_1, ~new_number_yn_2, ~new_number_yn_3,
100, 1, 2, 3, F, T, F,
101, 4, 5, 6, T, F, F)
This should do the trick:
spec <- data.frame(.name = names(df1)[-1],
nr = rep(1:3, 2),
.value = c(rep("mem_was_there", 3), rep("new_number_yn", 3)),
stringsAsFactors = FALSE)
library(tidyverse)
df1 %>%
pivot_longer_spec(., spec)
gives:
# # A tibble: 6 x 4
# person_id nr mem_was_there new_number_yn
# <dbl> <int> <dbl> <lgl>
# 1 100 1 1 FALSE
# 2 100 2 2 TRUE
# 3 100 3 3 FALSE
# 4 101 1 4 TRUE
# 5 101 2 5 FALSE
# 6 101 3 6 FALSE
As deschen recommended, I played around with the regex a bit and this pivot longer call does work as expected. (and is a bit cleaner than having to manually create the pivot_longer_spec and also works if there are an unequal amount of mem_was_there_x and new_number_yn_y (it will just insert missings, where applicable)
df1 %>%
pivot_longer(
cols = c(matches("^mem_was_there"), matches("^new_number_yn")),
names_to = c('.value', 'nr'),
names_pattern = "([A-Za-z_]+_)([0-9]*)")

Binding dataframes with different column names by row

I imported this excel sheet as a list of dataframes. I want to merge the list into one dataframe. bind_rows() allow me to easily add together the dataframes, but the issue is that I have a variable/column that has different names in each dataframe. bind_row() will by default create two separate columns, with empty values for the data from the other data frames. How can I join these columns?
Sample code:
# Sample dataframes
df1 <- tibble(A = c(1,2,3),
B = c("X","Y","Z"),
C = c(T,F,F)
)
df2 <- tibble(A = c(3,4,5),
B = c("U","V","W"),
D = c(T,T,F)
)
# List of dataframes
my_ls <- list(df1, df2)
my_ls
[[1]]
# A tibble: 3 x 3
A B C
<dbl> <chr> <lgl>
1 1 X TRUE
2 2 Y FALSE
3 3 Z FALSE
[[2]]
# A tibble: 3 x 3
A B D
<dbl> <chr> <lgl>
1 3 U TRUE
2 4 V TRUE
3 5 W FALSE
# Creating joined dataframe:
my_df <- bind_rows(my_ls)
my_df
# Current outcome: A tibble: 6 x 4
A B C D
<dbl> <chr> <lgl> <lgl>
1 1 X TRUE NA
2 2 Y FALSE NA
3 3 Z FALSE NA
4 3 U NA TRUE
5 4 V NA TRUE
6 5 W NA FALSE
The desired outcome:
# Desired outcome: A tibble: 6 x 3
A B C
<dbl> <chr> <lgl>
1 1 X TRUE
2 2 Y FALSE
3 3 Z FALSE
4 3 U TRUE
5 4 V TRUE
6 5 W FALSE
Currently, I've been using mutate() with case_when(), where I check which column is not empty (!is.na()). This works, but I can't help but think there must be an easier way.
# Example using mutate
my_df <- my_df %>%
mutate(
C = case_when(is.na(C) & !is.na(D) ~ D,
!is.na(C) & is.na(D) ~ C,
# The lines below may be a bit redundant for my purpose, since the dataframes either have the C or D variable.
!is.na(C) & !is.na(D) ~ C, # Better would be to return that variable has overlapping information
is.na(C) & is.na(D) ~ NA
)
) %>%
select(-D)
my_df
# A tibble: 6 x 3
A B C
<dbl> <chr> <lgl>
1 1 X TRUE
2 2 Y FALSE
3 3 Z FALSE
4 3 U TRUE
5 4 V TRUE
6 5 W FALSE
You can bind_rows and then select non-NA value using coalesce :
library(dplyr)
bind_rows(my_ls) %>% mutate(C = coalesce(C, D)) %>% select(A:C)
# A B C
# <dbl> <chr> <lgl>
#1 1 X TRUE
#2 2 Y FALSE
#3 3 Z FALSE
#4 3 U TRUE
#5 4 V TRUE
#6 5 W FALSE
Following the comment by #KarthikS you can rename your columns before binding. My approach using rename_with does not require the columns to be in a specific order. To illusrate this I used somewhat different example dataframes:
library(purrr)
library(dplyr)
d1 <- data.frame(A = 1, B = 2, C = 3)
d2 <- data.frame(A = 4, B = 5, D = 6)
d3 <- data.frame(D = 7, A = 8, B = 9)
d <- list(d1, d2, d3)
map(d, ~ rename_with(.x, ~ "C", matches("^D$"))) %>%
bind_rows()
#> A B C
#> 1 1 2 3
#> 2 4 5 6
#> 3 8 9 7
And now four your dataset:
d <- list(df1, df2)
map(d, ~ rename_with(.x, ~ "C", matches("^D$"))) %>%
bind_rows()
#> # A tibble: 6 x 3
#> A B C
#> <dbl> <chr> <lgl>
#> 1 1 X TRUE
#> 2 2 Y FALSE
#> 3 3 Z FALSE
#> 4 3 U TRUE
#> 5 4 V TRUE
#> 6 5 W FALSE
And if we add an addtional one with a different order:
df3 <- tibble(D = c(T,T,F),
A = c(7,8,9),
B = c("A","B","C"))
d <- list(df1, df2, df3)
map(d, ~ rename_with(.x, ~ "C", matches("^D$"))) %>%
bind_rows()
#> # A tibble: 9 x 3
#> A B C
#> <dbl> <chr> <lgl>
#> 1 1 X TRUE
#> 2 2 Y FALSE
#> 3 3 Z FALSE
#> 4 3 U TRUE
#> 5 4 V TRUE
#> 6 5 W FALSE
#> 7 7 A TRUE
#> 8 8 B TRUE
#> 9 9 C FALSE
Created on 2020-10-16 by the reprex package (v0.3.0)
Apologize for breaking out of the tidyverse for a quick answer
expl <- read.table(text= " A B C D
1 1 X TRUE NA
2 2 Y FALSE NA
3 3 Z FALSE NA
4 3 U NA TRUE
5 4 V NA TRUE
6 5 W NA FALSE")
expl$E <- ifelse(is.na(expl$C), expl$D, expl$C)
print(expl)
or maybe
expl[,c("C", "D")] %>% rowMeans(na.rm = TRUE) %>% as.logical()
EDIT: Translated the latter to tidy:
expl %>% select("C", "D") %>% rowMeans(na.rm = TRUE) %>% as.logical()
EDIT after first comment:
If you want more control you should probably write the things you want to do in each case in a function similar to the following example:
library(magrittr)
expl <- read.table(text= " A B C D
1 1 X TRUE NA
2 2 Y FALSE NA
3 3 Z FALSE NA
4 3 U NA TRUE
5 4 V NA TRUE
6 5 W NA FALSE
7 7 I NA NA
8 9 J TRUE TRUE")
myfun <- function(a, b){
if(is.na(a) & is.na(b))
return(NA)
if(!is.na(a) & !is.na(b)) {
warning("too much information, a and b set!")
return(NaN)
}
return(max(a, b, na.rm=TRUE))
}
myfun = Vectorize(myfun)
myfun(expl$C, expl$D) %>% as.logical()

Add column to grouped data that assigns 1 to individuals and randomly assigns 1 or 0 to pairs

I have a dataframe...
df <- tibble(
id = 1:7,
family = c("a","a","b","b","c", "d", "e")
)
Families will only contain 2 members at most (so they're either individuals or pairs).
I need a new column 'random' that assigns the number 1 to families where there is only one member (e.g. c, d and e) and randomly assigns 0 or 1 to families containing 2 members (a and b in the example).
By the end the data should look like the following (depending on the random assignment of 0/1)...
df <- tibble(
id = 1:7,
family = c("a","a","b","b","c", "d", "e"),
random = c(1, 0, 0, 1, 1, 1, 1)
)
I would like to be able to do this with a combination of group_by and mutate since I am mostly using Tidyverse.
I tried the following (but this didn't randomly assign 0/1 within families)...
df %>%
group_by(family) %>%
mutate(
random = if_else(
condition = n() == 1,
true = 1,
false = as.double(sample(0:1,1,replace = T))
)
You could sample along the sequence length of the family group and take the answer modulo 2:
df %>%
group_by(family) %>%
mutate(random = sample(seq(n())) %% 2)
#> # A tibble: 7 x 3
#> # Groups: family [5]
#> id family random
#> <int> <chr> <dbl>
#> 1 1 a 0
#> 2 2 a 1
#> 3 3 b 0
#> 4 4 b 1
#> 5 5 c 1
#> 6 6 d 1
#> 7 7 e 1
We can use if/else
library(dplyr)
df %>%
group_by(family) %>%
mutate(random = if(n() == 1) 1 else sample(rep(0:1, length.out = n())))
# A tibble: 7 x 3
# Groups: family [5]
# id family random
# <int> <chr> <dbl>
#1 1 a 0
#2 2 a 1
#3 3 b 1
#4 4 b 0
#5 5 c 1
#6 6 d 1
#7 7 e 1
Another option
df %>%
group_by(family) %>%
mutate(random = 2 - sample(1:n()))
# A tibble: 7 x 3
# Groups: family [5]
id family random
# <int> <chr> <dbl>
# 1 1 a 1
# 2 2 a 0
# 3 3 b 1
# 4 4 b 0
# 5 5 c 1
# 6 6 d 1
# 7 7 e 1

R: Sum the Max Values of Unique Rows with dplyr

I am trying to come up with a sum for each task in a dataset that only uses the largest value observed for the id once in the sum. If that's not clear I've provided an example of the desired output below.
Sample Data
dat <- data.frame(task = rep(LETTERS[1:3], each=3),
id = c(rep(1:2, 4) , 3),
value = c(rep(c(10,20), 4), 5))
dat
task id value
1 A 1 10
2 A 2 20
3 A 1 10
4 B 2 20
5 B 1 10
6 B 2 20
7 C 1 10
8 C 2 20
9 C 3 5
I've found an answer that works, but it requires two separate group_by() functions. Is there a way to get the same output with a single group_by()? The reason is I have other summarized metrics that are sensitive to the grouping and I can't run two different group_by functions in the same pipeline.
dat %>%
group_by(task, id) %>%
summarize(v = max(value)) %>%
group_by(task) %>%
summarize(unique_ids = n_distinct(id),
value_sum = sum(v))
# A tibble: 3 × 3
task unique_ids value_sum
<chr> <int> <dbl>
1 A 2 30
2 B 2 30
3 C 3 35
I've found something that works using tapply().
dat %>%
group_by(task) %>%
summarize(unique_ids = length(unique(id)),
value_sum = sum(tapply(value, id, FUN = max)))
# A tibble: 3 × 3
task unique_ids value_sum
<chr> <int> <dbl>
1 A 2 30
2 B 2 30
3 C 3 35

check whether steps in a counter variable are missing

I have a datafile with one row per participants (named 1-x, based on the study they took part in). I want to check whether all participants are present in the dataset. This is my toy dataset, personid are the participants, study is the study they took part in.
df <- read.table(text = "personid study measurement
1 x 23
2 x 32
1 y 21
3 y 23
4 y 23
6 y 23", header=TRUE)
which looks like this:
personid study measurement
1 1 x 23
2 2 x 32
3 1 y 21
4 3 y 23
5 4 y 23
6 6 y 23
so for y, I am missing participants 2 and 5. How do I check that automatically? I tried adding a counter variable and comparing that counter variable to the participant id but once one participant is missing, the comparison is meaningless because the alignment is off.
df %>% group_by(study) %>% mutate(id = 1:n(),check = id==personid)
Source: local data frame [6 x 5]
Groups: date [2]
personid study measurement id check
<int> <fctr> <int> <int> <lgl>
1 1 x 23 1 TRUE
2 2 x 32 2 TRUE
3 1 y 21 1 TRUE
4 3 y 23 2 FALSE
5 4 y 23 3 FALSE
6 6 y 23 4 FALSE
Assuming your personid is sequential, then you can do this using setdiff, i.e.
library(dplyr)
df %>%
group_by(study) %>%
mutate(new = toString(setdiff(max(personid):min(personid), personid)))
#Source: local data frame [6 x 4]
#Groups: study [2]
# personid study measurement new
# <int> <fctr> <int> <chr>
#1 1 x 23
#2 2 x 32
#3 1 y 21 5, 2
#4 3 y 23 5, 2
#5 4 y 23 5, 2
#6 6 y 23 5, 2
One approach is to use tidy::expand() to generate all possible combinations of study and personid and then use anti_join() to remove the combinations that actually appear in the data.
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
df %>%
expand(study, personid) %>%
anti_join(df)
#> Joining, by = c("study", "personid")
#> # A tibble: 4 × 2
#> study personid
#> <fctr> <int>
#> 1 y 2
#> 2 x 6
#> 3 x 4
#> 4 x 3
A simple solution using base R
tapply(df$personid, df$study, function(a) setdiff(min(a):max(a), a))
Output:
$x
integer(0)
$y
[1] 2 5

Resources