complex(ish) pivot_longer in R - r

I have a dataset that roughly looks like this:
person_id mem_was_there_1 mem_was_there_2 mem_was_there_3 new_number_yn_1 new_number_yn_2 new_number_yn_3
<dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl>
1 100 1 2 3 FALSE TRUE FALSE
2 101 4 5 6 TRUE FALSE FALSE
I need to pivot this data into something like this:
# A tibble: 6 x 4
person_id nr mem_was_there new_number_yn
<dbl> <dbl> <dbl> <lgl>
1 100 1 1 FALSE
2 100 2 2 TRUE
3 100 3 3 FALSE
4 101 1 4 TRUE
5 101 2 5 FALSE
6 101 3 6 FALSE
I would like to use a pivot_longer() from dplyr option. I tried using this code, but I do not use what to fill in at the ??? to regex to the third _. Ideally, I would like a separate names_sep for both 'mem_was_there_xx' and 'new_number_yn_xx'
df1 %>%
pivot_longer(cols = c(matches("^mem_was_there"), matches("^new_number_yn")),
names_to = c('.value', 'nr'),
names_sep = ??? )
df1 <-
tribble(~person_id, ~mem_was_there_1, ~mem_was_there_2, ~mem_was_there_3, ~new_number_yn_1, ~new_number_yn_2, ~new_number_yn_3,
100, 1, 2, 3, F, T, F,
101, 4, 5, 6, T, F, F)

This should do the trick:
spec <- data.frame(.name = names(df1)[-1],
nr = rep(1:3, 2),
.value = c(rep("mem_was_there", 3), rep("new_number_yn", 3)),
stringsAsFactors = FALSE)
library(tidyverse)
df1 %>%
pivot_longer_spec(., spec)
gives:
# # A tibble: 6 x 4
# person_id nr mem_was_there new_number_yn
# <dbl> <int> <dbl> <lgl>
# 1 100 1 1 FALSE
# 2 100 2 2 TRUE
# 3 100 3 3 FALSE
# 4 101 1 4 TRUE
# 5 101 2 5 FALSE
# 6 101 3 6 FALSE

As deschen recommended, I played around with the regex a bit and this pivot longer call does work as expected. (and is a bit cleaner than having to manually create the pivot_longer_spec and also works if there are an unequal amount of mem_was_there_x and new_number_yn_y (it will just insert missings, where applicable)
df1 %>%
pivot_longer(
cols = c(matches("^mem_was_there"), matches("^new_number_yn")),
names_to = c('.value', 'nr'),
names_pattern = "([A-Za-z_]+_)([0-9]*)")

Related

dplyr group_by_ lazy .drop = F

I am trying to incorporate the drop = F into the following dplyr function
dspreadN = function(data, ...) {
data %>% group_by_(.dots = lazyeval::lazy_dots(...), .drop = F) %>%
summarise(n = n()*100) %>% spread(value, n, fill = 0)
}
Basically, the function transform this
id x
1 1 A
2 1 A
3 1 A
4 1 A
5 2 A
6 2 A
7 2 B
8 2 B
9 3 A
10 3 A
11 3 B
12 3 A
into that
id drop A B
<dbl> <lgl> <dbl> <dbl>
1 1 FALSE 400 0
2 2 FALSE 200 200
3 3 FALSE 300 100
I use the function in this way dff %>% dspreadN(id, value = x)
(my real example is much more complicated that why I need the dplyr function).
What I would like is to keep all the levels of the x variable, here the C is missing.
id A B C
<dbl> <dbl> <dbl> <dbl>
1 1 400 0 0
2 2 200 200 0
3 3 300 100 0
Why is the drop = F not working?
library(tidyverse)
# data
dff = data.frame(id = c(1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4),
x = c('A','A','A','A', 'A','A','B','B', 'A','A','B','A', 'C', 'C', 'C', 'C'))
# remove the case to keep the C level
dff = dff[dff$id != 4, ]
You can use .drop = FALSE argument in count instead of group_by.
group_by + summarise with n() is equal to count.
spread has been deprecated in favour of pivot_wider.
Thanks to #Edo for useful tips in improving the post
library(dplyr)
library(tidyr)
dspreadN = function(data, ...) {
data %>%
count(id, x, .drop = FALSE, wt = n() * 100) %>%
pivot_wider(names_from = x, values_from = n, values_fill = 0)
}
dspreadN(dff, id, x)
# id A B C
# <dbl> <dbl> <dbl> <dbl>
#1 1 400 0 0
#2 2 200 200 0
#3 3 300 100 0

Pivot/gather multiple "crossed" values belonging to a common key

I have some strangely stored time series data. Two kinds of values, event and foo, can be observed together for different phenomena a and b. The observations are in time t and belong to different category (those are basically different recordings).
Everything is stored as follows, in a kind of mixed wide format:
> tibble(category = c("x", "x", "y", "y"), t = c(1:2, 1:2),
event_a = c(T, T, F, F), event_b = c(T, F, T, F),
foo_a = c(1, 2, 3, 4), foo_b = c(10, 20, 30, 40))
# A tibble: 4 x 6
category t event_a event_b foo_a foo_b
<chr> <int> <lgl> <lgl> <dbl> <dbl>
1 x 1 TRUE TRUE 1 10
2 x 2 TRUE FALSE 2 20
3 y 1 FALSE TRUE 3 30
4 y 2 FALSE FALSE 4 40
Now I want convert it to long format, with the phenomena being used to index the kind of event with a value, and the foo value being matched to them via a/b:
# A tibble: 8 x 5
category t event value foo
<chr> <dbl> <chr> <lgl> <dbl>
1 x 1 a TRUE 1
2 x 1 b TRUE 10
3 x 2 a TRUE 2
4 x 2 b FALSE 20
5 y 1 a FALSE 3
6 y 1 b TRUE 30
7 y 2 a FALSE 4
8 y 2 b FALSE 40
I'm looking for some sort of tidyr (or at least tidyverse) solution using gather/pivot_long and friends, but couldn't come up with anything useful, since there are multiple value columns in the result. I was thinking about a join with the foo columns split of, but didn't really succeed, and I'm not really enought in to SQL to know what goes wrong there...
This is a complicated way of solving the problem but it works.
The idea is to solve the multiple columns issue with in two steps, a pivot_longer for each of event_* and foo_*. And bind_cols the results. Finally, remove the pattern 'event' from the new column event.
library(tidyverse)
df1 %>%
dplyr::select(-starts_with('foo')) %>%
pivot_longer(
cols = starts_with('event'),
names_to = 'event',
values_to = 'value'
) %>%
bind_cols(
df1 %>%
dplyr::select(-starts_with('event')) %>%
pivot_longer(
cols = starts_with('foo'),
values_to = 'foo'
) %>%
dplyr::select(-category, -t, -name)
) %>%
mutate(event = sub('event_', '', event))
## A tibble: 8 x 5
# category t event value foo
# <chr> <int> <chr> <lgl> <dbl>
#1 x 1 a TRUE 1
#2 x 1 b TRUE 10
#3 x 2 a TRUE 2
#4 x 2 b FALSE 20
#5 y 1 a FALSE 3
#6 y 1 b TRUE 30
#7 y 2 a FALSE 4
#8 y 2 b FALSE 40

Add column to grouped data that assigns 1 to individuals and randomly assigns 1 or 0 to pairs

I have a dataframe...
df <- tibble(
id = 1:7,
family = c("a","a","b","b","c", "d", "e")
)
Families will only contain 2 members at most (so they're either individuals or pairs).
I need a new column 'random' that assigns the number 1 to families where there is only one member (e.g. c, d and e) and randomly assigns 0 or 1 to families containing 2 members (a and b in the example).
By the end the data should look like the following (depending on the random assignment of 0/1)...
df <- tibble(
id = 1:7,
family = c("a","a","b","b","c", "d", "e"),
random = c(1, 0, 0, 1, 1, 1, 1)
)
I would like to be able to do this with a combination of group_by and mutate since I am mostly using Tidyverse.
I tried the following (but this didn't randomly assign 0/1 within families)...
df %>%
group_by(family) %>%
mutate(
random = if_else(
condition = n() == 1,
true = 1,
false = as.double(sample(0:1,1,replace = T))
)
You could sample along the sequence length of the family group and take the answer modulo 2:
df %>%
group_by(family) %>%
mutate(random = sample(seq(n())) %% 2)
#> # A tibble: 7 x 3
#> # Groups: family [5]
#> id family random
#> <int> <chr> <dbl>
#> 1 1 a 0
#> 2 2 a 1
#> 3 3 b 0
#> 4 4 b 1
#> 5 5 c 1
#> 6 6 d 1
#> 7 7 e 1
We can use if/else
library(dplyr)
df %>%
group_by(family) %>%
mutate(random = if(n() == 1) 1 else sample(rep(0:1, length.out = n())))
# A tibble: 7 x 3
# Groups: family [5]
# id family random
# <int> <chr> <dbl>
#1 1 a 0
#2 2 a 1
#3 3 b 1
#4 4 b 0
#5 5 c 1
#6 6 d 1
#7 7 e 1
Another option
df %>%
group_by(family) %>%
mutate(random = 2 - sample(1:n()))
# A tibble: 7 x 3
# Groups: family [5]
id family random
# <int> <chr> <dbl>
# 1 1 a 1
# 2 2 a 0
# 3 3 b 1
# 4 4 b 0
# 5 5 c 1
# 6 6 d 1
# 7 7 e 1

Strange behavior with a conditional mutate with dplyr

My apologies if this topic has been discussed somewhere, I was not able to find it out.
I was trying to apply a quite simple conditional mutate() with dplyr when I noticed something quite weird to me, I explain:
Let's say that in a data.frame I want to modify a variable (here VALUE) according to the value of a specific row in each group (here COND).
The modification is: "if the last value of COND within the current group is 0, then set VALUE to 99 for the current group, otherwhise do nothing"
Here's what I naturally wrote:
tab <- data.frame(
ID = c(rep(1,3), rep(2,3)),
COND = c(c(1,0,0), rep(1,3)),
VALUE = 1:6
)
tab %>%
group_by(ID) %>%
mutate(VALUE = ifelse(COND[n()] == 0,
99,
VALUE))
# ID COND VALUE
# <dbl> <dbl> <dbl>
# 1 1 1 99
# 2 1 0 99
# 3 1 0 99
# 4 2 1 4
# 5 2 1 4 <
# 6 2 1 4 <
The propagation went well for the first group since VALUE is now 99 which is legitimate (COND == 0 in row 3) whereas I was surprised to see that VALUE also changed for the other group by propagating the first value of VALUE within the group while the condition is not fulfilled.
Can someone enlight me on what I am misunderstanding here?
Expected result was:
# ID COND VALUE
# <dbl> <dbl> <dbl>
# 1 1 1 99
# 2 1 0 99
# 3 1 0 99
# 4 2 1 4
# 5 2 1 5 <
# 6 2 1 6 <
[edit] I also tried using case_when() which apparently I do not manage well either:
tab %>%
group_by(ID) %>%
mutate(VALUE = case_when(
COND[n()] == 0 ~ 99,
TRUE ~ VALUE
))
# Erreur : must be a double vector, not an integer vector
One workaround that would be to calculate an intermediate variable, but I am quite surprised having to do that.
Possible solution:
tab %>%
group_by(ID) %>%
mutate(TEST_COND = COND[n()] == 0,
VALUE = ifelse(TEST_COND, 99, VALUE))
# ID COND VALUE TEST_COND
# <dbl> <dbl> <dbl> <lgl>
# 1 1 1 99 TRUE
# 2 1 0 99 TRUE
# 3 1 0 99 TRUE
# 4 2 1 4 FALSE
# 5 2 1 5 FALSE
# 6 2 1 6 FALSE
# Yeepee
Try this
library(dplyr)
tab <- data.frame(
ID = c(rep(1,3), rep(2,3)),
COND = c(1, rep(0,2), rep(1,3)),
VALUE = 1:6
)
tab %>%
group_by(ID) %>%
mutate(VALUE = case_when(last(COND) == 0 ~ 99L,
TRUE ~ VALUE))
#> # A tibble: 6 x 3
#> # Groups: ID [2]
#> ID COND VALUE
#> <dbl> <dbl> <int>
#> 1 1 1 99
#> 2 1 0 99
#> 3 1 0 99
#> 4 2 1 4
#> 5 2 1 5
#> 6 2 1 6
Created on 2020-05-12 by the reprex package (v0.3.0)

Run length ID in sparklyr

data.table provides a rleid function which I find invaluable - it acts as a ticker when a watched variable(s) changes, ordered by some other variable(s).
library(dplyr)
tbl = tibble(time = as.integer(c(1, 2, 3, 4, 5, 6, 7, 8)),
var = c("A", "A", "A", "B", "B", "A", "A", "A"))
> tbl
# A tibble: 8 × 2
time var
<int> <chr>
1 1 A
2 2 A
3 3 A
4 4 B
5 5 B
6 6 A
7 7 A
8 8 A
Desired result is
> tbl %>% mutate(rleid = data.table::rleid(var))
# A tibble: 8 × 3
time var rleid
<int> <chr> <int>
1 1 A 1
2 2 A 1
3 3 A 1
4 4 B 2
5 5 B 2
6 6 A 3
7 7 A 3
8 8 A 3
I was wondering if I could reproduce something similar using the tools provided by sparklyr. When testing, I found the best I could do was get to the point at which I needed to do a fill forward, but then couldn't achieve that.
library(sparklyr)
spark_install(version = "2.0.2")
sc <- spark_connect(master = "local",
spark_home = spark_home_dir())
spk_tbl = copy_to(sc, tbl, overwrite = TRUE)
spk_tbl %>%
mutate(var2 = (var != lag(var, 1L, order = time))) %>% # Thanks #JaimeCaffarel
mutate(var3 = if(var2) { paste0(time, var) } else { NA })
Source: query [8 x 4]
Database: spark connection master=local[4] app=sparklyr local=TRUE
time var var2 var3
<int> <chr> <lgl> <chr>
1 1 A TRUE 1A
2 2 A FALSE <NA>
3 3 A FALSE <NA>
4 4 B TRUE 4B
5 5 B FALSE <NA>
6 6 A TRUE 6A
7 7 A FALSE <NA>
8 8 A FALSE <NA>
I've tried using SparkR, however I much prefer the sparklyr interface and its ease of use, so I'd ideally be able to do this in Spark SQL.
I can of course, already do this by partitioning the data into small enough chunks, collecting it, running a function and sending it back.
For context, the reason I find the rleid to be useful is that I work with a lot of train data, and it's useful to be able to index what run it's on.
Thanks for any help
Akhil
A working solution in sparklyr would be this:
spk_tbl %>%
dplyr::arrange(time) %>%
dplyr::mutate(rleid = (var != lag(var, 1, order = time, default = FALSE))) %>%
dplyr::mutate(rleid = cumsum(as.numeric(rleid)))
Try this:
tbl %>% mutate(run = c(0,cumsum(var[-1L] != var[-length(var)])))
# A tibble: 8 × 3
time var run
<int> <chr> <dbl>
1 1 A 0
2 2 A 0
3 3 A 0
4 4 B 1
5 5 B 1
6 6 A 2
7 7 A 2
8 8 A 2

Resources