R - Nested list to tibble - r

I have a nested list like so:
> ex <- list(list(c("This", "is", "an", "example", "."), c("I", "really", "hate", "examples", ".")), list(c("How", "do", "you", "feel", "about", "examples", "?")))
> ex
[[1]]
[[1]][[1]]
[1] "This" "is" "an" "example" "."
[[1]][[2]]
[1] "I" "really" "hate" "examples" "."
[[2]]
[[2]][[1]]
[1] "How" "do" "you" "feel" "about" "examples" "?"
I want to convert it to a tibble like so:
> tibble(d_id = as.integer(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2)),
+ s_id = as.integer(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1)),
+ t_id = as.integer(c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7)),
+ token = c("This", "is", "an", "example", ".", "I", "really",
+ "hate", "examples", ".", "How", "do", "you", "feel", "about", "examples", "?"))
# A tibble: 17 x 4
d_id s_id t_id token
<int> <int> <int> <chr>
1 1 1 1 This
2 1 1 2 is
3 1 1 3 an
4 1 1 4 example
5 1 1 5 .
6 1 2 1 I
7 1 2 2 really
8 1 2 3 hate
9 1 2 4 examples
10 1 2 5 .
11 2 1 1 How
12 2 1 2 do
13 2 1 3 you
14 2 1 4 feel
15 2 1 5 about
16 2 1 6 examples
17 2 1 7 ?
What is the most efficient way for me to perform this? Preferably using tidyverse functionality?

Time to get some sequences working, which should be very efficient:
d_id <- rep(seq_along(ex), lengths(ex))
s_id <- sequence(lengths(ex))
t_id <- lengths(unlist(ex, rec=FALSE))
data.frame(
d_id = rep(d_id, t_id),
s_id = rep(s_id, t_id),
t_id = sequence(t_id),
token = unlist(ex)
)
# d_id s_id t_id token
#1 1 1 1 This
#2 1 1 2 is
#3 1 1 3 an
#4 1 1 4 example
#5 1 1 5 .
#6 1 2 1 I
#7 1 2 2 really
#8 1 2 3 hate
#9 1 2 4 examples
#10 1 2 5 .
#11 2 1 1 How
#12 2 1 2 do
#13 2 1 3 you
#14 2 1 4 feel
#15 2 1 5 about
#16 2 1 6 examples
#17 2 1 7 ?
This will run in about 2 seconds for a 500K sample of your ex list. I suspect that will be hard to beat in terms of efficiency.

We can do
ex %>%
set_names(seq_along(ex)) %>%
map( ~ set_names(.x, seq_along(.x)) %>%
stack) %>%
bind_rows(.id = 'd_id') %>%
group_by(d_id, s_id = ind) %>%
mutate(t_id = row_number()) %>%
select(d_id, s_id, t_id, token = values)
# A tibble: 17 x 4
# Groups: d_id, s_id [3]
# d_id s_id t_id token
# <chr> <chr> <int> <chr>
# 1 1 1 1 This
# 2 1 1 2 is
# 3 1 1 3 an
# 4 1 1 4 example
# 5 1 1 5 .
# 6 1 2 1 I
# 7 1 2 2 really
# 8 1 2 3 hate
# 9 1 2 4 examples
#10 1 2 5 .
#11 2 1 1 How
#12 2 1 2 do
#13 2 1 3 you
#14 2 1 4 feel
#15 2 1 5 about
#16 2 1 6 examples
#17 2 1 7 ?

You can use melt from the reshape2 package:
library(data.table)
setDT(melt(ex))[, .(d_id = L1, s_id = L2, t_id = rowid(L1, L2), token = value)]
d_id s_id t_id token
1: 1 1 1 This
2: 1 1 2 is
3: 1 1 3 an
4: 1 1 4 example
5: 1 1 5 .
6: 1 2 1 I
7: 1 2 2 really
8: 1 2 3 hate
9: 1 2 4 examples
10: 1 2 5 .
11: 2 1 1 How
12: 2 1 2 do
13: 2 1 3 you
14: 2 1 4 feel
15: 2 1 5 about
16: 2 1 6 examples
17: 2 1 7 ?
I'm showing it here with data.table, since I know how to do the column selection and renaming in one step from there (though it should be no trouble with dplyr instead). The melt.list function is coming from reshape2.

Another tidyverse solution:
library(tidyverse)
ex %>%
modify_depth(-1,~tibble(token=.x) %>% rowid_to_column("t_id")) %>%
map(~map_dfr(.x,identity,.id = "s_id")) %>%
map_dfr(identity,.id = "d_id")
# # A tibble: 17 x 4
# d_id s_id t_id token
# <chr> <chr> <int> <chr>
# 1 1 1 1 This
# 2 1 1 2 is
# 3 1 1 3 an
# 4 1 1 4 example
# 5 1 1 5 .
# 6 1 2 1 I
# 7 1 2 2 really
# 8 1 2 3 hate
# 9 1 2 4 examples
# 10 1 2 5 .
# 11 2 1 1 How
# 12 2 1 2 do
# 13 2 1 3 you
# 14 2 1 4 feel
# 15 2 1 5 about
# 16 2 1 6 examples
# 17 2 1 7 ?

Related

Nested list to grouped rows in R

I have the following nested list called l (dput below):
> l
$A
$A$`1`
[1] 1 2 3
$A$`2`
[1] 3 2 1
$B
$B$`1`
[1] 2 2 2
$B$`2`
[1] 3 4 3
I would like to convert this to a grouped dataframe where A and B are the first group column and 1 and 2 are the subgroups with respective values. The desired output should look like this:
group subgroup values
1 A 1 1
2 A 1 2
3 A 1 3
4 A 2 3
5 A 2 2
6 A 2 1
7 B 1 2
8 B 1 2
9 B 1 2
10 B 2 3
11 B 2 4
12 B 2 3
As you can see A and B are the main group and 1 and 2 are the subgroups. Using purrr::flatten(l) or unnest doesn't work. So I was wondering if anyone knows how to convert a nested list to a grouped row dataframe?
dput of l:
l <- list(A = list(`1` = c(1, 2, 3), `2` = c(3, 2, 1)), B = list(`1` = c(2,
2, 2), `2` = c(3, 4, 3)))
Using stack and rowbind with id:
data.table::rbindlist(lapply(l, stack), idcol = "id")
# id values ind
# 1: A 1 1
# 2: A 2 1
# 3: A 3 1
# 4: A 3 2
# 5: A 2 2
# 6: A 1 2
# 7: B 2 1
# 8: B 2 1
# 9: B 2 1
# 10: B 3 2
# 11: B 4 2
# 12: B 3 2
You can use enframe() to convert the list into a data.frame, and unnest the value column twice.
library(tidyr)
tibble::enframe(l, name = "group") %>%
unnest_longer(value, indices_to = "subgroup") %>%
unnest(value)
# A tibble: 12 × 3
group value subgroup
<chr> <dbl> <chr>
1 A 1 1
2 A 2 1
3 A 3 1
4 A 3 2
5 A 2 2
6 A 1 2
7 B 2 1
8 B 2 1
9 B 2 1
10 B 3 2
11 B 4 2
12 B 3 2
Turn the list directly into a data frame, then pivot it into a long format and arrange to your desired order.
library(tidyverse)
lst %>%
as.data.frame() %>%
pivot_longer(everything(), names_to = c("group", "subgroup"),
values_to = "values",
names_pattern = "(.+?)\\.(.+?)") %>%
arrange(group, subgroup)
# A tibble: 12 × 3
group subgroup values
<chr> <chr> <dbl>
1 A 1 1
2 A 1 2
3 A 1 3
4 A 2 3
5 A 2 2
6 A 2 1
7 B 1 2
8 B 1 2
9 B 1 2
10 B 2 3
11 B 2 4
12 B 2 3
You can combine rrapply with unnest, which has the benefit to work in lists of arbitrary lengths:
library(rrapply)
library(tidyr)
rrapply(l, how = "melt") |>
unnest(value)
# A tibble: 12 × 3
L1 L2 value
<chr> <chr> <dbl>
1 A 1 1
2 A 1 2
3 A 1 3
4 A 2 3
5 A 2 2
6 A 2 1
7 B 1 2
8 B 1 2
9 B 1 2
10 B 2 3
11 B 2 4
12 B 2 3

Remove rows after observing some specific row values in group id

I try to filter the group id and remove it after the first observation of sex==2). The data looks like
data<- data.frame( id= c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3,3 ,3,3,4,4,4), sex=c(1,1,2,2,1,1,1,2,2,2,1,1,2,1,1,2,1,2,2))
data
id sex
1 1
1 1
1 2
1 2
2 1
2 1
2 1
2 2
2 2
2 2
3 1
3 1
3 2
3 1
3 1
3 2
4 1
4 2
4 2
The desired output
id sex
1 1
1 1
1 2
2 1
2 1
2 1
2 2
3 1
3 1
3 2
3 1
3 1
3 2
4 1
4 2
I try to
library(dplyr)
data1 <- data %>% filter(type == 1 ) & silec(2))
But I got an error. Please anyone help?
Data
data<- data.frame( id= c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3,3 ,3,3,4,4,4), sex=c(1,1,2,2,1,1,1,2,2,2,1,1,2,1,1,2,1,2,2))
Code
data %>%
#Grouping by id
group_by(id) %>%
#Filter sex = 1 or the first time sex was equal 2
filter( sex == 1 | (cumsum(sex == 2) == 1))
Output
# A tibble: 14 x 2
# Groups: id [4]
id sex
<dbl> <dbl>
1 1 1
2 1 1
3 1 2
4 2 1
5 2 1
6 2 1
7 2 2
8 3 1
9 3 1
10 3 2
11 3 1
12 3 1
13 4 1
14 4 2
You may create a set of consecutive occurring 1's and 2's in each group. From each group select the rows till you encounter the 1st 2 in it.
library(dplyr)
library(data.table)
data %>%
group_by(id, grp = ceiling(rleid(sex)/2)) %>%
slice(seq_len(match(2, sex))) %>%
ungroup
select(-grp)
# id sex
# <dbl> <dbl>
# 1 1 1
# 2 1 1
# 3 1 2
# 4 2 1
# 5 2 1
# 6 2 1
# 7 2 2
# 8 3 1
# 9 3 1
#10 3 2
#11 3 1
#12 3 1
#13 3 2
#14 4 1
#15 4 2

Counter based on ID and value in a column

I have a dataframe that contains an ID and Type column. I want a counter that if the Type is "T" then the counter in the next row would be counter + 1 for every ID. Basically, the counter is the Output_column in this example.
ID <- c(1,1,1,1,1,1,3,3,4,4,4,4)
Type <- c("A","A","T","A","A","A","A","A","T","A","T","A")
Output_Column <- c(1,1,1,2,2,2,1,1,1,2,2,3)
ID Type Output_Column
1 1 A 1
2 1 A 1
3 1 T 1
4 1 A 2
5 1 A 2
6 1 A 2
7 3 A 1
8 3 A 1
9 4 T 1
10 4 A 2
11 4 T 2
12 4 A 3
d <- data.frame(ID,Type, Output_Column)
baseR solution
output_col <- as.numeric(ave(Type, ID, FUN = function(x) cumsum(c('T', x[-length(x)]) == 'T')))
output_col
[1] 1 1 1 2 2 2 1 1 1 2 2 3
Here's data.table version :
library(data.table)
setDT(d)[, res := shift(cumsum(Type == 'T') + 1, fill = 1), ID]
d
# ID Type Output_Column res
# 1: 1 A 1 1
# 2: 1 A 1 1
# 3: 1 T 1 1
# 4: 1 A 2 2
# 5: 1 A 2 2
# 6: 1 A 2 2
# 7: 3 A 1 1
# 8: 3 A 1 1
# 9: 4 T 1 1
#10: 4 A 2 2
#11: 4 T 2 2
#12: 4 A 3 3
Here is a way to achieve it using group_by, lag, and cumsum
library(dplyr)
d %>%
# group by ID so calculation is within each ID
group_by(ID) %>%
mutate(
# create a counter variable check if previous Type is "T"
# Here default is "T" which result the first row of ID will start at 1
counter = if_else(lag(Type, default = "T") == "T", 1, 0),
# cumsum the counter which result same as the expected output column
output_column_calculated = cumsum(counter)) %>%
ungroup() %>%
# Remove the counter column if not needed
select(-counter)
#> # A tibble: 12 x 4
#> ID Type Output_Column output_column_calculated
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 A 1 1
#> 2 1 A 1 1
#> 3 1 T 1 1
#> 4 1 A 2 2
#> 5 1 A 2 2
#> 6 1 A 2 2
#> 7 3 A 1 1
#> 8 3 A 1 1
#> 9 4 T 1 1
#> 10 4 A 2 2
#> 11 4 T 2 2
#> 12 4 A 3 3
Created on 2021-04-26 by the reprex package (v2.0.0)

Creating a "run ID" for values in sequence

I have a vector which contains an ordered sequence of repeated integers:
x <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 5, 5, 5, 5, 6, 6, 9, 9, 9, 9)
I want to create a "run ID" (I assume using data.table::rleid()) for numbers that are in sequence. That is, numbers which are either equal or +1 the previous value.
So, the expected output would be:
x
#> [1] 1 1 1 2 2 2 2 3 3 5 5 5 5 6 6 9 9 9 9
data.table::rleid(???)
#> [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3
My first thought was to simply check if each value is the same or +1 the previous, but that doesn't work since the first change is considered a run of its own, obviously (a FALSE surrounded by TRUEs):
x
#> [1] 1 1 1 2 2 2 2 3 3 5 5 5 5 6 6 9 9 9 9
data.table::rleid((x - lag(x, default = 1)) %in% 0:1)
#> [1] 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 4 5 5 5
I obviously need something which allows me to compare each value to the last different value, but I can't think of how to do that effectively. Any pointers?
How about using lag from dplyr with cumsum?
library(dplyr)
cumsum(x - lag(x,default = 0) > 1)+1
[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3
Or the data.table way with shift:
library(data.table)
cumsum(x - shift(x,1,fill = 0) > 1) + 1
[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3
Base R option using diff and cumsum :
cumsum(c(TRUE, diff(x) > 1))
#[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3
x <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 5, 5, 5, 5, 6, 6, 9, 9, 9, 9)
tibble(X = x) %>%
mutate(PREV.X = lag(X, default = 0),
IS.SEQ = X != PREV.X & X != PREV.X + 1,
RZLT = 1 + cumsum(IS.SEQ))
# A tibble: 19 x 4
X PREV.X IS.SEQ RZLT
<dbl> <dbl> <lgl> <dbl>
1 1 0 FALSE 1
2 1 1 FALSE 1
3 1 1 FALSE 1
4 2 1 FALSE 1
5 2 2 FALSE 1
6 2 2 FALSE 1
7 2 2 FALSE 1
8 3 2 FALSE 1
9 3 3 FALSE 1
10 5 3 TRUE 2
11 5 5 FALSE 2
12 5 5 FALSE 2
13 5 5 FALSE 2
14 6 5 FALSE 2
15 6 6 FALSE 2
16 9 6 TRUE 3
17 9 9 FALSE 3
18 9 9 FALSE 3
19 9 9 FALSE 3

Add a count column and count twice if a certain condition is met

I am wondering if there is a way to make a conditional column-count by a group, adding 1 to a row_number or rowid if a certain value is met (in this case 0). For example:
df<-data.frame(group=c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,3,3,3),
condition=c(1,0,1,1,1,0,0,1,1,0,1,1,0, 1),
want=c(1, 3, 4,5,1,3,5,6,7,2,3,4,6,7))
group condition want
1 1 1 1
2 1 0 3
3 1 1 4
4 1 1 5
5 2 1 1
6 2 0 3
7 2 0 5
8 2 1 6
9 2 1 7
10 3 0 2
11 3 1 3
12 3 1 4
13 3 0 6
14 3 1 7
I think this might involve making a row_number per group and then making a customized row_number but I am open to suggestions. It is kind of a work-around method to "break up" my data when a 0 appears.
Using dplyr, for each group of data (group-by(group)) we can add a column which has a counter from 1 to the length of each group (i.e. n()). By adding a cumulative sum of condition == 0, that counter will jump one more, whenever your desired condition is met.
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(desired = (1:n()) + cumsum(condition == 0))
Output:
#> # A tibble: 14 x 3
#> # Groups: group [3]
#> group condition desired
#> <dbl> <dbl> <int>
#> 1 1 1 1
#> 2 1 0 3
#> 3 1 1 4
#> 4 1 1 5
#> 5 2 1 1
#> 6 2 0 3
#> 7 2 0 5
#> 8 2 1 6
#> 9 2 1 7
#> 10 3 0 2
#> 11 3 1 3
#> 12 3 1 4
#> 13 3 0 6
#> 14 3 1 7
Data:
df1 <- data.frame(group=c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,3,3,3),
condition=c(1,0,1,1,1,0,0,1,1,0,1,1,0, 1))
You can do:
transform(df, want = ave(condition, group, FUN = function(x) cumsum(x + (x == 0) * 2 )))
group condition want
1 1 1 1
2 1 0 3
3 1 1 4
4 1 1 5
5 2 1 1
6 2 0 3
7 2 0 5
8 2 1 6
9 2 1 7
10 3 0 2
11 3 1 3
12 3 1 4
13 3 0 6
14 3 1 7

Resources