using apply on listcolumns in R seems inconsistent

using apply on listcolumns in R seems inconsistent - r

..or at least inconsistent with my intuition.
I'm trying to extract data from inside a listcolumn using apply - in the example I've got a column of tibbles called eagles:
df1 <- tibble(
location = c(1,2),
eagles = list(
tibble(
talons = c(2,3,4),
beaks = c("blue","red","red")),
tibble(
talons = c(2,3),
beaks = c("red","red"))))
and extracting the beaks values as vectors using apply:
df1$beakz <- apply(df1, 1, \(x) x$eagles$beaks)
which works as expected:
> df1
# A tibble: 2 x 3
location eagles beakz
<dbl> <list> <list>
1 1 <tibble [3 x 2]> <chr [3]>
2 2 <tibble [2 x 2]> <chr [2]>
However if I add another row to one of the nested tibbles, the apply function won't play along anymore:
df2 <- tibble(
location = c(1,2),
eagles = list(
tibble(
talons = c(2,3,4),
beaks = c("blue","red","red")),
tibble(
talons = c(2,3,2),
beaks = c("red","red","yellow"))))
df2$beakz <- apply(df2, 1, \(x) x$eagles$beaks)
Error:
! Assigned data `apply(df2, 1, function(x) x$eagles$beaks)` must be compatible with existing data.
x Existing data has 2 rows.
x Assigned data has 3 rows.
i Only vectors of size 1 are recycled.
The expected output would be adding a listcolumn beakz with two vectors (of length 3) as elements.
Additionally, if both the nested tibbles have two rows only, the apply function does work, but instead of a single new listcolumn, I get two new columns:
df3 <- tibble(
location = c(1,2),
eagles = list(
tibble(
talons = c(2,3),
beaks = c("blue","red")),
tibble(
talons = c(2,3),
beaks = c("red","red"))))
df3$beakz <- apply(df3, 1, \(x) x$eagles$beaks)
df3
# A tibble: 2 x 3
location eagles beakz[,1] [,2]
<dbl> <list> <chr> <chr>
1 1 <tibble [2 x 2]> blue red
2 2 <tibble [2 x 2]> red red
This is a grossly simplified example, but basically, I would expect apply to function the same way in all three cases: I would like to extract a column as a vector and bring it up a level. Ideally using apply, although I'm sure there are purrr ways of doing this. But mainly I would just like to understand why this works this way, because debugging it has not been much fun :lolsob:
(also would appreciate it if someone with enough reputation could add listcolumn or list-column to the tags)

This is happening because apply() does not return a list, it returns a 3x2 matrix, which has too many rows to be put into df2. To get it to do what you want you could e.g. coerce it to a data frame (to give the columns names) and then to a list. There's probably a more elegant way to do it. But basically apply() does not play well with the list-structure of your data, whereas the purrr functions do.
apply(df2, 1, \(x) x$eagles$beaks)
#> [,1] [,2]
#> [1,] "blue" "red"
#> [2,] "red" "red"
#> [3,] "red" "yellow"
class(apply(df2, 1, \(x) x$eagles$beaks))
#> [1] "matrix" "array"
df2$beakz <- as.list(data.frame(apply(df2, 1, \(x) x$eagles$beaks)))
df2
#> # A tibble: 2 × 3
#> location eagles beakz
#> <dbl> <list> <named list>
#> 1 1 <tibble [3 × 2]> <chr [3]>
#> 2 2 <tibble [3 × 2]> <chr [3]>
df2$beakz
#> $X1
#> [1] "blue" "red" "red"
#>
#> $X2
#> [1] "red" "red" "yellow"

Purely for reference (not debugging OP), purrr works without issue:
library(purrr)
> mutate(df1, beaks=map(eagles, ~ .x$beaks))
# A tibble: 2 × 3
location eagles beaks
<dbl> <list> <list>
1 1 <tibble [3 × 2]> <chr [3]>
2 2 <tibble [2 × 2]> <chr [2]>
> mutate(df2, beaks=map(eagles, ~ .x$beaks))
# A tibble: 2 × 3
location eagles beaks
<dbl> <list> <list>
1 1 <tibble [3 × 2]> <chr [3]>
2 2 <tibble [3 × 2]> <chr [3]>
> mutate(df3, beaks=map(eagles, ~ .x$beaks))
# A tibble: 2 × 3
location eagles beaks
<dbl> <list> <list>
1 1 <tibble [2 × 2]> <chr [2]>
2 2 <tibble [2 × 2]> <chr [2]>

Related

In R, how can I create a new list-column variable in a tibble that doesn't depend on any prior columns?

This seems like a fairly basic question but I'm strugging to solve it. In the simplest form my problem is this: I have a vector of latitudes and a vector of longitudes and I'd like to generate 3 random integers without replacement for each one.
The code I want to make in my head runs something like this:
library(tidyverse)
LAT = c(0, 10, 20, 30, 40)
LON = c(-180, -120, -60, 0, 60, 120)
Case = c("All", "These", "Are", "Arbitrary")
df <- crossing(LAT, LON, Case) %>%
nest(TotalCases = Case) %>%
mutate(ViewingAngles = sample(0:50, 3, replace=F))
The 'TotalCases' element there is representative of some of the other irrelevant code I have to do on this dataframe, but the effect is the same - I generate a large (in this case 120x3) tibble that nests down to a 30x3 nested tibble. I'd then like to add another nested list-column to the end on the basis of that Sample function, but the output needs to be generated without replacement (meaning I cannot generate the three values separately) and must be distinct for each lat-lon pair, which means I cannot simply include "ViewingAngles" in my crossing() call.
My first inclination was to use some variant of map but that doesn't work without a .x and none makes sense to me. I also cannot simply use mutate(ViewingAngles = list(sample(0:50,3,replace=F)) because then each lat-lon pair has the same set of ViewingAngles.
Wut Do?
Thanks.

library(tidyverse)
LAT <- c(0, 10, 20, 30, 40)
LON <- c(-180, -120, -60, 0, 60, 120)
Case <- c("All", "These", "Are", "Arbitrary")
# create the dataframe
df <- crossing(LAT, LON, Case) %>%
nest(TotalCases = Case)
# generate random numbers
lists_numbers <- replicate(nrow(df), sample(0:50, 3, replace = F), simplify = FALSE)
# check whether unique vectors are created for each LON LAT combination
length(unique(as.numeric(sapply(lists_numbers, paste0, collapse = "")))) == nrow(df)
#> [1] TRUE
# create the nested list column
new_df <- df %>%
mutate(ViewingAngles = lists_numbers)
new_df
#> # A tibble: 30 × 4
#> LAT LON TotalCases ViewingAngles
#> <dbl> <dbl> <list> <list>
#> 1 0 -180 <tibble [4 × 1]> <int [3]>
#> 2 0 -120 <tibble [4 × 1]> <int [3]>
#> 3 0 -60 <tibble [4 × 1]> <int [3]>
#> 4 0 0 <tibble [4 × 1]> <int [3]>
#> 5 0 60 <tibble [4 × 1]> <int [3]>
#> 6 0 120 <tibble [4 × 1]> <int [3]>
#> 7 10 -180 <tibble [4 × 1]> <int [3]>
#> 8 10 -120 <tibble [4 × 1]> <int [3]>
#> 9 10 -60 <tibble [4 × 1]> <int [3]>
#> 10 10 0 <tibble [4 × 1]> <int [3]>
#> # … with 20 more rows
Created on 2022-07-07 by the reprex package (v2.0.1)

Removing duplicate records in a dataframe based on the values of a list column

I have a dataframe which contains duplicate values in a list column and I want to keep only the first appearence of each unique value.
Let's say I have the following tibble:
df <- tribble(
~x, ~y,
1, tibble(a = 1:2, b = 2:3),
2, tibble(a = 1:2, b = 2:3),
3, tibble(a = 0:1, b = 0:1)
)
df
#> # A tibble: 3 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 2 <tibble [2 x 2]>
#> 3 3 <tibble [2 x 2]>
The desired outcome is:
desired_df
#> # A tibble: 2 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 3 <tibble [2 x 2]>
Wasn't y a list column I'd be able to use distinct(df, y, .keep_all = TRUE), but the fuction doesn't support list columns properly, as shown:
distinct(df, y, .keep_all = TRUE)
#> Warning: distinct() does not fully support columns of type `list`.
#> List elements are compared by reference, see ?distinct for details.
#> This affects the following columns:
#> - `y`
#> # A tibble: 3 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 2 <tibble [2 x 2]>
#> 3 3 <tibble [2 x 2]>
Is there any "clean" way to achieve what I want?

One option is to use filter with duplicated
library(dplyr)
df %>%
filter(!duplicated(y))

I have come to an answer, but I think it's quite "wordy" (and I suspect it might be slow as well):
df <- df %>%
mutate(unique_list_id = match(y, unique(y))) %>%
group_by(unique_list_id) %>%
slice(1) %>%
ungroup() %>%
select(-unique_list_id)
df
#> # A tibble: 2 x 2
#> x y
#> <dbl> <list>
#> 1 1 <tibble [2 x 2]>
#> 2 3 <tibble [2 x 2]>

Is there a way to automatically propagate NULL in purrr::map?

I have some data, some of which is replicated, some not. I can only fit
my model for the replicated data.
library(tidyverse)
d <- tribble(
~env, ~val,
"A", 1,
"A", 2,
"B", 3
)
I am using tidyr::nest() and purrr::map() functions to fit my model.
However, in every function I use for map() I had to cater for the special
case that a particular set of data is not modellable, which i archieved
via calls in the style of
map(col, function(elem){ if(!is.null(elem)) DO_STUFF(elem) else NULL})
After a while, I managed to extract this behaviour to a purrr-style adverb
function which takes another function and wraps it such that this behaviour
for NULL elements is automatic:
maybe <- function(fun){
function(val,...){ if(!is.null(val)) fun(val, ...) else NULL}
}
However, this left me wondering: Am I duplicating behaviour which is
already archievable using tidyverse functions?
Bonus question: Is there a word in functional programming for a function like maybe?
This is an example to test my adverb:
Simple Model: A mean for data in environment A and no model for data in
environment B (since the data is unreplicated:)
modelFuns <- list(A = mean, B = NULL)
Group data by environment and determine the appropriate model for each group
d <- d %>% group_by(env) %>% nest(.key = "data")
d %<>% mutate(model = modelFuns[env])
d
## # A tibble: 2 x 3
## env data model
## <chr> <list> <list>
## 1 A <tibble [2 × 1]> <fn>
## 2 B <tibble [1 × 1]> <NULL>
Perform modelling:
d %<>% mutate(out = pmap(list(model, data), maybe(function(m,d) m(d$val))))
d
## # A tibble: 2 x 4
## env data model out
## <chr> <list> <list> <list>
## 1 A <tibble [2 × 1]> <fn> <dbl [1]>
## 2 B <tibble [1 × 1]> <NULL> <NULL>
Which is equivalent to the following code which does not use my maybe adverb:
d %<>% mutate(out = pmap(list(model, data), function(m,d){if(!is.null(m)) m(d$val) else NULL}))
d
## # A tibble: 2 x 4
## env data model out
## <chr> <list> <list> <list>
## 1 A <tibble [2 × 1]> <fn> <dbl [1]>
## 2 B <tibble [1 × 1]> <NULL> <NULL>
The fact that there might be a value or there might be NULL propagates to
everything I want to do with the modelling results downstream, which is
why the adverb maybe is useful. Does something like this already exist
in the tidyverse?
isModelNice <- function(val) val > 0
d %<>% mutate(nice = map(out, maybe(isModelNice)))
d
## # A tibble: 2 x 5
## env data model out nice
## <chr> <list> <list> <list> <list>
## 1 A <tibble [2 × 1]> <fn> <dbl [1]> <lgl [1]>
## 2 B <tibble [1 × 1]> <NULL> <NULL> <NULL>

could you use purrr::possibly()?
library(tidyverse)
d <- tribble(
~env, ~val,
"A", 1,
"A", 2,
"B", 3
)
modelFuns <- list(A = mean, B = NULL)
d %>% group_by(env) %>%
nest(.key = "data") %>%
mutate(model = modelFuns[env]) %>%
mutate(out = pmap(list(model, data), possibly(function(m,d) m(d$val), NULL)))

List of lists by unique coordinates

I have a data frame like the one below. I want to collapse it, though, so that each unique coordinate is a list of its SubIDs.
subID latlon
1 S20298920 29.2178694, -94.9342990
2 S35629295 26.7063982, -80.7168961
3 S35844314 26.7063982, -80.7168961
4 S35833936 26.6836236, -80.3512144
7 S30634757 42.4585456, -76.5146989
8 S35834082 26.4330582, -80.9416786
9 S35857972 26.4330582, -80.9416786
10 S35833885 26.7063982, -80.7168961
So, here, I want (26.7063982, -80.7168961) to be a list containing (S35629295, S35844314), and (29.2178694, -94.9342990) to be a list containing just (S20298920). I think a list of lists is what makes most sense.

Use aggregate:
out <- aggregate(data=df,subID~latlon,FUN = function(t) list(sort(paste(t))))
Since your data set is large and cumbersome, the sample code below uses watered down data which is easier to read.
out <- aggregate(data=df,name~ID,FUN = function(t) list(sort(paste(t))))
out
ID name
1 1 apple, orange
2 2 orange
3 3 apple, orange
Data:
df <- data.frame(ID=c(1,1,2,3,3),
name=c('apple', 'orange', 'orange', 'orange', 'apple'))
Demo

with(data,tapply(subID,latlon,as.list))
output:
$`26.4330582 -80.9416786`
$`26.4330582 -80.9416786`[[1]]
[1] "S35834082"
$`26.4330582 -80.9416786`[[2]]
[1] "S35857972"
$`26.6836236 -80.3512144`
$`26.6836236 -80.3512144`[[1]]
[1] "S35833936"
:
:
:
data:
data=read.table(text="subID latlon
S20298920 '29.2178694 -94.9342990'
S35629295 '26.7063982 -80.7168961'
S35844314 '26.7063982 -80.7168961'
S35833936 '26.6836236 -80.3512144'
S30634757 '42.4585456 -76.5146989'
S35834082 '26.4330582 -80.9416786'
S35857972 '26.4330582 -80.9416786'
S35833885 '26.7063982 -80.7168961' ",h=T,stringsAsFactors=F)

In the tidyverse, you can either use tidyr::nest, which will nest data frames:
library(tidyverse)
df <- data_frame(subID = c("S20298920", "S35629295", "S35844314", "S35833936", "S30634757", "S35834082", "S35857972", "S35833885"),
latlon = c("29.2178694, -94.934299", "26.7063982, -80.7168961", "26.7063982, -80.7168961", "26.6836236, -80.3512144", "42.4585456, -76.5146989", "26.4330582, -80.9416786", "26.4330582, -80.9416786", "26.7063982, -80.7168961"))
df %>% nest(subID)
#> # A tibble: 5 x 2
#> latlon data
#> <chr> <list>
#> 1 29.2178694, -94.934299 <tibble [1 x 1]>
#> 2 26.7063982, -80.7168961 <tibble [3 x 1]>
#> 3 26.6836236, -80.3512144 <tibble [1 x 1]>
#> 4 42.4585456, -76.5146989 <tibble [1 x 1]>
#> 5 26.4330582, -80.9416786 <tibble [2 x 1]>
or just summarize with list to make a list column of vectors:
df %>%
group_by(latlon) %>%
summarise_all(list)
#> # A tibble: 5 x 2
#> latlon subID
#> <chr> <list>
#> 1 26.4330582, -80.9416786 <chr [2]>
#> 2 26.6836236, -80.3512144 <chr [1]>
#> 3 26.7063982, -80.7168961 <chr [3]>
#> 4 29.2178694, -94.934299 <chr [1]>
#> 5 42.4585456, -76.5146989 <chr [1]>

filter tibble of tibbles by nrow

I've a tibble like that
>dat
# A tibble: 556 × 3
sample run abc
<chr> <chr> <list>
1 206_03_07_2013 21102016 <tibble [304 × 21]>
2 206_04_07_2017 7082017 <tibble [229 × 21]>
3 206_04_10_2015 25112015 <tibble [2,687 × 21]>
4 206_07_08_2013 15102015 <tibble [460 × 21]>
5 206_08_12_2016 3032017 <tibble [3,250 × 21]>
6 206_11_03_2014 21102016 <tibble [975 × 21]>
7 206_13_02_2013 21112016 <tibble [101 × 21]>
8 206_13_03_2013 21112016 <tibble [345 × 21]>
9 206_14_08_2014 8092016 <tibble [1,952 × 21]>
10 206_19_03_2015 25012016 <tibble [11 × 21]>
# ... with 546 more rows
The abc column contains tibble of different length. I want to filter the dat tibble using their length (>100 rows).
I could do something like that :
dat[sapply(dat$abs,nrow)>100,]
but I would like to use dplyr phylosophy ?
Any ideas ?
Thanks

A way could be:
library(dplyr)
library(purrr)
dat <- tribble(
~foo, ~bar,
1, as_tibble(head(iris, 3)),
2, as_tibble(head(iris, 7))
)
# # A tibble: 2 x 2
# foo bar
# <dbl> <list>
# 1 1 <tibble [3 x 5]>
# 2 2 <tibble [7 x 5]>
res <- filter(dat, map_int(bar, nrow) > 5)
# # A tibble: 1 x 2
# foo bar
# <dbl> <list>
# 1 2 <tibble [7 x 5]>
desired_output <- dat[sapply(dat$bar,nrow)>5,]
identical(res, desired_output)
# [1] TRUE
There is not really any added value here, compared to what you tried, it's a matter of using drop-in replacements to [ and sapply (with filter and map_int respectively). Base R functions are not incompatible with the so-called "dplyr philosophy". If you mean the use of the magrittr pipe %>%, dat %>% .[sapply(.$bar, nrow) > 5, ] and dat %>% filter(map_int(bar, nrow) > 5) work equally well.
Note: I usually prefer all.equal over identical but couldn't make it work:
all.equal(res, desired_output)
# Error in equal_data_frame(target, current, ignore_col_order = ignore_col_order, :
# Can't join on 'bar' x 'bar' because of incompatible types (list / list)
(See https://github.com/tidyverse/dplyr/issues/2194)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

using apply on listcolumns in R seems inconsistent - r

Related

In R, how can I create a new list-column variable in a tibble that doesn't depend on any prior columns?

Removing duplicate records in a dataframe based on the values of a list column

Is there a way to automatically propagate NULL in purrr::map?

List of lists by unique coordinates

filter tibble of tibbles by nrow

Categories

Resources