subsetting a list column of integer matrices - r

Background
I've gotten myself into a situation where one column in a tibble/dataframe consists of a list of integer matrices which have zero or more rows and exactly 2 columns. This column happens to be the output of a stringr::str_locate_all() invocation, so I expect this is a common scenario.
What I would like to do is to select only one of the columns of the integer matrices and then unnest the dataframe, but I am getting confused about how to do this properly.
Example
Here's an example (I have to create it manually because dpasta() doesn't seem to work with list column tibbles). In any case, my starting point, is the tibble mydf:
library(tidyverse)
m1 <- matrix( c(761,784), nrow=1,ncol=2, dimnames = list(c(),c("start","end")) )
m2 <- matrix( integer(0), nrow=0,ncol=2, dimnames = list(c(),c("start","end")) )
m3 <- matrix( c(1001,2300,1010,2310), nrow=2,ncol=2, dimnames = list(c(),c("start","end")) )
mydf <- tibble( item = c("a","b","c"), pos = list(m1,m2,m3))
Below is what that looks like in the rstudio viewer. It's kind of misleading because it suggests that the pos rows are just vectors of integers. They're actually nx2 matrices and there isn't any cue that indicates it's more complex. It caused me some confusion, but that's beside the point now.
What I would like to do is end up with an unnested tibble where the 1st column, "start", is selected. The desired output would look like this (after unnesting):
mydf_desired <- tibble( item = c("a","c","c"), start_pos = c(761,1001,2300))
Note that the first row in mydf had only one row in it's pos matrix, so it has one row in the desired result. The row with item="b" had a 0x2 matrix, so it doesn't appear (but it would have been OK if it appeared as an NA too). The row with item="c" had two rows in the pos matrix, so it has two rows in the desired result.
What I tried
This seems simple enough, I've unnested list columns before. The only twist here is that I have to first select the "start" column and then unnest, right? I just map the pos list column to [,1] to pick off the 1st column (the "start" column). And then it should be a matter of unnesting...
mydf_desired <- mydf %>%
mutate(start_pos = map(pos, ~ .[,1])) %>%
unnest()
#> Error in vec_rbind(!!!x, .ptype = ptype): Internal error in `vec_assign()`: `value` should have been recycled to fit `x`.
#> Warning: `cols` is now required.
#> Please use `cols = c(pos, start_pos)`
No idea what "value should have been recycled to fit x" actually means, but it's also giving me a warning about not giving cols in unnest(). The suspicion is now something about what I am giving unnest().
If I omit unnest() I don't get that error...
mydf_desired <- mydf %>%
mutate(start_pos = map(pos, ~ .[,1]))
And the output looks like this...
That sort of looks OK, I notice there's still a pos entry for item=b of integer(0). But even if I omit that row, I get the same error when I try to unnest().
Here's where I am stumped. Why can't I just unnest() this tibble? What is the meaning of the value should have been recycled to fit x error?

One option is to filter the rows, then map over the list element and extract the column from the matrix, and use unnest_longer
library(dplyr)
library(purrr)
mydf %>%
filter(lengths(pos) > 0) %>%
transmute(item, start_pos = map(pos, ~ as.vector(.x[,1]))) %>%
unnest_longer(c(start_pos))
# A tibble: 3 x 2
# item start_pos
# <chr> <dbl>
#1 a 761
#2 c 1001
#3 c 2300
Also, can avoid the filter step, if we convert to tibble
mydf %>%
transmute(item, pos = map(pos, ~ .x[,1] %>%
tibble(start_pos = .))) %>%
unnest(c(pos))

The error comes because unnest is trying to unnest pos column. You can specify which columns you want to unnest explicitly to avoid the error.
library(dplyr)
library(purrr)
mydf %>% mutate(start_pos = map(pos, ~.[, 1])) %>% unnest(start_pos)
# A tibble: 3 x 3
# item pos start_pos
# <chr> <list> <dbl>
#1 a <dbl[,2] [1 × 2]> 761
#2 c <dbl[,2] [2 × 2]> 1001
#3 c <dbl[,2] [2 × 2]> 2300
If you want NA for "b" item you can use unnest_longer
mydf %>%
mutate(start_pos = map(pos, ~.[, 1])) %>%
unnest_longer(start_pos, indices_include = FALSE)
# A tibble: 4 x 3
# item pos start_pos
# <chr> <list> <dbl>
#1 a <dbl[,2] [1 × 2]> 761
#2 b <int[,2] [0 × 2]> NA
#3 c <dbl[,2] [2 × 2]> 1001
#4 c <dbl[,2] [2 × 2]> 2300
Or unnest with keep_empty = TRUE.
mydf %>%
mutate(start_pos = map(pos, ~.[, 1])) %>%
unnest(start_pos, keep_empty = TRUE)

Related

I wish to keep/discard items in a PURRR nested list based on a sublist within each grouped level

I have sets of weather station data which I wish to compare by site. I need to do this efficiently because each set is large and I wish to build my experience with PURRR. My issue concerns use of the keep/discard (or list.exclude (rlist)) to remove days (id) with incomplete data - it should be a doozy but I can't get the syntax right. I have tried to approach this problem by computing the dimensions of each tibble, and then use the length to give me a unitary list). I am using R 3.6.1 on a PC running Windows 10. Here is a trivial example. I wish 'mylist' to comprise id = 'a' only in this example.
mylist <- tibble(id = c(rep("a",5),rep("b",4)),
dl = c(seq(1,5,1), seq(1,4,1)),
v = c(seq(0, 40, 10), seq(50, 80, 10))) %>%
group_by(id) %>%
nest() %>%
mutate(ddim = map(data, dim)) %>%
mutate(nn = map(ddim, extract(1)))
mylist
# A tibble: 2 x 4
# Groups: id [2]
id data ddim nn
<chr> <list<df[,2]>> <list> <list>
1 a [5 x 2] <int [2]> <int [1]>
2 b [4 x 2] <int [2]> <int [1]>
It is not clear how "incomplete data" is defined but since the question is more about how to filter rows where a certain condition is satisfied in a list, I have considered a temporary condition which is select rows where v column of tibble has first value as 0. This condition can be changed after clarification from OP.
We can use filter to select rows and map_lgl to loop over data column for each id.
library(tidyverse)
mylist %>% filter(map_lgl(data, ~first(.x$v) == 0))
# id data
# <chr> <list<df[,2]>>
#1 a [5 × 2]
Similarly, in base R, we can use subset with sapply
subset(mylist, sapply(data, function(x) x$v[1] == 0))
data
mylist <- tibble(id = c(rep("a",5),rep("b",4)),
dl = c(seq(1,5,1), seq(1,4,1)),
v = c(seq(0, 40, 10), seq(50, 80, 10))) %>%
group_by(id) %>% nest()

How do you calculate the mean / sum of particular columns of a nested list column using purr's map function

I'm trying to use purr to summarize a particular column of a nested list column.
library(tidyverse)
z <- tibble(name = c("Bill","Bill","Bill","Sue","Sue"), grade =c(90L,95L,70L,100L,98L), time=c(10L,11L,10L,15L,16L))
summary <- z %>%
group_by(name) %>%
nest %>%
mutate(n = map_int(data,nrow)) %>%
mutate(avg = map(data$grade,mean)) %>%
mutate(ttl_time = map(data$time, sum))
When I run this I get an error:: Column y must be length 3 (the number of rows) or one, not 2
My target output is:
name data n avg ttl_time
Bill [3x3] 3 92 31
Sue [2x3] 2 99 31
When I remove the last two mutate function the script works as anticipated. This leads me to believe that I'm not isolating the grade and time columns within the data column, but I can figure out what I'm doing wrong?
I watched this r studio video and I believe I'm doing the same thing I saw in the video. Working with List Columns
z %>%
group_by(name) %>%
nest() %>%
mutate(n = map_int(data, nrow),
avg = map_dbl(data, ~ mean(.x$grade)),
ttl_time = map_dbl(data, ~ sum(.x$time)))
# # A tibble: 2 x 5
# name data n avg ttl_time
# <chr> <list> <int> <dbl> <dbl>
# 1 Bill <tibble [3 × 2]> 3 85 31
# 2 Sue <tibble [2 × 2]> 2 99 31
The formula notation with ~ is a shortcut for e.g. function(.x) mean(.x$grade)
OP's error indeed stems from the fact that map cannot iterate directly over each grade element of the data list, at least not with this syntax.
data$grade is understood as an element of the list data that has name grade, and there is no such element.
This alternative syntax might help understand how this is achievable:
z %>%
group_by(name) %>%
nest() %>%
mutate(n = map_int(data, nrow),
avg = map_dbl(map(data, "grade"), mean),
ttl_time = map_dbl(map(data, "time"), sum))
where map(data, "grade") extracts each grade component from the elements of the list column data.
Though this is, in my opinion, less readable than the first suggestion.

Create a list of dataframes and use it to call details about that dataframe

I am trying to create a list of dataframes and then using that list of dataframes to create another dataframe about the attributes of that dataframe. I wanted to do this by creating a loop.
I tried creating a list of dataframes. Then I used that list in a loop that says for each row in my new dataframe, put in the name of the dataframe in one column and the number of rows in that dataframe in another column.
df_Months <- as.list(c(df_Jan2018, df_Feb2018, df_March2018, df_April2018, df_May2018))
for i in 1:length(df_Months) {
Monthly_Size$Month[i] <- paste(df_Months [i])
Monthly_Size$Size[i] <- nrow(df_Months[i])
}
if I do nrow(df_Months[1]) the result is NULL even though I know that is not the case because if i just do nrow(df_Jan2018) it gives me back the correct number of rows.
Here is a solution using the purrr and dplyr package that should work on your data. You wouldn't need the for loop anymore.
library("purrr")
library("dplyr")
test_df <- data.frame( a = c(1,2,3,4,NA),
b = c(NA,6,5,7,9))
test_df2 <- data.frame(c = c(1:10),
d = c(11:20))
df_list <- list(test_df = test_df, test_df2 = test_df2)
res <- map_dbl(df_list,nrow)
tibble(df = names(res), nrow = res)
The output looks like this
# A tibble: 2 x 2
df nrow
<chr> <dbl>
1 test_df 5
2 test_df2 10
A slightly different approach would be to put the above list df_list into a tibble and then do operations on that tibble and create new rows with the information you are looking for.
df_tibble <- tibble(name = names(df_list), df = df_list)
df_tibble %>% mutate(nrow = map_dbl(df, ~ nrow(.x)))
# A tibble: 2 x 3
name df nrow
<chr> <list> <dbl>
1 test_df <data.frame [5 × 2]> 5
2 test_df2 <data.frame [10 × 2]> 10
You could go on and include more information in this way. For example the number of columns.
df_tibble %>% mutate(nrow = map_dbl(df, ~ nrow(.x)),
ncol = map_dbl(df, ~ ncol(.x)))

For loop: Count matches and unique elements among two dataframes and apply function to counts

I would like to conduct a very involved loop. I have multiple regions, each with hundreds of plots in my real data frame. I would like to subset by region and then plot and preform various functions on the subsets to ultimately calculate dissimilarity owed to only species that are shared. I will preface by saying each row is representative of an interaction.
My example df:
set.seed(540)
df<- data.frame(region= c(rep(1, 16), rep(2,8)),
plot= c(rep("A",5), rep("B",9), rep("C", 2), rep("D", 6),rep("E", 2)),
plantsp= sample(1:24,24, replace= TRUE),
lepsp= sample(1:24,24,replace= TRUE),
psitsp= sample(1:24,24,replace= TRUE))
df[] <- lapply(df, as.character)
df$plantsp<-paste('plantsp', df$plantsp, sep='_')
df$lepsp<-paste('lepsp', df$lepsp, sep='_')
df$psitsp<-paste('psitsp', df$psitsp, sep='_')
df$paste1<- paste(df$plantsp, df$lepsp, sep='_')
df$paste2<- paste(df$lepsp, df$psitsp, sep='_')
df$paste3<- paste(df$plantsp,df$lepsp, df$psitsp)
Step1: Subset df by region. Example:
region_sub <- split(df, df$region)
Step2: Subset df by plot. Example:
plot_sub <- split(region_sub[[1]], region_sub[[1]][[2]])
Step3: We will call each subset (each list component) from the step above a plot subset. In this example I will use the first subset (region1, plotA) as an example for all subsequent outputs. I will call this region1, plotA subset plot_sub1. I want to compare plot_sub1 to the original df to make three df subsets. We will call these df_sub1, df_sub2, df_sub3. First, df_sub1 consists of matches among entries in the plantsp, lepsp columns among plot_sub1 and df. Rows with any unique entries are removed, as well as and rows where a plantsp match, but not the lepsp and visa versa. Example of df_sub1:
df_sub1<- df[c(1,2,3,4,5,22),c(1:4,6)]
Notice, only those rows with shared species remain. Further, only those rows with shared species that also interact remain. Also, I have removed unnecessary columns (e.g. psitsp, paste2, paste3) to draw your attention to the results of this step. These columns do not need to be removed for the code.
Step4: Repeat step3 for lepsp and psitsp columns to make df_sub2. Example:
df_sub2<- df[1:5,c(1:2,4,5,7)]
Step5: Repeat step3 for plantsp,lepsp and psitsp column to make df_sub3. Example:
df_sub3<- df[1:5,c(1:5,8)]
Step6: Now that all subsets are made, I want to count matching elements in the paste1 column among plot_sub1 and df_sub1 (=5). Example:
This would be stored in a vector match. The results will be stored in the match or unique vector, accordingly. Example:
match<- length(intersect(df_sub1$paste1, plot_sub[[1]]$paste1))
match
I also want to count the unique elements (=1). This would be stored in a vector unique. This will be repeated for plot_sub1 and df_sub2and plot_sub1 and df_sub3. I am not sure how to count unique elements among two df so I cannot offer example code for that.
unique<- 1
Note: Matches among plot_sub only need to be counted 1 time in the event the df_sub has repeated interactions or matches. This needs to account for presence- absence of matches, not the abundance.
In summary for this subset, the two vectors would be:
match<- c( length(intersect(df_sub1$paste1, plot_sub[[1]]$paste1)),
length(intersect(df_sub2$paste2, plot_sub[[1]]$paste2)),
length(intersect(df_sub3$paste3, plot_sub[[1]]$paste3))
match
unique<-c(1,0,0)
The sum will then be totaled for each vector. Example:
sum_match<- 15
sum_unique<- 1
Step7: Lastly, these values would be input into a function:
((a + b)/((2*a + b)/2) - 1) Where a= sum_match and b=sum_unique.
The value is then input into the result vector res_vec.
Step8: This process (step3-7) would be iterated for each plot subset.
Effectively, this will calculate the dissimilarity of shared interactions among plot interactions and the corresponding metaweb (all possible interactions). This is a modification from (Poisot et al 2012) to account for tritrophic interactions.
It's quite pathetic, but to start the for loop I have:
res_vec<- NA
for (i in 1:length(unique(df$region)))
{
for (j in 1:length(unique(df$plot)))
{
I really appreciate any time one is willing to help me realize the arguments within the loop. That is where it gets tricky for me.
Thans #Gregor for all the clarification you've already done in the comments!
Here is my solution using the the tidyverse.
CODE + EXPLANATION
## Load packages
library(tidyverse)
## Nest data
new_df <- df %>%
group_by(region, plot) %>%
nest(.key = plot_sub)
new_df
# A tibble: 5 x 3
# region plot plot_sub
# <dbl> <fctr> <list>
# 1 1 A <tibble [5 x 3]>
# 2 1 B <tibble [9 x 3]>
# 3 1 C <tibble [2 x 3]>
# 4 2 D <tibble [6 x 3]>
# 5 2 E <tibble [2 x 3]>
The column plot_sub contains the same data as the list with the same name in your question. Think of this column as a list of dataframes.
I know write a function to create the df_sub's. This keeps our code more clean, and avoids unecessary repetition. This function will then be applied to our column plot_sub
# Function to create the df_sub
# Takes the plot_sub, original dataframe (df) and a list of columns, which should be compared
# Returns the desired df_sub with new interactions of species which are in plot_sub
# Only unique interactions are returned
create_df_sub <- function(plot_sub, df, col_list){
# Filter df such that it only contains species which are in plot_sub
for (x in col_list) {
df <- df[df[[x]] %in% plot_sub[[x]], ]
}
# Combine plot_sub and filtered df
df_sub <- rbind(plot_sub[, col_list], df[, col_list])
# Paste relevant colums together
df_sub$paste_col <- do.call(paste, c(df_sub[, col_list], sep = '_'))
# Exclude duplicated values
df_sub <- df_sub[!duplicated(df_sub$paste_col), ]
return(df_sub)
}
Now I define the columns I want to create the df_sub with and then apply the function to the plot_sub-column
col_list1 <- c('plantsp', 'lepsp')
col_list2 <- c('lepsp', 'psitsp')
col_list3 <- c('plantsp', 'lepsp', 'psitsp')
new_df <- new_df %>%
mutate(df_sub1 = map(plot_sub, create_df_sub, df = df, col_list = col_list1),
df_sub2 = map(plot_sub, create_df_sub, df = df, col_list = col_list2),
df_sub3 = map(plot_sub, create_df_sub, df = df, col_list = col_list3))
map takes a vector or list as argument and applies the specified function to each element (like lapply). Compare the first elements of df_sub1 and plot_sub to see the difference.
new_df$plot_sub[[1]]
# A tibble: 5 x 3
# plantsp lepsp psitsp
# <chr> <chr> <chr>
# 1 plantsp_2 lepsp_19 psitsp_19
# 2 plantsp_21 lepsp_19 psitsp_4
# 3 plantsp_19 lepsp_2 psitsp_11
# 4 plantsp_9 lepsp_13 psitsp_24
# 5 plantsp_24 lepsp_9 psitsp_2
new_df$df_sub1[[1]]
# A tibble: 6 x 3
# plantsp lepsp paste_col
# <chr> <chr> <chr>
# 1 plantsp_2 lepsp_19 plantsp_2_lepsp_19
# 2 plantsp_21 lepsp_19 plantsp_21_lepsp_19
# 3 plantsp_19 lepsp_2 plantsp_19_lepsp_2
# 4 plantsp_9 lepsp_13 plantsp_9_lepsp_13
# 5 plantsp_24 lepsp_9 plantsp_24_lepsp_9
# 6 plantsp_9 lepsp_2 plantsp_9_lepsp_2
The new interaction is added in df_sub1.
To extract matching and unique values, I use inner_join and anti_join on the plot_sub-column and the different df_sub's
new_df <- new_df %>%
mutate(match1 = map2(df_sub1, plot_sub, inner_join, by = col_list1),
match2 = map2(df_sub2, plot_sub, inner_join, by = col_list2),
match3 = map2(df_sub3, plot_sub, inner_join, by = col_list3),
unique1 = map2(df_sub1, plot_sub, anti_join, by = col_list1),
unique2 = map2(df_sub2, plot_sub, anti_join, by = col_list2),
unique3 = map2(df_sub3, plot_sub, anti_join, by = col_list3))
The inner_join returns all rows, which have matching values in the columns specified in the by-argument, whereas the anti_join returns all rows of df_sub, which are not matched.
Here I use the map2-function, which takes two vectors/list and applies the specified function.
new_df$match1[[1]]
# A tibble: 5 x 4
# plantsp lepsp psitsp paste_col
# <chr> <chr> <chr> <chr>
# 1 plantsp_2 lepsp_19 psitsp_19 plantsp_2_lepsp_19
# 2 plantsp_21 lepsp_19 psitsp_4 plantsp_21_lepsp_19
# 3 plantsp_19 lepsp_2 psitsp_11 plantsp_19_lepsp_2
# 4 plantsp_9 lepsp_13 psitsp_24 plantsp_9_lepsp_13
# 5 plantsp_24 lepsp_9 psitsp_2 plantsp_24_lepsp_9
new_df$unique1[[1]]
# A tibble: 1 x 3
# plantsp lepsp paste_col
# <chr> <chr> <chr>
# 1 plantsp_9 lepsp_2 plantsp_9_lepsp_2
In the last step I extract the number of rows of each match and unique and sum it up. I also calculate the res_vec.
new_df <- new_df %>%
mutate(sum_match = map_int(match1, nrow) + map_int(match2, nrow) + map_int(match3, nrow),
sum_unique = map_int(unique1, nrow) + map_int(unique2, nrow) + map_int(unique3, nrow),
res_vec = ((sum_match + sum_unique)/((2*sum_match + sum_unique)/2)) - 1)
Here I use map_int as my return value is an integer and I want to directly use it in a sum. Using map only would return a list which I first have to convert to a integer vector.
new_df %>% select(region, plot, sum_match, sum_unique, res_vec)
# A tibble: 5 x 5
# region plot sum_match sum_unique res_vec
# <dbl> <fctr> <int> <int> <dbl>
# 1 1 A 15 1 0.03225806
# 2 1 B 27 3 0.05263158
# 3 1 C 6 2 0.14285714
# 4 2 D 18 1 0.02702703
# 5 2 E 6 0 0.00000000
DATA
set.seed(540)
df <- data.frame(region = c(rep(1, 16), rep(2, 8)),
plot = c(rep('A', 5), rep('B', 9), rep('C', 2), rep('D', 6),rep('E', 2)),
plantsp = sample(1:24, 24, replace = TRUE),
lepsp = sample(1:24, 24, replace = TRUE),
psitsp = sample(1:24, 24, replace = TRUE))
df$plantsp <- paste('plantsp', df$plantsp, sep = '_')
df$lepsp <- paste('lepsp', df$lepsp, sep = '_')
df$psitsp <- paste('psitsp', df$psitsp, sep = '_')

Filtering out nested data frames by number of observations

Following from: Use filter() (and other dplyr functions) inside nested data frames with map()
I want to nest on multiple columns, and then filter out rows by the number of items that were nested into that row. For example,
df <- tibble(
a = sample(x = c(rep(c('x','y'),4), 'w', 'z')),
b = sample(c(1:10)),
c = sample(c(91:100))
)
I want to nest on column a, as in:
df_nest <- df %>%
nest(-a)
Then, I want to filter out the rows that only have 1 observation in the data column (where a = w or a = z, in this case.) How can I do that?
You can use map/map_int on the data column to return the nrow in each nested tibble, and construct the filter condition based on it:
df %>%
nest(-a) %>%
filter(map_int(data, nrow) == 1)
# filter(map(data, nrow) == 1) works as well
# A tibble: 2 x 2
# a data
# <chr> <list>
#1 w <tibble [1 x 2]>
#2 z <tibble [1 x 2]>

Resources