Extract a dataframe from a column of dataframes (tidyverse approach) - r

I have been able to do some nice things with purrr to be able to work with dataframe columns within a dataframe. By which I am referring to a column of a dataframe where every cell contains a dataframe itself.
I am trying to find out the idiomatic approach for extracting one of these dataframes back out.
Example
# Create a couple of dataframes:
df1 <- tibble::tribble(~a, ~b,
1, 2,
3, 4)
df2 <- tibble::tribble(~a, ~b,
11, 12,
13, 14)
# Make a dataframe with a dataframe column containing
# our first two dfs as cells:
meta_df <- tibble::tribble(~df_name, ~dfs,
"One", df1,
"Two", df2)
My question is, what is the tidyverse-preferred way of getting one of these dataframes back out of meta_df? Say I get the cell I want using select() and filter():
library("magrittr")
# This returns a 1x1 tibble with the only cell containing the 2x2 tibble that
# I'm actually after:
meta_df %>%
dplyr::filter(df_name == "Two") %>%
dplyr::select(dfs)
This works, but seems non-tidyverse-ish:
# To get the actual tibble that I'm after I can wrap the whole lot in brackets
# and then use position [[1, 1]] index to get it:
(meta_df %>%
dplyr::filter(df_name == "Two") %>%
dplyr::select(dfs))[[1, 1]]
# Or a pipeable version:
meta_df %>%
dplyr::filter(df_name == "Two") %>%
dplyr::select(dfs) %>%
`[[`(1, 1)
I have a feeling that this might be a situation where the answer is in purrr rather than dplyr, and that it might be a simple trick once you know it, but I'm coming up blank so far.

better solution:
Use tidyr::unnest():
meta_df %>%
dplyr::filter(df_name == "Two") %>%
dplyr::select(dfs) %>%
tidyr::unnest()
other solution:
You can use pull (the tidyverse way to select the column, equivalent to $), but it returns a one-element list of tibbles, so you need to add %>% .[[1]] to the end.
meta_df %>%
dplyr::filter(df_name == "Two") %>%
dplyr::pull(dfs) %>% .[[1]]

Related

Split a data.frame by group into a list of vectors rather than a list of data.frames

I have a data.frame which maps an id column to a group column, and the id column is not unique because the same id can map to multiple groups:
set.seed(1)
df <- data.frame(id = paste0("id", sample(1:10,300,replace = T)), group = c(rep("A",100), rep("B",100), rep("C",100)), stringsAsFactors = F)
I'd like to convert this data.frame into a list where each element is the ids in each group.
This seems a bit slow for the size of data I'm working with:
library(dplyr)
df.list <- lapply(unique(df$group), function(g) dplyr::filter(df, group == g)$id)
So I was thinking about this:
df.list <- df %>%
dplyr::group_by(group) %>%
dplyr::group_split()
Assuming it is faster than my first option, any idea how to get it to return the same output as in the first option rather than a list of data.frames?
Using base R only with split. It should be faster than the == with unique
with(df, split(id, group))
Or with tidyverse we can pull the column after the group_split. The group_split returns a data.frame/tibble and could be slower compared to the split only method above. But, here, we can make some performance improvements by removing the group column (keep = FALSE) and then in the list, pull the 'id' column to create the list of vectors
library(dplyr)
library(purrr)
df %>%
group_split(group, keep = FALSE) %>%
map(~ .x %>%
pull(id))
Or use {} with pipe
df %>%
{split(.$id, .$group)}
Or wrap with with
df %>%
with(., split(id, group))

most elegant way to calculate rowSums of colums that start AND end with certain strings, using dplyr

I am working with a dataset of which I want to calculate rowSums of columns that start with a certain string and end with an other specified string, using dplyr (in my example: starts_with('c_') & ends_with('_f'))
My current code is as follows (and works fine):
df <- df %>% mutate(row.sum = rowSums(select(select(., starts_with('c_')), ends_with('_f'))))
However, as you can see, using the select() function within a select() function seems a bit messy. Is there a way to combine the starts_with and ends_with within just one select() function? Or do you have other ideas to make this line of code more elegant via using dplyr?
EDIT:
To make the example reproducible:
names <- c('c_first_f', 'c_second_o', 't_third_f', 'c_fourth_f')
values <- c(5, 3, 2, 5)
df <- t(values)
colnames(df) <- names
> df
c_first_f c_second_o t_third_f c_fourth_f
[1,] 5 3 2 5
Thus, here I want to sum the first and fourth column, making the summed value 10.
We could use select_at with matches
library(dplyr)
df %>% select_at(vars(matches("^c_.*_f$"))) %>% mutate(row.sum = rowSums(.))
and with base R :
df$row.sum <- rowSums(df[grep("^c_.*_f$", names(df))])
We can use tidyverse approaches
library(dplyr)
library(purrr)
df %>%
select_at(vars(matches("^c_.*_f$"))) %>%
mutate(rowSum = reduce(., `+`))
Or with new versions of tidyverse, select can take matches
df %>%
select(matches("^c_.*_f$")) %>%
mutate(rowSum = reduce(., `+`))

Frequency table with common values of 5 tables

I have 5 data frames and I have to analize just the first column. From these, I must obtain a frequency table of their common words (not necessarily of all data frames, for example a word can appear just in two or more dataframes).
Then I must obtain a frequency table of common words of ALL dataframes
I just tried doing a for cycle but I seems very complicated. Moreover, dataframes have different dimentions. I didn't find any useful function.
Then I tried doing
lst1 <- list(a,b,c,d,e)
newdat <- stack(setNames(lapply(lst1, "[", 1), seq_along(lst1)))[2:1]
library(dplyr)
newdat %>% group_by(val) %>% filter(uniqueN(ind) > 1) %>% count(val)
but it gives me an error
> stack(setNames(lapply(lst1, "[", 1), seq_along(lst1)))
Error in stack.default(setNames(lapply(lst1, "[", 1), seq_along(lst1))):
at least one vector element is required
Thank you
Here's my solution using purrr & dplyr:
library(purrr)
library(dplyr)
lst1 <- list(mtcars=mtcars, iris=iris, chick=chickwts, cars=cars, airqual=airquality)
lst1 %>%
map_dfr(select, value=1, .id="df") %>% # select first column of every dataframe and name it "value"
group_by(value) %>%
summarise(freq=n(), # frequency over all dataframes
n_df=n_distinct(df), # number of dataframes this value ocurrs
dfs = paste(unique(df), collapse=",")) %>%
filter(n_df > 1) %>%
filter(n_df == 5) # if value has to be in all 5 dataframes

Read a set of files into a matrix in R

I am trying to read a set of tab separated files into a matrix or data.frame. For each file I need to extract one column and then concatenate all the columns into a single matrix keeping both column and row names.
I am using tidyverse (and I am terrible at that). I successfully get column names but I miss row names at the very last stage of processing.
library("purrr")
library("tibble")
samples <- c("a","b","c","d")
a <- samples %>%
purrr::map_chr(~ file.path(getwd(), TARGET_FOLDER, paste(., "tsv", sep = "."))) %>%
purrr::map(safely(~ read.table(., row.names = 1, skip = 4))) %>%
purrr::set_names(rownames(samples)) %>%
purrr::transpose()
is_ok <- a$error %>% purrr::map_lgl(is_null)
x <- a$result[is_ok] %>%
purrr::map(~ {
v <- .[,1]
names(v) <- rownames(.)
v
}) %>% as_tibble(rownames = NA)
The x data.frame has correct colnames but lacks rownames. All the element on the a list have the same rownames in the exact same order. I am aware of tricks like rownames(x) <- rownames(a$result[[1]]) but I am looking for more consistent solutions.
It turned out that the solution was easier than expected. Using as.data.frame instead the last as_tibble solved it.

dplyr: conditional filter with character strings as column names

I am trying to 1) use dplyr::filter with character strings as column names and 2) use if statement, at the same time. However, while either works fine by itself, I had difficulty combining them, seemingly due to the issue that there cannot be a comma in the if statement. Below is a simplified example with dplyr 0.7.1.
Any insights are welcome. Thank you very much!!
library("dplyr")
df = data.frame(A=1:6, B=rep(c("good","bad"), 3), C=c("AA","AA","BB","BB","CC","CC"))
# with if statement but not character strings as column names - works
df %>% filter(if(T) {B %in% "good"})
# with character string as column names but no if statement - works
df %>% filter_at(vars("B"), any_vars(. %in% "good"))
# doesn't work when I tried to combine the two
# Error: unexpected ',' in "df %>% filter_at(if(T){vars("B"),"
df %>% filter_at(if(T){vars("B"), any_vars(. %in% "good")})
# tried to indicate the two parts in the if statement are really one item not two
# by adding () or {}, and got the same complaint
df %>% filter_at(if(T){(vars("B"), any_vars(. %in% "good"))})
df %>% filter_at(if(T){{vars("B"), any_vars(. %in% "good")}})
This works:
df %>% filter_at(if(T){vars("B")}, any_vars(. %in% "good"))

Resources