pmap_df inside mutate applying to entire dataframe not just row - r

I have a dataframe where each row includes arguments that I want to pass into a function iteratively. The function itself returns a dataframe with a few rows. I would like to keep the arguments and results together in one dataframe by applying pmap_df like you can with pmap_dbl inside of a mutate to add a new column with the results from the function. With the code below, I am able to get a column with nested data in it, but every row contains the data for all of the results, not just the ones corresponding to that row.
library(tidyr)
example_function <- function(data, string, ...){
word_one <- paste(data$word_one, string)
word_two <- paste(data$word_two, string)
output <- data_frame(result_words = c(word_one, word_two))
}
fake_data <- tibble(group_id = rep(c(1, 2), each = 3),
word_one = c("hello", "goodbye", "today",
"apple", "banana", "coconut"),
word_two = c("my", "name", "is",
"ellie", "good", "morning"))
test <- fake_data %>%
group_by(group_id) %>%
nest() %>%
mutate(string = "not working") %>%
mutate(final_output = list(purrr::pmap_df(.l = ., .f = example_function)))
The output looks like:
Rows: 2
Columns: 4
Groups: group_id [2]
$ group_id <dbl> 1, 2
$ data <list> [<tbl_df[3 x 2]>], [<tbl_df[3 …
$ string <chr> "not working", "not working"
$ final_output <list> [<tbl_df[12 x 1]>], [<tbl_df[…
What I would like to have would be for each of the final outputs to have only 6 rows in each dataframe, corresponding to the inputs from the nested data column. Is this possible?

With the OP's function, it may be easily done without any pmap (return the output from the function)
example_function <- function(data, string, ...){
word_one <- paste(data$word_one, string)
word_two <- paste(data$word_two, string)
output <- data_frame(result_words = c(word_one, word_two))
output
}
As it is a nest_by, directly apply the function
library(dplyr)
fake_data %>%
nest_by(group_id) %>%
mutate(string = "not working") %>%
mutate(final_output = list(example_function(data, string)))
# A tibble: 2 × 4
# Rowwise: group_id
group_id data string final_output
<dbl> <list<tibble[,2]>> <chr> <list>
1 1 [3 × 2] not working <tibble [6 × 1]>
2 2 [3 × 2] not working <tibble [6 × 1]>
With pmap, extract the contents as a list to an object 'x1' then apply the OP's function on the list elements i.e. x1$data and x1$string
library(purrr)
library(stringr)
fake_data %>%
nest_by(group_id) %>%
mutate(string = "not working") %>%
ungroup %>%
mutate(final_output = pmap(across(-group_id),
~ {
x1 <- list(...)
example_function(x1$data, x1$string)
}))
# A tibble: 2 × 4
group_id data string final_output
<dbl> <list<tibble[,2]>> <chr> <list>
1 1 [3 × 2] not working <tibble [6 × 1]>
2 2 [3 × 2] not working <tibble [6 × 1]>

Related

How to make a code run for twice rows each as a group in a dataframe in R?

I would like to proof if an image includes a certain color range. I have the following data, the code for color groups and the picture:
library(magick)
library(scatterplot3d)
library(colordistance)
image <- image_read("image.jpg")
colorpalette <- data.frame(group = c(1,1,2,2,3,3,4,4),
A = c(0.25168911, 0.22669197, 0.22669197, 0.21293378, 0.21293378, 0.19612549, 0.19612549, 0.17940884),
B = c(0.1477321049, 0.1738456673, 0.1738456673, 0.1149389567, 0.1149389567, 0.1365333333, 0.1365333333, 0.0737957272),
C = c(0.0887786440, 0.1442915876, 0.1442915876, 0.0545024047, 0.0545024047, 0.1020078431, 0.1020078431, 0.0139303483))
upper1 <- c(0.22669197, 0.1738456673, 0.1442915876)
lower1 <- c(0.21293378, 0.1149389567, 0.0545024047)
group1 <- countColors("image.jpg", color.range = "rectangular", upper = upper1, lower = lower1, bg.lower = NULL, plotting = TRUE, target.color = "orange")
I would like to transfer this and run this code for all color groups (1-4) within the predefined colorpalette.
The url for the picture:
https://www.google.com/imgres?imgurl=https%3A%2F%2Fdigitalsynopsis.com%2Fwp-content%2Fuploads%2F2016%2F05%2Fcinema-palettes-famous-movie-colors-star-wars-the-force-awakens.jpg&imgrefurl=https%3A%2F%2Fdigitalsynopsis.com%2Fdesign%2Fcinema-palettes-famous-movie-colors%2F&tbnid=x8RFFszidcs6-M&vet=12ahUKEwiY892Tj9f2AhUp1-AKHdjVBNsQMygAegQIARAd..i&docid=sZw2jvuqpGDZ2M&w=780&h=551&q=cinema-palettes-famous-movie-colors-star-wars-the-force-awakens.jpg&client=firefox-b-e&ved=2ahUKEwiY892Tj9f2AhUp1-AKHdjVBNsQMygAegQIARAd
I'm a little confused by your initial code as the online doc for colordistance suggests that countColors is not exported from the package. However, here's a proof of concept example to show you the sort of thing you will need to do.
My plan is to use the group_map function from the tidyverse. To do this, I need to reformat your colorpalette dataframe so that the input parameters to countColors (or any other function) are each contained in the columns of one row (or group) of the data frame. Your format won't work because you have the values we need in the rows of each column. So I need to pivot the dataframe.
library(tidyverse)
d <- colorpalette %>%
add_column(limit=rep(c("upper", "lower"), times=4)) %>%
nest(data=c(A, B, C)) %>%
pivot_wider(names_from=limit, values_from=data)
d
# A tibble: 4 × 3
group upper lower
<dbl> <list> <list>
1 1 <tibble [1 × 3]> <tibble [1 × 3]>
2 2 <tibble [1 × 3]> <tibble [1 × 3]>
3 3 <tibble [1 × 3]> <tibble [1 × 3]>
4 4 <tibble [1 × 3]> <tibble [1 × 3]>
That looks a little odd, but the original data points are still there. For example:
d$upper[1]
[[1]]
# A tibble: 1 × 3
A B C
<dbl> <dbl> <dbl>
1 0.252 0.148 0.0888
As I mentioned, it seems that I can't run countColors. So, as a demonstration, I'll simply calculate the Euclidean distances between the 3-dimensional points defined by each row of upper and lower
d %>%
rowwise() %>%
group_map(
function(.x, .y) {
y <- .x %>% unnest(
c(upper, lower),
names_repair=function(x) {
c("group", "A1", "B1", "C1", "A2", "B2", "C2")
}
)
list(delta=(y$A1 - y$A2)^2 + (y$B1 - y$B2)^2 + (y$C1 - y$C2)^2)
}
) %>%
bind_rows()
# A tibble: 4 × 1
delta
<dbl>
1 0.00439
2 0.0117
3 0.00301
4 0.0120
I'm not suggesting that's a sensible way of calculating Euclidean distances. I'm just illustrating a technique.
That suggests you might get what you want with something like
colorpalette %>%
add_column(limit=rep(c("upper", "lower"), times=4)) %>%
nest(data=c(A, B, C)) %>%
pivot_wider(names_from=limit, values_from=data) %>%
rowwise() %>%
group_map(
function(.x, .y) {
countColors(
"image.jpg",
color.range = "rectangular",
upper = .x$upper,
lower = .x$lower,
bg.lower = NULL,
plotting = TRUE,
target.color = "orange"
)
}
)
But I can't be sure because I can't test the code.

How to write R function to create every subgroup based on multiple columns?

I'm struggling to create a function in R that will take in a dataset and columns, and output every permutation of datasets filtered by all of these 3 columns.
My data set looks like
structure(list(name = c("Peter Doe", "John Gary", "Elsa Johnson",
"Mary Poppins", "Jesse Bogart"), sex = c("Male", "Male", "Female",
"Female", "Male"), class = c("Honors", "Core", "Core", "Honors",
"Honors"), grade = c("A", "A", "A", "B", "C")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L))
I tried to visualize my goal here:
I was hoping to create new variables based on what path of this map it followed (e.g. male_honors_a <- dataset filtered by those column values) and I think I could do that with the paste function but am not sure here as well. More importantly though, I'm struggling with how to put for loops together inside the function that are able to filter based on the unique values of a column.
I got as far as to coding up a function that creates every subgroup individually but was not able to figure out how to put them together.
subgroups <- function(df, filters, group = "none", name = ""){
listofdfs <- list()
for (i in filters) {
subgroups <- unique(df[[i]])
for (j in subgroups){
x <- df[df[i] == j,]
listofdfs[[paste(name,j, sep = "")]] <- x
}
}
if (group != "none"){
return(listofdfs[[group]])
}
else {
return(listofdfs)}
}
subgroups(df, c("sex", "class", "grade"))
I would hope by running subgroups(df, c("sex", "class")), my output would be a list of dataframes:
list(male_honors, male_core, female_honors, female_core)
in which the male_honors element is
# A tibble: 2 × 4
name sex class grade
1 Peter Doe Male Honors A
2 Jesse Bogart Male Honors C
Would really appreciate any help!
The tidyr::nest() does this directly. Notice for each combination of grouping/nesting variables, a tibble is neatly tucked into the data cell. I've modified your function a little by (a) removing the aspects unrelated to grouping (like filter) and (b) making groups default to an empty character vector so if nothing is passed then nothing is grouped.
Also, the names (e.g., male honors) are easily retrievable via variable values. That's typically a lot more useful than retrieving the values from the variable names.
Will this work for your purposes?
subgroups <- function(df, groups = character(0)) {
df |>
tidyr::nest(data = -groups)
}
> subgroups(ds, c("class", "sex"))
# # A tibble: 4 × 3
# sex class data
# <chr> <chr> <list>
# 1 Male Honors <tibble [2 × 2]>
# 2 Male Core <tibble [1 × 2]>
# 3 Female Core <tibble [1 × 2]>
# 4 Female Honors <tibble [1 × 2]>
> subgroups(ds, c("sex"))
# # A tibble: 2 × 2
# sex data
# <chr> <list>
# 1 Male <tibble [3 × 3]>
# 2 Female <tibble [2 × 3]>
> subgroups(ds)
# # A tibble: 1 × 1
# data
# <list>
# 1 <tibble [5 × 4]>
Additional resources: tidyr's Nested data vignette
You can create a column key that is used to filter. The unique of the key could be used to loop through each subset of your data frame. Here is a solution with your data as df and the desired list result as l.
library(dplyr)
#make a key (constructed of 2 or more column values)
df<- df |> mutate(key = paste0(sex, "_", class))
#get the unqiue keys
keys<-unique(df$key)
#make an empty list
l<-list()
#loop through unique keys to filter your df, removing the key column
for(x in 1:length(keys)){
l[[x]]<-df[df$key ==keys[x],] |> select(!key)
}
#name list elements
names(l)<-tolower(keys)
# your desired result
l
And written as a function, it would look like this:
subgroups <- function(df, groups = character(0)){
#make a key vector
v <- df |> select(groups)
v <- do.call(paste, c(v, sep = "_"))
#get unqiue keys
keys<-unique(v)
#make an empty list
l<-list()
#loop through unique keys to filter, removing the key column
for(x in 1:length(keys)){
l[[x]]<-df[v %in% keys[x],] |> select(!key)
}
return(l)
}
#example call
subgroups(df, c("sex", "class"))

R purrr:map on a grouped/nested tibble

I would like to apply a function across columns of a nested grouped tibble as in the example below.
library(tidyverse)
df <- swiss %>%
group_by(Catholic > 20) %>%
nest()
Which results in a tibble that looks like:
> df
# A tibble: 2 x 2
# Groups: Catholic > 20 [2]
`Catholic > 20` data
<lgl> <list>
1 FALSE <tibble [26 × 6]>
2 TRUE <tibble [21 × 6]>
Now I make some function to build a model
fit <- function(df, modL = NA){
if (modL == 1) {fit <- lm(Fertility ~ Education, data = df)}
if (modL == 2) {fit <- lm(Fertility ~ Education + Examination, data = df)}
fit
}
Now I map that model to columns of the grouped data and make two new variables to store the model fits.
df <- df %>%
mutate(model1 = map(data, fit, modL = 1)) %>%
mutate(model2 = map(data, fit, modL = 2))
Which produces a tibble with two new columns that contain the model fits
> df
# A tibble: 2 x 4
# Groups: Catholic > 20 [2]
`Catholic > 20` data model1 model2
<lgl> <list> <list> <list>
1 FALSE <tibble [26 × 6]> <lm> <lm>
2 TRUE <tibble [21 × 6]> <lm> <lm>
What I want to achieve is a purr-type map function that does the same thing as the following code.
anova(df$model1[[1]], df$model2[[1]])
anova(df$model1[[2]], df$model2[[2]])
I though the following code would work, but it does not.
map(df[,3:4], anova)
Gurus, how do I map a function across columns of a nested and grouped dataset to give one result per row using the columns of that row as input?
Brant
df %>%
mutate(anova = map2(model1, model2, ~ anova(.x,.y)))%>%
mutate(pvalue = map_dbl(anova, ~.$`Pr(>F)`[2]))
I think this is what you want? Can you clarify please! Second mutate will pull out the p-value of the anova for each pairwise comparison.

Running multiple Cox-PH models with tidyr

I have a regular Surv object from the survival package;
s <- Surv(sample(100:150, 5), sample(c(T, F), 5, replace = T))
And a matrix of multiple variables;
df <- data.frame(var1 = rnorm(5),
var2 = rnorm(5),
var3 = rnorm(5))
I need to fit a Cox-PH model for each variable separately. My code currently uses a loop as follows:
for (v in colnames(df)) {
coxph(s ~ df[[v]])
}
Of course, in reality there are thousands of variables and this process takes a bit. I wanted to follow the answer given here to try and do it all with tidyr but I'm kinda stumped because the predictand isn't a factor, it's a survival object, so I don't quite know how to handle it as part of a tibble.
Assuming your response is s for the survival model, you can use a nested dataframe similar to the answer you link to, then map the model to the different variables:
library(tidyverse)
df_nested <- df %>% pivot_longer(cols = var1:var3) %>% group_by(name) %>% nest()
surv_model <- function(df) {
coxph(s ~ df$value)
}
df_nested <- df_nested %>% mutate(model = map(data, surv_model))
df_nested
# A tibble: 3 x 3
# Groups: name [3]
name data model
<chr> <list> <list>
1 var1 <tibble [5 x 1]> <coxph>
2 var2 <tibble [5 x 1]> <coxph>
3 var3 <tibble [5 x 1]> <coxph>

subsetting a list column of integer matrices

Background
I've gotten myself into a situation where one column in a tibble/dataframe consists of a list of integer matrices which have zero or more rows and exactly 2 columns. This column happens to be the output of a stringr::str_locate_all() invocation, so I expect this is a common scenario.
What I would like to do is to select only one of the columns of the integer matrices and then unnest the dataframe, but I am getting confused about how to do this properly.
Example
Here's an example (I have to create it manually because dpasta() doesn't seem to work with list column tibbles). In any case, my starting point, is the tibble mydf:
library(tidyverse)
m1 <- matrix( c(761,784), nrow=1,ncol=2, dimnames = list(c(),c("start","end")) )
m2 <- matrix( integer(0), nrow=0,ncol=2, dimnames = list(c(),c("start","end")) )
m3 <- matrix( c(1001,2300,1010,2310), nrow=2,ncol=2, dimnames = list(c(),c("start","end")) )
mydf <- tibble( item = c("a","b","c"), pos = list(m1,m2,m3))
Below is what that looks like in the rstudio viewer. It's kind of misleading because it suggests that the pos rows are just vectors of integers. They're actually nx2 matrices and there isn't any cue that indicates it's more complex. It caused me some confusion, but that's beside the point now.
What I would like to do is end up with an unnested tibble where the 1st column, "start", is selected. The desired output would look like this (after unnesting):
mydf_desired <- tibble( item = c("a","c","c"), start_pos = c(761,1001,2300))
Note that the first row in mydf had only one row in it's pos matrix, so it has one row in the desired result. The row with item="b" had a 0x2 matrix, so it doesn't appear (but it would have been OK if it appeared as an NA too). The row with item="c" had two rows in the pos matrix, so it has two rows in the desired result.
What I tried
This seems simple enough, I've unnested list columns before. The only twist here is that I have to first select the "start" column and then unnest, right? I just map the pos list column to [,1] to pick off the 1st column (the "start" column). And then it should be a matter of unnesting...
mydf_desired <- mydf %>%
mutate(start_pos = map(pos, ~ .[,1])) %>%
unnest()
#> Error in vec_rbind(!!!x, .ptype = ptype): Internal error in `vec_assign()`: `value` should have been recycled to fit `x`.
#> Warning: `cols` is now required.
#> Please use `cols = c(pos, start_pos)`
No idea what "value should have been recycled to fit x" actually means, but it's also giving me a warning about not giving cols in unnest(). The suspicion is now something about what I am giving unnest().
If I omit unnest() I don't get that error...
mydf_desired <- mydf %>%
mutate(start_pos = map(pos, ~ .[,1]))
And the output looks like this...
That sort of looks OK, I notice there's still a pos entry for item=b of integer(0). But even if I omit that row, I get the same error when I try to unnest().
Here's where I am stumped. Why can't I just unnest() this tibble? What is the meaning of the value should have been recycled to fit x error?
One option is to filter the rows, then map over the list element and extract the column from the matrix, and use unnest_longer
library(dplyr)
library(purrr)
mydf %>%
filter(lengths(pos) > 0) %>%
transmute(item, start_pos = map(pos, ~ as.vector(.x[,1]))) %>%
unnest_longer(c(start_pos))
# A tibble: 3 x 2
# item start_pos
# <chr> <dbl>
#1 a 761
#2 c 1001
#3 c 2300
Also, can avoid the filter step, if we convert to tibble
mydf %>%
transmute(item, pos = map(pos, ~ .x[,1] %>%
tibble(start_pos = .))) %>%
unnest(c(pos))
The error comes because unnest is trying to unnest pos column. You can specify which columns you want to unnest explicitly to avoid the error.
library(dplyr)
library(purrr)
mydf %>% mutate(start_pos = map(pos, ~.[, 1])) %>% unnest(start_pos)
# A tibble: 3 x 3
# item pos start_pos
# <chr> <list> <dbl>
#1 a <dbl[,2] [1 × 2]> 761
#2 c <dbl[,2] [2 × 2]> 1001
#3 c <dbl[,2] [2 × 2]> 2300
If you want NA for "b" item you can use unnest_longer
mydf %>%
mutate(start_pos = map(pos, ~.[, 1])) %>%
unnest_longer(start_pos, indices_include = FALSE)
# A tibble: 4 x 3
# item pos start_pos
# <chr> <list> <dbl>
#1 a <dbl[,2] [1 × 2]> 761
#2 b <int[,2] [0 × 2]> NA
#3 c <dbl[,2] [2 × 2]> 1001
#4 c <dbl[,2] [2 × 2]> 2300
Or unnest with keep_empty = TRUE.
mydf %>%
mutate(start_pos = map(pos, ~.[, 1])) %>%
unnest(start_pos, keep_empty = TRUE)

Resources