rsample vfold_cv function not accepting .y parameter from purrr::map2 - r

I'm trying to create nested cross-validations using the rsample package, and I use purrr::map2 to create them multiple times, with differing amount of folds as dictated by the v parameter. However, the vfold_cv function does not accept the v parameter, and instead I get this error: Error: v must be a single integer.
In the reprex below, I'm simulating the situation using the mtcars data, by creating a cross validation for each cylinder. Replacing .y with a number works, but I need the parameter to vary with each cylinder by using the n column.
library(purrr)
library(parsnip)
library(rsample)
library(tidyr)
data("mtcars")
nested <- mtcars %>%
select(cyl, disp:gear) %>%
group_by(cyl) %>%
nest(data = disp:gear) %>%
cbind(n = 2:4)
nested %>%
group_by(cyl) %>%
mutate(cv = map2(data, n,
~nested_cv(.x,
inside = vfold_cv(v = 10, repeats = 3),
outside = vfold_cv(v = .y))))
Error: `v` must be a single integer.

It's vfold_cv function inside nested_cv, you can try it:
createNested = function(x,y){
nested_cv(x,inside = vfold_cv(v = 10, repeats = 3),outside = vfold_cv(v = y))
}
createNested(nested$data[[1]],3)
Error in vfold_splits(data = data, v = v, strata = strata, breaks = breaks) :
object 'y' not found
So it cannot see the y variable (like your .y) inside the function. So I wrote a function to explicitly pass the results of vfold_cv() for outside into nested_cv(), a few more lines of code but it's ok:
createNested = function(x,y){
outside_cv = vfold_cv(x,v = y)
nested_cv(x,inside = vfold_cv(v = 10, repeats = 3),outside = outside_cv)
}
nested <- mtcars %>%
select(cyl, disp:gear) %>%
nest(data = disp:gear) %>%
mutate(n=2:4)
nested %>% mutate(cv = map2(data,n,.f=createNested))
# A tibble: 3 x 4
cyl data n cv
<dbl> <list> <int> <list>
1 6 <tibble [7 × 8]> 2 <tibble [2 × 3]>
2 4 <tibble [11 × 8]> 3 <tibble [3 × 3]>
3 8 <tibble [14 × 8]> 4 <tibble [4 × 3]>
Note, once you have nested the data, you don't need group_by()

Related

Unable to access nested data elements inside mutate

I am trying to understand why the following code doesn't work. My understanding is it will take data$Sepal.Length (element within the nested data column) and iterate that one(the vector) over the function sum.
df <- iris %>%
nest(-Species) %>%
mutate(Total.Sepal.Length = map_dbl(data$Sepal.Length, sum, na.rm = TRUE))
print(df)
But this throws an error Total.Sepal.Length must be size 3 or 1, not 0. The following code works by using anonymous function as how it is usually accessed
df <- iris %>%
nest(-Species) %>%
mutate(Total.Sepal.Length = map_dbl(data, function(x) sum(x$Sepal.Length, na.rm = TRUE)))
print(df)
I am trying to understand why the previous code didn't work even though I am correctly passing arguments to mutate and map.
You should do this:
df <- iris %>%
nest(-Species) %>%
mutate(Total.Sepal.Length = map_dbl(data, ~sum(.x$Sepal.Length, na.rm = TRUE)))
Two things: any reason you're not using group_by?
Second: your initial mutate is trying to perform:
map_dbl(df$data$Sepal.Length, sum, na.rm = TRUE)
Which brings an empty result, because df$data$Total.Sepal.Length is NULL (you have to access each list element to access the columns, so df$data[[1]]$Total.Sepal.Length works)
Output:
# A tibble: 3 × 3
Species data Total.Sepal.Length
<fct> <list> <dbl>
1 setosa <tibble [50 × 4]> 250.
2 versicolor <tibble [50 × 4]> 297.
3 virginica <tibble [50 × 4]> 329.

How to make a code run for twice rows each as a group in a dataframe in R?

I would like to proof if an image includes a certain color range. I have the following data, the code for color groups and the picture:
library(magick)
library(scatterplot3d)
library(colordistance)
image <- image_read("image.jpg")
colorpalette <- data.frame(group = c(1,1,2,2,3,3,4,4),
A = c(0.25168911, 0.22669197, 0.22669197, 0.21293378, 0.21293378, 0.19612549, 0.19612549, 0.17940884),
B = c(0.1477321049, 0.1738456673, 0.1738456673, 0.1149389567, 0.1149389567, 0.1365333333, 0.1365333333, 0.0737957272),
C = c(0.0887786440, 0.1442915876, 0.1442915876, 0.0545024047, 0.0545024047, 0.1020078431, 0.1020078431, 0.0139303483))
upper1 <- c(0.22669197, 0.1738456673, 0.1442915876)
lower1 <- c(0.21293378, 0.1149389567, 0.0545024047)
group1 <- countColors("image.jpg", color.range = "rectangular", upper = upper1, lower = lower1, bg.lower = NULL, plotting = TRUE, target.color = "orange")
I would like to transfer this and run this code for all color groups (1-4) within the predefined colorpalette.
The url for the picture:
https://www.google.com/imgres?imgurl=https%3A%2F%2Fdigitalsynopsis.com%2Fwp-content%2Fuploads%2F2016%2F05%2Fcinema-palettes-famous-movie-colors-star-wars-the-force-awakens.jpg&imgrefurl=https%3A%2F%2Fdigitalsynopsis.com%2Fdesign%2Fcinema-palettes-famous-movie-colors%2F&tbnid=x8RFFszidcs6-M&vet=12ahUKEwiY892Tj9f2AhUp1-AKHdjVBNsQMygAegQIARAd..i&docid=sZw2jvuqpGDZ2M&w=780&h=551&q=cinema-palettes-famous-movie-colors-star-wars-the-force-awakens.jpg&client=firefox-b-e&ved=2ahUKEwiY892Tj9f2AhUp1-AKHdjVBNsQMygAegQIARAd
I'm a little confused by your initial code as the online doc for colordistance suggests that countColors is not exported from the package. However, here's a proof of concept example to show you the sort of thing you will need to do.
My plan is to use the group_map function from the tidyverse. To do this, I need to reformat your colorpalette dataframe so that the input parameters to countColors (or any other function) are each contained in the columns of one row (or group) of the data frame. Your format won't work because you have the values we need in the rows of each column. So I need to pivot the dataframe.
library(tidyverse)
d <- colorpalette %>%
add_column(limit=rep(c("upper", "lower"), times=4)) %>%
nest(data=c(A, B, C)) %>%
pivot_wider(names_from=limit, values_from=data)
d
# A tibble: 4 × 3
group upper lower
<dbl> <list> <list>
1 1 <tibble [1 × 3]> <tibble [1 × 3]>
2 2 <tibble [1 × 3]> <tibble [1 × 3]>
3 3 <tibble [1 × 3]> <tibble [1 × 3]>
4 4 <tibble [1 × 3]> <tibble [1 × 3]>
That looks a little odd, but the original data points are still there. For example:
d$upper[1]
[[1]]
# A tibble: 1 × 3
A B C
<dbl> <dbl> <dbl>
1 0.252 0.148 0.0888
As I mentioned, it seems that I can't run countColors. So, as a demonstration, I'll simply calculate the Euclidean distances between the 3-dimensional points defined by each row of upper and lower
d %>%
rowwise() %>%
group_map(
function(.x, .y) {
y <- .x %>% unnest(
c(upper, lower),
names_repair=function(x) {
c("group", "A1", "B1", "C1", "A2", "B2", "C2")
}
)
list(delta=(y$A1 - y$A2)^2 + (y$B1 - y$B2)^2 + (y$C1 - y$C2)^2)
}
) %>%
bind_rows()
# A tibble: 4 × 1
delta
<dbl>
1 0.00439
2 0.0117
3 0.00301
4 0.0120
I'm not suggesting that's a sensible way of calculating Euclidean distances. I'm just illustrating a technique.
That suggests you might get what you want with something like
colorpalette %>%
add_column(limit=rep(c("upper", "lower"), times=4)) %>%
nest(data=c(A, B, C)) %>%
pivot_wider(names_from=limit, values_from=data) %>%
rowwise() %>%
group_map(
function(.x, .y) {
countColors(
"image.jpg",
color.range = "rectangular",
upper = .x$upper,
lower = .x$lower,
bg.lower = NULL,
plotting = TRUE,
target.color = "orange"
)
}
)
But I can't be sure because I can't test the code.

R Use map2 to iterate over columns within a list of data frames to fit statistical models

I'm trying to figure out a purrr approach to iteratively map over columns within a list of data frames to fit univariate GLMs. Using map2, the first element, .x, would be the three pred columns, and the second element, .y, would be the list of data frames (or vice-versa). map2 seems to be able to do this, but I recognize that I need to cross the .x and .y elements first, so I use tidyr::crossing first to do this. From here, I am unsure how to properly reference the columns to select within the data frames. Example code is below:
#Sample data
set.seed(100)
test_df <- tibble(pred1 = sample(40:80, size = 1000, replace = TRUE),
pred2 = sample(40:80, size = 1000, replace = TRUE),
pred3 = sample(40:80, size = 1000, replace = TRUE),
resp = sample(100:200, size = 1000, replace = TRUE),
group = sample(c('a','b','c'), size = 1000, replace = TRUE))
#Split into list
test_ls <- test_df %>%
group_by(group) %>%
{df_groups <<- .} %>%
group_split()
#Obtain keys and name list elements
group_keys <- df_groups %>%
group_keys() %>%
pull()
test_ls <- test_ls %>% setNames(nm = group_keys)
#Cross all combinations of pred columns and list element names
preds <- c('pred1','pred2','pred3')
map_keys <- crossing(preds, group_keys)
#.y = list of data frames; iterate over data frames
#.x = three pred columns; iterate over columns
#Use purrr to fit glm of each .x columns within each of .y dfs
#Example structure - does not work
map2(.x, .y, .f = ~glm(resp ~ .x, data = .y))
#Workaround that does work
lapply(test_ls, function(x) {
x %>%
select(pred1, pred2, pred3) %>%
map(.f = ~glm(resp ~ .x, data = x))
})
There's something I'm missing, and I can't seem to figure it out. I've gotten a variety of errors with a few approaches, but I think it's coming down to not properly referencing the .x columns within the .y data frames. My approaches don't seem to recognize that .x is a column within .y. The workaround does the trick, but I'd prefer to avoid using both lapply and map.
My suggestion would be to NOT split the data before fitting models, since you are considering all possible combinations of variables that are already available directly in your original dataset. Instead, consider converting the original data frame to the "long" format, and then grouping by the necessary variables:
test_df %>% gather( pred, value, pred1:pred3 ) %>%
nest( -c(group, pred) ) %>%
mutate( models = map(data, ~glm(resp ~ value, data=.x)) )
# # A tibble: 9 x 4
# group pred data models
# <chr> <chr> <list> <list>
# 1 b pred1 <tibble [340 x 2]> <glm>
# 2 a pred1 <tibble [317 x 2]> <glm>
# 3 c pred1 <tibble [343 x 2]> <glm>
# 4 b pred2 <tibble [340 x 2]> <glm>
# 5 a pred2 <tibble [317 x 2]> <glm>
# 6 c pred2 <tibble [343 x 2]> <glm>
# 7 b pred3 <tibble [340 x 2]> <glm>
# 8 a pred3 <tibble [317 x 2]> <glm>
# 9 c pred3 <tibble [343 x 2]> <glm>
This substantially simplifies your code, and you can now split the result, if you still need those models in a list.

run model for each line of model parameters (meta) data.frame

In the spirit of purr, broom, modelr, I am trying to create a "meta" data.frame in which each row denotes the dataset (d) and the model parameters (yvar, xvars, FEvars). For instance:
iris2 <- iris %>% mutate(Sepal.Length=Sepal.Length^2)
meta <- data.frame(n=1:4,
yvar = c('Sepal.Length','Sepal.Length','Sepal.Length','Sepal.Length'),
xvars= I(list(c('Sepal.Width'),
c('Sepal.Width','Petal.Length'),
c('Sepal.Width'),
c('Sepal.Width','Petal.Length'))),
data= I(list(iris,iris,iris2,iris2)) )
Now, I would like to run a model for each column of "meta". And then add a list column "model" with the model output object. To run the model I use an auxiliary function that uses a dataset, a y variable and a vector of x variables:
OLS_help <- function(d,y,xvars){
paste(y, paste(xvars, collapse=" + "), sep=" ~ ") %>% as.formula %>%
lm(d)
}
y <- 'Sepal.Length'
xvars <- c('Sepal.Width','Petal.Length')
OLS_help(iris,y,xvars)
How can I execute OLS_help for all the rows of meta and adding the output of OLS_help as a list column in meta? I tryed the following code, but it did not work:
meta %>% mutate(model = map2(d,yvar,xvars,OLS_help) )
Error: Can't convert a `AsIs` object to function
Call `rlang::last_error()` to see a backtrace
OBS: The solution to when only the "data" (nested) list column (corvered in Hadley's book here) is:
by_country <- gapminder %>% group_by(country, continent) %>% nest()
country_model <- function(df) { lm(lifeExp ~ year, data = df) }
by_country <- by_country %>% mutate(model = map(data, country_model))
We can use pmap in the following way
df <- meta %>%
as_tibble() %>%
mutate_if(is.factor, as.character) %>%
mutate(fit = pmap(
list(yvar, xvars, data),
function(y, x, df) lm(reformulate(x, response = y), data = df)))
## A tibble: 4 x 5
# n yvar xvars data fit
# <int> <chr> <I<list>> <I<list>> <list>
#1 1 Sepal.Length <chr [1]> <df[,5] [150 × 5]> <lm>
#2 2 Sepal.Length <chr [2]> <df[,5] [150 × 5]> <lm>
#3 3 Sepal.Length <chr [1]> <df[,5] [150 × 5]> <lm>
#4 4 Sepal.Length <chr [2]> <df[,5] [150 × 5]> <lm>
Explanation: pmap iterates over multiple arguments simultaneously (similar to base R's Map); here we simultaneously loop throw entries in column yvar, xvar and data, then use reformulate to construct the formula to be used within lm. We store the lm fit object in column fit.

R - Adding an extrapolated (lm) value to a matrix of observations

I am trying to add a set of extrapolated "observations" to a matrix in R. I know how to do this using normal programming techniques (read; bunch of nested loops and functions) but I feel this must be possible in a much more clean way by using build in R-functionality.
The code below illustrates the point, and where it breaks down
Many thanks in advance for your help!
With kind regards
Sylvain
library(dplyr)
# The idea is that i have a table of observations for e.g. x=5, 6, 7, 8, 9 and 10. The observations (in this example 2)
# conform fairly decently to sets of 2nd order polynomials.
# Now, I want to add an extrapolated value to this table (e.g. x=4). I know how to do this programmically
# but I feel there must be a cleaner solution to do this.
#generate dummy data table
x <- 5:10
myData <- tibble(x, a = x^2 * 2 + x * 3 + 4 + rnorm(1,0,0.01), b = x^2 * 3 + x * 4 + 5 + rnorm(1,0,0.01) )
#Gather (put in Data-Key format)
myDataKeyFormat <- gather(myData,key = "someLabel", value = "myObservation", -x)
fitted_models <- myDataKeyFormat %>% group_by(someLabel) %>% do(model = lm(myObservation ~ poly(x,2), data = .))
myExtrapolatedDataPointx <- tibble(x = 4)
#Add the x=4 field
fitted_points <- fitted_models %>% group_by(someLabel) %>% do(predict(.$model,myExtrapolatedDataPointx)) #R really doesnt like this bit
#append the fitted_points to the myDataKeyFormat
myDataKeyFormatWithExtrapolation <- union(myDataKeyFormat,fitted_points)
#use spread to
myDataWithExtrapolation <- myDataKeyFormatWithExtrapolation %>% spread(someLabel,myObservation)
Here is a solution in the tidyverse, and using purrr to create the different models. The idea is to nest (using tidyr::nest) and then purrr::map to train the model. I will then add new values and compute the predictions using modelr::add_predictions. Here you have all the data in the same place : training data, models, testing data and prediction, by your variable someLabel. I also give you a way to visualise the data.
You can check R for Data Science by Hadley Wickham & Garrett Grolemund, and especially the part about models for more information.
library(dplyr)
library(tibble)
library(tidyr)
library(purrr)
library(modelr)
library(ggplot2)
set.seed(1) # For reproducibility
x <- 5:10
myData <- tibble(x,
a = x^2 * 2 + x * 3 + 4 + rnorm(1,0,0.01),
b = x^2 * 3 + x * 4 + 5 + rnorm(1,0,0.01))
#Gather (put in Data-Key format)
myDataKeyFormat <- gather(myData,key = "someLabel", value = "myObservation", -x)
myModels <- myDataKeyFormat %>%
nest(-someLabel) %>%
mutate(model = map(data, ~lm(myObservation ~ poly(x,2), data = .x)))
Here is the result at this point : you have a model for each value of someLabel.
# A tibble: 2 × 3
someLabel data model
<chr> <list> <list>
1 a <tibble [6 × 2]> <S3: lm>
2 b <tibble [6 × 2]> <S3: lm>
I'll add some data points in a new column (map is to create it as a tibble for each line of the data frame).
# New data
new_data <- myModels %>%
mutate(new = map(data, ~tibble(x = c(3, 4, 11, 12))))
I add the predictions: add_predictions take a data frame and a model as argument, so I use map2 to map over the new data and the models.
fitted_models <- new_data %>%
mutate(new = map2(new, model, ~add_predictions(.x, .y)))
fitted_models
# A tibble: 2 × 4
someLabel data model new
<chr> <list> <list> <list>
1 a <tibble [6 × 2]> <S3: lm> <tibble [4 × 2]>
2 b <tibble [6 × 2]> <S3: lm> <tibble [4 × 2]>
There you go: you have for each label the data and model trained on this data, and the new data with predictions.
In order to plot it, I use unnest to take the data back to the data frame, and I bind the rows to have the "old" data and the new values together.
my_points <- bind_rows(unnest(fitted_models, data),
unnest(fitted_models, new))
ggplot(my_points)+
geom_point(aes(x = x, y = myObservation), color = "black") +
geom_point(aes(x = x, y = pred), color = "red")+
facet_wrap(~someLabel)

Resources