I would like to apply multiple functions over multiple columns in a data frame. I've worked out how to apply one function to all the columns in a data frame, but I'm stumped trying to use invoke_map to apply a list of functions. I've played around with quo and enquo, but I don't have the right combination (or grasp yet, I guess).
Toy example set up:
library(tidyverse)
library(RcppRoll)
library(purrr)
ID <- letters[1:26]
var1 <- sample(1:100, 26, replace= T)
var2 <- sample(100:200, 26, replace= T)
temp <- cbind(ID, var1, var2) %>% data.frame()
This works to apply one function:
roll.var <- function(name) {
label <- enquo(name)
map_df(temp[, 2:3], ~ name(.x, 5, fill= NA)) %>%
rename_all(funs(paste0(., '.', (!!label)))) %>%
cbind(temp, .)
}
test <- roll.var(roll_sdr)
Here's my attempt to use invoke_map to apply a list of functions to the chosen columns:
roll.func <- c("roll_sdr", "roll_varr")
invoke_map(roll.var, .x= roll.func)
And it returns: Error in name(.x, 51, fill = NA) : could not find function "name"
The second issue is that in the resulting 'test' data frame from the first example, the first variable is named incorrectly (var1.~) whereas the second one is named how I anticipated (var2.roll_sdr), can any one tell me why?
Any solution and/or education would be much appreciated!
EDIT:
Incorporating Mike's explanation that invoke_map needs a list of lists the complete code to produce what I want is:
library(tidyverse)
library(purrr)
library(RcppRoll)
library(plyr)
options(stringsAsFactors= F)
ID <- letters[1:26] %>% data.frame(ID= .)
var1 <- sample(1:100, 26, replace= T) %>% data.frame(var1= .)
var2 <- sample(100:200, 26, replace= T) %>% data.frame(var2= .)
temp <- bind_cols(ID, var1, var2)
roll.func <- list(list(roll_sdr, 'roll_sdr'),
list(roll_varr, 'roll_varr'))
roll.var <- function(name, vname) {
map_df(temp[, 2:3], ~ name(.x, 5, fill= NA)) %>%
rename_all(funs(paste0(., '.', vname))) %>%
cbind(temp, .)
}
df <- invoke_map(roll.var, roll.func)
## plyr statrment works much faster than purrr:reduce
df2 <- join_all(df, by= c('ID', 'var1', 'var2'))
Is it possible to add a statement in the roll.var function so that the vname doesn't have to be reiterated in roll.func? Somehow quote the name once inside the function? I've played around enquo and the rlang package and I'm not coming up with the right combination.
roll.func <- list(list(roll_sdr),
list(roll_varr))
would work both as a function call, and in appending the label to the variable name.
There are two problems with this.
The first problem is with the construct map_df(temp[, 2:3], ~ name(.x, 5, fill= NA)) - this doesn't work because it doesn't know what name is referring to. You will find it much easier in these types of cases to just pass the function object, and not the name of the function - that is, don't put it in quotes.
The second problem is that your construct roll.func isn't correct. Read the docs for invoke_map carefully - that argument must be a list. Each element of the list must be a list, the elements of which will be passed as arguments to the function. So, this simple example works:
library(purrr)
var1 <- sample(1:100, 26, replace= T) %>% as.numeric
var2 <- sample(100:200, 26, replace= T) %>% as.numeric
temp <- cbind(var1, var2) %>% data.frame()
simple_example <- function(func) map(temp, func)
roll.func <- list(
list(mean),
list(sum)
)
invoke_map(simple_example, roll.func)
#> [[1]]
#> [[1]]$var1
#> [1] 53.42308
#>
#> [[1]]$var2
#> [1] 140.6154
#>
#>
#> [[2]]
#> [[2]]$var1
#> [1] 1389
#>
#> [[2]]$var2
#> [1] 3656
and you should be able to adapt that to do what you need.
Related
I built a function to use it inside the purrr::map2 function and run it in two lists. When I run the function steps separately it works ok. But apparently in map2 it runs the first time (for the first elements of list .x[[1]] .y[[1]]) and then in the second round throws this error in the for function:
How can I find out why it's not working?
PS: It's hard to put an example of the data here because they are lists with very specific characteristics for this function. I'm sorrry.
Follow the function:
df <- list()
build_HUW_raster <- function(.x, .y) {
list.time <- .x %>%
split(.$id) %>%
purrr::map(~list(t=as.matrix(.x$date),
xy=unname(as.matrix(.x[,c(22,23)])))
)
for(i in 1:50){
cat(i," ")
path=list.time[[i]]
ctmc=ctmcmove::path2ctmc(path$xy,path$t,r,method="LinearInterp")
df[[i]] <- as.data.frame(do.call(cbind, ctmc))
}
df <- df %>% purrr::map(~ group_by(., ec) %>%
summarise(rt = mean(rt)) %>%
arrange(desc(rt))
)
stacktime <- df %>% purrr::map(~ rename(., cell = ec)) %>%
map(~dplyr::left_join(cargo.grid, ., by="cell", copy=T)) %>%
map(~raster::rasterize(., r, field="rt", na.rm=F, background=0)) %>%
raster::stack()
stackprop <- .y %>%
split(.$id) %>%
purrr::map(~ raster::rasterize(., y = r,
field=.$proportion,
fun=function(x, ...)median(x))) %>%
raster::stack()
stack_huw <- raster::overlay(raster::calc(stacktime, fun=function(x)
ifelse(is.na(x), NA, x/sum(x, na.rm=T))), stackprop, fun=function(x,y)x*y
)
raster_mean <- raster::stackApply(stack_huw,
indices = rep(1,raster::nlayers(stack_huw)),
fun = "mean",
na.rm = F
)
}
result.list <- purrr::map2 (.x=list1, .y=list2, fun=build_HUW_raster)
The reason is based on the element looped. [[ extracts the list element and depending on the class of the element, map loops over either individual elements if it is a vector/matrix or the columns in case of data.frame as these are units. By using [, it extracts the element as a list
list(1, 2, 3)[1]
[[1]]
[1] 1
vs
list(1, 2, 3)[[1]]
[1] 1
When we loop over map and apply some functions that require a specific structure i.e. colSums require a matrix/data.frame ie. with dim attributes, it fails if we use [[
> map(replicate(2, data.frame(col1 = 1:5, col2 = 6:10), simplify = FALSE)[[1]], colSums)
Error in .f(.x[[i]], ...) :
'x' must be an array of at least two dimensions
> map(replicate(2, data.frame(col1 = 1:5, col2 = 6:10), simplify = FALSE)[1], colSums)
[[1]]
col1 col2
15 40
Here, we may change the code to
purrr::map2(.x=list1[1], .y=list2[1], fun=build_HUW_raster)
I would like to create for loop to repeat the same function for 150 variables. I am new to R and I am a bit stuck.
To give you an example of some commands I need to repeat:
N <- table(df$ var1 ==0)["TRUE"]
n <- table(df$ var1 ==1)["TRUE"]
PREV95 <- (svyciprop(~ var1 ==1, level=0.95, design= design, deff= "replace")*100)
I need to run the same functions for 150 columns. I know that I need to put all my cols in one vector = x but then I don't know how to write the loop to repeat the same command for all my variables.
Can anyone help me to write a loop?
A word in advance: loops in R can in most cases be replaced with a faster, R-ish way (various flavours of apply, maping, walking ...)
applying a function to the columns of dataframe df:
a)
with base R, example dataset cars
my_function <- function(xs) max(xs)
lapply(cars, my_function)
b)
tidyverse-style:
cars %>%
summarise_all(my_function)
An anecdotal example: I came across an R-script which took about half an hour to complete and made abundant use of for-loops. Replacing the loops with vectorized functions and members of the apply family cut the execution time down to about 3 minutes. So while for-loops and related constructs might be more familiar when coming from another language, they might soon get in your way with R.
This chapter of Hadley Wickham's R for data science gives an introduction into iterating "the R-way".
Here is an approach that doesn't use loops. I've created a data set called df with three factor variables to represent your dataset as you described it. I created a function eval() that does all the work. First, it filters out just the factors. Then it converts your factors to numeric variables so that the numbers can be summed as 0 and 1 otherwise if we sum the factors it would be based on 1 and 2. Within the function I create another function neg() to give you the number of negative values by subtracting the sum of the 1s from the total length of the vector. Then create the dataframes "n" (sum of the positives), "N" (sum of the negatives), and PREV95. I used pivot_longer to get the data in a long format so that each stat you are looking for will be in its own column when merged together. Note I had to leave PREV95 out because I do not have a 'design' object to use as a parameter to run the function. I hashed it out but you can remove the hash to add back in. I then used left_join to combine these dataframes and return "results". Again, I've hashed out the version that you'd use to include PREV95. The function eval() takes your original dataframe as input. I think the logic for PREV95 should work, but I cannot check it without a 'design' parameter. It returns a dataframe, not a list, which you'll likely find easier to work with.
library(dplyr)
library(tidyr)
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
eval <- function(df){
df1 <- df %>%
select_if(is.factor) %>%
mutate_all(function(x) as.numeric(as.character(x)))
neg <- function(x){
length(x) - sum(x)
}
n<- df1 %>%
summarize(across(where(is.numeric), sum)) %>%
pivot_longer(everything(), names_to = "Var", values_to = "n")
N <- df1 %>%
summarize(across(where(is.numeric), function(x) neg(x))) %>%
pivot_longer(everything(), names_to = "Var", values_to = "N")
#PREV95 <- df1 %>%
# summarize(across(where(is.numeric), function(x) survey::svyciprop(~x == 1, design = design, level = 0.95, deff = "replace")*100)) %>%
# pivot_longer(everything(), names_to = "Var", values_to = "PREV95")
results <- n %>%
left_join(N, by = "Var")
#results <- n %>%
# left_join(N, by = "Var") %>%
# left_join(PREV95, by = "Var")
return(results)
}
eval(df)
Var n N
<chr> <dbl> <dbl>
1 Var1 2 8
2 Var2 5 5
3 Var3 4 6
If you really wanted to use a for loop, here is how to make it work. Again, I've left out the survey function due to a lack of info on the parameters to make it work.
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
VarList <- names(df %>% select_if(is.factor))
results <- list()
for (var in VarList){
results[[var]][["n"]] <- sum(df[[var]] == 1)
results[[var]][["N"]] <- sum(df[[var]] == 0)
}
unlist(results)
Var1.n Var1.N Var2.n Var2.N Var3.n Var3.N
2 8 5 5 4 6
I am running a function using purrr::map that will return an error if the dataframe does not contain numeric data (i.e., na.omit do not return any valid rows). I discovered map_if but it seems map_if returns .x if .p is false. Is there a way to return NA. This example should explain what I need:
library(openair)
library(tidyverse)
# Build test dataset
df1 <- mydata
df2 <- mydata
df2$no2 <- NA_real_
df3 <- mydata
dfx <- tibble(id = c(1, 2, 3), data = list(df1, df2, df3))
# polarPlot function will return error if dataframe does not contain numeric data (i.e., it only contains NAs)
polarPlot(df2, pollutant = "no2")
# Function to test length of dataframe (i.e., if 0 theneverything is NAs)
check_length <- function(x) (x %>% select(ws, wd, "no2") %>% na.omit() %>% nrow()) > 0
check_length(df1)
check_length(df2)
# purrr::map (is there a way for map_if to return NA if length == 0?)
dfx %>% mutate(mynewvar = map_if(.x = data, check_length, ~ polarPlot(.x, pollutant = "no2")))
In other words, I would like mynewvar[[2]] to return NA.
#dylanjm I posted a reprex not sure if you are not able to see it. As you suggested the function possibly is what I need.
possible_polarPlot <- possibly(polarPlot, otherwise = NA)
out <- dfx %>% mutate(mynewvar = map(.x = data, ~ possible_polarPlot(.x, pollutant = "no2")))
out$mynewvar[[2]] # Returns NA as I was looking for.
I am trying to apply a function which takes multiple inputs (which are columns which vary depending on the problem at hand) and applying this to list of data frames. I have taken the below code from this example: Map with Purrr multiple dataframes and have those modified dataframes as the output and modified it to include another metric of my choosing ('choice'). This code, however, throws an error:
Error in .f(.x[[i]], ...) : unused argument (choice = "disp").
Ideally, I would like to be able to create a grouped data frame (with group_by or split() and apply a function over the different groups within the data frame, however have not been able to work this out. Hence looking at a list of data frames instead.
mtcars2 <- mtcars
#change one variable just to distinguish them
mtcars2$mpg <- mtcars2$mpg / 2
#create the list
dflist <- list(mtcars,mtcars2)
#then, a simple function example
my_fun <- function(x)
{x <- x %>%
summarise(`sum of mpg` = sum(mpg),
`sum of cyl` = sum(cyl),
`sum of choice` = sum(choice))}
#then, using map, this works and prints the desired results
list_results <- map(dflist,my_fun, choice= "disp")
Three things to fix the code above:
Add choice as an argument in your function.
Make your function have an output by removing x <-
Use tidyeval to make the "choice" argument work.
The edited code thus looks like this:
my_fun <- function(x, choice)
{x %>%
summarise(`sum of mpg` = sum(mpg),
`sum of cyl` = sum(cyl),
`sum of choice` = sum(!!choice))}
list_results <- map(dflist, my_fun, choice = quo(disp))
If you want to stay within a dataframe/tibble, then using nest to create list-columns might help.
mtcars2$group <- sample(c("a", "b", "c"), 32, replace = TRUE)
mtcars2 %>%
as_tibble() %>%
nest(-group) %>%
mutate(out = map(data, my_fun, quo(disp))) %>%
unnest(out)
Is there a "tidyverse" way to join a list of data.frames (a la full_join(), but for >2 data.frames)? I have a list of data.frames as a result of a call to map(). I've used Reduce() to do something like this before, but would like to merge them as part of a pipeline - just haven't found an elegant way to do that. Toy example:
library(tidyverse)
## Function to make a data.frame with an ID column and a random variable column with mean = df_mean
make.df <- function(df_mean){
data.frame(id = 1:50,
x = rnorm(n = 50, mean = df_mean))
}
## What I'd love:
my.dfs <- map(c(5, 10, 15), make.df) #%>%
# <<some magical function that will full_join() on a list of data frames?>>
## Gives me the result I want, but inelegant
my.dfs.joined <- full_join(my.dfs[[1]], my.dfs[[2]], by = 'id') %>%
full_join(my.dfs[[3]], by = 'id')
## Kind of what I want, but I want to merge, not bind
my.dfs.bound <- map(c(5, 10, 15), make.df) %>%
bind_cols()
We can use Reduce
set.seed(1453)
r1 <- map(c(5, 10, 15), make.df) %>%
Reduce(function(...) full_join(..., by = "id"), .)
Or this can be done with reduce
library(purrr)
set.seed(1453)
r2 <- map(c(5, 10, 15), make.df) %>%
reduce(full_join, by = "id")
identical(r1, r2)
#[1] TRUE