Alter Variables in List with purrr - r

I have three datasets a, b, c with identical variable names. I want to check whether these variables contain missing/invalid values.
I have a checking function check_variables() that checks missing or invalid values (for example the function could just be is.na).
While I could apply my checking function check_variables() explicitly to each of these datasets, like:
check.output = list(
a = check_variables(a),
b = check_variables(b),
c = check_variables(c)
)
purrr offers a nice all-in-one-step solution for this problem:
list(a,b,c) %>%
map(~ .x %>% check_variables())
But this step only maps check_variables() to elements of datasets in the list. Instead, I want function check_variables() map to each dataset. Is there a way to effectively map functions to the datasets in the list instead of the elements within each dataset?

If you want to modify the independent variables you can pass a list of variable names to edit then use get and assign to access and modify them.
library(purrr)
library(magrittr)
a = list(var = 1)
b = list(var = 2)
c = list(var = 3)
# get the current environment. alternative is to use functions like
# parent.frame from within the loop but that can get confusing
e = environment()
c('a','b','c') %>%
map(function(x){
ls = get(x,envir = e)
# whatever modification you want to make on the list
ls$var = ls$var+1
assign(x,ls,envir = e)
})
Note in real life, as #MrFlick stated, you probably don't want to do this. Keep a, b, c in a single list and your downstream analysis will be easier since I assume they will have to be processes through the same pipeline. map will happily return your modified list that you can either use to overwrite the original list or assign to a new variable. Alternatively, use a for loop over list indexes to modify the original list on the go or fill a pre-allocated new variable.

If the purpose is to apply check_variables() which takes in a dataset (table) and returns a single TRUE or FALSE, then the issue might be related to the usage of vectorized functions.
R and packages of R have many vectorized functions, such as is.na, which means when applying these function on to a list c(1, NA, 2) or dataframe, the function will be applied on to each elements of the list, resulting FALSE TRUE FALSE instead of TRUE (any element is.na) or FALSE (all element is.na).
When check_variable() function is composed by these vectorized functions, we will need to "aggregate" the vectorized functions use functions like all, any. Further more we will need to control the scope of aggregation in order to control whether the check_variables() function is to be applied on elements, variables (columns), or the entire table(dataframe):
require(tidyverse) # in production code, import only `dplyr` and `tidyr`
require(purrr)
a = data.frame(x = c(1,2,3), y =c(3,NA,5))
b = data.frame(x = c(1,NA,3), y =c(3,4,5))
c = data.frame(x = c(1,NA,3), y =c(3,NA,4))
# apply `check.func` on varaibles(columns)
# aggregation has to be limited within scope of each varaible (column)
# `dplyr::summarize_all` happens to functioning like this
check.vars = function(list.tbls, check.func) list.tbls %>% map(~ .x %>% summarize_all(check.func) )
# apply `check.func` on the entire table
# as long as `check.func` takes a table and returns a single value
# we can directly apply this function
check.tbls = function(list.tbls, check.func) list.tbls %>% map(~ check.func(.x))
## Some sample functions
# check if all elements under the scope, has no NA
# take in either a vector or a table, return a boolean
has.no.na = . %>% is.na %>% any %>% `!`
# check if all elements under the scope is less than 5, NAs are counted as False
# take in either a vector or a table, return a boolean
has.no.na = . %>% is.na %>% any %>% `!`
is.lt.5 = . %>% `<`(5) %>% all %>% replace_na(F)
# check if all elements under the scope is less than 5, NAs are ignored, all NA means TRUE
# take in either a vector or a table, return a boolean
is.lt.5.rm.na = . %>% `<`(5) %>% all(na.rm=T)
## Use of sample functions to check variables within each dataset
list(a,b,c) %>% check.vars(has.no.na)
list(a,b,c) %>% check.vars(is.lt.5)
## Use of sample functions to check each dataset
list(a,b,c) %>% check.tbls(has.no.na)
list(a,b,c) %>% check.tbls(is.lt.5)
list(a,b,c) %>% check.tbls(is.lt.5.rm.na)

Related

How to get grouping variables and tibble to a function (possibly group_nest)?

Suppose I have the following code:
df <- data.frame(a=c(1,2,3), b=c(4,5,6), c=c(7,8,9))
func <- function(...) {
the_args <- list(...)
data <- the_args[[1]]
message(names(data))
}
Now I want to make three calls to func, one for each distinct value of a. I thought maybe group_nest was my friend, but not quite:
# func gets all rows instead of one group at a time
df %>% group_nest(a) %>% func()
# func gets one group at a time, but without a
df %>% group_nest(a) %>% mutate(result=map(data, func))
I'd like func to be called three times (one for each distinct value of a), each time with all three columns (a, b, c).
Suggestions?
EDIT: If I knew the grouping in advance, I could hardcode it in advance:
df %>% group_nest(a) %>% mutate(result=map(data, func, a))
and inside the function I could set a <- the_args[[2]]
However, I want a result that is resilient to different groupings, and passes a complete data frame (data and grouping columns put together), so func doesn't have to know anything about how to assemble data.
EDIT 2: My actual use case has grouping columns specified more generally, i.e., something like
grouping_cols <- c('a')
df %>% group_nest(across(all_of(grouping_cols))) %>%
mutate(result=map(data, func))
For the simplest case, you can just
df %>% group_nest(a_ = a)
And as pointed out by the OP, you can also use a variable for more generic cases
df %>% group_nest(foo = across(all_of(grouping_cols)))
Another alternative would be
df %>%
mutate(across(!!grouping_cols, `(`, .names = "{.col}_")) %>%
group_nest(across(paste0(grouping_cols, "_"))

Log Transform many variables in R with loop

I have a data frame that has a binary variable for diagnosis (column 1) and 165 nutrient variables (columns 2-166) for n=237. Let’s call this dataset nutr_all. I need to create 165 new variables that take the natural log of each of the nutrient variables. So, I want to end up with a data frame that has 331 columns - column 1 = diagnosis, cols 2-166 = nutrient variables, cols 167-331 = log transformed nutrient variables. I would like these variables to take the name of the old variables but with "_log" at the end
I have tried using a for loop and the mutate command, but, I'm not very well versed in r, so, I am struggling quite a bit.
for (nutr in (nutr_all_nomiss[,2:166])){
nutr_all_log <- mutate(nutr_all, nutr_log = log(nutr) )
}
When I do this, it just creates a single new variable called nutr_log. I know I need to let r know that the "nutr" in "nutr_log" is the variable name in the for loop, but I'm not sure how.
For any encountering this page more recently, dplyr::across() was introduced in late 2020 and it is built for exactly this task - applying the same transformation to many columns all at once.
A simple solution is below.
If you need to be selective about which columns you want to transform, check out the tidyselect helper functions by running ?tidyr_tidy_select in the R console.
library(tidyverse)
# create vector of column names
variable_names <- paste0("nutrient_variable_", 1:165)
# create random data for example
data_values <- purrr::rerun(.n = 165,
sample(x=100,
size=237,
replace = T))
# set names of the columns, coerce to a tibble,
# and add the diagnosis column
nutr_all <- data_values %>%
set_names(variable_names) %>%
as_tibble() %>%
mutate(diagnosis = 1:237) %>%
relocate(diagnosis, .before = everything())
# use across to perform same transformation on all columns
# whose names contain the phrase 'nutrient_variable'
nutr_all_with_logs <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = list(log10 = log10),
.names = "{.col}_{.fn}"))
# print out a small sample of data to validate
nutr_all_with_logs[1:5, c(1, 2:3, 166:168)]
Personally, instead of adding all the columns to the data frame,
I would prefer to make a new data frame that contains only the
transformed values, and change the column names:
logs_only <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = log10)) %>%
rename_with(.cols = contains('nutrient_variable'),
.fn = ~paste0(., '_log10'))
logs_only[1:5, 1:3]
We can use mutate_at
library(dplyr)
nutr_all_log <- nutr_all_nomiss %>%
mutate_at(2:166, list(nutr_log = ~ log(.)))
In base R, we can do this directly on the data.frame
nm1 <- paste0(names(nutr_all_nomiss)[2:166], "_nutr_log")
nutr_all_nomiss[nm1] <- log(nutr_all_nomiss[nm1])
In base R, we can use lapply :
nutr_all_nomiss[paste0(names(nutr_all_nomiss)[2:166], "_log")] <- lapply(nutr_all_nomiss[2:166], log)
Here is a solution using only base R:
First I will create a dataset equivalent to yours:
nutr_all <- data.frame(
diagnosis = sample(c(0, 1), size = 237, replace = TRUE)
)
for(i in 2:166){
nutr_all[i] <- runif(n = 237, 1, 10)
names(nutr_all)[i] <- paste0("nutrient_", i-1)
}
Now let's create the new variables and append them to the data frame:
nutr_all_log <- cbind(nutr_all, log(nutr_all[, -1]))
And this takes care of the names:
names(nutr_all_log)[167:331] <- paste0(names(nutr_all[-1]), "_log")
given function using dplyr will do your task, which can be used to get log transformation for all variables in the dataset, it also checks if the column has -ive values. currently, in this function it will not calculate the log for those parameters,
logTransformation<- function(ds)
{
# this function creats log transformation of dataframe for only varibles which are positive in nature
# args:
# ds : Dataset
require(dplyr)
if(!class(ds)=="data.frame" ) { stop("ds must be a data frame")}
ds <- ds %>%
dplyr::select_if(is.numeric)
# to get only postive variables
varList<- names(ds)[sapply(ds, function(x) min(x,na.rm = T))>0]
ds<- ds %>%
dplyr::select(all_of(varList)) %>%
dplyr::mutate_at(
setNames(varList, paste0(varList,"_log")), log)
)
return(ds)
}
you can use it for your case as :
#assuming your binary variable has namebinaryVar
nutr_allTransformed<- nutr_all %>% dplyr::select(-binaryVar) %>% logTransformation()
if you want to have negative variables too, replace varlist as below:
varList<- names(ds)

Can not set sheet name with dplyr package

I am exporting multiple dataframe in a list to different sheet in one excel file, and I can do this with the below code(use mtcars as an example):
library(tidyverse)
library(ImportExport)
data_list <- split(mtcars, mtcars$cyl)
table_name <- names(data_list)
# Run
excel_export(data_list, "foo.xlsx", table_names = tab_name)
And then, I want to do this with another way, cause I see the dplyr document said that:
.y to refer to the key, a one row tibble with one column per
grouping variable that identifies the group.
So I thought .y equal to my created variable table_name, and I do this:
data_list %>% excel_export("foo.xlsx", table_names = .y)
Then I got an error:
Error in .jcall(wb, "Lorg/apache/poi/ss/usermodel/Sheet;",
"createSheet", : object '.y' not found
Could someone explain why and how can I do this with .y.
Any help will be highly appreciated.
If you reference a quote, I think it makes sense to use the function in which that quote is used. In this case, I found it in group_map (which includes group_walk, a complementary function that operates primarily in side-effect).
You still need to group_by the data. More specifically, it needs to operate on a tbl_df (not a list), typically (but not always) a grouped tibble.
I have neither ImportExport nor xlsx (on which the former depends) installed, so I'll proxy your action with write.csv.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
group_walk(~ write.csv(.x, paste0(.y, ".csv")))
The side-effect of this is that three files are created in the current directory named 4.csv, 6.csv, and 8.csv.
If you want to operate on a named list, one could also use one of:
# using: data_list <- split(mtcars, mtcars$cyl)
purrr::imap(data_list,
~ write.csv(.x, paste0("~/Downloads/", .y, ".csv")))
Map(write.csv, data_list, paste0(names(data_list), ".csv"))
The effect is the same.
.x and .y are not global parameters that can be used anywhere, those are reserved for specific functions. (usually map)
From ?map
.f A function, formula, or vector (not necessarily atomic).
If a formula, e.g. ~ .x + 2, it is converted to a function. There are three ways to refer to the arguments:
For a single argument function, use .
For a two argument function, use .x and .y
For more arguments, use ..1, ..2, ..3 etc
Let's take a simple list
library(purrr)
listvec1 <- list(a = 1:3, b = 4:6, c = 2:4)
1) Let's say we want to multiply every element in the list with 2, map has a single argument so we use . here.
map(listvec1, ~. * 2)
#$a
#[1] 2 4 6
#$b
#[1] 8 10 12
#$c
#[1] 4 6 8
Coincidently .x works here as well :
map(listvec1, ~.x * 2)
but if you use .y it will give an error because there are no 2 arguments in map.
map(listvec, ~.y * 2)
Error in .f(.x[[i]], ...) : the ... list contains fewer than 2 elements
2) Let's take another list and now add listvec1 with listvec2. For this, we can use map2 which has two arguments so here we refer them as .x and .y.
listvec2 <- list(a = 7, b = 8, c = 9)
map2(listvec1,listvec2, ~.x + .y)
#you could actually simplify this as but anyway this is just an example
#map2(listvec1,listvec2, `+`)
#$a
#[1] 8 9 10
#$b
#[1] 12 13 14
#$c
#[1] 11 12 13
In the same we use .x and .y in imap where .x is the element and .y is either the name of the list (if present) or index.
In the post above .y means nothing, so using it as
data_list %>% excel_export("foo.xlsx", table_names = .y)
would definitely yield an error. You need to use specific functions as described above to use .x, .y. So if you want to use pipes for your command, you should use
data_list %>% excel_export("foo.xlsx", table_names = tab_name)

How can I simultaneously assign value to multiple new columns with R and dplyr?

Given
base <- data.frame( a = 1)
f <- function() c(2,3,4)
I am looking for a solution that would result in a function f being applied to each row of base data frame and the result would be appended to each row. Neither of the following works:
result <- base %>% rowwise() %>% mutate( c(b,c,d) = f() )
result <- base %>% rowwise() %>% mutate( (b,c,d) = f() )
result <- base %>% rowwise() %>% mutate( b,c,d = f() )
What is the correct syntax for this task?
This appears to be a similar problem (Assign multiple new variables on LHS in a single line in R) but I am specifically interested in solving this with functions from tidyverse.
I think the best you are going to do is a do() to modify the data.frame. Perhaps
base %>% do(cbind(., setNames(as.list(f()), c("b","c","d"))))
would probably be best if f() returned a list in the first place for the different columns.
In case you're willing to do this without dplyr:
# starting data frame
base_frame <- data.frame(col_a = 1:10, col_b = 10:19)
# the function you want applied to a given column
add_to <- function(x) { x + 100 }
# run this function on your base data frame, specifying the column you want to apply the function to:
add_computed_col <- function(frame, funct, col_choice) {
frame[paste(floor(runif(1, min=0, max=10000)))] = lapply(frame[col_choice], funct)
return(frame)
}
Usage:
df <- add_computed_col(base_frame, add_to, 'col_a')
head(df)
And add as many columns as needed:
df_b <- add_computed_col(df, add_to, 'col_b')
head(df_b)
Rename your columns.

R: row-wise dplyr::mutate using function that takes a data frame row and returns an integer

I am trying to use pipe mutate statement using a custom function. I looked a this somewhat similar SO post but in vain.
Say I have a data frame like this (where blob is some variable not related to the specific task but is part of the entire data) :
df <-
data.frame(exclude=c('B','B','D'),
B=c(1,0,0),
C=c(3,4,9),
D=c(1,1,0),
blob=c('fd', 'fs', 'sa'),
stringsAsFactors = F)
I have a function that uses the variable names so select some based on the value in the exclude column and e.g. calculates a sum on the variables not specified in exclude (which is always a single character).
FUN <- function(df){
sum(df[c('B', 'C', 'D')] [!names(df[c('B', 'C', 'D')]) %in% df['exclude']] )
}
When I gives a single row (row 1) to FUN I get the the expected sum of C and D (those not mentioned by exclude), namely 4:
FUN(df[1,])
How do I do similarly in a pipe with mutate (adding the result to a variable s). These two tries do not work:
df %>% mutate(s=FUN(.))
df %>% group_by(1:n()) %>% mutate(s=FUN(.))
UPDATE
This also do not work as intended:
df %>% rowwise(.) %>% mutate(s=FUN(.))
This works of cause but is not within dplyr's mutate (and pipes):
df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))
If you want to use dplyr you can do so using rowwise and your function FUN.
df %>%
rowwise %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The same can be achieved using group_by instead of rowwise (like you already tried) but with do instead of mutate
df %>%
group_by(1:n()) %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The reason mutate doesn't work in this case, is that you are passing the whole tibble to it, so it's like calling FUN(df).
A much more efficient way of doing the same thing though is to just make a matrix of columns to be included and then use rowSums.
cols <- c('B', 'C', 'D')
include_mat <- outer(function(x, y) x != y, X = df$exclude, Y = cols)
# or outer(`!=`, X = df$exclude, Y = cols) if it's more readable to you
df$s <- rowSums(df[cols] * include_mat)
purrr approach
We can use a combination of nest and map_dbl for this:
library(tidyverse)
df %>%
rowwise %>%
nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>%
unnest
Let's break that down a little bit. First, rowwise allows us to apply each subsequent function to support arbitrary complex operations that need to be applied to each row.
Next, nest will create a new column that is a list of our data to be fed into FUN (the beauty of tibbles vs data.frames!). Since we are applying this rowwise, each row contains a single-row tibble of exclude:D.
Finally, we use map_dbl to map our FUN to each of these tibbles. map_dbl is used over the family of other map_* functions since our intended output is numeric (i.e. double).
unnest returns our tibble into the more standard structure.
purrrlyr approach
While purrrlyr may not be as 'popular' as its parents dplyr and purrr, its by_row function has some utility here.
In your above example, we would use your data frame df and user-defined function FUN in the following way:
df %>%
by_row(..f = FUN, .to = "s", .collate = "cols")
That's it! Giving you:
# tibble [3 x 6]
exclude B C D blob s
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 B 1 3 1 fd 4
2 B 0 4 1 fs 5
3 D 0 9 0 sa 9
Admittedly, the syntax is a little strange, but here's how it breaks down:
..f = the function to apply to each row
.to = the name of the output column, in this case s
.collate = the way the results should be collated, by list, row, or column. Since FUN only has a single output, we would be fine to use either "cols" or "rows"
See here for more information on using purrrlyr...
Performance
Forewarning, while I like the functionality of by_row, it's not always the best approach for performance! purrr is more intuitive, but also at a rather large speed loss. See the following microbenchmark test:
library(microbenchmark)
mbm <- microbenchmark(
purrr.test = df %>% rowwise %>% nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>% unnest,
purrrlyr.test = df %>% by_row(..f = FUN, .to = "s", .collate = "cols"),
rowwise.test = df %>%
rowwise %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
group_by.test = df %>%
group_by(1:n()) %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
sapply.test = {df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))},
times = 1000
)
autoplot(mbm)
You can see that the purrrlyr approach is faster than the approach of using a combination of do with rowwise or group_by(1:n()) (see #konvas answer), and rather on par with the sapply approach. However, the package is admittedly not the most intuitive. The standard purrr approach seems to be the slowest, but also perhaps easier to work with. Different user-defined functions may change the speed order.

Resources