Strange(?) behaviour of `summary()` inside a pipe

Strange(?) behaviour of `summary()` inside a pipe - r

Me and a couple of colleagues are wondering why
N <- rnorm(16) %>% matrix(., ncol = 4) %>% `colnames<-`(letters[1:4]) %T>% summary()
or
N <- rnorm(16) %>% matrix(., ncol = 4) %>% `colnames<-`(letters[1:4]) %T>% summary() %>% `+`(., 0)
do not work (summary is not printed), while
N <- rnorm(16) %>% matrix(., ncol = 4) %>% `colnames<-`(letters[1:4]) %T>% {print(summary(.))}
does?

There's no reason for summary to print its output any more than rnorm or matrix or any of the other functions in the pipeline. Printing is generally suppressed when there is assignment <-.
N <- rnorm(16) %>% matrix(., ncol = 4) %>% `colnames<-`(letters[1:4]) %T>% summary()
# no printing
N <- 5
# no printing
M <- rnorm(16) %>% matrix(., ncol = 4) %>% `colnames<-`(letters[1:4])
# no printing
M %>% summary
summary(M)
# prints
sM <- M %>% summary
sM <- summary(M)
# no printing
# assignment prevents implicit printing.
An explicit print call creates a side-effect (printing), which is what you want to happen.
The tee pipe %T>% doesn't create any side-effects (like printing), it just returns the LHS so that if you %T>% into a function that does create side-effects but does not return it's input (like plot()), you can get those side effects while still piping the input to another step.
Note that print does return its argument (invisibly), so you don't actually need the tee pipe %T>% with print.

Related

Printing an intermediate output produced by a customized function in dplyr R

Purpose
I have a customized function that takes a list of variables, and for each variable it chooses another variable within the dataframe to create a fitted value between two variables. I would like to print the summary of regression output when I run this in dplyr. It would be easier to see the function below to understand what I try to achieve.
The Customized Function
prisma_fn_add_fitted_fam <- function(d, vars){
for(i in 1:length(vars)){
varname <- gsub(paste0(str_extract(vars[i], "[0-9]+")[[1]], '.', sep = ''),"7.",vars[i])
lag <- gsub(paste0(str_extract(vars[i], "[0-9]+")[[1]], '.', sep = ''),"6.",vars[i])
d <-
d %>%
set.name(vars[i], 'xyz') %>%
set.name(lag, 'wxy')
s <-
lm(xyz ~ wxy, d)
d <-
d %>%
modelr::add_predictions(s) %>%
rename(!!varname := pred) %>%
set.name('xyz', vars[i]) %>%
set.name('wxy', lag)
}
print(summary(s)) # It does not print.
d
}

FUN == 'x' does not work, how to go around it in R

I am trying to write a function that uses some other function FUN as an argument - in this case, I want to (among other things), alter what to do if I set FUN = match0.
library(dplyr)
library(purrr)
f <- function(df, pair, FUN, ...){
df1 <- df %>%
group_split()
w <- df1 %>%
map(~ .x %>%
nrow() %>%
seq())
x <- map2(w, df1, ~map(.x, mean, df = .y))
y <- map(x, unlist)
l <- map2(y, df1, ~map(.x, function(x, df = .y){
if(deparse(substitute(FUN)) == 'match0'){
out <- x
} else{
out <- df[x, pair]}
return(out)
}
)) %>% unlist()
df <- bind_rows(df1) %>% bind_cols(index = l)
return(df)
}
If I run, for instance:
a <- data.frame(n = c(15,20,15,20,15,20)) %>% group_by(n)
x <- f(a, pair = 'pairs0', FUN = match0)
I get Column 'pairs0' doesn't exist.
This would be the case if, in fact, the conditional statement evaluated to FALSE. How can I change this?
To be honest, I'm not quite sure to use deparse, substitute and the like, I've just tried to follow some other posts. FWIW, I thought it would work because if I test deparse(substitute(match0)) == 'match0', I get TRUE.
Any help?

Margrittr pipe with matrix operations in R

I'm working on some functions that take a matrix as input and provide a matrix as output. Is it possible to use the magrittr pipe with matrices without using the . placeholder? Ideally, I'd like these functions to be piped into each other like a dplyr chain. The issue is that I'm constantly forgetting to specify the . placeholder and getting errors.
library(magrittr)
set.seed(123)
m <- matrix(rnorm(10), ncol = 2)
# This works perfectly:
layout_align_x <- function(n = nodes, anchor, m = matrix){
m[n, 1] <- m[anchor, 1]
return(m)}
# This also works perfectly:
layout_align_x(c(1,2), 3, m)
# And this also:
m %>% layout_align_x(c(1,2), 3, .)
# This returns error:
m %>% layout_align_x(c(1,2), 3)
#Error in m[anchor, 1] : incorrect number of dimensions
# The goal is:
m %>%
layout_align_x(c(1,2), 3) %>%
layout_align_x(c(3,4), 5)

Change your function to
layout_align_x <- function(m = matrix, n = nodes, anchor){
m[n, 1] <- m[anchor, 1]
return(m)
}

How to optimize memory usage in dplyr + purrr

I have an issue where after replicating data for a training and testing set, I'm showing a large amount of memory allocated to my user in Rstudio, but not being used in my R session. I've created a small example to reproduce my situation :)
This code runs a bunch of model, based on different formulas, algorithms, and parameter sets that I give it. It is a function, but I've created a simple script for reprex.
library(dplyr)
library(purrr)
library(modelr)
library(tidyr)
library(pryr)
# set my inputs
data <- mtcars
formulas <- c(test1 = mpg ~ cyl + wt + hp,
test2 = mpg ~ cyl + wt)
params = list()
methods <- "lm"
n <- 20 # num of cv splits
mult <- 10 # number of times I want to replicate some of the data
frac <- .25 # how much I want to cut down other data (fractional)
### the next few chunks get the unique combos of the inputs.
if (length(params) != 0) {
cross_params <- params %>%
map(cross) %>%
map_df(enframe, name = "param_set", .id = "method") %>%
list
} else cross_params <- NULL
methods_df <- tibble(method = methods) %>%
list %>%
append(cross_params) %>%
reduce(left_join, by = "method") %>%
split(1:nrow(.))
# wrangle formulas into a split dataframe
formulas_df <- tibble(formula = formulas,
name = names(formulas)) %>%
split(.$name)
# split out the data into n random train-test combos
cv_data <- data %>%
crossv_kfold(n) %>% # rsample?
mutate_at(vars(train:test), ~map(.x, as_tibble))
# sample out if needed
cv_data_samp <- cv_data %>%
mutate(train = modify(train,
~ .x %>%
split(.$gear == 4) %>%
# take a sample of the non-vo data
modify_at("FALSE", sample_frac, frac) %>%
# multiply out the vo-on data
modify_at("TRUE", function(.df) {
map_df(seq_along(1:mult), ~ .df)
}) %>%
bind_rows))
# get all unique combos of formula and method
model_combos <- list(cv = list(cv_data_samp),
form = formulas_df,
meth = methods_df) %>%
cross %>%
map_df(~ bind_cols(nest(.x$cv), .x$form, .x$meth)) %>%
unnest(data, .preserve = matches("formula|param|value")) %>%
{if ("value" %in% names(.)) . else mutate(., value = list(NULL))}
# run the models
model_combos %>%
# put all arguments into a single params column
mutate(params = pmap(list(formula = formula, data = train), list)) %>%
mutate(params = map2(params, value, ~ append(.x, .y))) %>%
mutate(params = modify(params, discard, is.null)) %>%
# run the models
mutate(model = invoke_map(method, params))
mem_change(rm(data, cv_data, cv_data_samp))
mem_used()
Now after I do this, my mem_used comes out to 77.3mb, but I'm seeing roughly double that (160Mb) allocated to my R user. This really explodes when my data is 3 Gb, which is my real-life case. I end up using 100Gb and tying up a whole server :(.
What is going on and how can I optimize?
Any help appreciated!!!

I figured this out! The issue was that I was converting my series of modelr resample objects to tibbles and that was exploding the memory even though I was subsequently sampling them down. The solution? Write methods for dealing with resample objects so that I never have to convert resample objects to tibble. These looked like:
# this function just samples the indexes instead of the data
sample_frac.resample <- function(data, frac) {
data$idx <- sample(data$idx, frac * length(data$idx))
data
}
# this function replicates the indexes. I should probably call it something else.
augment.resample <- function(data, n) {
data$idx <- unlist(map(seq_along(1:n), ~ data$idx))
data
}
# This function does simple splitting (logical only) of resample obejcts
split.resample <- function(data, .p) {
pos <- list(data = data$data, idx = which(.p, 1:nrow(data$data)))
neg <- list(data = data$data, idx = which(!.p, 1:nrow(data$data)))
class(pos) <- "resample"
class(neg) <- "resample"
list("TRUE" = pos,
"FALSE" = neg)
}
# This function takes the equivalent of a `bind_rows` for resample objects.
# Since bind rows does not call `useMethod` I had to call it something else
bind <- function(data) {
out <- list(data = data[[1]]$data, idx = unlist(map(data, pluck, "idx")))
class(out) <- "resample"
out
}
Then I just converted to a tibble in the same purrr closure in which my model is run for that CV. Problem solved! My memory usage is VERY low now.

pass arguments when pipe is called inside a function

I wanted to write a function that is the composition of two functions using the pipe operator %>%, so that, for instance, the following are equivalent (imagine I call it %O%):
iris[1:4] %>% lapply(FUN = function(x) hist(sample(x))
iris[1:4] %>% lapply(FUN = sample %O% hist)
I want it in this direction (not hist %O% sample) because it would follow the same logic as %>%.
I've come to something like that:
'%O%' <- function(lhs1, rhs1){
return(
function(x){
return(x %>% lhs1 %>% rhs1)
})
}
However, it raises errors when I try
iris[1:4] %>% lapply(FUN = sample(size = 100, replace = TRUE) %O% hist)
What should I do to allow the parameters to be understood by %>% inside the %O% function? Is it an eval and quote problem? I also don't really understand how %>% is able to read arguments in lhs and rhs.
Any help will be appreciated. Thanks a lot!

Use a magrittr functional sequence instead?
f <- . %>% sample() %>% hist()
iris[1:4] %>% lapply(f)
Or just
iris[1:4] %>% lapply(. %>% sample() %>% hist())

Here are some possibilities:
iris[1:4] %>% lapply(FUN = function(x) x %>% sample %>% hist)
library(gsubfn)
iris[1:4] %>% fn$lapply(FUN = x ~ x %>% sample %>% hist)
library(functional)
iris[1:4] %>% lapply(FUN = Compose(sample, hist))
library(functional)
`%c%` <- Compose
iris[1:4] %>% lapply(FUN = sample %c% hist)
Update: Added additional solutions.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Strange(?) behaviour of `summary()` inside a pipe - r

Related

Printing an intermediate output produced by a customized function in dplyr R

FUN == 'x' does not work, how to go around it in R

Margrittr pipe with matrix operations in R

How to optimize memory usage in dplyr + purrr

pass arguments when pipe is called inside a function

Categories

Resources