I am trying to write a function that will (in part) rename a variable by combining its source dataframe and existing variable name. In essence, I want:
df1 <- data.frame(a = 1, b = 2)
to become:
df1 %>%
rename(df1_a = a)
# df1_a b
#1 1 2
But I want to do this programatically, something along the lines of:
fun <- function(df, var) {
outdf <- rename_(df, paste(df, var, sep = "_") = var)
return(outdf)
}
This admittedly naive approach obviously doesn't work, but I haven't been able to figure it out. I'm sure the answer is somewhere in the nse vignette (https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html), but that doesn't seem to address constructing variable names.
Not sure if this is the proper dplyr-esque way, but it'll get you going.
fun <- function(df, var) {
x <- deparse(substitute(df))
y <- deparse(substitute(var))
rename_(df, .dots = with(df, setNames(as.list(y), paste(x, y, sep = "_"))))
}
fun(df1, a)
# df1_a b
# 1 1 2
fun(df1, b)
# a df1_b
# 1 1 2
lazyeval isn't really needed here because the environment of both inputs is known. That being said:
library(lazyeval)
library(dplyr)
library(magrittr)
fun = function(df, var) {
df_ = lazy(df)
var_ = lazy(var)
fun_(df_, var_)
}
fun_ = function(df_, var_) {
new_var_string =
paste(df_ %>% as.character %>% extract(1),
var_ %>% as.character %>% extract(1),
sep = "_")
dots = list(var_) %>% setNames(new_var_string)
df_ %>%
lazy_eval %>%
rename_(.dots = dots)
}
fun(df1, a)
Related
I am attempting to adapt a long function (rcompanion::groupwiseMean) to use dplyr instead of plyr::ddply in its code to avoid dependency on the now deprecated plyr package.
I would like to define a custom ddply2 function, taking the same arguments as the original plyr function, but with dplyr under the hood. The benefit would be to only redefine the function once at the top of the existing long function/script without changing anything else. My attempts have failed so far. Demo below.
I have been using this resource: plyr::ddply equivalent in dplyr
Original plyr:ddplyr call
data <- mtcars
var <- "mpg"
group <- c("cyl", "am")
# Original plyr:ddply-fed function:
fun.y <- function(x, idx) { length(x[, idx]) }
# Original plyr:ddply call:
plyr::ddply(.data = data, .variables = group, var, .fun = fun.y)
#> cyl am V1
#> 1 4 0 3
#> 2 4 1 8
#> 3 6 0 4
#> 4 6 1 3
#> 5 8 0 12
#> 6 8 1 2
This is the function that I CANNOT rewrite
fun.y <- function(x, idx) { length(x[, idx]) }
However this is just an example. Here are some other functions I will need working with ddply2:
fun.z <- function(x, idx) { as.numeric(mean(x[, idx], trim = trim, na.rm = na.rm)) }
fun.w <- function(x, idx) {
mean(boot(x[, idx], function(y, j) mean(y[j], trim = trim,
na.rm = na.rm), R = R, ...)$t[, 1])
}
Now let's proceed to the desired ddply2 call, which I am allowed to modify any way I want. However it must take the same arguments as plyr::ddply.
Attempt to rewrite plyr:ddply as dpply2
library(dplyr)
ddply2 <- function(.data, .variables, var, .fun) {
.data %>%
group_by(across({{.variables}})) %>%
do(.fun(., {{var}}))
}
ddply2(.data = data, .variables = group, var, .fun = fun.y)
# Error in `do()`:
# ! Results 1, 2, 3, 4, 5, 6 must be data frames, not integer.
Edit
Again, I cannot rewrite fun.y, fun.z, or fun.w, only ddply2. So solutions based on summarize() or count() will not work as they are not generalizable to other functions. plyr:ddplyr did not require summarize() or count(), that's the idea.
After some discussion I now understand that what is desired is to rewrite this function using dplyr rather than plyr such that for inputs such as those listed in the inputs section below it gives the same result.
dd <- function(data, group, var, fun)
plyr::ddply(.data = data, .variables = group, var, .fun = fun)
To do that the new function can use group_by with either summarize or group_modify. dd1 below uses the first and dd2 uses the second. Use whichever you prefer.
Note that the way fun.z was written it assumes a data frame and not a tibble (because data frames return a vector if there is only one column whereas tibble returns another tibble) so we use as.data.frame to ensure that. Also plyr returns a data frame and at the end of dd1 and dd2 we convert the tibble produced to data frame to ensure that the result is identical.
dd1 <- function(data, group, var, fun)
data %>%
group_by(across(all_of(group))) %>%
summarize(V1 = fun(as.data.frame(cur_data()), var), .groups = "drop") %>%
as.data.frame
dd2 <- function(data, group, var, fun)
data %>%
group_by(across(all_of(group))) %>%
group_modify(~ { data.frame(V1 = fun(as.data.frame(.), var)) }) %>%
ungroup %>%
as.data.frame
Now test it out
# inputs - start #
data <- mtcars
trim <- 0
na.rm <- FALSE
var <- "mpg"
group <- c("cyl", "am")
fun.z <- function(x, idx) {
as.numeric(mean(x[, idx], trim = trim, na.rm = na.rm))
}
# inputs - end #
library(dplyr)
dd.out <- dd(data, group, var, fun.z) # plyr
dd1.out <- dd1(data, group, var, fun.z)
dd2.out <- dd2(data, group, var, fun.z)
identical(dd1.out, dd.out)
## [1] TRUE
identical(dd2.out, dd.out)
## [1] TRUE
I'm building a dplyr structure to run some custom functions over the columns of a dataframe in 1 block of code
currently my function looks this
funx <- function(x) {
logchoice <- if(max(x) < 400) {'T' } else { 'F' }
logtest <- suppressWarnings(log10(x))
remaining <- length(logtest[which(!is.na(logtest) & is.finite(logtest))])
x <- if(remaining > 0.75*length(x)) {suppressWarnings(log10(x)) } else { x }
x <- x[which(!is.na(x) & is.finite(x))]
y <- diptest::dip.test(x)
z <- tibble(pvalue = y$p.value, Transform = logchoice)
return(z)
}
and the dplyr structure looks like this:
mtcars %>%
sample_n(30) %>%
select(colnames(mtcars)[2:5]) %>%
summarise_all(list(~ list(funx(.)))) %>%
gather %>%
unnest %>%
arrange(pvalue) %>%
rename(Parameter = key)
which gives me:
Parameter pvalue Transform
1 cyl 0.00000000 T
2 drat 0.03026093 T
3 hp 0.04252001 T
4 disp 0.06050505 F
I would like to know how I can access the column name inside my function, mainly because I would like to change the name in the result table to look like the output of this: paste(original_column_name, 'log10', sep = '') if the function applies the log transformation, but leave the original name as is when it decides not to.
so the expected output is:
Parameter pvalue Transform
1 log10_cyl 0.00000000 T
2 log10_drat 0.03026093 T
3 log10_hp 0.04252001 T
4 disp 0.06050505 F
You were quite close. You can just add a mutate() to the end
mtcars %>%
sample_n(30) %>%
select(colnames(mtcars)[2:5]) %>%
summarise_all(list(~ list(funx(.)))) %>%
gather() %>%
unnest() %>%
arrange(pvalue) %>%
rename(Parameter = key) %>%
mutate(Parameter = ifelse(Transform == "T", paste0("log10_", Parameter), Parameter)) %>%
select(Parameter, pvalue)
# Parameter pvalue
# log10_cyl 0.00000000
# log10_drat 0.01389723
# disp 0.02771770
# log10_hp 0.08493466
Answering in a separate post as the solution is a different. To get the column names in a print(), I would pass them in the function and use purrr::map_dfr to build a dataframe of the result. The small changes I made are to grab the column name, col_name, and specify the dataframe. I tried a few approaches to grab the column name using your original function but came out unsuccessful.
logtest_pval <- function(col, df) {
col_name <- col
x <- df %>% pull(!!col)
logchoice <- ifelse(max(x) < 400, TRUE, FALSE)
logtest <- log10(x)
remaining <- length(logtest[which(!is.na(logtest) & is.finite(logtest))])
x <- if(remaining > 0.75*length(x)) {suppressWarnings(log10(x)) } else { x }
x <- x[which(!is.na(x) & is.finite(x))]
y <- diptest::dip.test(x)
z <-
tibble(
transform = logchoice,
column = ifelse(logchoice, paste0("log10_", col_name), col_name),
pvalue = y$p.value
)
print(paste0(z, collapse = " | "))
return(z)
}
Then you can build your dataframe:
purrr::map_dfr(
.x = names(mtcars), # the columns to use
.f = logtest_pval, # the function to use
df = mtcars # additional arguments needed
)
Here's another example
df <-
mtcars %>%
select_if(is.numeric)
pvalues <-
map_dfr(names(df), logtest_pval, df)
I'm trying to write a function that takes as one of its arguments a vector of column names from user. The column names will be used to specify what columns of the dataframe will be pasted together to form a new column within dplyr::mutate. I tried to collapse the elements of argument vector first and then use the collapsed string in mutate - this is wrong. See that latest attempt below. I made other attempts but I'm not understanding the new quo, enquo, UQ, !!!, !!, and so on within dplyr. Can someone show what I need to do?
df <- data.frame(.yr = c("2000", "2001", "2002"), .mo = c("12", "01", "02"), .other = rnorm(3))
cols <- colnames(df)[1:2]
do_want <- df %>%
mutate(new = paste(.yr, .mo, sep = "-"))
my_func <- function(dat, vars){
.vars <- paste(vars, collapse = ",")
result <- dat %>%
mutate(new = paste(.vars, sep = "-" ))
return(result)
}
my_func(dat = df, vars = cols)
edit: this is my attempt at using quo and !! in the function definition. the result is a column of repeated string ".yr,.mo"
my_func <- function(dat, vars){
.vars <- quo(paste(vars, collapse = ","))
result <- dat %>%
mutate(new = paste(!!.vars, sep = "-" ))
return(result)
}
Because you have a list of strings, you can use rlang::syms in your function to take the strings and turn them into symbols. Then you can use !!! to splice the arguments together to put into paste.
my_func <- function(dat, vars){
.vars <- rlang::syms(vars)
result <- dat %>%
mutate(new = paste(!!!.vars, sep = "-" ))
return(result)
}
my_func(dat = df, vars = cols)
.yr .mo .other new
1 2000 12 -0.2663456 2000-12
2 2001 01 0.5463433 2001-01
3 2002 02 -1.3133078 2002-02
Use unite.
names <- iris %>% colnames()
iris %>% mutate(new = paste(names)) #Error
iris %>% unite("new",names,remove=F) #OK
Use mutate_ instead of mutate & turning the expression into a string worked for me:
dplyr_solution <- function(dat, vars){
.vars <- paste(vars, collapse = ",")
result <- dat %>%
mutate_(new = paste0('paste(', .vars, ', sep="-")'))
return(result)
}
dplyr_solution(dat = df, vars = cols)
I want to use filter and summarise from dplyr inside my function. Without a function it works like following:
library(dplyr)
> Orange %>%
+ filter(Tree==1) %>%
+ summarise(age_max = max(age))
age_max
1 1582
I want to do the same inside a function, but following fails:
## Function definition:
df.maker <- function(df, plant, Age){
require(dplyr)
dfo <- df %>%
filter(plant==1) %>%
summarise(age_max = max(Age))
return(dfo)
}
## Use:
> df.maker(Orange, Tree, age)
Rerun with Debug
Error in as.lazy_dots(list(...)) : object 'Tree' not found
I know that similar questions have been asked before. I've also gone through some relevant links such as page1 and page2. But I can't fully grasp the concepts of NSE and SE. I tried following:
df.maker <- function(df, plant, Age){
require(dplyr)
dfo <- df %>%
filter_(plant==1) %>%
summarise_(age_max = ~max(Age))
return(dfo)
}
But get the same error. Please help me understand what's going on. And how can I correctly create my function? Thanks!
EDIT:
I also tried following:
df.maker <- function(df, plant, Age){
require(dplyr)
dfo <- df %>%
#filter_(plant==1) %>%
summarise_(age_max = lazyeval::interp(~max(x),
x = as.name(Age)))
return(dfo)
}
> df.maker(Orange, Tree, age)
Error in as.name(Age) : object 'age' not found
Either supply character arguments and use as.name:
df.maker1 <- function(d, plant, Age){
require(dplyr)
dfo <- d %>%
filter_(lazyeval::interp(~x == 1, x = as.name(plant))) %>%
summarise_(age_max = lazyeval::interp(~max(x), x = as.name(Age)))
return(dfo)
}
df.maker1(Orange, 'Tree', 'age')
age_max
1 1582
Or capture the arguments with substitute:
df.maker2 <- function(d, plant, Age){
require(dplyr)
plant <- substitute(plant)
Age <- substitute(Age)
dfo <- d %>%
filter_(lazyeval::interp(~x == 1, x = plant)) %>%
summarise_(age_max = lazyeval::interp(~max(x), x = Age))
return(dfo)
}
df.maker2(Orange, Tree, age)
age_max
1 1582
How can I get a data frame's name from a list? Sure, get() gets the object itself, but I want to have its name for use within another function. Here's the use case, in case you would rather suggest a work around:
lapply(somelistOfDataframes, function(X) {
ddply(X, .(idx, bynameofX), summarise, checkSum = sum(value))
})
There is a column in each data frame that goes by the same name as the data frame within the list. How can I get this name bynameofX? names(X) would return the whole vector.
EDIT: Here's a reproducible example:
df1 <- data.frame(value = rnorm(100), cat = c(rep(1,50),
rep(2,50)), idx = rep(letters[1:4],25))
df2 <- data.frame(value = rnorm(100,8), cat2 = c(rep(1,50),
rep(2,50)), idx = rep(letters[1:4],25))
mylist <- list(cat = df1, cat2 = df2)
lapply(mylist, head, 5)
I'd use the names of the list in this fashion:
dat1 = data.frame()
dat2 = data.frame()
l = list(dat1 = dat1, dat2 = dat2)
> str(l)
List of 2
$ dat1:'data.frame': 0 obs. of 0 variables
$ dat2:'data.frame': 0 obs. of 0 variables
and then use lapply + ddply like:
lapply(names(l), function(x) {
ddply(l[[x]], c("idx", x), summarise,checkSum = sum(value))
})
This remains untested without a reproducible answer. But it should help you in the right direction.
EDIT (ran2): Here's the code using the reproducible example.
l <- lapply(names(mylist), function(x) {
ddply(mylist[[x]], c("idx", x), summarise,checkSum = sum(value))
})
names(l) <- names(mylist); l
Here is the dplyr equivalent
library(dplyr)
catalog =
data_frame(
data = someListOfDataframes,
cat = names(someListOfDataframes)) %>%
rowwise %>%
mutate(
renamed =
data %>%
rename_(.dots =
cat %>%
as.name %>%
list %>%
setNames("cat")) %>%
list)
catalog$renamed %>%
bind_rows(.id = "number") %>%
group_by(number, idx, cat) %>%
summarize(checkSum = sum(value))
you could just firstly use names(list)->list_name and then use list_name[1] , list_name[2] etc. to get each list name. (you may also need as.numeric(list_name[x]) if your list names are numbers.