I'm building a dplyr structure to run some custom functions over the columns of a dataframe in 1 block of code
currently my function looks this
funx <- function(x) {
logchoice <- if(max(x) < 400) {'T' } else { 'F' }
logtest <- suppressWarnings(log10(x))
remaining <- length(logtest[which(!is.na(logtest) & is.finite(logtest))])
x <- if(remaining > 0.75*length(x)) {suppressWarnings(log10(x)) } else { x }
x <- x[which(!is.na(x) & is.finite(x))]
y <- diptest::dip.test(x)
z <- tibble(pvalue = y$p.value, Transform = logchoice)
return(z)
}
and the dplyr structure looks like this:
mtcars %>%
sample_n(30) %>%
select(colnames(mtcars)[2:5]) %>%
summarise_all(list(~ list(funx(.)))) %>%
gather %>%
unnest %>%
arrange(pvalue) %>%
rename(Parameter = key)
which gives me:
Parameter pvalue Transform
1 cyl 0.00000000 T
2 drat 0.03026093 T
3 hp 0.04252001 T
4 disp 0.06050505 F
I would like to know how I can access the column name inside my function, mainly because I would like to change the name in the result table to look like the output of this: paste(original_column_name, 'log10', sep = '') if the function applies the log transformation, but leave the original name as is when it decides not to.
so the expected output is:
Parameter pvalue Transform
1 log10_cyl 0.00000000 T
2 log10_drat 0.03026093 T
3 log10_hp 0.04252001 T
4 disp 0.06050505 F
You were quite close. You can just add a mutate() to the end
mtcars %>%
sample_n(30) %>%
select(colnames(mtcars)[2:5]) %>%
summarise_all(list(~ list(funx(.)))) %>%
gather() %>%
unnest() %>%
arrange(pvalue) %>%
rename(Parameter = key) %>%
mutate(Parameter = ifelse(Transform == "T", paste0("log10_", Parameter), Parameter)) %>%
select(Parameter, pvalue)
# Parameter pvalue
# log10_cyl 0.00000000
# log10_drat 0.01389723
# disp 0.02771770
# log10_hp 0.08493466
Answering in a separate post as the solution is a different. To get the column names in a print(), I would pass them in the function and use purrr::map_dfr to build a dataframe of the result. The small changes I made are to grab the column name, col_name, and specify the dataframe. I tried a few approaches to grab the column name using your original function but came out unsuccessful.
logtest_pval <- function(col, df) {
col_name <- col
x <- df %>% pull(!!col)
logchoice <- ifelse(max(x) < 400, TRUE, FALSE)
logtest <- log10(x)
remaining <- length(logtest[which(!is.na(logtest) & is.finite(logtest))])
x <- if(remaining > 0.75*length(x)) {suppressWarnings(log10(x)) } else { x }
x <- x[which(!is.na(x) & is.finite(x))]
y <- diptest::dip.test(x)
z <-
tibble(
transform = logchoice,
column = ifelse(logchoice, paste0("log10_", col_name), col_name),
pvalue = y$p.value
)
print(paste0(z, collapse = " | "))
return(z)
}
Then you can build your dataframe:
purrr::map_dfr(
.x = names(mtcars), # the columns to use
.f = logtest_pval, # the function to use
df = mtcars # additional arguments needed
)
Here's another example
df <-
mtcars %>%
select_if(is.numeric)
pvalues <-
map_dfr(names(df), logtest_pval, df)
Related
Suppose I have multiple data frames with the same prefixes and same structure.
mydf_1 <- data.frame('fruit' = 'apples', 'n' = 2)
mydf_2 <- data.frame('fruit' = 'pears', 'n' = 0)
mydf_3 <- data.frame('fruit' = 'oranges', 'n' = 3)
I have a for-loop that grabs all the tables with this prefix, and appends those that match a certain condition.
res <- data.frame()
for(i in mget(apropos("^mydf_"), envir = .GlobalEnv)){
if(sum(i$n) > 0){
res <- rbind.data.frame(res, data.frame('name' = paste0(i[1]),
'n' = sum(i$n)))
}
}
res
This works fine, but I want my 'res' table to identify the name of the original data frame itself in the 'name' column, instead of the column name. My desired result is:
The closest I have gotten to solving this issue is:
'name' = paste0(substitute(i))
instead of
'name' = paste0(i[1])
but it just returns 'i'.
Any simple solution? Base preferred but not essential.
As mentioned in the comments, it is better to put dataframes into a list as it much easier to handle and manipulate them. However, we could still grab the dataframes from the global environment, get the sum for each dataframe, then bind them together and add the dataframe name as a row.
library(tidyverse)
df_list <-
do.call("list", mget(grep("^mydf_", names(.GlobalEnv), value = TRUE))) %>%
map(., ~ .x %>% summarise(n = sum(n))) %>%
discard(~ .x == 0) %>%
bind_rows(., .id = "name")
Or we could use map_dfr to bind together and summarise, then filter out the 0 values:
map_dfr(mget(ls(pattern = "^mydf_")), ~ c(n = sum(.x$n)), .id = "name") %>%
filter(n != 0)
Output
name n
1 mydf_1 2
2 mydf_3 3
To bind a list of data.frames and store the list names as a new column, a convenient way is to set the arg .id in dplyr::bind_rows().
library(dplyr)
mget(apropos("^mydf_")) %>%
bind_rows(.id = "name") %>%
count(name, wt = n) %>%
filter(n > 0)
# name n
# 1 mydf_1 2
# 2 mydf_3 3
I am trying to run a function of the following structure:
my_fun <- function(new_var_names, input_var_names, df){
df %>%
mutate(...)
}
Where an indefinite number of new variables are generated. Their names use each element of a character vector new_var_names which has variable length. They are generated using as inputs the variables named in the character vector input_var_names using some functions f1 and f2.
Also ideally this should be done within mutate and without using mapping or looping procedures.
Is it possible to adapt mutate to do this ?
Thanks in advance.
Here's an approach with across and rename_at. Who knows when rename will have across functionality added:
my_fun <- function(new_var_names, input_var_names, df){
df %>%
mutate(across(.cols = one_of(input_var_names), .names = "New.{.col}",
~ . * 100)) %>%
rename_at(vars(paste0("New.",input_var_names)),~new_var_names)
}
my_fun(c("NewX1","NewX2"),c("X1","X2"),data)
X1 X2 X3 X4 NewX1 NewX2
1 76.512308 59.52818 35.45349 53.071453 7651.2308 5952.818
2 90.432867 53.60952 55.91350 87.441985 9043.2867 5360.952
3 82.226977 39.00973 14.58712 87.100901 8222.6977 3900.973
4 8.071753 32.63577 78.70822 3.345176 807.1753 3263.577
5 1.385738 81.03024 88.79939 97.613080 138.5738 8103.024
6 6.167663 5.15003 21.20549 49.532196 616.7663 515.003
7 86.789458 37.01053 77.29167 39.527862 8678.9458 3701.053
8 58.048272 85.80310 60.03993 42.337941 5804.8272 8580.310
9 32.070415 70.09671 95.80930 10.199656 3207.0415 7009.671
10 95.987113 68.76416 16.71015 17.019112 9598.7113 6876.416
You can replace ~ . * 100 with whatever function you want.
Sample Data:
data <- data.frame(replicate(4,runif(10,1,100)))
Sure you can, look this:
data <- data.frame(L=letters,X=runif(26),Y=rnorm(26))
# Make a Cumsum in X and Y at the same time.
my_fun <- function(new_var_names, input_var_names, df) {
df %>% mutate(across(.cols = input_var_names,
.fns = cumsum,
.names = paste0(new_var_names,'{.col}') ))
}
my_fun(new_var_names="cumsum_",input_var_names=c("X","Y"),df=data)
# Make a Cumsum and CumMax in X and Y at the same time.
my_fun <- function(new_var_names, input_var_names, df) {
df %>% mutate(across(.cols = input_var_names,
.fns = list(CumSum=cumsum,CumMax=cummax),
.names = paste0(new_var_names,'{.fn}_{.col}') ))
}
my_fun(new_var_names="New_",input_var_names=c("X","Y"),df=data)
I am trying to write a function so that I can input any columns to be described both at the overall level and by a grouping variable.
However, I am having trouble with getting output for grouped results.
My data:
df <- data.frame(gender=c("m", "f", "m","m"), age=c("18-22","23-32","23-32","50-60"), income=c("low", "low", "medium", "high"), group=c("A", "A", "B", "B"))
> df
gender age income group
1 m 18-22 low A
2 f 23-32 low A
3 m 23-32 medium B
4 m 50-60 high B
Function:
library(dplyr)
make_sum <- function(data=df, cols, group_var) {
data %>% dplyr::select(cols) %>%
# print tables with frequency and proportions
apply(2, function(x) {
n <- table(x, useNA = "no")
prop=round(n/length(x[!is.na(x)])*100,2)
print(cbind(n, prop))
})
# print tables by group
data %>% dplyr::select(cols, vars(group_var)) %>%
apply(2, function(x) {
n <- table(x, vars(group_var),useNA = "no")
print(n)
})
}
cols <- df %>% dplyr::select(gender,age, income) %>% names()
make_sum(data=df, cols=cols, group_var="group")
I get the proper output for the overall tables but not the grouped, with this error showing:
Error: `vars(group_var)` must evaluate to column positions or names, not a list
Desired output (example) for grouped gender variable:
A B
f 1 0
m 1 2
Instead of using the apply with MARGIN = 2, summarise_all can be called here. Also, the vars wrapped is applied along with a tidyverse function. Here, inorder to get the frequency, an option is to subset the column with [[ which is more direct. Also, as summarise returns only a single row (for each group - if there is grouping variable), we can wrap the output in a list
make_sum <- function(data=df, cols, group_var) {
data %>%
dplyr::select(cols) %>%
summarise_all(~ {
n <- table(., data[[group_var]], useNA = "no")
#list(round(n/length(.[!is.na(.)])*100,2))
list(n)
})
}
cols <- df %>%
dplyr::select(gender,age, income) %>%
names()
out <- make_sum(data=df, cols=cols, group_var="group")
out$gender
#[[1]]
#. A B
# f 1 0
# m 1 2
I am trying to apply a custom function to a data.frame row by row, but I can't figure out how to apply the function row by row. I'm trying rowwise() as in the simple artificial example below:
library(tidyverse)
my_fun <- function(df, col_1, col_2){
df[,col_1] + df[,col_2]
}
dff <- data.frame("a" = 1:10, "b" = 1:10)
dff %>%
rowwise() %>%
mutate(res = my_fun(., "a", "b"))
How ever the data does not get passed by row. How can I achieve that?
dplyr's rowwise() puts the row-output (.data) as a list of lists, so you need to use [[. You also need to use .data rather than ., because . is the entire dff, rather than the individual rows.
my_fun <- function(df, col_1, col_2){
df[[col_1]] + df[[col_2]]
}
dff %>%
rowwise() %>%
mutate(res = my_fun(.data, 'a', 'b'))
You can see what .data looks like with the code below
dff %>%
rowwise() %>%
do(res = .data) %>%
.[[1]] %>%
head(1)
# [[1]]
# [[1]]$a
# [1] 1
#
# [[1]]$b
# [1] 1
I am trying to write a function that will (in part) rename a variable by combining its source dataframe and existing variable name. In essence, I want:
df1 <- data.frame(a = 1, b = 2)
to become:
df1 %>%
rename(df1_a = a)
# df1_a b
#1 1 2
But I want to do this programatically, something along the lines of:
fun <- function(df, var) {
outdf <- rename_(df, paste(df, var, sep = "_") = var)
return(outdf)
}
This admittedly naive approach obviously doesn't work, but I haven't been able to figure it out. I'm sure the answer is somewhere in the nse vignette (https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html), but that doesn't seem to address constructing variable names.
Not sure if this is the proper dplyr-esque way, but it'll get you going.
fun <- function(df, var) {
x <- deparse(substitute(df))
y <- deparse(substitute(var))
rename_(df, .dots = with(df, setNames(as.list(y), paste(x, y, sep = "_"))))
}
fun(df1, a)
# df1_a b
# 1 1 2
fun(df1, b)
# a df1_b
# 1 1 2
lazyeval isn't really needed here because the environment of both inputs is known. That being said:
library(lazyeval)
library(dplyr)
library(magrittr)
fun = function(df, var) {
df_ = lazy(df)
var_ = lazy(var)
fun_(df_, var_)
}
fun_ = function(df_, var_) {
new_var_string =
paste(df_ %>% as.character %>% extract(1),
var_ %>% as.character %>% extract(1),
sep = "_")
dots = list(var_) %>% setNames(new_var_string)
df_ %>%
lazy_eval %>%
rename_(.dots = dots)
}
fun(df1, a)