I can use !! to filter by a user-given variable but not to modify that same variable. The following function throws an error when created, but it works just fine if I delete the mutate call.
avg_dims <- function(x, y) {
y <- enquo(y)
x %>%
filter(!!y != "TOTAL") %>%
mutate(!!y = "MEAN") %>%
group_by(var1, var2)
}
The naming of the column on the lhs of assignment goes along with the assignment operator (:=) instead of the = operator. Also, the names should be either string or symbol. So, we can convert the quosure ('y' from enquo) to string (quo_name) and then do the evaluation (!!)
avg_dims <- function(x, y) {
y <- enquo(y)
y1 <- rlang::quo_name(y)
x %>%
filter(!!y != "TOTAL") %>%
mutate(!!y1 := "MEAN") %>%
group_by(var1, var2)
}
avg_dims(df1, varN)
data
set.seed(24)
df1 <- data.frame(var1 = rep(LETTERS[1:3], each = 4),
var2 = rep(letters[1:2], each = 6),
varN = sample(c("TOTAL", "hello", 'bc'), 12, replace = TRUE),
stringsAsFactors = FALSE)
Related
I'm working on making a function to create tables and I need to have some conditional rules involved for formatting. One will be based on a column name, however when I send it down using as.formula it seems to be over doing it. I've made an example here:
library(tidyverse)
library(rlang)
a <- as_tibble(x =cbind( Year = c(2018, 2019, 2020), a = 1:3,
b.1 = c("a", "b", "c"),
b.2 = c("d", "e", "f"),
fac = c("This", "This","That")))
foo <- function(x, y, z, ...){
y_var <- enquo(y)
x %>%
filter(Year %in% c(2018, 2019),
...) %>%
mutate(!!quo_name(y_var) := factor(!!y_var,
levels = z,
ordered = TRUE)) %>%
arrange(!!y_var)
}
to.table <- function(x, y, z, ...){
y_var <- enquo(y)
df.in <- foo(x=x,
y=!!y_var,
z= z)
cond <- paste("~!is.na(", quo_name(y_var),")")
cond.2 <- paste("~startsWith(colnames(", df.in, "),\"b\")")
flextable(df.in) %>%
bold(i = as.formula(cond),
part = "body") %>%
bg(i = as.formula(cond.2),
bg = "Red3",
j = as.formula(cond.2))
}
to.table(x=a,
y=Year,
z= c(2020,2018,2019),
fac == "This")
Error in startsWith(colnames(2:3), "b") : non-character object(s)
From the error I've been reviving it looks like solved the expression before it gets put through the as.formula as those two columns are the correct answer.
Proof:
df.in <- foo(x=a,
y=Year,
z= c(2020,2018,2019),
fac == "This")
startsWith(colnames(df.in), prefix = "b")
[1] FALSE FALSE TRUE TRUE FALSE
What am I missing here? If anyone has a solution, or suggestion on how to do things differently potentially using quosures or other tidyverse friendly methods I would much appreciate it.
Extension:
To make this a bit more clear, I may need to elaborate on my intended use of this example. I'm trying to figure out how to take names generated dynamically in a function represented as foo that start with a specified value (generally 3 columns), and then check those columns for a specified value that I can then highlight in a specific Color.
Additionally in the answer cond is used in both of the i= designation, the two separate conditions in will likely never overlap.
We could specify the j with the column names of the data created i.e. startsWith returns a logical vector from the column names based on the names that starts with 'b', use the logical vector to extract the column names with [ (nm1).
to.table <- function(x, y, z, ...){
y_var <- enquo(y)
df.in <- foo(x=x,
y=!!y_var,
z= z)
cond <- as.formula(glue::glue('~ !is.na({quo_name(y_var)})'))
nm1 <- names(df.in)[startsWith(names(df.in), prefix = "b")]
flextable(df.in) %>%
bold(i = cond,
part = "body") %>%
bg(i = cond,
bg = "Red3",
j = nm1)
}
-testing
to.table(x=a,
y=Year,
z= c(2020,2018,2019),
fac == "This")
-output
In the OP's post formula created for 'cond' is fine although it is a bit more flexible by using glue whereas the second one i.e. 'cond.2' returns
paste("~startsWith(colnames(", df.in, "),\"b\")")
[1] "~startsWith(colnames( 2:3 ),\"b\")" "~startsWith(colnames( c(\"1\", \"2\") ),\"b\")"
[3] "~startsWith(colnames( c(\"a\", \"b\") ),\"b\")" "~startsWith(colnames( c(\"d\", \"e\") ),\"b\")"
[5] "~startsWith(colnames( c(\"This\", \"This\") ),\"b\")"
It is because df.in is a data.frame on which we are trying to paste the startsWith(colnames( string. Each of the lines returned are the column values
If we want to get either 'a' or 'b' column names prefix with 'red' color, change the startsWith to grep which can take a regex as pattern
to.table <- function(x, y, z, ...){
y_var <- enquo(y)
df.in <- foo(x=x,
y=!!y_var,
z= z)
cond <- as.formula(glue::glue('~ !is.na({quo_name(y_var)})'))
nm1 <- grep("^(a|b)", names(df.in), value = TRUE)
flextable(df.in) %>%
bold(i = cond,
part = "body") %>%
bg(i = cond,
bg = "Red3",
j = nm1)
}
to.table(x=a,
y=Year,
z= c(2020,2018,2019),
fac == "This")
-output
If we want to color based on the value of 'a'
to.table <- function(x, y, z, ...){
y_var <- enquo(y)
df.in <- foo(x=x,
y=!!y_var,
z= z)
cond <- as.formula(glue::glue('~ !is.na({quo_name(y_var)})'))
nm1 <- names(df.in)[startsWith(names(df.in), prefix = "b")]
flextable(df.in) %>%
bold(i = cond,
part = "body") %>%
bg(i = ~ a == 2,
bg = "Red3",
j = nm1)
}
to.table(x=a,
y=Year,
z= c(2020,2018,2019),
fac == "This")
-output
I'm wondering if the following code can be simplified to allow the data to be piped directly from the summarise command to the pairwise.t.test, without creating the intermediary object?
data_for_PTT <- data %>%
group_by(subj, TT) %>%
summarise(meanRT = mean(RT))
pairwise.t.test(x = data_for_PTT$meanRT, g = data_for_PTT$TT, paired = TRUE)
I tried x = .$meanRT but it didn't like it, returning:
Error in match.arg(p.adjust.method) :
'arg' must be NULL or a character vector
You can use curly braces:
data_for_PTT <- data %>%
group_by(subj, TT) %>%
summarise(meanRT = mean(RT)) %>%
{pairwise.t.test(x = .$meanRT, g = .$TT, paired = TRUE)}
Reproducible:
df <- data.frame(X1 = runif(1000), X2 = runif(1000), subj = rep(c("A", "B")))
df %>%
{pairwise.t.test(.$X1, .$subj, paired = TRUE)}
Background
Using rlang I've a simple summary function for dplyr that counts a number of missing observations within a variable per provided groups. I would like to return the results in a descending order of grouping variables.
Sample data
library("tidyverse")
set.seed(123)
test_data <- tibble(dates = seq.Date(
from = as.Date.character(x = "01-01-2000", format = "%d-%m-%Y"),
to = as.Date.character(x = "31-12-2010", format = "%d-%m-%Y"),
by = "day"
)) %>%
transmute(
t_year = lubridate::year(dates),
t_mnth = lubridate::month(dates),
t_day = lubridate::day(dates),
tst_var = if_else(rnorm(n()) > .8, NA_real_, rnorm(n()))
)
Summary function
Working version
quick_smry <- function(df, x, ...) {
group_by_vars <- enquos(...)
check_var <- enquo(x)
df %>%
group_by(!!!group_by_vars) %>%
summarise(num_missing = sum(is.na(!!check_var)))
}
Desired results
test_data %>%
group_by(t_year, t_mnth) %>%
summarise(num_missing = sum(is.na(tst_var))) %>%
arrange(desc(t_year), desc(t_mnth))
Problem
Implementing arrange(desc(x)) call so it can handle each of the variables passed initially via enquos. I.e. if there are 5 grouping variables passed via in enquos this should be equivalent of arrange(desc(var1)) .... arrange(desc(var5)).
Attempt
Naturally, this doesn't work:
quick_smry <- function(df, x, ...) {
group_by_vars <- enquos(...)
check_var <- enquo(x)
df %>%
group_by(!!!group_by_vars) %>%
summarise(num_missing = sum(is.na(!!check_var))) %>%
# Desc call should be created for each of the group variables
arrange(desc(!!!group_by_vars))
}
You can use arrange_at like this:
quick_smry <- function(df, x, ...) {
group_by_vars <- enquos(...)
check_var <- enquo(x)
df %>%
group_by(!!!group_by_vars) %>%
summarise(num_missing = sum(is.na(!!check_var))) %>%
arrange_at(group_by_vars, desc)
}
quick_smry(test_data, tst_var, t_year, t_mnth)
I frequently work with data frames and have to run some sophisticated data wrangling / manipulations by subgroup that is defined in one of the columns. I am aware of dplyr and group_by and know that many things could be solved using group_by. However, often I have to do some pretty intricate calculations and end up just using the 'for' loop.
I was wondering about the existence of some other general approach or paradigm that is faster/more elegant. Maybe map (that I am not very familiar with)?
Below is an example. Notice - it is fake and meaningless. So let's ignore why I need to do those things or the fact that there could be 2 consequtive NAs in a column, etc. That's not the focus of my question. The point is that often I have to operate "within the constraints of a subgroup" and then - inside that subgroup - I have to do operations columnwise, rowwise and sometimes even cellwise.
I also realize that I could probably put most of that code inside a function, split my data frame into a list based on 'group', apply this function to each element of that list and then do.call(rbind...) at the end. But is this the only way?
Thanks a lot for any hints!
library(dplyr)
library(forcats)
set.seed(123)
x <- tibble(group = c(rep('a', 10), rep('b', 10), rep('c', 10)),
attrib = c(sample(c("one", "two", "three", "four"), 10, replace = T),
sample(c("one", "two", "three"), 10, replace = T),
sample(c("one", "three", "four"), 10, replace = T)),
v1 = sample(c(1:5, NA), 30, replace = T),
v2 = sample(c(1:5, NA), 30, replace = T),
v3 = sample(c(1:5, NA), 30, replace = T),
n1 = abs(rnorm(30)), n2 = abs(rnorm(30)), n3 = abs(rnorm(30)))
v_vars = paste0("v", 1:3)
n_vars = paste0("n", 1:3)
results <- NULL # Placeholder for final results
for(i in seq(length(unique(x$group)))) { # loop through groups
mygroup <- unique(x$group)[i]
mysubtable <- x %>% filter(group == mygroup)
# IMPUTE NAs in v columns
# Replace every NA with a mean of values above and below it; and if it's the first or
# the last value, with the mean of 2 values below or above it.
for (v in v_vars){ # loop through v columns
which_nas <- which(is.na(mysubtable[[v]])) # create index of NAs for column v
if (length(which_nas) == 0) next else {
for (na in which_nas) { # loop through indexes of column values that are NAs
if (na == 1) {
mysubtable[[v]][na] <- mean(c(mysubtable[[v]][na + 1],
mysubtable[[v]][na + 2]), na.rm = TRUE)
} else if (na == nrow(mysubtable)) {
mysubtable[[v]][na] <- mean(c(mysubtable[[v]][na - 2],
mysubtable[[v]][na - 1]), na.rm = TRUE)
} else {
mysubtable[[v]][na] <- mean(c(mysubtable[[v]][na - 1],
mysubtable[[v]][na + 1]), na.rm = TRUE)
}
} # end of loop through NA indexes
} # end of else
} # end of loop through v vars
# Aggregate v columns (mean) for each value of column 'attrib'
result1 <- mysubtable %>% group_by(attrib) %>%
summarize_at(v_vars, mean)
# Aggregate n columns (sum) for each value of column 'attrib'
result2 <- mysubtable %>% group_by(attrib) %>%
summarize_at(n_vars, sum)
# final result should contain the name of the group
results[[i]] <- cbind(mygroup, result1, result2[-1])
}
results <- do.call(rbind, results)
Maybe this example is too simple, but in this case, the only thing you need to pull out is the imputation.
my_impute <- function(x) {
which_nas <- which(is.na(x))
for (na in which_nas) {
if (na == 1) {
x[na] <- mean(c(x[na + 1], x[na + 2]), na.rm = TRUE)
} else if (na == length(x)) {
x[na] <- mean(c(x[na - 2], x[na - 1]), na.rm = TRUE)
} else {
x[na] <- mean(c(x[na - 1], x[na + 1]), na.rm = TRUE)
}
}
x
}
Then you just need to group appropriately and impute and summarize.
x2 <- x %>% group_by(group) %>% mutate_at(v_vars, my_impute) %>%
group_by(group, attrib)
full_join(x2 %>% summarize_at(v_vars, mean),
x2 %>% summarize_at(n_vars, sum))
My usual method for things like this, where similar calculations need to be on a bunch of columns, is to put it in long format. Here it feels a little like the long way round, but perhaps this would be useful to see.
x %>% mutate(row=1:n()) %>% gather("variable", "value", c(v_vars, n_vars)) %>%
separate(variable, c("var", "x"), sep=1) %>% spread(var, value) %>%
arrange(group, x, row) %>% group_by(group, x) %>%
mutate(v=my_impute(v)) %>% group_by(group, attrib, x) %>%
summarize(v=mean(v), n=sum(n)) %>%
gather("var", "value", v, n) %>% mutate(X=paste0(var, x)) %>%
select(-x, -var) %>% spread(X, value)
More generally, split-apply-combine is probably the way to go, as you suggest in your question; here's a way using the tidyverse.
doX <- function(x) {
x2 <- x %>% mutate_at(v_vars, my_impute) %>% group_by(attrib)
full_join(x2 %>% summarize_at(v_vars, mean),
x2 %>% summarize_at(n_vars, sum))
}
x %>% group_by(group) %>% nest() %>%
mutate(result=map(data, doX)) %>% select(-data) %>% unnest()
The more traditional method is with do.call, split, and rbind; here I don't make the effort to keep the group information.
do.call(rbind, lapply(split(x, x$group), doX))
The first thing to do is to change your data imputing into a function. I made some simple modifications to have it accept a vector and simplified the call to mean.
fx_na_rm <- function(z) {
which_nas <- which(is.na(z))
if (length(which_nas) > 0) {
for (na in which_nas) { # loop through indexes of column values that are NAs
if (na == 1) {
z[na] <- mean(z[na + (1:2)], na.rm = TRUE)
} else if (na == nrow(mysubtable)) {
z[na] <- mean(z[na - (1:2)], na.rm = TRUE)
} else {
z[na] <- mean(z[c(na - 1, na + 1)], na.rm = TRUE)
}
} # end of loop through NA indexes
}
return(z)
}
I like data.table so here's a solution that uses it. Now since you use different functions for the n and v variable groups, most purrr or any other solutions will also be a little funny.
library(data.table)
dt <- copy(as.data.table(x))
v_vars = paste0("v", 1:3)
n_vars = paste0("n", 1:3)
dt[, (v_vars) := lapply(.SD, as.numeric), .SDcols = v_vars]
dt[, (v_vars) := lapply(.SD, fx_na_rm), by = group, .SDcols = v_vars]
# see https://stackoverflow.com/questions/50626316/r-data-table-apply-function-a-to-some-columns-and-function-b-to-some-others
scols <- list(v_vars, n_vars)
funs <- rep(c(mean, sum), lengths(scols))
dt[, setNames(Map(function(f, x) f(x), funs, .SD), unlist(scols))
, by = .(group,attrib)
, .SDcols = unlist(scols)]
The for loop itself is difficult to vectorize because the results can depend on itself. Here is my attempt which is not an identical output to yours:
# not identical
fx_na_rm2 <- function(z) {
which_nas <- which(is.na(z))
if (length(which_nas) > 0) {
ind <- c(rbind(which_nas - 1 + 2 * (which_nas == 1) + -1 * (which_nas == length(z)),
which_nas + 1 + 1 * (which_nas == 1) + -2 * (which_nas == length(z))))
z[which_nas] <- colMeans(matrix(z[ind], nrow = 2), na.rm = T)
}
return(z)
}
I would like to programmatically set a column name for the dplyr::top_n function.
getSubset <- function(df, t, f) {
df %>%
top_n(t, wt = eval(as.name(f), envir = df))
}
data.frame(x = 1:20, y = 20:1) %>%
getSubset(10, "x")
And it tells me that Error: object 'f' not found. I tried to play with lazyeval package but somehow I keep misunderstanding the concept. Could somebody push me in the right direction? Thanks!
Your problem is that top_n uses non-standard evaluation. It's trying to evaluate the expression eval(as.name(f), envir = df)) in the context of the data frame df, and f doesn't exist in that environment.
One work around would be to temporarily add the desired value of wt to the data frame.
getSubset <- function(df, t, f) {
df %>%
mutate(.wt = eval(as.name(f), envir = df)) %>%
top_n(t, wt = .wt) %>%
select(-.wt)
}
data.frame(x = 1:20, y = 20:1) %>%
getSubset(10, "x")
Another approach would be to use interp from the lazyeval package.
getSubset <- function(df, t, f) {
call <- quote(df %>%
top_n(t, wt = .wt))
call <- interp(call, .wt = eval(as.name(f), envir = df))
eval(call)
}
data.frame(x = 1:20, y = 20:1) %>%
getSubset(10, "x")