How to speed up case_when/conditional mutate? - r

One can notice that case_when do not evaluate in the formula the subset of the tibble when condition is met, but the whole tibble, an example:
picks = c(1:3)
a = tibble(id = c(1:4),
k = NA)
a %>% mutate(
k = case_when(id %in% picks~runif(length(picks)))
)
This is clear in the error:
Error: Problem with `mutate()` column `k`. i `k = case_when(id %in% picks ~ runif(length(picks)))`. x `id %in% picks ~ runif(length(picks))` must be length 4 or one, not 3.
an alternative would be to rowwise() or group_by(id) but that would still be highly inefficient. I would probably still route for rowwise(), but since I have to perform operations only on 1% of the tibble, I just want a mutate within that 1%, anything else untouched. Any suggestion to make R perform the minimal number of evaluations?
I tought about combination of filter and join, but, for example, that would not work for a tidygraph object, because through filtering the nodes, one would filtering out edges too, so local_members would not work anymore properly.
EDIT:
Also, in my experience, it seems that base::ifelse is faster than dplyr::case_when; is that expected?

dplyr::if_else is faster than base::ifelse. You also need T ~ NA_real_ in the case_when together with n():
picks = c(1:3)
a = tibble(id = c(1:4),
k = NA)
a %>% mutate(
k = case_when(
id %in% picks~runif(n()),
T ~ NA_real_
)
)
# A tibble: 4 x 2
id k
<int> <dbl>
1 1 0.0757
2 2 0.708
3 3 0.255
4 4 NA
# Would be faster with if_else:
a %>% mutate(
k = if_else(id %in% picks,
runif(n()),
NA_real_
)
)

Related

More efficient way to compute mean for subset

In this dataframe:
df <- data.frame(
comp = c("pre",rep("story",4), rep("x",2), rep("story",3)),
hbr = c(101:110)
)
let's say I need to compute the mean for hbr subsetted to the first stretch where comp=="story", how would I do that more efficiently than this way, which seems bulky and longwinded and requires that I specify the grpI want to compute the mean for manually:
library(dplyr)
library(data.table)
df %>%
mutate(grp = rleid(comp)) %>%
summarise(M = mean(hbr[grp==2]))
M
1 103.5
I'm not sure if this is any better, but at least you only need to specify that you want the first run of 'story':
df %>%
mutate(grp = ifelse(comp == 'story', rleid(comp), NA)) %>%
filter(grp == min(grp, na.rm = TRUE)) %>%
summarise(M = mean(hbr))
#> M
#> 1 103.5
In base R, you can select the desired rows using cumsum and diff, and then choosing which group you need (here it's the first, so 1), and then compute the mean on those rows. With this option, you don't need to get the group you need manually and you don't require any additional packages.
idx <- which(df$comp == "story")
first <- idx[cumsum(c(1, diff(idx) != 1)) == 1]
#[1] 2 3 4 5
mean(df$hbr[first])
#[1] 103.5

Creating a loop in R for a function

I would like to create for loop to repeat the same function for 150 variables. I am new to R and I am a bit stuck.
To give you an example of some commands I need to repeat:
N <- table(df$ var1 ==0)["TRUE"]
n <- table(df$ var1 ==1)["TRUE"]
PREV95 <- (svyciprop(~ var1 ==1, level=0.95, design= design, deff= "replace")*100)
I need to run the same functions for 150 columns. I know that I need to put all my cols in one vector = x but then I don't know how to write the loop to repeat the same command for all my variables.
Can anyone help me to write a loop?
A word in advance: loops in R can in most cases be replaced with a faster, R-ish way (various flavours of apply, maping, walking ...)
applying a function to the columns of dataframe df:
a)
with base R, example dataset cars
my_function <- function(xs) max(xs)
lapply(cars, my_function)
b)
tidyverse-style:
cars %>%
summarise_all(my_function)
An anecdotal example: I came across an R-script which took about half an hour to complete and made abundant use of for-loops. Replacing the loops with vectorized functions and members of the apply family cut the execution time down to about 3 minutes. So while for-loops and related constructs might be more familiar when coming from another language, they might soon get in your way with R.
This chapter of Hadley Wickham's R for data science gives an introduction into iterating "the R-way".
Here is an approach that doesn't use loops. I've created a data set called df with three factor variables to represent your dataset as you described it. I created a function eval() that does all the work. First, it filters out just the factors. Then it converts your factors to numeric variables so that the numbers can be summed as 0 and 1 otherwise if we sum the factors it would be based on 1 and 2. Within the function I create another function neg() to give you the number of negative values by subtracting the sum of the 1s from the total length of the vector. Then create the dataframes "n" (sum of the positives), "N" (sum of the negatives), and PREV95. I used pivot_longer to get the data in a long format so that each stat you are looking for will be in its own column when merged together. Note I had to leave PREV95 out because I do not have a 'design' object to use as a parameter to run the function. I hashed it out but you can remove the hash to add back in. I then used left_join to combine these dataframes and return "results". Again, I've hashed out the version that you'd use to include PREV95. The function eval() takes your original dataframe as input. I think the logic for PREV95 should work, but I cannot check it without a 'design' parameter. It returns a dataframe, not a list, which you'll likely find easier to work with.
library(dplyr)
library(tidyr)
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
eval <- function(df){
df1 <- df %>%
select_if(is.factor) %>%
mutate_all(function(x) as.numeric(as.character(x)))
neg <- function(x){
length(x) - sum(x)
}
n<- df1 %>%
summarize(across(where(is.numeric), sum)) %>%
pivot_longer(everything(), names_to = "Var", values_to = "n")
N <- df1 %>%
summarize(across(where(is.numeric), function(x) neg(x))) %>%
pivot_longer(everything(), names_to = "Var", values_to = "N")
#PREV95 <- df1 %>%
# summarize(across(where(is.numeric), function(x) survey::svyciprop(~x == 1, design = design, level = 0.95, deff = "replace")*100)) %>%
# pivot_longer(everything(), names_to = "Var", values_to = "PREV95")
results <- n %>%
left_join(N, by = "Var")
#results <- n %>%
# left_join(N, by = "Var") %>%
# left_join(PREV95, by = "Var")
return(results)
}
eval(df)
Var n N
<chr> <dbl> <dbl>
1 Var1 2 8
2 Var2 5 5
3 Var3 4 6
If you really wanted to use a for loop, here is how to make it work. Again, I've left out the survey function due to a lack of info on the parameters to make it work.
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
VarList <- names(df %>% select_if(is.factor))
results <- list()
for (var in VarList){
results[[var]][["n"]] <- sum(df[[var]] == 1)
results[[var]][["N"]] <- sum(df[[var]] == 0)
}
unlist(results)
Var1.n Var1.N Var2.n Var2.N Var3.n Var3.N
2 8 5 5 4 6

Checking if all factors within a factor are unique, then if so, returning that factor. If not, returning a third value. R

First time posting here! Been struggling with this for about two days but I have a dataframe that looks like this:
code.1 <- factor(c(rep("x",3), rep("y",2), rep("z",3)))
type.1 <- factor(c(rep("small", 2), rep("medium", 2), rep("large", 4)))
df <- cbind.data.frame(type.1, code.1)
df
And am trying to get it to return this:
code.2 <- factor(c("x", "y", "z"))
type.2 <- factor(c("multiple", "multiple", "large"))
df2 <- cbind.data.frame(type.2, code.2)
df2
I've tried all manner of If/Else and apply functions grouping by "code" to return these results but am stuck. Any help appreciated!
You can do that with dplyr: you group by code.1, then all you have to do is to summarize type.1 with an if/else: if there is only a single value, you return it, else you return "multiple".
The code is slightly more complicated because of practical considerations (need to convert to character, need to have a vectorized TRUE condition that always returns a single value even when FALSE):
df %>%
group_by(code.1) %>%
summarize(type.2 = if_else(n_distinct(type.1) == 1,
as.character(first(type.1)),
"multiple"),
type.2 = as.factor(type.2))
# A tibble: 3 x 2
# code.1 type.2
# <fct> <fct>
# 1 x multiple
# 2 y multiple
# 3 z large
EDIT: here is a different formulation of the same approach without converting to character, might be better suited for large problems, and might give a different view of the same question:
# default value when multiple
iffalse <- as.factor("multiple")
df %>%
group_by(code.1) %>%
mutate(type.1 = factor(type.1, levels = c(levels(type.1), levels(iffalse)))) %>% # add possible level to type.1
summarize(type.2 = if_else(n_distinct(type.1) == 1,
first(type.1),
iffalse))

Add summarize variable in multiple statements using dplyr?

In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:
data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))
Output:
a b sum_c
1 a 1 4
2 b 2 5
3 c 3 6
Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add
variable x_v to the summary.
I see several related stackoverflow questions, but I didn't see this.
EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).
summarize_num <- TRUE
summarize_num_distinct <- FALSE
data <- data.frame(val=c(1,2,2))
if (summarize_num && summarize_num_distinct) {
summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
summ <- data %>% summarize(n_unique=n_distinct())
}
Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.
As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.
I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.
Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.
data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups
summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))}
summ = slice(summ, 1) %>% ungroup() %>% select(-val)
## A tibble: 2 x 3
# grp n n_unique
# <dbl> <int> <int>
#1 1 2 2
#2 2 1 1
The summarise_at() function takes a list of functions as parameter. So, we can get
data <- data.frame(val=c(1,2,2))
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts)
n_unique n
1 2 3
All functions in the list must take one argument. Therefore, n() was replaced by length().
The list of functions can be modified dynamically as requested by the OP, e.g.,
summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
n
1 3
So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:
fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>%
summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
max n avg sum
1 2 3 1.666667 5

filtering within the summarise function of dplyr

I am struggling a little with dplyr because I want to do two things at one and wonder if it is possible.
I want to calculate the mean of values and at the same time the mean for the values which have a specific value in an other column.
library(dplyr)
set.seed(1234)
df <- data.frame(id=rep(1:10, each=14),
tp=letters[1:14],
value_type=sample(LETTERS[1:3], 140, replace=TRUE),
values=runif(140))
df %>%
group_by(id, tp) %>%
summarise(
all_mean=mean(values),
A_mean=mean(values), # Only the values with value_type A
value_count=sum(value_type == 'A')
)
So the A_mean column should calculate the mean of values where value_count == 'A'.
I would normally do two separate commands and merge the results later, but I guess there is a more handy way and I just don't get it.
Thanks in advance.
We can try
df %>%
group_by(id, tp) %>%
summarise(all_mean = mean(values),
A_mean = mean(values[value_type=="A"]),
value_count=sum(value_type == 'A'))
You can do this with two summary steps:
df %>%
group_by(id, tp, value_type) %>%
summarise(A_mean = mean(values)) %>%
summarise(all_mean = mean(A_mean),
A_mean = sum(A_mean * (value_type == "A")),
value_count = sum(value_type == "A"))
The first summary calculates the means per value_type and the second "sums" only the mean of value_type == "A"
You can also give the following function a try:
?summarise_if
(the function family is summarise_all)
Example
The dplyr documentation serves a quite good example of this, i think:
# The _if() variants apply a predicate function (a function that
# returns TRUE or FALSE) to determine the relevant subset of
# columns. Here we apply mean() to the numeric columns:
starwars %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
#> # A tibble: 1 x 3
#> height mass birth_year
#> <dbl> <dbl> <dbl>
#> 1 174. 97.3 87.6
The interesting thing here is the predicate function. This represents the rule by which the columns, that will have to be summarized, are selected.

Resources