Using case_when with dplyr across - r

I'm trying to translate a mutate_at() to a mutate() using dplyr's new "across" function and a bit stumped.
In a nutshell, I need to compare the values in a series of columns to a "baseline" column. When the values in the columns are higher than the baseline, I need to use the baseline value. When the values in the columns are lower than or equal to the baseline, I need to keep the value. Here's an example dataset (my actual dataset is much larger):
test <- structure(list(baseline = c(5, 7, 8, 4, 9, 1, 0, 46, 47), bob = c(7,
11, 34, 9, 6, 8, 3, 49, 12), sally = c(3, 5, 2, 2, 6, 1, 3, 4,
56), rita = c(6, 4, 6, 7, 6, 0, 3, 11, 3)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), spec = structure(list(
cols = list(baseline = structure(list(), class = c("collector_double",
"collector")), bob = structure(list(), class = c("collector_double",
"collector")), sally = structure(list(), class = c("collector_double",
"collector")), rita = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
My current code uses mutate_at() and works fine:
trial1 <- test %>%
mutate_at(
vars('bob','sally', 'rita'),
funs(case_when(
. > baseline ~ baseline,
. <= baseline ~ .)))
But when I try to update it to reflect across() from dplyr 1.0, I keep getting an error. Here is my attempt:
trial2 <- test %>%
mutate(across(c(bob, sally, rita),
case_when(. > baseline ~ baseline,
. <= baseline ~ .)))
And here is the error:
error: Problem with mutate() input ..1.
x . > baseline ~ baseline, . <= baseline ~ . must be length 36 or one, not 9, 4.
ℹ Input ..1 is across(...)
Any ideas what I might be doing wrong? Does case_when() work with across?

We can use the ~ to specify the anonymous function/lambda function call
library(dplyr)
test %>%
mutate(across(c(bob, sally, rita),
~ case_when(. > baseline ~ baseline,
. <= baseline ~ .)))
-output
# A tibble: 9 x 4
# baseline bob sally rita
# <dbl> <dbl> <dbl> <dbl>
#1 5 5 3 5
#2 7 7 5 4
#3 8 8 2 6
#4 4 4 2 4
#5 9 6 6 6
#6 1 1 1 0
#7 0 0 0 0
#8 46 46 4 11
#9 47 12 47 3
According to ?across the arguments to fns can be either
Functions to apply to each of the selected columns. Possible values are:
NULL, to returns the columns untransformed.
A function, e.g. mean.
A purrr-style lambda, e.g. ~ mean(.x, na.rm = TRUE)
A list of functions/lambdas, e.g. list(mean = mean, n_miss = ~ sum(is.na(.x))
Also, instead of case_when, we can make use of the pmin
test %>%
mutate(across(c(bob, sally, rita), ~ pmin(baseline, .)))
-output
# A tibble: 9 x 4
# baseline bob sally rita
# <dbl> <dbl> <dbl> <dbl>
#1 5 5 3 5
#2 7 7 5 4
#3 8 8 2 6
#4 4 4 2 4
#5 9 6 6 6
#6 1 1 1 0
#7 0 0 0 0
#8 46 46 4 11
#9 47 12 47 3

Related

How to bootstrap dataset in R which is blocked by a factor?

I want to perform bootstrap on this data set. Notice that the data has two factors: replicate and level, and two variables high.density and low.density that need to be regressed. I want to perform a bootstrap on this data-set but the replacements can occur only within the nested factor of replicate and level.
replicate level high.density low.density
1 low 14 36
1 low 54 31
1 mid 82 10
1 mid 24 NA
2 low 12 28
2 low 11 45
2 mid 12 17
2 mid NA 24
2 up 40 10
2 up NA 5
2 up 20 2
For instance, in replicate/ level: 1/low the low.density 31 and 36 can be interchanged (or high.density interchanged) so the head of that dataset may look like:
replicate level high.density low.density
1 low 14 31
1 low 54 36
1 mid 82 10
1 mid 24 NA
I then want to estimate the linear regression (glm) from this dataset. I would appreciate any feedback on trying to achieve this.
##DATA FRAME (credits: caldwellst)
df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2), level = c("low", "low", "mid", "mid", "low", "low", "mid", "mid", "up", "up", "up"), high.density = c(14, 54, 82, 24, 12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10,
NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df","tbl_df","tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(cols = list(replicate = structure(list(), class = c("collector_double", "collector")), level = structure(list(), class = c("collector_character","collector")), high.density = structure(list(), class = c("collector_double","collector")), low.density = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 1L), class = "col_spec"))
df$replicate <- as.factor(as.numeric(df$replicate))
df$level <- as.factor(as.character(df$level)
)
We may exploit split and do the sampling according to unique combinations of replicate and level. We could repeat this process B times.
df_shuffle <- function(DF) {
my_split <- split(DF, f = ~ DF$replicate + DF$level)
shuffle <- lapply(my_split, function(x) {
nrX <- nrow(x)
cbind(x[, c('replicate', 'level')],
high.density = x[sample(seq_len(nrX), replace = TRUE), 'high.density'],
low.density = x[sample(seq_len(nrX), replace = TRUE), 'low.density'])
})
DF_new <- do.call(rbind, shuffle)
rownames(DF_new) <- NULL
return(DF_new)
}
B <- 1000L
df_list <- replicate(B, df_shuffle(df), simplify = FALSE)
# ---------------------------------------------------
> df_list[[B]]
replicate level high.density low.density
1 1 low 54 36
2 1 low 54 36
3 2 low 12 45
4 2 low 12 28
5 1 mid 24 10
6 1 mid 82 10
7 2 mid NA 17
8 2 mid 12 17
9 2 up 20 10
10 2 up 40 10
11 2 up 20 5
Because the original data contains missing observations, we either have to multiply impute them or opt to lisewise delete them. For now, let's perform the latter option.
# listwise delete missing observations
df_list <- lapply(df_list, function(x) x[complete.cases(x), ])
Finally, we perform a linear regression on each shuffled dataset and store the B coefficients in out.
row_bind <- function(x) data.frame(do.call(rbind, x))
out <- row_bind(
lapply(df_list, function(x) lm(high.density ~ low.density, data = x)$coef)
)
## out <- row_bind(
## lapply(df_list, function(x) glm(replicate ~ low.density, data = x,
## family = binomial())$coef)
## )
# -------------------------------------------------------------------
> dim(out)
[1] 1000 2
Output
> head(out)
X.Intercept. low.density
1 13.74881 0.2804738
2 20.01074 -0.2095672
3 30.26643 -0.2946373
4 29.19541 -0.2752761
5 37.76273 -0.4555651
6 37.72250 -0.1548349
The code required to create this image can be found here.
Here's a solution using dplyr, purrr, and tidyr. First nest the numeric data, and then sample each of the unique combinations of replicate and level in the data. Then within those, bootstrap the unique values of the densities and then unnest for final data frame.
# library(tidyverse)
library(dplyr)
library(tidyr)
library(purrr)
df %>%
nest(data = ends_with("density")) %>%
slice_sample(n = 500, replace = TRUE) %>%
mutate(data = map(data, ~summarize(.x, across(.fns = sample, size = 1)))) %>%
unnest(cols = data)
#> # A tibble: 500 × 4
#> replicate level high.density low.density
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 low 54 31
#> 2 2 mid 12 24
#> 3 1 mid 24 10
#> 4 2 up 20 2
#> 5 2 mid 12 24
#> 6 2 mid 12 24
#> 7 1 mid 82 10
#> 8 2 up NA 2
#> 9 1 low 14 36
#> 10 2 mid 12 17
#> # … with 490 more rows
Data
df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2),
level = c("low", "low", "mid", "mid", "low", "low", "mid",
"mid", "up", "up", "up"), high.density = c(14, 54, 82, 24,
12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10,
NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(
cols = list(replicate = structure(list(), class = c("collector_double",
"collector")), level = structure(list(), class = c("collector_character",
"collector")), high.density = structure(list(), class = c("collector_double",
"collector")), low.density = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))

Replacing values of a column in R dataframe

I have a data frame named C0001 with 3671 observations of 31 variables. I want to apply a check on each value of one variable named Y. If the value of that variable is greater than 30, then replace it with 30 otherwise keep the existing value. I wrote the following in R but it gives me an error:
C0001 <- read.csv("C0001.csv")
C0001$Y<- ifelse(C0001$Y > 30, 30, C0001$Y)
Error in ans[npos] <- rep(no, length.out = len)[npos] :
replacement has length zero
In addition: Warning message:
In rep(no, length.out = len) : 'x' is NULL so the result will be NULL
Could someone help me with what mistake I am making here? Is there some other way to do the same operation without using ifelse?
Thank you
Try to replace read.csv() with read_csv() as well check your core work directory. The read_csv() function imports data into R as a tibble, while read.csv() imports a regular old R data frame instead. The error indicates that your input is either NULL or a length 0 vector: make sure the indices are correct.
library(readr)
C0001 <- read_csv("C:/Users/Desktop//C0001.csv")
C0001
> C0001
# A tibble: 6 x 3
x y z
<dbl> <dbl> <dbl>
1 2 40 4
2 3 12 5
3 45 12 6
4 1 50 7
5 1 50 30
6 1 0 0
C0001$y<- ifelse(C0001$y > 30, 30, C0001$y)
C0001
# A tibble: 6 x 3
x y z
<dbl> <dbl> <dbl>
1 2 30 4
2 3 12 5
3 45 12 6
4 1 30 7
5 1 30 30
6 1 0 0
Data sample:
structure(list(x = c(2, 3, 45, 1, 1, 1), y = c(30, 12, 12, 30,
30, 0), z = c(4, 5, 6, 7, 30, 0)), row.names = c(NA, -6L), spec = structure(list(
cols = list(x = structure(list(), class = c("collector_double",
"collector")), y = structure(list(), class = c("collector_double",
"collector")), z = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
Use vectorization like this:
C0001$Y <- C0001$Y[C0001$Y > 30]
This works instead of using ifelse().

How to add new column and calculate recursive cum using dplyr and shift

I have a dataset: (actually I have more than 100 groups)
and I want to use dplyr to create a variable-y for each group, and fill first value of y to be 1,
Second y = 1* first x + 2*first y
The result would be:
I tried to create a column- y, all=1, then use
df%>% group_by(group)%>% mutate(var=shift(x)+2*shift(y))%>% ungroup()
but the formula for y become, always use initialize y value--1
Second y = 1* first x + 2*1
Could someone give me some ideas about this? Thank you!
The dput of my result data is:
structure(list(group = c("a", "a", "a", "a", "a", "b", "b", "b" ), x =
c(1, 2, 3, 4, 5, 6, 7, 8), y = c(1, 3, 8, 19, 42, 1, 8, 23)),
row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame" ))
To perform such calculation we can use accumulate from purrr or Reduce in base R.
Since you are already using dplyr we can use accumulate :
library(dplyr)
df %>%
group_by(group) %>%
mutate(y1 = purrr::accumulate(x[-n()], ~.x * 2 + .y, .init = 1))
# group x y y1
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 1
#2 a 2 3 3
#3 a 3 8 8
#4 a 4 19 19
#5 a 5 42 42
#6 b 6 1 1
#7 b 7 8 8
#8 b 8 23 23

Fill multiple columns in a R dataframe [duplicate]

This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
Closed 2 years ago.
I have a dataframe called flu that is a count of case(n) by group per week.
flu <- structure(list(isoweek = c(1, 1, 2, 2, 3, 3, 4, 5, 5), group = c("fluA",
"fluB", "fluA", "fluB", "fluA", "fluB", "fluA", "fluA", "fluB"
), n = c(5, 6, 3, 5, 12, 14, 6, 23, 25)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), spec = structure(list(
cols = list(isoweek = structure(list(), class = c("collector_double",
"collector")), group = structure(list(), class = c("collector_character",
"collector")), n = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
In the data set there are some rows where zero cases are not reported in the data so there are no NA values to work with.
I have identified a fix for this to fill down missing weeks with zeros.
flu %>% complete(isoweek, nesting(group), fill = list(n = 0))
My problem is that this only works for the weeks of data reported. For example, at weeks 6, 7, 8 etc if there are no cases reported I have no data.
How can I extend this fill down process to extend the data frame with zeros for isoweeks 6 to 10 (for example) and have a corresponding fluA and fluB for each week with a zero value for each isoweek/group pair?
You can expand multiple columns in complete. Let's say if you need data till week 8, you can do :
tidyr::complete(flu, isoweek = 1:8, group, fill = list(n = 0))
# A tibble: 16 x 3
# isoweek group n
# <dbl> <chr> <dbl>
# 1 1 fluA 5
# 2 1 fluB 6
# 3 2 fluA 3
# 4 2 fluB 5
# 5 3 fluA 12
# 6 3 fluB 14
# 7 4 fluA 6
# 8 4 fluB 0
# 9 5 fluA 23
#10 5 fluB 25
#11 6 fluA 0
#12 6 fluB 0
#13 7 fluA 0
#14 7 fluB 0
#15 8 fluA 0
#16 8 fluB 0

Dataframe: Divide each group by a vector corresponding to each group in R?

I have a data frame like this:
df1 <- structure(list(user_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3), param_a = c(123,
2.3, -9, 1, -0.03333, 4, -41, -12, 0.89)), .Names = c("user_id",
"param_a"), row.names = c(NA, -9L), class = c("tbl_df", "tbl",
"data.frame"))
and another dataframe of vectors:
df2 <- structure(list(user_id = c(1, 2, 3), param_b = c(34, 12, -0.89
)), .Names = c("user_id", "param_b"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
Now I want to divide each group in df1 by corresponding value in df2:
For example for a group of user 1 divide each row by param_b first vector:
user_id param_a
1 123/34
1 2.3/34
1 -9/34
2 1/12
2 -0.03333/12
2 4/12
....
for user 2 divide each row by param_b second vector.
Please advise how can I divide a grouped by user dataframe by a vector per each group?
P.S
If I have df1 extended to param_a, param_k, param_p
and df2 extended accordingly with param_b, param_l, param_r
How can I perform this kind of operation? #nicola suggested a very nice solution but I want to extend it.
Something like this?
df1%>%
left_join(df2)%>%
mutate(result=param_a/param_b)
Joining, by = "user_id"
# A tibble: 9 x 4
user_id param_a param_b result
<dbl> <dbl> <dbl> <dbl>
1 1 123 34 3.62
2 1 2.3 34 0.0676
3 1 -9 34 -0.265
4 2 1 12 0.0833
5 2 -0.0333 12 -0.00278
6 2 4 12 0.333
7 3 -41 -0.89 46.1
8 3 -12 -0.89 13.5
9 3 0.89 -0.89 -1

Resources