Add two columns simulataneously via mutate - r

I would like to use dplyr::mutate to add two named columns to a dataframe simulataneously and with a single function call. Consider the following example
library(dplyr)
n <- 1e2; M <- 1e3
variance <- 1
x <- rnorm(n*M, 0, variance)
s <- rep(1:M, each = n)
dat <- data.frame(s = s, x = x)
ci_studclt <- function(x, alpha = 0.05) {
n <- length(x)
S_n <- var(x)
mean(x) + qt(c(alpha/2, 1 - alpha/2), df = n-1)*sqrt(S_n / n)
}
ci_studclt(x)
Trying something like the below returns an error, since obviously two values are produced and cannot be inserted into a single atomic-type column.
dat %>%
group_by(s) %>%
mutate(ci = ci_studclt(x, variance))
It seems one option is to insert a list column then unnest_wider and that this is easier with data.table or the specific case of splitting a string column into two new columns.
In my example, a confidence interval (lower and upper bound) come out of a function and I would like to directly add both as new columns to dat e.g. calling the columns ci_lower and ci_upper.
Is there a straightforward way of doing this with dplyr or do I need to insert the elements as a list column then unnest?
NB Keep in mind that the confidence interval values are a function of a group of simulated values x, grouped by s; the CI values should be constant within a group.

You can do this by having your function (or a wrapper function) return a data.frame. When you call it in mutate, don’t specify a column name (or else you’ll end up with a nested data.frame column). If you want to specify names for the new columns, you can include them as function arguments as in the below.
library(dplyr)
n <- 1e2; M <- 1e3
variance <- 1
x <- rnorm(n*M, 0, variance)
s <- rep(1:M, each = n)
dat <- data.frame(s = s, x = x)
ci_studclt <- function(x, alpha = 0.05) {
n <- length(x)
S_n <- var(x)
mean(x) + qt(c(alpha/2, 1 - alpha/2), df = n-1)*sqrt(S_n / n)
}
ci_wrapper <- function(x, alpha = 0.05, names_out = c("ci_lower", "ci_upper")) {
ci <- ci_studclt(x, alpha = alpha)
out <- data.frame(ci[[1]], ci[[2]])
names(out) <- names_out
out
}
# original code was ci_studclt(x, variance)
# but ci_studclt() doesn't take a variance argument, so I omitted
dat %>%
group_by(s) %>%
mutate(ci_wrapper(x))
output:
# A tibble: 100,000 x 4
# Groups: s [1,000]
s x ci_lower ci_upper
<int> <dbl> <dbl> <dbl>
1 1 0.233 -0.223 0.139
2 1 1.03 -0.223 0.139
3 1 1.53 -0.223 0.139
4 1 0.0150 -0.223 0.139
5 1 -0.211 -0.223 0.139
6 1 -1.13 -0.223 0.139
7 1 -1.51 -0.223 0.139
8 1 0.371 -0.223 0.139
9 1 1.80 -0.223 0.139
10 1 -0.137 -0.223 0.139
# ... with 99,990 more rows
With specified column names:
dat %>%
group_by(s) %>%
mutate(ci_wrapper(x, names_out = c("ci.lo", "ci.hi")))
output:
# A tibble: 100,000 x 4
# Groups: s [1,000]
s x ci.lo ci.hi
<int> <dbl> <dbl> <dbl>
1 1 0.233 -0.223 0.139
2 1 1.03 -0.223 0.139
3 1 1.53 -0.223 0.139
4 1 0.0150 -0.223 0.139
5 1 -0.211 -0.223 0.139
6 1 -1.13 -0.223 0.139
7 1 -1.51 -0.223 0.139
8 1 0.371 -0.223 0.139
9 1 1.80 -0.223 0.139
10 1 -0.137 -0.223 0.139
# ... with 99,990 more rows

If you get your function to return a two-column data frame with repeated values of the same length as the input, then this becomes very easy:
ci_studclt <- function(x, alpha = 0.05) {
n <- length(x)
S_n <- var(x)
res <- mean(x) + qt(c(alpha/2, 1 - alpha/2), df = n-1)*sqrt(S_n / n)
data.frame(lower = rep(res[1], length(x)), upper = res[2])
}
dat %>%
group_by(s) %>%
mutate(ci_studclt(x))
#> # A tibble: 100,000 x 4
#> # Groups: s [1,000]
#> s x lower upper
#> <int> <dbl> <dbl> <dbl>
#> 1 1 -0.767 -0.147 0.293
#> 2 1 -0.480 -0.147 0.293
#> 3 1 -1.31 -0.147 0.293
#> 4 1 0.219 -0.147 0.293
#> 5 1 0.650 -0.147 0.293
#> 6 1 0.542 -0.147 0.293
#> 7 1 -0.249 -0.147 0.293
#> 8 1 2.22 -0.147 0.293
#> 9 1 -0.239 -0.147 0.293
#> 10 1 0.176 -0.147 0.293
#> # ... with 99,990 more rows

Other possible variation (if you don't want to change your ci_studclt function) how it can be done:
dat %>%
group_by(s) %>%
mutate(
across(x,
.fns = list(
lower = ~ci_studclt(.)[1],
upper = ~ci_studclt(.)[2]
)
)
)
In this case output will also contain new x_lower and x_upper columns. This variant is also somewhat scalable, so if you want to calculate your function over other column y as well, you can just replace x with c(x,y) and have also y_lower and y_upper columns in dat as well.
UPDATE
Actually, all the stuff that Allan did in his answer could be done inside mutate call and without any modification of initial function:
dat %>%
group_by(s) %>%
mutate(
t(ci_studclt(x)) %>%
as.data.frame() %>%
set_names(c('ci_lower','ci_upper'))
)
We just transpose an output from ci_studclt(x) for treating it as row by data.frame function and give this 1-row dataframe correct names.

Related

R: How to get rid of rows depending on a calculation applied in a certain column, that refers to values of other columns?

I have a df with the columns subject(subj), condition, reaction time(rt_link). Each row is one trial. I calculated the means and SDs of RTs per subject and Condition:
RTs_overview <- links_RTs %>%
group_by(subj, condition) %>%
summarize(mean_rt = mean(rt_link),
sd_rt = sd(rt_link))
and it gives out a nice df showing the subject, the condition and mean and SD of RTs each, so there are 4 rows per subject, one for each condition.
I would now like to get rid of all rows where the RT is > or < 3SDs of the mean of the corresponding subject and condition. Earlier, I got rid of rows with RTs > 10s, with this:
links_RTs <- links_cl[links_cl$rt_link < 10, ]
I tried to do the same now, but unsuccessfully
links_RTs[links_RTs$rt_link < (lapply(mean(links4_RTs$rt_link) + 3 * sd(links4_RTs$rt_link))), ]
I also tried it without lapply, but this too did not work.
How can I delete every row where rt_link is > or < than 3SDs from the corresponding rt-mean (depending on the subject and condition)?
My suggestion is to retain decision-making criteria in the original dataset and then subset based on those criteria:
library(dplyr)
set.seed(99)
# Define raw data
dat <- data.frame(subj = rep(1:25, 8),
condition = rep(c("A","B", "C", "D"), 200),
rt_link = runif(200, 0, 1))
dat %>%
group_by(subj, condition) %>%
mutate(mean_rt = mean(rt_link)) %>%
mutate(sd_rt = sd(rt_link)) %>%
ungroup() %>%
mutate(upr = mean_rt + (sd_rt*3)) %>%
mutate(lwr = mean_rt - (sd_rt*3)) %>%
mutate(remove = ifelse(rt_link > upr | rt_link < lwr, 1, 0))
subj condition rt_link mean_rt sd_rt upr lwr remove
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 A 0.585 0.321 0.282 1.17 -0.524 0
2 2 B 0.114 0.462 0.372 1.58 -0.654 0
3 3 C 0.684 0.709 0.0264 0.788 0.630 0
4 4 D 0.993 0.541 0.482 1.99 -0.905 0
5 5 A 0.535 0.582 0.0505 0.734 0.431 0
6 6 B 0.967 0.827 0.149 1.27 0.380 0
7 7 C 0.671 0.508 0.174 1.03 -0.0147 0
8 8 D 0.295 0.442 0.158 0.916 -0.0315 0
9 9 A 0.358 0.249 0.117 0.600 -0.101 0
10 10 B 0.175 0.167 0.00939 0.195 0.138 0
# ... with 790 more rows
then filter:
mydat <- dat[dat$remove == 0,]

problem with `replace_na()` from tidyr package

I wrote a function that has five arguments to calculate random numbers from a normal distribution. It has two steps:
replace NA with 0 in tibble column
replace 0 with a random number
My problems are:
line three doesn't replace NA value with 0
line five doesn't replace 0 with a random number
I have this error :
! Must subset columns with a valid subscript vector.
x Subscript `col` has the wrong type `function`.
It must be logical, numeric, or character.
here is my code :
whithout=function(col,min,max,mean,sd){
for(i in 1:4267){
continuous_dataset=continuous_dataset %>% replace_na(continuous_dataset[,col]=0)
if(is.na(continuous_dataset[,col])){
continuous_dataset[i,col]=round(rtruncnorm(1,min,max,mean,sd))
}
}
}
There's no need to write a function that loops across both columns and observations.
I assume you have no zeroes in your dataset to begin with. In which case, I can skip replacing NA with 0 and go straight to genereating the replacement value.
My solution is based on the tidyverse.
First, generate some test data.
library(tidyverse)
set.seed(123)
df <- tibble(x=runif(5), y=runif(5), z=runif(5))
df$x[3] <- NA
df$y[4] <- NA
df$z[5] <- NA
df
# A tibble: 5 × 3
x y z
<dbl> <dbl> <dbl>
1 0.288 0.0456 0.957
2 0.788 0.528 0.453
3 NA 0.892 0.678
4 0.883 NA 0.573
5 0.940 0.457 NA
Now solve the problem.
df %>%
mutate(
across(
everything(),
function(.x, mean, sd) .x <- ifelse(is.na(.x), rnorm(nrow(.), mean, sd), .x),
mean=500,
sd=100
)
)
# A tibble: 5 × 3
x y z
<dbl> <dbl> <dbl>
1 0.288 0.0456 0.957
2 0.788 0.528 0.453
3 669. 0.892 0.678
4 0.883 629. 0.573
5 0.940 0.457 467.
By avoiding looping through columns and rows, the code is more compact, more robust and (though I've not tested) faster.
If you don't want to process every column, simply replace everything() with a vector of columns that you do want to process. For example
df %>%
mutate(
across(
c(x, y),
function(.x, mean, sd) .x <- ifelse(is.na(.x), rnorm(nrow(.), mean, sd), .x),
mean=500,
sd=100
)
)
# A tibble: 5 × 3
x y z
<dbl> <dbl> <dbl>
1 0.288 0.0456 0.957
2 0.788 0.528 0.453
3 669. 0.892 0.678
4 0.883 629. 0.573
5 0.940 0.457 NA

Generating a column with the average value of rows before and after tiw row index

Given some data like the following:
set.seed(1234)
df <- tibble(class = rep(c("a","b"), each=6), value = c(rnorm(n=6, mean=0, sd=1), rnorm(n=6, mean=1, sd=0.1)))
# A tibble: 12 x 2
# class value
# <chr> <dbl>
# 1 a -1.21
# 2 a 0.277
# 3 a 1.08
# 4 a -2.35
# 5 a 0.429
# 6 a 0.506
# 7 b 0.943
# 8 b 0.945
# 9 b 0.944
#10 b 0.911
#11 b 0.952
#12 b 0.900
I'm trying to generate a new column (context) that contains the average of "value" of the X preceding and posterior rows, when possible. It would be desirable to have this by level of a factor in a different column. For example, for X=2, I would expect something like the following:
# A tibble: 12 x 2
# class value context
# <chr> <dbl> <dbl>
# 1 a -1.21 NA
# 2 a 0.277 NA
# 3 a 1.08 -0.7135
# 4 a -2.35 0.573
# 5 a 0.429 NA
# 6 a 0.506 NA
# 7 b 0.943 NA
# 8 b 0.945 NA
# 9 b 0.944 0.9377
#10 b 0.911 0.9278
#11 b 0.952 NA
#12 b 0.900 NA
Note that for the first two rows it is not possible to generate the context value in this case, because they do not have X=2 predecing rows. The value -0.7135 at row 3 is the average of rows 1, 2, 4 and 5.
Similarly, rows 5 and 6 do not have a value of context, because these do not have two values afterwards belonging to the same level of the factor "class" (because row 7 is class="b" while 5 and 6 are class="a").
I do not know if this is even possible in R, I haven't found any similar questions, and I can only reach to solutions like the following one, which I think is not representative of this language.
My solution:
X <- 2
df_list <- df %>% dplyr::group_split(class)
result <- tibble()
for (i in 1:length(df_list)) {
tmp <- df_list[[i]]
context <- vector()
for (j in 1:nrow(tmp)) {
if (j<=X | j>nrow(tmp)-X) context <- c(context, NA)
else {
values <- vector()
for (k in 1:X) {
values <- c(values, tmp$value[j-k], tmp$value[j+k])
}
context <- c(context, mean(values))
}
}
tmp <- tmp %>% dplyr::mutate(context=context)
result <- result %>% dplyr::bind_rows(tmp)
}
This will give and approximate solution to that above (differences due to rounding). But again, this approach lacks of flexibility, e.g. if we want to create various columns at once, for different values of X. Are there R functions developed to solved tasks like this one? (eg. vectorized functions?)
# this is your dataframe
set.seed(1234)
df <- tibble(class = rep(c("a","b"), each=6), value = c(rnorm(n=6, mean=0, sd=1), rnorm(n=6, mean=1, sd=0.1)))
# pipes ('%>%') and grouping from the dplyr package
library(tidyverse)
# rolling mean function from the zoo package
library(zoo)
df %>% # take df
group_by(class) %>% # group it by class
mutate(context = (rollsum(value, 5, fill = NA) - value) / 4) # and calculate the rolling mean
Basically you calculate a rolling mean with a window width of 5, that is center (it's the default) and you fill the remaining values with NAs. Since the value of the exact row is not to be included in the average, it needs to be excluded.
One way using dplyr :
n <- 2
library(dplyr)
df %>%
group_by(class) %>%
mutate(context = map_dbl(row_number(), ~ if(.x <= n | .x > (n() - n))
NA else mean(value[c((.x - n):(.x - 1), (.x + 1) : (.x + n))])))
# class value context
# <chr> <dbl> <dbl>
# 1 a -1.21 NA
# 2 a 0.277 NA
# 3 a 1.08 -0.712
# 4 a -2.35 0.574
# 5 a 0.429 NA
# 6 a 0.506 NA
# 7 b 0.943 NA
# 8 b 0.945 NA
# 9 b 0.944 0.938
#10 b 0.911 0.935
#11 b 0.952 NA
#12 b 0.900 NA
Here is a base R solution using ave(), i.e.,
df <- within(df,
contest <- ave(value,
class,
FUN = function(v,X=2) sapply(seq(v), function(k) ifelse(k-X < 1 | k+X >length(v),NA,mean(v[c(k-(X:1),k + (1:X))])))))
such that
> df
# A tibble: 12 x 3
class value contest
<chr> <dbl> <dbl>
1 a -1.21 NA
2 a 0.277 NA
3 a 1.08 -0.712
4 a -2.35 0.574
5 a 0.429 NA
6 a 0.506 NA
7 b 0.943 NA
8 b 0.945 NA
9 b 0.944 0.938
10 b 0.911 0.935
11 b 0.952 NA
12 b 0.900 NA

Transforming R dataframe by applying function rowwise and create (possibly) larger columns

I'm trying to transform a dataframe (tibble) by using each row as function arguments and create a new column out of it, which is possibly bigger than the number of arguments. Consider the following example, where I have some sample observations:
library(dplyr)
library(stringi)
observations <- c("110", "11011", "1100010")
df <- tibble(obs = observations) %>%
transmute(
Failure = stri_count(obs, fixed = "0"),
Success = stri_count(obs, fixed = "1")
)
df is then:
# A tibble: 3 x 2
Failure Success
<int> <int>
1 1 2
2 1 4
3 4 3
I would like to take every row and use that for calculating a bunch of values, and save each result vector in a new column. For example I would like to do:
p_values = pgrid <- seq(from = 0, to = 1, length.out = 11)
df %>%
rowwise() %>%
transmute(
p = p_values,
likelihood = dbinom(Success,
size = Failure + Success,
prob = p_values
)
)
Error: Column `p` must be length 1 (the group size), not 11
And get something like:
# A tibble: 4 x 11
p_values likelihood_1 likelihood_2 likelihood_3
<float> <float> <float> <float>
1 0 ... ... ...
2 0.1 ... ... ...
... ... ... ... ...
10 0.9 ... ... ...
11 1 ... ... ...
I would actually switch into purrr for this. The function pmap() will iterate by row. We use ..1 and ..2 to signify the first and second inputs, respectively. Using pmap_dfc() will bind the results by columns (dfc = data frame columns).
library(purrr)
library(tibble)
df %>%
pmap_dfc(~ dbinom(..2, size = ..1 + ..2, prob = p_values)) %>%
set_names(paste0("likelihood_", seq_along(.))) %>%
add_column(p_values = p_values, .before = 1)
# A tibble: 11 x 4
p_values likelihood_1 likelihood_2 likelihood_3
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 0
2 0.1 0.027 0.00045 0.0230
3 0.2 0.096 0.0064 0.115
4 0.3 0.189 0.0284 0.227
5 0.4 0.288 0.0768 0.290
6 0.5 0.375 0.156 0.273
7 0.6 0.432 0.259 0.194
8 0.7 0.441 0.360 0.0972
9 0.8 0.384 0.410 0.0287
10 0.9 0.243 0.328 0.00255
11 1 0 0 0
This sort of workflow can be somewhat awkward with a tidyverse approach, as the data is not in a 'tidy' format.
I would come at it from the other angle, starting with the p_values vector:
likelihoods <-
tibble(p = p_values) %>%
mutate(likelihood_1 = dbinom(df[1,]$Success,size = df[1,]$Failure + df[1,]$Success,prob = p),
likelihood_2 = dbinom(df[2,]$Success,size = df[2,]$Failure + df[2,]$Success,prob = p),
likelihood_3 = dbinom(df[3,]$Success,size = df[3,]$Failure + df[3,]$Success,prob = p))
The issue is that transmute or mutate expects the number of elements to be same as number of rows (or if it is grouped, then the number of rows for that group). Here, we do rowwise- which is basically grouping each row, so the n() expected is 1, whereas the output is length of 'p_values'. One option is to wrap in a list, unnest, and reshape to 'wide' format with pivot_wider (if needed)
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(grp = str_c('likelihood_', row_number())) %>%
rowwise() %>%
transmute(grp, p = list(p_values),
likelihood = list(dbinom(Success,
size = Failure + Success,
prob = p_values
))
) %>%
unnest(c(p, likelihood)) %>%
pivot_wider(names_from = grp, values_from = likelihood)
# A tibble: 11 x 4
# p likelihood_1 likelihood_2 likelihood_3
# <dbl> <dbl> <dbl> <dbl>
# 1 0 0 0 0
# 2 0.1 0.027 0.00045 0.0230
# 3 0.2 0.096 0.0064 0.115
# 4 0.3 0.189 0.0284 0.227
# 5 0.4 0.288 0.0768 0.290
# 6 0.5 0.375 0.156 0.273
# 7 0.6 0.432 0.259 0.194
# 8 0.7 0.441 0.360 0.0972
# 9 0.8 0.384 0.410 0.0287
#10 0.9 0.243 0.328 0.00255
#11 1 0 0 0

R summarise with multiple evalution metric functions that use actual and predicted from a data frame

I want to calculate multiple model evaluation metrics by groups for a data set. Each metric requires the input of actual (observed) and predicted values. These are columns in my data frame. My groups are represented by the variables iTime and an_id.
I can do the necessary calculations with summarise and much redundant typing, but there must be a purrr way to do this. I am trying to master purrr. I have tried invoke_map and pmap but could not figure out how to refer to the columns "actual" and "predicted" in my data frame.
A short example - there are more metrics needed:
library(Metrics)
df <- data.frame(an_id = c('G','J','J', 'J', 'G','G','J','G'),
iTime = c(1,1,2,2,1,2,1,2),
actual = c(1.28, 2.72,.664,.927,.711,1.16,.727,.834),
predicted = c(1.14,1.61,.475,.737,.715,1.15,.725,.90))
dataMetrics <- df %>%
group_by(an_id, iTime) %>%
summarise(vmae = mae(actual, predicted),
rae = rae(actual, predicted),
vrmse = rmse(actual, predicted))
> dataMetrics
A tibble: 4 x 5
an_id iTime vmae rae vrmse
<chr> <dbl> <dbl> <dbl> <dbl>
1 G 1 0.072 0.253 0.0990
2 G 2 0.038 0.233 0.0472
3 J 1 0.556 0.558 0.785
4 J 2 0.190 1.44 0.190
I don't know where mae, mase and rmse come from, which regrettably makes your example not reproducible. It's important to always explicitly state which packages you're using.
invoke_map is the way to map multiple functions to the same data. We can then combine that with nesting data and mapping invoke_map over the nested data.
I'll demonstrate with the sample data you give and by defining two functions f1 and f2:
f1 <- function(x, y) sum(abs(x - y))
f2 <- function(x, y) sum((x - y)^2)
library(tidyverse)
df %>%
group_by(an_id, iTime) %>%
nest() %>%
mutate(tmp = map(data, ~invoke_map_dfc(
list(f1 = f1, f2 = f2),
x = .x$actual, y = .x$predicted))) %>%
select(-data) %>%
unnest()
## A tibble: 4 x 4
# an_id iTime f1 f2
# <fct> <int> <dbl> <dbl>
#1 G 1 0.144 0.0196
#2 J 1 1.11 1.23
#3 J 2 0.381 0.0718
#4 G 2 0.01 0.0001
Explanation: We group observations by an_id and iTime, then nest the remaining data and use invoke_map_dfc inside map to apply f1 and f2 to data and store the result in columns of a nested tibble. The last step is removing the data column and un-nesting the summary stats.
Update
To reproduce your expected output
library(Metrics)
df %>%
group_by(an_id, iTime) %>%
nest() %>%
mutate(tmp = map(data, ~invoke_map_dfc(
list(vmae = mae, rae = rae, vrmse = rmse),
actual = .x$actual, predicted = .x$predicted))) %>%
select(-data) %>%
unnest()
## A tibble: 4 x 5
# an_id iTime vmae rae vrmse
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 G 1 0.072 0.253 0.0990
#2 J 1 0.556 0.558 0.785
#3 J 2 0.190 1.44 0.190
#4 G 2 0.038 0.233 0.0472
Sample data
df <- read.table(text =
"an_id iTime actual predicted
G 1 1.28 1.14
J 1 2.72 1.61
J 2 0.664 0.475
J 2 0.927 0.737
G 1 0.711 0.715
G 2 1.16 1.15
J 2 0.727 0.725", header = T)

Resources