problem with `replace_na()` from tidyr package - r

I wrote a function that has five arguments to calculate random numbers from a normal distribution. It has two steps:
replace NA with 0 in tibble column
replace 0 with a random number
My problems are:
line three doesn't replace NA value with 0
line five doesn't replace 0 with a random number
I have this error :
! Must subset columns with a valid subscript vector.
x Subscript `col` has the wrong type `function`.
It must be logical, numeric, or character.
here is my code :
whithout=function(col,min,max,mean,sd){
for(i in 1:4267){
continuous_dataset=continuous_dataset %>% replace_na(continuous_dataset[,col]=0)
if(is.na(continuous_dataset[,col])){
continuous_dataset[i,col]=round(rtruncnorm(1,min,max,mean,sd))
}
}
}

There's no need to write a function that loops across both columns and observations.
I assume you have no zeroes in your dataset to begin with. In which case, I can skip replacing NA with 0 and go straight to genereating the replacement value.
My solution is based on the tidyverse.
First, generate some test data.
library(tidyverse)
set.seed(123)
df <- tibble(x=runif(5), y=runif(5), z=runif(5))
df$x[3] <- NA
df$y[4] <- NA
df$z[5] <- NA
df
# A tibble: 5 × 3
x y z
<dbl> <dbl> <dbl>
1 0.288 0.0456 0.957
2 0.788 0.528 0.453
3 NA 0.892 0.678
4 0.883 NA 0.573
5 0.940 0.457 NA
Now solve the problem.
df %>%
mutate(
across(
everything(),
function(.x, mean, sd) .x <- ifelse(is.na(.x), rnorm(nrow(.), mean, sd), .x),
mean=500,
sd=100
)
)
# A tibble: 5 × 3
x y z
<dbl> <dbl> <dbl>
1 0.288 0.0456 0.957
2 0.788 0.528 0.453
3 669. 0.892 0.678
4 0.883 629. 0.573
5 0.940 0.457 467.
By avoiding looping through columns and rows, the code is more compact, more robust and (though I've not tested) faster.
If you don't want to process every column, simply replace everything() with a vector of columns that you do want to process. For example
df %>%
mutate(
across(
c(x, y),
function(.x, mean, sd) .x <- ifelse(is.na(.x), rnorm(nrow(.), mean, sd), .x),
mean=500,
sd=100
)
)
# A tibble: 5 × 3
x y z
<dbl> <dbl> <dbl>
1 0.288 0.0456 0.957
2 0.788 0.528 0.453
3 669. 0.892 0.678
4 0.883 629. 0.573
5 0.940 0.457 NA

Related

Add two columns simulataneously via mutate

I would like to use dplyr::mutate to add two named columns to a dataframe simulataneously and with a single function call. Consider the following example
library(dplyr)
n <- 1e2; M <- 1e3
variance <- 1
x <- rnorm(n*M, 0, variance)
s <- rep(1:M, each = n)
dat <- data.frame(s = s, x = x)
ci_studclt <- function(x, alpha = 0.05) {
n <- length(x)
S_n <- var(x)
mean(x) + qt(c(alpha/2, 1 - alpha/2), df = n-1)*sqrt(S_n / n)
}
ci_studclt(x)
Trying something like the below returns an error, since obviously two values are produced and cannot be inserted into a single atomic-type column.
dat %>%
group_by(s) %>%
mutate(ci = ci_studclt(x, variance))
It seems one option is to insert a list column then unnest_wider and that this is easier with data.table or the specific case of splitting a string column into two new columns.
In my example, a confidence interval (lower and upper bound) come out of a function and I would like to directly add both as new columns to dat e.g. calling the columns ci_lower and ci_upper.
Is there a straightforward way of doing this with dplyr or do I need to insert the elements as a list column then unnest?
NB Keep in mind that the confidence interval values are a function of a group of simulated values x, grouped by s; the CI values should be constant within a group.
You can do this by having your function (or a wrapper function) return a data.frame. When you call it in mutate, don’t specify a column name (or else you’ll end up with a nested data.frame column). If you want to specify names for the new columns, you can include them as function arguments as in the below.
library(dplyr)
n <- 1e2; M <- 1e3
variance <- 1
x <- rnorm(n*M, 0, variance)
s <- rep(1:M, each = n)
dat <- data.frame(s = s, x = x)
ci_studclt <- function(x, alpha = 0.05) {
n <- length(x)
S_n <- var(x)
mean(x) + qt(c(alpha/2, 1 - alpha/2), df = n-1)*sqrt(S_n / n)
}
ci_wrapper <- function(x, alpha = 0.05, names_out = c("ci_lower", "ci_upper")) {
ci <- ci_studclt(x, alpha = alpha)
out <- data.frame(ci[[1]], ci[[2]])
names(out) <- names_out
out
}
# original code was ci_studclt(x, variance)
# but ci_studclt() doesn't take a variance argument, so I omitted
dat %>%
group_by(s) %>%
mutate(ci_wrapper(x))
output:
# A tibble: 100,000 x 4
# Groups: s [1,000]
s x ci_lower ci_upper
<int> <dbl> <dbl> <dbl>
1 1 0.233 -0.223 0.139
2 1 1.03 -0.223 0.139
3 1 1.53 -0.223 0.139
4 1 0.0150 -0.223 0.139
5 1 -0.211 -0.223 0.139
6 1 -1.13 -0.223 0.139
7 1 -1.51 -0.223 0.139
8 1 0.371 -0.223 0.139
9 1 1.80 -0.223 0.139
10 1 -0.137 -0.223 0.139
# ... with 99,990 more rows
With specified column names:
dat %>%
group_by(s) %>%
mutate(ci_wrapper(x, names_out = c("ci.lo", "ci.hi")))
output:
# A tibble: 100,000 x 4
# Groups: s [1,000]
s x ci.lo ci.hi
<int> <dbl> <dbl> <dbl>
1 1 0.233 -0.223 0.139
2 1 1.03 -0.223 0.139
3 1 1.53 -0.223 0.139
4 1 0.0150 -0.223 0.139
5 1 -0.211 -0.223 0.139
6 1 -1.13 -0.223 0.139
7 1 -1.51 -0.223 0.139
8 1 0.371 -0.223 0.139
9 1 1.80 -0.223 0.139
10 1 -0.137 -0.223 0.139
# ... with 99,990 more rows
If you get your function to return a two-column data frame with repeated values of the same length as the input, then this becomes very easy:
ci_studclt <- function(x, alpha = 0.05) {
n <- length(x)
S_n <- var(x)
res <- mean(x) + qt(c(alpha/2, 1 - alpha/2), df = n-1)*sqrt(S_n / n)
data.frame(lower = rep(res[1], length(x)), upper = res[2])
}
dat %>%
group_by(s) %>%
mutate(ci_studclt(x))
#> # A tibble: 100,000 x 4
#> # Groups: s [1,000]
#> s x lower upper
#> <int> <dbl> <dbl> <dbl>
#> 1 1 -0.767 -0.147 0.293
#> 2 1 -0.480 -0.147 0.293
#> 3 1 -1.31 -0.147 0.293
#> 4 1 0.219 -0.147 0.293
#> 5 1 0.650 -0.147 0.293
#> 6 1 0.542 -0.147 0.293
#> 7 1 -0.249 -0.147 0.293
#> 8 1 2.22 -0.147 0.293
#> 9 1 -0.239 -0.147 0.293
#> 10 1 0.176 -0.147 0.293
#> # ... with 99,990 more rows
Other possible variation (if you don't want to change your ci_studclt function) how it can be done:
dat %>%
group_by(s) %>%
mutate(
across(x,
.fns = list(
lower = ~ci_studclt(.)[1],
upper = ~ci_studclt(.)[2]
)
)
)
In this case output will also contain new x_lower and x_upper columns. This variant is also somewhat scalable, so if you want to calculate your function over other column y as well, you can just replace x with c(x,y) and have also y_lower and y_upper columns in dat as well.
UPDATE
Actually, all the stuff that Allan did in his answer could be done inside mutate call and without any modification of initial function:
dat %>%
group_by(s) %>%
mutate(
t(ci_studclt(x)) %>%
as.data.frame() %>%
set_names(c('ci_lower','ci_upper'))
)
We just transpose an output from ci_studclt(x) for treating it as row by data.frame function and give this 1-row dataframe correct names.

Using group_modify with selected columns (retaining whole data frame and order)

I have run out of R power on this one. I appreciate any help, it is probably quite simple for someone with more experience.
I have a data frame (tibble) with some numerical columns, a group column, and some other columns with other information. I want to do operations on the numerical columns, by group, but still retain all the columns.
I've put an example below: I am replacing the NAs with the group mean, for each column. The columns to replace the NAs are specified by the df_names variable.
It basically works, except it removes all columns except the numerical ones, AND reorders everything. Which makes it hard to reassemble. I could work around this, but I have a feeling there must be a simpler way to direct group_apply to specified columns, while retaining the other columns, and keeping the order.
Can anyone help? Thanks so much in advance!
Will
library("tidyverse")
# create tibble
df <- tibble(
name=letters[1:10],
csize=c("L","S","S","L","L","S","L","S","L","S"),
v1=rnorm(10),
v2=rnorm(10),
v3=rnorm(10)
)
# introduce some missing data
df$v1[3] <- NA
df$v1[6] <- NA
df$v1[7] <- NA
df$v3[2] <- NA
# these are the cols where I want to replace the NAs
df_names <- c("v1","v2","v3")
# this is the grouping variable (has to be stored as a string, since it is an input to the function)
groupvar <- "csize"
# now I want to replace the NAs with column means, restricted to their group
# the following line works, but the problem is that it removes the name column, and reorders the rows...
df_imp <- df %>% group_by(.dots=groupvar) %>% select(df_names) %>% group_modify( ~{replace_na(.x,as.list(colMeans(.x, na.rm=TRUE)))})
group_modify is overkill in this case; mutate(across()) is your friend here:
df %>% group_by(.dots = groupvar) %>%
mutate(across(all_of(df_names), ~if_else(is.na(.x), mean(.x, na.rm = TRUE), .x)))
Result:
> df
# A tibble: 10 x 5
# Groups: csize [2]
name csize v1 v2 v3
<chr> <chr> <dbl> <dbl> <dbl>
1 a L -1.22 1.48 -0.628
2 b S -1.17 0.0890 -0.130
3 c S -0.422 -0.0956 -0.0271
4 d L -0.265 0.180 -0.786
5 e L -0.491 0.509 -0.359
6 f S -0.422 -0.712 0.232
7 g L -0.400 -1.13 1.13
8 h S -0.538 -0.0785 0.690
9 i L 0.373 0.308 0.252
10 j S 0.445 0.743 -1.41
Does this work:
> library(dplyr)
> df %>% group_by(csize) %>% mutate(across(v1:v3, ~ replace_na(., mean(., na.rm = T))))
# A tibble: 10 x 5
# Groups: csize [2]
name csize v1 v2 v3
<chr> <chr> <dbl> <dbl> <dbl>
1 a L 1.57 0.310 -1.76
2 b S -0.705 0.0655 0.577
3 c S -1.05 1.28 1.82
4 d L 0.958 -2.09 -0.371
5 e L -0.712 0.247 -1.13
6 f S -1.05 -0.516 -0.107
7 g L 0.403 1.79 0.128
8 h S -0.793 1.52 1.07
9 i L -0.206 -0.369 -1.77
10 j S -1.65 -0.992 -0.476

Generating a column with the average value of rows before and after tiw row index

Given some data like the following:
set.seed(1234)
df <- tibble(class = rep(c("a","b"), each=6), value = c(rnorm(n=6, mean=0, sd=1), rnorm(n=6, mean=1, sd=0.1)))
# A tibble: 12 x 2
# class value
# <chr> <dbl>
# 1 a -1.21
# 2 a 0.277
# 3 a 1.08
# 4 a -2.35
# 5 a 0.429
# 6 a 0.506
# 7 b 0.943
# 8 b 0.945
# 9 b 0.944
#10 b 0.911
#11 b 0.952
#12 b 0.900
I'm trying to generate a new column (context) that contains the average of "value" of the X preceding and posterior rows, when possible. It would be desirable to have this by level of a factor in a different column. For example, for X=2, I would expect something like the following:
# A tibble: 12 x 2
# class value context
# <chr> <dbl> <dbl>
# 1 a -1.21 NA
# 2 a 0.277 NA
# 3 a 1.08 -0.7135
# 4 a -2.35 0.573
# 5 a 0.429 NA
# 6 a 0.506 NA
# 7 b 0.943 NA
# 8 b 0.945 NA
# 9 b 0.944 0.9377
#10 b 0.911 0.9278
#11 b 0.952 NA
#12 b 0.900 NA
Note that for the first two rows it is not possible to generate the context value in this case, because they do not have X=2 predecing rows. The value -0.7135 at row 3 is the average of rows 1, 2, 4 and 5.
Similarly, rows 5 and 6 do not have a value of context, because these do not have two values afterwards belonging to the same level of the factor "class" (because row 7 is class="b" while 5 and 6 are class="a").
I do not know if this is even possible in R, I haven't found any similar questions, and I can only reach to solutions like the following one, which I think is not representative of this language.
My solution:
X <- 2
df_list <- df %>% dplyr::group_split(class)
result <- tibble()
for (i in 1:length(df_list)) {
tmp <- df_list[[i]]
context <- vector()
for (j in 1:nrow(tmp)) {
if (j<=X | j>nrow(tmp)-X) context <- c(context, NA)
else {
values <- vector()
for (k in 1:X) {
values <- c(values, tmp$value[j-k], tmp$value[j+k])
}
context <- c(context, mean(values))
}
}
tmp <- tmp %>% dplyr::mutate(context=context)
result <- result %>% dplyr::bind_rows(tmp)
}
This will give and approximate solution to that above (differences due to rounding). But again, this approach lacks of flexibility, e.g. if we want to create various columns at once, for different values of X. Are there R functions developed to solved tasks like this one? (eg. vectorized functions?)
# this is your dataframe
set.seed(1234)
df <- tibble(class = rep(c("a","b"), each=6), value = c(rnorm(n=6, mean=0, sd=1), rnorm(n=6, mean=1, sd=0.1)))
# pipes ('%>%') and grouping from the dplyr package
library(tidyverse)
# rolling mean function from the zoo package
library(zoo)
df %>% # take df
group_by(class) %>% # group it by class
mutate(context = (rollsum(value, 5, fill = NA) - value) / 4) # and calculate the rolling mean
Basically you calculate a rolling mean with a window width of 5, that is center (it's the default) and you fill the remaining values with NAs. Since the value of the exact row is not to be included in the average, it needs to be excluded.
One way using dplyr :
n <- 2
library(dplyr)
df %>%
group_by(class) %>%
mutate(context = map_dbl(row_number(), ~ if(.x <= n | .x > (n() - n))
NA else mean(value[c((.x - n):(.x - 1), (.x + 1) : (.x + n))])))
# class value context
# <chr> <dbl> <dbl>
# 1 a -1.21 NA
# 2 a 0.277 NA
# 3 a 1.08 -0.712
# 4 a -2.35 0.574
# 5 a 0.429 NA
# 6 a 0.506 NA
# 7 b 0.943 NA
# 8 b 0.945 NA
# 9 b 0.944 0.938
#10 b 0.911 0.935
#11 b 0.952 NA
#12 b 0.900 NA
Here is a base R solution using ave(), i.e.,
df <- within(df,
contest <- ave(value,
class,
FUN = function(v,X=2) sapply(seq(v), function(k) ifelse(k-X < 1 | k+X >length(v),NA,mean(v[c(k-(X:1),k + (1:X))])))))
such that
> df
# A tibble: 12 x 3
class value contest
<chr> <dbl> <dbl>
1 a -1.21 NA
2 a 0.277 NA
3 a 1.08 -0.712
4 a -2.35 0.574
5 a 0.429 NA
6 a 0.506 NA
7 b 0.943 NA
8 b 0.945 NA
9 b 0.944 0.938
10 b 0.911 0.935
11 b 0.952 NA
12 b 0.900 NA

summarise dplyr with dynamic columns? [duplicate]

This question already has answers here:
summarise_at using different functions for different variables
(2 answers)
Aggregate multiple variables with different functions [duplicate]
(2 answers)
Closed 3 years ago.
I've some R-code which does, what I want it to do. But now the question:
Is there any mechanism to avoid coding A1 A2 A3 and so on? I would like to code A* for all columns beginning with A. There can be any number of "A" columns in dependency to a list length which is definied in the code. The rest of the code is dynamic, but here I have a manual intervention (add some A columns or delete some A columns within the summerise statement).
I have found summarize_at, but I don't see how I can do the other things like last() and sum() at the same time for the other columns.
l_af <- l_cf %>%
group_by(PID, Server) %>%
summarise(Player=last(Player),
Guild=last(Guild),
Points=last(Points),
Battles=last(Battles),
A1=max(A1),
A2=max(A2),
A3=max(A3),
A4=max(A4),
A5=max(A5),
A6=max(A6),
RecCount=sum(RecCount))
Any help is appreciated.
The problem with using summarise it is removes all other columns if they are not used. You can consider to use mutate first perform all the operations and then use summarise.
library(dplyr)
l_cf %>%
group_by(PID, Server) %>%
mutate_at(vars(Player,Guild,Points,Battles), last) %>%
mutate_at(vars(starts_with("A")), max) %>%
mutate(RecCount = sum(RecCount)) %>%
summarise_all(max)
A reproducible example
set.seed(123)
df <- data.frame(group = rep(1:5, 2), x = runif(10), y = runif(10),
a1 = runif(10), a2 = runif(10), z = runif(10))
First applying functions individually for each column
df %>%
group_by(group) %>%
summarise(x=last(x),
y=last(y),
a1=max(a1),
a2=max(a2),
z=sum(z))
# A tibble: 5 x 6
# group x y a1 a2 z
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.0456 0.900 0.890 0.963 0.282
#2 2 0.528 0.246 0.693 0.902 0.648
#3 3 0.892 0.0421 0.641 0.691 0.880
#4 4 0.551 0.328 0.994 0.795 0.635
#5 5 0.457 0.955 0.656 0.232 1.01
Now apply the functions together for multiple columns
df %>%
group_by(group) %>%
mutate_at(vars(x, y), last) %>%
mutate_at(vars(starts_with("a")), max) %>%
mutate(z = sum(z)) %>%
summarise_all(max)
# group x y a1 a2 z
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.0456 0.900 0.890 0.963 0.282
#2 2 0.528 0.246 0.693 0.902 0.648
#3 3 0.892 0.0421 0.641 0.691 0.880
#4 4 0.551 0.328 0.994 0.795 0.635
#5 5 0.457 0.955 0.656 0.232 1.01
We can see that both the approaches gave the same output.

how to calculate cumsum with depreciation in a grouped dataframe?

I tried to calculate the cumsum with a depreciation rate.
I have a grouped dataframe with a column number.
I want to add the number one by one with depreciation.
If the rate is 1, then the cumsum function in base r is good enough.
But if not, let's say the rate of 0.5 (means each number will multiply by 0.5 to add the next number), cumsum is not enough.
I tried to write my own function to work with dplyr, but it fails.
library(tidyverse)
# dataframe
id=sample(1:5,25,replace=TRUE)
num=rnorm(25)
df=data.frame(id,num)
# my custom function
depre=function(data){
rate=0.5
r=nrow(data)
sl=data$num
nl=data$num
for (i in 2:r){
sl[i]=sl[i-1]*rate+nl[i]
}
return(sl)
}
# work with one group
df %>% filter(id==1) %>% depre(.)
# failed to work with dplyr
df %>% group_by(id) %>% mutate(sl=depre(.))
I expect the first element of column s, should be the same as in column num.
But the following ones, should be depreciate by times 0.5 and add next num.
It works in one group, but failed in multi-grouped dataframe.
The error message is: "Error: Column sl must be length 6 (the group size) or one, not 25".
I have no idea. Could anyone have a clue?
Thanks
Your function would work if you pass vector to your function instead of dataframe
depre <- function(num){
rate = 0.5
r= length(num)
sl = num
nl = num
for (i in 2:r){
sl[i]=sl[i-1]*rate+nl[i]
}
return(sl)
}
and then apply it by group.
library(dplyr)
df %>% group_by(id) %>% mutate(sl = depre(num))
We can split by 'id' and use the OP's function without any changes
library(dplyr)
library(purrr)
df %>%
group_split(id, keep = FALSE) %>%
map_df(~ tibble(id = .$id, sl = depre(.)))
# id sl
# <int> <dbl>
# 1 1 1.07
# 2 1 -0.776
# 3 1 -0.518
# 4 1 0.628
# 5 1 0.601
# 6 1 1.10
# 7 2 -0.734
# 8 2 -0.583
# 9 2 -0.437
#10 2 -3.45
# … with 15 more rows
or an option would be accumulate from purrr which would be more compact
out <- df %>%
group_by(id) %>%
mutate(sl = accumulate(num, ~ .y + .x * 0.5))
out
# A tibble: 25 x 3
# Groups: id [5]
# id num sl
# <int> <dbl> <dbl>
# 1 3 -0.784 -0.784
# 2 2 -0.734 -0.734
# 3 2 -0.216 -0.583
# 4 3 -0.335 -0.727
# 5 5 -1.09 -1.09
# 6 4 -0.0854 -0.0854
# 7 1 1.07 1.07
# 8 2 -0.145 -0.437
# 9 3 -1.17 -1.53
#10 5 -0.819 -1.36
# … with 15 more rows
out %>%
filter(id == 1)
# A tibble: 6 x 3
# Groups: id [1]
# id num sl
# <int> <dbl> <dbl>
#1 1 1.07 1.07
#2 1 -1.31 -0.776
#3 1 -0.129 -0.518
#4 1 0.887 0.628
#5 1 0.287 0.601
#6 1 0.800 1.10
Issue in the OP's function is that the input is the whole dataset and during the process of getting the number of rows, it uses nrow(data), which would be the total number of rows. With group_by, the dplyr convention is n() - giving the number of rows. By doing the group_split, the input data.frame is split into subset of data.frames and the nrow of those will work for the created function

Resources