Replace NA with random numbers within `group_by` in dplyr - r

I have a data frame in long format and I want to replace missing values by random numbers, but I want to do this group wise with different settings...
library(dplyr)
set.seed(1)
imp_df <-
data.frame(exp=rep(letters[1:3], each=2),
rep=1:2,
mean=1:6,
sd=seq(0,0.5,0.1))
df <-
data.frame(
exp=rep(letters[1:3], each=20),
rep=1:2,
int=rnorm(60,10,5)
)
df[sample(1:60,25,replace=F), 'int'] <- NA
So my data looks like above, in the imp_df I have the settings for the rnorm function based on the experiment exp and the replicate rep.
My data frame has then some missing values and I want to replace the NA by the random numbers.
How can I do it using dplyr or tidyr?
Edit
After the answer from #starja, I found a quick, but maybe slow solution by using rowwise together with left_join.
df %>%
left_join(imp_df) %>%
rowwise() %>%
mutate(imp.int=if_else(
is.na(int),
rnorm(1, mean, sd),
int
)) %>%
print(n=60)
Are there other ways to do this?
Edit 2
Since the rowwise approach is pretty slow and I couldn't get it running within some dplyr code, I used a for loop to go through imp_df with the imputation settings.
This is a pretty quick solution, but not as readable as I was hoping:
df$imp.int <- df$int
for(line in 1:nrow(imp_df)) {
imp_settings <- as.list(imp_df[line,])
rows_missing_values <- which(
df$exp == imp_settings$exp &
df$rep == imp_settings$rep &
is.na(df$imp.int)
)
df$imp.int[rows_missing_values] <-
stats::rnorm(length(rows_missing_values), imp_settings$mean, imp_settings$sd)
}
So we first add a column imp.int for the imputed values and run now line by line the different imputation settings by replacing the NAs for each group.

This could also be done:
library(dplyr)
library(purrr)
df %>%
left_join(imp_df, by = c("exp", "rep")) %>%
mutate(int = ifelse(is.na(int),
map2(mean, sd, ~ rnorm(1, .x, .y)), int))
exp rep int mean sd
1 a 1 1 1 0.0
2 a 2 10.91822 2 0.1
3 a 1 5.821857 1 0.0
4 a 2 17.9764 2 0.1
5 a 1 11.64754 1 0.0
6 a 2 5.897658 2 0.1
7 a 1 12.43715 1 0.0
8 a 2 13.69162 2 0.1
9 a 1 12.87891 1 0.0
10 a 2 1.986482 2 0.1

I guess there are cleverer solutions out there that use vectorisation, but if you don't have super large data, I like to use a purrr::map function for this together with a small custom made function:
library(dplyr)
set.seed(1)
imp_df <-
data.frame(exp=rep(letters[1:3], each=2),
rep=1:2,
mean=1:6,
sd=seq(0,0.5,0.1))
df <-
data.frame(
exp=rep(letters[1:3], each=20),
rep=1:2,
int=rnorm(60,10,5)
)
df[sample(1:60,25,replace=F), 'int'] <- NA
replace_fun <- function(x, mean, sd) {
if (is.na(x)) {
rnorm(1, mean, sd)
} else {
x
}
}
df %>%
left_join(imp_df, by = c("exp", "rep")) %>%
mutate(int = purrr::pmap_dbl(list(int, mean, sd), replace_fun)) %>%
head()
#> exp rep int mean sd
#> 1 a 1 1.000000 1 0.0
#> 2 a 2 10.918217 2 0.1
#> 3 a 1 5.821857 1 0.0
#> 4 a 2 17.976404 2 0.1
#> 5 a 1 11.647539 1 0.0
#> 6 a 2 5.897658 2 0.1
Created on 2021-05-27 by the reprex package (v0.3.0)
(If you want, you can remove the mean/sd columns with select(-c(mean, sd)).)

Related

findInterval by group with dplyr [duplicate]

This question already has answers here:
How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame
(11 answers)
Closed 1 year ago.
In this example I have a tibble with two variables:
a group variable gr
the variable of interest val
set.seed(123)
df <- tibble(gr = rep(1:3, each = 10),
val = gr + rnorm(30))
Goal
I want to produce a discretized version of val using the function findInterval but the breakpoints should be gr-specific, since in my actual data as well as in this example, the distribution of valdepends on gr. The breakpoints are determined within each group using the quartiles of val.
What I did
I first construct a nested tibble containing the vectors of breakpoints for each value of gr:
df_breakpoints <- bind_cols(gr = 1:3,
purrr::map_dfr(1:3, function(gr) {
c(-Inf, quantile(df$val[df$gr == gr], c(0.25, 0.5, 0.75)), Inf)
})) %>%
nest(bp = -gr) %>%
mutate(bp = purrr::map(.$bp, unlist))
Then I join it with df:
df <- inner_join(df, df_breakpoints, by = "gr")
My first guess to define the discretized variable lvl was
df %>% mutate(lvl = findInterval(x = val, vec = bp))
It produces the error
Error : Problem with `mutate()` input `lvl2`.
x 'vec' must be sorted non-decreasingly and not contain NAs
ℹ Input `lvl` is `findInterval(x = val, vec = bp)`.
Then I tried
df$lvl <- purrr::imap_dbl(1:nrow(df),
~findInterval(x = df$val[.x], vec = df$bp[[.x]]))
or
df %>% mutate(lvl = purrr::map2_int(df$val, df$bp, findInterval))
It does work. However it is highly unefficient. With my actual data (1.2 million rows) it takes several minutes to run. I guess there is a much better way of doing this than iterating on rows. Any idea?
You can do this in group_by + mutate step -
library(dplyr)
df %>%
group_by(gr) %>%
mutate(breakpoints = findInterval(val,
c(-Inf, quantile(val, c(0.25, 0.5, 0.75)), Inf))) %>%
ungroup
# gr val breakpoints
# <int> <dbl> <int>
# 1 1 0.440 1
# 2 1 0.770 2
# 3 1 2.56 4
# 4 1 1.07 3
# 5 1 1.13 3
# 6 1 2.72 4
# 7 1 1.46 4
# 8 1 -0.265 1
# 9 1 0.313 1
#10 1 0.554 2
# … with 20 more rows
findInterval is applied for each gr separately.

Variance over time with two dataframes in R

I have two dataframes. df1 is a dataframe where every row is a score that someone gives.
df1
title <- c("x","x","x","x","y","y","y","y","y")
day <- c(0,2,2,4,1,1,3,3,4)
score <- c(7,7,6,4,8,1,7,1,5)
df1 = data.frame(title,day,score)
df2 is title-day formatted panel dataset in long format with a lot of variables. I'm looking for a way to mutate the variance score on day x and the variance score over time (i.e. variance of the score(s) on day x and all the previous scores that are there).
It should look like this:
title <- c("x","x","x","x","x","y","y","y","y","y")
day <- c(0,1,2,3,4,0,1,2,3,4)
variance_day_x <- c(0,0,0.5,0,0,0,24.5,0,12,0)
variance_cumulative <- c(0,0,0.3333,0.3333,2,0,24.5,24.5,14.25,10.8)
df2 <- data.frame(title,day,variance_day_x,variance_cumulative)
As you can see I need to mutate 2 variables out of df1 into df2. The variance per day is the first variable, where variance = 0 when there is 0 or 1 score available on that day because there is nothing to calculate. The second variable is cumulative variance where the variance needs to be updated every time there are new scores available.
Hope this explained my problem well enough. I'm stuck at this moment, hope you guys can help!
Using tidyverse you could try something like this. First group_by title and use a custom cumulative variance function that can be called from mutate. The daily variance is computed after grouping by both title and day. complete will fill in missing days, and fill will carry forward the cumulative variance for those missing days. You can replace the NA with zero if you would like with replace_na.
library(tidyverse)
cumvar <- function(x) {
sapply(seq_along(x), function(i) var(x[1:i]))
}
df1 %>%
group_by(title) %>%
mutate(cvar = cumvar(score)) %>%
group_by(title, day) %>%
summarise(variance_day_x = var(score),
variance_cumulative = last(cvar)) %>%
complete(title, day = 0:4) %>%
fill(variance_cumulative, .direction = "down")
Output
# A tibble: 10 x 4
# Groups: title [2]
title day variance_day_x variance_cumulative
<chr> <dbl> <dbl> <dbl>
1 x 0 NA NA
2 x 1 NA NA
3 x 2 0.5 0.333
4 x 3 NA 0.333
5 x 4 NA 2
6 y 0 NA NA
7 y 1 24.5 24.5
8 y 2 NA 24.5
9 y 3 18 14.2
10 y 4 NA 10.8
A bit messy Base R solution:
df_variances <- cbind(df1, data.frame(do.call("rbind", lapply(split(df1, df1$title),
function(x){
variance_cumulative <- sapply(seq_len(nrow(x)), function(i){
z <- var(x$score[1:i])
}
)
variance_day_x <- sapply(seq_len(nrow(x)), function(j){
q <- var(x$score[(j-1):j])
}
)
variance_df <- data.frame(variance_day_x = variance_day_x,
variance_cumulative = variance_cumulative)
}
)
), row.names = NULL))
df_clean <- replace(df_variances, is.na(df_variances), 0)
Another base R solution. I also use a custom cumvar function. Furthermore I use #Ruben's great repeat_last function to fill up NAs with last known values.
This solution bases mainly on ave, which applies a function on a variable, grouped by other variables. Since the days are not complete, we can merge the original data to a complete data set with all unique titles and days. Before we calculate the variances, we calculate the cumulative variances; the idea is to select later the "newer" value per day using length. Finally we delete the dupes, then it's done.
cumvar <- function(x) sapply(1:length(x), function(i) {var(x[1:i])})
df1$vari.cum <- with(df1, ave(score, title, FUN=cumvar))
compl <- expand.grid(title=unique(df1$title), day=unique(df1$day))
dfx <- merge(compl, df1, all.x= TRUE)
dfx$vari.cum <- with(dfx, ave(vari.cum, title, FUN=repeat_last))
res <- within(dfx, {
vari.day <- ave(score, title, day, FUN=var)
vari.cum <- ave(vari.cum, title, day, FUN=function(x) x[length(x)])
})
res <- res[!duplicated(res[c("title", "day")]), c(1:2, 5:4)]
res
# title day vari.day vari.cum
# 1 x 0 NA NA
# 2 x 1 NA NA
# 3 x 2 0.5 0.3333333
# 5 x 3 NA 0.3333333
# 6 x 4 NA 2.0000000
# 7 y 0 NA NA
# 8 y 1 24.5 24.5000000
# 10 y 2 NA 24.5000000
# 11 y 3 18.0 14.2500000
# 13 y 4 NA 10.8000000

running Shannon and Simpson : Vegan package

I am interested in biodiversity index calculations using vegan
package. The simpsons index works but no results from Shannon
argument. I was hoping somebody know the solution
What I have tried is that I have converted data. frame into vegan
package test data format using code below
Plot <- c(1,1,2,2,3,3,3)
species <- c( "Aa","Aa", "Aa","Bb","Bb","Rr","Xx")
count <- c(3,2,1,4,2,5,7)
veganData <- data.frame(Plot,species,count)
matrify(veganData )
diversity(veganData,"simpson")
diversity(veganData,"shannon", base = exp(1))
1. I get the following results, so I think it produces all
simpsons indices
> diversity(veganData,"simpson")
simpson.D simpson.I simpson.R
1 1.00 0.00 1.0
2 0.60 0.40 1.7
3 0.35 0.65 2.8
2. But when I run for Shannon index get the following
message
> diversity(veganData,"shannon")
data frame with 0 columns and 3 rows
I am not sure why its not working ? do we need to make any changes
in data formatting while switching the methods?
Your data need to be in the wide format. Also the counts must be either in total or averages (not repeated counts for the same plot).
library(dply); library(tidyr)
df <- veganData %>%
group_by(Plot, species) %>%
summarise(count = sum(count)) %>%
ungroup %>%
spread(species, count, fill=0)
df
# # A tibble: 3 x 5
# Plot Aa Bb Rr Xx
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 5 0 0 0
# 2 2 1 4 0 0
# 3 3 0 2 5 7
diversity(df[,-1], "shannon")
# [1] 0.0000000 0.5004024 0.9922820
To check if the calculation is correct, note the Shannon calculation is carried out as -1 x summation of Pi*lnPi
# For plot 3:
-1*(
(2/(2+5+7))*log((2/(2+5+7))) + #Pi*lnPi of Bb
(5/(2+5+7))*log((5/(2+5+7))) + #Pi*lnPi of Rr
(7/(2+5+7))*log((7/(2+5+7))) #Pi*lnPi of Xx
)
# [1] 0.992282

How to create a column that is a group label for unique collections of other columns data table [duplicate]

I have a tbl_df where I want to group_by(u, v) for each distinct integer combination observed with (u, v).
EDIT: this was subsequently resolved by adding the (now-deprecated) group_indices() back in dplyr 0.4.0
a) I then want to assign each distinct group some arbitrary distinct number label=1,2,3...
e.g. the combination (u,v)==(2,3) could get label 1, (1,3) could get 2, and so on.
How to do this with one mutate(), without a three-step summarize-and-self-join?
dplyr has a neat function n(), but that gives the number of elements within its group, not the overall number of the group. In data.table this would simply be called .GRP.
b) Actually what I really want to assign a string/character label ('A','B',...).
But numbering groups by integers is good-enough, because I can then use integer_to_label(i) as below. Unless there's a clever way to merge these two? But don't sweat this part.
set.seed(1234)
# Helper fn for mapping integer 1..26 to character label
integer_to_label <- function(i) { substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",i,i) }
df <- tibble::as_tibble(data.frame(u=sample.int(3,10,replace=T), v=sample.int(4,10,replace=T)))
# Want to label/number each distinct group of unique (u,v) combinations
df %>% group_by(u,v) %>% mutate(label = n()) # WRONG: n() is number of element within its group, not overall number of group
u v
1 2 3
2 1 3
3 1 2
4 2 3
5 1 2
6 3 3
7 1 3
8 1 2
9 3 1
10 3 4
KLUDGE 1: could do df %>% group_by(u,v) %>% summarize(label = n()) , then self-join
dplyr has a group_indices() function that you can use like this:
df %>%
mutate(label = group_indices(., u, v)) %>%
group_by(label) ...
Another approach using data.table would be
require(data.table)
setDT(df)[,label:=.GRP, by = c("u", "v")]
which results in:
u v label
1: 2 1 1
2: 1 3 2
3: 2 1 1
4: 3 4 3
5: 3 1 4
6: 1 1 5
7: 3 2 6
8: 2 3 7
9: 3 2 6
10: 3 4 3
As of dplyr version 1.0.4, the function cur_group_id() has replaced the older function group_indices.
Call it on the grouped data.frame:
df %>%
group_by(u, v) %>%
mutate(label = cur_group_id())
# A tibble: 10 x 3
# Groups: u, v [6]
u v label
<int> <int> <int>
1 2 2 4
2 2 2 4
3 1 3 2
4 3 2 6
5 1 4 3
6 1 2 1
7 2 2 4
8 2 4 5
9 3 2 6
10 2 4 5
Updated answer
get_group_number = function(){
i = 0
function(){
i <<- i+1
i
}
}
group_number = get_group_number()
df %>% group_by(u,v) %>% mutate(label = group_number())
You can also consider the following slightly unreadable version
group_number = (function(){i = 0; function() i <<- i+1 })()
df %>% group_by(u,v) %>% mutate(label = group_number())
using iterators package
library(iterators)
counter = icount()
df %>% group_by(u,v) %>% mutate(label = nextElem(counter))
Updating my answer with three different ways:
A) A neat non-dplyr solution using interaction(u,v):
> df$label <- factor(interaction(df$u,df$v, drop=T))
[1] 1.3 2.3 2.2 2.4 3.2 2.4 1.2 1.2 2.1 2.1
Levels: 2.1 1.2 2.2 3.2 1.3 2.3 2.4
> match(df$label, levels(df$label)[ rank(unique(df$label)) ] )
[1] 1 2 3 4 5 4 6 6 7 7
B) Making Randy's neat fast-and-dirty generator-function answer more compact:
get_next_integer = function(){
i = 0
function(u,v){ i <<- i+1 }
}
get_integer = get_next_integer()
df %>% group_by(u,v) %>% mutate(label = get_integer())
C) Also here is a one-liner using a generator function abusing a global variable assignment from this:
i <- 0
generate_integer <- function() { return(assign('i', i+1, envir = .GlobalEnv)) }
df %>% group_by(u,v) %>% mutate(label = generate_integer())
rm(i)
I don't have enough reputation for a comment, so I'm posting an answer instead.
The solution using factor() is a good one, but it has the disadvantage that group numbers are assigned after factor() alphabetizes its levels. The same behaviour happens with dplyr's group_indices(). Perhaps you would like the group numbers to be assigned from 1 to n based on the current group order. In which case, you can use:
my_tibble %>% mutate(group_num = as.integer(factor(group_var, levels = unique(.$group_var))) )

Filter Dataframe with Tidy Evaluation

I am handling a large dataset. First, for certain columns (X1, X2, ...), I am trying to identify a range of value (a, b) consists of repeated value (a > n, b > n). Next, I wish to filter row based on the condition which matches respective columns to result given in the previous step.
Here is a reproducible example simulating the scenario I am facing,
library(tidyverse)
set.seed(1122)
vecs <- lapply(X = 1:2, function(x) rep(c(1, 2, 3), times = 10) %>% sample() %>% head(10))
names(vecs) <- paste0("col_", 1:2)
dat <- vecs %>% as.data.frame()
dat
col_1 col_2
1 3 2
2 1 1
3 1 1
4 1 2
5 1 2
6 3 3
7 3 3
8 2 1
9 1 3
10 2 2
I am able to identify the range by the following method,
# Which col has repeated value more than 3 appearances?
more_than_3 <- function(df, var){
var <- rlang::sym(var)
df %>%
group_by(!!var) %>%
summarise(n = n()) %>%
filter(n > 3) %>%
pull(!!var) %>%
range()
}
cols_name <- c("col_1", "col_2")
some_range <- purrr::map(cols_name, more_than_3, df = dat)
names(some_range) <- cols_name
some_range
$col_1
[1] 1 1
$col_2
[1] 2 2
However, to filter out values that fall outside the upper limit, this is what I do.
dat %>%
filter(col_1 <= some_range[["col_1"]][2],
col_2 <= some_range[["col_2"]][2])
col_1 col_2
1 1 1
2 1 1
3 1 2
4 1 2
I believe there must be a more efficient and elegant way of filtering the result based on tidy evaluation. Can someone point me to the right direction?
Many thanks in advance.
First let's try to create a small function that creates a single filter expression for one column. This function will take a symbol and then transform to string but it could be the other way around:
new_my_filter_call_upper <- function(sym, range) {
col_name <- as.character(sym)
col_range <- range[[col_name]]
if (is.null(col_range)) {
stop(sprintf("Can't find column `%s` to compute range", col_name), call. = FALSE)
}
expr(!!sym < !!col_range[[2]])
}
Let's try it:
new_my_filter_call_upper(quote(foobar), some_range)
#> Error: Can't find column `foobar` to compute range
# It works!
new_my_filter_call_upper(quote(col_1), some_range)
#> col_1 < 3
Now we're ready to create a pipeline verbs that will take a data frame and bare column names.
# Probably cleaner to pass range as argument. Prefix with dot to allow
# columns named `range`.
my_filter <- function(.data, ..., .range) {
# ensyms() guarantees there won't be complex expressions
syms <- rlang::ensyms(...)
# Now let's map the function to create many calls:
calls <- purrr::map(syms, new_my_filter_call_upper, range = .range)
# And we're ready to filter with those expressions:
dplyr::filter(.data, !!!calls)
}
Let's try it:
dat %>% my_filter(col_1, col_2, .range = some_range)
#> col_1 col_2 NA.
#> 1 2 1 1
#> 2 2 2 1
We could use map2
library(purrr)
map2(dat, some_range, ~ .x < .y[2]) %>%
reduce(`&`) %>%
dat[.,]
# col_1 col_2
#1 2 2
#2 1 1
#3 1 2
#6 1 1
Or with pmap
pmap(list(dat, some_range %>%
map(2)), `<`) %>%
reduce(`&`) %>%
dat[.,]

Resources