Replacing NA's with LOCF using Sparklyr - r

My aim is to replace NA's in a spark data frame using the Last Observation Carried Forward method. I wrote the following code and works. However, it seems to take longer than expected for a larger dataset.
It would be great if someone can recommend a better approach or improve the code.
Example and Code with Sparklyr
In the following example, NA's are replaced after ordering them using the
time and grouping them by grp.
df_with_nas <- data.frame(time = seq(as.Date('2001/01/01'),
as.Date('2010/01/01'), length.out = 10),
grp = c(rep(1, 5), rep(2, 5)),
v1 = c(1, rep(NA, 3), 5, rep(NA, 5)),
v2 = c(NA, NA, 3, rep(NA, 4), 3, NA, NA))
tbl <- copy_to(sc, df_with_nas, overwrite = TRUE)
tbl %>%
spark_apply(function(df) {
library(dplyr)
na_locf <- function(x) {
v <- !is.na(x)
c(NA, x[v])[cumsum(v) + 1]
}
df %>% arrange(time) %>% group_by(grp) %>% mutate_at(vars(-v1, -grp),
funs(na_locf(.)))
})
# # Source: spark<?> [?? x 4]
# time grp v1 v2
# <dbl> <dbl> <dbl> <dbl>
# 1 11323 1 1 NaN
# 2 11688. 1 NaN NaN
# 3 12053. 1 NaN 3
# 4 12419. 1 NaN 3
# 5 12784. 1 5 3
# 6 13149. 2 NaN NaN
# 7 13514. 2 NaN NaN
# 8 13880. 2 NaN 3
# 9 14245. 2 NaN 3
# 10 14610 2 NaN 3
data.table
Following approach with data.table works quite fast for the data I have. I am expecting the size of the data to increase soon, and then I may have to rely on sparklyr.
library(data.table)
setDT(df_with_nas)
df_with_nas <- df_with_nas[order(time)]
cols <- c("v1", "v2")
df_with_nas[, (cols) := zoo::na.locf(.SD, na.rm = FALSE),
by = grp, .SDcols = cols]

I did this sort of loop, is quite slow...
df_with_nas = df_with_nas %>% mutate(row = 1:nrow(df_with_nas))
for(n in 1:50){
df_with_nas = df_with_nas %>%
arrange(row) %>%
mutate_all(~if_else(is.na(.),lag(.,1),.))
}
run until no NA
then
collect(df_with_nas)
Will run the code.

You can leverage the spark_apply() function and run the na.locf function in each of your cluster nodes.
Install R runtimes on each of your cluster nodes.
Install the zoo R package on each nodes as well.
Run spark apply this way:
data_filled <- spark_apply(data_with_holes, function(df) zoo:na.locf(df))

You can do this quite quickly using sql with the added benefit that you can easily apply LOCF on grouped basis. The pattern you want to use is LAST_VALUE(column, true) OVER (window) - this searches over the window for the most recent column value which is not NA (passing "true" to LAST_VALUE sets ignore NA = true). Since you want to look backwards from the current value the window should be
ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND -1 FOLLOWING
Of course, if the first value in the group is NA it will remain NA.
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
test_table <- data.frame(
v1 = c(1, 2, NA, 3, NA, 5, NA, 6, NA),
v2 = c(1, 1, 1, 1, 1, 2, 2, 2, 2),
time = c(1, 2, 3, 4, 5, 2, 1, 3, 4)
) %>%
sdf_copy_to(sc, ., "test_table")
spark_session(sc) %>%
sparklyr::invoke("sql", "SELECT *, LAST_VALUE(v1, true)
OVER (PARTITION BY v2
ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND -1 FOLLOWING)
AS last_non_na
FROM test_table") %>%
sdf_register() %>%
mutate(v1 = ifelse(is.na(v1), last_non_na, v1))
#> # Source: spark<?> [?? x 4]
#> v1 v2 time last_non_na
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 NaN
#> 2 2 1 2 1
#> 3 2 1 3 2
#> 4 3 1 4 2
#> 5 3 1 5 3
#> 6 NaN 2 1 NaN
#> 7 5 2 2 NaN
#> 8 6 2 3 5
#> 9 6 2 4 6
Created on 2019-08-27 by the reprex package (v0.3.0)

Related

How can I calculate a group-level statistic omitting a given sub-group nested within the group?

I have long data that is students nested within classrooms. I would like to calculate various class-level statistics for each student about the classroom that they study in, but exclude the student's own data in this calculation.
A simple example would be as below:
df <- data.frame(
class_id = c(rep("a", 6), rep("b", 6)),
student_id = c(rep(1, 3), rep(2, 2), rep(3, 1), rep(4, 2), rep(5, 3), rep(6, 1)),
value = rnorm(12)
)
As shown above, I six students in two classrooms, each of which has one or more observations of value. It's easy to get the student-level average with:
df %>%
group_by(class_id, student_id) %>%
summarize(value = mean(value))
or to add a classroom-level average with:
df %>%
group_by(class_id) %>%
mutate(class_avg = mean(value))
but I can't figure out how to tell dplyr to "leave out" a given group in the higher-level group level calculation. This is similar to the question asked here, but that calculates the mean of all groups except for the given group. I'm not sure how to modify this with dplyr to get what I want.
Thanks for your help.
Edit: After #akrun's request, the expected output is below (using a slightly modified version of #jared_mamrot's answer). As you can see, the class_mean_othstudents variable takes the value of the mean of the students in each class except for the given student. Jared's solution works but is a very manual approach and would only apply to getting a mean value. I am wondering if there is a dplyr way to do this more generally.
set.seed(123)
df <- data.frame(
class_id = c(rep("a", 6), rep("b", 6)),
student_id = c(rep(1, 3), rep(2, 2), rep(3, 1), rep(4, 2), rep(5, 3), rep(6, 1)),
value = rnorm(12)
)
df %>%
group_by(class_id, student_id) %>%
summarize(student_mean = mean(value)) %>%
mutate(class_mean_othstudents =
(sum(student_mean) - student_mean)/(n() - 1)
)
`summarise()` has grouped output by 'class_id'. You can override using the `.groups` argument.
# A tibble: 6 x 4
# Groups: class_id [2]
class_id student_id student_mean class_mean_othstudents
<chr> <dbl> <dbl> <dbl>
1 a 1 0.256 0.907
2 a 2 0.0999 0.986
3 a 3 1.72 0.178
4 b 4 -0.402 0.195
5 b 5 0.0305 -0.0211
6 b 6 0.360 -0.186
Based on the update, we may loop over the row_number(), get the 'student_mean' values that are not from the current row, get the mean
library(dplyr)
library(purrr)
df %>%
group_by(class_id, student_id) %>%
summarize(student_mean = mean(value), .groups = 'drop_last') %>%
mutate(class_mean_othstudents = map_dbl(row_number(), ~
mean(student_mean[-.x]))) %>%
ungroup
-output
# A tibble: 6 x 4
class_id student_id student_mean class_mean_othstudents
<chr> <dbl> <dbl> <dbl>
1 a 1 0.256 0.907
2 a 2 0.0999 0.986
3 a 3 1.72 0.178
4 b 4 -0.402 0.195
5 b 5 0.0305 -0.0211
6 b 6 0.360 -0.186
Depending on the statistics you want for each classroom you could calculate them 'manually', e.g. classroom_mean = sum(x) / n; classroom_mean_excluding_the_student_in_question = sum(x) - x / n - 1
E.g.
library(tidyverse)
set.seed(123)
df <- data.frame(
class_id = c(rep("a", 6), rep("b", 6)),
student_id = c(rep(1, 3), rep(2, 2), rep(3, 1), rep(4, 2), rep(5, 3), rep(6, 1)),
value = rnorm(12)
)
df %>%
group_by(class_id, student_id) %>%
summarise(student_mean = mean(value)) %>%
mutate(class_mean_exc_this_student = (
sum(student_mean) - student_mean)/(n() - 1)
)
#> `summarise()` has grouped output by 'class_id'. You can override using the `.groups` argument.
#> # A tibble: 6 x 4
#> # Groups: class_id [2]
#> class_id student_id student_mean class_mean_exc_this_student
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 1 0.256 0.907
#> 2 a 2 0.0999 0.986
#> 3 a 3 1.72 0.178
#> 4 b 4 -0.402 0.195
#> 5 b 5 0.0305 -0.0211
#> 6 b 6 0.360 -0.186
Created on 2021-07-13 by the reprex package (v2.0.0)
Here is a solution which computes student_mean and class_mean_othstudents bottom up as overall mean. The result differs from the other answers posted so far which use the mean of means to compute class_mean_othstudents:
library(data.table)
setDT(df)[, lapply(unique(student_id),
\(sid) .(student_id = sid,
student_mean = mean(value[student_id == sid]),
class_mean_othstudents = mean(value[student_id != sid]))) |>
rbindlist(),
by = .(class_id)]
class_id student_id student_mean class_mean_othstudents
1: a 1 0.25601839 0.6382870
2: a 2 0.09989806 0.6207800
3: a 3 1.71506499 0.1935703
4: b 4 -0.40207251 0.1128452
5: b 5 0.03052233 -0.1481104
6: b 6 0.35981383 -0.1425156
For the sake of completeness and for comparison with the other answers here is the version which is using means of means:
library(data.table)
setDT(df)[, .(student_mean = mean(value)), by = .(class_id, student_id)][
, class_student_mean :=
.SD[, sapply(student_id, \(sid) mean(student_mean[student_id != sid])),
by = class_id]$V1][]
class_id student_id student_mean class_student_mean
1: a 1 0.25601839 0.90748153
2: a 2 0.09989806 0.98554169
3: a 3 1.71506499 0.17795823
4: b 4 -0.40207251 0.19516808
5: b 5 0.03052233 -0.02112934
6: b 6 0.35981383 -0.18577509
This result is in line with the other two answers which are based on means of means
Data
Note that the same seed as in Jared's answer and in OP's edit ist used.
set.seed(123)
df <- data.frame(
class_id = c(rep("a", 6), rep("b", 6)),
student_id = c(rep(1, 3), rep(2, 2), rep(3, 1), rep(4, 2), rep(5, 3), rep(6, 1)),
value = rnorm(12)
)

How do I subset a vector using a list of index of variable lengths

I am trying to determine an efficient way to gather the means and standard deviations of subsections of a variables in a dataframe based on a list of lengths of the sections within the variable. This is a small example of the type of data I have.
X1 <- c(1, 2.5, 3, .5, 1, 1.5, 3, 3.5, 4, 6, 8, 8, 6, 3, 4)
X2 <- c(0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1)
df <- c(X1, X2)
X3 <- list(3, 5, 4, 3)
I will note that the data I have are the result of preprocessing of a large data set of continuous time data that I inherited and the list (X3) represents the length of each sequence of X2. Unfortunately, I need a mean and SD of each section of the vector. I tried using the aggregate() function
aggregate(X1, by = list(X2), FUN = mean)
but this code aggregates all of the "0" and "1" values together into two values. I need 4 values.
I can't quite figure out how capture some function of each unique sequence of X2 without doing it locally by creating a factor for each separate sequence.
Another option is to create grouping variable by comparing the adjacent elements and get the cumulative sum
library(dplyr)
data_frame(X1, X2) %>%
group_by(ID = cumsum(X2 != lag(X2, default = X1[1]))) %>%
summarise(MEAN = mean(X1), SD = sd(X1))
# A tibble: 4 x 3
# ID MEAN SD
# <int> <dbl> <dbl>
#1 1 2.17 1.04
#2 2 1.90 1.29
#3 3 6.50 1.91
#4 4 4.33 1.53
Or if we have the number of elements in 'X3', create the grouping variable with rep and use aggregate from base R
do.call(data.frame, aggregate(X1 ~cbind(ID = rep(seq_along(X3),
unlist(X3))), FUN = function(x) c(MEAN = mean(x), SD = sd(x))))
# ID X1.MEAN X1.SD
#1 1 2.166667 1.040833
#2 2 1.900000 1.294218
#3 3 6.500000 1.914854
#4 4 4.333333 1.527525
First of all, I assume that you want to create a data frame with two columns, X1 and X2. Here is how to create the data frame.
df <- data.frame(X1, X2)
df
# X1 X2
# 1 1.0 0
# 2 2.5 0
# 3 3.0 0
# 4 0.5 1
# 5 1.0 1
# 6 1.5 1
# 7 3.0 1
# 8 3.5 1
# 9 4.0 0
# 10 6.0 0
# 11 8.0 0
# 12 8.0 0
# 13 6.0 1
# 14 3.0 1
# 15 4.0 1
We can then use the data.table package to calculate the mean and standard deviation of each group. The key is to use the rleid function to create the ID of each group. After that, we can summarize the data. df2 is the final output. X3 is actually not needed as long as you have the X2 column in your data frame.
# Load the package
library(data.table)
# Convert df to a data.table
setDT(df)
# Perform rhe analysis
df2 <- df[, ID := rleid(X2)][, .(MEAN = mean(X1), SD = sd(X1)), by = ID]
df2[]
# ID MEAN SD
# 1: 1 2.166667 1.040833
# 2: 2 1.900000 1.294218
# 3: 3 6.500000 1.914854
# 4: 4 4.333333 1.527525

Arrange function in dplyr 0.7.1

I am trying to use the new quo functionality while writing a function utilizing dplyr and ran into the following issue:
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 3, 1),
a = sample(5),
b = sample(5)
)
To arrange the dataframe by a variable is straightforward:
my_arrange <- function(df, arrange_var) {
quo_arrange_var <- enquo(arrange_var)
df %>%
arrange(!!quo_arrange_var)
}
But what if I want to set a preferential order? For example, any arrange variable has 2 as the top variable and then sorts normally. With the previous version of dplyr I would use:
arrange(-(arrange_var == 2), arrange_var)
but in the new structure I am not sure how to approach. I have tried:
my_arrange <- function(df, arrange_var) {
quo_arrange_var <- enquo(arrange_var)
df %>%
arrange(-!!quo_arrange_var==2, !!quo_arrange_var)
}
but I get the error
Error in arrange_impl(.data, dots) :
incorrect size (1) at position 1, expecting : 5
I have also tried using the quo_name:
my_arrange <- function(df, arrange_var) {
quo_arrange_var <- enquo(arrange_var)
df %>%
arrange(-!!(paste0(quo_name(quo_arrange_var), "==2")), !!quo_arrange_var)
}
but get this error:
Error in arrange_impl(.data, dots) :
Evaluation error: invalid argument to unary operator.
any help would be appreciated
The easiest fix is to put parenthesis around the bang-bang. This has to do with operator precedence with respect to ! and ==. When you have !!a==b, it gets parsed as !!(a==b) even though you want (!!a)==b. And for some reason you can compare a quosure to a numeric value quo(a)==2 returns FALSE so you expression is evaluating to arrange(-FALSE, g2) which would give you the same error message.
my_arrange <- function(df, arrange_var) {
quo_arrange_var <- enquo(arrange_var)
df %>%
arrange(-((!!quo_arrange_var)==2), !!quo_arrange_var)
}
my_arrange(df, g2)
# # A tibble: 5 x 4
# g1 g2 a b
# <dbl> <dbl> <int> <int>
# 1 1 2 5 4
# 2 1 1 2 5
# 3 2 1 4 3
# 4 2 1 3 1
# 5 2 3 1 2
The tidyverse has evolved and there no need for enquo anymore. Instead we enclose expressions in double braces {{ }} (aka we embrace them).
library("tidyverse")
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 3, 1),
a = sample(5),
b = sample(5)
)
my_arrange <- function(df, arrange_var) {
df %>%
arrange(desc({{ arrange_var }} == 2), {{ arrange_var }})
}
my_arrange(df, g2)
#> # A tibble: 5 × 4
#> g1 g2 a b
#> <dbl> <dbl> <int> <int>
#> 1 1 2 1 2
#> 2 1 1 4 5
#> 3 2 1 3 3
#> 4 2 1 5 1
#> 5 2 3 2 4
packageVersion("tidyverse")
#> [1] '1.3.1'
Created on 2022-03-17 by the reprex package (v2.0.1)

R ifelse loop on unique values always resolves FALSE

I am newish to R and having trouble with a for loop over unique values.
with the df:
id = c(1,2,2,3,3,4)
rank = c(1,2,1,3,3,4)
df = data.frame(id, rank)
I run:
df$dg <- logical(6)
for(i in unique(df$id)){
ifelse(!unique(df$rank), df$dg ==T, df$dg == F)
}
I am trying to mark the $dg variable as T providing that rank is different for each unique id and F if rank is the same within each id.
I am not getting any errors, but I am only getting F for all values of $dg even though I should be getting a mix.
I have also used the following loop with the same results:
for(i in unique(df$id)){
ifelse(length(unique(df$rank)), df$dg ==T, df$dg == F)
}
I have read other similar posts but the advice has not worked for my case.
From Comments:
I want to mark dg TRUE for all instances of an id if rank changed at all for a given id. Im looking to say for a given ID which has anywhere between 1-13 instances, mark dg TRUE if rank differs across instances.
Update: How to identify groups (ids) that only have one rank?
After clarification that OP provided this would be a solution for this particular case:
library(dplyr)
df %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
For another data-set that has also an id, which has duplicates but also non-duplicate rank (presented below) this would be the output:
df2 %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
#:OUTPUT:
# Source: local data frame [9 x 3]
# Groups: id [5]
#
# # A tibble: 9 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
# 7 5 1 TRUE
# 8 5 1 TRUE
# 9 5 3 TRUE
Data-no-2:
df2 <- structure(list(id = c(1, 2, 2, 3, 3, 4, 5, 5, 5), rank = c(1, 2, 1, 3, 3, 4, 1, 1, 3
)), .Names = c("id", "rank"), row.names = c(NA, -9L), class = "data.frame")
How to identify duplicated rows within each group (id)?
You can use dplyr package:
library(dplyr)
df %>%
group_by(id, rank) %>%
mutate(dg = ifelse(n() > 1, F,T))
This will give you:
# Source: local data frame [6 x 3]
# Groups: id, rank [5]
#
# # A tibble: 6 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
Note: You can simply convert it back to a data.frame().
A data.table solution would be:
dt <- data.table(df)
dt$dg <- ifelse(dt[ , dg := .N, by = list(id, rank)]$dg>1,F,T)
Data:
df <- structure(list(id = c(1, 2, 2, 3, 3, 4), rank = c(1, 2, 1, 3,
3, 4)), .Names = c("id", "rank"), row.names = c(NA, -6L), class = "data.frame")
# > df
# id rank
# 1 1 1
# 2 2 2
# 3 2 1
# 4 3 3
# 5 3 3
# 6 4 4
N. B. Unless you want a different identifier rather than TRUE/FALSE, using ifelse() is redundant and costs computationally. #DavidArenburg

Sample by groupy with a condition (r)

I need to randomly select a diary for each individual (id) but only for those who filled more than one.
Let us suppose my data look like this
dta = rbind(c(1, 1, 'a'),
c(1, 2, 'a'),
c(1, 3, 'b'),
c(2, 1, 'a'),
c(3, 1, 'b'),
c(3, 2, 'a'),
c(3, 3, 'c'))
colnames(dta) <- c('id', 'DiaryNumber', 'type')
dta = as.data.frame(dta)
dta
id DiaryNumber type
1 1 a
1 2 a
1 3 b
2 1 a
3 1 b
3 2 a
3 3 c
For example, id 1 filled 3 diaries. What I need is to randomly select one of the 3 diaries. Id 2 only filled one diary, so I do not need to do anything with it.
I have no idea how I could do that.
Any ideas ?
You can use sample_n:
library(dplyr)
dta %>% group_by(id) %>% sample_n(1)
## Source: local data frame [3 x 3]
## Groups: id
##
## id DiaryNumber type
## 1 1 2 a
## 2 2 1 a
## 3 3 1 b
Base package:
set.seed(123)
df <- lapply(split(dta, dta$id), function(x) x[sample(nrow(x), 1), ])
do.call("rbind", df)
Output:
id DiaryNumber type
1 1 1 a
2 2 1 a
3 3 2 a

Resources