How do you dynamically add row to a tibble - r

I working with the mlogit package. The package has some unforgiving data requirements. For each key in a data set, there must be an identical number of rows.
Here is a reprex with an example:
library(reprex)
#> Warning: package 'reprex' was built under R version 3.5.3
## Have This
df <- tibble( key = c(1,1,1,1,1,2,2,2,2,3,3,3),y=c(2,2,2,2,2,2,2,2,2,2,2,2), z=c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE))
#> Error in tibble(key = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3), y = c(2, : could not find function "tibble"
df
#> function (x, df1, df2, ncp, log = FALSE)
#> {
#> if (missing(ncp))
#> .Call(C_df, x, df1, df2, log)
#> else .Call(C_dnf, x, df1, df2, ncp, log)
#> }
#> <bytecode: 0x0000000013f046d0>
#> <environment: namespace:stats>
#Want this via tidyverse
df2 <- tibble( key = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),y=c(2,2,2,2,2,2,2,2,2,0,2,2,2,0,0), z=c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE))
#> Error in tibble(key = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3), : could not find function "tibble"
df2
#> Error in eval(expr, envir, enclos): object 'df2' not found
Created on 2020-05-02 by the reprex package (v0.3.0)
df has three keys 1, 2 and 3. Key 1 has five rows of observation, Key 2 has 4 rows of observation and Key 3 has three rows. I need each key to have 5 rows of observation and would like to achieve this with the tidyverse. I thought add_row() might be my solution, but I couldn't get it to work. Is this possible.
In my example, I have df as the before and df2 as the desired after.
Created on 2020-05-02 by the reprex package (v0.3.0)

We could expand the dataset based on the count of 'key' column
library(dplyr)
library(tidyr)
library(data.table)
df %>%
mutate(ind = rowid(key)) %>%
complete(key, ind) %>%
select(-ind) %>%
fill(z) %>%
mutate(y = replace_na(y, 0))
# A tibble: 15 x 3
# key y z
# <dbl> <dbl> <lgl>
# 1 1 2 TRUE
# 2 1 2 FALSE
# 3 1 2 FALSE
# 4 1 2 FALSE
# 5 1 2 FALSE
# 6 2 2 TRUE
# 7 2 2 FALSE
# 8 2 2 FALSE
# 9 2 2 FALSE
#10 2 0 FALSE
#11 3 2 TRUE
#12 3 2 FALSE
#13 3 2 FALSE
#14 3 0 FALSE
#15 3 0 FALSE

Related

change values of several columns based on value in other column with mutate() and across()

I would like to change the columns a and b of my initial dataframe (df) to the columns in dataframe df_new. I basically just want to set the values of column a and b to "999" if the NAvar column is NA. How is this possible with mutate and across?
df <- data.frame(a =1:5, b = 6:10, NAvar = c(NA, 1, 12, 4, NA))
df %>%
dplyr::filter(is.na(NAvar)) %>%
select(colnames(df)) %>%
mutate(across(c("a", "b"), EqualsTo=999))
df_new <- data.frame(a = c(999, 2:4, 999), b = c(999, 7:9, 999), NAvar = c(NA, 1, 12, 4, NA))
df
df_new
df %>%
mutate(across(a:b, ~ if_else(is.na(NAvar), 999L, .)))
# a b NAvar
# 1 999 999 NA
# 2 2 7 1
# 3 3 8 12
# 4 4 9 4
# 5 999 999 NA
Note the use of 999L vice 999, a numeric. If your real data ever blurs the line between integer and numeric, then dplyr::if_else will be problematic (for good reason). If you need to side-step that (perhaps some other calculation is inadvertently converting a and/or b to numeric), then some options:
mutate(across(a:b, ~ if_else(is.na(NAvar), 999L, as.integer(.))))
mutate(across(a:b, ~ if_else(is.na(NAvar), 999, as.numeric(.))))
## base::ifelse
mutate(across(a:b, ~ ifelse(is.na(NAvar), 999, .)))
The last is taking advantage of the fact that base::ifelse can be perceived as sloppy. Realize that ifelse(..., 1, "") may return numeric or character, depending on the conditionals, and the fact that it is not known a priori which class will be returned is a risk. While the difference between integer/numeric is less-risky, it can still be a problem if/when other expressions require one over the other.
using base R
df <- data.frame(a =1:5, b = 6:10, NAvar = c(NA, 1, 12, 4, NA))
cbind(apply(df[1:2], 2, function(x) ifelse(is.na(df$NAvar), 999L, x)), NAvar = df$NAvar)
#> a b NAvar
#> [1,] 999 999 NA
#> [2,] 2 7 1
#> [3,] 3 8 12
#> [4,] 4 9 4
#> [5,] 999 999 NA
Created on 2021-01-18 by the reprex package (v0.3.0)
or
df[1:2] <- lapply(df[1:2], function(x) ifelse(is.na(df$NAvar), 999L, x))
df
#> a b NAvar
#> 1 999 999 NA
#> 2 2 7 1
#> 3 3 8 12
#> 4 4 9 4
#> 5 999 999 NA
Created on 2021-01-19 by the reprex package (v0.3.0)

Replacing NA's with LOCF using Sparklyr

My aim is to replace NA's in a spark data frame using the Last Observation Carried Forward method. I wrote the following code and works. However, it seems to take longer than expected for a larger dataset.
It would be great if someone can recommend a better approach or improve the code.
Example and Code with Sparklyr
In the following example, NA's are replaced after ordering them using the
time and grouping them by grp.
df_with_nas <- data.frame(time = seq(as.Date('2001/01/01'),
as.Date('2010/01/01'), length.out = 10),
grp = c(rep(1, 5), rep(2, 5)),
v1 = c(1, rep(NA, 3), 5, rep(NA, 5)),
v2 = c(NA, NA, 3, rep(NA, 4), 3, NA, NA))
tbl <- copy_to(sc, df_with_nas, overwrite = TRUE)
tbl %>%
spark_apply(function(df) {
library(dplyr)
na_locf <- function(x) {
v <- !is.na(x)
c(NA, x[v])[cumsum(v) + 1]
}
df %>% arrange(time) %>% group_by(grp) %>% mutate_at(vars(-v1, -grp),
funs(na_locf(.)))
})
# # Source: spark<?> [?? x 4]
# time grp v1 v2
# <dbl> <dbl> <dbl> <dbl>
# 1 11323 1 1 NaN
# 2 11688. 1 NaN NaN
# 3 12053. 1 NaN 3
# 4 12419. 1 NaN 3
# 5 12784. 1 5 3
# 6 13149. 2 NaN NaN
# 7 13514. 2 NaN NaN
# 8 13880. 2 NaN 3
# 9 14245. 2 NaN 3
# 10 14610 2 NaN 3
data.table
Following approach with data.table works quite fast for the data I have. I am expecting the size of the data to increase soon, and then I may have to rely on sparklyr.
library(data.table)
setDT(df_with_nas)
df_with_nas <- df_with_nas[order(time)]
cols <- c("v1", "v2")
df_with_nas[, (cols) := zoo::na.locf(.SD, na.rm = FALSE),
by = grp, .SDcols = cols]
I did this sort of loop, is quite slow...
df_with_nas = df_with_nas %>% mutate(row = 1:nrow(df_with_nas))
for(n in 1:50){
df_with_nas = df_with_nas %>%
arrange(row) %>%
mutate_all(~if_else(is.na(.),lag(.,1),.))
}
run until no NA
then
collect(df_with_nas)
Will run the code.
You can leverage the spark_apply() function and run the na.locf function in each of your cluster nodes.
Install R runtimes on each of your cluster nodes.
Install the zoo R package on each nodes as well.
Run spark apply this way:
data_filled <- spark_apply(data_with_holes, function(df) zoo:na.locf(df))
You can do this quite quickly using sql with the added benefit that you can easily apply LOCF on grouped basis. The pattern you want to use is LAST_VALUE(column, true) OVER (window) - this searches over the window for the most recent column value which is not NA (passing "true" to LAST_VALUE sets ignore NA = true). Since you want to look backwards from the current value the window should be
ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND -1 FOLLOWING
Of course, if the first value in the group is NA it will remain NA.
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
test_table <- data.frame(
v1 = c(1, 2, NA, 3, NA, 5, NA, 6, NA),
v2 = c(1, 1, 1, 1, 1, 2, 2, 2, 2),
time = c(1, 2, 3, 4, 5, 2, 1, 3, 4)
) %>%
sdf_copy_to(sc, ., "test_table")
spark_session(sc) %>%
sparklyr::invoke("sql", "SELECT *, LAST_VALUE(v1, true)
OVER (PARTITION BY v2
ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND -1 FOLLOWING)
AS last_non_na
FROM test_table") %>%
sdf_register() %>%
mutate(v1 = ifelse(is.na(v1), last_non_na, v1))
#> # Source: spark<?> [?? x 4]
#> v1 v2 time last_non_na
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 NaN
#> 2 2 1 2 1
#> 3 2 1 3 2
#> 4 3 1 4 2
#> 5 3 1 5 3
#> 6 NaN 2 1 NaN
#> 7 5 2 2 NaN
#> 8 6 2 3 5
#> 9 6 2 4 6
Created on 2019-08-27 by the reprex package (v0.3.0)

Arrange function in dplyr 0.7.1

I am trying to use the new quo functionality while writing a function utilizing dplyr and ran into the following issue:
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 3, 1),
a = sample(5),
b = sample(5)
)
To arrange the dataframe by a variable is straightforward:
my_arrange <- function(df, arrange_var) {
quo_arrange_var <- enquo(arrange_var)
df %>%
arrange(!!quo_arrange_var)
}
But what if I want to set a preferential order? For example, any arrange variable has 2 as the top variable and then sorts normally. With the previous version of dplyr I would use:
arrange(-(arrange_var == 2), arrange_var)
but in the new structure I am not sure how to approach. I have tried:
my_arrange <- function(df, arrange_var) {
quo_arrange_var <- enquo(arrange_var)
df %>%
arrange(-!!quo_arrange_var==2, !!quo_arrange_var)
}
but I get the error
Error in arrange_impl(.data, dots) :
incorrect size (1) at position 1, expecting : 5
I have also tried using the quo_name:
my_arrange <- function(df, arrange_var) {
quo_arrange_var <- enquo(arrange_var)
df %>%
arrange(-!!(paste0(quo_name(quo_arrange_var), "==2")), !!quo_arrange_var)
}
but get this error:
Error in arrange_impl(.data, dots) :
Evaluation error: invalid argument to unary operator.
any help would be appreciated
The easiest fix is to put parenthesis around the bang-bang. This has to do with operator precedence with respect to ! and ==. When you have !!a==b, it gets parsed as !!(a==b) even though you want (!!a)==b. And for some reason you can compare a quosure to a numeric value quo(a)==2 returns FALSE so you expression is evaluating to arrange(-FALSE, g2) which would give you the same error message.
my_arrange <- function(df, arrange_var) {
quo_arrange_var <- enquo(arrange_var)
df %>%
arrange(-((!!quo_arrange_var)==2), !!quo_arrange_var)
}
my_arrange(df, g2)
# # A tibble: 5 x 4
# g1 g2 a b
# <dbl> <dbl> <int> <int>
# 1 1 2 5 4
# 2 1 1 2 5
# 3 2 1 4 3
# 4 2 1 3 1
# 5 2 3 1 2
The tidyverse has evolved and there no need for enquo anymore. Instead we enclose expressions in double braces {{ }} (aka we embrace them).
library("tidyverse")
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 3, 1),
a = sample(5),
b = sample(5)
)
my_arrange <- function(df, arrange_var) {
df %>%
arrange(desc({{ arrange_var }} == 2), {{ arrange_var }})
}
my_arrange(df, g2)
#> # A tibble: 5 × 4
#> g1 g2 a b
#> <dbl> <dbl> <int> <int>
#> 1 1 2 1 2
#> 2 1 1 4 5
#> 3 2 1 3 3
#> 4 2 1 5 1
#> 5 2 3 2 4
packageVersion("tidyverse")
#> [1] '1.3.1'
Created on 2022-03-17 by the reprex package (v2.0.1)

R ifelse loop on unique values always resolves FALSE

I am newish to R and having trouble with a for loop over unique values.
with the df:
id = c(1,2,2,3,3,4)
rank = c(1,2,1,3,3,4)
df = data.frame(id, rank)
I run:
df$dg <- logical(6)
for(i in unique(df$id)){
ifelse(!unique(df$rank), df$dg ==T, df$dg == F)
}
I am trying to mark the $dg variable as T providing that rank is different for each unique id and F if rank is the same within each id.
I am not getting any errors, but I am only getting F for all values of $dg even though I should be getting a mix.
I have also used the following loop with the same results:
for(i in unique(df$id)){
ifelse(length(unique(df$rank)), df$dg ==T, df$dg == F)
}
I have read other similar posts but the advice has not worked for my case.
From Comments:
I want to mark dg TRUE for all instances of an id if rank changed at all for a given id. Im looking to say for a given ID which has anywhere between 1-13 instances, mark dg TRUE if rank differs across instances.
Update: How to identify groups (ids) that only have one rank?
After clarification that OP provided this would be a solution for this particular case:
library(dplyr)
df %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
For another data-set that has also an id, which has duplicates but also non-duplicate rank (presented below) this would be the output:
df2 %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
#:OUTPUT:
# Source: local data frame [9 x 3]
# Groups: id [5]
#
# # A tibble: 9 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
# 7 5 1 TRUE
# 8 5 1 TRUE
# 9 5 3 TRUE
Data-no-2:
df2 <- structure(list(id = c(1, 2, 2, 3, 3, 4, 5, 5, 5), rank = c(1, 2, 1, 3, 3, 4, 1, 1, 3
)), .Names = c("id", "rank"), row.names = c(NA, -9L), class = "data.frame")
How to identify duplicated rows within each group (id)?
You can use dplyr package:
library(dplyr)
df %>%
group_by(id, rank) %>%
mutate(dg = ifelse(n() > 1, F,T))
This will give you:
# Source: local data frame [6 x 3]
# Groups: id, rank [5]
#
# # A tibble: 6 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
Note: You can simply convert it back to a data.frame().
A data.table solution would be:
dt <- data.table(df)
dt$dg <- ifelse(dt[ , dg := .N, by = list(id, rank)]$dg>1,F,T)
Data:
df <- structure(list(id = c(1, 2, 2, 3, 3, 4), rank = c(1, 2, 1, 3,
3, 4)), .Names = c("id", "rank"), row.names = c(NA, -6L), class = "data.frame")
# > df
# id rank
# 1 1 1
# 2 2 2
# 3 2 1
# 4 3 3
# 5 3 3
# 6 4 4
N. B. Unless you want a different identifier rather than TRUE/FALSE, using ifelse() is redundant and costs computationally. #DavidArenburg

For each observation, find a corresponding centile on a subset determined by factor

Assume I have a data frame like so:
df<-data.frame(f=rep(c("a", "b", "c", "d"), 100), value=rnorm(400))
I want to create a new column, which will contain a centile that an observation belongs to, calculated separately on each factor level.
What would be a reasonably simple and efficient way to do that? The closest I came to a solution was
df$newColumn<-findInterval(df$value, tapply(df$value, df$f, quantile, probs=seq(0, 0.99, 0.01))$df[, "f"])
However, this just gives zeros to all observations. The tapply returns a four-element list of quantile vectors and I'm not sure how to access a relevant element for each observation to pass as an argument for the findInterval function.
The number of rows in the data frame could reach a few millions, so speed is an issue too. The factor column will always have four levels.
With dplyr:
library(dplyr)
df %>%
group_by(f) %>%
mutate(quant = findInterval(value, quantile(value)))
#> Source: local data frame [400 x 3]
#> Groups: f [4]
#>
#> f value quant
#> <fctr> <dbl> <int>
#> 1 a 0.51184061 3
#> 2 b 0.44362348 3
#> 3 c -1.04869448 1
#> 4 d -2.41772425 1
#> 5 a 0.10738332 3
#> 6 b -0.58630348 1
#> 7 c 0.34376820 3
#> 8 d 0.68322738 4
#> 9 a 1.00232314 4
#> 10 b 0.05499391 3
#> # ... with 390 more rows
With data.table:
library(data.table)
dt <- setDT(df)
dt[, quant := findInterval(value, quantile(value)), by = f]
dt
#> f value quant
#> 1: a 0.3608395 3
#> 2: b -0.1028948 2
#> 3: c -2.1903336 1
#> 4: d 0.7470262 4
#> 5: a 0.5292031 3
#> ---
#> 396: d -1.3475332 1
#> 397: a 0.1598605 3
#> 398: b -0.4261003 2
#> 399: c 0.3951650 3
#> 400: d -1.4409000 1
Data:
df <- data.frame(f = rep(c("a", "b", "c", "d"), 100), value = rnorm(400))
I think that data.table is faster, however, a solution without using packages is:
Define a function based on cut or findInterval together with quantile
cut2 <- function(x){
cut( x , breaks=quantile(x, probs = seq(0, 1, 0.01)) , include.lowest=T , labels=1:100)
}
then, apply it by a factor using ave
df$newColumn <- ave(df$values, df$f, FUN=cut2)

Resources