I am trying to use the new quo functionality while writing a function utilizing dplyr and ran into the following issue:
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 3, 1),
a = sample(5),
b = sample(5)
)
To arrange the dataframe by a variable is straightforward:
my_arrange <- function(df, arrange_var) {
quo_arrange_var <- enquo(arrange_var)
df %>%
arrange(!!quo_arrange_var)
}
But what if I want to set a preferential order? For example, any arrange variable has 2 as the top variable and then sorts normally. With the previous version of dplyr I would use:
arrange(-(arrange_var == 2), arrange_var)
but in the new structure I am not sure how to approach. I have tried:
my_arrange <- function(df, arrange_var) {
quo_arrange_var <- enquo(arrange_var)
df %>%
arrange(-!!quo_arrange_var==2, !!quo_arrange_var)
}
but I get the error
Error in arrange_impl(.data, dots) :
incorrect size (1) at position 1, expecting : 5
I have also tried using the quo_name:
my_arrange <- function(df, arrange_var) {
quo_arrange_var <- enquo(arrange_var)
df %>%
arrange(-!!(paste0(quo_name(quo_arrange_var), "==2")), !!quo_arrange_var)
}
but get this error:
Error in arrange_impl(.data, dots) :
Evaluation error: invalid argument to unary operator.
any help would be appreciated
The easiest fix is to put parenthesis around the bang-bang. This has to do with operator precedence with respect to ! and ==. When you have !!a==b, it gets parsed as !!(a==b) even though you want (!!a)==b. And for some reason you can compare a quosure to a numeric value quo(a)==2 returns FALSE so you expression is evaluating to arrange(-FALSE, g2) which would give you the same error message.
my_arrange <- function(df, arrange_var) {
quo_arrange_var <- enquo(arrange_var)
df %>%
arrange(-((!!quo_arrange_var)==2), !!quo_arrange_var)
}
my_arrange(df, g2)
# # A tibble: 5 x 4
# g1 g2 a b
# <dbl> <dbl> <int> <int>
# 1 1 2 5 4
# 2 1 1 2 5
# 3 2 1 4 3
# 4 2 1 3 1
# 5 2 3 1 2
The tidyverse has evolved and there no need for enquo anymore. Instead we enclose expressions in double braces {{ }} (aka we embrace them).
library("tidyverse")
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 3, 1),
a = sample(5),
b = sample(5)
)
my_arrange <- function(df, arrange_var) {
df %>%
arrange(desc({{ arrange_var }} == 2), {{ arrange_var }})
}
my_arrange(df, g2)
#> # A tibble: 5 × 4
#> g1 g2 a b
#> <dbl> <dbl> <int> <int>
#> 1 1 2 1 2
#> 2 1 1 4 5
#> 3 2 1 3 3
#> 4 2 1 5 1
#> 5 2 3 2 4
packageVersion("tidyverse")
#> [1] '1.3.1'
Created on 2022-03-17 by the reprex package (v2.0.1)
Related
I have a tibble with the explicit "id" and colnames I need to convert to NA's. Is there anyway I can create the NA's without making my df a long dataset? I considered using the new rows_update function, but I'm not sure if this is correct because I only want certain columns to be NA.
library(dplyr)
to_na <- tribble(~x, ~col,
1, "z",
3, "y"
)
df <- tibble(x = c(1,2,3),
y = c(1,1,1),
z = c(2,2,2))
# desired output:
#> # A tibble: 3 x 3
#> x y z
#> <dbl> <dbl> <dbl>
#> 1 1 1 NA
#> 2 2 1 2
#> 3 3 NA 2
Created on 2020-07-03 by the reprex package (v0.3.0)
This definitely isn't the most elegant solution, but it gets the output you want.
library(dplyr)
library(purrr)
to_na <- tribble(~x, ~col,
1, "z",
3, "y"
)
df <- tibble(x = c(1,2,3),
y = c(1,1,1),
z = c(2,2,2))
map2(to_na$x, to_na$col, #Pass through these two objects in parallel
function(xval_to_missing, col) df %>% #Two objects above matched by position here.
mutate_at(col, #mutate_at the specified cols
~if_else(x == xval_to_missing, NA_real_, .) #if x == xval_to_missing, make NA, else keep as is.
) %>%
select(x, col) #keep x and the modified column.
) %>% #end of map2
reduce(left_join, by = "x") %>% #merge within the above list, by x.
relocate(x, y, z) #Keep your ordering
Output:
# A tibble: 3 x 3
x y z
<dbl> <dbl> <dbl>
1 1 1 NA
2 2 1 2
3 3 NA 2
We can use row/column indexing to assign the values to NA in base R
df <- as.data.frame(df)
df[cbind(to_na$x, match(to_na$col, names(df)))] <- NA
df
# x y z
#1 1 1 NA
#2 2 1 2
#3 3 NA 2
If we want to use rows_update
library(dplyr)
library(tidyr)
library(purrr)
lst1 <- to_na %>%
mutate(new = NA_real_) %>%
split(seq_len(nrow(.))) %>%
map(~ .x %>%
pivot_wider(names_from = col, values_from = new))
for(i in seq_along(lst1)) df <- rows_update(df, lst1[[i]])
df
# A tibble: 3 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 1 NA
#2 2 1 2
#3 3 NA 2
I working with the mlogit package. The package has some unforgiving data requirements. For each key in a data set, there must be an identical number of rows.
Here is a reprex with an example:
library(reprex)
#> Warning: package 'reprex' was built under R version 3.5.3
## Have This
df <- tibble( key = c(1,1,1,1,1,2,2,2,2,3,3,3),y=c(2,2,2,2,2,2,2,2,2,2,2,2), z=c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE))
#> Error in tibble(key = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3), y = c(2, : could not find function "tibble"
df
#> function (x, df1, df2, ncp, log = FALSE)
#> {
#> if (missing(ncp))
#> .Call(C_df, x, df1, df2, log)
#> else .Call(C_dnf, x, df1, df2, ncp, log)
#> }
#> <bytecode: 0x0000000013f046d0>
#> <environment: namespace:stats>
#Want this via tidyverse
df2 <- tibble( key = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),y=c(2,2,2,2,2,2,2,2,2,0,2,2,2,0,0), z=c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE))
#> Error in tibble(key = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3), : could not find function "tibble"
df2
#> Error in eval(expr, envir, enclos): object 'df2' not found
Created on 2020-05-02 by the reprex package (v0.3.0)
df has three keys 1, 2 and 3. Key 1 has five rows of observation, Key 2 has 4 rows of observation and Key 3 has three rows. I need each key to have 5 rows of observation and would like to achieve this with the tidyverse. I thought add_row() might be my solution, but I couldn't get it to work. Is this possible.
In my example, I have df as the before and df2 as the desired after.
Created on 2020-05-02 by the reprex package (v0.3.0)
We could expand the dataset based on the count of 'key' column
library(dplyr)
library(tidyr)
library(data.table)
df %>%
mutate(ind = rowid(key)) %>%
complete(key, ind) %>%
select(-ind) %>%
fill(z) %>%
mutate(y = replace_na(y, 0))
# A tibble: 15 x 3
# key y z
# <dbl> <dbl> <lgl>
# 1 1 2 TRUE
# 2 1 2 FALSE
# 3 1 2 FALSE
# 4 1 2 FALSE
# 5 1 2 FALSE
# 6 2 2 TRUE
# 7 2 2 FALSE
# 8 2 2 FALSE
# 9 2 2 FALSE
#10 2 0 FALSE
#11 3 2 TRUE
#12 3 2 FALSE
#13 3 2 FALSE
#14 3 0 FALSE
#15 3 0 FALSE
My aim is to replace NA's in a spark data frame using the Last Observation Carried Forward method. I wrote the following code and works. However, it seems to take longer than expected for a larger dataset.
It would be great if someone can recommend a better approach or improve the code.
Example and Code with Sparklyr
In the following example, NA's are replaced after ordering them using the
time and grouping them by grp.
df_with_nas <- data.frame(time = seq(as.Date('2001/01/01'),
as.Date('2010/01/01'), length.out = 10),
grp = c(rep(1, 5), rep(2, 5)),
v1 = c(1, rep(NA, 3), 5, rep(NA, 5)),
v2 = c(NA, NA, 3, rep(NA, 4), 3, NA, NA))
tbl <- copy_to(sc, df_with_nas, overwrite = TRUE)
tbl %>%
spark_apply(function(df) {
library(dplyr)
na_locf <- function(x) {
v <- !is.na(x)
c(NA, x[v])[cumsum(v) + 1]
}
df %>% arrange(time) %>% group_by(grp) %>% mutate_at(vars(-v1, -grp),
funs(na_locf(.)))
})
# # Source: spark<?> [?? x 4]
# time grp v1 v2
# <dbl> <dbl> <dbl> <dbl>
# 1 11323 1 1 NaN
# 2 11688. 1 NaN NaN
# 3 12053. 1 NaN 3
# 4 12419. 1 NaN 3
# 5 12784. 1 5 3
# 6 13149. 2 NaN NaN
# 7 13514. 2 NaN NaN
# 8 13880. 2 NaN 3
# 9 14245. 2 NaN 3
# 10 14610 2 NaN 3
data.table
Following approach with data.table works quite fast for the data I have. I am expecting the size of the data to increase soon, and then I may have to rely on sparklyr.
library(data.table)
setDT(df_with_nas)
df_with_nas <- df_with_nas[order(time)]
cols <- c("v1", "v2")
df_with_nas[, (cols) := zoo::na.locf(.SD, na.rm = FALSE),
by = grp, .SDcols = cols]
I did this sort of loop, is quite slow...
df_with_nas = df_with_nas %>% mutate(row = 1:nrow(df_with_nas))
for(n in 1:50){
df_with_nas = df_with_nas %>%
arrange(row) %>%
mutate_all(~if_else(is.na(.),lag(.,1),.))
}
run until no NA
then
collect(df_with_nas)
Will run the code.
You can leverage the spark_apply() function and run the na.locf function in each of your cluster nodes.
Install R runtimes on each of your cluster nodes.
Install the zoo R package on each nodes as well.
Run spark apply this way:
data_filled <- spark_apply(data_with_holes, function(df) zoo:na.locf(df))
You can do this quite quickly using sql with the added benefit that you can easily apply LOCF on grouped basis. The pattern you want to use is LAST_VALUE(column, true) OVER (window) - this searches over the window for the most recent column value which is not NA (passing "true" to LAST_VALUE sets ignore NA = true). Since you want to look backwards from the current value the window should be
ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND -1 FOLLOWING
Of course, if the first value in the group is NA it will remain NA.
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
test_table <- data.frame(
v1 = c(1, 2, NA, 3, NA, 5, NA, 6, NA),
v2 = c(1, 1, 1, 1, 1, 2, 2, 2, 2),
time = c(1, 2, 3, 4, 5, 2, 1, 3, 4)
) %>%
sdf_copy_to(sc, ., "test_table")
spark_session(sc) %>%
sparklyr::invoke("sql", "SELECT *, LAST_VALUE(v1, true)
OVER (PARTITION BY v2
ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND -1 FOLLOWING)
AS last_non_na
FROM test_table") %>%
sdf_register() %>%
mutate(v1 = ifelse(is.na(v1), last_non_na, v1))
#> # Source: spark<?> [?? x 4]
#> v1 v2 time last_non_na
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 NaN
#> 2 2 1 2 1
#> 3 2 1 3 2
#> 4 3 1 4 2
#> 5 3 1 5 3
#> 6 NaN 2 1 NaN
#> 7 5 2 2 NaN
#> 8 6 2 3 5
#> 9 6 2 4 6
Created on 2019-08-27 by the reprex package (v0.3.0)
Problem:
I have a function that uses an argument to index to an internal data.frame, but returns an integer. However when I run the function in dplyr::mutate to create a new variable based on another variable in adata.frame, I get an error:
Error in mutate_impl(.data, dots) :
Evaluation error: duplicate subscripts for columns.
This appears to be caused by the internal indexing of the data frame using the index position of the variable, instead of its value.
How can I solve this?
Example:
In this function I need to index to an internal data.frame and use this in the calculation of the result. :unction and data:
toyfun <- function(thing1){
thing2 <- data.frame(a = 0, b = 0, c = 0, d = 0)
thing2[, thing1] <- 1
thing3 <- sum(thing2[1,]) + thing1
return(thing3)
}
toydat <- tibble(thing1 = c(4, 3, 2, 1, 1, 2))
Function does as expected:
toyfun(thing1 = toydat$thing1[1])
#[1] 5
But if I want to calculate the function with each element of a variable in a tibble or data.frame, with mutate, it fails:
toydat %>%
mutate(thing4 = toyfun(thing1 = thing1))
# Error in mutate_impl(.data, dots) :
# Evaluation error: duplicate subscripts for columns.
If we just use the first 4 rows (or fewer) of toydat, and note that the internal data.frame in toyfun is 4 columns wide, it works fine
toydat[1:4,] %>%
mutate(thing4 = toyfun(thing1 = thing1))
# # A tibble: 4 x 2
# thing1 thing4
# <dbl> <dbl>
# 1 4 5
# 2 3 4
# 3 2 3
# 4 1 2
But again, if we use 5 rows, so going over the index value of the internal data.frame, we fail again:
toydat[1:5,] %>%
mutate(thing4 = toyfun(thing1 = thing1))
# Error in mutate_impl(.data, dots) :
# Evaluation error: duplicate subscripts for columns.
Crux of the issue
This result seems to illustrate that the problem is with this internal indexing using the index value from thing1 rather than it's actual value. Which is weird, because as used in the 4-row example above, we can see that the returned values in thing4 are as they should be from using the values of thing1 to calculate the result.
NB: The same problem doesn't occur with sapply:
sapply(toydat$thing1, toyfun)
# [1] 5 4 3 2 2 3
Any ideas on ways around this in the dplyr type framework so I can keep the work flow consistent?
The issue is because mutate sends the entire column together to the function.
Let's debug the function
toyfun <- function(thing1){
browser()
thing2 <- data.frame(a = 0, b = 0, c = 0, d = 0)
thing2[,thing1] <- 1
thing3 <- thing1 + 1
return(thing3)
}
Now we run the mutate command
toydat %>%
mutate(thing4 = toyfun(thing1 = thing1))
#Called from: toyfun(thing1 = thing1)
#Browse[1]> thing1
#[1] 4 3 2 1 1 2
As there are duplicate entries of column 1 , it gives an error.
It is same as
df <- mtcars
df[, c(5, 5)] <- 1
Error in [<-.data.frame(*tmp*, , c(1, 1), value = 1) :
duplicate subscripts for columns
Now let's look at sapply call
sapply(toydat$thing1, toyfun)
#Called from: FUN(X[[i]], ...)
#Browse[1]> thing1
#[1] 4
sapply passes the value one by one hence there is no error.
This is same as
df <- mtcars
df[, 5] <- 1
df[, 5] <- 1
which doesn't give any error.
To resolve the error we can use unique to get only unique entries of thing1
toyfun <- function(thing1){
thing2 <- data.frame(a = 0, b = 0, c = 0, d = 0)
thing2[,unique(thing1)] <- 1
thing3 <- thing1 + 1
return(thing3)
}
toydat %>%
mutate(thing4 = toyfun(thing1 = thing1))
# A tibble: 6 x 2
# thing1 thing4
# <dbl> <dbl>
#1 4 5
#2 3 4
#3 2 3
#4 1 2
#5 1 2
#6 2 3
and this would also continue to work with sapply
sapply(toydat$thing1, toyfun)
#[1] 5 4 3 2 2 3
If you do not want to change the function, another option is to use rowwise which works same as sapply and sends each individual value one by one to the function
toydat %>%
rowwise() %>%
mutate(thing4 = toyfun(thing1 = thing1))
#Called from: toyfun(thing1 = thing1)
#Browse[1]> thing1
#[1] 4
toydat %>%
rowwise() %>%
mutate(thing4 = toyfun(thing1 = thing1))
# thing1 thing4
# <dbl> <dbl>
#1 4 5
#2 3 4
#3 2 3
#4 1 2
#5 1 2
#6 2 3
Hope this was clear and helpful.
What is the best way to convert a specific column in each list object to a specific format?
For instance, I have a list with four objects (each of which is a data frame) and I want to change column 3 in each data.frame from double to integer?
I'm guessing something along the line of lapply but I didn't know what specific synthax to use. I was trying:
lapply(df,function(x){as.numeric(var1(x))})
but it wasn't working.
Thanks!
Yes, lapply works well here:
lapply(listofdfs, function(df) { # loop through each data.frame in list
df[ , 3] <- as.integer(df[ , 3]) # make the 3rd column of type integer
df # return the new data.frame
})
This is just an alternative to C. Braun's answer.
You can also use map() function from the purr library.
Input:
library(tidyverse)
df <- tibble(a = c(1, 2, 3), b =c(4, 5, 6), d = c(7, 8, 9))
myList <- list(df, df, df)
myList
Method:
map(myList, ~(.x %>% mutate_at(vars(3), funs(as.integer(.)))))
Output:
[[1]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
[[2]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
[[3]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
You can use this:
dlist2 <- lapply(dlist,function(x){
y <- x
y[,coltochange] <- as.numeric(x[,coltochange])
return(y)
} )
Simple example:
data <- data.frame(cbind(c("1","2","3","4",NA),c(1:5)),stringsAsFactors = F)
typeof(data[,1]) #character
dlist <- list(data,data,data)
coltochange <- 1
dlist2 <- lapply(dlist,function(x){
y <- x
y[,coltochange] <- as.numeric(x[,coltochange])
return(y)
} )
typeof(dlist[[1]][,1]) #character
typeof(dlist2[[1]][,1]) #double