Using dplyr's enquo to access Spark table columns via sparklyr - r

I would like to be ebale to use dplyr's enquo within lapply call while jumping through Spark table columns.
lapply(tbl_vars(sprkTbl),
function(col_nme) {
print(col_nme)
# Enquoe column name
quo_col_nme <- enquo(col_nme)
print(quo_col_nme)
sprkTbl %>%
select(!!quo_col_nme) %>%
# do stuff
collect -> dta_res
}) -> l_res
However, when I try to run this code I keep on getting error:
Error in (function (x, strict = TRUE) : the argument has already
been evaluated
I've isolated the error to enquo:
>> lapply(tbl_vars(sprkTbl),
... function(col_nme) {
... print(col_nme)
... # Enquoe column name
... quo_col_nme <- enquo(col_nme)
... # print(quo_col_nme)
...
... # sprkTbl%>%
... # select(!!quo_col_nme) %>%
... # # do stuff
... # collect -> dta_res
... }) -> l_res
[1] "first_column_in_spark"
(and then the same error)
Error in (function (x, strict = TRUE) : the argument has
already been evaluated
I want to understand why enquo can't be used like that. tbl_vars returns an ordinary character vector, shouldn't the col_name be a string? I would envisage for the syntax to work in the same manner as in:
mtcars %>% select(!!enquote("am")) %>% head(2)
am
Mazda RX4 1
Mazda RX4 Wag 1
but, clearly this is not the case, when called from within lapply.
Edit
leaving the sparklyr aspect on side, a better and more reproducible example can be provided:
lapply(names(mtcars),function(x) {
col_enq <- enquo(x)
mtcars %>%
select(!!col_enq) %>%
head(2)
})
produces identical error.
Desired results
The old _-based syntax works
lapply(names(mtcars),function(x) {
# col_enq <- enquo(x)
mtcars %>%
select_(x) %>%
head(2)
})
In a word, I want to achieve the same functionality by jumping to Spark table columns and I would prefer not use deprecated select_.

Do I understand your question correctly that you are interested in this result? Or are you bound to use enquo instead of quo?
library(dplyr)
lapply(names(mtcars),function(x) {
col_enq <- quo(x)
mtcars %>%
select(!!col_enq) %>%
head(2)
})
#> [[1]]
#> mpg
#> Mazda RX4 21
#> Mazda RX4 Wag 21
#>
#> [[2]]
#> cyl
#> Mazda RX4 6
#> Mazda RX4 Wag 6
#>
#> [[3]]
#> disp
#> Mazda RX4 160
#> Mazda RX4 Wag 160
#>
#> [[4]]
#> hp
#> Mazda RX4 110
#> Mazda RX4 Wag 110
#>
#> [[5]]
#> drat
#> Mazda RX4 3.9
#> Mazda RX4 Wag 3.9
#>
#> [[6]]
#> wt
#> Mazda RX4 2.620
#> Mazda RX4 Wag 2.875
#>
#> [[7]]
#> qsec
#> Mazda RX4 16.46
#> Mazda RX4 Wag 17.02
#>
#> [[8]]
#> vs
#> Mazda RX4 0
#> Mazda RX4 Wag 0
#>
#> [[9]]
#> am
#> Mazda RX4 1
#> Mazda RX4 Wag 1
#>
#> [[10]]
#> gear
#> Mazda RX4 4
#> Mazda RX4 Wag 4
#>
#> [[11]]
#> carb
#> Mazda RX4 4
#> Mazda RX4 Wag 4

Related

How to view the process step by step of a function in R?

Df:
df <- tibble(
a = c("z", "x", "y"),
b = c("m", "n", "o"),
c = c("p", "q", "r")
)
-I would like to see the result of names(.) (to inspect) in:
df %>%
set_names(c(names(.)[1], unlist(.[2, 2:3])))
-I know it is c("a", "b", "c"), or the result in unlist(.[2, 2:3]). This is an example, I want to apply the idea in any operation in R. Is there something out there? I want to have a deep sense of what some function is doing.
Perhaps you're looking for the boomer package by #Moody_Mudskipper:
Installation
Install with remotes::install_github("moodymudskipper/boomer")
Examples (from github):
library(boomer)
boom(1 + !1 * 2)
#> 1 * 2
#> [1] 2
#> !1 * 2
#> [1] FALSE
#> 1 + !1 * 2
#> [1] 1
boom(subset(head(mtcars, 2), qsec > 17))
#> head(mtcars, 2)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
#> qsec > 17
#> [1] FALSE TRUE
#> subset(head(mtcars, 2), qsec > 17)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
# You can use boom() with {magrittr} pipes, just pipe to boom() at the end of a pipe chain.
library(magrittr)
mtcars %>%
head(2) %>%
subset(qsec > 17) %>%
boom()
#> head(., 2)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
#> qsec > 17
#> [1] FALSE TRUE
#> subset(., qsec > 17)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
#> mtcars %>% head(2) %>% subset(qsec > 17)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
#If a call fails, {boomer} will print intermediate outputs up to the occurrence of the error, it can help with debugging:
"tomato" %>%
substr(1, 3) %>%
toupper() %>%
sqrt() %>%
boom()
#> substr(., 1, 3)
#> [1] "tom"
#> toupper(.)
#> [1] "TOM"
#> Error in .Primitive("sqrt")(.): non-numeric argument to mathematical function
Boomer prints the output of intermediate steps as they are executed, and thus doesn’t say anything about what isn’t executed, it is in contrast with functions like lobstr::ast() which return the parse tree.
Apparently boomer doesn't 'play nice' with the set_names function; here is a similar example:
df %>%
names(.)[1] %>%
boom()
#>names(.)
#>[1] "a" "b" "c"
#>.[names(.), 1]
#> A tibble: 3 x 1
#> a
#> <dbl>
#>1 NA
#>2 NA
#>3 NA
#>df %>% names(.)[1]
#> A tibble: 3 x 1
#> a
#> <dbl>
#>1 NA
#>2 NA
#>3 NA
Not sure if this is what you had in mind, but lobstr::ast() shows the abstract syntax tree that R creates to determine how the various inputs and functions in a call are related.
https://medium.com/analytics-vidhya/become-a-better-r-programmer-with-the-awesome-lobstr-package-af97fcd22602
lobstr::ast(df %>%
set_names(c(names(.)[1], unlist(.[2, 2:3]))))
█─`%>%`
├─df
└─█─set_names
└─█─c
├─█─`[`
│ ├─█─names
│ │ └─.
│ └─1
└─█─unlist
└─█─`[`
├─.
├─2
└─█─`:`
├─2
└─3

remove quotes from colnames?

I have a dataframe of the following form
"column1"
"column2"
1
5
2
6
3
7
How do I remove the quotation mark from the column names? I've tried using gsub but I can't quote quotation marks haha. Also need a way to do this that isn't just names(data) <- c("column1", "column2"). Thank you all!
You can use gsub with single-quotes in order to reference the double-quote character for replacement:
names(df) = gsub('"', "", names(df))
Test:
# Set up data
d = mtcars[1:3, 1:4]
names(d)[1:2] = c('"column1"', '"column2"')
names(d)
#> [1] "\"column1\"" "\"column2\"" "disp" "hp"
d
#> "column1" "column2" disp hp
#> Mazda RX4 21.0 6 160 110
#> Mazda RX4 Wag 21.0 6 160 110
#> Datsun 710 22.8 4 108 93
# Remove quotation marks from column names
names(d) = gsub('"', "", names(d))
names(d)
#> [1] "column1" "column2" "disp" "hp"
d
#> column1 column2 disp hp
#> Mazda RX4 21.0 6 160 110
#> Mazda RX4 Wag 21.0 6 160 110
#> Datsun 710 22.8 4 108 93
Created on 2021-01-19 by the reprex package (v0.3.0)

Maintain rownames when filter a data frame with %>%

Look two codes below, myup1 maintain row names, myup2 does not.
myup1<-outdf2[outdf2$label == "Up-Regulated", ]
myup2<-outdf2 %>%filter(label == "Up-Regulated" )
Is there a way to report rownames with %>% approach?
To expand my comment with an example, we can use add_rownames but it is deprecated, so use tibble::rownames_to_column() instead.
library(dplyr)
library(tibble)
df1 <- mtcars[1:5, 1:3]
df1
# mpg cyl disp
# Mazda RX4 21.0 6 160
# Mazda RX4 Wag 21.0 6 160
# Datsun 710 22.8 4 108
# Hornet 4 Drive 21.4 6 258
# Hornet Sportabout 18.7 8 3
df1[ df1$cyl == 6, ]
# mpg cyl disp
# Mazda RX4 21.0 6 160
# Mazda RX4 Wag 21.0 6 160
# Hornet 4 Drive 21.4 6 258
df1 %>%
rownames_to_column("myCars") %>%
filter(cyl == 6)
# # A tibble: 3 x 4
# myCars mpg cyl disp
# <chr> <dbl> <dbl> <dbl>
# 1 Mazda RX4 21.0 6 160
# 2 Mazda RX4 Wag 21.0 6 160
# 3 Hornet 4 Drive 21.4 6 258

Rename multiple columns given character vectors of column names and replacement [duplicate]

This question already has answers here:
Rename multiple columns by names
(20 answers)
Closed 4 years ago.
While this is easy to do with base R or setnames in data.table or rename_ in dplyr 0.5. Since rename_ is deprecated, I couldn't find an easy way to do this in dplyr 0.6.0.
Below is an example. I want to replace column name in col.from with corresponding values in col.to:
col.from <- c("wt", "hp", "vs")
col.to <- c("foo", "bar", "baz")
df <- mtcars
head(df, 2)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
Expected output:
names(df)[match(col.from, names(df))] <- col.to
head(df, 2)
#> mpg cyl disp bar drat foo qsec baz am gear carb
#> Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
How can I do this with rename or rename_at in dplyr 0.6.0?
I don't know if this is the right way to approach it, but
library(dplyr)
df %>% rename_at(vars(col.from), function(x) col.to) %>% head(2)
# mpg cyl disp bar drat foo qsec baz am gear carb
# Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
Also note that I live in the future:
# packageVersion("dplyr")
# # [1] ‘0.7.0’

Column names that contain NAs in R

I have a large data frame that contains 900 variables per row. I am trying to write a function that gives me the name of each column that contains a NA for each row.
For example:
x->
mpg cyl disp hp draw wt
Mazda RX4 21.0 6 160 110 NA 2.62
Mazda RX4 Wag 21.0 6 NA 110 3.90 NA
Datsun 710 22.8 4 NA 93 NA NA
I would like a function to return:
Mazda RX4: "draw"
Mazda RX4 Wag: "disp", "wt"
Datsun 710: "disp","draw","wt"
Run apply by row to select from colnames(x). Probably going to get a list since the result is ragged.
apply(x, 1, function(x2) colnames(x)[ is.na(x2) ] )
$`Mazda RX4`
[1] "draw"
$`Mazda RX4 Wag`
[1] "disp" "wt"
$`Datsun 710`
[1] "disp" "draw" "wt"

Resources