I have a data frame in which every column consists of number followed by text, e.g. 533 234r/r.
The following code to get rid off text works well:
my_data <- my_data %>%
mutate(column1 = str_extract(column1, '.+?(?=[a-z])'))
I would like to do it for multiple columns:
col_names <- names(my_data)
for (i in 1:length(col_names)) {
my_data <- my_data%>%
mutate(col_names[i] = str_extract(col_names[i], '.+?(?=[a-z])'))
}
But it returns an error:
Error: unexpected '=' in:
" my_data <- my_data %>%
mutate(col_names[i] ="
I think mutate_all() wouldn't work as well, bcos str_extract() requires column name as argument.
If we are using strings, then convert to symbol and evaluate (!!) while we do the assignment with (:=)
library(dplyr)
library(stringr)
col_names <- names(my_data)
for (i in seq_along(col_names)) {
my_data <- my_data %>%
mutate(!! col_names[i] :=
str_extract(!!rlang::sym(col_names[i]), '.+?(?=[a-z])'))
}
In tidyverse, we could do this with across instead of looping with a for loop (dplyr version >= 1.0)
my_data <- my_data %>%
mutate(across(everything(), ~ str_extract(., '.+?(?=[a-z])')))
If the dplyr version is old, use mutate_all
my_data <- my_data %>%
mutate_all(~ str_extract(., '.+?(?=[a-z])'))
Related
I have been reading from this SO post on how to work with string references to variables in dplyr.
I would like to mutate a existing column based on string input:
var <- 'vs'
my_mtcars <- mtcars %>%
mutate(get(var) = factor(get(var)))
Error: unexpected '=' in:
"my_mtcars <- mtcars %>%
mutate(get(var) ="
Also tried:
my_mtcars <- mtcars %>%
mutate(!! rlang::sym(var) = factor(!! rlang::symget(var)))
This resulted in the exact same error message.
How can I do the following based on passing string 'vs' within var variable to mutate?
# works
my_mtcars <- mtcars %>%
mutate(vs = factor(vs))
This operation can be carried out with := while evaluating (!!) and using the conversion to symbol and evaluating on the rhs of assignment
library(dplyr)
my_mtcars <- mtcars %>%
mutate(!! var := factor(!! rlang::sym(var)))
class(my_mtcars$vs)
#[1] "factor"
Or without thinking too much, use mutate_at, which can take strings in vars and apply the function of interest
my_mtcars2 <- mtcars %>%
mutate_at(vars(var), factor)
The jist of this question is that I have some R code which works fine on a local data frame, but fails on a Spark data frame, even if otherwise the two tables are identical.
In R, given a dataframe of all character columns, one can dynamically type cast all the columns to numeric that can be safely converted to numeric with the following code:
require(dplyr)
require(varhandle)
require(sparklyr)
checkNumeric <- function(column)
{
column %>% as.data.frame %>% .[,1] %>% varhandle::check.numeric(.) %>% all
}
typeCast <- function(df)
{
columns <- colnames(df)
numericIdx <- df %>% mutate(across(columns, checkNumeric)) %>% .[1,]
doThese <- columns[which(numericIdx==T)]
df <- df %>% mutate_at(all_of(vars(doThese)), as.numeric)
return(df)
}
For a trivial example, one could run:
df <- iris
df$Sepal.Length <- as.character(df$Sepal.Length)
newDF <- df %>% typeCast
class(df$Sepal.Length)
class(newDF$Sepal.Length)
Now, this code will not work on a dataset like starwars, which has composite columns. But for other dataframes, I would expect this code to work just fine on a Spark data frame. It doesn't. That is:
sc <- spark_connect('yarn', config=config) # define your Spark configuration somewhere, that's outside the scope of this question
df <- copy_to(sc, iris, "iris")
newDF <- df %>% typeCast
Will fail with the following error.
Error in .[1, ] : incorrect number of dimensions
When debugging, if we try to run this code:
columns <- colnames(df)
df %>% mutate(across(columns, checkNumeric))
This error is returned:
Error in UseMethod("escape") :
no applicable method for 'escape' applied to an object of class "function"
What gives? Why would the code work fine on a local data frame, but not a Spark data frame?
I didn't find an exact solution per se, but I did find a workaround.
typeCheckPartition <- function(df)
{
require(dplyr)
require(varhandle)
checkNumeric <- function(column)
{
column %>% as.data.frame %>% .[,1] %>% varhandle::check.numeric(.) %>% all
}
# this works on non-spark data frames
columns <- colnames(df)
numericIdx <- df %>% mutate(across(all_of(columns), checkNumeric)) %>% .[1,]
return(numericIdx)
}
typeCastSpark <- function(df, max_partitions = 1000, undo_coalesce = T)
{
# numericIdxDf will have these dimensions: num_partition rows x num_columns
# so long as num_columns is not absurd, this coalesce should make collect a safe operation
num_partitions <- sdf_num_partitions(df)
if (num_partitions > max_partitions)
{
undo_coalesce <- T && undo_coalesce
df <- df %>% sdf_coalesce(max_partitions)
} else
{
undo_coalesce <- F
}
columns <- colnames(df)
numericIdxDf <- df %>% spark_apply(typeCheckPartition, packages=T) %>% collect
numericIdx <- numericIdxDf %>% as.data.frame %>% apply(2, all)
doThese <- columns[which(numericIdx==T)]
df <- df %>% mutate_at(all_of(vars(doThese)), as.numeric)
if (undo_coalesce)
df <- df %>% sdf_repartition(num_partitions)
return(df)
}
Just run the typeCastSpark function against your dataframe and it will type cast all of the columns to numeric (that can be).
How can I hand over the argument ColName of my function foo to the R function count? ColName is the name of the column in the dataframe.
library(scales)
library(dplyr)
foo <- function(df, ColName, YearCol){
percentData <- df %>%
group_by(format(as.Date(df[,YearCol]),"%Y")) %>%
count(ColName) %>% # does not work like this, also df[,ColName] does not work
mutate(ratio=scales::percent(n/sum(n)))
}
You can use the .dots parameter of select to choose the columns you're interested in.
foo <- function(df, ColName, YearCol){
percentData <- df %>%
select(.dots = c(ColName, YearCol)) %>%
group_by(format(as.Date(.dots2), "%Y")) %>%
count(.dots1) %>%
mutate(ratio=scales::percent(n/sum(n)))
percentData
}
I have read "Programming with dplyr" and have succeeded in writing my first functions using dplyr pipes and bare variable names.
For the sake of readability as well as the use of non-dyplr functions using do(), I rename the columns at the beginning of the script, perform calculations, and return the dataframe with an additional calculated variable. The problem arises when trying to return to the original variable names.
require(dplyr)
require(rlang)
myfun <- function(df, var1) {
var1 <- enquo(var1)
# Rename column of interest
df <- df %>% rename(tempname = UQ(var1))
# Calculate mean of column of interest
df <- df %>% mutate(calc = tempname*2)
# Rename column of interest back to original name
df <- df %>% rename(UQ(var1) = tempname)
}
test <- myfun(mtcars, cyl)
This is the error thrown:
Error: unexpected '=' in:
" # Rename column of interest back to original name
df <- df %>% rename(UQ(var1) ="
> }
Error: unexpected '}' in "}"
I'm working with nested dataframes and want to pass the name of the top level dataframe, and the name of a column containing lower level dataframes, to a function that uses purrr::map to iterate over the lower level data frames.
Here's a toy example.
library(dplyr)
library(purrr)
library(tibble)
library(tidyr)
df1 <- tibble(x = c("a","b","c", "a","b","c"), y = 1:6)
df1 <- df1 %>%
group_by(x) %>%
nest()
testfunc1 <- function(df) {
df <- df %>%
mutate(out = map(data, min))
tibble(min1 = df$out)
}
testfunc2 <- function(df, col_name) {
df <- df %>%
mutate(out = map(col_name, min))
tibble(min2 = df$out)
}
df1 <- bind_cols(df1, testfunc1(df1))
df1 <- bind_cols(df1, testfunc2(df1, "data"))
df1$min1
df1$min2
testfunc1 behaves as expected, in this case giving the minimum of each data column in a new column. In testfunc2, where I've tried to pass the column name, a string reading "data" is passed to the new column. I think I understand from the thread here (Pass a data.frame column name to a function) why this doesn't behave as I want, but I haven't been able to figure out how to make it work in this case. Any suggestions would be great.
This should work for you, it uses the tidy eval framework. This assumes col_name is a string.
testfunc2 <- function(df, col_name) {
df <- df %>%
mutate(out = map(!! rlang::sym(col_name), min))
tibble(min2 = df$out)
}
EDIT:
If you'd rather pass a bare column name to the function, instead of a string, use enquo instead of sym.
testfunc2 <- function(df, col_name) {
col_quo = enquo(col_name)
df <- df %>%
mutate(out = map(!! col_quo, min))
tibble(min2 = df$out)
}