I have working code which excludes columns based on a parameter and mutates certain columns based on other parameters. There is this SO question Can dplyr package be used for conditional mutating? but it does not address conditional select
Is there a way to have pure dplyr code without the if statements?
Working R Code:
# Loading
diamonds_tbl <- diamonds
head(diamonds_tbl)
# parameters
initialColumnDrop <- c('x','y','z')
forceCategoricalColumns <- c('carat','cut', 'color')
forceNumericalColumns <- c('')
# Main Code
if(length(which(colnames(diamonds_tbl) %in% initialColumnDrop))>=1){
diamonds_tbl_clean <- diamonds_tbl %>%
select(-one_of(initialColumnDrop)) #Drop specific columns in columnDrop
}
if(length(which(colnames(diamonds_tbl_clean) %in% forceCategoricalColumns))>=1){
diamonds_tbl_clean <- diamonds_tbl_clean %>%
mutate_at(forceCategoricalColumns,funs(as.character)) #Force columns to be categorical
}
if(length(which(colnames(diamonds_tbl_clean) %in% forceNumericalColumns))>=1){
diamonds_tbl_clean <- diamonds_tbl_clean %>%
mutate_at(forceNumericalColumns,funs(as.numeric)) #Force columns to be numeric
}
I don't really understand the desire for a "pure dplyr" solution, but you can make any problem easier with helper functions. For example you could write a a function to run a transformation only if certain columns are found
run_if_cols_match <- function(data, cols, expr) {
if (any(names(data) %in% cols)) {
expr(data)
} else {
data
}
}
Then you could use that in a pipe
diamonds_tbl_clean <- diamonds_tbl %>%
run_if_cols_match(initialColumnDrop,
. %>% select(-one_of(initialColumnDrop))) %>%
run_if_cols_match(forceCategoricalColumns,
. %>% mutate_at(forceCategoricalColumns,funs(as.character))) %>%
run_if_cols_match(forceNumericalColumns,
. %>% mutate_at(forceNumericalColumns,funs(as.numeric)))
which would do the same thing as your code. Here just just conditionally run different anonymous pipes.
Related
I would like to create a function that updates a data frame from a different environment. Specifically, I would like to update the labels of a data frame using the Hmisc::label() function.
assign_label <- function(df, col) {
col <- rlang::as_name(rlang::ensym(col))
Hmisc::label(df[,col]) <- fetch_label(col)
}
fetch_label <- function(col) {
val <- c("mpg" = "MPG",
"hp" = "HP")
unname(val[col])
}
The following code executes without issue: assign_label(mtcars, hp)
However, it does not actually alter the data frame in the calling environment. I just can't figure out how to make it do what I imagine.
Ideally, I would like to be able to pipe a dataframe to this function as such:
mtcars %>% assign_label(mpg)
1) Return modified object Modifying objects in place is discouraged in R. The usual way to do this is to return the data frame and then assign it to a new name or back to the original name clobbering or shadowing it.
assign_label <- function(df, col) {
col <- deparse(substitute(col))
Hmisc::label(df[[col]]) <- fetch_label(col)
df
}
mtcars_labelled <- mtcars %>% assign_label(mpg)
2) magrittr Despite what we have said above there are some facilties for modifying in place in R and in some R packages. The magrittr package provides a syntax for overwriting or shadowing the input. Using the definition in (1) we can write:
library(mtcars)
mtcars %<>% assign_label(mpg)
If mtcars were in the global environment it would ovewrite it with the new value but in this case mtcars is in datasets so a new mtcars is written to the caller and the original in datasets is unchanged.
3) replacement function Although not widely used, R does provide replacement functions which are defined and used like this. This does overwite or shadow the input.
`assign_label<-` <- function(df, value) {
Hmisc::label(df[[value]]) <- fetch_label(value)
df
}
assign_label(mtcars) <- "mpg"
Note
As an aside, if the aim is for an interface that is consistent with tidyverse then use tidyselect to retrieve the column name(s) so that examples like the following work:
assign_labels <- function(df, col) {
nms <- names(select(df, {{col}}))
for(nm in nms) Hmisc::label(df[[nm]]) <- fetch_label(nm)
df
}
mtcars_labelled <- mtcars %>% assign_labels(starts_with("mp"))
str(mtcars_labelled)
mtcars_labelled <- mtcars %>% assign_labels(mpg|hp)
str(mtcars_labelled)
In regards to the comments about not modifying outside of the scope of a function, I created two functions that assign new dataframes with labels.
fetch_label <- function(col) {
val <- c("mpg" = "MPG",
"hp" = "HP")
unname(val[col])
}
assign_label <- function(df, col) {
col <- rlang::as_name(rlang::ensym(col))
Hmisc::label(df[[col]]) <- fetch_label(col)
return(df)
}
assign_labels <- function(df) {
purrr::iwalk(df, function(.x, .y) {
lab <- fetch_label(.y)
Hmisc::label(df[[col]]) <<- lab
})
return(df)
}
mtcars <- mtcars %>% assign_label(hp)
mtcars <- mtcars %>% assign_labels()
I'm trying to drop columns that have a suffix .1 - indicating that this is a repeated column name. This needs to act over a list of dataframe
I have written a function:
drop_duplicated_columns <- function (df) {
lapply(df, function(x) {
x <- x %>% select(-contains(".1"))
x
})
return(df)
}
However it is not working. Any ideas why?
One tidy way to solve this problem would be to first create a function that works for one data.frame and then map this function to a list
library(tidyverse)
drop_duplicated_columns <- function(df) {
df %>%
select(-contains(".1"))
}
Or even better
drop_duplicated_columns <- . %>%
select(-contains(".1"))
Usage in pipes, combine it with a map
list_dfs <- list(mtcars,mtcars)
list_dfs %>%
map(drop_duplicated_columns)
If you just need one function you can create a new pipe using the functioning code that you tested before
drop_duplicated_columns_list <- . %>%
map(drop_duplicated_columns)
list_dfs %>%
drop_duplicated_columns_list()
I got a problem with the use of MUTATE, please check the next code block.
output1 <- mytibble %>%
mutate(newfield = FND(mytibble$ndoc))
output1
Where FND function is a FILTER applied to a large file (5GB):
FND <- function(n){
result <- LARGETIBBLE %>% filter(LARGETIBBLE$id == n)
return(paste(unique(result$somefield),collapse=" "))
}
I want to execute FND function for each row of output1 tibble, but it just executes one time.
Never use $ in dplyr pipes, very rarely they are used. You can change your FND function to :
library(dplyr)
FND <- function(n){
LARGETIBBLE %>% filter(id == n) %>% pull(somefield) %>%
unique %>% paste(collapse = " ")
}
Now apply this function to every ndoc value in mytibble.
mytibble %>% mutate(newfield = purrr::map_chr(ndoc, FND))
You can also use sapply :
mytibble$newfield <- sapply(mytibble$ndoc, FND)
FND(mytibble$ndoc) is more suitable for data frames. When you use functions such as mutate on a tibble, there is no need to specify the name of the tibble, only that of the column. The symbols %>% are already making sure that only data from the tibble is used. Thus your example would be:
output1 <- mytibble %>%
mutate(newfield = FND(ndoc))
FND <- function(n){
result <- LARGETIBBLE %>% filter(id == n)
return(paste(unique(result$somefield),collapse=" "))
}
This would be theoretically, however I do not know if your function FND will work, maybe try it and if not, give some practical example with data and what you are trying to achieve.
I have four functions, clean, clean2, cleanFun, and trim. Currently I apply the functions to one column, like so.
library(tidyverse)
library(data.table)
py17$CE.Finding.Description <- clean(py17$CE.Finding.Description)
py17$CE.Finding.Description <- clean2(py17$CE.Finding.Description)
py17$CE.Finding.Description <- cleanFun(py17$CE.Finding.Description)
py17$CE.Finding.Description <- trim(py17$CE.Finding.Description)
This process does the trick but I have to copy and paste this multiple times, and I'd eventually like to expand this to multiple columns.
For now, I'd like to save time and add an apply function but I'm not sure how to create that apply function. I've tried creating this.
maxclean <- function(cleaner) {
c(clean(cleaner), clean2(cleaner), cleanFun(cleaner), trim(cleaner))
}
py17$CE.Finding.Description <- sapply(py17$CE.Finding.Description, maxclean)
After trying this I just get
Error in `$<-.data.frame`(`*tmp*`, CE.Finding.Description, value = c(NA, :
replacement has 4 rows, data has 4318
I do not get any errors doing this the long way. Where am I going wrong on this?
Your maxclean function should take the same arguments as the separate functions. In your case - a vector. And then call each function in a row. Like this:
maxclean <- function(x) {
x <- clean(x)
x <- clean2(x)
x <- cleanFun(x)
x <- trim(x)
return(x)
}
Apparently, the OP has created a cleaning pipeline where the output of one step is fed into the next step and the final result of the pipeline overwrites the original input.
The magrittr package has the freduce() function which applies one function after the other in the described way. Thus,
py17$CE.Finding.Description <- clean(py17$CE.Finding.Description)
py17$CE.Finding.Description <- clean2(py17$CE.Finding.Description)
py17$CE.Finding.Description <- cleanFun(py17$CE.Finding.Description)
py17$CE.Finding.Description <- trim(py17$CE.Finding.Description)
can be written as:
library(magrittr)
fcts <- list(clean, clean2, cleanFun, trim)
py17$CE.Finding.Description %<>% freduce(fcts)
which is a shortcut for
py17$CE.Finding.Description <- py17$CE.Finding.Description %>%
clean() %>%
clean2() %>%
cleanFun() %>%
trim()
Here, %>% is the magrittr forward-pipe operator and %<>% is the magrittr compound assignment pipe-operator which updates the left-hand side object with the resulting value.
Reproducible example
Using the mtcars dataset:
data(mtcars)
mycars <- mtcars
mycars$mpg %<>%
{. - mean(.)} %>%
abs() %>%
sqrt()
mycars
or
mycars <- mtcars
mycars$mpg %<>% freduce(list(function(.) {. - mean(.)}, abs, sqrt))
mycars
Applying on multiple columns
The OP has mentioned that he eventually like to expand this to multiple columns
This can be achieved by, e.g.,
mycars <- mtcars
fcts <- list(function(.) {. - mean(.)}, abs, sqrt)
mycars$mpg %<>% freduce(fcts)
mycars$disp %<>% freduce(fcts)
mycars
I am actually learning R and I don't understand why this simple assignment does not works. I would like to subset by year using the filter function of the dplyr package. After several tentatives, here are a reproducible example using the gapminder dataset.
I could use the subset function, lapply, or even anonymous function to solve this problem, but here, I just want to understand why this specific code is not working.
library(gapminder)
library(dplyr)
for (i in unique(gapminder$year)) {
paste0("gapminder", i) <- print(gapminder %>%
filter(year == i))
}
With or without print, same problem
It's because your assignment is to a function (paste0).
If you remove that part it prints each filtered dataframe:
library(gapminder)
library(dplyr)
for (i in unique(gapminder$year)) {
print(gapminder %>% filter(year == i))
}
You could assign each to a list, like so:
my_list <- list()
library(gapminder)
library(dplyr)
for (i in seq_along(unique(gapminder$year))) {
year_filter <- unique(gapminder$year)[i] # each iteration we get another year
my_list[[i]] <- gapminder %>% filter(year == year_filter)
cat(paste0("gapminder", year_filter, " ")) # use cat if you want to print at each iteration
}
paste0 just concatenates vectors after converting to character.
Use assign function to store the output.
for (i in unique(gapminder$year))
{
assign(paste0("gapminder", i),print(gapminder %>%filter(year == i)))
}
If you want to get the specific output, use get function.
out_i = get(paste0("gapminder", i))