I'm trying to make my code general, I'd only want to change the YEAR variable without having to change everything in the code
YEAR = 1970
y <- data.frame(col1 = c(1:5))
function (y){
summarize(column_YEAR = sum(col1))
}
#Right now this gives
column_YEAR
1 15
#I would like this function to output this (so col1 is changed to column_1970)
column_1970
1 15
or for example this
df <- list("a_YEAR" = anotherdf)
#I would like to have a list with a df with the name a_1970
I tried things like
df <- list(assign(paste0(a_, YEAR), anotherdf))
But it does not work, does somebody have any advice? Thanks in advance :)
rlang provides a flexible way to defuse R expressions. You can use that functionality to create dynamic column names within dplyr flow. In this example dynamic column name is created using suffix argument passed to a wrapper function on dplyr's summarise.
library("tidyverse")
YEAR = 1970
y <- data.frame(col1 = c(1:5))
function (y) {
summarize(column_YEAR = sum(col1))
}
my_summarise <- function(.data, suffix, sum_col) {
var_name <- paste0("column_", suffix)
summarise(.data,
{{var_name}} := sum({{sum_col}}))
}
my_summarise(.data = y, suffix = YEAR, sum_col = col1)
Results
my_summarise(.data = y, suffix = YEAR, sum_col = col1)
# column_1970
# 1 15
You can also source arguments directly from global environment but from readability perspective this is poorer solution as it's not immediately clear how the function creates suffix.
my_summarise_two <- function(.data, sum_col) {
var_name <- paste0("column_", YEAR)
summarise(.data,
{{var_name}} := sum({{sum_col}}))
}
my_summarise_two(.data = y, sum_col = col1)
Related
I would like to manually define the name of the object/output of a function. A very simple example of what I have is:
x <- data.frame(name = c("A", "B", "C"),
value = c(50, 20, 100))
statistics <- function(data, name){
total <- data %>% mutate(New = value +50)
assign(paste0(name), data)
}
statistics(x, "NewName")
I would like to run this function and define what data to use and the name of the output. The idea is to create a uniquely named output for each dataset used.
Thanks!
One way is to just assign the data with the <- instead of using assign() in the function call. Or you can use assign but you have to specify the envir you would like the object to be assigned too. If it is left blank it will go to the functions environment.
A word of caution on using assign() in the function is that is will overwrite other objects in the global env if they have the same name. So be careful with your object names.
x <- data.frame(name = c("A", "B", "C"),
value = c(50, 20, 100))
statistics <- function(data){
data %>% mutate(New = value +50)
}
Newname <- statistics(x)
statistics2 <- function(data, name){
total <- data %>% mutate(New = value +50)
assign(paste0(name), total, envir = .GlobalEnv)
}
statistics2(x, "NewName2")
A slight aside in your code in your assign() it should say total not data.
I have the following piece of code to update a column in a dataframe in R with the median value. This works fine, but I would like to be able to call this as a function from other parts of the program, passing over other dataframes and columns.
medianVal <- median(df$column, na.rm = T)
df$column[is.na(df$column)] <- medianVal
The logic for the code I am attempting to use is Pass over the DataFrame and Column, Get the median Value, Update and return the dataframe
updateWithMedian <- function(DataFrame, Column)
{
medianValue <- median(Column, na.rm = T)
Column[is.na(DataFrame$Column)] <- medianValue
return(DataFrame)
}
DataFrame[[Column]] in the function helps me identify the column, but I am still struggling to update the NA values to the median.
For example, the code
DataFrame[[Column]][is.na(DataFrame$Column)] <- medianValue
dosent feel like the correct syntax.
You're mixing notations here. If you use quoted column names, you cannot use the dataframe$variable kind of notation. Try that (untested) solution:
updateWithMedian <- function(df, colname)
{
medianValue <- median(df[,colname], na.rm = T)
df[is.na(df[,colname]), colname] <- medianValue
return(df)
}
We can also do
library(dplyr)
library(zoo)
updateWithMedian <- function(df, colname) {
df %>%
mutate_at(vars(colname), na.aggregate, FUN = median)
}
updateWithMedian(df, "column")
Thanks, that has worked perfectly. Just a follow on question, if I wanted to update the column value with the value of a different column in the same dataframe, what would be the correct step for it?
I have tried the the code below but this replaces the NAs in Col1 with the Col2 Name, but its the value I need.
updateWithMedian <- function(df, colname1, colname2)
{
df[is.na(df[,colname1]), colname1] <- colname2
return(df)
}
library(dplyr)
clean_name <- function(df,col_name,new_col_name){
#remove whitespace and common titles.
df$new_col_name <- mutate_all(df,
trimws(gsub("MR.?|MRS.?|MS.?|MISS.?|MASTER.?","",df$col_name)))
#remove any chunks of text where a number is present
df$new_col_name<- transmute_all(df,
gsub("[^\\s]*[\\d]+[^\\s]*","",df$col_name,perl = TRUE))
}
I get the following error
"Error: Column new_col_name must be a 1d atomic #vector or a list"
what you want to do is make sure that the output of the functions you're using is either a vector or a list with only one dimension so that you can add it as a new column in the desired data frame. You can verify the class of an object with the Class function which comes within the base package.
The mutate function by itself should do what you want, it returns the same data frame but with the new column:
library(dplyr)
clean_name <- function(df, col_name, new_col_name) {
# first_cleaning_to_colname = The first change you want to make to the col_name column. This should be a vector.
# second_cleaning_to_colname = The change you're going to make to the col_name column after the first one. This should be a vector too.
first_change <- mutate(df, col_name = first_cleaning_to_colname)
second_change <- mutate(first_change, new_col_name = second_cleaning_to_colname)
return(second_change)
}
You can make both this changes at the same time but I thought this way it's easier to read.
If we are passing unquoted column names, then use
library(tidyverse)
clean_name <- function(df,col_name, new_col_name){
col_name <- enquo(col_name)
new_col_name <- enquo(new_col_name)
df %>%
mutate(!! new_col_name :=
trimws(str_replace_all(!!col_name, "MR.?|MRS.?|MS.?|MISS.?|MASTER.?","")) ) %>%
transmute(!! new_col_name := trimws(str_replace_all(!! new_col_name,
"[^\\s]*[\\d]+[^\\s]*","")))
}
clean_name(dat1, col1, colN)
# colN
#1 one
#2 two
data
dat1 <- data.frame(col1 = c("MR. one", "MS. two 24"), stringsAsFactors = FALSE)
I'm trying to extract the following code as a function, where movies_sub is a dataframe and Director is a column name.
library(tidyr)
library(reshape2)
movies_sub$Director <- strsplit(movies_sub$Director,"(\\s)?,(\\s)?")
unnested <- unnest(movies_sub)
movies_sub <- dcast(unnested, ... ~ Director, fun.aggregate = length)
Here is my attempted function:
toDummyVars = function(df, col) {
df[,col] = strsplit(df[,col],"(\\s)?,(\\s)?") # split by comma
unnested = unnest(df)
df = eval(dcast(unnested, ... ~ col, fun.aggregate = length))
}
I've figured out how to represent movies_sub$Director as df[,col].
However, how do I have the column name "col" recognized when I execute dcast in the 3rd line of toDummyVars()?
We can change the third line with paste
toDummyVars = function(df, col) {
df[,col] = strsplit(df[,col],"(\\s)?,(\\s)?")
unnested = unnest(df)
dcast(df, paste0("... ~ ", col), length)
}
toDummyVars <- function(df, colName) {
dfTemp = df
dfTemp$colName <- strsplit(dfTemp[,colName],"(\\s)?,(\\s)?") # separate by commas
unnested <- unnest(dfTemp) # convert to long format with each feature separated with correct corresponding Profit.
dfTemp <- dcast(unnested, Title + Profit ~ `colName`, fun.aggregate = length, value.var = "colName") # convert to binary vectors with one-hot encoding.
return(dfTemp)
}
The problem was that colName needed to be recognized with `` as in colName in line 4 of the function. Also column name needs to be passed to hte function as a string. eg.
genreBin = toDummyVars(moviesForBin, "Genre")
see
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html
I have been searching this and have found this link to be helpful with renaming passed columns from a function (the [,column_name] code actually made my_function1 work after I had been searching for a while. Is there a way to use the pipe operator to rename columns in a dataframe within a function?
My attempt is shown in my_function2 but it gives me an Error: All arguments to rename must be named or Error: Unknown variables: col2. I am guessing because I have not specified what col2 belongs to.
Also, is there a way to pass associated arguments into the function, like col1 and new_col1 so that you can associated the column name to be replaced and the column name that is replacing it. Thanks in advance!
library(dplyr)
my_df = data.frame(a = c(1,2,3), b = c(4,5,6), c = c(7,8,9))
my_function1 = function(input_df, col1, new_col1) {
df_new = input_df
df_new[,new_col1] = df_new[,col1]
return(df_new)
}
temp1 = my_function1(my_df, "a", "new_a")
my_function2 = function(input_df, col2, new_col2) {
df_new = input_df %>%
rename(new_col2 = col2)
return(df_new)
}
temp2 = my_function2(my_df, "b", "new_b")
rename_ (alongside other dyplyr verbs suffixed with an underscore) has been depreciated.
Instead, try:
my_function3 = function(input_df, cols, new_cols) {
input_df %>%
rename({{ new_cols }} := {{ cols }})
}
See this vignette for more information about embracing arguments with double braces and programming with dplyr.
Following #MatthewPlourde's answer to a similar question, we can do:
my_function3 = function(input_df, cols, new_cols) {
rename_(input_df, .dots = setNames(cols, new_cols))
}
# example
my_function3(my_df, "b", "new_b")
# a new_b c
# 1 1 4 7
# 2 2 5 8
# 3 3 6 9
Many dplyr functions have less-known variants with names ending in _. that allow you to work with the package more programmatically. One pattern is...
DF %>% dplyr_fun(arg1 = val1, arg2 = val2, ...)
# becomes
DF %>% dplyr_fun_(.dots = list(arg1 = "val1", arg2 = "val2", ...))
This has worked for me in a few cases, where the val* are just column names. There are more complicated patterns and techniques, covered in the document that pops up when you type vignette("nse"), but I do not know them well.