I want to substitute parts of the transform function with variable inputs.
I have created a df using subset with col1 from an existing table:
col1 = c('A','B','C')
The df looks something like this:
A = c(1, 3)
B = c(3, 1)
C = c(5, 2)
df = data.frame(A, B, C)
I now want to automate calculations which manually would look like this:
df <- transform(df, 'ABC' = (A + B + C))
where (A + B + C) refers to the columns of the df. Because I have hundreds of 'col1's I can't do it by hand. I was trying to use something similar to %s (as available in python 2.X), yet so far nothing really worked and I understand too little of R (related to eval()?)to get things working (tried paste, as.formula, sprintf, substitute etc.).
Using cv(col1) I'm trying to paste the output inside the transform function, yet the furthest I got was transform trying to grab values from the environment (not columns) when using as.formula.
cv = function(var){
output = paste('(', paste(var, collapse = ' + '), ')', sep = '')
return(output)
}
Would appreciate any hints or ideas!
You have maneuvered yourself into a strange corner. This is easy with R:
cols <- c("A", "B", "C")
df[, paste(cols, collapse = "")] <- rowSums(df[, cols])
#alternatively for other binary functions:
#Reduce("+", df[, cols])
# A B C ABC
#1 1 3 5 9
#2 3 1 2 6
You can get a similar effect using mutate from dplyr:
library(dplyr)
cols <- c("A", "B", "C")
df %>% mutate_(.dots = setNames(paste(cols, collapse = '+'),
'new_column_name'))
Here we tell mutate_ (spot the _) what to do via paste() which yields "A+B+C", and use setNames to name the new column.
I acknowledge the syntax is somewhat convoluted, but this is related to non-standard evaluation in dplyr. But if you want to do this in the dplyr ecosystem, this is the way to do it.
Related
I have a Data Table with two Text columns. I need to use column b to determine which letters to replace in column a with an "x".
I can do it using a for loop as in the code below. however my actual data set has 250,000+ rows so the script takes ages. Is there a more efficient way to do this? I considered lappy but couldn't get my head round it.
DT <- data.table(a = c("ABCD","ABCD","ABCD","ABCD"), b = c("A","B","C", "D"))
DT$c <- ""
for (i in 1 : NROW(DT)){
DT[i]$c <- sub(DT[i,b], "x", DT[i,a])
}
Here is one approach using the tidyverse
library(tidyverse)
DT <- data.table::data.table(a = c("ABCD","ABCD","ABCD","ABCD"), b = c("A","B","C", "D"))
DT %>%
mutate(new_vec = str_replace_all(string = a,pattern = b,replacement = "X"))
dplyr's rename functions require the new column name to be passed in as unquoted variable names. However I have a function where the column name is constructed by pasting a string onto an argument passed in and so is a character string.
For example say I had this function
myFunc <- function(df, col){
new <- paste0(col, '_1')
out <- dplyr::rename(df, new = old)
return(out)
}
If I run this
df <- data.frame(a = 1:3, old = 4:6)
myFunc(df, 'x')
I get
a new
1 1 4
2 2 5
3 3 6
Whereas I want the 'new' column to be the name of the string I constructed ('x_1'), i.e.
a x_1
1 1 4
2 2 5
3 3 6
Is there anyway of doing this?
I think this is what you were looking for. It is the use of rename_ as #Henrik suggested, but the argument has an, lets say, interesting, name:
> myFunc <- function(df, col){
+ new <- paste0(col, '_1')
+ out <- dplyr::rename_(df, .dots=setNames(list(col), new))
+ return(out)
+ }
> myFunc(data.frame(x=c(1,2,3)), "x")
x_1
1 1
2 2
3 3
>
Note the use of setNames to use the value of new as name in the list.
Recent updates to tidyr and dplyr allow you to use the rename_with function.
Say you have a data frame:
library(tidyverse)
df <- tibble(V0 = runif(10), V1 = runif(10), V2 = runif(10), key=letters[1:10])
And you want to change all of the "V" columns. Usually, my reference for columns like this comes from a json file, which in R is a labeled list. e.g.,
colmapping <- c("newcol1", "newcol2", "newcol3")
names(colmapping) <- paste0("V",0:2)
You can then use the following to change the names of df to the strings in the colmapping list:
df <- rename_with(.data = df, .cols = starts_with("V"), .fn = function(x){colmapping[x]})
I have been searching this and have found this link to be helpful with renaming passed columns from a function (the [,column_name] code actually made my_function1 work after I had been searching for a while. Is there a way to use the pipe operator to rename columns in a dataframe within a function?
My attempt is shown in my_function2 but it gives me an Error: All arguments to rename must be named or Error: Unknown variables: col2. I am guessing because I have not specified what col2 belongs to.
Also, is there a way to pass associated arguments into the function, like col1 and new_col1 so that you can associated the column name to be replaced and the column name that is replacing it. Thanks in advance!
library(dplyr)
my_df = data.frame(a = c(1,2,3), b = c(4,5,6), c = c(7,8,9))
my_function1 = function(input_df, col1, new_col1) {
df_new = input_df
df_new[,new_col1] = df_new[,col1]
return(df_new)
}
temp1 = my_function1(my_df, "a", "new_a")
my_function2 = function(input_df, col2, new_col2) {
df_new = input_df %>%
rename(new_col2 = col2)
return(df_new)
}
temp2 = my_function2(my_df, "b", "new_b")
rename_ (alongside other dyplyr verbs suffixed with an underscore) has been depreciated.
Instead, try:
my_function3 = function(input_df, cols, new_cols) {
input_df %>%
rename({{ new_cols }} := {{ cols }})
}
See this vignette for more information about embracing arguments with double braces and programming with dplyr.
Following #MatthewPlourde's answer to a similar question, we can do:
my_function3 = function(input_df, cols, new_cols) {
rename_(input_df, .dots = setNames(cols, new_cols))
}
# example
my_function3(my_df, "b", "new_b")
# a new_b c
# 1 1 4 7
# 2 2 5 8
# 3 3 6 9
Many dplyr functions have less-known variants with names ending in _. that allow you to work with the package more programmatically. One pattern is...
DF %>% dplyr_fun(arg1 = val1, arg2 = val2, ...)
# becomes
DF %>% dplyr_fun_(.dots = list(arg1 = "val1", arg2 = "val2", ...))
This has worked for me in a few cases, where the val* are just column names. There are more complicated patterns and techniques, covered in the document that pops up when you type vignette("nse"), but I do not know them well.
I have a data frame with 300 columns which has a string variable somewhere which I am trying to remove. I have found this solution in stack overflow using lapply (see below), which is what I want to do, but using the dplyr package. I have tried using the mutate_each function but cant seem to make it work
"If your data frame (df) is really all integers except for NAs and garbage then then the following converts it.
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
You'll have a warning about NAs introduced by coercion but that's just all those non numeric character strings turning into NAs.
dplyr 0.5 now includes a select_if() function.
For example:
person <- c("jim", "john", "harry")
df <- data.frame(matrix(c(1:9,NA,11,12), nrow=3), person)
library(dplyr)
df %>% select_if(is.numeric)
# X1 X2 X3 X4
#1 1 4 7 NA
#2 2 5 8 11
#3 3 6 9 12
Of course you could add further conditions if necessary.
If you want to use this line of code:
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
with dplyr (by which I assume you mean "using pipes") the easiest would be
df2 = df %>% lapply(function(x) as.numeric(as.character(x))) %>%
as.data.frame
To "translate" this into the mutate_each idiom:
mutate_each(df, funs(as.numeric(as.character(.)))
This function will, of course, convert all columns to character, then to numeric. To improve efficiency, don't bother doing two conversions on columns that are already numeric:
mutate_each(df, funs({
if (is.numeric(.)) return(.)
as.numeric(as.character(.))
}))
Data for testing:
df = data.frame(v1 = 1:10, v2 = factor(11:20))
mutate_all works here, and simply wrap the gsub in a function. (I also assume you aren't necessarily string hunting, so much as trawling for non-integers.
StrScrub <- function(x) {
as.integer(gsub("^\\D+$",NA, x))
}
ScrubbedDF <- mutate_all(data, funs(StrScrub))
Example dataframe:
library(dplyr)
options(stringsAsFactors = F)
data = data.frame("A" = c(2:5),"B" = c(5,"gr",3:2), "C" = c("h", 9, "j", "1"))
with reference/help from Tony Ladson
I want to filter a dataframe using a field which is defined in a variable, to select a value that is also in a variable. Say I have
df <- data.frame(V=c(6, 1, 5, 3, 2), Unhappy=c("N", "Y", "Y", "Y", "N"))
fld <- "Unhappy"
sval <- "Y"
The value I want would be df[df$Unhappy == "Y", ].
I've read the nse vignette to try use filter_ but can't quite understand it. I tried
df %>% filter_(.dots = ~ fld == sval)
which returned nothing. I got what I wanted with
df %>% filter_(.dots = ~ Unhappy == sval)
but obviously that defeats the purpose of having a variable to store the field name. Any clues please? Eventually I want to use this where fld is a vector of field names and sval is a vector of filter values for each field in fld.
You can try with interp from lazyeval
library(lazyeval)
library(dplyr)
df %>%
filter_(interp(~v==sval, v=as.name(fld)))
# V Unhappy
#1 1 Y
#2 5 Y
#3 3 Y
For multiple key/value pairs, I found this to be working but I think a better way should be there.
df1 %>%
filter_(interp(~v==sval1[1] & y ==sval1[2],
.values=list(v=as.name(fld1[1]), y= as.name(fld1[2]))))
# V Unhappy Col2
#1 1 Y B
#2 5 Y B
For these cases, I find the base R option to be easier. For example, if we are trying to filter the rows based on the 'key' variables in 'fld1' with corresponding values in 'sval1', one option is using Map. We subset the dataset (df1[fld1]) and apply the FUN (==) to each column of df1[f1d1] with corresponding value in 'sval1' and use the & with Reduce to get a logical vector that can be used to filter the rows of 'df1'.
df1[Reduce(`&`, Map(`==`, df1[fld1],sval1)),]
# V Unhappy Col2
# 2 1 Y B
#3 5 Y B
data
df1 <- cbind(df, Col2= c("A", "B", "B", "C", "A"))
fld1 <- c(fld, 'Col2')
sval1 <- c(sval, 'B')
Now, with rlang 0.4.0, it introduces a new more intuitive way for this type of use case:
packageVersion("rlang")
# [1] ‘0.4.0’
df <- data.frame(V=c(6, 1, 5, 3, 2), Unhappy=c("N", "Y", "Y", "Y", "N"))
fld <- "Unhappy"
sval <- "Y"
df %>% filter(.data[[fld]]==sval)
#OR
filter_col_val <- function(df, fld, sval) {
df %>% filter({{fld}}==sval)
}
filter_col_val(df, Unhappy, "Y")
More information can be found at https://www.tidyverse.org/articles/2019/06/rlang-0-4-0/
Previous Answer
With dplyr 0.6.0 and later, this code works:
packageVersion("dplyr")
# [1] ‘0.7.1’
df <- data.frame(V=c(6, 1, 5, 3, 2), Unhappy=c("N", "Y", "Y", "Y", "N"))
fld <- "Unhappy"
sval <- "Y"
df %>% filter(UQ(rlang::sym(fld))==sval)
#OR
df %>% filter((!!rlang::sym(fld))==sval)
#OR
fld <- quo(Unhappy)
sval <- "Y"
df %>% filter(UQ(fld)==sval)
More about the dplyr syntax available at http://dplyr.tidyverse.org/articles/programming.html and the quosure usage in the rlang package https://cran.r-project.org/web/packages/rlang/index.html .
If you find it challenging mastering non-standard evaluation in dplyr 0.6+, Alex Hayes has an excellent writing-up on the topic: https://www.alexpghayes.com/blog/gentle-tidy-eval-with-examples/
Original Answer
With dplyr version 0.5.0 and later, it is possible to use a simpler syntax and gets closer to the syntax #Ricky originally wanted, which I also find more readable than using lazyeval::interp
df %>% filter_(.dots = paste0(fld, "=='", sval, "'"))
# V Unhappy
#1 1 Y
#2 5 Y
#3 3 Y
#OR
df %>% filter_(.dots = glue::glue("{fld}=='{sval}'"))
Here's an alternative with base R, which is maybe not very elegant, but it might have the benefit of being rather easily understandable:
df[df[colnames(df)==fld]==sval,]
# V Unhappy
#2 1 Y
#3 5 Y
#4 3 Y
Following on from LmW; personally I prefer using a dplyr pipeline where the dots are specified before the pipeline so that it is easier to use programmatically, say in a loop of filters.
dots <- paste0(fld," == '",sval,"'")
df %>% filter_(.dots = dots)
LmW's example is correct but the values are hardcoded.
So I was trying to do the same thing, and it seems that now dplyr has a builtin functionality to address exactly this.
Check the last example here: https://dplyr.tidyverse.org/reference/filter.html
I'm also pasting it here for simplicity:
# To refer to column names that are stored as strings, use the `.data` pronoun:
vars <- c("mass", "height")
cond <- c(80, 150)
starwars %>%
filter(
.data[[vars[[1]]]] > cond[[1]],
.data[[vars[[2]]]] > cond[[2]]
)