R Edit data frame in function within function - r

I have a code made up of a lot of functions used for different codes and which will modify a df by adding some columns. I need to have a global function that takes over several of these functions, but since they are functions inside another function, my df does not update this on every function call. Do you have any advice for this problem?
Here is an example of my problem :
f_a<-function(df){
df$x<-1
.GlobalEnv$df <- df
}
f_b<-function(df){
df$y<-1
.GlobalEnv$df <- df
}
f_global<-function(df){
f_a(df)
f_b(df)
}
In this case df will not have the x and y columns created
Thanks

It's generally a bad idea for functions to have "side effects": things are easier to get right if functions are completely self contained. For your example, that would look like this:
f_a<-function(df){
df$x<-1 # This only changes the local copy
df # This returns the local copy as the function result
}
f_b<-function(df){
df$y<-1
df
}
f_global<-function(df){
df <- f_a(df) # This uses f_a to change the local copy
df <- f_b(df) # This uses f_b to make another change
df # This returns the changed dataframe
}
Then you use it like this:
mydf <- data.frame(z = 1)
mydf <- f_global(mydf)

use this operator <<- in the function.as an example:
dat = data.frame(x1 = rep(1,10),x2 = rep(2,10),x3 = rep(3,10))
head(dat)
myFun <- function(x){
print(x)
dat$x1 <<- rep(5,10)
}
myFun(10)
head(dat)

In the call to f_b the input argument df is assigned to .GlobalEnv rewriting the df that already existed there. So f_global first calls f_a and creates a column x, then calls f_b passing it its input data.frame and f_b creates a column y in this df.
All that needs to be changed is f_global:
f_global<-function(df){
f_a(df)
f_b(.GlobalEnv$df)
}
f_global(data.frame(a=1))
df
# a x y
#1 1 1 1
df <- head(mtcars)
f_global(df)
df
# mpg cyl disp hp drat wt qsec vs am gear carb x y
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1 1
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1 1
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1 1
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1 1
#Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1 1
Though the code above works and follows the lines of the question, I think that a better strategy is to have f_global change its input argument assigning the return value of each f_* and assign the end result in f_global's parent environment only after all transformations are done.
f_a <- function(df){
df$x <- 1
df
}
f_b <- function(df){
df$y <- 1
df
}
f_global <- function(df){
dfname <- deparse(substitute(df))
df <- f_a(df)
df <- f_b(df)
assign(dfname, df, envir = parent.frame())
invisible(NULL)
}
df1 <- data.frame(a=1)
f_global(df1)
df1
df <- head(mtcars)
f_global(df)
df

Related

Prevent change to dataframe format in R

I have a dataframe that must have a specific layout. Is there a way for me to make R reject any command I attempt that would change the number or names of the columns?
It is easy to check the format of the data table manually, but I have found no way to make R do it for me automatically every time I execute a piece of code.
regards
This doesn’t offer the level of foolproof safety I think you’re looking for (hard to know without more details), but you could define a function operator that yields modified functions that error if changes to columns are detected:
same_cols <- function(fn) {
function(.data, ...) {
out <- fn(.data, ...)
stopifnot(identical(sort(names(.data)), sort(names(out))))
out
}
}
For example, you could create modified versions of dplyr functions:
library(dplyr)
my_mutate <- same_cols(mutate)
my_summarize <- same_cols(summarize)
which work as usual if columns are preserved:
mtcars %>%
my_mutate(mpg = mpg / 2) %>%
head()
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 10.50 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 10.50 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 11.40 4 108 93 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 10.70 6 258 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 9.35 8 360 175 3.15 3.440 17.02 0 0 3 2
# Valiant 9.05 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars %>%
my_summarize(across(everything(), mean))
# mpg cyl disp hp drat wt qsec vs am
# 1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625
# gear carb
# 1 3.6875 2.8125
But throw errors if changes to columns are made:
mtcars %>%
my_mutate(mpg2 = mpg / 2)
# Error in my_mutate(., mpg2 = mpg/2) :
# identical(sort(names(.data)), sort(names(out))) is not TRUE
mtcars %>%
my_summarize(mpg = mean(mpg))
# Error in my_summarize(., mpg = mean(mpg)) :
# identical(sort(names(.data)), sort(names(out))) is not TRUE
You mention the names and columns need to be the same, also realize that with data.table also names are updated by reference. See the example below.
foo <- data.table(
x = letters[1:5],
y = LETTERS[1:5]
)
colnames <- names(foo)
colnames
# [1] "x" "y"
setnames(foo, colnames, c("a", "b"))
foo[, z := "oops"]
colnames
# [1] "a" "b" "z"
identical(colnames, names(foo))
# [1] TRUE
To check that both the columns and names are unalterated (and in same order here) you can take right away a copy of the names. And after each code run, you can check the current names with the copied names.
foo <- data.table(
x = letters[1:5],
y = LETTERS[1:5]
)
colnames <- copy(names(foo))
setnames(foo, colnames, c("a", "b"))
foo[, z := "oops"]
identical(colnames, names(foo))
[1] FALSE
colnames
# [1] "x" "y"
names(foo)
# [1] "a" "b" "z"

Create column from symbol in mutate (tidy eval)

So I want to create a new column called var that has the text "testing" for all rows. I.e. the result should be like mtcars$var <- "testing". I have tried different things such as as_name, as_string...
library(tidyverse)
f <- function(df, hello = testing){
df %>%
mutate(var = hello)
}
f(mtcars)
We can do:
f <- function(df, hello = testing){
hello <- deparse(substitute(hello))
df %>%
mutate(var =rlang::as_name(hello))
}
f(mtcars)
However, as pointed out by #Lionel Henry(see comments below):
deparse will not check for simple inputs and might return a character vector. Then as_name() will fail if a length > 1 vector, or do nothing otherwise since it's already a string
as_name(substitute(hello)) does the same but checks the input is a simple symbol or string. It is more constrained than as_label()
It might therefore be better rewritten as:
f <- function(df, hello = testing){
hello <- as_label(substitute(hello))
df %>%
mutate(var = hello )
}
Or:
f <- function(df, hello = testing){
hello <- rlang::as_name(substitute(hello))
df %>%
mutate(var = hello)
}
Result:
mpg cyl disp hp drat wt qsec vs am gear carb var
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 testing
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 testing
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 testing

Use a character vector in the `by` argument

Within the data.table package in R, is there a way in order to use a character vector to be assigned within the by argument of the calculation?
Here is an example of what would be the desired output from this using mtcars:
mtcars <- data.table(mtcars)
ColSelect <- 'cyl' # One Column Option
mtcars[,.( AveMpg = mean(mpg)), by = .(ColSelect)] # Doesn't work
# Desired Output
cyl AveMpg
1: 6 19.74286
2: 4 26.66364
3: 8 15.10000
I know that this is possible to use assigning column names in j by enclosing the vector around brackets.
ColSelect <- 'AveMpg' # Column to be assigned for average mpg value
mtcars[,(ColSelect):= mean(mpg), by = .(cyl)]
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb AveMpg
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 19.74286
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 19.74286
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 26.66364
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 19.74286
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 15.10000
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 19.74286
Is there a suggestion as to what to put in the by argument in order to achieve this?
From ?data.table in the by section it says that by accepts:
a single character string containing comma separated column names (where spaces are significant since column names may contain spaces
even at the start or end): e.g., DT[, sum(a), by="x,y,z"]
a character vector of column names: e.g., DT[, sum(a), by=c("x", "y")]
So yes, you can use the answer in #cccmir's response. You can also use c() as #akrun mentioned, but that seems slightly extraneous unless you want multiple columns.
The reason you cannot use .() syntax is that in data.table .() is an alias for list(). And according to the same help for by the list() syntax requires an expression of column names - not a character string.
Going off the examples in the by help if you wanted to use multiple variables and pass the names as characters you could do:
mtcars[,.( AveMpg = mean(mpg)), by = "cyl,am"]
mtcars[,.( AveMpg = mean(mpg)), by = c("cyl","am")]
try to use it like this
mtcars <- data.table(mtcars)
ColSelect <- 'cyl' # One Column Option
mtcars[, AveMpg := mean(mpg), by = ColSelect] # Should work

Finding duplicate columns in a data.table

I have a pretty big data.table (500 x 2000), and I need to find out if any of the columns are duplicates, i.e., have the same values for all rows. Is there a way to efficiently do this within the data.table structure?
I have tried a naive two loop approach with all(col1 == col2) for each pair of columns, but it takes too long. I have also tried converting it to a data.frame and using the above approach, and it still takes quite a long time.
My current solution is to convert the data.table to a matrix and use the apply() function as:
similarity.matrix <- apply(m, 2, function(x) colSums(x == m)))/nrow(m)
However, the approach forces the modes of all elements to be the same, and I'd rather not have that happen. What other options do I have?
Here is a sample construction for the data.table:
m = matrix(sample(1:10, size=1000000, replace=TRUE), nrow=500, ncol=2000)
DF = as.data.frame(m)
DT = as.data.table(m)
Following the suggestion of #Haboryme*, you can do this using duplicated to find any duplicated vectors. duplicated usually works rowwise, but you can transpose it with t() just for finding the duplicates.
DF <- DF[ , which( !duplicated( t( DF ) ) ) ]
With a data.table, you may need to add with = FALSE (I think this depends on the version of data.table you're using).
DT <- DT[ , which( !duplicated( t( DT ) ) ), with = FALSE ]
*#Haboryme, if you were going to turn your comment into an answer, please do and I'll remove this one.
Here's a different approach, where you hash each column first and then call duplicated.
library(digest)
dups <- duplicated(sapply(DF, digest))
DF <- DF[,which(!dups)]
Depending on your data this might be a faster way.
I am using mtcars for a reproducible result:
library(data.table)
library(digest)
# Create data
data <- as.data.table(mtcars)
data[, car.name := rownames(mtcars)]
data[, car.name.dup := car.name] # create a duplicated row
data[, car.name.not.dup := car.name] # create a second duplicated row...
data[1, car.name.not.dup := "Moon walker"] # ... but change a value so that it is no longer a duplicated column
data contains now:
> head(data)
mpg cyl disp hp drat wt qsec vs am gear carb car.name car.name.dup car.name.not.dup
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Mazda RX4 Moon walker
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag Mazda RX4 Wag Mazda RX4 Wag
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710 Datsun 710 Datsun 710
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive Hornet 4 Drive Hornet 4 Drive
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout Hornet Sportabout Hornet Sportabout
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Valiant Valiant Valiant
Now find the duplicated colums:
# create a vector with the checksum for each column (and keep the column names as row names)
col.checksums <- sapply(data, function(x) digest(x, "md5"), USE.NAMES = T)
# make a data table with one row per column name and hash value
dup.cols <- data.table(col.name = names(col.checksums), hash.value = col.checksums)
# self join using the hash values and filter out all column name pairs that were joined to themselves
dup.cols[dup.cols,, on = "hash.value"][col.name != i.col.name,]
Results in:
col.name hash.value i.col.name
1: car.name.dup 58fed3da6bbae3976b5a0fd97840591d car.name
2: car.name 58fed3da6bbae3976b5a0fd97840591d car.name.dup
Note: The result still contains both directions (col1 == col2 and col2 == col1) and should be deduplicated ;-)

Error when trying to programmatically create columns in a data.table

I am getting this error when I try to create a new column in a data.table programmatically:
dt[, (new_x) := get(x)]
# Error in get(x) : invalid first argument
Where x is a variable that holds the name of the column that I am using in the assignment, which also happens to be named "x" in this case. In other words, x <- "x", and "x" %in% names(dt) is TRUE. This error only seems to occur when the variable name is the same as the column name.
A reproducible example:
library(data.table)
# Our data.table
dt <- as.data.table(mtcars)
dt
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
# 3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# 4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
# ...
# My new column name
new_col <- "new_column"
# Will make my new column be the sum of two columns
mpg <- "mpg"
cyl <- "cyl"
# I thought this would work:
dt[, (new_col) := get(mpg) + get(cyl)]
# Error in get(mpg) : invalid first argument
# If the variable names are not the same as the string it contains, it works
mpg_col <- "mpg"
cyl_col <- "cyl"
dt[, (new_col) := get(mpg_col) + get(cyl_col)]
Now, in my script, I have a helper function that takes in two column names, x and y, as arguments to calculate a new column with name new_col.
calculate_new_column <- function(dt, x, y, new_col) {
dt[, (new_col) := some calculation with x and y ]
}
Is there a way to make my function safe to this kind of corner case, where x = 'x' or y = 'y'? I guess I could give unique names to the arguments of the function (e.g. .x. and .y.), but would prefer a better solution.
EDIT
Following my reproducible example, it seems this works:
dt[, (new_col) := get(eval(mpg)) + get(eval(cyl))]
But I am wary of using eval and am not sure if this follows best practices. Would this be the way to go?

Resources