Avoid overwriting existing columns in data.table functions - r

I'm writing a function that takes a data.table as an argument. The column names of data.table are partially specified as arguments, but not all columns names are specified and all original columns need to be maintained. Inside the function, some columns need to be added to the data.table. Even if the data.table is copied inside the function, I want to add these columns in a way that is guaranteed not to overwrite existing columns. What's the best way to ensure I'm not overwriting columns given that column names are not known?
Here's one approach:
#x is a data.table and knownvar is a column name of that data.table
f <- function(x,knownvar){
x <- copy(x)
tempcol <- "z"
while(tempcol %in% names(x))
tempcol <- paste0("i.",tempcol)
tempcol2 <- "q"
while(tempcol2 %in% names(x))
tempcol2 <- paste0("i.",tempcol2)
x[, (tempcol):=3]
eval(parse(text=paste0("x[,(tempcol2):=",tempcol,"+4]")))
x
}
Note that even though I'm copying x here, I still need this process to be memory efficient. Is there an easier way of doing this? Possibly without using eval(parse(text=?
Obviously I could just create a local variable (e.g. a vector) in the function environment (rather than adding it explicitly as column of the data.table), but this wouldn't work if I then need to sort/join the data.table. Plus I may want to explicitly return a data.table that contains both the original variables and the new column.

Here is one way to write the function using set and non-standard evaluation with substitute() + eval().
Note 1: if new columns are created based on the column names in newcols (instead of the column name in knownvar), the character names in newcols are converted to symbols with as.name() (or equivalently as.symbol()).
Note 2: new columns in newvals can only be added in a sensible order, i.e. if column q requires column z, column z should be added before column q.
library(data.table)
f <- function(x, knownvar) {
## remove if x should be modified in-place
x <- copy(x)
## new column names
newcols <- setdiff(make.unique(c(names(x), c("z", "q"))), names(x))
## new column values based on knownvar or new column names
zcol <- as.name(newcols[1])
newvals <- list(substitute(3 * knownvar), substitute(zcol + 4))
for(i in seq_along(newvals)) {
set(x, j = newcols[i], value = eval(newvals[[i]], envir = x))
}
return(x)
}
## example data
x <- as.data.table(mtcars)
x[, c("q", "q.1") := .(mpg, 2 * mpg)]
head(f(x, mpg))
#> mpg cyl disp hp drat wt qsec vs am gear carb q q.1 z q.2
#> 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 21.0 42.0 63.0 67.0
#> 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 21.0 42.0 63.0 67.0
#> 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 22.8 45.6 68.4 72.4
#> 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.4 42.8 64.2 68.2
#> 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 18.7 37.4 56.1 60.1
#> 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 18.1 36.2 54.3 58.3

Related

Using glue-like constructs on RHS in R/Tidyeval

I've spent hours trying to make glue on the RHS of a formula work and out of clues. Here is a simple reprex.
meta <- function(x, var, suffix){
x<- x %>% mutate("{{var}}_{suffix}":= 5)
x<- x %>% mutate("{{var}}_{suffix}_new":= {{var}} - "{{var}}_{suffix}")
}
x<- meta(mtcars, mpg, suf)
#Should be equivalent to
x<- mtcars %>% mutate(mpg_suf:= 5)
x<- x%>% mutate(mpg_suf_new:= mpg - mpg_suf)
#N: Tried https://stackoverflow.com/questions/70427403/how-to-correctly-glue-together-prefix-suffix-in-a-function-call-rhs but none of the methods in it worked, unfortunately
Meta function gives me "Error in local_error_context(dots = dots, .index = i, mask = mask) :
promise already under evaluation: recursive default argument reference or earlier problems? "
Went over all hits for the searchwords for it on SO but nothing worked at the moment.
Would really appreciate any insights. Thank you!
Here is a working version:
meta <- function(x, var, suffix){
new_name <- rlang::englue("{{ var }}_{{ suffix }}")
x %>%
mutate("{new_name}" := 5) %>%
mutate("{new_name}_new" := {{ var }} - .data[[new_name]])
}
names(meta(mtcars, mpg, suf))
#> [1] "mpg" "cyl" "disp" "hp"
#> [5] "drat" "wt" "qsec" "vs"
#> [9] "am" "gear" "carb" "mpg_suf"
#> [13] "mpg_suf_new"
To understand what is going on:
Learn about the difference between "{{ var }}" and "{var}" in tidyeval glue strings: https://rlang.r-lib.org/reference/glue-operators.html
Learn about englue() to create glue strings outside of the LHS of :=: https://rlang.r-lib.org/reference/englue.html. This part is not necessary but I thought it was nicer to create and reuse a variable.
Tricky part, you create a new column with a constructed name and then want to use the new column that this name refers to. You'll have to subset it with .data, see: https://rlang.r-lib.org/reference/dot-data.html
See also the general topic: https://rlang.r-lib.org/reference/topic-data-mask-programming.html
I think it's best if we define the pieces we need first, then we can use them as needed on the LHS or the RHS of the calculation. I will add that it doesn't make much sense to me to pass the suffix argument as a bare name. I think it would be a clearer choice to make it string only.
library(dplyr)
meta <- function(x, var, suffix) {
var <- rlang::as_name(enquo(var))
suffix <- rlang::as_name(enquo(suffix)) # Remove this to make "suffix" string only.
new_var <- glue::glue("{var}_{suffix}")
x %>%
mutate("{new_var}" := 5,
"{new_var}_new" := !!sym(var) - !!sym(new_var))
}
mtcars %>%
head() %>%
meta(mpg, suf)
mpg cyl disp hp drat wt qsec vs am gear carb mpg_suf mpg_suf_new
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 5 16.0
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 5 16.0
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 5 17.8
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 5 16.4
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 5 13.7
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 5 13.1

R single element subsetting (aka `[[`) using a matrix

Assuming, I'm understanding the documentation of [[ correctly, a matrix can be used to subset a data.frame:
A third form of indexing is via a numeric matrix with the one column for each dimension: each row of the index matrix then selects a single element of the array, and the result is a vector. Negative indices are not allowed in the index matrix. NA and zero values are allowed: rows of an index matrix containing a zero are ignored, whereas rows containing an NA produce an NA in the result.
While this works for [, I'm struggling to understand how to do this with [[.
mtcars[1:6, 1:6]
#> mpg cyl disp hp drat wt
#> Mazda RX4 21.0 6 160 110 3.90 2.620
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875
#> Datsun 710 22.8 4 108 93 3.85 2.320
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440
#> Valiant 18.1 6 225 105 2.76 3.460
(ind <- matrix(1:6, ncol = 2))
#> [,1] [,2]
#> [1,] 1 4
#> [2,] 2 5
#> [3,] 3 6
mtcars[ind]
#> [1] 110.00 3.90 2.32
mtcars[[ind]]
#> Error in as.matrix(x)[[i]]: attempt to select more than one element in vectorIndex
Is this a bug? Or am I misinterpreting the documentation?
Here is the source of [[.data.frame (v3.6.1)
function (x, ..., exact = TRUE)
{
na <- nargs() - !missing(exact)
if (!all(names(sys.call()) %in% c("", "exact")))
warning("named arguments other than 'exact' are discouraged")
if (na < 3L)
(function(x, i, exact) if (is.matrix(i))
as.matrix(x)[[i]]
else .subset2(x, i, exact = exact))(x, ..., exact = exact)
else {
col <- .subset2(x, ..2, exact = exact)
i <- if (is.character(..1))
pmatch(..1, row.names(x), duplicates.ok = TRUE)
else ..1
col[[i, exact = exact]]
}
}
The doc page (?Extract) you reference says that arrays can be indexed by matrices. Implicitly, I take that to mean non-arrays cannot be indexed by matrices. Data frames are not arrays, so they cannot be indexed by matrices. (Matrices are arrays, of course.)
I do think you're misinterpreting the documentation. You're looking at a documentation page that jointly documents [, [[, and $, together. In the argument description, it says
When indexing arrays by [ a single argument i can be a matrix with as many columns as there are dimensions of x...
The section you quote at the top of your question comes later on, under the heading Matrices and Arrays, which I take to be a section about subsetting matrices and arrays, not about using matrices as indices. (Look at the rest of the section, and the sections before and after, and I think you'll agree with me.)
Nowhere on that documentation page does it talk about using matrices as indices for [[.
I'm surprised it's handled specially in the [[ code you show - but near as I can tell, a matrix given to [[.data.frame will error out unless it's a 1x1 matrix, in which case the data frame is treated as a matrix and the single element is returned, for some arcane reason (probably "compatability with S", though I've no good guess as to why S would allow it).

Apply variable function to columns in data.table

I'm wondering if there's a way to apply a function in a string variable to .SD cols in a data.table.
I can generalize all other parts of function calls using a data.table, including input and output columns, which I'm very happy about. But the final piece seems to be applying a variable function to a data.table, which is something I believe I've done before with dplyr and do.call.
mtcars <- as.data.table(mtcars)
returnNames <- "calculatedColumn"
SDnames <- c("mpg","hp")
myfunc <- function(data) {
print(data)
return(data[,1]*data[,2])
}
This obviously works:
mtcars[,eval(returnNames) := myfunc(.SD),.SDcols = SDnames,by = cyl]
But if I want to apply a dynamic function, something like this does not work:
functionCall <- "myfunc"
mtcars[,eval(returnNames) := lapply(.SD,eval(functionCall)),.SDcols = SDnames,by = cyl]
I get this error:
Error in `[.data.table`(mtcars, , `:=`(eval(returnNames), lapply(.SD, : attempt to apply non-function
Is using "apply" with "eval" the right idea, or am I on the wrong track entirely?
You don't want lapply. Since myfunc takes a data.table with multiple columns, you just want to feed such a data table into the function as one object.
To get the function you need get instead of eval
On the left-hand-side of :=, you can just put the character vector in parentheses, eval isn't needed
-
mtcars[, (returnNames) := get(functionCall)(.SD)
, .SDcols = SDnames
, by = cyl]
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb calculatedColumn
# 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2310.0
# 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2310.0
# 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 2120.4
# 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2354.0
# 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3272.5
# 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1900.5
The code above was run after the following code
mtcars <- as.data.table(mtcars)
returnNames <- "calculatedColumn"
SDnames <- c("mpg","hp")
myfunc <- function(data) {
print(data)
return(data[,1]*data[,2])
}
functionCall <- "myfunc"

Using 'mutate_' to sum a bunch of columns row-wise

In this blog post, Paul Hiemstra shows how to sum up two columns using dplyr::mutate_. Copy/paste-ing relevant parts:
library(lazyeval)
f = function(col1, col2, new_col_name) {
mutate_call = lazyeval::interp(~ a + b, a = as.name(col1), b = as.name(col2))
mtcars %>% mutate_(.dots = setNames(list(mutate_call), new_col_name))
}
allows one to then do:
head(f('wt', 'mpg', 'hahaaa'))
Great!
I followed up with a question (see comments) as to how one could extend this to a 100 columns, since it wasn't quite clear (to me) how one could do it without having to type all the names using the above method. Paul was kind enough to indulge me and provided this answer (thanks!):
# data
df = data.frame(matrix(1:100, 10, 10))
names(df) = LETTERS[1:10]
# answer
sum_all_rows = function(list_of_cols) {
summarise_calls = sapply(list_of_cols, function(col) {
lazyeval::interp(~col_name, col_name = as.name(col))
})
df %>% select_(.dots = summarise_calls) %>% mutate(ans1 = rowSums(.))
}
sum_all_rows(LETTERS[sample(1:10, 5)])
I'd like to improve this answer on these points:
The other columns are gone. I'd like to keep them.
It uses rowSums() which has to coerce the data.frame to a matrix which I'd like to avoid.
Also I'm not sure if the use of . within non-do() verbs is encouraged? Because . within mutate() doesn't seem to adapt to just those rows when used with group_by().
And most importantly, how can I do the same using mutate_() instead of mutate()?
I found this answer, which addresses point 1, but unfortunately, both dplyr answers use rowSums() along with mutate().
PS: I just read Hadley's comment under that answer. IIUC, 'reshape to long form + group by + sum + reshape to wide form' is the recommend dplyr way for these type of operations?
Here's a different approach:
library(dplyr); library(lazyeval)
f <- function(df, list_of_cols, new_col) {
df %>%
mutate_(.dots = ~Reduce(`+`, .[list_of_cols])) %>%
setNames(c(names(df), new_col))
}
head(f(mtcars, c("mpg", "cyl"), "x"))
# mpg cyl disp hp drat wt qsec vs am gear carb x
#1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 27.0
#2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 27.0
#3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 26.8
#4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 27.4
#5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 26.7
#6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 24.1
Regarding your points:
Other columns are kept
It doesn't use rowSums
You are specifically asking for a row-wise operation here so I'm not sure (yet) how a group_by could do any harm when using . inside mutate/mutate_
It makes use of mutate_

How to get width of dataframe being printed in terminal in R

Im looking for a way to put a bottom line under a dataframe that I print out in R
The Output looks like
I want to print out the bottom line as long as the dataframe output is. But the width is varying.
Any Idea?
EDIT
I'm trying to get rid of
cat("---------------------------------------\n")
and want to make that dynamic to the output size of a given dataframe. The "-----" line should be not longer or shorter then the dataframe.
Use getOption("width"):
> getOption("width")
[1] 80
You can see a description of this option via ?options which states
‘width’: controls the maximum number of columns on a line used in
printing vectors, matrices and arrays, and when filling by
‘cat’.
That doesn't mean that the entire 80 (in my case) characters are used, but R's printing shouldn't extend beyond that so it should be an upper limit.
You should probably also check this in IDE or other front-ends to R. For example, RStudio might do something different depending on the width of the console widget in their app.
To actually format exactly the correct width for the data frame, you'll need to process the data frame into character strings for each line (much as print.data.frame does via its format method. Something like:
df <- data.frame(Price = round(runif(10), 2),
Date = Sys.Date() + 0:9,
Subject = rep(c("Foo", "Bar", "DJGHSJIBIBFUIBSFIUBFUIS"),
length.out = 10),
Category = rep("Media", 10))
class(df) <- c("MyDF", "data.frame")
print.MyDF <- function(x, ...) {
fdf <- format(x)
strings <- apply(x, 2, function(x) unlist(format(x)))[1, ]
rowname <- format(rownames(fdf))[[1]]
strings <- c(rowname, strings)
widths <- nchar(strings)
names <- c("", colnames(x))
widths <- pmax(nchar(strings), nchar(names))
csum <- sum(widths + 1) - 1
print.data.frame(df)
writeLines(paste(rep("-", csum), collapse = ""))
writeLines("Balance: 48") ## FIXME !!
invisible(x)
}
which gives:
> df
Price Date Subject Category
1 0.73 2015-06-29 Foo Media
2 0.11 2015-06-30 Bar Media
3 0.19 2015-07-01 DJGHSJIBIBFUIBSFIUBFUIS Media
4 0.54 2015-07-02 Foo Media
5 0.04 2015-07-03 Bar Media
6 0.37 2015-07-04 DJGHSJIBIBFUIBSFIUBFUIS Media
7 0.59 2015-07-05 Foo Media
8 0.85 2015-07-06 Bar Media
9 0.15 2015-07-07 DJGHSJIBIBFUIBSFIUBFUIS Media
10 0.05 2015-07-08 Foo Media
----------------------------------------------------
Balance: 48
See how this works. Very simple counting of characters, no bells and whistles, but should do the expected job:
EDIT: Printing of the data.frame and the line done in the function.
# create a function that prints the data.frame with the line we want
lineLength <- function( testDF )
{
# start with the characters in the row names,
# plus empty space between columns
dashes <- max( nchar( rownames( testDF ) ) ) + length ( testDF )
# loop finding the longest string in each column, including header
for( i in 1 : length ( testDF ) )
{
x <- nchar( colnames( testDF ) )[ i ]
y <- max( nchar( testDF[ , i ] ) )
if( x > y ) dashes <- dashes + x else dashes <- dashes + y
}
myLine <- paste( rep( "-", dashes ), collapse = "" )
print( testDF )
cat( myLine, "\n" )
}
# sample data
data( mtcars )
# see how it works
lineLength( head( mtcars ) )
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
--------------------------------------------------------------------

Resources