Why is data.table keeping changes to colnames done within a function?

Why is data.table keeping changes to colnames done within a function? - r

I have noticed this interesting behaviour of data.tables:
I create a new data.table and use a function to do something to it and change the colnames with setnames. In this minimal example only setnames is used to change 'B' to 'C':
dt1 <- data.table(
A=c(1:5),
B=c(6:10))
dt1
> dt1
> A B
> 1: 1 6
> 2: 2 7
> 3: 3 8
> 4: 4 9
> 5: 5 10
doSomething <- function(dt){
setnames(dt, "B", "C")
}
dt2 <- doSomething(dt1)
dt2
> dt2
> A C
> 1: 1 6
> 2: 2 7
> 3: 3 8
> 4: 4 9
> 5: 5 10
All appears to have worked without an itch. However, looking at dt1:
dt1
> dt1
> A C
> 1: 1 6
> 2: 2 7
> 3: 3 8
> 4: 4 9
> 5: 5 10
After the function dt1 also has a changed colname 'C'. I know that data.tables do not worked in exactly in the same manner as data.frames, in that after certain operations they are not assigned to new objects creating "duplicates". However, in this event a new object get assigned, and still the old object changes after the operation. It somehow reminds of python.
Is this working as intended or should I report it has a bug? Also, is there a way to change this behaviour? I would like to keep dt1 intact after applying a function, with setnames, to it.
Cheers

You can make a copy of the original dataset ('dt1') and then try doSomething(dt1) which will change only 'dt1'
dt2 <- copy(dt1)
doSomething(dt1)
colnames(dt1)
#[1] "A" "C"
colnames(dt2)
#[1] "A" "B"

Related

Adding a column by reference to every data.table in a list does not "stick"

I have a list of many metering data.tables each with a UTC timestamp column (all the other columns are unit meterings). I want to add a CET column by reference to each of the data.tables. Weirdly this does not seem to stick.
My code:
lapply(list_dtmetering, function(x) x[, cet:=utc + hours(1)])
When I run this, I get each element printed out to the console, showing the added column. Then when I do ...
names(list_dtmetering[[1]])
...I only get the original column names. The "cet" column is gone!
I tried a simpler example to check my syntax but this show the expected behaviour:
> c1 = 1:5
> DT = data.table(c1)
> L = list(DT)
> L
[[1]]
c1
1: 1
2: 2
3: 3
4: 4
5: 5
> lapply(L, function(x) x[, C2:=c1+10])
[[1]]
> L
[[1]]
c1 C2
1: 1 11
2: 2 12
3: 3 13
4: 4 14
5: 5 15
Does anybody know why this does not work in my first example?
list_dtmetering is only 900Mb.

don't auto-return by-columns with data.table

Sample data:
dt = data.table(a=c(10,20,30,40),b=c(40,30,20,10),c=c(0,0,1,1))
rank_a = dt[,rank(a)]
rank_a
[1] 1 2 3 4
This returns what I want. However, if I add a by statement,
rank_a = dt[,rank(a),by=c]
...then it returns a whole data frame including the by column "c", not just the answers I want.
How to fix this behavior?

A generic solution to remove one or multiple "by" columns could be to use mget + chaining:
dt = data.table(a=c(10,20,30,40),b=c(40,30,20,10),c=c(0,0,1,1),d=c(0,0,0,1))
dt
# a b c d
# 1: 10 40 0 0
# 2: 20 30 0 0
# 3: 30 20 1 0
# 4: 40 10 1 1
by.cols = c("c", "d")
# group by one or multiple columns without returning the "group by" columns
# (includes code of #Frank now, see comment)
dt[, .(rank=rank(a)),by = by.cols] [, -(1:length(by.cols))]
# rank
# 1: 1
# 2: 2
# 3: 1
# 4: 1
#
# OLD code (before including the code snippets from #Frank)
# dt[, .(rank = rank(a)),by=mget(by.cols)] [, -(1:length(by.cols))]
Not pretty, but working.
PS: Is there a better way to exclude columns from a data.table whose names are stored in a character vector?

It returns the other columns because they are necessary to understand the output. If, for instance you have a dataset:
a b
1 1
1 2
2 3
2 4
.. and you run:
dt[, sum(b), a]
The output will look like:
V1 a
3 1
7 2
This makes sense because without the "a" column, the output wouldn't make any sense, as you wouldn't know which V1 values correspond to which "a" groupings. If you really want to get rid of the columns after an operation like this, you can just take the result with something like
dt[, sum(b), a][, V1]
or as a data.table with
dt[, sum(b), a][, .(V1)]

Filtering columns in data table by vector of names

I am learning data.table and trying to filter certain columns by using a vector containing a set of column names.
> dt <- data.table(A=1:5, B=2:6, C=3:7)
> dt
A B C
1: 1 2 3
2: 2 3 4
3: 3 4 5
4: 4 5 6
5: 5 6 7
>
> list <- c("A", "B")
> dt[ ,list, with=FALSE]
A B
1: 1 2
2: 2 3
3: 3 4
4: 4 5
5: 5 6
>
This works fine and filter columns.
However, the "missing" item in the list will return an error:
> list <- c("A", "B", "D")
> dt[ ,list, with=FALSE]
Error in `[.data.table`(dt, , list, with = FALSE) :
column(s) not found: D
How can I ignore the missing column name from the list and return just existing columns from the dt data.table?

dt[ ,colnames(dt) %in% list, with=FALSE]

Use of "list" in data.table's j argument

I am learning data.table properties from a blog post. I am trying to understand the part under "summary table (short and narrow)", starting by coercing data.frame(mtcars) to data.table:
> data <- as.data.table(mtcars)
> data <- data[,.(gear,cyl)]
> head(data)
gear cyl
1: 4 6
2: 4 6
3: 4 4
4: 3 6
5: 3 8
6: 3 6
Up to this point everything is fine.
Now I have tried this data[, gearsL := list(list(unique(gear))), by=cyl]
> head(data)
gear cyl gearsL
1: 4 6 4,3,5
2: 4 6 4,3,5
3: 4 4 4,3,5
4: 3 6 4,3,5
5: 3 8 3,5
6: 3 6 4,3,5
I am able to understand unique(gear) but unable to understand what list(list(unique(gear)) is doing.

A data.table -- like any data.frame -- is a list of pointers to column vectors.
When creating new columns, we write j of DT[i,j,by] so that it evaluates to a list of columns:
DT[, (newcol_names) := list(newcol_A, newcol_B)]
That's what the outermost list() in the OP's example does, for a single list column.
data[,gearsL := list(list(unique(gear))), by=cyl]
This can and should be written using the alias .(), for clarity:
data[, gearsL := .(list(unique(gear))), by=cyl]
That's all you need to know, but I've put some elaboration below.
Details. When creating a new column, we can often skip list()/.():
DT = data.table(id=1:3)
DT[, E := c(4,5,6)]
DT[, R := 3]
# this works as if we had typed
# R := c(3,3,3)
Note that E enumerates each value, while R recycles a single value over all rows. Next example:
DT[, Elist := list(hist(rpois(1,1)), hist(rpois(2,2)), hist(rpois(3,3)))]
As we did for E, we're enumerating the values of Elist here. This still uses the shortcut; list() is here only because the column is itself a list, as confirmed by
sapply(DT, class)
# id E R Elist
# "integer" "numeric" "numeric" "list"
The convenient shortcut of skipping list()/.() fails in one special case: when we are creating a list column that that recycles its value:
DT[, Rlist := list(c("a","b"))]
# based on the pattern for column R, this should work as if we typed
# Rlist := list(c("a","b"), c("a","b"), c("a","b"))
It doesn't work because the parser sees this as C2 := .( c("a", "b") ) and thinks we simply neglected to make a full enumeration with one value for each row, like Elist does. To get the desired result, skip the shortcut and wrap the vector in list()/.():
DT[, Rlist := .(list(c("a","b")))]
# id E R Elist Rlist
# 1: 1 4 3 <histogram> a,b
# 2: 2 5 3 <histogram> a,b
# 3: 3 6 3 <histogram> a,b
This is the case in the OP's example, where the outer list()/.() is necessary.

How do I reorder data.table columns? [duplicate]

This question already has answers here:
How to reorder data.table columns (without copying)
(2 answers)
Closed 8 years ago.
How do I permute columns in a data.table?
I can do that for a data.frame, but data.table overrides the method:
> df <- data.frame(a=1:3,b=4:6)
> df
a b
1 1 4
2 2 5
3 3 6
> df[c("b","a")]
b a
1 4 1
2 5 2
3 6 3
> dt <- as.data.table(df)
> dt
a b
1: 1 4
2: 2 5
3: 3 6
> dt[c("b","a")]
Error in `[.data.table`(dt, c("b", "a")) :
When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey.
Calls: [ -> [.data.table
Note that this is not a dupe for How does one reorder columns in R?.

Use setcolorder:
> library(data.table)
> dt <- data.table(a=1:3,b=4:6)
> setcolorder(dt, c("b", "a"))
> dt
b a
1: 4 1
2: 5 2
3: 6 3

This is how you do it in data.table (without modifying original table):
dt[, list(b, a)]
or
dt[, c("b", "a")]
or
dt[, c(2, 1)]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Why is data.table keeping changes to colnames done within a function? - r

You can make a copy of the original dataset ('dt1') and then try doSomething(dt1) which will change only 'dt1' dt2 <- copy(dt1) doSomething(dt1) colnames(dt1) #[1] "A" "C" colnames(dt2) #[1] "A" "B"

Related

Adding a column by reference to every data.table in a list does not "stick"

don't auto-return by-columns with data.table

Filtering columns in data table by vector of names

Use of "list" in data.table's j argument

How do I reorder data.table columns? [duplicate]

Categories

Resources