This question already has answers here:
Why can't R's ifelse statements return vectors?
(9 answers)
Closed 7 years ago.
In my data.table, I wanted to numerate entries if there are more than one in each by group:
dt1 <- data.table(col1=1:4, col2 = c('A', 'B', 'B', 'C'))
# col1 col2
# 1: 1 A
# 2: 2 B
# 3: 3 B
# 4: 4 C
dt1[, col3:={
if (.N>1) {paste0((1:.N), "_", col2)} else {col2};
}, by=col2]
# col1 col2 col3
# 1: 1 A A
# 2: 2 B 1_B
# 3: 3 B 2_B
# 4: 4 C C
This works fine, but didn't work when I tried to use ifelse() instead:
dt1[, col4:=ifelse (.N>1, paste0((1:.N), "_", col2), col2), by=col2]
# col1 col2 col3 col4
# 1: 1 A A A
# 2: 2 B 1_B 1_B
# 3: 3 B 2_B 1_B
# 4: 4 C C C
can anyone explain why?
This is only by proxy related to data.table; at core is that ifelse is designed for use like:
ifelse(test, yes, no)
where test, yes, and no all have the same length -- the output will be the same length as test, and all the elements corresponding to where test is TRUE will be the corresponding element from yes, and similarly for where test is FALSE.
When test is a scalar and yes or no are vectors, as in your case, you have to look at what ifelse is doing to understand what's going on:
Relevant source:
if (any(test[ok])) #is any element of `test` `TRUE`?
ans[test & ok] <- rep(yes, length.out = length(ans))[test &
ok]
What is rep(c(1, 2), length.out = 1)? It's just 1 -- the second element is truncated.
That's what's happened here -- the value of ifelse is only the first element of paste0(1:.N, "_", col2). When passed to `:=`, this single element is recycled.
When your logical condition is a scalar, you should use if, not ifelse. I'll also add that I do my damndest to avoid using ifelse in general because it's slow.
Related
This question already has answers here:
Mode in R by groups
(5 answers)
Closed 3 years ago.
I got a data.table DT with millions of rows and quite a few columns.
I'd like to aggregate the data.table on various columns at the same time. One column 'Var' is a categorical variable and I want to aggregate it in a way that the entry with the most occurrence is chosen.
> require(data.table)
> DT <- data.table(ID = c(1,1,1,1,2,2,2,3,3), Var = c('A', 'B', 'B', 'B', 'C', 'C', 'A', 'A', 'A'))
> DT
ID Var
1: 1 A
2: 1 B
3: 1 B
4: 1 B
5: 2 C
6: 2 C
7: 2 A
8: 3 A
9: 3 A
My desired output is:
> desired_output
ID agg_Var
1: 1 B # B occurred the most for ID = 1
2: 2 C # C occurred the most for ID = 2
3: 3 A # A occurred the most for ID = 3
I know i can do this in two steps. First by aggregating the numbers of occurrence for each ID and Var, then choosing the row with maximum frequency:
> ## I know this works but it involves more than one step:
> step1 <- DT[,.( freq = .N), by=.(ID, Var)]
> step1
ID Var freq
1: 1 A 1
2: 1 B 3
3: 2 C 2
4: 2 A 1
5: 3 A 2
> step2 <- step1[, .(Var_agg = Var[which.max(freq)]), by = .(ID)]
> step2
ID Var_agg
1: 1 B
2: 2 C
3: 3 A
I'm looking for a way to do this in one step if possible?
The reason is that I have quite a few other aggregations i need to do for this table but the other aggregations all involve one step and it would be great if I didn't have to do a separate aggregation for this column, so that I could just include it with the aggregation of other columns. This problem is a code optimisation issue. I'm only interested in data.table operations, not additional packages.
Create a function for calculation of Mode and do a group by Mode
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
DT[, .(agg_Var = Mode(Var)), ID]
I think I am on the right direction with this code, but I am not quite there yet.
I tried finding something useful on Google and SE, but I did not seem to be able to formulate the question in a way that gets me the answer I am looking for.
I could write a for-loop for this, comparing for each id and for each unique value of a per row, but I strive to achieve a higher level of R-understanding and thus want to avoid loops.
id <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
a <- c(1,1,1,2,2,2,3,3,4,4,4,5,5,5,6)
b <- c(1,2,3,3,3,4,3,4,5,4,4,5,6,7,8)
require(data.table)
dt <- data.table(id, a, b)
dt
dt[,unique(a) %in% b, by=id]
tmp <- dt[,unique(a) %in% b, by=id]
tmp$id[tmp$V1 == FALSE]
In my example, IDs 2, 3 and 5 should be the result, the decision rule being: "By id, check if for each unique value of a if there is at least one observation where the value of b equals value of a."
However, my code only outputs IDs 2 and 5, but not 3. This is because for ID 3, the 4 is matched with the 4 of the previous observation.
The result should either output the IDs for which the condition is not met, or add a dummy variable to the original table that indicated whether the condition is met for the ID.
How about
dt[, all(sapply(unique(a), function(i) any(a == i & b == i))), by = id]
# id V1
#1: 1 TRUE
#2: 2 FALSE
#3: 3 FALSE
#4: 4 TRUE
#5: 5 FALSE
If you want to add a dummy variable to the original table, you can modify it like
dt[, check:=all(sapply(unique(a), function(i) any(a == i & b == i))), by = id]
I wondered if I can find are more data.table-esk solution for this old question using the enhanced join capabilities which were introduced to data.table in version 1.9.6 (on CRAN 19 Sep 2015). With that version, data.table has gained the ability to join without having to set keys by using the on argument.
Variant 1
dt[a == b][dt[, unique(a), by = id], on = .(id, a == V1)][is.na(b), unique(id)]
[1] 2 3 5
First, the rows of dt where a and b are equal are selected. Only these rows are right joined with the unique values of a for each id. The result of the join is
dt[a == b][dt[, unique(a), by = id], on = .(id, a == V1)]
id a b
1: 1 1 1
2: 2 2 NA
3: 3 3 3
4: 3 4 NA
5: 4 4 4
6: 4 4 4
7: 4 5 5
8: 5 5 NA
9: 5 6 NA
The NA values in column b indicate that no match is found. Any id which has an NA value indicates that OP's condition is not met.
Variant 2
dt[dt[, unique(a), by = id], on = .(id, a == V1, b == V1), unique(id[is.na(x.a)])]
[1] 2 3 5
This variant right joins dt (unfiltered!) with the unique values of a for each id but the join conditions require matches in id as well as matches in a and b. (This resembles the a == i & b == i expression in konvas' accepted answer. Finally, those ids are returned which have at least one NA value in the join result indicating a missing match.
Pass character vectors and column names to data.table as a list of columns?
I want to be able to produce a subset of columns in R using data.table in a way that I can determine some of them earlier on and pass the predetermined list on as a character vector, then combine with a static list of columns.
That is, given this:
a <- 1:4
b <- 5:8
c <- c('aa','bb','cc','dd')
e <- 1:4
z <- data.table(a,b,c,e)
I want to do this:
z[, list(a,b)]
Which produces this output:
a b
1: 1 5
2: 2 6
3: 3 7
4: 4 8
But I want to do it in some way similar to this (which works, almost):
cols <- "b"
z[, list(get(cols), a)]
Results:
Note that it doesn't return the name of the column stored in cols
V1 a
1: 5 1
2: 6 2
3: 7 3
4: 8 4
but I need to do it with more than one element of cols (which does not work):
cols <- c('a', 'b')
z[, list(mget(cols), c)]
The above produces the following error:
Error: value for ‘a’ not found
I think my problem lies with scoping and which environments mget is looking in, but I can't figure out what exactly I am doing wrong. Also, how do I preserve the column titles?
Here are two (pretty much equivalent) options. One using lapply:
z[, c(lapply(cols, get), list(c))]
# V1 V2 V3
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
And one using mget:
z[, c(mget(cols, inherits = TRUE), c = list(c))]
# a b c
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
Note that get returns a vector which loses the information about column name (and there isn't much you can do about it besides manually adding it back in), while mget returns a named list.
Attempting to mix standard and non-standard evaluation within a single call will probably end in tears / frustration / obfusticated code.
There are a number of options in data.table
Use .. notation to "look up one level" to find the vector of column names
cols <- c('a','b')
z[, ..cols]
Use .SDcols
z[, .SD, .SDcols = cols]
But if you really want to combine the two ways of referencing, then you can use something like (introducing another option, with=FALSE, which allows more general expressions for column names than a simple vector)
ll <- function(char=NULL,uneval=NULL){
Call <- match.call()
cols <- lapply(Call$uneval,as.character)
unlist(c(char,cols))}
z[, ll(cols,c), with=FALSE]
# a b c
# 1: 1 5 aa
# 2: 2 6 bb
# 3: 3 7 cc
# 4: 4 8 dd
z[, ll(char=cols), with=FALSE]
# a b
# 1: 1 5
# 2: 2 6
# 3: 3 7
# 4: 4 8
z[, ll(uneval=c), with=FALSE]
# c
# 1: aa
# 2: bb
# 3: cc
# 4: dd
Combining a variable with column names with hard-coded column names in data.table
Given z and cols from the example above:
To combine a list of column names in a variable col with other hard coded column name c, we combine them in a new character vector c(col, 'c') in the call to data.table. We can refer to cols from within j (the second argument within []) by using the "up-one-level" notation ..:
z[, c(..cols, 'c')]
Thank you to #thelatemail for providing the base to the solution above.
Pass character vectors and column names to data.table as a list of columns?
I want to be able to produce a subset of columns in R using data.table in a way that I can determine some of them earlier on and pass the predetermined list on as a character vector, then combine with a static list of columns.
That is, given this:
a <- 1:4
b <- 5:8
c <- c('aa','bb','cc','dd')
e <- 1:4
z <- data.table(a,b,c,e)
I want to do this:
z[, list(a,b)]
Which produces this output:
a b
1: 1 5
2: 2 6
3: 3 7
4: 4 8
But I want to do it in some way similar to this (which works, almost):
cols <- "b"
z[, list(get(cols), a)]
Results:
Note that it doesn't return the name of the column stored in cols
V1 a
1: 5 1
2: 6 2
3: 7 3
4: 8 4
but I need to do it with more than one element of cols (which does not work):
cols <- c('a', 'b')
z[, list(mget(cols), c)]
The above produces the following error:
Error: value for ‘a’ not found
I think my problem lies with scoping and which environments mget is looking in, but I can't figure out what exactly I am doing wrong. Also, how do I preserve the column titles?
Here are two (pretty much equivalent) options. One using lapply:
z[, c(lapply(cols, get), list(c))]
# V1 V2 V3
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
And one using mget:
z[, c(mget(cols, inherits = TRUE), c = list(c))]
# a b c
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
Note that get returns a vector which loses the information about column name (and there isn't much you can do about it besides manually adding it back in), while mget returns a named list.
Attempting to mix standard and non-standard evaluation within a single call will probably end in tears / frustration / obfusticated code.
There are a number of options in data.table
Use .. notation to "look up one level" to find the vector of column names
cols <- c('a','b')
z[, ..cols]
Use .SDcols
z[, .SD, .SDcols = cols]
But if you really want to combine the two ways of referencing, then you can use something like (introducing another option, with=FALSE, which allows more general expressions for column names than a simple vector)
ll <- function(char=NULL,uneval=NULL){
Call <- match.call()
cols <- lapply(Call$uneval,as.character)
unlist(c(char,cols))}
z[, ll(cols,c), with=FALSE]
# a b c
# 1: 1 5 aa
# 2: 2 6 bb
# 3: 3 7 cc
# 4: 4 8 dd
z[, ll(char=cols), with=FALSE]
# a b
# 1: 1 5
# 2: 2 6
# 3: 3 7
# 4: 4 8
z[, ll(uneval=c), with=FALSE]
# c
# 1: aa
# 2: bb
# 3: cc
# 4: dd
Combining a variable with column names with hard-coded column names in data.table
Given z and cols from the example above:
To combine a list of column names in a variable col with other hard coded column name c, we combine them in a new character vector c(col, 'c') in the call to data.table. We can refer to cols from within j (the second argument within []) by using the "up-one-level" notation ..:
z[, c(..cols, 'c')]
Thank you to #thelatemail for providing the base to the solution above.
I have the following data.table.
ts,id
1,a
2,a
3,a
4,a
5,a
6,a
7,a
1,b
2,b
3,b
4,b
I want to subset this data.table into two. The criteria is to have approximately the first half for each group (in this case column "id") in one data table and the remaining in another data.table. So the expected result are two data.tables as follows
ts,id
1,a
2,a
3,a
4,a
1,b
2,b
and
ts,id
5,a
6,a
7,a
3,b
4,b
I tried the following,
z1 = x[,.SD[.I < .N/2,],by=dev]
z1
and got just the following
id ts
a 1
a 2
a 3
Somehow, .I within the .SD isn't working the way I think it should. Any help appreciated.
Thanks in advance.
.I gives the row locations with respect to the whole data.table. Thus it can't be used like that within .SD.
Something like
DT[, subset := seq_len(.N) > .N/2,by='id']
subset1 <- DT[(subset)][,subset:=NULL]
subset2 <- DT[!(subset)][,subset:=NULL]
subset1
# ts id
# 1: 4 a
# 2: 5 a
# 3: 6 a
# 4: 7 a
# 5: 3 b
# 6: 4 b
subset2
# ts id
# 1: 1 a
# 2: 2 a
# 3: 3 a
# 4: 1 b
# 5: 2 b
Should work
For more than 2 groups, you could use cut to create a factor with the appropriate number of levels
Something like
DT[, subset := cut(seq_len(.N), 3, labels= FALSE),by='id']
# you could copy to the global environment a subset for each, but this
# will not be memory efficient!
list2env(setattr(split(DT, DT[['subset']]),'names', paste0('s',1:3)), .GlobalEnv)
Here's the corrected version of your expression:
dt[, .SD[, .SD[.I <= .N/2]], by = id]
# id ts
#1: a 1
#2: a 2
#3: a 3
#4: b 1
#5: b 2
The reason yours is not working is because .I and .N are not available in the i-expression (i.e. first argument of [) and so the parent data.table's .I and .N are used (i.e. dt's).