Unexpected .GRP sequence in data.table - r

Given a data.table such as:
library(data.table)
n = 5000
set.seed(123)
pop = data.table(id=1:n, age=sample(18:80, n, replace=TRUE))
and a function which converts a numeric vector into an ordered factor, such as:
toAgeGroups <- function(x){
groups=c('Under 40','40-64','65+')
grp = findInterval(x, c(40,65)) +1
factor(groups[grp], levels=groups, ordered=TRUE)
}
I am seeing unexpected results when grouping on the output of this function as a key and indexing with .GRP.
pop[, .(age_segment_id = .GRP, pop_count=.N), keyby=.(age_segment = toAgeGroups(age))]
returns:
age_segment age_segment_id pop_count
1: Under 40 1 1743
2: 40-64 3 2015
3: 65+ 2 1242
I would have expected the age_segment_id values to be c(1,2,3), not c(1,3,2), but .GRP seems set on order of occurrence in underlying data (as in by= order) rather than sorted order (as in keyby=).
I was planning on using .GRP as an index for some additional labelling, but instead I need to do something like:
pop[, .(pop_count=.N), keyby=.(age_segment = toAgeGroups(age))][, age_segment_id := .I][]
to get what I want.
Is this expected behavior? If so, is there a better workaround?
(v. 1.9.6)

This issue should no longer occur in versions 1.9.8+ of data.table.
library(data.table) #1.9.8+
pop[, .(age_segment_id = .GRP, pop_count=.N),
keyby=.(age_segment = toAgeGroups(age))]
# age_segment age_segment_id pop_count
# 1: Under 40 1 1743
# 2: 40-64 2 2015
# 3: 65+ 3 1242
For some more, see the discussion here. Basically, how by works internally returns sorted rows for each group, then re-sorts the table back to its original order.
The change recognized that this re-sort is unnecessary if keyby is specified, so now your approach works as you expected.
Before (through 1.9.6), keyby would just re-sort the answer at the end by running setkey, as documented in ?data.table:
[keyby is the s]ame as by, but with an additional setkey() run on the by columns of the result.
Thus, on less-than-brand-new versions of data.table, you'd have to fix your code as:
pop[(order(age), .(age_segment_id = .GRP, pop_count=.N),
keyby=.(age_segment = toAgeGroups(age))]

Related

Remove data.table rows whose vector elements contain nested NAs

I need to remove from a data.table any row in which column a contains any NA nested in a vector:
library(data.table)
a = list(as.numeric(c(NA,NA)), 2,as.numeric(c(3, NA)), c(4,5) )
b <- 11:14
dt <- data.table(a,b)
Thus, rows 1 and 3 should be removed.
I tried three solutions without success:
dt1 <- dt[!is.na(a)]
dt2 <- dt[!is.na(unlist(a))]
dt3 <- dt[dt[,!Reduce(`&`, lapply(a, is.na))]]
Any ideas? Thank you.
You can do the following:
dt[sapply(dt$a, \(l) !any(is.na(l)))]
This alternative also works, but you will get warnings
dt[sapply(dt$a, all)]
Better approach (thanks to r2evans, see comments)
dt[!sapply(a,anyNA)]
Output:
a b
1: 2 12
2: 4,5 14
A third option that you might prefer: You could move the functionality to a separate helper function that ingests a list of lists (nl), and returns a boolean vector of length equal to length(nl), and then apply that function as below. In this example, I explicitly call unlist() on the result of lapply() rather than letting sapply() do that for me, but I could also have used sapply()
f <- \(nl) unlist(lapply(nl,\(l) !any(is.na(l))))
dt[f(a)]
An alternative to *apply()
dt[, .SD[!anyNA(a, TRUE)], by = .I][, !"I"]
# a b
# <list> <int>
# 1: 2 12
# 2: 4,5 14

Generate group by condition on row value in column R data.table

I want to split a data.table in R into groups based on a condition in the value of a row. I have searched SO extensively and can't find an efficient data.table way to do this (I'm not looking to for loop across rows)
I have data like this:
library(data.table)
dt1 <- data.table( x=1:139, t=c(rep(c(1:5),10),120928,rep(c(6:10),9), 10400,rep(c(13:19),6)))
I'd like to group at the large numbers (over a settable value) and come up with the example below:
dt.desired <- data.table( x=1:139, t=c(rep(c(1:5),10),120928,rep(c(6:10),9), 10400,rep(c(13:19),6)), group=c(rep(1,50),rep(2,46),rep(3,43)))
dt1[ , group := cumsum(t > 200) + 1]
dt1[t > 200]
# x t group
# 1: 51 120928 2
# 2: 97 10400 3
dt.desired[t > 200]
# x t group
# 1: 51 120928 2
# 2: 97 10400 3
You can use a test like t>100 to find the large values. You can then use cumsum() to get a running integer for each set of rows up to (but not including) the large number.
# assuming you can define "large" as >100
dt1[ , islarge := t>100]
dt1[ , group := shift(cumsum(islarge))]
I understand that you want the large number to be part of the group above it. To do this, use shift() and then fill in the first value (which will be NA after shift() is run.
# a little cleanup
# (fix first value and start group at 1 instead of 0)
dt1[1, group := 0]
dt1[ , group := group+1]

Modify list-column by reference in nested data.table

When using a list column of data.tables in a nested data.table it is easy to apply a function over the column. Example:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
We can use:
dt[ ,list(length = nrow(dt.mtcars[[1]])), by = gear]
dt[ ,list(length = nrow(dt.mtcars[[1]])), by = gear]
gear length
1: 4 12
2: 3 15
3: 5 5
or
dt[, list( length = lapply(dt.mtcars, nrow)), by = gear]
gear length
1: 4 12
2: 3 15
3: 5 5
I would like to do the same process and apply a modification by reference using the operator := to each data.table of the column.
Example:
modify_by_ref<- function(d){
d[, max_hp:= max(hp)]
}
dt[, modify_by_ref(dt.mtcars[[1]]), by = gear]
That returns the error:
Error in `[.data.table`(d, , `:=`(max_hp, max(hp))) :
.SD is locked. Using := in .SD's j is reserved for possible future use; a tortuously flexible way to modify by group. Use := in j directly to modify by group by reference.
Using the tip in the error message do not works in any way for me, it seems to be targeting another case but maybe I am missing something. Is there any recommended way or flexible workaround to modify list columns by refence?
This can be done in following two steps or in Single Step:
The given table is:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
Step 1 - Let's add list of column hp vectors in each row of dt
dt[, hp_vector := .(list(dt.mtcars[[1]][, hp])), by = list(gear)]
Step 2 - Now calculate the max of hp
dt[, max_hp := max(hp_vector[[1]]), by = list(gear)]
The given table is:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
Single Step - Single step is actually the combination of both of the above steps:
dt[, max_hp := .(list(max(dt.mtcars[[1]][, hp])[[1]])), by = list(gear)]
If we wish to populate values within nested table by Reference then the following link talks about how to do it, just that we need to ignore a warning message. I will be happy if anyone can point me how to fix the warning message or is there any pitfall. For more detail please refer the link:
https://stackoverflow.com/questions/48306010/how-can-i-do-fast-advance-data-manipulation-in-nested-data-table-data-table-wi/48412406#48412406
Taking inspiration from the same i am going to show how to do it here for the given data set.
Let's first clean everything:
rm(list = ls())
Let's re-define the given table in different way:
dt<- data.table(mtcars)[, list(dt.mtcars = list(data.table(.SD))), by = list(gear)]
Note that i have defined the table slightly different. I have used data.table in addition to list in the above definition.
Next, populate the max by reference within nested table:
dt[, dt.mtcars := .(list(dt.mtcars[[1]][, max_hp := max(hp)])), by = list(gear)]
And, what good one can expect, we can perform manipulation within nested table:
dt[, dt.mtcars := .(list(dt.mtcars[[1]][, weighted_hp_carb := max_hp*carb])), by = list(gear)]

is it possible to negate columns in the by parameter for data.table R

I would like to specify sum columns and group by the remaining columns. It seemed like there is no way to negate columns in the by parameter like its possible for .SDcols. Is that correct? I have found another way of doing it, but was wondering if I am missing some data.table magic.
a=data.table(a=c(1,3,1), b=c(2,2,3), c=c(5,6,7))
not_gp = c('b','c')
# this works but is not what I want!
a[,lapply(.SD,sum),by=not_gp,.SDcols =!not_gp]
# what I want, but doesn't work
a[,lapply(.SD,sum),by=!not_gp,.SDcols =not_gp]
# Error in !not_gp : invalid argument type
#does work
gp = names(a)[!names(a) %in% not_gp]
a[,lapply(.SD,sum),by=gp,.SDcols =not_gp]
# also works
a[,lapply(.SD,sum),by=gp]
You could use:
a[, lapply(.SD, sum), by = setdiff(names(a), not_gp), .SDcols = not_gp]
Which gives you:
a b c
1: 1 5 12
2: 3 2 6

Perform row-wise operations on a data.table for a vector-valued column

EDIT:
(I apologize for the fact that my example was oversimplified, and I will try to remedy this, as well as format my more relevant example in a more convenient format for copying directly into R. In particular, there are multiple value columns, and some preceding columns with other information that does not need to be parsed.)
I am fairly new to R, and to data.table as well, so I would appreciate input on an issue I am finding. I am working with a data table where one column is a colon-separated format string that serves as a legend for values in other colon-separated columns. In order to parse it, I have to first split it into its components, and then search for the indices of the components I need to later index the value strings. Here is a simplified example of the sort of situation I might be working with
DT <- data.table(number=c(1:5),
format=c("name:age","age:name","age:name:height","height:age:name","weight:name:age"),
person1=c("john:30","40:bill","20:steve:100","300:70:george","140:fred:20"),
person2=c("jane:31","42:ivan","21:agnes:120","320:72:vivian","143:rose:22"))
When evaluated, we get
> DT
number format person1 person2
1: 1 name:age john:30 jane:31
2: 2 age:name 40:bill 42:ivan
3: 3 age:name:height 20:steve:100 21:agnes:120
4: 4 height:age:name 300:70:george 320:72:vivian
5: 5 weight:name:age 140:fred:20 143:rose:22
Let's say that for each person, I need to know ONLY their name and age, and don't need their height or weight; in this example, and in my actual data, every format string has fields for name and age, but possibly in different positions (the fields that I am actually looking for are usually fixed in certain columns, but I am reluctant to hard-code any indices as I am not completely familiar with the production of the data files I am working with). I would first split up the format string and then do a match() search for the names of the fields I want.
DT[, format.split := strsplit(format, ":")]
At this point, the only method I used that worked to perform the match was a vapply:
DT[, index.name := vapply(format.split, function (x) match('name', x), 0L)]
DT[, index.age := vapply(format.split, function (x) match('age', x), 0L)]
because I don't know of any other way to let R know that it should be looking at the rows in the columns individually, and not bunched together as a vector, and perform the match on the vector-valued format.split column of each row, rather than trying to match the whole column of rows. Even then, once I find the indices for each row, I have to perform another strsplit and then an mapply to parse the name-value and age-value out of each person's value-string:
DT[, person1.split := strsplit(person1, ':')]
DT[, person1.name := mapply(function (x,y) x[y], person1.split, index.name]
DT[, person1.age := mapply(function (x,y) x[y], person1.split, index.age]
DT[, person2.split := strsplit(person2, ':')]
DT[, person2.name := mapply(function (x,y) x[y], person2.split, index.name]
DT[, person2.age := mapply(function (x,y) x[y], person2.split, index.age]
(And, of course, I would do the same thing for age as well)
I am working with fairly large data sets, so I'd like my code to be as efficient as possible. Does anyone have recommendations for ways I can speed up or otherwise optimize my code?
(NOTE: I am really looking for the right approach to take, not the right *apply or *ply or Map function to use. If *(ap)ply or Map really is the right approach, I would appreciate knowing which is the most efficient or appropriate for my situation, but if there is a better way of testing for intra-row duplication, I would prefer recommendations about that to function suggestions. Suggestions are welcome, though).
EDIT 2:
It turns out that my example was much more general than it need have been. I only need two fields, which are always going to be the first two fields in the format string, without variation. The first field is just a literal character string. The second field, however, consists of at least 2 numbers, separated by commas (ultimately, I filter out any rows with more than 2 numbers in the second field, so the possibility of more is only relevant if the filtering happens after the parsing). For each of the (3) value strings, I only need to create three columns: a character column for the first field, and two numeric columns, one for each member of the comma-separated pair in the second field. Any other fields are irrelevant. My current method, which is probably quite inefficient, is to use sub() to pattern-match on the desired fields and subfields with back-references.
> DT <- data.table(id=1:5,
format=c(rep("A:B:C:D:E", 5)),
person1=paste(paste0("foo",LETTERS[1:5]), paste(1:5, 10:6, sep=','), "blah", "bleh", "bluh", sep=':'),
person2=paste(paste0("bar",LETTERS[1:5]), paste(16:20, 5:1, sep=','), "blah", "bleh", "bluh", sep=':'),
person3=paste(paste0("baz",LETTERS[1:5]), paste(0:4, 12:8, sep=','), "blah", "bleh", "bluh", sep=':'))
> DT
id format person1 person2 person3
1: 1 A:B:C:D:E fooA:1,10:blah:bleh:bluh barA:16,5:blah:bleh:bluh bazA:0,12:blah:bleh:bluh
2: 2 A:B:C:D:E fooB:2,9:blah:bleh:bluh barB:17,4:blah:bleh:bluh bazB:1,11:blah:bleh:bluh
3: 3 A:B:C:D:E fooC:3,8:blah:bleh:bluh barC:18,3:blah:bleh:bluh bazC:2,10:blah:bleh:bluh
4: 4 A:B:C:D:E fooD:4,7:blah:bleh:bluh barD:19,2:blah:bleh:bluh bazD:3,9:blah:bleh:bluh
5: 5 A:B:C:D:E fooE:5,6:blah:bleh:bluh barE:20,1:blah:bleh:bluh bazE:4,8:blah:bleh:bluh
My code then does this:
DT[, `:=`(person1.A=sub("^([^:]*):.*$","\\1", person1),
person2.A=sub("^([^:]*):.*$","\\1", person2),
person3.A=sub("^([^:]*):.*$","\\1", person3),
person1.B.first=sub("^[^:]*:([^:,]*),.*$","\\1", person1),
person1.B.second=sub("^[^:]*:[^:,]*,([^:,]*)(,[^:,]*)*:.*$","\\1", person1),
person2.B.first=sub("^[^:]*:([^:,]*),.*$","\\1", person2),
person2.B.second=sub("^[^:]*:[^:,]*,([^:,]*)(,[^:,]*)*:.*$","\\1", person2),
person3.B.first=sub("^[^:]*:([^:,]*),.*$","\\1", person3),
person3.B.second=sub("^[^:]*:[^:,]*,([^:,]*)(,[^:,]*)*:.*$","\\1", person3))]
for the splitting, and filters by
DT <- DT[grepl("^[^:]*:[^:,]*,[^:,]*:.*$", person1) &
grepl("^[^:]*:[^:,]*,[^:,]*:.*$", person2) &
grepl("^[^:]*:[^:,]*,[^:,]*:.*$", person3) ]
I understand that this method is probably very inefficient, but it was the first improvement I came up with over my old approach of repeatedly applying strsplit. With the new conditions in mind, is there an even better way of doing things than melt, csplit, dcast?
EDIT 3:
Since I only needed the first two fields, I ended up trimming all the value strings, removing those with more than two commas (i.e. more than 3 2nd-field numbers), changing the commas to colons, replacing the format string of every line with the names of the (now 3) fields, and performing the dcast(csplit(melt)) as suggested by #AnandaMahto. It seems to work well.
#bskaggs has the right idea that it might just make more sense to put your data into a long form, or even a structured wide form.
I'll show you two options, but first, it's always better to share your data in a way that others can actually use it:
DT <- data.table(
format = c("name:age", "name:age:height", "age:height:name",
"height:weight:name:age", "name:age:weight:height",
"name:age:height:weight"),
values = c("john:30", "rene:33:183", "100:10:speck",
"100:400:sumo:11", "james:43:120:120",
"plink:2:300:400"))
I'm also going to suggest you use my cSplit function.
Here's how you would easily convert this dataset into a long form:
cSplit(DT, c("format", "values"), ":", "long")
# format values
# 1: name john
# 2: age 30
# 3: name rene
# 4: age 33
# 5: height 183
# 6: age 100
# 7: height 10
# 8: name speck
# 9: height 100
# 10: weight 400
# 11: name sumo
# 12: age 11
# 13: name james
# 14: age 43
# 15: weight 120
# 16: height 120
# 17: name plink
# 18: age 2
# 19: height 300
# 20: weight 400
Once the data are in a "long" form, you can convert it easily to a "wide" form using dcast.data.table, like this. (I've also reordered the columns using setcolorder, which lets you rearrange the data without copying.)
X <- dcast.data.table(
cSplit(cbind(id = 1:nrow(DT), DT),
c("format", "values"), ":", "long"),
id ~ format, value.var = "values")
setcolorder(X, c("id", "name", "age", "height", "weight"))
X
# id name age height weight
# 1: 1 john 30 NA NA
# 2: 2 rene 33 183 NA
# 3: 3 speck 100 10 NA
# 4: 4 sumo 11 100 400
# 5: 5 james 43 120 120
# 6: 6 plink 2 300 400
How does this fare in terms of speed?
First, a very moderate dataset:
DT <- rbindlist(replicate(2000, DT, FALSE))
dim(DT)
# [1] 12000 2
## #bskaggs's suggestion
system.time(colonMelt(DT))
# user system elapsed
# 0.27 0.00 0.27
## cSplit. It would be even faster if you already had
## an id column and didn't need to cbind one in
system.time(cSplit(cbind(id = 1:nrow(DT), DT),
c("format", "values"), ":", "long"))
# user system elapsed
# 0.02 0.00 0.01
## cSplit + dcast.data.table
system.time(dcast.data.table(
cSplit(cbind(id = 1:nrow(DT), DT),
c("format", "values"), ":", "long"),
id ~ format, value.var = "values"))
# user system elapsed
# 0.08 0.00 0.08
Update
For your updated problem, you can melt the "data.table" first, and then proceed similarly:
library(reshape2)
## Melting, but no reshaping -- a nice long format
cSplit(melt(DT, id.vars = c("number", "format")),
c("format", "value"), ":", "long")
## Try other combinations for the LHS and RHS of the
## formula. This seems to be what you might be after
dcast.data.table(
cSplit(melt(DT, id.vars = c("number", "format")),
c("format", "value"), ":", "long"),
number ~ variable + format, value.var = "value")
I think you may be better served by using a tall tidy format:
colonMelt <- function(DT) {
formats <- strsplit(DT$format, ":")
rows <- rep(row.names(DT), sapply(formats, length))
data.frame(row = rows,
key = unlist(formats),
value = unlist(strsplit(DT$values, ":"))
)
}
newDT <- colonMelt(DT)
The result is a format that is much easier to do search and filtering without string splitting all the time:
row key value
1 1 name john
2 1 age 30
3 2 name rene
4 2 age 33
5 2 height 183
6 3 age 100
7 3 height 10
8 3 name speck

Resources