When using a list column of data.tables in a nested data.table it is easy to apply a function over the column. Example:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
We can use:
dt[ ,list(length = nrow(dt.mtcars[[1]])), by = gear]
dt[ ,list(length = nrow(dt.mtcars[[1]])), by = gear]
gear length
1: 4 12
2: 3 15
3: 5 5
or
dt[, list( length = lapply(dt.mtcars, nrow)), by = gear]
gear length
1: 4 12
2: 3 15
3: 5 5
I would like to do the same process and apply a modification by reference using the operator := to each data.table of the column.
Example:
modify_by_ref<- function(d){
d[, max_hp:= max(hp)]
}
dt[, modify_by_ref(dt.mtcars[[1]]), by = gear]
That returns the error:
Error in `[.data.table`(d, , `:=`(max_hp, max(hp))) :
.SD is locked. Using := in .SD's j is reserved for possible future use; a tortuously flexible way to modify by group. Use := in j directly to modify by group by reference.
Using the tip in the error message do not works in any way for me, it seems to be targeting another case but maybe I am missing something. Is there any recommended way or flexible workaround to modify list columns by refence?
This can be done in following two steps or in Single Step:
The given table is:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
Step 1 - Let's add list of column hp vectors in each row of dt
dt[, hp_vector := .(list(dt.mtcars[[1]][, hp])), by = list(gear)]
Step 2 - Now calculate the max of hp
dt[, max_hp := max(hp_vector[[1]]), by = list(gear)]
The given table is:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
Single Step - Single step is actually the combination of both of the above steps:
dt[, max_hp := .(list(max(dt.mtcars[[1]][, hp])[[1]])), by = list(gear)]
If we wish to populate values within nested table by Reference then the following link talks about how to do it, just that we need to ignore a warning message. I will be happy if anyone can point me how to fix the warning message or is there any pitfall. For more detail please refer the link:
https://stackoverflow.com/questions/48306010/how-can-i-do-fast-advance-data-manipulation-in-nested-data-table-data-table-wi/48412406#48412406
Taking inspiration from the same i am going to show how to do it here for the given data set.
Let's first clean everything:
rm(list = ls())
Let's re-define the given table in different way:
dt<- data.table(mtcars)[, list(dt.mtcars = list(data.table(.SD))), by = list(gear)]
Note that i have defined the table slightly different. I have used data.table in addition to list in the above definition.
Next, populate the max by reference within nested table:
dt[, dt.mtcars := .(list(dt.mtcars[[1]][, max_hp := max(hp)])), by = list(gear)]
And, what good one can expect, we can perform manipulation within nested table:
dt[, dt.mtcars := .(list(dt.mtcars[[1]][, weighted_hp_carb := max_hp*carb])), by = list(gear)]
Related
I want to split a data.table in R into groups based on a condition in the value of a row. I have searched SO extensively and can't find an efficient data.table way to do this (I'm not looking to for loop across rows)
I have data like this:
library(data.table)
dt1 <- data.table( x=1:139, t=c(rep(c(1:5),10),120928,rep(c(6:10),9), 10400,rep(c(13:19),6)))
I'd like to group at the large numbers (over a settable value) and come up with the example below:
dt.desired <- data.table( x=1:139, t=c(rep(c(1:5),10),120928,rep(c(6:10),9), 10400,rep(c(13:19),6)), group=c(rep(1,50),rep(2,46),rep(3,43)))
dt1[ , group := cumsum(t > 200) + 1]
dt1[t > 200]
# x t group
# 1: 51 120928 2
# 2: 97 10400 3
dt.desired[t > 200]
# x t group
# 1: 51 120928 2
# 2: 97 10400 3
You can use a test like t>100 to find the large values. You can then use cumsum() to get a running integer for each set of rows up to (but not including) the large number.
# assuming you can define "large" as >100
dt1[ , islarge := t>100]
dt1[ , group := shift(cumsum(islarge))]
I understand that you want the large number to be part of the group above it. To do this, use shift() and then fill in the first value (which will be NA after shift() is run.
# a little cleanup
# (fix first value and start group at 1 instead of 0)
dt1[1, group := 0]
dt1[ , group := group+1]
I would like to specify sum columns and group by the remaining columns. It seemed like there is no way to negate columns in the by parameter like its possible for .SDcols. Is that correct? I have found another way of doing it, but was wondering if I am missing some data.table magic.
a=data.table(a=c(1,3,1), b=c(2,2,3), c=c(5,6,7))
not_gp = c('b','c')
# this works but is not what I want!
a[,lapply(.SD,sum),by=not_gp,.SDcols =!not_gp]
# what I want, but doesn't work
a[,lapply(.SD,sum),by=!not_gp,.SDcols =not_gp]
# Error in !not_gp : invalid argument type
#does work
gp = names(a)[!names(a) %in% not_gp]
a[,lapply(.SD,sum),by=gp,.SDcols =not_gp]
# also works
a[,lapply(.SD,sum),by=gp]
You could use:
a[, lapply(.SD, sum), by = setdiff(names(a), not_gp), .SDcols = not_gp]
Which gives you:
a b c
1: 1 5 12
2: 3 2 6
Given a data.table such as:
library(data.table)
n = 5000
set.seed(123)
pop = data.table(id=1:n, age=sample(18:80, n, replace=TRUE))
and a function which converts a numeric vector into an ordered factor, such as:
toAgeGroups <- function(x){
groups=c('Under 40','40-64','65+')
grp = findInterval(x, c(40,65)) +1
factor(groups[grp], levels=groups, ordered=TRUE)
}
I am seeing unexpected results when grouping on the output of this function as a key and indexing with .GRP.
pop[, .(age_segment_id = .GRP, pop_count=.N), keyby=.(age_segment = toAgeGroups(age))]
returns:
age_segment age_segment_id pop_count
1: Under 40 1 1743
2: 40-64 3 2015
3: 65+ 2 1242
I would have expected the age_segment_id values to be c(1,2,3), not c(1,3,2), but .GRP seems set on order of occurrence in underlying data (as in by= order) rather than sorted order (as in keyby=).
I was planning on using .GRP as an index for some additional labelling, but instead I need to do something like:
pop[, .(pop_count=.N), keyby=.(age_segment = toAgeGroups(age))][, age_segment_id := .I][]
to get what I want.
Is this expected behavior? If so, is there a better workaround?
(v. 1.9.6)
This issue should no longer occur in versions 1.9.8+ of data.table.
library(data.table) #1.9.8+
pop[, .(age_segment_id = .GRP, pop_count=.N),
keyby=.(age_segment = toAgeGroups(age))]
# age_segment age_segment_id pop_count
# 1: Under 40 1 1743
# 2: 40-64 2 2015
# 3: 65+ 3 1242
For some more, see the discussion here. Basically, how by works internally returns sorted rows for each group, then re-sorts the table back to its original order.
The change recognized that this re-sort is unnecessary if keyby is specified, so now your approach works as you expected.
Before (through 1.9.6), keyby would just re-sort the answer at the end by running setkey, as documented in ?data.table:
[keyby is the s]ame as by, but with an additional setkey() run on the by columns of the result.
Thus, on less-than-brand-new versions of data.table, you'd have to fix your code as:
pop[(order(age), .(age_segment_id = .GRP, pop_count=.N),
keyby=.(age_segment = toAgeGroups(age))]
I am using data.table in R and looping over my table, it s really slow because of my table size.
I wonder if someone have any idea on
I have a set of value that I want to "cluster".
Each line have a position, a positive integer. You can load a simple view of that :
library(data.table)
#Here is a toy example
fulltable=c(seq (1,4))*c(seq(1,1000,10))
fulltable=data.table(pos=fulltable[order(fulltable)])
fulltable$id=1
So I loop in my lines and When there is more than 50 between two position I change the group :
#here is the main loop
lastposition=fulltable[1]$pos
lastid=fulltable[1]$id
for(i in 2:nrow(fulltable)){
if(fulltable[i]$pos-50>lastposition){
lastid=lastid+1
print(lastid)
}
fulltable[i]$id=lastid;
lastposition=fulltable[i]$pos
}
Any idea for an effi
fulltable[which((c(fulltable$pos[-1], NA) - fulltable$pos) > 50) + 1, new_group := 2:(.N+1)]
fulltable[is.na(new_group), new_group := 1]
fulltable[, c("lastid_new", "new_group") := list(cummax(new_group), NULL)]
I have a follow up to this question.
I am creating a data.frame conditional on the column names and specific row entries of an existing data.frame. Below is how I resolved it using a for loop (thanks to #Roland's suggestion... the real data violated requirements of #eddi's answer), but it has been running on the actual data set (200x500,000+ rows.cols) for more than two hours now...
(The following generated data.frames are very similar to the actual data.)
set.seed(1)
a <- data.frame(year=c(1986:1990),
events=round(runif(5,0,5),digits=2))
b <- data.frame(year=c(rep(1986:1990,each=2,length.out=40),1986:1990),
region=c(rep(c("x","y"),10),rep(c("y","z"),10),rep("y",5)),
state=c(rep(c("NY","PA","NC","FL"),each=10),rep("AL",5)),
events=round(runif(45,0,5),digits=2))
d <- matrix(rbinom(200,1,0.5),10,20, dimnames=list(c(1:10), rep(1986:1990,each=4)))
e <- data.frame(id=sprintf("%02d",1:10), as.data.frame(d),
region=c("x","y","x","z","z","y","y","z","y","y"),
state=c("PA","AL","NY","NC","NC","NC","FL","FL","AL","AL"))
for (i in seq_len(nrow(d))) {
for (j in seq_len(ncol(d))) {
d[i,j] <- ifelse(d[i,j]==0,
a$events[a$year==colnames(d)[j]],
b$events[b$year==colnames(d)[j] &
b$state==e$state[i] &
b$region==e$region[i]])
}
}
Is there a better/faster way to do this?
A simpler way to do it (I think - it does not involve melting, dcasting and merging) is as follows:
First, your a and b arrays, should be indexed by year (for a) and by year/state/region (for b):
at = a$events; names(at) = a$year
bt = tapply(b$events,list(b$year,b$state,b$region),function(x) min(x))
# note, I used min(x) in tapply just to be on the safe side, that the functions always returns a scalar
# we now create the result of the more complex case (lookup in b)
ids = cbind(colnames(d)[col(d)],
as.character(e$state[row(d)]),
as.character(e$region[row(d)])
)
vals=bt[ids]; dim(vals)=dim(d)
# and compute your desired result with the ifelse
result = ifelse(d==0,at[colnames(d)[col(d)]],vals)
# and that's it!
This should be faster (avoiding the nested loops), but I haven't profiled that. Let us know how that works for you on the full data
# This will require a couple of merges,
# but first let's convert the data to long form and extract year as integer
# I convert result to data.table, since that's easier and faster to deal with
# Note: it *is* possible to do the melt/dcast entirely in data.table framework,
# but it's a hassle right now - there is a FR iirc about that
library(reshape2)
library(data.table)
dt = data.table(melt(e))[, year := as.integer(sub('X([0-9]*).*','\\1',variable))]
# set key for merging and merge with b and a
setkey(dt, year, region, state)
dt.result = data.table(a, key = 'year')[
data.table(b, key = c('year', 'region', 'state'))[dt]]
# now we can compute the value we want
dt.result[, final.value := value * events.1 + (!value) * events]
# dcast back
e.result = dcast(dt.result, id + region + state ~ variable,
value.var = 'final.value')