Suppose I have a data.table with columns "A","C","byvar" and sometimes "B". I want to summarize it by a variable 'byvar', but only include B if it is present or conditional upon some other criteria.
The following doesn't seem to work, does someone have an idea?
dt[, .(
A=sum(A),
if("B" %in% names(dt)) {B=mean(B)},
C=mean(C),
D=sum(A)/C
), by = .(byvar)]
Try B=ifelse("B"%in%names(dt),mean(B),NA) it'll give you a column with NAs but it is extensible to arbitrary criteria and column names.
dt<-data.table(A=runif(100,1,100), C=runif(100,1,100), byvar=rep(letters[1:10],10))
dt[, .(
A=sum(A),
B=ifelse("B"%in%names(dt),mean(B),NA),
C=mean(C),
D=sum(A)/C
), by = .(byvar)]
In running this I get 100 row response because your D=sum(A)/C has C in it which grabs the original C not the new C and so it gives you 100 rows because there are 100 Cs. If you change your definition of D to sum(A)/mean(C) then it gives what you likely intended.
Edit:
Another way you can do this is to take advantage of the ability to use curly braces in the J expression
dt[, {checkcol='B'
prelimreturn=list(A=sum(A),
C=mean(C),
D=sum(A)/mean(C))
if(checkcol%in%names(dt)) prelimreturn[[checkcol]]<-mean(get(checkcol))
prelimreturn}
, by = .(byvar)]
Here I set a helper variable called checkcol so that we're not putting "B" in two places. Next we make your preliminary result with the columns you know you want. After that we check if whatever is in checkcol exists and if it does we add that column to our prexisting list. Then the last line in the curly braces is what data.table displays which is our prelimresult list which may or may not have a "B" column. You could extend this approach pretty broadly too.
You can try
dt[, lapply(.SD, sum), byvar,,.SDcols = patterns("A|B|C")]
Related
I am trying to subset a data.table by two conditions stored in the same data.table, that are found by a single key.
In practice, I am trying to merge overlapping ranges.
I know how to do this:
dt[, max := max(localRange), by=someGroup]
However, I want to use a range as selectors in i. So something like:
dt[range > min(localRange) & range < max(localRange),
max := max(localRange),
by = someGroup]
where range and finalRange are the same column, just range is outside of the scope of .SD.
Or something like:
dt[col2 > dt[,min(col2),by = col1] & col2 < dt[,max(col2),by = col1],
col2 := max(col2)]
where the two by='s synchronise/share the same col1 value
I have tried it with a for loop using set(), iterating over a list of the min and max range as conditions to the data.table. The list I made using split() on a data.table table:
for (range in split(
dt[,
list(min = min(rightBound),max = max(rightBound)),
by = leftBound
],
f = 1:nrow(dt[,.GRP,by = leftBound])
)
){
set(
x = dt,
i = dt[rightBound >= range$min & rightBound <= range$max]
j = range$max
)
}
It all became a mess (even errors), though I assume that this could be a (syntacticly) fairly straightforward operation. Moreover, this is only a case in which there is a single step, getting the conditions associated with the by= group.
What if I would want to adjust values based on a series of transformations on a value in by= based on data in the data.table outside of .SD? For example: "by each start, select the range of ends, and based on that range find a range of starts", etc.
Here it does not matter at all that we are talking about ranges, as I think this is generally useful functionality.
In case anybody is wondering about the practical case, user971102 provides fine sample data for a simple case:
my.df<- data.frame(name=c("a","b","c","d","e","f","g"), leftBound=as.numeric(c(0,70001,70203,70060, 40004, 50000872, 50000872)), rightBound=as.numeric(c(71200,71200,80001,71051, 42004, 50000890, 51000952)))
dt = as.data.table(my.df)
name leftBound rightBound
a 0 71200
b 70001 71200
c 70203 80001
d 70060 71051
e 40004 42004
f 50000872 50000890
g 50000872 51000952
Edit:
The IRanges package is going to solve my practical problem. However, I am still very curious to learn a possible solution to the more abstract case of 'chaining' selectors in data.tables
Thanks a bunch Jeremycg and AGstudy. Though it's not the findOverlaps() function, but the reduce() and disjoin() functions.
I'm happy to find data.table has its new release, and got one question about J(). From data.table NEWS 1.9.2:
x[J(2), a], where a is the key column sees a in j, #2693 and FAQ 2.8. Also, x[J(2)] automatically names the columns from i using the key columns of x. In cases where the key columns of x and i are identical, i's columns can be referred to by using i.name; e.g., x[J(2), i.a]
There're several questions about J() in S.O, and also the introduction to data.table talks about the binary search of J(). But my understanding of J() is still not very clear.
All I know is that, if I want to select rows where "b" in column A and "d" in column B:
DT2 <- data.table(A = letters[1:5], B = letters[3:7], C = 1:5)
setkey(DT2, A, B)
DT2[J("b", "d")]
and if I want to select the rows where A = "a" or "c", I code like this
DT2[A == "a" | A == "c"]
much like the data.frame way. (minor question: how to select using a more data.table way?)
So to my understanding, 'J() only uses in the above case. select two single value from 2 different columns.
Hope my understanding is wrong. There're few documents about J(). I read How is J() function implemented in data.table?. J(.) is detected and simply replaced with list(.)
It seems that every case list(.) can replace J(.)
And back to the question, what the purpose of this new feature? x[J(2), a]
It's really appreciated if you can give some detailed explanations!
.() and J() as the function wrapping the i argument of data.table are simply replaced by list() because [.data.table does some programming on the language of the i and j arguments to optimize how things are done internally. It can be thought of as a alias for list
The reason they are included is to allow save time and effort (3 key strokes!)
If I wanted to select key values 'a' or 'c' from the first column of a key I could do
DT[.(c('a','c'))]
# or
DT[J(c('a','c'))]
# or
DT[list(c('a','c'))]
If I wanted A='a' or 'c' and B = 'd' then I would could use
DT[.(c('a','c'),'d')]
If I wanted A = 'a' or 'c' and B = 'd' or 'e' then I would use CJ (or expand.grid) to create all combinations
DT[CJ(c('a','c'),c('d','e'))]
The help for J,SJ and CJ is quite well written! See also the vignette Keys and fast binary search based subset.
I can select a few columns from a data.frame:
> z[c("events","users")]
events users
1 26246016 201816
2 942767 158793
3 29211295 137205
4 30797086 124314
but not from a data.table:
> best[c("events","users")]
Starting binary search ...Error in `[.data.table`(best, c("events", "users")) :
typeof x.pixel_id (integer) != typeof i.V1 (character)
Calls: [ -> [.data.table
What do I do?
Is there a better way than to turn the data.table back into a data.frame?
Given that you're looking for a data.table back you should use list rather than c in the j part of the call.
z[, list(events,users)] # first comma is important
Note that you don't need the quotes around the column names.
Column subsetting should be done in j, not in i. Do instead:
DT[, c("x", "y")]
Check this presentation (slide 4) to get an idea of how to read a data.table syntax (more like SQL). That'll help convince you that it makes more sense for providing columns in j - equivalent of SELECT in SQL.
Hi still trying to figure out data.table. If I have a data.table of values such as those below, what is the most efficient way to replace the values with those from another data.table?
set.seed(123456)
a=data.table(
date_id = rep(seq(as.Date('2013-01-01'),as.Date('2013-04-10'),'days'),5),
px =rnorm(500,mean=50,sd=5),
vol=rnorm(500,mean=500000,sd=150000),
id=rep(letters[1:5],each=100)
)
b=data.table(
date_id=rep(seq(as.Date('2013-01-01'),length.out=600,by='days'),5),
id=rep(letters[1:5],each=600),
px=NA_real_,
vol=NA_real_
)
setkeyv(a,c('date_id','id'))
setkeyv(b,c('date_id','id'))
What I'm trying to do is replace the px and vol in b with those in a where date_id and id match I'm a little flummoxed with this - I would suppose that something along the lines of might be the way to go but I don't think this will work in practice.
b[which(b$date_id %in% a$date_id & b$id %in% a$id),list(px:=a$px,vol:=a$vol)]
EDIT
I tried the following
t = a[b,roll=T]
t[!is.na(px),list(px.1:=px,vol.1=vol),by=list(date_id,id)]
and got the error message
Error in `:=`(px.1, px) :
:= is defined for use in j only, and (currently) only once; i.e., DT[i,col:=1L] and DT[,newcol:=sum(colB),by=colA] are ok, but not DT[i,col]:=1L, not DT[i]$col:=1L and not DT[,{newcol1:=1L;newcol2:=2L}]. Please see help(":="). Check is.data.table(DT) is TRUE.
If you are wanting to replace the values within b you can use the prefix i.. From the NEWS regarding version 1.7.10
The prefix i. can now be used in j to refer to join inherited
columns of i that are otherwise masked by columns in x with
the same name.
b[a, `:=`(px = i.px, vol = i.vol)]
Doesn't sound like you need the roll from your description, and it seems like you want to do this instead when you get your error:
t[!is.na(px),`:=`(px.1=px,vol.1=vol),by=list(date_id,id)]
I have this code:
dat<-dat[,list(colA,colB
,RelativeIncome=Income/.SD[Nation=="America",Income]
,RelativeIncomeLog2=log2(Income)-log2(.SD[Nation=="America",Income])) #Read 1)
,by=list(Name,Nation)]
1) I would like to be able to say "RelativeIncomeLog2=log2(RelativeIncome)", but "RelativeIncome" is not available in j's scope?
2) I tried the following instead (per the data.table FAQ). Now "RelativeIncome" is available but it doesn't add the columns:
dat<-dat[,{colA;colB;RelativeIncome=Income/.SD[Nation=="America",Income];
,RelativeIncomeLog2=log2(RelativeIncome)]))
,by=list(Name,Nation)]
You can create and assign objects in j, just use { curly braces }.
You can then pass these objects (or functions & calculations of the objects) out of j and assign them as columns of the data.table. To assign more than once column at a time, simply:
wrap the LHS in c(.) make sure column names are strings and
the last line of j (ie, the "return" value) should be a list.
dat[ , c("NewIncomeComlumn", "AnotherNewColumn") := {
RelativeIncome <- Income/.SD[Nation == "A", Income];
RelativeIncomeLog2 <- log2(RelativeIncome);
## this last line is what will be asigned.
list(RelativeIncomeLog2 * 100, c("A", "hello", "World"))
# assigned values are recycled as needed.
# If the recycling does not match up, a warning is issued.
}
, by = list(Name, Nation)
]
You can losely think of j as a function within the environment of dat
You can also get a lot more sophisticated and complex if required. You can also incorporate by arguments as well, using by=list(<someName>=col)
In fact, similar to functions, simply creating an object in j and assigning it a value, does not mean that it will be available outside of j. In order for it to be assigned to your data.table, you must return it. j automatically returns the last line; if that last line is a list, each element of the list will be handled as a column. If you are assigning by reference (ie, using := ) then you will achieve the results you are expecting.
On a separate note, I noticed the following in your code:
Income / .SD[Nation == "America", Income]
# Which instead could simply be:
Income / Income[Nation == "America"]
.SD is great in that it is a wonderful shorthand. However, to invoke it without needing all of the columns which it encapsulates is to burden your code with extra memory costs. If you are using only a single column, consider naming that column explicitly or perhaps add the .SDcols argument (after j) and being naming the columns needed there.