R data.table replacing an index of values from another data.table - r

Hi still trying to figure out data.table. If I have a data.table of values such as those below, what is the most efficient way to replace the values with those from another data.table?
set.seed(123456)
a=data.table(
date_id = rep(seq(as.Date('2013-01-01'),as.Date('2013-04-10'),'days'),5),
px =rnorm(500,mean=50,sd=5),
vol=rnorm(500,mean=500000,sd=150000),
id=rep(letters[1:5],each=100)
)
b=data.table(
date_id=rep(seq(as.Date('2013-01-01'),length.out=600,by='days'),5),
id=rep(letters[1:5],each=600),
px=NA_real_,
vol=NA_real_
)
setkeyv(a,c('date_id','id'))
setkeyv(b,c('date_id','id'))
What I'm trying to do is replace the px and vol in b with those in a where date_id and id match I'm a little flummoxed with this - I would suppose that something along the lines of might be the way to go but I don't think this will work in practice.
b[which(b$date_id %in% a$date_id & b$id %in% a$id),list(px:=a$px,vol:=a$vol)]
EDIT
I tried the following
t = a[b,roll=T]
t[!is.na(px),list(px.1:=px,vol.1=vol),by=list(date_id,id)]
and got the error message
Error in `:=`(px.1, px) :
:= is defined for use in j only, and (currently) only once; i.e., DT[i,col:=1L] and DT[,newcol:=sum(colB),by=colA] are ok, but not DT[i,col]:=1L, not DT[i]$col:=1L and not DT[,{newcol1:=1L;newcol2:=2L}]. Please see help(":="). Check is.data.table(DT) is TRUE.

If you are wanting to replace the values within b you can use the prefix i.. From the NEWS regarding version 1.7.10
The prefix i. can now be used in j to refer to join inherited
columns of i that are otherwise masked by columns in x with
the same name.
b[a, `:=`(px = i.px, vol = i.vol)]

Doesn't sound like you need the roll from your description, and it seems like you want to do this instead when you get your error:
t[!is.na(px),`:=`(px.1=px,vol.1=vol),by=list(date_id,id)]

Related

data.table size and datatable.alloccol option

The dataset I am working on is not very big, but quite wide. I tcurrently has 10 854 columns and I would like to add approximately another 10/11k columns. It has only 760 rows.
When I try (applying functions to a subset of the existing columns), I get the following
Warning message:
In `[.data.table`(setDT(Final), , `:=`(c(paste0(vars, ".xy_diff"), :
truelength (30854) is greater than 10,000 items over-allocated (length = 10854). See ?truelength. If you didn't set the datatable.alloccol option very large, please report to data.table issue tracker including the result of sessionInfo().
I have tried to play with setalloccol, but I get something similar. For example:
setalloccol(Final, 40960)
Error in `[.data.table`(x, i, , ) :
getOption('datatable.alloccol') should be a number, by default 1024. But its type is 'language'.
In addition: Warning message:
In setalloccol(Final, 40960) :
tl (51894) is greater than 10,000 items over-allocated (l = 21174). If you didn't set the datatable.alloccol option to be very large, please report to data.table issue tracker including the result of sessionInfo().
Is there a way to bypass this problem?
Thanks a lot
Edit:
to answer Roland's comment, here is what I am doing:
vars <- c(colnames(FinalTable_0)[271:290], colnames(FinalTable_0)[292:dim(FinalTable_0)[2]]) # <- variables I want to operate on
# FinalTable_0 is a previous table I use to collect the roots of the variables I want to work with
difference <- function(root) lapply(root, function(z) paste0("get('", z, ".x') - get('", z, ".y')"))
ratio <- function(root) lapply(root, function(z) paste0("get('", z, ".x') / get('", z, ".y')"))
# proceed to the computation
setDT(Final)[ , c(paste0(vars,".xy_diff"), paste0(vars,".xy_ratio")) := lapply(c(difference(vars), ratio(vars)), function(x) eval(parse(text = x)))]
I tried the solution proposed by Roland, but was not fully satisfied. It works, but I do not like the idea of transposing my data.
In the end, I just split the original data.table into multiple ones, proceeded to the computations on each individually and joined back at the end. Fast and simple, no need to play with variables, tell which ones are ids and which are measures, no need to shape and reshape. I just prefer.

if else command in data.table to add column

Suppose I have a data.table with columns "A","C","byvar" and sometimes "B". I want to summarize it by a variable 'byvar', but only include B if it is present or conditional upon some other criteria.
The following doesn't seem to work, does someone have an idea?
dt[, .(
A=sum(A),
if("B" %in% names(dt)) {B=mean(B)},
C=mean(C),
D=sum(A)/C
), by = .(byvar)]
Try B=ifelse("B"%in%names(dt),mean(B),NA) it'll give you a column with NAs but it is extensible to arbitrary criteria and column names.
dt<-data.table(A=runif(100,1,100), C=runif(100,1,100), byvar=rep(letters[1:10],10))
dt[, .(
A=sum(A),
B=ifelse("B"%in%names(dt),mean(B),NA),
C=mean(C),
D=sum(A)/C
), by = .(byvar)]
In running this I get 100 row response because your D=sum(A)/C has C in it which grabs the original C not the new C and so it gives you 100 rows because there are 100 Cs. If you change your definition of D to sum(A)/mean(C) then it gives what you likely intended.
Edit:
Another way you can do this is to take advantage of the ability to use curly braces in the J expression
dt[, {checkcol='B'
prelimreturn=list(A=sum(A),
C=mean(C),
D=sum(A)/mean(C))
if(checkcol%in%names(dt)) prelimreturn[[checkcol]]<-mean(get(checkcol))
prelimreturn}
, by = .(byvar)]
Here I set a helper variable called checkcol so that we're not putting "B" in two places. Next we make your preliminary result with the columns you know you want. After that we check if whatever is in checkcol exists and if it does we add that column to our prexisting list. Then the last line in the curly braces is what data.table displays which is our prelimresult list which may or may not have a "B" column. You could extend this approach pretty broadly too.
You can try
dt[, lapply(.SD, sum), byvar,,.SDcols = patterns("A|B|C")]

Subsetting columns works on data.frame but not on data.table

I can select a few columns from a data.frame:
> z[c("events","users")]
events users
1 26246016 201816
2 942767 158793
3 29211295 137205
4 30797086 124314
but not from a data.table:
> best[c("events","users")]
Starting binary search ...Error in `[.data.table`(best, c("events", "users")) :
typeof x.pixel_id (integer) != typeof i.V1 (character)
Calls: [ -> [.data.table
What do I do?
Is there a better way than to turn the data.table back into a data.frame?
Given that you're looking for a data.table back you should use list rather than c in the j part of the call.
z[, list(events,users)] # first comma is important
Note that you don't need the quotes around the column names.
Column subsetting should be done in j, not in i. Do instead:
DT[, c("x", "y")]
Check this presentation (slide 4) to get an idea of how to read a data.table syntax (more like SQL). That'll help convince you that it makes more sense for providing columns in j - equivalent of SELECT in SQL.

Subset row and column at the same time

I'm a bit surprised by how data.table works:
> library(data.table)
data.table 1.8.2 For help type: help("data.table")
> dt <- data.table(a=11:20, b=21:30, c=31:40, key="a")
> dt[list(12)]
a b c
1: 12 22 32
> dt[list(12), b]
a b
1: 12 22
> dt[list(12)][,b]
[1] 22
What I'm trying to do is obtain the value of a single column (or expression) in rows matched by a selection. I see that I've got to pass the key as a list as a raw number would indicate a row number and not a key value. So the first of the above is clear to me. But why the second and the thirs subset expression yield different results seems rather confusing to me. I'd like to get the third result, but would exect being able to write it the second way.
Is there any good reason why subsetting a data.table for rows and columns at the same time will always include the key value as well as the computed result? Is there a syntactically shorter way to obtain a single result except by double subsetting as above?
I'm using data.table 1.8.2 on R 2.15.1. If you cannot reproduce my example, you might as well consider a factor as key:
dt <- data.table(a=paste("a", 11:20, sep=""), b=21:30, c=31:40, key="a")
dt["a11", b]
Regarding this question:
Is there any good reason why subsetting a data.table for rows and columns at the same time will always include the key value as well as the computed result?
I believe that the (good enough for me) reason is simply that Matthew Dowle hasn't yet gotten around to adding that option (likely because he has prioritized work on much more useful features such as ":= with by").
In comments following my answer here, Matthew seemed to indicate that it is on his TODO list, noting that "[this] is what drop=TRUE will do (with a speed advantage) when drop is added".
Until then, any of the following will get the job done:
dt[list(12)][,b]
# [1] 22
dt[list(12)][[2]]
# [1] 22
dt[dt[list(12), which=TRUE], b]
# [1] 22
One possibility is to use:
dt[a == 12]
and
dt[a == 12, b]
This will work as expected, but it prevents binary search and requires sequential search instead (is there a plan to change this behavior ??), making it potentially slower.
UPDATE Sep 2014: now in v1.9.3
From NEWS :
DT[column==values] is now optimized to use DT's key when key(DT)[1]=="column", otherwise a secondary key (a.k.a. index) is automatically added so the next DT[column==values] is much faster. DT[column %in% values] is equivalent; i.e., both == and %in% accept vector values. No code changes are needed; existing code should automatically benefit. Secondary keys can be added manually using set2key() and existence checked using key2(). These optimizations and function names/arguments are experimental and may be turned off with options(datatable.auto.index=FALSE).

Problem with data.table ifelse behavior

I am trying to calculate a simple ratio using data.table. Different files have different tmax values, so that is why I need ifelse. When I debug this, the dt looks good. The tmaxValue is a single value (the first "t=60" encountered in this case), but t0Value is all of the "t=0" values in dt.
summaryDT <- calculate_Ratio(reviewDT[,list(Result, Time), by=key(reviewDT)])
calculate_Ratio <- function(dt){
tmaxValue <- ifelse(grepl("hhep", inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=240min"),Result],
ifelse(grepl("hlm",inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=60"),Result],
dt[which(dt[,Time] == "t=30"),Result]))
t0Value <- dt[which(dt[,Time] == "t=0"),Result]
return(dt[,Ratio:=tmaxValue/t0Value])
}
What I am getting out is theResult for tmaxValue divided by all of the Result's for all of the t0Value's, but what I want is a single ratio for each unique by.
Thanks for the help.
You didn't provide a reproducible example, but typically using ifelse is the wrong thing to do.
Try using if(...) ... else ... instead.
ifelse(test, yes, no) acts very weird: It produces a result with the attributes and length from test and the values from yes or no.
...so in your case you should get something without attributes and of length one - and that's probably not what you wanted, right?
[UPDATE] ...Hmm or maybe it is since you say that tmaxValue is a single value...
Then the problem isn't in calculating tmaxValue? Note that ifelse is still the wrong tool for the job...

Resources