Replacing all missing values in R data.table with a value - r

If you have an R data.table that has missing values, how do you replace all of them with say, the value 0? E.g.
aa = data.table(V1=1:10,V2=c(1,2,2,3,3,3,4,4,4,4))
bb = data.table(V1=3:6,X=letters[1:4])
setkey(aa,V1)
setkey(bb,V1)
tt = bb[aa]
V1 X V2
1: 1 NA 1
2: 2 NA 2
3: 3 a 2
4: 4 b 3
5: 5 c 3
6: 6 d 3
7: 7 NA 4
8: 8 NA 4
9: 9 NA 4
10: 10 NA 4
Any way to do this in one line? If it were just a matrix, you could just do:
tt[is.na(tt)] = 0

is.na (being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set to replace NA with0`.
Using <- to assign will result in a copy of all the columns and this is not the idiomatic way using data.table.
First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):
One way to do this efficiently:
for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.
Why shouldn't you use <- here:
# by reference - idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# modifies value by reference - no copy
system.time({
for (i in seq_along(tt))
set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
})
# user system elapsed
# 0.284 0.083 0.386
# by copy - NOT the idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# makes copy
system.time({tt[is.na(tt)] <- 0})
# a bunch of "tracemem" output showing the copies being made
# user system elapsed
# 4.110 0.976 5.187

Nothing unusual here:
tt[is.na(tt)] = 0
..will work.
This is somewhat confusing however given that:
tt[is.na(tt)]
...currently returns:
Error in [.data.table(tt, is.na(tt)) : i is invalid type
(matrix). Perhaps in future a 2 column matrix could return a list of
elements of DT (in the spirit of A[B] in FAQ 2.14). Please let
datatable-help know if you'd like this, or add your comments to FR #1611.

I would make use of data.table and lapply, namely:
tt[,lapply(.SD,function(kkk) ifelse(is.na(kkk),-666,kkk)),.SDcols=names(tt)]
yielding in:
V1 X V2
1: 1 -666 1
2: 2 -666 2
3: 3 a 2
4: 4 b 3
5: 5 c 3
6: 6 d 3
7: 7 -666 4
8: 8 -666 4
9: 9 -666 4
10: 10 -666 4

The specific problem OP is posting could also be solved by
tt[is.na(X), X := 0]

Related

How to perform a "serial join" in data.table?

I have two data.tables: an experiment data table x and a category lookup table dict.
library(data.table)
set.seed(123)
x = data.table(samp=c(1,1,2,3,3,3,4,5,5,5,6,7,7,7,8,9,9,10,10), y=rnorm(19))
x
samp y
#1: 1 -0.56047565
#2: 1 -0.23017749
#3: 2 1.55870831
#4: 3 0.07050839
#5: 3 0.12928774
#6: 3 1.71506499
#7: 4 0.46091621
#8: 5 -1.26506123
#9: 5 -0.68685285
#10: 5 -0.44566197
#11: 6 1.22408180
#12: 7 0.35981383
#13: 7 0.40077145
#14: 7 0.11068272
#15: 8 -0.55584113
#16: 9 1.78691314
#17: 9 0.49785048
#18: 10 -1.96661716
#19: 10 0.70135590
dict = data.table(samp=c(1:5, 4:8, 7:10), cat=c(rep(1,length(1:5)), rep(2,length(4:8)), rep(3,length(7:10))))
dict
# samp cat
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 4 2
# 7: 5 2
# 8: 6 2
# 9: 7 2
# 10: 8 2
# 11: 7 3
# 12: 8 3
# 13: 9 3
# 14: 10 3
For each samp, I need to first compute the product of all y's associated with it. I then need to compute the sum of these products per each sample category specified in dict$cat. Note that each samp maps to more than one dict$cat.
One way of doing this is merge x and dict right away, allowing row duplication (allow.cartesian=T):
setkey(dict, samp)
setkey(x, samp)
step0 = dict[x, allow.cartesian=T]
setkey(step0, samp, cat)
step1 = step0[, list(prodY=prod(y)[1], cat=cat[1]), by=c("samp", "cat")]
resMet1 = step1[, sum(prodY), by="cat"]
I wonder however whether this joining step can be avoided. There are a few reasons to this - for example, if x is enormous, duplication will use extra memory (am I right?). Also, these summary tables with duplicated rows are quite confusing, making the analysis more error-prone.
So instead I was thinking of using samples in each dict$cat for a binary search in x. I know how to do it for a single category, so an ugly way of doing it for all of them would be with a loop:
setkey(x, samp)
setkey(dict,samp)
pool = vector("list")
for(n in unique(dict$cat)){
thisCat = x[J(dict[cat==n])]
setkey(thisCat, samp)
step1 = thisCat[, list(prodY=prod(y)[1], cat=cat[1]), by="samp"]
pool[[n]] = step1[, sum(prodY), by="cat"]
}
resMet2 = rbindlist(pool)
But of course such loops are to be avoided. So I'm wondering if there's any way to somehow get data.table to iterate over the key values inside of J()?
IIUC, I'd formulate your question as follows: for each dict$cat, I'd like to get prod(y) corresponding to each sample for that cat, and then sum them all up.
Let's construct this step by step now:
For each dict$cat - sounds like you need to group by cat:
dict[, ,by=cat]
All that's left is to fill up j properly.
you need to get prod(y) from x for each sample for this group:
x[samp %in% .SD$samp, prod(y), by=samp]
extracts those rows from x corresponding to this group's samp (using .SD which stands for subset of data) and computes prod(y) on them, grouped by samp. Great!
We still need to sum them.
sum(x[samp %in% .SD$samp, prod(y), by=samp]$V1)
We've the complete j expression. Let's plug it all in:
dict[, sum(x[samp %in% .SD$samp, prod(y), by=samp]$V1), by=cat]
# cat V1
# 1: 1 1.7770272
# 2: 2 0.7578771
# 3: 3 -1.0295633
Hope this helps.
Note 1: that there's some redundant computation of prod(y) here, but the upside is that we don't materialise much intermediate data. So it's memory efficient. If you've too many groups, this might get slower.. and you might want to compute prod(y) in another variable like so:
x_p = x[, .(p = prod(y)), by=samp]
With this, we can simplify j as follows:
dict[, x_p[samp %in% .SD$samp, sum(p)], by=cat]
Note 2: that %in% expression creates an auto index on the first run on x's samp column to use binary search based subset from then on. Therefore there need not be any worries on performance due to vector scans.
You might as well collapse x to the samp level first.
xprod = x[, .(py = prod(y)), by=samp]
Merge
res2 <- xprod[dict, on = "samp"][, sum(py), by=cat]
identical(res2, resMet2) # test passed
Or subset
If samp is the row number in xprod (as here), you can subset instead of merging:
res3 <- xprod[(dict$samp), sum(py), by=.(cat=dict$cat)]
identical(res3, resMet2) # test passed
It's very simple to relabel sample IDs so that this is true.

Don't resort data.table rows

I am learning data.table so I'm very new to it's syntax. I am trying to use the package as a hash lookup and it works well except, because of my ignorance of syntax, it reorders the rows. I want it not to reorder the rows without sacrificing speed (i.e., the efficient way to accomplish this). Here is an example and desired output:
library(data.table)
(key <- setNames(aggregate(mpg~as.character(carb), mtcars, mean), c("x", "y")))
set.seed(10)
terms <- data.frame(x = c(9, 12, sample(key[, 1], 6, TRUE)), stringsAsFactors = FALSE)
## > terms$x
## [1] "9" "12" "4" "2" "3" "6" "1" "2"
setDT(key)
setDT(terms)
setkey(key, x)
setkey(terms, x)
terms[key, out := i.y]
terms
This gives:
## x out
## 1: 1 25.34286
## 2: 12 NA
## 3: 2 22.40000
## 4: 2 22.40000
## 5: 3 16.30000
## 6: 4 15.79000
## 7: 6 19.70000
## 8: 9 NA
I want:
## x out
## 1: 9 NA
## 2: 12 NA
## 3: 4 15.79000
## 4: 2 22.40000
## 5: 3 16.30000
## 6: 6 19.70000
## 7: 1 25.34286
## 8: 2 22.40000
In data.table, a join x[i] has to have a key set for x, but it's not essential for the key to be set for i.
NOTE: But if you don't set the key for i,
1) Ensure that the columns of i are in the same order as the key columns of x (reorder if necessary, using setcolorder), as it doesn't join by checking for names (yet).
2) It could be a tad slower (but not by much in my benchmarks).
The issue therefore is that, if you just want to do a x[i] join without any additional preprocessing, then terms has to take the place of i with no key set in order to get the results in the order you require.
With this in mind, we can approach this in two ways (that I could think of).
First method:
This one requires no additional preprocessing. That is we treat key as x as mentioned above - meaning it's key has to be set. We don't set key for terms.
setkey(key, x)
The first column of terms is also named x and that's the column we want to join with. So, no reordering needed here.
ans = key[terms]
> ans
# x y
# 1: 9 NA
# 2: 12 NA
# 3: 4 15.79000
# 4: 2 22.40000
# 5: 3 16.30000
# 6: 6 19.70000
# 7: 1 25.34286
# 8: 2 22.40000
The difference is that this is an entirely new data.table, not just assigning the column by reference.
Second method:
We do a little extra preprocessing - addition of an extra column N to terms, by reference, which runs from 1:nrow(terms). This basically helps us to rearrange the data back in the order required, after the join. Here, we'll consider terms as x.
terms[, N := 1:.N]
setkey(terms, x)
It doesn't matter if key has 'x' column set as key.. But again, ensure that x is the first column in key if it's key isn't set.. In my case, I'll set the key column of key to x.
setkey(key, x)
setkey(terms[key, out := i.y], N)
> terms
# x N out
# 1: 9 1 NA
# 2: 12 2 NA
# 3: 4 3 15.79000
# 4: 2 4 22.40000
# 5: 3 5 16.30000
# 6: 6 6 19.70000
# 7: 1 7 25.34286
# 8: 2 8 22.40000
Personally, since you require terms unsorted, I'd go with the first method here. But feel free to benchmark on your real data dimensions and choose which suits your need best.

compatea value in two vectors and assign the compared results into a new vector in R

I have a vector to be append, and here is the code,which is pretty slow due to the nrow is big.
All I want to is to speed up. I have tried c() and append() and both seems not fast enough.
And I checkd Efficiently adding or removing elements to a vector or list in R?
Here is the code:
compare<-vector()
for (i in 1:nrow(domin)){
for (j in 1:nrow(domin)){
a=0
if ((domin[i,]$GPA>domin[j,]$GPA) & (domin[i,]$SAT>domin[j,]$SAT)){
a=1
}
compare<-c(compare,a)
}
print(i)
}
I found it is hard to figure out the index for the compare if I use
#compare<-rep(0,times=nrow(opt_predict)*nrow(opt_predict))
The information you want would be better placed in a matrix:
v1 <- 1:3
v2 <- c(1,2,2)
mat1 <- outer(v1,v1,`>`)
mat2 <- outer(v2,v2,`>`)
both <- mat1 & mat2
To see which positions the inequality holds for, use which:
which(both,arr.ind=TRUE)
# row col
# [1,] 2 1
# [2,] 3 1
Comments:
This answer should be a lot faster than your loop. However, you are really just sorting two vectors, so there is probably a faster way to do this than taking the exhaustive set of inequalities...
In your case, there is only a partial ordering (since, for a given i and j, it is possible that neither one is strictly greater than the other in both dimensions). If you were satisfied with sorting first on v1 and then on v2, you could use the data.table package to easily get a full ordering:
set.seed(1)
v1 <- sample.int(10,replace=TRUE)
v2 <- sample.int(10,replace=TRUE)
require(data.table)
DT <- data.table(v1,v2)
setkey(DT)
DT[,rank:=.GRP,by='v1,v2']
which gives
v1 v2 rank
1: 1 8 1
2: 3 3 2
3: 3 8 3
4: 4 2 4
5: 6 7 5
6: 7 4 6
7: 7 10 7
8: 9 5 8
9: 10 4 9
10: 10 8 10
It depends on what you were planning to do next.

Using .BY with a lookup table--unexpected results

I'd like to create a variable in dt according to a lookup table k. I'm getting some unexpected results depending on how I extract the variable of interest in k.
dt <- data.table(x=c(1:10))
setkey(dt, x)
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, x)
dt[,b:=k[.BY, list(b)],by=x]
dt #unexpected results
# x b
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 4
# 5: 5 5
# 6: 6 6
# 7: 7 7
# 8: 8 8
# 9: 9 9
# 10: 10 10
dt <- data.table(x=c(1:10))
setkey(x, x)
dt[,b:=k[.BY]$b,by=x]
dt #expected results
# x b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 e
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
# 10: 10 d
Can anyone explain why this is happening?
You don't have to use by=. here at all.
First solution:
Set appropriate keys and use X[Y] syntax from data.table:
require(data.table)
dt <- data.table(x=c(1:10))
setkey(dt, "x")
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, "x")
k[dt]
# x b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 e
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
# 10: 10 d
OP said that this creates a new data.table and it is undesirable for him.
Second solution
Again, without by:
dt <- data.table(x=c(1:10))
setkey(dt, "x")
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, "x")
# solution
dt[k, b := i.b]
This does not create a new data.table and gives the solution you're expecting.
To explain why the unexpected result happens:
For the first case you do, dt[,b:=k[.BY, list(b)],by=x]. Here, k[.BY, list(b)] itself returns a data.table. For example:
k[list(x=1), list(b)]
# x b
# 1: 1 a
So, basically, if you would do:
k[list(x=dt$x), list(b)]
That would give you the desired solution as well. To answer why you get what you get when you do b := k[.BY, list(b)], since, the RHS returns a data.table and you're assigning a variable to it, it takes the first element and drops the rest. For example, do this:
dt[, c := dt[1], by=x]
# you'll get the whole column to be 1
For the second case, to understand why it works, you'll have to know the subtle difference between, accessing a data.table as k[6] and k[list(6)], for example:
In the first case, k[6], you are accessing the 6th element of k, which is 10 d. But in the second case, you're asking for a J, join. So, it searches for x = 6 (key column) and since there isn't any in k, it returns 6 NA. In your case, since you use k[.BY] which returns a list, it is a J operation, which fetches the right value.
I hope this helps.

Rbind with new columns and data.table

I need to add many large tables to an existing table, so I use rbind with the excellent package data.table. But some of the later tables have more columns than the original one (which need to be included). Is there an equivalent of rbind.fill for data.table?
library(data.table)
aa <- c(1,2,3)
bb <- c(2,3,4)
cc <- c(3,4,5)
dt.1 <- data.table(cbind(aa, bb))
dt.2 <- data.table(cbind(aa, bb, cc))
dt.11 <- rbind(dt.1, dt.1) # Works, but not what I need
dt.12 <- rbind(dt.1, dt.2) # What I need, doesn't work
dt.12 <- rbind.fill(dt.1, dt.2) # What I need, doesn't work either
I need to start rbinding before I have all tables, so no way to know what future new columns will be called. Missing data can be filled with NA.
Since v1.9.2, data.table's rbind function gained fill argument. From ?rbind.data.table documentation:
If TRUE fills missing columns with NAs. By default FALSE. When
TRUE, use.names has to be TRUE, and all items of the input list has to
have non-null column names.
Thus you can do (prior to approx v1.9.6):
data.table::rbind(dt.1, dt.2, fill=TRUE)
# aa bb cc
# 1: 1 2 NA
# 2: 2 3 NA
# 3: 3 4 NA
# 4: 1 2 3
# 5: 2 3 4
# 6: 3 4 5
UPDATE for v1.9.6:
This now works directly:
rbind(dt.1, dt.2, fill=TRUE)
# aa bb cc
# 1: 1 2 NA
# 2: 2 3 NA
# 3: 3 4 NA
# 4: 1 2 3
# 5: 2 3 4
# 6: 3 4 5
Here is an approach that will update the missing columns in
rbind.missing <- function(A, B) {
cols.A <- names(A)
cols.B <- names(B)
missing.A <- setdiff(cols.B,cols.A)
# check and define missing columns in A
if(length(missing.A) > 0L){
# .. means "look up one level"
class.missing.A <- lapply(B[, ..missing.A], class)
nas.A <- lapply(class.missing.A, as, object = NA)
A[,c(missing.A) := nas.A]
}
# check and define missing columns in B
missing.B <- setdiff(names(A), cols.B)
if(length(missing.B) > 0L){
class.missing.B <- lapply(A[, ..missing.B], class)
nas.B <- lapply(class.missing.B, as, object = NA)
B[,c(missing.B) := nas.B]
}
# reorder so they are the same
setcolorder(B, names(A))
rbind(A, B)
}
rbind.missing(dt.1,dt.2)
## aa bb cc
## 1: 1 2 NA
## 2: 2 3 NA
## 3: 3 4 NA
## 4: 1 2 3
## 5: 2 3 4
## 6: 3 4 5
This will not be efficient for many, or large data.tables, as it only works two at a time.
The answers are awesome, but looks like, there are some functions suggested here such as plyr::rbind.fill and gtools::smartbind which seemed to work perfectly for me.
the basic concept is to add missing columns in both directions: from the running master table
to the newTable and back the other way.
As #menl pointed out in the comments, simply assigning an NA is a problem, because that will
make the whole column of class logical.
One solution is to force all columns of a single type (ie as.numeric(NA)), but that is too restrictive.
Instead, we need to analyze each new column for its class. We can then use as(NA, cc) _(cc being the class)
as the vector that we will assign to a new column. We wrap this in an lapply statement on the RHS and use eval(columnName)
on the LHS to assign.
We can then wrap this in a function and use S3 methods so that we can simply call
rbindFill(A, B)
Below is the function.
rbindFill.data.table <- function(master, newTable) {
# Append newTable to master
# assign to Master
#-----------------#
# identify columns missing
colMisng <- setdiff(names(newTable), names(master))
# if there are no columns missing, move on to next part
if (!identical(colMisng, character(0))) {
# identify class of each
colMisng.cls <- sapply(colMisng, function(x) class(newTable[[x]]))
# assign to each column value of NA with appropriate class
master[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
}
# assign to newTable
#-----------------#
# identify columns missing
colMisng <- setdiff(names(master), names(newTable))
# if there are no columns missing, move on to next part
if (!identical(colMisng, character(0))) {
# identify class of each
colMisng.cls <- sapply(colMisng, function(x) class(master[[x]]))
# assign to each column value of NA with appropriate class
newTable[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
}
# reorder columns to avoid warning about ordering
#-----------------#
colOrdering <- colOrderingByOtherCol(newTable, names(master))
setcolorder(newTable, colOrdering)
# rbind them!
#-----------------#
rbind(master, newTable)
}
# implement generic function
rbindFill <- function(x, y, ...) UseMethod("rbindFill")
Example Usage:
# Sample Data:
#--------------------------------------------------#
A <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
#--------------------------------------------------#
# Four iterations of calling rbindFill
master <- rbindFill(A, B)
master <- rbindFill(master, A2)
master <- rbindFill(master, C)
# Results:
master
# a b c d m n f
# 1: 1 1 1 NA NA NA NA
# 2: 2 2 2 NA NA NA NA
# 3: 3 3 3 NA NA NA NA
# 4: NA 1 1 1 A NA NA
# 5: NA 2 2 2 B NA NA
# 6: NA 3 3 3 C NA NA
# 7: 6 6 6 NA NA NA NA
# 8: 7 7 7 NA NA NA NA
# 9: 8 8 8 NA NA NA NA
# 10: 9 9 9 NA NA NA NA
# 11: NA NA 7 NA NA 0.86 TRUE
# 12: NA NA 8 NA NA -1.15 FALSE
# 13: NA NA 9 NA NA 1.10 TRUE
Yet another way to insert the missing columns (with the correct type and NAs) is to merge() the first data.table A with an empty data.table A2[0] which has the structure of the second data.table. This saves the possibility to introduce bugs in user functions (I know merge() is more reliable than my own code ;)). Using mnel's tables from above, do something like the code below.
Also, using rbindlist() should be much faster when dealing with data.tables.
Define the tables (same as mnel's code above):
library(data.table)
A <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
Insert the missing variables in table A: (note the use of A2[0]
A <- merge(x=A, y=A2[0], by=intersect(names(A),names(A2)), all=TRUE)
Insert the missing columns in table A2:
A2 <- merge(x=A[0], y=A2, by=intersect(names(A),names(A2)), all=TRUE)
Now A and A2 should have the same columns, with the same types. Set the column order to match, just in case (possibly not needed, not sure if rbindlist() binds across column names or column positions):
setcolorder(A2, names(A))
DT.ALL <- rbindlist(l=list(A,A2))
DT.ALL
Repeat for the other tables... Maybe it would be better to put this into a function rather than repeat by hand...
DT.ALL <- merge(x=DT.ALL, y=B[0], by=intersect(names(DT.ALL), names(B)), all=TRUE)
B <- merge(x=DT.ALL[0], y=B, by=intersect(names(DT.ALL), names(B)), all=TRUE)
setcolorder(B, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, B))
DT.ALL <- merge(x=DT.ALL, y=C[0], by=intersect(names(DT.ALL), names(C)), all=TRUE)
C <- merge(x=DT.ALL[0], y=C, by=intersect(names(DT.ALL), names(C)), all=TRUE)
setcolorder(C, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, C))
DT.ALL
The result looks the same as mnels' output (except for the random numbers and the column order).
PS1: The original author does not say what to do if there are matching variables -- do we really want to do a rbind() or are we thinking of a merge()?
PS2: (Since I do not have enough reputation to comment) The gist of the question seems a duplicate of this question. Also important for the benchmarking of data.table vs. plyr with large datasets.

Resources