rbindlist data.tables with different number of columns - r

I am wondering how do I rbindlist data tables with different number of columns, and filling up empty rows with NAs like rbind.fill
DT1 <- data.table(A = 1:3)
DT2 <- data.table(A =4:5, B = letters[4:5])
l <- list(DT1, DT2)
rbindlist(l)
# Error in rbindlist(l) :
# Item 2 has 2 columns, inconsistent with item 1 which has 1 columns
What I want to get is
A B
1: 1 NA
2: 2 NA
3: 3 NA
4: 4 d
5: 5 e

This feature is now implemented in commit 1266 of v1.9.3. From NEWS:
o 'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented
entirely in C. Closes #5249
-> use.names by default is FALSE for backwards compatibility (doesn't bind by
names by default)
-> rbind(...) now just calls rbindlist() internally, except that 'use.names'
is TRUE by default, for compatibility with base (and backwards compatibility).
-> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
-> At least one item of the input list has to have non-null column names.
-> Duplicate columns are bound in the order of occurrence, like base.
-> Attributes that might exist in individual items would be lost in the bound result.
-> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible.
-> And incredibly fast ;).
-> Documentation updated in much detail. Closes DR #5158.
Check this post for benchmarks.
Examples:
1) Using fill argument of rbindlist:
DT1 <- data.table(x=1, y=2)
DT2 <- data.table(y=2, z=-1)
rbindlist(list(DT1, DT2), fill=TRUE)
# x y z
# 1: 1 2 NA
# 2: NA 2 -1
Note that when fill=TRUE, use.names should be TRUE.
2) Binding tables with duplicate names appropriately:
DT1 <- data.table(x=1, x=2, y=1, y=2)
DT2 <- data.table(y=3, y=-1, y=-2)
rbindlist(list(DT1, DT2), fill=TRUE)
# x x y y y
# 1: 1 2 1 2 NA
# 2: NA NA 3 -1 -2
3) It's not limited to just data.tables, but works on data.frames and lists as well:
DT1 <- data.table(x=1, y=2)
DT2 <- data.frame(y=2, z=-1)
DT3 <- list(z=10)
rbindlist(list(DT1,DT2,DT3), fill=TRUE)
# x y z
# 1: 1 2 NA
# 2: NA 2 -1
# 3: NA NA 10
4) If you would like to bind just by names, you can set just use.names=TRUE, but not fill:
DT1 <- data.table(x=1, y=2)
DT2 <- data.table(y=1, x=2)
rbindlist(list(DT1,DT2), use.names=TRUE, fill=FALSE)
# x y
# 1: 1 2
# 2: 2 1
DT1 <- data.table(x=1, y=2)
DT2 <- data.table(z=2, y=1)
# returns error when fill=FALSE but can't be bound without fill=TRUE
rbindlist(list(DT1, DT2), use.names=TRUE, fill=FALSE)
# Error in rbindlist(list(DT1, DT2), use.names = TRUE, fill = FALSE) :
# Answer requires 3 columns whereas one or more item(s) in the input
# list has only 2 columns. ...
5) The default is the same for backwards compatibility (use.names=FALSE, fill=FALSE):
DT1 <- data.table(x=1, y=2)
DT2 <- data.table(y=1, x=2)
rbindlist(list(DT1, DT2))
# x y
# 1: 1 2
# 2: 1 2
HTH

Related

Binary search for integer64 in data.table

I have a integer64 indexed data.table object:
library(data.table)
library(bit64)
some_data = as.integer64(c(1514772184120000026, 1514772184120000068, 1514772184120000042, 1514772184120000078,1514772184120000011, 1514772184120000043, 1514772184120000094, 1514772184120000085,
1514772184120000083, 1514772184120000017, 1514772184120000013, 1514772184120000060, 1514772184120000032, 1514772184120000059, 1514772184120000029))
#
n <- 10
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)
Then I have a bunch of other integer64 that I need to binary-search for in the indexes of my original data.table object (x):
search_values <- some_data[(n+1):length(some_data)]
If these where native integers I could use findInterval() to solve the problem:
values_index <- findInterval(search_values, x$new_col)
but when the arguments to findInterval are integer64, I get:
Warning messages:
1: In as.double.integer64(vec) :
integer precision lost while converting to double
2: In as.double.integer64(x) :
integer precision lost while converting to double
and wrong indexes:
> values_index
[1] 10 10 10 10 10
e.g. it is not true that the entries of search_values are all larger than all entries of x$new_col.
Edit:
Desired output:
print(values_index)
9 10 6 10 1
Why?:
value_index has as many entries as search_values. For each entries of search_values, the corresponding entry in value_index gives the rank that entry of search_values would have if it where inserted inside x$new_col. So the first entry of value_index is 9 because the first entry of search_values (1514772184120000045) would have rank 9 among the entries of x$new_col.
Maybe you want something like this:
findInterval2 <- function(y, x) {
toadd <- y[!(y %in% x$new_col)] # search_values that is not in data
x2 <- copy(x)
x2[, i := .I] # mark the original data set
x2 <- rbindlist(list(x2, data.table(new_col = toadd)),
use.names = T, fill = T) # add missing search_values
setkey(x2, new_col) # order
x2[, index := cumsum(!is.na(i))]
x2[match(y, new_col), index]
}
# x2 is:
# a new_col i index
# 1: 0.56602278 1514772184120000011 1 1
# 2: NA 1514772184120000013 NA 1
# 3: 0.29408237 1514772184120000017 2 2
# 4: 0.28532378 1514772184120000026 3 3
# 5: NA 1514772184120000029 NA 3
# 6: NA 1514772184120000032 NA 3
# 7: 0.66844754 1514772184120000042 4 4
# 8: 0.83008829 1514772184120000043 5 5
# 9: NA 1514772184120000059 NA 5
# 10: NA 1514772184120000060 NA 5
# 11: 0.76992760 1514772184120000068 6 6
# 12: 0.57049677 1514772184120000078 7 7
# 13: 0.14406169 1514772184120000083 8 8
# 14: 0.02044602 1514772184120000085 9 9
# 15: 0.68016024 1514772184120000094 10 10
findInterval2(search_values, x)
# [1] 1 5 3 5 3
If not, then maybe you could change the code as needed.
update
look at this integer example to see that this function gives the same result as base findInterval
now <- 10
n <- 10
n2 <- 10
some_data = as.integer(now + sample.int(n + n2, n + n2))
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)
search_values <- some_data[(n + 1):length(some_data)]
r1 <- findInterval2(search_values, x)
r2 <- findInterval(search_values, x$new_col)
all.equal(r1, r2)
If I get what you want, then a quick workaround could be:
toadd <- search_values[!(search_values %in% x$new_col)] # search_values that is not in data
x[, i := .I] # mark the original data set
x <- rbindlist(list(x, data.table(new_col = toadd)),
use.names = T, fill = T) # add missing search_values
setkey(x, new_col) # order
x[, index := new_col %in% search_values] # mark where the values are
x[, index := cumsum(index)] # get indexes
x <- x[!is.na(i)] # remove added rows
x$index # should contain your desired output

How to append values of columns? [duplicate]

This question already has answers here:
paste two data.table columns
(4 answers)
Closed 6 years ago.
For example there is the following data.table:
dt <- data.table(x = list(1:2, 3:5, 6:9), y = c(1,2,3))
# x y
# 1: 1,2 1
# 2: 3,4,5 2
# 3: 6,7,8,9 3
I need to create a new data.table, where values of the y column will be appended to lists stored in the x column:
# z
# 1: 1,2,1
# 2: 3,4,5,2
# 3: 6,7,8,9,3
I've tried lapply, cbind, list, c functions. But I can't get the table I need.
UPDATE:
The question is different from paste two data.table columns because a trivial solution with paste function or something like this doesn't work.
This will do it
# Merge two lists
dt[, z := mapply(c, x, y, SIMPLIFY=FALSE)]
print(dt)
x y z
1: 1,2 1 1,2,1
2: 3,4,5 2 3,4,5,2
3: 6,7,8,9 3 6,7,8,9,3
And deleting the original x and y columns
dt[, c("x", "y") := NULL]
print(dt)
z
1: 1,2,1
2: 3,4,5,2
3: 6,7,8,9,3
I would like to suggest a general approach for this kind of task in case you have multiple columns that you would like to combine into a single column
An example data with multiple columns
dt <- data.table(x = list(1:2, 3:5, 6:9), y = 1:3, z = list(4:6, NULL, 5:8))
Solution
res <- melt(dt, measure.vars = names(dt))[, .(.(unlist(value))), by = rowid(variable)]
res$V1
# [[1]]
# [1] 1 2 1 4 5 6
#
# [[2]]
# [1] 3 4 5 2
#
# [[3]]
# [1] 6 7 8 9 3 5 6 7 8
The idea here is to convert to long format and then unlist/list by group
(You will receive an warning due to different classes in the resulting value column)

Melt data.table according to nested list

I had a data.table like this:
library(data.table)
dt <- data.table(a = c(rep("A", 3), rep("B", 3)), b = c(1, 3, 5, 2, 4, 6))
I needed to perform an operation (forecast) on the values for each a, so I decided to put them in a list, like this:
dt <- dt[, x := .(list(b)), by = a][, .SD[1,], by = a, .SDcols = "x"]
Now I wanted to "melt" (that's the thing that comes to mind) dt back into its original form.
I could do it for very few levels of a like this:
dt2 <- rbind(expand.grid(dt[1, a], dt[1, x[[1]]]), expand.grid(dt[2, a], dt[2, x[[1]]]))
but of course, the solution is impractical for more levels of a.
I've tried
dt2 <- dt[, expand.grid(a, x[[1]]), by = a]
which results in
dt2
## a Var1 Var2
## 1: A A 1
## 2: A A 3
## 3: A A 5
## 4: B A 2
## 5: B A 4
## 6: B A 6
it's interesting to notice that Var1 doesn't actually follow the "A - B" pattern expected (but at least a remains).
Is there a better approach to achieve this?
EDITS
Expected output will be the result of
dt2[, .(a, Var2)]
Corrected "melt" for "dcast".
You are looking for a method to nest(convert a column from a atomic vector type to list type) and unnest(the opposite direction) in a data.table way. This is different from reshaping data which either spread a column values to row header(dcast) or gather the row headers to a column values(melt):
In data.table syntax, you can use list and unlist on the target column to summarize or broadcast it along with group variables:
Say if we are starting from:
dt
# a b
# 1: A 1
# 2: A 3
# 3: A 5
# 4: B 2
# 5: B 4
# 6: B 6
To repeat what you have achieved in your first step, i.e. nest column b, you can do:
dt_nest <- dt[, .(b = list(b)), a]
dt_nest
# a b
# 1: A 1,3,5
# 2: B 2,4,6
To go the opposite direction, use unlist with the group variable:
dt_nest[, .(b = unlist(b)), a]
# a b
# 1: A 1
# 2: A 3
# 3: A 5
# 4: B 2
# 5: B 4
# 6: B 6

trying to subset a data table in R by removing items that are in a 2nd table

I have two data frames (from a csv file) in R as such:
df1 <- data.frame(V1 = 1:9, V2 = LETTERS[1:9])
df2 <- data.frame(V1 = 1:3, V2 = LETTERS[1:3])
I convert both to data.table as follows:
dt1 <- data.table(df1, key="V1")
dt2 <- data.table(df2, key="V1")
I want to now return a table that looks like dt1 but without any rows where the key is found in dt2. So in this instance I would like to get back:
4 D
5 E
...
9 I
I'm using the following code in R:
dt3 <- dt1[!dt2$V1]
this works on this example, however when I try it for a large data set (100k)
it does not work. It only removes 2 rows, and I know it should be a lot more than that. is there a limit to this type of operation or something else I havent considered?
Drop the column name "V1" to do a not-join. The tables are already keyed by V1.
dt3 <- dt1[!dt2]
Because the tables are keyed, you can do this with a "not-join"
dt1 <- data.table(rep(1:3,2), LETTERS[1:6], key="V1")
# V1 V2
# 1: 1 A
# 2: 1 D
# 3: 2 B
# 4: 2 E
# 5: 3 C
# 6: 3 F
dt2 <- data.table(1:2, letters[1:2], key="V1")
# V1 V2
# 1: 1 a
# 2: 2 b
dt1[!.(dt2$V1)]
# V1 V2
# 1: 3 C
# 2: 3 F
According to the documentation, . or J should not be necessary, since the ! alone is enough:
All types of i may be prefixed with !. This signals a not-join or not-select should be performed.
However, the OP's code does not work:
dt1[!(dt2$V1)]
# V1 V2
# 1: 2 B
# 2: 2 E
# 3: 3 C
# 4: 3 F
In this case, dt2$V1 is read as a vector of row numbers, not as part of a join. Looks like this is what is meant by a "not-select", but I think it could be more explicit. When I read the sentence above, for all I know "not-select" and "not-join" are two terms for the same thing.
You could try:
dt1[!(dt1$V1 %in% dt2$V1)]
This assumes that you don't care about ordering.

R: Merge data.table and fill in NAs

Suppose 3 data tables:
dt1<-data.table(Type=c("a","b"),x=1:2)
dt2<-data.table(Type=c("a","b"),y=3:4)
dt3<-data.table(Type=c("c","d"),z=3:4)
I want to merge them into 1 data table, so I do this:
dt4<-merge(dt1,dt2,by="Type") # No error, produces what I want
dt5<-merge(dt4,dt3,by="Type") # Produces empty data.table (0 rows) of 4 cols: Type,x,y,z
Is there a way to make dt5 like this instead?:
> dt5
Type x y z
1: a 1 3 NA
2: b 2 4 NA
3: c NA NA 3
4: d NA NA 4
While you explore the all argument to merge, I'll also offer you an alternative that might want to consider:
Reduce(function(x, y) merge(x, y, by = "Type", all = TRUE), list(dt1, dt2, dt3))
# Type x y z
# 1: a 1 3 NA
# 2: b 2 4 NA
# 3: c NA NA 3
# 4: d NA NA 4
If you know in advance the unique values you have in your Type column you can use J and then join tables the data.table way. You should set the key for each table so data.table knows what to join on, like this...
# setkeys
setkey( dt1 , Type )
setkey( dt2 , Type )
setkey( dt3 , Type )
# Join
dt1[ dt2[ dt3[ J( letters[1:4] ) , ] ] ]
# Type x y z
#1: a 1 3 NA
#2: b 2 4 NA
#3: c NA NA 3
#4: d NA NA 4
This shows off data.table's compound queries (i.e. dt1[dt2[dt3[...]]] ) which are wicked!
If you don't know in advance the unique values for the key column you can make a list of your tables and use lapply to quickly run through them getting the unique values to make your J expression...
# A simple way to get the unique values to make 'J',
# assuming they are in the first column.
ll <- list( dt1 , dt2 , dt3 )
vals <- unique( unlist( lapply( ll , `[` , 1 ) ) )
#[1] "a" "b" "c" "d"
Then use it like before, i.e. dt1[ dt2[ dt3[ J( vals ) , ] ] ].

Resources