Vectorised between: datatable R - r

I have a hard time to understand the "Vectorised between" example in data.table packages document V1.10.4?
X = data.table(a=1:5, b=6:10, c=c(5:1))
> X
a b c
1: 1 6 5
2: 2 7 4
3: 3 8 3
4: 4 9 2
5: 5 10 1
# NEW feature in v1.9.8, vectorised between
> X[c %between% list(a,b)]
a b c
1: 1 6 5
2: 2 7 4
3: 3 8 3
X[between(c, a, b)] # same as above
Can someone please explain it to me how dose it work? why only 5,4,3 from c was selected? Thanks.

-----As posted in comments----
In row 4, 2 is not between 4 and 9....between(c=2,a=4,b=9).
between uses >= and <= (rather than > and <). That's why in row 3, it returns 3 (since its TRUE)

Related

Summing in data.table returns different values in R 3 vs 4

I am getting a weird summation problem when using data.table in R 4.0.2. When I group data by one column and sum the other (the bar[,.(C = sum(B)), by = A] line), I get some incorrect numbers. Here is a reprex where I only load data.table:
> library(data.table)
> bar <- data.table(data.frame("A" = as.character(c(1,2,3,2,3,2)), "B" = as.numeric(c(1,2,3,4,5,6))))
> bar
A B
1: 1 1
2: 2 2
3: 3 3
4: 2 4
5: 3 5
6: 2 6
> bar[,.(C = sum(B)), by = A]
A C
1: 1 2
2: 2 10
3: 3 8
> bar[A == 1, sum(B)]
[1] 1
> bar[A == 2, sum(B)]
[1] 12
> bar[A == 3, sum(B)]
[1] 8
> bar[,.(C = sum(as.integer(B))), by = A]
A C
1: 1 1
2: 2 12
3: 3 8
Yet, if I do this on R 3.6.3, everything works as I expect, and the problematic portion above now looks like:
> bar[,.(C = sum(B)), by = A]
A C
1: 1 1
2: 2 12
3: 3 8
And everything else is the same.
Did R 4.* change some method of how numerics are summed? Why is it fixed when I convert to integers first?

Quick way to use the row element as name and the value as column in R

I do this with some steps that are not at all elegant and safe, but I'm sure that there is a way more easily and fast.
I need a help to know what is a quick way to go from dataframe_1 to dataframe_2.
#from this
a<-c("A","A","B","B","C","C")
b<-c(1,2,12,2,4,5)
dataframe_1<-cbind.data.frame(a,b)
a b
1 A 1
2 A 2
3 B 12
4 B 2
5 C 4
6 C 5
#to this
a<-c(1,2)
b<-c(12,2)
c<-c(4,5)
dataframe_2<-cbind.data.frame(A=a,B=b,C=c)
A B C
1 1 12 4
2 2 2 5
Try unstack
> unstack(rev(dataframe_1))
A B C
1 1 12 4
2 2 2 5
One option IF the number of elements in each group is constant.
data.frame(do.call(cbind, split(dataframe_1$b, dataframe_1$a)))
A B C
1 1 12 4
2 2 2 5
This can be also be done with dcast and rowid from data.table:
dcast(as.data.table(dataframe_1), rowid(a) ~ a, value.var = 'b')[, -1]
A B C
1: 1 12 4
2: 2 2 5
Here, [, -1] removes the first column (which is rowid(a)).

How to replace a certain value in one data.table with values of another data.table of same dimension

Given two data.table:
dt1 <- data.table(id = c(1,-99,2,2,-99), a = c(2,1,-99,-99,3), b = c(5,3,3,2,5), c = c(-99,-99,-99,2,5))
dt2 <- data.table(id = c(2,3,1,4,3),a = c(6,4,3,2,6), b = c(3,7,8,8,3), c = c(2,2,4,3,2))
> dt1
id a b c
1: 1 2 5 -99
2: -99 1 3 -99
3: 2 -99 3 -99
4: 2 -99 2 2
5: -99 3 5 5
> dt2
id a b c
1: 2 6 3 2
2: 3 4 7 2
3: 1 3 8 4
4: 4 2 8 3
5: 3 6 3 2
How can one replace the -99 of dt1 with the values of dt2?
Wanted results should be dt3:
> dt3
id a b c
1: 1 2 5 2
2: 3 1 3 2
3: 2 3 3 4
4: 2 2 2 2
5: 3 3 5 5
You can do the following:
dt3 <- as.data.frame(dt1)
dt2 <- as.data.frame(dt2)
dt3[dt3 == -99] <- dt2[dt3 == -99]
dt3
# id a b c
# 1 1 2 5 2
# 2 3 1 3 2
# 3 2 3 3 4
# 4 2 2 2 2
# 5 3 3 5 5
If your data is all of the same type (as in your example) then transforming them to matrix is a lot faster and transparent:
dt1a <- as.matrix(dt1) ## convert to matrix
dt2a <- as.matrix(dt2)
# make a matrix of the same shape to access the right entries
missing_idx <- dt1a == -99
dt1a[missing_idx] <- dt2a[missing_idx] ## replace by reference
This is a vectorized operation, so it should be fast.
Note: If you do this make sure the two data sources match exactly in shape and order of rows/columns. If they don't then you need to join by the relevant keys and pick the correct columns.
EDIT: The conversion to matrix may be unnecessary. See kath's answer for a more terse solution.
Simple way could be to use setDF function to convert to data.frame and use data frame sub-setting methods. Restore to data.table at the end.
#Change to data.frmae
setDF(dt1)
setDF(dt2)
# Perform assignment
dt1[dt1==-99] = dt2[dt1==-99]
# Restore back to data.table
setDT(dt1)
setDT(dt2)
dt1
# id a b c
# 1 1 2 5 2
# 2 3 1 3 2
# 3 2 3 3 4
# 4 2 2 2 2
# 5 3 3 5 5
This simple trick would work efficiently.
dt1<-as.matrix(dt1)
dt2<-as.matrix(dt2)
index.replace = dt1==-99
dt1[index.replace] = dt2[index.replace]
as.data.table(dt1)
as.data.table(dt2)
This should work, a simple approach:
for (i in 1:nrow(dt1)){
for (j in 1:ncol(dt1)){
if (dt1[i,j] == -99) dt1[i,j] = dt2[i,j]
}
}

Creating a new data table for each row of an existing data table R while avoiding memory vector issue

Suppose I have two data tables:
library(data.table)
A=data.table(w=1:3,d=5:7)
B=data.table(K=2:4,m=9:11)
> A
w d
1: 1 5
2: 2 6
3: 3 7
> B
K m
1: 2 9
2: 3 10
3: 4 11
I want to do the following expansion, where I have a new B for each row of A:
C=A[,B[],by=names(A)]
w d K m
1: 1 5 2 9
2: 1 5 3 10
3: 1 5 4 11
4: 2 6 2 9
5: 2 6 3 10
6: 2 6 4 11
7: 3 7 2 9
8: 3 7 3 10
9: 3 7 4 11
However, when I do it with my real data, I get this error:
Error in `[.data.table`(A, , B[], by = names(A)) :
negative length vectors are not allowed
It turns out this is a memory error. However, I think there should be a way to do this without loops, memory is not an issue on my server up to 50gb of ram, which the following data table would certainly be less than.
Does anyone know an efficient way to do this?
A hacky way to handle this might be to add an identical helper column to each table and then to allow cartesian joins:
library(data.table)
A = data.table(w = 1:3, d = 5:7)
B = data.table(K = 2:4, m = 9:11)
A[, j := 1]
B[, j := 1]
C = A[B, on = 'j', allow.cartesian = T]

R data.table not preserving factor when applying function by group [duplicate]

The data comes from another question I was playing around with:
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
country=c(rep(1,4),rep(2,6)),
event=1:10, key="user")
# user country event
#1: 3 1 1
#2: 3 1 2
#3: 3 1 3
#4: 3 1 4
#5: 3 2 5
#6: 4 2 6
#7: 4 2 7
#8: 4 2 8
#9: 4 2 9
#10: 4 2 10
And here's the surprising behavior:
dt[user == 3, as.data.frame(table(country))]
# country Freq
#1 1 4
#2 2 1
dt[user == 4, as.data.frame(table(country))]
# country Freq
#1 2 5
dt[, as.data.frame(table(country)), by = user]
# user country Freq
#1: 3 1 4
#2: 3 2 1
#3: 4 1 5
# ^^^ - why is this 1 instead of 2?!
Thanks mnel and Victor K. The natural follow-up is - shouldn't it be 2, i.e. is this a bug? I expected
dt[, blah, by = user]
to return identical result to
rbind(dt[user == 3, blah], dt[user == 4, blah])
Is that expectation incorrect?
The idiomatic data.table approach is to use .N
dt[ , .N, by = list(user, country)]
This will be far quicker and it will also retain country as the same class as in the original.
As mnel noted in comments, as.data.frame(table(...)) produces a data frame where the first variable is a factor. For user == 4, there is only one level in the factor, which is stored internally as 1.
What you want is factor levels, but what you get is how factors are stored internally (as integers, starting from 1). The following provides the expected result:
> dt[, lapply(as.data.frame(table(country)), as.character), by = user]
user country Freq
1: 3 1 4
2: 3 2 1
3: 4 2 5
Update. Regarding your second question: no, I think data.table behaviour is correct. Same thing happens in plain R when you join two factors with different levels:
> a <- factor(3:5)
> b <- factor(6:8)
> a
[1] 3 4 5
Levels: 3 4 5
> b
[1] 6 7 8
Levels: 6 7 8
> c(a,b)
[1] 1 2 3 1 2 3

Resources