Using .BY with a lookup table--unexpected results

Using .BY with a lookup table--unexpected results - r

I'd like to create a variable in dt according to a lookup table k. I'm getting some unexpected results depending on how I extract the variable of interest in k.
dt <- data.table(x=c(1:10))
setkey(dt, x)
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, x)
dt[,b:=k[.BY, list(b)],by=x]
dt #unexpected results
# x b
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 4
# 5: 5 5
# 6: 6 6
# 7: 7 7
# 8: 8 8
# 9: 9 9
# 10: 10 10
dt <- data.table(x=c(1:10))
setkey(x, x)
dt[,b:=k[.BY]$b,by=x]
dt #expected results
# x b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 e
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
# 10: 10 d
Can anyone explain why this is happening?

You don't have to use by=. here at all.
First solution:
Set appropriate keys and use X[Y] syntax from data.table:
require(data.table)
dt <- data.table(x=c(1:10))
setkey(dt, "x")
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, "x")
k[dt]
# x b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 e
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
# 10: 10 d
OP said that this creates a new data.table and it is undesirable for him.
Second solution
Again, without by:
dt <- data.table(x=c(1:10))
setkey(dt, "x")
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, "x")
# solution
dt[k, b := i.b]
This does not create a new data.table and gives the solution you're expecting.
To explain why the unexpected result happens:
For the first case you do, dt[,b:=k[.BY, list(b)],by=x]. Here, k[.BY, list(b)] itself returns a data.table. For example:
k[list(x=1), list(b)]
# x b
# 1: 1 a
So, basically, if you would do:
k[list(x=dt$x), list(b)]
That would give you the desired solution as well. To answer why you get what you get when you do b := k[.BY, list(b)], since, the RHS returns a data.table and you're assigning a variable to it, it takes the first element and drops the rest. For example, do this:
dt[, c := dt[1], by=x]
# you'll get the whole column to be 1
For the second case, to understand why it works, you'll have to know the subtle difference between, accessing a data.table as k[6] and k[list(6)], for example:
In the first case, k[6], you are accessing the 6th element of k, which is 10 d. But in the second case, you're asking for a J, join. So, it searches for x = 6 (key column) and since there isn't any in k, it returns 6 NA. In your case, since you use k[.BY] which returns a list, it is a J operation, which fetches the right value.
I hope this helps.

Related

R: Examine to see if a Datatable is subset of another Datatable

How I can check to see if a data table is subset of another data table, regardless of the row and column order? For instance, imagine someone rbinded the DT_x and DT_y with removing the duplicate and created DT_Z. Now, I want to know how I can compare DT_x and DT_Z and get the result which show/state that the DT_z is a subset of DT_Z?
as very simple example:
DT1 <- data.table(a= LETTERS[1:10], v=1:10)
DT2 <- data.table(a= LETTERS[1:6], v=1:6)
DT1
a v
1: A 1
2: B 2
3: C 3
4: D 4
5: E 5
6: F 6
7: G 7
8: H 8
9: I 9
10: J 10
DT2
a v
1: A 1
2: B 2
3: C 3
4: D 4
5: E 5
6: F 6
I am sure all.equal(DT1, DT2) will not answer my question.

I think you can use data.table's fintersect() and fsetequal():
is_df1_subset_of_df2 <- function(df1, df2) {
intersection <- data.table::fintersect(df1, df2)
data.table::fsetequal(df1, intersection)
}
The first line picks the elements in df1 that exists in df2.
The second line checks if that set is all of df1.

Inline ifelse assignment in data.table

Let the following data set be given:
library('data.table')
set.seed(1234)
DT <- data.table(x = LETTERS[1:10], y =sample(10))
my.rows <- sample(1:dim(DT)[1], 3)
I want to add a new column to the data set such that, whenever the rows of the data set match the row numbers given by my.rows the entry is populated with, say, true, or false otherwise.
I have got DT[my.rows, z:= "true"], which gives
head(DT)
x y z
1: A 2 NA
2: B 6 NA
3: C 5 true
4: D 8 NA
5: E 9 true
6: F 4 NA
but I do not know how to automatically populate the else condition as well, at the same time. I guess I should make use of some sort of inline ifelse but I am lacking the correct syntax.

We can compare the 'my.rows' with the sequence of row using %in% to create a logical vector and assign (:=) it to create 'z' column.
DT[, z:= 1:.N %in% my.rows ]
Or another option would be to create 'z' as a column of 'FALSE', using 'my.rows' as 'i', we assign the elements in 'z' that correspond to 'i' as 'TRUE'.
DT[, z:= FALSE][my.rows, z:= TRUE]

DT <- cbind(DT,z = ifelse(DT[, .I] %in% my.rows , T, NA))
> DT
# x y z
# 1: A 2 NA
# 2: B 6 NA
# 3: C 5 TRUE
# 4: D 8 NA
# 5: E 9 TRUE
# 6: F 4 NA
# 7: G 1 TRUE
# 8: H 7 NA
# 9: I 10 NA
#10: J 3 NA

Replacing all missing values in R data.table with a value

If you have an R data.table that has missing values, how do you replace all of them with say, the value 0? E.g.
aa = data.table(V1=1:10,V2=c(1,2,2,3,3,3,4,4,4,4))
bb = data.table(V1=3:6,X=letters[1:4])
setkey(aa,V1)
setkey(bb,V1)
tt = bb[aa]
V1 X V2
1: 1 NA 1
2: 2 NA 2
3: 3 a 2
4: 4 b 3
5: 5 c 3
6: 6 d 3
7: 7 NA 4
8: 8 NA 4
9: 9 NA 4
10: 10 NA 4
Any way to do this in one line? If it were just a matrix, you could just do:
tt[is.na(tt)] = 0

is.na (being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set to replace NA with0`.
Using <- to assign will result in a copy of all the columns and this is not the idiomatic way using data.table.
First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):
One way to do this efficiently:
for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.
Why shouldn't you use <- here:
# by reference - idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# modifies value by reference - no copy
system.time({
for (i in seq_along(tt))
set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
})
# user system elapsed
# 0.284 0.083 0.386
# by copy - NOT the idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# makes copy
system.time({tt[is.na(tt)] <- 0})
# a bunch of "tracemem" output showing the copies being made
# user system elapsed
# 4.110 0.976 5.187

Nothing unusual here:
tt[is.na(tt)] = 0
..will work.
This is somewhat confusing however given that:
tt[is.na(tt)]
...currently returns:
Error in [.data.table(tt, is.na(tt)) : i is invalid type
(matrix). Perhaps in future a 2 column matrix could return a list of
elements of DT (in the spirit of A[B] in FAQ 2.14). Please let
datatable-help know if you'd like this, or add your comments to FR #1611.

I would make use of data.table and lapply, namely:
tt[,lapply(.SD,function(kkk) ifelse(is.na(kkk),-666,kkk)),.SDcols=names(tt)]
yielding in:
V1 X V2
1: 1 -666 1
2: 2 -666 2
3: 3 a 2
4: 4 b 3
5: 5 c 3
6: 6 d 3
7: 7 -666 4
8: 8 -666 4
9: 9 -666 4
10: 10 -666 4

The specific problem OP is posting could also be solved by
tt[is.na(X), X := 0]

rbindlist two data.tables where one has factor and other has character type for a column

I just discovered this warning in my script that was a bit strange.
# Warning message:
# In rbindlist(list(DT.1, DT.2)) : NAs introduced by coercion
Observation 1: Here's a reproducible example:
require(data.table)
DT.1 <- data.table(x = letters[1:5], y = 6:10)
DT.2 <- data.table(x = LETTERS[1:5], y = 11:15)
# works fine
rbindlist(list(DT.1, DT.2))
# x y
# 1: a 6
# 2: b 7
# 3: c 8
# 4: d 9
# 5: e 10
# 6: A 11
# 7: B 12
# 8: C 13
# 9: D 14
# 10: E 15
However, now if I convert column x to a factor (ordered or not) and do the same:
DT.1[, x := factor(x)]
rbindlist(list(DT.1, DT.2))
# x y
# 1: a 6
# 2: b 7
# 3: c 8
# 4: d 9
# 5: e 10
# 6: NA 11
# 7: NA 12
# 8: NA 13
# 9: NA 14
# 10: NA 15
# Warning message:
# In rbindlist(list(DT.1, DT.2)) : NAs introduced by coercion
But rbind does this job nicely!
rbind(DT.1, DT.2) # where DT.1 has column x as factor
# do.call(rbind, list(DT.1, DT.2)) # also works fine
# x y
# 1: a 6
# 2: b 7
# 3: c 8
# 4: d 9
# 5: e 10
# 6: A 11
# 7: B 12
# 8: C 13
# 9: D 14
# 10: E 15
The same behaviour can be reproduced if column x is an ordered factor as well. Since the help page ?rbindlist says: Same as do.call("rbind",l), but much faster., I'm guessing this is not the desired behaviour?
Here's my session info:
# R version 3.0.0 (2013-04-03)
# Platform: x86_64-apple-darwin10.8.0 (64-bit)
#
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] data.table_1.8.8
#
# loaded via a namespace (and not attached):
# [1] tools_3.0.0
Edit:
Observation 2: Following #AnandaMahto's another interesting observation, reversing the order:
# column x in DT.1 is still a factor
rbindlist(list(DT.2, DT.1))
# x y
# 1: A 11
# 2: B 12
# 3: C 13
# 4: D 14
# 5: E 15
# 6: 1 6
# 7: 2 7
# 8: 3 8
# 9: 4 9
# 10: 5 10
Here, the column from DT.1 is silently coerced to numeric.
Edit: Just to clarify, this is the same behaviour as that of rbind(DT2, DT1) with DT1's column x being a factor. rbind seems to retain the class of the first argument. I'll leave this part here and mention that in this case, this is the desired behaviour since rbindlist is a faster implementation of rbind.
Observation 3: If now, both the columns are converted to factors:
# DT.1 column x is already a factor
DT.2[, x := factor(x)]
rbindlist(list(DT.1, DT.2))
# x y
# 1: a 6
# 2: b 7
# 3: c 8
# 4: d 9
# 5: e 10
# 6: a 11
# 7: b 12
# 8: c 13
# 9: d 14
# 10: e 15
Here, the column x from DT.2 is lost (/ replaced with that of DT.1). If the order is reversed, the exact opposite happens (column x of DT.1 gets replaced with that of DT.2).
In general, there seems to be a problem with handling factor columns in rbindlist.

UPDATE - This bug (#2650) was fixed on 17 May 2013 in v1.8.9
I believe that rbindlist when applied to factors is combining the numerical values of the factors and using only the levels associated with the first list element.
As in this bug report:
http://r-forge.r-project.org/tracker/index.php?func=detail&aid=2650&group_id=240&atid=975
# Temporary workaround:
levs <- c(as.character(DT.1$x), as.character(DT.2$x))
DT.1[, x := factor(x, levels=levs)]
DT.2[, x := factor(x, levels=levs)]
rbindlist(list(DT.1, DT.2))
As another view of whats going on:
DT3 <- data.table(x=c("1st", "2nd"), y=1:2)
DT4 <- copy(DT3)
DT3[, x := factor(x, levels=x)]
DT4[, x := factor(x, levels=x, labels=rev(x))]
DT3
DT4
# Have a look at the difference:
rbindlist(list(DT3, DT4))$x
# [1] 1st 2nd 1st 2nd
# Levels: 1st 2nd
do.call(rbind, list(DT3, DT4))$x
# [1] 1st 2nd 2nd 1st
# Levels: 1st 2nd
Edit as per comments:
as for observation 1, what's happening is similar to:
x <- factor(LETTERS[1:5])
x[6:10] <- letters[1:5]
x
# Notice however, if you are assigning a value that is already present
x[11] <- "S" # warning, since `S` is not one of the levels of x
x[12] <- "D" # all good, since `D` *is* one of the levels of x

rbindlist is superfast because it doesn't do the checking of rbindfill or do.call(rbind.data.frame,...)
You can use a workaround like this to ensure that factors are coerced to characters.
DT.1 <- data.table(x = factor(letters[1:5]), y = 6:10)
DT.2 <- data.table(x = LETTERS[1:5], y = 11:15)
for(ii in seq_along(DDL)){
ff <- Filter(function(x) is.factor(DDL[[ii]][[x]]), names(DDL[[ii]]))
for(fn in ff){
set(DDL[[ii]], j = fn, value = as.character(DDL[[ii]][[fn]]))
}
}
rbindlist(DDL)
or (less memory efficiently)
rbindlist(rapply(DDL, classes = 'factor', f= as.character, how = 'replace'))

The bug is not fixed in R 4.0.2 and data.table 1.13.0. When I try to rbindlist() two DTs, one of which has factor columns, the other one empty, final result gets this column broken, and factor values mangled (\n occuring randomly; levels are broken, NAs are introduced).
The workaround is to not rbindlisting a DT with an empty one, but instead rbindlist it with other DTs which also has payload data. Although this requires some boilerplate code.

plyr ddply and summarise use in R

Hi I want to avoid using loops and so want to use something from plyr to help solve my problem.
I would like to create a function that gets the sum of a specifically chosen column for each factor from a dataframe.
So if we have the following example data...
df <- data.frame(cbind(x=rnorm(100),y=rnorm(100),z=rnorm(100),f=sample(1:10,100, replace=TRUE)))
df$f <- as.factor(df$f)
i.e. I would like something like:
foo <- function(df.obj,colname){
some code
}
where the df.obj would be the df variable above and the colname argument could be any of x,y or z.
and I would like the output/result of the function to have a column of the unique factors (in the above case 1:10) and the sums of the values in column x for each factor.
I expect that the solution to be quite simple and would probably be using ddply or summarise somehow but can't work out how to do it so that i can have the column name as an argument.
Thanks

Is this what you're after?
> ddply(df, .(f), colwise(sum))
f x y z
1 1 -0.4190284 2.61101681 1.2280026
2 2 1.1063977 2.40006922 4.9550079
3 3 0.4498366 -4.00610558 0.9964754
4 4 1.9325488 -2.81241212 -3.1185574
5 5 -4.1077670 -1.01232884 -3.9852388
6 6 -1.0488003 -2.42924689 3.5273636
7 7 2.2999306 0.85930085 -0.6245167
8 8 -4.8105311 -6.81352238 -2.1223436
9 9 -2.8187083 5.03391770 1.6433896
10 10 5.1323666 -0.06192382 1.8978994
Edit: correct answer as supplied by TS:
foo <- function(df.obj,colname){ddply(df, .(f), colwise(sum))[,c("f",colname)]}

This seems a perfect fit for data.table and the lapply(.SD,FUN) and .SDcols arguments
.SD is a data.table containing the Subset of x's Data for each group, excluding the group column(s).
.SDcols is a vector containing the names of the columns to which you wish to apply the function (FUN)
An example
Setup the data.table
library(data.table)
DT <- as.data.table(df)
The sums of x,y,z columns by f
DT[, lapply(.SD, sum), by = f, .SDcols = c("x", "y", "z")]
## f x y z
## 1: 4 4.8041 3.9788 1.2519
## 2: 2 1.1255 -0.8147 2.9053
## 3: 3 0.9699 -0.1550 -8.5876
## 4: 9 2.2685 -1.2734 1.0506
## 5: 5 -0.1282 -2.5512 5.0668
## 6: 10 -2.7397 0.5290 -0.3638
## 7: 1 2.9544 -3.1139 -1.3884
## 8: 8 -4.3488 0.6894 1.4195
## 9: 7 2.3152 0.6474 2.7183
## 10: 6 -0.1569 1.0142 0.9156
The sums of x, and z columns by f
DT[, lapply(.SD, sum), by = f, .SDcols = c("x", "z")]
## f x z
## 1: 4 4.8041 1.2519
## 2: 2 1.1255 2.9053
## 3: 3 0.9699 -8.5876
## 4: 9 2.2685 1.0506
## 5: 5 -0.1282 5.0668
## 6: 10 -2.7397 -0.3638
## 7: 1 2.9544 -1.3884
## 8: 8 -4.3488 1.4195
## 9: 7 2.3152 2.7183
## 10: 6 -0.1569 0.9156
Examples calculating the mean
DT[, lapply(.SD, mean), by = f, .SDcols = c("x", "y", "z")]
## f x y z
## 1: 4 0.36955 0.30606 0.09630
## 2: 2 0.10232 -0.07407 0.26412
## 3: 3 0.07461 -0.01193 -0.66059
## 4: 9 0.15123 -0.08489 0.07004
## 5: 5 -0.01425 -0.28346 0.56298
## 6: 10 -0.21075 0.04069 -0.02799
## 7: 1 0.29544 -0.31139 -0.13884
## 8: 8 -0.54360 0.08617 0.17744
## 9: 7 0.38586 0.10790 0.45305
## 10: 6 -0.07844 0.50710 0.45782
DT[, lapply(.SD, mean), by = f, .SDcols = c("x", "z")]
## f x z
## 1: 4 0.36955 0.09630
## 2: 2 0.10232 0.26412
## 3: 3 0.07461 -0.66059
## 4: 9 0.15123 0.07004
## 5: 5 -0.01425 0.56298
## 6: 10 -0.21075 -0.02799
## 7: 1 0.29544 -0.13884
## 8: 8 -0.54360 0.17744
## 9: 7 0.38586 0.45305
## 10: 6 -0.07844 0.45782

I haven't got enough rep to comment so will have to ask in answer form - why do you want to avoid using loops in R?
EDIT: Anyway using plyr I'd use count()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using .BY with a lookup table--unexpected results - r

Related

R: Examine to see if a Datatable is subset of another Datatable

Inline ifelse assignment in data.table

Replacing all missing values in R data.table with a value

rbindlist two data.tables where one has factor and other has character type for a column

plyr ddply and summarise use in R

Categories

Resources