data table alternative to relist function R - r

I am looking for a faster, data table alternative to the relist function, specifically the skeleton argument that allows you to group values based on a specific structure. I know the by option does this for you but that's only by columns in the data table. I want to ultimately take a column of values from a data table and group those values based on an already created 'frame' for the values. For example, the code below shows how listFrame is created and would be passed to the skeleton argument to group values based on each sequence of numbers.
library(data.table)
sampleData <- data.table(rep(seq(0,10), each = 10))
sampleData <- sampleData[-(sample(1:length(sampleData[[1]]), 25))]
nList <- as.data.table(sampleData[, table(V1)])
listFrame <- data.table(sapply(1:nrow(nList), function(x) 1:nList$N[x]))
sampleData <- relist(sampleData[[1]], listFrame[[1]])
Furthermore, after relisting sampleData I plan to apply a function over each sublist, using lapply. If it can all be done using data.table that would be great.
lapply(1:length(sampleData), function(x) median(sampleData[[x]]))

If I understand well, your listFrame as skeleton is a way of counting same number in V1. As your input data is sorted on V1, I think is the same as a run-length type group id you could obtain with function rleid.
So you could group by rleid(V1) then apply the median
library(data.table)
sampleData <- data.table(rep(seq(0,10), each = 10))
sampleData <- sampleData[-(sample(1:length(sampleData[[1]]), 25))]
sampleData[, .(med = median(V1)), by = .(V1, rleid(V1))][, rleid := NULL][]
#> V1 med
#> 1: 0 0
#> 2: 1 1
#> 3: 2 2
#> 4: 3 3
#> 5: 4 4
#> 6: 5 5
#> 7: 6 6
#> 8: 7 7
#> 9: 8 8
#> 10: 9 9
#> 11: 10 10
Results in column med is the same as your example, but stored in a table not a list.
After precision in comments
rleid is a way for creating the ID column. If I take your specific case of function dplR::tbrm, you just need an ID column to apply the function.
library(data.table)
library(dplR)
sampleData <- data.table(rep(seq(0,10), each = 10))
sampleData <- sampleData[-(sample(1:length(sampleData[[1]]), 25))]
create an ID column :
sampleData[, ID := LETTERS[rleid(V1)]]
Apply your function by ID
sampleData[, dplR::tbrm(V1), by = ID]
#> ID V1
#> 1: A 0
#> 2: B 1
#> 3: C 2
#> 4: D 3
#> 5: E 4
#> 6: F 5
#> 7: G 6
#> 8: H 7
#> 9: I 8
#> 10: J 9
#> 11: K 10

Related

Merging data.tables by numeric column when machine tolerance needs to be accounted for

Many have seen the issue with using == to compare to floating point
numbers. == fails to return TRUE but all.equal works.
x <- sqrt(2)
x^2 == 2
#> [1] FALSE
all.equal(x^2, 2)
#> [1] TRUE
My issue comes from the need to join to data.tables by a numeric column
where == will fail to find the matching pairs.
I have considered coercing the numeric values to characters, but that option
has too many other potiential errors. I have considered rounding the values,
but that to, in the application I need, will create more problems.
Here is simple example of a join that is failing because
DT1$x == DT2$x will return FALSE when it would be preferable to have the
return be TRUE.
library(data.table)
packageVersion("data.table")
#> [1] '1.12.8'
DT1 <- data.table(x = sqrt(1:10), v1 = 1:10)
DT2 <- data.table(x = 1:10, v2 = LETTERS[1:10])
# set x to its square
DT1[, x := x^2]
# left join
merge(DT1, DT2, by = "x", all.x = TRUE)
#> x v1 v2
#> 1: 1 1 A
#> 2: 2 2 <NA>
#> 3: 3 3 <NA>
#> 4: 4 4 D
#> 5: 5 5 <NA>
#> 6: 6 6 <NA>
#> 7: 7 7 <NA>
#> 8: 8 8 <NA>
#> 9: 9 9 I
#> 10: 10 10 <NA>
How can I specify a left join by a numeric column key such that the machine
tolerance in the comparison is accounted for?
Created on 2020-04-06 by the reprex package (v0.3.0)
You could use roll = "nearest". Note that only the last column specified in on = can be rolling.
library(data.table)
DT1[DT2,on = "x", roll = "nearest"]
x v1 v2
1: 1 1 A
2: 2 2 B
3: 3 3 C
4: 4 4 D
5: 5 5 E
6: 6 6 F
7: 7 7 G
8: 8 8 H
9: 9 9 I
10: 10 10 J
I suspect the problem is more complicated than this simple case, but you could subsequently filter joins that do not meet a certain threshold of difference.
Data
DT1 <- data.table(x = sqrt(1:10), v1 = 1:10)
DT2 <- data.table(x = 1:10, v2 = LETTERS[1:10])
DT1[, x := x^2]

Data.table: square brackets used with j

So I am trying to learn data.tableand came accros the .SDnotation in a cheat sheet online link. So the example uses square brackets with .SD to subset rows. But why not just subset rows with i? So .SD[c(1, .N)]subsets rows right? And why should I subset rows like this?
library(data.table)
DT <- data.table(A = letters[c(1, 1, 1, 2, 2)],
B = 1:5,
C = 6:10)
DT
#> A B C
#> 1: a 1 6
#> 2: a 2 7
#> 3: a 3 8
#> 4: b 4 9
#> 5: b 5 10
# Method 1
DT[, .SD[c(1, .N)], by = A]
#> A B C
#> 1: a 1 6
#> 2: a 3 8
#> 3: b 4 9
#> 4: b 5 10
# method 2
DT[c(1, .N), .SD, by = A]
#> A B C
#> 1: a 1 6
#> 2: b 5 10
In the second case, we are specifying the i with index where .N is the last row, while in first case, it is the last row of each group
DT[c(1, .N)]
is similar to
DT[c(1, .N), .SD, by = A]
Only difference is that the rows specified in the i would be used for processing/changing for grouping info by 'A'

Join single variable to multiple variables in r data.table

DT1 holds the mapping for all IDs, DT2 holds relationships between IDs.
I'd like to join the mappings from DT1 directly to both IDs in DT2.
Example datasets below:
# Join a mapping to multiple variables.
library(data.table)
# Dataset with mappings.
set.seed(1)
dt1 <- data.table(id=1:10,
group=sample(letters[1:4], 10, replace=TRUE))
# > dt1
# id group
# 1: 1 b
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 a
# 6: 6 d
# 7: 7 d
# 8: 8 c
# 9: 9 c
# 10: 10 a
# Dataset with relationship between IDs.
dt2 <- data.table(id1=1:5,
id2=6:10)
# > dt2
# id1 id2
# 1: 1 6
# 2: 2 7
# 3: 3 8
# 4: 4 9
# 5: 5 10
I could of course use two joins, first on ID1, then on ID2. Another way of achieving what I want is first melting DT2, so all the ID values are a single variable before joining...
# Now melt, join group variable of DT1 to DT2, then cast again to obtain
# original structure.
dt2[, i := .I] # need an observation ID
dt2Long <- melt(dt2, id="i")
setkey(dt2Long, value)
dcast(dt2Long[dt1], i ~ variable, value.var=c("value", "group"))
# i value_id1 value_id2 group_id1 group_id2
# 1: 1 1 6 b d
# 2: 2 2 7 b d
# 3: 3 3 8 c c
# 4: 4 4 9 d c
# 5: 5 5 10 a a
This gives the desired result, but I would like to know if something like the following is possible (i.e. merging a single variable with two variables)?
setkey(dt1, id)
dt1[dt2, on=c("id1", "id2")]

How to create new column in a data.table when the name of column must be a string [duplicate]

This question already has answers here:
Select / assign to data.table when variable names are stored in a character vector
(6 answers)
Closed 6 years ago.
How to create new column in a data.table when the name of column must be a string or character?
For example:
library(data.table)
DT = data.table(v1=c(1,2,3), v2=2:4)
new_var <- "v3"
DT[, new_var:=v2+5]
I Get
DT
#> v1 v2 new_var
#> 1: 1 2 7
#> 2: 2 3 8
#> 3: 3 4 9
But, I want
#> v1 v2 v3
#> 1: 1 2 7
#> 2: 2 3 8
#> 3: 3 4 9
I can be done this way, by enclosing the variable name within brackets:
DT = data.table(v1=c(1,2,3), v2=2:4)
new_var <- "v3"
DT[, eval(new_var):=v2+5]
# or
DT[, (new_var):=v2+5]
DT
#> v1 v2 v3
#> 1: 1 2 7
#> 2: 2 3 8
#> 3: 3 4 9

Return a list in dplyr mutate()

I have a function in my real-world problem that returns a list. Is there any way to use this with the dplyr mutate()? This toy example doesn't work -:
it = data.table(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2))
myfun = function(arg1,arg2) {
temp1 = arg1 + arg2
temp2 = arg1 - arg2
list(temp1,temp2)
}
myfun(1,2)
it%.%mutate(new = myfun(V2,V3))
I see that it is cycling through the output of the function in the first "column" of the new variable, but do not understand why.
Thanks!
The idiomatic way to do this using data.table would be to use the := (assignment by reference) operator. Here's an illustration:
it[, c(paste0("V", 4:5)) := myfun(V2, V3)]
If you really want a list, why not:
as.list(it[, myfun(V2, V3)])
Alternatively, maybe this is what you want, but why don't you just use the data.table functionality:
it[, c(.SD, myfun(V2, V3))]
# V1 V2 V3 V4 V5
# 1: a 1 2 3 -1
# 2: a 2 3 5 -1
# 3: b 3 4 7 -1
# 4: b 4 2 6 2
# 5: c 5 2 7 3
Note that if myfun were to name it's output, then the names would show up in the final result columns:
# V1 V2 V3 new.1 new.2
# 1: a 1 2 3 -1
# 2: a 2 3 5 -1
# 3: b 3 4 7 -1
# 4: b 4 2 6 2
# 5: c 5 2 7 3
Given the title to this question, I thought I'd post a tidyverse solution that uses dplyr::mutate. Note that myfun needs to output a data.frame to work.
library(tidyverse)
it = data.frame(
v1 = c("a","a","b","b","c"),
v2 = c(1,2,3,4,5),
v3 = c(2,3,4,2,2))
myfun = function(arg1,arg2) {
temp1 = arg1 + arg2
temp2 = arg1 - arg2
data.frame(temp1, temp2)
}
it %>%
nest(data = c(v2, v3)) %>%
mutate(out = map(data, ~myfun(.$v2, .$v3))) %>%
unnest(cols = c(data, out))
#> # A tibble: 5 x 5
#> v1 v2 v3 temp1 temp2
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 2 3 -1
#> 2 a 2 3 5 -1
#> 3 b 3 4 7 -1
#> 4 b 4 2 6 2
#> 5 c 5 2 7 3
Created on 2020-02-04 by the reprex package (v0.3.0)
The mutate() function is designed to add new columns to the existing data frame. A data frame is a list of vectors of the same length. Thus, you cant add a list as a new column, because a list is not a vector.
You can rewrite your function as two functions, each of which return a vector. Then apply each of these separately using mutate() and it should work.

Resources