Collapse data.table column values while grouping - r

given a data.table object I would to collapse the values of some grouped columns into a single object and insert the resulting objects into a new colum.
dt <- data.table(
c('A|A', 'B|A', 'A|A', 'B|A', 'A|B'),
c(0, 0, 1, 1, 0),
c(22.7, 1.2, 0.3, 0.4, 0.0)
)
setnames(dt, names(dt), c('GROUPING', 'NAME', 'VALUE'))
dt
# GROUPING NAME VALUE
# 1: A|A 0 22.7
# 2: B|A 0 1.2
# 3: A|A 1 0.3
# 4: B|A 1 0.4
# 5: A|B 0 0.0
I think that to do this is first necessary to specify the column for which you want to group, so I should start with something like dt[, OBJECTS := <expr>, by = GROUPING].
Unfortunately, I don't know the expression <expr> to use so that the result is as follows:
# GROUPING OBJECTS
# 1: A|A <vector>
# 2: B|A <vector>
# 3: A|B <vector>
Each <vector> must contain the values ​​of the other columns. E.g the first <vector> have to be a named vector equivalent to:
eg <- c(22.7, 0.3)
names(eg) <- c('0', '1')
# 0 1
# 22.7 0.3

Working inside of j: If you want to have the values of a column be a vector, you need to wrap the output in list(.).
j itself requires a call to list, so your final expression will resemble a nested list, eg:
dt[, list(allNames=list(NAME), allValues=list(VALUE)), by=GROUPING]
# GROUPING allNames allValues
# 1: A|A 0,1 22.7,0.3
# 2: B|A 0,1 1.2,0.4
# 3: A|B 0 0
As #Mnel points out, equivalently:
dt[, lapply(.SD, list), by=GROUPING]
If you want it in long form, then the structure of your <expr> will be:
list( c( list(), list(), ..., list() ) ) eg:
dt[, list(c(list(NAME), list(VALUE))), by=GROUPING]
# GROUPING V1
# 1: A|A 0,1
# 2: A|A 22.7,0.3
# 3: B|A 0,1
# 4: B|A 1.2,0.4
# 5: A|B 0
# 6: A|B 0
Or equivalently:
dt[, list(lapply(.SD, c)), by=GROUPING]

I think that this is what you are looking for:
dt1 <- dt[, list(list(setNames(VALUE, NAME))), by = GROUPING]
dt1
# GROUPING V1
# 1: A|A 22.7,0.3
# 2: B|A 1.2,0.4
# 3: A|B 0
str(dt1)
# Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
# $ GROUPING: chr "A|A" "B|A" "A|B"
# $ V1 :List of 3
# ..$ : Named num 22.7 0.3
# .. ..- attr(*, "names")= chr "0" "1"
# ..$ : Named num 1.2 0.4
# .. ..- attr(*, "names")= chr "0" "1"
# ..$ : Named num 0
# .. ..- attr(*, "names")= chr "0"
# - attr(*, ".internal.selfref")=<externalptr>
dt1$V1
# [[1]]
# 0 1
# 22.7 0.3
#
# [[2]]
# 0 1
# 1.2 0.4
#
# [[3]]
# 0
# 0
As #Arun points out in the comments, the "data.table" alternative to setNames in this case is setattr(VALUE, 'names', NAME), making another solution:
dt1 <- dt[, list(list(setattr(VALUE, 'names', NAME))), by = GROUPING]

Related

Rename list elements which has names starting with specified characters using purrr package

I have a list with element names like x.height, x.weight, y.height, y.length, z.weight, z.price I would like to extract the elements which names start with "x." and rename these element by removing their prefix "x.". This can be done in two steps:
list.new <- list.old %>% keep(str_detect(names(.), "^x."))
names(list.new) <- str_replace(names(list.new), "x", "")
My first question: how to combine these two steps in a pipeline?
At the end, I would like to process the list for all of the different prefixes "y.", "z." to get a new list with the renamed sublists like:
List of 3
$ x:List of 2
..$ height: num 100
..$ weight: num 200
$ y:List of 2
..$ height: num 300
..$ length: num 400
$ z:List of 2
..$ weight: num 500
..$ price: num 600
Is it possible to do this using a single pipeline?
You can simply use setNames() or set_names():
list.old <- list(
x.height=1, x.weight=2, y.height=3, y.length=4, z.weight=5, z.price=6
)
list.old %>%
keep(startsWith(names(.), prefix)) %>%
set_names(str_replace(names(.), prefix, ""))
# $height
# [1] 1
#
# $weight
# [1] 2
And to apply to many prefixes, use the previous code as a function:
prefix_list <- c("x","y","z")
map(prefix_list,
function(prefix) list.old %>%
keep(startsWith(names(.), prefix)) %>%
set_names(str_replace(names(.), prefix, ""))
) %>%
set_names(prefix_list)
# $x
# $x$.height
# [1] 1
#
# $x$.weight
# [1] 2
#
#
# $y
# $y$.height
# [1] 3
#
# $y$.length
# [1] 4
#
#
# $z
# $z$.weight
# [1] 5
#
# $z$.price
# [1] 6
You can achieve what you want the following way. Note that this requires that you have a recent version of the dplyr package (>= 1.0.0).
library(dplyr)
library(stringr)
library(purrr)
list.old <- list(
x = list(x.height = 100, x.weight = 200),
y = list(y.height = 300, y.length = 400),
z = list(z.weight = 500, z.price = 600)
)
list.new <- list.old %>%
map(as_tibble) %>%
map(~ rename_with(.x, ~ str_remove(.x, "^[xyz]\\."))) %>%
map(as.list)
str(list.new)
List of 3
$ x:List of 2
..$ height: num 100
..$ weight: num 200
$ y:List of 2
..$ height: num 300
..$ length: num 400
$ z:List of 2
..$ weight: num 500
..$ price : num 600

How do I lookup the closest row in a lookup data.frame based on multiple columns?

Suppose I have the following dataframe with data (v) and a lookup dataframe (l):
v <- data.frame(d = c(as.Date('2019-01-01'), as.Date('2019-01-05'), as.Date('2019-01-30'), as.Date('2019-02-02')), kind=c('a', 'b', 'c', 'a'), v1=c(1,2,3,4))
v
d kind v1
1 2019-01-01 a 1
2 2019-01-05 b 2
3 2019-01-30 c 3
4 2019-02-02 a 4
l <- data.frame(d = c(as.Date('2019-01-01'), as.Date('2019-01-04'), as.Date('2019-02-01')), kind=c('a','b','a'), l1=c(10,20,30))
l
d kind l1
1 2019-01-01 a 10
2 2019-01-04 b 20
3 2019-02-01 a 30
I would like to find the closest row in the l dataframe corresponding to each row in v using the columns: c("d", "kind"). Column kind needs to match exactly and maybe use findInterval(...) on d?
I would like my result to be:
d kind v1 l1
1 2019-01-01 a 1 10
2 2019-01-05 b 2 20
3 2019-01-30 c 3 NA
4 2019-02-02 a 4 30
NOTE: I would prefer a base-R implementation but it would be
interesting to see others
I tried findInterval(...) but I don't know how get it to work with multiple columns
Here's a shot in base-R only. (I do believe that data.table will do this much more elegantly, but I appreciate your aversion to bring in other packages.)
Split each frame into a list of frames, by kind:
v_spl <- split(v, v$kind)
l_spl <- split(l, l$kind)
str(v_spl)
# List of 3
# $ a:'data.frame': 2 obs. of 3 variables:
# ..$ d : Date[1:2], format: "2019-01-01" "2019-02-02"
# ..$ kind: Factor w/ 3 levels "a","b","c": 1 1
# ..$ v1 : num [1:2] 1 4
# $ b:'data.frame': 1 obs. of 3 variables:
# ..$ d : Date[1:1], format: "2019-01-05"
# ..$ kind: Factor w/ 3 levels "a","b","c": 2
# ..$ v1 : num 2
# $ c:'data.frame': 1 obs. of 3 variables:
# ..$ d : Date[1:1], format: "2019-01-30"
# ..$ kind: Factor w/ 3 levels "a","b","c": 3
# ..$ v1 : num 3
Now we determine the unique kind we have in common between the two, no need to try to join everything:
### this has the 'kind' in common
(nms <- intersect(names(v_spl), names(l_spl)))
# [1] "a" "b"
### this has the 'kind' we have to bring back in later
(miss_nms <- setdiff(names(v_spl), nms))
# [1] "c"
For the in-common kind, do an interval join:
joined <- Map(
v_spl[nms], l_spl[nms],
f = function(v0, l0) {
ind <- findInterval(v0$d, l0$d)
ind[ ind < 1 ] <- NA
v0$l1 <- l0$l1[ind]
v0
})
Ultimately we will rbind things back together, but those in miss_nms will not have the new column(s). This is a generic way to capture exactly one row of the new columns with an appropriate NA value:
emptycols <- joined[[1]][, setdiff(colnames(joined[[1]]), colnames(v)),drop=FALSE][1,,drop=FALSE][NA,,drop=FALSE]
emptycols
# l1
# NA NA
And add that column(s) to the not-yet-found frames:
unjoined <- lapply(v_spl[miss_nms], cbind, emptycols)
unjoined
# $c
# d kind v1 l1
# 3 2019-01-30 c 3 NA
And finally bring everything back into a single frame:
do.call(rbind, c(joined, unjoined))
# d kind v1 l1
# a.1 2019-01-01 a 1 10
# a.4 2019-02-02 a 4 30
# b 2019-01-05 b 2 20
# c 2019-01-30 c 3 NA
If you want an exact match you would go:
vl <- merge(v, l, by = c("d","kind"))
For your purposes, you can transform d into additional variables for year, month or day and use the merge

Instantiate a data.frame with list in a cell in one step

If I do
data.frame(`Type` = list(c("aa", "bb")))
The list is spread in lines, and I got the output:
c..aa....bb..
1 aa
2 bb
Whereas if I do it in three steps:
df = data.frame(`Type` = NA)
df$Type <- list(c("aa", "bb"))
df
Got it good:
Type
1 aa, bb
Also I need to instantiate `Type` first. A link to understand those behaviours is very welcome.
You can use I():
data.frame(Type = I(list(c("aa", "bb"))))
# Type
# 1 aa, bb
str(.Last.value)
# 'data.frame': 1 obs. of 1 variable:
# $ Type:List of 1
# ..$ : chr "aa" "bb"
# ..- attr(*, "class")= chr "AsIs"
"dplyr" and "data.table" allow this directly:
library(dplyr)
data_frame(Type = list(c("aa", "bb")))
# Source: local data frame [1 x 1]
#
# Type
# (chr)
# 1 <chr[2]>
library(data.table)
data.table(Type = list(c("aa", "bb")))
# Type
# 1: aa,bb

Append a data frame to a list

I'm trying to figure out how to add a data.frame or data.table to the first position in a list.
Ideally, I want a list structured as follows:
List of 4
$ :'data.frame': 1 obs. of 3 variables:
..$ a: num 2
..$ b: num 1
..$ c: num 3
$ d: num 4
$ e: num 5
$ f: num 6
Note the data.frame is an object within the structure of the list.
The problem is that I need to add the data frame to the list after the list has been created, and the data frame has to be the first element in the list. I'd like to do this using something simple like append, but when I try:
append(list(1,2,3),data.frame(a=2,b=1,c=3),after=0)
I get a list structured:
str(append(list(1,2,3),data.frame(a=2,b=1,c=3),after=0))
List of 6
$ a: num 2
$ b: num 1
$ c: num 3
$ : num 1
$ : num 2
$ : num 3
It appears that R is coercing data.frame into a list when I'm trying to append. How do I prevent it from doing so? Or what alternative method might there be for constructing this list, inserting the data.frame into the list in position 1, after the list's initial creation.
The issue you are having is that to put a data frame anywhere into a list as a single list element, it must be wrapped with list(). Let's have a look.
df <- data.frame(1, 2, 3)
x <- as.list(1:3)
If we just wrap with c(), which is what append() is doing under the hood, we get
c(df)
# $X1
# [1] 1
#
# $X2
# [1] 2
#
# $X3
# [1] 3
But if we wrap it in list() we get the desired list element containing the data frame.
list(df)
# [[1]]
# X1 X2 X3
# 1 1 2 3
Therefore, since x is already a list, we will need to use the following construct.
c(list(df), x) ## or append(x, list(df), 0)
# [[1]]
# X1 X2 X3
# 1 1 2 3
#
# [[2]]
# [1] 1
#
# [[3]]
# [1] 2
#
# [[4]]
# [1] 3

R- Split + list function

Could anyone explain the split and list function in R? I am quite confused how to use them together. For example
x <- rnorm(10)
a <- gl(2,5)
b <- gl(5,2)
str(split(x,list(a,b))
The result I get is
List of 10
$ 1.1: num [1:2] 0.1326 -0.0578
$ 2.1: num(0)
$ 1.2: num [1:2] 0.151 0.907
$ 2.2: num(0)
$ 1.3: num -0.393
$ 2.3: num 1.83
$ 1.4: num(0)
$ 2.4: num [1:2] 0.4266 -0.0116
$ 1.5: num(0)
$ 2.5: num [1:2] 0.62 1.64
How are values in x assigned to a level in list(a,b)? Why are there some levels without any values and some with many values? I do not see any relation between the values in x and the levels of list(a,b). Are they randomly assigned?
Really apreciate if someone could help me with this.
When you call split(x, list(a, b)), you are basically saying that two x values are in the same group if they have the same a and b value and are in different groups otherwise.
list(a, b)
# [[1]]
# [1] 1 1 1 1 1 2 2 2 2 2
# Levels: 1 2
#
# [[2]]
# [1] 1 1 2 2 3 3 4 4 5 5
# Levels: 1 2 3 4 5
We can see that the first two elements in x are going to be in group "1.1" (the group where a=1 and b=1), the next two will be in group 1.2, the next one will be in group 1.3, the next one will be in group 2.3, the next two will be in group 2.4, and the last two will be in group 2.5. This is exactly what we see when we call split(x, list(a, b)):
split(x, list(a, b))
# $`1.1`
# [1] -0.2431983 -1.5747339
# $`2.1`
# numeric(0)
# $`1.2`
# [1] -0.1058044 -0.8053585
# $`2.2`
# numeric(0)
# $`1.3`
# [1] -1.538958
# $`2.3`
# [1] 0.8363667
# $`1.4`
# numeric(0)
# $`2.4`
# [1] 0.8391658 -1.0488495
# $`1.5`
# numeric(0)
# $`2.5`
# [1] 0.3141165 -1.1813052
The reason you have extra empty groups (e.g. group 2.1) is that a and b have some pairs of values where there are no x values. From ?split, you can read that the way to not include these in the output is with the drop=TRUE option:
split(x, list(a, b), drop=TRUE)
# $`1.1`
# [1] -0.2431983 -1.5747339
# $`1.2`
# [1] -0.1058044 -0.8053585
# $`1.3`
# [1] -1.538958
# $`2.3`
# [1] 0.8363667
# $`2.4`
# [1] 0.8391658 -1.0488495
# $`2.5`
# [1] 0.3141165 -1.1813052

Resources