copy() in data.table, R - r

I am attempting to assign a column by reference after I subsetted a data.table and assigned the return value to another data.table like so (toy example):
> x <- data.table(a=1:10, b=11:20, c=21:30)
> x
a b c
1: 1 11 21
2: 2 12 22
3: 3 13 23
4: 4 14 24
5: 5 15 25
6: 6 16 26
7: 7 17 27
8: 8 18 28
9: 9 19 29
10: 10 20 30
> y <- x[a==1 | a == 2, list(a,b,c)]
> y[,d:=a+b]
Error in `[.data.table`(y, , `:=`(d, a + b)) :
It appears that at some earlier point, names of this data.table have been reassigned. Please ensure to use setnames() rather than names<- or colnames<-. Otherwise, please report to datatable-help.
I don't exactly understand the issue: is it that the returned y is simply a "view" into the same memory as x and thus one should copy x before setting a column by reference?
Thanks

Unable to reproduce the error with data.table 1.8.2 in R 2.15.1:
> x <- data.table(a=1:10, b=11:20, c=21:30); x
a b c
1: 1 11 21
2: 2 12 22
3: 3 13 23
4: 4 14 24
5: 5 15 25
6: 6 16 26
7: 7 17 27
8: 8 18 28
9: 9 19 29
10: 10 20 30
>
> y <- x[x$a==1 | x$a == 2, list(a,b,c)]
>
> y
a b c
1: 1 11 21
2: 2 12 22
> y[,d:=a+b]
a b c d
1: 1 11 21 12
2: 2 12 22 14

Related

Extracting the first element from strsplit, applied across each row element in data.table in R

I have the following dataset:
library(data.table)
x <- data.table(a = c(1:3, 1), b = c('12 13', '14 15', '16 17', '18 19'))
> x
a b
1: 1 12 13
2: 2 14 15
3: 3 16 17
4: 1 18 19
and I would like to get a new dataset which has
> x
a b c
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 19
so that it takes the first element of column b's elements. I tried to do
x[,c:=unlist(strsplit(b, " "))[[1]][1]]
but it doesn't work. Is there a way to apply such a thing in data.table?
You can use stringr::str_split_i to take the first element of each split string:
library(stringr)
x[, c := str_split_i(b, " ", 1)]
x
a b c
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 18
Use tstrsplit from data.table:
x[,c := tstrsplit(b," ")[1]]
x
a b c
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 18
x[, c := readr::parse_number(b)]
x
a b c
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 18
We can use sapply() along with strsplit() and retain the first element from each vector in the list.
x$c <- sapply(strsplit(x$b, " "), `[[`, 1)
x
a b c
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 18

Grouped application of function that return a data.frame (without a for loop)

I need to apply a function that return a data.frame across a (grouped) tibble
Some data:
df <- data.frame(start=1:10,end=21:30,g=sample(LETTERS[1:2],10,replace=TRUE))
ff <- function(start,end,... ) {
out <- data.frame(T1=c(start,rev(start)),T2=c(end,rev(end)))
return(out)
}
and then I would like to do something like this
library(dplyr)
library(purrr)
df %>%
group_by(g) %>%
pmap_dfr( ff,.keep=TRUE)
to produce a tibble / data.frame like this:
g start end
1 A 1 21
2 A 3 23
3 A 4 24
4 A 5 25
5 A 6 26
6 A 7 27
7 A 8 28
8 A 8 28
9 A 7 27
10 A 6 26
11 A 5 25
12 A 4 24
13 A 3 23
14 A 1 21
15 B 2 22
16 B 9 29
17 B 10 30
18 B 10 30
19 B 9 29
20 B 2 22
So that the outcput is concatenated together row-wise, and the group to which it belong is marked somehow.
The functions I would like to apply need to get arguments from the other columns in the original data.frame (df in the example code) so I thought pmap_dfr would be the correct function to used. But I am just confused by the output, so I must be using that function wrong.
I would appreciate all the help I could get on this.
One option is to use dplyr::group_split() and purrr::map_dfr().
How this works: group_split() will divide your data.frame df into a list of data.frames based on the grouping variables you supply (e.g., g). Next, map_dfr() can be used to apply a function to each element of that list. Because your custom function ff() returns a data.frame without your grouping variable g, you'll want to add that information back to ff() output - this can be accomplished with mutate() as in the example below:
library(dplyr)
library(purrr)
# set seed so that example is reproducible
set.seed(1)
# your example data and function
df <- data.frame(start=1:10,end=21:30,g=sample(LETTERS[1:2],10,replace=TRUE))
ff <- function(start,end,... ) {
out <- data.frame(T1=c(start,rev(start)),T2=c(end,rev(end)))
return(out)
}
# use group_split & map_dfr
df %>%
# divide df into a list of data.frames based on supplied grouping variables
group_split(g) %>%
# for each element in the list, apply this function
map_dfr(function(df.x) {
with(df.x,
# get the data.frame your function returns
ff(start, end) %>%
# add your grouping variables back-in (stripped by ff)
mutate(g = g[1]))
})
# a short-hand version of the above can be written as:
df %>%
group_split(g) %>%
map_dfr(~ff(.x$start, .x$end) %>% mutate(g = .x$g[1]))
Using data.table and lapply expected result can be achieved.
df <- data.frame(start=1:10,end=21:30,g=sample(LETTERS[1:2],10,replace=TRUE))
start end g
1: 1 21 B
2: 2 22 A
3: 3 23 A
4: 4 24 A
5: 5 25 A
6: 6 26 B
7: 7 27 A
8: 8 28 B
9: 9 29 B
10: 10 30 B
library(data.table)
setDT(df)
ff <- function(x) {
x <- c(x, rev(x))
return(x)
}
df[,lapply(.SD, ff), .SDcols = c('start', 'end'), by = .(g)]
g start end
1: B 1 21
2: B 6 26
3: B 8 28
4: B 9 29
5: B 10 30
6: B 10 30
7: B 9 29
8: B 8 28
9: B 6 26
10: B 1 21
11: A 2 22
12: A 3 23
13: A 4 24
14: A 5 25
15: A 7 27
16: A 7 27
17: A 5 25
18: A 4 24
19: A 3 23
20: A 2 22
It is possible to use dplyr::across() like this:
library(tidyverse)
group_by(df, g) %>%
summarise(across(all_of(c("start", "end"))) %>%
{
ff(.[[1]], .[[2]])
})
#> `summarise()` has grouped output by 'g'. You can override using the `.groups` argument.
#> # A tibble: 20 × 3
#> # Groups: g [2]
#> g T1 T2
#> <chr> <int> <int>
#> 1 A 1 21
#> 2 A 3 23
#> 3 A 4 24
#> 4 A 9 29
#> 5 A 10 30
#> 6 A 10 30
#> 7 A 9 29
#> 8 A 4 24
#> 9 A 3 23
#> 10 A 1 21
#> 11 B 2 22
#> 12 B 5 25
#> 13 B 6 26
#> 14 B 7 27
#> 15 B 8 28
#> 16 B 8 28
#> 17 B 7 27
#> 18 B 6 26
#> 19 B 5 25
#> 20 B 2 22
Created on 2021-12-21 by the reprex package (v2.0.1)

functions with data.table variables which names are stored in a character vector

I am not a big data.table expert but I am somehow puzzled by some things. Here is my simple example:
test<-data.table(x= 1:10,y= 1:10,z= 1:10, l = 11:20,d= 21:30)
test<-test[,..I:=.I]
vec_of_names = c("z","l","d")
function_test<-function(x,y){
sum(x)+y
}
vec_of_final_names<-c("sum_z","sum_l","sum_d")
When I then attempt do to something like this:
for (i in 1:length(vec_of_names)){
test<-test[,vec_of_final_names[i]:=function_test(x=.SD,y=eval(parse(text=vec_of_names[i]))),.SDcols=c("x","y"),by=..I]
}
I get an error:
Error in eval(expr, envir, enclos) : object 'z' not found
Whereas code below works perfectly fine but is a little bit ugly and also slow. Maybe somebody can suggest better alternatives.
for (i in 1:length(vec_of_names)){
test<-test[,vec_of_final_names[i]:=function_test(x=eval(parse(text=paste("c(",paste(c("x","y"),collapse=","),")",sep=""))),y=eval(parse(text=vec_of_names[i]))),by=..I]
}
After specifying the .SDcols and grouped by = ..I (the ..I is a strange name for a column name), we unlist the .SD, get the sum, get the values of 'vec_of_names' in a list with mget, do the + of corresponding elements of this with the sum(unlist(.SD)) and assign (:=) it to 'vec_of_final_names' to create new columns
test[, (vec_of_final_names) := Map(`+`, sum(unlist(.SD)),
mget(vec_of_names)), by = ..I, .SDcols = x:y]
Based on the example, this can also be done without the grouping variable
test[, (vec_of_final_names) := Map(`+`, list(x+y), mget(vec_of_names))]
Or by specifying the .SDcols
test[, (vec_of_final_names) := Map(`+`, list(Reduce(`+`, .SD)),
mget(vec_of_names)), .SDcols = x:y]
Or using the OP's function
test[, (vec_of_final_names) := Map(function_test, list(unlist(.SD)),
mget(vec_of_names)), ..I, .SDcols = x:y]
test
# x y z l d ..I sum_z sum_l sum_d
# 1: 1 1 1 11 21 1 3 13 23
# 2: 2 2 2 12 22 2 6 16 26
# 3: 3 3 3 13 23 3 9 19 29
# 4: 4 4 4 14 24 4 12 22 32
# 5: 5 5 5 15 25 5 15 25 35
# 6: 6 6 6 16 26 6 18 28 38
# 7: 7 7 7 17 27 7 21 31 41
# 8: 8 8 8 18 28 8 24 34 44
# 9: 9 9 9 19 29 9 27 37 47
#10: 10 10 10 20 30 10 30 40 50

data.table difference set of columns from another column

I'm trying to difference a set of columns from another column with data.table. Here's a simple example:
library(data.table)
dt <- data.table(a=1:10,b=11:20,d=21:30)
mycols <- c("b","d")
dt[,c(paste0("diff",mycols)):=lapply(mycols, function(x, env) get(x,env) - get("a",env), env=dt)]
dt
a b d diffb diffd
1: 1 11 21 10 20
2: 2 12 22 10 20
3: 3 13 23 10 20
4: 4 14 24 10 20
5: 5 15 25 10 20
6: 6 16 26 10 20
7: 7 17 27 10 20
8: 8 18 28 10 20
9: 9 19 29 10 20
10: 10 20 30 10 20
My question is whether there is a better syntax for this with data.table? The issue is that the column "a" is not defined within the scope of the function, so I have to use get to make it work.
You can subset .SD using mycols and subtract a:
dt[, paste0("diff", mycols) := .SD[, mycols, with = FALSE] - a ]
# a b d diffb diffd
# 1: 1 11 21 10 20
# 2: 2 12 22 10 20
# 3: 3 13 23 10 20
# 4: 4 14 24 10 20
# 5: 5 15 25 10 20
# 6: 6 16 26 10 20
# 7: 7 17 27 10 20
# 8: 8 18 28 10 20
# 9: 9 19 29 10 20
#10: 10 20 30 10 20
As Frank pointed out in the comments, this works, too
dt[, paste0("diff", mycols) := .SD - dt$a, .SDcols=mycols]
Not sure what's better practice, though.

Fastest way to remove same number of NA from each column and realign data

Extend from this post that gives the result as follows:
x y z
1: 1 NA NA
2: 2 NA 22
3: 3 13 23
4: 4 14 24
5: 5 15 25
6: 6 16 26
7: 7 17 27
8: NA 18 28
9: NA 19 NA
10: NA NA NA
As you can see, if NAs of each column are removed, we can obtain data.table as follows:
x y z
1: 1 13 22
2: 2 14 23
3: 3 15 24
4: 4 16 25
5: 5 17 26
6: 6 18 27
7: 7 19 28
I come up with this code to obtain the above result:
mat.temp <- na.omit(mat[,1, with = F])
for (i in 2:3) {
temp <- na.omit(mat[,i, with = F])
mat.temp <- cbind(mat.temp, temp)
}
However, I am not sure it is efficient.
Could you please give me suggestions ?
Thank you
It sounds like you are just trying to do:
DT[, lapply(.SD, function(x) x[!is.na(x)])]
# x y z
# 1: 1 13 22
# 2: 2 14 23
# 3: 3 15 24
# 4: 4 16 25
# 5: 5 17 26
# 6: 6 18 27
# 7: 7 19 28
However, I'm not sure how well this would hold up if you have a different number of NA values in each column.

Resources