data.table difference set of columns from another column

data.table difference set of columns from another column - r

I'm trying to difference a set of columns from another column with data.table. Here's a simple example:
library(data.table)
dt <- data.table(a=1:10,b=11:20,d=21:30)
mycols <- c("b","d")
dt[,c(paste0("diff",mycols)):=lapply(mycols, function(x, env) get(x,env) - get("a",env), env=dt)]
dt
a b d diffb diffd
1: 1 11 21 10 20
2: 2 12 22 10 20
3: 3 13 23 10 20
4: 4 14 24 10 20
5: 5 15 25 10 20
6: 6 16 26 10 20
7: 7 17 27 10 20
8: 8 18 28 10 20
9: 9 19 29 10 20
10: 10 20 30 10 20
My question is whether there is a better syntax for this with data.table? The issue is that the column "a" is not defined within the scope of the function, so I have to use get to make it work.

You can subset .SD using mycols and subtract a:
dt[, paste0("diff", mycols) := .SD[, mycols, with = FALSE] - a ]
# a b d diffb diffd
# 1: 1 11 21 10 20
# 2: 2 12 22 10 20
# 3: 3 13 23 10 20
# 4: 4 14 24 10 20
# 5: 5 15 25 10 20
# 6: 6 16 26 10 20
# 7: 7 17 27 10 20
# 8: 8 18 28 10 20
# 9: 9 19 29 10 20
#10: 10 20 30 10 20
As Frank pointed out in the comments, this works, too
dt[, paste0("diff", mycols) := .SD - dt$a, .SDcols=mycols]
Not sure what's better practice, though.

Related

Extracting the first element from strsplit, applied across each row element in data.table in R

I have the following dataset:
library(data.table)
x <- data.table(a = c(1:3, 1), b = c('12 13', '14 15', '16 17', '18 19'))
> x
a b
1: 1 12 13
2: 2 14 15
3: 3 16 17
4: 1 18 19
and I would like to get a new dataset which has
> x
a b c
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 19
so that it takes the first element of column b's elements. I tried to do
x[,c:=unlist(strsplit(b, " "))[[1]][1]]
but it doesn't work. Is there a way to apply such a thing in data.table?

You can use stringr::str_split_i to take the first element of each split string:
library(stringr)
x[, c := str_split_i(b, " ", 1)]
x
a b c
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 18

Use tstrsplit from data.table:
x[,c := tstrsplit(b," ")[1]]
x
a b c
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 18
x[, c := readr::parse_number(b)]
x
a b c
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 18

We can use sapply() along with strsplit() and retain the first element from each vector in the list.
x$c <- sapply(strsplit(x$b, " "), `[[`, 1)
x
a b c
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 18

functions with data.table variables which names are stored in a character vector

I am not a big data.table expert but I am somehow puzzled by some things. Here is my simple example:
test<-data.table(x= 1:10,y= 1:10,z= 1:10, l = 11:20,d= 21:30)
test<-test[,..I:=.I]
vec_of_names = c("z","l","d")
function_test<-function(x,y){
sum(x)+y
}
vec_of_final_names<-c("sum_z","sum_l","sum_d")
When I then attempt do to something like this:
for (i in 1:length(vec_of_names)){
test<-test[,vec_of_final_names[i]:=function_test(x=.SD,y=eval(parse(text=vec_of_names[i]))),.SDcols=c("x","y"),by=..I]
}
I get an error:
Error in eval(expr, envir, enclos) : object 'z' not found
Whereas code below works perfectly fine but is a little bit ugly and also slow. Maybe somebody can suggest better alternatives.
for (i in 1:length(vec_of_names)){
test<-test[,vec_of_final_names[i]:=function_test(x=eval(parse(text=paste("c(",paste(c("x","y"),collapse=","),")",sep=""))),y=eval(parse(text=vec_of_names[i]))),by=..I]
}

After specifying the .SDcols and grouped by = ..I (the ..I is a strange name for a column name), we unlist the .SD, get the sum, get the values of 'vec_of_names' in a list with mget, do the + of corresponding elements of this with the sum(unlist(.SD)) and assign (:=) it to 'vec_of_final_names' to create new columns
test[, (vec_of_final_names) := Map(`+`, sum(unlist(.SD)),
mget(vec_of_names)), by = ..I, .SDcols = x:y]
Based on the example, this can also be done without the grouping variable
test[, (vec_of_final_names) := Map(`+`, list(x+y), mget(vec_of_names))]
Or by specifying the .SDcols
test[, (vec_of_final_names) := Map(`+`, list(Reduce(`+`, .SD)),
mget(vec_of_names)), .SDcols = x:y]
Or using the OP's function
test[, (vec_of_final_names) := Map(function_test, list(unlist(.SD)),
mget(vec_of_names)), ..I, .SDcols = x:y]
test
# x y z l d ..I sum_z sum_l sum_d
# 1: 1 1 1 11 21 1 3 13 23
# 2: 2 2 2 12 22 2 6 16 26
# 3: 3 3 3 13 23 3 9 19 29
# 4: 4 4 4 14 24 4 12 22 32
# 5: 5 5 5 15 25 5 15 25 35
# 6: 6 6 6 16 26 6 18 28 38
# 7: 7 7 7 17 27 7 21 31 41
# 8: 8 8 8 18 28 8 24 34 44
# 9: 9 9 9 19 29 9 27 37 47
#10: 10 10 10 20 30 10 30 40 50

Add data frames row wise with [d]plyr

I have two data frames
df1
# a b
# 1 10 20
# 2 11 21
# 3 12 22
# 4 13 23
# 5 14 24
# 6 15 25
df2
# a b
# 1 4 8
I want the following output:
df3
# a b
# 1 14 28
# 2 15 29
# 3 16 30
# 4 17 31
# 5 18 32
# 6 19 33
i.e. add df2 to each row of df1.
Is there a way to get the desired output using plyr (mdplyr??) or dplyr?

I see no reason for "dplyr" for something like this. In base R you could just do:
df1 + unclass(df2)
# a b
# 1 14 28
# 2 15 29
# 3 16 30
# 4 17 31
# 5 18 32
# 6 19 33
Which is the same as df1 + list(4, 8).

One liner with dplyr.
mutate_each(df1, funs(.+ df2$.), a:b)
# a b
#1 14 28
#2 15 29
#3 16 30
#4 17 31
#5 18 32
#6 19 33

A base R solution using sweet function sweep:
sweep(df1, 2, unlist(df2), '+')
# a b
#1 14 28
#2 15 29
#3 16 30
#4 17 31
#5 18 32
#6 19 33

Fastest way to remove same number of NA from each column and realign data

Extend from this post that gives the result as follows:
x y z
1: 1 NA NA
2: 2 NA 22
3: 3 13 23
4: 4 14 24
5: 5 15 25
6: 6 16 26
7: 7 17 27
8: NA 18 28
9: NA 19 NA
10: NA NA NA
As you can see, if NAs of each column are removed, we can obtain data.table as follows:
x y z
1: 1 13 22
2: 2 14 23
3: 3 15 24
4: 4 16 25
5: 5 17 26
6: 6 18 27
7: 7 19 28
I come up with this code to obtain the above result:
mat.temp <- na.omit(mat[,1, with = F])
for (i in 2:3) {
temp <- na.omit(mat[,i, with = F])
mat.temp <- cbind(mat.temp, temp)
}
However, I am not sure it is efficient.
Could you please give me suggestions ?
Thank you

It sounds like you are just trying to do:
DT[, lapply(.SD, function(x) x[!is.na(x)])]
# x y z
# 1: 1 13 22
# 2: 2 14 23
# 3: 3 15 24
# 4: 4 16 25
# 5: 5 17 26
# 6: 6 18 27
# 7: 7 19 28
However, I'm not sure how well this would hold up if you have a different number of NA values in each column.

copy() in data.table, R

I am attempting to assign a column by reference after I subsetted a data.table and assigned the return value to another data.table like so (toy example):
> x <- data.table(a=1:10, b=11:20, c=21:30)
> x
a b c
1: 1 11 21
2: 2 12 22
3: 3 13 23
4: 4 14 24
5: 5 15 25
6: 6 16 26
7: 7 17 27
8: 8 18 28
9: 9 19 29
10: 10 20 30
> y <- x[a==1 | a == 2, list(a,b,c)]
> y[,d:=a+b]
Error in `[.data.table`(y, , `:=`(d, a + b)) :
It appears that at some earlier point, names of this data.table have been reassigned. Please ensure to use setnames() rather than names<- or colnames<-. Otherwise, please report to datatable-help.
I don't exactly understand the issue: is it that the returned y is simply a "view" into the same memory as x and thus one should copy x before setting a column by reference?
Thanks

Unable to reproduce the error with data.table 1.8.2 in R 2.15.1:
> x <- data.table(a=1:10, b=11:20, c=21:30); x
a b c
1: 1 11 21
2: 2 12 22
3: 3 13 23
4: 4 14 24
5: 5 15 25
6: 6 16 26
7: 7 17 27
8: 8 18 28
9: 9 19 29
10: 10 20 30
>
> y <- x[x$a==1 | x$a == 2, list(a,b,c)]
>
> y
a b c
1: 1 11 21
2: 2 12 22
> y[,d:=a+b]
a b c d
1: 1 11 21 12
2: 2 12 22 14

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

data.table difference set of columns from another column - r

Related

Extracting the first element from strsplit, applied across each row element in data.table in R

functions with data.table variables which names are stored in a character vector

Add data frames row wise with [d]plyr

Fastest way to remove same number of NA from each column and realign data

copy() in data.table, R

Categories

Resources