So what I'm trying to achieve is this : Say I have a data table dt having (say) 4 columns. I want to get unique length of every combination of 2 columns.
DT <- data.table(a = 1:10, b = c(1,1,1,2,2,3,4,4,5,5), c = letters[1:10], d = c(3,3,5,2,4,2,5,1,1,5))
> DT
a b c d
1: 1 1 a 3
2: 2 1 b 3
3: 3 1 c 5
4: 4 2 d 2
5: 5 2 e 4
6: 6 3 f 2
7: 7 4 g 5
8: 8 4 h 1
9: 9 5 i 1
10: 10 5 j 5
I tried the following code :
cols <- colnames(DT)
for(i in 1:(length(cols)-1)) {
for (j in i+1:length(cols)) {
print(unique(DT[,.SD, .SDcols = c(cols[i],cols[j])]))
}
}
Here, basically 'i' goes from first column to second last whereas 'j' is the combining column with 'i'. So the combinations I get are : ab, ac, ad, bc, bd, cd.
But it gives me the following error
Error in [.data.table(DT, , .SD, .SDcols = c(cols[i], cols[j])) :
.SDcols missing at the following indices: [2]
If someone can explain why this is and a way around it, I'll be really grateful. Thanks.
This is due to operators precedence, : is evaluated before +:
1+1:length(cols)
[1] 2 3 4 5
> (1+1):length(cols)
[1] 2 3 4
Correct loop is :
for(i in 1:(length(cols)-1)) {
for (j in (i+1):length(cols)) {
print(unique(DT[,.SD, .SDcols = c(cols[i],cols[j])]))
}
}
Related
I am impressed by the efficiency R-code could be by using functions and loops.
I will provide a simplified example of the question first, and explain my problem (where the code is probably not replicable).
If I have several vectors which are different in contents and length,like:
tables_vector_1 <- c(1,2,3)
tables_vector_2 <- c(1:10)
And I have a function to create data.tables from the vector, like:
create_dt <- function(tables_vector, i){
DT <- data.table(id = 1:i, name = c("a","b","c"))
return(DT)
}
I am wondering, if there is a way to write a loop or function, where I can create all (or some of ) data.tables in the vector by running the function created before?
(probably like)
for i in 1:length(tables_vector){
create_dt(tables_vector, i)
}
And then combine the results in a list, same as the result if you run:
list(create_dt(tables_vector_1,1),create_dt(tables_vector_1,2),create_dt(tables_vector_1,3))
I have tried to use lapply(list(1:3),create_dt,tables_vector = tables_vector_1, i), but it falls, since I don't know how to specify the i argument correctly in lapply().
Here is the explanation why this problem rise:
In the real situation, I have created a function to import data.table from the database:
import_data <- function(tables_vector,i){
end <- Sys.time()
start <- end - 7200
con <- dbConnect("PostgreSQL", dbname="db", host = "host", user=db_user, password=db_password)
query <- sprintf("SELECT %s.timeutc, %s.scal AS %s FROM %s WHERE timeutc BETWEEN '%s' AND '%s' AND mode='General';",
tables_vector[i],tables_vector[i],tables_vector[i], tables_vector[i],start,end)
rs <- dbSendQuery(con, query)
df <- fetch(rs, n = -1)
dbClearResult(rs)
dbDisconnect(con)
return(as.data.table(df))
}
And I have tens of vectors which are defined by groups (e.g. vector1 contains channels for purpose 1, vector2 contains channels for purpose 2).
Since they are created for different analysis purposes, I cannot simply combine them in one vector.
Moreover, some vector contains 7, 8 channels, so it is quite annoying to list them by repeating the function one by one.
How about something like this:
tables_vector_1 <- c(1,2,3)
tables_vector_2 <- c(1:10)
create_dt <- function(tables_vector, i){
DT <- data.table(id = 1:i, name = letters[1:i])
return(DT)
}
make_list <- function(x){
lapply(seq_along(x), function(i)create_dt(x, i))
}
make_list(tables_vector_1)
[[1]]
id name
1: 1 a
[[2]]
id name
1: 1 a
2: 2 b
[[3]]
id name
1: 1 a
2: 2 b
3: 3 c
make_list(tables_vector_2)
[[1]]
id name
1: 1 a
[[2]]
id name
1: 1 a
2: 2 b
[[3]]
id name
1: 1 a
2: 2 b
3: 3 c
[[4]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
[[5]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
[[6]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
[[7]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
[[8]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
8: 8 h
[[9]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
8: 8 h
9: 9 i
[[10]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
8: 8 h
9: 9 i
10: 10 j
Note, I changed the create_dt() function so it did not produce a warning, but the mechanics should still work as intended.
Based on this previous post I build leftOuterJoin which is a function to update a data.table X according to an other data.table Y. The function is defined as follows:
leftOuterJoin <- function(X, Y, onCol) {
.colsY <- names(Y)
X[Y, (.colsY) := mget(paste0("i.", .colsY)), on = onCol]
}
The function works 99% of the time as intended, e.g.:
X <- data.table(id = 1:5, L = letters[1:5])
id L
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
Y <- data.table(id = 3:5, L = c(NA, "g", "h"), N = c(10, NA, 12))
id L N
1: 3 <NA> 10
2: 4 g NA
3: 5 h 12
leftOuterJoin(X, Y, "id")
X
id L N
1: 1 a NA
2: 2 b NA
3: 3 <NA> 10
4: 4 g NA
5: 5 h 12
However, for some reason that is unknown to me, it just stops working with some data tables (I have no reproductible example at hand). There is no error, but the data table is not updated. When I use the debug function, everything seems to be working fine, X is updated, but the real data.table isn't. Now, if I just do it outside the function it works. Maybe it is related to the scope of the function? I am really struggling with this problem.
Spec: R v3.5.1 and data.table v1.11.4.
EDIT
Based on the comments I figured out that the problem is related to the data.table pointer. You can reproduce the problem with this code:
> save(X, file = "X.RData")
> load("X.RData")
> leftOuterJoin(X, Y, "id")
> X
id L
1: 1 a
2: 2 b
3: 3 <NA>
4: 4 g
5: 5 h
Notice that X is updated but not the way we want it. However, if we use setDT() it works properly:
> load("X.RData")
> setDT(X)
> leftOuterJoin(X, Y, "id")
> X
id L N
1: 1 a NA
2: 2 b NA
3: 3 <NA> 10
4: 4 g NA
5: 5 h 12
Is there a way to set up leftOuterJoin() such that it will not be necessary to run setDT() every time some data is loaded?
I very often transform subsets of data using the .SDcols option in data.table. It makes sense that the .SD columns sent to j are in the same order as the original data.table.
EDITED to properly identify the issue
It's nice that .SD columns have the same order as that specified in the .SDcols argument. This does not happen when get is used in the j argument (inside an lapply call, at least). In this case, the .SD table columns maintain their original order.
Is there any way to override this behaviour?
An example without get works fine
# library(data.table)
dt = data.table(col1 = rep(LETTERS[1:3], 4),
b = rnorm(12),
a = 1:12,
c = LETTERS[1:12])
# columns I want to do something to
d.vars = c('a', 'b') #' names in different order than names(dt)
# Generate columns of first differences by group
dt[, paste('d', d.vars, sep='.') :=
lapply(.SD, function(L) L - shift(L, n = 1, type='lag') ),
keyby = col1, .SDcols = d.vars]
The result is assigns differenced values to the "wrong" column because my named vector (d.vars) is ordered differently than the columns in dt. The result is:
The results are as expected, the .SD table's columns are ordered the same way as the names in d.vars.
> dt
col1 b a c d.a d.b
1: A -0.28901751 1 A NA NA
2: A 0.65746901 4 D 3 0.94648651
3: A -0.10602462 7 G 3 -0.76349362
4: A -0.38406252 10 J 3 -0.27803790
5: B -1.06963450 2 B NA NA
6: B 0.35137273 5 E 3 1.42100723
7: B 0.43394046 8 H 3 0.08256772
8: B 0.82525042 11 K 3 0.39130996
9: C 0.50421710 3 C NA NA
10: C -1.09493665 6 F 3 -1.59915375
11: C -0.04858163 9 I 3 1.04635501
12: C 0.45867279 12 L 3 0.50725443
Which is the expected output because lapply in j processed column a first and b second, in spite of the column order in dt.
Example with get behaves differently
dt2 = data.table(col1 = rep(LETTERS[1:3], 4),
b = rnorm(12),
a = 1:12,
neg = -1,
c = LETTERS[1:12])
# columns I want to do something to
d.vars = c('a', 'b') #' names in different order than names(dt)
# name of variable to be called in j.
negate <- 'neg'
dt2[, paste('d', d.vars, sep='.') :=
lapply(.SD, function(L) {(L - shift(L, n = 1, type='lag') ) * get(negate) }),
keyby = col1, .SDcols = d.vars]
Now the naming of the newly created columns doesn't align with the name order in d.vars:
> dt2
col1 b a neg c d.a d.b
1: A -0.3539066 1 -1 A NA NA
2: A 0.2702374 4 -1 D -0.62414408 -3
3: A -0.7834941 7 -1 G 1.05373150 -3
4: A -1.2765652 10 -1 J 0.49307118 -3
5: B -0.2936422 2 -1 B NA NA
6: B -0.2451996 5 -1 E -0.04844252 -3
7: B -1.6577614 8 -1 H 1.41256181 -3
8: B 1.0668059 11 -1 K -2.72456737 -3
9: C -0.1160938 3 -1 C NA NA
10: C -0.7940771 6 -1 F 0.67798333 -3
11: C 0.2951743 9 -1 I -1.08925140 -3
12: C -0.4508854 12 -1 L 0.74605969 -3
In this second example the b column is processed by lapply first and therefore assigned to d.a.
If I refer to neg directly (i.e., I don't use get) then the results are as expected: lapply processes the .SD columns in the order given in d.vars.
p.s. Thanks data.table team! I love this package!
Based on the description, we can use match to match the 'd.vars' and the column names of 'dt' ('d.vars1') and then use it to get the order right
d.vars1 <- d.vars[match(names(dt), d.vars, nomatch = 0)]
dt[, paste0("d.",d.vars1) := lapply(.SD, function(L)
L - shift(L, n = 1, type='lag') ), keyby = col1, .SDcols = d.vars1]
dt
# col1 b a c d.b d.a
# 1: A -0.28901751 1 A NA NA
# 2: A 0.65746901 4 D 0.94648652 3
# 3: A -0.10602462 7 G -0.76349363 3
# 4: A -0.38406252 10 J -0.27803790 3
# 5: B -1.06963450 2 B NA NA
# 6: B 0.35137273 5 E 1.42100723 3
# 7: B 0.43394046 8 H 0.08256773 3
# 8: B 0.82525042 11 K 0.39130996 3
# 9: C 0.50421710 3 C NA NA
#10: C -1.09493665 6 F -1.59915375 3
#11: C -0.04858163 9 I 1.04635502 3
#12: C 0.45867279 12 L 0.50725442 3
Update
Based on the new dataset
d.vars1 <- d.vars[match(names(dt2), d.vars, nomatch = 0)]
dt2[, paste0('d.', d.vars1) := lapply(.SD, function(L)
L - shift(L, n = 1, type='lag') * get(negate) ),
keyby = col1, .SDcols = d.vars1]
dt2
# col1 b a neg c d.b d.a
# 1: A -0.3539066 1 -1 A NA NA
# 2: A 0.2702374 4 -1 D -0.0836692 5
# 3: A -0.7834941 7 -1 G -0.5132567 11
# 4: A -1.2765652 10 -1 J -2.0600593 17
# 5: B -0.2936422 2 -1 B NA NA
# 6: B -0.2451996 5 -1 E -0.5388418 7
# 7: B -1.6577614 8 -1 H -1.9029610 13
# 8: B 1.0668059 11 -1 K -0.5909555 19
# 9: C -0.1160938 3 -1 C NA NA
#10: C -0.7940771 6 -1 F -0.9101709 9
#11: C 0.2951743 9 -1 I -0.4989028 15
#12: C -0.4508854 12 -1 L -0.1557111 21
If I want to add a field to a given data frame and setting it equal to an existing field in the same data frame based on a condition on a different (existing) field.
I know this works:
is.even <- function(x) x %% 2 == 0
df <- data.frame(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
df$test[is.even(df$a)] <- as.character(df[is.even(df$a), "b"])
> df
a b test
1 1 A NA
2 2 B B
3 3 C NA
4 4 D D
5 5 E NA
6 6 F F
But I have this feeling it can be done a lot better than this.
Using data.table it's quite easy
library(data.table)
dt = data.table(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
dt[is.even(a), test := b]
> dt
a b test
1: 1 A NA
2: 2 B B
3: 3 C NA
4: 4 D D
5: 5 E NA
6: 6 F F
This is a small challenge within a big project, so I'm going to try to keep this simple.
I'm attempting to conditionally add columns to a data.table, and then process them on a conditional basis.
x <- T
y <- data.table(a = 1:10, b = c(rep(1,5), rep(2,5)))
y[ # filter some rows
a != 1
][ # conditionally add two calculated columns
,
if(x){
`:=` (
c = a*b,
d = 1/b
)
}
][ # process columns and group
,
list(
a = sum(a),
b = sum(b),
if(x) c = sum(c) # only add c if it's created above
),
by = if(x) list(b, d) else list(b) # only group by d if it's created above
]
Here is the output (error references the second set []):
Error in eval(expr, envir, enclos) : object 'd' not found
In addition: Warning message:
In deconstruct_and_eval(m, envir, enclos) :
Caught and removed `{` wrapped around := in j. := and `:=`(...) are
defined for use in j, once only and in particular ways. See help(":=").
Of course, the error is a symptom of the warning. How can I get this done?
As #Michal pointed out, putting the if() statement outside the data.table call is an option:
if(x) {
y[
...
]
} else {
y[
...
]
}
I'm hoping there's a way to get this done without repeating the code in its entirety, to simplify everything.
I can't think of a way of doing it inside the j-expression, because of how := gets evaluated in there (it really only works if it's at the root of the expression tree), but you could put it in the i-expression as a workaround:
x = FALSE
y[a != 1][x, `:=`(c = a * b, d = 1/b)][]
# a b
#1: 2 1
#2: 3 1
#3: 4 1
#4: 5 1
#5: 6 2
#6: 7 2
#7: 8 2
#8: 9 2
#9: 10 2
x = TRUE
y[a != 1][x, `:=`(c = a * b, d = 1/b)][]
# a b c d
#1: 2 1 2 1.0
#2: 3 1 3 1.0
#3: 4 1 4 1.0
#4: 5 1 5 1.0
#5: 6 2 12 0.5
#6: 7 2 14 0.5
#7: 8 2 16 0.5
#8: 9 2 18 0.5
#9: 10 2 20 0.5
Since c(1) is the same as c(1, NULL), it can be used to return complete vectors when you're not sure how many elements will compose them.
To conditionally include columns in j
y[
,
c(
list(
a = sum(a),
b = sum(b)
),
if(x) list(c = sum(c))
)
]
And to conditionally include columns in by
y[
,
...,
by = c("b", if(x) "d")
]
by won't accept a vector of lists, but it will accept a vector of column names.