I would like to remove data points above 97.5% and below 2.5%. I created the following parsimonious data set to explain the issue:
y <- data.table(a = rep(c("b","d"), each = 2, times = 3), c=rep(c("e","f"),
each = 3, times = 2), seq(1,6))
I created the following script to accomplish the task:
require(data.table)
y[, trimErr := ifelse(y$V3 < quantile(y$V3, 0.95) & y$V3 > quantile(y$V3, 0.05),y$V3, NA),
by = list(a,c)]
I then got 4 warning messages, I will only provide the first warning:
Warning messages:
1: In `[.data.table`(y, , `:=`(trimErr, ifelse(y$V3 < quantile(y$V3, :
RHS 1 is length 12 (greater than the size (3) of group 1). The last 9 element(s) will be discarded.
can you please explain to me what the warning means and how can i modify my code.
Would you suggest a better code to remove the top and bottom 2.5% of the data. Thanks in advance.
You're grouping by a and c, but passing in a vector that is the length of the entire data.table, instead of just the data for each group.
You don't need the y$ inside the [.data.table call
y[, trimErr:=ifelse(V3 < quantile(V3, 0.95) & V3 > quantile(V3, 0.05),V3, NA),
by=list(a,c)]
y
# a c V3 trimErr
# 1: b e 1 NA
# 2: b e 2 2
# 3: d e 3 NA
# 4: d f 4 NA
# 5: b f 5 5
# 6: b f 6 NA
# 7: d e 1 NA
# 8: d e 2 2
# 9: b e 3 NA
#10: b f 4 NA
#11: d f 5 5
#12: d f 6 NA
Related
I am impressed by the efficiency R-code could be by using functions and loops.
I will provide a simplified example of the question first, and explain my problem (where the code is probably not replicable).
If I have several vectors which are different in contents and length,like:
tables_vector_1 <- c(1,2,3)
tables_vector_2 <- c(1:10)
And I have a function to create data.tables from the vector, like:
create_dt <- function(tables_vector, i){
DT <- data.table(id = 1:i, name = c("a","b","c"))
return(DT)
}
I am wondering, if there is a way to write a loop or function, where I can create all (or some of ) data.tables in the vector by running the function created before?
(probably like)
for i in 1:length(tables_vector){
create_dt(tables_vector, i)
}
And then combine the results in a list, same as the result if you run:
list(create_dt(tables_vector_1,1),create_dt(tables_vector_1,2),create_dt(tables_vector_1,3))
I have tried to use lapply(list(1:3),create_dt,tables_vector = tables_vector_1, i), but it falls, since I don't know how to specify the i argument correctly in lapply().
Here is the explanation why this problem rise:
In the real situation, I have created a function to import data.table from the database:
import_data <- function(tables_vector,i){
end <- Sys.time()
start <- end - 7200
con <- dbConnect("PostgreSQL", dbname="db", host = "host", user=db_user, password=db_password)
query <- sprintf("SELECT %s.timeutc, %s.scal AS %s FROM %s WHERE timeutc BETWEEN '%s' AND '%s' AND mode='General';",
tables_vector[i],tables_vector[i],tables_vector[i], tables_vector[i],start,end)
rs <- dbSendQuery(con, query)
df <- fetch(rs, n = -1)
dbClearResult(rs)
dbDisconnect(con)
return(as.data.table(df))
}
And I have tens of vectors which are defined by groups (e.g. vector1 contains channels for purpose 1, vector2 contains channels for purpose 2).
Since they are created for different analysis purposes, I cannot simply combine them in one vector.
Moreover, some vector contains 7, 8 channels, so it is quite annoying to list them by repeating the function one by one.
How about something like this:
tables_vector_1 <- c(1,2,3)
tables_vector_2 <- c(1:10)
create_dt <- function(tables_vector, i){
DT <- data.table(id = 1:i, name = letters[1:i])
return(DT)
}
make_list <- function(x){
lapply(seq_along(x), function(i)create_dt(x, i))
}
make_list(tables_vector_1)
[[1]]
id name
1: 1 a
[[2]]
id name
1: 1 a
2: 2 b
[[3]]
id name
1: 1 a
2: 2 b
3: 3 c
make_list(tables_vector_2)
[[1]]
id name
1: 1 a
[[2]]
id name
1: 1 a
2: 2 b
[[3]]
id name
1: 1 a
2: 2 b
3: 3 c
[[4]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
[[5]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
[[6]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
[[7]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
[[8]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
8: 8 h
[[9]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
8: 8 h
9: 9 i
[[10]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
8: 8 h
9: 9 i
10: 10 j
Note, I changed the create_dt() function so it did not produce a warning, but the mechanics should still work as intended.
Based on this previous post I build leftOuterJoin which is a function to update a data.table X according to an other data.table Y. The function is defined as follows:
leftOuterJoin <- function(X, Y, onCol) {
.colsY <- names(Y)
X[Y, (.colsY) := mget(paste0("i.", .colsY)), on = onCol]
}
The function works 99% of the time as intended, e.g.:
X <- data.table(id = 1:5, L = letters[1:5])
id L
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
Y <- data.table(id = 3:5, L = c(NA, "g", "h"), N = c(10, NA, 12))
id L N
1: 3 <NA> 10
2: 4 g NA
3: 5 h 12
leftOuterJoin(X, Y, "id")
X
id L N
1: 1 a NA
2: 2 b NA
3: 3 <NA> 10
4: 4 g NA
5: 5 h 12
However, for some reason that is unknown to me, it just stops working with some data tables (I have no reproductible example at hand). There is no error, but the data table is not updated. When I use the debug function, everything seems to be working fine, X is updated, but the real data.table isn't. Now, if I just do it outside the function it works. Maybe it is related to the scope of the function? I am really struggling with this problem.
Spec: R v3.5.1 and data.table v1.11.4.
EDIT
Based on the comments I figured out that the problem is related to the data.table pointer. You can reproduce the problem with this code:
> save(X, file = "X.RData")
> load("X.RData")
> leftOuterJoin(X, Y, "id")
> X
id L
1: 1 a
2: 2 b
3: 3 <NA>
4: 4 g
5: 5 h
Notice that X is updated but not the way we want it. However, if we use setDT() it works properly:
> load("X.RData")
> setDT(X)
> leftOuterJoin(X, Y, "id")
> X
id L N
1: 1 a NA
2: 2 b NA
3: 3 <NA> 10
4: 4 g NA
5: 5 h 12
Is there a way to set up leftOuterJoin() such that it will not be necessary to run setDT() every time some data is loaded?
Is there a way to compare whether "any value of" a row is identical to "any value" of the row above -- regardless of the order? Below is a very random input data table.
DT <- data.table(A=c("a","a","b","d","e","f","h","i","j"),
B=c("a","b","c","c","f","g",NA,"j",NA),
C=c("a","b","c","b","g","h",NA,NA,NA))
> DT
A B C
1: a a a
2: a b b
3: b c c
4: d c b
5: e f g
6: f g h
7: h NA NA
8: i j NA
9: j NA NA
I would like to add a column D that compares a row with the row above, and compare whether any values of the two rows are identical (regardless of the order). So the desired output would be:
> DT
A B C D
1: a a a 0 #No row above to compare; could be either NA or 0
2: a b b 1 #row 2 has "a", which is in row 1; returns 1
3: b c c 1 #row 3 has "b", which is in row 2; returns 1
4: d c b 1 #row 4 has "b" and "c", which are in row 3; returns 1
5: e f g 0 #row 5 has nothing that is in row 4; returns 0
6: f g h 1 #row 6 has "f" and "g", which are in row 5; returns 1
7: h NA NA 1 #row 7 has "h", which is in row 6; returns 1
8: i j NA 0 #row 8 has nothing that is in row 7 (NA doesn't count)
9: j NA NA 1 #row 9 has "j", which is in row 8; returns 1 (NA doesn't count)
The main idea is that I would like to compare a row (or a vector) with another row (vector), and define two rows to be identical if any of the elements in each row (vector) are. (without reiterating to compare each element)
We can do this by getting the lead rows of the dataset, paste each row, check for any pattern in with the pasteed rows of original dataset using grepl and Map, then unlist and convert to integer
DT[, D := {
v1 <- do.call(paste, .SD)
v2 <- do.call(paste, c(shift(.SD, type = "lead"), sep="|"))
v2N <- gsub("NA\\|*|\\|*NA", "", v2)
v3 <- unlist(Map(grepl, v2N, v1), use.names = FALSE)
as.integer(head(c(FALSE, v3), -1))
}]
DT
# A B C D
#1: a a a 0
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
Or we can do a split and do comparison using Map
as.integer(c(FALSE, unlist(Map(function(x,y) {
x1 <- na.omit(unlist(x))
y1 <- na.omit(unlist(y))
any(x1 %in% y1 | y1 %in% x1) },
split(DT[-nrow(DT)], 1:(nrow(DT)-1)), split(DT[-1], 2:nrow(DT))), use.names = FALSE)))
Here is another method. It's probably not advisable on large data.tables as it uses by=1:nrow(DT) which tends to be quite slow.
DT[, D:= sign(DT[, c(.SD, shift(.SD))][,
sum(!is.na(intersect(unlist(.SD[, .(A, B, C)]), unlist(.SD[, .(V4, V5, V6)])))),
by=1:nrow(DT)]$V1)]
Here, [, c(.SD, shift(.SD))] creates a copy of the data.frame, with the lagged variables included (cbinded). Then the second chain intersects the unlisted variables in the original data.table and the shifted data.table. NAs are assigned 0 and non-NAs are assigned 1 and these results are summed. This operation occurs for each row of the copied data.table. The sum is extracted with $v1 and is turned into binary (0 and 1) using sign.
It returns
DT
A B C D
1: a a a 0
2: a b b 1
3: b c c 1
4: d c b 1
5: e f g 0
6: f g h 1
7: h NA NA 1
8: i j NA 0
9: j NA NA 1
Here's a loop-free approach using data.table's joins:
DT[, id := 1:.N]
dt <- melt(DT, id.vars = "id")
dt[, id2 := id-1]
dt <- dt[!is.na(value)]
idx <- dt[dt, on = .(id2 = id, value), nomatch=0][, unique(id)]
DT[, `:=`(D = as.integer(id %in% idx), id = NULL)]
It looks somewhat complicated but id does perform pretty well with just over a second for a 1-million-row data set with three columns.
I would do a sapply along the indices (minus the last) of the table:
compare <- function(i) {
row1 <- as.character(DT[i,])
row2 <- as.character(DT[i+1,])
return(length(intersect(row1[!is.na(row1)], row2[!is.na(row2)])) > 0)
}
result <- sapply(1:(nrow(DT) - 1), compare)
This returns a vector of logicals, so if you prefer to get integers, wrap the output of compare in a as.numeric()
Here is a base R solution using intersect:
res <- c(0, sapply(2:nrow(DT), function(i)
length(intersect( na.omit(as.character(DT[i,])), na.omit(as.character(DT[i-1,])) ) )>0))
cbind(DT, D=res)
# A B C D
# 1: a a a 0
# 2: a b b 1
# 3: b c c 1
# 4: d c b 1
# 5: e f g 0
# 6: f g h 1
# 7: h NA NA 1
# 8: i j NA 0
# 9: j NA NA 1
This solution compares the two rows with %in% (after unlist()):
DT[, result:=as.integer(c(NA, sapply(2:DT[,.N], function(i) any(na.omit(unlist(DT[i])) %in% unlist(DT[i-1])))))]
#> DT
# A B C result
#1: a a a NA
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
Using a combination of intersect and mapply you could do:
#list of unique elements in each row
tableList = apply(DT,1,function(x) unique(na.omit(x)))
#a lagged list to be compared with above list
tableListLag = c(NA,tableList[2:length(tableList)-1])
#find common elements using intersect function
#if length > 0 implies common elements hence set value as 1 else 0
DT$D = mapply(function(x,y) ifelse(length(intersect(x,y))>0,1,0) ,tableList,tableListLag,
SIMPLIFY = TRUE)
DT
# A B C D
#1: a a a 0
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
I very often transform subsets of data using the .SDcols option in data.table. It makes sense that the .SD columns sent to j are in the same order as the original data.table.
EDITED to properly identify the issue
It's nice that .SD columns have the same order as that specified in the .SDcols argument. This does not happen when get is used in the j argument (inside an lapply call, at least). In this case, the .SD table columns maintain their original order.
Is there any way to override this behaviour?
An example without get works fine
# library(data.table)
dt = data.table(col1 = rep(LETTERS[1:3], 4),
b = rnorm(12),
a = 1:12,
c = LETTERS[1:12])
# columns I want to do something to
d.vars = c('a', 'b') #' names in different order than names(dt)
# Generate columns of first differences by group
dt[, paste('d', d.vars, sep='.') :=
lapply(.SD, function(L) L - shift(L, n = 1, type='lag') ),
keyby = col1, .SDcols = d.vars]
The result is assigns differenced values to the "wrong" column because my named vector (d.vars) is ordered differently than the columns in dt. The result is:
The results are as expected, the .SD table's columns are ordered the same way as the names in d.vars.
> dt
col1 b a c d.a d.b
1: A -0.28901751 1 A NA NA
2: A 0.65746901 4 D 3 0.94648651
3: A -0.10602462 7 G 3 -0.76349362
4: A -0.38406252 10 J 3 -0.27803790
5: B -1.06963450 2 B NA NA
6: B 0.35137273 5 E 3 1.42100723
7: B 0.43394046 8 H 3 0.08256772
8: B 0.82525042 11 K 3 0.39130996
9: C 0.50421710 3 C NA NA
10: C -1.09493665 6 F 3 -1.59915375
11: C -0.04858163 9 I 3 1.04635501
12: C 0.45867279 12 L 3 0.50725443
Which is the expected output because lapply in j processed column a first and b second, in spite of the column order in dt.
Example with get behaves differently
dt2 = data.table(col1 = rep(LETTERS[1:3], 4),
b = rnorm(12),
a = 1:12,
neg = -1,
c = LETTERS[1:12])
# columns I want to do something to
d.vars = c('a', 'b') #' names in different order than names(dt)
# name of variable to be called in j.
negate <- 'neg'
dt2[, paste('d', d.vars, sep='.') :=
lapply(.SD, function(L) {(L - shift(L, n = 1, type='lag') ) * get(negate) }),
keyby = col1, .SDcols = d.vars]
Now the naming of the newly created columns doesn't align with the name order in d.vars:
> dt2
col1 b a neg c d.a d.b
1: A -0.3539066 1 -1 A NA NA
2: A 0.2702374 4 -1 D -0.62414408 -3
3: A -0.7834941 7 -1 G 1.05373150 -3
4: A -1.2765652 10 -1 J 0.49307118 -3
5: B -0.2936422 2 -1 B NA NA
6: B -0.2451996 5 -1 E -0.04844252 -3
7: B -1.6577614 8 -1 H 1.41256181 -3
8: B 1.0668059 11 -1 K -2.72456737 -3
9: C -0.1160938 3 -1 C NA NA
10: C -0.7940771 6 -1 F 0.67798333 -3
11: C 0.2951743 9 -1 I -1.08925140 -3
12: C -0.4508854 12 -1 L 0.74605969 -3
In this second example the b column is processed by lapply first and therefore assigned to d.a.
If I refer to neg directly (i.e., I don't use get) then the results are as expected: lapply processes the .SD columns in the order given in d.vars.
p.s. Thanks data.table team! I love this package!
Based on the description, we can use match to match the 'd.vars' and the column names of 'dt' ('d.vars1') and then use it to get the order right
d.vars1 <- d.vars[match(names(dt), d.vars, nomatch = 0)]
dt[, paste0("d.",d.vars1) := lapply(.SD, function(L)
L - shift(L, n = 1, type='lag') ), keyby = col1, .SDcols = d.vars1]
dt
# col1 b a c d.b d.a
# 1: A -0.28901751 1 A NA NA
# 2: A 0.65746901 4 D 0.94648652 3
# 3: A -0.10602462 7 G -0.76349363 3
# 4: A -0.38406252 10 J -0.27803790 3
# 5: B -1.06963450 2 B NA NA
# 6: B 0.35137273 5 E 1.42100723 3
# 7: B 0.43394046 8 H 0.08256773 3
# 8: B 0.82525042 11 K 0.39130996 3
# 9: C 0.50421710 3 C NA NA
#10: C -1.09493665 6 F -1.59915375 3
#11: C -0.04858163 9 I 1.04635502 3
#12: C 0.45867279 12 L 0.50725442 3
Update
Based on the new dataset
d.vars1 <- d.vars[match(names(dt2), d.vars, nomatch = 0)]
dt2[, paste0('d.', d.vars1) := lapply(.SD, function(L)
L - shift(L, n = 1, type='lag') * get(negate) ),
keyby = col1, .SDcols = d.vars1]
dt2
# col1 b a neg c d.b d.a
# 1: A -0.3539066 1 -1 A NA NA
# 2: A 0.2702374 4 -1 D -0.0836692 5
# 3: A -0.7834941 7 -1 G -0.5132567 11
# 4: A -1.2765652 10 -1 J -2.0600593 17
# 5: B -0.2936422 2 -1 B NA NA
# 6: B -0.2451996 5 -1 E -0.5388418 7
# 7: B -1.6577614 8 -1 H -1.9029610 13
# 8: B 1.0668059 11 -1 K -0.5909555 19
# 9: C -0.1160938 3 -1 C NA NA
#10: C -0.7940771 6 -1 F -0.9101709 9
#11: C 0.2951743 9 -1 I -0.4989028 15
#12: C -0.4508854 12 -1 L -0.1557111 21
If I want to add a field to a given data frame and setting it equal to an existing field in the same data frame based on a condition on a different (existing) field.
I know this works:
is.even <- function(x) x %% 2 == 0
df <- data.frame(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
df$test[is.even(df$a)] <- as.character(df[is.even(df$a), "b"])
> df
a b test
1 1 A NA
2 2 B B
3 3 C NA
4 4 D D
5 5 E NA
6 6 F F
But I have this feeling it can be done a lot better than this.
Using data.table it's quite easy
library(data.table)
dt = data.table(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
dt[is.even(a), test := b]
> dt
a b test
1: 1 A NA
2: 2 B B
3: 3 C NA
4: 4 D D
5: 5 E NA
6: 6 F F