Reduce redundant SQLite nested query? - sqlite

How can I reduce this nested query so that X,Y,Z are filtered prior to checking AA?
This works but is expensive since it calculates X,Y,Z for each subquery.
Only AA needs to be checked in each.
SELECT 3*b3.bin3 + 2*b2.bin2 + b1.bin1 FROM
(SELECT count(*) AS bin1 FROM `TD` WHERE
`X` = 1 AND
`Y` >= 2 AND
`Z` >= 2 AND
`AA` >= 1 AND `AA` <= 2) b1
JOIN
(SELECT count(*) AS bin2 FROM `TD` WHERE
`X` = 1 AND
`Y` >= 2 AND
`Z` >= 2 AND
`AA` >= 2.01 AND `AA` <= 3) b2
JOIN
(SELECT count(*) AS bin3 FROM `TD` WHERE
`X` = 1 AND
`Y` >= 2 AND
`Z` >= 2 AND
`AA` >= 3.01 AND `AA` <= 4) b3;

are you on SQL 2008? you might be able to use with as. Try this
With b as
(
Select
*
from TD
where
x=1 and
y >= 2 and
z >= 2)
SELECT 3*b3.bin3 + 2*b2.bin2 + b1.bin1 FROM
(select
count ()
from b
where AA >= 1 and AA <= 2) bin1
join
(select
count ()
from b
where AA >= 2.01 and AA <= 3) bin2
join
(select
count ()
from b
where AA >= 3.01 and AA <= 4) bin3

--REDUCED FORM from Golden Ratio's hint.
WITH `v` AS
(SELECT `AA` FROM `TD` WHERE
`X` = 1 AND
`Y` >= 2 AND
`Z` >= 2)
SELECT 3*bin3 + 2*bin2 + bin1 FROM
(SELECT count(*) AS bin1 FROM `v` WHERE
`AA` >= 1 AND `AA` <= 2)
JOIN
(SELECT count(*) AS bin2 FROM `v` WHERE
`AA` >= 2.01 AND `AA` <= 3)
JOIN
(SELECT count(*) AS bin3 FROM `v` WHERE
`AA` >= 3.01 AND `AA` <= 4);

Related

How to write ifelse statement with multiple conditions in R?

I have a problem in writing ifelse statement ,I have three columns as shown below:
Team 1 Winner
T1 T1
T2 T1
T2 NA
T3 NA
I want another column : Result such that if Team=Winner it should be Winner else losser and If Team=anything & winner=NA then it should be no result...
Team 1 Winner result
T1 T1 winner
T2 T1 losser
T2 NA noresult
T3 NA noresult
Any help would be appreciated.
Another possibility is with case_when from dplyr:
library(dplyr)
df %>%
mutate(Result = case_when(
Team == Winner ~ "Winner",
Team != Winner ~ "Loser",
is.na(Winner) ~ "No result"
))
# Team Winner Result
# 1 T1 T1 Winner
# 2 T2 T1 Loser
# 3 T2 <NA> No result
# 4 T3 <NA> No result
Data:
tt <- "Team Winner
T1 T1
T2 T1
T2 NA
T3 NA"
df <- read.table(text=tt, header = T, stringsAsFactors = F)
You can use dplyr::if_else(), as I learned, it is strict, because it checks the data type and it handles the NAs, making code simpler:
df %>% mutate(Result = if_else( Team==Winner, "Winner", "Loser", missing ='No result'))
Team Winner Result
1 T1 T1 Winner
2 T2 T1 Loser
3 T2 <NA> No result
4 T3 <NA> No result
Despite, looking at the one-liner solution here, for your example data, it's not the fastest (the winner is the #Tim Biegeleisen 's answer, +1):
Unit: microseconds
expr min lq mean median uq max neval cld
IF_ELSE 893.013 974.5060 1176.35331 1053.2260 1343.3590 2278.398 100 b
IFELSE 20.481 34.3475 49.57934 47.3605 58.0275 143.361 100 a
CASE 1067.946 1152.4255 1423.41426 1226.0255 1721.3850 4108.795 100 c
So I can figure out a trade off between simplicity (that is subjective, of course) and more control (that is objective, due the nature of the functions), and velocity (if it's an issue to you, looking your real data, but it's more objective).
Use -
df$Winner <- factor(df[,2], levels=unique(df$Team.1)) # avoid "level sets of factors are different" error
df$result <- ifelse(df$Team.1 == df$Winner, "winner", "loser")
df[is.na(df$result), "result"] <- "noresult"
df
Output
Team.1 Winner result
1 T1 T1 winner
2 T2 T1 loser
3 T2 <NA> noresult
4 T3 <NA> noresult
Try this logic:
df$result <- ifelse(is.na(df$Winner), "no result",
ifelse(df$Team==df$Winner, "winner", "loser"))
df
Team Winner result
1 T1 T1 winner
2 T2 T1 loser
3 T2 <NA> no result
4 T3 <NA> no result

data.table: calculate statistics of rows time within time moving window

library(data.table)
library(lubridate)
df <- data.table(col1 = c('A', 'A', 'A', 'B', 'B', 'B'), col2 = c("2015-03-06 01:37:57", "2015-03-06 01:39:57", "2015-03-06 01:45:28", "2015-03-06 02:31:44", "2015-03-06 03:55:45", "2015-03-06 04:01:40"))
For each row I want to calculate standard deviation of time(col2) of rows with same values of 'col1' and time within window of past 10 minutes before time of this row(include)
I use next approach:
df$col2 <- as_datetime(df$col2)
gap <- 10L
df[, feat1 := .SD[.(col1 = col1, t1 = col2 - gap * 60L, t2 = col2)
, on = .(col1, col2 >= t1, col2 <= t2)
, .(sd_time = sd(as.numeric(col2))), by = .EACHI]$sd_time][]
as result I see only NA values instead of values in seconds
For example for third row (col="A" and col2 = "2015-03-06 01:45:28")
I have calculated manually by next way:
v <- c("2015-03-06 01:37:57", "2015-03-06 01:39:57", "2015-03-06 01:45:28")
v <- as_datetime(v)
sd(v) = 233.5815
Two alternative data.table solutions (variations on my previous answer):
# option 1
df[.(col1 = col1, t1 = col2, t2 = col2 + gap * 60L)
, on = .(col1, col2 >= t1, col2 <= t2)
, .(col1, col2 = x.col2, times = as.numeric(t1))
][, .(feat1 = sd(times))
, by = .(col1, col2)]
# option 2
df[, feat1 := .SD[.(col1 = col1, t1 = col2, t2 = col2 + gap * 60L)
, on = .(col1, col2 >= t1, col2 <= t2)
, .(col1, col2 = x.col2, times = as.numeric(t1))
][, .(sd_times = sd(times))
, by = .(col1, col2)]$sd_times][]
which both give:
col1 col2 feat1
1: A 2015-03-06 00:37:57 NA
2: A 2015-03-06 00:39:57 84.85281
3: A 2015-03-06 00:45:28 233.58153
4: B 2015-03-06 01:31:44 NA
5: B 2015-03-06 02:55:45 NA
6: B 2015-03-06 03:01:40 251.02291
A pure data.table solution:
df[,col3:=as.numeric(col2)]
df[, feat1 := {
d <- data$col3 - col3
sd(data$col3[col1 == data$col1 & d <= 0 & d >= -gap * 60L])
},
by = list(col3, col1)]
Another way to loop over all combinations of col1, col2 with mapply:
df[,col3:=as.numeric(col2)]
df[, feat1:=mapply(Date = col3,ID = col1, function(Date, ID) {
DateVect=df[col1 == ID,col3]
d <- DateVect - Date
sd(DateVect[d <= 0 & d >= -gap * 60L])})][]

data.table: count rows within time moving window

library(data.table)
df <- data.table(col1 = c('B', 'A', 'A', 'B', 'B', 'B'), col2 = c("2015-03-06 01:37:57", "2015-03-06 01:39:57", "2015-03-06 01:45:28", "2015-03-06 02:31:44", "2015-03-06 03:55:45", "2015-03-06 04:01:40"))
For each row I want to count number of rows with same values of 'col1' and time within window of past 10 minutes before time of this row(include)
I run next code:
df$col2 <- as_datetime(df$col2)
window = 10L
(counts = setDT(df)[.(t1=col2-window*60L, t2=col2), on=.((col2>=t1) & (col2<=t2)),
.(counts=.N), by=col1]$counts)
df[, counts := counts]
and got next mistake:
Error in `[.data.table`(setDT(df), .(t1 = col2 - window * 60L, t2 = col2), : Column(s) [(col2] not found in x
I want result like next:
col1 col2 counts
B 2015-03-06 01:37:57 1
A 2015-03-06 01:39:57 1
A 2015-03-06 01:45:28 2
B 2015-03-06 02:31:44 1
B 2015-03-06 03:55:45 1
B 2015-03-06 04:01:40 2
A possible solution:
df[.(col1 = col1, t1 = col2 - gap * 60L, t2 = col2)
, on = .(col1, col2 >= t1, col2 <= t2)
, .(counts = .N), by = .EACHI][, (2) := NULL][]
which gives:
col1 col2 counts
1: B 2015-03-06 01:37:57 1
2: A 2015-03-06 01:39:57 1
3: A 2015-03-06 01:45:28 2
4: B 2015-03-06 02:31:44 1
5: B 2015-03-06 03:55:45 1
6: B 2015-03-06 04:01:40 2
A couple of notes about your approach:
You don't need setDT because you already constructed df with data.table(...).
You on-statement isn't specified correctly: you need to separate the join conditions with a , and not with a &. For example: on = .(col1, col2 >= t1, col2 <= t2)
Use by = .EACHI to get the result for each row.
An alternative approach:
df[, counts := .SD[.(col1 = col1, t1 = col2 - gap * 60L, t2 = col2)
, on = .(col1, col2 >= t1, col2 <= t2)
, .N, by = .EACHI]$N][]
which gives the same result.

How can I create a new column based on conditional statements and dplyr?

x y
2 4
5 8
1 4
9 12
I have four conditions
maxx = 3, minx = 1, maxy = 6, miny = 3. (If minx < x < maxx and miny < y < maxy, then z = apple)
maxx = 6, minx = 4, maxy = 9, miny = 7. (If minx < x < maxx and miny < y < maxy, then z = ball)
maxx = 2, minx = 0, maxy = 5, miny = 3. (If minx < x < maxx and miny < y < maxy, then z = pine)
maxx = 12, minx = 7, maxy = 15, miny = 11. (If minx < x < maxx and miny < y < maxy, then z = orange)
Expected outcome:
x y z
2 4 apple
5 8 ball
1 4 pine
9 12 orange
I have thousands of rows, and these four conditions that will fit all values.
How can I do this using the mutate function? I know how to manipulate numbers directly, but not sure how I can store a character based on conditional statements.
I believe the best option here is to use dplyr::case_when
df %>% mutate(z = case_when(
x < 3 & x > 1 & y < 6 & y > 3 ~ "apple" ,
x < 6 & x > 4 & y < 9 & y > 7 ~ "ball" ,
x < 2 & x > 0 & y < 5 & y > 3 ~ "pine" ,
x < 12 & x > 7 & y < 15 & y > 11 ~ "orange"
)
)
Which gives us:
# A tibble: 4 x 3
x y z
<dbl> <dbl> <chr>
1 2 4 apple
2 5 8 ball
3 1 4 pine
4 9 12 orange
Alternative answer:
library(mosaic)
df <- mutate(df, fruit = derivedFactor(
"apple" = (x<3 & x>1 & y<6 & y>3),
"ball" = (x<6 & x>4 & y<9 & y>7),
"pine" = (x<2 & x>0 & y<5 & y>3),
"orange" = (x<12 & x>7 & y<15 & y>11),
method ="first",
.default = NA
))
Using ifelse, it's
df %>% mutate(z = ifelse(x<3 & x>1 & y<6 & y>3, 'apple',
ifelse(x<6 & x>4 & y<9 & y>7, 'ball',
ifelse(x<2 & x>0 & y<5 & y>3, 'pine',
ifelse(x<12 & x>7 & y<15 & y>11, 'orange', NA))))
)
# x y z
# 1 2 4 apple
# 2 5 8 ball
# 3 1 4 pine
# 4 9 12 orange
Notes:
If you have cases that match two conditions (x = 1.5, y = 4), this will fail.
dplyr also has a between helper function that can reduce your conditions to two calls each, but it uses <= and >=, so you'd need to reconfigure your endpoints.
You could use switch, but all your conditions would need to be in the first term, which will end up looking exactly like the ifelse version, and your cases will have nothing to do.
If your ranges don't overlap, this is better solved with cut, which is easy to implement for one variable and could be overwritten by a second.

Cartesian product with filter data.table

I'm trying to replace Cartesian product produced by SQL by data.table call.
I have large history with assets and values, and I need a subset of all combinations.
Let's say that I have table a with T = [date, contract, value]. In SQL it looks like
SELECT a.date, a.contract, a.value, b.contract. b.value
FROM T a, T b
WHERE a.date = b.date AND a.contract <> b.contract AND a.value + b.value < 4
In R I have now the following
library(data.table)
n <- 1500
dt <- data.table(date = rep(seq(Sys.Date() - n+1, Sys.Date(), by = "1 day"), 3),
contract = c(rep("a", n), rep("b", n), rep("c", n)),
value = c(rep(1, n), rep(2, n), rep(3, n)))
setkey(dt, date)
dt[dt, allow.cartesian = TRUE][(contract != i.contract) & (value + i.value < 4)]
I believe that my solution creates all combinations first (in this case 13,500 rows) and then filter (to 3000). SQL however (and I might be wrong) joining subset, and what is more important don't load all combinations into RAM. Any ideas how to use data.table more efficient?
Use by = .EACHI feature. In data.table joins and subsets are very closely linked; i.e., a join is just another subset - using data.table - instead of the usual integer / logical / row names. They are designed this way with these cases in mind.
Subset based joins allow to incorporate j-expressions and grouping operations together while joining.
require(data.table)
dt[dt, .SD[contract != i.contract & value + i.value < 4L], by = .EACHI, allow = TRUE]
This is the idiomatic way (in case you'd like to use i.* cols just for condition, but not return them as well), however, .SD has not yet been optimised, and evaluating the j-expression on .SD for each group is costly.
system.time(dt[dt, .SD[contract != i.contract & value + i.value < 4L], by = .EACHI, allow = TRUE])
# user system elapsed
# 2.874 0.020 2.983
Some cases using .SD have already been optimised. Until these cases are taken care of, you can workaround it this way:
dt[dt, {
idx = contract != i.contract & value + i.value < 4L
list(contract = contract[idx],
value = value[idx],
i.contract = i.contract[any(idx)],
i.value = i.value[any(idx)]
)
}, by = .EACHI, allow = TRUE]
And this takes 0.045 seconds, as opposed to 0.005 seconds from your method. But by = .EACHI evaluates the j-expression each time (and therefore memory efficient). That's the trade-off you'll have to accept.
Since version v1.9.8 (on CRAN 25 Nov 2016), non-equi joins are possible with data.table which can be utilized here.
In addition, OP's approach creates "symmetric duplicates" (a, b) and (b, a). Avoiding duplicates would halfen the size of the result set without loss of information (compare ?combn)
If this is the intention of the OP we can use non-equi joins to avoid those symmetric duplicates:
library(data.table)
dt[, rn := .I][dt, on = .(date, rn < rn), nomatch = 0L][value + i.value < 4]
which gives
date contract value rn i.contract i.value
1: 2013-09-24 a 1 1501 b 2
2: 2013-09-25 a 1 1502 b 2
3: 2013-09-26 a 1 1503 b 2
4: 2013-09-27 a 1 1504 b 2
5: 2013-09-28 a 1 1505 b 2
---
1496: 2017-10-28 a 1 2996 b 2
1497: 2017-10-29 a 1 2997 b 2
1498: 2017-10-30 a 1 2998 b 2
1499: 2017-10-31 a 1 2999 b 2
1500: 2017-11-01 a 1 3000 b 2
as opposed to the result using OP's code
date contract value i.contract i.value
1: 2013-09-24 b 2 a 1
2: 2013-09-24 a 1 b 2
3: 2013-09-25 b 2 a 1
4: 2013-09-25 a 1 b 2
5: 2013-09-26 b 2 a 1
---
2996: 2017-10-30 a 1 b 2
2997: 2017-10-31 b 2 a 1
2998: 2017-10-31 a 1 b 2
2999: 2017-11-01 b 2 a 1
3000: 2017-11-01 a 1 b 2
The next step is to further reduce the number of pairs created which are need to be filtered out afterwards:
dt[, val4 := 4 - value][dt, on = .(date, rn < rn, val4 > value), nomatch = 0L]
which returns the same result as above.
Note that filter condition value + i.value < 4 is replaced by another join condition val4 > value where val4 is an especially created helper column.
Benchmark
For a benchmark case of n <- 150000L resulting in 450 k rows in dt the timings are:
n <- 150000L
dt <- data.table(date = rep(seq(Sys.Date() - n+1, Sys.Date(), by = "1 day"), 3),
contract = c(rep("a", n), rep("b", n), rep("c", n)),
value = c(rep(1, n), rep(2, n), rep(3, n)))
dt0 <- copy(dt)
microbenchmark::microbenchmark(
OP = {
dt <- copy(dt0)
dt[dt, on = .(date), allow.cartesian = TRUE][
(contract != i.contract) & (value + i.value < 4)]
},
nej1 = {
dt <- copy(dt0)
dt[, rn := .I][dt, on = .(date, rn < rn), nomatch = 0L][value + i.value < 4]
},
nej2 = {
dt <- copy(dt0)
dt[, rn := .I][, val4 := 4 - value][dt, on = .(date, rn < rn, val4 > value), nomatch = 0L]
},
times = 20L
)
Unit: milliseconds
expr min lq mean median uq max neval cld
OP 136.3091 143.1656 246.7349 298.8648 304.8166 311.1141 20 b
nej1 127.9487 133.1772 160.8096 136.0825 146.0947 298.3348 20 a
nej2 180.4189 183.9264 219.5171 185.9385 198.7846 351.3038 20 b
So, doing the check value + i.value < 4 after the join seems to be faster than including it in the non-equi join.

Resources