Oracle combining columns with comma separated values - oracle11g

I have one table with values like "1,2,3,4,5,6,7" per row like
ID Value
101 5,6,7
201 8,9,3
301 3,4,5
Value column values are foreign key of other table B
Table B
5 A
6 C
7 N
Is there any way i can join these two tables together in one query?
I want to pass 101 and get A C N values.

if your model is as shown, something like this?
select a.id, listagg(new_value, ',') within group (order by new_value) new_value
from a
inner join b
on ','||a.value||',' like '%,'|| b.value ||',%'
group by a.id
http://www.sqlfiddle.com/#!4/74e46/1

Related

Join 2 tables/files (csv or excel) in 1 (with rows order=Id column from table 1 for the new table joined) and same heads of columns in the 2 tables

For example to transform or join these 2 tables/files in 1 (with rows order=Id from file/table 1 for the new table joined) and the same heads of columns from the 2 tables:
File/table 1:
File/table 2:
You only have screenshots of your data, but I'm guessing you want this?
DF1 <-read.csv("table1.csv")
DF2 <-read.csv("table2.csv")
combined <- merge(DF1, DF2, by="Id")

Moving sum with dates for teradata

I have a situation where I have to create a moving sum for the past 6 months. My data looks like
A B 20-Jan-18 20
A B 20-Mar-18 45
A B 10-Apr-18 15
A B 21-May-18 30
A B 30-Jul-18 10
A B 15-Aug-18 25
And the expected result is
A B 20-Jan-18 20 20 Sum of row1
A B 20-Mar-18 45 65 Sum of row1+2
A B 10-Apr-18 15 80 Sum of row1+2+3
A B 21-May-18 30 110 Sum of row1+2+3+4
A B 30-Jul-18 10 100 Sum of row2+3+4+5 (as row1 is > 6 months in the past)
A B 15-Aug-18 25 125 Sum of row2+3+4+5+6
I tried to use the solution proposed in an earlier thread by inserting dummy records for dates where there is no record and then using ROWS BETWEEN 181 PRECEDING AND CURRENT ROW
But there may be situations where there are multiple records on the same day which means that choosing the last 181 rows will lead to the earliest record getting dropped.
I have checked a lot of cases on this forum and others but can't find a solution for this moving average where the window size is not constant. Please help.
Teradata doesn't implement RANGE in Windowed Aggregates, but you can use old-style SQL to get the same result. If the number of rows per group is not too high it's very efficient, but needs an intermediate table (unless the GROUP BY columns are the PI of the souce table). The self-join on the PI columns results in an AMP-local direct join plus aggregated locally, without matching PIs it will be a less efficient join plus aggregated globally
create volatile table vt as
( select a,b,datecol,sumcol
from mytable
) with data
primary index(a,b);
select t1.a, t1.b, t1.datecol
,sum(t2.sumcol)
from vt as t1
join vt as t2
on t1.a=t2.a
and t1.b=t2.b
and t2.datecol between t1.datecol -181 and t1.datecol
group by 1,2,3
Of course this will not work as expected if there are multiple rows per day (this will increase the number of rows for the sum due to the n*m join). You need some unique column combination, this defect_id might be useful.
Otherwise you need to switch to a Scalar Subquery which takes care about non-uniqueness, but is usually less efficient:
create volatile table vt as
( select a,b,defect_id,datecol,sumcol
from mytable
) with data
primary index(a,b);
select t1.*
,(select sum(t2.sumcol)
from vt as t2
where t1.a=t2.a
and t1.b=t2.b
and t2.datecol between t1.datecol -181 and t1.datecol
)
from vt as t1
To use your existing approach you must aggregate those multiple rows per day first

How to calculate the columns adding the value of the previous row in SQLite?

What I want to do is, when I select records from the table, the last column is the subtraction of the two columns. Now in the first record, the last column (i.e. Subtraction of two columns) will be [Value1] - [Value2] where `[Value1] and [Value2] are columns of the table.
Now the second record will be like below,
'Value of (previous row.last column) + ([Value1] - [Value2])
and so for the next record and so on.
The columns are as below,
[ID],[Value1],[Value2]
Now the records will be like below,
[ID] [Value1] [Value2] [Result]
1 10 5 10 - 5 = 5
2 15 7 5 + (15 - 7) = 13
3 100 50 13 + (100 - 50) = 63
and so on......
SQLite doesn't support running totals, but for your data and your desired result it's possible to factor out the arithmetic and write the query like this:
SELECT t.id, t.value1, t.value2, SUM(t1.value1 - t1.value2)
FROM table1 AS t
JOIN table1 AS t1 ON t.id >= t1.id
GROUP BY t.id, t.value1, t.value2
http://sqlfiddle.com/#!7/efaf1/2/0
This query will slow down as your row count increases. So, if you're planning to run this on a large table, you may want to run the calculation outside of SQLite.

R Multiple condition join using data.table

I have a large dataset and a lookup table. I need to return for each row in the dataset the smallest value present for rows in the lookup where conditions are met.
Given the size of my dataset I'm reluctant to hack an iffy solution together by cross-joining as this would create many millions of records. I'm hoping someone can suggest a solution that (ideally) leverages base r or data.table since these are already in use in an efficient manner.
Example
A<-seq(1e4,9e4,1e4)
B<-seq(0,1e4,1e3)
dt1<-data.table(expand.grid(A,B),ID=1:nrow(expand.grid(A,B)))
setnames(dt1, c("Var1","Var2"),c("A","B"))
lookup<-data.table(minA=c(1e4,1e4,2e4,2e4,5e4),
maxA=c(2e4,3e4,7e4,6e4,9e4),
minB=rep(2e3,5),
Val=seq(.1,.5,.1))
# Sample Desired Value
A B ID Val
99: 90000 10000 99 0.5
In SQL, I would then write something along the lines of
SELECT ID, A, B, min(Val) as Val
FROM dt1
LEFT JOIN lookup on dt1.A>=lookup.minA
and dt1.A<=lookup.maxA
and dt1.B>=lookup.minB
GROUP BY ID, A, B
Which would join all matching records from lookup to dt1 and return the smallest Val.
Update
My solution so far looks like:
CJ.table<-function(X,Y) setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
dt1.lookup<- CJ.table(dt1,lookup)[A>=minA & A<=maxA & B>=minB,
list(Val=Val[which.min( Val)]),
by=list(ID,A,B)]
dt1.lookup<-rbind.fill(dt1.lookup, dt1[!ID %in% dt1.lookup$ID])
This retrieves all records and allows the return of additional columns from the lookup table if I need them. It also has the benefit of enforcing the pick of the minimum Val.
A solution I found without cross joining first needs to prepare the data by getting rid of rows where A and B are out of range entirely:
Prep = dt1[A >= min(lookup$minA) & A <= max(lookup$maxA) & B >= min(lookup$minB)]
Then you make a data table of where each of the conditions are met that correspond to the lowest possible Val:
Indices = Prep[,list(min(which(A >= lookup$minA)),
min(which(A <= lookup$maxA)),
min(which(B >= lookup$minB)), A, B),by=ID]
Then you must get Val at the lowest point where all three conditions are satisfied:
Indices[,list(Val=lookup$Val[max(V1,V2,V3)], A, B),by=ID]
See if this gets you what you're looking for:
ID Val A B
1: 19 0.1 10000 2000
2: 20 0.1 20000 2000
3: 21 0.2 30000 2000
4: 22 0.3 40000 2000
5: 23 0.3 50000 2000
6: 24 0.3 60000 2000
7: 25 0.3 70000 2000
8: 26 0.5 80000 2000
9: 27 0.5 90000 2000
10: 28 0.1 10000 3000
My first thought was trying to make an index like Senor O did. However, the min(Val) made the index table tougher for me to think through. The way I thought to do it was to loop through the lookup table.
dt1[,Val:=as.numeric(NA)]
for (row in 1:NROW(lookup)) {
dt1[A>=lookup[order(Val)][row,minA]&A<=lookup[order(Val)][row,maxA]&B>=lookup[order(Val)][row,minB]&is.na(Val),Val:=lookup[order(Val)][row,Val]]
}
I think this should work because it first sets the new column with NA values.
Then it puts the lookup table in order by Val so you're going to get the lowest value of it.
At each loop it will only potentially changes values in dt1 if they were still NA in Val and since we're looping through lookup in order of smallest Val to biggest it will ensure you get the min(Val) that you wanted.
replace the rbind.fill line with
rbindlist(list(dt1.lookup,dt1[!ID %in% dt1.lookup[,ID]][,list(ID, A, B, Val=as.numeric(NA))]))
it will eliminate reliance on the reshape package and I think it'll be faster.

merge data.table when the number of key columns are different

I am attempting to understand the logic in the data.table from the documentation and a bit unclear. I know I can just try this and see what happens but I would like to make sure that there is no pathological case and therefore would like to know how the logic was actually coded. When two data.table objects have a different number of key columns, for example a has 2 and b has 3, and you run c <- a[b], will a and b be merged simply on the first two key columns or will the third column in a be automatically merged to the 3rd key column in b? An example:
require(data.table)
a <- data.table(id=1:10, t=1:20, v=1:40, key=c("id", "t"))
b <- data.table(id=1:10, v2=1:20, key="id")
c <- a[b]
This should select rows of a that match the id key column in b. For example, for id==1 in b, there are 2 rows in b and 4 rows in a that should generate 8 rows in c. This is indeed what seems to happen:
> head(c,10)
id t v v2
1: 1 1 1 1
2: 1 1 21 1
3: 1 11 11 1
4: 1 11 31 1
5: 1 1 1 11
6: 1 1 21 11
7: 1 11 11 11
8: 1 11 31 11
9: 2 2 2 2
10: 2 2 22 2
The other way to try it is to do:
d <-b[a]
This should do the same thing: for every row in a it should select the matching row in b: since a has an extra key column, t, that column should not be used for matching and a join based only on the first key column, id should be done. It seems like this is the case:
> head(d,10)
id v2 t v
1: 1 1 1 1
2: 1 11 1 1
3: 1 1 1 21
4: 1 11 1 21
5: 1 1 11 11
6: 1 11 11 11
7: 1 1 11 31
8: 1 11 11 31
9: 2 2 2 2
10: 2 12 2 2
Can someone confirm? To be clear: is the third key column of a ever used in any of the merges or does data.table only use the min(length(key(DT))) of the two tables.
Good question. First the correct terminology is (from ?data.table) :
[A data.table] may have one key of one or more columns. This key can be used for row indexing instead of rownames.
So "key" (singlular) not "keys" (plural). We can get away with "keys", currently. But when secondary keys are added in future, there may then be multiple keys. Each key (singular) can have multiple columns (plural).
Otherwise you're absolutely correct. The following paragraph was improved in v1.8.2 based on feedback from others also confused. From ?data.table:
When i is a data.table, x must have a key. i is joined to x using x's key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key; i.e., column 1 of i is matched to the 1st column of x's key, column 2 to the second, etc. The match is a binary search in compiled C in O(log n) time. If i has fewer columns than x's key then many rows of x will ordinarily match to each row of i since not all of x's key columns will be joined to (a common use case). If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns (column 1 of i's key is joined to column 1 of x's key, column 2 to column 2, and so on) and a binary merge of the two tables is carried out. In all joins the names of the columns are irrelevant. The columns of x's key are joined to in order, either from column 1 onwards of i when i is unkeyed, or from column 1 onwards of i's key.
Following comments, in v1.8.3 (on R-Forge) this now reads (changes in bold) :
When i is a data.table, x must have a key. i is joined to x using x's key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key; i.e., column 1 of i is matched to the 1st column of x's key, column 2 to the second, etc. The match is a binary search in compiled C in O(log n) time. If i has fewer columns than x's key then not all of x's key columns will be joined to (a common use case) and many rows of x will (ordinarily) match to each row of i. If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns (column 1 of i's key is joined to column 1 of x's key, column 2 of i's key to column 2 of x's key, and so on for as long as the shorter key) and a binary merge of the two tables is carried out. In all joins the names of the columns are irrelevant; the columns of x's key are joined to in order, either from column 1 onwards of i when i is unkeyed, or from column 1 onwards of i's key. In code, the number of join columns is determined by min(length(key(x)),if (haskey(i)) length(key(i)) else ncol(i)).
Quote data.table FAQ:
X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) as an index.
Y[X] is a join, looking up Y's rows using X (or X's key if it has one) as an index.
merge(X,Y) does both ways at the same time. The number of rows of X[Y] and Y[X] usually differ;
whereas the number of rows returned by merge(X,Y) and merge(Y,X) is the same.

Resources