I have a situation where I have to create a moving sum for the past 6 months. My data looks like
A B 20-Jan-18 20
A B 20-Mar-18 45
A B 10-Apr-18 15
A B 21-May-18 30
A B 30-Jul-18 10
A B 15-Aug-18 25
And the expected result is
A B 20-Jan-18 20 20 Sum of row1
A B 20-Mar-18 45 65 Sum of row1+2
A B 10-Apr-18 15 80 Sum of row1+2+3
A B 21-May-18 30 110 Sum of row1+2+3+4
A B 30-Jul-18 10 100 Sum of row2+3+4+5 (as row1 is > 6 months in the past)
A B 15-Aug-18 25 125 Sum of row2+3+4+5+6
I tried to use the solution proposed in an earlier thread by inserting dummy records for dates where there is no record and then using ROWS BETWEEN 181 PRECEDING AND CURRENT ROW
But there may be situations where there are multiple records on the same day which means that choosing the last 181 rows will lead to the earliest record getting dropped.
I have checked a lot of cases on this forum and others but can't find a solution for this moving average where the window size is not constant. Please help.
Teradata doesn't implement RANGE in Windowed Aggregates, but you can use old-style SQL to get the same result. If the number of rows per group is not too high it's very efficient, but needs an intermediate table (unless the GROUP BY columns are the PI of the souce table). The self-join on the PI columns results in an AMP-local direct join plus aggregated locally, without matching PIs it will be a less efficient join plus aggregated globally
create volatile table vt as
( select a,b,datecol,sumcol
from mytable
) with data
primary index(a,b);
select t1.a, t1.b, t1.datecol
,sum(t2.sumcol)
from vt as t1
join vt as t2
on t1.a=t2.a
and t1.b=t2.b
and t2.datecol between t1.datecol -181 and t1.datecol
group by 1,2,3
Of course this will not work as expected if there are multiple rows per day (this will increase the number of rows for the sum due to the n*m join). You need some unique column combination, this defect_id might be useful.
Otherwise you need to switch to a Scalar Subquery which takes care about non-uniqueness, but is usually less efficient:
create volatile table vt as
( select a,b,defect_id,datecol,sumcol
from mytable
) with data
primary index(a,b);
select t1.*
,(select sum(t2.sumcol)
from vt as t2
where t1.a=t2.a
and t1.b=t2.b
and t2.datecol between t1.datecol -181 and t1.datecol
)
from vt as t1
To use your existing approach you must aggregate those multiple rows per day first
Related
first I can't understand aggregate function and cbind I need explanation really simple words, second I have data
permno number mean std
1 10107 120 0.0117174000 0.06802718
2 11850 120 0.0024398083 0.04594591
3 12060 120 0.0005072167 0.08544500
4 12490 120 0.0063569167 0.05325215
5 14593 120 0.0200060583 0.08865493
6 19561 120 0.0154743500 0.07771348
7 25785 120 0.0184815583 0.16510082
8 27983 120 0.0025951333 0.09538822
9 55976 120 0.0092889000 0.04812975
10 59328 120 0.0098526167 0.07135423
I NEED TO process this by
data_processed2 <- aggregate(cbind(return)~permno, Data_summary, median)
I cant understand this command please explain me very simple THANK YOU!
cbind takes two or more tables (dataframes), puts them side by side, and then makes them into one big table. So for example, if you have one table with columns A, B and C, and another with column D and E, after you cbind them, you'll have one table with five columns: A, B, C, D and E. for the rows, cbind assumes all tables are in the same order.
As noted by Rui, in your example cbind doesn't do anything, because return is not a table, and even if it was, it's only one thing.
aggregate takes a table, divides it by some variable, and the calculates a statistic on a variable within each group. For example, if I have data for sales by month and day of month, I can aggregate by month, and calculate the average sales per day for each of the months.
The command you provided uses the following syntax:
aggregate(VARIABLES~GROUPING, DATA, FUNCTION)
Variables (cbind(return) - which doesn't make sense, really) is the list of all the variables for which your statistic will be calculated
Grouping (pernmo) is the variable by which you will break the data into groups (in the sample data you provided each row has a unique number for this variable, so that doesn't really make sense either).
Data is the dataframe you're using.
Function is median.
So this call will break Data_summery into groups that have the same pernmo, and calculate the median for each of the columns.
With the data you provided, you'll basically get the same table back, since you're grouping the data by groups of one row each... -- Actually, since your variables are an empty group, as far as I can tell, you'll get nothing back.
What I want to do is, when I select records from the table, the last column is the subtraction of the two columns. Now in the first record, the last column (i.e. Subtraction of two columns) will be [Value1] - [Value2] where `[Value1] and [Value2] are columns of the table.
Now the second record will be like below,
'Value of (previous row.last column) + ([Value1] - [Value2])
and so for the next record and so on.
The columns are as below,
[ID],[Value1],[Value2]
Now the records will be like below,
[ID] [Value1] [Value2] [Result]
1 10 5 10 - 5 = 5
2 15 7 5 + (15 - 7) = 13
3 100 50 13 + (100 - 50) = 63
and so on......
SQLite doesn't support running totals, but for your data and your desired result it's possible to factor out the arithmetic and write the query like this:
SELECT t.id, t.value1, t.value2, SUM(t1.value1 - t1.value2)
FROM table1 AS t
JOIN table1 AS t1 ON t.id >= t1.id
GROUP BY t.id, t.value1, t.value2
http://sqlfiddle.com/#!7/efaf1/2/0
This query will slow down as your row count increases. So, if you're planning to run this on a large table, you may want to run the calculation outside of SQLite.
I have a large dataset and a lookup table. I need to return for each row in the dataset the smallest value present for rows in the lookup where conditions are met.
Given the size of my dataset I'm reluctant to hack an iffy solution together by cross-joining as this would create many millions of records. I'm hoping someone can suggest a solution that (ideally) leverages base r or data.table since these are already in use in an efficient manner.
Example
A<-seq(1e4,9e4,1e4)
B<-seq(0,1e4,1e3)
dt1<-data.table(expand.grid(A,B),ID=1:nrow(expand.grid(A,B)))
setnames(dt1, c("Var1","Var2"),c("A","B"))
lookup<-data.table(minA=c(1e4,1e4,2e4,2e4,5e4),
maxA=c(2e4,3e4,7e4,6e4,9e4),
minB=rep(2e3,5),
Val=seq(.1,.5,.1))
# Sample Desired Value
A B ID Val
99: 90000 10000 99 0.5
In SQL, I would then write something along the lines of
SELECT ID, A, B, min(Val) as Val
FROM dt1
LEFT JOIN lookup on dt1.A>=lookup.minA
and dt1.A<=lookup.maxA
and dt1.B>=lookup.minB
GROUP BY ID, A, B
Which would join all matching records from lookup to dt1 and return the smallest Val.
Update
My solution so far looks like:
CJ.table<-function(X,Y) setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
dt1.lookup<- CJ.table(dt1,lookup)[A>=minA & A<=maxA & B>=minB,
list(Val=Val[which.min( Val)]),
by=list(ID,A,B)]
dt1.lookup<-rbind.fill(dt1.lookup, dt1[!ID %in% dt1.lookup$ID])
This retrieves all records and allows the return of additional columns from the lookup table if I need them. It also has the benefit of enforcing the pick of the minimum Val.
A solution I found without cross joining first needs to prepare the data by getting rid of rows where A and B are out of range entirely:
Prep = dt1[A >= min(lookup$minA) & A <= max(lookup$maxA) & B >= min(lookup$minB)]
Then you make a data table of where each of the conditions are met that correspond to the lowest possible Val:
Indices = Prep[,list(min(which(A >= lookup$minA)),
min(which(A <= lookup$maxA)),
min(which(B >= lookup$minB)), A, B),by=ID]
Then you must get Val at the lowest point where all three conditions are satisfied:
Indices[,list(Val=lookup$Val[max(V1,V2,V3)], A, B),by=ID]
See if this gets you what you're looking for:
ID Val A B
1: 19 0.1 10000 2000
2: 20 0.1 20000 2000
3: 21 0.2 30000 2000
4: 22 0.3 40000 2000
5: 23 0.3 50000 2000
6: 24 0.3 60000 2000
7: 25 0.3 70000 2000
8: 26 0.5 80000 2000
9: 27 0.5 90000 2000
10: 28 0.1 10000 3000
My first thought was trying to make an index like Senor O did. However, the min(Val) made the index table tougher for me to think through. The way I thought to do it was to loop through the lookup table.
dt1[,Val:=as.numeric(NA)]
for (row in 1:NROW(lookup)) {
dt1[A>=lookup[order(Val)][row,minA]&A<=lookup[order(Val)][row,maxA]&B>=lookup[order(Val)][row,minB]&is.na(Val),Val:=lookup[order(Val)][row,Val]]
}
I think this should work because it first sets the new column with NA values.
Then it puts the lookup table in order by Val so you're going to get the lowest value of it.
At each loop it will only potentially changes values in dt1 if they were still NA in Val and since we're looping through lookup in order of smallest Val to biggest it will ensure you get the min(Val) that you wanted.
replace the rbind.fill line with
rbindlist(list(dt1.lookup,dt1[!ID %in% dt1.lookup[,ID]][,list(ID, A, B, Val=as.numeric(NA))]))
it will eliminate reliance on the reshape package and I think it'll be faster.
I have multiple records in my table's column, most of them are duplicate entries. I want to sum them up so that what ever the number is duplicate should be summed up just once, like:
Numbers
10
10
10
15
20
The summed result should be 45
I am using this query:
/sum(summary.filter(start_time>='2013-01-01'&end_time<='2013-05-01'&student='john'&course='BCS').s_sub_n)
Please help me where I can put ^ to distinct s_sub_n
ok, if I am not getting you wrong, you want the Number column to be distinct and then added.
Distinct Number Column(s_sub_n):
10
15
20
and then its sum = 45
so the htsql query would be:
sum(summary.filter(start_time>='2013-01-01'&end_time<='2013-05-01'&student='john'&course='BCS')^{s_sub_n}{s_sub_n})
I have one table with values like "1,2,3,4,5,6,7" per row like
ID Value
101 5,6,7
201 8,9,3
301 3,4,5
Value column values are foreign key of other table B
Table B
5 A
6 C
7 N
Is there any way i can join these two tables together in one query?
I want to pass 101 and get A C N values.
if your model is as shown, something like this?
select a.id, listagg(new_value, ',') within group (order by new_value) new_value
from a
inner join b
on ','||a.value||',' like '%,'|| b.value ||',%'
group by a.id
http://www.sqlfiddle.com/#!4/74e46/1