Creating an embedded R windowing function in monetdb - r

I'm trying to create a windowing aggregate function using embedded R in monetdb. The function I have used is:
CREATE AGGREGATE r_sw(val double, part varchar(255), endtime timestamp, starttime timestamp) RETURNS double LANGUAGE R {
library(data.table)
library(zoo)
DT=data.table(ag=aggr_group,pa=part,va=val,et=endtime,st=starttime)
setorder(DT,pa,et)
DT[, o:=mapply(function(x,y) DT[(et>=x & pa==y),.N], DT$st, DT$pa)]
as.data.frame(DT[,.(s:=rollapply(va,o,sum), by=pa)]$s)
};
When attempting to select from the function I am getting an error claiming the aggregate doesn't exist:
Error: SELECT: no such operator 'r_sw'
SQLState: 22000
ErrorCode: 0
I think this is an issue with the number of parameters I am passing, and nothing to do with the R code. I have created aggregates with 2 parameters which work perfectly, but 3 or more seems to cause a problem. Is there something else I need to be doing to get this to work?
(EDIT)Steps to reproduce:
CREATE TABLE mytable (myval double, mypart varchar(255), myend timestamp, mystart timestamp);
INSERT INTO mytable VALUES (10,'A','2016-01-01 00:00:00','2016-01-07 00:00:00');
INSERT INTO mytable VALUES (200,'A','2016-01-04 00:00:00','2016-01-12 00:00:00');
SELECT mypart, r_sw(myval,mypart,myend,mystart) from mytable GROUP BY mypart;

Related

result_create(conn#ptr, statement) : Result too large

t1_DA <- sqldf("select decile,
count(decile) as count, avg(pred_spent) as avg_pred_spent,
avg(exp(total_spent)) as avg_total_spent,
avg(log(pred_spent)) as ln_avg_pred_spent,
avg(total_spent) as ln_avg_total_spent
from t1
group by decile
order by decile desc")
I am doing linear regression on a file and when I run this part of the code I am getting following error
Error in result_create(conn#ptr, statement) : Result too large
Is there any way to overcome this error?
As mentioned, by default sqldf uses the SQLite dialect which does not support extensive mathematical and statistical functions like exp and log. Admittedly, a better raised message can help users debug rather than Result too large (maybe a git issue for author, #ggrothendieck?).
However, in order to integrate these outputs into your aggregate query, consider creating those columns before running in sqldf. Use either transform or within for easy new column assignment without constant reference to data frame using the $ assignment approach.
t1 <- transform(t1, exp_total_spent = exp(total_spent),
log_pred_spent = log10(log_pred_spent)
)
# ALTERNATIVE
t1 <- within(t1, {exp_total_spent <- exp(total_spent)
log_pred_spent <- log10(log_pred_spent)
})
t1_DA <- sqldf("select decile,
count(decile) as count,
avg(pred_spent) as avg_pred_spent,
avg(exp_total_spent) as avg_total_spent,
avg(log_pred_spent) as ln_avg_pred_spent,
avg(total_spent) as ln_avg_total_spent
from t1
group by decile
order by decile desc")

Converting date to a varchar using "like" in pl/sql

I need to go through few millions of data searching for a year sent as a parameter to a method. The year comes as a varchar.
This is the query I'm working with
SELECT X,Y
FROM A
WHERE mch_code = 'KN'
AND contract = '15KTN'
AND to_char(cre_date, 'YYYY') = year_;
cre_ date is of type date and year_ is from type carchar.
when performing this query it take around 25 minutes to process it completely.
Is anyone knows about a different approach to find out the quick execution.
Please help.
This didn't work out.
SELECT X,Y
FROM A
WHERE mch_code = 'KN'
AND contract = '15KTN'
AND cre_date LIKE '%2013';
The reason might be 'cre_date' and '%2013' are of different types
If you have an index on (mch_code, contract, cre_date) columns, you can improve performance by doing something like:
select x, y
from a
where mch_code = 'KN'
and contract = '15KTN'
and cre_date >= to_date('01/01/'||year_, 'dd/mm/yyyy')
and cre_date < add_months(to_date('01/01/'||year_, 'dd/mm/yyyy'), 12);
Even better would be to declare the start of the year as a DATE variable prior to running the sql, eg:
v_year_dt := to_date('01/01/'||year_, 'dd/mm/yyyy');
which would make the query:
select x, y
from a
where mch_code = 'KN'
and contract = '15KTN'
and cre_date >= v_year_dt
and cre_date < add_months(v_year_dt, 12);
If you don't have an index on those three columns, you could create a function based index on (mch_code, contract, to_char(cre_date, 'yyyy')) that should help speed up your query, depending on the percentage of rows you're expecting to select. It may help even more if you added the x and y columns into the index, so that no table access was required at all.
Alternatively, you could think about partitioning the table on cre_date, monthly or yearly.
The reason your query is slow is that you're applying a function to a column on every row in your table. Let's try it another way:
SELECT X,Y
FROM A
WHERE mch_code = 'KN' AND
contract = '15KTN' AND
CRE_DATE BETWEEN TO_DATE('01/01/' || year_, 'DD/MM/YYYY')
AND TO_DATE('01/01/' || year_, 'DD/MM/YYYY') + INTERVAL '1' YEAR;
This eliminates the need to apply a function against every row in the table, and should allow any indexes on CRE_DATE to be used.
Best of luck.
You can try with EXTRACT function:
SELECT X,Y
FROM A
WHERE mch_code = 'KN'
AND contract = '15KTN'
AND EXTRACT(YEAR FROM cre_date) = year_;

Update data.table column changes data type

I am testing a small scale scenario before rolling it out in a larger production environment and am experiencing a strange occurrence.
I have 2 data sets:
dtL <- data.table(URN=c(1,2,3,4,5), DonorType=c("Cash","RG","Emergency","Emergency","Cash"))
dtL[,c("EmergVal","EmergDate") := list(as.numeric(NA),as.Date(NA))]
setkey(dtL,URN)
dtR <- data.table(URN = c(1,1,1,2,3,3 ,3 ,4,4, 4,4,5),
class=c(5,5,5,1,5,40,40,5,40,5,40,5),
xx= c(25,50,25,10,100,20,25,20,40,35,20,25),
xdate=as.Date(c("2013-01-01","2013-06-05","2014-05-27","2014-10-14",
"2014-06-09","2014-04-07","2014-10-16",
"2014-07-16","2014-10-21","2014-10-22","2014-09-18","2013-12-19")))
setkey(dtR,URN)
I am wanting to update dtL where the DonorType is equal to "Emergency", but only for a subset of records from dtR. I have seen Update subset of data.table based on join and thus have used that as a foundation for my solution.
dtL[dtR[class==40,list(maxxx=max(xx)),by=URN],
EmergVal := ifelse(DonorType=="Emergency",i.maxxx,as.numeric(NA))]
dtL[dtR[class==40,list(maxdate=max(xdate)),by=URN],
EmergDate := ifelse(DonorType=="Emergency",as.Date(i.maxdate),as.Date(NA)),nomatch=0]
I don't get any errors, however when I look at the data now in dtL it has changed the datatype for EmergDate to num rather than what it originally was (i.e. Date).
So three questions
Why has it changed the data type (especially when it is a Date when first created in dtL, and I tell it to put it as a date in my ifelse statement?
How do I get it to keep the date type when I assign it? or will I have to do some post assignment conversion/castint?
Is there a clean way I could do my assignment of EmergVal and EmergDate in a single statement given that I don't have a field DonorType in dtR and I don't want to add it in (so can't use a multiple key for the join)?

How to get count of all columns of a table, which are not null using PL/SQL?

Is there any PL/SQL function, which allows to pass a table name and returns the count of all columns, which don't include null values?
I have a huge number of columns and don't want to query each and every column. I'm new to PL/SQL and highly appreciate your help.
As of a comment to the question one approach to solve this is the following query:
SELECT t.table_name,
t.num_rows,
c.column_name,
c.num_nulls,
t.num_rows - c.num_nulls num_not_nulls,
c.data_type,
c.last_analyzed
FROM all_tab_cols c
JOIN sys.all_all_tables t ON c.table_name = t.table_name
WHERE c.table_name LIKE 'EXT%'
AND c.nullable = 'Y'
GROUP BY t.table_name,
t.num_rows,
c.column_name,
c.num_nulls,
c.data_type,
c.last_analyzed
ORDER BY t.table_name,
c.column_name

Subset of values in result from RODBC

I select some values from a Database in R with RODBC like
library(RODBC)
dbhandle <- odbcDriverConnect('driver={SQL Server};server=mydatabase, ...')
res <- sqlQuery("select id, class, param1, param2 from table1 ..."
For Analysis of the Data I need to select a subset of the data. I got the column class which is a varchar and defines some subclasses like set1 or set2.
For example, I need summary() for both sets, and then for each set. I would say that this is done by:
summary(res) # works fine
summary(res[res["class"] == 'set1']) # does not work
summary(res[res["class"] == 'set2']) # does not work
Because I get this instead:
Length Class Mode
10788 character character
After filtering I have the data as a long list and not as matrix. What is wrong there?
zx8754 answer shows you what wrong in your code. Another way of getting it done is to use subset function:
summary(subset(res, class == 'set2'))
Try this:
summary(res[res[,"class"] == "set1",])
Update:
res[row,column] - genrally 1st value is row index, 2nd value is column index, so:
res[,"class"] - select "class" column from res.
res[,"class"] == "set1" - compare "class" column values with string "set1", this will give TRUE, FALSE values.
res[res[,"class"] == "set1",] - TRUE, FALSE values define which rows to return.

Resources