I need to go through few millions of data searching for a year sent as a parameter to a method. The year comes as a varchar.
This is the query I'm working with
SELECT X,Y
FROM A
WHERE mch_code = 'KN'
AND contract = '15KTN'
AND to_char(cre_date, 'YYYY') = year_;
cre_ date is of type date and year_ is from type carchar.
when performing this query it take around 25 minutes to process it completely.
Is anyone knows about a different approach to find out the quick execution.
Please help.
This didn't work out.
SELECT X,Y
FROM A
WHERE mch_code = 'KN'
AND contract = '15KTN'
AND cre_date LIKE '%2013';
The reason might be 'cre_date' and '%2013' are of different types
If you have an index on (mch_code, contract, cre_date) columns, you can improve performance by doing something like:
select x, y
from a
where mch_code = 'KN'
and contract = '15KTN'
and cre_date >= to_date('01/01/'||year_, 'dd/mm/yyyy')
and cre_date < add_months(to_date('01/01/'||year_, 'dd/mm/yyyy'), 12);
Even better would be to declare the start of the year as a DATE variable prior to running the sql, eg:
v_year_dt := to_date('01/01/'||year_, 'dd/mm/yyyy');
which would make the query:
select x, y
from a
where mch_code = 'KN'
and contract = '15KTN'
and cre_date >= v_year_dt
and cre_date < add_months(v_year_dt, 12);
If you don't have an index on those three columns, you could create a function based index on (mch_code, contract, to_char(cre_date, 'yyyy')) that should help speed up your query, depending on the percentage of rows you're expecting to select. It may help even more if you added the x and y columns into the index, so that no table access was required at all.
Alternatively, you could think about partitioning the table on cre_date, monthly or yearly.
The reason your query is slow is that you're applying a function to a column on every row in your table. Let's try it another way:
SELECT X,Y
FROM A
WHERE mch_code = 'KN' AND
contract = '15KTN' AND
CRE_DATE BETWEEN TO_DATE('01/01/' || year_, 'DD/MM/YYYY')
AND TO_DATE('01/01/' || year_, 'DD/MM/YYYY') + INTERVAL '1' YEAR;
This eliminates the need to apply a function against every row in the table, and should allow any indexes on CRE_DATE to be used.
Best of luck.
You can try with EXTRACT function:
SELECT X,Y
FROM A
WHERE mch_code = 'KN'
AND contract = '15KTN'
AND EXTRACT(YEAR FROM cre_date) = year_;
Related
I'm asking for you help here because i didn't find how to do it on internet and even after trying myself.
Here is my concern:
I want to create an Order Book for a trading strategy, I created then a dictionary like this :
bids = {Price : (size, submission_time, is_market_maker)}
When executing orders, the first priority is price, and, within a given price, the order that was sent prior to another one will have priority (sort by number, the lower being the first).
This is called the price-priority rule for order books.
I already tried this function with the '[1][1]' to sort by the value which is a list but it doesn't work maybe because I'm using a DefaultDict I don't know
sorted_dict = sorted(bids.items(), key=lambda item: item[1][1])
And then I would like to sort it by price without changing the order of the orders.
I would like it to be like this :
bids = {100: [(5,2,True), (4,1, False)], 101: [(2,3,True), (4,1, False)]}
bids = {101: [(4, 1, False), (2 , 3, True)] , 100: [(4, 1, False), (5, 2, True)]}
Thank you for your help !
I'm trying to find number of distinct values in all columns for some query. I found that dcount works well but you have to supply the specific column. I want to do this on all columns, where the column names and the number of columns are dynamic
you'll have to explicitly include all columns of interest.
note that any additional column you add to the query will increase the resources utilization of the query, so if you have any knowledge about columns that are likely to be of high cardinality, consider including only those.
FWIW: you can generate the query (for all columns, with the caveat above) dynamically, then invoke the result of this:
let tableName = "my_table";
let datetime_column_name = "my_datetime_column";
let lookback_period = 1h;
let column_names = toscalar(
table(tableName)
| getschema
| summarize make_set(ColumnName)
);
print query = strcat(
tableName,
"\n| where ",
datetime_column_name,
" > ago(timespan(",
lookback_period,
"))\n| summarize dcount(",
strcat_array(column_names, "),\ndcount("),
")")
I want to use full_join to join two tables. Below is my pseudo code:
join <- full_join(a, b, by = c("a_ID" = "b_ID" , "a_DATE_MONTH" = "b_DATE_MONTH" +1 | "a_DATE_MONTH" = "b_DATE_MONTH" -1 | "a_DATE_MONTH" = "b_DATE_MONTH"))
a_DATE_MONTH and b_DATE_MONTH are in date format "%Y-%m".
I want to do full join based on condition that a_DATE_MONTH can be one month prior to b_DATE_MONTH, OR one month after b_DATE_MONTH, OR exactly equal to b_DATE_MONTH. Thank you!
While SQL allows for (almost) arbitrary conditions in a join statement (such as a_month = b_month + 1 OR a_month + 1 = b_month) I have not found dplyr to allow the same flexibility.
The only way I have found to join in dplyr on anything other than a_column = b_column is to do a more general join and filter afterwards. Hence I recommend you try something like the following:
join <- full_join(a, b, by = c("a_ID" = "b_ID")) %>%
filter(abs(a_DATE_MONTH - b_DATE_MONTH) <= 1)
This approach still produces the same records in your final results.
It perform worse / slower if R does a complete full join before doing any filtering. However, dplyr is designed to use lazy evaluation, which means that (unless you do something unusual) both commands should be evaluated together (as they would be in a more complex SQL join).
I am testing a small scale scenario before rolling it out in a larger production environment and am experiencing a strange occurrence.
I have 2 data sets:
dtL <- data.table(URN=c(1,2,3,4,5), DonorType=c("Cash","RG","Emergency","Emergency","Cash"))
dtL[,c("EmergVal","EmergDate") := list(as.numeric(NA),as.Date(NA))]
setkey(dtL,URN)
dtR <- data.table(URN = c(1,1,1,2,3,3 ,3 ,4,4, 4,4,5),
class=c(5,5,5,1,5,40,40,5,40,5,40,5),
xx= c(25,50,25,10,100,20,25,20,40,35,20,25),
xdate=as.Date(c("2013-01-01","2013-06-05","2014-05-27","2014-10-14",
"2014-06-09","2014-04-07","2014-10-16",
"2014-07-16","2014-10-21","2014-10-22","2014-09-18","2013-12-19")))
setkey(dtR,URN)
I am wanting to update dtL where the DonorType is equal to "Emergency", but only for a subset of records from dtR. I have seen Update subset of data.table based on join and thus have used that as a foundation for my solution.
dtL[dtR[class==40,list(maxxx=max(xx)),by=URN],
EmergVal := ifelse(DonorType=="Emergency",i.maxxx,as.numeric(NA))]
dtL[dtR[class==40,list(maxdate=max(xdate)),by=URN],
EmergDate := ifelse(DonorType=="Emergency",as.Date(i.maxdate),as.Date(NA)),nomatch=0]
I don't get any errors, however when I look at the data now in dtL it has changed the datatype for EmergDate to num rather than what it originally was (i.e. Date).
So three questions
Why has it changed the data type (especially when it is a Date when first created in dtL, and I tell it to put it as a date in my ifelse statement?
How do I get it to keep the date type when I assign it? or will I have to do some post assignment conversion/castint?
Is there a clean way I could do my assignment of EmergVal and EmergDate in a single statement given that I don't have a field DonorType in dtR and I don't want to add it in (so can't use a multiple key for the join)?
Is there any PL/SQL function, which allows to pass a table name and returns the count of all columns, which don't include null values?
I have a huge number of columns and don't want to query each and every column. I'm new to PL/SQL and highly appreciate your help.
As of a comment to the question one approach to solve this is the following query:
SELECT t.table_name,
t.num_rows,
c.column_name,
c.num_nulls,
t.num_rows - c.num_nulls num_not_nulls,
c.data_type,
c.last_analyzed
FROM all_tab_cols c
JOIN sys.all_all_tables t ON c.table_name = t.table_name
WHERE c.table_name LIKE 'EXT%'
AND c.nullable = 'Y'
GROUP BY t.table_name,
t.num_rows,
c.column_name,
c.num_nulls,
c.data_type,
c.last_analyzed
ORDER BY t.table_name,
c.column_name