Setting up a dynamic bucketed constraint in MarkLogic for dates - xquery

In my database, I have a path range index on <date> that contains xs:dates.
<date>2019-01-01</date>
I'm trying to set up a faceted constraint with the following buckets:
2019 with quarter (2019 Q1 (Jan-Mar), Q2 (April-June) etc)
2018
2017
2016
2015
etc
My issue is, I want to dynamically update the bucket to compute the current year into the quarterly bucket, with the rest of the years following. I have a current bucketed range constraint:
<constraint name="date">
<range type="xs:date" facet="true">
<path-index>/data/date</path-index>
<bucket ge="2019-01-01" lt="2019-03-01" name="q1">2019 Q1</bucket>
<bucket ge="2019-04-01" lt="2019-06-01" name="q2">2019 Q2</bucket>
<bucket ge="2019-07-01" lt="2019-09-01" name="q3">2019 Q3</bucket>
<bucket ge="2019-10-01" lt="2019-12-01" name="q4">2019 Q4</bucket>
<bucket ge="2018-01-01" lt="2019-01-01" name="2018">2018</bucket>
<bucket ge="2017-01-01" lt="2018-01-01" name="2017">2017</bucket>
<bucket ge="2016-01-01" lt="2017-01-01" name="2016">2016</bucket>
<bucket ge="2015-01-01" lt="2016-01-01" name="2015">2015</bucket>
</range>
</constraint>
The issue with the above, is that it manually is setting 2019 to be split up into quarters, but when it turns 2020 and years following, how could I get the bucket to automatically update and only split apart the current year into quarters?

Would computed buckets address the requirement? See:
http://docs.marklogic.com/guide/search-dev/search-api#id_22725
and
http://docs.marklogic.com/guide/search-dev/appendixa#id_91755
Hoping that helps,

Related

Portfolio sorts with incomplete data

I have a panel data of stock returns, where after a certain year the coverage universe of stocks doubled. It looks a bit like this:
Year Stock 1 Stock 2 Stock 3 Stock 4
2000 5.1% 0.04% NA NA
2001 3.6% 9.02% NA NA
2002 5.0% 12.09% NA NA
2003 -2.1% -9.05% 1.1% 4.7%
2004 7.1% 1.03% 4.2% -1.1%
.....
Of course, I am trying to maximize my observations both in the time series and in the cross-section as much as possible. However, I am not sure which of these 3 ways to sort would be the most "academically honest":
Sort the years until 2001 using only stocks 1 and 2, and incorporate the remaining stocks in the calculations once they become available in 2003.
Only include those stocks in calculations that have been available since 2000, i.e. stocks 1 and 2. Ignore the remaining stocks altogether since we do not have the full return profile.
Start the sort in year 2003, to have a larger cross-section.
The reason why our coverage universe expands in 2003 is simply because the data provider I am using changed their methodology in that year and decided to track more stocks. Stocks 3 and 4 do exist before 2003, but I cannot use their past return data since I need to follow my data provider (for the second variable I am sorting on).
Thanks all!
I am using the portsort() package in R but this does not seem to work well with NA`s.

Propensity Score Matching with panel data

I am trying to use MatchIt to perform Propensity Score Matching (PSM) for my panel data. The data is panel data that contains multi-year observations from the same group of companies.
The data is basically describing a list of bond data and the financial data of their issuers, also the bond terms such as issued date, coupon rate, maturity, and bond type of bonds issued by them. For instance:
Firmnames
Year
ROA
Bond_type
AAPL US Equity
2015
0.3
0
AAPL US Equity
2015
0.3
1
AAPL US Equity
2016
0.3
0
AAPL US Equity
2017
0.3
0
C US Equity
2015
0.3
0
C US Equity
2016
0.3
0
C US Equity
2017
0.3
0
......
I've already known how to match the observations by the criteria I want and I use exact = Year to make sure I match observations from the same year. The problem now I am facing is that the observations from the same companies will be matched together, this is not what I want. The code I used:
matchit(Bond_type ~ Year + Amount_Issued + Cpn + Total_Assets_bf + AssetsEquityRatio_bf + Asset_Turnover_bf, data = rdata, method = "nearest", distance = "glm", exact = "Year")
However, as you can see, in the second raw of my sample, there might be two observations in one year from the same companies due to the nature of my study (the company can issue bonds more than one time a year). The only difference between them is the Bond_type. Therefore, the MathcIt function will, of course, treat them as the best control and treatment group and match these two observations together since they have the same ROA and other matching factors in that year.
I have two ways to solve this in my opinion:
Remove the observations from the same year and company, however, removing the observations might lead to bias results and ruined the study.
Preventing MatchIt function match the observations from the same company (or with the same Frimnames)
The second approach will be better since it will not lead to bias, however, I don't know if I can do this in MatchIt function. Hope someone can give me some advice on this or maybe there's any better solution to this problem, please be so kind to share with me, thanks in advance!
Note: If there's any further information or requirement I should provide, please just inform me. This is my first time raising the question here!
This is not possible with MatchIt at the moment (though it's an interesting idea and not hard to implement, so I may add it as a feature).
In the optmatch package, which perfroms optimal pair and full matching, there is a constraint that can be added called "anti-exact matching", which sounds exactly like what you want. Units with the same value of the anti-exact matching variable will not be matched with each other. This can be implemented using optmatch::antiExactMatch().
In the Matching package, which performs nearest neighbor and genetic matching, the restrict argument can be supplied to the matching function to restrict certain matches. You could manually create the restriction matrix by restricting all pairs of observations in the same company and then supply the matrix to Match().

Count rows until you get to the current owning team value... Kusto, countof()

I have this Kusto code that I have been trying to develop and any help would be greatly appreciated.
The objective is to count to the first occurrence of the CurrentOwningTeamId in the OwningTeamId column.
I packed the Owning Team number and parsed the value into a column of its own. I need to count the owning teams until I get to the current owning team.
Columns are (example):
Objective: Count to the first occurrence of the CurrentOwningTeam value in the OwningTeamId column using Kusto (Application Insights code):
[CODE]
OwningTeamId, CurrenOwningTeam, CreateDate, RequestType
155523 **888888** 2017-07-02 PRIMARY
256924 **888888** 2017-08-02 TRANSFER
**888888** **888888** 2017-09-02 TRANSFER
954005 **888888** 2017-10-02 TRANSFER
**888888** **888888** 2017-11-02 TRANSFER
155523 **888888** 2017-12-02 TRANSFER
954005 **888888** 2017-13-02 TRANSFER
**888888** **888888** 2017-14-02 TRANSFER
[/CODE]
I think you can match the current owning team with the countof() function, but I don't know how to go about it using regex. Note: values are different with each owning team on every incident, is why I capture the owning team on the incident first and try to count the very first instance of the CurrentOwningTeam number in the OwningTeamId column. In other words I want to count the number of times it takes to get to the very first owning team. In this case, it would be three.
Note: OwningTeamId's and CurrentOwningTeam can change on every incident, I first capture the CurrentOwningTeam then try to match in the OwningTeamId column.
Note: This is just one incident, but I am trying to do multiple Incidents.
Below is how I got the Current Owning Team Value.
[/CODE]
| extend CurrentOwningTeam=pack_array(OwningTeamId)
| parse CurrentOwningTeam with * "[" CurrentOwningTeam:int "]" *
| serialize CurrentOwningTeam
[/CODE]
I tried using row_number() but it will not work for multiple incidents, only per incident, so I have to use count or countof functions or another way of doing it.
Thanks for clarification. Here is a suggestion for a query that counts ordered by-time rows until certain condition is reached (count is contextual using IncidentId key).
datatable(IncidentId:string, OwningTeamId:string, CurrentOwningTeam:string, CreateDate:datetime, RequestType:string)
[
'Id1','155523','888888',datetime(2017-02-07),'PRIMARY',
'Id1','256924','888888',datetime(2017-02-08),'TRANSFER',
'Id1','888888','888888',datetime(2017-02-09),'TRANSFER',
'Id1','954005','888888',datetime(2017-02-10),'TRANSFER',
'Id1','888888','888888',datetime(2017-02-11),'TRANSFER',
'Id1','155523','888888',datetime(2017-02-12),'TRANSFER',
'Id1','954005','888888',datetime(2017-02-13),'TRANSFER',
'Id1','888888','888888',datetime(2017-02-14),'TRANSFER',
// Id2
'Id2','155523','888888',datetime(2017-02-07),'PRIMARY',
'Id2','256924','888888',datetime(2017-02-08),'TRANSFER',
'Id2','999999','888888',datetime(2017-02-09),'TRANSFER',
'Id2','954005','888888',datetime(2017-02-10),'TRANSFER',
'Id2','888888','888888',datetime(2017-02-11),'TRANSFER',
'Id2','155523','888888',datetime(2017-02-12),'TRANSFER',
'Id2','954005','888888',datetime(2017-02-13),'TRANSFER',
'Id2','888888','888888',datetime(2017-02-14),'TRANSFER',
]
| order by IncidentId, CreateDate asc
| extend c= row_cumsum(1, IncidentId!=prev(IncidentId))
| where OwningTeamId == CurrentOwningTeam
| summarize arg_min(CreateDate, c) by IncidentId
Result:
IncidentId CreateDate c
Id1 2017-02-09 00:00:00.0000000 3
Id2 2017-02-11 00:00:00.0000000 5
Here are the links to the docs that point how to find earliest record using arg_min() aggregation, and link to the row_cumsum() (cumulative sum) function.
https://learn.microsoft.com/en-us/azure/kusto/query/arg-min-aggfunction
https://learn.microsoft.com/en-us/azure/kusto/query/rowcumsumfunction
I figured it out by using the RowNumber directly into grouping inside the table, then finally summing to get my total count.
[CODE]
| serialize Id
| extend RowNumber=row_number(1, (Id) ==Id)
| summarize TotalOwningTeamChanges=sum(RowNumber) by Id
[/CODE]
Then after that I got the Minimum Date to extract the entire data set to the first instance of the current OwningTeamName.
[CODE]
//Outside the scope of the table.
| extend ExtractFirstOwningTeamCreateDate=CreateDate2
| extend VeryFirstOwningTeamCreateDate=MinimumCreateDate
| where FirstOwningTeamRow == true or MinimumCreateDate <=
ExtractFirstOwningTeamCreateDate
| serialize VeryFirstOwningTeamCreateDate
[/CODE]

update one table with 2 where conditions in the same and one condition in another table

I have 2 tables fees and students. i want to update one field of fees with 3 WHERE conditions, i.e, 2 conditions in table 'fees' and 1 condition in table 'students'.
I tried many queries like
UPDATE fees, students SET fees.dues= 300 WHERE fees.month= November
AND fees.session= 2017-18 AND students.class= Nursery
It gives me error like java.sql.SQLException: near",": syntax error
I am using sqlite as database. Please suggest me a query or let me correct this query.
Thanks
You cannot join tables in a UPDATE command in SQLite. Therefore, use a sub-query in the where condition
UPDATE fees
SET dues = 300
WHERE
month = November AND
session = 2017-18 AND
student_id IN (SELECT id FROM students WHERE class=Nursery)
Also, I am not sure about the types of your columns. String literals must be enclosed in single quotes ('). The expression 2017-18 would yield the number 2017 minus 18 = 1999. Should it be a string literal as well?
UPDATE fees
SET dues = 300
WHERE
month = 'November' AND
session = '2017-18' AND
student_id IN (SELECT id FROM students WHERE class='Nursery')

Get Correct Price based on Effectivity Date

I have a problem getting the right "Price" for a product based on Effectivity date.
Example, I have 2 tables:
a. "Transaction" table --> this contains the products ordered, and
b. "Item Master" table --> this contains the product prices and effectivity dates of those prices
Inside the Trasaction table:
INVOICE_NO INVOICE_DATE PRODUCT_PKG_CODE PRODUCT_PKG_ITEM
1234 6/29/2009 ProductA ProductA-01
1234 6/29/2009 ProductA ProductA-02
1234 6/29/2009 ProductA ProductA-03
Inside the "Item_Master" table:
PRODUCT_PKG_CODE PRODUCT_PKG_ITEM PRODUCT_ITEM_PRICE EFFECTIVITY_DATE
ProductA ProductA-01 25 6/1/2009
ProductA ProductA-02 22 6/1/2009
ProductA ProductA-03 20 6/1/2009
ProductA ProductA-01 15 5/1/2009
ProductA ProductA-02 12 5/1/2009
ProductA ProductA-03 10 5/1/2009
ProductA ProductA-01 19 4/1/2009
ProductA ProductA-02 17 4/1/2009
ProductA ProductA-03 15 4/1/2009
In my report, I need to display the Invoices and Orders,
as well as the Price of the Order Item which was effective
at the time it was paid (Invoice Date).
My query looks like this (my source db is Oracle):
SELECT T.INVOICE_NO,
T.INVOICE_DATE,
T.PRODUCT_PKG_CODE,
T.PRODUCT_PKG_ITEM,
P.PRODUCT_ITEM_PRICE FROM TRANSACTION T,
ITEM_MASTER P WHERE T.PRODUCT_PKG_CODE = P.PRODUCT_PKG_CODE
AND T.PRODUCT_PKG_ITEM = P.PRODUCT_PKG_ITEM
AND P.EFFECTIVITY_DATE <= T.INVOICE_DATE
AND T.INVOICE_NO = '1234';
...which shows 2 prices for each item.
I did some other different query styles
but to no avail, so I decided
it's time to get help. :)
Thanks to any of you who can
share your knowledge. --CJ--
p.s. Sorry, my post doesn't even look right! :D
If it's returning two rows with different effective dates that are less than the invoice date, you may want to change your date join to
'AND T.INVOICE_DATE = (
select max(effectivity_date)
from item_master
where effectivity_date < t.invoice_date)'
or something like that, to only get the one price that is the most recent one before the invoice date.
Analytics is your friend. You can use the FIRST_VALUE() function, for example, to get all the product_item_prices for the given product, sort by effectivity_date (descending), and just pick the first one. You'll need a DISTINCT as well so that only one row is returned for each transaction.
SELECT DISTINCT
T.INVOICE_NO,
T.INVOICE_DATE,
T.PRODUCT_PKG_CODE,
T.PRODUCT_PKG_ITEM,
FIRST_VALUE(P.PRODUCT_ITEM_PRICE)
OVER (PARTITION BY T.INVOICE_NO, T.INVOICE_DATE,
T.PRODUCT_PKG_CODE, T.PRODUCT_PKG_ITEM
ORDER BY P.EFFECTIVITY_DATE DESC)
as PRODUCT_ITEM_PRICE
FROM TRANSACTION T,
ITEM_MASTER P
WHERE T.PRODUCT_PKG_CODE = P.PRODUCT_PKG_CODE
AND T.PRODUCT_PKG_ITEM = P.PRODUCT_PKG_ITEM
AND P.EFFECTIVITY_DATE <= T.INVOICE_DATE
AND T.INVOICE_NO = '1234';
While your question's formatting is a bit too messy for me to get all the details, it sure does look like you're looking for the standard SQL construct ROW_NUMBER() OVER with both PARTITION and ORDER_BY -- it's in PostgreSql 8.4 and has been in Oracle [and MS SQL Server too, and DB2...] for quite a while, and it's the handiest way to select the "top" (or "top N") "by group" and with a certain order of anything in a SQL query. Look it up, see here for the PosgreSQL-specific docs.

Resources