Netting transactions in quantstrat - r

Let's say I have a strategy with multiple rules that generates multiple orders on the same symbol at the same timestamp. For example, on 2012-05-23 one rule might buy 10 shares of IBM while another rule sells 5 shares of IBM. In production, a reasonable system would use netting and execute one order to buy 5 shares, rather than one order to buy 10 shares and another order to sell 5 shares.
Is there a way to get this behaviour in quantstrat? From my experiments, quantstrat does not do netting, and for example will add transaction fees for both opposing orders as if two separate orders were executed.
If quantstrat cannot net orders then it should still be possible to obtain the desired PnL in backtesting by using a custom TxnFees function. If this is the correct way to go, how would one go about defining a custom function to net the transaction fees?

A 'reasonable system' would likely do no such thing. My experience of simultaneous execution on tick data is basically zero for aggressive orders.
On bar data, yes, internal netting would make sense, and would be handled by a production order management system. Or, for example, internalizing resting internal limit orders against other signals asking for aggressive orders on the other side, or netting positions. Does any investor of non-trivial size use bar data?
That seems to miss the point of what quantstrat is for. You are looking to figure out (in research) some strategy that makes good predictions and evaluate the quality of those predictions by writing a backtest.
Backtests aren't reality.
Further, netting would completely muddle any ability to figure out if your signal process has predictive power.
The account in blotter will net P&L automatically, so it will have the same result as your order netting, in the absence of fees. So I don't think you would need a separate TxnFees function to understand the possible impact of netting, pre-fees.

Related

Check if any combination of binary variables is correlated/has impact on an ordinal dependent variable

I am working on a case to finish my (not so advanced) data scientist course and I have already been helped a lot by topics here, thanks!
Unfortunately now I am stuck again and cannot find an existing answer.
My data comes from a bike shop and I want to see if products bought during customers' first registered purchase are related to/have impact on how important they will become to the shop in the future. I have grouped customers into 5 clusters (from those who registered and made never any registered purchase again, through these who made 2-3 purchases for little money, those who made a few purchases for a lot of money to those who purchase stuff regularly and really bring a lot of money to this bike shop), I have ordered them into an ordinal dependent variable.
As the independent variables I have prepared 20+ binary variables that identify products/services bought during the first purchase from this shop (first purchase as a registered customer). One row per customer. So I want to check the idea if there are combinations of products (probably "extras" to the bike purchase) that can increase the chance that a customer would register and hopefully stay as a loyal customer for the future.
The dream would be be able to say, for example, if you buy a cheap or middle-cheap bike during this first purchase you probably don't contribute so much to the bike shop in a long term so you have low grade on the dependent variable. But those who bought a middle-cheap bike AND a helmet AND a lock (probably to special price) are more likely to become one of the loyal registered customers bringing money for a longer time.
There might be no relation like that but I want to test that anyways. Implementation of the result could be being able to recommend an extra product during a purchase (with a good price on it).
I am learning R during this course. We went through some techniques and first I was imagining it would be possible to work with the neural networks (just cause it sounded most fun to try), having all these products as input in the sparse matrix and the customers clusters as the output (I hoped it was similar to the examples I read about with sparse matrix with pixels from a picture as the input and numbers 1-9 as the output) but then I was told that this actually is based on pictures and real patterns and in my case I don't even know if there is any.
Then I was thinking I could try with the ordinal forest. But it doesn't predict my clusters well, not at all (2 out of 5 clusters get no predictions). But that is OK, I don't expect the first purchase to be able to predict all the customers future. But I would really want to see if there are combinations of products that might increase the chance that a customer ends up in one of the "higher" clusters on the loyalty scale.
I am not sure if this was clear enough. :) Do you think that there is any way of testing my idea? What could I try to do? Let me know if you need more information.

KNIME ANALYTIC PLATFORM: What does pattern mean in knime?

I am working on KNIME ANALYTIC PLATFORM as part of my project. I am new to this analytics platform.
Prediction Analysis is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trends. ... Knime is based on the Eclipse platform and provides a visual programming language based on data-flows to create an easy-to-understand analysis process quickly
My Approach
With an existing data I was trying to form a pattern. Say like ..
There are several customers with pending amount to be paid and few of them paid. My case was they might exist 1 or more number of orders from customers,
Say customer 1,2 and 3 are there. Cust_1 has 3 orders and Cust_2 has 2 orders and Cust_3 had 1 order, with there some orders amount paid and some not paid.
My Question
My question is can we generate a pattern, based on customers.
To know the customers order more than 2 with coloured and arrange them into pattern? What nodes in knime make my pattern?
can anyone please solve this question.
The patterns in this case what customers buy together, which are expressed as association rules. These rules can applied to new data and can be help predicting new buys by suggesting those products when one of them is in the basket.
In case more information is available on the customers, that can be used to cluster them together based on those properties (which case the patterns are the similarity of the customers) and if a new customer fits in one of those clusters, the most common product(s) can be suggested to her/him/it. The nice thing is that KNIME makes this very easy once you have your data and you get familiar with KNIME (which is itself user friendly, there are many free sources available: https://www.knime.com/resources).
Obviously other patterns might be also useful. If you have more data, you might see trends (patterns) in buys of individual customer orders (or the amount of the orders, where the ARIMA nodes might be useful) or in the popularity of different products. These can also be called patterns.
For complex models, you might need to use other tools too, like R or Python or something else. I should emphasize that KNIME has very good PMML support, so you are not tied to a single tool, you can create/train your model in KNIME and use some other tool to make predictions based on that model or the other way around.

How do blotter/quantstrat/quantmod/performanceanalytics handle internal cashflows and expiring instruments?

I don't understand how internal cashflows are handled in blotter/quantstrat/quantmod/performanceanalytics. This mainly concerns two aspects: Regular cashflows like dividends, coupons etc. as well as cashflows from expiring instruments (e.g. a cash settled in the money option). For equities this seems not too much of an issue as one can always use dividend adjusted prices and it is relatively rare that stocks get delisted. For coupon bonds or options however, I don't get how this is handled.
So my questions are:
Is there a generic mechanism to handle internal cashflows (dividends,
coupons, repayments etc.) in these packages?
If so, is there some documentation for this and where can I find the relevant implementation in the source code (i.e. pointers to specific R files and/or functions would be great)?
Thanks in advance

Get Annual Financial Data for a Stock for many years in R

Suppose I want to regress in R Gross Profit on Total Revenue. I need data for this, and the more, the better.
There is a library on CRAN that I find very useful: quantmod , that does what I need.
library(quantmod)
getFinancials(Symbol="AMD", src="google")
#to get the names of the matrix: rownames(AMD.f$IS$A)
Total.Revenue<-AMD.f$IS$A["Revenue",]
Gross.Profit<-AMD.f$IS$A["Gross Profit",]
#finally:
reg1<-lm(Gross.Profit~Total.Revenue)
The biggest issue that I have is that this library gets me data only for 4 years (4 observations, and who runs a regression with only 4 observations???). Is there any other way (maybe other libraries) that would get data for MORE than 4 years?
I agree that this is not an R programming question, but I'm going to make a few comments anyway before this question is (likely) closed.
It boils down to this: getting reliable fundamental data across sectors and markets is difficult enough even if you have money to spend. If you are looking at the US then there are a number of options, but all the major (read 'relatively reliable') providers require thousands of dollars per month - FactSet, Bloomberg, Datastream and so on. For what it's worth, for working with fundamental data I prefer and use FactSet.
Generally speaking, because the Excel tools offered by each provider are more mature, I have found it easier to populate spreadsheets with the data and then read the data into R. Then again, I typically deal with the fundamentals of a few dozen companies at most, because once you move out of the domain of your "known" companies the time it takes to check anomalies increases exponentially.
There are numerous potential "gotchas". The most obvious is that definitions vary from sector to sector. "Sales" for an industrial company is very different from "sales" for a bank, for example. Another problem is changes in definitions. Pretty much every year some accounting regulation or other changes and breaks your data series. Last year minorities were reported here, but this year this item is moved to another position in the P&L and so on.
Another problem is companies themselves changing. How does one deal with mergers, acquisitions and spin-offs, for example? This sort of thing can make measuring organic sales growth next to impossible. Yet another point to bear in mind is that if you're dealing with operating or net profit, you have to consider exceptionals and whether to adjust for them.
Dealing with companies outside the US adds a whole bunch of further problems. Of course, the major data providers try to standardise globally (FactSet Fundamentals for example). This just adds another layer of abstraction and typically it is hard to check to see how the data has been manipulated.
In short, getting the data is onerous and I know of no reliable free sources. Unless you're dealing with the simplest items for a very homogenous group of companies, this is a can of worms even if you do have the data.

How do you estimate a ROI for clearing technical debt?

I'm currently working with a fairly old product that's been saddled with a lot of technical debt from poor programmers and poor development practices in the past. We are starting to get better and the creation of technical debt has slowed considerably.
I've identified the areas of the application that are in bad shape and I can estimate the cost of fixing those areas, but I'm having a hard time estimating the return on investment (ROI).
The code will be easier to maintain and will be easier to extend in the future but how can I go about putting a dollar figure on these?
A good place to start looks like going back into our bug tracking system and estimating costs based on bugs and features relating to these "bad" areas. But that seems time consuming and may not be the best predictor of value.
Has anyone performed such an analysis in the past and have any advice for me?
Managers care about making $ through growth (first and foremost e.g. new features which attract new customers) and (second) through optimizing the process lifecycle.
Looking at your problem, your proposal falls in the second category: this will undoubtedly fall behind goal #1 (and thus get prioritized down even if this could save money... because saving money implies spending money (most of time at least ;-)).
Now, putting a $ figure on the "bad technical debt" could be turned around into a more positive spin (assuming that the following applies in your case): " if we invest in reworking component X, we could introduce feature Y faster and thus get Z more customers ".
In other words, evaluate the cost of technical debt against cost of lost business opportunities.
Sonar has a great plugin (technical debt plugin) to analyze your sourcecode to look for just such a metric. While you may not specifically be able to use it for your build, as it is a maven tool, it should provide some good metrics.
Here is a snippet of their algorithm:
Debt(in man days) =
cost_to_fix_duplications +
cost_to_fix_violations +
cost_to_comment_public_API +
cost_to_fix_uncovered_complexity +
cost_to_bring_complexity_below_threshold
Where :
Duplications = cost_to_fix_one_block * duplicated_blocks
Violations = cost_to fix_one_violation * mandatory_violations
Comments = cost_to_comment_one_API * public_undocumented_api
Coverage = cost_to_cover_one_of_complexity *
uncovered_complexity_by_tests (80% of
coverage is the objective)
Complexity = cost_to_split_a_method *
(function_complexity_distribution >=
8) + cost_to_split_a_class *
(class_complexity_distribution >= 60)
I think you're on the right track.
I've not had to calculate this but I've had a few discussions with a friend who manages a large software development organisation with a lot of legacy code.
One of the things we've discussed is generating some rough effort metrics from analysing VCS commits and using them to divide up a rough estimate of programmer hours. This was inspired by Joel Spolsky's Evidence-based Scheduling.
Doing such data mining would allow you to also identify clustering of when code is being maintained and compare that to bug completion in the tracking system (unless you are already blessed with a tight integration between the two and accurate records).
Proper ROI needs to calculate the full Return, so some things to consider are:
- decreased cost of maintenance (obviously)
- opportunity cost to the business of downtime or missed new features that couldn't be added in time for a release
- ability to generate new product lines due to refactorings
Remember, once you have a rule for deriving data, you can have arguments about exactly how to calculate things, but at least you have some figures to seed discussion!
I can only speak to how to do this empirically in an iterative and incremental process.
You need to gather metrics to estimate your demonstrated best cost/story-point. Presumably, this represents your system just after the initial architectural churn, when most of design trial-and-error has been done but entropy has had the least time to cause decay. Find the point in the project history when velocity/team-size is the highest. Use this as your cost/point baseline (zero-debt).
Over time, as technical debt accumulates, the velocity/team-size begins to decrease. The percentage decrease of this number with respect to your baseline can be translated into "interest" being paid on each new story point. (This is really interest paid on technical and knowledge debt)
Disciplined refactoing and annealing causes the the interest on technical debt to stablize at some value higher than your baseline. Think of this as the steady-state interest the product owner pays on the technical debt in the system. (The same concept applies to knowledge debt).
Some systems reach the point where the cost + interest on each new story point exceeds the value of the feature point being developed. This is when the system is bankrupt, and it's time to rewrite the system from scratch.
I think it's possible to use regression analysis to tease apart technical debt and knowledge debt (but I haven't tried it). For example, if you assume that technical debt correlates closely with some code metrics, e.g. code duplication, you could determine the degree the interest being paid is increasing because of technical debt versus knowledge debt.
+1 for jldupont's focus on lost business opportunities.
I suggest thinking about those opportunities as perceived by management. What do they think affects revenue growth -- new features, time to market, product quality? Relating debt paydown to those drivers will help management understand the gains.
Focusing on management perceptions will help you avoid false numeration. ROI is an estimate, and it is no better than the assumptions made in its estimation. Management will suspect solely quantitative arguments because they know there's some qualitative in there somewhere. For example, over the short term the real cost of your debt paydown is the other work the programmers aren't doing, rather than the cash cost of those programmers, because I doubt you're going to hire and train new staff just for this. Are the improvements in future development time or quality more important than features these programmers would otherwise be adding?
Also, make sure you understand the horizon for which the product is managed. If management isn't thinking about two years from now, they won't care about benefits that won't appear for 18 months.
Finally, reflect on the fact that management perceptions have allowed this product to get to this state in the first place. What has changed that would make the company more attentive to technical debt? If the difference is you -- you're a better manager than your predecessors -- bear in mind that your management team isn't used to thinking about this stuff. You have to find their appetite for it, and focus on those items that will deliver results they care about. If you do that, you'll gain credibility, which you can use to get them thinking about further changes. But appreciation of the gains might be a while in growing.
Being a mostly lone or small-team developer this is out of my field, but to me a great solution to find out where time is wasted is very, very detailed timekeeping, for example with a handy task-bar tool like this one that can even filter out when you go to the loo, and can export everything to XML.
It may be cumbersome at first, and a challenge to introduce to a team, but if your team can log every fifteen minutes they spend due to a bug, mistake or misconception in the software, you accumulate a basis of impressive, real-life data on what technical debt is actually costing in wages every month.
The tool I linked to is my favourite because it is dead simple (doesn't even require a data base) and provides access to every project/item through a task bar icon. Also entering additional information on the work carried out can be done there, and timekeeping is literally activated in seconds. (I am not affiliated with the vendor.)
It might be easier to estimate the amount it has cost you in the past. Once you've done that, you should be able to come up with an estimate for the future with ranges and logic even your bosses can understand.
That being said, I don't have a lot of experience with this kind of thing, simply because I've never yet seen a manager willing to go this far in fixing up code. It has always just been something we fix up when we have to modify bad code, so refactoring is effectively a hidden cost on all modifications and bug fixes.

Resources