Let's say that I have a diminishing value that should be portrayed both on a monthly basis and on a weekly basis.
For example. I know that the value, say 100 000, diminishes by 30 %/year. Which when I calculate (by normal "periodic compound" formulas) is 2.21 %/month and 0.51 %/week.
However, looking at the results from these calculations (calculating for a entire year) I do not get the same end valued. Only if I calculate it as a "interest" (=the percentage is ADDED to the value, NOT taken away) do I get matching values on both the weekly and monthly calculations.
What is the correct formula for calculating this "compound taxation" problem?
I don't know if I fully understand your question.
You cannot calculate diminushing interest the way you do it.
If your value (100 000) diminishes by 30 %/ year this means that at the end of year 1 your value is 70 000.
The way you calculated you compound would work if diminishing by 30% meant 100000/1.3
Your mistake:
You made your calculation this way:
(1+x)^12 - 1 =30% then x=0.0221 the monthly interest is 2.21%
(1+x)^52 - 1 = 30% then x=0.0051 the weekly interest is 0.51%
But what you should have done is:
(1-x)^12=1-30% then x =0.0292 the monthly interest is 2.92%
(1-x)^52=1-30% then x=0.0068 the monthly interest is 0.68 %
You cannot calculate the compound interest as if it was increasing 30% when it's decreasing 30%.
It's easy to understand that the compound interest for an increasing will be smallest than the one for decreasing:
Exemple:
Let's say your investment makes 30% per year.
At the end of first month you will have more money, and therefore you're investing more so you need a smaller return to make as much money as in the first month.
Therefore for increasing interest the coumpond interest i=2.21 is smaller than 30/12 = 2.5
same reasonning for the decreasing i =2.92 > 30/12=2.5
note:
(1+x)^12 - 1 =30% is not equivalent to (1-x)^12=1-30%
negative interest cannot be treated as negative interest:
following what you did adding 10% to one then taking away 10% to the result would return one:
(1+10%)/(1+10%)=1
The way it's calculated won't give the same result : (1+10%)*(1-10%)=0.99
Hope I understood your question and it helps .
Engaging psychic debugging...
diminishes by 30 %/year. Which when I
calculate (by normal "periodic
compound" formulas) is 2.21 %/month
and 0.51 %/week.
You are doing an inappropriate calculation.
You are correct in saying that 30% annual growth is approx 2.21% monthly growth. The reason for this is because 30% annual growth is expressed as multiplication by 1.30 (since 100% + 30% = 130%, or 1.30), and making this monthly is:
1.30 ^ (1/12) = 1.0221 (approx)
However, it does not follow from this that 30% annual shrinkage is approx 2.21% monthly shrinkage. To work out the monthly shrinkage we must note that 30% shrinkage is multiplication by 0.70 (since 100% - 30% = 70%, or 0.70), and make this monthly in the same way:
0.70 ^ (1/12) = 0.9707 (approx)
Multiplication by 0.9707 is monthly shrinkage of 2.929% (approx).
Hopefully this will give you the tools you need to correct your calculations.
Related
Main question: Suppose you have a discrete, finite data set $d$. Then the command summary(d) returns the Min, 1st quartile, Median, mean, 3rd quartile, and max. My question is: what formula does R use to compute the 1st quartile?
Background: My data set was: d=c(1,2,3,3,4,9). summary(d) returns 2.25 as the first quartile. Now, one way to compute the first quartile is to choose a value q1 such that 25% of the data set is less than of equal to q1. Clearly, this is not what R is using. So, I was wondering, what formula does R use to compute the first quartile?
Google searches on this topic have left even even more puzzled, and I couldn't find a formula that R uses. Typing help(summary) in R wasn't helpful to me too.
General discussion:
There are many different possibilities for sample quantile functions; we want them to have various properties (including being simple to understand and explain!), and depending on which properties we want most, we might prefer different definitions.
As a result, the wide variety of packages between them use many different definitions.
The paper by Hyndman and Fan [1] gives six desirable properties for a sample quantile function, lists nine existing definitions for the quantile function, and mentions which (of a number of common) packages use which definitions. Its Introduction says (sorry, the mathematics in this quote doesn't render properly any more, since it was moved to SO):
the sample quantiles that are used in statistical
packages are all based on one or two order statistics, and
can be written as
\hat{Q}_i(p) = (1 - γ) X_{(j)} + γ X_{(j+1)}\,,
where \frac{j-m}{n}\leq p< \frac{j-m+1}{n} \quad (1)
for some m\in \mathbb{R} and 0\leq\gamma\leq 1.
Which is to say, in general, the sample quantiles can be written as some kind of weighted average of two adjacent order statistics (though it may be that there's only weight on one of them).
In R:
In particular, R offers all nine definitions mentioned in Hyndman & Fan (with $7$ as the default). From Hyndman & Fan we see:
Definition 7. Gumbel (1939) also considered the modal position
$p_k = \text{mode}\,F(X_{(k)}) = (k-l)/(n-1)$. One nice property is that the vertices of $Q_7(p)$ divide the range into $n-1$ intervals, and exactly $100p\%$ of the intervals lie to the left of $Q_7(p$) and $100(1-p)\%$ of the intervals lie to the right of $Q_7(p)$.
What does this mean? Consider n=9. Then for (k-1)/(n-1) = 0.25, you need k = 1+(9-1)/4 = 3. That is, the lower quartile is the 3rd observation of 9.
We can see that in R:
quantile(1:9)
0% 25% 50% 75% 100%
1 3 5 7 9
For its behavior when n is not of the form 4k+1, the easiest thing to do is try it:
> quantile(1:10)
0% 25% 50% 75% 100%
1.00 3.25 5.50 7.75 10.00
> quantile(1:11)
0% 25% 50% 75% 100%
1.0 3.5 6.0 8.5 11.0
> quantile(1:12)
0% 25% 50% 75% 100%
1.00 3.75 6.50 9.25 12.00
When k isn't integer, it's taking a weighted average of the adjacent order statistics, in proportion to the fraction it lies between them (that is, it does linear interpolation).
The nice thing is that on average you get 3 times as many observations above the first quartile as you get below. So for 9 observations, for example, you get 6 above and 2 below the third observation, which divides them into the ratio 3:1.
What's happening with your sample data
You have d=c(1,2,3,3,4,9), so n is 6. You need (k-1)/(n-1) to be 0.25, so k = 1 + 5/4 = 2.25. That is, it takes 25% of the way between the second and third observation (which coincidentally are themselves 2 and 3), so the lower quartile is 2+0.25*(3-2) = 2.25.
Under the hood: some R details:
When you call summary on a data frame, this results in summary.data.frame being applied to the data frame (i.e. the relevant summary for the class you called it on). Its existence is mentioned in the help on summary.
The summary.data.frame function (ultimately -- via summary.default applied to each column) calls quantile to compute quartiles (you won't see this in the help, unfortunately, since ?summary.data.frame simply takes you to the summary help and that doesn't give you any details on what happens when summary is applied to a numeric vector -- this is one of those really bad spots in the help).
So ?quantile (or help(quantile)) describes what R does.
Here are two things it says (based directly off Hyndman & Fan). First, it gives general information:
All sample quantiles are defined as weighted averages of consecutive order
statistics. Sample quantiles of type i are defined by:
Q[i](p) = (1 - γ) x[j] + γ x[j+1],
where 1 ≤ i ≤ 9, (j-m)/n ≤ p < (j-m+1)/n, x[j] is the jth order statistic,
n is the sample size, the value of γ is a function of j = floor(np + m) and
g = np + m - j, and m is a constant determined by the sample quantile type.
Second, there's specific information about method 7:
Type 7
m = 1-p
. p[k] = (k - 1) / (n - 1). In this case, p[k] = mode[F(x[k])]. This is used by S.
Hopefully the explanation I gave earlier helps to make more sense of what this is saying. The help on quantile pretty much just quotes Hyndman & Fan as far as definitions go, and its behavior is pretty simple.
Reference:
[1]: Rob J. Hyndman and Yanan Fan (1996),
"Sample Quantiles in Statistical Packages,"
The American Statistician, Vol. 50, No. 4. (Nov.), pp. 361-365
Also see the discussion here.
I am trying to find the value of a parameter where all constituents of a portfolio with a parameter smaller than this value sum up to 25% of the entire portfolio.
Of course, using R.
For example, I have having a portfolio for a market index that's having 100 constituent equity securities. 25% of this portfolio means 25% of total market value. I want to find the value of a parameter, let's say the P/E ratio, that all securities with a P/E ratio smaller than such value contribute to 25% of this portfolio's market value.
i know that I can use 'quantile' function to find the 25% value for P/E ratio,
Bottom_Quartile <- data[(data$PE < quantile(data$PE,0.25),]
But that value not necessarily corresponding to 25% market value, and result generated above is not giving out a total market value = 25% of Original total market value.
I am wondering if there is a function that can help us find the value of a specific parameter (in this case P/E ratio) based on the given quantile of another parameter (in this case 25% Market Value).
Thanks in advance for your help!!!
All right, here is an update.
I have tried some method to do this the following one worked fine, however, it looks very inefficient.
#compute % market weight first
data$MarketValuePct <- data$MarketValue / sum(data$MarketValue)
#sort the result by descend order of P/E
sort_data <- data[order(data$PE),]
#calculate cumulative sum for percentage market value
sort_data$cumsum_MarketValuePct <- cumsum(sort_data$MarketValuePct)
#Locate the edge of bottom quartile
Bottom_quartile <- sort_data[sort_data$cumsum_MarketValuePct < 0.25, ]
#The bottom value of PE is what I was looking for
4th_Qartile_PE <- Bottom_quartile [nrow(Bottom_quartile),(col# of PE)]
I am still looking for a more elegant and more accurate way of doing this. Please kindly help if you have any idea.
Thank you!
You do not want a quantile of the market value, but 25% of the total market value.
You can use order to sort the stocks by increasing PE, and cumsum to compute
the total market value of stocks below a certain PE.
i <- order(PE)
tail( which( cumsum(MCap[i]) < .25 * sum(MCap) ), 1 )
Given the results for a simple A / B test...
A B
clicked 8 60
ignored 192 1940
( ie a conversation rate of A 4% and B 3% )
... a fisher test in R quite rightly says there's no significant difference
> fisher.test(data.frame(A=c(8,192), B=c(60,1940)))
...
p-value = 0.3933
...
But what function is available in R to tell me how much I need to increase my sample size to get to a p-value of say 0.05?
I could just increase the A values (in their proportion) until I get to it but there's got to be a better way? Perhaps pwr.2p2n.test [1] is somehow usable?
[1] http://rss.acs.unt.edu/Rdoc/library/pwr/html/pwr.2p2n.test.html
power.prop.test() should do this for you. In order to get the math to work I converted your 'ignored' data to impressions by summing up your columns.
> power.prop.test(p1=8/200, p2=60/2000, power=0.8, sig.level=0.05)
Two-sample comparison of proportions power calculation
n = 5300.739
p1 = 0.04
p2 = 0.03
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
That gives 5301, which is for each group, so your sample size needs to be 10600. Subtracting out the 2200 that have already run, you have 8400 "tests" to go.
In this case:
sig.level is the same as your p-value.
power is the likelihood of finding significant results that exist within your sample. This is somewhat arbitrary, 80% is a common choice. Note that choosing 80% means that 20% of the time you won't find significance when you should. Increasing the power means you'll need a larger sample size to reach your desired significance level.
If you wanted to decide how much longer it will take to reach significance, divide 8400 by the number of impressions per day. That can help determine if its worth while to continue the test.
You can also use this function to determine required sample size before testing begins. There's a nice write-up describing this on the 37 Signals blog.
This is a native R function, so you won't need to add or load any packages. Other than that I can't say how similar this is to pwr.p2pn.test().
I have a stream of data that trends over time. How do I determine the rate of change using C#?
It's been a long time since calculus class, but now is the first time I actually need it (in 15 years). Now when I search for the term 'derivatives' I get financial stuff, and other math things I don't think I really need.
Mind pointing me in the right direction?
If you want something more sophisticated that smooths the data, you should look into a a digital filter algorithm. It's not hard to implement if you can cut through the engineering jargon. The classic method is Savitzky-Golay
If you have the last n samples stored in an array y and each sample is equally spaced in time, then you can calculate the derivative using something like this:
deriv = 0
coefficient = (1,-8,0,8,-1)
N = 5 # points
h = 1 # second
for i range(0,N):
deriv += y[i] * coefficient[i]
deriv /= (12 * h)
This example happens to be a N=5 filter of "3/4 (cubic/quartic)" filter. The bigger N, the more points it is averaging and the smoother it will be, but also the latency will be higher. You'll have to wait N/2 points to get the derivative at time "now".
For more coefficients, look here at the Appendix
https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter
You need both the data value V and the corresponding time T, at least for the latest data point and the one before that. The rate of change can then be approximated with Eulers backward formula, which translates into
dvdt = (V_now - V_a_moment_ago) / (T_now - T_a_moment_ago);
in C#.
Rate of change is calculated as follows
Calculate a delta such as "price minus - price 20 days ago"
Calculate rate of change such as "delta / price 99 days ago"
Total rate of change, i.e. (new_value - original_value)/time?
Under the user generated posts on my site, I have an Amazon-like rating system:
Was this review helpful to you: Yes | No
If there are votes, I display the results above that line like so:
5 of 8 people found this reply helpful.
I would like to sort the posts based upon these rankings. If you were ranking from most helpful to least helpful, how would you order the following posts?
a) 1/1 = 100% helpful
b) 2/2 = 100% helpful
c) 999/1000 = 99.9% helpful
b) 3/4 = 75% helpful
e) 299/400 = 74.8% helpful
Clearly, its not right to sort just on the percent helpful, somehow the total votes should be factored in. Is there a standard way of doing this?
UPDATE:
Using Charles' formulas to calculate the Agresti-Coull lower range and sorting on it, this is how the above examples would sort:
1) 999/1000 (99.9%) = 95% likely to fall in 'helpfulness' range of 99.2% to 100%
2) 299/400 (74.8%) = 95% likely to fall in 'helpfulness' range of 69.6% to 79.3%
3) 3/4 (75%) = 95% likely to fall in 'helpfulness' range of 24.7% to 97.5%
4) 2/2 (100%) = 95% likely to fall in 'helpfulness' range of 23.7% to 100%
5) 1/1 (100%) = 95% likely to fall in 'helpfulness' range of 13.3% to 100%
Intuitively, this feels right.
UPDATE 2:
From an application point of view, I don't want to be running these calculations every time I pull up a list of posts. I'm thinking I'll either update and store the Agresti-Coull lower bound either on a regular, cron-driven schedule (updating only those posts which have received a vote since the last run) or update it whenever a new vote is received.
For each post, generate bounds on how helpful you expect it to be. I prefer to use the Agresti-Coull interval. Pseudocode:
float AgrestiCoullLower(int n, int k) {
//float conf = 0.05; // 95% confidence interval
float kappa = 2.24140273; // In general, kappa = ierfc(conf/2)*sqrt(2)
float kest=k+kappa^2/2;
float nest=n+kappa^2;
float pest=kest/nest;
float radius=kappa*sqrt(pest*(1-pest)/nest);
return max(0,pest-radius); // Lower bound
// Upper bound is min(1,pest+radius)
}
Then take the lower end of the estimate and sort on this. So the 2/2 is (by Agresti-Coull) 95% likely to fall in the 'helpfulness' range 23.7% to 100%, so it sorts below the 999/1000 which has range 99.2% to 100% (since .237 < .992).
Edit: Since some people seem to have found this helpful (ha ha), let me note that the algorithm can be tweaked based on how confident/risk-averse you want to be. The less confidence you need, the more willing you will be to abandon the 'proven' (high-vote) reviews for the untested but high-scoring reviews. A 90% confidence interval gives kappa = 1.95996398, an 85% confidence interval gives 1.78046434, a 75% confidence interval gives 1.53412054, and the all-caution-to-the-wind 50% confidence interval gives 1.15034938.
The 50% confidence interval gives
1) 999/1000 (99.7%) = 50% likely to fall in 'helpfulness' range of 99.7% to 100%
2) 299/400 (72.2%) = 50% likely to fall in 'helpfulness' range of 72.2% to 77.2%
3) 2/2 (54.9%) = 50% likely to fall in 'helpfulness' range of 54.9% to 100%
4) 3/4 (45.7%) = 50% likely to fall in 'helpfulness' range of 45.7% to 91.9%
5) 1/1 (37.5%) = 50% likely to fall in 'helpfulness' range of 37.5% to 100%
which isn't that different overall, but it does prefer the 2/2 to the safety of the 3/4.
This question is probably better asked on http://stats.stackexchange.com .
I guess you still want to order by increasing of 'helpfulness'.
If you want to know how precise a given number is, the simplest way is to use the square root of the variance of the Binomial distribution with n equal to the total number of responses and p the fraction of responses which were 'helpful'.
A very simple solution would be to ignore everything with less than a cut-off amount of votes, and then sort by percentage.
For example (require at least five votes)
1. 99.9% (1000 votes)
2. 74.8% (400 votes)
3-5. waiting for five votes
It depends on the expected rate of positive feedback and the number of the people that vote on average.
If, like in the example you give, you are going to have sometimes 5 and 10 people voting and other times a 1000, then I would suggest the Wilson midpoint:
(x+z^2/2)/(n+z^2) The midpoint of the Adjusted Wald Interval / Wilson Score
where:
n = Sum(all_votes),
x = Sum(positive_votes) / n,
z = 1.96 (fixed value)