Find a parameter value based on weight of a portfolio - r

I am trying to find the value of a parameter where all constituents of a portfolio with a parameter smaller than this value sum up to 25% of the entire portfolio.
Of course, using R.
For example, I have having a portfolio for a market index that's having 100 constituent equity securities. 25% of this portfolio means 25% of total market value. I want to find the value of a parameter, let's say the P/E ratio, that all securities with a P/E ratio smaller than such value contribute to 25% of this portfolio's market value.
i know that I can use 'quantile' function to find the 25% value for P/E ratio,
Bottom_Quartile <- data[(data$PE < quantile(data$PE,0.25),]
But that value not necessarily corresponding to 25% market value, and result generated above is not giving out a total market value = 25% of Original total market value.
I am wondering if there is a function that can help us find the value of a specific parameter (in this case P/E ratio) based on the given quantile of another parameter (in this case 25% Market Value).
Thanks in advance for your help!!!
All right, here is an update.
I have tried some method to do this the following one worked fine, however, it looks very inefficient.
#compute % market weight first
data$MarketValuePct <- data$MarketValue / sum(data$MarketValue)
#sort the result by descend order of P/E
sort_data <- data[order(data$PE),]
#calculate cumulative sum for percentage market value
sort_data$cumsum_MarketValuePct <- cumsum(sort_data$MarketValuePct)
#Locate the edge of bottom quartile
Bottom_quartile <- sort_data[sort_data$cumsum_MarketValuePct < 0.25, ]
#The bottom value of PE is what I was looking for
4th_Qartile_PE <- Bottom_quartile [nrow(Bottom_quartile),(col# of PE)]
I am still looking for a more elegant and more accurate way of doing this. Please kindly help if you have any idea.
Thank you!

You do not want a quantile of the market value, but 25% of the total market value.
You can use order to sort the stocks by increasing PE, and cumsum to compute
the total market value of stocks below a certain PE.
i <- order(PE)
tail( which( cumsum(MCap[i]) < .25 * sum(MCap) ), 1 )

Related

Time series: how to get a single number?

I need your help!
I have to replicate a table in which there are some variables ( gdp growth, labor share, etc)
I have a database that cointains for each country the times series ( 1960-2000) of the base variables.
How can I compute a unique value starting from a time series??
For example:
I need to compute the Capital Share ( K / Y ) of USA
In the database I have the time serie of K and the time serie of Y
If I make K / Y with R , I get a time series as a result, but I need instead a single number ( as in the table that I have to replicate )
Should I make an average of all the observations in K and an average of all the observations in Y an then make the ratio?
Or should I compute the difference from the first observation and the last one for both variables and then make the ratio?
thanks to anyone who will help me

Trying to coerce the data onto a Gaussian curve and the results are not as expected

this is not a question about curve fitting. Instead, what I have is a collection of 60 different sites, from which I can collect maximum, minimum and average temperatures. I need to be able to use this data to calculate the operating temperature of a photovoltaic cell; it doesn't make sense to do this, however, with the average temperatures because it includes values from after sunset. Instead, I first create a "fake" average temperature (this is our "fake average", totalityoftemperatures_fakemeans) which is the average value of the maximum and minimum temperatures. At that point, I calculate an adjusted minimum temperature by subtracting one standard deviation (assuming 6 * sd = max - min), and then finally calculate an "adjusted" mean temperature which is the average of the new minimum (fake mean - 1 * sd) and the pre-existing maximum temperature (so this is our "adjusted mean").
What really bothers me is that this re-calculated average ought to be higher than the "fake" mean; after all, it is an average value of the adjusted minimum together with the original maximum value. I might also cross-post this to the statistics stack exchange or something, but I'm pretty sure that this is a coding issue right now. Is there anyone out there who can look at the below code in R?
#The first data sets of maxima and minima are taken from empirical data
for(i in 1:nrow(totalityofsites))
{
for(j in 1:12)
{
totalityoftemperatures_fakemeans[i,j] = mean(totalityoftemperatures_maxima[i,j], totalityoftemperatures_minima[i,j])
}
}
totality_onesigmaDF = abs((1/6)*(totalityoftemperatures_maxima - totalityoftemperatures_minima))
totalityoftemperatures_adjustedminima = totalityoftemperatures_fakemeans - totality_onesigmaDF
for(i in 1:nrow(totalityofsites))
{
for(j in 1:12)
{
totalityoftemperatures_adjustedmeans[i,j] = mean(totalityoftemperatures_adjustedminima[i,j], totalityoftemperatures_maxima[i,j])
}
}
#The second calculation of the average should be higher than "fake" but that is not the case
I think your problem lies in your use of the mean function. When you do this:
mean(totalityoftemperatures_adjustedminima[i,j], totalityoftemperatures_maxima[i,j])
You are calling mean with two arguments. The function only takes one argument, a vector of numbers. If you supply it with two numbers it will ignore the second one. Look:
mean(2, 100)
#[1] 2
Whereas if you concatenate the values into a single vector, you get the right answer:
mean(c(2, 100))
#[1] 51
So you need to change
mean(totalityoftemperatures_maxima[i,j], totalityoftemperatures_minima[i,j])
to
mean(c(totalityoftemperatures_maxima[i,j], totalityoftemperatures_minima[i,j]))
and
mean(totalityoftemperatures_adjustedminima[i,j], totalityoftemperatures_maxima[i,j])
to
mean(c(totalityoftemperatures_adjustedminima[i,j], totalityoftemperatures_maxima[i,j]))

Determining percentile based on reference table

I have standardized normal values for heart rates and respiratory rates in children from a recent article. I copied them into a csv to use as a dataset in R. The data is simply different age ranges (ex 3 months to 6 months, or 1 year to 2 years) and then the heart rate for the 1st percentile, 10th, 25th, 50th, 75th, 90th, 99th percentile for that age range.
I want to compare a patients data with this reference table to tell me what percentile they are at. Since this is a perfectly normal distribution, I don't think its a very hard task to do, but its outside of my R scope and I can't seem to find any good information on how to accomplish this.
Based on what you explained i can suggest this simple function that will input the heart rate and the age range of your patient and return the %percentile based on a normal density of this specific range.
my.quantile = function(myrange, heart.rate){
table <- data.frame('range'= c("range1", "range2", "range3"),
'mean' = c(120, 90, 60),
'sd' = c(12, 15, 30))
res <- pnorm(q = heart.rate,
mean = subset(table, range==myrange)$mean,
sd = subset(table, range==myrange)$sd)
return(res*100)
}
### my.quantile("range1", 140)
### [1] 95.22096
From what you say if it is perfectly normal you just need the mean and variance of each range right? You can adapt it for the respiratory rate.
EDIT: in order to retrieve the normal distribution parameters from your quantile table, given the hypothesis that the quantiles you've got are rather precise:
i/ Your mean paramater is exactly the 50th percentile
ii/ You find the standard deviation by taking any other percentile, for instance let's assume your 90th percentile is 73 beats and the 50th is 61 beats:
(73-61)/qnorm(0.90)
### [1] 9.36365
9.36 is your standard deviation. From here it shouldn't be very hard to automate it.
Note: if your percentile data are not very precise, you may want to repeat the operation for each percentile value and take the average.

Compound interest but with a twist: "compound tax"

Let's say that I have a diminishing value that should be portrayed both on a monthly basis and on a weekly basis.
For example. I know that the value, say 100 000, diminishes by 30 %/year. Which when I calculate (by normal "periodic compound" formulas) is 2.21 %/month and 0.51 %/week.
However, looking at the results from these calculations (calculating for a entire year) I do not get the same end valued. Only if I calculate it as a "interest" (=the percentage is ADDED to the value, NOT taken away) do I get matching values on both the weekly and monthly calculations.
What is the correct formula for calculating this "compound taxation" problem?
I don't know if I fully understand your question.
You cannot calculate diminushing interest the way you do it.
If your value (100 000) diminishes by 30 %/ year this means that at the end of year 1 your value is 70 000.
The way you calculated you compound would work if diminishing by 30% meant 100000/1.3
Your mistake:
You made your calculation this way:
(1+x)^12 - 1 =30% then x=0.0221 the monthly interest is 2.21%
(1+x)^52 - 1 = 30% then x=0.0051 the weekly interest is 0.51%
But what you should have done is:
(1-x)^12=1-30% then x =0.0292 the monthly interest is 2.92%
(1-x)^52=1-30% then x=0.0068 the monthly interest is 0.68 %
You cannot calculate the compound interest as if it was increasing 30% when it's decreasing 30%.
It's easy to understand that the compound interest for an increasing will be smallest than the one for decreasing:
Exemple:
Let's say your investment makes 30% per year.
At the end of first month you will have more money, and therefore you're investing more so you need a smaller return to make as much money as in the first month.
Therefore for increasing interest the coumpond interest i=2.21 is smaller than 30/12 = 2.5
same reasonning for the decreasing i =2.92 > 30/12=2.5
note:
(1+x)^12 - 1 =30% is not equivalent to (1-x)^12=1-30%
negative interest cannot be treated as negative interest:
following what you did adding 10% to one then taking away 10% to the result would return one:
(1+10%)/(1+10%)=1
The way it's calculated won't give the same result : (1+10%)*(1-10%)=0.99
Hope I understood your question and it helps .
Engaging psychic debugging...
diminishes by 30 %/year. Which when I
calculate (by normal "periodic
compound" formulas) is 2.21 %/month
and 0.51 %/week.
You are doing an inappropriate calculation.
You are correct in saying that 30% annual growth is approx 2.21% monthly growth. The reason for this is because 30% annual growth is expressed as multiplication by 1.30 (since 100% + 30% = 130%, or 1.30), and making this monthly is:
1.30 ^ (1/12) = 1.0221 (approx)
However, it does not follow from this that 30% annual shrinkage is approx 2.21% monthly shrinkage. To work out the monthly shrinkage we must note that 30% shrinkage is multiplication by 0.70 (since 100% - 30% = 70%, or 0.70), and make this monthly in the same way:
0.70 ^ (1/12) = 0.9707 (approx)
Multiplication by 0.9707 is monthly shrinkage of 2.929% (approx).
Hopefully this will give you the tools you need to correct your calculations.

How should I order these "helpful" scores?

Under the user generated posts on my site, I have an Amazon-like rating system:
Was this review helpful to you: Yes | No
If there are votes, I display the results above that line like so:
5 of 8 people found this reply helpful.
I would like to sort the posts based upon these rankings. If you were ranking from most helpful to least helpful, how would you order the following posts?
a) 1/1 = 100% helpful
b) 2/2 = 100% helpful
c) 999/1000 = 99.9% helpful
b) 3/4 = 75% helpful
e) 299/400 = 74.8% helpful
Clearly, its not right to sort just on the percent helpful, somehow the total votes should be factored in. Is there a standard way of doing this?
UPDATE:
Using Charles' formulas to calculate the Agresti-Coull lower range and sorting on it, this is how the above examples would sort:
1) 999/1000 (99.9%) = 95% likely to fall in 'helpfulness' range of 99.2% to 100%
2) 299/400 (74.8%) = 95% likely to fall in 'helpfulness' range of 69.6% to 79.3%
3) 3/4 (75%) = 95% likely to fall in 'helpfulness' range of 24.7% to 97.5%
4) 2/2 (100%) = 95% likely to fall in 'helpfulness' range of 23.7% to 100%
5) 1/1 (100%) = 95% likely to fall in 'helpfulness' range of 13.3% to 100%
Intuitively, this feels right.
UPDATE 2:
From an application point of view, I don't want to be running these calculations every time I pull up a list of posts. I'm thinking I'll either update and store the Agresti-Coull lower bound either on a regular, cron-driven schedule (updating only those posts which have received a vote since the last run) or update it whenever a new vote is received.
For each post, generate bounds on how helpful you expect it to be. I prefer to use the Agresti-Coull interval. Pseudocode:
float AgrestiCoullLower(int n, int k) {
//float conf = 0.05; // 95% confidence interval
float kappa = 2.24140273; // In general, kappa = ierfc(conf/2)*sqrt(2)
float kest=k+kappa^2/2;
float nest=n+kappa^2;
float pest=kest/nest;
float radius=kappa*sqrt(pest*(1-pest)/nest);
return max(0,pest-radius); // Lower bound
// Upper bound is min(1,pest+radius)
}
Then take the lower end of the estimate and sort on this. So the 2/2 is (by Agresti-Coull) 95% likely to fall in the 'helpfulness' range 23.7% to 100%, so it sorts below the 999/1000 which has range 99.2% to 100% (since .237 < .992).
Edit: Since some people seem to have found this helpful (ha ha), let me note that the algorithm can be tweaked based on how confident/risk-averse you want to be. The less confidence you need, the more willing you will be to abandon the 'proven' (high-vote) reviews for the untested but high-scoring reviews. A 90% confidence interval gives kappa = 1.95996398, an 85% confidence interval gives 1.78046434, a 75% confidence interval gives 1.53412054, and the all-caution-to-the-wind 50% confidence interval gives 1.15034938.
The 50% confidence interval gives
1) 999/1000 (99.7%) = 50% likely to fall in 'helpfulness' range of 99.7% to 100%
2) 299/400 (72.2%) = 50% likely to fall in 'helpfulness' range of 72.2% to 77.2%
3) 2/2 (54.9%) = 50% likely to fall in 'helpfulness' range of 54.9% to 100%
4) 3/4 (45.7%) = 50% likely to fall in 'helpfulness' range of 45.7% to 91.9%
5) 1/1 (37.5%) = 50% likely to fall in 'helpfulness' range of 37.5% to 100%
which isn't that different overall, but it does prefer the 2/2 to the safety of the 3/4.
This question is probably better asked on http://stats.stackexchange.com .
I guess you still want to order by increasing of 'helpfulness'.
If you want to know how precise a given number is, the simplest way is to use the square root of the variance of the Binomial distribution with n equal to the total number of responses and p the fraction of responses which were 'helpful'.
A very simple solution would be to ignore everything with less than a cut-off amount of votes, and then sort by percentage.
For example (require at least five votes)
1. 99.9% (1000 votes)
2. 74.8% (400 votes)
3-5. waiting for five votes
It depends on the expected rate of positive feedback and the number of the people that vote on average.
If, like in the example you give, you are going to have sometimes 5 and 10 people voting and other times a 1000, then I would suggest the Wilson midpoint:
(x+z^2/2)/(n+z^2) The midpoint of the Adjusted Wald Interval / Wilson Score
where:
n = Sum(all_votes),
x = Sum(positive_votes) / n,
z = 1.96 (fixed value)

Resources