R-programming. Percentiles

R-programming. Percentiles - r

So I have estimated 1000 cash flows 10 years ahead. These cash flows are contained in a 1000*10 matrix.
For analytical purposes I want to examine the different percentiles/quantiles in terms of cash flows.
Part of my R-code
plot(timeline, TOTKS[1,], ylim=range(-15000000,100000000), type="l", ylab="Cash Flow", xlab="year")
for (i in 1:total.simulations){
lines(timeline, TOTKS[i,])}
Example:
I want to color the 10 % lowest cash flows in the plot above.
I want to examine those 10 % in a new matrix.
Any Suggestion?
Regards

By "lowest cashflows" do you mean "lowest total cash flow over all the years" or "lowest single point of cash flow over the years" or ... ?
I'm guessing the former, though what you mean might be something I didn't guess at all. :)
Also, I don't have any data from you, so what's below is guesswork based on your description. In future, you'll get better answers faster if you can put some sample or fake data (in the correct structure) into the question using dput(). But I hope this gets you started.
Edited - changed 10 lowest to lowest 10%
Due to inattention and insufficient coffee, I answered the wrong question. Fixed, I think.
First, find the ten lowest total cash flows. Add up the columns:
total.cashflow <- drop(colSums(TOTKS, na.rm=TRUE))
Then find which are the 10% lowest:
rank.cashflow.le10 <- which(
total.cashflow <= qualtile(total.cashflow, 0.1)
)
Add them to the plot:
matlines(timeline, TOTKS[,rank.cashflow.le10], col="red")
To examine those as a separate matrix (but you might not need to, with a vector of indicies)
low10TOTKS <- TOTKS[,rank.cashflow.le10]
Hope that helps.

Related

Is there a way to generate data in R where the sum of the observations add up to a specific value?

I'm looking for a way to generate different data frames where a variable is distributed randomly among a set number of observations, but where the sum of those values adds up to a predetermined total. More specifically I'm looking for a way to distribute 20.000.000 votes among 15 political parties randomly. I've looked around the forums a bit but can't seem to find an answer, and while trying to generate the data on my own I've gotten nowhere; I don't even know where to begin. The distribution itself does not matter, though I'd love to be able to influence the way it distributes the votes.
Thank you :)

You could make a vector of 20,000,000 samples of the numbers 1 through 15 then make a table from them, but this seems rather computationally expensive, and will result in an unrealistically even split of votes. Instead, you could normalise the cumulative sum of 15 numbers drawn from a uniform distribution and multiply by 20 million. This will give a more realistic spread of votes, with some parties having significantly more votes than others.
my_sample <- cumsum(runif(15))
my_sample <- c(0, my_sample/max(my_sample))
votes <- round(diff(my_sample) * 20000000)
votes
#> [1] 725623 2052337 1753844 61946 1173750 1984897
#> [7] 554969 1280220 1381259 1311762 766969 2055094
#> [13] 1779572 2293662 824096
These will add up to 20,000,000:
sum(votes)
#> [1] 2e+07
And we can see quite a "natural looking" spread of votes.
barplot(setNames(votes, letters[1:15]), xlab = "party")
I'm guessing if you substitute rexp for runif in the above solution this would more closely match actual voting numbers in real life, with a small number of high-vote parties and a large number of low-vote parties.

How to run a function to EACH of my observations in R?

My problem is as follows:
I have a dataset of 6000 observation containing information from costumers (each observation is one client's information).
I'm optimizing a given function (in my case is a profit function) in order to find an optimal for my variable of interest. Particularly I'm looking for the optimal interest rate I should offer in order to maximize my expected profits.
I don't have any doubt about my function. The problem is that I don't know how should I proceed in order to apply this function to EACH OBSERVATION in order to obtain an OPTIMAL INTEREST RATE for EACH OF MY 6000 CLIENTS (or observations, as you prefer).
Until now, it has been easy to find the UNIQUE optimal (same for all clients) for this variable that would maximize my profits (This is, the global maximum I guess). But what I need to know is how I should proceed in order to apply my optimization problem to EACH of my 6000 observations, INDIVIDUALLY, in order to have the optimal interest rate to offer to each costumer (this is, 6000 optimal interest rates, one for each of them).
I guess I should do something similar to a for loop, but my experience in this area is limited, and I'm quite frustrated already. What's more, I've tried to use mapply(myfunction, mydata) as usual, but I only get error messages.
This is how my (really) simple code now looks like:
profits<- function(Rate)
sum((Amount*(Rate-1.2)/100)*
(1/(1+exp(0.600002438-0.140799335888812*
((Previous.Rate - Rate)+(Competition.Rate - Rate))))))
And results for ONE optimal for the entire sample:
> optimise(profits, lower = 0, upper = 100, maximum = TRUE)
$maximum
[1] 6.644821
$objective
[1] 1347291
So the thing is, how do I rewrite my code in order to maximize this and obtain the optimal of my variable of interest for EACH of my rows?
Hope I've been clear! Thank you all in advance!

It appears each of your customers are independent. So you just put lapply() around the optimize() call:
lapply(customer_list, function(one_customer){
optimise(profits, lower = 0, upper = 100, maximum = TRUE)
})
This will return a very big list, where each list element has a $maximum and a $objective. You can then run lapply to total the $maximums, to find just how rich you have become!

Summarized huge data, How to handle it with R?

I am working on EBS, Forex market Limit Order Book(LOB): here is an example of LOB in a 100 millisecond time slice:
datetime|side(0=Bid,1=Ask)| distance(1:best price, 2: 2nd best, etc.)| price
2008/01/28,09:11:28.000,0,1,1.6066
2008/01/28,09:11:28.000,0,2,1.6065
2008/01/28,09:11:28.000,0,3,1.6064
2008/01/28,09:11:28.000,0,4,1.6063
2008/01/28,09:11:28.000,0,5,1.6062
2008/01/28,09:11:28.000,1,1,1.6067
2008/01/28,09:11:28.000,1,2,1.6068
2008/01/28,09:11:28.000,1,3,1.6069
2008/01/28,09:11:28.000,1,4,1.6070
2008/01/28,09:11:28.000,1,5,1.6071
2008/01/28,09:11:28.500,0,1,1.6065 (I skip the rest)
To summarize the data, They have two rules(I have changed it a bit for simplicity):
If there is no change in LOB in Bid or Ask side, they will not record that side. Look at the last line of the data, millisecond was 000 and now is 500 which means there was no change at LOB in either side for 100, 200, 300 and 400 milliseconds(but those information are important for any calculation).
The last price (only the last) is removed from a given side of the order book. In this case, a single record with nothing in the price field. Again there will be no record for whole LOB at that time.
Example:2008/01/28,09:11:28.800,0,1,
I want to calculate minAsk-maxBid(1.6067-1.6066) or weighted average price (using sizes of all distances as weights, there is size column in my real data). I want to do for my whole data. But as you see the data has been summarized and this is not routine. I have written a code to produce the whole data (not just summary). This is fine for small data set but for a large one I am creating a huge file. I was wondering if you have any tips how to handle the data? How to fill the gaps while it is efficient.

You did not give a great reproducible example so this will be pseudo/untested code. Read the docs carefully and make adjustments as needed.
I'd suggest you first filter and split your data into two data.frames:
best.bid <- subset(data, side == 0 & distance == 1)
best.ask <- subset(data, side == 1 & distance == 1)
Then, for each of these two data.frames, use findInterval to compute the corresponding best ask or best bid:
best.bid$ask <- best.ask$price[findInterval(best.bid$time, best.ask$time)]
best.ask$bid <- best.bid$price[findInterval(best.ask$time, best.bid$time)]
(for this to work you might have to transform date/time into a linear measure, e.g. time in seconds since market opening.)
Then it should be easy:
min.spread <- min(c(best.bid$ask - best.bid$price,
best.ask$bid - best.ask$price))
I'm not sure I understand the end of day particularity but I bet you could just compute the spread at market close and add it to the final min call.
For the weighted average prices, use the same idea but instead of the two best.bid and best.ask data.frames, you should start with two weighted.avg.bid and weighted.avg.ask data.frames.

Calculate percentage over time on very large data frames

I'm new to R and my problem is I know what I need to do, just not how to do it in R. I have an very large data frame from a web services load test, ~20M observations. I has the following variables:
epochtime, uri, cache (hit or miss)
I'm thinking I need to do a coule of things. I need to subset my data frame for the top 50 distinct URIs then for each observation in each subset calculate the % cache hit at that point in time. The end goal is a plot of cache hit/miss % over time by URI
I have read, and am still reading various posts here on this topic but R is pretty new and I have a deadline. I'd appreciate any help I can get
EDIT:
I can't provide exact data but it looks like this, its at least 20M observations I'm retrieving from a Mongo database. Time is epoch and we're recording many thousands per second so time has a lot of dupes, thats expected. There could be more than 50 uri, I only care about the top 50. The end result would be a line plot over time of % TCP_HIT to the total occurrences by URI. Hope thats clearer
time uri action
1355683900 /some/uri TCP_HIT
1355683900 /some/other/uri TCP_HIT
1355683905 /some/other/uri TCP_MISS
1355683906 /some/uri TCP_MISS

You are looking for the aggregate function.
Call your data frame u:
> u
time uri action
1 1355683900 /some/uri TCP_HIT
2 1355683900 /some/other/uri TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906 /some/uri TCP_MISS
Here is the ratio of hits for a subset (using the order of factor levels, TCP_HIT=1, TCP_MISS=2 as alphabetical order is used by default), with ten-second intervals:
ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
FUN=function(x) sum((2-as.numeric(x))/length(x)))
Now use lapply to get the final result:
lapply(seq_along(levels(u$uri)),
function(l) list(uri=levels(u$uri)[l],
hits=ratio(u[as.numeric(u$uri) == l,])))
[[1]]
[[1]]$uri
[1] "/some/other/uri"
[[1]]$hits
u$time%/%10 u$action
1 135568390 0.5
[[2]]
[[2]]$uri
[1] "/some/uri"
[[2]]$hits
u$time%/%10 u$action
1 135568390 0.5
Or otherwise filter the data frame by URI before computing the ratio.

#MatthewLundberg's code is the right idea. Specifically, you want something that utilizes the split-apply-combine strategy.
Given the size of your data, though, I'd take a look at the data.table package.
You can see why visually here--data.table is just faster.

Thought it would be useful to share my solution to the plotting part of them problem.
My R "noobness" my shine here but this is what I came up with. It makes a basic line plot. Its plotting the actual value, I haven't done any conversions.
for ( i in 1:length(h)) {
name <- unlist(h[[i]][1])
dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
names(dftemp) <- c("time", "cache")
plot(dftemp$time,dftemp$cache, type="o")
title(main=name)
}

Ideas for optimization algorithm for Fantasy Football

So, this is a bit different than standard fantasy football. What I have is a list of players, their average "points per game" (PPG) and their salary. I want to maximize points per game under the constraint that my team does not exceed a salary cap. A team consists of 1 QB, 1 TE, 3 WRs, and 2 RBs. So, if we have 15 of each position we have 15X15 X(15 c 3)X(15 c 2) = 10749375 possible teams.
Pretty computationally complex. I can use a bit of branch and bound i.e. once a team has surpassed the salary cap I can trim the tree, but even with that the algorithm is still pretty slow. I tried another option where I used a "genetic algorithm" i.e. made 10 random teams, picked the best one and "mutated" it (randomly changing some of the players) into another 10 teams and then picked of those and then looped through a bunch of times until the points per game of the "best team" stopped getting better.
There must be a better way to do this. I'm not a computer scientist and I've only taken an intro course in algorithmics. Programmers - what are your thoughts? I have a feeling that some sort of application of dynamic programming could help.
Thanks

I think a genetic algorithm, intelligently implemented, will yield an acceptable result for you. You might want to use a metric like points per salary dollar rather than straight PPG to decide the best team. This way you are inherently measuring value added. Also, you should consider running the full algorithm/mutation to satisfactory completion numerous times so that you can identity what players consistently show up in the final outcomes. These players then should be valued above others.
Of course the problem with the genetc approach Is that you need a good mutation algorithm and that is highly personal for how you want to implement it.

Take i as the current number of players out of n players and j to be the current remaining salary that is left. Take m[i, j] to be the dynamic set of solutions.
Then m[i, 0] = 0, m[0, j] = 0
and
m[i, j] = m[i - 1, j] if salary for player i is greater than j
else
m[i, j] = max ( m[i - 1, j], m[i - 1, j - salary of player i] + PPG of player i)
Sorry that I don't know R but I'm good with algorithms so I hope this helps.
A further optimization you can make is that you really only need 2 rows of m[i, j] because the DP solution only uses the current row and the last row (you can save memory this way)

First of all, the variation you have provided should not be right. Best way to build team is limit positions by limited plus there is absolutely no sense of moving 3 similar positions players between themselves.
Christian Ronaldo, Suarez and Messi will give you the equal sum of fantasy points in any line-up, like:
Christian Ronaldo, Suarez and Messi
or
Suarez, Christian Ronaldo and Messi
or
Messi, Suarez, Ronaldo
First step - simplify the variation possibility.
Next step - calculate the average price, and build the team one by one by adding player with lower salary but higher price. When reach salary limit, remove expensive one and add cheaper but with same fantasy points - and so on. Don't build the variation, value the weight of each player by combination of salary and fantasy points.

Does this help? It sets up the constraints and maximises points.
You could adapt to get data out of excel
http://pena.lt/y/2014/07/24/mathematically-optimising-fantasy-football-teams
14/07/24/mathematically-optimising-fantasy-football-teams

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex