Conditional reordering of values within column (permutation) in R - r

I am working on a research paper on graph manipulation and I have the following data:
returns 1+returns cum_return price period_ret(step=25)
1 7.804919e-03 1.0078049 0.007804919 100.78355 NA
2 3.560800e-03 1.0035608 0.011393511 101.14306 NA
3 -1.490719e-03 0.9985093 0.009885807 100.99239 NA
. -2.943304e-03 0.9970567 0.006913406 100.69558 NA
. 1.153007e-03 1.0011530 0.008074385 100.81175 NA
. -2.823012e-03 0.9971770 0.005228578 100.52756 NA
25 -7.110762e-03 0.9928892 -0.001919363 99.81526 -0.02364
. -1.807268e-02 0.9819273 -0.019957356 98.02754 NA
. -3.300315e-03 0.9966997 -0.023191805 97.70455 NA
250 5.846750e-03 1.0058467 -0.017480652 98.27748 0.12125
These are 250 daily stock returns, the cummulative return, price and the 25-day period returns (returns between days 0-25; 25-50;...;200-250).
What I want to do is the following:
I want to rearrange the returns but the period returns should be identical although their order can change. So there are 10! possible combinations of the subsets.
What I did so far: I wrote a code using the sample, repeat and identical functions and here is a shortened version:
repeat{
temp <- tibble(
returns = sample(x$returns, 250, replace=TRUE) )
if(identical(sort(round(c(x$period_ret[(!is.na(x$period_ret))]),2)),sort(round(c(temp$period_ret[(!is.na(temp$period_ret))]),2)))) break
}
This took me quite some time and unfortunately it isn't of any real use. Only later I began thinking of the math and that there are 250! possible samples so I would spend days waiting for any result.
What do I need this for?
I would like to create graphs with different orders of the returns. Thus, all the graphs have the same summary statistics but look different. Its important that they have the same period_returns (no matter of their order) to fulfil a utility formula.

Related

Calculate similarity within a dataframe across specific rows (R)

I have a dataframe that looks something like this:
df <- data.frame("index" = 1:10, "title" = c("Sherlock","Peaky Blinders","Eastenders","BBC News", "Antiques Roadshow","Eastenders","BBC News","Casualty", "Dragons Den","Peaky Blinders"), "date" = c("01/01/20","01/01/20","01/01/20","01/01/20","01/01/20","02/01/20","02/01/20","02/01/20","02/01/20","02/01/20"))
The output looks like this:
Index Title Date
1 Sherlock 01/01/20
2 Peaky Blinders 01/01/20
3 Eastenders 01/01/20
4 BBC News 01/01/20
5 Antiques Roadshow 01/01/20
6 Eastenders 02/01/20
7 BBC News 02/01/20
8 Casualty 02/01/20
9 Dragons Den 02/01/20
10 Peaky Blinders 02/01/20
I want to be able to determine the number of times that a title appears on different dates. In the example above, "BBC News", "Peaky Blinders" and "Eastenders" all appear on 01/01/20 and 02/01/20. The similarity between the two dates is therefore 60% (3 out of 5 titles are identical across both dates).
It's probably also worth mentioning that the actual dataframe is much larger, and has 120 titles per day, and spans some 700 days. I need to compare the "titles" of each "date" with the previous "date" and then calculate their similarity. So to be clear, I need to determine the similarity of 01/01/20 with 02/01/20, 02/01/20 with 03/01/20, 03/01/20 with 04/01/20, and so on...
Does anyone have any idea how I might go about doing this? My eventual aim is to use Tableau to visualise similarity/difference over time, but I fear that such a calculation would be too complicated for that particular software and I'll have to somehow add it into the actual data itself.
Here is another possibility. You can create a simple function to calculate the similarity or other index between groups. Then, split your data frame by date into a list, and lapply the custom function to each in the list (final result will be a list).
calc_similar <- function(i) {
sum(s[[i]] %in% s[[i-1]])/length(s[[i-1]])
}
s <- split(df$title, df$date)
setNames(lapply(seq_along(s)[-1], calc_similar), names(s)[-1])
Output
$`2020-01-02`
[1] 0.6
I have come up with this solution. However, I'm unsure about how will it work when the number of records per day is different (i.e. you have 8 titles for day n and 15 titles for day n+1). I guess you would like to normalize with respect to the day with more records. Anyway, here it comes:
divide <- split.data.frame(df, as.factor(df$date))
similarity <- vector()
for(i in 1:(length(divide)-1)){
index <- sum((divide[[i]]$title) %in% divide[[i+1]]$title)/max(c(length(divide[[i]]$title), length((divide[[i+1]]$title))))
similarity <- c(similarity, index)
}
similarity

Updating Values within a Simulation in R

I am working on building a model that can predict NFL games, and am looking to run full season simulations and generate expected wins and losses for each team.
Part of the model is based on a rating that changes each week based on whether or not a team lost. For example, lets say the Bills and Ravens each started Sundays game with a rating of 100, after the Ravens win, their rating now increases to 120 and the Bills decrease to 80.
While running the simulation, I would like to update the teams rating throughout in order to get a more accurate representation of the number of ways a season could play out, but am not sure how to include something like this within the loop.
My loop for the 2017 season.
full.sim <- NULL
for(i in 1:10000){
nflpredictions$sim.homewin <- with(nflpredictions, rbinom(nrow(nflpredictions), 1, homewinpredict))
nflpredictions$winner <- with(nflpredictions, ifelse(sim.homewin, as.character(HomeTeam), as.character(AwayTeam)))
winningteams <- table(nflpredictions$winner)
projectedwins <- data.frame(Team=names(winningteams), Wins=as.numeric(winningteams))
full.sim <- rbind(full.sim, projectedwins)
}
full.sim <- aggregate(full.sim$Wins, by= list(full.sim$Team), FUN = sum)
full.sim$expectedwins <- full.sim$x / 10000
full.sim$expectedlosses <- 16 - full.sim$expectedwins
This works great when running the simulation for 2017 where I already have the full seasons worth of data, but I am having trouble adapting for a model to simulate 2018.
My first idea is to create another for loop within the loop that iterates through the rows and updates the ratings for each week, something along the lines of
full.sim <- NULL
for(i in 1:10000){
for(i in 1:nrow(nflpredictions)){
The idea being to update a teams rating, then generate the win probability for the week using the GLM I have built, simulate who wins, and then continue through the entire dataframe. The only thing really holding me back is not knowing how to add a value to a row based on a row that is not directly above. So what would be the easiest way to update the ratings each week based on the result of the last game that team played in?
The dataframe is built like this, but obviously on a larger scale:
nflpredictions
Week HomeTeam AwayTeam HomeRating AwayRating HomeProb AwayProb
1 BAL BUF 105 85 .60 .40
1 NE HOU 120 90 .65 .35
2 BUF LAC NA NA NA NA
2 JAX NE NA NA NA NA
I hope I explained this well enough... Any input is greatly appreciated, thanks!

Table of average score of peer per percentile

I'm quite a newbie in R so I was interested in the optimality of my solution. Even if it works it could be (a bit) long and I wanted your advice to see if the "way I solved it" is "the best" and it could help me to learn new techniques and functions in R.
I have a dataset on students identified by their id and I have the school where they are matched and the score they obtained at a specific test (so for short: 3 variables id,match and score).
I need to construct the following table: for students in between two percentiles of score, I need to calculate the average score (between students) of the average score of the students of the school they are matched to (so for each school I take the average score of the students matched to it and then I calculate the average of this average for percentile classes, yes average of a school could appear twice in this calculation). In English it allows me to answer: "A student belonging to the x-th percentile in terms of score will be in average matched to a school with this average quality".
Here is an example in the picture:
So in that case, if I take the median (15) for the split (rather than percentiles) I would like to obtain:
[0,15] : 9.5
(15,24] : 20.25
So for students having a score between 0 and 15 I take the average of the average score of the school they are matched to (note that b average will appears twice but that's ok).
Here how I did it:
match <- c(a,b,a,b,c)
score <- c(18,4,15,8,24)
scoreQuant <- cut(score,quantile(score,probs=seq(0,1,0.1),na.rm=TRUE))
AvgeSchScore <- tapply(score,match,mean,na.rm=TRUE)
AvgScore <- 0
for(i in 1:length(score)) {
AvgScore[i] <- AvgeSchScore[match[i]]
}
results <- tapply(AvgScore,scoreQuant,mean,na.rm = TRUE)
If you have a more direct way of doing it.. Or I think the bad point is 3) using a loop, maybe apply() is better ? But I'm not sure how to use it here (I tried to code my own function but it crashed so I "bruted force it").
Thanks :)
The main fix is to eliminate the for loop with:
AvgScore <- AvgeSchScore[match]
R allows you to subset in ways that you cannot in other languages. The tapply function outputs the names of the factor that you grouped by. We are using those names for match to subset AvgeScore.
data.table
If you would like to try data.table you may see speed improvements.
library(data.table)
match <- c("a","b","a","b","c")
score <- c(18,4,15,8,24)
dt <- data.table(id=1:5, match, score)
scoreQuant <- cut(dt$score,quantile(dt$score,probs=seq(0,1,0.1),na.rm=TRUE))
dt[, AvgeScore := mean(score), match][, mean(AvgeScore), scoreQuant]
# scoreQuant V1
#1: (17.4,19.2] 16.5
#2: NA 6.0
#3: (12.2,15] 16.5
#4: (7.2,9.4] 6.0
#5: (21.6,24] 24.0
It may be faster than base R. If the value in the NA row bothers you, you can delete it after.

How do I generate a dataframe displaying the number of unique pairs between two vectors, for each unique value in one of the vectors?

First of all, I apologize for the title. I really don't know how to succinctly explain this issue in one sentence.
I have a dataframe where each row represents some aspect of a hospital visit by a patient. A single patient might have thousands of rows for dozens of hospital visits, and each hospital visit could account for several rows.
One column is Medical.Record.Number, which corresponds to Patient IDs, and the other is Patient.ID.Visit, which corresponds to an ID for an individual hospital visit. I am trying to calculate the number of hospital visits each each patient has had.
For example:
Medical.Record.Number    Patient.ID.Visit
AAAXXX           1111
AAAXXX           1112
AAAXXX           1113
AAAZZZ           1114
AAAZZZ           1114
AAABBB           1115
AAABBB           1116
would produce the following:
Medical.Record.Number   Number.Of.Visits
AAAXXX          3
AAAZZZ          1
AAABBB          2
The solution I am currently using is the following, where "data" is my dataframe:
#this function returns the number of unique hospital visits associated with the
#supplied record number
countVisits <- function(record.number){
visits.by.number <- data$Patient.ID.Visit[which(data$Medical.Record.Number
== record.number)]
return(length(unique(visits.by.number)))
}
recordNumbers <- unique(data$Medical.Record.Number)
visits <- integer()
for (record in recordNumbers){
visits <- c(visits, countVisits(record))
}
visit.counts <- data.frame(recordNumbers, visits)
This works, but it is pretty slow. I am dealing with potentially millions of rows of data, so I'd like something efficient. From what little I know about R, I know there's usually a faster way to do things without using a for-loop.
This essentially looks like a table() operation after you take out duplicates. First, some sample data
#sample data
dd<-read.table(text="Medical.Record.Number Patient.ID.Visit
AAAXXX 1111
AAAXXX 1112
AAAXXX 1113
AAAZZZ 1114
AAAZZZ 1114
AAABBB 1115
AAABBB 1116", header=T)
then you could do
tt <- table(Medical.Record.Number=unique(dd)$Medical.Record.Number)
as.data.frame(tt, responseName="Number.Of.Visits") #to get a data.frame rather than named vector (table)
# Medical.Record.Number Number.Of.Visits
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
Or you could also think of this as an aggregation problem
aggregate(Patient.ID.Visit~Medical.Record.Number, dd, function(x) length(unique(x)))
# Medical.Record.Number Patient.ID.Visit
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
There are many ways to do this, #MrFlick provided handful of perfectly valid approaches. Personally I'm fond of the data.table package. Its faster on large data frames and I find the logic to be more intuitive than the base functions. I'd check it out if you are having problems with execution time.
library(data.table)
med.dt <- data.table(med_tbl)
num.visits.dt <- med.dt[ , num_visits = length(unique(Patient.ID.Visit)),
by = Medical.Record.Number]
data.Table should be much faster than data.frame on a large tables.

HoltWinter Initial values not matching with Rob Hyndman theory

I am following this tutorial by Rob Hyndman for initialization (additive).
Steps to calculate initial values are specified as:
I am running above steps manually (with pen/paper) on data set provided in Rob Hydman free online text book. Values I got after first two steps are:
I used same data set on "R", but seasonal output values in R are drastically different (screenshot below)
Not sure what I am doing wrong. Any help would be appreciated.
Another interesting thing I have observed just now is, initial level (l(t)) in text book is 33.8, but in R output it is : 48.24, which proves that I am missing something while calculating manually.
EDIT:
Here is how I am calculating Moving Averages Smooth (Based on formula used in Section 2 of this link. )
After calculating I have de-trended, means original value - smoothed value.
Then seasonal values: Which is
S1 =Average of Q1
S2 = Average of Q2
...
The first two values of your moving average are incorrect. You have assumed that the values prior to the first observation are zero. They are not zero, they are missing, which is quite different. It is impossible to compute the moving average for the first two observations for this reason.
The third and subsequent values of your moving average are only approximately correct because you have rounded the data to the first decimal point instead of using the data as provided in the fpp package in R.
The values obtained following this procedure are used as initial values in the optimization within ets(). So the output from ets() will not contain the initial values but the optimized values. The table in the book gives the optimized values. You will not be able to reproduce them using a simple procedure.
However, you can reproduce what is provided by HoltWinters because it does not do any optimization of initial values. Using HoltWinters, the initial seasonal values are given as:
> HoltWinters(y)$fitted[1:4,]
xhat level trend season
[1,] 43.73934 33.21330 1.207739 9.318302
[2,] 28.25863 35.65614 1.376490 -8.774002
[3,] 36.86581 37.57569 1.450688 -2.160566
[4,] 41.87604 38.83521 1.424568 1.616267
(The output in coefficients gives the final states not the initial states.)
The seasonal indices in the last column can be computed as follows:
y MAsmooth detrend detrend.adj
41.72746 NA NA NA
24.04185 NA NA NA
32.32810 34.41724 -2.089139 -2.160566
37.32871 35.64101 1.687695 1.616267
46.21315 36.82342 9.389730 9.318302
29.34633 38.04890 -8.702575 -8.774002
36.48291 NA NA NA
42.97772 NA NA NA
The last column is the adjusted detrended data (so they add to zero).

Resources