I have been struggling with this problem for quite a while and any help would be much appreciated.
I am trying to write a function to calculate a transition matrix from observed data for a markov model.
My initial data I am using to build the function look something like this;
Season Team State
1 1 Manchester United 1
2 1 Chelsea 1
3 1 Manchester City 1
.
.
.
99 5 Charlton Athletic 4
100 5 Watford 4
with 5 seasons and 4 states.
I know how I am going to calculate the transition matrix, but in order to do this I need to count the number of teams that move from state i to state j for each season.
I need code that will do something like this,
a<-function(x,i,j){
if("team x is in state i in season 1 and state j in season 2") 1 else 0
}
sum(a)
and then I could do this for each team and pair of states and repeat for all 5 seasons. However, I am having a hard time getting my head around how to tell R the thing in quotation marks. Sorry if there is a really obvious answer but I am a rubbish programmer.
Thanks so much for reading!
This function tells you if a team made the transition from state1 to state2 from season1 to season2
a <- function(team, state1, state2, data, season1, season2) {
team.rows = data[team == data["Team",],]
in.season1.in.state1 = ifelse(team.rows["Season",]==season1 && team.rows["State",state1],1,0)
in.season2.in.state2 = ifelse(team.rows["Season",]==season2 && team.rows["State",state2],1,0)
return(sum(in.season1.in.stat1) * sum(in.season2.in.state2))
}
In the first line I select all rows of a particular team.
The second line is determining for each entry if a team is ever in state1 in season1.
The third line is determining for each entry if a team is ever in state2 in season2,
and the return statement returns 0 if the team was never in the respective state in the respective season or 1 otherwise (only works if there are no duplicates, in that case it might return a value greater than 1)
Related
I'm trying to ease my life by writing a menu creator, which is supposed to permutate a weekly menu from a list of my favourite dishes, in order to get a little bit more variety in my life.
I gave every dish a value of how many days it approximately lasts and tried to arrange the dishes to end up with menus worth 7 days of food.
I've already tried solutions for knapsack functions from here, including dynamic programming, but I'm not experienced enough to get the hang of it. This is because all of these solutions are targeting only the most efficient option and not every combination, which fills the Knapsack.
library(adagio)
#create some data
dish <-c('Schnitzel','Burger','Steak','Salad','Falafel','Salmon','Mashed potatoes','MacnCheese','Hot Dogs')
days_the_food_lasts <- c(2,2,1,1,3,1,2,2,4)
price_of_the_food <- c(20,20,40,10,15,18,10,15,15)
data <- data.frame(dish,days_the_food_lasts,price_of_the_food)
#give each dish a distinct id
data$rownumber <- (1:nrow(data))
#set limit for how many days should be covered with the dishes
food_needed_for_days <- 7
#knapsack function of the adagio library as an example, but all other solutions I found to the knapsackproblem were the same
most_exspensive_food <- knapsack(days_the_food_lasts,price_of_the_food,food_needed_for_days)
data[data$rownumber %in% most_exspensive_food$indices, ]
#output
dish days_the_food_lasts price_of_the_food rownumber
1 Schnitzel 2 20 1
2 Burger 2 20 2
3 Steak 1 40 3
4 Salad 1 10 4
6 Salmon 1 18 6
Simplified:
I need a solution to a single objective single Knapsack problem, which returns all possible combinations of dishes which add up to 7 days of food.
Thank you very much in advance
This is my first question on stackoverflow, so feel free to criticize the question.
For every row in a data set, I would like to sum the rows that:
have identical 'team', 'season' and 'simulation_ID'.
have 'match_ID' smaller than (and not equal to) the current 'match_ID'.
such that I find the accumulated number of points up to that match, for that team, season and simulation_ID, i.e. cumsum(simulation$team_points).
I have issues to implement the second condition without using an extremely slow for-loop.
The data looks like this:
match_ID
season
simulation_ID
home_team
team
match_result
team_points
2084
2020-2021
1
TRUE
Liverpool
Away win
0
2084
2020-2021
2
TRUE
Liverpool
Draw
1
2084
2020-2021
3
TRUE
Liverpool
Away win
0
2084
2020-2021
4
TRUE
Liverpool
Away win
0
2084
2020-2021
5
TRUE
Liverpool
Home win
3
2084
2020-2021
1
FALSE
Burnley
Home win
0
2084
2020-2021
2
FALSE
Burnley
Draw
1
My current solution is:
simulation$accumulated_points <- 0
for (row in 1:nrow(simulation)) {
simulation$accumulated_points[row] <-
sum(simulation$team_points[simulation$season==simulation$season[row] &
simulation$match_ID<simulation$match_ID[row] &
simulation$simulation_ID==simulation$simulation_ID[row] &
simulation$team==simulation$team[row]], na.rm = TRUE)
}
This works, but it is obviously too slow to use on large data sets. I cannot figure out how to speed it up. What is a good solution here?
For loops are always slow in scripting languages like R and should best be avoided. This can be done using "vectorized operations", that apply a function to a vector rather than each element separately. Native functions in R or popular packages often rely on optimized C++ code and linear algebra libraries under the hood to do this, such that operations become much faster than a loop in R. For example, your CPU is usually able to process dozens of vector elements at the same time rather than going 1-by-1 as in a for loop. You can find more information about vectorization in this question.
In your specific example, you could for example use dplyr to transform your data:
library(dplyr)
df %>%
# you want to perform the same operation for each of the groups
group_by(team, season, simulationID) %>%
# within each group, order the data by match_ID (ascending)
arrange(match_ID) %>%
# take the vector team_points in each group then calculate its cumsum
# write that cumsum into a new column named "points"
mutate(points = cumsum(team_points))
The code above essentially decomposes the match_points column into one vector for each group that you care about, then applies a single, highly optimized operation to each of them.
Sorry I do not know how to properly title my question. It is easier to understand with an example.
Sample data
Consider the following example.
> l_ids=as.data.frame(cbind(a=c("strong","intense","intensity"),
id=c("1","2","3"),new_id=c("","1","2")),stringsAsFactors = FALSE)
a id new_id
1 strong 1
2 intense 2 1
3 intensity 3 2
I would like to update the id of each word in a with a new_id, if it applies. Consider this as a synonym dictionary. As I iterate over new_id;
> for (i in 1:nrow(l_ids)){
+ if (nchar(l_ids$new_id[i])>0){
+ l_ids$id[i]=l_ids$new_id[i]
+ }
+ }
> l_ids
a id new_id
1 strong 1
2 intense 1 1
3 intensity 2 2
The problem is that I would like for intensity to also be given a 1. Is there a way to do this without having to iterate multiple times?
Update on background
I have a document where I have a list of synonyms. These are synonyms only relevant to the field of application of the problem. Example:
> dictionary
good bad
1 strong intense
2 intense intensity
3 light soft
I am then given a list of words, each with a given id. My task is to check if any of those words is in the bad column of dictionary and, if so, update it with the id of the word to its left. As can be seen, intensity would need two steps to become strong (a good word in the dictionary). Is there a way to do so without having to do multiple iterations? (say, a for loop)
I am working on building a model that can predict NFL games, and am looking to run full season simulations and generate expected wins and losses for each team.
Part of the model is based on a rating that changes each week based on whether or not a team lost. For example, lets say the Bills and Ravens each started Sundays game with a rating of 100, after the Ravens win, their rating now increases to 120 and the Bills decrease to 80.
While running the simulation, I would like to update the teams rating throughout in order to get a more accurate representation of the number of ways a season could play out, but am not sure how to include something like this within the loop.
My loop for the 2017 season.
full.sim <- NULL
for(i in 1:10000){
nflpredictions$sim.homewin <- with(nflpredictions, rbinom(nrow(nflpredictions), 1, homewinpredict))
nflpredictions$winner <- with(nflpredictions, ifelse(sim.homewin, as.character(HomeTeam), as.character(AwayTeam)))
winningteams <- table(nflpredictions$winner)
projectedwins <- data.frame(Team=names(winningteams), Wins=as.numeric(winningteams))
full.sim <- rbind(full.sim, projectedwins)
}
full.sim <- aggregate(full.sim$Wins, by= list(full.sim$Team), FUN = sum)
full.sim$expectedwins <- full.sim$x / 10000
full.sim$expectedlosses <- 16 - full.sim$expectedwins
This works great when running the simulation for 2017 where I already have the full seasons worth of data, but I am having trouble adapting for a model to simulate 2018.
My first idea is to create another for loop within the loop that iterates through the rows and updates the ratings for each week, something along the lines of
full.sim <- NULL
for(i in 1:10000){
for(i in 1:nrow(nflpredictions)){
The idea being to update a teams rating, then generate the win probability for the week using the GLM I have built, simulate who wins, and then continue through the entire dataframe. The only thing really holding me back is not knowing how to add a value to a row based on a row that is not directly above. So what would be the easiest way to update the ratings each week based on the result of the last game that team played in?
The dataframe is built like this, but obviously on a larger scale:
nflpredictions
Week HomeTeam AwayTeam HomeRating AwayRating HomeProb AwayProb
1 BAL BUF 105 85 .60 .40
1 NE HOU 120 90 .65 .35
2 BUF LAC NA NA NA NA
2 JAX NE NA NA NA NA
I hope I explained this well enough... Any input is greatly appreciated, thanks!
My problem I have is that I need to calculate out how much a point is worth based on played games.
If a team plays a match it can get 3 points for a win, 1 point for a tie and 0 points for a loss.
And the problem here is following:
Team 1
Wins:8 Tie:2 Loss:3 Points:26 Played Games: 13
Team 2
Wins:8 Tie:3 Loss:4 Points:27 Played Games: 15
And here you can see that Team 2 has 1 more point than Team 1 has. But Team 2 has played 2 more matches and have a lesser win % then Team 1 has. But if you should list these two then Team 2 would get a higher "rating" then Team 1 has.
So how should the math look for this to make it fair? where Team 1 will have a better score here then Team 2 ?
Just divide by the number of games to get the average points per game played.
Team1: 2.0 ppg
Team2: 1.8 ppg
Okey first of all thanks for the help.
And the solution of this is the following:
p/pg * p = Real points
p = Sum(points),
pg = Played games
So for the example up top the real points will be:
Team 1: 52
Team 2: 48.6