Related
I made reproducible minimal example, but my real data is really huge
ac_1 <-c(0.1, 0.3, 0.03, 0.03)
ac_2 <-c(0.2, 0.4, 0.1, 0.008)
ac_3 <-c(0.8, 0.043, 0.7, 0.01)
ac_4 <-c(0.2, 0.73, 0.1, 0.1)
c_2<-c(1,2,5,23)
check_1<-c(0.01, 0.902,0.02,0.07)
check_2<-c(0.03, 0.042,0.002,0.00001)
check_3<-c(0.01, 0.02,0.5,0.001)
check_4<-c(0.001, 0.042,0.02,0.2)
id<-1:4
df<-data.frame(id,ac_1, ac_2,ac_3,ac_4,c_2,check_1,check_2,check_3,check_4)
so, the dataframe is like this:
> df
id ac_1 ac_2 ac_3 ac_4 c_2 check_1 check_2 check_3 check_4
1 1 0.10 0.200 0.800 0.20 1 0.010 0.03000 0.010 0.001
2 2 0.30 0.400 0.043 0.73 2 0.902 0.04200 0.020 0.042
3 3 0.03 0.100 0.700 0.10 5 0.020 0.00200 0.500 0.020
4 4 0.03 0.008 0.010 0.10 23 0.070 0.00001 0.001 0.200
and what I want to do is,
if check_1 is 0.02, I will make the corresponding ac_1 to be missing data.
if check_2 is 0.02, I will make the corresponding ac_2 to be missing data.
I will keep doing this every "check" and "ac"columns
For example, in the check_1 column, the 3th id person have 0.02.
so, this person's ac_1 score should be missing data-- 0.03 should be missing data (NA)
In the check_3 column, the 2nd id person have 0.02.
so, this person's ac_3 score should be missing data.
In the check_4 column, the 3th id person have 0.02
so, this person's ac_4 score should be missing data.
so. what i did is as follows:
for(i in 1:4){
if(paste0("df$check_",i)==0.02){
paste0("df$ac_",i)==NA
}
}
But, it did not work...
You're really close, but you're off on a few fundamentals.
You can't (easily) use strings to refer to objects, so "df$check_1" won't work. You can use strings to refer to column names, but not with $, you need to use [ or [[, so df[["check_1"]] will work.
if isn't vectorized, so it won't work on each value in a column. Use ifelse instead, or even better in this case we can skip the if entirely.
Using == on non-integer numbers is risky due to precision issues. We'll use a tolerance instead.
Minor issue, paste0("df$ac_",i)==NA isn't good, == is for checking equality. You need = or <- for assignment on that line.
Addressing all of these issues:
for(i in 1:4){
df[
## rows to replace
abs(df[[paste0("check_", i)]] - 0.02) < 1e-10,
## column to replace
paste0("ac_", i)
] <- NA
}
df
# id ac_1 ac_2 ac_3 ac_4 c_2 check_1 check_2 check_3 check_4
# 1 1 0.10 0.200 0.80 0.20 1 0.010 0.03000 0.010 0.001
# 2 2 0.30 0.400 NA 0.73 2 0.902 0.04200 0.020 0.042
# 3 3 NA 0.100 0.70 NA 5 0.020 0.00200 0.500 0.020
# 4 4 0.03 0.008 0.01 0.10 23 0.070 0.00001 0.001 0.200
Its often better to work with long format data, even if just temporarily. Here is an example of doing so, using dplyr and tidyr:
pivot_longer(df, -c(id,c_2)) %>%
separate(name,into=c("type", "pos")) %>%
pivot_wider(names_from=type, values_from = value) %>%
mutate(ac=if_else(near(check,0.02), as.double(NA), ac)) %>%
pivot_wider(names_from = pos, values_from = ac:check)
(Updated with near() thanks to Gregor)
Output:
id c_2 ac_1 ac_2 ac_3 ac_4 check_1 check_2 check_3 check_4
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.1 0.2 0.8 0.2 0.01 0.03 0.01 0.001
2 2 2 0.3 0.4 NA 0.73 0.902 0.042 0.02 0.042
3 3 5 NA 0.1 0.7 NA 0.02 0.002 0.5 0.02
4 4 23 0.03 0.008 0.01 0.1 0.07 0.00001 0.001 0.2
I have got this coefficient table under the form of a dataframe:
coefficient_table <- data.frame("less_than_1" = c( 1, 0.5, 0.1, 0.025, 0.010, 0.005, 0.001),
"1-5" = c(0.500, 1.000, 0.200, 0.050, 0.020, 0.010, 0.002),
"5-20" = c(0.10, 0.20, 1.00, 0.25, 0.10, 0.05, 0.01),
"20-50" = c(0.025, 0.050, 0.250, 1.000, 0.400, 0.200, 0.040),
"50-100" = c(0.010, 0.020, 0.10, 0.400, 1.00, 0.500, 0.100),
"100-500" = c(0.005, 0.010, 0.050, 0.200, 0.500, 1.000, 0.200),
"more_than_500" = c(0.001, 0.002, 0.010, 0.040, 0.100, 0.200, 1.000))
I would like to apply it to a matrix that has 7 dimensions that has the same variables as my coefficient dataframe. For now this is how it looks:
A <- data.frame("less_than_1" = c(0,0,1,0), "1-5" = c(1,0,0,0), "5-20" = c(0,0,0,0),
"20-50" = c(0,1,0,1), "50-100" = c(0,0,0,0), "100-500" = c(0,0,0,0),
"more_than_500" = c(0,0,0,0))
A <- as.matrix(A)
less_than_1 1-5 5-20 20-50 50-100 100-500 more_than_500
1 0 1 0 0 0 0 0
2 0 0 0 1 0 0 0
3 1 0 0 0 0 0 0
4 0 0 0 1 0 0 0
I would like however to use my coefficient matrix to weight the elements of the matrix based on the formula I've used to create the coefficients, namely: min(BudgetRange1,BudgetRange2) / max(BudgetRange1,BudgetRange2) .
The first row has for example a budget of "1-5", the respective column should therefore take value 1. The other columns should take their respective value based on the same column "1-5" of the coefficient matrix (A).
table_z
less_than_1 1-5 5-20 20-50 50-100 100-500 more_than_500
1 0.5 1 0.2 0.05 0.02 0.01 0.002
2 0.025 0.05 0.25 1 0.4 0.2 0.04
3 1 0.5 0.1 0.025 0.01 0.005 0.001
4 0.025 0.05 0.25 1 0.4 0.2 0.04
Anyone knows how? Thanks for reading so far
Assuming the columns (i.e. 'less_than_1', '1-5', '5-20' ...) are exaclty in the same order for both coefficient_table and A , you can use matrix multiplication :
Z <- as.matrix(A)%*%as.matrix(coefficient_table)
> Z
less_than_1 1-5 5-20 20-50 50-100 100-500 more_than_500
[1,] 0.500 1.00 0.20 0.050 0.02 0.010 0.002
[2,] 0.025 0.05 0.25 1.000 0.40 0.200 0.040
[3,] 1.000 0.50 0.10 0.025 0.01 0.005 0.001
[4,] 0.025 0.05 0.25 1.000 0.40 0.200 0.040
# where Z is a matrix, you can convert to data.frame if you need it :
table_z <- as.data.frame(Z)
I have some asset data in the middle of a dplyr pipeline similar to this:
fcast <- data.frame(group = rep(c('a','b'),each=12),
yr = rep(2018:2019,each=6,times=2),
mo = rep(c(7:12,1:6),times=2),
book_value = c(10000,rep(0,times=11),15000,rep(0,times=11)),
accum_depr = c(200,rep(0,times=11),700,rep(0,times=11)),
depr_rate = .02,
depr_expense = c(10,rep(0,times=11),15,rep(0,times=11)),
book_addn = c(0,0,0,0,80,0,0,40,0,0,0,0,0,0,100,70,0,0,0,0,0,0,0,0),
book_growth = 1.01
)
I need to apply some (ideally, tidy) rolling function to each group like the one below, which does not work at the moment.
roll_depr <- function(.data) {
r_d <- .data$depr_rate[1]
r_g <- .data$book_growth[1]
for(i in 2:length(.data$depreciation_rate)) {
.data$book_value[i] <- .data$book_value[i-1]*r_g + .data$book_addn[i]
.data$depr_expense[i] <- (.data$book_value[i] - .data$accum_depr[i-1])*r_d
.data$accum_depr[i] <- .data$accum_depr[i-1]+.data$depr_expense[i]
}
return(.data)
}
To further complicate things, this calculation will be performed in a shiny dashboard repeatedly as users input new values for book_addn. The actual dataset is much larger, and for loops don't cut it.
I know a better solution must exist with data.table or apply, but I haven't been able to figure it out. Bonus points if this can be accomplished from within the pipeline!
EDIT: I'm expecting the code to output the following table. Basically, the book_value grows at 1% of the previous value, plus any additions in the period. The depr_expense takes the book_value net of the previous accum_depr, and multiplies by the depr_rate. Finally, accum_depr updates to account for the newly-calculated depr_expense.
group yr mo book_value accum_depr depr_rate depr_expense book_addn book_growth
a 2018 7 10000.00 200.00 0.02 10.00 0 1.01
a 2018 8 10100.00 398.00 0.02 198.00 0 1.01
a 2018 9 10201.00 594.06 0.02 196.06 0 1.01
a 2018 10 10303.01 788.24 0.02 194.18 0 1.01
a 2018 11 10486.04 982.20 0.02 193.96 80 1.01
a 2018 12 10590.90 1174.37 0.02 192.17 0 1.01
a 2019 1 10696.81 1364.82 0.02 190.45 0 1.01
a 2019 2 10843.78 1554.40 0.02 189.58 40 1.01
a 2019 3 10952.22 1742.35 0.02 187.96 0 1.01
a 2019 4 11061.74 1928.74 0.02 186.39 0 1.01
a 2019 5 11172.35 2113.61 0.02 184.87 0 1.01
a 2019 6 11284.08 2297.02 0.02 183.41 0 1.01
b 2018 7 15000.00 700.00 0.02 15.00 0 1.01
b 2018 8 15150.00 989.00 0.02 289.00 0 1.01
b 2018 9 15401.50 1277.25 0.02 288.25 100 1.01
b 2018 10 15625.52 1564.22 0.02 286.97 70 1.01
b 2018 11 15781.77 1848.57 0.02 284.35 0 1.01
b 2018 12 15939.59 2130.39 0.02 281.82 0 1.01
b 2019 1 16098.98 2409.76 0.02 279.37 0 1.01
b 2019 2 16259.97 2686.76 0.02 277.00 0 1.01
b 2019 3 16422.57 2961.48 0.02 274.72 0 1.01
b 2019 4 16586.80 3233.99 0.02 272.51 0 1.01
b 2019 5 16752.67 3504.36 0.02 270.37 0 1.01
b 2019 6 16920.19 3772.68 0.02 268.32 0 1.01
This can actually be done at decent speed with two simple functions that implement for loops, and using them within mutate.
The key is to recognize that book_value can be calculated independently in its own loop. Once that has been done, accum_depr[i] is only a function of accum_depr[i-1] and book_value[i]. The depr_expense can be extracted as the difference between accum_depr and its lag, but I don't need it for my purposes.
expn[i] = (book[i] - accum_depr[i-1])*depr_rate
accum_depr[i] = accum_depr[i-1] + expn[i]
Which implies
accum_depr[i] = accum_depr[i-1]*(1-depr_rate) + book_value[i]*depr_rate
The code:
roll_book <- function(book_val,addn,g_rate) {
z <- rep(0,length(book_val))
z[1] <- book_val[1]
for(i in 2:length(book_val)) {
z[i] <- z[i-1]*g_rate[1] + addn[i]
}
return(z)
}
roll_depr <- function(accum_depr,book_val,depr_rate) {
r_d <- depr_rate[1]
z <- rep(0, length(accum_depr))
z[1] <- accum_depr[1]
for(i in 2:length(accum_depr)) {
z[i] <- book_val[i]*r_d + z[i-1]*(1-r_d)
}
return(z)
}
fcast <- fcast %>%
group_by(group) %>%
mutate(book_value = roll_book(book_value,book_addn,book_growth),
accum_depr = roll_depr(accum_depr,book_value,depr_rate))
On my dataset with ~110,000 rows and ~450 groups:
Unit: milliseconds
min lq mean median uq max neval
65.01492 67.14825 70.80178 69.85741 72.53611 98.75224 100
aIn R, how do one run a tournament simulation?
I have the probabilities of each teams chance of winning against the other pairs, for example:
prob_res <- matrix(round(runif(64),2), 8, 8)
prob_res[lower.tri(prob_res, diag = TRUE)] <- 0
prob_res <- as.data.frame(prob_res)
colnames(prob_res) <- 1:8
rownames(prob_res) <- 1:8
Which would mean something like this:
1 2 3 4 5 6 7 8
1 0 0.76 0.35 0.81 0.95 0.08 0.47 0.26
2 0 0.00 0.24 0.34 0.54 0.48 0.53 0.54
3 0 0.00 0.00 0.47 0.51 0.68 0.50 0.80
4 0 0.00 0.00 0.00 0.52 0.59 0.38 0.91
5 0 0.00 0.00 0.00 0.00 0.05 0.88 0.64
6 0 0.00 0.00 0.00 0.00 0.00 0.23 0.65
7 0 0.00 0.00 0.00 0.00 0.00 0.00 0.77
8 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The next step would be to run a set of simulations, say n = 100000
First the quarter-finals (best out of 3):
1 vs 8
2 vs 7
3 vs 6
4 vs 5
And then the winners of each pair face off in the semi-finals:
1-8 winner VS 4-5 winner
2-7 winner VS 3-6 winner
Winners move on to the final. All is best out of 3.
What approach/package could I use to run bracket simulations? I did find a package called mRchmadness but it's too specific to handle this simulation.
I have created some dummy code that can help you figure out how to do it. The code is not optimized at all, but it is quite linear for you to understand how to do it.
prob_res <- matrix(round(runif(64),2), 8, 8)
prob_res[lower.tri(prob_res, diag = TRUE)] <- 0
prob_res <- as.data.frame(prob_res)
colnames(prob_res) <- 1:8
rownames(prob_res) <- 1:8
prob_res
## Total number of combinations
posscombi<-t(combn(1:8, 2))
## This function gives you winners of the match with n repetitionmatches against every other team possible combination of teams.
## It "reproduces" like the whole league assuming winning probabilities are static.
League <- function(repetitionMatches, posscomb , prob_res)
{
TotalVect<-integer(0)
for(i in 1:nrow(posscomb)){
pair <- posscomb[i,]
Vect<-sample(pair,
size = repetitionMatches,
prob = c(prob_res[pair[1], pair[2]], 1-prob_res[pair[1], pair[2]]),
replace = TRUE)
TotalVect <- c(TotalVect, Vect)
}
return(table(TotalVect))
}
Result<-League(100,posscomb = posscombi, prob_res= prob_res)
Myorder<-order(Result)
### Quarters
pair1<- c(names(Result)[Myorder[c(1,8)]])
pair2<- c(names(Result)[Myorder[c(2,7)]])
pair3<- c(names(Result)[Myorder[c(3,6)]])
pair4<- c(names(Result)[Myorder[c(4,5)]])
## This function gives you the results to n matches (being 3 in the example)
PlayMatch<-function(pairs, numMatches){
Res <-sample(pairs, size = numMatches,
prob = c(prob_res[pairs[1], pairs[2]], 1-prob_res[pairs[1], pairs[2]]),
replace = TRUE)
return(table(Res))
}
# Results of the matches
winner1<-PlayMatch(pairs = pair1, 3)
winner2<-PlayMatch(pairs = pair2, 3)
winner3<-PlayMatch(pairs = pair3, 3)
winner4<-PlayMatch(pairs = pair4, 3)
## Semis
#Choosing the winning teams
pair1<- c(names(winner1)[which.max(winner1)],names(winner2)[which.max(winner2)])
pair2<- c(names(winner3)[which.max(winner3)],names(winner4)[which.max(winner4)])
winner1<-PlayMatch(pairs = pair1, 3)
winner2<-PlayMatch(pairs = pair2, 3)
## Final
# Same as before
pair1<- c(names(winner1)[which.max(winner1)],names(winner2)[which.max(winner2)])
winner1<-PlayMatch(pairs = pair1, 3)
paste0( "team ",names(winner1)[which.max(winner1)], " is the winner!")
I have to solve this specific problem in R. I have a large list, containing columns and rows in this format:
Day_and_Time Rain1_mm/min Rain2_mm/min
01.12.10 18:01 0 0
.............. .... ...
02.12.10 01:00 0.03 0
02.12.10 01:01 0.03 0
02.12.10 01:02 0.01 0
02.12.10 01:03 0.05 0
02.12.10 01:04 0.03 0.1
02.12.10 01:05 0.04 0
.............. .... ...
02.12.10 18:00 0 0
What I want to do is to write a function that sums up six following rows and return the result as a new row. This means that at the end I have a new list - looking like this for example:
Day_and_Time Rain1_mm/5min Rain2_mm/5min
.............. .... ...
02.12.10 01:05 0.19 0.1
02.12.10 01:10 .... ...
.............. .... ...
Is it possible to do this? The goal is to transform the unit [mm/min] from the first and second column to [mm/5min].
Thank you very much!
Assuming that you read the data in your .csv file as a data frame df, one approach to your problem is to use rollapply from the zoo package to give you a rolling sum:
library(zoo)
ind_keep <- seq(1,floor(nrow(df)/5)*5, by=5) ## 1.
out <- sapply(df[,-1], function(x) rollapply(x,6,sum)) ## 2.
out <- data.frame(df[ind_keep+5,1],out[ind_keep,]) ## 3.
colnames(out) <- c("Day_and_time","Rain1_mm/5min","Rain2_mm/5min") ## 4.
Notes:
Here, we define the indices corresponding to every 5 minutes where we want to keep the rolling sum over the next 5 minutes.
Apply a rolling sum function for each column.
Use sapply over all columns of df that is not the first column. Note that the column indices specified in df[,-1] can be adjusted so that you process only certain columns.
The function to apply is rollapply from the zoo package. The additional arguments are the width of the window 5 and the sum function so that this performs a rolling sum.
At this point, out contains the rolling sums (over 5 minutes) at each minute, but we only want those every 5 minutes. Therefore,
Combines the Day_and_time column from the original df with out keeping only those columns every 5 minutes. Note that we keep the last Day_and_Time in each window.
This just renames the columns.
Using MikeyMike's data, which is
Day_and_Time rain1 rain2
1 2010-02-12 01:00:00 0.03 0.00
2 2010-02-12 01:01:00 0.03 0.00
3 2010-02-12 01:02:00 0.01 0.00
4 2010-02-12 01:03:00 0.05 0.00
5 2010-02-12 01:04:00 0.03 0.10
6 2010-02-12 01:05:00 0.04 0.00
7 2010-02-12 01:06:00 0.02 0.10
8 2010-02-12 01:07:00 0.10 0.10
9 2010-02-12 01:08:00 0.30 0.00
10 2010-02-12 01:09:00 0.01 0.00
11 2010-02-12 01:10:00 0.00 0.01
this gives:
print(out)
## Day_and_time Rain1_mm/5min Rain2_mm/5min
##1 2010-02-12 01:05:00 0.19 0.10
##2 2010-02-12 01:10:00 0.47 0.21
Note the difference in the result, this approach assumes you want overlapping windows since you specified that you want to sum the six numbers between the closed interval [i,i+5] at each 5 minute mark.
To extend the above to a window in the closed interval [i, i+nMin] at each nMin mark:
library(zoo)
nMin <- 10 ## for example 10 minutes
ind_keep <- seq(1, floor(nrow(df)/nMin)*nMin, by=nMin)
out <- sapply(df[,-1], function(x) rollapply(x, nMin+1, sum))
out <- data.frame(df[ind_keep+nMin, 1],out[ind_keep,])
colnames(out) <- c("Day_and_time",paste0("Rain1_mm/",nMin,"min"),paste0("Rain2_mm/",nMin,"min"))
For this to work, the data must have at least 2 * nMin + 1 rows
Hope this helps.
Assuming you want the groups to be 0 - 5 minutes, 6 - 10 minutes, etc. this should give you what you're looking for:
library(data.table)
setDT(df)[,.(day_time = max(Day_and_Time),
rain1_sum=sum(rain1),
rain2_sum=sum(rain2)),
by=.(floor(as.numeric(Day_and_Time)/360))]
floor day_time rain1_sum rain2_sum
1: 3516540 2010-02-12 01:05:00 0.19 0.10
2: 3516541 2010-02-12 01:10:00 0.43 0.21
Data:
df <- structure(list(Day_and_Time = structure(c(1265954400, 1265954460,
1265954520, 1265954580, 1265954640, 1265954700, 1265954760, 1265954820,
1265954880, 1265954940, 1265955000), class = c("POSIXct", "POSIXt"
), tzone = ""), rain1 = c(0.03, 0.03, 0.01, 0.05, 0.03, 0.04,
0.02, 0.1, 0.3, 0.01, 0), rain2 = c(0, 0, 0, 0, 0.1, 0, 0.1,
0.1, 0, 0, 0.01)), .Names = c("Day_and_Time", "rain1", "rain2"
), row.names = c(NA, -11L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000000000240788>)