Related
I want to obtain new values for each step and use them inside an equation for further steps. Normally, I can perform loop but for this problem, I need to use past values, too. For instance, I have flow data like this:
q<-c(10, 15.83333, 21.66667)
I created a loop manually:
z1<-190 #initial elevation
s1<-24011 #initial storage
in1<-q[1] #initial inflow
out1<-1.86*sqrt((z1-110)*19.62) #outflow
z2<- z1+0.3*((in1-out1)/s1) #elevation at second step
in2<-q[2] #second inflow
out2<-1.86*sqrt((z2-110)*19.62) #outflow at z2 elevation
ds2<-0.3*((in1+in2)/2-(out1+out2)/2) #change in storage
s2<-s1+ds2 #net storage value
z3<-z2+0.3*((in2-out2)/s2) #elevation at third step
in3<-q[3]
out3<-1.86*sqrt((z3-110)*19.62)
ds3<-0.3*((in2+in3)/2-(out2+out3)/2)
s3<-s2+ds3
.
.
.
z4<-z3+0.3*((in3-out3)/s3)
Briefly, I am calculating z value using previous values of in,out,s. What I need to find is z values considering q values.
Expected result is:
z q outflows storages
[1,] 190.0000 10.00000 73.68981 24011.00
[2,] 189.9992 15.83333 73.68944 23992.77
[3,] 189.9985 21.66667 73.68911 23976.29
I'll extend my comment here.
You can overwrite the variables. Following your implementation you could for instance create a temporal variable for your out2:
q = c(10, 15.83333, 21.66667)
#Results storage array
results = array(numeric(),c(length(q),4))
colnames(results) = c("z", "q", "outflows", "storages")
z = 190
s = 24011
infl = q[1]
out = 1.86*sqrt((z-110)*19.62)
#Save init values
results[1,1] = z
results[,2] = q
results[1,3] = out
results[1,4] = s
for (n in 2:length(q)) {
z = z+0.3*((infl-out)/s)
out_tmp = 1.86*sqrt((z-110)*19.62)
ds = 0.3*((infl+q[n])/2-(out+out_tmp)/2)
s = s+ds
infl = q[n]
out = out_tmp
results[n,1] = z
results[n,3] = out
results[n,4] = s
}
View(results)
If you want to avoid to create the temporal variable, you can try something like this:
q = c(10, 15.83333, 21.66667)
results = array(numeric(),c(length(q),4))
colnames(results) = c("z", "q", "outflows", "storages")
z = 190
s = 24011
infl = q[1]
out = 1.86*sqrt((z-110)*19.62)
#Save init values
results[1,1] = z
results[,2] = q
results[1,3] = out
results[1,4] = s
for (n in 2:length(q)) {
z = z+0.3*((infl-out)/s)
out = 1.86*sqrt((z-110)*19.62)
ds = 0.3*((infl+q[n])/2-(results[n-1,3]+out)/2)
s = s+ds
infl = q[n]
results[n,1] = z
results[n,3] = out
results[n,4] = s
}
View(results)
Init values and create a result table
Add the current state values to the table
Simulate new state using old or new states
Set the new state to all variables
library(tidyverse)
data <- tibble(step = numeric(), out = numeric(), y = numeric(), z = numeric())
# Initialization
z <- 190
y <- 1
out <- NA
for (step in seq(5)) {
# save current state
data <- data %>% add_row(step = step, out = out, z = z, y = y)
# use old state of z
new_out <- z / 2
# use old state of y
new_z <- y + 1
# use new state of out
new_y <- new_out
# Lastly, update all new variables
out <- new_out
y <- new_y
z <- new_z
}
data
#> # A tibble: 5 x 4
#> step out y z
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 NA 1 190
#> 2 2 95 95 2
#> 3 3 1 1 96
#> 4 4 48 48 2
#> 5 5 1 1 49
Created on 2021-11-10 by the reprex package (v2.0.1)
I have a data set of real estate data. I'm trying to create a new column of days on market groups (labeled DOM_Groups) and group them into 15-day intervals (i.e. 0-14, 15-29, etc.). Then I'm trying to summarize() these groupings by the count of observations and the average sale price for each 15-day group.
I'm using the cut() function attempting to break my DOM_Groups into these 15-day intervals. In the base spreadsheet that I imported, the column containing the days on market has a unique observation in each cell, and the data in that column are numeric whole numbers...no decimals, no negative numbers.
When I run the following code, the tibble output is not grouping correctly, and it is including a negative number with a decimal, which does not exist in my data set. I'm not sure what to do to correct this.
gibbsMkt %>%
mutate(DOM_Groups = cut(DOM, breaks = 15, dig.lab = 2)) %>%
filter(Status == "SOLD") %>%
group_by(DOM_Groups) %>%
summarize(numDOM = n(),
avgSP = mean(`Sold Price`, na.rm = TRUE))
The tibble output I get is this:
DOM_Groups numDOM avgSP
<fct> <int> <dbl>
1 (-0.23,16] 74 561675.
2 (16,31] 18 632241.
3 (31,47] 11 561727.
4 (47,63] 8 545862.
5 (63,78] 7 729286.
6 (78,94] 6 624167.
7 (1.4e+02,1.6e+02] 2 541000
8 (1.6e+02,1.7e+02] 1 535395
Also, for rows 7 & 8 in the tibble, the largest number is 164, so I also don't understand why these rows are being converted to scientific notation.
When I use an Excel pivot table, I get the output that I want to reproduce in R, which is depicted below:
How can I reproduce this in R with the correct code?
cut(x, breaks = 15) means x will be cut into 15 intervals--it cannot guess that you want 15-unit intervals starting with 0 and ending with 150. This is in the docs for ?cut:
breaks either a numeric vector of two or more unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.
You will need to define your own start and end to each interval such as:
seq(0, max(x), 15)
# [1] 0 15 30 45 60 75 90 105 120 135 150
cut(x, seq(0, max(x), 15))
However, if you set it up correctly, you can define your intervals and make labels at the same time.
set.seed(1)
x <- floor(runif(500, 0, 164))
from <- seq(0, max(x), 15)
to <- from + 15 - 1
labs <- sprintf('%s-%s', from, to)
# [1] "0-14" "15-29" "30-44" "45-59" "60-74" "75-89" "90-104" "105-119" "120-134" "135-149" "150-164"
data.frame(table(cut(x, c(from, Inf), right = FALSE)), labels = labs)
# Var1 Freq labels
# 1 [0,15) 35 0-14
# 2 [15,30) 57 15-29
# 3 [30,45) 45 30-44
# 4 [45,60) 44 45-59
# 5 [60,75) 57 60-74
# 6 [75,90) 55 75-89
# 7 [90,105) 33 90-104
# 8 [105,120) 47 105-119
# 9 [120,135) 40 120-134
# 10 [135,150) 39 135-149
# 11 [150,Inf) 48 150-164
DOM_Groups <- cut(x, c(from, Inf), labs, right = FALSE)
data.frame(table(DOM_Groups))
# DOM_Groups Freq
# 1 0-14 35
# 2 15-29 57
# 3 30-44 45
# 4 45-59 44
# 5 60-74 57
# 6 75-89 55
# 7 90-104 33
# 8 105-119 47
# 9 120-134 40
# 10 135-149 39
# 11 150-164 48
Your other question of "why am I getting negative numbers," as I mentioned this does not mean that you have negatives in your data--these are just labels generated by using breaks = 15 with your data.
These are the relevant lines in cut.default
if (length(breaks) == 1L) {
if (is.na(breaks) || breaks < 2L)
stop("invalid number of intervals")
nb <- as.integer(breaks + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0) {
dx <- if (rx[1L] != 0)
abs(rx[1L])
else 1
breaks <- seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000,
length.out = nb)
}
else {
breaks <- seq.int(rx[1L], rx[2L], length.out = nb)
breaks[c(1L, nb)] <- c(rx[1L] - dx/1000, rx[2L] +
dx/1000)
}
Using the x from before and breaks = 15, you can see how negatives are introduced:
breaks <- 15
nb <- as.integer(breaks + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0) {
dx <- if (rx[1L] != 0)
abs(rx[1L])
else 1
breaks <- seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000,
length.out = nb)
} else {
breaks <- seq.int(rx[1L], rx[2L], length.out = nb)
breaks[c(1L, nb)] <- c(rx[1L] - dx/1000, rx[2L] + dx/1000)
}
breaks
# [1] -0.16300 10.86667 21.73333 32.60000 43.46667 54.33333 65.20000 76.06667 86.93333 97.80000 108.66667 119.53333 130.40000
# [14] 141.26667 152.13333 163.16300
levels(cut(x, breaks = 15))
# [1] "(-0.163,10.9]" "(10.9,21.7]" "(21.7,32.6]" "(32.6,43.5]" "(43.5,54.3]" "(54.3,65.2]" "(65.2,76.1]" "(76.1,86.9]"
# [9] "(86.9,97.8]" "(97.8,109]" "(109,120]" "(120,130]" "(130,141]" "(141,152]" "(152,163]"
Here's a simple solution with my santoku package:
library(santoku)
gibbsMkt %>%
mutate(DOM_Groups = chop_width(DOM, 15, labels = lbl_dash("-")))
# then proceed as before
You can use the start argument to chop_width if you want to start the intervals at a particular number.
I'm working to implement a lpSolve solution to optimizing a hypothetical daily fantasy baseball problem. I'm having trouble applying my last constraint:
position - Exactly 3 outfielders (OF) 2 pitchers (P) and 1 of everything else
cost - Cost less than 200
team - Max number from any one team is 6
team - Minimum number of teams on a roster is 3**
Say for example you have a dataframe of 1000 players with points, cost, position, and team and you're trying to maximize average points:
library(tidyverse)
library(lpSolve)
set.seed(123)
df <- data_frame(avg_points = sample(5:45,1000, replace = T),
cost = sample(3:45,1000, replace = T),
position = sample(c("P","C","1B","2B","3B","SS","OF"),1000, replace = T),
team = sample(LETTERS,1000, replace = T)) %>% mutate(id = row_number())
head(df)
# A tibble: 6 x 5
# avg_points cost position team id
# <int> <int> <chr> <chr> <int>
#1 17 13 2B Y 1
#2 39 45 1B P 2
#3 29 33 1B C 3
#4 38 31 2B V 4
#5 17 13 P A 5
#6 10 6 SS V 6
I've implemented the first 3 constraints with the following code, but i'm having trouble figuring out how to implement the minimum number of teams on a roster. I think I need to add additional variable to the model, but i'm not sure how to do that.
#set the objective function (what we want to maximize)
obj <- df$avg_points
# set the constraint rows.
con <- rbind(t(model.matrix(~ position + 0,df)), cost = df$cost, t(model.matrix(~ team + 0, df)) )
#set the constraint values
rhs <- c(1,1,1,1,3,2,1, # 1. #exactly 3 outfielders 2 pitchers and 1 of everything else
200, # 2. at a cost less than 200
rep(6,26) # 3. max number from any team is 6
)
#set the direction of the constraints
dir <- c("=","=","=","=","=","=","=","<=",rep("<=",26))
result <- lp("max",obj,con,dir,rhs,all.bin = TRUE)
If it helps, i'm trying to replicate This paper (with minor tweaks) which has corresponding julia code here
This might be a solution for your problem.
This is the data I have used (identical to yours):
library(tidyverse)
library(lpSolve)
N <- 1000
set.seed(123)
df <- tibble(avg_points = sample(5:45,N, replace = T),
cost = sample(3:45,N, replace = T),
position = sample(c("P","C","1B","2B","3B","SS","OF"),N, replace = T),
team = sample(LETTERS,N, replace = T)) %>%
mutate(id = row_number())
You want to find x1...xn that maximise the objective function below:
x1 * average_points1 + x2 * average_points1 + ... + xn * average_pointsn
With the way lpSolve works, you will need to express every LHS as the sum over
x1...xn times the vector you provide.
Since you cannot express the number of teams with your current variables, you can introduce new ones (I will call them y1..yn_teams and z1..zn_teams):
# number of teams:
n_teams = length(unique(df$team))
Your new objective function (ys and zs will not influence your overall objective funtion, since the constant is set to 0):
obj <- c(df$avg_points, rep(0, 2 * n_teams))
)
The first 3 constraints are the same, but with the added constants for y and z:
c1 <- t(model.matrix(~ position + 0,df))
c1 <- cbind(c1,
matrix(0, ncol = 2 * n_teams, nrow = nrow(c1)))
c2 = df$cost
c2 <- c(c2, rep(0, 2 * n_teams))
c3 = t(model.matrix(~ team + 0, df))
c3 <- cbind(c3, matrix(0, ncol = 2 * n_teams, nrow = nrow(c3)))
Since you want to have at least 3 teams, you will first use y to count the number of players per team:
This constraint counts the number of players per team. You sum up all players of a team that you have picked and substract the corresponding y variable per team. This should be equal to 0. (diag() creates the identity matrix, we do not worry about z at this point):
# should be x1...xn - y1...n = 0
c4_1 <- cbind(t(model.matrix(~team + 0, df)), # x
-diag(n_teams), # y
matrix(0, ncol = n_teams, nrow = n_teams) # z
) # == 0
Since each y is now the number of players in a team, you can now make sure that z is binary with this constraint:
c4_2 <- cbind(t(model.matrix(~ team + 0, df)), # x1+...+xn ==
-diag(n_teams), # - (y1+...+yn )
diag(n_teams) # z binary
) # <= 1
This is the constraint that ensures that at least 3 teams are picked:
c4_3 <- c(rep(0, nrow(df) + n_teams), # x and y
rep(1, n_teams) # z >= 3
)
You need to make sure that
You can use the big-M method for that to create a constraint, which is:
Or, in a more lpSolve friendly version:
In this case you can use 6 as a value for M, because it is the largest value any y can take:
c4_4 <- cbind(matrix(0, nrow = n_teams, ncol = nrow(df)),
diag(n_teams),
-diag(n_teams) * 6)
This constraint is added to make sure all x are binary:
#all x binary
c5 <- cbind(diag(nrow(df)), # x
matrix(0, ncol = 2 * n_teams, nrow = nrow(df)) # y + z
)
Create the new constraint matrix
con <- rbind(c1,
c2,
c3,
c4_1,
c4_2,
c4_3,
c4_4,
c5)
#set the constraint values
rhs <- c(1,1,1,1,3,2,1, # 1. #exactly 3 outfielders 2 pitchers and 1 of everything else
200, # 2. at a cost less than 200
rep(6, n_teams), # 3. max number from any team is 6
rep(0, n_teams), # c4_1
rep(1, n_teams), # c4_2
3, # c4_3,
rep(0, n_teams), #c4_4
rep(1, nrow(df))# c5 binary
)
#set the direction of the constraints
dir <- c(rep("==", 7), # c1
"<=", # c2
rep("<=", n_teams), # c3
rep('==', n_teams), # c4_1
rep('<=', n_teams), # c4_2
'>=', # c4_3
rep('<=', n_teams), # c4_4
rep('<=', nrow(df)) # c5
)
The problem is almost the same, but I am using all.int instead of all.bin to make sure the counts work for the players in the team:
result <- lp("max",obj,con,dir,rhs,all.int = TRUE)
Success: the objective function is 450
roster <- df[result$solution[1:nrow(df)] == 1, ]
roster
# A tibble: 10 x 5
avg_points cost position team id
<int> <int> <chr> <chr> <int>
1 45 19 C I 24
2 45 5 P X 126
3 45 25 OF N 139
4 45 22 3B J 193
5 45 24 2B B 327
6 45 25 OF P 340
7 45 23 P Q 356
8 45 13 OF N 400
9 45 13 SS L 401
10 45 45 1B G 614
If you change your data to
N <- 1000
set.seed(123)
df <- tibble(avg_points = sample(5:45,N, replace = T),
cost = sample(3:45,N, replace = T),
position = sample(c("P","C","1B","2B","3B","SS","OF"),N, replace = T),
team = sample(c("A", "B"),N, replace = T)) %>%
mutate(id = row_number())
It will now be infeasable, because the number of teams in the data is less then 3.
You can check that it now works:
sort(unique(df$team))[result$solution[1027:1052]==1]
[1] "B" "E" "I" "J" "N" "P" "Q" "X"
sort(unique(roster$team))
[1] "B" "E" "I" "J" "N" "P" "Q" "X"
For the example for the lift curve I run
library(caret)
set.seed(1)
simulated <- data.frame(obs = factor(rep(letters[1:2], each = 100)),
perfect = sort(runif(200), decreasing = TRUE),
random = runif(200))
lift2 <- lift(obs ~ random + perfect, data = simulated)
xyplot(lift2, plot = "lift", auto.key = list(columns = 2))
and get
as result. I expected the image to be swapped horizontally - something along the lines of
What am I doing wrong?
Btw: This is a lift chart not a cumulative gains chart.
Update:
The plot that I expected, produced now by my own code
mylift <- caret::lift(Class ~ cforest_prob + perfect_prob + guess_prob, data = data_test)
ggplot(mylift$data) +
geom_line(aes(CumTestedPct, lift, color = liftModelVar))
is
I noticed, that the data.frame mylift$data contains the following columns:
names(mylift$data)
[1] "liftModelVar" "cuts" "events" "n" "Sn" "Sp" "EventPct"
[8] "CumEventPct" "lift" "CumTestedPct"
So I printed the following plot
ggplot(mylift$data) +
geom_line(aes(cuts, lift, color = liftModelVar))
So I guess that the different plots are just different ways of examining lift? I wasn't aware that there are different lift charts - I thought it was standardized across the industry.
Edit by the question author, for late readers: I accepted this answer for a large part because of the helpful discussion in the comments to this answer. Please consider reading the discussion!
Let's reproduce the graph and find the baseline. Let
cutoffs <- seq(0, 1, length = 1000)
be our cutoffs. Now the main computations are done by
aux <- sapply(cutoffs, function(ct) {
perf <- simulated$obs[simulated$perfect > ct]
rand <- simulated$obs[simulated$random > ct]
c(mean(perf == "a"), mean(rand == "a"))
})
where we go over the vector of cutoffs and do the following. Take the perfect case. We say that whenever perfect > ct, we are going to predict "a". Then simulated$obs[simulated$perfect > ct] are the true values, while mean(perf == "a") is our accuracy with a given cutoff. The same happens with random.
As for the baseline, it is just a constant defined by the share of "a" in the sample:
baseline <- mean(simulated$obs == "a")
When plotting the lifts, we divide our accuracy by that of the baseline method and get the same graph along with the baseline curve:
plot(x = cutoffs, y = aux[1, ] / baseline, type = 'l', ylim = c(0, 2), xlab = "Cutoff", ylab = "Lift")
lines(x = cutoffs, y = aux[2, ] / baseline, col = 'blue')
abline(a = baseline / baseline, b = 0, col = 'magenta')
Update:
Here's an illustration that, at least when plotted manually, the lift curve of the "expected" type can be manipulated and gives non-unique results.
Your example graph is from here, which also has this data:
# contacted response
# 1 10000 6000
# 2 20000 10000
# 3 30000 13000
# 4 40000 15800
# 5 50000 17000
# 6 60000 18000
# 7 70000 18800
# 8 80000 19400
# 9 90000 19800
# 10 100000 20000
Now suppose that we know not this evolution but 10 individual blocks:
# contacted response
# 1 10000 6000
# 2 10000 4000
# 3 10000 3000
# 4 10000 2800
# 5 10000 1200
# 6 10000 1000
# 7 10000 800
# 8 10000 600
# 9 10000 400
# 10 10000 200
In that case it depends on how we order the observations when putting "% Contacted" in the x-axis:
set.seed(1)
baseline <- sum(df$response) / sum(df$contacted) * cumsum(df$contacted)
lift1 <- cumsum(df$response)
lift2 <- cumsum(sample(df$response))
x <- 1:10 * 10
plot(x = x, y = lift1 / baseline, col = 'red', type = 'l', ylim = c(0, 3), xlab = "% Customers contacted", ylab = "Lift")
lines(x = x, y = lift2 / baseline, col = 'blue')
abline(a = baseline / baseline, b = 0, col = 'magenta')
I am using BTYD BG NBD in R and did the individual level estimates.
For instance following the documentation in page 20 of:
BTYD Walkthrough
Code for Data Prep:
system.file("data/cdnowElog.csv", package = "BTYD")%>%
dc.ReadLines(., cust.idx = 2, date.idx = 3, sales.idx = 5)%>%
dc.MergeTransactionsOnSameDate()%>%
mutate(date = parse_date_time(date, "%Y%m%d")) -> elog
end.of.cal.period <- as.Date("1997-09-30")
elog.cal <- elog[which(elog$date <= end.of.cal.period), ]
split.data <- dc.SplitUpElogForRepeatTrans(elog.cal);
birth.periods <- split.data$cust.data$birth.per
last.dates <- split.data$cust.data$last.date
clean.elog <- split.data$repeat.trans.elog;
freq.cbt <- dc.CreateFreqCBT(clean.elog);
tot.cbt <- dc.CreateFreqCBT(elog)
cal.cbt <- dc.MergeCustomers(tot.cbt, freq.cbt)
cal.cbs.dates <- data.frame(birth.periods, last.dates, end.of.cal.period)
cal.cbs <- dc.BuildCBSFromCBTAndDates(cal.cbt, cal.cbs.dates,per="week")
params <- pnbd.EstimateParameters(cal.cbs);
one could get estimates for a particular observation.
Code for Individual Level Estimation:
cal.cbs["1516",]
# x t.x T.cal
# 26.00 30.86 31.00
x <- cal.cbs["1516", "x"]
t.x <- cal.cbs["1516", "t.x"]
T.cal <- cal.cbs["1516", "T.cal"]
bgnbd.ConditionalExpectedTransactions(params, T.star = 52,
x, t.x, T.cal)
# [1] 25.76
My question is, is it possible to recursively run this such that I could get a data frame containing the expectations for each row instead of hard coding a particular ID number such as "1516" in this case?
Thanks!
Yes, it is straightforward with dplyr's mutate()
cal.cbs%>%
data.frame()%>%
mutate(`Conditional Expectation` = bgnbd.ConditionalExpectedTransactions(params, T.star = 52, x, t.x, T.cal))
x t.x T.cal Conditional Expectation
1 2 30.428571 38.85714 2.3224971
2 1 1.714286 38.85714 1.0646350
3 0 0.000000 38.85714 0.5607707
4 0 0.000000 38.85714 0.5607707
5 0 0.000000 38.85714 0.5607707
6 7 29.428571 38.85714 6.0231497