My problem:
If the sex ratio at birth (male to female) is 1.1, but people adopt the following
strategy: have children until you have one son, and then stop, unless you have 12
daughters (in which case you stop, too). What would be the average sex ratio in the
population? (Calculate by simulation. Suppose you randomly select 10,000 families.)
My code
pm=0.5238095 # Probability of Male
pw=0.4761905 # Female
w=0 # initial number of Female
n=1 # loop
p=0 # count of number
for(i in 1:n){
s=rbinom(1,1,0.4761905)
if(s==1){
w=w+1
}
p=p+1
while(w<=12){ ####1. How to count the number of female? ###
while(s==1){
s=rbinom(1,1,0.4761905)
if(s==1){
w=w+1
}
p=p+1
}
}
f[i]=p
}
w/p
My question
How to count the number of female? I'm using loop to count the number of women$(if(s==1){
w=w+1
}). $It seems inefficient.I think MAYBE counting true or false is more efficient.
How to write the code more concise?
The answer, of course, is that this strategy won't affect the sex ratio at all! At least as you've set this up, no matter what a couple's previous birth history is, the probability of a male arising from each birth is always the same.
Here's one way to confirm that with some calculations. (The code's offered without further explanation, at least for now.):
pm <- 0.5238095
m <- cbind(boys=c(rep(1, 12), 0), girls=0:12)
p <- c(dgeom(0:11, pm), 1-pgeom(11, pm))
## Calculate expected number of boys and girls for an immortal couple pursuing
## this "strategy"
(res <- p %*% m)
# boys girls
# [1,] 0.9998641 0.9089674
p[1] / sum(p)
# [1] 0.5238095 ## Look familiar
Yes, this is very inefficient. Perhaps I can address just a couple of things that almost make sense and it will give you your answer. In your code...
for(i in 1:n){
s=rbinom(1,1,0.4761905)
if(s==1){
w=w+1
}
can be rewritten as...
s = rbinom(n,1,0.4761905)
w = sum(s)
That's the same result. Keep in mind that rbinom is producing 0's and 1's. You can just sum them to know how many 1's. Given that you define n then the number of 0s (females) is...
n - w
But, if you didn't it would be easy to find too...
length(s) - sum(s)
It is still probably inefficient but at least it's correct for what you're trying to do:
# set.seed(1)
pw <- 0.4761905 # Initial sex ratio
w <- 0 # number of daughters
n <- 10000 # number of families
p <- 0 # number of kids
f <- data.frame(Daughters=vector(length=n), Kids=vector(length=n))
for(i in 1:n){
while(w < 12 & w==p){ #As long as you don't have 12 daughters or 1 son...
s <- rbinom(1,1,pw)
if(s==1){w <- w+1}
p <- p+1
}
f[i,] <- c(w,p) #Number of daughter and total kids in each families
w <- p <- 0 # Reset number of kids and daughters for the next family
}
colSums(f)[1]/colSums(f)[2] #Final sex ratio
Daughters
0.4736842 # So as #JoshO'Brien pointed out, very close to the original sex ratio.
And you can verify vector f to see that there is never more than 1 son (number of kids minus number of daughters):
range(f[,2]-f[,1])
[1] 1 1 # Range of the number of boys per family
range(f[,1])
[1] 0 11 # Range of the number of daughters per family
nrow(f[f[,1]==0,])
[1] 5275 # Number of families having 1 son and no daughters (to be compared with 1-pw)
Related
I can do it for the two sample t test but not for Median test or Wilcoxon test or Hodges Lehmann test
data_2000 <- c(500,450,600,700,550,551,552)
data_2019 <- c(560,460,620,720,540,600,750)
mean(data_2000)
mean(data_2019)
mean(data_2019) - mean(data_2000)
combined_data <- c(data_2000, data_2019)
set.seed(123)
null_dist <- c()
for (i in 1:100000) {
shuffled_data <- sample(combined_data)
shuffled_2000 <- shuffled_data[1:7]
shuffled_2019 <- shuffled_data[8:14]
null_dist[i] <- mean(shuffled_2019) - mean(shuffled_2000)
}
(p_value <- (sum(null_dist >= 49.57143) + sum(null_dist <=
`enter code here`-49.57143))/length(null_dist))
I think this is what you're trying to do. I altered your code as little as possible. There are packages like infer that will do this for you and the for loop is not the most efficient but it's plenty good enough and may help you learn. As long as we're looping I did mean and median at the same time since all other parts of the code are identical. ifelse is a nice easy way to make 1s and 0s to sum.
data_2000 <- c(500,450,600,700,550,551,552)
data_2019 <- c(560,460,620,720,540,600,750)
delta_mean <- mean(data_2019) - mean(data_2000)
delta_median <- median(data_2019) - median(data_2000)
combined_data <- c(data_2000, data_2019)
trials <- 100000
set.seed(123)
mean_diff <- c()
median_diff <- c()
for (i in 1:trials) {
shuffled_data <- sample(combined_data)
shuffled_2000 <- shuffled_data[1:7]
shuffled_2019 <- shuffled_data[8:14]
mean_diff[i] <- mean(shuffled_2019) - mean(shuffled_2000)
median_diff[i] <- median(shuffled_2019) - median(shuffled_2000)
}
p_mean <- sum(ifelse(mean_diff > delta_mean | mean_diff < -1 * delta_mean, 1, 0)) / trials
p_median <- sum(ifelse(median_diff > delta_median | median_diff < -1 * delta_median, 1, 0)) / trials
p_mean
#> [1] 0.31888
p_median
#> [1] 0.24446
Following up on your question about HL test. Quoting Wikipedia
The Hodges–Lehmann statistic also estimates the difference between two populations. For two sets of data with m and n observations, the set of two-element sets made of them is their Cartesian product, which contains m × n pairs of points (one from each set); each such pair defines one difference of values. The Hodges–Lehmann statistic is the median of the m × n differences.
You could run it on your data with the following code...
Do NOT run it 100,000 times the answer is the same everytime because you're already making all 49 possible pairings
hl_df <- expand.grid(data_2019, data_2000)
hl_df$pair_diffs <- hl_df$Var1 - hl_df$Var2
median(hl_df$pair_diffs)
[1] 49
You can do the Wilcoxon test with wilcox.test in the stats package (loaded by default as part of R core). You need to set exact = FALSE because an exact p-value is not possible if there are ties.
wilcox.test(data_2019, data_2000, exact = FALSE)
Wilcoxon rank sum test with continuity correction
data: data_2019 and data_2000
W = 33.5, p-value = 0.2769
alternative hypothesis: true location shift is not equal to 0
I'll update this when I figure out how to do the other tests.
I'm quite new to coding, so I don't know what the limits are for what I can do in R, and I haven't been able to find an answer for this particular kind of problem yet, although it probably has quite a simple solution.
For equation 2, A.1 is the starting value, but in each subsequent equation I need to use the previous answer (i.e. for A.3 I need A.2, for A.4 I need A.3, etc.).
A.1 <- start.x*(1-rate[1])+start.x*rate[1]
A.[2:n] <- A.[n-1]*(1-rate[2:n])+x*rate[2:n]
How do I set A.1 as the initial value, and is there a better way of writing equation 2 than to copy and paste the equation 58 times?
I've included the variables I have below:
A.1<- -13.2 # which is the same as start.x
x<- -10.18947 # x[2:n]
n<- 58
Age<-c(23:80)
rate <- function(Age){
Turnover<-(1/(1.0355*Age-3.9585))
return(Turnover)
}
I need to find the age at which A can be rounded to -11.3. I expect to see it from ages 56 to 60.
Using the new information, try this:
x<- -10.18947
n<- 58
Age <- 23:80
rate <- (1 / (1.0355 * Age - 3.9585))
A <- vector("numeric", 58)
A[1] <- -13.2
for (i in 2:n) {
A[i] <- A[i-1] * (1 - rate[i]) + x * rate[i]
}
Age[which.min(abs(A + 11.3))]
# [1] 58
plot(Age, A, type="l")
abline(h=-11.3, v=58, lty=3)
So the closest age to -11.3 is 58 years.
I am trying to apply the Simpson's Diversity Index across a number of different datasets with a variable number of species ('nuse') captured. As such I am trying to construct code which can cope with this automatically without needing to manually construct a formula each time I do it. Example dataset for a manual formula is below:
diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))
simp <- function(x) {
total <- x[,"total"]
nuse1 <- x[,"nuse1"]
nuse2 <- x[,"nuse2"]
nuse3 <- x[,"nuse3"]
nuse4 <- x[,"nuse4"]
div <- round(((1-(((nuse1*(nuse1 - 1)) + (nuse2*(nuse2 - 1)) + (nuse3*(nuse3 - 1)) + (nuse4*(nuse4 - 1)))/(total*(total - 1))))),digits=4)
return(div)
}
diverse$Simpson <- simp(diverse)
diverse
As you can see this works fine. However, how would I be able to create a function which could automatically adjust to, for example, 9 species (so up to nuse9)?
I have experimented with the paste function + as.formula as indicated here Formula with dynamic number of variables; however it is the expand form of (nuse1 * (nuse1 - 1)) that I'm struggling with. Does anyone have any suggestions please? Thanks.
How about something like:
diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))
simp <- function(x, species) {
spcs <- grep(species, colnames(x)) # which column names have "nuse"
total <- rowSums(x[,spcs]) # sum by row
div <- round(1 - rowSums(apply(x[,spcs], 2, function(s) s*(s-1))) / (total*(total - 1)), digits = 4)
return(div)
}
diverse$Simpson2 <- simp(diverse, species = "nuse")
diverse
# nuse1 nuse2 nuse3 nuse4 total Simpson2
# 1 0 5 0 5 10 0.5556
# 2 20 5 2 8 35 0.6151
# 3 40 3 8 2 53 0.4107
# 4 20 20 20 20 80 0.7595
All it does is find out which columns start with "nuse" or any other species you have in your dataset. It constructs the "total" value within the function and does not require a total column in the dataset.
I have to calculate cosine similarity (patient similarity metric) in R between 48k patients data with some predictive variables. Here is the equation: PSM(P1,P2) = P1.P2/ ||P1|| ||P2||
where P1 and P2 are the predictor vectors corresponding to two different patients, where for example P1 index patient and P2 will be compared with index (P1) and finally pairwise patient similarity metric PSM(P1,P2) will be calculated.
This process will go on for all 48k patients.
I have added sample data-set for 300 patients in a .csv file. Please find the sample data-set here.https://1drv.ms/u/s!AhoddsPPvdj3hVTSbosv2KcPIx5a
First things first: You can find more rigorous treatments of cosine similarity at either of these posts:
Find cosine similarity between two arrays
Creating co-occurrence matrix
Now, you clearly have a mixture of data types in your input, at least
decimal
integer
categorical
I suspect that some of the integer values are Booleans or additional categoricals. Generally, it will be up to you to transform these into continuous numerical vectors if you want to use them as input into the similarity calculation. For example, what's the distance between admission types ELECTIVE and EMERGENCY? Is it a nominal or ordinal variable? I will only be modelling the columns that I trust to be numerical dependent variables.
Also, what have you done to ensure that some of your columns don't correlate with others? Using just a little awareness of data science and biomedical terminology, it seems likely that the following are all correlated:
diasbp_max, diasbp_min, meanbp_max, meanbp_min, sysbp_max and sysbp_min
I suggest going to a print shop and ordering a poster-size printout of psm_pairs.pdf. :-) Your eyes are better at detecting meaningful (but non-linear) dependencies between variable. Including multiple measurements of the same fundamental phenomenon may over-weight that phenomenon in your similarity calculation. Don't forget that you can derive variables like
diasbp_rage <- diasbp_max - diasbp_min
Now, I'm not especially good at linear algebra, so I'm importing a cosine similarity function form the lsa text analysis package. I'd love to see you write out the formula in your question as an R function. I would write it to compare one row to another, and use two nested apply loops to get all comparisons. Hopefully we'll get the same results!
After calculating the similarity, I try to find two different patients with the most dissimilar encounters.
Since you're working with a number of rows that's relatively large, you'll want to compare various algorithmic methodologies for efficiency. In addition, you could use SparkR/some other Hadoop solution on a cluster, or the parallel package on a single computer with multiple cores and lots of RAM. I have no idea whether the solution I provided is thread-safe.
Come to think of it, the transposition alone (as I implemented it) is likely to be computationally costly for a set of 1 million patient-encounters. Overall, (If I remember my computational complexity correctly) as the number of rows in your input increases, the performance could degrade exponentially.
library(lsa)
library(reshape2)
psm_sample <- read.csv("psm_sample.csv")
row.names(psm_sample) <-
make.names(paste0("patid.", as.character(psm_sample$subject_id)), unique = TRUE)
temp <- sapply(psm_sample, class)
temp <- cbind.data.frame(names(temp), as.character(temp))
names(temp) <- c("variable", "possible.type")
numeric.cols <- (temp$possible.type %in% c("factor", "integer") &
(!(grepl(
pattern = "_id$", x = temp$variable
))) &
(!(
grepl(pattern = "_code$", x = temp$variable)
)) &
(!(
grepl(pattern = "_type$", x = temp$variable)
))) | temp$possible.type == "numeric"
psm_numerics <- psm_sample[, numeric.cols]
row.names(psm_numerics) <- row.names(psm_sample)
psm_numerics$gender <- as.integer(psm_numerics$gender)
psm_scaled <- scale(psm_numerics)
pair.these.up <- psm_scaled
# checking for independence of variables
# if the following PDF pair plot is too big for your computer to open,
# try pair-plotting some random subset of columns
# keep.frac <- 0.5
# keep.flag <- runif(ncol(psm_scaled)) < keep.frac
# pair.these.up <- psm_scaled[, keep.flag]
# pdf device sizes are in inches
dev <-
pdf(
file = "psm_pairs.pdf",
width = 50,
height = 50,
paper = "special"
)
pairs(pair.these.up)
dev.off()
#transpose the dataframe to get the
#similarity between patients
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficnet, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
extract.pat <- function(enc.col) {
my.patients <-
sapply(enc.col, function(one.pat) {
temp <- (strsplit(as.character(one.pat), ".", fixed = TRUE))
return(temp[[1]][[2]])
})
return(my.patients)
}
cs.melt$pat.A <- extract.pat(cs.melt$enc.A)
cs.melt$pat.B <- extract.pat(cs.melt$enc.B)
same.pat <- cs.melt[cs.melt$pat.A == cs.melt$pat.B ,]
different.pat <- cs.melt[cs.melt$pat.A != cs.melt$pat.B ,]
most.dissimilar <-
different.pat[which.min(different.pat$similarity),]
dissimilar.pat.frame <- rbind(psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.A) ,],
psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.B) ,])
print(t(dissimilar.pat.frame))
which gives
patid.68.49 patid.9
gender 1.00000 2.00000
age 41.85000 41.79000
sysbp_min 72.00000 106.00000
sysbp_max 95.00000 217.00000
diasbp_min 42.00000 53.00000
diasbp_max 61.00000 107.00000
meanbp_min 52.00000 67.00000
meanbp_max 72.00000 132.00000
resprate_min 20.00000 14.00000
resprate_max 35.00000 19.00000
tempc_min 36.00000 35.50000
tempc_max 37.55555 37.88889
spo2_min 90.00000 95.00000
spo2_max 100.00000 100.00000
bicarbonate_min 22.00000 26.00000
bicarbonate_max 22.00000 30.00000
creatinine_min 2.50000 1.20000
creatinine_max 2.50000 1.40000
glucose_min 82.00000 129.00000
glucose_max 82.00000 178.00000
hematocrit_min 28.10000 37.40000
hematocrit_max 28.10000 45.20000
potassium_min 5.50000 2.80000
potassium_max 5.50000 3.00000
sodium_min 138.00000 136.00000
sodium_max 138.00000 140.00000
bun_min 28.00000 16.00000
bun_max 28.00000 17.00000
wbc_min 2.50000 7.50000
wbc_max 2.50000 13.70000
mingcs 15.00000 15.00000
gcsmotor 6.00000 5.00000
gcsverbal 5.00000 0.00000
gcseyes 4.00000 1.00000
endotrachflag 0.00000 1.00000
urineoutput 1674.00000 887.00000
vasopressor 0.00000 0.00000
vent 0.00000 1.00000
los_hospital 19.09310 4.88130
los_icu 3.53680 5.32310
sofa 3.00000 5.00000
saps 17.00000 18.00000
posthospmort30day 1.00000 0.00000
Usually I wouldn't add a second answer, but that might be the best solution here. Don't worry about voting on it.
Here's the same algorithm as in my first answer, applied to the iris data set. Each row contains four spatial measurements of the flowers form three different varieties of iris plants.
Below that you will find the iris analysis, written out as nested loops so you can see the equivalence. But that's not recommended for production with large data sets.
Please familiarize yourself with starting data and all of the intermediate dataframes:
The input iris data
psm_scaled (the spatial measurements, scaled to mean=0, SD=1)
cs (the matrix of pairwise similarities)
cs.melt (the pairwise similarities in long format)
At the end I have aggregated the mean similarities for all comparisons between one variety and another. You will see that comparisons between individuals of the same variety have mean similarities approaching 1, and comparisons between individuals of the same variety have mean similarities approaching negative 1.
library(lsa)
library(reshape2)
temp <- iris[, 1:4]
iris.names <- paste0(iris$Species, '.', rownames(iris))
psm_scaled <- scale(temp)
rownames(psm_scaled) <- iris.names
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficient, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
names(cs.melt) <- c("flower.A", "flower.B", "similarity")
class.A <-
strsplit(as.character(cs.melt$flower.A), '.', fixed = TRUE)
cs.melt$class.A <- sapply(class.A, function(one.split) {
return(one.split[1])
})
class.B <-
strsplit(as.character(cs.melt$flower.B), '.', fixed = TRUE)
cs.melt$class.B <- sapply(class.B, function(one.split) {
return(one.split[1])
})
cs.melt$comparison <-
paste0(cs.melt$class.A , '_vs_', cs.melt$class.B)
cs.agg <-
aggregate(cs.melt$similarity, by = list(cs.melt$comparison), mean)
print(cs.agg[order(cs.agg$x),])
which gives
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
If you’re still not comfortable with performing lsa::cosine() on a scaled, numerical dataframe, we can certainly do explicit pairwise calculations.
The formula you gave for PSM, or cosine similarity of patients, is expressed in two formats at Wikipedia
Remembering that vectors A and B represent the ordered list of attributes for PatientA and PatientB, the PSM is the dot product of A and B, divided by (the scalar product of [the magnitude of A] and [the magnitude of B])
The terse way of saying that in R is
cosine.sim <- function(A, B) { A %*% B / sqrt(A %*% A * B %*% B) }
But we can rewrite that to look more similar to your post as
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
I guess you could even re-write that (the calculations of similarity between a single pair of individuals) as a bunch of nested loops, but in the case of a manageable amount of data, please don’t. R is highly optimized for operations on vectors and matrices. If you’re new to R, don’t second guess it. By the way, what happened to your millions of rows? This will certainly be less stressful now that your down to tens of thousands.
Anyway, let’s say that each individual only has two elements.
individual.1 <- c(1, 0)
individual.2 <- c(1, 1)
So you can think of individual.1 as a line that passes between the origin (0,0) and (0, 1) and individual.2 as a line that passes between the origin and (1, 1).
some.data <- rbind.data.frame(individual.1, individual.2)
names(some.data) <- c('element.i', 'element.j')
rownames(some.data) <- c('individual.1', 'individual.2')
plot(some.data, xlim = c(-0.5, 2), ylim = c(-0.5, 2))
text(
some.data,
rownames(some.data),
xlim = c(-0.5, 2),
ylim = c(-0.5, 2),
adj = c(0, 0)
)
segments(0, 0, x1 = some.data[1, 1], y1 = some.data[1, 2])
segments(0, 0, x1 = some.data[2, 1], y1 = some.data[2, 2])
So what’s the angle between vector individual.1 and vector individual.2? You guessed it, 0.785 radians, or 45 degrees.
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
cos.sim.result <- cosine.sim(individual.1, individual.2)
angle.radians <- acos(cos.sim.result)
angle.degrees <- angle.radians * 180 / pi
print(angle.degrees)
# [,1]
# [1,] 45
Now we can use the cosine.sim function I previously defined, in two nested loops, to explicitly calculate the pairwise similarities between each of the iris flowers. Remember, psm_scaled has already been defined as the scaled numerical values from the iris dataset.
cs.melt <- lapply(rownames(psm_scaled), function(name.A) {
inner.loop.result <-
lapply(rownames(psm_scaled), function(name.B) {
individual.A <- psm_scaled[rownames(psm_scaled) == name.A, ]
individual.B <- psm_scaled[rownames(psm_scaled) == name.B, ]
similarity <- cosine.sim(individual.A, individual.B)
return(list(name.A, name.B, similarity))
})
inner.loop.result <-
do.call(rbind.data.frame, inner.loop.result)
names(inner.loop.result) <-
c('flower.A', 'flower.B', 'similarity')
return(inner.loop.result)
})
cs.melt <- do.call(rbind.data.frame, cs.melt)
Now we repeat the calculation of cs.melt$class.A, cs.melt$class.B, and cs.melt$comparison as above, and calculate cs.agg.from.loops as the mean similarity between the various types of comparisons:
cs.agg.from.loops <-
aggregate(cs.agg.from.loops$similarity, by = list(cs.agg.from.loops $comparison), mean)
print(cs.agg.from.loops[order(cs.agg.from.loops$x),])
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
Which, I believe is identical to the result we got with lsa::cosine.
So what I'm trying to say is... why wouldn't you use lsa::cosine?
Maybe you should be more concerned with
selection of variables, including removal of highly correlated variables
scaling/normalizing/standardizing the data
performance with a large input data set
identifying known similars and dissimilars for quality control
as previously addressed
I have a simple (indeed standard in economics) nonlinear constrained discrete maximisation problem to solve in R and am having trouble. I found solutions for parts of the problem (nonlinear maximisation; discrete maximisation) but not for the union of all the problems.
Here is the problem. A consumer wants to buy three products (ananas, banana, cookie), knows the prices and has a budget of 20€. He likes variety (i.e., he wants to have all three products if possible) and his satisfaction is decreasing in the amount consumed (he likes his first cookie way more than his 100th).
The function he wishes to maximise is
and of course since each has a price, and he has a limited budget, he maximises this function under the constraint that
What I want to do is to find the optimal buying list (N ananas, M bananas, K cookies) that satisfies the constraint.
If the problem were linear, I would simply use linprog::solveLP(). But the objective function is nonlinear.
If the problem were of a continuous nature, ther would be a simple analytic solution to it.
The question being discrete and nonlinear, I do not know how to proceed.
Here is some toy data to play with.
df <- data.frame(rbind(c("ananas",2.17),c("banana",0.75),c("cookie",1.34)))
names(df) <- c("product","price")
I'd like to have an optimization routine that gives me an optimal buying list of (N,M,K).
Any hints?
1) no packages This can be done by brute force. Using df from the question as input ensure that price is numeric (it's a factor in the df of the question) and calculate the largest number mx for each variable. Then create grid g of variable counts and compute the total price of each and the associated objective giving gg. Now sort gg in descending order of objective and take those solutions satisfying the constraint. head will show the top few solutions.
price <- as.numeric(as.character(df$price))
mx <- ceiling(20/price)
g <- expand.grid(ana = 0:mx[1], ban = 0:mx[2], cook = 0:mx[3])
gg <- transform(g, total = as.matrix(g) %*% price, objective = sqrt(ana * ban * cook))
best <- subset(gg[order(-gg$objective), ], total <= 20)
giving:
> head(best) # 1st row is best soln, 2nd row is next best, etc.
ana ban cook total objective
1643 3 9 5 19.96 11.61895
1929 3 7 6 19.80 11.22497
1346 3 10 4 19.37 10.95445
1611 4 6 5 19.88 10.95445
1632 3 8 5 19.21 10.95445
1961 2 10 6 19.88 10.95445
2) dplyr This can also be nicely expressed using the dplyr package. Using g and price from above:
library(dplyr)
g %>%
mutate(total = c(as.matrix(g) %*% price), objective = sqrt(ana * ban * cook)) %>%
filter(total <= 20) %>%
arrange(desc(objective)) %>%
top_n(6)
giving:
Selecting by objective
ana ban cook total objective
1 3 9 5 19.96 11.61895
2 3 7 6 19.80 11.22497
3 3 10 4 19.37 10.95445
4 4 6 5 19.88 10.95445
5 3 8 5 19.21 10.95445
6 2 10 6 19.88 10.95445
If you do not mind using a "by hand" solution:
uf=function(x)prod(x)^.5
bf=function(x,pr){
if(!is.null(dim(x)))apply(x,1,bf,pr) else x%*%pr
}
budget=20
df <- data.frame(product=c("ananas","banana","cookie"),
price=c(2.17,0.75,1.34),stringsAsFactors = F)
an=0:(budget/df$price[1]) #include 0 for all possibilities
bn=0:(budget/df$price[2])
co=0:(budget/df$price[3])
X=expand.grid(an,bn,co)
colnames(X)=df$product
EX=apply(X,1,bf,pr=df$price)
psX=X[which(EX<=budget),] #1st restrict
psX=psX[apply(psX,1,function(z)sum(z==0))==0,] #2nd restrict
Ux=apply(psX,1,uf)
cbind(psX,Ux)
(sol=psX[which.max(Ux),])
uf(sol) # utility
bf(sol,df$price) #budget
> (sol=psX[which.max(Ux),])
ananas banana cookie
1444 3 9 5
> uf(sol) # utility
[1] 11.61895
> bf(sol,df$price) #budget
1444
19.96
I think this problem is very similar in nature to this question (Solve indeterminate equation system in R). The answer by Richie Cotton was the basis to this possible solution:
df <- data.frame(product=c("ananas","banana","cookie"),
price=c(2.17,0.75,1.34),stringsAsFactors = F)
FUN <- function(w, price=df$price){
total <- sum(price * w)
errs <- c((total-20)^2, -(sqrt(w[1]) * sqrt(w[2]) * sqrt(w[3])))
sum(errs)
}
init_w <- rep(10,3)
res <- optim(init_w, FUN, lower=rep(0,3), method="L-BFGS-B")
res
res$par # 3.140093 9.085182 5.085095
sum(res$par*df$price) # 20.44192
Notice that the total cost (i.e. price) for the solution is $ 20.44. To solve this problem, we can weight the error terms to put more emphasis on the 1st term, which relates to the total cost:
### weighting of error terms
FUN2 <- function(w, price=df$price){
total <- sum(price * w)
errs <- c(100*(total-20)^2, -(sqrt(w[1]) * sqrt(w[2]) * sqrt(w[3]))) # 1st term weighted by 100
sum(errs)
}
init_w <- rep(10,3)
res <- optim(init_w, FUN2, lower=rep(0,3), method="L-BFGS-B")
res
res$par # 3.072868 8.890832 4.976212
sum(res$par*df$price) # 20.00437
As LyzandeR remarked there is no nonlinear integer programming solver available in R. Instead, you can use the R package rneos that sends data to one of the NEOS solvers and returns the results into your R process.
Select one of the solvers for "Mixed Integer Nonlinearly Constrained Optimization" on the NEOS Solvers page, e.g., Bonmin or Couenne. For your example above, send the following files in the AMPL modeling language to one of these solvers:
[Note that maximizing the product x1 * x2 * x3 is the same as maximising the product sqrt(x1) * sort(x2) * sqrt(x3).]
Model file:
param p{i in 1..3};
var x{i in 1..3} integer >= 1;
maximize profit: x[1] * x[2] * x[3];
subject to restr: sum{i in 1..3} p[i] * x[i] <= 20;
Data file:
param p:= 1 2.17 2 0.75 3 1.34 ;
Command file:
solve;
display x;
and you will receive the following solution:
x [*] :=
1 3
2 9
3 5
;
This approach will work for more extended examples were solutions "by hand" are not reasonable and rounded optim solutions are not correct.
To look at a more demanding example, let me propose the following problem:
Find an integer vector x = (x_i), i=1,...,10, that maximizes x1 * ... * x10, such that p1*x1 + ... + p10*x10 <= 10, where p = (p_i), i=1,...,10, is the following price vector
p <- c(0.85, 0.22, 0.65, 0.73, 0.91, 0.11, 0.31, 0.47, 0.93, 0.71)
Using constrOptim for this nonlinear optimization problem with a linear inequality constraint, I get solutions like 900 for different starting points, but never the optimal solutions that is 960 !