R - Optimization (max) - r

I am trying to copy over a solution from excel solver into R but not sure where to start.
The problem: Choose 5 options for each hour (5 rows) that maximize the sum of "Score" without picking the same group 2 times across multiple hours.
In other words: Maximize score, with criteria:
1. rows within same group only gets picked a maximum of 2 times.
2. rows within same hour get picked a maximum of 5 times.
I think it would be easier for me to explain this by showing you guys the results in excel:
Data:
group,hour,Score a,1,1000 a,2,1231 b,1,12312 b,2,6438 c,1,3033 c,2,6535 d,1,4283 d,2,4957 e,1,9507 e,2,5115 f,1,1914 f,2,9278 g,1,5362 g,2,8408 h,1,4640 h,2,4296 j,1,8115 j,2,1143 aa,1,3242 aa,2,3695 bb,1,3908 bb,2,2540 cc,1,6438 cc,2,2170 dd,1,6497 dd,2,3327 ee,1,5067 ee,2,6614 ff,1,5140 ff,2,9858 gg,1,8061 gg,2,2316 hh,1,7848 hh,2,3525 jj,1,8259 jj,2,9014 a,3,31100 b,3,111100 c,3,87200 d,3,60700 e,3,50600 f,3,74300 g,3,97400 h,3,28900 j,3,25900 aa,3,55600 bb,3,38200 cc,3,58500 dd,3,51300 ee,3,84000 ff,3,83700 gg,3,74200 hh,3,19700 jj,3,62800
Data in dput format.
df1 <-
structure(list(group = structure(c(1L, 1L, 3L, 3L,
5L, 5L, 7L, 7L, 9L, 9L, 11L, 11L, 13L, 13L, 15L,
15L, 17L, 17L, 2L, 2L, 4L, 4L, 6L, 6L, 8L, 8L,
10L, 10L, 12L, 12L, 14L, 14L, 16L, 16L, 18L,
18L, 1L, 3L, 5L, 7L, 9L, 11L, 13L, 15L, 17L,
2L, 4L, 6L, 8L, 10L, 12L, 14L, 16L, 18L),
.Label = c("a", "aa", "b", "bb", "c", "cc",
"d", "dd", "e", "ee", "f", "ff", "g", "gg",
"h", "hh", "j", "jj"), class = "factor"),
hour = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), Score = c(1000L,
1231L, 12312L, 6438L, 3033L, 6535L, 4283L, 4957L,
9507L, 5115L, 1914L, 9278L, 5362L, 8408L, 4640L,
4296L, 8115L, 1143L, 3242L, 3695L, 3908L, 2540L,
6438L, 2170L, 6497L, 3327L, 5067L, 6614L, 5140L,
9858L, 8061L, 2316L, 7848L, 3525L, 8259L, 9014L,
31100L, 111100L, 87200L, 60700L, 50600L, 74300L,
97400L, 28900L, 25900L, 55600L, 38200L, 58500L,
51300L, 84000L, 83700L, 74200L, 19700L, 62800L)),
class = "data.frame", row.names = c(NA, -54L))

Using lpSolve package for this optimization problem with binary variables and linear constraints,
library(lpSolve)
library(data.table) #for pivoting data and shifting coef of constraints
d <- dcast(df1, group ~ hour, value.var="Score")
nr <- nrow(d)
nc <- ncol(d) - 1L
m1 <- matrix(c(1,1,1,rep(0, nr*nc-3L)), ncol=nc, byrow=TRUE)
max2constr <- do.call(rbind, shift(m1, 0L:(nr-1), fill=0))
m2 <- matrix(c(rep(1, nr), rep(0, (nc-1)*nr)), ncol=nc)
choose5constr <- do.call(rbind, shift(m2, seq(0, by=nr, length.out=nc), fill=0))
ans <- lp("max",
unlist(d[, 2:4]),
rbind(max2constr, choose5constr),
c(rep("<=", nrow(max2constr)), rep("=", nrow(choose5constr))),
c(rep(2, nrow(max2constr)), rep(5, nrow(choose5constr))),
all.bin=TRUE)
ans$objval
soln <- matrix(ans$solution, nrow=nr, dimnames=list(d$group, names(d)[-1L]))
Objective value = 552826
soln output:
1 2 3
a 0 0 0
aa 0 0 0
b 1 0 1
bb 0 0 0
c 0 0 1
cc 0 0 0
d 0 0 0
dd 0 0 0
e 1 0 0
ee 0 1 1
f 0 1 0
ff 0 1 1
g 0 1 1
gg 1 0 0
h 0 0 0
hh 0 0 0
j 1 0 0
jj 1 1 0

Related

R:Error in `[.data.frame` undefined columns selected

I have this data sample
dput()
timeseries=structure(list(sales_point_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), calendar_id_operday = c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L,
20L, 21L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L), line_fact_amt = c(55767L,
59913L, 36363L, 48558L, 505L, 76344L, 22533L, 11965L, 78944L,
36754L, 30621L, 55716L, 32470L, 62165L, 57986L, 2652L, 16487L,
72849L, 73715L, 65656L, 64411L, 47460L, 61866L, 10877L, 72392L,
53011L, 23544L, 76692L, 10388L, 24255L, 56684L, 59329L, 6655L,
65612L, 17495L, 10389L, 63702L, 47407L, 78782L, 22898L, 21151L,
32587L)), class = "data.frame", row.names = c(NA, -42L))
calendar_id_operday=1 its mean range of week 20210101-20210108(ymd) but here there is no date format only week, just such a specificity of these data . I try transform my data.
library(reshape)
df <- cast(melt(timeseries, id=c("calendar_id_operday"), na.rm=TRUE),
line_fact_amt + calendar_id_operday)[, c("line_fact_amt", "calendar_id_operday", substring(month.name, 1, 3))]
colnames(df)[1] <- "sales_point_id"
df[, substring(month.name, 1, 3)] <- lapply(timeseries[, substring(month.name, 1, 3)],
function(x) as.numeric(as.character(x)))
But something goes wrong
Error in `[.data.frame`(timeseries, , substring(month.name, 1, 3)) :
undefined columns selected
I want that as the result i got this data.frame
sales_point_id year jan-1 jan-2 jan-3 jan-4 feb1
1 1 2021 8034.843 7485.725 8238.493 8446.994 134
2 1 2021 7810.315 7261.198 8013.965 8222.466 346
3 1 2021 7585.788 7036.670 7789.438 7997.938 54364
4 1 2021 7361.260 6812.142 7564.910 7773.411 34546
5 1 2021 7136.733 6587.615 7340.382 7548.883 46436
jan-1 is data for firts week of jan. jan2- is the second week of jan and so on.
What should i do to get desired result?
Thanks for your valuable help

Using sample() to sample from nested lists in R

I'm looking for a way to use sample() to sample values from different lists based on a value in another column of a data.table - at the moment I'm getting a recursive indexing failed error - code and more explanation below:
First set up example data:
library(stats)
library(data.table)
# list of three different nest survival rates
survival<-list(0.91,0.95,0.99)
# incubation period
inc.period<-28
# then set up function to use the geometric distribution to generate 3 lists of incubation outcomes based on the nest survivals and incubation period above.
# e.g. less than 28 is a nest failure, 28 is a successful nest.
create.sample <- function(survival){
outcome<-rgeom(100,1-survival)
fifelse(outcome > inc.period, inc.period, outcome)
}
# then create list of 100 nest outcomes with 3 different survival values using lapply
inc.outcomes <- lapply(survival,create.sample)
# set up a data.table - each row of data will be a nest.
index<-c(1:3)
iteration<-1:20
dt = CJ(index,iteration)
Then I want to make a new column 'inc.period' which samples from the 'inc.outcomes' list using the index column of the dt to select which of the three 'inc.outcomes' lists to sample from (with a different sample for each row of data).
So e.g. when index = 1, the sampled value comes from inc.outcomes[[1]] - which is the low nest survival list, when index = 2 I sample from inc.outcomes[[2]] etc.
The code would look something like this but this doesn't work (I get a recursive indexing failed error):
dt[,inc.period:= sample(inc.outcomes[[index]],nrow(dt),replace = TRUE)]
Any help or advice gratefully received, also suggestions for different approaches to this problem - this is for an update to code that runs in a shiny simulation so quicker options preferred!
Two problems:
inc.outcomes[[index]] is a problem since index is 60-long here, meaning you are ultimately trying inc.outcomes[[ c(1,1,...,2,2,...,3,3) ]], which is incorrect. [[-indexing is either length-1 (for most uses) or a vector as long as its list is nested. For example, in list(list(1,2),list(3,4))[[ c(1,2) ]] the [[c(1,2)]] with length-2 works because the have 2-deep nested lists. Since inc.outcomes is only 1-deep, we can only have length-1 in the [[ indexing.
This means we need to do this by-index. (An from this, we need to change from nrow(dt) to .N, but frankly we should be using that anyway even without by=.)
dt[, inc.period := sample(inc.outcomes[[ index[1] ]], .N, replace = TRUE), by = index]
# index iteration inc.period
# <int> <int> <num>
# 1: 1 1 17
# 2: 1 2 17
# 3: 1 3 21
# 4: 1 4 24
# 5: 1 5 3
# 6: 1 6 1
# 7: 1 7 17
# 8: 1 8 0
# 9: 1 9 1
# 10: 1 10 0
# ---
# 51: 3 11 0
# 52: 3 12 0
# 53: 3 13 28
# 54: 3 14 28
# 55: 3 15 9
# 56: 3 16 28
# 57: 3 17 7
# 58: 3 18 28
# 59: 3 19 28
# 60: 3 20 28
My data:
dt <- setDT(structure(list(index = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), iteration = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L)), row.names = c(NA, -60L), class = c("data.table", "data.frame"), sorted = c("index", "iteration")))

`ddply` fails to apply logistic regression (GLM) by group to my dataset

I'm working out the LD50 (lethal dosage) for multiple populations from different experiments using the MASS package. It's simple enough when I subset the data and do one at a time, but I'm getting an error when I use ddply. Essentially I need an LD50 for each population at each temperature.
My data looks somewhat like this:
# dput(d)
d <- structure(list(Pop = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L), .Label = c("a", "b", "c"), class = "factor"), Temp = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("high", "low"), class = "factor"),
Dose = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), Dead = c(0L,
11L, 12L, 14L, 2L, 16L, 17L, 7L, 5L, 3L, 17L, 15L, 9L, 20L,
8L, 19L, 7L, 2L, 20L, 14L, 9L, 15L, 1L, 15L), Alive = c(20L,
9L, 8L, 6L, 18L, 4L, 3L, 13L, 15L, 17L, 3L, 5L, 11L, 0L,
12L, 1L, 13L, 18L, 0L, 6L, 11L, 5L, 19L, 5L)), .Names = c("Pop",
"Temp", "Dose", "Dead", "Alive"), class = "data.frame", row.names = c(NA,
-24L))
The following works fine:
d$Mortality <- cbind(d$Alive, d$Dead)
a <- d[d$Pop=="a" & d$Temp=="high",]
library(MASS)
dose.p(glm(Mortality ~ Dose, family="binomial", data=a), p=0.5)[1]
But when I put this into ddply I get the following error:
library(plyr)
d$index <- paste(d$Pop, d$Temp, sep="_")
ddply(d, 'index', function(x) dose.p(glm(Mortality~Dose, family="binomial", data=x), p=0.5)[1])
Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
I can get the right LD50 when I use a proportion but can't figure out where I've gone wrong with my approach (and had already written this question).
Perhaps this will amaze you. But if you choose to use formula
cbind(Alive, Dead) ~ Dose
instead of
Mortality ~ Dose
the problem will be gone.
library(MASS)
library(plyr)
## `d` is as your `dput` result
## a function to apply
f <- function(x) {
fit <- glm(cbind(Alive, Dead) ~ Dose, family = "binomial", data = x)
dose.p(fit, p=0.5)[[1]]
}
## call `ddply`
ddply(d, .(Pop, Temp), f)
# Pop Temp V1
#1 a high 2.6946257
#2 a low 2.1834099
#3 b high 2.5000000
#4 b low 0.4830998
#5 c high 2.2899553
#6 c low 2.5000000
So what happened with Mortality ~ Dose? Let's set .inform = TRUE when calling ddply:
## `d` is as your `dput` result
d$Mortality <- cbind(d$Alive, d$Dead)
## a function to apply
g <- function(x) {
fit <- glm(Mortality ~ Dose, family = "binomial", data = x)
dose.p(fit, p=0.5)[[1]]
}
## call `ddply`
ddply(d, .(Pop, Temp), g, .inform = TRUE)
#Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
#Error: with piece 1:
# Pop Temp Dose Dead Alive Mortality
#1 a high 1 0 20 20
#2 a high 2 11 9 9
#3 a high 3 12 8 8
#4 a high 4 14 6 6
Now we we see that variable Mortality has lost dimension, and only the first column (Alive) is retained. For a glm with binomial response, if the response is a single vector, glm expects 0-1 binary or a factor of two levels. Now, we have integers 20, 9, 8, 6, ..., hence glm will complain
Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
There is really no way to fix this issue. I have tried using a protector:
d$Mortality <- I(cbind(d$Alive, d$Dead))
but it still ends up with the same failure.

Sampling distribution and sum of tables

I've made a few experiments and each experiment led to the apparition of color.
As I can't do more experiments, I want to sample by size=30 and see what frequency table (of colors) I could obtain for 1000 sampling. The resulting frequency table should be the sum of the 1000 frequency table.
I think about concatenating table as follows and try to agregate, but it did not work:
mydata=structure(list(Date = structure(c(11L, 1L, 9L, 9L, 10L, 1L, 2L,
3L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 6L, 7L, 4L, 4L, 4L, 6L, 6L, 11L,
5L, 4L, 7L, 10L, 6L, 6L, 2L, 5L, 7L, 11L, 1L, 9L, 11L, 11L, 11L,
1L, 1L), .Label = c("01/02/2016", "02/02/2016", "03/02/2016",
"08/02/2016", "10/02/2016", "11/02/2016", "16/02/2016", "22/02/2016",
"26/01/2016", "27/01/2016", "28/01/2016"), class = "factor"),
Color = structure(c(30L, 33L, 11L, 1L, 18L, 18L, 11L,
16L, 19L, 19L, 22L, 1L, 18L, 18L, 13L, 14L, 13L, 18L, 24L,
24L, 11L, 24L, 2L, 33L, 25L, 1L, 30L, 5L, 24L, 18L, 13L,
35L, 19L, 19L, 18L, 23L, 19L, 8L, 19L, 14L), .Label = c("ARD",
"ARP", "BBB", "BIE", "CFX", "CHR", "DDD", "DOO", "EAU", "ELY",
"EPI", "ETR", "GEN", "GER", "GGG", "GIS", "ISE", "JUV", "LER",
"LES", "LON", "LYR", "MON", "NER", "NGY", "NOJ", "NYO", "ORI",
"PEO", "RAY", "RRR", "RSI", "SEI", "SEP", "VIL", "XQU", "YYY",
"ZYZ"), class = "factor"), Categorie = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1", "1,2", "1,2,3",
"1,3", "2", "2,3", "3", "4", "5"), class = "factor"), Portion_Longueur = c(3L,
4L, 1L, 1L, 2L, 4L, 5L, 6L, 7L, 7L, 8L, 8L, 9L, 8L, 8L, 9L,
11L, 7L, 7L, 7L, 9L, 8L, 3L, 8L, 7L, 11L, 2L, 9L, 8L, 5L,
8L, 12L, 3L, 4L, 1L, 3L, 3L, 3L, 4L, 5L)), .Names = c("Date",
"Color", "Categorie", "Portion_Longueur"), row.names = c(NA,
40L), class = "data.frame")
for (i in 1:1000) {
mysamp= sample(mydata$Color,size=30)
x=data.frame(table(mysamp))
if (i==1) w=x
else w <- c(w, x)
}
aggregate(w$Freq, by=list(Color=w$mysamp), FUN=sum)
Example, for 3 sampling, for (i in 1:3) I expect have sum as follow :
But I do not have Sum, instead I have:
Color x
1 ARD 2
2 ARP 1
3 BBB 0
4 BIE 0
5 CFX 0
6 CHR 0
7 DDD 0
8 DOO 1
9 EAU 0
10 ELY 0
11 EPI 3
12 ETR 0
13 GEN 2
14 GER 2
15 GGG 0
16 GIS 1
17 ISE 0
18 JUV 4
19 LER 5
20 LES 0
21 LON 0
22 LYR 1
23 MON 1
24 NER 2
25 NGY 1
26 NOJ 0
27 NYO 0
28 ORI 0
29 PEO 0
30 RAY 1
31 RRR 0
32 RSI 0
33 SEI 2
34 SEP 0
35 VIL 1
36 XQU 0
37 YYY 0
38 ZYZ 0
How to do this ?
Thanks a lot
Your for loop is what's causing your issues. You end up creating a big list that is somewhat difficult to perform calculations on (check out names(w) to see what I mean). A better data structure would allow for easier calculations:
x = NULL #initialize
for (i in 1:1000) {
mysamp = sample(mydata$Color,size=30) #sample
mysamp = data.frame(table(mysamp)) #frequency
x = rbind(x, mysamp) #bind to x
}
aggregate(Freq~mysamp, data = x, FUN = sum) #perform calculation
Note that this loop runs a bit slower than your loop. This is because of the rbind() function. See this post. Maybe someone will come along with a more efficient solution.

array manipulation: calculate odds ratios for a layer in a 3-way table

This is a question about array and data frame manipulation and calculation, in the
context of models for log odds in contingency tables. The closest question I've found to this is How can i calculate odds ratio in many table, but mine is more general.
I have a data frame representing a 3-way frequency table, of size 5 (litter) x 2 (treatment) x 3 (deaths).
"Freq" is the frequency in each cell, and deaths is the response variable.
Mice <-
structure(list(litter = c(7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L,
11L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L, 11L, 7L, 7L, 8L,
8L, 9L, 9L, 10L, 10L, 11L, 11L), treatment = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B"), class = "factor"), deaths = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("0", "1",
"2+"), class = "factor"), Freq = c(58L, 75L, 49L, 58L, 33L, 45L,
15L, 39L, 4L, 5L, 11L, 19L, 14L, 17L, 18L, 22L, 13L, 22L, 12L,
15L, 5L, 7L, 10L, 8L, 15L, 10L, 15L, 18L, 17L, 8L)), .Names = c("litter",
"treatment", "deaths", "Freq"), row.names = c(NA, 30L), class = "data.frame")
From this, I want to calculate the log odds for adjacent categories of the last variable (deaths)
and have this value in a data frame with factors litter (5), treatment (2), and contrast (2), as detailed below.
The data can be seen in xtabs() form:
mice.tab <- xtabs(Freq ~ litter + treatment + deaths, data=Mice)
ftable(mice.tab)
deaths 0 1 2+
litter treatment
7 A 58 11 5
B 75 19 7
8 A 49 14 10
B 58 17 8
9 A 33 18 15
B 45 22 10
10 A 15 13 15
B 39 22 18
11 A 4 12 17
B 5 15 8
>
From this, I want to calculate the (adjacent) log odds of 0 vs. 1 and 1 vs.2+ deaths, which is easy in
array format,
odds1 <- log(mice.tab[,,1]/mice.tab[,,2]) # contrast 0:1
odds2 <- log(mice.tab[,,2]/mice.tab[,,3]) # contrast 1:2+
odds1
treatment
litter A B
7 1.6625477 1.3730491
8 1.2527630 1.2272297
9 0.6061358 0.7156200
10 0.1431008 0.5725192
11 -1.0986123 -1.0986123
>
But, for analysis, I want to have these in a data frame, with factors litter, treatment and contrast
and a column, 'logodds' containing the entries in the odds1 and odds2 tables, suitably strung out.
More generally, for an I x J x K table, where the last factor is the response, my desired result
is a data frame of IJ(K-1) rows, with adjacent log odds in a 'logodds' column, and ideally, I'd like
to have a general function to do this.
Note that if T is the 10 x 3 matrix of frequencies shown by ftable(), the calculation is essentially
log(T) %*% matrix(c(1, -1, 0,
0, 1, -1))
followed by reshaping and labeling.
Can anyone help with this?

Resources