Why does Keras perform poorly on this simple toy dataset? - r

Here I've created a toy dataset by randomly sampling from two bernoulli distributions dictated by the logistic functions
1 / (1 + exp(-0.2 * (x - 20)))
-1 / (1 + exp(-0.2 * (x - 80)))
My hope was that I could train a keras NNet with a 2-node hidden layer and a softmax activation function that would learn these two logistic functions, but the resulting model predicts probability of 1 for every x value.
library(keras)
train <- data.frame(
x = c(4.44, 8.25, 15.72, 17.53, 17.53, 17.86, 18.57, 20.22, 20.24, 20.57, 21.99, 25.06, 28.3, 31.1, 35.91, 37.29, 38.36, 39.58,
39.78, 40.1, 47.29, 51.67, 51.74, 53.52, 57.45, 62.69, 63.03, 69.03, 70.11, 74.44, 76.4, 79.81, 86.92, 87.59, 89.88),
y = c(0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0)
)
head(train, 10)
x y
1 4.44 0
2 8.25 0
3 15.72 0
4 17.53 0
5 17.53 0
6 17.86 0
7 18.57 0
8 20.22 0
9 20.24 1
10 20.57 1
# Build and fit model
model <- keras_model_sequential()
model <- layer_dense(object = model, input_shape = 1L, use_bias = TRUE, units = 2L, activation = 'sigmoid')
model <- layer_dense(object = model, units = 1L, activation = 'softmax', input_shape = 2L)
model <- compile(object = model, loss = 'binary_crossentropy', optimizer = 'sgd', metrics = c('accuracy'))
fit(object = model, x = dt$Age, y = dt$LittleSleep * 1, epochs = 30)
# Evaluate
predict_proba(object = model, x = train$x)[, 1]
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Why does Keras do such a poor job of fitting to the training data?

Keras is not doing a poor job, it is exactly doing the job you told it to do in your network architecture :)
You are using a softmax activation at the output with only one output neuron, meaning that the softmax will always output 1.0, as the output is normalized across neurons. Do not do that, use at least two output neurons so normalization can happen correctly.
As you use binary cross-entropy loss, a better choice of activation would be sigmoid at the output, which will work with a single output neuron.

Related

combine redundant row items in r

I have a dataset with the the names of many different plant species (column MTmatch), some of which appear repeatedly. Each of these has a column (ReadSum) with a sum associated with it (as well as many other pieces of information). How do I combine/aggregate all of the redundant plant species and sum the associated ReadSum with each, while leaving the non-redundant rows alone?
I would like to take a dataset like this, and either have it transformed so that each sample has the aggregate of the combined rows, or at least an additional column showing the sum of the ReadSum column for the combined redundant species. Sorry if this is confusing, I'm not sure how to ask this question.
I have been messing about with dplyr, using group_by() and summarise(), but that seems to be summarizing across the whole column rather than just the new group.
structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028",
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135",
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432,
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389,
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677,
0, 0, 0.01513888, 0, 0.00115466)), row.names = c(NA, -9L), class = "data.frame")
> dput(samp5[1:9])
structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028",
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135",
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432,
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389,
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677,
0, 0, 0.01513888, 0, 0.00115466), S026409.R1 = c(0.221175955,
0, 0, 0, 0, 0, 0.005146173, 0, 0), S026412.R1 = c(0.026058888,
0, 0, 0, 0, 0, 0, 0, 0), MAX = c(0.400577608, 0.009933177, 0.124412855,
0.00814677, 0.009824944, 0.086475106, 0.154850408, 0.015593835,
0.008340888), ReadSum = c(3.54892343, 0.012059346, 0.203303936,
0.021075546, 0.009824944, 0.128007863, 0.859687787, 0.068159534,
0.050266853), SPECIES = c("Abies ", "Abies ", "Acer", "Alnus",
"Berberis", "Betula ", "Boykinia", "Boykinia", "Boykinia")), row.names = c(NA,
-9L), class = "data.frame")
Do either of these approached produce your intended outcome?
Data:
df <- structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028",
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135",
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432,
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389,
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677,
0, 0, 0.01513888, 0, 0.00115466), S026409.R1 = c(0.221175955,
0, 0, 0, 0, 0, 0.005146173, 0, 0), S026412.R1 = c(0.026058888,
0, 0, 0, 0, 0, 0, 0, 0), MAX = c(0.400577608, 0.009933177, 0.124412855,
0.00814677, 0.009824944, 0.086475106, 0.154850408, 0.015593835,
0.008340888), ReadSum = c(3.54892343, 0.012059346, 0.203303936,
0.021075546, 0.009824944, 0.128007863, 0.859687787, 0.068159534,
0.050266853), SPECIES = c("Abies ", "Abies ", "Acer", "Alnus",
"Berberis", "Betula ", "Boykinia", "Boykinia", "Boykinia")), row.names = c(NA,
-9L), class = "data.frame")
Create a new column "combined_ReadSum" (2nd col) which is the sum of "ReadSum" for each "SPECIES":
library(dplyr)
df %>%
group_by(SPECIES) %>%
summarise(combined_ReadSum = sum(ReadSum)) %>%
left_join(df, by = "SPECIES")
#> # A tibble: 9 × 10
#> SPECIES combi…¹ ESVID S0264…² S0264…³ S0264…⁴ S0264…⁵ S0264…⁶ MAX ReadSum
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 "Abies " 3.56 ESV_… 0.222 0.0226 0.361 0.221 0.0261 0.401 3.55
#> 2 "Abies " 3.56 ESV_… 0 0 0 0 0 0.00993 0.0121
#> 3 "Acer" 0.203 ESV_… 0 0 0 0 0 0.124 0.203
#> 4 "Alnus" 0.0211 ESV_… 0 0 0.00815 0 0 0.00815 0.0211
#> 5 "Berber… 0.00982 ESV_… 0 0 0 0 0 0.00982 0.00982
#> 6 "Betula… 0.128 ESV_… 0 0 0 0 0 0.0865 0.128
#> 7 "Boykin… 0.978 ESV_… 0.0291 0.0164 0.0151 0.00515 0 0.155 0.860
#> 8 "Boykin… 0.978 ESV_… 0 0.00126 0 0 0 0.0156 0.0682
#> 9 "Boykin… 0.978 ESV_… 0 0 0.00115 0 0 0.00834 0.0503
#> # … with abbreviated variable names ¹​combined_ReadSum, ²​S026401.R1,
#> # ³​S026404.R1, ⁴​S026406.R1, ⁵​S026409.R1, ⁶​S026412.R1
Or, summarise columns by summing the values for each unique species:
library(dplyr)
df %>%
group_by(SPECIES) %>%
summarise(across(where(is.numeric), sum))
#> # A tibble: 6 × 8
#> SPECIES S026401.R1 S026404.R1 S026406.R1 S026409.R1 S0264…¹ MAX ReadSum
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 "Abies " 0.222 0.0226 0.361 0.221 0.0261 0.411 3.56
#> 2 "Acer" 0 0 0 0 0 0.124 0.203
#> 3 "Alnus" 0 0 0.00815 0 0 0.00815 0.0211
#> 4 "Berberis" 0 0 0 0 0 0.00982 0.00982
#> 5 "Betula " 0 0 0 0 0 0.0865 0.128
#> 6 "Boykinia" 0.0291 0.0176 0.0163 0.00515 0 0.179 0.978
#> # … with abbreviated variable name ¹​S026412.R1
Created on 2022-10-28 by the reprex package (v2.0.1)

Extract columns from data frames in a list in a separate list of data frames

I have a list -cj1- with multiple data frames
dput(head(cj1[1:2]))
list(structure(list(individual = c("a12TTT.pdf", "a15.pdf", "a17.pdf",
"a18.pdf", "a21.pdf", "a2TTT.pdf", "a5.pdf", "B11.pdf", "B12.pdf",
"B13.pdf", "B22.pdf", "B24.pdf", "B4.pdf", "B7.pdf", "B8.pdf",
"cw10-1.pdf", "cw13-1.pdf", "cw15-1TTT.pdf", "cw17-1.pdf", "cw18.pdf",
"cw3.pdf", "cw4.pdf", "cw7_1TTT.pdf"), id = 1:23, Ntot = c(13,
9, 16, 15, 9, 13, 10, 10, 11, 10, 14, 10, 11, 12, 11, 10, 15,
12, 14, 11, 9, 10, 11), N1 = c(5, 5, 10, 11, 7, 9, 5, 5, 6, 8,
8, 8, 9, 8, 7, 1, 0, 6, 3, 4, 2, 4, 2), ND = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), N0 = c(8,
4, 6, 4, 2, 4, 5, 5, 5, 2, 6, 2, 2, 4, 4, 9, 15, 6, 11, 7, 7,
6, 9), score = c(5.06923076923077, 4.96666666666667, 9.925, 10.86,
6.83333333333333, 8.88461538461539, 5, 5, 5.97272727272727, 7.82,
7.95714285714286, 7.82, 8.80909090909091, 7.9, 6.91818181818182,
1.24, 0.3, 6, 3.17142857142857, 4.08181818181818, 2.16666666666667,
4.06, 2.19090909090909), propscore = c(0.389940828402367, 0.551851851851852,
0.6203125, 0.724, 0.759259259259259, 0.683431952662722, 0.5,
0.5, 0.54297520661157, 0.782, 0.568367346938776, 0.782, 0.800826446280992,
0.658333333333333, 0.628925619834711, 0.124, 0.02, 0.5, 0.226530612244898,
0.371074380165289, 0.240740740740741, 0.406, 0.199173553719008
), theta = c(-0.571211122446447, 0.418736780198501, 0.464533662219296,
0.760432013134893, 1.43961032059382, 0.935963883364303, 0.0742361005467161,
0.416783201347136, 0.232586422933618, 1.65345248955369, 0.178947462869717,
1.3980442736112, 1.5300599487058, 0.340087410746963, 0.616985944469495,
-1.73246102772711, -4.06186172096556, -0.347700710331151, -1.21009964741398,
0.239145600406579, -1.88836418690337, -0.276451472526056, -0.611455626388059
), se.theta = c(0.689550115014498, 0.689441554709003, 0.595659709892116,
0.609506508256404, 0.917792293663691, 0.652011367164736, 0.720534163064516,
0.695969555549033, 0.661019531367007, 0.87050969318314, 0.605775647419845,
0.797443937820774, 0.768436114096332, 0.695748274310803, 0.709380679025605,
1.00089414765463, 1.8701468050665, 0.68959824350285, 0.733014089189809,
0.656392513303483, 0.952935324276941, 0.71608982789968, 0.771906532861938
), outfit = c(1.24922700170817, 1.46067763769417, 0.915183304626819,
0.753992664091072, 0.37410361433915, 0.727316037037668, 0.616907868814702,
1.01528298230254, 1.01594232662062, 0.616808170683195, 0.646097057961938,
0.622993494551005, 0.807441271101246, 0.788526018181888, 1.2157399735092,
0.341189086206191, 0.021052091633073, 0.543024513106335, 1.04183076617928,
1.1772656963046, 0.736106160865241, 0.756316095787985, 0.58320701094964
), infit = c(1.4078580948461, 1.42854494963967, 1.09762978932861,
0.893957122448352, 0.64936943769433, 0.899191443180872, 0.724956556509282,
1.14975990693782, 1.08074439712469, 0.978248081241133, 0.755557633771936,
0.823903684368671, 0.911855771375284, 0.954272320131035, 0.926253596526142,
0.634052701587448, 0.0504659822408584, 0.712539957033173, 0.966034039620798,
1.1901663169553, 0.81371119642719, 0.817417869881874, 0.737574872116582
)), row.names = c(NA, -23L), class = "data.frame"), structure(list(
parlabel = c("Ties", "Home"), par = c("delta", "eta"), est = c(-43.5016417611571,
0.337872999554289), se = c(366043197.615422, 0.215169736220537
)), row.names = c(NA, -2L), class = "data.frame"))
Here is how data frames look:
head(cj1[[1]],2)
individual id Ntot N1 ND N0 score propscore theta se.theta outfit
1 a12TTT.pdf 1 13 5 0 8 5.069231 0.3899408 -0.5712111 0.6895501 1.249227
2 a15.pdf 2 9 5 0 4 4.966667 0.5518519 0.4187368 0.6894416 1.460678
infit
1 1.407858
2 1.428545
I would like to create a separate list -results1- that would contain data frames that would include columns 1 and 9 named individual and theta
I tried:
results1<-sapply(cj1, "[",c("individual","theta") )
Error in [.data.frame(X[[i]], ...) : undefined columns selected
library(dplyr)
> results1 <- lapply(cj1, function(x) x%>% select(individual,theta))
Error:
Can't subset columns that don't exist.
x Column individual doesn't exist.
Run rlang::last_error() to see where the error occurred.
I can subtract these columns from one data frame:
cj[[1]][c(1,9)]
I could not apply this to the whole list.
You can use the following solution. We use .x to refer to every individual element of your list. Here .x can be each of your data frames of which we would like to select only 2 columns c("individual","theta").
However, since only one of your data frames contains such column names I used keep function to actually keep only elements whose data frames contain the desired column name. Just bear in mind for this form of coding which is called purrr-style formula we need ~ before .x. So you use map function which is an equivalent to lapply from base R and use this syntax to apply whatever function on every individual elements (data frames here).
library(purrr)
cj1 %>%
map_if(~ all(c("individual","theta") %in% names(.x)),
~ .x %>% select(individual, theta)) %>%
keep(~ all(c("individual","theta") %in% names(.x)))
[[1]]
individual theta
1 a12TTT.pdf -0.5712111
2 a15.pdf 0.4187368
3 a17.pdf 0.4645337
4 a18.pdf 0.7604320
5 a21.pdf 1.4396103
6 a2TTT.pdf 0.9359639
7 a5.pdf 0.0742361
8 B11.pdf 0.4167832
9 B12.pdf 0.2325864
10 B13.pdf 1.6534525
11 B22.pdf 0.1789475
12 B24.pdf 1.3980443
13 B4.pdf 1.5300599
14 B7.pdf 0.3400874
15 B8.pdf 0.6169859
16 cw10-1.pdf -1.7324610
17 cw13-1.pdf -4.0618617
18 cw15-1TTT.pdf -0.3477007
19 cw17-1.pdf -1.2100996
20 cw18.pdf 0.2391456
21 cw3.pdf -1.8883642
22 cw4.pdf -0.2764515
23 cw7_1TTT.pdf -0.6114556
Or we can spare a line of code to be more concise:
cj1 %>%
keep(~ all(c("individual","theta") %in% names(.x))) %>%
map(~ .x %>% select(individual, theta))
[[1]]
individual theta
1 a12TTT.pdf -0.5712111
2 a15.pdf 0.4187368
3 a17.pdf 0.4645337
4 a18.pdf 0.7604320
5 a21.pdf 1.4396103
6 a2TTT.pdf 0.9359639
7 a5.pdf 0.0742361
8 B11.pdf 0.4167832
9 B12.pdf 0.2325864
10 B13.pdf 1.6534525
11 B22.pdf 0.1789475
12 B24.pdf 1.3980443
13 B4.pdf 1.5300599
14 B7.pdf 0.3400874
15 B8.pdf 0.6169859
16 cw10-1.pdf -1.7324610
17 cw13-1.pdf -4.0618617
18 cw15-1TTT.pdf -0.3477007
19 cw17-1.pdf -1.2100996
20 cw18.pdf 0.2391456
21 cw3.pdf -1.8883642
22 cw4.pdf -0.2764515
23 cw7_1TTT.pdf -0.6114556
Here is just another base R solution with a slightly different syntax. Just note that \(x) is equivalent to function(x) which is a new feature available as of R. 4.1.0.
cj1 |>
lapply(\(x) {
if(all(c("individual","theta") %in% names(x))) {
`[`(x, c("individual","theta"))
}
}
) -> cj2
cj2 <- cj2[-which(sapply(cj2, is.null))] |> as.data.frame()
In base R, you can try this solution with lapply -
cols <- c("individual","theta")
lapply(cj1, function(x) if(all(cols %in% names(x))) x[cols])
#[[1]]
# individual theta
#1 a12TTT.pdf -0.5712
#2 a15.pdf 0.4187
#3 a17.pdf 0.4645
#4 a18.pdf 0.7604
#5 a21.pdf 1.4396
#6 a2TTT.pdf 0.9360
#7 a5.pdf 0.0742
#8 B11.pdf 0.4168
#9 B12.pdf 0.2326
#10 B13.pdf 1.6535
#11 B22.pdf 0.1789
#12 B24.pdf 1.3980
#13 B4.pdf 1.5301
#14 B7.pdf 0.3401
#15 B8.pdf 0.6170
#16 cw10-1.pdf -1.7325
#17 cw13-1.pdf -4.0619
#18 cw15-1TTT.pdf -0.3477
#19 cw17-1.pdf -1.2101
#20 cw18.pdf 0.2391
#21 cw3.pdf -1.8884
#22 cw4.pdf -0.2765
#23 cw7_1TTT.pdf -0.6115
#[[2]]
#NULL
If you want to drop the NULL lists you can add Filter -
Filter(length, lapply(cj1, function(x) if(all(cols %in% names(x))) x[cols]))

problem while changing col names with str_to_title

I have a data set that looks like this:
It can be build using codes:
df<- structure(list(`Med` = c("DOCETAXEL",
"BEVACIZUMAB", "CARBOPLATIN", "CETUXIMAB", "DOXORUBICIN", "IRINOTECAN"
), `2.4 mg` = c(0, 0, 0, 0, 1, 0), `PRIOR CANCER THERAPY` = c(4L,
3L, 3L, 3L, 3L, 3L), `PRIOR CANCER SURGERY` = c(0, 0, 0, 0, 0,
0), `PRIOR RADIATION THERAPY` = c(0, 0, 0, 0, 0, 0)), row.names = c(NA,
6L), class = "data.frame")
Now I would like to change col name that are not start with number to proper case. How should I do it? I thought I could use str_to_title. I have tried many ways can not get it to work. Here is the codes that I tried:
# try1:
df[,3:5] %>% setNames(str_to_title(colnames(df[,3:5])))
#try2:
df[,3:5] <- df[,3:5]%>% rename_with (str_to_title)
# try3:
colnames(df[,3:5])<- str_to_title(colnames(df[,3:5]))
What did I do wrong? there is no error message, just the col names did not get updated. Could anyone help me identify the issue, or maybe show me a better way if you have?
Here I have small data then I can find the col number. If I want it to auto correct the col names to proper case, how can I do that?
Thanks.
We can use
library(dplyr)
library(stringr)
df %>%
rename_at(3:5, ~ str_to_title(.))
-output
# Med 2.4 mg Prior Cancer Therapy Prior Cancer Surgery Prior Radiation Therapy
#1 DOCETAXEL 0 4 0 0
#2 BEVACIZUMAB 0 3 0 0
#3 CARBOPLATIN 0 3 0 0
#4 CETUXIMAB 0 3 0 0
#5 DOXORUBICIN 1 3 0 0
#6 IRINOTECAN 0 3 0 0
Or using rename_with
df %>%
rename_with(~ str_to_title(.), 3:5)

Weighted random sampling for Monte Carlo simulation in R

I would like to run a Monte Carlo simulation. I have a data.frame where rows are unique IDs which have a probability of association with one of the columns. The data entered into the columns can be treated as the weights for that probability. I want to randomly sample each row in the data.frame based on the weights listed for each row. Each row should only return one value per run. The data.frame structure looks like this:
ID, X2000, X2001, X2002, X2003, X2004
X11, 0, 0, 0.5, 0.5, 0
X33, 0.25, 0.25, 0.25, 0.25, 0
X55, 0, 0, 0, 0, 1
X77, 0.5, 0, 0, 0, 0.5
For weighting, "X11" should either return X2002 or X2003, "X33" should have an equal probability of returning X2000, X2001, X2002, or X2003, should be equal with no chance of returning X2004. The only possible return for "X55" should be X2004.
The output data I am interested in are the IDs and the column that was sampled for that run, although it would probably be simpler to return something like this:
ID, X2000, X2001, X2002, X2003, X2004
X11, 0, 0, 1, 0, 0
X33, 1, 0, 0, 0, 0
X55, 0, 0, 0, 0, 1
X77, 1, 0, 0, 0, 0
Your data.frame is transposed - the sample() function takes a probability vector. However, your probability vector is rowwise which means it's harder to extract from a data.frame.
To get around this - you can import your ID column as a row.name. This allows you to be able to access it during an apply() statement. Note the apply() will coerce the data.frame to a matrix which means only one data type is allowed. That's why the IDs needed to be rownames - otherwise we'd have a probability vector of characters instead of numerics.
mc_df <- read.table(
text =
'ID X2000 X2001 X2002 X2003 X2004
X11 0 0 0.5 0.5 0
X33 0.25 0.25 0.25 0.25 0
X55 0 0 0 0 1
X77 0.5 0 0 0 0.5'
, header = T
,row.names = 1)
From there, can use the apply function:
apply(mc_df, 1, function(x) sample(names(x), size = 200, replace = T, prob = x))
Or you could make it fancy
apply(mc_df, 1, function(x) table(sample(names(x), size = 200, replace = T, prob = x)))
$X11
X2002 X2003
102 98
$X33
X2000 X2001 X2002 X2003
54 47 64 35
$X55
X2004
200
$X77
X2000 X2004
103 97
Fancier:
apply(mc_df, 1, function(x) table(sample(as.factor(names(x)), size = 200, replace = T, prob = x)))
X11 X33 X55 X77
X2000 0 51 0 99
X2001 0 50 0 0
X2002 91 57 0 0
X2003 109 42 0 0
X2004 0 0 200 101
Or fanciest:
prop.table(apply(mc_df
, 1
, function(x) table(sample(as.factor(names(x)), size = 200, replace = T, prob = x)))
,2)
X11 X33 X55 X77
X2000 0.00 0.270 0 0.515
X2001 0.00 0.235 0 0.000
X2002 0.51 0.320 0 0.000
X2003 0.49 0.175 0 0.000
X2004 0.00 0.000 1 0.485

How to group by with if statement in R?

I have data containing four variables (id, quantity, weight, date) and i want to make packages of quantity=6 using just observations with quantity below 6, example : if i have 6 products of quantity 1 each,
i want in return just 1 product of quantity 6 where i sum(weight[which(qte)<6])
And do it for all elements of the table
I've tried this code but it's not working, can anyone help me please ?
poids = c()
qte =c()
dd = data.frame()
for (i in length(paquet)){
if(paquet$RealQuantity[i]+paquet$RealQuantity[i+1]==6){
poids[i] = sum(paquet$RealWeigth)
qte[i] = sum(paquet$RealQuantity)
dd = rbind(dd,data.frame(i=i,poids = poids[i],qte =qte[i]))
}
}
this is an example :
A tibble: 232 x 4
ProductID RealQuantity RealWeigth PickingDate
<dbl> <dbl> <dbl> <date>
1 1 5 0.296 2017-12-26
2 1 1 0.064 2018-01-05
3 1 1 0.061 2018-01-05
4 1 5 0.297 2018-01-10
5 1 5 0.298 2018-01-13
6 1 1 0.058 2018-01-16
7 1 3 0.172 2018-01-23
8 1 3 0.172 2018-01-23
the output expected is :
ProductID RealQuantity RealWeigth PickingDate
<dbl> <dbl> <dbl> <date>
1 1 6 0.36 2017-12-26
2 1 6 0.358 2018-01-05
3 1 6 0.356 2018-01-13
4 1 6 0.344 2018-01-23
This is an iterative solution that cannot guarantee everything is assigned a package of 6 items:
df <- structure(list(ProductID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), RealQuantity = c(5, 1, 1, 5, 5, 1, 3, 3, 3, 3, 3, 3, 3, 5, 3, 3, 3, 3, 3, 3), RealWeigth = c(0.296, 0.064, 0.061, 0.297, 0.298, 0.058, 0.172, 0.172, 0.177, 0.1695, 0.179, 0.18, 0.175, 0.301, 0.181, 0.178, 0.161, 0.178, 0.1775, 0.183), PickingDate = structure(c(17526, 17536, 17536, 17541, 17544, 17547, 17554, 17554, 17554, 17555, 17556, 17556, 17557, 17557, 17557, 17557, 17558, 17561, 17562, 17562), class = "Date")), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"))
# for portion of data.frame with RealQuantity < 6:
df$group <- NA
df <- df[order(df$RealQuantity, decreasing=TRUE),]
gi <- 0
groupsize <- 0
while (TRUE) {
gi <- gi + 1
# find biggest, non-assigned row:
i <- min(which(is.na(df$group)))
if (is.infinite(i)) # nothing found; which would return a zero-length vector. But min(integer(0)) returns Inf.
break
groupsize <- df$RealQuantity[i]
df$group[i] <- gi
# find next line whos combined quantity might fit within "6":
while (TRUE) {
j <- min(which(is.na(df$group) & df$RealQuantity + groupsize <= 6))
if (is.infinite(j)) # nothing found
break
groupsize <- groupsize + df$RealQuantity[j]
df$group[j] <- gi
if (groupsize >= 6)
break
}
}
library(dplyr)
df %>% group_by(group) %>% summarise_at(vars(RealQuantity, RealWeigth), funs(sum))
or some more details:
df %>% group_by(group) %>% summarise(combined=paste(RealQuantity, collapse=', '), RealQuantity=sum(RealQuantity), RealWeigth=sum(RealWeigth), Firstdate=min(PickingDate))

Resources