R lpSolve setting constraints (fantasy football related) - r

I am trying to solve linear optimization problem.
I have 100 players with score and price for each player. My goal is to select 11 players and to maximize possible score while staying within budget (83 in this example).
Code below solve this task.
library(data.table)
library(lpSolve)
set.seed(1)
# simulating data
data <- data.table(id = 1:100,
price = round(rnorm(100, 9, 2), 1),
score = round(rnorm(100, 150, 25)))
# objective which should be maximized
obj <- data$score
constraints <- matrix(
c(data$price,
rep(1, 100)),
byrow = TRUE, nrow = 2
)
constr_dir <- c("<=", "==")
rhs <- c(83, 11)
result <- lp("max", obj, constraints, constr_dir, rhs, all.bin = TRUE)
# joining result for easier observing
data$result <- result$solution
data[result != 0,]
Here is the question. In my team should be a captain whose score will count twice. How do I modify my code to add this condition? (please pay attention to all.bin argument in lp function, maybe this should be changed in final solution).
So here is current result:
id price score coeff
1: 6 7.4 194 1
2: 10 8.4 192 1
3: 13 7.8 186 1
4: 14 4.6 134 1
5: 24 5.0 146 1
6: 35 6.2 158 1
7: 60 8.7 197 1
8: 66 9.4 205 1
9: 71 10.0 208 1
10: 78 9.0 202 1
11: 97 6.4 186 1
What I want to achieve is 10 coefficients should be equal to 1 and one equal to 2(result may differ from one below, this is an example):
id price score coeff
1: 6 7.4 194 1
2: 10 8.4 192 1
3: 13 7.8 186 1
4: 14 4.6 134 1
5: 24 5.0 146 1
6: 35 6.2 158 1
7: 60 8.7 197 1
8: 66 9.4 205 1
9: 71 10.0 208 2
10: 78 9.0 202 1
11: 97 6.4 186 1

You can run 100 augmented linear problems. In ith problem you multiple score of ith player by 2. At then end you pick solution with highest score:
constraints <- matrix(c(data$price, rep(1, 100)), byrow = TRUE, nrow = 2)
constr_dir <- c("<=", "==")
rhs <- c(83, 11)
res <- vector("list", nrow(data))
for(i in seq_len(nrow(data))){
cat("iter:", i, "\n")
obj <- data$score
obj[[i]] <- obj[[i]] * 2
res[[i]] <- lp("max", obj, constraints, constr_dir, rhs, all.bin = TRUE)
}
captain_index <- which.max(unlist(lapply(res, function(x) x$objval)))
data[res[[captain_index]]$solution == 1,]
chosen captain has index captian_index

Related

R: Counting rows in a dataframe in which all values fall within individual ranges

I have a dataframe (A rows x K columns). For each column, I get the 5th and 95th percentile value. I want to know how many rows in the df have all K of their values within these K sets of 5th and 95th percentile values.
Example code below works, removing rows that do not fall within the bounds (1 and 9 here, but in practice will be percentiles), and then counting what remains. But A will be 10K and K will be 40, and I am simulating this dataframe 10K times, so I am wondering if there is code that will run faster.
data <- rbind(c(1,2,3,4,5), c(3,5,7,8,5), c(2,8,9,5,9), c(9,1,1,8,9),
c(3,5,6,7,5))
Lower_Bound <- rbind(1,1,1,1,1)
Upper_Bound <- rbind(9,9,9,9,9)
for (i in c(1:5)) {
data <- data[data[,i] > Lower_Bound[i,],]
data <- data[data[,i] < Upper_Bound[i,],]
}
N <- nrow(data)
If I understand correctly, the OP is only interested in the number of rows which fulfill the condition. So, there is no need to actually remove rows fromdata that do not fall within the bounds. It is sufficient to count the number of rows which do fall within the bounds.
This answer contains solutions for
matrices
data.frames
and a benchmark which compares
OP's approach,
apply() with matrices and data.frames,
an approach using purrr's map() and reduce() functions.
apply() with matrices
Let's start with the provided sample data and fixed Lower_Bound and Upper_Bound. Please, note that all three objects are matrices created by rbind(). This is in contrast to the text of the question which refers to a dataframe (A rows x K columns). Anyhow, we will provide solutions for both cases.
apply(data, 1, function(x) all(x > Lower_Bound & x < Upper_Bound))
returns a vector of type logical
[1] FALSE TRUE FALSE FALSE TRUE
The number of rows which fulfill the condition can be derived by
N <- sum(apply(data, 1, function(x) all(x > Lower_Bound & x < Upper_Bound)))
N
[1] 2
because TRUE is coerced to 1L and FALSE to 0L.
The next step is to also compute the bounds for each column as 5th and 95th percentile. For this, we have to create a new sample dataset mat, again as matrix
# create sample data
n_col <- 5
n_row <- 10
set.seed(42) # required for reproducible results
mat <- sapply(1:n_col, function(x) rnorm(n_row, mean = x))
mat
[,1] [,2] [,3] [,4] [,5]
[1,] 2.3709584 3.3048697 2.693361 4.455450 5.205999
[2,] 0.4353018 4.2866454 1.218692 4.704837 4.638943
[3,] 1.3631284 0.6111393 2.828083 5.035104 5.758163
[4,] 1.6328626 1.7212112 4.214675 3.391074 4.273295
[5,] 1.4042683 1.8666787 4.895193 4.504955 3.631719
[6,] 0.8938755 2.6359504 2.569531 2.282991 5.432818
[7,] 2.5115220 1.7157471 2.742731 3.215541 4.188607
[8,] 0.9053410 -0.6564554 1.236837 3.149092 6.444101
[9,] 3.0184237 -0.4404669 3.460097 1.585792 4.568554
[10,] 0.9372859 3.3201133 2.360005 4.036123 5.655648
For demonstration, each column has a different mean.
# count number of rows
probs <- c(0.05, 0.95)
bounds <- apply(mat, 2, quantile, probs)
idx <- apply(mat, 1, function(x) all(x > bounds[1, ] & x < bounds[2, ]))
N <- sum(idx)
N
1 5
If required, the subset of mat which fulfills the condition can be derived by
mat[idx, ]
[,1] [,2] [,3] [,4] [,5]
[1,] 2.3709584 3.304870 2.693361 4.455450 5.205999
[2,] 1.6328626 1.721211 4.214675 3.391074 4.273295
[3,] 0.8938755 2.635950 2.569531 2.282991 5.432818
[4,] 2.5115220 1.715747 2.742731 3.215541 4.188607
[5,] 0.9372859 3.320113 2.360005 4.036123 5.655648
The bounds are
bounds
[,1] [,2] [,3] [,4] [,5]
5% 0.641660 -0.5592606 1.226857 1.899532 3.882318
95% 2.790318 3.8517060 4.588960 4.886484 6.135429
apply() with data.frames
In case the dataset is a data.frame we can use the same code, i.e.,
df <- as.data.frame(mat)
probs <- c(0.05, 0.95)
bounds <- apply(df, 2, quantile, probs)
idx <- apply(df, 1, function(x) all(x > bounds[1, ] & x < bounds[2, ]))
N <- sum(idx)
Benchmark
The OP is looking for code which is faster than OP's own approach because the OP wants to replicate the simulation 10000 times.
So, here is a benchmark which compares
OP1: OP's own approach using matrices
OP2: a slightly modified version of OP1
apply_mat: the apply() function with matrices
apply_df: the apply() function with data.frames
purrr: using map(), pmap(), and reduce() from the purrr package
(Note that the list of methods is not exhaustive)
The benchmark is repeated for varying problem sizes, i.e., 5, 10, and 40 columns as well as 100, 1000, and 10000 rows. The largest problem size corresponds to the size of OP's simulations. As some codes modify the input dataset, all runs start with a fresh copy of the input data.
library(bench)
library(purrr)
library(ggplot2)
bm <- press(
n_col = c(5L, 10L, 40L)
, n_row = 10L^(2:4)
, {
set.seed(42)
mat0 <- sapply(1:n_col, function(x) rnorm(n_row, mean = x))
df0 <- as.data.frame(mat0)
mark(
OP1 = {
data <- data.table::copy(mat0)
Lower_Bound <- as.matrix(apply(data, 2, quantile, probs = 0.05), ncol = 1L)
Upper_Bound <- as.matrix(apply(data, 2, quantile, probs = 0.95), ncol = 1L)
for (i in seq_len(ncol(data))) {
data <- data[data[, i] > Lower_Bound[i, ], ]
data <- data[data[, i] < Upper_Bound[i, ], ]
}
nrow(data)
},
OP2 = {
data <- data.table::copy(mat0)
Lower_Bound <- as.matrix(apply(data, 2, quantile, probs = 0.05), ncol = 1L)
Upper_Bound <- as.matrix(apply(data, 2, quantile, probs = 0.95), ncol = 1L)
for (i in seq_len(ncol(data))) {
data <- data[data[, i] > Lower_Bound[i, ] & data[, i] < Upper_Bound[i, ], ]
}
nrow(data)
},
apply_mat = {
mat <- data.table::copy(mat0)
probs <- c(0.05, 0.95)
bounds <- apply(mat, 2, quantile, probs)
idx <- apply(mat, 1, function(x) all(x > bounds[1, ] & x < bounds[2, ]))
sum(idx)
},
apply_df = {
df <- data.table::copy(df0)
probs <- c(0.05, 0.95)
bounds <- apply(df, 2, quantile, probs)
idx <- apply(df, 1, function(x) all(x > bounds[1, ] & x < bounds[2, ]))
sum(idx)
},
purrr = {
data.table::copy(df0) %>%
map2(map_dfc(., quantile, probs), ~ (.x > .y[1L] & .x < .y[2L])) %>%
pmap(all) %>%
reduce(`+`)
}
)
}
)
autoplot(bm)
Note the logarithmic time scale
print(bm[, 1:11], n = Inf)
# A tibble: 45 x 11
expression n_col n_row min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <int> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 OP1 5 100 1.46ms 1.93ms 493. 88.44KB 0 248 0 503ms
2 OP2 5 100 1.34ms 1.78ms 534. 71.56KB 0 267 0 500ms
3 apply_mat 5 100 1.16ms 1.42ms 621. 26.66KB 2.17 286 1 461ms
4 apply_df 5 100 1.41ms 1.8ms 526. 34.75KB 0 263 0 500ms
5 purrr 5 100 2.34ms 2.6ms 374. 17.86KB 0 187 0 500ms
6 OP1 10 100 2.42ms 2.78ms 344. 205.03KB 0 172 0 500ms
7 OP2 10 100 2.37ms 2.71ms 354. 153.38KB 2.07 171 1 484ms
8 apply_mat 10 100 1.76ms 2.12ms 457. 51.64KB 0 229 0 501ms
9 apply_df 10 100 2.31ms 2.63ms 367. 67.78KB 0 184 0 501ms
10 purrr 10 100 3.44ms 4.1ms 222. 34.89KB 2.09 106 1 477ms
11 OP1 40 100 9.4ms 10.57ms 92.9 955.41KB 0 47 0 506ms
12 OP2 40 100 9.18ms 10.08ms 96.8 638.92KB 0 49 0 506ms
13 apply_mat 40 100 5.44ms 6.46ms 146. 429.95KB 2.12 69 1 472ms
14 apply_df 40 100 6.12ms 6.75ms 141. 608.66KB 0 71 0 503ms
15 purrr 40 100 10.43ms 11.8ms 84.9 149.53KB 0 43 0 507ms
16 OP1 5 1000 1.75ms 1.94ms 478. 837.55KB 2.10 228 1 477ms
17 OP2 5 1000 1.69ms 1.94ms 487. 674.36KB 0 244 0 501ms
18 apply_mat 5 1000 4.84ms 5.62ms 176. 255.17KB 0 89 0 506ms
19 apply_df 5 1000 6.37ms 7.66ms 122. 333.58KB 0 62 0 506ms
20 purrr 5 1000 9.86ms 11.22ms 87.7 165.52KB 2.14 41 1 467ms
21 OP1 10 1000 3.35ms 3.91ms 253. 1.89MB 0 127 0 503ms
22 OP2 10 1000 3.33ms 3.72ms 256. 1.41MB 2.06 124 1 484ms
23 apply_mat 10 1000 5.86ms 6.93ms 142. 491.09KB 0 72 0 508ms
24 apply_df 10 1000 7.74ms 10.08ms 99.2 647.86KB 0 50 0 504ms
25 purrr 10 1000 14.55ms 15.44ms 62.5 323.17KB 2.23 28 1 448ms
26 OP1 40 1000 13.8ms 16.28ms 58.8 8.68MB 2.18 27 1 459ms
27 OP2 40 1000 13.29ms 14.72ms 67.9 5.84MB 0 34 0 501ms
28 apply_mat 40 1000 12.17ms 13.85ms 68.5 4.1MB 2.14 32 1 467ms
29 apply_df 40 1000 14.61ms 15.86ms 62.9 5.78MB 0 32 0 509ms
30 purrr 40 1000 41.85ms 43.66ms 22.7 1.25MB 0 12 0 529ms
31 OP1 5 10000 5.57ms 6.55ms 147. 8.15MB 2.07 71 1 482ms
32 OP2 5 10000 5.38ms 6.27ms 157. 6.55MB 2.06 76 1 485ms
33 apply_mat 5 10000 43.98ms 46.9ms 20.7 2.48MB 0 11 0 532ms
34 apply_df 5 10000 53.59ms 56.53ms 17.8 3.24MB 3.57 5 1 280ms
35 purrr 5 10000 86.32ms 88.83ms 11.1 1.6MB 0 6 0 540ms
36 OP1 10 10000 12.03ms 13.63ms 72.3 18.97MB 2.07 35 1 484ms
37 OP2 10 10000 11.66ms 12.97ms 76.5 14.07MB 4.25 36 2 471ms
38 apply_mat 10 10000 50.31ms 51.77ms 18.5 4.77MB 0 10 0 541ms
39 apply_df 10 10000 62.09ms 65.17ms 15.1 6.3MB 0 8 0 528ms
40 purrr 10 10000 125.82ms 128.3ms 7.35 3.13MB 2.45 3 1 408ms
41 OP1 40 10000 53.38ms 56.34ms 16.2 87.79MB 5.41 6 2 369ms
42 OP2 40 10000 46.24ms 47.43ms 20.3 58.82MB 2.25 9 1 444ms
43 apply_mat 40 10000 78.25ms 83.79ms 11.4 40.94MB 2.85 4 1 351ms
44 apply_df 40 10000 95.66ms 97.02ms 10.3 57.58MB 2.06 5 1 486ms
45 purrr 40 10000 361.26ms 373.23ms 2.68 12.31MB 0 2 0 746ms
Conclusions
To my surprise, OPs approach does perform quite well despite the repeated copy operations. In fact, for OP's problem size of 10000 rows and 40 columns the modified version OP2 is nearly tow times faster than apply_mat.
A possible explanation (which needs to be verified, though) is that OPs approach is kind of recursive where the number of rows to be checked are reduced when iterating over the columns.
Interestingly, the purrr variant has the lowest memory requirements.
Taking the median run time of about 50 ms for the OP2 method from this benchmark, 10000 repetitions of the simulation may take less than 10 minutes.

How to resample and remodel n times by vectorization?

here's my for loop version of doing resample and remodel,
B <- 999
n <- nrow(butterfly)
estMat <- matrix(NA, B+1, 2)
estMat[B+1,] <- model$coef
for (i in 1:B) {
resample <- butterfly[sample(1:n, n, replace = TRUE),]
re.model <- lm(Hk ~ inv.alt, resample)
estMat[i,] <- re.model$coef
}
I tried to avoid for loop,
B <- 999
n <- nrow(butterfly)
resample <- replicate(B, butterfly[sample(1:n, replace = TRUE),], simplify = FALSE)
re.model <- lapply(resample, lm, formula = Hk ~ inv.alt)
re.model.coef <- sapply(re.model,coef)
estMat <- cbind(re.model.coef, model$coef)
It worked but didn't improve efficiency. Is there any approach I can do vectorization?
Sorry, not quite familiar with StackOverflow. Here's the dataset butterfly.
colony alt precip max.temp min.temp Hk
pd+ss 0.5 58 97 16 98
sb 0.8 20 92 32 36
wsb 0.57 28 98 26 72
jrc+jrh 0.55 28 98 26 67
sj 0.38 15 99 28 82
cr 0.93 21 99 28 72
mi 0.48 24 101 27 65
uo+lo 0.63 10 101 27 1
dp 1.5 19 99 23 40
pz 1.75 22 101 27 39
mc 2 58 100 18 9
hh 4.2 36 95 13 19
if 2.5 34 102 16 42
af 2 21 105 20 37
sl 6.5 40 83 0 16
gh 7.85 42 84 5 4
ep 8.95 57 79 -7 1
gl 10.5 50 81 -12 4
(Assuming butterfly$inv.alt <- 1/butterfly$alt)
You get the error because resample is not a list of resampled data.frames, which you can obtain with:
resample <- replicate(B, butterfly[sample(1:n, replace = TRUE),], simplify = FALSE)
The the following should work:
re.model <- lapply(resample, lm, formula = Hk ~ inv.alt)
To extract coefficients from a list of models, re.model$coef does work. The correct path to coefficients are: re.model[[1]]$coef, re.model[[2]]$coef, .... You can get all of them with the following code:
re.model.coef <- sapply(re.model, coef)
Then you can combined it with the observed coefficients:
estMat <- cbind(re.model.coef, model$coef)
In fact, you can put all of them into replicate:
re.model.coef <- replicate(B, {
bf.rs <- butterfly[sample(1:n, replace = TRUE),]
coef(lm(formula = Hk ~ inv.alt, data = bf.rs))
})
estMat <- cbind(re.model.coef, model$coef)

Multiple different conditions and if statments within a loop

I want to assign different letters from A:U to a new column vector according to some conditions that depend on a different column that takes the numbers 1:99.
I came up with the following solution, but I want to write it more efficiently.
for (i in 1:99){
if (i %in% 1:3 == T ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"A"
}
.............
if (i %in% 45:60 == T ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"D"
}
.....................
if (i == 99 ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"U"
}
}
In the previous code I skipped multiple other line which essentially do the same thing. Notice that conditions changing all the time within this loop that I created and are of two types. One is for example of the type i %in% 45:60 == T and the other of the type 'i == 99 '
My original code has multiple such ifs within this loop so any help on how I can write it more efficiently or compactly will be appreciated.
The user has requested to map the numbers given in H07_NACE$NACE2.Code to the letters "A" to "U" according to given rules he has hardcoded in a number of if clauses.
A more flexible approach (and less tedious to code) is to use a lookup table (or constraint vector as Joseph Wood called it in his answer).
With data.table, we can use either a rolling join or a non-equi update join to do the mapping.
Sample data to be mapped
set.seed(1)
H07_NACE <- data.frame(NACE2.Code = sample(99, 10, replace = TRUE))
Rolling join
For the rolling join, we specify the mapping rules by tiling the number range 1:99 contiguously and giving the start number of each tile.
library(data.table)
# set up lookup table
lookup <- data.table(Code = c(1, 4, 21, 45, 61:75, 98, 99),
Sector = LETTERS[1:21])
lookup
Code Sector
1: 1 A
2: 4 B
3: 21 C
4: 45 D
5: 61 E
6: 62 F
7: 63 G
8: 64 H
9: 65 I
10: 66 J
11: 67 K
12: 68 L
13: 69 M
14: 70 N
15: 71 O
16: 72 P
17: 73 Q
18: 74 R
19: 75 S
20: 98 T
21: 99 U
Code Sector
# map Code to Sector
lookup[setDT(H07_NACE), on = .(Code = NACE2.Code), roll = TRUE]
Code Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
If the H07_NACE is to be updated we can append a new column by
setDT(H07_NACE)[, NACE2.Sector := lookup[H07_NACE, on = .(Code = NACE2.Code),
roll = TRUE, Sector]][]
NACE2.Code NACE2.Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
Non-equi update join
For the non-equi update join, we specify the mapping rules by giving the lower and upper bounds. This can be derived from lookup by
lookup2 <- lookup[, .(Sector, lower = Code,
upper = shift(Code - 1L, type = "lead", fill = max(Code)))]
lookup2
Sector lower upper
1: A 1 3
2: B 4 20
3: C 21 44
4: D 45 60
5: E 61 61
6: F 62 62
7: G 63 63
8: H 64 64
9: I 65 65
10: J 66 66
11: K 67 67
12: L 68 68
13: M 69 69
14: N 70 70
15: O 71 71
16: P 72 72
17: Q 73 73
18: R 74 74
19: S 75 97
20: T 98 98
21: U 99 99
Sector lower upper
The new column is created by
setDT(H07_NACE)[lookup2, on = .(NACE2.Code >= lower, NACE2.Code <= upper),
NACE2.Sector := Sector][]
NACE2.Code NACE2.Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
Here is a quick and dirty solution that should do the job (I'm sure there is more efficient/elegant way to do this). We can setup a constraint vector and use indexing from there to produce the desired results.
## Here is some random data that resembles the OP's
set.seed(3)
H07_NACE <- data.frame(NACE2.Code = sample(99, replace = TRUE))
## "T" is the 20th element... we need to gurantee
## that the number corresponding to "U"
## corresponds to max(NACE2.Code)
maxCode <- max(H07_NACE$NACE2.Code)
constraintVec <- sort(sample(maxCode - 1, 20))
constraintVec <- c(constraintVec, maxCode)
H07_NACE$NACE2.Sector <- LETTERS[vapply(H07_NACE$NACE2.Code, function(x) {
which(constraintVec >= x)[1]
}, 1L)]
## Add optional check column to ensure we are mapping the
## Code to the correct Sector
H07_NACE$NACE2.Check <- constraintVec[vapply(H07_NACE$NACE2.Code, function(x) {
which(constraintVec >= x)[1]
}, 1L)]
head(H07_NACE)
NACE2.Code NACE2.Sector NACE2.Check
1 17 E 18
2 80 R 85
3 39 K 54
4 33 J 37
5 60 N 66
6 60 N 66
Update courtesy of #Frank
As suspected, there is a much simpler solution assuming the above logic is correct. We use findInterval and set the arguments rightmost.closed and left.open to TRUE (we also have to add 1L to the resulting vector):
H07_NACE$NACE2.Sector2 <- LETTERS[findInterval(H07_NACE$NACE2.Code, constraintVec,
rightmost.closed = TRUE, , left.open = TRUE) + 1L]
head(H07_NACE)
NACE2.Code NACE2.Sector NACE2.Check NACE2.Sector2
1 17 E 18 E
2 80 R 85 R
3 39 K 54 K
4 33 J 37 J
5 60 N 66 N
6 60 N 66 N
identical(H07_NACE$NACE2.Sector, H07_NACE$NACE2.Sector2)
[1] TRUE
Here's two tidyverse examples, though I'm not completely certain what the original poster is really asking for.
library(tidyverse)
data.frame(NACE2.Code = sample(99, replace = TRUE)) %>%
mutate(Sectors = ifelse(NACE2.Code %in% 1:3, "A",
ifelse(NACE2.Code %in% 45:60, "D",
ifelse(NACE2.Code ==99, "U", NA))))
data.frame(NACE2.Code = sample(99, replace = TRUE)) %>%
mutate(Sectors = case_when(NACE2.Code %in% 1:3 ~ "A",
NACE2.Code %in% 45:60 ~ "D",
NACE2.Code ==99 ~ "U")) %>%
drop_na

Combining factor levels in R 3.2.1

In previous versions of R I could combine factor levels that didn't have a "significant" threshold of volume using the following little function:
whittle = function(data, cutoff_val){
#convert to a data frame
tab = as.data.frame.table(table(data))
#returns vector of indices where value is below cutoff_val
idx = which(tab$Freq < cutoff_val)
levels(data)[idx] = "Other"
return(data)
}
This takes in a factor vector, looks for levels that don't appear "often enough" and combines all of those levels into one "Other" factor level. An example of this is as follows:
> sort(table(data$State))
05 27 35 40 54 84 9 AP AU BE BI DI G GP GU GZ HN HR JA JM KE KU L LD LI MH NA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
OU P PL RM SR TB TP TW U VD VI VS WS X ZH 47 BL BS DL M MB NB RP TU 11 DU KA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3
BW ND NS WY AK SD 13 QC 01 BC MT AB HE ID J NO LN NM ON NE VT UT IA MS AO AR ME
4 4 4 4 5 5 6 6 7 7 7 8 8 8 9 10 11 17 23 26 26 30 31 31 38 40 44
OR KS HI NV WI OK KY IN WV AL CO WA MN NH MO SC LA TN AZ IL NC MI GA OH ** CT DE
45 47 48 57 57 64 106 108 112 113 120 125 131 131 135 138 198 200 233 492 511 579 645 646 840 873 1432
RI DC TX MA FL VA MD CA NJ PA NY
1782 2513 6992 7027 10527 11016 11836 12221 15485 16359 34045
Now when I use whittle it returns me the following message:
> delete = whittle(data$State, 1000)
Warning message:
In `levels<-`(`*tmp*`, value = c("Other", "Other", "Other", "Other", :
duplicated levels in factors are deprecated
How can I modify my function so that it has the same effect but doesn't use these "deprecated" factor levels? Converting to a character, tabling, and then converting to the character "Other"?
I've always found it easiest (less typing and less headache) to convert to character and back for these sorts of operations. Keeping with your as.data.frame.table and using replace to do the replacement of the low-frequency levels:
whittle <- function(data, cutoff_val) {
tab = as.data.frame.table(table(data))
factor(replace(as.character(data), data %in% tab$data[tab$Freq < cutoff_val], "Other"))
}
Testing on some sample data:
state <- factor(c("MD", "MD", "MD", "VA", "TX"))
whittle(state, 2)
# [1] MD MD MD Other Other
# Levels: MD Other
I think this verison should work. The levels<- function allows you to collapse by assigning a list (see ?levels).
whittle <- function(data, cutoff_val){
tab <- table(data)
shouldmerge <- tab < cutoff_val
tokeep <- names(tab)[!shouldmerge]
tomerge <- names(tab)[shouldmerge]
nv <- c(as.list(setNames(tokeep,tokeep)), list("Other"=tomerge))
levels(data)<-nv
return(data)
}
And we test it with
set.seed(15)
x<-factor(c(sample(letters[1:10], 100, replace=T), sample(letters[11:13], 10, replace=T)))
table(x)
# x
# a b c d e f g h i j k l m
# 5 11 8 8 7 5 13 14 14 15 2 3 5
y <- whittle(x, 9)
table(y)
# y
# b g h i j Other
# 11 13 14 14 15 43
It's worth adding to this answer that the new forcats package contains the fct_lump() function which is dedicated to this.
Using #MrFlick's data:
x <- factor(c(sample(letters[1:10], 100, replace=T),
sample(letters[11:13], 10, replace=T)))
library(forcats)
library(magrittr) ## for %>% ; could also load dplyr
fct_lump(x, n=5) %>% table
# b g h i j Other
#11 13 14 14 15 43
The n argument specifies the number of most common values to preserve.
Here's another way of doing it by replacing all the items below the threshold with the first and then renaming that level to Other.
whittle <- function(x, thresh) {
belowThresh <- names(which(table(x) < thresh))
x[x %in% belowThresh] <- belowThresh[1]
levels(x)[levels(x) == belowThresh[1]] <- "Other"
factor(x)
}

Normalizing the values in a data table using the values stored in another data table

I am trying to normalize the values in a data table (dt) using the baseline values stored in another data table (dt.base). Next you have a sample contents of these tables and the code to generate that example:
> dt
Bench Config Part Power
1: A 10 P 171
2: A 10 Q 125
3: A 100 P 139
4: A 100 Q 109
5: B 10 P 196
6: B 10 Q 101
7: B 100 P 157
8: B 100 Q 176
> dt.base
Bench Config Part Power
1: A Base P 187
2: A Base Q 104
3: B Base P 166
4: B Base Q 188
Example generation code:
set.seed(13)
dt <- data.table(
Bench = c(rep('A', 4), rep('B', 4)),
Config = rep(c(10, 10, 100, 100), 2),
Part = rep(c('P', 'Q'), 4),
Power = round(runif(8, 100, 200)))
dt.base <- data.table(
Bench = c(rep('A', 2), rep('B', 2)),
Config = c('Base', 'Base', 'Base', 'Base'),
Part = rep(c('P', 'Q'), 2),
Power = round(runif(4, 100, 200)))
The idea would be to divide all the values in dt by their corresponding values in dt.base. Therefore, the table would become:
Bench Config Part Power
1: A 10 P 171 / 187
2: A 10 Q 125 / 104
3: A 100 P 139 / 187
4: A 100 Q 109 / 104
5: B 10 P 196 / 166
6: B 10 Q 101 / 188
7: B 100 P 157 / 166
8: B 100 Q 176 / 188
I thought the solution for this was quite straightforward, but I am running into some issues. This is my current attempt:
normalize.power <- function(pwr, base.pwr) {
pwr / base.pwr
}
dt.norm <- dt[,
Power <- normalize.power(
.SD, dt.base[Bench == Bench & Config == 'Base' & Part == Part,
'Power', with = F]
), by = list(Bench, Config, Part)]
The problem is that normalize.pwr is not receiving a single value in its second parameter (base.pwr), but rather a vector containing all the power values in dt.base. However, when I directly execute from the command line
dt.base[Bench == 'A' & Config == 'Base' & Part == 'P', 'Power', with = F]
then I obtain a single power value, as expected.
I would appreciate any help that solves my problem or leads me to the solution.
You can try something like this
setkey(dt, Bench, Part)
setkey(dt.base, Bench, Part)
dt[dt.base, Power := Power / i.Power]
dt
## Bench Config Part Power
## 1: A 10 P 0.91444
## 2: A 100 P 0.74332
## 3: A 10 Q 1.20192
## 4: A 100 Q 1.04808
## 5: B 10 P 1.18072
## 6: B 100 P 0.94578
## 7: B 10 Q 0.53723
## 8: B 100 Q 0.93617
Thanks #Arun for the useful i.Power syntax

Resources