Function that checks multiple combinations of operations between columns of a dataframe

Function that checks multiple combinations of operations between columns of a dataframe - r

Here is a sample of the original dataframe (data provided at the end)
> DATA
N_b N_l A x.sqr_sum e_1 e_2 e_3 e_4 e_5 e_6 e_7 e_8
1 7 6 -36 4032 33.8 21.8 9.8 -2.2 -14.2 -26.2 0 0
2 7 4 -27 2268 23.6 11.6 -0.4 -12.4 0.0 0.0 0 0
3 7 4 -27 2268 23.6 11.6 -0.4 -12.4 0.0 0.0 0 0
4 7 6 -36 4032 33.8 21.8 9.8 -2.2 -14.2 -26.2 0 0
5 7 6 -36 4032 33.8 21.8 9.8 -2.2 -14.2 -26.2 0 0
6 7 6 -36 4032 33.8 21.8 9.8 -2.2 -14.2 -26.2 0 0
7 7 8 -45 6300 44.0 32.0 20.0 8.0 -4.0 -16.0 -28 -40
8 7 8 -45 6300 44.0 32.0 20.0 8.0 -4.0 -16.0 -28 -40
9 7 8 -45 6300 44.0 32.0 20.0 8.0 -4.0 -16.0 -28 -40
I want to write a function to calculate R from the equation
I write the code below to calculate R and the N_l responsible for the maximum R.
R <- function(x){
N_b <- x[1]
N_l <- x[2]
N_l_seq <- seq(N_l)
A <- x[3]
x.sqr_sum <- x[4]
e <- x[5:12]
m <- Multi.Presence$m[N_l_seq]
f <- m * (N_l_seq/N_b + A * cumsum(e) / x.sqr_sum)
c(val = max(f), pos = which.max(f))
}
DATA <- cbind(DATA, vars = t(apply(DATA, 1, R)))
In the function above, R is calculated for all possible values of N_l by defining N_l_seq <- seq(N_l). The problem is I don't want to just multiply by cumsum(e) as written in the function. I want to modify it so that it would calculate R for all possible combinations for the same number of e_1, e_2, e_3,... as the current value of N_l.
Example
If N_l = 3, the equation for f is calculated for the cumsum of all possible combinations of 3 of the e_1, e_2, e_3, e_4, e_5, e_6, e_7, e_8, such as cumsum(e_1, e_8, e_6) and cumsum(e_7, e_2, e_4). When N_l = 5, the equation for f is calculated for the cumsum of all possible combinations of 5 of the e_1, e_2, e_3, e_4, e_5, e_6, e_7, e_8, and so on.
PROBLEM
I am not sure how to update the f equation so instead of the cumsum() of all the possible e values, it calculates the cumsum() of all combinations of a number equal to the current N_l of the e values.
DATA
> dput(DATA)
structure(list(N_b = c(7, 7, 7, 7, 7, 7, 7, 7, 7), N_l = c(6,
4, 4, 6, 6, 6, 8, 8, 8), A = c(-36, -27, -27, -36, -36, -36,
-45, -45, -45), x.sqr_sum = c(4032, 2268, 2268, 4032, 4032, 4032,
6300, 6300, 6300), e_1 = c(33.8, 23.6, 23.6, 33.8, 33.8, 33.8,
44, 44, 44), e_2 = c(21.8, 11.6, 11.6, 21.8, 21.8, 21.8, 32,
32, 32), e_3 = c(9.8, -0.399, -0.399, 9.8, 9.8, 9.8, 20, 20, 20),
e_4 = c(-2.2, -12.4, -12.4, -2.2, -2.2, -2.2, 8, 8, 8), e_5 =
c(-14.2, 0, 0, -14.2, -14.2, -14.2, -4, -4, -4), e_6 = c(-26.2,
0, 0, -26.2, -26.2, -26.2, -16, -16, -16), e_7 = c(0, 0, 0, 0,
0, 0, -28, -28, -28), e_8 = c(0, 0, 0, 0, 0, 0, -40, -40, -40),
S = c(12, 9, 9, 12, 12, 12, 15, 15, 15)), row.names = c(1L, 3L,
4L, 115L, 116L, 117L, 199L, 200L, 201L), class = "data.frame")
A dependent variable m is defined in the dataframe below:
> dput(Multi.Presence)
structure(list(N_l = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), m = c(1.2,
1, 0.85, 0.65, 0.65, 0.65, 0.65, 0.65, 0.65, 0.65)), row.names = c(NA,
-10L), class = "data.frame")

I am not sure if this is what you want. I guess you should use combn rather than cumsum in your function R.
As the first step, I merged Multi.Presence to DATA so you can read the corresponding value of m with respect to N_l
df <- merge(DATA, Multi.Presence, by = "N_l")
Then, I rewrote function R such that it accepts the rows of df as the argument
R <- function(x){
N_l <- x["N_l"]
N_b <- x["N_b"]
N_l_seq <- seq(N_l)
A <- x["X_ext"]
x.sqr_sum <- x["x.sqr_sum"]
e <- x[grepl("e_\\d",names(x))]
m <- x["m"]
f <- m * (N_l/N_b + A * combn(e,N_l,sum) / x.sqr_sum)
c(val = max(f), pos = which.max(f))
}
Finally, you can execute function R within apply by rows, e.g.,
> apply(df,1,R)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
val 0.4704685 0.4704685 0.7475 0.7475 0.7475 0.7475 0.6685714 0.6685714
pos 56.0000000 56.0000000 28.0000 28.0000 28.0000 28.0000 1.0000000 1.0000000
[,9]
val 0.6685714
pos 1.0000000
Update
I have no clue how you want to deal with the combn, but below is an update
R <- function(x){
# browser()
N_l <- x["N_l"]
N_b <- x["N_b"]
N_l_seq <- seq(N_l)
A <- x["A"]
x.sqr_sum <- x["x.sqr_sum"]
e <- x[grepl("e_\\d",names(x))]
m <- Multi.Presence$m[N_l_seq]
f <- m * sapply(N_l_seq,function(k) N_l/N_b + A * max(combn(e,k,sum)) / x.sqr_sum)
c(val = max(f), pos = which.max(f))
}

Related

Mix "color_bar" and "style" in formattable package

I'm using formattable package and I want to personalize my table but I can't in the way I want.
Here is my table
structure(list(PJ = c(4, 4, 4, 4, 4, 4), V = c(4, 2, 2, 2, 1,
1), E = c(0, 0, 0, 0, 0, 0), D = c(0, 2, 2, 2, 3, 3), GF = c(182,
91, 92, 185, 126, 119), GC = c(84, 143, 144, 115, 141, 168),
Dif = c(98, -52, -52, 70, -15, -49), Pts = c(12, 6, 6, 6,
3, 3)), class = "data.frame", row.names = c("Player1", "Player2",
"Player3", "Player4", "Player5", "Player6"))
It looks like this:
PJ V E D GF GC Dif Pts
Player1 4 4 0 0 182 84 98 12
Player2 4 2 0 2 91 143 -52 6
Player3 4 2 0 2 92 144 -52 6
Player4 4 2 0 2 185 115 70 6
Player5 4 1 0 3 126 141 -15 3
Player6 4 1 0 3 119 168 -49 3
If I want the column GF in bold, I use
formattable(TAB.df, list(
GF = formatter("span",style = style("font.weight"="bold"))
))
If I want a color_bar I run this code:
formattable(TAB.df, list(
GF = color_bar("lightgreen")
))
Nevertheless, I don't know how to combine them and get the "color_bar" with "bold" numbers.

Add a condition within the function with apply

I want to write a conditional function to apply to my dataframe.
The existing function is:
Proposed <- function(N_b,N_l,m,A,x.sqr,e_1,e_2,e_3,e_4,e_5,K_g,a,b,c,d) {
e <- data.frame(e_1,e_2,e_3,e_4,e_5,N_l)
ee <- e[e != 0]
CSi <- m * ((N_l/N_b) * ((a*K_g)^b) +
((A * apply(ee,1,function(v) max(combn(v[1:ncol(ee)],v["N_l"],sum))))
/ x.sqr) * ((c*K_g)^d))
return(CSi)
}
There are two folds to the conditions:
I want to add the condition that if A > 0 then select max(combn(v[1:5],v["N_l"],sum)). But if A < 0 select min(combn(v[1:5],v["N_l"],sum))
Data
A sample dataframe is provided (just a couple of rows)
dput(DATA)
structure(list(N_l = c(2, 4, 2, 4, 1, 2, 1, 3), N_b = c(5, 5,
5, 5, 5, 5, 5, 5), m = c(1, 0.65, 1, 0.65, 1.2, 1, 1.2, 0.85),
A = c(-12, -12, -15, -15, -9, -9, -9, -9), x.sqr = c(1440,
1440, 2250, 2250, 810, 810, 810, 810), e_1 = c(21.8, 21.8,
29, 29, 14.6, 14.6, 14.6, 14.6), e_2 = c(9.8, 9.8, 17, 17,
2.6, 2.6, 2.6, 2.6), e_3 = c(-2.2, -2.2, 5, 5, -9.4, -9.4,
-9.4, -9.4), e_4 = c(-14.2, -14.2, -7, -7, 0, 0, 0, 0), e_5 = c(0,
0, -19, -19, 0, 0, 0, 0), K_g = c(6340598.65753794, 6340598.65753794,
6429472.98493414, 6429472.98493414, 6296482.86883766, 6296482.86883766,
8140521.8248051, 8140521.8248051)), row.names = c(20L, 40L,
60L, 80L, 100L, 120L, 140L, 160L), class = "data.frame")

max(combn(v, n, sum)) is the same as sum(tail(v,n)), no combn necessary. For example,
v <- 1:5
n <- 3
combn(v, n, sum)
# [1] 6 7 8 8 9 10 9 10 11 12
sum(tail(sort(v), n))
# [1] 12
Similarly, min(combn(...)) is just sum of the smallest two. With this, your combn calls can be simplified.
A simple function:
func2 <- function(x) { # A=x[1], N_l=x[2], e_*=x[-(1:2)]
decr <- (x[1] > 0)
sum(head(sort(x[-(1:2)], decreasing = decr), x[2]))
}
For the sake of "interesting data", I'll negate A in every other row:
dat$A
# [1] -12 -12 -15 -15 -9 -9 -9 -9
dat$A[c(2,4,6,8)] <- (-dat$A[c(2,4,6,8)])
dat$A
# [1] -12 12 -15 15 -9 9 -9 9
The use of it:
apply(dat[,c("A", "N_l", grep("^e_", colnames(dat), value = TRUE))],
1, func2)
# 20 40 60 80 100 120 140 160
# -16.4 29.4 -26.0 44.0 -9.4 17.2 -9.4 17.2
(Assign this to a new variable in your frame if you'd like.)
Decomposition: let's debug it and run through the first row:
debug(func2)
# debugging in: FUN(newX[, i], ...)
# debug at #1: {
# decr <- (x[1] > 0)
# sum(head(sort(x[-(1:2)], decreasing = decr), x[2]))
# }
# Browse[2]>
x
# A N_l e_1 e_2 e_3 e_4 e_5
# -12.0 2.0 21.8 9.8 -2.2 -14.2 0.0
Our x variable has everything, so we'll need to reference them individually. We could use x["A"] instead of x[1], which would be clearer to read. Over to you.
# Browse[2]>
decr <- (x[1] > 0)
# Browse[2]>
decr
# A
# FALSE
# Browse[2]>
sort(x[-(1:2)], decreasing = decr)
# e_4 e_3 e_5 e_2 e_1
# -14.2 -2.2 0.0 9.8 21.8
Okay, so in this row, A is less than 0, so we want the N_l (2) lowest values, so when we sort them, the lowest are first.
# Browse[2]>
head(sort(x[-(1:2)], decreasing = decr), x[2])
# e_4 e_3
# -14.2 -2.2
# Browse[2]>
sum(head(sort(x[-(1:2)], decreasing = decr), x[2]))
# [1] -16.4
We'll repeat for the second row, where I changed A to be positive (and now N_l is 4):
# debugging in: FUN(newX[, i], ...)
# debug at #1: {
# decr <- (x[1] > 0)
# sum(head(sort(x[-(1:2)], decreasing = decr), x[2]))
# }
x
# A N_l e_1 e_2 e_3 e_4 e_5
# 12.0 4.0 21.8 9.8 -2.2 -14.2 0.0
decr <- (x[1] > 0)
decr
# A
# TRUE
sort(x[-(1:2)], decreasing = decr)
# e_1 e_2 e_5 e_3 e_4
# 21.8 9.8 0.0 -2.2 -14.2
head(sort(x[-(1:2)], decreasing = decr), x[2])
# e_1 e_2 e_5 e_3
# 21.8 9.8 0.0 -2.2
sum(head(sort(x[-(1:2)], decreasing = decr), x[2]))
# [1] 29.4
Repeating through them all. (If you're doing this on the console, either reload/redefine the function, or call undebug(func2).

How to iteratively change variable value until all predicted probabilities are above .5

I am trying to write code that subtracts a given value from a variable until each row has a predicted probability at or above .05.
train <- data.frame('cost'= c(120, 3, 2, 4, 10, 110, 200, 43, 1, 51, 22, 14),
'price' = c(120, 20, 10, 4, 3, 4, 30, 43, 56, 88, 75, 44),
'dich' = c(0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0))
train$dich <- as.factor(train$dich)
test <- data.frame('cost'= c(13, 5, 32, 22, 14, 145, 54, 134, 11, 14, 33, 21),
'price' = c(32, 11, 210, 6, 3, 7, 22, 423, 19, 99, 192, 32)
)
model <- glm(dich ~ cost + price,
data = train,
family = "binomial")
pred <- predict(model, test, type = "response")
1 2 3 4
3.001821e-01 4.442316e-01 4.507495e-04 6.310900e-01
5 6 7 8
5.995459e-01 9.888085e-01 7.114101e-01 1.606681e-06
9 10 11 12
4.096450e-01 2.590474e-02 9.908167e-04 3.572890e-01
So in the above output the cases 4, 5, 6, and 7 would remain the same because they are already above .05 but for the rest of the cases I would like to subtract 1 from the price column and then run the prediction again and repeat until all cases have a probability at or above .05.

If you want to subtract 1 for each row (or "customer") individually, rather than 1 across the board:
test$pred_prob <- NA
for (n in 1:nrow(test)) {
print("-----------------------------")
print(n)
while (TRUE) {
pred <- predict(model, test[n,], type = "response")
print(pred)
test$pred_prob[n] <- pred
if (sum(pred > 0.05) == length(pred)) {
print(test$price[n])
break
}
test$price[n] <- test$price[n] - 1
}
print(test)
}
# cost price pred_prob
# 1 13 32 0.30018209
# 2 5 11 0.44423163
# 3 32 96 0.05128337
# 4 22 6 0.63109001
# 5 14 3 0.59954586
# 6 145 7 0.98880854
# 7 54 22 0.71141007
# 8 134 175 0.05074762
# 9 11 19 0.40964501
# 10 14 82 0.05149897
# 11 33 97 0.05081947
# 12 21 32 0.35728897

I see what you are trying to do but the results are quite hilarious. This is if you want to subtract 1 from all elements of price each time:
x <- 1
while (TRUE) {
print("----------------------------------------")
print(x)
test$price <- test$price - 1
pred <- predict(model, test, type = "response")
print(pred)
x <- x + 1
if (sum(pred > 0.05) == length(pred)) {
print(test)
break
}
}
# ... loops 247 times
# [1] "----------------------------------------"
# [1] 248
# 1 2 3 4 5 6 7 8 9 10 11 12
# 0.99992994 0.99996240 0.93751936 0.99998243 0.99997993 0.99999966 0.99998781 0.05074762 0.99995669 0.99887117 0.97058913 0.99994594
# cost price
# 1 13 -216
# 2 5 -237
# 3 32 -38
# 4 22 -242
# 5 14 -245
# 6 145 -241
# 7 54 -226
# 8 134 175
# 9 11 -229
# 10 14 -149
# 11 33 -56
# 12 21 -216

In case anyone else wants to run the same thing with a xgboost model.
train <- data.frame('cost'= c(120, 3, 2, 4, 10, 110, 200, 43, 1, 51, 22, 14),
'price' = c(120, 20, 10, 4, 3, 4, 30, 43, 56, 88, 75, 44))
label <- data.frame('dich' = c(0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0))
train <- as.matrix(train)
label <- as.matrix(label)
model <- xgboost(data = train,
label = label,
max.depth = 3,
nround = 1,
objective = "binary:logistic")
test <- data.frame('cost'= c(13, 5, 32, 22, 14, 145, 54, 134, 11, 14, 33, 21),
'price' = c(32, 11, 210, 6, 3, 7, 22, 423, 19, 99, 192, 32)
)
test <- as.matrix(test)
#FOR A MATRIX
test <- cbind(test, rep(NA, nrow(test)))
colnames(test)[3] <- c("pred_prob")
for (n in 1:nrow(test)) {
print("-----------------------------")
print(n)
while (TRUE) {
pred <- predict(model, t(test[n,]), type = "response")
print(pred)
test[,"pred_prob"][n] <- pred
if (sum(pred > 0.5) == length(pred)) {
print(test[,"pred_prob"][n])
break
}
test[,"price"][n] <- test[,"price"][n] - .01
}
print(test)
}
It seems to take a while to run on 12 rows. I need to do some thinking about the thresholds of a tree model and how that'll effect a range of different changes in the price to obtain at or above .5 probability (which I meant in my first question but I wrote .05 haha).

R: generate means and SD table from some columns of a table by group

My apologies if this question has already been answered but I can't find it.
I have a table in R: (see below example copied from txt, the actual table has more data and NA)
I need to compute the mean and sd from column c, e, and f by the group in column b
I can calculate the mean and sd separate by group for all of the separate e.g.
mean(c[b == 1], na.rm=TRUE)
var(e[b == 2], na.rm=TRUE)
I can also calculate the mean and SD for all the columns and generate a table with the results
library(data.table)
new <- data.table(project2016)
wide <- setnames(new[, sapply(.SD, function(x) list(mean = round(mean(x), 3), sd = round(sd(x), 3))), by = b], c("b", sapply(names(new)[-1], paste0, c(".mean", ".SD"))))
wide
But I am not able to do it for only the needed colums and separated by group.
Thx in advance,
Nimby
"id" "a" "b" "c" "d" "e" "f" "g"
1 78 2 83 4 2.53 1.07 3
2 72 2 117 4 2.50 1.16 2
3 72 2 132 4 2.43 1.13 2
4 73 2 102 4 2.48 .81 2
5 73 2 114 4 2.33 1.13 2
6 73 2 88 43 2.13 .84 2
7 65 2 213 4 2.55 1.26 1
8 68 2 153 4 2.45 1.23 1

library(dplyr)
# Some reproducible data
d <- matrix(c(1, 78, 2, 83, 4, 2.53, 1.07, 3, 2, 72, 2, 117, 4, 2.50, 1.16, 2, 3, 72, 2, 132, 4, 2.43, 1.13, 2, 4, 73, 2, 102, 4, 2.48, .81, 2, 5, 73, 2, 114, 4, 2.33, 1.13, 2, 6, 73, 2, 88, 43, 2.13, .84, 2, 7, 65, 2, 213, 4, 2.55, 1.26, 1, 8, 68, 2, 153, 4, 2.45, 1.23, 1),
ncol = 8, byrow = TRUE) %>%
as.data.frame
names(d) <- c("id", "a", "b", "c", "d", "e", "f", "g")
# Your data only included one group in column b
d$b[5:8] <- 1
# Calc mean and sd for the 3 columns, grouped by b
d %>%
group_by(b) %>%
summarise(mean_c = mean(c), sd_c = sd(c),
mean_e = mean(e), sd_e = sd(e),
mean_f = mean(f), sd_f = sd(f))
d
This yields
# A tibble: 2 × 7
b mean_c sd_c mean_e sd_e mean_f sd_f
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 142.0 54.35071 2.365 0.18064699 1.1150 0.1915724
2 2 108.5 20.95233 2.485 0.04203173 1.0425 0.1594522
There'll also be non dplyr ways to do it.

Forming a new variable based on previous row value

My data is of the following form
structure(list(Flag = c(1, 0, 0, 1, 0, 0, 1, 0), variable = c(3,
8, 6, 7, 1, 4, 3, 6), sale = c(26, 27, 61, 38, 79, 87, 81, 13
)), .Names = c("Flag", "variable", "sale"), row.names = c(NA,
-8L), class = "data.frame")
And I want to create output as follows
structure(list(Flag = c(1, 0, 0, 1, 0, 0, 1, 0), variable = c(3,
8, 6, 7, 1, 4, 3, 6), sale = c(26, 27, 61, 38, 79, 87, 81, 13
), begin = c(3, -23, -50, 7, -31, -70, 3, -78), end = c(-23,
-50, -111, -31, -70, -151, -78, -91)), .Names = c("Flag", "variable",
"sale", "begin", "end"), row.names = c(NA, -8L), class = "data.frame")
where the ne column begin and end are based on the following algorathim
if flag=1 then
begin=variable;
end=variable-sale;
----------
else
begin=lag(end) ( i.e the previous value of end variable)
end= lag(end)-sale
What I want is when flag is 1 the value of "begin" is equalt to "variable" value and "end" value is "variable-sale" value.
Where as for the others the value of begin is the previous row "end" value and "end" value is (begin-sales) value
Can anyone help me how to write achieve this in R?

I think the example output you provide is incorrect, but I would try the following:
beginEnd <- by(indf, cumsum(indf$Flag), FUN = function(x) {
out <- Reduce("-", c(x[, "variable"][1], x[, "sale"]), accumulate = TRUE)
cbind(begin = head(out, -1),
end = tail(out, -1))
})
cbind(indf, do.call(rbind, beginEnd))
# Flag variable sale begin end
# 1 1 3 26 3 -23
# 2 0 8 27 -23 -50
# 3 0 6 61 -50 -111
# 4 1 7 38 7 -31
# 5 0 1 79 -31 -110
# 6 0 4 87 -110 -197
# 7 1 3 81 3 -78
# 8 0 6 13 -78 -91

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Function that checks multiple combinations of operations between columns of a dataframe - r

Related

Mix "color_bar" and "style" in formattable package

Add a condition within the function with apply

How to iteratively change variable value until all predicted probabilities are above .5

R: generate means and SD table from some columns of a table by group

Forming a new variable based on previous row value

Categories

Resources