Better way to adding elements in data frame without looping in R

Better way to adding elements in data frame without looping in R - r

I want to create a dataframe that calculates the odds ratio with the standard error and confidence intervals in R.
I have a dataset similar to the one like so:
dat <- read.table(header = TRUE, text = "
f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 target
0 0 1 0 0 0 0 0 0 0 0 0
1 1 1 0 0 0 0 0 1 0 0 1
0 0 0 0 0 0 0 0 0 0 0 1
1 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 0 1
1 1 1 1 1 0 0 0 0 0 0 0")
And create a function that calculates everything I need in the dataframe for a particular future in the data set like so:
get_ci <- function(df, feature) {
tab <- table(df[[feature]], df$target)
a <- tab[1,1]
b <- tab[1,2]
c <- tab[2,1]
d <- tab[2,2]
odds_ratio <- (a/b)/(c/d)
standard_error <- sqrt(1/a + 1/b + 1/c + 1/d)
log_ci_lower <- log(odds_ratio) - 1.96 + standard_error
log_ci_upper <- log(odds_ratio) - 1.96 + standard_error
ci_lower <- exp(log_ci_lower)
ci_upper <- exp(log_ci_upper)
df <- data.frame(Feature = feature,
`Odds Ratio` = odds_ratio,
`Standard Error` = standard_error,
`Lower Bound CI` = ci_lower,
`Upper Bound CI` = ci_upper
)
}
I want to create a DF that computes the odds ratio, standard error, and confidence interval for each features (f1-f11). What is the most efficient way to do this?
I am currently creating an empty dataframe and looping through the features in the df to populate one but I feel like this is not the right way to do it. I was looking at the apply functions, but not sure how I can apply that with my function I created

I think the first table line in the function should be :
tab <- table(factor(df[[feature]], levels = 0:1), df$target)
otherwise, if you have all 1's and all 0's in a particular column the next lines would break.
With that change, you can use lapply passing the column names
result <- do.call(rbind, lapply(paste0('f', 1:11), get_ci, df = dat))
Or using purrr's map_df
result <- map_df(paste0('f', 1:11), get_ci, df = dat)

Here's another solution.
get_ci <- function(x, target) {
tab <- table(factor(x, levels=0:1), target) #changed
...
ci_upper <- exp(log_ci_upper)
c(`Odds Ratio` = odds_ratio, # changed
`Standard Error` = standard_error,
`Lower Bound CI` = ci_lower,
`Upper Bound CI` = ci_upper
)
}
as.data.frame(apply(dat[,1:11], 2, function(x) { get_ci(x, dat$target) })) #changed

Related

R - Replace sequence values by its maximum in a column

I am with a small problem and hope someone can help me.
I have a dataframe like this:
df <- data.frame(foo = 1:20, bar = c(0,0,1,0,0,0,1,2,0,0,1,2,3,0,0,0,1,2,3,4))
and what to have a result like this:
df_result <- data.frame(foo = 1:20, bar = c(0,0,1,0,0,0,2,2,0,0,3,3,3,0,0,0,4,4,4,4))
How do I do this without using a while loop?

Using ave in base R :
with(df, as.integer(bar > 0) * (ave(bar, cumsum(bar == 0), FUN = max)))
#[1] 0 0 1 0 0 0 2 2 0 0 3 3 3 0 0 0 4 4 4 4
where cumsum(bar == 0) is used to create groups, ave is used to calculate max in each group and as.integer(bar > 0) is to keep value which are 0 as 0.

How to solve linear programming model in R

I need to solve the following microeconomic problem:
I have six assets I can produce (asset 1 - 6) across five years (2011 - 2015).
Each asset can only be produced during one year.
Each asset must be produced in my five year period.
Production is not mutually exclusive; I can produce more than one good in a year without affecting the production of either.
Each asset has a fixed cost of production equal to 30.
I must have non-negative profit in each year; revenues must be at least 30.
Below is a matrix representing my potential revenue for producing each asset (i) in a given year (j).
2011 2012 2013 2014 2015
Asset1 35* 37 39 42 45
Asset2 16 17 18 19 20*
Asset3 125 130 136*139 144
Asset4 15 27 29 30* 33
Asset5 14 43* 46 50 52
Asset6 5 7 8 10 11*
The asterisks (*) represent what should be the optimal solution set.
How can I use R to solve for the production plan that maximizes my revenue (and therefore profit) subject to the constraints outlined. My output should be a similar 6x5 matrix of 0's and 1's, where 1's represent choosing to produce a good in a given year.

This is a classic problem, and one that needs to be reformulated.
Start by reformulating your problem
Max( sum_[i,t] (pi_[i,t] - C_[i,t]) * x_[i,t])
Sd.
sum_t x_[i,t] = 1 [ for all i ]
sum_i x_[i,t] >= 30 [ for all t ]
x_[i,t] >= 0 [for all i, t]
In the lpSolve package the maximization problem is given in a linear representation, eg. in non-matrix format. Lets start by making a vector representing our x_[i,t]. For ease let's name it (although this is not used), just so we can keep track.
n <- 6
t <- 5
#x ordered by column.
x <- c(35, 16, 125, 15, 14, 5, 37, 17, 130, 27, 43, 7, 39, 18, 136, 29, 46, 8, 42, 19, 139, 30, 50, 10, 45, 20, 144, 33, 52, 11)
# if x is matrix use:
# x <- as.vector(x)
names(x) <- paste0('x_[', seq(n), ',', rep(seq(t), each = n), ']')
head(x, n * 2)
x_[1,1] x_[2,1] x_[3,1] x_[4,1] x_[5,1] x_[6,1] x_[1,2] x_[2,2] x_[3,2] x_[4,2] x_[5,2] x_[6,2]
35 16 125 15 14 5 37 17 130 27 43 7
length(x)
[1] 30
Now now we need to create our conditions. Starting with the first condition
sum_t x_[i,t] = 1 [ for all i ]
we can create this rather simply. The thing to watch out for here, is that the dimension has to be right. We have a vector of length 30, so we'll need our conditions matrix to have 30 columns. In addition we have 6 assets, so we'll need 6 rows for this condition. Again lets name the rows and columns to keep track ourself.
cond1 <- matrix(0, ncol = t * n,
nrow = n,
dimnames = list(paste0('x_[', seq(n), ',t]'),
names(x)))
cond1[, seq(n + 1)]
x_[1,1] x_[2,1] x_[3,1] x_[4,1] x_[5,1] x_[6,1] x_[1,2]
x_[1,t] 0 0 0 0 0 0 0
x_[2,t] 0 0 0 0 0 0 0
x_[3,t] 0 0 0 0 0 0 0
x_[4,t] 0 0 0 0 0 0 0
x_[5,t] 0 0 0 0 0 0 0
x_[6,t] 0 0 0 0 0 0 0
Next we fill our the correct fields. x_[1,1] + x[1, 2] + ... = 1 and x_[2,1] + x_[2,2] + ... = 1 and so forth. Using a for loop is the simplest for this problem
for(i in seq(n)){
cond1[i, seq(i, 30, n)] <- 1
}
cond1[, seq(n + 1)]
x_[1,1] x_[2,1] x_[3,1] x_[4,1] x_[5,1] x_[6,1] x_[1,2]
x_[1,t] 1 0 0 0 0 0 1
x_[2,t] 0 1 0 0 0 0 0
x_[3,t] 0 0 1 0 0 0 0
x_[4,t] 0 0 0 1 0 0 0
x_[5,t] 0 0 0 0 1 0 0
x_[6,t] 0 0 0 0 0 1 0
We still have to create the RHS and specify direction but I'll wait with this for now.
So next lets create our matrix for the second condition
sum_i x_[i,t] >= 30 [ for all t ]
The process for this one is very similar, but now we need a row for each period, so the dimension of the matrix is 5x30. The main difference here, is we need to insert the values of x_[i, t]
cond2 <- matrix(0, ncol = t * n,
nrow = t,
dimnames = list(paste0('t=', seq(t)),
names(x)))
for(i in seq(t)){
cond2[i, seq(n) + n * (i - 1)] <- x[seq(n) + n * (i - 1)]
}
cond2[, seq(1, n * t, n)]
x_[1,1] x_[1,2] x_[1,3] x_[1,4] x_[1,5]
t=1 35 0 0 0 0
t=2 0 37 0 0 0
t=3 0 0 39 0 0
t=4 0 0 0 42 0
t=5 0 0 0 0 45
Note that I'm printing the result for x_[1, t] to illustrate we've got it right.
Last we have the final condition. For this we note the ?lpSolve::lp has an argument all.bin, and reading this, it states
Logical: should all variables be binary? Default: FALSE.
So since all variables are either 1 or 0, we simply set this value to TRUE. Before continuing lets combine our conditions into one matrix
cond <- rbind(cond1, cond2)
Now both the RHS and the direction are simply taken from the 2 conditions. From the documentation on the const.dir argument
Vector of character strings giving the direction of the constraint: each value should be one of "<," "<=," "=," "==," ">," or ">=". (In each pair the two values are identical.)
In our conditions we have 6 rows representing the first condition, and rows represeting condition 2. Thus we need n (6) times == and t (5) times >=.
cond_dir <- c(rep('==', n), rep('>=', t))
The RHS is created in a similar fashion
RHS <- c(rep(1, n), rep(30, t))
And that's it! Now we're ready to solve our problem using the lpSolve::lp function.
sol = lpSolve::lp(direction = 'max',
objective.in = x,
const.mat = cond,
const.dir = cond_dir,
const.rhs = RHS,
all.bin = TRUE)
sol$objval
[1] 275
The weights for the solution are stored in sol$solution
names(sol$solution) <- names(x)
sol$solution
x_[1,1] x_[2,1] x_[3,1] x_[4,1] x_[5,1] x_[6,1] x_[1,2] x_[2,2] x_[3,2] x_[4,2] x_[5,2] x_[6,2] x_[1,3] x_[2,3] x_[3,3]
1 0 0 0 0 0 0 0 0 0 1 0 0 0 1
x_[4,3] x_[5,3] x_[6,3] x_[1,4] x_[2,4] x_[3,4] x_[4,4] x_[5,4] x_[6,4] x_[1,5] x_[2,5] x_[3,5] x_[4,5] x_[5,5] x_[6,5]
0 0 0 0 0 0 1 0 0 0 1 0 0 0 1
matrix(sol$solution,
ncol = t,
dimnames = list(rownames(cond1),
rownames(cond2)))
t=1 t=2 t=3 t=4 t=5
x_[1,t] 1 0 0 0 0
x_[2,t] 0 0 0 0 1
x_[3,t] 0 0 1 0 0
x_[4,t] 0 0 0 1 0
x_[5,t] 0 1 0 0 0
x_[6,t] 0 0 0 0 1
Which we quickly see is the correct solution. :-)
Side note on costs
One may have noticed "Where the hell did the costs go?". In this specific case, costs are fixed and not very interesting. This means we can ignore these during the calculations because we know the total cost is going to be 30 * 6 = 180 (which has to be substracted from the objective value). However it is not uncommon that costs depend on various factors, and might affect the optimal solution. For illustration, I'll include how we could incorporate costs in this example here.
First we'll have to extend our objective vector to incorporate the costs for each product at each period
Fixed_C <- -30
x <- c(x, rep(Fixed_C, n * t))
Next we'll add a pseudo-constraint
x_[i,t] - C_[i,t] = 0 [for all i, t]
This constraint ensures that if x_[i,t] = 1 then the relevant cost is added to the problem. There's 2 ways to create this constraint. The first is to have a matrix with n * t rows, one for each cost and period. Alternatively we can use our first constraint and actually live with only a single constrant
sum_[i,t] x_[i,t] - C_[i,t] = 0
because our first constraint makes sure x[1, 1] != x[1, 2]. So our third constraint becomes
cond3 <- c(rep(1, n * t), rep(-1, n * t))
Lastly we have to extend our RHS and condition 1 and 2 matrices. Simply add 0's to the condition matrices to make the dimensions fit.
cond1 <- cbind(cond1, matrix(0, nrow = n, ncol = n * t))
cond2 <- cbind(cond2, matrix(0, nrow = n, ncol = n * t))
cond <- rbind(cond1, cond2, cond3)
cond_dir <- c(cond_dir, '==')
RHS <- c(RHS, 0)
And now we can once again find the optimal solution using lpSolve::lp
solC = lpSolve::lp(direction = 'max',
objective.in = x,
const.mat = cond,
const.dir = cond_dir,
const.rhs = RHS,
all.bin = TRUE)
solC$objval
[1] 95
which is equal to our previous value 275 minus our fixed costs Fixed_C * n = 180.

error in design matrix: "Design matrix not of full rank. The following coefficients not estimable"

I have a design matrix for my data as below. I run command for analyzing and comparing different groups together but get error.
I would like to have these comparisons:
L4vsL6.L8 ,
Q3vsQ5.Q7, QvsL
design matrix:
design
organoids_biological_samples method
L4_D49_rep_1 L4 L
L4_D49_rep_2 L4 L
L6_L8_D49_rep_1 L6_L8 L
L6_L8_D49_rep_2 L6_L8 L
Q3_D49_rep_1 Q3 Q
Q3_D49_rep_2 Q3 Q
Q5_Q7_D49_rep_1 Q5_Q7 Q
Q5_Q7_D49_rep_2 Q5_Q7 Q
design$organoids_biological_samples <- factor(design$organoids_biological_samples, levels = c("L4","L6_L8", "Q3", "Q5_Q7"))
design$method <- factor(design$method, levels = c("L", "Q"))
all(rownames(design) %in% colnames(data))
all(rownames(design) == colnames(data))
Group <- factor(paste(design$organoids_biological_samples,design$method,sep="."))
design<- cbind(design,Group)
design.matrix <- model.matrix(~0+Group+method,design)
colnames(design.matrix) <- c("L4.L", "L6_L8.L", "Q3.Q", "Q5_Q7.Q", "method")
design.matrix
L4.L L6_L8.L Q3.Q Q5_Q7.Q method
L4_D49_rep_1 1 0 0 0 0
L4_D49_rep_2 1 0 0 0 0
L6_L8_D49_rep_1 0 1 0 0 0
L6_L8_D49_rep_2 0 1 0 0 0
Q3_D49_rep_1 0 0 1 0 1
Q3_D49_rep_2 0 0 1 0 1
Q5_Q7_D49_rep_1 0 0 0 1 1
Q5_Q7_D49_rep_2 0 0 0 1 1
attr(,"assign")
[1] 1 1 1 1 2
attr(,"contrasts")
attr(,"contrasts")$Group
[1] "contr.treatment"
attr(,"contrasts")$method
[1] "contr.treatment"
edgeR.dgelist = DGEList(counts = data,group = Group)
edgeR.dgelist = calcNormFactors(edgeR.dgelist,method = "TMM")
CommonDisp <- estimateGLMCommonDisp(edgeR.dgelist, design.matrix)
Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset, :
Design matrix not of full rank. The following coefficients not estimable:
method

There is a strict linear dependency in your data:
Q3.Q + Q5_Q7.Q = method.
Therefore your model can not find a unique coefficient for these columns
if B1, B2, B3 is an optimal set of coefficients so is B1-x, B2-x, B3-x for any real number x.
Numerically your program can not invert the matrix to find the optimal coefficient.
Consider testing if it is better to include both Q3.Q and Q5_Q7.Q or just method in your model.

Faster way to fill in missing columns in R data frame

Can any R experts provide a faster way to do the following? My code works, but it takes 1 minute to do a 30,000-[column] by 12-[row] data frame. Thanks!
sync.columns = function(old.data, new.colnames)
{
# Given a data frame and a vector of column names,
# makes a new data frame containing exactly the named
# columns in the specified order; any that were not
# present are filled in as columns of zeroes.
if (length(new.colnames) == ncol(old.data) &&
all(new.colnames == colnames(old.data)))
{
old.data # nothing to do
}
else
{
m = matrix(nrow=nrow(old.data),ncol=length(new.colnames))
for (t in 1:length(new.colnames))
{
if (new.colnames[t] %in% colnames(old.data))
{
m[,t] = old.data[,new.colnames[t]] # copy column
}
else
{
m[,t] = rep(0,nrow(m)) # fill with zeroes
}
}
result = as.data.frame(m)
rownames(result) = rownames(old.data)
colnames(result) = new.colnames
result
}
}
Maybe something with cbind?

This seems rather fast. First create a data.frame full of zeroes, then only replace what you can find in the old data:
sync.columns <- function(old.data, new.colnames) {
M <- nrow(old.data)
N <- length(new.colnames)
rn <- rownames(old.data)
cn <- new.colnames
new.data <- as.data.frame(matrix(0, M, N, dimnames = list(rn, cn)))
keep.col <- intersect(cn, colnames(old.data))
new.data[keep.col] <- old.data[keep.col]
new.data
}
M <- 30000
x <- data.frame(b = runif(M), i = runif(M), z = runif(M))
rownames(x) <- paste0("z", 1:M)
system.time(y <- sync.columns(x, letters[1:12]))
# user system elapsed
# 0.031 0.010 0.043
head(y)
# a b c d e f g h i j k l
# z1 0 0.27994248 0 0 0 0 0 0 0.3785181 0 0 0
# z2 0 0.75291520 0 0 0 0 0 0 0.7414294 0 0 0
# z3 0 0.07036461 0 0 0 0 0 0 0.1543653 0 0 0
# z4 0 0.40748957 0 0 0 0 0 0 0.5564374 0 0 0
# z5 0 0.98769595 0 0 0 0 0 0 0.4277466 0 0 0
# z6 0 0.82117781 0 0 0 0 0 0 0.2034743 0 0 0
Edit: following comments with the OP below, here is a matrix version:
sync.columns <- function(old.data, new.colnames) {
M <- nrow(old.data)
N <- length(new.colnames)
rn <- rownames(old.data)
cn <- new.colnames
new.data <- matrix(0, M, N, dimnames = list(rn, cn))
keep.col <- intersect(cn, colnames(old.data))
new.data[, keep.col] <- old.data[, keep.col]
new.data
}
x <- t(as.matrix(x)) # a wide matrix
system.time(y <- sync.columns(x, paste0("z", sample(1:50000, 30000))))
# user system elapsed
# 0.049 0.002 0.051

How to build binary data.frame in R for multiple dimensions?

I have a dataframe with three factors of which two are binary and the third one is integer:
DATA YEAR1 YEAR2 REGION1 REGION2
OBS1 X 1 0 1 0
OBS2 Y 1 0 0 1
OBS3 Z 0 1 1 0
etc.
Now I want to transform it to something like this
YEAR1_REGION1 YEAR1_REGION2 YEAR2_REGION1 YEAR2_REGION2
OBS1 X 0 0 0
OBS2 0 Y 0 0
OBS3 0 0 Z 0
Basic matrix multiplication is not what I'm after. I would like to find a neat way to do this that would automatically have the columns renamed as well. My actual data has three factor dimensions with 20*8*6 observations so finally there will be 960 columns altogether.

Here's another approach based on outer and similar to #Roland answer.
year <- grep("YEAR", names(DF), value = TRUE)
region <- grep("REGION", names(DF), value = TRUE)
data <- as.character(DF$DATA)
df <- outer(year, region, function(x, y) DF[,x] * DF[,y])
colnames(df) <- outer(year, region, paste, sep = "_")
df <- as.data.frame(df)
for (i in seq_len(ncol(df)))
df[as.logical(df[,i]), i] <- data[as.logical(df[,i])]
df
## YEAR1_REGION1 YEAR2_REGION1 YEAR1_REGION2 YEAR2_REGION2
## OBS1 X 0 0 0
## OBS2 0 0 Y 0
## OBS3 0 Z 0 0

Maybe others will come up with a more succinct possibility, but this creates the expected result:
DF <- read.table(text=" DATA YEAR1 YEAR2 REGION1 REGION2
OBS1 X 1 0 1 0
OBS2 Y 1 0 0 1
OBS3 Z 0 1 1 0", header=TRUE)
DF[,-1] <- lapply(DF[,-1], as.logical)
DF[,1] <- as.character(DF[,1])
res <- apply(expand.grid(2:3, 4:5), 1, function(i) {
tmp <- rep("0", length(DF[,1]))
ind <- do.call(`&`,DF[,i])
tmp[ind] <- DF[ind,1]
tmp <- list(tmp)
names(tmp) <- paste0(names(DF)[i], collapse="_")
tmp
})
res <- as.data.frame(res)
rownames(res) <- rownames(DF)
# YEAR1_REGION1 YEAR2_REGION1 YEAR1_REGION2 YEAR2_REGION2
# OBS1 X 0 0 0
# OBS2 0 0 Y 0
# OBS3 0 Z 0 0
However, I suspect there is a much better possibility to achieve what you actually want to do, without creating a huge wide-format data.frame.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Better way to adding elements in data frame without looping in R - r

Related

R - Replace sequence values by its maximum in a column

How to solve linear programming model in R

error in design matrix: "Design matrix not of full rank. The following coefficients not estimable"

Faster way to fill in missing columns in R data frame

How to build binary data.frame in R for multiple dimensions?

Categories

Resources