issue with loop increments of lower than 1 when simulating data - r

I'm trying to simulate data utulizing a for loop and storing it in some matrix with the following code:
m <- matrix(nrow = 500 , ncol = 7)
for(i in seq(from = 1, to = 4, by = 0.5)){
a <- 1 * i + rnorm(n = 500, mean = 0, sd = 1)
m[, i] <- a
}
But instead of giving me 7 columns with means of roughly 1, 1.5, 2, 2.5, 3, 3.5 and 4. matrix m contains 4 columns with means of roughly 1.5, 2.5, 3.5 and 4 and 3 columns of NA values.
If i change the increments to 1 and run the below code, everything behaves as expected so the issue seems to be with the increments, but i cant figure out what i should do differently, help would be most appreciated.
m <- matrix(nrow = 500 , ncol = 7)
for(i in seq(from = 1, to = 7, by = 1)){
a <- 1 * i + rnorm(n = 500, mean = 0, sd = 1)
m[, i] <- a
}

Column indices must be integers. In your case, you try to select column 1.5 which is not possible. You can fix this by some simple calculations ((i * 2) - 1)
# reduce number of rows for showcase
n <- 100
m <- matrix(nrow = n , ncol = 7)
for(i in seq(from = 1, to = 4, by = 0.5)){
# NOTE: 1*i does not change anything
a <- 1*i + rnorm(n = n, mean = 0, sd = 1)
# make column index integerish
m[, (i * 2) - 1] <- a
}
m[1:5, ]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] 1.15699467 0.8917952 1.999899 2.330557 4.502607 4.469957 5.687460
#> [2,] -1.13634309 1.5394771 1.700148 1.669329 2.124019 3.472836 3.513351
#> [3,] 2.08584731 1.0591743 2.866186 3.192953 3.984286 3.593902 3.983265
#> [4,] 0.02211767 2.2222376 2.055832 2.927851 2.846376 3.411725 3.742966
#> [5,] 0.49167319 2.2244472 2.190050 3.525931 2.841522 5.722172 4.797856
colMeans(m)
#> [1] 0.8537568 1.6805235 1.9907633 2.6434843 2.8651140 3.5499583 3.9757984

When you use rnorm, it actually allows vectorzied input for the mean value, so you can try the code below (but you should use matrix to fit the obtained output into the desired dimensions of your output matrix)
nr <- 500
nc <- 7
m <- t(matrix(rnorm(nr * nc, seq(1, 4, 0.5), 1), nc, nr))
where you can see, for example
> m[1:5, ]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 3.2157776 0.3805689 0.7550255 2.508356 3.567479 2.597378 4.122201
[2,] 0.8634009 0.4887092 2.5655513 1.710756 2.377790 3.733045 4.199812
[3,] -0.1786419 2.4471083 1.2138140 3.090687 2.763694 3.471715 4.676037
[4,] 1.2492511 2.3480447 2.2180039 1.965656 1.505342 3.832380 4.086075
[5,] -0.1301543 1.7463687 1.2467769 2.649525 4.795677 2.606623 4.318468
> colMeans(m)
[1] 0.901146 1.476423 1.900147 2.567463 2.996918 3.468140 4.025929

You're using i as a row index, but i has non-integer values. Only integers can be used for indexing a matrix/df. When i is, say, 1.5 but you try to use it in the m[,i] expression, it gets forced into an integer and rounded down to 1, so the first 2 runs of your loop overwrite each other (and the 3rd and 4th, etc.).
You could simply use your second code and replace 1*i with 0.5 + 0.5*i:
m <- matrix(nrow = 5000 , ncol = 7)
for(i in seq(from = 1, to = 7, by = 1)){
a <- 0.5 + 0.5*i + rnorm(n = 5000, mean = 0, sd = 1)
m[,i] <- a
}
However, it may be better to use the params of the rnorm function to generate values with a specified mean/sd: currently, you are drawing from a normal distribution centered around 0 then shifting it sideways; you could simply tell it to use the mean you actually want.
m <- matrix(nrow = 5000 , ncol = 7)
for(i in seq(from = 1, to = 7, by = 1)){
m[,i] <- rnorm(n = 5000, mean = 0.5 + 0.5*i, sd = 1)
}

Related

problem with missing values for calculating the median

I'm having problem with managing with NAs to calculate median by using multiple matrix.
This is an example of the code and data I'm working on:
#Data example
m1 = matrix(c(2, 4, 3, 1),nrow=2, ncol=2, byrow = TRUE)
m2 = matrix(c(NA, 5, 7, 9),nrow=2, ncol=2, byrow = TRUE)
m3 = matrix(c(NA, 8, 10, 14),nrow=2, ncol=2, byrow = TRUE)
Median calculation
apply(abind::abind(m1, m2, m3, along = 3), 1:2, median)
[,1] [,2]
[1,] NA 5
[2,] 7 9
As expected the the function doesn't return a value for cells which contains NAs.
The problem is that if I replace NAs with 0 I'll get an output like this:
#Data example
m1 = matrix(c(2, 4, 3, 1),nrow=2, ncol=2, byrow = TRUE)
m2 = matrix(c(0, 5, 7, 9),nrow=2, ncol=2, byrow = TRUE)
m3 = matrix(c(0, 8, 10, 14),nrow=2, ncol=2, byrow = TRUE)
Median calculation
apply(abind::abind(m1, m2, m3, along = 3), 1:2, median)
[,1] [,2]
[1,] 0 5
[2,] 7 9
I'm trying instead to get an output where cells which reports NAs are just skipped so that only values are take into consideration. As in the example, if I have cells with NA, NA, 2 I would expect to get 2 as result while (out of the example) for cells with NA,2,5 I would expect 3.5 as result.
[,1] [,2]
[1,] 2 5
[2,] 7 9
Do you have an idea of how I could get this results? Any suggestion would be appreciated, thanks.
Just pass the argument na.rm=TRUE inside apply
apply(abind::abind(m1, m2, m3, along = 3), 1:2, median, na.rm = TRUE)
Output:
[,1] [,2]
[1,] 2 5
[2,] 7 9
Perhaps you should drop de NA's first? Try adding na.rm = TRUE

Solutions to a system of inequalities in R

Suppose I have the following system of inequalities:
-2x + y <= -3
1.25x + y <= 2.5
y >= -3
I want to find multiple tuples of (x, y) that satisfy the above inequalities.
library(Rglpk)
obj <- numeric(2)
mat <- matrix(c(-2, 1, 1.25, 1, 0, 1), nrow = 3)
dir <- c("<=", "<=", ">=")
rhs <- c(-3, 2.5, -3)
Rglpk_solve_LP(obj = obj, mat = mat, dir = dir, rhs = rhs)
Using the above code only seems to return 1 possible solution tuple (1.5, 0). Is possible to return other solution tuples?
Edit: Based on the comments, I would be interested to learn if there are any functions that could help me find the corner points.
Actually to understand the possible answers for the given question we can try to solve the system of inequalities graphically.
There was a nice answer concerning plotting of inequations in R at stackowerflow. Using the given aproach we can plot the following graph:
library(ggplot2)
fun1 <- function(x) 2*x - 3 # this is the same as -2x + y <= -3
fun2 <- function(x) -1.25*x + 2.5 # 1.25x + y <= 2.5
fun3 <- function(x) -3 # y >= -3
x1 = seq(-1,5, by = 1/16)
mydf = data.frame(x1, y1=fun1(x1), y2=fun2(x1),y3= fun3(x1))
mydf <- transform(mydf, z = pmax(y3,pmin(y1,y2)))
ggplot(mydf, aes(x = x1)) +
geom_line(aes(y = y1), colour = 'blue') +
geom_line(aes(y = y2), colour = 'green') +
geom_line(aes(y = y3), colour = 'red') +
geom_ribbon(aes(ymin=y3,ymax = z), fill = 'gray60')
All the possible (infinite by number) tuples lie inside the gray triangle.
The vertexes can be found using the following code.
obj <- numeric(2)
mat <- matrix(c(-2, 1.25, 1, 1), nrow = 2)
rhs <- matrix(c(-3, 2.5), nrow = 2)
aPoint <- solve(mat, rhs)
mat <- matrix(c(-2, 0, 1, 1), nrow = 2)
rhs <- matrix(c(-3, -3), nrow = 2)
bPoint <- solve(mat, rhs)
mat <- matrix(c(1.25, 0, 1, 1), nrow = 2)
rhs <- matrix(c(2.5, -3), nrow = 2)
cPoint <- solve(mat, rhs)
Note the order of arguments of matrices.
And you get the coordinates:
> aPoint
[,1]
[1,] 1.6923077
[2,] 0.3846154
> bPoint
[,1]
[1,] 0
[2,] -3
> cPoint
[,1]
[1,] 4.4
[2,] -3.0
All the codes below are with base R only (no need library(Rglpk))
1. Corner Points
If you want to get all the corner points, here is one option
A <- matrix(c(-2, 1.25, 0, 1, 1, -1), nrow = 3)
b <- c(-3, 2.5, 3)
# we use `det` to check if the coefficient matrix is singular. If so, we return `Inf`.
xh <-
combn(nrow(A), 2, function(k) {
if (det(A[k, ]) == 0) {
rep(NA, length(k))
} else {
solve(A[k, ], b[k])
}
})
# We filter out the points that satisfy the constraint
corner_points <- t(xh[, colSums(A %*% xh <= b, na.rm = TRUE) == length(b)])
such that
> corner_points
[,1] [,2]
[1,] 1.692308 0.3846154
[2,] 0.000000 -3.0000000
[3,] 4.400000 -3.0000000
2. Possible Tuples
If you want to have multiple tuples, e.g., n=10, we can use Monte Carlo simulation (based on the obtained corner_points in the previous step) to select the tuples under the constraints:
xrange <- range(corner_points[, 1])
yrange <- range(corner_points[, 2])
n <- 10
res <- list()
while (length(res) < n) {
px <- runif(1, xrange[1], xrange[2])
py <- runif(1, yrange[1], yrange[2])
if (all(A %*% c(px, py) <= b)) {
res[length(res) + 1] <- list(c(px, py))
}
}
and you will see n possible tuples in a list like below
> res
[[1]]
[1] 3.643167 -2.425809
[[2]]
[1] 2.039007 -2.174171
[[3]]
[1] 0.4990635 -2.3363637
[[4]]
[1] 0.6168402 -2.6736421
[[5]]
[1] 3.687389 -2.661733
[[6]]
[1] 3.852258 -2.704395
[[7]]
[1] 1.7571062 0.1067597
[[8]]
[1] 3.668024 -2.771307
[[9]]
[1] 2.108187 -1.365349
[[10]]
[1] 2.106528 -2.134310
First of all, the matrix representing the three equations needs a small correction, because R fills matrices column by column :
-2x + y <= -3
1.25x + y <= 2.5
y >= -3
mat <- matrix(c(-2, 1.25, 0, 1, 1, 1), nrow = 3
# and not : mat <- matrix(c(-2, 1, 1.25, 1, 0, 1), nrow = 3)
To get different tuples, you could modify the objective function :
obj <- numeric(2) results in an objective function 0 * x + 0 * y which is always equal to 0 and can't be maximized : the first valid x,y will be selected.
Optimization on x is achieved by using obj <- c(1,0), resulting in maximization / minimization of 1 * x + 0 * y.
Optimization on y is achieved by using obj <- c(0,1).
#setting the bounds is necessary, otherwise optimization occurs only for x>=0 and y>=0
bounds <- list(lower = list(ind = c(1L, 2L), val = c(-Inf, -Inf)),
upper = list(ind = c(1L, 2L), val = c(Inf, Inf)))
# finding maximum x: obj = c(1,0), max = T
Rglpk_solve_LP(obj = c(10,0), mat = mat, dir = dir, rhs = rhs,bound=bounds, max = T)$solution
# [1] 4.4 -3.0
# finding minimum x: obj = c(1,0), max = F
Rglpk_solve_LP(obj = c(10,0), mat = mat, dir = dir, rhs = rhs,bound=bounds, max = F)$solution
#[1] 0 -3
# finding maximum y: obj = c(0,1), max = T
Rglpk_solve_LP(obj = c(0,1), mat = mat, dir = dir, rhs = rhs,bound=bounds, max = T)$solution
#[1] 1.6923077 0.3846154

Pick a slice from R array according to coordinates

I have two vectors x and y of coordinates and a 3D array A in R. I want to produce a matrix, where the i'th row is A[x[i], , y[i]].
If A was 2D, I believe I could use A[cbind(x,y)]. For the 3D array, I think the following works, but it's kind of slow:
sapply(1:length(x), function(i) A[x[i],,y[i]]).
Is there a faster way to do this, e.g. by somehow using cbind?
Edit:
For instance, consider the following
A = array(1:12, c(2,2,3))
x = c(1,2,1)
y = c(1,2,3)
> A
, , 1
[,1] [,2]
[1,] 1 3
[2,] 2 4
, , 2
[,1] [,2]
[1,] 5 7
[2,] 6 8
, , 3
[,1] [,2]
[1,] 9 11
[2,] 10 12
I would like to get the following output, but with a faster code:
> t(sapply(1:length(x), function(i) A[x[i],,y[i]]))
[,1] [,2]
[1,] 1 3
[2,] 6 8
[3,] 9 11
indices <- cbind(
rep(x, each = dim(A)[2]),
rep(seq_len(dim(A)[2]), times = length(x)),
rep(y, each = dim(A)[2])
)
identical(
array(A[indices], dim = c(dim(A)[2], length(x))),
sapply(1:length(x), function(i) A[x[i],,y[i]])
)
#> [1] TRUE
Data:
A <- array(1:30, dim = c(2, 3, 5))
n <- 7
set.seed(123)
x <- sample(dim(A)[1], n, replace = TRUE)
y <- sample(dim(A)[3], n, replace = TRUE)
Performance: not necessarily better, depends on use case, see plot for insights
set.seed(42)
create_data <- function(array_size, coordinates_size) {
list(
A = array(1:array_size^3, dim = rep(array_size, 3)),
x = sample(array_size, coordinates_size, replace = TRUE),
y = sample(array_size, coordinates_size, replace = TRUE)
)
}
results <- bench::press(
array_size = c(10, 100, 1e3),
coordinates_size = c(100, 1e3, 10e3),
{
dat <- create_data(array_size, coordinates_size)
A <- dat[["A"]] ; x <- dat[["x"]] ; y <- dat[["y"]]
bench::mark(
sapply = {
sapply(1:length(x), function(i) A[x[i],,y[i]])
},
cbind = {
indices <- cbind(
rep(x, each = dim(A)[2]),
rep(seq_len(dim(A)[2]), times = length(x)),
rep(y, each = dim(A)[2])
)
array(A[indices], dim = c(dim(A)[2], length(x)))
}
)
}
)
ggplot2::autoplot(results)

Creating 10 categorical and 10 continuous random variables and save them as a data frame

I would like to create a data frame with 10 categorical and 10 continuous random variables. I can do it using the following loop.
p_val=rbeta(10,1,1) #10 probabilities
n=20
library(truncnorm)
mu_val=rtruncnorm(length(p_val),0,Inf, mean = 100, sd=5)#rnorm(length(p))
d_mat_cat=matrix(NA, nrow = n, ncol = length(p))
d_mat_cont= matrix(NA, nrow = n, ncol = length(p))
for ( j in 1:length(p)){
d_mat_cat[,j]=rbinom(n,1,p[j]) #Binary RV
d_mat_cont[,j]=rnorm(n,mu_val[j]) #Cont. RV
}
d_mat=cbind(d_mat_cat, d_mat_cont)
Any alternative options are appreciated.
rbinom is vectorized over prob, and rnorm is vectorized over mean, so you can use this:
cbind(
matrix(rbinom(n * length(p_val), size = 1, prob = p_val),
ncol = length(p_val), byrow = TRUE),
matrix(rnorm(n * length(mu_val), mean = mu_val),
ncol = length(mu_val), byrow = TRUE)
)
We can be a little clever with rep to make the call much cleaner:
p_val = c(0, 0.5, 1)
mu_val = c(1, 10, 100)
n = 4
##
matrix(
c(
rbinom(n * length(p_val), size = 1, prob = rep(c(0, .5, 1), each = n)),
rnorm(n * length(mu_val), mean = rep(c(1, 10, 100), each = n))
),
nrow = n,
)
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0 1 1 1.1962718 9.373595 100.1739
# [2,] 0 0 1 -0.1854631 9.574706 100.0725
# [3,] 0 1 1 3.4873697 9.447363 100.1345
# [4,] 0 1 1 2.8467450 9.700975 101.3178
You can try using sapply to run rbinom and rnorm and cbind the data.
cbind(sapply(p_val, rbinom, n = n, size = 1), sapply(mu_val, rnorm, n = n))

Constrained Optimization in R not giving expected results

I have 2 given matrices
a1 <- matrix(c(0.4092951, 0.1611806, 0.4283178, 0.001206529), nrow =
1)
a2 <- matrix(c(0.394223557, 0.140443266, 0.463980790, 0.001352387),
nrow = 1)
I have an initial matrix
b <- matrix(c(0.4095868, 0.1612955, 0.4286231, 0.0004946572,
0, 0.2732351, 0.7260891, 0.0006757670,
0, 0, 0.9909494, 0.0090505527,
0, 0, 0, 1), nrow = 4, byrow = T)
I need to update 'b' such that
a1 %*% b = a2
The above is an optimization problem where the
objective function is to minimize
(a1 %*% b - a2)
which would drive the value of the sum(absolute value(a1 %*% b - a2)) to zero, subject to the constraints:
Lower triangle(b) = 0 ;
RowSum(b) = 1
## creating a data vector with a1 and a2
data = c(as.numeric(a1), as.numeric(a2))
## objective function
min_obj <- function(p){
## Creating a matrix to recreate 'b'
p1 <- matrix(rep(0, 16), nrow = 4)
k = 1
for(i in 1:nrow(p1)){
for (j in 1:ncol(p1)){
if(j >= i){
p1[i,j] <- p[k]
k = k+1
}
}
}
actual <- matrix(data[1:(length(data)/2)], nrow = 1)
pred <- matrix(data[(length(data)/ 2 + 1):length(data)], nrow = 1)
s <- (actual %*% p1) - pred
sum(abs(s))
}
## Initializing the initial values for b taking only non-zero values
init <- b[b>0]
opt <- optim(init, min_obj, control = list(trace = T), method =
"L-BFGS-B", lower = rep(0, length(init)), upper = rep(1,
length(init)))
transformed_b <- matrix(rep(0, 16), nrow = 4)
k = 1
for(i in 1:nrow(transformed_b)){
for (j in 1:ncol(transformed_b)){
if(j >= i){
transformed_b[i,j] <- opt$par[k]
k = k+1
}
}
}
transformed_b
The issue with transformed_b is that rowSum of the matrix is not 1. Any help is highly appreciated.
"optim" is the right choice. Since the row sums have to be 1, there are only 6 parameters, not 10 as in your attempt. The diagonal is uniquely determined by the values strictly above the diagonal.
a1 <- matrix(c(0.4092951, 0.1611806, 0.4283178, 0.001206529), nrow =
1)
a2 <- matrix(c(0.394223557, 0.140443266, 0.463980790, 0.001352387),
nrow = 1)
b <- matrix(c(0.4095868, 0.1612955, 0.4286231, 0.0004946572,
0, 0.2732351, 0.7260891, 0.0006757670,
0, 0, 0.9909494, 0.0090505527,
0, 0, 0, 1), nrow = 4, byrow = T)
#======================================================================
# Build an upper triangular matrix with rowsums 1:
B <- function(x)
{
X <- matrix(c(0,x[1:3],0,0,x[4:5],0,0,0,x[6],rep(0,4)),4,4,byrow=TRUE)
diag(X) <- 1-rowSums(X)
return(X)
}
#----------------------------------------------------------------------
# The function we want to minimize:
f <- function(x)
{
return (sum((a1%*%B(x) - a2)^2))
}
#----------------------------------------------------------------------
#Optimization:
opt <- optim( par = c(b[1,2:4],b[2,3:4],b[3,4]),
fn = f,
lower = rep(0,6),
method = "L-BFGS-B" )
optB <- B(opt$par)
Result:
> optB
[,1] [,2] [,3] [,4]
[1,] 0.9631998 0.03680017 0.0000000 0.0000000000
[2,] 0.0000000 0.77820700 0.2217930 0.0000000000
[3,] 0.0000000 0.00000000 0.9998392 0.0001608464
[4,] 0.0000000 0.00000000 0.0000000 1.0000000000
> a1 %*% optB - a2
[,1] [,2] [,3] [,4]
[1,] 9.411998e-06 5.07363e-05 1.684534e-05 -7.696464e-05
> rowSums(optB)
[1] 1 1 1 1
I chose the sum of squares instead of sum of absolute values, since it is differentiable. This makes it easier for "optim" to find the minimum, I guess.

Resources