Related
I have two large sparse matrices (about 41,000 x 55,000 in size). The density of nonzero elements is around 10%. They both have the same row index and column index for nonzero elements.
I now want to modify the values in the first sparse matrix if values in the second matrix are below a certain threshold.
library(Matrix)
# Generating the example matrices.
set.seed(42)
# Rows with values.
i <- sample(1:41000, 227000000, replace = TRUE)
# Columns with values.
j <- sample(1:55000, 227000000, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000)
# Values for the second matrix.
x2 <- sample(1:3, 227000000, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
I now get the rows, columns and values from the first matrix in a new matrix. This way, I can simply subset them and only the ones I am interested in remain.
# Getting the positions and values from the matrices.
position_matrix_from_m1 <- rbind(i = m1#i, j = summary(m1)$j, x = m1#x)
position_matrix_from_m2 <- rbind(i = m2#i, j = summary(m2)$j, x = m2#x)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- position_matrix_from_m1[,position_matrix_from_m1[3,] > 0 & position_matrix_from_m1[3,] < 0.05]
# We add 1 to the values, since the sparse matrix is 0-based.
position_matrix_from_m1[1,] <- position_matrix_from_m1[1,] + 1
position_matrix_from_m1[2,] <- position_matrix_from_m1[2,] + 1
Now I am getting into trouble. Overwriting the values in the second matrix takes too long. I let it run for several hours and it did not finish.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
I thought about pasting the row and column information together. Then I have a unique identifier for each value. This also takes too long and is probably just very bad practice.
# We would get the unique identifiers after the subsetting.
m1_identifiers <- paste0(position_matrix_from_m1[1,], "_", position_matrix_from_m1[2,])
m2_identifiers <- paste0(position_matrix_from_m2[1,], "_", position_matrix_from_m2[2,])
# Now, I could use which and get the position of the values I want to change.
# This also uses to much memory.
m2_identifiers_of_interest <- which(m2_identifiers %in% m1_identifiers)
# Then I would modify the x values in the position_matrix_from_m2 matrix and overwrite m2#x in the sparse matrix object.
Is there a fundamental error in my approach? What should I do to run this efficiently?
Is there a fundamental error in my approach?
Yes. Here it is.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
Syntax as mat[rn, cn] (whether mat is a dense or sparse matrix) is selecting all rows in rn and all columns in cn. So you get a length(rn) x length(cn) matrix. Here is a small example:
A <- matrix(1:9, 3, 3)
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
rn <- 1:2
cn <- 2:3
A[rn, cn]
# [,1] [,2]
#[1,] 4 7
#[2,] 5 8
What you intend to do is to select (rc[1], cn[1]), (rc[2], cn[2]) ..., only. The correct syntax is then mat[cbind(rn, cn)]. Here is a demo:
A[cbind(rn, cn)]
#[1] 4 8
So you need to fix your code to:
m2[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 1
m1[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 0
Oh wait... Based on your construction of position_matrix_from_m1, this is just
ij <- t(position_matrix_from_m1[1:2, ])
m2[ij] <- 1
m1[ij] <- 0
Now, let me explain how you can do better. You have underused summary(). It returns a 3-column data frame, giving (i, j, x) triplet, where both i and j are index starting from 1. You could have worked with this nice output directly, as follows:
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# you never seem to use `position_matrix_from_m2` so I skip it
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
Now you can do:
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2[ij] <- 1
m1[ij] <- 0
Is there a even better solution? Yes! Note that nonzero elements in m1 and m2 are located in the same positions. So basically, you just need to change m2#x according to m1#x.
ind <- m1#x > 0 & m1#x < 0.05
m2#x[ind] <- 1
m1#x[ind] <- 0
A complete R session
I don't have enough RAM to create your large matrix, so I reduced your problem size a little bit for testing. Everything worked smoothly.
library(Matrix)
# Generating the example matrices.
set.seed(42)
## reduce problem size to what my laptop can bear with
squeeze <- 0.1
# Rows with values.
i <- sample(1:(41000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Columns with values.
j <- sample(1:(55000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000 * squeeze ^ 2)
# Values for the second matrix.
x2 <- sample(1:3, 227000000 * squeeze ^ 2, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
## give me more usable RAM
rm(i, j, x1, x2)
##
## fix to your code
##
m1a <- m1
m2a <- m2
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2a[ij] <- 1
m1a[ij] <- 0
##
## the best solution
##
m1b <- m1
m2b <- m2
ind <- m1#x > 0 & m1#x < 0.05
m2b#x[ind] <- 1
m1b#x[ind] <- 0
##
## they are identical
##
all.equal(m1a, m1b)
#[1] TRUE
all.equal(m2a, m2b)
#[1] TRUE
Caveat:
I know that some people may propose
m1c <- m1
m2c <- m2
logi <- m1 > 0 & m1 < 0.05
m2c[logi] <- 1
m1c[logi] <- 0
It looks completely natural in R's syntax. But trust me, it is extremely slow for large matrices.
I'm working with the popbio package on a population model. It looks something like this:
library(popbio)
babies <- 0.3
kids <- 0.5
teens <- 0.75
adults <- 0.98
A <- c(0,0,0,0,teens*0.5,adults*0.8,
babies,0,0,0,0,0,
0,kids,0,0,0,0,
0,0,kids,0,0,0,
0,0,0,teens,0,0,
0,0,0,0,teens,adults
)
A <- matrix ((A), ncol=6, byrow = TRUE)
N<-c(10,10,10,10,10,10)
N<-matrix (N, ncol=1)
model <- pop.projection(A,N,iterations=10)
model
I'd like to know how I can randomise the input so that at each iteration, which represents years this case, I'd get a different input for the matrix elements. So, for instance, my model runs for 10 years, and I'd like to have the baby survival rate change for each year. babies <- rnorm(1,0.3,0.1)doesn't do it because that still leaves me with a single value, just randomly selected.
Update: This is distinct from running 10 separate models with different initial, random values. I'd like the update to occur within a single model run, which itself has 10 iteration in the pop.projection function.
Hope you can help.
I know this answer is very late, but here's one approach using expressions. First, use an expression to create the matrix.
vr <- list( babies=0.3, kids=0.5, teens=0.75, adults=0.98 )
Ax <- expression( matrix(c(
0,0,0,0,teens*0.5,adults*0.8,
babies,0,0,0,0,0,
0,kids,0,0,0,0,
0,0,kids,0,0,0,
0,0,0,teens,0,0,
0,0,0,0,teens,adults), ncol=6, byrow = TRUE ))
A1 <- eval(Ax, vr)
lambda(A1)
[1] 1.011821
Next, use an expression to create vital rates with nrorm or other functions.
vr2 <- expression( list( babies=rnorm(1,0.3,0.1), kids=0.5, teens=0.75, adults=0.98 ))
A2 <- eval(Ax, eval( vr2))
lambda(A2)
[1] 1.014586
Apply the expression to 100 matrices.
x <- sapply(1:100, function(x) lambda(eval(Ax, eval(vr2))))
quantile(x, c(.05,.95))
5% 95%
0.996523 1.025900
Finally, make two small changes to pop.projection by adding the vr option and a line to evaluate A at each time step.
pop.projection2 <- function (Ax, vr, n, iterations = 20)
{
x <- length(n)
t <- iterations
stage <- matrix(numeric(x * t), nrow = x)
pop <- numeric(t)
change <- numeric(t - 1)
for (i in 1:t) {
stage[, i] <- n
pop[i] <- sum(n)
if (i > 1) {
change[i - 1] <- pop[i]/pop[i - 1]
}
## evaluate Ax
A <- eval(Ax, eval(vr))
n <- A %*% n
}
colnames(stage) <- 0:(t - 1)
w <- stage[, t]
pop.proj <- list(lambda = pop[t]/pop[t - 1], stable.stage = w/sum(w),
stage.vectors = stage, pop.sizes = pop, pop.changes = change)
pop.proj
}
n <-c(10,10,10,10,10,10)
pop.projection2(Ax, vr2, n, 10)
$lambda
[1] 0.9874586
$stable.stage
[1] 0.33673579 0.11242588 0.08552367 0.02189786 0.02086656 0.42255023
$stage.vectors
0 1 2 3 4 5 6 7 8 9
[1,] 10 11.590000 16.375700 19.108186 20.2560223 20.5559445 20.5506251 20.5898222 20.7603581 20.713271
[2,] 10 4.147274 3.332772 4.443311 5.6693931 1.9018887 6.8455597 5.3879202 10.5214540 6.915534
[3,] 10 5.000000 2.073637 1.666386 2.2216556 2.8346965 0.9509443 3.4227799 2.6939601 5.260727
[4,] 10 5.000000 2.500000 1.036819 0.8331931 1.1108278 1.4173483 0.4754722 1.7113899 1.346980
[5,] 10 7.500000 3.750000 1.875000 0.7776139 0.6248948 0.8331209 1.0630112 0.3566041 1.283542
[6,] 10 17.300000 22.579000 24.939920 25.8473716 25.9136346 25.8640330 25.9715930 26.2494195 25.991884
$pop.sizes
[1] 60.00000 50.53727 50.61111 53.06962 55.60525 52.94189 56.46163 56.91060 62.29319 61.51194
$pop.changes
[1] 0.8422879 1.0014610 1.0485765 1.0477793 0.9521023 1.0664832 1.0079517 1.0945797 0.9874586
I have the following code which returns the model with the lowest AIC, but I want all the models with their AIC in ascending or descending order without using the built-in sort function in R
sp <- rnorm(100) ## just some toy data to make code work!
spfinal.aic <- Inf
spfinal.order <- c(0,0,0)
for (p in 1:4) for (d in 0:1) for (q in 1:4) {
spcurrent.aic <- AIC(arima(sp, order=c(p, d, q)))
if (spcurrent.aic < spfinal.aic) {
spfinal.aic <- spcurrent.aic
spfinal.order <- c(p, d, q)
spfinal.arima <- arima(sp, order=spfinal.order)
}
}
I want spfinal.order<-c(p,d,p) to be a list of all models in ascending or descending order of AIC. How can I do that?
The code below does what you want. As you want a record of all models having been tried, no comparison is done inside the loop. A vector aic.vec will hold AIC values of all models, while a matrix order.matrix will hold column-by-column the ARIMA specification. In the end, we sort both by ascending AIC values, so that you know the best model is the first one.
sp <- rnorm(100) ## just some toy data to make code work!
order.matrix <- matrix(0, nrow = 3, ncol = 4 * 2 * 4)
aic.vec <- numeric(4 * 2 * 4)
k <- 1
for (p in 1:4) for (d in 0:1) for (q in 1:4) {
order.matrix[, k] <- c(p,d,q)
aic.vec[k] <- AIC(arima(sp, order=c(p, d, q)))
k <- k+1
}
ind <- order(aic.vec, decreasing = FALSE)
aic.vec <- aic.vec[ind]
order.matrix <- order.matrix[, ind]
I did not use a list to store the ARIMA specification, because I think a matrix is better. At the moment the matrix is in wide format, i.e., with 3 rows while many columns. You can transpose it for better printing:
order.matrix <- t(order.matrix)
Maybe you also want to bind order.matrix and aic.vec together for better presentation? Do this:
result <- cbind(order.matrix, aic.vec)
colnames(result) <- c("p", "d", "q", "AIC")
I think this makes it easier for you to inspect. Example output (first 5 rows):
> result
p d q AIC
[1,] 2 0 2 305.5698
[2,] 3 0 3 305.8882
[3,] 1 0 3 307.8365
[4,] 2 1 3 307.9137
[5,] 1 1 2 307.9952
I'm trying to write a function that performs a given number (n) of t-tests on a random set of normal data of size k. The output should be a count of the total number of significant (<0.05) t-tests and a ratio of significant to overall t-tests. I wrote this function below:
StatPractice <- function(n, k) {
i = 1
length <- k
size <- n
while(i <= size){
k1 <- rnorm(length)
k2 <- rnorm(length)
t <- t.test(k1, k2)
p <- cbind(t$p.value)
i <- i + 1;
q <- c(p <= 0.05)
count <- length(q[q==TRUE])
prop <- count/size
print(q)
}
cat("count of significant t-tests:", count, "\n",
"proportion of significant t-tests:", prop, "\n")
}
I've tooled with this in a number of ways, but essentially, the output is something like this:
[1] FALSE
[1] TRUE
[1] FALSE
[1] FALSE
[1] FALSE
[1] TRUE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
count of significant t-tests: 0
proportion of significant t-tests: 0
Could someone help me figure out why the count is unable to recognize q as a single vector and thus unable to give correct output for number of TRUE values?
You need a vector to store the p-values. Currently, the object p stores the last value only. You can create a vector before the loop starts (p <- numeric(size)). Within the loop, you assign the current p-value to the vector p at index i (p <- numeric(size)). The counting of significant p-values has to be done after the loop. Below is a modified version of your function.
StatPractice <- function(n, k) {
i = 1
length <- k
size <- n
p <- numeric(size)
while(i <= size){
k1 <- rnorm(length)
k2 <- rnorm(length)
t <- t.test(k1, k2)
p[i] <- t$p.value
i <- i + 1
}
q <- p <= 0.05
count <- sum(q)
prop <- count/size
print(q)
cat("count of significant t-tests:", count, "\n",
"proportion of significant t-tests:", prop, "\n")
}
Note that length(q[q==TRUE]) has been replaced with the simpler command sum(q). Furthermore, the function does not print q but return q.
I have got a triple summation expression like this
sum(l(from 1 to n))
sum(i(from 1 to m))
sum(t(from 1 to m)
[phil_z1_1[i]*phil_z1_1[t}*I(X(l)<min(y(i),y(t))]
I have done:
set.seed(1234567)
x <- rnorm(2900)
n <- length(x)
y <- rnorm(3000)*0.25
m <-length(y)
z1 <- runif(m,min=0,max=1)
z2 <- runif(m,min=0,max=1)
phil_z1_1 <- sqrt(12*(z1/z2)))
for min(y[i],y[t]) I have done something like
y_m<-matrix(rep(y,length(y)),ncol=length(y))
y_m_t<-t(y_m)
y_min<-pmin(y_m_t,y_m)
After expanding the two inner summation, For example, for example m=2,n=3
I can put the original expression into the matrices like x*A*x'
where
x=[phil_z1_1[1] phil_z1_1[2]]
A is a 2*2 matrix
A=[sum(from 1 to n) I(x[l]<=min(y[1],y[1]), sum(from 1 to n) I(x[l]<=min(y1,y2); sum(from 1 to n) I(x[l]<=min(y[2],y[1]), sum(from 1 to n) I(x[l]<=min(y[2],y[2])]
Therefore,
x*A*x'=[phil_z1_1[1] phil_z1_1[2]]*[sum(from 1 to n) I(x[l]<=min(y[1],y[1]), sum(from 1 to n) I(x[l]<=min(y1,y2); sum(from 1 to n) I(x[l]<=min(y[2],y[1]), sum(from 1 to n) I(x[l]<=min(y[2],y[2])][phil_z1_1[1] phil_z1_1[2]]'
Basically I want to create a m*m matrix for A, in which each individual element is equal to the sum of its corresponding part, for example, sum(from 1 to n)x[l]<=min(y[1],y[1]) will be the a11 of matrix A I want to create
I have tried to use
args <- expand.grid(l=1:n, i=1:m, t=1:m)
args <- subset(args, x[l] <= pmin(y[i],y[t])-z1[i]*z2[t])
args <- transform(args, result=phil_z1_1[i]*phil_z1_1[t])
sum(args[,"result"])
But r cannot run the above programming, as the sample size of data set is too big, around 3,000.
Can someone tell me how to solve this problem?
Thanks in advance!
Here is a matrix approach for your triple sum
set.seed(1234567)
n <- 10
x <- rnorm(n)
m <- 3000
y <- rnorm(m)/4
y_m <- pmin(matrix(rep(y,m), ncol=m, byrow=TRUE), y)
z1 <- runif(m,min=0,max=1)
z2 <- runif(m,min=0,max=1)
phi <- sqrt(12*(z1/z2))
phi_m <- phi %o% phi
f1 <- function(l) sum(phi_m * (x[l] < y_m))
sum(sapply(1:n, f1))
[1] 242034847337
It is not lightning fast, but much faster than the data.frame approach
f2 <- function(lrng) {
args <- expand.grid(l=lrng, i=1:m, t=1:m)
args <- subset(args, x[l] <= pmin(y[i],y[t]))
args <- transform(args, result=phi[i]*phi[t])
sum(args[,"result"])
}
sum(sapply(1:n, f2)) # 90 times slower
[1] 242034847337