Weighted rowSums of a matrix - r

I have a matrix like this:
I would like to sum every value of a single row but weighted.
Example: Given a specific row, the sum would be:
S = x1 * loan + x2 * mortdue + x3 * value + ...
x1, x2, x3, ... are predefined values.
I tried rowSums() and things like that but I have not been able to figure out how to do it properly.

You are looking for a matrix-vector multiplication. For example, if you have a matrix:
set.seed(0)
A <- matrix(round(rnorm(9), 1), 3)
# [,1] [,2] [,3]
#[1,] 1.3 1.3 -0.9
#[2,] -0.3 0.4 -0.3
#[3,] 1.3 -1.5 0.0
And you have another vector x, which is what you called "ponderation":
x <- round(rnorm(3), 1)
#[1] 2.4 0.8 -0.8
You can do
drop(A %*% x)
#[1] 4.88 -0.16 1.92
The drop just convert the result single column matrix into a 1D vector.
You can have a quick check to see this is what you want:
sum(A[1, ] * x)
#[1] 4.88
sum(A[2, ] * x)
#[1] -0.16
sum(A[3, ] * x)
#[1] 1.92
Compared with rowSums(), you can also think such computation as a "weighted rowSums".
At the moment, it seems more likely that you have a data frame rather than a matrix. You can convert this data frame to matrix by as.matrix().

Related

split() returns "longer object length is not a multiple of shorter object length"

Context
I asked this question recently:
Comparing partitions from split() using a nested for loop containing an if statement
where I needed to compare partitions generated by split() from a distance matrix using the code fix provided by #robertdj
set.seed(1234) # set random seed for reproducibility
# generate random normal variates
x <- rnorm(5)
y <- rnorm(5)
df <- data.frame(x, y) # merge vectors into dataframe
d <- dist(x) # generate distance matrix
splt <- split(d, 1:5) # split data with 5 values in each partition
for (i in 1:length(splt)) {
for (j in 1:length(splt)) {
if (i != j) {
a <- length(which(splt[[i]] >= min(splt[[j]]))) / length(splt[[i]])
b <- length(which(splt[[j]] <= max(splt[[i]]))) / length(splt[[j]])
}
}
}
I generated a MWE where each split contained the same number of elements. I did this just for illustrative purposes, fully knowing that this would not necessarily hold for real data.
As per #Robert Hacken's comment if I instead do
d <- na.omit(d[lower.tri(d)])
I get partitions of unequal length.
Real Data
However my real data does not have the "same size" property. My real data contains many more partitions than only 5 in my MWE.
Here is my code
splt <- split(dist_matrix, sub("(?:(.*)\\|){2}(\\w+)\\|(\\w+)\\|.*?$", "\\1-\\2", colnames(dist_matrix)))
The distance matrix dist_matrix contains FASTA headers from which I extract the species names.
I then use splt above in the doubly nested loop.
For instance, splt[[4]] contains 5 values, whereas splt[[10]] contains 9.
splt[[4]]
[1] 0.1316667 0.1383333 0.1166667 0.1333333 0.1216667
splt[[10]]
[1] 0.1450000 0.1483333 0.1316667 0.1316667 0.1333333 0.1333333 0.1166667 0.1166667 0.1200000
Expected Output
For my real problem, each partition corresponds to distances for a single species to all other unique species. So, if Species X has two DNA sequences representing it and there are 10 species in total, the partition for Species X should contain 20 distances. However I don't want the partition to include the distance between the two sequences for species A.
splt would thus contain 10 partitions (each not necessarily of the same length) for all species
The expected output of a and b is a number between 0-1 inclusive. I think these numbers should be small in my real example, but they are large when I try to run my code, which I think is a consequence of the warning().
What I've Done
I've read on SO that %in% is typically used to resolve the warning
In splt[[i]] == splt[[j]] :
longer object length is not a multiple of shorter object length
except in my case, I believe I would need %notin% <- Negate(%in%).
However, %notin% gives the error in my original post
the condition has length > 1
Question
How can my nested loop be altered to remove the warning?
I'm going to go out on a limb by interpreting parts of what you say, discarding your code, and seeing what I can come up with. If nothing else, it may spark conversation to explain what about my interpretations are correct (and which are incorrect).
Starting with the splt as generated by the random data, then replacing elements 4 and 5 with longer vectors,
set.seed(1234)
x <- rnorm(5)
y <- rnorm(5)
df <- data.frame(x, y)
d <- dist(x)
splt <- split(d, 1:5)
splt[[4]] <- rnorm(4)
splt[[5]] <- rnorm(10)
We have:
splt <- list("1" = c(1.48449499149608, 2.62312694474001), "2" = c(2.29150692606848, 0.15169544670039), "3" = c(1.13863195324393, 3.43013887931241), "4" = c(-0.477192699753547, -0.998386444859704, -0.77625389463799, 0.0644588172762693), "5" = c(-0.693720246937475, -1.44820491038647, 0.574755720900728, -1.02365572296388, -0.0151383003641817, -0.935948601168394, 1.10229754620026, -0.475593078869057, -0.709440037512506, -0.501258060594761))
splt
# $`1`
# [1] 1.484495 2.623127
# $`2`
# [1] 2.2915069 0.1516954
# $`3`
# [1] 1.138632 3.430139
# $`4`
# [1] -0.47719270 -0.99838644 -0.77625389 0.06445882
# $`5`
# [1] -0.6937202 -1.4482049 0.5747557 -1.0236557 -0.0151383 -0.9359486 1.1022975 -0.4755931 -0.7094400 -0.5012581
You reference expressions like which(splt[[i]] >= min(splt[[j]])), which I'm interpreting to mean *"what is the ratio of splt[[i]] that is above the max value in splt[[j]]. Since we're comparing (for example) splt[[1]] with all of splt[[2]] through splt[[5]] here, and likewise for the others, we're going to have a square matrix where the diagonal is splt[[i]]-vs-splt[[i]] (likely not interesting).
Some quick math so we know what we should end up with:
splt[[1]]
# [1] 1.484495 2.623127
range(splt[[2]])
# [1] 0.1516954 2.2915069
Since 1 from [[1]] is greater than 2's max of 2.29, we expect 0.5 in a comparison between the two (for >= max(.)); similarly, none of [[1]] is below 0.15, so we expect a 0 there.
Similarly, [[5]] over [[4]]:
splt[[5]]
# [1] -0.6937202 -1.4482049 0.5747557 -1.0236557 -0.0151383 -0.9359486 1.1022975 -0.4755931 -0.7094400 -0.5012581
range(splt[[4]])
# [1] -0.99838644 0.06445882
### 2 of 10 are greater than the max
sum(splt[[5]] >= max(splt[[4]])) / length(splt[[5]])
# [1] 0.2
### 9 of 10 are lesser than the min
sum(splt[[5]] <= min(splt[[4]])) / length(splt[[5]])
# [1] 0.2
We can use outer, but sometimes that can be confusing, especially since in this case we'd need to Vectorize the anon-func passed to it. I'll adapt your double-for loop premise into nested sapply calls.
Greater than the other's max
sapply(splt, function(y) sapply(setNames(splt, paste0("max", seq_along(splt))), function(z) sum(y >= max(z)) / length(y)))
# 1 2 3 4 5
# max1 0.5 0.0 0.5 0.00 0.0
# max2 0.5 0.5 0.5 0.00 0.0
# max3 0.0 0.0 0.5 0.00 0.0
# max4 1.0 1.0 1.0 0.25 0.2
# max5 1.0 0.5 1.0 0.00 0.1
Interpretation and subset validation:
1 with max of 2: comparing [[1]] (first column) with the max value from [[2]] (second row), half of 1's values are greater, so we have 0.5 (as expected).
5 with max of 4: comparing [[5]] (fifth column) with the max value from [[4]] (fourth row), 0.2 meet the condition.
Less than the other's min
sapply(splt, function(y) sapply(setNames(splt, paste0("min", seq_along(splt))), function(z) sum(y <= min(z)) / length(y)))
# 1 2 3 4 5
# min1 0.5 0.5 0.5 1.00 1.0
# min2 0.0 0.5 0.0 1.00 0.8
# min3 0.0 0.5 0.5 1.00 1.0
# min4 0.0 0.0 0.0 0.25 0.2
# min5 0.0 0.0 0.0 0.00 0.1
Same two pairs:
1 with min of 2 (row 2, column 1) is 0, as expected
5 with min of 4 (row 4, column 5) is 0.2, as expected
Edit: #compbiostats pointed out that while sum(..) should produce the same results as length(which(..)), the latter may be more robust to missing-data (e.g., NA values, c.f., Difference between sum(), length(which()), and nrow() in R). For sum(..) to share that resilience, we should add na.rm=TRUE) to both sum(.) and min(.) in the above calls. Thanks #compbiostats!

Using a Function With 2 Inputs in a Loop With Combinations of Inputs to Create a Data Frame

I've created a custom function to calculate values based on two inputs.
# function
info.theta <- function(theta, delta) {
P = 1/(1+exp(-1*(theta-delta)))
Q = 1 -P
1*P*Q
}
I'd like to use the function to calculate the value for all possible combinations of values for two sequences of interest.
# for each input of the function create sequences of values to explore
thetas <- seq(-4, 4, by = .5)
deltas <- seq(-4, 4, by = .5)
I'd like to end up with a data frame with a column labeled thetas, deltas and information, where both theta and delta are the values for the sequence that were used in the function, and information is the output of the function for each combination of theta and delta.
I'm at a loss for how to execute the last point, as this level of coding is new to me. My hunch was maybe a nested for loop. This is obviously not correct, but it is as close as I can get to a start. How would I use the function in the way I described to generate the desired data frame?
#nested for loop
y <- NULL
for(i in sequence) {
for(j in deltas) {
tmp <- info.theta(i, j)
y <- rbind(y, tmp)
}
}
y
You can use outer to get matrix of values:
outer(thetas,deltas,info.theta)
A slight change to your original function:
info.theta <- function(theta, delta) {
P = 1/(1+exp(-1*(theta-delta)))
Q = 1 -P
data.frame(theta=theta,delta=delta, information=1*P*Q)
}
Because data.frames are cooler.
Now:
td_grid<-expand.grid(thetas, deltas)
info.theta(td_grid[,1],td_grid[,2])
results in:
theta delta information
1 -4.0 -4.0 0.2500000000
2 -3.5 -4.0 0.2350037122
3 -3.0 -4.0 0.1966119332
4 -2.5 -4.0 0.1491464521
5 -2.0 -4.0 0.1049935854
6 -1.5 -4.0 0.0701037165
7 -1.0 -4.0 0.0451766597
8 -0.5 -4.0 0.0284530239
9 0.0 -4.0 0.0176627062
10 0.5 -4.0 0.0108662297
11 1.0 -4.0 0.0066480567

Fast algorithm for calculating second-order adacency matrix from first-order adjacency matrix with a probabilistic directed graph

I'm working with adjacency matrices that look like this:
N <- 5
A <- matrix(round(runif(N^2),1),N)
diag(A) <- 0
1> A
[,1] [,2] [,3] [,4] [,5]
[1,] 0.0 0.1 0.2 0.6 0.9
[2,] 0.8 0.0 0.4 0.7 0.5
[3,] 0.6 0.8 0.0 0.8 0.6
[4,] 0.8 0.1 0.1 0.0 0.3
[5,] 0.2 0.9 0.7 0.9 0.0
Probabilistic and directed.
Here is a slow way to calculate the probability that i is linked to j through at least one other node:
library(foreach)
`%ni%` <- Negate(`%in%`) #opposite of `in`
union.pr <- function(x){#Function to calculate the union of many probabilities
if (length(x) == 1){return(x)}
pr <- sum(x[1:2]) - prod(x[1:2])
i <- 3
while(i <= length(x)){
pr <- sum(pr,x[i]) - prod(pr,x[i])
i <- 1+i
}
pr
}
second_order_adjacency <- function(A, i, j){#function to calculate probability that i is linked to j through some other node
pr <- foreach(k = (1:nrow(A))[1:nrow(A) %ni% c(i,j)], .combine = c) %do% {
A[i,k]*A[k,j]
}
union.pr(pr)
}
#loop through the indices...
A2 <- A * NA
for (i in 1:N){
for (j in 1:N){
if (i!=j){
A2[i,j] <- second_order_adjacency(A, i, j)
}
}}
diag(A2) <- 0
1> A2
[,1] [,2] [,3] [,4] [,5]
[1,] 0.000000 0.849976 0.666112 0.851572 0.314480
[2,] 0.699040 0.000000 0.492220 0.805520 0.831888
[3,] 0.885952 0.602192 0.000000 0.870464 0.790240
[4,] 0.187088 0.382128 0.362944 0.000000 0.749960
[5,] 0.954528 0.607608 0.440896 0.856736 0.000000
This algorithm scales like N^2, and I've got thousands of nodes. And my matrix isn't all that sparse -- a lot of small numbers with a few big ones. I could parallelize it, but I'd only be dividing by the number of cores. Is there some vectorized trick that allows me to take advantage of the relative speed of vectorized operations?
tl;dr: how can I quickly calculate a second-order adjacency matrix in a probablistic directed graph?
Your union.pr function is slower by 500 times than a simple and efficient way. So replace your union.pr by 1-prod(1-pr) and you'll get 500X speed.
x <- runif(1000)*0.01
t1 <- proc.time()
for (i in 1:10000){
y <- union.pr(x)
}
t1 <- proc.time()-t1
print(t1)
# user system elapsed
# 21.09 0.02 21.18
t2 <- proc.time()
for (i in 1:10000){
y <- 1-prod(1-x)
}
t2 <- proc.time() - t2
print(t2)
# user system elapsed
# 0.04 0.00 0.03
So #Julius's answer was useful for reminding me of some elementary probability rules, but it didn't speed up the rate of computation much. The following function, however, helps a ton:
second_order_adjacency2 <- function(A, i, j){#function to calculate probability that i is linked to j through some other node
a1 <- A[i,1:nrow(A) %ni% c(i,j)]
a2 <- t(A)[j,1:nrow(A) %ni% c(i,j)]
1-prod(1-a1*a2)
}
It still scales like N^2 because it is a loop, but takes advantage of vectorization in the calculation of the various paths from i to j. As such it is much faster.

I want to save the result of simulation from this programme

# library (energy)
RR=100
n=10
a=2
b=4
miu1=2
miu2=4
m22=(b^2)*(1-(rho^2))
# This is the point where am having problem
# I want the programme to retain the results average0.1, average0.05 and
# average0.01 for every 'rho' from the rho_list used for the simulation
# but I am stuck because I don't know how to get the result
rho_list=c(0,0.3,0.6)
for (rho in rho_list){
energy=rep(NA,RR)
for (i in 1:RR){
z1=rnorm(n,0,1)
z2=rnorm(n,0,1)
x1=miu1+a*z1
x2=miu2+(rho*b*z1)+(sqrt(m22)*z2)
X=matrix(c(x1,x2),byrow=TRUE,ncol=2)
energy[i]=mvnorm.etest(X)$p.value
}
average0.1=sum(energy<=0.1)/RR
average0.05=sum(energy<=0.05)/RR
average0.01=sum(energy<=0.01)/RR
}
I want the program to retain the results average0.1, average0.05 and average0.01 for every rho from the rho_list used for the simulation
but I am stuck because I don't know how to get the result
Your example is not reproducible, so I'm giving you some simulated data to demonstrate how to output the result.
rho_list=c(0,0.3,0.6)
result <- sapply(rho_list, FUN = function(rho, ...) {
average0.1 = runif(1)
average0.05 = runif(1)
average0.01 = runif(1)
c(rho = rho, a01 = average0.1, a0.05 = average0.05, a0.01 = average0.01)
}, RR = RR, n = n, a = a, b = b, miu1 = miu1, miu2 = miu2, m22 = m22, simplify = FALSE)
do.call("rbind", result)
rho a01 a0.05 a0.01
[1,] 0.0 0.0136175 0.08581583 0.07171591
[2,] 0.3 0.8334469 0.42103038 0.07857328
[3,] 0.6 0.8231120 0.40647485 0.65408540
One option would be to store the results in a list for each value of rho and then bind them into a single data frame. Here's an example. Note that since rho isn't defined in the set-up code, I've substituted the definition of m22 for m22 in the loop. Also, I've set RR=10 to save time in running the code.
library(energy)
RR=10
n=10
a=2
b=4
miu1=2
miu2=4
rho_list=c(0, 0.3, 0.6)
energy_threshold = c(0.1, 0.05, 0.01) # Store energy thresholds in a vector
# Create a list of data frames. Each data frame contains the result for each
# of the three energy thresholds for one value of rho.
results = lapply(rho_list, function(rho) {
energy=rep(NA,RR)
for (i in 1:RR) {
z1=rnorm(n,0,1)
z2=rnorm(n,0,1)
x1=miu1+a*z1
x2=miu2+(rho*b*z1)+(sqrt((b^2)*(1-(rho^2)))*z2)
X=matrix(c(x1,x2),byrow=TRUE,ncol=2)
energy[i]=mvnorm.etest(X)$p.value
}
data.frame(rho, energy_threshold, result=sapply(energy_threshold, function(y) sum(energy <= y)/RR))
})
# Bind the three data frames into a single data frame
results = do.call(rbind, results)
And here's the output:
results
rho energy_threshold result
1 0.0 0.10 0.1
2 0.0 0.05 0.0
3 0.0 0.01 0.0
4 0.3 0.10 0.2
5 0.3 0.05 0.1
6 0.3 0.01 0.0
7 0.6 0.10 0.0
8 0.6 0.05 0.0
9 0.6 0.01 0.0
I stored the variables from the loop into a numeric vector and then used cbind() to store results. Here is the entire code :
library(energy)
RR=10
n=10
a=2
b=4
miu1=2
miu2=4
m22=(b^2)*(1-(rho^2))
average0.1 <- as.numeric()
average0.05 <- as.numeric()
average0.01 <- as.numeric()
# This is the point where am having problem
# I want the programme to retain the results average0.1, average0.05 and
# average0.01 for every 'rho' from the rho_list used for the simulation
# but I am stuck because I dont know how to get the result
rho_list=c(0,0.3,0.6)
for (rho in unique(rho_list)){
energy=rep(NA,RR)
for (i in 1:RR){
z1=rnorm(n,0,1)
z2=rnorm(n,0,1)
x1=miu1+a*z1
x2=miu2+(rho*b*z1)+(sqrt(m22)*z2)
X=matrix(c(x1,x2),byrow=TRUE,ncol=2)
energy[i]=mvnorm.etest(X)$p.value
}
average0.1=rbind(average0.1, sum(energy<=0.1)/RR)
average0.05=rbind(average0.05, sum(energy<=0.05)/RR)
average0.01=rbind(average0.01, sum(energy<=0.01)/RR)
}

Select matrix based in intervals (subset function) in R

I would like to subset one matrix based on X2 collumn values. I Tryied it:
on <- subset(mat.num, X2 <= -3)
un <- subset(mat.num, X2 >= -1.50000 & X2 <= -0.3599999)
dn <- subset(mat.num, X2 >= -0.3599998 & X2 <= 0.5)
But I get this error:
Error in subset.matrix(mat.num, X2 <= -3) : object 'X2' not found.
ps: I have one X2 column:
mat.num head:
T_EBV X2
[1,] 0.09 -0.00777840
[2,] 0.26 0.03600431
[3,] 0.20 -0.06191900
[4,] 0.25 0.13423752
[5,] 0.42 0.06354759
[6,] -0.20 0.06303164
The matrix method doesn't reference column names the same way that you can with data frames. You probably want:
subset(mat.num, mat.num[,2] <= -3)
If you look at the code for subset.matrix you'll see that it's not evaluating the subset criteria inside any special environment:
if (missing(subset))
subset <- TRUE
else if (!is.logical(subset))
stop("'subset' must be logical")
x[subset & !is.na(subset), vars, drop = drop]
as opposed to subset.data.frame which is using eval and substitute.

Resources