-Inf content in a Position Weight Matrix - r

What does it mean when I have -Inf content in some positions of a Position Weight Matrix?
I am using the seqLogo package. For plotting the seqLogo:
library(seqLogo)
seqLogo(weight_matrix, ic.scale=TRUE, xaxis=TRUE, yaxis=TRUE, xfontsize=15, yfontsize=15)
and I have:
Error in seqLogo(weight_matrix, ic.scale = TRUE, xaxis = TRUE, yaxis =
TRUE, : Columns of PWM must add up to 1.0

From the error it is obvious, column sum must be equal to 1. As it is sum of probabilities, which can't be more than 1. See example:
Below works fine, using example m matrix from seqLogo package:
library(seqLogo)
# get example matrix
mFile <- system.file("Exfiles/pwm1", package="seqLogo")
m <- read.table(mFile)
# check if all columns have sum of 1
colSums(m)
# V1 V2 V3 V4 V5 V6 V7 V8
# 1 1 1 1 1 1 1 1
# plot, all great!
seqLogo(m)
Now, let's change one of the values, so that column sum is more than 1. This will give us error.
m[1, 1] <- 1
# check if all columns have sum of 1
# V1 V2 V3 V4 V5 V6 V7 V8
# 2 1 1 1 1 1 1 1
seqLogo(m)
# Error in seqLogo(m) : Columns of PWM must add up to 1.0
Other reason could be that matrix values are already logged. If they are then convert them back to probabilities using:
plotMatrix <- 2 ^ weight_matrix * 0.25
then plot:
seqLogo(plotMatrix)

Related

Calculating weight ratios in the presence of empty cells

I have a sample which needs to weighed in order to represent the population.
library(data.table)
sample <- fread("
1,0,2,2
3,4,3,0
")
V1 V2 V3 V4
1: 1 0 2 2
2: 3 4 3 0
population <- fread("
10,20,20,10
30,40,20,10
")
This weight would simply be:
weights <- population/sample
V1 V2 V3 V4
1: 10 Inf 10.000000 5
2: 10 10 6.666667 Inf
However, because V2 in row 1 of the sample has no observations, it receives an infinite weight (Note that also V4 in row 2 receives an Inf, but this is easier to solve, because the weight is irrelevant, as there are no observations in either the sample or the population).
A solution to the problem, would be to count V1 and V2 together in the sample and the population.
EDIT:
After some thought I realised that, for the weights to be correct, only the population values have to be adapted. If V1 and V2 in row 1 of population are added together in V1 of population, this will already lead to the correct weight for the sample observation of V1 row . The value of V2 becomes irrelevant because there is no observation in the sample to receive that weight.
End of EDIT
The observation would then get a weight of:
(population[1,1]+population[1,2])/(sample[1,1]+sample[1,2])
(10+20)/(1+0)=30
In my actual data, there however many more rows, with hero and there a 0 in the sample. I am trying to figure out if there is a way to write my code, so that I do not have to do this manually..
Desired outcome (notice that the weight of V1 row 1 is now 30):
weights
V1 V2 V3 V4
1: 30 0 10.000000 5
2: 10 10 6.666667 0
Attempt
I was think of doing something like:
for (i in seq_along(ncol(sample))) {
lapply(population, (ifelse(sample[i]==0), population[i]<-population[i+1], population[i])
}
Where the values in the population of the cell to right will be added when the value in the sample is zero. However I am having trouble getting the syntax right, and even if it did, it does not solve the case where V4 is 0.
Here is a rather verbose solution. In case there are more columns that should be aggregated in case of zeros in sample, I would have proposed a more flexible approach but this seems sufficient for your example
library(data.table)
sample <- fread("
1,0,2,2
3,4,3,0
")
population <- fread("
10,20,20,10
30,40,20,10
")
# aggregate Values if sample is zero
population[sample$V1 == 0, `:=`(V1 = 0,
V2 = V1 + V2)]
population[sample$V2 == 0, `:=`(V1 = V1 + V2,
V2 = 0)]
weights <- population/sample
# Fix NaNs
weights[is.na(weights), ] <- 0
weights
#> V1 V2 V3 V4
#> 1: 30 0 10.000000 5
#> 2: 10 10 6.666667 Inf

A vector is created from different vectors and i want to find starting and ending positions of these vectors in the vector created from them

These vectors will always be in increasing order such as 1 ..2 ... 3 ..4. They cannot decrease. Let's say I have three vectors as an example.
v1 <- c(1,3)
v2 <- c(2)
v3 <- c(1,3,4)
And I have a vector that was created from these vectors:
vsum <- c(v2, v1, v3)
Now i want to create a code which can find the position where each vector (v1,v2,v3) starts and ends in vsum. In this case, the starting position would look like
start <- c(1,2,4)
because if I run vsum these are the starting positions of each vector.
2 1 3 1 3 4
the ending position would look like
end <- c(1,3,6)
because these are ending positions
2 1 3 1 3 4
You can wrap your vectors in a list and use lengths with cumsum:
v1 <- c(1,3)
v2 <- c(2)
v3 <- c(1,3,4)
l = lengths(list(v2, v1, v3))
# [1] 1 2 3
start = cumsum(l) - l + 1
# [1] 1 2 4
end = cumsum(l)
# [1] 1 3 6

How can I pad a vector with NA from the front?

I want to make an existing vector size n and use NA. I know I can pad at the end of the vector like so:
v1 <- 1:10
v2 <- diff(v1)
length(v2) <- length(v1)
v2
# 1 1 1 1 1 1 1 1 1 NA
But I want to fill the NA at the beginnning instead in a generic way. I mean for this particular example I can just
v2 <- c(NA, diff(v1))
# NA 1 1 1 1 1 1 1 1 1
But I was hoping that there exist some base R function or library that provides something like v2 <- pad(v2, n=length(v1), value=NA)
Is there anything like that I can use off the self or do I need to define my own function:
pad <- function(x, n) { # ugly function that doesn't keep the attributes of x
len.diff <- n - length(x)
c(rep(NA, len.diff), x)
}
pad(1:10, 12) # NA NA 1 2 3 4 5 6 7 8 9 10
Assuming v1 has the desired length and v2 is shorter (or the same length) these left pad v2 with NA values to the length of v1. The first four assume numeric vectors although they can be modified to also work more generally by replacing NA*v1 in the code with rep(NA, length(v1)).
replace(NA * v1, seq(to = length(v1), length = length(v2)), v2)
rev(replace(NA * v1, seq_along(v2), rev(v2)))
replace(NA * v1, seq_along(v2) + length(v1) - length(v2), v2)
tail(c(NA * v1, v2), length(v1))
c(rep(NA, length(v1) - length(v2)), v2)
The fourth is the shortest. The first two and fourth do not involve any explicit arithmetic calculations other than multiplying v1 with NA values. The second is likely slow since it involves two applications of rev.
One option is diff from zoo which also have the na.pad
library(zoo)
as.vector(diff(zoo(v1), na.pad=TRUE))
#[1] NA 1 1 1 1 1 1 1 1 1
Defining nrValues as the number of elements you want at the start of v2 you could use:
n <- length(v1)
v2 <- c(rep(NA,nrValues),v1[nrValues:n])
I'm not familiar with a function that does this, so if you intend to do it multiple times I would create your own function.

Replace specific values in a data frame except first column

I have this line in one my function - result[result>0.05] <- "", that replaces all values from my data frame grater than 0.05, including the row names from the first column. How to avoid this?
This is a fast way too:
df <- as.data.frame(matrix(runif(100),nrow=10))
df[-1][df[-1]>0.05] <- ''
Output:
> df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0.60105471
2 0.63340567
3 0.11625581
4 0.96227379 0.0173133104108274
5 0.07333583
6 0.05474430 0.0228175506927073
7 0.62610309
8 0.76867090
9 0.76684615 0.0459537433926016
10 0.83312158

R independent columns in matrix

I am trying to find independent columns to solve the system of linear equations. Here my simplified example:
> mat = matrix(c(1,0,0,0,-1,1,0,0,0,-1,1,0,0,0,-1,0,-1,0,0,1,0,0,1,-1), nrow=4, ncol=6, dimnames=list(c("A", "B", "C", "D"), paste("v", 1:6, sep="")))
> mat
v1 v2 v3 v4 v5 v6
A 1 -1 0 0 -1 0
B 0 1 -1 0 0 0
C 0 0 1 -1 0 1
D 0 0 0 0 1 -1
The matrix is full rank:
qr(mat)$rank
gives me 4, and since there are 6 columns, there should be 6-4=2 independent columns from which I can calculate the others.
I know that columns v4 and v6 are independent... My first question is, how can I find these columns (maybe with qr(mat)$pivot)?
By rearranging the linear equations on paper, I see that
[v1, v2, v3, v4, v5, v6] = [v4, v4-v6, v4-v6, v4, v4, v6, v6]
and thus I can find from arbitrary values for v4 and v6 a vector that lies in the null space by multiplying v4 and v6 with the vectors below:
v4 * [1,1,1,1,0,0] + v6 * [0,-1,-1,0,1,1]
My second question is: How do I find these vectors, meaning how do I solve the matrix for v4 and v6?
For example
qr.solve(mat, cbind(c(0,0,0,0), c(0,0,0,0)))
gives me two vectors of length 6 with only zeros.
Any help is appreciated, many thanks in advance!
-H-
Use the pivot information to find a set of independent columns:
q <- qr(mat)
mmat <- mat[,q$pivot[seq(q$rank)]]
mmat
## v1 v2 v3 v5
## A 1 -1 0 -1
## B 0 1 -1 0
## C 0 0 1 0
## D 0 0 0 1
qr(mmat)$rank
## [1] 4
Why does this work? The meaning of pivot is given in QR.Auxiliaries {base} brought up with ?qr.Q. In particular:
qr.R returns R. This may be pivoted, e.g., if a <- qr(x) then x[, a$pivot] = QR.
The number of rows of R is either nrow(X) or ncol(X) (and may depend on whether
complete is TRUE or FALSE).
Pivoting is done to order the eigenvalues in decreasing absolute value, for numerical stability. This also means that any 0 eigenvalues are at the end, beyond q$rank in q$pivot (and nonexistent in the current example, where Q is a 4x4 orthogonal matrix).
The final lines in the QR.Auxiliaries {base} show this relationship:
pivI <- sort.list(a$pivot) # the inverse permutation
stopifnot(
all.equal(x[, a$pivot], qr.Q(a) %*% qr.R(a)), # TRUE
all.equal(x , qr.Q(a) %*% qr.R(a)[, pivI])) # TRUE too!
If you start with v4 and v6 then you need 2 more with non-zero values inrows 1 and 2 so that you need to pick v1 and either v2 or v3. These are all possible basis choices that will have maximal rank.
> qr(mat[, c(1,2,4,6)])$rank
[1] 4
> qr(mat[, c(1,2,3,5)])$rank
[1] 4
> qr(mat[, c(1,3,4,6)])$rank
[1] 4
It is simply not the case that "independent columns" are uniquely determined. There may be sets of columns that are necessarily dependent, e.g., ones which are scalar multiples of each other, but that is not the case here.
On the other hand this will be rank deficient:
> qr(mat[, c(1,2,3,4)])$rank
[1] 3

Resources