R independent columns in matrix - r

I am trying to find independent columns to solve the system of linear equations. Here my simplified example:
> mat = matrix(c(1,0,0,0,-1,1,0,0,0,-1,1,0,0,0,-1,0,-1,0,0,1,0,0,1,-1), nrow=4, ncol=6, dimnames=list(c("A", "B", "C", "D"), paste("v", 1:6, sep="")))
> mat
v1 v2 v3 v4 v5 v6
A 1 -1 0 0 -1 0
B 0 1 -1 0 0 0
C 0 0 1 -1 0 1
D 0 0 0 0 1 -1
The matrix is full rank:
qr(mat)$rank
gives me 4, and since there are 6 columns, there should be 6-4=2 independent columns from which I can calculate the others.
I know that columns v4 and v6 are independent... My first question is, how can I find these columns (maybe with qr(mat)$pivot)?
By rearranging the linear equations on paper, I see that
[v1, v2, v3, v4, v5, v6] = [v4, v4-v6, v4-v6, v4, v4, v6, v6]
and thus I can find from arbitrary values for v4 and v6 a vector that lies in the null space by multiplying v4 and v6 with the vectors below:
v4 * [1,1,1,1,0,0] + v6 * [0,-1,-1,0,1,1]
My second question is: How do I find these vectors, meaning how do I solve the matrix for v4 and v6?
For example
qr.solve(mat, cbind(c(0,0,0,0), c(0,0,0,0)))
gives me two vectors of length 6 with only zeros.
Any help is appreciated, many thanks in advance!
-H-

Use the pivot information to find a set of independent columns:
q <- qr(mat)
mmat <- mat[,q$pivot[seq(q$rank)]]
mmat
## v1 v2 v3 v5
## A 1 -1 0 -1
## B 0 1 -1 0
## C 0 0 1 0
## D 0 0 0 1
qr(mmat)$rank
## [1] 4
Why does this work? The meaning of pivot is given in QR.Auxiliaries {base} brought up with ?qr.Q. In particular:
qr.R returns R. This may be pivoted, e.g., if a <- qr(x) then x[, a$pivot] = QR.
The number of rows of R is either nrow(X) or ncol(X) (and may depend on whether
complete is TRUE or FALSE).
Pivoting is done to order the eigenvalues in decreasing absolute value, for numerical stability. This also means that any 0 eigenvalues are at the end, beyond q$rank in q$pivot (and nonexistent in the current example, where Q is a 4x4 orthogonal matrix).
The final lines in the QR.Auxiliaries {base} show this relationship:
pivI <- sort.list(a$pivot) # the inverse permutation
stopifnot(
all.equal(x[, a$pivot], qr.Q(a) %*% qr.R(a)), # TRUE
all.equal(x , qr.Q(a) %*% qr.R(a)[, pivI])) # TRUE too!

If you start with v4 and v6 then you need 2 more with non-zero values inrows 1 and 2 so that you need to pick v1 and either v2 or v3. These are all possible basis choices that will have maximal rank.
> qr(mat[, c(1,2,4,6)])$rank
[1] 4
> qr(mat[, c(1,2,3,5)])$rank
[1] 4
> qr(mat[, c(1,3,4,6)])$rank
[1] 4
It is simply not the case that "independent columns" are uniquely determined. There may be sets of columns that are necessarily dependent, e.g., ones which are scalar multiples of each other, but that is not the case here.
On the other hand this will be rank deficient:
> qr(mat[, c(1,2,3,4)])$rank
[1] 3

Related

Calculating weight ratios in the presence of empty cells

I have a sample which needs to weighed in order to represent the population.
library(data.table)
sample <- fread("
1,0,2,2
3,4,3,0
")
V1 V2 V3 V4
1: 1 0 2 2
2: 3 4 3 0
population <- fread("
10,20,20,10
30,40,20,10
")
This weight would simply be:
weights <- population/sample
V1 V2 V3 V4
1: 10 Inf 10.000000 5
2: 10 10 6.666667 Inf
However, because V2 in row 1 of the sample has no observations, it receives an infinite weight (Note that also V4 in row 2 receives an Inf, but this is easier to solve, because the weight is irrelevant, as there are no observations in either the sample or the population).
A solution to the problem, would be to count V1 and V2 together in the sample and the population.
EDIT:
After some thought I realised that, for the weights to be correct, only the population values have to be adapted. If V1 and V2 in row 1 of population are added together in V1 of population, this will already lead to the correct weight for the sample observation of V1 row . The value of V2 becomes irrelevant because there is no observation in the sample to receive that weight.
End of EDIT
The observation would then get a weight of:
(population[1,1]+population[1,2])/(sample[1,1]+sample[1,2])
(10+20)/(1+0)=30
In my actual data, there however many more rows, with hero and there a 0 in the sample. I am trying to figure out if there is a way to write my code, so that I do not have to do this manually..
Desired outcome (notice that the weight of V1 row 1 is now 30):
weights
V1 V2 V3 V4
1: 30 0 10.000000 5
2: 10 10 6.666667 0
Attempt
I was think of doing something like:
for (i in seq_along(ncol(sample))) {
lapply(population, (ifelse(sample[i]==0), population[i]<-population[i+1], population[i])
}
Where the values in the population of the cell to right will be added when the value in the sample is zero. However I am having trouble getting the syntax right, and even if it did, it does not solve the case where V4 is 0.
Here is a rather verbose solution. In case there are more columns that should be aggregated in case of zeros in sample, I would have proposed a more flexible approach but this seems sufficient for your example
library(data.table)
sample <- fread("
1,0,2,2
3,4,3,0
")
population <- fread("
10,20,20,10
30,40,20,10
")
# aggregate Values if sample is zero
population[sample$V1 == 0, `:=`(V1 = 0,
V2 = V1 + V2)]
population[sample$V2 == 0, `:=`(V1 = V1 + V2,
V2 = 0)]
weights <- population/sample
# Fix NaNs
weights[is.na(weights), ] <- 0
weights
#> V1 V2 V3 V4
#> 1: 30 0 10.000000 5
#> 2: 10 10 6.666667 Inf

R Difference with previous column across multiple columns

I have a dataframe like this that resulted from a cumsum of variables:
id v1 v2 v3
1 4 5 9
2 1 1 4
I I would like to get the difference among columns, such as the dataframe is transformed as:
id v1 v2 v3
1 4 1 4
2 1 0 3
So effectively "de-acumulating" the resulting values getting the difference. This is a small example original df is around 150 columns.
Thx!
x <- read.table(header=TRUE, text="
id v1 v2 v3
1 4 5 9
2 1 1 4")
x[,c("v1","v2","v3")] <- cbind(x[,"v1"], t(apply(x[,c("v1","v2","v3")], 1, diff)))
x
# id v1 v2 v3
# 1 1 4 1 4
# 2 2 1 0 3
Explanation:
Up front, a note: when using apply on a data.frame, it converts the argument to a matrix. This means that if you have any character columns in the argument passed to apply, then the entire matrix will be character, likely not what you want. Because of this, it is safer to only select columns you need (and reassign them specifically).
apply(.., MARGIN=1, ...) returns its output in an orientation transposed from what you might expect, so I have to wrap it in t(...).
I'm using diff, which returns a vector of length one shorter than the input, so I'm cbinding the original column to the return from t(apply(...)).
Just as I had to specific about which columns to pass to apply, I'm similarly specific about which columns will be replaced by the return value.
Simple for cycle might do the trick, but for larger data it will be slower that other approaches.
df <- data.frame(id = c(1,2), v1 = c(4,1), v2 = c(5,1))
df2 <- df
for(i in 3:ncol(df)){
df2[,i] <- df[,i] - df[,i-1]
}

-Inf content in a Position Weight Matrix

What does it mean when I have -Inf content in some positions of a Position Weight Matrix?
I am using the seqLogo package. For plotting the seqLogo:
library(seqLogo)
seqLogo(weight_matrix, ic.scale=TRUE, xaxis=TRUE, yaxis=TRUE, xfontsize=15, yfontsize=15)
and I have:
Error in seqLogo(weight_matrix, ic.scale = TRUE, xaxis = TRUE, yaxis =
TRUE, : Columns of PWM must add up to 1.0
From the error it is obvious, column sum must be equal to 1. As it is sum of probabilities, which can't be more than 1. See example:
Below works fine, using example m matrix from seqLogo package:
library(seqLogo)
# get example matrix
mFile <- system.file("Exfiles/pwm1", package="seqLogo")
m <- read.table(mFile)
# check if all columns have sum of 1
colSums(m)
# V1 V2 V3 V4 V5 V6 V7 V8
# 1 1 1 1 1 1 1 1
# plot, all great!
seqLogo(m)
Now, let's change one of the values, so that column sum is more than 1. This will give us error.
m[1, 1] <- 1
# check if all columns have sum of 1
# V1 V2 V3 V4 V5 V6 V7 V8
# 2 1 1 1 1 1 1 1
seqLogo(m)
# Error in seqLogo(m) : Columns of PWM must add up to 1.0
Other reason could be that matrix values are already logged. If they are then convert them back to probabilities using:
plotMatrix <- 2 ^ weight_matrix * 0.25
then plot:
seqLogo(plotMatrix)

How can I pad a vector with NA from the front?

I want to make an existing vector size n and use NA. I know I can pad at the end of the vector like so:
v1 <- 1:10
v2 <- diff(v1)
length(v2) <- length(v1)
v2
# 1 1 1 1 1 1 1 1 1 NA
But I want to fill the NA at the beginnning instead in a generic way. I mean for this particular example I can just
v2 <- c(NA, diff(v1))
# NA 1 1 1 1 1 1 1 1 1
But I was hoping that there exist some base R function or library that provides something like v2 <- pad(v2, n=length(v1), value=NA)
Is there anything like that I can use off the self or do I need to define my own function:
pad <- function(x, n) { # ugly function that doesn't keep the attributes of x
len.diff <- n - length(x)
c(rep(NA, len.diff), x)
}
pad(1:10, 12) # NA NA 1 2 3 4 5 6 7 8 9 10
Assuming v1 has the desired length and v2 is shorter (or the same length) these left pad v2 with NA values to the length of v1. The first four assume numeric vectors although they can be modified to also work more generally by replacing NA*v1 in the code with rep(NA, length(v1)).
replace(NA * v1, seq(to = length(v1), length = length(v2)), v2)
rev(replace(NA * v1, seq_along(v2), rev(v2)))
replace(NA * v1, seq_along(v2) + length(v1) - length(v2), v2)
tail(c(NA * v1, v2), length(v1))
c(rep(NA, length(v1) - length(v2)), v2)
The fourth is the shortest. The first two and fourth do not involve any explicit arithmetic calculations other than multiplying v1 with NA values. The second is likely slow since it involves two applications of rev.
One option is diff from zoo which also have the na.pad
library(zoo)
as.vector(diff(zoo(v1), na.pad=TRUE))
#[1] NA 1 1 1 1 1 1 1 1 1
Defining nrValues as the number of elements you want at the start of v2 you could use:
n <- length(v1)
v2 <- c(rep(NA,nrValues),v1[nrValues:n])
I'm not familiar with a function that does this, so if you intend to do it multiple times I would create your own function.

Interpreting this error message in R

I have the following matrix
mat<-read.csv("mat.csv")
sel<-c(135, 211)
I would like to select the rows in 'mat' that correspond to 'sel'
I do it in the following way:
subset(mat, mat$V2==c(sel))
and I get the following error:
Warning message:
In l[, 2] == c(135, 211) :
longer object length is not a multiple of shorter object length
And also it only selects one of the two.
Try this (credits go to Roland)
mat[mat$V2 %in% sel,]
X V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
11 11 1 135 2 7 100 2 0 0 0 0
15 15 1 211 5 7 100 2 0 0 0 0
from ?'%in% you can read:
%in% is a more intuitive interface as a binary operator, which returns
a logical vector indicating if there is a match or not for its left operand.
If you have a logical vector indicating the matching, then you can use it for indexing and selecting the elements you want. In this case mat$V2 %in% sel matches all elements of mat$V2 that are in sel it will give you a logical vector, then using it in mat[row, col] you'll get ontly those desired elements as in mat[mat$V2 %in% sel,] this means: Give all the columns for those rows which elements meeting the condition mat$V2 %in% sel.

Resources