Simple linear equation using qr.solve gives very off the mark results - r

I'm trying to figure out how to solve a system of linear equations which are approximations (i.e. there is error in the solution, and I want it minimized).
To understand/verify the process, I came up with a simple example: I give a bunch of 5x + 4x^2 + 3x^3 with a 0-5% error in the answer.
> a
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 4 8
[3,] 3 9 27
[...]
[98,] 98 9604 941192
[99,] 99 9801 970299
[100,] 100 10000 1000000
> b
[1] 12.04 48.17 130.02 269.93 505.75 838.44
[7] 1202.04 1911.69 2590.51 3381.00 4538.80 5846.19
...
[97] 2824722.45 2826700.98 3012558.52 2920400.25
When I try to solve this using qr.solve,
> qr.solve(a,b)
[1] 85.2896286 -0.8924785 3.0482766
the results are completely off (want 5, 4, 3). I'm sure I'm missing something obvious. Or perhaps my experiment with polynomials is inherently bad? (if so, why?)

I cannot reproduce this problem with an additive error:
a <- cbind(1:100, (1:100)^2, (1:100)^3)
set.seed(42)
b <- a %*% (5:3) + rnorm(100, sd = 0.1)
qr.solve(a, b)
# [,1]
#[1,] 4.998209
#[2,] 4.000056
#[3,] 3.000000
I can reproduce it with a relative error, but that's not really surprising, since the error is then dominated by the magnitude of the third degree summand:
a <- cbind(1:100, (1:100)^2, (1:100)^3)
set.seed(42)
b <- a %*% (5:3) * rnorm(100, mean = 1, sd = 0.1)
qr.solve(a, b)
# [,1]
#[1,] -1686.611970
#[2,] 68.693368
#[3,] 2.481742
Note that the third coefficient is about what you expect (even more so in you not-reproducible example).

Related

How does the equation for the SpatRaster roughness index (terrain, v = "roughness") work?

The terra package offers and describes the following terrain indices:
x <- terrain(x, v="roughness")
x <- terrain(x, v="TPI")
x <- terrain(x, v="TRI")
I am confused on how this is calculated based on the package description of roughness as "the difference between the maximum and the minimum value of a cell and its 8 surrounding cells" (Hijmans et al. 2023). How does this work for edge and corner cells? I am assuming that the calculation reduces to a cell and its 5 or 3 surrounding cells in these cases?
The ruggedness (TRI) index is described as "the mean of the absolute differences between the value of a cell and the value of its 8 surrounding cells". The following is a graphic illustration of how I envision the calculation of these indices from the description provided.
Does this provide a correct interpretation of these indices?
If this interpretation is incorrect, then I am hoping someone point me in the correct direction (a reference) or explain here. I am interested in coding to calculate a slope of 16° from a DSM and an elevational difference of 1.3 m, but think that a terrain index would give a better indicator of the 1.3 m criterion for this habitat model.
## > 16° slope
habitat_slope_mat <- matrix(nrow = 2, ncol = 3)
habitat_slope_mat[1, ] <- c(0,16,0) # from,to = 0 absent
habitat_slope_mat[2, ] <- c(16,minmax(x)[2],1) # from,to = 1 present
habitat_slope <- classify(x, habitat_slope_mat, include.lowest=TRUE)
I looked at the cited references and was expecting to find the formula for this to help me think of the best way to treat the 1.3 m criterion. I have been unable to locate a written / published description that further explains the method. This paper is listed in the citations for the terrain function description:
Jones, K.H., 1998. A comparison of algorithms used to compute hill (sic) *terrain *as a property of the DEM. Computers & Geosciences 24: 315-323
The correct title for the article (DOI: 10.1016/S0098-3004(98)00032-6) is: "A comparison of algorithms used to compute hill *slope *as a property of the DEM". I cannot locate the formula for roughness in that paper and was interested in reading more on this topic.
I am not sure if this question is appropriate here, as you do not seem to be asking a coding question.
The manual points to Wilson et al (2007) for terrain indices. It also shows how you can use focal instead of terrain to compute them.
You can see for yourself what happens with small examples like this:
library(terra)
x <- rast(nrow=3, ncol=3, vals=c(1,2,3,1,2,1,1,2,8), ext=ext(0,1,0,1), crs="local")
as.matrix(x, wide=T)
# [,1] [,2] [,3]
#[1,] 1 2 3
#[2,] 1 2 1
#[3,] 1 2 8
terrain(x, "roughness") |> as.matrix(wide=TRUE)
# [,1] [,2] [,3]
#[1,] NaN NaN NaN
#[2,] NaN 7 NaN
#[3,] NaN NaN NaN
focal(x, w=3, fun=\(x) {max(x) - min(x)}) |> as.matrix(wide=T)
# [,1] [,2] [,3]
#[1,] NA NA NA
#[2,] NA 7 NA
#[3,] NA NA NA
terrain(x, "TRI") |> as.matrix(wide=TRUE)
# [,1] [,2] [,3]
#[1,] NaN NaN NaN
#[2,] NaN 1.375 NaN
#[3,] NaN NaN NaN
focal(x, w=3, fun=\(x) sum(abs(x[-5]-x[5]))/8) |> as.matrix(wide=T)
# [,1] [,2] [,3]
#[1,] NA NA NA
#[2,] NA 1.375 NA
#[3,] NA NA NA
So the edges become missing (you could do other things via focal)
Or look at the source code.

Limma to Compare Bulk RNA Seq using makeContrasts and eBayes

After a day of googling, I've decided that it'd be better to ask the question here.
So the experiment is I have bulk RNA seq data from 3 patients: A, B, C.
And their RNA seq data is obtained for pre-treatment, treatment cycle 1, treatment cycle 2, treatment cycle 3.
So in total I have 12 samples of bulk RNA seq:
A.PreTreat -> A.Cycle1 -> A.Cycle2 -> A.Cycle3
B.PreTreat -> B.Cycle1 -> B.Cycle2 -> B.Cycle3
C.PreTreat -> C.Cycle1 -> C.Cycle2 -> C.Cycle3
I want to get a differential gene list between different cycles (i.e. cycle 3 to pretreatment, cycle 3 to cycle 2) using model.matrix(), lmFit(), makeContrasts(), contrasts.fit(), eBayes(), all of which are in the limma package.
Here is my minimal working example.
library(limma)
# Already normalized expression set: rows are genes, columns are the 12 samples
normalized_expression <- matrix(data=sample(1:100), nrow=10, ncol=12)
colnames(normalized_expression) <- c("A.PreTreat", "A.Cycle1", "A.Cycle2", "A.Cycle3", "B.PreTreat", "B.Cycle1", "B.Cycle2", "B.Cycle3", "C.PreTreat", "C.Cycle1", "C.Cycle2", "C.Cycle3")
patient_and_treatment <- factor(colnames(normalized_expression), levels = colnames(normalized_expression))
design.matrix <- model.matrix(~0 + patient_and_treatment)
colnames(design.matrix) <- patient_and_treatment
fit <- lmFit(normalized_expression, design.matrix)
# I want to get a contrast matrix to get differential genes between cycle 3 treatment and pre-treatment in all patients
contrast.matrix <- makeContrasts("A.Cycle3+B.Cycle3+C.Cycle3-A.PreTreat-B.PreTreat-C.PreTreat",
levels = levels(patient_and_treatment))
# Outputs Error of no residual degree of freedom
fit2 <- eBayes( contrasts.fit( fit, contrast.matrix ) )
# Want to run but cannot
summary(decideTests(fit2))
So far I am stuck on no residual degree of freedom error.
I am not even sure if this is the statistically right way in limma to address my question of getting differential gene list between cycle 3 treatment to pre-treatment in all patients.
Any help will be greatly appreciated.
Thanks!
You cannot have 1 observation per group, this makes the regression meaningless as you are fitting each data point to itself.
Briefly, what you are looking for is common effects observed across all patients, for say Cycle3 compared to PreTreat and so on, set up the model like this:
library(limma)
metadata = data.frame(
Patient=gsub("[.][^ ]*","",colnames(normalized_expression)),
Treatment=gsub("^[A-Z][.]*","",colnames(normalized_expression))
)
Patient Treatment
1 A PreTreat
2 A Cycle1
3 A Cycle2
4 A Cycle3
5 B PreTreat
6 B Cycle1
7 B Cycle2
8 B Cycle3
9 C PreTreat
10 C Cycle1
11 C Cycle2
12 C Cycle3
Now specify the model matrix, the Patient term is to account for differences in starting levels between Patients:
design.matrix <- model.matrix(~0 + Treatment+Patient,data=metadata)
fit <- lmFit(normalized_expression, design.matrix)
contrast.matrix <- makeContrasts(TreatmentCycle3-TreatmentPreTreat,
TreatmentCycle1-TreatmentPreTreat,levels=design.matrix)
fit2 = contrasts.fit(fit, contrast.matrix)
fit2 = eBayes(fit2)
You can check that the coefficients give you what you wanted:
fit2$coefficients
Contrasts
TreatmentCycle3 - TreatmentPreTreat
[1,] -3.666667
[2,] -13.666667
[3,] 1.666667
[4,] -40.666667
[5,] 12.000000
[6,] -46.000000
[7,] -32.000000
[8,] 4.666667
[9,] 11.333333
[10,] 5.666667
Contrasts
TreatmentCycle1 - TreatmentPreTreat
[1,] -11.33333
[2,] -19.33333
[3,] -27.33333
[4,] -42.33333
[5,] 27.33333
[6,] -32.66667
[7,] -33.00000
[8,] -30.66667
[9,] 46.00000
[10,] 17.33333

rbind is changing a constant number of one of my columns

I am trying to make a matrix form a dataframe, everything is working perfect but when I use rbind a constant number that I have in one of my columns change.
My dataframe looks like this:
CHR POS Fd FdDenom
1 10 3809 0.0000 0.0000
2 10 5673 -0.2500 0.0000
3 10 5847 0.0000 0.5000
...
And is named FS_10
On than I am running the next for loop
table10 <-c()
a <- 0
for(i in 1:round((nrow(Fs_10)/50))) {
window <- Fs_4[a:c(a+50),]
a<- a+50
a1 <- sum(window$FdNum)
a2 <- sum(window$FdDenom)
Result <- a1/a2
start <- window[1,]
end <-window[50,]
middle <- (start[,2]+end[,2])/2
table10 <- rbind(table10,c(window[1,1], start[,2], end[,2], end[,2]-start[,2], middle, Result))
}
My output look like this:
V1 V2 V3 V4 V5 V6
1 2 3869 624096 620287 313952.5 0.029411765
50 2 624096 624694 598 624395.0 0.500000000
100 2 624714 625470 756 625092.0 0.205128205
I expect in column V1 the number 10, but I am having 2, I have change several things and the 2 is still there instead the 10. Do you know what is happening?
A simplified version of the porblem is:
rbind(c(start[,1], start[,2], end[,2]), c(start[,1], start[,2], end[,2]))
Where start is:
CHR POS FdNum FdDenom
240938 10 148990666 0.25 0.25
And end is:
CHR POS FdNum FdDenom
240987 10 149534407 -0.5 0
I have this:
[,1] [,2] [,3]
[1,] 2 148990666 149534407
[2,] 2 148990666 149534407
Again 2 instead 10
Unsing this:
rbind(list(inicio[,1], inicio[,2], fin[,2]), list(inicio[,1], inicio[,2], fin[,2]))
I have this:
[,1] [,2] [,3]
[1,] factor,1 148990666 149534407
[2,] factor,1 148990666 149534407
Do you know which is the problem?
Thanks
I am newbie programming in R, I thought that I was creating a matrix but akrun is right, I was creating a list. My solution was to create instead a dataframe substituting the las line of my for loop with this:
tabla10 <- rbind(tabla10,data.frame(window[1,c(1,2)], start[,2], end[,2], end[,2]-start[,2], meddle, Result))
However I still do not understand why the 10 was turning in to 2, I'll study more about this.
Thanks

R while loop with vector condition

I want to vectorize a function that uses a while-loop.
The original function is
getParamsLeadtime <- function(leadtimeMean, in_tolerance, tolerance){
searchShape=0
quantil=0
# iterates the parameters until the percentage of values is within the interval of tolerance
while (quantil < in_tolerance){
searchShape = searchShape+1
quantil <- pgamma(leadtimeMean+tolerance,shape=searchShape,rate=searchShape/leadtimeMean) -
pgamma(leadtimeMean-tolerance,shape=searchShape,rate=searchShape/leadtimeMean)
}
leadtimeShape <- searchShape
leadtimeRate <- searchShape/leadtimeMean
return(c(leadtimeShape, leadtimeRate))
}
I would like to have a vectorized call to this function to apply it to a data frame. Currently I am looping through it:
leadtimes <- data.frame()
for (a in seq(92:103)) {
leadtimes <- rbind(leadtimes, getParamsLeadtime(a, .85,2))
}
When I tried to vectorize the function, the while did not seem to accept a vector as condition. The following warning occured:
Warning message:
In while (input["U"] < rep(tolerance, dim(input)[1])) { :
the condition has length > 1 and only the first element will be used
This let me suppose that while does not like vectors. Can you tell me how to vectorize the function?
On a sidenote, I wonder why the column names of the resulting leadtimes-data.frame appear to be values:
> leadtimes
X1 X1.1
1 1 1.000000
2 1 0.500000
3 4 1.333333
4 8 2.000000
5 13 2.600000
6 19 3.166667
7 25 3.571429
8 33 4.125000
9 42 4.666667
10 52 5.200000
11 63 5.727273
12 74 6.166667
Here's an option that is pretty performant.
We vectorize the calculation of pgamma for a given mean lead time, for both the +tol and the -tol case, over a sufficiently large sequence of shp. We calculate a (vectorized) difference, and compare to in_tol. The index (minus 1, since we start our sequence at 0) of the first element of the vector that is greater than in_tol is the lowest value of shp that leads to a pgamma of greater than in_tol.
f <- function(lead, in_tol, tol) {
shp <- which(!(pgamma(lead + tol, 0:10000, (0:10000)/lead) -
pgamma(lead - tol, 0:10000, (0:10000)/lead))
< in_tol)[1] - 1
rate <- shp/lead
c(shp, rate)
}
We can then sapply this over a range of mean lead times.
t(sapply(1:12, f, 0.85, 2))
## [,1] [,2]
## [1,] 1 1.000000
## [2,] 1 0.500000
## [3,] 4 1.333333
## [4,] 8 2.000000
## [5,] 13 2.600000
## [6,] 19 3.166667
## [7,] 25 3.571429
## [8,] 33 4.125000
## [9,] 42 4.666667
## [10,] 52 5.200000
## [11,] 63 5.727273
## [12,] 74 6.166667
system.time(leadtimes <- sapply(1:103, f, 0.85, 2))
## user system elapsed
## 1.28 0.00 1.30
You just need to make sure you choose a sensible upper ceiling for the shape parameter (here I've chosen 10000, which was more than generous). Note that if you don't choose an upper limit that is high enough, some return values will be NA.

Transforming rows in a PCA context using dudi.pca

I have a huge matrix of genetic data (1e7 rows representing individuals x 5,000 columns representing markers) on which I would like to perform a PCA in order to keep c. 20 columns. However, due to memory issues, I cannot perform PCA using either dudi.pca or big.PCA on R 3.1.2 on a 8GB 64bits machine.
An alternative was to compute an approximation of the coordinates of principal axes on a row-subset of the matrix and then transform the whole matrix using a linear combination with the approximate PA coordinates.
I am facing a simple PCA-related problem using dudi.pca: how can I get the row coordinates using the original matrix and the matrix of column coordinates (= principal axes) ?
Here is a simple example, let's take a random matrix M (3 rows and 4 columns) such as:
M=
1 9 10 13
20 13 20 7
18 19 17 10
Doing dudi.pca(M, center=T, scale=T) and keeping only one PC, dudi.pca outputs the following $c1 matrix (column normed scores ie principal axes):
c1 =
-0.547
-0.395
-0.539
0.504
To compute the row coordinates of the data on the first principal axis, I thought doing the inner product:
r =
-0.547*1 + -0.395*9 + -0.539*10 + -0.504*13
-0.547*20 + -0.395*13 + -0.539*20 + -0.504*17
-0.547*18 + -0.395*19 + -0.539*17 + -0.504*10
i.e.
r =
-2.944
-23.331
-21.481
But if I look up at the $li (row coordinates ie principal components) natively computed by dudi.pca on the same dataset, I read:
r' =
2.565
-1.559
-1.005
Am I doing something wrong when formulating the row coordinates using dudi.pca $ci matrix?
Many thanks for your help,
Quaerens.
Code :
> M=matrix(c(1,9,10,13,20,13,20,7,18,19,17,10), ncol=4, byrow=T)
> M
[,1] [,2] [,3] [,4]
[1,] 1 9 10 13
[2,] 20 13 20 7
[3,] 18 19 17 10
> N=dudi.pca(M, center=T, scale=T, scannf=F, nf=1)
> N$c1
CS1
V1 -0.5468634
V2 -0.3955638
V3 -0.5389504
V4 0.5039863
> r=c( M[1,] %*% N$c1[,1], M[2,] %*% N$c1[,1], M[3,] %*% N$c1[,1] )
> r
[1] -2.94462 -23.33070 -21.48155
> N$li
Axis1
1 2.565165
2 -1.559546
3 -1.005619
If this is still of interest...
ADE4 works on the duality diagram, hence when p is greater than n singular value decomposition is carried out on the nxn symmetric matrix
library(ade4)
M=matrix(c(1,9,10,13,20,13,20,7,18,19,17,10), ncol=4, byrow=T)
M
## [,1] [,2] [,3] [,4]
## [1,] 1 9 10 13
## [2,] 20 13 20 7
## [3,] 18 19 17 10
N=dudi.pca(M, center=T, scale=T, scannf=F, nf=1)
#dimensions of M
n=3
p=4
X=scalewt(M,center=T,scale=T)
#this could be done in two ways. Singular Value Decomposition or Duality Diagrams.
#Consider a Singular value decomposition of X; S=UDV; where S is X, U is the left triangular matrix, and V is the right triangular matrix, and D is the diagonal matrix of eigen values
svd=svd(X)
#These are equivalent
N$c1
svd$v[,1]
#Equivalent
N$eig
## [1] 3.341175 0.658825
svd$d[1:2]
## [1] 3.341175 0.658825
#Diagonal matrix of eigen values
lambda=diag(svd$d)
#N$lw gives the row weights
N$lw
#0.3333333 0.3333333 0.3333333
#find the inverse of the diagonal matrix of row weights; this is the normalization part
K=solve(sqrt(diag(N$lw,n)))%*%svd$u
#These are equivalent
head(K[,1])
## [1] 1.4033490 -0.8531958 -0.5501532
head(N$l1)
## RS1
## 1 1.4033490
## 2 -0.8531958
## 3 -0.5501532
#Find Principal Components
pc=K%*%sqrt(lambda)
#These are equivalent
head(pc)
## [,1] [,2]
## [1,] 2.565165 -0.1420130
## [2,] -1.559546 -0.9154578
## [3,] -1.005619 1.0574707
head(N$li)
## Axis1
## 1 2.565165
## 2 -1.559546
## 3 -1.005619
This could also be done using the duality diagram implemented in ade4
look here for references on the duality diagram implemented in ade4: http://projecteuclid.org/euclid.aoas/1324399594
Q<-diag(p)
D<-diag(1/n, n)
rk<-qr(X)
rank=rk$rank
#Statistical Triplets
V<-t(X)%*%D%*%X
W<-X%*%Q%*%t(X)
#Compute the eigen values and vectors of the statistical triplet
example.eigen=eigen(W%*%D)
#Equivalent
N$eig
## [1] 3.341175 0.658825
example.eigen$values[1:rank]
## [1] 3.341175 0.658825
#Diagonal matrix of eigen values
lambda=diag(example.eigen$values[1:rank])
#find the inverse of the diagonal matrix of row weights; this is the normalizing part
Binv<-solve(sqrt(D))
K=Binv%*%example.eigen$vectors[,1:rank]
#These are equivalent
head(K[,1])
## [1] 1.4033490 -0.8531958 -0.5501532
head(N$l1)
## RS1
## 1 1.4033490
## 2 -0.8531958
## 3 -0.5501532
#Find Principal Components
pc=K%*%sqrt(lambda)
#These are equivalent
head(pc)
## [,1] [,2]
## [1,] 2.565165 -0.1420130
## [2,] -1.559546 -0.9154578
## [3,] -1.005619 1.0574707
head(N$li)
## Axis1
## 1 2.565165
## 2 -1.559546
## 3 -1.005619

Resources