Online PCA in R - r

I'm trying to code online PCA in R, there is no existing implementation of this code available, thus, it may be useful for others as well. The pseudo-code can be found here (Algorithm 1). What I've done so far is as follows:
PCA<-function(X,k,epsilon){
X_f<-norm(as.matrix(X),"f")
d<-nrow(X)
n<-ncol(X)
l<-floor((8*k)/(epsilon^2))
U<-matrix(0,d,l)
C<-matrix(0,d,d)
Y<-matrix(0,n,l)
for(t in 1:n){
r<-X[,t]-(U%*%t(U)%*%X[,t])
n<-C + r%*%t(r)
while(norm(n,"2") >= 2*(X_f^2)/l){
lamb<-eigen(C)$values[1]
u<-eigen(C)$vectors[,1]
U<-cbind(U,u)
#U[,which(!apply(U==0,2,all))]
C<-C-(lamb*(u%*%t(u)))
r<-X[,t]-(U%*%t(U)%*%X[,t])
}
C<-C+(r%*%t(r))
y<-matrix(0,1,l)
y<-t(U)%*%x_t
Y[t,]<-y
}
return(Y)
}
To test the code I used the famous fisher iris data:
log.ir <- log(iris[, 1:4])
ir.species <- iris[, 5]
ir.pca <- PCA(log.ir,50,0.2)
There seems to be a bug in the code, which is not so obvious to me, the while loop never stops, can some one please help?

It's because while(norm(n,"2") >= 2*(X_f^2)/l) never finishes, 2*(X_f^2)/l) is always smaller than norm(n,"2")
In fact if you print out the values of these, and debug(PCA) you'll see that they never change
function(X,k,epsilon){
X_f<-norm(as.matrix(X),"f")
d<-nrow(X)
n<-ncol(X)
l<-floor((8*k)/(epsilon^2))
U<-matrix(0,d,l)
C<-matrix(0,d,d)
Y<-matrix(0,n,l)
for(t in 1:n){
r<-X[,t]-(U%*%t(U)%*%X[,t])
n<-C + r%*%t(r)
while(norm(n,"2") >= 2*(X_f^2)/l){
print(norm(n,"2") )
print(2*(X_f^2)/l)
lamb<-eigen(C)$values[1]
u<-eigen(C)$vectors[,1]
U<-cbind(U,u)
U[,which(!apply(U==0,2,all))]
C<-C-(lamb*(u%*%t(u)))
r<-X[,t]-(U%*%t(U)%*%X[,t])
}
C<-C+(r%*%t(r))
y<-matrix(0,1,l)
y<-t(U)%*%x_t
Y[t,]<-y
}
return(Y)
}
debug(PCA)
In general using print statements inside of functions you want to debug is a good way to diagnose problems.

Related

R function to calculate SSE for loess span optimisation - can't find error in code

The code comes from http://r-statistics.co/Loess-Regression-With-R.html, and it is exactly what I would need, if I could only make it work.
Data:
data(economics, package="ggplot2")
economics$index <- 1:nrow(economics)
economics <- economics[1:80, ]
Function to calculate SSE:
calcSSE <- function(x){
loessMod <- try(loess(uempmed ~ index, data=economics, span=x), silent=T)
res <- try(loessMod$residuals, silent=T)
if(class(res)!="try-error"){
if((sum(res, na.rm=T) > 0)){
sse <- sum(res^2)
}
}else{
sse <- 99999
}
return(sse)
}
Finding out the best 'span' value for loess:
optim(par=c(0.5), calcSSE, method="SANN")
The problem is that this should yield a span of 0.0543 and an achievable minimum SSE of 3.85e-28 for a function count of 10000 (this can be checked at the link at the beginning of this post), but if I run it, it gives 0.5, SSE 99999 and the function count is only 2, so it obviously isn't working.
I am running R 4.1.2 under Win10 from RStudio. The only thing I can think of is that something major in R has been changed since this code was written. Any help in fixing it would be greatly appreciated.
I think the problem is here: if((sum(res, na.rm=T) > 0)){ - this unnecessarily triggers a move to sse <- 99999 when the sum of the residuals could quite reasonably be below zero. If you remove that condition, then you get a more reasonable answer. The try() function should catch any wonky stuff going on in the results.
data(economics, package="ggplot2")
economics$index <- 1:nrow(economics)
economics <- economics[1:80, ]
calcSSE <- function(x){
loessMod <- try(loess(uempmed ~ index, data=economics, span=x), silent=T)
res <- try(loessMod$residuals, silent=T)
if(!inherits(res, "try-error")){
sse <- sum(res^2)
}else{
sse <- 99999
}
return(sse)
}
optimize(calcSSE, c(0.01,1))
#> $minimum
#> [1] 0.04414211
#>
#> $objective
#> [1] 7.888609e-31
Created on 2022-01-30 by the reprex package (v2.0.1)
Note that I used optimize instead of optim because that's suggested if there is only one parameter being optimized. I also removed the warnings from the output. One other note, I switched class(res)!="try-error" to !inherits(res, "try-error") as this is encouraged because classes can often have more than one value.

For>while>for loops slowing down this R code?

I apologize in advance for the elementary question, but thought it may be a quick pointer from someone out there.
I am looking at this publicly-available code and wondering why it runs slow (or stalls completely), when the mu for the negative binomial generation is >1. Is it related to the nested loops?
Thank you.
for(i in 1:runs) {
cases <- seed
t <- rep(0,seed)
times <- t
while(cases > 0) {
secondary <- rnbinom(cases,size=fit.cases$estimate[1],mu=fit.cases$estimate[2])
t.new <- numeric()
for(j in 1:length(secondary)) {
t.new <- c(t.new,t[j] + rgamma(secondary[j],shape=fit.serial$estimate[1],
rate=fit.serial$estimate[2]))
}
cases <- length(t.new)
t <- t.new
times <- c(times,t.new)
}
lines(sort(times),1:length(times),col=cols[i],lwd=1)
points(max(times),length(times),col=cols[i],pch=16)
}
https://github.com/calthaus/Ebola/blob/master/Superspreading%20(Lancet%20Inf%20Dis%202015)/Ebola_superspreading_analysis.R

Complex tryCatch in loop-R

I'm writing the code to get the data from Uncomtrade- an UN's database. Because the database has a usage limit of 100 enquiries/hour so I need to put a time out there.
I want to write the code with tryCatch that will:
Automatically set programs to time out everytime the error for max limit appears
Rerun for the current level of i,j and k if a connection error orcurs
My current code still work though but I want to learn how to use tryCatch too
And also is there a way to get rid of the for loops. Can the apply family function be used here?
Thanks guys
n=0
a<-c()
for (i in (1996:2014)) {
for (j in c("0301","0302","0303","0304","0305","0306","0307","0308")) {
for (k in c("704","116","360","418","458","104","608","702","764")) {
s2<-paste(i,j,k,sep="")
a<-c(a,s2)
print (s2)
n<-n+1
if(n<=100) {
s1 <- get.Comtrade(r=k, ps=i, rg="2", cc=j, fmt="csv",px="H0")
Sys.sleep (1)
s1<-do.call(rbind.data.frame,s1)
library(foreign)
write.dta(s1,file=paste("D:/unTrade/",s2,".dta"))
}
else {
print(n)
print(s2)
print("reset here")
n=0
Sys.sleep(3610)
}
}
}
}
I can't really help you with the TryCatch(); I don't have the experience myself.
Regarding the for loops, this is one solution (although I think in these cases the for-loops are not that evil; vectorization really counts in all kinds of matrix operations etc).
dat <- expand.grid(i = 1996:1999, j = c("0301","0302","0303","0304","0305","0306","0307","0308"), k = c("704","116","360","418","458","104","608","702","764"))
library(dplyr)
dat %>% group_by(i, j, k) %>%
do({
cat('s1 <- get.Comtrade(r=', .$k, ', ps=', .$i, ', cc=', .$j, ', rg=\"2\", fmt=\"csv\",px=\"H0\")\n')
flush.console()
# return(s1)
})
From your own code s1 (also) appears to be a data.frame, so in this case, the dplyr do() nicely glues all these data frames together.
HTH

How to iteratively update an variable with its previous value in R with vectorization?

I have a piece of R code like this
s<-vector()
s[1]=1
for(i in 2:10){
ds= 1/i/s[i-1]
s[i]= s[i-1]+ds
}
s
The result is:
[1] 1.000000 1.500000 1.722222 1.867384 1.974485 2.058895 2.128281 2.187014
[9] 2.237819 2.282505
The above is an toy example, what I am trying to do is updating s[i] with some sort of use of s[i-1], please help me with this case, I have tried the following:
s<-vector()
s[1]<-1
sapply(2:10,FUN = function(i){
if(i==1){
return(s[i])
}
ds<-1/i/s[i-1]
#print(paste(i,s_new,ds,ds+s[i-1]))
s[i]= s[i-1]+ds
return(s[i])
}
)
The above does not work
I think that using <<- will let you directly assign to the global version of s. This is not something I do frequently. See this question for good details of how environments work in R.
I'm not crazy with the below version, mostly because it's mixing a return() with a direct assignment <<-. At this point, it seems like you might as well use a loop as mentioned in the comments.
s <- vector()
s[1] <- 1
sapply(2:10,FUN = function(i){
if(i == 1){
return(s[i])
}
ds <- 1/i/s[i - 1]
s[i] <<- s[i - 1] + ds
}
)

My function works in R console but not in R script

I tried to write some functions to calculate anova power and sample size using non-central parameter.
There're some very good functions in R but my functions were to learn and reproduce line of thought from a biostatistical book...
Despite de math involved, my "nc" and "fpower" functions just work well, and as expected:
nc <- function(diff,n,sd) {
nonc <- (diff^2/2)*(n/sd^2)
return(nonc)
}
fpower <- function(k,n,diff,sd,alpha=0.05) {
nonc <- nc(diff,n,sd)
dfn <- k - 1
dfd <- k*(n-1)
f1 <- qf(1-alpha,dfn,dfd)
f2 <- pf(f1,dfn,dfd,nonc)
return(1-f2)
}
However, my "fsample" just doesn´t work as expected. Return 2, the first n in the seq.
fsample <- function(k,diff,sd,alpha=0.05,power=0.9){
for(n in 2:5000){
if ( fpower(k,n,sd,alpha) >= power) break
}
return(n)
}
But, if I "hand" run this code in console it work as expected!!
And return the right n value.
What's wrong?
You didn't pass the diff argument to fpower, so the arguments aren't in the order you think they are. fsample should be:
fsample <- function(k,diff,sd,alpha=0.05,power=0.9){
for(n in 2:5000){
if ( fpower(k,n,diff,sd,alpha) >= power) break
}
return(n)
}
Note that this wouldn't have been a problem if you had named the arguments when you called fpower because you would have received an error about diff being missing and not having a default value:
# this will error
fsample <- function(k,diff,sd,alpha=0.05,power=0.9){
for(n in 2:5000){
if ( fpower(k=k,n=n,sd=sd,alpha=alpha) >= power) break
}
return(n)
}
Also, you might want to avoid giving data objects the same name as functions (e.g. diff, sd, and power are also functions), otherwise you may confuse yourself.

Resources