I have data which I want to fit to the following equation using R:
Z(u,w)=z0*F(w)*[1-exp((-b*u)/F(w))]
where z0 and b are constants and F(w), w=0,...,9 is a decreasing step function that depends on w with F(0)=1 and u=1,...,50.
Z(u,w) is an observed set of data in the form of a 50x10 matrix (u=50,...,1 down the side of the rows and w=0,...,9 along the columns). For example as I haven't explained that great, Z(42,3) will be the element in the 9th row down and the 4th column along.
Using F(0)=1 I was able to get estimates of b and z0 using just the first column (ie w=0) with the code:
n0=nls(zuw~z0*(1-exp(-b*u)),start=list(z0=283,b=0.03),options(digits=10))
I then found F(w) for w=1,...,9 by going through each columns and using the vlaues of b and z0 I found.
However, I was wanting to find a way to estimate all the 12 parameters at once (b, z0 and the 10 values of F(w)) as b and z0 should be fitted to all the data, not just the first column.
Does anyone know of any way of doing this? All help would be greatly appreciated!
Thanks
James
This may be a case where the formula interface of the nls(...) function works against you. As an alternative, you can use nls.lm(...) in the minpack.lm package to perform non-linear regression with a programmatically defined function. To demonstrate this, first we create an artificial dataset which follows your functional form by design, with random error added (error ~ N[0,1]).
u <- 1:50
w <- 0:9
z0 <- 100
b <- 0.02
F <- 10/(10+w^2)
# matrix containing data, in OP's format: rows are u, cols are w
m <- do.call(cbind,lapply(w,function(w)
z0*F[w+1]*(1-exp(-b*u/F[w+1]))+rnorm(length(u),0,1)))
So now we have a matrix m, which is equivalent to your dataset. This matrix is in the so-called "wide" format - the response for different values of w is in different columns. We need it in "long" format: all responses in a single column, with a separate columns identifying u and w. We do this using melt(...) in the reshape2 package.
# prepend values of u
df.wide <- data.frame(u=u, m)
library(reshape2)
# reshape to long format: col1 = u, col2=w, col3=z
df <- melt(df.wide,id="u",variable.name="w", value.name="z")
df$w <- as.numeric(substr(df$w,2,4))-1
Now we have a data frame df with columns u, w, and z. The nls.lm(...) function takes (at least) 4 arguments: par is a vector of initial estimates of the parameters of the fit, fn is a function that calculates the residuals at each step, observed is the dependent variable (z), and xx is a vector or matrix containing the independent variables (u, v).
Next we define a function, f(par, xx), where par is an 11 element vector. The first two elements contain estimates of z0 and b. The next 9 contain estimates of F(w), w=1:9. This is because you state that F(0) is known to be 1. xx is a matrix with two columns: the values for u and w respectively. f(par,xx) then calculates estimate of the response z for all values of u and w, for the given parameter estimates.
library(minpack.lm)
# model function
f <- function(pars, xx) {
z0 <- pars[1]
b <- pars[2]
F <- c(1,pars[3:11])
u <- xx[,1]
w <- xx[,2]
z <- z0*F[w+1]*(1-exp(-b*u/F[w+1]))
return(z)
}
# residual function
resids <- function(p, observed, xx) {observed - f(p,xx)}
Next we perform the regression using nls.lm(...), which uses a highly robust fitting algorithm (Levenberg-Marquardt). Consequently, we can set the par argument (containing the initial estimates of z0, b, and F) to all 1's, which is fairly distant from the values used in creating the dataset (the "actual" values). nls.lm(...) returns a list with several components (see the documentation). The par component contains the final estimates of the fit parameters.
# initial parameter estimates; all 1's
par.start <- c(z0=1, b=1, rep(1,9))
# fit using Levenberg-Marquardt algorithm
nls.out <- nls.lm(par=par.start,
fn = resids, observed = df$z, xx = df[,c("u","w")],
control=nls.lm.control(maxiter=10000, ftol=1e-6, maxfev=1e6))
par.final <- nls.out$par
results <- rbind(predicted=c(par.final[1:2],1,par.final[3:11]),actual=c(z0,b,F))
print(results,digits=5)
# z0 b
# predicted 102.71 0.019337 1 0.90456 0.70788 0.51893 0.37804 0.27789 0.21204 0.16199 0.13131 0.10657
# actual 100.00 0.020000 1 0.90909 0.71429 0.52632 0.38462 0.28571 0.21739 0.16949 0.13514 0.10989
So the regression has done an excellent job at recovering the "actual" parameter values. Finally, we plot the results using ggplot just to make sure this is all correct. I can't overwmphasize how important it is to plot the final results.
df$pred <- f(par.final,df[,c("u","w")])
library(ggplot2)
ggplot(df,aes(x=u, color=factor(w)))+
geom_point(aes(y=z))+ geom_line(aes(y=pred))
Related
I need to conduct Gaussian Maximum Likelihood Classification for 1000 data sets of two classes of bivariate Gaussian distributions with each 100 data points.
Here is the code to create the data sets:
# mean vector for two classes
mean1<-c(70,130) ; mean2<-c(148,160)
# covariance matrix for two classes
cov1<-matrix(c(784,-546,-546,900),nrow=2,ncol=2,byrow=TRUE)
cov2<-matrix(c(484,285.1,285.1,324),nrow=2,ncol=2,byrow=TRUE)
library(MASS)
# Number of samples
nrs <- 1000
# sample size
ss <- 100
# number of dimensions
d <- length(mean1)
set.seed(1)
# generation of bivariate normal random variables based on mean vector and covariance matrix for each class
refdata_1 <- replicate(nrs,matrix(mvrnorm(ss, mu = mean1, Sigma = cov1 ),ncol = d,nrow = ss),simplify=FALSE)
refdata_2 <- replicate(nrs,matrix(mvrnorm(ss, mu = mean2, Sigma = cov2 ),ncol = d,nrow = ss),simplify=FALSE)
# calculation of mean vector for each sample of random reference data
mean_refdata_1 <- lapply(refdata_1,colMeans)
mean_refdata_2 <- lapply(refdata_2,colMeans)
# calculation of covariance matrix for each sample of random reference data
cov_refdata_1 <- lapply(refdata_1,cov)
cov_refdata_2 <- lapply(refdata_2,cov)
Now, I need to plot the decision boundary between the two classes for each of the 1000 data sets (thus 1000 decision boundaries).
Here is the decision equation (if you wonder where the ln p(class) part is, both classes have same probability and thus cancel each other out):
This is the vector of the data points:
x = vector(SR,var('a,b'))
Here is the decision equation (if you wonder where the ln p(class) part is, both classes have same probability and thus cancel each other out):
decision1 =-0.5*log(det(cov1))-0.5*((x-mean1)*cov1.inverse()*(x-mean1))
decision2 =-0.5*log(det(cov2))-0.5*((x-mean2)*cov2.inverse()*(x-mean2))
If decision1(data point) > decision2(data point), then the data point belongs to class 1. In order to get the decision boundary, decision1 - decision2 == 0. The data points are RBG images. Thus, a in the data vector x is 0:255. I solve the equation for b:
solve(decision1-decision2==0,b)
In R, that looks for the original data set like this:
m_1<-c(70,130) ; m_2<-c(148,160)
covma_1<-matrix(c(784,-546,-546,900),nrow=2,ncol=2,byrow=TRUE)
covma_2<-matrix(c(484,285.1,285.1,324),nrow=2,ncol=2,byrow=TRUE)
library(rSymPy)
c11 <- Var("c11")
c12 <- Var("c12")
c13 <- Var("c13")
c14 <- Var("c14")
sympy("covma_1 = Matrix([[c11,c12], [c13,c14]])")
a <- Var("a")
b <- Var("b")
sympy("x = Matrix([a,b])")
m11 <- Var("m11")
m12 <- Var("m12")
sympy("m_1 = Matrix([m11,m12])")
sympy("covma_1=covma_1.subs(c11,784)")
sympy("covma_1=covma_1.subs(c12,-546)")
sympy("covma_1=covma_1.subs(c13,-546)")
sympy("covma_1=covma_1.subs(c14,900)")
sympy("m_1=m_1 .subs(m11,70)")
sympy("m_1=m_1 .subs(m12,130)")
first <-sympy("-0.5*log(covma_1.det())")
second <-sympy("-0.5*((x-m_1).T*covma_1.inv()*(x-m_1))")
second<-gsub("\\[","",second)
second<-gsub("\\]","",second)
c21 <- Var("c21")
c22 <- Var("c22")
c23 <- Var("c23")
c24 <- Var("c24")
sympy("covma_2 = Matrix([[c21,c22], [c23,c24]])")
m21 <- Var("m21")
m22 <- Var("m22")
sympy("m_2 = Matrix([m21,m22])")
sympy("covma_2=covma_2.subs(c21,484)")
sympy("covma_2=covma_2.subs(c22,285.1)")
sympy("covma_2=covma_2.subs(c23,285.1)")
sympy("covma_2=covma_2.subs(c24,324)")
sympy("m_2=m_2.subs(m21,148)")
sympy("m_2=m_2.subs(m22,160)")
third <-sympy("-0.5*log(covma_2.det())")
fourth <-sympy("-0.5*((x-m_2).T*covma_2.inv()*(x-m_2))")
fourth<-gsub("\\[","",fourth)
fourth<-gsub("\\]","",fourth)
class1 <- paste(c(first,second),collapse="")
class2 <- paste(c(third,fourth),collapse="")
sympy(paste(c("hm=solve(",class2,"-","(",class1,")",",b)"), collapse = ""))
As you can see, I use very nasty string operations to parse into sympy. Anyway, after solving for b in sympy, I stuck and don't know how to get numeric values. Can somebody tell me how to solve symbolically for b and plot it in a loop for 1000 data sets? I m also open for non-symbolic approaches.
Any help is appreciated!
This is a follow up question to a previous post (How to modify slots lme4 >1.0). I have a similar pairwise data structure and want the random effect to consider both "pops" in the pair. I have a functional random intercept model using the code previously suggested:
dat <- data.frame(pop1 = c(2,1,1,1,1,3,2,2,2,3,5,3,5,4,6),
pop2 = c(1,3,4,5,6,2,4,5,6,4,3,6,4,6,5),
X = c(20,25,18,40,36,70,68,72,78,76,97,100,115,110,108),
Y = c(18,16,15,40,22,18,18,18,18,45,10,47,67,5,6))
#build random effects matrix
Zl<-lapply(c("pop1","pop2"),function(nm)Matrix:::fac2sparse(dat[[nm]],"d",drop=FALSE))
ZZ<-Reduce("+",Zl[-1],Zl[[1]])
#specify model structure
mod<-lFormula(Y~X+(1|pop1),data=dat,REML=TRUE)
#replace slot
mod$reTrms$Zt <- ZZ
#fit model
dfun<-do.call(mkLmerDevfun,mod)
opt<-optimizeLmer(dfun)
mkMerMod(environment(dfun),opt,mod$reTrms,fr=mod$fr)
However, when attempting to add a random slope variable:
mod2<-lFormula(Y~X+(1+X|pop1),data=dat,REML=TRUE)
mod2$reTrms$Zt <- ZZ
dfun<-do.call(mkLmerDevfun,mod2)
Results in the same error identified in the previous post (where the issue was calling the wrong data frame): "Error in Lambdat %*% Ut :
Cholmod error 'A and B inner dimensions must match' at file ../MatrixOps/cholmod_ssmult.c, line 82"
View lm for each pop
plot(1,type="n",xlim=c(0,150),ylim=c(0,75),ylab = "Y",xlab="X")
for(i in 1:length(unique(c(dat$pop1,dat$pop2)))){
subdat<-dat[which(dat$pop1==i | dat$pop2==i),]
out<-summary(lm(subdat$Y~subdat$X))
x=subdat$X
y=x*out$coefficients[2,1]+out$coefficients[1,1]
lines(x,y,col=i))
}
legend(125,60,1:6,col=1:6,lty=1,title="Pop")
dat <- data.frame(pop1 = c(2,1,1,1,1,3,2,2,2,3,5,3,5,4,6),
pop2 = c(1,3,4,5,6,2,4,5,6,4,3,6,4,6,5),
X = c(20,25,18,40,36,70,68,72,78,76,97,100,115,110,108),
Y = c(18,16,15,32,22,29,32,38,44,45,51,47,67,59,61))
It helps to try to understand what the original code is actually doing:
## build random effects matrix
## 1. sparse dummy-variable matrices for each population ID
Zl <- lapply(dat[c("pop1","pop2")],
Matrix::fac2sparse,to="d",drop.unused.levels=FALSE)
## 2. take the sum of all components of the list of dummy-variable matrices ...
ZZ <- Reduce("+",Zl[-1],Zl[[1]])
The Reduce form is convenient in general if we have a long list, but it helps to see that in this case it's just Zl[[1]]+Zl[[2]] ...
all.equal(Zl[[1]]+Zl[[2]],ZZ) ## TRUE
What does this RE structure look like?
library(gridExtra)
grid.arrange(
image(t(Zl[[1]]),main="pop 1",sub="",xlab="pop",ylab="obs"),
image(t(Zl[[2]]),main="pop 2",sub="",xlab="pop",ylab="obs"),
image(t(ZZ),main="combined",sub="",xlab="RE",ylab="obs"),
nrow=1)
For the random slope, I think we want to take each filled element of ZZ and replace it with the X value observed for the corresponding observation/row of dat: the indexing here is a bit obscure - in this case it boils down to there being 2 filled values in each row of Z/column of Zt (the #p slot of the sparse matrix gives a zero-indexed pointer to the first non-zero element in each column ...)
vals <- dat$X[rep(1:(length(ZZ#p)-1),diff(ZZ#p))]
ZZX <- ZZ
ZZX#x <- vals
image(t(ZZX))
library(lme4)
mod <- lFormula(Y~X+(X|pop1),data=dat,REML=TRUE)
## replace slot
mod$reTrms$Zt <- rbind(ZZ,ZZX)
## fit model
dfun <- do.call(mkLmerDevfun,mod)
opt <- optimizeLmer(dfun)
m1 <- mkMerMod(environment(dfun),opt,mod$reTrms,fr=mod$fr)
This seems to work, but you should certainly check it with your own knowledge of what's supposed to be going on here ...
I want to generate 2 continuous random variables Q1, Q2 (quantitative traits, each are normal) and 2 binary random variables Z1, Z2 (binary traits) with given pairwise correlations between all possible pairs of them.
Say
(Q1,Q2):0.23
(Q1,Z1):0.55
(Q1,Z2):0.45
(Q2,Z1):0.4
(Q2,Z2):0.5
(Z1,Z2):0.47
Please help me generate such data in R.
This is crude but might get you started in the right direction.
library(copula)
options(digits=3)
probs <- c(0.5,0.5)
corrs <- c(0.23,0.55,0.45,0.4,0.5,0.47) ## lower triangle
Simulate correlated values (first two quantitative, last two transformed to binary)
sim <- function(n,probs,corrs) {
tmp <- normalCopula( corrs, dim=4 , "un")
getSigma(tmp) ## test
x <- rCopula(1000, tmp)
x2 <- x
x2[,3:4] <- qbinom(x[,3:4],size=1,prob=rep(probs,each=nrow(x)))
x2
}
Test SSQ distance between observed and target correlations:
objfun <- function(corrs,targetcorrs,probs,n=1000) {
cc <- try(cor(sim(n,probs,corrs)),silent=TRUE)
if (is(cc,"try-error")) return(NA)
sum((cc[lower.tri(cc)]-targetcorrs)^2)
}
See how bad things are when input corrs=target:
cc0 <- cor(sim(1000,probs=probs,corrs=corrs))
cc0[lower.tri(cc0)]
corrs
objfun(corrs,corrs,probs=probs) ## 0.112
Now try to optimize.
opt1 <- optim(fn=objfun,
par=corrs,
targetcorrs=corrs,probs=c(0.5,0.5))
opt1$value ## 0.0208
Stops after 501 iterations with "max iterations exceeded". This will never work really well because we're trying to use a deterministic hill-climbing algorithm on a stochastic objective function ...
cc1 <- cor(sim(1000,probs=c(0.5,0.5),corrs=opt1$par))
cc1[lower.tri(cc1)]
corrs
Maybe try simulated annealing?
opt2 <- optim(fn=objfun,
par=corrs,
targetcorrs=corrs,probs=c(0.5,0.5),
method="SANN")
It doesn't seem to do much better than the previous value. Two possible problems (left as an exercise for the reader are) (1) we have specified a set of correlations that are not feasible with the marginal distributions we have chosen, or (2) the error in the objective function surface is getting in the way -- to do better we would have to average over more replicates (i.e. increase n).
Attempting to do loess on two variables x and y in R using MA normalization (http://en.wikipedia.org/wiki/MA_plot) like this:
> x = rnorm(100) + 5
> y = x + 0.6 + rnorm(100)*0.8
> m = log2(x/y)
> a = 0.5*log(x*y)
I want to normalize x and y in such a way that the average m is 0, as in standard MA normalization, and then back-calculate the correct x and y values. First running loess on MA:
> l = loess(m ~ a)
What is the way to get corrected m values then? Is this correct?
> mc <- predict(l, a)
# original MA plot
> plot(a,m)
# corrected MA plot
> plot(a,m-mc)
not clear to me what predict actually does in the case of loess objects and how it's different from using l$residuals in the object l returned by loess - can someone explain?
finally, how can I back calculate new x and y values based on this correction?
First, yes, your proposed method gets the corrected m values.
Regarding the predict function: yes, l$residuals , m - fitted(l) , and m -
predict(l) all give the same result: the corrected m values. However, the predict function is more general: it will take any new values as input. This is useful if you want to use only a subset of the data to fit the loess, and then predict on the totality of the data (for example, when using spiked-in standards).
Finally, how can you back calculate new x and y values based on this correction? If you transform your data into log-space, by creating two new variables x1 <- log2(x) and y1 <- log2(y), it becomes easier to see. Since we're in log-space, calculating m and a is simpler:
m <- x1 - y1
a <- (x1 + y1)/2
Now, for correcting your data based on the fitted loess model, instead of updating the m variable by your mc correction, you can update x1 and y1 instead. Set:
x1 <- x1 - mc / 2
y1 <- y1 + mc / 2
This update has the same effect as updating m <- m - mc (because m will be recomputed as the difference between the updated x1 and y1) and has no effect on the a value.
To get your corrected data out, transform them by returning 2^x1 and 2^y1.
This is the method as used by the authors of the normalize.loess function in affy package, as originally described here (and includes the capability to cyclically look at all pairs of variables as opposed to a single pair in this case): http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/limma/html/normalizeCyclicLoess.html
I'm looking to perform classification on data with mostly categorical features. For that purpose, Euclidean distance (or any other numerical assuming distance) doesn't fit.
I'm looking for a kNN implementation for [R] where it is possible to select different distance methods, like Hamming distance.
Is there a way to use common kNN implementations like the one in {class} with different distance metric functions?
I'm using R 2.15
As long as you can calculate a distance/dissimilarity matrix (in whatever way you like) you can easily perform kNN classification without the need of any special package.
# Generate dummy data
y <- rep(1:2, each=50) # True class memberships
x <- y %*% t(rep(1, 20)) + rnorm(100*20) < 1.5 # Dataset with 20 variables
design.set <- sample(length(y), 50)
test.set <- setdiff(1:100, design.set)
# Calculate distance and nearest neighbors
library(e1071)
d <- hamming.distance(x)
NN <- apply(d[test.set, design.set], 1, order)
# Predict class membership of the test set
k <- 5
pred <- apply(NN[, 1:k, drop=FALSE], 1, function(nn){
tab <- table(y[design.set][nn])
as.integer(names(tab)[which.max(tab)]) # This is a pretty dirty line
}
# Inspect the results
table(pred, y[test.set])
If anybody knows a better way of finding the most common value in a vector than the dirty line above, I'd be happy to know.
The drop=FALSE argument is needed to preserve the subset of NN as matrix in the case k=1. If not it will be converted to a vector and apply will throw an error.