How to do ma and loess normalization in R? - r

Attempting to do loess on two variables x and y in R using MA normalization (http://en.wikipedia.org/wiki/MA_plot) like this:
> x = rnorm(100) + 5
> y = x + 0.6 + rnorm(100)*0.8
> m = log2(x/y)
> a = 0.5*log(x*y)
I want to normalize x and y in such a way that the average m is 0, as in standard MA normalization, and then back-calculate the correct x and y values. First running loess on MA:
> l = loess(m ~ a)
What is the way to get corrected m values then? Is this correct?
> mc <- predict(l, a)
# original MA plot
> plot(a,m)
# corrected MA plot
> plot(a,m-mc)
not clear to me what predict actually does in the case of loess objects and how it's different from using l$residuals in the object l returned by loess - can someone explain?
finally, how can I back calculate new x and y values based on this correction?

First, yes, your proposed method gets the corrected m values.
Regarding the predict function: yes, l$residuals , m - fitted(l) , and m -
predict(l) all give the same result: the corrected m values. However, the predict function is more general: it will take any new values as input. This is useful if you want to use only a subset of the data to fit the loess, and then predict on the totality of the data (for example, when using spiked-in standards).
Finally, how can you back calculate new x and y values based on this correction? If you transform your data into log-space, by creating two new variables x1 <- log2(x) and y1 <- log2(y), it becomes easier to see. Since we're in log-space, calculating m and a is simpler:
m <- x1 - y1
a <- (x1 + y1)/2
Now, for correcting your data based on the fitted loess model, instead of updating the m variable by your mc correction, you can update x1 and y1 instead. Set:
x1 <- x1 - mc / 2
y1 <- y1 + mc / 2
This update has the same effect as updating m <- m - mc (because m will be recomputed as the difference between the updated x1 and y1) and has no effect on the a value.
To get your corrected data out, transform them by returning 2^x1 and 2^y1.
This is the method as used by the authors of the normalize.loess function in affy package, as originally described here (and includes the capability to cyclically look at all pairs of variables as opposed to a single pair in this case): http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/limma/html/normalizeCyclicLoess.html

Related

How to generate samples from MVN model?

I am trying to run some code on R based on this paper here through example 5.1. I want to simulate the following:
My background on R isn't great so I have the following code below, how can I generate a histogram and samples from this?
xseq<-seq(0, 100, 1)
n<-100
Z<- pnorm(xseq,0,1)
U<- pbern(xseq, 0.4, lower.tail = TRUE, log.p = FALSE)
Beta <- (-1)^U*(4*log(n)/(sqrt(n)) + abs(Z))
Some demonstrations of tools that will be of use:
rnorm(1) # generates one standard normal variable
rnorm(10) # generates 10 standard normal variables
rnorm(1, 5, 6) # generates 1 normal variable with mu = 5, sigma = 6
# not needed for this problem, but perhaps worth saying anyway
rbinom(5, 1, 0.4) # generates 5 Bernoulli variables that are 1 w/ prob. 0.4
So, to generate one instance of a beta:
n <- 100 # using the value you gave; I have no idea what n means here
u <- rbinom(1, 1, 0.4) # make one Bernoulli variable
z <- rnorm(1) # make one standard normal variable
beta <- (-1)^u * (4 * log(n) / sqrt(n) + abs(z))
But now, you'd like to do this many times for a Monte Carlo simulation. One way you might do this is by building a function, having beta be its output, and using the replicate() function, like this:
n <- 100 # putting this here because I assume it doesn't change
genbeta <- function(){ # output of this function will be one copy of beta
u <- rbinom(1, 1, 0.4)
z <- rnorm(1)
return((-1)^u * (4 * log(n) / sqrt(n) + abs(z)))
}
# note that we don't need to store beta anywhere directly;
# rather, it is just the return()ed value of the function we defined
betadraws <- replicate(5000, genbeta())
hist(betadraws)
This will have the effect of making 5000 copies of your beta variable and putting them in a histogram.
There are other ways to do this -- for instance, one might just make a big matrix of the random variables and work directly with it -- but I thought this would be the clearest approach for starting out.
EDIT: I realized that I ignored the second equation entirely, which you probably didn't want.
We've now made a vector of beta values, and you can control the length of the vector in the first parameter of the replicate() function above. I'll leave it as 5000 in my continued example below.
To get random samples of the Y vector, you could use something like:
x <- replicate(5000, rnorm(17))
# makes a 17 x 5000 matrix of independent standard normal variables
epsilon <- rnorm(17)
# vector of 17 standard normals
y <- x %*% betadraws + epsilon
# y is now a 17 x 1 matrix (morally equivalent to a vector of length 17)
and if you wanted to get many of these, you could wrap that inside another function and replicate() it.
Alternatively, if you didn't want the Y vector, but just a single Y_i component:
x <- rnorm(5000)
# x is a vector of 5000 iid standard normal variables
epsilon <- rnorm(1)
# epsilon_i is a single standard normal variable
y <- t(x) %*% betadraws + epsilon
# t() is the transpose function; y is now a 1 x 1 matrix

Calculation of DFFITS as diagnostic for Leverage and Influence in regression

I am trying to calculate DFFITS by hand. The value obtained should be equal to the first value obtained by dffits function. However there must be something wrong with my own calculation.
attach(cars)
x1 <- lm(speed ~ dist, data = cars) # all observations
x2 <- lm(speed ~ dist, data = cars[-1,]) # without first obs
x <- model.matrix(speed ~ dist) # x matrix
h <- diag(x%*%solve(crossprod(x))%*%t(x)) # hat values
num_dffits <- x1$fitted.values[1] - x2$fitted.values[1] #Numerator
denom_dffits <- sqrt(anova(x2)$`Mean Sq`[2]*h[1]) #Denominator
df_fits <- num_dffits/denom_dffits #DFFITS
dffits(x1)[1] # DFFITS function
Your numerator is wrong. As you have removed first datum from the second model, corresponding predicted value is not in fitted(x2). We need to use predict(x2, cars[1, ]) in place of fitted(x2)[1].
Hat values can be efficiently computed by
h <- rowSums(qr.Q(x1$qr) ^ 2)
or using its R wrapper function
h <- hat(x1$qr, FALSE)
R also has a generic function for getting hat values, too:
h <- lm.influence(x1, FALSE)$hat
or its wrapper function
h <- hatvalues(x1)
You also don't have to call anova to get MSE:
c(crossprod(x2$residuals)) / x2$df.residual

Using uniroot function in r with data from data frame to solve Euler-Lotka equation

I'd like to solve an equation for a variable for each line of a given csv file.
You may know the equation as the Euler-Lotka equation.
That is what I have so far:
# seed is needed for reproducible results (otherwise random numbers will never be the same!)
set.seed(42)
# using the Euler-Lotka equation
# l = survival rate until age x
# m = amount of offspring at age x
# x = age of reproduction
# r = population growth rate
y <- function(r, l1, l2, l3, m1, m2, m3, x1, x2, x3, z){((l1*m1*exp(-r*x1)) + (l2*m2*exp(-r*x2)) + (l3*m3*exp(-r*x3))) - z}
# iterate through each line calculating r and writing it into the respective field
for (i in 1:length(neos_data$jar_no)){
# declare the variables from table (this does not work!!)
l1 <- neos_data$surv_rate_clutch1[i]
l2 <- neos_data$surv_rate_clutch2[i]
l3 <- neos_data$surv_rate_clutch3[i]
m1 <- neos_data$indiv_sum_1_clutch[i]
m2 <- neos_data$indiv_sum_2_clutch[i]
m3 <- neos_data$indiv_sum_3_clutch[i]
x1 <- neos_data$age_clutch_1[i]
x2 <- neos_data$age_clutch_2[i]
x3 <- neos_data$age_clutch_3[i]
# this works, while these numbers are the same as in the data frame
l1 <- 0.9333333
l2 <- 0.9333333
l3 <- 0.9333333
m1 <- 3.4
m2 <- 0
m3 <- 0
x1 <- 9
x2 <- 13
x3 <- 16
## uniroot finds a 0 value, so offset function, thats why -z in the upper formula
r <- uniroot(y, l1=l1, l2=l2, l3=l3, m1=m1, m2=m2, m3=m3, x1=x1, x2=x2, x3=x3, z = 1, interval = c(-1, 1))[1] #writing only the result of r into variable
# write r into table
neos_data$pop_gr[i] <- r
}
As I already commented, uniroot works fine with manual input of values. But when try to load a value from my data frame it gives the error "values of f() have the same sign".
I do understand the meaning of the error itself, but why does it work with the values I insert manually and not with the same values from the data frame (and yes, I have checked the data types).
Would be glad for any help, as what I've seen so far was not helpful in my case :)
EDIT:
To clearify: I'd like to get a value of r for which the equation becomes 0. This works with the given code very fine as far as I insert the values of the variables as a number. But when I try to pass the value from my data frame, it fails even if the same values are passed.
Ok, I think I've found the problem.
There are some lines where each part of the sum becomes 0. At each step where the loop hits the 0s the error occurs and the whole stuff doesn't work.
This seems natural as the equation is:
1 = SUM( l(x) * m(x) * exp(-r*x) )
if all l(x) and m(x) are 0 the equation cannot become 1, of course.
I didn't realize this issue as the script didn't work at all. Now, after trying and rewriting and deleting code, somehow it writes the resulting r into the data frame until the line with 0s. That brought me to this conclusion.
Why does this always happens after hours of trying? :D
However, to solve this issue, I inserted a 0.0001 at the respective fields just to get the loop running. In my case I just want to copy the r values to my data mastersheed. As there are only 3 lines with all 0 (it was because NAs couldn't be handled by uniroot) I will delete those values by hand (NAs won't disturb any further calculation).
Thanks for your help anyway. It dropped me into the right direction :)

R: interaction between continuous and categorical vars in 'isat' regression ('gets' package)

I want to calculate the differential response of y to x (continuous) depending on the categorical variable z.
In the standard lm setup:
lm(y~ x:z)
However, I want to do this while allowing for Impulse Indicator Saturation (IIS) in the 'gets' package. However, the following syntax produces an error:
isat(y, mxreg=x:z, iis=TRUE)
The error message is of the form:
"Error in solve.qr(out, tol = tol, LAPACK = LAPACK) :
singular matrix 'a' in 'solve"
1: In x:z :
numerical expression has 96 elements: only the first used
2: In x:z :
numerical expression has 96 elements: only the first used"
How should I modify the syntax?
Thank you!
At the moment, alas, isat doesn't provide the same functionality as lm on categorical/character variables, nor on using * and :. We hope to address that in a future release.
In the meantime you'll have to create distinct variables in your dataset representing the interaction. I guess something like the following...
library(gets)
N <- 100
x <- rnorm(N)
z <- c(rep("A",N/4),rep("B",N/4),rep("C",N/4),rep("D",N/4))
e <- rnorm(N)
y <- 0.5*x*as.numeric(z=="A") + 1.5*x*as.numeric(z=="B") - 0.75*x*as.numeric(z=="C") + 5*x*as.numeric(z=="D") + e
lm.reg <- lm(y ~ x:z)
arx.reg.0 <- arx(y,mxreg=x:z)
data <- data.frame(y,x,z,stringsAsFactors=F)
for(i in z[duplicated(z)==F]) {
data[[paste("Zx",i,sep=".")]] <- data$x * as.numeric(data$z==i)
}
arx.reg.1 <- arx(data$y,mxreg=data[,c("x","Zx.A","Zx.B","Zx.C")])
isat.1 <- isat(data$y,mc=TRUE,mxreg=data[,c("x","Zx.A","Zx.B","Zx.C")],max.block.size=20)
Note that as you'll be creating dummies for each category, there's a chance those dummies will cause singularity of your matrix of explanatory variables (if, as in my example, isat automatically uses 4 blocks). Using the argument max.block.size enables you to avoid this problem.
Let me know if I haven't addressed your particular point.

nls() in R using entire matrix

I have data which I want to fit to the following equation using R:
Z(u,w)=z0*F(w)*[1-exp((-b*u)/F(w))]
where z0 and b are constants and F(w), w=0,...,9 is a decreasing step function that depends on w with F(0)=1 and u=1,...,50.
Z(u,w) is an observed set of data in the form of a 50x10 matrix (u=50,...,1 down the side of the rows and w=0,...,9 along the columns). For example as I haven't explained that great, Z(42,3) will be the element in the 9th row down and the 4th column along.
Using F(0)=1 I was able to get estimates of b and z0 using just the first column (ie w=0) with the code:
n0=nls(zuw~z0*(1-exp(-b*u)),start=list(z0=283,b=0.03),options(digits=10))
I then found F(w) for w=1,...,9 by going through each columns and using the vlaues of b and z0 I found.
However, I was wanting to find a way to estimate all the 12 parameters at once (b, z0 and the 10 values of F(w)) as b and z0 should be fitted to all the data, not just the first column.
Does anyone know of any way of doing this? All help would be greatly appreciated!
Thanks
James
This may be a case where the formula interface of the nls(...) function works against you. As an alternative, you can use nls.lm(...) in the minpack.lm package to perform non-linear regression with a programmatically defined function. To demonstrate this, first we create an artificial dataset which follows your functional form by design, with random error added (error ~ N[0,1]).
u <- 1:50
w <- 0:9
z0 <- 100
b <- 0.02
F <- 10/(10+w^2)
# matrix containing data, in OP's format: rows are u, cols are w
m <- do.call(cbind,lapply(w,function(w)
z0*F[w+1]*(1-exp(-b*u/F[w+1]))+rnorm(length(u),0,1)))
So now we have a matrix m, which is equivalent to your dataset. This matrix is in the so-called "wide" format - the response for different values of w is in different columns. We need it in "long" format: all responses in a single column, with a separate columns identifying u and w. We do this using melt(...) in the reshape2 package.
# prepend values of u
df.wide <- data.frame(u=u, m)
library(reshape2)
# reshape to long format: col1 = u, col2=w, col3=z
df <- melt(df.wide,id="u",variable.name="w", value.name="z")
df$w <- as.numeric(substr(df$w,2,4))-1
Now we have a data frame df with columns u, w, and z. The nls.lm(...) function takes (at least) 4 arguments: par is a vector of initial estimates of the parameters of the fit, fn is a function that calculates the residuals at each step, observed is the dependent variable (z), and xx is a vector or matrix containing the independent variables (u, v).
Next we define a function, f(par, xx), where par is an 11 element vector. The first two elements contain estimates of z0 and b. The next 9 contain estimates of F(w), w=1:9. This is because you state that F(0) is known to be 1. xx is a matrix with two columns: the values for u and w respectively. f(par,xx) then calculates estimate of the response z for all values of u and w, for the given parameter estimates.
library(minpack.lm)
# model function
f <- function(pars, xx) {
z0 <- pars[1]
b <- pars[2]
F <- c(1,pars[3:11])
u <- xx[,1]
w <- xx[,2]
z <- z0*F[w+1]*(1-exp(-b*u/F[w+1]))
return(z)
}
# residual function
resids <- function(p, observed, xx) {observed - f(p,xx)}
Next we perform the regression using nls.lm(...), which uses a highly robust fitting algorithm (Levenberg-Marquardt). Consequently, we can set the par argument (containing the initial estimates of z0, b, and F) to all 1's, which is fairly distant from the values used in creating the dataset (the "actual" values). nls.lm(...) returns a list with several components (see the documentation). The par component contains the final estimates of the fit parameters.
# initial parameter estimates; all 1's
par.start <- c(z0=1, b=1, rep(1,9))
# fit using Levenberg-Marquardt algorithm
nls.out <- nls.lm(par=par.start,
fn = resids, observed = df$z, xx = df[,c("u","w")],
control=nls.lm.control(maxiter=10000, ftol=1e-6, maxfev=1e6))
par.final <- nls.out$par
results <- rbind(predicted=c(par.final[1:2],1,par.final[3:11]),actual=c(z0,b,F))
print(results,digits=5)
# z0 b
# predicted 102.71 0.019337 1 0.90456 0.70788 0.51893 0.37804 0.27789 0.21204 0.16199 0.13131 0.10657
# actual 100.00 0.020000 1 0.90909 0.71429 0.52632 0.38462 0.28571 0.21739 0.16949 0.13514 0.10989
So the regression has done an excellent job at recovering the "actual" parameter values. Finally, we plot the results using ggplot just to make sure this is all correct. I can't overwmphasize how important it is to plot the final results.
df$pred <- f(par.final,df[,c("u","w")])
library(ggplot2)
ggplot(df,aes(x=u, color=factor(w)))+
geom_point(aes(y=z))+ geom_line(aes(y=pred))

Resources