sommer, multivariate liner mixed model analysis, plant breeding applications - r

I had the opportunity to read the sommer documentation but I was able to find some example of regression on markers (rrBLUP parametrization), just examples using the kinship parametrization (GBLUP parametrization). Please, could you gently say if it is possible
on sommer to regress directly on markers, instead of using the kinship matrix? Especially under multivariate scenarios (multiple traits, locations etc) modelling an unstructured var-cov for the marker effects

In sommer >= 3.7 is straight forward to fit an rrBLUP model in the multivariate setting, the DT_cpdata has a good example
librayr(sommer)
data(DT_cpdata)
mix.rrblup <- mmer(fixed=cbind(color,Yield)~1,
random=~vs(list(GT),Gtc=unsm(2)) + vs(Rowf,Gtc=diag(2))
rcov=~vs(units,Gtc=unsm(2)),
data=DT)
summary(mix.rrblup)
A <- A.mat(GT)
mix.gblup <- mmer(fixed=cbind(color,Yield)~1,
random=~vs(id,Gu=A, Gtc=unsm(2)) + vs(Rowf,Gtc=diag(2))
rcov=~vs(units,Gtc=unsm(2)),
data=DT)
summary(mix.gblup)
the vs() function makes a variance structure for a given random effect, and the covariance structure for the univariate/multivariate setting is provided in the Gtc argument as a matrix, where i.e. diagonal, unstructured or a customized structure can be provided. When the user want to provide a customized matrix as a random effect such as a marker matrix GT to do rrBLUP it has to be provided in a list() to internally help sommer to put it in the right format, whereas in the GBLUP version the random effect id which has the labels for individuals can have a covariance matrix provided in the Gu argument.

Related

How to do a leave-one-out cross validation for CAP/capscale in R vegan?

I would like to perform a "leave-one-out cross validation" (LOO-CV) for a CAP in R. The CAP was calculated by using capscale in R package vegan and is a canonical analysis of principal coordinates, similar to an rda or cca, but based on another similarity matrix, in my case Bray-Curtis. I have found that within predict.cca there is the function calibrate.cca but I cannot make it work.
https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/predict.cca
This is what I have (based on the sample data mite available in vegan)
library(vegan)
data(mite, mite.env)
str(mite.env) #"SubsDens", "WatrCont", "Substrate", "Shrub", "Topo"
miteBC <- vegdist(mite, method="bray") #Bray-Curtis similarity matrix
miteCAP <-capscale(miteBC~Substrate + Shrub + Topo, data=mite.env, #CAP in capscale
distance = "bray", metaMDSdist = F)
summary(miteCAP)
anova(miteCAP)
anova(miteCAP, by = "axis")
anova(miteCAP, by = "margin")
calibrate.cca(miteCAP, type = c("response")) #error cannot find function calibrate.cca
In the program Primer it is done automatically within the CAP function ("Leave-one-out Allocation of Observations to Groups"), where it assigns each sample automatically to a group and get a mis-classification error (similar to a classification randomForest, which I have already done), but I would like to use R, and it should be possible with vegan::capscale.
Any help is very much appreciated!
Function vegan::calibrate does not have argument type and never returns "response". Check its documentation. It does the environmental calibration, and returns the predicted values of constraints (Substrate, Shrub, Topo) in the scale of model matrix, and with factors these hardly make sense directly.
There is no direct option of LOO: you got to do it by hand cycling through points, and using the complete left-out-point as the newdata. However, I'd suggest k-fold cross-validation as a better alternative for estimation of predictive power: LOO changes data too little, and gives over-optimistic view of predictive power.

Low-pass fltering of a matrix

I'm trying to write a low-pass filter in R, to clean a "dirty" data matrix.
I did a google search, came up with a dazzling range of packages. Some apply to 1D signals (time series mostly, e.g. How do I run a high pass or low pass filter on data points in R? ); some apply to images. However I'm trying to filter a plain R data matrix. The image filters are the closest equivalent, but I'm a bit reluctant to go this way as they typically involve (i) installation of more or less complex/heavy solutions (imageMagick...), and/or (ii) conversion from matrix to image.
Here is sample data:
r<-seq(0:360)/360*(2*pi)
x<-cos(r)
y<-sin(r)
z<-outer(x,y,"*")
noise<-0.3*matrix(runif(length(x)*length(y)),nrow=length(x))
zz<-z+noise
image(zz)
What I'm looking for is a filter that will return a "cleaned" matrix (i.e. something close to z, in this case).
I'm aware this is a rather open-ended question, and I'm also happy with pointers ("have you looked at package so-and-so"), although of course I'd value sample code from users with experience on signal processing !
Thanks.
One option may be using a non-linear prediction method and getting the fitted values from the model.
For example by using a polynomial regression, we can predict the original data as the purple one,
By following the same logic, you can do the same thing to all columns of the zz matrix as,
predictions <- matrix(, nrow = 361, ncol = 0)
for(i in 1:ncol(zz)) {
pred <- as.matrix(fitted(lm(zz[,i]~poly(1:nrow(zz),2,raw=TRUE))))
predictions <- cbind(predictions,pred)
}
Then you can plot the predictions,
par(mfrow=c(1,3))
image(z,main="Original")
image(zz,main="Noisy")
image(predictions,main="Predicted")
Note that, I used a polynomial regression with degree 2, you can change the degree for a better fitting across the columns. Or maybe, you can use some other powerful non-linear prediction methods (maybe SVM, ANN etc.) to get a more accurate model.

Writing syntax for bivariate survival censored data to fit copula models in R

library(Sunclarco)
library(MASS)
library(survival)
library(SPREDA)
library(SurvCorr)
library(doBy)
#Dataset
diabetes=data("diabetes")
data1=subset(diabetes,select=c("LASER","TRT_EYE","AGE_DX","ADULT","TIME1","STATUS1"))
data2=subset(diabetes,select=c("LASER","TRT_EYE","AGE_DX","ADULT","TIME2","STATUS2"))
#Adding variable which identify cluster
data1$CLUSTER<- rep(1,197)
data2$CLUSTER<- rep(2,197)
#Renaming the variable so that that we hve uniformity in the common items in the data
names(data1)[5] <- "TIME"
names(data1)[6] <- "STATUS"
names(data2)[5] <- "TIME"
names(data2)[6] <- "STATUS"
#merge the files
Total_data=rbind(data1,data2)
# Re arranging the database
diabete_full=orderBy(~LASER+TRT_EYE+AGE_DX,data=Total_data)
diabete_full
#using Sunclarco package for Clayton a nd Gumbel
Clayton_1step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=1,copula="Clayton",marginal="Weibull")
summary(Clayton_1step)
# Estimates StandardErrors
#lambda 0.01072631 0.005818201
#rho 0.79887565 0.058942208
#theta 0.10224445 0.090585891
#beta_LASER 0.16780224 0.157652947
#beta_TRT_EYE 0.24580489 0.162333369
#beta_ADULT 0.09324001 0.158931463
# Estimate StandardError
#Kendall's Tau 0.04863585 0.04099436
Clayton_2step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=2,copula="Clayton",marginal="Weibull")
summary(Clayton_1step)
# Estimates StandardErrors
#lambda 0.01131751 0.003140733
#rho 0.79947406 0.012428824
#beta_LASER 0.14244235 0.041845100
#beta_TRT_EYE 0.27246433 0.298184235
#beta_ADULT 0.06151645 0.253617142
#theta 0.18393973 0.151048024
# Estimate StandardError
#Kendall's Tau 0.08422381 0.06333791
Gumbel_1step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=1,copula="GH",marginal="Weibull")
# Estimates StandardErrors
#lambda 0.01794495 0.01594843
#rho 0.70636113 0.10313853
#theta 0.87030690 0.11085344
#beta_LASER 0.15191936 0.14187943
#beta_TRT_EYE 0.21469814 0.14736381
#beta_ADULT 0.08284557 0.14214373
# Estimate StandardError
#Kendall's Tau 0.1296931 0.1108534
Gumbel_2step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=2,copula="GH",marginal="Weibull")
Am required to fit copula models in R for different copula classes particularly the Gaussian, FGM,Pluckett and possibly Frank (if i still have time). The data am using is Diabetes data available in R through the package Survival and Survcorr.
Its my thesis am working on and its a study for the exploratory purposes to see how does copula class serves different purposes as in results they lead to having different results on the same. I found a package Sunclarco in Rstudio which i was able to fit Clayton and Gumbel copula class but its not available yet for the other classes.
The challenge am facing is that since i have censored data which has to be incorporated in likelihood estimation then it becomes harder fro me to write a syntax since as I don't have a strong programming background. In addition, i have to incorporate the covariates present in programming and see their impact on the association if it present or not. However, taking to my promoter he gave me insights on how to approach the syntax writing for this puzzle which goes as follows
• ******First of all, forget about the likelihood function. We only work with the log-likelihood function. In this way, you do not need to take the product of the contributions over each of the observations, but can take the sum of the log-contributions over the different observations.
• Next, since we have a balanced design, we can use the regular data frame structure in which we have for each cluster only one row in the data frame. The different variables such as the lifetimes, the indicators and all the covariates are the columns in this data frame.
• Due to the bivariate setting, there are only 4 possible ways to give a contribution to the log-likelihood function: both uncensored, both censored, first uncensored and second censored, or first censored and second uncensored. Well, to create the loglikelihood function, you create a new variable in your data frame in which you put the correct contribution of the log-likelihood based on which individual in the couple is censored. When you take the sum of this variable, you have the value of the log-likelihood function.
• Since this function depends on parameters, you can use any optimizer, like optim or nlm to get your optimal values. By careful here, optim and nlm look for the minimum of a function, not a maximum. This is easy solved since the minimum of a function -f is the same as the maximum of a function f.
• Since you have for each copula function, the different expressions for the derivatives, it should be possible to get the likelihood functions now.******
Am still struggling to find a way as for each copula class each of the likelihood changes as the generator function is also unique for the respective copula since it needs to be adapted during estimation. Lastly, I should run analysis for both one and two steps of copula estimations as i will use to compare results.
if someone could help me to figure it out then I will be eternally grateful. Even if for just one copula class e.g. Gaussian then I will figure it the rest based on the one that am requesting to be assisted since I tried everything and still i have nothing to show up for and now i feel time is running out to get answers by myself.

convert a list -class numeric- into a distance structure in R

I have a list that looks like this, it is a measure of dispersion for each sample.
1 2 3 4 5
0.11829384 0.24987017 0.08082147 0.13355495 0.12933790
To further analyze this I need it to be a distance structure, the -vegan- package need it as a 'dist' object.
I found some solutions that applies to matrices > dist, but how could I change this current data into a dist object?
I am using the FD package, at the manual I found,
Still, one potential advantage of FDis over Rao’s Q is that in the unweighted case
(i.e. with presence-absence data), it opens possibilities for formal statistical tests for differences in
FD between two or more communities through a distance-based test for homogeneity of multivariate
dispersions (Anderson 2006); see betadisper for more details
I wanted to use vegan betadisper function to test if there are differences among different regions (I provided this using element "region" with column "region" too)
functional <- FD(trait, comun)
mod <- betadisper(functional$FDis, region$region)
using gowdis or fdisp from FD didn't work too.
distancias <- gowdis(rasgo)
mod <- betadisper(distancias, region$region)
dispersion <- fdisp(distancias, presence)
mod <- betadisper(dispersion, region$region)
I tried this but I need a list object. I thought I could pass those results to betadisper.
You cannot do this: FD::fdisp() does not return dissimilarities. It returns a list of three elements: the dispersions FDis for each sampling unit (SU), and the results of the eigen decomposition of input dissimilarities (eig for eigenvalues, vectors for orthonormal eigenvectors). The FDis values are summarized for each original SU, but there is no information on the differences among SUs. The eigen decomposition can be used to reconstruct the original input dissimilarities (your distancias from FD::gowdis()), but you can directly use the input dissimilarities. Function FD::gowdis() returns a regular "dist" structure that you can directly use in vegan::betadisper() if that gives you a meaningful analysis. For this, your grouping variable must be based on the same units as your distancias. In typical application of fdisp, the units are species (taxa), but it seems you want to get analysis for communities/sites/whatever. This will not be possible with these tools.

Confusion about Markov random fields in the mgcv package in R

In order to implement a spatial analysis, I tried a simple Markov random field smoother in an example in the mgcv package in R, where the manual is here:
https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/smooth.construct.mrf.smooth.spec.html
This is the example I tried:
library(mgcv)
data(columb) ## data frame
data(columb.polys) ## district shapes list
xt <- list(polys=columb.polys) ## neighbourhood structure info for MRF
b <- gam(crime ~ s(district,bs="mrf",xt=xt),data=columb,method="REML")
However, when I took a look at estimated coefficients in b$coefficients, there are 48 estimates from the Markov random field smoother:
> b$coefficients
(Intercept) s(district).1 s(district).2 s(district).3 s(district).4
35.12882390 -10.96490165 20.99250496 16.04968951 10.49535483
s(district).5 s(district).6 s(district).7 s(district).8 s(district).9
16.56626217 14.55352540 17.90043996 -0.60239588 13.41215603
s(district).10 s(district).11 s(district).12 s(district).13 s(district).14
18.61920671 -11.13853418 -2.95677338 7.89719220 3.04717540
s(district).15 s(district).16 s(district).17 s(district).18 s(district).19
-11.18235328 12.57473374 19.83013619 10.56130003 12.36240748
s(district).20 s(district).21 s(district).22 s(district).23 s(district).24
15.65160761 20.40965885 24.79853590 0.05312873 -14.65881026
s(district).25 s(district).26 s(district).27 s(district).28 s(district).29
-13.01294201 7.16191556 -9.36311304 3.65410713 -16.37092777
s(district).30 s(district).31 s(district).32 s(district).33 s(district).34
11.23500771 13.92036006 -14.67653893 -12.39341674 11.02216471
s(district).35 s(district).36 s(district).37 s(district).38 s(district).39
-12.93210046 -15.48924425 3.42745125 -2.54916472 -1.90604972
s(district).40 s(district).41 s(district).42 s(district).43 s(district).44
-16.25160966 -7.46491914 -4.48126353 -7.61064264 -2.91807488
s(district).45 s(district).46 s(district).47 s(district).48
-12.12765102 6.68446503 2.55883220 -0.20920888
However, the district shapes list shows 49 areas (from 0~48). When I tried my data, the same situation happened because data with 28 areas only produced 27 estimates from the Markov random field smoother.
My understanding is, the Markov random field used as a spatial function can be regarded as a structured random effect; however, the Markov random field smoother in the mgcv package in R seems to automatically set up the first area as a reference level. I am wondering whether it is just like a fixed effect but under the consideration of spatial autocorrelation?
If so, an extended problem is how to explain such output? I feel very weird in that the spatial estimate can be explained like the difference between each area and the reference area, but this interpretation is not too meaningful.
I am thinking whether we can fit a Markov random field smoother like a random effect in R. Hope anyone who is familiar with this package can provide some suggestions. Thanks!
The coefficients in a multivariate Gaussian smoothing are not and should not be interpreted as coefficients individually applied to each covariate s is a function of. They describe the correlation relationship between the covariates; correlation described by a number of coefficients to be set by the k parameter in the s R function.
By default, s sets k to its maximum, n-1. k cannot be bigger than n-1 with n the number of covariates in s as an intercept is necessary to set the average level the smoothing function will vary around and the total number of fitted parameters must not be bigger than the size of the "data".
For further details, check the paragraph on choose.k in the mgcv help file.
If you are interested by something directly applicable to each of your districts, you should check the values predicted by the smoothing function. Following gamObject help it is given by the fitted.values item.
Here I get:
> b$fitted.values
[1] 18.81758 22.12502 30.13315 33.14305 44.11208 30.17184 20.96227 39.77438
[9] 35.64875 32.88071 54.08242 49.13961 43.58527 49.65618 47.64344 50.99036
[17] 32.48752 46.50207 51.70913 21.95138 40.98711 36.13709 21.90757 45.66465
[25] 52.92006 43.65122 45.45233 48.74153 53.49958 57.88845 18.43111 20.07698
[33] 40.25183 23.72681 36.74403 16.71899 44.32493 47.01028 18.41338 20.69650
[41] 20.15782 17.60067 36.51737 30.54075 31.18387 16.83831 25.62500 28.60866
[49] 25.47928
The result of plot(b) allows to visualize the fit, it
looks good and the correspondence between observed and estimated seems reasonable: plot(columb$crime,b$fitted.values)

Resources