Partial Canonical Correspondence Analysis in R - r

I am having issues with conducting a partial Canonical Correspondence Analysis (pCCA) in R. The data associated with code is quite extensive so I am unable to include it here.
The following code produces the error below it. In the pCCA model, I am attempting to account for both environmental and spatial variables in explaining species matrix. Spatial variables are Latitude and Longitude values. Env2 variables are a host of continuous and a few binary (0,1) environmental variables.
mod2 <-cca(species ~ env2 + spatial)
Error in model.frame.default(~env2 + spatial, na.action = na.pass, xlev = NULL) : invalid type (list) for variable 'env2'
I have used unlist () for both env2 and spatial, but it does not work.
Thoughts?

The right-hand side of formula must have variables, but it seems that you have data frames of several variables. This will not work, but gives a similar error message as in your post (and this is documented). Further, your formula will not define partial CCA, because the formula does not contain function Condition() that defines the terms partialled out.
The formula interface may work if you have numerical matrices as terms, but it won't work with unlist() variables.
If you are using vegan 2.5-1 or newer, you can define a partial CCA without formula interface as
cca(species, env2, spatial)
and the data frames env2 and spatial are automatically expanded to model matrices, and spatial terms are partialled out before analysing the effects of env2 terms.

Related

Can dismo::evaluate() be used for a model fit with glmnet() or cv.glmnet()?

I'm using the glmnet package to create a species distribution model (SDM) based on a lasso regression. I've succesfully fit models using glmnet::cv.glmnet(), and I can use the predict() function to generate predicted probabilities for a given lambda value by setting s = lambda.min and type = "response".
I'm creating several different kinds of SDMs and had been using dismo::evaluate() to generate fit statistics (based on a testing dataset) and thresholds to convert probabilities to binary values. However, when I run dismo::evaluate() with a cv.glmnet (or glmnet) model, I get the following error:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'as.matrix': not-yet-implemented method for <data.frame> %*%
This is confusing to me as I think the x argument in evaluate() isn't needed when I'm providing a matrix with predictor values at presence locations (p) and another matrix with values at absence locations (a). I'm wondering whether evaluate() doesn't work with these types of models? Thanks, and apologies if I've missed something obvious!
After spending more time on this, I don't think dismo::evaluate() works with glmnet objects when supplying "p" and "a" as matrices of predictor values. dismo::evaluate() converts them to data.frames before calling the predict() function. To solve my problem, I was able to create a new function based on dismo::evaluate() that supplies p or a as a matrix to the predict() function.

Regression in subpopulations using svyglm function in the R survey package

I would like to use the svyglm function from the survey package to run stratified regression models/regression models on subset of my population.
Suppose x is my predictor, y is my outcome, and z is a third (factor) variable. I would like to see individual relationships between x and y for different levels of z.
The documentation for this package says that "The correct standard error estimate for a subpopulation that isn’t a stratum is not just obtained by pretending that the sub population was a designed survey of its own. However, the subset function and [ method for survey design objects handle all these details automagically, so you can ignore this problem."
There is a subset argument in the svyglm function. My question is - do you specify the subpopulation in the subset argument of the design function, in the svyglm function, or both?
Either one, but not both.
The code inside svyglm looks like
subset <- substitute(subset)
subset <- eval(subset, model.frame(design), parent.frame())
if (!is.null(subset))
design <- design[subset, ]
The first two lines are handling where to look up the subset, and then it just gets used to [ the design.

Simple variogram in R, understanding gstat::variogram() and object gstat

I have a data.frame in R whose variables represent locations and whose observations are measures of a certain variable in those locations. I want to measure the decay of dependence for certain locations depending on distance, so the variogram comes particularly useful for my studies.
I am trying to use gstat library but I am a bit confused about certain parameters. As far as I understand the (empirical) variogram should only need as basic data:
The locations of the variables
Observations for these variables
And then other parameters like maximun distance, directions, ...
Now, gstat::variogram() function requires as first input an object of class gstat. Checking the documentation of function gstat() I see that it outputs an object of this class, but this function requires a formula argument, which is described as:
formula that defines the dependent variable as a linear model of independent variables; suppose the dependent variable has name z, for ordinary and simple kriging use the formula z~1; for simple kriging also define beta (see below); for universal kriging, suppose z is linearly dependent on x and y, use the formula z~x+y
Could someone explain me what this formula is for?
try
methods(variogram)
and you'll see that gstat has several methods for variogram, one requiring a gstat object as first argument.
Given a data.frame, the easiest is to use the formula method:
variogram(z~1, ~x+y, data)
which specifies that in data, z is the observed variable of interest, ~1 specifies a constant mean model, ~x+y specify that the coordinates are found in columns x and y of data.

pool.compare generates non-comformable arguments error

Alternate title: Model matrix and set of coefficients show different numbers of variables
I am using the mice package for R to do some analyses. I wanted to compare two models (held in mira objects) using pool.compare(), but I keep getting the following error:
Error in model.matrix(formula, data) %*% coefs : non-conformable arguments
The binary operator %*% indicates matrix multiplication in R.
The expression model.matrix(formula, data) produces "The design matrix for a regression-like model with the specified formula and data" (from the R Documentation for model.matrix {stats}).
In the error message, coefs is drawn from est1$qbar, where est1 is a mipo object, and the qbar element is "The average of complete data estimates. The multiple imputation estimate." (from the documentation for mipo-class {mice}).
In my case
est1$qbar is a numeric vector of length 36
data is a data.frame with 918 observations of 82 variables
formula is class 'formula' containing the formula for my model
model.matrix(formula, data) is a matrix with dimension 918 x 48.
How can I resolve/prevent this error?
As occasionally happens, I found the answer to my own question while writing the question.
The clue I was that the estimates for categorical variables in est1.qbar only exist if that level of that variables was present in the data. Some of my variables are factor variables where not every level is represented. This caused the warning "contrasts dropped from factor variable name due to missing levels", which I foolishly ignored.
On the other hand, looking at dimnames(model.matrix.temp)[[2]] shows that the model matrix has one column for each level of each factor variable, regardless of whether that level of that variable was present in the data. So, although the contrasts for missing factor levels are dropped in terms of estimating the coefficients, those factor levels still appear in the model matrix. This means that the model matrix has more columns than the length of est1.qbar (the vector of estimated coefficients), so matrix multiplication is not going to work.
The answer here is to fix the factor variables so that there are no unused levels. This can be done with the factor() function (as explained here). Unfortunately, this needs to be done on the original dataset, prior to imputation.

Weighting the inverse of the variance in linear mixed model

I have a linear mixed model which is run 50 different times in a loop.
Each time the model is run, I want the response variable b to be weighted inversely with the variance. So if the variance of b is small, I want the weighting to be bigger and vice versa. This is a simplified version of the model:
model <- lme(b ~ type, random = ~1|replicate,weights = ~ I(1/b))
Here's the R data files:
b: https://www.dropbox.com/s/ziipdtsih5f0252/b.Rdata?dl=0
type: https://www.dropbox.com/s/90682ewib1lw06e/type.Rdata?dl=0
replicate: https://www.dropbox.com/s/kvrtao5i2g4v3ik/replicate.Rdata?dl=0
I'm trying to do this using the weights option in lme. Right now I have this as:
weights = ~ I(1/b).
But I don't think this is correct....maybe weights = ~ I(1/var(b)) ??
I also want to adjust this slightly as b consists of two types of data specified in the factor variable (of 2 levels) type.
I want to inversely weight the variance of each of these two levels separately. How could I do this?
I'm not sure it makes sense to talk about weighting the response variable in this manner. The descriptions I have found in the R-SIG-mixed-models mailing list refer to using inverse weighting derived from the predictor variables, either the fixed effects or the random effects. The weighting is used in minimizing the deviations of approximation of the model fits to the response. There is a function that returns the fixed effects variance (a sub-class of the varFunc family of functions) and it has a help page (linked from the weights section of the ?gls page):
?varFixed
?varFunc
It requires a formula object as its argument. So my original guess was:
model <- lme(b ~ type, random = ~1|replicate, weights = varFixed( ~type) )
Which you proved incorrect. How about seeing if this works:
model <- lme(b ~ type, random = ~1|replicate, weights = varFixed( ~1| type) )
(My continuing guess is that this weighting is the default situation and specifying these particular weights may not be needed. The inverse nature of the weighting is implied and does not need to be explicitly stated with "1/type". In the case of mixed models the "correct" construction depends on the design and the prior science and none of this has been presented, so this is really only a syntactic comment and not an endorsement of this model. I did not download the files. Seems rather odd to have three separate files and no code for linking them into a dataframe. Generally one would want to have a single data object within which the column names would be used in the formulas of the regression function. (I also suspect this is the default behavior of this function and so my untested prediction is that that you would be getting no change by omitting that 'weights' parameter.)

Resources