vegan::betadisper() extract distance and error associated with centroid - r

I am trying to construct a meta regression to look at distance between centroids across multiple independent monitoring datasets. To build that model, for each dataset I need to extract the distance to each centroid (each dataset has the same two grouping variables -- before, after), the number of points that went into calculating the centroid (n), and the standard deviation associated with each distance to centroid (sd). I'm using vegan::betadisper() to calculate the distance to each centroid, but I am not sure whether it is possible to extract a single unit of standard deviation associated with the centroid?
I've modified the dune dataset below as sample code. The 'Use' grouping variable has two levels: before, after.
library (vegan)
# Species and environmental data
dune2.spe <- read.delim ('', row.names = 1)
dune2.env <- read.delim ('', row.names = 1)
data (dune) # matrix with species data (20 samples in rows and 30 species in columns)
data (dune.env)# matix of environmental variables (20 samples in rows and 5 environmental variables in columns)
#select two grouping levels for 'use'
dune_data <- cbind(dune2.spe,dune2.env)%>%
dune_data$Use <- recode_factor(dune_data$Use, 'Pasture'='Before')
dune_data$Use <- recode_factor(dune_data$Use, 'Hayfield'='After')
dune_sp <- dune_data%>%
dune_en <- dune_data%>%
#transform relative species counts
dune_rel <- decostand(dune_sp, method = "hellinger")
dune_distmat <- vegdist(dune_rel, method = "bray", na.rm=T)
(dune_disper <- betadisper(dune_distmat, type="centroid", group=dune_en$Use))
plot(dune_disper, label=FALSE)
GAM distributed lag model with factor smooth interaction (by variable)

I'm trying to compare the climate response in the last 60 years of two subgroups of a plant (factor variable subgroups with 2 levels). The response of the two subgroups which both grew on the same plots is measured in deviation from the long-term growth (plant_growth). As climate data mean temperature (tmean) and mean precipitation (prec) are available.
I formulated a distributed lag model using mgcv's gam() to test the hypothesis, that the climate response differs between the plant subgroups:
climate_model <- gam(plant_growth ~ te(tmean, lag, by = subgroups) +
te(prec, lag, , by = subgroups) +
te(tmean, prec, lag, , by = subgroups) ,
data = plant_data)
plant_data is a list that contains tmean, prec and lag as separate numeric matrices, subgroups as factor variable which distinguishes between subgroup A and B, a character variable giving the ID of the plant, and the numeric measured plant_growth as vector.
The problem is, however, that factor by variables cannot be used with the matrix arguments from plant_data. The error message looks as follows:
Error in smoothCon(split$smooth.spec[[i]], data, knots, absorb.cons, scale.penalty = scale.penalty, :
factor `by' variables can not be used with matrix arguments.
I'm wondering if there is a way to include the factor variable subgroups into the distributed lag model so that a comparison between the two levels of the factor is possible.
I've already tried running two separate lag models for the two levels of subgroups. This works fine. However, I cannot really compare the predictions of the two models because the fit and the parameters of the smooths are different. Moreover, in this way the the climate response of the two subgroups is treated as if it was completely independent. This is however not the case.
I was reproduce my problem with growth data from the Treeclim package:
library("treeclim") #Data library
data("muc_spruce") #Plant growth
data("muc_clim") #Climate data
#Format climate to wide
clim <- pivot_wider(muc_clim, names_from = month, values_from = c(temp,prec))
#Format the growth data and add three new groth time series
growth <- muc_spruce %>%
select(-samp.depth) %>%
mutate(year = as.numeric(row.names(muc_spruce))) %>%
mutate(ID = 1) %>%
rename("plant_growth" = "mucstd")
additional_growth <- data.frame()
for (i in c(1:3)){
A <- growth %>%
mutate(plant_growth = plant_growth + runif(nrow(muc_spruce), min = 0, max = 0.5)) %>%
mutate(ID = ID + i)
additional_growth <- rbind(additional_growth, A)
growth <- rbind(growth, additional_growth)
#Bring growth and climate data together
plant_data <- na.omit(left_join(growth, clim))
rm(A, growth, clim, muc_clim, muc_spruce, additional_growth, i) #clean
#Add the subgroups label
plant_data$subgroups <- as.factor(c(rep("A", nrow(plant_data)/2), rep("B", nrow(plant_data)/2)))
#Format for gam input
plant_data <- list(lag = matrix(1:12,nrow(plant_data),12,byrow=TRUE),
year = plant_data$year,
ID = plant_data$ID,
plant_growth = plant_data$plant_growth,
subgroups = as.factor(plant_data$subgroups),
tmean = data.matrix(plant_data[,c(4:15)]),
prec = data.matrix(plant_data[,c(16:27)]))
From ?mgcv::linear.functional.terms:
The mechanism is usable with random effect smooths which take factor arguments, by using a trick to create a 2D array of factors. Simply create a factor vector containing the columns of the factor matrix stacked end to end (column major order). Then reset the dimensions of this vector to create the appropriate 2D array: the first dimension should be the number of response data and the second the number of columns of the required factor matrix. You can not use matrix or data.matrix to set up the required matrix of factor levels. See example below:
## set up a `factor matrix'...
fac <- factor(sample(letters,n*2,replace=TRUE))
dim(fac) <- c(n,2)
You cannot create a factor matrix tough, but can create a factor and modify the dims afterwars.

Is it possible to include NA for one missing value, and can I verify that each individual corresponds with my specified function

I have code here that generates a random spatial distribution of points, returns a distance column between every point and an infected individual and uses a function to calculate the probability of infection in the next time step. There are 60 hosts, one of which is infected. I would like to bind the values of Pi (which calculates infection probability) to my data frame with the original co-ordinates. Obviously one point is removed from the distance matrix, the infected individual. This value I would like to replace with NA in the main data frame as the next step in my code, and also to confirm that the co-ordinates correspond with the output of the function Pi.
So as it stands I am trying to attach a column of 59 rows to the main data frame of 60 rows.
# Create a spatial distribution with infected individuals
xcoord <- sample(1:100,60)
ycoord <- sample(1:100,60)
infectionstatus <- rep(0,60)
Df <- data.frame(xcoord, ycoord, infectionstatus)
a <- sample(1:60, 1)
Df$infectionstatus[a] <- 1
# Calculate distance between infected individuals and susceptibles
distances <- pdist(Df[,1:2], metric = "euclidean")
position_infected_individual <- which(Df[,3]==1)
distance_from_infected <- distances[-(position_infected_individual), position_infected_individual]
#Assign parameter values and calculate probability of infection
for (p in 1:length(distance_from_infected)){
Pi[p] = 1-exp(-beta*exp(-alpha*distance_from_infected[p]))
The obvious step is:
distance_from_infected <- distances[-(position_infected_individual), position_infected_individual]
distance_from_infected <- c(NA, distances[-(position_infected_individual), position_infected_individual])
But you're setting yourself up for quite a few failures.
Assuming only one infected case
That the DF can always be appropriately sorted so infected individual is first
That NA makes "sense" for this kind of numeric summary

Looking for analysis that clusters like SIMPROF, but allows for many observations per category

I need to run a clustering or similarity analysis on some biological data and I am looking for an output like the one SIMPROF gives. Aka a dendrogram or hierarchical cluster.
However, I have 3200 observations/rows per group. SIMPROF, see example here,
# Run simprof on the data
res <- simprof(data= usarrests,
# Graph the result
pl.color <- simprof.plot(res)
seems to expect only one observation per group (US state in this example).
Now, again, my biological data (140k rows total) has about 3200 obs per group.
I am trying to cluster the groups together that have a similar representation in the variables provided.
As if in the example above, AK would be represented by more than one observation.
What's my best bet for a function/package/analysis?
Example from a paper:
The solution became obvious upon further reflection.
Instead of using all observations (200k) in the long format, I made longitude and depth of sampling into one variable, used like sampling units along a transect. Thus, ending up with 3800 columns of longitude - depth combinations, and 61 rows for the taxa, with the value variable being the abundance of the taxa (If you want to cluster sampling units then you have to transpose the df). This is then feasible for hclust or SIMPROF since now the quadratic complexity only applies to 61 rows (as opposed to ~200k as I tried at the beginning).
Here is some code:
d4<-d4 %>% na.omit() %>% arrange(desc(LONGITUDE_DEC))
# make 1 variable of longitude and depth that can be used for all taxa measured, like
#community ecology sampling units
d5<-d4 %>% select(PREDICTED_GROUP,CONCENTRATION_IND_M3,sampling_units)
# dcast data frame so that you get the taxa as rows, sampling units as columns w
# concentration/abundance as values.
d6<-dcast(d5,PREDICTED_GROUP ~ sampling_units, value.var = "CONCENTRATION_IND_M3")
d7<-d6 %>% na.omit()
# give the rownames the taxa names
#delete that variable that is no longer needed
# calculate the dissimilarity matrix with vegdist so you can use the sorenson/bray
distBray <- vegdist(d7, method = "bray")
# calculate the clusters with ward.D2
clust1 <- hclust(distBray, method = "ward.D2")
#plot the cluster dendrogram with dendextend
dend <- clust1 %>% as.dendrogram %>%
set("branches_k_color", k = 5) %>% set("branches_lwd", 0.5) %>% set("clear_leaves") %>% set("labels_colors", k = 5) %>% set("leaves_cex", 0.5) %>%
set("labels_cex", 0.5)
ggd1 <- as.ggdend(dend)
ggplot(ggd1, horiz = TRUE)

R: Comparison between two hexbins with applying KL divergence

Suppose I have two data sets with different sizes, each data set contains x and y to locate each observation.
x1 <- runif(1000,-195.5,195.5)
y1 <- runif(1000,-49,49)
data1 <- data.frame(x1,y1)
x2 <- runif(2000,-195.5,195.5)
y2 <- runif(2000,-49,49)
data2 <- data.frame(x2,y2)
Here I generated two data sets with random locations within an specific area.
Then I generated two hexbins of each dataset. And I know for achieving tracing back the bins, I need to set IDs = TRUE
hbin_1 <- hexbin(x=data1$x1,y=data1$y1,xbins=30,shape=98/391,IDs=TRUE)
hbin_2 <- hexbin(x=data2$x2,y=data2$y2,xbins=30,shape=98/391,IDs=TRUE)
In next step, I wanna apply KL divergence to achieve comparison of these two datasets. Then the problem is how can I get the matching bin in second data set to the first data set? (I wanna compare the bins with same location in two different datasets)
We can get the table contains the cell name(bin number) with corresponding count of observations in this bin by
tI1 <- table(hbin_1#cID)
tI2 <- table(hbin_2#cID)
The problem is the bin numbers are different between dataset1 and dataset2. Even we set same xbins and shape in the function hexbin, we still get different bins of two datasets. Then how can I compare the two datasets (or get bins with same location)?
The function hexbin doesn't not return empty bins. Hence, even we set the xbins, xbnds and ybnds same, the returned hexbin results can be different for two datasets.
We can use kde2d from the package MASS to achieve two-dimensional kernel density estimation.
b1 <- kde2d(data1$x1,data1$y1,lims = c(xbnds,ybnds))
b2 <- kde2d(data2$x2,data2$y2,lims = c(xbnds,ybnds))
Then, we can get two vectors of kernel density estimation of two datasets, and then normalising the results by dividing by the sum of each vector of the estimated density. Finally, we can apply KL divergence to quantify the similarity of the distributions.
z1 <- as.vector(b1$z)
z2 <- as.vector(b2$z)
z1 <- mapply("/",z1,0.01509942)
z2 <- mapply("/",z2,0.01513236)
kullback.leibler(z1, z2)

Large dataset and autocorrelation computation

I have geographical data at the town level for 35 000 towns.
I want to estimate the impact of my covariates X on a dependent variable Y, taking into account autocorrelation.
I have first computed weight matrix and then I used the command spautolm from the package spam but it returned me an error message because my dataset is too large.
Do you have any ideas of how can I fix it? Is there any other equivalent commands that would work?
myvars <- c("longitude","latitude","Y","X")
newdata2 <- na.omit(X2000[myvars]) #drop observations with no values for one observation
df <- data.frame(newdata2)
newdata3<- unique(df) #drop duplicates in terms of longitude and latitude
coordinates(newdata3) <- c("longitude2","latitude2") #set the coordinates
Sy4_nb <- knn2nb(knearneigh(coords, k = 4)) # Display the k closest neighbours
Sy4_lw_idwB <- nb2listw(Sy8_nb, glist = idw, style = "B") #generate a list weighted by the distance
When I try to run such formulas:
spautolm(formula = Y~X, data = newdata3, listw = Sy4_lw_idwB)
It returns me : Error: cannot allocate vector of size 8.3 Gb
