vegdist function cannot handle datasets of abundance containing 0 - r

As a marine biologist, we need to figure out whether the fish abundance of 4 different fish species counted three times over a year differs from one artifical reef to another (reef A, B, and C) and from one month to another (June, September, November). For each area, 3 different replicates are generated (1, 2, 3).
Let's consider the gathered data (including the factors for better understanding) as follows:
data <- as.data.frame(matrix(NA, 27, 4, dimnames =
list(1:27, c("Diplodus sargus", "Chelon labrosus", "Oblada melanura", "Seriola dumerii"))))
#fish counts
data$`Diplodus sargus` <- as.numeric(c(0,0,0,0,0,0,0,0,0,5,0,0,3,0,0,0,0,1,0,0,0,0,0,0,4,0,0))
data$`Oblada melanura` <- as.numeric(c(0,0,0,10,0,0,0,0,0,0,0,0,10,5,0,0,0,0,1,0,2,3,0,2,0,0,0))
data$`Chelon labrosus`<- as.numeric(c(0,0,0,0,2,0,6,0,0,0,0,0,3,0,0,2,0,0,0,0,0,3,0,0,0,0,1))
data$`Seriola dumerii` <-as.numeric(c(4,0,2,0,1,1,0,0,9,0,0,0,0,0,3,0,0,7,0,0,0,8,0,0,0,1,0))
#factors
data$reef <- rep(c(rep("A", 3), rep("B",3), rep("C", 3)),3)
data$month <- rep(c(rep("June", 3), rep("September",3), rep("November", 3)),3)
data$combined <- c(rep("JuneA", 3), rep("JuneB",3), rep("JuneC", 3), rep("SepA", 3), rep("SepB",3), rep("SepC", 3),rep("NovA", 3), rep("NovB",3), rep("NOvC", 3))
data$Replicate <- rep(c(rep("1", 3), rep("2", 3), rep("3", 3)))
#square-root data
comp <- sqrt(data[, 1:4])
library(vegan)
mydist <- vegdist(comp, method = "bray")
pl.clust <- hclust(mydist, method = "complete")
Error in hclust(mydist, method = "complete") :
NA/NaN/Inf in foreign function call (arg 11)
The aim is to perform a Permutation ANOVA on the Bray-Curtis similarities of square root-transformed data in order to determine whether samples (assemblages of counted species) differ significantly depending on factors (alone or combined). However, vegdist function cannot handle data set with 0 as it generates vegdist objects containing NaN...which in turn cannot be handled by the adonis function. I thought of simply adding +1 to each counts as it is the differences between the samples that matter and not the absolute values. However, mydist <- ecodist::bcdist(squared_data,rmzero=FALSE) gives a very different result to that first solution. Is anybody familiar with such issue and how to correctly handle it?
Thank you and looking forward to reading you

Related

Clustering ranking

I'm analyzing a data in R where predictor variables are available but there is no response variable. Using unsupervised learning (k-means) I have identified patterns in the data. But I need to rank the clusters according to their overall performance (example: student's data on exam marks and co-curricular marks). What technique do I use after clustering in R?
The cluster attribute of the kmeans output gives you the index of which cluster each data point is in. Example data taken from kmeans documentation:
nclusters = 5
# a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
cl <- kmeans(x, nclusters, nstart = 25)
Now, your evaluation function (e.g. mean of column values) can be applied to each cluster individually:
for (i in 1:nclusters){
cat(i, apply(x[which(cl$cluster==i),],MARGIN=2,FUN=mean), '\n')
}
Or better still, use some kind of aggregation function, e.g. tapply or aggregate, e.g.:
aggregate(x, by=list(cluster=cl$cluster), FUN=mean)
which gives
cluster x y
1 1 1.2468266 1.1499059
2 2 -0.2787117 0.0958023
3 3 0.5360855 1.0217910
4 4 1.0997776 0.7175210
5 5 0.2472313 -0.1193551
At this point you should be able to rank the values of the aggregation function as needed.

How can I automate creation of a list of vectors containing simulated data from a known distribution, using a "for loop" in R?

First stack exchange post so please bear with me. I'm trying to automate the creation of a list, and the list will be made up of many empty vectors of various, known lengths. The empty vectors will then be filled with simulated data. How can I automate creation of this list using a for loop in R?
In this simplified example, fish have been caught by casting a net 4 times, and their abundance is given in the vector "abundance" (from counting the number of total fish in each net). We don't have individual fish weights, just the mean weight of all fish each net, so I need to simulate their weights from a lognormal distribution. So, I'm then looking to fill those empty vectors for each net, each with a length equal to the number of fish caught in that net, with weight data simulated from a lognormal distribution with a known mean and standard deviation.
A simplified example of my code:
abundance <- c(5, 10, 9, 20)
net1 <- rep(NA, abundance[1])
net2 <- rep(NA, abundance[2])
net3 <- rep(NA, abundance[3])
net4 <- rep(NA, abundance[4])
simulated_weights <- list(net1, net2, net3, net4)
#meanlog vector for each net
weight_per_net
#meansd vector for each net
sd_per_net
for (i in 1:4) {
simulated_weights[[i]] <- rlnorm(n = abundance[i], meanlog = weight_per_net[i], sd = sd_per_net[i])
print(simulated_weights_VM)
}
Could anyone please help me automate this so that I don't have to write out each net vector (e.g. net1) by hand, and then also write out all the net names in the list() function? There are far more nets than 4 so it would be extremely time consuming and inefficient to do it this way. I've tried several things from other posts like paste0(), other for loops, as.list(c()), all to no avail.
Thanks!
HM
Turns out you don't need the net1, net2, etc variables at all. You can just do
abundance <- c(5, 10, 9, 20)
simulated_weights <- lapply(abundance, function(x) rep(NA, x))
The lapply function will return the list you need by calling the function once for each value of abundance
We could create the 'simulated_weights' with split and rep
simulated_weights <- split(rep(rep(NA, length(abundance)), abundance),
rep(seq_along(abundance), abundance))

Fitting spatial regression with repeated measures making incorrect neighbours

I am trying to fit a spatial lag model (spdep::lagsarlm), after having built a neighbour distance matrix. I have two questions, because every time I read about it, the model always fit data that has only one single observation (one row) per each spatial location.
My dataset has a variable number of observations for each spatial point (but it's not temporal data) and I was wondering if it was valid to do like this, especially when creating the distance matrix because I get a warning:
Warning message:
In spdep::knearneigh(., k = 3, longlat = F) :
knearneigh: identical points found
Indeed when I plot the neighbours relationships, I get a wrong graph (I guess that the algorithm thinks that the repeated points are neighbours with themselves so they get isolated); when I filter only the first measure, the plot is OK.
library(sp); library(spdep);set.seed(12345678)
df = data.frame('id'=rep(1:10, 3),
'x'=rep(rnorm(10, 48, 0.1), 3),
'y'=rep(rnorm(10, 2.3, 0.05),3),
'response'=c(rnorm(5), rnorm(20, 1), rnorm(5)),
'type.sensor'=rep(c(rep("a", 6), rep("b", 4)), 3))
coordinates(df)<-c("x", "y")
w <- df %>% spdep::knearneigh(k=3, longlat=F) %>% knn2nb
plot(w, coordinates(df))
df2 = head(df, 10) # I keep only the first measure for each location
w2 <- df2 %>% spdep::knearneigh(k=3, longlat=F) %>% knn2nb
plot(w2, coordinates(df2))
So i'm not very confident in the result of my lagsarlm model in the first case..
lagsarlm(response ~ type.sensor, data=df, listw=nb2listw(w), type = "lag" )
lagsarlm(response ~ type.sensor, data=df, listw=nb2listw(w2), type = "lag" )
However, if I try to fit my model with the larger dataset, but with the right neighbours matrix, it complains
Error in lagsarlm(response ~ type.sensor, data = df, listw = nb2listw(w2), :
Input data and weights have different dimensions
How can I deal with such data, in the end? Thanks.

How do I structure data to use R lmer

I am trying to do a trending analysis of reliability data. A typical case would be to determine if a 10-year trend exists in the demand rate for a specified system at specified plants.
I am trying to generate a test case but am a bit confused about how to structure the data. The trend years range from 2004 to 2013. In my test case I have, for each year, 10 systems for which demands have been counted. I am using normally distributed demand counts with different means and variances each year. Of course real data will likely not have the same system count each year, and the demand counts are not necessarily normally distributed.
The following R code produces a data frame (df1) that seems reasonable to me:
yr <- 2004:2013
y2004 <- rnorm(10, 10, 3)
y2005 <- rnorm(10, 11, 2)
y2006 <- rnorm(10, 12, 1)
y2007 <- rnorm(10, 13, 5)
y2008 <- rnorm(10, 14, 3)
y2009 <- rnorm(10, 15, 4)
y2010 <- rnorm(10, 16, 1)
y2011 <- rnorm(10, 17, 2)
y2012 <- rnorm(10, 18, 4)
y2013 <- rnorm(10, 19, 1)
df1 <- data.frame(cbind(yr), y2004, y2005, y2006, y2007, y2008, y2009, y2010, y2011, y2012,y2013)
df2 <- data.frame(cbind(rep(0.0, 100), rep(0.0, 100)))
names(df2) <- c("x", "y")
k <-1
for (i in 1:10) {
for (j in 1:10) {
df2$x[k] <- df1$yr[i]
df2$y[k] <- df1[j,i+1]
k <- k + 1
}
}
boxplot(y ~ x, df2)
Anyway, my first problem is the construction of df2 seems unnecessary given I already have the data in df1 - it's just that the call to lmer seems to require the organization of df2. My call to lmer looks like the following:
fit <- lmer(y ~ x + (1|x), data=df2)
So is there a way to use lmer without the construction of df2, using df1 directly? Or is there a better way to structure the data entirely?
My second problem is I am not really sure how to use lmer to do what I want to do. Basically I am looking to pool the count data for each year and fit the mean demand count each year with a straight line. The best fit should consider the variance in the data in each pooled year group. Am I going about it correctly?
Nearly all plotting and modeling functions in R require data in the "long" format (ie df2). So if anything, I would skip the construction of df1. If you want to generate df2 more directly, you could do
df2 <- do.call("rbind.data.frame", Map(cbind,
y=Map(function(n,m,s) rnorm(n,m,s), 10, 10:19, c(3,2,1,5,3,4,1,2,3,1)),
x=2004:2013))

How to generate such random numbers in R

I want to generate bivariates in the following way. I have four lists with equal length n. I need to use the first two lists as means lists, and the latter two as variance lists, and generate normal bivariates.
For example n=2, I have the lists as (1, 2), (3, 4), (5, 6), (7, 8), and I need
c(rnorm(1, mean=1, sd=sqrt(5)), rnorm(1, mean=2, sd=sqrt(6)), rnorm(1, mean=3, sd=sqrt(7)), rnorm(1, mean=4, sd=sqrt(8)),ncol=2)
How can I do this in R in a more functional way?
Here is one way:
m <- 1:4
s <- 5:8
rnorm(n = 4, mean = m, sd = s)
[1] 4.599257 1.661132 16.987241 3.418957
This works because, like many R functions, rnorm() is 'vectorized', meaning that it allows you to call it once with vectors as arguments, rather than many times in a loop that iterates through the elements of the vectors.
Your main task, then, is to convert the 'lists' in which you've got your arguments right now into vectors that can be passed to rnorm().
NOTE: If you want to produce more than one -- lets say 3 -- random variate for each mean/sd combination, rnorm(n=rep(3,4), mean=m, sd=s) will not work. You'll have to either: (a) repeat elements of the m and s vectors like so rnorm(n=3*4, mean=rep(m, each=3), sd=rep(s, each=3)); or (b) use mapply() as described in DWin's answer.
I'm taking you at your word that you have a list, i.e an Rlist:
plist <- list( a=list(1, 2), b=list(3, 4), c=list(5, 6), d=list(7, 8))
means <-plist[c("a","b")] # or you could use means <- plist[1:2]
vars <- plist[c("c","d")]
mapply(rnorm, n=rep(1,4), unlist(means), unlist(vars))
#[1] 3.9382147 1.0502025 0.9554021 -7.3591917
You used the term bivariate. Did you really want to have x,y pairs that had a specific correlation?

Resources