Struggling to loop within a function - r

I am investigating the presence of convergent evolution in a dataset containing 20 trait variables. I need to test for convergence on each trait individually, I am trying to set up a loop to do this. However, I am getting errors on my loops and I am not sure why. I can run each variable independently in the function perfectly.
I am using the package windex and am using the function test.windex. This function uses a data.frame containing information about which species are the focal ones and the traits that one wishes to test convergence on. The function also requires a phylogenetic tree for the species included in the data.frame.
To create a reproducible example I am using the sample data supplied with the package windex as my data is confidential (I have run the following and get the same errors as with my own data).
data(sample.data) #data.frame containing the focal species and variables
data(sample.tree) #phylogentic tree
head(sample.data)
species focals bm1 bm2 bm3
1 s4 0 13.03895 17.2201554 11.43644
2 s7 0 18.22705 15.2427947 22.75245
3 s8 0 12.38588 10.5858736 13.80216
4 s9 0 24.79114 8.1148456 23.38717
5 s11 1 28.20126 -2.9911114 15.63215
6 s12 1 29.45775 0.9225023 12.09184
ou1 ou2 ou3 bin
1 13.03895 17.220155 11.43644 1
2 18.22705 15.242795 22.75245 2
3 12.38588 10.585874 13.80216 3
4 24.79114 8.114846 23.38717 4
5 21.22215 21.936316 20.30037 5
6 21.24501 20.952824 20.88650 6
#My loop, it is the traits that I need to loop for each trait variable in the sample.data
sapply(sample.data[3:8], function(x) test.windex(sample.data,sample.tree,traits= x,
focal=sample.data[,2], reps=1000,plot=TRUE,col="light grey"))
This code generates the error: Error in .subset(x, j) : only 0's may be mixed with negative subscripts
Called from: [.data.frame(dat, , traits)
I am assuming that it is some sort of an indexing problem, but I just can't seem to find a way around it. I tried this instead, but also go the same error:
traitsonly<-sample.data[3:8]
sapply(traitsonly, function(x) test.windex(sample.data,sample.tree,traits= x,
focal=sample.data[,2], reps=1000,plot=TRUE,col="light grey"))
Any help would be greatly appreciated.

From the help page of test.windex it accepts column number or column name.
traits
Column numbers (or names) of the traits for which you want to calculate a Wheatsheaf index.
So either of these should work.
library(windex)
sapply(3:8, function(x)
test.windex(sample.data,sample.tree,traits= x,
focal=sample.data[,2], reps=1000,plot=TRUE,col="light grey")
) -> plot
Or
sapply(names(sample.data)[3:8], function(x)
test.windex(sample.data,sample.tree,traits= x,
focal=sample.data[,2], reps=1000,plot=TRUE,col="light grey")
) -> plot

Related

Calculate Gamma diversity over complete dataset using Vegan package in R

I have some datasets for which i want to calculate gamma diversity as the Shannon H index.
Example dataset:
Site SpecA SpecB SpecC
1 4 0 0
2 3 2 4
3 1 1 1
Calculating the alpha diversity is as follows:
vegan::diversity(df, index = "shannon")
However, i want this diversity function to calculate one number for the complete dataset instead of for each row. I can't wrap my head around this. My thought is that i need to write a function to merge all the columns into one and taking the average abundance of each species, thus creating a dataframe with one site contaning all the species information:
site SpecA SpecB SpecC
1 2.6 1 1.6
This seems like a giant workaround for something that could be done with some existing functions, but i don't know how. I hope someone can help in creating this dataframe or using some other method to use the diversity() function over the complete dataframe.
Regards
library(vegan)
data(BCI)
diversity(colSums(BCI)) # vector of sums is all you need
## vegan 2.6-0 in github has 'groups' argument for sampling units
diversity(BCI, groups = rep(1, nrow(BCI))) # one group, same result as above
diversity(BCI, groups = "a") # arg 'groups' recycled: same result as above

Chi-squared test of independence on all combinations of columns in a dataframe in R

this is my first time posting here and I hope this is all in the right place. I have been using R for basic statistical analysis for some time, but haven't really used it for anything computationally challenging and I'm very much a beginner in the programming/ data manipulation side of R.
I have presence/absence (binary) data on 72 plant species in 323 plots in a single catchment. The dataframe is 323 rows, each representing a plot, with 72 columns, each representing a species. This is a sample of the first 4 columns (some row numbers are missing because the 323 plots are a subset of a larger number of preassigned plots, not all of which were surveyed):
> head(plots[,1:4])
Agrostis.canina Agrostis.capillaris Alchemilla.alpina Anthoxanthum.odoratum
1 1 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
8 0 0 0 0
I want to to determine whether any of the plant species in this catchment are associated with any others, and if so, whether that is a positive or negative association. To do this I want to perform a chi-squared test of independence on each combination of species. I need to create a 2x2 contingency table for each speciesxspecies comparison, run a chi-squared test on each of those contingency tables, and save the output. Ultimately I would like to end up with a list or matrix of all species by species tests that shows whether that combination of species has a positive, negative, or no significant association. I'd also like to incorporate some code that only shows an association as positive if all expected values were greater than 5.
I have made a start by writing the following function:
CHI <- function(sppx, sppy)
{test <- chisq.test(table(sppx, sppy))
result <- c(test$statistic, test$p.value,
sign((table(sppx, sppy) - test$expected)[2,2]))
return(result)
}
This returns the following:
> CHI(plots$Agrostis.canina, plots$Agrostis.capillaris)
X-squared
1.095869e-27 1.000000e+00 -1.000000e+00
Warning message:
In chisq.test(chitbl) : Chi-squared approximation may be incorrect
Now I'm trying to figure out a way to apply this function to each speciesxspecies combination in the data frame. I essentially want R to take each column, apply the CHI function to that column and each other column in sequence, and so on through all the columns, subtracting each column from the dataframe as it is done so the same species pair is not tested twice. I have tried various methods trying to use "for" loops or "apply" functions, but have not been able to figure this out.
I hope that is clear enough. Any help here would be much appreciated. I have tried looking for existing solutions to this specific problem online, but haven't been able to find any that really helped. If anyone could link me to an existing answer to this that would also be great.
You need the combn function to find all the combinations of the columns and then apply them to your function, something like this:
apply(combn(1:ncol(plots), 2), 2, function(ind) CHI(plots[, ind[1]], plots[, ind[2]]))
I think you are looking for something like this. I used the iris dataset.
require(datasets)
ind<-combn(NCOL(iris),2)
lapply(1:NCOL(ind), function (i) CHI(iris[,ind[1,i]],iris[,ind[2,i]]))
The below R code run chisquare test for every categorical variable / every factor of a r dataframe, against a variable given (x or y chisquare parameter is kept stable, is explicitly defined):
Define your variable
Please - change df$variable1 to your desired factor variable and df to your desirable dataframe that contain all the factor variables tested against the given df$variable1
Define your Dataframe
A new dataframe is created (df2) that will contain all the chi square values / dfs, p value of the given variable vs dataframe comparisons
Code created / completed/ altered from similar posts in stackoverflow, neither that produced my desired outcome.
Chi-Square Tables statistic / df / p value for variable vs dataframe
"2" parameter define column wide comparisons - check apply (MARGIN) option.
df2 <- t(round(cbind(apply(df, 2, function(x) {
ch <- chisq.test(df$variable1, x)
c(unname(ch$statistic), ch$parameter, ch$p.value )})), 3))

Faster alternative to populating a pre-allocated data frame using a for-loop

I am running a few different Monte Carlo simulations, all of which involve generating some data, fitting a model, and capturing several output variables from the fit of the model. Typically data are generated so that several characteristics vary (e.g., number of items, sample size), and models are fit so that several other characteristics vary (e.g., estimation method, model misspecification). I have no questions about generating the data or about how to actually fit the model. However, I know my method for populating my results data frame is very inefficient and I would like some help in improving this. My usual method is as follows:
1) Create data frame with as many rows as models I have to fit (e.g., number of iterations * different item lengths * different sample size lengths * different estimation methods * different types of model misspecification), and as many columns as I need to contain the identifying variables (sample size, number of items, etc.) and to capture all the output.
2) Use a for-loop to identify the particular combination of conditions and the particular iteration of said combination, fit the model, and populate the appropriate row of the data frame.
So I might start with something that looks like:
> head(df)
fit.model n.sample n.item distr.cond estim iteration df.chisq obt.chisq
1 1 100 3 1 ML 1 NA NA
1.1 1 100 3 1 ML 2 NA NA
1.2 1 100 3 1 ML 3 NA NA
1.3 1 100 3 1 ML 4 NA NA
1.4 1 100 3 1 ML 5 NA NA
1.5 1 100 3 1 ML 6 NA NA
where the last two columns capture results and need to be filled in, and the first six columns are necessary to identify each row uniquely. I then use a for-loop to go row-by-row, pick out the identifying characteristics of that iteration (which allows me to locate the appropriate data file and read it in, as well as to specify how to fit the model), do the model fitting, and then write to the NA columns the output desired. Then I just fill in the remaining NAs with the obtained values, for instance using df$obt.chisq[i] <- fitMeasures(fit,"chisq"), where the function fitMeasures extracts the particular value from the resulting fitted model fit.
Is it possible to vectorize this? I forget the terminology, but I recognize that in this case each iteration is completely independent of each other iteration, so that the particular order doesn't matter. It's time for a change in approach! Any help would be much appreciated.

R - Modified mosaic plot from descr package

I have a dataframe dbwith 2 categorical variables: varA has 4 levels (0,1,2,3), varB has 2 levels (yes,no). varB has no values for the level 0 of varA:
id varA varB
1 2 yes
2 3 no
3 3 no
4 1 yes
5 0 NA
6 1 no
7 2 no
8 3 yes
9 3 yes
10 2 no
I created a contingency table using CrossTable from the descr package and then a mosaic plot with the plot function:
table <- CrossTable(db$varA,db$varB, missing.include=FALSE)
plot(table,xlab="varA",ylab="varB")
I obtained this plot:
I would like to eliminate the level 0 from the plot. I also would like to add 2 y-axis, one on the left of the plot with a scale from 0 to 1 and one on the right with a scale from 1 to 0.
Could you help me?
Well, that was annoying. There is no support for subsetting such a "CrossTable" object. If it were a well-behaved table-like object you would been able to just pass table[ , -1] to the plot function. instead you need to do the subetting before the data that is passed to CrossTable:
table <- with( na.omit(db), CrossTable( varA, varB, missing.include=TRUE))
plot(table, xlab="varA", ylab="varB")
BTW using the name table for a data-object is quite confusing to regular R users since the table function is one of our basic tools.
Personally I would avoid avoid using that CrossTable function since its output is so weird and not available for management with typical R functions. Yeah, I know it produces a SAS-like output, but R users grow to love the compact output of the table function and the many matrix operations that are available for working with table-objects. You may need to get your margin percentages by hand with prop.table.

Get ordered kmeans cluster labels

Say I have a data set x and do the following kmeans cluster:
fit <- kmeans(x,2)
My question is in regards to the output of fit$cluster: I know that it will give me a vector of integers (from 1:k) indicating the cluster to which each point is allocated. Instead, is there a way to have the clusters be labeled 1,2, etc... in order of decreasing numerical value of their center?
For example: If x=c(1.5,1.4,1.45,.2,.3,.3) , then fit$cluster should result in (1,1,1,2,2,2) but not result in (2,2,2,1,1,1)
Similarly, if x=c(1.5,.2,1.45,1.4,.3,.3) then fit$cluster should return (1,2,1,1,2,2), instead of (2,1,2,2,1,1)
Right now, fit$cluster seems to label the cluster numbers randomly. I've looked into documentation but haven't been able to find anything. Please let me know if you can help!
I had a similar problem. I had a vector of ages that I wanted to separate into 5 factor groups based on a logical ordinal set. I did the following:
I ran the k-means function:
k5 <- kmeans(all_data$age, centers = 5, nstart = 25)
I built a data frame of the k-means indexes and centres; then arranged it by centre value.
kmeans_index <- as.numeric(rownames(k5$centers))
k_means_centres <- as.numeric(k5$centers)
k_means_df <- data_frame(index=kmeans_index, centres=k_means_centres)
k_means_df <- k_means_df %>%
arrange(centres)
Now that the centres are in the df in ascending order, I created my 5 element factor list and bound it to the data frame:
factors <- c("very_young", "young", "middle_age", "old", "very_old")
k_means_df <- cbind(k_means_df, factors)
Looks like this:
> k_means_df
index centres factors
1 2 23.33770 very_young
2 5 39.15239 young
3 1 55.31727 middle_age
4 4 67.49422 old
5 3 79.38353 very_old
I saved my cluster values in a data frame and created a dummy factor column:
cluster_vals <- data_frame(cluster=k5$cluster, factor=NA)
Finally, I iterated through the factor options in k_means_df and replaced the cluster value with my factor/character value within the cluster_vals data frame:
for (i in 1:nrow(k_means_df))
{
index_val <- k_means_df$index[i]
factor_val <- as.character(k_means_df$factors[i])
cluster_vals <- cluster_vals %>%
mutate(factor=replace(factor, cluster==index_val, factor_val))
}
Voila; I now have a vector of factors/characters that were applied based on their ordinal logic to the randomly created cluster vector.
# A tibble: 3,163 x 2
cluster factor
<int> <chr>
1 4 old
2 2 very_young
3 2 very_young
4 2 very_young
5 3 very_old
6 3 very_old
7 4 old
8 4 old
9 2 very_young
10 5 young
# ... with 3,153 more rows
Hope this helps.
K-means is a randomized algorithm. It is actually correct when the labels are not consistent across runs, or ordered in "ascending" order.
But you can of course remap the labels as you like, you know...
You seem to be using 1-dimensional data. Then k-means is actually not the best choice for you.
In contrast to 2- and higher-dimensional data, 1-dimensional data can efficiently be sorted. If your data is 1-dimensional, use an algorithm that exploits this for efficiency. There are much better algorithms for 1-dimensional data than for multivariate data.

Resources