Transform community data into wide-format for vegan package - r

I am trying to analyze some community data with the vegan package. I have my data in the wrong format, and am looking for ways to change the format.
What I have is something like this:
Habitat Species Abundance
1 A 3
2 B 2
3 C 1
1 D 5
2 A 8
3 F 4
And what I think I need is:
Habitat Species A Species B Species C Species D Species D
1 3 0 0 5 0
2 8 ...... etc
3 0
Or is there any other format that vegan can take? I am trying to calculate similarity in species composition between habitats.

The function matrify() in the labdsv package does exactly this for community analyses.
Takes a data.frame in three column form (sample.id, taxon, abundance) and converts it into full matrix form, and then exports it as data.frame with the appropriate row.names and column names.
In other words, it converts your data from long to wide format so that each row represents a sample (in your case "habitat"; sometimes this would be a "plot"), each column represents a species, and each cell shows the abundance of the given cell's species (column) in the given cell's habitat (row).
Example:
dat <- data.frame(Habitat = c('Hab1','Hab1','Hab2','Hab2','Hab2','Hab3','Hab3'),
Species = c('Sp1','Sp2','Sp1','Sp3','Sp4','Sp2','Sp3'),
Abundance = c(1,2,1,3,2,2,1))
print(dat)
Habitat Species Abundance
1 Hab1 Sp1 1
2 Hab1 Sp2 2
3 Hab2 Sp1 1
4 Hab2 Sp3 3
5 Hab2 Sp4 2
6 Hab3 Sp2 2
7 Hab3 Sp3 1
library(labdsv)
matrify(dat)
Sp1 Sp2 Sp3 Sp4
Hab1 1 2 0 0
Hab2 1 0 3 2
Hab3 0 2 1 0
Bonus:
I rewrote matrify many years ago so that it could handle longitudinal community data
Specifically, my matrify2() function creates rows for each plot-year combination (i.e., resampled rows for the same plot) by duplicating plot (or habitat) row monikers and adding a Year column.
Below is the code:
#Create data.frame with PLOT, YEAR, and ABUNDANCE for each SPEC:
#Creates function that can sort the data.frame output by:
#Columns = individual SPECS, #Rows = plot by Year
#Note: Code modified from matrify() function from labdsv package (v. 1.6-1)
matrify2 <- function(data) {
#Data must have columns: plot, SPEC, abundance measure,Year
if (ncol(data) != 4)
stop("data frame must have four column format")
plt <- factor(data[, 1])
spc <- factor(data[, 2])
abu <- data[, 3]
yrs <- factor(data[, 4])
plt.codes <- sort(levels(factor(plt))) ##object with sorted plot numbers
spc.codes <- levels(factor(spc)) ##object with sorted SPEC names
yrs.codes <- sort(levels(factor(yrs))) ##object with sorted sampling Years
taxa <- matrix(0, nrow = length(plt.codes)*length(yrs.codes), ncol = length(spc.codes)) ##Create empty matrix with proper dimensions (unique(plotxYear) by # of SPEC)
plt.list <- rep(plt.codes,length(yrs.codes)) ##Create a list of all the plot numbers (in order of input data) to add as an ID column at end of function
yrs.list <- rep(yrs.codes,each=length(plt.codes)) ##Create a list of all the Year numbers (in order of input data) to add as an ID column at end of function
col <- match(spc, spc.codes) ##object that determines the alphabetical order ranking of each SPEC in the spc.code list
row.plt <- match(plt, plt.codes) ##object that determines the rank order ranking of each plot of the input data in the plt.code list
row.yrs <- match(yrs,yrs.codes) ##object that determines the rank order ranking of each Year of the input data in the yrs.code list
for (i in 1:length(abu)) {
row <- (row.plt[i])+length(plt.codes)*(row.yrs[i]-1) ##Determine row number by assuming each row represents a specific plot & year in an object of rep(plot,each=Year)
if(!is.na(abu[i])) { ##ONly use value if !is.na .. [ignore all is.NA values]
taxa[row, col[i]] <- sum(taxa[row, col[i]], abu[i]) ##Add abundance measure of row i to the proper SPEC column and plot/Year row. Sum across all identical individuals.
}
}
taxa <- data.frame(taxa) ##Convert to data.frame for easier manipulation
taxa <- cbind(plt.list,yrs.list,taxa) ##Add ID columns for plot and Year to each row already representing the abundance of Each SPEC of that given plot/Year.
names(taxa) <- c('Plot','Year',spc.codes)
taxa
}
Example:
dat.y <- data.frame(Habitat = c('Hab1','Hab1','Hab2','Hab2','Hab2','Hab3','Hab3','Hab1','Hab1','Hab2','Hab2','Hab2','Hab3','Hab3'),
Species = c('Sp1','Sp2','Sp1','Sp3','Sp4','Sp2','Sp3','Sp1','Sp2','Sp1','Sp3','Sp4','Sp2','Sp3'),
Abundance = c(1,2,1,3,2,2,1,1,2,1,3,2,2,1),
Year = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2))
print(dat.y)
Habitat Species Abundance Year
1 Hab1 Sp1 1 1
2 Hab1 Sp2 2 1
3 Hab2 Sp1 1 1
4 Hab2 Sp3 3 1
5 Hab2 Sp4 2 1
6 Hab3 Sp2 2 1
7 Hab3 Sp3 1 1
8 Hab1 Sp1 1 2
9 Hab1 Sp2 2 2
10 Hab2 Sp1 1 2
11 Hab2 Sp3 3 2
12 Hab2 Sp4 2 2
13 Hab3 Sp2 2 2
14 Hab3 Sp3 1 2
matrify2(dat.y)
Plot Year Sp1 Sp2 Sp3 Sp4
1 Hab1 1 1 2 0 0
2 Hab2 1 1 0 3 2
3 Hab3 1 0 2 1 0
4 Hab1 2 1 2 0 0
5 Hab2 2 1 0 3 2
6 Hab3 2 0 2 1 0
Also, FYI, you should get to know labdsv according to the vegan documentation:
Together with the labdsv package, the vegan package provides most standard tools of descriptive community analysis.

You probably want to spread your data. For example:
library(tidyr)
mydata %>%
spread(Species, Abundance)

This is what I would so, using dcast:
Create a data sample: cc=data.frame(habitat=c(1,2,3,1,2,3),species=c('a','e','a','e','g','a'), abundance=sample(1:10000,6)).
Output looks like this (Ignore first column as it is an automatic index created by the ouput operation in R. What is important is the columns):
> cc
> habitat species abundance
> 1 1 a 7814
> 2 2 e 7801
> 3 3 a 9510
> 4 1 e 7443
> 5 2 g 2160
> 6 3 a 4026
>
Now melt: m=melt(cc, id.vars=c("habitat","species")). Output:
habitat species variable value
1 1 a abundance 7814
2 2 e abundance 7801
3 3 a abundance 9510
4 1 e abundance 7443
5 2 g abundance 2160
6 3 a abundance 4026
Now reshape: dcast(m,habitat~species,fun.aggregate=mean), which yields:
habitat a e g
1 1 7814 7443 NaN
2 2 NaN 7801 2160
3 3 6768 NaN NaN
More info about reshape here.
Kf

Related

Testing/Training data sets stratified on two crossed variables

I have a data set which is crossed with respect to two categorical variables, and only 1 rep per combination:
> examp <- data.frame(group=rep(LETTERS[1:4], each=6), class=rep(LETTERS[16:21], times=4))
> table(examp$group, examp$class)
P Q R S T U
A 1 1 1 1 1 1
B 1 1 1 1 1 1
C 1 1 1 1 1 1
D 1 1 1 1 1 1
I need to create a testing/training data set (50/50 split) which balances both group and class.
I know I can use createDataPartition from the caret package to balance it in one of the two factors, but this leaves impalance in the other factor:
> library(caret)
> examp$valid <- "test"
> examp$valid[createDataPartition(examp$group, p=0.5, list=FALSE)] <- "train"
> table(examp$group, examp$valid)
test train
A 3 3
B 3 3
C 3 3
D 3 3
> table(examp$class, examp$valid)
test train
P 1 3
Q 2 2
R 2 2
S 2 2
T 2 2
U 3 1
>
>
> examp$valid <- "test"
> examp$valid[createDataPartition(examp$class, p=0.5, list=FALSE)] <- "train"
> table(examp$group, examp$valid)
test train
A 3 3
B 3 3
C 5 1
D 1 5
> table(examp$class, examp$valid)
test train
P 2 2
Q 2 2
R 2 2
S 2 2
T 2 2
U 2 2
How can I create a partition which is balanced in both factors? If I had multiple reps per group/class combination, I would stratify by interaction(group,class), but I cannot in this case since there is only one observation in each combo.
I propose this algorithm
Randomly sort the unique group values (e.g., DBAC)
Iterate over adjacent pairs of the randomly sorted group values (e.g., first DB, then AC):
Randomly pick half of the class values
Assign the rows with the first group and in the selected half of class to TRAIN
Assign the rows with the second group and not in the selected half of class to TEST

How to merge columns in R with different levels of values

I have been given a dataset that I am attempting to perform logistic regression on. However, to do so, I need to merge some columns in R.
For instance in the carevaluations data set, I am given (BuyingPrice_low, BuyingPrice_medium, BuyingPrice_high, BuyingPrice_vhigh, MaintenancePrice_low MaintenancePrice_medium MaintenancePrice_high MaintenancePrice_vhigh)
How would I combine the columns buying price_low, medium, etc. into one column called "BuyingPrice" with the order and their respective data in each column and the same with the maintenanceprice column?
library(dplyr)
df <- data.frame(Buy_low=rep(c(0,1), 10),
Buy_high=rep(c(0,1), 10))
one_column <- df %>%
gather(var, value)
head(one_column)
var value
1 Buy_low 0
2 Buy_low 1
3 Buy_low 0
4 Buy_low 1
5 Buy_low 0
6 Buy_low 1
It can be done with stack in base R :
df1 <- data.frame(a=1:3,b=4:6,c=7:9)
stack(df1)
# values ind
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 b
# 5 5 b
# 6 6 b
# 7 7 c
# 8 8 c
# 9 9 c

How do I identifying the first zero in a group of ordered columns?

I'm trying to format a dataset for use in some survival analysis models. Each row is a school, and the time-varying columns are the total number of students enrolled in the school that year. Say the data frame looks like this (there are time invariate columns as well).
Name total.89 total.90 total.91 total.92
a 8 6 4 0
b 1 2 4 9
c 7 9 0 0
d 2 0 0 0
I'd like to create a new column indicating when the school "died," i.e., the first column in which a zero appears. Ultimately I'd like to have this column be "years since 1989" and can re-name columns accordingly.
A more general version of the question, for a series of time ordered columns, how do I identify the first column in which a given value occurs?
Here's a base R approach to get a column with the first zero (x = 0) or NA if there isn't one:
data$died <- apply(data[, -1], 1, match, x = 0)
data
# Name total.89 total.90 total.91 total.92 died
# 1 a 8 6 4 0 4
# 2 b 1 2 4 9 NA
# 3 c 7 9 0 0 3
# 4 d 2 0 0 0 2
Here is an option using max.col with rowSums
df1$died <- max.col(!df1[-1], "first") * NA^!rowSums(!df1[-1])
df1$died
#[1] 4 NA 3 2

How would I convert a species x site dataframe into a matrix? [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 5 years ago.
I have a dataframe (in the form of an excel file) with rows of sampled sites and columns of each species (sp). A very standard community ecology species by sites matrix but in dataframe format.
example data (note i added a column for site names as that's how it is in my excel file):
sites<-c("SiteA", "SiteB", "SiteC")
sp1<-c(0, 5, 2)
sp2<-c(0, 1, 2)
sp3<-c(1, 1, 4)
comm<-data.frame(sites,sp1,sp2,sp3)
In my situation I only have one of these dataframes or one "plot". I need to convert this dataframe into a matrix formatted like below:
sp site plot Abundance
1 sp1 A 1 0
2 sp2 A 1 0
3 sp3 A 1 1
4 sp1 B 1 5
5 sp2 B 1 5
6 sp3 B 1 1
7 sp1 C 1 2
8 sp2 C 1 2
9 sp3 C 1 4
I have looked into using techniques described in this previous post
(Collapse species matrix to site by species matrix)
but the end result is different from mines where I need my matrix to ultimately look like what I showed above.
Any help would be greatly appreciated.
Using the reshape package:
library(reshape2)
comm.l <- melt(comm)
comm.l$plot <- 1
A solution using dplyr and tidyr.
library(dplyr)
library(tidyr)
comm2 <- comm %>%
gather(sp, Abundance, starts_with("sp")) %>%
mutate(site = sub("Site", "", sites), plot = 1) %>%
select(sp, site, plot, Abundance) %>%
arrange(site)
comm2
sp site plot Abundance
1 sp1 A 1 0
2 sp2 A 1 0
3 sp3 A 1 1
4 sp1 B 1 5
5 sp2 B 1 1
6 sp3 B 1 1
7 sp1 C 1 2
8 sp2 C 1 2
9 sp3 C 1 4

loop ordinal regression statistical analysis and save the data R

could you, please, help me with a loop? I am relatively new to R.
The short version of the data looks ike this:
sNumber blockNo running TrialNo wordTar wordTar1 Freq Len code code2
1 1 1 5 spouse violent 5011 6 1 2
1 1 1 5 violent spouse 17873 7 2 1
1 1 1 5 spouse aviator 5011 6 1 1
1 1 1 5 aviator wife 515 7 1 1
1 1 1 5 wife aviator 87205 4 1 1
1 1 1 5 aviator spouse 515 7 1 1
1 1 1 9 stability usually 12642 9 1 3
1 1 1 9 usually requires 60074 7 3 4
1 1 1 9 requires client 25949 8 4 1
1 1 1 9 client requires 16964 6 1 4
2 2 1 5 grimy cloth 757 5 2 1
2 2 1 5 cloth eats 8693 5 1 4
2 2 1 5 eats whitens 3494 4 4 4
2 2 1 5 whitens woman 18 7 4 1
2 2 1 5 woman penguin 162541 5 1 1
2 2 1 9 pie customer 8909 3 1 1
2 2 1 9 customer sometimes 13399 8 1 3
2 2 1 9 sometimes reimburses 96341 9 3 4
2 2 1 9 reimburses sometimes 65 10 4 3
2 2 1 9 sometimes gangster 96341 9 3 1
I have a code for ordinal regression analysis for one participant for one trial (eye-tracking data - eyeData) that looks like this:
#------------set the path and import the library-----------------
setwd("/AscTask-3/Data")
library(ordinal)
#-------------read the data----------------
read.delim(file.choose(), header=TRUE) -> eyeData
#-------------extract 1 trial from one participant---------------
ss <- subset(eyeData, sNumber == 6 & runningTrialNo == 21)
#-------------delete duplicates = refixations-----------------
ss.s <- ss[!duplicated(ss$wordTar), ]
#-------------change the raw frequencies to log freq--------------
ss.s$lFreq <- log(ss.s$Freq)
#-------------add a new column with sequential numbers as a factor ------------------
ss.s$rankF <- as.factor(seq(nrow(ss.s)))
#------------ estimate an ordered logistic regression model - fit ordered logit model----------
m <- clm(rankF~lFreq*Len, data=ss.s, link='probit')
summary(m)
#---------------get confidence intervals (CI)------------------
(ci <- confint(m))
#----------odd ratios (OR)--------------
exp(coef(m))
The eyeData file is a huge massive of data consisting of 91832 observations with 11 variables. In total there are 41 participants with 78 trials each. In my code I extract data from one trial from each participant to run the anaysis. However, it takes a long time to run the analysis manually for all trials for all participants. Could you, please, help me to create a loop that will read in all 78 trials from all 41 participants and save the output of statistics (I want to save summary(m), ci, and coef(m)) in one file.
Thank you in advance!
You could generate a unique identifier for every trial of every particpant. Then you could loop over all unique values of this identifier and subset the data accordingly. Then you run the regressions and save the output as a R object
eyeData$uniqueIdent <- paste(eyeData$sNumber, eyeData$runningTrialNo, sep = "-")
uniqueID <- unique(eyeData$uniqueIdent)
for (un in uniqueID) {
ss <- eyeData[eyeData$uniqueID == un,]
ss <- ss[!duplicated(ss$wordTar), ] #maybe do this outside the loop
ss$lFreq <- log(ss$Freq) #you could do this outside the loop too
#create DV
ss$rankF <- as.factor(seq(nrow(ss)))
m <- clm(rankF~lFreq*Len, data=ss, link='probit')
seeSumm <- summary(m)
ci <- confint(m)
oddsR <- exp(coef(m))
save(seeSumm, ci, oddsR, file = paste("toSave_", un, ".Rdata", sep = ""))
# add -un- to the output file to be able identify where it came from
}
Variations of this could include combining the output of every iteration in a list (create an empty list in the beginning) and then after running the estimations and the postestimation commands combine the elements in a list and recursively fill the previously created list "gatherRes":
gatherRes <- vector(mode = "list", length = length(unique(eyeData$uniqueIdent) ##before the loop
gatherRes[[un]] <- list(seeSum, ci, oddsR) ##last line inside the loop
If you're concerned with speed, you could consider writing a function that does all this and use lapply (or mclapply).
Here is a solution using the plyr package (it should be faster than a for loop).
Since you don't provide a reproducible example, I'll use the iris data as an example.
First make a function to calculate your statistics of interest and return them as a list. For example:
# Function to return summary, confidence intervals and coefficients from lm
lm_stats = function(x){
m = lm(Sepal.Width ~ Sepal.Length, data = x)
return(list(summary = summary(m), confint = confint(m), coef = coef(m)))
}
Then use the dlply function, using your variables of interest as grouping
data(iris)
library(plyr) #if not installed do install.packages("plyr")
#Using "Species" as grouping variable
results = dlply(iris, c("Species"), lm_stats)
This will return a list of lists, containing output of summary, confint and coef for each species.
For your specific case, the function could look like (not tested):
ordFit_stats = function(x){
#Remove duplicates
x = x[!duplicated(x$wordTar), ]
# Make log frequencies
x$lFreq <- log(x$Freq)
# Make ranks
x$rankF <- as.factor(seq(nrow(x)))
# Fit model
m <- clm(rankF~lFreq*Len, data=x, link='probit')
# Return list of statistics
return(list(summary = summary(m), confint = confint(m), coef = coef(m)))
}
And then:
results = dlply(eyeData, c("sNumber", "TrialNo"), ordFit_stats)

Resources