Combining factor levels via taking the mean - r

Newbie here.
I have a dataset with columns "YEAR" (2014-2019), "SITE" (7 SITES), "TRANSECT" (UPSTREAM,DOWNSTREAM), and about 50 Insect species columns containing counts of individuals. I want to average the upstream and downstream samples for each year and site. The end goal is a dataset with columns "YEAR", "SITE", and the 50 Insect species columns containing the mean of the upstream and downstream counts. I have tried several methods to do this but have been unsuccessful. The following code is the last thing I have tried.
INS_YxS<-aggregate(INV.MEANS[5:54], INV.MEANS[1:3], mean)
Columns 1-4 in this dataset are X, YEAR, SITE, TRANSECT. 5-54 are Insect Species.
The resulting dataset appeared to have the correct columns but it looks like it just removed the TRANSECT column without averaging the upstream and downstream species counts... Anyone know how to accomplish what I am trying to do?
Here is a visual representation of what my data looks like (table 1) and what I want it to look like (table 2): https://i.stack.imgur.com/WkX4e.png
Notice that in 2 there is no TRANSECT column and that the new values in the insect columns are the means of the UPSTREAM and DOWNSTREAM TRANSECT rows for each YEAR SITE resulting in fewer rows.
Apologies, I am trying to find the best way to explain what I want to do...
I know the answer is out there and depends on me asking the correct question...
Thank you!!!

Consider the formula version for aggregate with dot notation:
INSECTS_MEANS <- aggregate(
. ~ YEAR + SITE + TRANSECT,
data=INSECTS_COUNTS,
FUN=mean, na.rm=TRUE,
na.action=na.omit
)
Otherwise you need to pass lists into by argument:
INSECTS_MEANS <- aggregate(
x = INSECTS_COUNTS[5:ncol(INSECTS_COUNTS)],
by = list(
YEAR = INSECTS_COUNTS$YEAR,
SITE = INSECTS_COUNTS$SITE,
TRANSECT = INSECTS_COUNTS$TRANSECT
),
FUN=mean, na.rm=TRUE,
na.action=na.omit
)

Related

Species-Area curves in R using Modified Whitaker Plots: how to write code to go from a binary occurrence matrix to species-area accumulation?

I used Modified Whitaker Plots to measure the rate of species accumulation as I increased my search area (https://www.researchgate.net/figure/Diagram-of-modified-Whittaker-plot-and-subplot-establishment_fig1_11253141).
The tricky part is, I need to only count new occurrences of species as they are encountered within a given Whitaker plot site. I don't want to just count the total number of species in each plot and then add them up, because I would be multi-counting common species that occur in more than one subplot.
I've tried looking around for anything similar, and while I can find information on simulating species-area curves, I can't find anything that helps me use real data (https://www.r-bloggers.com/2012/08/r-for-ecologists-simulating-species-area-curves-linear-vs-nonlinear-regression/).
I'm aware of the "specaccum" command in the vegan package, but am not sure how I would use it here.
Ultimately I need to go from a binary community matrix of species occurrences to a single column dependent variable of species richness that adds only occurrences of never-before-seen species to the total from the prior row. Hope that made sense!
Data can be found here, access set to anyone with the link: https://docs.google.com/spreadsheets/d/1gwyhgLNvTt1yriLX9qeMtsJnI5t540o-fmwzDUKGEAg/edit?usp=sharing
In my code I used a .csv . Unfortunately all I've been able to figure out is getting my data into a binary occurrence matrix and calculating individual species richness for each plot, but I can't figure out what the code-logic would be to tell R to proceed from the top row down, only adding new species occurrences to the richness column rather than just adding up the total number of species in a given subplot as it is now.
Code:
#Load full whitaker dataset
whitakern <- read.csv("WhitakerPlotsKern2019.csv")
#remove i.. from 1st column name... this may or may not be necessary with downloaded Google Sheets data
colnames(whitakern)[1] <- gsub('^...','',colnames(whitakern)[1])
#Plot # 9 was untreated/invaded and was the only replicate of its type; it must be removed
whitakern <- subset(whitakern, SiteType!="NoTrt")
#For species richness and abundance datasets to be combined in the same spreadsheet,
#richness hits were recorded as '.' rather than '1'. So we need to convert these back to usable binary counts:
whitakern[whitakern == '.'] <- '1'
#now we need to convert abundance counts to richness counts.
#Any abundance hit >1 needs to be replaced with 1 so we can calculate species richness rather than relative abundance
spac.spp[spac.spp > 1] <- 1
#separate out count data from treatment variables
spac.spp <- whitakern[,-c(1:8)]
spac.trt <- whitakern[,c(1:8)]
#make sure count data is in correct (numeric) format
spac.spp[sapply(spac.spp, is.character)] <-lapply(spac.spp[sapply(spac.spp, is.character)], as.numeric)
spac.trt$Richness <- rowSums(spac.spp[,c(1:93)], na.rm=TRUE)
#remove columns that are not de facto plant species, and remove the "Hits" column
spac.trt <- spac.trt[,-c(8:13)]
spac.dat <- cbind(spac.trt, spac.spp)
write.csv(spac.dat, "Whitaker-Data.csv")

Finding the mean of a subset with 2 variables in R

Sincere apologies if my terminology is inaccurate, I am very new to R and programming in general (<1m experience). I was recently given the opportunity to do data analysis on a project I wish to write-up for a conference and could use some help.
I have a csv file (cida_ams_scc_csv) with patient data from a recent study. It's a dataframe, with columns of patient ID ('Cow ID'), location of sample ('QTR', either LH LF RH or RF), date ('Date', written DD/MM/YY), and the lab result from testing of the sample ('SCC', an integer).
For any given day, each of the four anatomic locations for each patient were sampled and tested. I want to find the average 'SCC' of the each of the locations for each of the patients, across all days the patient was sampled.
I was able to find the average SCC for each patient across all days and all anatomic sites using the code below.
aggregate(cida_ams_scc_csv$SCC, list(cida_ams_scc_csv$'Cow ID'), mean)
Now I want to add another "layer," where I see not just the patient's average, but the average of each patient for each of the 4 sample sites.
I honestly have no idea where to start. Please walk me through this in the simplest way possible, I will be eternally grateful.
It is always better to provide a minimal reproducible example. But here the answer might be easy enough so its not necessary...
You can use the same code to do what you want. If we look at the aggregate documentation ?aggregate we find that the second argument by is
a list of grouping elements, each as long as the variables in the data
frame x. The elements are coerced to factors before use.
Therefore running:
aggregate(mtcars$mpg, by = list(mtcars$cyl, mtcars$gear), mean)
Returns the "double grouped" means
In your case that means adding the "second layer" to the list you pass as value for the by parameter.
I'd recommend dplyr for working with data frames - it's fast and friendly. Here's a way to calculate the average of each patient for each location using it:
# Load the package
library(dplyr)
# Make some fake data that looks similar to yours
cida_ams_scc_csv <- data.frame(
QTR=gl(
n = 4, k = 5, labels = c("LH", "LF", "RH", "RF"), length = 40
),
Date=rep(c("06/10/2021", "05/10/2021"), each=5),
SCC=runif(40),
Cow_ID=1:5)
# Group by ID and QTR, then calculate the mean for each group
cida_ams_scc_csv %>%
group_by(Cow_ID, QTR) %>%
summarise(grouped_mean=mean(SCC))
which returns

How to calculate the average of different groups in a dataset using R

I have a dataset in R that I would like to find the average of a given variable for each year in the dataset (here, from 1871-2019). Not every year has the same number of entries, and so I have encountered two problems: first, how to find the average of the variable for each year, and second, how to add the column of averages to the dataset. I am unsure how to approach the first problem, but I attempted a version of the second problem by simply finding the sum of each group and then trying to add those values to the dataset for each entry of a given year with the code teams$SBtotal <- tapply(teams$SB, teams$yearID, FUN=sum). That code resulted in an error that notes replacement has 149 rows, data has 2925. I know that this can be done less quickly in Excel, but I'm hoping to be able to use R to solve this problem.
The tapply should work
data(iris)
tapply(iris$Sepal.Length, iris$Species, FUN = sum)

R assign categorical variables to matrix

I have 5 categorical variables: age(5 levels), sex(2 levels), zone(4 levels), qmat(5 levels), and qsoc(5 levels) for a total of 1000 unique combinations. Each unique combination has a corresponding data value (e.g. population size). I would like to assign this data to a 1000 x 6 table where the first five columns contain the indices of age, sex, zone, qmat, qsoc and the 6th column holds the data value.
I would like to avoid using nested for loops which are inefficient in R (some of my datasets will have more than 1000 unique combinations). I know there exist many tools in R for parallel operations (but am not familiar with them). Is there an efficient way to perform the above variable assignment using parallel/vector operations? Any suggestions or references would be appreciated.
It's hard to understand how the original data you have looks like, but assuming that you have your data on a data frame, you may want to use aggregate().
# simulating a data frame
set.seed(1)
N = 9000
df = data.frame(pop=rnorm(N),
age=sample(1:5, N, replace=T),
sex=sample(1:2, N, replace=T)
)
# 'aggregate' this data frame by 'age' and 'sex'
newData = aggregate(pop ~ age + sex, data=df, FUN=sum)
The R function expand.grid() will solve my problem e.g.
expand.grid(list(age,sex,zone,qmat,qsoc))
Thanks for all the responses and I apologize for any possible vagueness in the wording of my question.

recording time a taxa first appears: nested loops and conditional statements in R

Here is my example. Here is some hypothetical data resembling my own. Environmental data describes the metadata of the community data, which is made up of taxa abundances over years in different treatments.
#Elements of Environmental (meta) data
nTrt<-2
Trt<-c("High","High","High","Low","Low","Low")
Year<-c(1,2,3,1,2,3)
EnvData<-cbind(Trt,Year)
#Elements of community data
nTaxa<-2
Taxa1<-c(0,0,2,50,3,4)
Taxa2<-c(0,34,0,0,0,23)
CommData<-cbind(Taxa1,Taxa2)
#Elements of ideal data produced
Ideal_YearIntroduced<-array(0,dim=c(nTrt,nTaxa))
Taxa1_i<-c(2,1)
Taxa2_i<-c(2,3)
IdealData<-cbind(Taxa1_i,Taxa2_i)
rownames(IdealData)<-c("High","Low")
I want to know what the Year is (in EnvData) when a given taxa first appears in a particular treatment. ie The "introduction year". That is, if the taxa is there at year 1, I want it to record "1" in an array of Treatment x Taxa, but if that taxa in that treatment does not arrive until year 3 (which means it meets the condition that it is absent in year 2), I want it to record Year 3.
So I want these conditional statements to only loop within a treatment. In other words, I do not want it to record a taxa as being "introduced" if it is 0 in year 3 of one treatment and prsent in year 1 of the next.
I've approached this by doing several for loops, but the loops are getting out of hand, with the conditional statements, and there is now an error that I can't figure out- I may be not thinking of the i and j's correctly.'
The data itself is more complicated than this...has 6 years, 1102 taxa, many treatments.
#Get the index number where each treatment starts
Index<-which(EnvData[,2]==1)
TaxaIntro<-array(0,dim=dim(Comm_0)) #Array to hold results
for (i in 1:length(Index)) { #Loop through treatment (start at year 1 each time)
for (j in 1:3) { #Loop through years within a treatment
for (k in 1:ncol(CommData)) { #Loop through Taxa
if (CommData[Index[i],1]>0 ) { #If Taxa is present in Year 1...want to save that it was introduced at Year 1
TaxaIntro[i,k]<-EnvData[Index[i],2]
}
if (CommData[Index[i+j]]>0 && CommData[Index[((i+j)-j)]] ==0) { #Or if taxa is present in a year AND absent in the previous year
TaxaIntro[i,k]<-EnvData[Index[i+j],2]
}
}
}
}
With this example, I get an error related to my second conditional statement...I may be going about this the wrong way.
Any help would be greatly appreciated. I am open to other (non-loop) approaches, but please explain thoroughly as I'm not so well-versed.
Current error:
Error in if (CommData[Index[i + j]] > 0 & CommData[Index[((i + j) - j)]] == :
missing value where TRUE/FALSE needed
Based on your example, I think you could combine your environmental and community data into a single data.frame. Then you might approach your problem using functions from the package dplyr.
# Make combined dataset
dat = data.frame(EnvData, CommData)
Since you want to do the work separately for each Trt, you'll want group_by that variable to do everything separately by group.
Then the problem is to find the first time each one of your Taxa columns contains a value greater than 0 and record which year that is. Because you want to do the same thing for many columns, you can use summarise_each. To get the desired summary, I used the function first to choose the first instance of Year where whatever Taxa column you are working with is greater than 0. The . refers to the Taxa columns. The last thing I did in summarise_each is to choose which columns I wanted to do this work on. In this case, you want to do this for all your Taxa columns, so I chose all columns that starts_with the word Taxa.
With chaining, this looks like:
library(dplyr)
dat %>%
group_by(Trt) %>%
summarise_each(funs(first(Year[. > 0])), contains("Taxa"))
The result is slightly different than yours, but I think this is correct based on the data provided (Taxa1 in High first seen in year 3 not year 2).
Source: local data frame [2 x 3]
Trt Taxa1 Taxa2
1 High 3 2
2 Low 1 3
The above code assumes that your dataset is already in order by Year. If it isn't, you can use arrange to set the order before summarising.
If you aren't used to chaining, the following code is the equivalent to above.
groupdat = group_by(dat, Trt)
summarise_each(groupdat, funs(first(Year[. > 0])), starts_with("Taxa"))

Resources