Applying a function to a group of a group (tapply in ddply) - r

My dataset looks like this:
d = data.frame(year=rep(2000:2002,each=40),month=rep(c(rep(1:12,3),5,6,7,8),3),species=rep(c(rep(letters[1:12],3),"a","b","g","l"),3),species_group=NA,kg=round(rnorm(120,15,6),digits=2))
d$species_group=ifelse(d$species %in% letters[1:5],"A","B")
I would like to have per year and per species group (so excluding the levels of month and species) the mean weight and the number of species included. This works fine with ddply. However, I also would like to include a value of the “quality” of my data. That is, if the number of species per month is balanced or there are for example more species included during the summer months. Therefore I thought I might simply calculate the yearly standard deviation of the number of unique species per month.
I tried doing this with tapply in ddply as follows:
s=ddply(d,c("year","species_group"),function(x) cbind(n_species=length(unique(x$species)),
quality=tapply(x,x$month,sd(length(unique(x$species)))),
kg=sum(x$kg,na.rm=T)))
but this gives me an error
Error in match.fun(FUN) : 'sd(length(unique(x$species)))' is not a function, character or symbol
What I would like to obtain is something like this:
output=data.frame(year=rep(2000:2002,each=2),species_group=rep(c("A","B"),3),n_species=rep(c(7,9),3),quality=round(rnorm(6,2,0.3),digits=2),kg=round(rnorm(6,15,6),digits=2))
I cannot first use ddply by month, year and species group because this means I couldn’t know anymore the number of unique species per year.
I suppose I could also calculate n_species and quality separately and put them together afterwards, but this would be a cumbersome approach.
How can I make my function work, or how can I do this more properly?
ANSWER:
The easiest solution came from shadow, who noted my mistake in the use of tapply. Furthermore, a standard error should be more appropriate than standard deviation, giving the following formula:
s=ddply(d,c("year","species_group"),function(x) cbind(n_species=length(unique(x$species)),
quality=sd(tapply(x$species,x$month, function(y) length(unique(y))))/sqrt(length(tapply(x$species,x$month, function(y) length(unique(y))))),
kg=sum(x$kg,na.rm=T)))

Not clear how do you define your quality criteria. So How I would do this.
First I define my quality criteria in a separate function. Note that your function should retuen a single value not a vector (in your solution you are using tapply which return a vector).
## returns the mean of sd variation per month
get_quality <-
function(species,month)
mean(tapply(species,month,
FUN=function(s)sd(as.integer(s))),
na.rm=TRUE)
Then I use it within ddply . To simplify code I am also creating a function to be applied by group.
ff <-
function(x) {
cbind(n_species=length(unique(x$species)),
quality= get_quality(x$species,x$month),
kg=sum(x$kg,na.rm=TRUE))
}
library(plyr)
s=ddply(d,.(year,species_group),ff)
year species_group n_species quality kg
1 2000 A 5 0.4000000 259.68
2 2000 B 7 0.2857143 318.24
3 2001 A 5 0.4000000 285.07
4 2001 B 7 0.2857143 351.54
5 2002 A 5 0.4000000 272.46
6 2002 B 7 0.2857143 331.45

Related

Calculate Gamma diversity over complete dataset using Vegan package in R

I have some datasets for which i want to calculate gamma diversity as the Shannon H index.
Example dataset:
Site SpecA SpecB SpecC
1 4 0 0
2 3 2 4
3 1 1 1
Calculating the alpha diversity is as follows:
vegan::diversity(df, index = "shannon")
However, i want this diversity function to calculate one number for the complete dataset instead of for each row. I can't wrap my head around this. My thought is that i need to write a function to merge all the columns into one and taking the average abundance of each species, thus creating a dataframe with one site contaning all the species information:
site SpecA SpecB SpecC
1 2.6 1 1.6
This seems like a giant workaround for something that could be done with some existing functions, but i don't know how. I hope someone can help in creating this dataframe or using some other method to use the diversity() function over the complete dataframe.
Regards
library(vegan)
data(BCI)
diversity(colSums(BCI)) # vector of sums is all you need
## vegan 2.6-0 in github has 'groups' argument for sampling units
diversity(BCI, groups = rep(1, nrow(BCI))) # one group, same result as above
diversity(BCI, groups = "a") # arg 'groups' recycled: same result as above

Explanation for aggregate and cbind function

first I can't understand aggregate function and cbind I need explanation really simple words, second I have data
permno number mean std
1 10107 120 0.0117174000 0.06802718
2 11850 120 0.0024398083 0.04594591
3 12060 120 0.0005072167 0.08544500
4 12490 120 0.0063569167 0.05325215
5 14593 120 0.0200060583 0.08865493
6 19561 120 0.0154743500 0.07771348
7 25785 120 0.0184815583 0.16510082
8 27983 120 0.0025951333 0.09538822
9 55976 120 0.0092889000 0.04812975
10 59328 120 0.0098526167 0.07135423
I NEED TO process this by
data_processed2 <- aggregate(cbind(return)~permno, Data_summary, median)
I cant understand this command please explain me very simple THANK YOU!
cbind takes two or more tables (dataframes), puts them side by side, and then makes them into one big table. So for example, if you have one table with columns A, B and C, and another with column D and E, after you cbind them, you'll have one table with five columns: A, B, C, D and E. for the rows, cbind assumes all tables are in the same order.
As noted by Rui, in your example cbind doesn't do anything, because return is not a table, and even if it was, it's only one thing.
aggregate takes a table, divides it by some variable, and the calculates a statistic on a variable within each group. For example, if I have data for sales by month and day of month, I can aggregate by month, and calculate the average sales per day for each of the months.
The command you provided uses the following syntax:
aggregate(VARIABLES~GROUPING, DATA, FUNCTION)
Variables (cbind(return) - which doesn't make sense, really) is the list of all the variables for which your statistic will be calculated
Grouping (pernmo) is the variable by which you will break the data into groups (in the sample data you provided each row has a unique number for this variable, so that doesn't really make sense either).
Data is the dataframe you're using.
Function is median.
So this call will break Data_summery into groups that have the same pernmo, and calculate the median for each of the columns.
With the data you provided, you'll basically get the same table back, since you're grouping the data by groups of one row each... -- Actually, since your variables are an empty group, as far as I can tell, you'll get nothing back.

R: Dropping variables using number of observations

I have a large dataset, and I'm trying to drop some of my variables based on how many observations each has. For instance, I would like to drop any variable in my dataframe where n < 3 (total observations for that variable is less than 3). Since R can count observations for each variable using describe, can't I use that number to subset the data instead of having to type in each variable name each time I pull in a new version (each version has different variables that will have low n's and there are over 40 variables). Thanks so much for your help!
For instance, my data looks like this:
ID Runaway Aggressive Emergency Hospitalization Injury
1 3 NA 4 1 NA
2 NA NA 2 1 NA
3 4 NA 6 2 3
4 1 NA 1 1 NA
I want to be able to drop "Aggressive" and "Injury" based on their n's being 0 and 1 respectively. However, instead of telling R to drop them by variable name, it would be much more convenient if it was possible to tell R to drop any variable where n < 3 (or whatever number I choose) as I'll be using this code for multiple versions of this dataset. I have tried using column numbers (which is better than writing them out) but it's still pretty tedious when I have to describe() the data, figure out which variables have low n's, and then drop 28 variables or subset() around them.
This works but it's cumbersome...
UIRCorrelation <- UIRKidUnique61[c(28, 30, 32, 34:38, 42, 54:74)]
For some reason, my example looks different when I'm editing versus when I save so I also included an image of it. Sorry. This is the first time I've ever used stack overflow to ask a question. I actually spent a lot of time googling this but couldn't find an answer relating to n.
This line did not work: DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
DF being your dataframe
DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
This function did the trick:
valid <- function(x) {sum(!is.na(x))}
N <- apply(UIRCorrelation,2,valid)
UIRCorrelation2 <- UIRCorrelation[N > 3]

dplyr: Can a function called inside mutate find the element of a column from current row

I have a very large data frame and a set of adjustment coefficients that I wish to apply to certain years, with each coefficient applied to one and only one year. The code below tries, for each row, to select the right coefficient, and return a vector containing dat in the unaffected years and dat times that coefficient in the selected years, which is to replace dat.
year <- rep(1:5, times = c(2,2,2,2,2))
dat <- 1:10
df <- tibble(year, dat)
adjust = c(rep(0, 4), rep(c(1 + 0.1*1:3), c(2,2,2)))
df %>% mutate(dat = ifelse(year < 5, year, dat*adjust[[year - 2]]))
If I get to do this, I get the following error:
Evaluation error: attempt to select more than one element in vectorIndex.
I am pretty sure this is because the extraction operator [[ treats year as the entire vector year rather than the year of the current row, so there is then a vectorized subtraction, whereupon [[ chokes on the vector-valued index.
I know there are many ways to solve this problem. I have a particularly ugly way involving nested ifelse’s working now. My question is, is there any way to do what I was trying to do in an R- and dplyr- idiomatic way? In some ways this seems like a filter or group_by problem, since we want to treat rows or groups of rows as distinct entities, but I have not found a way of doing so that is any cleaner.
It seems like there are some functions which are easier to define or to think of as row-by-row rather than as the product of entire vectors. I could produce a single vector containing the correct adjustment for each year, but since the number of rows per year varies, I would still have to apply a multi-valued conditional test to construct that vector, so the same problem arises.
Or doesn’t it?
You need to use [ instead of [[ for vector indexing; And also year - 2 produces negative index which will further give problems; If you want to map year to adjust by index positions, you can use replace with a mask that indicates the year to be modified:
df %>%
mutate(dat = {
mask = year > 2;
replace(year, mask, dat[mask] * adjust[year[mask] - 2])
})
# A tibble: 10 x 2
# year1 dat1
# <int> <dbl>
# 1 1 1.0
# 2 1 1.0
# 3 2 2.0
# 4 2 2.0
# 5 3 5.5
# 6 3 6.6
# 7 4 8.4
# 8 4 9.6
# 9 5 11.7
#10 5 13.0

Unsplit reduced data table based on two factors in R

Suppose I have a data frame in R where I would like to use 2 columns "factor1" and "factor2" as factors and I need to calculate mean value for all other columns per each pair of the above mentioned factors. After running the code below, the last line gives the following warnings:
Warning messages:
1: In split.default(seq_along(x), f, drop = drop, ...) :
data length is not a multiple of split variable
...
Why is it happening and what should I do to make it right?
Thanks.
Here is my code:
# Create data frame
myDataFrame <- data.frame(factor1=c(1,1,1,2,2,2,3,3,3), factor2=c(3,3,3,4,4,4,5,5,5), val1=c(1,2,3,4,5,6,7,8,9), val2=c(9,8,7,6,5,4,3,2,1))
# Split by 2 columns (factors)
splitDataFrame <- split(myDataFrame, list(myDataFrame$factor1, mydataFrame$factor2))
# Calculate mean value for each column per each pair of factors
splitMeanValues <- lapply(splitDataFrame, function(x) apply(x, 2, mean))
# Combine back to reduced table whereas there is only one value (mean) per each pair of factors
MeanValues <- unsplit(splitMeanValues, list(unique(myDataFrame$factor1), unique(mydataFrame$factor2)))
EDIT1: Added data frame creation (see above)
If you need to calculate the mean for all other columns than the factors, you can use the formula syntax of aggregate()
aggregate(.~factor1+factor2, myDataFrame, FUN=mean)
That returns
factor1 factor2 val1 val2
1 1 3 2 8
2 2 4 5 5
3 3 5 8 2
Your split() method didn't work because when you unsplit you must have the same number of rows as when you split your data. You were reduing the number of rows for all groups to just one row. Plus, unsplit really should be used with the exact same list of factors that was used to do the split otherwise groups may get out of order. You could to a split and then lapply some collapsing function and then rbind the list back into a single data.frame if you really wanted, but for a simple mean, aggregate is probably best.
The same result can be obtained with summaryBy() in the doBy package. Although it's pretty much the same as aggregate() in this case.
> library(doBy)
> summaryBy( . ~ factor1+factor2, data = myDataFrame)
# factor1 factor2 val1.mean val2.mean
# 1 1 3 2 8
# 2 2 4 5 5
# 3 3 5 8 2
Have you tried aggregate?
aggregate(myDataFrame$valueColum, myDataFrame$factor1, FUN=mean)
aggregate(myDataFrame$valueColum, myDataFrame$factor2, FUN=mean)

Resources