R: Sum each column by subgroup

R: Sum each column by subgroup - r

I have a matrix of plant species occurrence data. The matrix is set up so that every column is a species, and every row is a sampling location. I also have identifiers that group sampling locations based on certain environmental variables. I would like to create columns sums for each species, but subgrouped by the specific environmental variables.
An example data set:
library(vegan)
data("dune")
data("dune.env")
dune$plot <- c(1:20); dune.env$plot <- c(1:20)
merge(dune, dune.env)
So there are now 20 plots, with 30 species observed, and 5 associated environmental variables. I would like to generate the sum of the number of individuals observed per species, grouped by "Management". I have tried something like this:
library(tidyverse)
sums <- group_by(data, data$Management) %>% colSums(data[,(2:31)], na.rm = TRUE)
but I always get an error about incorrect dims. I am not sure how I would go about solving my problem. Ideally, the result would be a dataframe with 4 rows (1 for each management type) where all the species (cols 2:31) have been summed.

rowsums does what you need:
dat <- merge(dune, dune.env)
> rowsum(dat[,2:31], dat$Management)
Achimill Agrostol Airaprae Alopgeni Anthodor Bellpere Bromhord Chenalbu ...
BF 7 0 0 2 4 5 8 0 ...
HF 6 7 0 8 9 2 4 0 ...
NM 2 13 5 0 8 2 0 0 ...
SF 1 28 0 26 0 4 3 1 ...

use data.table:
require(data.table)
a <- merge(dune, dune.env)
setDT(a)
a[, lapply(.SD, sum), by = Management, .SDcols = names(a)[2:31]]

Well, I was doing something very similar a few days ago:
How to obtain species richness and abundance for sites with multiple samples using dplyr
To modify the excellent answer given by #akrun:
df <- merge(dune, dune.env)
library(dplyr)
df2<- df %>%
group_by(Management) %>%
summarise_at(sum, .vars = vars(Achimill:Callcusp))

Related

Why aren't stratum sample sizes maintained using rsample bootstraps?

[After seeing joran's comment below about the make_strata() function, I filed an issue with rsample on Github.]
I'm trying to take stratified bootstrap samples from a data frame. I want separate bootstrap samples to be taken within each stratum, so that the resulting bootstrap sample has the same number of observations in each stratum as the original data frame. However, that does not always happen when using the bootstraps() function of the rsample package. When I run this code:
library(rsample)
mydf <- data.frame(A=1:58, B=rep(1:4, c(6, 6, 23, 23)))
lboots <- bootstraps(mydf, times=3, strata="B")$splits
lbootsdf <- lapply(lboots, as.data.frame)
with(mydf, table(B))
lapply(lbootsdf, function(df) table(df$B))
These are the results I get:
B
1 2 3 4
6 6 23 23
$`1`
1 2 3 4
10 5 20 23
$`2`
1 2 3 4
3 8 24 23
$`3`
1 2 3 4
4 5 24 25
I was expecting to see 6 1's, 6 2's, 23 3's, and 23 4's in each of the three bootstrap samples.
How can I take the type of stratified bootstrap sample that I want?

This doesn't use rsample::bootstraps but instead constructs the bootstrap samples explicitly.
library("dplyr")
splits <- mydf %>%
crossing(id = seq(2)) %>%
group_by(id, B) %>%
sample_n(n(), replace = TRUE) %>%
ungroup()
lboots$splits[[id]]$data are copies of the original data.

It doesn't look as if you're doing a bootstrap sample because you're not estimating the sampling distribution of a statistic. What it seems to me that you're trying to do is a stratified sample (i.e. instead of a simple random sample) of the data stored in mydf$A using mydf$B as the strata.
The package dplyr has a function that is purpose-built for this scenario, sample_frac:
library(dplyr)
mydf <- data.frame(A=1:58, B=rep(1:4, c(6, 6, 23, 23)))
data_grouped_by_stratum <- mydf %>% group_by(mydf$B)
data_sampled_by_stratum <- data_grouped_by_stratum %>% sample_frac(size=1, replace=T)
# Now, a bit of cleanup on the resulting tibble object
df_of_data_sampled_by_stratum <- data_sampled_by_stratum %>% dplyr::ungroup %>% dplyr::select(-`mydf$B`) %>% as.data.frame
In the call to sample_frac, size=1 means that the fraction of the rows to sample within each group is 1; i.e. 100% of the group's rows.

When I looked at the "B" component of the lboots object (made without subsetting splits, I see consistency on the sampling faction.
BUT: that apparently not the designed out as joran points out. Appears this is a package in early development, since the documentation is not in sync with the code.:
maintainer("rsample")
[1] "Max Kuhn <max#rstudio.com>"
lboots <- bootstraps(mydf, times=3, strata="B")
str(lboots)
table(lboots$splits[['1']]$data$B)
1 2 3 4
6 6 23 23
> table(lboots$splits[['2']]$data$B)
1 2 3 4
6 6 23 23
> table(lboots$splits[['3']]$data$B)
1 2 3 4
6 6 23 23

How do I get the difference of two groups in one dataframe (longtable) in R?

I have this given dataframe:
days classtype scores
1 1 a 49
2 1 b 47
3 2 a 36
4 2 b 41
It is produce by this given code:
days=c(1,1,2,2)
classtype=c("a","b","a","b")
scores=c(49,47,36,41)
myData=data.frame(days,classtype,scores)
print(myData)
What lines do I need to add to the code in order to get calculate the difference in scores of the two classes for each day? I want to get this output:
days difference_in_scores
1 1 2
2 2 -5

If the format of your data is consistently as you have shown then you can accomplish this very neatly using data.table:
setDT(myData)
myData[, diff(scores), by = days]
days V1
1: 1 -2
2: 2 5
Or using just base-R:
aggregate(scores ~ days, myData, FUN = diff)

One approach you could take
library(dplyr)
library(reshape2)
days=c(1,1,2,2)
classtype=c("a","b","a","b")
scores=c(49,47,36,41)
myData=data.frame(days,classtype,scores)
myData %>%
# convert the data to wide format
dcast(days ~ classtype,
value.var = "scores") %>%
# calculate differences
mutate(difference_in_scores = a - b) %>%
# remove columns (just to match your desired output)
select(days, difference_in_scores)

How to re-arrange a data.frame

I am interested in re-arranging a data.frame in R. Bear with me a I stumble through a reproducible example.
I have a nominal variable which can have 1 of two values. Currently this nominal variable is a column. Instead I would like to have two columns, representing the two values this nominal variable can have. Here is an exmample data frame. S is the nominal variable with values T and C.
n <- c(1,1,2,2,3,3,4,4)
s <- c("t","c","t","c","t","c","t","c")
b <- c(11,23,6,5,12,16,41,3)
mydata <- data.frame(n, s, b)
I would rather have a data frame that looked like this
n.n <- c(1,2,3,4)
trt <- c(11,6,23,41)
cnt <- c(23,5,16,3)
new.data <- data.frame(n.n, trt, cnt)
I am sure there is a way to use mutate or possibly tidyr but I am not sure what the best route is and my data frame that I would like to re-arrange is quite large.

you want spread:
library(dplyr)
library(tidyr)
new.data <- mydata %>% spread(s,b)
n c t
1 1 23 11
2 2 5 6
3 3 16 12
4 4 3 41

How about unstack(mydata, b~s):
c t
1 23 11
2 5 6
3 16 12
4 3 41

How to omit rows of two highest and the lowest value by group in R

this seems a very basic question, but I just can't seem to find the solution.
How do you remove the (three) rows of the two highest and the lowest values of a variable by several factors in R? I have modifed the airquality a little to get an example (sorry, I am still a beginner):
set.seed(1)
airquality$var1 <- c(sample(1:3, 153, replace=T))
airquality$var2 <- c(sample(1:2, 153, replace=T))
airquality2 <- airquality
airquality2$Solar.R <- as.numeric(airquality2$Solar.R)
airquality2$Solar.R <- airquality2$Solar.R*2
airquality3 <- airquality
airquality3$Solar.R <- as.numeric(airquality3$Solar.R)
airquality3$Solar.R <- airquality3$Solar.R*2.5
test <- round(na.omit(rbind(airquality, airquality2, airquality3)))
test$var1 <- factor(test$var1)
test$var2 <- factor(test$var2)
head(test)
Which comes to:
head(test)
# Ozone Solar.R Wind Temp Month Day var1 var2
# 1 41 190 7 67 5 1 1 1
# 2 36 118 8 72 5 2 2 2
# 3 12 149 13 74 5 3 2 1
# 4 18 313 12 62 5 4 3 2
# 7 23 299 9 65 5 7 3 1
# 8 19 99 14 59 5 8 2 1
Now I would like to remove the rows with the two highest and the lowest values of Solar.R with something like group_by(Month, var1, var2). Since there are 30 factor combinations (5*3*2), 90 rows should be omitted. The rest of the data should stay the same. I looked at Min & Max, but could not get it to work. Any help would be gladly appreciated.

I think you're looking for slice:
library("dplyr")
sliced =
test %>%
group_by(Month, var1, var2) %>% # group
arrange(Solar.R) %>% # within-group, order by Solar.R
slice(3:(n() - 2)) # keep the 3rd through the 3rd-to-last row
nrow(sliced)
# [1] 233
Edit: I had 3:(n() - 3) at first, corrected to 3:(n() - 2). A nice sanity check is to think of (1:10)[3:(10 - 3)] vs (1:10)[3:(10 - 2)]. I didn't bother to read your simulation code, but when I checked things out with n_group() I saw 27 groups, not 30 as stated in your question. (Perhaps a seed issue, with rawr's set.seed(1) there are 28 groups.)
More edits: Based on your edit, looks like perhaps you want to omit the lowest value and the two highest values rather than the two lowest and two highest. Simply change 3:(n() - 2)) to 2:(n() - 2) to make that adjustment.

here is a data.table way of doing this but I guess dplyr would be more verbose .
require(data.table)
set.seed(1)
airquality$var1 <- c(sample(1:3, 153, replace=T))
airquality$var2 <- c(sample(1:2, 153, replace=T))
airquality2 <- airquality
airquality2$Solar.R <- as.numeric(airquality2$Solar.R)
airquality2$Solar.R <- airquality2$Solar.R*2
airquality3 <- airquality
airquality3$Solar.R <- as.numeric(airquality3$Solar.R)
airquality3$Solar.R <- airquality3$Solar.R*2.5
test <- round(na.omit(rbind(airquality, airquality2, airquality3)))
test$var1 <- factor(test$var1)
test$var2 <- factor(test$var2)
dt_test <- as.data.table(test)
dt_test[,.SD[order(-Solar.R)][c(3:(.N-1))],.(Month,var1,var2)]

We can also use .I to get the row index in data.table and then subset it based on that.
library(data.table)
i1 <- setDT(test)[order(Solar.R), .I[3:(.N-1)],.(Month, var1, var2)]$V1
test[i1]

Grouping an entire data set and aggregating

I have a dataset of 20 variables V1,V2,V3......V20 with 1,200 rows.
I want to average of every four rows in my data frame, i.e my output dataset should have 20 columns
containing V1,V2,V3…V20 and 300 rows containing average of data in group of 4.
I cannot use tapply as for that I have to input 1 variable at a time; I want to average all the 20 variables at a time.
Is there an efficient way to do this? I want to use functions from apply family and would
like to avoid looping.

Using lapply with colMeans
set.seed(42)
dat <- as.data.frame(matrix(sample(1:20, 20*1200, replace=TRUE), ncol=20))
n <- seq_len(nrow(dat))
res <- do.call(rbind,lapply(split(dat, (n-1)%/%4 +1),colMeans, na.rm=TRUE))
dim(res)
#[1] 300 20
Explanation
Here the idea is to create a grouping variable that splits the datasets into subsets of datasets in a list with the condition that 1:4 rows goes into first subset, 5:8 to 2nd subset, and ..., the last subset would have 297:300. For easy understanding, using a subset of rows. Suppose if your dataset has 10 rows:
n1 <- seq_len(10)
n1
#[1] 1 2 3 4 5 6 7 8 9 10
(n1-1) %/%4 #created a numeric index to split by group
# [1] 0 0 0 0 1 1 1 1 2 2
I added 1 to the above to start from 1 instead of 0
(n1-1) %/%4 +1
#[1] 1 1 1 1 2 2 2 2 3 3
You could also use gl ie.
gl(10, 4, 10)
For the dataset, it should be
gl(1200, 4, 1200)
Now, you can either split n1 by the newly created grouping index or the dataset
split(n1,(n1-1) %/%4 +1) # you can check the result of this
For a subset of 10 rows of the dataset
split(dat[1:10,], (n1-1) %/%4 +1)
and then use lapply along with colMeans to get the column means of each list element and rbind them using do.call(rbind,..)
Or
summarise_each from dplyr
library(dplyr)
res2 <- dat %>%
mutate(N= (row_number()-1)%/%4+1) %>%
group_by(N) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE))) %>%
select(-N)
dim(res2)
#[1] 300 20
all.equal(as.data.frame(res), as.data.frame(res2), check.attributes=FALSE)
#[1] TRUE
Or
Using data.table
library(data.table)
DT1 <- setDT(dat)[, N:=(seq_len(.N)-1)%/%4 +1][,
lapply(.SD, mean, na.rm=TRUE), by=N][,N:=NULL]
dim(DT1)
#[1] 300 20