Conditional summing across columns with dplyr - r

I have a data frame with four habitats sampled over eight months. Ten samples were collected from each habitat each month. The number of individuals for species in each sample was counted. The following code generates a smaller data frame of a similar structure.
# Pseudo data
Habitat <- factor(c(rep("Dry",6), rep("Wet",6)), levels = c("Dry","Wet"))
Month <- factor(rep(c(rep("Jan",2), rep("Feb",2), rep("Mar",2)),2), levels=c("Jan","Feb","Mar"))
Sample <- rep(c(1,2),6)
Species1 <- rpois(12,6)
Species2 <- rpois(12,6)
Species3 <- rpois(12,6)
df <- data.frame(Habitat,Month, Sample, Species1, Species2, Species3)
I want to sum the total number of individuals by month, across all species sampled. I'm using ddply (preferred) but I'm open to other suggestions.
The closest I get is to add together the sum of each column, as shown here.
library(plyr)
ddply(df, ~ Month, summarize, tot_by_mon = sum(Species1) + sum(Species2) + sum(Species3))
# Month tot_by_mon
# 1 Jan 84
# 2 Feb 92
# 3 Mar 67
This works, but I wonder if there is a generic method to handle cases with an "unknown" number of species. That is, the first species always begins in the 4th column but the last species could be in the 10th or 42nd column. I do not want to hard code the actual species names into the summary function. Note that the species names vary widely, such as Doryflav and Pheibica.

Similar to #useR's answer with data.table's melt, you can use tidyr to reshape with gather:
library(tidyr)
library(dplyr)
gather(df, Species, Value, matches("Species")) %>%
group_by(Month) %>% summarise(z = sum(Value))
# A tibble: 3 x 2
Month z
<fctr> <int>
1 Jan 90
2 Feb 81
3 Mar 70
If you know the columns by position instead of a pattern to be "matched"...
gather(df, Species, Value, -(1:3)) %>%
group_by(Month) %>% summarise(z = sum(Value))
(Results shown using #akrun's set.seed(123) example data.)

Here's another solution with data.table without needing to know the names of the "Species" columns:
library(data.table)
DT = melt(setDT(df), id.vars = c("Habitat", "Month", "Sample"))
DT[, .(tot_by_mon=sum(value)), by = "Month"]
or if you want it compact, here's a one-liner:
melt(setDT(df), 1:3)[, .(tot_by_mon=sum(value)), by = "Month"]
Result:
Month tot_by_mon
1: Jan 90
2: Feb 81
3: Mar 70
Data: (Setting seed to make example reproducible)
set.seed(123)
Habitat <- factor(c(rep("Dry",6), rep("Wet",6)), levels = c("Dry","Wet"))
Month <- factor(rep(c(rep("Jan",2), rep("Feb",2), rep("Mar",2)),2), levels=c("Jan","Feb","Mar"))
Sample <- rep(c(1,2),6)
Species1 <- rpois(12,6)
Species2 <- rpois(12,6)
Species3 <- rpois(12,6)
df <- data.frame(Habitat,Month, Sample, Species1, Species2, Species3)

Suppose Speciess columns all start with Species, you can select them by the prefix and sum using group_by %>% do:
library(tidyverse)
df %>%
group_by(Month) %>%
do(tot_by_mon = sum(select(., starts_with('Species')))) %>%
unnest()
# A tibble: 3 x 2
# Month tot_by_mon
# <fctr> <int>
#1 Jan 63
#2 Feb 67
#3 Mar 58
If column names don't follow a pattern, you can select by column positions, for instance if Species columns go from 4th to the end of data frame:
df %>%
group_by(Month) %>%
do(tot_by_mon = sum(select(., 4:ncol(.)))) %>%
unnest()
# A tibble: 3 x 2
# Month tot_by_mon
# <fctr> <int>
#1 Jan 63
#2 Feb 67
#3 Mar 58

Here is another option with data.table without reshaping to 'long' format
library(data.table)
setDT(df)[, .(tot_by_mon = Reduce(`+`, lapply(.SD, sum))), Month,
.SDcols = Species1:Species3]
# Month tot_by_mon
#1: Jan 90
#2: Feb 81
#3: Mar 70
Or with tidyverse, we can also make use of map functions which would be efficient
library(dplyr)
library(purrr)
df %>%
group_by(Month) %>%
nest(starts_with('Species')) %>%
mutate(tot_by_mon = map_int(data, ~sum(unlist(.x)))) %>%
select(-data)
# A tibble: 3 x 2
# Month tot_by_mon
# <fctr> <int>
#1 Jan 90
#2 Feb 81
#3 Mar 70
data
set.seed(123)
Habitat <- factor(c(rep("Dry",6), rep("Wet",6)), levels = c("Dry","Wet"))
Month <- factor(rep(c(rep("Jan",2), rep("Feb",2), rep("Mar",2)),2),
levels=c("Jan","Feb","Mar"))
Sample <- rep(c(1,2),6)
Species1 <- rpois(12,6)
Species2 <- rpois(12,6)
Species3 <- rpois(12,6)
df <- data.frame(Habitat,Month, Sample, Species1, Species2, Species3)

Related

Mean of few months for a monthly data in r

I want to find the average of the months from Nov to March, say Nov 1982 to Mar 1983. Then, for my result, I want a column with year and mean in another. If the mean is taken till Mar 1983, I want the year to be shown as 1983 along with that mean.
This is how my data looks like.
I want my result to look like this.
1983 29.108
1984 26.012
I am not very good with R packages, If there is an easy way to do this. I would really appreciate any help. Thank you.
Here is one approach to get average of Nov-March every year.
library(dplyr)
df %>%
#Remove data for month April-October
filter(!between(month, 4, 10)) %>%
#arrange the data by year and month
arrange(year, month) %>%
#Remove 1st 3 months of the first year and
#last 2 months of last year
filter(!(year == min(year) & month %in% 1:3 |
year == max(year) & month %in% 11:12)) %>%
#Create a group column for every November entry
group_by(grp = cumsum(month == 11)) %>%
#Take average for each year
summarise(year = last(year),
value = mean(value)) %>%
select(-grp)
# A tibble: 2 x 2
# year value
# <int> <dbl>
#1 1982 0.308
#2 1983 -0.646
data
It is easier to help if you provide data in a reproducible format which can be copied easily.
set.seed(123)
df <- data.frame(year = rep(1981:1983, each = 12),month = 1:12,value = rnorm(36))
With dplyr
# remove the "#" before in the begining of the next line if dplyr or tidyverse is not installed
#install.packages("dplyr")
library(dplyr) #reading the library
colnames(df) <- c("year","month","value") #here I assumed your dataset is named df
df<- df%>%
group_by(year) %>%
summarize(av_value =mean(value))
You can do this as follow using tidyverse
require(tidyverse)
year <- rep(1982:1984, 3)
month <- rep(1:12, 3)
value <- runif(length(month))
dat <- data.frame(year, month, value)
head(dat)
dat looks like your data
# A tibble: 3 × 2
year value
<int> <dbl>
1 1982 0.450
2 1983 0.574
3 1984 0.398
The trick then is to group_by and summarise
dat %>%
group_by(year) %>%
summarise(value = mean(value))
Which gives you
# A tibble: 3 × 2
year value
<int> <dbl>
1 1982 0.450
2 1983 0.574
3 1984 0.398

Subseting the lowest date by a factor

I have the following dataset:
id<-c("1a","1a","1a","1a","1a",
"2a","2a","2a","2a","2a",
"3a","3a","3a","3a","3a")
fch<-c("22/05/2020","12/01/2020","01/01/2019","10/11/2020","01/01/2019",
"10/10/2015","01/01/2015","20/10/2015","08/04/2020","12/12/2019",
"01/05/2020","01/01/2013","10/08/2019","12/01/2020","20/10/2019")
dat<-c(25,35,48,97,112,
65,85,77,89,555,
58,98,25,45,336)
data<-as.data.frame(cbind(id,fch,dat))
My intention is to extract the row corresponding to the earliest date by the factor "id".
So my resulting data frame would look like this:
id<-c("1a","2a","3a")
fch<-c("01/01/2019","01/01/2015","01/01/2013")
dat<-c(48,85,98)
data_result<-as.data.frame(cbind(id,fch,dat))
This was my unsuccessful attempt:
DF1 <- data %>%
mutate(fch = as.Date(as.character(data$fch),format="%d/%m/%Y")) %>%
group_by(id) %>%
mutate(fch = min(fch)) %>%
ungroup
Slightly different method from #akrun. Note that one of the earliest dates in your data has two entries. Without a time there is no way to know which occurred first (or maybe you want both?).
library(tidyverse)
library(lubridate)
data.frame(id = c(rep("1a",5), rep("2a",5), rep("3a",5)),
fch = c("22/05/2020","12/01/2020","01/01/2019","10/11/2020","01/01/2019",
"10/10/2015","01/01/2015","20/10/2015","08/04/2020","12/12/2019",
"01/05/2020","01/01/2013","10/08/2019","12/01/2020","20/10/2019"),
dat = c(25,35,48,97,112,65,85,77,89,555,58,98,25,45,336)) %>%
group_by(id) %>%
mutate(fch = dmy(fch)) %>%
filter(fch == min(fch))
ungroup()
# A tibble: 4 x 3
# Groups: id [3]
id fch dat
<chr> <chr> <dbl>
1 1a 01/01/2019 48
2 1a 01/01/2019 112
3 2a 01/01/2015 85
4 3a 01/01/2013 98
We arrange the data by 'id', and the Date converted 'fch', grouped by 'id', use slice_head to get the first row of each group
library(dplyr)
library(lubridate)
data %>%
arrange(id, dmy(fch)) %>%
group_by(id) %>%
slice_head(n = 1) %>%
ungroup
-output
# A tibble: 3 x 3
# id fch dat
# <chr> <chr> <dbl>
#1 1a 01/01/2019 48
#2 2a 01/01/2015 85
#3 3a 01/01/2013 98
NOTE: cbind returns a matrix by default and matrix can have only a single type. Instead, we can directly create the data.frame
data
data <- data.frame(id, fch, dat)

Can You Iterate Through Columns AND Unique Variables of Each Column to create a summary in R?

Considering the example dataframe below, is it possible to iterate over each column, and the unique variable in each column to obtain a summary of the unique variables for each column?
sex <- c("M","F","M","M","F","F","F","M","M","F")
school <- c("north","north","central","south","south","south","central","north","north","south")
days_missed <- c(5,1,2,0,7,1,3,2,4,15)
df <- data.frame(sex, school, days_missed, stringsAsFactors = F)
In this example, I want to be able to create a summary of missed days by sex and school
My expected output would 1 data frame for sex and one for schoool with output similar to below:
sex missed_days
M 13
F 27
school missed_days
north 12
central 5
south 23
I tried (without success):
for(i in seq_along(select(df,1:2)) {
output[[i]] <- sum(df$days_missed[[i]] )
}
Is there a way to accomplish what I am looking to do?
in base R you could do:
lapply(1:2,function(x)xtabs(days_missed~.,df[c(x,3)]))
[[1]]
sex
F M
27 13
[[2]]
school
central north south
5 12 23
using tidyverse:
library(tidyverse)
map(df[-3],~xtabs(days_missed~.x,df))
$sex
.x
F M
27 13
$school
.x
central north south
5 12 23
if you must use summarize then:
df %>%
summarise_at(vars(-days_missed), ~list(xtabs(days_missed~.x))) %>%
{t(.)[,1]}
$sex
.x
F M
27 13
$school
.x
central north south
5 12 23
In base R, you can use lapply along with tapply to get sum of days_missed by group.
lapply(df[-ncol(df)], function(x) tapply(df$days_missed, x, sum))
Or using tidyverse :
library(dplyr)
cols <- c('sex', 'school')
purrr::map(cols, ~df %>% group_by_at(.x) %>% summarise(sum = sum(days_missed)))
#[[1]]
# A tibble: 2 x 2
# sex sum
# <chr> <dbl>
#1 F 27
#2 M 13
#[[2]]
# A tibble: 3 x 2
# school sum
# <chr> <dbl>
#1 central 5
#2 north 12
#3 south 23
This returns a list of dataframes.
Here is a tidyverse approach
library(tidyverse)
sex <- c("M","F","M","M","F","F","F","M","M","F")
school <- c("north","north","central","south","south","south","central","north","north","south")
days_missed <- c(5,1,2,0,7,1,3,2,4,15)
df <- data.frame(sex, school, days_missed, stringsAsFactors = F)
df %>%
group_by(sex) %>%
summarise(missed_day = sum(days_missed))
df %>%
group_by(school) %>%
summarise(missed_day = sum(days_missed))
If you want to map all other features
simple_operation <- function(x,group) {
x %>%
group_by_at({{group}}) %>%
summarise(missed_day = sum(days_missed))
}
variable_names <-
df %>%
select(-days_missed) %>%
names()
map(.x = variable_names,.f = ~ simple_operation(x = df,group = .))

How to get the difference of a lagged variable by date?

Consider the following example:
library(tidyverse)
library(lubridate)
df = tibble(client_id = rep(1:3, each=24),
date = rep(seq(ymd("2016-01-01"), (ymd("2016-12-01") + years(1)), by='month'), 3),
expenditure = runif(72))
In df you have stored information on monthly expenditure from a bunch of clients for the past 2 years. Now you want to calculate the monthly difference between this year and the previous year for each client.
Is there any way of doing this maintaining the "long" format of the dataset? Here I show you the way I am doing it nowadays, which implies going wide:
df2 = df %>%
mutate(date2 = paste0('val_',
year(date),
formatC(month(date), width=2, flag="0"))) %>%
select(client_id, date2, value) %>%
pivot_wider(names_from = date2,
values_from = value)
df3 = (df2[,2:13] - df2[,14:25])
However I find tihs unnecessary complex, and in large datasets going from long to wide can take quite a lot of time, so I think there must be a better way of doing it.
If you want to keep data in long format, one way would be to group by month and date value for each client_id and calculate the difference using diff.
library(dplyr)
df %>%
group_by(client_id, month_date = format(date, "%m-%d")) %>%
summarise(diff = -diff(expenditure))
# client_id month_date diff
# <int> <chr> <dbl>
# 1 1 01-01 0.278
# 2 1 02-01 -0.0421
# 3 1 03-01 0.0117
# 4 1 04-01 -0.0440
# 5 1 05-01 0.855
# 6 1 06-01 0.354
# 7 1 07-01 -0.226
# 8 1 08-01 0.506
# 9 1 09-01 0.119
#10 1 10-01 0.00819
# … with 26 more rows
An option with data.table
library(data.table)
library(zoo)
setDT(df)[, .(diff = -diff(expenditure)), .(client_id, month_date = as.yearmon(date))]

How can I match two sets of factor levels in a new data frame?

I have a large data frame and I want to export a new data frame that contains summary statistics of the first based on the id column.
library(tidyverse)
set.seed(123)
id = rep(c(letters[1:5]), 2)
species = c("dog","dog","cat","cat","bird","bird","cat","cat","bee","bee")
study = rep("UK",10)
freq = rpois(10, lambda=12)
df1 <- data.frame(id,species, freq,study)
df1$id<-sort(df1$id)
df1
df2 <- df1 %>% group_by(id) %>%
summarise(meanFreq= mean(freq),minFreq=min(freq))
df2
I want to keep the species name in the new data frame with the summary statistics. But if I merge by id I get redundant rows. I should only have one row per id but with the species name appended.
df3<-merge(df2,df1,by = "id")
This is what it should look like but my real data is messier than this neat set up here:
df4 = df3[seq(1, nrow(df3), 2), ]
df4
From the summarised output ('df2') we can join with the distinct rows of the selected columns of original data
library(dplyr)
df2 %>%
left_join(df1 %>%
distinct(id, species, study), by = 'id')
# A tibble: 5 x 5
# id meanFreq minFreq species study
# <fct> <dbl> <dbl> <fct> <fct>
#1 a 10.5 10 dog UK
#2 b 14.5 12 cat UK
#3 c 14.5 12 bird UK
#4 d 10 7 cat UK
#5 e 11 6 bee UK
Or use the same logic with the base R
merge(df2,unique(df1[c(1:2, 4)]),by = "id", all.x = TRUE)
Time for mutate followed by distinct:
df1 %>% group_by(id) %>%
mutate(meanFreq = mean(freq), minFreq = min(freq)) %>%
distinct(id, .keep_all = T)
Now actually there are two possibilities: either id and species are essentially the same in your df, one is just a label for the other, or the same id can have several species.
If the latter is the case, you will need to replace the last line with distinct(id, species, .keep_all = T).
This would get you:
# A tibble: 5 x 6
# Groups: id [5]
id species freq study meanFreq minFreq
<fct> <fct> <int> <fct> <dbl> <dbl>
1 a dog 10 UK 10.5 10
2 b cat 17 UK 14.5 12
3 c bird 12 UK 14.5 12
4 d cat 13 UK 10 7
5 e bee 6 UK 11 6
If your only goal is to keep the species & they are indeed the same as id, you could also just include it in the group_by:
df1 %>% group_by(id, species) %>%
summarise(meanFreq = mean(freq), minFreq = min(freq))
This would then remove study and freq - if you have the need to keep them, you can again replace summarise with mutate and then distinct with .keep_all = T argument.

Resources