R, rate of growth, different time - r

I have the following example of dataset for a biology project.
I want to compute a rate of growth of number between 4th January and 2nd of January.
That is rate = (number_at_0104 - number_at_0102)/(number_at_0102)
(In tidyverse if possible)
a <- c("Date", "Specie", "Number")
b <- c("2020-01-01", "Dog", "3")
c <- c("2020-01-02", "Dog", "4")
d <- c("2020-01-03", "Dog", "5")
e <- c("2020-01-04", "Dog", "6")
f <- c("2020-01-01", "Cat", "3")
g <- c("2020-01-02", "Cat", "7")
h <- c("2020-01-03", "Cat", "8")
i <- c("2020-01-04", "Cat", "10")
df <- as.data.frame(rbind(b, c, d, e, f, g, h, i))
names(df) <- a
df$Date <- as.Date(df$Date)
df$Number <- as.integer(df$Number)
I want to calculate a rate of growth. I know this has been treated already but I'm not sure whether I can apply it there.
Usually, we use the lag() function but I have some questions.
Can we tell the lag function what lags to use (eg not the previous periods but 4 periods before)
My dataset is much bigger and for some specie (say cat) I want to compute the rate of growth between 20 Feb and 3 March. And for others (say dog) I want to compute between 5 May and 4th April. How can I do it?
Thank you in advance,

using dplyr
start <- as.Date("2020-01-02")
end <- as.Date("2020-01-04")
df %>%
filter(Date %between% c(start, end)) %>%
arrange(Date, Species) %>%
group_by(Species) %>%
summarise(Growth = (last(Number) - first(Number)) / first(Number))
output
Species Growth
<fct> <dbl>
1 Cat 0.25
2 Dog 0.5
data
note that my test data is already as dates and as numeric values
df <- data.frame(
Date = rep(seq.Date(as.Date("2020-01-01"), as.Date("2020-01-04"), "days"), 2),
Species = c(rep("Dog", 4), rep("Cat", 4)),
Number = 3:10
)
if you want different lookups for every species you can do something like this. Define your lookups and the output will be a tibble with Species, Growth and the period it was taken from.
lookups <- list(
c("Species" = "Dog", "start" = "2020-01-01", "end" = "2020-01-04"),
c("Species" = "Cat", "start" = "2020-01-02", "end" = "2020-01-04")
)
bind_rows(lapply(lookups, function(species) {
df %>%
filter(Species == species["Species"] & Date %between% as.Date(c(species["start"], species["end"]))) %>%
arrange(Date, Species) %>%
group_by(Species) %>%
summarise(
Growth = (last(Number) - first(Number)) / first(Number),
Start = species["start"],
End = species["end"]
)
}))
# # A tibble: 2 x 4
# Species Growth Start End
# <chr> <dbl> <chr> <chr>
# 1 Dog 1 2020-01-01 2020-01-04
# 2 Cat 0.25 2020-01-02 2020-01-04

Since you want to have different input for different species and different start and end dates, I have created a function to do the job.
Function compute_rate_in_species()
x is a dataframe
species is Species in x, which can be supplied in a vector (e.g. c("Dog", "Cat"))
start_date and end_date is self explanatory
DF is a logical value indicating whether you want dataframe as output. If FALSE, a named vector with be the output
If you would like to have different start_date and end_date for different species, you'll have to run the function separately using your desired input.
library(tidyverse)
compute_rate_in_species <- function(x, species, start_date, end_date, DF = T) {
x <- x %>% filter(Species %in% species & Date %in% as.Date(c(start_date, end_date))) %>%
group_by(Species) %>%
summarize(Rate = (last(Number) - first(Number))/first(Number))
if (DF == T) {
return(x)
} else {
x_vec <- setNames(x$Rate, x$Species)
return(x_vec)
}
}
Output
# DF = F
compute_rate_in_species(df, c("Dog", "Cat"), "2020-01-01", "2020-01-03", DF = F)
Cat Dog
1.6666667 0.6666667
# DF = T
compute_rate_in_species(df, c("Dog", "Cat"), "2020-01-01", "2020-01-03", DF = T)
# A tibble: 2 x 2
Species Rate
<chr> <dbl>
1 Cat 1.67
2 Dog 0.667

You can use -
library(dplyr)
start_date <- as.Date("2020-01-02")
end_date <- as.Date("2020-01-04")
df %>%
group_by(Specie) %>%
summarise(growth_rate = (Number[match(end_date, Date)] -
Number[match(start_date, Date)])/
Number[match(start_date, Date)])
# Specie growth_rate
# <chr> <dbl>
#1 Cat 0.429
#2 Dog 0.5
You can replace the start_date and end_date with the dates of your choice.
Or maybe a bit verbose but clearer answer would be -
df %>%
group_by(Specie) %>%
summarise(num_end = Number[match(end_date, Date)],
num_start = Number[match(start_date, Date)],
growth_rate = (num_end - num_start)/num_start)
# Specie num_end num_start growth_rate
# <chr> <int> <int> <dbl>
#1 Cat 10 7 0.429
#2 Dog 6 4 0.5

Related

Count unique values per month in R

I have a dataset with dead bird records from field observers.
Death.Date Observer Species Bird.ID
1 03/08/2021 DA MF FC10682
2 15/08/2021 AG MF FC10698
3 12/01/2022 DA MF FC20957
4 09/02/2022 DA MF FC10708
I want to produce a dataset from this with the number of unique Bird.ID / Month so I can produce a graph from that. ("unique" because some people make mistakes and enter a bird twice sometimes).
The output in this case would be:
Month Number of dead
08/2021 2
01/2022 1
02/2022 1
The idea is to use the distinct function but by month (knowing the value is in date format dd/mm/yyyy).
In case your Date column is character type first transform to date type with dmy
Change format to month and year
group_by and summarize
library(dplyr)
library(lubridate) # in case your Date is in character format
df %>%
mutate(Death.Date = dmy(Death.Date)) %>% # you may not need this line
mutate(Month = format(as.Date(Death.Date), "%m/%Y")) %>%
group_by(Month) %>%
summarise(`Number of dead`=n())
Month `Number of dead`
<chr> <int>
1 01/2022 1
2 02/2022 1
3 08/2021 2
For completeness, this can be achieved using aggregate without any additional packages:
df <- data.frame(
Death.Date = c("3/8/2021", "15/08/2021", "12/1/2022", "9/2/2022"),
Observer = c("DA", "AG", "DA", "DA"),
Species = c("MF", "MF", "MF", "MF"),
Bird.ID = c("FC10682", "FC10698", "FC20957", "FC10708")
)
aggregate.data.frame(
x = df["Bird.ID"],
by = list(death_month = format(as.Date(df$Death.Date, "%d/%m/%Y"), "%m/%Y")),
FUN = function(x) {length(unique(x))}
)
Notes
The anonymous function function(x) {length(unique(x)) provides the count of the unique values
format(as.Date(df$Death.Date, "%d/%m/%Y"), "%m/%Y")) call ensures that the month/Year string is provided
data.table solution
library(data.table)
library(lubridate)
# Reproductible example with a duplicated bird
deadbirds <- data.table::data.table(Death.Date = c("03/08/2021", "15/08/2021", "12/01/2022", "09/02/2022", "03/08/2021"),
Observer = c("DA", "AG", "DA", "DA", "DA"),
Species = c("MF", "MF", "MF" , "MF", "MF"),
Bird.ID = c("FC10682", "FC10698", "FC20957", "FC10708", "FC10682"))
# Clean dataset = option 1 : delete all duplicated row
deadbirds <- base::unique(deadbirds)
# Clean dataset = option 2 : keep only the first line by bird (can be useful when there is duplicated data with differents values in useless columns)
deadbirds <- deadbirds[
j = .SD[1],
by = c("Bird.ID")
]
# Death.Date as date
deadbirds <- deadbirds[
j = Death.Date := lubridate::dmy(Death.Date)
]
# Create month.Death.Date
deadbirds <- deadbirds[
j = month.Death.Date := base::paste0(lubridate::month(Death.Date),
"/",
lubridate::year(Death.Date))
]
# Count by month
deadbirds <- deadbirds[
j = `Number of dead` := .N,
by = month.Death.Date]
A possible solution, based on tidyverse, lubridate and zoo::as.yearmon:
library(tidyverse)
library(lubridate)
library(zoo)
df <- data.frame(
Death.Date = c("3/8/2021", "15/08/2021", "12/1/2022", "9/2/2022"),
Observer = c("DA", "AG", "DA", "DA"),
Species = c("MF", "MF", "MF", "MF"),
Bird.ID = c("FC10682", "FC10698", "FC20957", "FC10708")
)
df %>%
group_by(date = as.yearmon(dmy(Death.Date))) %>%
summarise(nDead = n_distinct(Bird.ID), .groups = "drop")
#> # A tibble: 3 x 2
#> date nDead
#> <yearmon> <int>
#> 1 Aug 2021 2
#> 2 Jan 2022 1
#> 3 Feb 2022 1
You could use:
as.data.frame(table(format(as.Date(df$Death.Date,'%d/%m/%Y'), '%m/%Y')))
# Var1 Freq
# 1 01/2022 1
# 2 02/2022 1
# 3 08/2021 2
data:
df <- data.frame(
Death.Date = c("3/8/2021", "15/08/2021", "12/1/2022", "9/2/2022"),
Observer = c("DA", "AG", "DA", "DA"),
Species = c("MF", "MF", "MF", "MF"),
Bird.ID = c("FC10682", "FC10698", "FC20957", "FC10708")
)

Cumulative percentage per column of a dataframe

I have a data.frame made up of daily temperatures per julian day for a number of stations.
Minimal reproducible example data.frame:
TemperatureData <- data.frame(
Julian_Day = 1:365,
Station_1 = c(rnorm(1:365, mean=10, sd=2)),
Station_2 = c(rnorm(1:365, mean=10, sd=2)),
Station_3 = c(rnorm(1:365, mean=10, sd=2))
)
I would like to determine the julian day on which each station exceeds a stated percentage of the total cumulative value, and for an output stating the julian day when this cumulative value threshold is reached for each station.
For example, say Station 1 has a total value of 4000, and after 180 julian days the cumulative value exceeds a set 50% threshold of the total value, and is repeated for each column of the data.frame (example of preferred output below).
Station_1 Station_2 Station_3
180 183 179
I assume this would utilise the cumsum function in some capacity, but not sure how to implement it. Can anyone help with this?
Let me know if this doesn't make sense.
Base R solution:
TemperatureData <- data.frame(
Julian_Day = 1:365,
Station_1 = c(rnorm(1:365, mean=10, sd=2)),
Station_2 = c(rnorm(1:365, mean=10, sd=2)),
Station_3 = c(rnorm(1:365, mean=10, sd=2))
)
TemperatureData$Station_1 <- cumsum(TemperatureData$Station_1) / sum(TemperatureData$Station_1)
TemperatureData$Station_2 <- cumsum(TemperatureData$Station_2) / sum(TemperatureData$Station_2)
TemperatureData$Station_3 <- cumsum(TemperatureData$Station_3) / sum(TemperatureData$Station_3)
results <- c(
"Station 1" = TemperatureData$Julian_Day[TemperatureData$Station_1 >= .5][1],
"Station 2" = TemperatureData$Julian_Day[TemperatureData$Station_2 >= .5][1],
"Station 3" = TemperatureData$Julian_Day[TemperatureData$Station_3 >= .5][1]
)
results
#> Station 1 Station 2 Station 3
#> 180 185 183
tidyverse solution:
library(dplyr)
TemperatureData %>%
summarize(across(matches("Station"),
function(x) Julian_Day[cumsum(x) / sum(x) > .5][1]))
data.table solution:
library(data.table)
setDT(TemperatureData)
TemperatureData[, lapply(.SD, function(x) Julian_Day[cumsum(x) / sum(x) > .5][1]),
.SDcols=patterns("Station")]
Here is a tidyverse way to do it. I figure there is a simpler method, and if I figure it out, I will post it.
library(dplyr)
library(tidyr)
TemperatureData %>%
pivot_longer(cols = -Julian_Day, names_to = "Station") %>%
group_by(Station) %>%
arrange(Station, Julian_Day) %>%
mutate(cumpct = cumsum(value) / sum(value)) %>%
filter(cumpct >= 0.5) %>%
slice(1) %>%
pivot_wider(id_cols = 1, names_from = Station, values_from = Julian_Day)
# A tibble: 1 x 3
Station_1 Station_2 Station_3
<int> <int> <int>
1 184 181 181

Error in 'group by' stating its a character for looping through summarise R

still getting to grips with R, and as a newbie, as I have been doing most of my coding manual, ie copy and paste the same block of code 20 times. Here, I was trying to learn about looping and summarising. When I do the summarise with one individual dataset, it works fine, but now I try and loop it, it states its a character, so I added in 'as.numeric' to no prevail. Any advice would be helpful.
Error:
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "character"
#educationYears
fiv0_education <- subset(sf_education, Year == '2005')
six0_education <- subset(sf_education, Year == '2006')
sev0_education <- subset(sf_education, Year == '2007')
eig0_education <- subset(sf_education, Year == '2008')
nin0_education <- subset(sf_education, Year == '2009')
ten_education <- subset(sf_education, Year == '2010')
one_education <- subset(sf_education, Year == '2011')
two_education <- subset(sf_education, Year == '2012')
thr_education <- subset(sf_education, Year == '2013')
for_education <- subset(sf_education, Year == '2014')
fiv_education <- subset(sf_education, Year == '2015')
six_education <- subset(sf_education, Year == '2016')
sev_education <- subset(sf_education, Year == '2017')
eig_education <- subset(sf_education, Year == '2018')
nin_education <- subset(sf_education, Year == '2019')
names <- c('fiv0', 'six0', 'sev0', 'eig0', 'nin0', 'ten', 'one', 'two', 'thr', 'for', 'fiv', 'six', 'sev', 'eig', 'nin')
test <- vector("list", length(names))
for (i in 1:length(names)) {
test[i] <- paste(names[i], '_education', sep = "", collapse = NULL) %>%
group_by(as.numeric(as.character(Kod))) %>%
summarise(Count=sum(as.numeric(as.character(Count))))
}
Here is a solution using built-in data set iris as an example. I believe it's easy to adapt to the problem in the question.
1. A solution with a for loop, like in the question.
library(dplyr)
names <- c('fiv0', 'six0', 'sev0')
test <- vector("list", length(names))
for (i in 1:length(names)) {
tmp <- paste0(names[i], '_education')
test[[i]] <- get(tmp, envir = .GlobalEnv) %>%
mutate(Count = as.numeric(as.character(Count))) %>%
group_by(Kod) %>%
summarise(Count = sum(Count))
}
test
#[[1]]
## A tibble: 4 x 2
# Kod Count
# <int> <dbl>
#1 1 1.6
#2 2 3.7
#3 3 2.4
#4 4 4.6
#
#[[2]]
## A tibble: 4 x 2
# Kod Count
# <int> <dbl>
#1 1 24.5
#2 2 27.2
#3 3 19.1
#4 4 30.5
#
#[[3]]
## A tibble: 4 x 2
# Kod Count
# <int> <dbl>
#1 1 15.9
#2 2 18.9
#3 3 15.5
#4 4 16
2. Here is another way, with purrr::map.
This code uses the data set already split in several with subset.
paste0(names, '_education') %>%
mget(envir = .GlobalEnv) %>%
purrr::map(
function(X){
X %>%
mutate(Count = as.numeric(as.character(Count))) %>%
group_by(Kod) %>%
summarise(Count = sum(Count))
}
)
3. Another purrr:map way, but this time from the original data set, with no need to subset multiple times first.
Note that the splitting column here is Species, in the question it's Year.
df1 %>%
group_split(Species) %>%
purrr::map(
function(X){
X %>%
mutate(Count = as.numeric(as.character(Count))) %>%
group_by(Kod) %>%
summarise(Count = sum(Count))
}
)
Data creation code.
set.seed(1234)
df1 <- iris[4:5]
names(df1)[1] <- "Count"
df1$Kod <- sample(4, 150, TRUE)
fiv0_education <- subset(df1, Species == 'setosa')
six0_education <- subset(df1, Species == 'virginica')
sev0_education <- subset(df1, Species == 'versicolor')

R - how to speed a for loop with vectorised operations. Pratical problem

I've tried to create a minimal example, sorry.
Is there a way to speed the process?
My procedures table has 4M rows. I am processing for 15 hours and it has populate only 1.5M rows.
Maybe using mutate, I don't know.
library(tidyverse)
library(lubridate)
frequencies <- tibble(
id = 1:3,
date_hour_initial = c(
dmy_hms('01/01/2020 13:00:00'),
dmy_hms('01/01/2020 15:00:00'),
dmy_hms('02/01/2020 20:00:00')
),
date_hour_final= c(
dmy_hms('01/01/2020 18:00:00'),
dmy_hms('01/01/2020 22:00:00'),
dmy_hms('03/01/2020 05:00:00')
),
id_person = c("1", "2", "2"),
type_service = c("1", "3", "4")
) %>%
mutate(
intervalo = interval(
date_hour_initial,
date_hour_final
)
)
procedures <- tibble(
id = 1:3,
date_hour = c(
dmy_hms('01/01/2020 17:00:00'),
dmy_hms('01/01/2020 22:00:00'),
dmy_hms('03/01/2020 03:00:00')
),
id_person = c("1", "1", "2")
)
procedures$type_service <- vector(
"character",
nrow(procedures)
)
for(i in 1:nrow(procedures)) {
frequencies %>%
filter(
procedures$date_hour[i] %within% intervalo,
id_person == procedures$id_person[i]
) %>% pull(type_service) %>% unique() -> response
if(length(response) == 1){
procedures$type_service[i] <- response
} else {
procedures$type_service[i] <- NA_character_
}
}
Here's a dplyr solution without using loops:
library(tidyverse)
left_join(frequencies, procedures, by = "id_person") %>%
mutate(type_service = ifelse(date_hour %within% intervalo, type_service.x, NA)) %>%
select(id = id.y, date_hour, id_person, type_service) %>%
group_by(id) %>%
arrange(type_service) %>%
filter(!duplicated(id)) %>%
ungroup() %>%
arrange(id)
#> # A tibble: 3 x 4
#> id date_hour id_person type_service
#> <int> <dttm> <chr> <chr>
#> 1 1 2020-01-01 17:00:00 1 1
#> 2 2 2020-01-01 22:00:00 1 NA
#> 3 3 2020-01-03 03:00:00 2 4
Here is an option using non-equi join in data.table:
procedures[, type_service :=
frequencies[procedures, on=.(id_person, date_hour_initial<=date_hour, date_hour_final>=date_hour),
by=.EACHI, if (length(x.type_service) == 1L) x.type_service]$V1
]
output:
id date_hour id_person type_service
1: 1 2020-01-01 17:00:00 1 1
2: 2 2020-01-01 22:00:00 1 <NA>
3: 3 2020-01-03 03:00:00 2 4
data:
library(data.table)
frequencies <- data.table(id = 1:3,
date_hour_initial = as.POSIXct(c('01/01/2020 13:00:00','01/01/2020 15:00:00','02/01/2020 20:00:00'), format="%d/%m/%Y %T"),
date_hour_final= as.POSIXct(c('01/01/2020 18:00:00','01/01/2020 22:00:00','03/01/2020 05:00:00'), format="%d/%m/%Y %T"),
id_person = c("1", "2", "2"),
type_service = c("1", "3", "4"))
procedures <- data.table(id = 1:3,
date_hour = as.POSIXct(c('01/01/2020 17:00:00','01/01/2020 22:00:00','03/01/2020 03:00:00'), format="%d/%m/%Y %T"),
id_person = c("1", "1", "2"))
My guess is that this will take around a min for 4 mio rows?
Here is a solution using the fuzzy join package. First step is split the frequency and procedure dataframes by person Id. This is dividing the large problem into many smaller problem. I did not add any error checking to ensure there is a corresponding match between the person_id between the 2 data frames.
Once the data frames are split, loop through each person id and using the left_fuzzy_join function to match "data_hour" in procedures and the "interval" in frequencies
library(lubridate)
library(dplyr)
#divide and conquer
#split the data frame down to list by person_id
sfreq<-split(frequencies, frequencies$id_person)
sprocedures <- split(procedures, procedures$id_person)
library(fuzzyjoin)
#define function for the matching
matfun<-function(x, y){
x %within% y
}
#define empty answer list
answer<-list()
#loop thru all of the split groups
for (id in names(sfreq)) {
print(id)
#perfrom a fuzzy join with data_hour in procedures and the interval in frequencies
answer[[id]]<-fuzzy_left_join(sprocedures[[id]], sfreq[[id]], by= c("date_hour" ="intervalo"), match_fun=matfun)
}
#Combine all of the subsets into the final answer
finalanswer<-bind_rows(answer)

Count of cases per group, based on a min value of another variable, also per same groups

I have the following type of dataframe:
Person General_Type Specific_Type Age
A X XY 2
A Y YZ 3
A Y YY 3
B X XY 5
B Y YZ 6
B X XX 8
What I am trying to do:
For each person separately, I want to calculate the age where he first produces a Specific_Type YZ.
Then, I want to count all the cases where he produces a General_Type X, up to the age which I calculated earlier.
What I have so far:
The original data is in the dataframe 'data_file'.
I've managed to get the age using:
Person <- c('A', 'B')
df <- data.frame(Person)
library(dplyr)
Initial_Age <- (data_file %>%
group_by_(.dots=c("Person","Specific_Type")) %>%
filter(all(Specific_Type == "YZ")) %>%
summarize(Age_Calc = min(Age)))
df$Initial_Age <- Initial_Age$Age_Calc
I can get the total count per Person for each General_Type using the following:
total_count <- (data_file %>%
group_by(Person, General_Type) %>%
filter(all(General_Type == "x")) %>%
summarize(count = n()))
But I'm not sure how to use the results from the former in the latter.
Specifically, what I am asking is how to calculate the count of General_Type X for each person, up to the age where he first produces a Specific_Type YZ.
I am currently using dplyr, but am open to other solutions if they're preferable.
This should do the job:
data_file %>%
group_by(Person) %>%
filter(Age <= first(Age[Specific_Type == "YZ"])) %>%
summarise(count = sum(General_Type == "X"))
# # A tibble: 2 x 2
# Person count
# <chr> <int>
# 1 A 1
# 2 B 1
NB: If your data is not already sorted, use min instead of first.
Data:
data_file <- read.table(text = "
Person General_Type Specific_Type Age
A X XY 2
A Y YZ 3
A Y YY 3
B X XY 5
B Y YZ 6
B X XX 8
", header = TRUE, stringsAsFactors = FALSE)

Resources