Calculating changes from "yesterday's" value in R? - r

I have data on bird individuals and their feeding locations. The feeding locations move, and so I want to create a variable that calculates the distance from yesterday's feeding location to "today's" feeding options.
Here is a reprex that exemplifies what I'm talking about. So, the 'bird' column represents the bird individual id's, feedLoc represents the the possible feeding locations for each day. Then there is the date of that observation. H (horizontal) and V (vertical) represent coordinate locations of the feeding locations on a grid. And finally, bp represents if that individual was identified at the feeding location or not.
reprex <- tibble(bird = c("A", "A", "A", "B", "B", "B", "C", "C"),
feedLoc = c("x","y", "x", "x", "y", "x", "y", "z"),
date = as.Date(c("2020-05-10", "2020-05-11", "2020-05-11",
"2020-05-24", "2020-05-25", "2020-05-25",
"2020-05-22", "2020-05-23")),
h = c(100, 123, 45, 75, 89, 64, 99, 101),
v = c(89, 23, 65, 92, 29, 90, 120, 34),
bp = c(1, 1, 0, 1, 0, 1, 1, 0))
Which produces this:
# A tibble: 8 × 6
bird feedLoc date h v bp
<chr> <chr> <date> <dbl> <dbl> <dbl>
1 A x 2020-05-10 100 89 1
2 A y 2020-05-11 123 23 1
3 A x 2020-05-11 45 65 0
4 B x 2020-05-24 75 92 1
5 B y 2020-05-25 89 29 0
6 B x 2020-05-25 64 90 1
7 C y 2020-05-22 99 120 1
8 C z 2020-05-23 101 34 0
My question is, I want to make a new variable that calculates the distance from yesterday's feeding choice (so, the rows where bp == 1 AND date == date - 1), to the current feeding location options for each bird individual using the coordinate data. How would I do this? Thanks!
I initially tried to group by bird and feedLoc id's, arrange by date, and then lag the h and v variables so that I could then use the distance formula to calculate distance from yesterday's ant swarm choice. However, the issue with that is that in the data set, the row previous when arranged is not always exactly "yesterday".

Create a dataframe filtered to bp == 1, add 1 to the date to match rows to the next day, then left_join() to your original data to compute distances:
library(dplyr)
yesterday <- reprex %>%
filter(bp == 1) %>%
transmute(bird, date = date + 1, h.yest = h, v.yest = v)
reprex %>%
left_join(yesterday) %>%
mutate(
dist = sqrt((h - h.yest)^2 + (v - v.yest)^2)
) %>%
select(!h.yest:v.yest)
# A tibble: 8 × 7
bird feedLoc date h v bp dist
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl>
1 A x 2020-05-10 100 89 1 NA
2 A y 2020-05-11 123 23 1 69.9
3 A x 2020-05-11 45 65 0 60.0
4 B x 2020-05-24 75 92 1 NA
5 B y 2020-05-25 89 29 0 64.5
6 B x 2020-05-25 64 90 1 11.2
7 C y 2020-05-22 99 120 1 NA
8 C z 2020-05-23 101 34 0 86.0

Try something like this dplyr approach, which first restricts the manipulation to just bp == 1 then checks to see if the feeding location is different and the previous date is one day behind (date == date - 1) then calculates the difference for h and y. After all that it adds back in the bp == 0 data and rearranges (this approach saves a more convoluted case_when statement. If this isn't exactly what you need post an example of the desired output and I will edit. Good luck!
library(dplyr)
reprex %>%
group_by(bird) %>%
filter(bp == 1) %>%
arrange(date) %>%
mutate(h_change = case_when(
feedLoc != lag(feedLoc) & lag(date) == date - 1 ~ h - lag(h)),
v_change = case_when(
feedLoc != lag(feedLoc) & lag(date) == date - 1 ~ v - lag(v)
)) %>%
right_join(reprex) %>% arrange(bird, date)
Output:
# bird feedLoc date h v bp h_change v_change
# <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A x 2020-05-10 100 89 1 NA NA
# 2 A y 2020-05-11 123 23 1 23 -66
# 3 A x 2020-05-11 45 65 0 NA NA
# 4 B x 2020-05-24 75 92 1 NA NA
# 5 B x 2020-05-25 64 90 1 NA NA
# 6 B y 2020-05-25 89 29 0 NA NA
# 7 C y 2020-05-22 99 120 1 NA NA
# 8 C z 2020-05-23 101 34 0 NA NA

Related

How to apply a function to each group of IDs based on conditions with dplyr

I am trying to apply a custom function over a group of IDs only when some conditions are met for each column of that group (data should be only numeric and their sum non-zero). Here is the reproducible example:
dat <- as.Date("2021/08/04")
len <- 5
seq(dat, by = "day", length.out = len)
input <- data.frame(
date = c(seq(dat, by = "day", length.out = len) , seq(dat, by = "day", length.out = len)),
id = c("aa", "aa","aa","aa","aa","bb","bb","bb","bb","bb"),
var1 = c(2,3,4,6,7,8,9,3,5,6),
var2 = c(0, 0, 0, 0, 0, 1, 2, 3 ,4, 5),
var3 = c("hi", "hi", "hi", "hi", "hi", 1, 2, 3 ,4, 5)
)
Here is my custom Rescale function:
rescale = function(x,max_range=100){
return(((x-min(x))/(max(x)-min(x)))*max_range)
}
And this is the desired output:
output <- data.frame(
date = c(seq(dat, by = "day", length.out = len) , seq(dat, by = "day", length.out = len)),
id = c("aa", "aa","aa","aa","aa","bb","bb","bb","bb","bb"),
var1 = c(0, 20, 40, 80, 100, 83.3, 100, 0, 33.3, 50),
var2 = c(0, 0, 0, 0, 0, 0, 25, 50 ,75, 100),
var3 = c("hi", "hi", "hi", "hi", "hi", 0, 25, 50 ,75, 100)
)
I am using the following lines to solve this, using dplyr:
out = input %>%
dplyr::group_by(id) %>%
dplyr::mutate_if(~is.numeric(.) && sum(.x) != 0 ,rescale) %>%
dplyr::arrange(date, .by_group = TRUE) %>%
dplyr::ungroup()
The problem with these lines is that conditions do not refer to the columns of each group exclusively, but to the whole column of the input table. Hence, the function is applied to the whole table-column although the conditions are met only for one ID. In this example function was applied for aa~var2 (which is not desired) and wasn't applied for bb~var3 (which is desired).
Could you please help me correct the code of out ? Thank you.
This gives you the desired output, but it is much harder than it needs to be because you want to preserve the "hi" values. A numeric column cannot have text in it, so you have to convert to numeric, handle NA values, rescale the non-NA values, then convert to character and rewrite the "hi" values in place. Furthermore, you need to go back at the end and re-convert the columns without "hi" in them back to numeric. You could avoid all this by having NA values instead of "hi", but anyway, if you really have to preserve the "hi" values, you can do:
library(dplyr)
input %>%
group_by(id) %>%
mutate(across(contains("var"), function(x) {
x_n <- suppressWarnings(as.numeric(x))
if(all(is.na(x_n))) return(x)
if(all(x_n == 0)) return(as.character(x_n))
x_n[!is.na(x_n)] <- rescale(x_n[!is.na(x_n)])
if(any(is.na(x_n))) {
x_n <- as.character(x_n)
x_n[is.na(x_n)] <- x[is.na(x_n)]
}
as.character(x_n)
})) %>%
ungroup() %>%
mutate(across(contains("var"), function(x) {
if(any(is.na(suppressWarnings(as.numeric(x))))) x else as.numeric(x)
}))
#> # A tibble: 10 x 5
#> date id var1 var2 var3
#> <date> <chr> <dbl> <dbl> <chr>
#> 1 2021-08-04 aa 0 0 hi
#> 2 2021-08-05 aa 20 0 hi
#> 3 2021-08-06 aa 40 0 hi
#> 4 2021-08-07 aa 80 0 hi
#> 5 2021-08-08 aa 100 0 hi
#> 6 2021-08-04 bb 83.3 0 0
#> 7 2021-08-05 bb 100 25 25
#> 8 2021-08-06 bb 0 50 50
#> 9 2021-08-07 bb 33.3 75 75
#> 10 2021-08-08 bb 50 100 100
If you are prepared to have NA values instead of "hi" (and thereby have numeric columns that you can actually perform calculations on), you can simplify to
input %>%
group_by(id) %>%
mutate(across(contains("var"), function(x) {
x_n <- suppressWarnings(as.numeric(x))
if(all(is.na(x_n))) return(NA)
if(all(x_n == 0)) return(x_n)
x_n[!is.na(x_n)] <- rescale(x_n[!is.na(x_n)])
x_n
}))
#> # A tibble: 10 x 5
#> # Groups: id [2]
#> date id var1 var2 var3
#> <date> <chr> <dbl> <dbl> <dbl>
#> 1 2021-08-04 aa 0 0 NA
#> 2 2021-08-05 aa 20 0 NA
#> 3 2021-08-06 aa 40 0 NA
#> 4 2021-08-07 aa 80 0 NA
#> 5 2021-08-08 aa 100 0 NA
#> 6 2021-08-04 bb 83.3 0 0
#> 7 2021-08-05 bb 100 25 25
#> 8 2021-08-06 bb 0 50 50
#> 9 2021-08-07 bb 33.3 75 75
#> 10 2021-08-08 bb 50 100 100
Edit
Removing the complication of having a non-numeric column altogether by having var3 = c(1, 2, 3 ,4, 5,1, 2, 3 ,4, 5) as suggested in the comments by the OP makes this far easier:
input %>%
group_by(id) %>%
mutate(across(contains("var"), ~ if(all(.x == 0)) .x else rescale(.x)))
#> # A tibble: 10 x 5
#> # Groups: id [2]
#> date id var1 var2 var3
#> <date> <chr> <dbl> <dbl> <dbl>
#> 1 2021-08-04 aa 0 0 0
#> 2 2021-08-05 aa 20 0 25
#> 3 2021-08-06 aa 40 0 50
#> 4 2021-08-07 aa 80 0 75
#> 5 2021-08-08 aa 100 0 100
#> 6 2021-08-04 bb 83.3 0 0
#> 7 2021-08-05 bb 100 25 25
#> 8 2021-08-06 bb 0 50 50
#> 9 2021-08-07 bb 33.3 75 75
#> 10 2021-08-08 bb 50 100 100
rescale = function(x,max_range=100){
if(min(x) == max(x)) return(x)
return(((x-min(x))/(max(x)-min(x)))*max_range)
}
input %>%
group_by(id)%>%
mutate(across(where(is.numeric), rescale))
# A tibble: 10 × 5
# Groups: id [2]
date id var1 var2 var3
<date> <chr> <dbl> <dbl> <chr>
1 2021-08-04 aa 0 0 hi
2 2021-08-05 aa 20 0 hi
3 2021-08-06 aa 40 0 hi
4 2021-08-07 aa 80 0 hi
5 2021-08-08 aa 100 0 hi
6 2021-08-04 bb 83.3 0 1
7 2021-08-05 bb 100 25 2
8 2021-08-06 bb 0 50 3
9 2021-08-07 bb 33.3 75 4
10 2021-08-08 bb 50 100 5

How to calculate difference in values grouped by 2 separate variables in R

Let's say we have a team variable, but we also have a time period 1 and a time period 2 variable, and a numeric grade 1-10. I want to mutate and add a variable that calculates the difference from time period 1 to time period 2.
How do I do this?
Visually the table looks like this:
img
There is a neat function in the data.table package called dcast( ) that allows you to transform your data from long to wide. In this case, you can use the Period variable to create 2 new columns, Period 1 and Period 2, where the values are the Grades.
library(data.table)
> data <- data.table(
+ Team = c("Team 1","Team 1","Team 2","Team 2","Team 3","Team 3"),
+ Period = c("Period 1","Period 2","Period 1","Period 2","Period 1","Period 2"),
+ Grade = c(75,87,42,35,10,95))
> data
Team Period Grade
1: Team 1 Period 1 75
2: Team 1 Period 2 87
3: Team 2 Period 1 42
4: Team 2 Period 2 35
5: Team 3 Period 1 10
6: Team 3 Period 2 95
> data2 <- dcast(
+ data = data,
+ Team ~ Period,
+ value.var = "Grade")
> data2
Team Period 1 Period 2
1: Team 1 75 87
2: Team 2 42 35
3: Team 3 10 95
> data2 <- data2[,Difference := `Period 2` - `Period 1`]
> data2
Team Period 1 Period 2 Difference
1: Team 1 75 87 12
2: Team 2 42 35 -7
3: Team 3 10 95 85
In tidyverse syntax, we would use pivot_wider and mutate:
library(tidyverse)
df %>%
pivot_wider(names_from = `Time Period`, values_from = Grade) %>%
mutate(difference = P2 - P1)
#> # A tibble: 3 x 4
#> Team P1 P2 difference
#> <chr> <dbl> <dbl> <dbl>
#> 1 Team 1 75 87 12
#> 2 Team 2 42 35 -7
#> 3 Team 3 10 95 85
Created on 2022-08-29 with reprex v2.0.2
Data used
df <- data.frame(Team = paste("Team", rep(1:3, each = 2)),
`Time Period` = rep(c("P1", "P2"), 3),
Grade = c(75, 87, 42, 35, 10, 95),
check.names = FALSE)
df
#> Team Time Period Grade
#> 1 Team 1 P1 75
#> 2 Team 1 P2 87
#> 3 Team 2 P1 42
#> 4 Team 2 P2 35
#> 5 Team 3 P1 10
#> 6 Team 3 P2 95

group_by and case_when() function for multiple conditions

I'm struggling with a problem in R. I want to create a new variable (qc) by group_by the variable (NAME and PLOT) using case_when for where "EH” > “PH” then give me B else give me Q......
I have a data set like this:
df <- tibble(
NAMEOFEXPERIMENT= c("A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
PLOT= c(2,1,2,1,2,1,2,1,2,1,2,1,2,1,2),
trait= c("EH","NP","NP","PH","PH","PL","PL","EH","EH","NP","NP","PH","PH","PL","PL"),
traitValue= c(125,36,36,240,"NA",36,36,90,110,35,33,215,190,36,31)
)
# A tibble: 15 x 4
NAME PLOT trait traitValue
<chr> <dbl> <chr> <chr>
1 A 2 EH 250
2 A 1 NP 36
3 A 2 NP 36
4 A 1 PH 240
5 A 2 PH 200
6 A 1 PL 36
7 A 2 PL 36
8 B 1 EH 90
9 B 2 EH 110
10 B 1 NP 35
11 B 2 NP 33
12 B 1 PH 215
13 B 2 PH 190
14 B 1 PL 36
15 B 2 PL 31
This is what I want to achieve: If “EH” > “PH” then give me B else give me Q
If “PL” > “NP” then give me B else give me Q
Thus, line qc line 4 to be empty since there is no NAME "A", PLOT 1, Trait "EH" to compare with
# A tibble: 15 x 4
NAME PLOT trait traitValue dc
<chr> <dbl> <chr> <chr> <chr>
1 A 2 EH 250 B
2 A 1 NP 36 Q
3 A 2 NP 36 Q
4 A 1 PH 240
5 A 2 PH 200 B
6 A 1 PL 36 Q
7 A 2 PL 36 Q
8 B 1 EH 90 Q
9 B 2 EH 110 Q
10 B 1 NP 35 B
11 B 2 NP 33 Q
12 B 1 PH 215 Q
13 B 2 PH 190 Q
14 B 1 PL 36 B
15 B 2 PL 31 Q
When I run this code
dt2 <- df %>%
group_by(NAME, PLOT) %>%
traitValue[trait == "EH"] > traitValue[trait == "PH"] ~ "B",
traitValue[trait == "EH"] < traitValue[trait == "PH"] ~ "Q",
traitValue[trait == "PL"] > traitValue[trait == "NP"] ~ "B",
traitValue[trait == "PL"] < traitValue[trait == "NP"] ~ "Q"
))
I got this Error
Error in `mutate()`:
! Problem while computing `data_qc = case_when(...)`.
i The error occurred in group 1: NAME = "A", PLOT = 1.
Caused by error in`case_when()`:
! `traitValue[trait == "EH"] > traitValue[trait == "PH"] ~ "B"`, traitValue[trait == "EH"] < traitValue[trait == "PH"] ~ "Q"`
must be length 3 or one, not 0.
I don't fully understand your constraints. You did not specify what would happen if "PH" > "EH" and "PL" > "NP" at the same time. In this case, will the final outcome be "B" or "Q".
However, to get you started I wrote the following code:
## Loading the required libraries
library(dplyr)
library(tidyverse)
## Creating the dataframe
df <- data.frame(
NAMEOFEXPERIMENT= c("A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
PLOT= c(2,1,2,1,2,1,2,1,2,1,2,1,2,1,2),
trait= c("EH","NP","NP","PH","PH","PL","PL","EH","EH","NP","NP","PH","PH","PL","PL"),
traitValue= c(125,36,36,240,200,36,36,90,110,35,33,215,190,36,31)
)
## Removing duplicates
unique(df)
## Pivot longer to wider
df %>%
pivot_wider(names_from = trait, values_from = traitValue) %>%
arrange(NAMEOFEXPERIMENT,PLOT) %>%
mutate(ConditionalValue1 = ifelse(EH>PH,"B", "Q"),
ConditionalValue2 = ifelse(PL>NP,"B", "Q"))
Output
# A tibble: 4 x 8
NAMEOFEXPERIMENT PLOT EH NP PH PL ConditionalValue1 ConditionalValue2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 A 1 NA 36 240 36 NA Q
2 A 2 125 36 200 36 Q Q
3 B 1 90 35 215 36 Q B
4 B 2 110 33 190 31 Q Q

Filling in non-existing rows in R + dplyr [duplicate]

This question already has answers here:
Proper idiom for adding zero count rows in tidyr/dplyr
(6 answers)
Closed 2 years ago.
Apologies if this is a duplicate question, I saw some questions which were similar to mine, but none exactly addressing my problem.
My data look basically like this:
FiscalWeek <- as.factor(c(45, 46, 48, 48, 48))
Group <- c("A", "A", "A", "B", "C")
Amount <- c(1, 1, 1, 5, 6)
df <- tibble(FiscalWeek, Group, Amount)
df
# A tibble: 5 x 3
FiscalWeek Group Amount
<fct> <chr> <dbl>
1 45 A 1
2 46 A 1
3 48 A 1
4 48 B 5
5 48 C 6
Note that FiscalWeek is a factor. So, when I take a weekly average by Group, I get this:
library(dplyr)
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount))
averages
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 1
2 B 5
3 C 6
But, this is actually a four-week period. Nothing at all happened in Week 47, and groups B and C didn't show data in weeks 45 and 46, but I still want averages that reflect the existence of those weeks. So I need to fill out my original data with zeroes such that this is my desired result:
DesiredGroup <- c("A", "B", "C")
DesiredAvgs <- c(0.75, 1.25, 1.5)
Desired <- tibble(DesiredGroup, DesiredAvgs)
Desired
# A tibble: 3 x 2
DesiredGroup DesiredAvgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
What is the best way to do this using dplyr?
Up front: missing data to me is very different from 0. I'm assuming that you "know" with certainty that missing data should bring all other values down.
The name FiscalWeek suggests that it is an integer-like data, but your use of factor suggests ordinal or categorical. Because of that, you need to define authoritatively what the complete set of factors can be. And because your current factor does not contain all possible levels, I'll infer them (you need to adjust your all_groups_weeks accordingly:
all_groups_weeks <- tidyr::expand_grid(FiscalWeek = as.factor(45:48), Group = c("A", "B", "C"))
all_groups_weeks
# # A tibble: 12 x 2
# FiscalWeek Group
# <fct> <chr>
# 1 45 A
# 2 45 B
# 3 45 C
# 4 46 A
# 5 46 B
# 6 46 C
# 7 47 A
# 8 47 B
# 9 47 C
# 10 48 A
# 11 48 B
# 12 48 C
From here, join in the full data in order to "complete" it. Using tidyr::complete won't work because you don't have all possible values in the data (47 missing).
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0))
# # A tibble: 12 x 3
# FiscalWeek Group Amount
# <fct> <chr> <dbl>
# 1 45 A 1
# 2 46 A 1
# 3 48 A 1
# 4 48 B 5
# 5 48 C 6
# 6 45 B 0
# 7 45 C 0
# 8 46 B 0
# 9 46 C 0
# 10 47 A 0
# 11 47 B 0
# 12 47 C 0
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0)) %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount, na.rm = TRUE))
# # A tibble: 3 x 2
# Group Avgs
# <chr> <dbl>
# 1 A 0.75
# 2 B 1.25
# 3 C 1.5
You can try this. I hope this helps.
library(dplyr)
#Define range
df %>% mutate(FiscalWeek=as.numeric(as.character(FiscalWeek))) -> df
range <- length(seq(min(df$FiscalWeek),max(df$FiscalWeek),by=1))
#Aggregation
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = sum(Amount)/range)
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
You can do it without filling if you know number of weeks:
df %>%
group_by(Group) %>%
summarise(Avgs = sum(Amount) / length(45:48))

Substract value from tibble column based on another tibble

Say I have a tibble of values:
raw = tibble(
group = c("A", "B", "C", "A", "B", "C"),
value = c(10, 20, 30, 40, 50, 60)
)
# A tibble: 6 x 2
group value
<chr> <dbl>
1 A 10
2 B 20
3 C 30
4 A 40
5 B 50
6 C 60
I want to subtract a certain amount from each value in my tibble depending on which group it belongs to. The amounts I need to subtract are in another tibble:
corrections = tibble(
group = c("A", "B", "C"),
corr = c(0, 1, 2)
)
# A tibble: 3 x 2
group corr
<chr> <dbl>
1 A 0
2 B 1
3 C 2
What is the most elegant way to achieve this? The following works, but I feel like it is messy - surely there is another way?
mutate(raw, corrected = value - as_vector(corrections[corrections["group"] == group, "corr"]))
# A tibble: 6 x 3
group value corrected
<chr> <dbl> <dbl>
1 A 10 10
2 B 20 19
3 C 30 28
4 A 40 40
5 B 50 49
6 C 60 58
How about first joining raw and corrections and then calculating corrected?
library(dplyr)
left_join(raw, corrections, by = "group") %>%
mutate(corrected = value - corr) %>%
select(-corr)
#> # A tibble: 6 x 3
#> group value corrected
#> <chr> <dbl> <dbl>
#> 1 A 10 10
#> 2 B 20 19
#> 3 C 30 28
#> 4 A 40 40
#> 5 B 50 49
#> 6 C 60 58

Resources