I would like know how many animals will show up on a specific day. This chart describes people register their animals in advance.
For instance, at 7 days ahead, someone registered for their 4 cats to show up on 5/3/2019; at 6 days ahead, another 9 cats are registered for 5/3/2019. So there will be 7+6=13 cats showing up on 5/3/2019.
When days_ahead = 0, it simply means someone registered on the event day. For instance, 4 wolves registered for 5/1/2019 on 5/1/2019 (0 days ahead), and there will be 4 wolves that day.
library(dplyr)
set.seed(0)
animal = c(rep('cat', 5), rep('dog', 6), rep('wolf', 3))
date = sample(seq(as.Date("2019/5/1"), as.Date('2019/5/10'), by='day'), 14, replace=TRUE)
days_ahead = sample(seq(0,14), 14, replace=FALSE)
number = sample.int(10, 14, replace=TRUE)
dt = data.frame(animal, date, days_ahead, number) %>% arrange(animal, date)
The expected outcome should have the same 1-3 columns as the example, but the fourth column should be an accumulated number by each date, accumulating on days_ahead.
I added an expected outcome here. The comments are used to explain the accumulated_number column.
I've considered loop function but not entirely sure how to loop over three variables (cat, date, and days_ahead). Any advice is appreciated!!
The accumulated_number is somewhat easy with cumsum(). See this link for your comments field:
Cumulatively paste (concatenate) values grouped by another variable
dt%>%
group_by(animal,date)%>%
mutate(accumulated_number = cumsum(number)
,comments = Reduce(function(x1, x2) paste(x1, x2, sep = '+'), as.character(number), accumulate = T)
)%>%
ungroup()
Also, my dataset is slightly different than yours with the same seed. Still, it seems to work.
# A tibble: 14 x 6
animal date days_ahead number accumulated_number comments
<fct> <date> <int> <int> <int> <chr>
1 cat 2019-05-03 10 9 9 9
2 cat 2019-05-04 6 4 4 4
3 cat 2019-05-06 8 5 5 5
4 cat 2019-05-09 5 4 4 4
5 cat 2019-05-10 13 6 6 6
6 dog 2019-05-01 0 2 2 2
7 dog 2019-05-03 3 5 5 5
8 dog 2019-05-07 1 7 7 7
9 dog 2019-05-07 9 8 15 7+8
10 dog 2019-05-09 12 2 2 2
11 dog 2019-05-10 7 9 9 9
12 wolf 2019-05-02 14 5 5 5
13 wolf 2019-05-03 11 8 8 8
14 wolf 2019-05-07 4 9 9 9
I'm not sure I understand your question, is this what you want?
I'm adding an "animals_arriving" column and kepping the rest of dt
library(dplyr)
library(lubridate)
dt %>%
mutate(date_arrival = date + days(days_ahead)) %>%
group_by(date = date_arrival) %>%
summarise(animals_arriving = n()) %>%
full_join(dt,by="date")
Related
I'm new here, so maybe my question could be difficult to understand. So, I have some data and it's date information and I need to group the mean of the data in year ranges. But this year ranges are non-ecluding, I mean that, for example, my first range is: 2013-2015 then 2014-2016 then 2015-2017, etc. So I think that it could be done by using a loop function and dplyr, but I dont know how to do it. I´ll be very thankfull if someone can help me.
Thank you,
Alejandro
What I tried was like:
for (i in Year){
Year_3=c(i, i+1, i+2)
db>%> group_by(Year_3)
#....etc
}
As you note, each observation would be used in multiple groups, so one approach could be to make copies of your data accordingly:
df <- data.frame(year = 2013:2020, value = 1:8)
library(dplyr)
df %>%
tidyr::uncount(3, .id = "grp") %>%
mutate(group_start = year - grp + 1,
group_name = paste0(group_start, "-", group_start + 2)) %>%
group_by(group_name) %>%
summarise(value = mean(value),
n = n())
# A tibble: 10 × 3
group_name value n
<chr> <dbl> <int>
1 2011-2013 1 1
2 2012-2014 1.5 2
3 2013-2015 2 3
4 2014-2016 3 3
5 2015-2017 4 3
6 2016-2018 5 3
7 2017-2019 6 3
8 2018-2020 7 3
9 2019-2021 7.5 2
10 2020-2022 8 1
Or we might take a more algebraic approach, noting that the sum of a three year period will be the difference between the cumulative amount two years in the future minus the cumulative amount the prior year. This approach excludes the partial ranges.
df %>%
mutate(cuml = cumsum(value),
value_3yr = (lead(cuml, n = 2) - lag(cuml, default = 0)) / 3)
year value cuml value_3yr
1 2013 1 1 2
2 2014 2 3 3
3 2015 3 6 4
4 2016 4 10 5
5 2017 5 15 6
6 2018 6 21 7
7 2019 7 28 NA
8 2020 8 36 NA
Below is my attempt at a minimal reproducible example. Briefly explained, I am using rollApply from the rowr package to calculate a function over a rolling window, and using data from two columns simultaneously. If possible, I would like to skip n steps between each time the function is calculated on a new window. I will try to make it clear what I mean in the example below.
Here is the example data:
df1 <- tibble(
x = c(1:9),
y = c(1:9),
Date = as.Date(c("2015-08-08", "2015-08-15", "2015-08-22",
"2015-08-29","2015-09-05", "2015-09-12", "2015-09-19",
"2015-09-26", "2015-10-03"))
)
Here are the example functions:
calc_ex <- function(y){
sum(y[,1] + y[,2])
}
roll_calc_ex <- function(y){
vec <- c(rep(NA, 2), rowr::rollApply(y, calc_ex, window = 3, minimum = 3))
y <- y %>%
mutate(estimate = vec)
return(y)
}
Applying the function roll_calc_ex() to df1, I get the following output:
> roll_calc_ex(df1)
# A tibble: 9 x 4
x y Date estimate
<int> <int> <date> <int>
1 1 1 2015-08-08 NA
2 2 2 2015-08-15 NA
3 3 3 2015-08-22 12
4 4 4 2015-08-29 18
5 5 5 2015-09-05 24
6 6 6 2015-09-12 30
7 7 7 2015-09-19 36
8 8 8 2015-09-26 42
9 9 9 2015-10-03 48
Ideally, I would to have a rolling window that skips n steps, say n=2, to produce the following output:
# A tibble: 9 x 4
x y Date estimate
<int> <int> <date> <int>
1 1 1 2015-08-08 NA
2 2 2 2015-08-15 NA
3 3 3 2015-08-22 12
4 4 4 2015-08-29 NA
5 5 5 2015-09-05 NA
6 6 6 2015-09-12 30
7 7 7 2015-09-19 NA
8 8 8 2015-09-26 NA
9 9 9 2015-10-03 48
Alternatively, instead of returning NA for every row skipped, the number from the previous calculation could be filled in (something I am planning to do later aynway using fill() from tidyverse).
If this is possible to solve using for example rollapply() from the zoo package, that would also be interesting to hear. I am only using rowr::rollApply() because I need to apply the function to two columns simultaneously. I know it is possible to use runner() from the package "runner", but in my more complicated problem I need to run parallel computations. I am using the furrr package for parallelization, and my code works well with rollApply, but not with runner(). The problem I have with runner is explained here: Problem with parallelization using furrr [and runner::runner() ] in R .
Thanks to anyone that took the time to read this post. Any help will be much appreciated.
1) The rowr package was removed from CRAN but we can use rollapplyr (like rollapply but the r on the end means to default to right alignment) from zoo which has a by.column= argument to specify whether processing is performed column by column (TRUE) or all columns are passed at once (FALSE) and a by= argument which causes skipping.
library(dplyr)
library(zoo)
mutate(df1, roll =
rollapplyr(cbind(x, y), 3, calc_ex, fill = NA, by.column = FALSE, by = 2)
)
giving:
x y Date roll
1 1 1 2015-08-08 NA
2 2 2 2015-08-15 NA
3 3 3 2015-08-22 12
4 4 4 2015-08-29 NA
5 5 5 2015-09-05 24
6 6 6 2015-09-12 NA
7 7 7 2015-09-19 36
8 8 8 2015-09-26 NA
9 9 9 2015-10-03 48
2) Using complex arithmetic would also work:
f <- function(v) calc_ex(cbind(Re(v), Im(v)))
mutate(df1, roll = rollapplyr(x + y * 1i, 3, f, fill = NA, by = 2))
3) and if we look into call_ex then it could be written (although this does not generalize):
mutate(df1, roll = rollapplyr(x + y, 3, sum, fill = NA, by = 2))
4) We could also consider using zoo objects rather than data frames:
z <- read.zoo(df1, index = "Date")
merge(z, roll = rollapplyr(z, 3, calc_ex, by.column = FALSE, by = 2))
If we were to use the slider package
library(tidyverse)
library(slider)
df1 <- tibble(
x = c(1:9),
y = c(1:9),
Date = as.Date(c("2015-08-08", "2015-08-15", "2015-08-22",
"2015-08-29","2015-09-05", "2015-09-12", "2015-09-19",
"2015-09-26", "2015-10-03")))
df1 |>
mutate(rolling_sum = slide2_dbl(.x = x,.y = y,.f = sum,
.step = 3,.before = 2,.complete = T
))
#> # A tibble: 9 x 4
#> x y Date rolling_sum
#> <int> <int> <date> <dbl>
#> 1 1 1 2015-08-08 NA
#> 2 2 2 2015-08-15 NA
#> 3 3 3 2015-08-22 12
#> 4 4 4 2015-08-29 NA
#> 5 5 5 2015-09-05 NA
#> 6 6 6 2015-09-12 30
#> 7 7 7 2015-09-19 NA
#> 8 8 8 2015-09-26 NA
#> 9 9 9 2015-10-03 48
Created on 2021-10-21 by the reprex package (v2.0.1)
I have a data frame that includes information about schools. The code below produces a toy example.
df <- tibble(grade_range = c('1-3','2-5','5-12'),
school = c('AAA', 'BBB', 'CCC'),
score = c(100, 110, 150))
The current data has one row per school, with a single character variable indicating the range of grade levels. I'd like to have a longer dataset, with one row per school-by-grade combination. The code below does the job, but it feels like a clumsy workaround, and I'm wondering if there's a more efficient way to produce the same output.
df_long <- df %>%
mutate(low_grade = as.numeric(str_remove(str_extract(grade_range, '[[:digit:]]+-'),'-')),
high_grade = as.numeric(str_remove(str_extract(grade_range, '-[[:digit:]]+'),'-')),
fake_join_var = 1) %>%
left_join(data.frame(grade_level = c(1:12), fake_join_var = rep(1,12))) %>%
select(-fake_join_var) %>%
filter(grade_level >= low_grade &
grade_level <= high_grade)
(To be clear, df_long is exactly the output I want, I'm just wondering if there's a simpler way of producing it, maybe with purrr somehow?)
Since your code is based on the difference between low_grade and high_grade, you still have to extract the numerical value from the string.
However, after that, you can simply unnest() the sequence between the two.
Here is the code:
library(tidyverse)
df <- tibble(grade_range = c('1-3','2-5','5-12'),
school = c('AAA', 'BBB', 'CCC'),
score = c(100, 110, 150))
x = df %>%
mutate(
low_grade = as.numeric(str_remove(str_extract(grade_range, '\\d+-'),'-')),
high_grade = as.numeric(str_remove(str_extract(grade_range, '-\\d+'),'-')),
grade_level = map2(low_grade, high_grade, seq)
) %>%
unnest(grade_level)
x
#> # A tibble: 15 x 6
#> grade_range school score low_grade high_grade grade_level
#> <chr> <chr> <dbl> <dbl> <dbl> <int>
#> 1 1-3 AAA 100 1 3 1
#> 2 1-3 AAA 100 1 3 2
#> 3 1-3 AAA 100 1 3 3
#> 4 2-5 BBB 110 2 5 2
#> 5 2-5 BBB 110 2 5 3
#> 6 2-5 BBB 110 2 5 4
#> 7 2-5 BBB 110 2 5 5
#> 8 5-12 CCC 150 5 12 5
#> 9 5-12 CCC 150 5 12 6
#> 10 5-12 CCC 150 5 12 7
#> 11 5-12 CCC 150 5 12 8
#> 12 5-12 CCC 150 5 12 9
#> 13 5-12 CCC 150 5 12 10
#> 14 5-12 CCC 150 5 12 11
#> 15 5-12 CCC 150 5 12 12
waldo::compare(df_long, x)
#> v No differences
Created on 2021-10-01 by the reprex package (v2.0.0)
I have a data frame as follows :
id <- c(1, 2, 3, 4, 5)
week1 <- c(234,567456, 134123, 13412421, 2345245)
week2 <- c(4234,5123456, 454123, 12342421, 8394545)
week3 <- c(1234, 234124, 12348, 9348522, 134534)
data <- data.frame(id, week1, week2, week3)
I would like to find the percent change between week1 and week2, and then week2 and week3 etc (my dataframe is much larger with about 27 columns).
I tried:
data$change1 <- (data$week2-data$week1)*100/data$week1
However this would be extensive with a larger dataset.
Try the following:
library(tidyverse)
df <- gather(df, key='week', value='value', -id)
df$week <- as.integer(as.character((gsub('week', '', df$week))))
df %>% group_by(id) %>% arrange(week) %>% mutate(perc_change = (value-lag(value,1))/lag(value,1)*100)
# A tibble: 15 x 4
# Groups: id [5]
id week value perc_change
<dbl> <int> <dbl> <dbl>
1 1 1 234 NA
2 2 1 567456 NA
3 3 1 134123 NA
4 4 1 13412421 NA
5 5 1 2345245 NA
6 1 2 4234 1709.
7 2 2 5123456 803.
8 3 2 454123 239.
9 4 2 12342421 -7.98
10 5 2 8394545 258.
11 1 3 1234 -70.9
12 2 3 234124 -95.4
13 3 3 12348 -97.3
14 4 3 9348522 -24.3
15 5 3 134534 -98.4
This works reasonably well, but assumes that there is an observation every week, or else your percent change will be based on the last available week (so, if week 3 is missing, the value for week 4 will be a week on week change with week 2 as basis).
(Edit: replaced substr with gsub)
Sense checking:
For row 6, you see id 1. This is week 2 with a value of 4234. In week 1, id 1 had a value of 234. The difference is
(4234-234)/234
[1] 17.09402
So, that is aligned.
In the past, when I've needed to create a new variable in an R data frame that is partly based on a 'group_by' summary statistic, I've always used the following sequence:
(1) calculate 'group stats' from data in the base (ungrouped) data frame using group_by() and summarize()
(2) join the base data frame with the result of the previous step, then calculate the new variable value using mutate.
However, (after years of using dplyr!) I accidentally did the 'summarizing' in a mutate step and everything seemed to work. This is illustrated in Option #2 in the code snippet below. I'm assuming Option #2 is okay because I'm getting identical results using both options, and because I found similar examples searching the web today. However, I wasn't sure.
Is Option #2 acceptable practice, or is Option #1 preferred (and if so why)?
set.seed(123)
df <- tibble(year_ = c(rep(c(2019), 4), rep(c(2020), 4)),
qtr_ = c(rep(c(1,2,3,4), 2)),
foo = sample(seq(1:8)))
# Option 1: calc statistics then rejoin with input data
df_stats <- df %>%
group_by(year_) %>%
summarize(mean_foo = mean(foo))
df_with_stats <- left_join(df, df_stats) %>%
mutate(dfoo = foo - mean_foo)
# Option 2: everything in one go
df_with_stats2 <- df %>%
group_by(year_) %>%
mutate(mean_foo = mean(foo),
dfoo = foo - mean_foo)
df_with_stats
# A tibble: 8 x 5
year_ qtr_ foo mean_foo dfoo
<dbl> <dbl> <int> <dbl> <dbl>
1 2019 1 7 6 1
2 2019 2 8 6 2
3 2019 3 3 6 -3
4 2019 4 6 6 0
5 2020 1 2 3 -1
6 2020 2 4 3 1
7 2020 3 5 3 2
8 2020 4 1 3 -2
df_with_stats2
# A tibble: 8 x 5
# Groups: year_ [2]
year_ qtr_ foo mean_foo dfoo
<dbl> <dbl> <int> <dbl> <dbl>
1 2019 1 7 6 1
2 2019 2 8 6 2
3 2019 3 3 6 -3
4 2019 4 6 6 0
5 2020 1 2 3 -1
6 2020 2 4 3 1
7 2020 3 5 3 2
8 2020 4 1 3 -2
Option 2 is fine, if you don't need the intermediate object anyway, and you don't even need to create mean_foo in your mutate statement:
df %>% group_by(year_) %>% mutate(dfoo=foo-mean(foo))
also, data.table
setDT(df)[,dfoo:=foo-mean(foo), by =year_]