Combining a loop with stacking dataframes created by a function - r

I'm doing some analysis with the BaseballR package and want to be able to combine dataframes by using a loop.
For example, the following code using the standings_on_date_bref function gives me a table of division standings for the specified day (plus manually adding a column for the date of those standings):
library("baseballr")
library("dplyr")
standings_on_date_bref(date = "04-28-2021", division = "NL West") %>%
mutate(date = "04-28-2021")
Tm
W-L%
date
SFG
0.640
04-28-2021
LAD
0.640
04-28-2021
SDP
0.538
04-28-2021
ARI
0.500
04-28-2021
COL
0.375
04-28-2021
However, I'm interested in getting the standings for a whole range of days (which would end up being a dataframe with rows = 5 teams * x number of days) for example for 04-28-2021 to 04-29-2021, I'm hoping it would look something like this:
Tm
W-L%
date
SFG
0.640
04-28-2021
LAD
0.640
04-28-2021
SDP
0.538
04-28-2021
ARI
0.500
04-28-2021
COL
0.375
04-28-2021
SFG
0.640
04-29-2021
LAD
0.615
04-29-2021
SDP
0.538
04-29-2021
ARI
0.520
04-29-2021
COL
0.360
04-29-2021
I have tried to do so by implementing some sort of loop. This is what I've come up with so far, but in the end it just gives me the standings for the end date.
start <- as.Date("04-01-21",format="%m-%d-%y")
end <- as.Date("04-03-21",format="%m-%d-%y")
theDate <- start
while (theDate <= end)
{
all_standings <- standings_on_date_bref(date = theDate, division = "NL West") %>%
mutate(date = theDate)
theDate <- theDate + 1
}

You can try purrr which would do it quite nicely with map_dfr function
library(baseballr)
library(dplyr)
library(purrr)
date_seq <- seq(as.Date("04-01-21",format="%m-%d-%y"),
as.Date("04-03-21",format="%m-%d-%y"), by = "1 day")
map_dfr(.x = date_seq,
.f = function(x) {
standings_on_date_bref(date = x, division = "NL West") %>%
mutate(date = x)
})
#> # A tibble: 15 x 9
#> Tm W L `W-L%` GB RS RA `pythW-L%` date
#> <chr> <int> <int> <dbl> <chr> <int> <int> <dbl> <date>
#> 1 SDP 1 0 1 -- 8 7 0.561 2021-04-01
#> 2 COL 1 0 1 -- 8 5 0.703 2021-04-01
#> 3 ARI 0 1 0 1.0 7 8 0.439 2021-04-01
#> 4 SFG 0 1 0 1.0 7 8 0.439 2021-04-01
#> 5 LAD 0 1 0 1.0 5 8 0.297 2021-04-01
#> 6 SDP 2 0 1 -- 12 9 0.629 2021-04-02
#> 7 COL 1 1 0.5 1.0 14 16 0.439 2021-04-02
#> 8 SFG 1 1 0.5 1.0 13 11 0.576 2021-04-02
#> 9 LAD 1 1 0.5 1.0 16 14 0.561 2021-04-02
#> 10 ARI 0 2 0 2.0 9 12 0.371 2021-04-02
#> 11 SDP 3 0 1 -- 19 9 0.797 2021-04-03
#> 12 LAD 2 1 0.667 1.0 22 19 0.567 2021-04-03
#> 13 COL 1 2 0.333 2.0 19 22 0.433 2021-04-03
#> 14 SFG 1 2 0.333 2.0 13 15 0.435 2021-04-03
#> 15 ARI 0 3 0 3.0 9 19 0.203 2021-04-03
Created on 2022-01-02 by the reprex package (v2.0.1)

Related

Filter in dplyr interval of dates

I have the following simulated dataset in R:
library(tidyverse)
A = seq(from = as.Date("2021/1/1"),to=as.Date("2022/1/1"), length.out = 252)
length(A)
x = rnorm(252)
d = tibble(A,x);d
that looks like :
# A tibble: 252 × 2
A x
<date> <dbl>
1 2021-01-01 0.445
2 2021-01-02 -0.793
3 2021-01-03 -0.367
4 2021-01-05 1.64
5 2021-01-06 -1.15
6 2021-01-08 0.276
7 2021-01-09 1.09
8 2021-01-11 0.443
9 2021-01-12 -0.378
10 2021-01-14 0.203
# … with 242 more rows
Is one year of 252 trading days.Let's say I have a date of my interest which is:
start = as.Date("2021-05-23");start.
I want to filter the data set and the result to be a new dataset starting from this starting date and the next 20 index dates NOT simple days, and then to find the total indexes that the new dataset contains.
For example from the starting date and after I have :
d1=d%>%
dplyr::filter(A>start)%>%
dplyr::summarise(n())
d1
# A tibble: 1 × 1
`n()`
<int>
1 98
but I want from the starting date and after the next 20 trading days.How can I do that ? Any help?
Perhaps a brute-force attempt:
d %>%
filter(between(A, start, max(head(sort(A[A > start]), 20))))
# # A tibble: 20 x 2
# A x
# <date> <dbl>
# 1 2021-05-23 -0.185
# 2 2021-05-24 0.102
# 3 2021-05-26 0.429
# 4 2021-05-27 -1.21
# 5 2021-05-29 0.260
# 6 2021-05-30 0.479
# 7 2021-06-01 -0.623
# 8 2021-06-02 0.982
# 9 2021-06-04 -0.0533
# 10 2021-06-05 1.08
# 11 2021-06-07 -1.96
# 12 2021-06-08 -0.613
# 13 2021-06-09 -0.267
# 14 2021-06-11 -0.284
# 15 2021-06-12 0.0851
# 16 2021-06-14 0.355
# 17 2021-06-15 -0.635
# 18 2021-06-17 -0.606
# 19 2021-06-18 -0.485
# 20 2021-06-20 0.255
If you have duplicate dates, you may prefer to use head(sort(unique(A[A > start])),20), depending on what "20 index dates" means.
And to find the number of indices, you can summarise or count as needed.
You could first sort by the date, filter for days greater than given date and then pull top 20 records.
d1 = d %>%
arrange(A) %>%
filter(A > start) %>%
head(20)

Dividing a number within a month with the last observation in the previous month using dplyr

I am struggling with finding the correct way of achieving the relative return within a month using the last observation in the previous month. Data for reference:
set.seed(123)
Date = seq(as.Date("2021/12/31"), by = "day", length.out = 90)
Returns = runif(90, min=-0.02, max = 0.02)
mData = data.frame(Date, Returns)
Then, I would like to have a return column. For example: When calculating the returns for February the third, then it should be the returns for the respective dates: 2022-02-03 / 2022-01-31 - 1. And likewise for e.g March the third: 2022-03-03 / 2022-02-28 -1. So the question is, how can I keep the date returns within a month as the numerator while having the last observation in the previous month as the denominator using dplyr?
Using a tmp column to get the previous value from the last month (assuming sorted data) and then picking the first. Grouping is done on year-month in group_by.
mData %>%
mutate(tmp=lag(Returns)) %>%
group_by(dat=strftime(Date, format="%Y-%m")) %>%
mutate(tmp=first(tmp), result=Returns/tmp-1) %>%
ungroup() %>%
select(-c(tmp, dat))
# A tibble: 90 × 5 # before select:
Date Returns result # tmp dat
<date> <dbl> <dbl> # <dbl> <chr>
1 2021-12-31 -0.00850 NA # NA 2021-12
2 2022-01-01 0.0115 -2.36 # -0.00850 2022-01
3 2022-01-02 -0.00364 -0.571 # -0.00850 2022-01
4 2022-01-03 0.0153 -2.80 # -0.00850 2022-01
5 2022-01-04 0.0176 -3.07 # -0.00850 2022-01
6 2022-01-05 -0.0182 1.14 # -0.00850 2022-01
7 2022-01-06 0.00112 -1.13 # -0.00850 2022-01
8 2022-01-07 0.0157 -2.85 # -0.00850 2022-01
9 2022-01-08 0.00206 -1.24 # -0.00850 2022-01
10 2022-01-09 -0.00174 -0.796 # -0.00850 2022-01
# … with 80 more rows
library(tidyverse)
library(lubridate)
set.seed(123)
Date = seq(as.Date("2021/12/31"), by = "day", length.out = 90)
Returns = runif(90, min=-0.02, max = 0.02)
mData = data.frame(Date, Returns)
mData |>
group_by(month(Date)) |>
mutate(last_return = last(Returns)) |>
ungroup() |>
nest(data = c(Date, Returns)) |>
mutate(last_return_lag = lag(last_return)) |>
unnest(data) |>
mutate(x = Returns/last_return_lag)
#> # A tibble: 90 × 6
#> `month(Date)` last_return Date Returns last_return_lag x
#> <dbl> <dbl> <date> <dbl> <dbl> <dbl>
#> 1 12 -0.00850 2021-12-31 -0.00850 NA NA
#> 2 1 0.0161 2022-01-01 0.0115 -0.00850 -1.36
#> 3 1 0.0161 2022-01-02 -0.00364 -0.00850 0.429
#> 4 1 0.0161 2022-01-03 0.0153 -0.00850 -1.80
#> 5 1 0.0161 2022-01-04 0.0176 -0.00850 -2.07
#> 6 1 0.0161 2022-01-05 -0.0182 -0.00850 2.14
#> 7 1 0.0161 2022-01-06 0.00112 -0.00850 -0.132
#> 8 1 0.0161 2022-01-07 0.0157 -0.00850 -1.85
#> 9 1 0.0161 2022-01-08 0.00206 -0.00850 -0.242
#> 10 1 0.0161 2022-01-09 -0.00174 -0.00850 0.204
#> # … with 80 more rows
Created on 2022-02-03 by the reprex package (v2.0.1)

Create index column reverse group by column

I want to create a column sequenced like rank but not based on numerical values.
like in the example below.
a <- rep(letters[1:3],each =3)
b <- round(rnorm(9,5,1),2)
tempdf <- data.frame(cbind(a,b))
tempdf
#> a b
#> 1 a 5.58
#> 2 a 3.68
#> 3 a 5.12
#> 4 b 3.28
#> 5 b 4.97
#> 6 b 6.57
#> 7 c 5.92
#> 8 c 5.25
#> 9 c 3.02
tempdf["c"] <- rep(1:3, each = 3)
tempdf
#> a b c
#> 1 a 5.58 1
#> 2 a 3.68 1
#> 3 a 5.12 1
#> 4 b 3.28 2
#> 5 b 4.97 2
#> 6 b 6.57 2
#> 7 c 5.92 3
#> 8 c 5.25 3
#> 9 c 3.02 3
Created on 2021-04-09 by the reprex package (v1.0.0)
My data actually looks more like this. I want to create an index of week number over multiple years. Please suggest better ways to do it.
library(dplyr)
library(lubridate)
a <- seq.Date(as.Date("2021-01-01"), as.Date("2021-02-28"), by = "1 day")
b <- round(rnorm(59,5,1),2)
tempdf <- cbind.data.frame(a,b)
tempdf <- tempdf %>%
mutate(weeks = week(a),
month = month(a),
year = year(a)) %>%
# mutate(ymw = 10000*year+100*month+weeks) %>%
mutate(ymw = paste0(year,month, weeks))
tempdf
#> a b weeks month year ymw
#> 1 2021-01-01 6.78 1 1 2021 202111
#> 2 2021-01-02 4.17 1 1 2021 202111
#> 3 2021-01-03 5.65 1 1 2021 202111
#> 4 2021-01-04 5.20 1 1 2021 202111
#> 5 2021-01-05 4.55 1 1 2021 202111
#> 6 2021-01-06 5.07 1 1 2021 202111
#> 7 2021-01-07 6.29 1 1 2021 202111
#> 8 2021-01-08 6.01 2 1 2021 202112
#> 9 2021-01-09 4.45 2 1 2021 202112
#> 10 2021-01-10 5.35 2 1 2021 202112
#> 11 2021-01-11 5.10 2 1 2021 202112
#> 12 2021-01-12 4.34 2 1 2021 202112
#> 13 2021-01-13 4.47 2 1 2021 202112
#> 14 2021-01-14 6.03 2 1 2021 202112
#> 15 2021-01-15 6.55 3 1 2021 202113
#> 16 2021-01-16 5.60 3 1 2021 202113
#> 17 2021-01-17 5.54 3 1 2021 202113
Created on 2021-04-09 by the reprex package (v1.0.0)
I can think of two options, depending on what you ultimately intend to do with the column.
tempdf %>%
mutate(
weekIndex_1 = year + weeks/100,
weekIndex_2 = floor(as.numeric(a)/7)
)
#> a b weeks month year ymw weekIndex_1 weekIndex_2
#> 1 2021-01-01 7.30 1 1 2021 202111 2021.01 2661
#> 2 2021-01-02 4.53 1 1 2021 202111 2021.01 2661
#> 3 2021-01-03 5.21 1 1 2021 202111 2021.01 2661
#> 4 2021-01-04 6.74 1 1 2021 202111 2021.01 2661
#> 5 2021-01-05 4.53 1 1 2021 202111 2021.01 2661
#> 6 2021-01-06 5.56 1 1 2021 202111 2021.01 2661
#> 7 2021-01-07 5.09 1 1 2021 202111 2021.01 2662
#> 8 2021-01-08 4.82 2 1 2021 202112 2021.02 2662
#> 9 2021-01-09 5.65 2 1 2021 202112 2021.02 2662
#> 10 2021-01-10 4.46 2 1 2021 202112 2021.02 2662
Both will allow you to sort on the index. The difference is weekIndex_1 tracks the year and resets the week number when the year changes. In a sense, using semantic versioning for the date. This is very similar to what you did with the ymw column though. With weekIndex_2 you are essentially tracking years since the origin, which accounts for the fact that years aren't exactly 52 weeks long. You get the sequential order, but lose a bit of the year context. Since you have both these in other columns already (weeks and year), perhaps this isn't that important.

Summing up Certain Sequences of a Dataframe in R

I have several data frames of daily rates of different regions by age-groups:
Date 0-14 Rate 15-29 Rate 30-44 Rate 45-64 Rate 65-79 Rate 80+ Rate
2020-23-12 0 33.54 45.68 88.88 96.13 41.28
2020-24-12 0 25.14 35.28 66.14 90.28 38.41
It begins on Wednesday (2020-23-12) and I have data from then on up to date.
I want to obtain weekly row sums of rates from each Wednesday to Tuesday.
There should be a wise way of combinations with aggregate, seq and rowsum functions to do this using a few lines. Otherwise, I'll use too long ways to do this.
I created some minimal data, three weeks with some arbitrary column and numerics (no missings). You can use tidyverse language to sum over columns, create groups per week and sum over rowsums by week:
# Minimal Data
MWE <- data.frame(date = c(outer(as.Date("12/23/20", "%m/%d/%y"), 0:20, `+`)),
column1 = runif(21,0,1),
column2 = runif(21,0,1))
library(tidyverse)
MWE %>%
# Calculate Row Sum Everywhere
mutate(sum = rowSums(across(where(is.numeric)))) %>%
# Create Week Groups
group_by(week = ceiling(row_number()/7)) %>%
# Sum Over All RowSums per Group
summarise(rowSums_by_week = sum(sum))
# Groups: week [3]
date column1 column2 sum week
<date> <dbl> <dbl> <dbl> <dbl>
1 2020-12-23 0.449 0.759 1.21 1
2 2020-12-24 0.423 0.0956 0.519 1
3 2020-12-25 0.974 0.592 1.57 1
4 2020-12-26 0.798 0.250 1.05 1
5 2020-12-27 0.870 0.487 1.36 1
6 2020-12-28 0.952 0.345 1.30 1
7 2020-12-29 0.349 0.817 1.17 1
8 2020-12-30 0.227 0.727 0.954 2
9 2020-12-31 0.292 0.209 0.501 2
10 2021-01-01 0.678 0.276 0.954 2
# ... with 11 more rows
# A tibble: 3 x 2
week rowSums_by_week
<dbl> <dbl>
1 1 8.16
2 2 6.02
3 3 6.82

How to find both rows associated with a string in an R dataframe and subtract their mutual column values

In R, I have a dataframe that looks like this:
sample value gene tag isPTV
1 1120 3.4 arx1 1120|arx1 0
2 2123 2.3 mnf2 2123|mnf2 0
3 1129 1.9 trf4 1129|trf4 0
4 2198 0.2 brc1 2198|brc1 0
5 1120 2.1 arx1 1120|arx1 1
6 2123 0.4 mnf2 2123|mnf2 1
7 1129 1.2 trf4 1129|trf4 1
8 2198 0.9 brc1 2198|brc1 1
Such that 0 means false and 1 means true. What I'm ultimately trying to do is create a dataframe that, for each tag, finds the absolute value between the value numbers.
For instance, for 1129|trf4 occurs in two separate rows. There's a value for when it isPTV and when it is not, so the absolute value would be 1.9 - 1.2 = 0.7.
I started out by trying to write a function to do these for a given tag value, such that, for a given tag, it would return both rows containing the tag:
getExprValue <- function(dataframe, tag){
return(dataframe[tag,])
}
But this is not working, and I'm not very familiar with how you index dataframes in R.
What is the right way to do this?
UPDATE:
Solution 1 Attempt:
m_diff <- m %>% group_by(tag) %>% mutate(absDiff = abs(diff(value)))
Response:
Error in mutate_impl(.data, dots) : ColumnabsDiffmust be length 1 (the group size), not 0
Solution 2 Attempt:
with(df1, abs(ave(value, tag, FUN = diff)))
Response:
Error in x[i] <- value[[j]] : replacement has length zero
Edit: I just noticed that #akrun had a much simpler solution
Create data with a structure similar to yours:
library(tidyverse)
dat <- tibble(
sample = rep(sample(1000:3000, 10), 2),
value = rnorm(20, 5, 1),
gene = rep(letters[1:10], 2),
tag = paste(sample, gene, sep = "|"),
isPTV = rep(0:1, each = 10)
)
dat
#> # A tibble: 20 x 5
#> sample value gene tag isPTV
#> <int> <dbl> <chr> <chr> <int>
#> 1 2149 5.90 a 2149|a 0
#> 2 1027 5.46 b 1027|b 0
#> 3 1103 5.65 c 1103|c 0
#> 4 1884 4.86 d 1884|d 0
#> 5 2773 5.58 e 2773|e 0
#> 6 2948 6.98 f 2948|f 0
#> 7 2478 5.17 g 2478|g 0
#> 8 2724 6.71 h 2724|h 0
#> 9 1927 5.06 i 1927|i 0
#> 10 1081 4.39 j 1081|j 0
#> 11 2149 4.60 a 2149|a 1
#> 12 1027 2.97 b 1027|b 1
#> 13 1103 6.17 c 1103|c 1
#> 14 1884 5.83 d 1884|d 1
#> 15 2773 4.23 e 2773|e 1
#> 16 2948 6.48 f 2948|f 1
#> 17 2478 5.06 g 2478|g 1
#> 18 2724 5.32 h 2724|h 1
#> 19 1927 7.32 i 1927|i 1
#> 20 1081 4.73 j 1081|j 1
#akrun solution (much better than mine):
dat %>%
group_by(tag) %>%
mutate(absDiff = abs(diff(value)))
#> # A tibble: 20 x 6
#> # Groups: tag [10]
#> sample value gene tag isPTV absDiff
#> <int> <dbl> <chr> <chr> <int> <dbl>
#> 1 2149 5.90 a 2149|a 0 1.30
#> 2 1027 5.46 b 1027|b 0 2.49
#> 3 1103 5.65 c 1103|c 0 0.520
#> 4 1884 4.86 d 1884|d 0 0.974
#> 5 2773 5.58 e 2773|e 0 1.34
#> 6 2948 6.98 f 2948|f 0 0.502
#> 7 2478 5.17 g 2478|g 0 0.114
#> 8 2724 6.71 h 2724|h 0 1.39
#> 9 1927 5.06 i 1927|i 0 2.26
#> 10 1081 4.39 j 1081|j 0 0.337
#> 11 2149 4.60 a 2149|a 1 1.30
#> 12 1027 2.97 b 1027|b 1 2.49
#> 13 1103 6.17 c 1103|c 1 0.520
#> 14 1884 5.83 d 1884|d 1 0.974
#> 15 2773 4.23 e 2773|e 1 1.34
#> 16 2948 6.48 f 2948|f 1 0.502
#> 17 2478 5.06 g 2478|g 1 0.114
#> 18 2724 5.32 h 2724|h 1 1.39
#> 19 1927 7.32 i 1927|i 1 2.26
#> 20 1081 4.73 j 1081|j 1 0.337
My initial suggestion (unnecessarily complicated):
nested <- dat %>%
group_by(tag) %>%
nest()
nested %>%
mutate(difference = map(data, ~ abs(diff(.$value)))) %>%
select(- data) %>%
unnest()
#> # A tibble: 10 x 2
#> tag difference
#> <chr> <dbl>
#> 1 2149|a 1.30
#> 2 1027|b 2.49
#> 3 1103|c 0.520
#> 4 1884|d 0.974
#> 5 2773|e 1.34
#> 6 2948|f 0.502
#> 7 2478|g 0.114
#> 8 2724|h 1.39
#> 9 1927|i 2.26
#> 10 1081|j 0.337

Resources