Calculate cumulative sum over time stamp with dplyr - r

I'm trying to calculate cumulative sums over a time span. Is there a way to calculate this within a step? Any package recommendations?
activate_2019 <- activate_rate %>%
filter(
grepl("2019", join_day)
) %>%
summarize(
proportion = sum(if_activate) /n()
)
activate_2020 <- activate_rate %>%
filter(
grepl("2019|2020", join_day)
) %>%
summarize(
proportion = sum(if_activate) /n()
)
activate_2021 <- activate_rate %>%
filter(
grepl("2019|2020|2021", join_day)
) %>%
summarize(
proportion = sum(if_activate) /n()
)

Here is one method with tidyverse
Extract the unique year` from the 'join_day' column
Loop over those, slice the rows in active_rate based on the matching the 'year' looped in 'join_day'
Summarise by taking the mean of 'if_activate'
Bind the output with _dfc i.e. column bind in map
library(stringr)
library(dplyr)
library(purrr)
un1 <- str_extract_all(activate_rate$join_day, "\\d{4}") %>%
unlist %>%
unique %>%
as.integer %>%
sort
map_dfc(un1, ~ activate_rate %>%
arrange(as.Date(join_day)) %>%
slice(seq(max(grep(as.character(.x), join_day)))) %>%
sumarise(!!str_c("proportion", .x) := mean(if_activate)))

If I understand correctly, this should do the trick:
activate_rate %>%
mutate(year = floor_date(join_day, unit = "year")) %>%
group_by(year) %>%
summarise(proportion = sum(if_activate) / n())

Related

R - Issue with Ranking and Grouping

I have the following question that I am trying to solve with R:
"For each year, first calculate the mean observed value for each country (to allow for settings where countries may have more than 1 value per year, note that this is true in this data set). Then rank countries by increasing MMR for each year.
Calculate the mean ranking across all years, extract the mean ranking for 10 countries with the lowest ranking across all years, and print the resulting table."
This is what I have so far:
dput(mmr)
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean = mean(mmr)) %>%
arrange(mean) %>%
group_by(country)
tib2
My output is so close to where I need it to be, I just need to make each country have only one row (that has the mean ranking for each country).
Here is the result:
Output
Thank you!
Just repeat the same analysis, but instead of grouping by (country, year), just group by country:
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean_mmr = mean(mmr)) %>%
arrange(mean) %>%
group_by(country) %>%
summarise(mean_mmr = mean(mean_mmr)) %>%
arrange(mean_mmr) %>%
ungroup() %>%
slice_min(n=10)
tib2
Not sure without the data, but does this work?
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean1 = mean(mmr)) %>%
ungroup() %>%
group_by(year) %>%
mutate(rank1 = rank(mean1)) %>%
ungroup() %>%
group_by(country) %>%
summarise(rank = mean(rank1))%>%
ungroup() %>%
arrange(rank) %>%
slice_head(n=10)

Manipulating data.frame while using cycles and storing values in a list

I have 2 codes that manipulate and filter (by date) my data.frame and that work perfectly. Now I want to run the code for not only one day, but for every day in vector:
seq(from=as.Date('2020-03-02'), to=Sys.Date(),by='days')` #.... 538 days
The code I want to run for all the days between 2020-03-02 and today is:
KOKOKO <- data.frame %>%
filter(DATE < '2020-03-02')%>%
summarize(DATE = '2020-03-02', CZK = sum(Objem.v.CZK,na.rm = T)
STAVPTF <- data.frame %>%
filter (DATE < '2020-03-02')%>%
group_by(CP) %>%
summarize(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), DATE = '2020-03-02') %>%
select(DATE,CP,mnozstvi) %>%
rbind(KOKOKO)%>%
drop_na() %>%
So instead of '2020-03-02' I want to fill in all days since '2020-03-02' one after another. And each of the KOKOKO and STAVPTF created for the unique day like this I want to save as a separate data.frame and all of them store in a list.
We could use map to loop over the sequence and apply the code
library(dplyr)
library(purrr)
out <- map(s1, ~ data.frame %>%
filter(DATE < .x)%>%
summarize(DATE = .x, CZK = sum(Objem.v.CZK,na.rm = TRUE))
As this is repeated cycle, a function would make it cleaner
f1 <- function(dat, date_col, group_col, Objem_col, aktualni_col, date_val) {
filtered <- dat %>%
filter({{date_col}} < date_val)
KOKOKO <- filtered %>%
summarize({{date_col}} := date_val,
CZK = sum({{Objem_col}}, na.rm = TRUE)
STAVPTF <- filtered %>%
group_by({{group_col}}) %>%
summarize(mnozstvi = last({{aktualni_col}}),
{{date_col}} := date_val) %>%
select({{date_col}}, {{group_col}}, mnozstvi) %>%
bind_rows(KOKOKO)%>%
drop_na()
return(STAVPTF)
}
and call as
map(s1, ~ f1(data.frame, DATE, CP, Objem.v.CZK, AKTUALNI_MNOZSTVI_AKCIE, !!.x))
where
s1 <- seq(from=as.Date('2020-03-02'), to=Sys.Date(), by='days')
It would be easier to answer your question, if you would provide a minimal reproducible example. It's easy done with tidyverses reprex packages
However, your KOKOKO code can be rewritten as simple cumulative sum:
KOKOKO =
data.frame %>%
arrange(DATE) %>% # if necessary
group_by(DATE) %>%
summarise(CZK = sum(Objem.v.CZK), .groups = 'drop') %>% # summarise per DATE (if necessary)
mutate(CZK = cumsum(CZK) - CZK) # cumulative sum excluding current row (current DATE)
Even STAVPTF code can probably be rewritten without iterations. First find the last value of AKTUALNI_MNOZSTVI_AKCIE per CP and DATE. Then this value is assigned to the next DATE:
STAVPTF <-
data.frame %>%
group_by(CP, DATE) %>%
summarise(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), .groups='drop_last') %>%
arrange(DATE) %>% # if necessary
mutate(DATE = lead(DATE))

Increase speed in finding "Ιncreasing dice roll sequences"

The problem was How many sequences of 9 dice rolls are increasing (e.g. 223444556). Ok, I know the answer is given by choose(14,9) but i just wanted to play around with dplyr.
A fast but not elegant way:
library(tidyverse)
expand.grid(data.frame(matrix(rep(1:6,9),ncol=9))) %>%
filter(X1<=X2 & X2<=X3 &X3<=X4 &X4<=X5 &X5<=X6 &X6<=X7 &X7<=X8 &X8<=X9) %>% tally
I tried the following two alternatives (without explicit reference to variable names), but they're both very slow (and memory consuming). Can you help me optimize my code using tidyverse?
expand_grid(!!!data.frame(matrix(rep(1:6,9),ncol=9))) %>%
rownames_to_column(var = "grp") %>%
mutate(grp = as.numeric(grp)) %>%
pivot_longer (cols=!grp) %>%
group_by(grp) %>%
mutate(prev = lag(value)) %>%
filter(!is.na(prev)) %>%
transmute(dif=value-prev) %>%
summarize(res = all(dif >=0)) %>%
group_by(res) %>% summarize(n=n())
9 %>%
rerun(1:6) %>% crossing(!!!.,.name_repair = "minimal") %>%
set_names(glue::glue('c{1:ncol(.)}')) %>%
rowwise() %>%
mutate(asc = all(diff(c_across(cols = everything())>=0))) %>%
filter(asc==TRUE) %>% tally
This is also slow, but not memory consuming.
9 %>%
rerun(1:6) %>% crossing(!!!.,.name_repair = "minimal") %>%
set_names(glue::glue('c{1:ncol(.)}')) %>%
filter(pmap_lgl(.,~{
if(all(list(...) %>% flatten_dbl() %>% diff() >=0)) return(TRUE) else return(FALSE)
})) %>% tally
Here is a tidyverse approach that relies on purrr:
expand.grid(replicate(9, 1:6, FALSE)) %>%
filter(reduce(map2(.[, -length(.)], .[, -1], ~ .x <= .y), `&`)) %>%
tally()
This is somewhat difficult to do in the contexts of pipes. We both need to compare columns n and n + 1 while reducing done to a logical vector. Then we need to filter the original dataset.
And if you were only interested in the tally, we could sum the logical vector.
expand.grid(replicate(9, 1:6, FALSE)) %>%
{sum(reduce(map2(.[, -length(.)], .[, -1], ~ .x <= .y), `&`))}
Finally, if you don't mind one more dependency, matrixstats can parallel what you were doing with one of your approaches:
library(matrixStats)
expand.grid(replicate(9, 1:6, FALSE)) %>%
{sum(rowAlls(rowDiffs(as.matrix(.)) >= 0L))}

Can't combine <character> and <double> on adding sort function

I am trying to sort the data based on the median price i.e m , but when I added sort function it throwing me an error that
Error: Can't combine locationName character and m double
how can I sort data based on newly mutated column in my case m which median price ?
df %>%
filter_at(.vars= vars(area), all_vars(grepl('10 Marla',.))) %>%
group_by(locationName,area,city) %>%
mutate(m = median(price)) %>%
select(locationName,area,city,m) %>%
sort(m,decreasing = TRUE)
We can use sort within mutate
library(dplyr)
df %>%
filter_at(.vars= vars(area), all_vars(grepl('10 Marla',.))) %>%
group_by(locationName,area,city) %>%
mutate(m = median(price)) %>%
select(locationName,area,city,m) %>%
mutate(m = sort(m,decreasing = TRUE))
If the intention is to order the rows based on 'm', use arrange
df %>%
filter_at(.vars= vars(area), all_vars(grepl('10 Marla',.))) %>%
group_by(locationName,area,city) %>%
mutate(m = median(price)) %>%
select(locationName,area,city,m) %>%
arrange(desc(m))

Summarise with multiple conditions based on years

I would like to create a set of columns based on papers count for each number of year, therefore filtering multiple conditions in dplyr through summarise:
This is my code:
words_list <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(word) %>%
summarise(papers_count = n()) %>%
arrange(desc(papers_count))
The code above gives me two columns, 'word' and 'papers_count', I would like to create more columns like papers_count (papers_count1990, papers_count1991, etc..) based on each year between 1990 and 2017.
I Am looking for something like ths:
words_list <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(word) %>%
summarise(tot_papers_count = n(), papers_count_1991 = n()year="1991", ...) %>%
arrange(desc(papers_count))
please does anybody have any suggestion?
I would suggest adding year to the group_by, and then using spread to create multiple summary columns.
library(tidyr)
words_list_by_year <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(year,word) %>%
summarise(papers_count = n()) %>%
spread(year,papers_count,fill=0)

Resources