Increase speed in finding "Ιncreasing dice roll sequences" - r

The problem was How many sequences of 9 dice rolls are increasing (e.g. 223444556). Ok, I know the answer is given by choose(14,9) but i just wanted to play around with dplyr.
A fast but not elegant way:
library(tidyverse)
expand.grid(data.frame(matrix(rep(1:6,9),ncol=9))) %>%
filter(X1<=X2 & X2<=X3 &X3<=X4 &X4<=X5 &X5<=X6 &X6<=X7 &X7<=X8 &X8<=X9) %>% tally
I tried the following two alternatives (without explicit reference to variable names), but they're both very slow (and memory consuming). Can you help me optimize my code using tidyverse?
expand_grid(!!!data.frame(matrix(rep(1:6,9),ncol=9))) %>%
rownames_to_column(var = "grp") %>%
mutate(grp = as.numeric(grp)) %>%
pivot_longer (cols=!grp) %>%
group_by(grp) %>%
mutate(prev = lag(value)) %>%
filter(!is.na(prev)) %>%
transmute(dif=value-prev) %>%
summarize(res = all(dif >=0)) %>%
group_by(res) %>% summarize(n=n())
9 %>%
rerun(1:6) %>% crossing(!!!.,.name_repair = "minimal") %>%
set_names(glue::glue('c{1:ncol(.)}')) %>%
rowwise() %>%
mutate(asc = all(diff(c_across(cols = everything())>=0))) %>%
filter(asc==TRUE) %>% tally
This is also slow, but not memory consuming.
9 %>%
rerun(1:6) %>% crossing(!!!.,.name_repair = "minimal") %>%
set_names(glue::glue('c{1:ncol(.)}')) %>%
filter(pmap_lgl(.,~{
if(all(list(...) %>% flatten_dbl() %>% diff() >=0)) return(TRUE) else return(FALSE)
})) %>% tally

Here is a tidyverse approach that relies on purrr:
expand.grid(replicate(9, 1:6, FALSE)) %>%
filter(reduce(map2(.[, -length(.)], .[, -1], ~ .x <= .y), `&`)) %>%
tally()
This is somewhat difficult to do in the contexts of pipes. We both need to compare columns n and n + 1 while reducing done to a logical vector. Then we need to filter the original dataset.
And if you were only interested in the tally, we could sum the logical vector.
expand.grid(replicate(9, 1:6, FALSE)) %>%
{sum(reduce(map2(.[, -length(.)], .[, -1], ~ .x <= .y), `&`))}
Finally, if you don't mind one more dependency, matrixstats can parallel what you were doing with one of your approaches:
library(matrixStats)
expand.grid(replicate(9, 1:6, FALSE)) %>%
{sum(rowAlls(rowDiffs(as.matrix(.)) >= 0L))}

Related

How do you efficiently group by multiple columns in dplyr

With dplyr you can group by columns like this:
library(dplyr)
df <- data.frame(a=c(1,2,1,3,1,4,1,5), b=c(2,3,4,1,2,3,4,5))
df %>%
group_by(a) %>%
summarise(count = n())
If I want to group by two columns all the guides say:
df %>%
group_by(a,b) %>%
summarise(count = n())
But can I not feed the group_by() parameters more efficiently somehow, rather than having to type them in explicitly, e.g. like:
cols = colnames(df)
df %>%
group_by(cols) %>%
summarise(count = n())
I have examples where I want to group by 10+ columns, and it is pretty horrible to write it out if you can just parse their names.
across and curly-curly is the answer (even though it doesn't make sense to group_by using all your columns)
cols = colnames(df)
df %>%
group_by(across({{cols}}) %>%
summarise(count = n())
You can use across with any of the tidy selectors. For example if you want all columns
df %>%
group_by(across(everything())) %>%
summarise(count = n())
Of if you want a list
cols <- c("a","b")
df %>%
group_by(across(all_of(cols))) %>%
summarise(count = n())
See help("language", package="tidyselect") for all the selection options.

Calculate cumulative sum over time stamp with dplyr

I'm trying to calculate cumulative sums over a time span. Is there a way to calculate this within a step? Any package recommendations?
activate_2019 <- activate_rate %>%
filter(
grepl("2019", join_day)
) %>%
summarize(
proportion = sum(if_activate) /n()
)
activate_2020 <- activate_rate %>%
filter(
grepl("2019|2020", join_day)
) %>%
summarize(
proportion = sum(if_activate) /n()
)
activate_2021 <- activate_rate %>%
filter(
grepl("2019|2020|2021", join_day)
) %>%
summarize(
proportion = sum(if_activate) /n()
)
Here is one method with tidyverse
Extract the unique year` from the 'join_day' column
Loop over those, slice the rows in active_rate based on the matching the 'year' looped in 'join_day'
Summarise by taking the mean of 'if_activate'
Bind the output with _dfc i.e. column bind in map
library(stringr)
library(dplyr)
library(purrr)
un1 <- str_extract_all(activate_rate$join_day, "\\d{4}") %>%
unlist %>%
unique %>%
as.integer %>%
sort
map_dfc(un1, ~ activate_rate %>%
arrange(as.Date(join_day)) %>%
slice(seq(max(grep(as.character(.x), join_day)))) %>%
sumarise(!!str_c("proportion", .x) := mean(if_activate)))
If I understand correctly, this should do the trick:
activate_rate %>%
mutate(year = floor_date(join_day, unit = "year")) %>%
group_by(year) %>%
summarise(proportion = sum(if_activate) / n())

How to reuse parts of long chain of pipe operators in R?

I have a set of chains of pipe operators (%>%) doing different things with different datasets.
For instance:
dataset %>%
mutate(...) %>%
filter(...) %>%
rowwise() %>%
summarise() %>%
etc...
If I want to reuse some parts of these chains, is there a way to do it, without just wrapping it into a function?
For instance (in pseudocode obviously):
subchain <- filter(...) %>%
rowwise() %>%
summarise()
# and then instead of the chain above it would be:
dataset %>%
mutate(...) %>%
subchain() %>%
etc...
Similar in syntax to desired pseudo-code:
library(dplyr)
subchain <- . %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld)
all.equal(
starwars %>%
group_by(gender) %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld),
starwars %>%
group_by(gender) %>%
subchain()
)
Using a dot . as start of a piping sequence. This is in effect close to function wrapping, but this is called a magrittr functional sequence. See ?functions and try magrittr::functions(subchain)

summing up the first 5 elements of a list

I have a data frame that contains a column with varying numbers of integer values. I need to take the first five of these values and sum them up. I found a way to do it for one, but can't seem to generalize it to loop through all:
Here is the code for the first element:
results$occupied[1] %>%
strsplit(",") %>%
as.list() %>%
unlist() %>%
head(5) %>%
as.numeric() %>%
sum()
And what does not work for all elements:
results %>%
rowwise() %>%
select(occupied) %>%
as.character() %>%
strsplit(",") %>%
as.list() %>%
unlist() %>%
head(5) %>%
as.numeric() %>%
sum()
In base R, you can do :
sapply(strsplit(results$occupied, ","), function(x) sum(as.numeric(head(x, 5))))
Or using dplyr and purrr
library(dplyr)
library(purrr)
results %>%
mutate(total_sum = map_dbl(strsplit(occupied, ","),
~sum(as.numeric(head(.x, 5)))))
Similarly, using rowwise :
results %>%
rowwise() %>%
mutate(total_sum = sum(as.numeric(head(strsplit(occupied, ",")[[1]], 5))))
We can use separate_rows to split the 'occupied' column and expand the rows, then do a group by row number and get the sum of the first five elements
library(dplyr)
library(tidyr)
results %>%
mutate(rn = row_number()) %>%
separate_rows(occupied, convert = TRUE) %>%
group_by(rn) %>%
slice(seq_len(5)) %>%
summmarise(total_sum = sum(occupied)) %>%
select(-rn) %>%
bind_cols(results, .)

get sequence of group in R

So I have already done what I need but I am sure that there is a better way to do that
library(tidyverse)
library(schrute)
office <- schrute::theoffice
top_3_lines_per_episode <- office %>%
group_by(season,episode,episode_name,imdb_rating) %>%
count(character) %>%
top_n(3, n) %>% ungroup()
epi_num<-top_3_lines_per_episode %>%
select(episode_name) %>%
unique() %>%
mutate(episode_num=row_number())
top_3_lines_per_episode %>%
inner_join(epi_num)
I want to generate column epi_num which gets done using above. A simpler way to do that. I have look at group_indices from SO but I think thats derecated. Need a better way preferably in tidyverse.
I am not aware of group_indices being deprecated in fact it seems to be a perfect option here.
library(dplyr)
office %>%
group_by(season,episode,episode_name,imdb_rating) %>%
count(character) %>%
top_n(3, n) %>%
ungroup %>%
mutate(episode_num = group_indices(., season,episode,episode_name,imdb_rating))
Another option is to combine the columns with unite and then match to get episode_num.
office %>%
group_by(season,episode,episode_name,imdb_rating) %>%
count(character) %>%
top_n(3, n) %>%
ungroup %>%
tidyr::unite(temp, season,episode,episode_name,imdb_rating, remove = FALSE) %>%
mutate(episode_num = match(temp, unique(temp))) %>%
select(-temp)

Resources