Increase speed in finding "Ιncreasing dice roll sequences"

Increase speed in finding "Ιncreasing dice roll sequences" - r

The problem was How many sequences of 9 dice rolls are increasing (e.g. 223444556). Ok, I know the answer is given by choose(14,9) but i just wanted to play around with dplyr.
A fast but not elegant way:
library(tidyverse)
expand.grid(data.frame(matrix(rep(1:6,9),ncol=9))) %>%
filter(X1<=X2 & X2<=X3 &X3<=X4 &X4<=X5 &X5<=X6 &X6<=X7 &X7<=X8 &X8<=X9) %>% tally
I tried the following two alternatives (without explicit reference to variable names), but they're both very slow (and memory consuming). Can you help me optimize my code using tidyverse?
expand_grid(!!!data.frame(matrix(rep(1:6,9),ncol=9))) %>%
rownames_to_column(var = "grp") %>%
mutate(grp = as.numeric(grp)) %>%
pivot_longer (cols=!grp) %>%
group_by(grp) %>%
mutate(prev = lag(value)) %>%
filter(!is.na(prev)) %>%
transmute(dif=value-prev) %>%
summarize(res = all(dif >=0)) %>%
group_by(res) %>% summarize(n=n())
9 %>%
rerun(1:6) %>% crossing(!!!.,.name_repair = "minimal") %>%
set_names(glue::glue('c{1:ncol(.)}')) %>%
rowwise() %>%
mutate(asc = all(diff(c_across(cols = everything())>=0))) %>%
filter(asc==TRUE) %>% tally
This is also slow, but not memory consuming.
9 %>%
rerun(1:6) %>% crossing(!!!.,.name_repair = "minimal") %>%
set_names(glue::glue('c{1:ncol(.)}')) %>%
filter(pmap_lgl(.,~{
if(all(list(...) %>% flatten_dbl() %>% diff() >=0)) return(TRUE) else return(FALSE)
})) %>% tally

Here is a tidyverse approach that relies on purrr:
expand.grid(replicate(9, 1:6, FALSE)) %>%
filter(reduce(map2(.[, -length(.)], .[, -1], ~ .x <= .y), `&`)) %>%
tally()
This is somewhat difficult to do in the contexts of pipes. We both need to compare columns n and n + 1 while reducing done to a logical vector. Then we need to filter the original dataset.
And if you were only interested in the tally, we could sum the logical vector.
expand.grid(replicate(9, 1:6, FALSE)) %>%
{sum(reduce(map2(.[, -length(.)], .[, -1], ~ .x <= .y), `&`))}
Finally, if you don't mind one more dependency, matrixstats can parallel what you were doing with one of your approaches:
library(matrixStats)
expand.grid(replicate(9, 1:6, FALSE)) %>%
{sum(rowAlls(rowDiffs(as.matrix(.)) >= 0L))}

Related

How do you efficiently group by multiple columns in dplyr

With dplyr you can group by columns like this:
library(dplyr)
df <- data.frame(a=c(1,2,1,3,1,4,1,5), b=c(2,3,4,1,2,3,4,5))
df %>%
group_by(a) %>%
summarise(count = n())
If I want to group by two columns all the guides say:
df %>%
group_by(a,b) %>%
summarise(count = n())
But can I not feed the group_by() parameters more efficiently somehow, rather than having to type them in explicitly, e.g. like:
cols = colnames(df)
df %>%
group_by(cols) %>%
summarise(count = n())
I have examples where I want to group by 10+ columns, and it is pretty horrible to write it out if you can just parse their names.

across and curly-curly is the answer (even though it doesn't make sense to group_by using all your columns)
cols = colnames(df)
df %>%
group_by(across({{cols}}) %>%
summarise(count = n())

You can use across with any of the tidy selectors. For example if you want all columns
df %>%
group_by(across(everything())) %>%
summarise(count = n())
Of if you want a list
cols <- c("a","b")
df %>%
group_by(across(all_of(cols))) %>%
summarise(count = n())
See help("language", package="tidyselect") for all the selection options.

Calculate cumulative sum over time stamp with dplyr

I'm trying to calculate cumulative sums over a time span. Is there a way to calculate this within a step? Any package recommendations?
activate_2019 <- activate_rate %>%
filter(
grepl("2019", join_day)
) %>%
summarize(
proportion = sum(if_activate) /n()
)
activate_2020 <- activate_rate %>%
filter(
grepl("2019|2020", join_day)
) %>%
summarize(
proportion = sum(if_activate) /n()
)
activate_2021 <- activate_rate %>%
filter(
grepl("2019|2020|2021", join_day)
) %>%
summarize(
proportion = sum(if_activate) /n()
)

Here is one method with tidyverse
Extract the unique year` from the 'join_day' column
Loop over those, slice the rows in active_rate based on the matching the 'year' looped in 'join_day'
Summarise by taking the mean of 'if_activate'
Bind the output with _dfc i.e. column bind in map
library(stringr)
library(dplyr)
library(purrr)
un1 <- str_extract_all(activate_rate$join_day, "\\d{4}") %>%
unlist %>%
unique %>%
as.integer %>%
sort
map_dfc(un1, ~ activate_rate %>%
arrange(as.Date(join_day)) %>%
slice(seq(max(grep(as.character(.x), join_day)))) %>%
sumarise(!!str_c("proportion", .x) := mean(if_activate)))

If I understand correctly, this should do the trick:
activate_rate %>%
mutate(year = floor_date(join_day, unit = "year")) %>%
group_by(year) %>%
summarise(proportion = sum(if_activate) / n())

How to reuse parts of long chain of pipe operators in R?

I have a set of chains of pipe operators (%>%) doing different things with different datasets.
For instance:
dataset %>%
mutate(...) %>%
filter(...) %>%
rowwise() %>%
summarise() %>%
etc...
If I want to reuse some parts of these chains, is there a way to do it, without just wrapping it into a function?
For instance (in pseudocode obviously):
subchain <- filter(...) %>%
rowwise() %>%
summarise()
# and then instead of the chain above it would be:
dataset %>%
mutate(...) %>%
subchain() %>%
etc...

Similar in syntax to desired pseudo-code:
library(dplyr)
subchain <- . %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld)
all.equal(
starwars %>%
group_by(gender) %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld),
starwars %>%
group_by(gender) %>%
subchain()
)
Using a dot . as start of a piping sequence. This is in effect close to function wrapping, but this is called a magrittr functional sequence. See ?functions and try magrittr::functions(subchain)

summing up the first 5 elements of a list

I have a data frame that contains a column with varying numbers of integer values. I need to take the first five of these values and sum them up. I found a way to do it for one, but can't seem to generalize it to loop through all:
Here is the code for the first element:
results$occupied[1] %>%
strsplit(",") %>%
as.list() %>%
unlist() %>%
head(5) %>%
as.numeric() %>%
sum()
And what does not work for all elements:
results %>%
rowwise() %>%
select(occupied) %>%
as.character() %>%
strsplit(",") %>%
as.list() %>%
unlist() %>%
head(5) %>%
as.numeric() %>%
sum()

In base R, you can do :
sapply(strsplit(results$occupied, ","), function(x) sum(as.numeric(head(x, 5))))
Or using dplyr and purrr
library(dplyr)
library(purrr)
results %>%
mutate(total_sum = map_dbl(strsplit(occupied, ","),
~sum(as.numeric(head(.x, 5)))))
Similarly, using rowwise :
results %>%
rowwise() %>%
mutate(total_sum = sum(as.numeric(head(strsplit(occupied, ",")[[1]], 5))))

We can use separate_rows to split the 'occupied' column and expand the rows, then do a group by row number and get the sum of the first five elements
library(dplyr)
library(tidyr)
results %>%
mutate(rn = row_number()) %>%
separate_rows(occupied, convert = TRUE) %>%
group_by(rn) %>%
slice(seq_len(5)) %>%
summmarise(total_sum = sum(occupied)) %>%
select(-rn) %>%
bind_cols(results, .)

get sequence of group in R

So I have already done what I need but I am sure that there is a better way to do that
library(tidyverse)
library(schrute)
office <- schrute::theoffice
top_3_lines_per_episode <- office %>%
group_by(season,episode,episode_name,imdb_rating) %>%
count(character) %>%
top_n(3, n) %>% ungroup()
epi_num<-top_3_lines_per_episode %>%
select(episode_name) %>%
unique() %>%
mutate(episode_num=row_number())
top_3_lines_per_episode %>%
inner_join(epi_num)
I want to generate column epi_num which gets done using above. A simpler way to do that. I have look at group_indices from SO but I think thats derecated. Need a better way preferably in tidyverse.

I am not aware of group_indices being deprecated in fact it seems to be a perfect option here.
library(dplyr)
office %>%
group_by(season,episode,episode_name,imdb_rating) %>%
count(character) %>%
top_n(3, n) %>%
ungroup %>%
mutate(episode_num = group_indices(., season,episode,episode_name,imdb_rating))
Another option is to combine the columns with unite and then match to get episode_num.
office %>%
group_by(season,episode,episode_name,imdb_rating) %>%
count(character) %>%
top_n(3, n) %>%
ungroup %>%
tidyr::unite(temp, season,episode,episode_name,imdb_rating, remove = FALSE) %>%
mutate(episode_num = match(temp, unique(temp))) %>%
select(-temp)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Increase speed in finding "Ιncreasing dice roll sequences" - r

Related

How do you efficiently group by multiple columns in dplyr

Calculate cumulative sum over time stamp with dplyr

How to reuse parts of long chain of pipe operators in R?

summing up the first 5 elements of a list

get sequence of group in R

Categories

Resources