summing up the first 5 elements of a list

summing up the first 5 elements of a list - r

I have a data frame that contains a column with varying numbers of integer values. I need to take the first five of these values and sum them up. I found a way to do it for one, but can't seem to generalize it to loop through all:
Here is the code for the first element:
results$occupied[1] %>%
strsplit(",") %>%
as.list() %>%
unlist() %>%
head(5) %>%
as.numeric() %>%
sum()
And what does not work for all elements:
results %>%
rowwise() %>%
select(occupied) %>%
as.character() %>%
strsplit(",") %>%
as.list() %>%
unlist() %>%
head(5) %>%
as.numeric() %>%
sum()

In base R, you can do :
sapply(strsplit(results$occupied, ","), function(x) sum(as.numeric(head(x, 5))))
Or using dplyr and purrr
library(dplyr)
library(purrr)
results %>%
mutate(total_sum = map_dbl(strsplit(occupied, ","),
~sum(as.numeric(head(.x, 5)))))
Similarly, using rowwise :
results %>%
rowwise() %>%
mutate(total_sum = sum(as.numeric(head(strsplit(occupied, ",")[[1]], 5))))

We can use separate_rows to split the 'occupied' column and expand the rows, then do a group by row number and get the sum of the first five elements
library(dplyr)
library(tidyr)
results %>%
mutate(rn = row_number()) %>%
separate_rows(occupied, convert = TRUE) %>%
group_by(rn) %>%
slice(seq_len(5)) %>%
summmarise(total_sum = sum(occupied)) %>%
select(-rn) %>%
bind_cols(results, .)

Related

How do you efficiently group by multiple columns in dplyr

With dplyr you can group by columns like this:
library(dplyr)
df <- data.frame(a=c(1,2,1,3,1,4,1,5), b=c(2,3,4,1,2,3,4,5))
df %>%
group_by(a) %>%
summarise(count = n())
If I want to group by two columns all the guides say:
df %>%
group_by(a,b) %>%
summarise(count = n())
But can I not feed the group_by() parameters more efficiently somehow, rather than having to type them in explicitly, e.g. like:
cols = colnames(df)
df %>%
group_by(cols) %>%
summarise(count = n())
I have examples where I want to group by 10+ columns, and it is pretty horrible to write it out if you can just parse their names.

across and curly-curly is the answer (even though it doesn't make sense to group_by using all your columns)
cols = colnames(df)
df %>%
group_by(across({{cols}}) %>%
summarise(count = n())

You can use across with any of the tidy selectors. For example if you want all columns
df %>%
group_by(across(everything())) %>%
summarise(count = n())
Of if you want a list
cols <- c("a","b")
df %>%
group_by(across(all_of(cols))) %>%
summarise(count = n())
See help("language", package="tidyselect") for all the selection options.

Increase speed in finding "Ιncreasing dice roll sequences"

The problem was How many sequences of 9 dice rolls are increasing (e.g. 223444556). Ok, I know the answer is given by choose(14,9) but i just wanted to play around with dplyr.
A fast but not elegant way:
library(tidyverse)
expand.grid(data.frame(matrix(rep(1:6,9),ncol=9))) %>%
filter(X1<=X2 & X2<=X3 &X3<=X4 &X4<=X5 &X5<=X6 &X6<=X7 &X7<=X8 &X8<=X9) %>% tally
I tried the following two alternatives (without explicit reference to variable names), but they're both very slow (and memory consuming). Can you help me optimize my code using tidyverse?
expand_grid(!!!data.frame(matrix(rep(1:6,9),ncol=9))) %>%
rownames_to_column(var = "grp") %>%
mutate(grp = as.numeric(grp)) %>%
pivot_longer (cols=!grp) %>%
group_by(grp) %>%
mutate(prev = lag(value)) %>%
filter(!is.na(prev)) %>%
transmute(dif=value-prev) %>%
summarize(res = all(dif >=0)) %>%
group_by(res) %>% summarize(n=n())
9 %>%
rerun(1:6) %>% crossing(!!!.,.name_repair = "minimal") %>%
set_names(glue::glue('c{1:ncol(.)}')) %>%
rowwise() %>%
mutate(asc = all(diff(c_across(cols = everything())>=0))) %>%
filter(asc==TRUE) %>% tally
This is also slow, but not memory consuming.
9 %>%
rerun(1:6) %>% crossing(!!!.,.name_repair = "minimal") %>%
set_names(glue::glue('c{1:ncol(.)}')) %>%
filter(pmap_lgl(.,~{
if(all(list(...) %>% flatten_dbl() %>% diff() >=0)) return(TRUE) else return(FALSE)
})) %>% tally

Here is a tidyverse approach that relies on purrr:
expand.grid(replicate(9, 1:6, FALSE)) %>%
filter(reduce(map2(.[, -length(.)], .[, -1], ~ .x <= .y), `&`)) %>%
tally()
This is somewhat difficult to do in the contexts of pipes. We both need to compare columns n and n + 1 while reducing done to a logical vector. Then we need to filter the original dataset.
And if you were only interested in the tally, we could sum the logical vector.
expand.grid(replicate(9, 1:6, FALSE)) %>%
{sum(reduce(map2(.[, -length(.)], .[, -1], ~ .x <= .y), `&`))}
Finally, if you don't mind one more dependency, matrixstats can parallel what you were doing with one of your approaches:
library(matrixStats)
expand.grid(replicate(9, 1:6, FALSE)) %>%
{sum(rowAlls(rowDiffs(as.matrix(.)) >= 0L))}

How to reuse parts of long chain of pipe operators in R?

I have a set of chains of pipe operators (%>%) doing different things with different datasets.
For instance:
dataset %>%
mutate(...) %>%
filter(...) %>%
rowwise() %>%
summarise() %>%
etc...
If I want to reuse some parts of these chains, is there a way to do it, without just wrapping it into a function?
For instance (in pseudocode obviously):
subchain <- filter(...) %>%
rowwise() %>%
summarise()
# and then instead of the chain above it would be:
dataset %>%
mutate(...) %>%
subchain() %>%
etc...

Similar in syntax to desired pseudo-code:
library(dplyr)
subchain <- . %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld)
all.equal(
starwars %>%
group_by(gender) %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld),
starwars %>%
group_by(gender) %>%
subchain()
)
Using a dot . as start of a piping sequence. This is in effect close to function wrapping, but this is called a magrittr functional sequence. See ?functions and try magrittr::functions(subchain)

get sequence of group in R

So I have already done what I need but I am sure that there is a better way to do that
library(tidyverse)
library(schrute)
office <- schrute::theoffice
top_3_lines_per_episode <- office %>%
group_by(season,episode,episode_name,imdb_rating) %>%
count(character) %>%
top_n(3, n) %>% ungroup()
epi_num<-top_3_lines_per_episode %>%
select(episode_name) %>%
unique() %>%
mutate(episode_num=row_number())
top_3_lines_per_episode %>%
inner_join(epi_num)
I want to generate column epi_num which gets done using above. A simpler way to do that. I have look at group_indices from SO but I think thats derecated. Need a better way preferably in tidyverse.

I am not aware of group_indices being deprecated in fact it seems to be a perfect option here.
library(dplyr)
office %>%
group_by(season,episode,episode_name,imdb_rating) %>%
count(character) %>%
top_n(3, n) %>%
ungroup %>%
mutate(episode_num = group_indices(., season,episode,episode_name,imdb_rating))
Another option is to combine the columns with unite and then match to get episode_num.
office %>%
group_by(season,episode,episode_name,imdb_rating) %>%
count(character) %>%
top_n(3, n) %>%
ungroup %>%
tidyr::unite(temp, season,episode,episode_name,imdb_rating, remove = FALSE) %>%
mutate(episode_num = match(temp, unique(temp))) %>%
select(-temp)

Summarise with multiple conditions based on years

I would like to create a set of columns based on papers count for each number of year, therefore filtering multiple conditions in dplyr through summarise:
This is my code:
words_list <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(word) %>%
summarise(papers_count = n()) %>%
arrange(desc(papers_count))
The code above gives me two columns, 'word' and 'papers_count', I would like to create more columns like papers_count (papers_count1990, papers_count1991, etc..) based on each year between 1990 and 2017.
I Am looking for something like ths:
words_list <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(word) %>%
summarise(tot_papers_count = n(), papers_count_1991 = n()year="1991", ...) %>%
arrange(desc(papers_count))
please does anybody have any suggestion?

I would suggest adding year to the group_by, and then using spread to create multiple summary columns.
library(tidyr)
words_list_by_year <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(year,word) %>%
summarise(papers_count = n()) %>%
spread(year,papers_count,fill=0)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

summing up the first 5 elements of a list - r

Related

How do you efficiently group by multiple columns in dplyr

Increase speed in finding "Ιncreasing dice roll sequences"

How to reuse parts of long chain of pipe operators in R?

get sequence of group in R

Summarise with multiple conditions based on years

Categories

Resources