I have a data frame that contains a column with varying numbers of integer values. I need to take the first five of these values and sum them up. I found a way to do it for one, but can't seem to generalize it to loop through all:
Here is the code for the first element:
results$occupied[1] %>%
strsplit(",") %>%
as.list() %>%
unlist() %>%
head(5) %>%
as.numeric() %>%
sum()
And what does not work for all elements:
results %>%
rowwise() %>%
select(occupied) %>%
as.character() %>%
strsplit(",") %>%
as.list() %>%
unlist() %>%
head(5) %>%
as.numeric() %>%
sum()
In base R, you can do :
sapply(strsplit(results$occupied, ","), function(x) sum(as.numeric(head(x, 5))))
Or using dplyr and purrr
library(dplyr)
library(purrr)
results %>%
mutate(total_sum = map_dbl(strsplit(occupied, ","),
~sum(as.numeric(head(.x, 5)))))
Similarly, using rowwise :
results %>%
rowwise() %>%
mutate(total_sum = sum(as.numeric(head(strsplit(occupied, ",")[[1]], 5))))
We can use separate_rows to split the 'occupied' column and expand the rows, then do a group by row number and get the sum of the first five elements
library(dplyr)
library(tidyr)
results %>%
mutate(rn = row_number()) %>%
separate_rows(occupied, convert = TRUE) %>%
group_by(rn) %>%
slice(seq_len(5)) %>%
summmarise(total_sum = sum(occupied)) %>%
select(-rn) %>%
bind_cols(results, .)
Related
With dplyr you can group by columns like this:
library(dplyr)
df <- data.frame(a=c(1,2,1,3,1,4,1,5), b=c(2,3,4,1,2,3,4,5))
df %>%
group_by(a) %>%
summarise(count = n())
If I want to group by two columns all the guides say:
df %>%
group_by(a,b) %>%
summarise(count = n())
But can I not feed the group_by() parameters more efficiently somehow, rather than having to type them in explicitly, e.g. like:
cols = colnames(df)
df %>%
group_by(cols) %>%
summarise(count = n())
I have examples where I want to group by 10+ columns, and it is pretty horrible to write it out if you can just parse their names.
across and curly-curly is the answer (even though it doesn't make sense to group_by using all your columns)
cols = colnames(df)
df %>%
group_by(across({{cols}}) %>%
summarise(count = n())
You can use across with any of the tidy selectors. For example if you want all columns
df %>%
group_by(across(everything())) %>%
summarise(count = n())
Of if you want a list
cols <- c("a","b")
df %>%
group_by(across(all_of(cols))) %>%
summarise(count = n())
See help("language", package="tidyselect") for all the selection options.
The problem was How many sequences of 9 dice rolls are increasing (e.g. 223444556). Ok, I know the answer is given by choose(14,9) but i just wanted to play around with dplyr.
A fast but not elegant way:
library(tidyverse)
expand.grid(data.frame(matrix(rep(1:6,9),ncol=9))) %>%
filter(X1<=X2 & X2<=X3 &X3<=X4 &X4<=X5 &X5<=X6 &X6<=X7 &X7<=X8 &X8<=X9) %>% tally
I tried the following two alternatives (without explicit reference to variable names), but they're both very slow (and memory consuming). Can you help me optimize my code using tidyverse?
expand_grid(!!!data.frame(matrix(rep(1:6,9),ncol=9))) %>%
rownames_to_column(var = "grp") %>%
mutate(grp = as.numeric(grp)) %>%
pivot_longer (cols=!grp) %>%
group_by(grp) %>%
mutate(prev = lag(value)) %>%
filter(!is.na(prev)) %>%
transmute(dif=value-prev) %>%
summarize(res = all(dif >=0)) %>%
group_by(res) %>% summarize(n=n())
9 %>%
rerun(1:6) %>% crossing(!!!.,.name_repair = "minimal") %>%
set_names(glue::glue('c{1:ncol(.)}')) %>%
rowwise() %>%
mutate(asc = all(diff(c_across(cols = everything())>=0))) %>%
filter(asc==TRUE) %>% tally
This is also slow, but not memory consuming.
9 %>%
rerun(1:6) %>% crossing(!!!.,.name_repair = "minimal") %>%
set_names(glue::glue('c{1:ncol(.)}')) %>%
filter(pmap_lgl(.,~{
if(all(list(...) %>% flatten_dbl() %>% diff() >=0)) return(TRUE) else return(FALSE)
})) %>% tally
Here is a tidyverse approach that relies on purrr:
expand.grid(replicate(9, 1:6, FALSE)) %>%
filter(reduce(map2(.[, -length(.)], .[, -1], ~ .x <= .y), `&`)) %>%
tally()
This is somewhat difficult to do in the contexts of pipes. We both need to compare columns n and n + 1 while reducing done to a logical vector. Then we need to filter the original dataset.
And if you were only interested in the tally, we could sum the logical vector.
expand.grid(replicate(9, 1:6, FALSE)) %>%
{sum(reduce(map2(.[, -length(.)], .[, -1], ~ .x <= .y), `&`))}
Finally, if you don't mind one more dependency, matrixstats can parallel what you were doing with one of your approaches:
library(matrixStats)
expand.grid(replicate(9, 1:6, FALSE)) %>%
{sum(rowAlls(rowDiffs(as.matrix(.)) >= 0L))}
I have a set of chains of pipe operators (%>%) doing different things with different datasets.
For instance:
dataset %>%
mutate(...) %>%
filter(...) %>%
rowwise() %>%
summarise() %>%
etc...
If I want to reuse some parts of these chains, is there a way to do it, without just wrapping it into a function?
For instance (in pseudocode obviously):
subchain <- filter(...) %>%
rowwise() %>%
summarise()
# and then instead of the chain above it would be:
dataset %>%
mutate(...) %>%
subchain() %>%
etc...
Similar in syntax to desired pseudo-code:
library(dplyr)
subchain <- . %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld)
all.equal(
starwars %>%
group_by(gender) %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld),
starwars %>%
group_by(gender) %>%
subchain()
)
Using a dot . as start of a piping sequence. This is in effect close to function wrapping, but this is called a magrittr functional sequence. See ?functions and try magrittr::functions(subchain)
So I have already done what I need but I am sure that there is a better way to do that
library(tidyverse)
library(schrute)
office <- schrute::theoffice
top_3_lines_per_episode <- office %>%
group_by(season,episode,episode_name,imdb_rating) %>%
count(character) %>%
top_n(3, n) %>% ungroup()
epi_num<-top_3_lines_per_episode %>%
select(episode_name) %>%
unique() %>%
mutate(episode_num=row_number())
top_3_lines_per_episode %>%
inner_join(epi_num)
I want to generate column epi_num which gets done using above. A simpler way to do that. I have look at group_indices from SO but I think thats derecated. Need a better way preferably in tidyverse.
I am not aware of group_indices being deprecated in fact it seems to be a perfect option here.
library(dplyr)
office %>%
group_by(season,episode,episode_name,imdb_rating) %>%
count(character) %>%
top_n(3, n) %>%
ungroup %>%
mutate(episode_num = group_indices(., season,episode,episode_name,imdb_rating))
Another option is to combine the columns with unite and then match to get episode_num.
office %>%
group_by(season,episode,episode_name,imdb_rating) %>%
count(character) %>%
top_n(3, n) %>%
ungroup %>%
tidyr::unite(temp, season,episode,episode_name,imdb_rating, remove = FALSE) %>%
mutate(episode_num = match(temp, unique(temp))) %>%
select(-temp)
I would like to create a set of columns based on papers count for each number of year, therefore filtering multiple conditions in dplyr through summarise:
This is my code:
words_list <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(word) %>%
summarise(papers_count = n()) %>%
arrange(desc(papers_count))
The code above gives me two columns, 'word' and 'papers_count', I would like to create more columns like papers_count (papers_count1990, papers_count1991, etc..) based on each year between 1990 and 2017.
I Am looking for something like ths:
words_list <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(word) %>%
summarise(tot_papers_count = n(), papers_count_1991 = n()year="1991", ...) %>%
arrange(desc(papers_count))
please does anybody have any suggestion?
I would suggest adding year to the group_by, and then using spread to create multiple summary columns.
library(tidyr)
words_list_by_year <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(year,word) %>%
summarise(papers_count = n()) %>%
spread(year,papers_count,fill=0)