I am trying to figure out a dplyr specific way of continuing a sequence of numbers when there are NAs in that column.
For example I have this dataframe:
library(tibble)
dat <- tribble(
~x, ~group,
1, "A",
2, "A",
NA_real_, "A",
NA_real_, "A",
1, "B",
NA_real_, "B",
3, "B"
)
dat
#> # A tibble: 7 × 2
#> x group
#> <dbl> <chr>
#> 1 1 A
#> 2 2 A
#> 3 NA A
#> 4 NA A
#> 5 1 B
#> 6 NA B
#> 7 3 B
I would like this one:
#> # A tibble: 7 × 2
#> x group
#> <dbl> <chr>
#> 1 1 A
#> 2 2 A
#> 3 3 A
#> 4 4 A
#> 5 1 B
#> 6 2 B
#> 7 3 B
When I try this I get a warning which makes me think I am probably approaching this incorrectly:
library(dplyr)
dat %>%
group_by(group) %>%
mutate(n = n()) %>%
mutate(new_seq = seq_len(n))
#> Warning in seq_len(n): first element used of 'length.out' argument
#> Warning in seq_len(n): first element used of 'length.out' argument
#> # A tibble: 7 × 4
#> # Groups: group [2]
#> x group n new_seq
#> <dbl> <chr> <int> <int>
#> 1 1 A 4 1
#> 2 2 A 4 2
#> 3 NA A 4 3
#> 4 NA A 4 4
#> 5 1 B 3 1
#> 6 NA B 3 2
#> 7 3 B 3 3
It's easier if you do it in one go. Your approach is not 'wrong', it is just that seq_len needs one integer, and you are giving a vector (n), so seq_len corrects it by using the first value.
dat %>%
group_by(group) %>%
mutate(x = seq_len(n()))
Note that row_number might be even easier here:
dat %>%
group_by(group) %>%
mutate(x = row_number())
We could use rowid directly if the intention is to create a sequence and group size is just intermediate column
library(data.table)
library(dplyr)
dat %>%
mutate(new_seq = rowid(group))
The issue with using a column after it is created is that it is no longer a single row as showed in #Maëls post. If we need to do that, use first as seq_len is not vectorized and here it is not needed as well
dat %>%
group_by(group) %>%
mutate(n = n()) %>%
mutate(new_seq = seq_len(first(n)))
A base R option using ave (work in a similar way as group_by in dplyr)
> transform(dat, x = ave(x, group, FUN = seq_along))
x group
1 1 A
2 2 A
3 3 A
4 4 A
5 1 B
6 2 B
7 3 B
I have a dataset with consistently named columns and I would like to take the average of the columns by their group e.g.,
library(dplyr)
library(purrr)
library(glue)
df <- tibble(`1_x_blind` = 1:3,
`1_y_blind` = 7:9,
`2_x_blind` = 4:6,
`2_y_blind` = 5:7)
df %>%
mutate(`1_overall_test` = rowMeans(select(., matches(glue("^1_.*_blind$")))))
#> # A tibble: 3 x 5
#> `1_x_blind` `1_y_blind` `2_x_blind` `2_y_blind` `1_overall_test`
#> <int> <int> <int> <int> <dbl>
#> 1 1 7 4 5 4
#> 2 2 8 5 6 5
#> 3 3 9 6 7 6
This method works fine. The next step for me would be to scale it so that I can do the entire series of columns e.g., something like
df %>%
mutate(overall_blind = map(1:2, ~rowMeans(select(., matches(glue("^{.x}_.*_blind$"))))))
#> Error: Problem with `mutate()` input `overall_blind`.
#> x no applicable method for 'select' applied to an object of class "c('integer', 'numeric')"
#> ℹ Input `overall_blind` is `map(1:2, ~rowMeans(select(., matches(glue("^{.x}_.*_blind$")))))`.
I think the problem here is that select is confusing the . operator. Is it possible to map over a series of column names in this way? Ideally I'd like the column names to follow the {.x}_overall pattern as in the example above.
Update Here's a cleaner way that doesn't require rename or bind_cols:
map_dfc(1:2,
function(x) df %>%
select(matches(glue("^{x}_.*_blind$"))) %>%
mutate("{x}_overall_blind" := rowMeans(.))
)
# A tibble: 3 x 6
`1_x_blind` `1_y_blind` `1_overall_blind` `2_x_blind` `2_y_blind` `2_overall_blind`
<int> <int> <dbl> <int> <int> <dbl>
1 1 7 4 4 5 4.5
2 2 8 5 5 6 5.5
3 3 9 6 6 7 6.5
Previous
Here's a map approach.
The challenge is mutating two new columns based on separate groups of existing columns. Easiest just to do that in its own map_dfc() and then bind that to the existing df.
df %>%
bind_cols(
map_dfc(1:2, ~rowMeans(df %>% select(matches(glue("^{.x}_.*_blind$"))))) %>%
rename_with(~paste0(str_replace(., "\\...", ""), "_overall_blind"))
)
# A tibble: 3 x 6
`1_x_blind` `1_y_blind` `2_x_blind` `2_y_blind` `1_overall_blind` `2_overall_blind`
<int> <int> <int> <int> <dbl> <dbl>
1 1 7 4 5 4 4.5
2 2 8 5 6 5 5.5
3 3 9 6 7 6 6.5
And here's a way to get your rowwise column-group averages using pivots, which avoids regex and mutate/map operations:
df %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
separate(name, c("grp"), sep = "_", extra = "drop") %>%
group_by(row, grp) %>%
summarise(overall_blind = mean(value)) %>%
ungroup() %>%
pivot_wider(id_cols = row, names_from = grp, values_from = overall_blind,
names_glue = "{grp}_{.value}") %>%
bind_cols(df)
# A tibble: 3 x 6
`1_overall_blind` `2_overall_blind` `1_x_blind` `1_y_blind` `2_x_blind` `2_y_blind`
<dbl> <dbl> <int> <int> <int> <int>
1 4 4.5 1 7 4 5
2 5 5.5 2 8 5 6
3 6 6.5 3 9 6 7
Here is one solution:
map_dfc(1:2, function(x) {
select(df, matches(glue("^{x}_.*_blind$"))) %>%
mutate(overall_blind = rowMeans(select(., matches(glue("^{x}_.*_blind$"))))) %>%
# General but not perfect names
# set_names(paste0(x, "_", names(.)))
# Hand-tailored names
set_names(c(names(.)[1], names(.)[2], paste0(x, "_", names(.)[3])))
})
#> # A tibble: 3 x 6
#> `1_x_blind` `1_y_blind` `1_overall_blind` `2_x_blind` `2_y_blind` `2_overall_blind`
#> <int> <int> <dbl> <int> <int> <dbl>
#> 1 1 7 4 4 5 4.5
#> 2 2 8 5 5 6 5.5
#> 3 3 9 6 6 7 6.5
I added two possibilities of naming the overall_blind column for each group, one more general but not perfect names (it duplicates the 1_ or 2_ for the data columns), and another that gives the names that you want but require knowing in advance the number of columns per group.
We can use split.default to split the data into list of datasets based on the column name pattern, then get the rowMeans and bind with the original data
library(dplyr)
library(purrr)
library(stringr)
df %>%
split.default(readr::parse_number(names(.))) %>%
map_dfc(rowMeans) %>%
set_names(str_c(names(.), "_overall_blind")) %>%
bind_cols(df, .)
# A tibble: 3 x 6
# `1_x_blind` `1_y_blind` `2_x_blind` `2_y_blind` `1_overall_blind` `2_overall_blind`
# <int> <int> <int> <int> <dbl> <dbl>
#1 1 7 4 5 4 4.5
#2 2 8 5 6 5 5.5
#3 3 9 6 7 6 6.5
I have the following data with ID and value:
id <- c("1103-5","1103-5","1104-2","1104-2","1104-4","1104-4","1106-2","1106-2","1106-3","1106-3","2294-1","2294-1","2294-2","2294-2","2294-2","2294-3","2294-3","2294-3","2294-4","2294-4","2294-5","2294-5","2294-5","2300-1","2300-1","2300-2","2300-2","2300-4","2300-4","2321-1","2321-1","2321-2","2321-2","2321-3","2321-3","2321-4","2321-4","2347-1","2347-1","2347-2","2347-2")
value <- c(6,3,6,3,6,3,6,3,6,3,3,6,9,3,6,9,3,6,3,6,9,3,6,9,6,9,6,9,6,9,3,9,3,9,3,9,3,9,6,9,6)
If you notice, there are multiple values for the same id. What I'd like to do is get the value that are only 3 and 6 only if the IDs are the same. for eg. ID "1103-5" has both 3 and 6, so it should be in the list, but not "2347-2"
I'm using R
One method I tried is the following, but it gives me everything with value 3 and 6.
d <- data.frame(id, value)
group36 <- d[d$value == 3 | d$value == 6,]
and
d %>% group_by(id) %>% filter(3 == value | 6 == value)
The output should be like this:
id value
1103-5 6
1103-5 3
1104-2 6
1104-2 3
1104-4 6
1104-4 3
1106-2 6
1106-2 3
1106-3 6
1106-3 3
2294-1 3
2294-1 6
2294-2 3
2294-2 6
2294-3 3
2294-3 6
2294-4 3
2294-4 6
2294-5 3
2294-5 6
d<-group_by(d,id)
filter(d,any(value==3),any(value==6))
This gives you all the IDs where there is both a value of 3 (somewhere) AND a value of 6 (somewhere). Mind you, your data contains some IDs with THREE values. In these cases, if both 3 and 6 are present, it will be included in the result.
If you want to exclude those lines that remain which done equal 3 or 6, add this:
filter(d,value==3 | value==6)
If you want to exclude IDs that also have 3 and 6 as values but also have OTHER values, use this:
filter(d,any(value==3),any(value==6),value==3 | value==6)
Not sure if this is what you want. We can filter rows that equal to either 3 or 6 then convert from long to wide format and keep only columns which have both 3 and 6 values. After that, convert back to long format.
library(dplyr)
library(tidyr)
id <- c("1103-5","1103-5","1104-2","1104-2","1104-4","1104-4","1106-2","1106-2",
"1106-3","1106-3","2294-1","2294-1","2294-2","2294-2","2294-2",
"2294-3","2294-3","2294-3","2294-4","2294-4","2294-5","2294-5","2294-5",
"2300-1","2300-1","2300-2","2300-2","2300-4","2300-4","2321-1","2321-1",
"2321-2","2321-2","2321-3","2321-3","2321-4","2321-4","2347-1","2347-1","2347-2","2347-2")
value <- c(6,3,6,3,6,3,6,3,6,3,3,6,9,3,6,9,3,6,3,6,9,3,6,9,6,9,6,9,6,9,3,9,3,9,3,9,3,9,6,9,6)
d <- data.frame(id, value)
d %>%
group_by(id) %>%
filter(value %in% c(3, 6)) %>%
mutate(rows = 1:n()) %>%
spread(key = id, value) %>%
select_if(~ all(!is.na(.)))
#> # A tibble: 2 x 11
#> rows `1103-5` `1104-2` `1104-4` `1106-2` `1106-3` `2294-1` `2294-2`
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 6 6 6 6 3 3
#> 2 2 3 3 3 3 3 6 6
#> # ... with 3 more variables: `2294-3` <dbl>, `2294-4` <dbl>,
#> # `2294-5` <dbl>
d %>%
group_by(id) %>%
filter(value %in% c(3, 6)) %>%
mutate(rows = 1:n()) %>%
spread(key = id, value) %>%
select_if(~ all(!is.na(.))) %>%
select(-rows) %>%
gather(id, value)
#> # A tibble: 20 x 2
#> id value
#> <chr> <dbl>
#> 1 1103-5 6
#> 2 1103-5 3
#> 3 1104-2 6
#> 4 1104-2 3
#> 5 1104-4 6
#> 6 1104-4 3
#> 7 1106-2 6
#> 8 1106-2 3
#> 9 1106-3 6
#> 10 1106-3 3
#> 11 2294-1 3
#> 12 2294-1 6
#> 13 2294-2 3
#> 14 2294-2 6
#> 15 2294-3 3
#> 16 2294-3 6
#> 17 2294-4 3
#> 18 2294-4 6
#> 19 2294-5 3
#> 20 2294-5 6
Created on 2018-07-01 by the reprex package (v0.2.0.9000).
I have a data frame with columns labeled sales1, sales2, price1, price2 and I want to calculate revenues by multiplying sales1 * price1 and so-on across each number in an iterative fashion.
data <- data_frame(
"sales1" = c(1, 2, 3),
"sales2" = c(2, 3, 4),
"price1" = c(3, 2, 2),
"price2" = c(3, 3, 5))
data
# A tibble: 3 x 4
# sales1 sales2 price1 price2
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 3 3
#2 2 3 2 3
#3 3 4 2 5
Why doesn't the following code work?
data %>%
mutate (
for (i in seq_along(1:2)) {
paste0("revenue",i) = paste0("sales",i) * paste0("price",i)
}
)
Assuming your columns are already ordered (sales1, sales2, price1, price2). We can split the dataframe in two parts and then multiply them
data[grep("sales", names(data))] * data[grep("price", names(data))]
# sales1 sales2
#1 3 6
#2 4 9
#3 6 20
If the columns are not already sorted according to their names, we can sort them by using order and then use above command.
data <- data[order(names(data))]
This answer is not brief. For that, #RonakShah's existing answer is the one to look at!
My response is intended to address a broader concern regarding the difficulty of trying to do this in the tidyverse. My understanding is this is difficult because the data is not currently in a "tidy" format. Instead, you can create a tidy data frame like so:
library(tidyverse)
tidy_df <- data %>%
rownames_to_column() %>%
gather(key, value, -rowname) %>%
extract(key, c("variable", "id"), "([a-z]+)([0-9]+)") %>%
spread(variable, value)
Which then makes the final calculation straightforward
tidy_df %>% mutate(revenue = sales * price)
#> # A tibble: 6 x 5
#> rowname id price sales revenue
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 1 1 3 1 3
#> 2 1 2 3 2 6
#> 3 2 1 2 2 4
#> 4 2 2 3 3 9
#> 5 3 1 2 3 6
#> 6 3 2 5 4 20
If you need to get the data back into the original format you can although this feels clunky to me (I'm sure this can be improved in someway).
tidy_df %>% mutate(revenue = sales * price) %>%
gather(key, value, -c(rowname, id)) %>%
unite(key, key, id, sep = "") %>%
spread(key, value) %>%
select(starts_with("price"),
starts_with("sales"),
starts_with("revenue"))
#> # A tibble: 3 x 6
#> price1 price2 sales1 sales2 revenue1 revenue2
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3 3 1 2 3 6
#> 2 2 3 2 3 4 9
#> 3 2 5 3 4 6 20
I wanted to get all unique pairwise combinations of a unique string column of a dataframe using the tidyverse (ideally).
Here is a dummy example:
library(tidyverse)
a <- letters[1:3] %>%
tibble::as_tibble()
a
#> # A tibble: 3 x 1
#> value
#> <chr>
#> 1 a
#> 2 b
#> 3 c
tidyr::crossing(a, a) %>%
magrittr::set_colnames(c("words1", "words2"))
#> # A tibble: 9 x 2
#> words1 words2
#> <chr> <chr>
#> 1 a a
#> 2 a b
#> 3 a c
#> 4 b a
#> 5 b b
#> 6 b c
#> 7 c a
#> 8 c b
#> 9 c c
Is there a way to remove 'duplicate' combinations here. That is have the output be the following in this example:
# A tibble: 9 x 2
#> words1 words2
#> <chr> <chr>
#> 1 a b
#> 2 a c
#> 3 b c
I was hoping there would be a nice purrr::map or filter approach to pipe into to complete the above.
EDIT: There are similar questions to this one e.g. here, marked by #Sotos. Here I am specifically looking for tidyverse (purrr, dplyr) ways to complete the pipeline I have setup. The other answers use various other packages that I do not want to include as dependencies.
wish there was a better way, but I usually use this...
library(tidyverse)
df <- tibble(value = letters[1:3])
df %>%
expand(value, value1 = value) %>%
filter(value < value1)
# # A tibble: 3 x 2
# value value1
# <chr> <chr>
# 1 a b
# 2 a c
# 3 b c
Something like this?
tidyr::crossing(a, a) %>%
magrittr::set_colnames(c("words1", "words2")) %>%
rowwise() %>%
mutate(words1 = sort(c(words1, words2))[1], # sort order of words for each row
words2 = sort(c(words1, words2))[2]) %>%
filter(words1 != words2) %>% # remove word combinations with itself
unique() # remove duplicates
# A tibble: 3 x 2
words1 words2
<chr> <chr>
1 a b
2 a c
3 b c