I have a data frame with columns labeled sales1, sales2, price1, price2 and I want to calculate revenues by multiplying sales1 * price1 and so-on across each number in an iterative fashion.
data <- data_frame(
"sales1" = c(1, 2, 3),
"sales2" = c(2, 3, 4),
"price1" = c(3, 2, 2),
"price2" = c(3, 3, 5))
data
# A tibble: 3 x 4
# sales1 sales2 price1 price2
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 3 3
#2 2 3 2 3
#3 3 4 2 5
Why doesn't the following code work?
data %>%
mutate (
for (i in seq_along(1:2)) {
paste0("revenue",i) = paste0("sales",i) * paste0("price",i)
}
)
Assuming your columns are already ordered (sales1, sales2, price1, price2). We can split the dataframe in two parts and then multiply them
data[grep("sales", names(data))] * data[grep("price", names(data))]
# sales1 sales2
#1 3 6
#2 4 9
#3 6 20
If the columns are not already sorted according to their names, we can sort them by using order and then use above command.
data <- data[order(names(data))]
This answer is not brief. For that, #RonakShah's existing answer is the one to look at!
My response is intended to address a broader concern regarding the difficulty of trying to do this in the tidyverse. My understanding is this is difficult because the data is not currently in a "tidy" format. Instead, you can create a tidy data frame like so:
library(tidyverse)
tidy_df <- data %>%
rownames_to_column() %>%
gather(key, value, -rowname) %>%
extract(key, c("variable", "id"), "([a-z]+)([0-9]+)") %>%
spread(variable, value)
Which then makes the final calculation straightforward
tidy_df %>% mutate(revenue = sales * price)
#> # A tibble: 6 x 5
#> rowname id price sales revenue
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 1 1 3 1 3
#> 2 1 2 3 2 6
#> 3 2 1 2 2 4
#> 4 2 2 3 3 9
#> 5 3 1 2 3 6
#> 6 3 2 5 4 20
If you need to get the data back into the original format you can although this feels clunky to me (I'm sure this can be improved in someway).
tidy_df %>% mutate(revenue = sales * price) %>%
gather(key, value, -c(rowname, id)) %>%
unite(key, key, id, sep = "") %>%
spread(key, value) %>%
select(starts_with("price"),
starts_with("sales"),
starts_with("revenue"))
#> # A tibble: 3 x 6
#> price1 price2 sales1 sales2 revenue1 revenue2
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3 3 1 2 3 6
#> 2 2 3 2 3 4 9
#> 3 2 5 3 4 6 20
Related
let's say I have a dataframe like this:
df <- tibble(ID = c(1, 1, 1, 1, 1), v1 = c(3, 5, 1, 0, 1), v2 = c(10, 6, 1, 20, 23), Time = c(as.POSIXct("1900-01-01 10:00:00"), as.POSIXct("1900-01-01 11:00:00"), as.POSIXct("1900-01-01 13:00:00"), as.POSIXct("1900-01-01 16:00:00"), as.POSIXct("1900-01-01 20:00:00"))) %>% group_by(ID)
# A tibble: 5 x 4
# Groups: ID [1]
ID v1 v2 Time
<dbl> <dbl> <dbl> <dttm>
1 1 3 10 1900-01-01 10:00:00
2 1 5 6 1900-01-01 11:00:00
3 1 1 1 1900-01-01 13:00:00
4 1 0 20 1900-01-01 16:00:00
5 1 1 23 1900-01-01 20:00:00
In words, this is a simple timeseries of a specific ID with two values v1 and v2 per time.
As quite common in machine learning, I want to aggregate the last n timesteps into one feature vector. For all previous timesteps there should be a time reference in hours when this data point occured. For the first row, where no previous timestep is available, the data should be filled with zeros.
Let's make an example. In this case n=2, that is I want to aggregate the current time step (t2) and the prevopus (t1) together:
# A tibble: 5 x 6
ID v1_t1 v2_t1 time_t1 v1_t2 v2_t2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 NA 3 10
2 1 3 10 1 5 6
3 1 5 6 2 1 1
4 1 1 1 3 0 20
5 1 0 20 4 1 23
I want to keep that as generic as possible, so that n can change and the number of data columns. Any idea how to do this?
Thanks :)
Using dplyr::lag and dplyr::across you could do:
library(dplyr, warn=FALSE)
library(lubridate, warn=FALSE)
df %>%
group_by(ID) %>%
mutate(time_t1 = lubridate::hour(Time) - lag(lubridate::hour(Time))) %>%
mutate(across(c(v1, v2), .fns = list(t2 = ~.x, t1 = ~lag(.x, default = 0)))) %>%
select(-v1, -v2, -Time)
#> # A tibble: 5 × 6
#> # Groups: ID [1]
#> ID time_t1 v1_t2 v1_t1 v2_t2 v2_t1
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 NA 3 0 10 0
#> 2 1 1 5 3 6 10
#> 3 1 2 1 5 1 6
#> 4 1 3 0 1 20 1
#> 5 1 4 1 0 23 20
UPDATE Here is a more generic approach which makes use of some function factories to create list of functions which could then be passed to the .fns argument of across. Haven't tested for the more general case but should work for any n or number of lags to include and also for any number of data columns.
library(dplyr, warn=FALSE)
library(lubridate, warn=FALSE)
fun_factory1 <- function(n) {
function(x) {
lubridate::hour(x) - lag(lubridate::hour(x), n = n)
}
}
fun_factory2 <- function(n) {
function(x) {
lag(x, n = n, default = 0)
}
}
n <- 2
fns1 <- lapply(seq(n - 1), fun_factory1)
names(fns1) <- paste0("t", seq(n - 1))
fns2 <- lapply(seq(n) - 1, fun_factory2)
names(fns2) <- paste0("t", seq(n))
df %>%
group_by(ID) %>%
mutate(across(Time, .fns = fns1)) %>%
mutate(across(c(v1, v2), .fns = fns2)) %>%
select(-v1, -v2, -Time)
#> # A tibble: 5 × 6
#> # Groups: ID [1]
#> ID Time_t1 v1_t1 v1_t2 v2_t1 v2_t2
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 NA 3 0 10 0
#> 2 1 1 5 3 6 10
#> 3 1 2 1 5 1 6
#> 4 1 3 0 1 20 1
#> 5 1 4 1 0 23 20
I have this tibble:
library(tibble)
library(dplyr)
df <- tibble(id = c("one", "two", "three"),
A = c(1,2,3),
B = c(4,5,6))
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 two 2 5
3 three 3 6
I want to add a row to each group AND assign values to the new column BUT with a function (here the new row in each group should get A=4 B = the first group value of column B USING first(B)-> desired output:
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 one 4 4
3 three 3 6
4 three 4 6
5 two 2 5
6 two 4 5
I have tried so far:
If I add a row in a ungrouped tibble with add_row -> this works perfect!
df %>%
add_row(A=4, B=4)
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 two 2 5
3 three 3 6
4 NA 4 4
If I try to use add_row in a grouped tibble -> this works not:
df %>%
group_by(id) %>%
add_row(A=4, B=4)
Error: Can't add rows to grouped data frames.
Run `rlang::last_error()` to see where the error occurred.
According to this post Add row in each group using dplyr and add_row() we could use group_modify -> this works great:
df %>%
group_by(id) %>%
group_modify(~ add_row(A=4, B=4, .x))
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 one 4 4
3 three 3 6
4 three 4 4
5 two 2 5
6 two 4 4
I want to assign to column B the first value of column B (or it can be any function min(B), max(B) etccc.) -> this does not work:
df %>%
group_by(id) %>%
group_modify(~ add_row(A=4, B=first(B), .x))
Error in h(simpleError(msg, call)) :
Fehler bei der Auswertung des Argumentes 'x' bei der Methodenauswahl für Funktion 'first': object 'B' not found
library(tidyverse)
df <- tibble(id = c("one", "two", "three"),
A = c(1,2,3),
B = c(4,5,6))
df %>%
group_by(id) %>%
summarise(add_row(cur_data(), A = 4, B = first(cur_data()$B)))
#> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
#> # A tibble: 6 × 3
#> # Groups: id [3]
#> id A B
#> <chr> <dbl> <dbl>
#> 1 one 1 4
#> 2 one 4 4
#> 3 three 3 6
#> 4 three 4 6
#> 5 two 2 5
#> 6 two 4 5
Or
df %>%
group_by(id) %>%
group_split() %>%
map_dfr(~ add_row(.,id = first(.$id), A = 4, B = first(.$B)))
#> # A tibble: 6 × 3
#> id A B
#> <chr> <dbl> <dbl>
#> 1 one 1 4
#> 2 one 4 4
#> 3 three 3 6
#> 4 three 4 6
#> 5 two 2 5
#> 6 two 4 5
Created on 2022-01-02 by the reprex package (v2.0.1)
Maybe this is an option
library(dplyr)
df %>%
group_by(id) %>%
summarise( A=c(A,4), B=c(B,first(B)) ) %>%
ungroup
`summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
# A tibble: 6 x 3
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 one 4 4
3 three 3 6
4 three 4 6
5 two 2 5
6 two 4 5
According to the documentation of the function group_modify, if you use a formula, you must use ". or .x to refer to the subset of rows of .tbl for the given group;" that's why you used .x inside the add_row function. To be entirely consistent, you have to do it also within the first function.
df %>%
group_by(id) %>%
group_modify(~ add_row(A=4, B=first(.x$B), .x))
# A tibble: 6 x 3
# Groups: id [3]
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 one 4 4
3 three 3 6
4 three 4 6
5 two 2 5
6 two 4 5
Using first(.$B) or first(df$B) will provide the same results.
A possible solution:
library(tidyverse)
df <- tibble(id = c("one", "two", "three"),
A = c(1,2,3),
B = c(4,5,6))
df %>%
group_by(id) %>%
slice(rep(1,2)) %>% mutate(A = if_else(row_number() > 1, first(df$B), A)) %>%
ungroup
#> # A tibble: 6 × 3
#> id A B
#> <chr> <dbl> <dbl>
#> 1 one 1 4
#> 2 one 4 4
#> 3 three 3 6
#> 4 three 4 6
#> 5 two 2 5
#> 6 two 4 5
I tried to transform df into df2. I have done it through a very patchy way using df3, Is there a simpler and more elegant way of doing it?
library(tidyverse)
# I want to transform df
df <- tibble(id = c(1, 2, 1, 2, 1, 2),
time = c('t1', 't1', 't2', 't2', 't3', 't3'),
value = c(2, 3, 6, 4, 5, 7))
df
#> # A tibble: 6 x 3
#> id time value
#> <dbl> <chr> <dbl>
#> 1 1 t1 2
#> 2 2 t1 3
#> 3 1 t2 6
#> 4 2 t2 4
#> 5 1 t3 5
#> 6 2 t3 7
# into df2
df2 <- tibble(id = c(1, 2, 1, 2),
t = c(2, 3, 6, 4),
r = c(6, 4, 5, 7))
df2
#> # A tibble: 4 x 3
#> id t r
#> <dbl> <dbl> <dbl>
#> 1 1 2 6
#> 2 2 3 4
#> 3 1 6 5
#> 4 2 4 7
# This is how I did it, but I think it should be a better way
df3 <- df %>% pivot_wider(names_from = time, values_from = value)
b <- tibble(id = numeric(), t = numeric(), r = numeric())
for (i in 2:3){
a <- df3[,c(1,i,i+1)]
colnames(a) <- c('id', 't', 'r')
b <- bind_rows(a, b)
}
b
#> # A tibble: 4 x 3
#> id t r
#> <dbl> <dbl> <dbl>
#> 1 1 6 5
#> 2 2 4 7
#> 3 1 2 6
#> 4 2 3 4
Created on 2020-11-25 by the reprex package (v0.3.0)
For each id you can use lead to select next value and create r column and drop NA rows.
library(dplyr)
df %>%
group_by(id) %>%
mutate(t = value,
r = lead(value)) %>%
na.omit() %>%
select(id, t, r)
# id t r
# <dbl> <dbl> <dbl>
#1 1 2 6
#2 2 3 4
#3 1 6 5
#4 2 4 7
We can use summarise from dplyr version >= 1.0. Previously, it had the constraint of returning only single observation per group. From version >= 1.0, it is no longer the case. Can return any number of rows i.e. it can be shorter or longer than the original number of rows
library(dplyr)
df %>%
group_by(id) %>%
summarise(t = value[-n()], r = value[-1], .groups = 'drop')
-output
# A tibble: 4 x 3
# id t r
# <dbl> <dbl> <dbl>
#1 1 2 6
#2 1 6 5
#3 2 3 4
#4 2 4 7
I have a dataset with consistently named columns and I would like to take the average of the columns by their group e.g.,
library(dplyr)
library(purrr)
library(glue)
df <- tibble(`1_x_blind` = 1:3,
`1_y_blind` = 7:9,
`2_x_blind` = 4:6,
`2_y_blind` = 5:7)
df %>%
mutate(`1_overall_test` = rowMeans(select(., matches(glue("^1_.*_blind$")))))
#> # A tibble: 3 x 5
#> `1_x_blind` `1_y_blind` `2_x_blind` `2_y_blind` `1_overall_test`
#> <int> <int> <int> <int> <dbl>
#> 1 1 7 4 5 4
#> 2 2 8 5 6 5
#> 3 3 9 6 7 6
This method works fine. The next step for me would be to scale it so that I can do the entire series of columns e.g., something like
df %>%
mutate(overall_blind = map(1:2, ~rowMeans(select(., matches(glue("^{.x}_.*_blind$"))))))
#> Error: Problem with `mutate()` input `overall_blind`.
#> x no applicable method for 'select' applied to an object of class "c('integer', 'numeric')"
#> ℹ Input `overall_blind` is `map(1:2, ~rowMeans(select(., matches(glue("^{.x}_.*_blind$")))))`.
I think the problem here is that select is confusing the . operator. Is it possible to map over a series of column names in this way? Ideally I'd like the column names to follow the {.x}_overall pattern as in the example above.
Update Here's a cleaner way that doesn't require rename or bind_cols:
map_dfc(1:2,
function(x) df %>%
select(matches(glue("^{x}_.*_blind$"))) %>%
mutate("{x}_overall_blind" := rowMeans(.))
)
# A tibble: 3 x 6
`1_x_blind` `1_y_blind` `1_overall_blind` `2_x_blind` `2_y_blind` `2_overall_blind`
<int> <int> <dbl> <int> <int> <dbl>
1 1 7 4 4 5 4.5
2 2 8 5 5 6 5.5
3 3 9 6 6 7 6.5
Previous
Here's a map approach.
The challenge is mutating two new columns based on separate groups of existing columns. Easiest just to do that in its own map_dfc() and then bind that to the existing df.
df %>%
bind_cols(
map_dfc(1:2, ~rowMeans(df %>% select(matches(glue("^{.x}_.*_blind$"))))) %>%
rename_with(~paste0(str_replace(., "\\...", ""), "_overall_blind"))
)
# A tibble: 3 x 6
`1_x_blind` `1_y_blind` `2_x_blind` `2_y_blind` `1_overall_blind` `2_overall_blind`
<int> <int> <int> <int> <dbl> <dbl>
1 1 7 4 5 4 4.5
2 2 8 5 6 5 5.5
3 3 9 6 7 6 6.5
And here's a way to get your rowwise column-group averages using pivots, which avoids regex and mutate/map operations:
df %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
separate(name, c("grp"), sep = "_", extra = "drop") %>%
group_by(row, grp) %>%
summarise(overall_blind = mean(value)) %>%
ungroup() %>%
pivot_wider(id_cols = row, names_from = grp, values_from = overall_blind,
names_glue = "{grp}_{.value}") %>%
bind_cols(df)
# A tibble: 3 x 6
`1_overall_blind` `2_overall_blind` `1_x_blind` `1_y_blind` `2_x_blind` `2_y_blind`
<dbl> <dbl> <int> <int> <int> <int>
1 4 4.5 1 7 4 5
2 5 5.5 2 8 5 6
3 6 6.5 3 9 6 7
Here is one solution:
map_dfc(1:2, function(x) {
select(df, matches(glue("^{x}_.*_blind$"))) %>%
mutate(overall_blind = rowMeans(select(., matches(glue("^{x}_.*_blind$"))))) %>%
# General but not perfect names
# set_names(paste0(x, "_", names(.)))
# Hand-tailored names
set_names(c(names(.)[1], names(.)[2], paste0(x, "_", names(.)[3])))
})
#> # A tibble: 3 x 6
#> `1_x_blind` `1_y_blind` `1_overall_blind` `2_x_blind` `2_y_blind` `2_overall_blind`
#> <int> <int> <dbl> <int> <int> <dbl>
#> 1 1 7 4 4 5 4.5
#> 2 2 8 5 5 6 5.5
#> 3 3 9 6 6 7 6.5
I added two possibilities of naming the overall_blind column for each group, one more general but not perfect names (it duplicates the 1_ or 2_ for the data columns), and another that gives the names that you want but require knowing in advance the number of columns per group.
We can use split.default to split the data into list of datasets based on the column name pattern, then get the rowMeans and bind with the original data
library(dplyr)
library(purrr)
library(stringr)
df %>%
split.default(readr::parse_number(names(.))) %>%
map_dfc(rowMeans) %>%
set_names(str_c(names(.), "_overall_blind")) %>%
bind_cols(df, .)
# A tibble: 3 x 6
# `1_x_blind` `1_y_blind` `2_x_blind` `2_y_blind` `1_overall_blind` `2_overall_blind`
# <int> <int> <int> <int> <dbl> <dbl>
#1 1 7 4 5 4 4.5
#2 2 8 5 6 5 5.5
#3 3 9 6 7 6 6.5
I have the following data with ID and value:
id <- c("1103-5","1103-5","1104-2","1104-2","1104-4","1104-4","1106-2","1106-2","1106-3","1106-3","2294-1","2294-1","2294-2","2294-2","2294-2","2294-3","2294-3","2294-3","2294-4","2294-4","2294-5","2294-5","2294-5","2300-1","2300-1","2300-2","2300-2","2300-4","2300-4","2321-1","2321-1","2321-2","2321-2","2321-3","2321-3","2321-4","2321-4","2347-1","2347-1","2347-2","2347-2")
value <- c(6,3,6,3,6,3,6,3,6,3,3,6,9,3,6,9,3,6,3,6,9,3,6,9,6,9,6,9,6,9,3,9,3,9,3,9,3,9,6,9,6)
If you notice, there are multiple values for the same id. What I'd like to do is get the value that are only 3 and 6 only if the IDs are the same. for eg. ID "1103-5" has both 3 and 6, so it should be in the list, but not "2347-2"
I'm using R
One method I tried is the following, but it gives me everything with value 3 and 6.
d <- data.frame(id, value)
group36 <- d[d$value == 3 | d$value == 6,]
and
d %>% group_by(id) %>% filter(3 == value | 6 == value)
The output should be like this:
id value
1103-5 6
1103-5 3
1104-2 6
1104-2 3
1104-4 6
1104-4 3
1106-2 6
1106-2 3
1106-3 6
1106-3 3
2294-1 3
2294-1 6
2294-2 3
2294-2 6
2294-3 3
2294-3 6
2294-4 3
2294-4 6
2294-5 3
2294-5 6
d<-group_by(d,id)
filter(d,any(value==3),any(value==6))
This gives you all the IDs where there is both a value of 3 (somewhere) AND a value of 6 (somewhere). Mind you, your data contains some IDs with THREE values. In these cases, if both 3 and 6 are present, it will be included in the result.
If you want to exclude those lines that remain which done equal 3 or 6, add this:
filter(d,value==3 | value==6)
If you want to exclude IDs that also have 3 and 6 as values but also have OTHER values, use this:
filter(d,any(value==3),any(value==6),value==3 | value==6)
Not sure if this is what you want. We can filter rows that equal to either 3 or 6 then convert from long to wide format and keep only columns which have both 3 and 6 values. After that, convert back to long format.
library(dplyr)
library(tidyr)
id <- c("1103-5","1103-5","1104-2","1104-2","1104-4","1104-4","1106-2","1106-2",
"1106-3","1106-3","2294-1","2294-1","2294-2","2294-2","2294-2",
"2294-3","2294-3","2294-3","2294-4","2294-4","2294-5","2294-5","2294-5",
"2300-1","2300-1","2300-2","2300-2","2300-4","2300-4","2321-1","2321-1",
"2321-2","2321-2","2321-3","2321-3","2321-4","2321-4","2347-1","2347-1","2347-2","2347-2")
value <- c(6,3,6,3,6,3,6,3,6,3,3,6,9,3,6,9,3,6,3,6,9,3,6,9,6,9,6,9,6,9,3,9,3,9,3,9,3,9,6,9,6)
d <- data.frame(id, value)
d %>%
group_by(id) %>%
filter(value %in% c(3, 6)) %>%
mutate(rows = 1:n()) %>%
spread(key = id, value) %>%
select_if(~ all(!is.na(.)))
#> # A tibble: 2 x 11
#> rows `1103-5` `1104-2` `1104-4` `1106-2` `1106-3` `2294-1` `2294-2`
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 6 6 6 6 3 3
#> 2 2 3 3 3 3 3 6 6
#> # ... with 3 more variables: `2294-3` <dbl>, `2294-4` <dbl>,
#> # `2294-5` <dbl>
d %>%
group_by(id) %>%
filter(value %in% c(3, 6)) %>%
mutate(rows = 1:n()) %>%
spread(key = id, value) %>%
select_if(~ all(!is.na(.))) %>%
select(-rows) %>%
gather(id, value)
#> # A tibble: 20 x 2
#> id value
#> <chr> <dbl>
#> 1 1103-5 6
#> 2 1103-5 3
#> 3 1104-2 6
#> 4 1104-2 3
#> 5 1104-4 6
#> 6 1104-4 3
#> 7 1106-2 6
#> 8 1106-2 3
#> 9 1106-3 6
#> 10 1106-3 3
#> 11 2294-1 3
#> 12 2294-1 6
#> 13 2294-2 3
#> 14 2294-2 6
#> 15 2294-3 3
#> 16 2294-3 6
#> 17 2294-4 3
#> 18 2294-4 6
#> 19 2294-5 3
#> 20 2294-5 6
Created on 2018-07-01 by the reprex package (v0.2.0.9000).