I have a table like this:
ID
Date
Status
101
2020-09-14
1
102
2020-09-14
1
103
2020-09-14
1
104
2020-09-14
2
105
2020-09-14
2
106
2020-09-14
2
But want a table like this:
Status
ID
Date
1
101,102,103
2020-09-14, 2020-09-14, 2020-09-14
1
104,105,106
2020-09-14, 2020-09-14, 2020-09-14
Code that i'm currently using:
note: date is in format yyyy-mm-dd before running code.
g1 <- df1 %>%
mutate(Date = as.Date(Date, format = '%Y%m%d')) %>%
group_by(status) %>%
summarise_at(c("ID", "Date"), list)
This seems to work except for the date in the new table is not in yyyy-mm-dd. For example, 2021-06-10 is converting to 18788.
A possible solution:
library(tidyverse)
df %>%
group_by(Status) %>%
summarise(ID = str_c(ID, collapse = ","), Date = str_c(Date, collapse = ","))
#> # A tibble: 2 × 3
#> Status ID Date
#> <int> <chr> <chr>
#> 1 1 101,102,103 2020-09-14,2020-09-14,2020-09-14
#> 2 2 104,105,106 2020-09-14,2020-09-14,2020-09-14
A more succinct alternative:
library(tidyverse)
df %>%
group_by(Status) %>%
summarise(across(c(ID, Date), str_c, collapse = ","))
#> # A tibble: 2 × 3
#> Status ID Date
#> <int> <chr> <chr>
#> 1 1 101,102,103 2020-09-14,2020-09-14,2020-09-14
#> 2 2 104,105,106 2020-09-14,2020-09-14,2020-09-14
Related
I'm not sure why this has been so difficult, but i've exhausted my R knowledge. I'm trying to return the date from a column if it falls between two dates into a new column in R. it must be done through sql-friendly verbs (i.e. dplyr).
sample <- data.frame(
id = c(1, 1, 2, 3, 4),
paint = c('zwbc',
'zbbb',
'zwbs',
'aass',
'zwbc')
date = c('2020-03-01',
'2020-04-01',
'2019-01-01',
'2019-12-31',
'2020-05-01',))
I've tried the following:
sam2 <- sample %>%
group_by(id) %>%
mutate(flag = if_else(paint == 'zwbc', 1, 0)) %>%
mutate(paint_date = if_else(flag == 1 + (date > '2020-1-1' & date < '2020-1-1'), date, NULL)) %>%
ungroup()
Does this solve your problem?
library(tidyverse)
sample <- data.frame(
id = c(1, 1, 2, 3, 4),
paint = c('zwbc',
'zbbb',
'zwbs',
'aass',
'zwbc'),
date = c('2020-03-01',
'2020-04-01',
'2019-01-01',
'2019-12-31',
'2020-05-01'))
sample %>%
group_by(id) %>%
mutate(flag = if_else(paint == 'zwbc', 1, 0))
#> # A tibble: 5 × 4
#> # Groups: id [4]
#> id paint date flag
#> <dbl> <chr> <chr> <dbl>
#> 1 1 zwbc 2020-03-01 1
#> 2 1 zbbb 2020-04-01 0
#> 3 2 zwbs 2019-01-01 0
#> 4 3 aass 2019-12-31 0
#> 5 4 zwbc 2020-05-01 1
# To exclude the bottom row (date > specified date cutoff)
sample %>%
group_by(id) %>%
mutate(flag = if_else(paint == 'zwbc', 1, 0)) %>%
mutate(paint_date = if_else(flag == 1 & date > '2020-01-01' & date < '2020-04-01', date, NULL)) %>%
ungroup()
#> # A tibble: 5 × 5
#> id paint date flag paint_date
#> <dbl> <chr> <chr> <dbl> <chr>
#> 1 1 zwbc 2020-03-01 1 2020-03-01
#> 2 1 zbbb 2020-04-01 0 <NA>
#> 3 2 zwbs 2019-01-01 0 <NA>
#> 4 3 aass 2019-12-31 0 <NA>
#> 5 4 zwbc 2020-05-01 1 <NA>
Created on 2022-10-12 by the reprex package (v2.0.1)
I have a dataframe (df) like this. I have 82 SKUs started from M1 to M82.
SKU date sales
M1 2-jan 4
M2 2-jan 5
M1 3-jan 8
M82 3-jan 1
...
M82 31-dec 9
i want to filter each SKU seperate and then group_by(date) and summarise(sales_perday = sum(sales)
Something like this
for(i in SKU){
SKU_M[i] <- df %>% filter(SKU == SKU_M[i]) %>% group_by(date)
%>% summarise(sales_perday = sum(sales))
Expected output are 82 dataframes with each SKU in 1 dataframe.
I did this below for 1 SKU but i want it for all 82 in an easy way.
M50 <- df %>% filter(SKU == 'M50') %>% group_by(date) %>% summarise(sales_perday = sum(sales))
You probably want to group by multiple columns:
library(tidyverse)
data <- tribble(
~SKU, ~date, ~sales,
"M1", "2-jan",4,
"M2", "2-jan",5,
"M1", "3-jan",8
)
# the cioncise way
data %>%
group_by(SKU, date) %>%
summarise(sales_perday = sum(sales))
#> `summarise()` has grouped output by 'SKU'. You can override using the `.groups`
#> argument.
#> # A tibble: 3 × 3
#> # Groups: SKU [2]
#> SKU date sales_perday
#> <chr> <chr> <dbl>
#> 1 M1 2-jan 4
#> 2 M1 3-jan 8
#> 3 M2 2-jan 5
# if one really want to have multiple data frames
data %>%
group_by(SKU, date) %>%
summarise(sales_perday = sum(sales)) %>%
nest(-SKU) %>%
pull(data)
#> Warning: All elements of `...` must be named.
#> Did you want `data = -SKU`?
#> `summarise()` has grouped output by 'SKU'. You can override using the `.groups`
#> argument.
#> [[1]]
#> # A tibble: 2 × 2
#> date sales_perday
#> <chr> <dbl>
#> 1 2-jan 4
#> 2 3-jan 8
#>
#> [[2]]
#> # A tibble: 1 × 2
#> date sales_perday
#> <chr> <dbl>
#> 1 2-jan 5
Created on 2022-06-08 by the reprex package (v2.0.0)
Another option with split:
df <- df |>
group_by(date) |>
summarise(sales_perday = sum(sales))
split(df, df$SKU)
If you really do want separate data frames, then after grouping by SKU and date, and then summarizing, use group_split() to partition by SKU.
library(tidyverse)
df <- tribble(
~SKU, ~date, ~sales,
"M1", "2-jan",4,
"M2", "2-jan",5,
"M1", "3-jan",8
)
df |>
group_by(SKU, date) |>
summarise(sales_perday = sum(sales)) |>
group_split()
#> `summarise()` has grouped output by 'SKU'. You can override using the `.groups`
#> argument.
#> <list_of<
#> tbl_df<
#> SKU : character
#> date : character
#> sales_perday: double
#> >
#> >[2]>
#> [[1]]
#> # A tibble: 2 × 3
#> SKU date sales_perday
#> <chr> <chr> <dbl>
#> 1 M1 2-jan 4
#> 2 M1 3-jan 8
#>
#> [[2]]
#> # A tibble: 1 × 3
#> SKU date sales_perday
#> <chr> <chr> <dbl>
#> 1 M2 2-jan 5
I have a tibble, df, I would like to take the tibble and group it and then use dplyr::pull to create vectors from the grouped dataframe. I have provided a reprex below.
df is the base tibble. My desired output is reflected by df2. I just don't know how to get there programmatically. I have tried to use pull to achieve this output but pull did not seem to recognize the group_by function and instead created a vector out of the whole column. Is what I'm trying to achieve possible with dplyr or base r. Note - new_col is supposed to be a vector created from the name column.
library(tidyverse)
library(reprex)
df <- tibble(group = c(1,1,1,1,2,2,2,3,3,3,3,3),
name = c('Jim','Deb','Bill','Ann','Joe','Jon','Jane','Jake','Sam','Gus','Trixy','Don'),
type = c(1,2,3,4,3,2,1,2,3,1,4,5))
df
#> # A tibble: 12 x 3
#> group name type
#> <dbl> <chr> <dbl>
#> 1 1 Jim 1
#> 2 1 Deb 2
#> 3 1 Bill 3
#> 4 1 Ann 4
#> 5 2 Joe 3
#> 6 2 Jon 2
#> 7 2 Jane 1
#> 8 3 Jake 2
#> 9 3 Sam 3
#> 10 3 Gus 1
#> 11 3 Trixy 4
#> 12 3 Don 5
# Desired Output - New Col is a column of vectors
df2 <- tibble(group=c(1,2,3),name=c("Jim","Jane","Gus"), type=c(1,1,1), new_col = c("'Jim','Deb','Bill','Ann'","'Joe','Jon','Jane'","'Jake','Sam','Gus','Trixy','Don'"))
df2
#> # A tibble: 3 x 4
#> group name type new_col
#> <dbl> <chr> <dbl> <chr>
#> 1 1 Jim 1 'Jim','Deb','Bill','Ann'
#> 2 2 Jane 1 'Joe','Jon','Jane'
#> 3 3 Gus 1 'Jake','Sam','Gus','Trixy','Don'
Created on 2020-11-14 by the reprex package (v0.3.0)
Maybe this is what you are looking for:
library(dplyr)
df <- tibble(group = c(1,1,1,1,2,2,2,3,3,3,3,3),
name = c('Jim','Deb','Bill','Ann','Joe','Jon','Jane','Jake','Sam','Gus','Trixy','Don'),
type = c(1,2,3,4,3,2,1,2,3,1,4,5))
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
group_by(name, type, .add = TRUE) %>%
summarise(new_col = paste(new_col, collapse = ","))
#> `summarise()` regrouping output by 'group', 'name' (override with `.groups` argument)
#> # A tibble: 3 x 4
#> # Groups: group, name [3]
#> group name type new_col
#> <dbl> <chr> <dbl> <chr>
#> 1 1 Jim 1 Jim,Deb,Bill,Ann
#> 2 2 Jane 1 Joe,Jon,Jane
#> 3 3 Gus 1 Jake,Sam,Gus,Trixy,Don
EDIT If new_col should be a list of vectors then you could do `summarise(new_col = list(c(new_col)))
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
group_by(name, type, .add = TRUE) %>%
summarise(new_col = list(c(new_col)))
Another option would be to use tidyr::nest:
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
nest(new_col = new_col)
I am trying to replicate the tidyr:complete function in sparklyr. I have a dataframe with some missing values and I have to fill out those rows. In dplyr/tidyr I can do:
data <- tibble(
"id" = c(1,1,2,2),
"dates" = c("2020-01-01", "2020-01-03", "2020-01-01", "2020-01-03"),
"values" = c(3,4,7,8))
# A tibble: 4 x 3
id dates values
<dbl> <chr> <dbl>
1 1 2020-01-01 3
2 1 2020-01-03 4
3 2 2020-01-01 7
4 2 2020-01-03 8
data %>%
mutate(dates = as_date(dates)) %>%
group_by(id) %>%
complete(dates = seq.Date(min(dates), max(dates), by="day"))
# A tibble: 6 x 3
# Groups: id [2]
id dates values
<dbl> <date> <dbl>
1 1 2020-01-01 3
2 1 2020-01-02 NA
3 1 2020-01-03 4
4 2 2020-01-01 7
5 2 2020-01-02 NA
6 2 2020-01-03 8
However the complete function does not exist in sparklyr.
data_spark %>%
mutate(dates = as_date(dates)) %>%
group_by(id) %>%
complete(dates = seq.Date(min(dates), max(dates), by="day"))
Error in UseMethod("complete_") :
no applicable method for 'complete_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"
Is there a way to set a UDF or to achieve a similar result?
Thank you
Under the hood tidyr::complete just performs a full join followed by optional NA fill. You can replicate its effects by using sdf_copy_to to create a new sdf that is just a single column seq.Date between your start and end date, and then perform a full_join between that and your dataset.
Here's a method that does all of the work in Spark.
library(sparklyr)
sc <- spark_connect(master = "local")
data <- tibble(
id = c(1, 1, 2, 2),
dates = c("2020-01-02", "2020-01-04", "2020-01-01", "2020-01-03"),
values = c(1, 2, 3, 4)
)
data_spark <- copy_to(sc, data)
We need to generate all combinations of dates and id. To do this, we need to know the total number of days and the first date.
days_info <-
data_spark %>%
summarise(
first_date = min(dates),
total_days = datediff(max(dates), min(dates))
) %>%
collect()
days_info
#> # A tibble: 1 x 2
#> first_date total_days
#> <chr> <int>
#> 1 2020-01-01 3
sdf_seq can be used to generate a sequence in Spark. This can be used to get the combinations of dates and id.
dates_id_combinations <-
sdf_seq(
sc,
from = 0,
to = days_info$total_days,
repartition = 1
) %>%
transmute(
dates = date_add(local(days_info$first_date), id),
join_by = TRUE
) %>%
full_join(data_spark %>% distinct(id) %>% mutate(join_by = TRUE)) %>%
select(dates, id)
dates_id_combinations
#> # Source: spark<?> [?? x 2]
#> dates id
#> <date> <dbl>
#> 1 2020-01-01 1
#> 2 2020-01-01 2
#> 3 2020-01-02 1
#> 4 2020-01-02 2
#> 5 2020-01-03 1
#> 6 2020-01-03 2
#> 7 2020-01-04 1
#> 8 2020-01-04 2
full_join the original data frame and the combination data frame. Then filter based on the min/max date for each group.
data_spark %>%
group_by(id) %>%
mutate(first_date = min(dates), last_date = max(dates)) %>%
full_join(dates_id_combinations) %>%
filter(dates >= min(first_date), dates <= max(last_date)) %>%
arrange(id, dates) %>%
select(id, dates)
#> # Source: spark<?> [?? x 2]
#> # Groups: id
#> # Ordered by: id, dates
#> id dates
#> <dbl> <chr>
#> 1 1 2020-01-02
#> 2 1 2020-01-03
#> 3 1 2020-01-04
#> 4 2 2020-01-01
#> 5 2 2020-01-02
#> 6 2 2020-01-03
I have a large data.frame that I am trying to spread. A toy example looks like this.
data = data.frame(date = rep(c("2019", "2020"), 2), ticker = c("SPY", "SPY", "MSFT", "MSFT"), value = c(1, 2, 3, 4))
head(data)
date ticker value
1 2019 SPY 1
2 2020 SPY 2
3 2019 MSFT 3
4 2020 MSFT 4
I would like to spread it so the data.frame looks like this.
spread(data, key = ticker, value = value)
date MSFT SPY
1 2019 3 1
2 2020 4 2
However, when I do this on my actual data.frame, I get an error.
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 18204 rows:
* 30341, 166871
* 30342, 166872
* 30343, 166873
* 30344, 166874
* 30345, 166875
* 30346, 166876
* 30347, 166877
* 30348, 166878
* 30349, 166879
* 30350, 166880
* 30351, 166881
* 30352, 166882
Below is a head and tail of my data.frame
head(df)
ref.date ticker weeklyReturn
<date> <chr> <dbl>
1 2008-02-01 SPY NA
2 2008-02-04 SPY NA
3 2008-02-05 SPY NA
4 2008-02-06 SPY NA
5 2008-02-07 SPY NA
6 2008-02-08 SPY -0.0478
tail(df)
ref.date ticker weeklyReturn
<date> <chr> <dbl>
1 2020-02-12 MDYV 0.00293
2 2020-02-13 MDYV 0.00917
3 2020-02-14 MDYV 0.0179
4 2020-02-18 MDYV 0.0107
5 2020-02-19 MDYV 0.00422
6 2020-02-20 MDYV 0.00347
You can use dplyr and tidyr packages. To get rid of that error, you would have to firstly sum the values for each group.
data %>%
group_by(date, ticker) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = ticker, values_from = value)
# date MSFT SPY
# <fct> <dbl> <dbl>
# 1 2019 3 1
# 2 2020 4 2
As said in the comments, you have multiple values for same combination of date-ticker. You need to define what to do with it.
Here with a reprex:
library(tidyr)
library(dplyr)
# your data is more like:
data = data.frame(
date = c(2019, rep(c("2019", "2020"), 2)),
ticker = c("SPY", "SPY", "SPY", "MSFT", "MSFT"),
value = c(8, 1, 2, 3, 4))
# With two values for same date-ticker combination
data
#> date ticker value
#> 1 2019 SPY 8
#> 2 2019 SPY 1
#> 3 2020 SPY 2
#> 4 2019 MSFT 3
#> 5 2020 MSFT 4
# Results in error
data %>%
spread(ticker, value)
#> Error: Each row of output must be identified by a unique combination of keys.
#> Keys are shared for 2 rows:
#> * 1, 2
# New pivot_wider() Creates list-columns for duplicates
data %>%
pivot_wider(names_from = ticker, values_from = value,)
#> Warning: Values in `value` are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list(value = list)` to suppress this warning.
#> * Use `values_fn = list(value = length)` to identify where the duplicates arise
#> * Use `values_fn = list(value = summary_fun)` to summarise duplicates
#> # A tibble: 2 x 3
#> date SPY MSFT
#> <fct> <list> <list>
#> 1 2019 <dbl [2]> <dbl [1]>
#> 2 2020 <dbl [1]> <dbl [1]>
# Otherwise, decide yourself how to summarise duplicates with mean() for instance
data %>%
group_by(date, ticker) %>%
summarise(value = mean(value, na.rm = TRUE)) %>%
spread(ticker, value)
#> # A tibble: 2 x 3
#> # Groups: date [2]
#> date MSFT SPY
#> <fct> <dbl> <dbl>
#> 1 2019 3 4.5
#> 2 2020 4 2
Created on 2020-02-22 by the reprex package (v0.3.0)