Add/Merge/Melt only specific columns and give out one unique row - r

I am trying to transform a dataset that has multiple product sales on a date. At the end I want to keep only unique columns with the sum of the product sales per day.
My MRE:
df <- data.frame(created = as.Date(c("2020-01-01", "2020-01-01", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-03"), "%Y-%m-%d", tz = "GMT"),
soldUnits = c(1, 1, 1, 1, 1, 1),
Weekday = c("Mo","Mo","Tu","Tu","Th","Th"),
Sunshinehours = c(7.8,7.8,6.0,6.0,8.0,8.0))
Which looks like this:
Date soldUnits Weekday Sunshinehours
2020-01-01 1 Mo 7.8
2020-01-01 1 Mo 7.8
2020-01-02 1 Tu 6.0
2020-01-02 1 Tu 6.0
2020-01-03 1 We 8.0
2020-01-03 1 We 8.0
And should look like this after transforming:
Date soldUnits Weekday Sunshinehours
2020-01-01 2 Mo 7.8
2020-01-02 2 Tu 6.0
2020-01-03 2 We 8.0
I tried aggregate() and group_by but without success because my data was dropped.
Is there anyone who has an idea, how i can transform and clean up my dataset according to the specifications i mentioned?

This can work:
library(tidyverse)
df %>%
group_by(created) %>%
count(Weekday, Sunshinehours, wt = soldUnits,name = "soldUnits")
#> # A tibble: 3 × 4
#> # Groups: created [3]
#> created Weekday Sunshinehours soldUnits
#> <date> <chr> <dbl> <dbl>
#> 1 2020-01-01 Mo 7.8 2
#> 2 2020-01-02 Tu 6 2
#> 3 2020-01-03 Th 8 2
Created on 2021-12-04 by the reprex package (v2.0.1)

Applying different functions to different columns (or set of columns) can be done with collap
library(collapse)
collap(df, ~ created + Weekday,
custom = list(fmean = "Sunshinehours", fsum = "soldUnits"))
created soldUnits Weekday Sunshinehours
1 2020-01-01 2 Mo 7.8
2 2020-01-02 2 Tu 6.0
3 2020-01-03 2 Th 8.0

Another dplyr approach:
df %>%
group_by(created, Weekday, Sunshinehours) %>%
summarise(soldUnits = sum(soldUnits))
created Weekday Sunshinehours soldUnits
<date> <chr> <dbl> <dbl>
1 2020-01-01 Mo 7.8 2
2 2020-01-02 Tu 6 2
3 2020-01-03 Th 8 2

Using base and dplyr R
df1 = aggregate(df["Sunshinehours"], by=df["created"], mean)
df2 = aggregate(df["soldUnits"], by=df["created"], sum)
df3 = inner_join(df1, df2)
#converting `Weekday` to factors
df$Weekday = as.factor(df$Weekday)
df3$Weekday = levels(df$Weekday)
created Sunshinehours soldUnits Weekday
1 2020-01-01 7.8 2 Mo
2 2020-01-02 6.0 2 Th
3 2020-01-03 8.0 2 Tu

Related

if_else with sequence of conditions

I have the following data:
library(tidyverse)
library(lubridate)
df <- tibble(date = as_date(c("2019-11-20", "2019-11-27", "2020-04-01", "2020-04-15", "2020-09-23", "2020-11-25", "2021-03-03")))
# A tibble: 7 x 1
date
<date>
1 2019-11-20
2 2019-11-27
3 2020-04-01
4 2020-04-15
5 2020-09-23
6 2020-11-25
7 2021-03-03
I also have an ordered comparison vector of dates:
comparison <- seq(as_date("2019-12-01"), today(), by = "months") - 1
I now want to compare my dates in df to those comparison dates and so something like:
if date in df is < comparison[1], then assign a 1
if date in df is < comparison[2], then assign a 2
and so on.
I know I could do it with a case_when, e.g.
df %>%
mutate(new_var = case_when(date < comparison[1] ~ 1,
date < comparison[2] ~ 2))
(of course filling this up with all comparisons).
However, this would require to manually write out all sequential conditions and I'm wondering if I couldn't just automate it. I though about creating a match lookup first (i.e. take the comparison vector, then add the respective new_var number (i.e. 1, 2, and so on)) and then match it against my data, but I only know how to do that for exact matches and don't know how I can add the "smaller than" condition.
Expected result:
# A tibble: 7 x 2
date new_var
<date> <dbl>
1 2019-11-20 1
2 2019-11-27 1
3 2020-04-01 6
4 2020-04-15 6
5 2020-09-23 11
6 2020-11-25 13
7 2021-03-03 17
You can use findInterval as follows:
df %>% mutate(new_var = df$date %>% findInterval(comparison) + 1)
# A tibble: 7 x 2
date new_var
<date> <dbl>
1 2019-11-20 1
2 2019-11-27 1
3 2020-04-01 6
4 2020-04-15 6
5 2020-09-23 11
6 2020-11-25 13
7 2021-03-03 17

How to calculate the ratio of a condition based on another condition

I have a simplified data frame like this
date state hour
2020-01-01 A 6
2020-01-01 B 3
2020-01-02 A 4
2020-01-02 B 3.5
2020-01-03 A 5
2020-01-03 B 2.5
For each date, there are two states. I want to calculate the ratio of state A/B in hour each day
For example,
date ratio
2020-01-01 2
2020-01-02 1.143
2020-01-03 2
How do I get this result? Thank you!
With the help of match you can do :
library(dplyr)
df %>%
group_by(date) %>%
summarise(ratio = hour[match('A', state)]/hour[match('B', state)])
# date ratio
# <chr> <dbl>
#1 2020-01-01 2
#2 2020-01-02 1.14
#3 2020-01-03 2
You can use xtabs:
tt <- xtabs(hour ~ date + state, x)
data.frame(dimnames(tt)[1], ratio = tt[,1] / tt[,2])
# date ratio
#2020-01-01 2020-01-01 2.000000
#2020-01-02 2020-01-02 1.142857
#2020-01-03 2020-01-03 2.000000
Data:
x <- data.frame(date = c("2020-01-01", "2020-01-01", "2020-01-02",
"2020-01-02", "2020-01-03", "2020-01-03"), state = c("A", "B",
"A", "B", "A", "B"), hour = c(6, 3, 4, 3.5, 5, 2.5))
A data.table option
> setDT(df)[, .(ratio = Reduce(`/`, hour[order(state)])), date]
date ratio
1: 2020-01-01 2.000000
2: 2020-01-02 1.142857
3: 2020-01-03 2.000000
You can also use the following solution, albeit it is to some extent similar to the one posted by dear #Ronak Shah .
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = state, values_from = hour) %>%
group_by(date) %>%
summarise(ratio = A/B)
# A tibble: 3 x 2
date ratio
<chr> <dbl>
1 2020-01-01 2
2 2020-01-02 1.14
3 2020-01-03 2

Update the new date column using the existing date column with -1 day

I have a set of patient ids and date column. I want to update date1 column with -1 day from the date column. for example :
ID Date Date1
1 23-10-2017 23-09-2018
1 24-09-2018 28-08-2019
1 29-08-2019 -
2 30-05-2016 11-06-2017
2 12-06-2017 12-07-2018
2 13-07-2018 -
I don't know if i get what you want. But if you just want a date less one day, this is the code.
x <- data.frame(ID = c(1,1,1,2,2,2), Date = as.Date(c("20-10-2017", "24-09-2018", "29-08-2019", "30-05-2016", "12-06-2017", "13-07-2018"),"%d-%m-%Y"))
x$Date1 <- x$Date-1
Shift by one row by group, then subtract one day:
library(data.table)
dt1 <- fread("
ID Date
1 23-10-2017
1 24-09-2018
1 29-08-2019
2 30-05-2016
2 12-06-2017
2 13-07-2018")
# convert to date
dt1[, Date := as.Date(Date, "%d-%m-%y")]
# shift per group, then minus 1 day
dt1[, Date1 := shift(Date, - 1) - 1, by = ID]
dt1
# ID Date Date1
# 1: 1 2020-10-23 2020-09-23
# 2: 1 2020-09-24 2020-08-28
# 3: 1 2020-08-29 <NA>
# 4: 2 2020-05-30 2020-06-11
# 5: 2 2020-06-12 2020-07-12
# 6: 2 2020-07-13 <NA>
Try using lead:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Date1 = lead(Date)-1)
# A tibble: 6 x 3
# Groups: ID [2]
ID Date Date1
<int> <date> <date>
1 1 2017-10-23 2018-09-23
2 1 2018-09-24 2019-08-28
3 1 2019-08-29 NA
4 2 2016-05-30 2017-06-11
5 2 2017-06-12 2018-07-12
6 2 2018-07-13 NA

How to convert a single date column into three individual columns (y, m, d)?

I have a large dataset with thousands of dates in the ymd format. I want to convert this column so that way there are three individual columns by year, month, and day. There are literally thousands of dates so I am trying to do this with a single code for the entire dataset.
You can use the year(), month(), and day() extractors in lubridate for this. Here's an example:
library('dplyr')
library('tibble')
library('lubridate')
## create some data
df <- tibble(date = seq(ymd(20190101), ymd(20191231), by = '7 days'))
which yields
> df
# A tibble: 53 x 1
date
<date>
1 2019-01-01
2 2019-01-08
3 2019-01-15
4 2019-01-22
5 2019-01-29
6 2019-02-05
7 2019-02-12
8 2019-02-19
9 2019-02-26
10 2019-03-05
# … with 43 more rows
Then mutate df using the relevant extractor function:
df <- mutate(df,
year = year(date),
month = month(date),
day = day(date))
This results in:
> df
# A tibble: 53 x 4
date year month day
<date> <dbl> <dbl> <int>
1 2019-01-01 2019 1 1
2 2019-01-08 2019 1 8
3 2019-01-15 2019 1 15
4 2019-01-22 2019 1 22
5 2019-01-29 2019 1 29
6 2019-02-05 2019 2 5
7 2019-02-12 2019 2 12
8 2019-02-19 2019 2 19
9 2019-02-26 2019 2 26
10 2019-03-05 2019 3 5
# … with 43 more rows
If you only want the new three columns, use transmute() instead of mutate().
Using lubridate but without having to specify a separator:
library(tidyverse)
df <- tibble(d = c('2019/3/18','2018/10/29'))
df %>%
mutate(
date = lubridate::ymd(d),
year = lubridate::year(date),
month = lubridate::month(date),
day = lubridate::day(date)
)
Note that you can change the first entry from ymd to fit other formats.
A slighlty different tidyverse solution that requires less code could be:
Code
tibble(date = "2018-05-01") %>%
mutate_at(vars(date), lst(year, month, day))
Result
# A tibble: 1 x 4
date year month day
<chr> <dbl> <dbl> <int>
1 2018-05-01 2018 5 1
#Data
d = data.frame(date = c("2019-01-01", "2019-02-01", "2012/03/04"))
library(lubridate)
cbind(d,
read.table(header = FALSE,
sep = "-",
text = as.character(ymd(d$date))))
# date V1 V2 V3
#1 2019-01-01 2019 1 1
#2 2019-02-01 2019 2 1
#3 2012/03/04 2012 3 4
OR
library(dplyr)
library(tidyr)
library(lubridate)
d %>%
mutate(date2 = as.character(ymd(date))) %>%
separate(date2, c("year", "month", "day"), "-")
# date year month day
#1 2019-01-01 2019 01 01
#2 2019-02-01 2019 02 01
#3 2012/03/04 2012 03 04

R fuzzyjoin on most recent previous record

I want to join two tables A & B by ID and find in B the most recent date that is anterior to A[date].
After some search it seems that fuzzyjoin allow to join on date ranges :
library(fuzzyjoin)
fuzzy_left_join(A, B,
by = c("ID" = "ID",
"date" = "date"),
match_fun = list("==","<"))
Problem is that this will return many records (if they exist in B), when I just want the most recent.
Any idea on how to proceed ?
EDIT :
A <- data.frame(ID=c(1,2,3),
date = c('2019-04-03','2019-05-13','2019-05-27'))
B <- data.frame(ID=c(1,1,2,3,4),
date = c('2018-01-01','2019-01-01','2019-02-20','2019-06-01','2019-01-01'),
value = c(1,1.5,1.2,3.7,4))
> A
ID date
1 1 2019-04-03
2 2 2019-05-13
3 3 2019-05-27
> B
ID date value
1 1 2018-01-01 1.0
2 1 2019-01-01 1.5
3 2 2019-02-20 1.2
4 3 2019-06-01 3.7
5 4 2019-01-01 4.0
Expected output :
ID date value
1 1 2019-04-03 1.5
2 2 2019-05-13 1.2
3 3 2019-05-27 NA
Another option using rolling join from data.table:
library(data.table)
setDT(A)[, date:=as.Date(date)]
setDT(B)[, date:=as.Date(date)]
B[A, on=.(ID, date), roll=Inf]
output:
ID date value
1: 1 2019-04-03 1.5
2: 2 2019-05-13 1.2
3: 3 2019-05-27 NA
We can do group_by A then select the last row in each group
library(fuzzyjoin)
fuzzy_left_join(A, B,
by = c("ID" = "ID","date" = "date"),
match_fun = list(`==`,`>`)) %>%
group_by(ID.x) %>%
slice(n()) %>%
select(-ends_with('y')) %>%
rename_at(vars(ends_with('x')), ~gsub('.x','',.))
# A tibble: 3 x 3
# Groups: ID.x [3]
ID date value
<dbl> <date> <dbl>
1 1 2019-04-03 1.5
2 2 2019-05-13 1.2
3 3 2019-05-27 NA

Resources