df is my current dataset and I want to insert dates from 1st Jan'2020 to 4th Jan'2020 for all possible locations .
df<-data.frame(location=c("x","x","y"),date=c("2020-01-01","2020-01-04","2020-01-03"))
This is what my expected dataset look like .
expected_df<-data.frame(location=c("x","x","x","x","y","y","y","y"),date=c("2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-01","2020-01-02","2020-01-03","2020-01-04"))
location date
1 x 2020-01-01
2 x 2020-01-02
3 x 2020-01-03
4 x 2020-01-04
5 y 2020-01-01
6 y 2020-01-02
7 y 2020-01-03
8 y 2020-01-04
We can use complete from tidyr
library(dplyr)
library(tidyr)
start <- as.Date('2020-01-01')
end <- as.Date('2020-01-04')
df %>%
mutate(date = as.Date(date)) %>%
complete(location, date = seq(start, end, by = "1 day"))
# location date
# <fct> <date>
#1 x 2020-01-01
#2 x 2020-01-02
#3 x 2020-01-03
#4 x 2020-01-04
#5 y 2020-01-01
#6 y 2020-01-02
#7 y 2020-01-03
#8 y 2020-01-04
It is essential that you place "stringsAsFactor = FALSE" in your data frame so those values do not get transformed into factors.
df <- data.frame(location=c("x","x","y"), date=c("2020-01-01","2020-01-04","2020-01-03"), stringsAsFactors = F)
'['(
expand.grid(
date = seq.Date(from=min(as.Date(df$date)), to=max(as.Date(df$date)), by = "day"),
location = unique(df$location)
),
c(2,1)
)
Output
location date
1 x 2020-01-01
2 x 2020-01-02
3 x 2020-01-03
4 x 2020-01-04
5 y 2020-01-01
6 y 2020-01-02
7 y 2020-01-03
8 y 2020-01-04
Related
I am trying to transform a dataset that has multiple product sales on a date. At the end I want to keep only unique columns with the sum of the product sales per day.
My MRE:
df <- data.frame(created = as.Date(c("2020-01-01", "2020-01-01", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-03"), "%Y-%m-%d", tz = "GMT"),
soldUnits = c(1, 1, 1, 1, 1, 1),
Weekday = c("Mo","Mo","Tu","Tu","Th","Th"),
Sunshinehours = c(7.8,7.8,6.0,6.0,8.0,8.0))
Which looks like this:
Date soldUnits Weekday Sunshinehours
2020-01-01 1 Mo 7.8
2020-01-01 1 Mo 7.8
2020-01-02 1 Tu 6.0
2020-01-02 1 Tu 6.0
2020-01-03 1 We 8.0
2020-01-03 1 We 8.0
And should look like this after transforming:
Date soldUnits Weekday Sunshinehours
2020-01-01 2 Mo 7.8
2020-01-02 2 Tu 6.0
2020-01-03 2 We 8.0
I tried aggregate() and group_by but without success because my data was dropped.
Is there anyone who has an idea, how i can transform and clean up my dataset according to the specifications i mentioned?
This can work:
library(tidyverse)
df %>%
group_by(created) %>%
count(Weekday, Sunshinehours, wt = soldUnits,name = "soldUnits")
#> # A tibble: 3 × 4
#> # Groups: created [3]
#> created Weekday Sunshinehours soldUnits
#> <date> <chr> <dbl> <dbl>
#> 1 2020-01-01 Mo 7.8 2
#> 2 2020-01-02 Tu 6 2
#> 3 2020-01-03 Th 8 2
Created on 2021-12-04 by the reprex package (v2.0.1)
Applying different functions to different columns (or set of columns) can be done with collap
library(collapse)
collap(df, ~ created + Weekday,
custom = list(fmean = "Sunshinehours", fsum = "soldUnits"))
created soldUnits Weekday Sunshinehours
1 2020-01-01 2 Mo 7.8
2 2020-01-02 2 Tu 6.0
3 2020-01-03 2 Th 8.0
Another dplyr approach:
df %>%
group_by(created, Weekday, Sunshinehours) %>%
summarise(soldUnits = sum(soldUnits))
created Weekday Sunshinehours soldUnits
<date> <chr> <dbl> <dbl>
1 2020-01-01 Mo 7.8 2
2 2020-01-02 Tu 6 2
3 2020-01-03 Th 8 2
Using base and dplyr R
df1 = aggregate(df["Sunshinehours"], by=df["created"], mean)
df2 = aggregate(df["soldUnits"], by=df["created"], sum)
df3 = inner_join(df1, df2)
#converting `Weekday` to factors
df$Weekday = as.factor(df$Weekday)
df3$Weekday = levels(df$Weekday)
created Sunshinehours soldUnits Weekday
1 2020-01-01 7.8 2 Mo
2 2020-01-02 6.0 2 Th
3 2020-01-03 8.0 2 Tu
I have a simplified data frame like this
date state hour
2020-01-01 A 6
2020-01-01 B 3
2020-01-02 A 4
2020-01-02 B 3.5
2020-01-03 A 5
2020-01-03 B 2.5
For each date, there are two states. I want to calculate the ratio of state A/B in hour each day
For example,
date ratio
2020-01-01 2
2020-01-02 1.143
2020-01-03 2
How do I get this result? Thank you!
With the help of match you can do :
library(dplyr)
df %>%
group_by(date) %>%
summarise(ratio = hour[match('A', state)]/hour[match('B', state)])
# date ratio
# <chr> <dbl>
#1 2020-01-01 2
#2 2020-01-02 1.14
#3 2020-01-03 2
You can use xtabs:
tt <- xtabs(hour ~ date + state, x)
data.frame(dimnames(tt)[1], ratio = tt[,1] / tt[,2])
# date ratio
#2020-01-01 2020-01-01 2.000000
#2020-01-02 2020-01-02 1.142857
#2020-01-03 2020-01-03 2.000000
Data:
x <- data.frame(date = c("2020-01-01", "2020-01-01", "2020-01-02",
"2020-01-02", "2020-01-03", "2020-01-03"), state = c("A", "B",
"A", "B", "A", "B"), hour = c(6, 3, 4, 3.5, 5, 2.5))
A data.table option
> setDT(df)[, .(ratio = Reduce(`/`, hour[order(state)])), date]
date ratio
1: 2020-01-01 2.000000
2: 2020-01-02 1.142857
3: 2020-01-03 2.000000
You can also use the following solution, albeit it is to some extent similar to the one posted by dear #Ronak Shah .
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = state, values_from = hour) %>%
group_by(date) %>%
summarise(ratio = A/B)
# A tibble: 3 x 2
date ratio
<chr> <dbl>
1 2020-01-01 2
2 2020-01-02 1.14
3 2020-01-03 2
I have a dataset with dates of implementation.
ID Date_implemented
345 2020-01-01
2 2020-01-01
67 2020-01-02
380 2020-01-02
9 2020-01-02
176 2020-01-03
I want to create a new column and assign a binary c(1,-1) on every different date. For example my dataset would look like this:
ID Date C
345 2020-01-01 1
2 2020-01-01 1
67 2020-01-02 -1
380 2020-01-02 -1
9 2020-01-02 -1
176 2020-01-03 1
I have tried with
rep(c(1,-1),length.out=length(Date)
but it does not give me the results above but rather alternates each row between 1 and -1.
Any idea?
A solution based in the tidyverse could look as follows.
library(dplyr)
library(stringr)
df %>%
mutate(C = if_else(as.double(str_sub(Date_implemented, -1)) %% 2 == 0, -1, 1))
# ID Date_implemented C
# <dbl> <chr> <dbl>
# 1 345 2020-01-01 1
# 2 2 2020-01-01 1
# 3 67 2020-01-02 -1
# 4 380 2020-01-02 -1
# 5 9 2020-01-02 -1
# 6 176 2020-01-03 1
Data
df <- structure(list(ID = c(345, 2, 67, 380, 9, 176), Date_implemented = c("2020-01-01",
"2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02", "2020-01-03"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
Base R option:
grp <- cumsum(c(TRUE, (dat$Date_implemented[-1] != dat$Date_implemented[-nrow(dat)])))
grp
# [1] 1 1 2 2 2 3
dat$C <- ifelse(grp %% 2 == 1, 1, -1)
dat
# ID Date_implemented C
# 1 345 2020-01-01 1
# 2 2 2020-01-01 1
# 3 67 2020-01-02 -1
# 4 380 2020-01-02 -1
# 5 9 2020-01-02 -1
# 6 176 2020-01-03 1
If you want to keep with rep() function, I believe this will give what you want. The df$Date refers to your Date column.
rep(rep(c(1,-1), length.out = length(unique(df$Date))), times = table(df$Date))
You can use lag to compare the current value to the previous value.
library(dplyr)
df %>% mutate(c = ifelse(dates == lag(dates), 1, -1))
This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 2 years ago.
I have a data frame with time windows on each row. The time window is identified by a start_date and end _date for each ID.
For each calendar day, I would like to know how may IDs have a time window spanning that day.
Example data
data <- data.frame(
id = c("A","B","C"),
start_date = as.POSIXct(c("2020-01-01 01:00:00", "2020-01-02 01:00:00", "2020-01-03 01:00:00")),
end_date = as.POSIXct(c("2020-01-04 01:00:00", "2020-01-03 01:00:00", "2020-01-06 01:00:00")),
stringsAsFactors = FALSE
)
data
id start_date end_date
1 A 2020-01-01 01:00:00 2020-01-04 01:00:00
2 B 2020-01-02 01:00:00 2020-01-03 01:00:00
3 C 2020-01-03 01:00:00 2020-01-06 01:00:00
The output I am looking for is to aggregate this into days with number of IDs present on each day.
day number_of_ids
2020-01-01 1
2020-01-02 2
2020-01-03 3
2020-01-04 2
2020-01-05 1
2020-01-06 1
Any help much appreciated.
We get the sequence of dates between corresponding 'start_date', 'end_date' in a list column, unnest the list column, then do a group by 'day' and get the number of distinct 'id' with n_distinct in summarise
library(dplyr)
library(purrr)
library(tidyr)
data %>%
transmute(id, day = map2(as.Date(start_date), as.Date(end_date),
~ seq(.x, .y, by = 'day'))) %>%
unnest(c(day)) %>%
group_by(day) %>%
summarise(number_of_ids = n_distinct(id))
# A tibble: 6 x 2
# day number_of_ids
# <date> <int>
#1 2020-01-01 1
#2 2020-01-02 2
#3 2020-01-03 3
#4 2020-01-04 2
#5 2020-01-05 1
#6 2020-01-06 1
In base R you could do:
a <- with(data, setNames(Map( function(x, y) format(seq(x,y,'1 day'), '%F'), start_date, end_date),id))
aggregate(ind~values, stack(a), length)
values ind
1 2020-01-01 1
2 2020-01-02 2
3 2020-01-03 3
4 2020-01-04 2
5 2020-01-05 1
6 2020-01-06 1
This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Closed 2 years ago.
I'm attempting to create a separate table from my original data that contains all of the dates between two dates, represented by separate columns in my original table. I have successfully completed this using a loop, but I am sure there's a more efficient means of doing this. In the example data I have provided, there are only 3 row, but the real data set I'm working with has > 500,000. I can't afford inefficiency.
Example:
df <- data.frame(
id = c('A','B','C'),
fromDate = c('2020-01-01','2020-02-01','2020-03-05'),
toDate = c('2020-01-10','2020-02-03','2020-03-06')
)
#output
------------------------------
id fromDate toDate
---- ------------ ------------
A 2020-01-01 2020-01-10
B 2020-02-01 2020-02-03
C 2020-03-05 2020-03-06
------------------------------
#current solution
results <- data.frame(id = NULL,timespan = NULL)
for(i in 1:nrow(df)){
results <- rbind(
results,
data.frame(id = df$id[i], timespan = seq(as.Date(df$fromDate[i]),as.Date(df$toDate[i]),by = 'days'))
)
}
#results
-----------------
id timespan
---- ------------
A 2020-01-01
A 2020-01-02
A 2020-01-03
A 2020-01-04
A 2020-01-05
A 2020-01-06
A 2020-01-07
A 2020-01-08
A 2020-01-09
A 2020-01-10
B 2020-02-01
B 2020-02-02
B 2020-02-03
C 2020-03-05
C 2020-03-06
-----------------
Any suggestions on how to speed this up for scale?
This will probably be rather slow for such a large number of rows, regardless how you do it. I'd try to avoid this.
Anyway, you can use package data.table for efficient "apply-by-group":
library(data.table)
setDT(df)
df[, c("fromDate", "toDate") := lapply(.(fromDate, toDate), as.Date)]
results <- df[, seq(fromDate, toDate, by = "1 day"), by = id]
# id V1
# 1: A 2020-01-01
# 2: A 2020-01-02
# 3: A 2020-01-03
# 4: A 2020-01-04
# 5: A 2020-01-05
# 6: A 2020-01-06
# 7: A 2020-01-07
# 8: A 2020-01-08
# 9: A 2020-01-09
#10: A 2020-01-10
#11: B 2020-02-01
#12: B 2020-02-02
#13: B 2020-02-03
#14: C 2020-03-05
#15: C 2020-03-06
Using dplyr, tidyr and padr, assuming the date columns are actually dates. Otherwise cast them into dates first.
df %>% pivot_longer(cols = c(fromDate, toDate), values_to = "timespan") %>%
select(-name) %>%
pad(interval = "day") %>%
fill(id)
# A tibble: 66 x 2
id timespan
<chr> <date>
1 A 2020-01-01
2 A 2020-01-02
3 A 2020-01-03
4 A 2020-01-04
5 A 2020-01-05
6 A 2020-01-06
7 A 2020-01-07
8 A 2020-01-08
9 A 2020-01-09
10 A 2020-01-10
# ... with 56 more rows
Speedwise the data.table answer by #Roland might be the better solution as you will shoot into millions of records.