I want to insert rows between two dates by group. My way of doing it is so complicated that I insert missing values by last observation carry forwards and then merge. I was wondering is there any easier way to achieve it.
# sample data
user<-c("A","A","B","B","B")
dummy<-c(1,1,1,1,1)
date<-as.Date(c("2017/1/3","2017/1/6","2016/5/1","2016/5/3","2016/5/5"))
dt<-data.frame(user,dummy,date)
user dummy date
1 A 1 2017-01-03
2 A 1 2017-01-06
3 B 1 2016-05-01
4 B 1 2016-05-03
5 B 1 2016-05-05
Desired output
By using dplyr and tidyr :)(one line solution )
library(dplyr)
library(tidyr)
dt %>% group_by(user) %>% complete(date=full_seq(date,1),fill=list(dummy=0))
# A tibble: 9 x 3
# Groups: user [2]
user date dummy
<fctr> <date> <dbl>
1 A 2017-01-03 1
2 A 2017-01-04 0
3 A 2017-01-05 0
4 A 2017-01-06 1
5 B 2016-05-01 1
6 B 2016-05-02 0
7 B 2016-05-03 1
8 B 2016-05-04 0
9 B 2016-05-05 1
you can try this
library(data.table)
setDT(dt)
tmp <- dt[, .(date = seq.Date(min(date), max(date), by = '1 day')), by =
'user']
dt <- merge(tmp, dt, by = c('user', 'date'), all.x = TRUE)
dt[, dummy := ifelse(is.na(dummy), 0, dummy)]
We can use the tidyverse to achieve this task.
library(tidyverse)
dt2 <- dt %>%
group_by(user) %>%
do(date = seq(from = min(.$date), to = max(.$date), by = 1)) %>%
unnest() %>%
left_join(dt, by = c("user", "date")) %>%
replace_na(list(dummy = 0)) %>%
select(colnames(dt))
dt2
# A tibble: 9 x 3
user dummy date
<fctr> <dbl> <date>
1 A 1 2017-01-03
2 A 0 2017-01-04
3 A 0 2017-01-05
4 A 1 2017-01-06
5 B 1 2016-05-01
6 B 0 2016-05-02
7 B 1 2016-05-03
8 B 0 2016-05-04
9 B 1 2016-05-05
The simplest way that I have found to do this is with the padr library.
library(padr)
dt_padded <- pad(dt, group = "user", by = "date") %>%
replace_na(list(dummy=0))
A Base R (not quite as elegant) solution:
# Data
user<-c("A","A","B","B","B")
dummy<-c(1,1,1,1,1)
date<-as.Date(c("2017/1/3","2017/1/6","2016/5/1","2016/5/3","2016/5/5"))
df1 <-data.frame(user,dummy,date)
# Solution
do.call(rbind, lapply(split(df1, df1$user), function(df) {
dff <- data.frame(user=df$user[1], dummy=0, date=seq.Date(min(df$date), max(df$date), 'day'))
dff[dff$date %in% df$date, "dummy"] <- df$dummy[1]
dff
}))
# user dummy date
# A 1 2017-01-03
# A 0 2017-01-04
# A 0 2017-01-05
# A 1 2017-01-06
# B 1 2016-05-01
# B 0 2016-05-02
# B 1 2016-05-03
# B 0 2016-05-04
# B 1 2016-05-05
Assuming your data is called df1, and you want to add dates between two days try this:
library(dplyr)
df2 <- seq.Date(as.Date("2015-01-03"), as.Date("2015-01-06"), by ="day")
left_join(df2, df1)
If you're simply trying to add a new record, I suggest using rbind.
rbind()
Related
I am trying to count the appearances of a value (across 2 columns) consecutively over the previous days. In the example this would be counting the consecutive days a team made an appearance (either in Hteam or Ateam) prior to that date. The aim would be to produce additional columns for both the home and away teams that showed these new values.
Test data:
data<- data.frame(
Date= c("2018-01-01", "2018-01-01", "2018-01-02", "2018-01-03", "2018-01-04", "2018-01-05"),
Hteam= c("A","D","B","A","C","A"),
Ateam= c("B","C","A","C","B","C"))
Date Hteam Ateam
1 2018-01-01 A B
2 2018-01-01 D C
3 2018-01-02 B A
4 2018-01-03 A C
5 2018-01-04 C B
6 2018-01-05 A C
The aim would end up looking like:
Date Hteam Ateam Hdays Adays
1 2018-01-01 A B 0 0
2 2018-01-01 D C 0 0
3 2018-01-02 B A 1 1
4 2018-01-03 A C 2 0
5 2018-01-04 C B 1 0
6 2018-01-05 A C 0 2
In my searching I haven't found an example close enough that I am able to adapt to this situation. I feel like I should be using a rollapply or dplyr grouping, but I can't get close to a solution.
Thanks.
Maybe the following gives what you wanted assuming that data is sorted by Date and missing days are not considered.
t1 <- unique(unlist(data[-1]))
t2 <- do.call(rbind, lapply(split(data[-1], data$Date), function(x) t1 %in% unlist(x)))
t3 <- apply(t2, 2, function(x) ave(x, cumsum(!x), FUN=cumsum))-1
data.frame(data
, Hdays=t3[cbind(match(data$Date, rownames(t3)), match(data$Hteam, t1))]
, Adays=t3[cbind(match(data$Date, rownames(t3)), match(data$Ateam, t1))])
# Date Hteam Ateam Hdays Adays
#1 2018-01-01 A B 0 0
#2 2018-01-01 D C 0 0
#3 2018-01-02 B A 1 1
#4 2018-01-03 A C 2 0
#5 2018-01-04 C B 1 0
#6 2018-01-05 A C 0 2
I think your expected output is incorrect. Namely, row 5's "C" occurs twice above it, but has a 1.
Here's a tidyverse version:
library(dplyr)
library(tidyr)
data %>%
mutate(rn = row_number()) %>%
pivot_longer(-c(Date, rn), names_to = "x", values_to = "team") %>%
mutate(x = gsub("team$", "", x)) %>%
group_by(team) %>%
mutate(days = row_number() - 1) %>%
ungroup() %>%
pivot_wider(c(Date, rn), names_from = x, values_from = c(team, days)) %>%
select(-rn)
# # A tibble: 6 x 5
# Date team_H team_A days_H days_A
# <chr> <chr> <chr> <dbl> <dbl>
# 1 2018-01-01 A B 0 0
# 2 2018-01-01 D C 0 0
# 3 2018-01-02 B A 1 1
# 4 2018-01-03 A C 2 1
# 5 2018-01-04 C B 2 2
# 6 2018-01-05 A C 3 3
I have the following situation.
df <- rbind(
data.frame(thisDate = rep(seq(as.Date("2018-1-1"), as.Date("2018-1-2"), by="day")) ),
data.frame(thisDate = rep(seq(as.Date("2018-2-1"), as.Date("2018-2-2"), by="day")) ))
df <- cbind(df,lastMonth = as.Date(format(as.Date(df$thisDate - months(1)),"%Y-%m-01")))
df <- cbind(df, prod1Quantity= seq(1:4) )
I have quantities for different days of a month for an unknown number of products. I want to have 1 column for every product with the total monthly quantity of that product for all of the previous month. So the output would be like this .. ie grouped by lastMonth, Prod1Quantity . I just don't get how to group by, mutate and summarise dynamically if that indeed is the right approach.
I came across data.table generate multiple columns and summarize them . I think it appears to do what I need - but I just don't get how it is working!
Desired Output
thisDate lastMonth prod1Quantity prod1prevMonth
1 2018-01-01 2017-12-01 1 NA
2 2018-01-02 2017-12-01 2 NA
3 2018-02-01 2018-01-01 3 3
4 2018-02-02 2018-01-01 4 3
Another approach could be
library(dplyr)
library(lubridate)
temp_df <- df %>%
mutate(thisDate_forJoin = as.Date(format(thisDate,"%Y-%m-01")))
final_df <- temp_df %>%
mutate(thisDate_forJoin = thisDate_forJoin %m-% months(1)) %>%
left_join(temp_df %>%
group_by(thisDate_forJoin) %>%
summarise_if(is.numeric, sum),
by="thisDate_forJoin") %>%
select(-thisDate_forJoin)
Output is:
thisDate prod1Quantity.x prod2Quantity.x prod1Quantity.y prod2Quantity.y
1 2018-01-01 1 10 NA NA
2 2018-01-02 2 11 NA NA
3 2018-02-01 3 12 3 21
4 2018-02-02 4 13 3 21
Sample data:
df <- structure(list(thisDate = structure(c(17532, 17533, 17563, 17564
), class = "Date"), prod1Quantity = 1:4, prod2Quantity = 10:13), class = "data.frame", row.names = c(NA,
-4L))
# thisDate prod1Quantity prod2Quantity
#1 2018-01-01 1 10
#2 2018-01-02 2 11
#3 2018-02-01 3 12
#4 2018-02-02 4 13
A solution can be reached by calculating the month-wise production quantity and then joining on month of lastMonth and thisDate.
lubridate::month function has been used evaluate month from date.
library(dplyr)
library(lubridate)
df %>% group_by(month = as.integer(month(thisDate))) %>%
summarise(prodQuantMonth = sum(prod1Quantity)) %>%
right_join(., mutate(df, prevMonth = month(lastMonth)), by=c("month" = "prevMonth")) %>%
select(thisDate, lastMonth, prod1Quantity, prodQuantLastMonth = prodQuantMonth)
# # A tibble: 4 x 4
# thisDate lastMonth prod1Quantity prodQuantLastMonth
# <date> <date> <int> <int>
# 1 2018-01-01 2017-12-01 1 NA
# 2 2018-01-02 2017-12-01 2 NA
# 3 2018-02-01 2018-01-01 3 3
# 4 2018-02-02 2018-01-01 4 3
Given a data.frame that contains a time series and one or ore grouping fields. So we have several time series - one for each grouping combination.
But some dates are missing.
So, what's the easiest (in terms of the most "tidyverse way") of adding these dates with the right grouping values?
Normally I would say I generate a data.frame with all dates and do a full_join with my time series. But now we have to do it for each combination of grouping values -- and fill in the grouping values.
Let's look at an example:
First I create a data.frame with missing values:
library(dplyr)
library(lubridate)
set.seed(1234)
# Time series should run vom 2017-01-01 til 2017-01-10
date <- data.frame(date = seq.Date(from=ymd("2017-01-01"), to=ymd("2017-01-10"), by="days"), v = 1)
# Two grouping dimensions
d1 <- data.frame(d1 = c("A", "B", "C", "D"), v = 1)
d2 <- data.frame(d2 = c(1, 2, 3, 4, 5), v = 1)
# Generate the data.frame
df <- full_join(date, full_join(d1, d2)) %>%
select(date, d1, d2)
# and ad to value columns
df$v1 <- runif(200)
df$v2 <- runif(200)
# group by the dimension columns
df <- df %>%
group_by(d1, d2)
# create missing dates
df.missing <- df %>%
filter(v1 <= 0.8)
# So now 2017-01-01 and 2017-01-10, A, 5 are missing now
df.missing %>%
filter(d1 == "A" & d2 == 5)
# A tibble: 8 x 5
# Groups: d1, d2 [1]
date d1 d2 v1 v2
<date> <fctr> <dbl> <dbl> <dbl>
1 2017-01-02 A 5 0.21879954 0.1335497
2 2017-01-03 A 5 0.32977018 0.9802127
3 2017-01-04 A 5 0.23902573 0.1206089
4 2017-01-05 A 5 0.19617465 0.7378315
5 2017-01-06 A 5 0.13373890 0.9493668
6 2017-01-07 A 5 0.48613541 0.3392834
7 2017-01-08 A 5 0.35698708 0.3696965
8 2017-01-09 A 5 0.08498474 0.8354756
So to add the missing dates I generate a data.frame with all dates:
start <- min(df.missing$date)
end <- max(df.missing$date)
all.dates <- data.frame(date=seq.Date(start, end, by="day"))
No I want to do something like (remember: df.missing is group_by(d1, d2))
df.missing %>%
do(my_join())
So let's define my_join():
my_join <- function(data) {
# get value of both dimensions
d1.set <- data$d1[[1]]
d2.set <- data$d2[[1]]
tmp <- full_join(data, all.dates) %>%
# First we need to ungroup. Otherwise we can't change d1 and d2 because they are grouping variables
ungroup() %>%
mutate(
d1 = d1.set,
d2 = d2.set
) %>%
group_by(d1, d2)
return(tmp)
}
Now we can call my_join() for each combination and have a look at "A/5"
df.missing %>%
do(my_join(.)) %>%
filter(d1 == "A" & d2 == 5)
# A tibble: 10 x 5
# Groups: d1, d2 [1]
date d1 d2 v1 v2
<date> <fctr> <dbl> <dbl> <dbl>
1 2017-01-02 A 5 0.21879954 0.1335497
2 2017-01-03 A 5 0.32977018 0.9802127
3 2017-01-04 A 5 0.23902573 0.1206089
4 2017-01-05 A 5 0.19617465 0.7378315
5 2017-01-06 A 5 0.13373890 0.9493668
6 2017-01-07 A 5 0.48613541 0.3392834
7 2017-01-08 A 5 0.35698708 0.3696965
8 2017-01-09 A 5 0.08498474 0.8354756
9 2017-01-01 A 5 NA NA
10 2017-01-10 A 5 NA NA
Great! That's what we were looking for.
But we need to define d1 and d2 in my_join and it feels a little bit clumsy.
So, is there any tidyverse-way of this solution?
P.S.: I've put the code into a gist: https://gist.github.com/JerryWho/1bf919ef73792569eb38f6462c6d7a8e
tidyr has some great tools for these sorts of problems. Take a look at complete.
library(dplyr)
library(tidyr)
library(lubridate)
want <- df.missing %>%
ungroup() %>%
complete(nesting(d1, d2), date = seq(min(date), max(date), by = "day"))
want %>% filter(d1 == "A" & d2 == 5)
#> # A tibble: 10 x 5
#> d1 d2 date v1 v2
#> <fctr> <dbl> <date> <dbl> <dbl>
#> 1 A 5 2017-01-01 NA NA
#> 2 A 5 2017-01-02 0.21879954 0.1335497
#> 3 A 5 2017-01-03 0.32977018 0.9802127
#> 4 A 5 2017-01-04 0.23902573 0.1206089
#> 5 A 5 2017-01-05 0.19617465 0.7378315
#> 6 A 5 2017-01-06 0.13373890 0.9493668
#> 7 A 5 2017-01-07 0.48613541 0.3392834
#> 8 A 5 2017-01-08 0.35698708 0.3696965
#> 9 A 5 2017-01-09 0.08498474 0.8354756
#> 10 A 5 2017-01-10 NA NA
package tsibble function fill_gaps should do the job easily.
library(tsibble)
df.missing %>%
# tsibble format
as_tsibble(key = c(d1, d2), index = date) %>%
# fill gaps
fill_gaps(.full = TRUE)
Here's a tidyverse way starting with df.missing
library(tidyverse)
ans <- df.missing %>%
nest(date) %>%
mutate(data = map(data, ~seq.Date(start, end, by="day"))) %>%
unnest(data) %>%
rename(date = data) %>%
left_join(., df.missing, by=c("date","d1","d2"))
ans %>% filter(d1 == "A" & d2 == 5)
Output
d1 d2 date v1 v2
<fctr> <dbl> <date> <dbl> <dbl>
1 A 5 2017-01-01 NA NA
2 A 5 2017-01-02 0.21879954 0.1335497
3 A 5 2017-01-03 0.32977018 0.9802127
4 A 5 2017-01-04 0.23902573 0.1206089
5 A 5 2017-01-05 0.19617465 0.7378315
6 A 5 2017-01-06 0.13373890 0.9493668
7 A 5 2017-01-07 0.48613541 0.3392834
8 A 5 2017-01-08 0.35698708 0.3696965
9 A 5 2017-01-09 0.08498474 0.8354756
10 A 5 2017-01-10 NA NA
-------------------------------------------------------------------------------------------------
Here's an alternative approach that uses expand.grid and dplyr verbs
with(df.missing, expand.grid(unique(date), unique(d1), unique(d2))) %>%
setNames(c("date", "d1", "d2")) %>%
left_join(., df.missing, by=c("date","d1","d2"))
output (head)
date d1 d2 v1 v2
1 2017-01-01 A 1 0.113703411 0.660754634
2 2017-01-02 A 1 0.316612455 0.422330675
3 2017-01-03 A 1 0.553333591 0.424109178
4 2017-01-04 A 1 NA NA
5 2017-01-05 A 1 NA NA
6 2017-01-06 A 1 0.035456727 0.352998502
Here read.zoo creates a wide form zoo object and to that we merge the dates. Then we convert that back to a long data frame using fortify.zoo and spread out out v1 and v2 using spread.
Note that:
if we can assume that each date appears in at least one combination of the split variables, i.e. sort(unique(df.missing$date)) contains all the dates, then we could omit the merge line and no joins would have to be done at all. The test data df.missing shown in the question does have this property:
all(all.dates$date %in% df.missing$date)
## [1] TRUE
we could stop after the merge (or after read.zoo if each date is present at least once as in prior point) if a wide form zoo object can be used as that already has all the dates.
In the code below the line marked ### can be omitted with the development version of zoo (1.8.1):
library(dplyr)
library(tidyr)
library(zoo)
split.vars <- c("d1", "d2")
df.missing %>%
as.data.frame %>% ###
read.zoo(split = split.vars) %>%
merge(zoo(, seq(start(.), end(.), "day"))) %>%
fortify.zoo(melt = TRUE) %>%
separate(Series, c("v", split.vars)) %>%
spread(v, Value)
Update: Note simplification in zoo 1.8.1 .
I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.
Country Year Count
A 1 1
A 2 1
A 4 2
B 1 1
B 3 1
So I end up with
Country Year Count
A 1 1
A 2 1
A 3 0
A 4 2
B 1 1
B 2 0
B 3 1
Hope that's clear guys, thanks in advance!
This is a dplyr/tidyr solution using complete and full_seq:
library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
Country Year Count
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]
Similar to #PoGibas' answer:
library(data.table)
# set default values
def = list(Count = 0L)
# create table with all levels
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])
# initialize to defaults
fullDT[, names(def) := def ]
# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]
which gives
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 0
4: A 4 2
5: B 1 1
6: B 2 0
7: B 3 1
This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".
Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.
d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
x <- rbind(i, data.frame(Country = i$Country[1],
Year = setdiff(seq(max(i$Year)), i$Year),
Count = 0));
x[with(x, order(Year)),]}), make.row.names = FALSE))
which gives,
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 NA
4: A 4 2
5: B 1 1
6: B 2 NA
7: B 3 1
Another dplyr and tidyr solution.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(Country) %>%
do(data_frame(Country = unique(.$Country),
Year = full_seq(.$Year, 1))) %>%
full_join(dt, by = c("Country", "Year")) %>%
replace_na(list(Count = 0))
Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.
# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))
# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))
# merge onto original
mydf <- merge(dat, mydf, all=TRUE)
# fill in missing values
mydf[is.na(mydf)] <- 0
This returns
mydf
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
I have a data set of this format
Order_Name Frequency Order_Dt
A 2 2016-01-20
A 2 2016-05-01
B 1 2016-02-12
C 3 2016-03-04
C 3 2016-07-01
C 3 2016-08-09
I need to find the average difference between the dates of those order which have been placed for more than 1 times, i.e., frequency > 1.
require(dplyr)
# loading the data
df0 <- read.table(text =
'Order_Name Frequency Order_Dt
A 2 2016-01-20
A 2 2016-05-01
B 1 2016-02-12
C 3 2016-03-04
C 3 2016-07-01
C 3 2016-08-09',
stringsAsFactors = F,
header = T)
# putting the date in the right format
df0$Order_Dt <- as.Date(df0$Order_Dt)
# obtaining the averages
df0 %>% filter(Frequency > 1) %>%
arrange(., Order_Name, Order_Dt) %>%
mutate(diff_date = Order_Dt - lag(Order_Dt)) %>%
group_by(Order_Name) %>%
summarise(avg_days = mean(diff_date, na.rm = T))
# A tibble: 2 × 2
Order_Name avg_days
<chr> <time>
1 A 102.00000 days
2 C 33.33333 days