Combine two data frames in R without repeated entries - r

I have two data frames containing row entries with respective dates. Data frame 1 contains observations collected from 2010 to 2017.
dates A
2010-01-01 21
2010-01-02 27
2010-01-03 34
...
2017-12-29 22
2017-12-30 32
2017-12-31 25
Data frame 2 contains observations collected from 2015 to 2020.
dates A
2015-01-01 20
2015-01-02 29
2015-01-03 34
...
2020-12-29 22
2020-12-30 27
2020-12-31 32
Both the data frames have missing observations for some days. I wish to combine both data frames to fill out missing data and obtain complete time series upto 2020 without any repeated entries. Like the following data frame:
dates A
2010-01-01 21
2010-01-02 27
2010-01-03 34
...
2020-12-29 22
2020-12-30 27
2020-12-31 32
Using merge(df1, df2, by = 'dates') or full_join(df1, df2, by = 'dates') creates duplicate entries or two columns A.x and A.y which is not expected.

Try the code below
dfout <- unique(rbind(df1,df2))
dfout <- dfout[order(dfout$dates),]

Combine df1 and df2, if there are duplicate dates which are available in both the dataframes mean the A value and use complete to fill the missing dates.
library(dplyr)
library(tidyr)
df1 %>%
bind_rows(df2) %>%
mutate(dates = as.Date(dates)) %>%
group_by(dates) %>%
summarise(A = mean(A)) %>%
complete(dates = seq(min(date), max(date), by = 'day'))

If your df is really just two columns, you should be able to bind_rows, group_by, and distinct to remove duplicates.
library(dplyr)
df <- bind_rows(df1, df2) %>%
group_by(dates, A) %>%
distinct(dates)
Edit: This will not work if you have data that doesn't agree between the dataframes on a single date. If you have two records for 1/1/15 and they have different A values, those will both be retained.

Related

Is there an R function for finding a list of all dates between two values. then inserting them as rows?

I have a dataframe in the following format:
Contract_Begin Contract_End FP
2020-01-01 2020-01-31 5
2020-01-01 2020-03-31 6
If the Contract_End - Contract_Begin is less than 1 month, I want to insert the additional months as rows below. Here is the desired output.
Contract_Begin Contract_End FP
2020-01-01 2020-01-31 5
2020-01-01 6
2020-02-01 6
2020-03-01 6
Trying to accomplish in R as a part of pre data processing. Any help is greatly appreciated.
We can use map2 to get the sequence of dates from 'Contract_Begin','Contract_End' and then unnest the listcolumn created by map2 and expand the rows
library(dplyr)
library(tidyr)
library(purrr)
df1 %>%
mutate_at(1:2, as.Date) %>%
mutate(Contract_Begin = map2(Contract_Begin, Contract_End, seq,
by = "1 month")) %>%
unnest(c(Contract_Begin))

How do I create a daily time series using data that isn't taken daily

I have a csv file that is written like this
Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50
I'd like R to produce something like this
Date Date
1/1/1980
1/2/1980
1/3/1980
1/4/1980
1/5/1980 25
1/6/1980
1/7/1980 30
Then I would like R to bring the last observation forward like this
Date Date
1/1/1980
1/2/1980
1/3/1980
1/4/1980
1/5/1980 25
1/6/1980 25
1/7/1980 30
I'd like two separate data.tables created one with just the actual data, then another with the last observation brought forward.
Thanks for all the help!
Edit: I also will need any NA's that are populated to changed to 0
You could also use tidyverse:
library(tidyverse)
df %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
complete(Date = seq(as.Date(format(min(Date), "%Y-%m-01")), max(Date), by = "day")) %>%
fill(Data) %>%
replace(., is.na(.), 0)
First 10 rows:
# A tibble: 104 x 2
Date Data
<date> <dbl>
1 1980-01-01 0
2 1980-01-02 0
3 1980-01-03 0
4 1980-01-04 0
5 1980-01-05 25
6 1980-01-06 25
7 1980-01-07 30
8 1980-01-08 30
9 1980-01-09 30
10 1980-01-10 30
I've used as a starting point the 1st day of the month and year of minimum date, and maximum the maximum date; this can be of course adjusted as needed.
EDIT: #Sotos has an even better suggestion for a more concise approach (by better usage of format argument):
df %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
complete(Date = seq(as.Date(format(min(Date), "%Y-%m-01")), max(Date), by = "day")) %>%
fill(Data)
The solution is:
create a data.frame with successive date
merge it with your original data.frame
use na.locf function from zoo to carry forward your data
Here is the code. I use lubridate to work with date.
library(lubridate)
df$Date <- mdy(df$Date)
successive <-data.frame(Date = seq( as.Date(as.yearmon(df$Date[1])), df$Date[length(df$Date)], by="days"))
successive is the vector of successive dates. Now the merging:
result <- merge(df,successive,all.y = T,on = "Date")
And the forward propagation:
library(zoo)
result$Data <- na.locf(result$Data,na.rm = F)
Date Data
1 1980-01-05 25
2 1980-01-06 25
3 1980-01-07 30
4 1980-01-08 30
5 1980-01-09 30
6 1980-01-10 30
7 1980-01-11 30
8 1980-01-12 30
9 1980-01-13 30
10 1980-01-14 30
11 1980-01-15 30
12 1980-01-16 30
13 1980-01-17 30
14 1980-01-18 30
15 1980-01-19 30
16 1980-01-20 30
17 1980-01-21 30
18 1980-01-22 30
19 1980-01-23 30
20 1980-01-24 30
21 1980-01-25 30
The data:
df <- read.table(text = "Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50", header = T)
Assuming that the result should start at the first of the month of the first date and end at the last date and that the input data frame is DF shown reproducibly in the Note at the end, convert DF to a zoo object z, create a grid of dates g merge them to give zoo objects z0 (with zero filling) and zz (with na.locf filling) and optionally convert back to data frames or else just leave it as is so you can use zoo for further processing.
library(zoo)
z <- read.zoo(DF, header = TRUE, format = "%m/%d/%Y")
g <- seq(as.Date(as.yearmon(start(z))), end(z), "day")
z0 <- merge(z, zoo(, g), fill = 0) # zero filled
zz <- na.locf0(merge(z, zoo(, g))) # na.locf filled
# optional
DF0 <- fortify.zoo(z0) # zero filled
DF2 <- fortify.zoo(zz) # na.locf filled
data.table
The question mentions data tables and if that refers to the data.table package then add:
library(data.table)
DT0 <- data.table(DF0) # zero filled
DT2 <- data.table(DF2) # na.locf filled
Variations
I wasn't clear on whether the question was asking for a zero filled answer and an na.locf filled answer or just an na.locf filled answer whose remaining NA values are 0 filled but assumed the former case. If you want to fill the NAs that are left in the na.locf filled answer then add:
zz[is.na(zz)] <- 0
If you want to end at the end of the last month rather than at the last date replace end(z) with as.Date(as.yearmon(end(z)), frac = 1) .
If you want to start at the first date rather than the first of the month of the first date replace as.Date(as.yearmon(start(z))) with start(z)
.
As an alternative to (3), to start at the first date and end at the last date we could simply convert to ts and back. Note that we need to restore Date class on the second line below since ts class cannot handle Date class directly.
z2.na <- as.zoo(as.ts(z))
time(z2.na) <- as.Date(time(z2.na))
zz20 <- replace(z2.na, is.na(z2.na), 0) # zero filled
zz2 <- na.locf0(z2.na) # na.locf filled
Note
Lines <- "
Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50"
DF <- read.table(text = Lines, header = TRUE)

Counting and grouping with dplyr

My goal is simply to count the number of records in each hour of each day. I thought a simple solution could be found with the dplyr or data.table packages:
My data set is extremely simple:
> head(test)
id date hour
1 14869663 2018-01-24 17
2 14869664 2018-01-24 17
3 14869665 2018-01-24 17
4 14869666 2018-01-24 17
5 14869667 2018-01-24 17
6 14869668 2018-01-24 17
I only need to group by two variables (date and hour) and count. The id doesn't matter. However, these two methods in dplyr do not seem to produce the desired result (a data frame of the same length of the input data, which includes millions of records, is the output). What am I doing wrong here?
test %>% group_by(date, hour) %>% mutate(count = n())
test %>% add_count(date, hour)
The output would look something like this
> head(output)
n_records date hour
1 700 2018-01-24 0
2 750 2018-01-24 1
3 730 2018-01-24 2
4 700 2018-01-24 3
5 721 2018-01-24 4
6 753 2018-01-24 5
and so on
any suggestions?
This seems to do the trick:
library(dplyr)
starwars %>%
group_by(gender, species) %>%
count
It appears (h/t to Frank) that the count function can take the grouping fields directly:
starwars %>% count(gender, species)
using data.table,
test[, .N, by=.(date, hour)]
Base
aggregate(name ~ gender + species, data = starwars, length)
If we want to treat NAs as a group:
species1 <- factor(starwars$species, exclude = "")
gender1 <- factor(starwars$gender, exclude = "")
aggregate(name ~ gender1 + species1, data = starwars, length)

Convert dplyr chain into a function

Given a column of dates, this will count the number of records in each month
library(dplyr)
library(lubridate)
samp <- tbl_df(seq.Date(as.Date("2017-01-01"), as.Date("2017-12-01"), by="day"))
freq <- samp %>%
filter(!is.na(value)) %>%
transmute(month = floor_date(value, "month")) %>%
group_by(month) %>% summarise(adds = n())
freq
# A tibble: 12 x 2
month adds
<date> <int>
1 2017-01-01 31
2 2017-02-01 28
3 2017-03-01 31
4 2017-04-01 30
5 2017-05-01 31
6 2017-06-01 30
7 2017-07-01 31
8 2017-08-01 31
9 2017-09-01 30
10 2017-10-01 31
11 2017-11-01 30
12 2017-12-01 1
>
I would like to convert this to a function, so that I can perform the operation on a number of variables. Have read the vignette on dplyr programming, but continue to have issues.
My attempt;
library(rlang)
count_x_month <- function(df, var, name){
var <- enquo(var)
name <- enquo(name)
df %>%
filter(!is.na(!!var)) %>%
transmute(month := floor_date(!!var, "month")) %>%
group_by(month) %>% summarise(!!name := n())
}
freq2 <- samp %>% count_x_month(value, out)
Error message;
Error: invalid argument type
Making this version of the function work will be a big help. More broadly, other ways to achieve the objective would be welcome.
One way to state the problem; given a dataframe of customers and first purchase dates, count the number of customers purchasing for the first time in each month.
update: The selected answer works in dplyr 0.7.4, but the rstudio environment I have access to has dplyr 0.5.0. What modifications are required to 'backport' this function?
You forgot to quo_name it
library(rlang)
count_x_month <- function(df, var, name){
var <- enquo(var)
name <- enquo(name)
name <- quo_name(name)
df %>%
filter(!is.na(!!var)) %>%
transmute(month := floor_date(!!var, "month")) %>%
group_by(month) %>%
summarise(!!name := n())
}
freq2 <- samp %>% count_x_month(value, out)
# A tibble: 12 x 2
month out
<date> <int>
1 2017-01-01 31
2 2017-02-01 28
3 2017-03-01 31
4 2017-04-01 30
5 2017-05-01 31
6 2017-06-01 30
7 2017-07-01 31
8 2017-08-01 31
9 2017-09-01 30
10 2017-10-01 31
11 2017-11-01 30
12 2017-12-01 1
See "Different input and output variable" section of "Programming with dplyr":
We create the new names by pasting together strings, so we need
quo_name() to convert the input expression to a string.
The error is caused by summarise(df, !!name := n()) and is solved by replacing the second line of the function with
name <- substitute(name)
The reason, as far as I understand it is, that a quosure is not only its name, but it carries with it the environment from where it came. This makes sense when specifying column names in functions. The function must know from which data frame (=environment in this case) the column comes to replace the name with the values.
However, name shall take a new name, specified by the user. There is nothing to replace it with. I suspect if using name <- enquo(name), R wants to replace !!name by values instead of just putting in the new name. Therefore it complains that on the LHS there is no name (because R replaced it by values(?))
Not sure though if substitute is the ideomatic "programming with dplyr" way though. Comments are welcome.
Create a dataframe showing customer IDs and first purchase dates:
dates <- seq.Date(as.Date("2017-01-01"), as.Date("2017-12-01"), by="day")
dates_rep <- c(dates,dates,dates)
cust_ids <- paste('id_', floor(runif(length(dates_rep), min=0, max=100000)))
cust_frame <- data.frame(ID=cust_ids, FP_DATE=dates_rep)
head(cust_frame)
Use the plyr package to aggregate by FP_DATE:
library(plyr)
count(cust_frame, c('FP_DATE'))
Therefore, given a dataframe of customers and first purchase dates, we get a count of the number of customers purchasing for the first time in each month.
You can extend this to aggregate across any number of features in your dataset:
count(cust_frame, c('FP_DATE', 'feature_b', 'feature_c', 'feature_d', 'feature_e'))

sub-setting rows based on dates in R

I am trying to solve a problem in R. I have 2 data frames which look like this:
df1 <-
Date Rainfall_Duration
6/14/2016 10
6/15/2016 20
6/17/2016 10
8/16/2016 30
8/19/2016 40
df2 <-
Date Removal.Rate
6/17/2016 64.7
6/30/2016 22.63
7/14/2016 18.18
8/19/2016 27.87
I want to look up the dates from df2 in df1 and their corresponding Rainfall_Duration data. For example, I want to look for the 1st date of df2 in df1 and subset rows in df1 for that specific date and 7 days prior to that. additionally, for example: for 6/30/2016 (in df2) there is no dates available in df1 within it's 7 days range. So, in this case I just want to extract the results same as it's previous date (6/17/2016) in df2. Same logic goes for 7/14/2016(df2).
The output should look like this:
df3<-
Rate.Removal.Date Date Rainfall_Duration
6/17/2016 6/14/2016 10
6/17/2016 6/15/2016 20
6/17/2016 6/17/2016 10
6/30/2016 6/14/2016 10
6/30/2016 6/15/2016 20
6/30/2016 6/17/2016 10
7/14/2016 6/14/2016 10
7/14/2016 6/15/2016 20
7/14/2016 6/17/2016 10
8/19/2016 8/16/2016 30
8/19/2016 8/19/2016 40
I could subset data for the 7 days range. But could not do it when no dates are available in that range. I have the following code:
library(plyr)
library (dplyr)
df1$Date <- as.Date(df1$Date,format = "%m/%d/%Y")
df2$Date <- as.Date(df2$Date,format = "%m/%d/%Y")
df3 <- lapply(df2$Date, function(x){
filter(df1, between(Date, x-7, x))
})
names(df3) <- as.character(df2$Date)
bind_rows(df3, .id = "Rate.Removal.Date")
df3 <- ldply (df3, data.frame, .id = "Rate.Removal.Date")
I hope I could explain my problem properly. I would highly appreciate if someone can help me out with this code or a new one. Thanks in advance.
I would approach this by explicitly generating all of the dates on which you want to collect the rainfall duration then binding.
To that end, I split each row and generated a rainfallDate column for each that included the seven previous days. Then, I bound the rows back together and used left_join to get the rainfall data. Finally, I filter out the rows with missing rainfall information
df2 %>%
split(1:nrow(.)) %>%
lapply(function(x){
data.frame(
x
, rainfallDate = seq(x$Date - 7, x$Date, 1)
)
}) %>%
bind_rows() %>%
left_join(df1, by = c(rainfallDate = "Date")) %>%
filter(!is.na(Rainfall_Duration))
gives
Date Removal.Rate rainfallDate Rainfall_Duration
1 2016-06-17 64.70 2016-06-14 10
2 2016-06-17 64.70 2016-06-15 20
3 2016-06-17 64.70 2016-06-17 10
4 2016-08-19 27.87 2016-08-16 30
5 2016-08-19 27.87 2016-08-19 40
Note that if the dates without rainfall information are actually 0's, you could skip the filter line and use replace_na from tidyr to set them to explicit zeros (e.g., if you want average rainfall or to include the removal dates without any days of rainfall ahead of them).
An alternative is possible if you are actually interested in a summary value for each date (e.g., total rain in the past seven days). First, I would generate a rainfall dataset that actually had entries for every day of observation. For that, I used complete from tidyr to add an explicit zero entry for each day then used rollapply from zoo to calculate the sum in the rolling previous seven days.
completeRainfall <-
df1 %>%
complete(Date = full_seq(c(as.Date("2016-06-10"), max(Date)), 1)
, fill = list(Rainfall_Duration = 0)) %>%
mutate(prevSeven = rollapply(Rainfall_Duration, 7, sum
, fill = NA, align = "right"))
Then, a simple join will give you both that day's rainfall and the summarized total from the past seven days. This is particularly useful if your Dates for df1 are ever within seven days of each other to avoid copying the same rows multiple times.
left_join(
df2
, completeRainfall
)
Gives
Date Removal.Rate Rainfall_Duration prevSeven
1 2016-06-17 64.70 10 40
2 2016-06-30 22.63 0 0
3 2016-07-14 18.18 0 0
4 2016-08-19 27.87 40 70

Resources