Merge data frames whilst summing common columns in R - r

My problem is very similar to the one posted here.
The difference is that they knew the columns that would be conflicting whereas I need a generic method that wont know in advance which columns conflict.
example:
TABLE1
Date Time ColumnA ColumnB
01/01/2013 08:00 10 30
01/01/2013 08:30 15 25
01/01/2013 09:00 20 20
02/01/2013 08:00 25 15
02/01/2013 08:30 30 10
02/01/2013 09:00 35 5
TABLE2
Date ColumnA ColumnB ColumnC
01/01/2013 100 300 1
02/01/2013 200 400 2
Table 2 only has dates and so is applied to all fields in table A that match the date regardless on time.
I would like the merge to sum the conflicting columns into 1. The result should look like this:
TABLE3
Date Time ColumnA ColumnB ColumnC
01/01/2013 08:00 110 330 1
01/01/2013 08:30 115 325 1
01/01/2013 09:00 120 320 1
02/01/2013 08:00 225 415 2
02/01/2013 08:30 230 410 2
02/01/2013 09:00 235 405 2
At the moment my standard merge just creates duplicate columns of "ColumnA.x", "ColumnA.y", "ColumnB.x", "ColumnB.y".
Any help is much appreciated

If I understand correctly, you want a flexible method that does not require knowing which columns exist in each table aside from the columns you want to merge by and the columns you want to preserve. This may not be the most elegant solution, but here is an example function to suit your exact needs:
merge_Sum <- function(.df1, .df2, .id_Columns, .match_Columns){
merged_Columns <- unique(c(names(.df1),names(.df2)))
merged_df1 <- data.frame(matrix(nrow=nrow(.df1), ncol=length(merged_Columns)))
names(merged_df1) <- merged_Columns
for (column in merged_Columns){
if(column %in% .id_Columns | !column %in% names(.df2)){
merged_df1[, column] <- .df1[, column]
} else if (!column %in% names(.df1)){
merged_df1[, column] <- .df2[match(.df1[, .match_Columns],.df2[, .match_Columns]), column]
} else {
df1_Values=.df1[, column]
df2_Values=.df2[match(.df1[, .match_Columns],.df2[, .match_Columns]), column]
df2_Values[is.na(df2_Values)] <- 0
merged_df1[, column] <- df1_Values + df2_Values
}
}
return(merged_df1)
}
This function assumes you have a table '.df1' that is a master of sorts, and you want to merge data from a second table '.df2' that has rows that match one or more of the rows in '.df1'. The columns to preserve from the master table '.df1' are accepted as an array '.id_Columns', and the columns that provide the match for merging the two tables are accepted as an array '.match_Columns'
For your example, it would work like this:
merge_Sum(table1, table2, c("Date","Time"), "Date")
# Date Time ColumnA ColumnB ColumnC
# 1 01/01/2013 08:00 110 330 1
# 2 01/01/2013 08:30 115 325 1
# 3 01/01/2013 09:00 120 320 1
# 4 02/01/2013 08:00 225 415 2
# 5 02/01/2013 08:30 230 410 2
# 6 02/01/2013 09:00 235 405 2
In plain language, this function first finds the total number of unique columns and makes an empty data frame in the shape of the master table '.df1' to later hold the merged data. Then, for the '.id_Columns', the data is copied from '.df1' into the new merged data frame. For the other columns, any data that exists in '.df1' is added to any existing data in '.df2', where the rows in '.df2' are matched based on the '.match_Columns'
There is probably some package out there that does something similar, but most of them require knowledge of all the existing columns and how to treat them. As I said before, this is not the most elegant solution, but it is flexible and accurate.
Update: The original function assumed a many-to-one relationship between table1 and table2, and the OP requested the allowance of a many-to-none relationship, also. The code has been updated with a slightly less efficient but 100% more flexible logic.

A data.table solution:
dt1 <- data.table(read.table(header=T, text="Date Time ColumnA ColumnB
01/01/2013 08:00 10 30
01/01/2013 08:30 15 25
01/01/2013 09:00 20 20
02/01/2013 08:00 25 15
02/01/2013 08:30 30 10
02/01/2013 09:00 35 5"))
dt2 <- data.table(read.table(header=T, text="Date ColumnA ColumnB ColumnC
01/01/2013 100 300 1
02/01/2013 200 400 2"))
setkey(dt1, "Date")
setkey(dt2, "Date")
# Note: The ColumnC assignment has to be come before the summing operations
# Else it gives out error (see below)
dt1[dt2, `:=`(ColumnC = i.ColumnC, ColumnA = ColumnA + i.ColumnA,
ColumnB = ColumnB + i.ColumnB)]
# Date Time ColumnA ColumnB ColumnC
# 1: 01/01/2013 08:00 110 330 1
# 2: 01/01/2013 08:30 115 325 1
# 3: 01/01/2013 09:00 120 320 1
# 4: 02/01/2013 08:00 225 415 2
# 5: 02/01/2013 08:30 230 410 2
# 6: 02/01/2013 09:00 235 405 2
I'm not sure why placing ColumnC assignment on the right end throws this error. Perhaps MatthewDowle could explain the cause for this error.
dt1[dt2, `:=`(ColumnA = ColumnA + i.ColumnA, ColumnB = ColumnB + i.ColumnB,
ColumnC = i.ColumnC)]
Error in `[.data.table`(dt1, dt2, `:=`(ColumnA = ColumnA + i.ColumnA, :
Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'NULL'
Update from v1.8.9 :
o Mixing adding new with updating existing columns into one :=() by group; i.e.,
DT[,:=(existingCol=...,newCol=...), by=...] now works without error or
segfault, #2778 and #2528. Many thanks to Arun for reporting both with reproducible examples. Tests added.

I wrote the package safejoin which solves this very succintly
#devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
safe_full_join(df1,df2, by = "Date", conflict = `+`)
# Date Time ColumnA ColumnB ColumnC
# 1 01/01/2013 08:00 110 330 1
# 2 01/01/2013 08:30 115 325 1
# 3 01/01/2013 09:00 120 320 1
# 4 02/01/2013 08:00 225 415 2
# 5 02/01/2013 08:30 230 410 2
# 6 02/01/2013 09:00 235 405 2
In case of conflict, the function + is used on pairs of conflicting columns
data
df1 <- read.table(header=T, text="Date Time ColumnA ColumnB
01/01/2013 08:00 10 30
01/01/2013 08:30 15 25
01/01/2013 09:00 20 20
02/01/2013 08:00 25 15
02/01/2013 08:30 30 10
02/01/2013 09:00 35 5")
df2 <- read.table(header=T, text="Date ColumnA ColumnB ColumnC
01/01/2013 100 300 1
02/01/2013 200 400 2")

Related

R: Velocity/Aggregation - excess unique counts of column B per column A within certain time periods?

I'm exploring ways to identify when a count exceeds a certain threshold within a certain time period.
For example, let's say we have 4 columns - Transaction, Time, Email and CC. Throughout the data set, we want to identify WHICH user emails (Email) are involved with more than 2 credit cards (CC) within ANY 60 minute period. Ideally, we would also like to know at WHAT (Transaction) this threshold is broken.
The end goal is to know something like this -
'CBC' used its 3rd (CC) in <= 60 minutes at 'Transaction' 50.
Simulated data:
library(stringi)
set.seed(123)
CC <- sample(1000:1199, 100, replace = TRUE)
Email <- stri_rand_strings(100, 3, pattern = "[A-D]")
Time <- as.POSIXct("2020-01-01 00:00") + sort(sample(1:10000, 100))
DF <- data.frame(Time, Email, CC)
DF <- tibble::rowid_to_column(DF, "Transaction")
> head(DF)
Transaction Time Email CC
1 1 2020-01-01 00:00:05 CBB 1057
2 2 2020-01-01 00:04:40 DBD 1157
3 3 2020-01-01 00:08:11 DCB 1081
4 4 2020-01-01 00:09:39 ADB 1176
5 5 2020-01-01 00:11:39 ADC 1188
6 6 2020-01-01 00:13:45 ACD 1009
This seems to be a pretty unique question, as I'm essentially checking for excess/risky aggregation/counts throughout a data set.
An early dplyr attempt to set this up is as follows -
Counts_DF <- DF %>%
group_by(Email) %>%
mutate(HourInter = cut(Time, breaks = "60 min")) %>%
group_by(Email, HourInter) %>%
summarize(Diff_Cards = n_distinct(CC)) %>%
arrange(desc(Diff_Cards)) %>%
filter(Diff_Cards > 2)
> head(Counts_DF)
# A tibble: 5 x 3
# Groups: Email [5]
Email HourInter Diff_Cards
<fct> <chr> <int>
1 ABB 2020-01-01 01:22:00 3
2 BAC 2020-01-01 00:54:00 3
3 CAB 2020-01-01 00:35:00 3
4 CBC 2020-01-01 00:14:00 3
5 DAB 2020-01-01 01:41:00 3
However, I'm unsure what the 'HourInter' column is really doing and there is clearly no (Transaction) info available.
I've seen other questions for aggregations under static time intervals for just one column, but this is clearly a bit different. Any help with this would be greatly appreciated.
here is a data.table-approach
library( data.table )
#make DF a data.table, set keys for optmised joining
setDT( DF, key = c("Email", "Time" ) )
#get CC used in hour window, and number of unique CC used last hour, by Email by row
DF[ DF,
#get desired values, suppress immediate output using {}
c( "cc_last_hour", "unique_cc_last_hour" ) := {
#temporary subset, with all DF values with the same Email, from the last hour
val = DF[ Email == i.Email &
Time %between% c( i.Time - lubridate::hours(1), i.Time) ]$CC
#get values
list( paste0( val, collapse = "-" ),
uniqueN( val ) )
},
#do the above for each row
by = .EACHI ]
#now subset rows where `unique_cc_used_last_hour` exceeds 2
DF[ unique_cc_last_hour > 2, ]
# Transaction Time Email CC cc_last_hour unique_cc_last_hour
# 1: 66 2020-01-01 01:35:32 AAD 1199 1152-1020-1199 3
# 2: 78 2020-01-01 02:00:16 AAD 1152 1152-1020-1199-1152 3
# 3: 53 2020-01-01 01:24:46 BAA 1096 1080-1140-1096 3
# 4: 87 2020-01-01 02:15:24 BAA 1029 1140-1096-1029 3
# 5: 90 2020-01-01 02:19:30 BAA 1120 1096-1029-1120 3
# 6: 33 2020-01-01 00:55:52 BBC 1031 1196-1169-1031 3
# 7: 64 2020-01-01 01:34:58 BDD 1093 1154-1052-1093 3
# 8: 68 2020-01-01 01:40:07 CBC 1085 1022-1052-1085 3
# 9: 38 2020-01-01 01:03:34 CCA 1073 1090-1142-1073 3
#10: 21 2020-01-01 00:35:54 DBB 1025 1194-1042-1025 3
#11: 91 2020-01-01 02:20:33 DDA 1109 1115-1024-1109 3
update based on OP's comment below
first, create some sample data with a transaction amount
#sample data with an added Amount
library(stringi)
set.seed(123)
CC <- sample(1000:1199, 100, replace = TRUE)
Email <- stri_rand_strings(100, 3, pattern = "[A-D]")
Time <- as.POSIXct("2020-01-01 00:00") + sort(sample(1:10000, 100))
Amount <- sample( 50:100, 100, replace = TRUE )
DF <- data.frame(Time, Email, CC, Amount)
DF <- tibble::rowid_to_column(DF, "Transaction")
here is the code to also calculate the sum of Amount, for the past hour.
A bit more explanation of the functionality of the code
make DF a data.table
'loop' over each row of DF
for each row, take the Email and Time of that row and...
... create a temporary subset of DF, where the Email is the same, and the Time is bewteen Time - 1 hour and Time
join on this this subset, creating new columns "cc_hr", "un_cc_hr" and "am_hr", which get their values from a list. So paste0( val$CC, collapse = "-" ) fills the first column (i.e. "cc_hr"), uniqueN( val$CC ) filles the second col (i.e. "un_cc_hr") and the sum of the amount ("am_hr") gets calculated by sum( val$Amount ).
As you can see, it does not calculate the score for every 60 minute interval, but in stead is defines the end of an interval based on the Time of a Transaction, and then looks for Transactions with the same Email within the hour before Time.
I assumed this is the behaviour you are looking for, and you're not interested in periods where nothing happens.
library( data.table )
#make DF a data.table, set keys for optmised joining
setDT( DF, key = c("Email", "Time" ) )
#self join
DF[ DF,
#get desired values, suppress immediate output using {}
c( "cc_hr", "un_cc_hr", "am_hr" ) := {
#create a temporary subset of DF, named val,
# with all DF's rows with the same Email, from the last hour
val = DF[ Email == i.Email &
Time %between% c( i.Time - lubridate::hours(1), i.Time) ]
#get values
list( paste0( val$CC, collapse = "-" ),
uniqueN( val$CC ),
sum( val$Amount ) ) # <-- calculate the amount of all transactions
},
#do the above for each row of DF
by = .EACHI ]
sample output
#find all Transactions where, in the past hour,
# 1. the number of unique CC used > 2, OR
# 2. the total amount paid > 180
DF[ un_cc_hr > 2 | am_hr > 180, ]
# Transaction Time Email CC Amount cc_hr un_cc_hr am_hr
# 1: 80 2020-01-01 02:03:05 AAB 1021 94 1089-1021 2 194
# 2: 66 2020-01-01 01:35:32 AAD 1199 60 1152-1020-1199 3 209
# 3: 78 2020-01-01 02:00:16 AAD 1152 63 1152-1020-1199-1152 3 272
# 4: 27 2020-01-01 00:40:50 BAA 1080 100 1169-1080 2 186
# 5: 53 2020-01-01 01:24:46 BAA 1096 100 1080-1140-1096 3 259
# 6: 87 2020-01-01 02:15:24 BAA 1029 71 1140-1096-1029 3 230
# 7: 90 2020-01-01 02:19:30 BAA 1120 93 1096-1029-1120 3 264
# 8: 33 2020-01-01 00:55:52 BBC 1031 55 1196-1169-1031 3 171
# 9: 64 2020-01-01 01:34:58 BDD 1093 78 1154-1052-1093 3 212
# 10: 42 2020-01-01 01:08:04 CBC 1052 96 1022-1052 2 194
# 11: 68 2020-01-01 01:40:07 CBC 1085 100 1022-1052-1085 3 294
# 12: 38 2020-01-01 01:03:34 CCA 1073 81 1090-1142-1073 3 226
# 13: 98 2020-01-01 02:40:40 CCC 1121 86 1158-1121 2 183
# 14: 21 2020-01-01 00:35:54 DBB 1025 67 1194-1042-1025 3 212
# 15: 91 2020-01-01 02:20:33 DDA 1109 99 1115-1024-1109 3 236
You could always make the problem a bit easier by extracting the date and hour feature:
library(stringi)
library(tidyverse)
library(lubridate)
set.seed(123)
CC <- sample(1000:1199, 100, replace = TRUE)
Email <- stri_rand_strings(100, 3, pattern = "[A-D]")
Time <- as.POSIXct("2020-01-01 00:00") + sort(sample(1:10000, 100))
DF <- data.frame(Time, Email, CC)
DF <- tibble::rowid_to_column(DF, "Transaction")
DF %>%
mutate(Date = as.Date(Time),
Hour = hour(Time)) %>%
group_by(Date, Hour, Email) %>%
summarise(Diff_Cards = n_distinct(CC)) %>%
filter(Diff_Cards > 2) %>%
arrange(desc(Diff_Cards))

Is there a function to join 2 datasets on multiple column where one of the columns is an inexact numerical match?

I am trying to join 2 datasets together via multiple columns and one of the columns is an inexact numerical match. The other columns match perfectly because the data are essentially the same but from differences sources. The numerical mismatch in the final column is time based and the mismatch appears to be that the first dataset has recorded timestamps approximately 1 to 4 seconds before the second dataset.
The 1st dataset has approximately 100 more observations than the second and I am trying to extract information from the first to add to the second. The reason for this is the first dataset with the extra information that I need only has millesecond granularity, whereas the second dataset without the required information has nanosecond granularity.
sample1:
1 date time seconds 2 3 ID1 ID2
CBA 2015-01-02 10:30 24 50 200 2 3
CBA 2015-01-02 10:30 24 51 40 2 4
CBA 2015-01-02 10:30 25 50 30 5 2
CBA 2015-01-02 10:30 25 50 100 6 4
CBA 2015-01-02 10:30 25 51 60 1 6
sample2:
1 date time seconds 2 3
CBA 2015-01-02 10:30 24 50 200
CBA 2015-01-02 10:30 24 51 40
CBA 2015-01-02 10:30 25 50 30
CBA 2015-01-02 10:30 26 50 100
CBA 2015-01-02 10:30 26 51 60
The first thing I did was make sure that each dataset had the same column headers and the information contained in the columns that I am trying to match are of the same class and are in the same format.
For the timestamp, I separated this column in both datasets so there is now 1 column each for "date", "time" (hours:minutes:seconds), with fourth column showing the milleseconds / nanoseconds respectively. Note: I am trying to match the "date" column along with other columns that have information that mostly matches and isn't because I can see that most of matches. The time column is the one that is causing problems. I then tried using dplyr to mutate the seconds to a separate column. When I try to do an inner_join with date, time, seconds, and the other 3 columns I end up with around 1000 less observations than I should. When I do an inner_join with the aforementioned columns except the seconds column, I end up with around 2000 more observations than I should. I suspect this is because inner_join is matching multiple time(?).
I have also tried right_join and left_join to which I have similar issues with the seconds column. A full_join literally just adds the datasets together which I expected.
Another thing I tried was the merge function. This gave the same results as the inner_join both when I included and excluded the seconds columns in the join.
Lastly, I tried fuzzyjoin but for some reason this killed the memory in my laptop and crashed it.
A couple of things I haven't tried are: using a for loop to try to get the info from the first dataset to the second, because the final datasets that I need to join like have hundreds of millions of observations. match also doesn't appear to work because it only seems to work on 1 column, not the 6 columns that I need matched.
This gives 1000 less observations than it should because it's not picking up the inexact mismatch on the seconds:
merged_data <- inner_join(sample1, sample2,
by = c('1' = '1','Date'='Date', 'Time'='Time',
'seconds'='seconds', '2' ='2', '3'= '3')
) %>% as_tibble()`
This gives me approx 2000 observations more than I should:
merged_data <- inner_join(sample1, sample2,
by = c('1' = '1','Date'='Date',
'Time'='Time', '2' ='2', '3'= '3')
) %>% as_tibble()
And doing this shows me the observations that didn't match by putting them below the rows that did match:
merged_data <- inner_join(sample1, sample2,
by = c('1' = '1','Date'='Date',
'Time'='Time', 'seconds'='seconds',
'2' ='2', '3'= '3'), all.x = TRUE
) %>% as_tibble()
This fuzzyjoin code kills the memory on my laptop and crashes it:
merged_data = fuzzyjoin(sample1, sample2,
by = c('1' = '1','Date'='Date',
'Time'='Time', 'seconds'='seconds',
'2' ='2', '3'= '3'),
match_fun = list(`==`, `==`, `==`, `<=`, `==`, `==`)
) %>% as_tibble()
Expected results:
merged_data
1 date time seconds 2 3 ID1 ID2
CBA 2015-01-02 10:30 24 50 200 2 3
CBA 2015-01-02 10:30 24 51 40 2 4
CBA 2015-01-02 10:30 25 50 30 5 2
CBA 2015-01-02 10:30 26 50 100 6 4
CBA 2015-01-02 10:30 26 51 60 1 6
Actual results when seconds are included on the inner_join and merge:
merged_data
1 date time seconds 2 3 ID1 ID2
CBA 2015-01-02 10:30 24 50 200 2 3
CBA 2015-01-02 10:30 24 51 40 2 4
CBA 2015-01-02 10:30 25 50 30 5 2
Actual results from the merge with all.x = TRUE:
merged_data
1 date time seconds 2 3 ID1 ID2
CBA 2015-01-02 10:30 24 50 200 2 3
CBA 2015-01-02 10:30 24 51 40 2 4
CBA 2015-01-02 10:30 25 50 30 5 2
CBA 2015-01-02 10:30 26 50 100 6 4
CBA 2015-01-02 10:30 26 51 60 1 6
CBA 2015-01-02 10:30 25 50 100 6 4
CBA 2015-01-02 10:30 25 51 60 1 6

How do I count across a date range in an R datatable

ID FROM TO
1881 11/02/2013 11/02/2013
3090 09/09/2013 09/09/2013
1113 24/11/2014 06/12/2014
1110 24/07/2013 25/07/2013
111 25/06/2015 05/09/2015
If I have data.table of vacation dates, FROM and TO, I want to know how many people were on vacation for any given month.
I tried:
dt[, .N, by=.(year(FROM), month(FROM))]
but obviously it would exclude people who were on vacation across two months. ie. someone on vacation FROM JAN TO FEB would only show up in the JAN count and not the FEB count even though they are still on vacation in FEB
The output of the above code showing year, month and number is exactly what I'm looking for otherwise.
year month N
1: 2013 2 17570
2: 2013 9 16924
3: 2014 11 18809
4: 2013 7 16984
5: 2015 6 14401
6: 2015 12 10239
7: 2014 3 19346
8: 2013 5 14864
EDIT: I want every month someone is away on vacation counted. So ID 111 would be counted in June, July, August and Sept.
EDIT 2:
Running uwe's code on the full dataset produces the Total Count column below.
Subsetting the full data set for people on vacation for <= 30 days and > 30 days produces the counts in the respective columns below. These columns added to each other should equal the Total Count and therefore the DIFFERENCE should be 0 but this isn't the case.
month Total count <=30 >30 (<=30) + (>30) DIFFERENCE
01/02/2012 899 4 895 899 0
01/03/2012 3966 2320 1646 3966 0
01/04/2012 8684 6637 2086 8723 39
01/05/2012 10287 7586 2750 10336 49
01/06/2012 12018 9080 3000 12080 62
The OP has not specified what the exact rules are for counting, for instance, how to count if the same ID has multiple non-overlapping periods of vacation in the same month.
The solution below is based on the following rules:
Each ID may appear in more than one row.
For each row, the total number of month between FROM and TO are counted (including the FROM and TO months). E.g., ID 111 is counted in the months of June, July, August, and September 2015.
Vacation on the last and first day of a month are accounted in full, e.g., vacations starting on May 31 and ending on June 1, are counted in both months.
If an ID has multiple periods of vacation in one month it is only counted once.
To verify that the code implements these rule, I had to enhance the sample dataset provided by the OP with additional use cases (see Data section below)
library(data.table)
library(lubridate)
# coerce dt to data.table object and character dates to class Date
setDT(dt)[, (2:3) := lapply(.SD, dmy), .SDcols = 2:3]
# for each row, create sequence of first days of months
dt[, .(month = seq(floor_date(FROM, "months"), TO, by = "months")), by = .(ID, rowid(ID))][
# count the number of unique IDs per month, order result by month
, uniqueN(ID), keyby = month]
month V1
1: 2013-02-01 1
2: 2013-07-01 1
3: 2013-09-01 2
4: 2014-11-01 1
5: 2014-12-01 1
6: 2015-06-01 1
7: 2015-07-01 1
8: 2015-08-01 1
9: 2015-09-01 1
10: 2015-11-01 1
11: 2015-12-01 1
12: 2016-06-01 1
13: 2016-07-01 1
14: 2016-08-01 1
15: 2016-09-01 1
Data
Based on OP's sample dataset but extended by additional use cases:
library(data.table)
dt <- fread(
"ID FROM TO
1881 11/02/2013 11/02/2013
1881 23/02/2013 24/02/2013
3090 09/09/2013 09/09/2013
3091 09/09/2013 09/09/2013
1113 24/11/2014 06/12/2014
1110 24/07/2013 25/07/2013
111 25/06/2015 05/09/2015
111 25/11/2015 05/12/2015
11 25/06/2016 01/09/2016"
)
for the data given above, you will do:
melt(dat,1)[,value:=as.Date(sub("\\d+","20",value),"%d/%m/%Y")][,
seq(value[1],value[2],by="1 month"),by=ID][,.N,by=.(year(V1),month(V1))]
year month N
1: 2013 2 1
2: 2013 9 1
3: 2014 11 1
4: 2014 12 1
5: 2013 7 1
6: 2015 6 1
7: 2015 7 1
8: 2015 8 1
9: 2015 9 1

Extracting rows based on ID and date. R-base

i have 2 data frames. one with a list of ID and dates of 700 persons, and another with 400.000 rows with date and several other variables for over 1000 persons.
example df1:
ID date
1010 2014-05-31
1011 2015-08-27
1015 2011-04-15
...
example df2:
ID Date Operationcode
1010 2008-01-03 456
1010 2016-06-09 1234
1010 1999-10-04 123186
1010 2017-02-30 71181
1010 2005-05-05 201
1011 2008-04-02 46
1011 2009-09-09 1231
1515 2017-xx-xx 156
1015 2013-xx-xx 123
1615 1998-xx-xx 123
1015 2005-xx-xx 4156
1015 2007-xx-xx 123
1015 2016-xx-xx 213
now i wanna create a df3 where i only keep rows from df2 where the date is before df1 (when matched by ID).
so i get:
ID Date Operationcode
1010 2008-01-03 456
1010 1999-10-04 123186
1010 2005-05-05 201
1015 2005-xx-xx 4156
1015 2007-xx-xx 123
ive tried
df3 <- subset(df1, ID %in% df2$ID & df2$date < df1$date)
but keep giving me an error where something with the length of the last part, df2$date < df1$date doesnt match, and when I take a sampletest (look for the operationcode for 1 ID) i can see that i miss alot of rows before the date from df1. Any idea or solutions?
AND i only got base-R as its the hospitals computer which doesnt allow any downloading -.-
In base R you could do something like this...
df3 <- merge(df2,df1,by="ID",all.x=TRUE) #merge in df1 date column
df3 <- df3[as.Date(df3$Date)<as.Date(df3$date),] #remove rows with invalid dates
#note that 'Date' is the df2 column, 'date' is the df1 version
df3 <- df3[!is.na(df3$ID),] #remove NA rows
df3$date <- NULL #remove df1 date column
df3
ID Date Operationcode
1 1010 2008-01-03 456
2 1010 1999-10-04 123186
3 1010 2005-05-05 201
6 1011 2009-09-09 1231
7 1011 2008-04-02 46
I'm not sure what is supposed to happen with the dates with xx in your data. Are they real? If they appear in the actual data, they will need special handling as otherwise they will not be converted to proper date format, so the calculation fails.

how to group sales data by 4 days from yesterday to start date in r?

Date Sales
3/11/2017 1
3/12/2017 0
3/13/2017 40
3/14/2017 47
3/15/2017 83
3/16/2017 62
3/17/2017 13
3/18/2017 58
3/19/2017 27
3/20/2017 17
3/21/2017 71
3/22/2017 76
3/23/2017 8
3/24/2017 13
3/25/2017 97
3/26/2017 58
3/27/2017 80
3/28/2017 77
3/29/2017 31
3/30/2017 78
3/31/2017 0
4/1/2017 40
4/2/2017 58
4/3/2017 32
4/4/2017 31
4/5/2017 90
4/6/2017 35
4/7/2017 88
4/8/2017 16
4/9/2017 72
4/10/2017 39
4/11/2017 8
4/12/2017 88
4/13/2017 93
4/14/2017 57
4/15/2017 23
4/16/2017 15
4/17/2017 6
4/18/2017 91
4/19/2017 87
4/20/2017 44
Here current date is 20/04/2017, My question is grouping data from 19/04/2017 to 11/03/2017 with 4 equal parts with summation sales in r programming?
Eg :
library("xts")
ep <- endpoints(data, on = 'days', k = 4)
period.apply(data,ep,sum)
it's not working. However, its taking start date to current date but I need to geatherd data from yestderday (19/4/2017) to start date and split into 4 equal parts.
kindly anyone guide me soon.
Thank you
Base R has a function cut.Date() which is built for the purpose.
However, the question is not fully clear on what the OP intends. My understanding of the requirements supplied in Q and additional comment is:
Take the sales data per day in Book1 but leave out the current day, i.e., use only completed days.
Group the data in four equal parts, i.e., four periods containing an equal number of days. (Note that the title of the Q and the attempt to use xts::endpoint() with k = 4 indicates that the OP might have a different intention to group the data in periods of four days length each.)
Summarize the sales figures by period
For the sake of brevity, data.table is used here for data manipulation and aggregation, lubridate for date manipulation
library(data.table)
library(lubridate)
# coerce to data.table, convert Date column from character to class Date,
# exclude the actual date
temp <- setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()]
# cut the date range in four parts
temp[, start_date_of_period := cut.Date(Date, 4)]
temp
# Date Sales start_date_of_period
# 1: 2017-03-11 1 2017-03-11
# 2: 2017-03-12 0 2017-03-11
# 3: 2017-03-13 40 2017-03-11
# ...
#38: 2017-04-17 6 2017-04-10
#39: 2017-04-18 91 2017-04-10
#40: 2017-04-19 87 2017-04-10
# Date Sales start_date_of_period
# aggregate sales by period
temp[, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
# start_date_of_period n_days total_sales
#1: 2017-03-11 10 348
#2: 2017-03-21 10 589
#3: 2017-03-31 10 462
#4: 2017-04-10 10 507
Thanks to chaining, this can be put together in one statement without using a temporary variable:
setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()][
, start_date_of_period := cut.Date(Date, 4)][
, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
Note If you want to reproduce the result in the future, you will have to replace the call to today() which excludes the current day by mdy("4/20/2017") which is the last day in the sample data set supplied by the OP.

Resources