I have two dataframes in R.
Release dataframe
Date Product
2011-01-13 A
2011-02-15 A
2011-01-14 B
2011-02-15 B
Casedata dataframe
Date Product Numberofcases
2011-01-13 A 50
2011-01-12 A 20
2011-01-11 A 100
2011-01-10 A 120
2011-01-09 A 150
2011-01-08 A 180
2011-01-07 A 200
2011-01-06 A 220
2011-01-23 A 500
2011-01-31 A 450
2011-02-08 A 50
2011-02-09 A 1000
2011-02-10 A 1200
2011-02-11 A 1500
2011-02-12 A 1800
2011-02-13 A 2000
2011-02-14 A 2200
2011-02-15 A 5000
2011-01-31 A 4500
:::
:::
2011-01-15 B 1000
My requirement is for every product release date(from release dataframe), I should obtain the corresponding sum(numberofcases) one week prior to the release date(in the casedata dataframe). ie., for product A and release date 2011-01-13, it should be sum of all cases in the previous week (from 2011-01-06 to 2011-01-13) ie., (50+20+100+120+150+180+200+220)
Releasedate Product Numberofcasesoneweekpriorrelease
2011-01-13 A 1040
2011-02-15 A 19250
2011-01-14 B ...
2011-02-15 B ...
What I have tried :
beforerelease <- sqldf("select product,release.date_release,sum(numberofcasescreated) as numberofcasesbeforerelease from release left join casedata using (product) where date_case>=weekbeforerelease and date_case<=date_release group by product,date_release")
finaldf <- merge(beforerelease,afterelease,by=c("monthyear","product"))
I am struck and it is not giving me the expected outcome. Can somebody help me ?
Using the recently implemented non-equi joins feature in the current development version of data.table, v1.9.7, this can be done simply as (assuming all Date columns are of class Date):
require(data.table)
setDT(release)[, Date2 := Date-7L]
setDT(casedata)[release, on = .(Product, Date >= Date2, Date <= Date),
.(count = sum(Numberofcases)), by = .EACHI]
# Product Date Date count
# 1: A 2011-01-06 2011-01-13 1040
# 2: A 2011-02-08 2011-02-15 14750
# 3: B 2011-01-07 2011-01-14 NA
# 4: B 2011-02-08 2011-02-15 NA
With the data.table package you could follow two approaches:
1) Using the foverlaps functionality:
library(data.table)
# convert to a 'data.table' with 'setDT()'
# and create a release window
setDT(release)[, `:=` (bdat = as.Date(Date)-7, edat = as.Date(Date))][, Date := NULL]
# convert to a 'data.table' and create a 2nd date column for use with 'foverlaps
setDT(casedata)[, `:=` (bdat = as.Date(Date), edat = as.Date(Date))][, Date := NULL]
# set the key for use in 'foverlaps'
setkey(release, Product, bdat, edat)
setkey(casedata, Product, bdat, edat)
# do an overlap join ('foverlaps') and summarise
foverlaps(casedata, release, type = 'within', nomatch = 0L)[, .(cases.prior.release = sum(Numberofcases)), by = .(Product, release.date = edat)]
which gives:
Product release.date cases.prior.release
1: A 2011-01-13 1040
2: A 2011-02-15 14750
2) Using the standard join functionality of data.table:
setDT(release)
setDT(casedata)
casedata[, Date := as.Date(Date)
][release[, `:=` (Date = as.Date(Date), idx = .I)
][, .(dates = seq(Date-7,Date,'day')), by = .(Product,idx)],
on = c('Product', Date = 'dates'), nomatch = 0L
][, .(releasedate = Date[.N], cases.prior.release = sum(Numberofcases)), by = .(Product,idx)
][, idx := NULL]
which will get you the same result.
Used data:
release <- structure(list(Date = c("2011-01-13", "2011-02-15", "2011-01-14", "2011-02-15"),
Product = c("A", "A", "B", "B")),
.Names = c("Date", "Product"), class = "data.frame", row.names = c(NA, -4L))
casedata <- structure(list(Date = c("2011-01-13", "2011-01-12", "2011-01-11", "2011-01-10", "2011-01-09", "2011-01-08", "2011-01-07", "2011-01-06", "2011-01-23", "2011-01-31", "2011-02-08", "2011-02-09", "2011-02-10", "2011-02-11", "2011-02-12", "2011-02-13", "2011-02-14", "2011-02-15", "2011-01-31"),
Product = c("A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"),
Numberofcases = c(50L, 20L, 100L, 120L, 150L, 180L, 200L, 220L, 500L, 450L, 50L, 1000L, 1200L, 1500L, 1800L, 2000L, 2200L, 5000L, 4500L)),
.Names = c("Date", "Product", "Numberofcases"), class = "data.frame", row.names = c(NA, -19L))
Related
I want to create a variable that adds the values from other columns based on the condition of variable year. That is if variable YEAR = 2013 then add columns YR_2006, YR_2007, YR_2008, YR_2009, YR_2010, YR_2011. So for group A the sum would be 12,793
GROUP YEAR YR_2006 YR_2007 YR_2008 YR_2009 YR_2010 YR_2011
A 2013 636 3653 4759 3745
B 2019 1417 2176 3005 2045 2088 1849
C 2007 4218 3622 4651 4574 4122 4711
E 2017 5956 6031 6032 4885 5400 5828
Here is an option with apply and MARGIN = 1 to loop over the rows, get the index where the 'YEAR' matches the names, do a sequence from the 2nd element to that index, subset the values and get the sum
df1$Sum <- apply(df1[-1], 1, function(x)
sum(x[2:c(grep(as.character(x[1]), names(x)[-1]) +1,
length(x))[1]], na.rm = TRUE))
df1$Sum
#[1] 12793 12580 7840 28886
Or we can use a vectorized option with rowSums after replaceing some of the elements in each row to NA based on matching the 'YEAR' column with the column names that startsWith 'YR_'
i1 <- startsWith(names(df1), "YR_")
i2 <- match(df1$YEAR, sub("YR_", "", names(df1)[i1]), nomatch = sum(i1))
rowSums(replace(df1[i1], col(df1[i1]) > i2[row(df1[i1])], NA), na.rm = TRUE)
#[1] 12793 12580 7840 28886
Or using tidyverse to reshape to 'long' format with pivot_longer and then do a group_by sum after sliceing the rows based on the match
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = starts_with("YR_"), values_drop_na = TRUE) %>%
group_by(GROUP) %>%
slice(seq(match(first(YEAR), readr::parse_number(name), nomatch = n()))) %>%
summarise(Sum = sum(value)) %>%
left_join(df1, .)
GROUP YEAR YR_2006 YR_2007 YR_2008 YR_2009 YR_2010 YR_2011 Sum
1 A 2013 NA 636 3653 4759 3745 NA 12793
2 B 2019 1417 2176 3005 2045 2088 1849 12580
3 C 2007 4218 3622 4651 4574 4122 4711 7840
4 E 2017 5956 6031 6032 4885 5400 582 28886
data
df1 <- structure(list(GROUP = c("A", "B", "C", "E"), YEAR = c(2013L,
2019L, 2007L, 2017L), YR_2006 = c(NA, 1417L, 4218L, 5956L), YR_2007 = c(636L,
2176L, 3622L, 6031L), YR_2008 = c(3653L, 3005L, 4651L, 6032L),
YR_2009 = c(4759L, 2045L, 4574L, 4885L), YR_2010 = c(3745L,
2088L, 4122L, 5400L), YR_2011 = c(NA, 1849L, 4711L, 582L)),
class = "data.frame", row.names = c(NA,
-4L))
I have multiple grouping variables (id) and I want to filter each one with its own specific date.
mydata <- structure(list(ID = structure(c("A", "A", "A", "B", "B", "B", "C", "C", "C")),
Start = structure(c(1357038060, 1357221074, 1357369644, 1357834170,
1357913412, 1358151763, 1358691675, 1358789411, 1359538400
), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1357110430,
1357365312, 1357564413, 1358230679, 1357978810, 1358674600,
1358853933, 1359531923, 1359568151), class = c("POSIXct",
"POSIXt"), tzone = "")), .Names = c("Line", "Start", "End"), row.names = c(NA, -9L), class = "data.frame")
I could do it individually with the following but how do I tie this together into one line?
mydata %>% filter(id == "A" & time >= as.Date("2013-01-01 00:00:00"))
mydata %>% filter(id == "B" & time >= as.Date("2013-01-13 00:00:00"))
mydata %>% filter(id == "C" & time >= as.Date("2013-01-23 00:00:00"))
If there are many dates, then can use a loop
library(dplyr)
library(purrr)
v1 <- unique(mydata$Line)
dates <- as.POSIXct(c("2013-01-01", "2013-01-13", "2013-01-23"))
mydata %>%
filter(map2(v1, dates, ~ Line== .x & Start >= .y) %>%
reduce(`|`))
If there are many dates, I suggest to use a non-equi join either using SQL (package sqldf) or data.table
For this, a table with filter conditions is created, e.g.,
fc <- data.frame(Line = LETTERS[1:3],
dates = as.POSIXct(c("2013-01-01", "2013-01-13", "2013-01-23")))
fc
Line dates
1 A 2013-01-01
2 B 2013-01-13
3 C 2013-01-23
(Note that dates is of type POSIXct to be in line with Start and End)
sqldf
library(sqldf)
sqldf("select mydata.* from mydata join fc on mydata.Line = fc.Line and mydata.Start >= fc.dates")
Line Start End
1 A 2013-01-01 12:01:00 2013-01-02 08:07:10
2 A 2013-01-03 14:51:14 2013-01-05 06:55:12
3 A 2013-01-05 08:07:24 2013-01-07 14:13:33
4 B 2013-01-14 09:22:43 2013-01-20 10:36:40
5 C 2013-01-30 10:33:20 2013-01-30 18:49:11
BTW,
sqldf("select mydata.* from mydata, fc where mydata.Line = fc.Line and mydata.Start >= fc.dates")
returns the same result.
data.table
library(data.table)
setDT(mydata)[mydata[fc, on = .(Line, Start >= dates ), which = TRUE]]
Line Start End
1: A 2013-01-01 12:01:00 2013-01-02 08:07:10
2: A 2013-01-03 14:51:14 2013-01-05 06:55:12
3: A 2013-01-05 08:07:24 2013-01-07 14:13:33
4: B 2013-01-14 09:22:43 2013-01-20 10:36:40
5: C 2013-01-30 10:33:20 2013-01-30 18:49:11
The expression
mydata[fc, on = .(Line, Start >= dates ), which = TRUE]
returns the indices of the rows of mydata which fulfill the conditions
[1] 1 2 3 6 9
I have 2 tables. Below are the sample tables and the desired output.
Table1:
Start Date End Date Country
2017-01-04 2017-01-06 id
2017-02-13 2017-02-15 ng
Table2:
Transaction Date Country Cost Product
2017-01-04 id 111 21
2017-01-05 id 200 34
2017-02-14 ng 213 45
2017-02-15 ng 314 32
2017-02-18 ng 515 26
Output:
Start Date End Date Country Cost Product
2017-01-04 2017-01-06 id 311 55
2017-02-13 2017-02-15 ng 527 77
The problem is to merge two tables when transaction date lies in between start date and end date & country matches. And add the values of cost and product.
This calls for fuzzyjoins. Below are 2 examples.
Using dplyr and fuzzyjoin packages:
fuzzy_left_join(df1, df2,
c("Country" = "Country",
"Start_Date" = "Transaction_Date",
"End_Date" = "Transaction_Date"),
list(`==`, `<=`,`>=`)) %>%
group_by(Country.x, Start_Date, End_Date) %>%
summarise(Cost = sum(Cost),
Product = sum(Product))
# A tibble: 2 x 5
# Groups: Country.x, Start_Date [?]
Country.x Start_Date End_Date Cost Product
<chr> <date> <date> <int> <int>
1 id 2017-01-04 2017-01-06 311 55
2 ng 2017-02-13 2017-02-15 527 77
Using data.table:
library(data.table)
dt1 <- data.table(df1)
dt2 <- data.table(df2)
dt2[dt1, on=.(Country = Country,
Transaction_Date >= Start_Date,
Transaction_Date <= End_Date),
.(Cost = sum(Cost), Product = sum(Product)),
by=.EACHI]
data:
df1 <- structure(list(Start_Date = structure(c(17170, 17210), class = "Date"),
End_Date = structure(c(17172, 17212), class = "Date"), Country = c("id",
"ng")), row.names = c(NA, -2L), class = "data.frame")
df2 <- structure(list(Transaction_Date = structure(c(17170, 17171, 17211,
17212, 17215), class = "Date"), Country = c("id", "id", "ng",
"ng", "ng"), Cost = c(111L, 200L, 213L, 314L, 515L), Product = c(21L,
34L, 45L, 32L, 26L)), row.names = c(NA, -5L), class = "data.frame")
Not sure if you can use any of the merge operation here but one way using mapply is to subset the rows based on the condition and take the sum of Product and Cost columns.
df1[c("Cost", "Product")] <- t(mapply(function(x, y, z) {
inds <- df2$Transaction_Date >= x & df2$Transaction_Date <= y & df2$Country == z
c(sum(df2$Cost[inds]), sum(df2$Product[inds]))
},df1$Start_Date, df1$End_Date, df1$Country))
df1
# Start_Date End_Date Country Cost Product
#1 2017-01-04 2017-01-06 id 311 55
#2 2017-02-13 2017-02-15 ng 527 77
I am having dataframe which looks like:
Count_ID Stats Date
123 A 10-01-2017
123 A 12-01-2017
123 B 15-01-2017
456 B 18-01-2017
456 C 17-01-2017
789 A 20-01-2017
486 A 25-01-2017
486 A 28-01-2017
I want to add a Status & Count column in Dataframe which give me below mention status.
Match oldest Count_ID as per date having Stats as "A" compare if any Count_ID with same value (i.e 123) is having date > than that Previous same Count_ID having Stats as "A", than show it "False" in status column.
If there are multiple Count_ID with same value (i.e 123) than check Stats "A" than match any same Count_ID with Stats other than "A" or "A" are having date > than of those having Stats "A", than show status as "False"
If there are multiple same Count_ID (i.e 123) having Stats as "A" with date difference <30 days (w.r.t the previous Count_ID as per Date) show status as "False-B".
In count column, show difference of days between same Count_ID created from previous Count_ID.
Where no condition show it as "-".
Required Output:
Count_ID Stats Date Status Count
123 A 10-01-2017 False-B 0
123 A 12-01-2017 False-B 2
123 B 15-01-2017 False 3
456 B 18-01-2017 - 0
456 C 17-01-2017 False 1
789 A 20-01-2017 - 0
486 A 25-01-2017 False-B 0
486 A 28-01-2017 False-B 3
Dput:
structure(list(Count_ID = c(123L, 123L, 123L, 456L, 456L, 789L,
486L, 486L), Stats = c("A", "A", "B", "B", "C", "A", "A", "A"
), Date = c("10/01/2017", "12/01/2017", "15/01/2017", "18/01/2017",
"17/01/2017", "20/01/2017", "25/01/2017", "28/01/2017")), .Names = c("Count_ID",
"Stats", "Date"), class = "data.frame", row.names = c(NA, -8L
))
If I understood the question correctly then you can try this
library(dplyr)
df %>%
group_by(Count_ID) %>%
mutate(Count = c(0, abs(as.numeric(diff(Date)))),
Status = ifelse((Date==min(Date[Stats=='A']) | Date>min(Date[Stats=='A'])) & (n()>1), "FALSE", "-")) %>%
mutate(Status = ifelse(Stats=='A' & Count < 30 & Status=='FALSE', 'FALSE-B', Status)) %>%
data.frame()
Note that the condition for "row item 5" is not clear so I have left it as -. I am not sure how you want to go about this row as there is no Stats = A for Count_ID = 456.
Output is:
Count_ID Stats Date Count Status
1 123 A 2017-01-10 0 FALSE-B
2 123 A 2017-01-12 2 FALSE-B
3 123 B 2017-01-15 3 FALSE
4 456 B 2017-01-18 0 -
5 456 C 2017-01-17 1 -
6 789 A 2017-01-20 0 -
7 486 A 2017-01-25 0 FALSE-B
8 486 A 2017-01-28 3 FALSE-B
Sample data:
df <- structure(list(Count_ID = c(123L, 123L, 123L, 456L, 456L, 789L,
486L, 486L), Stats = c("A", "A", "B", "B", "C", "A", "A", "A"
), Date = structure(c(17176, 17178, 17181, 17184, 17183, 17186,
17191, 17194), class = "Date")), .Names = c("Count_ID", "Stats",
"Date"), row.names = c(NA, -8L), class = "data.frame")
Let's say we have two tables:
A table of budgets:
Item Budget
A 900
B 350
C 100
D 0
bDT = structure(list(Item = c("A", "B", "C", "D"), Budget = c(900L,
350L, 100L, 0L)), .Names = c("Item", "Budget"), row.names = c(NA,
-4L), class = "data.frame")
and a table of expected expenses by item per date.
Item Date Expense
A 2017-08-24 850
B 2017-08-18 300
B 2017-08-11 50
C 2017-08-18 50
C 2017-08-11 100
D 2017-08-01 500
expDF = structure(list(Item = c("A", "B", "B", "C", "C", "D"), Date = structure(c(17402,
17396, 17389, 17396, 17389, 17379), class = "Date"), Expense = c(850L,
300L, 50L, 50L, 100L, 500L)), .Names = c("Item", "Date", "Expense"
), row.names = c(NA, -6L), class = "data.frame")
I'm looking to summarize the amount we can spend per item per date like this:
Item Date Spend
A 8/24/2017 850
B 8/18/2017 300
B 8/11/2017 50
C 8/18/2017 50
C 8/11/2017 50
D 8/1/2017 0
This works:
library(data.table)
setDT(bDF); setDT(expDF)
expDF[bDF, on=.(Item), Spending :=
pmin(
Expense,
pmax(
0,
Budget - cumsum(shift(Expense, fill=0))
)
)
, by=.EACHI]
Item Date Expense Spending
1: A 2017-08-24 850 850
2: B 2017-08-18 300 300
3: B 2017-08-11 50 50
4: C 2017-08-18 50 50
5: C 2017-08-11 100 50
6: D 2017-08-01 500 0
How it works
cumsum(shift(Expense, fill = 0)) is prior spending**
max(0, Budget - prior spending) is remaining budget
min(Expense, remaining budget) is current spending
The data.table syntax x[i, on=, j, by=.EACHI] is a join. In this case j takes the form v := expr, which adds a new column to x. See ?data.table for details.
** Well, "prior" in ordering of the table. I'll ignore the OP's weird reversed dates.