I am tracking the date/time when I say goodnight to Alexa. The entries are super weird and unhelpful strings, not dates and times:
Sample data:
January 25, 2021 at 12:03AM
January 25, 2021 at 11:27PM
January 26, 2021 at 11:17PM
Alexa just dumps these unconventional date/time strings into A1-A??? on the first tab.
I am using this formula to show my average bedtime each month:
= QUERY(ARRAYFORMULA(
IF(LEN(A2:A), {
MONTH(REGEXEXTRACT(A2:A, "\D+") & 1),
REGEXEXTRACT(A2:A, "\D+"),
IF(TIMEVALUE(REGEXEXTRACT(A2:A, "\d+:\d+.*")) > 0.5,
TIMEVALUE(REGEXEXTRACT(A2:A, "\d+:\d+.*")),
TIMEVALUE(REGEXEXTRACT(A2:A, "\d+:\d+.*")) + 1)
}, "")),
"Select Col1,Col2 ,avg(Col3) where Col1 is not null
group by Col1, Col2 Order By Col1 asc label Col1 '#', Col2 'Month', avg(Col3)
'Average bedtime'
")
But really, I don’t care so much about weekends as I do about weeknights. Stumped on how to adjust the formula so that it only shows Sun-Thu nights.
To make it trickier.. if I went to bed after midnight on a Thursday (gasp), that should still be included.
Turning to those who have madder skills than me!
Thanks for your help,
Drew
I think this should take into account the edge cases mentioned above:
cell B2:
=arrayformula(
regexreplace(A2:A, "^([\w, ]+) at ([\w: ]+)$", "$1 $2")
)
cell C2:
=arrayformula(
query(
{
text(B2:B - (B2:B - int(B2:B) < timevalue("4:00 AM")), "mmmm"),
text(B2:B - (B2:B - int(B2:B) < timevalue("4:00 AM")), "ddd"),
timevalue(B2:B) + (B2:B - int(B2:B) < timevalue("4:00 AM"))
},
"select Col1, avg(Col3)
where Col2 matches 'Sun|Mon|Tue|Wed|Thu'
group by Col1
pivot Col2",
0
)
)
Format the result cells as Format > Number > Time.
Here is a partial solution.
=ArrayFormula(query(if(A2:A = "",, {
month(REGEXEXTRACT(A2:A, "(.*) at")),
REGEXEXTRACT(A2:A, "\D+"),
REGEXEXTRACT(A2:A, "\d+:.*") + (REGEXEXTRACT(A2:A, "\D\D$")="AM"),
mod(REGEXEXTRACT(A2:A, "(.*) at") + REGEXEXTRACT(A2:A, "at (.*)")-1, 7)
}),
"Select Col1,Col2 ,avg(Col3) where Col4 > 0.5 and Col4 < 5.5
group by Col1, Col2 Order By Col1 asc label Col1 '#', Col2 'Month', avg(Col3)
'Average bedtime'
"))
It does not address the year and end of month issues raised by Erik. Any before noon entries on the first of a month should be included in the previous month.
Col4 at 0.5 would be Sunday noon and 5.5 would be Friday noon. Ed
Related
I have a database of cash register transactions. The records are split by Products in a Basket:
Date Hour Cust Prod Basket Spend
1| 20160416 8 C1 P1 B2 10
2| 20160416 8 C1 P2 B2 20
3| 20160115 15 C1 P3 B1 30
4| 20160115 15 C1 P2 B1 50
5| 20161023 11 C1 P4 B3 60
I would like to see:
DaysSinceLastVisit Cust Basket Spend
NULL C1 B1 30
92 C1 B2 80
190 C1 B3 60
AND
AvgDaysBetweenVisits Cust AvgSpent
141 C1 56.57
I can't figure out how to perform aggregate functions on Dates during a GROUP BY. All the other posts on SO seem to have 2 for start/end dates [1] [2] [3].
Here's what I have tried so far:
SELECT SUM(DATE(Date)), Cust, Basket, SUM(Spend) FROM 'a' GROUP BY CUST_CODE,BASKET # Sums the numeric values
SELECT DIFF(DATE(Date)), Cust, Basket, AVG(Spend) FROM 'a' GROUP BY CUST_CODE,BASKET # DIFF/DIFFERENCE not a function
Also, it should be noted that I'm running this on r with sqldf, which uses SQLite syntax. However, I'd prefer an SQLite solution.
Try this-
df <- data.frame("Date"=c("20160416","20160416","20160115","20160115","20161023"),
"Hour"=c(8,8,15,15,11), "Cust"=c("C1","C1","C1","C1","C1"),
"Prod"=c("P1","P2","P3","P2","P4"), "Basket"=c("B2","B2","B1","B1","B3"),
"Spend"=c(10,20,30,50,60))
df$Date <- as.Date(df$Date, format = "%Y%m%d")
# Aggregate the data first
df2 <- aggregate(Spend ~ Date + Cust + Basket, data = df, FUN = sum)
# Now get days since last visit
df2$Date <- c(0, diff(df2$Date, 1))
# And finally
df3 <- aggregate(cbind(Date, Spend) ~ Cust, data = df2, FUN = mean)
day_since_last_visit is with respect to today's date+time , as it is more practical. However if you get the difference btween 1st and 2nd and 2nd and 3rd, it will be 92 and 190, which is similar to your data. Best way to handle that part will be in cursor, can be done in query too, but will be bit more complex..
select round( julianday('now') - min ( julianday (substr(date,1,4) || "-"||substr(date,5,2) || "-"|| substr(date,7) ) ) ,2 ) days_since_last_visit,
date, cust, basket, sum(spend) total_spend
from customer
group by cust, basket, date
Average for date visited and today's date for each record
select round(avg( julian_days) ,2) average_days , cust, round(avg(total_spend) ,2) average_spent
from
( select julianday('now') - min ( julianday (substr(date,1,4) || "-"||substr(date,5,2) || "-"|| substr(date,7) ) ) julian_days, date,
cust, basket, sum(spend) total_spend
from customer
group by cust, basket, date )
group by cust
create and insert script only for reference
create table customer ( date text , hour integer, cust text, prod text, basket text, spend integer )
insert into customer ( date, hour, cust, prod, basket, spend ) values ( "20161023", 11, "C1", "P4", "B3",60)
This uses SQLite via sqldf as in the question.
We first define three tables (only for the duration of the SQL statement) in the with clause:
aa is table a with an additional julian date column suitable for differencing
tab_days is a table that uses aa for defining the differenced days via an appropriately aggregated join
tab_sum_spend is a table that constains the Spend sums
Finally we join the last two and sort appropriately.
library(sqldf)
# see note at end for a in reproducible form
t1 <- sqldf("
WITH aa AS (SELECT julianday(substr(Date, 1, 4) || '-' ||
substr(Date, 5, 2) || '-' ||
substr(Date, 7, 2)) juldate,
*
FROM a),
tab_days AS (SELECT a1.Date, min(a1.juldate - a2.juldate) Days, a1.Cust, a1.Basket
FROM aa a1
LEFT JOIN aa a2 ON a1.Date > a2.Date AND a1.Cust = a2.Cust
GROUP BY a1.Cust, a1.Date, a1.Basket),
tab_sum_spend AS (SELECT Cust, Date, Basket, sum(Spend) Spend
FROM aa
GROUP BY Cust, Date, Basket)
SELECT Days, Cust, Basket, Spend
FROM tab_days
JOIN tab_sum_spend USING(Cust, Date, Basket)
ORDER BY Cust, Date, Basket
")
t1
## Days Cust Basket Spend
## 1 <NA> C1 B1 80
## 2 92.0 C1 B2 30
## 3 190.0 C1 B3 60
and for the second question:
sqldf("SELECT avg(Days) AvgDays, Cust, avg(Spend) AvgSpend FROM t1")
## AvgDays Cust AvgSpend
## 1 141 C1 56.66667
Note: The data.frame a in reproducible form is:
Lines <- "Date Hour Cust Prod Basket Spend
1 20160416 8 C1 P1 B2 10
2 20160416 8 C1 P2 B2 20
3 20160115 15 C1 P3 B1 30
4 20160115 15 C1 P2 B1 50
5 20161023 11 C1 P4 B3 60"
a <- read.table(text = Lines, as.is = TRUE)
Sample Data:
product_id <- c("1000","1000","1000","1000","1000","1000", "1002","1002","1002","1002","1002","1002")
qty_ordered <- c(1,2,1,1,1,1,1,2,1,2,1,1)
price <- c(2.49,2.49,2.49,1.743,2.49,2.49, 2.093,2.093,2.11,2.11,2.11, 2.97)
date <- c("2/23/15","2/23/15", '3/16/15','3/16/15','5/16/15', "6/18/15", "2/19/15","3/19/15","3/19/15","3/19/15","3/19/15","4/19/15")
sampleData <- data.frame(product_id, qty_ordered, price, date)
I would like to identify every time when a change in a price occurred. Also, I would like to sum() the total qty_ordered between those two price change dates. For example,
For product_id == "1000", price changed occurred on 3/16/15 from $2.49 to $1.743. The total qty_ordered is 1+2+1=4;
the difference between those two earliest date of price change is from 2/23/15 to 3/16/15 which is 21 days.
So the New Data Frame should be:
product_id sum_qty_ordered price date_diff
1000 4 2.490 21
1000 1 1.743 61
1000 2 2.490 33
Here are what I have tried:
**NOTE: for this case, a simple "dplyr::group_by" will not work since it will ignore the date effect.
1) I found this code from Determine when columns of a data.frame change value and return indices of the change:
This is to identify every time when the price changed, which identify the first date when the price changed for each product.
IndexedChanged <- c(1,which(rowSums(sapply(sampleData[,3],diff))!=0)+1)
sampleData[IndexedChanged,]
However, I am not sure how to calculate the sum(qty_ordered) and the date difference for each of those entries if I use that code.
2) I tried to write a WHILE loop to temporarily store each batch of product_id, price, range of dates (e.g. a subset of data frame with one product_id, one price, and all entries ranged from the earliest date of price change till the last date of price before it changed),
and then, summarise that subset to get sum(sum_qty_ordered) and the date diff.
However, I think I always am confused by WHILE and FOR, so my code has some problems in it. Here is my code:
create an empty data frame for later data storage
NewData_Ready <- data.frame(
product_id = character(),
price = double(),
early_date = as.Date(character()),
last_date=as.Date(character()),
total_qty_demanded = double(),
stringsAsFactors=FALSE)
create a temp table to store the batch price order entries
temp_dataset <- data.frame(
product_id = character(),
qty_ordered = double(),
price = double(),
date=as.Date(character()),
stringsAsFactors=FALSE)
loop:
This is messy...and probably not make sense, so I do really help on this.
for ( i in unique(sampleData$product_id)){
#for each unique product_id in the dataset, we are gonna loop through it based on product_id
#for first product_id which is "1000"
temp_table <- sampleData[sampleData$product_id == "i", ] #subset dataset by ONE single product_id
#this dataset only has product of "1000" entries
#starting a new for loop to loop through the entire entries for this product
for ( p in 1:length(temp_table$product_id)){
current_price <- temp_table$price[p] #assign current_price to the first price value
#assign $2.49 to current price.
min_date <- temp_table$date[p] #assign the first date when the first price change
#assign 2015-2-23 to min_date which is the earliest date when price is $2.49
while (current_price == temp_table$price[p+1]){
#while the next price is the same as the first price
#that is, if the second price is $2.49 is the same as the first price of $2.49, which is TRUE
#then execute the following statement
temp_dataset <- rbind(temp_dataset, temp_table[p,])
#if the WHILE loop is TRUE, means every 2 entries have the same price
#then combine each entry when price is the same in temp_table with the temp_dataset
#if the WHILE loop is FALSE, means one entry's price is different from the next one
#then stop the statement at the above, but do the following
current_price <- temp_table$price[p+1]
#this will reassign the current_price to the next price, and restart the WHILE loop
by_idPrice <- dplyr::group_by(temp_dataset, product_id, price)
NewRow <- dplyr::summarise(
early_date = min(date),
last_date = max(date),
total_qty_demanded = sum(qty_ordered))
NewData_Ready <- rbind(NewData_Ready, NewRow)
}
}
}
I have searched a lot on related questions but I have not found anything that are related to this problem yet. If you have some suggestions, please let me know.
Also, please provide some suggestions on the solution to my questions. I would greatly appreciate your time and help!
Here is my R version:
platform x86_64-apple-darwin13.4.0
arch x86_64
os darwin13.4.0
system x86_64, darwin13.4.0
status
major 3
minor 3.1
year 2016
month 06
day 21
svn rev 70800
language R
version.string R version 3.3.1 (2016-06-21)
nickname Bug in Your Hair
Using data.table:
library(data.table)
setDT(sampleData)
Some Preprocessing:
sampleData[, firstdate := as.Date(date, "%m/%d/%y")]
Based on how you calculate date diff, we are better off creating a range of dates for each row:
sampleData[, lastdate := shift(firstdate,type = "lead"), by = product_id]
sampleData[is.na(lastdate), lastdate := firstdate]
# Arun's one step: sampleData[, lastdate := shift(firstdate, type="lead", fill=firstdate[.N]), by = product_id]
Then create a new ID for every change in price:
sampleData[, price_id := cumsum(c(0,diff(price) != 0)), by = product_id]
Then calculate your groupwise functions, by product and price run:
sampleData[,
.(
price = unique(price),
sum_qty = sum(qty_ordered),
date_diff = max(lastdate) − min(firstdate)
),
by = .(
product_id,
price_id
)
]
product_id price_id price sum_qty date_diff
1: 1000 0 2.490 4 21 days
2: 1000 1 1.743 1 61 days
3: 1000 2 2.490 2 33 days
4: 1002 0 2.093 3 28 days
5: 1002 1 2.110 4 31 days
6: 1002 2 2.970 1 0 days
I think the last price change for 1000 is only 33 days, and the preceding one is 61 (not 60). If you include the first day it is 22, 62 and 34, and the line should read date_diff = max(lastdate) − min(firstdate) + 1
I have a R dataframe containing date info about generic events:
id;start_date;end_date.
Sometimes the same event may occur the same day (1) or at a distance of one day (2), for example:
(1)
1001;2016-05-07;2016-05-11
1001;2016-05-11;2016-05-14
(2)
1001;2016-05-07;2016-05-11
1001;2016-05-12;2016-05-14
In the first case the event "1001" ends and restarts the same day, while in the second case that event ends on 2017-05-11 and starts again the day after. I'd like to delete the second occurrence of the event in both cases.
If the second occurrence is at a distance of two or more days, it's ok to preserve the second occurrence. How can I do this in R?
Thank you in advance.
Partial solution with my guess of how data look like:
library(data.table)
dat <- data.table(id = c(1001,1001,1001,1001),
start_date = as.Date(c("2016-05-07", "2016-05-11", "2016-05-07", "2016-05-12")),
end_date = as.Date(c("2016-05-11", "2016-05-14", "2016-05-11", "2016-05-14")))
dat2 <- data.table(id = c(dat$id, NA),
start_date = c(dat$start_date, NA),
end_date = c(as.Date(NA), dat$end_date))
dat2[, dif := end_date - start_date]
Then you can just remove rows with dif <= 0 I guess.
I've used the data.table package, but you can just do dat2$dif <- dat2$end_date - dat2$start_date.
I have a column of strings in my data set formatted as year week (e.g. '201401' is equivalent to 7th April 2014, or the first fiscal week of the year)
I am trying to convert these to a proper date so I can manipulate them later, however I always receive the dame date for a given year, specifically the 14th of April.
e.g.
test_set <- c('201401', '201402', '201403')
as.Date(test_set, '%Y%U')
gives me:
[1] "2014-04-14" "2014-04-14" "2014-04-14"
Try something like this:
> test_set <- c('201401', '201402', '201403')
>
> extractDate <- function(dateString, fiscalStart = as.Date("2014-04-01")) {
+ week <- substr(dateString, 5, 6)
+ currentDate <- fiscalStart + 7 * as.numeric(week) - 1
+ currentDate
+ }
>
> extractDate(test_set)
[1] "2014-04-07" "2014-04-14" "2014-04-21"
Basically, I'm extracting the weeks from the start of the year, converting it to days and then adding that number of days to the start of the fiscal year (less 1 day to make things line up).
Not 100% sure what is your desired output but this may work
as.Date(paste0(substr(test_set, 1, 4), "-04-07")) +
(as.numeric(substr(test_set, 5, 6)) - 1) * 7
# [1] "2014-04-07" "2014-04-14" "2014-04-21"
I would like a function that counts the number of specific days per month..
i.e.. Nov '13 -> 5 fridays.. while Dec'13 would return 4 Fridays..
Is there an elegant function that would return this?
library(lubridate)
num_days <- function(date){
x <- as.Date(date)
start = floor_date(x, "month")
count = days_in_month(x)
d = wday(start)
sol = ifelse(d > 4, 5, 4) #estimate that is the first day of the month is after Thu or Fri then the week will have 5 Fridays
sol
}
num_days("2013-08-01")
num_days(today())
What would be a better way to do this?
1) Here d is the input, a Date class object, e.g. d <- Sys.Date(). The result gives the number of Fridays in the year/month that contains d. Replace 5 with 1 to get the number of Mondays:
first <- as.Date(cut(d, "month"))
last <- as.Date(cut(first + 31, "month")) - 1
sum(format(seq(first, last, "day"), "%w") == 5)
2) Alternately replace the last line with the following line. Here, the first term is the number of Fridays from the Epoch to the next Friday on or after the first of the next month and the second term is the number of Fridays from the Epoch to the next Friday on or after the first of d's month. Again, we replace all 5's with 1's to get the count of Mondays.
ceiling(as.numeric(last + 1 - 5 + 4) / 7) - ceiling(as.numeric(first - 5 + 4) / 7)
The second solution is slightly longer (although it has the same number of lines) but it has the advantage of being vectorized, i.e. d could be a vector of dates.
UPDATE: Added second solution.
There are a number of ways to do it. Here is one:
countFridays <- function(y, m) {
fr <- as.Date(paste(y, m, "01", sep="-"))
to <- fr + 31
dt <- seq(fr, to, by="1 day")
df <- data.frame(date=dt, mon=as.POSIXlt(dt)$mon, wday=as.POSIXlt(dt)$wday)
df <- subset(df, df$wday==5 & df$mon==df[1,"mon"])
return(nrow(df))
}
It creates the first of the months, and a day in the next months.
It then creates a data frame of month index (on a 0 to 11 range, but we only use this for comparison) and weekday.
We then subset to a) be in the same month and b) on a Friday. That is your result set, and
we return the number of rows as your anwser.
Note that this only uses base R code.
Without using lubridate -
#arguments to pass to function:
whichweekday <- 5
whichmonth <- 11
whichyear <- 2013
#function code:
firstday <- as.Date(paste('01',whichmonth,whichyear,sep="-"),'%d-%m-%Y')
lastday <- if(whichmonth == 12) { '31-12-2013' } else {seq(as.Date(firstday,'%d-%m-%Y'), length=2, by="1 month")[2]-1}
sum(
strftime(
seq.Date(
from = firstday,
to = lastday,
by = "day"),
'%w'
) == whichweekday)