I am trying to find average daily occurrence based on given occurrence frequency for last 7 days, 14 days, 30 days, 60 days and 90 days.
For ex below is the data for two events for last 7,14,30,60, 90 days.
| Event | 7 Days | 14 Days | 30 Days | 60 Days | 90 Days |
| 1 | 2 times| 4 times | 8 times |18 times |19 times |
|2 | 3 times | 6 times | 7 times | 10 times | 11 times |
is it as simple as [((2/7)+(4/14)+(8/30)+(18/60)+(19/90))/5] for first event ?
This can also be thought in terms of item 1 and item 2 with their selling frequency in 7/14/30/60/90 days and we need to find out daily sale for each item.
Daily average in the last n days = (Total number of events in last n days)/(n)
For Event 1:
If you want find daily average in the last 7 days, it will be 2/7, in the last 14 days will be 4/14 etc. However, if you want to find the overall daily average you would have:
Overall daily average = (Total number of events)/(Total number of days)
For Event 1 that would be 19/90 which is same as daily average in the last 90 days.
The expression you wrote does not have any meaning. Note that last 14 days includes last 7 days etc.
Related
I created a summary for my data and worked out percentages of occurrences per category.
Now, I want to sum a subset of categories to show their value combined. For example, I want to be able to say that 51.1% of all occurrences are within the categories 30, 60, and 120 days (sum of rows #6, #9, and #3). The name of the Data.frame is "Summary_2".
Category Count Percent
1 1 day 4 3.3%
8 5 days 5 4.1%
4 180 days 8 6.5%
5 240 days 9 7.3%
2 10 days 15 12.2%
3 120 days 18 14.6%
6 30 days 19 15.4%
7 360 days 19 15.4%
9 60 days 26 21.1%
This is a summary of tickets. I arbitrarily want to say that 50% of our tickets are resolved within 2 months, 30% are resolved from 180 to 360 days, and 20% is resolved within 10 days.
In Excel it looks like that:
I can't find a way to define a time variable in Stata that takes two months. I found in other forums the way to define intervals of 3 months (quarters) or to define it as a semester, but that is not what I'm looking for.
I have a data set like this
year month observation
2000 1 40
2000 2 10
2000 3 50
2000 4 10
I created a variable bi_month as
year month bi_month observation
2000 1 1 40
2000 2 1 10
2000 3 2 50
2000 4 2 10
but here I'm not able to use the following code nor the tsset command (because it doesn't have a command of definition bimonthly data)
gen mdate = ym(year, bi_month)
format mdate %tm
because Stata reads bi_month as indicating months from 1 to 12.
Bi-monthly (or bimonthly) doesn't seem to me especially transparent as a term. I recommend twice-monthly and two-monthly for two interpretations.
The main issue here is, it seems, wanting to work with an aggregation of monthly data to two-monthly data, specifically intervals Jan-Feb, ..., Nov-Dec. To that end I suggest representing two-month periods by the first month of each.
clear
input year month whatever
2000 1 40
2000 2 10
2000 3 50
2000 4 10
end
gen mdate = ym(year, month)
gen m2date = 2 * floor(mdate/2)
format m*date %tm
list
+-------------------------------------------+
| year month whatever mdate m2date |
|-------------------------------------------|
1. | 2000 1 40 2000m1 2000m1 |
2. | 2000 2 10 2000m2 2000m1 |
3. | 2000 3 50 2000m3 2000m3 |
4. | 2000 4 10 2000m4 2000m3 |
+-------------------------------------------+
Now such data can't be tsset or xtset using the new two-monthly date because each such date doesn't occur uniquely in the dataset.
But supposing that you reduce your dataset so that each two-monthly date occurs just once (or, maximally, once per panel). Now tsset or xtset is within reach, and the needed twist is just to set delta(2).
collapse whatever, by(year m2date)
tsset m2date, delta(2)
list
+--------------------------+
| year m2date whatever |
|--------------------------|
1. | 2000 2000m1 25 |
2. | 2000 2000m3 30 |
+--------------------------+
Representing each two-month period by the second month of each is equally systematic. Just add 1 to the recipe for m2date above.
Note: Strictly xtset requires only a panel identifier and doesn't insist on times occurring at most once for each panel. I am not sure that is widely useful, but it's another story.
I have an ohlc daily data for US stocks. I would like to derive a weekly timeseries from it and compute SMA and EMA. To be able to do that though, requirement is to create the weekly timeseries from the maximum high per week, and another weekly timeseries from the minimum low per week. After that I, would then compute their sma and ema then assign to every days of the week (one period forward). So, first problem first, how do I get the weekly from the daily using R (any package), or better if you can show me an algo for it, any language but preferred is Golang? Anyway, I can rewrite it in golang if needed.
Date High Low Week(High) Week(Low) WkSMAHigh 2DP WkSMALow 2DP
(one period forward)
Dec 24 Fri 6 3 8 3 5.5 1.5
Dec 23 Thu 7 5 5.5 1.5
Dec 22 Wed 8 5 5.5 1.5
Dec 21 Tue 4 4 5.5 1.5
Assume Holiday (Dec 20)
Dec 17 Fri 4 3 6 2 None
Dec 16 Thu 4 3
Dec 15 Wed 5 2
Dec 14 Tue 6 4
Dec 13 Mon 6 4
Dec 10 Fri 5 1 5 1 None
Dec 9 Thu 4 3
Dec 8 Wed 3 2
Assume Holiday (Dec 6 & 7)
I'd start by generating a column which specifies which week it is.
You could use the lubridate package to do this, that would require converting your dates into Date types. It has a function called week which returns the number of full 7 day periods that have passed since Jan 1st + 1. However I don't know if this data goes over several years or not. Plus I think there's a simpler way to do this.
The example I'll give below will simply do it by creating a column which just repeats an integer 7 times up to the length of your data frame.
Pretend your data frame is called ohlcData.
# Create a sequence 7 at a time all the way up to the end of the data frame
# I limit the sequence to the length nrow(ohlcData) so the rounding error
# doesn't make the vectors uneven lengths
ohlcData$Week <- rep(seq(1, ceiling(nrow(ohlcData)/7), each = 7)[1:nrow(ohlcData)]
With that created we can then go ahead and use the plyr package, which has a really useful function called ddply. This function applies a function to columns of data grouped by another column of data. In this case we will apply the max and min functions to your data based on its grouping by our new column Week.
library(plyr)
weekMax <- ddply(ohlcData[,c("Week", "High")], "Week", numcolwise(max))
weekMin <- ddply(ohlcData[,c("Week", "Low")], "Week", numcolwise(min))
That will then give you the min and max of each week. The dataframe returned for both weekMax and weekMin will have 2 columns, Week and the value. Combine these however you see fit. Perhaps weekExtreme <- cbind(weekMax, weekMin[,2]). If you want to be able to marry up date ranges to the week numbers it will just be every 7th date starting with whatever your first date was.
Was wondering how I would use R to calculate the below.
Assuming a CSV with the following purchase data:
| Customer ID | Purchase Date |
| 1 | 01/01/2017 |
| 2 | 01/01/2017 |
| 3 | 01/01/2017 |
| 4 | 01/01/2017 |
| 1 | 02/01/2017 |
| 2 | 03/01/2017 |
| 2 | 07/01/2017 |
I want to figure out the average time between repurchases by customer.
The math would be like the one below:
| Customer ID | AVG repurchase |
| 1 | 30 days | = (02/01 - 01/01 / 1 order
| 2 | 90 days | = ( (03/01 - 01/01) + (07 - 3/1) ) /2 orders
| 3 | n/a |
| 4 | n/a |
The output would be the total average across customers -- so: 60 days = (30 avg for customer1 + 90 avg for customer2) / 2 customers.
I've assumed you have read your CSV into a dataframe named df and I've renamed your variables using snake case, since having variables with a space in the name can be inconvenient, leading many to use either snake case or camel case variable naming conventions.
Here is a base R solution:
mean(sapply(by(df$purchase_date, df$customer_id, diff), mean), na.rm=TRUE)
[1] 60.75
You may notice that we get 60.75 rather than 60 as you expected. This is because there are 31 days between customer 1's purchases (31 days in January until February 1), and similarly for customer 2's purchases -- there are not always 30 days in a month.
Explanation
by(df$purchase_date, df$customer_id, diff)
The by() function applies another function to data by groupings. Here, we are applying diff() to df$purchase_date by the unique values of df$customer_id. By itself, this would result in the following output:
df$customer_id: 1
Time difference of 31 days
-----------------------------------------------------------
df$customer_id: 2
Time differences in days
[1] 59 122
We then use
sapply(by(df$purchase_date, df$customer_id, diff), mean)
to apply mean() to the elements of the previous result. This gives us each customer's average time to repurchase:
1 2 3 4
31.0 90.5 NaN NaN
(we see customers 3 and 4 never repurchased). Finally, we need to average these average repurchase times, which means we need to also deal with those NaN values, so we use:
mean(sapply(by(df$purchase_date, df$customer_id, diff), mean), na.rm=TRUE)
which will average the previous results, ignoring missing values (which, in R include NaN values).
Here's another solution with dplyr + lubridate:
library(dplyr)
library(lubridate)
df %>%
mutate(Purchase_Date = mdy(Purchase_Date)) %>%
group_by(Customer_ID) %>%
summarize(AVG_Repurchase = sum(difftime(Purchase_Date,
lag(Purchase_Date), units = "days"),
na.rm=TRUE)/(n()-1))
or with data.table:
library(data.table)
setDT(df)[, Purchase_Date := mdy(Purchase_Date)]
df[, .(AVG_Repurchase = sum(difftime(Purchase_Date,
shift(Purchase_Date), units = "days"),
na.rm=TRUE)/(.N-1)), by = "Customer_ID"]
Result:
# A tibble: 4 x 2
Customer_ID AVG_Repurchase
<dbl> <time>
1 1 31.0 days
2 2 90.5 days
3 3 NaN days
4 4 NaN days
Customer_ID AVG_Repurchase
1: 1 31.0 days
2: 2 90.5 days
3: 3 NaN days
4: 4 NaN days
Note:
I first converted Purchase_Date to mmddyyyy format, then group_by Customer_ID. Final for each Customer_ID, I calculated the mean of the days difference between Purchase_Date and it's lag.
Data:
df = structure(list(Customer_ID = c(1, 2, 3, 4, 1, 2, 2), Purchase_Date = c(" 01/01/2017",
" 01/01/2017", " 01/01/2017", " 01/01/2017", " 02/01/2017", " 03/01/2017",
" 07/01/2017")), .Names = c("Customer_ID", "Purchase_Date"), class = "data.frame", row.names = c(NA,
-7L))
I would like to get a numeric vector of time gaps between goals scored by a soccer team
df <- data.frame(game=c(1,2,3,4,5,6,6,6,7),goaltime=c(NA,35,51,NA,NA,2,81,90,15))
NA indicates no goal was scored by the team in that game. The earliest a goal can be scored in a game is 1
Each game has a total time of 90 minutes so the output vector should be
c(125,106,221,79,9,15,75)
You can try:
diff(c(0,setdiff(90*(df$game-1)+df$goaltime,NA),90*max(df$game)))
#[1] 125 106 221 79 9 15 75