I performed static N2O chamber measurements that I would like to analyse now using the "gasfluxes package" https://cran.r-project.org/web/packages/gasfluxes/gasfluxes.pdf.
I measured different samples (POTS) during 10 min intervals. Each sample was measured two times a day (SESSION: AM, PM) for 9 days. The N2O analyzer saved data (conc.) every second!
My data now looks like this:
DATE POT SESSION TIME Concentration
1: 2017-10-18T00:00:00Z O11 AM 10:16:00.746 0.3512232
2: 2017-10-18T00:00:00Z O11 AM 10:16:01.382 0.3498687
3: 2017-10-18T00:00:00Z O11 AM 10:16:02.124 0.3482681
4: 2017-10-18T00:00:00Z O11 AM 10:16:03.216 0.3459306
5: 2017-10-18T00:00:00Z O11 AM 10:16:04.009 0.3459124
6: 2017-10-18T00:00:00Z O11 AM 10:16:04.326 0.3456660
To use the package, I need to calculate closing times out of the exact time (TIME) data points. The time should look like this (table taken from the package pdf. see above)
serie V A time C
1: ID1 0.522625 1 0.0000000 0.3317823
2: ID1 0.522625 1 0.3333333 0.3304053
3: ID1 0.522625 1 0.6666667 0.3394311
4: ID1 0.522625 1 1.0000000 0.4469102
5: ID2 0.523625 1 0.0000000 0.4572708
How can I calculate this for each individual 10-minute measurement period for each pot? Basically it should list the increasing nr. of seconds as my machine measured conc. every second.
My idea is to group by "POT", "DATE" and "Session" which creates a unique identifier for one complete chamber measurement and do a loop.
I also learned that I should use "lubridate" as I'm working with times (https://data.library.virginia.edu/working-with-dates-and-time-in-r-using-the-lubridate-package/). I still don't know how to calculate Time durations now for my case? I think I need to write a loop?
Something like this but I always get error messages (my former question R: Calculate measurement time-points for separate samples)
df.HMR %>% group_by(DATE, Series, Session) %>%
mutate(dt=as.POSIXct(df.HMR$TIME,format="%H:%M:%S"), time_diff = dt-lag(dt))
Error message: Column dt must be length 838 (the group size) or one, not 379698
Anyone can help me or knows about an another approach?
Any help is very welcome.
Many thanks!
Related
The BTYD package in R looks very useful for predicting future customer behavior based on past transactions.
However, the walk-through only illustrates predicting how many transactions a customer will make in an upcoming period, for example in the next year or month.
Is there a way to use this package to create a prediction for the date on which a customer will purchase, and the expected amount of the purchase?
For example, using the sample data set available in the BTYD package:
cdnowElog <- system.file("data/cdnowElog.csv", package = "BTYD")
elog <- dc.ReadLines(cdnowElog, cust.idx = 2,
date.idx = 3, sales.idx = 5)
# Change to date format
elog$date <- as.Date(elog$date, "%Y%m%d");
elog[1:3,]
# cust date sales
# 1 1 1997-01-01 29.33
# 2 1 1997-01-18 29.73
# 3 1 1997-08-02 14.96
I would want an output that has the customer number, expected next date of purchase, and expected purchase amount.
# cust exp_date exp_sales
# 1 1998-02-23 19.35
# 2 1997-09-12 39.83
# 3 1998-01-05 24.56
Or this package can only predict the expected number of transactions in a time period, not the date itself or the spend amount? Is there a better approach for what I want to achieve?
I apologize if this question seems very basic, but I couldn't find the answer to this conceptual question in the documentation.
I have a data.table that looks like this:
dt
id month balance
1: 1 4 100
2: 1 5 50
3: 2 4 200
4: 2 5 135
5: 3 4 100
6: 3 5 100
7: 4 5 300
"id" is the client's ID, "month" indicates what month it is, and "balance" indicates the account balance of a client. In a sense, this is longitudinal data where, say, element (2,3) indicates that Client #1 has an account balance of 50 at the end of month 5.
I want to generate a column that will give me the difference between a client's balance between month 5 and 4 to know the transactions carried out from one month to another.
This new variable should let me know that Client 1 drew 50, Client 2 drew 65 and Client 3 didn't do anything in aggregate terms between april and may. Client 4 is a new client that joined in may.
I thought of the following code:
dt$transactions <- dt$balance - shift(dt$balance, 1, "up")
However, it does not work properly because it's telling me that Client 4 made a 200 dollar deposit (but Client 4 is new!). Therefore, I want to be able to introduce the argument "by=id" to this somehow.
I know the solution lies in using the following notation:
dt[, transactions := balance - shift(balance, ??? ), by=id]
I just need to figure out how to make the aforementioned code work properly.
Thanks in advance.
Given that I only have two observations (at most), the following code gives me an elegant solution:
dt[, transaction := balance - first(balance), by = id]
This prevents any NAs from entering the variable transaction.
However, if I had more observations per id, I would do the following:
dt[,transaction := balance - shift(balance,1), by = id]
Big thanks to #Ryan and #Onyambu for helping.
I have a dataset that looks somewhat like this (the actual dataset is ~150000 lines with additional columns of fluff information such as company name, etc.):
Date return1 return2 rank
01/31/2008 0.05434 0.23413 3
01/31/2008 0.03423 0.43423 4
01/31/2008 0.65277 0.23423 1
01/31/2008 0.02342 0.47234 4
02/31/2008 0.01463 0.01231 4
02/31/2008 0.13456 0.52552 2
02/31/2008 0.34534 0.36663 1
02/31/2008 0.00324 0.56463 3
...
12/31/2015 0.21234 0.02333 2
12/31/2015 0.07245 0.87234 1
12/31/2015 0.47282 0.12998 1
12/31/2015 0.99022 0.03445 2
Basically I need to caculate the date-specific correlation between return1 and rank (so the corr. on 01/31/2008, 02/31/2008, and so on). I know I can split the data using the split() function but I am unsure as to how to get the date-specific correlation. The real data has about 260 entries per date and around 68 dates, so manually subsetting the original table and performing calculations is time consuming but more importantly more susceptible to error.
My ultimate goal is to create a time series of the correlations on different dates.
Thank you in advance!
I had this same problem earlier, except I wasn't calculating correlation. What I would do is
a %>% group_by(Date) %>% summarise(Correlation = cor(return1, rank))
And this will provide, for each date, a correlation value between return1 and rank. Don't forget that you can specify what kind of correlation you would like (e.g. Spearman).
I am qute new to R and studied several posts and websites about time series and moving averaging but simply cannot find a useful hint averging a special period of time.
My data is a table via readcsv with a date and time in one column and several other columns with values. The time steps in the data are not constant, so sometimes 5 minutes, sometimes 2 hours. Eg.
2014-01-25 14:50:00, 4, 8
2014-01-25 14:55:00, 3, 7
2014-01-25 15:00:00, 1, 4
2014-01-25 15:20:24, 12, 34
2014-01-25 17:19:00, 150, 225
2014-01-25 19:00:00, 300, 400
2014-01-25 21:00:00, NA, NA
2014-01-25 23:19:00, 312, 405
So I look for an averaging plot that
calculates data average in arbitrary intervals like 30 minutes, 1 hour, 1 day etc. So lower steps should be aggregated and higher steps should be disaggregated.
(removed, since it is trivial to get value per hour from a time series D which is averaged by X hours with D/x.)
data flagged as NA should not be taken into account. So the function should not interpolate/smooth through Na gaps and performing a line plot should not connect the points between a NA gap with a line.
I already tried
aggregate(list(value1=data$value1,value2=data$value2), list(time=cut(data$time, "1 hour")), sum)
but this does not fulfill needs 1 and 3 and is not able to disaggregate 2-hourly data steps.
Answering point 3: plot automatically skips NA values and breaks the line.
Try this example:
plot(c(1:5,NA,NA,6:10),t='l')
Now, if you want to 'smooth' or average over time intervals purely for graphical purposes, It's probably easiest to start out by separating your data at each line with an NA and then doing a spline or other smoothing operation on each subsection separately.
I'm attempting to model customer lifetimes on subscriptions. As the data is censored I'll be using R's survival package to create a survival curve.
The original subscriptions dataset looks like this..
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
Which I manipulate to look like this..
id tenure_in_months status(1=cancelled, 0=active)
1 2 1
2 ? 0
3 1 1
..in order to feed the survival model:
obj <- with(subscriptions, Surv(time=tenure_in_months, event=status, type="right"))
fit <- survfit(obj~1, data=subscriptions)
plot(fit)
What shall I put in the tenure_in_months variable for the consored cases i.e. the cases where the subscription is still active today - should it be the tenure up until today or should it be NA?
First I shall say I disagree with the previous answer. For a subscription still active today, it should not be considered as tenure up until today, nor NA. What do we know exactly about those subscriptions? We know they tenured up until today, that is equivalent to say tenure_in_months for those observations, although we don't know exactly how long they are, they are longer than their tenure duration up to today.
This is a situation known as right-censor in survival analysis. See: http://en.wikipedia.org/wiki/Censoring_%28statistics%29
So your data would need to translate from
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
to:
id t1 t2 status(3=interval_censored)
1 2 2 3
2 3 NA 3
3 1 1 3
Then you will need to change your R surv object, from:
Surv(time=tenure_in_months, event=status, type="right")
to:
Surv(t1, t2, event=status, type="interval2")
See http://stat.ethz.ch/R-manual/R-devel/library/survival/html/Surv.html for more syntax details. A very good summary of computational details can be found: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_lifereg_sect018.htm
Interval censored data can be represented in two ways. For the first use type = interval and the codes shown above. In that usage the value of the time2 argument is ignored unless event=3. The second approach is to think of each observation as a time interval with (-infinity, t) for left censored, (t, infinity) for right censored, (t,t) for exact and (t1, t2) for an interval. This is the approach used for type = interval2, with NA taking the place of infinity. It has proven to be the more useful.
If a missing end date means that the subscription is still active, then you need to take the time until the current date as censor date.
NA wont work with the survival object. I think those cases will be omitted. That is not what you want! Because these cases contain important information about the survival.
SQL code to get the time till event (use in SELECT part of query)
DATEDIFF(M,start_date,ISNULL(end_date,GETDATE()) AS tenure_in_months
BTW:
I would use difference in days, for my analysis. Does not make sense to round off the time to months.
You need to know the date the data was collected. The tenure_in_months for id 2 should then be this date minus 2013-06-01.
Otherwise I believe your encoding of the data is correct. the status of 0 for id 2 indicates it's right-censored (meaning we have a lower bound on it's lifetime, but not an upper bound).