Generate time-series of any frequency in R - r

I used to generate time-series in Rusing xts package as
library(xts)
seq <- timeBasedSeq('2015-06-01/2015-06-05 23')
z <- xts(1:length(seq),seq)
After a bit of tweaks, I find it easy to generate data at a default rate of 1 hour or 1 minute or 1 second. Reading the help page of ?timeBasedseq does not clearly mention how to generate data at other frequncies. Say, I want to generate data at 10 minutes rate. Where Should I mention M (minutes) and 10 in the said command to generate 10 minutes data. Option M is mentioned in the help pages.

This isn't currently possible. The "by" value is essentially hard-coded at 1 unit. It should be possible to patch the code so you can specify the "by" component as something like "10M" for 10-minutes, since seq.POSIXct would accept a by = "10 mins" value.
If you want, you can create an issue for this feature request and we can discuss details of what this patch should include.

Related

Vectorizing R custom calculation with dynamic day range

I have a big dataset (around 100k rows) with 2 columns referencing a device_id and a date and the rest of the columns being attributes (e.g. device_repaired, device_replaced).
I'm building a ML algorithm to predict when a device will have to be maintained. To do so, I want to calculate certain features (e.g. device_reparations_on_last_3days, device_replacements_on_last_5days).
I have a function that subsets my dataset and returns a calculation:
For the specified device,
That happened before the day in question,
As long as there's enough data (e.g. if I want last 3 days, but only 2 records exist this returns NA).
Here's a sample of the data and the function outlined above:
data = data.frame(device_id=c(rep(1,5),rep(2,10))
,day=c(1:5,1:10)
,device_repaired=sample(0:1,15,replace=TRUE)
,device_replaced=sample(0:1,15,replace=TRUE))
# Exaxmple: How many times the device 1 was repaired over the last 2 days before day 3
# => getCalculation(3,1,data,"device_repaired",2)
getCalculation <- function(fday,fdeviceid,fdata,fattribute,fpreviousdays){
# Subset dataset
df = subset(fdata,day<fday & day>(fday-fpreviousdays-1) & device_id==fdeviceid)
# Make sure there's enough data; if so, make calculation
if(nrow(df)<fpreviousdays){
calculation = NA
} else {
calculation = sum(df[,fattribute])
}
return(calculation)
}
My problem is that the amount of attributes available (e.g. device_repaired) and the features to calculate (e.g. device_reparations_on_last_3days) has grown exponentially and my script takes around 4 hours to execute, since I need to loop over each row and calculate all these features.
I'd like to vectorize this logic using some apply approach which would also allow me to parallelize its execution, but I don't know if/how it's possible to add these arguments to a lapply function.

R: data.table. How to save dates properly with fwrite?

I have a dataset. I can choose to load it on R from a Stata file or from a SPSS file.
In both cases it's loaded properly with the haven package.
The dates are recognized properly.
But when I save it to disk with data.table's fwrite function.
fwrite(ppp, "ppp.csv", sep=",", col.names = TRUE)
I have a problem, the dates dissapear and are converted to different numbers. For example, the date 1967-08-06 is saved in the csv file as -879
I've also tried playing with fwrite options, such as quote=FALSE, with no success.
I've uploaded a small sample of the files, the spss, the stata and the saved csv.
and this is the code, in order to do things easier for you.
library(haven)
library(data.table)
ppp <- read_sav("pspss.sav") # choose one of these two.
ppp <- read_dta("pstata.dta") # choose one of these two.
fwrite(ppp, "ppp.csv", sep=",", col.names = TRUE)
The real whole table has more than one thousand variables and one million individuals. That's why I would like to use a fast way to do things.
http://www73.zippyshare.com/v/OwzwbyQq/file.html
This is for #ArtificialBreeze:
> head(my)
# A tibble: 6 x 9
ID_2006_2011 TIS FECHA_NAC_2006 año2006 Edad_31_12_2006 SEXO_2006
<dbl> <chr> <date> <date> <dbl> <chr>
1 1.60701e+11 BBNR670806504015 1967-08-06 2006-12-31 39 M
2 1.60701e+11 BCBD580954916014 1958-09-14 2006-12-31 48 F
3 1.60701e+11 BCBL451245916015 1945-12-05 2006-12-31 61 F
4 1.60701e+11 BCGR610904916012 1961-09-04 2006-12-31 45 M
5 1.60701e+11 BCMR580148916015 1958-01-08 2006-12-31 48 F
6 1.60701e+11 BCMX530356917018 1953-03-16 2006-12-31 53 F
# ... with 3 more variables: PAIS_NAC_2006 <dbl>, FECHA_ALTA_TIS_2006 <date>,
# FECHA_ALTA_TIS_2006n <date>
Since this question was asked 6 months ago, fwrite has improved and been released to CRAN. I believe it should work as you wanted now; i.e. fast, direct and convenient date formatting. It now has the dateTimeAs argument as follows, copied from fwrite's manual page for v1.10.0 as on CRAN now. As time progresses, please check the latest version of the manual page.
====
dateTimeAs : How Date/IDate, ITime and POSIXct items are written.
"ISO" (default) - 2016-09-12, 18:12:16 and 2016-09-12T18:12:16.999999Z. 0, 3 or 6 digits of fractional seconds are printed if and when present for convenience, regardless of any R options such as digits.secs. The idea being that if milli and microseconds are present then you most likely want to retain them. R's internal UTC representation is written faithfully to encourage ISO standards, stymie timezone ambiguity and for speed. An option to consider is to start R in the UTC timezone simply with "$ TZ='UTC' R" at the shell (NB: it must be one or more spaces between TZ='UTC' and R, anything else will be silently ignored; this TZ setting applies just to that R process) or Sys.setenv(TZ='UTC') at the R prompt and then continue as if UTC were local time.
"squash" - 20160912, 181216 and 20160912181216999. This option allows fast and simple extraction of yyyy, mm, dd and (most commonly to group by) yyyymm parts using integer div and mod operations. In R for example, one line helper functions could use %/%10000, %/%100%%100, %%100 and %/%100 respectively. POSIXct UTC is squashed to 17 digits (including 3 digits of milliseconds always, even if 000) which may be read comfortably as integer64 (automatically by fread()).
"epoch" - 17056, 65536 and 1473703936.999999. The underlying number of days or seconds since the relevant epoch (1970-01-01, 00:00:00 and 1970-01-01T00:00:00Z respectively), negative before that (see ?Date). 0, 3 or 6 digits of fractional seconds are printed if and when present.
"write.csv" - this currently affects POSIXct only. It is written as write.csv does by using the as.character method which heeds digits.secs and converts from R's internal UTC representation back to local time (or the "tzone" attribute) as of that historical date. Accordingly this can be slow. All other column types (including Date, IDate and ITime which are independent of timezone) are written as the "ISO" option using fast C code which is already consistent with write.csv.
The first three options are fast due to new specialized C code. The epoch to date-part conversion uses a fast approach by Howard Hinnant (see references) using a day-of-year starting on 1 March. You should not be able to notice any difference in write speed between those three options. The date range supported for Date and IDate is [0000-03-01, 9999-12-31]. Every one of these 3,652,365 dates have been tested and compared to base R including all 2,790 leap days in this range. This option applies to vectors of date/time in list column cells, too. A fully flexible format string (such as "%m/%d/%Y") is not supported. This is to encourage use of ISO standards and because that flexibility is not known how to make fast at C level. We may be able to support one or two more specific options if required.
====
I had the same problem, and I just changed the date column to as.character before writing, and then changed it back to as.Date after reading. I don't know how it influences read and write times, but it was a good enough solution for me.
These numbers have sense :) It seems that fwrite change data format into "Matlab coding" where origin is "1970-01-01".
If you read your data, you can simply change number into date using these code:
my$FECHA_NAC_2006<-as.Date(as.numeric(my$FECHA_NAC_2006),origin="1970-01-01")
For example
as.Date(-879,origin="1970-01-01")
[1] "1967-08-06"
Since it seems there is no simple solution I'm trying to store column classes and change them back again.
I take the original dataset ppp,
areDates <- (sapply(ppp, class) == "Date")
I save it on an file and I can read it next time.
ppp <- fread("ppp.csv", encoding="UTF-8")
And now I change the classes of the newly read dataset back to the original one.
ppp[,names(ppp)[areDates] := lapply(.SD,as.Date),
.SDcols = areDates ]
Maybe someone can written it better with a for loop and the command set.
ppp[,lapply(.SD, setattr, "class", "Date") ,
.SDcols = areDates]
It can be also written with positions instead of a vector of TRUE and FALSE
You need to add the argument: dateTimeAs = "ISO".
By adding the argument dateTimeAs = and specifying the appropriate option, you will get dates writting in your csv file with the desired format AND with their respective time zone.
This is particularly important in the case of dealing with POSIXct variables, which are time zone dependant. The lack of this argument might affect the timestamps written in the csv file by shifting dates and times according to the difference of hours between time zones. Thus, the date/time variable POSIXct, you will need to add: dateTimeAs = "write.csv" ; unfortunately this option can be slow (https://www.rdocumentation.org/packages/data.table/versions/1.10.0/topics/fwrite?). Good luck!!!

How can I compute data rates in R with millisecond precision data?

I'm trying to take data from a CSV file that looks like this:
datetime,bytes
2014-10-24T10:38:49.453565,52594
2014-10-24T10:38:49.554342,86594
2014-10-24T10:38:49.655055,196754
2014-10-24T10:38:49.755772,272914
2014-10-24T10:38:49.856477,373554
2014-10-24T10:38:49.957182,544914
2014-10-24T10:38:50.057873,952914
2014-10-24T10:38:50.158559,1245314
2014-10-24T10:38:50.259264,1743074
and compute rates of change of the bytes value (which represents the number of bytes downloaded so far into a file), in a way that accurately reflects my detailed time data for when I took the sample (which should approximately be every 1/10 of a second, though for various reasons, I expect that to be imperfect).
For example, in the above sampling, the second row got (86594-52594=)34000 additional bytes over the first, in (.554342-.453565=).100777 seconds, thus yielding (34000/0.100777=)337,378 bytes/second.
A second example is that the last row compared to its predecessor got (1743074-1245314=)497760 bytes in (.259264-.158559=).100705 seconds, thus yielding (497760/.100705=)4,942,753 bytes/sec.
I'd like to get a graph of these rates over time, and I'm fairly new to R, and not quite figuring out how to get what I want.
I found some related questions that seem like they might get me close:
How to parse milliseconds in R?
Need to calculate Rate of Change of two data sets over time individually and Net rate of Change
Apply a function to a specified range; Rate of Change
How do I calculate a monthly rate of change from a daily time series in R?
But none of them seem to quite get me there... When I try using strptime, I seem to lose the precision (even using %OS); plus, I'm just not sure how to plot this as a series of deltas with timestamps associated with them... And the stuff in that one answer (second link, the answer with the AAPL stock delta graph) about diff(...) and -nrow(...) makes sense to me at a conceptual level, but not deeply enough that I understand how to apply it in this case.
I think I may have gotten close, but would love to see what others come up with. What options do I have for this? Anything that could show a rolling average (over, say, a second or 5 seconds), and/or using nice SI units (KB/s, MB/s, etc.)?
Edit:
I think I may be pretty close (or even getting the basic question answered) with:
my_data <- read.csv("my_data.csv")
my_deltas <- diff(my_data$bytes)
my_times <- strptime(my_data$datetime, "%Y-%m-%dT%H:%M:%S.%OS")
my_times <- my_times[2:nrow(my_data)]
df <- data.frame(my_times,my_deltas)
plot(df, type='l', xlab="When", ylab="bytes/s")
It's not terribly pretty (especially the y axis labels, and the fact that, with a longer data file, it's all pretty crammed with spikes), though, and it's not getting the sub-second precision, which might actually be OK for the larger problem (in the bigger graph, you can't tell, whereas with the sample data above, you really can), but still is not quite what I was hoping for... so, input still welcomed.
A possible solution:
# reading the data
df <- read.table(text="datetime,bytes
2014-10-24T10:38:49.453565,52594
2014-10-24T10:38:49.554342,86594
2014-10-24T10:38:49.655055,196754
2014-10-24T10:38:49.755772,272914
2014-10-24T10:38:49.856477,373554
2014-10-24T10:38:49.957182,544914
2014-10-24T10:38:50.057873,952914
2014-10-24T10:38:50.158559,1245314
2014-10-24T10:38:50.259264,1743074", header=TRUE, sep=",")
# formatting & preparing the data
df$bytes <- as.numeric(df$bytes)
df$datetime <- gsub("T"," ",df$datetime)
df$datetime <- strptime(df$datetime, "%Y-%m-%d %H:%M:%OS")
df$sec <- as.numeric(format(df$datetime, "%OS6"))
# calculating the change in bytes per second
df$difftime <- c(NA,diff(df$sec))
df$diffbytes <- c(NA,diff(df$bytes))
df$bytespersec <- df$diffbytes / df$difftime
# creating the plot
library(ggplot2)
ggplot(df, aes(x=sec,y=bytespersec/1000000)) +
geom_line() +
geom_point() +
labs(title="Change in bytes\n", x="\nWhen", y="MB/s\n") +
theme_bw()
which gives:

Create warnings in R

I wanna write a script whicht makes R usable for "everybody" at this special topic of analysis. Is there a possibility to create warnings?
time,value
2012-01-01,5
2012-01-02,0
2012-01-03,0
2012-01-04,0
2012-01-05,3
For example if the value is at least 3 times 0 (afterwards - better within a settet period of time - 3 days) give warnings - and naming the date. Maybe create something like a report, if I am combining conditions.
In general: Masurement data are read via read.csv and then set Date by as.POSIXct - xts/zoo. I want the "user" to get a clear message if the values are changing etc.; if they are 0 for a long time etc.
The second step would be sending emails - maybe running on a server later.
Additional Questions:
I do have a df in xts now - is it possible to check if the value is greater a threshold value? It's not working because it's not an atomic vector.
Thanks
Try this.
x <- read.table(text = "time,value
2012-01-01,5
2012-01-02,0
2012-01-03,0
2012-01-04,0
2012-01-05,3", header = TRUE, sep = ",")
if(any(rle(x$value)$lengths >= 3)) warning("I noticed some dates have value 0 at least three times.")
Warning message:
I noticed some dates have value 0 at least three times.
I'll leave it to you as a training exercise to paste a warning message that would also give you the date(s).

Calculate percentage over time on very large data frames

I'm new to R and my problem is I know what I need to do, just not how to do it in R. I have an very large data frame from a web services load test, ~20M observations. I has the following variables:
epochtime, uri, cache (hit or miss)
I'm thinking I need to do a coule of things. I need to subset my data frame for the top 50 distinct URIs then for each observation in each subset calculate the % cache hit at that point in time. The end goal is a plot of cache hit/miss % over time by URI
I have read, and am still reading various posts here on this topic but R is pretty new and I have a deadline. I'd appreciate any help I can get
EDIT:
I can't provide exact data but it looks like this, its at least 20M observations I'm retrieving from a Mongo database. Time is epoch and we're recording many thousands per second so time has a lot of dupes, thats expected. There could be more than 50 uri, I only care about the top 50. The end result would be a line plot over time of % TCP_HIT to the total occurrences by URI. Hope thats clearer
time uri action
1355683900 /some/uri TCP_HIT
1355683900 /some/other/uri TCP_HIT
1355683905 /some/other/uri TCP_MISS
1355683906 /some/uri TCP_MISS
You are looking for the aggregate function.
Call your data frame u:
> u
time uri action
1 1355683900 /some/uri TCP_HIT
2 1355683900 /some/other/uri TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906 /some/uri TCP_MISS
Here is the ratio of hits for a subset (using the order of factor levels, TCP_HIT=1, TCP_MISS=2 as alphabetical order is used by default), with ten-second intervals:
ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
FUN=function(x) sum((2-as.numeric(x))/length(x)))
Now use lapply to get the final result:
lapply(seq_along(levels(u$uri)),
function(l) list(uri=levels(u$uri)[l],
hits=ratio(u[as.numeric(u$uri) == l,])))
[[1]]
[[1]]$uri
[1] "/some/other/uri"
[[1]]$hits
u$time%/%10 u$action
1 135568390 0.5
[[2]]
[[2]]$uri
[1] "/some/uri"
[[2]]$hits
u$time%/%10 u$action
1 135568390 0.5
Or otherwise filter the data frame by URI before computing the ratio.
#MatthewLundberg's code is the right idea. Specifically, you want something that utilizes the split-apply-combine strategy.
Given the size of your data, though, I'd take a look at the data.table package.
You can see why visually here--data.table is just faster.
Thought it would be useful to share my solution to the plotting part of them problem.
My R "noobness" my shine here but this is what I came up with. It makes a basic line plot. Its plotting the actual value, I haven't done any conversions.
for ( i in 1:length(h)) {
name <- unlist(h[[i]][1])
dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
names(dftemp) <- c("time", "cache")
plot(dftemp$time,dftemp$cache, type="o")
title(main=name)
}

Resources