Speedup conversion of 2 million rows of date strings to POSIX.ct - r

I have a csv which includes about 2 million rows of date strings in the format:
2012/11/13 21:10:00
Lets call that csv$Date.and.Time
I want to convert these dates (and their accompanying data) to xts as fast as possible
I have written a script which performs the conversion just fine (see below), but it's terribly slow and I'd like to speed this up as much as possible.
Here is my current methodology. Does anyone have any suggestions on how to make this faster?
dt <- as.POSIXct(csv$Date.and.Time,tz="UTC")
idx <- format(dt,tz=z,usetz=TRUE)
So the script converts these date strings to POSIX.ct. It then does a timezone conversion using format (z is a variable representing the TZ to which I am converting). I then do a regular xts call to make this an xts series with the rest of the data in the csv.
This works 100%. It's just very, very slow. I've tried running this in parallel (it doesn't do anything; if anything it makes it worse). What do I mean by 'slow'?
user system elapsed
155.246 16.430 171.650
That's on a 3GhZ, 16GB ram 2012 mb pro. I can get about half that on a similar processor with 32GB RAM on a Win7 Machine
I'm sure someone has a better idea - I'm open to suggestions via Rcpp etc. However, ideally the solution works with the csv rather than some other method, like setting up a database. Having said that, I'm up to doing this via whatever method is going to give the fastest conversion.
I'd be super appreciative of any help at all. Thanks in advance.

You want the small and simple fasttime package by Simon which does this in the fastest possible way---by not calling time parsing functions but just using C-level string functions.
It does not support as many formats as strptime. In fact, it doesn't even have a format string. But well-formed ISO format variants, that is yyyy-mm-dd hh:mm:ss.fff will work, and your / separator may just work too.

Try using lubridate - it does all date time parsing using regular expressions, so not only is it much faster, it's also much more flexible.

Related

How to convert string as YYYY-MM-DD:HH:mm:SS:sss into epoch time?

I have a string as YYYY-MM-DD:HH:mm:SS:sss (ex : 2017-10-11:04:36:26.376). Now I want to convert it into epoch time . What would be programmatic approach for this ?
I am programming in C++, able to extract information in variable.
It turns out there is a formula, but it's fairly ugly. I originally implemented something similar in BASIC 2.0 in 1982 (when each byte counted), and later converted it to Perl:
sub datestar {
$_=shift;
/^(....)(..)(..)/;
$fy=($1-($2<3));
$jd=$fy*365+int($fy/4)-int($fy/100)+int($fy/400)+int(((($2-3+12*($2<3))*30.6)+.5)+$3);
return(86400*($jd-719469))
}
Note that this takes something like "20171011", not "2017-10-11", and doesn't convert hours/minutes/seconds (which are easy to convert).
As always, doublecheck code before use, and use it as a template to write your own code if you really want to.
However, you would be infinitely better off using your programming language's existing functions to do this.
As others said the formula is so complex and would make whole code a mess, So to avoid these I am calculating the number of days from Input date to 01-01-2000. As I know epoch time till 01-01-2000, thus by finding number of days considering leap year I can calculate total epoch time.

Want only the time portion of a date-time object in R

I have a vector of times in R, all_symbols$Time and I am trying to find out how to get JUST the times (or convert the times to strings without losing information). I use
strptime(all_symbol$Time[j], format="%H:%M:%S")
which for some reason assumes the date is today and returns
[1] "2013-10-18 09:34:16"
Date and time formatting in R is quite annoying. I am trying to get the time only without adding too many packages (really any--I am on a school computer where I cannot install libraries).
Once you use strptime you will of necessity get a date-time object and the default behavior for no date in the format string is to assume today's date. If you don't like that you will need to prepend a string that is the date of your choice.
#James' suggestion is equivalent to what I was going to suggest:
format(all_symbol$Time[j], format="%H:%M:%S")
The only package I know of that has time classes (i.e time of day with no associated date value) is package:chron. However I find that using format as a way to output character values from POSIXt objects lends itself well to functions that require factor input.
In the decade since this was written there is now a package named “hms” that has some sort of facility for hours, minutes, and seconds.
hms: Pretty Time of Day
Implements an S3 class for storing and formatting time-of-day values, based on the 'difftime' class.
Came across the same problem recently and found this and other posts R: How to handle times without dates? inspiring. I'd like to contribute a little for whoever has similar questions.
If you only want to you base R, take advantage of as.Date(..., format = ("...")) to transform your date into a standard format. Then, you can use substr to extract the time. e.g. substr("2013-10-01 01:23:45 UTC", 12, 16) gives you 01:23.
If you can use package lubridate, functions like mdy_hms will make life much easier. And substr works most of the time.
If you want to compare the time, it should work if they are in Date or POSIXt objects. If you only want the time part, maybe force it into numeric (you may need to transform it back later). e.g. as.numeric(hm("00:01")) gives 60, which means it's 60 seconds after 00:00:00. as.numeric(hm("23:59")) will give 86340.

Rapidminer : converting unix timestamp

Does anybody know a way to convert unix timestamp to date_time attribute?
I tried to use R extensions (my operators are mainly written in R) such as as.POSIXct functions to convert timestamps but it seems that rapidminer doesn't like it and keeps ignoring it.
Any help is appreciated
Thanks
A little known feature of generating attributes is that the input attribute can be the output attribute so no new one is created. In addition, the type of the attribute is changed.
In other words, a construction like this would work as long as the input is milliseconds since the epoch.
unixtime = date_parse(unixtime)

Alternative to sqlite OR a better way to handle date / time fields in sqlite

My data tends to be medium to large but never qualifies as "BIG" data. The data is almost always complexly relational. For the purposes I'm talking about here, 10-50 tables with a total size of 1-10 GB. Nothing more. When I deal with data bigger than this, I'll stick it into Postgres or SQL Server.
Overall, I like SQLite, but the data I work with has lots and lots of date / datetime fields and dealing with date fields in SQLite makes my head hurt and when I move data back and forth between R and SQLite, my dates often get mangled.
I am either looking for a file-based alternative to SQLite that is easy to work with from R.
OR
Better techniques/packages for moving data in/out of SQLite and R without mangling the dates. My goal is to stop mangling my dates. For example, when I use dbWriteTable from the RSQLite package my dates are usually messed up in a way that makes them impossible to work with.
My primary workstation is running Ubuntu but I work in an office dominated by Windows. If suggesting an alternative to SQLite, +++ for an alternative that works on both platforms (or more).
Use epoch times and dates (days from origin, seconds from origin). The conversion using epochs into R POSIXct or Date is fast (strings are very slow).
Edit: Another alternative, after re-reading and considering the size of your data:
You could simply save the tables directly in R format, perhaps with a small piece of extra metadata describing the key relationships between tables. You would have to create your own conventions and all, but it's definitely smoother (no impedance mismatches).
Also, I'm personally very partial to the package data.table. It's fast and has a syntax which is pure R but has a nice mapping onto SQL concepts. E.g. in dt[i, j, by=list(...)], i corresponds to "where", j correspond to "select", and by to "group by" and there are facilities for joins as well, although I wrote infix wrappers around those so it was easier to remember.
I typically do my data processing work exclusively in R (after an initial pull from SQLITE), and I find data.table more faster and practical than massive SQLDF queries.
http://datatable.r-forge.r-project.org/
sqlite wants to read the data in the standard format "YYYY-MM-DD HH:MM:SS" (you can omit the time part if you don't need it)---I don't know of a way to read arbitrary date strings. This results in a normalized date being stored.
On output, you want to format the date using sqlite functions to whatever your other software needs---check the options of strftime().
For instance, Octave likes the day number since year 0, so if I have a table mydata with column "date", I'd do
select julianday(mydate)-1721059.666667 from mydata
The magic number is julianday("0000-01-01T00:00:00-04:00") and compensates for the fact that julianday starts in year 4017BC or something like that, whereas Octave counts from year 0.

Out of memory when modifying a big R data.frame

I have a big data frame taking about 900MB ram. Then I tried to modify it like this:
dataframe[[17]][37544]=0
It seems that makes R using more than 3G ram and R complains "Error: cannot allocate vector of size 3.0 Mb", ( I am on a 32bit machine.)
I found this way is better:
dataframe[37544, 17]=0
but R's footprint still doubled and the command takes quite some time to run.
From a C/C++ background, I am really confused about this behavior. I thought something like dataframe[37544, 17]=0 should be completed in a blink without costing any extra memory (only one cell should be modified). What is R doing for those commands I posted? What is the right way to modify some elements in a data frame then without doubling the memory footprint?
Thanks so much for your help!
Tao
Following up on Joran suggesting data.table, here are some links. Your object, at 900MB, is manageable in RAM even in 32bit R, with no copies at all.
When should I use the := operator in data.table?
Why has data.table defined := rather than overloading <-?
Also, data.table v1.8.0 (not yet on CRAN but stable on R-Forge) has a set() function which provides even faster assignment to elements, as fast as assignment to matrix (appropriate for use inside loops for example). See latest NEWS for more details and example. Also see ?":=" which is linked from ?data.table.
And, here are 12 questions on Stack Overflow with the data.table tag containing the word "reference".
For completeness :
require(data.table)
DT = as.data.table(dataframe)
# say column name 17 is 'Q' (i.e. LETTERS[17])
# then any of the following :
DT[37544, Q:=0] # using column name (often preferred)
DT[37544, 17:=0, with=FALSE] # using column number
col = "Q"
DT[37544, col:=0, with=FALSE] # variable holding name
col = 17
DT[37544, col:=0, with=FALSE] # variable holding number
set(DT,37544L,17L,0) # using set(i,j,value) in v1.8.0
set(DT,37544L,"Q",0)
But, please do see linked questions and the package's documentation to see how := is more general than this simple example; e.g., combining := with binary search in an i join.
Look up 'copy-on-write' in the context of R discussions related to memory. As soon as one part of a (potentially really large) data structure changes, a copy is made.
A useful rule of thumb is that if your largest object is N mb/gb/... large, you need around 3*N of RAM. Such is life with an interpreted system.
Years ago when I had to handle large amounts of data on machines with (relative to the data volume) relatively low-ram 32-bit machines, I got good use out of early versions of the bigmemory package. It uses the 'external pointer' interface to keep large gobs of memory outside of R. That save you not only the '3x' factor, but possibly more as you may get away with non-contiguous memory (which is the other thing R likes).
Data frames are the worst structure you can choose to make modification to. Due to quite the complex handling of all features (such as keeping row names in synch, partial matching, etc.) which is done in pure R code (unlike most other objects that can go straight to C) they tend to force additional copies as you can't edit them in place. Check R-devel on the detailed discussions on this - it has been discussed in length several times.
The practical rule is to never use data frames for large data, unless you treat them read-only. You will be orders of magnitude more efficient if you either work on vectors or matrices.
There is type of object called a ffdf in the ff package which is basically a data.frame stored on disk. In addition to the other tips above you can try that.
You can also try the RSQLite package.

Resources