Test Data for Time and Date Parsing (Varied Formats) - datetime

Any ideas how I can get a varied set of time / date strings to test a parser?
The idea is to see how wide a range of different formats can be parsed. Note that I am looking for different formats, so simply extracting all timestamps from a bunch of emails isn't that useful (since the format is fixed by RFC 2822).
[Also, I am not sure this is appropriate for SO, sorry, so please feel free to suggest an alternative place to ask.]

You'll probably have to create your own list. But here are some resources describing some of the various formats you might encounter:
http://www.hackcraft.net/web/datetime/
http://en.wikipedia.org/wiki/Date_format_by_country

Related

Custom column design

I have a raw dataset and the columns are not clearly defined at all. When I go to import the data using "Read.Table" in R, it automatically tries to approximate where the columns begin and end. But it is not correct. I know the number of characters per variable, but I am not sure how to customize them as one would in Excel(=Left(x,3) OR =MID(X,4,1)... etc.). Some variables are separated by spaces, some aren't. It is not consistent.
FYI: The document was originally ".dat", then I saved the file as a ".R" file.
Here is an example of my data
Any help is much appreciated! Let me know
You can use read_fwf from the great readr package, to specify the fix widths per variable.

Weka Apriori No Large Itemset and Rules Found

I am trying to do apriori association mining with WEKA (i use 3.7) using given database table
So, i exported two columns (orderLineNumber and productCode) and load it into weka, as far as i go, i haven't got any success attempt, always ended with "No large itemsets and rules found!"
Again, i tried to convert the csv into ARFF file first using ARFF Converter and still get the same message;
I also tried using database loader in WEKA, the data loaded just fine but still give the same result;
The filter i've applied in preprocessing is only numericToNominal filter;
What have i wrongly done here, i suspiciously think it was my ARFF format though, thank you
Update
After further trial, i found out that i exported wrong column and i lack 1 filter process, which is "denormalized", i installed the plugin via packet manager and denormalized my data after converting it to nominal first;
I then compared the results with "Supermarket" sample's result; The only difference are my output came with 'f' instead of 't' (like shown below) and the confidence value seems like always 100%;
First of all, OrderLine is the wrong column.
Obviously, the position on the printed bill is not very important.
Secondly, the file format is not appropriate.
You want one line for every order, one column for every possible item in the #data section. To save memory, it may be helpful to use sparse formats (do not forget to set flags appropriately)
Other tools like ELKI can process input formats like this, that may be easier to use (it also was a lot faster than Weka):
apple banana
milk diapers beer
but last I checked, ELKI would "only" find frequent itemsets (the harder part) not compute association rules. I then used a tiny python script to produce actual association rules as desired.

Specify end of record (EOL) delimiter while importing from a text file?

I'm trying to import into R a large number of pipe-delimited files that were created in a windows environment, with CR+LF as the end of record (=EOL) delimiter. However, they also have CR's scattered about periodically, which is resulting in frequent inappropriately-split lines. Ideally, want an efficient way to solve this problem from within R - either by finding a way to specify the EOL delimiter when I import, or, if necessary, by reading in the text file and excising the CRs before any parsing of lines is done.
The creators of the files comment on this problem and recommend adding "TERMSTR= CRLF" into your SAS code, and I can find lots of discussions of how to do this in other languages as well. For R, however, all I can find is this discussion, here on stackoverflow:
Possible to change the record delimiter in R?
The sample problem given is a great match for my problem. The solution identified is nice for their specific situation of having a single file like this, but for me would require coding up separate scripts for importing each of the dozens of files, since each have different primary keys that would need to be recognized after the fact to repair the inappropriate import. Alternatively, I could open each file in something like Notebook++ to remove the extra CR's but again, that seems quite inefficient, and then would have to be repeated by hand every time the initial data set was updated by its producers.
Given how frequent a problem this seems to be for people, and the existence of hard-coded solutions in other programming languages, I'm confused as to why there isn't a fix in R and feel like I must be missing something. It seems clear (I think?) that there's no way to do this directly from read.table or even from readLines, but is there a way perhaps to do this using scan, that I'm missing?
Thanks for any thoughts!

Speedup conversion of 2 million rows of date strings to POSIX.ct

I have a csv which includes about 2 million rows of date strings in the format:
2012/11/13 21:10:00
Lets call that csv$Date.and.Time
I want to convert these dates (and their accompanying data) to xts as fast as possible
I have written a script which performs the conversion just fine (see below), but it's terribly slow and I'd like to speed this up as much as possible.
Here is my current methodology. Does anyone have any suggestions on how to make this faster?
dt <- as.POSIXct(csv$Date.and.Time,tz="UTC")
idx <- format(dt,tz=z,usetz=TRUE)
So the script converts these date strings to POSIX.ct. It then does a timezone conversion using format (z is a variable representing the TZ to which I am converting). I then do a regular xts call to make this an xts series with the rest of the data in the csv.
This works 100%. It's just very, very slow. I've tried running this in parallel (it doesn't do anything; if anything it makes it worse). What do I mean by 'slow'?
user system elapsed
155.246 16.430 171.650
That's on a 3GhZ, 16GB ram 2012 mb pro. I can get about half that on a similar processor with 32GB RAM on a Win7 Machine
I'm sure someone has a better idea - I'm open to suggestions via Rcpp etc. However, ideally the solution works with the csv rather than some other method, like setting up a database. Having said that, I'm up to doing this via whatever method is going to give the fastest conversion.
I'd be super appreciative of any help at all. Thanks in advance.
You want the small and simple fasttime package by Simon which does this in the fastest possible way---by not calling time parsing functions but just using C-level string functions.
It does not support as many formats as strptime. In fact, it doesn't even have a format string. But well-formed ISO format variants, that is yyyy-mm-dd hh:mm:ss.fff will work, and your / separator may just work too.
Try using lubridate - it does all date time parsing using regular expressions, so not only is it much faster, it's also much more flexible.

dos date/time calculation

I am working on a project that involves converting data into dos date and time. using a hex editor (Hex Workshop) i have looked through the file manually and and found the values I am looking for, however I am unsure how they are calculated. I am told that the int16 value 15430 corresponds to the date 06/02/2010 but i can see no correlation, also the value 15430 corresponds to the time 07:34:12 but i am lost in how it is calculated. any help with these calculations would be very welcomed
You need to look at the bits in those numbers.
See here for details:
http://www.vsft.com/hal/dostime.htm
I know this post is very old but I think the time 07:34:12 corresponds to 15436 (not 15430).

Resources