Generating "Hovmöller" style diagram from dataset with gaps in R - r

What I have is data in a tab delimited txt file in the following format (http://pastebin.com/XN3y9Wek):
Date Time Flow (L/h)
...
6/10/15 05:19:05 -0.175148624605041
6/10/15 05:34:05 -0.170297042615798
...
7/10/15 07:34:08 -0.033833540932291
7/10/15 07:49:08 -0.0256913011453011
...
The data currently ranges from 6/10/15 till 22/11/15. Measurements occur approximately every 15 minutes, but sometimes there is data loss which means that there are not the same amount of data points for every day. There are also periods where there is a larger gap (for example evening 16/11 -> morning 17/11) due to logger malfunction.
From this data I would like to create a similar figure like this one, as it offers a very nice seasonal representation of a large amount of data (my full dataset spans over several years):
Its similar to the style of a Hovmöller diagram. I have tried experimenting with R and the lattice package, but I struggle with the data gaps I have in my datasets and the irregular data points per day.
Any help you can offer me, an R beginner, would be greatly appreciated!
(If it would be possible in PHP or Javascript, feel free to post this as well)

Related

Efficient way to compile NC4 file information from separate files in R

I am currently trying to compile temperature information from the WDFE5 Data set which is quite large in size and am struggling to find an efficient way to meet my goal. My main goals are to:
Determine the max temperature for individual days for each individual grid cell
Change the time step from hourly to daily and from UTC to MST.
The data set is stored in monthly NC4 files and contains the temperature data in a 3 dimensional matrix (time lat lon). My main question is if there is a efficient way to compile this data to meet my goals or to manipulate the NC4 files to be easier to play around with (Somehow merge the monthly files into one mega file?)
I have tried two rather convoluted ways to catch holes between months (Example : due to the time conversion, some dates end up spanning between two months, which requires me to read in the next file and then continuing to read the data).
My first try was to individually read 1 month / file at a time, using pmax() to get the max value of each grid cell, and comparing time steps for 24 hours, and then repeating the process. I have been using
ncvar_get() with start and count to only read one time step at a time. To catch days that span two months, I was able to create a convoluted function to merge the two, by calculating the number of 1 hour periods left in one month, and how much would be needed from the next.
My second try still involved pmax(), but I tried a different method to fill in any holes between months. I set a date vector from the time variable to each hour time step, and match by same day. While this seems better, it still has to read multiple NC4 files which gets very convoluting compared to being able to just reading one NC4 file with all the needed information.
In the end, I tested a few test cases and both seem to solutions seem to work, but run extremely slow and seem very overcomplicated to me. I was wondering if anyone had suggestions on how to better set up the NC4 files for reading and time conversion.

SOM Data preperation

Good day.
I am 3 month old in R and R-Studio but am getting the hang of things. I am implementing a SOM solution with 38k records/observations using Kohonen SuperSOM following Self-Organising Maps for Customer Segmentation using R.
My data have no missing values but almost 60 columns many of them are dummyVars (I received this data in this format)
I have removed the ONE char Column (URL)
My Y column (as I understand it) is "shares" (How many times it was shared)
My data only consist of numerical data (dummyVars are of course 1 or 0)
I have Centered and Scaled my data (entire dataFrame)
As per the example I followed I dod convert the entire DF to a matrix
My problem is that my SOM takes ages to train even with multi core processing and my progress graph does not reach a nice flat"ish" plateau, it does come nicely down but still is very erratic, all my other graphs are extremely high in population and there are no nice clustering. I have even tried a 500 iteration with a 100x100 grid ;-(
I think /guess it is because of the huge amount of columns including mostly dummyVars e.g. dayOfWeek.Monday, dayOfWeek.Tuesday, category.LifeStile, category.Computers, etc.
What am I to do?
Should I convert the dummyVars back into another format, How and Why?
Please do not just give me a section of code as I would like to understand why I need to do What.
Thanx

Arranging different data points by date in R?

I'm quite new to R. Let me explain my situation clearly. I have excel files of data collected in time increments. The first column is the time collected. The second is the value. This data extends for quite a few years, so there is a lot of it across different files.
I would like to plot these values against time, but the files have gaps. Data wasn't recorded all of the time in each file, occasionally missing days or weeks. My files end up being inconsistent,
with one looking like this:
2/10/11 430....
2/11/11 437....
4/6/11 309....
and the other looking like this:
2/10/11 456....
2/13/11 333....
2/14/11 287....
and so on...
How can I space these points appropriately along the x axis, so that they are not all pushed together?

How to download intraday stock market data with R

All,
I'm looking to download stock data either from Yahoo or Google on 15 - 60 minute intervals for as much history as I can get. I've come up with a crude solution as follows:
library(RCurl)
tmp <- getURL('https://www.google.com/finance/getprices?i=900&p=1000d&f=d,o,h,l,c,v&df=cpct&q=AAPL')
tmp <- strsplit(tmp,'\n')
tmp <- tmp[[1]]
tmp <- tmp[-c(1:8)]
tmp <- strsplit(tmp,',')
tmp <- do.call('rbind',tmp)
tmp <- apply(tmp,2,as.numeric)
tmp <- tmp[-apply(tmp,1,function(x) any(is.na(x))),]
Given the amount of data I'm looking to import, I worry that this could be computationally expensive. I also don't for the life of me, understand how the time stamps are coded in Yahoo and Google.
So my question is twofold--what's a simple, elegant way to quickly ingest data for a series of stocks into R, and how do I interpret the time stamping on the Google/Yahoo files that I would be using?
I will try to answer timestamp question first. Please note this is my interpretation and I could be wrong.
Using the link in your example https://www.google.com/finance/getprices?i=900&p=1000d&f=d,o,h,l,c,v&df=cpct&q=AAPL I get following data :
EXCHANGE%3DNASDAQ
MARKET_OPEN_MINUTE=570
MARKET_CLOSE_MINUTE=960
INTERVAL=900
COLUMNS=DATE,CLOSE,HIGH,LOW,OPEN,VOLUME
DATA=
TIMEZONE_OFFSET=-300
a1357828200,528.5999,528.62,528.14,528.55,129259
1,522.63,528.72,522,528.6499,2054578
2,523.11,523.69,520.75,522.77,1422586
3,520.48,523.11,519.6501,523.09,1130409
4,518.28,520.579,517.86,520.34,1215466
5,518.8501,519.48,517.33,517.94,832100
6,518.685,520.22,518.63,518.85,565411
7,516.55,519.2,516.55,518.64,617281
...
...
Note the first value of first column a1357828200, my intuition was that this has something to do with POSIXct. Hence a quick check :
> as.POSIXct(1357828200, origin = '1970-01-01', tz='EST')
[1] "2013-01-10 14:30:00 EST"
So my intuition seems to be correct. But the time seems to be off. Now we have one more info in the data. TIMEZONE_OFFSET=-300. So if we offset our timestamps by this amount we should get :
as.POSIXct(1357828200-300*60, origin = '1970-01-01', tz='EST')
[1] "2013-01-10 09:30:00 EST"
Note that I didn't know which day data you had requested. But quick check on google finance reveals, those were indeed price levels on 10th Jan 2013.
Remaining values from first column seem to be some sort of offset from first row value.
So downloading and standardizing the data ended up being more much of a bear than I figured it would--about 150 lines of code. The problem is that while Google provides the past 50 training days of data for all exchange-traded stocks, the time stamps within the days are not standardized: an index of '1,' for example could either refer to the first of second time increment on the first trading day in the data set. Even worse, stocks that only trade at low volumes only have entries where a transaction is recorded. For a high-volume stock like APPL that's no problem, but for low-volume small caps it means that your series will be missing much if not the majority of the data. This was problematic because I need all the stock series to lie neatly on to of each other for the analysis I'm doing.
Fortunately, there is still a general structure to the data. Using this link:
https://www.google.com/finance/getprices?i=1800&p=1000d&f=d,o,h,l,c,v&df=cpct&q=AAPL
and changing the stock ticker at the end will give you the past 50 days of trading days on 1/2-hourly increment. POSIX time stamps, very helpfully decoded by #geektrader, appear in the timestamp column at 3-week intervals. Though the timestamp indexes don't invariably correspond in a convenient 1:1 manner (I almost suspect this was intentional on Google's part) there is a pattern. For example, for the half-hourly series that I looked at the first trading day of ever three-week increment uniformly has timestamp indexes running in the 1:15 neighborhood. This could be 1:13, 1:14, 2:15--it all depends on the stock. I'm not sure what the 14th and 15th entries are: I suspect they are either daily summaries or after-hours trading info. The point is that there's no consistent pattern you can bank on.The first stamp in a training day, sadly, does not always contain the opening data. Same thing for the last entry and the closing data. I found that the only way to know what actually represents the trading data is to compare the numbers to the series on Google maps. After days of futiley trying to figure out how to pry a 1:1 mapping patter from the data, I settled on a "ballpark" strategy. I scraped APPL's data (a very high-volume traded stock) and set its timestamp indexes within each trading day as the reference values for the entire market. All days had a minimum of 13 increments, corresponding to the 6.5 hour trading day, but some had 14 or 15. Where this was the case I just truncated by taking the first 13 indexes. From there I used a while loop to essentially progress through the downloaded data of each stock ticker and compare its time stamp indexes within a given training day to the APPL timestamps. I kept the overlap, gap-filled the missing data, and cut out the non-overlapping portions.
Sounds like a simple fix, but for low-volume stocks with sparse transaction data there were literally dozens of special cases that I had to bake in and lots of data to interpolate. I got some pretty bizarre results for some of these that I know are incorrect. For high-volume, mid- and large-cap stocks, however, the solution worked brilliantly: for the most part the series either synced up very neatly with the APPL data and matched their Google Finance profiles perfectly.
There's no way around the fact that this method introduces some error, and I still need to fine-tune the method for spare small-caps. That said, shifting a series by a half hour or gap-filling a single time increment introduces a very minor amount of error relative to the overall movement of the market and the stock. I am confident that this data set I have is "good enough" to allow me to get relevant answers to some questions that I have. Getting this stuff commercially costs literally thousands of dollars.
Thoughts or suggestions?
Why not loading the data from Quandl? E.g.
library(Quandl)
Quandl('YAHOO/AAPL')
Update: sorry, I have just realized that only daily data is fetched with Quandl - but I leave my answer here as Quandl is really easy to query in similar cases
For the timezone offset, try:
as.POSIXct(1357828200, origin = '1970-01-01', tz=Sys.timezone(location = TRUE))
(The tz will automatically adjust according to your location)

Plotting hundreds of hours of data with gnuplot

I am trying to plot data from a simulation that tracks simulation time in (hours):(minutes):(seconds) format, but does not turn (hours) into days - so (hours) can be in the hundreds. When gnuplot plots data by time, however ("set xdata time"), it only plots up to 99 hours in one continuous plot; after that, it loops back around and starts overplotting hour 100+ near the beginning (and even then, does weird stuff). Does anyone know why this happens and/or how to get around it?
I also looked into reading the components of the time column (which is the 3rd field of data on each line, but not necessarily a fixed number of characters into the line) in as 3 simple numbers (integers), then converting to a real number, which happens to be a decimal version of the time (e.g., 107:45:00 -> 107.75), which would be fine for the plot, but I haven't been able to figure out how to get gnuplot to do that, either.
Any other ideas are welcome. (I would rather not alter the original file, due to the additional complexity of multiple versions of each file, having to teach others how to convert the file and how to figure out the plot didn't work because they didn't convert the file, etc.)
Version 2 of MathGL (GPL plotting library) have time ticks which can be set as you want (using standard strftime() format). However it is in beta version now -- stable version should appear at October 2011.

Resources