Plotting Timeseries with map() from purrr - r

I am trying to generate a plot that is similar to this:
A walkthrough is provided here -> https://medium.com/#erickramer/beautiful-data-science-with-functional-programming-and-r-a3f72059500b
However the code supplied on this website isn't generating a plot for me, instead I get this error:
> forecasts1 = tsdf %>%
+ map(auto.arima) %>%
+ map(forecast, h=10)
Error in is.constant(x) :
(list) object cannot be coerced to type 'double'
This is despite the fact that I have replicated their data formatting precisely. Here are our datasets for comparison:
> str(tsdf)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 89 obs. of 1 variable:
$ time_series:List of 89
..$ 1_1 : Time-Series from 2013 to 2017: 8981338 10707490 11410597 10816217 12263765 ...
..$ 1_10 : Time-Series from 2013 to 2017: 12645212 13510638 13133558 13542970 16074675 ...
..$ 1_2 : Time-Series from 2013 to 2017: 19028892 20626896 19952328 20865263 22547313 ...
..$ 1_3 : Time-Series from 2013 to 2017: 7081624 8317481 8374427 8330653 9643845 ...
..$ 1_4 : Time-Series from 2013 to 2017: 25421637 30934941 30756101 27977317 32417608 ...
And the provided example data (upon which the code did work, according to the website):
> str(time_series)
List of 9
$ Germany : Time-Series [1:52] from 1960 to 2011: 684721 716424 749838 ...
$ Singapore : Time-Series [1:52] from 1960 to 2011: 7208 7795 8349 ...
$ Finland : Time-Series [1:37] from 1975 to 2011: 85842 86137 86344 ...
I can't seem to figure it out, though it may have something to do with the fact that their timeseries has one solid endpoint, yet my timeseries have several different monthly endpoints.
Any help with this is greatly appreciated!
* UPDATE *
After applying Akruns suggestion I stored exclusively the time-series vector in a list like so:
tsdf <- akrun %>%
select(time_series)
I then fit the model like this:
tsdf$time_series %>% map(auto.arima) %>%
map(forecast, h=12)
...and then the plot...
... looks awful.
Do I need to convert y_axis scale? Or do some sort of differencing to the data before plotting the arima? Really appreciate any suggestions!

Related

Read cell values without formatting into R with googlesheets

Would like to be able to read Google Sheets cell values into R with googlesheets package, but without any cell formatting applied (e.g. comma separators, percentage conversion, etc.).
Have tried gs_read() without specifying a range, which uses gs_read_csv(), which will "request the data from the Sheets API via the exportcsv link". Can't find a way to tell it to provide underlying cell value without formatting applied.
Similarly, tried gs_read() and specifying a range, which uses gs_read_cellfeed(). But can't find a way to indicate that I want un-formatted cell values.
Note: I'm not after the formulas in any cells, just the values without any formatting applied.
Example:
(looks like I'm not able to post image images)
Here's a screenshot of an example Google Sheet:
https://www.dropbox.com/s/qff05u8nn3do33n/Screenshot%202015-07-26%2008.42.58.png?dl=0
First and third columns are numeric with no formatting applied, 2nd column applies comma separators for thousands, 4th column applies percentage formatting.
Reading this sheet with the following code:
library(googlesheets)
gs <- gs_title("GoogleSheets Test")
ws <- gs_read(gs, ws = "Sheet1")
yields:
> str(ws)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 4 variables:
$ Number : int 123456 123457 123458
$ Number_wFormat : chr "123,456" "123,457" "123,458"
$ Percent : num 0.123 0.234 0.346
$ Percent_wFormat: chr "12.34%" "23.45%" "34.56%"
Would like to be able to read a worksheet that has formatting applied (ala columns 2 and 4), but read the unformatted values (ala columns 1 and 3).
At this point, I think your best bet is to fix the imported data like so:
> ws$Number_fixed <- type.convert(gsub(',', '', ws$Number_wFormat))
> ws$Percent_fixed <- type.convert(gsub('%', '', ws$Percent_wFormat)) / 100
> str(ws)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 6 variables:
$ Number : int 123456 123457 123458
$ Number_wFormat : chr "123,456" "123,457" "123,458"
$ Percent : num 0.123 0.234 0.346
$ Percent_wFormat: chr "12.34%" "23.45%" "34.56%"
$ Number_fixed : int 123456 123457 123458
$ Percent_fixed : num 0.123 0.234 0.346
I had some hope that post-processing with functions from readr would be a decent answer, but it looks like percentages and "currency" style numbers are open issues there too.
I have opened an issue to solve this better in googlesheets, one way or another.

Hidden Markov Model in R - Predict the next observation with RHmm

This is my first post on StackOverflow and I could use a little help... Please forgive me if I am not following the correct posting protocols.
There is another example in the StackOverflow for which I am heavily basing my work off of but I cant quite figure out how to adapt the code. Most importantly, I am looking at the solution to the question provided.
Here is the link:
Getting the next observation from a HMM gaussian mixture distribution
Some background:
RHmm - version 2.1.0 downloaded from R Forge.
RStudio - 0.98.953
R - 3.0.2 32 bit
I am trying to figure out the following issues with my code:
How do I amend the solution from the link above (prediction of the next observation) to work with my Baum-Welch model?
Ex. hm_model <- HMMFit(obs=TWII_Train, nStates=5)
The R / RStudio session aborts when I run the Baum-Welch version of the hm_model <- HMMFit(obs=TWII_Train, dis="MIXTURE", nStates=5, nMixt=4). Can you recreate the error and propose a workaround?
Here is my R code:
library(quantmod)
library(RHmm)
getSymbols("^TWII")
TWII_Subset <- window(TWII, start=as.Date("2012-01-01"), end = as.Date("2013-04-01"))
TWII_Train <- cbind(TWII_Subset$TWII.Close - TWII_Subset$TWII.Open,
TWII_Subset$TWII.Volume)
hm_model <- HMMFit(obs=TWII_Train, nStates=5)
VitPath <- viterbi(hm_model, TWII_Train)
I'm not a user of this package and this is not really an answer, but a comment would obscure some of the structures. It appears that the "proportion" value of your model is missing (so the structures are different. The "mean" value looks like this:
$ mean :List of 5
..$ : num [1:2] 6.72 3.34e+06
..$ : num [1:2] -12.4 2420174.5
..$ : num [1:2] -2.4 1832546.5
..$ : num [1:2] -10.4 1432636.1
..$ : num [1:2] 5.02 1.96e+06
I also suspect that you should be using 2 and 5 rather than 4 and 5 for m and n. Look at the rest of the model with:
str(hm_model)

Reducing data in data frame to plot data in R

I'm very new to programming so I apologise in advance for my lack of R know-how. I'm a PhD student interested in pupillometry and I have just recorded the pupil response of participants performing a listening task in two conditions (Easy and Hard). The pupil response interest period for each trial is around 20 seconds and I would like to be able to plot this data for each participant on R. The eyetracker sampling rate is 1000Hz and each participant completed 92 trials. So the data that I currently have for each participant includes close to 2million rows. I have tried to plot this using ggplot2 but, as expected, the graph is very cluttered.
I've been trying to work out a way of reducing the data so that I can plot it on R. Ideally, I would like to take the mean pupil size value for every 1000 samples (i.e. 1 second of recording) averaged across all 92 trials for each participant. With this information, I would then create a new dataframe for plotting the average slope from 1-20 seconds for the two listening conditions (Easy and Hard).
Here is the current structure of my data frame;
> str(ppt53data)
'data.frame': 1915391 obs. of 6 variables:
$ RECORDING_SESSION_LABEL: Factor w/ 1 level "ppt53": 1 1 1 1 1 1 1 1 1 1 ...
$ listening_condition : Factor w/ 2 levels "Easy","Hard": 2 2 2 2 2 2 2 2 2 2 ...
$ RIGHT_PUPIL_SIZE : Factor w/ 3690 levels ".","0.00","1000.00",..: 3266 3264 3263 3262 3262 3260 3257 3254 3252 3252 ...
$ TIMESTAMP : num 262587 262588 262589 262590 262591 ...
$ TRIAL_START_TIME : int 262587 262587 262587 262587 262587 262587 262587 262587 262587 262587 ...
$ TrialTime : num 0 1 2 3 4 5 6 7 8 9 ...
- attr(*, "na.action")=Class 'omit' Named int [1:278344] 873 874 875 876 877 878 879 880 881 882 ...
.. ..- attr(*, "names")= chr [1:278344] "873" "874" "875" "876" ...
The 'TrialTime' variable specifies the sample (i.e. millisecond) in each trial. Can anyone advise me about which step I should take next? I figure it would make sense to arrange my data into separate data frames which would allow me to calculate the mean values that I want (across trials and for every 1000 samples). However, I'm not sure what is the most efficient/best way of doing this.
I'm sorry that I can't be any more specific. Any rough guidance would be much appreciated.
I think for such a large block of data with many aggregation levels you will need to use data.table. I may have mis-structured your data, but hopefully this will give you the idea:
require(data.table)
require(ggplot2)
#100 patient * 20,000 observations (1-20,000 ms)
ppt53data<-data.frame(
RECORDING_SESSION_LABEL=paste0("pat-",rep(1:100,each=20000)), #patients
listening_condition=sample(c("Easy","Hard"),2000000,replace=T), #Easy/Hard
RIGHT_PUPIL_SIZE=rnorm(2000000,3000,500), #Pupil Size
TrialTime=rep(1:20000,100) #ms from start
)
# group in 1000ms blocks
ppt53data$group<-cut(ppt53data$TrialTime,c(0,seq(1000,20000,1000),Inf))
unique(ppt53data$group)
#convert frame to table
dt.ppt53data<-data.table(ppt53data)
#index
setkey(dt.ppt53data, RECORDING_SESSION_LABEL, group)
#create data.frame of aggregated plot data
plot.data<-data.frame(dt.ppt53data[,list(RIGHT_PUPIL_SIZE=mean(RIGHT_PUPIL_SIZE)),by=list(group)])
#plot with ggplot2
ggplot(plot.data)+geom_bar(aes(group,RIGHT_PUPIL_SIZE,stat="identity",fill=group)) +
theme(axis.text.x=element_text(angle=-90))+
coord_cartesian(ylim=c(2995,3005))
Some rough guidance:
library(plyr)
ppt53data.summarized <- ddply(ppt53data, .(TrialTime), summarize, mean = mean(RIGHT_PUPIL_SIZE))
This tells it to calculate the mean size of the right pupil for each unique TrialTime. Perhaps seeing how this works would help you figure out how to describe what you need?
Assuming that within each TrailTime there are more than 1000 observations, you can randomly select:
set.seed(42)
ppt53data.summarized <- ddply(ppt53data, .(TrialTime), summarize, mean = mean(sample(RIGHT_PUPIL_SIZE,1000)))

Regarding creating a data set in accordance with a given data format

I am learning to use topicmodels package and R as well, and explored one of its example data set by using
str(testdata)
'data.frame': 3104 obs. of 5 variables:
$ Article_ID: int 41246 41257 41268 41279 41290 41302 41314 41333 41344 41355 ...
$ Date : chr "1-Jan-96" "2-Jan-96" "3-Jan-96" "4-Jan-96" ...
$ Title : chr "Nation's Smaller Jails Struggle To Cope With Surge in Inmates" "FEDERAL IMPASSE SADDLING STATES WITH INDECISION" "Long, Costly Prelude Does Little To Alter Plot of Presidential Race" "Top Leader of the Bosnian Serbs Now Under Attack From Within" ...
$ Subject : chr "Jails overwhelmed with hardened criminals" "Federal budget impasse affect on states" "Contenders for 1996 Presedential elections" "Bosnian Serb leader criticized from within" ...
$ Topic.Code: int 12 20 20 19 1 19 1 1 20 15 ...
If I want to create a data set according to the above format in R, how to do that?
test.data is a data.frame, one of the few fundamental R objects. You should probably start here: http://cran.r-project.org/doc/manuals/R-intro.pdf.
Some functions for creating data.frames are data.frame, read.table, read.csv. For each of these you can access their documentation by typing ?data.frame for example. Good luck.

R: xts object - determine the date of the first data point

I have a large number of time series variables (stock prices) of which I want to perform various analytics. The problem is not all variables have same number of prices in the data range I am interested in using because some stocks came into existence at different points in time.
As such, I am trying to return the date of the first data element in each of the xts variables but I have a very ugly solution to do this at the moment. I was wondering if there is a function that I could call to return the date by some sort of indexing.
i.e
> str(IBM)
An ‘xts’ object from 2004-01-02 to 2011-04-25 containing:
Data: num [1:1841, 1] 25.1 25.6 25.6 25.3 25.4 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr "IBM.Adjusted"
Indexed by objects of class: [Date] TZ:
xts Attributes:
List of 2
$ src : chr "yahoo"
$ updated: POSIXct[1:1], format: "2011-04-26 14:35:02"
I am looking for a clean way to grab 2004-01-02 from the above object for example.
I appreciate the help. Thank you.
I imagine this would work:
min(index(IBM))
You can use the start function:
> library(quantmod)
> getSymbols("IBM")
[1] "IBM"
> start(IBM)
[1] "2007-01-03"

Resources