plotting time series in R - r

I am working with data, 1st two columns are dates, 3rd column is symbol, and 4th and 5th columns are prices.
So, I created a subset of the data as follows:
test.sub<-subset(test,V3=="GOOG",select=c(V1,V4)
and then I try to plot a time series chart using the following
as.ts(test.sub)
plot(test.sub)
well, it gives me a scatter plot - not what I was looking for.
so, I tried plot(test.sub[1],test.sub[2])
and now I get the following error:
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
To make sure the no. of rows were same, I ran nrow(test.sub[1]) and nrow(test.sub[2]) and they both return equal rows, so as a newcomer to R, I am not sure what the fix is.
I also ran plot.ts(test.sub) and that works, but it doesn't show me the dates in the x-axis, which it was doing with plot(test.sub) and which is what I would like to see.
test.sub[1]
V1
1107 2011-Aug-24
1206 2011-Aug-25
1307 2011-Aug-26
1408 2011-Aug-29
1510 2011-Aug-30
1613 2011-Aug-31
1718 2011-Sep-01
1823 2011-Sep-02
1929 2011-Sep-06
2035 2011-Sep-07
2143 2011-Sep-08
2251 2011-Sep-09
2359 2011-Sep-13
2470 2011-Sep-14
2581 2011-Sep-15
2692 2011-Sep-16
2785 2011-Sep-19
2869 2011-Sep-20
2965 2011-Sep-21
3062 2011-Sep-22
3160 2011-Sep-23
3258 2011-Sep-26
3356 2011-Sep-27
3455 2011-Sep-28
3555 2011-Sep-29
3655 2011-Sep-30
3755 2011-Oct-03
3856 2011-Oct-04
3957 2011-Oct-05
4059 2011-Oct-06
4164 2011-Oct-07
4269 2011-Oct-10
4374 2011-Oct-11
4479 2011-Oct-12
4584 2011-Oct-13
4689 2011-Oct-14
str(test.sub)
'data.frame': 35 obs. of 2 variables:
$ V1:Class 'Date' num [1:35] NA NA NA NA NA NA NA NA NA NA ...
$ V4: num 0.475 0.452 0.423 0.418 0.403 ...
head(test.sub) V1 V4
1212 <NA> 0.474697
1313 <NA> 0.451907
1414 <NA> 0.423184
1516 <NA> 0.417709
1620 <NA> 0.402966
1725 <NA> 0.414264
Now that this is working, I'd like to add a 3rd variable to plot a 3d chart - any suggestions how I can do that. thx!

So I think there are a few things going on here that are worth talking through:
first, some example data:
test <- data.frame(End = Sys.Date()+1:5,
Start = Sys.Date()+0:4,
tck = rep("GOOG",5),
EndP= 1:5,
StartP= 0:4)
test.sub = subset(test, tck=="GOOG",select = c(End, EndP))
First, note that test and test.sub are both data frames, so calls like test.sub[1] don't really "mean" anything to R.** It's more R-ish to write test.sub[,1] by virtue of consistency with other R structures. If you compare the results of str(test.sub[1]) and str(test.sub[,1]) you'll see that R treats them slightly differently.
You said you typed:
as.ts(test.sub)
plot(test.sub)
I'd guess you have extensive experience with some sort of OO-language; and while R does have some OO flavor to it, it doesn't apply here. Rather than transforming test.sub to something of class ts, this just does the transformation and throws it away, then moves on to plot the data frame you started with. It's an easy fix though:
test.sub.ts <- as.ts(test.sub)
plot(test.sub.ts)
But, this probably isn't what you were looking for either. Rather, R creates a time series that has two variables called "End" (which is the date now coerced to an integer) and "EndP". Funny business like this is part of the reason time series packages like zoo and xts have caught on so I'll detail them instead a little further down.
(Unfortunately, to the best of my understanding, R doesn't keep date stamps with its default ts class, choosing instead to keep start and end dates as well as a frequency. For more general time series work, this is rarely flexible enough)
You could perhaps get what you wanted by typing
plot(test.sub[,1], test.sub[,2])
instead of
plot(test.sub[1], test.sub[2])
since the former runs into trouble given that you are passing two sub-data frames instead of two vectors (even though it looks like you would be).*
Anyways, with xts (and similarly for zoo):
library(xts) # You may need to install this
xtemp <- xts(test.sub[,2], test.sub[,1]) # Create the xts object
plot(xtemp)
# Dispatches a xts plot method which does all sorts of nice time series things
Hope some of this helps and sorry for the inline code that's not identified as such: still getting used to stack overflow.
Michael
**In reality, they access the lists that are used to structure a data frame internally, but that's more a code nuance than something worth relying on.
***The nitty-gritty is that when you pass plot(test.sub[1], test.sub[2]) to R, it dispatches the method plot.data.frame which takes a single data frame and tries to interpret the second data frame as an additional plot parameter which gets misinterpreted somewhere way down the line, giving your error.

The reason that you get the Error about different x and y lengths is immediately apparent if you do a traceback immediately upon raising the error:
> plot(test.sub[1],test.sub[2])
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
> traceback()
6: stop("'x' and 'y' lengths differ")
5: xy.coords(x, y, xlabel, ylabel, log)
4: plot.default(x1, ...)
3: plot(x1, ...)
2: plot.data.frame(test.sub[1], test.sub[2])
1: plot(test.sub[1], test.sub[2])
The problems in your call are manifold. First, as mentioned by #mweylandt test.sub[1] is a data frame with the single component, not a vector comprised of the contents of the first component of test.sub.
From the traceback, we see that the plot.data.frame method was called. R is quite happy to plot a data frame as long as it has at least two columns. R took you at your word and passed test.sub[1] (as a data.frame) on to plot() - test.sub[2] never gets a look in. test.sub[1] is eventually passed on to xy.coords() which correctly informs you that you have lots of rows for x but 0 rows for y because test.sub[1] only contains a single component.
It would have worked if you'd done plot(test.sub[,1], test.sub[,2], type = "l") or used the formula interface to name the variables plot(V4 ~ V1, data = test.sub, type = "l") as I show in my other Answer.

Surely it is easier to use the formula interface:
> test <- data.frame(End = Sys.Date()+1:5,
+ Start = Sys.Date()+0:4,
+ tck = rep("GOOG",5),
+ EndP= 1:5,
+ StartP= 0:4)
>
> test.sub = subset(test, tck=="GOOG",select = c(End, EndP))
> head(test.sub)
End EndP
1 2011-10-19 1
2 2011-10-20 2
3 2011-10-21 3
4 2011-10-22 4
5 2011-10-23 5
> plot(EndP ~ End, data = test.sub, type = "l")
I work extensively with time series type data and rarely, if ever, have any need for the "ts" class of objects. Packages zoo and xts are very useful, but if all you want to do is plot the data, i) get the date/time information correctly formatted/set-up as a "Date" or "POSIXt" class object, and then ii) just plot it using standard graphics and type = "l" (or type = "b" or type = "o" if you want to see the observation times).

Related

Is there a way to import the results of HSD.test from agricolae directly into geom_text() in a ggplot2?

I'm creating figures that show the efficacy of several warning signals relative to the event they warn about. The figure is based off a dataframe which is produced by a function that runs a model multiple times and collates the results like this:
t type label early
4 847 alarm alarm.1 41
2 849 alarm alarm.2 39
6 853 alarm alarm.3 35
5 923 alarm alarm.4 -35
7 1003 alarm alarm.5 -115
But with a dozen alarms and a value for each alarm n times (typically 20 - 100), with each value being slightly different depending on random and stochastic variables built into the model.
I'm putting the results in an lm
a.lm <- lm(log(early + 500) ~ label, data = alarm.data)
and after checking the assumptions are met, running a 1 way anova
anova(a.lm)
then a tukey post hoc test
HSD.test(a.lm, trt = "label", console = TRUE)
Which produces
log(early + 500) groups
alarm.1 6.031453 a
alarm.2 6.015221 a
alarm.3 6.008366 b
alarm.4 5.995150 b
alarm.5 5.921384 c
I have a function which generates a ggplot2 figure based on the collated data, to which I am then manually adding +geom_text(label = c("a", "a", "b", "b", "c") or whatever the appropriate letters are. Is there a way to generalise that last step? Calling the letters directly from the result of the HSD.test. If I put the results of the HSD.test into an object
a.test <- HSD.test(a.lm, trt = "label", console = TRUE)
I can call the results using a.test$groups and calling the letter groupings specifically using a.test$groups$groups but I don't know enough about manipulating lists to make that useful to me. Whilst the order of the labels in the ggplot is predictable, the order of the groups in the HSD.test result isn't and can vary a lot between iterations of the model running function.
If anyone has any insights I would be grateful.
Okay I actually bumped into a solution just after I posted the question.
If you take the output of the HSD.test and make it into an object
a.test <- HSD.test(ram.lm, trt = "label")
Then convert the groups list into a dataframe
a.df <- as.data.frame(a.test$groups)
The row index is the alarm names rather than numbers
a.df
log(early + 500) groups
alarm.1 6.849082 a
alarm.2 6.842465 a
alarm.3 6.837438 a
alarm.4 6.836437 a
alarm.5 6.812714 a
so they can be called specifically into geom_text inside the function
a.plot +
geom_text(label = c(a.df["alarm.1",2],
a.df["alarm.2",2],
a.df["alarm.3", 2],
a.df["alarm.4", 2],
a.df["alarm.5", 2])
even though not using the same functions to get the compact letter display, I think this may be a more efficient way of doing it? (Make sure to unfold the code via the "Code" button above the ggplots)

Changing multiple columns of a data frame from class 'character' to class 'time' using chron

I have a data frame with multiple columns, some of which I need to change to 'time' class using chron so that I can retrieve basic statistics. These columns are currently times stored as characters and formatted like this: hh:mm.
Here is a subset of it as well as the list of columns that need to change:
> Data
DATE FLT TYPE REG AC DEP ARR STD STA ATD ATA
1 15-01-02 953 J C-GCPT 73M YVQ YEV 12:00 12:55 13:00 13:59
2 15-01-04 953 J C-GCPT 73M YVQ YEV 12:00 12:55 13:17 14:13
3 15-01-05 953 J C-GCPT 73M YVQ YEV 12:00 12:55 13:20 14:14
Time_list <-c("STD","STA","ATD","ATA")
Here is what I have done to change only one column (and it works):
Data$ATA <- paste0(Data$ATA, ':00')
Data$ATA<-chron(times.=Data$ATA)
class(Data$ATA)
[1] "times"
However, I would prefer to be able to do all the columns at the same time since there are many of them. I've tried multiple techniques and some seem to work for the first part, which is pasting ':00', but it always goes wrong for the second part, using chron . I seem to have a length problem that I don't understand
Using dmap
Data[,Time_list]<-
Data%>%
select(one_of(Time_list)) %>%
dmap(paste0,':00')
Data[,Time_list]<-
Data %>%
select(one_of(Time_list)) %>%
dmap(chron,times.=Data[,Time_list])
**Error in .f(.d[[i]], ...) :
.d[[i]] and Data[, Time_list] must have equal lengths**
Using apply
YEVdata[,(Time_list)] <- lapply(YEVdata[,(Time_list)], paste0,':00')
Data[,(Time_list)] <- lapply(Data[,(Time_list)], chron, times. =Data[,(Time_list)])
**Error in FUN(X[[i]], ...) :
X[[i]] and Data[, (Time_list)] must have equal lengths**
Using a forloop
I tried using a for loop, but I'm just a beginner and could get anywhere.
Using "simple" solution from another Stack Overflow question.
It just made a mess, even pasting.
Efficiently transform multiple columns of a data frame
Any ideas in plain beginner language would be very appreciated! If it is possible to nest both operations, it would be even better!
dplyr::mutate_at would work for this situation. You define the variables you want to mutate and then define the function you want to use.
You can do the pasting and converting to a time in a single step within funs using the . notation and nesting functions.
library(dplyr)
Data = mutate_at(Data, Time_list, funs(chron(times. = paste0(., ":00"))))

Plot a histogram of subset of a data

!The image shows the screen shot of the .txt file of the data.
The data consists of 2,075,259 rows and 9 columns
Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.
Only data from the dates 2007-02-01 and 2007-02-02 is needed.
I was trying to plot a histogram of "Global_active_power" in the above mentioned dates.
Note that in this dataset missing values are coded as "?"]
This is the code i was trying to plot the histogram:
{
data <- read.table("household_power_consumption.txt", header=TRUE)
my_data <- data[data$Date %in% as.Date(c('01/02/2007', '02/02/2007'))]
my_data <- gsub(";", " ", my_data) # replace ";" with " "
my_data <- gsub("?", "NA", my_data) # convert "?" to "NA"
my_data <- as.numeric(my_data) # turn into numbers
hist(my_data["Global_active_power"])
}
After running the code it is showing this error:
Error in hist.default(my_data["Global_active_power"]) :
invalid number of 'breaks'
Can you please help me spot the mistake in the code.
Link of the data file : https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip
You need to provide the separator (";") explicitly and your types aren't what you think they are, observe:
data <- read.table("household_power_consumption.txt", header=TRUE, sep=';', na.strings='?')
data$Date <- as.Date(data$Date, format='%d/%m/%Y')
bottom.date <- as.Date('01/02/2007', format='%d/%m/%Y')
top.date <- as.Date('02/02/2007', format='%d/%m/%Y')
my_data <- data[data$Date > bottom.date & data$Date < top.date,3]
hist(my_data)
Gives as the plot. Hope that helps.
Given you have 2m rows (though not too many columns), you're firmly into fread territory;
Here's how I would do what you want:
library(data.table)
data<-fread("household_power_consumption.txt",sep=";", #1
na.strings=c("?","NA"),colClasses="character" #2
)[,Date:=as.Date(Date,format="%d/%m/%Y")
][Date %in% seq(from=as.Date("2007-02-01"), #3
to=as.Date("2007-02-02"),by="day")]
numerics<-setdiff(names(data),c("Date","Time")) #4
data[,(numerics):=lapply(.SD,as.numeric),.SDcols=numerics]
data[,hist(Global_active_power)] #5
A brief explanation of what's going on
1: See the data.table vignettes for great introductions to the package. Here, given the structure of your data, we tell fread up front that ; is what separates fields (which is nonstandard)
2: We can tell fread up front that it can expect ? in some of the columns and should treat them as NA--e.g., here's data[8640] before setting na.strings:
Date Time Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1: 21/12/2006 11:23:00 ? ? ? ? ? ? NA
Once we set na.strings, we sidestep having to replace ? as NA later:
Date Time Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1: 21/12/2006 11:23:00 NA NA NA NA NA NA
On the other hand, we also have to read those fields as characters, even though they're numeric. This is something I'm hoping fread will be able to handle automatically in the future.
data.table commands can be chained (from left to right); I'm using this to subset the data before it's assigned. It's up to you whether you find that more or less readable, as there's only marginal performance differences.
Since we had to read the numeric fields as strings, we now recast them as numeric; this is the standard data.table syntax for doing so.
Once we've got our data subset as we like and of the right type, we can pass hist as an argument in j and get what we want.
Note that if all you wanted from this data set was the histogram, you could have condensed the code a bit:
ok_dates<-seq(from=as.Date("2007-02-01"),
to=as.Date("2007-02-02"),by="day")
fread("household_power_consumption.txt",sep=";",
select=c("Date","Global_active_power"),
na.strings=c("?","NA"),colClasses="character"
)[,Date:=as.Date(Date,format="%d/%m/%Y")
][Date %in% ok_dates,hist(as.numeric(Global_active_power))]

rolling Hurst exponent

I am pretty new to R and I have been trying to compute a rolling Hurst exponent with no avail. I have installed packages fArma (for the Hurst) and zoo (for the rollapply). The data is in a dataframe called 'data' and is variable 'returns'. The following code for the Hurst works great;
rsFit(data$returns, levels = 50, minnpts = 3, cut.off = 10^c(0.7, 2.5), + doplot = FALSE, trace = FALSE, title = NULL, description = NULL)
Below is my attempt at a rolling Hurst exponent of window size 230, which generates an error.
rollapply(data$returns, 230, (rsFit(data$returns, levels = 50, minnpts = 3, cut.off = 10^c(0.7, 2.5), + doplot = FALSE, trace = FALSE, title = NULL, description = NULL)))
Any help with the code would be much appreciated. I am trying to calculate the Hurst exponent over a 230 period window, that rolls forward 1 period at a time.
The data is;
returns
1 -0.002002003
2 -0.002006019
3 0.000000000
4 0.000000000
5 -0.009077218
6 -0.003044142
7 -0.002034589
8 0.004065046
9 0.002026343
10 0.001011634
11 0.001010612
12 0.000000000
13 -0.001010612
14 -0.001011634
15 0.003031837
16 -0.001009591
17 0.001009591
18 -0.002020203
I'm really not familiar with the fArma package or its functions, but I noticed a couple of major issues with your code.
You are trying to use rollapply incorrectly; specifically your third argument FUN=rsFun(data$returns). In general, if you are applying a (one-parameter) function foo on a data object x with rollapply, your function call should be rollapply(x,some_integer,foo).
So in your case, you would have
rollapply(data$returns,230,rsFit)
since it is acceptable to call rsFit with only one argument (the first one, x, as shown in the help file ?rsFit).
The width of 230 that you are specifying in rollapply is much too large - your sample data, data$returns, only has a length of 18 - the window size has to be less than the length of the data. One option is to use a smaller width: I tried a couple of small values (5,10,...) with your data, but this produced errors. Like I said, I'm not familiar with the functions in fArma, but I suspect rsFit requires more than 5 or 10 observations at a time. A better solution would be to use a larger sample of data, which is shown below.
Even having made the changes described above, you will encounter one more issue. From the Value section (i.e. return value) in ?rollapply:
"A object of the same class as data with the results of the rolling
function."
Typically this is some type of simple object, e.g. a vector, matrix, etc... depending on your input. However, rsFit returns an S4 class fHURST object, which rollapply is apparently not able to deal with. This is not surprising, since fHURST objects have a fairly complex structure - try running str(rsFit(data$returns)) and note all of the various slots it contains. Basically, the simple solution for this is instead of returning the entire fHURST object calculated in rollapply, just return the component / slot that you need. Again, I've never used rsFit and I don't have to time to read into the theoretical underpinnings of Hurst exponents, but below I assumed you were mainly concerned with the estimated coefficient values occupying the #hurst slot of the fHURST objects.
As noted above, I made a toy data set that is much larger than 18 observations so that I could keep the width=230 in rollapply.
library(fArma)
library(zoo)
##
set.seed(123)
data2 <- rnorm(690)
##
data2.ra <- rollapply(data2,230,function(x){
hSlot <- rsFit(x)#hurst
result <- data.frame(
H = hSlot$H,
beta = hSlot$beta,
Estimate.intercept = hSlot$diag[1,1],
Estimate.X = hSlot$diag[2,1])
result
})
##
> head(data2.ra)
H beta Estimate.intercept Estimate.X
1 0.6257476 0.6257476 -0.143363281 0.6257476
2 0.6627804 0.6627804 -0.193806373 0.6627804
3 0.6235309 0.6235309 -0.133828565 0.6235309
4 0.5683417 0.5683417 -0.055960572 0.5683417
5 0.5520769 0.5520769 -0.027270395 0.5520769
6 0.5334170 0.5334170 -0.003523383 0.5334170
> dim(data2.ra)
[1] 461 4
> 690 - (230-1)
[1] 461
Which is an object of length 461, since the output of using rollapply with a window size k on an object of length n is n - (k-1). Of course, you can change the body of the anonymous function (function(x){...}) used in rollapply above to suit your needs.

Circular-linear regression with covariates in R

I have data showing when an animal came to a survey station. example csv file here The first few lines of data look like this:
Site_ID DateTime HourOfDay MinTemp LunarPhase Habitat
F1 6/12/2013 14:01:00 14 -1 0 river
F1 6/12/2013 14:23:00 14 -1 0 river
F2 6/13/2013 1:21:00 1 3 1 upland
F2 6/14/2013 1:33:00 1 4 2 upland
F3 6/14/2013 1:48:00 1 4 2 river
F3 6/15/2013 11:08:00 11 0 0 river
I would like to perform a circular-linear regression in R to determine peak activity times. The dependent variable could be DateTime or HourOfDay, whichever is easier. I would like to incorporate the covariates Site_ID (random effect), plus MinTemp, LunarPhase, and Habitat into a mixed-effects model.
I have tried using the lm.circular function of program circular, and have the following code:
data<-read.csv("StackOverflowExampleData.csv")
data$DateTime<-as.POSIXct(as.character(data$DateTime), format = "%m/%d/%Y %H:%M:%S")
data$LunarPhase<-as.factor(data$LunarPhase)
str(data)
library(circular)
y<-data$DateTime
y<-circular(y, units ="hours",template = "clock24",rotation = "clock")
x<-data[,c(1,4,5,6)]
lm.circular(y=y, x=x, init=c(1,1,1,1), type='c-l', verbose=TRUE)
I keep getting the error:
Error in Ops.POSIXt(x, 12) : '/' not defined for "POSIXt" objects
Apparently this is a known bug, but I was confused by this threat about it and could not determine an appropriate work-around. Suggestions?
Also, my ultimate goal with this data was to run a circular-linear version of a glm, and then test several models against one another using AIC or some other information theoretics method. The model I'm seeking would be a circular-linear version of something like this:
glmer(HourOfDay~MinTemp+LunarPhase+Habitat+(1|Site_ID),family=binomial,data=data)
Perhaps this is an inappropriate application of the circular package. If so, I'm open to other suggestions of models and/or graphics that would investigate peak activity using the data and covariates.
Note: I did search for related discussions and found this somewhat relevant thread, but it was never answered, did not request a solution in R, and was of a different scope.
The specific problem is caused by conversion.circular. There, a POSIXlt object is divided by 12. This is an operation that has a non-defined outcome:
> as.POSIXlt('2005-07-16') / 2
Error in Ops.POSIXt(as.POSIXlt("2005-07-16"), 2) :
'/' not defined for "POSIXt" objects
So, it seems that you cannot use data of this class as input for the circular package. I could not find any mention of POSIXlt data in the examples. Maybe you need to specify the timestamps simply as a number, not as a POSIXlt object.

Resources