I have a dataframe called EWMA_SD252 3561 obs. of 102 variables (daily volatilities of 100 stocks since 2000), here is a sample :
Data IBOV ABEV3 AEDU3 ALLL3
3000 2012-02-09 16.88756 15.00696 33.46089 25.04788
3001 2012-02-10 18.72925 14.55346 32.72209 24.93913
3002 2012-02-13 20.87183 15.25370 31.91537 24.28962
3003 2012-02-14 20.60184 14.86653 31.04094 28.18687
3004 2012-02-15 20.07140 14.56653 37.45965 33.47379
3005 2012-02-16 19.99611 16.80995 37.36497 32.46208
3006 2012-02-17 19.39035 17.31730 38.85145 31.50452
What i am trying to do is using a single command, to subset a interval from a particular stock using dates references and also plot a chart for the same interval, so far i was able to do the subset part but now i am stuck on plotting a chart, here is what i code so far :
Getting the Date Interval and the stock name :
datas = function(x,y,z){
intervalo_datas(as.Date(x,"%d/%m/%Y"),as.Date(y,"%d/%m/%Y"),z)
}
Subsetting the Data :
intervalo_datas <- function(x,y,z){
cbind(as.data.frame(EWMA_SD252[,1]),as.data.frame(EWMA_SD252[,z]))[EWMA_SD252$Data >= x & EWMA_SD252$Data <= y,]
}
Now i am stuck, is it possible using a function to get ABEV3 data.frame and plot a chart using dates in X and volatility in y, using just the command bellow ?
ABEV3 = datas("09/02/2012","17/02/2012","ABEV3")
I think you should use xts package. It is suitable :
manipluating time series specially financial time series
subsetting time series
plotting time series
So I would create an xts object using your data. Then I wrap the subset/plot in a single function like what you tried to do.
library(xts)
dat_ts <- xts(dat[,-1],as.Date(dat$Data))
plot_data <-
function(start,end,stock)
plot(dat_ts[paste(start,end,sep='/'),stock])
You can call it like this :
plot_data('2012-02-09','2012-02-14','IBOV')
You could use ggplot2 and reshape2 to make a function that automatically plots an arbitrary quantity of stocks:
plot_stocks <- function(data, date1, date2, stocks){
require(ggplot2)
require(reshape2)
date1 <- as.Date(date1, "%d/%m/%Y")
date2 <- as.Date(date2, "%d/%m/%Y")
data <- data[data$Data > date1 & data$Data < date2,c("Data", stocks)]
data <- melt(data, id="Data")
names(data) <- c("Data", "Stock", "Value")
ggplot(data, aes(Data, Value, color=Stock)) + geom_line()
}
Plotting one stock "ABEV3":
plot_stocks(EWMA_SD252, "09/02/2012", "17/02/2012", "ABEV3")
Plotting three stocks:
plot_stocks(EWMA_SD252, "09/02/2012", "17/02/2012", c("IBOV", "ABEV3", "AEDU3"))
You can further personalize your function adding other geoms, like geom_smooth etc.
(I'm assuming your EWMA_SD252 data.frame's Data column is already Date class. Convert it if it's not already.)
It looks like your trying to plot a particular column of your data.frame for a given date interval. It will be much easier for others to read your code (and you too in 6 months!) if you use variable names that are more descriptive than x, y, and z, e.g. date0, date1, column.
Let's rewrite your function. If EWMA_SD252 is already a data.frame, then you don't need to cbind individual columns of it into a data.frame. Giving a data argument makes things more flexible as well. All your datas function does is convert to Dates and call intervalo_datas, so we should wrap that up as well.
intervalo_datas <- function(date0, date1, column_name, data = EWMA_SD252) {
if (!is.Date(date0)) date0 <- as.Date(date0, "%d/%m/%Y")
if (!is.Date(date1)) date1 <- as.Date(date1,"%d/%m/%Y")
cols <- c(1, which(names(data) == column_name))
return(EWMA_SD252[EWMA_SD252$Data >= x & EWMA_SD252$Data <= y, cols])
}
Now you should be able to get a subset this way
ABEV3 = intervalo_datas("09/02/2012", "17/02/2012", "ABEV3")
And plot like this.
plot(ABEV3[, 1], ABEV3[, 2])
If you want the subsetting function to also plot, just add the plot command before the return line (but define the subset first!). Using something like xts as agstudy recommends will simplify things and handle the dates better on the axis labels.
Related
I have a CSV file containing data as follows-
date, group, integer_value
The date starts from 01-January-2013 to 31-October-2015 for the 20 groups contained in the data.
I want to create a time series for the 20 different groups. But the dates are not continuous and have sporadic gaps in it, hence-
group4series <- ts(group4, frequency = 365.25, start = c(2013,1,1))
works from programming point of view but is not correct due to gaps in data.
How can I use the 'date' column of the data to create the time series instead of the usual 'frequency' parameter of 'ts()' function?
Thanks!
You could use zoo::zoo instead of ts.
Since you don't provide sample data, let's generate daily data, and remove some days to introduce "gaps".
set.seed(2018)
dates <- seq(as.Date("2015/12/01"), as.Date("2016/07/01"), by = "1 day")
dates <- dates[sample(length(dates), 100)]
We construct a sample data.frame
df <- data.frame(
dates = dates,
val = cumsum(runif(length(dates))))
To turn df into a zoo timeseries, you can do the following
library(zoo)
ts <- with(df, zoo(val, dates))
Let's plot the timeseries
plot.zoo(ts)
I have hourly timeseries data of three homes(H1, H2, H3) for continuous five days created as
library(xts)
library(ggplot2)
set.seed(123)
dt <- data.frame(H1 = rnorm(24*5,200,2),H2 = rnorm(24*5,150,2),H3 = rnorm(24*5,50,2)) # hourly data of three homes for 5 days
timestamp <- seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-01-05 23:59:59"), by = "hour") # create timestamp
dt$timestamp <- timestamp
Now I want to plot data homewise in facet form; accordingly I melt dataframe as
tempdf <- reshape2::melt(dt,id.vars="timestamp") # melt data for faceting
colnames(tempdf) <- c("time","var","val") # rename so as not to result in conflict with another melt inside geom_line
Within each facet (for each home), I want to see the values of all the five days in line plot form (each facet should contain 5 lines corresponding to different days). Accordingly,
ggplot(tempdf) + facet_wrap(~var) +
geom_line(data = function(x) {
locdat <- xts(x$val,x$time)# create timeseries object for easy splitting
sub <- split.xts(locdat,f="days") # split data daywise of considered home
sub2 <- sapply(sub, function(y) return(coredata(y))) # arrange data in matrix form
df_sub2 <- as.data.frame(sub2)
df_sub2$timestamp <- index(sub[[1]]) # forcing same timestamp for all days [okay with me]
df_melt <- reshape2::melt(df_sub2,id.vars="timestamp") # melt to plot inside each facet
#return(df_melt)
df_melt
}, aes(x=timestamp, y=value,group=variable,color=variable),inherit.aes = FALSE)
I have forced the same timestamp for all the days of a home to make plotting simple. With above code, I get plot as
Only problem with above plot is that, It is plotting same data in all the facets. Ideally, H1 facet should contain data of home 1 only and H2 facet should contain data of home 2. I know that I am not able to pass homewise data in geom_line(), can anyone help to do in correct manner.
I think that you may find it more efficient to modify the data outside the call to ggplot rather than inside it (allows closer inspection of what is happening at each step, at least in my opinion).
Here, I am using lubridate to generate two new columns. The first holds only the date (and not the time) to allow faceting on that. The second holds the full datetime, but I then modify the date so that they are all the same. This leaves only the times as mattering (and we can suppress the chosen date in the plot).
library(lubridate)
tempdf$day <- date(tempdf$time)
tempdf$forPlotTime <- tempdf$time
date(tempdf$forPlotTime) <-
"2016-01-01"
Then, I can pass that modified data.frame to ggplot. You will likely want to modify colors/labels, but this should get you a pretty good start.
ggplot(tempdf
, aes(x = forPlotTime
, y = val
, col = as.factor(day))) +
geom_line() +
facet_wrap(~var) +
scale_x_datetime(date_breaks = "6 hours"
, date_labels = "%H:%M")
Generates:
I've gotten fairly good with the *apply family of functions, and I've recently learned to use the do.call("rbind", by(... as a wrapper for tapply. I'm working with a large data set (Compustat) and I have a function (see below) that generates a new column of lagged variables which I later attach to the main data frame df.
My problem is that it is extremely slow. I create about two dozen lagged variables, and the processing in this function takes approximately 1.5 hours because there are 350,000+ firm-year observations in the data set.
Can anyone help improve the speed of this function without losing the aspects that I find desirable:
#' lag vector of unknown size (for do.call-rbind-by: using datadate to track)
lag.vec <- function(x){
x <- x[order(x$datadate), ] # sort data into ascending by date
var <- x[,2] # the specific variable name in data.frame x hereby ignored
var.name <- paste(names(x)[2], "lag", sep = '.') # keep variable name
if(length(var)>1){ # no lagging if single observation
lagged <- c(NA, var[1:(length(var)-1)])
datelag <- c(x$datadate[1], x$datadate[1:(length(x$datadate) - 1)])
datediff <- x$datadate - datelag
y <- data.frame(x$datadate, datediff, lagged) # join lagged variable and difference in YYYYMMDD data
y$lagged[y$datediff >= 20000 & !is.na(y$datediff)] <- NA # 2 or more full years difference
y <- y[, c('x.datadate', 'lagged')]
names(y) <- c("datadate", var.name)
} else { y <- c(x$datadate[1], NA); names(y) <- c("datadate", var.name) }
return(y)
}
I then call this function in a command separately for each variable that I want to generate a lagged series for (here I use the ni variable as an example):
ni_lag <- do.call('rbind', by(df[ , c('datadate', 'ni')], df$gvkey, lag.vec))
where gvkey is the ID number for the particular firm and datadate is an 8-digit integer of the form YYYYMMDD.
The approach was much faster when I used a simpler function:
lag.vec.seq <- function(x){#' lag vector when all data points are present, in order
if(length(x)>1){
y <- c(NA, x[1:(length(x)-1)])
} else {y <- NA}
return(y)
}
along with the tapply command in something like
ni_lag <- as.vector(unlist(tapply(df$ni, df$gvkey, lag.vec.seq)))
As you can see the main difference is that the tapply approach doesn't include any datadate information and so the function assumes that all data are sequential (i.e., there are no missing years in the dataframe). Since I know there are missing years, I built the do.call-by function to account for that.
Some notes:
1) The first order command in the function is probably unnecessary since my data is ordered by gvkey and datadate in advance (e.g. df <- df[order(df$gvkey, df$datadate), ]). However, I'm always a bit afraid that R messies up my row ordering when I use functional programming like this. Is that an unfounded fear?
2) Identifying what is slowing down the processing would be very helpful. Is it the renaming of variables? The creation of a new data frame in the function? Or is the do.call with by just typically (much) slower than tapply?
Thank you!
I have a dataframe with several columns:
state
county
year
Then x, y, and z, where x, y, and z are observations unique to the triplet listed above. I am looking for a sane way to store this in a time series and xts will not let me since there are multiple observations for each time index. I have looked at the hts package, but am having trouble figuring out how to get my data into it from the dataframe.
(Yes, I did post the same question on Quora, and was advised to bring it here!)
One option is to reshape your data so you have a column for every State-County combination. This allows you to construct an xts matrix :
require(reshape)
Opt1 <- as.data.frame(cast(Data, Date ~ county + State, value="Val"))
rownames(Opt1) <- Opt1$Date
Opt1$Date <- NULL
as.xts(Opt1)
Alternatively, you could work with a list of xts objects, each time making sure that you have the correct format as asked by xts. Same goes for any of the other timeseries packages. A possible solution would be :
Opt2 <-
with(Data,
by(Data,list(county,State,year),
function(x){
rownames(x) <- x$Date
x <- x["Val"]
as.xts(x)
}
)
)
Which would allow something like :
Opt2[["d","b","2012"]]
to select a specific time series. You can use all xts options on that. You can loop through the counties, states and years to construct plots like this one :
Code for plot :
counties <- dimnames(Opt2)[[1]]
states <- dimnames(Opt2)[[2]]
years <- dimnames(Opt2)[[3]]
op <- par(mfrow=c(3,6))
apply(
expand.grid(counties,states,years),1,
function(i){
plot(Opt2[[i[1],i[2],i[3]]],main=paste(i,collapse="-"))
invisible()
}
)
par(op)
Test-data :
Data <- data.frame( State = rep(letters[1:3],each=90),
county = rep(letters[4:6],90),
Date = rep(seq(as.Date("2011-01-01"),by="month",length.out=30),each=3),
Val = runif(270)
)
Data$year <- as.POSIXlt(Data$Date)$year + 1900
I have date that looks like this:
"date", "sunrise"
2009-01-01, 05:31
2009-01-02, 05:31
2009-01-03, 05:33
2009-01-05, 05:34
....
2009-12-31, 05:29
and I want to plot this in R, with "date" as the x-axis, and "sunrise" as the y-axis.
You need to work a bit harder to get R to draw a suitable plot (i.e. get suitable axes). Say I have data similar to yours (here in a csv file for convenience:
"date","sunrise"
2009-01-01,05:31
2009-01-02,05:31
2009-01-03,05:33
2009-01-05,05:34
2009-01-06,05:35
2009-01-07,05:36
2009-01-08,05:37
2009-01-09,05:38
2009-01-10,05:39
2009-01-11,05:40
2009-01-12,05:40
2009-01-13,05:41
We can read the data in and format it appropriately so R knows the special nature of the data. The read.csv() call includes argument colClasses so R doesn't convert the dates/times into factors.
dat <- read.csv("foo.txt", colClasses = "character")
## Now convert the imported data to appropriate types
dat <- within(dat, {
date <- as.Date(date) ## no need for 'format' argument as data in correct format
sunrise <- as.POSIXct(sunrise, format = "%H:%M")
})
str(dat)
Now comes the slightly tricky bit as R gets the axes wrong (or perhaps better to say they aren't what we want) if you just do
plot(sunrise ~ date, data = dat)
## or
with(dat, plot(date, sunrise))
The first version gets both axes wrong, and the second can dispatch correctly on the dates so gets the x-axis correct, but the y-axis labels are not right.
So, suppress the plotting of the axes, and then add them yourself using axis.FOO functions where FOO is Date or POSIXct:
plot(sunrise ~ date, data = dat, axes = FALSE)
with(dat, axis.POSIXct(x = sunrise, side = 2, format = "%H:%M"))
with(dat, axis.Date(x = date, side = 1))
box() ## complete the plot frame
HTH
I think you can use the as.Date and as.POSIXct functions to convert the two columns in the proper format (the format parameter of as.POSIXct should be set to "%H:%M")
The standard plot function should then be able to deal with time and dates by itself