I have a dataframe with several columns:
state
county
year
Then x, y, and z, where x, y, and z are observations unique to the triplet listed above. I am looking for a sane way to store this in a time series and xts will not let me since there are multiple observations for each time index. I have looked at the hts package, but am having trouble figuring out how to get my data into it from the dataframe.
(Yes, I did post the same question on Quora, and was advised to bring it here!)
One option is to reshape your data so you have a column for every State-County combination. This allows you to construct an xts matrix :
require(reshape)
Opt1 <- as.data.frame(cast(Data, Date ~ county + State, value="Val"))
rownames(Opt1) <- Opt1$Date
Opt1$Date <- NULL
as.xts(Opt1)
Alternatively, you could work with a list of xts objects, each time making sure that you have the correct format as asked by xts. Same goes for any of the other timeseries packages. A possible solution would be :
Opt2 <-
with(Data,
by(Data,list(county,State,year),
function(x){
rownames(x) <- x$Date
x <- x["Val"]
as.xts(x)
}
)
)
Which would allow something like :
Opt2[["d","b","2012"]]
to select a specific time series. You can use all xts options on that. You can loop through the counties, states and years to construct plots like this one :
Code for plot :
counties <- dimnames(Opt2)[[1]]
states <- dimnames(Opt2)[[2]]
years <- dimnames(Opt2)[[3]]
op <- par(mfrow=c(3,6))
apply(
expand.grid(counties,states,years),1,
function(i){
plot(Opt2[[i[1],i[2],i[3]]],main=paste(i,collapse="-"))
invisible()
}
)
par(op)
Test-data :
Data <- data.frame( State = rep(letters[1:3],each=90),
county = rep(letters[4:6],90),
Date = rep(seq(as.Date("2011-01-01"),by="month",length.out=30),each=3),
Val = runif(270)
)
Data$year <- as.POSIXlt(Data$Date)$year + 1900
Related
I have a function below from the package gapminder to run an analysis. I need to pick two continents out of the five available.
library(gapminder)
part3 <- gapminder
continent1 <- subset(part3, continent == "Asia")
continent2 <- subset(part3, continent =="Africa")
#As I'm going to t-test I need two factors - picking two continents
part3c <- rbind(continent1, continent1)
Question Is there a way for the user to pick continents for the analysis e.g., some code that says "pick two from the five available" that allows for the analysis to be run with different combinations?
Something like getting the output from filtering data in an excel pivot table or do I need to code in the continents each time - as above?
Do you want something like this?
Function combn returns the combinations of a vector, in the case below 2 by 2 and applies a function to each of them. The function test_fun first makes sure the groups are of the same size, then runs the t-test.
In the example call, I test equality of lifeExp by continent but any other column can be tested.
test_fun <- function(X, col){
cols <- c(col, "continent")
n <- min(nrow(X[[1]]), nrow(X[[2]]))
Y <- lapply(X, \(y) {
if(nrow(y) > n)
y[sample(nrow(y), n), cols]
else y[cols]
})
Y <- do.call(rbind, Y)
t.test(get(col) ~ continent, Y)
}
sp_part3 <- split(part3, part3$continent)
combn(sp_part3, 2, test_fun, simplify = FALSE, col = "lifeExp")
I've gotten fairly good with the *apply family of functions, and I've recently learned to use the do.call("rbind", by(... as a wrapper for tapply. I'm working with a large data set (Compustat) and I have a function (see below) that generates a new column of lagged variables which I later attach to the main data frame df.
My problem is that it is extremely slow. I create about two dozen lagged variables, and the processing in this function takes approximately 1.5 hours because there are 350,000+ firm-year observations in the data set.
Can anyone help improve the speed of this function without losing the aspects that I find desirable:
#' lag vector of unknown size (for do.call-rbind-by: using datadate to track)
lag.vec <- function(x){
x <- x[order(x$datadate), ] # sort data into ascending by date
var <- x[,2] # the specific variable name in data.frame x hereby ignored
var.name <- paste(names(x)[2], "lag", sep = '.') # keep variable name
if(length(var)>1){ # no lagging if single observation
lagged <- c(NA, var[1:(length(var)-1)])
datelag <- c(x$datadate[1], x$datadate[1:(length(x$datadate) - 1)])
datediff <- x$datadate - datelag
y <- data.frame(x$datadate, datediff, lagged) # join lagged variable and difference in YYYYMMDD data
y$lagged[y$datediff >= 20000 & !is.na(y$datediff)] <- NA # 2 or more full years difference
y <- y[, c('x.datadate', 'lagged')]
names(y) <- c("datadate", var.name)
} else { y <- c(x$datadate[1], NA); names(y) <- c("datadate", var.name) }
return(y)
}
I then call this function in a command separately for each variable that I want to generate a lagged series for (here I use the ni variable as an example):
ni_lag <- do.call('rbind', by(df[ , c('datadate', 'ni')], df$gvkey, lag.vec))
where gvkey is the ID number for the particular firm and datadate is an 8-digit integer of the form YYYYMMDD.
The approach was much faster when I used a simpler function:
lag.vec.seq <- function(x){#' lag vector when all data points are present, in order
if(length(x)>1){
y <- c(NA, x[1:(length(x)-1)])
} else {y <- NA}
return(y)
}
along with the tapply command in something like
ni_lag <- as.vector(unlist(tapply(df$ni, df$gvkey, lag.vec.seq)))
As you can see the main difference is that the tapply approach doesn't include any datadate information and so the function assumes that all data are sequential (i.e., there are no missing years in the dataframe). Since I know there are missing years, I built the do.call-by function to account for that.
Some notes:
1) The first order command in the function is probably unnecessary since my data is ordered by gvkey and datadate in advance (e.g. df <- df[order(df$gvkey, df$datadate), ]). However, I'm always a bit afraid that R messies up my row ordering when I use functional programming like this. Is that an unfounded fear?
2) Identifying what is slowing down the processing would be very helpful. Is it the renaming of variables? The creation of a new data frame in the function? Or is the do.call with by just typically (much) slower than tapply?
Thank you!
I have a dataframe called EWMA_SD252 3561 obs. of 102 variables (daily volatilities of 100 stocks since 2000), here is a sample :
Data IBOV ABEV3 AEDU3 ALLL3
3000 2012-02-09 16.88756 15.00696 33.46089 25.04788
3001 2012-02-10 18.72925 14.55346 32.72209 24.93913
3002 2012-02-13 20.87183 15.25370 31.91537 24.28962
3003 2012-02-14 20.60184 14.86653 31.04094 28.18687
3004 2012-02-15 20.07140 14.56653 37.45965 33.47379
3005 2012-02-16 19.99611 16.80995 37.36497 32.46208
3006 2012-02-17 19.39035 17.31730 38.85145 31.50452
What i am trying to do is using a single command, to subset a interval from a particular stock using dates references and also plot a chart for the same interval, so far i was able to do the subset part but now i am stuck on plotting a chart, here is what i code so far :
Getting the Date Interval and the stock name :
datas = function(x,y,z){
intervalo_datas(as.Date(x,"%d/%m/%Y"),as.Date(y,"%d/%m/%Y"),z)
}
Subsetting the Data :
intervalo_datas <- function(x,y,z){
cbind(as.data.frame(EWMA_SD252[,1]),as.data.frame(EWMA_SD252[,z]))[EWMA_SD252$Data >= x & EWMA_SD252$Data <= y,]
}
Now i am stuck, is it possible using a function to get ABEV3 data.frame and plot a chart using dates in X and volatility in y, using just the command bellow ?
ABEV3 = datas("09/02/2012","17/02/2012","ABEV3")
I think you should use xts package. It is suitable :
manipluating time series specially financial time series
subsetting time series
plotting time series
So I would create an xts object using your data. Then I wrap the subset/plot in a single function like what you tried to do.
library(xts)
dat_ts <- xts(dat[,-1],as.Date(dat$Data))
plot_data <-
function(start,end,stock)
plot(dat_ts[paste(start,end,sep='/'),stock])
You can call it like this :
plot_data('2012-02-09','2012-02-14','IBOV')
You could use ggplot2 and reshape2 to make a function that automatically plots an arbitrary quantity of stocks:
plot_stocks <- function(data, date1, date2, stocks){
require(ggplot2)
require(reshape2)
date1 <- as.Date(date1, "%d/%m/%Y")
date2 <- as.Date(date2, "%d/%m/%Y")
data <- data[data$Data > date1 & data$Data < date2,c("Data", stocks)]
data <- melt(data, id="Data")
names(data) <- c("Data", "Stock", "Value")
ggplot(data, aes(Data, Value, color=Stock)) + geom_line()
}
Plotting one stock "ABEV3":
plot_stocks(EWMA_SD252, "09/02/2012", "17/02/2012", "ABEV3")
Plotting three stocks:
plot_stocks(EWMA_SD252, "09/02/2012", "17/02/2012", c("IBOV", "ABEV3", "AEDU3"))
You can further personalize your function adding other geoms, like geom_smooth etc.
(I'm assuming your EWMA_SD252 data.frame's Data column is already Date class. Convert it if it's not already.)
It looks like your trying to plot a particular column of your data.frame for a given date interval. It will be much easier for others to read your code (and you too in 6 months!) if you use variable names that are more descriptive than x, y, and z, e.g. date0, date1, column.
Let's rewrite your function. If EWMA_SD252 is already a data.frame, then you don't need to cbind individual columns of it into a data.frame. Giving a data argument makes things more flexible as well. All your datas function does is convert to Dates and call intervalo_datas, so we should wrap that up as well.
intervalo_datas <- function(date0, date1, column_name, data = EWMA_SD252) {
if (!is.Date(date0)) date0 <- as.Date(date0, "%d/%m/%Y")
if (!is.Date(date1)) date1 <- as.Date(date1,"%d/%m/%Y")
cols <- c(1, which(names(data) == column_name))
return(EWMA_SD252[EWMA_SD252$Data >= x & EWMA_SD252$Data <= y, cols])
}
Now you should be able to get a subset this way
ABEV3 = intervalo_datas("09/02/2012", "17/02/2012", "ABEV3")
And plot like this.
plot(ABEV3[, 1], ABEV3[, 2])
If you want the subsetting function to also plot, just add the plot command before the return line (but define the subset first!). Using something like xts as agstudy recommends will simplify things and handle the dates better on the axis labels.
Supposing I need to apply an MA(5) to a batch of market data, stored in an xts object. I can easily pull the subset of data I wanted smoothed with xts subsetting:
x['2013-12-05 17:00:01/2013-12-06 17:00:00']
However, I need an additional 5 observations prior to the first one in my subset to "prime" the filter. Is there an easy way to do this?
The only thing I have been able to figure out is really ugly, with explicit row numbers (here using xts sample data):
require(xts)
data(sample_matrix)
x <- as.xts(sample_matrix)
x$rn <- row(x[,1])
frst <- first(x['2007-05-18'])$rn
finl <- last(x['2007-06-09'])$rn
ans <- x[(frst-5):finl,]
Can I just say bleah? Somebody help me.
UPDATE: by popular request, a short example that applies an MA(5) to the daily data in sample_matrix:
require(xts)
data(sample_matrix)
x <- as.xts(sample_matrix)$Close
calc_weights <- function(x) {
##replace rnorm with sophisticated analysis
wgts <- matrix(rnorm(5,0,0.5), nrow=1)
xts(wgts, index(last(x)))
}
smooth_days <- function(x, wgts) {
w <- wgts[index(last(x))]
out <- filter(x, w, sides=1)
xts(out, index(x))
}
set.seed(1.23456789)
wgts <- apply.weekly(x, calc_weights)
lapply(split(x, f='weeks'), smooth_days, wgts)
For brevity, only the final week's output:
[[26]]
[,1]
2007-06-25 NA
2007-06-26 NA
2007-06-27 NA
2007-06-28 NA
2007-06-29 -9.581503
2007-06-30 -9.581208
The NAs here are my problem. I want to recalculate my weights for each week of data, and apply those new weights to the upcoming week. Rinse, repeat. In real life, I replace the lapply with some ugly stuff with row indexes, but I'm sure there's a better way.
In an attempt to define the problem clearly, this appears to be a conflict between the desire to run an analysis on non-overlapping time periods (weeks, in this case) but requiring overlapping time periods of data (2 weeks, in this case) to perform the calculation.
Here's one way to do this using endpoints and a for loop. You could still use the which.i=TRUE suggestion in my comment, but integer subsetting is faster.
y <- x*NA # pre-allocate result
ep <- endpoints(x,"weeks") # time points where parameters change
set.seed(1.23456789)
for(i in seq_along(ep)[-(1:2)]) {
rng1 <- ep[i-1]:ep[i] # obs to calc weights
rng2 <- ep[i-2]:ep[i] # "prime" obs
wgts <- calc_weights(x[rng1])
# calc smooth_days on rng2, but only keep rng1 results
y[rng1] <- smooth_days(x[rng2], wgts)[index(x[rng1])]
}
Update: My NOAA GHCN-Daily weather station data functions have since been cleaned and merged into the rnoaa package, available on CRAN or here: https://github.com/ropensci/rnoaa
I'm designing a R function to calculate statistics across a data set comprised of multiple data frames. In short, I want to pull data frames by class based on a reference data frame containing the names. I then want to apply statistical functions to values for the metrics listed for each given day. In effect, I want to call and then overlay a list of data frames to calculate functions on a vector of values for every unique date and metric where values are not NA.
The data frames are iteratively read into the workspace from file based on a class variable, using the 'by' function. After importing the files for a given class, I want to rbind() the data frames for that class and each user-defined metric within a range of years. I then want to apply a concatenation of user-provided statistical functions to each metric within a class that corresponds to a given value for the year, month, and day (i.e., the mean [function] low temperature [class] on July 1st, 1990 [date] reported across all locations [data frames] within a given region [class]. I want the end result to be new data frames containing values for every date within a region and a year range for each metric and statistical function applied. I am very close to having this result using the aggregate() function, but I am having trouble getting reasonable results out of the aggregate function, which is currently outputting NA and NaN for most functions other than the mean temperature. Any advice would be much appreciated! Here is my code thus far:
# Example parameters
w <- c("mean","sd","scale") # Statistical functions to apply
x <- "C:/Data/" # Folder location of CSV files
y <- c("MaxTemp","AvgTemp","MinTemp") # Metrics to subset the data
z <- c(1970:2000) # Year range to subset the data
CSVstnClass <- data.frame(CSVstations,CSVclasses)
by(CSVstnClass, CSVstnClass[,2], function(a){ # Station list by class
suppressWarnings(assign(paste(a[,2]),paste(a[,1]),envir=.GlobalEnv))
apply(a, 1, function(b){ # Data frame list, row-wise
classData <- data.frame()
sapply(y, function(d){ # Element list
CSV_DF <- read.csv(paste(x,b[2],"/",b[1],".csv",sep="")) # Read in CSV files as data frames
CSV_DF1 <- CSV_DF[!is.na("Value")]
CSV_DF2 <- CSV_DF1[which(CSV_DF1$Year %in% z & CSV_DF1$Element == d),]
assign(paste(b[2],"_",d,sep=""),CSV_DF2,envir=.GlobalEnv)
if(nrow(CSV_DF2) > 0){ # Remove empty data frames
classData <<- rbind(classData,CSV_DF2) # Bind all data frames by row for a class and element
assign(paste(b[2],"_",d,"_bound",sep=""),classData,envir=.GlobalEnv)
sapply(w, function(g){ # Function list
# Aggregate results of bound data frame for each unique date
dataFunc <- aggregate(Value~Year+Month+Day+Element,data=classData,FUN=g,na.action=na.pass)
assign(paste(b[2],"_",d,"_",g,sep=""),dataFunc,envir=.GlobalEnv)
})
}
})
})
})
I think I am pretty close, but I am not sure if rbind() is performing properly, nor why the aggregate() function is outputting NA and NaN for so many metrics. I was concerned that the data frames were not being bound together or that missing values were not being handled well by some of the statistical functions. Thank you in advance for any advice you can offer.
Cheers,
Adam
You've tackled this problem in a way that makes it very hard to debug. I'd recommend switching things around so you can more easily check each step. (Using informative variable names also helps!) The code is unlikely to work as is, but it should be much easier to work iteratively, checking that each step has succeeded before continuing to the next.
paths <- dir("C:/Data/", pattern = "\\.csv$")
# Read in CSV files as data frames
raw <- lapply(paths, read.csv, str)
# Extract needed rows
filter_metrics <- c("MaxTemp", "AvgTemp", "MinTemp")
filter_years <- 1970:2000
filtered <- lapply(raw, subset,
!is.na(Value) & Year %in% filter_years & Element %in% filter_metrics)
# Drop any empty data frames
rows <- vapply(filtered, nrow, integer(1))
filtered <- filtered[rows > 0]
# Compute aggregates
my_aggregate <- function(df, fun) {
aggregate(Value ~ Year + Month + Day + Element, data = df, FUN = fun,
na.action = na.pass)
}
means <- lapply(filtered, my_aggregate, mean)
sds <- lapply(filtered, my_aggregate, sd)
scales <- lapply(filtered, my_aggregate, scale)