I have a time series on returns which is approximately 20 year long. Based on this time series, I want to compute somekind of a moving bootstrap to calculate the mean returns for every observation.
Let me demonstrate this on an example:
Let´s say we have information starting at 01.01.1990 and I want to compute the means with bootstrap starting at 02101.1991.
At 01.01.1991 I want to comupte the mean based on the returns between 01.01.1991-01.01.1990.
Then, on 02.10.1991 I also want to take into account the return of 02.01.1991 and therefore want to calculate the mean with bootstrap based on the returns from 01.01.1990-02.01.1991.
To sum up, the data for my bootstrap should increase by 1 through the time series.
I hope that you can understand what I am trying to say.
I would appreciate any help.
Cheers
Sven
So I managed to answer the question by myself
Let say we want to get the means calculated with bootstrap starting at 01.01.1991 which is the 300th observation in our sample
(Overall we have 1000 observations in our time series)
then the code is the following one:
h <- rep(1, 1000)
for (i in 300:1000) {
h[i] <- mean( sample(rawdata$retoil[1:i] , 5000 , replace=TRUE))
}
the first 300 row of h are 1's and can be deleted in the end
Hope I could help some of you :)
What is the correct program flow to write different sized data frame to the same worksheet but ensure only the most recent data values written are visible?
Here was my original sequence:
gc = pygsheets.authorize(outh_file=oauth_file)
sh = gc.open(sheet_name)
wks = sh.worksheet_by_title(wks_name)
wks.set_dataframe(df, (1, 1))
Problem with above sequence is if 1st write was 3800 rows x 12 cols and 2nd write was 2400 rows x 12 cols the wks would still show data from the prior write for rows above 2400.
My 2nd solution (basically a hack just to get it to work for me):
gc = pygsheets.authorize(outh_file=oauth_file)
sh = gc.open(spreadsheet_name)
wks = sh.worksheet_by_title(sheet_name)
sh.del_worksheet(wks)
sh.add_worksheet(sheet_name, rows=len(df) + 1, cols=len(df.columns))
wks = sh.worksheet_by_title(sheet_name)
wks.set_dataframe(df, (1, 1))
The above sequence basically does what I want but I do not like having to delete the wks (I lose all my manual formatting). I know there must be a correct way to accomplish but I do not know the pygsheets API very well.
Will a more advanced pygsheet users please advise proper program flow and methods to use?
TIA,
--Rj
fit=True will basically resize the sheet to fit you data frame. so if you wanna keep the sheet at same size, you can clear the sheet before next write. it wold be easier than your second solution. Also if you just wanna clear the range you had written earlier, you can pass a range to clear function.
wks.set_dataframe(df, (1, 1))
wks.clear()
wks.set_dataframe(df, (1, 1))
I used to generate time-series in Rusing xts package as
library(xts)
seq <- timeBasedSeq('2015-06-01/2015-06-05 23')
z <- xts(1:length(seq),seq)
After a bit of tweaks, I find it easy to generate data at a default rate of 1 hour or 1 minute or 1 second. Reading the help page of ?timeBasedseq does not clearly mention how to generate data at other frequncies. Say, I want to generate data at 10 minutes rate. Where Should I mention M (minutes) and 10 in the said command to generate 10 minutes data. Option M is mentioned in the help pages.
This isn't currently possible. The "by" value is essentially hard-coded at 1 unit. It should be possible to patch the code so you can specify the "by" component as something like "10M" for 10-minutes, since seq.POSIXct would accept a by = "10 mins" value.
If you want, you can create an issue for this feature request and we can discuss details of what this patch should include.
I'm new to R and my problem is I know what I need to do, just not how to do it in R. I have an very large data frame from a web services load test, ~20M observations. I has the following variables:
epochtime, uri, cache (hit or miss)
I'm thinking I need to do a coule of things. I need to subset my data frame for the top 50 distinct URIs then for each observation in each subset calculate the % cache hit at that point in time. The end goal is a plot of cache hit/miss % over time by URI
I have read, and am still reading various posts here on this topic but R is pretty new and I have a deadline. I'd appreciate any help I can get
EDIT:
I can't provide exact data but it looks like this, its at least 20M observations I'm retrieving from a Mongo database. Time is epoch and we're recording many thousands per second so time has a lot of dupes, thats expected. There could be more than 50 uri, I only care about the top 50. The end result would be a line plot over time of % TCP_HIT to the total occurrences by URI. Hope thats clearer
time uri action
1355683900 /some/uri TCP_HIT
1355683900 /some/other/uri TCP_HIT
1355683905 /some/other/uri TCP_MISS
1355683906 /some/uri TCP_MISS
You are looking for the aggregate function.
Call your data frame u:
> u
time uri action
1 1355683900 /some/uri TCP_HIT
2 1355683900 /some/other/uri TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906 /some/uri TCP_MISS
Here is the ratio of hits for a subset (using the order of factor levels, TCP_HIT=1, TCP_MISS=2 as alphabetical order is used by default), with ten-second intervals:
ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
FUN=function(x) sum((2-as.numeric(x))/length(x)))
Now use lapply to get the final result:
lapply(seq_along(levels(u$uri)),
function(l) list(uri=levels(u$uri)[l],
hits=ratio(u[as.numeric(u$uri) == l,])))
[[1]]
[[1]]$uri
[1] "/some/other/uri"
[[1]]$hits
u$time%/%10 u$action
1 135568390 0.5
[[2]]
[[2]]$uri
[1] "/some/uri"
[[2]]$hits
u$time%/%10 u$action
1 135568390 0.5
Or otherwise filter the data frame by URI before computing the ratio.
#MatthewLundberg's code is the right idea. Specifically, you want something that utilizes the split-apply-combine strategy.
Given the size of your data, though, I'd take a look at the data.table package.
You can see why visually here--data.table is just faster.
Thought it would be useful to share my solution to the plotting part of them problem.
My R "noobness" my shine here but this is what I came up with. It makes a basic line plot. Its plotting the actual value, I haven't done any conversions.
for ( i in 1:length(h)) {
name <- unlist(h[[i]][1])
dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
names(dftemp) <- c("time", "cache")
plot(dftemp$time,dftemp$cache, type="o")
title(main=name)
}
I have some data formatted as below. I have done some analysis on this and would like to be able to plot the price development in the same graph as the analyzed data.
This requires me to have the same x-axes for the data.
So I would like to aggregate the "shares" column in say 150 increments, and add the "finalprice" and "time" to this.
The aggregation should include the latest time and price, so if the aggregation needs to occur over two or more rows of data then the last row should provide the price and time data.
My question is how to create a new vector with 150 shares per row.
The length of the vector will equal sum(shares)/150.
Is there an easy way to do this? Thanks in advance.
Edit:
I thought about expanding the observations using rep(finalprice, shares) and then getting each 150th value of the expanded vector.
Data sample:
"date","ord","shares","finalprice","time","stock"
20120702,E,2000,99.35,540.84753333,500
20120702,E,28000,99.35,540.84753333,500
20120702,E,50,99.5,542.03073333,500
20120702,E,13874,99.5,542.29411667,500
20120702,E,292,99.5,542.30191667,500
20120702,E,784,99.5,542.30193333,500
20120702,E,13300,99.35,543.04805,500
20120702,E,16658,99.35,543.04805,500
20120702,E,42,99.5,543.04805,500
20120702,E,400,99.4,546.17173333,500
20120702,E,100,99.4,547.07,500
20120702,E,2219,99.3,549.47988333,500
20120702,E,781,99.3,549.5238,500
20120702,E,50,99.3,553.4052,500
20120702,E,1500,99.35,559.86275,500
20120702,E,103,99.5,567.56726667,500
20120702,E,1105,99.7,573.93326667,500
20120702,E,4100,99.5,582.2657,500
20120702,E,900,99.5,582.2657,500
20120702,E,1024,99.45,582.43891667,500
20120702,E,8214,99.45,582.43891667,500
20120702,E,10762,99.45,582.43895,500
20120702,E,1250,99.6,586.86446667,500
20120702,E,5000,99.45,594.39061667,500
20120702,E,20000,99.45,594.39061667,500
20120702,E,15000,99.45,594.39061667,500
20120702,E,4000,99.45,601.34491667,500
20120702,E,8700,99.45,603.53608333,500
20120702,E,3290,99.6,609.23213333,500
I think I got it solved.
expand <- rep(finalprice, shares)
Increment <- expand[seq(from = 1, to = length(expand), by = 150)]