Altering the data on diagrams - r

I am working on a statistics project and i came across one tiny issue. I have to work out a tactic for bidding on horse racing. We were given a set of 300 races with data such as the time, favorable odds, and if the favorable won or not. So i thought I could order the table in ascending order in relation to the time, and see how are the wins spread during the day. So I had games starting from 13:35 to 21:20, so I divided the time into 1 hour chunks (except the 1st chunk which is only 25 min, and last chunk which is only 20 min) and related wins to those chunks respectively. Then I added up all the wins inside each chunk, and presented that on the chart, the chart looks like this:
It works with my tactic which is to play between 14:00 and 16:00 cause then the bookie starts making odds as favorable to himself and starts robbing you from your earned money he just gave you. The thing is that on the bottom it says 2, 4, 6, 8.. but I want to write 13:35-14:00, 14:00 - 15:00 etc.. How can I do that, given that the code I used is that:
> plot(chunksVector, type="o", col="blue", xlab="Time", ylab="Wins")
How can I do this? I've been struggling with this one for some time now. Is there any way to alter the code?
P.S: "chunks" is just the name i gave to separated wins based on 1 hour distance. So i basically have a chink13.35_14.00, chunk14.00_15.00, chunk15.00_16.00 etc.. What I want to alter is only the X axis.
I want it to look like this:

the quick answer is:
plot(chunksVector, type="o", col="blue", xlab="Time", ylab="Wins", xaxt="n")
axis(1, at=c(2,4,6,8), labels=c('14:00', '16:00', '18:00', '20:00'))
There are probably a few other ways messing about with time-series packages?

Instead of messing about with x labels, convert x to factor:
df <- data.frame(x=factor(c('14:00', '16:00', '18:00', '20:00')),
y=c(1,2,3,4))
plot(df)

Related

R-programming. Percentiles

So I have estimated 1000 cash flows 10 years ahead. These cash flows are contained in a 1000*10 matrix.
For analytical purposes I want to examine the different percentiles/quantiles in terms of cash flows.
Part of my R-code
plot(timeline, TOTKS[1,], ylim=range(-15000000,100000000), type="l", ylab="Cash Flow", xlab="year")
for (i in 1:total.simulations){
lines(timeline, TOTKS[i,])}
Example:
I want to color the 10 % lowest cash flows in the plot above.
I want to examine those 10 % in a new matrix.
Any Suggestion?
Regards
By "lowest cashflows" do you mean "lowest total cash flow over all the years" or "lowest single point of cash flow over the years" or ... ?
I'm guessing the former, though what you mean might be something I didn't guess at all. :)
Also, I don't have any data from you, so what's below is guesswork based on your description. In future, you'll get better answers faster if you can put some sample or fake data (in the correct structure) into the question using dput(). But I hope this gets you started.
Edited - changed 10 lowest to lowest 10%
Due to inattention and insufficient coffee, I answered the wrong question. Fixed, I think.
First, find the ten lowest total cash flows. Add up the columns:
total.cashflow <- drop(colSums(TOTKS, na.rm=TRUE))
Then find which are the 10% lowest:
rank.cashflow.le10 <- which(
total.cashflow <= qualtile(total.cashflow, 0.1)
)
Add them to the plot:
matlines(timeline, TOTKS[,rank.cashflow.le10], col="red")
To examine those as a separate matrix (but you might not need to, with a vector of indicies)
low10TOTKS <- TOTKS[,rank.cashflow.le10]
Hope that helps.

How to compare cumulative counter vs the best, average and worst using Graphite?

I have a counter that measures the number items sold every 10 minutes.
I currently use this to track the cumulative number of items:
alias(integral(app.items_sold), 'Today')
And it looks like this:
Now, what I want to do to show how well we were are doing TODAY vs best, avg (or may median) worst day we've had for the past say 90 days.
I tried something like this:
alias(integral(maxSeries(timeStack(app.items_sold, '1d', 0, 90))),'Max')
alias(integral(averageSeries(timeStack(app.items_sold, '1d', 0,90))), 'Avg')
alias(integral(minSeries(timeStack(app.items_sold, '1d',0, 90))), 'Min')
which looks great but actually shows me the cumulative amount of all the max, avg and min for all series interval.
Can anyone suggest a way to achieve what I'm looking for?
i.e. determine what the best (and worst and median) day was for the past 90 days and plot that. Can it be done using purely Graphite functions?
Thanks.
The answer was to just flip the order to the function calls: (maxSeries before integral)
Thanks to turner on the grafana#groups.io board for the answer
alias(maxSeries(integral(timeStack(app.items_sold, '1d', 0, 90))),'Max')
alias(averageSeries(integral(timeStack(app.items_sold, '1d', 0,90))), 'Avg')
alias(minSeries(integral(timeStack(app.items_sold, '1d',0, 90))), 'Min')

How can I compute data rates in R with millisecond precision data?

I'm trying to take data from a CSV file that looks like this:
datetime,bytes
2014-10-24T10:38:49.453565,52594
2014-10-24T10:38:49.554342,86594
2014-10-24T10:38:49.655055,196754
2014-10-24T10:38:49.755772,272914
2014-10-24T10:38:49.856477,373554
2014-10-24T10:38:49.957182,544914
2014-10-24T10:38:50.057873,952914
2014-10-24T10:38:50.158559,1245314
2014-10-24T10:38:50.259264,1743074
and compute rates of change of the bytes value (which represents the number of bytes downloaded so far into a file), in a way that accurately reflects my detailed time data for when I took the sample (which should approximately be every 1/10 of a second, though for various reasons, I expect that to be imperfect).
For example, in the above sampling, the second row got (86594-52594=)34000 additional bytes over the first, in (.554342-.453565=).100777 seconds, thus yielding (34000/0.100777=)337,378 bytes/second.
A second example is that the last row compared to its predecessor got (1743074-1245314=)497760 bytes in (.259264-.158559=).100705 seconds, thus yielding (497760/.100705=)4,942,753 bytes/sec.
I'd like to get a graph of these rates over time, and I'm fairly new to R, and not quite figuring out how to get what I want.
I found some related questions that seem like they might get me close:
How to parse milliseconds in R?
Need to calculate Rate of Change of two data sets over time individually and Net rate of Change
Apply a function to a specified range; Rate of Change
How do I calculate a monthly rate of change from a daily time series in R?
But none of them seem to quite get me there... When I try using strptime, I seem to lose the precision (even using %OS); plus, I'm just not sure how to plot this as a series of deltas with timestamps associated with them... And the stuff in that one answer (second link, the answer with the AAPL stock delta graph) about diff(...) and -nrow(...) makes sense to me at a conceptual level, but not deeply enough that I understand how to apply it in this case.
I think I may have gotten close, but would love to see what others come up with. What options do I have for this? Anything that could show a rolling average (over, say, a second or 5 seconds), and/or using nice SI units (KB/s, MB/s, etc.)?
Edit:
I think I may be pretty close (or even getting the basic question answered) with:
my_data <- read.csv("my_data.csv")
my_deltas <- diff(my_data$bytes)
my_times <- strptime(my_data$datetime, "%Y-%m-%dT%H:%M:%S.%OS")
my_times <- my_times[2:nrow(my_data)]
df <- data.frame(my_times,my_deltas)
plot(df, type='l', xlab="When", ylab="bytes/s")
It's not terribly pretty (especially the y axis labels, and the fact that, with a longer data file, it's all pretty crammed with spikes), though, and it's not getting the sub-second precision, which might actually be OK for the larger problem (in the bigger graph, you can't tell, whereas with the sample data above, you really can), but still is not quite what I was hoping for... so, input still welcomed.
A possible solution:
# reading the data
df <- read.table(text="datetime,bytes
2014-10-24T10:38:49.453565,52594
2014-10-24T10:38:49.554342,86594
2014-10-24T10:38:49.655055,196754
2014-10-24T10:38:49.755772,272914
2014-10-24T10:38:49.856477,373554
2014-10-24T10:38:49.957182,544914
2014-10-24T10:38:50.057873,952914
2014-10-24T10:38:50.158559,1245314
2014-10-24T10:38:50.259264,1743074", header=TRUE, sep=",")
# formatting & preparing the data
df$bytes <- as.numeric(df$bytes)
df$datetime <- gsub("T"," ",df$datetime)
df$datetime <- strptime(df$datetime, "%Y-%m-%d %H:%M:%OS")
df$sec <- as.numeric(format(df$datetime, "%OS6"))
# calculating the change in bytes per second
df$difftime <- c(NA,diff(df$sec))
df$diffbytes <- c(NA,diff(df$bytes))
df$bytespersec <- df$diffbytes / df$difftime
# creating the plot
library(ggplot2)
ggplot(df, aes(x=sec,y=bytespersec/1000000)) +
geom_line() +
geom_point() +
labs(title="Change in bytes\n", x="\nWhen", y="MB/s\n") +
theme_bw()
which gives:

Calculate percentage over time on very large data frames

I'm new to R and my problem is I know what I need to do, just not how to do it in R. I have an very large data frame from a web services load test, ~20M observations. I has the following variables:
epochtime, uri, cache (hit or miss)
I'm thinking I need to do a coule of things. I need to subset my data frame for the top 50 distinct URIs then for each observation in each subset calculate the % cache hit at that point in time. The end goal is a plot of cache hit/miss % over time by URI
I have read, and am still reading various posts here on this topic but R is pretty new and I have a deadline. I'd appreciate any help I can get
EDIT:
I can't provide exact data but it looks like this, its at least 20M observations I'm retrieving from a Mongo database. Time is epoch and we're recording many thousands per second so time has a lot of dupes, thats expected. There could be more than 50 uri, I only care about the top 50. The end result would be a line plot over time of % TCP_HIT to the total occurrences by URI. Hope thats clearer
time uri action
1355683900 /some/uri TCP_HIT
1355683900 /some/other/uri TCP_HIT
1355683905 /some/other/uri TCP_MISS
1355683906 /some/uri TCP_MISS
You are looking for the aggregate function.
Call your data frame u:
> u
time uri action
1 1355683900 /some/uri TCP_HIT
2 1355683900 /some/other/uri TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906 /some/uri TCP_MISS
Here is the ratio of hits for a subset (using the order of factor levels, TCP_HIT=1, TCP_MISS=2 as alphabetical order is used by default), with ten-second intervals:
ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
FUN=function(x) sum((2-as.numeric(x))/length(x)))
Now use lapply to get the final result:
lapply(seq_along(levels(u$uri)),
function(l) list(uri=levels(u$uri)[l],
hits=ratio(u[as.numeric(u$uri) == l,])))
[[1]]
[[1]]$uri
[1] "/some/other/uri"
[[1]]$hits
u$time%/%10 u$action
1 135568390 0.5
[[2]]
[[2]]$uri
[1] "/some/uri"
[[2]]$hits
u$time%/%10 u$action
1 135568390 0.5
Or otherwise filter the data frame by URI before computing the ratio.
#MatthewLundberg's code is the right idea. Specifically, you want something that utilizes the split-apply-combine strategy.
Given the size of your data, though, I'd take a look at the data.table package.
You can see why visually here--data.table is just faster.
Thought it would be useful to share my solution to the plotting part of them problem.
My R "noobness" my shine here but this is what I came up with. It makes a basic line plot. Its plotting the actual value, I haven't done any conversions.
for ( i in 1:length(h)) {
name <- unlist(h[[i]][1])
dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
names(dftemp) <- c("time", "cache")
plot(dftemp$time,dftemp$cache, type="o")
title(main=name)
}

Drawing a Square Line Chart using quantmod

Is there a way to get quantmod to draw a square line chart?
I've tried modifying my time series so that each data point is replicated one second before the next datapoint (hoping this would approximate a square line), but quantmod seems to data on the x axis sequentially & evenly spaces without regard to the actually values of x (i.e. the horizontal space between one point an the next is the same whether the delta-T is 1 second or 1 minute).
I suppose I could convert my timeseries from a sparse to a dense one (one entry per second instead of one entry per change in value), but this seems very kludgy and should be unnecessary.
I'm constructing my time series thus:
library(quantmod)
myNumericVector <- c(3,7,2,9,4)
myDateTimeStrings <- paste("2011-10-31", c("5:26:00", "5:26:10", "5:26:40", "5:26:50", "5:27:00"))
myXts <- xts(myNumericVector, order.by=as.POSIXct(myDateTimeStrings))
And drawing the chart like so:
chartSeries(myXts, type="line", show.grid="true", theme=chartTheme("black"))
To illustrate what I have vs. what I want, the result looks like the blue line below but I'd like something more like the green:
Also, for the curious, here is the code that replicates points in the time series such that the gap between one value and the next are as small as possible:
mySquareDateTimes <- rep(as.POSIXct(myDateTimeStrings),2)[-1]
mySquareDateTimes[seq(2,8,by=2)] <- mySquareDateTimes[seq(2,8,by=2)] - 1
mySquareXts <- xts(rep(myNumericVector,each=2)[-10], order.by=mySquareDateTimes)
chartSeries(mySquareXts, type="line", show.grid="true", theme=chartTheme("black"))
The results are less than ideal.
You want a line.type of "step":
chartSeries(myXts, line.type="s")
See ?plot, specifically "type" under ... in the Arguments section (you may want "S" instead of "s").

Resources