How to graph requests per second from web log file using R - r

I'm trying to graph request per second using our apache log files.
I've massaged the log down to a simple listing of the timestamps, one entry per request.
04:02:28
04:02:28
04:02:28
04:02:29
...
I can't quite figure out how to get R to recognize as time and aggregate to per second.
Thanks for any help

The lubridate package makes working with dates and time very easy.
Here is an example, using the hms() function of lubridate. hms converts a character string into a data frame with separate columns for hours, minutes and seconds. There are similar functions for myd (month-day-year), dmy (day-month-year), ms (minutes-seconds)... you get the point.
library(lubridate)
data <- c("04:02:28", "04:02:28", "04:02:28", "04:02:29")
times <- hms(data)
times$second
[1] 28 28 28 29
At this point, times is a straight-forward data frame, and you can isolate any column you wish:
str(times)
Classes 'period' and 'data.frame': 4 obs. of 6 variables:
$ year : num 0 0 0 0
$ month : num 0 0 0 0
$ day : num 0 0 0 0
$ hour : num 4 4 4 4
$ minute: num 2 2 2 2
$ second: num 28 28 28 29

It seems to me that since you already have time-stamps at one-second granularity, all you need to do is do a frequency-count of the time-stamps and plot the frequencies in the original time-order. Say timeStamps is your array of time-stamps, then you would do:
plot(c( table( timeStamps ) ) )
I'm assuming you want to plot the log-messages in each one-second interval over a certain period. Also I'm assuming that the HMS time-stamps are within one day. Note that the table function produces a frequency-count of its argument.

I'm not exactly sure, how to make this correctly, but this should be one possible way and maybe helps you.
Instead of strings, get the data as UNIX timestamps from the database that denote the number of seconds from 1970-01-01.
Use hist(data) to plot a histogram for example. Or you may use melt command from reshape2 package and use cast for creating a data frame, where one column is the timestamp and another column determines the number of transactions at that time.
Use as.POSIXlt(your.unix.timestamps, origin="1970-01-01", tz="GMT") to convert the timestamps to R understandable datetime structures.
Then add labels to the plot using the data from point 3 using format.
Here's an example:
# original data
data.timestamps = c(1297977452, 1297977452, 1297977453, 1297977454, 1297977454, 1297977454, 1297977455, 1297977455)
data.unique.timestamps = unique(data.timestamps)
# get the labels
data.labels = format(as.POSIXlt(data.unique.timestamps, origin="1970-01-01", tz="GMT"), "%H:%M:%S")
# plot the histogram without axes
hist(data.timestamps, axes=F)
# add axes manually
axis(2)
axis(1, at=unique(data.timestamps), labels=data.labels)
--
Hope this helps

Related

How to resolve this specific kind of rbind(deparse.level, ...): replacement has length zero error?

Problem
I am doing a rbind and I am experiencing the following error as shown below:
Error in rbind(deparse.level, ...) : replacement has length zero
How may I resolve this error?
Details
The code is inside a loop. I am going through a dataframe and if a certain row has a null value for steps, I replace it with something and move on. That was working great until I got to row 289 which basically as far as I know looks exactly just like those before it (except the interval is 0 instead of 2355, but when I changed it to 2356 just to check if that was it, still get the same error, so this is some sort of formatting thing).
The code is as follows:
rbind(df, row)
I have a data frame, lets call it df for short that looks like something like this.
steps interval dates
283 2.6037736 2330 2012-10-01
284 4.6981132 2335 2012-10-01
285 3.3018868 2340 2012-10-01
286 0.6415094 2345 2012-10-01
287 0.2264151 2350 2012-10-01
288 1.0754717 2355 2012-10-01
The row is as follows:
steps interval dates
289 0 0 2012-10-02
Here are some details about df and row as follows:
data.frame': 288 obs. of 3 variables:
$ steps :'data.frame': 288 obs. of 1 variable:
..$ steps: num 1.717 0.3396 0.1321 0.1509 0.0755 ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
$ dates : Date, format: "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
and
'data.frame': 1 obs. of 3 variables:
$ steps : int 0
$ interval: int 0
$ dates : Date, format: "2012-10-02"
Update
I think the problem is that my steps is a dataframe within a dataframe. It should be just a column of numeric values. Here is some code that specifies the intended definition for df. Please see:
df <- data.frame(steps = numeric, date = date, interval = integer)
> str(df)
'data.frame': 0 obs. of 3 variables:
$ steps :function (...)
$ date :function (...)
$ interval:function (...)
The following cannot be easily duplicated, I apologize in advanced, but I can describe what the code is doing. I have a dataset that has NA in steps and I use row$steps <- subset(avgStepsByInterval, interval == row$interval, steps) to replace that NA with a number. For example, the idea is that:
steps interval dates
289 NA 0 2012-10-02
Becomes
steps interval dates
289 0 0 2012-10-02
The loop I am using is as follows:
for(i in 1:size)
{
index<-i
row <- data[i,]
isNa <- is.na(row$steps)
if(isNa)
{
row$steps <- subset(avgStepsByInterval, interval == row$interval, steps)
}
df<-rbind(df, row)
}
Don't worry about trying to duplicate the above code, I don't think that is important, but I am hoping it can help clue folks into something obviously wrong that I might be doing. If not, let me know and I'll see if I can create some replicable code (which will take some time). It might be blowing up because steps is somehow a dataframe within a dataframe in df. Which is not right. Our row definition is correct.
Attempts
I have tried the following approaches to no avail:
Format row with a do.call method
Cast row to data.frame (which it already is)
Manipulating values in steps and interval in row.
I'm thinking there is obviously something wrong with my formatting but I can't fix a problem if I don't know what is wrong. The above is all I was able to gather from debugging the code. I'm frankly completely at a loss and would appreciate someone with more experience than me to indicate to me what is going wrong and how I can best address it? Any assistance, questions, or comments would be most welcome.
Simple Answer
Do the following:
row$steps <- subset(avgStepsByInterval, interval == row$interval, steps)
Explanation
The problem was that steps was a dataframe object. Said in another way, we were trying to bind a row that had the desired dataframe definition to a dataframe object df that didn't have the proper definitions.
This is because
row$steps <- subset(avgStepsByInterval, interval == row$interval, steps) is returning a dataframe object. So we simply need to rewrite it as follows:
row$steps <- as.numeric(subset(avgStepsByInterval, interval == row$interval, steps))
This means the definition for df and row will match.
Summary
If you are searching on stackoverflow and you are seeing this error. Be sure that the objects you are trying to rbind have a matching and intended structure.

Extracting Categorical Variables from Data Source in Tableau

I currently have a table that looks like this:
Date Variable Value
1995-10-01 X 50
1995-10-01 Y 60
1995-08-03 X 70
1995-08-03 Y 90
And want to reshape it so that it looks like this:
Date X Y
1995-10-01 50 60
1995-08-03 70 90
This is easily doable in R using the cast function from the reshape package with the command df <- cast(df, ... ~ variable). I have two questions:
1) Can this form of dataset modification be done using a calculated field with an R script?
2) Is there a native way for such modification to be done in Tableau?
Any help would be much appreciated.
This is your data as it would be setup by default:
Step 1 Image
All, you need to do is to move the variable field up to the Columns section:
Step 2 Image

Compose different factors in dataframe

I have a dataframe that looks like this:
Sensor NewValue NewDate
1 iphone/NuhKZFrx/noise 1.00000 2015-10-20 23:26:14
2 iphone/NuhKZFrx/noiseS 58.63411 2015-10-20 23:26:14
3 iphone/wlhAlrPQ/noise 0.00000 2015-10-21 08:03:28
4 iphone/wlhAlrPQ/noiseS 65.26167 2015-10-21 08:03:28
[...]
with the following datatypes:
'data.frame': 405 obs. of 3 variables:
$ Sensor : Factor w/ 28 levels "iphone/5mZU0HWz/noise",..: 11 12 23 24 9 10 23 24 21 22 ...
$ NewValue: num 1 58.6 0 65.3 3 ...
$ NewDate : POSIXct, format: "2015-10-20 23:26:13" "2015-10-20 23:26:14" "2015-10-21 08:03:28" "2015-10-21 08:03:28" .
The Sensor field is set up like this: <model>/<uniqueID>/<type>. And I want to find out if there is a correlation between noise and noiseS for each uniqueID at a given time.
For a single uniqueID it works fine since there are only two factors. I tried to use xtabs(NewValue~NewDate+Sensor, data=dataNoises) but that gives me zeros since there aren't values for every ID at any time ...
What could I do to somehow compose the factors so that I only have on factor for noise and one for noiseS? Or is there an easier way to solve this problem?
What I want to do is the following:
Date noise noiseS
2015-10-20 23:26:14 1 58.63
2015-10-20 23:29:10 4 78.33
And then compute the pearson correlation coefficient between noise and noiseS.
If I understand your question correctly, you just want a 2-level factor that distinguishes between noise and noiseS?
That can be easily achieved by defining a new column in the dataframe and populating it with the output of grepl(). A MWE:
a <- "blahblahblahblahnoise"
aa <- "blahblahblahblahnoiseS"
b <- "noiseS"
type <- vector()
type[1] <- grepl(b, a)
type[2] <- grepl(b, aa)
type <- as.factor(type)
This two-level factor would let you build a simple model of the means for noise (type[i]==FALSE) and noiseS (type[i]==TRUE), but would not let you evaluate the CORRELATION between the types for a given UniqueID and time. One way to do this would be to create separate columns for data with type==FALSE and type==TRUE, where rows correspond to a specific UniqueID+time combination. In this case, you would need to think carefully about what you want to learn and when you assume data to be independent. For example, if you want to learn whether noise and noiseS are correlated across time for a given uniqueID, then you would need to make a separate factor for uniqueID and include it in your model as an effect (possibly a random effect, depending on your purposes and your data).

How do I plot data by splitting it unto 5 second intervals?

I'm completely new to R, and I have been tasked with making a script to plot the protocols used by a simulated network of users into a histogram by a) identifying the protocols they use and b) splitting everything into a 5-second interval and generate a graph for each different protocol used.
Currently we have
data$bucket <- cut(as.numeric(format(data$DateTime, "%H%M")),
c(0,600, 2000, 2359),
labels=c("00:00-06:00", "06:00-20:00", "20:00-23:59")) #Split date into dates that are needed to be
to split the codes into 3-zones for another function.
What should the code be changed to for 5 second intervals?
Sorry if the question isn't very clear, and thank you
The histogram function hist() can aggregate and/or plot all by itself, so you really don't need cut().
Let's create 1,000 random time stamps across one hour:
set.seed(1)
foo <- as.POSIXct("2014-12-17 00:00:00")+runif(1000)*60*60
(Look at ?POSIXct on how R treats POSIX time objects. In particular, note that "+" assumes you want to add seconds, which is why I am multiplying by 60^2.)
Next, define the breakpoints in 5 second intervals:
breaks <- seq(as.POSIXct("2014-12-17 00:00:00"),
as.POSIXct("2014-12-17 01:00:00"),by="5 sec")
(This time, look at ?seq.POSIXt.)
Now we can plot the histogram. Note how we assign the output of hist() to an object bar:
bar <- hist(foo,breaks)
(If you don't want the plot, but only the bucket counts, use plot=FALSE.)
?hist tells you that hist() (invisibly) returns the counts per bucket. We can look at this by accessing the counts slot of bar:
bar$counts
[1] 1 2 0 1 0 1 1 2 3 3 0 ...

Calculate sum of counts per min from data frame in R

I've been trying to figure this out for a while, but haven't been able to do so. I found a lot of similar questions which didn't help at all.
I have around 43000 records in data frame in R. The date column is in the format "2011-11-15 02:00:01", and the other column is the count. The structure of the data frame:
str(results)
'data.frame': 43070 obs. of 2 variables:
$ dates: Factor w/ 43070 levels "2011-11-15 02:00:01",..: 1 2 3 4 5 6 7 8 9 10 ...
$ count: num 1 2 1 1 1 1 2 3 1 2 ...
How can I get the total count per min?
And I also want to convert the results data frame into json. I used rjson package which converted the entire data frame as a single json element. When I inserted into mongodb, there was only on _id for all 43000 records. What did I do wrong?
You can use the xts package to get the counts/minute quite easily.
install.packages("xts")
require("xts")
results_xts <- xts(results$count, order.by = as.POSIXlt(results$dates))
This converts your dataframe to an xts object. There are a bunch of functions (apply.daily, apply.yearly, etc) in xts that apply functions to different time frames, but there isn't one for by minute. Fortunately the code for those functions is super simple, so just run
ep <- endpoints(results_xts, "minutes")
period.apply(results_xts, ep, FUN = sum)
Sorry, I don't know the answer to your other question.
Asterisk here, untested, but here is my solution for getting the counts per minute, maybe someone will chime in on the json part, I'm not familiar with that
here's my example time series and count
tseq<-seq(now,length.out=130, by="sec")
count<-rep(1, 130)
we find the index of where our minutes switch via the following
mins<-c(0,diff(floor(cumsum(c(0,diff(tseq)))/60)))
indxs<-which(mins%in%1)
Let me break that down (as there are many things nested in there).
First we diff over the time sequence, then add a 0 on the front because we lose an observation with diff
Second, sum the diff-ed vector, giving us the seconds value in each spot (this could probably also be done by a simple format call over the vector of times too)
Third, divide that vector, now the seconds in each spot, by 60 so we get a value in each spot corresponding to the minutes.
Fourth, floor it so we get integers
diff that vector so we get 0's except for 1's where the minute switches
add a 0 to that vector since we lose an observation with the diff
then get the indeces of the 1's with the which call
then we find the start and ends to our minutes
startpoints<-indxs
endpoints<-c(indxs[2:length(indxs)], length(mins))
then we simply sum over the corresponding subset
mapply(function(start, end) sum(count[start:end]), start=startpoints, end=endpoints)
#[1] 61 10
We get 61 for the first point because we include the 0th and 60th second for the first subset

Resources