I'm completely new to R, and I have been tasked with making a script to plot the protocols used by a simulated network of users into a histogram by a) identifying the protocols they use and b) splitting everything into a 5-second interval and generate a graph for each different protocol used.
Currently we have
data$bucket <- cut(as.numeric(format(data$DateTime, "%H%M")),
c(0,600, 2000, 2359),
labels=c("00:00-06:00", "06:00-20:00", "20:00-23:59")) #Split date into dates that are needed to be
to split the codes into 3-zones for another function.
What should the code be changed to for 5 second intervals?
Sorry if the question isn't very clear, and thank you
The histogram function hist() can aggregate and/or plot all by itself, so you really don't need cut().
Let's create 1,000 random time stamps across one hour:
set.seed(1)
foo <- as.POSIXct("2014-12-17 00:00:00")+runif(1000)*60*60
(Look at ?POSIXct on how R treats POSIX time objects. In particular, note that "+" assumes you want to add seconds, which is why I am multiplying by 60^2.)
Next, define the breakpoints in 5 second intervals:
breaks <- seq(as.POSIXct("2014-12-17 00:00:00"),
as.POSIXct("2014-12-17 01:00:00"),by="5 sec")
(This time, look at ?seq.POSIXt.)
Now we can plot the histogram. Note how we assign the output of hist() to an object bar:
bar <- hist(foo,breaks)
(If you don't want the plot, but only the bucket counts, use plot=FALSE.)
?hist tells you that hist() (invisibly) returns the counts per bucket. We can look at this by accessing the counts slot of bar:
bar$counts
[1] 1 2 0 1 0 1 1 2 3 3 0 ...
Related
I have a dataframe of intervals stored as strings :
interval
1 '(-inf-57142.8]'
2 '(57142.8-94002.6]'
3 '(94002.6-130862.4]'
4 '(130862.4-167722.2]'
5 '(167722.2-204582]'
6 '(204582-241441.8]'
7 '(241441.8-278301.6]'
8 '(278301.6-315161.4]'
9 '(315161.4-352021.2]'
10 '(352021.2-inf)'
I want to map any given number to interval "bins", using the intervals stored in the dataframe above and the index as the bin number i.e.
-57142.8 would map to 1
-57142.9 would map to 2
130862.5 would map to 4
352021.2 would map to 9
352021.3 would map to 10
etc
The intervals are generated dynamically using a discretize function.
Are there any simple R tools for helping to achieve this?
Or anything that deals with intervals stored as strings?
Thanks In Advance
Resolved this using gsub & findInterval, It may be useful to others?
Get boundary from strings described in original question above :
boundaries<-gsub("\\(-inf-|\\(-inf-|\\(\\d+[.]*\\d+[-]+|\\'|\\]","",intervals$interval)[1:9] %>% as.numeric()
Get Interval position:
findInterval(value_to_test,boundaries[1:9],rightmost.closed = FALSE,all.inside = TRUE)
The endpoints '(-inf-57142.8]' & '(352021.2-inf)' are dealt with seperately as special cases. If the value_to_test lands on a boundary its Interval position is also a special case and adjusted by -1.
I have 3 tiers of product which I'm creating a hierarchical forecast for using the gts function from the hts in R.
My tiers are:
PL1: A3
PL2: AT
PL3: ATA,ATB,ATD,ATH,ATI,ATJ
In reality I have many more, but I limited the structure to this subset as I'm just learning this package. Each PL3 has 40 time observations.
Following this tutorial from Hyndsight, I was able to get something working. However I don't think I'm specifying the character argument correctly.
myts=ts(matrix(data.agg$SalesUnits,ncol=6,nrow=40))
blnames <- unique(paste(data.agg$Group.2, # PL2
data.agg$Group.3, # PL3
data.agg$Group.4, # PL4
sep=""))
colnames(myts)=blnames
gy=gts(myts,characters=c(2,2,3))
fc=forecast(gy)
According to the documentation, specifying a numeric vector for characters implies a non-hierarchy?
Because none of these is hierarchical, we could specify characters = list(3, 1, 1), or as a simple numeric vector: characters = c(3, 1, 1). This implies its non-hierarchical structure and its characters segments
I can't figure out how I'm supposed to specify the correct character argument. When I try to use lists, the function fails. While my code works as written, I don't think it's correct because the output says there are only 2 levels:
Grouped Time Series
2 Levels
Number of groups at each level: 1 6
Total number of series: 7
Number of observations in each historical series: 40
Number of forecasts per series: 10
My mistake. I was using gts where I should have been using hts. That resolved my issue
Just picking up R and I have the following question:
Say I have the following data.frame:
v1 v2 v3
3 16 a
44 457 d
5 23 d
34 122 c
12 222 a
...and so on
I would like to create a histogram or barchart for this in R, but instead of having the x-axis be one of the numeric values, I would like a count by v3. (2 a, 1 c, 2 d...etc.)
If I do hist(dataFrame$v3), I get the error that 'x 'must be numeric.
Why can't it count the instances of each different string like it can for the other columns?
What would be the simplest code for this?
OK. First of all, you should know exactly what a histogram is. It is not a plot of counts. It is a visualization for continuous variables that estimates the underlying probability density function. So do not try to use hist on categorical data. (That's why hist tells you that the value you pass must be numeric.)
If you just want counts of discrete values, that's just a basic bar plot. You can calculate counts of values in R for discrete data using table and then plot that with the basic barplot() command.
barplot(table(dataFrame$v3))
If you want to require a minimum number of observations, try
tbl<-table(dataFrame$v3)
atleast <- function(i) {function(x) x>=i}
barplot(Filter(atleast(10), tbl))
I am doing some very basic plots for exploratory analyses, and have successfully created a for loop to do most of the work for me. I have 12 years of data, 5 different categories(Cat1-Cat5), and 3 different variables(Say X,Y,Z). The loops that I have done so far gives me the histogram of each of the variables by year (so X in year 1 - X in year 12 for example).
I partitioned my data in 2 ways - by category, and by year as follows:
Cat.1<-subset(data,Category==1) #Similar code for categories 2-5
categories<-list(Cat.1,Cat.2,Cat.3,Cat.4,Cat.5)
Year.1<-subset(data,Year==1)
years<-list(Year.1,Year.2, ... , Year.12)
Now, with the data partitioned this way I have set up loops:
for(i in (1:length(categories))
{
store.data<-categories[[i]]
hist(store.data$X)
}
What I would like to do is have an external loop that deals with the 3 variables:
variables<-list(X,Y,Z)
for(j in (1:length(variables))
{
#insert above for loop here
}
The desired output would be the output of all of the histograms for each year and each variable. I realize that I can just add in lines to the original for loop:
hist(store.data$Y)
hist(store.data$Z)
But, eventually I will be running analyses (ANOVA, t-test, etc) on the data and I plan on having the same setup. By having the external loop that deals with which variable the internal loop works on, I should have much less code to write in theory.
This short solution gives you the histograms, but doesn't name them to inform you which histogram relate to which category. The histograms will be named by variable, and the order the histograms are generated will correspond to the numerical order of you categories. It doesn't look like you're labeling you're histograms in the code you posted, so this may not be a problem for you.
category = rep(1:5,20)
X = rnorm(100)
Y = rexp(100)
Z = rgamma(100,5)
require(data.table)
DT = data.table(category, X, Y, Z)
DT[,lapply(.SD, hist), by=category]
Given a data frame with a column that contains strings. I would like to plot the frequency of strings that bear a certain pattern. For example
strings <- c("abcd","defd","hfjfjcd","kgjgcdjrye","yryriiir","twtettecd")
df <- as.data.frame(strings)
df
strings
1 abcd
2 defd
3 hfjfjcd
4 kgjgcdjrye
5 yryriiir
6 twtettec
I would like to plot the frequency of the strings that contain the pattern `"cd"
Anyone with a quick solution?
I presume from your question that you meant to have some entries that appear more than once, so I've added one duplicate string:
x <- c("abcd","abcd","defd","hfjfjcd","kgjgcdjrye","yryriiir","twtettecd")
To find only those strings that contain a specific pattern, use grep or grepl:
y <- x[grepl("cd", x)]
To get a table of frequencies, you can use table
table(y)
y
abcd hfjfjcd kgjgcdjrye twtettecd
2 1 1 1
And you can plot it using plot or barplot as follows:
barplot(table(y))
Others have already mentioned grepl. Here is an implementation with plot.density using grep to get the positions of the matches
plot( density(0+grepl("cd", strings)) )
If you don't like the extension of the density plot beyond the range there are other methods in the 'logspline' package that allow one to get sharp border at range extremes. Searching RSiteSearch
check "Kernlab" package.
You can define a kernel (pattern) which could any kind of string and count them later on.