Hmisc - cut2 - create factors from times - r

I'm trying to use the cut2() function from the Hmisc package to create a factor based on time periods.
Here's some code:
library(Hmisc)
i.time <- as.POSIXct("2013-07-16 13:55:14 CEST")
f.time <- i.time+as.difftime(1, units="hours")
data.points <- seq(from=i.time, to=f.time, by="1 sec")
cut.points <- seq(from=i.time, to=f.time, by="60 sec")
intervals <- cut2(x=data.points, cuts=cut.points, minmax=TRUE)
I expected intervals to be created such that each point in data.point were placed in a interval of time.
But there are some NA values in the end:
> tail(intervals, 1)
[1] <NA>
60 Levels: [2013-07-16 13:55:14,2013-07-16 13:56:14) ... [2013-07-16 14:54:14,2013-07-16 14:55:14]
I was expecting that the option minmax=TRUE would make sure that hte cuts included all the values in data.points.
Can anyone clarify what's going on here? How can I use the cut2 function to generate a factor that includes all the values in the data?

The reason I use cut2 in preference to cut is that its default for "right" is the way I expect it to work (left-closed intervals). Looking at the code I see that when 'cuts' is present in the argument list, then the cut function is used with a shifted set of cuts that has the effect of making the intervals left-closed, and then the code relabels the factor to change the "("'s to ["'s, but then does not use include.lowest = TRUE. This has the effect of turning the last value into <NA>. Frankly, I see this as a bug. After looking at this more closely I see that cut2's help page does not promise to handle either Date or date-time objects, so "bug" is too strong. It completely fails with Date objects and it appears to be only an accident that is is almost correct with POSIXct objects. (This implementation is somewhat surprising to me in that I always assumed that it was just using cut( ... , right=FALSE, include.lowest=TRUE).)
You can alter the code and one idea I had was to extend the range back to the right end point in the original data by changing this line:
r <- range(x, na.rm = TRUE)
To this line:
r <- range(c(x,max(x)+min(diff(x.unique))/2), na.rm = TRUE)
It's not exactly the result I expected since you get a new category at the right end because the penultimate interval was still open on the right.
intervals <- cut3(x=data.points, cuts=cut.points, minmax=TRUE)
> tail(intervals, 1)
[1] 2013-07-16 14:55:14
61 Levels: [2013-07-16 13:55:14,2013-07-16 13:56:14) ...
> tail(intervals, 2)
[1] [2013-07-16 14:54:14,2013-07-16 14:55:14) 2013-07-16 14:55:14
61 Levels: [2013-07-16 13:55:14,2013-07-16 13:56:14) ...
A different idea gives a more satisfactory result. Change only this line:
y <- cut(x, k2)
To to this:
y <- cut(x, k2, include.lowest=TRUE)
Giving the expected right and left closed interval and no NA:
tail(intervals, 2)
[1] [2013-07-16 14:54:14,2013-07-16 14:55:14] [2013-07-16 14:54:14,2013-07-16 14:55:14]
60 Levels: [2013-07-16 13:55:14,2013-07-16 13:56:14) ...
Note: include.lowest=TRUE with right=FALSE, will actually become include.highest. And I'm scratching my head about why I am actually getting the desired behavior in this case when I did not also need to do something with the 'right' parameter. I sent Frank Harrell a message, and he is willing to consider revisions to the code to handle other cases. I'm working on that.
Why this is an issue: The labeling for cut.POSIXt and cut.Date differs from the labeling of cut.numeric (actually cut.default) results. The former two label strategy is to just reprot the beginnings of the intervals whereas the labeling from cut.numeric includes "[" and ")" and the ends of the intervals. Compare the output from these:
levels( cut(0+1:100, 3) )
levels( cut(Sys.time()+1:100, 3) )
levels( cut(Sys.Date()+1:100, 3) )

from ??cut2:
minmax :
if cuts is specified but min(x) < min(cuts) or max(x) > max(cuts),
augments cuts to include min and max x
Checking your arguments:
x=data.points
cuts=cut.points
r <- range(x, na.rm = TRUE)
(r[1] < min(cuts) | (r[2] > max(cuts)))
FALSE ## no need to include mean and max
So here setting minmax don't change the result. But here a result using cut by setting include.lowest=TRUE) :
res <- cut(x=data.points, breaks=cut.points, include.lowest=TRUE)
table(is.na(res))

Related

Difference between acf and cor correlations

I try to replicate the results from acf() with manually using cor(), and I get differences which I cannot understand.
Consider the following toy example:
# two slightly different lines, with a bit of noise
# both are casted into time series
foo <- ts(seq(1:20) + rnorm(20))
bar <- ts((1.5*seq(1:20)-1) + rnorm(20))
foobar=cbind(foo, bar)
# lagged (auto)correlations, up to a lag of 10
pippo <- acf(foobar, 10)
# look at the results; I see the two 'slices' of the resulting
# datacube: [ , , 1] and [ , , 2]. So far, so good.
pippo$acf
# 1st slice: [1,1,1] = 1 (obviously), and
# [2,1,1] = 0.82796109 (autocorrelation of foo with itself, lagged 1)
# but if I do:
cor(foobar[1:19, 1], foobar[2:20, 1])
> [1] 0.9669717
obviously, 0.82796109 is quite different from 0.9669717,
and I cannot understand why. The same happens with cross-correlation
between the two time series, one lagged with respect to the other.
Even taking into account the
(n-1/n) unbiased/biased factor that sometimes creeps in.
What am I missing here?
Many thanks in advance to whomever wants to help. And my apologies
in advance if my question should come from a lack of understanding.
Luca
Environment: R 4.1.12, running on Win7

Poisson Process algorithm in R (renewal processes perspective)

I have the following MATLAB code and I'm working to translating it to R:
nproc=40
T=3
lambda=4
tarr = zeros(1, nproc);
i = 1;
while (min(tarr(i,:))<= T)
tarr = [tarr; tarr(i, :)-log(rand(1, nproc))/lambda];
i = i+1;
end
tarr2=tarr';
X=min(tarr2);
stairs(X, 0:size(tarr, 1)-1);
It is the Poisson Process from the renewal processes perspective. I've done my best in R but something is wrong in my code:
nproc<-40
T<-3
lambda<-4
i<-1
tarr=array(0,nproc)
lst<-vector('list', 1)
while(min(tarr[i]<=T)){
tarr<-tarr[i]-log((runif(nproc))/lambda)
i=i+1
print(tarr)
}
tarr2=tarr^-1
X=min(tarr2)
plot(X, type="s")
The loop prints an aleatory number of arrays and only the last is saved by tarr after it.
The result has to look like...
Thank you in advance. All interesting and supportive comments will be rewarded.
Adding on to the previous comment, there are a few things which are happening in the matlab script that are not in the R:
[tarr; tarr(i, :)-log(rand(1, nproc))/lambda]; from my understanding, you are adding another row to your matrix and populating it with tarr(i, :)-log(rand(1, nproc))/lambda].
You will need to use a different method as Matlab and R handle this type of thing differently.
One glaring thing that stands out to me, is that you seem to be using R: tarr[i] and M: tarr(i, :) as equals where these are very different, as what I think you are trying to achieve is all the columns in a given row i so in R that would look like tarr[i, ]
Now the use of min is also different as R: min() will return the minimum of the matrix (just one number) and M: min() returns the minimum value of each column. So for this in R you can use the Rfast package Rfast::colMins.
The stairs part is something I am not familiar with much but something like ggplot2::qplot(..., geom = "step") may work.
Now I have tried to create something that works in R but am not sure really what the required output is. But nevertheless, hopefully some of the basics can help you get it done on your side. Below is a quick try to achieve something!
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
# Major alteration, create a temporary row from previous row in tarr
temp <- matrix(tarr[i, ] - log((runif(nproc))/lambda), nrow = 1)
# Join temp row to tarr matrix
tarr <- rbind(tarr, temp)
i = i + 1
}
# I am not sure what was meant by tarr' in the matlab script I took it as inverse of tarr
# which in matlab is tarr.^(-1)??
tarr2 = tarr^(-1)
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
As you can see I have sorted the min_for_each_col so that the plot is actually a stair plot and not some random stepwise plot. I think there is a problem since from the Matlab code 0:size(tarr2, 1)-1 gives the number of rows less 1 but I cant figure out why if grabbing colMins (and there are 40 columns) we would create around 20 steps. But I might be completely misunderstanding! Also I have change T to T0 since in R T exists as TRUE and is not good to overwrite!
Hope this helps!
I downloaded GNU Octave today to actually run the MatLab code. After looking at the code running, I made a few tweeks to the great answer by #Croote
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
temp <- matrix(tarr[i, ] - log(runif(nproc))/lambda, nrow = 1) #fixed paren
tarr <- rbind(tarr, temp)
i = i + 1
}
tarr2 = t(tarr) #takes transpose
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
Edit: Some extra plotting tweeks -- seems to be closer to the original
qplot(seq_along(min_for_each_col), c(1:length(min_for_each_col)), geom="step", ylab="", xlab="")
#or with ggplot2
df1 <- cbind(min_for_each_col, 1:length(min_for_each_col)) %>% as.data.frame
colnames(df1)[2] <- "index"
ggplot() +
geom_step(data = df1, mapping = aes(x = min_for_each_col, y = index), color = "blue") +
labs(x = "", y = "")
I'm not too familiar with renewal processes or matlab so bear with me if I misunderstood the intention of your code. That said, let's break down your R code step by step and see what is happening.
The first 4 lines assign numbers to variables.
The fifth line creates an array with 40 (nproc) zeros.
The sixth line (which doesnt seem to be used later) creates an empty vector with mode 'list'.
The seventh line starts a while loop. I suspect this line is supposed to say while the min value of tarr is less than or equal to T ...
or it's supposed to say while i is less than or equal to T ...
It actually takes the minimum of a single boolean value (tarr[i] <= T). Now this can work because TRUE and FALSE are treated like numbers. Namely:
TRUE == 1 # returns TRUE
FALSE == 0 # returns TRUE
TRUE == 0 # returns FALSE
FALSE == 1 # returns FALSE
However, since the value of tarr[i] depends on a random number (see line 8), this could lead to the same code running differently each time it is executed. This might explain why the code "prints an aleatory number of arrays ".
The eight line seems to overwrite the assignment of tarr with the computation on the right. Thus it takes the single value of tarr[i] and subtracts from it the natural log of runif(proc) divided by 4 (lambda) -- which gives 40 different values. These fourty different values from the last time through the loop are stored in tarr.
If you want to store all fourty values from each time through the loop, I'd suggest storing it in say a matrix or dataframe instead. If that's what you want to do, here's an example of storing it in a matrix:
for(i in 1:nrow(yourMatrix)){
//computations
yourMatrix[i,] <- rowCreatedByComputations
}
See this answer for more info about that. Also, since it's a set number of values per run, you could keep them in a vector and simply append to the vector each loop like this:
vector <- c(vector,newvector)
The ninth line increases i by one.
The tenth line prints tarr.
the eleveth line closes the loop statement.
Then after the loop tarr2 is assigned 1/tarr. Again this will be 40 values from the last time through the loop (line 8)
Then X is assigned the min value of tarr2.
This single value is plotted in the last line.
Also note that runif samples from the uniform distribution -- if you're looking for a Poisson distribution see: Poisson
Hope this helped! Let me know if there's more I can do to help.

Difference between mean and manual calculation in R?

I am writing a simple function in R to calculate percentage differences between two input numbers.
pdiff <-function(a,b)
{
if(length(a>=1)) a <- median(a)
if(length(b>=1)) b <- median(b)
(abs(a-b)/((a+b)/2))*100
}
pdiffa <-function(a,b)
{
if(length(a>=1)) a <- median(a)
if(length(b>=1)) b <- median(b)
(abs(a-b)/mean(a,b))*100
}
When you run it with a random value of a and b, the functions give different results
x <- 5
y <- 10
pdiff(x,y) #gives 66%
pdiffa(x,y) #gives 100%
When I go into the code, apparently the values of (x+y)/2 = 7.5 and mean(x,y) = 5 differ......Am I missing something really obvious and stupid here?
This is due to a nasty "gotcha" in the mean() function (not listed in the list of R traps, but probably should be): you want mean(c(a,b)), not mean(a,b). From ?mean:
mean(x, ...)
[snip snip snip]
... further arguments passed to or from other methods.
So what happens if you call mean(5,10)? mean calls the mean.default method, which has trim as its second argument:
trim the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint.
The last phrase "values of trim outside that range are taken as the nearest endpoint" means that values of trim larger than 0.5 are set to 0.5, which means that we're asking mean to throw away 50% of the data on either end of the data set, which means that all that's left is the median. Debugging our way through mean.default, we see that we indeed end up at this code ...
if (trim >= 0.5)
return(stats::median(x, na.rm = FALSE))
So mean(c(x,<value_greater_than_0.5>)) returns the median of c(5), which is just 5 ...
Try mean(5, 10) by itself.
mean(5, 10)
[1] 5
Now try mean(c(5, 10)).
mean(c(5, 10))
[1] 7.5
mean takes a vector as its first argument.

Finding a value in an interval

Sorry if this is a basic question. Have been trying to figure this out but not being able to.
I have a vector of values called sym.
> head(sym)
[,1]
val 3.652166e-05
val -2.094026e-05
val 4.583950e-05
val 6.570184e-06
val -1.431486e-05
val -5.339604e-06
These I put in intervals by using factor on cut function on sym.
factorx<-factor(cut(sym,breaks=nclass.Sturges(sym)))
[1] (2.82e-05,5.28e-05] (-2.11e-05,3.55e-06] (2.82e-05,5.28e-05] (3.55e-06,2.82e-05] (-2.11e-05,3.55e-06] (-2.11e-05,3.55e-06]
[7] (-2.11e-05,3.55e-06] (2.82e-05,5.28e-05] (3.55e-06,2.82e-05] (7.74e-05,0.000102]
Levels: (-2.11e-05,3.55e-06] (3.55e-06,2.82e-05] (2.82e-05,5.28e-05] (7.74e-05,0.000102]
So clearly, four intervals were created in factorx. Now I have a new value tmp=3.7e-0.6.
My question is how can I find which interval in the above does it belongs to? I tried to use findInterval() but seems it does not work on factors like factorx.
Thanks
If you plan to re-classify new values, it's best to explicitly set the breaks= parameter with a vector rather than a size. Not that had those values been in the set originally, you may have had different breaks, and it is possible that your new values may be outside all the levels of your existing data which can be troublesome.
So first, I will generate some sample data.
set.seed(18)
x <- runif(50)
Now I will show two different way to calculate breaks. Here are b1() and b2()
b1<-function(x, n=nclass.Sturges(x)) {
#like default cut()
nb <- as.integer(n + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0)
dx <- abs(rx[1L])
seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000,
length.out = nb)
}
b2<-function(x, n=nclass.Sturges(x)) {
#like default hist()
pretty(range(x), n=n)
}
So each of these functions will give break points similar to either the default behaviors of cut() or hist(). Rather than just a single number of breaks, they each return a vector with all the break points explicitly stated. This allows you to use cut() to create your factor
mybreaks <- b1(x)
factorx <- cut(x,breaks=mybreaks))
(Note that's you don't have to wrap cut() in factor() as cut() already returns a factor. Now, if you get new values, you can classify them using findInterval() and the special breaks vector you've already prepared
nv <- runif(5)
grp <- findInterval(nv,mybreaks)
And we can check the results with
data.frame(grp=levels(factorx)[grp], x=nv)
# grp x
# 1 (0.831,0.969] 0.8769438
# 2 (0.00131,0.14] 0.1188054
# 3 (0.416,0.554] 0.5467373
# 4 (0.14,0.278] 0.2327532
# 5 (0.554,0.693] 0.6022678
and everything looks pretty good. In this case, findInterval() will tell you which level of the previous factor you created that each item belongs to. It will return 0 if the number is smaller than your previous observations, but it will return the largest category for anything greater than the largest level of mybreaks. This behavior is somewhat different that cut() which return NA. The last group in cut() is right-closed where findInterval leaves the right-end open.

Getting strings recognized as variable names in R

Related: Strings as variable references in R
Possibly related: Concatenate expressions to subset a dataframe
I've simplified the question per the comment request. Here goes with some example data.
dat <- data.frame(num=1:10,sq=(1:10)^2,cu=(1:10)^3)
set1 <- subset(dat,num>5)
set2 <- subset(dat,num<=5)
Now, I'd like to make a bubble plot from these. I have a more complicated data set with 3+ colors and complicated subsets, but I do something like this:
symbols(set1$sq,set1$cu,circles=set1$num,bg="red")
symbols(set2$sq,set2$cu,circles=set2$num,bg="blue",add=T)
I'd like to do a for loop like this:
colors <- c("red","blue")
sets <- c("set1","set2")
vars <- c("sq","cu","num")
for (i in 1:length(sets)) {
symbols(sets[[i]][,sq],sets[[i]][,cu],circles=sets[[i]][,num],
bg=colors[[i]],add=T)
}
I know you can have a variable evaluated to specify the column (like var="cu"; set1[,var]; I want to know how to get a variable to specify the data.frame itself (and another to evaluate the column).
Update: Ran across this post on r-bloggers which has this example:
x <- 42
eval(parse(text = "x"))
[1] 42
I'm able to do something like this now:
eval(parse(text=paste(set[[1]],"$",var1,sep="")))
In fiddling with this, I'm finding it interesting that the following are not equivalent:
vars <- data.frame("var1","var2")
eval(parse(text=paste(set[[1]],"$",var1,sep="")))
eval(parse(text=paste(set[[1]],"[,vars[[1]]]",sep="")))
I actually have to do this:
eval(parse(text=paste(set[[1]],"[,as.character(vars[[1]])]",sep="")))
Update2: The above works to output values... but not in trying to plot. I can't do:
for (i in 1:length(set)) {
symbols(eval(parse(text=paste(set[[i]],"$",var1,sep=""))),
eval(parse(text=paste(set[[i]],"$",var2,sep=""))),
circles=paste(set[[i]],".","circles",sep=""),
fg="white",bg=colors[[i]],add=T)
}
I get invalid symbol coordinates. I checked the class of set[[1]] and it's a factor. If I do is.numeric(as.numeric(set[[1]])) I get TRUE. Even if I add that above prior to the eval statement, I still get the error. Oddly, though, I can do this:
set.xvars <- as.numeric(eval(parse(text=paste(set[[i]],"$",var1,sep=""))))
set.yvars <- as.numeric(eval(parse(text=paste(set[[i]],"$",var2,sep=""))))
symbols(xvars,yvars,circles=data$var3)
Why different behavior when stored as a variable vs. executed within the symbol function?
You found one answer, i.e. eval(parse()) . You can also investigate do.call() which is often simpler to implement. Keep in mind the useful as.name() tool as well, for converting strings to variable names.
The basic answer to the question in the title is eval(as.symbol(variable_name_as_string)) as Josh O'Brien uses. e.g.
var.name = "x"
assign(var.name, 5)
eval(as.symbol(var.name)) # outputs 5
Or more simply:
get(var.name) # 5
Without any example data, it really is difficult to know exactly what you are wanting. For instance, I can't at all divine what your object set (or is it sets) looks like.
That said, does the following help at all?
set1 <- data.frame(x = 4:6, y = 6:4, z = c(1, 3, 5))
plot(1:10, type="n")
XX <- "set1"
with(eval(as.symbol(XX)), symbols(x, y, circles = z, add=TRUE))
EDIT:
Now that I see your real task, here is a one-liner that'll do everything you want without requiring any for() loops:
with(dat, symbols(sq, cu, circles = num,
bg = c("red", "blue")[(num>5) + 1]))
The one bit of code that may feel odd is the bit specifying the background color. Try out these two lines to see how it works:
c(TRUE, FALSE) + 1
# [1] 2 1
c("red", "blue")[c(F, F, T, T) + 1]
# [1] "red" "red" "blue" "blue"
If you want to use a string as a variable name, you can use assign:
var1="string_name"
assign(var1, c(5,4,5,6,7))
string_name
[1] 5 4 5 6 7
Subsetting the data and combining them back is unnecessary. So are loops since those operations are vectorized. From your previous edit, I'm guessing you are doing all of this to make bubble plots. If that is correct, perhaps the example below will help you. If this is way off, I can just delete the answer.
library(ggplot2)
# let's look at the included dataset named trees.
# ?trees for a description
data(trees)
ggplot(trees,aes(Height,Volume)) + geom_point(aes(size=Girth))
# Great, now how do we color the bubbles by groups?
# For this example, I'll divide Volume into three groups: lo, med, high
trees$set[trees$Volume<=22.7]="lo"
trees$set[trees$Volume>22.7 & trees$Volume<=45.4]="med"
trees$set[trees$Volume>45.4]="high"
ggplot(trees,aes(Height,Volume,colour=set)) + geom_point(aes(size=Girth))
# Instead of just circles scaled by Girth, let's also change the symbol
ggplot(trees,aes(Height,Volume,colour=set)) + geom_point(aes(size=Girth,pch=set))
# Now let's choose a specific symbol for each set. Full list of symbols at ?pch
trees$symbol[trees$Volume<=22.7]=1
trees$symbol[trees$Volume>22.7 & trees$Volume<=45.4]=2
trees$symbol[trees$Volume>45.4]=3
ggplot(trees,aes(Height,Volume,colour=set)) + geom_point(aes(size=Girth,pch=symbol))
What works best for me is using quote() and eval() together.
For example, let's print each column using a for loop:
Columns <- names(dat)
for (i in 1:ncol(dat)){
dat[, eval(quote(Columns[i]))] %>% print
}

Resources