I have a problem with documentation and the return value of plot() for factors. I'd like to add a horizontal line with the mean value to the plot, but I fail to compute it. I was hoping to be able to use the value of the plot, but I failed. For example:
> x<-sample(5, 10, replace=TRUE)
> x
[1] 3 5 1 4 5 4 2 4 1 5
> y<-plot(factor(x))
> y
[,1]
[1,] 0.7
[2,] 1.9
[3,] 3.1
[4,] 4.3
[5,] 5.5
Obviously the domain and range are all integer, so what do this numbers returned by plot really mean, and how could I get the mean bar height?
Of course (if there's not a more elegant solution) I can iterate over the factor levels counting the the number of items for each, and then take the mean value of those. Also, if you use hist() instead of plot(), then the solution is very simple: abline(h=mean(hist(x)$counts))
To add a horizontal line, just say:
abline(h = mean(x))
abline(h = whatever) gives you a horizontal line. abline(y = whatever) gives you a vertical line.
Probably the (most?) complicated solution invented by myself is this:
abline(h=mean(unlist(lapply(min(x):max(x), function(ff) length(which(x == ff))))))
Of course this solution only works if x is a factor, and the levels are numeric; otherwise replace min(x):max(x) with levels(x).
And (harder to understand for me) a more simple solution seems to be (from #Marco Sandri):
abline(h=length(x)/length(y))
Related
I've been practicing basics in R (3.6.3) and I'm stuck trying to understand this problem for hours already. This was the exercise:
Step 1: Generate sequence of data between 1 and 3 of total length 100; #use the jitter function (with a large factor) to add noise to your data
Step 2: Compute the vector of rolling averages roll.mean with the average of 5 consecutive points. This vector has only 96 averages.
Step 3: add the vector of these averages to your plot
Step 4: generalize step 2 and step 3 by making a function with parameters consec (default=5) and y.
y88 = seq(1,3,0.02)
y = jitter(y88, 120, set.seed(1))
y = y[-99] # removed one guy so y can have 100 elements, as asked
roll.meanT = rep(0,96)
for (i in 1:length(roll.meanT)) # my 'reference i' is roll.mean[i], not y[i]
{
roll.meanT[i] = (y[i+4]+y[i+3]+y[i+2]+y[i+1]+y[i])/5
}
plot(y)
lines(roll.meanT, col=3, lwd=2)
This produced this plot:
Then, I proceed to generalize using a function (it asks me to generalize steps 2 and 3, so the data creation step was ignored) and I consider y to remain constant):
fun50 = function(consec=5,y)
{
roll.mean <- rep(NA,96) # Apparently, we just leave NA's as NA's, since lenght(y) is always greater than lenght(roll.means)
for (i in 1:96)
{
roll.mean[i] <- mean(y[i:i+consec-1]) # Using mean(), I'm able to generalize.
}
plot(y)
lines(roll.mean, col=3, lwd=2)
}
Which gave me a completely different plot:
When I manually try too see if mean(y[1:5]) produces the right mean, it does. I know I could have already used the mean() function in the first part, but I would really like to get the same results using (y[i+4]+y[i+3]+y[i+2]+y[i+1]+y[i])/5 or mean(y[1:5],......).
You have the line
roll.mean[i] <- mean(y[i:i+consec-1]) # Using mean(), I'm able to generalize.
I believe your intention is to grab the values with indices i to (i+consec-1). Unfortunately for you - the : operator takes precedence over arithmetic operations.
> 1:1+5-1 #(this is what your code would do for i=1, consec=5)
[1] 5
> (1:1)+5-1 # this is what it's actually doing for you
> 5
> 2:2+5-1 #(this is what your code would do for i=2, consec=5)
[1] 6
> 3:3+5-1 #(this is what your code would do for i=3, consec=5)
[1] 7
> 3:(3+5-1) #(this is what you want your code to do for i=3, consec=5)
[1] 3 4 5 6 7
so to fix - just add some parenthesis
roll.mean[i] <- mean(y[i:(i+consec-1)]) # Using mean(), I'm able to generalize.
I have a continuous variable that I want to split into bins, returning a numeric vector (of length equal to my original vector) whose values relate to the values of the bins. Each bin should have roughly the same number of elements.
This question: splitting a continuous variable into equal sized groups describes a number of techniques for related situations. For instance, if I start with
x = c(1,5,3,12,5,6,7)
I can use cut() to get:
cut(x, 3, labels = FALSE)
[1] 1 2 1 3 2 2 2
This is undesirable because the values of the factor are just sequential integers, they have no direct relation to the underlying original values in my vector.
Another possibility is cut2: for instance:
library(Hmisc)
cut2(x, g = 3, levels.mean = TRUE)
[1] 3.5 3.5 3.5 9.5 3.5 6.0 9.5
This better because now the return values relate to the values of the bins. It is still less than ideal though since:
(a) it yields a factor which then needs to be converted to numeric (see, e.g.), which is both slow and awkward code wise.
(b) Ideally I'd like to be able to choose whether to use the top or bottom end points of the intervals, instead of just the means.
I know that there are also options using regex on the factors returns from cut or cut2 to get the top or bottom points of the intervals. These too seem overly cumbersome.
Is this just a situation that requires some not-so-elegant hacking? Or, is there some easier functionality to accomplish this?
My current best effort is as follows:
MyDiscretize = function(x, N_Bins){
f = cut2(x, g = N_Bins, levels.mean = TRUE)
return(as.numeric(levels(f))[f])
}
My goal is to find something faster, more elegant, and easily adaptable to use either of the endpoints, rather than just the means.
Edit:
To clarify: my desired output would be:
(a) an equivalent to what I can achieve right now in the example with cut2 but without needing to convert the factor to numeric.
(b) if possible, the ability to also easily chose to use either of the endpoints of the interval, instead of the midpoint.
Use ave like this:
Given:
x = c(1,5,3,12,5,6,7)
Mean:
ave(x,cut2(x,g = 3), FUN = mean)
[1] 3.5 3.5 3.5 9.5 3.5 6.0 9.5
Min:
ave(x,cut2(x,g = 3), FUN = min)
[1] 1 1 1 7 1 6 7
Max:
ave(x,cut2(x,g = 3), FUN = max)
[1] 5 5 5 12 5 6 12
Or standard deviation:
ave(x,cut2(x,g = 3), FUN = sd)
[1] 1.914854 1.914854 1.914854 3.535534 1.914854 NA 3.535534
Note the NA result for only one data point in interval.
Hope this is what you need.
NOTE:
Parameter g in cut2 is number of quantile groups. Groups might not have the same amount of data points, and the intervals might not have the same length.
On the other hand, cut splits the interval into several of equal length.
Maybe not much elegant, but should be efficient. Try this function:
myCut<-function(x,breaks,retValues=c("means","highs","lows")) {
retValues<-match.arg(retValues)
if (length(breaks)!=1) stop("breaks must be a single number")
breaks<-as.integer(breaks)
if (is.na(breaks)||breaks<2) stop("breaks must greater than or equal to 2")
intervals<-seq(min(x),max(x),length.out=breaks+1)
bins<-findInterval(x,intervals,all.inside=TRUE)
if (retValues=="means") return(rowMeans(cbind(intervals[-(breaks+1)],intervals[-1]))[bins])
if (retValues=="highs") return(intervals[-1][bins])
intervals[-(breaks+1)][bins]
}
x = c(1,5,3,12,5,6,7)
myCut(x,3)
#[1] 2.833333 6.500000 2.833333 10.166667 6.500000 6.500000 6.500000
myCut(x,3,"highs")
#[1] 4.666667 8.333333 4.666667 12.000000 8.333333 8.333333 8.333333
myCut(x,3,"lows")
#[1] 1.000000 4.666667 1.000000 8.333333 4.666667 4.666667 4.666667
I've created an interaction plot and realized that one line breaks in the middle and that is because two factors on the x axis has no value for that particular factor, although for the other factors, it has a value. It looks like this (assume the lines shown are actually connected):
_
/ _ /
1 2 3 4 5 6 7
x axis
Basically there is a value for 4, but not for 3 and 5 so that's why it looks as it does. How do I correct this? I have set everything as a factor
I read to use na.rm but that doesn't work and it looks the same.
Thanks
I'm trying to do a boxplot of a list of values at ggplot2, but the problem is that it doesn't know how to deal with lists, what should I try ?
E.g.:
k <- list(c(1,2,3,4,5),c(1,2,3,4),c(1,3,6,8,14),c(1,3,7,8,10,37))
k
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 1 2 3 4
[[3]]
[1] 1 3 6 8 14
[[4]]
[1] 1 3 7 8 10 37
If I pass k as an argument to boxplot() it will handle it flawlessly and produce a nice (well not so nice... hehehe) boxplot with the range of all the values as the Y-axis and the list index (each element) as the X-axis.
How should I achieve the exact same effect with ggplot2 ? I think that dataframes or matrices are not an option because the vectors are of different length.
Thanks
The answer is that you don't. ggplot2 is designed to work with data frames, particularly long form data frames. That means you need your data as one tall vector, with a grouping factor:
d <- data.frame(x = unlist(k),
grp = rep(letters[1:length(k)],times = sapply(k,length)))
ggplot(d,aes(x = grp, y = x)) + geom_boxplot()
And as pointed out in the comments, melt achieves the same result as this manual reshaping and is much simpler. I guess I like to make things difficult.
I have a set of user recommandations
review=matrix(c(5:1,10,2,1,1,2), nrow=5, ncol=2, dimnames=list(NULL,c("Star","Votes")))
and wanted to use summary(review) to show basic properties mean, median, quartiles and min max.
But it gives back the summary of both columns. I refrain from using data.frame because the factors 'Star' are ordered.
How can I tell R that Star is a ordered list of factors numeric score and votes are their frequency?
I'm not exactly sure what you mean by taking the mean in general if Star is supposed to be an ordered factor. However, in the example you give where Star is actually a set of numeric values, you can use the following:
library(Hmisc)
R> review=matrix(c(5:1,10,2,1,1,2), nrow=5, ncol=2, dimnames=list(NULL,c("Star","Votes")))
R> wtd.mean(review[, 1], weights = review[, 2])
[1] 4.0625
R> wtd.quantile(review[, 1], weights = review[, 2])
0% 25% 50% 75% 100%
1.00 3.75 5.00 5.00 5.00
I don't understand what's the problem. Why shouldn't you use data.frame?
rv <- data.frame(star = ordered(review[, 1]), votes = review[, 2])
You should convert your data.frame to vector:
( vts <- with(rv, rep(star, votes)) )
[1] 5 5 5 5 5 5 5 5 5 5 4 4 3 2 1 1
Levels: 1 < 2 < 3 < 4 < 5
Then do the summary... I just don't know what kind of summary, since summary will bring you back to the start. O_o
summary(vts)
1 2 3 4 5
2 1 1 2 10
EDIT (on #Prasad's suggestion)
Since vts is an ordered factor, you should convert it to numeric, hence calculate the summary (at this moment I will disregard the background statistical issues):
nvts <- as.numeric(levels(vts)[vts]) ## numeric conversion
summary(nvts) ## "ordinary" summary
fivenum(nvts) ## Tukey's five number summary
Just to clarify -- when you say you would like "mean, median, quartiles and min/max", you're talking in terms of number of stars? e.g mean = 4.062 stars?
Then using aL3xa's code, would something like summary(as.numeric(as.character(vts))) be what you want?