I have a question about plots. For example we have variable a and b, we plot this in R and you get the point. Now, I want to make a range of best/highest point. Is there a way to generate a ranking in the point? I thought maybe something with mean?
Thanks!
a<- c(1,3,7,5,3,8,4,5,3,6,9,4,2,6,3)
b<- c(5,3,7,2,7,2,5,2,7,3,6,2,1,1,9)
plot(a,b)
Based on your comment to get the positions of the points with the 5 highest b values, use order:
order(b,decreasing=T)[1:5]
[1] 15 3 5 9 11
And you can use this to get the relevant a and b values:
a[order(b,decreasing=T)[1:5]]
[1] 3 7 3 3 9
b[order(b,decreasing=T)[1:5]]
[1] 9 7 7 7 6
You can use this also to highlight them in the plot:
high <- order(b,decreasing=T)[1:5]
col <- rep("black",length(b))
col[high] <- "red"
plot(a,b,col=col)
Note that there is some overplotting here (2 values at (3,7))
Related
I have data saved in a text file with couple thousands line. Each line only has one value. Like this
52312
2
3
4
5
7
9
4
5
3
The first value is always roughly 10.000 times bigger than all the other values.
I can read in the data with data<-read.table("data.txt")
When I just use plot(data) all the data have the same y-value, resulting in a line, where the x values just represent the values given from the data.
What I want, however, is that the x-value represents the linenumber and y-value the actual data. So for the above example my values would be (1,52312), (2,2), (3,3), (4,4), (5,5), (6,7), (7,9), (8,4), (9,5), (10,3).
Also, since the first value is way higher than all the other values, I'd like to use a log scale for the y-axis.
Sorry, very new to R.
set.seed(1000)
df = data.frame(a=c(9999999,sample(2:78,77,replace = F)))
plot(x=1:nrow(df), y=log(df$a))
i) set.seed(1000) helps you reproduce the same random numbers from sample() each time you run this code. It makes code reproducible.
ii) type ?sample in R console for documentation.
iii) since you wanted the x-axis to be linenumber - I create it using ":" operator. 1:3 = 1,2,3. Similarily I created a "id" index using 1:nrow(df) which will create based on the dimension of your data.
iv) for log ,just use it simple :). read more about ?plot and its parameters
Try this:
df
x y
1 1 52312
2 2 2
3 3 3
4 4 4
5 5 5
6 6 7
7 7 9
8 8 4
9 9 5
10 10 3
library(ggplot2)
ggplot(df, aes(x, y)) + geom_point(size=2) + scale_y_log10()
Consider the following frequency data:
> table(income)
income
3 5 6 7 8 5000
2 7 2 2 2 1
When I type >hist(income) I get the following histogram
So as you can see, the fact that most income values are concentrated around 5 and there is one value quite distant from the others makes the histogram not look very good. MS Excel can consider the 5000 value as of another category, so the data would like this instead:
> table(income)
income
3 5 6 7 8 more
2 7 2 2 2 1
So plotting this as a histogram would look much better, so you can see the frequency within a shorter range:
Is there anyway to do this either with the hist() function or others functions from lattice or ggplot2? I do however, don't want to overwrite the values that exceed a certain threshold, so as I do lose any information.
Thanks a lot!
Data generation:
income <- c(rep(3,2), rep(5,7), rep(6,2), rep(7,2), rep(8,2), 5000)
Function for preparing data for plotting:
nice.data <- function(x, threshold=10){
x[x>threshold] <- "More"
x
}
Plotting:
library(ggplot2)
ggplot() + geom_histogram(aes(x=nice.data(income))) + xlab("Income")
Result:
Users
I have a distance matrix dMat and want to find the 5 nearest samples to the first one. What function can I use in R? I know how to find the closest sample (cf. 3rd line of code), but can't figure out how to get the other 4 samples.
The code:
Mat <- replicate(10, rnorm(10))
dMat <- as.matrix(dist(Mat))
which(dMat[,1]==min(dMat[,1]))
The 3rd line of code finds the index of the closest sample to the first sample.
Thanks for any help!
Best,
Chega
You can use order to do this:
head(order(dMat[-1,1]),5)+1
[1] 10 3 4 8 6
Note that I removed the first one, as you presumably don't want to include the fact that your reference point is 0 distance away from itself.
Alternative using sort:
sort(dMat[,1], index.return = TRUE)$ix[1:6]
It would be nice to add a set.seed(.) when using random numbers in matrix so that we could show the results are identical. I will skip the results here.
Edit (correct solution): The above solution will only work if the first element is always the smallest! Here's the correct solution that will always give the 5 closest values to the first element of the column:
> sort(abs(dMat[-1,1] - dMat[1,1]), index.return=TRUE)$ix[1:5] + 1
Example:
> dMat <- matrix(c(70,4,2,1,6,80,90,100,3), ncol=1)
# James' solution
> head(order(dMat[-1,1]),5) + 1
[1] 4 3 9 2 5 # values are 1,2,3,4,6 (wrong)
# old sort solution
> sort(dMat[,1], index.return = TRUE)$ix[1:6]
[1] 4 3 9 2 5 1 # values are 1,2,3,4,6,70 (wrong)
# Correct solution
> sort(abs(dMat[-1,1] - dMat[1,1]), index.return=TRUE)$ix[1:5] + 1
[1] 6 7 8 5 2 # values are 80,90,100,6,4 (right)
I have some data that I want to display as a box plot using ggplot2. It's basically counts, stratified by two other variables. Here's an example of the data (in reality there's a lot more, but the structure is the same):
TAG Count Condition
A 5 1
A 6 1
A 6 1
A 6 2
A 7 2
A 7 2
B 1 1
B 2 1
B 2 1
B 12 2
B 8 2
B 10 2
C 10 1
C 12 1
C 13 1
C 7 2
C 6 2
C 10 2
For each Tag, there are a fixed number of observations in condition 1, and condition 2 (here it's 3, but in the real data it's much more). I want a box plot like the following ('s' is a dataframe arranged as above):
ggplot(s, aes(x=TAG, y=Count, fill=factor(Condition))) + geom_boxplot()
This is fine, but I want to be able to order the x-axis by the p-value from a Wilcoxon test for each Tag. For example, with the above data, the values would be (for the tags A,B, and C respectively):
> wilcox.test(c(5,6,6),c(6,7,7))$p.value
[1] 0.1572992
> wilcox.test(c(1,2,2),c(12,8,10))$p.value
[1] 0.0765225
> wilcox.test(c(10,12,13),c(7,6,10))$p.value
[1] 0.1211833
Which would induce the ordering A,C,B on the x-axis (largest to smallest). But I don't know how to go about adding this information into my data (specifically, attaching a p-value at just the tag level, rather than adding a whole extra column), or how to use it to change the x-axis order. Any help greatly appreciated.
Here is a way do it. The first step is to calculate the p-values for each TAG. We do this by using ddply which splits the data by TAG, and calculates the p-value using the formula interface to wilcox.test. The plot statement reorders the TAG based on its p-value.
library(ggplot2); library(plyr)
dfr2 <- ddply(dfr, .(TAG), transform,
pval = wilcox.test(Count ~ Condition)$p.value)
qplot(reorder(TAG, pval), Count, fill = factor(Condition), geom = 'boxplot',
data = dfr2)
I'm trying to do a boxplot of a list of values at ggplot2, but the problem is that it doesn't know how to deal with lists, what should I try ?
E.g.:
k <- list(c(1,2,3,4,5),c(1,2,3,4),c(1,3,6,8,14),c(1,3,7,8,10,37))
k
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 1 2 3 4
[[3]]
[1] 1 3 6 8 14
[[4]]
[1] 1 3 7 8 10 37
If I pass k as an argument to boxplot() it will handle it flawlessly and produce a nice (well not so nice... hehehe) boxplot with the range of all the values as the Y-axis and the list index (each element) as the X-axis.
How should I achieve the exact same effect with ggplot2 ? I think that dataframes or matrices are not an option because the vectors are of different length.
Thanks
The answer is that you don't. ggplot2 is designed to work with data frames, particularly long form data frames. That means you need your data as one tall vector, with a grouping factor:
d <- data.frame(x = unlist(k),
grp = rep(letters[1:length(k)],times = sapply(k,length)))
ggplot(d,aes(x = grp, y = x)) + geom_boxplot()
And as pointed out in the comments, melt achieves the same result as this manual reshaping and is much simpler. I guess I like to make things difficult.