I am trying to plot multiple lines using ggplot2. My data is fitted into a data frame as follow:
> rs
time 1 2 3 4
1 200 17230622635 17280401147 17296993985 17313586822
2 400 22328386154 22456712709 22499488227 22542263745
3 600 28958840968 29186097622 29261849840 29337602058
4 800 40251281810 40650094691 40783032318 40915969945
5 1000 73705771414 74612829244 74915181854 75217534464
I would like to use the "time" column as the x value. Other columns are y values of points in different lines. In the data above, there are 4 lines, each line consists of 5 points. More specifically, the first line has points (200, 17230622635), (400, 22328386154), (600, 28958840968), etc. The second line has points (200, 17280401147), (400, 22456712709), etc. (If you need further explanation of the data format, see P.S. in the end.)
To generate a similar data, you could use the following code:
rs = data.frame(seq(200, 1000, by=200), runif(5), runif(5), runif(5))
names(rs)=c("time", 1:3)
I followed some examples on stack overflow and tried to use reshape2 and ggplot2 to do this plot:
I first melt the data into a "long-format":
library('reshape2')
library('ggplot2')
melted = melt(rs, id.vars="time")
Then plot the data using the following statment:
ggplot() + geom_line(data=melted, aes(x="time", y="value", group="variable"))
However, I got an empty graph which has no point nor line.
Can anyone help me to see what's wrong with my procedure?
P.S.
About the data format:
You can imagine there are many students in the class and we have their scores of several quizzes. Each row contains one student's data: first column is the quiz number, then the rest of columns are his/her scores. For each student, we want to plot a line to reflect how his/her scores change over different quizzes, each point is the score of one quiz for a certain students. Since there are multiple students, we would like to draw multiple lines.
About the melted data:
Specific to the data I show above, the data I got from the melt() function is:
> melted
time variable value
1 200 1 17230622635
2 400 1 22328386154
3 600 1 28958840968
4 800 1 40251281810
5 1000 1 73705771414
6 200 2 17280401147
7 400 2 22456712709
8 600 2 29186097622
9 800 2 40650094691
10 1000 2 74612829244
11 200 3 17296993985
12 400 3 22499488227
13 600 3 29261849840
14 800 3 40783032318
15 1000 3 74915181854
16 200 4 17313586822
17 400 4 22542263745
18 600 4 29337602058
19 800 4 40915969945
20 1000 4 75217534464
Drop the quotes:
ggplot(data=melted, aes(x=time, y=value, group=variable)) + geom_line()
see: ggplot aesthetics
Another option is to use aes_string.
Related
I have data saved in a text file with couple thousands line. Each line only has one value. Like this
52312
2
3
4
5
7
9
4
5
3
The first value is always roughly 10.000 times bigger than all the other values.
I can read in the data with data<-read.table("data.txt")
When I just use plot(data) all the data have the same y-value, resulting in a line, where the x values just represent the values given from the data.
What I want, however, is that the x-value represents the linenumber and y-value the actual data. So for the above example my values would be (1,52312), (2,2), (3,3), (4,4), (5,5), (6,7), (7,9), (8,4), (9,5), (10,3).
Also, since the first value is way higher than all the other values, I'd like to use a log scale for the y-axis.
Sorry, very new to R.
set.seed(1000)
df = data.frame(a=c(9999999,sample(2:78,77,replace = F)))
plot(x=1:nrow(df), y=log(df$a))
i) set.seed(1000) helps you reproduce the same random numbers from sample() each time you run this code. It makes code reproducible.
ii) type ?sample in R console for documentation.
iii) since you wanted the x-axis to be linenumber - I create it using ":" operator. 1:3 = 1,2,3. Similarily I created a "id" index using 1:nrow(df) which will create based on the dimension of your data.
iv) for log ,just use it simple :). read more about ?plot and its parameters
Try this:
df
x y
1 1 52312
2 2 2
3 3 3
4 4 4
5 5 5
6 6 7
7 7 9
8 8 4
9 9 5
10 10 3
library(ggplot2)
ggplot(df, aes(x, y)) + geom_point(size=2) + scale_y_log10()
Consider the following frequency data:
> table(income)
income
3 5 6 7 8 5000
2 7 2 2 2 1
When I type >hist(income) I get the following histogram
So as you can see, the fact that most income values are concentrated around 5 and there is one value quite distant from the others makes the histogram not look very good. MS Excel can consider the 5000 value as of another category, so the data would like this instead:
> table(income)
income
3 5 6 7 8 more
2 7 2 2 2 1
So plotting this as a histogram would look much better, so you can see the frequency within a shorter range:
Is there anyway to do this either with the hist() function or others functions from lattice or ggplot2? I do however, don't want to overwrite the values that exceed a certain threshold, so as I do lose any information.
Thanks a lot!
Data generation:
income <- c(rep(3,2), rep(5,7), rep(6,2), rep(7,2), rep(8,2), 5000)
Function for preparing data for plotting:
nice.data <- function(x, threshold=10){
x[x>threshold] <- "More"
x
}
Plotting:
library(ggplot2)
ggplot() + geom_histogram(aes(x=nice.data(income))) + xlab("Income")
Result:
Could you please help me to solve this problem:
I have a database like below:
Animal Milk Age
1 11.96703591 1
1 13.41236333 2
1 14.85769075 3
1 16.30301817 4
2 17.74834559 1
2 19.08465881 2
2 20.42097204 3
2 14.66094662 4
2 14.70197368 5
3 14.74300075 1
3 14.78402781 2
3 14.82505488 3
3 14.86608194 4
3 14.90710901 5
I want to make a plot between milk versus age, so I use function plot(Milk~Age, data=mydata)
My question is how can I make the same plot (Milk~Age) for each individual, by using only one function. Since I have about 200 animals and if I have to run 200 times to produce 200 curves.
Thanks
Phuong
One approach would be to use library ggplot2 and then make individual facets for each animal. As you have many animals you can change ncol= or nrow= in facet_wrap() to get better view.
library(ggplot2)
ggplot(df,aes(x=Age,y=Milk))+geom_point()+facet_wrap(~Animal)
The following code should create as many plot as you have unique Animal values, and store them in different pdf files in the working directory :
invisible(by(df, df$Animal, function(tmpdf) {
pdf(paste0("plot",tmpdf$Animal[1],".pdf"))
plot(Milk~Age, data=tmpdf, main=tmpdf$Animal[1])
dev.off()
}))
I would say to use ggplot from the ggplot2 package
ggplot(df,aes(x=Age,y=Milk, color=Animal))+geom_point()
edit1: actually this would lose clarity with 200 animals. Did you want all this data point in one graph or spread out across 200 graphs? If the latter then I agree with Didzis
I can not seem to figure out how to get a nice barplot that contains the data from two tables that contain a different number of columns.
The tables in question are something like (snipped some data from the end):
> tab1
1 2 3 6 8 31
5872 1525 831 521 299 4
> tab2
1 2 3 4 22
7874 422 2 5 1
Note the column names and sizes are different. When I just do barplot() on one of these tables it comes out with the plot I'd like (showing the column names as the X-axis, frequencies on Y-axis). But, I would like these two side by side.
I've gotten as far as creating a data frame containing both variables as comments and the different row names in the first column (with data.frame()and merge()), but when I plot this the X-axis seems to be all wrong. Attempting to reorder the columns gives me an exception about lengths differing.
Code:
combined <- merge(data.frame(tab1), data.frame(tab2), by = c('Var1'), all=T)
barplot(t(combined[,2:3]), names.arg = combined[,1], beside=T)
This shows a plot, but not all labels are present and the value for position 26 is plotted after 33.
Is there any simple way to get this plot working? A ggplot2 solution would be nice.
You can put all your data in one data frame (as in example).
df<-data.frame(group=rep(c("A","B"),times=c(2,3)),
values=c(23,56,345,6,7),xval=c(1,2,1,2,8))
group values xval
1 A 23 1
2 A 56 2
3 B 345 1
4 B 6 2
5 B 7 8
Then ggplot() with geom_bar() can be used to plot the data.
ggplot(df,aes(xval,values,fill=group))+
geom_bar(stat="identity",position="dodge")
I have some data that I want to display as a box plot using ggplot2. It's basically counts, stratified by two other variables. Here's an example of the data (in reality there's a lot more, but the structure is the same):
TAG Count Condition
A 5 1
A 6 1
A 6 1
A 6 2
A 7 2
A 7 2
B 1 1
B 2 1
B 2 1
B 12 2
B 8 2
B 10 2
C 10 1
C 12 1
C 13 1
C 7 2
C 6 2
C 10 2
For each Tag, there are a fixed number of observations in condition 1, and condition 2 (here it's 3, but in the real data it's much more). I want a box plot like the following ('s' is a dataframe arranged as above):
ggplot(s, aes(x=TAG, y=Count, fill=factor(Condition))) + geom_boxplot()
This is fine, but I want to be able to order the x-axis by the p-value from a Wilcoxon test for each Tag. For example, with the above data, the values would be (for the tags A,B, and C respectively):
> wilcox.test(c(5,6,6),c(6,7,7))$p.value
[1] 0.1572992
> wilcox.test(c(1,2,2),c(12,8,10))$p.value
[1] 0.0765225
> wilcox.test(c(10,12,13),c(7,6,10))$p.value
[1] 0.1211833
Which would induce the ordering A,C,B on the x-axis (largest to smallest). But I don't know how to go about adding this information into my data (specifically, attaching a p-value at just the tag level, rather than adding a whole extra column), or how to use it to change the x-axis order. Any help greatly appreciated.
Here is a way do it. The first step is to calculate the p-values for each TAG. We do this by using ddply which splits the data by TAG, and calculates the p-value using the formula interface to wilcox.test. The plot statement reorders the TAG based on its p-value.
library(ggplot2); library(plyr)
dfr2 <- ddply(dfr, .(TAG), transform,
pval = wilcox.test(Count ~ Condition)$p.value)
qplot(reorder(TAG, pval), Count, fill = factor(Condition), geom = 'boxplot',
data = dfr2)