Overlay two differently formatted qplots in ggplot2 - r

I have two scatterplots, based on different but related data, created using qplot() from ggplot2. (Learning ggplot hasn't been a priority because qplot has been sufficient for my needs up to now). What I want to do is superimpose/overlay the two charts so that the x,y data for each is plotted in the same plot space. The complication is that I want each plot to retain its formatting/aesthetics.
That data in question are row and column scores from correspondence analysis - corresp() from MASS - so the number of data rows (i.e. samples or taxa) differ between the two datasets. I can plot the two score sets together easily. Either by combing the two datasets or, even easier, just using the biplot() function.
However, I have been using qplot to get the plots looking exactly as I need them; with samples plotted as colour-coded symbols and taxa as labels:
PlotSample <- qplot(DataCorresp$rscore[,1], DataCorresp$rscore[,2],
colour=factor(DataAll$ColourCode)) +
scale_colour_manual(values = c("black","darkgoldenrod2",
"deepskyblue2","deeppink2"))
and
PlotTaxa <- qplot(DataCorresp$cscore[,1], DataCorresp$cscore[,2],
label=colnames(DataCorresp), size=10, geom=“text”)
Can anyone suggest a way by which either
the two plots (PlotSample and PlotTaxa) can be superimposed atop of each other,
the two datasets (DataCorresp$rscore and DataCorresp$cscore) can be plotted together but formatted in their different ways, or
another function (e.g. biplot()) that could be used to achieve my aim.
Example of workflow using a extremely simplified and made-up dataset:
> require(MASS)
> require(ggplot2)
> alldata<-read.csv("Fake data.csv",header=T,row.name=1)
> selectdata<-alldata[,2:10]
> alldata
Period Species.1 Species.2 Species.3 Species.4 Species.5 Species.6
Sample-1 Early 50 87 97 12 60 49
Sample-2 Early 41 90 36 52 36 27
Sample-3 Early 87 56 82 45 56 13
Sample-4 Early 37 47 78 29 53 34
Sample-5 Early 58 70 34 35 8 21
Sample-6 Early 94 82 48 16 27 26
Sample-7 Early 91 69 50 57 24 13
Sample-8 Early 63 38 86 20 28 11
Sample-9 Middle 4 19 55 99 86 38
Sample-10 Middle 29 25 10 93 37 54
Sample-11 Middle 48 12 59 73 39 92
Sample-12 Middle 31 6 34 81 39 54
Sample-13 Middle 29 40 26 52 34 84
Sample-14 Middle 1 46 15 97 67 41
Sample-15 Late 43 47 30 18 60 23
Sample-16 Late 45 10 49 2 2 45
Sample-17 Late 14 8 51 36 58 51
Sample-18 Late 41 51 32 47 23 43
Sample-19 Late 43 17 6 54 4 12
Sample-20 Late 20 25 1 29 35 2
Species.7 Species.8 Species.9
Sample-1 41 39 57
Sample-2 59 4 45
Sample-3 10 56 5
Sample-4 59 30 39
Sample-5 9 29 57
Sample-6 29 24 35
Sample-7 22 4 42
Sample-8 31 19 40
Sample-9 17 7 57
Sample-10 6 9 29
Sample-11 34 20 0
Sample-12 56 41 59
Sample-13 6 31 13
Sample-14 25 12 28
Sample-15 60 75 84
Sample-16 32 69 34
Sample-17 48 53 56
Sample-18 80 86 46
Sample-19 50 70 82
Sample-20 57 84 70
> biplot(selectca,cex=c(0.6,0.6))
> selectca<-corresp(selectdata,nf=5)
> PlotSample <- qplot(selectca$rscore[,1], selectca$rscore[,2], colour=factor(alldata$Period) )
> PlotTaxa<-qplot(selectca$cscore[,1], selectca$cscore[,2], label=colnames(selectdata), size=10, geom="text")
The biplot will produce this plot: /r/10wk1a8/5
The PlotSample appears as such: /r/i29cba/5
The PlotTaxa appears as such: /r/245bl9d/5
EDIT so don't have enough rep to post pictures and tinypic links not accepted (despite https://meta.stackexchange.com/questions/60563/how-to-upload-images-on-stack-overflow). So if you add tinypic's URL to the start of those codes above you'll get there.
Essentially I want to creat the biplot plot but with samples colour coded as they are in PlotSample.

Have a look at Gavin Simpsons ggvegan-package!
require(vegan)
require(ggvegan)
# some data
data(dune)
# CA
mod <- cca(dune)
# plot
autoplot(mod, geom = 'text')
For a finer control (or if you want to stick with corresp(), you may also want to take a look at the code of the two involved functions fortify.cca (which wraps the data in the cca objects into a useable format for ggplot) and autoplot.cca for creating the plot.
I you want to do it from scratch, you'll have to wrap both scores (sites and species) into one data.frame (see how fortify.cca does this and extract the relevant values from the corresp() object) and use this to build the plot.

Related

In R topicmodels package, how could we get the topics' distributions over terms?

I'm running LDA by using topicmodels package.
lda.model = LDA(dtm, k,control = list(em = list(iter.max = 1000, tol = 10^-4)))
apps.terms<-terms(lda.model,15)
head(apps.terms)
Topic.1 Topic.2 Topic.3 Topic.4 Topic.5
1 38 55 187 38 38
2 40 38 171 40 35
3 55 35 178 56 44
4 49 49 74 35 55
5 35 44 177 190 52
6 44 53 80 55 49
These code get the 15 terms order by their proportion. If I didn't badly understand the LDA algorithm. Each topic is a distribution over terms.So I want to know the exact distribution over these terms. For example. Topic.1 is 30% related to 38, 20% related to 40 ..etc. Is there any way to get it by using topicmodels package?
It sounds like you want the posterior probabilities for each document.
lda.inf <- posterior(lda.model,dtm)

Error while plotting a tree with some squirrels using trees package

I am using the package trees found here, by #jbaums and explained in this post.
My data are the following:
the tree is composed by
the trunk
Trunk
[1] 13.60415
and the branches
Tree
TreeBranchLength TreeBranchID
1 10.004269 1
2 7.994269 2
3 9.028834 11
4 10.817401 12
5 8.551311 111
6 10.599798 112
7 11.073243 121
8 13.367392 122
9 9.625431 1111
10 10.793569 1112
11 9.896499 11121
12 8.687741 11122
13 7.791180 1211
14 12.506105 1212
15 6.768478 1221
16 10.441796 1222
17 10.751892 1121
18 9.458651 1122
19 10.768509 11221
20 10.150673 11222
21 12.377448 111211
22 12.235136 111212
23 9.074079 11211
24 9.996334 11212
25 9.807019 112221
26 10.895809 112222
27 6.741274 1122211
28 15.841272 1122212
29 5.753920 11222111
30 8.846389 11222112
31 11.925961 112111
32 9.780776 112112
33 8.207965 12221
34 10.079375 12222
the 50 squirrel populations -
Populations
PopulationPositionOnBranch PopulationBranchID ID
1 10.6321655 112111 1
2 1.0644897 1 2
3 3.9315473 1 3
4 1.0310244 0 4
5 9.1768846 0 5
6 13.4267181 0 6
7 7.9461528 0 7
8 6.0533401 121 8
9 2.1227425 121 9
10 1.8256787 121 10
11 4.7332588 11222112 11
12 4.4837432 11222112 12
13 4.6200834 11222112 13
14 2.5622276 1221 14
15 1.2446683 1221 15
16 7.0674052 111 16
17 1.3854674 111 17
18 4.8735635 111 18
19 9.5007998 1222 19
20 6.6373468 1222 20
21 12.6757728 122 21
22 4.2685465 122 22
23 3.9806540 2 23
24 3.1025403 2 24
25 3.9119065 11122 25
26 1.5527653 11122 26
27 1.6687957 11122 27
28 8.0697456 1122 28
29 6.7871391 1122 29
30 9.8050713 111212 30
31 8.5226920 111212 31
32 3.6113379 111212 32
33 7.3184965 111211 33
34 8.6142984 111211 34
35 1.3550870 1211 35
36 8.3650639 12 36
37 4.6411446 112112 37
38 3.2985541 112112 38
39 12.2344148 1212 39
40 9.0290776 1212 40
41 1.3900249 1121 41
42 0.9261425 1122212 42
43 15.2522199 1122212 43
44 4.0253771 12222 44
45 8.7507678 11222 45
46 4.6289841 1122211 46
47 9.1799522 112 47
48 5.1293838 12221 48
49 1.1543080 12221 49
50 10.1014837 112222 50
the code to produce the plot
g <- germinate(list(trunk.height=Trunk,
branches=Tree$TreeBranchID,
lengths=Tree$TreeBranchLength),
left='1', right='2', angle=30))
xy <- squirrels(g, Populations$PopulationBranchID, pos=Populations$PopulationPositionOnBranch,
left='1', right='2', pch=21, bg='white', cex=3, lwd=2)
text(xy$x, xy$y, labels=seq_len(nrow(xy)), font=1)
, which produces
As you can see on the plot bellow population 43 (blue arrow) is out of the tree.. It seems that the length of the branches on the plot do not correspond to the data. For example the branch (left green arrow) on which are populations 38 and 37 is longer than the one where population 43 is (right green arrow), that is not the case in the data. What am I doing wrong? Have I understood correctly how to use trees?
On studying the germinate function it seems to me that the Tree values that you are passing to it needs to be sorted on TreeBranchId field in the ascending order.
The BranchID: 1122212 where you have placed 43 is not the actual 1122212 branch.
Due to the order in which you have fed the values in the Tree, the function is somehow messing the location of branch.
I was curious to see if I increase the length of Branch ID: 1122212, will it change the branch where 43 is placed, and guess what? it didn't. The branch which actually showed an increase in length was the branch where you have placed 37 and 38.
So this hint pointed out that something was wrong with germinate function. On further debugging I was able to make it work using the below code.
Tree<-read.csv("treeBranch.csv")
Tree<-Tree[order(Tree$TreeBranchID),]
g <- germinate(list(trunk.height=15,
branches=Tree$TreeBranchID,
lengths=Tree$TreeBranchLength),
left='1', right='2', angle=30)
xy <- squirrels(g, Populations$PopulationBranchID,pos=Populations$PopulationPositionOnBranch,
left='1', right='2', pch=21, bg='white', cex=3, lwd=2)
text(xy$x, xy$y, labels=seq_len(nrow(xy)), font=1)

Mean and SD in R

maybe it is a very easy question. This is my data.frame:
> read.table("text.txt")
V1 V2
1 26 22516
2 28 17129
3 30 38470
4 32 12920
5 34 30835
6 36 36244
7 38 24482
8 40 67482
9 42 23121
10 44 51643
11 46 61064
12 48 37678
13 50 98817
14 52 31741
15 54 74672
16 56 85648
17 58 53813
18 60 135534
19 62 46621
20 64 89266
21 66 99818
22 68 60071
23 70 168558
24 72 67059
25 74 194730
26 76 278473
27 78 217860
It means that I have 22516 sequences with length 26, 17129 sequences with length 28, etc. I would like to know the sequence length mean and its standard deviation. I know how to do it, but I know to do it creating a list full of 26 repeated 22516 times and so on... and then compute the mean and SD. However, I thing there is a easier method. Any idea?
Thanks.
For mean: (V1 %*% V2)/sum(V2)
For SD: sqrt(((V1-(V1 %*% V2)/sum(V2))**2 %*% V2)/sum(V2))
I do not find mean(rep(V1,V2)) # 61.902 and sd(rep(V1,V2)) # 14.23891 that complex, but alternatively you might try:
weighted.mean(V1,V2) # 61.902
# recipe from http://www.ltcconline.net/greenl/courses/201/descstat/meansdgrouped.htm
sqrt((sum((V1^2)*V2)-(sum(V1*V2)^2)/sum(V2))/(sum(V2)-1)) # 14.23891
Step1: Set up data:
dat.df <- read.table(text="id V1 V2
1 26 22516
2 28 17129
3 30 38470
4 32 12920
5 34 30835
6 36 36244
7 38 24482
8 40 67482
9 42 23121
10 44 51643
11 46 61064
12 48 37678
13 50 98817
14 52 31741
15 54 74672
16 56 85648
17 58 53813
18 60 135534
19 62 46621
20 64 89266
21 66 99818
22 68 60071
23 70 168558
24 72 67059
25 74 194730
26 76 278473
27 78 217860",header=T)
Step2: Convert to data.table (only for simplicity and laziness in typing)
library(data.table)
dat <- data.table(dat.df)
Step3: Set up new columns with products, and use them to find mean
dat[,pr:=V1*V2]
dat[,v1sq:=as.numeric(V1*V1*V2)]
dat.Mean <- sum(dat$pr)/sum(dat$V2)
dat.SD <- sqrt( (sum(dat$v1sq)/sum(dat$V2)) - dat.Mean^2)
Hope this helps!!
MEAN = (V1*V2)/sum(V2)
SD = sqrt((V1*V1*V2)/sum(V2) - MEAN^2)

How to colour bars in a ggplot2 graph considering the order in a numeric variable of one data frame

Hi everybody I am trying to make a ggplot graph with a special element to add value to the graphic. I have the next data frame:
Name x
1 33374984
2 33209955
3 30250000
4 25897811
5 14995677
6 8200000
7 7878540
8 7571275
9 7188469
10 6756588
11 6592099
12 6538284
13 6248395
14 6116285
15 6045148
16 5745422
17 5670931
18 5494094
19 5282317
20 4695526
21 4641434
22 4595977
23 4572251
24 4535916
25 4506470
26 4445051
27 4444925
28 4393253
29 4383303
30 4349352
31 4278734
32 4134470
33 4102491
34 4010763
35 3589238
36 3573466
37 3524591
38 3497869
39 3340105
40 3292353
41 3234672
42 3217727
43 3202363
44 3170645
45 3119682
46 3092471
47 3033967
48 3000145
49 2994306
50 2943389
51 2940969
52 2936803
53 2923212
54 2884965
55 2863525
56 2846172
57 2839577
58 2819895
59 2809083
60 2804884
61 2752059
62 2750962
63 2740161
64 2718946
65 2580859
66 2580822
67 2553712
68 2490213
69 2425135
70 2406000
71 2405486
72 2387143
73 2384597
74 2381402
75 2372381
76 2308623
77 2299046
78 2287879
79 2260205
80 2245436
81 2208582
82 2203883
83 2176060
84 2169769
85 2136766
86 2121242
87 2115891
88 2106713
89 2084986
90 2049367
91 2031410
92 2023409
93 2015622
94 1999140
95 1901045
96 1900000
97 1891783
98 1863410
99 1859789
100 1851046
I made a simple ggplot bar graphic using this code:
png(filename = "Delta.png", width = 800, height = 1000)
Delta=ggplot(DeltaPBO)+
geom_bar(aes(x=Name,y=x,fill=x),
stat='identity',position='dodge')+theme(axis.text.x=element_text(angle=80,hjust=1,vjust=1,colour="grey20",face="bold",size=12),axis.text.y=element_text(colour="grey20",face="bold",hjust=1,vjust=0.8),axis.title.x=element_text(colour="grey20",face="bold",size=15),axis.title.y=element_text(colour="grey20",face="bold",size=15))+xlab('Cliente')+ylab('Total')+ ggtitle("My graph")+theme(plot.title = element_text(lineheight=3, face="bold", color="black", size=29))
print(Delta)
dev.off()
And I got this graph:
My problem is the next x is in descending order and I want to show in the graphic different bar colors for each 25 observations, for example observation 1 to 25 should be in yellow because they have higher values than observations 26 to 50; observations 26 to 50 should have green colour due to they are higher than observations 51 to 75; observations 51 to 75 should have blue colour because they are higher than observations 76 to 100 and finally observations 76 to 100 should have red colour because they have the minor values.
I would like to make this in R but I don't have enough knowledge in ggplot. Thanks for your help.
You can add new variable that contains grouping names for every 25 observations in your data. Then use this new variable for the fill of bars. With scale_fill_manual() set fill colors as you need.
DeltaPBO$group<-rep(c("A","B","C","D"),each=25)
ggplot(DeltaPBO)+
geom_bar(aes(x=Name,y=x,fill=as.factor(group)),
stat='identity',position='dodge')+
scale_fill_manual(values=c("yellow","green","blue","red"))

In R: Indexing vectors by boolean comparison of a value in range: index==c(min : max)

In R, let's say we have a vector
area = c(rep(c(26:30), 5), rep(c(500:504), 5), rep(c(550:554), 5), rep(c(76:80), 5)) and another vector yield = c(1:100).
Now, say I want to index like so:
> yield[area==27]
[1] 2 7 12 17 22
> yield[area==501]
[1] 27 32 37 42 47
No problem, right? But weird things start happening when I try to index it by using c(A, B). (and even weirder when I try c(min:max) ...)
> yield[area==c(27,501)]
[1] 7 17 32 42
What I'm expecting is of course the instances that are present in both of the other examples, not just some weird combination of them. This works when I can use the pipe OR operator:
> yield[area==27 | area==501]
[1] 2 7 12 17 22 27 32 37 42 47
But what if I'm working with a range? Say I want index it by the range c(27:503)? In my real example there are a lot more data points and ranges, so it makes more sense, please don't suggest I do it by hand, which would essentially mean:
yield[area==27 | area==28 | area==29 | ... | area==303 | ... | area==500 | area==501]
There must be a better way...
You want to use %in%. Also notice that c(27:503) and 27:503 yield the same object.
> yield[area %in% 27:503]
[1] 2 3 4 5 7 8 9 10 12 13 14 15 17
[14] 18 19 20 22 23 24 25 26 27 28 29 31 32
[27] 33 34 36 37 38 39 41 42 43 44 46 47 48
[40] 49 76 77 78 79 80 81 82 83 84 85 86 87
[53] 88 89 90 91 92 93 94 95 96 97 98 99 100
Why not use subset?
subset(yield, area > 26 & area < 504) ## for indexes
subset(area, area > 26 & area < 504) ## for values

Resources