How to remove duplicate x-axis labels in R - r

I am trying to obtain a barplot representing mean percentage of coloration (valores) grouped both by sex and size intervals (class). However, labels in the x-axis appear duplicated. I would like to get one single label ("50-55" for the first and second columns together, "55-60" for the third and fourth columns together, and so on) for each class level. How could I do this?
Here is my code:
par(mar=c(7,4,4,2)+0.1)
class<-factor(coloration$clase.2,levels=c("50-55","55-60","60-65","65-70","70-75","75-80"))
sex<-factor(coloration$sexo,levels=c("M","H"))
valores<-coloration$perc.greenblue
graf<-barplot(tapply(valores,list(sex,class),mean),beside=T,axes=F,ylim=c(0,50),col=c(grey.colors(2)),axisnames=F ,xlab=("Sex and size"),ylab=("% mean coloration"),las=1)
axis(2,at=c(0,5,10,15,20,25,30,35,40,45,50),labels=c(0,5,10,15,20,25,30,35,40,45,50),las=1)
labs<-as.character(class)
text(graf,par("usr")[3]-0.25,srt=0,adj = c(0,2),labels=labs,xpd=T,cex=1)
legend(locator(1),c("Adult males","Adult females"),fill=c(grey.colors(2)),bty="n")
EDIT: here's some reproducible code:
structure(list(edad = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "ADU", class = "factor"),
sexo = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("H", "M"), class = "factor"),
clase.2 = structure(c(2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L,
6L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L), .Label = c("50-55",
"55-60", "60-65", "65-70", "70-75", "75-80"), class = "factor"),
perc.greenblue = c(0.09, 0.32, 12.8, 94.32, 34.83, 0.04,
45.83, 12.34, 0.75, 34.82, 0.5, 0.05, 3.46, 0, 1.72, 0.07,
0.09, 0.2)), row.names = c(9L, 10L, 12L, 13L, 48L, 49L, 109L,
110L, 194L, 195L, 263L, 264L, 266L, 267L, 332L, 333L, 408L, 409L
), class = "data.frame")

Related

ggplot2 loop graph with conditional subsets

Data description:
I have a data set that is in long format with multiple different grouping variables (in data example: StandID and simID)
What I am trying to do:
I need to create simple scatter plots (x=predicted, y=observed) from this dataset for multiple columns based on a unique grouping variable.
An example of what I am trying to do using just standard plot is
obs=subset(example,simID=="OBS_OBS_OBS")
csfnw=example[example$simID== "CS_F_NW",]
plot(obs$X1HR,csfnw$X1HR)
I would need to do this for all simID and columns 9-14. (12 graphs total from data example)
What I have tried:
The problem I am running into is the y axis needs to remain the same, while cycling through the different subsets for the x axis.
I will admit up front, I have no idea what would be the best approach for this... I thought this would be easy for a split second because the data is already in long format and I would just be pointing to a subset of the data.
1) My original approach was to try and just splice up the data so that each simID had its own data frame, and compare it against the observation dataframe but I don't know how I would then pass it to ggplot.
2) My second idea was to make some kind of makeGraph function containing all the aesthetics I wanted essentially and use some kind of apply on it to pass everything through the function, but I could get neither to work.
makePlot=function(dat,x,y) {
ggplot(data=dat,aes(x=x,y=y))+geom_point(shape=Treat)+theme_bw()
}
What I could get to work was just breaking down the dataframe into the vectors of the variables I would then pass to some kind of loop/apply
sims=levels(example$simID)
sims2=sims[sims != "OBS_OBS_OBS"]
fuel_classes=colnames(example)[9:14]
Thank you
Data example:
example=structure(list(Year = structure(c(7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L), .Label = c("2001", "2002", "2003", "2004", "2005",
"2013", "2014", "2015"), class = "factor"), StandID = structure(c(10L,
2L, 6L, 22L, 14L, 18L, 34L, 26L, 30L, 10L, 2L, 6L, 22L, 14L,
18L, 34L, 26L, 30L, 10L, 2L, 6L, 22L, 14L, 18L, 34L, 26L, 30L
), .Label = c("1NB", "1NC", "1NT", "1NTB", "1RB", "1RC", "1RT",
"1RTB", "1SB", "1SC", "1ST", "1STB", "2NB", "2NC", "2NT", "2NTB",
"2RB", "2RC", "2RT", "2RTB", "2SB", "2SC", "2ST", "2STB", "3NB",
"3NC", "3NT", "3NTB", "3RB", "3RC", "3RT", "3RTB", "3SB", "3SC",
"3ST", "3STB"), class = "factor"), Block = structure(c(1L, 1L,
1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L,
1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("1", "2", "3"
), class = "factor"), Aspect = structure(c(3L, 1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L), .Label = c("N", "R", "S"), class = "factor"),
Treat = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), .Label = c("B", "C", "T", "TB"), class = "factor"),
Variant = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("CS", "OBS", "SN"), class = "factor"),
Fuels = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), .Label = c("F", "NF", "OBS"), class = "factor"),
Weather = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("NW", "OBS", "W"), class = "factor"),
X1HR = c(0.321666667, 0.177777778, 0.216111111, 0.280555556,
0.255555556, 0.251666667, 0.296666667, 0.231111111, 0.22,
0.27556628, 0.298042506, 0.440185249, 0.36150676, 0.398630172,
0.367523015, 0.345717251, 0.349305987, 0.412227929, 0.242860824,
0.258737177, 0.394024998, 0.287317872, 0.321927488, 0.281322986,
0.313588411, 0.303123146, 0.383658946), X10HR = c(0.440555556,
0.32, 0.266666667, 0.292222222, 0.496666667, 0.334444444,
0.564444444, 0.424444444, 0.432777778, 0.775042951, 0.832148314,
1.08174026, 1.023838878, 0.976997674, 0.844206274, 0.929837704,
1.0527215, 1.089246511, 0.88642776, 0.920596302, 1.209707737,
1.083737493, 1.077612877, 0.92481339, 1.041637182, 1.149550319,
1.229776621), X100HR = c(0.953888889, 1.379444444, 0.881666667,
1.640555556, 2.321666667, 1.122222222, 1.907777778, 1.633888889,
1.208333333, 1.832724094, 2.149356842, 2.364475727, 2.493232965,
2.262988567, 1.903909683, 2.135747433, 2.256677628, 2.288722038,
1.997704744, 2.087135553, 2.524872541, 2.34671092, 2.338253498,
2.06796217, 2.176314831, 2.580271006, 2.857197046), X1000HR = c(4.766666667,
8.342222222, 3.803333333, 8.057777778, 10.11444444, 6.931111111,
6.980555556, 13.20611111, 1.853333333, 3.389177084, 4.915714741,
2.795267582, 2.48227787, 2.218413353, 1.64684248, 2.716156483,
2.913746119, 2.238629341, 3.449863434, 3.432626724, 3.617531776,
3.641639471, 3.453454971, 3.176793337, 3.459602833, 3.871166945,
2.683447838), LITTER = c(2.4, 2.219444444, 2.772222222, 2.596666667,
2.693888889, 2.226111111, 2.552222222, 3.109444444, 2.963333333,
2.882233381, 3.025934696, 3.174396992, 3.291081667, 2.897673607,
2.737119675, 2.987895727, 3.679605484, 2.769756079, 2.882241249,
3.02594161, 3.174404144, 3.291091681, 2.897681713, 2.737129688,
2.987901449, 3.679611444, 2.769766569), DUFF = c(1.483333333,
1.723888889, 0.901666667, 1.520555556, 1.49, 1.366111111,
0.551666667, 1.056111111, 0.786111111, 2.034614563, 2.349547148,
1.685223818, 2.301301956, 2.609308243, 2.21895647, 2.043699026,
2.142618418, 0.953421116, 4.968493462, 4.990526676, 5.012362003,
5.023665905, 4.974074364, 4.947199821, 4.976779461, 5.082509995,
3.55211544), simID = structure(c(5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("CS_F_NW", "CS_F_W",
"CS_NF_NW", "CS_NF_W", "OBS_OBS_OBS", "SN_F_NW", "SN_F_W",
"SN_NF_NW", "SN_NF_W"), class = "factor")), .Names = c("Year",
"StandID", "Block", "Aspect", "Treat", "Variant", "Fuels", "Weather",
"X1HR", "X10HR", "X100HR", "X1000HR", "LITTER", "DUFF", "simID"
), row.names = c(37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L,
82L, 83L, 84L, 85L, 86L, 87L, 88L, 89L, 90L, 127L, 128L, 129L,
130L, 131L, 132L, 133L, 134L, 135L), class = "data.frame")
You were actually on the right track. If all plots are the same, just make one function and then use loops to loop over the subsets. For your example this can be done like this:
library(ggplot2)
# the plot function
plotFun = function(dat, title) {
ggplot(data=dat) +
geom_point(aes(x = x, y = y), shape=18) +
ggtitle(title) +
theme_bw()
}
# columns of interest
colIdx = 9:14
# split on all values of simID
dfList = split(example, example$simID)
# simID has never appearing factors. These are removed
dfList = dfList[lapply(dfList, nrow) != 0]
# make empty array for saving plots
plotList = array(list(), dim = c(length(dfList), length(dfList), length(colIdx)),
dimnames = list(names(dfList), names(dfList), names(example)[colIdx]))
# the first two loops loop over all unique combinations of dfList
for (i in 2:length(dfList)) {
for (j in 1:(i-1)) {
# loop over target variables
for (k in seq_along(colIdx)) {
# store variables to plot in a temporary dataframe
tempDf = data.frame(x = dfList[[i]][, colIdx[k]],
y = dfList[[j]][, colIdx[k]])
# add a title so we can see in the plot what is plotted vs what
title = paste0(names(dfList)[i], ":", names(dfList[[i]])[colIdx[k]], " VS ",
names(dfList)[j], ":", names(dfList[[j]])[colIdx[k]])
# make and save plot
plotList[[i, j, k]] = plotFun(tempDf, title)
}
}
}
# call the plots like this
plotList[[2, 1, 4]]
# Note that we only filled the lower triangle of combinations
# therefore indexing with [[1, 1, 1]] just returns NULL
plotList[, , 1]
This process can probably be more optimized, but when creating graphs I would go for clarity above speed since speed usually isn't an issue.

Faceting bars in ggplot2

I have this problem: I want to build a stacked bar plot with the faceting capabilities, so I can compare the distribution of frequencies for five common categories, within two different objects, separated according to three groups. I have six objects, five categories and three groups. The problem is that each group has only two different and exclusive objects to plot, but so far I can only produce a plot in which the six objects are plotted across the three groups. This is not optimal, since for each group I have four objects with no data.
Is it possible to plot just two objects for each group with the faceting capabilities?
EDITED
This is my data:
structure(list(Face = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L,
5L, 5L, 6L, 6L, 6L, 6L, 6L), .Label = c("LGH002", "LGH003", "LGM009",
"SCM018", "VAH022", "VAM028"), class = "factor"), Race = structure(c(1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L,
3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L),
.Label = c("1. Amerindian", "2. White", "3. Mestizo", "4. Other races",
"5. Cannot tell"), class = "factor"), Count = c(19L, 0L, 13L, 8L, 0L, 2L,
7L, 23L, 6L, 2L, 1L, 1L, 29L, 6L, 3L, 29L, 0L, 11L, 0L, 0L, 0L, 38L, 1L, 0L,
1L, 0L, 30L, 9L, 0L, 1L), Density = c(0.475, 0, 0.325, 0.2, 0,
0.05, 0.175, 0.575, 0.15, 0.05, 0.025, 0.025, 0.725, 0.15,
0.075, 0.725, 0, 0.275, 0, 0, 0, 0.95, 0.025, 0, 0.025, 0,
0.75, 0.225, 0, 0.025), School = structure(c(1L, 1L, 1L,
1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Municipal",
"Private Fee-Paying", "Private-Voucher"), class = "factor")),
.Names =c("Face", "Race", "Count", "Density", "School"),
class = "data.frame", row.names = c(NA, -30L))
This is the code I'm using to build the plot:
P <- ggplot(data = races.df, aes(x = Face, y = Density, fill = Race)) +
geom_bar(stat="identity") +
scale_y_continuous(labels=percent)
P + facet_grid(School ~ ., scales="free") + coord_flip()
As you can imagine, I only want to see the x-values "SCM018" and "LGH002" in "Municipal"; "LGM009" and "LGH003" in "Private-Voucher"; and "VAH022" and "VAM028" in "Private Fee-Paying" (only two objects per group). Is it possible? Any help?
All the best,
Mauricio.

Reshape a large matrix with missing values and multiple vars of interest [duplicate]

This question already has answers here:
Convert data from long format to wide format with multiple measure columns
(6 answers)
Closed 4 years ago.
I need to reorganize a large dataset into a specific format for further analysis. Right now the data are in long format, with multiple records through time for each point. I need to reshape the data so that each point has a single record, but it will add many new columns of the time-specific data. I’ve looked at previous similar posts but I need to ultimately convert several of the current variables into columns, and I can’t find an example of such. Is there a way to accomplish this in a single reshape, or will I have to do several and then concatenate the new columns back together? Another wrinkle before I post the example is that not all points were sampled at each time-step, so I need those values to show up as NA. For example, (see data below) SitePoint A1 was not sampled at all in 2012, SitePoint A10 was not sampled during the first round in 2012, but K83 was sampled all nine times.
mydatain <- structure(list(SitePoint = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 6L, 6L), .Label = c("A1", "A10", "K145", "K83", "T15",
"T213"), class = "factor"), Year_Rotation = structure(c(1L, 2L,
3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 8L, 9L, 1L, 2L, 4L, 5L,
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L,
4L, 5L, 6L, 7L, 8L, 9L, 1L, 7L), .Label = c("2010_1", "2010_2",
"2010_3", "2011_1", "2011_2", "2011_3", "2012_1", "2012_2", "2012_3"
), class = "factor"), MR_Fire = structure(c(5L, 6L, 6L, 2L, 9L,
9L, 5L, 6L, 6L, 2L, 9L, 9L, 7L, 8L, 16L, 17L, 21L, 22L, 23L,
25L, 3L, 4L, 10L, 11L, 12L, 13L, 14L, 15L, 18L, 19L, 20L, 1L,
2L, 2L, 5L, 6L, 6L, 11L, 11L, 12L, 7L, 24L), .Label = c("0",
"1", "10", "11", "12", "13", "14", "15", "2", "23", "24", "25",
"35", "36", "37", "39", "40", "47", "48", "49", "51", "52", "53",
"8", "9"), class = "factor"), fire_seas = structure(c(2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 3L), .Label = c("dry", "fire", "wet"
), class = "factor"), OptTSF = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L,
1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L,
0L, 1L, 1L)), .Names = c("SitePoint", "Year_Rotation", "MR_Fire",
"fire_seas", "OptTSF"), row.names = c(31L, 32L, 33L, 34L, 35L,
36L, 67L, 68L, 69L, 70L, 71L, 72L, 73L, 74L, 10543L, 10544L,
10545L, 10546L, 10547L, 10548L, 10549L, 10550L, 14988L, 14989L,
14990L, 14991L, 14992L, 14993L, 14994L, 14995L, 14996L, 17370L,
17371L, 17372L, 17373L, 17374L, 17375L, 17376L, 17377L, 17378L,
19353L, 19354L), class = "data.frame")
Ultimately I need something like this:
myfinal <- structure(list(SitePoint = structure(1:6, .Label = c("A1", "A10",
"K145", "K83", "T15", "T213"), class = "factor"), MR_Fire_2010_1 = c(12L,
12L, 39L, 23L, 0L, 14L), MR_Fire_2010_2 = c(13L, 13L, 40L, 24L,
1L, NA), MR_Fire_2010_3 = c(13L, 13L, NA, 25L, 1L, NA), MR_Fire_2011_1 = c(1L,
1L, 51L, 35L, 12L, NA), MR_Fire_2011_2 = c(2L, 2L, 52L, 36L,
13L, NA), MR_Fire_2011_3 = c(2L, 2L, 53L, 37L, 13L, NA), MR_Fire_2012_1 = c(NA,
NA, 9L, 47L, 24L, 8L), MR_Fire_2012_2 = c(NA, 14L, 10L, 48L,
24L, NA), MR_Fire_2012_3 = c(NA, 15L, 11L, 49L, 25L, NA), season_2010_1 = structure(c(2L,
2L, 1L, 2L, 2L, 1L), .Label = c("dry", "fire"), class = "factor"),
season_2010_2 = structure(c(2L, 2L, 1L, 2L, 2L, NA), .Label = c("dry",
"fire"), class = "factor"), season_2010_3 = structure(c(1L,
1L, NA, 1L, 1L, NA), .Label = "fire", class = "factor"),
season_2011_1 = structure(c(2L, 2L, 1L, 2L, 2L, NA), .Label = c("dry",
"fire"), class = "factor"), season_2011_2 = structure(c(2L,
2L, 1L, 2L, 2L, NA), .Label = c("dry", "fire"), class = "factor"),
season_2011_3 = structure(c(2L, 2L, 1L, 2L, 2L, NA), .Label = c("dry",
"fire"), class = "factor"), season_2012_1 = structure(c(NA,
NA, 2L, 1L, 1L, 2L), .Label = c("fire", "wet"), class = "factor"),
season_2012_2 = structure(c(NA, 1L, 2L, 1L, 1L, NA), .Label = c("fire",
"wet"), class = "factor"), season_2012_3 = structure(c(NA,
1L, 2L, 1L, 1L, NA), .Label = c("fire", "wet"), class = "factor"),
OptTSF_2010_1 = c(1L, 1L, 0L, 1L, 1L, 1L), OptTSF_2010_2 = c(1L,
1L, 0L, 1L, 1L, NA), OptTSF_2010_3 = c(1L, 1L, NA, 1L, 1L,
NA), OptTSF_2011_1 = c(1L, 1L, 0L, 0L, 1L, NA), OptTSF_2011_2 = c(1L,
1L, 0L, 0L, 1L, NA), OptTSF_2011_3 = c(1L, 1L, 0L, 0L, 1L,
NA), OptTSF_2012_1 = c(NA, NA, 1L, 0L, 0L, 1L), OptTSF_2012_2 = c(NA,
1L, 1L, 0L, 0L, NA), OptTSF_2012_3 = c(NA, 1L, 1L, 0L, 0L,
NA)), .Names = c("SitePoint", "MR_Fire_2010_1", "MR_Fire_2010_2",
"MR_Fire_2010_3", "MR_Fire_2011_1", "MR_Fire_2011_2", "MR_Fire_2011_3",
"MR_Fire_2012_1", "MR_Fire_2012_2", "MR_Fire_2012_3", "season_2010_1",
"season_2010_2", "season_2010_3", "season_2011_1", "season_2011_2",
"season_2011_3", "season_2012_1", "season_2012_2", "season_2012_3",
"OptTSF_2010_1", "OptTSF_2010_2", "OptTSF_2010_3", "OptTSF_2011_1",
"OptTSF_2011_2", "OptTSF_2011_3", "OptTSF_2012_1", "OptTSF_2012_2",
"OptTSF_2012_3"), class = "data.frame", row.names = c(NA, -6L
))
The actual dataset is about 23656 records X 15 variables, so doing it by hand is likely to cause major headaches and potential for mistakes. Any help or suggestions are appreciated. If this has been answered elsewhere, apologies. I couldn’t find anything directly applicable; everything seemed to related to three columns and only one of those being extracted as new variables. Thanks.
SP
dcast from the devel version of data.table i.e., v1.9.5 can cast multiple columns simultaneously. It can be installed from here.
library(data.table) ## v1.9.5+
dcast(setDT(mydatain), SitePoint~Year_Rotation,
value.var=c('MR_Fire', 'fire_seas', 'OptTSF'))
You can use reshape to change the structure of your dataframe from long to wide using the following code:
reshape(mydatain,timevar="Year_Rotation",idvar="SitePoint",direction="wide")

How to use multiple symbols in plots based on different variables in R?

I have created a PCA for measurements collected on individual from four locations placed on four substrates with three replicates. I have the sex (male or female)and "karyotype" (factor with three possible categories) and the calculated the first two PC scores for each individual.
I would like to make a plot where male and female have different symbols and the colour of the symbols is dependent on the karotype. I have created a plot with the code below that gives me one symbol colour coded for the three karyotypes and put 95% confidence elispses around the males and females.
How can I change the symbol for each sex and keeping the colouring dependent on the karytype? I would also like to have this reflected in the legend.
One last question. Is it possible to add an arrow for each PC (not each individual) from the origin similar to those found in ordination plots?
Sample Data:
test <- structure(list(Location = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("Kampinge", "Kaseberga", "Molle", "Steninge"
), class = "factor"), Substrate = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L), .Label = c("Kampinge", "Kaseberga", "Molle",
"Steninge"), class = "factor"), Replicate = structure(c(1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 1L, 1L), .Label = c("1", "2", "3"), class = "factor"),
Sex = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L,
2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L
), .Label = c("Female", "Male"), class = "factor"), Karyotype = structure(c(3L,
4L, 3L, 3L, 4L, 3L, 4L, 3L, 4L, 3L, 3L, 3L, 3L, 3L, 2L, 4L,
3L, 3L, 4L, 4L, 3L, 4L, 3L, 4L, 3L), .Label = c("", "BB",
"BD", "DD"), class = "factor"), Wing_Length = c(1439L, 1224L,
1558L, 1508L, 1286L, 1560L, 1377L, 1486L, 1638L, 1475L, 1703L,
1726L, 1668L, 1405L, 1737L, 1419L, 1530L, 1508L, 1525L, 1326L,
1609L, 1357L, 1830L, 1476L, 1661L), Leg_Length = c(465L,
357L, 610L, 415L, 343L, 560L, 435L, 390L, 425L, 514L, 693L,
695L, 657L, 454L, 661L, 382L, 431L, 531L, 435L, 387L, 407L,
414L, 752L, 524L, 650L), Development_Time = c(15, 15, 12,
12, 12, 12, 12, 12, 12, 15, 15, 15, 15, 15, 15, 15, 11, 12,
14, 12, 14, 14, 14, 11, 11), PC1 = c(-281.031806232855, -515.247908786317,
-96.7283446465637, -260.171340782501, -476.664849753781,
-127.267190895631, -347.839240839062, -293.08530374415, -154.026702195308,
-221.98257463847, 67.7504074590983, 86.6778734586525, 17.8073498265326,
-314.171132928964, 73.3068216627556, -349.616320093329, -233.030545551831,
-185.761623361004, -234.30046275676, -417.754317941649, -187.820500930148,
-376.653043663908, 203.025275308178, -214.80078992031, 7.94703091626344
), PC2 = c(-78.3082792875783, -133.370219905995, -113.211488986839,
4.31036861466361, -82.8593541869054, -73.5708675263244, -95.0643731443612,
9.37702847686542, 80.0290301136235, -92.8061497557789, -83.8731164047719,
-70.6537733486393, -78.706783632851, -91.6793310834752, -37.5144466525303,
-27.4637667171696, 6.14809390611532, -84.6794844768708, -0.127837123829732,
-90.9556028004192, 75.2353710655562, -91.7834027435658, -47.669385541585,
-99.8362257341741, -77.8269478596591)), .Names = c("Location",
"Substrate", "Replicate", "Sex", "Karyotype", "Wing_Length",
"Leg_Length", "Development_Time", "PC1", "PC2"), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 11L, 12L, 13L, 16L, 17L, 18L,
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 30L, 31L), class = "data.frame")
## Plot
par(mfrow=c(1,1), mar=c(4,4,2,1), pty = "s")
plot(test$PC1, test$PC2, xlab="PC1", ylab="PC2", pch=16, col=as.numeric(test[,"Karyotype"]),
xlim = c(-1000, 1000), ylim = c(-250, 250), las=1, cex.lab = 1.5, cex.axis = 1.25, main = NULL)
ordiellipse(test[,9:10], test$Sex, conf=0.95, col="black", cex=1.75, label=TRUE)
legend("bottomright", pch=16, col=unique(as.numeric(test[,"Karyotype"])), legend=unique(test[,"Karyotype"]), cex = 1.75)
Replace your pch plot argument by something like :
pch=ifelse(test$Sex=='Male',15,19)
Try with ggplot:
library(ggplot2)
ggplot(test, aes(x=PC1, y=PC2, color=Karyotype, shape=Sex, group=Sex))+geom_point(size=5)+stat_ellipse()

New variable based on conditional arithmetic by group

I have a data.frame df where I want to create a new variable that is the proportion of another by group. That is for each Species ID Plot Sub paring I'd like to find the proportion of Area by Type. If Type = 0, then PropArea == 1, if Type does not equal 0 (i.e. 1 or 2), then, for example, PropArea = Area (Type 1) / Area (Type 0). An sample data.frame is below. I know how to do this with if statements in excel, but was hoping to find a way to do this within r.
df <- structure(list(Species = structure(c(2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("BFGR", "RNNN"), class = "factor"),
ID = c(201L, 201L, 201L, 201L, 201L, 201L, 219L, 219L, 219L,
219L, 219L, 219L, 220L, 220L), Plot = c(1L, 1L, 2L, 2L, 1L,
1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Sub = c(2L, 2L, 2L,
2L, 3L, 3L, 10L, 10L, 11L, 11L, 12L, 12L, 13L, 13L), Type = c(0L,
1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 2L), Area = c(0.78,
0.445, 0.023, 0.015, 0.79, 0.235, 1.29, 1.29, 2.555, 1.065,
1.365, 1.365, 2.678, 1.305), PropArea = c(1, 0.570512821,
1, 0.652173913, 1, 0.297468354, 1, 1, 1, 0.416829746, 1,
1, 1, 0.487303958)), .Names = c("Species", "ID", "Plot",
"Sub", "Type", "Area", "PropArea"), class = "data.frame", row.names = c(NA,
-14L))
## A more complete data set
df_more <- structure(list(Species = structure(c(3L, 3L, 3L, 3L, 3L, 3L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("ACRU", "DIVI",
"LIST", "LITU", "PEPA", "QULA"), class = "factor"), ID = c(205L,
205L, 205L, 205L, 205L, 205L, 219L, 219L, 219L, 219L, 219L, 219L,
219L, 219L, 219L, 219L, 219L, 219L, 219L, 219L, 219L, 219L, 219L,
219L, 219L, 219L, 219L, 219L, 219L, 219L, 219L, 219L, 219L, 219L,
219L, 219L, 221L, 221L, 222L, 222L, 222L, 222L, 222L, 222L, 222L,
222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L,
222L, 222L, 222L, 222L, 222L, 222L, 227L, 227L, 227L, 227L, 227L,
227L, 227L, 227L, 227L, 227L, 227L, 227L, 228L, 228L, 228L, 228L,
228L, 228L, 228L, 228L, 228L, 228L, 228L, 228L, 228L, 228L, 228L,
228L, 228L, 228L, 228L, 228L, 228L, 228L, 228L, 229L, 229L, 229L,
229L, 229L, 229L, 229L, 229L, 229L, 229L, 229L, 229L, 229L, 229L
), Plot = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), Sub = c(2L, 2L, 3L, 3L, 4L, 4L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L,
10L, 11L, 11L, 12L, 12L, 13L, 13L, 14L, 14L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L,
5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L,
6L, 6L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L, 7L, 7L, 7L, 8L,
8L, 8L, 9L, 9L, 9L, 10L, 10L, 11L, 11L, 2L, 2L, 2L, 3L, 3L, 3L,
4L, 4L, 5L, 5L, 6L, 6L, 6L, 7L), Type = c(0L, 1L, 0L, 1L, 0L,
1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 0L, 1L, 0L,
1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L,
1L, 0L, 0L, 1L, 1L, 2L, 2L, 0L, 0L, 1L, 1L, 2L, 2L, 0L, 0L, 1L,
1L, 2L, 2L, 0L, 0L, 1L, 1L, 2L, 2L, 0L, 1L, 0L, 1L, 0L, 1L, 0L,
1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 2L,
0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 2L, 0L,
1L, 2L, 0L, 1L, 1L, 2L, 0L, 1L, 2L, 0L), Area = c(5.67, 3.24,
6.65, 4.26, 10.24, 1.31, 1.12, 1.23, 1.23, 0.88, 0.86, 0.86,
0.11, 1.36, 1.36, 1.17, 2.33, 2.33, 1.15, 1.15, 1.23, 1.23, 1.27,
1.27, 0.97, 0.97, 1.39, 1.39, 1.07, 1.07, 1.49, 1.49, 1.33, 1.33,
2.35, 2.35, 1.8, 1.8, 7.5, 7.42, 6.35, 6.82, 0.37, 0.48, 8.67,
8.57, 5.47, 5.66, 2.35, 2.42, 11.99, 12.8, 6.18, 6.19, 2.56,
2.71, 25.77, 25.6, 16.01, 16.56, 3.36, 3.35, 1.08, 0.12, 5.34,
5.34, 6.15, 6.15, 6.93, 6.93, 8.91, 8.91, 10.91, 10.91, 2.31,
1.21, 3.2, 2.42, 2.41, 2.41, 2.32, 2.32, 2.48, 2.48, 0.7, 2.89,
2.89, 1.27, 3.66, 3.66, 0.75, 8, 8, 8.85, 8.85, 11.22, 11.22,
5.08, 2.96, 0.22, 5, 3.01, 0.92, 6.94, 3.88, 4.48, 1.18, 9.03,
4.19, 0.5, 9.97)), .Names = c("Species", "ID", "Plot", "Sub",
"Type", "Area"), row.names = c(NA, 111L), class = "data.frame")
As long as you're OK with your data.frame being resorted, this should work:
library(plyr)
df2 <- ddply(df_more, .(Species, ID, Plot, Sub), function(groupdf) {
denominator <- groupdf[groupdf$Type==0,"Area"]
if(length(denominator) == 0) denominator <- groupdf[groupdf$Type==1,"Area"]
transform(groupdf, PropArea=Area/denominator)
})
And if you want to keep the same ordering, add these lines:
df1 <- df2[match(
interaction(df[c("Species", "ID", "Plot", "Sub", "Type")]),
interaction(df2[c("Species", "ID", "Plot", "Sub", "Type")])),]
If you can guarantee alternation of 0s with 1s and 2s like in your example, you could use ifelse:
df$PropArea <- ifelse(df$Type == 0, 1, df$Area / c(1, df$Area[-nrow(df)]))
There are duplicates in the df_more dataset.
E.g. DIVI/22/1/2/0 is having an area of both 7.50 and 7.42.
This will lead to errors.

Resources