how do you create linear line on geom_bar in ggplot2 - r

I need to create stacked ggplot bar plot given this data set with linear line drawn:
dput(t)
structure(list(Date = structure(c(16436, 16436, 16436, 16467,
16467, 16467, 16467, 16467, 16679, 16679, 16679, 16679, 16679
), class = "Date"), Applicatio = structure(c(4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 3L, 4L, 1L, 2L, 3L), .Label = c("DB", "Opt",
"Tom", "Web"), class = "factor"), Code = structure(c(1L, 2L,
4L, 3L, 1L, 2L, 4L, 3L, 3L, 1L, 2L, 4L, 3L), .Label = c("ch",
"db", "tt", "zz"), class = "factor"), m = structure(c(1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("2015-01",
"2015-02", "2015-09"), class = "factor"), count = c(1L, 3L, 1L,
4L, 1L, 7L, 1L, 9L, 1L, 6L, 4L, 7L, 9L), Total = c(1L, 12L, 1L,
2L, 1L, 20L, 1L, 7L, 7L, 9L, 50L, 3L, 6L)), .Names = c("Date",
"Applicatio", "Code", "m", "count", "Total"), row.names = c(NA,
-13L), class = "data.frame")
I am trying this:
ggplot(subset(t, Date> as.Date(c("2015-01-01", format="%Y-%m-%d"))), aes(m,fill=Code))+geom_bar()+
geom_smooth(aes(m,Total),method="lm", se=FALSE)+
guides(colour=FALSE)

I am not entirely sure what you are trying to achieve, but it looks like you want this:
ggplot(subset(t, Date > as.Date("2015-01-01", format="%Y-%m-%d")), aes(m,fill=Code))+geom_bar()+
geom_smooth(aes(m,Total,group=1),method="lm", se=FALSE)+
guides(colour=FALSE)
Basically, you had a c function in the subset function that was not needed and then you needed to use group=1 inside the geom_smooth function as mentioned by the warning.
So, yeah you can have a linear line on geom_bar.

Related

cld() output has a wrong order of factor levels

I am using R cld() function with emmeans, but the order of factor level in the output is different from what I set. Before calling cld(), the by.years output is also in the desired order (screenshot), but when I do cld(), the output is in the alphabetical order of Light - Moderate - No(screenshot). I also checked cld.years$Grazing.intensity, the levels are correct. Is there a way to specify the order of factor levels in the cld() output? Any help is appreciated.
# sample data
plants <- structure(list(Grazing.intensity = structure(c(3L, 2L, 3L, 3L, 3L, 1L, 3L, 2L, 2L, 2L, 1L, 2L, 3L, 3L, 3L), .Label = c("Light-grazing", "Moderate-grazing", "No-grazing"), class = "factor"), Grazing.intensity1 = structure(c(3L, 2L, 3L, 3L, 3L, 1L, 3L, 2L, 2L, 2L, 1L, 2L, 3L, 3L, 3L), .Label = c("LG", "MG", "NG"), class = "factor"), Years = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L), .Label = c("Dry-year", "Wet-year"), class = "factor"), Month = structure(c(2L, 2L, 2L, 1L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 3L, 2L, 2L, 3L), .Label = c("Aug.", "Jul.", "Sept."), class = "factor"), Plots = c(1L, 3L, 8L, 6L, 9L, 7L, 2L, 2L, 10L, 10L, 7L, 7L, 9L, 4L, 2L), Species.richness = c(8L, 6L, 10L, 11L, 9L, 5L, 7L, 13L, 10L, 6L, 5L, 5L, 14L, 8L, 10L)), class = "data.frame", row.names = c(NA, -15L))
# set the order of factor levels
plants$Grazing.intensity <- factor(plants$Grazing.intensity, levels =
c('No-grazing','Light-grazing','Moderate-grazing'))
attach(plants)
lmer.mod <- lmer(Species.richness ~ Grazing.intensity*Years + (1|Month), data = plants)
by.years <- emmeans(lmer.mod, specs = ~ Grazing.intensity:Years, by = 'Years', type = "response")
# display cld
cld.years <- cld(by.years, Letters = letters)
This is my first time posting sample data in StackOverflow, so it may be wrong.. I used dput().
I solved the issue. The order changed because the levels are displayed in the increasing order of emmean. I set sort = FALSE, and the result was displayed in the default order. I should have read the documentations more thoroughly.

Why assign() is behaving oddly in for() loop with dplyr pipes in R?

I need to loop different functions in dataframes allocated in my Global Environment and save the output of each "run" of the loop in a new dataframe that includes the initial name.
For this end, I'm using assign() with for() loop. It works well, except if I use the dplyr pipe %>%. The function itself works, but there is some error with the name assigned to the output dataframe. How can I fix this issue with %>% ? If not possible to fix, can I change assign() for another function?
This works well:
code1:
for(i in unique(table$V1)){
assign(paste0(i, "_target"),table[grepl(i,table$V1),])
}
Explanation: Selects unique entries in column 1 of the "table" and subset the rows with these entries to a new dataframe per entry. Output: the new dataframe name is "entry name" + "_target"
This doesn't work well (and I would like to know why):
code2:
for(i in mget(ls(pattern = "_target"))){
assign(paste0(i, "_slim"),data.frame(i %>% group_by(Sample.Name) %>% summarise(Mean_dC=mean(C__))))
}
Explanation: Selects all dataframes in the Global Env that name contains "_target". In each dataframe: it does the mean of the values "(C__)" associated to entries with same characters "(Sample.Name)". Should be output: the new dataframe name is "entry name_target" + "_slim". Real output: the new dataframe presents the mean of the same characters, but is named "c(aleatory numbers)_slim".
code2 input:
STA_target <- structure(list(Well = structure(c(8L, 9L, 10L, 21L, 22L, 23L,
33L, 34L, 35L, 46L, 47L, 48L, 58L, 59L, 60L, 73L, 74L, 75L, 85L,
86L, 87L, 97L, 98L, 99L), .Label = c("", "A1", "A10", "A11",
"A12", "A2", "A3", "A4", "A5", "A6", "A7", "A8", "A9", "Analysis Type",
"B1", "B10", "B11", "B12", "B2", "B3", "B4", "B5", "B6", "B7",
"B8", "B9", "C1", "C10", "C11", "C12", "C2", "C3", "C4", "C5",
"C6", "C7", "C8", "C9", "Chemistry", "D1", "D10", "D11", "D12",
"D2", "D3", "D4", "D5", "D6", "D7", "D8", "D9", "E1", "E10",
"E11", "E12", "E2", "E3", "E4", "E5", "E6", "E7", "E8", "E9",
"Endogenous Control", "Experiment File Name", "Experiment Run End Time",
"F1", "F10", "F11", "F12", "F2", "F3", "F4", "F5", "F6", "F7",
"F8", "F9", "G1", "G10", "G11", "G12", "G2", "G3", "G4", "G5",
"G6", "G7", "G8", "G9", "H1", "H10", "H11", "H12", "H2", "H3",
"H4", "H5", "H6", "H7", "H8", "H9", "Instrument Type", "Passive Reference",
"Reference Sample", "RQ Min/Max Confidence Level", "Well"), class = "factor"),
Sample.Name = c("Control_in", "Control_in", "Control_in",
"Sample2_in", "Sample2_in", "Sample2_in", "Sample5_in", "Sample5_in",
"Sample5_in", "Sample3_in", "Sample3_in", "Sample3_in", "Control_c",
"Control_c", "Control_c", "Sample2_c", "Sample2_c", "Sample2_c",
"Sample3_c", "Sample3_c", "Sample3_c", "Sample5_c", "Sample5_c",
"Sample5_c"), Target.Name = c("STA", "STA", "STA", "STA",
"STA", "STA", "STA", "STA", "STA", "STA", "STA", "STA", "STA",
"STA", "STA", "STA", "STA", "STA", "STA", "STA", "STA", "STA",
"STA", "STA"), Task = structure(c(3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L), .Label = c("", "Task", "UNKNOWN"), class = "factor"),
Reporter = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L
), .Label = c("", "Reporter", "SYBR"), class = "factor"),
Quencher = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("", "None", "Quencher"), class = "factor"),
RQ = structure(c(12L, 12L, 12L, 8L, 8L, 8L, 6L, 6L, 6L, 11L,
11L, 11L, 1L, 1L, 1L, 5L, 5L, 5L, 14L, 14L, 14L, 18L, 18L,
18L), .Label = c("", "0.706286132", "0.714652956", "0.724364996",
"0.7665869", "0.828774512", "0.838611245", "0.846661508",
"0.863589227", "0.896049678", "0.929288268", "1", "1.829339266",
"15.57538891", "17.64183807", "27.67574501", "3.064466953",
"34.78881073", "41.82569504", "8.117406845", "8.884188652",
"RQ"), class = "factor"), RQ.Min = structure(c(9L, 9L, 9L,
7L, 7L, 7L, 8L, 8L, 8L, 10L, 10L, 10L, 1L, 1L, 1L, 2L, 2L,
2L, 21L, 21L, 21L, 17L, 17L, 17L), .Label = c("", "0.032458056",
"0.429091513", "0.460811675", "0.541289926", "0.611138761",
"0.674698055", "0.71383971", "0.742018044", "0.753834546",
"0.772591949", "0.7868222", "0.803419232", "0.820919514",
"0.826185584", "0.989573121", "22.58564949", "27.2142868",
"4.501103401", "4.745172024", "4.843928814", "4.979007244",
"9.076541901", "RQ Min"), class = "factor"), RQ.Max = structure(c(13L,
13L, 13L, 8L, 8L, 8L, 6L, 6L, 6L, 9L, 9L, 9L, 1L, 1L, 1L,
16L, 16L, 16L, 19L, 19L, 19L, 20L, 20L, 20L), .Label = c("",
"0.858568788", "0.910271943", "0.943540215", "0.947846115",
"0.962214947", "0.971821666", "1.062453985", "1.145578504",
"1.162549496", "1.218146205", "1.244680166", "1.347676158",
"14.63914394", "15.85231876", "18.10507202", "20.37916756",
"3.381742954", "50.08181381", "53.58541107", "64.28199768",
"65.58969879", "84.38751984", "RQ Max"), class = "factor"),
C_ = c(25.48042297, 25.4738903, 25.83390617, 25.7304306,
25.78297043, 25.41260529, 25.49670792, 25.52298164, 25.6956234,
25.34812355, 25.51462555, 25.15455437, 0, 0, 0, 32.29237366,
37.10370636, 32.22016525, 29.50172043, 30.18544579, 29.91492081,
25.14842796, 24.89806747, 24.99397278), C_.Mean = c(25.59607506,
25.59607506, 25.59607506, 25.64200401, 25.64200401, 25.64200401,
25.57177162, 25.57177162, 25.57177162, 25.33910179, 25.33910179,
25.33910179, NA, NA, NA, 33.87208176, 33.87208176, 33.87208176,
29.86736107, 29.86736107, 29.86736107, 25.01348877, 25.01348877,
25.01348877), C_.SD = structure(c(21L, 21L, 21L, 20L, 20L,
20L, 12L, 12L, 12L, 19L, 19L, 19L, 1L, 1L, 1L, 31L, 31L,
31L, 23L, 23L, 23L, 14L, 14L, 14L), .Label = c("", "0.039937571",
"0.043110434", "0.049541138", "0.05469643", "0.061177365",
"0.066671595", "0.07365533", "0.079849631", "0.082057081",
"0.095515646", "0.108060829", "0.120047837", "0.126316145",
"0.129658803", "0.130481929", "0.142733917", "0.172286868",
"0.180205062", "0.200392827", "0.205995336", "0.236968249",
"0.344334781", "0.36769405", "0.413046211", "0.445171326",
"0.514641941", "0.640576839", "0.895943522", "0.993181109",
"2.798901796", "C_ SD"), class = "factor"), `_C_` = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "_C_"), class = "factor"),
`_C_.Mean` = structure(c(8L, 8L, 8L, 5L, 5L, 5L, 4L, 4L,
4L, 7L, 7L, 7L, 1L, 1L, 1L, 3L, 3L, 3L, 13L, 13L, 13L, 14L,
14L, 14L), .Label = c("", "_C_ Mean", "-0.577166259", "-0.68969661",
"-0.720502198", "-0.776381195", "-0.85484314", "-0.96064502",
"-1.058534026", "-2.04822278", "-2.545912504", "-3.293611526",
"-4.921841145", "-6.081196308", "0.477069855", "1.373315215",
"2.092705965", "2.244637728", "2.251055479", "2.346632004",
"2.456220627", "2.557917356", "2.729323149", "2.746313095"
), class = "factor"), `_C_.SE` = structure(c(13L, 13L, 13L,
11L, 11L, 11L, 6L, 6L, 6L, 9L, 9L, 9L, 1L, 1L, 1L, 24L, 24L,
24L, 21L, 21L, 21L, 15L, 15L, 15L), .Label = c("", "_C_ SE",
"0.042180877", "0.042606823", "0.048373949", "0.077573851",
"0.088320434", "0.102536619", "0.108728357", "0.113733612",
"0.117972165", "0.144372106", "0.155044988", "0.223316222",
"0.224465802", "0.258952528", "0.300881863", "0.306413502",
"0.319273174", "0.579304695", "0.606897891", "0.635279417",
"0.682336032", "1.643036604"), class = "factor"), HK.Control._C_.Mean = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "HK Control _C_ Mean"
), class = "factor"), HK.Control._C_.SE = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "HK Control _C_ SE"
), class = "factor"), `__C_` = structure(c(12L, 12L, 12L,
16L, 16L, 16L, 18L, 18L, 18L, 13L, 13L, 13L, 1L, 1L, 1L,
19L, 19L, 19L, 7L, 7L, 7L, 10L, 10L, 10L), .Label = c("",
"__C_", "-0.871322632", "-1.61563623", "-3.021018982", "-3.15124011",
"-3.961196184", "-4.140928745", "-4.790550232", "-5.120551586",
"-5.38631773", "0", "0.105801903", "0.15834935", "0.211582825",
"0.240142822", "0.253925949", "0.27094841", "0.383478791",
"0.465211242", "0.484685272", "0.501675308"), class = "factor"),
Automatic.Ct.Threshold = structure(c(3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L), .Label = c("", "Automatic Ct Threshold",
"TRUE"), class = "factor"), Ct.Threshold = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", "0.056211855",
"0.208910329", "0.693888608", "0.704941193", "Ct Threshold"
), class = "factor"), Automatic.Baseline = structure(c(3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("", "Automatic Baseline",
"TRUE"), class = "factor"), Baseline.Start = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", "3", "Baseline Start"
), class = "factor"), Baseline.End = structure(c(3L, 3L,
4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 13L, 14L, 14L, 8L,
12L, 8L, 6L, 7L, 7L, 3L, 3L, 3L), .Label = c("", "21", "22",
"23", "25", "26", "27", "29", "30", "31", "32", "34", "35",
"39", "Baseline End"), class = "factor"), Efficiency = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", "1", "Efficiency"
), class = "factor"), Comments = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "Comments"), class = "factor"),
HIGHSD = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L
), .Label = c("", "HIGHSD", "N", "Y"), class = "factor"),
NOAMP = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("",
"N", "NOAMP", "Y"), class = "factor"), OUTLIERRG = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", "N", "OUTLIERRG",
"Y"), class = "factor"), EXPFAIL = structure(c(3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("", "EXPFAIL", "N", "Y"
), class = "factor")), .Names = c("Well", "Sample.Name",
"Target.Name", "Task", "Reporter", "Quencher", "RQ", "RQ.Min",
"RQ.Max", "C_", "C_.Mean", "C_.SD", "_C_", "_C_.Mean", "_C_.SE",
"HK.Control._C_.Mean", "HK.Control._C_.SE", "__C_", "Automatic.Ct.Threshold",
"Ct.Threshold", "Automatic.Baseline", "Baseline.Start", "Baseline.End",
"Efficiency", "Comments", "HIGHSD", "NOAMP", "OUTLIERRG", "EXPFAIL"
), row.names = c(12L, 13L, 14L, 24L, 25L, 26L, 36L, 37L, 38L,
48L, 49L, 50L, 60L, 61L, 62L, 72L, 73L, 74L, 84L, 85L, 86L, 96L,
97L, 98L), class = "data.frame")
code2 "output":
> dput(`c(8, 9, 10, 21, 22, 23, 33, 34, 35, 46, 47, 48, 58, 59, 60, 73, 74, 75, 85, 86, 87, 97, 98, 99)_slim`)
structure(list(Group.1 = c("Sample2_c", "Sample2_in", "Sample3_c",
"Sample5_in", "Control_c", "Control_in", "Sample5_c", "Sample3_in"
), x = c(33.8720817566667, 25.6420021066667, 29.8673623433333,
25.5717709866667, 0, 25.5960731466667, 25.0134894033333, 25.3391011566667
)), .Names = c("Group.1", "x"), row.names = c(NA, -8L), class = "data.frame")
I don't know if this is really the output because of the given name. But the expected output should be something like that with the correct name: STA_slim
Thank you for your time
First of all, I strongly suggest you avoid assign() in your R code. It's much better to use one of the many mapping/apply function in R to build related data in lists. Using get/assign is sign that you are not doing things in a very R-like way.
Your problem has nothing to do with dplyr really, it's what you are looping over in your loop. When you do
for(i in mget(ls(pattern = "_target"))){
assign(paste0(i, "_slim"),data.frame(i %>% group_by(Sample.Name) %>% summarise(Mean_dC=mean(C__))))
}
that i isn't the name of the data.frame, because you did mget() it's the data frame itself. It doesn't make sense to paste that into a new name.
To "fix" this, you could do
for(i in ls(pattern = "_target")){
assign(paste0(i, "_slim"),data.frame(get(i) %>% group_by(Sample.Name) %>% summarise(Mean_dC=mean(C__))))
}
But even then you don't have a column named C__ in your example data set. You have C_ or _C_ or __C_ (what do these names even mean??). So you'd need to fix that.
The better list way would be
slim <- lapply(mget(ls(pattern = "_target$")) , function(x) {
x %>% group_by(Sample.Name) %>% summarise(Mean_dC=mean(C_))
})

Using ggplot to map mean values by group

Using an example dataframe:
df <- structure(list(value = c(10L, 8L, 6L, 4L, 2L, 9L, 7L, 5L, 3L,
1L, 1L, 1L, 2L, 3L, 4L, 3L, 3L, 4L, 5L, 2L, 2L, 4L, 6L, 4L, 7L,
3L, 5L, 4L, 6L, 3L), length = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L,
4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L,
5L, 1L, 2L, 3L, 4L, 5L), wave = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L)), .Names = c("value", "length", "wave"
), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-30L), spec = structure(list(cols = structure(list(value = structure(list(), class = c("collector_integer",
"collector")), length = structure(list(), class = c("collector_integer",
"collector")), wave = structure(list(), class = c("collector_integer",
"collector"))), .Names = c("value", "length", "wave")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
I wish to plot the average 'value' (line graph) by 'length' for each group (wave).
Is this possible direct from ggplot? (or do I need to do the preliminary analysis first).
I would have otherwise used:
ggplot(df, aes(x=length, y=value, color=wave)) + geom_point(shape=1)
We can use stat_summary for this task
library(ggplot2)
ggplot(df, aes(x = length, y = value, col = as.factor(wave))) +
stat_summary(geom = "line", fun.y = mean)

Test data has new levels while doing a logit but doesn't gives an error while predicting in C5

I don't know how the 2 model handle factor levels, but logit won't predict and gives an error message saying new factor levels. When I predict using C5 it works fine. I have created the train and test from a single data frame and levels in both match each other.
I am seeking an explanation of this behaviour and a solution for this. I understand that the new levels in test would not be able to get their coefficient calculated, but setting them to NULL should be okay I think.
Here is a bit of the code. I used this to match the levels of hold and train. tr=dataset to be split into train and test.
tr=structure(
list(
production_year = c(
2007L, 2010L, 2010L, 2008L,
2007L, 2008L, 2008L, 2008L, 2007L, 2011L, 2009L, 2009L, 2009L,
2008L, 2007L, 2007L, 2010L, 2009L, 2008L, 2008L, 2010L, 2010L,
2007L, 2010L, 2009L, 2008L, 2007L, 2007L, 2008L, 2007L, 2010L,
2011L, 2010L, 2007L, 2009L, 2009L, 2008L, 2008L, 2010L, 2011L
), movie_sequel = structure(
c(
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("0", "1"), class = "factor"
), creative_type = structure(
c(
1L,
4L, 1L, 4L, 5L, 1L, 1L, 6L, 2L, 1L, 6L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 8L, 1L, 7L, 1L, 1L, 3L, 1L, 1L, 2L, 4L, 4L, 1L, 1L, 4L, 5L,
5L, 1L, 4L, 1L, 1L, 1L, 1L
), .Label = c(
"Contemporary Fiction",
"Dramatization", "Factual", "Fantasy", "Historical Fiction",
"Kids Fiction", "Science Fiction", "Super Hero"
), class = "factor"
),
source = structure(
c(
6L, 2L, 6L, 7L, 2L, 6L, 6L, 6L, 4L,
6L, 2L, 7L, 6L, 6L, 6L, 3L, 6L, 6L, 1L, 2L, 6L, 5L, 6L, 5L,
5L, 6L, 4L, 2L, 2L, 6L, 6L, 2L, 7L, 4L, 6L, 5L, 6L, 2L, 6L,
6L
), .Label = c(
"Based on Comic/Graphic Novel", "Based on Fiction Book/Short Story",
"Based on Folk Tale/Legend/Fairytale", "Based on Real Life Events",
"Based on TV", "Original Screenplay", "Remake"
), class = "factor"
),
production_method = structure(
c(
3L, 3L, 3L, 3L, 3L, 3L, 3L,
2L, 3L, 3L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L
), .Label = c(
"Animation/Live Action", "Digital Animation",
"Live Action", "Stop-Motion Animation"
), class = "factor"
),
genre = structure(
c(
3L, 1L, 4L, 5L, 1L, 4L, 3L, 3L, 4L, 5L,
2L, 7L, 6L, 5L, 7L, 3L, 3L, 7L, 1L, 7L, 7L, 3L, 4L, 3L, 3L,
6L, 4L, 2L, 1L, 2L, 6L, 4L, 7L, 1L, 4L, 2L, 3L, 7L, 7L, 5L
), .Label = c(
"Action", "Adventure", "Comedy", "Drama", "Horror",
"Romantic Comedy", "Thriller/Suspense"
), class = "factor"
),
language = structure(
c(
2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L
), .Label = c("Danish", "English"), class = "factor"
),
movie_board_rating_display_name = structure(
c(
3L, 3L, 3L,
2L, 2L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 3L, 3L, 2L, 3L, 3L,
3L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 1L, 2L, 3L, 2L, 2L, 3L,
2L, 3L, 1L, 2L, 3L, 3L, 2L
), .Label = c("PG", "PG-13", "R"), class = "factor"
), movie_release_pattern_display_name = structure(
c(
4L,
4L, 3L, 4L, 4L, 3L, 4L, 4L, 3L, 4L, 4L, 4L, 4L, 4L, 3L, 4L,
3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 1L, 4L,
4L, 4L, 2L, 3L, 4L, 4L, 4L, 3L, 4L
), .Label = c("Exclusive",
"Expands Wide", "Limited", "Wide"), class = "factor"
), Category1 = structure(
c(
1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("0", "1"), class = "factor"
)
), .Names = c(
"production_year",
"movie_sequel", "creative_type", "source", "production_method",
"genre", "language", "movie_board_rating_display_name", "movie_release_pattern_display_name",
"Category1"
), row.names = c(
506L, 474L, 1011L, 569L, 737L, 1124L,
602L, 717L, 747L, 977L, 284L, 620L, 100L, 301L, 514L, 865L, 828L,
283L, 921L, 839L, 15L, 937L, 931L, 201L, 273L, 507L, 1180L, 689L,
276L, 649L, 603L, 22L, 555L, 974L, 552L, 500L, 216L, 312L, 796L,
682L
), class = "data.frame"
)
train=tr[1:25,] # training data
hold=tr[26:40,] # test data
for(i in 1:ncol(train)){
if(is.factor(train[,i])){
hold[,i] <- factor(hold[,i],levels=levels(train[,i]))
}
}
m.glm=glm(Category1 ~ ., data = train, family = 'binomial')
labels=hold$Category1
hold$Category1=NULL
p=predict(m.glm, hold)
all the levels
structure(list(production_year = 2011L, movie_sequel = structure(1L, .Label = c("0",
"1"), class = "factor"), creative_type = structure(5L, .Label = c("Contemporary Fiction",
"Dramatization", "Factual", "Fantasy", "Historical Fiction",
"Kids Fiction", "Multiple Creative Types", "Science Fiction",
"Super Hero"), class = "factor"), source = structure(14L, .Label = c("Based on Comic/Graphic Novel",
"Based on Factual Book/Article", "Based on Fiction Book/Short Story",
"Based on Folk Tale/Legend/Fairytale", "Based on Game", "Based on Musical or Opera",
"Based on Play", "Based on Real Life Events", "Based on Short Film",
"Based on Theme Park Ride", "Based on Toy", "Based on TV", "Compilation",
"Original Screenplay", "Remake", "Spin-Off"), class = "factor"),
production_method = structure(4L, .Label = c("Animation/Live Action",
"Digital Animation", "Hand Animation", "Live Action", "Multiple Production Methods",
"Stop-Motion Animation"), class = "factor"), genre = structure(13L, .Label = c("Action",
"Adventure", "Black Comedy", "Comedy", "Concert/Performance",
"Documentary", "Drama", "Horror", "Multiple Genres", "Musical",
"Romantic Comedy", "Thriller/Suspense", "Western"), class = "factor"),
language = structure(3L, .Label = c("Arabic", "Danish", "English",
"Farsi", "French", "German", "Hebrew", "Hindi", "Italian",
"Japanese", "Norwegian", "Polish", "Portuguese", "Silent",
"Spanish", "Swedish"), class = "factor"), movie_board_rating_display_name = structure(6L, .Label = c("G",
"NC-17", "Not Rated", "PG", "PG-13", "R"), class = "factor"),
movie_release_pattern_display_name = structure(7L, .Label = c("Exclusive",
"Expands Wide", "IMAX", "Limited", "Oscar Qualifying Run",
"Special Engagement", "Wide"), class = "factor"), Category1 = structure(1L, .Label = c("0",
"1"), class = "factor")), .Names = c("production_year", "movie_sequel",
"creative_type", "source", "production_method", "genre", "language",
"movie_board_rating_display_name", "movie_release_pattern_display_name",
"Category1"), row.names = 304L, class = "data.frame")
The way I see it, you will have to exclude the rows with levels which have not been used to fit the model.
predict(m.glm, hold[!hold$movie_release_pattern_display_name %in% c("Exclusive", "Expands Wide"), ])

How to use multiple symbols in plots based on different variables in R?

I have created a PCA for measurements collected on individual from four locations placed on four substrates with three replicates. I have the sex (male or female)and "karyotype" (factor with three possible categories) and the calculated the first two PC scores for each individual.
I would like to make a plot where male and female have different symbols and the colour of the symbols is dependent on the karotype. I have created a plot with the code below that gives me one symbol colour coded for the three karyotypes and put 95% confidence elispses around the males and females.
How can I change the symbol for each sex and keeping the colouring dependent on the karytype? I would also like to have this reflected in the legend.
One last question. Is it possible to add an arrow for each PC (not each individual) from the origin similar to those found in ordination plots?
Sample Data:
test <- structure(list(Location = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("Kampinge", "Kaseberga", "Molle", "Steninge"
), class = "factor"), Substrate = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L), .Label = c("Kampinge", "Kaseberga", "Molle",
"Steninge"), class = "factor"), Replicate = structure(c(1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 1L, 1L), .Label = c("1", "2", "3"), class = "factor"),
Sex = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L,
2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L
), .Label = c("Female", "Male"), class = "factor"), Karyotype = structure(c(3L,
4L, 3L, 3L, 4L, 3L, 4L, 3L, 4L, 3L, 3L, 3L, 3L, 3L, 2L, 4L,
3L, 3L, 4L, 4L, 3L, 4L, 3L, 4L, 3L), .Label = c("", "BB",
"BD", "DD"), class = "factor"), Wing_Length = c(1439L, 1224L,
1558L, 1508L, 1286L, 1560L, 1377L, 1486L, 1638L, 1475L, 1703L,
1726L, 1668L, 1405L, 1737L, 1419L, 1530L, 1508L, 1525L, 1326L,
1609L, 1357L, 1830L, 1476L, 1661L), Leg_Length = c(465L,
357L, 610L, 415L, 343L, 560L, 435L, 390L, 425L, 514L, 693L,
695L, 657L, 454L, 661L, 382L, 431L, 531L, 435L, 387L, 407L,
414L, 752L, 524L, 650L), Development_Time = c(15, 15, 12,
12, 12, 12, 12, 12, 12, 15, 15, 15, 15, 15, 15, 15, 11, 12,
14, 12, 14, 14, 14, 11, 11), PC1 = c(-281.031806232855, -515.247908786317,
-96.7283446465637, -260.171340782501, -476.664849753781,
-127.267190895631, -347.839240839062, -293.08530374415, -154.026702195308,
-221.98257463847, 67.7504074590983, 86.6778734586525, 17.8073498265326,
-314.171132928964, 73.3068216627556, -349.616320093329, -233.030545551831,
-185.761623361004, -234.30046275676, -417.754317941649, -187.820500930148,
-376.653043663908, 203.025275308178, -214.80078992031, 7.94703091626344
), PC2 = c(-78.3082792875783, -133.370219905995, -113.211488986839,
4.31036861466361, -82.8593541869054, -73.5708675263244, -95.0643731443612,
9.37702847686542, 80.0290301136235, -92.8061497557789, -83.8731164047719,
-70.6537733486393, -78.706783632851, -91.6793310834752, -37.5144466525303,
-27.4637667171696, 6.14809390611532, -84.6794844768708, -0.127837123829732,
-90.9556028004192, 75.2353710655562, -91.7834027435658, -47.669385541585,
-99.8362257341741, -77.8269478596591)), .Names = c("Location",
"Substrate", "Replicate", "Sex", "Karyotype", "Wing_Length",
"Leg_Length", "Development_Time", "PC1", "PC2"), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 11L, 12L, 13L, 16L, 17L, 18L,
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 30L, 31L), class = "data.frame")
## Plot
par(mfrow=c(1,1), mar=c(4,4,2,1), pty = "s")
plot(test$PC1, test$PC2, xlab="PC1", ylab="PC2", pch=16, col=as.numeric(test[,"Karyotype"]),
xlim = c(-1000, 1000), ylim = c(-250, 250), las=1, cex.lab = 1.5, cex.axis = 1.25, main = NULL)
ordiellipse(test[,9:10], test$Sex, conf=0.95, col="black", cex=1.75, label=TRUE)
legend("bottomright", pch=16, col=unique(as.numeric(test[,"Karyotype"])), legend=unique(test[,"Karyotype"]), cex = 1.75)
Replace your pch plot argument by something like :
pch=ifelse(test$Sex=='Male',15,19)
Try with ggplot:
library(ggplot2)
ggplot(test, aes(x=PC1, y=PC2, color=Karyotype, shape=Sex, group=Sex))+geom_point(size=5)+stat_ellipse()

Resources