R software: error when using cozigam() function - r

I am modelling the potential distribution of a species using COZIGAM package. I have the response variable ("pb", which tells where the species is present) and the predictor variables (e.g. altitude, temperature, precipitation, etc).
When I run this formula:
# devtools::install_github('AndrewLJackson/COZIGAM')
coz.model <- cozigam(formula=pb ~ s(altitude) + s(combustible) + s(distribution) + s(e1) + s(e2) + s(e3) + s(euc.human) + s(euc.river) + s(fccarb) + s(fccmat) + s(forarb) + s(aspect) + s(slope) + s(precipitation) + s(radiation) + s(tipestr_class) + s(tipestr_forest) + s(tmean), data=sdmdata2, family=poisson)
it appears an error warning, which is:
Error in as.matrix(x) : object 'altitude' not found
However, when I run as.matrix(sdmdata2), 'altitude' variable exits in my matrix. The output of dput(head(sdmdata2)) is:
structure(list(X = 1:6, pb = c(2L, 2L, 2L, 2L, 2L, 2L), altitude = c(879L,
1094L, 1035L, 410L, 342L, 665L), combustible = c(6L, 6L, 3L,
0L, 3L, 3L), distribution = c(6L, 6L, 6L, 0L, 6L, 0L), e1 = c(4L,
4L, 2L, 0L, 4L, 0L), e2 = c(0L, 0L, 2L, 0L, 2L, 0L), e3 = c(0L,
0L, 4L, 0L, 2L, 0L), euc.human = c(790.569397, 3201.562012, 1750,
250, 250, 1952.562012), euc.river = c(0, 4069.705078, 353.5534058,
1030.776001, 559.0170288, 0), fccarb = c(90L, 70L, 40L, 0L, 30L,
0L), fccmat = c(5L, 10L, 35L, 0L, 60L, 80L), forarb = c(1L, 1L,
2L, 0L, 5L, 0L), aspect = c(6L, 8L, 6L, 4L, 3L, 3L), slope = c(5L,
3L, 5L, 2L, 6L, 5L), precipitation = c(87.01500702, 79.57628632,
81.86239624, 75.10630798, 49.58106995, 69.55927277), radiation = c(160.1408997,
163.4971008, 161.8542938, 157.9179993, 159.2113953, 160.6203003
), tipestr_class = c(1L, 1L, 1L, 7L, 1L, 2L), tipestr_forest = c(6L,
6L, 6L, 0L, 6L, 0L), tmean = c(141.7760925, 134.9530029, 141.9192047,
171.9972992, 186.2566986, 157.0391998)), .Names = c("X", "pb",
"altitude", "combustible", "distribution", "e1", "e2", "e3", "euc.human",
"euc.river", "fccarb", "fccmat", "forarb", "aspect", "slope",
"precipitation", "radiation", "tipestr_class", "tipestr_forest",
"tmean"), row.names = c(NA, 6L), class = "data.frame")
Do someone know what is the problem?

Related

How to add different boxplots to the same plot based on different data sources in ggplot /R?

Please find My Data below. Please note that picture below is an example of the design I wish to copy and does not correlate to My Data specifically.
My Data is stored in p. I have a continuous covariate p$ki67pro which denominate the percentage of cells actively dividing in a tumor sample (thus, ranging from 0 to 100). I have three different stages of the tumor, which correspond to p$WHO.Grade==1,2,3. Each sample represent a tumor patient that either had recurrence (p$recurrence==1) or not (p$recurrence==0).
Therefore:
head(p)
WHO.Grade recurrence ki67pro
1 1 0 1
2 2 0 12
3 1 0 3
9 1 0 3
10 1 0 5
11 1 0 3
I wish to produce the boxplot below. As you can see, there are four points which correspond to each p$WHO.Grade and and All samples. There are two boxplots per p$WHO.Grade + All.
Per p$WHO.Grade and All, I want one boxplot to represent p$ki67pro for recurrent tumors (p$recurrence==1) and the other boxplot to represent p$ki67pro for non-recurrent tumors (p$recurrence==0).
I.e.
p$ki67pro[p$WHO.Grade==1 & p$recurrence==0] versus
p$ki67pro[p$WHO.Grade==1 & p$recurrence==1]
p$ki67pro[p$WHO.Grade==2 & p$recurrence==0] versus
p$ki67pro[p$WHO.Grade==2 & p$recurrence==1]
p$ki67pro[p$WHO.Grade==3 & p$recurrence==0] versus
p$ki67pro[p$WHO.Grade==3 & p$recurrence==1]
And for All
p$ki67pro[p$recurrence==0] versus
p$ki67pro[p$recurrence==1]
I have used the following script so far, but I can figure out on how to get the All included. Please, note that there is only one case p$WHO.Grade==3
df <- data.frame(x = as.factor(c(p$WHO.Grade)),
y = c(p$ki67pro),
f = rep(c("ki67pro"), c(nrow(p))))
df <- df[!is.na(df$x),]
ggplot(df) +
geom_boxplot(aes(x, y, fill = f, colour = f), outlier.alpha = 0, position = position_dodge(width = 0.78)) +
scale_x_discrete(name = "", label=c("WHO-I","WHO-II","WHO-III","All")) +
scale_y_continuous(name="x", breaks=seq(0,30,5), limits=c(0,30)) +
stat_boxplot(aes(x, y, colour = f), geom = "errorbar", width = 0.3,position = position_dodge(0.7753)) +
geom_point(aes(x, y, fill = f, colour = f), size = 3, shape = 21, position = position_jitterdodge()) +
scale_fill_manual(values = c("#edf1f9", "#fcebeb"), name = "",
labels = c("", "")) +
scale_colour_manual(values = c("#1C73C2", "red"), name = "",
labels = c("","")) + theme(legend.position="none")
My Data p
p <- structure(list(WHO.Grade = c(1L, 2L, 1L, 1L, 1L, 1L, 3L, 2L,
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), recurrence = c(0L, 0L, 0L, 0L, 0L,
0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), ki67pro = c(1L, 12L,
3L, 3L, 5L, 3L, 20L, 25L, 7L, 4L, 5L, 12L, 3L, 15L, 4L, 5L, 7L,
8L, 3L, 12L, 10L, 4L, 10L, 7L, 3L, 2L, 3L, 7L, 4L, 7L, 10L, 4L,
5L, 5L, 3L, 5L, 2L, 5L, 3L, 3L, 3L, 4L, 4L, 3L, 2L, 5L, 1L, 5L,
2L, 3L, 1L, 2L, 3L, 3L, 5L, 4L, 20L, 5L, 0L, 4L, 3L, 0L, 3L,
4L, 1L, 2L, 20L, 2L, 3L, 5L, 4L, 8L, 1L, 4L, 5L, 4L, 3L, 6L,
12L, 3L, 4L, 4L, 2L, 5L, 3L, 3L, 3L, 2L, 5L, 4L, 2L, 3L, 4L,
3L, 3L, 2L, 2L, 4L, 7L, 4L, 3L, 4L, 2L, 3L, 6L, 2L, 3L, 10L,
5L, 10L, 3L, 10L, 3L, 4L, 5L, 2L, 4L, 3L, 4L, 4L, 4L, 5L, 3L,
12L, 5L, 4L, 3L, 2L, 4L, 3L, 4L, 2L, 1L, 6L, 1L, 4L, 12L, 3L,
4L, 3L, 2L, 6L, 5L, 4L, 3L, 4L, 4L, 4L, 3L, 5L, 4L, 5L, 4L, 1L,
3L, 3L, 4L, 0L, 3L)), class = "data.frame", row.names = c(1L,
2L, 3L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 18L, 19L, 20L,
21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L,
34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 44L, 45L, 46L, 47L, 48L,
49L, 50L, 51L, 52L, 53L, 54L, 55L, 57L, 59L, 60L, 61L, 62L, 63L,
64L, 65L, 66L, 67L, 68L, 69L, 70L, 71L, 72L, 73L, 74L, 75L, 76L,
77L, 78L, 79L, 80L, 81L, 82L, 83L, 84L, 85L, 87L, 89L, 90L, 91L,
92L, 93L, 94L, 96L, 97L, 98L, 99L, 100L, 101L, 102L, 103L, 104L,
105L, 106L, 107L, 109L, 110L, 111L, 112L, 113L, 114L, 115L, 116L,
117L, 118L, 119L, 120L, 121L, 123L, 124L, 125L, 126L, 127L, 128L,
130L, 131L, 132L, 133L, 134L, 135L, 136L, 137L, 138L, 139L, 140L,
141L, 142L, 143L, 144L, 145L, 146L, 147L, 148L, 149L, 150L, 151L,
152L, 153L, 154L, 155L, 156L, 157L, 158L, 159L, 160L, 161L, 162L,
163L, 164L, 165L, 166L, 167L, 168L, 169L, 170L, 171L, 172L, 173L,
174L, 175L))
A trick that can be used is to create a new level in WHO.Grade, since it only has 3 levels. This should be a temporary level, so a good way of doing it is with package dplyr, function mutate.
Note that there is no need to create a new dataframe df.
library(ggplot2)
library(dplyr)
p %>%
bind_rows(p %>% mutate(WHO.Grade = 4)) %>%
mutate(WHO.Grade = factor(WHO.Grade),
recurrence = factor(recurrence)) %>%
ggplot(aes(WHO.Grade, ki67pro,
fill = recurrence, colour = recurrence)) +
geom_boxplot(outlier.alpha = 0,
position = position_dodge(width = 0.78, preserve = "single")) +
geom_point(size = 3, shape = 21,
position = position_jitterdodge()) +
scale_x_discrete(name = "",
label = c("WHO-I","WHO-II","WHO-III","All")) +
scale_y_continuous(name = "x", breaks=seq(0,30,5), limits=c(0,30)) +
scale_fill_manual(values = c("#edf1f9", "#fcebeb"), name = "",
labels = c("", "")) +
scale_colour_manual(values = c("#1C73C2", "red"), name = "",
labels = c("","")) +
theme(legend.position="none")
What about something like this:
# here you duplicate your original data
p1 <- p
# how to catch the all
p1$WHO.Grade <- 'all'
p <- rbind(p1,p)
library(ggplot2)
ggplot(p) +
geom_boxplot(aes(as.factor(WHO.Grade),
y = ki67pro,
fill = factor(recurrence) ,
color = factor(recurrence) ),
outlier.alpha = 0 , position = position_dodge(width = 0.78)) +
# from here it's more or less your code
scale_x_discrete(name = "", label=c("WHO-I","WHO-II","WHO-III","All")) +
scale_y_continuous(name="x", breaks=seq(0,30,5), limits=c(0,30)) +
stat_boxplot(aes(as.factor(WHO.Grade),
y = ki67pro,
color = factor(recurrence) ),
geom = "errorbar", width = 0.3,position = position_dodge(0.7753)) +
geom_point(aes(as.factor(WHO.Grade),
y = ki67pro,
color = factor(recurrence) ),
size = 3, shape = 21, position = position_jitterdodge()) +
scale_fill_manual(values = c("#edf1f9", "#fcebeb"), name = "",
labels = c("", "")) +
scale_colour_manual(values = c("#1C73C2", "red"), name = "",
labels = c("","")) +
theme(legend.position="none",
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
In case your dataset is too large for just doubling it in size you create two plots and put them next to each other via grid.arrange().
library(ggplot2)
library(gridExtra)
#the data
df <- data.frame(x = as.factor(c(p$WHO.Grade)),
y = p$ki67pro,
f = as.factor(p$recurrence))
df <- df[!is.na(df$x),]
# plot 1
plot1 <- ggplot(df) +
geom_boxplot(aes(x, y, fill = f, colour = f), outlier.alpha = 0, position = position_dodge(width = 0.78)) +
scale_x_discrete(name = "", label=c("WHO-I","WHO-II","WHO-III","All")) +
scale_y_continuous(name="x", breaks=seq(0,30,5), limits=c(0,30)) +
stat_boxplot(aes(x, y, colour = f), geom = "errorbar", width = 0.3,position = position_dodge(0.7753)) +
geom_point(aes(x, y, fill = f, colour = f), size = 3, shape = 21, position = position_jitterdodge()) +
scale_fill_manual(values = c("#edf1f9", "#fcebeb"), name = "",
labels = c("", "")) +
scale_colour_manual(values = c("#1C73C2", "red"), name = "",
labels = c("","")) + theme(legend.position="none") +
theme(plot.margin = unit(c(1,-0.5,1, 1), "cm"))
#plot 2
plot2 <- ggplot(df) +
geom_boxplot(aes(x = "All", y = y, fill = f, colour = f), outlier.alpha = 0, position = position_dodge(width = 0.78)) +
scale_x_discrete(name = "") +
scale_y_continuous(name="x", breaks=seq(0,30,5), limits=c(0,30)) +
stat_boxplot(aes(x = "All", y = y, colour = f), geom = "errorbar", width = 0.3,position = position_dodge(0.7753)) +
geom_point(aes(x = "All", y = y, fill = f, colour = f), size = 3, shape = 21, position = position_jitterdodge()) +
scale_fill_manual(values = c("#edf1f9", "#fcebeb"), name = "",
labels = c("", "")) +
scale_colour_manual(values = c("#1C73C2", "red"), name = "",
labels = c("","")) + theme(legend.position="none") +
theme(axis.line.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
plot.margin = unit(c(1,1,1, -0.5), "cm"))
#put it together
lm <- rbind(c(1,1,1,2))
grid.arrange(plot1, plot2, layout_matrix = lm)
If I understood correctly, you just want to show all of your data in the last boxplot.
You can do this easily by just duplicating the data while creating the data frame and labelling the duplicate with All.
df <- data.frame(x = as.factor(c(p$WHO.Grade, rep("All", nrow(p)))),
y = rep(c(p$ki67pro), 2),
f = "ki67pro")
The plotting remains the same and you can easily add recurrence.
However, the plot you're showing above looks weird as the All boxplot doesn't contain all the data.

ggplot2 with splitting by groups in R [duplicate]

This question already has answers here:
ggplot: colour points by groups based on user defined colours
(3 answers)
Closed 4 years ago.
I try to perform scatterplot between variables by two groups
ggplot(terr, aes(x = Killed, y = Terr..Attacks,group=Religion,Macro.Region)) +
geom_point() +
geom_smooth()
but i didn't get the results
how can i create scatterplot by groups?
terr=structure(list(Macro.Region = structure(c(5L, 4L, 4L, 3L, 4L,
6L, 1L, 2L, 4L, 3L, 6L, 5L, 4L, 4L, 3L, 4L, 6L, 1L, 2L, 4L, 3L,
6L), .Label = c("Arab Countries", "Asia", "Eastern Europe and post-Soviet",
"Latin America", "Sub-Saharan Africa", "Western States"), class = "factor"),
Killed = c(0L, 0L, 0L, 6L, 0L, 0L, 1L, 76L, 0L, 0L, 36L,
0L, 0L, 0L, 6L, 0L, 0L, 1L, 76L, 0L, 0L, 36L), Terr..Attacks = c(2L,
0L, 2L, 2L, 0L, 9L, 3L, 88L, 0L, 0L, 6L, 2L, 0L, 2L, 2L,
0L, 9L, 3L, 88L, 0L, 0L, 6L), Religion = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 1L, 1L, 1L), .Label = c("Christianity", "Islam"
), class = "factor"), GDP.capita = c(6813L, 26198L, 20677L,
9098L, NA, 49882L, 51846L, 4207L, 17508L, 18616L, 46301L,
6813L, 26198L, 20677L, 9098L, NA, 49882L, 51846L, 4207L,
17508L, 18616L, 46301L)), class = "data.frame", row.names = c(NA,
-22L))
ggplot(terr, aes(x = Killed, y = Terr..Attacks)) +
geom_point(alpha=1/4) +
facet_wrap(Religion ~ Macro.Region)

calculate the sum per 2 columns

I have the following data frame:
all <- structure(list(counts = c(0L, 0L, 3L, 0L, 2L, 0L), counts = c(0L,
2L, 1L, 0L, 5L, 1L), counts = c(1L, 9L, 17L, 0L, 7L, 2L), counts = c(2L,
1L, 13L, 0L, 7L, 5L), counts = c(1L, 1L, 3L, 0L, 2L, 10L), counts = c(0L,
2L, 2L, 0L, 8L, 9L), counts = c(0L, 4L, 4L, 0L, 4L, 0L), counts = c(0L,
2L, 3L, 0L, 7L, 1L), counts = c(0L, 2L, 0L, 0L, 3L, 8L), counts = c(1L,
3L, 3L, 0L, 4L, 13L), counts = c(0L, 6L, 12L, 0L, 3L, 2L), counts = c(0L,
7L, 6L, 0L, 4L, 2L), counts = c(1L, 0L, 1L, 0L, 2L, 5L), counts = c(1L,
1L, 2L, 0L, 3L, 6L), counts = c(0L, 2L, 1L, 1L, 2L, 0L), counts = c(0L,
4L, 1L, 0L, 4L, 0L), counts = c(0L, 2L, 1L, 0L, 3L, 3L), counts = c(0L,
1L, 1L, 0L, 2L, 1L), counts = c(0L, 3L, 1L, 0L, 5L, 0L), counts = c(0L,
4L, 5L, 0L, 1L, 0L), counts = c(0L, 2L, 5L, 0L, 8L, 23L), counts = c(0L,
0L, 2L, 0L, 1L, 7L), counts = c(1L, 0L, 0L, 0L, 1L, 2L), counts = c(0L,
0L, 0L, 0L, 1L, 0L)), .Names = c("counts", "counts", "counts",
"counts", "counts", "counts", "counts", "counts", "counts", "counts",
"counts", "counts", "counts", "counts", "counts", "counts", "counts",
"counts", "counts", "counts", "counts", "counts", "counts", "counts"
), row.names = c("1/2-SBSRNA4", "A1BG", "A1BG-AS1", "A1CF", "A2LD1",
"A2M"), class = "data.frame")
In this dataframe i need the sum of every 2 columns in the simplest form this can be done with: all[1] + all[2], all[3] + all[4] etc etc. then at the end i could cbind the new frames again but i now this can be done with something like aggregate or apply. Only i did not yet manage to succeed. My best try now is: allfinal <-aggregate( all ,FUN = sum,by=[1:2] ) I know this is not how it should work but cant figure out how to correctly use aggregate or (s)apply to do this. Any tips are appreciated!
As output i want to have a dataframe that holds the sum of 2 columns per 1 columns. The data.frame now has 24 columns so at the end i need 12 columns.
you can try this:
t(rowsum(t(all), gl(ncol(all)/2, 2)))
hth

Error using predict with klaR package, NaiveBayes

I'm using the klaR package's predict method as mentioned in the post Naive bayes in R:
nb_testpred <- predict(mynb, newdata=testdata).
nb_testpred is my Naive Bayes model, developed on traindata; testdata is the remaining data.
However, I get this error:
Error in FUN(1:10[[4L]], ...) : subscript out of bounds
I'm not sure what's going on - testdata has fewer rows than traindata, and the same number of columns.
For reference, my code looks like this:
ind <- sample(2, nrow(mydata), replace=TRUE, prob=c(0.9,0.1))
traindata <- mydata[ind==1,]
testdata <- mydata[ind==2,]
myformula <- as.factor(dep) ~ X1 + as.factor(X2) + as.factor(X3) + as.factor(X4) + X5 + as.factor(X6) + as.factor(date) + as.factor(hour)
mynb <- NaiveBayes(myformula, data=traindata)
nb_testpred <- predict(mynb, newdata=testdata) #where I'm getting an error...
A sample of the data is here (the original file has 100,000+ rows):
sampledata <- structure(list(dep = c(1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L), X1 = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("A", "B"), class = "factor"), X2 = c(200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L,
200L, 200L), X3 = structure(c(4L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L), .Label = c(".", "1400000", "2400000", "900000"), class = "factor"), X4 = c(0L, 0L, 0L, 3L, 4L, 5L, 5L, 5L, 5L, 0L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 0L), X5 = c(TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), X6 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), date = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L), .Label = c("9/23/2012",
"9/24/2012"), class = "factor"), hour = c(18L, 17L, 23L, 8L, 1L, 19L, 19L, 16L, 22L, 2L, 12L, 16L, 15L, 9L, 1L, 9L,
13L, 19L)), .Names = c("dep", "X1", "X2", "X3", "X4", "X5", "X6", "date", "hour"), class = "data.frame", row.names = c(NA, -18L))
Any help would be greatly appreciated!
You can act as follows:
traindata$dep=factor(traindata$dep)
mynb <- NaiveBayes(dep~.,traindata)
Then it works, however you should refine your data to have avoid constant columns.

edits in a ggplot2, geom = "line"

I have a line plot of some event at a hospital that I have been struggling with.
The challenges that I haven't solved yet are, 1) sorting the lines on the plot so that the patient-lines are sorted by Assessment-date, 2) coloring the lines by the variable 'openCase' and finally, 3) I would like to remove the Discharge-point (the blue square) for the cases that are in the year 2014 (or at some other random cut of date).
Any help would be appreciated?
Here is my sample data,
library(ggplot2)
library(plyr)
df <- data.frame(
date = seq(Sys.Date(), len= 156, by="5 day")[sample(156, 78)],
openCase = rep(0:1, 39),
patients = factor(rep(1:26, 3), labels = LETTERS)
)
df <- ddply(df, "patients", mutate, visit = order(date))
df$visit <- as.factor(df$visit)
levels(df$visit) <- c("Assessment (1)", "Treatment (2)", "Discharge (3)")
qplot(date, patients, data = df, geom = "line") +
geom_point(aes(colour = visit), size = 2, shape=0)
I'm aware that my example data is not perfect as some of the assessment datas is after the treatments and some of the discharge data is before the assessments data, but that part of the challenge that my base data is messed up.
What it looks like at the moment,
Update 2012-04-30 16:30:13 PDT
My data is delivered from a database and looks something like this,
df <- structure(list(date = structure(c(15965L, 15680L, 16135L, 15730L,
15920L, 15705L, 16110L, 15530L, 15575L, 15905L, 16140L, 15795L,
15955L, 15945L, 16205L, 15675L, 15525L, 15830L, 15625L, 15725L,
15855L, 15840L, 15615L, 15500L, 15780L, 15765L, 15610L, 15690L,
16080L, 15570L, 15685L, 16175L, 15740L, 15600L, 15985L, 15485L,
15605L, 16115L, 15535L, 15755L, 16145L, 16040L, 15970L, 16000L,
16075L, 15995L, 16010L, 15990L, 15665L, 15895L, 15865L, 16120L,
15880L, 15930L, 16055L, 15820L, 15650L, 16155L, 15700L, 15640L,
15505L, 15750L, 15800L, 15775L, 15825L, 15635L, 16150L, 15860L,
16100L, 15475L, 16050L, 15785L, 15495L, 15810L, 15805L, 15490L,
15460L, 16085L), class = "Date"), openCase = c(0L, 0L, 0L, 1L,
1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L,
0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L,
0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L,
1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L,
0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L), patients = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L,
6L, 7L, 7L, 7L, 8L, 8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L, 11L, 11L,
11L, 12L, 12L, 12L, 13L, 13L, 13L, 14L, 14L, 14L, 15L, 15L, 15L,
16L, 16L, 16L, 17L, 17L, 17L, 18L, 18L, 18L, 19L, 19L, 19L, 20L,
20L, 20L, 21L, 21L, 21L, 22L, 22L, 22L, 23L, 23L, 23L, 24L, 24L,
24L, 25L, 25L, 25L, 26L, 26L, 26L), .Label = c("A", "B", "C",
"D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P",
"Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"), class = "factor"),
visit = structure(c(2L, 1L, 3L, 3L, 1L, 2L, 2L, 3L, 1L, 3L,
1L, 2L, 2L, 1L, 3L, 2L, 1L, 3L, 1L, 2L, 3L, 3L, 2L, 1L, 3L,
2L, 1L, 3L, 1L, 2L, 1L, 3L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 1L,
3L, 2L, 1L, 2L, 3L, 3L, 1L, 2L, 1L, 3L, 2L, 2L, 3L, 1L, 3L,
2L, 1L, 3L, 2L, 1L, 1L, 2L, 3L, 3L, 1L, 2L, 2L, 3L, 1L, 1L,
3L, 2L, 1L, 3L, 2L, 2L, 1L, 3L), .Label = c("zym", "xov", "poi"
), class = "factor")), .Names = c("date", "openCase", "patients",
"visit"), row.names = c(NA, -78L), class = "data.frame")
The number of levels in visit, and specific labeling, will most likely change so I would like some kind of code where I rank or sort based on my existing data instead (visit) of generating new variables.
This is part-way:
Starting from after your initial definition of the data.
First, I think you want rank(date) rather than order(date) -- it made more sense to me, anyway.
df <- ddply(df, "patients", mutate, visit = rank(date))
df$visit <- as.factor(df$visit)
levels(df$visit) <- c("Assessment (1)", "Treatment (2)", "Discharge (3)")
Reorder patients by minimum date value (= Assessment date):
df$patients <- reorder(df$patients,df$date,function(x) min(as.numeric(x)))
Create a new data set missing the Discharge point, where they are after Jan 1 2014 (if you wanted to drop the Discharge point for cases that were assessed after a given date, you'd need to use ddply):
df2 <- subset(df,!(visit=="Discharge (3)" & date > as.Date("2014-01-01")))
As #Joran pointed out above it's a bit hard to get two separate colour scales for different variables, but this sort-of works (you have to make openCase into a factor in order to combine it with the colour scale for visit)
ggplot(df, aes(date, patients)) + geom_line(aes(colour=factor(openCase))) +
geom_point(data=df2,aes(colour = visit), size = 2, shape=0)
Alternately (and I think this is prettier anyway), you could code openCase with line type:
ggplot(df, aes(date, patients)) + geom_line(aes(linetype=factor(openCase))) +
geom_point(data=df2,aes(colour = visit), size = 2, shape=0)
I'm still not sure I understand what is wrong with #Ben's answer, but I'll try adding one of my own. Starting with the df given in the edit.
Create a new variable Visit (note the capital V) which is Assessment/Treatment/Discharge based on the ordering of the dates given. This is #Ben's code, just re-written.
df <- ddply(df, "patients", mutate,
Visit = factor(rank(date),
levels = 1:3,
labels=c("Assessment (1)", "Treatment (2)", "Discharge (3)")))
I don't understand how this relates to the visit column in the data originally; in fact, the original visit column is not used hereafter:
> table(df$Visit, df$visit)
zym xov poi
Assessment (1) 16 7 3
Treatment (2) 3 16 7
Discharge (3) 7 3 16
Reorder the patients (again copying Ben):
df$patients <- reorder(df$patients,df$date,function(x) min(as.numeric(x)))
Determine the subset of points that should be shown (same idea as Ben, but different code)
df2 <- df[!((df$Visit == "Discharge (3)") & (df$date > as.Date("2014-01-01"))),]
To add something new, here is a way to make the lines different colors without impacting the legend
ggplot(df, aes(date, patients)) +
geom_blank() +
geom_line(data = df[df$openCase == 0,], colour = "black") +
geom_line(data = df[df$openCase == 1,], colour = "red") +
geom_point(data = df2, aes(colour = Visit), size = 2, shape = 0)

Resources