Error using predict with klaR package, NaiveBayes

Error using predict with klaR package, NaiveBayes - r

I'm using the klaR package's predict method as mentioned in the post Naive bayes in R:
nb_testpred <- predict(mynb, newdata=testdata).
nb_testpred is my Naive Bayes model, developed on traindata; testdata is the remaining data.
However, I get this error:
Error in FUN(1:10[[4L]], ...) : subscript out of bounds
I'm not sure what's going on - testdata has fewer rows than traindata, and the same number of columns.
For reference, my code looks like this:
ind <- sample(2, nrow(mydata), replace=TRUE, prob=c(0.9,0.1))
traindata <- mydata[ind==1,]
testdata <- mydata[ind==2,]
myformula <- as.factor(dep) ~ X1 + as.factor(X2) + as.factor(X3) + as.factor(X4) + X5 + as.factor(X6) + as.factor(date) + as.factor(hour)
mynb <- NaiveBayes(myformula, data=traindata)
nb_testpred <- predict(mynb, newdata=testdata) #where I'm getting an error...
A sample of the data is here (the original file has 100,000+ rows):
sampledata <- structure(list(dep = c(1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L), X1 = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("A", "B"), class = "factor"), X2 = c(200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L,
200L, 200L), X3 = structure(c(4L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L), .Label = c(".", "1400000", "2400000", "900000"), class = "factor"), X4 = c(0L, 0L, 0L, 3L, 4L, 5L, 5L, 5L, 5L, 0L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 0L), X5 = c(TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), X6 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), date = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L), .Label = c("9/23/2012",
"9/24/2012"), class = "factor"), hour = c(18L, 17L, 23L, 8L, 1L, 19L, 19L, 16L, 22L, 2L, 12L, 16L, 15L, 9L, 1L, 9L,
13L, 19L)), .Names = c("dep", "X1", "X2", "X3", "X4", "X5", "X6", "date", "hour"), class = "data.frame", row.names = c(NA, -18L))
Any help would be greatly appreciated!

You can act as follows:
traindata$dep=factor(traindata$dep)
mynb <- NaiveBayes(dep~.,traindata)
Then it works, however you should refine your data to have avoid constant columns.

Related

R - Non-numeric argument to binary operator when dividing variable by participant-wise mean of that variable

I have a problem with some code.
My goal is to divide each value of the latency variable by the mean for latency of each individual participant.That means, I want to divide all latencies of participant 1 by the mean latency of participant, all latencies of participant 2 by the mean latency of participant 2 and so forth.
The error message is:
"Error in latency/mean(latency, na.rm = TRUE) :
non-numeric argument to binary operator
In addition: Warning messages:
1: In mean.default(latency) :
argument is not numeric or logical: returning NA
2: In mean.default(latency, na.rm = TRUE) :
argument is not numeric or logical: returning NA"
Importantly, the code works with another dataset.
Below find the code I used to try and achieve this and some example data to reproduce the error:
rt = df2$latency) as.numeric(df2$latency) na.omit(df2$latency) temp <- ddply(xy, c('participant'), transform, avg = mean(latency),
x = latency / mean(latency, na.rm = TRUE)
#Example Data
> dput(head (df2, 20))
structure(list(participant = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), timestamp = c(1547125307L,
1547125307L, 1547125307L, 1547125307L, 1547125307L, 1547125307L,
1547125307L, 1547125307L, 1547125307L, 1547125307L, 1547125307L,
1547125307L, 1547125307L, 1547125307L, 1547125307L, 1547125307L,
1547125307L, 1547125307L, 1547125307L, 1547125307L), dominance = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), blocknum = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), trialnum = 1:20,
condition = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "CFS", class = "factor"),
eyesidecfs = structure(c(1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("lefteye",
"righteye"), class = "factor"), stimside = structure(c(2L,
1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L,
1L, 1L, 2L, 2L), .Label = c("left", "right"), class = "factor"),
stimpos = c(-97L, -97L, -83L, -55L, -47L, -1L, 61L, 46L,
-4L, 28L, -60L, 16L, 11L, 77L, 96L, 52L, -29L, 23L, 84L,
93L), stimulus = structure(c(6L, 41L, 13L, 45L, 1L, 45L,
40L, 44L, 19L, 38L, 13L, 35L, 39L, 16L, 3L, 33L, 25L, 4L,
2L, 9L), .Label = c("attr_male_0.bmp", "attr_male_1.bmp",
"attr_male_10.bmp", "attr_male_11.bmp", "attr_male_12.bmp",
"attr_male_13.bmp", "attr_male_14.bmp", "attr_male_15.bmp",
"attr_male_16.bmp", "attr_male_17.bmp", "attr_male_18.bmp",
"attr_male_19.bmp", "attr_male_2.bmp", "attr_male_20.bmp",
"attr_male_21.bmp", "attr_male_3.bmp", "attr_male_4.bmp",
"attr_male_5.bmp", "attr_male_6.bmp", "attr_male_7.bmp",
"attr_male_8.bmp", "attr_male_9.bmp", "practice_0.png", "practice_1.png",
"unattr_male_0.bmp", "unattr_male_1.bmp", "unattr_male_10.bmp",
"unattr_male_11.bmp", "unattr_male_12.bmp", "unattr_male_13.bmp",
"unattr_male_14.bmp", "unattr_male_15.bmp", "unattr_male_16.bmp",
"unattr_male_17.bmp", "unattr_male_18.bmp", "unattr_male_19.bmp",
"unattr_male_2.bmp", "unattr_male_20.bmp", "unattr_male_21.bmp",
"unattr_male_3.bmp", "unattr_male_4.bmp", "unattr_male_5.bmp",
"unattr_male_6.bmp", "unattr_male_7.bmp", "unattr_male_8.bmp",
"unattr_male_9.bmp"), class = "factor"), response = structure(c(1L,
2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L,
2L, 2L, 1L, 1L), .Label = c("num_5", "s", "None"), class = "factor"),
correct = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), latency = c("0.957696499963",
"0.791598779233", "1.10196583883", "1.47500942541", "1.10195224516",
"0.874937406699", "0.974977383185", "0.891667885011", "0.925115802807",
"1.29170322855", "1.10850134231", "1.27520744911", "2.82531718331",
"1.40841043117", "2.24205221134", "1.1019921939", "0.74171666964",
"1.32521745017", "1.12505149643", "0.891592148851"), stimulus2 = structure(c(6L,
41L, 13L, 45L, 1L, 45L, 40L, 44L, 19L, 38L, 13L, 35L, 39L,
16L, 3L, 33L, 25L, 4L, 2L, 9L), .Label = c("attr_male_0.bmp",
"attr_male_1.bmp", "attr_male_10.bmp", "attr_male_11.bmp",
"attr_male_12.bmp", "attr_male_13.bmp", "attr_male_14.bmp",
"attr_male_15.bmp", "attr_male_16.bmp", "attr_male_17.bmp",
"attr_male_18.bmp", "attr_male_19.bmp", "attr_male_2.bmp",
"attr_male_20.bmp", "attr_male_21.bmp", "attr_male_3.bmp",
"attr_male_4.bmp", "attr_male_5.bmp", "attr_male_6.bmp",
"attr_male_7.bmp", "attr_male_8.bmp", "attr_male_9.bmp",
"practice_0.png", "practice_1.png", "unattr_male_0.bmp",
"unattr_male_1.bmp", "unattr_male_10.bmp", "unattr_male_11.bmp",
"unattr_male_12.bmp", "unattr_male_13.bmp", "unattr_male_14.bmp",
"unattr_male_15.bmp", "unattr_male_16.bmp", "unattr_male_17.bmp",
"unattr_male_18.bmp", "unattr_male_19.bmp", "unattr_male_2.bmp",
"unattr_male_20.bmp", "unattr_male_21.bmp", "unattr_male_3.bmp",
"unattr_male_4.bmp", "unattr_male_5.bmp", "unattr_male_6.bmp",
"unattr_male_7.bmp", "unattr_male_8.bmp", "unattr_male_9.bmp"
), class = "factor"), Group = c(1, 2, 1, 2, 1, 2, 2, 2, 1,
2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1)), row.names = 5:24, class = "data.frame")```

ggplot2 with splitting by groups in R [duplicate]

This question already has answers here:
ggplot: colour points by groups based on user defined colours
(3 answers)
Closed 4 years ago.
I try to perform scatterplot between variables by two groups
ggplot(terr, aes(x = Killed, y = Terr..Attacks,group=Religion,Macro.Region)) +
geom_point() +
geom_smooth()
but i didn't get the results
how can i create scatterplot by groups?
terr=structure(list(Macro.Region = structure(c(5L, 4L, 4L, 3L, 4L,
6L, 1L, 2L, 4L, 3L, 6L, 5L, 4L, 4L, 3L, 4L, 6L, 1L, 2L, 4L, 3L,
6L), .Label = c("Arab Countries", "Asia", "Eastern Europe and post-Soviet",
"Latin America", "Sub-Saharan Africa", "Western States"), class = "factor"),
Killed = c(0L, 0L, 0L, 6L, 0L, 0L, 1L, 76L, 0L, 0L, 36L,
0L, 0L, 0L, 6L, 0L, 0L, 1L, 76L, 0L, 0L, 36L), Terr..Attacks = c(2L,
0L, 2L, 2L, 0L, 9L, 3L, 88L, 0L, 0L, 6L, 2L, 0L, 2L, 2L,
0L, 9L, 3L, 88L, 0L, 0L, 6L), Religion = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 1L, 1L, 1L), .Label = c("Christianity", "Islam"
), class = "factor"), GDP.capita = c(6813L, 26198L, 20677L,
9098L, NA, 49882L, 51846L, 4207L, 17508L, 18616L, 46301L,
6813L, 26198L, 20677L, 9098L, NA, 49882L, 51846L, 4207L,
17508L, 18616L, 46301L)), class = "data.frame", row.names = c(NA,
-22L))

ggplot(terr, aes(x = Killed, y = Terr..Attacks)) +
geom_point(alpha=1/4) +
facet_wrap(Religion ~ Macro.Region)

Bootstrapping eigenvalues for nonlinear PCA in r

I am running nonlinear PCA in r, using the homals package. Here is a chunk of the code I am using as an example:
res1 <- homals(data = mydata, rank = 1, ndim = 9, level = "nominal")
res1 <- rescale(res1)
I want to generate 1000 bootstrap estimates of the eigenvalues in this analysis (with replacement), but I can't figure out the code. Does anyone have any suggestions?
Sample data:
dput(head(mydata, 30))
structure(list(`W age` = c(45L, 43L, 42L, 36L, 19L, 38L, 21L,
27L, 45L, 38L, 42L, 44L, 42L, 38L, 26L, 48L, 39L, 37L, 39L, 26L,
24L, 46L, 39L, 48L, 40L, 38L, 29L, 24L, 43L, 31L), `W education` = c(1L,
2L, 3L, 3L, 4L, 2L, 3L, 2L, 1L, 1L, 1L, 4L, 2L, 3L, 2L, 1L, 2L,
2L, 2L, 3L, 3L, 4L, 4L, 4L, 2L, 4L, 4L, 4L, 1L, 3L), `H education` = c(3L,
3L, 2L, 3L, 4L, 3L, 3L, 3L, 1L, 3L, 4L, 4L, 4L, 4L, 4L, 1L, 2L,
2L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 4L), `N children` = c(10L,
7L, 9L, 8L, 0L, 6L, 1L, 3L, 8L, 2L, 4L, 1L, 1L, 2L, 0L, 7L, 6L,
8L, 5L, 1L, 0L, 1L, 1L, 5L, 8L, 1L, 0L, 0L, 8L, 2L), `W religion` = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), `W employment` = c(1L,
1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 1L,
1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L), `H occupation` = c(3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 1L, 1L, 3L, 2L, 4L, 2L, 2L,
2L, 2L, 4L, 3L, 1L, 1L, 1L, 3L, 1L, 1L, 2L, 2L, 1L), `Standard of living` =
c(4L,
4L, 3L, 2L, 3L, 2L, 2L, 4L, 2L, 3L, 3L, 4L, 3L, 3L, 1L, 4L, 4L,
3L, 1L, 1L, 1L, 4L, 4L, 4L, 3L, 4L, 4L, 2L, 4L, 4L), Media = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Contraceptive = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("W age",
"W education", "H education", "N children", "W religion", "W employment",
"H occupation", "Standard of living", "Media", "Contraceptive"
), row.names = c(NA, 30L), class = "data.frame")
>
I was given the rescale function to use with the homals package, to do optimal scaling. Here is the function:
rescale <- function(res) {
# Rescale homals results to proper scaling
n <- nrow(res$objscores)
m <- length(res$catscores)
res$objscores <- (n * m)^0.5 * res$objscores
res$scoremat <- (n * m)^0.5 * res$scoremat
res$catscores <- lapply(res$catscores, FUN = function(x) (n * m)^0.5 * x)
res$cat.centroids <- lapply(res$cat.centroids, FUN = function(x) (n * m)^0.5 * x)
res$low.rank <- lapply(res$low.rank, FUN = function(x) n^0.5 * x)
res$loadings <- lapply(res$loadings, FUN = function(x) m^0.5 * x)
res$discrim <- lapply(res$discrim, FUN = function(x) (n * m)^0.5 * x)
res$eigenvalues <- n * res$eigenvalues
return(res)
}

The standard way to bootstrap in R is to use base package boot.
I am not very satistied with the code that follows because it is throwing lots of warnings. But maybe this is due to the dataset I have tested it with. I have used the dataset and 3rd example in help("homals").
I have run 10 bootstrap replicates only.
library(homals)
library(boot)
boot_eigen <- function(data, indices){
d <- data[indices, ]
res <- homals(d, active = c(rep(TRUE, 4), FALSE), sets = list(c(1,3,4),2,5))
res$eigenvalues
}
data(galo)
set.seed(7578) # Make the results reproducible
eig <- boot(galo, boot_eigen, R = 10)
eig
#
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = galo, statistic = boot_eigen, R = 10)
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* 0.1874958 0.03547116 0.005511776
#t2* 0.2210821 -0.02478596 0.005741331
colMeans(eig$t)
#[1] 0.2229669 0.1962961
If this also doesn't run properly in your case, please say so and I will delete the answer.
EDIT.
In order to answer to the discussion in the comments, I have changed the function boot_eigen, the call to homals now follows the question code and rescale is called before returning.
boot_eigen <- function(data, indices){
d <- data[indices, ]
res <- homals(data = d, rank = 1, ndim = 9, level = "nominal")
res <- rescale(res)
res$eigenvalues
}
set.seed(7578) # Make the results reproducible
eig <- boot(mydata, boot_eigen, R = 10)

statistic test on univariate time series without replicates in R

I'm having the following data on an experiment, where I want to find out, how an bacterium reacts on two similar levels (nucleic acids) to a treatment.
Treatment happened after the sampling on day 0 (vertical dashed line). As you can see, it got more abundant (line is average, dots are measured triplicates). I have 3 technical replicates (doing the lab work 3 times on the same sample) but no biological replicates.
For publication purposes, I want to show that the induced change is significant. So far I used a two tailed t test for heteroscedastic samples, using the 3 sample points day -25 to 0 as sample group 1 and 5 sample points day 3 to 17 as sample group 2 (this is the range where most of my bacteria reacted).
Afterwards I performed the Bonferroni correction on the p values to correct for multiple testing. But is this the correct way and is it possible with only technical replicates?
I'm finding many hints on fitting models to my graph, but I only want to test for statistic significance of difference between before and after treatment. So I'm searching for the correct statistics and also how to apply it in R. Any help appreciated!
here is the plot:
require(ggplot2)
require(scales)
ggplot(data=sample_data, aes(x=days-69,y=value,colour=nucleic_acid,group=nucleic_acid,lty=nucleic_acid))+
geom_vline(aes(xintercept=0),linetype="dashed", size=1.2)+
geom_point(aes(),colour="black")+
stat_summary(aes(colour=nucleic_acid),colour="black",fun.y="mean", geom="line", size=1.5)+
scale_linetype_manual(values=c("dna"=1,"cdna"=4),
name="Nucleic acid ",
breaks=c("cdna","dna"),
labels=c("16S rRNA","16S rDNA"))+
scale_x_continuous(breaks = scales::pretty_breaks(n = 20))+
theme_bw()+
scale_y_continuous(label= function(x) {ifelse(x==0, "0", parse(text=gsub("[+]", "", gsub("e", " %*% 10^", scientific_format()(x)))))})+
theme(axis.title.y = element_text(angle=90,vjust=0.5))+
theme(axis.text=element_text(size=12))+
theme(legend.text=element_text(size=11))+
theme(panel.grid.major=element_line(colour = NA, size = 0.2))+
theme(panel.grid.minor=element_line(colour = NA, size = 0.5))+
theme(legend.position="bottom")+
theme(legend.background = element_rect(fill="grey90",linetype="solid"))+
labs(x="Days",
y=expression(atop("Absolute abundance in cell equivalents",bgroup("[",relative~abundance~x~cells~mL^{-1},"]"))))
and here is my data:
sample_data<-structure(list(time = c(10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L,
11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L,
13L, 13L, 13L, 14L, 14L, 14L, 14L, 14L, 14L, 15L, 15L, 15L, 15L,
15L, 15L, 16L, 16L, 16L, 16L, 16L, 16L, 17L, 17L, 17L, 17L, 18L,
18L, 18L, 18L, 18L, 18L, 19L, 19L, 19L, 19L, 19L, 19L, 4L, 4L,
4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L,
7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L,
9L, 9L), days = c(83L, 83L, 83L, 83L, 83L, 83L, 86L, 86L, 86L,
86L, 86L, 86L, 91L, 91L, 91L, 91L, 91L, 91L, 98L, 98L, 98L, 98L,
98L, 98L, 105L, 105L, 105L, 105L, 105L, 105L, 112L, 112L, 112L,
112L, 112L, 112L, 119L, 119L, 119L, 119L, 119L, 119L, 126L, 126L,
126L, 126L, 133L, 133L, 133L, 133L, 133L, 133L, 140L, 140L, 140L,
140L, 140L, 140L, 44L, 44L, 44L, 44L, 44L, 44L, 62L, 62L, 62L,
62L, 62L, 62L, 69L, 69L, 69L, 69L, 69L, 69L, 72L, 72L, 72L, 72L,
72L, 72L, 76L, 76L, 76L, 76L, 76L, 76L, 79L, 79L, 79L, 79L, 79L,
79L), parallel = c(3L, 1L, 2L, 2L, 3L, 1L, 2L, 3L, 3L, 2L, 1L,
1L, 2L, 1L, 3L, 3L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 1L,
1L, 3L, 2L, 1L, 1L, 2L, 3L, 3L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 3L,
1L, 1L, 3L, 2L, 3L, 1L, 1L, 2L, 3L, 1L, 2L, 3L, 3L, 1L, 2L, 2L,
3L, 3L, 1L, 1L, 2L, 2L, 3L, 1L, 1L, 3L, 2L, 1L, 2L, 3L, 3L, 1L,
2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 3L, 3L, 1L, 2L, 3L,
3L, 1L, 2L), nucleic_acid = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L,
1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L,
2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,
1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("cdna", "dna"), class = "factor"),
habitat = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "water", class = "factor"),
value = c(5316639.62, 6402573.912, 6294710.95, 2369809.996,
2679661.691, 2105693.166, 2108794.224, 2487177.041, 6021765.438,
5524939.499, 6016021.786, 2628427.206, 3164229.113, 896068.7656,
2966515.364, 4436008.425, 1860580.149, 3911309.508, 888489.0268,
1004334.365, 1141636.992, 961140.0729, 1072009.18, 1134997.852,
668013.4333, 459645.1058, 645944.1129, 702293.6865, 590620.3693,
642136.7523, 932531.1588, 1224299.065, 1502344.5, 1545034.46,
1122002.798, 1411050.57, 1465061.711, 1378876.488, 810348.2823,
1361496.248, 1056558.288, 897876.4169, 931519.9524, 1165768.09,
957873.9045, 746011.7558, 624116.5603, 522209.2283, 551120.1371,
440096.4446, 565108.4447, 373304.8604, 266595.7171, 333767.4042,
185612.6681, 144899.8736, 173739.3969, 211490.827, 223815.0867,
296455.4243, 1278759.217, 247292.4355, 1171554.199, 1146278.577,
227443.8462, 233542.6719, 253224.2629, 875040.4892, 1151921.616,
1285744.479, 355381.9156, 110724.7928, 252238.9632, 912865.3372,
608269.6498, 500307.5301, 774955.9598, 1374106.94, 3121909.308,
1071086.757, 3033665.589, 2984567.998, 1396313.444, 1356465.773,
4480581.956, 4273141.231, 4957691.655, 1910056.657, 5520085.32,
5094686.657, 5990052.759, 2272441.566, 1513268.608, 1821716.75
), treatment2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Treatment", class = "factor")), .Names = c("time",
"days", "parallel", "nucleic_acid", "habitat", "value", "treatment2"
), class = "data.frame", row.names = c(51243L, 51244L, 51245L,
51246L, 51247L, 51248L, 51255L, 51256L, 51257L, 51258L, 51259L,
51260L, 51267L, 51268L, 51269L, 51270L, 51271L, 51272L, 51279L,
51280L, 51281L, 51282L, 51283L, 51284L, 51291L, 51292L, 51293L,
51294L, 51295L, 51296L, 51303L, 51304L, 51305L, 51306L, 51307L,
51308L, 51315L, 51316L, 51317L, 51318L, 51319L, 51320L, 51326L,
51327L, 51328L, 51329L, 51336L, 51337L, 51338L, 51339L, 51340L,
51341L, 51348L, 51349L, 51350L, 51351L, 51352L, 51353L, 51360L,
51361L, 51362L, 51363L, 51364L, 51365L, 51372L, 51373L, 51374L,
51375L, 51376L, 51377L, 51384L, 51385L, 51386L, 51387L, 51388L,
51389L, 51396L, 51397L, 51398L, 51399L, 51400L, 51401L, 51408L,
51409L, 51410L, 51411L, 51412L, 51413L, 51420L, 51421L, 51422L,
51423L, 51424L, 51425L))

If you want to test for significance of the effect of your treatment and you know how to fit model(s) on your data, you can simply fit a model which includes your treatment effect and a model which doesn't. Then compare the models by means of a likelihood ratio test.
In R it is pretty straightforward (I assume for simplicity a linear model, which anyway may not be the best choice, based on your data):
# Models fit
model_effect <- lm(y~Time + Treatment, data)
model_null <- lm(y~Time, data)
# Models comparison
anova(model_effect, model_null)

Reshape a large matrix with missing values and multiple vars of interest [duplicate]

This question already has answers here:
Convert data from long format to wide format with multiple measure columns
(6 answers)
Closed 4 years ago.
I need to reorganize a large dataset into a specific format for further analysis. Right now the data are in long format, with multiple records through time for each point. I need to reshape the data so that each point has a single record, but it will add many new columns of the time-specific data. I’ve looked at previous similar posts but I need to ultimately convert several of the current variables into columns, and I can’t find an example of such. Is there a way to accomplish this in a single reshape, or will I have to do several and then concatenate the new columns back together? Another wrinkle before I post the example is that not all points were sampled at each time-step, so I need those values to show up as NA. For example, (see data below) SitePoint A1 was not sampled at all in 2012, SitePoint A10 was not sampled during the first round in 2012, but K83 was sampled all nine times.
mydatain <- structure(list(SitePoint = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 6L, 6L), .Label = c("A1", "A10", "K145", "K83", "T15",
"T213"), class = "factor"), Year_Rotation = structure(c(1L, 2L,
3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 8L, 9L, 1L, 2L, 4L, 5L,
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L,
4L, 5L, 6L, 7L, 8L, 9L, 1L, 7L), .Label = c("2010_1", "2010_2",
"2010_3", "2011_1", "2011_2", "2011_3", "2012_1", "2012_2", "2012_3"
), class = "factor"), MR_Fire = structure(c(5L, 6L, 6L, 2L, 9L,
9L, 5L, 6L, 6L, 2L, 9L, 9L, 7L, 8L, 16L, 17L, 21L, 22L, 23L,
25L, 3L, 4L, 10L, 11L, 12L, 13L, 14L, 15L, 18L, 19L, 20L, 1L,
2L, 2L, 5L, 6L, 6L, 11L, 11L, 12L, 7L, 24L), .Label = c("0",
"1", "10", "11", "12", "13", "14", "15", "2", "23", "24", "25",
"35", "36", "37", "39", "40", "47", "48", "49", "51", "52", "53",
"8", "9"), class = "factor"), fire_seas = structure(c(2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 3L), .Label = c("dry", "fire", "wet"
), class = "factor"), OptTSF = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L,
1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L,
0L, 1L, 1L)), .Names = c("SitePoint", "Year_Rotation", "MR_Fire",
"fire_seas", "OptTSF"), row.names = c(31L, 32L, 33L, 34L, 35L,
36L, 67L, 68L, 69L, 70L, 71L, 72L, 73L, 74L, 10543L, 10544L,
10545L, 10546L, 10547L, 10548L, 10549L, 10550L, 14988L, 14989L,
14990L, 14991L, 14992L, 14993L, 14994L, 14995L, 14996L, 17370L,
17371L, 17372L, 17373L, 17374L, 17375L, 17376L, 17377L, 17378L,
19353L, 19354L), class = "data.frame")
Ultimately I need something like this:
myfinal <- structure(list(SitePoint = structure(1:6, .Label = c("A1", "A10",
"K145", "K83", "T15", "T213"), class = "factor"), MR_Fire_2010_1 = c(12L,
12L, 39L, 23L, 0L, 14L), MR_Fire_2010_2 = c(13L, 13L, 40L, 24L,
1L, NA), MR_Fire_2010_3 = c(13L, 13L, NA, 25L, 1L, NA), MR_Fire_2011_1 = c(1L,
1L, 51L, 35L, 12L, NA), MR_Fire_2011_2 = c(2L, 2L, 52L, 36L,
13L, NA), MR_Fire_2011_3 = c(2L, 2L, 53L, 37L, 13L, NA), MR_Fire_2012_1 = c(NA,
NA, 9L, 47L, 24L, 8L), MR_Fire_2012_2 = c(NA, 14L, 10L, 48L,
24L, NA), MR_Fire_2012_3 = c(NA, 15L, 11L, 49L, 25L, NA), season_2010_1 = structure(c(2L,
2L, 1L, 2L, 2L, 1L), .Label = c("dry", "fire"), class = "factor"),
season_2010_2 = structure(c(2L, 2L, 1L, 2L, 2L, NA), .Label = c("dry",
"fire"), class = "factor"), season_2010_3 = structure(c(1L,
1L, NA, 1L, 1L, NA), .Label = "fire", class = "factor"),
season_2011_1 = structure(c(2L, 2L, 1L, 2L, 2L, NA), .Label = c("dry",
"fire"), class = "factor"), season_2011_2 = structure(c(2L,
2L, 1L, 2L, 2L, NA), .Label = c("dry", "fire"), class = "factor"),
season_2011_3 = structure(c(2L, 2L, 1L, 2L, 2L, NA), .Label = c("dry",
"fire"), class = "factor"), season_2012_1 = structure(c(NA,
NA, 2L, 1L, 1L, 2L), .Label = c("fire", "wet"), class = "factor"),
season_2012_2 = structure(c(NA, 1L, 2L, 1L, 1L, NA), .Label = c("fire",
"wet"), class = "factor"), season_2012_3 = structure(c(NA,
1L, 2L, 1L, 1L, NA), .Label = c("fire", "wet"), class = "factor"),
OptTSF_2010_1 = c(1L, 1L, 0L, 1L, 1L, 1L), OptTSF_2010_2 = c(1L,
1L, 0L, 1L, 1L, NA), OptTSF_2010_3 = c(1L, 1L, NA, 1L, 1L,
NA), OptTSF_2011_1 = c(1L, 1L, 0L, 0L, 1L, NA), OptTSF_2011_2 = c(1L,
1L, 0L, 0L, 1L, NA), OptTSF_2011_3 = c(1L, 1L, 0L, 0L, 1L,
NA), OptTSF_2012_1 = c(NA, NA, 1L, 0L, 0L, 1L), OptTSF_2012_2 = c(NA,
1L, 1L, 0L, 0L, NA), OptTSF_2012_3 = c(NA, 1L, 1L, 0L, 0L,
NA)), .Names = c("SitePoint", "MR_Fire_2010_1", "MR_Fire_2010_2",
"MR_Fire_2010_3", "MR_Fire_2011_1", "MR_Fire_2011_2", "MR_Fire_2011_3",
"MR_Fire_2012_1", "MR_Fire_2012_2", "MR_Fire_2012_3", "season_2010_1",
"season_2010_2", "season_2010_3", "season_2011_1", "season_2011_2",
"season_2011_3", "season_2012_1", "season_2012_2", "season_2012_3",
"OptTSF_2010_1", "OptTSF_2010_2", "OptTSF_2010_3", "OptTSF_2011_1",
"OptTSF_2011_2", "OptTSF_2011_3", "OptTSF_2012_1", "OptTSF_2012_2",
"OptTSF_2012_3"), class = "data.frame", row.names = c(NA, -6L
))
The actual dataset is about 23656 records X 15 variables, so doing it by hand is likely to cause major headaches and potential for mistakes. Any help or suggestions are appreciated. If this has been answered elsewhere, apologies. I couldn’t find anything directly applicable; everything seemed to related to three columns and only one of those being extracted as new variables. Thanks.
SP

dcast from the devel version of data.table i.e., v1.9.5 can cast multiple columns simultaneously. It can be installed from here.
library(data.table) ## v1.9.5+
dcast(setDT(mydatain), SitePoint~Year_Rotation,
value.var=c('MR_Fire', 'fire_seas', 'OptTSF'))

You can use reshape to change the structure of your dataframe from long to wide using the following code:
reshape(mydatain,timevar="Year_Rotation",idvar="SitePoint",direction="wide")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Error using predict with klaR package, NaiveBayes - r

You can act as follows: traindata$dep=factor(traindata$dep) mynb <- NaiveBayes(dep~.,traindata) Then it works, however you should refine your data to have avoid constant columns.

Related

R - Non-numeric argument to binary operator when dividing variable by participant-wise mean of that variable

ggplot2 with splitting by groups in R [duplicate]

Bootstrapping eigenvalues for nonlinear PCA in r

statistic test on univariate time series without replicates in R

Reshape a large matrix with missing values and multiple vars of interest [duplicate]

Categories

Resources