The output of my classification predictive model is as follows:
a <- c(1,1.1,1,1,2,0.9,1.1,1,1.1,1) ## Class A
b <- c(2,2.1,1.9,1.7,2,2,3,2,2,2) ## Class B
c <- c(3,3,3.1,3.6,3.2,2,3.1,3,3,3) ## Class C
x <- data.frame(c(a,b,c))
x$color <- rep(c("red","green","blue"),times=c(10,10,10))
I am trying to find out at which position the class type changes. In this scenario, the class type changes at positions 11 and 21.
I have tried the following packages : CPM, EPC, QCC but it is not giving me what I desire.
What's the best way to detect that the class type has changed?
What about
library(strucchange)
( bp <- breakpoints( x[, 1] ~ 1) )
# Optimal 3-segment partition:
#
# Call:
# breakpoints.formula(formula = x[, 1] ~ 1)
#
# Breakpoints at observation number:
# 10 20
#
# Corresponding to breakdates:
# 0.3333333 0.6666667
bp$breakpoints
# [1] 10 20
(via)
Related
You can see I'm a beginner at this when I'm not even able to reproduce my problem with a dummy dataset... Anyways, here goes: I want to calculate tetrachoric correlations between one grouping variable and multiple other variables. Like this:
library(psych)
set.seed(42)
n <- 16
dat <- data.frame(id=1:n,
group=c(rep("a", times=5), rep("b", times=3)),
x=sample(1:2, n, replace=TRUE),
y=sample(1:2, n, replace=TRUE),
z=sample(1:2, n, replace=TRUE))
dat
id group x y z
1 1 a 1 1 2
2 2 a 1 2 2
3 3 a 1 1 2
4 4 a 1 2 2
5 5 a 2 1 1
6 6 b 2 2 1
7 7 b 2 1 1
8 8 b 2 1 1
tetrachoric(as.matrix(dat[,c("group","y")]))
Now with this example (not with my actual dataset) I get an error which I'm unable to solve:
Error in apply(x, 2, function(x) min(x, na.rm = TRUE)) :
dim(X) must have a positive length
In addition: Warning messages:
1: In var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
NAs introduced by coercion
2: In tetrachoric(as.matrix(dat[, c("group", "y")])) :
Item = group had no variance and was deleted
My question is still what would be the best solution to get all the correlations with a single piece of code? Thank you for help!
The help file for tetrachoric says "The tetrachoric correlation is the inferred Pearson Correlation from a two x two table with the assumption of bivariate normality", so presumably you need to pass it a 2x2 table. You could write a little function that would hand the tetrachoric the appropriate table and collect the results:
myfun <- function(x,y, ...){
tabs <- lapply(seq_along(y), function(i)table(x,y[,i]))
l <- lapply(tabs, function(x)tetrachoric(x, ...))
rho <- sapply(l, function(x)x$rho)
tau <- sapply(l, function(x)x$tau)
colnames(tau) <- colnames(y)
names(rho) <- colnames(y)
ret <- list(rho = rho ,
tau = tau)
ret
}
myfun(dat$group, dat[,c("x", "y", "z")])
# $rho
# x y z
# 0.5397901 -0.2605839 0.6200705
#
# $tau
# x y z
# a 0.3186394 0.3186394 0.2690661
# 1 0.1573107 0.1573107 -0.6045853
I'm trying to come up with a mechanism to prevent empty results in my lm() output. To be exact, I want to first find them and then prevent them from being entered into a new lm() call.
For example, in the example below, cf.type99:time3 & cf.type99:time4 in the output of tmp are NAs. Thus, after detecting them, I want to prevent them from being included in the second lm() call (new_tmp).
Is this possible in R?
p.s. I want new_tmp not to fit cf.type99:time3 & cf.type99:time4 in its model.matrix(), and not just deleting the NA's physically from its output.
d5 <- read.csv('https://raw.githubusercontent.com/rnorouzian/m/master/v14.csv')
d5[c("cf.type","time")] <- lapply(d5[c("cf.type","time")], as.factor)
tmp <- lm(dint~cf.type*time, data = d5)
(coef.na <- is.na(coef(tmp))) # detects the `NA` in the output:
# cf.type2:time3 cf.type3:time3 cf.type8:time3 cf.type99:time3 cf.type1:time4
# FALSE FALSE FALSE TRUE FALSE
# cf.type2:time4 cf.type3:time4 cf.type8:time4 cf.type99:time4
# FALSE FALSE FALSE TRUE
new_tmp <- lm(dint~cf.type*time, data = d5) # NOW PREVENT `cf.type99:time3` & `cf.type99:time4` from entering new_tmp
You could drop the columns off the model.matrix,
X.star <- model.matrix(tmp)[,!is.na(tmp$coefficients)]
then use lm.fit to calculate coefficients,
y <- d5$dint
fit.star <- lm.fit(X.star, y)
(beta.hat <- fit.star$coe)
# (Intercept) cf.type1 cf.type2 cf.type3 cf.type8
# 0.72824752 -0.20633963 0.72317588 -0.08369434 0.17387840
# cf.type99 time2 time3 time4 cf.type1:time2
# -0.53932949 -0.49471558 -0.34560481 0.01830757 0.28441094
# cf.type2:time2 cf.type3:time2 cf.type8:time2 cf.type99:time2 cf.type1:time3
# -0.44254640 0.77070527 0.71423737 0.51203670 1.15793712
# cf.type2:time3 cf.type3:time3 cf.type8:time3 cf.type1:time4 cf.type2:time4
# -0.66306413 0.18818360 -0.12287313 -0.06825029 -1.06941453
# cf.type3:time4 cf.type8:time4
# -0.37337606 -0.06063658
and optionally standard errors by hand.
## Standard errors
y.hat <- X.star %*% beta.hat
e.hat <- y - y.hat
n <- dim(X.star)[2]
m <- dim(X.star)[1]
sig2 <- (t(e.hat) %*% e.hat) / (m - (n + 1))
res.se <- sqrt(sig2)
diag.xx <- diag(solve(t(X.star) %*% X.star))
beta.se <- as.vector(res.se)*sqrt(diag.xx)
Result
cbind(Estimate=beta.hat, `Std. Error`=beta.se, `t value`=beta.hat/beta.se,
`Pr(>|t|)`=2*pt(-abs(beta.hat/beta.se), df=fit.star$df.residual))
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.72824752 0.1420687 5.12602302 6.252427e-07
# cf.type1 -0.20633963 0.2460702 -0.83853955 4.025946e-01
# cf.type2 0.72317588 0.4921405 1.46945012 1.430713e-01
# cf.type3 -0.08369434 0.2751149 -0.30421597 7.612372e-01
# cf.type8 0.17387840 0.2231278 0.77927713 4.366141e-01
# cf.type99 -0.53932949 0.6935709 -0.77761265 4.375931e-01
# time2 -0.49471558 0.1813017 -2.72868700 6.847850e-03
# time3 -0.34560481 0.1834099 -1.88432993 6.077612e-02
# time4 0.01830757 0.2391373 0.07655675 9.390424e-01
# cf.type1:time2 0.28441094 0.3152098 0.90229103 3.678421e-01
# cf.type2:time2 -0.44254640 0.5208981 -0.84958339 3.964364e-01
# cf.type3:time2 0.77070527 0.4209305 1.83095614 6.839524e-02
# cf.type8:time2 0.71423737 0.4165119 1.71480662 8.772179e-02
# cf.type99:time2 0.51203670 0.8460178 0.60523161 5.456192e-01
# cf.type1:time3 1.15793712 0.4035018 2.86971958 4.489380e-03
# cf.type2:time3 -0.66306413 0.5522639 -1.20062913 2.311248e-01
# cf.type3:time3 0.18818360 0.4218428 0.44609884 6.559437e-01
# cf.type8:time3 -0.12287313 0.3559152 -0.34523149 7.302345e-01
# cf.type1:time4 -0.06825029 0.3912267 -0.17445200 8.616630e-01
# cf.type2:time4 -1.06941453 0.6066406 -1.76284686 7.924862e-02
# cf.type3:time4 -0.37337606 0.4196728 -0.88968377 3.745613e-01
# cf.type8:time4 -0.06063658 0.3477727 -0.17435692 8.617377e-01
I have defined a function to calculate the relationship between height (h) and diameter (dbh) of trees based on equations extracted from 2 publications. My goal is to use the relationship established in paper 1 (Xiangtao) to predict the values of variables in an equation in paper 2 (Marechaux and Chave). I would like to test to see over what diameter range [x:y] the generated nls() curve of paper 2 fits paper 1. Currently, I keep getting an error (I believe in plot())
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
if I use anything except x=1 for [x:y] i.e. dbh.min:dbh.max
My function is as follows:
# Plant.Functional.Type constants...
Dsb1 <- 2.09
Dsb2 <- 0.54
Db1 <- 0.93
Db2 <- 0.84
BDb1 <- 2.66
BDb2 <- 0.48
Eb1 <- 1.41
Eb2 <- 0.65
# # # # # # # # # # # # # # # # # # # # # # # # # # #
Generate.curve <- function(b1, b2, dbh.min, dbh.max){
# calculate Xiangtao's allometry...
tmp_h <- c(dbh.min:dbh.max)
for (dbh in dbh.min:dbh.max)
{
h = b1*dbh^(b2)
tmp_h[dbh] = h
}
# plot to check curve
plot(dbh.min:dbh.max, tmp_h)
# define secondary function for Marechaux and Chave allometry
h_fxn <- function(hlim,dbh,ah){
h = hlim * (dbh / (dbh + ah))
return(h)
}
# use nonlinear least squares model to solve for ah and hlim
# set model inputs
start.ah <- 1
start.hlim <- 5
tmp_v <- cbind(dbh.min:dbh.max,tmp_h)
tmp.fit <- nls(tmp_h ~ h_fxn(hlim,dbh.min:dbh.max,ah), start = list(hlim = start.hlim,
ah = start.ah), algorithm = "port", upper = list(hlim = 75, ah = 99))
# seems to be no way of extracting ah and hlim from tmp.fit via subset
# extract manually and then check fit with
# lines(dbh.min:dbh.max, hlim * (dbh.min:dbh.max/(dbh.min:dbh.max + ah)))
# for equation h = hlim * (dbh / (dbh + ah)) from Marechaux and Chave
return(tmp.fit)
}
# # # # # # # # # # # # # # # # # # # # # # # # # # #
This works great for
Generate.curve(Dsb1,Dsb2,1,100)
lines(1:100, 36.75 * (1:100/(1:100 + 52.51)))
But I would like to be able to examine the curve fit in ranges such as [80:100] as well.
I have been trying to figure out why Generate.curve(Dsb1,Dsb2,80,100) returns an error for about 3 days now. Thanks for any help.
Your problem lies in this section:
tmp_h <- c(dbh.min:dbh.max)
for (dbh in dbh.min:dbh.max)
{
h = b1*dbh^(b2)
tmp_h[dbh] = h
}
Think about what happens when you set dbh.min to 80 and dbh.max to 100:
tmp_h <- 80:100
for (dbh in 80:100)
{
h = b1*dbh^(b2)
tmp_h[dbh] = h
}
What happens on the first cycle of the loop? Well, tmp_h is length 20, but on the first cycle, dbh is 80, and you are assigning a number to tmp_h[dbh], which is tmp_h[80]. By the time the loop has finished, tmp_h will have the correct values stored, but they will be in the indices 80:100. So tmp_h will have the numbers 80:100 stored in the first 21 indices, then a bunch of NAs then the correct numbers in the last 21 indices.
So change it to:
tmp_h <- c(dbh.min:dbh.max)
for (dbh in dbh.min:dbh.max)
{
h = b1*dbh^(b2)
tmp_h[dbh - dbh.min + 1] = h
}
and it will work.
However, you don't actually need a loop at all here, since R uses vectorized operations, so this whole section can be replaced with:
tmp_h <- b1 * (dbh.min:dbh.max)^(b2)
and then when you do
Generate.curve(Dsb1,Dsb2,80,100)
lines(80:100, 36.75 * (80:100/(80:100 + 52.51)))
you get this:
I have data set from a incomplete lattice design study that I have imported into R from excel and would like to conduct a PBIB.test. However, after running the function as shown below, the output shows object Area not found, even after repeated times.
library("agricolae", lib.loc = "~/R/win-library/3.3")
Rdata2 <- PBIB.test("BlockNo", "AccNo", "Rep", Area, k = 9, c("REML"), console = TRUE)
Error in data.frame(v1 = 1, y) : object 'Area' not found
What is the problem?
See below for a sampleĀ application of PBIB.test, based on the agricolae tutorial.
First, create some sample data.
# Construct the alpha design with 30 treatments, 2 repetitions, and block size = 3
Genotype <- c(paste("gen0", 1:9, sep= ""), paste("gen", 10:30, sep= ""));
r <- 2;
k <- 3;
s <- 10;
b <- s * r;
book <- design.alpha(Genotype, k, r,seed = 5);
# Source dataframe
df <- book$book;
Create a vector of response values.
# Response variable
response <- c(
5,2,7,6,4,9,7,6,7,9,6,2,1,1,3,2,4,6,7,9,8,7,6,4,3,2,2,1,1,2,
1,1,2,4,5,6,7,8,6,5,4,3,1,1,2,5,4,2,7,6,6,5,6,4,5,7,6,5,5,4);
Run PBIB.test
model <- with(df, PBIB.test(block, Genotype, replication, response, k = 3, method="REML"))
head(model);
#$ANOVA
#Analysis of Variance Table
#
#Response: yield
# Df Sum Sq Mean Sq F value Pr(>F)
#Genotype 29 72.006 2.4830 1.2396 0.3668
#Residuals 11 22.034 2.0031
#
#$method
#[1] "Residual (restricted) maximum likelihood"
#
#$parameters
# test name.t treatments blockSize blocks r alpha
# PBIB-lsd Genotype 30 3 10 2 0.05
#
#$statistics
# Efficiency Mean CV
# 0.6170213 4.533333 31.22004
#
#$model
#Linear mixed-effects model fit by REML
# Data: NULL
# Log-restricted-likelihood: -73.82968
# Fixed: y ~ trt.adj
# (Intercept) trt.adjgen02 trt.adjgen03 trt.adjgen04 trt.adjgen05 trt.adjgen06
# 6.5047533 -3.6252940 -0.7701618 -2.5264354 -3.1633495 -1.9413054
#trt.adjgen07 trt.adjgen08 trt.adjgen09 trt.adjgen10 trt.adjgen11 trt.adjgen12
# -3.0096514 -4.0648738 -3.5051139 -2.8765561 -1.7111335 -1.6308755
#trt.adjgen13 trt.adjgen14 trt.adjgen15 trt.adjgen16 trt.adjgen17 trt.adjgen18
# -2.2187974 -2.3393290 -2.0807215 -0.3122845 -3.4526453 -1.0320169
#trt.adjgen19 trt.adjgen20 trt.adjgen21 trt.adjgen22 trt.adjgen23 trt.adjgen24
# -3.1257616 0.2101325 -1.7632411 -1.9177848 -1.0500345 -2.5612960
#trt.adjgen25 trt.adjgen26 trt.adjgen27 trt.adjgen28 trt.adjgen29 trt.adjgen30
# -4.3184716 -2.3071359 1.2239927 -1.3643068 -1.4354599 -0.4726870
#
#Random effects:
# Formula: ~1 | replication
# (Intercept)
#StdDev: 8.969587e-05
#
# Formula: ~1 | block.adj %in% replication
# (Intercept) Residual
#StdDev: 1.683459 1.415308
#
#Number of Observations: 60
#Number of Groups:
# replication block.adj %in% replication
# 2 20
#
#$Fstat
# Fit Statistics
#AIC 213.65937
#BIC 259.89888
#-2 Res Log Likelihood -73.82968
R has the ability to allow you to defined a formula with a transformation function applied to one of the variables. For example in the following formula, the logarithmic transformation will be applied to variable b.
y ~ a + log(b) + c
How does one define their own formula functions in R without the use of I()? For example applying the user defined function foo to a in the following formula.
y ~ foo(a) + b
You can use your own functions in R formulas.
An example function:
foo <- function(x)
log(x) ^ 2
Data:
set.seed(1)
dat <- data.frame(y = rnorm(5), x = rgamma(5, 2))
Create model matrix based on formula and data:
mod <- model.matrix(y ~ 1 + foo(x), data = dat)
mod
# (Intercept) foo(x)
# 1 1 0.16837521
# 2 1 0.02222275
# 3 1 0.68509896
# 4 1 0.01936180
# 5 1 0.14758002
# attr(,"assign")
# [1] 0 1
The function is indeed applied to the data:
identical(foo(dat$x), unname(mod[ , "foo(x)"]))
# [1] TRUE