Prevent empty data to enter an lm() call in R?

Prevent empty data to enter an lm() call in R? - r

I'm trying to come up with a mechanism to prevent empty results in my lm() output. To be exact, I want to first find them and then prevent them from being entered into a new lm() call.
For example, in the example below, cf.type99:time3 & cf.type99:time4 in the output of tmp are NAs. Thus, after detecting them, I want to prevent them from being included in the second lm() call (new_tmp).
Is this possible in R?
p.s. I want new_tmp not to fit cf.type99:time3 & cf.type99:time4 in its model.matrix(), and not just deleting the NA's physically from its output.
d5 <- read.csv('https://raw.githubusercontent.com/rnorouzian/m/master/v14.csv')
d5[c("cf.type","time")] <- lapply(d5[c("cf.type","time")], as.factor)
tmp <- lm(dint~cf.type*time, data = d5)
(coef.na <- is.na(coef(tmp))) # detects the `NA` in the output:
# cf.type2:time3 cf.type3:time3 cf.type8:time3 cf.type99:time3 cf.type1:time4
# FALSE FALSE FALSE TRUE FALSE
# cf.type2:time4 cf.type3:time4 cf.type8:time4 cf.type99:time4
# FALSE FALSE FALSE TRUE
new_tmp <- lm(dint~cf.type*time, data = d5) # NOW PREVENT `cf.type99:time3` & `cf.type99:time4` from entering new_tmp

You could drop the columns off the model.matrix,
X.star <- model.matrix(tmp)[,!is.na(tmp$coefficients)]
then use lm.fit to calculate coefficients,
y <- d5$dint
fit.star <- lm.fit(X.star, y)
(beta.hat <- fit.star$coe)
# (Intercept) cf.type1 cf.type2 cf.type3 cf.type8
# 0.72824752 -0.20633963 0.72317588 -0.08369434 0.17387840
# cf.type99 time2 time3 time4 cf.type1:time2
# -0.53932949 -0.49471558 -0.34560481 0.01830757 0.28441094
# cf.type2:time2 cf.type3:time2 cf.type8:time2 cf.type99:time2 cf.type1:time3
# -0.44254640 0.77070527 0.71423737 0.51203670 1.15793712
# cf.type2:time3 cf.type3:time3 cf.type8:time3 cf.type1:time4 cf.type2:time4
# -0.66306413 0.18818360 -0.12287313 -0.06825029 -1.06941453
# cf.type3:time4 cf.type8:time4
# -0.37337606 -0.06063658
and optionally standard errors by hand.
## Standard errors
y.hat <- X.star %*% beta.hat
e.hat <- y - y.hat
n <- dim(X.star)[2]
m <- dim(X.star)[1]
sig2 <- (t(e.hat) %*% e.hat) / (m - (n + 1))
res.se <- sqrt(sig2)
diag.xx <- diag(solve(t(X.star) %*% X.star))
beta.se <- as.vector(res.se)*sqrt(diag.xx)
Result
cbind(Estimate=beta.hat, `Std. Error`=beta.se, `t value`=beta.hat/beta.se,
`Pr(>|t|)`=2*pt(-abs(beta.hat/beta.se), df=fit.star$df.residual))
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.72824752 0.1420687 5.12602302 6.252427e-07
# cf.type1 -0.20633963 0.2460702 -0.83853955 4.025946e-01
# cf.type2 0.72317588 0.4921405 1.46945012 1.430713e-01
# cf.type3 -0.08369434 0.2751149 -0.30421597 7.612372e-01
# cf.type8 0.17387840 0.2231278 0.77927713 4.366141e-01
# cf.type99 -0.53932949 0.6935709 -0.77761265 4.375931e-01
# time2 -0.49471558 0.1813017 -2.72868700 6.847850e-03
# time3 -0.34560481 0.1834099 -1.88432993 6.077612e-02
# time4 0.01830757 0.2391373 0.07655675 9.390424e-01
# cf.type1:time2 0.28441094 0.3152098 0.90229103 3.678421e-01
# cf.type2:time2 -0.44254640 0.5208981 -0.84958339 3.964364e-01
# cf.type3:time2 0.77070527 0.4209305 1.83095614 6.839524e-02
# cf.type8:time2 0.71423737 0.4165119 1.71480662 8.772179e-02
# cf.type99:time2 0.51203670 0.8460178 0.60523161 5.456192e-01
# cf.type1:time3 1.15793712 0.4035018 2.86971958 4.489380e-03
# cf.type2:time3 -0.66306413 0.5522639 -1.20062913 2.311248e-01
# cf.type3:time3 0.18818360 0.4218428 0.44609884 6.559437e-01
# cf.type8:time3 -0.12287313 0.3559152 -0.34523149 7.302345e-01
# cf.type1:time4 -0.06825029 0.3912267 -0.17445200 8.616630e-01
# cf.type2:time4 -1.06941453 0.6066406 -1.76284686 7.924862e-02
# cf.type3:time4 -0.37337606 0.4196728 -0.88968377 3.745613e-01
# cf.type8:time4 -0.06063658 0.3477727 -0.17435692 8.617377e-01

Related

Fixing a function to run for x:y instead of only 1:y

I have defined a function to calculate the relationship between height (h) and diameter (dbh) of trees based on equations extracted from 2 publications. My goal is to use the relationship established in paper 1 (Xiangtao) to predict the values of variables in an equation in paper 2 (Marechaux and Chave). I would like to test to see over what diameter range [x:y] the generated nls() curve of paper 2 fits paper 1. Currently, I keep getting an error (I believe in plot())
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
if I use anything except x=1 for [x:y] i.e. dbh.min:dbh.max
My function is as follows:
# Plant.Functional.Type constants...
Dsb1 <- 2.09
Dsb2 <- 0.54
Db1 <- 0.93
Db2 <- 0.84
BDb1 <- 2.66
BDb2 <- 0.48
Eb1 <- 1.41
Eb2 <- 0.65
# # # # # # # # # # # # # # # # # # # # # # # # # # #
Generate.curve <- function(b1, b2, dbh.min, dbh.max){
# calculate Xiangtao's allometry...
tmp_h <- c(dbh.min:dbh.max)
for (dbh in dbh.min:dbh.max)
{
h = b1*dbh^(b2)
tmp_h[dbh] = h
}
# plot to check curve
plot(dbh.min:dbh.max, tmp_h)
# define secondary function for Marechaux and Chave allometry
h_fxn <- function(hlim,dbh,ah){
h = hlim * (dbh / (dbh + ah))
return(h)
}
# use nonlinear least squares model to solve for ah and hlim
# set model inputs
start.ah <- 1
start.hlim <- 5
tmp_v <- cbind(dbh.min:dbh.max,tmp_h)
tmp.fit <- nls(tmp_h ~ h_fxn(hlim,dbh.min:dbh.max,ah), start = list(hlim = start.hlim,
ah = start.ah), algorithm = "port", upper = list(hlim = 75, ah = 99))
# seems to be no way of extracting ah and hlim from tmp.fit via subset
# extract manually and then check fit with
# lines(dbh.min:dbh.max, hlim * (dbh.min:dbh.max/(dbh.min:dbh.max + ah)))
# for equation h = hlim * (dbh / (dbh + ah)) from Marechaux and Chave
return(tmp.fit)
}
# # # # # # # # # # # # # # # # # # # # # # # # # # #
This works great for
Generate.curve(Dsb1,Dsb2,1,100)
lines(1:100, 36.75 * (1:100/(1:100 + 52.51)))
But I would like to be able to examine the curve fit in ranges such as [80:100] as well.
I have been trying to figure out why Generate.curve(Dsb1,Dsb2,80,100) returns an error for about 3 days now. Thanks for any help.

Your problem lies in this section:
tmp_h <- c(dbh.min:dbh.max)
for (dbh in dbh.min:dbh.max)
{
h = b1*dbh^(b2)
tmp_h[dbh] = h
}
Think about what happens when you set dbh.min to 80 and dbh.max to 100:
tmp_h <- 80:100
for (dbh in 80:100)
{
h = b1*dbh^(b2)
tmp_h[dbh] = h
}
What happens on the first cycle of the loop? Well, tmp_h is length 20, but on the first cycle, dbh is 80, and you are assigning a number to tmp_h[dbh], which is tmp_h[80]. By the time the loop has finished, tmp_h will have the correct values stored, but they will be in the indices 80:100. So tmp_h will have the numbers 80:100 stored in the first 21 indices, then a bunch of NAs then the correct numbers in the last 21 indices.
So change it to:
tmp_h <- c(dbh.min:dbh.max)
for (dbh in dbh.min:dbh.max)
{
h = b1*dbh^(b2)
tmp_h[dbh - dbh.min + 1] = h
}
and it will work.
However, you don't actually need a loop at all here, since R uses vectorized operations, so this whole section can be replaced with:
tmp_h <- b1 * (dbh.min:dbh.max)^(b2)
and then when you do
Generate.curve(Dsb1,Dsb2,80,100)
lines(80:100, 36.75 * (80:100/(80:100 + 52.51)))
you get this:

Regression with multivariate vectors and coefficients extraction

I want to create 1000 samples of 200 bivariate normally distributed vectors
set.seed(42) # for sake of reproducibility
mu <- c(1, 1)
S <- matrix(c(0.56, 0.4,
0.4, 1), nrow=2, ncol=2, byrow=TRUE)
bivn <- mvrnorm(200, mu=mu, Sigma=S)
so that I can run OLS regressions on each sample and therefore get 1000 estimators. I tried this
library(MASS)
bivn_1000 <- replicate(1000, mvrnorm(200, mu=mu, Sigma=S), simplify=FALSE)
but I am stuck there, because now I don't know how to proceed to run the regression for each sample.
I would appreciate the help to know how to run these 1000 regressions and then extract the coefficients.

We could write a custom regression function.
regFun1 <- function(x) summary(lm(x[, 1] ~ x[, 2]))
which we can loop over the data with lapply:
l1 <- lapply(bivn_1000, regFun1)
The coefficients are saved inside a list and can be extracted like so:
l1[[1]]$coefficients # for the first regression
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.5554601 0.06082924 9.131466 7.969277e-17
# x[, 2] 0.4797568 0.04255711 11.273246 4.322184e-23
Edit:
If we solely want the estimators without statistics, we adjust the output of the function accordingly.
regFun2 <- function(x) summary(lm(x[, 1] ~ x[, 2]))$coef[, 1]
Since we may want the output in matrix form we use sapply next.
m2 <- t(sapply(bivn_1000, regFun2))
head(m2)
# (Intercept) x[, 2]
# [1,] 0.6315558 0.4389721
# [2,] 0.5514555 0.4840933
# [3,] 0.6782464 0.3250800
# [4,] 0.6350999 0.3848747
# [5,] 0.5899311 0.3645237
# [6,] 0.6263678 0.3825725
where
dim(m2)
# [1] 1000 2
assures us that we have our 1,000 estimates.

Carrying out a PBIB.test

I have data set from a incomplete lattice design study that I have imported into R from excel and would like to conduct a PBIB.test. However, after running the function as shown below, the output shows object Area not found, even after repeated times.
library("agricolae", lib.loc = "~/R/win-library/3.3")
Rdata2 <- PBIB.test("BlockNo", "AccNo", "Rep", Area, k = 9, c("REML"), console = TRUE)
Error in data.frame(v1 = 1, y) : object 'Area' not found
What is the problem?

See below for a sample application of PBIB.test, based on the agricolae tutorial.
First, create some sample data.
# Construct the alpha design with 30 treatments, 2 repetitions, and block size = 3
Genotype <- c(paste("gen0", 1:9, sep= ""), paste("gen", 10:30, sep= ""));
r <- 2;
k <- 3;
s <- 10;
b <- s * r;
book <- design.alpha(Genotype, k, r,seed = 5);
# Source dataframe
df <- book$book;
Create a vector of response values.
# Response variable
response <- c(
5,2,7,6,4,9,7,6,7,9,6,2,1,1,3,2,4,6,7,9,8,7,6,4,3,2,2,1,1,2,
1,1,2,4,5,6,7,8,6,5,4,3,1,1,2,5,4,2,7,6,6,5,6,4,5,7,6,5,5,4);
Run PBIB.test
model <- with(df, PBIB.test(block, Genotype, replication, response, k = 3, method="REML"))
head(model);
#$ANOVA
#Analysis of Variance Table
#
#Response: yield
# Df Sum Sq Mean Sq F value Pr(>F)
#Genotype 29 72.006 2.4830 1.2396 0.3668
#Residuals 11 22.034 2.0031
#
#$method
#[1] "Residual (restricted) maximum likelihood"
#
#$parameters
# test name.t treatments blockSize blocks r alpha
# PBIB-lsd Genotype 30 3 10 2 0.05
#
#$statistics
# Efficiency Mean CV
# 0.6170213 4.533333 31.22004
#
#$model
#Linear mixed-effects model fit by REML
# Data: NULL
# Log-restricted-likelihood: -73.82968
# Fixed: y ~ trt.adj
# (Intercept) trt.adjgen02 trt.adjgen03 trt.adjgen04 trt.adjgen05 trt.adjgen06
# 6.5047533 -3.6252940 -0.7701618 -2.5264354 -3.1633495 -1.9413054
#trt.adjgen07 trt.adjgen08 trt.adjgen09 trt.adjgen10 trt.adjgen11 trt.adjgen12
# -3.0096514 -4.0648738 -3.5051139 -2.8765561 -1.7111335 -1.6308755
#trt.adjgen13 trt.adjgen14 trt.adjgen15 trt.adjgen16 trt.adjgen17 trt.adjgen18
# -2.2187974 -2.3393290 -2.0807215 -0.3122845 -3.4526453 -1.0320169
#trt.adjgen19 trt.adjgen20 trt.adjgen21 trt.adjgen22 trt.adjgen23 trt.adjgen24
# -3.1257616 0.2101325 -1.7632411 -1.9177848 -1.0500345 -2.5612960
#trt.adjgen25 trt.adjgen26 trt.adjgen27 trt.adjgen28 trt.adjgen29 trt.adjgen30
# -4.3184716 -2.3071359 1.2239927 -1.3643068 -1.4354599 -0.4726870
#
#Random effects:
# Formula: ~1 | replication
# (Intercept)
#StdDev: 8.969587e-05
#
# Formula: ~1 | block.adj %in% replication
# (Intercept) Residual
#StdDev: 1.683459 1.415308
#
#Number of Observations: 60
#Number of Groups:
# replication block.adj %in% replication
# 2 20
#
#$Fstat
# Fit Statistics
#AIC 213.65937
#BIC 259.89888
#-2 Res Log Likelihood -73.82968

Multiclass Changepoint detection in R

The output of my classification predictive model is as follows:
a <- c(1,1.1,1,1,2,0.9,1.1,1,1.1,1) ## Class A
b <- c(2,2.1,1.9,1.7,2,2,3,2,2,2) ## Class B
c <- c(3,3,3.1,3.6,3.2,2,3.1,3,3,3) ## Class C
x <- data.frame(c(a,b,c))
x$color <- rep(c("red","green","blue"),times=c(10,10,10))
I am trying to find out at which position the class type changes. In this scenario, the class type changes at positions 11 and 21.
I have tried the following packages : CPM, EPC, QCC but it is not giving me what I desire.
What's the best way to detect that the class type has changed?

What about
library(strucchange)
( bp <- breakpoints( x[, 1] ~ 1) )
# Optimal 3-segment partition:
#
# Call:
# breakpoints.formula(formula = x[, 1] ~ 1)
#
# Breakpoints at observation number:
# 10 20
#
# Corresponding to breakdates:
# 0.3333333 0.6666667
bp$breakpoints
# [1] 10 20
(via)

How can I perform multiple pairwise t.test in R using the same reference vector?

Let's consider the following vectors in the dataframe:
ctrl <- rnorm(50)
x1 <- rnorm(30, mean=0.2)
x2 <- rnorm(100,mean=0.1)
x3 <- rnorm(100,mean=0.4)
x <- data.frame(data=c(ctrl,x1,x2,x3),
Group=c(
rep("ctrl", length(ctrl)),
rep("x1", length(x1)),
rep("x2", length(x2)),
rep("x3", length(x3))) )
I know I could use
pairwise.t.test(x$data,
x$Group,
pool.sd=FALSE)
to get pairwise comparison like
Pairwise comparisons using t tests with non-pooled SD
data: x$data and x$Group
ctrl x1 x2
x1 0.08522 - -
x2 0.99678 0.10469 -
x3 0.00065 0.99678 2.8e-05
P value adjustment method: holm
However I am not interested in every possible combination of vectors. I am seeking a way to compare ctrl vector with every other vectors, and to take into account alpha inflation. I'd like to avoid
t.test((x$data[x$Group=='ctrl']), (x$data[x$Group=='x1']), var.equal=T)
t.test((x$data[x$Group=='ctrl']), (x$data[x$Group=='x2']), var.equal=T)
t.test((x$data[x$Group=='ctrl']), (x$data[x$Group=='x3']), var.equal=T)
And then perform manual correction for multiple comparisons. What would be the best way to do so ?

You can use p.adjust to get a Bonferroni adjustment to multiple p-values. You should not bundle thos unequal length vectors inot t adataframe but rather use a list.
ctrl <- rnorm(50)
x1 <- rnorm(30, mean=0.2)
x2 <- rnorm(100,mean=0.1)
x3 <- rnorm(100,mean=0.4)
> lapply( list(x1,x2,x3), function(x) t.test(x,ctrl)$p.value)
[[1]]
[1] 0.2464039
[[2]]
[1] 0.8576423
[[3]]
[1] 0.0144275
> p.adjust( .Last.value)
[1] 0.4928077 0.8576423 0.0432825

#BondedDust 's answer looks great. I provide a bit more complicated solution if you really need to work with dataframes.
library(dplyr)
ctrl <- rnorm(50)
x1 <- rnorm(30, mean=0.2)
x2 <- rnorm(100,mean=0.1)
x3 <- rnorm(100,mean=0.4)
x <- data.frame(data=c(ctrl,x1,x2,x3),
Group=c(
rep("ctrl", length(ctrl)),
rep("x1", length(x1)),
rep("x2", length(x2)),
rep("x3", length(x3))), stringsAsFactors = F )
# provide the combinations you want
# set1 with all from set2
set1 = c("ctrl")
set2 = c("x1","x2","x3")
dt_res =
data.frame(expand.grid(set1,set2)) %>% # create combinations
mutate(test_id = row_number()) %>% # create a test id
group_by(test_id) %>% # group by test id, so everything from now on is performed for each test separately
do({x_temp = x[(x$Group==.$Var1 | x$Group==.$Var2),] # for each test id keep groups of interest
x_temp = data.frame(x_temp)}) %>%
do(test = t.test(data~Group, data=.)) # perform the test and save it
# you create a dataset that has the test id and a column with t.tests results as elements
dt_res
# Source: local data frame [3 x 2]
# Groups: <by row>
#
# test_id test
# 1 1 <S3:htest>
# 2 2 <S3:htest>
# 3 3 <S3:htest>
# get all tests as a list
dt_res$test
# [[1]]
#
# Welch Two Sample t-test
#
# data: data by Group
# t = -1.9776, df = 58.36, p-value = 0.05271
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.894829477 0.005371207
# sample estimates:
# mean in group ctrl mean in group x1
# -0.447213560 -0.002484425
#
#
# [[2]]
#
# Welch Two Sample t-test
#
# data: data by Group
# t = -2.3549, df = 100.68, p-value = 0.02047
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.71174095 -0.06087081
# sample estimates:
# mean in group ctrl mean in group x2
# -0.44721356 -0.06090768
#
#
# [[3]]
#
# Welch Two Sample t-test
#
# data: data by Group
# t = -5.4235, df = 101.12, p-value = 4.001e-07
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -1.2171386 -0.5652189
# sample estimates:
# mean in group ctrl mean in group x3
# -0.4472136 0.4439652
PS : It's always interesting to work with p-values and alpha corrections. It's a bit of a philosophical issue how to approach that and some people agree and other disagree. Personally, I tend to correct alpha based on all possible comparison I can do after an experiment, because you never know when you'll come back to investigate other pairs. Imagine what happens if in the future people decide that you have to go back and compare the winning group (let's say x1) with x2 and x3. You'll focus on those pairs and you'll again correct alpha based on those compariosns. But on the whole you performed all possible comparisons, apart from x2 vs x3! You may write your reports or publish findings that should have been a bit more strict on the alpha correction.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Prevent empty data to enter an lm() call in R? - r

Related

Fixing a function to run for x:y instead of only 1:y

Regression with multivariate vectors and coefficients extraction

Carrying out a PBIB.test

Multiclass Changepoint detection in R

How can I perform multiple pairwise t.test in R using the same reference vector?

Categories

Resources