This question already has an answer here:
How to write linearly dependent column in a matrix in terms of linearly independent columns?
(1 answer)
Closed 7 years ago.
I have a data frame of which I know certain columns are an exact linear formula of some of the other columns, but I don't know which columns they are.
A B C D E G
1 -8453 319 3363 -16382 8290 2683
2 2269 -5687 5810 6626 5857 1283
3 8381 5725 1099 -6145 8507 1393
4 -2248 3936 5394 -10503 1803 7910
5 9579 4210 4027 4049 5235 112
6 7351 3717 2357 -1357 5458 1890
7 -8323 -9181 7914 -2417 2252 8937
8 731 -5936 5948 -4190 7621 9184
9 -7419 5345 218 -20339 7139 654
10 -9353 4583 444 -22751 6108 3151
DT <- structure(list(A = c(-6381L, 6029L, 171L, 6451L, -8843L, -4651L,
-4142L, -9292L, -5857L, 3378L), B = c(-9170L, 6601L, -4307L,
8391L, -5360L, 3783L, 4481L, 3990L, 5308L, -8744L), C = c(7899L,
1031L, 8288L, 2034L, 2146L, 2862L, 4911L, 1808L, 4351L, 287L),
D = c(4772L, -12577L, 7358L, -10506L, -15314L, -17401L, -7939L,
-29133L, -17846L, 5631L), E = c(15L, 5708L, 5272L, 5651L,
8126L, 8805L, 20L, 9129L, 3786L, 5498L), G = c(5901L, 7328L,
136L, 4949L, 5851L, 3024L, 4207L, 8530L, 7246L, 1280L)), class = "data.frame", row.names = c(NA,
-10L), .Names = c("A", "B", "C", "D", "E", "G"))
My initial reaction was to loop through the columns DT and perform a lm on the remaining columns, searching for r.squared == 1, but I was wondering whether there are functions for this specific task.
My first guess ended up working pretty well
❥ output <- lm(A ~ C + D + E + G + B, data = DT)
❥ summary(output)
Call:
lm(formula = A ~ C + D + E + G + B, data = DT)
Residuals:
1 2 3 4 5 6 7 8
-4.80e-12 1.59e-12 3.61e-12 -2.82e-12 2.79e-12 -5.58e-12 1.49e-12 -8.34e-14
9 10
3.40e-12 4.10e-13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.75e-13 8.62e-12 7.00e-02 0.95
C -1.00e+00 7.90e-16 -1.27e+15 <2e-16 ***
D 1.00e+00 3.94e-16 2.54e+15 <2e-16 ***
E 1.00e+00 9.46e-16 1.06e+15 <2e-16 ***
G 1.00e+00 1.17e-15 8.51e+14 <2e-16 ***
B 1.00e+00 3.85e-16 2.60e+15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.99e-12 on 4 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.53e+30 on 5 and 4 DF, p-value: <2e-16
Warning message:
In summary.lm(output) : essentially perfect fit: summary may be unreliable
I would challenge your claim (or at least twhat I initially thought was your claim). My first tool for investigating it was Hmisc::rcorr which calculates all of the correlation coefficients. If any pair was a linear combination of the other then the correlation coef should be 1.0
> rcorr(data.matrix(DT))
A B C D E G
A 1.00 0.22 -0.28 0.40 -0.05 -0.35
B 0.22 1.00 -0.32 -0.67 0.18 0.44
C -0.28 -0.32 1.00 0.49 -0.58 -0.27
D 0.40 -0.67 0.49 1.00 -0.55 -0.72
E -0.05 0.18 -0.58 -0.55 1.00 0.07
G -0.35 0.44 -0.27 -0.72 0.07 1.00
As it turns out it requires all 6 columns to get linear dependence, since removing any one column leaves the sub-matrix full rank:
sapply(1:6, function(i) rankMatrix(as.matrix(DT[-i])) )
[1] 5 5 5 5 5 5
Playing around with Rolands comment to see what the factors would be to get complete linear dependnence:
sapply(LETTERS[1:5], function(col) round( lm(as.formula(paste0(col, " ~ .")), data = DT)$coef,4) )
A B C D E
(Intercept) 0 0 0 0 0
B 1 1 -1 1 1
C -1 1 1 -1 -1
D 1 -1 1 1 1
E 1 -1 1 -1 -1
G 1 -1 1 -1 -1
#Hugh: Be sure to cite StackOverflow in your homework assignment writeup ;-)
Here's a way of making similar matrices:
res <- replicate(5, sample((-10000):10000, 10) )
res2 <- res %*% sample(c(-1,1) , 5, repl=TRUE)
res3 <- cbind(res2, res)
And then checking a couple of them with Dason's linfinder:
> linfinder(data.matrix(res3))
[1] "Column_6 = -1*Column_1 + -1*Column_2 + -1*Column_3 + -1*Column_4 + -1*Column_5"
> res2 <- res %*% sample(c(-1,1) , 5, repl=TRUE)
> res3 <- cbind(res2, res)
> linfinder(data.matrix(res3))
[1] "Column_6 = -1*Column_1 + -0.999999999999999*Column_2 + 0.999999999999999*Column_3 + 0.999999999999999*Column_4 + 0.999999999999999*Column_5"
>
Related
I have done an incomplete factorial design (fractional factorial design) experiment on different fertilizer applications.
Here is the format of the data: Excerpt of data
I want to do an ANOVA in R using the function aov. I have 450 data points in total, 'Location' has 5 factors, N has 3, and F1,F2,F3+F4 have two each.
Here is the code that I am using:
ANOVA1<-aov(PlantWeight~Location*N*F1*F2*F3*F4, data = data)
summary(ANOVA1)
Independent variables F1,F2,F3+F4 are not applied in a factorial manner. Each sample either has F1,F2,F3+F4 or nothing applied. In the cases where no F1,F2,F3 or F4 fertiliser was applied the value 0 has been put in every column. This is the control to which each of F1,F2,F3+F4 will be compared to. If F1 has been applied then column F1 will read 1 and it will read NA in F2,F3+F4 columns.
When I try to run this ANOVA I get this error message:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
Another approach I had was to but an 'x' instead of 'NA'. This has issues because it is assuming that x is a factor when it is not. It seemed to work fine except it would always ignore the F4.
ANOVA2<-aov(PlantWeight~((F1*Location*N)+(F2*Location*N)+(F3*Location*N)+
(F4*Location*N)), data = data)
summary(ANOVA2)
Results:
Df Sum Sq Mean Sq F value Pr(>F)
F1 2 10.3 5.13 5.742 0.00351 **
Location 6 798.6 133.11 149.027 < 2e-16 ***
N 2 579.6 289.82 324.485 < 2e-16 ***
F2 1 0.3 0.33 0.364 0.54667
F3 1 0.4 0.44 0.489 0.48466
F1:Location 10 26.5 2.65 2.962 0.00135 **
F1:N 4 6.6 1.66 1.857 0.11737
Location:N 10 113.5 11.35 12.707 < 2e-16 ***
Location:F2 5 6.5 1.30 1.461 0.20188
N:F2 2 2.7 1.37 1.537 0.21641
Location:F3 5 33.6 6.72 7.529 9.73e-07 ***
N:F3 2 2.5 1.23 1.375 0.25409
F1:Location:N 20 12.4 0.62 0.696 0.83029
F2:Location:N 10 18.9 1.89 2.113 0.02284 *
F3:Location:N 10 26.8 2.68 3.001 0.00118 **
Residuals 359 320.6 0.89
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Any help on how to approach this would be wonderful!
I have a data set like this
iu
sample
obs
1.5625
s
0.312
1.5625
s
0.302
3.125
s
0.335
3.125
s
0.333
6.25
s
0.423
6.25
s
0.391
12.5
s
0.562
12.5
s
0.56
25
s
0.84
25
s
0.843
50
s
1.202
50
s
1.185
100
s
1.408
100
s
1.338
200
s
1.42
200
s
1.37
1.5625
t
0.317
1.5625
t
0.313
3.125
t
0.345
3.125
t
0.343
6.25
t
0.413
6.25
t
0.404
12.5
t
0.577
12.5
t
0.557
25
t
0.863
25
t
0.862
50
t
1.22
50
t
1.197
100
t
1.395
100
t
1.364
200
t
1.425
200
t
1.415
I want to use R to recreate SAS code below. I believe this SAS code means a nonlinear fit is performed for each subsets, where three parameters are the same and one parameter is different.
proc nlin data=assay;
model obs=D+(A-D)/(1+(iu/((cs∗(sample=“S”)
+Ct∗(sample=“T”))))∗∗(B));
parms D=1 B=1 Cs=1 Ct=1 A=1;
run;
So I write something like this then get
nlm_1 <- nls(obs ~ (a - d) / (1 + (iu / c[sample]) ^ b) + d, data = csf_1, start = list(a = 0.3, b = 1.8, c = c(25, 25), d = 1.4))
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
But without[sample], the model can be calculated
nlm_1 <- nls(obs ~ (a - d) / (1 + (iu / c) ^ b) + d, data = csf_1, start = list(a = 0.3, b = 1.8, c = c(25), d = 1.4))
summary(nlm_1)
Formula: obs ~ (a - d)/(1 + (iu/c)^b) + d
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 0.31590 0.00824 38.34 <2e-16 ***
b 1.83368 0.06962 26.34 <2e-16 ***
c 25.58422 0.55494 46.10 <2e-16 ***
d 1.44777 0.01171 123.63 <2e-16 ***
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.02049 on 28 degrees of freedom
Number of iterations to convergence: 4
Achieved convergence tolerance: 6.721e-06
I don't get it, could some one tell me what's wrong with my code, and how can I achieve my goal with R? Thanks!
Thanks to #akrun. After I converting csf_1$sample to factor, I finally get what I wanted.
csf_1[, 2] <- as.factor(c(rep("s", 16), rep("t", 16)))
nlm_1 <- nls(obs ~ (a - d) / (1 + (iu / c[sample]) ^ b) + d, data = csf_1, start = list(a = 0.3, b = 1.8, c = c(25, 25), d = 1.4))
summary(nlm_1)
Formula: obs ~ (a - d)/(1 + (iu/c[sample])^b) + d
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 0.315874 0.008102 38.99 <2e-16 ***
b 1.833303 0.068432 26.79 <2e-16 ***
c1 26.075317 0.656779 39.70 <2e-16 ***
c2 25.114050 0.632787 39.69 <2e-16 ***
d 1.447901 0.011518 125.71 <2e-16 ***
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.02015 on 27 degrees of freedom
Number of iterations to convergence: 4
Achieved convergence tolerance: 6.225e-06
I am trying to reproduce a result from R in Stata (Please note that the data below is fictitious and serves just as an example). For some reason however, Stata appears to deal with certain issues differently than R. It chooses different dummy variables to kick out in case of multicollinearity.
I have posted a related question dealing with the statistical implications of these country-year dummies being removed here.
In the example below, R kicks out 2, while Stata kicks out 3, leading to a different result. Check for example the coefficients and p-values for vote and vote_won.
In essence, all I want to know is how to communicate to either R or Stata, which variables to kick out, so that they both do the same.
Data
The data looks as follows:
library(data.table)
library(dplyr)
library(foreign)
library(censReg)
library(wooldridge)
data('mroz')
year= c(2005, 2010)
country = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
n <- 2
DT <- data.table( country = rep(sample(country, length(mroz), replace = T), each = n),
year = c(replicate(length(mroz), sample(year, n))))
x <- DT
DT <- rbind(DT, DT); DT <- rbind(DT, DT); DT <- rbind(DT, DT) ; DT <- rbind(DT, DT); DT <- rbind(DT, x)
mroz <- mroz[-c(749:753),]
DT <- cbind(mroz, DT)
DT <- DT %>%
group_by(country) %>%
mutate(base_rate = as.integer(runif(1, 12.5, 37.5))) %>%
group_by(country, year) %>%
mutate(taxrate = base_rate + as.integer(runif(1,-2.5,+2.5)))
DT <- DT %>%
group_by(country, year) %>%
mutate(vote = sample(c(0,1),1),
votewon = ifelse(vote==1, sample(c(0,1),1),0))
rm(mroz,x, country, year)
The lm regression in R
summary(lm(educ ~ exper + I(exper^2) + vote + votewon + country:as.factor(year), data=DT))
Call:
lm(formula = educ ~ exper + I(exper^2) + vote + votewon + country:as.factor(year),
data = DT)
Residuals:
Min 1Q Median 3Q Max
-7.450 -0.805 -0.268 0.954 5.332
Coefficients: (3 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.170064 0.418578 26.69 < 0.0000000000000002 ***
exper 0.103880 0.029912 3.47 0.00055 ***
I(exper^2) -0.002965 0.000966 -3.07 0.00222 **
vote 0.576865 0.504540 1.14 0.25327
votewon 0.622522 0.636241 0.98 0.32818
countryA:as.factor(year)2005 -0.196348 0.503245 -0.39 0.69653
countryB:as.factor(year)2005 -0.530681 0.616653 -0.86 0.38975
countryC:as.factor(year)2005 0.650166 0.552019 1.18 0.23926
countryD:as.factor(year)2005 -0.515195 0.638060 -0.81 0.41968
countryE:as.factor(year)2005 0.731681 0.502807 1.46 0.14605
countryG:as.factor(year)2005 0.213345 0.674642 0.32 0.75192
countryH:as.factor(year)2005 -0.811374 0.637254 -1.27 0.20334
countryI:as.factor(year)2005 0.584787 0.503606 1.16 0.24594
countryJ:as.factor(year)2005 0.554397 0.674789 0.82 0.41158
countryA:as.factor(year)2010 0.388603 0.503358 0.77 0.44035
countryB:as.factor(year)2010 -0.727834 0.617210 -1.18 0.23869
countryC:as.factor(year)2010 -0.308601 0.504041 -0.61 0.54056
countryD:as.factor(year)2010 0.785603 0.503165 1.56 0.11888
countryE:as.factor(year)2010 0.280305 0.452293 0.62 0.53562
countryG:as.factor(year)2010 0.672074 0.674721 1.00 0.31954
countryH:as.factor(year)2010 NA NA NA NA
countryI:as.factor(year)2010 NA NA NA NA
countryJ:as.factor(year)2010 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.3 on 728 degrees of freedom
Multiple R-squared: 0.037, Adjusted R-squared: 0.0119
F-statistic: 1.47 on 19 and 728 DF, p-value: 0.0882
Same regression in Stata
write.dta(DT, "C:/Users/.../mroz_adapted.dta")
encode country, gen(n_country)
reg educ c.exper c.exper#c.exper vote votewon n_country#i.year
note: 9.n_country#2010.year omitted because of collinearity
note: 10.n_country#2010.year omitted because of collinearity
Source | SS df MS Number of obs = 748
-------------+---------------------------------- F(21, 726) = 1.80
Model | 192.989406 21 9.18997171 Prob > F = 0.0154
Residual | 3705.47583 726 5.1039612 R-squared = 0.0495
-------------+---------------------------------- Adj R-squared = 0.0220
Total | 3898.46524 747 5.21882897 Root MSE = 2.2592
---------------------------------------------------------------------------------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
exper | .1109858 .0297829 3.73 0.000 .052515 .1694567
|
c.exper#c.exper | -.0031891 .000963 -3.31 0.001 -.0050796 -.0012986
|
vote | .0697273 .4477115 0.16 0.876 -.8092365 .9486911
votewon | -.0147825 .6329659 -0.02 0.981 -1.257445 1.227879
|
n_country#year |
A#2010 | .0858634 .4475956 0.19 0.848 -.7928728 .9645997
B#2005 | -.4950677 .5003744 -0.99 0.323 -1.477421 .4872858
B#2010 | .0951657 .5010335 0.19 0.849 -.8884818 1.078813
C#2005 | -.5162827 .447755 -1.15 0.249 -1.395332 .3627664
C#2010 | -.0151834 .4478624 -0.03 0.973 -.8944434 .8640767
D#2005 | .3664596 .5008503 0.73 0.465 -.6168283 1.349747
D#2010 | .5119858 .500727 1.02 0.307 -.4710599 1.495031
E#2005 | .5837942 .6717616 0.87 0.385 -.7350329 1.902621
E#2010 | .185601 .5010855 0.37 0.711 -.7981486 1.169351
F#2005 | .5987978 .6333009 0.95 0.345 -.6445219 1.842117
F#2010 | .4853639 .7763936 0.63 0.532 -1.038881 2.009608
G#2005 | -.3341302 .6328998 -0.53 0.598 -1.576663 .9084021
G#2010 | .2873193 .6334566 0.45 0.650 -.956306 1.530945
H#2005 | -.4365233 .4195984 -1.04 0.299 -1.260294 .3872479
H#2010 | -.1683725 .6134262 -0.27 0.784 -1.372673 1.035928
I#2005 | -.39264 .7755549 -0.51 0.613 -1.915238 1.129958
I#2010 | 0 (omitted)
J#2005 | 1.036108 .4476018 2.31 0.021 .1573591 1.914856
J#2010 | 0 (omitted)
|
_cons | 11.58369 .350721 33.03 0.000 10.89514 12.27224
---------------------------------------------------------------------------------
Just for your question about which 'variables to kick out": I guess you meant which combination of interaction terms to be used as the reference group for calculating regression coefficients.
By default, Stata uses the combination of the lowest values of two variables as the reference while R uses the highest values of two variables as the reference. I use Stata auto data to demonstrate this:
# In R
webuse::webuse("auto")
auto$foreign = as.factor(auto$foreign)
auto$rep78 = as.factor(auto$rep78)
# Model
r_model <- lm(mpg ~ rep78:foreign, data=auto)
broom::tidy(r_model)
# A tibble: 11 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 26.3 1.65 15.9 2.09e-23
2 rep781:foreign0 -5.33 3.88 -1.38 1.74e- 1
3 rep782:foreign0 -7.21 2.41 -2.99 4.01e- 3
4 rep783:foreign0 -7.33 1.91 -3.84 2.94e- 4
5 rep784:foreign0 -7.89 2.34 -3.37 1.29e- 3
6 rep785:foreign0 5.67 3.88 1.46 1.49e- 1
7 rep781:foreign1 NA NA NA NA
8 rep782:foreign1 NA NA NA NA
9 rep783:foreign1 -3.00 3.31 -0.907 3.68e- 1
10 rep784:foreign1 -1.44 2.34 -0.618 5.39e- 1
11 rep785:foreign1 NA NA NA NA
In Stata:
. reg mpg i.foreign#i.rep78
note: 1.foreign#1b.rep78 identifies no observations in the sample
note: 1.foreign#2.rep78 identifies no observations in the sample
Source | SS df MS Number of obs = 69
-------------+---------------------------------- F(7, 61) = 4.88
Model | 839.550121 7 119.935732 Prob > F = 0.0002
Residual | 1500.65278 61 24.6008652 R-squared = 0.3588
-------------+---------------------------------- Adj R-squared = 0.2852
Total | 2340.2029 68 34.4147485 Root MSE = 4.9599
-------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
foreign#rep78 |
Domestic#2 | -1.875 3.921166 -0.48 0.634 -9.715855 5.965855
Domestic#3 | -2 3.634773 -0.55 0.584 -9.268178 5.268178
Domestic#4 | -2.555556 3.877352 -0.66 0.512 -10.3088 5.19769
Domestic#5 | 11 4.959926 2.22 0.030 1.082015 20.91798
Foreign#1 | 0 (empty)
Foreign#2 | 0 (empty)
Foreign#3 | 2.333333 4.527772 0.52 0.608 -6.720507 11.38717
Foreign#4 | 3.888889 3.877352 1.00 0.320 -3.864357 11.64213
Foreign#5 | 5.333333 3.877352 1.38 0.174 -2.419912 13.08658
|
_cons | 21 3.507197 5.99 0.000 13.98693 28.01307
-------------------------------------------------------------------------------
To reproduce the previous R in Stata, we could recode those two variables foreign and rep78:
. reg mpg i.foreign2#i.rep2
note: 0b.foreign2#1.rep2 identifies no observations in the sample
note: 0b.foreign2#2.rep2 identifies no observations in the sample
Source | SS df MS Number of obs = 69
-------------+---------------------------------- F(7, 61) = 4.88
Model | 839.550121 7 119.935732 Prob > F = 0.0002
Residual | 1500.65278 61 24.6008652 R-squared = 0.3588
-------------+---------------------------------- Adj R-squared = 0.2852
Total | 2340.2029 68 34.4147485 Root MSE = 4.9599
-------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
foreign2#rep2 |
0 1 | 0 (empty)
0 2 | 0 (empty)
0 3 | -3 3.306617 -0.91 0.368 -9.61199 3.61199
0 4 | -1.444444 2.338132 -0.62 0.539 -6.119827 3.230938
1 0 | 5.666667 3.877352 1.46 0.149 -2.086579 13.41991
1 1 | -5.333333 3.877352 -1.38 0.174 -13.08658 2.419912
1 2 | -7.208333 2.410091 -2.99 0.004 -12.02761 -2.389059
1 3 | -7.333333 1.909076 -3.84 0.000 -11.15077 -3.515899
1 4 | -7.888889 2.338132 -3.37 0.001 -12.56427 -3.213506
|
_cons | 26.33333 1.653309 15.93 0.000 23.02734 29.63933
-------------------------------------------------------------------------------
The same approach applies to reproduce Stata results in R, just redefine levels of those two factor variables.
I have the following data from an experiment
> spears
treatment length
1 Control 94.7
2 Control 96.1
3 Control 86.5
4 Control 98.5
5 Control 94.9
6 IAA 89.9
7 IAA 94.0
8 IAA 99.1
9 IAA 92.8
10 IAA 99.4
11 ABA 96.8
12 ABA 87.8
13 ABA 89.1
14 ABA 91.1
15 ABA 89.4
16 GA3 99.1
17 GA3 95.3
18 GA3 94.6
19 GA3 93.1
20 GA3 95.7
21 CPPU 104.4
22 CPPU 98.9
23 CPPU 98.9
24 CPPU 106.5
25 CPPU 104.8
An I want to compare all the treatments against the "Control" treatment using the following code
mod0 <-aov( length ~ treatment, data = spears)
summary(mod0)
library(multcomp)
spears_dun <- glht(mod0,linfct = mcp(treatment = "Dunnett"), alternative = "greater")
summary(spears_dun)
However it is taking the first treatment in alphabetical order (ABA) as control instead of the "Control" treatment.
The results are as follow
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Dunnett Contrasts
Fit: aov(formula = length ~ treatment, data = spears)
Linear Hypotheses:
Estimate Std. Error t value Pr(>t)
Control - ABA <= 0 3.300 2.325 1.419 0.2240
CPPU - ABA <= 0 11.860 2.325 5.101 <0.001 ***
GA3 - ABA <= 0 4.720 2.325 2.030 0.0833 .
IAA - ABA <= 0 4.200 2.325 1.806 0.1230
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)
How can I make the comparison against "Control"?
Thanks.
You can force the treatment contrast level when you make it a factor into any order you like the first will be the base.
spears$treatment <- factor(spears$treatment,
levels = c("Control", "ABA", "CPPU", "GA3", "IAA"))
contrasts(spears$treatment)
#> ABA CPPU GA3 IAA
#> Control 0 0 0 0
#> ABA 1 0 0 0
#> CPPU 0 1 0 0
#> GA3 0 0 1 0
#> IAA 0 0 0 1
mod0 <-aov( length ~ treatment, data = spears)
library(multcomp)
spears_dun <- glht(mod0,linfct = mcp(treatment = "Dunnett"), alternative = "greater")
summary(spears_dun)
#>
#> Simultaneous Tests for General Linear Hypotheses
#>
#> Multiple Comparisons of Means: Dunnett Contrasts
#>
#>
#> Fit: aov(formula = length ~ treatment, data = spears)
#>
#> Linear Hypotheses:
#> Estimate Std. Error t value Pr(>t)
#> ABA - Control <= 0 -3.300 2.325 -1.419 0.99232
#> CPPU - Control <= 0 8.560 2.325 3.682 0.00272 **
#> GA3 - Control <= 0 1.420 2.325 0.611 0.55407
#> IAA - Control <= 0 0.900 2.325 0.387 0.65279
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> (Adjusted p values reported -- single-step method)
The treatment column must be a factor and you have to set its reference level to "Control", with the relevel function:
library(multcomp)
dat <- data.frame(
treatment = c("Control", "Control", "ABA", "ABA", "X", "X"),
length = c(1, 2, 3, 4, 5, 6),
stringsAsFactors = TRUE
)
dat$treatment <- relevel(dat$treatment, ref = "Control")
amod <- aov(length ~ treatment, data = dat)
glht(amod, linfct = mcp(treatment = "Dunnett"))
Here are two dataframes, data1 and data2
data1
id A B C D E F G
1 id1 1.00 0.31 -3.20 2.50 3.1 -0.300 -0.214
2 id2 0.40 -2.30 0.24 -1.47 3.2 0.152 -0.140
3 id3 1.30 -3.20 2.00 -0.62 2.3 0.460 1.320
4 id4 -0.71 0.98 2.10 1.20 -1.5 0.870 -1.550
5 id5 2.10 -1.57 0.24 1.70 -1.2 -1.300 1.980
> data2
factor constant
1 A -0.321
2 B 1.732
3 C 1.230
4 D 3.200
5 E -0.980
6 F -1.400
7 G -0.300
Actually, data1 is a large set of data with id up to 1000 and factor up to z.
data2 also has the factor from a to z and corresponding constant variable.
And, I want to multiply the value of the factor in data1 and the constant of data2 corresponding to the factor, for all factors. And then, I want to create the total of multipliers into a new variable 'total' in data1.
for example> creating 'total' of 'id1'= (A value 1.0 (data1) x A constant -0.32 (data2) + (B value 0.31 x 1.732) + (C -3.20 x 1.230) + ( D 2.5 x 3.2) + (E 3.1 x 0.980) + (F -0.300 x -1.40) + (G -0.214 x -0.300)
If you have ordered your column names in data1 and the rows in data2 in the same order, you can do:
t(t(dat1[-1]) * dat2$constant)
# A B C D E F G
#1 -0.32100 0.53692 -3.9360 8.000 -3.038 0.4200 0.0642
#2 -0.12840 -3.98360 0.2952 -4.704 -3.136 -0.2128 0.0420
#3 -0.41730 -5.54240 2.4600 -1.984 -2.254 -0.6440 -0.3960
#4 0.22791 1.69736 2.5830 3.840 1.470 -1.2180 0.4650
#5 -0.67410 -2.71924 0.2952 5.440 1.176 1.8200 -0.5940
Or if you need the totals:
res = t(t(dat1[-1]) * dat2$constant)
res = cbind(res, total = rowSums(res))
res
# A B C D E F G total
#1 -0.32100 0.53692 -3.9360 8.000 -3.038 0.4200 0.0642 1.72612
#2 -0.12840 -3.98360 0.2952 -4.704 -3.136 -0.2128 0.0420 -11.82760
#3 -0.41730 -5.54240 2.4600 -1.984 -2.254 -0.6440 -0.3960 -8.77770
#4 0.22791 1.69736 2.5830 3.840 1.470 -1.2180 0.4650 9.06527
#5 -0.67410 -2.71924 0.2952 5.440 1.176 1.8200 -0.5940 4.74386