Boxplot of a matrix by factor - r

I want to create a boxplot from a matrix with several variables grouped by two levels of a factor.
Some sample data:
mymatrix = structure(list(Treatment = structure(c(1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 2L), .Label = c("con", "treat"), class = "factor"),
c1 = c(13L, 93L, 6L, 3L, 45L, 1L, 69L, 38L, 23L, 48L, 82L
), c5 = c(33L, 79L, 3L, 5L, 17L, 22L, 94L, 99L, 85L, 74L,
9L), c3 = c(96L, 52L, 0L, 6L, 60L, 14L, 69L, 96L, 57L, 99L,
39L), c8 = c(40L, 27L, 94L, 68L, 76L, 73L, 88L, 45L, 67L,
95L, 85L), c12 = c(20L, 14L, 53L, 9L, 93L, 1L, 12L, 45L,
59L, 38L, 25L)), .Names = c("Treatment", "c1", "c5", "c3",
"c8", "c12"), class = "data.frame", row.names = c("1a", "1b",
"2a", "2b", "3a", "3b", "4a", "4b", "5a", "5b", "5c"))
I was able to get a boxplot for each variable, but I cannot manage to group them at the same time:
boxplot(as.matrix(mymatrix[,2:6]))
boxplot(as.matrix(mymatrix[,2:6])~Treatment, data=mymatrix)
Thanks in advance for any help.

v <- stack(mymatrix[-1])
v$Treatment <- mymatrix$Treatment
boxplot(values~Treatment+ind, v)
The first part will give us a data.frame like this:
values ind
1 13 c1
2 93 c1
...
11 82 c1
12 33 c5
...
22 9 c5
23 96 c3
...
55 25 c12
Then we append the Treatment column, and just plot as usual.
update: using the reshape package as suggested by Drew.
v <- melt(mymatrix, id.vars="Treatment")
boxplot(value~Treatment+variable, v)

Personally I like to use the ggplot2/reshape2 approach - it is maybe a little tougher to learn at first, but once you get good at it, I think it makes things much easier.
Note that your 'matrix' is not actually a matrix, it is a data frame. This is convenient, because the approach I suggest only works with data frames.
str(mymatrix)
'data.frame': 11 obs. of 6 variables:
...
First, 'reshape' it to 'long' format, where each row represents a different observation
dfm <- melt(mymatrix, id.vars="Treatment")
(My convention is to append any melted data frame with the letter m).
Next, make the plot using ggplot2. I've mapped the Treatment column to the x axis, and the c1-c12 columns (named variable after reshaping) to the fill color, but the syntax of ggplot2 allows you to easily change that up:
ggplot(dfm, aes(x=Treatment, y=value, fill=variable)) +
geom_boxplot()

Related

How to calculate column by column power law regressions [duplicate]

This question already has answers here:
Fitting a linear model with multiple LHS
(1 answer)
Fast pairwise simple linear regression between variables in a data frame
(1 answer)
Closed 6 months ago.
Using the following code
model <- lm(log(y)~log(x))
I am able to get the coefficients for a potential law fit in the form y = ax^b. The obtained intercept and coefficient can be used to get the coefficient and exponent in the equation y = ax^b. The coefficient of model will be b and e^intercept will be a.
For V1 and V2 I get: Intercept=0.4272 log(x)=0.6009
Then: y = (e^0.4272)x^0.6009 = 1.5330x^0.6009
For the data:
data
structure(list(V1 = c(900L, 450L, 225L, 113L, 56L, 28L, 14L),
V2 = c(3L, 3L, 3L, 3L, 2L, 2L, 2L), V3 = c(27L, 30L, 17L,
14L, 9L, 7L, 5L), V4 = c(15L, 11L, 8L, 6L, 4L, 3L, 2L), V5 = c(50L,
38L, 23L, 14L, 8L, 5L, 4L), V6 = c(75L, 38L, 38L, 23L, 19L,
7L, 5L), V7 = c(82L, 50L, 45L, 38L, 19L, 9L, 7L), V8 = c(60L,
50L, 23L, 14L, 11L, 7L, 5L), V9 = c(129L, 64L, 56L, 38L,
19L, 28L, 14L), V10 = c(180L, 150L, 75L, 56L, 56L, 28L, 14L
), V11 = c(900L, 450L, 225L, 113L, 56L, 28L, 14L)), row.names = c(NA,
7L), class = "data.frame")
I will like the program to produce a df with a and b values. Taking V1 as x at all times, and from V2 to V11 for the y values.

Creating new column based on values in one column over another

I have some data
structure(list(Factor = c(0L, 1L, 0L, 1L, 1L, 0L, 1L), Col_A = c(45L,
23L, 35L, 43L, 42L, 23L, 11L), Col_B = c(85L, 67L, 55L, 40L,
27L, 85L, 12L), New_Column = c(45L, 67L, 35L, 40L, 27L, 23L,
12L)), class = "data.frame", row.names = c(NA, -7L))
Pretend that the 4th column is not there. I need to write a script that based on the value in the Factor column will take a value from either Col_A or Col_B and put in New_Column. If the value in Factor is 0 it should take the value in Col_A so the value in New_Column in the first row is 45.
A base solution:
df$New_Column <- ifelse(df$Factor == 0, df$Col_A, df$Col_B)
We can use row/column indexing to get the value
df1$New_Column <- df1[2:3][cbind(seq_len(nrow(df1)), df1$Factor + 1)]

Multiple imputation separated by gpoup

In my data example
data=structure(list(groupvar = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L,
2L, 1L), v1 = c(27L, 52L, 92L, 86L, NA, 19L, 94L, NA, 26L, 94L,
NA, 58L, 96L, 74L, 8L, 66L, 65L, 41L, 70L, 21L, 64L, 40L, 17L,
7L, NA, 14L, 63L), v2 = c(59L, 91L, 45L, 40L, 56L, 17L, 72L,
78L, 19L, 62L, 87L, NA, 79L, 62L, 40L, 67L, 93L, 1L, 64L, 22L,
NA, 98L, 44L, 85L, 67L, 88L, 92L), v3 = c(97L, 15L, 27L, 55L,
86L, 66L, NA, 61L, 27L, 47L, 93L, 68L, 72L, 4L, 35L, 69L, 65L,
NA, 83L, 60L, 42L, NA, 90L, 81L, NA, 27L, 60L)), .Names = c("groupvar",
"v1", "v2", "v3"), class = "data.frame", row.names = c(NA, -27L
))
There is groupvar (1 group and second group). I have many variable, but here only three.
And there are many missing values in these variables.
How can i perform multiple imputation for each variable(the type of variable can by numeric,int and so on), but for each group separately, using MICE
Edit
simple imp <- mice(data) is not give the need output, because i need by group
I want that the result was
groupvar v1 v2 v3
1 27 59 97
1 52 91 15
1 92 45 27
1 86 40 55
1 *64* 56 86
2 7 85 81
2 58*61,8* 68
2 64 *61,8* 42
** i marked example of imputed value
Group 'groupvar' as a factor.
data <- structure(list(groupvar = as.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L,
2L, 1L)),
v1 = c(27L, 52L, 92L, 86L, NA, 19L, 94L, NA, 26L, 94L,
NA, 58L, 96L, 74L, 8L, 66L, 65L, 41L, 70L, 21L, 64L, 40L, 17L,
7L, NA, 14L, 63L),
v2 = c(59L, 91L, 45L, 40L, 56L, 17L, 72L,
78L, 19L, 62L, 87L, NA, 79L, 62L, 40L, 67L, 93L, 1L, 64L, 22L,
NA, 98L, 44L, 85L, 67L, 88L, 92L),
v3 = c(97L, 15L, 27L, 55L,
86L, 66L, NA, 61L, 27L, 47L, 93L, 68L, 72L, 4L, 35L, 69L, 65L,
NA, 83L, 60L, 42L, NA, 90L, 81L, NA, 27L, 60L)),
.Names = c("groupvar",
"v1", "v2", "v3"), class = "data.frame", row.names = c(NA, -27L
))
Then use the mice package assuming the mice package is properly installed.
library(mice)
imp <- mice(data)
complete(imp)
groupvar v1 v2 v3
1 1 27 59 97
2 1 52 91 15
3 1 92 45 27
4 1 86 40 55
5 1 21 56 86
6 1 19 17 66
7 1 94 72 4
8 1 66 78 61
9 1 26 19 27
10 2 94 62 47
11 2 8 87 93
12 2 58 72 68
13 2 96 79 72
14 2 74 62 4
15 2 8 40 35
16 2 66 67 69
17 2 65 93 65
18 2 41 1 47
19 2 70 64 83
20 2 21 22 60
21 2 64 62 42
22 1 40 98 27
23 1 17 44 90
24 2 7 85 81
25 1 63 67 55
26 2 14 88 27
27 1 63 92 60

Wrong standard deviations for predictions in predict.lm in R? [duplicate]

This question already has answers here:
How does predict.lm() compute confidence interval and prediction interval?
(2 answers)
Closed 4 years ago.
With the following set up, why does one get the same standard deviations in both cases, namely: 1.396411?
Regression:
CopierDataRegression <- lm(V1~V2, data=CopierData1)
Intervals:
X6 <- data.frame(V2=6)
predict(CopierDataRegression, X6, se.fit=TRUE, interval="confidence", level=0.90)
predict(CopierDataRegression, X6, se.fit=TRUE, interval="prediction", level=0.90)
Both give the same result for se.fit.
One gets the correct standard deviations for the predictions with the following code:
z <- predict(CopierDataRegression, X6, se.fit=TRUE)
sqrt(z$se.fit^2 + z$residual.scale^2),
but I dont understand why one in this formula adds the residual standard deviation in the computation of the standard errors, could someone explain this?
Data:
CopierData1 <- structure(list(V1 = c(20L, 60L, 46L, 41L, 12L, 137L, 68L, 89L,
4L, 32L, 144L, 156L, 93L, 36L, 72L, 100L, 105L, 131L, 127L, 57L,
66L, 101L, 109L, 74L, 134L, 112L, 18L, 73L, 111L, 96L, 123L,
90L, 20L, 28L, 3L, 57L, 86L, 132L, 112L, 27L, 131L, 34L, 27L,
61L, 77L), V2 = c(2L, 4L, 3L, 2L, 1L, 10L, 5L, 5L, 1L, 2L, 9L,
10L, 6L, 3L, 4L, 8L, 7L, 8L, 10L, 4L, 5L, 7L, 7L, 5L, 9L, 7L,
2L, 5L, 7L, 6L, 8L, 5L, 2L, 2L, 1L, 4L, 5L, 9L, 7L, 1L, 9L, 2L,
2L, 4L, 5L)), .Names = c("V1", "V2"),
class = "data.frame", row.names = c(NA, -45L))
You have to account for error in the estimation due to sampling and from the noise term when you make a prediction. The confidence interval only accounts for the former. See the answer here.
Further, they do not give the same result for the bounds:
> predict(CopierDataRegression, X6,
+ se.fit=TRUE, interval="confidence", level=0.90)$fit
fit lwr upr
1 89.63133 87.28387 91.9788
> predict(CopierDataRegression, X6,
+ se.fit=TRUE, interval="prediction", level=0.90)$fit
fit lwr upr
1 89.63133 74.46433 104.7983
The se.fit only gives you for the error of the predicted mean, not the sd of the error term as documented in ?predict.lm:
se.fit standard error of predicted means
residual.scale residual standard deviations

Adding legend and structuring data for ggplot

In the data included below I have three sites (AAA,BBB,CCC) and individuals within each site (7, 12, 7 respectively). For each individual I have observed values (ObsValues) and three sets of predicted values each with a standard error. I have 26 rows (i.e. 26 individuals) and 9 columns.
The data is included here through dput()
help <- structure(list(StudyArea = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L), .Label = c("AAA", "BBB", "CCC"), class = "factor"),
Ind = structure(1:26, .Label = c("AAA_F01", "AAA_F17", "AAA_F33",
"AAA_F49", "AAA_F65", "AAA_F81", "AAA_F97", "BBB_P01", "BBB_P02",
"BBB_P03", "BBB_P04", "BBB_P05", "BBB_P06", "BBB_P07", "BBB_P08",
"BBB_P09", "BBB_P10", "BBB_P11", "BBB_P12", "CCC_F02", "CCC_F03",
"CCC_F04", "CCC_F05", "CCC_F06", "CCC_F07", "CCC_F08"), class = "factor"),
ObsValues = c(22L, 50L, 8L, 15L, 54L, 30L, 11L, 90L, 6L,
53L, 9L, 42L, 72L, 40L, 60L, 58L, 1L, 20L, 37L, 2L, 50L,
68L, 20L, 19L, 58L, 5L), AAAPred = c(28L, 52L, 6L, 15L, 35L,
31L, 13L, 79L, 6L, 58L, 5L, 42L, 88L, 49L, 68L, 60L, 1L,
26L, 46L, 0L, 34L, 71L, 20L, 15L, 35L, 5L), AAAPredSE = c(3.5027829,
4.7852191, 1.231803, 2.5244013, 4.873907, 3.8854192, 2.3532752,
6.3444402, 1.7387295, 5.605111, 1.667818, 4.4709107, 7.0437967,
5.447496, 6.0840486, 5.4371275, 0.8156916, 3.5153847, 4.698754,
0, 3.8901103, 5.993616, 3.1720272, 2.6777869, 4.5647313,
1.4864128), BBBPred = c(14L, 43L, 5L, 13L, 26L, 32L, 14L,
80L, 5L, 62L, 4L, 44L, 67L, 44L, 55L, 42L, 1L, 20L, 47L,
0L, 26L, 51L, 15L, 16L, 34L, 6L), BBBPredSE = c(3.1873435,
4.8782831, 1.3739863, 2.5752273, 4.4155679, 3.8102168, 2.3419518,
6.364606, 1.7096028, 5.6333421, 1.5861323, 4.4951428, 6.6046699,
5.302902, 5.9244328, 5.1887055, 0.8268689, 3.4014041, 4.6600598,
0, 3.8510512, 5.5776686, 3.0569531, 2.6358433, 4.5273782,
1.4263518), CCCPred = c(29L, 53L, 7L, 15L, 44L, 32L, 15L,
86L, 8L, 61L, 5L, 46L, 99L, 54L, 74L, 67L, 1L, 30L, 51L,
1L, 37L, 94L, 21L, 17L, 36L, 6L), CCCPredSE = c(3.4634488,
4.7953389, 0.9484051, 2.5207022, 5.053452, 3.8072731, 2.2764727,
6.3605968, 1.6044067, 5.590048, 1.6611899, 4.4183913, 7.0124638,
5.6495918, 6.1091934, 5.4797929, 0.8135164, 3.4353934, 4.6261147,
0.8187396, 3.7936333, 5.6512378, 3.1686123, 2.633179, 4.5841921,
1.3989955)), .Names = c("StudyArea", "Ind", "ObsValues",
"AAAPred", "AAAPredSE", "BBBPred", "BBBPredSE", "CCCPred", "CCCPredSE"
), class = "data.frame", row.names = c(NA, -26L))
The head() and dim() of help are below too
head(help)
StudyArea Ind ObsValues AAAPred AAAPredSE BBBPred BBBPredSE CCCPred CCCPredSE
1 AAA AAA_F01 22 28 3.502783 14 3.187343 29 3.4634488
2 AAA AAA_F17 50 52 4.785219 43 4.878283 53 4.7953389
3 AAA AAA_F33 8 6 1.231803 5 1.373986 7 0.9484051
4 AAA AAA_F49 15 15 2.524401 13 2.575227 15 2.5207022
5 AAA AAA_F65 54 35 4.873907 26 4.415568 44 5.0534520
6 AAA AAA_F81 30 31 3.885419 32 3.810217 32 3.8072731
dim(help)
> dim(help)
[1] 26 9
I am a relative newcomer to ggplot and am trying to make a plot that displays the observed and predicted values for each individual with a different color for each StudyArea. I can manually add points and force the color with the code below, however this feel rather clunky and also does not produce a legend as I have not specified color in aes().
require(ggplot2)
ggplot(help, aes(x=Ind, y=ObsValues))+
geom_point(color="red", pch = "*", cex = 10)+
geom_point(aes(y = AAAPred), color="blue")+
geom_errorbar(aes(ymin=AAAPred-AAAPredSE, ymax=AAAPred+AAAPredSE), color = "blue")+
geom_point(aes(y = BBBPred), color="darkgreen")+
geom_errorbar(aes(ymin=BBBPred-BBBPredSE, ymax=BBBPred+BBBPredSE), color = "darkgreen")+
geom_point(aes(y = CCCPred), color="black")+
geom_errorbar(aes(ymin=CCCPred-CCCPredSE, ymax=CCCPred+CCCPredSE), color = "black")+
theme(axis.text.x=element_text(angle=30, hjust=1))
In the figure above, the asterisks are the observed values and the values are the predicted values, one from each StudyArea.
I tried to melt() the data, but ran into more problems plotting. That being said, I suspect melt()ing or reshape()ing is the best option.
Any suggestions on how to best alter/restructure the help data so that I can plot the observed and predicted values for each individual with a different color for each StudyArea would be greatly appreciated.
I also hope to produce a legend - the likely default once the data is correctly formatted
Note: Indeed the resulting figure is very busy will likely be simplified once I get a better handle on ggplot.
thanks in advance.
Try this:
library(reshape2)
x.value <- melt(help,id.vars=1:3, measure.vars=c(4,6,8))
x.se <- melt(help,id.vars=1:3, measure.vars=c(5,7,9))
gg <- data.frame(x.value,se=x.se$value)
ggplot(gg)+
geom_point(aes(x=Ind, y=ObsValues),size=5,shape=18)+
geom_point(aes(x=Ind, y=value, color=variable),size=3, shape=1)+
geom_errorbar(aes(x=Ind, ymin=value-se, ymax=value+se, color=variable))+
theme(axis.text.x=element_text(angle=-90))
Produces this:
Edit:: Response to #B.Davis' questions below:
You have to group the ObsValues by StudyArea, not variable. But when you do that you get six colors, three for StudyArea and three for the predictor groups (variable). If we give the predictor groups (e.g., AAAPred, etc.) the same names as the StudyArea groups (e.g. AAA, etc.), then ggplot just generates three colors.
gg$variable <- substring(gg$variable,1,3) # removes "Pred" from group names
ggplot(gg)+
geom_point(aes(x=Ind, y=ObsValues, color=StudyArea),size=5,shape=18)+
geom_point(aes(x=Ind, y=value, color=variable),size=3, shape=1)+
geom_errorbar(aes(x=Ind, ymin=value-se, ymax=value+se, color=variable))+
theme(axis.text.x=element_text(angle=-90))
Produces this:
Similar to #jlhoward solution but I choose to treat ObsValues as a variable to get it in the legend.
help <- dat
x.value <- melt(help,id.vars=1:2, measure.vars=c(3,4,6,8))
x.se <- melt(help,id.vars=1:2, measure.vars=c(3,5,7,9))
gg <- data.frame(x.value,se=x.se$value)
ggplot(gg)+
geom_point(aes(x=Ind, y=value, color=variable),size=3, shape=1)+
geom_errorbar(data= subset(gg,variable!='ObsValues'),
aes(x=Ind, ymin=value-se, ymax=value+se, color=variable))+
theme(axis.text.x=element_text(angle=-90))
This is a little clumsy, but gets you what you want:
# jlhoward's melting is more elegant.
require(reshape2)
melted.points<-melt(help[,c('Ind','ObsValues','AAAPred','BBBPred','CCCPred')])
melted.points$observed<-ifelse(melted.points$variable=='ObsValues','observed','predicted')
melted.points.se<-melt(help[,c('Ind','AAAPredSE','BBBPredSE','CCCPredSE')])
melted.points.se$variable<-gsub('SE','',melted.points.se$variable,)
help2<-merge(melted.points,melted.points.se,by=c('Ind','variable'),all.x=TRUE)
help2<-rename(help2,c(value.x='value',value.y='se'))
And now the actual plot:
ggplot(help2,aes(x=Ind,y=value,color=variable,size=observed,shape=observed,ymin=value-se,ymax=value+se)) +
geom_point() +
geom_errorbar(size=1) +
scale_colour_manual(values = c("red","blue","darkgreen", "black")) +
scale_size_manual(values=c(observed=4,predicted=3)) +
scale_shape_manual(values=c(observed=8,predicted=16))

Resources