Plotting multiple variables in one boxplot - r

I have a 4*10 datasheet like so:
Total, var1, var2, var3
104, 35, 33, 36
106, 38, 32, 36
93, 34, 27, 32
98, 31, 32, 35
101, 34, 32, 35
106, 38, 32, 36
82, 32, 23, 27
100, 38, 30, 32
111, 34, 39, 38
89, 35, 27, 27
and I would like to produce a boxplot where each column is plotted as a separate boxplot but on the same graph. Ideally I would also like to colour code these and add some jitter to show the individual data points.
So far I have tried to use the melt functionality on reshape2 but I haven't had much luck.
I hope this is clear, it's been giving me lots of headaches. Thanks for your help

With your data like this:
> head(data)
Total var1 var2 var3
1 104 35 33 36
2 106 38 32 36
3 93 34 27 32
4 98 31 32 35
5 101 34 32 35
6 106 38 32 36
then this bit of ggplot2:
library(ggplot2)
ggplot(reshape2::melt(data), aes(x=variable, y=value, col=variable)) + geom_boxplot() + geom_jitter(height=0,col="black")
gets you:
I don't see the point of colouring things when the position and the axis label is sufficient, but whatever. Also, if you colour the points by variable as well you lose them against the boxplot so I kept them black.

Related

Is there an R function to calculate the chi-square value for each cell in a table?

I have been using R to calculate the chi-square value for a table of count data using chisq.test from the stats package. This has returned a chi-square value for the whole table.
W=c(98, 354, 105, 28)
WF=c(13, 108, 34, 6)
FNS=c(108, 438, 138, 24)
F=c(22, 61, 24, 2)
P=c(7, 48, 28, 4)
C=c(15, 68, 30, 4)
D=c(25, 106, 53, 5)
HD=c(39, 277, 122, 29)
Grade=cbind(W, WF, FNS, F, P, C, D, HD)
rownames(Grade)=c("3", "4", "5", "6")
Grade.chi=chisq.test(Grade)
Grade.chi
#Chi-squared approximation may be incorrect
# Pearson's Chi-squared test
#
#data: Grade
#X-squared = 54.274, df = 21, p-value = 9.012e-05
What I would like to calculate is the chi-square value of each cell, so that I can replace the count data in this table with chi-square values:
W WF FNS F P C D HD
"3" 98 13 108 22 7 15 25 39
"4" 354 108 438 61 48 68 106 277
"5" 105 34 138 24 28 30 53 122
"6" 28 6 24 2 4 4 5 29
Is there a pre-existing fraction I can use, or will I need to "manually" calculate it for each cell?
I feel it might be similar to this post, but not sure how to adapt Chi-square p value matrix in r
Any and all help appreciated - I'm still a beginner with R.
If you save the chi square test in a variable say abc, then you have all you need to compute the values
(abc$observed-abc$expected)^2/abc$expected

How to create a new column based on other columns with if conditions in r

Not able to find a way to generate a new column based with if conditions for group of events in a column.
The column called "BF" represent the (i-3) of the flow column, and is going to be the same BF for each "event" group. For example, in row 5, the value of "BF" is 39, which is the previous 3rd value of the flow column (flow for row 2) for all the "2" in the event column.
The problem is that BF[i] can't be bigger than flow[i]. If BF[i] is bigger than flow[i], then the BF should be the (i-4) or (i-5) or (1-6)... of the flow until BF[i] will be equal or smaller than flow[i]. For example, in row 10 the value of the column "BF" is bigger than the value of the column "flow", therefore, the value of BF_1 (column I want to create) in row 10 is 37, which represent the closest lower value of flow, in this case the flow[i-6].
As an example, we have the following dataframe:
flow<- c(40, 39, 38, 37, 50, 49, 46, 44, 43, 45, 40, 30, 80, 75, 50, 55, 53, 51, 49, 100)
event<- c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6)
BF<- c(NA, NA, NA, NA, 39, 39, 39, 39, 39, 46, 46, 46, 45, 45, 45, 80, 80, 80, 80, 53)
a<- data.frame(flow, event, BF)
This is the desire output I'm looking for. I want to create the BF_1 column.
flow event BF BF_1
1 40 1 NA NA
2 39 1 NA NA
3 38 1 NA NA
4 37 1 NA NA
5 50 2 39 39
6 49 2 39 39
7 46 2 39 39
8 44 2 39 39
9 43 2 39 39
10 45 3 46 37
11 40 3 46 37
12 30 3 46 37
13 80 4 45 45
14 75 4 45 45
15 50 4 45 45
16 55 5 80 30
17 53 5 80 30
18 51 5 80 30
19 49 5 80 30
20 100 6 53 53
Is there a possible way to generate the column BF_1? please let me know any thoughts. I am working with for loops and using if conditions but I am not able to hold the BF value for the entire group of the event column.
coding a bit inefficient, could have use dplyr etc.., but it will do the work and matching the BF_1 column given
flow <- c(40, 39, 38, 37, 50, 49, 46, 44, 43, 45, 40, 30, 80, 75, 50, 55, 53, 51, 49, 100)
event <- c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6)
BF <- c(NA, NA, NA, NA, 39, 39, 39, 39, 39, 46, 46, 46, 45, 45, 45, 80, 80, 80, 80, 53)
a <- data.frame(flow, event, BF)
a$BF_1 <- NA #default to NA first
for(i in 1:length(unique(a$event))){
if(is.na(a[a$event == i, "BF"][1])) next
if(a[a$event == i, "BF"][1] < a[a$event == i, "flow"][1]) a[a$event == i, "BF_1"] <- a[a$event == i, "BF"][1]
if(a[a$event == i, "BF"][1] > a[a$event == i, "flow"][1]) {
head <- min(which(a$event==i))-6
if (min(head-6) < 0) head <- 1 #making sure it doesn't overflow to row 0
a[a$event == i, "BF_1"] <- min( a[ head:min(which(a$event==i)), "flow"] ) #fill the min of the subset flow column given position
}
}
a
One tidyverse possibility could be:
a %>%
left_join(crossing(a, a) %>%
filter(event > event1) %>%
group_by(event) %>%
filter(flow == first(flow)) %>%
slice(1:(n() - 3)) %>%
slice(which.max(cumsum(flow > flow1))) %>%
ungroup() %>%
transmute(event,
flow_flag = flow1), by = c("event" = "event")) %>%
mutate(BF_1 = ifelse(lag(flow, 3) > flow, flow_flag, lag(flow, 3))) %>%
group_by(event) %>%
mutate(BF_1 = first(BF_1)) %>%
select(-flow_flag)
flow event BF BF_1
<dbl> <dbl> <dbl> <dbl>
1 40 1 NA NA
2 39 1 NA NA
3 38 1 NA NA
4 37 1 NA NA
5 50 2 39 39
6 49 2 39 39
7 46 2 39 39
8 44 2 39 39
9 43 2 39 39
10 45 3 46 37
11 40 3 46 37
12 30 3 46 37
13 80 4 45 45
14 75 4 45 45
15 50 4 45 45
16 55 5 80 30
17 53 5 80 30
18 51 5 80 30
19 49 5 80 30
20 100 6 53 53
It could be overcomplicated, but what it does is, first, creating all combinations of values (as the desired value can be theoretically anywhere in the data). Second, it identifies the first case per group fulfilling the condition (not taking into account the previous 3rd value). Finally, it combines it with the original df and if the 3rd previous value per group is fulfilling the condition, then returns it, otherwise returns the value first fulfilling condition to be smaller than the actual value.

Normal probability plot for multiple factors

I'm trying to run diagnostics for normality in a 2^4 factorial problem with two replicates. Here is my code:
n = 2
A <- factor(c(rep("-", 1*n), rep("+", 1*n)))
B <- factor(c(rep("-", 2*n), rep("+", 2*n)))
C <- factor(c(rep("-", 4*n), rep("+", 4*n)))
D <- factor(c(rep("-", 8*n), rep("+", 8*n)))
obs <- c(90, 93,
74, 78,
81, 85,
83, 80,
77, 78,
81, 80,
88, 82,
73, 70,
98, 95,
72, 76,
87, 83,
85, 86,
99, 90,
79, 75,
87, 84,
80, 80)
df <- data.frame(A, B, C, D, obs)
model <- aov(obs ~ A*B*C*D, data = df)
summary(model)
par(mfrow=c(1,2))
qqnorm(resid(model), ylab = "Residuals", xlab = "Quantiles", pch = 16)
qqline(resid(model))
plot(resid(model) ~ fitted(model), ylab = "Residual", xlab = "Predicted", pch = 16)
abline(0,0)
The ANOVA table is giving me the correct values, but when I analyze the normality conditions using a Normal Q-Q plot, it incorrectly gives me symmetric residuals. I have noticed that I only run into this issue when I am analyzing four or more interactions. All the residual plots for three interaction or less has the correct expected output with the same code.
Any help would be greatly appreciated
I'm unclear why you think the residuals in the saturated model should not be "symmetric" you can look at them directly with:
> print( sort(model$residuals), digits=4)
26 14 5 19 22 28
-4.500e+00 -3.000e+00 -2.000e+00 -2.000e+00 -2.000e+00 -2.000e+00
3 1 16 18 30 8
-2.000e+00 -1.500e+00 -1.500e+00 -1.500e+00 -1.500e+00 -1.500e+00
23 12 9 31 32 10
-5.000e-01 -5.000e-01 -5.000e-01 6.947e-17 6.947e-17 5.000e-01
11 24 7 29 15 17
5.000e-01 5.000e-01 1.500e+00 1.500e+00 1.500e+00 1.500e+00
2 20 21 27 4 6
1.500e+00 2.000e+00 2.000e+00 2.000e+00 2.000e+00 2.000e+00
13 25
3.000e+00 4.500e+00
They look pretty symmetric in the sense of having paired values on either side of the median. Another way to display numeric values might be:
table( abs(sort(model$residuals)) )
6.9468775251052e-17 0.499999999999995 0.499999999999998
2 2 1
0.499999999999999 0.5 0.500000000000002
1 1 1
1.49999999999999 1.5 1.50000000000006
2 6 2
2 3 4.5
10 2 2

assign value to a variable rather than using if statement

Right now, I have dataset consisting of variables Gbcode and ncnty
> str(dt)
'data.frame': 840 obs. of 8 variables:
$ Gbcode : Factor w/ 28 levels "11","12","13",..: 21 22 23 24 25 26 27 28 16 17 ...
$ ncounty : num 0 0 0 0 0 0 0 0 0 0 ...
I want to do the following thing:
if a data record is with Gbcode equal to 11, then assign 20 to its ncnty
Gbcode : 11, 12, 13, 14, 15, 21, 22, 23, 31, 32, 33
Corresponding ncnty: 20, 19, 198, 131, 112, 102, 60, 145, 22, 115, 95
I am wondering whether there is any better solution rather than write an if statement, which would be with many lines in this case, maybe less than 20 lines of code.
This is a merge operation as far as I can tell. Make a little lookup table with your Gbcode/ncnty data, and then merge it in.
# lookup table
lkup <- data.frame(Gbcode=c(11,12,13),ncnty=c(20,19,198))
#example data
dt <- data.frame(Gbcode=c(11,13,12,11,13,12,12))
dt
# Gbcode
#1 11
#2 13
#3 12
#4 11
#5 13
#6 12
#7 12
Merge:
merge(dt, lkup, by="Gbcode", all.x=TRUE)
# Gbcode ncnty
#1 11 20
#2 11 20
#3 12 19
#4 12 19
#5 12 19
#6 13 198
#7 13 198
It is sometimes preferable to use match for this sort of thing too:
dt$ncnty <- lkup$ncnty[match(dt$Gbcode,lkup$Gbcode)]
This could be more elegant, but should do the trick.
Gbcodes <- as.character(c(11, 12, 13, 14, 15, 21, 22, 23, 31, 32, 33))
ncounties <- c(20, 19, 198, 131, 112, 102, 60, 145, 22, 115, 95)
for(i in 1:length(Gbcodes)) dt$ncounty[dt$Gbcode==Gbcodes[i]] <- dt$ncounties[i]

R xts object - subset data points for 5 consecutive seconds

I have a large xts object and want to subset the seconds in the time column, but only if there is a sequence of minimum 5 consecutive seconds. I have up to 8 data points per second (which shouldn't be counted as 5 consecutive points as they are measured within the same second).
And_sub_xts is my xts object
> str(And_sub_xts)
An ‘xts’ object on 2010-04-09 20:32:56/2010-04-26 06:56:57 containing:
Data: chr [1:164421, 1:11] "0.255416" "0.168836" "0.212126" "0.229442" "0.238100" "0.212126" "0.168836" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:11] "CalSurge" "CalSway" "CalHeave" "Stat_Surge" ...
Indexed by objects of class: [POSIXct,POSIXt] TZ:
xts Attributes:
NULL
and the first 100 values for
abs(diff(.indexsec(And_sub_xts)) are
56 8 23 34 40 40 41 42 25 27 34 35 38 38 40 40 41 56 59 59 19 19 20 20 20 20 22 22 23 23 24 24 24 25 25 26 27 27 27 27 27 28 28 30 30 30 37 38 40 40 41 44 44 46 46 47 48 51 52 54 54 54 54 55 56 59 1 4 4 4 6 6 6 6 7 7 11 12 12 14 14 15 16 16 17 18 18 19 19 21 21 22 22 23 23 25 25 26 26 26
I marked the keeps in bold, so the subset should just consist of these data points.
I just realize that theorethically it could happen that there are some data points distributed like this
2010-04-09 20:32:20
2010-04-09 20:32:20
2010-04-09 20:32:21
2010-04-09 20:32:22
2010-04-09 20:32:22
2010-04-09 20:40:22
2010-04-09 22:52:23
2010-04-10 20:52:24
which wouldn't be 5 consecutive seconds, but you can't account for this with the .indexsec command - maybe anybody knows a way to go around this.
Thanks for your help!
Here's one way to do it. x is sample data that contains index values with seconds equal to your first 100 values.
require(xts)
# sample data
s <- c(56, 8, 23, 34, 40, 40, 41, 42, 25, 27, 34, 35, 38, 38, 40,
40, 41, 56, 59, 59, 19, 19, 20, 20, 20, 20, 22, 22, 23, 23, 24,
24, 24, 25, 25, 26, 27, 27, 27, 27, 27, 28, 28, 30, 30, 30, 37,
38, 40, 40, 41, 44, 44, 46, 46, 47, 48, 51, 52, 54, 54, 54, 54,
55, 56, 59, 1, 4, 4, 4, 6, 6, 6, 6, 7, 7, 11, 12, 12, 14, 14,
15, 16, 16, 17, 18, 18, 19, 19, 21, 21, 22, 22, 23, 23, 25, 25,
26, 26, 26)
S <- cumsum(ifelse(c(0, diff(s)) < 0, 1, 0)) * 60 + s
x <- .xts(seq_along(S), S, tzone="UTC")
The basic idea is to aggregate your data to 1-second resolution, so you can use rle (run-length encoding) to find the consecutive 5-second observations. Then find the first and last timestamps of the sets of 5-second observations in your aggregated data, and then find the locations of those timestamps in your original data. Finally, use the locations of the timestamps in your original data to create sets of sequences you can use to subset the consecutive 5-second groups of observations.
# aggregate data to 1-second resolution
oneSec <- period.apply(x, endpoints(x, 'seconds'), identity)
# find the runs of 5 or more consecutive one-second increments
consec <- rle(diff(.index(oneSec)))
gte5s <- consec$lengths >= 5
# get the location of the first obs of the run in the 1-second data
begLoc <- cumsum(c(1,consec$lengths))[gte5s]
endLoc <- begLoc + consec$lengths[gte5s]
# get the timestamp of the first and last obs from the original data
beg <- lapply(index(oneSec)[begLoc], function(i) first(x[i, which.i=TRUE]))
end <- lapply(index(oneSec)[endLoc], function(i) last(x[i, which.i=TRUE]))
# create index vector between each value in 'beg' and 'end'
loc <- unlist(mapply(seq, beg, end))
# subset original object using index vector
X <- x[loc,]

Resources