I'm trying to run diagnostics for normality in a 2^4 factorial problem with two replicates. Here is my code:
n = 2
A <- factor(c(rep("-", 1*n), rep("+", 1*n)))
B <- factor(c(rep("-", 2*n), rep("+", 2*n)))
C <- factor(c(rep("-", 4*n), rep("+", 4*n)))
D <- factor(c(rep("-", 8*n), rep("+", 8*n)))
obs <- c(90, 93,
74, 78,
81, 85,
83, 80,
77, 78,
81, 80,
88, 82,
73, 70,
98, 95,
72, 76,
87, 83,
85, 86,
99, 90,
79, 75,
87, 84,
80, 80)
df <- data.frame(A, B, C, D, obs)
model <- aov(obs ~ A*B*C*D, data = df)
summary(model)
par(mfrow=c(1,2))
qqnorm(resid(model), ylab = "Residuals", xlab = "Quantiles", pch = 16)
qqline(resid(model))
plot(resid(model) ~ fitted(model), ylab = "Residual", xlab = "Predicted", pch = 16)
abline(0,0)
The ANOVA table is giving me the correct values, but when I analyze the normality conditions using a Normal Q-Q plot, it incorrectly gives me symmetric residuals. I have noticed that I only run into this issue when I am analyzing four or more interactions. All the residual plots for three interaction or less has the correct expected output with the same code.
Any help would be greatly appreciated
I'm unclear why you think the residuals in the saturated model should not be "symmetric" you can look at them directly with:
> print( sort(model$residuals), digits=4)
26 14 5 19 22 28
-4.500e+00 -3.000e+00 -2.000e+00 -2.000e+00 -2.000e+00 -2.000e+00
3 1 16 18 30 8
-2.000e+00 -1.500e+00 -1.500e+00 -1.500e+00 -1.500e+00 -1.500e+00
23 12 9 31 32 10
-5.000e-01 -5.000e-01 -5.000e-01 6.947e-17 6.947e-17 5.000e-01
11 24 7 29 15 17
5.000e-01 5.000e-01 1.500e+00 1.500e+00 1.500e+00 1.500e+00
2 20 21 27 4 6
1.500e+00 2.000e+00 2.000e+00 2.000e+00 2.000e+00 2.000e+00
13 25
3.000e+00 4.500e+00
They look pretty symmetric in the sense of having paired values on either side of the median. Another way to display numeric values might be:
table( abs(sort(model$residuals)) )
6.9468775251052e-17 0.499999999999995 0.499999999999998
2 2 1
0.499999999999999 0.5 0.500000000000002
1 1 1
1.49999999999999 1.5 1.50000000000006
2 6 2
2 3 4.5
10 2 2
Related
I have been using R to calculate the chi-square value for a table of count data using chisq.test from the stats package. This has returned a chi-square value for the whole table.
W=c(98, 354, 105, 28)
WF=c(13, 108, 34, 6)
FNS=c(108, 438, 138, 24)
F=c(22, 61, 24, 2)
P=c(7, 48, 28, 4)
C=c(15, 68, 30, 4)
D=c(25, 106, 53, 5)
HD=c(39, 277, 122, 29)
Grade=cbind(W, WF, FNS, F, P, C, D, HD)
rownames(Grade)=c("3", "4", "5", "6")
Grade.chi=chisq.test(Grade)
Grade.chi
#Chi-squared approximation may be incorrect
# Pearson's Chi-squared test
#
#data: Grade
#X-squared = 54.274, df = 21, p-value = 9.012e-05
What I would like to calculate is the chi-square value of each cell, so that I can replace the count data in this table with chi-square values:
W WF FNS F P C D HD
"3" 98 13 108 22 7 15 25 39
"4" 354 108 438 61 48 68 106 277
"5" 105 34 138 24 28 30 53 122
"6" 28 6 24 2 4 4 5 29
Is there a pre-existing fraction I can use, or will I need to "manually" calculate it for each cell?
I feel it might be similar to this post, but not sure how to adapt Chi-square p value matrix in r
Any and all help appreciated - I'm still a beginner with R.
If you save the chi square test in a variable say abc, then you have all you need to compute the values
(abc$observed-abc$expected)^2/abc$expected
Not able to find a way to generate a new column based with if conditions for group of events in a column.
The column called "BF" represent the (i-3) of the flow column, and is going to be the same BF for each "event" group. For example, in row 5, the value of "BF" is 39, which is the previous 3rd value of the flow column (flow for row 2) for all the "2" in the event column.
The problem is that BF[i] can't be bigger than flow[i]. If BF[i] is bigger than flow[i], then the BF should be the (i-4) or (i-5) or (1-6)... of the flow until BF[i] will be equal or smaller than flow[i]. For example, in row 10 the value of the column "BF" is bigger than the value of the column "flow", therefore, the value of BF_1 (column I want to create) in row 10 is 37, which represent the closest lower value of flow, in this case the flow[i-6].
As an example, we have the following dataframe:
flow<- c(40, 39, 38, 37, 50, 49, 46, 44, 43, 45, 40, 30, 80, 75, 50, 55, 53, 51, 49, 100)
event<- c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6)
BF<- c(NA, NA, NA, NA, 39, 39, 39, 39, 39, 46, 46, 46, 45, 45, 45, 80, 80, 80, 80, 53)
a<- data.frame(flow, event, BF)
This is the desire output I'm looking for. I want to create the BF_1 column.
flow event BF BF_1
1 40 1 NA NA
2 39 1 NA NA
3 38 1 NA NA
4 37 1 NA NA
5 50 2 39 39
6 49 2 39 39
7 46 2 39 39
8 44 2 39 39
9 43 2 39 39
10 45 3 46 37
11 40 3 46 37
12 30 3 46 37
13 80 4 45 45
14 75 4 45 45
15 50 4 45 45
16 55 5 80 30
17 53 5 80 30
18 51 5 80 30
19 49 5 80 30
20 100 6 53 53
Is there a possible way to generate the column BF_1? please let me know any thoughts. I am working with for loops and using if conditions but I am not able to hold the BF value for the entire group of the event column.
coding a bit inefficient, could have use dplyr etc.., but it will do the work and matching the BF_1 column given
flow <- c(40, 39, 38, 37, 50, 49, 46, 44, 43, 45, 40, 30, 80, 75, 50, 55, 53, 51, 49, 100)
event <- c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6)
BF <- c(NA, NA, NA, NA, 39, 39, 39, 39, 39, 46, 46, 46, 45, 45, 45, 80, 80, 80, 80, 53)
a <- data.frame(flow, event, BF)
a$BF_1 <- NA #default to NA first
for(i in 1:length(unique(a$event))){
if(is.na(a[a$event == i, "BF"][1])) next
if(a[a$event == i, "BF"][1] < a[a$event == i, "flow"][1]) a[a$event == i, "BF_1"] <- a[a$event == i, "BF"][1]
if(a[a$event == i, "BF"][1] > a[a$event == i, "flow"][1]) {
head <- min(which(a$event==i))-6
if (min(head-6) < 0) head <- 1 #making sure it doesn't overflow to row 0
a[a$event == i, "BF_1"] <- min( a[ head:min(which(a$event==i)), "flow"] ) #fill the min of the subset flow column given position
}
}
a
One tidyverse possibility could be:
a %>%
left_join(crossing(a, a) %>%
filter(event > event1) %>%
group_by(event) %>%
filter(flow == first(flow)) %>%
slice(1:(n() - 3)) %>%
slice(which.max(cumsum(flow > flow1))) %>%
ungroup() %>%
transmute(event,
flow_flag = flow1), by = c("event" = "event")) %>%
mutate(BF_1 = ifelse(lag(flow, 3) > flow, flow_flag, lag(flow, 3))) %>%
group_by(event) %>%
mutate(BF_1 = first(BF_1)) %>%
select(-flow_flag)
flow event BF BF_1
<dbl> <dbl> <dbl> <dbl>
1 40 1 NA NA
2 39 1 NA NA
3 38 1 NA NA
4 37 1 NA NA
5 50 2 39 39
6 49 2 39 39
7 46 2 39 39
8 44 2 39 39
9 43 2 39 39
10 45 3 46 37
11 40 3 46 37
12 30 3 46 37
13 80 4 45 45
14 75 4 45 45
15 50 4 45 45
16 55 5 80 30
17 53 5 80 30
18 51 5 80 30
19 49 5 80 30
20 100 6 53 53
It could be overcomplicated, but what it does is, first, creating all combinations of values (as the desired value can be theoretically anywhere in the data). Second, it identifies the first case per group fulfilling the condition (not taking into account the previous 3rd value). Finally, it combines it with the original df and if the 3rd previous value per group is fulfilling the condition, then returns it, otherwise returns the value first fulfilling condition to be smaller than the actual value.
I have a 4*10 datasheet like so:
Total, var1, var2, var3
104, 35, 33, 36
106, 38, 32, 36
93, 34, 27, 32
98, 31, 32, 35
101, 34, 32, 35
106, 38, 32, 36
82, 32, 23, 27
100, 38, 30, 32
111, 34, 39, 38
89, 35, 27, 27
and I would like to produce a boxplot where each column is plotted as a separate boxplot but on the same graph. Ideally I would also like to colour code these and add some jitter to show the individual data points.
So far I have tried to use the melt functionality on reshape2 but I haven't had much luck.
I hope this is clear, it's been giving me lots of headaches. Thanks for your help
With your data like this:
> head(data)
Total var1 var2 var3
1 104 35 33 36
2 106 38 32 36
3 93 34 27 32
4 98 31 32 35
5 101 34 32 35
6 106 38 32 36
then this bit of ggplot2:
library(ggplot2)
ggplot(reshape2::melt(data), aes(x=variable, y=value, col=variable)) + geom_boxplot() + geom_jitter(height=0,col="black")
gets you:
I don't see the point of colouring things when the position and the axis label is sufficient, but whatever. Also, if you colour the points by variable as well you lose them against the boxplot so I kept them black.
Right now, I have dataset consisting of variables Gbcode and ncnty
> str(dt)
'data.frame': 840 obs. of 8 variables:
$ Gbcode : Factor w/ 28 levels "11","12","13",..: 21 22 23 24 25 26 27 28 16 17 ...
$ ncounty : num 0 0 0 0 0 0 0 0 0 0 ...
I want to do the following thing:
if a data record is with Gbcode equal to 11, then assign 20 to its ncnty
Gbcode : 11, 12, 13, 14, 15, 21, 22, 23, 31, 32, 33
Corresponding ncnty: 20, 19, 198, 131, 112, 102, 60, 145, 22, 115, 95
I am wondering whether there is any better solution rather than write an if statement, which would be with many lines in this case, maybe less than 20 lines of code.
This is a merge operation as far as I can tell. Make a little lookup table with your Gbcode/ncnty data, and then merge it in.
# lookup table
lkup <- data.frame(Gbcode=c(11,12,13),ncnty=c(20,19,198))
#example data
dt <- data.frame(Gbcode=c(11,13,12,11,13,12,12))
dt
# Gbcode
#1 11
#2 13
#3 12
#4 11
#5 13
#6 12
#7 12
Merge:
merge(dt, lkup, by="Gbcode", all.x=TRUE)
# Gbcode ncnty
#1 11 20
#2 11 20
#3 12 19
#4 12 19
#5 12 19
#6 13 198
#7 13 198
It is sometimes preferable to use match for this sort of thing too:
dt$ncnty <- lkup$ncnty[match(dt$Gbcode,lkup$Gbcode)]
This could be more elegant, but should do the trick.
Gbcodes <- as.character(c(11, 12, 13, 14, 15, 21, 22, 23, 31, 32, 33))
ncounties <- c(20, 19, 198, 131, 112, 102, 60, 145, 22, 115, 95)
for(i in 1:length(Gbcodes)) dt$ncounty[dt$Gbcode==Gbcodes[i]] <- dt$ncounties[i]
I'm new with R. I need to generate a simple Frequency Table (as in books) with cumulative frequency and relative frequency.
So I want to generate from some simple data like
> x
[1] 17 17 17 17 17 17 17 17 16 16 16 16 16 18 18 18 10 12 17 17 17 17 17 17 17 17 16 16 16 16 16 18 18 18 10
[36] 12 15 19 20 22 20 19 19 19
a table like:
frequency cumulative relative
(9.99,11.7] 2 2 0.04545455
(11.7,13.4] 2 4 0.04545455
(13.4,15.1] 1 5 0.02272727
(15.1,16.9] 10 15 0.22727273
(16.9,18.6] 22 37 0.50000000
(18.6,20.3] 6 43 0.13636364
(20.3,22] 1 44 0.02272727
I know it should be simple, but I don't know how.
I got some results using this code:
factorx <- factor(cut(x, breaks=nclass.Sturges(x)))
as.matrix(table(factorx))
You're close! There are a few functions that will make this easy for you, namely cumsum() and prop.table(). Here's how I'd probably put this together. I make some random data, but the point is the same:
#Fake data
x <- sample(10:20, 44, TRUE)
#Your code
factorx <- factor(cut(x, breaks=nclass.Sturges(x)))
#Tabulate and turn into data.frame
xout <- as.data.frame(table(factorx))
#Add cumFreq and proportions
xout <- transform(xout, cumFreq = cumsum(Freq), relative = prop.table(Freq))
#-----
factorx Freq cumFreq relative
1 (9.99,11.4] 11 11 0.25000000
2 (11.4,12.9] 3 14 0.06818182
3 (12.9,14.3] 11 25 0.25000000
4 (14.3,15.7] 2 27 0.04545455
5 (15.7,17.1] 6 33 0.13636364
6 (17.1,18.6] 3 36 0.06818182
7 (18.6,20] 8 44 0.18181818
The base functions table, cumsum and prop.table should get you there:
cbind( Freq=table(x), Cumul=cumsum(table(x)), relative=prop.table(table(x)))
Freq Cumul relative
10 2 2 0.04545455
12 2 4 0.04545455
15 1 5 0.02272727
16 10 15 0.22727273
17 16 31 0.36363636
18 6 37 0.13636364
19 4 41 0.09090909
20 2 43 0.04545455
22 1 44 0.02272727
With cbind and naming of the columns to your liking this should be pretty easy for you in the future. The output from the table function is a matrix, so this result is also a matrix. If this were being done on something big it would be more efficient todo this:
tbl <- table(x)
cbind( Freq=tbl, Cumul=cumsum(tbl), relative=prop.table(tbl))
If you are looking for something pre-packaged, consider the freq() function from the descr package.
library(descr)
x = c(sample(10:20, 44, TRUE))
freq(x, plot = FALSE)
Or to get cumulative percents, use the ordered() function
freq(ordered(x), plot = FALSE)
To add a "cumulative frequencies" column:
tab = as.data.frame(freq(ordered(x), plot = FALSE))
CumFreq = cumsum(tab[-dim(tab)[1],]$Frequency)
tab$CumFreq = c(CumFreq, NA)
tab
If your data has missing values, a valid percent column is added to the table.
x = c(sample(10:20, 44, TRUE), NA, NA)
freq(ordered(x), plot = FALSE)
Yet another possibility:
library(SciencesPo)
x = c(sample(10:20, 50, TRUE))
freq(x)
My suggestion is to check the agricolae package... check it out:
library(agricolae)
weight<-c( 68, 53, 69.5, 55, 71, 63, 76.5, 65.5, 69, 75, 76, 57, 70.5,
+ 71.5, 56, 81.5, 69, 59, 67.5, 61, 68, 59.5, 56.5, 73,
+ 61, 72.5, 71.5, 59.5, 74.5, 63)
h1<- graph.freq(weight,col="yellow",frequency=1,las=2,xlab="h1")
print(summary(h1),row.names=FALSE)