I am trying to run an ANOVA on a 5x5 design, where each subject only saw 5 of the 25 possible conditions. So subject A saw,
1,1; 2,2; 3,3; 4,4; 5,5
and subject B saw
1,2; 2,3; 3;4, 4;5, 5,1 etc.
Is it incorrect to use a within subject ANOVA since its not a true within subjects design?
This is what I was using,
(aov(Risk ~ Moral * Age + Error(ID/(Moral * Age)), data = Home)
and I get
Warning in aov(Risk ~ Moral * Age + Error(ID/(Moral * Age)), data = Home) :
Error() model is singular
Here is a sample of the data
ID Age Moral Risk
1 90 two affair 2
2 90 eight vol 2
3 90 ten invol 2
4 90 four work 3
5 90 six gym 4
6 117 two affair 9
7 117 eight vol 2
8 117 ten invol 3
9 117 four work 9
10 117 six gym 10
11 70 two affair 3
12 70 eight vol 3
13 70 ten invol 2
14 70 four work 6
15 70 six gym 6
16 51 two affair 4
17 51 eight vol 3
18 51 ten invol 3
19 51 four work 4
20 51 six gym 6
Related
I want to explore the relationship between the abundance of an organism and several possible explanatory factors. I have doubts regarding what variables should be called as fixed or random in the GLMM.
I have a dataset with the number of snails in different sites within a national park (all sites are under the same climatic conditions). But there are local parameters whose effects over the snail abundance haven't been studied yet.
This is a longitudinal study, with repeated measures over time (every month, for almost two years). The number of snails were counted in the field, always in the same 21 sites (each site has a 6x6 square meters plot, delimitated with wooden stakes).
In case it could influence the analysis, note that some parameters may vary over time, such as the vegetation cover in each plot, or the presence of the snail natural predator (measured with yes/no values). Others, however, are always the same, because they are specific to each site, such as the distant to the nearest riverbed or the type of soil.
Here is a subset of my data:
> snail.data
site time snails vegetation_cover predator type_soil distant_riverbed
1 1 1 9 NA n 1 13
2 1 2 7 0.8 n 1 13
3 1 3 13 1.4 n 1 13
4 1 4 14 0.6 n 1 13
5 1 5 12 1.6 n 1 13
10 2 1 0 NA n 1 136
11 2 2 0 0.0 n 1 136
12 2 3 0 0.0 n 1 136
13 2 4 0 0.0 n 1 136
14 2 5 0 0.0 n 1 136
19 3 1 1 NA n 2 201
20 3 2 0 0.0 n 2 201
21 3 3 0 0.0 y 2 201
22 3 4 3 0.0 n 2 201
23 3 5 2 0.0 n 2 201
28 4 1 0 NA n 2 104
29 4 2 0 0.0 n 2 104
30 4 3 0 0.0 y 2 104
31 4 4 0 0.0 n 2 104
32 4 5 0 0.0 n 2 104
37 5 1 1 NA n 3 65
38 5 2 0 2.4 n 3 65
39 5 3 3 2.2 n 3 65
40 5 4 2 2.2 n 3 65
41 5 5 4 2.0 y 3 65
46 6 1 1 NA n 3 78
47 6 2 2 3.0 n 3 78
48 6 3 7 2.8 n 3 78
49 6 4 3 1.8 n 3 78
50 6 5 6 1.2 y 3 78
55 7 1 14 NA n 3 91
56 7 2 21 2.8 n 3 91
57 7 3 16 2.6 n 3 91
58 7 4 15 1.6 n 3 91
59 7 5 8 2.0 n 3 91
So I'm interested in investigating if the number of snails is significantly different in each site and if those differences are related to some specific parameters.
So far the best statistic approach I have found is a generalized linear mixed model. But I'm struggling in choosing the correct fixed and random variables. My reasoning is, although I'm checking for the differences among sites (by comparing the number of snails) the focus of the study is the other parameters commented above, thus the site would be a random factor.
Then, my question is: should 'site' and 'time' be considered random factors and the local parameters should be the fixed variables? Should I include interactions between time and other factors?
I have set up my command as follows:
library(lme4)
mixed_model <- glmer(snails ~ vegetation_cover + predator + type_soil + distant_riverbed + (1|site) + (1|time), data = snails.data, family = poisson)
Would it be the correct syntax for what I have described?
I have a data base with 121 rows and like 10 columns. One of these columns corresponds to Station, another to depth and the rest to chemical variables (temperature, salinity, etc.). I want to calculate the integrated value of these chemical properties by station, using the function oce::integrateTrapezoid. It's my first time doing a loop, so i dont know how. Could you help me?
dA<-matrix(data=NA, nrow=121, ncol=3)
for (Station in unique(datos$Station))
{dA[Station, cd] <- integrateTrapezoid(cd, Profundidad..m., "cA")
}
Station
Depth
temp
1
10
28
1
50
25
1
100
15
1
150
10
2
9
27
2
45
24
2
98
14
2
152
11
3
11
28.7
3
48
23
3
102
14
3
148
9
I am struggling with how to describe level 2 data in my Multilevel Model in R.
I am using the nlme package.
I have longitudinal data with repeated measures. I have repeated observations for every subject across many days.
The Goal:
Level 1 would be the individual observations within the subject ID
Level 2 would be the differences between overall means between subject IDs (Cluster).
I am trying to determine if Test scores are significantly affected by study time, and to see if it's significantly different within subjects and between subjects.
How would I write the script if I want to do "Between Subjects" ?
Here is my script for Level 1 Model
model1 <- lme(fixed = TestScore~Studytime, random =~1|SubjectID, data=dataframe, na.action=na.omit)
Below is my example dataframe
`Subject ID` Observations TestScore Studytime
1 1 1 50 600
2 1 2 72 900
3 1 3 82 627
4 1 4 90 1000
5 1 5 81 300
6 1 6 37 333
7 2 1 93 900
8 2 2 97 1000
9 2 3 99 1200
10 2 4 85 600
11 3 1 92 800
12 3 2 73 900
13 3 3 81 1000
14 3 4 96 980
15 3 5 99 1300
16 4 1 47 600
17 4 2 77 900
18 4 3 85 950
I appreciate the help!
So I have a csv file with column headers ID, Score, and Age.
So in R I have,
data <- read.csv(file.choose(), header=T)
attach(data)
I would like to create two new vectors with people's scores whos age are below 70 and above 70 years old. I thought there was a nice a quick way to do this but I cant find it any where. Thanks for any help
Example of what data looks like
ID, Score, Age
1, 20, 77
2, 32, 65
.... etc
And I am trying to make 2 vectors where it consists of all peoples scores who are younger than 70 and all peoples scores who are older than 70
Assuming data looks like this:
Score Age
1 1 29
2 5 39
3 8 40
4 3 89
5 5 31
6 6 23
7 7 75
8 3 3
9 2 23
10 6 54
.. . ..
you can use
df_old <- data[data$Age >= 70,]
df_young <- data[data$Age < 70,]
which gives you
> df_old
Score Age
4 3 89
7 7 75
11 7 97
13 3 101
16 5 89
18 5 89
19 4 96
20 3 97
21 8 75
and
> df_young
Score Age
1 1 29
2 5 39
3 8 40
5 5 31
6 6 23
8 3 3
9 2 23
10 6 54
12 4 23
14 2 23
15 4 45
17 7 53
PS: if you only want the scores and not the age, you could use this:
df_old <- data[data$Age >= 70, "Score"]
df_young <- data[data$Age < 70, "Score"]
I am trying to encircle my datapoints of a scatterplot(using ggplot2), so that (1) 100% of my datapoints and (2) 80% of my datapoints are inside that circle. (See 1 - Like in this sctech (please excuse the lazy execution with snippingtool))
Here is my dummy-dataset:
x y
1 2
1 3
1 4
1 5
2 1
2 2
2 3
2 4
2 5
3 1
3 2
3 3
3 4
4 1
4 2
4 3
5 1
5 2
5 3
5 4
5 9
5 10
6 1
6 2
6 3
6 4
6 5
6 6
6 8
6 9
6 10
7 1
7 2
7 3
7 4
7 5
7 6
7 7
7 8
7 9
7 10
8 2
8 3
8 4
8 5
8 6
8 7
8 8
8 9
I have tried multiple approaches to achieve this, but nothing really satisfies what I want to accomplish.
My first approach was geom_density2d(). However, I have troubles interpreting the results, as I don't really know what the levels mean.
I tried the following:
ggplot(myData, aes(x,y)) + geom_point() + geom_density2d(bins=4, aes(colour=..level..))
Which results in this plot 2:
It is good, as it accomplishes the dent in the contours. However, I don't know how I would get a hull that encircles 100% of my data, and a second hull that encircles 80% of my data.
My second approach was to use the geom_encircle() function of the ggalt package. This results in the following plot 3
This time, my whole datapoints are encircled - so far so good. But the "dent" like in the geom_contour() plot is not present, and I don't know how to add an "encriclement" that covers only 80% of my datapoints.
My third approach was using the geom_bagplot() function (described here).
ggplot(myData, aes(x,y)) + geom_point() + geom_bag(prop=0.9) + geom_bag(prop=0.8)
(with geom_bag() I cannot use prop=1.0 to cover all datapoints, however setting it to 0.9 is sufficient)
This yields the following plot 5:
This time, again, the dent is not present. Another problem is, that setting prop=0.7 and prop=0.7 yields the exact same outcome. Another problem is, that the hull is not smooth like geom_contour().
How can I produce a plot (with ggplot2) that looks like my sketch in 1?
Thanks in advance!
____________________________________________
EDIT:
The actual dataset to show the real distribution of my datapoints:
x y
1 -19.397412 47.544324
2 -8.213419 69.892953
3 -29.926849 39.743923
4 -75.377447 79.817208
5 -9.215048 40.705533
6 -42.868995 45.721222
7 -85.590572 84.058463
8 -62.544121 69.371364
9 -60.209205 64.546267
10 3.598963 20.109707
11 -4.552074 61.3339
12 -197.619021 52.225312
13 -147.133639 56.96088
14 -59.402414 56.487012
15 -68.361091 46.811878
16 -105.556485 57.603839
17 -94.354948 32.706933
18 -107.26281 28.477637
19 -155.692967 35.106937
20 -80.819257 30.664812
21 -142.055086 33.728788
22 -118.353934 27.362929
23 -114.634413 31.501665
24 -113.470642 29.136781
25 -181.380891 41.046883
26 -171.106218 23.359443
27 -156.720415 35.450407
28 -165.042839 29.349575
29 -92.869955 25.478965
30 -114.78719 23.860353
31 -134.115204 25.491367
32 -109.430656 19.105614
33 -120.451655 25.97992
34 -87.570713 21.111895
35 -91.222139 22.484895
36 -208.979695 38.311266
37 -98.814223 16.121487
38 -201.812263 49.547512
39 -168.948464 39.583593
40 -112.44335 20.979357
41 -174.138029 28.470047
42 -220.936718 33.452972
43 -169.687859 33.173458
44 -157.119306 38.573987
45 -150.682075 41.66627
46 -77.397116 27.220171
47 -177.559527 53.278523
48 -61.212396 6.796908
49 -94.602774 24.669706
50 -204.333869 37.002679
51 -124.442364 31.519392
52 -165.722504 39.464188
53 -57.849212 23.973774
54 -106.643382 38.560785
55 -90.679094 29.863184
56 -132.476054 31.988021
57 -188.33621 29.658416
58 -136.247184 38.870171
59 -59.929772 20.626164
60 -121.020003 33.862312
61 -82.968422 33.033312
62 -79.130004 32.800121
63 -51.463395 23.452366
64 -63.819269 27.257994
65 -64.02259 27.711516
66 -66.876407 18.156063
67 -68.175454 22.996369
68 -108.640035 29.915306
69 -21.512647 16.930815
70 -66.902542 17.177093
71 -160.262625 33.061052
72 -41.672641 30.510433
73 -83.31784 28.965415
74 -132.410284 22.843924
75 -54.724716 10.642682
76 -69.688094 30.798878
77 -120.775133 24.597096
78 -78.655551 30.368373
79 -68.299767 35.937048
80 -45.037891 21.636422
81 -49.679704 19.508719
82 -62.018393 76.199247
83 -113.777141 27.730892
84 -74.630501 49.062317
85 -95.154793 37.279829
86 -65.229569 46.26744
87 -42.139223 16.38709
88 -94.186408 28.708069
89 -100.920471 27.533579
90 -66.332707 22.573064
91 -26.419725 13.948061
92 -152.704377 34.165409
93 -50.309209 22.032052
94 -125.896489 34.411915
95 -119.304969 28.786249
96 -41.689412 37.314049
97 -99.936438 31.363461
98 -74.807901 24.259652
This yields the following plot 6:
And I would like to show that most of my Datapoints are in the lower part, but still encircle all the data, something like in 7:
____________________________________________
EDIT2:
The "ultimate goal" would be to compare those both contours, without the corresponding datapoints, to another dataset, to see whether there are overlaps, but without overcrowding the resulting plots with too many datapoints.