Multiple t-test on independent group with a large dataframe - r

I've seen many similar posts but the vast majority of them are at least 3 years old and I'm not really sure they apply to my situations, so here we go.
A colleague asked for my help on a multiple t-test on her project.
Basically she has 20 observation x 30 variable dataframe that looks like this:
| Group | Lipid 1 | Lipid 2 | ... | Lipid 28|
| -------- | -------------- |
| A |
|B |
| |
|B |
What we want to do is a group comparison of each lipide (meaning a t-test for Lipide 1 between group A and B, then a t-test for Lipide 2 and so on).
We do not want to compare Lipids between them.
And of course, we'd like to not have to copy/paste the same 3 lines of code, especially since we've got 2 other dataframe with the same variable but different conditions.
I've tried one solution I saw in here but it gives me an error I'm not sure to understand:
sapply(foetal[,2:20], function(i) t.test(i ~ foetal$ID))
Error in if (stderr < 10 * .Machine$double.eps * max(abs(mx), abs(my))) stop("data are essentially constant") : missing value where TRUE/FALSE needed In addition: Warning messages: 1: In mean.default(x) : l'argument n'est ni numérique, ni logique : renvoi de NA 2: In var(x) : NAs introduced by coercion 3: In mean.default(y) : l'argument n'est ni numérique, ni logique : renvoi de NA 4: In var(y) : Error in if (stderr < 10 * .Machine$double.eps * max(abs(mx), abs(my))) stop("data are essentially constant") : missing value where TRUE/FALSE needed
Another solution I saw would by to use the gather function to get one column with the Lipids, one column for the value of each Lipids, then create a list column, spread the dataframe and mutate a new-column containing the p-value of the t-test.
tips %>%
select(tip, total_bill, sex) %>%
gather(key = variable, value = value, -sex) %>%
group_by(sex, variable) %>%
summarise(value = list(value)) %>%
spread(sex, value) %>%
group_by(variable) %>%
mutate(p_value = t.test(unlist(Female), unlist(Male))$p.value,
t_value = t.test(unlist(Female), unlist(Male))$statistic)
(https://sebastiansauer.github.io/multiple-t-tests-with-dplyr/)
I'm honestly not sure what to do. Does anyone have tips or anything?
Here's the dput() for the data.... Not really sure why it's necessary though...
dput(dummy)
structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B"),
Lipid.1 = c(0.737, 0.419, 0.468, 0.805, 1.036, 0.825, 0.286,
1.166, 0.898, 0.504, 1.433, 0.41, 0.325, 0.866, 0.337, 0.876,
0.636, 0.953, 0.481, 0.602), Lipid.2 = c(0.001, 0.017, 0.013,
0.025, 0.018, 0.003, 0.007, NA, 0.01, 0.002, 0.01, 0.022,
0.005, NA, 0.018, NA, 0.015, 0.016, NA, 0.01), Lipid.3 = c(0.035,
0.018, 0.036, 0.024, 0.023, 0.027, 0.036, 0.037, 0.013, 0.037,
0.03, 0.04, 0.038, 0.033, 0.016, 0.034, 0.029, 0.033, 0.018,
0.029), Lipid.4 = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), Lipid.5 = c(0.09,
0.099, 0.12, 0.058, 0.136, 0.103, 0.153, 0.148, 0.047, 0.085,
0.098, 0.133, 0.099, 0.121, 0.084, 0.065, 0.11, 0.088, 0.065,
0.043), Lipid.6 = c(0.39, 0.555, 0.568, 0.6, 0.626, 0.378,
0.657, 0.57, 0.271, 0.41, 0.474, 0.617, 0.491, 0.738, 0.459,
0.365, 0.499, 0.388, 0.271, 0.275), Lipid.7 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), Lipid.8 = c(0.186, 0.197, 0.191, 0.125, 0.209,
0.107, 0.174, 0.143, 0.055, 0.134, 0.148, 0.193, 0.184, 0.213,
0.134, 0.085, 0.165, 0.215, 0.163, 0.061), Lipid.9 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, "0,007"), Lipid.10 = c("0,242", "0,254", "0,134",
"0,226", "0,243", "0,122", "0,082", "0,119", "0,098", "0,093",
"0,27", "0,284", "0,258", "0,236", "0,173", "0,106", "0,138",
"0,066", "0,072", "0,081"), Lipid.11 = c("0,053", "0,114",
"0,038", "0,094", "0,073", "0,067", "0,028", "0,022", "0,021",
"0,05", "0,085", "0,102", "0,122", "0,096", "0,027", "0,03",
NA, "0,078", "0,066", NA), Lipid.12 = c(0.223, 0.261, 0.258,
0.212, 0.168, 0.101, 0.191, 0.09, 0.195, 0.082, 0.155, 0.2,
0.167, 0.231, 0.145, 0.089, 0.239, 0.141, 0.106, 0.124),
Lipid.13 = c(0.737, 0.763, 0.707, 0.587, 0.545, 0.317, 0.74,
0.602, 0.481, 0.531, 0.632, 0.448, 0.62, 0.766, 0.397, 0.623,
0.997, 0.578, 0.418, 0.412), Lipid.14 = c(0.683, 0.666, 0.507,
0.366, 0.443, 0.266, 0.493, 0.345, 0.368, 0.355, 0.432, 0.411,
0.491, 0.565, 0.357, 0.285, 0.604, 0.426, 0.538, 0.295),
Lipid.15 = c(0.911, 1.017, 0.503, 0.76, 0.741, 0.486, 0.648,
0.581, 0.955, 0.515, 0.932, 0.707, 0.626, 0.928, 0.836, 0.537,
0.654, 0.351, 0.498, 0.529), Lipid.16 = c(0.148, 0.116, 0.069,
0.104, 0.091, 0.064, 0.093, 0.123, 0.11, 0.097, 0.283, 0.076,
0.095, 0.194, 0.06, 0.061, 0.086, 0.051, 0.064, 0.059), Lipid.17 = c("0,155",
"0,274", "0,149", "0,127", "0,174", "nd", "0,109", "0,134",
"0,1", "0,09", "0,25", "0,112", "0,088", "0,243", "0,092",
"0,073", "0,153", "0,12", "0,14", "0,06"), Lipid.18 = c(3.143,
3.441, 4.359, 1.945, 2.573, 2.267, 3.585, 3.405, 2.296, 1.998,
3.468, 2.98, 3.626, 3.635, 3.236, 2.092, 2.586, 2.08, 1.718,
1.736), Lipid.19 = c(37.993, 36.148, 40.244, 30.395, 37.339,
35.742, 47.316, 47.555, 34.351, 32.377, 38.694, 39.413, 36.114,
41.235, 32.779, 32.222, 36.418, 36.918, 33.334, 31.421),
Lipid.20 = c(6.613, 5.913, 9.662, 3.789, 7.485, 6.297, 8.254,
8.07, 4.905, 5.686, 7.742, 7.533, 6.875, 7.908, 7.022, 5.446,
6.1, 6.782, 6.062, 6.089), Lipid.21 = c(7.235, 6.759, 8.331,
4.931, 6.558, 4.186, 5.99, 5.629, 3.066, 3.439, 7.102, 7.655,
6.606, 7.858, 5.804, 3.135, 3.218, 3.639, 2.975, 3.13), Lipid.22 = c(6.453,
6.664, 9.048, 4.341, 8.03, 7.599, 10.24, 10.954, 5.873, 6.687,
8.005, 8.908, 6.708, 8.06, 5.931, 6.083, 5.734, 5.587, 5.388,
6.088), Lipid.23 = c(4.943, 3.164, 5.153, 2.51, 4.071, 5.255,
7.636, 8.376, 4.726, 5.56, 4.762, 5.044, 4.549, 4.875, 4.57,
5.147, 4.396, 4.031, 3.556, 4.38), Lipid.24 = c(3.973, 4.279,
5.928, 3.066, 4.95, 4.667, 7.949, 7.268, 4.948, 3.72, 5.137,
5.539, 4.006, 5.276, 3.909, 4.163, 4.954, 5.02, 3.961, 4.201
), Lipid.25 = c(7.638, 5.224, 8.417, 3.902, 7.267, 6.007,
8.256, 7.457, 4.801, 4.86, 7.581, 8.173, 7.57, 8.591, 7.482,
5.091, 5.651, 6.577, 5.415, 5.76), Lipid.26 = c(10.225, 8.293,
13.188, 5.607, 10.993, 4.491, 5.767, 5.011, 3.589, 3.145,
11.471, 12.183, 9.686, 12.562, 9.697, 3.34, 4.186, 4.485,
3.23, 4.229), Lipid.27 = c(5.848, 4.856, 6.503, 3.534, 5.358,
8.933, 14.034, 12.806, 7.781, 8.094, 6.765, 6.867, 5.539,
7.772, 5.883, 7.832, 8.607, 7.586, 6.628, 7.563), Lipid.28 = c(32.941,
30.579, 31.358, 15.861, 30.353, 25.222, 35.662, 34.035, 20.338,
24.682, 30.698, 34.024, 31.608, 37.539, 24.901, 20.131, 23.126,
30.803, 25.639, 18.935)), class = "data.frame", row.names = c(NA,
-20L))

If you would like to have the full t-test output, you could just loop over the columns:
If we start with your df:
data <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B"),
Lipid.1 = c(0.737, 0.419, 0.468, 0.805, 1.036, 0.825, 0.286,
1.166, 0.898, 0.504, 1.433, 0.41, 0.325, 0.866, 0.337, 0.876,
0.636, 0.953, 0.481, 0.602), Lipid.2 = c(0.001, 0.017, 0.013,
0.025, 0.018, 0.003, 0.007, NA, 0.01, 0.002, 0.01, 0.022,
0.005, NA, 0.018, NA, 0.015, 0.016, NA, 0.01), Lipid.3 = c(0.035,
0.018, 0.036, 0.024, 0.023, 0.027, 0.036, 0.037, 0.013, 0.037,
0.03, 0.04, 0.038, 0.033, 0.016, 0.034, 0.029, 0.033, 0.018,
0.029), Lipid.4 = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), Lipid.5 = c(0.09,
0.099, 0.12, 0.058, 0.136, 0.103, 0.153, 0.148, 0.047, 0.085,
0.098, 0.133, 0.099, 0.121, 0.084, 0.065, 0.11, 0.088, 0.065,
0.043), Lipid.6 = c(0.39, 0.555, 0.568, 0.6, 0.626, 0.378,
0.657, 0.57, 0.271, 0.41, 0.474, 0.617, 0.491, 0.738, 0.459,
0.365, 0.499, 0.388, 0.271, 0.275), Lipid.7 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), Lipid.8 = c(0.186, 0.197, 0.191, 0.125, 0.209,
0.107, 0.174, 0.143, 0.055, 0.134, 0.148, 0.193, 0.184, 0.213,
0.134, 0.085, 0.165, 0.215, 0.163, 0.061), Lipid.9 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, "0,007"), Lipid.10 = c("0,242", "0,254", "0,134",
"0,226", "0,243", "0,122", "0,082", "0,119", "0,098", "0,093",
"0,27", "0,284", "0,258", "0,236", "0,173", "0,106", "0,138",
"0,066", "0,072", "0,081"), Lipid.11 = c("0,053", "0,114",
"0,038", "0,094", "0,073", "0,067", "0,028", "0,022", "0,021",
"0,05", "0,085", "0,102", "0,122", "0,096", "0,027", "0,03",
NA, "0,078", "0,066", NA), Lipid.12 = c(0.223, 0.261, 0.258,
0.212, 0.168, 0.101, 0.191, 0.09, 0.195, 0.082, 0.155, 0.2,
0.167, 0.231, 0.145, 0.089, 0.239, 0.141, 0.106, 0.124),
Lipid.13 = c(0.737, 0.763, 0.707, 0.587, 0.545, 0.317, 0.74,
0.602, 0.481, 0.531, 0.632, 0.448, 0.62, 0.766, 0.397, 0.623,
0.997, 0.578, 0.418, 0.412), Lipid.14 = c(0.683, 0.666, 0.507,
0.366, 0.443, 0.266, 0.493, 0.345, 0.368, 0.355, 0.432, 0.411,
0.491, 0.565, 0.357, 0.285, 0.604, 0.426, 0.538, 0.295),
Lipid.15 = c(0.911, 1.017, 0.503, 0.76, 0.741, 0.486, 0.648,
0.581, 0.955, 0.515, 0.932, 0.707, 0.626, 0.928, 0.836, 0.537,
0.654, 0.351, 0.498, 0.529), Lipid.16 = c(0.148, 0.116, 0.069,
0.104, 0.091, 0.064, 0.093, 0.123, 0.11, 0.097, 0.283, 0.076,
0.095, 0.194, 0.06, 0.061, 0.086, 0.051, 0.064, 0.059), Lipid.17 = c("0,155",
"0,274", "0,149", "0,127", "0,174", "nd", "0,109", "0,134",
"0,1", "0,09", "0,25", "0,112", "0,088", "0,243", "0,092",
"0,073", "0,153", "0,12", "0,14", "0,06"), Lipid.18 = c(3.143,
3.441, 4.359, 1.945, 2.573, 2.267, 3.585, 3.405, 2.296, 1.998,
3.468, 2.98, 3.626, 3.635, 3.236, 2.092, 2.586, 2.08, 1.718,
1.736), Lipid.19 = c(37.993, 36.148, 40.244, 30.395, 37.339,
35.742, 47.316, 47.555, 34.351, 32.377, 38.694, 39.413, 36.114,
41.235, 32.779, 32.222, 36.418, 36.918, 33.334, 31.421),
Lipid.20 = c(6.613, 5.913, 9.662, 3.789, 7.485, 6.297, 8.254,
8.07, 4.905, 5.686, 7.742, 7.533, 6.875, 7.908, 7.022, 5.446,
6.1, 6.782, 6.062, 6.089), Lipid.21 = c(7.235, 6.759, 8.331,
4.931, 6.558, 4.186, 5.99, 5.629, 3.066, 3.439, 7.102, 7.655,
6.606, 7.858, 5.804, 3.135, 3.218, 3.639, 2.975, 3.13), Lipid.22 = c(6.453,
6.664, 9.048, 4.341, 8.03, 7.599, 10.24, 10.954, 5.873, 6.687,
8.005, 8.908, 6.708, 8.06, 5.931, 6.083, 5.734, 5.587, 5.388,
6.088), Lipid.23 = c(4.943, 3.164, 5.153, 2.51, 4.071, 5.255,
7.636, 8.376, 4.726, 5.56, 4.762, 5.044, 4.549, 4.875, 4.57,
5.147, 4.396, 4.031, 3.556, 4.38), Lipid.24 = c(3.973, 4.279,
5.928, 3.066, 4.95, 4.667, 7.949, 7.268, 4.948, 3.72, 5.137,
5.539, 4.006, 5.276, 3.909, 4.163, 4.954, 5.02, 3.961, 4.201
), Lipid.25 = c(7.638, 5.224, 8.417, 3.902, 7.267, 6.007,
8.256, 7.457, 4.801, 4.86, 7.581, 8.173, 7.57, 8.591, 7.482,
5.091, 5.651, 6.577, 5.415, 5.76), Lipid.26 = c(10.225, 8.293,
13.188, 5.607, 10.993, 4.491, 5.767, 5.011, 3.589, 3.145,
11.471, 12.183, 9.686, 12.562, 9.697, 3.34, 4.186, 4.485,
3.23, 4.229), Lipid.27 = c(5.848, 4.856, 6.503, 3.534, 5.358,
8.933, 14.034, 12.806, 7.781, 8.094, 6.765, 6.867, 5.539,
7.772, 5.883, 7.832, 8.607, 7.586, 6.628, 7.563), Lipid.28 = c(32.941,
30.579, 31.358, 15.861, 30.353, 25.222, 35.662, 34.035, 20.338,
24.682, 30.698, 34.024, 31.608, 37.539, 24.901, 20.131, 23.126,
30.803, 25.639, 18.935)), class = "data.frame", row.names = c(NA,
-20L))
clean up a the df:
# remove the columns which only contain NA:
data$Lipid.4 <- NULL
data$Lipid.7 <- NULL
data$Lipid.9 <- NULL
# convert from string to numeric (I do it now manually with each column. You could use a for-loop)
data$Lipid.10 <- gsub(",", ".", data$Lipid.10) # convert comma to dot
data$Lipid.10 <- as.numeric(data$Lipid.10) # convert from string to numeric
data$Lipid.11 <- gsub(",", ".", data$Lipid.11)
data$Lipid.11 <- as.numeric(data$Lipid.11)
data$Lipid.17 <- gsub(",", ".", data$Lipid.17)
data$Lipid.17 <- as.numeric(data$Lipid.17)
# get the lipid column names
all_lipids <- colnames(data)
all_lipids <- all_lipids[all_lipids != "ID"] # we don't need the ID column for the loop
# now loop over each column an perform a t-test
for (column in all_lipids) {
print(column)
print(t.test(data[,column] ~ data$ID))
}
You get for each lipid:
[1] "Lipid.1"
Welch Two Sample t-test
data: data[, column] by data$ID
t = 0.15843, df = 17.391, p-value = 0.8759
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.2766112 0.3216112
sample estimates:
mean in group A mean in group B
0.7144 0.6919
And just a final coment: you perform a lot of comparisons. You may consider to correct for multiple testing.

Let's start with the data you pasted in is dirty! Instead of numbers, you have thongs. For example, Lipid.10
Lipid.10 = c("0,242", "0,254", "0,134",
"0,226", "0,243", "0,122", "0,082", "0,119", "0,098", "0,093",
"0,27", "0,284", "0,258", "0,236", "0,173", "0,106", "0,138",
"0,066", "0,072", "0,081")
Besides, you have variables that only contain NA values
Lipid.4 = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_)
So I had to clean them up a bit.
structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B"),
Lipid.1 = c(0.737, 0.419, 0.468, 0.805, 1.036, 0.825, 0.286,
1.166, 0.898, 0.504, 1.433, 0.41, 0.325, 0.866, 0.337, 0.876,
0.636, 0.953, 0.481, 0.602), Lipid.2 = c(0.001, 0.017, 0.013,
0.025, 0.018, 0.003, 0.007, NA, 0.01, 0.002, 0.01, 0.022,
0.005, NA, 0.018, NA, 0.015, 0.016, NA, 0.01), Lipid.3 = c(0.035,
0.018, 0.036, 0.024, 0.023, 0.027, 0.036, 0.037, 0.013, 0.037,
0.03, 0.04, 0.038, 0.033, 0.016, 0.034, 0.029, 0.033, 0.018,
0.029), Lipid.4 = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), Lipid.5 = c(0.09,
0.099, 0.12, 0.058, 0.136, 0.103, 0.153, 0.148, 0.047, 0.085,
0.098, 0.133, 0.099, 0.121, 0.084, 0.065, 0.11, 0.088, 0.065,
0.043), Lipid.6 = c(0.39, 0.555, 0.568, 0.6, 0.626, 0.378,
0.657, 0.57, 0.271, 0.41, 0.474, 0.617, 0.491, 0.738, 0.459,
0.365, 0.499, 0.388, 0.271, 0.275), Lipid.7 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), Lipid.8 = c(0.186, 0.197, 0.191, 0.125, 0.209,
0.107, 0.174, 0.143, 0.055, 0.134, 0.148, 0.193, 0.184, 0.213,
0.134, 0.085, 0.165, 0.215, 0.163, 0.061), Lipid.9 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 0.007), Lipid.10 = c(0.242, 0.254, 0.134, 0.226,
0.243, 0.122, 0.082, 0.119, 0.098, 0.093, 0.27, 0.284, 0.258,
0.236, 0.173, 0.106, 0.138, 0.066, 0.072, 0.081), Lipid.11 = c(0.053,
0.114, 0.038, 0.094, 0.073, 0.067, 0.028, 0.022, 0.021, 0.05,
0.085, 0.102, 0.122, 0.096, 0.027, 0.03, NA, 0.078, 0.066,
NA), Lipid.12 = c(0.223, 0.261, 0.258, 0.212, 0.168, 0.101,
0.191, 0.09, 0.195, 0.082, 0.155, 0.2, 0.167, 0.231, 0.145,
0.089, 0.239, 0.141, 0.106, 0.124), Lipid.13 = c(0.737, 0.763,
0.707, 0.587, 0.545, 0.317, 0.74, 0.602, 0.481, 0.531, 0.632,
0.448, 0.62, 0.766, 0.397, 0.623, 0.997, 0.578, 0.418, 0.412
), Lipid.14 = c(0.683, 0.666, 0.507, 0.366, 0.443, 0.266,
0.493, 0.345, 0.368, 0.355, 0.432, 0.411, 0.491, 0.565, 0.357,
0.285, 0.604, 0.426, 0.538, 0.295), Lipid.15 = c(0.911, 1.017,
0.503, 0.76, 0.741, 0.486, 0.648, 0.581, 0.955, 0.515, 0.932,
0.707, 0.626, 0.928, 0.836, 0.537, 0.654, 0.351, 0.498, 0.529
), Lipid.16 = c(0.148, 0.116, 0.069, 0.104, 0.091, 0.064,
0.093, 0.123, 0.11, 0.097, 0.283, 0.076, 0.095, 0.194, 0.06,
0.061, 0.086, 0.051, 0.064, 0.059), Lipid.17 = c(0.155, 0.274,
0.149, 0.127, 0.174, NA, 0.109, 0.134, 0.1, 0.09, 0.25, 0.112,
0.088, 0.243, 0.092, 0.073, 0.153, 0.12, 0.14, 0.06), Lipid.18 = c(3.143,
3.441, 4.359, 1.945, 2.573, 2.267, 3.585, 3.405, 2.296, 1.998,
3.468, 2.98, 3.626, 3.635, 3.236, 2.092, 2.586, 2.08, 1.718,
1.736), Lipid.19 = c(37.993, 36.148, 40.244, 30.395, 37.339,
35.742, 47.316, 47.555, 34.351, 32.377, 38.694, 39.413, 36.114,
41.235, 32.779, 32.222, 36.418, 36.918, 33.334, 31.421),
Lipid.20 = c(6.613, 5.913, 9.662, 3.789, 7.485, 6.297, 8.254,
8.07, 4.905, 5.686, 7.742, 7.533, 6.875, 7.908, 7.022, 5.446,
6.1, 6.782, 6.062, 6.089), Lipid.21 = c(7.235, 6.759, 8.331,
4.931, 6.558, 4.186, 5.99, 5.629, 3.066, 3.439, 7.102, 7.655,
6.606, 7.858, 5.804, 3.135, 3.218, 3.639, 2.975, 3.13), Lipid.22 = c(6.453,
6.664, 9.048, 4.341, 8.03, 7.599, 10.24, 10.954, 5.873, 6.687,
8.005, 8.908, 6.708, 8.06, 5.931, 6.083, 5.734, 5.587, 5.388,
6.088), Lipid.23 = c(4.943, 3.164, 5.153, 2.51, 4.071, 5.255,
7.636, 8.376, 4.726, 5.56, 4.762, 5.044, 4.549, 4.875, 4.57,
5.147, 4.396, 4.031, 3.556, 4.38), Lipid.24 = c(3.973, 4.279,
5.928, 3.066, 4.95, 4.667, 7.949, 7.268, 4.948, 3.72, 5.137,
5.539, 4.006, 5.276, 3.909, 4.163, 4.954, 5.02, 3.961, 4.201
), Lipid.25 = c(7.638, 5.224, 8.417, 3.902, 7.267, 6.007,
8.256, 7.457, 4.801, 4.86, 7.581, 8.173, 7.57, 8.591, 7.482,
5.091, 5.651, 6.577, 5.415, 5.76), Lipid.26 = c(10.225, 8.293,
13.188, 5.607, 10.993, 4.491, 5.767, 5.011, 3.589, 3.145,
11.471, 12.183, 9.686, 12.562, 9.697, 3.34, 4.186, 4.485,
3.23, 4.229), Lipid.27 = c(5.848, 4.856, 6.503, 3.534, 5.358,
8.933, 14.034, 12.806, 7.781, 8.094, 6.765, 6.867, 5.539,
7.772, 5.883, 7.832, 8.607, 7.586, 6.628, 7.563), Lipid.28 = c(32.941,
30.579, 31.358, 15.861, 30.353, 25.222, 35.662, 34.035, 20.338,
24.682, 30.698, 34.024, 31.608, 37.539, 24.901, 20.131, 23.126,
30.803, 25.639, 18.935)), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
The rest is easy.
library(tidyverse)
ft = function(data){
tryCatch(
{tout = t.test(data$val ~ data$ID))
tibble(
t = tout$statistic,
p = tout$p.value,
stderr = tout$stderr
)
}, error = function(msg){
return(tibble(t = NA, p = NA, stderr = NA))
})
}
df %>%
pivot_longer(starts_with("Lipid"), names_to = "Lipid", values_to = "val") %>%
group_by(Lipid) %>%
nest() %>%
mutate(testt = map(data, ft)) %>%
select(Lipid, testt) %>%
unnest(testt)
output
# A tibble: 28 x 4
# Groups: Lipid [28]
Lipid t p stderr
<chr> <dbl> <dbl> <dbl>
1 Lipid.1 0.158 0.876 0.142
2 Lipid.2 -0.870 0.399 0.00350
3 Lipid.3 -0.377 0.711 0.00372
4 Lipid.4 NA NA NA
5 Lipid.5 0.930 0.366 0.0143
6 Lipid.6 0.730 0.475 0.0614
7 Lipid.7 NA NA NA
8 Lipid.8 -0.180 0.859 0.0223
9 Lipid.9 NA NA NA
10 Lipid.10 -0.200 0.844 0.0355
# ... with 18 more rows
Customize the ft function as needed.
I had to use the tryCatch function in ft because of variables that contain only NA values.

FYou can use the multtest library in R too, for multiple two-sample t-tests, as shown in the following code:
library(multtest)
df <- as.data.frame(t(as.matrix(dummy)))
X <- apply(as.matrix.noquote(df[2:nrow(df),]), 2, as.numeric)
cl <- ifelse(df[1,] == 'A', 1, 0) # class labels
welch_t_stat <- mt.teststat(X, cl, test='t')
welch_t_stat
# [1] 0.15843467 -0.86954194 -0.37680666 NA 0.92978706 0.72969094 NA -0.17962582 NA NA NAv
# [12] 0.69705527 0.16001073 0.15733921 0.59540273 -0.05557413 NA 0.52706460 0.99860493 -0.14561137 0.58894166 1.25114061
# [23] 1.03458080 0.86540315 -0.62788116 -0.28806189 0.60206042 0.12954702
As can be seen from the above result, there are 28 Welch t-tests performed for 28 lipids in the dataframe.
Since you obtained individual t-statistics, now, you can compute the p-values and apply FWER corrections with Bonferroni / Holm or FDR corrections with Benjamini & Hochberg methods (useful when you have large number of tests):
raw_p <- 2 * (1 - pnorm(abs(welch_t_stat))) # raw p-values assuming normal
# or use pt() with appropriate df
procedures <- c("Bonferroni", "Holm", "BH")
adjusted <- mt.rawp2adjp(raw_p, procedures)

Related

How do you filter out individuals on a figure after creating a PCA plot in Factoextra?

I am a research student coming to grips with R for the first time.
I am trying to make a PCA plot from a series of body measurements, the specimens names and a subspecies tag (BIN) are in sperate columns. The BIN column contains the BIN ID for each sample.
The difficulty I am facing is filtering out individuals with certain BIN's.
My desired output is to produce a PCA plot identical to the one below but only displaying the named BIN's ("ACZ5516", "ADF3772") and not the remaining BIN's.
Revised image
#import data set
Anotylus<-read.csv("DataSO.csv", header = TRUE, sep = ",",
row.names = 1)
#row.names sets specimen ID as specimen name
#set BIN as factor
Anotylus$BIN<-as.factor(Anotylus$BIN)
# Number of BINs and number of individuals in each
table(Anotylus["BIN"])
#create PCA of data set, excludes column for BIN (column 12)
Ano.pca<-PCA(Anotylus[,c(1:11)], graph = FALSE)
#visualise PCA with all individuals in the d.f.
fviz_pca_ind(Ano.pca,
geom.ind = "point",
col.ind = Anotylus$BIN,
repel = TRUE,
legend.title = "BIN",
addEllipses = TRUE)
#With individuals from selected BINs
top<-list(name=c("ACZ5516", "ADF3772"))
fviz_pca_ind(Ano.pca,
geom.ind = "point",
col.ind = Anotylus$BIN,#
select.ind = top,
repel = TRUE,
legend.title = "BIN",
addEllipses = TRUE)
#no samples visible at all
#wouild like to see only the two named
I have tried using a subset of the data but the Principal Components variation changes and produces different a result.
How do I filter the individuals displayed to a curated list?
Any advice or guidance is deeply appreciated!
Best,
Dante
Sample data set below
> dput(Anotylus)
structure(list(Total.Anten.Length..mm. = c(0.66, 0.635, 0.676,
0.559, 1.249, 0.675, 0.704, 0.649, 0.661, 0.795, 0.836, 0.888,
0.941, 0.781, 0.899, 0.918, 0.854, 0.834, 0.888, 0.884, 0.879,
0.776, 0.954, 0.853, 0.96, 0.527, 0.515, 0.653, 0.491, 0.474,
0.538, 0.694, 1.01, 0.53, 0.641, 0.509, 0.918, 0.849, 0.452,
0.536), Body.Length...mm. = c(1.842, 1.664, 1.901, 1.917, 3.061,
1.961, 1.862, 1.99, 1.85, 1.449, 2.455, 2.077, 2.578, 2.478,
2.798, 2.589, 2.291, 2.882, 2.472, 2.55, 2.53, 2.757, 2.689,
2.166, 2.894, 1.944, 1.48, 2.385, 1.715, 1.674, 1.532, 2.27,
2.598, 1.677, 1.67, 1.68, 2.374, 2.877, 1.699, 1.656),
Eye.Area..mm2. = c(0.01,
0.009, 0.01, 0.006, 0.026, 0.007, 0.01, 0.01, 0.009, 0.006, 0.016,
0.014, 0.015, 0.018, 0.02, 0.016, 0.019, 0.015, 0.013, 0.011,
0.015, 0.014, 0.017, 0.014, 0.012, 0.007, 0.006, 0.02, 0.007,
0.006, 0.005, 0.013, 0.013, 0.006, 0.007, 0.005, 0.013, 0.006,
0.008, 0.005), Eye.Width..mm. = c(0.046, 0.036, 0.054, 0.033,
0.071, 0.04, 0.046, 0.047, 0.044, 0.05, 0.059, 0.053, 0.073,
0.063, 0.068, 0.051, 0.044, 0.07, 0.064, 0.061, 0.054, 0.042,
0.038, 0.059, 0.059, 0.043, 0.046, 0.079, 0.037, 0.035, 0.037,
0.054, 0.047, 0.045, 0.045, 0.028, 0.05, 0.037, 0.043, 0.045),
Head.Width..mm. = c(0.359, 0.362, 0.377, 0.317, 0.731, 0.456,
0.38, 0.414, 0.359, 0.453, 0.568, 0.449, 0.519, 0.517, 0.516,
0.515, 0.512, 0.513, 0.511, 0.456, 0.503, 0.474, 0.598, 0.453,
0.574, 0.309, 0.306, 0.574, 0.314, 0.298, 0.295, 0.386, 0.557,
0.289, 0.318, 0.306, 0.505, 0.291, 0.298, 0.263),
Pronotum.Width..mm. = c(0.413,
0.455, 0.439, 0.352, 0.741, 0.462, 0.467, 0.461, 0.442, 0.493,
0.573, 0.549, 0.584, 0.617, 0.632, 0.61, 0.614, 0.624, 0.631,
0.533, 0.587, 0.562, 0.609, 0.522, 0.621, 0.342, 0.341, 0.598,
0.336, 0.314, 0.331, 0.467, 0.547, 0.343, 0.342, 0.317, 0.545,
0.328, 0.329, 0.284), Pronotum.Length..mm. = c(0.304, 0.326,
0.334, 0.24, 0.48, 0.317, 0.303, 0.329, 0.302, 0.36, 0.418,
0.383, 0.424, 0.428, 0.399, 0.442, 0.404, 0.461, 0.435, 0.376,
0.393, 0.403, 0.373, 0.41, 0.435, 0.259, 0.247, 0.403, 0.257,
0.252, 0.23, 0.387, 0.388, 0.248, 0.26, 0.215, 0.336, 0.223,
0.231, 0.247), Elytra.Width..mm. = c(0.558, 0.552, 0.586,
0.43, 0.854, 0.506, 0.528, 0.586, 0.548, 0.54, 0.75, 0.716,
0.794, 0.816, 0.746, 0.82, 0.786, 0.8, 0.722, 0.69, 0.758,
0.766, 0.736, 0.668, 0.852, 0.468, 0.462, 0.741, 0.461, 0.323,
0.406, 0.637, 0.617, 0.41, 0.366, 0.422, 0.718, 0.42, 0.408,
0.278), Elytra.Length..mm. = c(0.469, 0.437, 0.386, 0.346,
0.631, 0.428, 0.464, 0.451, 0.445, 0.532, 0.583, 0.543, 0.558,
0.62, 0.625, 0.623, 0.613, 0.605, 0.623, 0.588, 0.606, 0.48,
0.568, 0.568, 0.598, 0.373, 0.352, 0.516, 0.365, 0.326, 0.327,
0.502, 0.464, 0.346, 0.344, 0.319, 0.519, 0.346, 0.329, 0.346
), Pronotum.Value = c(0.288, 0.319, 0.306, 0.331, 0.179,
0.278, 0.224, 0.211, 0.204, 0.273, 0.26, 0.33, 0.241, 0.218,
0.203, 0.209, 0.241, 0.227, 0.31, 0.236, 0.341, 0.288, 0.283,
0.263, 0.279, 0.173, 0.162, 0.22, 0.183, 0.209, 0.193, 0.185,
0.236, 0.181, 0.172, 0.227, 0.275, 0.164, 0.21, 0.217),
Elytra.Value = c(0.314,
0.319, 0.393, 0.243, 0.205, 0.297, 0.21, 0.205, 0.244, 0.359,
0.288, 0.335, 0.375, 0.291, 0.243, 0.238, 0.288, 0.283, 0.351,
0.271, 0.48, 0.415, 0.325, 0.294, 0.193, 0.182, 0.271, 0.237,
0.216, 0.246, 0.214, 0.193, 0.233, 0.205, 0.18, 0.262, 0.225,
0.176, 0.303, 0.251), BIN = structure(c(1L, 1L, 1L, 3L, 8L,
1L, 1L, 1L, 1L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 7L, 3L, 3L, 6L, 3L, 3L, 3L, 2L, 5L, 3L, 3L,
3L, 5L, 3L, 3L, 3L), .Label = c("ACZ5516", "ACZ5742", "ADF3772",
"ADF4138", "ADG1201", "ADH9095", "ADI3175", "ADR2790"), class =
"factor")), row.names = c("CCDB-22214-D03",
"CCDB-22214-D06", "CCDB-22214-D08", "CCDB-22214-G09", "CCDB-22214-
H02",
"CCDB-22214-H09", "CCDB-22215-A11", "CCDB-22215-A12", "CCDB-22215-
F04",
"CCDB-23850-B07", "CCDB-23851-C04", "CCDB-23851-C05", "CCDB-23851-
C11",
"CCDB-23851-C12", "CCDB-23851-D02", "CCDB-23851-D03", "CCDB-23851-
D04",
"CCDB-23851-D06", "CCDB-23851-E08", "CCDB-23851-E09", "CCDB-23851-
E11",
"CCDB-23851-F03", "CCDB-23851-G05", "CCDB-23851-G09", "CCDB-23858-
B08",
"CCDB-23858-G12", "CCDB-23858-H01", "CCDB-23859-B10", "CCDB-23859-
E07",
"CCDB-23859-E10", "CCDB-23859-E11", "CCDB-25504-E04", "CCDB-25505-
E02",
"CCDB-25510-B12", "CCDB-25510-D02", "CCDB-25510-E09", "CCDB-25511-
B06",
"CCDB-25511-B12", "CCDB-25511-E11", "CCDB-25512-E12"), class =
"data.frame")
Apparently factoextra "produces ggplot2-based elegant data visualization with less typing". From what I can tell, fviz_pca_ind is essentially plotting the PCA coordinate for each individual point, and compute a multivariate normal distribution as an ellipse.
Here's the replication of the plot you have attached in stripped down ggplot code:
#constructing a plotting data frame with the BIN identifier and each pca qualitative coordinates
df <- cbind.data.frame(BIN = Anotylus$BIN, Ano.pca$ind$coord)
ggplot(df, aes(x=Dim.1, y=Dim.2, color=BIN)) +
geom_point() +
stat_ellipse(type="norm")
Note that as there are only 1 or 2 points for all BIN other than ACZ5516 and ADF3772, there will be "Too few points to calculate an ellipse" and as such no ellipse is plotted.
In order to "hide" the other BIN in your figure, you can either just plot the BIN you wanted or you can create a new grouping (ACZ5516, ADF3772 and others) in the plotting data and set the points you do not want to focus on in less visible colour.
library(dplyr)
# Plot only BIN ACZ5516 and ADF3772
df %>%
filter(BIN %in% c("ACZ5516", "ADF3772")) %>%
ggplot(aes(x=Dim.1, y=Dim.2, color=BIN)) +
geom_point() +
stat_ellipse(type="norm")
# Create a new grouping for BIN other than ACZ5516 and ADF3772
df2 <- df %>%
mutate(BIN = ifelse(BIN %in% c("ACZ5516", "ADF3772"), as.character(BIN), "Others"))
df2 %>%
ggplot(aes(x=Dim.1, y=Dim.2, color=BIN)) +
geom_point() +
stat_ellipse(data = df %>% filter(BIN %in% c("ACZ5516", "ADF3772")), type="norm") +
scale_colour_manual(values = c("darkgreen", "orange", "gray"))

Need Help Making an Ordihull

I have been collaborating on this code that creates an NMDS plot and I want to add shaded polygons of the points. However, the ordihull code keeps returning the following error. Why would the argument be of length zero?
Error in if (n < 4) return(colMeans(x[-n, , drop = FALSE])) : argument is of length zero
> m1 <- metaMDS(d1)
> m2 <- metaMDS(d2)
> m3 <- metaMDS(d3)
> mdat <- data.frame(m3$points)
> mdat$site <- substr(rownames(mdat), 1, 1) mdat$col <- ifelse(mdat$site == "D", "red",
ifelse(mdat$site == "H", "blue", "green"))
> plot(mdat[,1], mdat[,2], pch=16, col=mdat$col, display = "sites",
xlab="NMDS1", ylab="NMDS2", xlim=c(-0.2, 0.2),
ylim=c(-0.2, 0.2), main= "Phylum")
> ordihull(mdat[,1], mdat[,2], display="sites", label=T,
lwd=2, draw="polygon",col= c("blue", "red", "green"))
Here is the Dput:
> structure(list(p__Proteobacteria = c(44.807, 40.907, 36.558,36.811,
39.401, 40.114, 45.911, 43.133, 30.137, 27.734, 26.722,
31.261), p__Actinobacteria = c(26.819, 34.651, 40.904, 38.847,
39.446, 37.523, 29.881, 29.251, 31.783, 23.641, 34.918, 31.308
), p__Acidobacteria = c(8.48, 6.6, 5.934, 6.609, 5.89, 7.567,
5.795, 6.666, 10.616, 10.709, 8.988, 11.794), p__Bacteroidetes =
c(7.56, 8.189, 5.363, 6.223, 4.716, 3.613, 4.65, 5.2, 4.281, 2.785,
2.808, 3.271), p__Gemmatimonadetes = c(3.529, 2.108, 1.213, 1.193,
1.541, 1.439, 1.006, 1.171, 5.794, 4.107, 4.001, 2.747),
p__Chloroflexi = c(2.686, 2.987, 2.979, 3.049, 4.128, 4.564, 5.304,
4.624, 3.669, 2.775, 4.534, 4.94), p__Bacteria_unclassified =
c(2.38, 1.869, 1.579, 1.247, 2.3, 2.108, 1.36, 1.193, 3.126, 1.885,
2.987, 2.37), p__Firmicutes = c(0.998, 0.807, 2.76, 2.962, 0.866,
1.32, 1.651, 2.073, 1.099, 1.046, 1.3, 1.302), p__Verrucomicrobia =
c(0.676, 0.404, 0.32, 0.35, 0.293, 0.239, 0.188, 0.261, 0.521,
0.726, 0.52, 0.397), p__Nitrospirae = c(0.464, 0.244, 0.198, 0.208,
0.016, 0.032, 0.024, 0.042, 0.296, 0.103, 0.229, 0.211),
p__Candidatus_Saccharibacteria = c(0.421, 0.511, 0.456, 0.552,
0.523, 0.6, 0.842, 1.016, 0.672, 0.636, 0.465, 0.736),
p__Planctomycetes = c(0.392, 0.267, 0.354, 0.285, 0.275, 0.356,
0.285, 0.276, 0.33, 0.438, 0.552, 0.365), p__Fibrobacteres = c(0.14,
0.074, 0.007, 0.009, 0.072, 0.044, 0.136, 0.079, 0.117, 0.018,
0.167, 0.065), p__Candidatus_Latescibacteria = c(0.113, 0.059,
0.017, 0.005, 0.004, 0.017, 0.015, 0.009, 0, 0.011, 0.007, 0.018
), p__Latescibacteria = c(0.085, 0.04, 0.01, 0.004, 0.012, 0.015,
0.033, 0.015, 0.012, 0.016, 0.011, 0.018), p__Cyanobacteria =
c(0.079, 0.048, 1.071, 1.372, 0.32, 0.19, 2.629, 4.689, 7.133,
22.963, 11.417, 8.767), p__Thermodesulfobacteria = c(0.068, 0.057,
0.115, 0.103, 0.008, 0.01, 0.015, 0.007, 0.01, 0.003, 0.002, 0.013),
p__Elusimicrobia = c(0.059, 0.021, 0.012, 0.001, 0.004, 0.002,
0.015, 0.017, 0, 0.002, 0.005, 0.006), p__Chlorobi = c(0.052,
0.025, 0.002, 0.012, 0.029, 0.046, 0.033, 0.04, 0.05, 0.02,
0.046, 0.025), p__Armatimonadetes = c(0.046, 0.053, 0.051,
0.072, 0.076, 0.095, 0.048, 0.053, 0.197, 0.159, 0.128, 0.125
), p__Spirochaetes = c(0.035, 0.021, 0.002, 0.001, 0, 0.002,
0.024, 0.039, 0, 0, 0, 0), p__Parcubacteria = c(0.03, 0.013,
0, 0, 0.01, 0.015, 0.042, 0.037, 0.032, 0.059, 0.053, 0.011
), p__Chlamydiae = c(0.028, 0.017, 0.046, 0.05, 0.014, 0.007,
0.021, 0.022, 0.07, 0.074, 0.08, 0.152)), class = "data.frame",
row.names = c("D15B", "D610B", "D15F", "D610F", "HR15B", "HR610B",
"HR15F", "HR610F", "C15B", "C610B", "C15F", "C610F"))
Here are the codes:
> phylum.dat <- dput
> x <- data.frame(tax=names(phylum.dat), nsites=apply(phylum.dat, 2, function(x){length(which(x>0))}))
> d1 <- vegdist(phylum.dat, method = "jaccard", binary = TRUE)
> d2 <- vegdist(log1p(phylum.dat, method = "jaccard"))
> logit_phylum <- as.matrix(phylum.dat+1)/100
> d3 <- qlogis(logit_phylum)
> d3 <- d3+abs(min(d3))
> d3 <- vegdist(d3, method = "jaccard")
> m1 <- metaMDS(d1)
> m2 <- metaMDS(d2)
> m3 <- metaMDS(d3)
> e1 <- envfit(m3, phylum.dat)
> exy <- data.frame(tax=names(phylum.dat),
> x=e1$vectors$arrows[,1],
> y=e1$vectors$arrows[,2],
> pval=e1$vectors$pvals,
> r=e1$vectors$r)
> rownames(exy) <- NULL
> exy <- exy[order(-exy$r),]
> mdat <- data.frame(m3$points)
> mdat$site <- substr(rownames(mdat), 1, 1)
> mdat$col <- ifelse(mdat$site == "D", "red",
> ifelse(mdat$site == "H", "blue", "green"))
> mdat$rad <- sqrt((mdat$MDS1^2) + (mdat$MDS2^2))
> max(mdat$rad)
> exy$x2 <- 0.17 * exy$r * exy$x
> exy$y2 <- 0.17 * exy$r * exy$y
> exy$adj <- ifelse(exy$x < 0, 1, 0)
> plot(mdat[,1], mdat[,2], pch=16, col=mdat$col,
> xlab="NMDS1", ylab="NMDS2", xlim=c(-0.2, 0.2),
> ylim=c(-0.2, 0.2), main= "Phylum")
> ordihull(mdat[,1], mdat[,2], display="sites", label=T,
> lwd=2, draw="polygon",col= c("blue", "red", "green"))

Removing NAs from ggplot x-axis in ggplot2

I would like to get rid off the whole NA block (highlighted here ).
I tried na.ommit and na.rm = TRUE unsuccesfully.
Here is the code I used :
library(readxl)
data <- read_excel("Documents/TFB/xlsx_geochimie/solfatara_maj.xlsx")
View(data)
data <- gather(data,FeO:`Fe2O3(T)`,key = "Element",value="Pourcentage")
library(ggplot2)
level_order <- factor(data$Element,levels = c("SiO2","TiO2","Al2O3","Fe2O3","FeO","MgO","CaO","Na2O","K2O"))
ggplot(data=data,mapping=aes(x=level_order,y=data$Pourcentage,colour=data$Ech)+geom_point()+geom_line(group=data$Ech) +scale_y_log10()
And here is my original file
https://drive.google.com/file/d/1bZi7fPWebbpodD1LFScoEcWt5Bs-cqhb/view?usp=sharing
If I run your code and look at data that goes into ggplot:
table(data$Element)
Al2O3 CaO Fe2O3 Fe2O3(T) FeO K2O LOI LOI2 MgO MnO
12 12 12 12 12 12 12 12 12 12
Na2O P2O5 SiO2 SO4 TiO2 Total Total 2 Total N Total S
12 12 12 12 12 12 12 12 12
You have included Total into the melted data frame.. which is not intended I guess. Hence when you do factor on these, and these "Total.." are not included in the levels, they become NA.
So we can do it from scratch:
data <- read_excel("solfatara_maj.xlsx")
The data:
structure(list(Ech = c("AGN 1A", "AGN 2A", "AGN 3B", "SOL 4B",
"SOL 8Ag", "SOL 8Ab", "SOL 16A", "SOL 16B", "SOL 16C", "SOL 22 A",
"SOL 22D", "SOL 25B"), FeO = c(0.2, 0.8, 1.7, 0.3, 1.7, NA, 0.2,
NA, 0.1, 0.7, 1.3, 2), `Total S` = c(5.96, 45.3, 0.22, 17.3,
NA, NA, NA, NA, NA, NA, 2.37, 0.36), SO4 = c(NA, 6.72, NA, 4.08,
0.06, 0.16, 42.2, 35.2, 37.8, 0.32, 6.57, NA), `Total N` = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, 15.2, NA, NA), SiO2 = c(50.2,
31.05, 56.47, 62.14, 61.36, 75.66, 8.41, 21.74, 17.44, 13.52,
19.62, 56.35), Al2O3 = c(15.53, 7.7, 17.56, 4.44, 17.75, 10.92,
31.92, 26.38, 27.66, 0.64, 3.85, 17.28), Fe2O3 = c(0.49, 0.63,
2.06, NA, 1.76, 0.11, 0.64, 0.88, 1.71, NA, 1.32, 2.67), MnO = c(0.01,
0.01, 0.13, 0.01, 0.09, 0.01, 0.01, 0.01, 0.01, 0.005, 0.04,
0.12), MgO = c(0.06, 0.07, 0.88, 0.03, 0.97, 0.05, 0.04, 0.07,
0.03, 0.02, 1.85, 1.63), CaO = c(0.2, 0.09, 3.34, 0.09, 2.58,
0.57, 0.2, 0.26, 0.15, 0.06, 35.66, 4.79), Na2O = c(0.15, 0.14,
3.23, 0.13, 3.18, 2.04, 0.68, 0.68, 0.55, 0.05, 0.45, 3.11),
K2O = c(4.39, 1.98, 8, 1.26, 8.59, 5.94, 8.2, 6.97, 8.04,
0.2, 0.89, 7.65), TiO2 = c(0.42, 0.27, 0.46, 0.79, 0.55,
0.16, 0.09, 0.22, 0.16, 0.222, 0.34, 0.53), P2O5 = c(0.11,
0.09, 0.18, 0.08, 0.07, 0.07, 0.85, 0.68, 0.62, NA, 0.14,
0.28), LOI = c(27.77, 57.06, 6.13, 29.03, 1.38, 4.92, 42.58,
37.58, 38.76, NA, 26.99, 3.92), LOI2 = c(27.79, 57.15, 6.32,
29.06, 1.57, 4.93, 42.6, 37.59, 38.77, 0.08, 27.13, 4.15),
Total = c(99.52, 99.88, 100.2, 98.25, 99.99, 100.5, 93.81,
95.57, 95.23, 15.25, 92.45, 100.3), `Total 2` = c(99.54,
99.96, 100.3, 98.28, 100.2, 100.6, 93.83, 95.58, 95.24, 15.33,
92.59, 100.6), `Fe2O3(T)` = c(0.71, 1.52, 3.95, 0.27, 3.65,
0.22, 0.87, 0.99, 1.82, 0.61, 2.76, 4.9)), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"))
First we set the plotting level like you did:
plotlvls = c("SiO2","TiO2","Al2O3","Fe2O3","FeO","MgO","CaO","Na2O","K2O")
Then we select only these columns, and also Ech, note I use pivot_longer() because gather() will supposedly be deprecated, and then we do the factoring too:
plotdf = data %>% select(c(plotlvls,"Ech")) %>%
pivot_longer(-Ech,names_to = "Element",values_to = "Pourcentage") %>%
mutate(Element=factor(Element,levels=toplot))
Finally we plot, and there are no NAs:
ggplot(data=plotdf,mapping=aes(x=Element,y=Pourcentage,colour=Ech))+
geom_point()+geom_line(aes(group=Ech)) +scale_y_log10()
1.Create reproducible minimal data
data <- data.frame(Element = c("SiO2","TiO2","Al2O3","Fe2O3","FeO","MgO","CaO","Na2O","K2O",NA),
Pourcentage = 1:10,
Ech = c("AGN 1A", "SOL 16"))
2.Set factor levels for variable 'Element'
data$Element <- factor(data$Element,levels = c("SiO2","TiO2","Al2O3","Fe2O3","FeO","MgO","CaO","Na2O","K2O"))
3.Remove rows containing NA in the variable 'Element'
data <- data[!is.na(data$Element), ]
4.Plot data using ggplot2 (ggplot2 syntax uses NSE (non standard evaluation), which means you dont't have to pass the variable names as strings or using the $ notation):
ggplot(data=data,aes(x=Element,y=Pourcentage,colour=Ech)) +
geom_point() +
geom_line(aes(group=Ech)) +
scale_y_log10()

Averaging the replicate data in omics / biostatistics

I have a dataframe for gene expression data. Samples are named as Genotype_Time_Replicate (e.g. AOX_1h_4).
E.g. data set
df <- structure(list(ID = c("AT5G54740.1", "AT5G55730.2", "AT5G57655.2", "AT5G64100.1", "AT5G64260.1", "AT5G67360.1", "AT1G30630.1", "AT1G62380.1", "AT1G70830.1", "AT3G14990.1", "AT4G18800.1", "AT4G24510.1", "AT5G15650.1", "AT5G19820.1", "AT5G59840.1", "AT5G47200.1", "AT1G12840.1", "AT1G76030.1", "AT1G78900.2", "AT3G42050.1", "AT4G11150.1", "AT1G11860.2", "AT1G17290.1" ),
Location = c("extracellular", "extracellular", "extracellular", "extracellular", "extracellular", "extracellular", "golgi", "golgi", "golgi", "golgi", "golgi", "golgi", "golgi", "golgi", "golgi", "ER", "ER", "ER", "mitochondrion", "mitochondrion", "mitochondrion", "mitochondrion", "mitochondrion"),
AOX_1h_1 = c(0.844651873, 0.50954096, 1.12e-08, 0.012981372, 0.978148381, 0.027579578, 0.068010151, 0.410629215, 0.253838635, 0.033631788, 0.335713512, 0.982799013, 0.025910457, 0.793810264, 0.762431665, 0.152154436, 0.027114103, 0.000227, 1.07e-05, 0.721209032, 0.086281162, 0.483130711, 0.014795515),
AOX_1h_2 = c(0.894623378, 0.011521413, 1.62e-06, 0.085249729, 0.02863972, 0.956962154, 0.225208718, 0.932679767, 0.002574192, 0.071700671, 0.233682544, 0.936572874, 1.12e-05, 0.241658735, 0.865205515, 0.000537, 0.103471292, 8.66e-07, 1.22e-08, 0.950878446, 0.145012176, 0.092919172, 0.599713247),
AOX_1h_3 = c(0.880951025, 0.00145276, 8.59e-10, 0.087023475, 0.675527672, 0.765543306, 0.305860948, 0.899172011, 0.020973476, 0.542988545, 0.735571562, 0.157569324, 0.025488075, 0.071006507, 0.262324019, 0.080470612, 0.0436526, 6.65e-09, 5.63e-10, 0.020557091, 0.069577215, 0.005502212, 0.852099232),
AOX_1h_4 = c(0.980823252, 0.158123518, 0.00210702, 0.006317657, 0.30496173, 0.489709702, 0.091469807, 0.958443361, 0.015583593, 0.566165972, 0.66746161, 0.935102341, 0.087733288, 0.744313619, 0.021169383, 0.633250945, 0.257489406, 0.024345088, 0.000355, 0.226279179, 0.004038493, 0.479275204, 0.703522761),
AOX_2h_1 = c(0.006474022, 0.246530998, 5.38e-06, 0.47169153, 0.305973663, 0.466202566, 0.191733645, 0.016121487, 0.234839116, 0.043866023, 0.089819656, 0.107934599, 2.09e-06, 0.413229678, 0.464078018, 0.004118766, 0.774970986, 3.79e-07, 2.3e-10, 0.428591262, 0.002326292, 0.385580707, 0.106216066),
AOX_2h_2 = c(0.166169729, 0.005721199, 7.77e-08, 0.099146712, 0.457164663, 0.481987525, 7.4e-05, 0.969805081, 0.100894997, 0.062103337, 0.095718425, 0.001686206, 0.009710516, 0.134651787, 0.887036569, 0.459218152, 0.074576369, 3.88e-09, 3.31e-15, 0.409645805, 0.064874307, 0.346371524, 0.449444779),
AOX_2h_3 = c(1.06e-05, 0.576589898, 4.03e-08, 0.787468189, 0.971119601, 0.432593753, 0.000274, 0.86932399, 0.08657663, 4.22e-06, 0.071190008, 0.697384316, 0.161623604, 0.422628778, 0.299545652, 0.767867006, 0.00295567, 0.078724176, 4.33e-09, 0.988576028, 0.080278831, 0.66505527, 0.014158693),
AOX_2h_4 = c(0.010356719, 0.026506539, 9.48e-09, 0.91009296, 0.302464488, 0.894377768, 0.742233323, 0.75032613, 0.175841127, 0.000721, 0.356904918, 0.461234653, 1.08e-05, 0.65800831, 0.360085919, 0.004814238, 0.174670947, 0.004246734, 7.31e-11, 0.778725214, 0.051334623, 0.10212841, 0.155831664 ),
AOX_6h_1 = c(0.271681878, 0.004822226, 1.87e-11, 0.616969208, 0.158860224, 0.684690326, 0.011798791, 0.564591916, 0.000314, 4.79e-06, 0.299871385, 0.001909713, 0.00682428, 0.039107415, 0.574143284, 0.061532691, 0.050483892, 2.28e-08, 1.92e-12, 0.058747794, 0.027147473, 0.196608218, 0.513693112),
AOX_6h_2 = c(5.72e-12, 0.719814288, 0.140016259, 0.927094438, 0.841229414, 0.224510089, 0.026567282, 0.242981965, 0.459311076, 0.038295888, 0.127935565, 0.453746728, 0.005023732, 0.554532387, 0.280899096, 0.336458018, 0.002024021, 0.793915731, 0.012838565, 0.873716549, 0.10097853, 0.237426815, 0.003711539),
AOX_6h_3 = c(3.16e-12, 0.780424491, 0.031315419, 0.363891436, 0.09562579, 0.104833988, 3.52e-05, 0.104196756, 0.870952423, 0.002036134, 0.016480622, 0.671475063, 2.3e-05, 0.00256744, 0.66263641, 0.005026601, 0.57280276, 0.058724117, 6.4e-10, 0.030965264, 0.005301006, 0.622027012, 0.371659724),
AOX_6h_4 = c(7.99e-10, 0.290847169, 0.001319424, 0.347344795, 0.743846306, 0.470908425, 0.00033, 0.016149973, 0.080036584, 0.020899676, 0.00723071, 0.187288769, 0.042514886, 0.00150443, 0.059344154, 0.06554177, 0.112601764, 0.000379, 2.36e-10, 0.78131093, 0.105861995, 0.174370801, 0.05570041 ),
WT_1h_1 = c(0.857, 0.809, 2.31e-05, 0.286, 0.87, 0.396, 0.539, 0.787, 0.73, 0.427, 0.764, 0.87, 0.386, 0.852, 0.848, 0.661, 0.393, 0.0415, 0.00611, 0.843, 0.576, 0.804, 0.304 ),
WT_1h_2 = c(0.898, 0.509, 0.0192, 0.729, 0.616, 0.902, 0.811, 0.9, 0.343, 0.712, 0.814, 0.901, 0.0446, 0.816, 0.896, 0.217, 0.747, 0.0143, 0.000964, 0.901, 0.776, 0.737, 0.876 ),
WT_1h_3 = c(0.939, 0.627, 0.0104, 0.867, 0.932, 0.935, 0.91, 0.939, 0.803, 0.926, 0.934, 0.888, 0.813, 0.859, 0.905, 0.864, 0.838, 0.0223, 0.00917, 0.802, 0.858, 0.724, 0.938 ),
WT_1h_4 = c(0.911, 0.782, 0.298, 0.396, 0.837, 0.871, 0.727, 0.91, 0.506, 0.88, 0.89, 0.909, 0.723, 0.896, 0.547, 0.887, 0.824, 0.566, 0.175, 0.814, 0.348, 0.869, 0.893),
WT_2h_1 = c(0.748, 0.911, 0.231, 0.929, 0.917, 0.928, 0.903, 0.801, 0.909, 0.849, 0.878, 0.884, 0.183, 0.925, 0.928, 0.719, 0.941, 0.108, 0.00817, 0.926, 0.678, 0.923, 0.884),
WT_2h_2 = c(0.935, 0.851, 0.163, 0.925, 0.951, 0.952, 0.63, 0.963, 0.926, 0.916, 0.925, 0.804, 0.868, 0.931, 0.961, 0.951, 0.92, 0.0706, 0.000265, 0.95, 0.917, 0.947, 0.951),
WT_2h_3 = c(0.0197, 0.894, 0.000613, 0.911, 0.922, 0.877, 0.122, 0.916, 0.739, 0.0125, 0.718, 0.905, 0.801, 0.875, 0.852, 0.91, 0.302, 0.729, 0.00015, 0.923, 0.731, 0.902, 0.504),
WT_2h_4 = c(0.696, 0.765, 0.0142, 0.931, 0.893, 0.931, 0.925, 0.925, 0.87, 0.45, 0.899, 0.908, 0.144, 0.921, 0.899, 0.631, 0.87, 0.62, 0.0014, 0.926, 0.807, 0.844, 0.865),
WT_6h_1 = c(0.898, 0.727, 0.00395, 0.921, 0.881, 0.924, 0.776, 0.919, 0.542, 0.234, 0.901, 0.67, 0.747, 0.83, 0.919, 0.848, 0.841, 0.056, 0.00144, 0.846, 0.815, 0.888, 0.916),
WT_6h_2 = c(2.38e-09, 0.88, 0.708, 0.898, 0.891, 0.768, 0.443, 0.777, 0.843, 0.505, 0.695, 0.842, 0.208, 0.859, 0.794, 0.813, 0.14, 0.887, 0.326, 0.894, 0.661, 0.775, 0.182),
WT_6h_3 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
WT_6h_4 = c(0.0357, 0.953, 0.792, 0.956, 0.967, 0.96, 0.711, 0.892, 0.931, 0.899, 0.866, 0.946, 0.917, 0.799, 0.925, 0.927, 0.938, 0.72, 0.025, 0.967, 0.936, 0.945, 0.923)),
class = "data.frame", row.names = c(NA, -23L))
I want to summarize data for each organelle (averaged by organelle and samples' replicates) and plot the Wildtype and mutant data side by side with standard error for each time point
df <-
melted <- melt(df)
head(melted)
melted$variable<- str_replace_all(melted$variable, '_[0-9]$', '')
melted$variable <- factor(melted$variable,levels=c("WT_1h","AOX_1h","WT_2h","AOX_2h","WT_6h","AOX_6h"))
my_comparisons <- list( c("WT_1h","AOX_1h"), c("WT_2h","AOX_2h"),c("WT_6h","AOX_6h"))
ggbarplot(melted, x = "variable", y = "value", add = "mean_se",
color = "variable", palette = c("grey","black","grey","black","grey","black"),
facet.by = "Location")+
stat_compare_means(comparisons = my_comparisons, label = "p.signif")
How can I use tidyverse (dplyr / tidyr) for this purpose?
How can I use tidyverse (dplyr / tidyr) to follow this pathway instead of above scripts?
You can use different functions to normalise this data. I use gather() in this example alongside stringr functions to extract the data from the character vector that has 3 columns of data in it.
dat %>%
gather(key, value, -ID, -Location) %>%
mutate(type = map_chr(str_split(key,"_"),~.x[1]),
hour = map_chr(str_split(key,"_"),~.x[2]),
n = map_chr(str_split(key,"_"),~.x[3])) %>%
group_by(type, hour) %>%
summarise(mean = mean(value))
Gives
# A tibble: 6 x 3
# Groups: type [?]
type hour mean
<chr> <chr> <dbl>
1 AOX 1h 0.3235302
2 AOX 2h 0.2709910
3 AOX 6h 0.2226648
4 WT 1h 0.6633866
5 WT 2h 0.7263108
6 WT 6h 0.7915662
This you can use in ggplot() to make a nice barplot.
To get it in a table you can use
dat %>%
gather(key, value, -ID, -Location) %>%
mutate(type = map_chr(str_split(key,"_"),~.x[1]),
hour = map_chr(str_split(key,"_"),~.x[2]),
n = map_chr(str_split(key,"_"),~.x[3])) %>%
group_by(type, hour) %>%
summarise(mean = mean(value)) %>%
spread(type, mean)
to get
# A tibble: 3 x 3
hour AOX WT
* <chr> <dbl> <dbl>
1 1h 0.3235302 0.6633866
2 2h 0.2709910 0.7263108
3 6h 0.2226648 0.7915662
Another version going from the df object:
The df object is a list, and expression values after cbind are character type, so you can do
tb <- as_tibble(do.call(cbind, df)) %>%
mutate_at(3:14, as.numeric)
NB that usually for gene expression data it is easier to read in count data using read_tsv or read.table and combine into matrix, data.frame or tibble.
NBB the df object specified has no "WT" samples (from my copy/paste anyway) so I renamed last 4 samples in tb as "WT_1h" replicates
colnames(tb)[11:14] <- paste0("WT_1h_",c(1:4))
Create means from replicates by function
rowMeanNrep <- function(tb, nm){
varname <- paste0(nm, "_mean")
selectn <- grep(nm, colnames(tb))
tb %>%
dplyr::mutate(!!varname := rowMeans(dplyr::select(., !!selectn)))
}
Specify which timepoints to use, and apply
tps <- c("AOX_1h", "WT_1h")
tb_1h_mean <- cbind(tb_1h[,1:2],
do.call(cbind, lapply(tps, function(f){
rowMeanNrep(tb=tb, nm=f) %>%
dplyr::select(paste0(f, "_mean"))
}))
)
A final NB, think about using boxplots instead of barplots, see this paper

Plot conditional density curve `P(Y|X)` along a linear regression line

This is my data frame, with two columns Y (response) and X (covariate):
## Editor edit: use `dat` not `data`
dat <- structure(list(Y = c(NA, -1.793, -0.642, 1.189, -0.823, -1.715,
1.623, 0.964, 0.395, -3.736, -0.47, 2.366, 0.634, -0.701, -1.692,
0.155, 2.502, -2.292, 1.967, -2.326, -1.476, 1.464, 1.45, -0.797,
1.27, 2.515, -0.765, 0.261, 0.423, 1.698, -2.734, 0.743, -2.39,
0.365, 2.981, -1.185, -0.57, 2.638, -1.046, 1.931, 4.583, -1.276,
1.075, 2.893, -1.602, 1.801, 2.405, -5.236, 2.214, 1.295, 1.438,
-0.638, 0.716, 1.004, -1.328, -1.759, -1.315, 1.053, 1.958, -2.034,
2.936, -0.078, -0.676, -2.312, -0.404, -4.091, -2.456, 0.984,
-1.648, 0.517, 0.545, -3.406, -2.077, 4.263, -0.352, -1.107,
-2.478, -0.718, 2.622, 1.611, -4.913, -2.117, -1.34, -4.006,
-1.668, -1.934, 0.972, 3.572, -3.332, 1.094, -0.273, 1.078, -0.587,
-1.25, -4.231, -0.439, 1.776, -2.077, 1.892, -1.069, 4.682, 1.665,
1.793, -2.133, 1.651, -0.065, 2.277, 0.792, -3.469, 1.48, 0.958,
-4.68, -2.909, 1.169, -0.941, -1.863, 1.814, -2.082, -3.087,
0.505, -0.013, -0.12, -0.082, -1.944, 1.094, -1.418, -1.273,
0.741, -1.001, -1.945, 1.026, 3.24, 0.131, -0.061, 0.086, 0.35,
0.22, -0.704, 0.466, 8.255, 2.302, 9.819, 5.162, 6.51, -0.275,
1.141, -0.56, -3.324, -8.456, -2.105, -0.666, 1.707, 1.886, -3.018,
0.441, 1.612, 0.774, 5.122, 0.362, -0.903, 5.21, -2.927, -4.572,
1.882, -2.5, -1.449, 2.627, -0.532, -2.279, -1.534, 1.459, -3.975,
1.328, 2.491, -2.221, 0.811, 4.423, -3.55, 2.592, 1.196, -1.529,
-1.222, -0.019, -1.62, 5.356, -1.885, 0.105, -1.366, -1.652,
0.233, 0.523, -1.416, 2.495, 4.35, -0.033, -2.468, 2.623, -0.039,
0.043, -2.015, -4.58, 0.793, -1.938, -1.105, 0.776, -1.953, 0.521,
-1.276, 0.666, -1.919, 1.268, 1.646, 2.413, 1.323, 2.135, 0.435,
3.747, -2.855, 4.021, -3.459, 0.705, -3.018, 0.779, 1.452, 1.523,
-1.938, 2.564, 2.108, 3.832, 1.77, -3.087, -1.902, 0.644, 8.507
), X = c(0.056, 0.053, 0.033, 0.053, 0.062, 0.09, 0.11, 0.124,
0.129, 0.129, 0.133, 0.155, 0.143, 0.155, 0.166, 0.151, 0.144,
0.168, 0.171, 0.162, 0.168, 0.169, 0.117, 0.105, 0.075, 0.057,
0.031, 0.038, 0.034, -0.016, -0.001, -0.031, -0.001, -0.004,
-0.056, -0.016, 0.007, 0.015, -0.016, -0.016, -0.053, -0.059,
-0.054, -0.048, -0.051, -0.052, -0.072, -0.063, 0.02, 0.034,
0.043, 0.084, 0.092, 0.111, 0.131, 0.102, 0.167, 0.162, 0.167,
0.187, 0.165, 0.179, 0.177, 0.192, 0.191, 0.183, 0.179, 0.176,
0.19, 0.188, 0.215, 0.221, 0.203, 0.2, 0.191, 0.188, 0.19, 0.228,
0.195, 0.204, 0.221, 0.218, 0.224, 0.233, 0.23, 0.258, 0.268,
0.291, 0.275, 0.27, 0.276, 0.276, 0.248, 0.228, 0.223, 0.218,
0.169, 0.188, 0.159, 0.156, 0.15, 0.117, 0.088, 0.068, 0.057,
0.035, 0.021, 0.014, -0.005, -0.014, -0.029, -0.043, -0.046,
-0.068, -0.073, -0.042, -0.04, -0.027, -0.018, -0.021, 0.002,
0.002, 0.006, 0.015, 0.022, 0.039, 0.044, 0.055, 0.064, 0.096,
0.093, 0.089, 0.173, 0.203, 0.216, 0.208, 0.225, 0.245, 0.23,
0.218, -0.267, 0.193, -0.013, 0.087, 0.04, 0.012, -0.008, 0.004,
0.01, 0.002, 0.008, 0.006, 0.013, 0.018, 0.019, 0.018, 0.021,
0.024, 0.017, 0.015, -0.005, 0.002, 0.014, 0.021, 0.022, 0.022,
0.02, 0.025, 0.021, 0.027, 0.034, 0.041, 0.04, 0.038, 0.033,
0.034, 0.031, 0.029, 0.029, 0.029, 0.022, 0.021, 0.019, 0.021,
0.016, 0.007, 0.002, 0.011, 0.01, 0.01, 0.003, 0.009, 0.015,
0.018, 0.017, 0.021, 0.021, 0.021, 0.022, 0.023, 0.025, 0.022,
0.022, 0.019, 0.02, 0.023, 0.022, 0.024, 0.022, 0.025, 0.025,
0.022, 0.027, 0.024, 0.016, 0.024, 0.018, 0.024, 0.021, 0.021,
0.021, 0.021, 0.022, 0.016, 0.015, 0.017, -0.017, -0.009, -0.003,
-0.012, -0.009, -0.008, -0.024, -0.023)), .Names = c("Y", "X"
), row.names = c(NA, -234L), class = "data.frame")
With this I run a OLS regression: lm(dat[,1] ~ dat[,2]).
At a set of values: X = quantile(dat[,2], c(0.1, 0.5, 0.7)), I would like to plot a graph similar to the following, with conditional density P(Y|X) displaying along the regression line.
How can I do this in R? Is it even possible?
I call your dataset dat. Don't use data as it masks R function data.
dat <- na.omit(dat) ## retain only complete cases
## use proper formula rather than `$` or `[,]`;
## otherwise you get trouble in prediction with `predict.lm`
fit <- lm(Y ~ X, dat)
## prediction point, as given in your question
xp <- quantile(dat$X, probs = c(0.1, 0.5, 0.7), names = FALSE)
## make prediction and only keep `$fit` and `$se.fit`
pred <- predict.lm(fit, newdata = data.frame(X = xp), se.fit = TRUE)[1:2]
#$fit
# 1 2 3
#0.20456154 0.14319857 0.00678734
#
#$se.fit
# 1 2 3
#0.2205000 0.1789353 0.1819308
To understand the theory behind the following, read Plotting conditional density of prediction after linear regression. Now I am to use mapply function to apply the same computation to multiple points:
## a function to make 101 sample points from conditional density
f <- function (mu, sig) {
x <- seq(mu - 3.2 * sig, mu + 3.2 * sig, length = 101)
dx <- dnorm(x, mu, sig)
cbind(x, dx)
}
## apply `f` to all `xp`
lst <- mapply(f, pred[[1]], pred[[2]], SIMPLIFY = FALSE)
## To plot rotated density curve, we basically want to plot `(dx, x)`
## but scaling `(alpha * dx, x)` is needed for good scaling with regression line
## Also to plot rotated density along the regression line,
## a shift is needed: `(alpha * dx + xp, x)`
## The following function adds rotated, scaled density to a regression line
## a "for-loop" is used for readability, with no loss of efficiency.
## (make sure there is an existing plot; otherwise you get `plot.new` error!!)
addrsd <- function (xp, lst, alpha = 1) {
for (i in 1:length(xp)) {
x0 <- xp[i]; mat <- lst[[i]]
dx. <- alpha * mat[, 2] + x0 ## rescale and shift
x. <- mat[, 1]
lines(dx., x., col = "gray") ## rotate and plot
segments(x0, x.[1], x0, x.[101], col = "gray") ## a local axis
}
}
Now let's see the picture:
## This is one simple way to draw the regression line
## A better way is to generate and grid and predict on the grid
## In later example I will show this
plot(dat$X, fit$fitted, type = "l", ylim = c(-0.6, 1))
## we try `alpha = 0.01`;
## you can also try `alpha = 1` in raw scale to see what it looks like
addrsd(xp, lst, 0.01)
Note, we have only scaled the height of the density, not its span. The span sort of implies confidence band, and should not be scaled. Consider further overlaying confidence band on the plot. If the use of matplot is not clear, read How do I change colours of confidence interval lines when using matlines for prediction plot?.
## A grid is necessary for nice regression plot
X.grid <- seq(min(dat$X), max(dat$X), length = 101)
## 95%-CI based on t-statistic
CI <- predict.lm(fit, newdata = data.frame(X = X.grid), interval = "confidence")
## use `matplot`
matplot(X.grid, CI, type = "l", col = c(1, 2, 2), lty = c(1, 2, 2))
## add rotated, scaled conditional density
addrsd(xp, lst, 0.01)
You see that the span of the density curve agrees with the confidence ribbon.

Resources