Concatenating two cells in R with a ± sign - r

I want to calculate mean and standard deviation from pairwise mirrored matrices in a list and write a table for further text procession:
mean_SG<- as.data.frame(lapply(list_SG, function(x) mean(x[upper.tri(x)])))
sd_SG <- as.data.frame(lapply(list_SG, function(x) sd(x[upper.tri(x)])))
write.table(t(rbind(round(mean_SG,3),round(sd_SG,3))), "SG.txt")
My idea is to directly concatenate the numeric values from mean_SG and sd_SG with the plus-minus symbol ± and write this in a single column with write.table. Is that possible in R?
Here is some data:
SG <- structure(c(85, 84.016, 82.9, 79, 85.167, 83.467, 78.5, 83.051,
80.064, 81.436, 79.94, 83.731, 83.468, 82.775, 83.294, 81.608,
82.176, 84.138, 82.6, 85.325, 82.297, 81.546, 83.569, 84.561,
87.039, 92.45, 86.35, 83.153, 84.447, 81.899, 81.972, 81.32,
81.949, 82.101, 0.656, 0.966, 1.833, NA, 0.643, 0.459, 0.608,
1.189, 1.024, 0.848, 1.207, 0.66, 0.757, 1.235, 0.872, 1.308,
0.958, 1.151, 0.914, 1.302, 0.708, 0.79, 1.349, 0.799, 1.297,
2.554, 0.55, 1.041, 1.216, 1.065, 0.981, 0.937, 1.133, 1.302), .Dim = c(34L,
2L), .Dimnames = list(c("X19_vs_11B.2", "X19_vs_AT.s3.28", "X19_vs_B276.D12",
"X19_vs_BP.U1C.1g10", "X19_vs_d142", "X19_vs_FFCH5909", "X19_vs_GBS.L1.B05",
"X19_vs_SG01", "X19_vs_SG02", "X19_vs_SG03", "X19_vs_SG04", "X19_vs_SG05",
"X19_vs_SG06", "X19_vs_SG07a", "X19_vs_SG07b", "X19_vs_SG08.Aca",
"X19_vs_SG08.Holo", "X19_vs_SG09", "X19_vs_SG10", "X19_vs_SG11",
"X19_vs_SG12", "X19_vs_SG13", "X19_vs_SG15", "X19_vs_SG17", "X19_vs_SG18",
"X19_vs_SG19", "X19_vs_SG20", "X19_vs_SG21", "X19_vs_SG22", "X19_vs_SG23",
"X19_vs_SG25", "X19_vs_SG26", "X19_vs_ThAna", "X19_vs_TPD.58"
), c("1", "2"))

I like sprintf for this. It allows you to specify the number of digits.
sprintf("%.3f \U00B1 %.3f", SG[,1], SG[,2])
#[1] "85.000 ± 0.656" "84.016 ± 0.966" "82.900 ± 1.833" "79.000 ± NA" "85.167 ± 0.643" "83.467 ± 0.459" ...

Related

How to implement k-fold cross-validation while forcing linear regression of predicted to real values to 1:1 line

I'm trying to train y as a polynomial function of x so that when the predicted y values are linearly regressed against the real y values, the relationship is on the 1:1 line (diagram - The image on the right uses geom_smooth(method="lm") for demonstration, but with SMA from the lmodel2() function, the regression line is 1:1). I'm kind of a stats amateur so I'm aware there might be problems with this, but without forcing the model tends to overestimate low values and underestimate high values. My question is: How do I introduce k-fold cross-validation using an existing package like caret or cvms? It seems like they need a model object to be returned and I can't figure out how to code my problem like that. Is there some way I can train the model by minimizing my custom metric and still return a model object with ypred and use it in k-fold CV?
This is my code for calculating the coefficients without k-fold CV:
data <- data.frame(
x = c(1.514, 1.514, 1.825, 1.281, 1.118, 1.279, 1.835, 1.819, 0.462, 1.53, 1.004, 1.19, 1.275, 0.428, 0.313, 0.909, 0.995, 0.995, 0.706, 0.563, 0.827, 0.65, 0.747, 1.013, 1.013, 1.163, 1.091, 1.163, 1.091, 0.955, 0.955, 2.044, 2.044, 1.777, 1.777, 1.434, 1.393, 1.324, 0.981, 0.845, 1.595, 1.595, 1.517, 1.517, 1.403, 1.403, 0.793, 0.793, 1.016, 0.901, 0.847, 1.054, 0.877, 1.639, 1.639, 1.268, 1.268, 0.842, 0.842, 0.827, 0.777, 1.024, 1.238, 1.238, 1.702, 1.702, 0.673, 0.673, 1.256, 1.256, 0.898, 0.898, 0.66, 0.933, 0.827, 0.836, 1.122, 1.5, 1.5, 1.44, 1.44, 0.671, 0.671, 0.486, 0.486, 1.051, 1.051, 0.971, 0.538, 0.971, 0.538, 1.012, 1.012, 0.776, 0.776, 0.854, 0.854, 0.74, 0.989, 0.989),
y = c(0.19, 0.18, 0.816, 2.568, 0.885, 0.521, 0.268, 0.885, 4.781, 1.648, 0.989, 1.614, 1.492, 0.679, 2.256, 3.17, 1.926, 1.631, 0.462, 2.48, 0.658, 0.355, 0.373, 2.31, 3.263, 1.374, 1.374, 2.637, 2.637, 2.073, 2.298, 0.257, 0.292, 0.359, 0.329, 1.329, 1.272, 3.752, 1.784, 0.76, 0.458, 0.488, 0.387, 0.387, 3.401, 1.458, 8.945, 9.12, 0.308, 0.386, 0.405, 6.444, 3.17, 0.458, 0.47, 0.572, 0.589, 1.961, 1.909, 0.636, 0.32, 1.664, 0.756, 0.851, 0.403, 0.232, 23.112, 22.042, 0.745, 0.477, 2.349, 3.01, 0.39, 0.246, 0.43, 1.407, 1.358, 0.235, 0.215, 0.595, 0.685, 2.539, 2.128, 8.097, 5.372, 0.644, 0.626, 17.715, 17.715, 6.851, 6.851, 2.146, 1.842, 3.147, 2.95, 1.127, 1.019, 8.954, 0.796, 0.758),
stringsAsFactors = FALSE)
optim_results <- optim(par = c(a0 = 0.3, a1 = -3.8, a2 = -1, a3 = 1, a4 = 1),
fn = function (params, x, y) {
params <- as.list(params)
ypred <- with(params, (a0 + (a1*x) + (a2*x^2) + (a3*x^3) + (a4*x^4)))
mod <- suppressMessages(lmodel2::lmodel2(ypred ~ y))$regression.results[3,]
line <- mod$Slope * y + mod$Intercept
return(sum((y - line)^2))},
x = log10(data$x),
y = log10(data$y))
cf <- as.numeric(optim_results$par)
data <- data %>% dplyr::mutate(ypred = 10^(cf[1] + cf[2]*log10(x) + cf[3]*log10(x)^2 + cf[4]*log10(x)^3 + cf[5]*log10(x)^4))
str(data)
Great question!
cvms::cross_validate_fn() allows you to cross-validate custom functions. You just have to wrap your code in a model function and a predict function as so:
EDIT: Added extraction of model parameters from the optim() output. optim() returns a list, which we convert to a class and then tell coef() how to extract the coefficients for that class.
library(dplyr)
library(groupdata2)
library(cvms)
# Set seed for reproducibility
set.seed(2)
data <- data.frame(
x = c(1.514, 1.514, 1.825, 1.281, 1.118, 1.279, 1.835, 1.819, 0.462, 1.53, 1.004, 1.19, 1.275, 0.428, 0.313, 0.909, 0.995, 0.995, 0.706, 0.563, 0.827, 0.65, 0.747, 1.013, 1.013, 1.163, 1.091, 1.163, 1.091, 0.955, 0.955, 2.044, 2.044, 1.777, 1.777, 1.434, 1.393, 1.324, 0.981, 0.845, 1.595, 1.595, 1.517, 1.517, 1.403, 1.403, 0.793, 0.793, 1.016, 0.901, 0.847, 1.054, 0.877, 1.639, 1.639, 1.268, 1.268, 0.842, 0.842, 0.827, 0.777, 1.024, 1.238, 1.238, 1.702, 1.702, 0.673, 0.673, 1.256, 1.256, 0.898, 0.898, 0.66, 0.933, 0.827, 0.836, 1.122, 1.5, 1.5, 1.44, 1.44, 0.671, 0.671, 0.486, 0.486, 1.051, 1.051, 0.971, 0.538, 0.971, 0.538, 1.012, 1.012, 0.776, 0.776, 0.854, 0.854, 0.74, 0.989, 0.989),
y = c(0.19, 0.18, 0.816, 2.568, 0.885, 0.521, 0.268, 0.885, 4.781, 1.648, 0.989, 1.614, 1.492, 0.679, 2.256, 3.17, 1.926, 1.631, 0.462, 2.48, 0.658, 0.355, 0.373, 2.31, 3.263, 1.374, 1.374, 2.637, 2.637, 2.073, 2.298, 0.257, 0.292, 0.359, 0.329, 1.329, 1.272, 3.752, 1.784, 0.76, 0.458, 0.488, 0.387, 0.387, 3.401, 1.458, 8.945, 9.12, 0.308, 0.386, 0.405, 6.444, 3.17, 0.458, 0.47, 0.572, 0.589, 1.961, 1.909, 0.636, 0.32, 1.664, 0.756, 0.851, 0.403, 0.232, 23.112, 22.042, 0.745, 0.477, 2.349, 3.01, 0.39, 0.246, 0.43, 1.407, 1.358, 0.235, 0.215, 0.595, 0.685, 2.539, 2.128, 8.097, 5.372, 0.644, 0.626, 17.715, 17.715, 6.851, 6.851, 2.146, 1.842, 3.147, 2.95, 1.127, 1.019, 8.954, 0.796, 0.758),
stringsAsFactors = FALSE)
# Fold data
# Will do 10-fold repeated cross-validation (10 reps)
data <- fold(
data = data,
k = 10, # Num folds
num_fold_cols = 10 # Num repetitions
)
# Write a model function from your code
# This ignores the formula and hyperparameters but
# you could pass values through those if you wanted
# to try different formulas or hyperparameter values
model_fn <- function(train_data, formula, hyperparameters){
out <- optim(par = c(a0 = 0.3, a1 = -3.8, a2 = -1, a3 = 1, a4 = 1),
fn = function (params, x, y) {
params <- as.list(params)
ypred <- with(params, (a0 + (a1*x) + (a2*x^2) + (a3*x^3) + (a4*x^4)))
mod <- suppressMessages(lmodel2::lmodel2(ypred ~ y))$regression.results[3,]
line <- mod$Slope * y + mod$Intercept
return(sum((y - line)^2))},
x = log10(train_data$x),
y = log10(train_data$y))
# Convert output to an S3 class
# so we can extract parameters with coef()
class(out) <- "OptimModel"
out
}
# Tell coef() how to extract the parameters
# This can modified if you need more info from the optim() output
# Just return a named list
coef.OptimModel <- function(object) {
object$par
}
# Write a predict function from your code
predict_fn <- function(test_data, model, formula, hyperparameters, train_data){
cf <- as.numeric(model$par)
test_data %>%
dplyr::mutate(
ypred = 10^(cf[1] + cf[2]*log10(x) + cf[3]*log10(x)^2 + cf[4]*log10(x)^3 + cf[5]*log10(x)^4)
) %>%
.[["ypred"]]
}
# Cross-validate the model
cv <- cross_validate_fn(
data = data,
model_fn = model_fn,
predict_fn = predict_fn,
formulas = c("y ~ x"), # Not currently used by the model function
fold_cols = paste0('.folds_', seq_len(10)),
type = 'gaussian'
)
#> Will cross-validate 1 models. This requires fitting 100 model instances.
# Check output
cv
# A tibble: 1 × 17
Fixed RMSE MAE NRMSE(I…¹ RRSE RAE RMSLE Predic…² Results Coeffi…³ Folds
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <list> <int>
1 x 4.00 2.31 2.66 1.47 1.17 0.662 <tibble> <tibble> <tibble> 100
# … with 6 more variables: `Fold Columns` <int>, `Convergence Warnings` <int>,
# `Other Warnings` <int>, `Warnings and Messages` <list>, Process <list>,
# Dependent <chr>, and abbreviated variable names ¹​`NRMSE(IQR)`,
# ²​Predictions, ³​Coefficients
# ℹ Use `colnames()` to see all variable names
Created on 2022-10-15 with reprex v2.0.2

Colour a Q-Q plot comparing two distributions by quartiles in R

I am trying to construct a Q-Q plot comparing two distributions, with the 99th percentile colored like the following example:
However I am not sure how to achieve this, here is a subset of my data:
dfw <- structure(list(Date.Time = structure(c(848502000, 848509200,
848512800, 848520000, 848523600, 848530800, 848534400, 848541600,
848545200, 848552400, 848556000, 848563200, 848566800, 848574000,
848577600, 848588400, 848595600, 848599200, 848606400, 848610000,
848617200, 848620800, 848628000, 848631600, 848638800, 848642400,
848649600, 848653200, 848660400, 848664000, 848674800, 848682000,
848685600, 848692800, 848696400, 848703600, 848707200, 848714400,
848718000, 848725200, 848728800, 848736000, 848739600, 848746800,
848750400, 848761200, 848768400, 848772000, 848779200, 848782800,
848790000, 848793600, 848800800, 848804400, 848811600, 848815200,
848822400, 848826000, 848833200, 848847600, 848854800, 848858400,
848865600, 848869200, 848876400, 848880000, 848887200, 848890800,
848898000, 848901600, 848908800, 848912400, 848919600, 848923200,
848934000, 848941200, 848944800, 848952000, 848955600, 848962800,
848966400, 848973600, 848977200, 848984400, 848988000, 848995200,
848998800, 849006000, 849009600, 853682400, 853686000, 853714800,
853718400, 853725600, 853729200, 853736400, 853750800, 853758000,
853761600, 853768800), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Hs.Mod = c(1.5960001, 1.5600001, 1.5480001, 1.552, 1.552,
1.534, 1.5180001, 1.462, 1.4260001, 1.3740001, 1.36, 1.3340001,
1.3080001, 1.2720001, 1.256, 1.218, 1.212, 1.21, 1.2060001,
1.2160001, 1.248, 1.25, 1.25, 1.264, 1.3560001, 1.394, 1.3700001,
1.332, 1.2900001, 1.2800001, 1.268, 1.2800001, 1.2800001,
1.3240001, 1.3240001, 1.286, 1.2540001, 1.19, 1.172, 1.1700001,
1.1860001, 1.2340001, 1.274, 1.3640001, 1.4120001, 1.58,
1.6580001, 1.6660001, 1.682, 1.748, 1.9280001, 1.9800001,
2.026, 2.052, 2.214, 2.328, 2.4320002, 2.39, 2.2180002, 1.9080001,
1.792, 1.7400001, 1.7140001, 1.7140001, 1.692, 1.6680001,
1.608, 1.58, 1.536, 1.524, 1.5640001, 1.5760001, 1.6100001,
1.6240001, 1.6120001, 1.5920001, 1.58, 1.542, 1.5200001,
1.48, 1.4640001, 1.4260001, 1.406, 1.386, 1.3820001, 1.34,
1.312, 1.268, 1.248, 1.8080001, 1.7960001, 1.644, 1.6420001,
1.6600001, 1.6880001, 1.7820001, 2.138, 2.2740002, 2.2940001,
2.252), Hs.Obs = c(1.741, 1.524, 1.618, 1.658, 1.697, 1.822,
1.792, 1.463, 1.433, 1.376, 1.208, 1.299, 1.255, 1.304, 1.328,
1.182, 1.282, 1.293, 1.228, 1.281, 1.45, 1.356, 1.501, 1.5,
1.356, 1.477, 1.408, 1.544, 1.497, 1.768, 2.04, 2.074, 2.042,
2.147, 2.224, 2.022, 2.017, 2.047, 2.353, 2.597, 2.838, 2.67,
2.762, 2.687, 2.734, 2.738, 2.938, 2.795, 2.549, 2.669, 2.447,
2.676, 2.577, 2.383, 2.362, 2.284, 2.341, 2.33, 2.397, 2.498,
2.317, 2.373, 2.377, 2.362, 2.218, 2.226, 1.97, 2.087, 1.874,
2.116, 2.022, 1.886, 2.046, 1.879, 1.638, 1.677, 1.638, 1.647,
1.551, 1.596, 1.591, 1.384, 1.345, 1.522, 1.469, 1.503, 1.459,
1.327, 1.453, 2.448, 2.235, 2.104, 1.958, 2.118, 2.209, 2.034,
2.229, 2.505, 2.163, 2.372)), row.names = c(NA, 100L), class = "data.frame")
Code to make the Q-Q plot:
ggplot(data=dfw, aes(x=sort(Hs.Obs), y=sort(Hs.Mod))) + geom_point(shape = 1, size =2) + xlab('Obs') + ylab('Model')+
theme_bw()+
geom_abline(linetype=2)
Code attempting to colour the 99th percentile:
ggplot(data=dfw, aes(x=sort(Hs.Obs), y=sort(Hs.Mod), col=cut(Hs.Mod,quantile(Hs.Mod, probs = .99)))) +
geom_point(shape = 1, size =2) + xlab('Obs') + ylab('Model')+
theme_bw()+
geom_abline(linetype=2)
Resulting plot:
I'm looking for some help to sort this out, as the attempts I have tried aren't working.
Thanks in advance!
try this
dfw %>%
mutate(qq = quantile(Hs.Mod, probs = c(0.99)),
qq_gt99 = ifelse(qq<= Hs.Mod, 1, 0)) %>%
ggplot(aes(x=Hs.Mod, y = Hs.Obs, col= as.factor(qq_gt99))) + geom_point()
if you order the observations first
dfw %>%
mutate(mod_ordered = sort(Hs.Mod),
obs_ordered = sort(Hs.Obs),
qq = quantile(mod_ordered, probs = c(0.99)),
qq_gt99 = ifelse(qq<= mod_ordered, 1, 0)) %>%
ggplot(aes(x=mod_ordered, y = obs_ordered, col=
as.factor(qq_gt99))) + geom_point()
Im using the dplyr library to filter() the data by quantile():
The code:
library(dplyr)
library(ggplot2)
ggplot()+
geom_point(data=filter(dfw,Hs.Obs>quantile(dfw$Hs.Mod,.99)),aes(x=sort(Hs.Obs),y=sort(Hs.Mod), col="Cuantile 99%"))+
geom_point(data=filter(dfw,Hs.Obs<quantile(dfw$Hs.Mod,.99)),aes(x=sort(Hs.Obs),y=sort(Hs.Mod), col="Cuantile 1-98%"))+
geom_abline(linetype=2)+xlab('Obs') + ylab('Model')+ theme_bw()

Removing NAs from ggplot x-axis in ggplot2

I would like to get rid off the whole NA block (highlighted here ).
I tried na.ommit and na.rm = TRUE unsuccesfully.
Here is the code I used :
library(readxl)
data <- read_excel("Documents/TFB/xlsx_geochimie/solfatara_maj.xlsx")
View(data)
data <- gather(data,FeO:`Fe2O3(T)`,key = "Element",value="Pourcentage")
library(ggplot2)
level_order <- factor(data$Element,levels = c("SiO2","TiO2","Al2O3","Fe2O3","FeO","MgO","CaO","Na2O","K2O"))
ggplot(data=data,mapping=aes(x=level_order,y=data$Pourcentage,colour=data$Ech)+geom_point()+geom_line(group=data$Ech) +scale_y_log10()
And here is my original file
https://drive.google.com/file/d/1bZi7fPWebbpodD1LFScoEcWt5Bs-cqhb/view?usp=sharing
If I run your code and look at data that goes into ggplot:
table(data$Element)
Al2O3 CaO Fe2O3 Fe2O3(T) FeO K2O LOI LOI2 MgO MnO
12 12 12 12 12 12 12 12 12 12
Na2O P2O5 SiO2 SO4 TiO2 Total Total 2 Total N Total S
12 12 12 12 12 12 12 12 12
You have included Total into the melted data frame.. which is not intended I guess. Hence when you do factor on these, and these "Total.." are not included in the levels, they become NA.
So we can do it from scratch:
data <- read_excel("solfatara_maj.xlsx")
The data:
structure(list(Ech = c("AGN 1A", "AGN 2A", "AGN 3B", "SOL 4B",
"SOL 8Ag", "SOL 8Ab", "SOL 16A", "SOL 16B", "SOL 16C", "SOL 22 A",
"SOL 22D", "SOL 25B"), FeO = c(0.2, 0.8, 1.7, 0.3, 1.7, NA, 0.2,
NA, 0.1, 0.7, 1.3, 2), `Total S` = c(5.96, 45.3, 0.22, 17.3,
NA, NA, NA, NA, NA, NA, 2.37, 0.36), SO4 = c(NA, 6.72, NA, 4.08,
0.06, 0.16, 42.2, 35.2, 37.8, 0.32, 6.57, NA), `Total N` = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, 15.2, NA, NA), SiO2 = c(50.2,
31.05, 56.47, 62.14, 61.36, 75.66, 8.41, 21.74, 17.44, 13.52,
19.62, 56.35), Al2O3 = c(15.53, 7.7, 17.56, 4.44, 17.75, 10.92,
31.92, 26.38, 27.66, 0.64, 3.85, 17.28), Fe2O3 = c(0.49, 0.63,
2.06, NA, 1.76, 0.11, 0.64, 0.88, 1.71, NA, 1.32, 2.67), MnO = c(0.01,
0.01, 0.13, 0.01, 0.09, 0.01, 0.01, 0.01, 0.01, 0.005, 0.04,
0.12), MgO = c(0.06, 0.07, 0.88, 0.03, 0.97, 0.05, 0.04, 0.07,
0.03, 0.02, 1.85, 1.63), CaO = c(0.2, 0.09, 3.34, 0.09, 2.58,
0.57, 0.2, 0.26, 0.15, 0.06, 35.66, 4.79), Na2O = c(0.15, 0.14,
3.23, 0.13, 3.18, 2.04, 0.68, 0.68, 0.55, 0.05, 0.45, 3.11),
K2O = c(4.39, 1.98, 8, 1.26, 8.59, 5.94, 8.2, 6.97, 8.04,
0.2, 0.89, 7.65), TiO2 = c(0.42, 0.27, 0.46, 0.79, 0.55,
0.16, 0.09, 0.22, 0.16, 0.222, 0.34, 0.53), P2O5 = c(0.11,
0.09, 0.18, 0.08, 0.07, 0.07, 0.85, 0.68, 0.62, NA, 0.14,
0.28), LOI = c(27.77, 57.06, 6.13, 29.03, 1.38, 4.92, 42.58,
37.58, 38.76, NA, 26.99, 3.92), LOI2 = c(27.79, 57.15, 6.32,
29.06, 1.57, 4.93, 42.6, 37.59, 38.77, 0.08, 27.13, 4.15),
Total = c(99.52, 99.88, 100.2, 98.25, 99.99, 100.5, 93.81,
95.57, 95.23, 15.25, 92.45, 100.3), `Total 2` = c(99.54,
99.96, 100.3, 98.28, 100.2, 100.6, 93.83, 95.58, 95.24, 15.33,
92.59, 100.6), `Fe2O3(T)` = c(0.71, 1.52, 3.95, 0.27, 3.65,
0.22, 0.87, 0.99, 1.82, 0.61, 2.76, 4.9)), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"))
First we set the plotting level like you did:
plotlvls = c("SiO2","TiO2","Al2O3","Fe2O3","FeO","MgO","CaO","Na2O","K2O")
Then we select only these columns, and also Ech, note I use pivot_longer() because gather() will supposedly be deprecated, and then we do the factoring too:
plotdf = data %>% select(c(plotlvls,"Ech")) %>%
pivot_longer(-Ech,names_to = "Element",values_to = "Pourcentage") %>%
mutate(Element=factor(Element,levels=toplot))
Finally we plot, and there are no NAs:
ggplot(data=plotdf,mapping=aes(x=Element,y=Pourcentage,colour=Ech))+
geom_point()+geom_line(aes(group=Ech)) +scale_y_log10()
1.Create reproducible minimal data
data <- data.frame(Element = c("SiO2","TiO2","Al2O3","Fe2O3","FeO","MgO","CaO","Na2O","K2O",NA),
Pourcentage = 1:10,
Ech = c("AGN 1A", "SOL 16"))
2.Set factor levels for variable 'Element'
data$Element <- factor(data$Element,levels = c("SiO2","TiO2","Al2O3","Fe2O3","FeO","MgO","CaO","Na2O","K2O"))
3.Remove rows containing NA in the variable 'Element'
data <- data[!is.na(data$Element), ]
4.Plot data using ggplot2 (ggplot2 syntax uses NSE (non standard evaluation), which means you dont't have to pass the variable names as strings or using the $ notation):
ggplot(data=data,aes(x=Element,y=Pourcentage,colour=Ech)) +
geom_point() +
geom_line(aes(group=Ech)) +
scale_y_log10()

Averaging the replicate data in omics / biostatistics

I have a dataframe for gene expression data. Samples are named as Genotype_Time_Replicate (e.g. AOX_1h_4).
E.g. data set
df <- structure(list(ID = c("AT5G54740.1", "AT5G55730.2", "AT5G57655.2", "AT5G64100.1", "AT5G64260.1", "AT5G67360.1", "AT1G30630.1", "AT1G62380.1", "AT1G70830.1", "AT3G14990.1", "AT4G18800.1", "AT4G24510.1", "AT5G15650.1", "AT5G19820.1", "AT5G59840.1", "AT5G47200.1", "AT1G12840.1", "AT1G76030.1", "AT1G78900.2", "AT3G42050.1", "AT4G11150.1", "AT1G11860.2", "AT1G17290.1" ),
Location = c("extracellular", "extracellular", "extracellular", "extracellular", "extracellular", "extracellular", "golgi", "golgi", "golgi", "golgi", "golgi", "golgi", "golgi", "golgi", "golgi", "ER", "ER", "ER", "mitochondrion", "mitochondrion", "mitochondrion", "mitochondrion", "mitochondrion"),
AOX_1h_1 = c(0.844651873, 0.50954096, 1.12e-08, 0.012981372, 0.978148381, 0.027579578, 0.068010151, 0.410629215, 0.253838635, 0.033631788, 0.335713512, 0.982799013, 0.025910457, 0.793810264, 0.762431665, 0.152154436, 0.027114103, 0.000227, 1.07e-05, 0.721209032, 0.086281162, 0.483130711, 0.014795515),
AOX_1h_2 = c(0.894623378, 0.011521413, 1.62e-06, 0.085249729, 0.02863972, 0.956962154, 0.225208718, 0.932679767, 0.002574192, 0.071700671, 0.233682544, 0.936572874, 1.12e-05, 0.241658735, 0.865205515, 0.000537, 0.103471292, 8.66e-07, 1.22e-08, 0.950878446, 0.145012176, 0.092919172, 0.599713247),
AOX_1h_3 = c(0.880951025, 0.00145276, 8.59e-10, 0.087023475, 0.675527672, 0.765543306, 0.305860948, 0.899172011, 0.020973476, 0.542988545, 0.735571562, 0.157569324, 0.025488075, 0.071006507, 0.262324019, 0.080470612, 0.0436526, 6.65e-09, 5.63e-10, 0.020557091, 0.069577215, 0.005502212, 0.852099232),
AOX_1h_4 = c(0.980823252, 0.158123518, 0.00210702, 0.006317657, 0.30496173, 0.489709702, 0.091469807, 0.958443361, 0.015583593, 0.566165972, 0.66746161, 0.935102341, 0.087733288, 0.744313619, 0.021169383, 0.633250945, 0.257489406, 0.024345088, 0.000355, 0.226279179, 0.004038493, 0.479275204, 0.703522761),
AOX_2h_1 = c(0.006474022, 0.246530998, 5.38e-06, 0.47169153, 0.305973663, 0.466202566, 0.191733645, 0.016121487, 0.234839116, 0.043866023, 0.089819656, 0.107934599, 2.09e-06, 0.413229678, 0.464078018, 0.004118766, 0.774970986, 3.79e-07, 2.3e-10, 0.428591262, 0.002326292, 0.385580707, 0.106216066),
AOX_2h_2 = c(0.166169729, 0.005721199, 7.77e-08, 0.099146712, 0.457164663, 0.481987525, 7.4e-05, 0.969805081, 0.100894997, 0.062103337, 0.095718425, 0.001686206, 0.009710516, 0.134651787, 0.887036569, 0.459218152, 0.074576369, 3.88e-09, 3.31e-15, 0.409645805, 0.064874307, 0.346371524, 0.449444779),
AOX_2h_3 = c(1.06e-05, 0.576589898, 4.03e-08, 0.787468189, 0.971119601, 0.432593753, 0.000274, 0.86932399, 0.08657663, 4.22e-06, 0.071190008, 0.697384316, 0.161623604, 0.422628778, 0.299545652, 0.767867006, 0.00295567, 0.078724176, 4.33e-09, 0.988576028, 0.080278831, 0.66505527, 0.014158693),
AOX_2h_4 = c(0.010356719, 0.026506539, 9.48e-09, 0.91009296, 0.302464488, 0.894377768, 0.742233323, 0.75032613, 0.175841127, 0.000721, 0.356904918, 0.461234653, 1.08e-05, 0.65800831, 0.360085919, 0.004814238, 0.174670947, 0.004246734, 7.31e-11, 0.778725214, 0.051334623, 0.10212841, 0.155831664 ),
AOX_6h_1 = c(0.271681878, 0.004822226, 1.87e-11, 0.616969208, 0.158860224, 0.684690326, 0.011798791, 0.564591916, 0.000314, 4.79e-06, 0.299871385, 0.001909713, 0.00682428, 0.039107415, 0.574143284, 0.061532691, 0.050483892, 2.28e-08, 1.92e-12, 0.058747794, 0.027147473, 0.196608218, 0.513693112),
AOX_6h_2 = c(5.72e-12, 0.719814288, 0.140016259, 0.927094438, 0.841229414, 0.224510089, 0.026567282, 0.242981965, 0.459311076, 0.038295888, 0.127935565, 0.453746728, 0.005023732, 0.554532387, 0.280899096, 0.336458018, 0.002024021, 0.793915731, 0.012838565, 0.873716549, 0.10097853, 0.237426815, 0.003711539),
AOX_6h_3 = c(3.16e-12, 0.780424491, 0.031315419, 0.363891436, 0.09562579, 0.104833988, 3.52e-05, 0.104196756, 0.870952423, 0.002036134, 0.016480622, 0.671475063, 2.3e-05, 0.00256744, 0.66263641, 0.005026601, 0.57280276, 0.058724117, 6.4e-10, 0.030965264, 0.005301006, 0.622027012, 0.371659724),
AOX_6h_4 = c(7.99e-10, 0.290847169, 0.001319424, 0.347344795, 0.743846306, 0.470908425, 0.00033, 0.016149973, 0.080036584, 0.020899676, 0.00723071, 0.187288769, 0.042514886, 0.00150443, 0.059344154, 0.06554177, 0.112601764, 0.000379, 2.36e-10, 0.78131093, 0.105861995, 0.174370801, 0.05570041 ),
WT_1h_1 = c(0.857, 0.809, 2.31e-05, 0.286, 0.87, 0.396, 0.539, 0.787, 0.73, 0.427, 0.764, 0.87, 0.386, 0.852, 0.848, 0.661, 0.393, 0.0415, 0.00611, 0.843, 0.576, 0.804, 0.304 ),
WT_1h_2 = c(0.898, 0.509, 0.0192, 0.729, 0.616, 0.902, 0.811, 0.9, 0.343, 0.712, 0.814, 0.901, 0.0446, 0.816, 0.896, 0.217, 0.747, 0.0143, 0.000964, 0.901, 0.776, 0.737, 0.876 ),
WT_1h_3 = c(0.939, 0.627, 0.0104, 0.867, 0.932, 0.935, 0.91, 0.939, 0.803, 0.926, 0.934, 0.888, 0.813, 0.859, 0.905, 0.864, 0.838, 0.0223, 0.00917, 0.802, 0.858, 0.724, 0.938 ),
WT_1h_4 = c(0.911, 0.782, 0.298, 0.396, 0.837, 0.871, 0.727, 0.91, 0.506, 0.88, 0.89, 0.909, 0.723, 0.896, 0.547, 0.887, 0.824, 0.566, 0.175, 0.814, 0.348, 0.869, 0.893),
WT_2h_1 = c(0.748, 0.911, 0.231, 0.929, 0.917, 0.928, 0.903, 0.801, 0.909, 0.849, 0.878, 0.884, 0.183, 0.925, 0.928, 0.719, 0.941, 0.108, 0.00817, 0.926, 0.678, 0.923, 0.884),
WT_2h_2 = c(0.935, 0.851, 0.163, 0.925, 0.951, 0.952, 0.63, 0.963, 0.926, 0.916, 0.925, 0.804, 0.868, 0.931, 0.961, 0.951, 0.92, 0.0706, 0.000265, 0.95, 0.917, 0.947, 0.951),
WT_2h_3 = c(0.0197, 0.894, 0.000613, 0.911, 0.922, 0.877, 0.122, 0.916, 0.739, 0.0125, 0.718, 0.905, 0.801, 0.875, 0.852, 0.91, 0.302, 0.729, 0.00015, 0.923, 0.731, 0.902, 0.504),
WT_2h_4 = c(0.696, 0.765, 0.0142, 0.931, 0.893, 0.931, 0.925, 0.925, 0.87, 0.45, 0.899, 0.908, 0.144, 0.921, 0.899, 0.631, 0.87, 0.62, 0.0014, 0.926, 0.807, 0.844, 0.865),
WT_6h_1 = c(0.898, 0.727, 0.00395, 0.921, 0.881, 0.924, 0.776, 0.919, 0.542, 0.234, 0.901, 0.67, 0.747, 0.83, 0.919, 0.848, 0.841, 0.056, 0.00144, 0.846, 0.815, 0.888, 0.916),
WT_6h_2 = c(2.38e-09, 0.88, 0.708, 0.898, 0.891, 0.768, 0.443, 0.777, 0.843, 0.505, 0.695, 0.842, 0.208, 0.859, 0.794, 0.813, 0.14, 0.887, 0.326, 0.894, 0.661, 0.775, 0.182),
WT_6h_3 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
WT_6h_4 = c(0.0357, 0.953, 0.792, 0.956, 0.967, 0.96, 0.711, 0.892, 0.931, 0.899, 0.866, 0.946, 0.917, 0.799, 0.925, 0.927, 0.938, 0.72, 0.025, 0.967, 0.936, 0.945, 0.923)),
class = "data.frame", row.names = c(NA, -23L))
I want to summarize data for each organelle (averaged by organelle and samples' replicates) and plot the Wildtype and mutant data side by side with standard error for each time point
df <-
melted <- melt(df)
head(melted)
melted$variable<- str_replace_all(melted$variable, '_[0-9]$', '')
melted$variable <- factor(melted$variable,levels=c("WT_1h","AOX_1h","WT_2h","AOX_2h","WT_6h","AOX_6h"))
my_comparisons <- list( c("WT_1h","AOX_1h"), c("WT_2h","AOX_2h"),c("WT_6h","AOX_6h"))
ggbarplot(melted, x = "variable", y = "value", add = "mean_se",
color = "variable", palette = c("grey","black","grey","black","grey","black"),
facet.by = "Location")+
stat_compare_means(comparisons = my_comparisons, label = "p.signif")
How can I use tidyverse (dplyr / tidyr) for this purpose?
How can I use tidyverse (dplyr / tidyr) to follow this pathway instead of above scripts?
You can use different functions to normalise this data. I use gather() in this example alongside stringr functions to extract the data from the character vector that has 3 columns of data in it.
dat %>%
gather(key, value, -ID, -Location) %>%
mutate(type = map_chr(str_split(key,"_"),~.x[1]),
hour = map_chr(str_split(key,"_"),~.x[2]),
n = map_chr(str_split(key,"_"),~.x[3])) %>%
group_by(type, hour) %>%
summarise(mean = mean(value))
Gives
# A tibble: 6 x 3
# Groups: type [?]
type hour mean
<chr> <chr> <dbl>
1 AOX 1h 0.3235302
2 AOX 2h 0.2709910
3 AOX 6h 0.2226648
4 WT 1h 0.6633866
5 WT 2h 0.7263108
6 WT 6h 0.7915662
This you can use in ggplot() to make a nice barplot.
To get it in a table you can use
dat %>%
gather(key, value, -ID, -Location) %>%
mutate(type = map_chr(str_split(key,"_"),~.x[1]),
hour = map_chr(str_split(key,"_"),~.x[2]),
n = map_chr(str_split(key,"_"),~.x[3])) %>%
group_by(type, hour) %>%
summarise(mean = mean(value)) %>%
spread(type, mean)
to get
# A tibble: 3 x 3
hour AOX WT
* <chr> <dbl> <dbl>
1 1h 0.3235302 0.6633866
2 2h 0.2709910 0.7263108
3 6h 0.2226648 0.7915662
Another version going from the df object:
The df object is a list, and expression values after cbind are character type, so you can do
tb <- as_tibble(do.call(cbind, df)) %>%
mutate_at(3:14, as.numeric)
NB that usually for gene expression data it is easier to read in count data using read_tsv or read.table and combine into matrix, data.frame or tibble.
NBB the df object specified has no "WT" samples (from my copy/paste anyway) so I renamed last 4 samples in tb as "WT_1h" replicates
colnames(tb)[11:14] <- paste0("WT_1h_",c(1:4))
Create means from replicates by function
rowMeanNrep <- function(tb, nm){
varname <- paste0(nm, "_mean")
selectn <- grep(nm, colnames(tb))
tb %>%
dplyr::mutate(!!varname := rowMeans(dplyr::select(., !!selectn)))
}
Specify which timepoints to use, and apply
tps <- c("AOX_1h", "WT_1h")
tb_1h_mean <- cbind(tb_1h[,1:2],
do.call(cbind, lapply(tps, function(f){
rowMeanNrep(tb=tb, nm=f) %>%
dplyr::select(paste0(f, "_mean"))
}))
)
A final NB, think about using boxplots instead of barplots, see this paper

Find point of systematic decrease in R

I have the following data frame:
df <- structure(list(x = c(1059.6, 1061.4, 1063.4, 1064.9, 1066.3,
1068, 1069.8, 1071.4, 1072.9, 1074.4, 1075.9, 1077.5, 1079.1,
1080.5, 1082.1, 1083.8, 1085.1, 1086.7, 1088.1, 1089.5, 1091.6,
1093.1, 1094.5, 1095.8, 1097.1, 1098.4, 1099.8, 1101.1, 1102.5,
1103.9, 1105.3, 1106.6, 1108, 1109.4, 1110.8, 1112.2, 1113.7,
1115.2, 1116.5, 1117.9, 1119.1, 1120.4, 1121.8, 1123.1, 1124.8,
1126.2, 1127.4, 1128.8, 1130.2, 1131.8, 1133.3, 1134.6, 1138.5,
1141.2, 1142.4, 1143.6, 1144.8, 1146.8, 1148.2, 1149.6, 1150.9,
1152.2, 1153.4, 1154.7, 1155.9, 1157.1, 1158.3, 1159.5, 1161.9,
1163.4, 1164.7, 1166, 1167.2, 1169, 1170.3, 1171.5, 1172.8, 1173.9,
1175.1, 1176.8, 1178, 1179.2, 1180.3, 1181.6, 1182.8, 1184.1,
1185.8, 1187, 1188.2, 1189.4, 1190.5, 1191.8, 1193, 1194.3, 1195.5,
1205.8, 1206.9, 1208, 1209, 1210.2, 1211.3, 1212.4, 1213.6, 1214.7,
1217.2, 1218.6, 1222.3, 1223.6, 1224.7, 1225.9, 1227.1, 1228.2,
1229.3, 1230.4, 1231.6, 1232.7, 1233.6, 1234.6, 1235.7, 1236.9,
1238.4, 1239.5, 1240.6, 1241.6, 1242.7, 1243.7, 1244.8, 1245.9,
1247, 1248.1, 1249.2, 1250.3, 1251.3, 1252.6, 1253.7, 1254.8,
1255.8, 1256.8, 1257.8, 1258.8, 1261.4, 1262.5, 1263.5, 1264.5,
1265.6, 1266.6, 1267.8, 1268.8, 1270.1, 1271.1, 1272.1, 1273.2,
1274.1, 1275.2, 1276.3, 1279, 1280, 1281, 1282.1, 1283.1, 1284.1,
1285, 1286, 1287, 1288, 1289, 1290, 1291.1, 1292.3, 1293.3, 1294.4,
1298.6, 1299.6, 1300.5, 1301.5, 1302.5, 1303.5, 1304.6, 1305.5,
1306.4, 1307.6, 1308.6, 1309.7, 1310.7, 1311.7, 1312.7, 1315.2,
1316.3, 1317.3, 1318.3, 1319.3, 1320.3, 1321.3, 1322.3, 1323.2,
1326.8, 1327.8, 1329, 1330, 1331, 1332, 1333, 1333.9, 1335, 1336,
1337.3, 1338.3, 1339.3, 1340.5, 1341.6, 1342.7, 1343.8, 1344.9,
1345.9, 1346.8, 1347.8, 1348.8, 1350, 1351.1, 1352, 1353.3, 1354.3,
1355.3, 1356.2, 1357.1, 1358, 1359.2, 1360.2, 1364.4, 1365.5,
1366.6, 1367.6, 1368.7, 1369.8, 1371, 1372, 1373, 1374.1, 1375,
1376, 1376.9, 1377.8, 1378.7, 1379.6, 1380.5, 1381.4, 1382.3,
1383.3, 1384.2, 1385.2, 1387.6, 1388.5, 1389.5, 1390.4, 1391.4,
1392.5, 1393.6, 1394.6, 1395.6, 1397, 1397.9, 1398.8, 1399.8,
1400.6, 1401.6, 1402.5, 1403.4, 1404.2, 1405.1, 1407.4, 1408.3,
1409.2, 1410.1, 1411.2, 1412.2, 1413.2, 1414.2, 1415.6, 1416.7,
1417.8, 1418.9, 1420.2, 1421.5, 1424.6, 1425.7, 1427, 1428.1,
1429.3, 1430.7, 1431.9, 1433.1, 1434.5, 1435.7, 1436.8, 1438,
1439.4, 1440.6, 1441.9, 1443, 1444.4, 1445.6, 1447.3, 1448.5,
1449.7, 1450.9, 1452.1, 1453.2, 1454.5, 1455.6, 1456.8, 1458.1,
1459.3, 1460.3, 1461.4, 1462.4, 1463.9, 1465.1, 1466.3, 1469.8,
1471.1, 1472.6, 1473.8, 1475, 1476.2, 1477.5, 1479.1, 1480.7,
1482, 1483.2, 1484.9, 1486.2, 1487.5, 1488.8, 1490, 1491.3, 1492.4,
1503, 1504.3, 1506.3, 1507.5, 1508.8, 1510.2, 1511.4, 1512.5,
1513.8, 1515.6, 1517.1, 1520.1, 1523.9, 1526.5, 1527.9, 1529.8,
1531.2, 1532.4, 1533.7, 1536, 1537.4, 1538.8, 1540.2, 1541.5,
1542.9, 1544.2, 1545.6, 1546.9, 1548.3, 1549.7, 1551.1, 1552.7,
1554.1, 1556.4, 1557.8, 1559.2, 1560.6, 1562, 1563.4, 1564.7,
1566.2, 1567.5, 1568.9, 1570.2, 1571.4, 1573.9, 1576.7, 1581.5,
1582.8, 1584.7, 1586.2, 1587.7, 1589.3, 1591, 1592.8, 1594.7,
1596.4, 1598.5, 1600.6, 1602.4, 1604.6, 1606.9, 1609, 1611, 1612.6,
1614.4, 1616.3, 1618.6, 1620.6, 1622.4, 1624.5, 1627.2, 1629.3,
1631.4, 1635, 1636.9, 1638.6, 1640.5, 1642.1, 1643.7, 1645.5,
1647.1, 1648.7, 1650.9, 1653, 1655.2, 1657.1, 1659.1, 1661.5,
1663.6, 1665.9, 1668.1, 1671.7, 1674, 1676.2, 1678.1, 1679.7,
1681.6, 1683.6, 1685.7, 1688, 1693.7, 1695.7, 1697.6, 1699.7,
1701.7, 1704.1), y = c(1.876, 2.027, 2.087, 2.231, 2.18, 1.922,
1.921, 1.851, 1.961, 2.035, 2.043, 2.043, 1.838, 2.032, 2.112,
1.976, 2.046, 2.117, 2.062, 2.07, 1.748, 1.917, 2.092, 2.283,
2.158, 2.119, 2.023, 1.971, 1.882, 2.058, 2.141, 2.241, 2.079,
1.946, 1.959, 2.117, 1.923, 2.015, 2.066, 1.98, 2.091, 1.929,
1.987, 1.852, 1.935, 2.127, 1.982, 2.182, 2.099, 2.03, 1.912,
1.998, 2.491, 2.359, 2.188, 1.965, 1.906, 1.772, 1.927, 2.077,
2.381, 2.191, 2.089, 2.086, 2.017, 2.028, 1.832, 1.88, 2.053,
2.177, 1.995, 2.045, 2.116, 1.961, 1.99, 2.227, 2.235, 2.208,
2.249, 1.992, 2.045, 2.152, 2.237, 2.239, 2.247, 2.114, 1.956,
2.042, 1.926, 2.396, 2.184, 2.208, 2.016, 2.177, 2.29, 2.469,
2.502, 2.115, 2.081, 2.091, 2.188, 2.118, 2.179, 2.067, 1.962,
2.181, 2.246, 2.526, 2.145, 1.961, 2.299, 2.306, 2.34, 2.133,
1.974, 1.997, 2.47, 2.24, 2.247, 2.137, 1.965, 2.232, 2.225,
2.417, 2.362, 2.155, 2.034, 2.151, 2.176, 2.183, 2.372, 2.145,
2.284, 1.967, 2.299, 2.299, 2.183, 2.292, 2.193, 2.249, 2.32,
2.333, 2.286, 2.216, 2.233, 2.453, 2.373, 2.284, 2.074, 2.014,
2.153, 2.353, 2.465, 2.373, 2.181, 2.424, 2.334, 2.349, 2.39,
2.513, 2.526, 2.268, 2.098, 2.326, 2.385, 2.306, 2.378, 2.126,
2.191, 2.363, 2.222, 2.723, 2.686, 2.4, 2.251, 2.121, 2.104,
2.16, 2.333, 2.151, 2.116, 2.136, 2.293, 2.281, 2.313, 2.374,
2.585, 2.521, 2.656, 2.66, 2.399, 2.442, 2.413, 2.528, 2.212,
2.58, 2.667, 2.153, 2.736, 2.486, 2.406, 2.39, 2.403, 2.504,
2.502, 2.158, 2.617, 2.434, 2.364, 2.497, 2.456, 2.263, 2.432,
2.562, 2.453, 2.249, 2.18, 2.141, 2.324, 2.176, 2.184, 2.153,
2.332, 2.202, 2.332, 2.125, 2.156, 2.189, 2.71, 2.458, 2.502,
2.285, 2.527, 2.437, 2.418, 2.507, 2.087, 2.321, 2.701, 2.486,
2.389, 2.335, 2.26, 2.108, 2.164, 2.286, 2.103, 2.257, 2.137,
2.076, 2.378, 2.637, 2.446, 2.448, 2.539, 2.253, 2.099, 2.59,
2.405, 2.219, 2.542, 2.532, 2.507, 2.439, 2.463, 2.342, 2.329,
2.436, 2.511, 2.557, 2.603, 2.5, 2.428, 2.204, 2.307, 2.174,
2.193, 1.793, 2.116, 2.107, 2.209, 1.967, 1.834, 2.713, 2.647,
2.379, 2.229, 2.11, 1.964, 1.985, 2.162, 1.996, 2.074, 1.994,
1.839, 1.838, 1.743, 1.668, 1.91, 1.735, 1.714, 1.421, 1.767,
1.816, 1.755, 1.755, 1.698, 1.608, 1.556, 1.511, 1.394, 1.425,
1.579, 1.495, 1.627, 1.305, 1.471, 1.469, 1.67, 1.697, 1.42,
1.483, 1.274, 1.341, 1.235, 1.295, 1.401, 1.463, 1.313, 1.176,
1.333, 1.373, 1.299, 1.086, 1.139, 1.237, 1.303, 1.143, 1.13,
1.114, 1.096, 1.248, 1.302, 1.19, 1.069, 1.1, 1.027, 0.897, 1.09,
0.922, 1.116, 0.963, 1.011, 1.053, 1.025, 0.985, 0.981, 1.025,
1.117, 1.141, 1.135, 1.068, 0.982, 1.028, 1.06, 1.004, 1.112,
1.108, 1.04, 0.857, 0.91, 0.98, 1.081, 1.025, 0.996, 0.931, 1,
1.074, 0.987, 0.996, 1.125, 0.9, 0.607, 1.17, 1.08, 1, 0.909,
0.841, 0.924, 0.818, 0.846, 0.732, 1.006, 0.717, 0.594, 0.786,
0.685, 0.619, 0.684, 0.69, 0.633, 0.564, 0.689, 0.555, 0.445,
0.696, 0.677, 0.729, 0.541, 0.362, 0.312, 0.568, 0.711, 0.515,
0.622, 0.583, 0.631, 0.645, 0.696, 0.535, 0.424, 0.469, 0.519,
0.511, 0.485, 0.436, 0.412, 0.351, 0.556, 0.255, 0.519, 0.399,
0.497, 0.477, 0.564, 0.462, 0.433, 0.616, 0.547, 0.42, 0.499,
0.415, 0.368)), row.names = c(NA, -443L), class = c("tbl_df",
"tbl", "data.frame"), .Names = c("x", "y"))
Plot:
And I need to find the point that y starts to systematically decrease.
I know that the real point is x == 1405. However, is there a way to automatically detect it?
I am not expecting to find the exact x point. A really good approximation would do the job.
I already tried to perform a break point analysis with the segmented package, but with not much success. The best number I could get was x == 1363, but I am looking for a closer approximation.
Here's how to get a fitted smooth of the data using loess. When you say "starts to systematically decrease," I think you mean something like "when the slope gets negative beyond a certain threshold," since it seems to me that it visually peaks and starts to decline around the 1350's. I could manually get the peak to occur later by smoothing more than default, using span = 0.4.
library(broom)
fit <- loess(y ~ x, df, span = 0.4)
df_aug <- augment(fit)
Using that model, the peak looks to be around the 1370's.
library(dplyr); library(ggplot2)
df_aug %>% filter(.fitted == max(.fitted))
# # A tibble: 1 x 5
# y x .fitted .se.fit .resid
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2.09 1373 2.39 0.0181 -0.307
I presume you could get a better result if you can more definitively describe what model should be used to define "systematically decrease."
You might alternately extract the slope and acceleration from the loess curve, but it's not clear that'd get you much closer you your expected result:
# Extract slope & acceleration
df_aug_slope <- df_aug %>%
mutate(slope = (.fitted - lag(.fitted)) /
(x - lag(x)),
curve = (slope - lag(slope)) /
(x - lag(x)))
ggplot(df_aug_slope, aes(x)) +
geom_point(aes(y=y)) +
geom_line(aes(y=.fitted), color ="red") +
geom_line(aes(y= slope * 100), color = "blue") +
geom_line(aes(y= curve * 1000), color = "green") +
geom_vline(xintercept = 1405, lty = "dashed") +
theme_minimal()

Resources