How to specify two thresholds for ggplot? - r

I have a dataset called volcano that looks like this:
DiffMean P.value
-0.0246757556 0.1
0.0050993889 0.002
-0.0169992614 0.008
0.0039905857 0.03
-0.0081568420 0.02
-0.0279989935 0.03
0.0313951281 0.44
-0.0097932018 0.22
-0.1033745673 0.003
0.1143251388 0.02
-0.0738617112 0.004
-0.0011579184 0.1
-0.0008561962 0.022
0.0435398270 0.11
-0.0380242369 0.05
0.1533720177 0.03
I want to plot this using ggplot, but I want the colors to be red if DiffMean < 0 and P.value < 0.05 or blue if DiffMean > 0 and P.value < 0.05.
What I have so far is:
volcano$threshold = as.factor(abs(volcano$DiffMean)>0 & volcano$p.value.adj< 0.05)
ggplot(data=volcano, aes(x=DiffMean, y=-1*log10(p.value), colour=threshold)) +
geom_point(aes(alpha=0.4, size=1.75)) +
xlim(c(-1,1)) + ylim(c(0,25))
But I don't know how to use this two thresholds.

I would do something like this :
volcano$threshold <-
factor(ifelse(volcano$DiffMean>0 & volcano$p.value< 0.05,
1,
ifelse(volcano$DiffMean<0 & volcano$p.value< 0.05,
-1,
0)
))
library(ggplot2)
ggplot(data=volcano, aes(x=DiffMean, y=-1*log10(p.value), colour=threshold)) +
geom_point(alpha=0.4, size=5) +
scale_y_log10()

Related

ROC Curve Plot using R (Error code: Predictor must be numeric or ordered)

I am trying to make a ROC Curve using pROC with the 2 columns as below: (the list goes on to over >300 entries)
Actual_Findings_%
Predicted_Finding_Prob
0.23
0.6
0.48
0.3
0.26
0.62
0.23
0.6
0.48
0.3
0.47
0.3
0.23
0.6
0.6868
0.25
0.77
0.15
0.31
0.55
The code I tried to use is:
roccurve<- plot(roc(response = data$Actual_Findings_% <0.4, predictor = data$Predicted_Finding_Prob >0.5),
legacy.axes = TRUE, print.auc=TRUE, main = "ROC Curve", col = colors)
Where the threshold for positive findings is
Actual_Findings_% <0.4
AND
Predicted_Finding_Prob >0.5
(i.e to be TRUE POSITIVE, actual_finding_% would be LESS than 0.4, AND predicted_finding_prob would be GREATER than 0.5)
but when I try to plot this roc curve, I get the error:
"Setting levels: control = FALSE, case = TRUE
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'plot': Predictor must be numeric or ordered."
Any help would be much appreciated!
This should work:
data <- read.table( text=
"Actual_Findings_% Predicted_Finding_Prob
0.23 0.6
0.48 0.3
0.26 0.62
0.23 0.6
0.48 0.3
0.47 0.3
0.23 0.6
0.6868 0.25
0.77 0.15
0.31 0.55
", header=TRUE, check.names=FALSE )
library(pROC)
roccurve <- plot(
roc(
response = data$"Actual_Findings_%" <0.4,
predictor = data$"Predicted_Finding_Prob"
),
legacy.axes = TRUE, print.auc=TRUE, main = "ROC Curve"
)
Now importantly - the roc curve is there to show you what happens when you varry your classification threshold. So one thing you do do wrong is to go and enforce one, by setting predictions < 0.5
This does however give a perfect separation, which is nice I guess. (Though bad for educational purposes.)

How to substitude some characters with an space between them in R

I'm trying to substitute some characters by some strings, but when I try this happens:
Group <- "ABC"
A <- "0.25 0.65 0.48"
B <- "0.054 0.41 0.09"
C <- "0.8 0.047 0.34"
Group <- gsub("A", A, Group)
Group <- gsub("B", B, Group)
Group <- gsub("C", C, Group)
Group
When I group them there is no space between A, B and C. The above code results in:
0.25 0.65 0.480.054 0.41 0.090.8 0.047 0.34
I want that the input be like this:
0.25 0.65 0.48 0.054 0.41 0.09 0.8 0.047 0.34
I will appreciate if you can help me with this.
There are several syntactical errors, but let me present you what I think you are trying to accomplish:
Group <- 'ABC'
A <- paste(0.25, 0.65, 0.48)
Group = gsub('A', A, Group)
[1] "0.25 0.65 0.48BC"
EDIT: Seeing your reformatted question, I would say the only change is to put a space between your Group letters:
Group <- 'A B C'
Or paste an empty character at the end of all groups of numbers:
A <- paste(0.25, 0.65, 0.48, "")
You can transform Group a bit, i.e., trimsw(gsub(""," ",Group)), then " " is inserted among characters in Group.
just use paste with collapse = "":
A <- "0.25 0.65 0.48"
B <- "0.054 0.41 0.09"
C <- "0.8 0.047 0.34"
paste(A, B, C, collaspe = "")
"0.25 0.65 0.48 0.054 0.41 0.09 0.8 0.047 0.34 "

Creating a pseudo-random list with r

Given I have 4 different values
intensities <- c(0.1,-0.1,0.05,-0.05)
My goal is to randomly sample every value 5 times but positive and negative values should alternate, e.g.
resultingList = (0.1, -0.05, 0.05, -0.05, 0.1, -0.1, ...)
Does anybody know an elegant way to do this in R?
Maybe something like this
# seed
set.seed(123)
plus <- rep(intensities[intensities >= 0], each = 5)
minus <- rep(intensities[intensities < 0], each = 5)
out <- numeric(length(plus) + length(minus))
out[seq(1, length(out), 2)] <- sample(plus)
out[seq(2, length(out), 2)] <- sample(minus)
out
# [1] 0.10 -0.05 0.05 -0.10 0.10 -0.05 0.05 -0.05 0.05 -0.10 0.10 -0.05 0.05 -0.05 0.05 -0.10
# [17] 0.10 -0.10 0.10 -0.10
If your list of intensities that you are sampling from come in +/- pairs, you could just sample from the list of positive values then change the sign of every other number drawn:
N <- 5
positiveIntensities <- c(0.1, 0.05)
resultingList <- sample(positiveIntensities,N,replace = T) * (-1)^(0:(N-1))
It's my solution, which creates a custom function and the argument n means the length of output. In addition, ceiling() and floor() can decide the lengths of odd and even positions.
mySample <- function(x, n){
res <- c()
res[seq(1, n, 2)] <- sample(x[x >= 0], ceiling(n / 2), T)
res[seq(2, n, 2)] <- sample(x[x < 0], floor(n / 2), T)
return(res)
}
intensities <- c(0.1, -0.1, 0.05, -0.05)
mySample(intensities, 10)
# [1] 0.10 -0.10 0.05 -0.05 0.10 -0.05 0.05 -0.05 0.05 -0.10

Smooth.Pspline yields the same results with different spar values

I am trying to determine the best value of spar to implement across a dataset by reducing the root mean square error between test and training replicates on the same raw data. My test and training replicates look like this:
Traindataset
t = -0.008
-0.006
-0.004
-0.002 0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
0.022
0.024
dist = NA 0 0 0 0
0.000165038
0.000686934
0.001168098
0.001928885
0.003147262
0.004054971
0.005605361
0.007192645
0.009504648
0.011498809
0.013013655
0.01342625
Testdataset
t = -0.008
-0.006
-0.004
-0.002 0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
0.022
0.024
dist = NA 0 0 0 0 0
0.000481184
0.001306409
0.002590156
0.003328259
0.004429246
0.005012768
0.005829698
0.006567801
0.008030102
0.009617453
0.011202827
I need the spline to be 5th order so I can accurately predict the 3rd derivative, so I am using smooth.Pspline (from the pspline package) instead of the more common smooth.spline. I attempted using a variant of the solution outlined here (using root mean squared error of predicting testdataset from traindataset instead of cross validation sum of squares within one dataset). My code looks like this:
RMSE <- function(m, o){
sqrt(mean((m - o)^2))
}
Psplinermse <- function(spar){
trainmod <- smooth.Pspline(traindataset$t, traindataset$dist, norder = 5,
spar = spar)
testpreddist <- predict(trainmod,testdataset$t)[,1]
RMSE(testpreddist, testdataset$dist)
}
spars <- seq(0, 1, by = 0.001)
rmsevals <- rep(0, length(spars))
for (i in 1:length(spars)){
rmsevals[i] <- Psplinermse(spars[i])
}
plot(spars, rmsevals, 'l', xlab = 'spar', ylab = 'RMSE' )
The issue I am having is that for pspline, the values of RMSE are the same for any spar above 0 graph of spar vs RMSE. When I dug into just the predictions line of code, I realized I am getting the exact same predicted values of dist for any spar above 0. Any ideas on why this might be are greatly appreciated.

R- tables package - error subscript out of bounds

I need to create a fancy table and export it as png. I'm trying tables package in R. I need to group "variacion" by groups of agents ("agentes") who had a positive variation vs the rest. I want the mean, sd and the number of agents who fulfill these conditions
My table is:
agente mes1 mes2 variacion
1 a1 0.50 0.60 0.20000000
2 a2 0.70 0.65 -0.07142857
3 a3 0.60 0.75 0.25000000
4 a4 0.80 0.60 -0.25000000
5 a5 0.78 0.90 0.15384615
My output should be (including format):
You can arrive to those numbers by doing for example:
sd(t_agentes1$variacion[t_agentes1$variacion<=0])
And the result is the last number in the table for the column sd: 0.126
So in tables library:
library(tables)
X<-t_agentes1$variacion
latex( tabular( (X > 0) + (X < 0) + 1
+ ~ ((n = 1) + X*(mean + sd + length)) ) )
But I get the error:
non-numeric argument to binary operator
Also when I try the first example of the package I get the same error
tabular( (Species + 1) ~ (n=1) + Format(digits=2)*
+ + (Sepal.Length + Sepal.Width)*(mean + sd), data=iris )
Error in e[[3]] : subscript out of bounds
I really don't understand the parameters of this package. Is there a way to do the grouping? I'm really lost with this so any help would be really appreciated. Thanks.
X <- read.table(header = TRUE, text="agente mes1 mes2 variacion
1 a1 0.50 0.60 0.20000000
2 a2 0.70 0.65 -0.07142857
3 a3 0.60 0.75 0.25000000
4 a4 0.80 0.60 -0.25000000
5 a5 0.78 0.90 0.15384615")
X <- within(X, variation <- factor(variacion > 0, levels = c(TRUE, FALSE),
labels = c('variation > 0',
'variation <= 0')))
library(tables)
# latex(
# tabular(Heading() * variation ~
# Justify(l) * (Heading() * Format(digits = 2) * variacion * (mean + sd) + (number = (n = 1))),
# data = X))
latex(
tabular(Heading() * variation ~
Justify(l) * (Heading() * variacion * (Format(digits = 2) * mean + Format(digits = 2) *sd) + (number = (n = 1))),
data = X))
# mean sd number
# variation $>$ 0 0.20 0.048 3
# variation $\\leq$ 0 -0.16 0.126 2
Gives me
Without prettifying the results:
tabular((X > 0) + (X < 0) ~ mean*X + sd*X + length*X)

Resources