I’m trying to characterize how prominent a selection is.
You poll 10 people for their
favorite color and you get the following response:
r = c(rep("blue",5),rep("green",4),rep("red",1))
And you make a contingency table:
tab = table(r)
If you take the density of this...
d = density(tab)
If you ask someone else and they say "red", you characterize the prominence
of this response by the integral of the portion of the probability density function equal to or less than the rate of their response:
get.prominence = function(new.response){
rate = tab[new.response]
window.index = tail(which(d$x<=rate),n=1)
sum(d$y[1:window.index])/sum(d$y)
}
get.prominence("red") # .16
get.prominence("blue") # .77
Is this a reasonable way of doing this? There must be a known technique
for this and I don't know the name of it.
Related
I am currently trying to implement the SARSA algorithm, as described in Sutton, Barto "Reinforcement Learning, An Introduction" on a gridworld with a windy upstream. (I am using the same environment as Sutton, Barto - p.130.) Basically, there are 70 fields and one can move in four directions: up, down, left or right. On some states, a wind will cause the movement to shift up one step. The reward is -1 for each timestep, where the goal has not been reached.
I implemented the environment and everything seems to be working fine. However, the learning algorithm does not seem to work. The authors of the book claim that when using certain parameters, the algorithm converges to a near optimal solution after about 150 episodes learned. This is not the case for my code (written in Julia v1.1.0)
g = GridWorld()
α = 0.5
γ = 1
ϵ = 0.1
Q = zeros(70,4)
for episode in 1:100000
isDone = false
S = g.start
A = eps_greedy(Q,ϵ,S)
while !isDone
(R,isDone) = action(g,A)
S´ = g.position
A´ = eps_greedy(Q,ϵ,S´)
Q[S,A] += α*(R + γ* Q[S´,A´] - Q[S,A])
S = S´
A = A´
end
end
The object g stores the current state, which gets changed according to action A when calling action(g,A). The function eps_greedy(Q,epsilon,S) just takes a current state and chooses an epsilon-greedy action from the action-value function Q.
The problem is: The longer I train, the lower the action values, stored in Q, will get. For example, training for about 2000 episodes, the action values of the starting state are all similar at approximately -950. Training for 20000 will yield action values of around -10000.
I don't think this is supposed to happen, but I am not quite sure what causes the problem. Any help would be greatly appreciated!
I'm looking for a way to identify genes that are significantly stable across conditions. In other words, the opposite of standard DE analysis.
Standard DE splits genes in two categories: significantly changing on one side, everything else, "the rest", on the other.
"The rest", however, contains both genes that actually do not change, and genes for which the confidence in the change is not sufficient to call them differential.
What I want is to find those that do not change, or in other words, those for which I can confidently say that there's no change across my conditions.
I know this is possible in DEseq by providing an alternative null-hypothesis, but I have to integrate this as an extra step into someone else's pipeline that already uses limma, and I'd like to stick to it.
Ideally I would like to test for both DE and non changing genes in a similar way, something conceptually similar to changing the H0 in DEseq.
At the moment the code to test for DE goes like:
# shaping data
comparison <- eBayes(lmFit(my_data, weights = my.weights^2))
results <- limma::topTable(my_data, sort.by = "t",
coef = 1, number = Inf)
as an example I'd love something like the following, but anything conceptually alike would do.
comparison <- eBayes(lmFit(my_data, weights = my.weights^2), ALTERNATIVE_H0 = my_H0)
I know treat() allows to specify an interval null hypothesis by providing a fold change, citing the manual: "it uses an interval null hypothesis, where the interval is [-lfc,lfc]".
However this still tests for change from a central interval around 0, while the intervals I would like to test against are [-inf,-lfc] + [lfc,inf].
Is there any option I'm missing?
Thanks!
You can try to use the confidence interval of the logFC to select your genes, but I must say this is very dependent on the number of samples you have, and also how strong is the biological variance. Below I show an example how it can be done:
first we use DESeq2 to generate an example dataset, we set betaSD so that we have a small proportion of genes that should show differences between conditions
library(DESeq2)
library(limma)
set.seed(100)
dds = makeExampleDESeqDataSet(n=2000,betaSD=1)
#pull out the data
DF = colData(dds)
# get out the true fold change
FC = mcols(dds)
Now we can run limma-voom on this dataset,
V = voom(counts(dds),model.matrix(~condition,data=DF))
fit = lmFit(V,model.matrix(~condition,data=DF))
fit = eBayes(fit)
# get the results, in this case, we are interested in the 2nd coef
res = topTable(fit,coef=2,n=nrow(V),confint=TRUE)
So there is an option to collect the 95% confidence interval of the fold change in the function topTable. We do that and compare against the true FC:
# fill in the true fold change
res$true_FC = FC[rownames(res),"trueBeta"]
We can look at how the estimated and true differ:
plot(res$logFC,res$true_FC)
So let's say we want to find genes, where we are confident there's a fold change < 1, we can do:
tabResults = function(tab,fc_cutoff){
true_unchange = abs(tab$true_FC)<fc_cutoff
pred_unchange = tab$CI.L>(-fc_cutoff) & res$CI.R <fc_cutoff
list(
X = table(pred_unchange,true_unchange),
expression_distr = aggregate(
tab$AveExpr ~ pred_unchange+true_unchange,data=tab,mean
))
}
tabResults(res,1)$X
true_unchange
pred_unchange FALSE TRUE
FALSE 617 1249
TRUE 7 127
The above results tells us, if we set limit it to genes whose 95% confidence are within +/- 1 FC, we get 134 hits, with 7 being false (with actual fold change > 1).
And the reason we miss out on some true no-changing genes is because they are expressed a bit lower, while most of what we predicted correctly to be unchanging, have high expression:
tabResults(res,1)$expression_distr
pred_unchange true_unchange tab$AveExpr
1 FALSE FALSE 7.102364
2 TRUE FALSE 8.737670
3 FALSE TRUE 6.867615
4 TRUE TRUE 10.042866
We can go lower FC, but we also end up with less genes:
tabResults(res,0.7)
true_unchange
pred_unchange FALSE TRUE
FALSE 964 1016
TRUE 1 19
The confidence interval depends a lot on the number of samples you have. So a cutoff of 1 for one dataset would mean something different for another.
So I would say if you have a dataset at hand, you can first run DESeq2 on the dataset, obtain the mean variance relationship and simulate data like I have done, to more or less guess, what fold change cutoff would be ok, how many can you possibly get, and make a decision from there.
I have a mixed type data set, so I wanted to try kamila clustering. It is easy to apply it, but I would like a plot to decide the number of clusters similar to knee-plot.
data <- read.csv("binarymat.csv",header=FALSE,sep=";")
conInd <- c(9)
conVars <- data[,conInd]
conVars <- data.frame(scale(conVars))
catVarsFac <- data[,c(1,2,3,4,5,6,7,8)]
catVarsFac[] <- lapply(catVarsFac, factor)
catVarsDum <- dummyCodeFactorDf(catVarsFac)
kamRes <- kamila(conVars, catVarsFac, numClust=5, numInit=10,
calcNumClust = "ps",numPredStrCvRun = 10, predStrThresh = 0.5)
summary(kamRes)
It says that the best number of clusters is 5. How does it decide that and can I see a plot indicating this?
In the kamila package documentation
Setting calcNumClust to ’ps’ uses the prediction strength method of
Tibshirani & Walther (J. of Comp. and Graphical Stats. 14(3), 2005).
There is no perfect method for estimating the number of clusters; PS
tends to give a smaller number than, say, BIC based methods for large
sample sizes.
In the case, you are using it, you have specified only one value to numClust. So, it doesn't look like you are actually selecting the number of clusters - you have already picked one.
To select the number of clusters, you have to specify the range you are interested in, for example, numClust = 2 : 7 and also the method for selecting the number of clusters.
If you also want to select the number of clusters, something like the following might work.
kamRes <- kamila(conVars, catVarsFac, numClust = 2 : 7, numInit = 10,
calcNumClust = "ps", numPredStrCvRun = 10, predStrThresh = 0.5)
Information on the selection of the number of clusters is now present in
kamRes$nClust, and plot(2:7, kamRes$nClust$psValues) could be what you are after.
I am trying to do a maximization in R that I have done previously in Excel with the solver. The problem is that I don't know how to deal with it (i don't have a good level in R).
let's talk a bit about my data. I have 26 Swiss cantons and the Swiss government (which is the sum of the value of the 26 cantons) with their population and their "wealth". So I have 27 observatios by variable. I'm not sure that the following descriptions are useful but I put them anyway. From this, I calculate some variables with while loops. For each canton [i]:
resource potential = mean(wealth2011 [i],wealth2012 [i],wealth2013 [i])
population mean = mean(population2011 [i],population2012 [i],population2013 [i])
resource potential per capita = 1000*resource potential [i]/population [i]
resource index = 100*resource potential capita [i]/resource potential capita [swiss government]
Here a little example of the kind of loops I used:
RI=0
i = 1
while(i<28){
RI[i]=resource potential capita [i]/resource potential capita [27]*100
i = i+1
}
The resource index (RI) for the Swiss government (i = 27) is 100 because we divide the resource potential capita of the swiss government (when i = 27) by itself and multiply by 100. Hence, all cantons that have a RI>100 are rich cantons and other (IR<100) are poor cantons. Until here, there was no problem. I just explained how I built my dataset.
Now the problem that I face: I have to create the variable weighted difference (wd). It takes the value of:
0 if RI>100 (rich canton)
(100-RI[i])^(1+P)*Pop[i] if RI<100 (poor canton)
I create this variable like this: (sorry for the weakness of the code, I did my best).
wd=-1
i = 1
a = 0
c = 0
tot = 0
while(i<28){
if(i == 27) {
wd[i] = a
} else if (RI[i] < 100) {
wd[i] = (100-RI[i])^(1+P)*Pop[i]
c = wd[i]
a = a+c
} else {
wd[i]= 0
}
i = i+1
}
However, I don't now the value of "p". It is a value between 0 and 1. To find the value of p, I have to do a maximization using the following features:
RI_26 = 65.9, it is the minimum of RI in my data
RI_min = 100-((x*wd [27])/((1+p)*z*100))^(1/p), where x and z are fixed values (x = 8'677, z = 4'075'977'077) and wd [27] the sum of wd for each canton.
We have p in two equation: RI_min and wd. To solve it in Excel, I used the Excel solver with the following features:
p_dot = RI_26/RI_min* p ==> p_dot =[65.9/100-((x* wd [27])/((1+p)*z*100))^(1/p)]*p
RI_26 = RI_min ==>65.9 =100-((x*wd [27])/((1+p)*z*100))^(1/p)
In Excel, p is my variable cell (the only value allowed to change), p_dot is my objective to define and RI_26 = RI_min is my constraint.
So I would like to maximize p and I don't know how to do this in R. My main problem is the presence of p in RI_min and wd. We need to do an iteration to solve it but this is too far from my skills.
Is anyone able to help me with the information I provided?
you should look into the optim function.
Here I will try to give you a really simple explanation since you said you don't have a really good level in R.
Assuming I have a function f(x) that I want to maximize and therefore I want to find the parameter x that gives me the max value of f(x).
First thing to do will be to define the function, in R you can do this with:
myfunction<- function(x) {...}
Having defined the function I can optimize it with the command:
optim(par,myfunction)
where par is the vector of initial parameters of the function, and myfunction is the function that needs to be optimized. Bear in mind that optim performs minimization, however it will maximize if control$fnscale is negative. Another strategy will be to change the function (i.e. changing the sign) to suit the problem.
Hope that this helps,
Marco
From the description you provided, if I'm not mistaken, it looks like that everything you need to do it's just an equation.
In particular you have the following two expressions:
RI_min = 100-((x*y)/((1+p)*z*100))^(1/p)
and, since x,y,z are fixed, the only variable is p.
Moreover, having RI_26 = RI_min this yields to:
65.9 =100-((x*y)/((1+p)*z*100))^(1/p)
Plugging in the values of x,y and z you have provided, this yields to
p=0.526639915936052
I don't understand what exactly you are trying to maximize.
I'm trying to fit the information from the G function of my data to the following mathematical mode: y = A / ((1 + (B^2)*(x^2))^((C+1)/2)) . The shape of this graph can be seen here:
http://www.wolframalpha.com/input/?i=y+%3D+1%2F+%28%281+%2B+%282%5E2%29*%28x%5E2%29%29%5E%28%282%2B1%29%2F2%29%29
Here's a basic example of what I've been doing:
data(simdat)
library(spatstat)
simdat.Gest <- Gest(simdat) #Gest is a function within spatstat (explained below)
Gvalues <- simdat.Gest$rs
Rvalues <- simdat.Gest$r
GvsR_dataframe <- data.frame(R = Rvalues, G = rev(Gvalues))
themodel <- nls(rev(Gvalues) ~ (1 / (1 + (B^2)*(R^2))^((C+1)/2)), data = GvsR_dataframe, start = list(B=0.1, C=0.1), trace = FALSE)
"Gest" is a function found within the 'spatstat' library. It is the G function, or the nearest-neighbour function, which displays the distance between particles on the independent axis, versus the probability of finding a nearest neighbour particle on the dependent axis. Thus, it begins at y=0 and hits a saturation point at y=1.
If you plot simdat.Gest, you'll notice that the curve is 's' shaped, meaning that it starts at y = 0 and ends up at y = 1. For this reason, I reveresed the vector Gvalues, which are the dependent variables. Thus, the information is in the correct orientation to be fitted the above model.
You may also notice that I've automatically set A = 1. This is because G(r) always saturates at 1, so I didn't bother keeping it in the formula.
My problem is that I keep getting errors. For the above example, I get this error:
Error in nls(rev(Gvalues) ~ (1/(1 + (B^2) * (R^2))^((C + 1)/2)), data = GvsR_dataframe, :
singular gradient
I've also been getting this error:
Error in nls(Gvalues1 ~ (1/(1 + (B^2) * (x^2))^((C + 1)/2)), data = G_r_dataframe, :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562
I haven't a clue as to where the first error is coming from. The second, however, I believe was occurring because I did not pick suitable starting values for B and C.
I was hoping that someone could help me figure out where the first error was coming from. Also, what is the most effective way to pick starting values to avoid the second error?
Thanks!
As noted your problem is most likely the starting values. There are two strategies you could use:
Use brute force to find starting values. See package nls2 for a function to do this.
Try to get a sensible guess for starting values.
Depending on your values it could be possible to linearize the model.
G = (1 / (1 + (B^2)*(R^2))^((C+1)/2))
ln(G)=-(C+1)/2*ln(B^2*R^2+1)
If B^2*R^2 is large, this becomes approx. ln(G) = -(C+1)*(ln(B)+ln(R)), which is linear.
If B^2*R^2 is close to 1, it is approx. ln(G) = -(C+1)/2*ln(2), which is constant.
(Please check for errors, it was late last night due to the soccer game.)
Edit after additional information has been provided:
The data looks like it follows a cumulative distribution function. If it quacks like a duck, it most likely is a duck. And in fact ?Gest states that a CDF is estimated.
library(spatstat)
data(simdat)
simdat.Gest <- Gest(simdat)
Gvalues <- simdat.Gest$rs
Rvalues <- simdat.Gest$r
plot(Gvalues~Rvalues)
#let's try the normal CDF
fit <- nls(Gvalues~pnorm(Rvalues,mean,sd),start=list(mean=0.4,sd=0.2))
summary(fit)
lines(Rvalues,predict(fit))
#Looks not bad. There might be a better model, but not the one provided in the question.