fuzzwuzzy package : the mathematical formula of fonctions - formula

I want to know the mathematical formula for the package "fuzzwuzzy" of:
- ratio
- partial_ratio
- token_set_ratio
For example, I want to be able to calculate the partial_ratio between :"Mark!" & "mark".

I find in the source code!
So:
ratio = 2.0*M / T , where M = matches , T = total number of elements in both sequences

Related

Optimize within for loop cannot find function

I've got a function, KozakTaper, that returns the diameter of a tree trunk at a given height (DHT). There's no algebraic way to rearrange the original taper equation to return DHT at a given diameter (4 inches, for my purposes)...enter R! (using 3.4.3 on Windows 10)
My approach was to use a for loop to iterate likely values of DHT (25-100% of total tree height, HT), and then use optimize to choose the one that returns a diameter closest to 4". Too bad I get the error message Error in f(arg, ...) : could not find function "f".
Here's a shortened definition of KozakTaper along with my best attempt so far.
KozakTaper=function(Bark,SPP,DHT,DBH,HT,Planted){
if(Bark=='ob' & SPP=='AB'){
a0_tap=1.0693567631
a1_tap=0.9975021951
a2_tap=-0.01282775
b1_tap=0.3921013594
b2_tap=-1.054622304
b3_tap=0.7758393514
b4_tap=4.1034897617
b5_tap=0.1185960455
b6_tap=-1.080697381
b7_tap=0}
else if(Bark=='ob' & SPP=='RS'){
a0_tap=0.8758
a1_tap=0.992
a2_tap=0.0633
b1_tap=0.4128
b2_tap=-0.6877
b3_tap=0.4413
b4_tap=1.1818
b5_tap=0.1131
b6_tap=-0.4356
b7_tap=0.1042}
else{
a0_tap=1.1263776728
a1_tap=0.9485083275
a2_tap=0.0371321602
b1_tap=0.7662525552
b2_tap=-0.028147685
b3_tap=0.2334044323
b4_tap=4.8569609081
b5_tap=0.0753180483
b6_tap=-0.205052535
b7_tap=0}
p = 1.3/HT
z = DHT/HT
Xi = (1 - z^(1/3))/(1 - p^(1/3))
Qi = 1 - z^(1/3)
y = (a0_tap * (DBH^a1_tap) * (HT^a2_tap)) * Xi^(b1_tap * z^4 + b2_tap * (exp(-DBH/HT)) +
b3_tap * Xi^0.1 + b4_tap * (1/DBH) + b5_tap * HT^Qi + b6_tap * Xi + b7_tap*Planted)
return(y=round(y,4))}
HT <- .3048*85 #converting from english to metric (sorry, it's forestry)
for (i in c((HT*.25):(HT+1))) {
d <- KozakTaper(Bark='ob',SPP='RS',DHT=i,DBH=2.54*19,HT=.3048*85,Planted=0)
frame <- na.omit(d)
optimize(f=abs(10.16-d), interval=frame, lower=1, upper=90,
maximum = FALSE,
tol = .Machine$double.eps^0.25)
}
Eventually I would like this code to iterate through a csv and return i for the best d, which will require some rearranging, but I figured I should make it work for one tree first.
When I print d I get multiple values, so it is iterating through i, but it gets held up at the optimize function.
Defining frame was my most recent tactic, because d returns one NaN at the end, but it may not be the best input for interval. I've tried interval=c((HT*.25):(HT+1)), defining KozakTaper within the for loop, and defining f prior to the optimize, but I get the same error. Suggestions for what part I should target (or other approaches) are appreciated!
-KB
Forestry Research Fellow, Appalachian Mountain Club.
MS, University of Maine
**Edit with a follow-up question:
I'm now trying to run this script for each row of a csv, "Input." The row contains the values for KozakTaper, and I've called them with this:
Input=read.csv...
Input$Opt=0
o <- optimize(f = function(x) abs(10.16 - KozakTaper(Bark='ob',
SPP='Input$Species',
DHT=x,
DBH=(2.54*Input$DBH),
HT=(.3048*Input$Ht),
Planted=0)),
lower=Input$Ht*.25, upper=Input$Ht+1,
maximum = FALSE, tol = .Machine$double.eps^0.25)
Input$Opt <- o$minimum
Input$Mht <- Input$Opt/.3048. # converting back to English
Input$Ht and Input$DBH are numeric; Input$Species is factor.
However, I get the error invalid function value in 'optimize'. I get it whether I define "o" or just run optimize. Oddly, when I don't call values from the row but instead use the code from the answer, it tells me object 'HT' not found. I have the awful feeling this is due to some obvious/careless error on my part, but I'm not finding posts about this error with optimize. If you notice what I've done wrong, your explanation will be appreciated!
I'm not an expert on optimize, but I see three issues: 1) your call to KozakTaper does not iterate through the range you specify in the loop. 2) KozakTaper returns a a single number not a vector. 3) You haven't given optimize a function but an expression.
So what is happening is that you are not giving optimize anything to iterate over.
All you should need is this:
optimize(f = function(x) abs(10.16 - KozakTaper(Bark='ob',
SPP='RS',
DHT=x,
DBH=2.54*19,
HT=.3048*85,
Planted=0)),
lower=HT*.25, upper=HT+1,
maximum = FALSE, tol = .Machine$double.eps^0.25)
$minimum
[1] 22.67713 ##Hopefully this is the right answer
$objective
[1] 0
Optimize will now substitute x in from lower to higher, trying to minimize the difference

Iteration / Maximization Excel solver in R

I am trying to do a maximization in R that I have done previously in Excel with the solver. The problem is that I don't know how to deal with it (i don't have a good level in R).
let's talk a bit about my data. I have 26 Swiss cantons and the Swiss government (which is the sum of the value of the 26 cantons) with their population and their "wealth". So I have 27 observatios by variable. I'm not sure that the following descriptions are useful but I put them anyway. From this, I calculate some variables with while loops. For each canton [i]:
resource potential = mean(wealth2011 [i],wealth2012 [i],wealth2013 [i])
population mean = mean(population2011 [i],population2012 [i],population2013 [i])
resource potential per capita = 1000*resource potential [i]/population [i]
resource index = 100*resource potential capita [i]/resource potential capita [swiss government]
Here a little example of the kind of loops I used:
RI=0
i = 1
while(i<28){
RI[i]=resource potential capita [i]/resource potential capita [27]*100
i = i+1
}
The resource index (RI) for the Swiss government (i = 27) is 100 because we divide the resource potential capita of the swiss government (when i = 27) by itself and multiply by 100. Hence, all cantons that have a RI>100 are rich cantons and other (IR<100) are poor cantons. Until here, there was no problem. I just explained how I built my dataset.
Now the problem that I face: I have to create the variable weighted difference (wd). It takes the value of:
0 if RI>100 (rich canton)
(100-RI[i])^(1+P)*Pop[i] if RI<100 (poor canton)
I create this variable like this: (sorry for the weakness of the code, I did my best).
wd=-1
i = 1
a = 0
c = 0
tot = 0
while(i<28){
if(i == 27) {
wd[i] = a
} else if (RI[i] < 100) {
wd[i] = (100-RI[i])^(1+P)*Pop[i]
c = wd[i]
a = a+c
} else {
wd[i]= 0
}
i = i+1
}
However, I don't now the value of "p". It is a value between 0 and 1. To find the value of p, I have to do a maximization using the following features:
RI_26 = 65.9, it is the minimum of RI in my data
RI_min = 100-((x*wd [27])/((1+p)*z*100))^(1/p), where x and z are fixed values (x = 8'677, z = 4'075'977'077) and wd [27] the sum of wd for each canton.
We have p in two equation: RI_min and wd. To solve it in Excel, I used the Excel solver with the following features:
p_dot = RI_26/RI_min* p ==> p_dot =[65.9/100-((x* wd [27])/((1+p)*z*100))^(1/p)]*p
RI_26 = RI_min ==>65.9 =100-((x*wd [27])/((1+p)*z*100))^(1/p)
In Excel, p is my variable cell (the only value allowed to change), p_dot is my objective to define and RI_26 = RI_min is my constraint.
So I would like to maximize p and I don't know how to do this in R. My main problem is the presence of p in RI_min and wd. We need to do an iteration to solve it but this is too far from my skills.
Is anyone able to help me with the information I provided?
you should look into the optim function.
Here I will try to give you a really simple explanation since you said you don't have a really good level in R.
Assuming I have a function f(x) that I want to maximize and therefore I want to find the parameter x that gives me the max value of f(x).
First thing to do will be to define the function, in R you can do this with:
myfunction<- function(x) {...}
Having defined the function I can optimize it with the command:
optim(par,myfunction)
where par is the vector of initial parameters of the function, and myfunction is the function that needs to be optimized. Bear in mind that optim performs minimization, however it will maximize if control$fnscale is negative. Another strategy will be to change the function (i.e. changing the sign) to suit the problem.
Hope that this helps,
Marco
From the description you provided, if I'm not mistaken, it looks like that everything you need to do it's just an equation.
In particular you have the following two expressions:
RI_min = 100-((x*y)/((1+p)*z*100))^(1/p)
and, since x,y,z are fixed, the only variable is p.
Moreover, having RI_26 = RI_min this yields to:
65.9 =100-((x*y)/((1+p)*z*100))^(1/p)
Plugging in the values of x,y and z you have provided, this yields to
p=0.526639915936052
I don't understand what exactly you are trying to maximize.

How to leverage Convex Optimization for Portfolio Optimization in Julia

I'm trying to use Julia (0.5) and Convex.jl (with ECOS solver) to figure out, given a portfolio of 2 stocks, how can I distribute my allocations (in percent) across both stocks such that I maximize my portfolio return and minimize my risk (std dev of returns). I want to maximize what is known as the Sharpe ratio that is a calculation driven from what percentages I have in each of my 2 stocks. So I want to MAXIMIZE the Sharpe ratio and have the solver figure out what is the optimal allocation for the two stocks (I want it to tell me I need x% of stock 1 and 1-x% of stock 2). The only real constraint is that the sum of the percent allocations adds to 100%. I have code below that runs, but does not give me the optimal weights/allocations I'm expecting (which is 36.3% for Supertech & 63.7% for Slowpoke). The solver instead comes back with 50/50.
My intuition is that I either have the objective function modeled incorrectly for the solver, or I need to do more with constraints. I don't have a good grasp on convex optimization so I'm winging it. Also, my objective function uses the variable.value attribute to get the correct output and I suspect I need to be working with the Variable expression object instead.
Question is, is what I'm trying to achieve something the Convex solver is designed for and I just have to model the objective function and constraints better, or do I have to just iterate the weights and brute force it?
Code with comments:
using Convex, ECOS
Supertech = [-.2; .1; .3; .5];
Slowpoke = [.05; .2; -.12; .09];
A = reshape([Supertech; Slowpoke],4,2)
mlen = size(A)[1]
R = vec(mean(A,1))
n=rank(A)
w = Variable(n)
c1 = sum(w) == 1;
λ = .01
w.value = [λ; 1-λ]
sharpe_ratio = sqrt(mlen) * (w.value' * R) / sqrt(sum(vec(w.value' .* w.value) .* vec(cov(A,1,false))))
# sharpe_ratio will be maximized at 1.80519 when w.value = [λ, 1-λ] where λ = .363
p = maximize(sharpe_ratio,c1);
solve!(p, ECOSSolver(verbose = false)); # when verbose=true, says is 'degenerate' because I don't have enough constrains...
println(w.value) # expecting to get [.363; .637] here but I get [0.5; 0.5]

R Nonlinear Least Squares (nls) Model Fitting

I'm trying to fit the information from the G function of my data to the following mathematical mode: y = A / ((1 + (B^2)*(x^2))^((C+1)/2)) . The shape of this graph can be seen here:
http://www.wolframalpha.com/input/?i=y+%3D+1%2F+%28%281+%2B+%282%5E2%29*%28x%5E2%29%29%5E%28%282%2B1%29%2F2%29%29
Here's a basic example of what I've been doing:
data(simdat)
library(spatstat)
simdat.Gest <- Gest(simdat) #Gest is a function within spatstat (explained below)
Gvalues <- simdat.Gest$rs
Rvalues <- simdat.Gest$r
GvsR_dataframe <- data.frame(R = Rvalues, G = rev(Gvalues))
themodel <- nls(rev(Gvalues) ~ (1 / (1 + (B^2)*(R^2))^((C+1)/2)), data = GvsR_dataframe, start = list(B=0.1, C=0.1), trace = FALSE)
"Gest" is a function found within the 'spatstat' library. It is the G function, or the nearest-neighbour function, which displays the distance between particles on the independent axis, versus the probability of finding a nearest neighbour particle on the dependent axis. Thus, it begins at y=0 and hits a saturation point at y=1.
If you plot simdat.Gest, you'll notice that the curve is 's' shaped, meaning that it starts at y = 0 and ends up at y = 1. For this reason, I reveresed the vector Gvalues, which are the dependent variables. Thus, the information is in the correct orientation to be fitted the above model.
You may also notice that I've automatically set A = 1. This is because G(r) always saturates at 1, so I didn't bother keeping it in the formula.
My problem is that I keep getting errors. For the above example, I get this error:
Error in nls(rev(Gvalues) ~ (1/(1 + (B^2) * (R^2))^((C + 1)/2)), data = GvsR_dataframe, :
singular gradient
I've also been getting this error:
Error in nls(Gvalues1 ~ (1/(1 + (B^2) * (x^2))^((C + 1)/2)), data = G_r_dataframe, :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562
I haven't a clue as to where the first error is coming from. The second, however, I believe was occurring because I did not pick suitable starting values for B and C.
I was hoping that someone could help me figure out where the first error was coming from. Also, what is the most effective way to pick starting values to avoid the second error?
Thanks!
As noted your problem is most likely the starting values. There are two strategies you could use:
Use brute force to find starting values. See package nls2 for a function to do this.
Try to get a sensible guess for starting values.
Depending on your values it could be possible to linearize the model.
G = (1 / (1 + (B^2)*(R^2))^((C+1)/2))
ln(G)=-(C+1)/2*ln(B^2*R^2+1)
If B^2*R^2 is large, this becomes approx. ln(G) = -(C+1)*(ln(B)+ln(R)), which is linear.
If B^2*R^2 is close to 1, it is approx. ln(G) = -(C+1)/2*ln(2), which is constant.
(Please check for errors, it was late last night due to the soccer game.)
Edit after additional information has been provided:
The data looks like it follows a cumulative distribution function. If it quacks like a duck, it most likely is a duck. And in fact ?Gest states that a CDF is estimated.
library(spatstat)
data(simdat)
simdat.Gest <- Gest(simdat)
Gvalues <- simdat.Gest$rs
Rvalues <- simdat.Gest$r
plot(Gvalues~Rvalues)
#let's try the normal CDF
fit <- nls(Gvalues~pnorm(Rvalues,mean,sd),start=list(mean=0.4,sd=0.2))
summary(fit)
lines(Rvalues,predict(fit))
#Looks not bad. There might be a better model, but not the one provided in the question.

Help understanding a definitive integral

I am trying to translate a function in a book into code, using MATLAB and C#.
I am first trying to get the function to work properly in MATLAB.
Here are the instructions:
The variables are:
xt and m can be ignored.
zMax = Maximum Sensor Range (100)
zkt = Sensor Measurement (49)
zkt* = What sensor measurement should have been (50)
oHit = Std Deviation of my measurement (5)
I have written the first formula, N(zkt;zkt*,oHit) in MATLAB as this:
hitProbabilty = (1/sqrt( 2*pi * (oHit^2) ))...
* exp(-0.5 * (((zkt- zktStar) ^ 2) / (oHit^2)) );
This gives me the Gaussian curve I expect.
I have an issue with the definite integral below, I do not understand how to turn this into a real number, because I get horrible values out my code, which is this:
func = #(x) hitProbabilty * zkt * x;
normaliser = quad(func, 0, max) ^ -1;
hitProbabilty = normaliser * hitProbabilty;
Can someone help me with this integral? It is supposed to normalize my curve, but it just goes crazy.... (I am doing this for zkt 0:1:100, with everything else the same, and graphing the probability it should output.)
You should use the error function ERF (available in basic MATLAB)
EDIT1:
As #Jim Brissom mentioned, the cumulative distribution function (CDF) is related to the error function by:
normcdf(X) = (1 + erf(X/sqrt(2)) / 2 , where X~N(0,1)
Note that NORMCDF requires the Statistics Toolbox
EDIT2:
I think there's been a small confusion seeing the comments.. The above only compute the normalizing factor, so if you want to compute the final probability over a certain range of values, you should do this:
zMax = 100; %# Maximum Sensor Range
zktStar = 50; %# What sensor measurement should have been
oHit = 5; %# Std Deviation of my measurement
%# p(0<z<zMax) = p(z<zMax) - p(z<0)
ncdf = diff( normcdf([0 zMax], zktStar, oHit) );
normaliser = 1 ./ ncdf;
zkt = linspace(0,zMax,500); %# Sensor Measurement, 500 values in [0,zMax]
hitProbabilty = normpdf(zkt, zktStar, oHit) * normaliser;
plot(zkt, hitProbabilty)
xlabel('z^k_t'), ylabel('P_{hit}(z^k_t)'), title('Measurement Probability')
The N in your code is just the well-known gaussian or normal distribution. I am mentioning this because since you re-implemented it in Matlab, it seems you missed that, seeing as how it is obviously already implemented in Matlab.
Integrating the normal distribution will yield a cumulative distribution function, available in Matlab for the normal distribution via normcdf. The ncdf can be written in terms of erf, which is probably what Amro was talking about.
Using normcdf avoids integrating manually.
In case you still need the result for the integral.
From Mathematica. The Calc is
hitProbabilty[zkt_] := (1/Sqrt[2*Pi*oHit^2])*Exp[-0.5*(((zkt - zktStar)^2)/(oHit^2))];
Integrate[hitProbabilty[zkt], {zkt, 0, zMax}];
The result is (just for copy/paste)
((1.2533141373155001*oHit*zktStar*Erf[(0.7071067811865476*Sqrt[zktStar^2])/oHit])/
Sqrt[zktStar^2] +
(1.2533141373155001*oHit*(zMax-zktStar)*Erf[(0.7071067811865476*Sqrt[(zMax-zktStar)^2])/oHit])/
Sqrt[(zMax-zktStar)^2])/(2*oHit*Sqrt[2*Pi])
Where Erf[] is the error function
HTH!

Resources