R ClustOfVar cutree height error - r

ClustOfVar is a package for clustering columns.
mtcars[] = lapply(mtcars,as.character)
fit = ClustOfVar::hclustvar(X.quali = mtcars)
labels = cutree(fit,h=0.5)
But when I cutree for some fits, I get the following message.
Warning: Error in cutree: the 'height' component of 'tree' is not sorted (increasingly)
The data on which this problem occurs is both private and big. The fit is generated successfully. I will try my best to find a reproducible data example but if you have met this kind of problem before, please let me know how to deal with it.

With certain linkage methods, it can happen that there is a non-monotonicty in the tree. Then this error occurs. If I'm not mistaken, centroid linkage is particularly prone to this.
The problem looks like this: C is created from A and B. Subtree A is created at height x. Subtree B is created at height y. Usually A and B would merge at a height such that z > x and z > y. But with some linkages, this property may be violated and e.g. z < x. Now if we want to cut the tree at height h with z < h < x, then we have the contradiction that subtree C should be merged (becaude z < h) but A should be split (because h < x) - but A is part of C.

Related

Why am I getting an error with plinear algorithm for non-linear regression in R?

I have a 2 element list with X and Y values which I would like to do non-linear regression on with R.
NP delta_f_norm
3.125E-08 1.305366836
6.25E-08 0
0.000000125 3.048361059
0.00000025 2.709158322
0.0000005 2.919379441
0.000001 42.8860945
0.000002 49.75418233
0.000004 50.89313017
0.000008 50.18050031
0.000016 49.67195257
0.000032 48.89396054
0.000064 48.00787709
0.0000006 16.50229042
0.0000007 8.906829316
0.0000008 14.2697833
2.74E-08 -0.913767771
4.11E-08 -0.942489364
6.17E-08 0.586660918
9.24E-08 -0.080955695
1.387E-07 1.672777115
2.081E-07 0.880006555
3.121E-07 13.23952061
4.682E-07 44.73003305
7.023E-07 57.11640257
1.0535E-06 54.09032726
1.5802E-06 58.71029183
2.3704E-06 56.85467325
3.5556E-06 57.83003606
5.3333E-06 53.71761902
0.000008 53.55511726
I import the plain text data, normalize the Y values and change the scale on the x values:
install.packages("tidyverse")
library(tidyverse)
# load in the data points, make sure the working directory is set correctly
# I have already trimmed data manually, so it is just tab separated, x values in the left
# column, y values in the right, with the first line containing the name of the variable
bind_curve <- read_tsv("MST_data.txt")
view(bind_curve)
# normalize curve to max
# as fractional occupancy of binding sites
bind_curve$delta_f_norm <- bind_curve$delta_f_norm/max(bind_curve$delta_f_norm)
#change units to nanomolar
bind_curve$NP <- bind_curve$NP*1e06
# due to the way the plinear algorithm works, y values cannot be zero, so we have to change them to very small values
for (i in 1:nrow(bind_curve))
{
if (bind_curve[i,2] == 0)
{
bind_curve[i,2] <- 1e-10
}
}
# here Ka is the apparent Kd and n is the hill coeficient, the parameters were
# guestimated by looking at the data
view(bind_curve)
hill_model <- nls((delta_f_norm ~ 1/(((Ka/NP)^n)+1)), data = bind_curve, start = list(Ka=700, n=2), algorithm = "plinear")
summary(hill_model)
this gives the following error:
Error in chol2inv(object$m$Rmat()) :
element (2, 2) is zero, so the inverse cannot be computed
This makes no sense, as element (2,2) was 0 when it was imported, but I specifically overwrote it with a small non-zero value to allow inversion. Inspection of the data frame before creating the non-linear model even shows the value is not 0, so why is it reporting that it is? Is this an issue where bind_curve exists in 2 different namespaces or something? That's the only possible way I can think that this would happen.
Ok I forgot to convert the units on my initial Ka guess when I changed the units on the NP data (700 vs. 0.7) so obviously my starting values were very far off which must have been what caused it to fail. I don't understand what that has to do with 0 values in the data, but whatever its fixed.
A mod can delete this post. I'm a moron :p

How to remove two data points from a data set that have a large influence on the regression model

I have found two outlier data points in my data set but I don't know how to remove them. All of the guides that I have found online seem to emphasize plotting the data but my question does not require plotting, it only takes regression model fitting. I am having great difficulty finding out how to remove the two data points from my data set and then fitting the new data set with a new model.
Here is the code that I have written and the outliers that I found:
library(alr4)
library(MASS)
data(lathe1)
head(lathe1)
y=lathe1$Life
x1=lathe1$Speed
x2=lathe1$Feed
x1_square=(x1)^2
x2_square=(x2)^2
#part A (Box-Cox method show log transformation)
y.regression=lm(y~x1+x2+(x1)^2+(x2)^2+(x1*x2))
mod=boxcox(y.regression, data=lathe1, lambda = seq(-1, 1, length=10))
best.lam=mod$x[which(mod$y==max(mod$y))]
best.lam
#part B (null-hypothesis F-test)
y.regression1_Reduced=lm(log(y)~1)
y.regression1=lm(log(y)~x1+x2+x1_square+x2_square+(x1*x2))
anova(y.regression1_Reduced, y.regression1)
#part D (F-test of log(Y) without beta1)
y.regression2=lm(log(y)~x2+x2_square)
anova(y.regression1_Reduced, y.regression2)
#part E (Cook's distance and refit)
cooks.distance(y.regression1)
Outliers:
9 10
0.7611370235 0.7088115474
I think you may be able (if execution time / corpus size allows it) to pass through your data using a loop and copy / remove elems by your criteria to obtain your desired result e.g.
corpus_list_without_outliers = []
for elem in corpus_list:
if(elem.speed <= 10000) # elem.[any_param_name] < arbitrary_outlier_value
# push to corpus_list_without_outliers because it is OK :)
print corpus_list_without_outliers
# regression algorithm after
this is how I'd see the situation, but you can change the above-if with a remove statement to avoid the creation of a second list etc. e.g.
for elem in corpus_list:
if(elem.speed > 10000) # elem.[any_param_name]
# remove from current corpus because it is an outlier :(
print corpus_list
# regression algorithm after
Hope it helped you!

Optimize within for loop cannot find function

I've got a function, KozakTaper, that returns the diameter of a tree trunk at a given height (DHT). There's no algebraic way to rearrange the original taper equation to return DHT at a given diameter (4 inches, for my purposes)...enter R! (using 3.4.3 on Windows 10)
My approach was to use a for loop to iterate likely values of DHT (25-100% of total tree height, HT), and then use optimize to choose the one that returns a diameter closest to 4". Too bad I get the error message Error in f(arg, ...) : could not find function "f".
Here's a shortened definition of KozakTaper along with my best attempt so far.
KozakTaper=function(Bark,SPP,DHT,DBH,HT,Planted){
if(Bark=='ob' & SPP=='AB'){
a0_tap=1.0693567631
a1_tap=0.9975021951
a2_tap=-0.01282775
b1_tap=0.3921013594
b2_tap=-1.054622304
b3_tap=0.7758393514
b4_tap=4.1034897617
b5_tap=0.1185960455
b6_tap=-1.080697381
b7_tap=0}
else if(Bark=='ob' & SPP=='RS'){
a0_tap=0.8758
a1_tap=0.992
a2_tap=0.0633
b1_tap=0.4128
b2_tap=-0.6877
b3_tap=0.4413
b4_tap=1.1818
b5_tap=0.1131
b6_tap=-0.4356
b7_tap=0.1042}
else{
a0_tap=1.1263776728
a1_tap=0.9485083275
a2_tap=0.0371321602
b1_tap=0.7662525552
b2_tap=-0.028147685
b3_tap=0.2334044323
b4_tap=4.8569609081
b5_tap=0.0753180483
b6_tap=-0.205052535
b7_tap=0}
p = 1.3/HT
z = DHT/HT
Xi = (1 - z^(1/3))/(1 - p^(1/3))
Qi = 1 - z^(1/3)
y = (a0_tap * (DBH^a1_tap) * (HT^a2_tap)) * Xi^(b1_tap * z^4 + b2_tap * (exp(-DBH/HT)) +
b3_tap * Xi^0.1 + b4_tap * (1/DBH) + b5_tap * HT^Qi + b6_tap * Xi + b7_tap*Planted)
return(y=round(y,4))}
HT <- .3048*85 #converting from english to metric (sorry, it's forestry)
for (i in c((HT*.25):(HT+1))) {
d <- KozakTaper(Bark='ob',SPP='RS',DHT=i,DBH=2.54*19,HT=.3048*85,Planted=0)
frame <- na.omit(d)
optimize(f=abs(10.16-d), interval=frame, lower=1, upper=90,
maximum = FALSE,
tol = .Machine$double.eps^0.25)
}
Eventually I would like this code to iterate through a csv and return i for the best d, which will require some rearranging, but I figured I should make it work for one tree first.
When I print d I get multiple values, so it is iterating through i, but it gets held up at the optimize function.
Defining frame was my most recent tactic, because d returns one NaN at the end, but it may not be the best input for interval. I've tried interval=c((HT*.25):(HT+1)), defining KozakTaper within the for loop, and defining f prior to the optimize, but I get the same error. Suggestions for what part I should target (or other approaches) are appreciated!
-KB
Forestry Research Fellow, Appalachian Mountain Club.
MS, University of Maine
**Edit with a follow-up question:
I'm now trying to run this script for each row of a csv, "Input." The row contains the values for KozakTaper, and I've called them with this:
Input=read.csv...
Input$Opt=0
o <- optimize(f = function(x) abs(10.16 - KozakTaper(Bark='ob',
SPP='Input$Species',
DHT=x,
DBH=(2.54*Input$DBH),
HT=(.3048*Input$Ht),
Planted=0)),
lower=Input$Ht*.25, upper=Input$Ht+1,
maximum = FALSE, tol = .Machine$double.eps^0.25)
Input$Opt <- o$minimum
Input$Mht <- Input$Opt/.3048. # converting back to English
Input$Ht and Input$DBH are numeric; Input$Species is factor.
However, I get the error invalid function value in 'optimize'. I get it whether I define "o" or just run optimize. Oddly, when I don't call values from the row but instead use the code from the answer, it tells me object 'HT' not found. I have the awful feeling this is due to some obvious/careless error on my part, but I'm not finding posts about this error with optimize. If you notice what I've done wrong, your explanation will be appreciated!
I'm not an expert on optimize, but I see three issues: 1) your call to KozakTaper does not iterate through the range you specify in the loop. 2) KozakTaper returns a a single number not a vector. 3) You haven't given optimize a function but an expression.
So what is happening is that you are not giving optimize anything to iterate over.
All you should need is this:
optimize(f = function(x) abs(10.16 - KozakTaper(Bark='ob',
SPP='RS',
DHT=x,
DBH=2.54*19,
HT=.3048*85,
Planted=0)),
lower=HT*.25, upper=HT+1,
maximum = FALSE, tol = .Machine$double.eps^0.25)
$minimum
[1] 22.67713 ##Hopefully this is the right answer
$objective
[1] 0
Optimize will now substitute x in from lower to higher, trying to minimize the difference

OpenMDAO 1.2.0 implicit component

I new to OpenMDAO and I'm still learning how to formulate the problems.
For a simple example, let's say I have 3 input variables with the given bounds:
1 <= x <= 10
0 <= y <= 10
1 <= z <= 10
and I have 4 outputs, defined as:
f1 = x * y
f2 = 2 * z
g1 = x + y - 1
g2 = z
my goal is to minimize f1 * g1, but enforce the constraint f1 = f2 and g1 = g2. For example, one solution is x=3, y=4, z=6 (no idea if this is optimal).
For this simple problem, you can probably just feed the output equality constraints to the driver. However, for my actual problem it's hard to find an initial starting point that satisfy all the constraints, and as the result the optimizer failed to do anything. I figure maybe I could define y and z as states in an implicit component and have a nonlinear solver figure out the right values of y and z given x, then feed x to the optimization driver.
Is this a possible approach? If so, how will the implicit component look like in this case? I looked at the Sellar problem tutorial but I wasn't able to translate it to this case.
You could create an implicit component if you want. In that case, you would define an apply_linear method in your component. That is done with the sellar problem here.
In your case since you have a 2 equation set of residuals which are both dependent on the state variables, I suggest you create a single array state variable of length 2, call it foo (I used a new variable to avoid any confusion, but name it whatever you want!). Then you will define two residuals, one for each element of the residual array of the new state variable.
Something like:
resids['foo'][0] = params['x'] * unknowns['foo'][0] - 2 * unknowns['foo'][1]
resids['foo'][1] = params['x'] + unknowns['foo'][0] - 1 - unknowns['foo'][1]
If you wanted to keep the state variable names separate you could, and it will still work. You'll just have to arbitrarily assign one residual equation to one variable and one to the other.
Then the only thing left is to add a non linear solver to the group containing your implicit component and it should work. If you choose to use a newton solver, you'll either need to set fd_options['force_fd'] = True or define derivatives of your residuals wrt all params and state variables.

R Nonlinear Least Squares (nls) Model Fitting

I'm trying to fit the information from the G function of my data to the following mathematical mode: y = A / ((1 + (B^2)*(x^2))^((C+1)/2)) . The shape of this graph can be seen here:
http://www.wolframalpha.com/input/?i=y+%3D+1%2F+%28%281+%2B+%282%5E2%29*%28x%5E2%29%29%5E%28%282%2B1%29%2F2%29%29
Here's a basic example of what I've been doing:
data(simdat)
library(spatstat)
simdat.Gest <- Gest(simdat) #Gest is a function within spatstat (explained below)
Gvalues <- simdat.Gest$rs
Rvalues <- simdat.Gest$r
GvsR_dataframe <- data.frame(R = Rvalues, G = rev(Gvalues))
themodel <- nls(rev(Gvalues) ~ (1 / (1 + (B^2)*(R^2))^((C+1)/2)), data = GvsR_dataframe, start = list(B=0.1, C=0.1), trace = FALSE)
"Gest" is a function found within the 'spatstat' library. It is the G function, or the nearest-neighbour function, which displays the distance between particles on the independent axis, versus the probability of finding a nearest neighbour particle on the dependent axis. Thus, it begins at y=0 and hits a saturation point at y=1.
If you plot simdat.Gest, you'll notice that the curve is 's' shaped, meaning that it starts at y = 0 and ends up at y = 1. For this reason, I reveresed the vector Gvalues, which are the dependent variables. Thus, the information is in the correct orientation to be fitted the above model.
You may also notice that I've automatically set A = 1. This is because G(r) always saturates at 1, so I didn't bother keeping it in the formula.
My problem is that I keep getting errors. For the above example, I get this error:
Error in nls(rev(Gvalues) ~ (1/(1 + (B^2) * (R^2))^((C + 1)/2)), data = GvsR_dataframe, :
singular gradient
I've also been getting this error:
Error in nls(Gvalues1 ~ (1/(1 + (B^2) * (x^2))^((C + 1)/2)), data = G_r_dataframe, :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562
I haven't a clue as to where the first error is coming from. The second, however, I believe was occurring because I did not pick suitable starting values for B and C.
I was hoping that someone could help me figure out where the first error was coming from. Also, what is the most effective way to pick starting values to avoid the second error?
Thanks!
As noted your problem is most likely the starting values. There are two strategies you could use:
Use brute force to find starting values. See package nls2 for a function to do this.
Try to get a sensible guess for starting values.
Depending on your values it could be possible to linearize the model.
G = (1 / (1 + (B^2)*(R^2))^((C+1)/2))
ln(G)=-(C+1)/2*ln(B^2*R^2+1)
If B^2*R^2 is large, this becomes approx. ln(G) = -(C+1)*(ln(B)+ln(R)), which is linear.
If B^2*R^2 is close to 1, it is approx. ln(G) = -(C+1)/2*ln(2), which is constant.
(Please check for errors, it was late last night due to the soccer game.)
Edit after additional information has been provided:
The data looks like it follows a cumulative distribution function. If it quacks like a duck, it most likely is a duck. And in fact ?Gest states that a CDF is estimated.
library(spatstat)
data(simdat)
simdat.Gest <- Gest(simdat)
Gvalues <- simdat.Gest$rs
Rvalues <- simdat.Gest$r
plot(Gvalues~Rvalues)
#let's try the normal CDF
fit <- nls(Gvalues~pnorm(Rvalues,mean,sd),start=list(mean=0.4,sd=0.2))
summary(fit)
lines(Rvalues,predict(fit))
#Looks not bad. There might be a better model, but not the one provided in the question.

Resources