cv.tree() in R, deviance - r

everyone!
I need help with one function in R (cv.tree). I created a regression tree, and now I need to find the optimal alpha parameter and its corresponding subtree for pruning. I know, that usually we use cross-validation, but I don't understand well what is shown on the graph on the y-axis. It says that on the x-axis is the size of each tree in the sequence. On the y-axis, cv.tree$dev is shown. And I don't understand what it is. I will be grateful for help. (Maybe you can also explain, how to calculate this value).

Related

Single linkage hierarchical clustering - boxplots on height of the branches to detect outliers

before k-means clustering for consumer segmentation, I want to identify and delete outliers of my sample. I tried hierarchical clustering with single linkage algorithm. The problem is, I have a sample with more than 800 cases, and in my plot (single linkage dendrogram) the numbers are written across each other and therefore not readable, so it is impossible for me to clearly identify the outliers by just looking at the graph :-/
Here they say, you can create boxplots based on the branch distance to identify outliers in a more objective way. I thought that would be also a great way to just make the row numbers of the outliers in my dataset readable, however I am struggling with creating the boxplots..
https://link.springer.com/article/10.1186/s12859-017-1645-5/figures/3
Does anyone know, how to write the code to get the boxplots based on the height of the branches?
This is the code I use for clustering and attached you can see the plot
dr_dist<-dist(dr_ma_cluster[,c(148:154)])
hc_dr<-hclust(dr_dist,method = "single") #single linkage
plot(hc_dr,labels=(row.names(dr_ma_cluster)))
This is my failed trial to do the boxplot, as I don't know how to address the branch height
> boxplot(hc_dr)
Error in x[floor(d)] + x[ceiling(d)] :
non-numeric argument for binary operator
> boxplot(hc_dr[,c(148:154)])
Error in hc_dr[, c(148:154)] : Incorrect number of dimensions
And here another way to do the graph (and some automated outlier detection approach), but it makes the readability even worse with large datasets..
Another code to plot the tree, even less readable for large datasets:
Delete outliers automatically of a calculated agglomerative hierarchical clustering data
Thanks for any help!!
boxplot(hc_dr$height) as suggested by StupidWolf was the simple thing I was looking for.
Unfortunately I did not manage to label the outlier dots with the rownames from the original dataframe. Rownames from the branch height table were useless as they were assigned in ascending order.
hang = 0.0001 gave a better look to the dendrogram, but labels were still unreadable as still over eachother.
If anyone has a similar problem check R Shiny, zoomable dendrogram program
the code given there in the answer was super easy to adapt, resulting in a zoomable dendrogram, which makes it easy to identify the relevant cases (->outliers). for details search dendextendas proposed by csgroen.
Both together, the boxplot and this nice tool served to identify the rownames of the outliers after single linkage clustering in order to delete them before km means clustering

Does this curve represent non-linearity in my residuals vs fitted plot? (simple linear regression)

Hi,
I am running a simple linear regression model in R at the moment and wanted to check my assumptions. As seen by the plot, my red line does not appear to be flat and instead curved in places.
I am having a little difficulty interpreting this - does this imply non-linearity? And if so, what does this say about my data?
Thank you.
The observation marked 19 on your graph (bottom right corner) seems to have significant influence and is pulling down your line more than other points are pulling it up. The relationship looks linear all in all, getting rid of that outlier by either nullifying it by increasing sample size (Law of large numbers) or removing the outlier(s) should fix your problem without compromising the story your data is trying to tell you and give you the nice graph you're looking for.

Value for mixture distributions crossing using 'mixdist' in R

I have a plot (below) generated using the package "mixdist" and would like to know the exact value at which the two distributions cross one another rather than just estimating from the plot. I haven't come across this in any of the output information. Can this be obtained through mixdist?
Thanks for any help
use locator() function to click on the point in the graph, followed by 'esc' key to give the values

Forest plot from cox object

Please be tolerant :) I am a dummy user of R and I am using the code and sample data to learn how to make forest plot that was shown in the previous post -
Optimal/efficient plotting of survival/regression analysis results
I was wondering is it possible to set user-defined x-axis scale with the code shown there? Up to now x a-axis scale is defined somehow automatically.
Thank you for any tips.
I'm unimpressed with the precision of the documentation since one might assume that the limits argument would be values on the relative risk scale rather than on the log-transformed scale. One gets a ridiculous result if that is done. That quibble not withstanding, it's relatively easy to use that parameter to created an expanded plot:
install('devtools') # then use it to get current package
# executing the install and load of the package referenced at the top of that answer
print(forest_model(lung_cox, limits=log( c(.5, 50) ) ))
Trying for a lower range of 0 on the relative risk scale is not sensible. Would imply a -Inf value on hte log-transformed scale. Trying for lower value, say log(0.001), confuses the pretty printing of the scale in my tests.

Splitting lme residual plot into separate boxplots

Using the basic plot function (plot.intervals.lmList) from an lme model (called meef1), I produced a massive graph of boxplots. My vector v2andv3commoditycombined has 98 levels.
plot(meef1, v2andv3commoditycombined~resid(.))
I would like to separate by the grouping values of my variable v2andv3commoditycombined to either graph them separately, order them, or exclude some. I'm not sure if there is code to do this or if I have to extract information from the lme output. If that is the case, I'm not sure what to extract to create the boxplots as extracting the residuals returns only one value for each level. If this is impossible, any advice on how to space out the commodity names would be equally helpful.
Thank you.
For each level of v2andv3commoditycombined, what exactly would you like your Y axis and your X axis to be? Since you're splitting the plots by v2andv3commoditycombined, you obviously can't also use that as one of your axes.
Let's pretend you just want do the traditional residuals on the Y axis and fitted values on the X axis, in a separate plot for each of the 98 levels. You can change the code to do plot whatever it is you actually want to plot.
As per ?plot.lme, you would do something like this:
plot(meef1,resid(.,type='pearson',level=1)~fitted(.,level=1)|v2andv3commoditycombined);
Make sure you stretch out your plot window beforehand so that it's nice and big, otherwise you might get an error saying something about margins. The following might produce a better-looking plot:
plot(meef1,resid(.,type='pearson',level=1)~fitted(.,level=1)|v2andv3commoditycombined,pch='.',cex=1.5,abline=0);
Since it wasn't clear from your question I went ahead and assumed you're interested in the individual level residuals (i.e. how much each datapoint differs from the predicted value given its random variables), and that you have one level of nesting in your random formula. If you want population residuals (i.e. how much each datapoint differs from the average predicted value), change both instances of level to say level=0. If you have K levels of nesting, change them to level=K and good luck.
I also assumed you wanted standardized residuals (because you can use the convenient rule of thumb that absolute values greater than 3 are possible outliers, regardless of what scale the original data are on). If not, see ?residuals.lme for other valid options for the type argument.
Oh, and the name of your variables suggests that you're looking at some sort of financial time series. If so, have a look at ACF(meef1) to see if there is a lot of autocorrelation. If there is, you could remedy it by instead fitting a model where the response (Y) variable is diff(...) the original variable. If you're seeing really skewed residuals, you might consider log-transforming your response variable before taking the diff.

Resources