Normalise positively skewed data with heavy tailed distribution - r

I am trying to figure out how to normalised some positively skewed data.
data
I really need it to have some parvence of positive distribution, but I have already tried log-transforming and it simply does not work. I get this kind of distribution.
log.data
I also tried sqrt(), but still no joy.
Should I just get rid of some of the extreme values on the tail? Why is log() not really doing much in terms of normalising my data?

Log transforming your data won't necessarily make it unskewed, but it does reduce the data range in the axis it was applied. Read this paper about using log transformations.
Nevertheless, a simple log transformation formatted your x-axis from a 1.2 e+07 range to a 0.2 range according to your image.

Related

Does this curve represent non-linearity in my residuals vs fitted plot? (simple linear regression)

Hi,
I am running a simple linear regression model in R at the moment and wanted to check my assumptions. As seen by the plot, my red line does not appear to be flat and instead curved in places.
I am having a little difficulty interpreting this - does this imply non-linearity? And if so, what does this say about my data?
Thank you.
The observation marked 19 on your graph (bottom right corner) seems to have significant influence and is pulling down your line more than other points are pulling it up. The relationship looks linear all in all, getting rid of that outlier by either nullifying it by increasing sample size (Law of large numbers) or removing the outlier(s) should fix your problem without compromising the story your data is trying to tell you and give you the nice graph you're looking for.

Applying window constraint in DTW

I use dtw package in r.
I am comparing the results of the simulation against those of experiment results.
I need to adjust the time shift (neither the scale in x-axis nor in y-axis), because the timing of events can be a little bit different.
I use the following function.
alignment<-dtw(exp, sim, keep=TRUE,
window.type='sakoechiba',
window.size=1);
Even though I set the window.size=1, the points that are quite far to each other are matched together.
How can I fix this issue? thanks,

Forest plot from cox object

Please be tolerant :) I am a dummy user of R and I am using the code and sample data to learn how to make forest plot that was shown in the previous post -
Optimal/efficient plotting of survival/regression analysis results
I was wondering is it possible to set user-defined x-axis scale with the code shown there? Up to now x a-axis scale is defined somehow automatically.
Thank you for any tips.
I'm unimpressed with the precision of the documentation since one might assume that the limits argument would be values on the relative risk scale rather than on the log-transformed scale. One gets a ridiculous result if that is done. That quibble not withstanding, it's relatively easy to use that parameter to created an expanded plot:
install('devtools') # then use it to get current package
# executing the install and load of the package referenced at the top of that answer
print(forest_model(lung_cox, limits=log( c(.5, 50) ) ))
Trying for a lower range of 0 on the relative risk scale is not sensible. Would imply a -Inf value on hte log-transformed scale. Trying for lower value, say log(0.001), confuses the pretty printing of the scale in my tests.

how to fit baseline/background in R

I am trying to fit the background shape in nmr spectra. For this I have been using the loess function so far.
First I try to identify all the peaks (which works more or less) and remove them from the spectrum. Then I try to fit the rest of the spectrum with the loess function.
My problem now is that if the removal of peaks doesn't work perfectly there are still some points left which are clearly not background.
Is there a way to tell the fit not to go over the data, i.e. having the fitted line always below the data points (which is clearly what you want from a baseline)? My hope is that, if I am able to constrain the fit to be below the data points I can find suitable parameters, so that the remaining points from the peaks are ignored.
Thanks
John

Splitting lme residual plot into separate boxplots

Using the basic plot function (plot.intervals.lmList) from an lme model (called meef1), I produced a massive graph of boxplots. My vector v2andv3commoditycombined has 98 levels.
plot(meef1, v2andv3commoditycombined~resid(.))
I would like to separate by the grouping values of my variable v2andv3commoditycombined to either graph them separately, order them, or exclude some. I'm not sure if there is code to do this or if I have to extract information from the lme output. If that is the case, I'm not sure what to extract to create the boxplots as extracting the residuals returns only one value for each level. If this is impossible, any advice on how to space out the commodity names would be equally helpful.
Thank you.
For each level of v2andv3commoditycombined, what exactly would you like your Y axis and your X axis to be? Since you're splitting the plots by v2andv3commoditycombined, you obviously can't also use that as one of your axes.
Let's pretend you just want do the traditional residuals on the Y axis and fitted values on the X axis, in a separate plot for each of the 98 levels. You can change the code to do plot whatever it is you actually want to plot.
As per ?plot.lme, you would do something like this:
plot(meef1,resid(.,type='pearson',level=1)~fitted(.,level=1)|v2andv3commoditycombined);
Make sure you stretch out your plot window beforehand so that it's nice and big, otherwise you might get an error saying something about margins. The following might produce a better-looking plot:
plot(meef1,resid(.,type='pearson',level=1)~fitted(.,level=1)|v2andv3commoditycombined,pch='.',cex=1.5,abline=0);
Since it wasn't clear from your question I went ahead and assumed you're interested in the individual level residuals (i.e. how much each datapoint differs from the predicted value given its random variables), and that you have one level of nesting in your random formula. If you want population residuals (i.e. how much each datapoint differs from the average predicted value), change both instances of level to say level=0. If you have K levels of nesting, change them to level=K and good luck.
I also assumed you wanted standardized residuals (because you can use the convenient rule of thumb that absolute values greater than 3 are possible outliers, regardless of what scale the original data are on). If not, see ?residuals.lme for other valid options for the type argument.
Oh, and the name of your variables suggests that you're looking at some sort of financial time series. If so, have a look at ACF(meef1) to see if there is a lot of autocorrelation. If there is, you could remedy it by instead fitting a model where the response (Y) variable is diff(...) the original variable. If you're seeing really skewed residuals, you might consider log-transforming your response variable before taking the diff.

Resources