Transform two sided skewed data into normal distribution in R - r

I have some two sided skewed data like the one generated below, and I don't know how to transform it to have a normal distribution or homoscedasticity. I have tried several transformation like log, log+1, exponential, sqrt but nothing seems to work. Any help will be greatly appreciated. TIA.
x <- rep(0,20)
y <- rep(1,50)
z <- replicate(50,max(sample(1000,2,replace=TRUE)))
z <- round(z/max(z),2)
hist(c(x,y,z))

Related

PCA scores vs. Varimax-rotated PCA scores

I have performed PCA using prcomp in R with my databases of 75-76 indicator variables and 7232 companies, including NAs. Before applying the function, I centred my data, but did not rescale them because they are all indicator variables. (Is my reasoning correct?)
After that I varimax-rotated the loadings of the 2 or 3 first principal components following the instructions by amoeba here.
Since I had centred, but not rescaled my data, I changed the code to:
Varimax_results <- varimax(rawLoadings,normalize = FALSE)
invLoadings <- t(pracma::pinv(VarimaxLoadings))
scores <- scale(DatosPCA, scale = FALSE) %*% invLoadings
Now I am trying to figure out why the scores given by "prcomp" and the scores obtained using the code above are not the same.
I am probably missing some theoretical background, so I would be grateful if someone could tell me if the scores are supposed to be the same and, in that case, what I am doing wrong in my code. If they are not supposed to be the same, which ones should I use?
Thank you very much!

exponential regression with R ( and negative values)

I am trying to fit a curve to a set of data points but did not succeed. So I ask you.
plot(time,val) # look at data
exponential.model <- lm(log(val)~ a) # compute model
fit <- exp(predict(exponential.model,list(Time=time))) # create the fitted curve
plot(time,val)#plot it again
lines(time, fit,lwd=2) # show the fitted line
My only problem is, that my data contains negative values and so log(val) produces a lot of NA making the model computation crash.
I know that my data does not necessarily look like exponential , but I want to see the fit anyway. I also used another program which shows me val=27.1331*exp(-time/2.88031) is a nice fit but I do not know, what I am doing wrong.
I want to compute it with R.
I had the idea to shift data so no negative values remain, but result is poor and quite sure wrong.
plot(time,val+20) # look at data
exponential.model <- lm(log(val+20)~ a) # compute model
fit <- exp(predict(exponential.model,list(Time=time))) # create the fitted curve
plot(time,val)#plot it again
lines(time, fit-20,lwd=2) # show the (BAD) fitted line
Thank you!
I figured some things out and have a satisfying solution.
exponential.model <- lm(log(val)~ a) # compute model
The log(val) term is trying to rescale the values, so a linear model can be applied. Since this not possible to my values, you have to use a non-linear model (nls).
exponential.model <- nls(val ~ a*exp(b*time), start=c(b=-0.1,h=30))
This worked fine for me.
satisfying fit

wavelet package Inverse DWT fails to reconstruct series?

I'm using the wavelets package, and noticed that when I try
library("wavelets")
x <- rnorm(100)
y <- idwt(dwt(x))
plot(x, y)
the reconstruction y is apparently not equal to the original x.
Is this to be expected?
For some context, I'm trying to do a (regularized) logistic regression using the wavelet transforms of a bunch of series. I then want to map the regression coefficient back into the original time series space, to see which time points were used in the discrimination.
But I can't seem to even reconstruct the original series. I might be completely misunderstanding things, can anyone shed some light on this?
Following the help file ?dwt, you can modify your script, such as:
library(wavelets)
set.seed(42)
x <- rnorm(100)
y <- idwt(dwt(x, n.levels=3, boundary="reflection", fast=FALSE))
plot(x, y)
abline(0,1)

Unable to plot PCA data in R. Are scores defined by a given object/name to plot them specifically?

I have completed a simple PCA function using code that was passed down thru the institution. It outputs scores, loadings, eigen values, % eigen values, # of principal components, mean of columns, std deviation, and lastly the starting data. In the output file the scores are labeled with [[1]] before displaying the scores. I am attempting to plot these scores but I am unsure on how to take that data from this point. I assumed it was assigned to this [[1]] or something in the code defined these scores. This line of code is presented below:
"#"perform pca on x
x.svd <- svd(x);
x.R <- x.svd$u %*% diag(x.svd$d);
x.C <- t(x.svd$v);
x.EV <- x.svd$d * x.svd$d
x.EVpct <- x.EV/sum(x.EV);
x.EV <- x.EV[1:sm];
x.EVpct <- x.EVpct[1:sm];
x.CumEVpct <- x.EVpct;
x.R is the part of the code enacting the scores but that too will not work with the plot function. Hopefully someone understands what I am struggling to ask. Any help is very appreciated. Thank you for your time.
The easiest thing to do would be:
pc <- prcomp(x)
plot(pc$x[, 1:2]

Generate stochastic random deviates from a density object with R

I have a density object dd created like this:
x1 <- rnorm(1000)
x2 <- rnorm(1000, 3, 2)
x <- rbind(x1, x2)
dd <- density(x)
plot(dd)
Which produces this very non-Gaussian distribution:
alt text http://www.cerebralmastication.com/wp-content/uploads/2009/09/nongaus.png
I would ultimately like to get random deviates from this distribution similar to how rnorm gets deviates from a normal distribution.
The way I am trying to crack this is to get the CDF of my kernel and then get it to tell me the variate if I pass it a cumulative probability (inverse CDF). That way I can turn a vector of uniform random variates into draws from the density.
It seems like what I am trying to do should be something basic that others have done before me. Is there a simple way or a simple function to do this? I hate reinventing the wheel.
FWIW I found this R Help article but I can't grok what they are doing and the final output does not seem to produce what I am after. But it could be a step along the way that I just don't understand.
I've considered just going with a Johnson distribution from the suppdists package but Johnson won't give me the nice bimodal hump which my data has.
Alternative approach:
sample(x, n, replace = TRUE)
This is just a mixture of normals. So why not something like:
rmnorm <- function(n,mean, sd,prob) {
nmix <- length(mean)
if (length(sd)!=nmix) stop("lengths should be the same.")
y <- sample(1:nmix,n,prob=prob, replace=TRUE)
mean.mix <- mean[y]
sd.mix <- sd[y]
rnorm(n,mean.mix,sd.mix)
}
plot(density(rmnorm(10000,mean=c(0,3), sd=c(1,2), prob=c(.5,.5))))
This should be fine if all you need are samples from this mixture distribution.

Resources