Converting R script to SAS - r

I want to add noise to a dataset. This is a fairly straightforward procedure in R. I sample from a Laplace distribution and then add/multiply/whatever that vector to the vector I want to add noise to.
The issue is, my colleague is asking for the code in SAS. I have not used SAS since graduate school and my project has been put on hold until I can get my colleague up to speed in SAS.
My code is pretty simple :
library ("rmutil")
vector <- c (1,2,3,1,2,3,1,2,3)
vector_prop <- vector/sum(vector)
noise <- rlaplace(9, m=1, s=.1)
new_vector <- vector_prop * noise
I am turning my vector I want to add noise to into a proportion, then drawing from a laplace distribution. Finally I multiply those draws with my proportion vector.
Any idea would be helpful as the SAS documentation was difficult to follow. I imagine they feel the same way with R documentation.

Assuming your data is in a data set called have with a variable called vector_prop the following code is likely correct. Because of the nature of random numbers and streams you can't replicate that though, don't you end up with a different data set each time?
data want;
set have;
call streaminit(24); *fixes random number stream for reproduciblilty;
new_var = vectorProp * rand('laplace', 1, 0.1);
run;

Related

Removing Outliers 3SDs from the mean of a monoexponetial function in R

I have a large data set that analyzes exercising subjects' oxygen consumption over time (x= Time, y = VO2). This data fits a monoexponential function.
Here is a brief, sample data frame:
'''
VO2 <- c(11.71,9.84,17.96,18.87,14.58,13.38,5.89,20.28,20.03,31.17,22.07,30.29,29.08,32.89,29.01,29.21,32.42,25.47,30.51,37.86,23.48,40.27,36.25,19.34,36.53,35.19,22.45,30.23,3,19.48,25.35,22.74)
Time <- c(0,2,27,29,31,33,39,77,80,94,99,131,133,134,135,149,167,170,177,178,183,184,192,222,239,241,244,245,250,251,255,256)
DF <- data.frame(VO2,Time)
'''
visual representation of the data -- * note that this data set is much smaller (and therefore might not fit a function as well) as the full data set.
I am somewhat new to R and very much not a mathematical expert. I would appreciate your help with the two goals of this data set.
Based on typical conventions of the laboratory I work in, this data should be fit to a monoexponential function
I would love some insight into fitting data to a function such as this. Note that I have many similar data sets (for different subjects) and need to fit a monoexponential function to each of them. It would be best if fit could be applied generically across my data sets.
Based on this monoexponential function, I would like to identify and remove any outlying points. Here I will define an outlier as any point >3 standard deviations from the mean of the monoexponential function.
So far, I have this (unsuccessful) code to fit a function to the above data. Not only does it not fit well, but I am also unable to create a smooth function.
'''
fit <- lm(VO2~poly(Time,2,raw=TRUE))
xx <- seq(1,250, length=32)
plot(Time,VO2,pch=19,ylim=c(0,50))+
lines(xx, predict(fit, data.frame(DF=xx)), col="red")
'''
Thank you to all the individuals who have commented and provided their valuable feedback. As I continue to learn and research, I will add to this post with successful/less successful attempts at the code for this process. Thank you for your knowledge, assistance and understanding.

Low-pass fltering of a matrix

I'm trying to write a low-pass filter in R, to clean a "dirty" data matrix.
I did a google search, came up with a dazzling range of packages. Some apply to 1D signals (time series mostly, e.g. How do I run a high pass or low pass filter on data points in R? ); some apply to images. However I'm trying to filter a plain R data matrix. The image filters are the closest equivalent, but I'm a bit reluctant to go this way as they typically involve (i) installation of more or less complex/heavy solutions (imageMagick...), and/or (ii) conversion from matrix to image.
Here is sample data:
r<-seq(0:360)/360*(2*pi)
x<-cos(r)
y<-sin(r)
z<-outer(x,y,"*")
noise<-0.3*matrix(runif(length(x)*length(y)),nrow=length(x))
zz<-z+noise
image(zz)
What I'm looking for is a filter that will return a "cleaned" matrix (i.e. something close to z, in this case).
I'm aware this is a rather open-ended question, and I'm also happy with pointers ("have you looked at package so-and-so"), although of course I'd value sample code from users with experience on signal processing !
Thanks.
One option may be using a non-linear prediction method and getting the fitted values from the model.
For example by using a polynomial regression, we can predict the original data as the purple one,
By following the same logic, you can do the same thing to all columns of the zz matrix as,
predictions <- matrix(, nrow = 361, ncol = 0)
for(i in 1:ncol(zz)) {
pred <- as.matrix(fitted(lm(zz[,i]~poly(1:nrow(zz),2,raw=TRUE))))
predictions <- cbind(predictions,pred)
}
Then you can plot the predictions,
par(mfrow=c(1,3))
image(z,main="Original")
image(zz,main="Noisy")
image(predictions,main="Predicted")
Note that, I used a polynomial regression with degree 2, you can change the degree for a better fitting across the columns. Or maybe, you can use some other powerful non-linear prediction methods (maybe SVM, ANN etc.) to get a more accurate model.

Discrepancy in Cubic Spline Interpolation, R & matlab

I am trying to replicate the spline() function in matlab using the spline() function in R's splinefun {stat}s package, without having full access to matlab (I don't have a licence for it). I am able to input all of the necessary data into R that would be present in matlab, but my spline output is different than matlab's by an average of .0036 (maxdif is .0342, mindif is -.0056, stdev is .0094). My main question is, how does matlab's formula compare to R's, and is that where my calculation discrepancy might come from?
The first part of my code is feeding the excel spreadsheet into R, then calculating the necessary variables to get tau and quick delta. After this, I run the spline calculation and then rotate the output for the purposes of exporting back into excel. Below is the essential script, plus some data to try out to see if there is something flawed in my calculation. I use spline(natural), as it returns the closest values to matlab's model.
#establishing what tau is for quick Delta calculation
today<-Sys.Date()
month<-as.Date(5/1/2016)
difday<-difftime(month,today,units=c("days"))
Tau<-as.numeric((month-today)/365)
Pu<-as.numeric(1.94)
Vol<-as.numeric(.4261)
#Pf is the representation of my fixed strike prices, the points used for interpolation
Pf<-c(Pu-.3,Pu-.25,Pu-.2,Pu-.1,Pu,Pu+.1,Pu+.2,Pu+.25,Pu+.3)
qDtable<-data.frame(matrix(ncol=length(Pf),nrow=length(month)))
colnames(qDtable)<-c(Pf)
rownames(qDtable)<-format.Date(month)
#my quick Delta calculation & table as a result
qD<-data.frame(pnorm(log(Pf/Pu)/(Vol*sqrt(Tau))))
Qd<-t(qD[1:24,1])
qDtable[1,]=c(Qd)
#setting up for spline interpolation
qDpoint<-as.numeric(qDtable[1,1:24])
ncsibyPf<-data.frame(matrix(ncol=length(Pf),nrow=length(month)))
colnames(ncsibyPf)<-Pf
rownames(ncsibyPf)<-format.Date(month)
qDvol<-data.frame(matrix(ncol=14,nrow=2)
colnames(qDvol)<-c("",0,.05,.1,.2,.3,.4,.5,.6,.7,.8,.9,.95,1)
rownames(qDvol)<-format.Date(month)
qDvol[2,2:14]<-c(.59612,.51112,.46112,.45612,.44612,.42612,.42612,.42612,.42612,.42612,.42612,.42612,.42612)
#x is the quick Vol point
x<-as.numeric(qDvol[1,2:14])
#y is the vol at the quick Vol point
y<-as.numeric(qDvol[2,2:14])
ncsivol<-data.frame(spline(x,y,xout=qDpoint,method="natural"))
nroutput<-t(ncsivol[1:24,2])
ncsibyPf[1,]=c(nroutput)
The essential data points for this spline run are all included (I think), and everything should line up correctly. Thank you for your help ahead of time!

Cross-correlation of 5 time series (distance) and interpretation

I would appreciate some input in this a lot!
I have data for 5 time series (an example of 1 step in the series is in the plot below), where each step in the series is a vertical profile of species sightings in the ocean which were investigated 6h apart. All 5 steps are spaced vertically by 0.1m (and the 6h in time).
What I want to do is calculate the multivariate cross-correlation between all series in order to find out at which lag the profiles are most correlated and stable over time.
Profile example:
I find the documentation in R on that not so great, so what I did so far is use the package MTS with the ccm function to create cross correlation matrices. However, the interpretation of the figures is rather difficult with sparse documentation. I would appreciate some help with that a lot.
Data example:
http://pastebin.com/embed_iframe.php?i=8gdAeGP4
Save in file cross_correlation_stack.csv or change as you wish.
library(dplyr)
library(MTS)
library(data.table)
d1 <- file.path('cross_correlation_stack.csv')
d2 = read.csv(d1)
# USING package MTS
mod1<-ccm(d2,lag=1000,level=T)
#USING base R
acf(d2,lag.max=1000)
# MQ plot also from MTS package
mq(d2,lag=1000)
Which produces this (the ccm command):
This:
and this:
In parallel, the acf command from above produces this:
My question now is if somebody can give some input in whether I am going in the right direction or are there better suited packages and commands?
Since the default figures don't get any titles etc. What am I looking at, specifically in the ccm figures?
The ACF command was proposed somewhere, but can I use it here? In it's documentation it says ... calculates autocovariance or autocorrelation... I assume this is not what I want. But then again it's the only command that seems to work multivariate. I am confused.
The plot with the significance values shows that after a lag of 150 (15 meters) the p values increase. How would you interpret that regarding my data? 0.1 intervals of species sightings and many lags up to 100-150 are significant? Would that mean something like that peaks in sightings are stable over the 5 time-steps on a scale of 150 lags aka 15 meters?
In either way it would be nice if somebody who worked with this before can explain what I am looking at! Any input is highly appreciated!
You can use the base R function ccf(), which will estimate the cross-correlation function between any two variables x and y. However, it only works on vectors, so you'll have to loop over the columns in d1. Something like:
cc <- vector("list",choose(dim(d1)[2],2))
par(mfrow=c(ceiling(choose(dim(d1)[2],2)/2),2))
cnt <- 1
for(i in 1:(dim(d1)[2]-1)) {
for(j in (i+1):dim(d1)[2]) {
cc[[cnt]] <- ccf(d1[,i],d1[,j],main=paste0("Cross-correlation of ",colnames(d1)[i]," with ",colnames(d1)[j]))
cnt <- cnt + 1
}
}
This will plot each of the estimated CCF's and store the estimates in the list cc. It is important to remember that the lag-k value returned by ccf(x,y) is an estimate of the correlation between x[t+k] and y[t].
All of that said, however, the ccf is only defined for data that are more-or-less normally distributed, but your data are clearly overdispersed with all of those zeroes. Therefore, lacking some adequate transformation, you should really look into other metrics of "association" such as the mutual information as estimated from entropy. I suggest checking out the R packages entropy and infotheo.

Trying to do a simulation in R

I'm pretty new to R, so I hope you can help me!
I'm trying to do a simulation for my Bachelor's thesis, where I want to simulate how a stock evolves.
I've done the simulation in Excel, but the problem is that I can't make that large of a simulation, as the program crashes! Therefore I'm trying in R.
The stock evolves as follows (everything except $\epsilon$ consists of constants which are known):
$$W_{t+\Delta t} = W_t exp^{r \Delta t}(1+\pi(exp((\sigma \lambda -0.5\sigma^2) \Delta t+\sigma \epsilon_{t+\Delta t} \sqrt{\Delta t}-1))$$
The only thing here which is stochastic is $\epsilon$, which is represented by a Brownian motion with N(0,1).
What I've done in Excel:
Made 100 samples with a size of 40. All these samples are standard normal distributed: N(0,1).
Then these outcomes are used to calculate how the stock is affected from these (the normal distribution represent the shocks from the economy).
My problem in R:
I've used the sample function:
x <- sample(norm(0,1), 1000, T)
So I have 1000 samples, which are normally distributed. Now I don't know how to put these results into the formula I have for the evolution of my stock. Can anyone help?
Using R for (discrete) simulation
There are two aspects to your question: conceptual and coding.
Let's deal with the conceptual first, starting with the meaning of your equation:
1. Conceptual issues
The first thing to note is that your evolution equation is continuous in time, so running your simulation as described above means accepting a discretisation of the problem. Whether or not that is appropriate depends on your model and how you have obtained the evolution equation.
If you do run a discrete simulation, then the key decision you have to make is what stepsize $\Delta t$ you will use. You can explore different step-sizes to observe the effect of step-size, or you can proceed analytically and attempt to derive an appropriate step-size.
Once you have your step-size, your simulation consists of pulling new shocks (samples of your standard normal distribution), and evolving the equation iteratively until the desired time has elapsed. The final state $W_t$ is then available for you to analyse however you wish. (If you retain all of the $W_t$, you have a distribution of the trajectory of the system as well, which you can analyse.)
So:
your $x$ are a sampled distribution of your shocks, i.e. they are $\epsilon_t=0$.
To simulate the evolution of the $W_t$, you will need some initial condition $W_0$. What this is depends on what you're modelling. If you're modelling the likely values of a single stock starting at an initial price $W_0$, then your initial state is a 1000 element vector with constant value.
Now evaluate your equation, plugging in all your constants, $W_0$, and your initial shocks $\epsilon_0 = x$ to get the distribution of prices $W_1$.
Repeat: sample $x$ again -- this is now $\epsilon_1$. Plugging this in, gives you $W_2$ etc.
2. Coding the simulation (simple example)
One of the useful features of R is that most operators work element-wise over vectors.
So you can pretty much type in your equation more or less as it is.
I've made a few assumptions about the parameters in your equation, and I've ignored the $\pi$ function -- you can add that in later.
So you end up with code that looks something like this:
dt <- 0.5 # step-size
r <- 1 # parameters
lambda <- 1
sigma <- 1 # std deviation
w0 <- rep(1,1000) # presumed initial condition -- prices start at 1
# Show an example iteration -- incorporate into one line for production code...
x <- rnorm(1000,mean=0,sd=1) # random shock
w1 <- w0*exp(r*dt)*(1+exp((sigma*lambda-0.5*sigma^2)*dt +
sigma*x*sqrt(dt) -1)) # evolution
When you're ready to let the simulation run, then merge the last two lines, i.e. include the sampling statement in the evolution statement. You then get one line of code which you can run manually or embed into a loop, along with any other analysis you want to run.
# General simulation step
w <- w*exp(r*dt)*(1+exp((sigma*lambda-0.5*sigma^2)*dt +
sigma*rnorm(1000,mean=0,sd=1)*sqrt(dt) -1))
You can also easily visualise the changes and obtain summary statistics (5-number summary):
hist(w)
summary(w)
Of course, you'll still need to work through the details of what you actually want to model and how you want to go about analysing it --- and you've got the $\pi$ function to deal with --- but this should get you started toward using R for discrete simulation.

Resources