Use of set.seed() in statistics - r

This is an elementary question so I apologize if I am missing something obvious. I'm in an advanced statistics class, albeit the first at my university to use R software. The question is primarily to get us used to using R and asks us to calculate the log of the square root of 125 and to use the set.seed() function. I am confused about the set.seed aspect. I understand that it is used as a random number generator in simulations but I don't understand where it is applied within the code. This is what I did.
125 %>%
log() %>%
sqrt() %>%
set.seed(100)
Is this how it is supposed to be used?

No. Someone will probably give a fuller answer, but:
set.seed() does not affect the log() or sqrt() operations at all. Maybe this was supposed to be a two-part question ("(1) calculate the log of the square root of 125; (2) use set.seed() to set the state of the pseudo-random number generator")
You can use the pipe (%>%) operator in this way to compose the log() and sqrt() functions, but (unless you have been specifically instructed to do it this way, for some reason) it's overkill. You really might as well write it in the more "normal" way log(sqrt(125)).

Related

R numerical method similar to Vpasolve in Matlab

I am trying to solve a numerical equation in R but would want a method which perform similar to vpasolve in Matlab. I have a non linear equation (involving lot of log functions) which when solve in R with uniroot gives me complete different answer compared to what vpasolve gives in matlab.
First, a word of caution: it's often much more productive to learn that there's a better way to do something than the way you are used to doing.
edit
I went back to MATLAB and realized that the "vpa" collection is using extended precision. Is that absolutely necessary for your purposes? If not, then my suggestions below may suffice.
If you do require extended precision, then perhaps Rmpfr::unirootR function will suffice. I would like to point out that, since all these solvers are generating an approximate solution (as opposed to analytic), the use of extended precision operations seems a bit pointless.
Next, you need to determine whether MATLAB::vpasolve or uniroot is getting you the correct answer. Or maybe you simply are converging to a root that's not the one you want, in which case you need to read up on setting limits on the starting conditions or the search region.
Finally, in addition to uniroot, I recommend you learn to use the R packages BBsolve , nleqslv, rootsolve, and ktsolve (disclaimer: I am the owner and maintainer of ktsolve). These packages are pretty flexible and may lead you to better solutions to your original problem.

Sage math numerical evaluation

I'm using Sage math to do some calculation where I found the numerical evaluation quite different from that of python's.
For example, the evalf() nolonger works, instead it uses n() and gp().
My questions are:
what are different ways of numerical evaluation in Sage and what's their difference?
what's difference between n() and gp()? why the later seemed to be much slower?
I assume you are referring to Sympy when you say evalf. Anyway, n() or numerical_approx() is the equivalent. See the documentation. The default is 53 bits of accuracy.
You shouldn't be thinking of gp() though, unless you really want to use the GP/Pari interpreter or convert something to GP.

In R, how can I make a vector Y whose components are derived from normal distribution?

I am a novice in R programming.
I would like to ask experts here a question concerning a code of R.
First, let a vector x be c(2,5,3,6,5)
I hope to make another vector y whose i-th component is derived from N(sum(x[1]:x[i]),1)
(i.e. the i-th component of y follows normal distribuion with variance 1 and mean summation from x[1](=2) to x[i] (i=1,2,3,4,5))
For example, the third component of y follows normal distribuion with mean x[1]+x[2]+x[3]=2+5+3=10 and variance 1
I want to know a code of R making the vector y described above "without using repetition syntax such as for, while, etc."
Since I am a novice of R programming and have a congenitally poor sense of computational statistics, I don't seem to hit on a ingenious code of R at all.
Please let me know a code of R making a vector explained above without using repetition syntax such as for, while, etc.
Previously, I should like to thank you very much heartily for your mindful answer.
You can do
rnorm(length(x), mean = cumsum(x), sd = 1)
rnorm is part of the family of functions associated with the normal distribution *norm. To see how a function with a known name works, use
help("rnorm") # or ?rnorm
cumsum takes the cumulative sum of a vector.
Finding functionality
In R, it's generally a safe bet that most functionality you can think of has been implemented by someone already. So, for example, in the OP's case, it is not necessary to roll a custom loop.
The same naming convention as *norm is followed for other distributions, e.g., rbinom. You can follow the link at the bottom of ?rnorm to reach ?Distributions, which lists others in base R.
If you are starting from scratch and don't know the names of any related functions, consider using the built-in search tools, like:
help.search("normal distribution") # or ??"normal distribution"
If this reveals nothing and yet you still think a function must exist, consider installing and loading the sos package, which allows
findFn("{cumulative mean}") # or ???"{cumulative mean}"
findFn("{the pareto distribution}") # or ???"{the pareto distribution}"
Beyond that, there are other online resources, like Google, that are good. However, a question about functionality on Stack Overflow is a risky proposition, since it will not be received well (downvoted and closed as a "tool request") if the implementation of the desired functionality is nonexistent or unknown to folks here. Stack Overflow's new "Documentation" subsite will hopefully prove to be a resource for finding R functions as well.

R: Winsorizing (robust HD) not compatible with NAs?

I want to use the winsorize function provided in the "robustHD" Package but it does not seem to work with NA's as can be seen in the example
## generate data
set.seed(1234) # for reproducibility
x <- rnorm(10) # standard normal
x[1] <- x[1] * 10 # introduce outlier
x[11]<- NA ## adding NA
## winsorize data
x
winsorize(x)
I googled the problem but didn't find a solution or even anyone with a similar problem. Is winsorizing might considered as a "bad" technique or how can you explain this lack of information?
If you only have a vector to winsorize, the winsor2 function defined here can be easily modified by setting na.rm = TRUE for the median and mad functions in the code. That provides the same functionality as winsorize{robustHD} with 1 difference: winsorize calls robStandardize, which includes some adjustment for very small values. I don't understand what it's doing, so caveat emptor if you forgo it.
If you want to winsorize the individual columns of a matrix (as opposed to the multivariate winsorization using a tolerance ellipse available as another option in winsorize) you should be able to poach the necessary code from winsorize.default and standardize. They do the same thing as winsor2 but in matrix form. Again, you need to add your own na.rm = TRUE settings into the functions as needed.
Some maybe useful thoughts:
Stack Overflow is a programming board, where programming related questions are asked and answers are given. For question whether or not certain statistical procedures are appropriate or considered "bad", you are more likely to find knowledgable people on crossvalidated.
A statistical method and the implementation of a statistical method into a certain software environment are often rather independent. That is to say that if the developer of a package has not included certain features (e.g NA handling) into his package, this does not mean much for the method per se. Having said that, of course it can. The only way to be sure whether the omission of a package feature was intentional is to actually ask the developer of the package. If the question is more geared towards statistics and the validity of the method in the presence of missing values, crossvalidated is likely to be more helpful.
I don't know why you can't find any information on this topic. I can confidently say though that this is the very first time I have heard the term "winsorized". I actually had to look it up, and I can surely say that I have never encountered this approach, and I would personally never use it.
A simple solution to your problem from a computational point of view would be to omit all incomplete cases before you start working with the function. It also makes intuitive sense that cases with missing values cannot be easily winsorized. First, the computation of the mean and standard deviation would have to be done on the complete cases anyway, and then it is unclear which value to assign to those with missing values since they may not necessarily be outliers, even though they could be.
If omitting incomplete cases is not an option for you, you may want to look for imputation methods (on CV).

How can I do blind fitting on a list of x, y value pairs if I don't know the form of f(x) = y?

If I have a function f(x) = y that I don't know the form of, and if I have a long list of x and y value pairs (potentially thousands of them), is there a program/package/library that will generate potential forms of f(x)?
Obviously there's a lot of ambiguity to the possible forms of any f(x), so something that produces many non-trivial unique answers (in reduced terms) would be ideal, but something that could produce at least one answer would also be good.
If x and y are derived from observational data (i.e. experimental results), are there programs that can create approximate forms of f(x)? On the other hand, if you know beforehand that there is a completely deterministic relationship between x and y (as in the input and output of a pseudo random number generator) are there programs than can create exact forms of f(x)?
Soooo, I found the answer to my own question. Cornell has released a piece of software for doing exactly this kind of blind fitting called Eureqa. It has to be one of the most polished pieces of software that I've ever seen come out of an academic lab. It's seriously pretty nifty. Check it out:
It's even got turnkey integration with Amazon's ec2 clusters, so you can offload some of the heavy computational lifting from your local computer onto the cloud at the push of a button for a very reasonable fee.
I think that I'm going to have to learn more about GUI programming so that I can steal its interface.
(This is more of a numerical methods question.) If there is some kind of observable pattern (you can kinda see the function), then yes, there are several ways you can approximate the original function, but they'll be just that, approximations.
What you want to do is called interpolation. Two very simple (and not very good) methods are Newton's method and Laplace's method of interpolation. They both work on the same principle but they are implemented differently (Laplace's is iterative, Newton's is recursive, for one).
If there's not much going on between any two of your data points (ie, the actual function doesn't have any "bumps" whose "peaks" are not represented by one of your data points), then the spline method of interpolation is one of the best choices you can make. It's a bit harder to implement, but it produces nice results.
Edit: Sometimes, depending on your specific problem, these methods above might be overkill. Sometimes, you'll find that linear interpolation (where you just connect points with straight lines) is a perfectly good solution to your problem.
It depends.
If you're using data acquired from the real-world, then statistical regression techniques can provide you with some tools to evaluate the best fit; if you have several hypothesis for the form of the function, you can use statistical regression to discover the "best" fit, though you may need to be careful about over-fitting a curve -- sometimes the best fit (highest correlation) for a specific dataset completely fails to work for future observations.
If, on the other hand, the data was generated something synthetically (say, you know they were generated by a polynomial), then you can use polynomial curve fitting methods that will give you the exact answer you need.
Yes, there are such things.
If you plot the values and see that there's some functional relationship that makes sense, you can use least squares fitting to calculate the parameter values that minimize the error.
If you don't know what the function should look like, you can use simple spline or interpolation schemes.
You can also use software to guess what the function should be. Maybe something like Maxima can help.
Wolfram Alpha can help you guess:
http://blog.wolframalpha.com/2011/05/17/plotting-functions-and-graphs-in-wolframalpha/
Polynomial Interpolation is the way to go if you have a totally random set
http://en.wikipedia.org/wiki/Polynomial_interpolation
If your set is nearly linear, then regression will give you a good approximation.
Creating exact form from the X's and Y's is mostly impossible.
Notice that what you are trying to achieve is at the heart of many Machine Learning algorithm and therefor you might find what you are looking for on some specialized libraries.
A list of x/y values N items long can always be generated by an degree-N polynomial (assuming no x values are the same). See this article for more details:
http://en.wikipedia.org/wiki/Polynomial_interpolation
Some lists may also match other function types, such as exponential, sinusoidal, and many others. It is impossible to find the 'simplest' matching function, but the best you can do is go through a list of common ones like exponential, sinusoidal, etc. and if none of them match, interpolate the polynomial.
I'm not aware of any software that can do this for you, though.

Resources