R calculate percentile using Stata definition [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I did two different analyses. One with R and another with Stata, based on percentile calculation. However I have a mismatch between the two results due to a different percentile method calculation between R and Stata. Do you know if I can use the Stata's percentile definition in R?

R has at least 9 definitions of quantiles and percentiles are just quantile(.) * 100. This link suggests that the corresponding quantile-type would be type=4. I was unable to find a percentile or quantile function documented in the Base Stata Manual, but I would welcome correction if that is in error.
Nick Cox is right. The quantile (the value in the data domain) at probability of 0.25 is the 25th percentile. The question appears unclear on both sides of the R-Stata divide because the original efforts in R were being done with the ecdf function in an unspecified manner. Fortunately the poster was satisfied by being pointed toward the R quantile function.
After looking at the Version 13 Stata Manual section on centile, I'm not sure it matches up with any of the R quantile methods although it would appear to match the type=4 method for percentiles away from the "extremes":
By default, centile estimates Cq for the variables in varlist and for
the values of q given in centile(numlist). It makes no assumptions about the distribution of X, and, if necessary, uses linear interpolation between neighboring sample values. Extreme centiles (for example, the 99th centile in samples smaller than 100) are fixed at the minimum or maximum sample
value

Related

Machine Learning Classification with only binary numbers [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have 50 predictors and 1 target variable. All my predictors and target variable are only binary numbers 0s and 1s. I am performing my analysis using R.
I will be implementing four algorithms.
1. RF
2. Log Reg
3. SVM
4. LDA
I have the following questions:
I convert them all into factors. How should i treat my variables priorly, before feeding them into my other algorithms.
I used the caret package to train my model, it takes very much time. I do practice ML regularly, but I dont know how to proceed with all variable being binary.
How to remove collinear variables?
I'm not mostly R-user, but Python. Bet there is common approach:
1. Check you columns. Remove column if number of zeroes or ones is > 95% of total amount (you can try 2.5% or even 1% later).
2. Run simple Random Forest by default and get feature importance. Columns that are unnecessary you can process with LDA.
3. Check target column. If it's highly unbalanced try oversampling or downsampling. Or use classification methods that can handle unbalanced target column (like XGBoost).
For Linear regression you'll need to calculate correlation matrix and remove correlated columns. Other methods can live without it.
Please check SVM (or SVC) does it support all features to be boolean or not. But usually it works very good with binary classification.
Also I advice to try Neural Network.
PS About collinear variables. I wrote a code on Python for my project. That's simple - you can do it:
- plot correlation matrix
- find pairs that have correlation over some threshold
- remove column that have lower correlation with target variable (you can also check that columns you want to remove is not important, otherwise try other way, probably union columns)
In my code I ran this algorithm iteratively for different thresholds: from 0.99 down to 0.9. Works good.

R Student-t (location 0 scale 1) [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
So I've read about fitting student-t in R using MLE but it always appears to be the case that location and scale parameters are of the utmost interest. I just want to fit a student-t (as described by wikipedia) to data that is usually considered to be distributed like a standard normal so I can assume the mean is 0 and the scale is 1. How can I do this is R?
If you "assume" your location and scale parameters, you are not "fitting" a distribution to the data, you are simply assuming that the data follows a certain distribution.
"Fitting" a distribution to some data means finding "appropriate" parameters of this distribution so that it "accurately" models your data. Maximum likelihood estimation is a method to find point-estimates of the parameters based on some data.
The easiest way to fit a classic distribution such as student-t is to use the function fitdistr from the MASS package, which uses MLE.
Assuming you have some data:
library("MASS")
# generating some data following a normal dist
x <- rnorm(100)
# fitting a t dist, although this makes little sense here
# since you know x comes from a normal dist...
fitdistr(x, densfun="t", df=length(x)-1)
Note that the student-t density is parameterised by location m, scale s and the degrees of freedom df. df is not tuned, but is set based on the data.
The output of fitdistr contains the fitted values for m and s. If you store the output in an object, you can access programmatically to all sorts of info about the fit.
The question now is whether fitting a t dist is what you really want to do. If the data is normal, why would you want to fit a t dist?
You are looking for the function t.test.
x <- rnorm(100)
t.test(x)
I put a sample here
EDIT
I think I misunderstood your question slightly. Use t.test for hypothesis testing about the location of your population density (here a standard normal).
As for fitting the parameters of a t distribution, you should not do this unless your data comes from a t distribution. If you know your data comes from a standard normal distribution, you already know the location and scale parameters, so what's the sense?

How to estimate missing DV using its own estimation model within a linear model? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
This question is more about statistics than R programming, though as I am a beginning user of R, I would especially appreciate any thoughts in the context of R; thanks for considering it:
The outcome variable in one of our linear models (lm) is waist circumference, which is missing in about 20% of our dataset. Last year a model was published which reliably estimates waist circumference from BMI, age, and gender (all of which we do have). I'd like to use this model to impute the missing waist circumferences in our data, but I'm wanting to make sure I incorporate the known error in that estimation model. The standard error of the intercept and of each coefficient has been reported.
Could you suggest how I might go about responsibly imputing (or perhaps a better word is estimating) the missing waist circumferences and evaluating any effect on my own waist circumference prediction models?
Thanks again for any coding strategy.
As Frank has indicated, this question has a strong stats flavor to it. But one possible solution does indeed entail some sophisticated programming, so perhaps it's legitimate to put it in an R thread.
In order to "incorporate the known error in that estimation", one standard approach is multiple imputation, and if you want to go this route, R is a good way to do it. It's a little involved, so you'll have to work out the specifics of the code for yourself, but if you understand the basic strategy it's relatively straightforward.
The basic idea is that for every subject in your dataset you impute the waist circumference by first using the published model and the BMI, age, and gender to determine the expected value, and then you add some simulated random noise to that; you'll have to read through the publication to determine the numerical value of that noise. Once you've filled in every missing value, then you just perform whatever statistical computation you want to run, and save the standard errors. Now, you create a second dataset, derived from your original dataset with missing values, once again using the published model to impute the expected values, along with some random noise -- since the noise is random, the imputed values for this dataset should be different from the imputed values for the first dataset. Now do your statistical computation, and save the standard errors, which will be a little different than those from the first imputed dataset, since the imputed values contain random noise. Repeat for a bunch of times. Finally, average the saved standard errors, and this will give you an estimate for the standard error incorporating the uncertainty due to the imputation.
What you're doing is essentially a two-level simulation: on a low level, for each iteration you are using the published model to create a simulated dataset with noisy imputed values for missing data, which then gives you a simulated standard error, and then on a high level you repeat the process to obtain a sample of such simulated standard errors, which you then average to get your overall estimate.
This is a pain to do in traditional stats packages such as SAS or Stata, although it IS possible, but it's much easier to do in R because it's based on a proper programming language. So, yes, your question is properly speaking a stats question, but the best solution is probably R-specific.

Test for significance in a time series using R [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Given a simplified example time series looking at a population by year
Year<-c(2001,2002,2003,2004,2005,2006)
Pop<-c(1,4,7,9,20,21)
DF<-data.frame(Year,Pop)
What is the best method to test for significance in terms of change between years/ which years are significantly different from each other?
As #joran mentioned, this is really a statistics question rather than a programming question. You could try asking on http://stats.stackexchange.com to obtain more statistical expertise.
In brief, however, two approaches come to mind immediately:
If you fit a regression line to the population vs. year and have a statistically significant slope, that would indicate that there is an overall trend in population over the years, i.e. use lm() in R, like this lmPop <- lm(Pop ~ Year,data=DF).
You could divide the time period into blocks (e.g. the first three years and the last three years), and assume that the population figures for the years in each block are all estimates of the mean population during that block of years. That would give you a mean and a standard deviation of the population for each block of years, which would let you do a t-test, like this: t.test(Pop[1:3],Pop[4:6]).
Both of these approaches suffer from some potential difficulties and the validity of each would depend on the nature of the data that you're examining. For the sample data, however, the first approach suggests that there appears to be a trend over time at a 95% confidence level (p=0.00214 for the slope coefficient) while the second approach suggests that the null hypothesis that there is no difference in means cannot be falsified at the 95% confidence level (p = 0.06332).
They're all significantly different from each other. 1 is significantly different from 4, 4 is significantly different from 7 and so on.
Wait, that's not what you meant? Well, that's all the information you've given us. As a statistician, I can't work with anything more.
So now you tell us something else. "Are any of the values significantly different from a straight line where the variation in the Pop values are independent Normally distributed values with mean 0 and the same variance?" or something.
Simply put, just a bunch of numbers can not be the subject of a statistical analysis. Working with a statistician you need to agree on a model for the data, and then the statistical methods can answer questions about significance and uncertainty.
I think that's often the thing non-statisticians don't get. They go "here's my numbers, is this significant?" - which usually means typing them into SPSS and getting a p-value out.
[have flagged this Q for transfer to stats.stackexchange.com where it belongs]

How is uniformity expressed? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I don't know anything about statistics and it was difficult for me to find A way to describe my question that was clear.
I am doing some initial research on a system that will measure the uniformity of electricity across a conductor. Basically we need to measure how evenly a signal is spread out on a surface.
I was doing research on how to determine uniformity of a data set and came across this question which is promising. However I realized that I don't know what unit to use to express uniformity. For example, if I take 100 equally spaced measurements in a grid pattern on the surface of an object and want to describe how uniform the values are, how would you say it?
"98% uniform?" - what does that mean? 98% of what?
"The signal is very evenly dispersed" - OK, great... but there must be a more specific or scientific way to communicate that... how "evenly"? What is a numeric representation of that statement?
Statistics and math are not my thing so if this seems like a dumb question, be gentle...
You are looking for the Variance. From Wikipedia:
In probability theory and statistics, the variance is a measure of how far a set of numbers are spread out from each other. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean
Recipe for calculating the Variance:
1) Calculate the Mean of your dataset
2) For each point, calculate (X - Mean)^2
3) Add up all those (X - Mean)^2
4) Divide the by the number of points
5) That is it
The Variance gives you an idea of how "equal" your points are. A Variance of zero, means all points are equal, and then increases as the points spread out.
Edit
Here you may find better algorithms (more numerically stable) for calculating the variance.
One has to first define "uniformity". Does it mean lack of variance in the data? Or does it also mean other things like lack of average change across a surface or over time?
If it's simply lack of variance in data, then the variance method already described is the ticket.
If you are also concerned about average "shift" in measurement across the surface, you could do a linear (or in this case a "cylindrical" or "planar") fit of the data to determine whether there's a general trend up or down in the data in either of two dimensions. (If the conductor is cylindrical, then radially and axially. If it's planar, then x/y.)
These three parameters, then, would give a reasonable uniformity measure by the above definition: overall variance (that belisarius described), and "flatness" in each of two dimensions.

Resources