Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Given a simplified example time series looking at a population by year
Year<-c(2001,2002,2003,2004,2005,2006)
Pop<-c(1,4,7,9,20,21)
DF<-data.frame(Year,Pop)
What is the best method to test for significance in terms of change between years/ which years are significantly different from each other?
As #joran mentioned, this is really a statistics question rather than a programming question. You could try asking on http://stats.stackexchange.com to obtain more statistical expertise.
In brief, however, two approaches come to mind immediately:
If you fit a regression line to the population vs. year and have a statistically significant slope, that would indicate that there is an overall trend in population over the years, i.e. use lm() in R, like this lmPop <- lm(Pop ~ Year,data=DF).
You could divide the time period into blocks (e.g. the first three years and the last three years), and assume that the population figures for the years in each block are all estimates of the mean population during that block of years. That would give you a mean and a standard deviation of the population for each block of years, which would let you do a t-test, like this: t.test(Pop[1:3],Pop[4:6]).
Both of these approaches suffer from some potential difficulties and the validity of each would depend on the nature of the data that you're examining. For the sample data, however, the first approach suggests that there appears to be a trend over time at a 95% confidence level (p=0.00214 for the slope coefficient) while the second approach suggests that the null hypothesis that there is no difference in means cannot be falsified at the 95% confidence level (p = 0.06332).
They're all significantly different from each other. 1 is significantly different from 4, 4 is significantly different from 7 and so on.
Wait, that's not what you meant? Well, that's all the information you've given us. As a statistician, I can't work with anything more.
So now you tell us something else. "Are any of the values significantly different from a straight line where the variation in the Pop values are independent Normally distributed values with mean 0 and the same variance?" or something.
Simply put, just a bunch of numbers can not be the subject of a statistical analysis. Working with a statistician you need to agree on a model for the data, and then the statistical methods can answer questions about significance and uncertainty.
I think that's often the thing non-statisticians don't get. They go "here's my numbers, is this significant?" - which usually means typing them into SPSS and getting a p-value out.
[have flagged this Q for transfer to stats.stackexchange.com where it belongs]
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 10 months ago.
Improve this question
I'm trying to learn the intricacies of linear regression for prediction, and I'd like to ask two questions:
I've got one dependent variable (call it X) and, let's say, ten independent variables. I can use lm() to generate a model. But my question is this: is the aim of generating a model (or, more likely, multiple models) to identify the single best predictor of X, or is the aim to discover the best combination of predictors of X? I assumed the latter, but after several hours of reading online I am now unsure.
If the aim is to discover the best combination of predictors of X, then (once I've identified that combination) how is a combination plotted properly? Plotting one line is easy, but for a combination would it be proper to (a) plot ten distinct regression lines (one per independent variable) or (b) plot a single line that somehow represents the combination? I've provided the summary() I'm working with in case it facilitates answering this question.
Is the aim of generating a model (or, more likely, multiple models) to identify the single best predictor of X, or is the aim to discover the best combination of predictors of X?
This depends mainly on the situation/context you are in. If you are always going to have access to these predictors, then yes, you'd like to identify the best model that will (likely) use a combination of these predictors. Obviously you want to keep in mind issues like overfitting and make sure the predictors you include are actually contributing something meaningful to your model, but there's no reason not to include multiple predictors if they make your model meaningfully better.
However, in many real world scenarios predictors are not free. It might cost $10,000 to collect each predictor and the organization you are working for only has the budget to collect one predictor. Thus, you might only be interested in the single best predictor because it is not practical to collect more than one going forward. In this case you'd also just be interested in how well that variable predicts in a simple regression, not a multiple regression, since you won't be controlling for other variables in the future anyway (but looking at the multiple regression results could still provide insight).
how is a combination plotted properly?
Again, this depends on context. However, in most cases you probably don't want to plot 10 regression lines because that's too overwhelming to look at and you will probably never have 10 variables that meaningfully contribute to your model. I'm actually kind of surprised your adjusted R^2 is not lower given you have quite a few variables so close to zero, unless they're just on massive scales.
First, who is viewing this graph? Is it you? If so, what information do you need to see that isn't being conveyed by the beta parameters? If it's someone else, who are they? Are they a stakeholder who knows nothing about statistics? If that's the case, you want a pretty simple graph that drives home your main point. Second, what is the purpose of your predictions and how does the process you are predicting unfold in the real world? Let's say I'm predicting how well people perform on the job given their scores on some different selection measures. The first thing you need to consider is, how is that selection happening? Are candidates screened on their answers to some personality questions and only the top scorers get an interview? In that case, it might be useful to create multiple graphs that show that process. However, candidates might be reviewed holistically and assigned a sum score based on all these predictors. In that case one regression line makes sense because you are interested in how these predictors act in concert.
There is no one answer to this question because the answers depend on the reason you're doing a regression in the first place. Once you identify the reason you're trying to predict this thing and the context that the process is happening in you should probably be able to determine what makes most sense. There is no "right" answer you'll find in a textbook because most real life problems are not in textbooks.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I have been following the last DESeq2 pipeline to perform an RNAseq analysis. My problem is the rin of the experimental samples is quite low compared to the control ones. Iread a paper in which they perform RNAseq analysis with time-course RNA degradation and conclude that including RIN value as a covariate can mitigate some of the effects of low rin in samples.
My question is how I should construct the design in the DESeq2 object:
~conditions+rin
~conditions*rin
~conditions:rin
none of them... :)
I cannot find proper resources where explain how to construct these models (I am new to the field...) and I recognise I crashed against a wall with these kinds of things. I would appreciate also some links to good resources to be able to understand which one is correct and why.
Thank you so much
Turns out to be quite long for typing in a comment.
It depends on your data.
First of all, counts ~conditions:rin does not make sense in your case, because conditions is categorical. You cannot fit only an interaction term model.
I would go with counts ~condition + rin, this assumes there is a condition effect and a linear effect from rin. And the counts' dependency of rin is independent of condition.
As you mentioned, rin in one of the conditions is quite low, but is there any reason to suspect the relationship between rin and counts to differ in the two conditions? If you fit counts ~condition * rin, you are assuming a condition effect and a rin effect that is different in conditions. Meaning a different slope for rin effect if you plot counts vs rin. You need to take a few genes out, and see whether this is true. And also, for fitting this model, you need quite a lot of samples to estimate the effects accurately. See if both of these holds
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
This question is more about statistics than R programming, though as I am a beginning user of R, I would especially appreciate any thoughts in the context of R; thanks for considering it:
The outcome variable in one of our linear models (lm) is waist circumference, which is missing in about 20% of our dataset. Last year a model was published which reliably estimates waist circumference from BMI, age, and gender (all of which we do have). I'd like to use this model to impute the missing waist circumferences in our data, but I'm wanting to make sure I incorporate the known error in that estimation model. The standard error of the intercept and of each coefficient has been reported.
Could you suggest how I might go about responsibly imputing (or perhaps a better word is estimating) the missing waist circumferences and evaluating any effect on my own waist circumference prediction models?
Thanks again for any coding strategy.
As Frank has indicated, this question has a strong stats flavor to it. But one possible solution does indeed entail some sophisticated programming, so perhaps it's legitimate to put it in an R thread.
In order to "incorporate the known error in that estimation", one standard approach is multiple imputation, and if you want to go this route, R is a good way to do it. It's a little involved, so you'll have to work out the specifics of the code for yourself, but if you understand the basic strategy it's relatively straightforward.
The basic idea is that for every subject in your dataset you impute the waist circumference by first using the published model and the BMI, age, and gender to determine the expected value, and then you add some simulated random noise to that; you'll have to read through the publication to determine the numerical value of that noise. Once you've filled in every missing value, then you just perform whatever statistical computation you want to run, and save the standard errors. Now, you create a second dataset, derived from your original dataset with missing values, once again using the published model to impute the expected values, along with some random noise -- since the noise is random, the imputed values for this dataset should be different from the imputed values for the first dataset. Now do your statistical computation, and save the standard errors, which will be a little different than those from the first imputed dataset, since the imputed values contain random noise. Repeat for a bunch of times. Finally, average the saved standard errors, and this will give you an estimate for the standard error incorporating the uncertainty due to the imputation.
What you're doing is essentially a two-level simulation: on a low level, for each iteration you are using the published model to create a simulated dataset with noisy imputed values for missing data, which then gives you a simulated standard error, and then on a high level you repeat the process to obtain a sample of such simulated standard errors, which you then average to get your overall estimate.
This is a pain to do in traditional stats packages such as SAS or Stata, although it IS possible, but it's much easier to do in R because it's based on a proper programming language. So, yes, your question is properly speaking a stats question, but the best solution is probably R-specific.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am working with R to create some linear models (using lm()) on the data that i have collected. Now I am not that good at statistics and am finding it difficult to understand the summary of the linear model that is generated through R.
I mean the residual value : Min , 1Q, Median, 3Q, Max
My question is: what do these values mean and how can I know from these values if my model is good ?
This are some of the residual values that I have.
Min: -4725611 1Q:-2161468 median:-1352080 3Q:3007561 Max:6035077
One fundamental assumption of linear regression (and the associated hypothesis tests in particular) is that residuals are normal distributed with expected value zero. A slight violation of this assumption is not problematic, as the statistics is pretty robust. However, the distribution should be at least symmetric.
The best way to judge if the assumption of normality is fullfilled, is to plot the residuals. There are many different diagnostic plots available, e.g., you can do the following:
fit <- lm(y~x)
plot(fit)
This will give you a plot of residuals vs. fitted values and a qq-plot of standardized residuals. The quantiles given by summary(fit) are useful for a quick check if residuals are symmetric. There, min and max values are not that important, but the median should be close to zero and the first and third quartil should have similar absolute values. Of course, this check only makes sense if you have a sufficient number of values.
If residuals are not normal distributed there are several possibilities to deal with that, e.g.,
transformations,
generalized linear models,
or a non-linear model could be more appropriate.
There are many good books on linear regression and even some good web tutorials. I suggest to read at least one of those carefully.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I don't know anything about statistics and it was difficult for me to find A way to describe my question that was clear.
I am doing some initial research on a system that will measure the uniformity of electricity across a conductor. Basically we need to measure how evenly a signal is spread out on a surface.
I was doing research on how to determine uniformity of a data set and came across this question which is promising. However I realized that I don't know what unit to use to express uniformity. For example, if I take 100 equally spaced measurements in a grid pattern on the surface of an object and want to describe how uniform the values are, how would you say it?
"98% uniform?" - what does that mean? 98% of what?
"The signal is very evenly dispersed" - OK, great... but there must be a more specific or scientific way to communicate that... how "evenly"? What is a numeric representation of that statement?
Statistics and math are not my thing so if this seems like a dumb question, be gentle...
You are looking for the Variance. From Wikipedia:
In probability theory and statistics, the variance is a measure of how far a set of numbers are spread out from each other. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean
Recipe for calculating the Variance:
1) Calculate the Mean of your dataset
2) For each point, calculate (X - Mean)^2
3) Add up all those (X - Mean)^2
4) Divide the by the number of points
5) That is it
The Variance gives you an idea of how "equal" your points are. A Variance of zero, means all points are equal, and then increases as the points spread out.
Edit
Here you may find better algorithms (more numerically stable) for calculating the variance.
One has to first define "uniformity". Does it mean lack of variance in the data? Or does it also mean other things like lack of average change across a surface or over time?
If it's simply lack of variance in data, then the variance method already described is the ticket.
If you are also concerned about average "shift" in measurement across the surface, you could do a linear (or in this case a "cylindrical" or "planar") fit of the data to determine whether there's a general trend up or down in the data in either of two dimensions. (If the conductor is cylindrical, then radially and axially. If it's planar, then x/y.)
These three parameters, then, would give a reasonable uniformity measure by the above definition: overall variance (that belisarius described), and "flatness" in each of two dimensions.