I am working on investigating the relationship between body measurements and overall weight in a set of biological specimens using regression equations. I have been comparing my results to previous studies, which did not draw their measurement data and body weights from the same series of individuals. Instead, these studies used the mean values reported for each species from the previously published literature (with body measurements and weight drawn from different sets of individuals) or just took the midpoint of reported ranges of body measurements.
I am trying to figure out how to introduce a small amount of random error in my data to simulate the effects of drawing measurement and weight data from different sources. For example, mutating all data to be slightly altered from their actual value by roughly +/- 5% of their actual value, which is close to the difference I get between my measurements and the literature measurements, and seeing how much that affects accuracy statistics. I know there is the jitter() command, but that only seems to work with plotting data.
There is jitter function in base R which allows you to add random noise in the data.
x <- 1:10
set.seed(123)
jitter(x)
#[1] 0.915 2.115 2.964 4.153 5.176 5.818 7.011 8.157 9.021 9.983
Check ?jitter which explains different ways to control the noise added.
Straight forward if you know what the error looks like (i.e. how is your error distributed?). Is the error normally distributed? Uniform?
v1 <- rep(100, 10) # measurements with no noise
v1_n <- v1 + rnorm(10, 0, 20) #error with mean 0 and sd 20 sampled from normal distribution
v1_u <- v1 + runif(10, -5, 5) #error with mean 0 min -5 and max 5 from uniform distribution
v1_n
[1] 87.47092 103.67287 83.28743 131.90562 106.59016 83.59063 109.74858 114.76649 111.51563 93.89223
v1_u
[1] 104.34705 97.12143 101.51674 96.25555 97.67221 98.86114 95.13390 98.82388 103.69691 98.40349
I have run a simple linear regression in R with two variables and got the following relation:
y = 30000+1.95x
Which is reasonably fair. My only concern is that, practically the (0,0) point should be included in the model.
Is there any math help I can get please ?
I needed to post the data somehow... and here it is. This will give a better approach to the problem now.
There are more such data sets available. This is data collected for a marketing strategy.
The objective is to obtain a relation between sales and spend so that we can predict the spend amount that we need in order to obtain a certain amount of sales.
All help will be appreciated.
This is not an answer, but rather a comment with graphics.
I converted the month data to "elapsed months", starting with 1 as the first month, then 2, then 3 etc. This allowed me to view the data in 3D, and as you can see from the 3D scatterplot below, both Spend and Sales are related to the number of months that have passed. I also scaled the financial data in thousands so I could more easily read the plots.
I fit the data to a simple flat surface equation of the form "z = f(x,y)" as shown below, as this equation was suggested to me by the scatterplot. My fit of this data gave me the equation
Sales (thousands) = a + b * Months + c * Spend(thousands)
with fitted parameters
a = 2.1934871882483066E+02
b = 6.3389747441412403E+01
c = 1.0011902575903093E+00
for the following data:
Month Spend Sales
1 120.499 327.341
2 168.666 548.424
3 334.308 978.437
4 311.963 885.522
5 275.592 696.238
6 405.845 1268.859
7 399.824 1054.429
8 343.622 1193.147
9 619.030 1118.420
10 541.674 985.816
11 701.460 1263.009
12 957.681 1960.920
13 479.050 1240.943
14 552.718 1821.106
15 633.517 1959.944
16 527.424 2351.679
17 1050.231 2419.749
18 583.889 2104.677
19 322.356 1373.471
if you want to include the point (0,0) in your regression line this would mean setting the intercept to zero.
In R you can achieve this by
mod_nointercept <- lm(y ~ 0 + x)
In this model only beta is fitted. And alpha (i.e. the intercept is set to zero).
I've trained a model to predict a certain variable. When I now use this model to predict said value and compare this predictions to the actual values, I get the two following distributions.
The corresponding R Data Frame looks as follows:
x_var | kind
3.532 | actual
4.676 | actual
...
3.12 | predicted
6.78 | predicted
These two distributions obviously have slightly different means, quantiles, etc. What I would now like to do is combine these two distributions into one (especially as they are fairly similar), but not like in the following thread.
Instead, I would like to plot one density function that shows the difference between the actual and predicted values and enables me to say e.g. 50% of the predictions are within -X% and +Y% of the actual values.
I've tried just plotting the difference between predicted-actual and also the difference compared to the mean in the respective group. However, neither approach has produced my desired result. With the plotted distribution, it is especially important to be able to make above statement, i.e. 50% of the predictions are within -X% and +Y% of the actual values. How can this be achieved?
Let's consider the two distributions as df_actual, df_predicted, then calculate
# dataframe with difference between two distributions
df_diff <- data.frame(x = df_predicted$x - df_actual$x, y = df_predicted$y - df_actual$y)
Then find the relative % difference by :
x_diff = mean(( df_diff$x - df_actual$x) / df_actual $x) * 100
y_diff = mean(( df_diff$y - df_actual$y) / df_actual $y) * 100
This will give you % prediction whether +/- in x as well as y. This is my opinion and also follow this thread for displaying and measuring area between two distribution curves.
I hope this helps.
ParthChaudhary is right - rather than subtracting the distributions, you want to analyze the distribution of differences. But take care to subtract the values within corresponding pairs, or otherwise the actual - predicted differences will be overshadowed by the variance of actual (and predicted) alone. I.e., if you have something like:
x y type
0 10.9 actual
1 15.7 actual
2 25.3 actual
...
0 10 predicted
1 17 predicted
2 23 predicted
...
you would merge(df[df$type=="actual",], df[df$type=="predicted",], by="x"), then calculate and plot y.x-y.y.
To better quantify whether the differences between your predicted and actual distributions are significant, you could consider using the Kolmogorov-Smirnov test in R, available via the function ks.test
I've successfully completed an analysis in rpart, where I have 0-1 outcome data, where I have weighted the data to deal with the problem of a scarce response. When I plot the data using prp, I want the labels to have the true proportion, rather than the weighted proportion. Is this possible?
A sample data set below (note that I am working with many more factors than I'm using here!)
require(rpart)
require(rpart.plot)
set.seed(1001)
x<-rnorm(1000)
y<-rbinom(1000,size=1,prob=1/(1+exp(-x)))
z<-10+rnorm(1000)
weights<-ifelse(y==0,1,z)
rpartfun<-rpart(y~x,
weights=z,method="class",control=list(cp=0))
rparttrim<- prune(rpartfun,cp=rpartfun$cptable[which.min(rpartfun$cptable[,"xerror"]),"CP"])
prp(rparttrim,extra=104)
[I would produce the image I get from that here, but I don't have enough reputation]
Where I would like that first node (and indeed,all the nodes!) to, instead of having .28 to .72 (the weighted proportions), have 0.65 to 0.35 (the true proportion).
I have some already tabulated survey data imported in a data frame and can making bar charts from it with ggplot.
X X.1 X.2
3 Less than 1 year 7
4 1-5 years 45
5 6-10 years 84
6 11-15 years 104
7 16 or more years 249
ggplot(responses[3:7,], aes(y=X.2, factor(X))) + geom_bar()
I would like to overlay a normal curve on the bar chart, and a horizontal box and whisker plot below that but I am unsure about the correct way to do this without the individual observations, it should be possible... I think. The example output I am trying to emulate is here: http://t.co/yOqRmOj5
I look forward to learning a new trick for this if there is one, or if anyone else had encountered it.
To save anyone else having to download the 134 page PDF, here is an example of the graph referenced in the question.
In this example, the data is from a Likert scale, and so the original data can be extrapolated and a normal curve and boxplot is at least interpretable. However, there are plots where the horizontal scale is nominal. Normal curves make no sense in these cases.
Your question is about an ordinal scale. Just from this summarized data, it is not reasonable to try and make a normal curve. You could treat each entry as located at the center point of its range (0.5 years, 3 years, 8 years, etc.), but there is no way to reasonably assign a value for the highest group (and worse, it is your largest, so its contribution is not insignificant). You must have the original data to make any reasonable approximation.
If you just want a density estimation based on the data that you have, then the oldlogspline function in the logspline package can fit density estimates to interval censored data:
mymat <- cbind( c(0,1,5.5,10.5, 15.5), c(1,5.5,10.5, 15.5, Inf) )[rep(1:5, c(7,45,84,104,249)),]
library(logspline)
fit <- oldlogspline(interval=mymat[mymat[,2] < 100,],
right=mymat[ mymat[,2]>100, 1], lbound=0)
fit2 <- oldlogspline.to.logspline(fit)
hist( mymat[,1]+0.5, breaks=c(0,1,5.5,10.5,15.5,60), main='', xlab='Years')
plot(fit2, add=TRUE, col='blue')
If you want a normal distribution, then the survreg function in the survival package will fit interval censored data:
library(survival)
mymat2 <- mymat
mymat2[ mymat2>100 ] <- NA
fit3 <- survreg( Surv(mymat2[,1], mymat2[,2], ,type='interval2') ~ 1,
dist='gaussian', control=survreg.control(maxiter=100) )
curve( dnorm(x, coef(fit3), fit3$scale), from=0, to=60, col='green', add=TRUE)
Though a different distribution may fit better:
fit4 <- survreg( Surv(mymat2[,1]+.01, mymat2[,2], ,type='interval2') ~ 1,
dist='weibull', control=survreg.control(maxiter=100) )
curve( dweibull(x, scale=exp(coef(fit4)), shape=1/fit4$scale),
from=0, to=60, col='red', add=TRUE)
You could also fit a discrete distribution using fitdistr in MASS:
library(MASS)
tmpfun <- function(x, size, prob) {
ifelse(x==0, dnbinom(0,size,prob),
ifelse(x < 5, pnbinom(5,size,prob)-pnbinom(0,size,prob),
ifelse(x < 10, pnbinom(10,size,prob)-pnbinom(5,size,prob),
ifelse(x < 15, pnbinom(15,size,prob)-pnbinom(10,size,prob),
pnbinom(15,size,prob, lower.tail=FALSE)))))
}
fit5 <- fitdistr( mymat[,1], tmpfun, start=list(size=6, prob=0.28) )
lines(0:60, dnbinom(0:60, fit5$estimate[1], fit5$estimate[2]),
type='h', col='orange')
If you wanted something a little more fuzz, such that 5.5 years could have been reported as either 5 or 6 years, and missing or I don't knows could be used to some degree (with some assumptions), then the EM algorithm could be used to estimate parameters (but this is a lot more complicated and you need to specify your assumptions in how the actual values would translate to observed values).
There might be a better way to look at that data. Since it is constrained by design to be integer valued, perhaps fitting a Poisson or Negative Binomial distribution might be more sensible. I think you should ponder the fact that the X values in the data you present are somewhat arbitrary. There appears to be no good reason to think that 3 is the most appropriate value for the lowest category. Why not 1?
And then, of course, you need to explain what that data refers to. It does not look to be at all Normal or even Poisson distributed. It is very left skewed and there are not a lot of left skewed distributions in common usage (despite there being an infinite number of possible such distributions.
If you just wanted to demonstrate how non-Normal this data is even ignoring the fact that you are fitting a trucated version of a Normal distribution, then take a look at this exercise in plotting:
barp <- barplot( dat$X.2)
barp
# this is what barplot returns and is then used as the x-values for a call to lines.
[,1]
[1,] 0.7
[2,] 1.9
[3,] 3.1
[4,] 4.3
[5,] 5.5
lines(barp, 1000*dnorm(seq(3,7), 7,2))