Disk space Prediction using R - r

I have 2 vectors
days = c(1, 2, 3, 4, 5, 6, 7)
pct_used = c(22.3, 22.1, 22.1, 22.1, 55.660198413, 56.001746032, 55.988769841)
fit <- lm(days ~ poly(pct_used,2,raw=TRUE))
prediction <- predict(fit, data.frame(pct_used=85))
days_remain <- (prediction - tail(days,1))
pct_used is basically disk space . So this code predicts when disk space will reach 85.
The prediction value returned is 325.something which is wierd I feel.Does that mean it will take 325 days to reach pct_used = 85 ?
Where am i going wrong ?

Try this to see what is happening:
plot(pct_used, days)
lines(pct_used, predict(fit
plot(pct_used, days, xlim=c(min(pct_used), 85) ,ylim= c(-50,350))
lines(seq(min(pct_used), 85, length=50), predict(fit, newdata=data.frame(
pct_used=seq( min(pct_used), 85, length=50))))

Related

predict() should display one value, but generates way too much values

I have a dataset of the german soccer league, which shows every team of the league, the player value, goals and points. Freiburg soccer team has scored 19 goals with a value of 1.12. Now I want to predict out of the created linear model, how many goals the team of Freiburg could expect with a player value of 5.
If I run the stated code line, the function shows me not one value, but 18 for each team. How can I change that, that I just get the value for the team of Freiburg? (Which should be the prediction 27.52 using the linear model.)
m3 <- lm(bundesliga$Goals ~ bundesliga$PlayerValue)
summary(m3)
nd <- data.frame(PlayerValue = 5)
predict(m3, newdata = nd)
Dataset:
You have specified your model in a way that R discourages.
The preferred way is:
m3 <- lm(Goals ~ PlayerValue, data=bundesliga)
Then the prediction works as expected using your command:
nd <- data.frame(PlayerValue = 5)
predict(m3, newdata = nd)
# 1
#27.52412
Although the help page of lm does say that the data argument is optional, specifying it in the model allows other functions, such as predict, to work. There is a note in the help page of predict.lm:
Note
Variables are first looked for in newdata and then searched for in the usual way (which will include the environment of the formula used in the fit). A warning will be given if the variables found are not of the same length as those in newdata if it was supplied.
This is why your original command doesn't work and you get the warning message:
predict(m3, newdata = nd)
1 2 3 4 5 6 7 8 9
40.06574 28.31378 26.08416 25.45708 25.31773 25.22483 24.22614 23.55261 23.36681
10 11 12 13 14 15 16 17 18
21.60169 20.51011 20.23140 20.25463 19.58110 19.48820 18.60564 18.60564 18.51274
#Warning message:
#'newdata' had 1 row but variables found have 18 rows
The environment of your formula is not the bundesliga data frame, so R cannot find PlayerValue.
Data:
bundesliga <- structure(list(PlayerValue = c(10.4, 5.34, 4.38, 4.11, 4.05, 4.01,
3.58, 3.29, 3.21, 2.45, 1.98, 1.86, 1.87, 1.58, 1.54, 1.16, 1.16, 1.12),
Goals = c(34, 32, 34, 35, 32, 16, 26, 27, 23, 13, 10, 21, 22, 18, 24, 21, 12, 19)),
class = "data.frame", row.names = c(NA, -18L))

Problems with ks.test and ties

I have a distribution, for example:
d
#[1] 4 22 15 5 9 5 11 15 21 14 14 23 6 9 17 2 7 10 4
Or, the vector d in dput format.
d <- c(4, 22, 15, 5, 9, 5, 11, 15, 21, 14, 14, 23, 6, 9, 17, 2, 7, 10, 4)
And when I apply the ks.test,:
gamma <- ks.test(d, "pgamma", shape = 3.178882, scale = 3.526563)
This gives the following warning:
Warning message:
In ks.test(d, "pgamma", shape = 3.178882, scale = 3.526563) :
ties should not be present for the Kolmogorov-Smirnov test
I tried put unique(d), but obvious my data reduce the values and I wouldn't like this happen.
And the others manners and examples online, this example happen too, but the difference is the test show some results with the warning message, not only the message without values of ks.test.
Some help?
In gamma you can find your result, warning message is not blocking
d <- c(4, 22, 15, 5, 9, 5, 11, 15, 21, 14, 14, 23, 6, 9, 17, 2, 7, 10, 4)
gamma <- ks.test(d, "pgamma", shape = 3.178882, scale = 3.526563)
Warning message: In ks.test(d, "pgamma", shape = 3.178882, scale =
3.526563) : ties should not be present for the Kolmogorov-Smirnov test
gamma
One-sample Kolmogorov-Smirnov test
data: d
D = 0.14549, p-value = 0.816
alternative hypothesis: two-sided
You find an explanation of the warning in the help page ??ks.test
The presence of ties always generates a warning, since continuous
distributions do not generate them. If the ties arose from rounding
the tests may be approximately valid, but even modest amounts of
rounding can have a significant effect on the calculated statistic.
As you can see some rounding is applied and the test is "approximately" valid.

How to avoid error ocuured due to minus sign of southern latitudes when mixed set of latitudes convert from degrees minutes to decimal degree using R

I have mixed data set of both North (positive) and Negative latitude column and separate minutes column in data frame. I used simple Lat + (minutes/60) to convert this in to decimal degrees.
lat1 <- c(7, -7, 6, -0, -1, 6, 8, -7, 6, 6)
lat2 <- c(7.4, 55.7, 32.6, 8.9, 47.5, 25.6, 6.8, 45.7, 24.6, 7.6)
ifelse(lat1<0,(lat <- lat1-(lat2/60)),(lat <- lat1+(lat2/60)))
>[1] 7.1233333 -7.9283333 6.5433333 0.1483333 -1.7916667 6.4266667
[7] 8.1133333 -7.7616667 6.4100000 6.1266667
this result is correct but
> lat
[1] 7.1233333 -6.0716667 6.5433333 0.1483333 -0.2083333 6.4266667
[7] 8.1133333 -6.2383333 6.4100000 6.1266667
ifelse statement provide correct result in R console but not to stored it to vector "lat"
I need to add minutes/60 to degree if degree value is positive and subtract minutes/60 from degrees if degree value is negative
The correct syntax to use ifelse here is the following:
lat <- ifelse(lat1 < 0, lat1 - lat2 / 60, lat1 + lat2 / 60)
ifelse returns a vector where component i = lat1[i] - lat2[i] / 60 if lat[i] < 0 and lat[1] + lat2[i] / 60 otherwise so you don't need to put the assignment inside.

Plot variables as slope of line between points

Due to the nature of my specification, the results of my regression coefficients provide the slope (change in yield) between two points; therefore, I would like to plot these coefficients using the slope of a line between these two points with the first point (0, -0.7620) as the intercept. Please note this is a programming question; not a statistics question.
I'm not entirely sure how to implement this in base graphics or ggplot and would appreciate any help. Here is some sample data.
Sample Data:
df <- data.frame(x = c(0, 5, 8, 10, 12, 15, 20, 25, 29), y = c(-0.762,-0.000434, 0.00158, 0.0000822, -0.00294, 0.00246, -0.000521, -0.00009287, -0.01035) )
Output:
x y
1 0 -7.620e-01
2 5 -4.340e-04
3 8 1.580e-03
4 10 8.220e-05
5 12 -2.940e-03
6 15 2.460e-03
7 20 -5.210e-04
8 25 -9.287e-05
9 29 -1.035e-02
Example:
You can use cumsum, the cumulative sum, to calculate intermediate values
df <- data.frame(x=c(0, 5, 8, 10, 12, 15, 20, 25, 29),y=cumsum(c(-0.762,-0.000434, 0.00158, 0.0000822, -0.00294, 0.00246, -0.000521, -0.00009287, -0.0103)))
plot(df$x,df$y)

How to calculate a mean value from multiple maximal values

I have a variable e.g. c(0, 8, 7, 15, 85, 12, 46, 12, 10, 15, 15)
how can I calculate a mean value out of random maximal values in R?
for example, I would like to calculate a mean value with three maximal values?
First step: You draw a sample of 3 from your data and store it in x
Second step: You calculate the mean of the sample
try
dat <- c(0,8,7,15, 85, 12, 46, 12, 10, 15,15)
x <- sample(dat,3)
x
mean(x)
possible output:
> x <- sample(dat,3)
> x
[1] 85 15 0
> mean(x)
[1] 33.33333
If you mean the three highest values, just sort your vector and subset:
> mean(sort(c(0,8,7,15, 85, 12, 46, 12, 10, 15,15), decreasing=T)[1:3])
[1] 48.66667

Resources