What values to look at in cross validated linear regression in DAAG package - r

I performed the following on a data set that contains 151 variables with 161 observations:-
> library(DAAG)
> fit <- lm(RT..seconds.~., data=cadets)
> cv.lm(df = cadets, fit, m = 10)
And got the following results:-
fold 1
Observations in test set: 16
7 11 12 24 33 38 52 67 72
Predicted 49.6 44.1 26.4 39.8 53.3 40.33 47.8 56.7 58.5
cvpred 575.0 -113.2 640.7 -1045.8 876.7 -5.93 2183.0 -129.7 212.6
RT..seconds. 42.0 44.0 44.0 45.0 45.0 46.00 49.0 56.0 58.0
CV residual -533.0 157.2 -596.7 1090.8 -831.7 51.93 -2134.0 185.7 -154.6
What I want to do is compare the predicted results to the actual experimental results, so I can plot a graph of the two against each other to show how similar they are. I'm I right in assuming I would do this by using the values in the Predicted row as my predicted results and not the cvpred?
I only ask this as when I performed the very same thing in the caret package, the predicted and the observed values came out to be far more different from one another:-
library(caret)
ctrl <- trainControl(method = "cv", savePred=T, classProb=T)
mod <- train(RT..seconds.~., data=cadets, method = "lm", trControl = ctrl)
mod$pred
pred obs rowIndex .parameter Resample
1 141.2 42 6 none Fold01
2 -504.0 42 7 none Fold01
3 1196.1 44 16 none Fold01
4 45.0 45 27 none Fold01
5 262.2 45 35 none Fold01
6 570.9 52 58 none Fold01
7 -166.3 53 61 none Fold01
8 -1579.1 59 77 none Fold01
9 2699.0 60 79 none Fold01
The model shouldn't be this inaccurate as I originally started from 1664 variables, reduced it through the use of random forest so only variables that had a variable importance of greater than 1 was used, which massively reduced my dataset from 162 * 1664 to 162 * 151.
If someone could explain this to me I'd be grateful, thanks

I think there are few areas of confusion here, let me try to clear the up for you.
The "Predicted" section from cv.lm does not correspond to results from crossvalidaiton. If you're interested with crossvalidaiton then you need to look at your "cvpred" results -- "Predicted" corresponds to predictions from the model fit using all of your data.
The reason that there is a such a large difference between your predictions and your cvpredictions is likely because your final model is overfitting which should illustrate why crossvalidation is so important.
I believe that you are fitting your cv.lm model incorrectly. I've never used the package but I think you want to pass in something like cv.lm(df = cadets, RT..seconds.~., m = 10) rather than your fit object. I'm not sure why you see such a large difference between your cvpred and Predicted options in the example above, but these results tell me that passing in a model will lead to using a model that was fit on all of the data for each CV fold:
library(DAAG)
fit <- lm(Sepal.Length ~ ., data=iris)
mod1 <- cv.lm(df=iris,fit,m=10)
mod2 <- cv.lm(df=iris,Sepal.Length ~ .,m=10)
> sqrt(mean((mod1$cvpred - mod1$Sepal.Length)^2))
[1] 0.318
> sqrt(mean((mod2$cvpred - mod2$Sepal.Length)^2))
[1] 5.94
> sqrt(mean((mod1$cvpred - mod1$Predicted)^2))
[1] 0.0311
> sqrt(mean((mod2$cvpred - mod2$Predicted)^2))
[1] 5.94
The reason that there is such a difference between your caret results is because you were looking at the "Predicted" section. "cvpred" should line up closely with caret (although make sure to make indices on your cv results) and if you want to line up the "Predicted" results with caret you will need to get your predictions using something like predict(mod,cadets).

Related

Creating and plotting confidence intervals

I have fitted a gaussian GLM model to my data, i now wish to create 95% CIs and fit them to my data. Im having a couple of issues with this when plotting as i cant get them to capture my data, they just seem to plot the same line as the model without captuing the data points. Also Im also unsure that I've created my CIs the correct way here for the mean. I entered my data and code below if anyone knows how to fix this
data used
aids
cases quarter date
1 2 1 83.00
2 6 2 83.25
3 10 3 83.50
4 8 4 83.75
5 12 1 84.00
6 9 2 84.25
7 28 3 84.50
8 28 4 84.75
9 36 1 85.00
10 32 2 85.25
11 46 3 85.50
12 47 4 85.75
13 50 1 86.00
14 61 2 86.25
15 99 3 86.50
16 95 4 86.75
17 150 1 87.00
18 143 2 87.25
19 197 3 87.50
20 159 4 87.75
21 204 1 88.00
22 168 2 88.25
23 196 3 88.50
24 194 4 88.75
25 210 1 89.00
26 180 2 89.25
27 277 3 89.50
28 181 4 89.75
29 327 1 90.00
30 276 2 90.25
31 365 3 90.50
32 300 4 90.75
33 356 1 91.00
34 304 2 91.25
35 307 3 91.50
36 386 4 91.75
37 331 1 92.00
38 368 2 92.25
39 416 3 92.50
40 374 4 92.75
41 412 1 93.00
42 358 2 93.25
43 416 3 93.50
44 414 4 93.75
45 496 1 94.00
my code used to create the model and intervals before plotting
#creating the model
model3 = glm(cases ~ date,
data = aids,
family = poisson(link='log'))
#now to add approx. 95% confidence envelope around this line
#predict again but at the linear predictor level along with standard errors
my_preds <- predict(model3, newdata=data.frame(aids), se.fit=T, type="link")
#calculate CI limit since linear predictor is approx. Gaussian
upper <- my_preds$fit+1.96*my_preds$se.fit #this might be logit not log
lower <- my_preds$fit-1.96*my_preds$se.fit
#transform the CI limit to get one at the level of the mean
upper <- exp(upper)/(1+exp(upper))
lower <- exp(lower)/(1+exp(lower))
#plotting data
plot(aids$date, aids$cases,
xlab = 'Date', ylab = 'Cases', pch = 20)
#adding CI lines
plot(aids$date, exp(my_preds$fit), type = "link",
xlab = 'Date', ylab = 'Cases') #add title
lines(aids$date,exp(my_preds$fit+1.96*my_preds$se.fit),lwd=2,lty=2)
lines(aids$date,exp(my_preds$fit-1.96*my_preds$se.fit),lwd=2,lty=2)
outcome i currently get with no data points, the model is correct here but the CI isnt as i have no data points, so the CIs are made incorrectly i think somewhere
Edit: Response to OP's providing full data set.
This started out as a question about plotting data and models on the same graph, but has morphed considerably. You seem you have an answer to the original question. Below is one way to address the rest.
Looking at your (and my) plots it seems clear that poisson glm is just not a good model. To say it differently, the number of cases may vary with date, but is also influenced by other things not in your model (external regressors).
Plotting just your data suggests strongly that you have at least two and perhaps more regimes: time frames where the growth in cases follows different models.
ggplot(aids, aes(x=date)) + geom_point(aes(y=cases))
This suggests segmented regression. As with most things in R, there is a package for that (more than one actually). The code below uses the segmented package to build successive poisson glm using 1 breakpoint (two regimes).
library(data.table)
library(ggplot2)
library(segmented)
setDT(aids) # convert aids to a data.table
aids[, pred:=
predict(
segmented(glm(cases~date, .SD, family = poisson), seg.Z = ~date, npsi=1),
type='response', se.fit=TRUE)$fit]
ggplot(aids, aes(x=date))+ geom_line(aes(y=pred))+ geom_point(aes(y=cases))
Note that we need to tell segmented the count of breakpoints, but not where they are - the algorithm figures that out for you. So here, we see a regime prior to 3Q87 which is well modeled using poission glm, and a regime after that which is not. This is a fancy way of saying that "something happened" around 3Q87 which changed the course of the disease (at least in this data).
The code below does the same thing but for between 1 and 4 breakpoints.
get.pred <- \(p.n, p.DT) {
fit <- glm(cases~date, p.DT, family=poisson)
seg.fit <- segmented(fit, seg.Z = ~date, npsi=p.n)
predict(seg.fit, type='response', se.fit=TRUE)[c('fit', 'se.fit')]
}
gg.dt <- rbindlist(lapply(1:4, \(x) { copy(aids)[, c('pred', 'se'):=get.pred(x, .SD)][, npsi:=x] } ))
ggplot(gg.dt, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
geom_line(aes(y=pred))+
geom_point(aes(y=cases))+
facet_wrap(~npsi)
Note that the location of the first breakpoint does not seem to change, and also that, notwithstanding the use of the poisson glm the growth appears linear in all but the first regime.
There are goodness-of-fit metrics described in the package documentation which can help you decide how many break points are most consistent with your data.
Finally, there is also the mcp package which is a bit more powerful but also a bit more complex to use.
Original Response: Here is one way that builds the model predictions and std. error in a data.table, then plots using ggplot.
library(data.table)
library(ggplot2)
setDT(aids) # convert aids to a data.table
aids[, c('pred', 'se', 'resid.scale'):=predict(glm(cases~date, data=.SD, family=poisson), type='response', se.fit=TRUE)]
ggplot(aids, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
geom_line(aes(y=pred))+
geom_point(aes(y=cases))
Or, you could let ggplot do all the work for you.
ggplot(aids, aes(x=date, y=cases))+
stat_smooth(method = glm, method.args=list(family=poisson))+
geom_point()

How to perform repeated k-fold cross validation in R with DAAG package?

I have created a 3-fold linear regression model using the HousePrices data set of DAAG package. I have read some of the threads in here and in Cross Validated and it was mentioned multiple times that the cross validation must be repeated many times (like 50 or 100) for robustness. I'm not sure what it means? Does it mean to simply run the code 50 times and calculate the average of the overall ms?
> cv.lm(data = DAAG::houseprices, form.lm = formula(sale.price ~ area+bedrooms),
+ m = 3, dots = FALSE, seed = 29, plotit = c("Observed","Residual"),
+ main="Small symbols show cross-validation predicted values",
+ legend.pos="topleft", printit = TRUE)
Analysis of Variance Table
Response: sale.price
Df Sum Sq Mean Sq F value Pr(>F)
area 1 18566 18566 17.0 0.0014 **
bedrooms 1 17065 17065 15.6 0.0019 **
Residuals 12 13114 1093
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
fold 1
Observations in test set: 5
11 20 21 22 23
Predicted 206 249 259.8 293.3 378
cvpred 204 188 199.3 234.7 262
sale.price 215 255 260.0 293.0 375
CV residual 11 67 60.7 58.3 113
Sum of squares = 24351 Mean square = 4870 n = 5
fold 2
Observations in test set: 5
10 13 14 17 18
Predicted 220.5 193.6 228.8 236.6 218.0
cvpred 226.1 204.9 232.6 238.8 224.1
sale.price 215.0 112.7 185.0 276.0 260.0
CV residual -11.1 -92.2 -47.6 37.2 35.9
Sum of squares = 13563 Mean square = 2713 n = 5
fold 3
Observations in test set: 5
9 12 15 16 19
Predicted 190.5 286.3 208.6 193.3 204
cvpred 174.8 312.5 200.8 178.9 194
sale.price 192.0 274.0 212.0 220.0 222
CV residual 17.2 -38.5 11.2 41.1 27
Sum of squares = 4323 Mean square = 865 n = 5
Overall (Sum over all 5 folds)
ms
2816
Every time I repeat it I get this same ms=2816. Can someone please explain what exactly it means to repeat the CV 100 times? Because repeating this code 100 times doesn't seem to change the ms.
Repeating this code 100 times will not change anything. You have set a seed which means that your sets are always the same sets, which means with three folds, you will have the same three folds, so all 100 times you will get the same mean square error.
It does not seem like you have enough samples to support 50 or 100 folds would be appropriate. And there is NO set number of folds that is appropriate across all sets of data.
The number of folds should be reasonable such that you have sufficient testing data.
Also, you do not want to run multiple different CV models with different seeds, to try to find the best performing seed, because that form of error hacking is a proxy for overfitting.
You should groom your data well, engineer and transform your variables properly pick a reasonable number of folds, set a seed so your stakeholders can repeat your findings and then build your model.

ARIMA Number of regressors does not match fitted model , Error in forecast.forecast_ARIMA(fit, xreg = ) in R

I have a time series object named timeseries2 which is as shown below:
timeseries2
timeseries2
Time Series:
Start = 1
End = 49
Frequency = 1
sum_profit sum_quantity sum_discount sum_Segment sum_Ship_mode
1 2424.1125 269 9.45 145 105
2 866.1925 163 8.05 100 79
3 123.4122 527 23.15 329 223
4 3313.2568 543 17.20 352 207
5 2636.2171 468 18.65 277 208
6 5316.8660 506 21.42 245 212
I fit the time series where y = sum_profits column and x = columns other than profit which is sum_quantity, sum_discount, sum_Segment and sum_Ship_mode. I fit these and then try to forecast for nexxt 8 periods. I am getting error as shown
(fit <- auto.arima(timeseries2[,"sum_profit"],
xreg=timeseries2[,c(2:5)]))
fcast <- forecast(fit, xreg=rep(mean(timeseries2[,c(2:5)]),8))
Error in forecast.forecast_ARIMA(fit, xreg = rep(mean(timeseries2[,
c(2:5)]), : Number of regressors does not match fitted model
This error appears because the result from rep(mean(timeseries2[,c(2:5)]),8) is a 1-dimensional vector, whereas your ARIMA model requires a 4-dimensional matrix of values. The following adjustment will run:
fcast <- forecast(fit, xreg=matrix(rep(mean(timeseries2[,c(2:5)]),8),ncol=4))
Of course, this will only give you a 2 period forecast since it is really 2 observations but that is easily solved. You will get a warning unless you provide names to the matrix columns that match your original data, but this is safely ignored if you check your input properly.

H2O deeplearning with class imbalance

I am using the H2O deeplearning Feed Forward Deep Neural network for doing a binary classification. My classes are highly imbalanced and I want to use the parameters like
balance_classes, class_sampling_factors
Can any body give me a reproducible example about how to specifically intialize these parameters to handle class imbalance problems.
First, here is the full, reproducible, example:
library(h2o)
h2o.init()
data(iris) #Not required?
iris <- iris[1:120,] #Remove 60% of virginica
summary(iris$Species) #50/50/20
d <- as.h2o(iris)
splits = h2o.splitFrame(d,0.8,c("train","test"), seed=77)
train = splits[[1]]
test = splits[[2]]
summary(train$Species) #41/41/14
summary(test$Species) #9/9/6
m1 = h2o.randomForest(1:4, 5, train, model_id ="RF_defaults", seed=1)
h2o.confusionMatrix(m1)
m2 = h2o.randomForest(1:4, 5, train, model_id ="RF_balanced", seed=1,
balance_classes = TRUE)
h2o.confusionMatrix(m2)
m3 = h2o.randomForest(1:4, 5, train, model_id ="RF_balanced", seed=1,
balance_classes = TRUE,
class_sampling_factors = c(1, 1, 2.5)
)
h2o.confusionMatrix(m3)
The first lines initialize H2O, then I deliberately modify the iris data set to throw away 60% of one of the 3 classes, to create an imbalance.
The next few lines load that data into H2O, and create a 80%/20% train/test data split. The seed was chosen deliberately, so that in the training data virginica is 14.58% of the data, compared to 16.67% in the original data, and 25% in the test data.
I then train three random forest models. m1 is all defaults, and its confusion matrix looks like this:
setosa versicolor virginica Error Rate
setosa 41 0 0 0.0000 = 0 / 41
versicolor 0 39 2 0.0488 = 2 / 41
virginica 0 1 13 0.0714 = 1 / 14
Totals 41 40 15 0.0312 = 3 / 96
Nothing to see here: it uses the data it finds.
Now here is the same output for m2, which switches on balance_classes. You can see it is over-sampled the virginica class to get them as balanced as possible. (The right-most columns says 41,41,40 instead of 41,41,14 as in the previous output.)
setosa versicolor virginica Error Rate
setosa 41 0 0 0.0000 = 0 / 41
versicolor 0 41 0 0.0000 = 0 / 41
virginica 0 2 38 0.0500 = 2 / 40
Totals 41 43 38 0.0164 = 2 / 122
In m3 we still switch on balance_classes, but also tell it the truth of the situation. I.e. that the actual data is 16.67% virginica, not the 14.58% it sees in the train data. The confusion matrix for m3 shows that it therefore turned the 14 virginica samples into 37 samples instead of 40 samples.
setosa versicolor virginica Error Rate
setosa 41 0 0 0.0000 = 0 / 41
versicolor 0 41 0 0.0000 = 0 / 41
virginica 0 2 35 0.0541 = 2 / 37
Totals 41 43 35 0.0168 = 2 / 119
How did I know to write c(1, 1, 2.5), and not c(2.5, 1, 1) or c(1, 2.5, 1) ? The docs say it must be in "lexicographic order". You can find out what that order is with:
h2o.levels(train$Species)
which tells me:
[1] "setosa" "versicolor" "virginica"
The opinion bit: balance_classes is good to switch on, but class_sampling_factors should only be used when you have a really good reason to believe that your training data is not representative.
NOTE: Code and explanation adapted from my upcoming book, Practical Machine Learning with H2O.

Curve fitting this data in R?

For a few days I've been working on this problem and I'm stuck ...
I have performed a number of Monte Carlo simulations in R which gives an output y for each input x and there is clearly some simple relationship between x and y, so I want to identify the formula and its parameters. But I can't seem to get a good overall fit for both the 'Low x' and 'High x' series, e.g. using a logarithm like this:
dat = data.frame(x=x, y=y)
fit = nls(y~a*log10(x)+b, data=dat, start=list(a=-0.8,b=-2), trace=TRUE)
I have also tried to fit (log10(x), 10^y) instead, which gives a good fit but the reverse transformation doesn't fit (x, y) very well.
Can anyone solve this?
Please explain how you found the solution.
Thanks!
EDIT:
Thanks for all the quick feedback!
I am not aware of a theoretical model for what I'm simulating so I have no basis for comparison. I simply don't know the true relationship between x and y. I'm not a statistician, by the way.
The underlying model is sort of a stochastic feedback-growth model. My objective is to determine the long-term growth-rate g given some input x>0, so the output of a system grows exponentially by the rate 1+g in each iteration. The system has a stochastic production in each iteration based on the system's size, a fraction of this production is output and the rest is kept in the system determined by another stochastic variable. From MC simulation I have found the growth-rates of the system output to be log-normal distributed for every x I have tested and the y's in the data-series are the logmeans of the growth-rates g. As x goes towards infinity g goes towards zero. As x goes towards zero g goes towards infinity.
I would like a function that could calculate y from x. I actually only need a function for low x, say, in the range 0 to 10. I was able to fit that quite well by y=1.556 * x^-0.4 -3.58, but it didn't fit well for large x. I'd like a function that is general for all x>0. I have also tried Spacedman's poly fit (thanks!) but it doesn't fit well enough in the crucial range x=1 to 6.
Any ideas?
EDIT 2:
I have experimented some more, also with the detailed suggestions by Grothendieck (thanks!) After some consideration I decided that since I don't have a theoretical basis for choosing one function over another, and I'm most likely only interested in x-values between 1 and 6, I ought to use a simple function that fits well. So I just used y~a*x^b+c and made a note that it doesn't fit for high x. I may seek the community's help again when the first draft of the paper is finished. Perhaps one of you can spot the theoretical relationship between x and y once you see the Monte Carlo model.
Thanks again!
Low x series:
x y
1 0.2 -0.7031864
2 0.3 -1.0533648
3 0.4 -1.3019655
4 0.5 -1.4919278
5 0.6 -1.6369545
6 0.7 -1.7477481
7 0.8 -1.8497117
8 0.9 -1.9300209
9 1.0 -2.0036842
10 1.1 -2.0659970
11 1.2 -2.1224324
12 1.3 -2.1693986
13 1.4 -2.2162889
14 1.5 -2.2548485
15 1.6 -2.2953162
16 1.7 -2.3249750
17 1.8 -2.3570141
18 1.9 -2.3872684
19 2.0 -2.4133978
20 2.1 -2.4359624
21 2.2 -2.4597122
22 2.3 -2.4818787
23 2.4 -2.5019371
24 2.5 -2.5173966
25 2.6 -2.5378936
26 2.7 -2.5549524
27 2.8 -2.5677939
28 2.9 -2.5865958
29 3.0 -2.5952558
30 3.1 -2.6120607
31 3.2 -2.6216831
32 3.3 -2.6370452
33 3.4 -2.6474608
34 3.5 -2.6576862
35 3.6 -2.6655606
36 3.7 -2.6763866
37 3.8 -2.6881303
38 3.9 -2.6932310
39 4.0 -2.7073198
40 4.1 -2.7165035
41 4.2 -2.7204063
42 4.3 -2.7278532
43 4.4 -2.7321731
44 4.5 -2.7444773
45 4.6 -2.7490365
46 4.7 -2.7554178
47 4.8 -2.7611471
48 4.9 -2.7719188
49 5.0 -2.7739299
50 5.1 -2.7807113
51 5.2 -2.7870781
52 5.3 -2.7950429
53 5.4 -2.7975677
54 5.5 -2.7990999
55 5.6 -2.8095955
56 5.7 -2.8142453
57 5.8 -2.8162046
58 5.9 -2.8240594
59 6.0 -2.8272394
60 6.1 -2.8338866
61 6.2 -2.8382038
62 6.3 -2.8401935
63 6.4 -2.8444915
64 6.5 -2.8448382
65 6.6 -2.8512086
66 6.7 -2.8550240
67 6.8 -2.8592950
68 6.9 -2.8622220
69 7.0 -2.8660817
70 7.1 -2.8710430
71 7.2 -2.8736998
72 7.3 -2.8764701
73 7.4 -2.8818748
74 7.5 -2.8832696
75 7.6 -2.8833351
76 7.7 -2.8891867
77 7.8 -2.8926849
78 7.9 -2.8944987
79 8.0 -2.8996780
80 8.1 -2.9011012
81 8.2 -2.9053911
82 8.3 -2.9063661
83 8.4 -2.9092228
84 8.5 -2.9135426
85 8.6 -2.9101730
86 8.7 -2.9186316
87 8.8 -2.9199631
88 8.9 -2.9199856
89 9.0 -2.9239220
90 9.1 -2.9240167
91 9.2 -2.9284608
92 9.3 -2.9294951
93 9.4 -2.9310985
94 9.5 -2.9352370
95 9.6 -2.9403694
96 9.7 -2.9395336
97 9.8 -2.9404153
98 9.9 -2.9437564
99 10.0 -2.9452175
High x series:
x y
1 2.000000e-01 -0.701301
2 2.517851e-01 -0.907446
3 3.169786e-01 -1.104863
4 3.990525e-01 -1.304556
5 5.023773e-01 -1.496033
6 6.324555e-01 -1.674629
7 7.962143e-01 -1.842118
8 1.002374e+00 -1.998864
9 1.261915e+00 -2.153993
10 1.588656e+00 -2.287607
11 2.000000e+00 -2.415137
12 2.517851e+00 -2.522978
13 3.169786e+00 -2.621386
14 3.990525e+00 -2.701105
15 5.023773e+00 -2.778751
16 6.324555e+00 -2.841699
17 7.962143e+00 -2.900664
18 1.002374e+01 -2.947035
19 1.261915e+01 -2.993301
20 1.588656e+01 -3.033517
21 2.000000e+01 -3.072003
22 2.517851e+01 -3.102536
23 3.169786e+01 -3.138539
24 3.990525e+01 -3.167577
25 5.023773e+01 -3.200739
26 6.324555e+01 -3.233111
27 7.962143e+01 -3.259738
28 1.002374e+02 -3.291657
29 1.261915e+02 -3.324449
30 1.588656e+02 -3.349988
31 2.000000e+02 -3.380031
32 2.517851e+02 -3.405850
33 3.169786e+02 -3.438225
34 3.990525e+02 -3.467420
35 5.023773e+02 -3.496026
36 6.324555e+02 -3.531125
37 7.962143e+02 -3.558215
38 1.002374e+03 -3.587526
39 1.261915e+03 -3.616800
40 1.588656e+03 -3.648891
41 2.000000e+03 -3.684342
42 2.517851e+03 -3.716174
43 3.169786e+03 -3.752631
44 3.990525e+03 -3.786956
45 5.023773e+03 -3.819529
46 6.324555e+03 -3.857214
47 7.962143e+03 -3.899199
48 1.002374e+04 -3.937206
49 1.261915e+04 -3.968795
50 1.588656e+04 -4.015991
51 2.000000e+04 -4.055811
52 2.517851e+04 -4.098894
53 3.169786e+04 -4.135608
54 3.990525e+04 -4.190248
55 5.023773e+04 -4.237104
56 6.324555e+04 -4.286103
57 7.962143e+04 -4.332090
58 1.002374e+05 -4.392748
59 1.261915e+05 -4.446233
60 1.588656e+05 -4.497845
61 2.000000e+05 -4.568541
62 2.517851e+05 -4.628460
63 3.169786e+05 -4.686546
64 3.990525e+05 -4.759202
65 5.023773e+05 -4.826938
66 6.324555e+05 -4.912130
67 7.962143e+05 -4.985855
68 1.002374e+06 -5.070668
69 1.261915e+06 -5.143341
70 1.588656e+06 -5.261585
71 2.000000e+06 -5.343636
72 2.517851e+06 -5.447189
73 3.169786e+06 -5.559962
74 3.990525e+06 -5.683828
75 5.023773e+06 -5.799319
76 6.324555e+06 -5.929599
77 7.962143e+06 -6.065907
78 1.002374e+07 -6.200967
79 1.261915e+07 -6.361633
80 1.588656e+07 -6.509538
81 2.000000e+07 -6.682960
82 2.517851e+07 -6.887793
83 3.169786e+07 -7.026138
84 3.990525e+07 -7.227990
85 5.023773e+07 -7.413960
86 6.324555e+07 -7.620247
87 7.962143e+07 -7.815754
88 1.002374e+08 -8.020447
89 1.261915e+08 -8.229911
90 1.588656e+08 -8.447927
91 2.000000e+08 -8.665613
Without an idea of the underlying process you may as well just fit a polynomial with as many components as you like. You don't seem to be testing a hypothesis (eg, gravitational strength is inverse-square related with distance) so you can fish all you like for functional forms, the data is unlikely to tell you which one is 'right'.
So if I read your data into a data frame with x and y components I can do:
data$lx=log(data$x)
plot(data$lx,data$y) # needs at least a cubic polynomial
m1 = lm(y~poly(lx,3),data=data) # fit a cubic
points(data$lx,fitted(m1),pch=19)
and the fitted points are pretty close. Change the polynomial degree from 3 to 7 and the points are identical. Does that mean that your Y values are really coming from a 7-degree polynomial of your X values? No. But you've got a curve that goes through the points.
At this scale, you may as well just join adjacent points up with a straight line, your plot is so smooth. But without underlying theory of why Y depends on X (like an inverse square law, or exponential growth, or something) all you are doing is joining the dots, and there are infinite ways of doing that.
Regressing x/y vs. x Plotting y vs. x for the low data and playing around a bit it seems that x/y is approximately linear in x so try regressing x/y against x which gives us a relationship based on only two parameters:
y = x / (a + b * x)
where a and b are the regression coefficients.
> lm(x / y ~ x, lo.data)
Call:
lm(formula = x/y ~ x, data = lo.data)
Coefficients:
(Intercept) x
-0.1877 -0.3216
MM.2 The above can be transformed into the MM.2 model in the drc R package. As seen below this model has a high R2. Also, we calculate the AIC which we can use to compare to other models (lower is better):
> library(drc)
> fm.mm2 <- drm(y ~ x, data = lo.data, fct = MM.2())
> cor(fitted(fm.mm2), lo.data$y)^2
[1] 0.9986303
> AIC(fm.mm2)
[1] -535.7969
CRS.6 This suggests we try a few other drc models and of the ones we tried CRS.6 has a particularly low AIC and seems to fit well visually:
> fm.crs6 <- drm(y ~ x, data = lo.data, fct = CRS.6())
> AIC(fm.crs6)
[1] -942.7866
> plot(fm.crs6) # see output below
This gives us a range of models we can use from the 2 parameter MM.2 model which is not as good as a fit (according to AIC) as the CRS.6 but still fits quite well and has the advantage of only two parameters or the 6 parameter CRS.6 model with its superior AIC. Note that AIC already penalizes models for having more parameters so having a better AIC is not a consequence of having more parameters.
Other If its believed that both low and high should have the same model form then finding a single model form fitting both low and high well might be used as another criterion for picking a model form. In addition to the drc models, there are also some yield-density models in (2.1), (2.2), (2.3) and (2.4) of Akbar et al, IRJFE, 2010 which look similar to the MM.2 model which could be tried.
UPDATED: reworked this around the drc package.

Resources