How to add nearest coordinates points from one file to another using RANN package - r

I tried the RANN package to extract the nearest coordinate points by comparing two files and then add the nearest extracted points to the another file.
My files -> fire
lat lon frp
30.037 80.572 38.5
23.671 85.008 7.2
22.791 86.206 11.4
23.755 86.421 5.6
23.673 86.088 4.2
23.768 86.392 8.4
23.789 86.243 7.8
23.805 86.327 6.4
23.682 86.085 7.8
23.68 86.095 5.7
21.194 81.41 19
16.95 81.912 8
16.952 81.898 11.5
16.899 81.682 10.6
12.994 79.651 16.1
9.2 77.603 14.5
12.291 77.346 20.5
17.996 79.708 13.9
17.998 79.718 29.6
16.61 81.266 6.6
16.499 81.2 6.8
19.505 81.784 22.4
18.322 80.555 7.7
19.506 81.794 28.2
21.081 81.957 8.7
21.223 82.127 9.4
20.918 81.025 6.3
19.861 82.123 9.3
20.62 75.049 11.6
and 2nd file -> wind
lat lon si10 u10 v10
40 60 3.5927058834376 -0.874587879393667 -0.375465368327018
40 60.125 3.59519876134577 -0.836646189656238 -0.388624092937835
40 60.25 3.59769163925393 -0.798704499918809 -0.401782817548651
40 60.375 3.6001845171621 -0.76076281018138 -0.414941542159468
40 60.5 3.60246965524458 -0.722821120443951 -0.428380239634345
40 60.625 3.60496253315275 -0.684585309080651 -0.441538964245162
40 60.75 3.60766315088659 -0.646937740969094 -0.454977661720038
40 60.875 3.60911732966636 -0.609878416109279 -0.468976304923035
40 61 3.608701850015 -0.575172064256437 -0.484934758174451
40 61.125 3.60807863053795 -0.540759834029467 -0.500893211425867
40 61.25 3.60787089071227 -0.506053482176625 -0.516851664677283
40 61.375 3.60745541106091 -0.471641251949655 -0.533090090792759
40 61.5 3.60703993140955 -0.437229021722684 -0.548768571180115
40 61.625 3.60662445175819 -0.402522669869843 -0.565006997295591
40 61.75 3.60454705350139 -0.398993210359384 -0.579285613362648
40 61.875 3.60163869594186 -0.411346318645989 -0.592724310837524
40 62 3.59873033838234 -0.423405305306722 -0.606163008312401
40 62.125 3.59540650117145 -0.435758413593327 -0.619601705787278
40 62.25 3.59249814361192 -0.44781740025406 -0.633320376126214
40 62.375 3.5895897860524 -0.460170508540664 -0.646759073601091
40 62.5 3.58668142849287 -0.471935373575526 -0.660197771075968
40 62.625 3.57546347790613 -0.509288820061212 -0.666357174085286
40 62.75 3.56445326714507 -0.546642266546898 -0.672236604230545
40 62.875 3.55323531655832 -0.584289834658455 -0.678675980103923
40 63 3.54201736597158 -0.621643281144141 -0.684835383113241
40 63.125 3.53100715521052 -0.658996727629827 -0.69099478612256
40 63.25 3.51978920462378 -0.696350174115513 -0.697154189131878
40 63.375 3.50005392118414 -0.726644701580281 -0.692954596170979
40 63.5 3.46266075256166 -0.743115512629088 -0.668037011269646
I want to add wind$si10 wind$u10 wind$v10 into the fire file with nearest coordinates corresponding to frp values. First, I tried only with variable si10 because in RANN package both fire and wind files should have the same number of columns. So I use the code with si10 only
library(RANN)
read.table(file.choose(), sep="\t", header = T) -> wind_jan
read.table(file.choose(), sep="\t", header = T) -> fire_jan
names(fire_jan)
names(wind_jan)
closest <- RANN::nn2(data = wind_jan, query = fire_jan, k = 1)
closest
fire_jan$wind_lat <- wind_jan[closest$nn.idx, "lat"]
fire_jan$wind_lon <- wind_jan[closest$nn.idx, "lon"]
fire_jan$WS <- wind_jan[closest$nn.idx, "si10"]
From the above code I am able to extract si10 values at the nearby coordinates of fire$frp but when I apply the same code for u10 and v10variables in wind file then I am not able to get the extracted values on the same coordinates as I got with si10.
How can I solve this query with this code?

you call closest_u$nn.id that doesnt exist.
Maybe there is an error with your label as well when reading wind df ?
could that be the error?

Related

How to add a new column in data frame using calculation in R?

I want to add a new column with calculation. In the below data frame,
Env<- c("High_inoc","High_NO_inoc","Low_inoc", "Low_NO_inoc")
CV1<- c(30,150,20,100)
CV2<- c(74,99,49,73)
CV3<- c(78,106,56,69)
CV4<- c(86,92,66,70)
CV5<- c(74,98,57,79)
Data<-data.frame(Env,CV1,CV2,CV3,CV4,CV5)
Data$Mean <- rowMeans(Data %>% select(-Env))
Data <- rbind(Data, c("Mean", colMeans(Data %>% select(-Env))))
I'd like to add a new column names 'Env_index' with calculation {each value of 'mean' column - overall mean (76.3) such as 68.4 - 76.3 , 109- 76.3 ,... 78.2 - 76.3
So, I did like this and obtained what I want.
Data$Env_index <- c(68.4-76.3,109-76.3,49.6-76.3,78.2-76.3, 76.3-76.3)
But, I want to directly calculate using code, so if I code like this,
Data$Env_index <- with (data, data$Mean - 76.3)
It generates error. Could you let me know how to calculate?
Thanks,
To make the calculation dynamic which will work on any data you can do :
Data$Mean <- as.numeric(Data$Mean)
Data$Env_index <- Data$Mean - Data$Mean[nrow(Data)]
Data
# Env CV1 CV2 CV3 CV4 CV5 Mean Env_index
#1 High_inoc 30 74 78 86 74 68.4 -7.9
#2 High_NO_inoc 150 99 106 92 98 109.0 32.7
#3 Low_inoc 20 49 56 66 57 49.6 -26.7
#4 Low_NO_inoc 100 73 69 70 79 78.2 1.9
#5 Mean 75 73.75 77.25 78.5 77 76.3 0.0
Data$Mean[nrow(Data)] will select last value of Data$Mean.

Error message in Treeclim

I'm attempting to use the package treeclim to analyze my tree ring growth data and climate. I measured the widths in CooRecorder, grouped them into series in CDENDRO, and read them into R-Studio using dplR read.rwl function. However, I keep getting an error message reading
"Error in dcc(Plot92.crn, Site92PRISM, selection = -6:9, method = "response", :
Overlapping time span of chrono and climate records is smaller than number of parameters! Consider adapting the number of parameters to a maximum of 100."
I have 100 years of monthly climate data that looks like below:
# head(Site92PRISM)
year month ppt tmax tmin tmean vpdmin..hPa. vpdmax..hPa. site
1 1915 01 0.97 26.1 12.3 19.2 0.97 2.32 92
2 1915 02 1.20 31.5 16.2 23.9 1.03 3.30 92
3 1915 03 2.51 36.0 17.0 26.5 0.97 4.69 92
4 1915 04 3.45 48.9 26.3 37.6 1.14 8.13 92
5 1915 05 3.95 44.6 29.1 36.9 0.94 5.58 92
6 1915 06 6.64 51.0 31.5 41.3 1.04 7.93 92
And my chronology, made in dplR looks like below:
#head(Plot92.crn)
CAMstd samp.depth
1840 0.7180693 1
1841 0.3175528 1
1842 0.5729651 1
1843 0.9785082 1
1844 0.7676334 1
1845 0.3633687 1
Where am I going wrong? Both files contain data from 1915-2015.
I posted a similar question to the author in the google forum of the package (i.e. https://groups.google.com/forum/#!forum/treeclim).
What you need to make sure of is that the number of parameters (n_param) is less or equal to the sample size of your dendrochronological data. By 'number of parameters' I mean the total number of columns in the climatic variables matrices.
For instance, in the following analysis:
resp <- dcc(chrono = my_chrono,
climate = list(precip, temp),
boot = 'stationary')
You need to make sure that the following is TRUE :
length(unique(rownames(my_chrono))) >= (ncol(precip)-1) + (ncol(temp)-1)
ncol(precip)-1 and not ncol(precip) because the first column of the matrix is YEAR. Also note that in my example the years in my_chrono are the same years as in precip and temp, which doesn't have to be the case to run the function (it will automatically take the common years).
Finally, if the previous line code gives you FALSE, you can reduce the number of parameters with the argument selection like this :
resp <- dcc(chrono = my_chrono,
climate = list(precip, temp),
selection = .range(6:12,'prec') + .range(6:12, 'temp'),
var_names = c('prec', 'temp'),
boot = 'stationary')
Because the dcc function automatically takes all months from previous June to current September (i.e. .range(-6:9)), you may need to reduce that range.

R equivalent of Stata's for-loop over local macro list of stubnames

I'm a Stata user that's transitioning to R and there's one Stata crutch that I find hard to give up. This is because I don't know how to do the equivalent with R's "apply" functions.
In Stata, I often generate a local macro list of stubnames and then loop over that list, calling on variables whose names are built off of those stubnames.
For a simple example, imagine that I have the following dataset:
study_id year varX06 varX07 varX08 varY06 varY07 varY08
1 6 50 40 30 20.5 19.8 17.4
1 7 50 40 30 20.5 19.8 17.4
1 8 50 40 30 20.5 19.8 17.4
2 6 60 55 44 25.1 25.2 25.3
2 7 60 55 44 25.1 25.2 25.3
2 8 60 55 44 25.1 25.2 25.3
and so on...
I want to generate two new variables, varX and varY that take on the values of varX06 and varY06 respectively when year is 6, varX07 and varY07 respectively when year is 7, and varX08 and varY08 respectively when year is 8.
The final dataset should look like this:
study_id year varX06 varX07 varX08 varY06 varY07 varY08 varX varY
1 6 50 40 30 20.5 19.8 17.4 50 20.5
1 7 50 40 30 20.5 19.8 17.4 40 19.8
1 8 50 40 30 20.5 19.8 17.4 30 17.4
2 6 60 55 44 25.1 25.2 25.3 60 25.1
2 7 60 55 44 25.1 25.2 25.3 55 25.2
2 8 60 55 44 25.1 25.2 25.3 44 25.3
and so on...
To clarify, I know that I can do this with melt and reshape commands - essentially converting this data from wide to long format, but I don't want to resort to that. That's not the intent of my question.
My question is about how to loop over a local macro list of stubnames in R and I'm just using this simple example to illustrate a more generic dilemma.
In Stata, I could generate a local macro list of stubnames:
local stub varX varY
And then loop over the macro list. I can generate a new variable varX or varY and replace the new variable value with the value of varX06 or varY06 (respectively) if year is 6 and so on.
foreach i of local stub {
display "`i'"
gen `i'=.
replace `i'=`i'06 if year==6
replace `i'=`i'07 if year==7
replace `i'=`i'08 if year==8
}
The last section is the section that I find hardest to replicate in R. When I write 'x'06, Stata takes the string "varX", concatenates it with the string "06" and then returns the value of the variable varX06. Additionally, when I write 'i', Stata returns the string "varX" and not the string "'i'".
How do I do these things with R?
I've searched through Muenchen's "R for Stata Users", googled the web, and searched through previous posts here at StackOverflow but haven't been able to find an R solution.
I apologize if this question is elementary. If it's been answered before, please direct me to the response.
Thanks in advance,
Tara
Well, here's one way. Columns in R data frames can be accessed using their character names, so this will work:
# create sample dataset
set.seed(1) # for reproducible example
df <- data.frame(year=as.factor(rep(6:8,each=100)), #categorical variable
varX06 = rnorm(300), varX07=rnorm(300), varX08=rnorm(100),
varY06 = rnorm(300), varY07=rnorm(300), varY08=rnorm(100))
# you start here...
years <- unique(df$year)
df$varX <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varX0",yr)]))
df$varY <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varY0",yr)]))
print(head(df),digits=4)
# year varX06 varX07 varX08 varY06 varY07 varY08 varX varY
# 1 6 -0.6265 0.8937 -0.3411 -0.70757 1.1350 0.3412 -0.6265 -0.70757
# 2 6 0.1836 -1.0473 1.5024 1.97157 1.1119 1.3162 0.1836 1.97157
# 3 6 -0.8356 1.9713 0.5283 -0.09000 -0.8708 -0.9598 -0.8356 -0.09000
# 4 6 1.5953 -0.3836 0.5422 -0.01402 0.2107 -1.2056 1.5953 -0.01402
# 5 6 0.3295 1.6541 -0.1367 -1.12346 0.0694 1.5676 0.3295 -1.12346
# 6 6 -0.8205 1.5122 -1.1367 -1.34413 -1.6626 0.2253 -0.8205 -1.34413
For a given yr, the anonymous function extracts the rows with that yr and column named "varX0" + yr (the result of paste0(...). Then lapply(...) "applies" this function for each year, and unlist(...) converts the returned list into a vector.
Maybe a more transparent way:
sub <- c("varX", "varY")
for (i in sub) {
df[[i]] <- NA
df[[i]] <- ifelse(df[["year"]] == 6, df[[paste0(i, "06")]], df[[i]])
df[[i]] <- ifelse(df[["year"]] == 7, df[[paste0(i, "07")]], df[[i]])
df[[i]] <- ifelse(df[["year"]] == 8, df[[paste0(i, "08")]], df[[i]])
}
This method reorders your data, but involves a one-liner, which may or may not be better for you (assume d is your dataframe):
> do.call(rbind, by(d, d$year, function(x) { within(x, { varX <- x[, paste0('varX0',x$year[1])]; varY <- x[, paste0('varY0',x$year[1])] }) } ))
study_id year varX06 varX07 varX08 varY06 varY07 varY08 varY varX
6.1 1 6 50 40 30 20.5 19.8 17.4 20.5 50
6.4 2 6 60 55 44 25.1 25.2 25.3 25.1 60
7.2 1 7 50 40 30 20.5 19.8 17.4 19.8 40
7.5 2 7 60 55 44 25.1 25.2 25.3 25.2 55
8.3 1 8 50 40 30 20.5 19.8 17.4 17.4 30
8.6 2 8 60 55 44 25.1 25.2 25.3 25.3 44
Essentially, it splits the data based on year, then uses within to create the varX and varY variables within each subset, and then rbind's the subsets back together.
A direct translation of your Stata code, however, would be something like the following:
u <- unique(d$year)
for(i in seq_along(u)){
d$varX <- ifelse(d$year == 6, d$varX06, ifelse(d$year == 7, d$varX07, ifelse(d$year == 8, d$varX08, NA)))
d$varY <- ifelse(d$year == 6, d$varY06, ifelse(d$year == 7, d$varY07, ifelse(d$year == 8, d$varY08, NA)))
}
Here's another option.
Create a 'column selection matrix' based on year, then use that to grab the values you want from any block of columns.
# indexing matrix based on the 'year' column
col_select_mat <-
t(sapply(your_df$year, function(x) unique(your_df$year) == x))
# make selections from col groups by stub name
sapply(c('varX', 'varY'),
function(x) your_df[, grep(x, names(your_df))][col_select_mat])
This gives the desired result (which you can cbind to your_df if you like)
varX varY
[1,] 50 20.5
[2,] 60 25.1
[3,] 40 19.8
[4,] 55 25.2
[5,] 30 17.4
[6,] 44 25.3
OP's dataset:
your_df <- read.table(header=T, text=
'study_id year varX06 varX07 varX08 varY06 varY07 varY08
1 6 50 40 30 20.5 19.8 17.4
1 7 50 40 30 20.5 19.8 17.4
1 8 50 40 30 20.5 19.8 17.4
2 6 60 55 44 25.1 25.2 25.3
2 7 60 55 44 25.1 25.2 25.3
2 8 60 55 44 25.1 25.2 25.3')
Benchmarking: Looking at the three posted solutions, this appears to be the fastest on average, but the differences are very small.
df <- your_df
d <- your_df
arvi1000 <- function() {
col_select_mat <- t(sapply(your_df$year, function(x) unique(your_df$year) == x))
# make selections from col groups by stub name
cbind(your_df,
sapply(c('varX', 'varY'),
function(x) your_df[, grep(x, names(your_df))][col_select_mat]))
}
jlhoward <- function() {
years <- unique(df$year)
df$varX <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varX0",yr)]))
df$varY <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varY0",yr)]))
}
Thomas <- function() {
do.call(rbind, by(d, d$year, function(x) { within(x, { varX <- x[, paste0('varX0',x$year[1])]; varY <- x[, paste0('varY0',x$year[1])] }) } ))
}
> microbenchmark(arvi1000, jlhoward, Thomas)
Unit: nanoseconds
expr min lq mean median uq max neval
arvi1000 37 39 43.73 40 42 380 100
jlhoward 38 40 46.35 41 42 377 100
Thomas 37 40 56.99 41 42 1590 100

Quickest way to find the maximum value from one column with multiple duplicates in others? [duplicate]

This question already has answers here:
How to get the maximum value by group
(5 answers)
Closed 5 years ago.
In reality I have a very large data frame. One column contains an ID and another contains a value associated with that ID. However, each ID occurs multiple times with differing values, and I wish to record the maximum value for each ID while discarding the rest. Here is a replicable example using the quakes dataset in R:
data <- as.data.frame(quakes)
##Create output matrix
output <- matrix(,length(unique(data[,5])),2)
colnames(output) <- c("Station ID", "Max Mag")
##Grab unique station IDs
uni <- unique(data[,5])
##Go through each station ID and record the maximum magnitude
for (i in 1:dim(output)[1])
{
sub.data <- data[which(data[,5]==uni[i]),]
##Put station ID in column 1
output[i,1] <- uni[i]
##Put biggest magnitude in column 2
output[i,2] <- max(sub.data[,4])
}
Considering that with my real data I have data frames with dimensions of 100000's of rows, this is a slow process. Is there a quicker way to execute such a task?
Any help much appreciated!
library(plyr)
ddply(data, "stations", function(data){data[which.max(data$mag),]})
lat long depth mag stations
1 -27.21 182.43 55 4.6 10
2 -27.60 182.40 61 4.6 11
3 -16.24 168.02 53 4.7 12
4 -27.38 181.70 80 4.8 13
-----
You can also use:
> data2 <- data[order(data$mag,decreasing=T),]
> data2[!duplicated(data2$stations),]
lat long depth mag stations
152 -15.56 167.62 127 6.4 122
15 -20.70 169.92 139 6.1 94
17 -13.64 165.96 50 6.0 83
870 -12.23 167.02 242 6.0 132
1000 -21.59 170.56 165 6.0 119
558 -22.91 183.95 64 5.9 118
109 -22.55 185.90 42 5.7 76
151 -23.34 184.50 56 5.7 106
176 -32.22 180.20 216 5.7 90
275 -22.13 180.38 577 5.7 104
Also :
> library(data.table)
> data <- data.table(data)
> data[,.SD[which.max(mag)],by=stations]
stations lat long depth mag
1: 41 -23.46 180.11 539 5.0
2: 15 -13.40 166.90 228 4.8
3: 43 -26.00 184.10 42 5.4
4: 19 -19.70 186.20 47 4.8
5: 11 -27.60 182.40 61 4.6
---
98: 77 -21.19 181.58 490 5.0
99: 132 -12.23 167.02 242 6.0
100: 115 -17.85 181.44 589 5.6
101: 121 -20.25 184.75 107 5.6
102: 110 -19.33 186.16 44 5.4
data.table works better for large dataset
You could try tapply, too:
tapply(data$mag, data$stations, FUN=max)
You can try the new 'dplyr' package as well, which is much faster and easier to use than 'plyr'. Using what Hadley called "like a grammar of data manipulation" by chaining the operations together with %.%, like so :
library(dplyr)
df <- as.data.frame(quakes)
df %.%
group_by(stations) %.%
summarise(Max = max(mag)) %.%
arrange(desc(Max)) %.%
head(5)
Source: local data frame [5 x 2]
stations Max
1 122 6.4
2 94 6.1
3 132 6.0
4 119 6.0
5 83 6.0

Curve fitting this data in R?

For a few days I've been working on this problem and I'm stuck ...
I have performed a number of Monte Carlo simulations in R which gives an output y for each input x and there is clearly some simple relationship between x and y, so I want to identify the formula and its parameters. But I can't seem to get a good overall fit for both the 'Low x' and 'High x' series, e.g. using a logarithm like this:
dat = data.frame(x=x, y=y)
fit = nls(y~a*log10(x)+b, data=dat, start=list(a=-0.8,b=-2), trace=TRUE)
I have also tried to fit (log10(x), 10^y) instead, which gives a good fit but the reverse transformation doesn't fit (x, y) very well.
Can anyone solve this?
Please explain how you found the solution.
Thanks!
EDIT:
Thanks for all the quick feedback!
I am not aware of a theoretical model for what I'm simulating so I have no basis for comparison. I simply don't know the true relationship between x and y. I'm not a statistician, by the way.
The underlying model is sort of a stochastic feedback-growth model. My objective is to determine the long-term growth-rate g given some input x>0, so the output of a system grows exponentially by the rate 1+g in each iteration. The system has a stochastic production in each iteration based on the system's size, a fraction of this production is output and the rest is kept in the system determined by another stochastic variable. From MC simulation I have found the growth-rates of the system output to be log-normal distributed for every x I have tested and the y's in the data-series are the logmeans of the growth-rates g. As x goes towards infinity g goes towards zero. As x goes towards zero g goes towards infinity.
I would like a function that could calculate y from x. I actually only need a function for low x, say, in the range 0 to 10. I was able to fit that quite well by y=1.556 * x^-0.4 -3.58, but it didn't fit well for large x. I'd like a function that is general for all x>0. I have also tried Spacedman's poly fit (thanks!) but it doesn't fit well enough in the crucial range x=1 to 6.
Any ideas?
EDIT 2:
I have experimented some more, also with the detailed suggestions by Grothendieck (thanks!) After some consideration I decided that since I don't have a theoretical basis for choosing one function over another, and I'm most likely only interested in x-values between 1 and 6, I ought to use a simple function that fits well. So I just used y~a*x^b+c and made a note that it doesn't fit for high x. I may seek the community's help again when the first draft of the paper is finished. Perhaps one of you can spot the theoretical relationship between x and y once you see the Monte Carlo model.
Thanks again!
Low x series:
x y
1 0.2 -0.7031864
2 0.3 -1.0533648
3 0.4 -1.3019655
4 0.5 -1.4919278
5 0.6 -1.6369545
6 0.7 -1.7477481
7 0.8 -1.8497117
8 0.9 -1.9300209
9 1.0 -2.0036842
10 1.1 -2.0659970
11 1.2 -2.1224324
12 1.3 -2.1693986
13 1.4 -2.2162889
14 1.5 -2.2548485
15 1.6 -2.2953162
16 1.7 -2.3249750
17 1.8 -2.3570141
18 1.9 -2.3872684
19 2.0 -2.4133978
20 2.1 -2.4359624
21 2.2 -2.4597122
22 2.3 -2.4818787
23 2.4 -2.5019371
24 2.5 -2.5173966
25 2.6 -2.5378936
26 2.7 -2.5549524
27 2.8 -2.5677939
28 2.9 -2.5865958
29 3.0 -2.5952558
30 3.1 -2.6120607
31 3.2 -2.6216831
32 3.3 -2.6370452
33 3.4 -2.6474608
34 3.5 -2.6576862
35 3.6 -2.6655606
36 3.7 -2.6763866
37 3.8 -2.6881303
38 3.9 -2.6932310
39 4.0 -2.7073198
40 4.1 -2.7165035
41 4.2 -2.7204063
42 4.3 -2.7278532
43 4.4 -2.7321731
44 4.5 -2.7444773
45 4.6 -2.7490365
46 4.7 -2.7554178
47 4.8 -2.7611471
48 4.9 -2.7719188
49 5.0 -2.7739299
50 5.1 -2.7807113
51 5.2 -2.7870781
52 5.3 -2.7950429
53 5.4 -2.7975677
54 5.5 -2.7990999
55 5.6 -2.8095955
56 5.7 -2.8142453
57 5.8 -2.8162046
58 5.9 -2.8240594
59 6.0 -2.8272394
60 6.1 -2.8338866
61 6.2 -2.8382038
62 6.3 -2.8401935
63 6.4 -2.8444915
64 6.5 -2.8448382
65 6.6 -2.8512086
66 6.7 -2.8550240
67 6.8 -2.8592950
68 6.9 -2.8622220
69 7.0 -2.8660817
70 7.1 -2.8710430
71 7.2 -2.8736998
72 7.3 -2.8764701
73 7.4 -2.8818748
74 7.5 -2.8832696
75 7.6 -2.8833351
76 7.7 -2.8891867
77 7.8 -2.8926849
78 7.9 -2.8944987
79 8.0 -2.8996780
80 8.1 -2.9011012
81 8.2 -2.9053911
82 8.3 -2.9063661
83 8.4 -2.9092228
84 8.5 -2.9135426
85 8.6 -2.9101730
86 8.7 -2.9186316
87 8.8 -2.9199631
88 8.9 -2.9199856
89 9.0 -2.9239220
90 9.1 -2.9240167
91 9.2 -2.9284608
92 9.3 -2.9294951
93 9.4 -2.9310985
94 9.5 -2.9352370
95 9.6 -2.9403694
96 9.7 -2.9395336
97 9.8 -2.9404153
98 9.9 -2.9437564
99 10.0 -2.9452175
High x series:
x y
1 2.000000e-01 -0.701301
2 2.517851e-01 -0.907446
3 3.169786e-01 -1.104863
4 3.990525e-01 -1.304556
5 5.023773e-01 -1.496033
6 6.324555e-01 -1.674629
7 7.962143e-01 -1.842118
8 1.002374e+00 -1.998864
9 1.261915e+00 -2.153993
10 1.588656e+00 -2.287607
11 2.000000e+00 -2.415137
12 2.517851e+00 -2.522978
13 3.169786e+00 -2.621386
14 3.990525e+00 -2.701105
15 5.023773e+00 -2.778751
16 6.324555e+00 -2.841699
17 7.962143e+00 -2.900664
18 1.002374e+01 -2.947035
19 1.261915e+01 -2.993301
20 1.588656e+01 -3.033517
21 2.000000e+01 -3.072003
22 2.517851e+01 -3.102536
23 3.169786e+01 -3.138539
24 3.990525e+01 -3.167577
25 5.023773e+01 -3.200739
26 6.324555e+01 -3.233111
27 7.962143e+01 -3.259738
28 1.002374e+02 -3.291657
29 1.261915e+02 -3.324449
30 1.588656e+02 -3.349988
31 2.000000e+02 -3.380031
32 2.517851e+02 -3.405850
33 3.169786e+02 -3.438225
34 3.990525e+02 -3.467420
35 5.023773e+02 -3.496026
36 6.324555e+02 -3.531125
37 7.962143e+02 -3.558215
38 1.002374e+03 -3.587526
39 1.261915e+03 -3.616800
40 1.588656e+03 -3.648891
41 2.000000e+03 -3.684342
42 2.517851e+03 -3.716174
43 3.169786e+03 -3.752631
44 3.990525e+03 -3.786956
45 5.023773e+03 -3.819529
46 6.324555e+03 -3.857214
47 7.962143e+03 -3.899199
48 1.002374e+04 -3.937206
49 1.261915e+04 -3.968795
50 1.588656e+04 -4.015991
51 2.000000e+04 -4.055811
52 2.517851e+04 -4.098894
53 3.169786e+04 -4.135608
54 3.990525e+04 -4.190248
55 5.023773e+04 -4.237104
56 6.324555e+04 -4.286103
57 7.962143e+04 -4.332090
58 1.002374e+05 -4.392748
59 1.261915e+05 -4.446233
60 1.588656e+05 -4.497845
61 2.000000e+05 -4.568541
62 2.517851e+05 -4.628460
63 3.169786e+05 -4.686546
64 3.990525e+05 -4.759202
65 5.023773e+05 -4.826938
66 6.324555e+05 -4.912130
67 7.962143e+05 -4.985855
68 1.002374e+06 -5.070668
69 1.261915e+06 -5.143341
70 1.588656e+06 -5.261585
71 2.000000e+06 -5.343636
72 2.517851e+06 -5.447189
73 3.169786e+06 -5.559962
74 3.990525e+06 -5.683828
75 5.023773e+06 -5.799319
76 6.324555e+06 -5.929599
77 7.962143e+06 -6.065907
78 1.002374e+07 -6.200967
79 1.261915e+07 -6.361633
80 1.588656e+07 -6.509538
81 2.000000e+07 -6.682960
82 2.517851e+07 -6.887793
83 3.169786e+07 -7.026138
84 3.990525e+07 -7.227990
85 5.023773e+07 -7.413960
86 6.324555e+07 -7.620247
87 7.962143e+07 -7.815754
88 1.002374e+08 -8.020447
89 1.261915e+08 -8.229911
90 1.588656e+08 -8.447927
91 2.000000e+08 -8.665613
Without an idea of the underlying process you may as well just fit a polynomial with as many components as you like. You don't seem to be testing a hypothesis (eg, gravitational strength is inverse-square related with distance) so you can fish all you like for functional forms, the data is unlikely to tell you which one is 'right'.
So if I read your data into a data frame with x and y components I can do:
data$lx=log(data$x)
plot(data$lx,data$y) # needs at least a cubic polynomial
m1 = lm(y~poly(lx,3),data=data) # fit a cubic
points(data$lx,fitted(m1),pch=19)
and the fitted points are pretty close. Change the polynomial degree from 3 to 7 and the points are identical. Does that mean that your Y values are really coming from a 7-degree polynomial of your X values? No. But you've got a curve that goes through the points.
At this scale, you may as well just join adjacent points up with a straight line, your plot is so smooth. But without underlying theory of why Y depends on X (like an inverse square law, or exponential growth, or something) all you are doing is joining the dots, and there are infinite ways of doing that.
Regressing x/y vs. x Plotting y vs. x for the low data and playing around a bit it seems that x/y is approximately linear in x so try regressing x/y against x which gives us a relationship based on only two parameters:
y = x / (a + b * x)
where a and b are the regression coefficients.
> lm(x / y ~ x, lo.data)
Call:
lm(formula = x/y ~ x, data = lo.data)
Coefficients:
(Intercept) x
-0.1877 -0.3216
MM.2 The above can be transformed into the MM.2 model in the drc R package. As seen below this model has a high R2. Also, we calculate the AIC which we can use to compare to other models (lower is better):
> library(drc)
> fm.mm2 <- drm(y ~ x, data = lo.data, fct = MM.2())
> cor(fitted(fm.mm2), lo.data$y)^2
[1] 0.9986303
> AIC(fm.mm2)
[1] -535.7969
CRS.6 This suggests we try a few other drc models and of the ones we tried CRS.6 has a particularly low AIC and seems to fit well visually:
> fm.crs6 <- drm(y ~ x, data = lo.data, fct = CRS.6())
> AIC(fm.crs6)
[1] -942.7866
> plot(fm.crs6) # see output below
This gives us a range of models we can use from the 2 parameter MM.2 model which is not as good as a fit (according to AIC) as the CRS.6 but still fits quite well and has the advantage of only two parameters or the 6 parameter CRS.6 model with its superior AIC. Note that AIC already penalizes models for having more parameters so having a better AIC is not a consequence of having more parameters.
Other If its believed that both low and high should have the same model form then finding a single model form fitting both low and high well might be used as another criterion for picking a model form. In addition to the drc models, there are also some yield-density models in (2.1), (2.2), (2.3) and (2.4) of Akbar et al, IRJFE, 2010 which look similar to the MM.2 model which could be tried.
UPDATED: reworked this around the drc package.

Resources