I would like to create a wrapper for the Levenberg-Marquardt Nonlinear Least-Squares function nls.lm (minpack.lm library) similar to nls2 (nls2 library) to give a brute force method for evaluating the fit of a model to observed data.
The idea is to create a range of starting value combinations and either:
pass these to a function, then compare the function output to the observed data to create an R^2 value for each of the starting value combinations and run the nls.lm fitting with the best one of them.
or
run nls.lm on all combinations and select the best returned fit.
I wanted to do this without looping and after inspiration from here am trying to use nested dataframes, with one column for the parameter input list, one for the values returned by my function, one for the R^2 values, and one for the best fit models,something like:
df
# start_val fun_out R^2
# 1 {a=2,b=2} {22,24,26...} 0.8
# 2 {a=3,b=5} {35,38,41...} 0.6
This is the code I have so far:
require(dplyr);require(tidyr)
foo <- function(x,a,b) a*x^2+b # function I am fitting
x <- 1:10 # independent variable
y_obs <- foo(x,1.5,2.5) + rnorm(length(x),0,10) # observed data (dependent variable)
start_range <- data.frame(a=c(1,2),b=c(2,3)) # range of allowed starting points for fitting
reps <- 2 # number of starting points to generate
# Create a data frame of starting points
df<-as.data.frame(sapply(start_range, function(x) runif(reps,min=x[[1]],max=x[[2]]))) %>%
mutate(id=seq_len(reps)) %>% # fudge to make nest behave as I want
nest(1:ncol(start_range)) %>%
mutate(data=as.list(data)) %>%
as.data.frame()
df
# id data
# 1 1 1.316356, 2.662923
# 2 2 1.059356, 2.723081
I get stuck now trying to pass the parameters in data into the function foo(). I've tried using do.call(), and even with using constant parameters the following error appears:
mutate(df,y=do.call(foo,list(x,1,2)))
# Error: wrong result size (5), expected 2 or 1
Is there a way to create columns of a dataframe which contain lists directly without using nest()?
Also when trying to create the list to pass to do.call() using the dataframe columns, how do you create a list where the first element is the vector x, the second is the parameter a and the third is the parameter b? The follwing splits the list down the column:
mutate(df,my_list=list(x,data))
# id data my_list
# 1 1 1.316356, 2.662923 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
# 2 2 1.059356, 2.723081 1.316356, 2.662923, 1.059356, 2.723081
Running nls2 using algorithm = "random-search" and all = TRUE and the specified maxiter will evaluate foo at maxiter random points and return starting_fits which are the fits at those points. It consists of a set of "nls" class objects evaluated at each of the randomly chosen starting values. It does not do an optimization from each of these starting values but just returns the "nls" object at each. That is, nls is not run. Now for each starting fit run nlsLM giving fits, a list of nlsLM fits and from that summarize them in data (a data frame with one row per run) and show the least.
If we only want to pick the best starting value and just run nlsLM once from that then use the alternate code near the end.
library(nls2)
fo <- y_obs ~ foo(x, a, b)
starting_fits <- nls2(fo, algorithm = "random-search",
start = start_range, control = nls.control(maxiter = reps), all = TRUE)
fits <- lapply(starting_fits, function(fit) nlsLM(fo, start = coef(fit)))
data <- data.frame(RSS = sapply(fits, deviance), t(sapply(fits, coef)),
start = t(sapply(starting_fits, coef)))
# data$fits <- fits # optional to store each row's fitted object in that row
subset(data, RSS == min(RSS)) # minimum(s)
giving:
RSS a b start.a start.b
2 706.3956 1.396616 7.226525 1.681819 2.768374
R squared is used for linear regression. It is not valid for nonlinear regression. Residual sum of squares (RSS) is shown above instead.
Alternately if you just want to pick out the best starting value and run nlsLM on that then just omit the all=TRUE argument from the nls2 call and do this. If you need the coefficients and RSS for later code then try coef(fit) and deviance(fit) .
starting_fit <- nls2(fo, algorithm = "random-search",
start = start_range, control = nls.control(maxiter = reps))
fit <- nlsLM(fo, start = coef(starting_fit))
Note 1: If you are getting errors from nlsLM try replacing nlsLM(...) with try(nlsLM(...)). This will issue error messages (use try(..., silent = TRUE) if you don't want them) but will not stop processing.
Note 2: I assume that the foo shown in the question is just an example and the real function is more complex. The foo shown is linear in the coefficients so one could use lm for it. No nonlinear optimization is needed.
An approach like this perhaps?
library(dplyr)
library(purrr)
foo2 <- function(x,data) data$a*x^2+data$b
r2 <- function(e, o) 1 - sum((e - 0)^2) / sum((e - mean(e)^2))
df <- as.data.frame(sapply(start_range, function(x) runif(reps,min=x[[1]],max=x[[2]]))) %>%
mutate(id=seq_len(reps)) %>% # fudge to make nest behave as I want
nest(1:ncol(start_range))
df %>%
mutate(fun_out = map(data, foo2, x = x),
R2 = map(fun_out, o = y_obs, r2))
Result:
# A tibble: 3 x 4
id data fun_out R2
<int> <list> <list> <list>
1 1 <tibble [1 x 2]> <dbl [10]> <dbl [1]>
2 2 <tibble [1 x 2]> <dbl [10]> <dbl [1]>
3 3 <tibble [1 x 2]> <dbl [10]> <dbl [1]>
Related
I have a dataset of 104 samples (2 classes) and 182 variables. I am to carry out LDA on the dataset. My strategy involves first carrying out PCA in order to reduce dimensionality; this leaves me with 104 PCs. Now, what I want to do is carry out LDA on the PCs. I want to carry it out first where the number of PCs equal to 1, and store the misclassification rates into a data frame object. I will then do the same for 2, 3 and so on until ~50 PCs; the number is not important. I have created a for-loop to try solve this but I end up with a data frame where the only row is the final value I choose for my PCs. Here is the code I have so far:
# required packages
library(MASS)
library(class)
library(tidyverse)
# reading in and cleaning data
og_data <- read.csv("data.csv")
og_data <- og_data[, -1]
og_data$tumour <- unclass(as.factor(og_data$tumour))
# standardizing
st_data <- as.data.frame(cbind(og_data[, 1], scale(og_data[, -1])))
colnames(st_data)[1] <- "tumour"
# PCA for dimension reduction
k=10 # this is for the for-loop
grouping <- c(rep(1, 62), rep(2, 42)) # a vector denoting the true class of the samples
pca <- prcomp(st_data[, -1])
df_misclassification <- tibble(i=as.numeric(),
misclassification_rate_1=as.numeric(),
misclassification_rate_2=as.numeric())
for (i in k){
a <- as.data.frame(pca$x[, 1:i])
b <- lda(a, grouping=grouping, CV=TRUE)
c <- table(list(predicted=b$class, observed=grouping)) # confusion matrix
d <- t(as.data.frame(diag(c) / rowSums(c))) # misclassification rate for each class
df_misclassification <- df_misclassification %>%
add_row(i=i,
misclassification_rate_1=d[, 1],
misclassification_rate_2=d[, 2])
}
Running the above for k=10 leaves me with the following data frame:
# A tibble: 1 x 3
i misclassification_rate_1 misclassification_rate_2
<dbl> <dbl> <dbl>
1 10 0.952 0.951
I would like the table to have 10 rows, one for each number of PCs used. There is some overwriting in the for-loop but I have no idea how to fix this. Any help would be much appreciated. Thank you.
My for-loop was wrong. It should have been for (i in 1:k).
I wish to collapse the output of spdep::moran.test into a single string that is regularly structured with variable names and values and that can both be saved as a text value into a dataframe, and be human readable in the RStudio console and copy-pastable into MS Word to form a table without too many additional manual adjustments. (I have multiple tests to run and wish to copy-paste their output in one go.)
In the course of looking for a solution, I stumbled upon the report package which claims to turn an htest class object into a "report" (I don't know what this looks like in R) and thus may address my goal to some extent. However, the report function doesn't work on moran.test, as presented in the code below.
I am exploring and there are probably alternative and more straightforward approaches which I haven't considered. Thus my question is twofold: 1. Solve the immediate issue with report and/or 2. Provide an alternative and more efficient solution to my goal.
The data preparation below is drawn from https://mgimond.github.io/simple_moransI_example.
library(sf)
library(spdep)
library(report)
# Load shapefile
s <- readRDS(url("https://github.com/mgimond/Data/raw/gh-pages/Exercises/nhme.rds"))
# Prevent error "old-style crs object detected; please recreate object with a recent sf::st_crs()"
st_crs(s) <- st_crs(s)
# Define neighboring polygons
nb <- poly2nb(s, queen=TRUE)
# Assign weights to the neighbors
lw <- nb2listw(nb, style="W", zero.policy=TRUE)
# Run Moran’s I test
(mt <- moran.test(s$Income,lw, alternative="greater"))
#Moran I test under randomisation
#data: s$Income
#weights: lw
#Moran I statistic standard deviate = 5.8525, p-value = 2.421e-09
#alternative hypothesis: greater
#sample estimates:
# Moran I statistic Expectation Variance
#0.68279551 -0.04000000 0.01525284
# Moran’s I test is of class htest required by function report::report
class(mt)
#[1] "htest"
# Function report::report returns an error
report(mt)
#Error in `$<-.data.frame`(`*tmp*`, "tau", value = c(`Moran I statistic` = 0.68279551202875, :
# replacement has 3 rows, data has 1
The desired output could look something like:
"P-value 2.421e-09 | Statistic 0.68279551 | Expectation -0.04000000 | Variance 0.01525284"
The point is the names and values, not the separators. This is based on my current assumptions of how to approach this task, which are probably imperfect.
You might want to take a look at the broom package:
broom::tidy(mt)
#> # A tibble: 1 x 7
#> estimate1 estimate2 estimate3 statistic p.value method alternative
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 0.683 -0.04 0.0153 5.85 2.42e-9 Moran I test u… greater
library(tidyverse)
mt %>%
broom::tidy() %>%
as.list() %>%
enframe() %>%
mutate(value = value %>% as.character()) %>% unite(data, sep = "=") %>%
pull(data) %>%
paste0(collapse = ", ")
#> [1] "estimate1=0.68279551202875, estimate2=-0.04, estimate3=0.0152528397222445, statistic=c(`Moran I statistic standard deviate` = 5.85248209823413), p.value=2.42145194022024e-09, method=Moran I test under randomisation, alternative=greater"
You can make a table and create a csv file from multiple tests (e.g. having multiple objects of class htest like mt, mt1 and mt2):
list(mt, mt2, mt3) %>% map(broom::tidy) %>% bind_rows() %>% write_csv("tests.csv")
I have obtained an mppm object by fitting a model on several independent datasets using the mppm function from the R package spatstat. How can I generate simulated realisations of this model and obtain the x, y, and marks attributes of the simulations ?
I fitted my model as such:
data <- listof(NMJ1,NMJ2,NMJ3)
data <- hyperframe(X=1:3, Points=data)
model <- mppm(Points ~marks*sqrt(x^2+y^2), data)
where NMJ1, NMJ2, and NMJ3 are marked ppp and are independent realisations of the same experiment.
sim <- simulate(model) allows me to generate simulated realisations of this model, and plot(sim,axes = TRUE) to plot them. sim itself is an hyperframe object:
> sim
Hyperframe:
Sim1
1 (ppp)
2 (ppp)
3 (ppp)
How can I access the values (x, y, and marks) in this hyperframe ? My goal is to generate a large number of independent realisations from my model, and to use the simulated values for another task. Is there a practical way to obtain, retrieve and save these values ?
Since you say you want to simulate this many times I have here shown the code
with two simulations (rather than one as you have in the question):
library(spatstat)
data <- list(amacrine, amacrine, amacrine)
data <- hyperframe(X=1:3, Points=data)
model <- mppm(Points ~marks*sqrt(x^2+y^2), data)
sim <- simulate(model, nsim = 2)
#> Generating simulated realisations of 3 models..
#> 1, 2, 3.
Now sim is a hyperframe with 2 columns (one for each simulation). Each
column is a list of 3 point patterns. To get the three sets of coordinates
and marks for the first simulation use as.data.frame on each point pattern:
co1 <- lapply(sim$Sim1, as.data.frame)
Then co1 is a list of length 3, and we can print out the first few
coordinates with the head() command, e.g. the coordinates of the third
point pattern:
head(co1[[3]])
#> x y marks
#> 1 0.4942587 0.7889985 off
#> 2 0.6987270 0.7637359 on
#> 3 0.3926415 0.6819965 on
#> 4 0.7982686 0.9060733 off
#> 5 1.3507722 0.9731363 on
#> 6 0.6450985 0.6924126 on
We can extract the coordinates and marks for each simulation by another lapply that
runs over every simulation (in this case 2):
co <- lapply(sim, function(x) lapply(x, as.data.frame))
Now co is a list with 2 elements, and each element is a list of 3 sets of
coordinates:
length(co)
#> [1] 2
length(co[[2]])
#> [1] 3
head(co[[2]][[3]])
#> x y marks
#> 1 0.1660580 0.04180501 on
#> 2 0.7840025 0.71727782 on
#> 3 1.2011733 0.17109112 on
#> 4 1.0429867 0.49284639 on
#> 5 1.1411869 0.86711072 off
#> 6 1.0375942 0.06427601 on
I want create predict function which predicts for which cluster, observation belong
data(iris)
mydata=iris
m=mydata[1:4]
train=head(m,100)
xNew=head(m,10)
rownames(train)<-1:nrow(train)
norm_eucl=function(train)
train/apply(train,1,function(x)sum(x^2)^.5)
m_norm=norm_eucl(train)
result=kmeans(m_norm,3,30)
predict.kmean <- function(cluster, newdata)
{
simMat <- m_norm(rbind(cluster, newdata),
sel=(1:nrow(newdata)) + nrow(cluster))[1:nrow(cluster), ]
unname(apply(simMat, 2, which.max))
}
## assign new data samples to exemplars
predict.kmean(m_norm, x[result$cluster, ], xNew)
After i get the error
Error in predict.kmean(m_norm, x[result$cluster, ], xNew) :
unused argument (xNew)
i understand that i am making something wrong function, cause I'm just learning to do it, but I can't understand where exactly.
indeed i want adopt similar function of apcluster ( i had seen similar topic, but for apcluster)
predict.apcluster <- function(s, exemplars, newdata)
{
simMat <- s(rbind(exemplars, newdata),
sel=(1:nrow(newdata)) + nrow(exemplars))[1:nrow(exemplars), ]
unname(apply(simMat, 2, which.max))
}
## assign new data samples to exemplars
predict.apcluster(negDistMat(r=2), x[apres#exemplars, ], xNew)
how to do it?
Rather than trying to replicate something, let's come up with our own function. For a given vector x, we want to assign a cluster using some prior k-means output. Given how k-means algorithm works, what we want is to find which cluster's center is closest to x. That can be done as
predict.kmeans <- function(x, newdata)
apply(newdata, 1, function(r) which.min(colSums((t(x$centers) - r)^2)))
That is, we go over newdata row by row and compute the corresponding row's distance to each of the centers and find the minimal one. Then, e.g.,
head(predict(result, train / sqrt(rowSums(train^2))), 3)
# 1 2 3
# 2 2 2
all.equal(predict(result, train / sqrt(rowSums(train^2))), result$cluster)
# [1] TRUE
which confirms that our predicting function assigned all the same clusters to the training observations. Then also
predict(result, xNew / sqrt(rowSums(xNew^2)))
# 1 2 3 4 5 6 7 8 9 10
# 2 2 2 2 2 2 2 2 2 2
Notice also that I'm calling simply predict rather than predict.kmeans. That is because result is of class kmeans and a right method is automatically chosen. Also notice how I normalize the data in a vectorized manner, without using apply.
I am trying to learn how to make my own functions with formula objects. I am mostly confused with how to parse them.
Lets say I have the following
gigl <- function(formula, data, family = gaussian())
Using the R dataset BOD
> BOD
Time demand
1 1 8.3
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8
It is easy to fit a linear model with lm
>lm(Time~demand, data=BOD)
Call:
lm(formula = Time ~ demand)
Coefficients:
(Intercept) demand
-1.8905 0.3746
How can I make my own function by parsing a formula?
For example if I had
>gigl(Time~demand, data=BOD)
How can I parse the components? I don't really care what the function gigl does. I just want to know how to work with the formula.
Edit
Due to questions about a concrete example lets try the following:
Say that I want to use the inputs from a formula to build a cor() matrix. So from the above I would see the result of cor(Time,demand) and if more variables were added I would see the complete cor() of all inputs.
Here's a function that takes a formula and transforms it into a call to the cor() function, then evaluates that call in an environment consisting of the data ...
f <- function(form,data) {
form[[1]] <- quote(cor)
eval(form,data)
}
f(demand~Time,BOD)
## [1] 0.8030693
The rlang package can make it easier to work with formulas in the tidyeval paradigm. For example you can do
library(rlang)
mycor <- function(form, data) {
v1 <- f_lhs(form)
v2 <- f_rhs(form)
d <- enquo(data)
qq <- expr(with(!!d, cor(!!v1, !!v2)))
eval_tidy(qq)
}
mycor(disp~drat, mtcars)
# [1] -0.7102139
with(mtcars, cor(disp, drat))
# [1] -0.7102139
The f_lhs/f_rhs functions help to extract the left-hand side and right-hand side respectively. Then we can use quo() and the !! operator to re-assemble those piece into a new function call. Then we evaluate that new function call with eval_tidy.
Not sure what you're trying to do, but you could take a look at the terms of a formula:
fm <- formula(Time ~ demand);
tms <- terms(fm);
tms;
#Time ~ demand
#attr(,"variables")
#list(Time, demand)
#attr(,"factors")
# demand
#Time 0
#demand 1
#attr(,"term.labels")
#[1] "demand"
#attr(,"order")
#[1] 1
#attr(,"intercept")
#[1] 1
#attr(,"response")
#[1] 1
#attr(,".Environment")
#<environment: R_GlobalEnv>
From tms you could extract relevant entries and attributes. For example,
attr(tms, "variables");
#list(Time, demand)
This assumes that two variables are used (expressions are not allowed). Assuming that the two variables are in the formula and that they can appear on the right or left or both, all.vars which gets the variable names and get_all_vars which gets the content can be useful:
gig1 <- function(formula, data) cor(data[all.vars(formula)])
gig1(demand ~ Time, BOD)
giving:
demand Time
demand 1.0000000 0.8030693
Time 0.8030693 1.0000000
or
gig2 <- function(formula, data) cor(get_all_vars(formula, data))
gig2(demand ~ Time, BOD)
giving:
demand Time
demand 1.0000000 0.8030693
Time 0.8030693 1.0000000
You might want to look at the source of lm and the Formula package for more ideas.