R using rgp symbolicRegression for equation discovery - r

I am trying to use the package rgp for equations discovery
library(rgp)
x = c (1:100)
y = 5*x+3*sin(x)+4*x^2+75
data1 = data.frame(x,y)
newFuncSet <- functionSet("+","-","*")
result1 <- symbolicRegression(y ~ x, data = data1, functionSet = newFuncSet, stopCondition = makeStepsStopCondition(2000))
plot(data1$y, col=1, type="l"); points(predict(result1, newdata = data1), col=2, type="l")
model <- result1$population[[which.min(result1$fitnessValues)]]
However, I keep getting an error message.I would be grateful for your help in pointing out the errors I have made above.
Useful references (it would be great to have this in R):
https://www.researchgate.net/publication/237050734_Improving_Genetic_Programming_Based_Symbolic_Regression_Using_Deterministic_Machine_Learning

The problem is that R treats the x vector as integers and has some problems with types further. Try to use type x into numeric specifically:
x <- as.numeric(1:100)
It worked for me.

Related

An R function cannot work in local environment of other functions

I use Matchit package for propensity score matching. It can generate a matched data after matching using get_matches() function.
However, if I do not run the get_matches() function in the global environment but include it in any other function, the matched data cannot be found in the local environment. (These prove to be misleading information. There is nothing wrong with MatchIt's output. Answer by Noah explains my question better.)
For producing my data
dataGen <- function(b0,b1,n = 2000,cor = 0){
# covariate
sigma <- matrix(rep(cor,9),3,3)
diag(sigma) <- rep(1,3)
cov <- MASS::mvrnorm(n, rep(0,3), sigma)
# error
error <- rnorm(n,0,sqrt(18))
# treatment variable
logit <- b0+b1*cov[,1]+0.3*cov[,2]+cov[,3]
p <- 1/(1+exp(-logit))
treat <- rbinom(n,1,p)
# outcome variable
y <- error+treat+cov[,1]+cov[,2]
data <- as.data.frame(cbind(cov,treat,y))
return(data)
}
set.seed(1)
data <- dataGen(b0=-0.92, b1=0.8, 900)
It is like the following works. The est.m.WLS() can use the m.data.
fm1 <- treat ~ V1+V2+V3
m.out <- MatchIt::matchit(data = data, formula = fm1, link = "logit", m.order = "random", caliper = 0.2)
m.data <- MatchIt::get_matches(m.out,data=data)
est.m.WLS <- function(m.data, fm2){
model.1 <- lm(fm2, data = m.data, weights=(weights))
est <- model.1$coefficients["treat"]
## regular robust standard error ignoring pair membership
model.1.2 <- lmtest::coeftest(model.1,vcov. = sandwich::vcovHC)
CI.r <- confint(model.1.2,"treat",level=0.95)
## cluster robust standard error accounting for pair membership
model.2.2 <- lmtest::coeftest(model.1, vcov. = sandwich::vcovCL, cluster = ~subclass)
CI.cr <- confint(model.2.2,"treat",level=0.95)
return(c(est=est,CI.r,CI.cr))
}
fm2 <- y ~ treat+V1+V2+V3
est.m.WLS(m.data,fm2)
But the next syntax does not work. It will report
"object 'm.data' not found"
rm(m.data)
m.out <- MatchIt::matchit(data = data, formula = fm1, link = "logit", m.order = "random", caliper = 0.2)
est.m.WLS <- function(m.out, fm2){
m.data <- MatchIt::get_matches(m.out,data=data)
model.1 <- lm(fm2, data = m.data, weights=(weights))
est <- model.1$coefficients["treat"]
## regular robust standard error ignoring pair membership
model.1.2 <- lmtest::coeftest(model.1,vcov. = sandwich::vcovHC)
CI.r <- confint(model.1.2,"treat",level=0.95)
## cluster robust standard error accounting for pair membership
model.2.2 <- lmtest::coeftest(model.1, vcov. = sandwich::vcovCL, cluster = ~subclass)
CI.cr <- confint(model.2.2,"treat",level=0.95)
return(c(est=est,CI.r,CI.cr))
}
est.m.WLS(m.out,fm2)
Since I want to run parallel loops using the groundhog library for simulation purpose, the get_matches function also cannot work in foreach()%dopar%{...} environment.
res=foreach(s = 1:7,.combine="rbind")%dopar%{
m.out <- MatchIt::matchit(data = data, formula = fm.p, distance = data$logit, m.order = "random", caliper = 0.2)
m.data <- MatchIt::get_matches(m.out,data=data)
...
}
How should I fix the problem?
Any help would be appreciated. Thank you!
Using for() loop directly will not run into any problem since it just works in the global environment, but it is too slow... I really hope to do the thousand time simulations at once. Help!
This has nothing to do with MatchIt or get_matches(). Run debugonce(est.m.WLS) with your second implementation of est.m.WLS(). You will see that get_matches() works perfectly fine and returns m.data. The problem is when lmtest() runs with a formula argument for cluster.
This is due to a bug in R, outside any package, that I have already requested to be fixed. The problem is that expand.model.matrix(), a function that searches for the dataset that the variables supplied to cluster could be in, only searches the global environment for data, but m.data does not exist in the global environment. To get around this issue, don't supply a formula to cluster; use cluster = m.data["subclass"]. This should hopefully be resolved in an upcoming R release.

R - object formula not found inside a function

I created a function to roll apply an exponentially weighted least-squares using the dynlm package. Here is the code:
residualization<-function(df,formula_ref, size){
rollapply(df,
width=size,
FUN = ewma_regression,
formula_ref = formula_ref,
by.column=FALSE, align="right")
}
ewma_regression<-function(x,formula_ref) {
n<-nrow(x)
weights <- 0.06*0.94^(seq(n-1,0,by=-1))
t <- dynlm(formula=as.formula(formula_ref), data = as.zoo(x),weights = weights)
return(t$residuals)
}
However when I run this code on my dataset, it shows the problem:
Error in as.formula(formula_ref) : object 'formula_ref' not found
When I try to debug it, inside the environment of the function, the variable formula_ref does exist! However even inside the debug mode, I cannot run the dynlm regression even if I try to set formula_ref to a temporary formula object.
Can anyone help me out? I know it might be a silly mistake but I can't find out!
A reproducible example would be:
dates<-seq.Date(from=as.Date("2010-01-01"), length.out = 1000, by="day")
teste1<-data.frame(x=rnorm(1000),y=rnorm(1000)*5)
teste2<-xts(teste1,order.by = dates)
formula.test<- y ~ x + I(x^2)
teste3<-residualization(df=teste2,formula_ref = formula.test, size=100)
You can just wrap y ~ x + I(x^2) in quotation marks ("y ~ x + I(x^2)").

R Harmonic Prediction Failing - newdata structure

I am forecasting a time series using harmonic regression created as such:
(Packages used: tseries, forecast, TSA, plyr)
airp <- AirPassengers
TIME <- 1:length(airp)
SIN <- COS <- matrix(nrow = length(TIME), ncol = 6,0)
for (i in 1:6){
SIN[,i] <- sin(2*pi*i*TIME/12)
COS[,i] <- cos(2*pi*i*TIME/12)
}
SIN <- SIN[,-6]
decomp.seasonal <- decompose(airp)$seasonal
seasonalfit <- lm(airp ~ SIN + COS)
The fitting works just fine. The problem occurs when forecasting.
TIME.NEW <- seq(length(TIME)+1, length(TIME)+12, by=1)
SINNEW <- COSNEW <- matrix(nrow=length(TIME.NEW), ncol = 6, 0)
for (i in 1:6) {
SINNEW[,i] <- sin(2*pi*i*TIME.NEW/12)
COSNEW[,i] <- cos(2*pi*i*TIME.NEW/12)
}
SINNEW <- SINNEW[,-6]
prediction.harmonic.dataframe <- data.frame(TIME = TIME.NEW, SIN = SINNEW, COS = COSNEW)
seasonal.predictions <- predict(seasonalfit, newdata = prediction.harmonic.dataframe)
This causes the warning:
Warning message:
'newdata' had 12 rows but variables found have 144 rows
I went through and found that the names were SIN.1, SIN.2, et cetera, instead of SIN1 and SIN2... So I manually changed those and it still didn't work. I also manually removed the SIN.6 because it, for some reason, was still there.
Help?
Edit: I have gone through the similar posts as well, and the answers in those questions did not fix my problem.
Trying to predict with a data.frame after fitting an lm model with variables not inside a data.frame (especially matrices) is not fun. It's better if you always fit your model from data in a data.frame.
For example if you did
seasonalfit <- lm(airp ~ ., data.frame(airp=airp,SIN=SIN,COS=COS))
Then your predict would work.
Alternatively you can try to cram matrices into data.frames but this is generally a bad idea. You would do
prediction.harmonic.dataframe <- data.frame(TIME = TIME.NEW,
SIN = I(SINNEW), COS = I(COSNEW))
The I() (or AsIs function) will keep them as matrices.

Selecting multiple variables into model starting with common name in R

As in SAS we can start multiple varibles using colon(:) option with start name. I wanted to do the same in R for modeling purpose.
Any suggestions?
There are probably many ways to do this. Here is one with a regular expression that doesn't do exactly what you want, but might do the trick:
x1 = rnorm(100)
x2 = rnorm(100)
z = rnorm(100)
a = rnorm(100)
y = x1+x2+z
d = data.frame(x1,x2,z,y)
X = as.matrix(d[,grepl("x",colnames(d))])
head(X)
m = lm(y~X+a)
summary(m)
as.formula(paste("y~", paste(names(mydata)[substr(names(mydata), 1, 1)=="x"], collapse="+"))) -> myformula
gives a formula object myformula for a regression of y on all variables with names beginning with x in the data frame mydata that you can use in models, e.g. lm(myformula, data=mydata). So you're not sub-setting the data frame, which can be a nuisance when it's big.

object 'panel.bpplot' not found error in R

I'm fairly new to R and I'm trying to create a lattice bwplot, however I'm getting an error saying object 'panel.bpplot' not found. I have tried using the following example from R documentation:
set.seed(13)
x <- rnorm(1000)
g <- sample(1:6, 1000, replace=TRUE)
x[g==1][1:20] <- rnorm(20)+3 # contaminate 20 x's for group 1
# default trellis box plot
require(lattice)
bwplot(g ~ x)
bwplot(g ~ x, panel=panel.bpplot, probs=seq(.01,.49,by=.01), datadensity=TRUE)
The result is that the first plot is created (bwplot(g ~ x)), but when second one tries to run I get:
Error in bwplot.formula(g ~ x, panel = panel.bpplot, probs = seq(0.01, :
object 'panel.bpplot' not found
Any help will be very much appreciated!
panel.bpplot is function of package Hmisc, so you need to attach this package before plotting.
library(Hmisc)
bwplot(g ~ x, panel=panel.bpplot, probs=seq(.01,.49,by=.01), datadensity=TRUE)

Resources