lapply error - PGLS (caper) with multiples comparative.data

lapply error - PGLS (caper) with multiples comparative.data - r

i need help in the following problem.
I generated a list containing 1000 comparative.dataand i want to run 1000 pgls using each of these comparative.data. I tried to use lapply function for this, using the following code:
pg <- lapply(obj, function(z){pgls(formula = y ~ x, cd[[z]], lambda = "ML")})
obj is a list of 1000 data.frames with my data. cd is my list of 1000 comparative.data.
When i tried to run this code the followin error returned:
Error in pgls(formula = y ~ x, cd[[z]], lambda = "ML") :
object 'z' not found
I can not see where is the error's source
Thanks in advance
More informations
obj is used to generate the comparative.data. To generate the 1000 comparative.data using the 1000 data frames in obj list, i used:
cd <- lapply(1:1000, function(x) comparative.data(phy = phylogeny,
data = as.data.frame(obj[[x]]),
names.col = species_name,
vcv=T, vcv.dim=3))
To run one pgls for the hundredth comparative.data the code is:
mod <- pgls(formula = y ~ x, cd[[100]], lambda = "ML")
Calling the hundredth obj and hundredth cd
obj[[100]]
# A tibble: 136 x 3
# Groups: Binomial, herbivores [136]
Binomial herbivores tm
* <chr> <dbl> <dbl>
1 Abies_alba 30. 0.896
2 Abies_balsamea 2. 0.990
3 Abies_borisii-regis 1. 0.940
4 Alcea_rosea 7. 0.972
5 Amaranthus_caudatus 1. 0.173
6 Amaranthus_hybridus_subsp._cruentus 1. 0.310
7 Aquilegia_vulgaris 9. 0.365
8 Arabidopsis_thaliana 8. 0.00280
9 Arabis_alpina 2. 0.978
10 Ariocarpus_fissuratus 1. 0.930
# ... with 126 more rows
cd[[100]]
Comparative dataset of 136 taxa:
Phylogeny: tree
136 tips, 134 internal nodes
chr [1:136] "Mercurialis_annua" "Manihot_esculenta"
"Malpighia_emarginata" "Comarum_palustre" ...
VCV matrix present:
VCV.array [1:136, 1:136, 1:16] 61.9 189.3 189.3 189.3 189.3 ...
Data: as.data.frame(obj[[x]])
$ herbivores: num [1:136] 4 1 1 5 19 21 7 4 4 2 ...
$ tm : num [1:136] 0.516 0.915 1.013 0.46 0.236 ...

Since cd was created from obj, there is no need to reference obj in lapply call but simply pass your list of comparative.data which you can do by object:
# BELOW d IS DATA FRAME OBJECT PASSED INTO LAPPLY LOOP
pg_list <- lapply(cd, function(d) pgls(formula = y ~ x, d, lambda = "ML"))
Or by index:
# BELOW i IS INTEGER VALUE PASSED INTO LAPPLY LOOP
pg_list <- lapply(seq_along(cd), function(i) pgls(formula = y ~ x, cd[[i]], lambda = "ML"))
Alternatively, you can combine both lapply calls, assuming you do not need the intermediate object, cd list, for other purposes:
# BELOW x IS OBJECT PASSED INTO LAPPLY LOOP
pg_list <- lapply(obj, function(x) {
cd <- comparative.data(phy = phylogeny,
data = as.data.frame(x),
names.col = species_name,
vcv=T, vcv.dim=3))
pgls(formula = y ~ x, cd, lambda = "ML")
})

Related

How to optimzie my function by dropping loops

I have the following function that uses nested loops and honestly I'm not sure how to proceed with making the code run more efficient. It runs fine for 100 sims in my opinion but when I ran for 2000 sims it took almost 12 seconds.
This code will generate any n Brownian Motion simulations and works well, the issue is once the simulation size is increased to say 500+ then it starts to bog down, and when it hits 2k then it's pretty slow ie 12.
Here is the function:
ts_brownian_motion <- function(.time = 100, .num_sims = 10, .delta_time = 1,
.initial_value = 0) {
# TidyEval ----
T <- as.numeric(.time)
N <- as.numeric(.num_sims)
delta_t <- as.numeric(.delta_time)
initial_value <- as.numeric(.initial_value)
# Checks ----
if (!is.numeric(T) | !is.numeric(N) | !is.numeric(delta_t) | !is.numeric(initial_value)){
rlang::abort(
message = "All parameters must be numeric values.",
use_cli_format = TRUE
)
}
# Initialize empty data.frame to store the simulations
sim_data <- data.frame()
# Generate N simulations
for (i in 1:N) {
# Initialize the current simulation with a starting value of 0
sim <- c(initial_value)
# Generate the brownian motion values for each time step
for (t in 1:(T / delta_t)) {
sim <- c(sim, sim[t] + rnorm(1, mean = 0, sd = sqrt(delta_t)))
}
# Bind the time steps, simulation values, and simulation number together in a data.frame and add it to the result
sim_data <- rbind(
sim_data,
data.frame(
t = seq(0, T, delta_t),
y = sim,
sim_number = i
)
)
}
# Clean up
sim_data <- sim_data %>%
dplyr::as_tibble() %>%
dplyr::mutate(sim_number = forcats::as_factor(sim_number)) %>%
dplyr::select(sim_number, t, y)
# Return ----
attr(sim_data, ".time") <- .time
attr(sim_data, ".num_sims") <- .num_sims
attr(sim_data, ".delta_time") <- .delta_time
attr(sim_data, ".initial_value") <- .initial_value
return(sim_data)
}
Here is some output of the function:
> ts_brownian_motion(.time = 10, .num_sims = 25)
# A tibble: 275 × 3
sim_number t y
<fct> <dbl> <dbl>
1 1 0 0
2 1 1 -2.13
3 1 2 -1.08
4 1 3 0.0728
5 1 4 0.562
6 1 5 0.255
7 1 6 -1.28
8 1 7 -1.76
9 1 8 -0.770
10 1 9 -0.536
# … with 265 more rows
# ℹ Use `print(n = ...)` to see more rows

As suggested in the comments, if you want speed, you should use cumsum. You need to be clear what type of Brownian Motion you want (arithmetic, geometric). For geometric Brownian motion, you'll need to correct the approximation error by adjusting the mean. As an example, the NMOF package (which I maintain), contains a function gbm that implements geometric Brownian Motion through cumsum. Here is an example call for 2000 paths with 100 timesteps each.
library("NMOF")
library("zoo") ## for plotting
timesteps <- 100
system.time(b <- NMOF::gbm(2000, tau = 1, timesteps = 100, r = 0, v = 1))
## user system elapsed
## 0.013 0.000 0.013
dim(b) ## each column is one path, starting at time zero
## [1] 101 2000
plot(zoo(b[, 1:5], 0:timesteps), plot.type = "single")

Problem passing extra arguments to lapply() with uniroot function

So I'm trying to find the roots for specific values of Y with uniroot(). I have them all in a column in a dataframe, and I want to create a new column with the root found for each one of the Ys of the original column via lapply().
The way I'm creating the function that uniroot takes as an argument to find its roots, is I am substracting the Y value to the last coefficient of this function, and that Y value is passed as an extra argument to uniroot (according to the uniroot help page).
After a couple hours trying to figure out what was happening I realized that the value that lapply() feeds to the function is the Y, but it's being read as the argument "interval" inside uniroot, thus giving me errors about this argument.
I think I could implement this another way, but it'd be much better and simpler if this way has a solution.
pol_mod <- lm(abs_p ~ poly(patron, 5, raw = TRUE), data = bradford)
a <- as.numeric(coefficients(pol_mod)[6])
b <- as.numeric(coefficients(pol_mod)[5])
c <- as.numeric(coefficients(pol_mod)[4])
d <- as.numeric(coefficients(pol_mod)[3])
e <- as.numeric(coefficients(pol_mod)[2])
f <- as.numeric(coefficients(pol_mod)[1])
fs <- function (x,y) {a*x^5 + b*x^4 + c*x^3 + d*x^2 + e*x + f - y}
interpol <- function (y, fs) {
return(uniroot(fs,y=y, interval=c(0,2000)))
}
bradford$concentracion <- lapply(bradford$abs_m, interpol, fs=fs)
The error I'm getting:
Error in uniroot(fs, y = y, interval = c(0, 2000)) :
f.lower = f(lower) is NA
Needless to say, everything works when applied outside of lapply()
I'd be really happy If someone could lend a hand! Thanks in advance!
EDIT: This is how the dataframe looks like.
bradford
# A tibble: 9 x 3
patron abs_p abs_m
<dbl> <dbl> <dbl>
1 0 0 1.57
2 25 0.041 1.27
3 125 0.215 1.59
4 250 0.405 1.61
5 500 0.675 0.447
6 750 0.97 0.441
7 1000 1.23 NA
8 1500 1.71 NA
9 2000 2.04 NA

I'm trying to tabulate the branches of a binary tree (party) into a dataframe in R

After fitting a Tree with party::ctree() I want to create a table to characterise the branches.
I have fitted these variables
> summary(juridicos_segmentar)
actividad_economica
Financieras : 89
Gubernamental : 48
Sector Primario : 34
Sector Secundario:596
Sector Terciario :669
ingresos_cut
(-Inf,1.03e+08] :931
(1.03e+08,4.19e+08]:252
(4.19e+08,1.61e+09]:144
(1.61e+09, Inf] :109
egresos_cut
(-Inf,6e+07] :922
(6e+07,2.67e+08] :256
(2.67e+08,1.03e+09]:132
(1.03e+09, Inf] :126
patrimonio_cut
(-Inf,2.72e+08] :718
(2.72e+08,1.46e+09]:359
(1.46e+09,5.83e+09]:191
(5.83e+09, Inf] :168
op_ingreso_cut
(-Inf,3] :1308
(3,7] : 53
(7,22] : 44
(22, Inf]: 31
The first one is categorical and the others are ordinal and I fitted them to
another factor variable
> summary(as.factor(segmento))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
27 66 30 39 36 33 39 15 84 70 271 247 101 34 100 74 47 25 48 50
I used the following code
library(party)
fit_jur <- ctree(cluster ~ .,
data=data.frame(juridicos_segmentar, cluster=as.factor(segmento)))
to get this tree
> fit_jur
Conditional inference tree with 31 terminal nodes
Response: cluster
Inputs: actividad_economica, ingresos_cut, egresos_cut, patrimonio_cut, op_ingreso_cut
Number of observations: 1436
1) actividad_economica == {Financieras}; criterion = 1, statistic = 4588.487
2) ingresos_cut <= (4.19e+08,1.61e+09]; criterion = 1, statistic = 62.896
3) egresos_cut <= (6e+07,2.67e+08]; criterion = 1, statistic = 22.314
4)* weights = 70
3) egresos_cut > (6e+07,2.67e+08]
5)* weights = 10
2) ingresos_cut > (4.19e+08,1.61e+09]
6)* weights = 9
plot of part of the tree
What I want is a table where every row is a path from the node to a leaf saying the prediction of the variable segmento and every column is the condition on the variable to split. Something alike this:
actividad economica ingresos (rango) egresos (rango) patrimonio (rango) operaciones de ingreso segmento
Sector Primario <=261.000.000 18
Sector Primario >261.000.000 20
The problem is there are several leaves to characterise and some time a variable appears several times in one path so I'd like to intersect the conditions, i.e. intersecting the ranges.
I've thought of data.tree::ToDataFrameTable but I've got no idea of how it works with party.
Thank you very much guys!
library(partykit)
fit_jur <- ctree(cluster ~ .,
data=data.frame(juridicos_segmentar, cluster=as.factor(segmento)))
pathpred <- function(object, ...)
{
## coerce to "party" object if necessary
if(!inherits(object, "party")) object <- as.party(object)
## get standard predictions (response/prob) and collect in data frame
rval <- data.frame(response = predict(object, type = "response", ...))
rval$prob <- predict(object, type = "prob", ...)
## get rules for each node
rls <- partykit:::.list.rules.party(object)
## get predicted node and select corresponding rule
rval$rule <- rls[as.character(predict(object, type = "node", ...))]
return(rval)
}
ct_pred_jur <- unique(pathpred(fit_jur)[c(1,3)])
write.csv2(ct_pred_jur,'parametrizacion_juridicos.csv')
thank you Achim Zeileis for pointing me in this direction, I couldn't intersect the rules in a same variable, i.e. evaluate the '&s'. That problem is still open.

You can convert both party class (from partykit) and BinaryTree (from party) to a data.tree, and use it for conversion to data frame and/or printing. For example like this:
library(party)
airq <- subset(airquality, !is.na(Ozone))
airct <- ctree(Ozone ~ ., data = airq,
controls = ctree_control(maxsurrogate = 3))
tree <- as.Node(airct)
df <- ToDataFrameTable(tree,
"pathString",
"label",
criterion = function(x) round(x$criterion$maxcriterion, 3),
statistic = function(x) round(max(x$criterion$statistic), 3)
)
df
This will print like so:
pathString label criterion statistic
1 1/2/3 weights = 10 0.000 0.000
2 1/2/4/5 weights = 48 0.936 6.141
3 1/2/4/6 weights = 21 0.891 5.182
4 1/7/8 weights = 30 0.675 3.159
5 1/7/9 weights = 7 0.000 0.000
Plotting:
#print subtree
subtree <- Clone(tree$`2`)
SetNodeStyle(subtree,
style = "filled,rounded",
shape = "box",
fillcolor = "GreenYellow",
fontname = "helvetica",
label = function(x) x$label,
tooltip = function(x) round(x$criterion$maxcriterion, 3))
plot(subtree)
And the result will look like this:

How to separate data frame in R into two separate data frames for Stepwise Regression in SuperLearner

I have a dataset 162 x 152. What I want to do is use stepwise regression, incorporating cross validation on the dataset to create a model and to test how accurate that model is.
ID RT (seconds) 76_TI2 114_DECC 120_Lop 212_PCD 236_X3Av
4281 38 4.086 1.2 2.322 0 0.195
4952 40 2.732 0.815 1.837 1.113 0.13
4823 41 4.049 1.153 2.117 2.354 0.094
3840 41 4.049 1.153 2.117 3.838 0.117
3665 42 4.56 1.224 2.128 2.38 0.246
3591 42 2.96 0.909 1.686 0.972 0.138
This is part of the dataset I have. I want to construct a model where my Y variable is RT(seconds) and all my variables (my predictors) are all the other 151 variables in my dataset. I was told to use the superleaner package, and algorithm for that is:-
test <- CV.SuperLearner(Y = Y, X = X, V = 10, SL.library = SL.library,
verbose = TRUE, method = "method.NNLS")
The problem is that I'm still rather new to R. The main way in which I've been reading my data in and performing other forms of machine learning algorithms onto my data is by doing the following:-
mydata <- read.csv("filepathway")
fit <- lm(RT..seconds~., data=mydata)
So how do I go about separating the RT seconds column from the input of my data so that I can input the things as an X and Y dataframe? i.e something along the lines of:-
mydata <- read.csv("filepathway")
mydata$RT..seconds. = Y #separating my Y response variable
Alltheother151variables = X #separating all of my X predictor variables (all 151 of them)
SL.library <- c("SL.step")
test <- CV.SuperLearner(Y (i.e RT seconds column), X (all the other 151 variables that corresponds to the RT values), V = 10, SL.library = SL.library,
verbose = TRUE, method = "method.NNLS")
I hope this all makes sense. Thanks!

If the response variable is in the first column, you can simply use:
Y <- mydata[ , 1 ]
X <- mydata[ , -1 ]
The first argument of [ (the row number) is empty, so we keep all the rows,
and the second is either 1 (the first column) or -1 (everything but the first column).
If your response variable is elsewhere, you can use the column names instead:
Y <- mydata[ , "RT..seconds." ]
X <- mydata[ , setdiff( colnames(mydata), "RT..seconds." ) ]

Lasso r code - what is wrong with it?

I am attempting to carry out lasso regression using the lars package but can not seem to get the lars bit to work. I have inputted code:
diabetes<-read.table("diabetes.txt", header=TRUE)
diabetes
library(lars)
diabetes.lasso = lars(diabetes$x, diabetes$y, type = "lasso")
However, I get an error message of :
Error in rep(1, n) : invalid 'times' argument.
I have tried entering it like this:
diabetes<-read.table("diabetes.txt", header=TRUE)
library(lars)
data(diabetes)
diabetes.lasso = lars(age+sex+bmi+map+td+ldl+hdl+tch+ltg+glu, y, type = "lasso")
But then I get the error message:
'Error in lars(age+sex + bmi + map + td + ldl + hdl + tch + ltg + glu, y, type = "lasso") :
object 'age' not found'
Where am I going wrong?
EDIT: Data - as below but with another 5 columns.
ldl hdl tch ltg glu
1 -0.034820763 -0.043400846 -0.002592262 0.019908421 -0.017646125
2 -0.019163340 0.074411564 -0.039493383 -0.068329744 -0.092204050
3 -0.034194466 -0.032355932 -0.002592262 0.002863771 -0.025930339
4 0.024990593 -0.036037570 0.034308859 0.022692023 -0.009361911
5 0.015596140 0.008142084 -0.002592262 -0.031991445 -0.046640874

I think some of the confusion may have to do with the fact that the diabetes data set that comes with the lars package has an unusual structure.
library(lars)
data(diabetes)
sapply(diabetes,class)
## x y x2
## "AsIs" "numeric" "AsIs"
sapply(diabetes,dim)
## $x
## [1] 442 10
##
## $y
## NULL
##
## $x2
## [1] 442 64
In other words, diabetes is a data frame containing "columns" which are themselves matrices. In this case, with(diabetes,lars(x,y,type="lasso")) or lars(diabetes$x,diabetes$y,type="lasso") work fine. (But just lars(x,y,type="lasso") won't, because R doesn't know to look for the x and y variables within the diabetes data frame.)
However, if you are reading in your own data, you'll have to separate the response variable and the predictor matrix yourself, something like
X <- as.matrix(mydiabetes[names(mydiabetes)!="y",])
mydiabetes.lasso = lars(X, mydiabetes$y, type = "lasso")
Or you might be able to use
X <- model.matrix(y~.,data=mydiabetes)

lars::lars does not appear to have a formula interface, which means you cannot use the formula specification for the column names (and furthermore it does not accept a "data=" argument). For more information on this and other "data mining" topics, you might want to get a copy of the classic text: "Elements of Statistical Learning". Try this:
# this obviously assumes require(lars) and data(diabetes) have been executed.
> diabetes.lasso = with( diabetes, lars(x, y, type = "lasso"))
> summary(diabetes.lasso)
LARS/LASSO
Call: lars(x = x, y = y, type = "lasso")
Df Rss Cp
0 1 2621009 453.7263
1 2 2510465 418.0322
2 3 1700369 143.8012
3 4 1527165 86.7411
4 5 1365734 33.6957
5 6 1324118 21.5052
6 7 1308932 18.3270
7 8 1275355 8.8775
8 9 1270233 9.1311
9 10 1269390 10.8435
10 11 1264977 11.3390
11 10 1264765 9.2668
12 11 1263983 11.0000

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

lapply error - PGLS (caper) with multiples comparative.data - r

Related

How to optimzie my function by dropping loops

Problem passing extra arguments to lapply() with uniroot function

I'm trying to tabulate the branches of a binary tree (party) into a dataframe in R

How to separate data frame in R into two separate data frames for Stepwise Regression in SuperLearner

Lasso r code - what is wrong with it?

Categories

Resources