R: How to fit gamlss in a foor loop with a variable (character) - r

I have a tricky problem. I have a dataframe with more than 1000 variables and want to fit each variable to age using fp smoothing function. I know how to use gamlss() for a specific variable (vari), but that's not practical to repeat this explicitly for more than 1000 times. Moreover, I want to plot the fitting for all 1000 variable in a single figure. What I did is:
variables <- colnames(data)[7:dim(data)[2]]
for(vari in variables) {
print("ROI is:")
print(vari)
model_fem <- gamlss(vari ~ fp(age), family=GG, data=females)
But I got errors:
Error in model.frame.default(formula = vari ~ fp(age), data = females) :
variable lengths differ (found for 'fp(age)')
I think the tricky part is from fp(). I have tried to use as.formula, it didn't work. Also because females$vari return NULL, that's why we got this error.
Do you have any solution for this?
Thank you

Character values are very different from formuals. Formulas contain symbols and you need to properly rebuild them to make them dynamic. There are lots of different ways to do that, but here's one that uses reformulate to turn characters into formulas and update() to modify a base formula.
variables <- colnames(data)[7:dim(data)[2]]
form_resp <- ~ fp(age)
for(vari in variables) {
print("ROI is:")
form_model <- update(form, reformulate(".", response=vari))
print(form_model)
model_fem <- gamlss(form_model, family=GG, data=females)
}

Related

Hmisc: How do I include the results of reformM as an argument in aregImpute?

I am using aregImpute from the Hmisc package to impute missing values in a dataset. Since the results are not invariant to the order of variables in the formula, I have used the reformM function to generate a number of random formula permutations:
reformM(~ x1 + x2 + x3..., data = d, nperm = Z)
The function returns a list.
I have reviewed the help file, but am unclear on how the results can be passed to aregImpute.
Can the object be passed such that aregImpute will iterate over the different combinations to assess any variability? That is, can the more than one permutation be passed to aregImpute? If not, do I simply create a new formula using the new order of variables?
You can certainly use map to pass reformM to aregImpute. Below, I provide a minimal example. That said, I do think it is a better idea to reorder variables such that variables with higher missing data rates are first imputed.
library(missForest)
library(tidyverse)
iris_miss <- prodNA(iris,0.10) ## make every variable missing at 10%
## factorial(ncol(iris_miss)) ## number of possible permutations
permlist <- reformM( as.formula(paste(" ~ ", names(iris_miss) %>% str_flatten( " + "))), data=iris_miss, nperm=10 ) ## get 10 permutations
map(permlist, aregImpute, nk=3, n.impute=5, data= iris_miss)

Writing a function in R to plot ROC curve using pROC

I'm trying to write a function to plot ROC curves based on different scoring systems I have to predict an outcome.
I have a dataframe data_all, with columns "score_1" and "Threshold.2000". I generate a ROC curve as desired with the following:
plot.roc(data_all$Threshold.2000, data_all$score_1)
My goal is to generate a ROC curve for a number of different outcomes (e.g. Threshold.1000) and scores (score_1, score_2 etc), but am initially trying to set it up just for different scores. My function is as follows:
roc_plot <- function(dataframe_of_interest, score_of_interest) {
plot.roc(dataframe_of_interest$Threshold.2000, dataframe_of_interest$score_of_interest)}
I get the following error: Error in roc.default(x, predictor, plot =
TRUE, ...) : No valid data provided.
I'd be very grateful if someone can spot why my function doesn't work! I'm a python coder and new-ish to R, and haven't had much luck trying a number of different things. Thanks very much.
EDIT:
Here is the same example with mtcars so it's reproducible:
data(mtcars)
plot.roc(mtcars$vs, mtcars$mpg) # --> makes correct graph
roc_plot <- function(dataframe_of_interest, score_of_interest) {
plot.roc(dataframe_of_interest$mpg, dataframe_of_interest$score_of_interest)}
Outcome:
Error in roc.default(x, predictor, plot = TRUE, ...) : No valid data provided.
roc_plot(mtcars, vs)
Here's one solution that works as desired (i.e. lets the user specify different values for score_of_interest):
library(pROC)
data(mtcars)
plot.roc(mtcars$vs, mtcars$mpg) # --> makes correct graph
# expects `score_of_interest` to be a string!!!
roc_plot <- function(dataframe_of_interest, score_of_interest) {
plot.roc(dataframe_of_interest$vs, dataframe_of_interest[, score_of_interest])
}
roc_plot(mtcars, 'mpg')
roc_plot(mtcars, 'cyl')
Note that your error was not resulting from an incorrect column name, it was resulting from an incorrect use of the data.frame class. Notice what happens with a simpler function:
foo <- function(x, col_name) {
head(x$col_name)
}
foo(mtcars, mpg)
## NULL
This returns NULL. So in your original function when you tried to supply plot.roc with dataframe_of_interest$score_of_interest you were actually feeding plot.roc a NULL.
There are several ways to extract a column from a data.frame by the column name when that name is stored in an object (which is what you're doing when you pass it as an argument in a function). Perhaps the easiest way is to remember that a data.frame is like a 2D array-type object and so we can use familiar object[i, j] syntax, but we ask for all rows and we specify the column by name, e.g., mtcars[, 'mpg']. This still works if we assign the string 'mpg' to an object:
x <- 'mpg'
mtcars[, x]
So that's how I produced my solution. Going a step further, it's not hard to imagine how you would be able to supply both a score_of_interest and a threshold_of_interest:
roc_plot2 <- function(dataframe_of_interest, threshold_of_interest, score_of_interest) {
plot.roc(dataframe_of_interest[, threshold_of_interest],
dataframe_of_interest[, score_of_interest])
}
roc_plot2(mtcars, 'vs', 'mpg')

error with rda test in vegan r package. Variable not being read correctly

I am trying to perform a simple RDA using the vegan package to test the effects of depth, basin and sector on genetic population structure using the following data frame.
datafile.
The "ALL" variable is the genetic population assignment (structure).
In case the link to my data doesn't work well, I'll paste a snippet of my data frame here.
I read in the data this way:
RDAmorph_Oct6 <- read.csv("RDAmorph_Oct6.csv")
My problems are two-fold:
1) I can't seem to get my genetic variable to read correctly. I have tried three things to fix this.
gen=rda(ALL ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Error in eval(specdata, environment(formula), enclos = globalenv()) :
object 'ALL' not found
In addition: There were 12 warnings (use warnings() to see them)
so, I tried things like:
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
so I specified numeric
> RDAmorph_Oct6$ALL = as.numeric(RDAmorph_Oct6$ALL)
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
I am really baffled. I've also tried specifying each variable with dataset$variable, but this doesn't work either.
The strange thing is, I can get an rda to work if I look the effects of the environmental variables on a different, composite, variable
MC = RDAmorph_Oct6[,5:6]
H_morph_var=rda(MC ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Note that I did try to just extract the ALL column for the genetic rda above. This didn't work either.
Regardless, this leads to my second problem.
When I try to plot the rda I get a super weird plot. Note the five dots in three places. I have no idea where these come from.
I will have to graph the genetic rda, and I figure I'll come up with the same issue, so I thought I'd ask now.
I've been though several tutorials and tried many iterations of each issue. What I have provided here is I think the best summary. If anyone can give me some clues, I would much appreciate it.
The documentation, ?rda, says that the left-hand side of the formula specifying your model needs to be a data matrix. You can't pass it the name of a variable in the data object as the left-hand side (or at least if this was ever anticipated, doing so exposes bugs in how we parse the formula which is what leads to further errors).
What you want is a data frame containing a variable ALL for the left-hand side of the formula.
This works:
library('vegan')
df <- read.csv('~/Downloads/RDAmorph_Oct6.csv')
ALL <- df[, 'ALL', drop = FALSE]
Notice the drop = FALSE, which stops R from dropping the empty dimension (i.e. converting the single column data frame to a vector.
Then your original call works:
ord <- rda(ALL ~ Basin + Depth + Sector, data = df, na.action = 'na.exclude')
The problem is that rda expects a separate df for the first part of the formula (ALL in your code), and does not use the one in the data = argument.
As mentioned above, you can create a new df with the variable needed for analysis, but here's a oneline solution that should also work:
gen <- rda(RDAmorph_Oct6$ALL ~ Depth + Basin + Sector, data = RDAmorph_Oct6, na.action = na.exclude)
This is partly similar to Gavin simpson's answer. There is also a problem with the categorical vectors in your data frame. You can either use library(data.table) and the rowid function to set the categorical variables to unique integers. Most preferably, not use them. I also wanted to set the ID vector as site names, but I am too lazy now.
library(data.table)
RDAmorph_Oct6 <- read.csv("C:/........../RDAmorph_Oct6.csv")
#remove NAs before. I like looking at my dataframes before I analyze them.
RDAmorph_Oct6 <- na.omit(RDAmorph_Oct6)
#I removed one duplicate
RDAmorph_Oct6 <- RDAmorph_Oct6[!duplicated(RDAmorph_Oct6$ID),]
#Create vector with only ALL
ALL <- RDAmorph_Oct6$ALL
#Create data frame with only numeric vectors and remove ALL
dfn <- RDAmorph_Oct6[,-c(1,4,11,12)]
#Select all categorical vectors.
dfc <- RDAmorph_Oct6[,c(1,11,12)]
#Give the categorical vectors unique integers doesn't do this for ID (Why?).
dfc2 <- as.data.frame(apply(dfc, 2, function(x) rowid(x)))
#Bind back with numeric data frame
dfnc <- cbind.data.frame(dfn, dfc2)
#Select only what you need
df <- dfnc[c("Depth", "Basin", "Sector")]
#The rest you know
rda.out <- rda(ALL ~ ., data=df, scale=T)
plot(rda.out, scaling = 2, xlim=c(-3,2), ylim=c(-1,1))
#Also plot correlations
plot(cbind.data.frame(ALL, df))
Sector and depth have the highest variation. Almost logical, since there are only three vectors used. The assignment of integers to the categorical vector has probably no meaning at all. The function assigns from top to bottom unique integers to the following unique character string. I am also not really sure which question you want to answer. Based on this you can organize the data frame.

Linear regression with multiple lag independent variables

I am trying to undertake a linear regression on multiple lagged independent variables. I am trying to automate the part where specifying the number of lags i.e. 1,3,5,etc. would automatically update the code below and provide results with lags defined in a previous step.
My code without any 'lag' automated operation is as follows. In this instance, i have specified 2 lags :
base::summary(stats::lm(ABX_2000$Returns ~ stats::lag(as.ts(ABX_2000$Returns),1) +
stats::lag(as.ts(ABX_2000$Returns),2)))
This code works!
I defined a function as follows::
# function to accept multiple lags
lm_lags_multiple <- function(ds,lags=2){
base::summary(stats::lm(ds ~ paste0("stats::lag(as.ts(ds,k=(", 1:lags, ")))", collapse = " + ")))
}
# run function
lm_lags_multiple(ds=ABX_2000$Returns,lags=2)
On running the above function, i receive an Error message noting:
Variable lengths are different.
I don't know how to solve this Error? Is there a lambda function equivalent in R as in Python?
Let's try this code:
lm_lags_multiple <- function(ds,lags=2){
lst <- list()
for (i in 1:lags){
lst[i] <- paste0("stats::lag(as.ts(ABX_2000$Returns),",i,")")
}
base::summary(stats::lm(as.formula(paste0("ds ~",paste(Reduce(c,lst), collapse = "+")))))
}
Pls don't forget to let us know if it worked :)

How to use a string as a formula in r

I'm trying to do an ANOVA of all of my data frame columns against time_of_day which is a factor. The rest of my columns are all doubles and of equal length.
x = 0
pdf("Time_of_Day.pdf")
for (i in names(data_in)){
if(x > 9){
test <- aov(paste(i, "~ time_of_day"), data = data_in)
}
x = x+1
}
dev.off()
Running this code gives me this error:
Error: $ operator is invalid for atomic vectors
Where is my code calling $? How can I fix this? Sorry, I'm new to r and am quite lost.
My research question is to see if time of day has an affect on brain volume at different ROIs in the brain. Time of day is divided into three categories of morning, afternoon or night.
Edit: SOLVED
treating the string as a formula will allow this to run although I have been advised to not have this many independent values as it will inflate the statistical results of the model. I am not removing this incase someone has a similar problem with the aov() call.
x = 0
pdf("Time_of_Day.pdf")
for (i in names(data_in)){
if(x > 9){
test <- aov(as.formula(paste(i, "~ time_of_day")), data = data_in)
}
x = x+1
}
dev.off()
I guess your problem is that you don't have an ANOVA formula integrated into your aov() function. See the following working example:
data_in <- data.frame(c(1,2,3),c(4,5,6),c(7,8,9))
names(data_in) <- c("first","second","third")
for (i in seq_along(names(data_in))){
test <- aov(data_in$first ~ data_in$second, data = data_in)
print(summary(test))
}
However, it seems that you tried to calculate an ANOVA for each column, whereas you need at least two variables. That is, a nominal scaled condition variable and an interval scaled dependent variable (e.g. gender and weight). So I'm generally wondering if an ANOVA is the correct method for your question. Anyways, in order to answer this question, sample data and a summary of your research question would be needed.

Resources