I am trying to do a generic function to construct a formula for lineal regression. I want that the function create the formula either
using user defined variables or,
using all the variables present in the dataframe.
I can create the formula using all the variables present in the dataframe but my problem is when I try to get the user defined variables, I do not know exactly how to get the variables to later use them to create the formula.
The function that I have until now is this:
lmformula <- function (data, IndepVariable = character, VariableList = TRUE){
if (VariableList) {
newlist <- list()
newlist <- # Here is where I do not exactly what to do to extract the variables defined by user
DependVariables <- newlist
f <- as.formula(paste(IndepVariable, "~", paste((DependVariables), collapse = '+')))
}else {
names(data) <- make.names(colnames(data))
DependVariables <- names(data)[!colnames(data)%in% IndepVariable]
f <- as.formula(paste(IndepVariable,"~", paste((DependVariables), collapse = '+')))
return (f)
}
}
Please any hint will be deeply appreciated
The only thing that changes is how you get the independent variables
If the user specifies them, then use that character vector directly
Else, you have to to take all the variables other than the dependent variable(which you are already doing)
Note : As Roland mentioned, the formula is like dependentVariable ~ independentVariable1 + independentVariable2 + independentVariable3
# creating mock data
data <- data.frame(col1 = numeric(0), col2 = numeric(0), col3 = numeric(0), col4 = numeric(0))
# the function
lmformula <- function (data, DepVariable, IndepVariable, VariableList = TRUE) {
if (!VariableList) {
IndepVariable <- names(data)[!names(data) %in% DepVariable]
}
f <- as.formula(paste(DepVariable,"~", paste(IndepVariable, collapse = '+')))
return (f)
}
# working examples
lmformula(data = data, DepVariable = "col1", VariableList = FALSE)
lmformula(data = data, DepVariable = "col1", IndepVariable = c("col2", "col3"), VariableList = TRUE)
Hope it helps!
Related
This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 1 year ago.
I have a function I would like to keep everything fixed apart form a single argument.
ls <- score_model_compound(data, pred, tmp$Prediction, score= "log")
bs <- score_model_compound(data, pred, tmp$Prediction, score="Brier")
ss <- score_model_compound(data, pred, score="spherical")
what I would like is something like
ls = data.frame()
ls <- score_model_compound(data, pred, score= c("log", "Brier", "spherical"))
is there a function I can use, like apply(), which lets me do this?
Thank you
You can create some kind of wrapping function with only the first argument being the one you want to vary and then pass it to lapply:
## Creating the wrapping function
my.wrapping.function <- function(score, data, pred, tmp) {
return(score_model_compound(data = data,
pred = pred,
tmp = tmp,
score = score))
}
## Making the list of variables
my_variables <- as.list(c("log", "Brier", "spherical"))
## Running the function for all the variables (with the set specific arguments)
result_list <- lapply(my_variables,
my.wrapping.function,
data = data, pred = pred, tmp = tmp$Prediction)
And finally, to transform it into a data.frame (or matrix), you can use the do.call(cbind, ...) function on the results:
## Combining the results into a table
my_results_table <- do.call(cbind, result_list)
Does that answer your question?
mapply() to the rescue:
score_v = c('spherical', 'log', 'Brier')
l = mapply(
score_model_compound, # function
score = score_v, # variable argument vector
MoreArgs = list(data = data, # fixed arguments
pred = pred),
SIMPLIFY = FALSE # don't simplify
)
You probably have to tweak it a little yourself, since you didn't provide a reproducible example. mapply(SIMPLIFY = FALSE) will output you a list. If the function returns data.frame's the resulting list of data.frame's can subsequently be bound with e.g. data.table::rbindlidst().
Alternatively you could just use a loop:
l = list()
for (i in seq_along(score_v)) {
sc = score_v[i]
message('Modeling: ', sc)
smc = score_model_compound(data, pred, score = sc)
l[[i]] = smc
names(l)[i] = sc
}
When contained inside a custom function, RVAideMemoire::Anova.clm
does not find the 'df1' data object that was passed to ordinal::clm (It would seem
because it searches for 'df1' in the global environment):
library(ordinal)
library(RVAideMemoire)
set.seed(1)
df <- data.frame(x = factor(sample(1:2, 100, replace=TRUE)),
y = factor(sample(1:5, 100, replace=TRUE), ordered=TRUE))
clm_function <- function(dv, gv, df1){
model <- ordinal::clm(as.formula(paste0(dv, " ~ ", gv)), data = df1)
result <- RVAideMemoire::Anova.clm(model, type = "II")
return(result)
}
clm_function(dv = "y", gv = "x", df1 = df)
Error in is.data.frame(data) : object 'df1' not found
One can sidestep this error by using assign to put 'df1' in the global
environment temporarily:
clm_function_alt <- function(dv, gv, df1){
assign("df1", df1, envir=globalenv()) # ASSIGN HERE
model <- ordinal::clm(as.formula(paste0(dv, " ~ ", gv)), data = df1)
result <- RVAideMemoire::Anova.clm(model, type = "II")
rm(df1, pos = 1) # REMOVE HERE
return(result)
}
clm_function_alt(dv = "y", gv = "x", df1 = df)
LyzandeR’s answer in this post implies that assigning ‘df1’ to the global environment outside the function is the way to go.
But I am wondering if there something potentially problematic with using assign from inside a function, like I have shown?
If so, is there a way to instruct RVAideMemoire::Anova.clm to search for ‘df1’ in the execution environment?
I'm writing a function to get some descriptive stats from a data frame. The function takes three argument: data set, set of numerical variables, set of character variables. I managed to write the function to successfully obtain the required result when the both numerical and character variables are identified within the argument. However, when one of these argument is missing, I'd like the function to return a list with two components with the missing argument as NULL within its component.
Here's the code I've written. Please let me know if you have an answer.
table1 <- function(dat, numvar, charvar){
result_n <- numeric()
result_c <- data.frame()
#This is the original table function for numerical values
for (i in 1:length(numvar)) {
new_row <- c(round(mean(dat[[numvar[i]]],na.rm = T),2) ,
round(median(dat[[numvar[i]]],na.rm = T),2),
round(sd(dat[[numvar[i]]],na.rm = T),2),
length(dat[[numvar[i]]])-sum(is.na(dat[[numvar[i]]])),
sum(is.na(dat[[numvar[i]]])))
result_n <- rbind(result_n,new_row)
}
rownames(result_n) <- numvar
colnames(result_n) <- c("Mean", "Median", "SD", "N", "N_miss")
#Thisi is the new table for char values
for (i in 1:length(charvar)) {
tab.dat <- as.data.frame(table(dat[charvar[i]],useNA = "ifany" ))
a1 <- as.character(tab.dat$Var1)
a1[3] <- "NMiss"
one.table <- data.frame(
Varname = c(charvar[i], rep(" ", nrow(tab.dat)-1)),
group = a1,
count= tab.dat$Freq)
result_c <- rbind(result_c, one.table)
}
result_list <- list(numericStats = result_n, FactorStats =result_c)
return(result_list)
}
You can set the default value to a function:
table1 <- function(dat = NULL, numvar = NULL, charvar = NULL) {...
From there, the script can determine which is missing and go from there.
Here's the answer:
table1 <- function(dat, numvar=NULL, charvar=NULL){
result_n <- numeric()
result_c <- data.frame()
#This is the original table function for numerical values
#I borrowed builtin function (ifmissing) from the internet
if(!missing(numvar)) {for (i in 1:length(numvar)) {
new_row <- c(round(mean(dat[[numvar[i]]],na.rm = T),2) ,
round(median(dat[[numvar[i]]],na.rm = T),2),
round(sd(dat[[numvar[i]]],na.rm = T),2),
length(dat[[numvar[i]]])-sum(is.na(dat[[numvar[i]]])),
sum(is.na(dat[[numvar[i]]])))
result_n <- rbind(result_n,new_row)
}
rownames(result_n) <- numvar
colnames(result_n) <- c("Mean", "Median", "SD", "N", "N_miss")}
#Thisi is the new table for char values
#I borrowed builtin function (ifmissing) from the internet
if(!missing(charvar)) {for (i in 1:length(charvar)) {
tab.dat <- as.data.frame(table(dat[charvar[i]],useNA = "ifany" ))
a1 <- as.character(tab.dat$Var1)
a1[3] <- "NMiss"
one.table <- data.frame(
Varname = c(charvar[i], rep(" ", nrow(tab.dat)-1)),
group = a1,
count= tab.dat$Freq)
result_c <- rbind(result_c, one.table)
}}
result_list <- list(numericStats = result_n, FactorStats =result_c)
return(result_list)
}
I am performing a lot of correlations of a populations over time. I have split them up accordingly and have put them through a function with lapply. I want to put the output of each correlation into a data frame (i.e.: each row will be the info for one correlation, with the columns: correlation's name, p-value, t statistic, df, CIs, and corcoeff).
I have two issues:
I don't know how to extract the name of the correlation made in the split
I can get my function to run the correlation on the split (600+ Correlations), but I can't get it to print it into the data frame. To clarify: When I run the function without the loop, it does all 600 Correlations for each group. However, when I add the loop, it produces NULL for all the groups in the split.
Here is what I have thus far:
> head(Birds) #Shortened for this Post
Location Species Year Longitude Latitude Section Total Percent Family
1 Chiswell A Kittiwake 1976 -149.5847 59.59559 Central 310 16.78397 Gull
BigSplit<-split(Birds,list(Birds$Family, Birds$Location,
Birds$Section,Birds$Species), drop=T) #A list of Dataframes
#Make empty data frame
resultcor <- data.frame(Name = character(),
tvalue = character(),
degreeF = character(),
pvalue = character(),
CIs = character(),
corcoeff = character(),stringsAsFactors = F)
WorkFunc <- function(dataset) {
data.name = substitute(dataset) #Use "dataset" as substitute for actual dataset name
#Correlation between Year and population Percent
try({
correlation <- cor.test(dataset$Year, dataset$Percent, method = "pearson")
}, silent = TRUE)
for (i in 1:nrow(resultcor)) {
resultcor$Name[i] <- ??? #These ??? are not in the code, just highlighting Issue 1
resultcor$tvalue[i] <- correlation$dataset$statistic
resultcor$degreeF[i] <- correlation$dataset$parameter
resultcor$pvalue[i] <- correlation$dataset$p.value
resultcor$CIs[i] <- correlation$dataset$conf.int
resultcor$corcoeff[i] <- correlation$dataset$estimate
}
}
lapply(BigSplit, WorkFunc)
Any help would be appreciated, Thanks!
Consider using Map (variant to mapply) where you pass both the elements and names of BigSplit. Using Map will output a list of dataframes that you can then row bind at end with do.call(). Below assumes BigSplit is a named list.
WorkFunc <- function(dataset, dataname) {
# Correlation between Year and population Percent
tryCatch({
correlation <- cor.test(dataset$Year, dataset$Percent, method = "pearson")
CIs <- correlation$conf.int
return(data.frame(
Name = dataname,
tvalue = correlation$statistic,
degreeF = correlation$parameter,
pvalue = correlation$p.value,
CI_lower = ifelse(is.null(CIs), NA, CIs[[1]]),
CI_higher = ifelse(is.null(CIs), NA, CIs[[2]]),
corcoeff = correlation$estimate
)
)
}, error = function(e)
return(data.frame(
Name = character(0),
tvalue = numeric(0),
degreeF = numeric(0),
pvalue = numeric(0),
CI_lower = numeric(0),
CI_higher = numeric(0),
corcoeff = numeric(0)
)
)
)
}
# BUILD CORRELATION DATAFRAMES INTO LIST
cor_df_list <- Map(WorkFunc, BigSplit, names(BigSplit))
cor_df_list <- mapply(WorkFunc, BigSplit, names(BigSplit), SIMPLIFY=FALSE) # EQUIVALENT
# ROW BIND ALL DATAFRAMES TO FINAL LARGE DATAFRAME
finaldf <- do.call(rbind, cor_df_list)
I'm not quite familiar with R function dealing with variables used.
Here's the problem:
I want to built a function, of which variables ... are column names of data frame used for table().
f <- function (data, ...){
T <- with(data, table(...) # ... variables input
return(T)
}
How can I deal with the code?
Thanks a lot for answering!
The order of evaluation doesn't quite work right with with() apparently. Here's an alternative that should work (using sample data from #DavidArenburg)
set.seed(1)
data1 <- data.frame(a = sample(5,5), b = sample(5,5))
f <- function (data, ...) {
xx <- lapply(substitute(...()), eval, data, parent.frame())
T <- do.call(table, xx)
return(T)
}
f(data = data1, a,b)
It is often far easier to avoid non-standard evaluation and use character strings to reference the columns within a data.frame.
set.seed(1)
data1 <- data.frame(a = sample(5,5), b = sample(5,5))
f <- function (data, ...) {
do.call(table,data[unlist(list(...))])
}
# the following calls to `f` return the same results
f(data = data1, 'a','b')
f(data = data1, c('a','b'))
a <- c('a','b')
f(data = data1, a)