The logic is similar to the content-based recommender,
content
undesirable
desirable
user_1
...
user_10
1
3.00
2.77
0.11
NA
...
5000
2.50
2.11
NA
0.12
I need to run the model for undesirable and desirable as independent values and each user as the dependent value, thus I need run 10 times to fit the model and predict each user's NA value.
This is the code that I hard coding, but I wonder how to use for loop, I just searched for several methods but they do not work for me...
the data as 'test'
hard code
#fit model
fit_1 = lm(user_1 ~ undesirable + desirable, data = test)
...
fit_10 = lm(user_10 ~ undesirable + desirable, data = test)
#prediction
u_1_na = test[is.na(test$user_1), c('user_1', 'undesirable', 'desirable')]
result1 = predict(fit_1, newdata = u_1_na)
which(result1 == max(result1))
max(result1)
...
u_10_na = test[is.na(test$user_10), c('user_10', 'undesirable', 'desirable')]
result10 = predict(fit_10, newdata = u_10_na)
which(result10 == max(result10))
max(result10)
#make to csv file
apply each max predict value to csv.
this is what I try for now(for loop)
mod_summaries <- list()
for(i in 1:10) {
predictors_i <- colnames(data)[1:10]
mod_summaries[[i - 1]] <- summary(
lm(predictors_i ~ ., test[ , c("undesirable", 'desirable')]))
}
An apply method:
mod_summaries_lapply <-
lapply(
colnames(mtcars),
FUN = function(x)
summary(lm(reformulate(".", response = x), data = mtcars))
)
A for loop method to make linear models for each column. The key is the reformulate() function, which creates the formula from strings. In the question, the function is made of a string and results in error invalid term in model formula. The string needs to be evaluated with eval() . This example uses the mtcars dataset.
mod_summaries <- list()
for(i in 1:11) {
predictors_i <- colnames(mtcars)[i]
mod_summaries[[i]] <- summary(lm(reformulate(".", response = predictors_i), data=mtcars))
#summary(lm(reformulate(". -1", response = predictors_i), data=mtcars)) # -1 to exclude intercept
#summary(lm(as.formula(paste(predictors_i, "~ .")), data=mtcars)) # a "paste as formula" method
}
You could use the function as.formula together with the paste function to create your formula. Following is an example
formula_lm <- as.formula(
paste(response_var,
paste(expl_var, collapse = " + "),
sep = " ~ "))
This implies that you have more than one explanatory variable (separated in the paste with +). If you only have one, omit the second paste.
With the created formula, you can use the lm funciton like this:
lm(formula_lm, data)
Edit: the vector expl_var would in your case include the undesirable and desirable variable.
Avoid the loop. Make your data tidy. Something like:
library(tidyverse)
test %>%
select(-content) %>%
pivot_longer(
starts_with("user"),
names_to="user",
values_to="value"
) %>%
group_by(user) %>%
group_map(
function(.x, .y) {
summary(lm(user ~ ., data=.x))
}
)
Untested code since your example is not reproducible.
Related
I have seen an example of list apply (lapply) that works nicely to take a list of data objects,
and return a list of regression output, which we can pass to Stargazer for nicely formatted output.
Using stargazer with a list of lm objects created by lapply-ing over a split data.frame
library(MASS)
library(stargazer)
data(Boston)
by.river <- split(Boston, Boston$chas)
class(by.river)
fit <- lapply(by.river, function(dd)lm(crim ~ indus,data=dd))
stargazer(fit, type = "text")
What i would like to do is, instead of passing a list of datasets to do the same regression on each data set (as above),
pass a list of independent variables to do different regressions on the same data set. In long hand it would look like this:
fit2 <- vector(mode = "list", length = 2)
fit2[[1]] <- lm(nox ~ indus, data = Boston)
fit2[[2]] <- lm(crim ~ indus, data = Boston)
stargazer(fit2, type = "text")
with lapply, i tried this and it doesn't work. Where did I go wrong?
myvarc <- c("nox","crim")
class(myvarc)
myvars <- as.list(myvarc)
class(myvars)
fit <- lapply(myvars, function(dvar)lm(dvar ~ indus,data=Boston))
stargazer(fit, type = "text")
Consider creating dynamic formulas from string:
fit <- lapply(myvars, function(dvar)
lm(as.formula(paste0(dvar, " ~ indus")),data=Boston))
This should work:
fit <- lapply(myvars, function(dvar) lm(eval(paste0(dvar,' ~ wt')), data = Boston))
You can also use a dplyr & purrr approach, keep everything in a tibble, pull out what you want, when you need it. No difference in functionality from the lapply methods.
library(dplyr)
library(purrr)
library(MASS)
library(stargazer)
var_tibble <- tibble(vars = c("nox","crim"), data = list(Boston))
analysis <- var_tibble %>%
mutate(models = map2(data, vars, ~lm(as.formula(paste0(.y, " ~ indus")), data = .x))) %>%
mutate(tables = map2(models, vars, ~stargazer(.x, type = "text", dep.var.labels.include = FALSE, column.labels = .y)))
You can also use get():
# make a list of independent variables
list_x <- list("nox","crim")
# create regression function
my_reg <- function(x) { lm(indus ~ get(x), data = Boston) }
# run regression
results <- lapply(list_x, my_reg)
I am trying to create a function that allows me to pass outcome and predictor variable names as strings into the lm() regression function. I have actually asked this before here, but I learned a new technique here and would like to try and apply the same idea in this new format.
Here is the process
library(tidyverse)
# toy data
df <- tibble(f1 = factor(rep(letters[1:3],5)),
c1 = rnorm(15),
out1 = rnorm(15))
# pass the relevant inputs into new objects like in a function
d <- df
outcome <- "out1"
predictors <- c("f1", "c1")
# now create the model formula to be entered into the model
form <- as.formula(
paste(outcome,
paste(predictors, collapse = " + "),
sep = " ~ "))
# now pass the formula into the model
model <- eval(bquote( lm(.(form),
data = d) ))
model
# Call:
# lm(formula = out1 ~ f1 + c1, data = d)
#
# Coefficients:
# (Intercept) f1b f1c c1
# 0.16304 -0.01790 -0.32620 -0.07239
So this all works nicely, an adaptable way of passing variables into lm(). But what if we want to apply special contrast coding to the factorial variable? I tried
model <- eval(bquote( lm(.(form),
data = d,
contrasts = list(predictors[1] = contr.treatment(3)) %>% setNames(predictors[1])) ))
But got this error
Error: unexpected '=' in:
" data = d,
contrasts = list(predictors[1] ="
Any help much appreciated.
Reducing this to the command generating the error:
list(predictors[1] = contr.treatment(3))
Results in:
Error: unexpected '=' in "list(predictors[1] ="
list() seems to choke when the left-hand side naming is a variable that needs to be evaluated.
Your approach of using setNames() works, but needs to be wrapped around the list construction step itself.
setNames(list(contr.treatment(3)), predictors[1])
Output is a named list containing a contrast matrix:
$f1
2 3
1 0 0
2 1 0
3 0 1
I'm trying to apply a custom function to a nested dataframe
I want to apply a machine learning algorithm to predict NA values
After doing a bit of reading online, it seemed that the map function would be the most applicable here
I have a section of code that nests the dataframe and then splits the data into a test (data3) and train (data2) set - with the test dataset containing all the null values for the column to be predicted, and the train containing all the values that are not null to be used to train the ML model
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
data3 = map(data, ~filter(., is.na(mean_night_flow))))
Here is the function I intend to use:
my_function (test,train) {
et <- extraTrees(x = train, y = train[, "mean_night_flow"], na.action = "fuse", ntree = 1000, nodesize = 2, mtry = ncol(train) * 0.9 )
test1 <- test
test1[ , "mean_night_flow"] <- 0
pred <- predict(et, newdata = test1[, "mean_night_flow"])
test1[ , "mean_night_flow"] <- pred
return(test1)
I have tried the following code, however it does not work:
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
data3 = map(data, ~filter(., is.na(mean_night_flow))),
data4 = map(data3, data2, ~my_function(.x,.y)))
It gives the following error:
Error: Index 1 must have length 1, not 33
This is suggests that it expects a column rather than a whole dataframe. How can I get this to work?
Many thanks
Without testing on your data, I think you're using the wrong map function. purrr::map works on one argument (one list, one vector, whatever) and returns a list. You are passing it two values (data3 and data2), so we need to use:
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
data3 = map(data, ~filter(., is.na(mean_night_flow))),
data4 = map2(data3, data2, ~my_function(.x,.y)))
If you find yourself needing more than two, you need pmap. You can use pmap for 1 or 2 arguments, it's effectively the same. The two biggest differences when migrating from map to pmap are:
your arguments need to be enclosed within a list, so
map2(data3, data12, ...)
becomes
pmap(list(data3, data12), ...)
you refer to them with double-dot number position, ..1, ..2, ..3, etc, so
~ my_function(.x, .y)
becomes
~ my_function(..1, ..2)
An alternative that simplifies your overall flow just a little.
my_function (test, train = NULL, fld = "mean_night_flow") {
if (is.null(train)) {
train <- test[ !is.na(test[[fld]]),, drop = FALSE ]
test <- test[ is.na(test[[fld]]),, drop = FALSE ]
}
et <- extraTrees(x = train, y = train[, fld], na.action = "fuse", ntree = 1000, nodesize = 2, mtry = ncol(train) * 0.9 )
test1 <- test
test1[ , fld] <- 0
pred <- predict(et, newdata = test1[, fld])
test1[ , fld] <- pred
return(test1)
}
which auto-populates train based on the missingness of your field. (I also parameterized it in case you ever need to train/test on a different field.) This changes your use to
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data4 = map(data, ~ my_function(.x, fld = "mean_night_flow")))
(It's important to name fld=, since otherwise it will be confused for train.)
If you're planning on reusing data2 and/or data3 later in the pipe or analysis, then this step is not necessarily what you need.
Note: I suspect your function in under-tested or incomplete. The fact that you assign all 0 to your test1[,"mean_night_flow"] and then use those zeroes in your call to predict seems suspect. I might be missing something, but I would expect perhaps
test1 <- test
pred <- predict(et, newdata = test1)
test1[ , fld] <- pred
return(test1)
(though copying to test1 using tibble or data.frame is mostly unnecessary, since it is copied in-place and the original frame is untouched; I would be more cautious if you were using class data.table).
I have a function in R which includes multiple other functions, including a custom one. I then use lapply to run the combined function across multiple variables. However, when the output is produced it is in the order of
function1: variable a, variable b, variable c
function2: variable a, variable b, variable c
When what I would like is for it to be the other way around:
variable a: function 1, function 2...
variable b: function 1, function 2...
I have recreated an example below using the mtcars dataset, with number of cylinders as a predictor variable, and vs and am as outcome variables.
library(datasets)
library(tidyverse)
library(skimr)
library(car)
data(mtcars)
mtcars_binary <- mtcars %>%
dplyr::select(cyl, vs, am)
# logistic regression function
logistic.regression <- function(logmodel) {
dev <- logmodel$deviance
null.dev <- logmodel$null.deviance
modelN <- length(logmodel$fitted.values)
R.lemeshow <- 1 - dev / null.dev
R.coxsnell <- 1 - exp ( -(null.dev - dev) / modelN)
R.nagelkerke <- R.coxsnell / ( 1 - ( exp (-(null.dev / modelN))))
cat("Logistic Regression\n")
cat("Hosmer and Lemeshow R^2 ", round(R.lemeshow, 3), "\n")
cat("Cox and Snell R^2 ", round(R.coxsnell, 3), "\n")
cat("Nagelkerke R^2" , round(R.nagelkerke, 3), "\n")
}
# all logistic regression results
log_regression_tests1 <- function(df_vars, df_data) {
glm_summary <- glm(df_data[,df_vars] ~ df_data[,1], data = df_data, family = binomial, na.action = "na.omit")
glm_print <- print(glm_summary)
log_results <- logistic.regression(glm_summary)
blr_coefficients <- exp(glm_summary$coefficients)
blr_confint <- exp(confint(glm_summary))
list(glm_summary = glm_summary, glm_print = glm_print, log_results = log_results, blr_coefficients = blr_coefficients, blr_confint = blr_confint)
}
log_regression_results1 <- sapply(colnames(mtcars_binary[,2:3]), log_regression_tests1, mtcars_binary, simplify = FALSE)
log_regression_results1
When I do this, the output is being produced as:
glm_summary: vs, am
log_results: vs, am
etc. etc.
When what I would like for the output to be ordered is:
vs: all function outputs
am: all function outputs
In addition, when I run this line of code, log_regression_results1 <- sapply(colnames(mtcars_binary[,2:3]), log_regression_tests1, mtcars_binary, simplify = FALSE) I get only the results of the logistic regression function, but when I print the overall results log_regression_results1 I get the remaining output, could anyone explain why?
Finally, the glm_summary function is not producing all of the output which it should. When I run the functions independently on a single variable, like so
glm_vs <- glm(vs ~ cyl, data = mtcars_binary, family = binomial, na.action = "na.omit")
summary(glm_vs)
logistic.regression(glm_vs)
exp(glm_vs$vs)
exp(confint(glm_vs))
it also produces the standard error, z value, and p value for summary(glm_vs) which it does not do embedded in the function, even though I have ```glm_print <- print(glm_summary)' included. Is there a way to get the output for the full summary function within the log_regression_tests1 function?
when I run your code up to log_regression_results1 I got exactly what you ask for:
summary(log_regression_results1)
Length Class Mode
vs 5 -none- list
am 5 -none- list
maybe you meant to ask the other way round?
I have seen an example of list apply (lapply) that works nicely to take a list of data objects,
and return a list of regression output, which we can pass to Stargazer for nicely formatted output.
Using stargazer with a list of lm objects created by lapply-ing over a split data.frame
library(MASS)
library(stargazer)
data(Boston)
by.river <- split(Boston, Boston$chas)
class(by.river)
fit <- lapply(by.river, function(dd)lm(crim ~ indus,data=dd))
stargazer(fit, type = "text")
What i would like to do is, instead of passing a list of datasets to do the same regression on each data set (as above),
pass a list of independent variables to do different regressions on the same data set. In long hand it would look like this:
fit2 <- vector(mode = "list", length = 2)
fit2[[1]] <- lm(nox ~ indus, data = Boston)
fit2[[2]] <- lm(crim ~ indus, data = Boston)
stargazer(fit2, type = "text")
with lapply, i tried this and it doesn't work. Where did I go wrong?
myvarc <- c("nox","crim")
class(myvarc)
myvars <- as.list(myvarc)
class(myvars)
fit <- lapply(myvars, function(dvar)lm(dvar ~ indus,data=Boston))
stargazer(fit, type = "text")
Consider creating dynamic formulas from string:
fit <- lapply(myvars, function(dvar)
lm(as.formula(paste0(dvar, " ~ indus")),data=Boston))
This should work:
fit <- lapply(myvars, function(dvar) lm(eval(paste0(dvar,' ~ wt')), data = Boston))
You can also use a dplyr & purrr approach, keep everything in a tibble, pull out what you want, when you need it. No difference in functionality from the lapply methods.
library(dplyr)
library(purrr)
library(MASS)
library(stargazer)
var_tibble <- tibble(vars = c("nox","crim"), data = list(Boston))
analysis <- var_tibble %>%
mutate(models = map2(data, vars, ~lm(as.formula(paste0(.y, " ~ indus")), data = .x))) %>%
mutate(tables = map2(models, vars, ~stargazer(.x, type = "text", dep.var.labels.include = FALSE, column.labels = .y)))
You can also use get():
# make a list of independent variables
list_x <- list("nox","crim")
# create regression function
my_reg <- function(x) { lm(indus ~ get(x), data = Boston) }
# run regression
results <- lapply(list_x, my_reg)