I'm using lm in R to do simple multilinear regression. Here's an example model:
m <- lm(formula = t ~ a + b + 0, data = df1)
where t, a and b are columns in df1. This model calculates 2 coefficients, let's call them a.coef and b.coef. If I then use this model to predict some other data, say in df2, I can get the predicted values like so:
predict(m, df2)
if I have the columns a and b in df2 as well. It essentially returns
df2$a * a.coef + df2$b * b.coef
What I'd like, however, are the columns df2$a * a.coef and df2$b * b.coef. R sums them and gives me the answer, but I'd like to see how the scaling affects these values.
Is there a convenient way to do this in R (esp in lm or predict.lm), or will I have to manually code this myself? I played with the terms argument in predict.lm, but I couldn't get anywhere.
Thanks for the help!
EDIT
I wrote this function:
scaled.fn <- function(dt, x, y, i) {
# dt is data.table
# x is dependent column (col name as str)
# y are predictor columns (col names as vector of str)
# i is name of column to multiply, as str
dep = paste(y, collapse = " + ")
my.formula = paste(x, " ~ ", dep, sep = "")
m = lm(formula = my.formula, data = dt)
# column names in dt are named in y
return(dt[, get(i) * coef(m)[i]])
}
Try this:
sweep(df2, MARGIN = 2, coef(m), '*')
EDIT: more specific solution:
sweep(df2[,c("a","b")], MARGIN = 2, coef(m), '*')
Related
I am trying to generate a function to calculate the correlation between variables age and views in data frame data for each category of gender.
My data frame is called tv_viewing with 5 columns: adhd (numeric), sex (factor, boy/girl), famsize (factor with 4 levels (1 child, 2 child, 3 child, 4+child), age (numeric) and views (numeric, amount of watching television).
I have gotten this far:
partcorr <- function(tv_viewing, age, views, sex) {
corrs <- list()
for(i(tv_viewing[,sex])) {
corrs[i] <- round(sex(tv_viewing[tv_viewing[,sex] == i, age], tv_viewing[tv_viewing[,sex] == i, views], method = "pearson"), digits = 2)
}
return()
}
Or, more generally,
partcorr <- function(data, x, y, cat) {
corrs <- list()
for(i in levels(data[,cat])) {
corrs[i] <- round(cor(data[data[,cat] == i, x], data[data[,cat] == i, y], method = "pearson"), digits = 2)
}
return()
}
But this is not working. What am I doing wrong?
You can use the function by() and do not need a for loop. With your data and columns, it should work like this:
by(tv_viewing[, c("age", "views")], tv_viewing$sex, cor)
Or, as I do not have access to your data, a reproducible using the iris data set is as follows:
by(iris[, 1:2], iris$Species, cor)
Maybe your code did not work because you did not quote your column names (tv_viewing[, sex] will not read your column sex, but tv_viewing[, "sex"] does).
EDIT: If you need the correlations in one vector, you can use this code:
c(by(iris[, 1:2], iris$Species, function(x) cor(x)[2, 1]))
partcorr <- function(df,
col1,
col2,
cat,
digits = 2,
method = "pearson",
... ) {
dfs <- split(df[, c(col1, col2)], df[, cat])
sapply(dfs, function(sdf) round(cor(sdf[, col1],
sdf[, col2],
method = method,
...),
digits = digits))
}
creates for each category a subdataframe of the original data
with the columns given for col1 and col2 - in a list of data frames (dfs).
Then it loops over this list of data frames and calculates
the correlation with the given method between the col1 and col2 columns
of the subdataframes. The ... makes that you can pass to partcorr
any additional arguments for the cor function - it will forward it to the
cor function.
Also the digits can be changed.
I know that somewhere there will exist this kind of question, but I couldn't find it. I have the variables a, b, c, d and I want to write a loop, such that I regress and append the variables and regress again with the additional variable
lm(Y ~ a, data = data), then
lm(Y ~ a + b, data = data), then
lm(Y ~ a + b + c, data = data) etc.
How would you do this?
Using paste and as.formula, example using mtcars dataset:
myFits <- lapply(2:ncol(mtcars), function(i){
x <- as.formula(paste("mpg",
paste(colnames(mtcars)[2:i], collapse = "+"),
sep = "~"))
lm(formula = x, data = mtcars)
})
Note: looks like a duplicate post, I have seen a better solution for this type of questions, cannot find at the moment.
You could do this with a lapply / reformulate approach.
formulae <- lapply(ivars, function(x) reformulate(x, response="Y"))
lapply(formulae, function(x) summary(do.call("lm", list(x, quote(dat)))))
Data
set.seed(42)
dat <- data.frame(matrix(rnorm(80), 20, 4, dimnames=list(NULL, c("Y", letters[1:3]))))
ivars <- sapply(1:3, function(x) letters[1:x]) # create an example vector ov indep. variables
vars = c('a', 'b', 'c', 'd')
# might want to use a subset of names(data) instead of
# manually typing the names
reg_list = list()
for (i in seq_along(vars)) {
my_formula = as.formula(sprintf('Y ~ %s', paste(vars[1:i], collapse = " + ")))
reg_list[[i]] = lm(my_formula, data = data)
}
You can then inspect an individual result with, e.g., summary(reg_list[[2]]) (for the 2nd one).
I am printing a table to PDF with much success. Standard hierarchical regression
with three steps. However, my questions is twofold: 1) How do I add the asterisk to mark sig p values on the covariates and 2) how do I remove rows like AIC etc.
At this point, I am just opening the pdf in word to edit the table but thought someone might have a solution.
H_regression <- apa_print(list(Step1 = model1,
Step2 = model2,
Step3 = model3),
boot_samples = 0)
I'll use the example from the papaja documentation as an example.
mod1 <- lm(Sepal.Length ~ Sepal.Width, data = iris)
mod2 <- update(mod1, formula = . ~ . + Petal.Length)
mod3 <- update(mod2, formula = . ~ . + Petal.Width)
moi <- list(Baseline = mod1, Length = mod2, Both = mod3)
h_reg <- apa_print(moi, boot_samples = 0)
h_reg_table <- h_reg$table
2) how do I remove rows like AIC
The table returned by apa_print() is a data.frame with some additional information. Hence, you can index and subset it as you would any other table. You can select rows by name (see below) or by row number.
# Remove rows
rows_to_remove <- c("$\\mathrm{AIC}$", "$\\mathrm{BIC}$")
h_reg_table <- h_reg_table[!rownames(h_reg_table) %in% rows_to_remove, ]
1) How do I add the asterisk to mark sig p values on the covariates
There is currently no way to highlight significant predictors (I'm not a fan of this practice). But here's some code that will let you add highlighting after the fact. The following function takes the formatted table, the list of the compared models and a character symbol to highlight significant predictors as input.
# Define custom function
highlight_sig_predictors <- function(x, models, symbol) {
n_coefs <- sapply(models, function(y) length(coef(y)))
for(i in seq_along(models)) {
sig_stars <- rep(FALSE, max(n_coefs))
sig_stars[1:n_coefs[i]] <- apply(confint(models[[i]]), 1, function(y) all(y > 0) || all(y < 0))
x[1:max(n_coefs), i] <- paste0(x[1:max(n_coefs), i], ifelse(sig_stars, symbol, paste0("\\phantom{", symbol, "}")))
}
x
}
Now this function can be used to customize the table returned by apa_print().
# Add significance symbols to predictors
h_reg_table <- highlight_sig_predictors(h_reg_table, moi, symbol = "*")
# Print table
apa_table(h_reg_table, escape = FALSE, align = c("lrrr"))
I am trying to smooth out my data for each variable in the data frame. Lets say it looks like this:
data <- data.frame(v1 = c(0.5,1.1,2.9,3.4,4.1,5.7,6.3,7.4,6.9,8.5,9.1),
v2 = c(0.1,0.8,0.5,1.1,1.9,2.4,0.8,3.4,2.9,3.1,4.2),
v3 = c(1.3,2.1,0.8,4.1,5.9,8.1,4.3,9.1,9.2,8.4,7.4))
data$x <- 1:nrow(data)
I then specify my x and y variables as:
x <- data$x
y <- data$v1
I can fit the predicted line I want (and I am happy with the process):
f <- function (x,a,b,d) {(a*x^2) + (b*x) + d}
order_two <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1))
co2 <- coef(order_two)
data$order_two_predicted_v1 <- (co2[1] * (data$x)^2) + (co2[2] * data$x) + co2[3]
I therefore end up with an appropriately titled new variable (the predicted values for v1). I now want to do this for each of the other 100 variables in my data frame (v2 and v3 in this example).
I tried using a function to do this but can't get it to work as intended. Here is my attempt:
myfunction <- function(xaxis,yaxis){
# Specfiy my "y" and "x"
x <- data$xaxis
y <- data$yaxis
f <- function (x,a,b,d) {(a*x^2) + (b*x) + d}
order_two <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1))
co2 <- coef(order_two)
data$order_two_predicted_yaxis <- (co2[1] * (data$x)^2) + (co2[2] * data$x) + co2[3]
}
myfunction(x,v1)
myfunction(x,v2)
myfunction(x,v3)
Not only does the function not work as intended, I would like to avoid calling the function 100 times for each variable and instead somehow loop through it.
This is really simple to do in SAS using macros but I am struggling to get this to work in R.
You can model your data directly with the lm() function:
data <- data.frame(v1 = c(0.5,1.1,2.9,3.4,4.1,5.7,6.3,7.4,6.9,8.5,9.1),
v2 = c(0.1,0.8,0.5,1.1,1.9,2.4,0.8,3.4,2.9,3.1,4.2),
v3 = c(1.3,2.1,0.8,4.1,5.9,8.1,4.3,9.1,9.2,8.4,7.4))
x <- 1:nrow(data)
# initialize a list to store the models
models = vector("list", length = (ncol(data)))
# create a loop running over the columns of data
for (i in 1:(ncol(data))){
models[[i]] = lm(data[,i] ~ poly(x,2, raw = TRUE))}
You can also use lapply instead of the for-loop, as stated in the comments.
Use predict() to get the values of the models:
smoothed_v1 = predict(model[[1]], newdata=data.frame(x = x))
Edit:
Regarding your comment - you can store the new values in data with:
for (i in (length(models):1)){
data <- cbind(predict(models[[i]], newdata=data.frame(x = x)), data)
# set the name for the new column
names(data)[1] = paste("pred_v",i, sep ="")}
I have many X and Y variables (something like, 500 x 500). The following just small data:
yvars <- data.frame (Yv1 = rnorm(100, 5, 3), Y2 = rnorm (100, 6, 4),
Yv3 = rnorm (100, 14, 3))
xvars <- data.frame (Xv1 = sample (c(1,0, -1), 100, replace = T),
X2 = sample (c(1,0, -1), 100, replace = T),
Xv3 = sample (c(1,0, -1), 100, replace = T),
D = sample (c(1,0, -1), 100, replace = T))
I want to extact p-values and make a matrix like this:
Yv1 Y2 Yv3
Xv1
X2
Xv3
D
Here is my attempt to loop the process:
prob = NULL
anova.pmat <- function (x) {
mydata <- data.frame(yvar = yvars[, x], xvars)
for (i in seq(length(xvars))) {
prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
data = mydata))$`Pr(>F)`[1]
}
}
sapply (yvars,anova.pmat)
Error in .subset(x, j) : only 0's may be mixed with negative subscripts
What could be the solution ?
Edit:
For the first Y variable:
For first Y variable:
prob <- NULL
mydata <- data.frame(yvar = yvars[, 1], xvars)
for (i in seq(length(xvars))) {
prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
data = mydata))$`Pr(>F)`[1]
}
prob
[1] 0.4995179 0.4067040 0.4181571 0.6291167
Edit again:
for (j in seq(length (yvars))){
prob <- NULL
mydata <- data.frame(yvar = yvars[, j], xvars)
for (i in seq(length(xvars))) {
prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
data = mydata))$`Pr(>F)`[1]
}
}
Gives the same result as above !!!
Here is an approach that uses plyr to loop over the columns of a dataframe (treating it as a list) for each of the xvars and yvars, returning the appropriate p-value, arranging it into a matrix. Adding the row/column names is just extra.
library("plyr")
probs <- laply(xvars, function(x) {
laply(yvars, function(y) {
anova(lm(y~x))$`Pr(>F)`[1]
})
})
rownames(probs) <- names(xvars)
colnames(probs) <- names(yvars)
Here is one solution, which consists in generating all combinations of Y- and X-variables to test (we cannot use combn) and run a linear model in each case:
dfrm <- data.frame(y=gl(ncol(yvars), ncol(xvars), labels=names(yvars)),
x=gl(ncol(xvars), 1, labels=names(xvars)), pval=NA)
## little helper function to create formula on the fly
fm <- function(x) as.formula(paste(unlist(x), collapse="~"))
## merge both datasets
full.df <- cbind.data.frame(yvars, xvars)
## apply our LM row-wise
dfrm$pval <- apply(dfrm[,1:2], 1,
function(x) anova(lm(fm(x), full.df))$`Pr(>F)`[1])
## arrange everything in a rectangular matrix of p-values
res <- matrix(dfrm$pval, nc=3, dimnames=list(levels(dfrm$x), levels(dfrm$y)))
Sidenote: With high-dimensional datasets, relying on the QR decomposition to compute the p-value of a linear regression is time-consuming. It is easier to compute the matrix of Pearson linear correlation for each pairwise comparisons, and transform the r statistic into a Fisher-Snedecor F using the relation F = νar2/(1-r2), where degrees of freedom are defined as νa=(n-2)-#{(xi=NA),(yi=NA)} (that is, (n-2) minus the number of pairwise missing values--if there're no missing values, this formula is the usual coefficient R2 in regression).