I have a formula which I would like to use to create a model matrix, but for my use I need to stop the user from adding an intercept as this will be taken care of at a later stage in the regression. How can I remove the intercept from the formula and is there a better option than update?
You can do this a few ways. The first option specified below is probably the best way of going about this.
# Create dataset and form for example
dta <- data.frame(y = rnorm(3), x = rnorm(3), z = rnorm(3))
form <- y ~ x + z
# No censoring
(X <- model.matrix(form, dta))
# Option 1 (my default option)
tf <- terms(form)
attr(tf, "intercept") <- 0
model.matrix(tf, dta)
# Option 2
X[, !colnames(X) %in% "(Intercept)"]
# Option 3
form2 <- update(form, . ~ . - 1)
model.matrix(form2, dta)
Related
for some reason, my model is not running. I created a model matrix to run a simple model with the package neuralnet. I know it might be challenging to debug other people code especially without the data but in case you think you could assist me here is the code:
library(tidyverse)
library(neuralnet)
#Activity 1 Load Data
featchannels <-read.csv("features_channel.csv")
trainTargets <-read.table("traintargets.txt")
#Activity 2 Normalize every column of the features dataset using min-max
normalization to range [0-1].
normalized <- function(x) {
return((x-min(x)) /(max(x) -min(x)))
}
featchannels <- normalized(featchannels)
#Activity 3 Add a target feature named response to the features dataset
with 0-1 values read from trainTargets.txt, with 1 indicating P300
response and 0 otherwise.
colnames(trainTargets)[1] <- "State"
featchannels <- cbind(featchannels, trainTargets)
# Changing rows to P300 and others.
featchannels <- within(featchannels, State <- factor(State, labels =
c("Other", "P300")))
featchannels$State <- as.factor(featchannels$State)
#4. Take the first 3840 rows of the dataset as the training data set, and
the remaining 960 rows as the testing data set.
training <- featchannels[1:3840,]
testing <- featchannels[3841:4800,]
enter code here
#Activitry 6
#Creating model matrix before runing the model
df_comb_training <- training
y <- model.matrix(~ df_comb_training$State + 0, data = df_comb_training[,
c('State'), drop=FALSE])
# fix up names for as.formula
y_feats <- gsub("^[^ ]+\\$", "", colnames(y))
colnames(y) <- y_feats
df_comb_training <- df_comb_training[, !(colnames(df_comb_training) ==
"State")]
feats <- colnames(df_comb_training)
df_comb_training <- cbind(y, df_comb_training)
# Concatenate strings
f <- paste(feats, collapse=' + ')
y_f <- paste(y_feats, collapse=' + ')
f <- paste(y_f, '~', f)
# Convert to formula
f <- as.formula(f)
model_h5 <- neuralnet(f, df_comb_training, stepmax = 1e+08, hidden = 5)
I am trying to smooth out my data for each variable in the data frame. Lets say it looks like this:
data <- data.frame(v1 = c(0.5,1.1,2.9,3.4,4.1,5.7,6.3,7.4,6.9,8.5,9.1),
v2 = c(0.1,0.8,0.5,1.1,1.9,2.4,0.8,3.4,2.9,3.1,4.2),
v3 = c(1.3,2.1,0.8,4.1,5.9,8.1,4.3,9.1,9.2,8.4,7.4))
data$x <- 1:nrow(data)
I then specify my x and y variables as:
x <- data$x
y <- data$v1
I can fit the predicted line I want (and I am happy with the process):
f <- function (x,a,b,d) {(a*x^2) + (b*x) + d}
order_two <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1))
co2 <- coef(order_two)
data$order_two_predicted_v1 <- (co2[1] * (data$x)^2) + (co2[2] * data$x) + co2[3]
I therefore end up with an appropriately titled new variable (the predicted values for v1). I now want to do this for each of the other 100 variables in my data frame (v2 and v3 in this example).
I tried using a function to do this but can't get it to work as intended. Here is my attempt:
myfunction <- function(xaxis,yaxis){
# Specfiy my "y" and "x"
x <- data$xaxis
y <- data$yaxis
f <- function (x,a,b,d) {(a*x^2) + (b*x) + d}
order_two <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1))
co2 <- coef(order_two)
data$order_two_predicted_yaxis <- (co2[1] * (data$x)^2) + (co2[2] * data$x) + co2[3]
}
myfunction(x,v1)
myfunction(x,v2)
myfunction(x,v3)
Not only does the function not work as intended, I would like to avoid calling the function 100 times for each variable and instead somehow loop through it.
This is really simple to do in SAS using macros but I am struggling to get this to work in R.
You can model your data directly with the lm() function:
data <- data.frame(v1 = c(0.5,1.1,2.9,3.4,4.1,5.7,6.3,7.4,6.9,8.5,9.1),
v2 = c(0.1,0.8,0.5,1.1,1.9,2.4,0.8,3.4,2.9,3.1,4.2),
v3 = c(1.3,2.1,0.8,4.1,5.9,8.1,4.3,9.1,9.2,8.4,7.4))
x <- 1:nrow(data)
# initialize a list to store the models
models = vector("list", length = (ncol(data)))
# create a loop running over the columns of data
for (i in 1:(ncol(data))){
models[[i]] = lm(data[,i] ~ poly(x,2, raw = TRUE))}
You can also use lapply instead of the for-loop, as stated in the comments.
Use predict() to get the values of the models:
smoothed_v1 = predict(model[[1]], newdata=data.frame(x = x))
Edit:
Regarding your comment - you can store the new values in data with:
for (i in (length(models):1)){
data <- cbind(predict(models[[i]], newdata=data.frame(x = x)), data)
# set the name for the new column
names(data)[1] = paste("pred_v",i, sep ="")}
I managed to apply a linear regression for each subject of my data frame and paste the values into a new dataframe using a for-loop. However, I think there should be a more readable way of achieving my result using an apply function, but all my attempts fail. This is how I do it:
numberOfFiles <- length(resultsHick$subject)
intslop <- data.frame(matrix(0,numberOfFiles,4))
intslop <- rename(intslop,
subject = X1,
intercept = X2,
slope = X3,
Rsquare = X4)
cond <- c(0:3)
allSubjects <- resultsHick$subject
for (i in allSubjects)
{intslop[i,1] <- i
yvalues <- t(subset(resultsHick,
subject == i,
select = c(H0meanRT, H1meanRT, H2meanRT, H258meanRT)))
fit <- lm(yvalues ~ cond)
intercept <- fit$coefficients[1]
slope <- fit$coefficients[2]
rsquared <- summary(fit)$r.squared
intslop[i,2] <- intercept
intslop[i,3] <- slope
intslop[i,4] <- rsquared
}
The result should look the same as
> head(intslop)
subject intercept slope Rsquare
1 1 221.3555 54.98290 0.9871209
2 2 259.4947 66.33344 0.9781499
3 3 227.8693 47.28699 0.9537868
4 4 257.7355 80.71935 0.9729132
5 5 197.4659 49.57882 0.9730409
6 6 339.1649 61.63161 0.8213179
...
Does anybody know a more readable way of writing this code using an apply function?
One common pattern I use to replace for loops that aggregate data.frames is:
do.call(
rbind,
lapply(1:numberOfDataFrames,
FUN = function(i) {
print(paste("Processing index:", i)) # helpful to see how slow/fast
temp_df <- do_some_work[i]
temp_df$intercept <- 1, etc.
return(temp_df) # key is to return a data.frame for each index.
}
)
)
In a regression model is it possible to include an interaction with only one dummy variable of a factor? For example, suppose I have:
x: numerical vector of 3 variables (1,2 and 3)
y: response variable
z: numerical vector
Is it possible to build a model like:
y ~ factor(x) + factor(x) : z
but only include the interaction with one level of X? I realize that I could create a separate dummy variable for each level of x, but I would like to simplify things if possible.
Really appreciate any input!!
One key point you're missing is that when you see a significant effect for something like x2:z, that doesn't mean that x interacts with z when x == 2, it means that the difference between x == 2 and x == 1 (or whatever your reference level is) interacts with z. It's not a level of x that is interacting with z, it's one of the contrasts that has been set for x.
So for a 3 level factor with default treatment contrasts:
df <- data.frame(x = sample(1:3, 10, TRUE), y = rnorm(10), z = rnorm(10))
df$x <- factor(df$x)
contrasts(df$x)
2 3
1 0 0
2 1 0
3 0 1
if you really think that only the first contrast is important, you can create a new variable that compares x == 2 to x == 1, and ignores x == 3:
df$x_1vs2 <- NA
df$x_1vs2[df$x == 1] <- 0
df$x_1vs2[df$x == 2] <- 1
df$x_1vs2[df$x == 3] <- NA
And then run your regression using that:
lm(y ~ x_1vs2 + x_1vs2:z)
X <- data.frame(x = sample(1:3, 10, TRUE), y = rnorm(10), z = rnorm(10))
lm(y ~ factor(x) + factor(x):z, data=X)
Is it what you want?
Something like this may be what you need:
y~factor(x)+factor(x=='SomeLevel'):z
If x is already coded as a factor in your data, something like
y ~ x + I(x=='some_level'):z
Or if x is of numeric type in your data frame, then
y ~ as.factor(x) + I(as.factor(x)=='some_level'):z
Or to only model some subset of the data try:
lm(y ~ as.factor(x) + as.factor(x):z, data = subset(df, x=='some_level'))
What I want to do is to make a condition for if there is a certain variable in linear model
Example. If there is a B in a linear model
model <- lm(Y ~ A + B + C)
I want to do something. I have used the summary function before to refer to R-squared.
summary(model)$r.squared
Probably I am looking for something like this
if (B %in% summary(model)$xxx)
or
if (B %in% summary(model)[xxx])
But I can't find xxx. Please help =)
Try this:
if ("B" %in% all.vars(formula(model))) ...
Another way:
if ("B" %in% names(coef(model)))
Yet another way:
if ("B" %in% variable.names(model)) ...
One option is to grab the model terms from the fitted model and interrogate the term.labels attribute. Using some dummy data:
set.seed(1)
DF <- data.frame(Y = rnorm(100), A = rnorm(100), B = rnorm(100), C = rnorm(100))
model <- lm(Y ~ A + B + C, data = DF)
The terms object contains the labels in an attribute:
> attr(terms(model), "term.labels")
[1] "A" "B" "C"
So check if "B" is in that set of labels:
> if("B" %in% attr(terms(model), "term.labels")) {
+ summary(model)$r.squared
+ }
[1] 0.003134009
A (somewhat inelegant) possible solutions would be:
length(grep("\\bB\\b",formula(model))) > 0
where \\b matches the word boundary and B is the variable name you're looking for.