Apologies for what is probably a very basic question.
I have created a linear model for a massive meteorological dataset using multiple regression. My goal is to use that model to "predict" data during a certain period using predictors 1, 2 and 3. I will then compare those predicted data to the observed data for that period.
My approach thus far has been to create a new vector for the predicted values and loop through the vector, creating predicted values based on the extracted coefficients of the linear model. Then, I will simply subtract the predicted values from the observed values. For some reason, this approach results in the new predicted vector being NULL. Any idea how I could approach this?
A sample is below. "data" refers to the dataset containing the predictors.
coef <- coefficients(multipleRegressionModel)
predictedValues=c()
for(i in 1:length(data$timePeriod)){
predictedValues[i] = append(predictedValues, data$coef[1]+data$predictor1[i]*data$coef[2]+data$predictor2[i]*data$coef[3]+
data$predictor3[i]*data$coef[4])
}
diff=c()
diff=observedValues - predictedValues
It looks like you are making this more difficult than it needs to be. R has a predict() function that does all of this for you. If you had a sample data.frame like so:
set.seed(26)
mydf = data.frame (a=1:20 , b = rnorm(20),
c = 1:20 + runif(20,2,3)*runif(20, 2, 3),
d = 1:20 + rpois(20,5)*runif(1:20)*sin(1:20))
And you wanted to train on some rows, and test on the others
trainRows<-sample(1:20, 16)
mydf.train<-mydf[trainRows,]
mydf.test<-mydf[-trainRows,]
Then fit the model and predict
model<-lm(a~b+c+d, data = mydf.train)
summary(model) #gives info about your model.
mydf.test$pred<-predict(model1, newdata = mydf.test)
MSE<-mean((mydf.test$pred-mydf.test$a)^2) #calculate mean squared error
MSE
#[1] 0.06321
View the predictions with mydf.test$pred
Here is a simple example using a glm on the mtcars data.
Line<- #setting up the linear model function
function (train_dat, test_dat, variables, y_var, family = "gaussian")
{
fm <- as.formula(paste(y_var, " ~", paste(variables, collapse = "+"))) #formula
glm1 <- glm(fm, data = train_dat, family = family) #run the model
pred <- predict(glm1, newdata = test_dat) #predict the model
}
data(mtcars)
y_var<-'mpg'
x_vars<-setdiff(names(mtcars),y_var)
mtcars[,'linear_prediction']<-Line(mtcars,mtcars,x_vars,y_var)
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb linear_prediction
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 22.59951
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 22.11189
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 26.25064
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.23740
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 17.69343
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 20.38304
Related
I have a set of Fisher's discriminant linear functions that I need to multiply against some test data. Both data files are in the form of two matrices (variables lined up to match variable order), so I need to multiply them together.
Here is some example test data, which I've added a constant=1 variable (you'll see why when you we get to the coefficients)
testdata <- cbind(constant=1,mtcars[ 1:6 ,c("mpg","disp","hp") ])
> testdata
constant mpg disp hp
Mazda RX4 1 21.0 160 110
Mazda RX4 Wag 1 21.0 160 110
Datsun 710 1 22.8 108 93
Hornet 4 Drive 1 21.4 258 110
Hornet Sportabout 1 18.7 360 175
Valiant 1 18.1 225 105
Here are my coefficients matrix (the Fishers discriminant linear functions)
coefs <- data.frame(constant = c(-67.67, -59.46, -89.70),
mpg = c(4.01,3.49,3.69),
disp = c(0.14,0.15,0.22),
hp = c(0.13,0.15,0.20))
rownames(coefs) <- c("Function1","Function2","Function3")
> coefs
constant mpg disp hp
Function1 -67.67 4.01 0.14 0.13
Function2 -59.46 3.49 0.15 0.15
Function3 -89.70 3.69 0.22 0.20
I need to multiply the values in test data against the respective coefficients to get 3 functions scores per row. Here is how the values would be calculated
for the first row, Function1 = 1*(-67.67)+21*(4.01)+160*(0.14)+110*(0.13)
for the first row, Function2 = 1*(-59.46)+21*(3.49)+160*(0.15)+110*(0.15)
for the first row, Function3 = 1*(-89.70)+21*(3.69)+160*(0.22)+110*(0.20)
It's kind of like a sumproduct of coefficients against each row time 3 for each function.
So the df/matrix should look like this when multiplied same number of rows with 3 function score variables
> df_result
Function1 Function2 Function3
row1 53.24 54.33 44.99
row2
Not ideal, but I'm taking the data out doing it excel. If this is possible to do, any help is greatly appreciated. Many thanks
Are you just looking for the inner product?
testdata <- cbind(constant=1,mtcars[ 1:6 ,c("mpg","disp","hp") ])
coefs <- data.frame(constant = c(-67.67, -59.46, -89.70),
mpg = c(4.01,3.49,3.69),
disp = c(0.14,0.15,0.22),
hp = c(0.13,0.15,0.20))
rownames(coefs) <- c("Function1","Function2","Function3")
as.matrix(testdata) %*% t(as.matrix(coefs))
# Function1 Function2 Function3
# Mazda RX4 53.240 54.330 44.990
# Mazda RX4 Wag 53.240 54.330 44.990
# Datsun 710 50.968 50.262 36.792
# Hornet 4 Drive 68.564 70.426 68.026
# Hornet Sportabout 80.467 86.053 93.503
# Valiant 50.061 53.209 47.589
I am trying to get R to run the same function/code but for a dataset. I have it set up with 50 questions, yes(1)/no(0) answers and about 500 different responses for each of the 50 questions. The 500 responses are identified as male(1) or female(0). At the end of each person is their "Score", how many yes (1) answers they had. I have run a plot on R before but I want to run this plot for all 50 questions without having to change the code every time, and running the code 50 times. The code that I am using is below. dataset is the excel file that I made with gender, Q001-Q052 points, and score as columns and then 500 rows down with their responses and gender.
>LRmod01<-glm(dataset$'Q001points'~dataset$Score+dataset$Gender,data=dataset,family=binomial(link="logit")
>summary(LRmod01)
>LRodds01<-cbind("Odds-Ratio"=exp(LRmod01$coefficients),exp(confint(LRmod01)))
>View(LRodds01)
>LR.pred.probs01<-predict(LRmod01,type="response")
>View(LR.pred.probs01)
>scatter.smooth(dataset$Score,logit(LR.pred.probs01))
>scatter.smooth(dataset$Score,(LR.pred.probs01),main="Logistic Regression for Question 001", xlab="Number of Questions Yes on Exam", ylab="Log Odds for Question 001",ylim=range(0,1,na.rm=TRUE)
I want to do this coding above but for all 50 questions. Right now it only runs for Q01 and I know that it only will because of the coding "dataset$'Q001points'" part. Should I use a loop for this and if so how?
Suppose we are using dataset mtcars:
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Now we want to do several linear models like
lm(mtg ~ cyl + disp, data=mtcars)
but mtg should be replaced by all other columns.
lst_model <- lapply(colnames(mtcars)[-3:-1],
function(x) lm(get(x) ~ cyl + disp, data=mtcars))
lst_model <- setNames(lst_model, colnames(mtcars)[-3:-1])
gives a list of models
$hp
Call:
lm(formula = get(x) ~ cyl + disp, data = mtcars)
Coefficients:
(Intercept) cyl disp
-32.4317 24.5145 0.1189
$drat
Call:
lm(formula = get(x) ~ cyl + disp, data = mtcars)
Coefficients:
(Intercept) cyl disp
4.607278 -0.095280 -0.001825
[...]
Every element of the list lst_model is named after the left hand side variable, i.e. you get the model for hp ~ cyl + disp by lst_model[["hp"]].
> lst_model[["hp"]]
Call:
lm(formula = get(x) ~ cyl + disp, data = mtcars)
Coefficients:
(Intercept) cyl disp
-32.4317 24.5145 0.1189
is the same as
> lm(hp ~ cyl + disp, data=mtcars)
Call:
lm(formula = hp ~ cyl + disp, data = mtcars)
Coefficients:
(Intercept) cyl disp
-32.4317 24.5145 0.1189
So for example if you want to get the fitted values for a model
model <- lm(hp ~ cyl + disp, data=mtcars)
you type model$fitted.values.
In case of lst_model you use lst_model[["hp"]][["fitted.values"]] to get same result. Since the [[ is somehow recursive, you can use lst_model[[c("hp", "fitted.values")]] which is the same.
I have a script that creates a column so that I know which rule should be applied to each row in a dataframe.
EndoSubset$FU_Group<-ifelse(EndoSubset$IMorNoIM=="No_IM","Rule1",
ifelse(EndoSubset$IMorNoIM=="IM","Rule2",
ifelse(EndoSubset$IMorNoIM=="AnotherIM","Rule3",
"NoRules")))
I want to make this into a function so that there can be any number of rules and any number of conditions for a column so it could be:
EndoSubset$FU_Group<-ifelse(EndoSubset$IMorNoIM=="No_IM","Rule1",
ifelse(EndoSubset$IMorNoIM=="IM","Rule2",
ifelse(EndoSubset$IMorNoIM=="AnotherIM","Rule3",
ifelse(EndoSubset$IMorNoIM=="SomeOtherIM","Rule4",
ifelse(EndoSubset$IMorNoIM=="LotsOfIM","Rule5",
"NoRules")))
I understand that I can use the ellipsis for this but I don't understand how to use this for both the conditional string ("No_IM, "IM,"AnotherIM", etc) and the Rule string at the same time ("Rule1","Rule2","Rule3" etc.)
This answer is based upon another, incomplete answer that has been deleted.
You can use case_when() from the dplyr package to achieve this. It takes an arbitrary number of conditions. Since you don't give a reproducible example, I show how this works with mtcars:
library(dplyr)
mtcars$cyl_group <- case_when(mtcars$cyl == 4 ~ "Rule1",
mtcars$cyl == 6 ~ "Rule2",
TRUE ~ "NoRules")
mtcars[2:5, ]
## mpg cyl disp hp drat wt qsec vs am gear carb cyl_group
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Rule2
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Rule1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Rule2
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 NoRules
As you can see, you can easily connect a condition with a value using ~. Your two examples can probably be solved like this (I cannot check this, since you don't give your data):
EndoSubset$FU_Group <- case_when(EndoSubset$IMorNoIM == "No_IM" ~ "Rule1",
EndoSubset$IMorNoIM == "IM" ~ "Rule2",
EndoSubset$IMorNoIM == "AnotherIM" ~ "Rule3",
TRUE ~ "NoRules")
EndoSubset$FU_Group <- case_when(EndoSubset$IMorNoIM == "No_IM" ~ "Rule1",
EndoSubset$IMorNoIM == "IM" ~ "Rule2",
EndoSubset$IMorNoIM == "AnotherIM" ~ "Rule3",
EndoSubset$IMorNoIM == "SomeOtherIM" ~ "Rule4",
EndoSubset$IMorNoIM == "LotsOfIM" ~ "Rule5",
TRUE ~ "NoRules")
I know I can plot a dendrogram as follows
library(cluster)
d <- mtcars
d[,8:11] <- lapply(d[,8:11], as.factor)
gdist <- daisy(d, metric = c("gower"), stand = FALSE)
dendro <- hclust(gdist, method = "average")
plot(as.dendrogram(dendro))
However I have some groups identified (eg. by an iterative classification method), given as the last column in d
G <- c(1,2,3,3,4,4,5,5,5,5,1,2,1,1,2,4,1,3,4,5,1,7,4,3,3,2,1,1,1,3,5,6)
d$Group <- G
head(d)
mpg cyl disp hp drat wt qsec vs am gear carb Group
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 4
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 4
I am trying to plot all the dendrograms together on the same plot with the same scale. The groups with only a single member also needs to be plotted. (group 6 and 7)
I am able to plot individual dendrograms for subset of the data except when number of members in a group is only one. But I don't think this is the right approach.
layout(matrix(1:9, 3,3,byrow=TRUE))
gdist <- as.matrix(gdist)
N <- max(G)
for (i in 1:N){
rc_tokeep <- row.names(subset(d, G==i))
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend))
}
The loop is giving this error for the last two groups. (6 and 7) having only a single member.
Error in hclust(dis, method = "average") :
must have n >= 2 objects to cluster
Essentially I wan't to reproduce these type of plots. The clusters with single members are also plotted here.
If you want to mimic the last few graphs, you can do something like this:
N <- max(G)
layout(matrix(c(0,1:N,0),nc=1))
gdist <- as.matrix(gdist)
for (i in 1:N){
par(mar=c(0,3,0,7))
rc_tokeep <- row.names(subset(d, G==i))
if(length(rc_tokeep)>2){ #The idea is to catch the groups with one single element to plot them differently
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend),horiz=TRUE,
xlim=c(.8,0),axes=FALSE) # giving the same xlim will scale all of them, here i used 0.8 to fit your data but you can change it to whatever
}else{
plot(NA,xlim=c(.8,0),ylim=c(0,1),axes=F,ann=F)
segments(0,.5,.1,.5) #I don't know how you intend to compute the length of the branch in a group of 1 element, you might want to change that
text(0,.5, pos=4,rc_tokeep,xpd=TRUE)
}
}
With your example it gives:
If you want to add the scale you can add a grid in all graphs and a scale in the last one:
N <- max(G)
layout(matrix(c(0,1:N,0),nc=1))
gdist <- as.matrix(gdist)
for (i in 1:N){
par(mar=c(0,3,0,7))
rc_tokeep <- row.names(subset(d, G==i))
if(length(rc_tokeep)>2){
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend),horiz=TRUE,xlim=c(.8,0),xaxt="n",yaxt="n")
abline(v=seq(0,.8,.1),lty=3) #Here the grid
}else{
plot(NA,xlim=c(.8,0),ylim=c(0,1),axes=F,ann=F)
segments(0,.5,.1,.5)
text(0,.5, pos=4,rc_tokeep,xpd=TRUE)
abline(v=seq(0,.8,.1),lty=3) #Here the grid
}
}
axis(1,at=seq(0,.8,.1)) #Here the axis
And finally if you want to even the spaces between the different branches in the resulting plot, you can use table(d$Group) to get the number of members of each group and use it as a height for each subplot:
N <- max(G)
layout(matrix(c(0,1:7,0),nc=1), height=c(3,table(d$Group),3)) #Plus the height of the empty spaces.
gdist <- as.matrix(gdist)
for (i in 1:N){
par(mar=c(0,3,0,7))
rc_tokeep <- row.names(subset(d, G==i))
if(length(rc_tokeep)>2){
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend),horiz=TRUE,xlim=c(.8,0),xaxt="n",yaxt="n")
abline(v=seq(0,.8,.1),lty=3)
}else{
plot(NA,xlim=c(.8,0),ylim=c(0,1),axes=F,ann=F)
segments(0,.5,.1,.5)
text(0,.5, pos=4,rc_tokeep,xpd=TRUE)
abline(v=seq(0,.8,.1),lty=3)
}
}
axis(1,at=seq(0,.8,.1))
When running a regression analysis in R (using glm) cases are removed due to 'missingness' of the data. Is there any way to flag which cases have been removed? I would ideally like to remove these from my original dataframe.
Many thanks
The model fit object returned by glm() records the row numbers of the data that it excludes for their incompleteness. They are a bit buried but you can retrieve them like this:
## Example data.frame with some missing data
df <- mtcars[1:6, 1:5]
df[cbind(1:5,1:5)] <- NA
df
# mpg cyl disp hp drat
# Mazda RX4 NA 6 160 110 3.90
# Mazda RX4 Wag 21.0 NA 160 110 3.90
# Datsun 710 22.8 4 NA 93 3.85
# Hornet 4 Drive 21.4 6 258 NA 3.08
# Hornet Sportabout 18.7 8 360 175 NA
# Valiant 18.1 6 225 105 2.76
## Fit an example model, and learn which rows it excluded
f <- glm(mpg~drat,weight=disp, data=df)
as.numeric(na.action(f))
# [1] 1 3 5
Alternatively, to get the row indices without having to fit the model, use the same strategy with the output of model.frame():
as.numeric(na.action(model.frame(mpg~drat,weight=disp, data=df)))
# [1] 1 3 5
Without a reproducible example I can't provide code tailored to your problem, but here's a generic method that should work. Assume your data frame is called df and your variables are called y, x1, x2, etc. And assume you want y, x1, x3, and x6 in your model.
# Make a vector of the variables that you want to include in your glm model
# (Be sure to include any weighting or subsetting variables as well, per Josh's comment)
glm.vars = c("y","x1","x3","x6")
# Create a new data frame that includes only those rows with no missing values
# for the variables that are in your model
df.glm = df[complete.cases(df[ , glm.vars]), ]
Also, if you want to see just the rows that have at least one missing value, do the following (note the addition of ! (the "not" operator)):
df[!complete.cases(df[ , glm.vars]), ]