Related
I'm very new to multilevel modeling and doing data analysis for repeated measures. I am trying to figure out if my model is set up correctly using the nlme package and if it's set up correctly to answer the question I want to answer. I want to see if ius moderates the relationship between na and worry.
Variables
id - subject id
count - time variable; day of collection
worry - outcome (collected daily, continuous variable)
na - predictor (collected daily, continuous variable)
ius - moderator (collected at baseline, continuous variable)
I also created lag variables for na (lag_na) and worry (lag_worry) so I can control for the previous days na and worry though I'm not sure if this was the right thing to do.
Here is my code:
library(lme4)
# Here's an example dataset:
Dataset <- structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L), levels = c("1", "2", "3",
"5"), class = "factor"), count = c(1, 2, 3, 1, 2, 3, 4, 1, 2,
3, 4, 1, 2, 3, 4, 5, 6), na = c(1, 0, 0, 18, 13, 3, 5, 29, 15,
19, 21, 3, 5, 2, 2, 18, 19), worry = c(0, 1, 0, 0, 0, 0, 0, 2,
2, 1, 2, 0, 0, 3, 0, 4, 3), ius = c(35, 35, 35, 65, 65, 65, 65,
44, 44, 44, 44, 53, 53, 53, 53, 53, 53), lag_na = c(NA, 1, 0,
NA, 18, 13, 3, NA, 29, 15, 19, NA, 3, 5, 2, 2, 18), lag_worry = c(NA,
0, 1, NA, 0, 0, 0, NA, 2, 2, 1, NA, 0, 0, 3, 0, 4)), row.names = c(NA,
-17L), groups = structure(list(id = structure(1:4, levels = c("1",
"2", "3", "5"), class = "factor"), .rows = structure(list(1:3,
4:7, 8:11, 12:17), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
model <- lmer(worry ~ na*ius + lag_na + lag_worry + count + (1 | id), REML=FALSE, data = Dataset)
For a variable to be an "effect moderator" (at least as the term is used in epidemiologic discussion) there would need to be a material change in the predictions from models with and without the interaction term in the model. You have a model with an interaction between ius and na
> model <- lmer(worry ~ na*ius + lag_na + lag_worry + count + (1 | id), REML=FALSE, data = Dataset)
> model
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: worry ~ na * ius + lag_na + lag_worry + count + (1 | id)
Data: Dataset
AIC BIC logLik deviance df.resid
49.0113 54.0958 -15.5056 31.0113 4
Random effects:
Groups Name Std.Dev.
id (Intercept) 0.7525
Residual 0.6096
Number of obs: 13, groups: id, 4
Fixed Effects:
(Intercept) na ius lag_na lag_worry count na:ius
2.185346 -0.169745 -0.092891 0.128599 -0.871304 1.004783 0.003021
# Now remove the interaction term
> model <- lmer(worry ~ na + ius + lag_na + lag_worry + count + (1 | id), REML=FALSE, data = Dataset)
> model
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: worry ~ na + ius + lag_na + lag_worry + count + (1 | id)
Data: Dataset
AIC BIC logLik deviance df.resid
47.4562 51.9758 -15.7281 31.4562 5
Random effects:
Groups Name Std.Dev.
id (Intercept) 0.7212
Residual 0.6325
Number of obs: 13, groups: id, 4
Fixed Effects:
(Intercept) na ius lag_na lag_worry count
1.474439 -0.006056 -0.076298 0.122280 -0.840278 0.951945
From what I can see there is almost no change in measures of global fit (AIC, BIC or deviance). Do you want to proceed further in determining what the differences in predictions are with such a small dataset? There would be a difference in the predictions between these two models, but there seems to be little evidence that they would be considered "material". The method of examining what the data shows versus the respective models is described in this post to the stats.SE forum: https://stats.stackexchange.com/questions/33059/testing-for-moderation-with-continuous-vs-categorical-moderators/33090#33090
Plot (scatterplot) worry as the y-axis and na on the x-axis. Then for the non-interaction model plot the single line at the mean of ius, You're going to find some difficulty in doing this sensibly because the values of `ius are not at all normally distributed. (I discovered this when I went to color the points in a scatterplot:
findInterval(Dataset$ius, c(30,45, 52, 66))
[1] 1 1 1 3 3 3 3 1 1 1 1 3 3 3 3 3 3
> table(Dataset$ius)
35 44 53 65
3 4 6 4
When you plot the points with the four groups you find that the ranges of the outcome and the predictor within groups of identical ius measures are much smaller that the full dataset ranges. It really makes little sense to use an interaction model with continuous variables in this setting:
png(); plot(worry~jitter(na,3), Dataset, col=2+findInterval(Dataset$ius, c(30,36, 52, 56, 66))); dev.off()
So I see two compelling reasons not to use this analysis as evidence for effect moderation. Whether you want to built a categorical prediction model might be determined by how much more data could be gathered. Seems to be a pretty sparse dataset for any conclusions, but there is suggestion of some sort of grouping effect.
I want to calculate the fold change between thyroid and testes dataframe using TPM values and provide the top 10 genes overexpressed in testes tissue (testes$gene_id in the testes dataframe).
In my code below, I first calculated the fold change and store it as a numeric vector tpm.foldchange but then I don't know how to sort the gene_id column of the testes dataframe based on the sorted fold-change values tpm.foldchange.
# Parse the gene results file from the testes and thyroid output
thyroid <- read.table("thyroid.genes.results", header=T, sep="\t")
testes <- read.table("testes.genes.results", header=T, sep="\t")
# Extract the TPM values
# Add one to each value and log them (base 2)
library(tidyverse)
thyroid.tpm <- log(thyroid %>% pull(TPM) + 1)
testes.tpm <- log(testes %>% pull(TPM) + 1)
# Pearson's correlation coefficient between thyroid and testes using TPM
cor(thyroid.tpm, testes.tpm, method="pearson")
# Calculate fold change between the testes and thyroid tissue TPM values and provide top 10 genes that are overexpressed in testes
library(gtools)
tpm.foldchange <- foldchange(testes.tpm, thyroid.tpm)
#tpm.df <- merge(testes.tpm, tpm.foldchange)
tpm.sorted <- sort(tpm.foldchange, decreasing=T)
tpm.sortedgenes <- testes[order(factor(testes$TPM, levels=tpm.sorted)),]
tpm.top10genes <- head(tpm.sortedgenes, 10)
testes[order(factor(testes$TPM, levels=tpm.sorted)),]
I initially wanted to sort after merging like this:
tpm.df <- merge(testes.tpm, tpm.foldchange)
tpm.sorted <- sort(tpm.df$tpm.foldchange, decreasing=T)
but it raised an error:
Error: cannot allocate vector of size 8.0 Gb
thyroid dataframe:
# Show only the first 20 rows, first column, and 6th column of thyroid dataframe
dput(thyroid[1:20, c(1,6)])
structure(list(gene_id = c("gene0_DDX11L1", "gene1_WASH7P", "gene100_C1orf233",
"gene1000_ZC3H12A", "gene10000_CD86", "gene10001_CASR", "gene10003_CSTA",
"gene10004_CCDC58", "gene10005_FAM162A", "gene10006_WDR5B", "gene10007_LOC102723582",
"gene10008_KPNA1", "gene1001_MIR6732", "gene10010_PARP9", "gene10011_DTX3L",
"gene10012_PARP15", "gene10015_PARP14", "gene10016_HSPBAP1",
"gene10017_DIRC2", "gene10018_LOC100129550"), TPM = c(0, 45.96,
2.72, 2.4, 1.67, 5.14, 4.33, 47.68, 81.1, 10.12, 0.96, 45.21,
0, 19.63, 15.06, 0.49, 21.76, 12.16, 19.37, 5.3)), row.names = c(NA,
20L), class = "data.frame")
testes dataframe:
# Show only the first 20 rows, first column, and 6th column of testes dataframe
dput(testes[1:20, c(1,6)])
structure(list(gene_id = c("gene0_DDX11L1", "gene1_WASH7P", "gene100_C1orf233",
"gene1000_ZC3H12A", "gene10000_CD86", "gene10001_CASR", "gene10003_CSTA",
"gene10004_CCDC58", "gene10005_FAM162A", "gene10006_WDR5B", "gene10007_LOC102723582",
"gene10008_KPNA1", "gene1001_MIR6732", "gene10010_PARP9", "gene10011_DTX3L",
"gene10012_PARP15", "gene10015_PARP14", "gene10016_HSPBAP1",
"gene10017_DIRC2", "gene10018_LOC100129550"), TPM = c(2.33, 47.56,
9.45, 2.03, 3.09, 0.11, 3.73, 28.52, 120.65, 6.89, 1.38, 30.89,
0, 20.39, 13.66, 0.59, 9.62, 22.04, 7.42, 2.53)), row.names = c(NA,
20L), class = "data.frame")
Based on Akrun's comment, I've attempted:
library(gtools)
tpm.foldchange <- foldchange(thyroid.tpm, testes.tpm)
testes.sorted <- testes %>%
left_join(thyroid, by="gene_id") %>%
mutate(TPM=testes.tpm, tpm.foldchange, .keep="unused") %>%
slice_max(n=10, order_by=tpm.foldchange)
Output:
> dim(testes.sorted)
[1] 304 15
> dput(testes.sorted[1:10,])
structure(list(gene_id = c("gene10075_LOC101927056", "gene10311_A4GNT",
"gene10394_SLC9A9-AS1", "gene10504_SUCNR1", "gene10511_TMEM14E",
"gene10798_LOC102724550", "gene10990_FLJ42393", "gene11054_DPPA2P3",
"gene11065_GP5", "gene11400_USP17L12"), transcript_id.s..x = c("rna28860_NR_125396.1,rna28861_NR_125395.1",
"rna29540_NM_016161.2", "rna29785_NR_048544.1", "rna30020_NM_033050.4",
"rna30060_NM_001123228.1", "rna30716_NR_110826.1", "rna31241_NR_024413.1",
"rna31390_NR_027764.1", "rna31430_NM_004488.2", "rna32519_NM_001256853.1"
), length.x = c(659, 1771, 518, 1650, 1293, 2957, 2266, 1146,
3493, 1593), effective_length.x = c(413.57, 1525.5, 272.62, 1404.5,
1047.5, 2711.5, 2020.5, 900.5, 3247.5, 1347.5), expected_count.x = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0.12), TPM.x = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0), FPKM.x = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), transcript_id.s..y = c("rna28860_NR_125396.1,rna28861_NR_125395.1",
"rna29540_NM_016161.2", "rna29785_NR_048544.1", "rna30020_NM_033050.4",
"rna30060_NM_001123228.1", "rna30716_NR_110826.1", "rna31241_NR_024413.1",
"rna31390_NR_027764.1", "rna31430_NM_004488.2", "rna32519_NM_001256853.1"
), length.y = c(796, 1771, 518, 1650, 1293, 2957, 2266, 1146,
3493, 1593), effective_length.y = c(535.05, 1510.04, 257.15,
1389.04, 1032.04, 2696.04, 2005.04, 885.04, 3232.04, 1332.04),
expected_count.y = c(9, 3, 2, 233, 2, 2, 36, 2, 35, 1.91),
TPM.y = c(0.58, 0.07, 0.27, 5.8, 0.07, 0.03, 0.62, 0.08,
0.37, 0.05), FPKM.y = c(0.29, 0.03, 0.14, 2.94, 0.03, 0.01,
0.31, 0.04, 0.19, 0.03), TPM = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0), tpm.foldchange = c(Inf, Inf, Inf, Inf, Inf, Inf, Inf,
Inf, Inf, Inf)), row.names = c(NA, 10L), class = "data.frame")
This code returns a dataframe with (304, 15) dimensions. But I'm only looking for the top ten genes. Also, please note that thyroid.tpm is the log2-transformed TPM values.
If we want to order by the foldchange, do a join first, and arrange based on the foldchange between the 'TPM' columns
library(dplyr)
library(gtools)
testes2 <- testes %>%
left_join(thyroid, by = 'gene_id') %>%
mutate(across(starts_with("TPM"), ~ log(.x + 1),
.names = "tpm_{.col}")) %>%
mutate(foldchange = foldchange(tpm_TPM.x, tpm_TPM.y)) %>%
filter(is.finite(foldchange)) %>%
arrange(tpm_TPM.x) %>%
dplyr::select(gene_id, TPM = TPM.x, foldchange) %>%
slice_head(n = 10)
If we want to select top 10 foldchange rows, use slice_max
testes %>%
left_join(thyroid, by = 'gene_id') %>%
mutate(TPM = TPM.x, foldchange = foldchange(log(TPM.x + 1), log(TPM.y + 1)),
.keep = "unused") %>%
filter(is.finite(foldchange)) %>%
slice_max(n = 10, order_by = foldchange, with_ties = FALSE)
-output
gene_id TPM foldchange
1 gene100_C1orf233 9.45 1.786222
2 gene10000_CD86 3.09 1.434249
3 gene10007_LOC102723582 1.38 1.288517
4 gene10016_HSPBAP1 22.04 1.217311
5 gene10012_PARP15 0.59 1.162893
6 gene10005_FAM162A 120.65 1.089205
7 gene10010_PARP9 20.39 1.011953
8 gene1_WASH7P 47.56 1.008704
9 gene10011_DTX3L 13.66 -1.033968
10 gene10003_CSTA 3.73 -1.076854
The merge results in memory error because it was done on two vectors creating a cartesian join
I am completely new to R shiny apps and I am currently trying to create a simple app to visualize some gene expression data from an RNA-sequencing experiment. I am trying to pass a textInput (a gene name) to the UI, so that the server can take it as an input to get the data from a data frame and plot it for that speficic gene using ggplot. My code (so far) is as follows:
library(shiny)
library(dplyr)
library(ggplot2)
ui <- fluidPage(
textInput("gene", "Please enter a gene of interest (Examples: IL2, CD28, LDLR):"),
plotOutput("plot")
)
server <- function(input, output) {
data1 <- reactive({
#Read data file and convert to a matrix
raw_df <- read_csv("C:/path/file.csv")
matsymbol <- as.matrix(raw_df[, 2:21])
row.names(matsymbol) <- raw_df$...1
## The gene is found in the matsymbol to extract all normalized read counts
## A matrix of dim=5x4 is formed
gene_counts <- t(matrix(matsymbol[input$gene,], nrow=4))
## Define rownames for the matrix
rownames(gene_counts) <- c("Non-stimulated",
"Stimulated, 24h",
"Stimulated, 48h",
"Stimulated, 48h + LV",
"Stimulated, 72h + LV")
## Calculate rowMeans and rowSDs for each row in the matrix
row_means_gene <- rowMeans(gene_counts)
row_sds_gene <- rowSds(gene_counts)
## Collect to a dataframe which can be used for ggplot
df_gene <- as.data.frame(cbind(row_means_gene, row_sds_gene))
})
output$plot <- renderPlot({
req(data1())
## Plot the expression using ggplot
p_gene <- ggplot(df_gene, aes(x=rownames(df_gene), y=row_means_gene, fill = rownames(df_gene))) +
geom_bar(stat="identity", color="grey", position=position_dodge(), width = 0.7) +
geom_errorbar(aes(ymin=row_means_gene-row_sds_gene, ymax=row_means_gene+row_sds_gene), width=0.2,
position=position_dodge(.9), color = "#404040") +
scale_fill_manual("Condition", values = c("Non-stimulated" = "blue",
"Stimulated, 24h" = "red",
"Stimulated, 48h" = "green",
"Stimulated, 48h + LV" = "yellow",
"Stimulated, 72h + LV" = "black")) +
labs(x="Condition", y = "Normalized expression (read counts) +/- s.d.") +
ggtitle(label = "Normalized expression") +
theme(plot.title = element_text(color = "black", size = 12, face = "bold", hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 30, hjust=1))
p_gene
})
}
shinyApp(ui, server)
The data from the raw_df (first col are gene names):
...1 A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 D1 D2 D3 D4 E1 E2 E3 E4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 DDX11L1 135. 2.45 24.0 8.77 0 9.14e-1 5.85e0 9.27e-1 1.12e0 1.97e+0 0 5.03e0 7.05e0 2.03e0 9.27e-1 3.27e+0 8.53e-1 0 9.38e-1 1.90e0
2 MIR6859-1 1.16 24.5 32.0 13.2 1.03e+1 9.14e-1 1.42e1 9.27e+0 3.35e0 0 8.76e+0 3.35e0 6.04e0 2.03e0 1.20e+1 3.27e+0 0 1.66e1 2.06e+1 1.05e1
3 DDX11L1 59.2 6.12 34.7 23.4 1.87e+0 1.83e+0 8.35e0 7.41e+0 1.12e0 4.92e+0 1.75e+0 8.38e0 5.03e0 3.05e0 3.71e+0 4.90e+0 1.11e+1 2.38e0 5.63e+0 7.61e0
4 MIR6859-2 16.2 6.12 2.67 10.2 9.36e-1 5.48e+0 4.18e0 9.27e-1 2.23e0 0 8.76e-1 0 5.03e0 1.02e0 0 8.17e-1 1.71e+0 3.96e0 3.75e+0 1.90e0
5 FAM87B 55.7 44.1 56.0 142. 3.74e+0 8.22e+0 1.67e0 2.59e+1 3.35e0 9.84e-1 5.25e+0 2.18e1 1.01e0 7.12e0 2.78e+0 2.29e+1 1.54e+1 7.92e0 1.59e+1 3.52e1
6 LINC00115 81.2 73.4 30.7 127. 8.42e+0 9.14e+0 1.42e1 1.76e+1 1.12e1 6.89e+0 7.88e+0 1.01e1 9.06e0 9.16e0 1.39e+1 1.63e+1 7.68e+0 1.74e1 2.72e+1 3.61e1
Data from the df_gene (example gene):
row_means_gene row_sds_gene
Non-stimulated 0.0000 0.0000
Stimulated, 24h 2692.3108 2627.0944
Stimulated, 48h 827.6718 823.3256
Stimulated, 48h + LV 1762.1866 1995.1224
Stimulated, 72h + LV 122.9962 144.6482
I learned that I should have my data wrapped in reactive() when using the input$, however, I am running into multiple different problems. First of all, I am getting the error error in evaluating the argument 'x' in selecting a method for function 't': subscript out of bounds. I know that it has something to do with the reactivity, and I have tried to look around on multiple forums for an answer, but I just can't figure out how to put it right. Secondly, I am not quite sure what to pass to the ggplot - the dataframe (df_gene) is inside the reactive element data1, so how can I tell it to find it? Can I use something like data1()$df_gene?
I really hope this is not trivial/a stupid question, otherwise I apologize.
Thank you so much in advance and all the best,
Lasse
I think it is not related to reactivity, but at first let me say something - I really do not work with row names, so perhaps I'm missing something.
However, I have checked this (rownames, subsetting) and see that:
raw_df <- data.frame(
stringsAsFactors = FALSE,
...1 = c("DDX11L1",
"MIR6859-1","DDX11L1","MIR6859-2","FAM87B",
"LINC00115"),
A1 = c(135, 1.16, 59.2, 16.2, 55.7, 81.2),
A2 = c(2.45, 24.5, 6.12, 6.12, 44.1, 73.4),
A3 = c(24, 32, 34.7, 2.67, 56, 30.7),
A4 = c(8.77, 13.2, 23.4, 10.2, 142, 127),
B1 = c(0, 10.3, 1.87, 0.936, 3.74, 8.42),
B2 = c(0.914, 0.914, 1.83, 5.48, 8.22, 9.14),
B3 = c(5.85, 14.2, 8.35, 4.18, 1.67, 14.2),
B4 = c(0.927, 9.27, 7.41, 0.927, 25.9, 17.6),
C1 = c(1.12, 3.35, 1.12, 2.23, 3.35, 11.2),
C2 = c(1.97, 0, 4.92, 0, 0.984, 6.89),
C3 = c(0, 8.76, 1.75, 0.876, 5.25, 7.88),
C4 = c(5.03, 3.35, 8.38, 0, 21.8, 10.1),
D1 = c(7.05, 6.04, 5.03, 5.03, 1.01, 9.06),
D2 = c(2.03, 2.03, 3.05, 1.02, 7.12, 9.16),
D3 = c(0.927, 12, 3.71, 0, 2.78, 13.9),
D4 = c(3.27, 3.27, 4.9, 0.817, 22.9, 16.3),
E1 = c(0.853, 0, 11.1, 1.71, 15.4, 7.68),
E2 = c(0, 16.6, 2.38, 3.96, 7.92, 17.4),
E3 = c(0.938, 20.6, 5.63, 3.75, 15.9, 27.2),
E4 = c(1.9, 10.5, 7.61, 1.9, 35.2, 36.1)
)
matsymbol <- as.matrix(raw_df[, 2:21])
row.names(matsymbol) <- raw_df$...1
input_gene <- "What should I choose?"
t(matrix(matsymbol[input_gene ,], nrow=4))
Gives you the same error if the input$gene (input_gene above) do not exists in the data frame / matrix. Or - more precisely - this what below gives you the same error:
input_gene <- ""
t(matrix(matsymbol[input_gene ,], nrow=4))
Because the empty textInput (and it is empty when the app starts) means ""
You can use the code below to get empty matrix instead of error:
t(matrix(matsymbol[rownames(matsymbol) == input_gene ,], nrow=4))
But you will have another problem if you get matrix with 0 rows - next line, i.e.:
rownames(gene_counts) <- c("Non-stimulated",
"Stimulated, 24h",
"Stimulated, 48h",
"Stimulated, 48h + LV",
"Stimulated, 72h + LV")
won't work, because you will get NULL for rownames(), like in this example:
rownames(matrix()[FALSE, ])
In other words, you would need at first to check if you have matrix with 4 rows to set the 4-length character vector as a names of rows. Or it will be better to gives the user possibility to choose the gene from the predefined list of genes? Check out the selectInput() function. However, because you are reading the file in the server part, you would need to familiarize yourself with updateSelectInput() function as well. Think about it, try it if you think it is wort to do this and if you would need help with "How to display values in selectInput()?" you can always ask another question here.
Notabene: I had a problem with function rowSds(), I have found it is from genefilter package, but I couldn't install this package (not available for my version of R), so I didn't check next steps in your app.
Does anyone know how to fit a quadratic (or higher order) model on a continuous variable and do quantile regression on it in R? Additionally, how do you tell what level of tau fits the data better?
The values for "den" are fish densities (count/m^3) and salinity = salinity (ppt). The full dataset is 1500 observations and I'd like to predict fish density using salinity. A plot with all the data looks semi-quadratic, but I'd like to compare that fit to others using quantile regression. I just can't figure out how to make the relationship in the model non-linear. Is it den ~ salinity + salinity^2?
df <- structure(list(den = c(0, 12, 8.33, 5, 0, 0, 1, 1.33, 0, 3), salinity = c(37, 35, 36, 39, 36, 37, 35, 38, 36, 37)), row.names = c(86L,
240L, 394L, 548L, 702L, 856L, 1010L, 1164L, 1318L, 1472L), class = "data.frame")
quantreg75 <- rq(den ~ salinity, data=rain, tau=0.75)
I am working on an exercise asking me "Plot the residuals against Y_hat, each predictor variable, and each two-factor interaction term on separate graphs." Here is a snippet of the data set I am using:
> dput(head(Commercial_Properties, 10))
structure(list(Rental_Rates = c(13.5, 12, 10.5, 15, 14, 10.5,
14, 16.5, 17.5, 16.5), Age = c(1, 14, 16, 4, 11, 15, 2, 1, 1,
8), Op_Expense_Tax = c(5.02, 8.19, 3, 10.7, 8.97, 9.45, 8, 6.62,
6.2, 11.78), Vacancy_Rate = c(0.14, 0.27, 0, 0.05, 0.07, 0.24,
0.19, 0.6, 0, 0.03), Total_Sq_Ft = c(123000, 104079, 39998, 57112,
60000, 101385, 31300, 248172, 215000, 251015), residuals = c(`1` = -1.03567244005944,
`2` = -1.51380641405037, `3` = -0.591053402133659, `4` = -0.133568082335235,
`5` = 0.313283765150399, `6` = -3.18718522392237, `7` = -0.538356748944345,
`8` = 0.236302385996349, `9` = 1.98922037248654, `10` = 0.105829602747806
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
From here I created the proper linear model that includes two factor interaction terms:
commercial_properties_lm_two_degree_interaction <-
lm(data=Commercial_Properties,
formula=Rental_Rates ~ (Age + Op_Expense_Tax + Vacancy_Rate + Total_Sq_Ft)^2)
Next what I was hoping to accomplish was to plot the residuals not just of the linear terms, but also of the interaction terms. I attempted to do this using the residualPlots() function in the car package
library(car)
residualPlots(model=commercial_properties_lm_two_degree_interaction,
terms=~ (Age + Op_Expense_Tax + Vacancy_Rate + Total_Sq_Ft)^2)
When applied in this way the output only produced the residual plots against the linear terms, it didn't plot any interactions. So I then attempted to do it manually, but I got an error:
residualPlots(model=commercial_properties_lm_two_degree_interaction,
terms=~ Age + Op_Expense_Tax + Vacancy_Rate + Tota_Sq_Ft +
Age:Op_Expense_Tax + Age:Vacancy_Rate)
Error in termsToMf(model, terms) : argument 'terms' not interpretable.
Now if I were to do things completely manually I was able to get an interaction plot for example:
with(data=Commercial_Properties, plot(x=Op_Expense_Tax * Vacancy_Rate, y=residuals))
plotted successfully. My issue is that sure I can do this completely manually for a reasonably small amount of variables, but it will get extremely tedious once the amount of variables begins to get larger.
So my question is if there is a way to use an already created function in R to make residual plots of the interaction terms or would I be left to doing it completely manually or most likely having to write some sort of loop ?
Note, I'm not asking about partial residuals. I haven't gotten to that point in my text I'm using. Just plain interaction terms against residuals.
You could do an eval(parse()) approach using the 'term.labels' attribute.
With gsub(':', '*', a[grep(':', a)]) pull out the interaction terms and replace : with * so it can be evaluated.
a <- attr(terms(commercial_properties_lm_two_degree_interaction), 'term.labels')
op <- par(mfrow=c(2, 3))
with(Commercial_Properties,
lapply(gsub(':', '*', a[grep(':', a)]), function(x)
plot(eval(parse(text=x)), residuals, xlab=x)))
par(op)
Edit
This is how we would do this with a for loop in R (but see comments below):
as <- gsub(':', '*', a[grep(':', a)])
op <- par(mfrow=c(2, 3))
for (x in as) {
with(Commercial_Properties,
plot(eval(parse(text=x)), residuals, xlab=x)
)
}
par(op)