Plotting regressions from slope and intercept (lattice or ggplot2) - r
I have a microarray dataset on which I performed a limma lmFit() test. If you haven't heard of it before, it's a powerful linear model package that tests differential gene expressions for >20k genes. You can extract the slope and intercept from the model for each one of these genes.
My problem is: given a table of slope and intercept values, how do I match a plot (I don't mind either ggplot2's geom_abline, lattice's panel.abline, or an alternative if necessary) with its corresponding slope and intercept?
My table (call it "slopeInt") has intercept as column 1 and slope as column 2, and has row names that correspond to the name of the gene. Their names look like this:
"202586_at" "202769_at" "203201_at" "214970_s_at" "219155_at"
These names match my gene names in another table ("Data") containing some details about my samples (I have 24 samples with different IDs and Time/Treatment combination) and the gene expression values.
It's in the long format with the gene names (as above) repeating every 24 rows (different expression levels for same gene, for each one of my samples):
ID Time Treatment Gene_name Gene_exp
... ... ... ... ...
I have overall eight genes I'm interested to plot, and the names in my Data$Gene_name match the row names of my slopeInt table. I can also merge the two tables together, that's not a problem. But I tried the following two approaches to give me graphs with graphs for every one of my genes with the appropriate regression, to no avail:
Using ggplot2:
ggplot(Data, aes(x = Time, y = Gene_exp, group = Time, color = Treatment)) +
facet_wrap(~ Gene_name, scales = "free_x") +
geom_point() +
geom_abline(intercept = Intercept, slope = Time), data = slopeInt) +
theme(panel.grid.major.y = element_blank())`
And also using Lattice:
xyplot(Gene_exp ~ Time| Gene_name, Data,
jitter.data = T,
panel = function(...){
panel.xyplot(...)
panel.abline(a = slopeInt[,1], b = slopeInt[,2])},
layout = c(4, 2))
I've tried multiple other methods in the actual geom_abline() and panel.abline() arguments, including some for loops, but I am not experienced in R and I cannot get it to work.. I can also have the data file in a wide format (separate columns for each gene).
Any help and further directions will be greatly appreciated!!!
Here is some code for a reproducible example:
Data <- data.frame(
ID = rep(1:24, 8),
Time = (rep(rep(c(1, 2, 4, 24), each = 3), 8)),
Treatment = rep(rep(c("control", "smoking"), each = 12), 8),
Gene_name = rep(c("202586_at", "202769_at", "203201_at", "214970_s_at",
"219155_at", "220165_at", "224483_s_at", "227559_at"), each = 24),
Gene_exp = rnorm(192))
slopeInt <- data.frame(
Intercept = rnorm(8),
Slope = rnorm(8))
row.names(slopeInt) <- c("202586_at", "202769_at", "203201_at",
"214970_s_at", "219155_at", "220165_at", "224483_s_at", "227559_at")
With lattice, this should work
xyplot(Gene_exp ~ Time| Gene_name, Data, slopeInt=slopeInt,
jitter.data = T,
panel = function(..., slopeInt){
panel.xyplot(...)
grp <- trellis.last.object()$condlevels[[1]][which.packet()]
panel.abline(a = slopeInt[grp,1], b = slopeInt[grp,2])
},
layout = c(4, 2)
)
using set.seed(15) before generating the sample data results in the following plot
The "trick" here is to use trellis.last.object()$condlevels to determine which conditioning block we are currently in. Then we use that information to extract the right slope information from the additional data we now pass in via a parameter. I thought there was a more elegant way to determine the current values of the conditioning variables but if there is I cannot remember it at this time.
If you specify Gene_name as a column in slopeInt, then it works [as I understand you want it to]. Note also a few other changes to the ggplot call.
slopeInt$Gene_name <- rownames(slopeInt)
ggplot(Data, aes(x = Time, y = Gene_exp, color = Treatment)) +
facet_wrap(~ Gene_name, scales = "free_x") +
geom_point() +
geom_abline(aes(intercept = Intercept, slope = Slope), data = slopeInt) +
theme(panel.grid.major.y = element_blank())
Related
Plotting a scatterplot with two continous variables but generating 'Discrete value supplied to continuous scale error' -updated
I am trying to plot a scatterplot with continuous variables on both the x and y axis Here is a simulation that pretty accurately represents what the data would look like (there's a couple of vectors with data items that are integers in the real thing, but that's immaterial here) library(tidyverse) set.seed(243) DemoDataTable <- tibble(Treatment = rep(c(0, 1), times = 20), ## each column is repeated 20 times, not 20 times for both CorrectAnswers = rnorm(40, 1, 2), Age = rnorm(40, 20, 5), PositiveAffect = rnorm(40, 1.5, 1), NegativeAffect = rnorm(40, 2, 1), TreatmentExpectancy = rnorm(40, 2, 1)) %>% mutate(Treatment = as.factor(Treatment)) Positive affect = 'positive affect2', negative affect = 'negative affect2' in the code below I am aware there are some NaN items in the x and y variables, but have attempted to remove them, the rest are continuous data on both axes. Have also checked the data frame using the glimpse function, only Treatment as a variable is showing up as a factor, the rest are coded as 'double', which to my understanding means numeric data. The code: ggplot(Data object, aes(x = 'Negative affect2', y = 'Positive affect2'), na.rm = TRUE) + geom_point(colour = "blue") + scale_x_continuous(name = "Negative affect") + scale_y_continuous(name = "Positive affect") + theme_minimal() However when I attempt to run the code, it states the error message: 'Discrete value supplied to continuous scale' I have also tried to use the na.omit function in place of na.rm but this has made no difference EDIT: I have changed the column names from "Negative Affect2" and "Positive Affect2" to "NegativeAffect" and "PositiveAffect" but this has made no difference to the error I have found that when running the code for the scatterplot line by line (as per below), I immediately hit problems when I get to line 2 as it will only give me a single data point on the x and y axes (whether I insert na.rm = T or not). ggplot(RTdataset2, aes(x = "NegativeAffect", y = "PositiveAffect", na.rm = T)) + geom_point(colour = "blue") However, I have not used the summarise function from dyplr at any point so not sure why this is, as it should plot all the data points from the sample against each other. I believe this is where the problem arises, as otherwise it would not state that I am trying to apply a discrete variable to a continuous scale as per below that.
How do I plot the Variable Importance of my trained rpart decision tree model?
I trained a model using rpart and I want to generate a plot displaying the Variable Importance for the variables it used for the decision tree, but I cannot figure out how. I was able to extract the Variable Importance. I've tried ggplot but none of the information shows up. I tried using the plot() function on it, but it only gives me a flat graph. I also tried plot.default, which is a little better but still now what I want. Here's rpart model training: argIDCART = rpart(Argument ~ ., data = trainSparse, method = "class") Got the variable importance into a data frame. argPlot <- as.data.frame(argIDCART$variable.importance) Here is a section of what that prints: argIDCART$variable.importance noth 23.339346 humanitarian 16.584430 council 13.140252 law 11.347241 presid 11.231916 treati 9.945111 support 8.670958 I'd like to plot a graph that shows the variable/feature name and its numerical importance. I just can't get it to do that. It appears to only have one column. I tried separating them using the separate function, but can't do that either. ggplot(argPlot, aes(x = "variable importance", y = "feature")) Just prints blank. The other plots look really bad. plot.default(argPlot) Looks like it plots the points, but doesn't put the variable name.
Since there is no reproducible example available, I mounted my response based on an own R dataset using the ggplot2 package and other packages for data manipulation. library(rpart) library(tidyverse) fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis) df <- data.frame(imp = fit$variable.importance) df2 <- df %>% tibble::rownames_to_column() %>% dplyr::rename("variable" = rowname) %>% dplyr::arrange(imp) %>% dplyr::mutate(variable = forcats::fct_inorder(variable)) ggplot2::ggplot(df2) + geom_col(aes(x = variable, y = imp), col = "black", show.legend = F) + coord_flip() + scale_fill_grey() + theme_bw() ggplot2::ggplot(df2) + geom_segment(aes(x = variable, y = 0, xend = variable, yend = imp), size = 1.5, alpha = 0.7) + geom_point(aes(x = variable, y = imp, col = variable), size = 4, show.legend = F) + coord_flip() + theme_bw()
If you want to see the variable names, it may be best to use them as the labels on the x-axis. plot(argIDCART$variable.importance, xlab="variable", ylab="Importance", xaxt = "n", pch=20) axis(1, at=1:7, labels=row.names(argIDCART)) (You may need to resize the window to see the labels properly.) If you have a lot of variables, you may want to rotate the variable names so that the do not overlap. par(mar=c(7,4,3,2)) plot(argIDCART$variable.importance, xlab="variable", ylab="Importance", xaxt = "n", pch=20) axis(1, at=1:7, labels=row.names(argIDCART), las=2) Data argIDCART = read.table(text="variable.importance noth 23.339346 humanitarian 16.584430 council 13.140252 law 11.347241 presid 11.231916 treati 9.945111 support 8.670958", header=TRUE)
Plotting multiple box plots as a single graph in R
I am trying to plot multiple box plots as a single graph. The data is where I have done a wilcoxon test. It should be like this I have four/five questions and I want to plot the respondent score for two sets as a box plot. This should be done for all questions (Two groups for each question). I am thinking of using ggplot2. My data is like q1o <- c(4,4,5,4,4,4,4,5,4,5,4,4,5,4,4,4,5,5,5,5,5,5,5,5,5,3,4,4,3,4) q1s <- c(5,4,4,5,5,5,5,5,4,5,4,4,5,4,5,5,5,5,5,5,5,5,5,5,5,5,4,5,4,4) q2o <- c(3,3,3,4,3,4,4,3,3,3,4,4,3,4,3,3,4,3,3,3,3,4,4,4,4,3,3,3,3,4) q2s <- c(5,4,4,5,5,5,5,5,4,5,4,4,5,4,5,5,5,5,5,5,5,5,5,5,5,5,4,3,4,4) .... .... q1 means question 1 and q2 means question 2. I also want to know how to align these stacked box plots based on my need. Like one row or two rows.
This should get you started: Unfortunately you don't provide a minimal example with sample data, so I will generate some random sample data. # Generate sample data set.seed(2017); df <- cbind.data.frame( value = rnorm(1000), Label = sample(c("Good", "Bad"), 1000, replace = T), variable = sample(paste0("F", 5:11), 1000, replace = T)); # ggplot library(tidyverse); df %>% mutate(variable = factor(variable, levels = paste0("F", 5:11))) %>% ggplot(aes(variable, value, fill = Label)) + geom_boxplot(position=position_dodge()) + facet_wrap(~ variable, ncol = 3, scale = "free"); You can specify the number of columns and rows in your 2d panel layout through arguments ncol and nrow, respectively, of facet_wrap. Many more details and examples can be found if you follow ?geom_boxplot and ?facet_wrap. Update 1 A boxplot based on your sample data doesn't make too much sense, because your data are not continuous. But ignoring that, you could do the following: df <- data.frame( q1o = c(4,4,5,4,4,4,4,5,4,5,4,4,5,4,4,4,5,5,5,5,5,5,5,5,5,3,4,4,3,4), q1s = c(5,4,4,5,5,5,5,5,4,5,4,4,5,4,5,5,5,5,5,5,5,5,5,5,5,5,4,5,4,4), q2o = c(3,3,3,4,3,4,4,3,3,3,4,4,3,4,3,3,4,3,3,3,3,4,4,4,4,3,3,3,3,4), q2s = c(5,4,4,5,5,5,5,5,4,5,4,4,5,4,5,5,5,5,5,5,5,5,5,5,5,5,4,3,4,4)); df %>% gather(key, value, 1:4) %>% mutate( variable = ifelse(grepl("q1", key), "F1", "F2"), Label = ifelse(grepl("o$", key), "Bad", "Good")) %>% ggplot(aes(variable, value, fill = Label)) + geom_boxplot(position = position_dodge()) + facet_wrap(~ variable, ncol = 3, scale = "free"); Update 2 One way of visualising discrete data would be in a mosaicplot. mosaicplot(table(df2)); The plot shows the count of value (as filled rectangles) per Variable per Label. See ?mosaicplot for details.
How to add a trendline to a boxplot of counts(y axis) and ids(x axis) when x axis is ordered
df1 <- data.frame(a=c(1,4,7), b=c(3, 5, 6), c=c(1, 1, 4), d=c(2 ,6 ,3)) df2<-data.frame(id=c("a","f","f","b","b","c","c","c","d","d"), var=c(12,20,15,18,10,30,5,8,5,5)) mediorder <- with(df2, reorder(id, -var, median)) boxplot(var~mediorder, data = df2) fc = levels(as.factor(mediorder)) ndf1= df1[,intersect(fc, colnames(df1))] ln<-lm( #confused here boxplot(ndf1) abline(ln) I have the above boxplot (ndf1) with an x-axis ordered according to medians from another data frame, and I would like to add a trendline to it. I am confused since it doesn't have an x and y variable to refer to, just columns with counts. Also the ordering is causing me problems. EDITED for clarification... I am building on the question here: How to match an ordered list (e.g., levels(as.factor(x)) ) to another dataframe in which only some columns match? All I would like to do is fit a trend line to ndf1
Something like this should do. It's fairly easy using ggplot2. However, your data/question are a bit confusing e.g. Some factors (a,d) have one data point only. Is this what you want? df2$id <- factor(df2$id , levels = levels(mediorder)) library(ggplot2) ggplot(data = df2, aes(x = id, y = var)) + geom_boxplot() + geom_smooth(method = "lm", aes(group = 1), se = F)
Subsetting ggplot2 graph using facet_grid()
I am trying to get individual trajectories and fitted trajectory per group across repeated measurements. Toy data below: set.seed(124) ID <- factor(rep(1:21, times = 3)) Group <- rep(c("A", "B", "C"), times = 21) score <- rnorm(63, 25, 3) session <- rep(c("s1","s2", "s3"), each = 21) df <- data.frame(ID, Group, session, score) Now plot trajectories across the three repeated measures for each individual and derive a fitted slope for the whole sample. c <- ggplot(df, aes(x = session, y = score, group = ID, colour = ID)) + geom_smooth(method = "lm", se = FALSE) + stat_smooth(aes(group = 1), se = FALSE, method = "lm", color = "red") c Now I want to break this plot up into three plots by group. There is the long way where you subset the dataframe by group and do three separate graphs, However I would like to do it all in one graph, same as above, except separated by group. I tried: c + facet_grid(.~Group) But it comes out blank. Something is missing here and I don't know what it is.