I'm eviews user and eviews very basically draws scatter plots matrix.
In the following graph, I have 13 different group datas and Eviews draws one group data against 12 groups' data in 12 plots in one graph with regression line.
How can I realize same graph with Rstudio?
Here is an example on how to do the requested plot in ggplot:
First some data:
z <- matrix(rnorm(1000), ncol= 10)
The basic idea here is to convert the wide matrix to long format where the variable that is compared to all others is duplicated as many times as there are other variables. Each of these other variables gets a specific label in the key column. ggplot likes the data in this format
library(tidyverse)
z %>%
as.tibble() %>% #convert matrix to tibble or data.frame
gather(key, value, 2:10) %>% #convert to long format specifying variable columns 2:10
mutate(key = factor(key, levels = paste0("V", 1:10))) %>% #specify levels so the facets go in the correct order to avoid V10 being before V2
ggplot() +
geom_point(aes(value, V1))+ #plot points
geom_smooth(aes(value, V1), method = "lm", se = F)+ #plot lm fit without se
facet_wrap(~key) #facet by key
Related
I'm using ggplot to create a scatterplot of a dataframe. The x and y axis are two columns in the frame, and the following code gives me a scatter plot:
ggplot(df,aes(x=Season,y=type))+
geom_point(fill="blue")
But the points are all the same size. I want each point to depend on the count of how many rows matched for the combination of x and y. Anyone know how?
You haven't provided any sample data, so I'm generating some:
library('tidyverse')
df <- data.frame(Season = sample(c('W', 'S'), size=20, replace=T),
type=sample(c('A', 'B'), size=20, replace=T))
df %>%
group_by(Season, type) %>%
summarise(count = n()) %>%
ggplot(aes(x=Season, y=type, size=count)) +
geom_point(col="blue")
The idea is to count all the occurences of your Season–type data, and then use that new count field to adjust the size in your ggplot.
I can see many posts on this topic, but none addresses this question. Apologies if I missed a relevant answer. I have a large protein expression dataset, with samples like so as the columns:
rep1_0hr, rep1_16hr, rep1_24hr, rep1_48hr, rep1_72hr .....
and 2000+ proteins in the rows. In other words each sample is a different developmental timepoint.
If it is of any interest, the original dataset is 'mulvey2015' from the pRolocdata package in R, which I converted to a SummarizedExperiment object in RStudio.
I first ran k-means clustering on the data (an assay() of a SummarizedExperiment dataset, to get 12 clusters:
k_mul <- kmeans(scale(assay(mul)), centers = 12, nstart = 10)
Then:
summary(k_mul)
produced the expected output.
I would like the visualisation to look like this, with samples on the x-axis and expression on the y-axis. The plots look like they have been generated using facet_wrap() in ggplot:
For ggplot the data need to be provided as a dataframe with a column for the cluster identity of an individual protein. Also the data need to be in long format. I tried pivoting (pivot_longer) the original dataset, but of course there are a very large number of data points. Moreover, the image I posted shows that for any one plot, the number of coloured lines is smaller than the total number of proteins, suggesting that there might have been dimension reduction on the dataset first, but I am unsure. Up till now I have been running the kmeans algorithm without dimension reduction. Can I get guidance please for how to produce this plot?
Here is my attempt at reverse engeneering the plot:
library(pRolocdata)
library(dplyr)
library(tidyverse)
library(magrittr)
library(ggplot2)
mulvey2015 %>%
Biobase::assayData() %>%
magrittr::extract2("exprs") %>%
data.frame(check.names = FALSE) %>%
tibble::rownames_to_column("prot_id") %>%
mutate(.,
cl = kmeans(select(., -prot_id),
centers = 12,
nstart = 10) %>%
magrittr::extract2("cluster") %>%
as.factor()) %>%
pivot_longer(cols = !c(prot_id, cl),
names_to = "Timepoint",
values_to = "Expression") %>%
ggplot(aes(x = Timepoint, y = Expression, color = cl)) +
geom_line(aes(group = prot_id)) +
facet_wrap(~ cl, ncol = 4)
As for you questions, pivot_longer is usually quite performant unless it fails to find unique combinations in keys or problems related with data type conversion. The plot can be improved by:
tweaking the alpha parameter of geom_lines (e.g. alpha = 0.5), in order to provide an idea of density of lines
finding a good abbreviation and order for Timepoint
changing axis.text.x orientation
Here is my own, very similar solution to the above.
dfsa_mul <- data.frame(scale(assay(mul)))
dfsa_mul2 <- rownames_to_column(dfsa_mul, "protID")
add the kmeans $cluster column to the dfsa_mul2 dataframe. Only change clus to a factor after executing pivot_longer
dfsa_mul2$clus <- ksa_mul$cluster
dfsa_mul2 %>%
pivot_longer(cols = -c("protID", "clus"),
names_to = "samples",
values_to = "expression") %>%
ggplot(aes(x = samples, y = expression, colour = factor(clus))) +
geom_line(aes(group = protID)) +
facet_wrap(~ factor(clus))
This generates a series of plots identical to the graphs posted by #sbarbit.
I have a data frame consisting of six variables -- one two-level grouping variable indicating treatment status and four binary (0/1) variables. I would like to plot the proportion of successes with 95% confidence intervals as error bars for each binary variable, including separate dots and colors for each treatment group.
I'm currently plotting these as shown below.
df2 <-
df %>%
select(., c(q1_active, # select variables
q2_appt,
q2_trmt,
q2_img,
q2_tele,
q2_trav))
df3 <-
df2 %>%
pivot_longer(cols = starts_with("q2"),
names_to = "variable",
names_prefix = "q2",
values_to = "values")
se <- function(x) sqrt(var(x)/length(x)) #creates function to calculate standard error of the mean
df4 <-
df3 %>%
group_by(variable, q1_active) %>% # group by both binom variable and treatment status
mutate(means=mean(values)) %>% # calculate proportions for binomial variables
mutate(se=se(values)) %>% # calculates std error
distinct(means, .keep_all=TRUE)
ungroup() %>%
drop_na() # there is one "NA" group in the treatment variable I do not need
pos <- position_dodge(.5)
p2 <-
df5 %>%
ggplot(., aes(x=variable, y=means)) +
geom_point(aes(colour=as.factor(q1_active)),position=pos) +
geom_errorbar(aes(ymin=means-(1.96*se), ymax=means+(1.96*se),
colour=as.factor(q1_active),
group=as.factor(q1_active)),
width=.2, position=pos) +
labs(title="Title Here",
subtitle="Subtitle Here",
x="",
y="")
The plot looks okay. I know the proportions are correct because I've double-checked the "means" variable.
However, I'm unsure that I'm calculating the standard error correctly for these proportions. Additionally (and as you can likely see), when I run the plot, I have one proportion with zero frequency. I would like to instead calculate and plot the Wilson interval for these proportions instead of the standard error as I have done.
Could someone(s) guide me on how to correctly calculate for these binomial proportions the Wilson (or "exact") confidence interval -- either before or after I pivot my data frame -- and how to plot these using ggplot?
I'm relatively new to coding and R, so please forgive any sloppy code or misunderstandings. And please let me know if you need clarification on anything. Thank you in advance.
I have a time series data frame similar to data created below. Measurements of 5 variables are taken on each individual. Individuals have unique ID numbers. Note that in this data set each individual is of the same length (each has 1000 observations), but in my real data set each individual is of has different lengths (teach individual has a different number of observations). For each individual, I want to plot all 5 variables on top of one another (i.e. all on the y axis) and plot them against time (x axis). I want to print each of these plots to an external document of some kind (pdf, or whatever is recommended for this application) with one plot per page, meaning each individual will have its own page with a single plot. I want these time series plots to be "interactive", in that I can move my mouse over a point, and it will tell me what time individual data points are at. My goal in doing this is exploring the association between peaks, valleys, and other regions between the 5 variables. I am not sure if ggplot2 is still the best application for this, but I would still like for the plots to be aesthetically appealing so that it will be easier to see patterns in the data. Also, is pasting these plots to a pdf the most sensible route? Or would I be better off using R notebook or some other application?
ID <- rep(c("A","B","C"), each=1000)
time <- rep(c(1:1000), times = 3)
one <- rnorm(1000)
two <- rnorm(1000)
three <- rnorm(1000)
four <- rnorm(1000)
five<-rnorm(1000)
data<- data.frame(cbind(ID,time,one,two,three,four,five))
Try using the plotly package. And since you want it to be interactive, you'll want to export as something like html rather than pdf.
To produce a single faceted plot (note I added stringAsFactors = FALSE to your sample data):
library(tidyverse)
library(plotly)
ID <- rep(c("A","B","C"), each=1000)
time <- rep(c(1:1000), times = 3)
one <- rnorm(1000)
two <- rnorm(1000)
three <- rnorm(1000)
four <- rnorm(1000)
five<-rnorm(1000)
data<- data.frame(cbind(ID,time,one,two,three,four,five),
stringsAsFactors = FALSE)
data_long <- data %>%
gather(variable,
value,
one:five) %>%
mutate(time = as.numeric(time),
value = as.numeric(value))
plot <- data_long %>%
ggplot(aes(x = time,
y = value,
color = variable)) +
geom_point() +
facet_wrap(~ID)
interactive_plot <- ggplotly(plot)
htmlwidgets::saveWidget(interactive_plot, "example.html")
If you want to produce and export an interactive plot for every ID programmatically:
walk(unique(data_long$ID),
~ htmlwidgets::saveWidget(ggplotly(data_long %>%
filter(ID == .x) %>%
ggplot(aes(x = time,
y = value,
color = variable)) +
geom_point() +
labs(title = paste(.x))),
paste("plot_for_ID_", .x, ".html", sep = "")))
Edit: I changed map() to walk() so that the plots are produced without console output (previously just a list with 3 empty elements).
I'm trying to do a plot with facets with some data from a previous model. As a simple example:
t=1:10;
x1=t^2;
x2=sqrt(t);
y1=sin(t);
y2=cos(t);
How can I plot this data in a 2x2 grid, being the rows one factor (levels x and y, plotted with different colors) and the columns another factor (levels 1 and 2, plotted with different linetypes)?
Note: t is the common variable for the X axis of all subplots.
ggplot will be more helpful if the data can be first put into tidy form. df is your data, df_tidy is that data in tidy form, where the series is identified in one column that can be mapped in ggplot -- in this case to the facet.
library(tidyverse)
df <- tibble(
t=1:10,
x1=t^2,
x2=sqrt(t),
y1=sin(t),
y2=cos(t),
)
df_tidy <- df %>%
gather(series, value, -t)
ggplot(df_tidy, aes(t, value)) +
geom_line() +
facet_wrap(~series, scales = "free_y")