I have a huge data frame and I would like to make some plots to get an idea of the associations among different variables. I cannot use
pairs(data)
, because that would give me 400+ plots. However, there's one response variable y I'm particularly interested in. Thus, I'd like to plot y against all variables, which would reduce the number of plots from n^2 to n. How can I do it?
EDIT: I add an example for the sake of clarity. Let's say I have the dataframe
foo=data.frame(x1=1:10,x2=seq(0.1,1,0.1),x3=-7:2,x4=runif(10,0,1))
and my response variable is x3. Then I'd like to generate four plots arranged in a row, respectively x1 vs x3, x2 vs x3, an histogram of x3 and finally x4 vs x3. I know how to make each plot
plot(foo$x1,foo$x3)
plot(foo$x2,foo$x3)
hist(foo$x3)
plot(foo$x4,foo$x3)
However I have no idea how to arrange them in a row. Also, it would be great if there was a way to automatically make all the n plots, without having to call the command plot (or hist) each time. When n=4, it's not that big of an issue, but I usually deal with n=20+ variables, so it can be a drag.
Could do reshape2/ggplot2/gridExtra packages combination. This way you don't need to specify the number of plots. This code will work on any number of explaining variables without any modifications
foo <- data.frame(x1=1:10,x2=seq(0.1,1,0.1),x3=-7:2,x4=runif(10,0,1))
library(reshape2)
foo2 <- melt(foo, "x3")
library(ggplot2)
p1 <- ggplot(foo2, aes(value, x3)) + geom_point() + facet_grid(.~variable)
p2 <- ggplot(foo, aes(x = x3)) + geom_histogram()
library(gridExtra)
grid.arrange(p1, p2, ncol=2)
The package tidyr helps doing this efficiently. please refer here for more options
data %>%
gather(-y_value, key = "some_var_name", value = "some_value_name") %>%
ggplot(aes(x = some_value_name, y = y_value)) +
geom_point() +
facet_wrap(~ some_var_name, scales = "free")
you would get something like this
If your goal is only to get an idea of the associations among different variables, you can also use:
plot(y~., data = foo)
It is not as nice as using ggplot and it doesn't automatically put all the graphs in one window (although you can change that using par(mfrow = c(a, b)), but it is a quick way to get what you want.
I faced the same problem, and I don't have any experience of ggplot2, so I created a function using plot which takes the data frame, and the variables to be plotted as arguments and generate graphs.
dfplot <- function(data.frame, xvar, yvars=NULL)
{
df <- data.frame
if (is.null(yvars)) {
yvars = names(data.frame[which(names(data.frame)!=xvar)])
}
if (length(yvars) > 25) {
print("Warning: number of variables to be plotted exceeds 25, only first 25 will be plotted")
yvars = yvars[1:25]
}
#choose a format to display charts
ncharts <- length(yvars)
nrows = ceiling(sqrt(ncharts))
ncols = ceiling(ncharts/nrows)
par(mfrow = c(nrows,ncols))
for(i in 1:ncharts){
plot(df[,xvar],df[,yvars[i]],main=yvars[i], xlab = xvar, ylab = "")
}
}
Notes:
You can provide the list of variables to be plotted as yvars,
otherwise it will plot all (or first 25, whichever is less) the variables in the data frame against xvar.
Margins were going out of bounds if the number of plots exceeds 25,
so I kept a limit to plot 25 charts only. Any suggestions to nicely
handle this are welcome.
Also the y axis labels are removed as titles of the graphs take care
of it. x axis label is set to xvar.
Related
Using a dataset, I have created the following plot:
I'm trying to create the following plot:
Specifically, I am trying to incorporate Twitter names over the first image. To do this, I have a dataset with each name in and a value that corresponds to a point on the axes. A snippet looks something like:
Name Score
#tedcruz 0.108
#RealBenCarson 0.119
Does anyone know how I can plot this data (from one CSV file) over my original graph (which is constructed from data in a different CSV file)? The reason that I am confused is because in ggplot2, you specify the data you want to use at the start, so I am not sure how to incorporate other data.
Thank you.
The question you ask about ggplot combining source of data to plot different element is answered in this post here
Now, I don't know for sure how this is going to apply to your specific data. Here I want to show you an example that might help you to go forward.
Imagine we have two data.frames (see bellow) and we want to obtain a plot similar to the one you presented.
data1 <- data.frame(list(
x=seq(-4, 4, 0.1),
y=dnorm(x = seq(-4, 4, 0.1))))
data2 <- data.frame(list(
"name"=c("name1", "name2"),
"Score" = c(-1, 1)))
The first step is to find the "y" coordinates of the names in the second data.frame (data2). To do this I added a y column to data2. y is defined here as a range of points from the may value of y to the min value of y with some space for aesthetics.
range_y = max(data1$y) - min(data1$y)
space_y = range_y * 0.05
data2$y <- seq(from = max(data1$y)-space, to = min(data1$y)+space, length.out = nrow(data2))
Then we can use ggplot() to plot data1 and data2 following some plot designs. For the current example I did this:
library(ggplot2)
p <- ggplot(data=data1, aes(x=x, y=y)) +
geom_point() + # for the data1 just plot the points
geom_pointrange(data=data2, aes(x=Score, y=y, xmin=Score-0.5, xmax=Score+0.5)) +
geom_text(data = data2, aes(x = Score, y = y+(range_y*0.05), label=name))
p
which gave this following plot:
facet_grid and facet_wrap have the scales parameter, which as far as I know allows each plot to adjust the scales of the x and/or y axis to the data being plotted. Since according to the grammar of ggplot x and y are just two among many aesthetics, and there's a scale for each aesthetic, I figured it would be reasonable to have the option of letting each aesthetic be free, but so far I didn't find a way to do it.
I was trying to set it in particular for the Size, since sometimes a variables lives in a different order of magnitude depending on the group I'm using for the facet, and having the same scale for every group blocks the possibility of seeing within-group variation.
A reproducible example:
set.seed(1)
x <- runif(20,0,1)
y <- runif(20,0,1)
groups <- c(rep('small', 10), rep('big', 10))
size_small <- runif(10,0,1)
size_big <- runif(10,0,1) * 1000
df <- data.frame(x, y, groups, sizes = c(size_small, size_big))
And an auxiliary function for plotting:
basic_plot <- function(df) ggplot(df) +
geom_point(aes(x, y, size = sizes, color = groups)) +
scale_color_manual(values = c('big' = 'red', 'small' = 'blue')) +
coord_cartesian(xlim=c(0,1), ylim=c(0,1))
If I we plot the data as is, we get the following:
basic_plot(df)
Non faceted plot
The blue dots are relatively small, but there is nothing we can do.
If we add the facet:
basic_plot(df) +
facet_grid(~groups, scales = 'free')
Faceted plot
The blue dots continue being small. But I would like to take advantage of the fact that I'm dividing the data in two, and allow the size scale to adjust to the data of each plot. I would like to have something like the following:
plot_big <- basic_plot(df[df$groups == 'big',])
plot_small <- basic_plot(df[df$groups == 'small',])
grid.arrange(plot_big, plot_small, ncol = 2)
What I want
Can it be done without resorting to this kind of micromanaging, or a manual rescaling of the sizes like the following?
df %>%
group_by(groups) %>%
mutate(maximo = max(sizes),
sizes = scale(sizes, center = F)) %>%
basic_plot() +
facet_grid(~groups)
I can manage to do those things, I'm just trying to see if I'm not missing another option, or if I'm misunderstanding the grammar of graphics.
Thank you for your time!
As mentioned, original plot aesthetics are maintained when calling facet_wrap. Since you need grouped graphs, consider base::by (the subsetting data frame function) wrapped in do.call:
do.call(grid.arrange,
args=list(grobs=by(df, df$groups, basic_plot),
ncol=2,
top="Grouped Point Plots"))
Should you need to share a legend, I always use this wrapper from #Steven Lockton's answer
do.call(grid_arrange_shared_legend, by(df, df$groups, basic_plot))
I want to compare the distribution of several variables (here X1 and X2) with a single value (here bm). The issue is that these variables are too many (about a dozen) to use a single boxplot.
Additionaly the levels are too different to use one plot. I need to use facets to make things more organised:
However with this plot my benchmark category (bm), which is a single value in X1 and X2, does not appear in X1 and seems to have several values in X2. I want it to be only this green line, which it is in the first plot. Any ideas why it changes? Is there any good workaround? I tried the options of facet_wrap/facet_grid, but nothing there delivered the right result.
I also tried combining a bar plot with bm and three empty categories with the boxplot. But firstly it looked terrible and secondly it got similarly screwed up in the facetting. Basically any work around would help.
Below the code to create the minimal example displayed here:
# Creating some sample data & loading libraries
library(ggplot2)
library(RColorBrewer)
set.seed(10111)
x=matrix(rnorm(40),20,2)
y=rep(c(-1,1),c(10,10))
x[y==1,]=x[y==1,]+1
x[,2]=x[,2]+20
df=data.frame(x,y)
# creating a benchmark point
benchmark=data.frame(y=rep("bm",2),key=c("X1","X2"),value=c(-0.216936,20.526312))
# melting the data frame, rbinding it with the benchmark
test_dat=rbind(tidyr::gather(df,key,value,-y),benchmark)
# Creating a plot
p_box <- ggplot(data = test_dat, aes(x=key, y=value,color=as.factor(test_dat$y))) +
geom_boxplot() + scale_color_manual(name="Cluster",values=brewer.pal(8,"Set1"))
# The first line delivers the first plot, the second line the second plot
p_box
p_box + facet_wrap(~key,scales = "free",drop = FALSE) + theme(legend.position = "bottom")
The problem only lies int the use of test_dat$y inside the color aes. Never use $ in aes, ggplot will mess up.
Anyway, I think you plot would improve if you use a geom_hline for the benchmark, instead of hacking in a single value boxplot:
library(ggplot2)
library(RColorBrewer)
ggplot(tidyr::gather(df,key,value,-y)) +
geom_boxplot(aes(x=key, y=value, color=as.factor(y))) +
geom_hline(data = benchmark, aes(yintercept = value), color = '#4DAF4A', size = 1) +
scale_color_manual(name="Cluster",values=brewer.pal(8,"Set1")) +
facet_wrap(~key,scales = "free",drop = FALSE) +
theme(legend.position = "bottom")
I started learning R for data analysis and, most importantly, for data visualisation.
Since I am still in the switching process, I am trying to reproduce the activities I was doing with Graphpad Prism or Origin Pro in R. In most of the cases everything was smooth, but I could not find a smart solution for plotting multiple y columns in a single graph.
What I usually get from the softwares I use for data visualisations look like this:
Each single black trace is a measurement, and I would like to obtain the same plot in R. In Prism or Origin, this will take a single copy-paste in a XY graph.
I exported the matrix of data (one X, which indicates the time, and multiple Y values, which are the traces you see in the image).
I imported my data in R with the following commands:
library(ggplot2) #loaded ggplot2
Data <- read.csv("Directory/File.txt", header=F, sep="") #imported data
DF <- data.frame(Data) #transformed data into data frame
If I plot my data now, I obtain a series of columns, where the first one (called V1) is the X axis and all the others (V2 to V140) are the traces I want to put on the same graph.
To plot the data, I tried different solutions:
ggplot(data=DF, aes(x=DF$V1, y=DF[V2:V140]))+geom_line()+theme_bw() #did not work
plot(DF, xy.coords(x=DF$V1, y=DF$V2:V140)) #gives me an error
plot(DF, xy.coords(x=V1, y=c(V2:V10))) #gives me an error
I tried the matplot, without success, following the EZH guide:
The code I used is the following: matplot(x=DF$V1, type="l", lty = 2:100)
The only solution I found would be to individually plot a command for each single column, but it is a crazy solution. The number of columns varies among my data, and manually enter commands for 140 columns is insane.
What would you suggest?
Thank you in advance.
Here there are also some data attached.Data: single X, multiple Y
I tried using the matplot(). I used a very sample data which has no trend at all. so th eoutput from my code shall look terrible, but my main focus is on the code. Since you have already tried matplot() ,just recheck with below solution if you had done it right!
set.seed(100)
df = matrix(sample(1:685765,50000,replace = T),ncol = 100)
colnames(df)=c("x",paste0("y", 1:99))
dt=as.data.frame(df)
matplot(dt[["x"]], y = dt[,c(paste0("y",1:99))], type = "l")
If you want to plot in base R, you have to make a plot and add lines one at a time, however that isn't hard to do.
we start by making some sample data. Since the data in the link seemed to all be on the same scale, I will assume your data frame only has y values and the x value is stored separately.
plotData <- as.data.frame(matrix(sort(rnorm(500)),ncol = 5))
xval <- sort(sample(200, 100))
Now we can initialize a plot with the first column.
plot(xval, plotData[[1]], type = "l",
ylim = c(min(plotData), max(plotData)))
type = "l" makes a line plot instead of a scatter plot
ylim = c(min(plotData), max(plotData)) makes sure the y-axis will fit all the data.
Now we can add the rest of the values.
apply(plotData[-1], 2, lines, x = xval)
plotData[-1] removes the column we already plotted,
apply function with 2 as the second parameter means we want to execute a function on every column,
lines defines the function we are applying to the columns. lines adds a new line to the current plot.
x = xval passes an extra parameter (x) to the lines function.
if you wat to plot the data using ggplot2, the data should be transformed to long format;
library(ggplot2)
library(reshape2)
dat <- read.delim('AP.txt', header = F)
# plotting only first 9 traces
# my rstudio will crach if I plot the full data;
df <- melt(dat[1:10], id.vars = 'V1')
ggplot(df, aes(x = V1, y = value, color = variable)) + geom_line()
# if you want all traces to be in same colour, you can use
ggplot(df, aes(x = V1, y = value, group = variable)) + geom_line()
I am new to R. Forgive me if this if this question has an obvious answer but I've not been able to find a solution. I have experience with SAS and may just be thinking of this problem in the wrong way.
I have a dataset with repeated measures from hundreds of subjects with each subject having multiple measurements across different ages. Each subject is identified by an ID variable. I'd like to plot each measurement (let's say body WEIGHT) by AGE for each individual subject (ID).
I've used ggplot2 to do something like this:
ggplot(data = dataset, aes(x = AGE, y = WEIGHT )) + geom_line() + facet_wrap(~ID)
This works well for a small number of subjects but won't work for the entire dataset.
I've also tried something like this:
ggplot(data=data, aes(x = AGE,y = BW, group = ID, colour = ID)) + geom_line()
This also works for a small number of subjects but is unreadable with hundreds of subjects.
I've tried to subset using code like this:
temp <- split(dataset,dataset$ID)
but I'm not sure how to work with the resulting dataset. Or perhaps there is a way to simply adjust the facet_wrap so that individual plots are created?
Thanks!
Because you want to split up the dataset and make a plot for each level of a factor, I would approach this with one of the split-apply-return tools from the plyr package.
Here is a toy example using the mtcars dataset. I first create the plot and name it p, then use dlply to split the dataset by a factor and return a plot for each level. I'm taking advantage of %+% from ggplot2 to replace the data.frame in a plot.
p = ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_line()
require(plyr)
dlply(mtcars, .(cyl), function(x) p %+% x)
This returns all the plots, one after another. If you name the resulting list object you can also call one plot at a time.
plots = dlply(mtcars, .(cyl), function(x) p %+% x)
plots[1]
Edit
I started thinking about putting a title on each plot based on the factor, which seems like it would be useful.
dlply(mtcars, .(cyl), function(x) p %+% x + facet_wrap(~cyl))
Edit 2
Here is one way to save these in a single document, one plot per page. This is working with the list of plots named plots. It saves them all to one document, one plot per page. I didn't change any of the defaults in pdf, but you can certainly explore the changes you can make.
pdf()
plots
dev.off()
Updated to use package dplyr instead of plyr. This is done in do, and the output will have a named column that contains all the plots as a list.
library(dplyr)
plots = mtcars %>%
group_by(cyl) %>%
do(plots = p %+% . + facet_wrap(~cyl))
Source: local data frame [3 x 2]
Groups: <by row>
cyl plots
1 4 <S3:gg, ggplot>
2 6 <S3:gg, ggplot>
3 8 <S3:gg, ggplot>
To see the plots in R, just ask for the column that contains the plots.
plots$plots
And to save as a pdf
pdf()
plots$plots
dev.off()
A few years ago, I wanted to do something similar - plot individual trajectories for ~2500 participants with 1-7 measurements each. I did it like this, using plyr and ggplot2:
library(plyr)
library(ggplot2)
d_ply(dat, .var = "participant_id", .fun = function(x) {
# Generate the desired plot
ggplot(x, aes(x = phase, y = result)) +
geom_point() +
geom_line()
# Save it to a file named after the participant
# Putting it in a subdirectory is prudent
ggsave(file.path("plots", paste0(x$participant_id, ".png")))
})
A little slow, but it worked. If you want to get a sense of all participants' trajectories in one plot (like your second example - aka the spaghetti plot), you can tweak the transparency of the lines (forget coloring them, though):
ggplot(data = dat, aes(x = phase, y = result, group = participant_id)) +
geom_line(alpha = 0.3)
lapply(temp, function(X) ggplot(X, ...))
Where X is your subsetted data
Keep in mind you may have to explicitly print the ggplot object (print(ggplot(X, ..)))