I started learning R for data analysis and, most importantly, for data visualisation.
Since I am still in the switching process, I am trying to reproduce the activities I was doing with Graphpad Prism or Origin Pro in R. In most of the cases everything was smooth, but I could not find a smart solution for plotting multiple y columns in a single graph.
What I usually get from the softwares I use for data visualisations look like this:
Each single black trace is a measurement, and I would like to obtain the same plot in R. In Prism or Origin, this will take a single copy-paste in a XY graph.
I exported the matrix of data (one X, which indicates the time, and multiple Y values, which are the traces you see in the image).
I imported my data in R with the following commands:
library(ggplot2) #loaded ggplot2
Data <- read.csv("Directory/File.txt", header=F, sep="") #imported data
DF <- data.frame(Data) #transformed data into data frame
If I plot my data now, I obtain a series of columns, where the first one (called V1) is the X axis and all the others (V2 to V140) are the traces I want to put on the same graph.
To plot the data, I tried different solutions:
ggplot(data=DF, aes(x=DF$V1, y=DF[V2:V140]))+geom_line()+theme_bw() #did not work
plot(DF, xy.coords(x=DF$V1, y=DF$V2:V140)) #gives me an error
plot(DF, xy.coords(x=V1, y=c(V2:V10))) #gives me an error
I tried the matplot, without success, following the EZH guide:
The code I used is the following: matplot(x=DF$V1, type="l", lty = 2:100)
The only solution I found would be to individually plot a command for each single column, but it is a crazy solution. The number of columns varies among my data, and manually enter commands for 140 columns is insane.
What would you suggest?
Thank you in advance.
Here there are also some data attached.Data: single X, multiple Y
I tried using the matplot(). I used a very sample data which has no trend at all. so th eoutput from my code shall look terrible, but my main focus is on the code. Since you have already tried matplot() ,just recheck with below solution if you had done it right!
set.seed(100)
df = matrix(sample(1:685765,50000,replace = T),ncol = 100)
colnames(df)=c("x",paste0("y", 1:99))
dt=as.data.frame(df)
matplot(dt[["x"]], y = dt[,c(paste0("y",1:99))], type = "l")
If you want to plot in base R, you have to make a plot and add lines one at a time, however that isn't hard to do.
we start by making some sample data. Since the data in the link seemed to all be on the same scale, I will assume your data frame only has y values and the x value is stored separately.
plotData <- as.data.frame(matrix(sort(rnorm(500)),ncol = 5))
xval <- sort(sample(200, 100))
Now we can initialize a plot with the first column.
plot(xval, plotData[[1]], type = "l",
ylim = c(min(plotData), max(plotData)))
type = "l" makes a line plot instead of a scatter plot
ylim = c(min(plotData), max(plotData)) makes sure the y-axis will fit all the data.
Now we can add the rest of the values.
apply(plotData[-1], 2, lines, x = xval)
plotData[-1] removes the column we already plotted,
apply function with 2 as the second parameter means we want to execute a function on every column,
lines defines the function we are applying to the columns. lines adds a new line to the current plot.
x = xval passes an extra parameter (x) to the lines function.
if you wat to plot the data using ggplot2, the data should be transformed to long format;
library(ggplot2)
library(reshape2)
dat <- read.delim('AP.txt', header = F)
# plotting only first 9 traces
# my rstudio will crach if I plot the full data;
df <- melt(dat[1:10], id.vars = 'V1')
ggplot(df, aes(x = V1, y = value, color = variable)) + geom_line()
# if you want all traces to be in same colour, you can use
ggplot(df, aes(x = V1, y = value, group = variable)) + geom_line()
Related
Using a dataset, I have created the following plot:
I'm trying to create the following plot:
Specifically, I am trying to incorporate Twitter names over the first image. To do this, I have a dataset with each name in and a value that corresponds to a point on the axes. A snippet looks something like:
Name Score
#tedcruz 0.108
#RealBenCarson 0.119
Does anyone know how I can plot this data (from one CSV file) over my original graph (which is constructed from data in a different CSV file)? The reason that I am confused is because in ggplot2, you specify the data you want to use at the start, so I am not sure how to incorporate other data.
Thank you.
The question you ask about ggplot combining source of data to plot different element is answered in this post here
Now, I don't know for sure how this is going to apply to your specific data. Here I want to show you an example that might help you to go forward.
Imagine we have two data.frames (see bellow) and we want to obtain a plot similar to the one you presented.
data1 <- data.frame(list(
x=seq(-4, 4, 0.1),
y=dnorm(x = seq(-4, 4, 0.1))))
data2 <- data.frame(list(
"name"=c("name1", "name2"),
"Score" = c(-1, 1)))
The first step is to find the "y" coordinates of the names in the second data.frame (data2). To do this I added a y column to data2. y is defined here as a range of points from the may value of y to the min value of y with some space for aesthetics.
range_y = max(data1$y) - min(data1$y)
space_y = range_y * 0.05
data2$y <- seq(from = max(data1$y)-space, to = min(data1$y)+space, length.out = nrow(data2))
Then we can use ggplot() to plot data1 and data2 following some plot designs. For the current example I did this:
library(ggplot2)
p <- ggplot(data=data1, aes(x=x, y=y)) +
geom_point() + # for the data1 just plot the points
geom_pointrange(data=data2, aes(x=Score, y=y, xmin=Score-0.5, xmax=Score+0.5)) +
geom_text(data = data2, aes(x = Score, y = y+(range_y*0.05), label=name))
p
which gave this following plot:
I create a dummy timeseries xts object with missing data on date 2-09-2015 as:
library(xts)
library(ggplot2)
library(scales)
set.seed(123)
seq <- seq(as.POSIXct("2015-09-01"),as.POSIXct("2015-09-02"), by = "1 hour")
ob1 <- xts(rnorm(length(seq),150,5),seq)
seq2 <- seq(as.POSIXct("2015-09-03"),as.POSIXct("2015-09-05"), by = "1 hour")
ob2 <- xts(rnorm(length(seq2),170,5),seq2)
final_ob <- rbind(ob1,ob2)
plot(final_ob)
# with ggplot
df <- data.frame(time = index(final_ob), val = coredata(final_ob) )
ggplot(df, aes(time, val)) + geom_line()+ scale_x_datetime(labels = date_format("%Y-%m-%d"))
After plotting my data looks like this:
The red coloured rectangular portion represents the date on which data is missing. How should I show that data was missing on this day in the main plot?
I think I should show this missing data with a different colour. But, I don't know how should I process data to reflect the missing data behaviour in the main plot.
Thanks for the great reproducible example.
I think you are best off to omit that line in your "missing" portion. If you have a straight line (even in a different colour) it suggests that data was gathered in that interval, that happened to fall on that straight line. If you omit the line in that interval then it is clear that there is no data there.
The problem is that you want the hourly data to be connected by lines, and then no lines in the "missing data section" - so you need some way to detect that missing data section.
You have not given a criteria for this in your question, so based on your example I will say that each line on the plot should consist of data at hourly intervals; if there's a break of more than an hour then there should be a new line. You will have to adjust this criteria to your specific problem. All we're doing is splitting up your dataframe into bits that get plotted by the same line.
So first create a variable that says which "group" (ie line) each data is in:
df$grp <- factor(c(0, cumsum(diff(df$time) > 1)))
Then you can use the group= aesthetic which geom_line uses to split up lines:
ggplot(df, aes(time, val)) + geom_line(aes(group=grp)) + # <-- only change
scale_x_datetime(labels = date_format("%Y-%m-%d"))
I am trying to build from a question similar to mine (and from which I borrowed the self-contained example and title inspiration). I am trying to apply transparency individually to each line of a ggparcoord or somehow add two layers of ggparcoord on top of the other. The detailed description of the problem and format of data I have for the solution to work is provided below.
I have a dataset with thousand of lines, lets call it x.
library(GGally)
x = data.frame(a=runif(100,0,1),b=runif(100,0,1),c=runif(100,0,1),d=runif(100,0,1))
After clustering this data I also get a set of 5 lines, let's call this dataset y.
y = data.frame(a=runif(5,0,1),b=runif(5,0,1),c=runif(5,0,1),d=runif(5,0,1))
In order to see the centroids y overlaying x I use the following code. First I add y to x such that the 5 rows are on the bottom of the final dataframe. This ensures ggparcoord will put them last and therefore stay on top of all the data:
df <- rbind(x,y)
Next I create a new column for df, following the question advice I referred such that I can color differently the centroids and therefore can tell it apart from the data:
df$cluster = "data"
df$cluster[(nrow(df)-4):(nrow(df))] <- "centroids"
Finally I plot it:
p <- ggparcoord(df, columns=1:4, groupColumn=5, scale="globalminmax", alphaLines = 0.99) + xlab("Sample") + ylab("log(Count)")
p + scale_colour_manual(values = c("data" = "grey","centroids" = "#94003C"))
The problem I am stuck with is from this stage and onwards. On my original data, plotting solely x doesn't lead to much insight since it is a heavy load of lines (on this data this is equivalent to using ggparcoord above on x instead of df:
By reducing alphaLines considerably (0.05), I can naturally see some clusters due to the overlapping of the lines (this is again running ggparcoord on x reducing alphaLines):
It makes more sense to observe the centroids added to df on top of the second plot, not the first.
However, since everything it is on a single dataframe, applying such a high value for alphaLine makes the centroid lines disappear. My only option is then to use ggparcoord (as provided above) on df without decreasing the alphaValue:
My goal is to have the red lines (centroid lines) on top of the second figure with very low alpha. There are two ways I thought so far but couldn't get it working:
(1) Is there any way to create a column on the dataframe, similar to what is done for the color, such that I can specify the alpha value for each line?
(2) I originally attempted to create two different ggparcoords and "sum them up" hoping to overlay but an error was raised.
The question may contain too much detail, but I thought this could motivate better the applicability of the answer to serve the interest of other readers.
The answer I am looking for would use the provided data variables on the current format and generate the plot I am looking for. Better ways to reconstruct the data is also welcomed, but using the current structure is preferred.
In this case I think it easier to just use ggplot, and build the graph yourself. We make slight adjustments to how the data is represented (we put it in long format), and then we make the parallel coordinates plot. We can now map any attribute to cluster that you like.
library(dplyr)
library(tidyr)
# I start the same as you
x <- data.frame(a=runif(100,0,1),b=runif(100,0,1),c=runif(100,0,1),d=runif(100,0,1))
y <- data.frame(a=runif(5,0,1),b=runif(5,0,1),c=runif(5,0,1),d=runif(5,0,1))
# I find this an easier way to combine the two data.frames, and have an id column
df <- bind_rows(data = x, centroids = y, .id = 'cluster')
# We need to add id's, so we know which points to connect with a line
df$id <- 1:nrow(df)
# Put the data into long format
df2 <- gather(df, 'column', 'value', a:d)
# And plot:
ggplot(df2, aes(column, value, alpha = cluster, color = cluster, group = id)) +
geom_line() +
scale_colour_manual(values = c("data" = "grey", "centroids" = "#94003C")) +
scale_alpha_manual(values = c("data" = 0.2, "centroids" = 1)) +
theme_minimal()
I have a huge data frame and I would like to make some plots to get an idea of the associations among different variables. I cannot use
pairs(data)
, because that would give me 400+ plots. However, there's one response variable y I'm particularly interested in. Thus, I'd like to plot y against all variables, which would reduce the number of plots from n^2 to n. How can I do it?
EDIT: I add an example for the sake of clarity. Let's say I have the dataframe
foo=data.frame(x1=1:10,x2=seq(0.1,1,0.1),x3=-7:2,x4=runif(10,0,1))
and my response variable is x3. Then I'd like to generate four plots arranged in a row, respectively x1 vs x3, x2 vs x3, an histogram of x3 and finally x4 vs x3. I know how to make each plot
plot(foo$x1,foo$x3)
plot(foo$x2,foo$x3)
hist(foo$x3)
plot(foo$x4,foo$x3)
However I have no idea how to arrange them in a row. Also, it would be great if there was a way to automatically make all the n plots, without having to call the command plot (or hist) each time. When n=4, it's not that big of an issue, but I usually deal with n=20+ variables, so it can be a drag.
Could do reshape2/ggplot2/gridExtra packages combination. This way you don't need to specify the number of plots. This code will work on any number of explaining variables without any modifications
foo <- data.frame(x1=1:10,x2=seq(0.1,1,0.1),x3=-7:2,x4=runif(10,0,1))
library(reshape2)
foo2 <- melt(foo, "x3")
library(ggplot2)
p1 <- ggplot(foo2, aes(value, x3)) + geom_point() + facet_grid(.~variable)
p2 <- ggplot(foo, aes(x = x3)) + geom_histogram()
library(gridExtra)
grid.arrange(p1, p2, ncol=2)
The package tidyr helps doing this efficiently. please refer here for more options
data %>%
gather(-y_value, key = "some_var_name", value = "some_value_name") %>%
ggplot(aes(x = some_value_name, y = y_value)) +
geom_point() +
facet_wrap(~ some_var_name, scales = "free")
you would get something like this
If your goal is only to get an idea of the associations among different variables, you can also use:
plot(y~., data = foo)
It is not as nice as using ggplot and it doesn't automatically put all the graphs in one window (although you can change that using par(mfrow = c(a, b)), but it is a quick way to get what you want.
I faced the same problem, and I don't have any experience of ggplot2, so I created a function using plot which takes the data frame, and the variables to be plotted as arguments and generate graphs.
dfplot <- function(data.frame, xvar, yvars=NULL)
{
df <- data.frame
if (is.null(yvars)) {
yvars = names(data.frame[which(names(data.frame)!=xvar)])
}
if (length(yvars) > 25) {
print("Warning: number of variables to be plotted exceeds 25, only first 25 will be plotted")
yvars = yvars[1:25]
}
#choose a format to display charts
ncharts <- length(yvars)
nrows = ceiling(sqrt(ncharts))
ncols = ceiling(ncharts/nrows)
par(mfrow = c(nrows,ncols))
for(i in 1:ncharts){
plot(df[,xvar],df[,yvars[i]],main=yvars[i], xlab = xvar, ylab = "")
}
}
Notes:
You can provide the list of variables to be plotted as yvars,
otherwise it will plot all (or first 25, whichever is less) the variables in the data frame against xvar.
Margins were going out of bounds if the number of plots exceeds 25,
so I kept a limit to plot 25 charts only. Any suggestions to nicely
handle this are welcome.
Also the y axis labels are removed as titles of the graphs take care
of it. x axis label is set to xvar.
I am trying to do the comparison of my observed and modeled data sets for two stations. One station is called station "red" and another is called "blue". I was able to create the facets but when I tried to add two series in one facet, only one facet got updated while other didn't.
This means for blue only one series is plotted and for red two series are plotted.
The code I used is as follows:
# install.packages("RCurl", dependencies = TRUE)
require(RCurl)
out <- postForm("https://dl.dropbox.com/s/ainioj2nn47sis4/watersurf1.csv?dl=1", format="csv")
watersurf <- read.csv(textConnection(out))
watersurf[1:100,]
watersurf$coupleid <- factor(rep(unlist(by(watersurf$id,watersurf$group1,
function(x) {ave(as.numeric(unique(x)),FUN=seq_along)}
)),each=6239))
p <- ggplot(data=watersurf,aes(x=time,y=data,group=id))+geom_line(aes(linetype=group1),size=1)+facet_wrap(~coupleid)
p
Is it also possible to add a third series in the graph but of unequal length (i.e not same interval)?
The output is
I followed the example on this page to create the graphs.
http://www.ats.ucla.edu/stat/r/faq/growth.htm
Is this what you are looking for,
ggplot(data = watersurf, aes( x = time, y = data))
+ geom_line(aes(linetype = group1, colour = group1), size = 0.2)
+ facet_wrap(~ id)