plot lines using qplot - r

I want to plot multiple lines on the sample plot using qplot in the ggplot2 package.
But I'm having some problem with it.
Using the old plot, and lines function I would do something like
m<-cbind(1:4,5:8,-(5:8))
colnames(m)<-c("time","y1","y2")
m<-as.data.frame(m)
> m
time y1 y2
1 1 5 -5
2 2 6 -6
3 3 7 -7
4 4 8 -8
plot(x=m$time,y=m$y1,type='l',ylim=range(m[,-1]))
lines(x=m$time,y=m$y2)
Thanks

Using the reshape package to melt m:
library(reshape)
library(ggplot2)
m2 <- melt(m, id = "time")
p <- ggplot(m2, aes(x = time, y = value, color = variable))
p + geom_line() + ylab("y")
You could rename the columns in the new data.frame to your liking. The trick here is to have a factor that denotes each of the lines you want to plot.

Related

geom_line with x-axis in order of appearance

this is the first question I post here, so please excuse if I don't provide all information right away.
I'm trying to build a line graph with two lines:
y1 <- c(1000,1500,1000,1500,2000,3000,4000)
y2 <- c(1100,1400,900,1500,2000,2500,3500)
x <- c(49,50,51,1,2,49,50)
df <- data.frame(y1,y2,x)
Imagine x being calendar weeks, I skipped the weeks between 3 and 48 of the second year.
Now I want to build a line graph, which display the x-axis values (time series) in this order.
First I tried a really simple approach:
p <- ggplot()
p <- p + geom_line(data=df,aes(x=x,y=y1))
p <- p + geom_line(data=df,aes(x=x,y=y2), color = "red")
p
Problem: R sorts the x values and also sums up same week numbers.
I then tried to change the x values to make them unique, e.g. 49/19,50/19, but R still changes the order. Same happens if I use geom_path instead of geom_line.
I then tried to change x to a factor and use x_scale_discrete, but I couldn't figure out, how to do it, either the lines or the x labels were always missing.
I hope you can give me some kind of advice.
Many thanks,
Andre
You can add a prefix of the year to your x value, and we pad it using str_pad() from stringr with a zero, so that they will be sorted from 01 all the way to 52:
library(tidyr)
library(stringr)
library(ggplot2)
df$week = paste(rep(c("2019","2020"),c(3,4)),
str_pad(df$x,2,pad="0"),sep="_")
pivot this long, so that we get a legend:
pivot_longer(df[,c("week","y1","y2")],-week)
# A tibble: 14 x 3
week name value
<chr> <chr> <dbl>
1 2019_49 y1 1000
2 2019_49 y2 1100
3 2019_50 y1 1500
4 2019_50 y2 1400
5 2019_51 y1 1000
6 2019_51 y2 900
7 2020_01 y1 1500
8 2020_01 y2 1500
9 2020_02 y1 2000
Then use this directly in ggplot
ggplot(pivot_longer(df[,c("week","y1","y2")],-week),
aes(x=week,y=value,group=name,col=name)) +
geom_line() + scale_color_manual(values=c("black","red"))
One approach is to replace x with a sequence of integers and then apply the x-axis labels afterwards.
library(ggplot2)
ggplot(data = df, aes(x = seq(1,nrow(df)))) +
geom_line(aes(y=y1)) +
geom_line(aes(y=y2), color = "red") +
scale_x_continuous(breaks = seq(1,nrow(df)),
labels = as.character(df$x)) +
labs(x = "Week")

R: a plot to visualise correlation of two groups with one measurements?

I have two groups with one measurement variable.
I would like to plot them on one graph to see if they show a correlation or they overlap.
The measurement for both group is in the same scale.
I thought of doing a scatter plot, but in this case, I thought it would just give me a straight line as I only have one measurement.
Could I get some ideas and suggestions please?
You can unstack the data.
set.seed(1234)
df <- data.frame(var = rnorm(200, 50, 10), gp = gl(2,100))
head(df)
var gp
1 37.92934 1
2 52.77429 1
3 60.84441 1
4 26.54302 1
5 54.29125 1
6 55.06056 1
unstack(df)
X1 X2
1 37.92934 54.14524
2 52.77429 45.25282
3 60.84441 50.65993
4 26.54302 44.97522
5 54.29125 41.74001
6 55.06056 51.66989
And then plot this.
library(ggplot2)
library(dplyr)
unstack(df) %>% ggplot(aes(x=X1, y=X2)) +
geom_point() +
geom_smooth(method="lm")

K-Means Clustering Method

Description of the data
I am trying to produce in R a suitable graphical display of the cluster means.
How can I place the attributes on the x-axis and treat the means for each cluster as trajectories over the items?
All the data is continuous.
What about the following approach: since your variables are on a similar measurement scale (e.g. Likert scale) you could show the distribution of each variable within each cluster (with e.g. boxplots) and visually compare their distribution by using the same axis limits on every cluster.
This can be accomplished by a putting your data in a suitable format and using the ggplot2 package to generate the plot. This is shown below.
Step 1: Generate simulated data to mimic the numeric data you have
The generated data contains four non-negative integer variables and a cluster variable with 3 clusters.
library(ggplot2)
set.seed(1717) # make the simulated data repeatable
N = 100
nclusters = 3
cluster = as.numeric( cut(rnorm(N), breaks=nclusters, label=seq_len(nclusters)) )
df = data.frame(cluster=cluster,
x1=floor(cluster + runif(N)*5),
x2=floor(runif(N)*5),
x3=floor((nclusters-cluster) + runif(N)*5),
x4=floor(cluster + runif(N)*5))
df$cluster = factor(df$cluster) # define cluster as factor to ease plotting code below
tail(df)
table(df$cluster)
whose output is:
cluster x1 x2 x3 x4
95 2 5 2 5 2
96 3 5 4 0 3
97 3 3 3 1 7
98 2 5 4 3 3
99 3 6 1 1 7
100 3 5 1 2 5
1 2 3
15 64 21
i.e., out of the 100 simulated cases, the data contains 15 cases in cluster 1, 64 cases in cluster 2 and 21 cases in cluster 3.
Step 2: Prepare the data for plotting
Here we use reshape() from the stats package to transpose the dataset from wide to long so that the four numeric variables (x1, x2, x3, x4) are placed into one single column, suitable for generating a boxplot for each of the four variables which are then grouped by the cluster variable.
vars2transpose = c("x1", "x2","x3", "x4")
df.long = reshape(df, direction="long", idvar="id",
varying=list(vars2transpose),
timevar="var", times=vars2transpose, v.names="value")
head(df.long)
table(df.long$cluster)
whose output is:
cluster var value id
1.x1 1 x1 5 1
2.x1 1 x1 3 2
3.x1 3 x1 5 3
4.x1 1 x1 1 4
5.x1 2 x1 3 5
6.x1 1 x1 2 6
1 2 3
60 256 84
Note that the number of cases in each cluster has increased 4-fold (i.e. the number of numeric variables) since the data is now in transposed long format.
Step 3: Create the variable's boxplots by cluster with line-connected means
We plot horizontal boxplots for each variable x1, x2, x3, x4 that show their distribution in each cluster, and mark the mean values with connected red crosses (the trajectories you are after).
gg <- ggplot(df.long, aes(x=var, y=value))
gg + facet_grid(cluster ~ ., labeller=label_both) +
geom_boxplot(aes(fill=cluster)) +
stat_summary(fun.y="mean", geom="point", color="red", pch="x", size=3) +
stat_summary(fun.y="mean", geom="line", color="red", aes(group=1)) +
coord_flip() # swap the x and y axis to make boxplots horizontal instead of vertical
which generates the following graph.
The graph might get packed with the many variables you have, so you may want to:
either show vertical boxplots by removing the last coord_flip() line
or remove the boxplots altogether and just show the connected red crosses by eliminating the geom_boxplot() line.
And if you want to compare each variable side by side among the different clusters, you can swap the grouping and x-axis variables as follows:
gg <- ggplot(df.long, aes(x=cluster, y=value))
gg + facet_grid(var ~ ., labeller=label_both) +
geom_boxplot(aes(group=cluster, fill=cluster)) +
stat_summary(fun.y="mean", geom="point", color="red", pch="x", size=3) +
stat_summary(fun.y="mean", geom="line", color="red", aes(group=1)) +
coord_flip() # swap the x and y axis to make boxplots horizontal instead of vertical

Two graphs on R with distinct values

Is there a way to plot 2 data frames on R despite that they have different values?. For example:
data1
[hour] [value]
1 5
2 4
3 3
4 4
data2
[hour] [value]
1 4
2 8
4 9
5 2
I would like to paint in the x axis 1,2,3,4,5 and in the y axis the value it correspond.
Thanks :)
You just need to add three lines of codes as below.
plot(data2$hour,data2$value,xlab='hour',ylab='value')
par(new=TRUE)
points(data1$hour,data1$value)
Hope it helps!
Try:
plot(data1, type="l", xlim = c(0,6), ylim = c(0,10))
lines(data2)
Here is one way.
library(ggplot2)
data1 <- data.frame(hour=c(1,2,3,4),value=c(5,4,3,4))
data2 <- data.frame(hour=c(1,2,4,5),value=c(4,8,9,2))
data3 <- rbind(data1,data2)
data3$data <- c(rep("data1",4),rep("data2",4))
#try this
ggplot(data3,aes(x=hour,y=value))+
geom_point()+
facet_wrap(~data)+
theme_bw()
Here is another way (with colours and lines):
#or this
ggplot(data3,aes(x=hour,y=value,col=data))+
geom_point()+
geom_line()+
theme_bw()
We can use:
par(mfrow=c(1,2))
By using this, we can have two different dataframe individually on same row.

force boxplots from geom_boxplot to constant width

I'm making a boxplot in which x and fill are mapped to different variables, a bit like this:
ggplot(mpg, aes(x=as.factor(cyl), y=cty, fill=as.factor(drv))) +
geom_boxplot()
As in the example above, the widths of my boxes come out differently at different x values, because I do not have all possible combinations of x and fill values, so .
I would like for all the boxes to be the same width. Can this be done (ideally without manipulating the underlying data frame, because I fear that adding fake data will cause me confusion during further analysis)?
My first thought was
+ geom_boxplot(width=0.5)
but this doesn't help; it adjusts the width of the full set of boxplots for a given x factor level.
This post almost seems relevant, but I don't quite see how to apply it to my situation. Using + scale_fill_discrete(drop=FALSE) doesn't seem to change the widths of the bars.
The problem is due to some cells of factor combinations being not present. The number of data points for all combinations of the levels of cyl and drv can be checked via xtabs:
tab <- xtabs( ~ drv + cyl, mpg)
tab
# cyl
# drv 4 5 6 8
# 4 23 0 32 48
# f 58 4 43 1
# r 0 0 4 21
There are three empty cells. I will add fake data to override the visualization problems.
Check the range of the dependent variable (y-axis). The fake data needs to be out of this range.
range(mpg$cty)
# [1] 9 35
Create a subset of mpg with the data needed for the plot:
tmp <- mpg[c("cyl", "drv", "cty")]
Create an index for the empty cells:
idx <- which(tab == 0, arr.ind = TRUE)
idx
# row col
# r 3 1
# 4 1 2
# r 3 2
Create three fake lines (with -1 as value for cty):
fakeLines <- apply(idx, 1,
function(x)
setNames(data.frame(as.integer(dimnames(tab)[[2]][x[2]]),
dimnames(tab)[[1]][x[1]],
-1),
names(tmp)))
fakeLines
# $r
# cyl drv cty
# 1 4 r -1
#
# $`4`
# cyl drv cty
# 1 5 4 -1
#
# $r
# cyl drv cty
# 1 5 r -1
Add the rows to the existing data:
tmp2 <- rbind(tmp, do.call(rbind, fakeLines))
Plot:
library(ggplot2)
ggplot(tmp2, aes(x = as.factor(cyl), y = cty, fill = as.factor(drv))) +
geom_boxplot() +
coord_cartesian(ylim = c(min(tmp$cty - 3), max(tmp$cty) + 3))
# The axis limits have to be changed to suppress displaying the fake data.
You can now use position_dodge() function.
ggplot(mpg, aes(x=as.factor(cyl), y=cty, fill=as.factor(drv))) +
geom_boxplot(position = position_dodge(preserve = "single"))
Just use the facet_grid() function, makes things a lot easier to visualize:
ggplot(mpg, aes(x=as.factor(drv), y=cty, fill=as.factor(drv))) +
geom_boxplot() +
facet_grid(.~cyl)
See how I switch from x=as.factor(cyl) to x=as.factor(drv).
Once you have done this you can always change the way you want the strips to be displayed and remove margins between the panels... it can easily look like your expected display.
By the way, you don't even need to use the as.factor() before specifying the columns to be used by ggplot(). this again improve the readability of your code.

Resources