Calculate the run length of a variable and plot with ggplot - r

I'm using ggplot to plot an ordered sequence of numbers that is colored by a factor. For example, given this fake data:
# Generate fake data
library(dplyr)
set.seed(12345)
plot.data <- data.frame(fitted = rnorm(20),
actual = sample(0:1, 20, replace=TRUE)) %>%
arrange(fitted)
head(plot.data)
fitted actual
1 -1.8179560 0
2 -0.9193220 1
3 -0.8863575 1
4 -0.7505320 1
5 -0.4534972 1
6 -0.3315776 0
I can easily plot the actual column from rows 1–20 as colored lines:
# Plot with lines
ggplot(plot.data, aes(x=seq(length.out = length(actual)), colour=factor(actual))) +
geom_linerange(aes(ymin=0, ymax=1))
The gist of this plot is to show how often the actual numbers appear sequentially across the range of fitted values. As you can see in the image, sequential 0s and 1s are readily seen as sequential blue and red vertical lines.
However, I'd like to move away from the lines and use geom_rect instead to create bands for the sequential number. I can fake this with really thick lineranges:
# Fake rectangular regions with thick lines
ggplot(plot.data, aes(x=seq(length.out = length(actual)), colour=factor(actual))) +
geom_linerange(aes(ymin=0, ymax=1), size=10)
But the size of these lines is dependent on the number of observations—if they're too thick, they'll overlap. Additionally, doing this means that there are a bunch of extraneous graphical elements that are plotted (i.e. sequential rectangular sections are really just a bunch of line segments that bleed into each other). It would be better to use geom_rect instead.
However, geom_rect requires that data include minimum and maximum values for x, meaning that I need to reshape actual to look something like this instead:
xmin xmax colour
0 1 red
1 5 blue
I need to programmatically calculate the run length of each color to mark the beginning and end of that color. I know that R has the rle() function, which is likely the best option for calculating the run length, but I'm unsure about how to split the run length into two columns (xmin and xmax).
What's the best way to calculate the run length of a variable so that geom_rect can plot it correctly?

Thanks to #baptiste, it seems that the best way to go about this is to condense the data into just those rows that see a change in x:
condensed <- plot.data %>%
mutate(x = seq_along(actual), change = c(0, diff(actual))) %>%
subset(change != 0 ) %>% select(-change)
first.row <- plot.data[1,] %>% mutate(x = 0)
condensed.plot.data <- rbind(first.row, condensed) %>%
mutate(xmax = lead(x),
xmax = ifelse(is.na(xmax), max(x) + 1, xmax)) %>%
rename(xmin = x)
condensed.plot.data
# fitted actual xmin xmax
# 1 -1.8179560 0 0 2
# 2 -0.9193220 1 2 6
# 3 -0.3315776 0 6 9
# 4 -0.1162478 1 9 11
# 5 0.2987237 0 11 14
# 6 0.5855288 1 14 15
# 7 0.6058875 0 15 20
# 8 1.8173120 1 20 21
ggplot(condensed.plot.data) +
geom_rect(aes(xmin=xmin, xmax=xmax, ymin=0, ymax=1, fill=factor(actual)))

Related

Can you change the shape of the points in ggeffects or Sjplot?

I can't figure out how to change the shape of the points in sjplot or ggeffects -
I have
in sJplot:
p<-plot_model(my.lm, type = "pred", terms = c("Var1", "Var2", "Var3"))
This makes the points of Var 2 different colors, but I need them to be different shapes.
I can use ggeffects for this, but I still can't figure out how to make an equivalent graph to plot_model in sJplot with different point shapes.
g<-ggpredict(my.lm, terms=c("Var1","Var2", "Var3"))
Thank you
There doesn't seem to be a simple way to do this (like an argument within plot_model()). So, it seems the only option here is the "nuclear option" - which is changing the values in the $data layer of the plot object. It gets a bit "hacky", but it does work.
I'll demonstrate based on an adaptation of an example in this vignette here.
library(ggplot2)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
data(efc)
y <- ifelse(efc$neg_c_7 < median(na.omit(efc$neg_c_7)), 0, 1)
df <- data.frame(
y = to_factor(y),
sex = to_factor(efc$c161sex),
dep = to_factor(efc$e42dep),
barthel = efc$barthtot,
education = to_factor(efc$c172code)
)
set_label(df$y) <- "High Negative Impact"
fit <- glm(y ~., data = df, family = binomial(link = "logit"))
p <- plot_model(fit, colors = "black")
p
Our goal is to change the shape of those points to something else. First step is to use ggplot_build() to pull the information from the object we just created, p.
qq <- ggplot_build(p)
The object, qq, is a list containing 3 layers: data, layout, and plot. Herein, we want to change the information contained in qq$data, which has the aesthetics after they have been mapped. qq$data is a list itself, containing 3 tables. Each of these tables corresponds to a particular geom in the underlying ggplot2 call. The first table qq$data[[1]], is for the vline geom. The third table qq$data[[3]] is for all the line segments behind the points (you'll note they have xmin, xmax, ymin, and ymax columns). The second table is the one we want, which is for the point geoms:
> qq$data[[2]]
colour fill x y PANEL group shape size alpha stroke
1 black black 7 0.28191638 1 7 19 2.5 NA 0.5
2 black black 6 0.21074532 1 6 19 2.5 NA 0.5
3 black black 5 0.48863767 1 5 19 2.5 NA 0.5
4 black black 4 0.39529220 1 4 19 2.5 NA 0.5
5 black black 3 -0.01294270 1 1 19 2.5 NA 0.5
6 black black 2 0.09837842 1 3 19 2.5 NA 0.5
7 black black 1 0.12316583 1 2 19 2.5 NA 0.5
You can see all points are made with shape=19. We can change that, then rebuild the plot from our modified qq object like so:
qq$data[[2]]$shape <- 5
plot(ggplot_gtable(qq))
We can do fun things like change all sorts of stuff manually in the plot if you want:
qq$data[[1]]$colour <- "blue" # change color of vertical line
qq$data[[2]]$shape[3:5] <- 19 # points 3-5 are now circles again
qq$data[[2]]$color[4:7] <- "red" # last 4 point are red
qq$data[[2]]$size[c(2,4,6)] <- 4 # some points are bigger
plot(ggplot_gtable(qq))
A word on saving these files plots to file, if you like to use ggsave(...): You cannot save the output of plot(ggplot_gtable(qq)) directly using ggsave(), since the default value for plot= in ggsave() is last_plot(), which does not work here when using plot(...). Therefore if you want to save with ggsave(), you can do the following:
p <- ggplot_gtable(qq) # save to a large gtable
ggsave("name.png", plot = p # reference the gtable object

Create an image filled chart in R using ggplot [duplicate]

This question already has an answer here:
Use a custom icon in plotly's pie chart in R
(1 answer)
Closed 5 years ago.
I'm trying to create a chart like the below image; essentailly its a filled shape denoting a percentage (most likely an image of a human, but this could be anything theortically).
I've managed to do this is Excel, albeit with a lot of botching of bar charts. But is there a way to do this in R, preferably using ggplot?
I've read the similar question Use an image as area fill in an R plot which doesn't quite do the same thing, but I cannot envisage a solution using this method.
Any help is appreciated!
EDIT:
As pointed out, this has been answered using plotly: Use a custom icon in plotly's pie chart. Is this possible to do with ggplot?
Note I stole this image from a google image search for infographics.
You really only need to modify the plotting command from the answer that #PoGibas linked: here
library(png)
library(ggplot2)
genderselection <- read.table(text="
Gender Freq
F 70
M 30
", header=T)
pcts <- round(prop.table(genderselection$Freq)*100)
# Load png file from imgur as binary
con <- url("https://i.imgur.com/vFDSFYX.png",
open='rb')
rawpng <- readBin(con, what='raw', n=50000)
close(con)
img <- readPNG(rawpng)
h <- dim(img)[1]
w <- dim(img)[2]
# Find the rows where feet starts and head ends
pos1 <- which(apply(img[,,1], 1, function(y) any(y==1)))
mn1 <- min(pos1)
mx1 <- max(pos1)
pospctM <- round((mx1-mn1)*pcts[2]/100+mn1)
pospctF <- round((mx1-mn1)*pcts[1]/100+mn1)
# Fill bodies with a different color according to percentages
# Note that this relies on the fact that the png is a bitmap.
# The png is expressed as a matrix with a cell for each pixel
# and 3 layers for r,g,b.
dim(img)
#> [1] 360 360 3
# Create a 2d matrix by just taking the red values
# Image is black and white so black corresponds to 0
# white corresponds to 1. Then change the values of
# the cells to correspond to one of three categories.
imgmtx <- img[h:1,,1]
whitemtx <- (imgmtx==1)
colmtx <- matrix(rep(FALSE,h*w),nrow=h)
midpt <- round(w/2)-10
colmtx[mx1:pospctM,1:midpt] <- TRUE
colmtx[mx1:pospctF,(midpt+1):w] <- TRUE
imgmtx[whitemtx & colmtx] <- 0.5
# Need to melt the matrix into a data.frame that ggplot can understand
df <- reshape2::melt(imgmtx)
head(df)
#> Var1 Var2 value
#> 1 1 1 0
#> 2 2 1 0
#> 3 3 1 0
#> 4 4 1 0
#> 5 5 1 0
#> 6 6 1 0
cols <- c(rgb(255,255,255,maxColorValue = 255),
rgb(209,230,244,maxColorValue = 255),
rgb(42,128,183,maxColorValue = 255))
# Then use a heatmap with 3 colours for background, and percentage fills
# Converting the fill value to a factor causes a discrete scale.
# geom_tile takes three columns: x, y, fill corresponding to
# x-coord, y-coord, and colour of the cell.
ggplot(df, aes(x = Var2, y = Var1, fill = factor(value)))+
geom_tile() +
scale_fill_manual(values = cols) +
theme(legend.position = "none")

Approach for creating plotting means from data frame

Trying to develop a flexible script to plot mean of continuous variable observations 'score' as a function of discrete time points 'day' from data frame.
I can do this by creating subsets, but I have a big set of data with many factor vectors like 'day,' so would like to get vectors or a data frame for each factor and its corresponding mean.
I have a data frame structured like this:
subject day score
1 0 99.13
2 0 NA
3 0 86.87
1 7 73.71
2 7 82.42
3 7 84.45
1 14 66.88
2 14 83.73
3 14 NA
I tried tapply(), but couldn't get it to output vectors or tables with appropriate headers and could also handle NAs.
Looking for a simple bit of code to get two vectors or a data frame with which to plot mean of 'score' as a function of factor 'day'.
So the plot will have point for average score on each day 0, 7, and 14.
I have seen a lot of posts for doing this directly with ggplot, but it seems useful to know how to do, and I need to see the output to make sure it is handling NAs correctly.
If you are able to help, please include explanatory annotations in your script. Thanks!
I think tapply should be able to handle this, you can amend the function to remove NAs:
df=data.frame("subject"=rep(1:3,3), "day"=as.factor(rep(c(0,7,14),each=3)),
"score"=c(99.13,NA,86.87,73.71,82.42,84.45,66.88,83.73,NA))
res = with(df, tapply(score, day, function(x) mean(x,na.rm=T)))
EDIT to get day and score as vectors
day=as.numeric(names(res))
day
0 7 14
score=as.numeric(res)
score
93.00000 80.19333 75.30500
Plot in base R:
plot(x=as.numeric(as.character(df$day)),y=df$score,type="p")
lines(x=names(res),y=res, col="red")
Not entirely clear what are you trying to achieve. Here I will show how to use the ggplot2 package to create a point plot with the mean for each group. Assuming that dt is your data frame.
library(ggplot2)
ggplot(dt, aes(x = day, y = score, color = factor(subject))) + # Specify x, y and color information
geom_point(size = 3) + # plot the point and specify the size is 3
scale_color_brewer(name = "Subject",
type = "qual",
palette = "Pastel1") + # Format the color of points and the legend using ColorBrewer
scale_x_continuous(breaks = c(0, 7, 14)) + # Set the breaks on x-axis
stat_summary(fun.y = "mean",
color = "red",
geom = "point",
size = 5,
shape = 8) + # Compute mean of each group and plot it
theme_classic() # Specify the theme
Warning messages: 1: Removed 2 rows containing non-finite values
(stat_summary). 2: Removed 2 rows containing missing values
(geom_point).
If you run the above code, you will get the warning message and a plot as follows. The warning message means NA has been removed, so you don't need to further remove NA from the dataset.
DATA
dt <- read.table(text = "subject day score
1 0 99.13
2 0 NA
3 0 86.87
1 7 73.71
2 7 82.42
3 7 84.45
1 14 66.88
2 14 83.73
3 14 NA",
header = TRUE, stringsAsFactors = FALSE)

R ggplot Facet Wrap

I'm tyring to facet wrap this scatter plot by the y axis.For example, if the y-axis goes up to 1000, I would like to separate this graph into 4, the first where the y-axis goes from 0-250, the next 251-500, the next 501-750, and the last 751-1000. Is this type of facet wrapping possible?
library(ggplot2)
A
nrow(A)
# 1000
ncol(A)
# 3
head(A)
# Track Base Location
# 1 1 A 1
# 2 1 C 2
# 3 1 G 3
# 4 1 G 4
# 5 1 A 5
# 6 1 A 6
p <- ggplot(data = A, aes(y=Track, x=Location)) +
geom_point(aes(colour=Base),shape=15,size=2)
print(p)
This is what I have right now, as you can see, it doesn't look aesthetically pleasing.
You can - you just have to make an indicator variable which shows which facet each point should belong in.
(Quick aside - what you have placed in your question is still not a reproducible example - what we are after is something we can copy-paste into our terminals that will still demonstrate your problem. For example, in my answer I've shown you a reproducible example where I only have 100 rows in A rather than you 1000, and my intervals are different to yours, but the key is you can copy-paste it straight to your terminal and you can easily see how it will extend to your problem).
e.g.
# here's a reproducible example, with 100 data points
A <- data.frame(Track=1, Location=1:100, Base=factor(sample(c('A', 'T', 'C', 'G'), 100, replace=T)))
library(ggplot2)
ggplot(transform(A, panel=cut(Location, seq(0, 100, by=25), include.lowest=T)),
aes(x=Track, y=Location)) +
geom_point(aes(colour=Base),shape=15,size=2) +
facet_wrap( ~ panel, scales='free')
Key points:
transform(A, panel=...) adds an extra column "panel" into your dataframe (try running that on its own to see)
cut(Location, seq(0, 100, by=25), include.lowest=T) makes a factor that indicates which interval each Location is in. The intervals here are [0, 25], (25,50] (50,75] (75,100] (the include.lowest makes sure that the 0 is included in the first interval). For the breaks you mentioned in your question you'd do something like seq(0, 1000, by=250)
the facet_wrap(~ panel,...) makes one facet per interval
the scales='free' in the facet_wrap makes it so that all the y scales may be different in each panel (skip it out and they all get plotted on a common scale - just try to skip it out and you will see what I mean).

R how to bin weighted data

Hi I'm trying to draw an histogram in ggplot but my data doesn't have all the values but values and number of occurrences.
value=c(1,2,3,4,5,6,7,8,9,10)
weight<-c(8976,10857,10770,14075,18075,20757,24770,14556,11235,8042)
df <- data.frame(value,weight)
df
value weight
1 1 8976
2 2 10857
3 3 10770
4 4 14075
5 5 18075
6 6 20757
7 7 24770
8 8 14556
9 9 11235
10 10 8042
Anybody would know either how to bin the values or how to plot an histogram of binned values.
I want to get something that would look like
bin weight
1 1-2 19833
2 3-4 24845
...
I would add another variable that designates the binning and then
df$group <- rep(c("1-2", "3-4", "5-6", "7-8", "9-10"), each = 2)
draw it using ggplot.
ggplot(df, aes(y = weight, x = group)) + stat_summary(fun.y="sum", geom="bar")
Here's one method for binning the data up:
df$bin <- findInterval(df$value,seq(1,max(df$value),2))
result <- aggregate(df["weight"],df["bin"],sum)
# get your named bins automatically without specifying them individually
result$bin <- tapply(df$value,df$bin,function(x) paste0(x,collapse="-"))
# result
bin weight
1 1-2 19833
2 3-4 24845
3 5-6 38832
4 7-8 39326
5 9-10 19277
# barplot it (base example since Roman has covered ggplot)
with(result,barplot(weight,names.arg=bin))
Just expand your data:
value=c(1,2,3,4,5,6,7,8,9,10)
weight<-c(8976,10857,10770,14075,18075,20757,24770,14556,11235,8042)
dat = rep(value,weight)
# plot result
histres = hist(dat)
And histres contains some potentially useful information if you want details of the histogram data.

Resources