So I have the following dataset -
dat <- structure(list(cases = c(2L, 6L, 10L, 8L, 12L, 9L, 28L, 28L,
36L, 32L, 46L, 47L), qrt = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L), date = c(83, 83.25, 83.5, 83.75, 84, 84.25,
84.5, 84.75, 85, 85.25, 85.5, 85.75)), .Names = c("cases", "qrt",
"date"), class = "data.frame", row.names = c(NA, -12L))
cases qrt date
2 1 83.00
6 2 83.25
10 3 83.50
8 4 83.75
12 1 84.00
9 2 84.25
28 3 84.50
28 4 84.75
36 1 85.00
32 2 85.25
46 3 85.50
47 4 85.75
There are more data points, but to make things look a bit simpler I omitted them.
And to this dataset I have fit a GLM:
fit <- glm(cases~date+qrt, family = poisson, data = dat)
Basically, I would like to create a plot for this fitted values that this GLM produces that looks like this (this is actually the plot for the full data set,the black circles are the original data and the empty circles are the fitted data)
with the repeating x-values qrt on the x-axis.I'm assuming I'd have to use the predict() function and then plot the resulting values, but I've tried this and I get x-values on the x-axis going from 1 to 12 instead of repeating 1,2,3,4,1,2,3,4 etc. Also, how would you plot the original data over the fitted values, as in the plot above?
It is not difficult. Just use axis to control axis display:
## disable "x-axis" when `plot` fitted values
## remember to set decent `ylim` for your plot
plot(fit$fitted, xaxt = "n", xlab = "qtr", ylab = "cases", main = "GLM",
ylim = range(c(fit$fitted, dat$cases)) )
## manually add "x-axis", with "labels" and "las"
axis(1, at = 1:12, labels = rep.int(1:4, 3), las = 2)
## add original observed cases
points(dat$cases, pch = 19)
You don't need to use predict here. You have no gap / missing values in your quarterly time series, so the fitted values inside fitted model fit is all you need.
with ggplot:
df <- rbind(data.frame(index=as.factor(1:nrow(dat)), value=dat$cases, cases='actual'),
data.frame(index=as.factor(1:nrow(dat)), value=predict(fit, type='response'), cases='predicted'))
library(ggplot2)
ggplot(df, aes(index, value, color=cases)) + geom_point(cex=3) +
scale_color_manual(values=c('black', 'gray')) +
scale_y_continuous(breaks=seq(0, max(df$value)+5, 5)) + theme_bw()
Related
For a sample dataframe df, pred_value and real_value respectively represent the monthly predicted values and actual values for a variable, and acc_level represents the accuracy level of the predicted values comparing with the actual values for the correspondent month, the smaller the values are, more accurate the predictions result:
df <- structure(list(date = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L), .Label = c("2022/3/31", "2022/4/30",
"2022/5/31"), class = "factor"), pred_value = c(2721.8, 2721.8,
2705.5, 2500, 2900.05, 2795.66, 2694.45, 2855.36, 2300, 2799.82,
2307.36, 2810.71, 3032.91), real_value = c(2736.2, 2736.2, 2736.2,
2736.2, 2736.2, 2759.98, 2759.98, 2759.98, 2759.98, 3000, 3000,
3000, 3000), acc_level = c(1L, 1L, 2L, 3L, 3L, 1L, 2L, 2L, 3L,
2L, 3L, 2L, 1L)), class = "data.frame", row.names = c(NA, -13L
))
Out:
date pred_value real_value acc_level
1 2022/3/31 2721.80 2736.20 1
2 2022/3/31 2721.80 2736.20 1
3 2022/3/31 2705.50 2736.20 2
4 2022/3/31 2500.00 2736.20 3
5 2022/3/31 2900.05 2736.20 3
6 2022/4/30 2795.66 2759.98 1
7 2022/4/30 2694.45 2759.98 2
8 2022/4/30 2855.36 2759.98 2
9 2022/4/30 2300.00 2759.98 3
10 2022/5/31 2799.82 3000.00 2
11 2022/5/31 2307.36 3000.00 3
12 2022/5/31 2810.71 3000.00 2
13 2022/5/31 3032.91 3000.00 1
I've plotted the predicted values with code below:
library(ggplot2)
ggplot(x, aes(x=date, y=pred_value, color=acc_level)) +
geom_point(size=2, alpha=0.7, position=position_jitter(w=0.1, h=0)) +
theme_bw()
Out:
Beyond what I've done above, if I hope to plot the actual values for each month with red line and red points, how could I do that? Thanks.
Reference:
How to add 4 groups to make Categorical scatter plot with mean segments?
We can add the actuals using additional layers. To make the line show up, we need to specify that the points should be part of the same series.
ggplot assumes by default that since the x axis is discrete that the data points are not part of the same group. We could alternatively deal with this by making the date variable into a date data type, like with aes(x=as.Date(date)...
library(ggplot2)
ggplot(df, aes(x=date, y=pred_value, color=as.factor(acc_level))) +
geom_point(size=2, alpha=0.7, position=position_jitter(w=0.1, h=0)) +
geom_point(aes(y = real_value), size=2, color = "red") +
geom_line(aes(y = real_value, group = 1), color = "red") +
scale_color_manual(values = c("yellow", "magenta", "cyan"),
name = "Acc Level") +
theme_bw()
I want to plot census data to compare data for each race over multiple years.
My data frame has years 1950-2010 (every 10 years) as the rows and race as the columns. The data at the cross section is the percentage of that race in a given year.
I want my line graph to plot the years on the x axis and race on the y axis. So with my 5 "race" variables, there would be 5 lines of different colors all plotted on the same graph.
I have tried to watch videos and scoured all over here but nothing I find seems to work the way I want it to.
Edit:
I refactored to the code and built my own dataframe instead of having it return a matrix.
However, I want the right side to say "Race" and then have my 5 lines. I am working on getting one line to show up at all before doing the other 4.
new dataframe
returned plot
Edit:
I have figured out thus far in my code - Allston <- ggplot(data = dataAllston, aes(Year, White.pct, group = 1)) + geom_point(aes(color = "orange")) + geom_line(aes(color = "orange"))
I want to scale the Y axis and from 0-1 in 0.2 increments and have the Y be "Race" instead of the individual labels. And more than just relabeling -- I want the graph to be representative of the actual increases/decreases as opposed to a straight line diagonally down as it is now.
I think it will take me longer to learn how to make the reproducible code than it will to make tweaks.
new returned plot
Edit:
dput(dataAllston)
returns
structure(list(Year = c(1950, 1960, 1970, 1980, 1990, 2000, 2010
), White.pct = structure(7:1, .Label = c("57.0", "59.0", "63.0",
"78.0", "90.8", "98.0", "98.3"), class = "factor"), BlackOrAA.pct =
structure(c(2L,
1L, 3L, 4L, 5L, 4L, 4L), .Label = c("1.20", "1.30", "2.60", "5.00",
"9.00"), class = "factor"), Hispanic.pct = structure(c(1L, 1L,
3L, 4L, 2L, 2L, 2L), .Label = c("0.00", "13.0", "3.10", "6.00"
), class = "factor"), AsianOrPI.pct = structure(c(1L, 1L, 5L,
6L, 2L, 3L, 4L), .Label = c("0.00", "14.0", "18.0", "20.0", "3.20",
"9.00"), class = "factor"), Other.pct = structure(c(2L, 1L, 3L,
4L, 5L, 4L, 4L), .Label = c("1.20", "1.30", "2.60", "5.00", "9.00"
), class = "factor")), class = "data.frame", row.names = c(NA,
-7L))
result from dput(data)
You need first to reshape your dataset into a longer format by using for example pivot_longer function from tidyr. At the end, your data should look like this.
As your data are in factor format (except Year column), the first line will convert all of them into a numerical format much appropriate for plotting.
library(dplyr)
library(tidyr)
Reshaped_DF <- df %>% mutate_at(vars(ends_with(".pct")), ~as.numeric(as.character(.))) %>%
pivot_longer(-Year, names_to = "Races", values_to = "values")
# A tibble: 35 x 3
Year Races values
<dbl> <chr> <dbl>
1 1950 White.pct 98.3
2 1950 BlackOrAA.pct 1.3
3 1950 Hispanic.pct 0
4 1950 AsianOrPI.pct 0
5 1950 Other.pct 1.3
6 1960 White.pct 98
7 1960 BlackOrAA.pct 1.2
8 1960 Hispanic.pct 0
9 1960 AsianOrPI.pct 0
10 1960 Other.pct 1.2
# … with 25 more rows
Then, you can plot it in ggplot2 by doing:
library(ggplot2)
ggplot(Reshaped_DF,aes(x = Year, y = values, color = Races, group = Races))+
geom_line()+
geom_point()+
ylab("Percentage")
Does it answer your question ?
If not, please consider providing a reproducible example of your dataset that people can easily copy/paste. See this guide: How to make a great R reproducible example
I'm new to R and I would appreciate your help.
I have a 3 columns df that looks like this:
> head(data)
V.hit J.hit frequency
1 IGHV1-62-3*00 IGHJ2*00 0.51937442
2 IGHV5-17*00 IGHJ3*00 0.18853542
3 IGHV3-5*00 IGHJ1*00 0.09777304
4 IGHV2-9*00 IGHJ3*00 0.03040866
5 IGHV5-12*00 IGHJ4*00 0.02900040
6 IGHV5-12*00 IGHJ2*00 0.00910554
This is just part of the data for example. I want to create a Heat map so that the X-axis will be "V.hit" and the Y-axis will be "J.hit", and the values of the heatmap will be the frequency (im interested of the freq for each combination of V+j). I tried to use this code for the interpolation:
library(akima)
newData <- with(data, interp(x = `V hit`, y = `J hit`, z = frequency))
but I'm getting this error:
Error in interp.old(x, y, z, xo, yo, ncp = 0, extrap = FALSE, duplicate = duplicate, :
missing values and Infs not allowed
so I don't know how to deal with it. I want to achieve this final output:
> head(fld)
# A tibble: 6 x 5
...1 `IGHJ1*00` `IGHJ2*00` `IGHJ3*00` `IGHJ4*00`
<chr> <dbl> <dbl> <dbl> <dbl>
1 IGHV10-1*00 0.00233 0.00192 NA 0.000512
2 IGHV1-14*00 NA NA 0.00104 NA
3 IGHV1-18*00 NA 0.000914 NA NA
4 IGHV1-18*00 NA NA 0.000131 NA
5 IGHV1-19*00 0.0000131 NA NA NA
6 IGHV1-26*00 NA 0.000214 NA NA
while cells that are "NA" will be assigned as "0".
And then I'm assuming I will be able to use the heatmap function to create my heat map graph. any help would be really appreciated!
Here is an idea using geom_tile(). Your data is called foo. I created all possible combination of V.hit and J.hit using complete(). For missing values, I asked complete() to use 0 to fill. Then, I used geom_tile() to produce the following graphic. You may want to consider the order of levels, if neccessary.
library(tidyverse)
complete(foo, V.hit, nesting(J.hit), fill = list(frequency = 0)) %>%
ggplot(aes(x = J.hit, y = V.hit, fill = frequency)) +
geom_tile()
In base R we could adapt #GregSnow's solution for a correlation matrix to a frequency heatmap.
First, we cut the vector, say into quartiles (the default in quantile) and get factor values.
dat$freq.fac <- cut(dat$frequency, quantile(dat$frequency, na.rm=TRUE), include.lowest=T)
Second to prepare the colors, we just copy the factor column and relevel them with builtin heat.colors and a white color for the zero values.
dat <- within(dat, {
freq.col <- freq.fac
levels(freq.col) <- c(heat.colors(length(levels(dat$freq.fac)), rev=T), "#FFFFFF")
})
Third, apply white color to NAs or zero value respectively.
dat$freq.col[is.na(dat$freq.col)] <- "#FFFFFF"
dat$frequency[is.na(dat$frequency)] <- 0
Fourth, apply xtabs and create a color matrix and match colors and levels after.
dat.x <- xtabs(frequency ~ v.hit + j.hit, dat)
col.m <- matrix(dat$freq.col[match(dat$frequency, as.vector(dat.x))], nrow=nrow(dat.x))
Finally plot using rasterImage function.
op <- par(mar=c(.5, 4, 4, 3)+.1) ## adapt outer margins
plot.new()
plot.window(xlim=c(0, 5), ylim=c(0, 5))
rasterImage(col.m, 0, 1, 5, 5, interpolate=FALSE)
rect(0, 1, 5, 5) ## frame it with a box
## numbers in the cells
text(col(round(dat.x, 3)) - .5, 5.45 - row(round(dat.x, 3))*.8, round(dat.x, 3))
mtext("Frequency heatmap", 3, 2, font=2, cex=1.2) ## title
mtext(rownames(dat.x), 2, at=5.45 -(1:5)*.8, las=2) ## y-axis
mtext(colnames(dat.x), 3, at=(1:5)-.5) ## y-axis (upper)
## a legend
legend(-.15, .75, legend=c("Frequency:\t", 0, paste("<", seq(.25, 1, .25))), horiz=TRUE,
pch=c(NA, rep(22, 5)), col=1, pt.bg=c(NA, levels(dat$freq.col)[c(5, 1:4)]),
bty="n", xpd=TRUE, cex=.75, text.font=2)
par(op) ## reset margins
Yields
Toy data:
dat <- structure(list(v.hit = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L,
3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L,
4L, 5L), .Label = c("A", "B", "C", "D", "E"), class = "factor"),
j.hit = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L
), .Label = c("F", "G", "H", "I", "J"), class = "factor"),
frequency = c(NA, NA, 0.717618508264422, NA, NA, 0.777445221319795,
NA, 0.212142521282658, 0.651673766085878, 0.125555095961317,
NA, 0.386114092543721, 0.0133903331588954, NA, 0.86969084572047,
0.34034899668768, 0.482080115471035, NA, 0.493541307048872,
0.186217601411045, 0.827373318606988, NA, 0.79423986072652,
0.107943625887856, NA)), row.names = c(NA, -25L), class = "data.frame")
You can interpolate with a linear model if the variables correlate.
mdl <- lm(z ~ ., df)
out <- NULL
for(x in seq(min(df$x), max(df$x), (max(df$x) - min(df$x)/100) )){
tmp <- c()
for(y in seq(min(df$y), max(df$y), (max(df$y) - min(df$y)/100) )){
h <- predict(
mdl,
data.frame(x = x, y = y)
)
tmp = c(tmp, h)
}
if(is.null(out)){
out = as.matrix(tmp)
}else{
out = cbind(out, tmp)
}
}
fig <- plot_ly(z = out, colorscale = "Hot", type = "heatmap")
fig <- fig %>% layout(
title = "Interpolated Heatmap of Z Given x, y",
xaxis = list(
title = "x"
),
yaxis = list(
title = "y"
)
)
fig
I'd like to plot standard deviations of the mean(z)/mean(b) which are grouped by two factors $angle and $treatment:
z= Tracer angle treatment
60 0 S
51 0 S
56.415 15 X
56.410 15 X
b=Tracer angle treatment
21 0 S
15 0 S
16.415 15 X
26.410 15 X
So far I've calculated the mean for each variable based on angle and treatment:
aggmeanz <-aggregate(z$Tracer, list(angle=z$angle,treatment=z$treatment), FUN=mean)
aggmeanb <-aggregate(b$Tracer, list(angle=b$angle,treatment=b$treatment), FUN=mean)
It now looks like this:
aggmeanz
angle treatment x
1 0 S 0.09088021
2 30 S 0.18463353
3 60 S 0.08784315
4 80 S 0.09127198
5 90 S 0.12679296
6 0 X 2.68670392
7 15 X 0.50440692
8 30 X 0.83564470
9 60 X 0.52856956
10 80 X 0.63220093
11 90 X 1.70123025
But when I come to plot it, I can't quite get what I'm after
ggplot(aggmeanz, aes(x=aggmeanz$angle,y=aggmeanz$x/aggmeanb$x, colour=treatment)) +
geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=0.1, ymax=1.15),
width=.2,
position=position_dodge(.9)) +
theme(panel.grid.minor = element_blank()) +
theme_bw()
EDIT:
dput(aggmeanz)
structure(list(time = structure(c(1L, 3L, 4L, 5L, 6L, 1L, 2L,
3L, 4L, 5L, 6L), .Label = c("0", "15", "30", "60", "80", "90"
), class = "factor"), treatment = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("S", "X"), class = "factor"),
x = c(56.0841582902523, 61.2014237854156, 42.9900742785269,
42.4688447229277, 41.3354173870287, 45.7164231791512, 55.3943182966382,
55.0574951462903, 48.1575625699563, 60.5527200655174, 45.8412287451211
)), .Names = c("time", "treatment", "x"), row.names = c(NA,
-11L), class = "data.frame")
> dput(aggmeanb)
structure(list(time = structure(c(1L, 3L, 4L, 5L, 6L, 1L, 2L,
3L, 4L, 5L, 6L), .Label = c("0", "15", "30", "60", "80", "90"
), class = "factor"), treatment = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("S", "X"), class = "factor"),
x = c(56.26325504249, 61.751655279608, 43.1687113436753,
43.4147408285209, 41.9113698082799, 46.2800894420131, 55.1550995335947,
54.7531592595068, 47.3280215294235, 62.4629068516043, 44.2590192583692
)), .Names = c("time", "treatment", "x"), row.names = c(NA,
-11L), class = "data.frame")
EDIT 2: I calculated the standard dev as follows:
aggstdevz <-aggregate(z$Tracer, list(angle=z$angle,treatment=z$treatment), FUN=std)
aggstdevb <-aggregate(b$Tracer, list(angle=b$angle,treatment=b$treatment), FUN=std)
Any thoughts would be much appreciated,
Cheers
As others have noted, you'll need to join the two dataframes together. There are also some little quirks in the dput data you showed, so I've renamed some columns to make sure that they join appropriately and match what you've attempted. NOTE: You'll need name the two means differently so that they don't get merged together or cause conflicts.
names(aggmeanb)[names(aggmeanb) == "x"] = "mean_b"
names(aggmeanb)[names(aggmeanb) == "time"] = "angle"
names(aggmeanz)[names(aggmeanz) == "x"] = "mean_z"
names(aggmeanz)[names(aggmeanz) == "time"] = "angle"
joined_data = join(aggmeanb, aggmeanz)
joined_data$divmean = joined_data$mean_b/joined_data$mean_z
> head(joined_data)
angle treatment mean_b mean_z divmean
1 0 S 56.26326 56.08416 1.003193
2 30 S 61.75166 61.20142 1.008991
3 60 S 43.16871 42.99007 1.004155
4 80 S 43.41474 42.46884 1.022273
5 90 S 41.91137 41.33542 1.013934
6 0 X 46.28009 45.71642 1.012330
ggplot(joined_data, aes(factor(angle), divmean)) +
geom_boxplot() +
theme(panel.grid.minor = element_blank()) +
theme_bw()
It might be that the data you've included is just a bit of your real data set, but as is there's only one data point per angle-treatment group. However, when you are using a fuller dataset, you can try something like:
ggplot(joined_data, aes(factor(angle), diffmean, group = treatment)) +
geom_boxplot() +
facet_grid(.~angle, scales = "free_x")
That will group the boxes by angle and then allow you to fill them by treatment.
Think about the problem in two steps:
create a data frame (say data) which contains all the information
you would like to visualize. In this case, this seems to be the two
factors (angle, treatment), the mean group differences (say dif)
and standard errors (say ste).
visualize this information.
Step 2) will be easy. This should probably produce something very similar to your sketch.
ggplot(data, aes(x=angle, y=dif, colour=treatment)) +
geom_point(position=position_dodge(0.1)) +
geom_errorbar(aes(ymin=dif-ste, ymax=dif+ste), width=.1, position=position_dodge(0.1)) +
theme_bw()
However, at this point, you do not provide enough information to get help with Step 1. Try to include code which produces your original data (or the type of data you have) instead of copy-pasting chunks of your data output or pasting the aggregated data which lacks standard errors.
Combining your two aggregated data frames and generating random numbers for standard error produces the graph below:
#I imported your two aggregated data frames from your dput output.
data <- cbind(aggmeanb, aggmeanz$x, rnorm(11))
names(data) <- c("angle", "treatment", "meanz", "meanb", "ste")
data$dif <- data$meanz - data$meanb
This is what my dataframe looks like:
Persnr Date AmountHolidays
1 55312 X201101 2
2 55312 X201102 4.5
3 55312 X201103 5
etc.
What I want to have is a graph that shows the amount of holidays (on the y-axis) of each period (Date on the x-axis) of a specific person (persnr). Basically, it's a pivot graph in R. So far I know, it is not possible to create such a graph.
Something like this is my desired result:
http://imgur.com/62VsYdJ
Is it possible in the first place to create such a model in R? If not, what is the best way for me to visualise such graph in R?
Thanks in advance.
Something like this could do the trick?
dat <- read.table(text="Persnr Date AmountHolidays
55312 2011-01-01 2
55312 2011-02-01 4.5
55312 2011-03-01 5
55313 2011-01-01 4
55313 2011-02-01 2.5
55313 2011-03-01 6", header=TRUE)
dat$Date <- as.POSIXct(dat$Date)
dat$Persnr <- as.factor(dat$Persnr)
# Build a primary graph
plot(AmountHolidays ~ Date, data = dat[dat$Persnr==55312,], type="l", col="red",
xlim = c(1293858000, 1299301200), ylim=c(0,8))
# Add additional lines to it
lines(AmountHolidays ~ Date, data = dat[dat$Persnr==55313,], type="l", col="blue")
# Build and place a legend
legend(x=as.POSIXct("2011-02-19"), y=2.2, legend = levels(dat$Persnr),fill = c("red", "blue"))
To set X coordinates, you can either use as.POSIXct(YYYY-MM-DD) or as.numeric(as.POSIXct(YYYY-MM-DD) as I did for the xlim's.
You can try with package ggplot2:
First option
ggplot(dat, aes(x=Date, y=AmountHolidays, group=Persnr)) +
geom_line(aes(colour=Persnr)) + scale_colour_discrete()
or
Second option
ggplot(dat, aes(x=Date, y=AmountHolidays, group=Persnr)) +
geom_line() + facet_grid(~Persnr)
One of the advantages is that you don't need to have a line per Persnr or even to specify (to know) the name or number of Persnr.
example:
first option
second option
Data:
dat <- structure(list(Persnr = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("54000",
"55312"), class = "factor"), Date = structure(c(1L, 2L, 3L, 1L,
2L, 3L), .Label = c("2011-01-01", "2011-02-01", "2011-03-01"), class = "factor"),
AmountHolidays = c(5, 4.5, 2, 3, 6, 7)), .Names = c("Persnr",
"Date", "AmountHolidays"), row.names = c(3L, 5L, 6L, 1L, 2L,
4L), class = "data.frame")