Related
I have a dataframe with 4 columns. I am plotting total car crashes vs. total losses colored by the column "distance". I am able to generate a colored plot. However when I inspect the plot I see the plot is not colored the way it is supposed to be. Since there are many duplicate distance values I created a palette by the unique values of the distance column. And then assigned this as color to legend and plot(). However examining plot colors are not right. For example distance 600 is colored with yellow in legend but the corresponding dot is colored red. I think the problem is that I need to create a color as much as the number of instances in the dataframe which is 58. But I have duplicate values and they are not originally sorted.Basically I need colors from lowest distance to highest distance with increasing shade of a color that will match with the crashes and losses data properly.
Below is the minimum reproducible dataframe and my code that works but is flawed.
# creating dataframe
year <- data.frame(year = seq(1946,2003,1))
crashes <- data.frame(crashes = c(386,317,294,287,266,245,268,296,226,265,243,239,183,212,195,224,170,169,140, 147,111,119,100,115,128,111,80,77,68,69,84,72,90,82,59,67,45,59,50,64,55,63,56,56,57,68,34,32,26,21,20,30,35,28 ,22,27,34,NA))
losses <- data.frame(losses = c(432,423,341,291,282,288,387,323,229,305,244,333,200,215,211,245,197,177,153,152, 115,189,124,129,133,120,91,90,69,78,88,77,95,98,62,70,45,62,70,68,65,73,90,65,61,74,39,33,31,22,21,39,35,58,25,36 ,40,NA))
distance <- data.frame(distance = c(600,571,589,613,618,605,605,610,608,584,605,615,605,597,603,600,578,560,541,500,478,459,449,447,452,444,431,433,452,436,426,425,430,426,430,417,372,401,389,418,414,397,443,436,431,439,430,425,415,423,437,463,487,505,503,508,516,529))
df <- cbind(year,crashes,losses,distance)
palette <- heat.colors(length(unique(df[order(df$distance),]$distance)))
plot(df$crashes,df$losses, main = "Crashes,Losses and Distance",xlab = "Crashes", ylab = "Losses", col = palette)
#legend
legend(x = 401,y = 450, legend = unique(df[order(df$distance),]$distance) , cex=.3, fill = palette, xpd=TRUE)
Hi from what i understand you are trying to give each distance a unique colour, this is very easy but i would highly recommend you to first add a grouping character factor for the distances, instead of using 47 unique levels of distance (the numerics > this would result in 47 different colours you can try it if you use: 'distance' instead of: 'distance_new' in the following)
here is my code for this:
# same df as you posted:
df <- cbind(year,crashes,losses,distance)
# this is necessary if you would want to use 'distance' for coloring
df$distance<- as.factor(df$distance)
## imo better add a grouping factor distance_new
## (this is automatically stored as class factor)
df$distance_new<-cut(as.numeric(df$distance),4, labels = c("very low","low","medium","high"))
# Now the plot
plot(df$crashes,df$losses, main = "Crashes,Losses and Distance",
xlab = "Crashes", ylab = "Losses",
col = df$distance_new, pch =19)
# add a legend
legend("bottomright",legend = unique(df$distance_new ),
fill=unique(df$distance_new), cex= 0.5, title= "Distance")
View(df)
The Problem
I have data that I would like to plot in a line-graph with a log-scale on the y-axis using ggplot2. Unfortunately, some of my values go all the way down to zero. The data represents relative occurences of a feature in dependence of some parameters. The value zero occurs when that feature is not observed in a sample, which means that it occurs very seldomly, or indeed never. These zero values cause a problem in the log plot.
The following code illustrates the problem on a simplified data set. In reality the data set consists of more points, so the curve looks smoother, and also more values for the parameter p.
library(ggplot2)
dat <- data.frame(x=rep(c(0, 1, 2, 3), 2),
y=c(1e0, 1e-1, 1e-4, 0,
1e-1, 1e-3, 0, 0),
p=c(rep('a', 4), rep('b', 4)))
qplot(data=dat, x=x, y=y, colour=p, log="y", geom=c("line", "point"))
Given the data above, we would expect two lines, the first one should have three finite points on a log plot, the second one should have only two finite points on a log plot.
However, as you can see this produces a very misleading plot. It looks like the blue and red line are both converging to a value between 1e-4 and 1e-3. The reason is that log(0) gives -Inf, which ggplot just puts on the lower axis.
My Question
What's the best way to deal with this in R with ggplot2? By best I mean in terms of efficiency, and being ideomatic R (I'm fairly new to R).
The plot should indicate that these curves go down to "very small" after x=2 (red), or x=1 (blue), respectively. Ideally, with a vertical line downwards from the last finite point. What I mean by that is demonstrated in the following.
My Attempt
Here I'll describe what I've come up with. However, given that I'm fairly new to R, I suspect that there might a much better way.
library(ggplot2)
library(scales)
dat <- data.frame(x=rep(c(0, 1, 2, 3), 2),
y=c(1e0, 1e-1, 1e-4, 0,
1e-1, 1e-3, 0, 0),
p=c(rep('a', 4), rep('b', 4)))
Same data as above.
Now, I'm going through each unique parameter p, find the x coordinate of the last finite point, and assign it to the x coordinates of all points where y is zero. That is to achieve a vertical line.
for (p in unique(dat$p)) {
dat$x[dat$p == p & dat$y == 0] <- dat$x[head(which(dat$p == p & dat$y == 0), 1) - 1]
}
At this point the plot looks as follows.
The vertical lines are there. However, there are also points. These are misleading as they indicate that there was an actual data point there, which is not true.
To remove the points I duplicate the y data (seems wasteful), let's call it yp, and replace zero by NA. Then I use that new yp as the y aesthetics for geom_point.
dat$yp <- dat$y
dat$yp[dat$y == 0] <- NA
ggplot(dat, aes(x=x, y=y, colour=p)) +
geom_line() +
geom_point(aes(y=dat$yp)) +
scale_y_continuous(trans=log10_trans(),
breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
Where I've used ggplot instead of qplot so that I can give different aesthetics to geom_line and geom_point.
Finally, the plot looks like this.
What is the right way to do this?
For me, I use
+ scale_y_continuous(trans=scales::pseudo_log_trans(base = 10))
If you're using ggplot, you can use scales::pseudo_log_trans() as your transformation object. This will replace your -inf with 0.
From the docs (https://scales.r-lib.org/reference/pseudo_log_trans.html),
A transformation mapping numbers to a signed logarithmic scale with a smooth transition to linear scale around 0.
pseudo_log_trans(sigma = 1, base = exp(1))
For example, my scale expression looks like this:
+ scale_fill_gradient(name = "n occurrences", trans="pseudo_log")
Unconfirmed, but you probably need to include the scales library:
require("scales")
The simplest way would be to add a small value to each of the numbers. Example,
df <- mutate(df, log_var = log(var + 0.01))
ggplot(df, aes(x = log_var)) + geom_histogram()
I have measurements of approximately 1000 variables in 2 groups with 10 replicates in each, in other words I have 2 dataframes with 10 columns and 1000 rows in each.
I would like to show the distribution of my measurements, in two different groups, to pick up variables that differ significantly between the groups. My initial idea was to do a large scatter plot where the x-coordinate would be an iteration of variables, and the y-coordinate would be measurement, and the points could be color coded. It doesn't quite work as expected however, I get a scatter plot matrix instead.
I tried to go with a boxplot,
ratios1 <- as.data.frame(matrix(rnorm(10000) * 100, 1000, 10))
boxplot(t(log2(ratios1)), horizontal = T)
which sort of works but all lines for the boxes makes the plot undecipherable, even for a single group (see figure below). Then I tried to remove the boxes and add the points afterwards as suggested here
boxplot(t(log2(ratios1)), horizontal = T, border = "white")
points(t(log2(ratios1)), pch=1)
But that didn't quite work either, as I only got the first variable drawn on the graph.
How can I display this type of information?
First of all, columns correspond to variables and rows to observations, not the other way around.
set.seed(42)
ratios1 <- as.data.frame(matrix(rnorm(10000) * 100, 10, 1000))
You could plot quantiles like this:
library(reshape2)
ratios2 <- melt(ratios1)
library(ggplot2)
ggplot(ratios2, aes(x = as.numeric(variable), y = value)) +
stat_summary(fun.data = function(y) as.data.frame(setNames(as.list(quantile(y, probs = c(0.025, 0.5, 0.975))), c("ymin", "y", "ymax"))),
color = "blue") +
stat_summary(fun.data = function(y) as.data.frame(setNames(as.list(quantile(y, probs = c(0.25, 0.5, 0.75))), c("ymin", "y", "ymax"))),
color = "red") +
xlab("variable")
There are no groups in your data, so I don't know what to do with that. Maybe you could facet by group. However, I don't think this kind of plot would be very useful for your goal of "pick[ing] up variables that differ significantly between the groups". I would do a hypothesis test with the appropriate correction for alpha error inflation.
I have a couple of box and whisker plots in R. In both, the x-axis corresponds to one categorical variable whilst the grouping colours correspond to the other.
If I draw both plots with an untransformed y-axis, they are both fine. However, if I try to square-root transform the y-axis (using: coord_trans(y = "sqrt")), one of those graph is still fine whilst the other drops the lines corresponding to the median in most boxes (except those for which there are only two groups and where the boxes are therefore slightly wider, see "Numbers" 1 and 2 on the first plot). Further, for the graph that does not draw properly, if I reduce the number of categories on my x-axis (hence getting the boxes wider again), the median lines appear again.
Is this a bug with coord_trans (if so, how can I get around it) or a problem with my code?
Thank you very much for any suggestion.
library(car)
library(gplots)
library(plyr)
library(ggplot2)
library(gridExtra)
library(gdata)
Category=factor(c(rep(1, times =3240), rep(2, times =2160)),
labels=c("A","B"), levels=c(1,2))
ID=factor(rep(seq(from = 1, to = 45),each = 120))
Months=factor(rep(seq(from = 1, to = 3), each = 40, times = 45),
labels=c("Jan","Feb","Mar"),levels=c(1:3))
Obs=rnorm(5400, mean=25, sd=15)
Data=data.frame(Category,ID,Months,Obs)
Data=subset(Data, (Data$Category=="B") | !(Data$ID%in%c(1,2)) |
(Data$Months%in%c("Jan","Feb")))
for (j in 1:2)
{
sel=which(Data$Category==unique(levels(Data$Category))[j])
Observ=Data$Obs[sel]
Month=Data$Months[sel]
Number=droplevels(Data$ID[sel])
Number=droplevels(Number)
Data_used=data.frame(Number,Month,Observ)
plot1 = ggplot(Data_used, aes(Number, Observ)) +
geom_boxplot(aes(fill=Month, drop=FALSE), na.rm=TRUE) +
scale_y_continuous(breaks = c(0,20,40,60,80,100), limits=c(0,115)) +
coord_trans(y = "sqrt")
plot(plot1)
}
#Dennis is correct in his comment that scale_y_sqrt() will correct this. Because median and quartiles are order statistics it doesn't matter whether the data are transformed before or after calculating them.
I'm trying to get a fine-grain visualisation of critical values I got from posthoc Tukey. There are some good guidelines out there for visualizing pairwise comparisons, but I need something more refined. The idea is that I would have a plot where each small square would represent a critical value from the matrix below, coded in such manner that:
if the value is higher or equal to 5.45 - it's a black square;
if the value is lower or equal to -5.45 - it's a gray square;
if the value is between -5.65 and 5.65 - it's a white square.
The data matrix is here.
Or maybe you would have better suggestion how to visualize those critical values?
EDIT: Following comments from #Aaron and #DWin I want to provide a bit more context for the above data and justification for my question. I am looking at the mean ratings of acceptability for seven virtual characters, each of them is animated on 5 different levels. So, I have two factors there - character (7 levels) and motion (5 levels). Because I have found interaction between those two factors, I decided to look at differences between the means for all the characters for all levels of motion , which resulted in this massive matrix, as an output of posthoc Tukey. It's probably too much detail now, but please don't throw me out to Cross Validated, they will eat me alive...
This is fairly straightforward with image:
d <- as.matrix(read.table("http://dl.dropbox.com/u/2505196/postH.dat"))
image(x=1:35, y=1:35, as.matrix(d), breaks=c(min(d), -5.45, 5.45, max(d)),
col=c("grey", "white", "black"))
For just half, set half to missing with d[upper.tri(d)] <- NA and add na.rm=TRUE to the
min and max functions.
Here is a ggplot2 solution. I'm sure there are simpler ways to accomplish this -- I guess I got carried away!
library(ggplot2)
# Load data.
postH = read.table("~/Downloads/postH.dat")
names(postH) = paste("item", 1:35, sep="") # add column names.
postH$item_id_x = paste("item", 1:35, sep="") # add id column.
# Convert data.frame to long form.
data_long = melt(postH, id.var="item_id_x", variable_name="item_id_y")
# Convert to factor, controlling the order of the factor levels.
data_long$item_id_y = factor(as.character(data_long$item_id_y),
levels=paste("item", 1:35, sep=""))
data_long$item_id_x = factor(as.character(data_long$item_id_x),
levels=paste("item", 1:35, sep=""))
# Create critical value labels in a new column.
data_long$critical_level = ifelse(data_long$value >= 5.45, "high",
ifelse(data_long$value <= -5.65, "low", "middle"))
# Convert to labels to factor, controlling the order of the factor levels.
data_long$critical_level = factor(data_long$critical_level,
levels=c("high", "middle", "low"))
# Named vector for ggplot's scale_fill_manual
critical_level_colors = c(high="black", middle="grey80", low="white")
# Calculate grid line positions manually.
x_grid_lines = seq(0.5, length(levels(data_long$item_id_x)), 1)
y_grid_lines = seq(0.5, length(levels(data_long$item_id_y)), 1)
# Create plot.
plot_1 = ggplot(data_long, aes(xmin=as.integer(item_id_x) - 0.5,
xmax=as.integer(item_id_x) + 0.5,
ymin=as.integer(item_id_y) - 0.5,
ymax=as.integer(item_id_y) + 0.5,
fill=critical_level)) +
theme_bw() +
opts(panel.grid.minor=theme_blank(), panel.grid.major=theme_blank()) +
coord_cartesian(xlim=c(min(x_grid_lines), max(x_grid_lines)),
ylim=c(min(y_grid_lines), max(y_grid_lines))) +
scale_x_continuous(breaks=seq(1, length(levels(data_long$item_id_x))),
labels=levels(data_long$item_id_x)) +
scale_y_continuous(breaks=seq(1, length(levels(data_long$item_id_x))),
labels=levels(data_long$item_id_y)) +
scale_fill_manual(name="Critical Values", values=critical_level_colors) +
geom_rect() +
geom_hline(yintercept=y_grid_lines, colour="grey40", size=0.15) +
geom_vline(xintercept=x_grid_lines, colour="grey40", size=0.15) +
opts(axis.text.y=theme_text(size=9)) +
opts(axis.text.x=theme_text(size=9, angle=90)) +
opts(title="Critical Values Matrix")
# Save to pdf file.
pdf("plot_1.pdf", height=8.5, width=8.5)
print(plot_1)
dev.off()
If you set this up with findInterval as an index into the bg, col, and/or pch arguments (although they are all squares at the moment), you should find the code fairly compact and understandable.
You'll need to get the data in long format first; here's one way:
d <- as.matrix(read.table("http://dl.dropbox.com/u/2505196/postH.dat"))
dat <- within(as.data.frame(as.table(d)),
{ Var1 <- as.numeric(Var1)
Var2 <- as.numeric(Var2) })
Then the code is as follows; pch=22 uses filled squares, bg sets the fill color of the square, col sets the border color, and cex=1.5 just makes them a little bigger than the default.
plot(dat$Var1, dat$Var2,
bg = c("grey", "white", "black")[1+findInterval(dat$Freq, c(-5.45,5.45))],
col="white", cex=1.5, pch = 22)
You need the 1+ in there because the values would be 0,1,2 and your indices need to start with 1.
To make a closure here I used majority of suggestions from #DWin and #Aaron to create the plot below. The lightest level of gray stands for non-significant values. I also used rect to create lines above axis names to better differentiate between conditions:
d <- as.matrix(read.table("http://dl.dropbox.com/u/2505196/postH.dat"))
#remove upper half of the values (as they are mirrored values)
d[upper.tri(d)] <- NA
dat <- within(as.data.frame(as.table(d)),{
Var1 <- as.numeric(Var1)
Var2 <- as.numeric(Var2)})
par(mar=c(6,3,3,6))
colPh=c("gray50","gray90","black")
plot(dat$Var1,dat$Var2,bg = colPh[1+findInterval(dat$Freq, c(-5.45,5.45))],
col="white",cex=1.2,pch = 21,axes=F,xlab="",ylab="")
labDis <- rep(c("A","B","C","D","E"),times=7)
labChar <- c(1:7)
axis(1,at=1:35,labels=labDis,cex.axis=0.5,tick=F,line=-1.4)
axis(1,at=seq(3,33,5),labels=labChar, tick=F)
#drawing lines above axis for better identification
rect(1,0,5,0,angle=90);rect(6,0,10,0,angle=90);rect(11,0,15,0,angle=90);
rect(16,0,20,0,angle=90);rect(21,0,25,0,angle=90);rect(26,0,30,0,angle=90);
rect(31,0,35,0,angle=90)
axis(4,at=1:35,labels=labDis,cex.axis=0.5,tick=F,line=-1.4)
axis(4,at=seq(3,33,5),labels=labChar,tick=F)
#drawing lines above axis for better identification
rect(36,1,36,5,angle=90);rect(36,6,36,10,angle=90);rect(36,11,36,15,angle=90);
rect(36,16,36,20,angle=90);rect(36,21,36,25,angle=90);rect(36,26,36,30,angle=90);
rect(36,31,36,35,angle=90)
legend("topleft",legend=c("not significant","p<0.01","p<0.05"),pch=16,
col=c("gray90","gray50","black"),cex=0.7,bty="n")