ggplot dropping zeros from boxplot? - r

Hi from reading and playing around with some data it seems that ggplot might drop zeros when it does plots like boxplots. Apparently it has some problems when handling zeros in log scale. When I do boxplots I constantly get warnings. The second I assume are removal of NAs but the first looks like it might be dropping zeros
Removed x rows containing non-finite values (stat_boxplot)
Removed x rows containing missing values (stat_summary)
for example
library(ggplot2)
df = read.table(text="X1 X1.1 X1.2 X1.3 X2 X2.1 X2.2 X2.3
1 0 3 4 3 2 3 1
2 'NA' 5 5 5 2 1 2
2 'NA' 2 1 2 1 2 5", header=TRUE)
dfmelt<-melt(df)
ggplot(dfmelt, aes(variable, value, fill=variable)) +
geom_boxplot() +
theme(axis.text.x=element_text(angle=90))+
scale_x_discrete(labels=c('C1','C2','C3','C4','C5','C6','C7','C8'))+
scale_fill_manual(values=rep(c("red","green","blue","yellow"),2))+
stat_summary(fun.y = median, geom = "point", position = position_dodge(width = .9))+
scale_y_log10()
I was wondering if this only happens when doing a log scale? If this could possibly affect the boxplot itself in both its positioning and median? Could data with several zeros and nonzero values have all the zeros dropped shifting the box? And if so how to best handle it so ggplot doesn't end up distorting my data?
thanks

0 is undefined for log scale which is most likely ggplot gets rid of them. There is simply no way mathematically to represent 0 in log scale.

Related

arithmatic operations and labelling in ggplot or R

I have a file that looks like this
2 3 LOGIC:A
2 5 LOGIC:A
3 4 LOGIC:Z
I plotted column 1 on x axis vs column 2 on y with column 3 acting as a legend
ggplot(Data, aes(V1, V2, col = V3)) + geom_point()
However is it possible in ggplot itself to subtract column 2 and column 1 and label the top 10 highest absolute difference rows of this subtraction with column 3 values on each scatter point. I dont want to label the entire dataset. Just the top 10 highest deltas
You can try this (if you original dataframe is Data):
library(dplyr)
library(ggplot2)
Data$sub <- abs(Data$V2 - Data$V1)
Data2<- Data %>%
top_n(10,sub)
ggplot()+ geom_text(data=Data2,aes(V1,V2-0.1,label=V3))+
geom_point(data=Data,aes(V1,V2))
With the library dplyr you can filter the top values of a dataframe.
You can change "0.1" for a better value in your plot

Approach for creating plotting means from data frame

Trying to develop a flexible script to plot mean of continuous variable observations 'score' as a function of discrete time points 'day' from data frame.
I can do this by creating subsets, but I have a big set of data with many factor vectors like 'day,' so would like to get vectors or a data frame for each factor and its corresponding mean.
I have a data frame structured like this:
subject day score
1 0 99.13
2 0 NA
3 0 86.87
1 7 73.71
2 7 82.42
3 7 84.45
1 14 66.88
2 14 83.73
3 14 NA
I tried tapply(), but couldn't get it to output vectors or tables with appropriate headers and could also handle NAs.
Looking for a simple bit of code to get two vectors or a data frame with which to plot mean of 'score' as a function of factor 'day'.
So the plot will have point for average score on each day 0, 7, and 14.
I have seen a lot of posts for doing this directly with ggplot, but it seems useful to know how to do, and I need to see the output to make sure it is handling NAs correctly.
If you are able to help, please include explanatory annotations in your script. Thanks!
I think tapply should be able to handle this, you can amend the function to remove NAs:
df=data.frame("subject"=rep(1:3,3), "day"=as.factor(rep(c(0,7,14),each=3)),
"score"=c(99.13,NA,86.87,73.71,82.42,84.45,66.88,83.73,NA))
res = with(df, tapply(score, day, function(x) mean(x,na.rm=T)))
EDIT to get day and score as vectors
day=as.numeric(names(res))
day
0 7 14
score=as.numeric(res)
score
93.00000 80.19333 75.30500
Plot in base R:
plot(x=as.numeric(as.character(df$day)),y=df$score,type="p")
lines(x=names(res),y=res, col="red")
Not entirely clear what are you trying to achieve. Here I will show how to use the ggplot2 package to create a point plot with the mean for each group. Assuming that dt is your data frame.
library(ggplot2)
ggplot(dt, aes(x = day, y = score, color = factor(subject))) + # Specify x, y and color information
geom_point(size = 3) + # plot the point and specify the size is 3
scale_color_brewer(name = "Subject",
type = "qual",
palette = "Pastel1") + # Format the color of points and the legend using ColorBrewer
scale_x_continuous(breaks = c(0, 7, 14)) + # Set the breaks on x-axis
stat_summary(fun.y = "mean",
color = "red",
geom = "point",
size = 5,
shape = 8) + # Compute mean of each group and plot it
theme_classic() # Specify the theme
Warning messages: 1: Removed 2 rows containing non-finite values
(stat_summary). 2: Removed 2 rows containing missing values
(geom_point).
If you run the above code, you will get the warning message and a plot as follows. The warning message means NA has been removed, so you don't need to further remove NA from the dataset.
DATA
dt <- read.table(text = "subject day score
1 0 99.13
2 0 NA
3 0 86.87
1 7 73.71
2 7 82.42
3 7 84.45
1 14 66.88
2 14 83.73
3 14 NA",
header = TRUE, stringsAsFactors = FALSE)

using geom_bar to plot the sum of values by criteria in R

I'm new in R and I am trying to use ggplot to create subsets of bar graph per id all together. Each bar must represent the sum of the values in d column by month-year (which is c column). d has NA values and numeric values as well.
My dataframe, df, is something like this, but it has actually around 10000 rows:
#Example of my data
a=c(1,1,1,1,1,1,1,1,3)
b=c("2007-12-03", "2007-12-10", "2007-12-17", "2007-12-24", "2008-01-07", "2008-01-14", "2008-01-21", "2008-01-28","2008-02-04")
c=c(format(b,"%m-%Y")[1:9])
d=c(NA,NA,NA,NA,NA,4.80, 0.00, 5.04, 3.84)
df=data.frame(a,b,c,d)
df
a b c d
1 1 2007-12-03 12-2007 NA
2 1 2007-12-10 12-2007 NA
3 1 2007-12-17 12-2007 NA
4 1 2007-12-24 12-2007 NA
5 1 2008-01-07 01-2008 NA
6 1 2008-01-14 01-2008 4.80
7 1 2008-01-21 01-2008 0.00
8 1 2008-01-28 01-2008 5.04
9 3 2008-02-04 02-2008 3.84
I tried to do my graph using this:
mplot<-ggplot(df,aes(y=d,x=c))+
geom_bar()+
theme(axis.text.x = element_text(angle=90, vjust=0.5))+
facet_wrap(~ a)
I read from the help of geom_bar():
"geom_bar uses stat_count by default: it counts the number of cases at each x position"
So, I thought it would work like that by I'm having this error:
Error: stat_count() must not be used with a y aesthetic.
For the sample I'm providing, I would like to have the graph for id 1 that shows the months with NA empty and the 01-2008 with 9.84. Then for the second id, I would like to have again the months with NA empty and 02-2008 with 3.84.
I'm also tried to sum the data per month by using aggregate and sum before to plot and then use identity in the stat parameter of geom_bar, but, I'm getting NA in some months and I don't know the reason.
I really aprreciate your help.
You should use geom_col not geom_bar. See the help text:
There are two types of bar charts: geom_bar makes the height of the bar proportional to the number of cases in each group (or if the weight aethetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col instead. geom_bar uses stat_count by default: it counts the number of cases at each x position. geom_col uses stat_identity: it leaves the data as is.
So your final line of code should be:
ggplot(df, aes(y=d, x=c)) + geom_col() + theme(axis.text.x = element_text(angle=90, vjust=0.5))+facet_wrap(~ a)
Do you want something like this:
mplot = ggplot(df, aes(x = b, y = d))+
geom_bar(stat = "identity", position = "dodge")+
facet_wrap(~ a)
mplot
I am using x = b instead of x = c for now.
No need to use geom_col as suggested by #Jan. Simply use the weight aesthetic instead:
ggplot(iris, aes(Species, weight=Sepal.Width)) + geom_bar() + ggtitle("summed sepal width")

Calculate the run length of a variable and plot with ggplot

I'm using ggplot to plot an ordered sequence of numbers that is colored by a factor. For example, given this fake data:
# Generate fake data
library(dplyr)
set.seed(12345)
plot.data <- data.frame(fitted = rnorm(20),
actual = sample(0:1, 20, replace=TRUE)) %>%
arrange(fitted)
head(plot.data)
fitted actual
1 -1.8179560 0
2 -0.9193220 1
3 -0.8863575 1
4 -0.7505320 1
5 -0.4534972 1
6 -0.3315776 0
I can easily plot the actual column from rows 1–20 as colored lines:
# Plot with lines
ggplot(plot.data, aes(x=seq(length.out = length(actual)), colour=factor(actual))) +
geom_linerange(aes(ymin=0, ymax=1))
The gist of this plot is to show how often the actual numbers appear sequentially across the range of fitted values. As you can see in the image, sequential 0s and 1s are readily seen as sequential blue and red vertical lines.
However, I'd like to move away from the lines and use geom_rect instead to create bands for the sequential number. I can fake this with really thick lineranges:
# Fake rectangular regions with thick lines
ggplot(plot.data, aes(x=seq(length.out = length(actual)), colour=factor(actual))) +
geom_linerange(aes(ymin=0, ymax=1), size=10)
But the size of these lines is dependent on the number of observations—if they're too thick, they'll overlap. Additionally, doing this means that there are a bunch of extraneous graphical elements that are plotted (i.e. sequential rectangular sections are really just a bunch of line segments that bleed into each other). It would be better to use geom_rect instead.
However, geom_rect requires that data include minimum and maximum values for x, meaning that I need to reshape actual to look something like this instead:
xmin xmax colour
0 1 red
1 5 blue
I need to programmatically calculate the run length of each color to mark the beginning and end of that color. I know that R has the rle() function, which is likely the best option for calculating the run length, but I'm unsure about how to split the run length into two columns (xmin and xmax).
What's the best way to calculate the run length of a variable so that geom_rect can plot it correctly?
Thanks to #baptiste, it seems that the best way to go about this is to condense the data into just those rows that see a change in x:
condensed <- plot.data %>%
mutate(x = seq_along(actual), change = c(0, diff(actual))) %>%
subset(change != 0 ) %>% select(-change)
first.row <- plot.data[1,] %>% mutate(x = 0)
condensed.plot.data <- rbind(first.row, condensed) %>%
mutate(xmax = lead(x),
xmax = ifelse(is.na(xmax), max(x) + 1, xmax)) %>%
rename(xmin = x)
condensed.plot.data
# fitted actual xmin xmax
# 1 -1.8179560 0 0 2
# 2 -0.9193220 1 2 6
# 3 -0.3315776 0 6 9
# 4 -0.1162478 1 9 11
# 5 0.2987237 0 11 14
# 6 0.5855288 1 14 15
# 7 0.6058875 0 15 20
# 8 1.8173120 1 20 21
ggplot(condensed.plot.data) +
geom_rect(aes(xmin=xmin, xmax=xmax, ymin=0, ymax=1, fill=factor(actual)))

R plot function - axes for a line chart

assume the following frequency table in R, which comes out of a survey:
1 2 3 4 5 8
m 5 16 3 16 5 0
f 12 25 3 10 3 1
NA 1 0 0 0 0 0
The rows stand for the gender of the survey respondent (male/female/no answer). The colums represent the answers to a question on a 5 point scale (let's say: 1= agree fully, 2 = agree somewhat, 3 = neither agree nor disagree, 4= disagree somewhat, 5 = disagree fully, 8 = no answer).
The data is stored in a dataframe called "slm", the gender variable is called "sex", the other variable is called "tv_serien".
My problem is, that I don't find a (in my opinion) proper way to create a line chart, where the x-axis represents the 5-point scale (plus the don't know answers) and the y-axis represents the frequencies for every point on the scale. Furthemore I want to create two lines (one for males, one for females).
My solution so far is the following:
I create a plot without plotting the "content" and the x-axis:
plot(slm$tv_serien, xlim = c(1,6), ylim = c(0,100), type = "n", xaxt = "n")
The problem here is that it feels like cheating to specify the xlim=c(1,6), because the raw scores of slm$tv_serienare 100 values. I tried also to to plot the variable via plot(factor(slm$tv_serien)...), but then it would still create a metric scale from 1 to 8 (because the dont know answer is 8).
So my first question is how to tell R that it should take the six distinct values (1 to 5 and 8) and take that as the x-axis?
I create the new x axis with proper labels:
axis(1, 1:6, labels = c("1", "2", "3", "4", "5", "DK"))
At least that works pretty well. ;-)
Next I create the line for the males:
lines(1:5, table(slm$tv_serien[slm$sex == 1]), col = "blue")
The problem here is that there is no DK (=8) answer, so I manually have to specify x = 1:5 instead of 1:6 in the "normal" case. My question here is, how to tell R to also draw the line for nonexisting values? For example, what would have happened, if no male had answered with 3, but I want a continuous line?
At last I create the line for females, which works well:
lines(1:6, table(slm$tv_serien[slm$sex == 2], col = "red")
To summarize:
How can I tell R to take the 6 distinct values of slm$tv_serien as the x axis?
How can i draw continuous lines even if the line contains "0"?
Thanks for your help!
PS: Attached you find the current plot for the abovementiond functions.
PPS: I tried to make a list from "1." to "4." but it seems that every new list element started again with "1.". Sorry.
Edit: Response to OP's comment.
This directly creates a line chart of OP's data. Below this is the original answer using ggplot, which produces a far superior output.
Given the frequency table you provided,
df <- data.frame(t(freqTable)) # transpose (more suitable for plotting)
df <- cbind(Response=rownames(df),df) # add row names as first column
plot(as.numeric(df$Response),df$f,type="b",col="red",
xaxt="n", ylab="Count",xlab="Response")
lines(as.numeric(df$Response),df$m,type="b",col="blue")
axis(1,at=c(1,2,3,4,5,6),labels=c("Str.Agr.","Sl.Agr","Neither","Sl.Disagr","Str.Disagr","NA"))
Produces this, which seems like what you were looking for.
Original Answer:
Not quite what you asked for, but converting your frequency table to a data frame, df
df <- data.frame(freqTable)
df <- cbind(Gender=rownames(df),df) # append rownames (Gender)
df <- df[-3,] # drop unknown gender
df
# Gender X1 X2 X3 X4 X5 X8
# m m 5 16 3 16 5 0
# f f 12 25 3 10 3 1
df <- df[-3,] # remove unknown gender column
library(ggplot2)
library(reshape2)
gg=melt(df)
labels <- c("Agree\nFully","Somewhat\nAgree","Neither Agree\nnor Disagree","Somewhat\nDisagree","Disagree\nFully", "No Answer")
ggp <- ggplot(gg,aes(x=variable,y=value))
ggp <- ggp + geom_bar(aes(fill=Gender), position="dodge", stat="identity")
ggp <- ggp + scale_x_discrete(labels=labels)
ggp <- ggp + theme(axis.text.x = element_text(angle=90, vjust=0.5))
ggp <- ggp + labs(x="", y="Frequency")
ggp
Produces this:
Or, this, which is much better:
ggp + facet_grid(Gender~.)

Resources