Reminding R that integer is a factor when producing Boxplot - r

Good afternoon,
This is my 1st question here and every attempt is being made to be thorough.
I am working with a large data set (casualtiesdf) in R and I am trying to produce a Boxplot, using ggplot2, with the variable Age_of Casualty by the Casualty_Severity variable. The problem is that R thinks that Casualty_Severity variable is integer. Casualty_Severity in the data is listed by numbers 1, 2,3.
Below you can see that I've tried to rename the integer into the named factor to which is corresponds and then converted the integer into a factor.
casualtiesdf$Casualty_Severity[casualtiesdf$Casualty_Severity == 1] "Fatal"
casualtiesdf$Casualty_Severity[casualtiesdf$Casualty_Severity == 2]"Serious"
casualtiesdf$Casualty_Severity[casualtiesdf$Casualty_Severity == 3] "Slight"
casualtiesdf$Casualty_Severity <- as.factor(casualtiesdf$Casualty_Severity)
When I try doing the Boxplot, however...
> ggplot(data = casualtiesdf, aes(x = Age_of_Casualty,
+ y = casualtiesdf$Casualty_Severity)) +
+ geom_boxplot()
I get: "Warning message:position_dodge requires non-overlapping x intervals"
I typed this message into Google and stackflow seems to advise putting the categorical variable in the x axes (yes I'm still very confused with my x's and y's...) so I tried:
ggplot(data = casualtiesdf, aes(x = Casualtiesdf$Casualty_Severity,
y = Age_of_Casualty +
geom_boxplot()
and get error message "Error: object 'Age_of_Casualty' not found"
I then went for thinking that maybe I have to put the as.factor in the plot code:
ggplot(data = casualtiesdf, aes(x = casualtiesdf$Casualty_Severity
as.factor(casualtiesdf$Casualty_Severity))) y = Age_of_Casualty) +
geom_boxplot()
and get error message "unexpected symbol in: geom_boxplot() ggplot"
Any help with this is greatly appreciated!

Is Age_of_Casualty also part of the dataframe as well? if not, you might consider to merge or separate assignment to create a Age_of_Casualty column in the df as well.
I created a dummy dataframe, with two variables
casualtiesdf <- data.frame(Casualty_Severity=c(1,2,1,1,2,3,1,3),
Age_of_Casualty = c(31,32,32,33,33,33,35,35))
I then created another varialbe, to store the casualty_severity as factor
casualtiesdf$Casualty_Severity_factor <- factor(x = casualtiesdf$Casualty_Severity,
levels = c(1,2,3),
labels = c("Fatal","Serious","Slight"))
With that, I can then do the box plot, with the casualty_severity as X-axis
library("ggplot2")
ggplot(data = casualtiesdf,
aes(x= Casualty_Severity_factor, y = Age_of_Casualty)) +
geom_boxplot()
This should give you some plot like this

So it's expected to me that in your third example R is reporting that you have a syntax error: unexpected symbol in: geom_boxplot() means "I have no idea what to do with that ...))) y = business.
Your first example R mistakenly assigns Age_of_Casualty as the X - this is really the variable whose distribution you want to analyze (it should be the Y variable).
So you're right, you need to establish Casualty_Severity as a Factor and make sure to ascribe the two variables to X and Y correctly. Something like this:
# Creating dummy data
AC.rand <- sample(15:90, 500, replace = T)
CS.rand <- sample(1:3, 500, replace = T)
# Combine them into a dataframe, define the "Severity" variable as a Factor
casualtiesdf <- data.frame(Casualty_Severity = factor(CS.rand), Age_of_Casualty = AC.rand)
# Define the Levels for the "Severity" variable - not necessary
levels(casualtiesdf$Casualty_Severity)=c("Fatal", "Serious", "Slight")
g <- ggplot(data = casualtiesdf, aes(x = Casualty_Severity, y = Age_of_Casualty))
g <- g + geom_boxplot()
When I mocked up 500 rows of data I get something like:
I'm an SO noob, too, so let's learn together! :)

Related

ggplot2 does not plot multiple groups of a variable, only plots one line

I would like to make a plot with multiple lines corresponding to different groups of variable "Prob" (0.1, 0.5 and 0.9) using ggplot. Although that, when I run the code, it only plots one line instead of 3. Thanks for the help :)
Here my code:
Prob <- c(0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9)
nit <- c(0.9,0.902777775,0.90555555,0.908333325,0.9111111,0.913888875,0.91666665,0.919444425,0.9222222,0.924999975,0.92777775,0.930555525,0.9333333,0.936111075,0.93888885,0.941666625,0.9444444,0.947222175,0.94999995,0.952777725,0.9555555,0.958333275,0.96111105,0.963888825,0.9666666,0.969444375,0.97222215,0.974999925,0.9777777,0.980555475,0.98333325,0.986111025,0.9888888,0.991666575,0.99444435,0.997222125,0.9999999,0.9,0.902777775,0.90555555,0.908333325,0.9111111,0.913888875,0.91666665,0.919444425,0.9222222,0.924999975,0.92777775,0.930555525,0.9333333,0.936111075,0.93888885,0.941666625,0.9444444,0.947222175,0.94999995,0.952777725,0.9555555,0.958333275,0.96111105,0.963888825,0.9666666,0.969444375,0.97222215,0.974999925,0.9777777,0.980555475,0.98333325,0.986111025,0.9888888,0.991666575,0.99444435,0.997222125,0.9999999,0.9,0.902777775,0.90555555,0.908333325,0.9111111,0.913888875,0.91666665,0.919444425,0.9222222,0.924999975,0.92777775,0.930555525,0.9333333,0.936111075,0.93888885,0.941666625,0.9444444,0.947222175,0.94999995,0.952777725,0.9555555,0.958333275,0.96111105,0.963888825,0.9666666,0.969444375,0.97222215,0.974999925,0.9777777,0.980555475,0.98333325,0.986111025,0.9888888,0.991666575,0.99444435,0.997222125,0.9999999)
greek <- log((1-Prob)/Prob)/-10
italian <- ((0.997-nit)/(0.997-0.97))^3
Temp<-c(rep(25,111))
GT <- ((30-Temp)/(30-3.3))^3
GH <- 1-GT-italian
acid <- (-1*(((sign(GH)*(abs(GH)^(1/3)))*(7-5))-7))
Species<-c(rep("Case",111))
data <- as.data.frame(cbind(Prob,greek,GT,GH,italian, Temp,acid,nit, Species))
ggplot() +
geom_line(data = data, aes_string(x = acid, y = nit, group = Prob, color = factor(Prob)), size = 0.8)
The answer seems to be kind of two parts:
In your data frame data, the columns that should be numeric are not numeric.
The reason why you only see one line.
Fixing the Data Frame and Using aes() in place of aes_string()
I noticed something was odd when you had as.data.frame(cbind(... to make your data frame and are using aes_string(.. within the ggplot portion. If you do a quick check on data via str(data), you'll see all of your columns in data are characters, whereas in the environment the data prepared in the code for their respective columns are numeric. Ex. acid is numeric, yet data$acid is a character.
The reason for this is that you're binding the columns into a data frame by using as.data.frame(cbind(.... This results in all data being coerced into a character, so you loose the numeric nature of the data. This is also why you have to use aes_string(...) to make it work instead of aes(). To bind vectors together into a data frame, use data.frame(..., not as.data.frame(cbind(....
To fix all this, bind your columns together like this + the ggplot code:
data <- data.frame(Prob,greek,GT,GH,italian, Temp,acid,nit, Species)
# data <- as.data.frame(cbind(Prob,greek,GT,GH,italian, Temp,acid,nit, Species))
ggplot() +
geom_line(data=data, aes(x = acid, y = nit, group = Prob, color = factor(Prob)), size = 0.8)
Why is there only one line?
The simple answer to why you only see one line is that the line for each of the values of data$Prob is equal. What you see is the effect of overplotting. It means that the line for data$Prob == 0.1 is the same line when data$Prob == 0.5 and data$Prob = 0.9.
To demonstrate this, let's separate each. I'm going to do this realizing that Prob could be created by repeating 0.1, 0.5, and 0.9 each 37 times in a row. I'll create a factor that I'll use as multiplication factor for data$nit that will result in separating our our lines:
my_factor <- rep(c(1,1.1,1.5), each=37) # our multiplication fractor
data$nit <- data$nit * my_factor # new nit column
# same plot code
ggplot() +
geom_line(data=data, aes(x = acid, y = nit, group = Prob, color = factor(Prob)), size = 0.8)
There ya go. We have all lines there, you just could not see them due to overplotting. You can convince yourself of this without the multiplication business and the original data by comparing the plots for each data$Prob:
# use original dataset as above
ggplot() +
geom_line(data=data, aes(x = acid, y = nit, group = Prob, color = factor(Prob)), size = 0.8) +
facet_wrap(~Prob)

How do I loop a ggplot2 functon to export and save about 40 plots?

I am trying to loop a ggplot2 plot with a linear regression line over it. It works when I type the y column name manually, but the loop method I am trying does not work. It is definitely not a dataset issue.
I've tried many solutions from various websites on how to loop a ggplot and the one I've attempted is the simplest I could find that almost does the job.
The code that works is the following:
plots <- ggplot(Everything.any, mapping = aes(x = stock_VWRETD, y = stock_10065)) +
geom_point() +
labs(x = 'Market Returns', y = 'Stock Returns', title ='Stock vs Market Returns') +
geom_smooth(method='lm',formula=y~x)
But I do not want to do this another 40 times (and then 5 times more for other reasons). The code that I've found on-line and have tried to modify it for my means is the following:
plotRegression <- function(z,na.rm=TRUE,...){
nm <- colnames(z)
for (i in seq_along(nm)){
plots <- ggplot(z, mapping = aes(x = stock_VWRETD, y = nm[i])) +
geom_point() +
labs(x = 'Market Returns', y = 'Stock Returns', title ='Stock vs Market Returns') +
geom_smooth(method='lm',formula=y~x)
ggsave(plots,filename=paste("regression1",nm[i],".png",sep=" "))
}
}
plotRegression(Everything.any)
I expect it to be the nice graph that I'd expect to get, a Stock returns vs Market returns graph, but instead on the y-axis, I get one value which is the name of the respective column, and the Market value plotted as normally, but as if on a straight number-line across the one y-axis value. Please let me know what I am doing wrong.
Desired Plot:
Actual Plot:
Sample Data is available on Google Drive here:
https://drive.google.com/open?id=1Xa1RQQaDm0pGSf3Y-h5ZR0uTWE-NqHtt
The problem is that when you assign variables to aesthetics in aes, you mix bare names and strings. In this example, both X and Y are supposed to be variables in z:
aes(x = stock_VWRETD, y = nm[i])
You refer to stock_VWRETD using a bare name (as required with aes), however for y=, you provide the name as a character vector produced by colnames. See what happens when we replicate this with the iris dataset:
ggplot(iris, aes(Petal.Length, 'Sepal.Length')) + geom_point()
Since aes expects variable names to be given as bare names, it doesn't interpret 'Sepal.Length' as a variable in iris but as a separate vector (consisting of a single character value) which holds the y-values for each point.
What can you do? Here are 2 options that both give the proper plot
1) Use aes_string and change both variable names to character:
ggplot(iris, aes_string('Petal.Length', 'Sepal.Length')) + geom_point()
2) Use square bracket subsetting to manually extract the appropriate variable:
ggplot(iris, aes(Petal.Length, .data[['Sepal.Length']])) + geom_point()
you need to use aes_string instead of aes, and double-quotes around your x variable, and then you can directly use your i variable. You can also simplify your for loop call. Here is an example using iris.
library(ggplot2)
plotRegression <- function(z,na.rm=TRUE,...){
nm <- colnames(z)
for (i in nm){
plots <- ggplot(z, mapping = aes_string(x = "Sepal.Length", y = i)) +
geom_point()+
geom_smooth(method='lm',formula=y~x)
ggsave(plots,filename=paste("regression1_",i,".png",sep=""))
}
}
myiris<-iris
plotRegression(myiris)

How to set y=rows and x=columns in ggplot2?

Before I get to my question, I should point out that I am new in R, and this question might be simplicity itself for an experienced user.
I want to use ggplot2 to take full advantage of all the functionalities therein. However, I have encountered a problem that I have not been able to solve.
If I have a data frame as follows:
df = as.data.frame(cbind(rnorm(100,35:65),rnorm(100,25:35),rnorm(100,15:20),rnorm(100,5:10),rnorm(100,0:5)))
header = c("A","B","C","D","E")
names(df) = make.names(header)
Plotting the data, where rows are Y and X is columns can readily be done in base R like e.g. this:
par(mfrow=c(2,0))
stripchart(df, vertical = TRUE, method = 'jitter')
boxplot(df)
The picture shows the stripchart & boxplot of the data
However, the same cannot readily be done in ggplot2, as x and y input are required. All examples I have found plots one column vs another column, or process the data into the column format. Yet, I want to set y as the rows in my df and the x as the columns. How can this be accomplished?
You'll need to reshape your data in order to get those graphs. I think this is what you're looking for:
> library(ggplot2)
> library(reshape2)
> df = as.data.frame(cbind(rnorm(100,35:65),rnorm(100,25:35),rnorm(100,15:20),rnorm(100,5:10),rnorm(100,0:5)))
> header = c("A","B","C","D","E")
> names(df) = make.names(header)
> df = melt(df)
No id variables; using all as measure variables
> head(df)
variable value
1 A 36.75505
2 A 35.68714
3 A 36.44952
4 A 38.77236
5 A 39.79136
6 A 39.39672
> ggplot(df, aes(x = variable, y = value))
> ggplot(df, aes(x = variable, y = value)) + geom_boxplot()
> ggplot(df, aes(x = variable, y = value)) + geom_point(shape = 0, size = 20)
Here is the box plot:
Here is the strip chart:
You can change the settings in aes() options. See here for more info.

Getting counts on bins in a heat map using R

This question follows from these two topics:
How to use stat_bin2d() to compute counts labels in ggplot2?
How to show the numeric cell values in heat map cells in r
In the first topic, a user wants to use stat_bin2d to generate a heatmap, and then wants the count of each bin written on top of the heat map. The method the user initially wants to use doesn't work, the best answer stating that stat_bin2d is designed to work with geom = "rect" rather than "text". No satisfactory response is given.
The second question is almost identical to the first, with one crucial difference, that the variables in the second question question are text, not numeric. The answer produces the desired result, placing the count value for a bin over the bin in a stat_2d heat map.
To compare the two methods i've prepared the following code:
library(ggplot2)
data <- data.frame(x = rnorm(1000), y = rnorm(1000))
ggplot(data, aes(x = x, y = y))
geom_bin2d() +
stat_bin2d(geom="text", aes(label=..count..))
We know this first gives you the error:
"Error: geom_text requires the following missing aesthetics: x, y".
Same issue as in the first question. Interestingly, changing from stat_bin2d to stat_binhex works fine:
library(ggplot2)
data <- data.frame(x = rnorm(1000), y = rnorm(1000))
ggplot(data, aes(x = x, y = y))
geom_binhex() +
stat_binhex(geom="text", aes(label=..count..))
Which is great and all, but generally, I don't think hex binning is very clear, and for my purposes wont work for the data i'm trying to desribe. I really want to use stat_2d.
To get this to work, i've prepared the following work around based on the second answer:
library(ggplot2)
data <- data.frame(x = rnorm(1000), y = rnorm(1000))
x_t<-as.character(round(data$x,.1))
y_t<-as.character(round(data$y,.1))
x_x<-as.character(seq(-3,3),1)
y_y<-as.character(seq(-3,3),1)
data<-cbind(data,x_t,y_t)
ggplot(data, aes(x = x_t, y = y_t)) +
geom_bin2d() +
stat_bin2d(geom="text", aes(label=..count..))+
scale_x_discrete(limits =x_x) +
scale_y_discrete(limits=y_y)
This works around allows one to bin numerical data, but to do so, you have to determine bin width (I did it via rounding) before bringing it into ggplot. I actually figured it out while writing this question, so I may as well finish.
This is the result: (turns out I can't post images)
So my real question here, is does any one have a better way to do this? I'm happy I at least got it to work, but so far I haven't seen an answer for putting labels on stat_2d bins when using a numerical variable.
Does any one have a method for passing on x and y arguments to geom_text from stat_2dbin without having to use a work around? Can any one explain why it works with text variables but not with numbers?
Another work around (but perhaps less work). Similar to the ..count.. method you can extract the counts from the plot object in two steps.
library(ggplot2)
set.seed(1)
dat <- data.frame(x = rnorm(1000), y = rnorm(1000))
# plot
p <- ggplot(dat, aes(x = x, y = y)) + geom_bin2d()
# Get data - this includes counts and x,y coordinates
newdat <- ggplot_build(p)$data[[1]]
# add in text labels
p + geom_text(data=newdat, aes((xmin + xmax)/2, (ymin + ymax)/2,
label=count), col="white")

Transform color scale, but keep a nice legend with ggplot2

I have seen somewhat similar questions to this, but I'd like to ask my specific question as directly as I can:
I have a scatter plot with a "z" variable encoded into a color scale:
library(ggplot2)
myData <- data.frame(x = rnorm(1000),
y = rnorm(1000))
myData$z <- with(myData, x * y)
badVersion <- ggplot(myData,
aes(x = x, y = y, colour = z))
badVersion <- badVersion + geom_point()
print(badVersion)
Which produces this:
As you can see, since the "z" variable is normally distributed, very few of the points are colored with the "extreme" colors of the distribution. This is as it should be, but I am interested in emphasizing difference. One way to do this would be to use:
betterVersion <- ggplot(myData,
aes(x = x, y = y, colour = rank(z)))
betterVersion <- betterVersion + geom_point()
print(betterVersion)
Which produces this:
By applying rank() to the "z" variable, I get a much greater emphasis on minor differences within the "z" variable. One could imagine using any transformation here, instead of rank, but you get the idea.
My question is, essentially, what is the most straightforward way, or the most "true ggplot2" way, of getting a legend in the original units (units of z, as opposed to the rank of z), while maintaining the transformed version of the colored points?
I have a feeling this uses rescaler() somehow, but it is not clear to me how to use rescaler() with arbitrary transformations, etc. In general, more clear examples would be useful.
Thanks in advance for your time.
Have a look at the package scales
especially
?trans
I think that a transformation that maps the colour given the probability of getting the value or more extreme should be reasonable (basically pnorm(z))
I think that scale_colour_continuous(trans = probability_trans(distribution = 'norm') should work, but it throws warnings.
So I defined a new transformation (see ?trans_new)
I have to define a transformation and an inverse
library(scales)
norm_trans <- function(){
trans_new('norm', function(x) pnorm(x), function(x) qnorm(x))
}
badVersion + geom_point() + scale_colour_continuous(trans = 'norm'))
Using the supplied probability_trans throws a warning and doesn't seem to work
# this throws a warning
badVersion + geom_point+
scale_colour_continuous(trans = probability_trans(distribution = 'norm'))
## Warning message:
## In qfun(x, ...) : NaNs produced

Resources