box-plot not working with factor data - r

I'm trying to create a simple boxplot of some survey data.
Data
The data is survey data, and each row has a response recorded 1-5.
**Example Data**
Race= 2,2,3,2,5
Rating = 1,1,3,5,5
Converting to factors
df$Race = factor(DF$Race)
df$Rating = factor(DF$Rating)
Assigning each factor variable levels
levels(df$Race) = c("Asian/Pacific Islander", "White" , "American Indian/Eskimo", "Black/African American", "Other","NA")
levels(df$Rating) = c("Poor","Below Avg.","Neutral","Good","Excellent", "NA")
ggplot(df, aes(x=Race, y=Rating)) + geom_boxplot()
Using the full data I get a result like this.
Please let me know why this turns out funky. Also, How can I remove NA's?. I'm brand new to R. So if you see something else that I am doing wrong, or poorly please let me know! Thanks!
UPDATE
Using #jlhoward code provided in the comments I can generate the following - but it's plotting them all the same, and not plotting "white."
ggplot(df, aes(x=Race, y=as.numeric(Rating))) + geom_boxplot() +scale_y_continuous(labels=df$Rating,breaks=as.integer(df$Rating))

If I understand correctly, you want the factor levels ("Poor", "Below Avg" etc.) to appear on the Y axis, but you actually want the "rating" boxplot to be computed with numerical values. Is that correct?
If that is the case, you would need to not convert your "rating" variable into a factor before feeding them to ggplot (leave it numerical), and then simply label the y axis appropriately according to your factor levels.
(A reproducible example would help answer the question more fully).

Related

How to plot a gg barplot for a single factor column?

My data frame has 621 rows and each column describes something about it. I'm trying to do a exploratory data analysis where I plot out all the data into a bar plot.
I have a factor column called phenotype, which has 86 levels which describe the main condition in my cohort. I want to plot this out as 86 separate bar plots, each with the total number of people who have that condition on ggplot.
I've attached a screenshot of my data below, I basically want the x axis to have the condition name like the 'Bardet-Biedl Syndrome', 'Classic Ehlers Danlos Syndrome' etc and on the y axis the number of people who have that condition, such as 3,4,5 as displayed below etc. I got the below data by basically doing
table(data.frame$Phenotype)
I'm using the below code to generate my ggplot
ggplot (tiering, aes(x = Phenotype, y = count(tiering$Phenotype))) +
theme bw() +
geom bar(stat = "identity")
I'm sure the answer is out there, but I've looked on the R help websites and I can't seem to figure this out, so would be very grateful for the help.
EDIT: I got to a marplot with the help of the below code, just trying to reorder the bar/columns in decreasing order and tried this method but it hasn't worked. Would anyone have any suggestions?

Ordering Facets in a plot based on a column in the dataset

So, I have a dataset which looks like this.
I'm tasked with creating a smooth faceted visualization which shows each coral's bleaching rate at each site which I've successfully done so like this:
(I FULLY realize that this code might be bad and have some mistakes in it and I'd really appreciate it if people could tell me ways to improve it or correct some grave errors in it).
coral_data <- read.csv("file.csv")
#options(warn=-1)
library(ggplot2)
ggplot(coral_data, aes(x=year, y=value, colour=coralType, group=coralType)) +
geom_smooth(method="lm", se=F) +
scale_x_continuous(name="Year", breaks=c(2010, 2013, 2016)) +
scale_y_discrete(breaks = seq(0, 100, by = 10)) +
facet_grid(coralType ~ location, scales="free")+
expand_limits(y=0) +
labs(x="\nBleaching Rate", y="Year", title="Coral Bleaching for different corals at different sites over the years\n")
But, I also have to order the facets by lattitudes (currently, its like site01, site02, etc but I want the faceted sites to be ordered w.r.t. their lattitude values, be it ascending or descending) but sadly I have no idea as to how I'm going to do that.
Thus, could someone please tell me how to go about doing this?
Consider ordering your data frame by latitude, then re-assign location factor variable by defining its levels to new ordering with unique:
# ORDER DATA FRAME BY ASCENDING LATITUDE
coral_data <- with(coral_data, coral_data[order(latitude),])
# ORDER DATA FRAME BY DESCENDING LATITUDE
coral_data <- with(coral_data, coral_data[order(rev(latitude)),])
# ASSIGN site AS FACTOR WITH DEFINED LEVELS
coral_data$location <- with(coral_data, factor(as.character(location), levels = unique(location)))
ggplot(coral_data, ...)

ggplot - how to make errorbar with respect to gender

Few words at the beginning - I have just started my journey with R, and after initial experience am really keen on further learning! But I've encountered a huge problem and searching in google doesnt seem to help. Maybe some good soul here could guide me with your wisdom :)
So, I've been trying to make an error_bar in R, using ggplot. But the problem is that on y axis I got continuous variable (marital satisfaction), x ais is factor (consisting of three levels), and I also wanted to add gender to the plot (what is more, all of it has to be black-white, which makes it double).
What I want to do is to show the means and standard deviation of marital satisfaction in three different religions (Christian, Muslim, atheistic) with respect to gender (male, female). Do you have any idea how to do it?
Thanks in advance! <3
I've already tried doing boxplot with my data, but such plot doesnt provide with any useful information, and after googling I think this error bar would better fit into the data.
Here how it looks like:
factor(ds$`Religion`, levels = c(2, 4, 6), labels = c("Christian", "Muslim", "atheistic"))
factor(ds$Sex, levels = c(0, 1), labels = c("Male", "Female"))
obj1 <- ggplot(data=df, aes(y=Marital, x=factor(Religion), fill=factor(Sex))) + geom_boxplot()
obj1+labs(x="Religious affiliation", y="Marital satisfaction", fill="Sex") -> obj2
obj2 + scale_x_discrete(labels = c('Christian','Muslim','atheistic')) -> obj3
obj3 + scale_fill_discrete(name = "Sex", labels = c("Male", "Female")) -> obj4
obj4 + scale_y_continuous(expand = c(0, 1)) -> obj5
I copy pasted my data here:
https://textsaver.flap.tv/lists/2l4k
What I want to do is to show the means and standard deviation of marital satisfaction in three different religions (Christian, Muslim, atheistic) with respect to gender (male, female). Do you have any idea how to do it?
The simple answer to this question is that you should create the summary statistics beforehand. After that picking the visualization becomes trivial.
Assuming you are interested in using a tidyverse solution I would proceed as follows:
library(tidyverse)
ds %>%
group_by(Religion, Sex) %>%
summarize(meanVal = mean(Marital),
sdVal = sd(Marital)) -> ds.summarized
ds.summarizedwill have one row for each Religion-Sex combination and provides the mean and the sd of martial satisfaction in this group. You can then proceed plotting using geom_errorbar where with aesthetics y=meanVal, ymin=meanVal - sdVal and ymax=meanVal + sdVal. One final remark - you may want to use standard errors instead of standard deviation.

ggplot2 in R: Calculate percentage and make a graph that might be a geom_area plot

I'm a beginner in R, so please be patient with me if there are very obvious mistakes in my code and for my question! For a homework problem, I am struggling to make what I think is a geom_area plot look like this:
As background, we are using the diamonds dataframe from ggplot2 library. We were given the plot and asked to reproduce it. My biggest problem is with the y-axis. The graph given indicated that the y-axis represents density, which I think is the percentage/proportion of each clarity grade given the title. Originally, I thought perhaps I needed to create a new dataframe with "Price" and "Clarity Proportion (or, density)", but I wasn't sure how to do that. The professor hinted that we should not need to create a new variable for this problem.
Here's what I have so far. It produces the error message: "In Ops.ordered(left, right): '/' is not meaningful for ordered factors":
set.seed(123)
d <- ggplot(diamonds[sample(nrow(diamonds),5000),]) #these were given in the homework
d + geom_area(aes(x = price, y = lapply(count(diamonds$clarity), FUN = count(diamonds$clarity)/53940), colour = clarity), position = "fill") +
labs(title = "Clarity Proportion by Price")
I know my y-axis is wrong, but I'm just not sure how to transform it. Your explanation and insight are greatly appreciated!

Avoid overlapping x-axis labels with ggplot? [duplicate]

I'm having some trouble with qplot in R. I am trying to plot data from a data frame. When I execute the command below the plot gets bunched up on the left side (see the image below). The data frame only has 963 rows so I don't think size is the issue, but I can use the same command on a smaller data frame and it looks fine. Any ideas?
library(ggplot2)
qplot(x=variable,
y=value,
data=data,
color=Classification,
main="Average MapQ Scores")
Or similarly:
ggplot(data = data, aes(x = variable, y = value, color = Classification) +
geom_point()
Your column value is likely a factor, when it should be a numeric. This causes each categorical value of value to be given its own entry on the y-axis, thus producing the effect you've noticed.
You should coerce it to be a numeric
data$value <- as.numeric(as.character(data$value))
Note that there is probably a good reason it has been interpreted as a factor and not a numeric, possibly because it has some entries that are not pure numeric values (maybe 1,000 or 1000 m or some other character entry among the numbers). The consequence of the coercion may be a loss of information, so be warned or cleanse the data thoroughly.
Also, you appear to have the same problem on the x-axis.

Resources