I am a new R user.
I have a difficult time figuring out how to combine different barplot into one graph.
For example,
Suppose, the top five of professions in China, are, government employees, CEOs, Doctors, Athletes, artists, with the incomes (in dollars) respectively, 20,000,17,000,15,000,14,000,and 13,000, while the top five of professions in the US, are, doctors, athletes, artists, lawyers, teachers with the incomes (in dollars) respectively, 40,000,35,000,30,000,25,000 and 20,000.
I want to show the differences in one graph.
How am I supposed to do that? Beware that they have different names.
The answer to the question is fairly straight forward. As a new R user, I recommend you make liberal use of the 'ggplot2' package. For many R users, this one package is enough.
To get the "combined" barchart described in the original post, the answer is to put all of the data into one dataset and then add grouping variables, like so:
Step 1: Make the dataset.
data <- read.table(text="
Country,Profession,Income
China,Government employee,20000
China,CEO,17000
China,Doctor,15000
China,Athlete,14000
China,Artist,13000
USA,Doctor,40000
USA,Athlete,35000
USA,Artist,30000
USA,Lawyer,25000
USA,Teacher,20000", header=TRUE, sep=",")
You'll notice I'm using the 'read.table' function here. This is not required and is purely for readability in this example. The important part is that we have our values (Income) and our grouping variables (Country, Profession).
Step 2: Create a barchart with Income as the height of the bars, Profession as the x-axis, and color the bars by Country.
library(ggplot2)
ggplot(data, aes(x=Profession, y=Income, fill=Country)) +
geom_bar(stat="identity", position="dodge") +
theme(axis.text.x = element_text(angle = 90))
Here we are first loading the 'ggplot2' package. You may need to install this.
Then, we specify what data we want to use and how to separate it.
ggplot(data, aes(x=Profession, y=Income, fill=Country))
This tells 'ggplot' to use our dataset in the 'data' data frame. The aes() command specifies how 'ggplot' should read the data. We map the grouping variable Profession onto the x-axis, map the Income onto the y-axis, and change the color (fill) of each bar according to the grouping variable Country.
Next, we specify what kind of barchart we want.
geom_bar(stat="identity", position="dodge")
This tells 'ggplot' to make a barchart (geom_bar()). By default, the 'geom_bar' function tries to make a histogram, but we already have the totals we want to use. We tell it to use our totals by specifying that the type of statistic represented in Income is the total, or actual values (identity) that we want to chart (stat="identity"). Finally, I made a judgement call about how to display the data and decided to set one set of data on next to the other when a single profession has multiple income values (position="dodge").
Finally, we need to rotate the x-axis labels, since some of them are quite long. We do this with a simple 'theme' command that changes the rotation of the x-axis text elements.
theme(axis.text.x = element_text(angle = 90))
We chain all of these commands together with the +, and it's done!
Related
I'm trying to make Gene EXPRESSION PROFILE plot in R. My input data is a data frame where column 1 has gene names and next column2:18 are multiple cancer types. Here is a small set of data.
what I want is to make a graph that has samples on x-axis and from y=axis expression line of each gene.
something that looks like this.
but instead of timepoints on x-axis it should have Cancer types (columns)
so far I've tried ggplot() and geneprofiler() but i failed over and over.
any help will be greatly appreciated.
Data Format
The current format of the data is referred to as wide format, but ggplot requires long format data. The tidyr package (part of the tidyverse) has functions for converting between wide and long formats. In this case, you want the function tidyr::pivot_longer. For example, if you have the data in a data.frame (or tibble) called df_gene_expr, the pivot would go something like
library(tidyverse)
df_gene_expr %>%
pivot_longer(cols=2:18, names_to="cancer_type", values_to="gene_expr") %>%
filter(ID == "ABCA8") %>%
ggplot(aes(x=cancer_type, y=gene_expr)) +
geom_point()
where here we single out the one gene "ABCA8". Change the geom_point() to whatever geometry you actually want (perhaps geom_bar(stat='identity').
Discrete Trendline
I'm not sure that geom_smooth is entirely appropriate - it is designed with continuous-continuous data in mind. Instead, I'd recommend stat_summary.
There's a slight trick to this because the discrete cancer_type on the x-axis. Namely, the cancer_type variable should be a factor, but we will use the underlying codes for the x-values in stat_summary. Otherwise, it would complain that using a geom='line' doesn't make sense.
Something along the lines:
ggplot(df_long, aes(x=cancer_type, y=gene_expr)) +
geom_hline(yintercept=0, linetype=4, color="red") +
geom_line(aes(group=ID), size=0.5, alpha=0.3, color="black") +
stat_summary(aes(x=as.numeric(cancer_type)), fun=mean, geom='line',
size=2, color='orange')
Output from Fake Data
Technically, this same trick (aes(x=as.numeric(cancer_type))) could be equally-well applied to geom_smooth, but I think it still makes more sense to use the stat_summary which let's one explicitly pick the stat to be computed. For example, perhaps, median instead of mean might be more appropriate in this context for the summary function.
First time asking a question on Stack Overflow
I am having trouble creating a bar plot which I hope to be able to filter based on certain fields. I am using some sensitive data, but I will try to be as clear as possible with the type of data I am using. The data is set up in a hierarchy of sorts, for example the data is as such:
Project -> Subproject -> Sub-SubProject, and a value for each month (jan, feb, march, etc), for a total of 15 columns.
Each row in the csv has a value for each element, so the Project column has a lot of repeated values since it's the top of the hierarchy, with subproject having a fair bit of repeated values as well since it's one level lower.
My goal is to create a bar chart that groups each unique value in the hierarchy, while having the months on the x axis, and values for the months on the y.
So all the values with the same Project, Subproject, will be grouped together, showing the months on the x axis.
I have tried using the ggplot2 library to try to group the values based on hierarchy, but it doesn't look the best and it's aggregating the values rather than showing a the unique value for the entry.
plot <- ggplot(data=data, aes(x= Sub-Project, y = January, fill = Sub-SubProject)) +
geom_bar(stat="identity", position = "dodge") +
facet_grid(~Project, scales = "free_x", space = "free") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_text(aes(label=Capacity.1),hjust=0, vjust=0)
I want to avoid using fill, since I would like to conditionally set the color on my own, but that is a problem for another time. I was able to somewhat replicate what I am looking for on Tableau, but now the deliverable must be in R.
In general, I would like no aggregation, but a unique bar for each entry, grouped by the hierarchy I said above.
I've written something in R using ggplot2 and don't know why it behaves as it does.
If I plot my data using geom_point and geom_line it is supposed to draw lines trough those points. but instead of connecting all the points it only connects those that are on a horizontal line. I don't know how to handle this.
This is a simple version of the code:
date<-c("2014-07-01","2014-07-02","2014-07-03",
"2014-07-04","2014-07-05","2014-07-06",
"2014-07-07")
mbR<- c(160,163,169,169,169,169,169)
mbL<- c(166,166,166,166,NA, NA, NA)
mb<-data.frame(mbR,mbL)
mb<-data.frame(t(as.Date(date)),mb)
colnames(mb)<-c("Datum","R","L")
mb$Datum<-date
plot1<-ggplot(mb,aes(x=mb$Datum,y=mb$R))+
geom_point(data=mb,aes(x=mb$Datum,y=mb$R,color="R",size=2),
group=mb$R,position="dodge")+
geom_line(data=mb,aes(y=mb$R,color="R",group=mb$R))+
geom_point(aes(y=mb$L,color="L",size=2),position="dodge")
plot1
I used group, otherwise I wouldn't have been able to draw any lines, still it doesn't do what I intended.
I hope you guys can help me out a little. :) It may be a minor fault.
First, melt your data to long format and then plot it. The column called variable in the melted data is the category (R or L). The column called value stores the data values for each instance of R and L. We group and color the data by variable in the call to ggplot, which gives us separate lines/points for R and L.
Also, you only need to provide the data frame and column mappings in the initial call to ggplot. They will carry through to geom_point and geom_line. Furthermore, when you provide the column names, you don't need to (and shouldn't) include the name of the data frame, because you've already specified the data frame in the data argument to ggplot.
library(reshape2)
mb.l = melt(mb, id.var="Datum")
ggplot(data=mb.l, aes(x=Datum, y=value, group=variable, color=variable)) +
geom_point(size=2) +
geom_line()
I'm trying to plot three different lines with ggplot2 that show the temporal series of three different variable (max, min and mean temperature). The problem comes when I want to set the color for each line/variable.
ggplot command:
plot.temp=ggplot(data,aes_string(x="date", y="Tmed",colour="Tmed")) + geom_line() +
geom_line(data=data,aes_string(x="date",y="Tmin",colour="Tmin")) +
geom_line(data=data,aes_string(x="date",y="Tmax",colour="Tmax"))
I don't use group as each row stands for one single time, not available categories here.
I have tried different optinos found in another posts like scale_color_manual but then a Continuous value supplied to discrete scale error message appears
You can find data file at http://ubuntuone.com/45LqzkMHWYp7I0d47oOJ02 that can be easily read with data = read.csv(filename,header=T, sep=",",na.strings="-99.9")
I'd just like to set the color line manually but can't find the way.
Thanks in advance.
First, you need to convert date to Date object because now it is treated as factor. If date is treated as factor then each date value is assumed as separate group.
data$date<-as.Date(data$date)
Second, as you are using aes_string() also colour="Tmed" is interpreted as colors that depend from actual Tmed values. Use aes() and quotes only for colour= variable. Also there is no need to repeat data= argument in each geom_line() because you use the same data frame.
ggplot(data,aes(x=date, y=Tmed,colour="Tmed")) + geom_line() +
geom_line(aes(y=Tmin,colour="Tmin")) +
geom_line(aes(y=Tmax,colour="Tmax"))
Of course you can also melt your data and then you will need only one geom_line() call (but still you need to change date column).
library(reshape2)
data2<-melt(data[,1:4],id.vars="date")
ggplot(data2,aes(date,value,color=variable))+geom_line()
I'm having some trouble producing what I think should be a fairly straightforward ggplot2 graph.
I have some experimental data in a data frame. Each data entry is identified by the system that was being measured, and the instance (problem) it was run on. Each entry also has a value measured for the particular system and instance.
For instance:
mydata <- data.frame(System=c("a","b","a","b","a","b"), Instance=factor(c(1,1,2,2,3,3)), Value=c(10,5,4,2,7,8))
Now, I'd like to plot this data in a boxplot where the x-axis contains the instance identifier, and the color of the bar indicates which system the value is for. The bar heights should be weighted by the value in the dataframe.
This almost does what I want:
qplot(data=mydata, weight=Value, Instance, fill=System, position="dodge")
The final thing that I would like to do is reorder the bars so they are sorted by the value of system A. However, I can't figure out an elegant way to do this.
My first instinct was to use qplot(data=mydata, weight=Value, reorder(Instance, Value), fill=System, position="dodge"), but this will order by the mean value for each instance, and I just want to use the value from A. I could use qplot(data=mydata, weight=Value, reorder(Instance, Value, function(x) { x[1] } ), fill=System, position="dodge") to order the instances by "the first value", but this is dangerous (what if the order changes?) and unclear to a reader.
What is a more elegant solution?
I'm sure there is a better way than this, but making Instance an ordered works, and would continue to work even if the data changes:
qplot(data=mydata, weight=Value,
ordered(Instance,
levels=mydata[System=='a','Instance'] [order(mydata[System=='a','Value'])])
,fill=System, position="dodge")
Perhaps a slightly more elegant way of writing the same thing:
qplot(data=mydata, weight=Value,
ordered(Instance,
levels=Instance [System=='a'] [order(Value [System=='a'])]) # Corrected
,fill=System, position="dodge")