ggplot2 stacked bar graph using rows as datapoints [duplicate] - r

This question already has an answer here:
Grouping & Visualizing cumulative features in R
(1 answer)
Closed 6 years ago.
I have a set of data that I would like to plot like this:
Now this is plotted using LibreOffice Calc in Ubunutu. I have tried to do this in R using following code:
ggplot(DATA, aes(x="Samples", y="Count", fill=factor(Sample1)))+geom_bar(stat="identity")
This does not give me a stacked bar graph for each sample, but rather one single graph. I have had a similar question, that used a different dataframe, that was answered here. However, in this problem I don't have just one sample, but information for at least three. In LibreOffice Calc or Excel I can choose the stacked bar graph option and then choose to use rows as the data series. How can I achieve a similar graph in ggplot2?
Here is the dataframe/object for which I am trying to produce the graph:
Aminoacid Sequence,Sample1,Sample2,Sample3
Sequence 1,16,10,33
Sequence 2,2,2,7
Sequence 3,1,1,6
Sequence 4,4,1,1
Sequence 5,1,2,4
Sequence 6,4,3,14
Sequence 7,2,2,2
Sequence 8,8,5,12
Sequence 9,1,3,17
Sequence 10,7,1,4
Sequence 11,1,1,1
Sequence 12,1,1,2
Sequence 13,1,1,1
Sequence 14,1,2,2
Sequence 15,5,4,7
Sequence 16,3,1,8
Sequence 17,7,5,20
Sequence 18,3,3,21
Sequence 19,2,1,5
Sequence 20,1,1,1
Sequence 21,2,2,5
Sequence 22,1,1,3
Sequence 23,4,2,9
Sequence 24,2,1,1
Sequence 25,4,4,3
Sequence 26,4,1,3
I copied the content of a .csv file, is that reproducible enough? It worked for me to just use read.csv(.file) in R.
Edit:
Thank you for redirecting me to another post with a very similar problem, I did not find that before. That post brought me a lot closer to the solution. I had to change the code just a little to fit my problem, but here is the solution:
df <- read.csv("example.csv")
df2 <- melt(example, id="Aminoacid.Sequence")
ggplot(df2, aes(x=variable, y=value, fill=Aminoacid.Sequence))+geom_bar(stat="identity")
Using variable as on the x-axis makes bar graph for each sample (Sample1-Sample3 in the example). Using y=value uses the value in each cell for that sample on the y-axis. And most importantly, using fill="Aminoacid.Sequence" stacks the values for each sequence on top of each other giving me the same graph as seen in the screenshot above!
Thank you for your help!

Try something along the following lines:
library(reshape2)
df <- melt(DATA) # you probably need to adjust the id.vars here...
ggplot(df, aes(x=variable, y=value) + geom_bar(stat="identity")
Note that you need to adjust the ggplot and the melt code somewhat, but since you haven't provided sample data, no one can provide the actual code necessary. The above provides the basic approach on how to deal with these multiple columns representing your samples, though. melt will "stack" the columns on top of each other, and create a column with the old variable name. This you can then use as x for ggplot.
Note that if you have other data in the data frame as well, melt will also stack these. For that reason you will need to adjust the commands to fit your data.
Edit: using your data:
library(reshape2)
library(ggplot2)
### reading your data:
# df <- read.table(file="clipboard", header=T, sep=",")
df2 <- melt(df)
head(df2)
Aminoacid.Sequence variable value
1 Sequence 1 preDLI 16
2 Sequence 2 preDLI 2
3 Sequence 3 preDLI 1
4 Sequence 4 preDLI 4
5 Sequence 5 preDLI 1
6 Sequence 6 preDLI 4
This can be used as in:
ggplot(df2, aes(x=variable, y=value, fill=Aminoacid.Sequence)) + geom_bar(stat="identity")
I am sure you want to change some details about the graph, such as the colors etc, but this should answer your inital question.

Related

How would you create categorical "bins" for a boxplot over time in R?

Been working on this and haven't been able to find a decent answer.
Basically, I've got a dataset of NBA Player height vs draft year, and I am trying to create a boxplot to show how player height has changed overtime (this is for a hw assignment, so a boxplot is necessary). My dataset (nba_data) looks like the table below, but I have 10k rows ranging from players drafted in the 60s all the way to the 2000s.
player_name
draft_year
height_in
player_a
1998
76
player_b
1972
81
player_c
2012
79
So far the closest I've gotten is
ggplot(data = nba_data, aes(x = draft_year,
y = height_in,
group = cut(x = draft_year, breaks = 5))) +
geom_boxplot()
And this is the result I get. As far as I understand, breaks being set to 5 should separate my years into 5 year buckets, right?
I created the same graph in Excel to get an idea of what it should look like:
I also attempted to create categories with cut, but was unable to apply it to my boxgraph. I mostly code in Python, but have to learn R for a class at school - any help is greatly appreciated.
Thanks!
Edit: Another question I guess would be how the "Undrafted" players would fit into this, since R seems to want to coerce the draft_year column as numerical to fit into a box plot.
From the ?cut help page, the breaks argument is:
breaks
either a numeric vector of two or more unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.
You gave it a single number, so that's interpreted as the number of intervals.
Instead, you should give it a vector of exact breakpoints, something like breaks = seq(1960, 2020, by = 5).
I'm surprised you think your axis is being numericized--it's definitely a continuous axis, but I've never heard of ggplot doing that to a string or factor input--check your data frame to make sure the "Undrafted" rows are really there, they might have gotten dropped or converted to NA at some point. But that's a good thing for cut, because cut will only work on numerics. I'd suggest cutting the column as numeric to create a bin column, and then replace NAs in the bin column with "Undrafted".
If you don't mind using a package, you can get the effect you want with:
library(santoku)
ggplot(..., aes(..., group = chop_width(draft_year, 5)))

Changing plotting order of points in R / ggplot2

I have the following code to plot a large dataset (450k) in ggplot2
x<-ggplot()+
geom_point(data=data_Male,aes(x=a,y=b),color="Turquoise",position=position_jitter(w=0.2,h=1),alpha=0.1,size=.5,show.legend=TRUE)+
geom_point(data=data_Female,aes(x=a,y=b),color="#FF9999",position=position_jitter(w=0.2,h=1),alpha=0.1,size=.5,show.legend=TRUE)+
theme_bw()
x<-x+geom_smooth(data=data_Male,aes(x=a,y=b,alpha="Male"),method="lm",colour="Blue",linetype=1,se=T)+
geom_smooth(data=data_Female,aes(x=a,y=b,alpha="Female"),method="lm",colour="Dark Red",linetype=5,se=T)+
geom_smooth(data=data_All,aes(x=a,y=b,alpha="All"),method="lm",colour="Black",linetype=3,se=T)+
scale_fill_discrete(name="Key",labels=c("Female","Male","All"))+
scale_colour_discrete(name="Plot Colour",labels=c("Female","Male","All"))+
scale_alpha_manual(name="Key",
values=c(1,1,1),
breaks=c("Female","Male","All"),
guide=guide_legend(override.aes=list(linetype=c(5,1,3),name="Key",
shape=c(16,16,NA),
color=c("Dark Red","Blue","Black"),
fill=c("#FF9999","Turquoise",NA))))
How can I change the order in which points are plotted? I have seen answered questions here dealing with a single dataframe but I am working with several dataframes so I cannot re-order the rows or ask ggplot to plot by certain criteria from within the dataframe. You can see an example of the kind of problem that this causes in the attached picture: the Female points are plotted on top of the Male points. Ideally I would like to be able to plot all the points in a random order, so that one "cloud" of points is not plotted on top of the other, obscuring it (N.B. the image shown doesn't include the "All" line).
Any help would be appreciated. Thank you.
I belive this is not possible. The following should work though:
You'd have to paste the two data frames together to df. The new data frame will appear sorted by male and female.
You can then suffle the new data frame:
set.seed(42)
rows <- sample(nrow(df))
male_female_mixed <- df[rows, ]
Then you can plot male_female_mixed

How to Plot line graph in R with the following Data

I want a line graph of around 145 data observations using R, the format of data is as below
Date Total Confirmed Total Deceased
3-Mar 6 0
4-Mar 28 0
5-Mar 30 5
.
.
.
141 more obs like this
I'm new to ggplot 2 in R so i don't know how to get the graph, I tried plotting the graph, but the dates
in x-axis becomes overlaped and were not visible. I want line graph of Total confirmed column and the Total Deceased column together with dates in the x- axis, please help and please also tell me how to colour the line graph, i want a colorfull graph, so... Please Do help in your busy schedule.. thank you so much...
Similar questions like this gives a lot of error, so I would like an answer for my specific requirements.
There are a lot of resources to help you create what you are looking to do - and even quite a few questions already answered here. However, I understand it's tough starting out, so here's a quick example to get you started.
Sample Data:
df <- data.frame(
dates=c('2020-01-01','2020-02-01','2020-03-03','2020-03-14','2020-04-01'),
var1=c(13,15,18,29,40),
var2=c(5,8,11,13,18)
)
If you are plotting by date on your x axis, you need to ensure that df$dates is formatted as a "Date" class (or one of the other date-like classes). You can do that via:
df$dates <- as.Date(df$dates, format='%Y-%m-%d')
The format= argument of as.Date() should follow the conventions indicated in strptime(). Just type ?striptime in your console and you can see in the help for that function how the various terms are defined for format=.
The next step is very important, which is to recognize that the data is in "wide" format, not "long" format. You will always want your data in what is known as Tidy Data format - convenient for any analysis, but necessary for ggplot2 and the related packages. In your data, the measure itself is numbers of cases and deaths. The measure itself is number of people. The type of the measure is either cases or deaths. So "number of people" is spread over two columns and the information on "type of measure" is stuck as a name for each column when it should be a variable in the dataset. Your goal should be to gather() those two columns together and create two new columns: (1) one to indicate if the number is "cases" or "deaths", and (2) the number itself. In the example I've shown you can do this via:
library(dplyr)
library(tidyr)
library(ggplot2)
df <- df %>% gather(key='var_name', value='number', -dates)
The result is that the data frame has columns for:
dates: unchanged
var_name: contains either var1 or var2 as a character class
number: the actual number
Finally, for the plot, the code is quite simple. You apply dates to the x aesthetic, number to y, and use var_name to differentiate color for the line geom:
ggplot(df, aes(x=dates, y=number)) +
geom_line(aes(color=var_name))

R:ggplot2 get both columns on one plot [duplicate]

This question already has answers here:
Plotting two variables as lines using ggplot2 on the same graph
(5 answers)
Closed 8 years ago.
I have csv file with many columns first column of the csv is user_id. Other columns are realted to different actions that user has taken. I want to plot two columns from this csv file on one ggplot with lines.
userid Action1TakenTimes Action2TakenTimes
1 0 4
2 6 4
3 0 1
4 8 23
5 4 3
6 1 1
I have converted the csv file to R data table and did the simple plot but I want do a ggplot with a smooth lines connecting the points.
plot(log(mytable.data$Action1TakenTimes))
plot(log(mytable.data$Action2TakenTimes))
I went over following tutorial but couldn't find a similar example:
http://www.ceb-institute.org/bbs/wp-content/uploads/2011/09/handout_ggplot2.pdf
Like this?
library(ggplot2)
library(reshape2)
gg <- melt(mytable.data,id="userid")
ggplot(gg,aes(x=userid,y=log(value),color=variable))+geom_line()
ggplot expects the data in so-called "long" format, with all the values in one column, and a second column which distinguishes the different groups. Your data is in "wide" format, with the different groups in different columns. To convert, use melt(...) in the reshape2 package.
This is a very common pattern with ggplot.
One problem with your data is that you're taking log(0), which produces -Inf. Smoothing is meaningless in that situation. If there were no infinities you could add +stat_smooth() to the end of the ggplot(...) line to generate a loess smoothed curve.

How do I do a Barplot of already tabled data?

I have input data with very many observations per day.
I can make the barplot I want with 1 day by using table().
If I try to load more than 1 day and call table() I run out of memory.
I'm trying to table() each day and concatenate the tables into totals I can then barplot later. But I just cannot work out how to take the already tabled data and barplot each day as a stacked column.
After looping and consolidating I end up with something like this: 2 days of observations. (the Freq column is the default from the previous table() calls)
What is the best way to do a stacked barplot when my data ends up like this?
> data.frame(CLIENT=c("Mr Fluffy","Peppa Pig","Mr Fluffy","Dr Who"), Freq=c(18414000,9000000,7000000,15000000), DAY=c("2011-11-03","2011-11-03","2011-11-04","2011-11-04"))
CLIENT Freq DAY
1 Mr Fluffy 18414000 2011-11-03
2 Peppa Pig 9000000 2011-11-03
3 Mr Fluffy 7000000 2011-11-04
4 Dr Who 15000000 2011-11-04
>
> # What should I put here?
I'm assuming that you are using base graphics since you mention barplot. Here is an approach using that:
wide <- reshape(dat, idvar="CLIENT", timevar="DAY", direction="wide")
barplot(as.matrix(wide[-1]), beside=FALSE)
Alternatively, using ggplot2:
library("ggplot2")
ggplot(dat, aes(x=DAY, y=Freq)) +
geom_bar(aes(fill=CLIENT), position="stack")
Try ggplot2:
ggplot(df,aes(DAY,fill=CLIENT,weight=Freq))+geom_bar()
Shamelessly ripped from here:
http://had.co.nz/ggplot2/geom_bar.html

Resources