R - change X axis values to categories - r

I am trying to draw a scatter dot plot for this data
head(data)
Subject Length Verdict
1 2 4575 Partial
2 2 5060 Partial
3 2 8978 5'DEFECT
4 2 7224 Partial
5 2 7224 Partial
6 7 8978 5'DEFECT
I get a scatter dot plot as such:
I have patients 1,2,6,7,10 for example. R is taking the names of my subjects and using them as an x-value. I want to change that so the data points appear above each patient (not treated as a value, but rather as a category).
Appreciate your help!
Here's the code I wrote to get this scatter dot plot:
ggplot(final,
aes(x=Subject,y=Length,colour=Verdict,shape=Verdict), group=Subject) +
geom_point(position=position_jitter(width=0.1,height=0)) +
scale_shape_manual(values=c(5,0,1,4,6)) +
scale_colour_manual(values=c("blue","red","green","black","violet")) +
scale_y_continuous(breaks=c(2000,4000,6000,8000,10000)) +
labs(y="Amplicon Size in bps")

Using the sample data you posted (the 6 observations), you can do this easily with as.factor or as.character. I recommend converting your Subject variable to a character first (since it sounds like you don't want it to be treated as numeric anyway).
This should work:
data$Subject <- as.character(data$Subject)
ggplot(data,
aes(x=Subject,y=Length,colour=Verdict,shape=Verdict), group=Subject) +
geom_point(position=position_jitter(width=0.1,height=0)) +
scale_shape_manual(values=c(5,0,1,4,6)) +
scale_colour_manual(values=c("blue","red","green","black","violet")) +
scale_y_continuous(breaks=c(2000,4000,6000,8000,10000)) +
labs(y="Amplicon Size in bps")
Data:
data <- structure(list(Subject = c(2L, 2L, 2L, 2L, 2L, 7L), Length =
c(4575L, 5060L, 8978L, 7224L, 7224L, 8978L), Verdict = c("Partial", "Partial",
"5'DEFECT", "Partial", "Partial", "5'DEFECT")), .Names = c("Subject",
"Length", "Verdict"), row.names = c(NA, -6L), class = "data.frame")

You could use as.factor(data$Subject) and scale_x_discrete("Subject"). Here a dummy example:
data <- read.table(textConnection('i Subject Length Verdict
1 2 4575 Partial
2 2 5060 Partial
3 2 8978 5DEFECT
4 2 7224 Partial
5 2 7224 Partial
6 7 8978 5DEFECT'), header = TRUE, stringsAsFactors = FALSE)
data$Subject <- as.factor(data$Subject)
p = ggplot(data) + geom_point(aes(x=Subject,y=Length, colour=Verdict))+ scale_shape_manual(values= c(5,0,1,4,6))
p = p + scale_colour_manual(values=c("blue","red","green","black","violet"))
p = p+ scale_x_discrete("Subject")
p = p + scale_y_continuous(breaks=c(2000,4000,6000,8000,10000))+labs(y="Amplicon Size in bps")
p

Related

Plotting Arbitrary Functions by Group in R

I have a dataset (test_df) that looks like:
Species
TreatmentA
TreatmentB
X0
L
K
Apple
Hot
Cloudy
1
2
3
Apple
Cold
Cloudy
4
5
6
Orange
Hot
Sunny
7
8
9
Orange
Cold
Sunny
10
11
12
I would like to display the effect of the treatments by using the X0, L, and K values as coefficients in a standard logistic function and plotting the same species across various treatments on the same plot. I would like a grid of plots with the logistic curves for each species on it's own plots, with each treatment then being grouped by color within every plot. In the above example, Plot1.Grid1 would have 2 logistic curves corresponding to Apple Hot and Apple Cold, and plot1.Grid2 would have 2 logistic curves corresponding to Orange Hot and Orange Cold.
The below code will create a single logistic function curve which can then be layered, but manually adding the layers for multiple treatments is tedious.
testx0 <- 1
testL <- 2
testk <- 3
days <- seq(from = -5, to = 5, by = 1)
functionmultitest <- function(x,testL,testK,testX0) {
(testL)/(1+exp((-1)*(testK) *(x - testX0)))
}
ggplot()+aes(x = days, y = functionmultitest(days,testL,testk,testx0))+geom_line()
The method described in (https://statisticsglobe.com/draw-multiple-function-curves-to-same-plot-in-r) works for dataframes with few species or treatments, but it becomes very tedious to individually define the curves if you have many treatments/species. Is there a way to programatically pass the list of coefficients and have ggplot handle the grouping?
Thank you!
Your current code shows how to compute the curve for a single row in your data frame. What you can do is pre-compute the curve for each row and then feed to ggplot.
Setup:
# Packages
library(ggplot2)
# Your days vector
days <- seq(from = -5, to = 5, by = 1)
# Your sample data frame above
df = structure(list(Species = c("Apple", "Apple", "Orange", "Orange"
), TreatmentA = c("Hot", "Cold", "Hot", "Cold"), TreatmentB = c("Cloudy",
"Cloudy", "Sunny", "Sunny"), X0 = c(1L, 4L, 7L, 10L), L = c(2L,
5L, 8L, 11L), K = c(3L, 6L, 9L, 12L)), class = "data.frame", row.names = c(NA,
-4L))
# Your function
functionmultitest <- function(x,testL,testK,testX0) {
(testL)/(1+exp((-1)*(testK) *(x - testX0)))
}
We'll "expand" each row of your data frame with the days vector:
# Define first a data frame of days:
days_df = data.frame(days = days)
# Perform a cross join
df_all = merge(days_df, df, all = T)
At this point, you will have a data frame where each original row is duplicated for as many days you have.
Now, just as you did for one row, we'll compute the value of the function for each row and store in the df_all as result:
df_all$result = mapply(functionmultitest, df_all$days, df_all$L, df_all$K, df_all$X0)
I'm not sure how you intended to handle treatmentA and treatmentB, so I'll just combine for illustration purposes:
df_all$combined_treatment = paste0(df_all$TreatmentA, "-", df_all$TreatmentB)
We can now feed this data frame to ggplot, set the color to be combined_treatment, and use the facet_grid function to split by species
ggplot(data = df_all, aes(x = days, y = result, color = combined_treatment))+
geom_line() +
facet_grid(Species ~ ., scales = "free")
The result is as follows:

R: Multiple bar plots of mean value vs. month vs. genre [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have the following data-frame, where variable is 10 different genre categories of movies, eg. drama, comedy etc.
> head(grossGenreMonthLong)
Gross ReleasedMonth variable value
5 33508485 2 drama 1
6 67192859 2 drama 1
8 37865 4 drama 1
9 76665507 1 drama 1
10 221594911 2 drama 1
12 446438 2 drama 1
Reproducible dataframe:
dput(head(grossGenreMonthLong))
structure(list(Gross = c(33508485, 67192859, 37865, 76665507,
221594911, 446438), ReleasedMonth = c(2, 2, 4, 1, 2, 2), variable = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("drama", "comedy", "short", "romance",
"action", "crime", "thriller", "documentary", "adventure", "animation"
), class = "factor"), value = c(1, 1, 1, 1, 1, 1)), .Names = c("Gross",
"ReleasedMonth", "variable", "value"), row.names = c(5L, 6L,
8L, 9L, 10L, 12L), class = "data.frame")
I would like to calculate the mean gross vs. month for each of the 10 genres and plot them in separate bar charts using facets (varying by genre).
In other words, what's a quick way to plot 10 bar charts of mean gross vs. month for each of the 10 genres?
You should provide a reproducible example to make it easier for us to help you. dput(my.dataframe) is one way to do it, or you can generate an example dataframe like below. Since you haven't given us a reproducible example, I'm going to put on my telepathy hat and assume the "variable" column in your screenshot is the genre.
n = 100
movies <- data.frame(
genre=sample(letters[1:10], n, replace=T),
gross=runif(n, min=1, max=1e7),
month=sample(12, n, replace=T)
)
head(movies)
# genre gross month
# 1 e 5545765.4 1
# 2 f 3240897.3 3
# 3 f 1438741.9 5
# 4 h 9101261.0 6
# 5 h 926170.8 7
# 6 f 2750921.9 1
(My genres are 'a', 'b', etc).
To do a plot of average gross per month, you will need to calculate average gross per month. One such way to do so is using the plyr package (there is also data.table, dplyr, ...)
library(plyr)
monthly.avg.gross <- ddply(movies, # the input dataframe
.(genre, month), # group by these
summarize, avgGross=mean(gross)) # do this.
The dataframe monthly.avg.gross now has one row per (month, genre) with a column avgGross that has the average gross in that (month, genre).
Now it's a matter of plotting. You have hinted at "facet" so I assume you're using ggplot.
library(ggplot2)
ggplot(monthly.avg.gross, aes(x=month, y=avgGross)) +
geom_point() +
facet_wrap(~ genre)
You can do stuff like add month labels and treat month as a factor instead of a number like here, but that's peripheral to your question.
Thank you very much #mathematical.coffee. I was able to adapt your answer to produce the appropriate bar charts.
meanGrossGenreMonth = ddply(grossGenreMonthLong,
.(ReleasedMonth, variable),
summarise,
mean.Gross = mean(Gross, na.rm = TRUE))
# plot bar plots with facets
ggplot(meanGrossGenreMonth, aes(x = factor(ReleasedMonth), y=mean.Gross))
+ geom_bar(stat = "identity") + facet_wrap(~ variable) +ylab("mean Gross ($)")
+ xlab("Month") +ggtitle("Mean gross revenue vs. month released by Genre")

ggplot2 line chart gives "geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?"

With this data frame ("df"):
year pollution
1 1999 346.82000
2 2002 134.30882
3 2005 130.43038
4 2008 88.27546
I try to create a line chart like this:
plot5 <- ggplot(df, aes(year, pollution)) +
geom_point() +
geom_line() +
labs(x = "Year", y = "Particulate matter emissions (tons)", title = "Motor vehicle emissions in Baltimore")
The error I get is:
geom_path: Each group consist of only one observation. Do you need to
adjust the group aesthetic?
The chart appears as a scatter plot even though I want a line chart. I tried to replace geom_line() with geom_line(aes(group = year)) but that didn't work.
In an answer I was told to convert year to a factor variable. I did and the problem persists. This is the output of str(df) and dput(df):
'data.frame': 4 obs. of 2 variables:
$ year : num 1 2 3 4
$ pollution: num [1:4(1d)] 346.8 134.3 130.4 88.3
..- attr(*, "dimnames")=List of 1
.. ..$ : chr "1999" "2002" "2005" "2008"
structure(list(year = c(1, 2, 3, 4), pollution = structure(c(346.82,
134.308821199349, 130.430379885892, 88.275457392443), .Dim = 4L, .Dimnames = list(
c("1999", "2002", "2005", "2008")))), .Names = c("year",
"pollution"), row.names = c(NA, -4L), class = "data.frame")
You only have to add group = 1 into the ggplot or geom_line aes().
For line graphs, the data points must be grouped so that it knows which points to connect. In this case, it is simple -- all points should be connected, so group=1. When more variables are used and multiple lines are drawn, the grouping for lines is usually done by variable.
Reference: Cookbook for R, Chapter: Graphs Bar_and_line_graphs_(ggplot2), Line graphs.
Try this:
plot5 <- ggplot(df, aes(year, pollution, group = 1)) +
geom_point() +
geom_line() +
labs(x = "Year", y = "Particulate matter emissions (tons)",
title = "Motor vehicle emissions in Baltimore")
You get this error because one of your variables is actually a factor variable
. Execute
str(df)
to check this.
Then do this double variable change to keep the year numbers instead of transforming into "1,2,3,4" level numbers:
df$year <- as.numeric(as.character(df$year))
EDIT: it appears that your data.frame has a variable of class "array" which might cause the pb.
Try then:
df <- data.frame(apply(df, 2, unclass))
and plot again?
I had similar problem with the data frame:
group time weight.loss
1 Control wl1 4.500000
2 Diet wl1 5.333333
3 DietEx wl1 6.200000
4 Control wl2 3.333333
5 Diet wl2 3.916667
6 DietEx wl2 6.100000
7 Control wl3 2.083333
8 Diet wl3 2.250000
9 DietEx wl3 2.200000
I think the variable for x axis should be numeric, so that geom_line knows how to connect the points to draw the line.
after I change the 2nd column to numeric:
group time weight.loss
1 Control 1 4.500000
2 Diet 1 5.333333
3 DietEx 1 6.200000
4 Control 2 3.333333
5 Diet 2 3.916667
6 DietEx 2 6.100000
7 Control 3 2.083333
8 Diet 3 2.250000
9 DietEx 3 2.200000
then it works.
Start up R in a fresh session and paste this in:
library(ggplot2)
df <- structure(list(year = c(1, 2, 3, 4), pollution = structure(c(346.82,
134.308821199349, 130.430379885892, 88.275457392443), .Dim = 4L, .Dimnames = list(
c("1999", "2002", "2005", "2008")))), .Names = c("year",
"pollution"), row.names = c(NA, -4L), class = "data.frame")
df[] <- lapply(df, as.numeric) # make all columns numeric
ggplot(df, aes(year, pollution)) +
geom_point() +
geom_line() +
labs(x = "Year",
y = "Particulate matter emissions (tons)",
title = "Motor vehicle emissions in Baltimore")
I got a similar prompt. It was because I had specified the x-axis in terms of some percentage (for example: 10%A, 20%B,....).
So an alternate approach could be that you multiply these values and write them in the simplest form.
I found this can also occur if the most of the data plotted is outside of the axis limits. In that case, adjust the axis scales accordingly.

Plotting a multidimensional Data Set

I have a 2 dimensional data set (matrix/data frame) that looks like this
779 482 859 1156
maxs 56916.00 78968.00 51156.00 44827.01
Means+Stdv 41784.70 64440.83 38319.10 42767.14
Mean_Cost 31863.18 44407.40 29365.78 38711.29
Means_Stdv 21941.66 24373.97 20412.45 34655.43
mins 21088.00 13768.00 24132.00 31452.00
The 779, 489,859, 1156 are values that I want to draw on the x-axis
The rest of the values on the column are values that correpond to each x
Now I want to plot the entire data set, so that I have a graph with the the following points
(779,56916) , (779, 41784)......
(482,78968) , (482, 64440)..... and so on
The way I did it so far is like this (it gives me the plot I am looking for)
plot(colnames(resultsSummary),resultsSummary[1,],ylim=c(0,80000),pch=6)
points(colnames(resultsSummary),resultsSummary[2,],pch=3)
points(colnames(resultsSummary),resultsSummary[3,])
and so on..... plotting row by row
I am sure there is a better way to do it, but I dont know how, any suggestions?
DF <- read.table(text=" 779 482 859 1156
maxs 56916.00 78968.00 51156.00 44827.01
Means+Stdv 41784.70 64440.83 38319.10 42767.14
Mean_Cost 31863.18 44407.40 29365.78 38711.29
Means_Stdv 21941.66 24373.97 20412.45 34655.43
mins 21088.00 13768.00 24132.00 31452.00",
header=TRUE, check.names=FALSE)
m <- as.matrix(DF)
matplot(as.integer(colnames(m)),
t(m), pch=seq_len(ncol(m)))
Following also works:
ddf = structure(list(var = structure(c(1L, 4L, 2L, 3L, 5L), .Label = c("maxs",
"Mean_Cost", "Means_Stdv", "Means+Stdv", "mins"), class = "factor"),
X779 = c(56916, 41784.7, 31863.18, 21941.66, 21088), X482 = c(78968,
64440.83, 44407.4, 24373.97, 13768), X859 = c(51156, 38319.1,
29365.78, 20412.45, 24132), X1156 = c(44827.01, 42767.14,
38711.29, 34655.43, 31452)), .Names = c("var", "X779", "X482",
"X859", "X1156"), class = "data.frame", row.names = c(NA, -5L
))
ddf
var X779 X482 X859 X1156
1 maxs 56916.00 78968.00 51156.00 44827.01
2 Means+Stdv 41784.70 64440.83 38319.10 42767.14
3 Mean_Cost 31863.18 44407.40 29365.78 38711.29
4 Means_Stdv 21941.66 24373.97 20412.45 34655.43
5 mins 21088.00 13768.00 24132.00 31452.00
ddf[6,2:5]=as.numeric(substr(names(ddf)[2:5],2,4))
ddf2 = data.frame(t(ddf))
ddf2 = ddf2[-1,]
mm = melt(ddf2, id='X6')
ggplot(mm)+geom_point(aes(x=X6, y=value, color=variable))

How do I plot boxplots of two different series?

I have 2 dataframe sharing the same rows IDs but with different columns
Here is an example
chrom coord sID CM0016 CM0017 CM0018
7 10 3178881 SP_SA036,SP_SA040 0.000000000 0.000000000 0.0009923
8 10 38894616 SP_SA036,SP_SA040 0.000434783 0.000467464 0.0000970
9 11 104972190 SP_SA036,SP_SA040 0.497802888 0.529319536 0.5479003
and
chrom coord sID CM0001 CM0002 CM0003
4 10 3178881 SP_SA036,SA040 0.526806527 0.544927536 0.565610860
5 10 38894616 SP_SA036,SA040 0.009049774 0.002849003 0.002857143
6 11 104972190 SP_SA036,SA040 0.451612903 0.401617251 0.435318275
I am trying to create a composite boxplot figure where I have in x axis the chrom and coord combined (so 3 points) and for each x value 2 boxplots side by side corresponding to the two dataframes ?
What is the best way of doing this ? Should I merge the two dataframes together somehow in order to get only one and loop over the boxplots rendering by 3 columns ?
Any idea on how this can be done ?
The problem is that the two dataframes have the same number of rows but can differ in number of columns
> dim(A)
[1] 99 20
> dim(B)
[1] 99 28
I was thinking about transposing the dataframe in order to get the same number of column but got lost on how to this properly
Thanks in advance
UPDATE
This is what I tried to do
I merged chrom and coord columns together to create a single ID
I used reshape t melt the dataframes
I merged the 2 melted dataframe into a single one
the head looks like this
I have two variable A2 and A4 corresponding to the 2 dataframes
then I created a boxplot such using this
ggplot(A2A4, aes(factor(combine), value)) +geom_boxplot(aes(fill = factor(variable)))
I think it solved my problem but the boxplot looks very busy with 99 x values with 2 boxplots each
So if these are your input tables
d1<-structure(list(chrom = c(10L, 10L, 11L),
coord = c(3178881L, 38894616L, 104972190L),
sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SP_SA040", class = "factor"),
CM0016 = c(0, 0.000434783, 0.497802888), CM0017 = c(0, 0.000467464,
0.529319536), CM0018 = c(0.0009923, 9.7e-05, 0.5479003)), .Names = c("chrom",
"coord", "sID", "CM0016", "CM0017", "CM0018"), class = "data.frame", row.names = c("7",
"8", "9"))
d2<-structure(list(chrom = c(10L, 10L, 11L), coord = c(3178881L,
38894616L, 104972190L), sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SA040", class = "factor"),
CM0001 = c(0.526806527, 0.009049774, 0.451612903), CM0002 = c(0.544927536,
0.002849003, 0.401617251), CM0003 = c(0.56561086, 0.002857143,
0.435318275)), .Names = c("chrom", "coord", "sID", "CM0001",
"CM0002", "CM0003"), class = "data.frame", row.names = c("4",
"5", "6"))
Then I would combine and reshape the data to make it easier to plot. Here's what i'd do
m1<-melt(d1, id.vars=c("chrom", "coord", "sID"))
m2<-melt(d2, id.vars=c("chrom", "coord", "sID"))
dd<-rbind(cbind(m1, s="T1"), cbind(m2, s="T2"))
mm$pos<-factor(paste(mm$chrom,mm$coord,sep=":"),
levels=do.call(paste, c(unique(dd[order(dd[[1]],dd[[2]]),1:2]), sep=":")))
I first melt the two input tables to turn columns into rows. Then I add a column to each table so I know where the data came from and rbind them together. And finally I do a bit of messy work to make a factor out of the chr/coord pairs sorted in the correct order.
With all that done, I'll make the plot like
ggplot(mm, aes(x=pos, y=value, color=s)) +
geom_boxplot(position="dodge")
and it looks like

Resources