I'm not an expert with the ggplot2 package. I have a subset selection problem.
Here is my code that produce this kind of graph...
g <- ggplot(merged_data,aes_string(x=Order,fill=var.y)) +
scale_y_continuous(expand = c(0.05,0)) +
xlab(paste("Order","Total number of sequences",sep=" - ")) +
ggtitle(main.str) +
geom_bar(position="fill",
subset = .(Order != ""),
width=0.6,hjust =0)+
geom_text(stat="bin",
subset = .(Order != ""),
color="black", hjust=1, vjust = 0.5, size=2,
aes_string(fill=NULL,x = Order,y = "0", label="..count.."))+
coord_flip()
For geom_bar and geom_text I select subset of data that remove empty names
subset = .(eval(parse(text=var.x)) != "")
this is a simple example with only 2 bars.
Here is a the data ...
Collector<- c("BK","YE_LD","BK","JB","JB",
"BK","BK","BK","JB","YE_LD")
Order<-c("A","B","B","B","A",
"B","B","A","B","B")
data <- data.frame(Order,Collector)
Now I want to add a cutoff to my subset... only show the variable that that have a minimum of counts.
So if I put the cutoff = 4 ... I will get only the bar at the bottom that have 7 counts, the bar at the top with 3 counts should not appear.
I have no idea how I can do this ...
Thanks for your help.
You could create a subset of the data and use this new object in ggplot. The following command will remove all Order conditions with less than four data points:
subset(data, Order %in% names(which(table(Order) >= 4)))
Order Collector
2 B YE_LD
3 B BK
4 B JB
6 B BK
7 B BK
9 B JB
10 B YE_LD
Related
I'm pretty new to R and I have a problem with plotting a barplot out of my data which looks like this:
condition answer
2 H
1 H
8 H
5 W
4 M
7 H
9 H
10 H
6 H
3 W
The data consists of 100 rows with the conditions 1 to 10, each randomly generated 10 times (10 times condition 1, 10 times condition 8,...). Each of the conditions also has a answer which could be H for Hit, M for Miss or W for wrong.
I want to plot the number of Hits for each condition in a barplot (for example 8 Hits out of 10 for condition 1,...) for that I tried to do the following in ggplot2
ggplot(data=test, aes(x=test$condition, fill=answer=="H"))+
geom_bar()+labs(x="Conditions", y="Hitrate")+
coord_cartesian(xlim = c(1:10), ylim = c(0:10))+
scale_x_continuous(breaks=seq(1,10,1))
And it looked like this:
This actually exactly what I need except for the red color which covers everything. You can see that conditions 3 to 5 have no blue bar, because there are no hits for these conditions.
Is there any way to get rid of this red color and to maybe count the amount of hits for the different conditions? -> I tried the count function of dplyr but it only showed me the amount of H when there where some for this particular condition. 3-5 where just "ignored" by count, there wasn't even a 0 in the output.-> but I'd still need those numbers for the plot
I'm sorry for this particular long post but I'm really at the end of knowledge considering this. I'd be open for suggestions or alternatives! Thanks in advance!
This is a situation where a little preprocessing goes a long way. I made sample data that would recreate the issue, i.e. has cases where there won't be any "H"s.
Instead of relying on ggplot to aggregate data in the way you want it, use proper tools. Since you mention dplyr::count, I use dplyr functions.
The preprocessing task is to count observations with answer "H", including cases where the count is 0. To make sure all combinations are retained, convert condition to a factor and set .drop = F in count, which is in turn passed to group_by.
library(dplyr)
library(ggplot2)
set.seed(529)
test <- data.frame(condition = rep(1:10, times = 10),
answer = c(sample(c("H", "M", "W"), 50, replace = T),
sample(c("M", "W"), 50, replace = T)))
hit_counts <- test %>%
mutate(condition = as.factor(condition)) %>%
filter(answer == "H") %>%
count(condition, .drop = F)
hit_counts
#> # A tibble: 10 x 2
#> condition n
#> <fct> <int>
#> 1 1 0
#> 2 2 1
#> 3 3 4
#> 4 4 2
#> 5 5 3
#> 6 6 0
#> 7 7 3
#> 8 8 2
#> 9 9 1
#> 10 10 1
Then just plot that. geom_col is the version of geom_bar for where you have your y-values already, instead of having ggplot tally them up for you.
ggplot(hit_counts, aes(x = condition, y = n)) +
geom_col()
One option is to just filter out anything but where answer == "H" from your dataset, and then plot.
An alternative is to use a grouped bar plot, made by setting position = "dodge":
test <- data.frame(condition = rep(1:10, each = 10),
answer = sample(c('H', 'M', 'W'), 100, replace = T))
ggplot(data=test) +
geom_bar(aes(x = condition, fill = answer), position = "dodge") +
labs(x="Conditions", y="Hitrate") +
coord_cartesian(xlim = c(1:10), ylim = c(0:10)) +
scale_x_continuous(breaks=seq(1,10,1))
Also note that if the condition is actually a categorical variable, it may be better to make it a factor:
test$condition <- as.factor(test$condition)
This means that you don't need the scale_x_continuous call, and that the grid lines will be cleaner.
Another option is to pick your fill colors explicitly and make FALSE transparent by using scale_fill_manual. Since FALSE comes alphabetically first, the first value to specify is FALSE, the second TRUE.
ggplot(data=test, aes(x=condition, fill=answer=="H"))+
geom_bar()+labs(x="Conditions", y="Hitrate")+
coord_cartesian(xlim = c(1:10), ylim = c(0:10))+
scale_x_continuous(breaks=seq(1,10,1)) +
scale_fill_manual(values = c(alpha("red", 0), "cadetblue")) +
guides(fill = F)
I have imported data in this form:
Sample1 Sample2 Identity
1 2 chr11-50-T
3 4 chr11-200-A
v <- read.table("myfile", header = TRUE)
I have a vector that looks like this:
x <- c(50,100)
And without some other aesthetic stuff I am plotting column 1 vs column 2 labeled with column 3.
p <- ggplot(v, aes(x=sample1, y=sample2, alpha=0.5, label=identity)) +
geom_point() +
geom_text_repel(aes(label=ifelse(sample2>0.007 |sample1>0.007 ,as.character(identity),''))) +
I would like to somehow indicate those points that contain a number in their ID, found within the vector x. I was thinking this could be done with color, but it doesn't really matter to me as long as there is a difference between the two types of points.
So for instance if the points containing a number in x were to be colored red, the first point would be red because it has 50 in the ID and the second point would not be, because 200 is not a value in x.
You could add in a TRUE/FALSE value as a column and use that as a color. I had to remove your label = ... aes since that's not an aes in ggplot2. Also everything is transparent because you use aes(alpha = 0.5):
library(ggrepel)
library(ggplot2)
vafs$col <- grepl(paste0(x,collapse = "|"), vafs$Identity)
p <- ggplot(vafs, aes(x=Sample1, y=Sample2, alpha=0.5, color = col)) +
geom_point() +
geom_text_repel(aes(label=ifelse(Sample2>0.007 |Sample1>0.007 ,as.character(Identity),'')))
I came up with the following solution:
vafs<-read.table(text="Sample1 Sample2 Identity
1 2 chr11-50-T
3 4 chr11-200-A", header=T)
vec <- c(50,100)
vafs$vec<- sapply(vafs$Identity, FUN=function(x)
ifelse(length(grep(pattern=paste(vec,collapse="|"), x))>0,1,0))
vafs$vec <- as.factor(vafs$vec)
ggplot(vafs, aes(x=Sample1, y=Sample2, label=Identity, col=vec),alpha=0.5)+geom_point()
I have a large dataframe, where a variable id (first column) recurs with different values in the second column. My idea is to order the dataframe, to split it into a list and then lapply a function which cbinds the sequence 1:nrows(variable id) to each group. My code so far:
DF <- DF[order(DF[,1]),]
DF <- split(DF,DF[,1])
DF <- lapply(1:length(DF), function(i) cbind(DF[[i]], 1:length(DF[[i]])))
But this gives me an error: arguments imply different number of rows.
Can you elaborate?
> head(DF, n=50)
cell area
1 1 121.2130
2 2 81.3555
3 3 81.5862
4 4 83.6345
...
33 1 121.3270
34 2 80.7832
35 3 81.1816
36 4 83.3340
DF <- DF[order(DF$cell),]
What I want is:
> head(DF, n=50)
cell area counter
1 1 121.213 1
33 1 121.327 2
65 1 122.171 3
97 1 122.913 4
129 1 123.697 5
161 1 124.474 6
...and so on.
This is my code:
cell.areas.t <- function(file) {
dat = paste(file)
DF <- read.table(dat, col.names = c("cell","area"))
DF <- splitstackshape::getanID(DF, "cell")[] # thanks to akrun's answer
ggplot2::ggplot(data = DF, aes(x = .id , y = area, color = cell)) +
geom_line(aes(group = cell)) + geom_point(size=0.1)
}
And the plot looks like this:
Most cells increase in area, only some decrease. This is only a first try to visualize my data, so what you can't see very well is that the areas drop down periodically due to cell division.
Additional question:
There is a problem I didn't take into account beforehand, which is that after a cell division a new cell is added to the data.frame and is handed the initial index 1 (you see in the image that all cells start from .id=1, not later), which is not what I want - it needs to inherit the index of its creation time. First thing that comes into my mind is that I could use a parsing mechanism that does this job for a newly added cell variable:
DF$.id[DF$cell != temporary.cellindex] <- max(DF$.id[DF$cell != temporary.cellindex])
Do you have a better idea? Thanks.
There is a boundary condition which may ease the problem: fixed number of cells at the beginning (32). Another solution would be to cut away all data before the last daughter cell is created.
Update: Additional question solved, here's the code:
cell.areas.t <- function(file) {
dat = paste(file)
DF <- read.table(dat, col.names = c("cell","area"))
DF$.id <- c(0, cumsum(diff(DF$cell) < 0)) + 1L # Indexing
title <- getwd()
myplot <- ggplot2::ggplot(data = DF, aes(x = .id , y = area, color = factor(cell))) +
geom_line(aes(group = cell)) + geom_line(size=0.1) + theme(legend.position="none") + ggtitle(title)
#save the plot
image=myplot
ggsave(file="cell_areas_time.svg", plot=image, width=10, height=8)
}
We can use getanID from splitstackshape
library(splitstackshape)
getanID(DF, "cell")[]
There's a much easier method to accomplish that goal. Use ave with seq.int
DF$group_seq <- ave(DF, DF[,1], FUN=function(x){ seq.int(nrow(x)) } )
I've looked all over stack and other sites to fix my code but can't see what's wrong. I am trying to plot 2 lines on the same graph on ggplot that are portions of 2 different columns. For example, I have a column of length 8 of which the first four rows are M (male) and the last four rows are F (female). I have two columns of data and one column for condition (factor).
ModelMF <- data.frame(ProbGender, ProbCond, ProbMF, Act_pct)
where:
ProbGender ProbCond ProbMF Act_pct
M 0 .75 .71
M 10 .67 .69
M 20 .61 .54
M 30 .81 .77
F 0 .88 .82
F 10 .73 .71
F 20 .67 .71
F 30 .60 .63
I have tried the following but I keep getting errors (see below):
ggplot(data = ModelMF, aes(x = ProbCond)) + geom_line(data =
ModelMF[ModelMF$ProbGender=="M",], aes(y=ProbMF), color = 'col1') +
geom_point(data = ModelMF[ModelMF$ProbGender=="M",], aes(y = ProbMF)) +
geom_line(data = ModelMF[ModelMF$ProbGender=="M",], aes(y=Act_pct), color =
'col2') + geom_point(data = ModelMF[ModelMF$ProbGender=="M",], aes(y =
Act_pct)) + scale_color_manual(values = c('col1' = 'darkblue', 'col2' ='lightblue'))
Preferably I would like to be able to create a custom legend that lets me map the colors as I've attempted to do using scale_color_manual, but I get the following error:
Error in grDevices::col2rgb(colour, TRUE) : invalid color name 'col1'
I'm not sure if it is due to the fact that I'm subsetting data within the df or something else I'm just missing? Also if I add the female lines I assume I can simply follow the same procedure?
Thanks in advance.
This question already has answers here:
Plot multiple columns on the same graph in R [duplicate]
(4 answers)
Closed 4 years ago.
For the following matrix of order 11*8 stored in an object named Results:
Delta UE RE LS PT SP JS JS+
SRE0 0.000000 1 3.8730275 2.2721219 1.0062884 1.0047529 1.0317746 1.0318688
SRE1 0.100065 1 2.2478516 2.0595205 1.0502708 1.0453288 1.0436898 1.0764224
SRE2 0.200385 1 1.5838920 1.8793306 1.0359049 1.0437888 1.0529307 1.0753217
SRE3 0.300075 1 0.9129295 1.5360455 0.9946433 1.0320438 1.0063378 1.0654772
SRE4 0.400175 1 0.6434000 1.3150935 0.9530553 1.0172104 1.0107737 1.0564151
SRE5 0.500138 1 0.6063778 1.2876456 0.9455131 1.0165491 0.9994965 1.0553198
SRE6 0.600200 1 0.3710599 0.9537165 0.8730835 0.9945211 0.9346991 1.0369921
SRE7 0.699500 1 0.3312944 0.8793348 0.8535376 0.9914288 0.9046180 1.0314705
SRE8 0.800285 1 0.2338423 0.6966505 0.7831482 0.9657499 0.8445466 1.0169138
SRE9 0.900020 1 0.1665775 0.5328803 0.7024265 0.9296520 0.7989161 0.9850603
SRE10 1.000074 1 0.1550065 0.5047066 0.6849924 0.9231919 0.7765414 0.9821768
I want to plot (as a line) last 7 columns of this matrix against first column in a single graph such that each column has either a different color or different line segment. The first column named Delta should be placed on X-axis while rest of columns will be on Y-axis.
The basic idea I'd take is to change your Results object from wide to long format, to pass to ggplot. I like to use Hadley Wickham's reshape2 library. It has a function, melt, which will stack your data appropriately, then you can choose to group the lines by the different variables.
library(reshape2) # install.packages("reshape2")
R = data.frame(Delta = c(1,2), UE = c(1,1), RE = c(3.8, 2.4))
meltR = melt(R, id = "Delta")
ggplot(meltR, aes(x = Delta, y = value, group = variable, colour = variable)) +
geom_line()
Try:
matplot(m[,1],m[,-1],type='l')
where m is your matrix.
The ggplot2 package can accomplish this easily.
You just need to have a separate command for every column.
From the start
Results
Delta UE RE LS PT SP JS JS2
SRE0 0.000000 1 3.8730275 2.2721219 1.006288 1.004753 1.031775 1.031869
SRE1 0.100065 1 2.2478516 2.0595205 1.050271 1.045329 1.043690 1.076422
SRE2 0.200385 1 1.5838920 1.8793306 1.035905 1.043789 1.052931 1.075322
SRE3 0.300075 1 0.9129295 1.5360455 1.994643 1.032044 1.006338 1.065477
SRE4 0.400175 1 0.6434000 1.3150935 1.953055 1.017210 1.010774 1.056415
SRE5 0.500138 1 0.6063778 1.2876456 1.945513 1.016549 1.999497 1.055320
SRE6 0.600200 1 0.3710599 0.9537165 1.873083 1.994521 1.934699 1.036992
SRE7 0.699500 1 0.3312944 0.8793348 1.853538 1.991429 1.904618 1.031470
SRE8 0.800285 1 0.2338423 0.6966505 1.783148 1.965750 1.844547 1.016914
SRE9 0.900020 1 0.1665775 0.5328803 1.702427 1.929652 1.798916 1.985060
SRE10 1.000074 1 0.1550065 0.5047066 1.684992 1.923192 1.776541 1.982177
class(Results)
[1] "Matrix"
Note that I converted the "JS+" column name to "JS2" to avoid errors on R.
Convert to data.frame
Assign Results to a new object, specifically a data.frame.
newResults <- as.data.frame(Results)
newResults
Delta UE RE LS PT SP JS JS2
SRE0 0.000000 1 3.8730275 2.2721219 1.006288 1.004753 1.031775 1.031869
SRE1 0.100065 1 2.2478516 2.0595205 1.050271 1.045329 1.043690 1.076422
SRE2 0.200385 1 1.5838920 1.8793306 1.035905 1.043789 1.052931 1.075322
SRE3 0.300075 1 0.9129295 1.5360455 1.994643 1.032044 1.006338 1.065477
SRE4 0.400175 1 0.6434000 1.3150935 1.953055 1.017210 1.010774 1.056415
SRE5 0.500138 1 0.6063778 1.2876456 1.945513 1.016549 1.999497 1.055320
SRE6 0.600200 1 0.3710599 0.9537165 1.873083 1.994521 1.934699 1.036992
SRE7 0.699500 1 0.3312944 0.8793348 1.853538 1.991429 1.904618 1.031470
SRE8 0.800285 1 0.2338423 0.6966505 1.783148 1.965750 1.844547 1.016914
SRE9 0.900020 1 0.1665775 0.5328803 1.702427 1.929652 1.798916 1.985060
SRE10 1.000074 1 0.1550065 0.5047066 1.684992 1.923192 1.776541 1.982177
class(newResults)
[1] "data.frame"
Now it's formatted as a data.frame so it will be easier to work with.
Create Lines
library(ggplot2)
ggplot(data = newResults, aes(x = Delta)) +
geom_line(aes(y = UE)) +
geom_line(aes(y = RE)) +
geom_line(aes(y = LS)) +
geom_line(aes(y = PT)) +
geom_line(aes(y = SP)) +
geom_line(aes(y = JS)) +
geom_line(aes(y = JS2)) +
labs(y = "") # Delete or change y axis title if desired.
You can also choose your own colors for each line with color = () inside the aes() function of each line.