Highlighting regions in ggplot2 barplot fulfilling a condition - r

I want to plot a horizontal barplot using ggplot2 and highlight regions satisfying a particular criteria.
In this case, if any "Term" for point "E15.5-E18.5_up_down" has more than twice the number of samples compared to point "P22-P29_up_down" and vice-versa, highlight that label or region.
I have a dataframe in following format:
CLID CLSZ GOID NodeSize SampleMatch Phyper Padj Term Ont SampleKeys
E15.5-E18.5_up_down 1364 GO:0007568 289 20 0.141830716154421 1 aging BP ENSMUSG00000049932 ENSMUSG00000046352 ENSMUSG00000078249 ENSMUSG00000039428 ENSMUSG00000014030 ENSMUSG00000039323 ENSMUSG00000026185 ENSMUSG00000027513 ENSMUSG00000023224 ENSMUSG00000037411 ENSMUSG00000020429 ENSMUSG00000020897 ENSMUSG00000025486 ENSMUSG00000021477 ENSMUSG00000019987 ENSMUSG00000023067 ENSMUSG00000031980 ENSMUSG00000023070 ENSMUSG00000025747 ENSMUSG00000079017
E15.5-E18.5_up_down 1364 GO:0006397 416 3 0.999999969537913 1 mRNA processing BP ENSMUSG00000027510 ENSMUSG00000021210 ENSMUSG00000027951
P22-P29_up_down 476 GO:0007568 289 11 0.0333771791166823 1 aging BP ENSMUSG00000049932 ENSMUSG00000037664 ENSMUSG00000026879 ENSMUSG00000026185 ENSMUSG00000026043 ENSMUSG00000060600 ENSMUSG00000022508 ENSMUSG00000020897 ENSMUSG00000028702 ENSMUSG00000030562 ENSMUSG00000021670
P22-P29_up_down 476 GO:0006397 416 2 0.998137879564768 1 mRNA processing BP ENSMUSG00000024007 ENSMUSG00000039878
reduced to (only those terms which are necessary for plotting):
CLID SampleMatch Term
E15.5-E18.5_up_down 20 aging
P22-P29_up_down 2 mRNA processing
E15.5-E18.5_up_down 3 mRNA processing
P22-P29_up_down 11 aging
I would prefer a general approach which will work with any condition, not just the one I need for this scenario. One way I imagined is to use sapply for each pair of CLID/Term and create another column which stores if the criteria is fulfilled as a boolean, but still I cannot find a way to highlight the values. What would be the most efficient way to achieve this ?
Pseudo-code for my approach:
for(i in CLID) {
for(k in CLID) {
if (Term[i] == Term[k]) {
condition = check(Term[i], Term[k]) #check if the SampleMatch count for any for any CLID/Term pair is significantly higher compared to corresponding CLID/Term pair
if (condition == True) {
highlight(term)
}
}
}
}
In the end I want something like this (highlighting the label or column):
or like this: Highlight data individually with facet_grid in R.

Related

Calculating a ratio in a ggplot2 graph while retaining faceting variables

So I don't think this has been asked before, but SO search might just be getting confused by combinations of 'ratio' and 'faceting'. I'm trying to calculate a productivity ratio; number of widgets produced for number of workers on a given day or period. I've got my data structured in a single data frame, with each widget produced each day by each worker in it's own record, and other workers that worked that day but didn't produce a widget also in their own record, along with various metadata.
Something like this:
widget_ind
employee_active_ind
employee_id
day
product_type
employee_bu
1
1
123
6/1/2021
pc
americas
0
1
234
6/1/2021
mac
emea
0
1
345
6/1/2021
mac
apac
1
1
444
6/1/2021
mac
americas
1
1
333
6/1/2021
pc
emea
0
1
356
6/1/2021
pc
americas
I'm trying to find the ratio of widget_inds to employee_active_inds, over time, while retaining the metadata, so that i can filter or facet within the ggplot2 code, something like:
plot <- ggplot(data = df[df$employee_bu == 'americas',],aes(y = (widget_ind/employee_active_ind), x = day)) +
geom_bar(stat = 'identity', position = 'stack') +
facet_wrap(product_type ~ ., scales = 'fixed') + #change these to look at different cuts of metadata
print(plot)
Retaining the metadata is appealing rather than making individual dataframes summarizing by the various combinations, but the results with no faceting aren't even correct (e.g. the ggplot is showing a barchart with a height of ~18 widgets per person; creating a summarized dataframe with no faceting is showing a ratio of less than 1 widget per person).
I'm currently getting this error when I run the ggplot code:
Warning message:
Removed 9865 rows containing missing values (geom_bar).
Which doesn't make sense since in my data frame both widget_ind and employee_active_ind have no NA values, so calculating the ratio of the two should always work?
Edit 1: Clarifying employee_active_ind: I should not have any employee_active_ind = 0, but my current joins produce them (and it passes the reality sniff test; the process we are trying to model allows you to do work on day 1 that results in a widget on day 2, where you may not do any work, so wouldn't be counted as active on that day). I think I need to re-think my data structure. Even so, I'm assuming here that ggplot2 is acting like it would for a given bar chart; it's taking the number in each widget_ind record, for a given day (along with any facets and filters), and is then summing that set and displaying the result. The wrinkle I'm adding is dividing by the number of active employees on that day, and while you can have some one out on a given day, you'd never have everyone out. But that isn't what ggplot is doing is it?
I agree with MrFlick - especially the question concerning employee_active_ind of 0. If you have them, this could create NA values where something is divided by 0.

number of items to replace is not a multiple of replacement length in weighting

I am using multinomial regression to get the probability of belonging to four sub-groups for 500,000 regions.
The data.frame looks like this:
Regions groupadmit mid-pop
1 2 1764
2 3 1254
25 1 1452
674 4 2665
3001 2 1097
56 3 9864
98 1 2675
500,000 .... .....
I wrote the following code:
library (nnet)
mlogit<- multinom(groupadmit~mid_pop, data = admissionLSOA1)
probs <- predict(mlogit, type="probs")
The codes work fine till this point, giving the probability of belonging to each group (1, 2, 3, 4) for each observation (region).
Probabilities:
Regions groupadmit1 groupadmit2 groupadmit3 groupadmit4
52 0.2484091 0.2494408 0.2505393 0.2516109
97 0.2483949 0.2494358 0.2505441 0.2516252
1300 0.2483253 0.2494112 0.2505676 0.251695
287 0.2483623 0.2494242 0.2505551 0.2516584
500,000 .... ..... .... ....
But, when I go to weight the sample (regions) according to their probability, it brings back the following error:
Warning message:
In wts[groupadmit == 1] <- probs[groupadmit == 1, 1]/probs[groupadmit == :
number of items to replace is not a multiple of replacement length
What I am doing is weighting the regions according to their probability of belonging to each groupadmit proportional to the probability of belonging to groupadmit one in order to balance any chance for selection bias. It is very similar to inverse probability weighting. The codes are:
wts[groupadmit==1] <- probs[groupadmit==1,1]/probs[groupadmit==1,1]
wts[groupadmit==2] <- probs[groupadmit==2,1]/probs[groupadmit==2,2]
wts[groupadmit==3] <- probs[groupadmit==3,1]/probs[groupadmit==3,3]
wts[groupadmit==4] <- probs[groupadmit==4,1]/probs[groupadmit==4,4]
But, the above error comes up whenever I do the the analysis.
May someone please help me to understand why I get this error and how can I solve it?
Many thanks in advance
Why R complains?
Warning message:
In wts[groupadmit == 1] <- probs[groupadmit == 1, 1]/probs[groupadmit == :
number of items to replace is not a multiple of replacement length
it means that, the right handside of you assign (<-) is bigger than, what you have on the left handside which is wts[groupadmit==1]
Therefore, i suggest you to do:
length(probs[groupadmit==1,1]/probs[groupadmit==1,1])
and then
length(wts[groupadmit==1])
Then i suppose, it shows the lefthand side is smaller.
Then simply run
wts[groupadmit==1] <- probs[groupadmit==1,1]/probs[groupadmit==1,1]
and finnally print
wts[groupadmit==1]
Solution:
A quick fix is to use rbind to build your wts:
wts<-rbind(probs[groupadmit==1,1]/probs[groupadmit==1,1],
probs[groupadmit==2,1]/probs[groupadmit==2,2],
probs[groupadmit==3,1]/probs[groupadmit==3,3],
probs[groupadmit==4,1]/probs[groupadmit==4,4])

How to create an interval file defined by values from another file - for circos imaging of WGS data

I am trying to depict my whole-genome sequence (WGS) data of my parasite, using the circos software.
One of the elements I would like to depict, is the areas of the reference genome for which i do not have sequencing data from my parasite.
I order to do this, I have used Samtools to create an mpileup file, from which I have extracted the positions where the sequence depth = 0. I therefore have a file that looks like this:
$chromosome_name $chromosome_position $depth
chr_1 1 0
chr_1 2 0
chr_1 3 0
chr_2 67 0
chr_2 68 0
chr_2 1099 0
chr_2 1100 0
chr_2 1101 0
this means that there are 3 positions in chromosome 1, with no sequence data (depth = 0): namely positions 1, 2 and 3. For chromosome 2, the positions with no data are positions 67, 68, 1099, 1100 and 1101.
Due to the fact that my files are enormous (up to 3 million lines), and the fact that alot of the unsequenced positions come in intervals, I would like to create an interval file from the above data. Also, circos requires such an interval-file in order to create tiles. I therefore need to create a new file from the above, that looks like this:
$chromosome_name $start_pos $end_pos
chr_1 1 3
chr_2 67 68
chr_2 1099 1101
I have searched a bunch, but I have only found questions pertaining to grouping data by pre-defined intervals (e.g. group purchases occurring over a period of 6 months, patients by age etc).
So if anybody can help me out, I will be extremely happy!
Sidsel
Consider using bedtools. Specifically the bedtools merge sub-command:
http://bedtools.readthedocs.io/en/latest/content/tools/merge.html
From this page, it would seem to do what you want:
bedtools merge combines overlapping or “book-ended” features in an
interval file into a single feature which spans all of the combined
features.
Moreover, you can use the -d option to specify max distance between featured to merge:
-d Maximum distance between features allowed for features to be merged. Default is 0. That is, overlapping and/or book-ended features
are merged.

cummeRbund csHeatmap column user-defined order

I am using the R package cummeRbund (from Bioconductor) to visualize RNA-seq data, I created a cuffGeneSet instance called "DEG_genes" that contains 662 genes that are significantly differentially expressed between males and females. My goal is to create a heatmap using csHeatmap() in which the male and female samples (replicates) are separated but with a specific user-defined order within the sex category.
I used:
> DEG<-diffData(genes(cuff)) # take differentially expressed genes
> DEG_significant<-subset(DEG,significant=='yes') # retain only significant changes
> DEG_sign_IDs <- DEG_significant$gene_id # retrieve IDs
> DEG_genes<-getGenes(cuff,DEG_sign_IDs) # get CuffGeneSet instance
> hmap<-csHeatmap(DEG_genes,clustering='none',labRow=F,replicates=T)
This gives me ALMOST what I want: the heatmap shows Females on the left and Males on the right but they are alphabetically ordered (Female_0,Female_1,Female_10,Female_11,Female_12...Female_19,Female_2,Female_20,Female_21..,Female_29 on the left and similarly for males Male_0,Male_1,Male_10...Male_19,Male_2,Male_20...etc on the right) and I want them to be in a specific order (clusterReps). I created a test vector with replicate names on a specific order (Males on the left with 0 and 6 echanged and females on the right) as follow:
clusterReps<-c("Male_6","Male_1","Male_2","Male_3","Male_4","Male_5","Male_0","Male_7","Male_8","Male_9","Male_10","Male_11","Male_12","Male_13","Male_14","Male_15","Male_16","Male_17","Male_18","Male_19","Male_20","Male_21","Male_22","Male_23","Male_24","Male_25","Male_26","Male_27","Male_28","Male_29","Male_30","Male_31","Male_32","Male_33","Female_0","Female_1","Female_2","Female_3","Female_4","Female_5","Female_6","Female_7","Female_8","Female_9","Female_10","Female_11","Female_12","Female_13","Female_14","Female_15","Female_16","Female_17","Female_18","Female_19","Female_20","Female_21","Female_22","Female_23","Female_24","Female_25","Female_26","Female_27","Female_28")
I would like the data to be exactly the same except the order of the columns that must follow the order of the "clusterReps" vector. Knowing that the heatmap is a ggplot, I looked everywhere for a solution the last 2 days but with no success (despite a closely ressembling problem with heatmap.2() instead of csHeatmap() on stackoverflow, I tried to get a replicate fpkm matrix and use heatmap.2 but could only use heatmap_2 and some options were not accepted).
Using:
> hmap<-hmap+scale_x_discrete(limits=clusterReps)
Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
only changes the x-axis labels but not the actual data (the heatmap remains identical).
Is there a similar function that rearranges the columns and not just labels?
Thanks in advance for your help, I'm not familiar with handling ggplot objects, and in particular heatmaps from cummeRbund.
EDIT:
Here is what I can give as further information:
> DEG_genes
CuffGeneSet instance for 662 genes
Slots:
annotation
fpkm
repFpkm
diff
count
isoforms CuffFeatureSet instance of size 930
TSS CuffFeatureSet instance of size 785
CDS CuffFeatureSet instance of size 230
promoters CuffFeatureSet instance of size 662
splicing CuffFeatureSet instance of size 785
relCDS CuffFeatureSet instance of size 662
> summary(DEG_genes)
Length Class Mode
662 CuffGeneSet S4
I am afraid I can't give more information for the moment, please let me know if you want me to execute a command and report the output if it can help.
I am not very fluent in R, but I was having the same problem. To solve it I made a script that renames all my sample names in all the files inside the cuffdiff folder to something that will give the right order when sorted alphabetically, and then rebuild the database.

Choose higher values from two columns after extracting the number, R

I have a data frame (451 obs of 8 variables) that has two columns (6&7) that look like this:
Major Minor
C:726 T:2
A:687 G:41
T:3 C:725
I want to create one column that summarises this. To do this, I don't care about the letters in each cell, but I want the larger number to remain, whatever row it's in. i.e. I want it to look like this:
Summary_column
726
687
725
Not necessary, but for those that wonder what Im doing, this is the output from a programme called VCFtools; it has a count function that counts alleles in a VCF, but sometimes it names the allele as "Minor" when it is clearly more common.
Thanks for your help!
I would do something like this :
extract <- function(v) {
gsub("^.*:", "", v)
}
within(d, Summary_column <- pmax(extract(Major), extract(Minor)))
Which gives :
Major Minor Summary_column
1 C:726 T:2 726
2 A:687 G:41 687
3 T:3 C:725 725

Resources