I am trying to depict my whole-genome sequence (WGS) data of my parasite, using the circos software.
One of the elements I would like to depict, is the areas of the reference genome for which i do not have sequencing data from my parasite.
I order to do this, I have used Samtools to create an mpileup file, from which I have extracted the positions where the sequence depth = 0. I therefore have a file that looks like this:
$chromosome_name $chromosome_position $depth
chr_1 1 0
chr_1 2 0
chr_1 3 0
chr_2 67 0
chr_2 68 0
chr_2 1099 0
chr_2 1100 0
chr_2 1101 0
this means that there are 3 positions in chromosome 1, with no sequence data (depth = 0): namely positions 1, 2 and 3. For chromosome 2, the positions with no data are positions 67, 68, 1099, 1100 and 1101.
Due to the fact that my files are enormous (up to 3 million lines), and the fact that alot of the unsequenced positions come in intervals, I would like to create an interval file from the above data. Also, circos requires such an interval-file in order to create tiles. I therefore need to create a new file from the above, that looks like this:
$chromosome_name $start_pos $end_pos
chr_1 1 3
chr_2 67 68
chr_2 1099 1101
I have searched a bunch, but I have only found questions pertaining to grouping data by pre-defined intervals (e.g. group purchases occurring over a period of 6 months, patients by age etc).
So if anybody can help me out, I will be extremely happy!
Sidsel
Consider using bedtools. Specifically the bedtools merge sub-command:
http://bedtools.readthedocs.io/en/latest/content/tools/merge.html
From this page, it would seem to do what you want:
bedtools merge combines overlapping or “book-ended” features in an
interval file into a single feature which spans all of the combined
features.
Moreover, you can use the -d option to specify max distance between featured to merge:
-d Maximum distance between features allowed for features to be merged. Default is 0. That is, overlapping and/or book-ended features
are merged.
To put it simple, I have three columns in excel like the ones below:
Vehicle x y
1 10 10
1 15 12
1 12 9
2 8 7
2 11 6
3 7 12
x and y are the coordinates of customers assigned to the corresponding vehicle. This file is the output of a program I run in advance. The list will always be sorted by vehicle, but the number of customers assigned to vehicle "k" may change from one experiment to the next.
I would like to plot a graph containing 3 series, one for each vehicle, where the customers of each vehicle would appear (as dots in 2D based on their x- and y- values) in different color.
In my real file, I have 12 vehicles and 3200 customers, and the ranges change from one experiment to the next, so I would like to automate the process, i.e copy-paste the list on my excel and see the graph appear automatically (if this is possible).
Thanks in advance for your time and effort.
EDIT: There is a similar post here: Use formulas to select chart data but requires the use of VB. Moreover, I am not sure whether it has been indeed answered.
you should try this free online tool - www.cloudyexcel.com/excel-to-graph/
I have a dataset like this:
MQ = data.frame(Model=c("C150A","B174","DG18"),Quantity=c(5000,3800,4000))
MQ is a data.frame, it shows the Productionplan for a week in the future. With Model producing Model and Quantity
C150A = data.frame( Material=c("A0015", "A0071", "Z00071", "Z00080","Z00090",
"Z00012","SZ0001"), Number=c(1,1,1,1,1,1,4))
B174= data.frame(Material=c("A0014","A0071","Z00080","Z00091","Z00011","SZ0000"),
Number=c(1,1,1,1,2,4))
DG18= data.frame( Material=c("A0014","A0075","Z00085","Z00090","Z00010","SZ0005"),
Number=c(1,1,1,2,3,4))
T75A= data.frame(Material=c("A0013","A0075","Z00085","Z00090","Z00012","SZ0005"),
Number=c(1,1,1,2,3,4))
G95= data.frame(Material=c("A0013","A0075","Z00085","Z00090","Z00017","SZ0008"),
Number=c(1,1,1,2,3,4))
These are Models which could be produced...
My first problem here is, that belonging on the Productionplan MQ, i want to open automatically the needed Models, and multiplicate the Quantity with the number, to know how many of each Component(Material) is needed.
The output could be a data.frame, where all needed Components ( different Models can use the same Components and different Components, also the amount of needed Components caan be different) over all in the production plan noted Models are combined.
Material_Master= data.frame( Material=c( "A0013", "A001","A0015", "A0071", "A0075",
"A0078", "Z00071", "Z00080", "Z00090", "Z00091",
"Z00012","Z00091","Z00010""Z00012","Z00017","SZ0001",
"SZ0005","SZ0005","SZ0000","SZ0008","SZ0009"),
Number=c(20000,180000,250000,480000,250000,170000,
690000,1800000,17000,45000,12000,5000, 5000,
8000,16000,17000,45000,88000,7500,12000,45000))
In the last step the created data.frame should be merged with the Material_Master data: in the Material Master data, there are all important Components with the stock noted.
In my example there are all Components which where needed for the production also noted in the Material Master, but it can also be that in Material_Master is a Component missing, then just ignore this Component.
The Output should be something like, Compare the needed amount of Components, with the actual stock of them. Give a report, if there is more need then the actual stock have.
Thank you for your help.
This should work:
mods <- do.call(rbind,lapply(MQ$Model,function(x)cbind(Model=x,get(x))))
full_plan <- merge(mods,MQ,by="Model")
material_plan <- with(full_plan,aggregate(Quantity*Number,by=list(Material),sum))
# Group.1 x
# 1 A0014 7800
# 2 A0015 5000
# 3 A0071 8800
# 4 A0075 4000
# 5 SZ0000 15200
# 6 SZ0001 20000
# 7 SZ0005 16000
# 8 Z00010 12000
# 9 Z00011 7600
# 10 Z00012 5000
# 11 Z00071 5000
# 12 Z00080 8800
# 13 Z00085 4000
# 14 Z00090 13000
# 15 Z00091 3800
The first line gets each of your models and stacks them, along with the model name. The second line merges back to get the Quantity, and the third aggregates.
I went ahead and made a usable example by trimming off the 1 at the beginning of each Number in your latter models. Also, I read the Model and Material columns in as character instead of factor.
options(stringsAsFactors=FALSE)
MQ = data.frame(Model=c("C150A","B174","DG18"),Quantity=c(5000,3800,4000))
C150A = data.frame(Material=c("A0015","A0071","Z00071","Z00080","Z00090","Z00012","SZ0001"),Number=c(1,1,1,1,1,1,4))
B174= data.frame(Material=c("A0014","A0071","Z00080","Z00091","Z00011","SZ0000"), Number=c(1,1,1,1,2,4))
DG18= data.frame(Material=c("A0014","A0075","Z00085","Z00090","Z00010","SZ0005"),Number=c(1,1,1,2,3,4))
T75A= data.frame(Material=c("A0013","A0075","Z00085","Z00090","Z00012","SZ0005"),Number=c(1,1,1,2,3,4))
G95= data.frame(Material=c("A0013","A0075","Z00085","Z00090","Z00017","SZ0008"),Number=c(1,1,1,2,3,4))
Edit: Added the required stringsAsFactors option, as identified by #RicardoSaporta.
I am new to R and want to analyze miRNA expression from a data set of 3 groups. Can anyone help me out.
In this case I got other miRNAs(on affy chips) as top expressed genes. Now I want to select only human miRNAs. Please help me
Thanks in advance
Summary
I'm not entirely sure what your data frame looks like, given that I haven't worked with Affy chips before. Let me try to summarize what I think you have told us. You have a data frame with a list of all of the microRNAs on the Affy chip, along with their expression data. You want to select a subset of these microRNAs that are unique to humans.
Possible solution 1
You do not state whether or not your data frame contains a variable that identifies whether or not these microRNAs are indeed from humans. If it does have this information, all you would need to do is subset your data based on this identifier. Type help(subset) or help(Extract) for more information on how to do this.
Possible solution 2
If your data frame does not contain such an identifier, you will first need to make a list of all known human microRNAs. You could retrieve these manually from the online miRBase website (and then import them into R), or you could download them from Ensembl using the R package biomaRt. To do the latter, after loading biomaRt, you might type this command:
miRNA <- getBM(c("mirbase_id", "ensembl_gene_id", "start_position", "chromosome_name"), filters = c("with_mirbase"), values = list(TRUE), mart = ensembl)
The above code requests that R download the mirbase identifier, gene ID, start position, and chromosome name for all microRNAs in the miRBase catalog. (Note that you would have to specify the human Ensembl mart in an earlier command, which I have not shown).
Once you have downloaded this information, you could use a merge command or perhaps a which command to pull the appropriate microRNAs from your Affy chip data.
Recommendations
This all might sound a bit complicated. If you haven't already, I recommend that you spend some time working through exercises on biomaRt and bioconductor. Information about these packages, and how to install them, are available at the below links:
Bioconductor, http://www.bioconductor.org/install/
Database mining with biomaRt, http://www.stat.berkeley.edu/~sandrine/Teaching/PH292.S10/Durinck.pdf
You might consider asking for this question to be migrated to Biostar. I think you would get better responses there. Also, consider editing your question to provide more information about your data. Good luck.
Edit to my original answer
In reference to your comment made at 2012-02-26 22:08:02, try the following:
## Load biomaRt package
library(biomaRt)
## Specify which "mart" (i.e., source of genetic data) that you want to use
ensembl <- useMart("ensembl")
ensembl <- useDataset("hsapiens_gene_ensembl", mart = ensembl)
## You can then ask the system what attributes are available for download
listAttributes(ensembl)
name description
58 mirbase_accession miRBase Accession(s)
59 mirbase_id miRBase ID(s)
60 mirbase_gene_name miRBase gene name
61 mirbase_transcript_name miRBase transcript
Above I have pasted part of the output from the listAttributes() command, which shows the relevant miRBase options. Now you can try the following code:
## Download microRNA data
miRNA <- getBM(c("mirbase_id", "ensembl_gene_id", "start_position", "chromosome_name"), filters = c("with_mirbase"), values = list(TRUE), mart = ensembl)
## Check how much we downloaded
> dim(miRNA)
[1] 715 4
## Peak at the head of our data
> head(miRNA)
mirbase_id ensembl_gene_id start_position chromosome_name
1 hsa-mir-320c-1 ENSG00000221493 19263471 18
2 hsa-mir-133a-1 ENSG00000207786 19405659 18
3 hsa-mir-1-2 ENSG00000207694 19408965 18
4 hsa-mir-320c-2 ENSG00000212051 21901650 18
5 hsa-mir-187 ENSG00000207797 33484781 18
6 hsa-mir-1539 ENSG00000222690 47013743 18
## Check which chromosomes are contributing to our data
> table(miRNA$chromosome_name)
1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X
50 27 26 25 15 59 26 15 35 7 85 23 32 5 16 31 23 30 17 33 27 28 80
Now your challenge will be to use this downloaded data to parse your original Affy data frame. Again, read the help files for the merge, Extract, and which functions to give it a try yourself first.