How to create an interval file defined by values from another file - for circos imaging of WGS data - unix

I am trying to depict my whole-genome sequence (WGS) data of my parasite, using the circos software.
One of the elements I would like to depict, is the areas of the reference genome for which i do not have sequencing data from my parasite.
I order to do this, I have used Samtools to create an mpileup file, from which I have extracted the positions where the sequence depth = 0. I therefore have a file that looks like this:
$chromosome_name $chromosome_position $depth
chr_1 1 0
chr_1 2 0
chr_1 3 0
chr_2 67 0
chr_2 68 0
chr_2 1099 0
chr_2 1100 0
chr_2 1101 0
this means that there are 3 positions in chromosome 1, with no sequence data (depth = 0): namely positions 1, 2 and 3. For chromosome 2, the positions with no data are positions 67, 68, 1099, 1100 and 1101.
Due to the fact that my files are enormous (up to 3 million lines), and the fact that alot of the unsequenced positions come in intervals, I would like to create an interval file from the above data. Also, circos requires such an interval-file in order to create tiles. I therefore need to create a new file from the above, that looks like this:
$chromosome_name $start_pos $end_pos
chr_1 1 3
chr_2 67 68
chr_2 1099 1101
I have searched a bunch, but I have only found questions pertaining to grouping data by pre-defined intervals (e.g. group purchases occurring over a period of 6 months, patients by age etc).
So if anybody can help me out, I will be extremely happy!
Sidsel

Consider using bedtools. Specifically the bedtools merge sub-command:
http://bedtools.readthedocs.io/en/latest/content/tools/merge.html
From this page, it would seem to do what you want:
bedtools merge combines overlapping or “book-ended” features in an
interval file into a single feature which spans all of the combined
features.
Moreover, you can use the -d option to specify max distance between featured to merge:
-d Maximum distance between features allowed for features to be merged. Default is 0. That is, overlapping and/or book-ended features
are merged.

Related

How to exclude zeroes for ggplot2 geom_line function in R in Power Bi

I have a R visual in power bi. I use this visual to show a scatter plot with points(geom_point) and lines (geom_line).
The data (Matlab_min_head - Matlab_max_head - Matlab) used for the lines has a lot of unused zeroes in it which I would like to omit as the line will shows a dip.
How can I exlude the zeroes?
I'm not an expert in R, but I understand for power Bi the adjustment of the data should happen as the graph is build. I cannot filter my data beforehad and then plot the result.
The slightly simplified code (without all the formatting) is as follows:
Plot1<-ggplot(dataset, aes(x=Capacity_recalculated, y=Head_recalculated))
Plot1<-Plot1+geom_point(aes(colour=Head_recalculated))
#these are the 3 black lines:
Plot1<-Plot1+geom_line(aes(y=Matlab_min_head, x=Matlab_min_Q))
Plot1<-Plot1+geom_line(aes(y=Matlab_max_head,x=Matlab_max_Q))
Plot1<-Plot1+geom_line(aes(y=Matlab_head, x=Matlab_Q))
Plot1
Update:
So I educated myself on how to use subset after Ismail's answer.
I think I know where the problem lies:
My data looks like this:
Project: Head_recalculated: Matlab_min_head: Matlab_head:
1 10 0 0
1 20 0 0
...
Matlab 60 0 60
Matlab 70 0 70
......
Matlab_min_head 50 50 0
Matlab_min_head 60 60 0
......
So if I filter using:
Plot1<-ggplot(subset(dataset, Matlab_head>0)
or
Plot1<-ggplot(subset(dataset, Matlab_head !=0)
I essentially remove the other Matlab_min_head column and the other Project data from the dataset as it is 0.
Would there be an option to only let the subset remove values in the Matlab_head column (and not the complete dataset)?
You could simply remove the zeroes with subset(dataset, Head_recalculated != 0) in your first ggplot call:
Plot1 <- ggplot(subset(dataset,Head_recalculated != 0), aes(x=Capacity_recalculated, y=Head_recalculated))

CART Methodology for data with mutually exhaustive rows

I am trying to use CART to analyse a data set whose each row is a segment, for example
Segment_ID | Attribute_1 | Attribute_2 | Attribute_3 | Attribute_4 | Target
1 2 3 100 3 0.1
2 0 6 150 5 0.3
3 0 3 200 6 0.56
4 1 4 103 4 0.23
Each segment has a certain population from the base data (irrelevant to my final use).
I want to condense, for example in the above case, the 4 segments into 2 big segments, based on the 4 attributes and on the target variable. I am currently dealing with 15k segments and want only 10 segments with each of the final segment based on target and also having a sensible attribute distribution.
Now, pardon my if I am wrong but CHAID on SPSS (if not using autogrow) will generally split the data into 70:30 ratio where it builds the tree on 70% of the data and tests on the remaining 30%. I can't use this approach since I need all my segments in the data to be included. I essentially want to club these segments into a a few big segments as explained before. My question is whether I can use CART (rpart in R) for the same. There is an explicit option 'subset' in the rpart function in R but I am not sure whether not mentioning it will ensure CART utilizing 100% of my data. I am relatively new to R and hence a very basic question.

for loops (R) within data frame; trying to get top 5 locations with corresponding number of people

first off:
Suppose I have a dataset that has variables like to_location_id, from_location_id gender, and age. now if i want to know overall the top 5 locations people like to visit i do this:
#most popular 5 locations to go to
top<-as.data.frame(sort(table((mydata$to_location_id),decreasing = TRUE)[1:5])
> top
sort(table(mydata$to_location_id), decreasing = TRUE)[1:5]
3 18544
9 18395
76 15457
5 14342
1 13898
*this gives the most 5 popular locations to go to overall in the dataset
locations 3 , 9, 76, 5, and 1
**similarly i can also get the most 5 popular locations to come form overall
Now suppose that there are 100 unique location id's (in both from and to location id's) I want to know for each location what are the top 5 popular to and the top 5 popular from locations given each location. i know i need a loop but i'm not sure how to do it. i have tried this (no luck):
for(i in unique(mydata$to_location_id)){
as.data.frame(sort(table(mydata$to_location_id),decreasing = TRUE)[1:5])
}

igraph does not show the right network I imported

I would like run some sna analysis. I work with RStudio and the igraph Package.
My input data is from a text file (created from excel as a tab seperated text file).
The data file has 3 columns. 1st and 2nd row are network data (vertices) and the 3rd row is the weight for each edge. I use airport connections data that looks like this:
1 54 28382 (Airport ID Origin Airport / Airport ID Destination Airport / Passanger number as a weight)
I loaded id with these commands:
USAN_num1 <- read.table('USAN_num.txt', header=T)
USAN_g_num1 <- graph.data.frame(USAN_num1)
> summary(USAN_g_num1)
Vertices: 626
Edges: 7078
Directed: TRUE
No graph attributes.
Vertex attributes: name.
Edge attributes: PAX.
Data looks like this:
ORIGN DESTN PAX
1 1 604 646
2 2 42 3736
3 2 118 5189
Now to the problem that occured:
My network consints of 6 different clusters when I check it with igraph. Even when I create a graphical picture of my network it has 6 seperated parts. That makes totally no sense since my data should be connected to one network. I checked through my dataset and there really are not different sub-networks.
Here is the cluster characteristics I get:
$csize
[1] 5 608 2 4 5 2
$no
[1] 6
One vertice in a small cluster is even a huge airport that should be connected to many others and not just 1 other...
UPDATE:
I now updated to the newest igraph version but it still does not work.
I uploaded an exemplary part of my data as a .txt file here: USAN_numS.txt
Would be great if someone has an idea on what I did wrong.
Thank you
So, as I said above the in my comment, a possible source of confusion is that your graph has symbolic vertex names that are actually numbers and don't match igraph's vertex ids. The workaround is to drop the vertex names, or to specify them explicitly when creating the graph, so that they match the igraph vertex ids.
But your graph really has multiple components, see the following code, where I check it in the original table, that two vertices only appear exactly once in the table, and they form a component of two by themselves.
Maybe the network really has multiple components, or there are mistakes in the file.
library(igraph)
USAN_num1 <- read.table('USAN_numS.txt', header=T)
USAN_g_num1 <- graph.data.frame(USAN_num1,
vertices=data.frame(id=1:max(USAN_num1[,1:2])))
clu <- clusters(USAN_g_num1)
clu$csize
## [1] 5 607 2 4 5 1 2 1
## The '1's appear because we counted the vertices that are
## not in the table
## Third component has two vertices only, let's check them in the
## original table
which(clu$membership == 3)
## [1] 64 617
## List the table rows where any of these two appear
USAN_num1[ USAN_num1[,1] %in% c(64, 617) | USAN_num1[,1] %in% c(64, 617), ]
## ORIGN DESTN PAX
## 691 64 617 636

Lines between certain points in a plot, based on the data? (with R)

I have done my research and googling but have yet to find a solution to the following problem. I have quite often found solutions to R-related issues from this forum, so I thought I'd give it a try and hope that somebody can suggest something. I would need it for my PhD thesis; anybody who's code or suggestions I will use will naturally be acknowledged and credited.
So: I need to draw lines/segments to connect points in a plot (of multidimensional scaling, specifically) in R (SPSS-based solutions are welcome as well) - but not between all points, just those that represent properties/variables that at least one data item shares - the placement of the lines should be based on the data that the plot in question is based on itself. Let me exeplify; below are some fictional data with dummy variables, where '1' means that the item has the property:
"properties"
a b c
"items" ---------
tree | 1 1 0
house | 0 1 1
hut | 0 1 1
book | 1 0 0
The plot is a multidimensional scaling plot (distances are to be interpreted as dissimilarities). This is the logic:
there's a line between A and B, because there is at least one item/variable ("tree") in
the data that has both properties;
there is a line between B and C, because there is at least one item in the data ("house" and "hut") that has both properties;
there is an item ("book") that has only one property (A), so it does not affect the placement of the lines
importantly, there is no line between A and C because there are no items in the data that have both properties.
What I am looking for is a way to add the grey lines automatically/computationally that I have for now drawn manually on the plot above. The automatic drawing should be based on the data as described above. With a small data set, drawing the lines manually is no problem, but becomes a problem when there are tens of such "properties" and hundreds of items/rows of data.
Any ideas? Some R code (commented if possible) would be most welcome!
EDIT: It seems I forgot something very important. First thing, the solution proposed by #GaborCsardi below works perfectly with the example data, thanks for that! But I forgot to include that the linking of the points should also be "conservative", with as few connecting lines as possible. For example, if there is an item that has all the "properties", then it should not create lines between every single property point in the plot just because of that, if the points are connected by other items already, even if indirectly. So a plot based on the following data should not be a full triangle, even though item1 has all three properties:
A B C
item1 1 1 1
item2 1 1 0
item3 0 1 1
Instead, A,B and B,C should be connected by a line, but a line between A and C would be exessive, as they are already indirectly connected (through B). Could this be done with incidence graphs?
This is very easy if you use graphs, and create the projection of the bipartite graph that you have in your table. E.g.
library(igraph)
## Some example data
mat <- " properties
items a b c
tree 1 1 0
house 0 1 1
hut 0 1 1
book 1 0 0
"
tab <- read.table(textConnection(mat), skip=1,
header=TRUE, row.names=1)
## Create a bipartite graph
graph <- graph.incidence(as.matrix(tab))
## Project the bipartite graph
proj <- bipartite.projection(graph)
## Plot one of the projections, the one you need
## happens to be the second one
plot(proj$proj2)
## Minimum spanning tree of the projection
plot(minimum.spanning.tree(proj$proj2))
For more information see the manual pages, i.e. ?"igraph-package" ?graph.incidence, ?bipartite.projection and ?plot.igraph.

Resources