R: Combine fragments below certain length - r

I have a bed file containing restriction fragments of the mouse genome. Each fragment has a different length/width, like this:
chr start end width
1 chr1 0 3000534 3000534
2 chr1 3000535 3000799 264
3 chr1 3000800 3001209 409
4 chr1 3001210 3001496 286
5 chr1 3001497 3002121 624
Is it possible to combine shorter fragments ( < 500bp) with adjacent fragments using R (see example below) and if yes how?
chr start end width
1 chr1 0 3000534 3000534
2 chr1 3000535 3001209 673
3 chr1 3001210 3002121 910
Note, I don't want to filter out fragments under a certain length, so sub setting the data is not an option.
I hope my question is not too confusing…

Here is a first solution, that supposes that chr stays the same and that filters out the last fragment if it is < 500 (the result is the dataframe you put in your example) :
mydata<-data.frame(chr=rep("chr1",6),start=c(0,3000535,3000800,3001210,3001497,3002122),end=c(3000534,3000799,3001209,3001496,3002121,3002134),width=c(3000534,264,409,286,624,12),stringsAsFactors=F)
i<-1
while(i<nrow(mydata)){
if(mydata$width[i]>=500) {
i<-i+1
} else {
mydata$end[i]<-mydata$end[i+1]
mydata$width[i]<-sum(mydata$width[i:(i+1)])
mydata<-mydata[-(i+1),]
}
}
if(mydata$width[i]<500) mydata<-mydata[-i,]

Related

How to define a column class in R, dataframe

I have a data frame like this
scf0001 123 4567 - 4350
scf0001 4474 4878 + 376 *
scf0002 5375 10571 c 5006
scf0003 11370 16986 c 5536
scf0003 16256 17000 - 789 *
when I attempt to read it as a table in R, the error is line 2 did not have 5 elements, so it is not recognizing the star(*), I tried to read it as string with na.string="*" but it didn't work, so I think I have to define it with colclasses=.I tried something like this colClasses=[, c(6)="character"] but it doesn't work, Error: unexpected '['

How to read only a subset of columns into a dataframe in julia?

I have a file like this:
chr1 47727 47778 2PJ3LS1:190:C5R7BACXX:3:2202:6839:64070 1 +
chr1 48132 48183 2PJ3LS1:190:C5R7BACXX:3:2109:14612:23955 60 +
chr1 49316 49367 2PJ3LS1:190:C5R7BACXX:3:1107:8369:30676 1 +
chr1 57049 57100 2PJ3LS1:190:C5R7BACXX:3:1205:2852:33393 60 -
chr1 59296 59347 2PJ3LS1:190:C5R7BACXX:3:2306:14160:96792 1 -
chr1 62116 62165 2PJ3LS1:190:C5R7BACXX:3:1203:3949:66047 60 +
chr1 64636 64687 2PJ3LS1:190:C5R7BACXX:3:2112:11315:75619 60 -
chr1 108831 108882 2PJ3LS1:190:C5R7BACXX:3:2211:11748:76230 60 +
chr1 150522 150573 2PJ3LS1:190:C5R7BACXX:3:2108:11820:88376 60 -
chr1 180744 180794 2PJ3LS1:190:C5R7BACXX:3:2115:5327:39987 60 -
I do not care about columns 4 and 5. Is it possible to ignore these when reading my giant file? There is nothing in CSV.read which allows this.
Well - this is not a CSV file so I would rather parse it directly (this is not maximally efficient, but in most cases it should be good enough):
df = DataFrame(a=String[], b=Int[], c=Int[])
for line in eachline("filename.txt")
a, b, c = split(line)
push!(df, (String(a), parse(Int, b), parse(Int, c)))
end
EDIT: if you want also column 6 use (I assume last column is a Char):
df = DataFrame(a=String[], b=Int[], c=Int[], d=Char[])
for line in eachline("filename.txt")
a, b, c, _, _, d = split(line)
push!(df, (String(a), parse(Int, b), parse(Int, c)), d[1])
end
If you're on Linux, you could use awk to create another file with only the columns you want. In Julia:
run(pipeline(`awk '{print $1, $2, $3, $6}' filename.txt`, "otherfile.txt"))
df = CSV.read("otherfile.txt",delim=" ")
I don't know if it's the most efficient, since it involves creating an intermediate file, but is quite simple and allows you to save the file if you need it later on. For removing any of them or both, just run rm("anyfile.txt).

GameTheory package: Convert data frame of games to Coalition Set

I am looking to explore the GameTheory package from CRAN, but I would appreciate help in converting my data (in the form of a data frame of unique combinations and results) in to the required coalition object. The precursor to this I believe to be an ordered list of all coalition values (https://cran.r-project.org/web/packages/GameTheory/vignettes/GameTheory.pdf).
My real data has n ~ 30 'players', and unique combinations = large (say 1000 unique combinations), for which I have 1 and 0 identifiers to describe the combinations. This data is sparsely populated in that I do not have data for all combinations, but will assume combinations not described have zero value. I plan to have one specific 'player' who will appear in all combinations, and act as a baseline.
By way of example this is the data frame I am starting with:
require(GameTheory)
games <- read.csv('C:\\Users\\me\\Desktop\\SampleGames.csv', header = TRUE, row.names = 1)
games
n1 n2 n3 n4 Stakes Wins Success_Rate
1 1 1 0 0 800 60 7.50%
2 1 0 1 0 850 45 5.29%
3 1 0 0 1 150000 10 0.01%
4 1 1 1 0 300 25 8.33%
5 1 1 0 1 1800 65 3.61%
6 1 0 1 1 1900 55 2.89%
7 1 1 1 1 700 40 5.71%
8 1 0 0 0 3000000 10 0.00333%
where n1 is my universal player, and in this instance, I have described all combinations.
To calculate my 'base' coalition value from player {1} alone, I am looking to perform the calculation: 0.00333% (success rate) * all stakes, i.e.
0.00333% * (800 + 850 + 150000 + 300 + 1800 + 1900 + 700 + 3000000) = 105
I'll then have zero values for {2}, {3} and {4} as they never "play" alone in this example.
To calculate my first pair coalition value, I am looking to perform the calculation:
7.5%(800 + 300 + 1800 + 700) + 0.00333%(850 + 150000 + 1900 + 3000000) = 375
This is calculated as players {1,2} base win rate (7.5%) by the stakes they feature in, plus player {1} base win rate (0.00333%) by the combinations he features in that player {2} does not - i.e. exclusive sets.
This logic is repeated for the other unique combinations. For example row 4 would be the combination of {1,2,3} so the calculation is:
7.5%(800+1800) + 5.29%(850+1900) + 8.33%(300+700) + 0.00333%(3000000+150000) = 529 which descriptively is set {1,2} success rate% by Stakes for the combinations it appears in that {3} does not, {1,3} by where {2} does not feature, {1,2,3} by their occurrences, and the base player {1} by examples where neither {2} nor {3} occur.
My expected outcome therefore should look like this I believe:
c(105,0,0,0, 375,304,110,0,0,0, 529,283,246,0, 400)
where the first four numbers are the single player combinations {1} {2} {3} and {4}, the next six numbers are two player combinations {1,2} {1,3} {1,4} (and the null cases {2,3} {2,4} {3,4} which don't exist), then the next four are the three player combinations {1,2,3} {1,2,4} {1,3,4} and the null case {2,3,4}, and lastly the full combination set {1,2,3,4}.
I'd then feed this in to the DefineGame function of the package to create my coalitions object.
Appreciate any help: I have tried to be as descriptive as possible. I really don't know where to start on generating the necessary sets and set exclusions.

Stata counting substring

My table looks like this:
ID AQ_ATC amountATC
. "A05" 1
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 1
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 1
441 "C05AA03" 130
The count for the full 8-character AQ_ATC codes is already correct.
The shorter codes are unique in the table and are substrings of the complete 8-character codes (they represent the first x characters).
What I am looking for is the count of the appearances of the shorter codes throughout the entire table.
For example in this case the resulting table would be
ID AQ_ATC amountATC
. "A05" 2715 <-- 2525 + 190
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 7430 <-- 4330 + 3100
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 130 <-- 130
441 "C05AA03" 130
The partial codes do not overlap, by what I mean that if there is "C05" there wont be another partial code "C05A1".
I created the amountATC column using
bysort ATC: egen amountATC = total(AQ_ATC==AQ_ATC)
I attempted recycling the code that I had received yesterday but failed in doing so.
My attempt looks like this:
levelsof AQ_ATC, local(ATCvals)
quietly foreach y in AQ_ATC {
local i = 0
quietly foreach x of local ATCvals {
if strpos(`y', `"`x'"') == 1{
local i = `i'+1
replace amountATC = `i'
}
}
}
My idea was to use a counter "i" and increase it by 1 everytime the an AQ_ATC starts with another AQ_ATC code. Then I write "i" into amountATC and after I iterated over the entire table for my AQ_ATC, I will have an "i"-value that will be equal to the amount of occurences of the substring. Then I reset "i" to 0 and continue with the next AQ_ATC.
At least thats how I intended for it to work, what it did in the end is set all amountATC-values to 1.
I also attempted looking into different egen-functions such as noccur and moss, but my connection keeps timing out when I attempt to install the packages.
It seems as if you come from another language and you insist in using loops when not strictly necessary. Stata does many things without explicit loops, precisely because commands already apply to all observations.
One way is:
clear
set more off
input ///
ID str15 AQ_ATC amountATC
. "A05" 1
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 1
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 1
441 "C05AA03" 130
end
*----- what you want -----
sort AQ_ATC ID
gen grou = sum(missing(ID))
bysort grou AQ_ATC: gen tosum = amountATC if _n == 1 & !missing(ID)
by grou: egen s = total(tosum)
replace amountATC = s if missing(ID)
list, sepby(grou)
Edit
With your edit the same principles apply. Below code that adjusts to your change and slightly changes the code (one line less):
*----- what you want -----
sort AQ_ATC
gen grou = sum(missing(ID))
bysort grou: gen s = sum(amountATC) if AQ_ATC != AQ_ATC[_n+1] & !missing(ID)
by grou: replace amountATC = s[_N] if missing(ID)
More efficient should be:
<snip>
bysort grou: gen s = sum(amountATC) if AQ_ATC != AQ_ATC[_n+1]
by grou: replace amountATC = s[_N] - 1 if missing(ID)
Some comments:
sort is a very handy command. If you sort the data by AQ_ATC they are arranged in such a way that the short (sub)strings are placed before corresponding long strings.
The by: prefix is fundamental and very helpful, and I noticed you can use it after defining appropriate groups. I created the groups taking advantage of the fact that all short (sub)strings have a missing(ID).
Then (by the groups just defined) you only want to add up one value (observation) per amountATC. That's what the condition if AQ_ATC != AQ_ATC[_n+1] does.
Finally, replace back into your original variable. I would usually generate a copy and work with that, so my original variable doesn't suffer.
An excellent read for the by: prefix is Speaking Stata: How to move step by: step, by Nick Cox.
Edit2
Yet another slightly different way:
*----- what you want -----
sort AQ_ATC
gen grou = sum(missing(ID))
egen t = tag(grou AQ_ATC)
bysort grou: gen s = sum(amountATC * t)
by grou: replace amountATC = s[_N] - 1 if missing(ID)

AWK: extract lines if column in file 1 falls within a range declared in two columns in other file

Currently I'm struggling with an AWK problem that I haven't been able to solve yet. I have one huge file (30GB) with genomic data that holds a list with positions (declared in col 1 and 2) and a second list that holds a number of ranges (declared in col 3, 4 and 5). I want to extract all lines in the first file where the position falls within the range declared in the seconds file. As the position is only unique within a certain chromosome (chr) first it has to be tested if the chr's are identical (ie. col1 in file 1 matches col3 in file2)
file 1
chromosome position another....hundred.....columns
chr1 816 .....
chr1 991 .....
chr2 816 .....
chr2 880 .....
chr2 18768 .....
...
chr22 9736286 .....
file 2
name identifier chromosome start end
GENE1 ucsc.86 chr1 800 900
GENE2 ucsc.45 chr2 700 1700
GENE3 ucsc.46 chr2 18000 19000
expected output
chromosome position another....hundred.....columns
chr1 816 .....
chr2 816 .....
chr2 880 .....
chr2 18768 .....
A summery of what I intend to do in (half coded):
(if $1(in file 1) matches $3(in file 2){ ##test if in the correct chr
if ($2(in file 1) >= $4 && =< $5 (in file 2){ ##test if pos is in the range
print $0 (in file 1) ##if so print the row from file1
}
}
I kind if understand how to solve this problem by putting file1 in an array and using position as the index but then I still have a problem with the chr and besides that file1 is way to big to put in an array (although I have 128GB of RAM). I've tried some things with multi-dimensional arrays but couldn't really figure out how to do that either.
Thanks a lot for all your help.
Update 8/5/14
Added a third line in file 2 containing another range in the same chrom. as on the second line. This line is skipped in the script below.
It'd be something like this, untested:
awk '
NR==FNR{ start[$3] = $4; end[$3] = $5; next }
(FNR==1) || ( ($1 in start) && ($2 >= start[$1]) && ($2 <= end[$1]) )
' file2 file1
The change in your data set actually modified the question greatly. You introduced an element which was used as a key and since keys have to be unique it got overwritten.
For your data set, you are better off making composite keys. Something like:
awk '
NR==FNR{ range[$3,$4,$5]; next }
FNR==1
{
for(x in range) {
split(x, check, SUBSEP);
if($1==check[1] && $2>=check[2] && $2<=check[3]) print $0
}
}
' file2 file1
chromosome position another....hundred.....columns
chr1 816 .....
chr2 816 .....
chr2 880 .....
chr2 18768

Resources