Finding sub intervals from interval data frames - r

I have two data.frames with coordinates of linear intervals, which correspond to ids. Each id has several linear intervals. One of the data.frames is called exon.df:
exon.df <- data.frame(id=c(rep("id1",4),rep("id2",3),rep("id3",5)),
start=c(10,20,30,40,100,200,300,1000,2000,3000,4000,5000),
end=c(15,25,35,45,150,250,350,1500,2500,3500,4500,5500))
And the other cds.df:
cds.df <- data.frame(id=c(rep("id1",3),rep("id2",3),rep("id3",3)),
start=c(20,30,40,125,200,300,2250,3000,4000),
end=c(25,35,45,150,250,325,2500,3500,4250))
They both have the same ids but the intervals of cds.df are contained within those of exon.df. The intervals in exons.df are exons of genes (parts of the genome which are copied and stitched together to make a transcript of the gene), and those in cds.df are the parts of these exons that will be translated to protein since exons of the gene transcript also contain parts that will not be translated (Un-Translated Regions - utr). These utr's can only be located at the start and end of the gene transcript. The utr in the start is called 5'utr and the utr in the end is called 3'utr. A utr may either not exists at all, or span anywhere between part of a single or more exons from each end of the gene.
This means that the 5'utr of an id starts from the id's first position of its first interval in exon.df to one position before its first interval in cds.df, and includes all the exons in exon.df in between if such exist. Similarly, the 3'utr of an id starts one position after its last interval in cds.df to the last position of its last interval in exon.df, and includes all the exons in exons.df in between if such exist.
It's also possible that an id will not have either or both utrs if the first position of its first interval in cds.df is its first position in its first interval in exon.df, and similarly if its last position of its last interval in cds.df is its last position in its last interval in exon.df.
I'm looking for a fast way to retrieve these 5'utr and 3'utr intervals give exon.df and cds.df.
Here's what the outcome for this example should be:
utr5.df <- data.frame(id=c("id1","id2","id3","id3"),
start=c(10,100,1000,2000),
end=c(15,124,1500,2249))
utr3.df <- data.frame(id=c("id2","id3","id3"),
start=c(326,4251,5000),
end=c(350,4500,5500))

Do you know about Bioconductor? It's an add-on for R, specifically for the biosciences. It has a package called GenomicRanges, with which you can create a GRanges object that contains all Exons, and another object that contains all CDSs.
You can then do a set difference of these two objects to get the UTRs. Check the section "setops-methods" here. You want the 'setdiff' function.
So: Transform your data.frames into GRanges objects, then issue something like utrs <- setdiff(exons, cds)

Related

R: Rank cells in a list of matrices based on cell position

I have a list of matrices containing association measurements between GPS tracked animals. One matrix in the list is observed association rates, the others are association rates for randomized versions of the GPS tracking trajectories. For example, I currently have 99 permutations of randomized tracking trajectories resulting in a list of 99 animal association matrices, plus the observed association matrix. I am expecting that for the animals that belong to the same pack, the observed association rates will be higher than the randomized association rates. Accordingly, I would like to determine the rank of the observed rates compared to the randomized rates for each dyad (cell). Essentially, I am doing a rank-permutation test. However, since I am only really concerned with determining if the observed association data is greater than the randomized trajectory association data, any result just giving the rank of the observed cells is sufficient.
ls <- list(matrix(10:18,3,3), matrix(18:10,3,3))
I've seen using sapply can get the ranks of particular cells. Could I do the following for all cells and take the final number in the resulting vector to get the rank of the cell in that position in the list (knowing the position of the observed data in the list of matrices, e.g. last).
rank(sapply(ls, '[',1,1))
The ideal result would be a matrix of the same form as those in the list giving the rank of the observed data, although any similar solutions are welcome. Thanks in advance.
You can proceed that way, but there are cleaner and quicker methods to get what you want.
Here's some code that would take your ls produce a 3x3 matrix with the following properties:
if the entry in ls[[1]] is greater than the corresponding entry of ls[[2]], record a 1
if the entry in ls[[1]] is less than the corresponding entry of ls[[2]], record a 2
if the entries are equal, record a 1.5
result <- 1 * (ls[[1]] > ls[[2]]) + 2 * (ls[[1]] < ls[[2]]) + 1.5 * (ls[[1]] == ls[[2]])
How it works: when we do something like ls[[1]] > ls[[2]], we are ripping out the matrices of interest and directly comparing them. The result of this bit of code is a T/F-populated matrix, which is secretly coded as a 0/1 matrix. We can then multiply it by whatever coefficient we want to represent that situation.

Averaging different length vectors with same domain range in R

I have a dataset that looks like the one shown in the code.
What I am guaranteed is that the "(var)x" (domain) of the variable is always between 0 and 1. The "(var)y" (co-domain) can vary but is also bounded, but within a larger range.
I am trying to get an average over the "(var)x" but over the different variables.
I would like some kind of selective averaging, not sure how to do this in R.
ax=c(0.11,0.22,0.33,0.44,0.55,0.68,0.89)
ay=c(0.2,0.4,0.5,0.42,0.5,0.43,0.6)
bx=c(0.14,0.23,0.46,0.51,0.78,0.91)
by=c(0.1,0.2,0.52,0.46,0.4,0.41)
qx=c(0.12,0.27,0.36,0.48,0.51,0.76,0.79,0.97)
qy=c(0.03,0.2,0.52,0.4,0.45,0.48,0.61,0.9)
a<-list(ax,ay)
b<-list(bx,by)
q<-list(qx,qy)
What I would like to have something like
avgd_x = c(0.12,0.27,0.36,0.48,0.51,0.76,0.79,0.97)
and
avgd_y would have contents that would
find the value of ay and by at 0.12 and find the mean with ay, by and qy.
Similarly and so forth for all the values in the vector with the largest number of elements.
How can I do this in R ?
P.S: This is a toy dataset, my dataset is spread over files and I am reading them with a custom function, but the raw data is available as shown in the code below.
Edit:
Some clarification:
avgd_y would have the length of the largest vector, for example, in the case above, avgd_y would be (ay'+by'+qy)/3 where ay' and by' would be vectors which have c(ay(qx(i))) and c(by(qx(i))) for i from 1 to length of qx, ay' and by' would have values interpolated at data points of qx

How to perform the same operation(s) on all elements in a list

I have a large list with 317 elements. Each element contains a varying number of cases. These elements all have the exact same categories, but they have different numbers for all of them.
Each element has five categories:
Location
Species 1 count
Species 2 count
Species 3 count
Total species count
I originally had a dataframe that had all of the records in one, but I split it based on location as I am trying to find the proportion of the three species for each site (hence the 317 elements. There were 317 different locations so it split them into that)
I just want to perform the same operation on every element, receiving a number for each of them. I don't know how to calculate the proportion, but I do not need help with that. I just want to perform the same function on every single element in the list I have.
This is the code so far that I want to execute for every single element. I need to add the proportions code, but I will do that when I find out how to work it out.
##df = name of the large list
df$location <- df$location[!( ((df$species1) + (df$species2) + (df$species3)) != (df$totalSpecies) ),]
##remove any records where the three species do not equal the total
Thank you in advance!

COUNTIF where criterion is a specific sequence of cells

I'm doing some work with arithmetic sequences modulo P, in which the sequences become periodic under the modulo. My worksheet generates a sequence mod P with the first term being 0, the second term being a number K (referencing another cell), and the following terms following the recurrence relation. The period of the sequence (number of values before it repeats itself) is related to the ratio P/K, s, for example, if P=2 and K=1, I get the sequence {0,1,1,0,1,1,0,1,1,...}, which has a period of 3, so when P/K=2, the period is 3.
I currently have a formula which uses the COUNTIF function to count the number of zeroes in the range, which is then divided out of the total range, currently an arbitrary size of 120, and this gives me the correct period for many ratios of P/K. Most of the time, however, the sequence generated exhibits semi-periodicity and sometimes even quasi-periodicity, such as in the case of K=1 and modulo 9: {0,1,1,2,3,5,8,4,3,7,1,8,0,8,8,7,6,4,1,5,6,2,8,1,...}, where P/K=9, the period is 24, and the semi-period is 12 (because of the 0,8,8,... part of the sequence). In such cases, my current COUNTIF formula thinks the full period is 12, even though it should be 24, because it counts the zeroes which define the semi-period.
What I would like to do is adjust the formula so that instead of the criterion for counting being 0, it would only count triplet sequences of cells in the pattern 0,K,K.
My current formula:
=QUOTIENT(120,(COUNTIF(B2:DQ2,0)))
So if I have =QUOTIENT(120,(COUNTIF(B2:DQ2,*X*))) I want the "X", which is currently 0, to reference a specific sequence of cells, namely the first three of the overall series, so something like: =QUOTIENT(120,(COUNTIF(B2:DQ2,(0,C2,D2)))) although obviously that criterion is not in remotely the correct syntax.
I'm not well-versed in writing macros, so that would probably be out of the question.
I would do this with four helper rows plus the final formula. Someone more clever than I am might be able to do it in one cell with an array formula; but compared to array formulas I think the helper rows are easier to understand and, if desired, tweak.
Once this is set up, if you're always going to use three as your criterion, you can hide the helper rows (to hide a row, right-click on the gray number label on the left side of the spreadsheet, and choose "hide").
So your sequence is in row 2, starting in column B. We'll set up the first helper row in row 3, starting in column C. In cell C3 put the formula =C2=$B$2. This will evaluate to FALSE, which is equivalent to 0. Copy and paste that formula all the way to cell DQ3 (or however many columns you want to run it). Cells below a sequence number equal to the first number in the sequence will evaluate to TRUE, which is equivalent to 1.
The next two helper rows are very similar. In cell D4 put the formula =D2=$C$2 and copy and paste to cell DQ4. This row tests which cells are equal to the second number in the sequence.
In cell E5 put the formula =E2=$D$2 and copy and paste to cell DQ5, showing which cells are equal to the third number in the sequence.
The last helper row is a little different, so I left an empty row after the first three helpers. In cell E7 I put the formula =SUM(C3,D4,E5); copy and paste that over to column DQ. This counts how many matches were found in the previous three helper rows. If all three match, the result of this formula will be 3 and your criterion for determining the period will have been fulfilled.
Now to show the period: in the cell you want to have this number, put the formula =MATCH(3,E7:DQ7,0). This searches the last (fourth) helper row looking for a cell that is equal to 3. (Obviously you could modify this method to match only the first two sequence numbers, or to match more than 3, and then you'd adjust the first parameter in the MATCH formula.) The last parameter in this MATCH formula is 0 because the helper row is not sorted. The return value is the index of the first match: a match in E7 would be index 1, a match in E8 would be index 2, etc.
I tested this in LibreOffice 4.4.4.3.

R Compare each data value of a column to rest of the values in the column?

I would like to create a function that looks at a column of values. from those values look at each value individually, and asses which of the other data points value is closest to that data point.
I'm guessing it could be done by checking the length of the data frame, making a list of the respective length in steps of 1. Then use that list to reference which cell is being analysed against the rest of the column. though I don't know how to implement that.
eg.
data:
20
17
29
33
1) is closest to 2)
2) is closest to 1)
3) is closest to 4)
4) is closest to 3)
I found this example which tests for similarity but id like to know what letter is assigns to.
x=c(1:100)
your.number=5.43
which(abs(x-your.number)==min(abs(x-your.number)))
Also if you know how I could do this, could you expain the parts of the code and what they mean?
I wrote a quick function that does the same thing as the code you provided.
The code you provided takes the absolute value of the difference between your number and each value in the vector, and compares that the minimum value from that vector. This is the same as the which.min function that I use below. I go through my steps below. Hope this helps.
Make up some data
a = 1:100
yourNumber = 6
Where Num is your number, and x is a vector
getClosest=function(x, Num){
return(which.min(abs(x-Num)))
}
Then if you run this command, it should return the index for the value of the vector that corresponds to the closest value to your specified number.
getClosest(x=a, Num=yourNumber)

Resources