Related
I have a mock-up dataframe representing some of the confidential data I have and it looks like this:
Name Value
1. AaaaBaCCCaaa.x 1
2. AbbAbbKalllNBN.y 2
3. CCCdddEfffFg.x 8
4. ZZZtTThGGtGGGG.y 1
...
9. AAAHHHhhhhIIIIII.x 2
10. RRRRmmmmJJJJJJJ.y 3
11. MMMMMnnnnNNNNrrrr.x 4
...
What's important to notice here is that the Name variable contains ordinal numbers (e.g. 1. 2., 10.) at the beginning of the string and either .x or .y at the end of the string. Also, length of the Name variable is not the same in each row.
How can I remove the number from the beginning of the each string in the Name variable along with the period and the space that come after it? It's very important for me to get rid of them because I need to use the separate function on this data afterwards to separate into x and y from the end of the string. If I will still have that period after the number on the beginning of the string, separate will fail.
I wanted to use substr but I didn't know how to do it since, for example, 10. is longer than 9. and I don't know which values I would put into the start and stop arguments.
I have two data.frames with coordinates of linear intervals, which correspond to ids. Each id has several linear intervals. One of the data.frames is called exon.df:
exon.df <- data.frame(id=c(rep("id1",4),rep("id2",3),rep("id3",5)),
start=c(10,20,30,40,100,200,300,1000,2000,3000,4000,5000),
end=c(15,25,35,45,150,250,350,1500,2500,3500,4500,5500))
And the other cds.df:
cds.df <- data.frame(id=c(rep("id1",3),rep("id2",3),rep("id3",3)),
start=c(20,30,40,125,200,300,2250,3000,4000),
end=c(25,35,45,150,250,325,2500,3500,4250))
They both have the same ids but the intervals of cds.df are contained within those of exon.df. The intervals in exons.df are exons of genes (parts of the genome which are copied and stitched together to make a transcript of the gene), and those in cds.df are the parts of these exons that will be translated to protein since exons of the gene transcript also contain parts that will not be translated (Un-Translated Regions - utr). These utr's can only be located at the start and end of the gene transcript. The utr in the start is called 5'utr and the utr in the end is called 3'utr. A utr may either not exists at all, or span anywhere between part of a single or more exons from each end of the gene.
This means that the 5'utr of an id starts from the id's first position of its first interval in exon.df to one position before its first interval in cds.df, and includes all the exons in exon.df in between if such exist. Similarly, the 3'utr of an id starts one position after its last interval in cds.df to the last position of its last interval in exon.df, and includes all the exons in exons.df in between if such exist.
It's also possible that an id will not have either or both utrs if the first position of its first interval in cds.df is its first position in its first interval in exon.df, and similarly if its last position of its last interval in cds.df is its last position in its last interval in exon.df.
I'm looking for a fast way to retrieve these 5'utr and 3'utr intervals give exon.df and cds.df.
Here's what the outcome for this example should be:
utr5.df <- data.frame(id=c("id1","id2","id3","id3"),
start=c(10,100,1000,2000),
end=c(15,124,1500,2249))
utr3.df <- data.frame(id=c("id2","id3","id3"),
start=c(326,4251,5000),
end=c(350,4500,5500))
Do you know about Bioconductor? It's an add-on for R, specifically for the biosciences. It has a package called GenomicRanges, with which you can create a GRanges object that contains all Exons, and another object that contains all CDSs.
You can then do a set difference of these two objects to get the UTRs. Check the section "setops-methods" here. You want the 'setdiff' function.
So: Transform your data.frames into GRanges objects, then issue something like utrs <- setdiff(exons, cds)
I'm not sure how to describe this question, so I'm just going to write a bit of code here to illustrate what I'm trying to achieve.
numberVector = c(56,23,10,26,11,9,33,60,71,1)
xaxisVector = c(1:length(numberVector))
booleanVector = c(FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,TRUE)
plot(xaxisVector,numberVector)
abline(a=50,b=0,col="red")
points(xaxisVector[booleanVector],numberVector[booleanVector],col="blue",pch=20)
As you can see, the above code produces a graph that looks like below.
As you can see, every time the value in the numberVector goes from a value above 50 to a value below 50, I highlight the dot in blue. (e.g. from 56 to 23, 23 is highlighted)
And likewise every time the value in the numberVector goes from a value below 50 to above 50, I highlight the dot in blue. (e.g. from 33 to 60, 60 is highlighted)
I've manually typed the booleans in booleanVector. But how would I generate such a vector of booleans given any vector like numberVector?
We can look at the different in the signs of the minus fifty. For example
booleanVector2 <- c(FALSE, diff(sign(numberVector-50))!=0)
all(booleanVector==booleanVector2)
# [1] TRUE
The sign(x-50) basically keeps track if its above or below the line. The diff() looks at the difference between pairs of values to look for changes. We add in a FALSE since the we assume the first value starts on one side of the line.
I'm doing some work with arithmetic sequences modulo P, in which the sequences become periodic under the modulo. My worksheet generates a sequence mod P with the first term being 0, the second term being a number K (referencing another cell), and the following terms following the recurrence relation. The period of the sequence (number of values before it repeats itself) is related to the ratio P/K, s, for example, if P=2 and K=1, I get the sequence {0,1,1,0,1,1,0,1,1,...}, which has a period of 3, so when P/K=2, the period is 3.
I currently have a formula which uses the COUNTIF function to count the number of zeroes in the range, which is then divided out of the total range, currently an arbitrary size of 120, and this gives me the correct period for many ratios of P/K. Most of the time, however, the sequence generated exhibits semi-periodicity and sometimes even quasi-periodicity, such as in the case of K=1 and modulo 9: {0,1,1,2,3,5,8,4,3,7,1,8,0,8,8,7,6,4,1,5,6,2,8,1,...}, where P/K=9, the period is 24, and the semi-period is 12 (because of the 0,8,8,... part of the sequence). In such cases, my current COUNTIF formula thinks the full period is 12, even though it should be 24, because it counts the zeroes which define the semi-period.
What I would like to do is adjust the formula so that instead of the criterion for counting being 0, it would only count triplet sequences of cells in the pattern 0,K,K.
My current formula:
=QUOTIENT(120,(COUNTIF(B2:DQ2,0)))
So if I have =QUOTIENT(120,(COUNTIF(B2:DQ2,*X*))) I want the "X", which is currently 0, to reference a specific sequence of cells, namely the first three of the overall series, so something like: =QUOTIENT(120,(COUNTIF(B2:DQ2,(0,C2,D2)))) although obviously that criterion is not in remotely the correct syntax.
I'm not well-versed in writing macros, so that would probably be out of the question.
I would do this with four helper rows plus the final formula. Someone more clever than I am might be able to do it in one cell with an array formula; but compared to array formulas I think the helper rows are easier to understand and, if desired, tweak.
Once this is set up, if you're always going to use three as your criterion, you can hide the helper rows (to hide a row, right-click on the gray number label on the left side of the spreadsheet, and choose "hide").
So your sequence is in row 2, starting in column B. We'll set up the first helper row in row 3, starting in column C. In cell C3 put the formula =C2=$B$2. This will evaluate to FALSE, which is equivalent to 0. Copy and paste that formula all the way to cell DQ3 (or however many columns you want to run it). Cells below a sequence number equal to the first number in the sequence will evaluate to TRUE, which is equivalent to 1.
The next two helper rows are very similar. In cell D4 put the formula =D2=$C$2 and copy and paste to cell DQ4. This row tests which cells are equal to the second number in the sequence.
In cell E5 put the formula =E2=$D$2 and copy and paste to cell DQ5, showing which cells are equal to the third number in the sequence.
The last helper row is a little different, so I left an empty row after the first three helpers. In cell E7 I put the formula =SUM(C3,D4,E5); copy and paste that over to column DQ. This counts how many matches were found in the previous three helper rows. If all three match, the result of this formula will be 3 and your criterion for determining the period will have been fulfilled.
Now to show the period: in the cell you want to have this number, put the formula =MATCH(3,E7:DQ7,0). This searches the last (fourth) helper row looking for a cell that is equal to 3. (Obviously you could modify this method to match only the first two sequence numbers, or to match more than 3, and then you'd adjust the first parameter in the MATCH formula.) The last parameter in this MATCH formula is 0 because the helper row is not sorted. The return value is the index of the first match: a match in E7 would be index 1, a match in E8 would be index 2, etc.
I tested this in LibreOffice 4.4.4.3.
I have a set of identically dimensioned tables for 322 areas. I need to sum these tables to 29 higher level areas and where each higher level area has varying numbers of the lower level area.
I am proposing to compute the sum in a loop, so the first task is to determine the number of lower level areas to be summed (each of which have character identifiers).
So for example the first list of lower level areas is a list of four:
lad_list_black
[1] "00CW" "00CU" "00CS" "00CR"
The higher area lists are differentiated by the last term -- in this case "black". The next is "bucks" (ie lad_list_bucks), etc.
I was proposing to use a loop which counted the number of lower level areas in its first step -- something like
nam <- c("black","bucks")
lad_list_black <- c("00CW","00CU","00CS","00CR")
for(i in 1:1){
eval(parse(text=length(paste("lad_list_",nam[i],sep=""))))}
but when I tested it outside the loop, the result was:
eval(parse(text=length(paste("lad_list_",nam[1],sep=""))))
[1] 1
which is not correct, since:
length(lad_list_black)
[1] 4