Delete all rows in a file that fit between certain headers? - r

I would like to delete all of the rows that sit between certain headers in this example text file.
fileConn <- file("sample.txt")
one <- "*Keyword"
two <- "*Node"
three <- "$ Node,X,Y,Z"
four <- "1,639982.78040607,4733827.5104821,0"
five <- "2,639757.59709573,4733830.43494066,0"
six <- "3,639738.81268144,4733834.3619618,0"
seven <- "*End"
writeLines (c(one, two, three, four, five, six, seven), fileConn)
close(fileConn)
sample <- readLines("sample.txt")
What I am looking to do is delete all of the rows/lines between "*Node" and "*End". Since I am dealing with files with different lengths of rows between these headers, the deletion method needs to be based on headers only. I have no idea how to do this since I've only deleted rows in dataframes referenced by row numbers previously. Any clues?
Expected output is:
*Keyword
*Node
*End

readLines returns a vector, not a data frame, so we can create the sample input more simply:
sample = c("*Keyword",
"*Node",
"$ Node,X,Y,Z",
"1,639982.78040607,4733827.5104821,0",
"2,639757.59709573,4733830.43494066,0",
"3,639738.81268144,4733834.3619618,0",
"*End")
Find the starting and ending headers, and remove the elements in between with negative indexing:
node = which(sample == "*Node")
end = which(sample == "*End")
result = sample[-seq(from = node + 1, to = end - 1)]
result
# [1] "*Keyword" "*Node" "*End"
This assumes there is a single *Node and a single *End line. It also assumes that there is at least one line to delete. You may want to create a more robust solution with some handling for those special cases, e.g.,
delete_between = function(input, start, end) {
start_index = which(sample == start)
end_index = which(sample == end)
if (length(start_index) == 0 | length(end_index) == 0) {
warning("No start or end found, returning input as-is")
return(input)
}
if (length(start_index) > 1 | length(end_index) > 1) {
stop("Multiple starts or ends found.")
}
if (start_index == end_index - 1) {
return(input)
}
return(input[-seq(from = start_index + 1, to = end_index - 1)])
}

Related

Creating dictionary from a '.fasta' file containing several genes from an organism

I have a '.txt' file in which a list of genes are given and their sequence. I need to create a dictionary in which the keys are the names of the genes and the values are the sequences.
I want the output of the dictionary to be this:
dict = ('sequence1' : 'AATTGGCC', 'sequence2' : 'AAGGCCTT', ...)
So this is what I tried, but I ran into some problems:
dictionary = {}
accesion_number = ""
sequentie = ""
with open("6EP.fasta", "r") as proteoom:
for line in proteoom:
if line.startswith(">"):
line.strip()
dictionary[accesion_number] = sequentie
sequentie = ""
else:
sequentie = sequentie + line.rstrip().strip("\n").strip("\r")
dictionary[accesion_number] = sequentie
Does anyone know what went wrong here, and how I can fix it?
Thanks in advance!
I can think of two ways to do this:
High memory usage
If the file is not too large, you can use readlines() and then use the indexes like so:
IDs = []
sequences = []
with open('Proteome.fasta', 'r') as f:
raw_data = f.readlines()
for i, l in enumerate(raw_data):
if l[0] == '>':
IDs.append(l)
sequences.append(raw_data[i + 1])
Low memory usage
Now, if you don't want to load the contents of the file into memory, then I think you can read the file twice by saving the indexes of every ID line plus one, like so:
Get the '>' lines and their indexes, which will be the ID index plus one
Compare if the line number is in the indexes list and, if so, then append the content to your variable
In here, I'm taking advantage of the fact that the lists are, by definition, sorted.
IDs = []
indexes = []
sequences = []
with open('Proteome.fasta', 'r') as f:
for i, l in enumerate(f):
IDs.append(l) # Get your IDs
indexes.append(i + 1) # Get the index of the ID + 1
with open('Proteome.fasta', 'r') as f:
for i, l in enumerate(f):
if i == indexes[0]: # Check whether line matches with the index
sequences.append(l) # Get your sequence
indexes.pop(0) # Remove the first element of the indexes
I hope this helps! ;)
Code
ids = []
seq = []
char = ['_', ':', '*', '#'] #invalid in sequence
seqs = ''
with open('fasta.txt', 'r') as f: #open sample fasta
for line in f:
if line.startswith('>'):
ids.append(line.strip('\n'))
if seqs != '': #if there's previous seq
seq.append(seqs) #append the seq
seqs = '' #then start a new seq
elif line not in char:
seqs = seqs + line.strip('\n') #build seq with each line until '>'
seq.append(seqs) #append any remaining seq
print(ids)
print(seq)
Result
['>SeqABCD [organism=Mus musculus]', '>SeqABCDE [organism=Plasmodium]']
['ACGTCAGTCACGTACGTCAGTTCAGTC...', 'GGTACTGCAAAGTTCTTCCGCCTGATTA...']
Sample File
>SeqABCD [organism=Mus musculus]
ACGTCAGTCACGTACGTCAGTTCAGTCARYSTYSATCASMBMBDH
ATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTTAA
AAATATCATTTGAGGTCAATACAAATCCTATTTCTATCGTTTTT
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAAT
>SeqABCDE [organism=Plasmodium falciparum]
GGTACTGCAAAGTTCTTCCGCCTGATTAATTATCCATTTTACCTT
TTGTTTTGCTTCTTTGAAGTAGTTTCTCTTTGCAAAATTCCTCTT
GGTACTGCAAAGTTCTTCCGCCTGATTAATTATCCGGTACTGCAA
AGTCAATTTTATATAATTTAATCAAATAAATAAGTTTATGGTTAA

A regular expression to filter out repetitive numbers for R data frame columns

I have a data frame with many columns and rows and I need to filter based on the value of two columns (Lat and Lon). I need a regular expression which
Removes any row for which either the Lat or Lon column does not have at least three decimal places. So the first row (human) would be filtered, because even though Lon has three decimal places, Lat does not.
Removes any row for which the decimal places are redundant. What I mean by redundant is there are three repeats of the same number continuing to the end. But if the redundancy starts after the third decimal, it doesn't matter. And if the redundancy is eventually followed by a different number, it doesn't matter.
Type <-c("human","camera","ebird","museum", "specimen", "gbif")
Lat <- c(34.67, 34.66,34.6666666, 34.666582, 34.56666, 34.586666)
Lon <- c(9.888,9.88,9.8761,9.888064, 9.78888,9.318888)
x = data.frame(cbind(Type,Lat,Lon))
Here's how each row would fare under the regex:
fails because Lat only has two decimal places, even though Lon passes.
fails because both rows only have two decimal places
fails because Lat repeats the same value, starting at the first decimal place, and the repetition continues to the end of the number.
Passes the regex
Fails because the repetitive number values starts at the second decimal places and continues for at least 3 repetitions all the way to the end
Passes the regex
So the resulting data frame from this regex filter would be:
Type <-c("museum","gbif")
Lat <- c(34.666582, 34.586666)
Lon <- c(9.888064, 9.318888)
x = data.frame(cbind(Type,Lat,Lon))
The function below will output the desired dataframe that you want. It accomplishes all of the requirements you stated above.
check.expressions <- function(data){
data$pass <- FALSE
for(i in 1:nrow(data)){
if(nchar(str_extract(x$Lon[i], "(?<=\\.).*")) < 3 | nchar(str_extract(x$Lat[i], "(?<=\\.).*")) < 3){
next
} else {
unlist(str_split(str_extract(x$Lon[i], "(?<=\\.).*" ), "")) -> lon
unlist(str_split(str_extract(x$Lat[i], "(?<=\\.).*" ), "")) -> lat
if(lon[1] == lon[2] && lon[2] == lon[3]){
if(length(lon) > 3){
if(lon[3] != lon[length(lon)]){
data$pass[i] <- TRUE
next
} else {
next
}
}
next
}
if(lat[1] == lat[2] && lat[2] == lat[3]){
if(length(lat) > 3){
if(lat[3] != lat[length(lat)]){
data$pass[i] <- TRUE
next
} else {
next
}
}
next
}
if(length(lon) > 4){
if(lon[2] == lon[3] && lon[3] == lon[4]){
if(lon[4] != lon[length(lon)]){
data$pass[i] <- TRUE
next
} else {
next
}
}
}
if(length(lat) > 4){
if(lat[2] == lat[3] && lat[3] == lat[4]){
if(lat[4] != lat[length(lat)]){
data$pass[i] <- TRUE
next
}
}
}
data$pass[i] <- TRUE
}
}
data[data$pass == TRUE, ] -> data
return(data)
}
The function call being just:
check.expressions(x) -> x.out
which would produce:
> check.expressions(x) -> x.out
> x.out
Type Lat Lon pass
4 museum 34.666582 9.888064 TRUE
6 gbif 34.586666 9.318888 TRUE

Multiple if statements within loop in R

I have an R script that retrieves CSV files from a daily email in my outlook and then depending whether the date in email subject is greater than a set date, moves them to a specific folder.
The code is splitting the subject line to extract the date - the position of which can be in one of two places in the string, due to recent changes.
I have an if statement built which can successfully locate the date within the string in either circumstance, but I can't then use a second if statement to see if the output from the first if statement is greater than the sample date.
Below is the code I am trying to execute (I have included data that can be reproduced):
# Test data
testLoop <- c("[EXTERNAL] Test Promo Sessions was executed at 28062019 100005",
"[EXTERNAL] Test Promo Sessions was executed at 29062019 100023",
"Test Promo Sessions was executed at 30062019 100007",
"Test Promo Sessions was executed at 01072019 100043",
"Test Promo Sessions was executed at 02072019 100049",
"Test Promo Sessions was executed at 03072019 100001")
# Example date
todaysDateFormatted2 <- '30062019'
# Loop
for(i in testLoop){
if(if(nchar(i) == 51){
strptime(sapply(strsplit(i, "\\s+"), "[", 7),"%d%m%Y")
} else {
strptime(sapply(strsplit(i, "\\s+"), "[", 8),"%d%m%Y")
} > strptime(todaysDateFormatted2,"%d%m%Y")){
print("greater than - move file")
} else {
print("not greater than - do nothing")
}
}
When attempting the execute this code, I get the following error, however I'm not sure how to interpret it:
[1] "not greater than - do nothing"
[1] "not greater than - do nothing"
Error in if (if (nchar(i) == 51) { :
argument is not interpretable as logical
In addition: Warning message:
In if (if (nchar(i) == 51) { :
the condition has length > 1 and only the first element will be used
There were several flaws in your code. The duplicated if was weird, and you strptime into nowhere if you don't assign it to something, below t. Also you may want to assign the else condition to t. Now you can compare t to todaysDateFormatted2 and print the result for each iteration.
for (i in testLoop) {
if (nchar(i) == 51) {
t <- strptime(sapply(strsplit(i, "\\s+"), "[", 7),"%d%m%Y")
} else {
t <- strptime(sapply(strsplit(i, "\\s+"), "[", 8),"%d%m%Y")
}
if (t > strptime(todaysDateFormatted2,"%d%m%Y")) {
print("greater than - move file")
} else {
print("not greater than - do nothing")
}
}
# [1] "not greater than - do nothing"
# [1] "not greater than - do nothing"
# [1] "not greater than - do nothing"
# [1] "greater than - move file"
# [1] "greater than - move file"
# [1] "greater than - move file"
The code in the OP fails because R does not consistently resolve the inner if() statement to a vector of length 1, which causes the outer if() to fail as described in the OP.
If the intent of the code is to decide whether to move a file based on the date in a file name, a simpler version of the code can accomplish what is desired. Here, we reduce the levels of nesting by using lapply() and saving the output from the original inner if() clause to an object. We then compare the saved object to the object representing today's date and write a message to the R log.
# Test data
testLoop <- c("[EXTERNAL] Test Promo Sessions was executed at 28062019 100005",
"[EXTERNAL] Test Promo Sessions was executed at 29062019 100023",
"Test Promo Sessions was executed at 30062019 100007",
"Test Promo Sessions was executed at 01072019 100043",
"Test Promo Sessions was executed at 02072019 100049",
"Test Promo Sessions was executed at 03072019 100001")
# Example date
todaysDateFormatted2 <- '30062019'
datesProcessed <- lapply(testLoop,function(x){
if(nchar(x) == 51) y <- strptime(sapply(strsplit(x, "\\s+"), "[", 7),"%d%m%Y")
else y <- strptime(sapply(strsplit(x, "\\s+"), "[", 8),"%d%m%Y")
if(y > strptime(todaysDateFormatted2,"%d%m%Y")) message("greater than - move file")
else message("not greater than - do nothing")
y
})
...and the output:
> datesProcessed <- lapply(testLoop,function(x){
+ if(nchar(x) == 51) y <- strptime(sapply(strsplit(x, "\\s+"), "[", 7),"%d%m%Y")
+ else y <- strptime(sapply(strsplit(x, "\\s+"), "[", 8),"%d%m%Y")
+ if(y > strptime(todaysDateFormatted2,"%d%m%Y")) message("greater than - move file")
+ else message("not greater than - do nothing")
+ y
+ })
not greater than - do nothing
not greater than - do nothing
not greater than - do nothing
greater than - move file
greater than - move file
greater than - move file
>

R -- screening Excel rows according to characteristics of multiple cells

I am trying to eliminate all rows in excel that have he following features:
First column is an integer
Second column begins with an integer
Third column is empty
The code I have written appears to run indefinitely. CAS.MULT is the name of my dataframe.
for (i in 1:nrow(CAS.MULT)) {
testInteger <- function(x) {
test <- all.equal(x, as.integer(x), check.attributes = FALSE)
if (test == TRUE) {
return (TRUE)
}
else {
return (FALSE)
}
}
if (testInteger(as.integer(CAS.MULT[i,1])) == TRUE) {
if (testInteger(as.integer(substring(CAS.MULT[i,2],1,1))) == TRUE) {
if (CAS.MULT[i,3] == '') {
CAS.MULT <- data.frame(CAS.MULT[-i,])
}
}
}
}
You should be very wary of deleting rows within a for loop, if often leads to undesired behavior. There are a number of ways you could handle this. For instance, you can flag the rows for deletion and then delete them after.
Another thing I noticed is that you are converting your columns to integers before passing them to your function to test if they are integers, so you will be incorrectly returning true for all values passed to the function.
Maybe something like this would work (without a reproducible example it's hard to say if it will work or not):
toDelete <- numeric(0)
for (i in 1:nrow(CAS.MULT)) {
testInteger <- function(x) {
test <- all.equal(x, as.integer(x), check.attributes = FALSE)
if (test == TRUE) {
return (TRUE)
}
else {
return (FALSE)
}
}
if (testInteger(CAS.MULT[i,1]) == TRUE) {
if (testInteger(substring(CAS.MULT[i,2],1,1)) == TRUE) {
if (CAS.MULT[i,3] == '') {
toDelete <- c(toDelete, i)
}
}
}
}
CAS.MULT <- CAS.MULT[-1*toDelete,]
Hard to be sure without testing my code on your data, but this might work. Instead of a loop, the code below uses logical indexing based on the conditions you specified in your question. This is vectorized (meaning it operates on the entire data frame at once, rather than by row) and is much faster than looping row by row:
CAS.MULT.screened = CAS.MULT[!(CAS.MULT[,1] %% 1 == 0 |
as.numeric(substring(CAS.MULT[,2],1,1)) %% 1 == 0 |
CAS.MULT[,3] == ""), ]
For more on checking whether a value is an integer, see this SO question.
One other thing: Just for future reference, for efficiency you should define your function outside the loop, rather than recreating the function every time through the loop.

Finding duplicate values in r

So, In a string containing multiple 1's,
Now, it is possible that, the number
'1'
appears at several positions, let's say, at multiple positions. What I want is
(3)
This is not a complete answer, but some ideas (partly based on comments):
z <- "1101101101"
zz <- as.numeric(strsplit(z,"")[[1]])
Compute autocorrelation function and draw plot: in this case I'm getting the periodicity=3 pretty crudely as the first point at which there is an increase followed by a decrease ...
a1 <- acf(zz)
first.peak <- which(diff(sign(diff(a1$acf[,,1])))==-2)[1]
Now we know the periodicity is 3; create runs of 3 with embed() and analyze their similarities:
ee <- embed(zz,first.peak)
pp <- apply(ee,1,paste,collapse="")
mm <- outer(pp,pp,"==")
aa <- apply(mm[!duplicated(mm),],1,which)
sapply(aa,length) ## 3 3 2 ## number of repeats
sapply(aa,function(x) unique(diff(x))) ## 3 3 3
The following code does exactly what you ask for. Try it with str_groups('1101101101'). It returns a list of 3-vectors. Note that the first triple is (1, 3, 4) because the character at the 10th position is also a 1.
Final version, optimized and without errors
str_groups <- function (s) {
digits <- as.numeric(strsplit(s, '')[[1]])
index1 <- which(digits == 1)
len <- length(digits)
back <- length(index1)
if (back == 0) return(list())
maxpitch <- (len - 1) %/% 2
patterns <- matrix(0, len, maxpitch)
result <- list()
for (pitch in 1:maxpitch) {
divisors <- which(pitch %% 1:(pitch %/% 2) == 0)
while (index1[back] > len - 2 * pitch) {
back <- back - 1
if (back == 0) return(result)
}
for (startpos in index1[1:back]) {
if (patterns[startpos, pitch] != 0) next
pos <- seq(startpos, len, pitch)
if (digits[pos[2]] != 1 || digits[pos[3]] != 1) next
repeats <- length(pos)
if (repeats > 3) for (i in 4:repeats) {
if (digits[pos[i]] != 1) {
repeats <- i - 1
break
}
}
continue <- F
for (subpitch in divisors) {
sublen <- patterns[startpos, subpitch]
if (sublen > pitch / subpitch * (repeats - 1)) {
continue <- T
break
}
}
if (continue) next
for (i in 1:repeats) patterns[pos[i], pitch] <- repeats - i + 1
result <- append(result, list(c(startpos, pitch, repeats)))
}
}
return(result)
}
Note: this algorithm has roughly quadratic runtime complexity, so if you make your strings twice as long, it will take four times as much time to find all patterns on average.
Pseudocode version
To aid understanding of the code. For particulars of R functions such as which, consult the R online documentation, for example by running ?which on the R command line.
PROCEDURE str_groups WITH INPUT $s (a string of the form /(0|1)*/):
digits := array containing the digits in $s
index1 := positions of the digits in $s that are equal to 1
len := pointer to last item in $digits
back := pointer to last item in $index1
IF there are no items in $index1, EXIT WITH empty list
maxpitch := the greatest possible interval between 1-digits, given $len
patterns := array with $len rows and $maxpitch columns, initially all zero
result := array of triplets, initially empty
FOR EACH possible $pitch FROM 1 TO $maxpitch:
divisors := array of divisors of $pitch (including 1, excluding $pitch)
UPDATE $back TO the last position at which a pattern could start;
IF no such position remains, EXIT WITH result
FOR EACH possible $startpos IN $index1 up to $back:
IF $startpos is marked as part of a pattern, SKIP TO NEXT $startpos
pos := possible positions of pattern members given $startpos, $pitch
IF either the 2nd or 3rd $pos is not 1, SKIP TO NEXT $startpos
repeats := the number of positions in $pos
IF there are more than 3 positions in $pos THEN
count how long the pattern continues
UPDATE $repeats TO the length of the pattern
END IF (more than 3 positions)
FOR EACH possible $subpitch IN $divisors:
check $patterns for pattern with interval $subpitch at $startpos
IF such a pattern is found AND it envelopes the current pattern,
SKIP TO NEXT $startpos
(using helper variable $continue to cross two loop levels)
END IF (pattern found)
END FOR (subpitch)
FOR EACH consecutive position IN the pattern:
UPDATE $patterns at row of position and column of $pitch TO ...
... the remaining length of the pattern at that position
END FOR (position)
APPEND the triplet ($startpos, $pitch, $repeats) TO $result
END FOR (startpos)
END FOR (pitch)
EXIT WITH $result
END PROCEDURE (str_groups)
Perhaps the following route will help:
Convert string to a vector of integers characters
v <- as.integer(strsplit(s, "")[[1]])
Repeatedly convert this vector to matrices of varying number of rows...
m <- matrix(v, nrow=...)
...and use rle to find relevant patterns in the rows of the matrix m:
rle(m[1, ]); rle(m[2, ]); ...

Resources