I have a data file here, which is imported into R by:
eya4_lagan_HM_cp <- "E:/blahblah/eya4_lagan_HM_cp.txt"
eya4_lagan_HM_cp <- readChar(eya4_lagan_HM_cp, file.info(eya4_lagan_HM_cp)$size)
Label the first string with position "1" and the last string as position "311,522" (note the sequence contains in total 311,522 characters). I have two queries which are closely related.
Query 1)
Now I have a data file with a list of positions here. The positions are read in "pairs", that is, take the first pair 44184 and 44216 as an example. I wish to delete the subsequence from position 44184 (inclusive) to position 44216 (inclusive) from the previous sequence eya4_lagan_HM_cp and in its place, insert the character #. In other words, substitute the subsequence from 44184 to 44216 with #. I would like to do this with the rest of the pairs, that is, for 151795 and 151844, I want to delete from position 151795 (inclusive) to 151844 (inclusive) in eya4_lagan_HM_cp and replace it with #, and so on.
Query 2)
Now I would like to do something slightly different with the data file with the list of positions. Take the first pair as an example again. I would like to insert a # right before position 44184, in other words, insert a # between positions 44183 and 44184 in eya4_lagan_HM_cp and then I would like to insert a # right after position 44216, i.e., insert a # between positions 44216 and 44217. I would like to repeat this procedure for all position pairs. So for the next pair, I would like a # right before 151795 and a # right after 151844.
Thank you.
e <- eya4_lagan_HM_cp <- readChar("eya4_lagan_HM_cp.txt", file.info("eya4_lagan_HM_cp.txt")$size)
pairs <- as.numeric(readLines("CDS coordinates.txt"))
idx1 <- pairs[seq(1, length(pairs), 2)]
idx2 <- pairs[seq(2, length(pairs), 2)]
e.split <- strsplit(e, "")[[1]]
# no1
hashIndices <- unlist(mapply(seq, from=idx1, to=idx2))
e.split[hashIndices] <- "#"
e.new <- paste(e.split, collapse="")
# no2
for (idx in c(idx1, idx2+1))
e.split <- c(e.split[1:(idx-1)], "#", e.split[idx:length(e.split)])
e.new <- paste(e.split, collapse="")
Edit:
Another try with reference to the comment: After e.split <- strsplit(e, "")[[1]] either
# no1
deleteIndices <- unlist(mapply(seq, from=idx1+1, to=idx2))
e.split[idx1] <- "#"
e.new <- paste(e.split[-deleteIndices], collapse="")
or
# no2
for (idx in c(idx1, idx2+2))
e.split <- c(e.split[1:(idx-1)], "#", e.split[idx:length(e.split)])
e.new <- paste(e.split, collapse="")
If you can assume the strings that are being replaced are unique, you might try a combination of substr() and gsub(). (If you only had to do the replacement once, you would only need substr.) For example if you loaded your pairs of positions into a 2-column matrix pp your query 1 could be
for(i in 1:nrow(pp)) {
ss <- substr(eya4_lagan_HM_cp,start=pp[i,1],stop=pp[i,2])
eya4_lagan_HM_cp = gsub(ss,"#",eya4_lagan_HM_cp)
}
and query 2
for(i in 1:nrow(pp)) {
ss <- substr(eya4_lagan_HM_cp,start=pp[i,1],stop=pp[i,2])
eya4_lagan_HM_cp <- gsub(ss,paste("#",ss,"#",sep=""),eya4_lagan_HM_cp)
}
If you can't assume the strings to be replaced will be unique, you could explode out the string eya4_lagan_HM_cp into a vector of character strings:
vv <-unlist(strsplit(eya4_lagan_HM_cp,split=""))
use vector subsetting to remove/insert, e.g., for query 1,
new.vv <- c(vv[1:(pp[1,1]-1)],"#")
for(i in 1:(nrow(pp)-1)) {
new.vv <-c(new.vv,vv[(pp[i,2]+1):(pp[(i+1),1]-1)],"#")
}
new.vv <- c(new.vv,vv[(pp[2,nrow(pp)]+1):length(vv)])
and then paste back together as one string
eya4_lagan_HM_cp <- paste(new.vv,sep="")
Related
I have a data frame 'key_words' with vectors of pairs of words
key_words <- data.frame( c1 = ('word1','word2'), c2 = ('word3, word4'), c3 = ('word5','word6'))
I would like to search for these pairs of key words in a character column 'text' in another data frame 'x' where each row can be a few sentences long. I want to grab the word following two consecutive matches of a column in the key_words data frame and insert that value into a table at the same index of where the match was found. For example, if 'word1' and 'word2' are found one after the other in text[1] then I want to grab the word that comes after in text[1] and insert it into table[1].
I have tried splitting each row in 'text' into a list, separating by a single space so that each word has its own index in each row. I have the following idea which seems very inefficient and I'm running into problems where the character value temp_list[k] is of length 0.
x <- x %>% mutate(text = strsplit(text, " "))
for (i in 1:ncol(key_words)) {
word1 <- key_words[i, 1]
word2 <- key_words[i, 2]
for (j in 1:length(x$text)) {
temp_list <- as.list(unlist(x$text[[j]]))
for (k in 1:length(temp_list))
if (word1 == temp_list[k]) {
if (word2 == temp_list[k + 1]) {
table$word_found[j] <- temp_list[k + 2]
}
}
}
Is there a better way to do this or can I search through the text column for 'word1 word2' and grab the next word which can be any length? I'm new to R and coding in general, but I know I should be avoiding nested loops like this. Any help would be appreciated, thanks!!
I would suggest that you create a small function like this one, that returns the word following the occurrence of the pair 'w1 w2'
get_word_after_pair <- function(text,w1,w2) {
stringr::str_extract(text, paste0("(?<=\\b", w1, "\\s", w2, "\\b\\s)\\w*(?=\\b)"))
}
and then you can do this
data.frame(
lapply(key_words, function(x) get_word_after_pair(texttable$text,x[1],x[2]))
)
Input:
(keywords is a list of word pairs, texttable is a frame with a column text)
key_words <- list( pair1 = c('has','important'), pair2 = c('sentence','has'), pair3 = c('third','sentence'))
texttable = data.frame(text=c("this sentence has important words that we must find",
"this second sentence has important words to find",
"this is the third sentence and it also has important words within")
)
Output:
pair1 pair2 pair3
1 words important <NA>
2 words important <NA>
3 words <NA> and
I have a set of files which I had named incorrectly. The file name is as follows.
Generation_Flux_0_Model_200.txt
Generation_Flux_101_Model_43.txt
Generation_Flux_11_Model_3.txt
I need to replace the second number (the model number) by adding 1 to the existing number. So the correct names would be
Generation_Flux_0_Model_201.txt
Generation_Flux_101_Model_44.txt
Generation_Flux_11_Model_4.txt
This is the code I wrote. I would like to know how to specify the position of the number (replace second number in the string with the new number)?
reNameModelNumber <- function(modelName){
#get the current model number
modelNumber = as.numeric(unlist(str_extract_all(modelName, "\\d+"))[2])
#increment it by 1
newModelNumber = modelNumber + 1
#building the new name with gsub
newModelName = gsub(" regex ", newModelNumber, modelName)
#rename
file.rename(modelName, newModelName)
}
reactionModels = list.files(pattern = "^Generation_Flux_\\d+_Model_\\d+.txt$")
sapply(reactionFiles, function(x) reNameModelNumber(x))
We can use gsubfn to incremement by 1. Capture the digits ((\\d+))
followed by a . and 'txt' at the end ($`) of the string, and replace it by adding 1 to it
library(gsubfn)
gsubfn("(\\d+)\\.txt$", ~ as.numeric(x) + 1, str1)
#[1] "Generation_Flux_0_Model_201" "Generation_Flux_101_Model_44"
#[3] "Generation_Flux_11_Model_4"
data
str1 <- c("Generation_Flux_0_Model_200.txt", "Generation_Flux_101_Model_43.txt",
"Generation_Flux_11_Model_3.txt")
Answering the question, if you want to increment a certain number inside a string, you may use
> library(gsubfn)
> nth = 2
> reactionFiles <- c("Generation_Flux_0_Model_200.txt", "Generation_Flux_101_Model_43.txt", "Generation_Flux_11_Model_3.txt")
> gsubfn(paste0("^((?:\\D*\\d+){", nth-1, "}\\D*)(\\d+)"), function(x,y,z) paste0(x, as.numeric(y) + 1), reactionFiles)
[1] "Generation_Flux_0_Model_201.txt" "Generation_Flux_101_Model_44.txt" "Generation_Flux_11_Model_4.txt"
nth here is the number of the digit chunk to increment.
Pattern details
^((?:\\D*\\d+){n}\\D*) - Capturing group 1 (the value is accessed in the gsubfn method via x):
(?:\\D*\\d+){n} - an n occurrences of
\\D* - 0 or more chars other than digits
\\d+ - 1+ digits
\\D* - 0+ non-digits
(\\d+) - Capturing group 2 (the value is accessed in the gsubfn method via y): one or more digits
Using base-R.
data <- c( # Just an example
"Generation_Flux_0_Model_200.txt",
"Generation_Flux_101_Model_43.txt",
"Generation_Flux_11_Model_3.txt"
)
fixNameModel <- function(data){
n <- length(data)
# get the current model number and increment it by 1
newn = as.integer(sub(".+_(\\d+)\\.txt", "\\1", data)) + 1L
#building the new name with gsub
newModelName <- vector(mode = "character", length = n)
for (i in 1:n) {
newModelName[i] <- gsub("\\d+\\.txt$", paste0(newn[i], ".txt"), data[i])
}
newModelName
}
fixNameModel(data)
[1] "Generation_Flux_0_Model_201.txt" "Generation_Flux_101_Model_44.txt"
[3] "Generation_Flux_11_Model_4.txt"
You can now do something like file.rename(modelName, fixNameModel(modelName))
EDIT:
Here is a bit neater version but makes stronger assumptions instead:
fixNameModel2 <- function(data) {
sapply(
strsplit(data, "_|\\."),
function(x) {
x[5] <- as.integer(x[5]) + 1L
x <- paste0(x, collapse = "_")
gsub("_txt", ".txt", x, fixed = TRUE)
}
)
}
Assuming that the digit always occurs before the extension, as is mentioned in the comments, here is another base R solution that is a little bit simpler.
sapply(regmatches(tmp, regexec("\\d+(?=\\.)", tmp, perl=TRUE), invert=NA),
function(x) paste0(c(x[1], as.integer(x[2]) + 1L, x[3]), collapse=""))
This returns
[1] "Generation_Flux_0_Model_201.txt" "Generation_Flux_101_Model_44.txt"
[3] "Generation_Flux_11_Model_4.txt"
regexec with the invert=NA a list of indices where each list element is the index matching the portions of the full with the matched element returned as the second indexed element. regmatches takes this information and returns a list of character vectors that breaks up the original string along the matches. Feed this list to sapply, convert the second element to integer and increment. Then paste the result to return an atomic vector.
The regex "\d+(?=\.)" uses a perl look behind, "(?=\.)", looking for the dot without capturing it, but capturing the digits with "\d+".
data
tmp <- c("Generation_Flux_0_Model_200.txt", "Generation_Flux_101_Model_43.txt",
"Generation_Flux_11_Model_3.txt")
Assume a character vector like the following
file1_p1_analysed_samples.txt
file1_p1_raw_samples.txt
f2_file2_p1_analysed_samples.txt
f3_file3_p1_raw_samples.txt
Desired output:
file1_p1_analysed
file1_p1_raw
file2_p1_analysed
file3_p1_raw
I would like to compare the elements and remove parts of the string from start and end as much as possible but keep them unique.
The above one is just an example. The parts to be removed are not common to all elements. I need a general solution independent of the strings in the above example.
So far I have been able to chuck off parts that are common to all elements, provided the separator and the resulting split parts are of same length. Here is the function,
mf <- function(x,sep){
xsplit = strsplit(x,split = sep)
xdfm <- as.data.frame(do.call(rbind,xsplit))
res <- list()
for (i in 1:ncol(xdfm)){
if (!all(xdfm[,i] == xdfm[1,i])){
res[[length(res)+1]] <- as.character(xdfm[,i])
}
}
res <- as.data.frame(do.call(rbind,res))
res <- apply(res,2,function(x) paste(x,collapse="_"))
return(res)
}
Applying the above function:
a = c("a_samples.txt","b_samples.txt")
mf(a,"_")
V1 V2
"a" "b"
2.
> b = c("apple.fruit.txt","orange.fruit.txt")
> mf(b,sep = "\\.")
V1 V2
"apple" "orange"
If the resulting split parts are not same length, this doesn't work.
What about
files <- c("file1_p1_analysed_samples.txt", "file1_p1_raw_samples.txt", "f2_file2_p1_analysed_samples.txt", "f3_file3_p1_raw_samples.txt")
new_files <- gsub('_samples\\.txt', '', files)
new_files
... which yields
[1] "file1_p1_analysed" "file1_p1_raw" "f2_file2_p1_analysed" "f3_file3_p1_raw"
This removes the _samples.txt part from your strings.
Why not:
strings <- c("file1_p1_analysed_samples.txt",
"file1_p1_raw_samples.txt",
"f2_file2_p1_analysed_samples.txt",
"f3_file3_p1_raw_samples.txt")
sapply(strings, function(x) {
pattern <- ".*(file[0-9].*)_samples\\.txt"
gsub(x, pattern = pattern, replacement = "\\1")
})
Things that match between ( and ) can be called back as a group in the replacement with backwards referencing. You can do this with \\1. You can even specify multiple groups!
Seeing your comment on Jan's answer. Why not define your static bits and paste together a pattern and always surround them with parentheses? Then you can always call \\i in the replacement of gsub.
An input is a string describing, for example, a range of seats in cinema where rows are named by double letters.
As the output, I would like to see a vector containing ALL individual seats.
Thanks!
Here is a solution:
# create a MWE:
v <- "AA1-AA5"
name <- unlist(strsplit(v, '-'))
# get numbers only
ind <- as.numeric(gsub(x = name, pattern = "[[:alpha:]]", replacement = ""))
# create new vector
names <- paste0("AA", ind[1]:ind[2])
I am trying to compare strings like PRABHAKAR SHARMA and SHARMA KUMAR PRABHAKAR. the intention is to check if all the characters of the shorter string exist in the other string. If that is the case, I should get a 100% match otherwise a percentage representing the percentage of characters that matched.
I tried using levenshteinSim in RecordLinkage package but it gives a number corresponding to the number of changes required to change one string to another.
install.packages("RecordLinkage")
require(RecordLinkage)
levenshteinSim("PRABHAKAR SHARMA","SHARMA KUMAR PRABHAKAR")
#[1] 0.3636364
I want a 100% match in such a case. Also, this has to be replicated for over 1,000,000 records.
Here is one approach
s1 <- "PRABHAKAR SHARMA"
s2 <- "SHARMA KUMAR PRABHAKAR"
compare <- function(s1, s2) {
c1 <- unique(strsplit(s1, "")[[1]])
c2 <- unique(strsplit(s2, "")[[1]])
length(intersect(c1,c2))/length(c1)
}
compare(s1,s2)
#1
It may be a little slow, though. And it considers the space character as character, too. Use Vectorize to apply on a column:
dat <- data.frame(small=c("a", "b"), big=c("aa", "cc"), stringsAsFactors=FALSE)
vcomp <- Vectorize(compare)
dat <- transform(dat, comp=vcomp(small, big))
If the characters to be considered are only letters you could use:
comp <- function(s1, s2){
in1 = letters %in% strsplit(tolower(s1), "")[[1]]
in2 = letters %in% strsplit(tolower(s2), "")[[1]]
sum(in1 & in2)/sum(in1)
}