I would like to use R to compare written text and extract sections which differ between the elements.
Consider a and b two text paragraphs. One is a modified version of the other:
a <- "This part is the same. This part is old."
b <- "This string is updated. This part is the same."
I want to compare the two strings and receive the part of the string which is unique to either of the two as output, preferably separate for both input strings.
Expected output:
stringdiff <- list(a = " This part is old.", b = "This string is updated. ")
> stringdiff
$a
[1] " This part is old."
$b
[1] "This string is updated. "
I've tried a solution from Extract characters that differ between two strings, but this only compares unique characters. The answer in Simple Comparing of two texts in R comes closer, but still only compares unique words.
Is there any way to get the expected output without too much of a hassle?
We concatenate both the strings, split at the space after the . to create a list of sentences ('lst'), get the unique elements from unlisting the 'lst' ('un1'), using setdiff we get the elements that are not in 'un1'
lst <- strsplit(c(a= a, b = b), "(?<=[.])\\s", perl = TRUE)
un1 <- unique(unlist(lst))
lapply(lst, setdiff, x= un1)
Related
I'm trying to merge two strings based on their matching suffix/prefix. For example, given the two strings 'a' and 'b' below, I'm first using Biostrings::pairwiseAlignement to get their common suffix/prefix, which in this case is "cutie". I then need to merge the two strings. Concatenating would not be helpful because I would get repeats.
This is all I have for now:
a= "bahahahallocutie"
b = "cutiepalaohaha"
pairwiseAlignment(a, b, type = "overlap")
Which gives me:
Overlap PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [12] cutie
subject: [1] cutie
score: 17.20587
What I want to get is the merging the two strings by the pattern that's the suffix of one and prefix of the other:
"bahahahallocutiepalaohaha"
You may extract the pattern from the result of pairwiseAlignment. Then using gsub to remove the pattern from the strings you may use paste0 to get the desired merged string. Note that in your final code you'll need to take account for the order of the original strings.
library(Biostrings)
pat <- pairwiseAlignment(a, b, type = "overlap")#pattern
paste0(gsub(pat, "", a), pat, gsub(pat, "", b))
# [1] "bahahahallocutiepalaohaha"
here is the example data:
example_sentences <- data.frame(doc_id = c(1,2,3),
sentence_id = c(1,2,3),
sentence = c("problem not fixed","i like your service and would tell others","peope are nice however the product is rubbish"))
matching_df <- data.frame(x = c("not","and","however"))
Created on 2019-01-07 by the reprex package (v0.2.1)
I want to add/insert a comma just before a certain word in a character string. for example if my string is:
problem not fixed.
I want to convert this to
problem, not fixed.
The other matching_df contains the words to match (these are Coordinate conjunctions) so if the x is found in matching_df then insert comma + space before the detected word.
I have looked at stringr package but not sure how to achieve this.
Best,
I've no idea what the data frame you're talking about looks like, but I made a simple data frame containing some phrases here:
df <- data.frame(strings = c("problems not fixed.","Help how are you"),stringsAsFactors = FALSE)
I then made a vector of words to put a comma after:
words <- c("problems","no","whereas","however","but")
Then I put the data frame of phrases through a simple for loop, using gsub to substitute the word for a word + comma:
for (i in 1:length(df$strings)) {
string <- df$strings[i]
findWords <- intersect(unlist(strsplit(string," ")),words)
if (!is.null(findWords)) {
for (j in findWords) {
df$strings[i] <- gsub(j,paste0(j,","),string)
}
}
}
Output:
df
strings
1 problems, not fixed.
2 Help how are you
The gsubfn function in the gsubfn package takes a regular expression as the first argument and a list (or certain other objects) as the second argument where the names of the list are strings to be matched and the values in the list are the replacement strings.
library(gsubfn)
gsubfn("\\w+", as.list(setNames(paste0(matching_df$x, ","), matching_df$x)),
format(example_sentences$sentence))
giving:
[1] "problem not, fixed "
[2] "i like your service and, would tell others "
[3] "peope are nice however, the product is rubbish"
I already have tried to find a solutions on the internet for my problem, and I have the feeling I know all the small pieces but I am unable to put them together. I'm quite knew at programing so pleace be patient :D...
I have a (in reality much larger) text string which look like this:
string <- "Test test [438] test. Test 299, test [82]."
Now I want to replace the numbers in square brackets using a lookup table and get a new string back. There are other numbers in the text but I only want to change those in brackets and need to have them back in brackets.
lookup <- read.table(text = "
Number orderedNbr
1 270 1
2 299 2
3 82 3
4 314 4
5 438 5", header = TRUE)
I have made a pattern to find the square brackets using regular expressions
pattern <- "\\[(\\d+)\\]"
Now I looked all around and tried sub/gsub, lapply, merge, str_replace, but I find myself unable to make it work... I don't know how to tell R! to look what's inside the brackets, to look for that same argument in the lookup table and give out what's standing in the next column.
I hope you can help me, and that it's not a really stupid question. Thx
We can use a regex look around to match only numbers that are inside a square bracket
library(gsubfn)
gsubfn("(?<=\\[)(\\d+)(?=\\])", setNames(as.list(lookup$orderedNbr),
lookup$Number), string, perl = TRUE)
#[1] "Test test [5] test. Test [3]."
Or without regex lookaround by pasteing the square bracket on each column of 'lookup'
gsubfn("(\\[\\d+\\])", setNames(as.list(paste0("[", lookup$orderedNbr,
"]")), paste0("[", lookup$Number, "]")), string)
Read your table of keys and values (a 2 column table) into a data frame. If your source information be a flat text file, then you can easily use read.csv to obtain a data frame. In the example below, I hard code a data frame with just two entries. Then, I iterate over it and make replacements in the input string.
df <- data.frame(keys=c(438, 82), values=c(5, 3))
string <- "Test test [438] test. Test [82]."
for (i in 1:nrow(df)) {
string <- gsub(paste0("(?<=\\[)", df$keys[i], "(?=\\])"), df$values[i], string, perl=TRUE)
}
string
[1] "Test test 5 test. Test 3."
Demo
Note: As #Frank wisely pointed out, my solution would fail if your number markers (e.g. [438]) happen to have replacements which are numbers also appearing as other markers. That is, if replacing a key with a value results in yet another key, there could be problems. If this be a possibility, I would suggest using markers for which this cannot happen. For example, you could remove the brackets after each replacement.
You can use regmatches<- with a pattern containing lookahead/lookbehind:
patt = "(?<=\\[)\\d+(?=\\])"
m = gregexpr(patt, string, perl=TRUE)
v = as.integer(unlist(regmatches(string, m)))
`regmatches<-`(string, m, value = list(lookup$orderedNbr[match(v, lookup$Number)]))
# [1] "Test test [5] test. Test 299, test [3]."
Or to modify the string directly, change the last line to the more readable...
regmatches(string, m) <- list(lookup$orderedNbr[match(v, lookup$Number)])
I need to write a function that finds the most common word in a string of text so that if I define a "word" as any sequence of words.
It can return the most common words.
For general purposes, it is better to use boundary("word") in stringr:
library(stringr)
most_common_word <- function(s){
which.max(table(s %>% str_split(boundary("word"))))
}
sentence <- "This is a very short sentence. It has only a few words: a, a. a"
most_common_word(sentence)
Here is a function I designed. Notice that I split the string based on white space, I removed any leading or lagging white space, I also removed ".", and I converted all upper case to lower case. Finally, if there is a tie, I always reported the first word. These are assumptions you should think about for your own analysis.
# Create example string
string <- "This is a very short sentence. It has only a few words."
library(stringr)
most_common_word <- function(string){
string1 <- str_split(string, pattern = " ")[[1]] # Split the string
string2 <- str_trim(string1) # Remove white space
string3 <- str_replace_all(string2, fixed("."), "") # Remove dot
string4 <- tolower(string3) # Convert to lower case
word_count <- table(string4) # Count the word number
return(names(word_count[which.max(word_count)][1])) # Report the most common word
}
most_common_word(string)
[1] "a"
Hope this helps:
most_common_word=function(x){
#Split every word into single words for counting
splitTest=strsplit(x," ")
#Counting words
count=table(splitTest)
#Sorting to select only the highest value, which is the first one
count=count[order(count, decreasing=TRUE)][1]
#Return the desired character.
#By changing this you can choose whether it show the number of times a word repeats
return(names(count))
}
You can use return(count) to show the word plus the number of time it repeats. This function has problems when two words are repeated the same amount of times, so beware of that.
The order function gets the highest value (when used with decreasing=TRUE), then it depends on the names, they get sorted by alphabet. In the case the words 'a' and 'b' are repeated the same amount of times, only 'a' get displayed by the most_common_word function.
Using the tidytext package, taking advantage of established parsing functions:
library(tidytext)
library(dplyr)
word_count <- function(test_sentence) {
unnest_tokens(data.frame(sentence = test_sentence,
stringsAsFactors = FALSE), word, sentence) %>%
count(word, sort = TRUE)
}
word_count("This is a very short sentence. It has only a few words.")
This gives you a table with all the word counts. You can adapt the function to obtain just the top one, but be aware that there will sometimes be ties for first, so perhaps it should be flexible enough to extract multiple winners.
I'm trying to concatenate multiple rows into one.
Each row, it is either start with ">Gene Identifier" or Sequence information
>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTC
AGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGC
CACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGA
ATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGC
GGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCA
CATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT
Here I just put two genes, but there are hundreds of genes following this.
Basically I will just leave the gene identifier as this, but I want to concatenate sequences only when it is separated into multiple rows.
Therefore, the final results should look like this:
The sequences were concatenated and combined into one row, without any space inbetween.
>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT
By using "paste" function in R, I was able to achieve this manually.
i.e. paste(dat[2,1], dat[3,1], sep="")
However, I have a list of hundreads of gene, so I need a way to concatenate rows automatically.
I was thinking forloop, basically, if the row starts from ">", skip it, but if it is not start from ">", concatenate.
But I'm not expert in bioinformatics/R, it is hard for me to actually generate a script to achieve it.
Any help would be greatly appreciated!
Something happened when I pasted this into the answer box to concatenate the data lines but they were separate in my R session so this should work:
Lines <-
readLines(textConnection(">*>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA*
>*>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT*
"))
geneIdx <- grepl("\\|", Lines)
grp <- cumsum(geneIdx)
grp
#[1] 1 1 1 2 2 2
tapply(Lines, grp, FUN=function(x) c(x[1], paste(x[-1], collapse="") ) )
#----------------------
$`1`
[1] ">*>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714"
[2] "GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA*"
$`2`
[1] ">*>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909"
[2] "GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT*"
Would regular expressions do the trick? The regular expression below deletes newlines (\\n) not followed by > ((?!>) being a negative lookahead).
text <-">Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTC
AGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGC
CACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGA
ATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGC
GGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCA
CATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT"
cat(text)
cat(gsub("\\n(?!>)", "", text, perl=TRUE))
Result
>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT