Match sentences containing a given word - sqlite

I have a sentence column in my table. I wish to select all sentences which contain a given word.
Words contain only the following letters: a-z, áéíóú
The only other character is a single space, separating each word in the sentence. There are no spaces at the start or end of a sentence. So sentences look like this:
"i am here"
"no im here"
selecting sentences containing the word "i" should only match the first sentence above.
How should I select these rows from my table?

Based upon the information you have provided there are three permutations you need to cover.
The word you are searching for is the first word in the sentence. In this case there will be no space before the word but there will be a space after the word.
The word you are searching for is not the first word nor the last word in the sentence. In this case there will be a space either side of the word.
The word you are searching for is the last word in the sentence. In this case there will be a space before the word but none after the word.
So your query would look something like this ...
SELECT * FROM YourTableName
WHERE SentenceCol LIKE 'YOUR-SEARCH-WORD %'
OR SentenceCol LIKE '% YOUR-SEARCH-WORD %'
OR SentenceCol LIKE '% YOUR-SEARCH-WORD'

Related

How to remove characters between space and specific character in R

I have a question similar to this one but instead of having two specific characters to look between, I want to get the text between a space and a specific character. In my example, I have this string:
myString <- "This is my string I scraped from the web. I want to remove all instances of a picture. picture-file.jpg. The text continues here. picture-file2.jpg"
but if I were to do something like this: str_remove_all(myString, " .*jpg) I end up with
[1] "This"
I know that what's happening is R is finding the first instance of a space and removing everything between that space and ".jpg" but I want it to be the first space immediately before ".jpg". My final result I hope for looks like this:
[1] "This is my string I scraped from the web. I want to remove all instances of a picture. the text continues here.
NOTE: I know that a solution may arise which does what I want, but ends up putting two periods next to each other. I do not mind a solution like that because later in my analysis I am removing punctuation.
You can use
str_remove_all(myString, "\\S*\\.jpg")
Or, if you also want to remove optional whitespace before the "word":
str_remove_all(myString, "\\s*\\S*\\.jpg")
Details:
\s* - zero or more whitespaces
\S* - zero or more non-whitespaces
\.jpg - .jpg substring.
To make it case insensitive, add (?i) at the pattern part: "(?i)\\s*\\S*\\.jpg".
If you need to make sure there is no word char after jpg, add a word boundary: "(?i)\\s*\\S*\\.jpg\\b"

Insert spaces between words that have meaning in R

I want to put spaces between words that have meaning in R.
For example I want to change this sentence :
sentence<-c("haveagoodday!")
to this one :
"have a good day !"
is it possible ?

Extract text from word and convert into Dataframe

I need to extract a specific portion of text that is in a Word (.docx). The document has the following structure:
Question 1:
How many ítems…
 two
 four
 five
 ten
Explanation:
There are four ítems in the bag.
Question 2:
How many books…
 two
 four
 five
Explanation:
There are four books in the bag.
With this information I have to create a Dataframe like this one:
I'm able to open the document, extract the text and print the lines starting with  , but I'm not able to extract the rest of the string of interest and create the Dataframe.
My code is:
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
text=getText('document.docx')
text
strings = re.findall(r" (.+)\n", text)
Any help?
Thanks in advance
I would suggest you expand your regular expression to include all of the information you need. In this case I think you'll need two passes - one to get each question, and a second to parse the possible answers.
Take a look at your source text and break it down into the parts you need. Each item starts with Question n:, then a line for the actual questions, multiple lines for each possible response, followed by Explanation and a line for the explanation. We'll use the grouping operator to extract the parts on interest.
The Question line can be described by the following pattern:
"Question ([0-9]+):\n"
The line that represents the actual question is just text:
"(.+)\n"
The collection of possible responses is a series of lines beginning with a special character (I've replaced it with '*' because I can't tell what character it is from the post), (allowing for possible whitespace)
\*\s*.+\n
but we can get the whole list of them using a combination of grouping including the non-capturing group:
((?:\*\s*.+\n)+)
That causes any number of matching lines to be captured as a single group.
Finally you have "Explanation" possibly preceded by some whitespace, and followed by a line of text:
\s*Explanation:\n(.+)\n
If we put these all together, our regex pattern is
r"Question\s+([0-9]+):\n(.*)\n((?:\*\s*.+\n)+)\s*Explanation:\n(.+)\n"
Parsing this:
patt = r"Question\s+([0-9]+):\n(.*)\n((?:\*\s*.+\n)+)\s*Explanation:\n(.+)\n"
matches = re.findall(patt, text)
yields:
[('1',
'How many ítems…',
'* two\n* four\n* five\n* ten\n',
'There are four ítems in the bag.'),
('2',
'How many books…',
'* two\n* four\n* five\n',
'There are four books in the bag.')]
Where each entry is a tuple. The 3rd item in each tuple is a text of all of the answers as a group, which you'll need to further break down.
The regex to match your answers (using the character '*') is:
\*\s*(.+)\n
Grouping it to eliminate the character, we can use:
r"(?:\*\s*(.+)\n)"
Finally, using a list comprehension we can replace the string value for the answers with a list:
matches = [ tuple([x[0],x[1],re.findall(r"(?:\*\s*(.+)\n)", x[2]),x[3]) for x in matches]
Yielding the result:
[('1',
'How many ítems…',
['two', 'four', 'five', 'ten'],
'There are four ítems in the bag.'),
('2',
'How many books…',
['two', 'four', 'five'],
'There are four books in the bag.')]
Now you should be prepared to massage that into your dataframe.

How to prevent code from detecting and pulling patterns within words (Example: I want 'one' detected but not 'one' in the word al'one')?

I have this code that is meant to add highlights to some numbers in a text stored in "lines"
stringr::str_replace_all(lines, nums, function(x) {paste0("<<", x, ">>")})
where nums is the following pattern being deteced
nums<-(Zero|One|Two|Three|Four|Five|Six|Seven|Eight|Nine)+\\s?(Hundred|Thousand|Million|Billion|Trillion)?'
The problem I'm having is that the line of code above also leads to numbers embedded in words also being detected. In the following text this happens:
Get <<ten>> eggs. That is what is writ<<ten>>. I am <<one>> and al<<one>>.
when it should be:
Get <<ten>> eggs. That is what is written. I am <<one>> and alone.
I don't want to remove the question mark after the \s because I want to detect both numbers like "One" followed by no space and "One Hundred" which has a space in between.
Does anyone know how to do this?
Surround (Zero|One|Two|Three|Four|Five|Six|Seven|Eight|Nine)+ with \b.
\b matches word boundaries, so this expression will newer match inside a word.

SQLite: which character can be ignored with FTS match in one word

I need to find any special character. If I put it in the middle of a word, SQLite FTS match can ignore it as if it does not exist, e.g.:
Text Body: book's
If my match string is 'books' I need to get result of "book's"..
No problem using porter or simple tokenizer.
I tried many characters for that like: book!s, book?s, book|s, book,s, book:s…, but when searching by match for 'books' no results of these returned.
I don't understand, why?
I am using: Contentless FTS4 Tables, and External Content FTS4 Tables, my text body has many characters in each word, should be changed to ignore it when searching..
I cannot change match query because I do not know where the special character in the word is. Also, I need to leave the original word length equal to the length of FTS Index word to use match info or snippet(); as such, I cannot remove these characters from text body.
The default tokenizers do not ignore punctuation characters but treat them as word separators.
So the text body or match string book's will end up as two words, book and s.
These will never match a single work like books.
To ignore characters like ', you have to install your own custom tokenizer.

Resources