Returning text between a starting and ending regular expression - r

I am working on a regular expression to extract some text from files downloaded from a newspaper database. The files are mostly well formatted. However, the full text of each article starts with a well-defined phrase ^Full text:. However, the ending of the full-text is not demarcated. The best that I can figure is that the full text ends with a variety of metadata tags that look like: Subject: , CREDIT:, Credit.
So, I can certainly get the start of the article. But, I am having a great deal of difficulty finding a way to select the text between the start and the end of the full text.
This is complicated by two factors. First, obviously the ending string varies, although I feel like I could settle on something like: `^[:alnum:]{5,}: ' and that would capture the ending. But the other complicating factor is that there are similar tags that appear prior to the start of the full text. How do I get R to only return the text between the Full text regex and the ending regex?
test<-c('Document 1', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Subject: A subject', 'Publication: Publication', 'Location: A country')
test2<-c('Document 2', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Credit: A subject', 'Publication: Publication', 'Location: A country')
My current attempt is here:
test[(grep('Full text:', test)+1):grep('^[:alnum:]{5,}: ', test)]
Thank you.

This just searches for the element matching 'Full text:', then the next element after that matching ':'
get_text <- function(x){
start <- grep('Full text:', x)
end <- grep(':', x)
end <- end[which(end > start)[1]] - 1
x[start:end]
}
get_text(test)
# [1] "Full text: some article text that I need to capture"
# [2] "the second line of the article that I need to capture"
get_text(test2)
# [1] "Full text: some article text that I need to capture"
# [2] "the second line of the article that I need to capture"

Related

Judge whitespace or number using pyparsing

I am working on parsing structured text files by pyparsing and I have a problem judging whitespace or numerical number. My file looks like this:
RECORD 0001
TITLE (Main Reference Title)
AUTHOR (M.Brown)
Some files have more than one author then
RECORD 0002
TITLE (Main Reference Title 1)
AUTHOR 1(S.Red)
2(B.White)
I would like to parse files and convert them into dictionary format.
{"RECORD": "001",
"TITLE": "Main Reference Title 1",
"AUTHOR": {"1": "M.Brown"}
}
{"RECORD": "002",
"TITLE": "Main Reference Title 2",
"AUTHOR": {"1": "S.Red", "2": "B.White"}
}
I tried to parse the AUTHOR field by pyparsing (tried both 2.4.7 and 3.0.0b3). Following is the simplified version of my code.
from pyparsing import *
flag = White(" ",exact=1).set_parse_action(replace_with("1")) | Word(nums,exact=1)
flaged_field = Group(flag + restOfLine)
next_line = White(" ",exact=8).suppress() + flaged_field
authors_columns = Keyword("AUTHOR").suppress() +\
White(" ",exact=2).suppress() +\.
flaged_field +\ # parse first row
ZeroOrMore(next_line) # parse next row
authors = authors_columns.search_string(f)
, where 'f' contains all lines read from the file. With this code, I only could parse the author's names with numbering flags.
[]
[[['1', '(S.Red)'],['2','(B.White)']]]
However, if I only parse with whitespace
flag = White(" ",exact=1).set_parse_action(replace_with("1"))
it worked correctly for the files without numbering flags.
['1', '(M.Brown)']
[]
The number (or whitespace) in [9:10] has a meaning in my format and want to judge if it is a whitespace or a numerical number (limited up to 9). I also replaced "|" to "^", and replaced the order, and tried
flag = Word(nums+" ")
, too, but neither of the cases works for me. Why judge White(" ") or Word(nums) doesn't work with my code? Could someone help me or give me an idea to solve this?
This was solved by adding leave_whitespace().
flag = (White(" ",exact=1).set_parse_action(replace_with("0")) | Word(nums,exact=1)).leave_whitespace()

Scrapy can't manage to request text with neither CSS or xPath

I've been trying to extract some text for a while now, and while everything works fine, there is something I can't manage to get.
Take this website : https://duproprio.com/fr/montreal/pierrefonds-roxboro/condo-a-vendre/hab-305-5221-rue-riviera-854000
I want to get the texts from the class=listing-main-characteristics__number nodes (below the picture, the box with "2 chambres 1 salle de bain Aire habitable (s-sol exclu) 1,030 pi2 (95,69m2)", there are 3 elements with that class in the page ( "2", "1" and "1,030 pi² (95,69 m²)"). I've tried a bunch of options in XPath and CSS, but none has worked, some gave back strange answers.
For example, with :
response.xpath('//span[#class="listing-main-characteristics__number"]').getall()
I get :
['<span class="listing-main-characteristics\_\_number">\n 2\n </span>', '<span class="listing-main-characteristics\_\_number">\n 1\n </span>']
For example, something else that works just fine on the same webpage :
response.xpath('//div[#property="description"]/p/text()').getall()
If I get all the spans with this query :
response.css('span::text').getall()
I can find my texts mentioned in the beginning in the. But from this :
response.css('span[class=listing-main-characteristics__number]::text').getall()
I only get this
['\n 2\n ', '\n 1\n ']
Could someone clue me in with what kind of selection I would need? Thank you so much!
Here is the xpath that you have to use.
//div[#data-label='#description']//div[#class='listing-main-characteristics__label']|//div[#data-label='#description']//div[#class='listing-main-characteristics__item-dimensions']/span[2]
you might have to use the above xpath. (Add /text() is you want the associated text.)
response.xpath("//div[#data-label='#description']//div[#class='listing-main-characteristics__label']|//div[#data-label='#description']//div[#class='listing-main-characteristics__item-dimensions']/span[2]").getall()
Below is the python sample code
url = "https://duproprio.com/fr/montreal/pierrefonds-roxboro/condo-a-vendre/hab-305-5221-rue-riviera-854000#description"
driver.get(url)
# get the output elements then we will get the text from them
outputs = driver.find_elements_by_xpath("//div[#data-label='#description']//div[#class='listing-main-characteristics__label']|//div[#data-label='#description']//div[#class='listing-main-characteristics__item-dimensions']/span[2]")
for output in outputs:
# replace the new line character with space and trim the text
print(output.text.replace("\n", ' ').strip())
Output:
2 chambres
1 salle de bain
1,030 pi² (95,69 m²)
Screenshot:

Why are keywords not parsed first and omitted from free text matching?

I thought I understood pyparsing's logic, but cannot figure out why the bottom example is failing.
I'm trying to parse open text comments where a product or set of products can be mentioned either in the beginning or the end of the comment. Product names can also be omitted from the comment.
The output should be a list of the mentioned products and the description regarding them.
Below are some test cases. The parse is identifying everything as 'description' instead of first picking up the products (isn't that what the negative is supposed to do?)
What's wrong in my understanding?
import pyparsing as pp
products_list = ['aaa', 'bbb', 'ccc']
products = pp.OneOrMore(' '.join(products_list))
word = ~products + pp.Word(pp.alphas)
description = pp.OneOrMore(word)
comment_expr = (pp.Optional(products("product1")) + description("description") + pp.Optional(products("product2")))
matches = comment_expr.scanString("""\
aaa is a good product
I prefer aaa
No comment
aaa bbb are both good products""")
for match in matches:
print match
The expected results would be:
product1: aaa, description: is a good product
product2: aaa, description: I prefer
description: No comment
product1: [aaa, bbb] description: are both good products
Pyparsing's shortcut equivalence between strings and Literals is intended to be a convenience, but sometimes it results in unexpected and unwanted circumstances. In these lines:
products_list = ['aaa', 'bbb', 'ccc']
products = pp.OneOrMore(' '.join(products_list))
.
I'm pretty sure you wanted product to match on any product. But instead, OneOrMore gets passed this as its argument:
' '.join(products_list)
This is purely a string expression, resulting in the string "aaa bbb ccc". Passing this to OneOrMore, you are saying that products is one or more instances of the string "aaa bbb ccc".
To get the lookahead, you need to change products to:
products = pp.oneOf(products_list)
or even better:
products = pp.MatchFirst(pp.Keyword(p) for p in products_list)
Then your negative lookahead will work better.

How to Store All Text in Between Two Index Positions of Same String in VBScript?

So I am going off memory here because I cannot see the code I am trying to figure this out for at the moment, but I am working with some old VB Script code where there is a data connection that is set like this:
set objCommand = Server.CreateObject("ADODB.command")
and I have a field from the database that is being stored in a variable like this:
Items = RsData(“Item”).
This specific field in the database is a long string of
text:
(i.e. “This is part of a string of text…Header One: Here is text after header one… Header Two: Here is more text after header two”).
There are certain parts of the text that I wish to store as a variable that are between two index positions in the long string of text within that field. They are separated by headers that are stored in the text field above like this: “Header One:” and “Header Two:”, and I want to capture all text that occurs in between those two headers of text and store them into their own variable (i.e. “Here is text after header one…”).
How do I achieve this? I have tried to use the InStr method to set the index but from how I understand how this works it will only count the beginning of where a specific part of the string occurs. Am I wrong in my thinking of this? Since that is the case, I am also having trouble getting the Mid function to work. Can some one please show me an example of how this is supposed to work? Remember, I am only going off of memory so please forgive me that I am unable to provide better code examples now. I hope my question makes sense!
I am hopeful that someone can help me with an answer tonight so I can try this out tomorrow when I am near the code again! Thank you for your efforts and any help offered!
You can extract all the substrings starting with the text Header and ending just before either the next Header or end-of-string. I have used regular expression to implement that and it is working for me. Have a look at the code below. If I get a simpler(non-regex solution), I will update the answer.
Code:
strTest = "Header One: Some random text Header Two: Some more text Header One: Some random textwerwerwefvxcf234234 Header Three: Some more t2345fsdfext Header Four: Some randsdfsdf3w42343om text Header Five: Some more text 123213"
set objReg = new Regexp
objReg.Global = true
objReg.IgnoreCase = false
objReg.pattern = "Header[^:]+:([\s\S]*?)(?=Header|$)" '<---Regex Pattern. Explained later.
set objMatches = objReg.Execute(strTest)
Dim arrHeaderValues() '<-----This array contains all the required values
i=-1
for each objMatch in objMatches
i = i+1
Redim Preserve arrHeaderValues(i)
arrHeaderValues(i) = objMatch.subMatches.item(0) '<---item(0) indicates the 1st group of each match
next
'Displaying the array values
for i=0 to ubound(arrHeaderValues)
msgbox arrHeaderValues(i)
next
set objReg = Nothing
Regex Explanation:
Header - matches Header literally
[^:]+: - matches 1+ occurrences of any character that is not a :. This is then followed by matching a :. So far, keeping the above 2 points in mind, we have matched strings like Header One:, Header Two:, Header blabla123: etc. Now, whatever comes after this match is relevant to us. So we will capture that inside a Group as shown in the next breakup.
([\s\S]*?)(?=Header|$) - matches and captures everything(including newlines) until either the next Header or the end-of-the-string(represented by $)
([\s\S]*?) - matches 0+ occurrences of any character and capture the whole match in Group 1
(?=Header|$) - match and capture the above thing until another instance of the string Header or end of the string
Click for Regex Demo
Alternative Solution(non-regex):
strTest = "Header One: Some random text Header Two: Some more text Header One: Some random textwerwerwefvxcf234234 Header Three: Some more t2345fsdfext Header Four: Some randsdfsdf3w42343om text Header Five: Some more text 123213"
arrTemp = split(strTest,"Header") 'Split using the text Header
j=-1
Dim arrHeaderValues()
for i=0 to ubound(arrTemp)
strTemp = arrTemp(i)
intTemp = instr(1,strTemp,":") 'Find the position of : in each array value
if(intTemp>0) then
j = j+1
Redim preserve arrHeaderValues(j)
arrHeaderValues(j) = mid(strTemp,intTemp+1) 'Store the desired value in array
end if
next
'Displaying the array values
for i=0 to ubound(arrHeaderValues)
msgbox arrHeaderValues(i)
next
If you don't want to store the values in an array, you can use Execute statement to create variables with different names during run-time and store the values in them. See this and this for reference.

how 'text area' input control insert /r/n in text

I have text in database that don,t have '/r/n'. but when assign this text i text area input control, it make next line(beak) in text area.
please let me know what 'return carries' it find in text and add '/r/n' in text.
first time '/r/n' not in text, before put in text area, but after put this in 'text area' n save , it save text with '/r/n'.
When I'm writing a text/string to the database, I run it through my clean() function. When I am pulling text/strings back out of the database to display, I run them through dirty(). I save both of these as include files as "i_fn_clean.asp" and "i_fun_dirty.asp". Here they are:
function clean(FixWhat)
if (isempty(FixWhat) or isnull(FixWhat) or FixWhat="") then
FixWhat=""
else
apos=chr(39)
quot=chr(34)
FixWhat=trim(FixWhat)
FixWhat=replace(FixWhat," "," ",1,-1,1)
FixWhat=replace(FixWhat,"''",apos & apos,1,-1,1)
FixWhat=replace(FixWhat,"'''",apos,1,-1,1)
FixWhat=replace(FixWhat,VBNullChar,"",1,-1,1)
FixWhat=replace(FixWhat,VBNullString,"",1,-1,1)
FixWhat=replace(FixWhat,VBTab," ",1,-1,1)
FixWhat=replace(FixWhat,VBVerticalTab," ",1,-1,1)
FixWhat=replace(FixWhat,"&","&",1,-1,1)
FixWhat=replace(FixWhat,"amp;","&",1,-1,1)
FixWhat=replace(FixWhat,"&","&",1,-1,1)
FixWhat=replace(FixWhat,"&&","&",1,-1,1)
FixWhat=replace(FixWhat,"&&","&",1,-1,1)
FixWhat=replace(FixWhat,"<","<",1,-1,1)
FixWhat=replace(FixWhat,">",">",1,-1,1)
FixWhat=replace(FixWhat,"/","/",1,-1,1)
FixWhat=replace(FixWhat,"’",apos,1,-1,1)
FixWhat=replace(FixWhat,"’",apos,1,-1,1)
FixWhat=replace(FixWhat,"`",apos,1,-1,1)
'FixWhat=replace(FixWhat,chr(145),apos,1,-1,1)
'FixWhat=replace(FixWhat,chr(146),apos,1,-1,1)
'FixWhat=replace(FixWhat,chr(180),apos,1,-1,1)
'FixWhat=replace(FixWhat,chr(184),apos,1,-1,1)
'quotes
'FixWhat=replace(FixWhat,chr(132),quot,1,-1,1)
'FixWhat=replace(FixWhat,chr(147),quot,1,-1,1)
'FixWhat=replace(FixWhat,chr(148),quot,1,-1,1)
'FixWhat=replace(FixWhat,chr(152),quot,1,-1,1)
'FixWhat=replace(FixWhat,chr(168),quot,1,-1,1)
'hyphens
'FixWhat=replace(FixWhat,chr(150),"-",1,-1,1)
'FixWhat=replace(FixWhat,chr(151),"--",1,-1,1)
'dot dot dot
'FixWhat=replace(FixWhat,chr(133),"...",1,-1,1)
FixWhat=replace(FixWhat,vbCrLf & vbCrLf,vbCrLf,1,-1,1)
FixWhat=replace(FixWhat,"[quote]",""",1,-1,1)
FixWhat=replace(FixWhat,quot,""",1,-1,1)
FixWhat=replace(FixWhat,"'","'",1,-1,1)
end if
clean=FixWhat
End Function
Function dirty(FixWhat)
if (isnull(FixWhat) or FixWhat="") then
FixWhat=""
else
FixWhat=trim(FixWhat)
FixWhat=replace(FixWhat," "," ",1,-1,1)
FixWhat=replace(FixWhat,"’","'",1,-1,1)
FixWhat=replace(FixWhat,"&apos;","'",1,-1,1)
FixWhat=replace(FixWhat,"%27","'",1,-1,1)
FixWhat=replace(FixWhat,"'","'",1,-1,1)
FixWhat=replace(FixWhat,"’","'",1,-1,1)
FixWhat=replace(FixWhat,"/","/",1,-1,1)
FixWhat=replace(FixWhat,"''''","'''",1,-1,1)
FixWhat=replace(FixWhat,""",chr(34),1,-1,1)
FixWhat=replace(FixWhat,"%22",chr(34),1,-1,1)
FixWhat=replace(FixWhat,chr(13) & chr(10),"",1,-1,1)
FixWhat=replace(FixWhat,"&","&",1,-1,1)
FixWhat=replace(FixWhat,"amp;","&",1,-1,1)
FixWhat=replace(FixWhat,"&","&",1,-1,1)
FixWhat=replace(FixWhat,"&&","&",1,-1,1)
FixWhat=replace(FixWhat,"&&","&",1,-1,1)
FixWhat=replace(FixWhat,"<","<",1,-1,1)
FixWhat=replace(FixWhat,">",">",1,-1,1)
'FixWhat=replace(FixWhat,"&","&",1,-1,1)
end if
dirty=FixWhat
End Function

Resources