Scrapy can't manage to request text with neither CSS or xPath - css

I've been trying to extract some text for a while now, and while everything works fine, there is something I can't manage to get.
Take this website : https://duproprio.com/fr/montreal/pierrefonds-roxboro/condo-a-vendre/hab-305-5221-rue-riviera-854000
I want to get the texts from the class=listing-main-characteristics__number nodes (below the picture, the box with "2 chambres 1 salle de bain Aire habitable (s-sol exclu) 1,030 pi2 (95,69m2)", there are 3 elements with that class in the page ( "2", "1" and "1,030 pi² (95,69 m²)"). I've tried a bunch of options in XPath and CSS, but none has worked, some gave back strange answers.
For example, with :
response.xpath('//span[#class="listing-main-characteristics__number"]').getall()
I get :
['<span class="listing-main-characteristics\_\_number">\n 2\n </span>', '<span class="listing-main-characteristics\_\_number">\n 1\n </span>']
For example, something else that works just fine on the same webpage :
response.xpath('//div[#property="description"]/p/text()').getall()
If I get all the spans with this query :
response.css('span::text').getall()
I can find my texts mentioned in the beginning in the. But from this :
response.css('span[class=listing-main-characteristics__number]::text').getall()
I only get this
['\n 2\n ', '\n 1\n ']
Could someone clue me in with what kind of selection I would need? Thank you so much!

Here is the xpath that you have to use.
//div[#data-label='#description']//div[#class='listing-main-characteristics__label']|//div[#data-label='#description']//div[#class='listing-main-characteristics__item-dimensions']/span[2]
you might have to use the above xpath. (Add /text() is you want the associated text.)
response.xpath("//div[#data-label='#description']//div[#class='listing-main-characteristics__label']|//div[#data-label='#description']//div[#class='listing-main-characteristics__item-dimensions']/span[2]").getall()
Below is the python sample code
url = "https://duproprio.com/fr/montreal/pierrefonds-roxboro/condo-a-vendre/hab-305-5221-rue-riviera-854000#description"
driver.get(url)
# get the output elements then we will get the text from them
outputs = driver.find_elements_by_xpath("//div[#data-label='#description']//div[#class='listing-main-characteristics__label']|//div[#data-label='#description']//div[#class='listing-main-characteristics__item-dimensions']/span[2]")
for output in outputs:
# replace the new line character with space and trim the text
print(output.text.replace("\n", ' ').strip())
Output:
2 chambres
1 salle de bain
1,030 pi² (95,69 m²)
Screenshot:

Related

Speech_contexts phrase list not working in google speech.SpeechAsyncClient.streaming_recognize

Unable to make speech_contexts phrase lists work with speech.SpeechAsyncClient in Google Speech to Text..
The transcription works, but the phrase list appears to be ignored.
Is there any config that needs to be in-place?
When Using the speech.SpeechAsyncClient (version 2.17.2 in python) I created a phrase list :
speech_contexts {
phrases: "Burrito"
boost: 10.0
}
speech_contexts {
phrases: "burrito"
boost: 5.0
}
I expected the word audio for 'burrito' to be transcribed as 'Burrito' as text. However it continued to be 'burrito'. Also I tried various phrase lists, but the recognition seems to ignore the phrase lists (same result with/without phrase list).
I verified that the proper speech_context is being sent in the 'streaming_config/Recogntionconfig like this:
Recognitionconfig = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
#encoding = cloud_speech.ExplicitDecodingConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
model="latest_long",
#enable_word_confidence=True,
speech_contexts=speech_contexts #this contains the phrase list
)
#The first message is the following streaming_config and is then followed by audio
streaming_config = speech.StreamingRecognitionConfig(
config=Recognitionconfig, interim_results=True
)
Try using model adaptation to strengthen the accuracy of your transcription results. It also uses RecognitionConfig for the request body. Also follow this format when using SpeechContext.
{
"phrases": [
string
],
"boost": number
}

Returning text between a starting and ending regular expression

I am working on a regular expression to extract some text from files downloaded from a newspaper database. The files are mostly well formatted. However, the full text of each article starts with a well-defined phrase ^Full text:. However, the ending of the full-text is not demarcated. The best that I can figure is that the full text ends with a variety of metadata tags that look like: Subject: , CREDIT:, Credit.
So, I can certainly get the start of the article. But, I am having a great deal of difficulty finding a way to select the text between the start and the end of the full text.
This is complicated by two factors. First, obviously the ending string varies, although I feel like I could settle on something like: `^[:alnum:]{5,}: ' and that would capture the ending. But the other complicating factor is that there are similar tags that appear prior to the start of the full text. How do I get R to only return the text between the Full text regex and the ending regex?
test<-c('Document 1', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Subject: A subject', 'Publication: Publication', 'Location: A country')
test2<-c('Document 2', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Credit: A subject', 'Publication: Publication', 'Location: A country')
My current attempt is here:
test[(grep('Full text:', test)+1):grep('^[:alnum:]{5,}: ', test)]
Thank you.
This just searches for the element matching 'Full text:', then the next element after that matching ':'
get_text <- function(x){
start <- grep('Full text:', x)
end <- grep(':', x)
end <- end[which(end > start)[1]] - 1
x[start:end]
}
get_text(test)
# [1] "Full text: some article text that I need to capture"
# [2] "the second line of the article that I need to capture"
get_text(test2)
# [1] "Full text: some article text that I need to capture"
# [2] "the second line of the article that I need to capture"

How to Store All Text in Between Two Index Positions of Same String in VBScript?

So I am going off memory here because I cannot see the code I am trying to figure this out for at the moment, but I am working with some old VB Script code where there is a data connection that is set like this:
set objCommand = Server.CreateObject("ADODB.command")
and I have a field from the database that is being stored in a variable like this:
Items = RsData(“Item”).
This specific field in the database is a long string of
text:
(i.e. “This is part of a string of text…Header One: Here is text after header one… Header Two: Here is more text after header two”).
There are certain parts of the text that I wish to store as a variable that are between two index positions in the long string of text within that field. They are separated by headers that are stored in the text field above like this: “Header One:” and “Header Two:”, and I want to capture all text that occurs in between those two headers of text and store them into their own variable (i.e. “Here is text after header one…”).
How do I achieve this? I have tried to use the InStr method to set the index but from how I understand how this works it will only count the beginning of where a specific part of the string occurs. Am I wrong in my thinking of this? Since that is the case, I am also having trouble getting the Mid function to work. Can some one please show me an example of how this is supposed to work? Remember, I am only going off of memory so please forgive me that I am unable to provide better code examples now. I hope my question makes sense!
I am hopeful that someone can help me with an answer tonight so I can try this out tomorrow when I am near the code again! Thank you for your efforts and any help offered!
You can extract all the substrings starting with the text Header and ending just before either the next Header or end-of-string. I have used regular expression to implement that and it is working for me. Have a look at the code below. If I get a simpler(non-regex solution), I will update the answer.
Code:
strTest = "Header One: Some random text Header Two: Some more text Header One: Some random textwerwerwefvxcf234234 Header Three: Some more t2345fsdfext Header Four: Some randsdfsdf3w42343om text Header Five: Some more text 123213"
set objReg = new Regexp
objReg.Global = true
objReg.IgnoreCase = false
objReg.pattern = "Header[^:]+:([\s\S]*?)(?=Header|$)" '<---Regex Pattern. Explained later.
set objMatches = objReg.Execute(strTest)
Dim arrHeaderValues() '<-----This array contains all the required values
i=-1
for each objMatch in objMatches
i = i+1
Redim Preserve arrHeaderValues(i)
arrHeaderValues(i) = objMatch.subMatches.item(0) '<---item(0) indicates the 1st group of each match
next
'Displaying the array values
for i=0 to ubound(arrHeaderValues)
msgbox arrHeaderValues(i)
next
set objReg = Nothing
Regex Explanation:
Header - matches Header literally
[^:]+: - matches 1+ occurrences of any character that is not a :. This is then followed by matching a :. So far, keeping the above 2 points in mind, we have matched strings like Header One:, Header Two:, Header blabla123: etc. Now, whatever comes after this match is relevant to us. So we will capture that inside a Group as shown in the next breakup.
([\s\S]*?)(?=Header|$) - matches and captures everything(including newlines) until either the next Header or the end-of-the-string(represented by $)
([\s\S]*?) - matches 0+ occurrences of any character and capture the whole match in Group 1
(?=Header|$) - match and capture the above thing until another instance of the string Header or end of the string
Click for Regex Demo
Alternative Solution(non-regex):
strTest = "Header One: Some random text Header Two: Some more text Header One: Some random textwerwerwefvxcf234234 Header Three: Some more t2345fsdfext Header Four: Some randsdfsdf3w42343om text Header Five: Some more text 123213"
arrTemp = split(strTest,"Header") 'Split using the text Header
j=-1
Dim arrHeaderValues()
for i=0 to ubound(arrTemp)
strTemp = arrTemp(i)
intTemp = instr(1,strTemp,":") 'Find the position of : in each array value
if(intTemp>0) then
j = j+1
Redim preserve arrHeaderValues(j)
arrHeaderValues(j) = mid(strTemp,intTemp+1) 'Store the desired value in array
end if
next
'Displaying the array values
for i=0 to ubound(arrHeaderValues)
msgbox arrHeaderValues(i)
next
If you don't want to store the values in an array, you can use Execute statement to create variables with different names during run-time and store the values in them. See this and this for reference.

xpathSApply skip if text equals postseason

I'm running into a road block here and I can't figure out what I'm doing wrong. I need to skip over the link if the text equals postseason. The text is in the second li in the xpaths below in my code.
I tried li[not(.,"postseason")] as I thought that is what I needed to exclude the postseason link but it doesn't work.
This link will show you an example of want I want to exclude under standard batting > game logs > postseason
http://www.baseball-reference.com/players/j/jeterde01.shtml
place this http://www.baseball-reference.com/players/j/jeterde01.shtml in playerURLs and you should season the postseason link returned. How can I skip over the postseason link? Thanks!
#GET YEARS PLAYED LINKS
yplist = NULL
playerURLs <- paste("http://www.baseball-reference.com",datafile17[,c("hrefs")],sep="")
for(thisplayerURL in playerURLs){
doc <- htmlParse(thisplayerURL)
yplinks <- data.frame(
names = xpathSApply(doc, '//*[#id="all_standard_batting"]/div//ul/li[2]/ul/li/a',xmlValue),
hrefs = xpathSApply(doc, '//*[#id="all_standard_batting"]/div/ul/li[2]/ul/li/a',xmlGetAttr,'href'))
yplist = rbind(yplist, yplinks)
}
I'm not familiar with r language specifically, but from xpath point of view, you can use . != "..." or not(contains(.,"...")) predicate pattern to exclude element having specific inner text value.
The following will exclude <li> having inner text exactly equals "postseason" :
li[. != "postseason"]
This one will exclude <li> having inner text like "postseason"
li[not(contains(.,"postseason"))]

Multiline Text in Flex 4

In Flex I want to create a Text file and it is working, but the problem is all inputs are written in one line;
here the cods
addText.text="[ \r\n"
addText.text=addText.text+"] \r\n";
fileRef.save(addText.text, "data.txt");
the current result is like below;
[]
how can I make it like this;
[
]
i would start trying
addTextxt.text = "[ \n ]";
it normally works in all cases...
good luck ( :

Resources