Scrapy - Remove comma and whitespace from getall() results - web-scraping

would there be an effective way to directly remove commas from the yielded results via getall()?
As an example, the data I'm trying to retrieve is in this format:
<div>
Text 1
<br>
Text 2
<br>
Text 3
</div>
My current selector for this is:
response.xpath("//div//text()").getall()
Which does get the correct data but they come out as:
Text 1,
Text 2,
Text 3
instead of
Text 1
Text 2
Text 3
I understand that they get recognized as a list which is the reason for the commas but would there be a direct function to remove them without affecting the commas from the text itself?

I'm just going to leave the solution I used in case someone needs it:
tc = response.xpath("//div//text()").getall() #xpath selector
tcl = "".join(tc) #used to convert the list into a string

Related

Remove all numbers in the text column but error

I have a material A, and I use "select()" to pick out text column,and then using "str_replace_all()" to delete all numbers.
This is the code I wrote. I expect text column will not appear any of numbers, showing only text column without other columns.
B<-select(A,text)%>%
text = str_replace_all(text,
pattern ="\d",
replacement="")
B
I'm not sure what's wrong in it...
Besides, if i want to reserve other columns,how can I cancel "select()" function and reserve text column which has revised?

Replacing strings in vector: Every instance replaced by previous found instance

I'm working with a lot of text files I have loaded into R and I'm trying to replace every instance (or tag) of </SPEAKER> with a certain string found earlier in the text file.
Example:
"<BOB> Lots of text here </SPEAKER> <HARRY> More text here by a different speaker </SPEAKER>"
I'd like to replace every instance of "</SPEAKER>" with the name of, say "<BOB>" and "<HARRY>" based on the NAME that has been found earlier, so I'd get this at the end:
"<BOB> Lots of text here </BOB> <HARRY> More text here by a different speaker </HARRY>"
I was thinking of looping through the vector text but as I only have limited experience with R, I wouldn't know how to tackle this.
If anyone has any suggestions for how to do this, possibly even outside of R using Notepad++ or another text/tag editor, I'd most appreciate any help.
Thanks!
Match
<,
word characters (capturing them in capture group 1),
>,
the shortest string (capturing it in capture group 2) until
</SPEAKER>
and then replace that with the
<,
capture group 1,
>,
capture group 2 and
</ followed by
capture group 1 and
>
This gives
x <- "<BOB> Lots of text here </SPEAKER> <HARRY> More text here by a different speaker </SPEAKER>"
gsub("<(\\w+)>(.*?)</SPEAKER>", "<\\1>\\2</\\1>", x)
## [1] "<BOB> Lots of text here </BOB> <HARRY> More text here by a different speaker </HARRY>"

How to extract sections of specific text from PDF files into R data frames? Complex

Please any advice will be appreciated.. This is time sensitive. I have PDF reports that are mostly blocks of text. They are long reports (~50-100 pages). I'm trying to write an R script that is capable of extracting specific sections of these PDF reports using start/stop positional strings. NOTE: Reports vary in length. Short example:
DOCUMENT TITLE
01. SECTION 1
This is a test section that I DONT want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
02. SECTION 2
This is a test section that I do want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
11. SECTION 11
This is a test section that I do want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
12. SECTION 12
This is a test section that I DONT want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
So the goal in this example, is to extract the paragraph below Section 2 and store it as a field/data point. I also want to store Section 11 as a field/data point. Note the document is in PDF format
I have tried used pdftools, tm, stringr, I've literally spent 20+ hours searching for solutions and tutorials on how to do this. I know it is possible as I have done it using SAS before...
Please see code below, I added comments with questions. I believe RegEx will be part of the solution but i'm so lost.
# Init Step
libs <- c("tm","class","stringr","testthat",
"pdftools")
lapply(libs, require, character.only= TRUE)
# File name & location
filename = "~/pdf_test/test.pdf"
# converting PDF to text
textFile <- pdf_text(filename)
cat(textFile[1]) # Text of pg. 1 of PDF
cat(textFile[2]) # Text of pg. 2 of PDF
# I'm at a loss of how to parse the values I want. I have seen things
like:
sectionxyz <- str_extract_all(textFile, #??? )
rm_between()
# 1) How do I loop through each page of PDF file?
# 2) How do I identify start/stop positions for section to be extracted?
# 3) How do I add logic to extract text between start/stop positions
# and then add the result to a data field?
# 4) Sections in PDF will be long sections of text (i.e. 100+ words into a field)
NEW------
So I have been able to:
-Prep doc correctly
-Identify the correct start/stop patterns:
length(grep("^11\\. LIMITS OF LIABILITY( +){1}$",source_main2))
length(grep("Applicable\\s+[Ll]imits\\s+[Oo]f",source_main2))
pat_st_lol <- "^11\\. LIMITS OF LIABILITY( +){1}$"
pat_ed_lol <- "Applicable\\s+[Ll]imits\\s+[Oo]f"
The length(grep()) statements verify only 1 instance is being found. From here I am kind of lost based on how to use gsub or similar to extract the portion of data I want. I tried:
pat <- paste0(".*",pat_st_lol,"(.*)",pat_ed_lol,".*")
test <- gsub(".*^11\\. LIMITS OF LIABILITY( +){1}$(.*)\n",
"Applicable\\s+[Ll]imits\\s+[Oo]f", source_main2)
test2 <-gsub(".*pat_st_lol(.*)\npat_ed_lol.*")
So far, little progress, but progress anyways.
Provided you can come with a systematic to identify the sections you need, you could, as you indicated, use Regex to extract the text you want.
In your above example, something like gsub(".*SECTION 11(.*)\n12\\..*","\\1",string) ought to work.
Now you could define patterns dynamically using paste and iterate through all files. Each result can then be saved in your data.frame, list,....
Here is a brief more detailed explanation of the pattern:
Firstly, .* is way of matching "anything". If you want to match digits you can use \\d or equivalently [0-9]. Here is a short intro to Regex in R (which I found to be quite useful) where you can find several character classes.
.* at the edges of the pattern means that there can be text before/after
(.*) denotes the content we want (so here matching any content as .* is used). Basically it means extract "anything" between SECTION 11 and 12.
\\. means the dot and \n is the "newline" metacharacter (as before "12.", a new line is started)
In Regex you can create groupings within your pattern using the brackets, i.e. gsub(".*(\\d{2}\\:\\d{2})", "\\1","18.05.2018, 21:37") will return 21:37, or gsub("([A-z]) \\d+","\\1","hello 123") will give hello.
Now the second argument in gsub can and is often used to provide a substitute, i.e. something to replace to matched pattern with. Here however, we do not want any substitue, we want to extract something. \\1 means extract the first grouping, i.e. what it inside the first brackets (you could have multiple groupings).
Finally, string is the string from which we want to extract, i.e. the PDF file
Now if you want to perform something similar in a loop you could do the following:
# we are in the loop
# first is your starting point in the extraction, i.e. "SECTION 11"
# last is your end point, i.e. "12."
first <- "SECTION 11" # first and last can be dynamically assigned
last <- "12\\." # "\\" is added before the dot as "." is a Regex metachar
# If last doesn't systematically contain a dot
# you could use gsub to add "\\" before the dot when needed:
# gsub("\\.","\\\\.",".") returns "\\."
# so gsub("\\.","\\\\.","12.") returns "12\\."
pat <- paste0(".*",first,"(.*)","\n",last,".*") #"\n" is added to stop before the newline, but it could be omitted (then "\n" might appear in the extraction)
gsub(pat,"\\1",string) # returns the same as above

How to remove the words that are contain within some tags in R

Suppose A is a data frame and structure of A is as follows
Row no C1
1 <p>I'd like to check if an uploaded file is </p>
2 <p>Is there a way to</p>
3 <p>I am import matlab file and construct</p> <pre><code>Error in model.frame.default(formula = expert_data_frame$t_labels ~ .,</code></pre>
For the column C1 what I am doing is using the tm package I am turning the rows to corpus and then using the different function like removewhitespace, removesopwords. But how to remove the words withing a specific tags. In the above example I want to remove the words that are within the (code)--(/code) tags but unable to do so.
The correct answer is to use an HTML parser. That requires more explanation. You can also get this done in an incorrect way with the qdap package:
library(qdap)
genX(A$C1, "<code>", "</code>")
## [1] "<p>I'd like to check if an uploaded file is </p>"
## [2] "<p>Is there a way to</p>"
## [3] "<p>I am import matlab file and construct</p> <pre></pre>"
At a pinch, you could do:
A$C1 <- gsub('<code>.*?</code>', '', A$C1)
However, there are many caveats to parsing HTML with regular expressions.
For example, if I had the a string ' # this is a tag ', the last ' tag ' would not be stripped.
If I adjusted the regex to use .* instead of .*? to get around this, the string ' some code and some text and some more code ' would have everything stripped from it, even the (legitimate) text between the two code blocks.
What it boils down to is what you know about A$C1. Can you rely on it to not have more than one code block in one string (or more than one occurence of </code>)? Then use <code>.*</code>. Can you rely on the string '' never appearing within a code block? then use <code>.*?</code>.
If you really want to be sure, you can actually parse the XML with the XML package (can you rely on the contents of A$C1 to be well-formed HTML, i.e. no missing tags?).

displaying text from InnerText

When I try and display text from an the InnerText from an XML element. I get something like this:
I need this spacing \r\n\r\n\r\second lot of spacing\r\n\r\nMore spacing\r\n\r\n
I know you can replace \r\n with <br> but is there no function that automatically takes the html for you and why does it use \r and \n? Many thanks.
You can use <pre> tag - it will show the text as-is like you see it in text editor:
For example:
<pre><%=MyText%></pre>
Better practice for ASP.NET is:
<pre id="myPlaceholder" runat="server"></pre>
Then assign its value from code behind:
myPlaceholder.InnerHtml = MyText;
As for your question "why does it use \r and \n" those are carriage return and line feed characters, aka newline characters - when you have such text:
line 1
line 2
Then code reading it will give: line1\nline2 or line1\r\nline2 depending on how it's stored exactly.

Resources