How to web scrape data from the second page? - web-scraping

I want to web scrape data using the functions from the rvest library. Loading the website and using a form works fine. But when I want to access the data I only get access to 35 of the available >70 table entries. The data are automatically split into chunks of 35 entries.
On the actual website, these data would be available with a nextPage-button (.rgPageNext RESPECTIVELY //*[contains(concat( " ", #class, " " ), concat( " ", "rgPageNext", " " ))]). However, I do not know how to include this into the code to get access to the other entries.
# library
library(rvest)
# load html session
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"
mainPage <-'https://www.tsis.fli.de/Reports/Info_SO.aspx?ts=416&guid=d04e759d-7cf0-48da-b730-4d7f397531e2'
# use form on session
session <- html_session(mainPage, httr::user_agent(ua))
pgform <- html_form(session)[[1]]
pgform$fields[['ctl00$ContentPlaceHolder1$rblActiveMode']]$value = "1"
filledform <- set_values(pgform)
result <- submit_form(session, filledform)
# load the entries 1-35
result %>%
html_nodes("#ctl00_ContentPlaceHolder1_rgSO_ctl00") %>%
html_text()

Related

I am doing web scraping of japanese site. But instead of showing japanese character in output, the characters are numbers and symbols

I write the code of web scraping below.
import requests
import bs4
from bs4 import BeautifulSoup
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.52'
}
url = 'https://www.mlit.go.jp/'
r = requests.get(url,{'headers':headers})
soup = bs4.BeautifulSoup(r.text,'html.parser')
soup.find_all('div',{'class':'clearfix'})[1].text
the output shows as below
'\n\n\n\n\n\nã\x83\x88ã\x83\x94ã\x83\x83ã\x82¯ã\x82¹ å\x9b½å\x9c\x9f交é\x80\x9aç\x9c\x81ã\x81®æ´»å\x8b\x95\n'
I want japanese sentences instead of these type of data.
can anybody help me?
I tried encoding decoding UTF-8, Cp932, shiftjis. Nothing worked
I want japanese sentences instead of these type of data.
can anybody help me?
I tried encoding decoding UTF-8, Cp932, shiftjis. Nothing worked

bs4 find sub-text / find_next

On the page link, there is a section called "COSEWIC Assessment report". This section had emboldened text that heads categories and then non-bold text containing the information regarding that category. I am looking to scrape the non-bold text using bs4.
The HTML format for the bold text is wrapped in <strong> sample text </strong> tags in this way I can find the bold titles for each category using result = s.find("strong", text=re.compile("Scientific name")).
That said, I would then like to scrape the information under that header for each given header. If I inspect the HTML for that section it looks like this.
<p>
<strong> Scientific name </strong>
<br>
"Anarta edwarsii"
</p>
So, from a starting point of having located the "scientific name" part, how do I get the "Anarta edwarsii" part.
I thought maybe bs4 find_next_sibling() would work or something of the sort but so far nothing has been successful. Also important to note is that I cannot use the text to look up the element because I have to repeat the processes for many different species. Therefore the header remains constant but its sub text will change.
Thanks!!
You can use next_siblings as resultset, iterate with list comprehension ans join() the results:
' '.join([x.text for x in soup.select_one('p:-soup-contains("Scientific name:") strong').next_siblings]).strip()
Output:
'"Anarta edwarsii"'
Alternativ example:
Select the <p> that contains the string "Scientific Name" get its stripped_strings as list ['Scientific name:', 'Anarta edwardsii'] and pick the second element:
import requests
from bs4 import BeautifulSoup
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'Referer': 'https://www.google.com/'
}
r = requests.get('https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/edwards-beach-moth-2009.html',headers=headers)
soup = BeautifulSoup(r.text,'lxml')
list(soup.select_one('p:-soup-contains("Scientific name:")').stripped_strings)[-1]
Output:
'"Anarta edwarsii"'

Read_csv ignoring delimeters

i have a problem handling a questionnaire which i need to import into R. The problem is that the following entry contains commas.
"Mozilla/5.0 (Windows NT 6.2; Win64;" x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36",Windows 8,Chrome
When i specify that commas should be used as delimeters is splices the entry at "KHTML, like Gecko" and "...36" ,Windows 8". But it somehow does not split at other commas such as "4,21 Jul 2020 - 16:06:04 CEST,21 Jul 2020 - 16:20:55 CEST,0h 14m 51s,"
I do not understand why these cases are handled differently and how to resolve the issue.
Any help would be greatly appreciated.
4,21 Jul 2020 - 16:06:04 CEST,21 Jul 2020 - 16:20:55 CEST,0h 14m 51s,0h 57m 2s,1595340364,1595341255,5,5,,Terminating,1,"Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 ( like Gecko) Chrome/73.0.3683.75 Safari/537.36",Windows 8,Chrome 73.0.3683.75,1920,,DKJK1995,1,1,1,4,2,2,6,6,7,1,2,5,5,4,4,2,2,3,2,2,3,4,3,3,5,5,5,2,2,4,4,4,5,6,6,6,1,2,3,3,4,3,5,6,6,2,3,4,5,4,4,4,2,3,2,3,5,5,5,6,2,2,1,2,3,5,5,4,5,2,2,3,100,0,100,0,20,72,100,0,100,0,0,0,100,0,100,0,4,5,5,4,4,4,3,3,2,4,4,5,5,6,6,4,4,4,4,4,5,2,2,3,3,5,3,3,5,5,2,2,2,2,4,3,3,4,5,2,3,2,3,5,3,2,2,2,2,2,2,6,6,6,24,1,2,90,60,90,6,Deutschland,4,1,2,2,1,,1,1,"1,6,5,8,3,2,7,4","6,5,7,3,1,8,4,2","2,1","1,2,4,3",3,56,92,70,54,25,64,29,132,22,34,27,136,15,66,32,109,16,46,36,132,13,31,20,132,10,40,20,107,8,25,16,84,12,27,19,91,7,148,329,355,233,185,29,216,72,"4,2,6,7,5,1,8,3","2,3,1,7,4,8,6,5","5,6,7,1,3,8,4,2","8,3,11,13,10,2,16,5","7,4,11,13,9,2,16,6","7,4,11,13,9,2,16,6";
The problem is that the line you showed has mismatched quotes. From the comments, it appears that the quote in Win64;" is not supposed to end the field, which is supposed to continue to Safari/537.36".
Fixing this will need three steps: read the file without interpreting fields, clean up the bad strings, then read it again as a CSV. For example, if your file is called data.csv, you could do it like this:
lines <- readLines("data.csv")
cleaned <- sub('Win64;"', 'Win64;', lines)
data <- read.csv(textConnection(cleaned))
I don't know if the line to clean it will work in general: you'll need to find out where those spurious quotes are, and work out a pattern to match and remove them.

Difficulty with importing txt file into R

I'm currently trying to import a txt file into R, and neither read.table, read_table or read_table2 worked. With read.table, I just got an error that there were more columns than data. With read_table I got two columns of data, and with read_table2 I got too many columns split in a way that made no sense, with columns and data fields split up as per the image below.
This is a sample of the headers and first lines from the txt file:
Blockquote Source Generator Generator Identifier Created Timestamp Created Date Recorded Timestamp Recorded Date Visit ID URL ID Referrer ID Domain Protocol URL Title Top Domain Search Terms Transition Type Timestamp Date
Blockquote 27-strong-rooks-knitted-patiently Web Historian - Community Edition/1.3.6 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36 web-historian 1517096769 2018-01-27T23:46:09.773127+00:00 1527424157 2018-05-27T12:29:17.479092+00:00 96844 0 american.edu https https://engage.american.edu/learn/login/index.php
This is an image of what the result was with read_table2:
enter image description here
Thanks!

How should an HTTP header field be parsed?

I'm trying to parse an HTTP header field according the ABNF rule header-field specified in the relevant section of RFC 7230. These rules are:
header-field = field-name ":" OWS field-value OWS
field-name = token
field-value = *( field-content / obs-fold )
field-content = field-vchar [ 1*( SP / HTAB ) field-vchar ]
field-vchar = VCHAR / obs-text
obs-fold = CRLF 1*( SP / HTAB )
; obsolete line folding
; see Section 3.2.4
(obs-text is just high-order bytes 0x80 to 0xff).
The problem I'm facing is that header-field rule seems to fail when applied the user-agent string that chrome sets when in responsive mode:
User-Agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Mobile Safari/537.36
The issue stems from the lone '5': when the parser reaches the final 's' in "Nexus", it takes both the 's', the following space, and '5'. This leaves the parsing cursor at the space directly after. That is
Parsed: ______________]
Data: ...6.0; Nexus 5 Build/MRA58N...
Cursor: ^
Since feild-content does not afford leading whitespace, the rule fails to match against the whole header field, which leads to the parser failing to parse the rest of the message.
It is obvious to me that HTTP headers should be able to contain single characters that are surrounded by whitespace. However this seems to be disallowed according to my reading of the spec.
I have searched online but have not found anything relevant. So I'm assuming it's a mistake on my part. Where is my mistake? and how should the rule actually be interpreted?
For RFCs, you can find errata as indicated on the front page:
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
http://www.rfc-editor.org/info/rfc7230.
This one likely is https://www.rfc-editor.org/errata/eid4189 - see https://github.com/httpwg/http-core/issues/19 for more information.

Resources