Read_csv ignoring delimeters - r

i have a problem handling a questionnaire which i need to import into R. The problem is that the following entry contains commas.
"Mozilla/5.0 (Windows NT 6.2; Win64;" x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36",Windows 8,Chrome
When i specify that commas should be used as delimeters is splices the entry at "KHTML, like Gecko" and "...36" ,Windows 8". But it somehow does not split at other commas such as "4,21 Jul 2020 - 16:06:04 CEST,21 Jul 2020 - 16:20:55 CEST,0h 14m 51s,"
I do not understand why these cases are handled differently and how to resolve the issue.
Any help would be greatly appreciated.
4,21 Jul 2020 - 16:06:04 CEST,21 Jul 2020 - 16:20:55 CEST,0h 14m 51s,0h 57m 2s,1595340364,1595341255,5,5,,Terminating,1,"Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 ( like Gecko) Chrome/73.0.3683.75 Safari/537.36",Windows 8,Chrome 73.0.3683.75,1920,,DKJK1995,1,1,1,4,2,2,6,6,7,1,2,5,5,4,4,2,2,3,2,2,3,4,3,3,5,5,5,2,2,4,4,4,5,6,6,6,1,2,3,3,4,3,5,6,6,2,3,4,5,4,4,4,2,3,2,3,5,5,5,6,2,2,1,2,3,5,5,4,5,2,2,3,100,0,100,0,20,72,100,0,100,0,0,0,100,0,100,0,4,5,5,4,4,4,3,3,2,4,4,5,5,6,6,4,4,4,4,4,5,2,2,3,3,5,3,3,5,5,2,2,2,2,4,3,3,4,5,2,3,2,3,5,3,2,2,2,2,2,2,6,6,6,24,1,2,90,60,90,6,Deutschland,4,1,2,2,1,,1,1,"1,6,5,8,3,2,7,4","6,5,7,3,1,8,4,2","2,1","1,2,4,3",3,56,92,70,54,25,64,29,132,22,34,27,136,15,66,32,109,16,46,36,132,13,31,20,132,10,40,20,107,8,25,16,84,12,27,19,91,7,148,329,355,233,185,29,216,72,"4,2,6,7,5,1,8,3","2,3,1,7,4,8,6,5","5,6,7,1,3,8,4,2","8,3,11,13,10,2,16,5","7,4,11,13,9,2,16,6","7,4,11,13,9,2,16,6";

The problem is that the line you showed has mismatched quotes. From the comments, it appears that the quote in Win64;" is not supposed to end the field, which is supposed to continue to Safari/537.36".
Fixing this will need three steps: read the file without interpreting fields, clean up the bad strings, then read it again as a CSV. For example, if your file is called data.csv, you could do it like this:
lines <- readLines("data.csv")
cleaned <- sub('Win64;"', 'Win64;', lines)
data <- read.csv(textConnection(cleaned))
I don't know if the line to clean it will work in general: you'll need to find out where those spurious quotes are, and work out a pattern to match and remove them.

Related

I am doing web scraping of japanese site. But instead of showing japanese character in output, the characters are numbers and symbols

I write the code of web scraping below.
import requests
import bs4
from bs4 import BeautifulSoup
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.52'
}
url = 'https://www.mlit.go.jp/'
r = requests.get(url,{'headers':headers})
soup = bs4.BeautifulSoup(r.text,'html.parser')
soup.find_all('div',{'class':'clearfix'})[1].text
the output shows as below
'\n\n\n\n\n\nã\x83\x88ã\x83\x94ã\x83\x83ã\x82¯ã\x82¹ å\x9b½å\x9c\x9f交é\x80\x9aç\x9c\x81ã\x81®æ´»å\x8b\x95\n'
I want japanese sentences instead of these type of data.
can anybody help me?
I tried encoding decoding UTF-8, Cp932, shiftjis. Nothing worked
I want japanese sentences instead of these type of data.
can anybody help me?
I tried encoding decoding UTF-8, Cp932, shiftjis. Nothing worked

bs4 find sub-text / find_next

On the page link, there is a section called "COSEWIC Assessment report". This section had emboldened text that heads categories and then non-bold text containing the information regarding that category. I am looking to scrape the non-bold text using bs4.
The HTML format for the bold text is wrapped in <strong> sample text </strong> tags in this way I can find the bold titles for each category using result = s.find("strong", text=re.compile("Scientific name")).
That said, I would then like to scrape the information under that header for each given header. If I inspect the HTML for that section it looks like this.
<p>
<strong> Scientific name </strong>
<br>
"Anarta edwarsii"
</p>
So, from a starting point of having located the "scientific name" part, how do I get the "Anarta edwarsii" part.
I thought maybe bs4 find_next_sibling() would work or something of the sort but so far nothing has been successful. Also important to note is that I cannot use the text to look up the element because I have to repeat the processes for many different species. Therefore the header remains constant but its sub text will change.
Thanks!!
You can use next_siblings as resultset, iterate with list comprehension ans join() the results:
' '.join([x.text for x in soup.select_one('p:-soup-contains("Scientific name:") strong').next_siblings]).strip()
Output:
'"Anarta edwarsii"'
Alternativ example:
Select the <p> that contains the string "Scientific Name" get its stripped_strings as list ['Scientific name:', 'Anarta edwardsii'] and pick the second element:
import requests
from bs4 import BeautifulSoup
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'Referer': 'https://www.google.com/'
}
r = requests.get('https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/edwards-beach-moth-2009.html',headers=headers)
soup = BeautifulSoup(r.text,'lxml')
list(soup.select_one('p:-soup-contains("Scientific name:")').stripped_strings)[-1]
Output:
'"Anarta edwarsii"'

Difficulty with importing txt file into R

I'm currently trying to import a txt file into R, and neither read.table, read_table or read_table2 worked. With read.table, I just got an error that there were more columns than data. With read_table I got two columns of data, and with read_table2 I got too many columns split in a way that made no sense, with columns and data fields split up as per the image below.
This is a sample of the headers and first lines from the txt file:
Blockquote Source Generator Generator Identifier Created Timestamp Created Date Recorded Timestamp Recorded Date Visit ID URL ID Referrer ID Domain Protocol URL Title Top Domain Search Terms Transition Type Timestamp Date
Blockquote 27-strong-rooks-knitted-patiently Web Historian - Community Edition/1.3.6 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36 web-historian 1517096769 2018-01-27T23:46:09.773127+00:00 1527424157 2018-05-27T12:29:17.479092+00:00 96844 0 american.edu https https://engage.american.edu/learn/login/index.php
This is an image of what the result was with read_table2:
enter image description here
Thanks!

How to web scrape data from the second page?

I want to web scrape data using the functions from the rvest library. Loading the website and using a form works fine. But when I want to access the data I only get access to 35 of the available >70 table entries. The data are automatically split into chunks of 35 entries.
On the actual website, these data would be available with a nextPage-button (.rgPageNext RESPECTIVELY //*[contains(concat( " ", #class, " " ), concat( " ", "rgPageNext", " " ))]). However, I do not know how to include this into the code to get access to the other entries.
# library
library(rvest)
# load html session
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"
mainPage <-'https://www.tsis.fli.de/Reports/Info_SO.aspx?ts=416&guid=d04e759d-7cf0-48da-b730-4d7f397531e2'
# use form on session
session <- html_session(mainPage, httr::user_agent(ua))
pgform <- html_form(session)[[1]]
pgform$fields[['ctl00$ContentPlaceHolder1$rblActiveMode']]$value = "1"
filledform <- set_values(pgform)
result <- submit_form(session, filledform)
# load the entries 1-35
result %>%
html_nodes("#ctl00_ContentPlaceHolder1_rgSO_ctl00") %>%
html_text()

Regular expression in Rstudio's "Find in files"

How do I use regular expressions in Rstudio's "Find in files"?
Searching for literal numbers work just fine:
But when trying to use a regular expression to find a number I can't:
The documentation does not mention which type of regex is needed:
https://support.rstudio.com/hc/en-us/articles/200710523-Navigating-Code
So maybe I am using a wrong flavor of regex?
The RStudio documentation is (to be kind) sorely lacking a reference describing the regex syntax that is supported within RStudio find and replace dialogs.
However, in answer to your question about how to find numeric digits, either of the following works in the "Find in Files" dialog with "Regular expression" option ticked:
[0-9]
[[:digit:]]
Unfortunately as you found \d does not work. In fact, on my current version* \d simply finds the letter 'd' or 'D'.
\s works as expected so perhaps \d not working is a bug in RStudio?
*RStudio version I'm using:
Version 1.1.463 – © 2009-2018 RStudio, Inc.
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) RStudio Safari/538.1 Qt/5.4.0
RStudio uses POSIX Basic Regular Expressions for its "Find in files" functionality (because it is using grep under the hood).

Resources