Difficulty with importing txt file into R - r

I'm currently trying to import a txt file into R, and neither read.table, read_table or read_table2 worked. With read.table, I just got an error that there were more columns than data. With read_table I got two columns of data, and with read_table2 I got too many columns split in a way that made no sense, with columns and data fields split up as per the image below.
This is a sample of the headers and first lines from the txt file:
Blockquote Source Generator Generator Identifier Created Timestamp Created Date Recorded Timestamp Recorded Date Visit ID URL ID Referrer ID Domain Protocol URL Title Top Domain Search Terms Transition Type Timestamp Date
Blockquote 27-strong-rooks-knitted-patiently Web Historian - Community Edition/1.3.6 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36 web-historian 1517096769 2018-01-27T23:46:09.773127+00:00 1527424157 2018-05-27T12:29:17.479092+00:00 96844 0 american.edu https https://engage.american.edu/learn/login/index.php
This is an image of what the result was with read_table2:
enter image description here
Thanks!

Related

I am doing web scraping of japanese site. But instead of showing japanese character in output, the characters are numbers and symbols

I write the code of web scraping below.
import requests
import bs4
from bs4 import BeautifulSoup
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.52'
}
url = 'https://www.mlit.go.jp/'
r = requests.get(url,{'headers':headers})
soup = bs4.BeautifulSoup(r.text,'html.parser')
soup.find_all('div',{'class':'clearfix'})[1].text
the output shows as below
'\n\n\n\n\n\nã\x83\x88ã\x83\x94ã\x83\x83ã\x82¯ã\x82¹ å\x9b½å\x9c\x9f交é\x80\x9aç\x9c\x81ã\x81®æ´»å\x8b\x95\n'
I want japanese sentences instead of these type of data.
can anybody help me?
I tried encoding decoding UTF-8, Cp932, shiftjis. Nothing worked
I want japanese sentences instead of these type of data.
can anybody help me?
I tried encoding decoding UTF-8, Cp932, shiftjis. Nothing worked

bs4 find sub-text / find_next

On the page link, there is a section called "COSEWIC Assessment report". This section had emboldened text that heads categories and then non-bold text containing the information regarding that category. I am looking to scrape the non-bold text using bs4.
The HTML format for the bold text is wrapped in <strong> sample text </strong> tags in this way I can find the bold titles for each category using result = s.find("strong", text=re.compile("Scientific name")).
That said, I would then like to scrape the information under that header for each given header. If I inspect the HTML for that section it looks like this.
<p>
<strong> Scientific name </strong>
<br>
"Anarta edwarsii"
</p>
So, from a starting point of having located the "scientific name" part, how do I get the "Anarta edwarsii" part.
I thought maybe bs4 find_next_sibling() would work or something of the sort but so far nothing has been successful. Also important to note is that I cannot use the text to look up the element because I have to repeat the processes for many different species. Therefore the header remains constant but its sub text will change.
Thanks!!
You can use next_siblings as resultset, iterate with list comprehension ans join() the results:
' '.join([x.text for x in soup.select_one('p:-soup-contains("Scientific name:") strong').next_siblings]).strip()
Output:
'"Anarta edwarsii"'
Alternativ example:
Select the <p> that contains the string "Scientific Name" get its stripped_strings as list ['Scientific name:', 'Anarta edwardsii'] and pick the second element:
import requests
from bs4 import BeautifulSoup
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'Referer': 'https://www.google.com/'
}
r = requests.get('https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/edwards-beach-moth-2009.html',headers=headers)
soup = BeautifulSoup(r.text,'lxml')
list(soup.select_one('p:-soup-contains("Scientific name:")').stripped_strings)[-1]
Output:
'"Anarta edwarsii"'

Read_csv ignoring delimeters

i have a problem handling a questionnaire which i need to import into R. The problem is that the following entry contains commas.
"Mozilla/5.0 (Windows NT 6.2; Win64;" x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36",Windows 8,Chrome
When i specify that commas should be used as delimeters is splices the entry at "KHTML, like Gecko" and "...36" ,Windows 8". But it somehow does not split at other commas such as "4,21 Jul 2020 - 16:06:04 CEST,21 Jul 2020 - 16:20:55 CEST,0h 14m 51s,"
I do not understand why these cases are handled differently and how to resolve the issue.
Any help would be greatly appreciated.
4,21 Jul 2020 - 16:06:04 CEST,21 Jul 2020 - 16:20:55 CEST,0h 14m 51s,0h 57m 2s,1595340364,1595341255,5,5,,Terminating,1,"Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 ( like Gecko) Chrome/73.0.3683.75 Safari/537.36",Windows 8,Chrome 73.0.3683.75,1920,,DKJK1995,1,1,1,4,2,2,6,6,7,1,2,5,5,4,4,2,2,3,2,2,3,4,3,3,5,5,5,2,2,4,4,4,5,6,6,6,1,2,3,3,4,3,5,6,6,2,3,4,5,4,4,4,2,3,2,3,5,5,5,6,2,2,1,2,3,5,5,4,5,2,2,3,100,0,100,0,20,72,100,0,100,0,0,0,100,0,100,0,4,5,5,4,4,4,3,3,2,4,4,5,5,6,6,4,4,4,4,4,5,2,2,3,3,5,3,3,5,5,2,2,2,2,4,3,3,4,5,2,3,2,3,5,3,2,2,2,2,2,2,6,6,6,24,1,2,90,60,90,6,Deutschland,4,1,2,2,1,,1,1,"1,6,5,8,3,2,7,4","6,5,7,3,1,8,4,2","2,1","1,2,4,3",3,56,92,70,54,25,64,29,132,22,34,27,136,15,66,32,109,16,46,36,132,13,31,20,132,10,40,20,107,8,25,16,84,12,27,19,91,7,148,329,355,233,185,29,216,72,"4,2,6,7,5,1,8,3","2,3,1,7,4,8,6,5","5,6,7,1,3,8,4,2","8,3,11,13,10,2,16,5","7,4,11,13,9,2,16,6","7,4,11,13,9,2,16,6";
The problem is that the line you showed has mismatched quotes. From the comments, it appears that the quote in Win64;" is not supposed to end the field, which is supposed to continue to Safari/537.36".
Fixing this will need three steps: read the file without interpreting fields, clean up the bad strings, then read it again as a CSV. For example, if your file is called data.csv, you could do it like this:
lines <- readLines("data.csv")
cleaned <- sub('Win64;"', 'Win64;', lines)
data <- read.csv(textConnection(cleaned))
I don't know if the line to clean it will work in general: you'll need to find out where those spurious quotes are, and work out a pattern to match and remove them.

How to specify special encoding for german characters in pd.read_sql

After some issues trying to connect to the database via Jupyter, it works now.
However, when trying to contact the database, there are some issues with 2 tables (the others works well).
FYI:
I'm working with datas from Germany, so maybe the problem is with the special characters ü,ä,ö?
I,ve been thinking to specify to the connection line that it is special encoding but I have no ideas how to do it.
import cx_Oracle
import pandas as pd
# The connection
conn = cx_Oracle.Connection(user='', password='', dsn='')
# The Query
SQL_Query = pd.read_sql_query(
'''select * from CB_CONTRACTS''', conn)
# Define the DF
df_CRH = pd.DataFrame(SQL_Query)
# Display the DF
df_CRH.head()
The output should be a data frame containing all the query.
As I said, it works for 8/10 tables except those 2 that I've been struggling with.
The Error Message:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 23: character maps to
cx_Oracle.connection() accepts encoding and nencoding parameters, see cx_Oracle.connect(). Alternatively you can set the NLS_LANG environment variable before starting python. Update: cx_Oracle version 8 uses the UTF-8 character set by default, and will ignore the character set component of NLS_LANG: see Character sets and globalization.
cursor.var() has an encodingErrors parameter, see Cursor.var and https://github.com/oracle/python-cx_Oracle/issues/162

Why all my right to left letters have been corrupted

I have web application solutions , some pages contain arabic letters ,After moving the solutions to another pc , all the arabic letters converted to corrupted letters ,something like that :
ÈíÇäÇÊ ÇáØÇáÈ
How to fix this problem ?
It's not corrupted. VS.NET based on the current culture of the OS (windows-1256), shows those letters correctly. If the regional settings of the new system is not configured for windows-1256, you will see the above letters. You have 2 options here:
Change the regional settings of the windows to Arabic (Windows-1256)
Or convert your files to UTF-8 with signature this way:
FixWindows1256
string data = File.ReadAllText(path, Encoding.GetEncoding("windows-1256"));
File.WriteAllText(path, data, Encoding.UTF8);

Resources