Why do I get garbled characters? - r

Why do I get garbled characters in parse a web?
I have used encoding="big-5\\IGNORE"to get the normal character, but it doesn't work.
require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
options(encoding="big-5")
data=htmlParse(url,isURL=TRUE,encoding="big-5\\IGNORE")
tdata=xpathApply(data,"//table[#class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)
How should I revise my code to change the garbled characters into normal?
#MartinMorgan (below) suggested using
htmlParse(url,isURL=TRUE,encoding="big-5")
Here is an example of what is going on:
require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
options(encoding="big-5")
data=htmlParse(url,isURL=TRUE,encoding="big-5")
tdata=xpathApply(data,"//table[#class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)
stock
The total records should be 1335. In the case above it is 309 - many records appear to have been lost
This is a complicated problem. There are a number of issues:
A Badly-formed html file
The web is not a standard web, not well formed html file,let me prove my point.
please run :
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
txt=download.file(url,destfile="stockbig-5",quiet = TRUE)
How about to open the downloaded file stockbig-5wiht firefox?
Iconv function bug in R
if a html file is well formed,you can use
data=readLines(file)
datachange=iconv(data,from="source encode",to="target encode\IGNORE")
when a html file is not well formed,you can do that way ,in this example,
please run ,
data=readLines(stockbig-5)
An error will occur.
1: In readLines("stockbig-5") :
invalid input found on input connection 'stockbig-5'
You can't use iconv function in R to change encode in bad formed html file.
You can, however do this in shell

I have solved it myself for one night,hard time.
System:debian6(locale utf-8)+R2.15(locale utf-8)+gnome terminal(locale utf-8).
Here is the code:
require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
txt=download.file(url,destfile="stockbig-5",quiet = TRUE)
system('iconv -f big-5 -t UTF-8//IGNORE stockbig-5 > stockutf-8')
data=htmlParse("stockutf-8",isURL=FALSE,encoding="utf-8\\IGNORE")
tdata=xpathApply(data,"//table[#class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)
stock
I want my code more elegant ,the shell command in R code is ugly maybe,
system('iconv -f big5 -t UTF-8//IGNORE stockgb2312 > stockutf-8')
i made tries to replace it with pure R code ,failed ,how can replace it in pure R code?
you can duplicate the result in your computer with the code.
half done,half success,continue to try.

Related

Read .sql into R with Spanish characters (á, é, í, ó, ú, ñ, etc)

So, I've been struggling with this for a while now and can't seem to google my way out of it. I'm trying to read a .sql file into R, I always do that to avoid putting 100+ lines of sql in my R scripts. I usually do this:
library(tidyverse)
library(DBI)
con <- dbConnect(<CONNECTION ARGUMENTS>)
query <- read_file("path/to/script.sql")
df <- as_tibble(dbGetQuery(con, query))
dbDisconnect(con)
However, this time my sql script has some Spanish characters in it. Say something like this:
select tree_id, tree
from forest.trees
where species = 'árbol'
When I read this script into R and make the query it just doesn't return anything, but if I copy and paste the sql script into an R string it works! So it seems that the problem is in the line where I read the script into R.
I tried changing the string's encoding in a couple of ways:
# none of these work
query <- read_file("path/to/script.sql")
Encoding(query) <- "latin1"
query <- readLines("path/to/script.sql", encoding = "latin1")
query <- paste0(query, collapse = " ")
Unfortunately I don't have a public database to offer to anyone reading this. I'm connecting to a postgreSQL 11 database.
--- UPDATE ----
I'm on a windows 10 machine, with US locale.
When I use the read_file function the contents of query look ok, the Spanish characters print out like they should, but when I pass it to dbGetQuery it just doesn't fetch anything.
I tried forcing encoding "latin1" because I found online that Spanish characters tend to fix in R when doing that. When doing this, the Spanish characters print out wrong, so I didn't expected it to work, and it didn't.
The character values in my database have 'utf-8' encoding.
Just to be completely clear, all my attempts to read the .sql script haven't worked, however this does work:
library(tidyverse)
library(DBI)
con <- dbConnect(<CONNECTION ARGUMENTS>)
query <- "select tree_id, tree from forest.trees where species = 'árbol'"
# df actually has results
df <- as_tibble(dbGetQuery(con, query))
dbDisconnect(con)
The encoding statement is telling R how to interpret the filename, not its contents. Try this instead:
filetext <- readLines(file("path/to/script.sql", encoding = "latin1"))
See this answer for more details:R: can't read unicode text files even when specifying the encoding
So after some time to think about it, I wondered why the solution proposed by MrFlick didn't work. I checked the encoding of the file created by this chunk:
query <- "select tree_id, tree from forest.trees where species = 'árbol'"
write_lines(query, "test.sql")
After checking what encoding did test.sql have, it turned out it was ANSI, but it didn't look right. So I manually changed my original script.sql encoding to ANSI. After that it worked totally fine.
This solution however, didn't work when I cloned my repo on an ubuntu environment. In ubuntu there was no problem with the original 'utf-8' encoding.
Hope this helps anyone dealing with this in windows.

How to do encoding in R and why ’ comes instead of apostrophe(')s and how to resolve it

Hi I am trying to do text mining in R version 3.4.2
I am trying to import .txt files from local drive using VCorpus command.
But after Run following code
cname <- file.path("C:", "texts")
cname
dir(cname)
library(readr)
library(tm)
docs <- VCorpus(DirSource(cname))
summary(docs)
inspect(docs[1])
writeLines(as.character(docs[1]))
Output:
Well, the election, it came out really well. Next time we**’**ll triple the number and so on
’ its originally aporstophe(')s now how can i convert or get original text in Rstudio?
Please it will appreciate if someone help me
Thanks in Advance
Encoding issues are not easy to solve, since they depend on various factors (file ecnoding, encoding settings during loading, etc.). As a first step you might try the following line, if we are lucky it solves your problem.
Encoding(your_text) <- "UTF-8"
Otherwise, other solutions have to be chekced, e.g., using stri_trans from stringi package or replacing wrong symbols with brute force via gsub(falsecharacter, desiredcharacter, fixed = TRUE) (there are debugging tables, e.g., on i18nqa.com).
I solved this a different way.
I found that apostrophes that looked like this: ' would render properly, while ones that looked slightly different, like this: ’ would not.
So, for any text that I was printing, I converted ’ to ' like this:
mytext <- gsub("’", "'", mytext )
Tada... no more issues with "’".

Can't read vcf-derived file in R: "no lines available in input"

I have a vcf file and I want to extract the header (which is the only line that has the pattern '#CHROM' on it. So, on the mac terminal I typed the following:
grep '#CHROM' file.vcf > headerlinevcf.txt
I see it on the terminal, or in vi, and I found what I expected. I can see all the columns of that line (= the text in the header). I see it like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 ...
Now, I try to read it into R as a vector, because I have to do some other stuff to it, and the following error comes up:
> headerline<-as.vector(read.table('headerlinevcf.txt'))
Error in read.table("headerlinevcf.txt") : no lines available in input
I tried to read.delim using tab and then a space, but it didn't work. I also tried:
headerline <- read.table('headerlinevcf.txt')
And also gives back the same error.
I also tried readLines command, and it gives me this:
headerline <- readLines('headerlinevcf.txt')
> headerline[1]
[1]
"#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tHG00096\tHG00097\tH
G00099\tHG00100\tHG00101\tHG00102\tHG00103\tHG00105\tHG00106\tHG00107\tHG00
... <truncated>
It seems that the VCF file (and thus this VCF-derived) have some strange way of delimitation.
A friend tried in Python to read it, change the '\t' into spaces, and open that new file with R, but the same error came out once again.
I don't know that well this kind of format to find the error that easy, so I've been struggling with this the past couple of days. Please, if someone knows what's happening lend me a hand! Thanks in advance.

Problems with reading a txt file (EOF within quoted string)

I am trying to use read.table() to import this TXT file into R (contains informations about meteorological stations provided by the WMO):
However, when I try to use
tmp <- read.table(file=...,sep=";",header=FALSE)
I get this error
eof within quoted string
warning and only 3514 of the 6702 lines appear in 'tmp'. From a quick look at the text file, I couldn't find any seemingly problematic characters.
As suggested in other threads, I also tried quote="". The EOF warning disappeared, but still only 3514 lines are imported.
Any advice on how I can get read.table() to work for this particular txt file?
It looks like your data actually has 11548 rows. This works:
read.table(url('http://weather.noaa.gov/data/nsd_bbsss.txt'),
sep=';', quote=NULL, comment='', header=FALSE)
edit: updated according #MrFlick's comment's below.
The problem is LF. R will not recognize "^M", to load the file, you only need to specify the encoding like this:
read.table("nsd_bbsss.txt",sep=";",header=F,encoding="latin1",quote="",comment='',colClasses=rep("character",14)) -> data
But Line 8638 has more than 14 columns, which is different from other lines and may lead an error message.

Extracting an html table in another language using R

I am using R to extract HTML Tables from a website.
However, the language for the HTML Table is in Hindi and the text is displayed as unicodes.
Any way where I can set/install the font family and get the actual text instead of the unicode.
The code I follow is :
library('XML')
table<-readHTMLTable(<the html file>)
n.rows <- unlist(lapply(table, function(t) dim(t)[1]))
table[[which.max(n.rows)]]
The sample site is : http://mpbhuabhilekh.nic.in/bhunakshaweb/reports/mpror.jsp?base=wz/CP8M/wj/DP8I/wz/CoA==&vsrno=26-03-02-00049-082&year=2013&plotno=71
The output comes as :
"< U+092A>"
etc.
Note:For some reason, the readHTMLTable works only when I remove the first two unwanted tables in the HTML file. So if you have to test with the file, please edit out the first two tables or simply delete the first two table headers from the file.
Any help will be appreciated. Thanks
update:
The issue seems to be related to locale set in R on windows OS Machines. Unable to figure out how to get it working though!
The solution I have found for this locale related bug would be to callthe corresponding encoding..
library('XML')
table<-readHTMLTable(<the html file>)
n.rows <- unlist(lapply(table, function(t) dim(t)[1]))
output <- table[[which.max(n.rows)]]
for (n in names(output)) Encoding(levels(output[[n]])) <-"UTF-16"
The output in R console might still look gibberish, but the advantage is that once you export the dataset (say a csv), it would all appear in Hindi on other editors.

Resources