Issues with latin characters using htmlparse in r

Issues with latin characters using htmlparse in r - r

I'm having some encoding problems when trying to webscrape a government page in portuguese. This is my code:
library("RCurl")
library("XML")
html = getURL("http://sei.cade.gov.br/sei/institucional/pesquisa/documento_consulta_externa.php?u0r2HDE7WIdiBH3O1y0Dr6krqmN-VVCNjJtZWrdX1mgt3CiIC_RM90F01GwwNk20muowNXaYKrI2Ob8UQUkAoA,,")
par = htmlParse(html)
x = xpathSApply(par, "//strong", xmlValue)[1]
print(x)
[1] "NOTA TÃ‰CNICA NÂº 58/2017/CGAA6/SGA2/SG/CADE"
I've tried some things, like adding encoding="latin1" and encoding="UTF-8" to the htmlParse, and adding .encoding="latin" and .encoding="UTF-8" to the getURL.
My system seems to be set to the right location, as Sys.getlocale() gives me
Sys.getlocale()
[1] "LC_COLLATE=Portuguese_Brazil.1252;LC_CTYPE=Portuguese_Brazil.1252;LC_MONETARY=Portuguese_Brazil.1252;LC_NUMERIC=C;LC_TIME=Portuguese_Brazil.1252"
I'm out of ideas here, and would appreciate any help.

I was able to get this to work using your code with one addition.
## Your code
library("RCurl")
library("XML")
html = getURL("http://sei.cade.gov.br/sei/institucional/pesquisa/documento_consulta_externa.php?u0r2HDE7WIdiBH3O1y0Dr6krqmN-VVCNjJtZWrdX1mgt3CiIC_RM90F01GwwNk20muowNXaYKrI2Ob8UQUkAoA,,")
par = htmlParse(html)
x = xpathSApply(par, "//strong", xmlValue)[1]
## Addition
x2 = iconv(x, from="UTF-8", to="latin1")
print(x2)
"NOTA TÉCNICA Nº 58/2017/CGAA6/SGA2/SG/CADE"

Related

Not getting the right text after stemming in text analysis (Swedish)

I am having problem with getting the right text after stemming in R.
Eg. 'papper' should show as 'papper' but instead shows up as 'papp', 'projekt' becomes 'projek'.
The frequency cloud generated thus shows these shortened versions which loses the actual meaning or becomes incomprehensible.
What can I do to get rid of this problem? I am using the latest version of snowball(0.6.0).
R Code:
library(tm)
library(SnowballC)
text_example <- c("projekt", "papper", "arbete")
stem_doc <- stemDocument(text_example, language="sv")
stem_doc
Expected:
stem_doc
[1] "projekt" "papper" "arbete"
Actual:
stem_doc
[1] "projek" "papp" "arbet"

What you describe here is actually not stemming but is called lemmatization (see #Newl's link for the difference).
To get the correct lemmas, you can use the R package UDPipe, which is a wrapper around the UDPipe C++ library.
Here is a quick example of how you would do what you want:
# install.packages("udpipe")
library(udpipe)
dl <- udpipe_download_model(language = "swedish-lines")
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.3/master/inst/udpipe-ud-2.3-181115/swedish-lines-ud-2.3-181115.udpipe to C:/Users/Johannes Gruber/AppData/Local/Temp/RtmpMhaF8L/reprex8e40d80ef3/swedish-lines-ud-2.3-181115.udpipe
udmodel_swed <- udpipe_load_model(file = dl$file_model)
text_example <- c("projekt", "papper", "arbete")
x <- udpipe_annotate(udmodel_swed, x = text_example)
x <- as.data.frame(x)
x$lemma
#> [1] "projekt" "papper" "arbete"

r - source script file that contains unicode (Farsi) character

write the text below in a buffer and save it as a .r script:
letters_fa <- c('الف','ب','پ','ت','ث','ج','چ','ح','خ','ر','ز','د')
then try these lines to source() it:
script <- "path/to/script.R"
file(script,
encoding = "UTF-8") %>%
readLines() # works fine
file(script,
encoding = "UTF-8") %>%
source() # works fine
source(script) # the Farsi letters in the environment are misrepresented
source(script,
encoding = "UTF-8") # gives error
The last line throws error. I tried to debug it and I believe there is a bug in the source function, in the following lines:
...
loc <- utils::localeToCharset()[1L]
...
The error occurs at .Internal(parse( line.
...
exprs <- if (!from_file) {
if (length(lines))
.Internal(parse(stdin(), n = -1, lines, "?",
srcfile, encoding))
else expression()
}
else .Internal(parse(file, n = -1, NULL, "?", srcfile,
encoding))
...
The exact error is:
Error in source(script, encoding = "UTF-8") :
script.R:2:17: unexpected INCOMPLETE_STRING
1: #' #export
2: letters_fa <- c('
^

The solution to this problem is to either change the OS Locale to a native Locale (e.g. Persian in this case) or use R built-in function Sys.setlocale(locale="Persian") to change an R session native Locale.

Use source without specifying the encoding, and then modify the vector's encoding with Encoding:
source(script)
letters_fa
# [1] "Ø§Ù„Ù\u0081" "Ø¨" "Ù¾" "Øª" "Ø«"
# [6] "Ø¬" "Ú†" "Ø" "Ø®" "Ø±"
# [11] "Ø²" "Ø¯"
Encoding(letters_fa) <- "UTF-8"
letters_fa
# [1] "الف" "ب" "پ" "ت" "ث" "ج" "چ" "ح" "خ" "ر" "ز" "د"

Unused argument (pretty = TRUE) in R

I had this working completely fine and changed nothing, but when I ran it again I got the error code:
Error in toJSON(json_data, pretty = TRUE) :
unused argument (pretty = TRUE)
in reference to the line below. When I print(json_data) everything prints out fine so I know it has nothing to do with the data. I'm pretty lost, any ideas? I reinstalled all of the packages I am using (jsonlite, rjson, etc.) and as far as my research shows pretty = TRUE should be working just fine. My code is below:
x <- toJSON(json_data, pretty = TRUE)
Here is an example of the data:
$`#context`
[1] "context.jsonld"
$archiveType
[1] "Marine sediment"
$dataSetName
[1] "2005-804.devernal.2013"
$pub
$pub[[1]]
$pub[[1]]$pubYear
[1] "2013"

R tm package: utf-8 text

I would like to create a wordcloud for non-english text in utf-8 (actually, it's in kazakh language).
The text is displayed absolutely right in inspect function of the tm package.
However, when I search for word frequency everything is displayed incorrectly:
The problem is that the text is displayed with coded characters instead of words. Cyrillic characters are displayed correctly. Consquently the wordcloud becomes a complete mess.
Is it possible to assign encoding to the tm function somehow? I tried this, but the text on its own is fine, the problem is with using tm package.
Let a sample text be:
Ол арман – әлем елдерімен терезесі тең қатынас құрып, әлем картасынан ойып тұрып орын алатын Тәуелсіз Мемлекет атану еді.
Ол арман – тұрмысы бақуатты, түтіні түзу ұшқан, ұрпағы ертеңіне сеніммен қарайтын бақытты Ел болу еді.
Біз армандарды ақиқатқа айналдырдық. Мәңгілік Елдің іргетасын қаладық.
Мен қоғамда «Қазақ елінің ұлттық идеясы қандай болуы керек?» деген сауал жиі талқыға түсетінін көріп жүрмін. Біз үшін болашағымызға бағдар ететін, ұлтты ұйыстырып, ұлы мақсаттарға жетелейтін идея бар. Ол – Мәңгілік Ел идеясы.
Тәуелсіздікпен бірге халқымыз Мәңгілік Мұраттарына қол жеткізді.
My simple code is this:
(Based on onertipaday.blogspot.com tutorials:)
require(tm)
require(wordcloud)
text<-readLines("text.txt", encoding="UTF-8")
ap.corpus <- Corpus(DataframeSource(data.frame(text)))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, tolower)
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)
1 2
44 4
findFreqTerms(ap.tdm, lowfreq=2)
[1] "<U+04D9>лем" "арман" "еді"
[4] "м<U+04D9><U+04A3>гілік"
Those words should be: "Әлем", арман", "еді", "мәңгілік". They are displayed correctly in inspect(ap.corpus) output.
Highly appreciate any help! :)

The problem comes from the default tokenizer. tm by default uses scan_tokenizer which it looses encoding(maybe you should contact the maintainer to add an encoding argument).
scan_tokenizer function (x) {
scan(text = x, what = "character", quote = "", quiet = TRUE) }
One solution is to provide your own tokenizer to create the matrix terms. I am using strsplit:
scanner <- function(x) strsplit(x," ")
ap.tdm <- TermDocumentMatrix(ap.corpus,control=list(tokenize=scanner))
Then you get the result well encoded:
findFreqTerms(ap.tdm, lowfreq=2)
[1] "арман" "біз" "еді" "әлем" "идеясы" "мәңгілік"

Actually, I disagree with agstudy's answer. It does not seem to be a tokenizer problem. I'm using version 0.6.0 of the tm package and your code works just fine for me, except that I had to explicitly set the encoding of your text data to UTF-8 using :
Encoding(text) <- "UTF-8"
Below is the complete piece of reproducible code. Just make sure you save it in a file with UTF-8 encoding, and use source() to run it; do not use source.with.encoding(), it'll throw an error.
text <- "Ол арман – әлем елдерімен терезесі тең қатынас құрып, әлем картасынан ойып тұрып орын алатын Тәуелсіз Мемлекет атану еді. Ол арман – тұрмысы бақуатты, түтіні түзу ұшқан, ұрпағы ертеңіне сеніммен қарайтын бақытты Ел болу еді. Біз армандарды ақиқатқа айналдырдық. Мәңгілік Елдің іргетасын қаладық. Мен қоғамда «Қазақ елінің ұлттық идеясы қандай болуы керек?» деген сауал жиі талқыға түсетінін көріп жүрмін. Біз үшін болашағымызға бағдар ететін, ұлтты ұйыстырып, ұлы мақсаттарға жетелейтін идея бар. Ол – Мәңгілік Ел идеясы. Тәуелсіздікпен бірге халқымыз Мәңгілік Мұраттарына қол жеткізді."
Encoding(text)
# [1] "unknown"
Encoding(text) <- "UTF-8"
# [1] "UTF-8"
ap.corpus <- Corpus(DataframeSource(data.frame(text)))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, content_transformer(tolower))
content(ap.corpus[[1]])
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
print(table(ap.d$freq))
# 1 2 3
# 62 5 1
print(findFreqTerms(ap.tdm, lowfreq=2))
# [1] "арман" "біз" "еді" "әлем" "идеясы" "мәңгілік"
It worked for me, hope it does for you too.

List and description of all packages in CRAN from within R

I can get a list of all the available packages with the function:
ap <- available.packages()
But how can I also get a description of these packages from within R, so I can have a data.frame with two columns: package and description?

Edit of an almost ten-year old accepted answer. What you likely want is not to scrape (unless you want to practice scraping) but use an existing interface: tools::CRAN_package_db(). Example:
> db <- tools::CRAN_package_db()[, c("Package", "Description")]
> dim(db)
[1] 18978 2
>
The function brings (currently) 66 columns back of which the of interest here are a part.
I actually think you want "Package" and "Title" as the "Description" can run to several lines. So here is the former, just put "Description" in the final subset if you really want "Description":
R> ## from http://developer.r-project.org/CRAN/Scripts/depends.R and adapted
R>
R> require("tools")
R>
R> getPackagesWithTitle <- function() {
+ contrib.url(getOption("repos")["CRAN"], "source")
+ description <- sprintf("%s/web/packages/packages.rds",
+ getOption("repos")["CRAN"])
+ con <- if(substring(description, 1L, 7L) == "file://") {
+ file(description, "rb")
+ } else {
+ url(description, "rb")
+ }
+ on.exit(close(con))
+ db <- readRDS(gzcon(con))
+ rownames(db) <- NULL
+
+ db[, c("Package", "Title")]
+ }
R>
R>
R> head(getPackagesWithTitle()) # I shortened one Title here...
Package Title
[1,] "abc" "Tools for Approximate Bayesian Computation (ABC)"
[2,] "abcdeFBA" "ABCDE_FBA: A-Biologist-Can-Do-Everything of Flux ..."
[3,] "abd" "The Analysis of Biological Data"
[4,] "abind" "Combine multi-dimensional arrays"
[5,] "abn" "Data Modelling with Additive Bayesian Networks"
[6,] "AcceptanceSampling" "Creation and evaluation of Acceptance Sampling Plans"
R>

Dirk has provided an answer that is terrific and after finishing my solution and then seeing his I debated for some time posting my solution for fear of looking silly. But I decided to post it anyway for two reasons:
it is informative to beginning scrapers like myself
it took me a while to do and so why not :)
I approached this thinking I'd need to do some web scraping and choose crantastic as the site to scrape from. First I'll provide the code and then two scraping resources that have been very helpful to me as I learn:
library(RCurl)
library(XML)
URL <- "http://cran.r-project.org/web/checks/check_summary.html#summary_by_package"
packs <- na.omit(XML::readHTMLTable(doc = URL, which = 2, header = T,
strip.white = T, as.is = FALSE, sep = ",", na.strings = c("999",
"NA", " "))[, 1])
Trim <- function(x) {
gsub("^\\s+|\\s+$", "", x)
}
packs <- unique(Trim(packs))
u1 <- "http://crantastic.org/packages/"
len.samps <- 10 #for demo purpose; use:
#len.samps <- length(packs) # for all of them
URL2 <- paste0(u1, packs[seq_len(len.samps)])
scraper <- function(urls){ #function to grab description
doc <- htmlTreeParse(urls, useInternalNodes=TRUE)
nodes <- getNodeSet(doc, "//p")[[3]]
return(nodes)
}
info <- sapply(seq_along(URL2), function(i) try(scraper(URL2[i]), TRUE))
info2 <- sapply(info, function(x) { #replace errors with NA
if(class(x)[1] != "XMLInternalElementNode"){
NA
} else {
Trim(gsub("\\s+", " ", xmlValue(x)))
}
}
)
pack_n_desc <- data.frame(package=packs[seq_len(len.samps)],
description=info2) #make a dataframe of it all
Resources:
talkstats.com thread on web scraping (great beginner
examples)
w3schools.com site on html stuff (very
helpful)

I wanted to try to do this using a HTML scraper (rvest) as an exercise, since the available.packages() in OP doesn't contain the package Descriptions.
library('rvest')
url <- 'https://cloud.r-project.org/web/packages/available_packages_by_name.html'
webpage <- read_html(url)
data_html <- html_nodes(webpage,'tr td')
length(data_html)
P1 <- html_nodes(webpage,'td:nth-child(1)') %>% html_text(trim=TRUE) # XML: The Package Name
P2 <- html_nodes(webpage,'td:nth-child(2)') %>% html_text(trim=TRUE) # XML: The Description
P1 <- P1[lengths(P1) > 0 & P1 != ""] # Remove NULL and empty ("") items
length(P1); length(P2);
mdf <- data.frame(P1, P2, row.names=NULL)
colnames(mdf) <- c("PackageName", "Description")
# This is the problem! It lists large sets column-by-column,
# instead of row-by-row. Try with the full list to see what happens.
print(mdf, right=FALSE, row.names=FALSE)
# PackageName Description
# A3 Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModels
# abbyyR Access to Abbyy Optical Character Recognition (OCR) API
# abc Tools for Approximate Bayesian Computation (ABC)
# abc.data Data Only: Tools for Approximate Bayesian Computation (ABC)
# ABC.RAP Array Based CpG Region Analysis Pipeline
# ABCanalysis Computed ABC Analysis
# For small sets we can use either:
# mdf[1:6,] #or# head(mdf, 6)
However, although working quite well for small array/dataframe list (subset), I ran into a display problem with the full list, where the data would be shown either column-by-column or unaligned. I would have been great to have this paged and properly formatted in a new window somehow. I tried using page, but I couldn't get it to work very well.
EDIT:
The recommended method is not the above, but rather using Dirk's suggestion (from the comments below):
db <- tools::CRAN_package_db()
colnames(db)
mdf <- data.frame(db[,1], db[,52])
colnames(mdf) <- c("Package", "Description")
print(mdf, right=FALSE, row.names=FALSE)
However, this still suffers from the display problem mentioned...

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Issues with latin characters using htmlparse in r - r

Related

Not getting the right text after stemming in text analysis (Swedish)

r - source script file that contains unicode (Farsi) character

Unused argument (pretty = TRUE) in R

R tm package: utf-8 text

List and description of all packages in CRAN from within R

Categories

Resources