tokenizing a list doesn't work with UTF8

tokenizing a list doesn't work with UTF8 - r

I extract some data from Oracle DB to do some text mining. My data is UTF8 and vocab can't handle it.
library(text2vec);
library(DBI);
Sys.setenv(TZ="+03:00");
drv=dbDriver("Oracle");
con=dbConnect(drv,username="user","pass",dbname="IP:port/servicename");
list=dbGetQuery(con,statement = "select * from test");
it_list = itoken(list$FNAME,
preprocessor = tolower,
tokenizer = word_tokenizer,
ids = list$ID,
progressbar = FALSE);
vocab = create_vocabulary(it_list, ngram = c(ngram_min = 1L, ngram_max =2L));
but just English word exists in vocab.
list variable object exists in this link (can be loaded with load())
I use windows
R.version:
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 3.0
year 2016
month 05
day 03
svn rev 70573
language R
version.string Oracle Distribution of R version 3.3.0 (2016-05-03)
nickname Supposedly Educational

Thanks for reporting. This is actually an issue with base::strsplit() which is used for basic tokenization.
I suggest you to use stringi package for regex with strong UTF-8 support. Or simply use tokenizers - good solution for tokenization on top of stringi.
For example you can use tokenizers::tokenize_words as drop-in replacement of word_tokenizer
tokenizers::tokenize_words("پوشاک بانک لي ")
# "پوشاک" "بانک" "لي"
For some reason base::strsplit() doesn't consider theses arabic symbols as "alphanumeric" ([[:alnum:]]).
strsplit("i was. there", "\\W") %>% lapply(function(x) x[nchar(x) > 0])
# "i" "was" "there"
strsplit("پوشاک بانک لي ", "\\W") %>% lapply(function(x) x[nchar(x) > 0])
# character(0)

Related

read scientific format numbers from .csv file

I read into R a CSV file with large numbers formatted in scientific notation.
I used a couple of statistical R functions (MSE and RMSPE) on the numbers and got an incorrect answer (I checked it in Excel).
When I changed the format in the CSV file to ordinary number format, i.e. with lots of zeroes, the R functions calculated correctly.
What was I doing wrong?
Thanks for any insights,
Claire
UPDATE: console output added. I am using R4.0.2. I have imported two CSV files, one called MPERRORS.csv with the original scientific notation format and the second called CBERRORS.csv saved in number format. I believe the issue is to do with the conversion in Excel of scientific notation format numbers.
Code is below and I have also pasted in results. If you look at number 6.89E+11 it shows as 689000000000 in the formula bar but if you convert it to number format you get 689116020736. Apologies if this is wrong, I am a newbie with minimal R experience as you will have guessed.
CLAIRE
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 4
minor 0.2
year 2020
month 06
day 22
svn rev 78730
language R
version.string R version 4.0.2 (2020-06-22)
MPERRORS RESULTS
3.359375e+20 MSE function for MPERRORS
3.359375e+20 mse function for MPERRORS
0.01878106 RMSPE function
0.991949 R2 function
0.9916312 gofR2 function
CBERRORS results
2.94363e+20 MSE
2.94363e+20 mse
0.01805762 RMSPE
0.9929211 R2
enter code here
version
library(ineq)
library(Metrics)
library(MLmetrics)
library(ehaGoF)
ERRORS1<-read.csv(file = 'MPerrors.csv')
ERRORS2<-read.csv(file = 'CBerrors.csv')
str(ERRORS1)
str(ERRORS2)
hist1<-ERRORS1[,2]
base1<-ERRORS1[,3]
print(hist1)
dput(head(ERRORS1,10))
MSE(base1,hist1)
mse(base1,hist1)
RMSPE(base1,hist1)
R2_Score(base1,hist1)
gofRSq(base1,hist1, dgt = 7)
hist2<-ERRORS2[,2]
base2<-ERRORS2[,3]
print(hist2)
dput(head(ERRORS2,10))
MSE(base2,hist2)
mse(base2,hist2)
RMSPE(base2,hist2)
R2_Score(base2,hist2)
gofRSq(base2,hist2, dgt = 7)
# MPERRORS FIRST 10 LINES
structure(list(Time..Year. = 1990:1999, real.gdp.at.market.prices =
c(6.89e+11, 7.51e+11, 7.27e+11, 7.55e+11, 7.85e+11, 7.99e+11, 8.53e+11,
8.95e+11,
9.67e+11, 1.02e+12), X..BusinessAsUsual = c(6.79e+11, 7.25e+11,
7.31e+11, 7.66e+11, 7.76e+11, 7.86e+11, 8.26e+11, 8.84e+11, 9.56e+11,
1.01e+12), Diff = c(9.93e+09, 2.54e+10, -4.32e+09, -1.05e+10,
9.4e+09, 1.36e+10, 2.7e+10, 1.02e+10, 1.13e+10, 1.49e+10)), row.names =
c(NA,
10L), class = "data.frame")
#CBERRORS FIRST 10 LINES
structure(list(Time..Year. = 1990:1999, real.gdp.at.market.prices =
c(689116020736,
750739980288, 726938025984, 755445989376, 785442996224, 799333023744,
852837007360, 894628003840, 966879019008, 1021999972352), X..BusinessAsUsual
= c(679182532608,
725334294528, 731261042688, 765934698496, 776039104512, 785780506624,
825845153792, 884472348672, 955611414528, 1007061172224), Diff = c(9.93e+09,
2.54e+10, -4.32e+09, -1.05e+10, 9.4e+09, 1.36e+10, 2.7e+10, 1.02e+10,
1.13e+10, 1.49e+10)), row.names = c(NA, 10L), class = "data.frame")

Error in Running NLRX (NetLogo) in Manjaro (Arch) Linux

I am attempting to run an NLRX simulation in Manjaro Linux (RNetLogo wouldn't work for some reason either), and am running into the following error when attempting to set up an dummy experiment:
cp: cannot stat '~/.netlogo/NetLogo 6.1.1/netlogo-headless.sh': No such file or directory
sed: can't read /tmp/Rtmpj15Yf7/netlogo-headless365385fb4bdc0.sh: No such file or directory
sed: can't read /tmp/Rtmpj15Yf7/netlogo-headless365385fb4bdc0.sh: No such file or directory
sh: /tmp/Rtmpj15Yf7/netlogo-headless365385fb4bdc0.sh: No such file or directory
Error in util_gather_results(nl, outfile, seed, siminputrow) :
Temporary output file /tmp/Rtmpj15Yf7/nlrx5493_1365385ab03157.csvnot found. On unix systems this can happen if the default system temp folder is used.
Try reassigning the default temp folder for this R session (unixtools package).
In addition: Warning message:
In system(NLcall, wait = TRUE) : error in running command
Given that I am running R 4.0.0, the Unixtools package doesn't work, so that's out of the question. How would I go about fixing this?
Code for those curious:
library(nlrx)
netlogopath <- file.path("~/.netlogo/NetLogo 6.1.1")
modelpath <- file.path(netlogopath, "app/models/Sample Models/Biology/Wolf Sheep Predation.nlogo")
outpath <- file.path("/home/out")
nl <- nl(nlversion = "6.0.3",
nlpath = netlogopath,
modelpath = modelpath,
jvmmem = 1024)
nl#experiment <- experiment(expname="wolf-sheep",
outpath=outpath,
repetition=1,
tickmetrics="true",
idsetup="setup",
idgo="go",
runtime=50,
evalticks=seq(40,50),
metrics=c("count sheep", "count wolves", "count patches with [pcolor = green]"),
variables = list('initial-number-sheep' = list(min=50, max=150, qfun="qunif"),
'initial-number-wolves' = list(min=50, max=150, qfun="qunif")),
constants = list("model-version" = "\"sheep-wolves-grass\"",
"grass-regrowth-time" = 30,
"sheep-gain-from-food" = 4,
"wolf-gain-from-food" = 20,
"sheep-reproduce" = 4,
"wolf-reproduce" = 5,
"show-energy?" = "false"))
nl#simdesign <- simdesign_lhs(nl=nl,
samples=100,
nseeds=3,
precision=3)
results <- run_nl_all(nl = nl)
R Version for those who may want it:
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 4
minor 0.0
year 2020
month 04
day 24
svn rev 78286
language R
version.string R version 4.0.0 (2020-04-24)
nickname Arbor Day

In case others find this helpful: I have encountered similar errors as the result of file path misspecification. For instance, double check model path. You may need to drop app/.

How to use custom font family in UpsetR plots?

I am trying to create set plot using the UpSetR package; however, I'd like to control the family of fonts. The ideal approach would be using theme function from ggplot2 but this is not supported at the moment by UpSetR (there's an open issue from 2016 on GitHub here) and results in NULL.
Example to create test plot:
# R version ---------------------------------------------------------------
# platform x86_64-w64-mingw32
# arch x86_64
# os mingw32
# system x86_64, mingw32
# status
# major 3
# minor 5.1
# year 2018
# month 07
# day 02
# svn rev 74947
# language R
# version.string R version 3.5.1 (2018-07-02)
# nickname Feather Spray
# Package versions --------------------------------------------------------
# Assumes packages are already installed
packageVersion(pkg = "extrafont") == "0.17"
packageVersion(pkg = "UpSetR") == "1.3.3"
packageVersion(pkg = "ggplot2") == "3.1.0"
# Load UpSetR -------------------------------------------------------------
library(UpSetR)
library(extrafont)
library(ggplot2)
# Example -----------------------------------------------------------------
movies <- read.csv( system.file("extdata", "movies.csv", package = "UpSetR"), header=T, sep=";" )
upset(data = movies,
order.by = "freq",
keep.order = TRUE,
mainbar.y.label = "Example plot",
point.size = 4,
line.size = 1,
sets.x.label = NULL)
Going forward, the ideal would be where UpSetR supports layers / + theme() function from ggplot2; however, the UpSetR is not able to use "+" "layer name" logic. For example, if + theme(text = element_text(family = "Times New Roman")) were added at the end of the call above, it would return NULL and produce no plot.
Can you please suggest any workaround (or customization of function in package) that would support custom fonts in the example plot above produced by UpSetR? Alternatively, is there a way to force default font family in all plots without specifying any family arguments manually?

One way of achieving this with another UpSet implementation would be:
# install if needed
if(!require(ggplot2movies)) install.packages('ggplot2movies')
if(!require(devtools)) install.packages('devtools')
devtools::install_github('krassowski/complex-upset')
movies = as.data.frame(ggplot2movies::movies)
genres = c('Action', 'Animation', 'Comedy', 'Drama', 'Documentary', 'Romance', 'Short')
library(ComplexUpset)
library(ggplot2)
upset(
movies, genres, min_size=45, width_ratio=0.1,
themes=upset_default_themes(text=element_text(family='Times New Roman'))
)
Disclamer: I am the author of this implementation.

How is the correct use of stemDocument?

I have already read this and this questions, but I still didn't understand the use of stemDocument in tm_map. Let's follow this example:
q17 <- VCorpus(VectorSource(x = c("poder", "pode")),
readerControl = list(language = "pt",
load = TRUE))
lapply(q17, content)
$`character(0)`
[1] "poder"
$`character(0)`
[1] "pode"
If I use:
> stemDocument("poder", language = "portuguese")
[1] "pod"
> stemDocument("pode", language = "portuguese")
[1] "pod"
it does work! But if I use:
> q17 <- tm_map(q17, FUN = stemDocument, language = "portuguese")
> lapply(q17, content)
$`character(0)`
[1] "poder"
$`character(0)`
[1] "pode"
it doesn't work. Why so?

Unfortunately you stumbled on a bug. stemDocument works if you pass on the language when you do:
stemDocument(x = c("poder", "pode"), language = "pt")
[1] "pod" "pod"
But when using this in tm_map, the function starts of with stemDocument.PlainTextDocument. In this function the language of the corpus is checked against the language you supply in the function. This works correctly. But at the end of this function everything is passed on to the function stemDocument.character, but without the language component. In stemDocument.character the default language is specified as English. So within the tm_map call (or the DocumentTermMatrix) the language you supply with it will revert back to English and the stemming doesn't work correctly.
A workaround could be using the package quanteda:
library(quanteda)
my_dfm <- dfm(x = c("poder", "pode"))
my_dfm <- dfm_wordstem(my_dfm, language = "pt")
my_dfm
Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
2 x 1 sparse Matrix of class "dfm"
features
docs pod
text1 1
text2 1
Since you are working with Portuguese, I suggest using the packages quanteda, udpipe, or both. Both packages handle non-English languages a lot better than tm.

Calling applescript in R

Is there any way to run an applescript within R?
I found this reference in an R FAQ on CRAN
From release 1.3.1 R has partial support for AppleScripts. This means two things: you can run applescripts from inside R using the command applescript() (see the corresponding help)
But in my current version of R
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
neither applescript() nor ?applescript() returns anything.
Thanks, Simon

Those features aren't in modern R versions (IIRC they harken back to pre-macOS/Mac OS X days).
However, the applescript() function performed no magic:
applescript <- function(script_source, extra_args = c()) {
script_source <- paste0(script_source, collapse = "\n")
tf <- tempfile(fileext = ".applescript")
on.exit(unlink(tf), add=TRUE)
cat(script_source, file = tf)
osascript <- Sys.which("osascript")
args <- c(extra_args, tf)
system2(
command = osascript,
args = args,
stdout = TRUE
) -> res
invisible(res)
}
So you can do anything with it, like open a folder:
applescript(
sprintf(
'tell app "Finder" to open POSIX file "%s"',
Sys.getenv("R_DOC_DIR")
)
)
or, query an app and return data:
res <- applescript('
tell application "iTunes"
set r_name to name of current track
set r_artist to artist of current track
end
return "artist=" & r_artist & "\ntrack=" & r_name
')
print(res)
## [1] "artist=NICO Touches the Walls" "track=Hologram"
For (mebbe) easier usage (I say "mebbe" as the pkg relies on reticulate for some things) I added this to the macthekinfe macOS-centric R package.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

tokenizing a list doesn't work with UTF8 - r

Related

read scientific format numbers from .csv file

Error in Running NLRX (NetLogo) in Manjaro (Arch) Linux

How to use custom font family in UpsetR plots?

How is the correct use of stemDocument?

Calling applescript in R

Categories

Resources