I read into R a CSV file with large numbers formatted in scientific notation.
I used a couple of statistical R functions (MSE and RMSPE) on the numbers and got an incorrect answer (I checked it in Excel).
When I changed the format in the CSV file to ordinary number format, i.e. with lots of zeroes, the R functions calculated correctly.
What was I doing wrong?
Thanks for any insights,
Claire
UPDATE: console output added. I am using R4.0.2. I have imported two CSV files, one called MPERRORS.csv with the original scientific notation format and the second called CBERRORS.csv saved in number format. I believe the issue is to do with the conversion in Excel of scientific notation format numbers.
Code is below and I have also pasted in results. If you look at number 6.89E+11 it shows as 689000000000 in the formula bar but if you convert it to number format you get 689116020736. Apologies if this is wrong, I am a newbie with minimal R experience as you will have guessed.
CLAIRE
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 4
minor 0.2
year 2020
month 06
day 22
svn rev 78730
language R
version.string R version 4.0.2 (2020-06-22)
MPERRORS RESULTS
3.359375e+20 MSE function for MPERRORS
3.359375e+20 mse function for MPERRORS
0.01878106 RMSPE function
0.991949 R2 function
0.9916312 gofR2 function
CBERRORS results
2.94363e+20 MSE
2.94363e+20 mse
0.01805762 RMSPE
0.9929211 R2
enter code here
version
library(ineq)
library(Metrics)
library(MLmetrics)
library(ehaGoF)
ERRORS1<-read.csv(file = 'MPerrors.csv')
ERRORS2<-read.csv(file = 'CBerrors.csv')
str(ERRORS1)
str(ERRORS2)
hist1<-ERRORS1[,2]
base1<-ERRORS1[,3]
print(hist1)
dput(head(ERRORS1,10))
MSE(base1,hist1)
mse(base1,hist1)
RMSPE(base1,hist1)
R2_Score(base1,hist1)
gofRSq(base1,hist1, dgt = 7)
hist2<-ERRORS2[,2]
base2<-ERRORS2[,3]
print(hist2)
dput(head(ERRORS2,10))
MSE(base2,hist2)
mse(base2,hist2)
RMSPE(base2,hist2)
R2_Score(base2,hist2)
gofRSq(base2,hist2, dgt = 7)
# MPERRORS FIRST 10 LINES
structure(list(Time..Year. = 1990:1999, real.gdp.at.market.prices =
c(6.89e+11, 7.51e+11, 7.27e+11, 7.55e+11, 7.85e+11, 7.99e+11, 8.53e+11,
8.95e+11,
9.67e+11, 1.02e+12), X..BusinessAsUsual = c(6.79e+11, 7.25e+11,
7.31e+11, 7.66e+11, 7.76e+11, 7.86e+11, 8.26e+11, 8.84e+11, 9.56e+11,
1.01e+12), Diff = c(9.93e+09, 2.54e+10, -4.32e+09, -1.05e+10,
9.4e+09, 1.36e+10, 2.7e+10, 1.02e+10, 1.13e+10, 1.49e+10)), row.names =
c(NA,
10L), class = "data.frame")
#CBERRORS FIRST 10 LINES
structure(list(Time..Year. = 1990:1999, real.gdp.at.market.prices =
c(689116020736,
750739980288, 726938025984, 755445989376, 785442996224, 799333023744,
852837007360, 894628003840, 966879019008, 1021999972352), X..BusinessAsUsual
= c(679182532608,
725334294528, 731261042688, 765934698496, 776039104512, 785780506624,
825845153792, 884472348672, 955611414528, 1007061172224), Diff = c(9.93e+09,
2.54e+10, -4.32e+09, -1.05e+10, 9.4e+09, 1.36e+10, 2.7e+10, 1.02e+10,
1.13e+10, 1.49e+10)), row.names = c(NA, 10L), class = "data.frame")
Related
I'm reading a bib file extracted from Google Scholar with BIB <- bibtex::read.bib("file.bib") command and this created a list object. If I use paste(BIB) or as.character(BIB) the console shows for all items in the list lines like:
"list(title = "A Lealdade no Sistema Financeiro Portugu{\\^e}s", author = list(list(given = c("Francisco", "José", "dos", "Santos", "Mota", "Ferreira"), family = "Guerra", role = NULL, email = NULL, comment = NULL)), year = "2017", school = "Universidade de Coimbra")"
And if I use print() shows:
Guerra FJdSMF (2017). A Lealdade no Sistema Financeiro Português. Ph.D. thesis,
Universidade de Coimbra.
I need to extract the second kind to a new character string, but any command I try just doesn't work. I've tried A <- paste(print(BIB)), A <- as.character(print(BIB)) or just A <- print(BIB). I just get the first kind of line or an equal object.
I have already tried open the same file with bib2df::bib2df() but has some problems with the encoding and the dataframe's columns and rows
Try format(BIB) For example
bib <- read.bib( package = "bibtex" )
x <- format(bib)
x
# [1] "R Development Core Team (2009). _R: A Language and Environment for\nStatistical Computing_. R Foundation for Statistical Computing, Vienna,\nAustria. ISBN 3-900051-07-0, <http://www.R-project.org>."
I found this by looking at class(BIB) and saw "bibentry" then looked for all methods that recognize that object methods(class="bibentry") and format seemed like a good candidate.
I am attempting to run an NLRX simulation in Manjaro Linux (RNetLogo wouldn't work for some reason either), and am running into the following error when attempting to set up an dummy experiment:
cp: cannot stat '~/.netlogo/NetLogo 6.1.1/netlogo-headless.sh': No such file or directory
sed: can't read /tmp/Rtmpj15Yf7/netlogo-headless365385fb4bdc0.sh: No such file or directory
sed: can't read /tmp/Rtmpj15Yf7/netlogo-headless365385fb4bdc0.sh: No such file or directory
sh: /tmp/Rtmpj15Yf7/netlogo-headless365385fb4bdc0.sh: No such file or directory
Error in util_gather_results(nl, outfile, seed, siminputrow) :
Temporary output file /tmp/Rtmpj15Yf7/nlrx5493_1365385ab03157.csvnot found. On unix systems this can happen if the default system temp folder is used.
Try reassigning the default temp folder for this R session (unixtools package).
In addition: Warning message:
In system(NLcall, wait = TRUE) : error in running command
Given that I am running R 4.0.0, the Unixtools package doesn't work, so that's out of the question. How would I go about fixing this?
Code for those curious:
library(nlrx)
netlogopath <- file.path("~/.netlogo/NetLogo 6.1.1")
modelpath <- file.path(netlogopath, "app/models/Sample Models/Biology/Wolf Sheep Predation.nlogo")
outpath <- file.path("/home/out")
nl <- nl(nlversion = "6.0.3",
nlpath = netlogopath,
modelpath = modelpath,
jvmmem = 1024)
nl#experiment <- experiment(expname="wolf-sheep",
outpath=outpath,
repetition=1,
tickmetrics="true",
idsetup="setup",
idgo="go",
runtime=50,
evalticks=seq(40,50),
metrics=c("count sheep", "count wolves", "count patches with [pcolor = green]"),
variables = list('initial-number-sheep' = list(min=50, max=150, qfun="qunif"),
'initial-number-wolves' = list(min=50, max=150, qfun="qunif")),
constants = list("model-version" = "\"sheep-wolves-grass\"",
"grass-regrowth-time" = 30,
"sheep-gain-from-food" = 4,
"wolf-gain-from-food" = 20,
"sheep-reproduce" = 4,
"wolf-reproduce" = 5,
"show-energy?" = "false"))
nl#simdesign <- simdesign_lhs(nl=nl,
samples=100,
nseeds=3,
precision=3)
results <- run_nl_all(nl = nl)
R Version for those who may want it:
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 4
minor 0.0
year 2020
month 04
day 24
svn rev 78286
language R
version.string R version 4.0.0 (2020-04-24)
nickname Arbor Day
In case others find this helpful: I have encountered similar errors as the result of file path misspecification. For instance, double check model path. You may need to drop app/.
I extract some data from Oracle DB to do some text mining. My data is UTF8 and vocab can't handle it.
library(text2vec);
library(DBI);
Sys.setenv(TZ="+03:00");
drv=dbDriver("Oracle");
con=dbConnect(drv,username="user","pass",dbname="IP:port/servicename");
list=dbGetQuery(con,statement = "select * from test");
it_list = itoken(list$FNAME,
preprocessor = tolower,
tokenizer = word_tokenizer,
ids = list$ID,
progressbar = FALSE);
vocab = create_vocabulary(it_list, ngram = c(ngram_min = 1L, ngram_max =2L));
but just English word exists in vocab.
list variable object exists in this link (can be loaded with load())
I use windows
R.version:
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 3.0
year 2016
month 05
day 03
svn rev 70573
language R
version.string Oracle Distribution of R version 3.3.0 (2016-05-03)
nickname Supposedly Educational
Thanks for reporting. This is actually an issue with base::strsplit() which is used for basic tokenization.
I suggest you to use stringi package for regex with strong UTF-8 support. Or simply use tokenizers - good solution for tokenization on top of stringi.
For example you can use tokenizers::tokenize_words as drop-in replacement of word_tokenizer
tokenizers::tokenize_words("پوشاک بانک لي ")
# "پوشاک" "بانک" "لي"
For some reason base::strsplit() doesn't consider theses arabic symbols as "alphanumeric" ([[:alnum:]]).
strsplit("i was. there", "\\W") %>% lapply(function(x) x[nchar(x) > 0])
# "i" "was" "there"
strsplit("پوشاک بانک لي ", "\\W") %>% lapply(function(x) x[nchar(x) > 0])
# character(0)
I am comparing the performance of R and Apache Spark on a local machine and R seems to be doing much better. Is that because I am not using a cluster or am I doing something wrong?
Create data (create_data.R):
options = commandArgs(trailingOnly = TRUE)
rows = as.numeric(options[1])
perday = 365 / (rows-1) * 6
dates = seq(as.Date('2010-01-01'), as.Date('2015-12-31'), by=perday)
rows = length(dates)
ids = sample(paste0("ID", seq(1:10000)), rows, replace=TRUE)
sales = rpois(rows,50)
categories = sample(paste("Category", sprintf("%02d",seq(1:10))), rows, replace=TRUE)
data = data.frame(dates, ids, sales, categories)
write.csv(data, "/home/phil/performance/data.csv", row.names=FALSE)
Test R (cut.R):
suppressMessages(suppressWarnings(require(dplyr, quietly=TRUE)))
data = read.csv("data.csv")
first_purchase = head(data[order(data$dates, data$ids),],1)
print(first_purchase)
Test Spark (cut.py):
from pyspark import SparkContext
sc = SparkContext("local")
rdd = sc.textFile("data.csv", 2)
# Get rid of header
header = rdd.take(1)[0]
rdd = rdd.filter(lambda line: line != header)
rdd = rdd.map(lambda line: line.split(","))
first_purchase = rdd.takeOrdered(1, lambda x: [x[0],x[1]])[0]
print(first_purchase)
Run complete test (run_tests.sh):
echo "Creating data"
Rscript create_data.R 5000000
wc -l data.csv
echo "Testing R"
time Rscript cut.R
echo "Testing Spark"
time spark-submit cut.py
Output of the tests:
$ . run_test.sh
Creating data
5000001 data.csv
Testing R
dates ids sales categories
1264 2010-01-01 ID10 60 Category 01
real 0m12.689s
user 0m12.498s
sys 0m0.187s
Testing Spark
[u'2010-01-01', u'"ID10"', u'60', u'"Category 01"']
real 0m17.029s
user 0m7.388s
sys 0m0.392s
I am running this on a Ubuntu in a VirtualBox with Windows 7 as host system, if that makes a difference.
Spark is a distributed computing framework and it's model is to break down the work in pieces (tasks), where those tasks are scheduled, serialized and shipped based on the DAG derived from the dependencies in the functional transformations defined on the RDD.
All that machinery comes with an overhead cost, even in local mode. When compared to R, it is not unexpected that R, having been designed for single node execution will work faster.
Try the same comparison on a cluster... oh... wait... R only runs in a single node (but not for long anymore).
I have the following code:
raw_test <- fread("avito_test.tsv", nrows = intNrows, skip = intSkip)
Which produces the following error:
Error in fread("avito_test.tsv", nrows = intNrows, skip = intSkip, autostart = (intSkip + :
Expected sep (',') but new line, EOF (or other non printing character) ends field 14 on line 1003 when detecting types: 10066652 ТранÑпорт Ðвтомобили Ñ Ð¿Ñ€Ð¾Ð±ÐµÐ³Ð¾Ð¼ Nissan R Nessa, 1998 Ð¢Ð°Ñ€Ð°Ð½Ñ‚Ð°Ñ Ð² отличном ÑоÑтоÑнии. на прошлой неделе возили на тех. ОбÑлуживание. Ð’ дорожных неприÑтноÑÑ‚ÑÑ… не был учаÑтником. Детали кузова без коцок и терок. ПредназначалаÑÑŒ Ð´Ð»Ñ Ð¿Ð¾ÐµÐ·Ð´Ð¾Ðº на природу, Отдам только в добрые руки. Ð’ Ñалон не поÑтавлю не звоните "{""Марка"":""Nissan"", ""Модель"":""R Nessa"", ""Год выпуÑка"":""1998"", ""Пробег"":""180 000 - 189 999"", ""Тип кузова"":""МинивÑн"", ""Цвет"":""Оранжевый"", ""Объём двигателÑ"":""2.4"", ""Коробка передач"":""МеханичеÑкаÑ
I have tried changing it to this:
raw_test <- fread("avito_test.tsv", nrows = intNrows, skip = intSkip, autostart = (intSkip + 2))
Which is based on what I read on a similar question skip and autostart in fread
However, it produces a similar error as above.
How can I skip the first 1000 rows, and read the next thousand? My expected output is 1000 rows total, skipping the first thousand from my CSV file, and reading the second thousand.
Note: Reading the file with raw_test <- fread("avito_test.tsv", nrows = 1000, skip = -1) works well for getting me only the first thousand, but I am trying to get only the second thousand.
Edit: The data is publicly available at http://www.kaggle.com/c/avito-prohibited-content/data
Edit: Environment and package info:
> packageVersion("data.table")
[1] ‘1.9.3’
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)