Sparklyr - Decimal precision 8 exceeds max precision 7 - r

I'm trying to copy a big database into Spark using spark_read_csv, but I'm getting the following error as output:
Error: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 0 in stage 16.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 16.0 (TID 176, 10.1.2.235):
java.lang.IllegalArgumentException: requirement failed: Decimal
precision 8 exceeds max precision 7
data_tbl <- spark_read_csv(sc, "data", "D:/base_csv", delimiter = "|", overwrite = TRUE)
It's a big data set, about 5.8 million of records, with my dataset I have data of types Int, num and chr.

I think you have a couple options depending on the spark version that you're using
Spark >=1.6.1
from here: https://docs.databricks.com/spark/latest/sparkr/functions/read.df.html
it seems, you can specifically specify your schema to force it to use doubles
csvSchema <- structType(structField("carat", "double"), structField("color", "string"))
diamondsLoadWithSchema<- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
source = "csv", header="true", schema = csvSchema)
Spark < 1.6.1
consider test.csv
1,a,4.1234567890
2,b,9.0987654321
you can easily make this more efficient, but I think you get the gist
linesplit <- function(x){
tmp <- strsplit(x,",")
return ( tmp)
}
lineconvert <- function(x){
arow <- x[[1]]
converted <- list(as.integer(arow[1]), as.character(arow[2]),as.double(arow[3]))
return (converted)
}
rdd <- SparkR:::textFile(sc,'/path/to/test.csv')
lnspl <- SparkR:::map(rdd, linesplit)
ll2 <- SparkR:::map(lnspl,lineconvert)
ddf <- createDataFrame(sqlContext,ll2)
head(ddf)
_1 _2 _3
1 1 a 4.1234567890
2 2 b 9.0987654321
NOTE: the SparkR::: methods are private for a reason, the docs say 'be careful when you use this'

Related

How to remove warnings in sqldf when using update, delete or alter table

Below is a reproducible example with warnings. I have done some research and some say that RSQLITE version causes this but not sure which version, so is there any way to prevent these warnings in sqldf. Thanks in advance
(mt <- mtcars[1:5,1:5])
sqldf(c('update mt set cyl=5 where cyl>5', 'select * from mt'))
Warning message:
In result_fetch(res#ptr, n = n) :
SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
You can suppress warnings globally and then reset after your code has run:
How to suppress warnings globally in an R Script
This appears to work:
oldw <- getOption("warn")
options(warn = -1)
(mt <- mtcars[1:5,1:5])
sqldf(c('update mt set cyl=5 where cyl>5', 'select * from mt'))
options(warn = oldw)
This wrapper muffles that particular warning without interfering with others.
sqldf2 <- function(...) {
withCallingHandlers(sqldf(...), warning =
function(w) if (grepl("SQL statements must be issued with dbExecute", w))
invokeRestart("muffleWarning") else w)
}
sql <- c("update BOD set Time = 1 where Time = 2", "select * from BOD")
sqldf2(sql)
giving the following with no warning:
Time demand
1 1 8.3
2 1 10.3 <-- Time was updated from 2 to 1 on this line
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8

Read in large text file in chunks

I'm working with limited RAM (AWS free tier EC2 server - 1GB).
I have a relatively large txt file "vectors.txt" (800mb) I'm trying to read into R. Having tried various methods I have failed to read in this vector to memory.
So, I was researching ways of reading it in in chunks. I know that the dim of the resulting data frame should be 300K * 300. If I was able to read in the file e.g. 10K lines at a time and then save each chunk as an RDS file I would be able to loop over the results and get what I need, albeit just a little slower with less convenience than having the whole thing in memory.
To reproduce:
# Get data
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
# word2vec r library
library(rword2vec)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
So far so good. Here's where I struggle:
word_vectors = as.data.frame(read.table("vector.txt",skip = 1, nrows = 10))
Returns "cannot allocate a vector of size [size]" error message.
Tried alternatives:
word_vectors <- ff::read.table.ffdf(file = "vector.txt", header = TRUE)
Same, not enough memory
word_vectors <- readr::read_tsv_chunked("vector.txt",
callback = function(x, i) saveRDS(x, i),
chunk_size = 10000)
Resulted in:
Parsed with column specification:
cols(
`299567 300` = col_character()
)
|=========================================================================================| 100% 817 MB
Error in read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs, :
Evaluation error: bad 'file' argument.
Is there any other way to turn vectors.txt into a data frame? Maybe by breaking it into pieces and reading in each piece, saving as a data frame and then to rds? Or any other alternatives?
EDIT:
From Jonathan's answer below, tried:
library(rword2vec)
library(RSQLite)
# Download pre trained Google News word2vec model (Slimmed down version)
# https://github.com/eyaler/word2vec-slim
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
# from https://privefl.github.io/bigreadr/articles/csv2sqlite.html
csv2sqlite <- function(tsv,
every_nlines,
table_name,
dbname = sub("\\.txt$", ".sqlite", tsv),
...) {
# Prepare reading
con <- RSQLite::dbConnect(RSQLite::SQLite(), dbname)
init <- TRUE
fill_sqlite <- function(df) {
if (init) {
RSQLite::dbCreateTable(con, table_name, df)
init <<- FALSE
}
RSQLite::dbAppendTable(con, table_name, df)
NULL
}
# Read and fill by parts
bigreadr::big_fread1(tsv, every_nlines,
.transform = fill_sqlite,
.combine = unlist,
... = ...)
# Returns
con
}
vectors_data <- csv2sqlite("vector.txt", every_nlines = 1e6, table_name = "vectors")
Resulted in:
Splitting: 12.4 seconds.
Error: nThread >= 1L is not TRUE
Another option would be to do the processing on-disk, e.g. using an SQLite file and dplyr's database functionality. Here's one option: https://stackoverflow.com/a/38651229/4168169
To get the CSV into SQLite you can also use the bigreadr package which has an article on doing just this: https://privefl.github.io/bigreadr/articles/csv2sqlite.html

Sparklyr: sdf_copy_to fails with 350 MB dataset

I'm facing a problem trying to write 2 dataset using sparklyr::spark_write_csv(). This is my configuration:
# Configure cluster
config <- spark_config()
config$spark.yarn.keytab <- "mykeytab.keytab"
config$spark.yarn.principal <- "myyarnprincipal"
config$sparklyr.gateway.start.timeout <- 10
config$spark.executor.instances <- 2
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
config$spark.driver.memory <- "4G"
config$spark.kryoserializer.buffer.max <- "1G"
Sys.setenv(SPARK_HOME = "/opt/cloudera/parcels/CDH/lib/spark")
Sys.setenv(HADOOP_CONF_DIR = '/etc/hadoop/conf.cloudera.hdfs')
Sys.setenv(YARN_CONF_DIR = '/etc/hadoop/conf.cloudera.yarn')
# Configure cluster
sc <- spark_connect(master = "yarn-client", config = config, version = '1.6.0')
Once the spark context is successfully created, I'm trying to save 2 datasets on hdfs using spark_write_csv(). As an intermediate step I need to transform the dataframe into a tbl_spark.
Unfortunately, I'm able to correctly save only the first one, meanwhile the second one (which is bigger but absolutely not big for hadoop standards i.e. 360 MB) takes a long time and finally crashes.
# load datasets
tmp_small <- read.csv("first_one.csv", sep = "|") # 13 MB
tmp_big <- read.csv("second_one.csv", sep = "|") # 352 MB
tmp_small_Spark <- sdf_copy_to(sc, tmp_small, "tmp_small", memory = F, overwrite = T)
tables_preview <- dbGetQuery(sc, "SHOW TABLES")
tmp_big_Spark <- sdf_copy_to(sc, tmp_big, "tmp_big", memory = F, overwrite = T) # fail!!
tables_preview <- dbGetQuery(sc, "SHOW TABLES")
It is probably a configuration problem but I can't figure it out.
This is the error:
|================================================================================| 100% 352 MB
Error in invoke_method.spark_shell_connection(sc, TRUE, class, method, :
No status is returned. Spark R backend might have failed.
Thanks
I was also having issues loading larger files. Try adding this to the spark connection config file:
config$spark.rpc.message.maxSize <- 512
It's a workaround, though.

subsetting data.cube inside custom function

I am trying to make a function of my own to subset a data.cube in R, and format the result automatically for some predefined plots I aim to build.
This is my function.
require(data.table)
require(data.cube)
secciona <- function(cubo = NULL,
fecha_valor = list(),
loc_valor = list(),
prod_valor = list(),
drop = FALSE){
cubo[fecha_valor, loc_valor, prod_valor, drop = drop]
## The line above will really be an asignment of type y <- format(cubo[...drop])
## Rest of code which will end up plotting the subset of the function
}
The thing is I keep on getting the error: Error in eval(expr, envir, enclos) : object 'fecha_valor' not found
What is most strange for me, is that on the console everything works fine, but not when defined inside the subsetting function of mine.
In console:
> dc[list(as.Date("2013/01/01"))]
> dc[list(as.Date("2013/01/01")),]
> dc[list(as.Date("2013/01/01")),,]
> dc[list(as.Date("2013/01/01")),list(),list()]
all give as result:
<data.cube>
fact:
5627 rows x 2 dimensions x 1 measures (0.32 MB)
dimensions:
localizacion : 4 entities x 3 levels (0.01 MB)
producto : 153994 entities x 3 levels (21.29 MB)
total size: 21.61 MB
But whenever I try
secciona(dc)
secciona(dc, fecha_valor = list(as.Date("2013/01/01")))
secciona(dc, fecha_valor = list())
I always get the error above mentioned.
Any ideas why this is happening? should I proceed in else way for my approach of editing the subset for plotting?
This is the standard issue that R users will face when dealing with non-standard evaluation. This is a consequence of Computing on the language R language feature.
[.data.cube function expects to be used in interactive way, that extends the flexibility of the arguments passed to it, but gives some restrictions. In that aspect it is similar to [.data.table when passing expressions from wrapper function to [ subset operator. I've added dummy example to make it reproducible.
I see you are already using data.cube-oop branch, so just to clarify for other readers. data.cube-oop branch is 92 commits ahead of master branch, to install use the following.
install.packages("data.cube", repos = paste0("https://", c(
"jangorecki.gitlab.io/data.cube",
"Rdatatable.github.io/data.table",
"cran.rstudio.com"
)))
library(data.cube)
set.seed(1)
ar = array(rnorm(8,10,5), rep(2,3),
dimnames = list(color = c("green","red"),
year = c("2014","2015"),
country = c("IN","UK"))) # sorted
dc = as.data.cube(ar)
f = function(color=list(), year=list(), country=list(), drop=FALSE){
expr = substitute(
dc[color=.color, year=.year, country=.country, drop=.drop],
list(.color=color, .year=year, .country=country, .drop=drop)
)
eval(expr)
}
f(year=list(c("2014","2015")), country="UK")
#<data.cube>
#fact:
# 4 rows x 3 dimensions x 1 measures (0.00 MB)
#dimensions:
# color : 2 entities x 1 levels (0.00 MB)
# year : 2 entities x 1 levels (0.00 MB)
# country : 1 entities x 1 levels (0.00 MB)
#total size: 0.01 MB
You can track the expression just by putting print(expr) before/instead eval(expr).
Read more about non-standard evaluation:
- R Language Definition: Computing on the language
- Advanced R: Non-standard evaluation
- manual of substitute function
And some related SO questions:
- Passing on non-standard evaluation arguments to the subset function
- In R, why is [ better than subset?

using package snow's parRapply: argument missing error

I want to find documents whose similarity between other doucuments are larger than a given value(0.1) by cutting documents into blocks.
library(tm)
data("crude")
sample.dtm <- DocumentTermMatrix(
crude, control=list(
weighting=function(x) weightTfIdf(x, normalize=FALSE),
stopwords=TRUE
)
)
step = 5
n = nrow(sample.dtm)
block = n %/% step
start = (c(1:block)-1)*step+1
end = start+step-1
j = unlist(lapply(1:(block-1),function(x) rep(((x+1):block),times=1)))
i = unlist(lapply(1:block,function(x) rep(x,times=(block-x))))
ij <- cbind(i,j)
library(skmeans)
getdocs <- function(k){
ci <- c(start[k[[1]]]:end[k[[1]]])
cj <- c(start[k[[2]]]:end[k[[2]]])
combi <- sample.dtm[ci]
combj < -sample.dtm[cj]
rownames(combi)<-ci
rownames(combj)<-cj
comb<-c(combi,combj)
sim<-1-skmeans_xdist(comb)
cat("Block", k[[1]], "with Block", k[[2]], "\n")
flush.console()
tri.sim<-upper.tri(sim,diag=F)
results<-tri.sim & sim>0.1
docs<-apply(results,1,function(x) length(x[x==TRUE]))
docnames<-names(docs)[docs>0]
gc()
return (docnames)
}
It works well when using apply
system.time(rmdocs<-apply(ij,1,getdocs))
When using parRapply
library(snow)
library(skmeans)
cl<-makeCluster(2)
clusterExport(cl,list("getdocs","sample.dtm","start","end"))
system.time(rmdocs<-parRapply(cl,ij,getdocs))
Error:
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: attempt to set 'rownames' on an object with no dimensions
Timing stopped at: 0.01 0 0.04
It seems that sample.dtm coundn't be used in parRapply. I'm confused. Can anyone help me? Thanks!
In addition to exporting objects, you need to load the necessary packages on the cluster workers. In your case, the result of not doing so is that there isn't a dimnames method defined for "DocumentTermMatrix" objects, causing rownames<- to fail.
You can load packages on the cluster workers with the clusterEvalQ function:
clusterEvalQ(cl, { library(tm); library(skmeans) })
After doing that, rownames(combi)<-ci will work correctly.
Also, if you want to see the output from cat, you should use the makeCluster outfile argument:
cl <- makeCluster(2, outfile='')

Resources