sparklyr :: Error reading parquet file using sparklyr library in R - r

I am trying to read parquet file from databricks Filestore
library(sparklyr)
parquet_dir has been pre-defined
parquet_dir = /dbfs/FileStore/test/flc_next.parquet'
List the files in the parquet dir
filenames <- dir(parquet_dir, full.names = TRUE)
"/dbfs/FileStore/test/flc_next.parquet/_committed_6244562942368589642"
[2] "/dbfs/FileStore/test/flc_next.parquet/_started_6244562942368589642"
[3] "/dbfs/FileStore/test/flc_next.parquet/_SUCCESS"
[4] "/dbfs/FileStore/test/flc_next.parquet/part-00000-tid-6244562942368589642-0edceedf-7157-4cce-a084-0f2a4a6769e6-925-1-c000.snappy.parquet"
Show the filenames and their sizes
data_frame(
filename = basename(filenames),
size_bytes = file.size(filenames)
)
rning: `data_frame()` was deprecated in tibble 1.1.0.
Please use `tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
# A tibble: 4 × 2
filename size_bytes
<chr> <dbl>
1 _committed_6244562942368589642 124
2 _started_6244562942368589642 0
3 _SUCCESS 0
4 part-00000-tid-6244562942368589642-0edceedf-7157-4cce-a084-0f2a4a6… 248643
Import the data into Spark
timbre_tbl <- spark_read_parquet("flc_next.parquet", parquet_dir)
Error : $ operator is invalid for atomic vectors
Some(<code style = 'font-size:10p'> Error: $ operator is invalid for atomic vectors </code>)
I would appreciate any help/suggestion
Thanks in advance

The first argument of spark_read_parquet expects a spark connection, check this: sparklyr::spark_connect. If you are running the codes in Databricks then this should work:
sc <- spark_connect(method = "databricks")
timbre_tbl <- spark_read_parquet(sc, "flc_next.parquet", parquet_dir)

Related

Character string limit reached early using fread from data.table, CSV lines look fine

I'm reading a moderately big CSV with fread but it errors out with "R character strings are limited to 2^31-1 bytes". readLines works fine, however. Pinning down the faulty line (2965), I am not sure what's going on: doesn't seem longer than the next one, for example.
Any thoughts about what is going on and how to overcome this, preferably checking the CSV in advance to avoid an fread error?
library(data.table)
options(timeout=10000)
download.file("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-03.csv",
destfile = "trip_data.csv", mode = "wb")
dt = fread("trip_data.csv")
#> Error in fread("trip_data.csv"): R character strings are limited to 2^31-1 bytes
lines = readLines("trip_data.csv")
dt2955 = fread("trip_data.csv", nrows = 2955)
#> Warning in fread("trip_data.csv", nrows = 2955): Previous fread() session was
#> not cleaned up properly. Cleaned up ok at the beginning of this fread() call.
dt2956 = fread("trip_data.csv", nrows = 2956)
#> Error in fread("trip_data.csv", nrows = 2956): R character strings are limited to 2^31-1 bytes
lines[2955]
#> [1] "CMT,2010-03-07 18:37:05,2010-03-07 18:41:51,1,1,-73.984211000000002,40.743720000000003,1,0,-73.974515999999994,40.748331,Cre,4.9000000000000004,0,0.5,1.0800000000000001,0,6.4800000000000004"
lines[2956]
#> [1] "CMT,2010-03-07 22:59:01,2010-03-07 23:01:04,1,0.59999999999999998,-73.992887999999994,40.703017000000003,1,0,-73.992887999999994,40.703017000000003,Cre,3.7000000000000002,0.5,0.5,2,0,6.7000000000000002"
Created on 2022-02-12 by the reprex package (v2.0.1)
When trying to read part of the file (around 100k rows) I got:
Warning message:
In fread("trip_data.csv", skip = i * 500, nrows = 500) :
Stopped early on line 2958. Expected 18 fields but found 19. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<CMT,2010-03-07 03:46:42,2010-03-07 03:58:31,1,3.6000000000000001,-73.961027000000001,40.796674000000003,1,,,-73.937324000000004,40.839283000000002,Cas,10.9,0.5,0.5,0,0,11.9>>
>
After removing it I was able to read at least 100 k rows
da = data.table(check.names = FALSE)
for (i in 0:200) {
print(i*500)
dt = fread("trip_data.csv", skip = i*500, nrows = 500, fill = TRUE)
da <- rbind(da, dt, use.names = FALSE)
}
str(da)
Classes ‘data.table’ and 'data.frame': 101000 obs. of 18 variables:
$ vendor_id : chr "" "CMT" "CMT" "CMT" ...
$ pickup_datetime : POSIXct, format: NA "2010-03-22 17:05:03" "2010-03-22 19:24:29" ...
$ dropoff_datetime : POSIXct, format: NA "2010-03-22 17:22:51" "2010-03-22 19:40:13" ...
$ passenger_count : int NA 1 1 1 3 1 1 1 1 1 ...
[...]
Any thoughts about what is going on and how to overcome this, preferably checking the CSV in advance to avoid an fread error?
Then you can read it line by line, checking length of the list, and binding it to data table.
Regards,
Grzegorz

Error Message when Opening a CSV File in R

I attempted opening a csv file in R using R studio but got this warning message:
In readLines("persons.csv") : incomplete final line found on 'persons.csv'
Please what is wrong with the file, and how can I fix it?
You can likely ignore this as it probably still worked. Here is an example without a final newline which gives that warning and another one which has the final newline which does not give the warning. Both worked.
cat("a,b\n1,2", file = "test1.csv")
read.csv("test1.csv")
## a b
## 1 1 2
## Warning message:
## In read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'test1.csv'
cat("a,b\n1,2\n", file = "test2.csv")
read.csv("test2.csv")
## a b
## 1 1 2
To address this try one of these:
Just ignore it as it probably worked.
Bring the file into a text editor and write it out again. That often eliminates the warning.
Use readr::read_csv. The indicated argument eliminates many messages that are otherwise output by that command.
library(readr)
read_csv("test1.csv", show_col_types = FALSE)
## # A tibble: 1 x 2
## a b
## <dbl> <dbl>
## 1 1 2
Use data.table::fread. It won't give that message.
library(data.table)
fread("test1.csv", data.table = FALSE)
## a b
## 1 1 2
From the Windows cmd line use this (note dot)
echo. >> test1.csv
or under bash (no dot)
echo >> test.csv

Using code_to_plan and target(..., format = "fst") in drake

I really like using the code_to_plan function when constructing drake plans. I also really using target(..., format = "fst") for big files. However I am struggling to combine these two workflows. For example if I have this _drake.R file:
# Data --------------------------------------------------------------------
data_plan = code_to_plan("code/01-data/data.R")
join_plan = code_to_plan("code/01-data/merging.R")
# Cleaning ----------------------------------------------------------------
cleaning_plan = code_to_plan("code/02-cleaning/remove_na.R")
# Model -------------------------------------------------------------------
model_plan = code_to_plan("code/03-model/model.R")
# Combine Plans
dplan = bind_plans(
data_plan,
join_plan,
cleaning_plan,
model_plan
)
config <- drake_config(dplan)
This works fine when called with r_make(r_args = list(show = TRUE))
As I understand it though target can only be used within a drake_plan. If I try something like this:
dplan2 <- drake_plan(full_plan = target(dplan, format = "fst"))
config <- drake_config(dplan2)
I get an r_make error like this:
target full_plan
Error in fst::write_fst(x = value$value, path = tmp) :
Unknown type found in column.
In addition: Warning message:
You selected fst format for target full_plan, so drake will convert it from class c("drake_plan", "tbl_df", "tbl", "data.frame") to a plain data frame.
Error:
-->
in process 18712
See .Last.error.trace for a stack trace.
So ultimately my question is where does one specify special data formats for targets when you are using code_to_plan?
Edit
Using #landau helpful suggestion, I defined this function:
add_target_format <- function(plan) {
# Get a list of named commands.
commands <- plan$command
names(commands) <- plan$target
# Turn it into a good plan.
do.call(drake_plan, commands)
}
So that this would work:
dplan = bind_plans(
data_plan,
join_plan,
cleaning_plan,
model_plan
) %>%
add_target_format()
It is possible, but not convenient. Here is a workaround.
writeLines(
c(
"x <- small_data()",
"y <- target(large_data(), format = \"fst\")"
),
"script.R"
)
cat(readLines("script.R"), sep = "\n")
#> x <- small_data()
#> y <- target(large_data(), format = "fst")
library(drake)
# Produces a plan, but does not process target().
bad_plan <- code_to_plan("script.R")
bad_plan
#> # A tibble: 2 x 2
#> target command
#> <chr> <expr>
#> 1 x small_data()
#> 2 y target(large_data(), format = "fst")
# Get a list of named commands.
commands <- bad_plan$command
names(commands) <- bad_plan$target
# Turn it into a good plan.
good_plan <- do.call(drake_plan, commands)
good_plan
#> # A tibble: 2 x 3
#> target command format
#> <chr> <expr> <chr>
#> 1 x small_data() <NA>
#> 2 y large_data() fst
Created on 2019-12-18 by the reprex package (v0.3.0)

R: trouble assigning values to a dynamic variable in a dataframe

I am trying to assign values to a dataframe variable defined by the user. The user specifies the name of the variable, let's call this x, in the dataframe df. For simplicity I want to assign a value of 3 to everything in the column the user specifies. The simplified code is:
variableName <- paste("df$", x, sep="")
eval(parse(text=variableName)) <- 3
But I get an error:
Error in file(filename, "r") : cannot open the connection
In addition: Warning message:
In file(filename, "r") :
cannot open file 'df$x': No such file or directory
I've tried all kinds of remedies to no avail. If I simply try to print the values of the column.
eval(parse(text=variableName))
I get no errors and it prints out ok. It's only when I try to give that column a value that I get the error. Any help would be appreciated.
I believe the issue is that there is no way to use the result of eval() on the LHS of an assignment.
df = data.frame(foo = 1:5,
bar = -3)
x = "bar"
variableName <- paste("df$", x, sep="")
eval(parse(text=variableName)) <- 3
#> Warning in file(filename, "r"): cannot open file 'df$bar': No such file or
#> directory
#> Error in file(filename, "r"): cannot open the connection
## This error is a bit misleading. Breaking it apart I get a different error.
eval(expression(df$bar)) <- 3
#> Error in eval(expression(df$bar)) <- 3: could not find function "eval<-"
## And it works if you put it all in the string to be parsed.
ex1 <- paste0("df$", x, "<-3")
eval(parse(text=ex1))
df
#> foo bar
#> 1 1 3
#> 2 2 3
#> 3 3 3
#> 4 4 3
#> 5 5 3
## But I doubt that's the best way to do it!

Sparklyr - Decimal precision 8 exceeds max precision 7

I'm trying to copy a big database into Spark using spark_read_csv, but I'm getting the following error as output:
Error: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 0 in stage 16.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 16.0 (TID 176, 10.1.2.235):
java.lang.IllegalArgumentException: requirement failed: Decimal
precision 8 exceeds max precision 7
data_tbl <- spark_read_csv(sc, "data", "D:/base_csv", delimiter = "|", overwrite = TRUE)
It's a big data set, about 5.8 million of records, with my dataset I have data of types Int, num and chr.
I think you have a couple options depending on the spark version that you're using
Spark >=1.6.1
from here: https://docs.databricks.com/spark/latest/sparkr/functions/read.df.html
it seems, you can specifically specify your schema to force it to use doubles
csvSchema <- structType(structField("carat", "double"), structField("color", "string"))
diamondsLoadWithSchema<- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
source = "csv", header="true", schema = csvSchema)
Spark < 1.6.1
consider test.csv
1,a,4.1234567890
2,b,9.0987654321
you can easily make this more efficient, but I think you get the gist
linesplit <- function(x){
tmp <- strsplit(x,",")
return ( tmp)
}
lineconvert <- function(x){
arow <- x[[1]]
converted <- list(as.integer(arow[1]), as.character(arow[2]),as.double(arow[3]))
return (converted)
}
rdd <- SparkR:::textFile(sc,'/path/to/test.csv')
lnspl <- SparkR:::map(rdd, linesplit)
ll2 <- SparkR:::map(lnspl,lineconvert)
ddf <- createDataFrame(sqlContext,ll2)
head(ddf)
_1 _2 _3
1 1 a 4.1234567890
2 2 b 9.0987654321
NOTE: the SparkR::: methods are private for a reason, the docs say 'be careful when you use this'

Resources