Collecting a Sparklyr Long Integer into an R Dataframe - r

How do you collect a Spark table into R using sparklyr while preserving long integers without having to convert them to strings beforehand?
My understanding is that R has an integer64 type that can be used to handle large integer values. Spark handles such values using its LongType but when I collect a Spark table with a Long I get a double on the R side. The issue with doubles is that they can lose precision.
I have attached an image to show the discrepancy that happens when collecting the data frame. If I convert the value into a string, it is collected perfectly. But if I don't, it turns into a double and then loses precision.
I was wondering if there were some Spark configs or some options I have to set somewhere to get sparklyr to collect it as an integer64.

Related

ROracle fetch large integer values

When using ROracle to fetch data from a database, I am running into an issue trying to fetch large integers (up to 21 digit positions). the database column has format NUMBER(38,0).
Fetching them through a simple select does not work, the numbers get garbled from the 12th position on.
I can circumvent this by converting them to characters (to_char(COLUMN_NAME)), but this is far from ideal.
A solution from an oracle forum that converts to binary double (cast(COLUMN_NAME as binary_double)) does not work in my case.
Do you have a hint towards data types to use?

Is there a size limit in DataBricks for converting an R dataframe to a Spark dataframe?

I am new to Stack overflow and tried so many ways to solve the error but without any success. My problem: I CAN convert subsets of an R dataframe to a Spark dataframe, but not the whole dataframe. Similar questions but not the same include:
Not able to to convert R data frame to Spark DataFrame and
Is there any size limit for Spark-Dataframe to process/hold columns at a time?
Here some information about the R dataframe:
library(SparkR)
sparkR.session()
sparkR.version()
[1] "2.4.3"
dim(df)
[1] 101368 25
class(df)
[1] "data.frame"
When converting this to a Spark Dataframe:
sdf <- as.DataFrame(df)
Error in handleErrors(returnStatus, conn) : Error in handleErrors(returnStatus, conn) :
Error in handleErrors(returnStatus, conn) :
However, when I subset the R dataframe, it does NOT result in an error:
sdf_sub1 <- as.DataFrame(df[c(1:50000), ])
sdf_sub2 <- as.DataFrame(df[c(50001:101368), ])
class(sdf_sub1)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"
class(sdf_sub2)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"
How can I write the whole dataframe to a Spark DataFrame? (I want to saveAsTable afterwards).
I was thinking about a problem with capacity but I do not have a clue how to solve it.
Thanks a lot!!
In general you'll see poor performance when converting from R dataframes to Spark dataframes, and vice versa. Objects are representing differently in memory in Spark and R, and there is significant expansion of the object size when converting from one to the other. This often blows out the memory of the driver, making it difficult to copy/collect large objects to/from Spark. Fortunately, you have a couple options.
Use Apache Arrow to establish a common in memory format for objects, eliminating the need to copy and convert from representation in R to Spark. The link I provided has instructions on how to set this up on Databricks.
Write the dataframe to disk as parquet (or CSV) and then read it into Spark directly. You can use the arrow library in R to do this.
Increase the size of your driver node to accommodate the memory expansion. On Databricks you can select the driver node type (or ask your admin to do it) for your cluster - make sure you pick one with a lot of memory. For reference, I tested collecting a 2GB dataset and needed a 30GB+ driver. With arrow that comes down dramatically.
Anecdotally, there is a limit on the size of table that SparkR will convert from DataFrame to data.table that is memory-dependent. It is also far smaller than I would have expected, around 50,000 rows for my work
I had to convert some very large data.tables to DataFrames and ended up making a script to chunk them into smaller pieces to get around this. Initially I chose to chunk n rows of the data, but when a very wide table was converted this error returned. My work-around was to have a limit to the number of elements being converted.

R - Converting and joining integer and numeric variables

I have a problem with a join in R, I've tried to create a reproducible example, but every one I've created works as intended, and I have no idea what the problem is to recreate. The dput is too large to provide as a whole, is there a way I can attach a file?
It is a problem with joining on different data types, integer and numeric. Most of the join happens as expected, but some does not join. This was eventually solved by exporting the data to Excel, changing the offending numeric variable to the "Number" format with no decimal places, saving and importing back in to R, where it is now an integer.
Is there an R equivalent of this step? as.integer() or as.numeric() did not provide the same result as opening in Excel and converting.

general to number in R

I have data in excel and after reading in R it reads as follows
as
lob2 lob3
1.86E+12 7.58E+12
I want it as
lob2 lob3
1857529190776.75 7587529190776.75
This difference causes me to have different results after doing my analysis later on
How is the data stored in Excel (does it think it is a number, a string, a date, etc.)?
How are you getting the data from Excel to R? If you save the data as a .csv file then read it into R, look at the intermediate file, Excel is known to abbreviate when saving and R would then see character strings instead of numbers. You need to find a way to tell excel to export the data in the correct format with the correct precision.
If you are using a package (there are more than 1) then look into the details of that package for how to grab the numbers correctly (you may need to make changes in Excel so that it knows they are numbers).
Lastly, what does the str function on your R object say? It could be that R is storing the proper numbers and only displaying the short version as mentioned in the comments. Or, it could be that R received strings that did not convert nicely to numbers and is storing them as characters or factors. The str function will let you see how your data is stored in R, and therefore how to convert or display it correctly.

Are the as.character() and paste() limited by the size of the numeric vales they are given?

I'm running into some problems with the R function as.character() and paste(): they do not give back what they're being fed...
as.character(1415584236544311111)
## [1] "1415584236544311040"
paste(1415584236544311111)
## [1] "1415584236544311040"
what could be the problem or a workaround to paste my number as a string?
update
I found that using the bit64 library allowed me to retain the extra digits I needed with the function as.integer64().
Remember that numbers are stored in a fixed number of bytes based upon the hardware you are running on. Can you show that your very big integer is treated properly by normal arithmetic operations? If not, you're probably trying to store a number to large to store in your R install's integer # of bytes. The number you see is just what could fit.
You could try storing the number as a double which is technically less precise but can store larger numbers in scientific notation.
EDIT
Consider the answers in long/bigint/decimal equivalent datatype in R which list solutions including arbitrary precision packages.

Resources