I am trying to read a csv (~ 18,000,000 rows, ~ 1000 columns) into arrow (in R) with open_dataset pre-specifying a schema. There are some instances in which the csv was generated incorrectly and some values don't match the intended schema (say some values where the age (int) of the individual was supposed to be entered have the name (string) of the individual). My intention is to set these ages that have strings that can't be parsed as integers as NA.
The default behaviour of open_dataset is to throw the following error:
CSV conversion error to int8: invalid value
Is there a way in which instead of getting an error when the schema is unable to parse I can get a missing value NA?
Here is an example of code that generates the error:
library(tidyverse)
library(arrow)
#Write csv
tibble(age = c(1,2,"StackOverflow",5)) %>%
write_csv("example.csv")
#Read the csv
arrow::open_dataset("example.csv", format = "csv", schema = schema(age = int8()), skip = 1) %>%
collect()
I know that I can specify the null_values inside the CsvConvertOptions if I know them previously as follows:
arrow::open_dataset("example.csv", format = "csv", schema = schema(age = int8()), skip = 1,
convert_options = CsvConvertOptions$create(null_values = "StackOverflow")) %>%
collect()
However this feels pretty inefficient as not knowing the mistakes a priori it seems to me that I need to go through the data twice (once to search the values and then once to set the schema correctly).
I created a table in Oracle like
Create table t1
(id_record NUMERIC GENERATED AS IDENTITY START WITH 500000 NOT NULL,
col1 numeric(2,0),
col2 varchar(10),
primary key(id_record))
where id_record is identity column the value of which is generated automatically when appending data to table.
I create a data.frame in R with 2 columns (table_in_R <- data.frame(col1, col2)). Let's skip the values of data frame for simplicity reasons.
When I append data from R to Oracle db using the following code
dbWriteTable(con, 't1', table_in_R,
append =T, row.names=F, overwrite = F)
where con is a connection object the error ORA-00947 arises and no data appended.
When I slightly modify my code (append = F, overwrite = T).
dbWriteTable(con_dwh, 't1', table_in_R,
append =FALSE, row.names=F, overwrite = TRUE)
the data is appended, but the identity column id_record is dropped.
How can I append data to Oracle db without dropping the identity column?
I'd never (based on this answer) recommend this one step approach where the dbWriteTabledirectly maintains the target table.
Instead I'd recommend a two step approach, where the R part fills a temporary table (with overwrite = T i.e. DROP and CREATE)
df <- data.frame(col1 = c(as.integer(1),as.integer(0)), col2 = c('x',NA))
dbWriteTable(jdbcConnection,"TEMP", df, rownames=FALSE, overwrite = TRUE, append = FALSE)
In the second step you simple adds the new rows to the target table using
insert into t1(col1,col2) select col1,col2 from temp;
You may call it direct with a database connection or also from R:
res <- dbSendUpdate(jdbcConnection,"insert into t1(col1,col2) select col1,col2 from temp")
Note there is a workaround anyway:
Define the identity column as
id_record NUMERIC GENERATED BY DEFAULT ON NULL AS IDENTITY
This configuration of the identity column provides the correct sequence value instead of the NULL value - but you will fail on the above linked problem of Inserting NULL in a Number column.
So the second trick is to use a character NA in the data.frame
Add the
identity column to your data.frame and fill it with all as.character(NA).
df <- data.frame(id_record =c(as.character(NA),as.character(NA) ), col1 = c(as.integer(1),as.integer(0)), col2 = c('x',NA))
dbWriteTable(jdbcConnection,"T1", df, rownames=FALSE, overwrite = F, append = T)
Test works fine, but as mentioned I'd recommend the two step approach.
I am using RJDBC and dbWriteTable to write a data.table into an existing SQL Server database table.
Here is my sample data: mtcars
After I get connection to DB, I am using dbWriteTable to create a DB table "mtcars".
dbWriteTable(conn, "mtcars", mtcars[1:5, ])
Next use append=T to insert two rows:
dbWriteTable(conn.pre.alg, "mtcars",mtcars[6:7, ], append = T)
Then I set a NA in a row:
mtcars[8, 2] = NA
I can insert the record without any problem.
dbWriteTable(conn.pre.alg, "mtcars",mtcars[8, ], append = T)
But when I set NA in a row and try to insert two rows:
mtcars[9:10, 2] = NA
dbWriteTable(conn.pre.alg, "mtcars",mtcars[9:10, ], append = T)
I get an error:
Error in .local(conn, statement, ...) :
execute JDBC update query failed in dbSendUpdate (The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Parameter 4 (""): The supplied value is not a valid instance of data type float. Check the source data for invalid values. An example of an invalid value is data of numeric type with scale greater than precision.)
I tried to set field.types, but I still get the same error.
I uses sqldfpackage to make SQLite database, my matrix dimension is 2880x1951. I write the table on the SQLite database, unfortunately it asks Error in rsqlite_send_query(conn#ptr, statement) : too many SQL variables. I read from SQlite website if the limitation of number variables use is limited to 999. Is there a simple way to increase the value of this?
Here is my syntax:
db <- dbConnect(SQLite(), dbname="xxx.sqlite")
bunch_vis <- read.csv("xxx.csv")
dbWriteTable(db, name = "xxx", value = xxx,
row.names = FALSE, header = TRUE)
and the output:
Error in rsqlite_send_query(conn#ptr, statement) : too many SQL variables
Similar symptoms as create a table from a dataframe using RODBC.
First Attempt
I'm connecting to ORACLE successfully with readOnly=FALSE, and can retrieve data using sqlQuery. I created a test table RTest with a single NUMBER(16,8) field called Value. I can insert into the table from R using:
sqlQuery(channel,"insert into RTest (value) values(25)")
So I appear to have WRITE permissions.
Next Attempt
Following several internet examples, I created an R data.frame test with a single row and column and named Value where I attempted:
sqlSave(channel, test, "RTest", fast=FALSE, append=TRUE)
and with safer=TRUE. I receive an Oracle error:
ORA-00922 - missing or invalid option. and RODBC error: "Could not SQLExecDirect 'CREATE TABLE DSN=...."
So, sqlSave appears to be attempting to create the table. I've added and deleted options: rownames=FALSE, colnames=FALSE, safer=TRUE/FALSE, varTypes=list, and varInfo=list to no avail.
Final Attempt
Next I deleted the table in ORACLE and used sqlSave with safer=TRUE that should have created the table based on the data.frame structure but receive same error.
Eventually, I need to periodically read data files and process in R then upload 100+ fields with millions of rows. Queries against an Oracle table will help with memory requirements to perform analyses and preclude reading the data into R from zipped files every time.
The help articles on inserting a data.frame into ORACLE via RODBC appear simple with default options. I thought this was a no-brainer, but this error keeps cropping up. Any clues?
New Information
Some progress:
1) Noted that connection uses case="nochange" and ORACLE defaults for tables and fields within tables is upper case. sqlSave attempts used lower case. Error changed when coordinated table name case.
2) Discovered through trial and error that ORACLE required a two part name USERNAME.TableName.
3) R numeric columns are double, oracle fields were defined NUMERIC(16,8) [and later (34,17)] which should hold the necessary digits. R generated data type errors trying to convert binary double. So switched Oracle to binary double and that error disappeared.
Response to Information Requests and New Problem Description
See following session text. Note that the local server name was replaced with {DummyServerName}, and the user name with {UserName}.
** Create ORACLE table to match base R data frame USArrests**
## in Oracle create table to match data types in demo data.frame USArrests
SQL> CREATE TABLE "{UserName}"."RTEST"
2 ("Murder" BINARY_DOUBLE,
3 "Assault" DECIMAL,
4 "UrbanPop" DECIMAL,
5 "Rape" BINARY_DOUBLE,
6 "State" VARCHAR2(255 BYTE)
7 );
Table created.
SQL> desc RTEST
Name Null? Type
----------------------------------------- -------- ---------------
Murder BINARY_DOUBLE
Assault NUMBER(38)
UrbanPop NUMBER(38)
Rape BINARY_DOUBLE
State VARCHAR2(255)
** Connection Details**
## establish ORACLE connection in R
library(RODBC)
channel <- odbcConnect("{DummyServerName}", readOnly=FALSE, connection="TNS", case="nochange", believeNRows=FALSE)
odbcGetInfo(channel)
DBMS_Name DBMS_Ver Driver_ODBC_Ver Data_Source_Name Driver_Name
"Oracle" "11.02.0030" "03.52" "Default" "SQORA32.DLL"
Driver_Ver ODBC_Ver Server_Name
"11.02.0003" "03.80.0000" "{DummyServerName}"
The odbcGetInfo results indicate successful connection
**Insert manually into at table from R **
### manually insert value into table to show write capability
y <- sqlQuery(channel,"insert into {UserName}.RTEST values(25,26,27,28,'JUNK')")
y
character(0)
examine the table
x <- sqlQuery(channel,"select * from RTest")
x
Murder Assault UrbanPop Rape State
1 25 26 27 28 JUNK
So a manual insertion is successful. I can insert from R, but so far not from sqlSave.
Attempt to add value from data.frame using sqlSave
##use data frame from R examples
attach(USArrests)
The following objects are masked from USArrests (pos = 3):
Assault, Murder, Rape, UrbanPop
##capture the state names from the rownames to match ORACLE table structure
State<-rownames(USArrests)
Arrests2<-cbind(USArrests[,1:4],State)
typeof(Arrests2)
[1] "list"
class(Arrests2)
[1] "data.frame"
Now attempt to populate the ORACLE table using sqlSave...
sqlSave(channel,Arrests2,tablename="{UserName}.RTEST", fast=FALSE, safer=TRUE,rownames=FALSE)
Error in sqlSave(channel, Arrests2, tablename = "{UserName}.RTEST", fast = FALSE, :
table ‘{UserName}.RTEST’ already exists
** OK - safer=TRUE should have appended, but see if forcing an append will work using append=TRUE instead **
sqlSave(channel,Arrests2,tablename="{UserName}.RTEST", fast=FALSE, append=TRUE, rownames=FALSE)
Error in sqlSave(channel, Arrests2, tablename = "{UserName}.RTEST", fast = FALSE, :
unable to append to table ‘{UserName}.RTEST’
** now let R create the table **
##In ORACLE...
SQL> drop table RTEST;
Table dropped.
back in R - safer=TRUE should create the table if it does not exist
apply sqlSave using ?sqlSave defaults, omit rownames to match table structure
sqlSave(channel,Arrests2,tablename="{UserName}.RTEST", fast=FALSE, safer=TRUE,rownames=FALSE)
Error in sqlSave(channel, Arrests2, tablename = "{UserName}.RTEST", fast = FALSE, :
HY000 902 [Oracle][ODBC][Ora]ORA-00902: invalid datatype
[RODBC] ERROR: Could not SQLExecDirect 'CREATE TABLE {UserName}.RTEST
(DSN={DummyServerName}MurderDSN={DummyServerName} binary_double,
DSN={DummyServerName}AssaultDSN={DummyServerName} decimal,
DSN={DummyServerName}UrbanPopDSN={DummyServerName} decimal,
DSN={DummyServerName}RapeDSN={DummyServerName} binary_double,
DSN={DummyServerName}StateDSN={DummyServerName} varchar(255))'
** follow the sqlSave example exactly - this should create table USArrests **
sqlSave(channel, USArrests, rownames = "state", addPK=TRUE)
Error in sqlSave(channel, USArrests, rownames = "state", addPK = TRUE) :
HY000 922 [Oracle][ODBC][Ora]ORA-00922: missing or invalid option
[RODBC] ERROR: Could not SQLExecDirect 'CREATE TABLE DSN={DummyServerName}USArrestsDSN={DummyServerName}
(DSN={DummyServerName}stateDSN={DummyServerName} varchar(255) NOT NULL PRIMARY KEY,
DSN={DummyServerName}MurderDSN={DummyServerName} binary_double,
DSN={DummyServerName}AssaultDSN={DummyServerName} decimal,
DSN={DummyServerName}UrbanPopDSN={DummyServerName} decimal,
DSN={DummyServerName}RapeDSN={DummyServerName} binary_double)'
**perhaps the uname qualifier is needed in tablename - force the table names **
sqlSave(channel, USArrests, tablename="{UserName}.USARRESTS", rownames = "state", addPK=TRUE)
Error in sqlSave(channel, USArrests, tablename = "{UserName}.USARRESTS", :
HY000 902 [Oracle][ODBC][Ora]ORA-00902: invalid datatype
[RODBC] ERROR: Could not SQLExecDirect 'CREATE TABLE {UserName}.USARRESTS
(DSN={DummyServerName}stateDSN={DummyServerName} varchar(255) NOT NULL PRIMARY KEY,
DSN={DummyServerName}MurderDSN={DummyServerName} binary_double,
DSN={DummyServerName}AssaultDSN={DummyServerName} decimal,
DSN={DummyServerName}UrbanPopDSN={DummyServerName} decimal,
DSN={DummyServerName}RapeDSN={DummyServerName} binary_double)'
**Finally, keep uname qualifier - force the table names, and use safer=TRUE that should create the table **
sqlSave(channel, USArrests, tablename="{UserName}.USARRESTS", rownames = "state", addPK=TRUE, safer=TRUE)
Error in sqlSave(channel, USArrests, tablename = "{UserName}.USARRESTS", :
HY000 902 [Oracle][ODBC][Ora]ORA-00902: invalid datatype
[RODBC] ERROR: Could not SQLExecDirect 'CREATE TABLE {UserName}.USARRESTS
(DSN={DummyServerName}stateDSN={DummyServerName} varchar(255) NOT NULL PRIMARY KEY,
DSN={DummyServerName}MurderDSN={DummyServerName} binary_double,
DSN={DummyServerName}AssaultDSN={DummyServerName} decimal,
DSN={DummyServerName}UrbanPopDSN={DummyServerName} decimal,
DSN={DummyServerName}RapeDSN={DummyServerName} binary_double)'
Still confused. Closest success was with existing table and "safer=true" but the insert failed. Examining the sqlSave code that generates the error message shows that error occurs when sqlwrite returns -1, which I assume is a failure. Failure of sqlSave examples may indicate a local ORACLE