Julia dataframe to database ODBC , not slow solution - odbc

I want to insert a Dataframe in a table using ODBC.jl , the table already exists and it seems that i can't use the function ODBC.load with it (even with the append=true).
usually i insert DataFrames with copyIn by loading the dataframe as a csv but with just ODBC it seems i can't do that .
The last thing i found is :
stmt = ODBC.prepare(dsn, "INSERT INTO cool_table VALUES(?, ?, ?)")
for row = 1:size(df, 1)
ODBC.execute!(stmt, [df[row, x] for x = 1:size(df, 2)])
end
but this is line by line , it's incredibly long to insert everything .
I tried also to do it myself like this :
_prepare_field(x::Any) = x
_prepare_field(x::Missing) = ""
_prepare_field(x::AbstractString) = string(''', x, ''')
row_names = join(string.(Tables.columnnames(table)), ",")
row_strings = imap(Tables.eachrow(table)) do row
join((_prepare_field(x) for x in row), ",") * "\n"
end
query = "INSERT INTO $tableName($row_names) VALUES $row_strings"
#debug "Inserting $(nrow(df)) in $tableName)"
DBInterface.execute(db.conn, query)
but it throw me an error because of some "," not at the right place at like col 9232123
which i can't find because i have too many lines , i guess it's the _prepare_field that doesn't cover all possible strings escape but i can't find another way .
Did i miss something and there is a easier way to do it ?
Thank you

Related

R, ClickHouse: Expected: FixedString(34). Got: UInt64: While processing

I am trying to query data from ClickHouse database from R with subset.
Here is the example
library(data.table)
library(RClickhouse)
library(DBI)
subset <- paste(traffic[,unique(IDs)][1:30], collapse = ',')
conClickHouse <- DBI::dbConnect('here is the connection')
DataX <- dbgetdbGetQuery(conClickHouse, paste0("select * from database
and IDs in (", subset ,") ", sep = "") )
As a result I get error:
DB::Exception: Type mismatch in IN or VALUES section. Expected: FixedString(34).
Got: UInt64: While processing (IDs IN ....
Any help is appreciated
Thanks to the comment of #DennyCrane,
"select * from database where toFixedString(IDs,34) in
(toFixedString(ID1, 34), toFixedString(ID2,34 ))"
This query subset properly
https://clickhouse.tech/docs/en/sql-reference/functions/#strong-typing
Strong Typing
In contrast to standard SQL, ClickHouse has strong typing. In other words, it doesn’t make implicit conversions between types. Each function works for a specific set of types. This means that sometimes you need to use type conversion functions.
https://clickhouse.tech/docs/en/sql-reference/functions/type-conversion-functions/#tofixedstrings-n
select * from (select 'x' B ) where B in (select toFixedString('x',1))
DB::Exception: Types of column 1 in section IN don't match: String on the left, FixedString(1) on the right.
use casting toString or toFixedString
select * from (select 'x' B ) where toFixedString(B,1) in (select toFixedString('x',1))

How to read fixed width files using Spark in R

I need to read a 10GB fixed width file to a dataframe. How can I do it using Spark in R?
Suppose my text data is the following:
text <- c("0001BRAjonh ",
"0002USAmarina ",
"0003GBPcharles")
I want the 4 first characters to be associated to the column "ID" of a data frame; from character 5-7 would be associated to a column "Country"; and from character 8-14 to be associated to a column "Name"
I would use function read.fwf if the dataset was small, but that is not the case.
I can read the file as a text file using sparklyr::spark_read_text function. But I don't know how to attribute the values of the file to a data frame properly.
EDIT: Forgot to say substring starts at 1 and array starts at 0, because reasons.
Going through and adding the code I talked about in the column above.
The process is dynamic and is based off a Hive table called Input_Table. The table has 5 columns: Table_Name, Column_Name, Column_Ordinal_Position, Column_Start, and Column_Length. It is external so any user can change, drop, and remove any file into the folder location. I quickly built this from scratch to not take actual code, does everything make sense?
#Call Input DataFrame and the Hive Table. For hive table we make sure to only take correct column as well as the columns in correct order.
val inputDF = spark.read.format(recordFormat).option("header","false").load(folderLocation + "/" + tableName + "." + tableFormat).rdd.toDF("Odd_Long_Name")
val inputSchemaDF = spark.sql("select * from Input_Table where Table_Name = '" + tableName + "'").sort($"Column_Ordinal_Position")
#Build all the arrays from the columns, rdd to map to collect changes a dataframe col to a array of strings. In this format I can iterator through the column.
val columnNameArray = inputSchemaDF.selectExpr("Column_Name").rdd.map(x=>x.mkString).collect
val columnStartArray = inputSchemaDF.selectExpr("Column_Start_Position").rdd.map(x=>x.mkString).collect
val columnLengthArray = inputSchemaDF.selectExpr("Column_Length").rdd.map(x=>x.mkString).collect
#Make the iteraros as well as other variables that are meant to be overwritten
var columnAllocationIterator = 1
var localCommand = ""
var commandArray = Array("")
#Loop as there are as many columns in input table
while (columnAllocationIterator <= columnNameArray.length) {
#overwrite the string command with the new command, thought odd long name was too accurate to not place into the code
localCommand = "substring(Odd_Long_Name, " + columnStartArray(columnAllocationIterator-1) + ", " + columnLengthArray(columnAllocationIterator-1) + ") as " + columnNameArray(columnAllocationIterator-1)
#If the code is running the first time it overwrites the command array, else it just appends
if (columnAllocationIterator==1) {
commandArray = Array(localCommand)
} else {
commandArray = commandArray ++ Array(localCommand)
}
#I really like iterating my iterators like this
columnAllocationIterator = columnAllocationIterator + 1
}
#Run all elements of the string array indepently against the table
val finalDF = inputDF.selectExpr(commandArray:_*)

Parameters and NULL

I'm having trouble passing NULL as an INSERT parameter query using RPostgres and RPostgreSQL:
In PostgreSQL:
create table foo (ival int, tval text, bval bytea);
In R:
This works:
res <- dbSendQuery(con, "INSERT INTO foo VALUES($1, $2, $3)",
params=list(ival=1,
tval= 'not quite null',
bval=charToRaw('asdf')
)
)
But this throws an error:
res <- dbSendQuery(con, "INSERT INTO foo VALUES($1, $2, $3)",
params=list(ival=NULL,
tval= 'not quite null',
bval=charToRaw('asdf')
)
)
Using RPostgres, the error message is:
Error: expecting a string
Under RPostgreSQL, the error is:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: invalid input
syntax for integer: "NULL"
)
Substituting NA would be fine with me, but it isn't a work-around - a literal 'NA' gets written to the database.
Using e.g. integer(0) gives the same "expecting a string" message.
You can use NULLIF directly in your insert statement:
res <- dbSendQuery(con, "INSERT INTO foo VALUES(NULLIF($1, 'NULL')::integer, $2, $3)",
params=list(ival=NULL,
tval= 'not quite null',
bval=charToRaw('asdf')
)
)
works with NA as well.
One option here to workaround the problem of not knowing how to articulate a NULL value in R which the PostgresSQL pacakge will be able to successfully translate is to simply not specify the column whose value you want to be NULL in the database.
So in your example you could use this:
res <- dbSendQuery(con, "INSERT INTO foo (col2, col3) VALUES($1, $2)",
params=list(tval = 'not quite null',
bval = charToRaw('asdf')
)
)
when you want col1 to have a NULL value. This of course assumes that col1 in your table is nullable, which may not be the case.
Thanks all for the help. Tim's answer is a good one, and I used it to catch the integer values. I went a different route for the rest of it, writing a function in PostgreSQL to handle most of this. It looks roughly like:
CREATE OR REPLACE FUNCTION add_stuff(ii integer, tt text, bb bytea)
RETURNS integer
AS
$$
DECLARE
bb_comp bytea;
rows integer;
BEGIN
bb_comp = convert_to('NA', 'UTF8'); -- my database is in UTF8.
-- front-end catches ii is NA; RPostgres blows up
-- trying to convert 'NA' to integer.
tt = nullif(tt, 'NA');
bb = nullif(bb, bb_comp);
INSERT INTO foo VALUES (ii, tt, bb);
GET DIAGNOSTICS rows = ROW_COUNT;
RETURN rows;
END;
$$
LANGUAGE plpgsql VOLATILE;
Now to have a look at the RPostgres source and see if there's an easy-enough way to make it handle NULL / NA a bit more easily. Hoping that it's missing because nobody thought of it, not because it's super-tricky. :)
This will give the "wrong" answer if someone is trying to put literally 'NA' into the database and mean something other than NULL / NA (e.g. NA = "North America"); given our use case, that seems very unlikely. We'll see in six months time.

sqlite - how do I get a one row result back? (luaSQLite3)

How can I get a single row result (e.g. in form of a table/array) back from a sql statement. Using Lua Sqlite (LuaSQLite3). For example this one:
SELECT * FROM sqlite_master WHERE name ='myTable';
So far I note:
using "nrows"/"rows" it gives an iterator back
using "exec" it doesn't seem to give a result back(?)
Specific questions are then:
Q1 - How to get a single row (say first row) result back?
Q2 - How to get row count? (e.g. num_rows_returned = db:XXXX(sql))
In order to get a single row use the db:first_row method. Like so.
row = db:first_row("SELECT `id` FROM `table`")
print(row.id)
In order to get the row count use the SQL COUNT statement. Like so.
row = db:first_row("SELECT COUNT(`id`) AS count FROM `table`")
print(row.count)
EDIT: Ah, sorry for that. Here are some methods that should work.
You can also use db:nrows. Like so.
rows = db:nrows("SELECT `id` FROM `table`")
row = rows[1]
print(row.id)
We can also modify this to get the number of rows.
rows = db:nrows("SELECT COUNT(`id`) AS count FROM `table`")
row = rows[1]
print(row.count)
Here is a demo of getting the returned count:
> require "lsqlite3"
> db = sqlite3.open":memory:"
> db:exec "create table foo (x,y,z);"
> for x in db:urows "select count(*) from foo" do print(x) end
0
> db:exec "insert into foo values (10,11,12);"
> for x in db:urows "select count(*) from foo" do print(x) end
1
>
Just loop over the iterator you get back from the rows or whichever function you use. Except you put a break at the end, so you only iterate once.
Getting the count is all about using SQL. You compute it with the SELECT statement:
SELECT count(*) FROM ...
This will return one row containing a single value: the number of rows in the query.
This is similar to what I'm using in my project and works well for me.
local query = "SELECT content FROM playerData WHERE name = 'myTable' LIMIT 1"
local queryResultTable = {}
local queryFunction = function(userData, numberOfColumns, columnValues, columnTitles)
for i = 1, numberOfColumns do
queryResultTable[columnTitles[i]] = columnValues[i]
end
end
db:exec(query, queryFunction)
for k,v in pairs(queryResultTable) do
print(k,v)
end
You can even concatenate values into the query to place inside a generic method/function.
local query = "SELECT * FROM ZQuestionTable WHERE ConceptNumber = "..conceptNumber.." AND QuestionNumber = "..questionNumber.." LIMIT 1"

How can I use the same name for a column in the output of my sqlite query as a value I am comparing to in the join's ON clause?

My original question
When I execute the following query in SQLite, I get this error:
Query Error: misuse of aggregate: sum() Unable to execute statement
When I change the name of the "Loan" column to something like loan_amount the error goes away and my query works fine. Why is there a problem with "Loan"?
select
t.*
, coalesce(sum(ded0.after_tax_ded_amt), 0) as "Loan"
, coalesce(sum(ded1.after_tax_ded_amt), 0) as ee_advance_amount
from totals t
left join totals as ded0
on t.ee_ssn = ded0.ee_ssn
and t.deduction_code = "Loan"
and ded0.deduction_code = "Loan"
left join totals as ded1
on t.ee_ssn = ded1.ee_ssn
and t.deduction_code = "EE Advance"
and ded1.deduction_code = "EE Advance"
group by t.ee_ssn;
Mid-post revelation
I'm pretty sure I figured out why I get the error, is it because I am comparing to "Loan" in the on-clause of my joins?
If so, how can I still use the word "Loan" for my column name in the output of my query?
I'd guess that your real problem is quote misuse. Single quotes in SQL are for quoting string literals, double quotes are for quoting column and table names that need to be case sensitive or contain odd characters. SQLite is fairly forgiving of odd syntax so it is probably making a guess about what "Loan" means and guessing incorrectly. Try this:
select
t.*
, coalesce(sum(ded0.after_tax_ded_amt), 0) as "Loan"
, coalesce(sum(ded1.after_tax_ded_amt), 0) as ee_advance_amount
from totals t
left join totals as ded0
on t.ee_ssn = ded0.ee_ssn
and t.deduction_code = 'Loan'
and ded0.deduction_code = 'Loan'
left join totals as ded1
on t.ee_ssn = ded1.ee_ssn
and t.deduction_code = 'EE Advance'
and ded1.deduction_code = 'EE Advance'
group by t.ee_ssn;

Resources