In cosmosDB, I need to insert a large amount of data in a new container.
create_table_sql = f"""
CREATE TABLE IF NOT EXISTS cosmosCatalog.`{cosmosDatabaseName}`.{cosmosContainerName}
USING cosmos.oltp
OPTIONS(spark.cosmos.database = '{cosmosDatabaseName}')
TBLPROPERTIES(partitionKeyPath = '/id', manualThroughput = '10000', indexingPolicy = 'AllProperties', defaultTtlInSeconds = '-1');
"""
spark.sql(create_table_sql)
# Read data with spark
data = (
spark.read.format("csv")
.options(header="True", inferSchema="True", delimiter=";")
.load(spark_file_path)
)
cfg = {
"spark.cosmos.accountEndpoint": "https://XXXXXXXXXX.documents.azure.com:443/",
"spark.cosmos.accountKey": "XXXXXXXXXXXXXXXXXXXXXX",
"spark.cosmos.database": cosmosDatabaseName,
"spark.cosmos.container": cosmosContainerName,
}
data.write.format("cosmos.oltp").options(**cfg).mode("APPEND").save()
Then after this insert I would like to change the Manual Throughput of this container.
alter_table = f"""
ALTER TABLE cosmosCatalog.`{cosmosDatabaseName}`.{cosmosContainerName}
SET TBLPROPERTIES( manualThroughput = '400');
"""
spark.sql(alter_table)
Py4JJavaError: An error occurred while calling o342.sql.
: java.lang.UnsupportedOperationException
I find no documentation online on how to change TBLPROPERTIES for a cosmosdb table in sparkSQL. I know I can edit it on the Azure Portal and also with azure cli, but I would like to keep it in sparkSQL.
This is not supported with the spark connector for NOSQL API , you might need to track the issue here. So you might need to do it through CLI command or portal or SDK (Java)
FYI : Cosmos NOSQL API container is not similar to Table in SQL, so alter commands will not work.
Related
I'm attempting to send a batch of PartiQL statements in the NodeJS AWS SDK v3. The statement works fine for a single ExecuteStatementCommand, but the Batch command doesn't.
The statement looks like
const statement = `
SELECT *
FROM "my-table"
WHERE "partitionKey" = '1234'
AND "filterKey" = '5678'
`
This code snippet works as expected:
const result = await dynamodbClient.send(new ExecuteStatementCommand(
{ Statement: statement}
))
The batch snippet does not:
const result = await dynamodbClient.send(new BatchExecuteStatementCommand({
Statements: [
{
Statement: statement
}
]
}))
The batch call produces the following error:
"Code": "ValidationError",
"Message": "Select statements within BatchExecuteStatement must specify the primary key in the where clause."
Any insight is greatly appreciated. Thanks for taking the time to reading my question!
Seems like what I needed was a rubber duck.
DynamoDB primary keys consists of partition key + sort key. My particular table has a sort key, which is missing from the statement. Batch jobs cannot handle filtering of responses, and each statement must match a single item in the database.
The pandas equivalent code for connecting to Teradata, I have used is:
database = config.get('Teradata connection', 'database')
host = config.get('Teradata connection', 'host')
user = config.get('Teradata connection', 'user')
pwd = config.get('Teradata connection', 'pwd')
with teradatasql.connect(host=host, user=user, password=pwd) as connect:
query1 = "SELECT * FROM {}.{}".format(database, tables)
df = pd.read_sql_query(query1, connect)
Now, I need to use the Dask library for loading big data as an alternative to pandas.
Please suggest a method to connect the same with Teradata.
Teradata appears to have a sqlalchemy engine, so you should be able to install that, set your connection string appropriately and use Dask's existing from_sql function.
Alternatively, you could do this by hand: you need to decide on a set of conditions which will partition the data for you, each partition being small enough for your workers to handle. Then you can make a set of partitions and combine into a dataframe as follows
def get_part(condition):
with teradatasql.connect(host=host, user=user, password=pwd) as connect:
query1 = "SELECT * FROM {}.{} WHERE {}".format(database, tables, condition)
return pd.read_sql_query(query1, connect)
parts = [dask.delayed(get_part)(cond) for cond in conditions)
df = dd.from_delayed(parts)
(ideally, you can derive the meta= parameter for from_delayed beforehand, perhaps by getting the first 10 rows of the original query).
I am trying to implement an upsert of an item into DynamoDB with optimistic locking. I have the update portion working with a ConditionExpression to check the version. But this fails the save portion as the ConditionExpression is false for saving. Is it possible to write the ConditionExpression so that it will handle both situations?
My code:
result = copy.copy(user)
table = get_db_table()
current_version = result.get_version()
result.update_version()
try:
table.put_item(
Item=result.to_table_item(),
ConditionExpression=Attr(result.get_version_key()).eq(current_version)
)
except ClientError as error:
logger.error(
"Saving to db failed with '%s'",
str(error))
# Restore version
result.set_version(current_version)
raise Exception(ErrorCode.DB_SAVE) from error
return result
Basically, you need to make sure the attribute exists before you can compare something to it. Your condition expression string should be
does_not_exist(current_version) or current_version = expected_current_version
Using Boto3, you can create this using
Attr(result.get_version_key()).not_exists() | Attr(result.get_version_key().eq(current_version))
I'm trying to implement a Guice module that to lets Guacamole use SQLite as a backend. The Guacamole project has a generic JDBC base module. This lets you implement modules for specific datastores with less code. Most of the lines of code end up being in mapper XML files. The project provides PostgreSQL and MySQL implementations.
I based this SQLite module off of the MySQL module. For the mapper XML files, SQLite and MySQL are similar enough that I didn't have to make any changes. However, when I try to use the SQLite module, I get this error:
### Error querying database. Cause: java.lang.ArrayIndexOutOfBoundsException: 2
### The error may exist in org/apache/guacamole/auth/jdbc/connectiongroup/ConnectionGroupMapper.xml
### The error may involve defaultParameterMap
### The error occurred while setting parameters
### SQL: SELECT guacamole_connection_group.connection_group_id, connection_group_name, parent_id, type, max_connections, max_connections_per_user, enable_session_affinity FROM guacamole_connection_group JOIN guacamole_connection_group_permission ON guacamole_connection_group_permission.connection_group_id = guacamole_connection_group.connection_group_id WHERE guacamole_connection_group.connection_group_id IN ( ? ) AND user_id = ? AND permission = 'READ'; SELECT parent_id, guacamole_connection_group.connection_group_id FROM guacamole_connection_group JOIN guacamole_connection_group_permission ON guacamole_connection_group_permission.connection_group_id = guacamole_connection_group.connection_group_id WHERE parent_id IN ( ? ) AND user_id = ? AND permission = 'READ'; SELECT parent_id, guacamole_connection.connection_id FROM guacamole_connection JOIN guacamole_connection_permission ON guacamole_connection_permission.connection_id = guacamole_connection.connection_id WHERE parent_id IN ( ? ) AND user_id = ? AND permission = 'READ';
### Cause: java.lang.ArrayIndexOutOfBoundsException: 2
It looks like the problem is that two parameters are passed to the query, but each is repeated three times. When MyBatis generates the PreparedStatement, it acts as if there are six parameters that needed to be passed in.
Here's the query it has a problem with:
<!-- Select multiple connection groups by identifier only if readable -->
<select id="selectReadable" resultMap="ConnectionGroupResultMap"
resultSets="connectionGroups,childConnectionGroups,childConnections">
SELECT
guacamole_connection_group.connection_group_id,
connection_group_name,
parent_id,
type,
max_connections,
max_connections_per_user,
enable_session_affinity
FROM guacamole_connection_group
JOIN guacamole_connection_group_permission ON guacamole_connection_group_permission.connection_group_id = guacamole_connection_group.connection_group_id
WHERE guacamole_connection_group.connection_group_id IN
<foreach collection="identifiers" item="identifier"
open="(" separator="," close=")">
#{identifier,jdbcType=VARCHAR}
</foreach>
AND user_id = #{user.objectID,jdbcType=INTEGER}
AND permission = 'READ';
SELECT parent_id, guacamole_connection_group.connection_group_id
FROM guacamole_connection_group
JOIN guacamole_connection_group_permission ON guacamole_connection_group_permission.connection_group_id = guacamole_connection_group.connection_group_id
WHERE parent_id IN
<foreach collection="identifiers" item="identifier"
open="(" separator="," close=")">
#{identifier,jdbcType=VARCHAR}
</foreach>
AND user_id = #{user.objectID,jdbcType=INTEGER}
AND permission = 'READ';
SELECT parent_id, guacamole_connection.connection_id
FROM guacamole_connection
JOIN guacamole_connection_permission ON guacamole_connection_permission.connection_id = guacamole_connection.connection_id
WHERE parent_id IN
<foreach collection="identifiers" item="identifier"
open="(" separator="," close=")">
#{identifier,jdbcType=VARCHAR}
</foreach>
AND user_id = #{user.objectID,jdbcType=INTEGER}
AND permission = 'READ';
</select>
If I manually populate the parameters, I can execute this against the SQLite database. Also, the MySQL version works fine.
What the heck is going on? What can I do to debug this? Is it a MyBatis problem or something with the JDBC connector?
If it helps, you can see the code for the module here.
Here's the method for the parameter the mapper related to this query. The full mapper classes for ConnectionGroup are here and here. The full mapper XML for my SQLite module is here.
Collection<ModelType> selectReadable(#Param("user") UserModel user,
#Param("identifiers") Collection<String> identifiers);
This is what the ConnectionGroupResultMap looks like:
<resultMap id="ConnectionGroupResultMap" type="org.apache.guacamole.auth.jdbc.connectiongroup.ConnectionGroupModel" >
<!-- Connection group properties -->
<id column="connection_group_id" property="objectID" jdbcType="INTEGER"/>
<result column="connection_group_name" property="name" jdbcType="VARCHAR"/>
<result column="parent_id" property="parentIdentifier" jdbcType="INTEGER"/>
<result column="type" property="type" jdbcType="VARCHAR"
javaType="org.apache.guacamole.net.auth.ConnectionGroup$Type"/>
<result column="max_connections" property="maxConnections" jdbcType="INTEGER"/>
<result column="max_connections_per_user" property="maxConnectionsPerUser" jdbcType="INTEGER"/>
<result column="enable_session_affinity" property="sessionAffinityEnabled" jdbcType="BOOLEAN"/>
<!-- Child connection groups -->
<collection property="connectionGroupIdentifiers" resultSet="childConnectionGroups" ofType="java.lang.String"
column="connection_group_id" foreignColumn="parent_id">
<result column="connection_group_id"/>
</collection>
<!-- Child connections -->
<collection property="connectionIdentifiers" resultSet="childConnections" ofType="java.lang.String"
column="connection_group_id" foreignColumn="parent_id">
<result column="connection_id"/>
</collection>
</resultMap>
David,
I know this is a little old, but I'm also trying to implement a SQLite JDBC module, and ran into exactly the same problem. I've managed to track down the source of the issue, and have filed an issue on the JDBC SQLite github page:
https://github.com/xerial/sqlite-jdbc/issues/277
Basically, the SQLite JDBC driver computes one of its array sizes based on the parameter count for a prepared statement. However, in the cases you mentioned, there are multiple SELECT statements which take the same parameters, so the array needs to be x * y (x = parameter count, y = number of select statements) rather than just the number of parameters. When it tries to prepare the statement, it hits the first position beyond the parameter count and generates this exception.
The other JDBC modules - MySQL, PostgreSQL, and SQL Server, as well as Oracle and H2 that I'm messing with - seem to handle this situation correctly (well, Oracle is a little...special...), but the SQLite driver does not.
I was able to work around the issue in a really, really kludgy way, by creating two different result maps, one for the generic select and one for the read that checks for READ permissions, and then break out each of the select queries into its own SELECT block and call those from the collection inside the result map. It's nowhere near as elegant as the existing code, but it works.
I set up a SQLite db with the same schema as my existing SQL server db and noted the following...
SQLite field names (and presumably everything else) are case sensitive.
MicroLite's SqlBuilder appears to insert the prefix 'dbo.' before the table name, which SQLite doesn't like...
This query works...
query = new SqlQuery("SELECT [ClubID], [Name] FROM [Clubs] WHERE [ClubID] = #p0", 3);
clubs = session.Fetch<MicroLiteClub>(query);
This one doesn't...
query = SqlBuilder.Select("*")
.From(typeof(MicroLiteClub))
.Where("ClubID = #p0", 3)
.OrWhere("ClubID = #p1", 22)
.OrderByDescending("Name")
.ToSqlQuery();
clubs = session.Fetch<MicroLiteClub>(query);
MicroLite logged: "no such table: dbo.Clubs"
This is happening because SQLite doesn't support table schemas like MS SQL Server does.
In the hand crafted query, you are not specifying a schame for the table FROM [Clubs] however in your mapping attribute you will have specified dbo as the schema like this:
[Table(schema: "dbo", name: "Clubs")]
The SqlBuilder doesn't know what SQL Dialect is in use so if a schame is present on the table mapping, it will be used. This means that it would generate FROM [dbo].[Clubs]. To rectify this, simply remove the schema value on the TableAttribute as is optional from MicroLite 2.1 onwards.
On a side note, MicroLite 2.1 introduced support for In in the SqlBuilder fluent API so you could change:
.Where("ClubID = #p0", 3)
.OrWhere("ClubID = #p1", 22)
to
.Where("ClubID").In(3, 22)