When you create a timestamp column in spark, and save to parquet, you get a 12 byte integer column type (int96); I gather the data is split into 6-bytes for Julian day and 6 bytes for nanoseconds within the day.
This does not conform to any parquet logical type. The schema in the parquet file does not, then, give an indication of the column being anything but an integer.
My question is, how does Spark know to load such a column as a timestamp as opposed to a big integer?
Semantics is determined based on the metadata. We'll need some imports:
import org.apache.parquet.hadoop.ParquetFileReader
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
example data:
val path = "/tmp/ts"
Seq((1, "2017-03-06 10:00:00")).toDF("id", "ts")
.withColumn("ts", $"ts".cast("timestamp"))
.write.mode("overwrite").parquet(path)
and Hadoop configuration:
val conf = spark.sparkContext.hadoopConfiguration
val fs = FileSystem.get(conf)
Now we can access Spark metadata:
ParquetFileReader
.readAllFootersInParallel(conf, fs.getFileStatus(new Path(path)))
.get(0)
.getParquetMetadata
.getFileMetaData
.getKeyValueMetaData
.get("org.apache.spark.sql.parquet.row.metadata")
and the result is:
String = {"type":"struct","fields: [
{"name":"id","type":"integer","nullable":false,"metadata":{}},
{"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}
Equivalent information can be stored in the Metastore as well.
According to the official documentation this is used to achieve compatibility with Hive and Impala:
Some Parquet-producing systems, in particular Impala and Hive, store Timestamp into INT96. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.
and can be controlled using spark.sql.parquet.int96AsTimestamp property.
Related
I'm trying to move a parquet file that is written out in R using the arrow library to BigTable. I have validated the arrow package instalation and made sure that the snappy codec is available using codec_is_available("snappy").
For some reason in the third step of the workflow I run into the following error:
Error message from worker: java.lang.RuntimeException:
org.apache.beam.sdk.util.UserCodeException:
org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 1 in file
ReadableFile{
metadata=Metadata{
resourceId=gs://mybucket/userdata_2.parquet,
sizeBytes=85550,
isReadSeekEfficient=true,
checksum=null,
lastModifiedMillis=0}, compression=UNCOMPRESSED}
It is unclear to me why it gives this error, but also why it says compression=UNCOMPRESSED. The file has been compressed with snappy.
I have tried to change the arrow version from 1.0 to 2.0, and have tried to change compression codecs, including uncompressed (even though the uncompressed format does not seem to be supported by Google Data Flow). The error stays the same.
Using a utility like parquet-tools gives no indication that there is anything wrong with the files I'm uploading.
Is there any special requirement to the parquet format for Google Data Flow that I'm missing here? I've iterated through the ones available to me in the arrow package to no avail.
I was also seeing this error when trying to use my own pyarrow-generated parquets with the parquet_to_bigtable dataflow template.
The issue boiled down to schema mismatches. While the data in the parquet matched the expected format perfectly, and printing known-good and my own versions showed the exact same contents, parquets contain additional metadata that describes the schema, like so:
➜ ~ parq my_pyarrow_generated.parquet -s
# Schema
<pyarrow._parquet.ParquetSchema object at 0x12d7164c0>
required group field_id=-1 schema {
optional binary field_id=-1 key;
optional group field_id=-1 cells (List) {
repeated group field_id=-1 list {
optional group field_id=-1 item {
optional binary field_id=-1 family (String);
optional binary field_id=-1 qualifier;
optional double field_id=-1 timestamp;
optional binary field_id=-1 value;
}
}
}
}
I knew this schema probably wasn't precisely what they use themselves, so to get an understanding of how far off I was from what was needed, I used the inverse template bigtable_to_parquet to get a sample parquet file that has the correct metadata encoded within it:
➜ ~ parq dataflow_bigtable_to_parquet.parquet -s
# Schema
<pyarrow._parquet.ParquetSchema object at 0x1205c6a80>
required group field_id=-1 com.google.cloud.teleport.bigtable.BigtableRow {
required binary field_id=-1 key;
required group field_id=-1 cells (List) {
repeated group field_id=-1 array {
required binary field_id=-1 family (String);
required binary field_id=-1 qualifier;
required int64 field_id=-1 timestamp;
required binary field_id=-1 value;
}
}
}
as seen, the schemas are very close, but not exact.
With this though, we can build a simple workaround. It's gross, but I'm still actively debugging this right now and this is what just worked finally.
bigtable_schema_parquet = pq.read_table(pa.BufferReader(bigtable_to_parquet_file_bytes))
keys = []
cells = []
.......
df = pd.DataFrame({'key': keys, 'cells': cells})
table = pa.Table.from_pandas(df, schema=bigtable_schema_parquet.schema)
tl;dr: Use the bigtable_to_parquet dataflow template to get a sample parquet that has the schema that the parquet_to_bigtable input must use. Then load that schema in-memory and pass it to from_pandas to overwrite whatever schema it would have otherwise inferred
I have a bigquery table which always has only one record. I need to fetch the record and store it as a python variable.
schema of bigquery table:
filename
b
filename is the column name and 'b' is the record within. b is a string.
I want a python variable (suppose 'p') and this should have p=b.
Please help me with the airflow dag.
Whilst I do not have your DAG code, I will share how to fetch the data from your BigQuery table and store it in a variable using the Python API.
Following the documentation, make sure you have the client library installed within your instance. I have used a public dataset and dummy data for demonstration purposes. The following code uses the Client library to create a BigQuery client and perform two queries. Then from the query results, since it is just one value for each query, the data is store in two different variables. Below is the code:
from google.cloud import bigquery
import pandas
client = bigquery.Client()
dataset_ref = client.dataset("stackoverflow", project="bigquery-public-data")
dataset = client.get_dataset(dataset_ref)
#Query result is an INT64
query_1 = """
SELECT COUNT(a.id) as count
FROM `bigquery-public-data.stackoverflow.posts_answers` AS a
"""
#Query result is a STRING
query_2 = """SELECT "Jack Sparrow" as name """
res_1 = client.query(query_1)
res_2 = client.query(query_2)
#storing the query result(int64) in a variable
for row in res_1:
var_1 = row.count
#storing the query result(string) in a variable
for row in res_2:
var_2 = row.name
print("Checking the var_1: {} . Now checking var_2: {}".format(var_1,var_2))
And the output,
Checking the var_1: 29468374 . Now checking var_2: Jack Sparrow
Notice that var_1 and var_2 were accessed outside as simple variables in Python, one being an INTEGER and the other STRING. Therefore, you can increment this piece of code (with your own query) in your DAG. I want to stress that you need to check python_operator in order to implement it, also this is a good example.
I have below hard coded string which i read from our configuration table in MySql.(And there are many hard coded {} in our database, I can't go to mysql to recognize all the variables and handle it manually in the script.)
string_from_sql_table = '/distribution/{partner}/{process_date}/'
I have below predefined variables in my python script.
from datetime import datetime
string_from_sql_table = '/distribution/{partner}/{process_date}/' #to keep simple, didn't include code to read the string from mysql table.
partner = 'startupco'
process_date = datetime.today().strftime('%Y%m%d_%H%M')
I got below output when I print(string_from_sql_table ):
/distribution/{partner}/{process_date}/
How to get below expected output without changing the input string_from_sql_table or changing the input string_from_sql_table:
/distribution/startupco/20191011_1317/
Simple
'/distribution/{partner}/{process_date}/'.format(
partner = 'startupco',
process_date = datetime.today().strftime('%Y%m%d_%H%M')
)
Read more on the topic here https://docs.python.org/3/library/stdtypes.html#str.format
I am trying to load my tables on PostgreSQL to Spark.
I have successfully read the table from PostgreSQL to Spark using jdbc.
I have a code written in R, which I want to use on the table, but I cannot access the data in R.
using the following code to connect
val pgDF_table = spark.read
.format("jdbc")
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://10.128.0.4:5432/sparkDB")
.option("dbtable", "survey_results")
.option("user", "prashant")
.option("password","pandey")
.load()
pgDF_table.show
is there any option as spark.write?
In SparkR,
You can read data from JDBC using the following code:
read.jdbc(url, tableName, partitionColumn = NULL, lowerBound = NULL,
upperBound = NULL, numPartitions = 0L, predicates = list(), ...)
Arguments
`url': JDBC database url of the form 'jdbc:subprotocol:subname'
`tableName': the name of the table in the external database
`partitionColumn': the name of a column of integral type that will be used for partitioning
`lowerBound': the minimum value of 'partitionColumn' used to decide partition stride
`upperBound': the maximum value of 'partitionColumn' used to decide partition stride
`numPartitions': the number of partitions, This, along with 'lowerBound' (inclusive), 'upperBound' (exclusive), form partition strides for generated WHERE clause expressions used to split the column 'partitionColumn' evenly. This defaults to SparkContext.defaultParallelism when unset.
`predicates': a list of conditions in the where clause; each one defines one partition
Data can be written to JDBC using the following code:
write.jdbc(x, url, tableName, mode = "error", ...)
Arguments
`x`: a SparkDataFrame.
`url`: JDBC database url of the form jdbc:subprotocol:subname.
`tableName`: yhe name of the table in the external database.
`mode`: one of 'append', 'overwrite', 'error', 'ignore' save mode (it is 'error' by default).
`...`: additional JDBC database connection properties.
JDBC Driver must be in spark classpath
I am trying to store binary data in a sqlite database using the Twisted adbapi. However, when I run a query to store the data, I get an error:
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
After googling a bit, I found the answer for a normal sqlite connection:
con = sqlite3.connect(...)
con.text_factory = str
However, I can't find an equivalent setting to use with a twisted adbapi sqlite connection:
dbpool = adbapi.ConnectionPool("sqlite3", "data.db", check_same_thread=False)
I would appreciate any help!
I figured it out. In order to make changes to the connection after it opens, you have to use the cp_openfun parameter for the ConnectionPool. The following code worked:
def set_text_factory(conn):
conn.text_factory = str
dbpool = adbapi.ConnectionPool("sqlite3", "data.db", check_same_thread=False,
cp_openfun=set_text_factory)