Strategy error when try to update the RocksDB default column family options - rocksdb

I am trying to update the Ttl option (in scala) for the default column family in RocksDB.
Following the code snippet i am using:
db = RocksDB.open(options, dir)
val optionsBuilder = db.getOptions()
optionsBuilder.setTtl(30)
db.setOptions(optionsBuilder.build())
But i am getting Error parsing:: blob_compression_type error
org.rocksdb.RocksDBException: Error parsing:: blob_compression_type
at org.rocksdb.RocksDB.setOptions(Native Method)
at org.rocksdb.RocksDB.setOptions(RocksDB.java:3436)
I checked the default options, it's NO_COMPRESSION. Switching to other compression type doesn't work either
optionsBuilder.setBlobCompressionType(CompressionType.SNAPPY_COMPRESSION)
Any suggestions?
How to update the default column family options in RocksDB

Related

The "Parquet Files on Cloud Storage to Cloud Bigtable" DataFlow template cannot read parquet files

I'm trying to move a parquet file that is written out in R using the arrow library to BigTable. I have validated the arrow package instalation and made sure that the snappy codec is available using codec_is_available("snappy").
For some reason in the third step of the workflow I run into the following error:
Error message from worker: java.lang.RuntimeException:
org.apache.beam.sdk.util.UserCodeException:
org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 1 in file
ReadableFile{
metadata=Metadata{
resourceId=gs://mybucket/userdata_2.parquet,
sizeBytes=85550,
isReadSeekEfficient=true,
checksum=null,
lastModifiedMillis=0}, compression=UNCOMPRESSED}
It is unclear to me why it gives this error, but also why it says compression=UNCOMPRESSED. The file has been compressed with snappy.
I have tried to change the arrow version from 1.0 to 2.0, and have tried to change compression codecs, including uncompressed (even though the uncompressed format does not seem to be supported by Google Data Flow). The error stays the same.
Using a utility like parquet-tools gives no indication that there is anything wrong with the files I'm uploading.
Is there any special requirement to the parquet format for Google Data Flow that I'm missing here? I've iterated through the ones available to me in the arrow package to no avail.
I was also seeing this error when trying to use my own pyarrow-generated parquets with the parquet_to_bigtable dataflow template.
The issue boiled down to schema mismatches. While the data in the parquet matched the expected format perfectly, and printing known-good and my own versions showed the exact same contents, parquets contain additional metadata that describes the schema, like so:
➜ ~ parq my_pyarrow_generated.parquet -s
# Schema
<pyarrow._parquet.ParquetSchema object at 0x12d7164c0>
required group field_id=-1 schema {
optional binary field_id=-1 key;
optional group field_id=-1 cells (List) {
repeated group field_id=-1 list {
optional group field_id=-1 item {
optional binary field_id=-1 family (String);
optional binary field_id=-1 qualifier;
optional double field_id=-1 timestamp;
optional binary field_id=-1 value;
}
}
}
}
I knew this schema probably wasn't precisely what they use themselves, so to get an understanding of how far off I was from what was needed, I used the inverse template bigtable_to_parquet to get a sample parquet file that has the correct metadata encoded within it:
➜ ~ parq dataflow_bigtable_to_parquet.parquet -s
# Schema
<pyarrow._parquet.ParquetSchema object at 0x1205c6a80>
required group field_id=-1 com.google.cloud.teleport.bigtable.BigtableRow {
required binary field_id=-1 key;
required group field_id=-1 cells (List) {
repeated group field_id=-1 array {
required binary field_id=-1 family (String);
required binary field_id=-1 qualifier;
required int64 field_id=-1 timestamp;
required binary field_id=-1 value;
}
}
}
as seen, the schemas are very close, but not exact.
With this though, we can build a simple workaround. It's gross, but I'm still actively debugging this right now and this is what just worked finally.
bigtable_schema_parquet = pq.read_table(pa.BufferReader(bigtable_to_parquet_file_bytes))
keys = []
cells = []
.......
df = pd.DataFrame({'key': keys, 'cells': cells})
table = pa.Table.from_pandas(df, schema=bigtable_schema_parquet.schema)
tl;dr: Use the bigtable_to_parquet dataflow template to get a sample parquet that has the schema that the parquet_to_bigtable input must use. Then load that schema in-memory and pass it to from_pandas to overwrite whatever schema it would have otherwise inferred

Adding a column to MongoDB from R via mongolite gives persistent error

I want to add a column to a MongoDB collection via R. The collection has tabular format and is already relatively big (14000000 entries, 140 columns).
The function I am currently using is
function (collection, name, value)
{
mongolite::mongo(collection)$update("{}", paste0("{\"$set\":{\"",
name, "\": ", value, "}}"), multiple = TRUE)
invisible(NULL)
}
It does work so far. (It takes about 5-10 Minutes, which is ok. Although, it would be nice if the speed could be improved somehow).
However, it also gives me persistently the following error that interrupts the execution of the rest of the script.
The error message reads:
Error: Failed to send "update" command with database "test": Failed to read 4 bytes: socket error or timeout
Any help on resolving this error would be appreciated. (If there are ways to improve the performance of the update itself I'd also be more than happy for any advices.)
the default socket timeout is 5 minutes.
You can override the default by setting sockettimeoutms directly in your connection URI:
mongoURI <- paste0("mongodb://", user,":",pass, "#", mongoHost, ":", mongoPort,"/",db,"?sockettimeoutms=<something large enough in milliseconds>")
mcon <- mongo(mongoCollection, url=mongoURI)
mcon$update(...)

Slackr: x Problem with `id` - Cannot send messages

I am not an admin so I can't change the scopes. I can send slackr_bot messages to a channel I set up in the creation of the app in UI but doing the below does not work. Has anyone found a solution to this?
I created a txt file called: test.txt
Within that txt file it looks like this:
api_token: xxxxxxxxxxxx
channel: #channel_name
username: myusername
incoming_webhook_url: https://hooks.slack.com/services/xxxxxxxxxxx/xxxxxxxxxxxxx
Then I want to simply send a message but eventually I would like to run the function
ggslackr(qplot(mpg, wt, data=mtcars))
slackr_setup(config_file = "test.txt")
my_message <- paste("I'm sending a Slack message at", Sys.time(), "from my R script.")
slackr_msg(my_message, channel = "#channel_name", as_user=F)
Here is the error message:
Error: Join columns must be present in data.
x Problem with `id`.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
In structure(vars, groups = group_vars, class = c("dplyr_sel_vars", :
Calling 'structure(NULL, *)' is deprecated, as NULL cannot have attributes.
Consider 'structure(list(), *)' instead.
Edit #2:
Okay, I learned some things regarding packages. If I had to do this over, I'd have gone to their github repo and read the issue tracker.
The reason is that it appears that slackr has a few issues related to changes in Slack's API.
And also since there has been a large updating of R (version 4.x) a lot of packages got broken.
My sense is that our issue is with a line of code inside a slackr function (slackr_util.r--iirc) that calls a dplyr join that is looking for a particular id that does not exist.
So, I'm going to watch the issue tracker and see what comes of it.
Edit: Try slackr_bot(my_message,channel = "#general")
worked as advertised!
But ggslackr continues to fail.
I'm having the same issue. I've found in another thread a debugging start:
`rlang::last_error()`
When I run that,
Backtrace:
1. slackr::slackr_msg(my_message, channel = "#general")
5. slackr::slackr_chtrans(channel)
6. slackr::slackr_ims(api_token)
8. dplyr:::left_join.data.frame(users, ims, by = "id", copy = TRUE)
9. dplyr:::join_mutate(...)
10. dplyr:::join_cols(...)
11. dplyr:::standardise_join_by(by, x_names = x_names, y_names = y_names)
12. dplyr:::check_join_vars(by$y, y_names)
So, step 8 there is a join effort by id, which I suppose this implies that 'id' is missing.
yet, if I run from github issue tracker : slackr::slackrSetup(echo=TRUE) I get the following:
{
"SLACK_CHANNEL": ["#general"],
"SLACK_USERNAME": ["slackr_brian"],
"SLACK_ICON_EMOJI": ["NA"],
"SLACK_INCOMING_URL_PREFIX": ["https://hooks.xxxxxxx"],
"SLACK_API_TOKEN": ["token secret"]
}
I'm not sure where to go from here as the issue tracker conversation makes mention of confirming webhooks going to the correct channel and becomes very user specific.
So, that's as far as I have gotten.

Can we use JDBC to write data from postgresql to Spark?

I am trying to load my tables on PostgreSQL to Spark.
I have successfully read the table from PostgreSQL to Spark using jdbc.
I have a code written in R, which I want to use on the table, but I cannot access the data in R.
using the following code to connect
val pgDF_table = spark.read
.format("jdbc")
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://10.128.0.4:5432/sparkDB")
.option("dbtable", "survey_results")
.option("user", "prashant")
.option("password","pandey")
.load()
pgDF_table.show
is there any option as spark.write?
In SparkR,
You can read data from JDBC using the following code:
read.jdbc(url, tableName, partitionColumn = NULL, lowerBound = NULL,
upperBound = NULL, numPartitions = 0L, predicates = list(), ...)
Arguments
`url': JDBC database url of the form 'jdbc:subprotocol:subname'
`tableName': the name of the table in the external database
`partitionColumn': the name of a column of integral type that will be used for partitioning
`lowerBound': the minimum value of 'partitionColumn' used to decide partition stride
`upperBound': the maximum value of 'partitionColumn' used to decide partition stride
`numPartitions': the number of partitions, This, along with 'lowerBound' (inclusive), 'upperBound' (exclusive), form partition strides for generated WHERE clause expressions used to split the column 'partitionColumn' evenly. This defaults to SparkContext.defaultParallelism when unset.
`predicates': a list of conditions in the where clause; each one defines one partition
Data can be written to JDBC using the following code:
write.jdbc(x, url, tableName, mode = "error", ...)
Arguments
`x`: a SparkDataFrame.
`url`: JDBC database url of the form jdbc:subprotocol:subname.
`tableName`: yhe name of the table in the external database.
`mode`: one of 'append', 'overwrite', 'error', 'ignore' save mode (it is 'error' by default).
`...`: additional JDBC database connection properties.
JDBC Driver must be in spark classpath

Overwrite a Spark DataFrame into location

I want to save my Spark DataFrame into directory using spark_write_* function like this:
spark_write_csv(df, "file:///home/me/dir/")
but if the directory is already there I will get error:
ERROR: org.apache.spark.sql.AnalysisException: path file:/home/me/dir/ already exists.;
When I'm working on the same data, I want to overwrite this dir - how can I achieve this? In documentation there is one parameter:
mode Specifies the behavior when data or table already exists.
but it doesn't say what value you should use.
Parameter mode should simply have value "overwrite":
spark_write_csv(df, "file:///home/me/dir/", mode = "overwrite")

Resources