Update JSONB in Postgres using R - r

as the title states I am trying to update a jsonb field from inside R. I have several changes to apply. The original dataset is created by a third party application but needs to be corrected for several rows.
The following statement is working fine as a database statement:
UPDATE histories set meta=jsonb_set(meta,'{product}','"55-AB"') WHERE id = 17983;
Now, I need to update the "product" field for several different ids.
Assume the following dataframe as an example:
df<-data.frame(product=c("55-AB","567-C","UTG-98"),
id=c(17983,54388,20000))
Usually I would use sql_glue from the glue package but I run out of quotes generating the above queries dynamically.
sql_glue("UPDATE histories set meta=jsonb_set(meta,'{product}','"{`df$product`}"') WHERE id = {`df$id`};")
Error: unexpected '{' in "sql_glue("UPDATE histories set meta=jsonb_set(meta,'{pesticide}','"{"
I am running into problems with the quotation. Any idea how to get around this?

Related

Rails - Create and operate on a temporary table?

My background is in data science with R, but in my current position I'm pulling data through Rails and ActiveRecord. I want to perform transformations to my data and create new columns and save it in a temporary way that allows me to continue querying it like a regular table, but without actually making changes to the database.
In R, this might look something like:
new_table <- old_table[old_table$date >= '2020-01-01']
new_table$average <- mean(new_table$value)
I would take this new_table and perform any number of queries I could have done to the old_table, and once I close my app I expect this temporary table to be removed as well.
This particular transformation is simple and wouldn't require a new table, but for example, there are a number of tables I'd like to join with my new_table. It would be easier if I could perform my transformations once and then join it, rather than joining the old_table and performing the transformation each time.
Since your question is vague I'll give a general answer that might not fit your use but it's a best guess at this point. There are numerous ways to use the DB connection in Rails to query directly, as referenced in the link in my comments above. But as an experiment I wanted to see if this would work and it does, at least with a project that is using Postgres. I wanted it to be DB agnostic so I'm avoiding calling the DB connection directly...
First create a temporary class in the Rails console:
rails c
Loading development environment (Rails...
class MyTempTable < ActiveRecord::Base
end
=> nil
EDIT:
In addition to the method below, you can also do this to create the table:
MyTempTable.find_by_sql('create temp table temp_tables AS select...')
This will create the temp table directly from a query. You could then use a join statement if you wanted data from more than one table in the new temp table, and you can add any additional columns you want
End Edit
Now you have a class that will act like a table with the usual ActiveRecord methods. Rails now assumes there is a table in the DB called my_temp_tables (must be plural). You can then create a temp table (if your DBMS supports temp tables) like this:
MyTempTable.find_by_sql('create temp table my_temp_tables(col1, col2... ')
Now you have a temp table with the columns you want. You can then do SQL operations using
MyTempTable.find_by_sql('INSERT INTO my_temp_tables SELECT * FROM ....')
You can then treat MyTempTable like any other model in Rails. If you wanted all the columns from one table joined with some columns from another table you can create the temp table as above, you just have to create all the columns first (at least in Postgres, in MSSQL you can probably create the temp table inserting directly from a select => join statement). If you are new to Rails you can grab column names by doing this on existing tables:
some_columns = SomeTable.column_names
=> ["id", "name", "serial", "purchased", ...]
Now you have an array of the column names so you don't have to type all of them. You can list out the columns you want from the various tables, cut and past them into the create temp table... statement, then INSERT the joined data into MyTempTable
If you do much of this regularly you'll probably want to keep a listing of all your column names in an text file. You can also create Rake tasks that do all of this and save the data to some format, or send it off to where ever it is supposed to go. That way you can have it all in a file that you can just run and it will create the temp tables, do the work, and then when it closes out the temporary classes and tables go away.
You might want to investigate some Ruby Gems, there are probably existing gems that do some of what you want. But as a proof of concept this works. You could also spin up a local Rails app and use scripting to import the data you want into tables, then just flush and recreate it at will.
Any Rails gurus that know of a better way, please add an answer or edit this one. This is mostly a thought experiment for me since I wanted to see if it was possible.
If you want to create views that you can access later on you could use a gem like https://github.com/scenic-views/scenic
Or something like this might be of interest: https://github.com/igorkasyanchuk/rails_db
Sounds like you're keen on the benefits of having some structure and tools available to work on the data, but don't want the data persisted in a db table.
Maybe use a model without a table like this.

Include a hashtag in dbGetQuery()

I'm trying to use RJDBC to connect to a SAP HANA database and query for a temporary table, which is stored with a #-prefix:
test <- dbGetQuery(jdbcConnection,
"SELECT * FROM #CONTROL_TBL")
# Error in [...]: invalid table name: Could not find table/view #CONTROL_TBL in schema USER
If I execute the SQL statement in HANA, it works perfectly fine. I'm also able to query for permanent tables. Therefore I assume that R doesn't pass over the hashtag. Inserting escapes like "SELECT * FROM \\#CONTROL_TBL" however didn't solve my problem.
It's not possible to query for the data of a local or global temporary table from a different session, since they are by definition session-specific. In the case of a global temporary table one can query for the metadata of the table because they are shared across sessions.
Source: Tutorial for HANA temporary tables
You have to double-quote the table because it contains special characters, see SAP Help, identifiers for details.
test <- dbGetQuery(jdbcConnection,
'SELECT * FROM "#CONTROL_TBL"')
See also related discussion on stackoverflow.
Ok, local temporary tables are always only visible to the session in which they've been defined, while global temporary tables are visible just like normal tables, but the data is session private.
So, if you created the local temp. table (name starts with #) in a different session, then no wonder it cannot be found.
For your example, the question is: why do you need a temporary table in the first place?
Instead of that, you could e.g. define a view or a table function to select data from.

UNION of tables using bigquery LegacySQL

I'm trying without luck to do a query to retrieve the union two tables of events using legacySQL, as standardSQL is not yet supported on data studio.
In standardSQL that would be something like:
SELECT
*
FROM
`com_myapp_ANDROID.app_events_*`,
`com_myapp_IOS.app_events_*`
However, in legacySQL I get an error when trying to refer app_events_*. How do I include all the tables of my events, so I can filter it afterwards on data studio if I can't use the wildcard?
I've tried something like:
select * from (TABLE_QUERY(com_myapp_ANDROID, 'table_id CONTAINS "app_events_"'))
But not sure if this is the right approach, I get:
Cannot output multiple independently repeated fields at the same time.
Found user_dim_user_properties_value_index and event_dim_date
Edit: in the end this is the result of the query, as you can't use directly FLATTEN with TABLE_QUERY:
select
*
from
FLATTEN((SELECT * FROM TABLE_QUERY(com_myapp_ANDROID, 'table_id CONTAINS "app_events"')),user_dim.user_properties),
FLATTEN((SELECT * FROM TABLE_QUERY(com_myapp_IOS, 'table_id CONTAINS "app_events"')),user_dim.user_properties)
Table wildcards don't work in legacy SQL as you have guessed so you have to use the TABLE_QUERY() function.
Your approach is right but the first parameter in the TABLE_QUERY function should be the dataset name not the first part of the table name. Assuming your dataset name is app_events that would look like this:
TABLE_QUERY(app_events,'table_id CONTAINS "app_events"')
In legacySQL the union table operator is comma
select * from [table1],[table2]
For TABLE_QUERY you would include the dataset name as first param, and the expression for the second
select * from (TABLE_QUERY([dataset], 'table_id CONTAINS "event"'))
to read more how to debug TABLE_QUERY read this linked answer
The Web UI automatically flattens you the results, but when there are independent repeated fields you need to flatten with the FLATTEN wrapper.
It takes two params, table, and repeated field eg: FLATTEN(table, tags)
Also if TABLE_QUERY is involved you need to subselect probably like
select
*
from
FLATTEN((SELECT * FROM TABLE_QUERY(com_myapp_ANDROID, 'table_id CONTAINS "app_events"')),user_dim.user_properties)
That particular issue you are experiencing is not UNION related - you will see same error message even with just one table if the table has multiple independently repeated fields and you are trying to output them at once. This scenario is specific to Legacy SQL and can be resolved with use of FLATTEN clause
At the same time, most likely you don't actually mean to use SELECT * which cause those repeated fields to be in output all at the same time. If you can narrow down your output list - you have slight chance to address it - but if still few independently repeated fields are in output - you can use FLATTEN technique

How to Handle BQ GA Export Changes?

I'm trying to reprocess ga_sessions_yyyymmdd data but am finding the ga_sessions never used to have a field called [channelGrouping] but it does in more recent data.
So my jobs work fine for the latest version of ga_sessions but when i try reprocess earleir ga_sessions data the job fails as it's missing the [channelGrouping] field.
Obviously usually this is what you want, but in this case it's not. I want to make sure i'm sticking to the latest ga_sessions schema and would like the job to just set missing cols to null for when they did not exist.
Is there any way around this?
Perhaps i need to make an empty table called ga_sessions_template_latest and union it on to whatever ga_sessions_ daily table i'm handling - maybe this will 'upgrade' the old ga_sessions to the new structure.
Attached is a screenshot of exactly what i mean (my union idea will actually be horrible due to nested fields in ga_sessions).
I don't have such a script yet. But since the tables are under your project you are able to update them. You can write a script and update the schema on all tables with missing columns from the most recent schema set.
I envision a script that gets most recent table schema.
Then goes back one by one to past tables, does a compare, identifies the missing columns, defines them as not required and nullable, and reads the schema + applies the additional columns and runs the update on the table. Data won't be modified, you will have just additional columns with null values.
you can try out for some also from the Web UI.

Determine flyway variables from earlier SQL step

I'd like to use flyway for a DB update with the situation that an DB already exists with productive data in it. The problem I'm looking at now (and I did not find a nice solution yet), is the following:
There is an existing DB table with numeric IDs, e.g.
create table objects ( obj_id number, ...)
There is a sequence "obj_seq" to allocate new obj_ids
During my DB migration I need to introduce a few new objects, hence I need new
object IDs. However I do not know at development time, what ID
numbers these will be
There is a DB trigger which later references these IDs. To improve performance I'd like to avoid determine the actual IDs every time the trigger runs but rather put the IDs directly into the trigger
Example (very simplified) of what I have in mind:
insert into objects (obj_id, ...) values (obj_seq.nextval, ...)
select obj_seq.currval from dual
-> store this in variable "newID"
create trigger on some_other_table
when new.id = newID
...
Now, is it possible to dynamically determine/use such variables? I have seen the flyway placeholders but my understanding is that I cannot set them dynamically as in the example above.
I could use a Java-based migration script and do whatever string magic I like - so, that would be a way of doing it, but maybe there is a more elegant way using SQL?
Many thx!!
tge
If the table you are updating contains only reference data, get rid of the sequence and assign the IDs manually.
If it contains a mix of reference and user data, you need to select the id based on values in other columns.

Resources