Load data from R into PosgreSQL database without losing constraints - r

I'm trying to import some dataframe from R into my PostgreSQL database. The tables are defined as follows:
CREATE TABLE fact_citedpubs(
citedpubs_id serial CONSTRAINT fact_citedpubs_pk PRIMARY KEY NOT NULL,
originID integer REFERENCES dim_country(country_id),
yearID integer REFERENCES dim_year(year_id),
citecount double precision
);
In my dataframe I have values for originID, yearID and citecount. My dataframe looks like this
| YEAR | GEO_DESC |OBS_VALUE
| 8 | 1 | 13.29400
| 17 | 2 | 4.42005
| 17 | 1 | 12.95001
| 15 | 1 | 11.61365
| 14 | 1 | 13.48174
To import this dataframe in the postgresql database I use the function dbWriteTable(con, 'fact_citedpubs', citations, overwrite = TRUE)
Because of using overwrite = TRUE Postgresql drops all the earlier set contstraints (primary-, foreign keys and datatypes). Is there any other way to import data into a postgresql database from R while keeping the constraints that were set in advance?
Many thanks!

Related

Query all primary roads from PostGis Database filled with OSM Data within a boundary

I imported OpenStreetMap data through osm2pgsql into PgSQL (PostGIS)
and I would like to get an SF object from the data containing
all primary roads (geometry) within a an area (bbox) into R.
I got lost since I would like to have also relations and nodes
and im not sure if only a query on planet_osm_roads will be sufficient and how the imported data structure is different to osm xml data im normaly working with.
I understand it is probably a bit broader question but
I would appreciate a start so to say to understand the query language better.
This is my approach but sadly i get an error
conn <- RPostgreSQL::dbConnect("PostgreSQL", host = MYHOST,
dbname = "osm_data", user = "postgres", password = MYPASSWORD)
pgPostGIS(conn)
a<-pgGetGeom(conn, c("public", "planet_osm_roads"), geom = "way", gid = "osm_id",
other.cols = FALSE, clauses = "WHERE highway = 'primary' && ST_MakeEnvelope(11.2364353533134, 47.8050651144447, 11.8882527806375, 48.2423300001326)")
a<-st_as_sf(a)
This is an error i get:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: syntax error at or near "ST_MakeEnvelope"
LINE 2: ...lic"."planet_osm_roads" WHERE "way" IS NOT NULL ST_MakeEnv...
^
)
Error in pgGetGeom(conn, c("public", "planet_osm_roads"), geom = "way", :
No geometries found.
In addition: Warning message:
In postgresqlQuickSQL(conn, statement, ...) :
Could not create execute: SELECT DISTINCT a.geo AS type
FROM (SELECT ST_GeometryType("way") as geo FROM "public"."planet_osm_roads" WHERE "way" IS NOT NULL ST_MakeEnvelope(11.2364353533134, 47.8050651144447, 11.8882527806375, 48.2423300001326)) a;
This is the db:
osm_data=# \d
List of relations
Schema | Name | Type | Owner
----------+--------------------+----------+----------
public | geography_columns | view | postgres
public | geometry_columns | view | postgres
public | planet_osm_line | table | postgres
public | planet_osm_nodes | table | postgres
public | planet_osm_point | table | postgres
public | planet_osm_polygon | table | postgres
public | planet_osm_rels | table | postgres
public | planet_osm_roads | table | postgres
public | planet_osm_ways | table | postgres
public | spatial_ref_sys | table | postgres
topology | layer | table | postgres
topology | topology | table | postgres
topology | topology_id_seq | sequence | postgres
schema_name table_name geom_column geometry_type type
1 public planet_osm_line way LINESTRING GEOMETRY
2 public planet_osm_point way POINT GEOMETRY
3 public planet_osm_polygon way GEOMETRY GEOMETRY
4 public planet_osm_roads way LINESTRING GEOMETRY
Table "public.planet_osm_roads"
Column | Type | Collation | Nullable | Default
--------------------+---------------------------+-----------+----------+---------
osm_id | bigint | | |
access | text | | |
addr:housename | text | | |
addr:housenumber | text | | |
addr:interpolation | text | | |
admin_level | text | | |
aerialway | text | | |
aeroway | text | | |
amenity | text | | |
area | text | | |
barrier | text | | |
bicycle | text | | |
brand | text | | |
bridge | text | | |
boundary | text | | |
building | text | | |
construction | text | | |
Your query looks just fine. Check the following example:
WITH planet_osm_roads (highway,way) AS (
VALUES
('primary','SRID=3857;POINT(1283861.57 6113504.88)'::geometry), --inside your bbox
('secondary','SRID=3857;POINT(1286919.06 6067184.04)'::geometry) --somewhere else ..
)
SELECT highway,ST_AsText(way)
FROM planet_osm_roads
WHERE
highway IN ('primary','secondary','tertiary') AND
planet_osm_roads.way && ST_Transform(
ST_MakeEnvelope(11.2364353533134,47.8050651144447,11.8882527806375,48.2423300001326, 4326),3857
);
highway | st_astext
---------+------------------------------
primary | POINT(1283861.57 6113504.88)
This image illustrates the BBOX and the points used in the example above
Check the documentation for more information on the bbox intersection operator &&.
However, there are a few things to consider.
In case you're creating the BBOX yourself in order to have an area for ST_Contains, consider simply using ST_DWithin. It will basically do the same, but you only have to provide a reference point and the distance.
Depending on the distribution of highway types in the table planet_osm_roads and considering that your queries will always look for either primary,secondary or tertiary highways, creating a partial index could significantly improve performance. As the documentation says:
A partial index is an index built over a subset of a table; the subset
is defined by a conditional expression (called the predicate of the
partial index). The index contains entries only for those table rows
that satisfy the predicate. Partial indexes are a specialized feature,
but there are several situations in which they are useful.
So try something like this:
CREATE INDEX idx_planet_osm_roads_way ON planet_osm_roads USING gist(way)
WHERE highway IN ('primary','secondary','tertiary');
And also highway needs to be indexed. So try this ..
CREATE INDEX idx_planet_osm_roads_highway ON planet_osm_roads (highway);
.. or even another partial index, in case you can't delete the other data but you don't need it for anything:
CREATE INDEX idx_planet_osm_roads_highway ON planet_osm_roads (highway)
WHERE highway IN ('primary','secondary','tertiary');
You can always identify bottlenecks and check if the query planer is using your index with EXPLAIN.
Further reading
Getting all Buildings in range of 5 miles from specified coordinates
Buffers (Circle) in PostGIS
How can I set data from on table to another according spatial relation geometries in these tables
I figured it out.
This is how you would get an SF object out of PostGIS Database filled with OSM Data within 11.2364353533134,47.8050651144447,11.8882527806375,48.2423300001326 BBOX:
library(sf)
library(RPostgreSQL)
library(tictoc)
pw <- MYPASSWORD
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "osm_data",
host = "localhost", port = 5432,
user = "postgres", password = pw)
tic()
sf_data = st_read(con, query = "SELECT osm_id,name,highway,way
FROM planet_osm_roads
WHERE highway = 'primary' OR highway = 'secondary' OR highway = 'tertiary'AND ST_Contains(
ST_Transform(
ST_MakeEnvelope(11.2364353533134,47.8050651144447,11.8882527806375,48.2423300001326,4326)
,3857)
,planet_osm_roads.way);")
toc()
RPostgreSQL::dbDisconnect(con)
I have to verify if BBOX values are actually getting considered .. im not sure.

Ingest different csv with different fields to an ADX table

Wondering if its possible to ingest multiple CSV with different number/ type of fields to a single ADX table? (Refer to the csv sample below)
Any way for me to use the header of the csv as the fields?
CSV type sample:
Type A
+--------+-----+--------+
| Name | Age | Uni |
+--------+-----+--------+
| Hazriq | 27 | UNITEN |
+--------+-----+--------+
Type B
+------+------+-----+
| Name | Uni | Age |
+------+------+-----+
| John | UNIx | 31 |
+------+------+-----+
Type C
+------+------+--------------+-----+
| Name | Uni | Hometown | Age |
+------+------+--------------+-----+
| Jane | UNIt | Kuala Lumpur | 31 |
+------+------+--------------+-----+
Yes, you can create multiple CSV mappings and provide the applicable mapping for a given file. You cannot use the CSV headers. Here is an example:
.create table demo(name:string, age:int, uni:string, hometown:string)
//define the mappings
.create table demo ingestion csv mapping "typeA" '[{"Name":"name", "Ordinal":0}, {"Name":"age", "Ordinal":1}, {"Name":"uni", "Ordinal":2}]'
.create table demo ingestion csv mapping "typeB" '[{"Name":"name", "Ordinal":0}, {"Name":"uni", "Ordinal":1}, {"Name":"age", "Ordinal":2}]'
.create table demo ingestion csv mapping "typeC" '[{"Name":"name", "Ordinal":0}, {"Name":"uni", "Ordinal":1}, {"Name":"age", "Ordinal":3}, {"Name":"hometown", "Ordinal":2} ]'
//Ingest some test date
.ingest inline into table demo with (csvMappingReference="typeA", pushbypull=true) <| Hazriq,27,UNITEN
.ingest inline into table demo with (csvMappingReference="typeB", pushbypull=true) <| John,UNIx,31
.ingest inline into table demo with (csvMappingReference="typeC", pushbypull=true) <| Jane,UNIt,Kuala Lumpur,31
//test
demo

SQLite find table row where a subset of columns satisfies a specified constraint

I have the following SQLite table
CREATE TABLE visits(urid INTEGER PRIMARY KEY AUTOINCREMENT,
hash TEXT,dX INTEGER,dY INTEGER,dZ INTEGER);
Typical content would be
# select * from visits;
urid | hash | dx | dY | dZ
------+-----------+-------+--------+------
1 | 'abcd' | 10 | 10 | 10
2 | 'abcd' | 11 | 11 | 11
3 | 'bcde' | 7 | 7 | 7
4 | 'abcd' | 13 | 13 | 13
5 | 'defg' | 20 | 21 | 17
What I need to do here is identify the urid for the table row which satisfies the constraint
hash = 'abcd' AND (nearby >= (abs(dX - tX) + abs(dY - tY) + abs(dZ - tZ))
with the smallest deviation - in the sense of smallest sum of absolute distances
In the present instance with
nearby = 7
tX = tY = tZ = 12
there are three rows that meet the above constraint but with different deviations
urid | hash | dx | dY | dZ | deviation
------+-----------+-------+--------+--------+---------------
1 | 'abcd' | 10 | 10 | 10 | 6
2 | 'abcd' | 11 | 11 | 11 | 3
4 | 'abcd' | 12 | 12 | 12 | 3
in which case I would like to have reported urid = 2 or urid = 3 - I don't actually care which one gets reported.
Left to my own devices I would fetch the full set of matching rows and then dril down to the one that matches my secondary constraint - smallest deviation - in my own Java code. However, I suspect that is not necessary and it can be done in SQL alone. My knowledge of SQL is sadly too limited here. I hope that someone here can put me on the right path.
I now have managed to do the following
CREATE TEMP TABLE h1(v1 INTEGER,v2 INTEGER);
SELECT urid,(SELECT (abs(dX - 12) + abs(dY - 12) + abs(dZ - 12))) devi FROM visits WHERE hash = 'abcd';
which gives
--SELECT * FROM h1
urid | devi |
-------+-----------+
1 | 6 |
2 | 3 |
4 | 3 |
following which I issue
select urid from h1 order by v2 asc limit 1;
which yields urid = 2, the result I am after. Whilst this works, I would like to know if there is a better/simpler way of doing this.
You're so close! You have all of the components you need, you just have to put them together into a single query.
Consider:
SELECT urid
, (abs(dx - :tx) + abs(dy - :tx) + abs(dz - :tx)) AS devi
FROM visits
WHERE hash=:hashval AND devi < :nearby
ORDER BY devi
LIMIT 1
Line by line, first you list the rows and computed values you want (:tx is a placeholder; in your code you want to prepare a statement and then bind values to the placeholders before executing the statement) from the visit table.
Then in the WHERE clause you restrict what rows get returned to those matching the particular hash (That column should have an index for best results... CREATE INDEX visits_idx_hash ON visits(hash) for example), and that have a devi that is less than the value of the :nearby placeholder. (I think devi < :nearby is clearer than :nearby >= devi).
Then you say that you want those results sorted in increasing order according to devi, and LIMIT the returned results to a single row because you don't care about any others (If there are no rows that meet the WHERE constraints, nothing is returned).

SQLite UPDATE returns empty

I'm trying to update a table column from another table with the code below.
Now the editor says '39 rows affected' and I can see something happened because some cells changed from null to empty (nothing shows).
While orhers are still null
What could be wrong here?
Why does it not update properly....
PS: I checked manually that the values are not empty in the column to check for.
UPDATE CANZ_CONC
SET EAN = (SELECT t1.EAN_nummer FROM ArtLev_CONC t1 WHERE t1.Artikelcode_leverancier = Artikelcode_leverancier)
WHERE ARTNMR IN (SELECT t1.Artikelcode_leverancier FROM Artlev_CONC t1 WHERE t1.Artikelcode_leverancier = ARTNMR);
Edit:
The tabel2 is like:
NMR | EAN | CUSTOM
-------------------------------
1 | 987 | A
2 | 654 | B
3 | 321 | C
Tabel 1 is like
NMR | EAN | CUSTOM
-------------------------------
1 | null | null
2 | null | null
5 | null | null
After the UPDATE table1 is like
NMR | EAN | CUSTOM
-------------------------------
1 | | null
2 | | null
5 | null | null
I've got this working.
I guess my data was corrupted after all.
Since it is about 330.000 rows it was not very easy to spot.
But it came to me when the loading of the data took about 10 minutes!
It used to be about 40 - 60 seconds.
So I ended up back at the drawing board for the initial csv file.
I also saw the columns had not been given a DATA type, so I altered that as well.
Thanx for the help!

Performance issue on CLOB column

I’m facing one issue with a table which has CLOB column.
The table is just a 15column table with one column as CLOB.
When i do SELECT on the table excluding CLOB column, it take only 15min, but if i include this column the SELECT query runs for 2hrs.
Have check the plan and found both the query with and without COLUM uses same Plan.
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 330K| 61M| 147K (1)| 00:29:34 | | |
| 1 | PARTITION RANGE ALL | | 330K| 61M| 147K (1)| 00:29:34 | 1 | 50 |
| 2 | TABLE ACCESS BY LOCAL INDEX ROWID| CC_CONSUMER_EV_PRFL | 330K| 61M| 147K (1)| 00:29:34 | 1 | 50 |
|* 3 | INDEX RANGE SCAN | CC_CON_EV_P_EV_TYPE_BTIDX | 337K| | 811 (1)| 00:00:10 | 1 | 50 |
Below are the stats i collected.
Stats Without CLOB Column With CLOB Column
recursive calls 0 1
db block gets 0 0
consistent gets 1374615 3131269
physical reads 103874 1042358
redo size 0 0
bytes sent via SQL*Net to client 449499347 3209044367
bytes received via SQL*Net from client 1148445 1288482930
SQL*Net roundtrips to/from client 104373 2215166
sorts (memory)
sorts (disk)
rows processed 1565567 1565567
I'm planing to perform below, is it worth to try?
1) Gather stats on the table and retry
2) compress the table and retry

Resources