Delete layer from GeoPackage - r

I am trying to delete a vector layer from a GeoPackage file using the sf package. By "delete" I mean permanently remove NOT overwrite or update. I am aware of the delete_layer option, but as I understand this only functions to delete a layer before replacing it with a layer of the same name.
Unfortunately I have written a layer with a name using non-standard encoding to a GeoPackage, which effectively makes the entire gpkg-file unreadble in QGIS. Hence, I am trying to find a solution to remove it via R.

A geopackage is also an SQLite database, so you can use RSQLite database functions to remove tables.
Set up a test:
> d1 = st_as_sf(data.frame(x=runif(10),y=runif(10),z=1:10), coords=c("x","y"))
> d2 = st_as_sf(data.frame(x=runif(10),y=runif(10),z=1:10), coords=c("x","y"))
> d3 = st_as_sf(data.frame(x=runif(10),y=runif(10),z=1:10), coords=c("x","y"))
Write those to a GPKG:
> st_write(d1,"deletes.gpkg","d1")
Writing layer `d1' to data source `deletes.gpkg' using driver `GPKG'
features: 10
fields: 1
geometry type: Point
> st_write(d2,"deletes.gpkg","d2",quiet=TRUE)
> st_write(d3,"deletes.gpkg","d3",quiet=TRUE)
Now to delete, use the RSQLite package (from CRAN), create a database connection:
library(RSQLite)
db = SQLite()
con = dbConnect(db,"./deletes.gpkg")
and remove the table:
dbRemoveTable(con, "d2")
There's one tiny problem - this removes the table but does not remove the metadata that GPKG uses to note this package is a spatial table. Hence you get warnings like this with GDAL tools:
$ ogrinfo -so -al deletes.gpkg
ERROR 1: Table or view 'd2' does not exist
Warning 1: unable to read table definition for 'd2'
QGIS happily read the remaining two layers in correctly though. I think this can be worked round in R by loading Spatialite module extensions alongside the SQLite modules, or manually removing the rows in the metadata tables gpkg_geometry_columns and maybe gpkg_ogr_contents but nothing seems to break hard with those things not updated.

Related

Trying to read 20GB of data, read.csv.sql Produces Errors

I have a 20GB dataset in csv format and I am trying to trim it down with a read.csv.sql command.
I am successfully able to load the first 10,000 observations with the following command:
testframe = read.csv(file.choose(),nrows = 10000)
The column names can be seen in the following picture:
I then tried to build my trimmed down dataset with the following command, and get an error:
reduced = read.csv.sql(file.choose(),
sql = 'select * from file where "country" = "Poland" OR
country = "Germany" OR country = "France" OR country = "Spain"',
header = TRUE,
eol = "\n")
The error is:Error in connection_import_file(conn#ptr, name, value, sep, eol, skip) : RS_sqlite_import: C:\Users\feded\Desktop\AWS\biodiversity-data\occurence.csv line 262 expected 37 columns of data but found 38
Why is it that I can load the first 10,000 observations with ease and problems arise with the second command? I hope you have all the information needed to be able to provide some help on this issue.
Note that with the latest version of all packages read.csv.sql is working again.
RSQLite made breaking changes in their interface to SQLite which mean read.csv.sql and any other software that reads files into SQLite from R that used their old interface no longer work. (Other aspects of sqldf still work.)
findstr/grep
If the only reason you are doing this is to cut down the file to the 4 countries indicated perhaps you could just preprocess the csv file like this on Windows assuming that abc.csv is your csv file and that it is in the current directory. Also we have assumed that XYZ is a string in the header.
DF <- read.csv(pipe('findstr "XYZ France Germany Poland Spain" abc.csv'))
On other platforms use grep:
DF <- read.csv(pipe('grep "XYZ|France|Germany|Poland|Spain" abc.csv'))
The above could possibly retrieve some extra rows if those words can also appear in fields other than the intended one but if that is a concern then using subset or filter in R once you have the data in R could be used to narrow it down to just the desired rows.
Other utilities
There are also numerous command line utilities that can be used as an alternative to findstr and grep such as sed, awk/gawk (mentioned in the comments) and utilities specifically geared to csv files such as csvfix (C++), miller (go), csvkit (python), csvtk (go) and xsv (rust).
xsv
Taking xsv as an example, binaries can be downloaded here and then we can write the following assuming xsv is in current directory or on path. This instructs xsv to extract the rows for which the indicated regular expression matches the country column.
cmd <- 'xsv search -s country "France|Germany|Poland|Spain" abc.csv'
DF <- read.csv(pipe(cmd))
SQLite command line tool
You can use the SQLite command line program to read the file into an SQLite database which it will create for you. Google for download sqlite, download the sqlite command line tools for your platform and unpack it. Then from the command line (not from R) run something like this to create the abc.db SQLite database from abc.csv.
sqlite3 --csv abc.db ".import abc.csv abc"
Then assuming that the database is in current directory run this in R:
library(sqldf)
sqldf("select count(*) from abc", dbname = "abc.db")
I am not sure that sqlite it a good choice for such a large file but you can try it
H2
Another possibility if you have sufficient memory to hold the database (possibly after using findstr/grep/xsv or other utility on the command line rather than R) is to then use the H2 database backend to sqldf from R.
If sqldf sees that the RH2 package containing the H2 driver is loaded it will use that instead of SQLite. (It would also be possible to use MySQL or PostgreSQL backends but these are more involved to install so we won't cover them although these are much more likely to be able to handle the large size you have.)
Note that the RH2 driver requires that rJava R package be installed and it requires java itself although java is very easy to install. The H2 database itself is included in the RH2 R driver package so it does not have to be separately installed. Also the first time in a session that you access java code with rJava it will have to load java itself which will take some time but thereafter it will be faster in that session.
library(RH2)
library(sqldf)
abc3 <- sqldf("select * from csvread('abc.csv') limit 3") |>
type.convert(as.is = TRUE)

Speed up INSERT of 1 million+ rows into Postgres via R using COPY?

I would like to bulk-INSERT/UPSERT a moderately large amount of rows to a postgreSQL database using R. In order to do so I am preparing a multi-row INSERT string using R.
query <- sprintf("BEGIN;
CREATE TEMPORARY TABLE
md_updates(ts_key varchar, meta_data hstore) ON COMMIT DROP;
INSERT INTO md_updates(ts_key, meta_data) VALUES %s;
LOCK TABLE %s.meta_data_unlocalized IN EXCLUSIVE MODE;
UPDATE %s.meta_data_unlocalized
SET meta_data = md_updates.meta_data
FROM md_updates
WHERE md_updates.ts_key = %s.meta_data_unlocalized.ts_key;
COMMIT;", md_values, schema, schema, schema, schema)
DBI::dbGetQuery(con,query)
The entire function can be found here. Surprisingly (at leat to me) I learned that the UPDATE part is not the problem. I left it out and ran a the query again and it wasn't much faster. INSERT a million+ records seems to be the issue here.
I did some research and found quite some information:
bulk inserts
bulk inserts II
what causes large inserts to slow down
answers from #Erwin Brandstetter and #Craig Ringer were particularly helpful. I was able to speed things up quite a bit by dropping indices and following a few other suggestions.
However, I struggled to implement another suggestion which sounded promising: COPY. The problem is I can't get it done from within R.
The following works for me:
sql <- sprintf('CREATE TABLE
md_updates(ts_key varchar, meta_data hstore);
COPY md_updates FROM STDIN;')
dbGetQuery(sandbox,"COPY md_updates FROM 'test.csv' DELIMITER ';' CSV;")
But I can't get it done without reading from a extra .csv file. So my questions are:
Is COPY really a promising approach here (over the multi-row INSERT I got?
Is there a way to use COPY from within R without writing data to a file. Data does fit in memory and since it's already in mem why write to disk?
I am using PostgreSQL 9.5 on OS X and 9.5 on RHEL respectively.
RPostgreSQL has a "CopyInDataframe" function that looks like it should do what you want:
install.packages("RPostgreSQL")
library(RPostgreSQL)
con <- dbConnect(PostgreSQL(), user="...", password="...", dbname="...", host="...")
dbSendQuery(con, "copy foo from stdin")
postgresqlCopyInDataframe(con, df)
Where table foo has the same columns as dataframe df

Can I import PostGIS raster data type into R by using the RPostgreSQL-package?

I have a PostgreSQL / PostGIS table with 30 rows (only 3 are shown) and 3 columns as follows
(raster is a PostGIS data type) - It's the EFSA CAPRI data set btw, if somebody's fimilar with it:
// Can I import the raster data type from PostGIS into R with the help of the RPostgreSQL-package (see the code below) OR do I have to use the rgdal-package inevitably as described by #Jot eN?
require(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "")
dbGetquery(con, "SELECT rid, rast, filename FROM schema.capri")
Importing it without transformation and St_AsText(rast) (which works for the geometry data type of PostGIS) don't work.
If this is still relevant, at the University of Florida, David Bucklin and I have released a rpostgis package that provides bi-directional transfer between PostGIS and R for vector and raster data. The package does not rely on GDAL (and rgdal), and should be platform independent.
Assuming that you already have a functional connection con established through RPostgreSQL, you can import PostGIS raster data type into R using the function pgGetRast, for instance:
library(rpostgis)
my_raster <- pgGetRast(con, c("schema", "raster_table"))
The function assumes that the raster tiles are stored in the column "rast" by default (as is the case for you), but you can change that with the argument rast. Now, depending on the size and other considerations, this may be significantly slower (but a lot more flexible) than using rgdal. We are still working on it, but this is the cost of providing a "pure R" solution. You can also use the boundary argument if you are only interested in a subset of the entire raster (which will significantly increase the loading time).
Note also that there is pgGetGeom for points/lines/polygons, instead of using St_AsText.
You have an answer on gis.stackexchange page - https://gis.stackexchange.com/a/118401/20955:
library('raster')
library('rgdal')
dsn="PG:dbname='plots' host=localhost user='test' password='test' port=5432 schema='gisdata' table='map' mode=2"
ras <- readGDAL(dsn) # Get your file as SpatialGridDataFrame
ras2 <- raster(ras,1) # Convert the first Band to Raster
plot(ras2)
Additional info could be found here https://rpubs.com/dgolicher/6373

integrating hadoop, revo-scaleR and hive

I have a requirement to fetch data from HIVE tables into a csv file and use it in RevoScaleR.
Currently we pull the data from HIVE and manually put it into a file and use it in unix file system for adhoc analysis, however, the requirement is to re-direct the result directly into hdfs location and use RevoScaleR from there?
How do I do that? or what sort of connection do I need to establish this.
If I understand your question correct, you could use RevoScaleR ODBC connection to import HIVE table and do further analysis from there.
Here is example of using Hortonworks provided ODBC driver:
OdbcConnString <- "DSN=Sample Hortonworks Hive DSN"
odbcDS <- RxOdbcData(sqlQuery = "SELECT * FROM airline",
connectionString=OdbcConnString,
stringsAsFactors=TRUE,
useFastRead = TRUE,
rowsPerRead=150000)
xdfFile <- "airlineHWS.xdf"
if(file.exists(xdfFile)) file.remove(xdfFile)
Flights<-rxImport(odbcDS, outFile=xdfFile,overwrite=TRUE)
rxGetInfo(data="airlineHWS.xdf", getVarInfo=TRUE,numRows = 10)
Chenwei's approach is ok but there is just one problem. The data is temporarily stored in memory as data frame in odbcDS object. If we have huge table in hive, then we are done.
I would suggested to keep everything on disk by using external tables in hive and then using the backend data directly in revolution r.
Something in these lines:
Create external table from the existing hive tables in textfile(csv, tab etc) format.
CREATE EXTERNAL TABLE ext_table
LIKE your_original_table_name
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/your/hdfs/location';
Here we are creating external table which is stored as csv file in hdfs.
Next copy the original table to the external table using insert overwrite command.
insert overwrite table ext_table select * from your_original_table_name
If we want to check the backend data on hdfs type:
hadoop fs -ls /your/hdfs/location/
We can see the part files stored at the location. Go ahead and cat them to be double sure
Now we can use RxTextData function to read the data from above step as
hive_data <- RxTextData(file='/your/hdfs/location/', delimiter = ',')
Now you can create an xdf file using hive_data as inFile argument in RxXdfData to be more efficient for further processing but above all data has never touched memory.

save file in XYZ format as vector (GML or shp)

I am using QGIS software. I would like to show value of each raster cell as label.
My idea (I don't know any plugin or any functionality from QGIS which allow to it easier) is to export raster using gdal2xyz.py into coordinates-value format and then save it as vector (GML or shapefile). For this second task, I try to use
*gdal_polygonize.py:*
gdal_polygonize.py rainfXYZ.txt rainf.shp Creating output rainf.shp of
format GML.
0...10...20...30...40...50...60...70...80...90...100 - done.
unfortunately I am unable to load created file (even if I change the extension to .gml)
ogr2ogr tool don't even recognize this format.
yes - sorry I forgot to add such information.
In general after preparing CSV file (using gdal2xyz.py with -csv option),
I need to add one line at begining of it:
"Longitude,Latitude,Value" (without the quotes)
Then I need to create a VRT file which contain
*> <OGRVRTDataSource>
> <OGRVRTLayer name="Shapefile_name">
> <SrcDataSource>Shapefile_name.csv</SrcDataSource>
> <GeometryType>wkbPoint</GeometryType>
>
> <GeometryField encoding="PointFromColumns" x="Longitude"
> y="Latitude"/>
> </OGRVRTLayer> </OGRVRTDataSource>*
Run the command "ogr2ogr -select Value Shapefile_name.shp Shapefile_name.vrt". I got the file evap_OBC.shp and two other associated files.
For the sake of archive completeness, this question has also been asked on GDAL mailing list as thread save raster as point-vector file. It seems Chaitanya provided solution for it.

Resources