Extracting vector graphics (lines and points) with pdfclown - vector

I want to extract vector graphics (lines and points) out of a pdf with pdfclown. I have tried to wrap my head around the graphics sample but i cannot figure out how the object model works for this. Please can anyone explain the relationships?

You are right: till PDF Clown 0.1 series, high-level path modelling was not implemented (it would have been derived from ContentScanner.GraphicsWrapper).
Next release (0.2 series, due next month) will support the high-level representation of all the graphics contents, including path objects (PathElement), through the new ContentModeller. Here is an example:
import org.pdfclown.documents.contents.elements.ContentModeller;
import org.pdfclown.documents.contents.elements.GraphicsElement;
import org.pdfclown.documents.contents.elements.PathElement;
import org.pdfclown.documents.contents.objects.Path;
import java.awt.geom.GeneralPath;
for(GraphicsElement<?> element : ContentModeller.model(page, Path.class))
PathElement pathElement = (PathElement)element;
List<ContentMarker> markers = pathElement.getMarkers();
GeneralPath getPath = pathElement.getPath();
In the meantime, you can extract the low-level representation of the vector graphics iterating the content stream through ContentScanner as suggested in ContentScanningSample (available in the downloadable distribution), looking for path-related operations (BeginSubpath, DrawLine, DrawRectangle, DrawCurve, ...).


how to handle every (polygon) item in a shapefile as single geometry?

import of packages:
from rasterio.mask import mask
import geopandas as gpd
opened a shapefile:
gdf = gpd.read_file(shpfilepath+clipshape)
and opened a rasterfile:
img = rasterio.open(f'{rstfilepath}raw_immutable/SuperView/{SV_filename}{ext}')
then perform action:
for poly_gon in gdf.geometry:
out_image, out_transform = mask(img, poly_gon, crop=True)
but this failes:
TypeError: 'Polygon' object is not iterable
I cannot find how to handle every polygon in the shapefile (5 in my case) to be the polygon to clip the raster image.
How about going into nesting your results. First create an empty object like an empty dict then fill it like:
for i in range(len(gdf.geometry)):
empt_dict1[i] = dict()
empt_dict1[i][0], empt_dict1[i][1] = mask(img, gdf.geometry[i], crop=True)
Your expected clips are in each sub-object of the empt_dict list.
I don't have a working gdf right now so I'm not sur if you can index it that way or if you should use something like .loc.
Old answer
If I understand correctly you seek to use the whole area of all the polygons at the same time. How about merging them into a single one using a temporary layer, like below. PS: I tried to use your names given that you don't provide any data.
gdf["dummy"]=[0 for i in range(5)]
gdf_tempo = gdf.dissolve(by=dummy)
out_image, out_transform = mask(img, gdf_tempo , crop=True)

Exporting embeddings per epoch in Keras

I am trying to get access to the output of the embedding layer (the n-dimensional vectors) in Keras on a per epoch basis. There doesn't seem to be a specific callback for this. I 've tried the Tensorboard callbacks since it provides an option for logging the embeddings on each epoch but when I find the log files, I can't read them. They are probably files that can be accessed only by Tensorboard for visualization purposes. I need the embedding vectors to be saved in a format I can use later on outside keras, like a TSV file. Is there a way I could do this?
Thanks a lot!
OK, so I figured out how to do this, with much needed help from Nazmul Hasan on how to format the name to be updated with each epoch. Essentially I created a custom callback:
import io
encoder = info.features['text'].encoder
class CustomCallback(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
out_v = io.open('vecs_{}.tsv'.format(epoch), 'w', encoding='utf-8')
vec = model.layers[0].get_weights()[0] # skip 0, it's padding.
out_v.write('\t'.join([str(x) for x in vec]) + "\n")

Geopandas to_file gives blank prj file

I am trying to use GeoPandas for a (only slightly) more complex project, but at the moment I'm failing to write out a simple shapefile with a single point in it in a projected manner.
The following code results in a shapefile that looks generally good - but the .prj is empty:
import pandas as pd
from geopandas import GeoDataFrame
from shapely.geometry import Point
geometry = [Point(xy) for xy in zip(df.x, df.y)]
crs = {'init': 'epsg:4326'}
geo_df = GeoDataFrame(df, crs=crs, geometry=geometry)
The csv is 2 row and 2 columns (header row, then lon and lat in 2nd row):
Am I missing something obvious? I've hunted through stackoverflow, the geopandas docs, etc. All seem to imply to_file() should work just fine.
In the long run, the goal is to create a few functions for my students to use in a lab - one that draws a line along a lat or lon the width / height of the US, another that clips the line to polygons (the states), so that the students can figure out the widest spot in each state as a gentle introduction to working with spatial data. I'm trying to avoid arcpy as it's Python 2, and I thought (and think) I was doing the right thing by teaching them the ways of Python 3. I'd like them to be able to debug their methodologies by being able to open the line in Arc though, hence this test.
So, after playing with this, I've determined that under the current version of Anaconda the problem is with crs = {'init': 'epsg:4326'} on Windows machines. This works fine on Macs, but has not worked on any of my or my students' Windows systems. Changing this line to make use of the proj4 string crs = {'proj': 'latlong', 'ellps': 'WGS84', 'datum': 'WGS84', 'no_defs': True} instead works just fine. More of a workaround than an actual solution, but, it seems to consistently work.
I'm always using from_epsg function from fiona library.
>>> from fiona.crs import from_epsg
>>> from_epsg(4326)
{'init': 'epsg:4326', 'no_defs': True}
I've never had any problems using it. Keep it mind that some local projections are missing, but it shouldn't be a problem in your case.
Another user and I had a similar issue using fiona, and the issue for me was the GDAL_DATA environmental variable not being set correctly. To reiterate my answer there: For reference, I'm using Anaconda, the Spyder IDE, Fiona 1.8.4, and Python 3.6.8, and GDAL 2.3.3.
While Anaconda usually sets the GDAL_DATA variable upon entering the virtual environment, using another IDE like Spyder will not preserve it, and thus causes issues where fiona (and I assume Geopandas) can't export the CRS correctly.
You can test this fix by trying to printing out a EPSG to WKT transformation before & after setting the GDAL_DATA variable explictly.
Without setting GDAL_DATA:
import os
print('GDAL_DATA' in os.environ)
from osgeo import osr
srs = osr.SpatialReference() # Declare a new SpatialReference
srs.ImportFromEPSG(3413) # Import the EPSG code into the new object srs
print(srs.ExportToWkt()) # Print the result before transformation to ESRI WKT (prints nothing)
Results in:
With setting GDAL_DATA:
import os
os.environ['GDAL_DATA'] = 'D:\\ProgramData\\Anaconda3\\envs\\cfm\\Library\\share\\gdal'
print('GDAL_DATA' in os.environ)
from osgeo import , osr
srs = osr.SpatialReference() # Declare a new SpatialReference
srs.ImportFromEPSG(3413) # Import the EPSG code into the new object srs
print(srs.ExportToWkt()) # Print the result before transformation to ESRI WKT (prints nothing)
Results in:
PROJCS["WGS 84 / NSIDC Sea Ice Polar Stereographic North",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]],PROJECTION["Polar_Stereographic"],PARAMETER["latitude_of_origin",70],PARAMETER["central_meridian",-45],PARAMETER["scale_factor",1],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["X",EAST],AXIS["Y",NORTH],AUTHORITY["EPSG","3413"]]

checkpointing DataFrames in SparkR

I am looping over a number of csv data files using R/spark. About 1% of each file must be retained (filtered based on certain criteria) and merged with the next data file (I have used union/rbind). However, as the loop runs, the lineage of the data gets longer and longer as spark remembers all the previous datasets and filter()-s.
Is there a way to do checkpointing in spark R API? I have learned that spark 2.1 has checkpointing for DataFrames but this seems not to be made available from R.
We got the same issue with Scala/GraphX on a quite large graph (few billions of data) and the search for connected components .
I'm not sure what is available in R for your specific version, but a usual workaround is to break the lineage by "saving" the data then reloading it. In our case, we break the lineage every 15 iterations:
def refreshGraph[VD: ClassTag, ED: ClassTag](g: Graph[VD, ED], checkpointDir: String, iterationCount: Int, numPartitions: Int): Graph[VD, ED] = {
val path = checkpointDir + "/iter-" + iterationCount
saveGraph(g, path)
loadGraph(path, numPartitions)
An incomplete solution/workaround is to collect() your dataframe into an R object, and later re-parallelize by createDataFrame(). This works well for small data but for larger datasets it become too slow and complains about too large tasks.

How to Import SQLite data (gathered by an Android device) into either Octave or MatLab?

I have some data gathered by an Android phone and it is stored in SQLite format in an SQLite file. I would like to play around with this data (analysing it) using either MatLab or Octave. The SQLite data is stored as a file.
I was wondering what commands you would use to import this data into MatLab? To say, put it into a vector or matrix. Do I need any special toolboxes or packages like the Database Package to access the SQL format?
There is the mksqlite tool.
I've used it personally, had some issues of getting the correct version for my version of matlab. But after that, no problems. You can even query the database file directly to reduce the amount of data you import into matlab.
Although mksqlite looks nice it is not available for Octave, and may not be suitable as a long-term solution. Exporting the tables to CSV-files is an option, but the importing (into Octave) can be quite slow for larger data sets because of the string-parsing involved.
As an alternative, I ended up writing a small Python script to convert my SQLite table into a MAT file, which is fast to load into either Matlab or Octave. MAT files are platform-neutral binary files, and the method works both for columns with numbers and strings.
import sqlite3
import scipy.io
conn = sqlite3.connect('my_data.db')
csr = conn.cursor()
res = csr.execute('SELECT * FROM MY_TABLE')
db_parms = list(map(lambda x: x[0], res.description))
# Remove those variables in db_parms you do not want to export
X = {}
for prm in db_parms:
csr.execute('SELECT "%s" FROM MY_TABLE' % (prm))
v = csr.fetchall()
# v is now a list of 1-tuples
X[prm] = list(*zip(*v))
scipy.io.savemat('my_data.mat', X)
