Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RFC 86: Column-oriented read API for vector layers #5830

Merged
merged 11 commits into from Jun 14, 2022

Conversation

rouault
Copy link
Member

@rouault rouault commented Jun 1, 2022

This RFC describes the addition of new methods to the OGRLayer class to retrieve batches of features with a column-oriented memory layout, that suits formats that have that organization or downstream consumers that expect data to be presented
in such a way, in particular the Apache Arrow, Pandas / GeoPandas ecosystem, R spatial packages, and many modern (data analytics focused) databases / engines which are column oriented (eg Snowflake, Google BigQuery, ..)

(Semi-)rendered version of the RFC at: https://github.com/rouault/gdal/blob/rfc_86/doc/source/development/rfc/rfc86_column_oriented_api.rst

@rouault rouault marked this pull request as draft June 1, 2022 08:44
@rouault rouault added this to the 3.6.0 milestone Jun 1, 2022
@pitrou
Copy link

pitrou commented Jun 1, 2022

@jorisvandenbossche pinged me on this. I can just say that usage of the C stream interface looks basically sane (I'm not competent on geo topics otherwise :-)).

Co-authored-by: Alan D. Snow <alansnow21@gmail.com>
@pramsey
Copy link
Contributor

pramsey commented Jun 1, 2022

What are the other (not Parquet, not Arrow) drivers that might produce this? I guess the API seems so specific to Arrow, and if I want to consume just Arrow, why would I come to a format abstraction library?

@jorisvandenbossche
Copy link
Contributor

jorisvandenbossche commented Jun 2, 2022

AFAIU all drivers can produce this, they only might not have a specialized implementation (so falling back to the generic implementation using the existing GetNextFeature underneath).
In the current POC of Even, he also included a specialized implementation for FlatGeoBuf (besides Parquet/Arrow, see Impacted Drivers section in the RFC), which gives a 2-3x speedup compared to the generic implementation of this API for the included benchmark. And in this benchmark, even this generic implementation of the proposed API gives a speed-up compared to the current fiona/pyogrio bindings to get this data in Python memory.

You are correct that if you just want to consume an Arrow file, it might still be faster / easier to use an arrow library to read that file. But so personally for me (and for geopandas), this proposal is mostly exciting for all other formats (geopandas will probably keep a specialized read_parquet that does not go through GDAL but uses pyarrow directly):

  • If you need the data in a columnar layout (as we do for reading the data into a geopandas GeoDataFrame, but the same is true for R sf, columnar databases, ..), it is both more efficient and more convenient that the GetNextFeature iteration happens in GDAL than in the bindings, even if the file format you are reading is row-oriented.
  • For some file formats (even row-oriented ones, like FlatGeoBuf in the POC), GDAL can include a specialized driver-specific implementation of the proposed API, providing an additional speed-up.

@rouault
Copy link
Member Author

rouault commented Jun 2, 2022

Not much to add to what @jorisvandenbossche said : the main purpose is not that much to have something convenient for column oriented formats, but more to have some convenient for column oriented consumers. The default implementation of GetArrowStream() will work with any OGR driver, using their row-oriented GetNextFeature() implementation.

There are a couple other drivers that could potentially benefit from a column-oriented consumption API (but definitely not in the scope of implementation I foresee for the RFC)

  • the FITS driver for its vector/tabular side: https://gdal.org/drivers/raster/fits.html#binary-table-support . I remember that at the time I implemented the classic GetNextFeature() API performance suffered on large tables due to having to seek at many different locations in the file to retrieve the information of a row. CC @grizonnetm if he's still interested in that topic
  • the HANA driver maybe ? But I don't know if the API to consume the database are optimized for column oriented retrieval. CC @mrylov

@paleolimbot
Copy link

Another benefit that does not depend on the driver (or whether or not the underlying data start as columnar or rowwise) is that the struct ArrowArrayStream is understood by a growing number of libraries (Arrow R and Python bindings are an obvious example but DuckDB can also scan a struct ArrowArrayStream as a queryable table). For example, in R or Python one could skip implementing any OGR type -> R/Python type conversions and just use pyarrow or R's arrow bindings (I tried this in R and it worked like magic! https://github.com/paleolimbot/rfc86 ).

rouault added a commit to rouault/gdal that referenced this pull request Jun 4, 2022
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg
file mentioned in RFC 86 (OSGeo#5830) runs
from 15.2 s down to 8.1 s
rouault added a commit to rouault/gdal that referenced this pull request Jun 4, 2022
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg
file mentioned in RFC 86 (OSGeo#5830) runs
from 8.1 s (previous commit) to 7.5 s
a0x8o pushed a commit to a0x8o/gdal that referenced this pull request Jun 6, 2022
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg
file mentioned in RFC 86 (OSGeo#5830) runs
from 15.2 s down to 8.1 s
a0x8o pushed a commit to a0x8o/gdal that referenced this pull request Jun 6, 2022
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg
file mentioned in RFC 86 (OSGeo#5830) runs
from 8.1 s (previous commit) to 7.5 s
@mrylov
Copy link
Contributor

mrylov commented Jun 7, 2022

@rouault, unfortunately there is no optimized API in HANA for column oriented retrieval.

@rouault
Copy link
Member Author

rouault commented Jun 7, 2022

Updated to reflect the fact that the implementation has now a GeoPackage specialized implementation, which is extremely fast !

// Do something useful
array.release(&array);
}
stream.release(&stream);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this method require the explicit object pointer argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The release callbacks may need to use the private_data member of the ArrowArray and ArrowArrayStream structures (those are C structures, not C++ objects)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ok. Do you think it's worth having a gdal specific wrapper around those structures so that we can provide a nicer public api which hides this kind of detail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth having a gdal specific wrapper around those structures so that we can provide a nicer public api which hides this kind of detail?

The Arrow C stream/array interfaces defines a standard, and I'd want to avoid defining something GDAL specific on top of it. Anything that would make it more friendly to consume it would apply to any software that would produce Arrow C streams/arrays, so that doesn't really belong to GDAL specificially. @paleolimbot raised a discussion about that in https://lists.apache.org/thread/6vt7btpbrkm554otxqz7oxzdlpfwkhwc . He has prototyped
https://github.com/paleolimbot/gpkg/tree/master/src/arrow-hpp for the writing side. Not sure if there's an equivalent on the reading side (except using the arrow-cpp library itself to ingest Arrow C arrays and expose corresponding C++ objects, but that's an heavy dependency)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's general agreement from the Arrow end of things that there should be a library to make what you are doing here easier (both writing and reading). What you are doing here has been very helpful to push that forward in the discussions that we're having about this...I'm hoping to convert the mailing list discussion into a PR in the next few days although it will be some time before it's ready to use. It will basically be the arrow-hpp thing that Even linked to but using C idioms.

I separated out some C++ helpers here: https://github.com/paleolimbot/geoarrow-cpp , which could eventually help on the Geo-specific reading side (on pause at the moment while the lower-level Arrow library gets sorted).

All of these are some months away from being useful here, and reading Even's implementations has been incredibly helpful both in writing code and in demonstrating that people actually want to do this kind of thing.

by default other formats (particularly the Arrow driver that can return geometries
encoded according to the GeoArrow specification (using a list of coordinates).
The GEOMETRY_ENCODING=WKB option can be passed to force the use of WKB (through
the default implementation)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm not super-clear on here -- if the requested geometry format matches the underlying dataset's geometry format, does that mean there'll be absolutely no parsing/re-interpretation of the geometry on GDAL's side?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that mean there'll be absolutely no parsing/re-interpretation of the geometry on GDAL's side?

yes, this is typically the case for GeoParquet that uses WKB as the storage format. And also for GeoPackage where only the GeoPackage geometry header is stripped from the SQlite blob to get a WKB geometry. No OGRGeometry object is instanciated then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should precise that when the default implementation of OGRLayer is used, as GetNextFeature() is used underneath and returns a OGRGeometry, if the underlying format would happen to use WKB for storage, you'd have: WKB --> OGRGeometry (in GetNextFeature() -> WKB (in OGRLayer::GetNextArrowArray()). That would for example be the case for the PostGIS driver currently (with EWKB as the source encoding)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I realise this is off-topic, sorry!)

yes, this is typically the case for GeoParquet that uses WKB as the storage format. And also for GeoPackage where only the GeoPackage geometry header is stripped from the SQlite blob to get a WKB geometry.

What's your thoughts regarding having a similar short-cut approach for the non-column oriented apis? (and for the scenario you've described above) I.e. being able to read features from a gpkg where the geometry is left as just the wkb binary content, and no interpretation is made on gdal's side. In the gpkg case this could be a good optimisation opportunity for some clients, in that they could request the wkb content directly if they are going to be constructing their own geometry representation of the features (ie gpkg -> wkb -> client geometry classes ) and want to avoid the extra representation (gpkg -> wkb -> ogr -> client geometry )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, but I'm not sure how to implement this in a generic option. I guess this could be an open option implemented in the drivers where that would make sense that would turn the geometry column as a regular Binary field. I'm not sure how to inform client of which column(s) has the geometry. maybe some new field in the OGRGeomFieldDefn class.

@nyalldawson
Copy link
Collaborator

Another question -- it's not mentioned anywhere how this would interact with any filters applied to a layer, eg via OGR_L_SetAttributeFilter/OGR_L_SetSpatialFilter. I'm assuming that OGR level filters will be incompatible with the columnar api?

@rouault
Copy link
Member Author

rouault commented Jun 8, 2022

Another question -- it's not mentioned anywhere how this would interact with any filters applied to a layer, eg via OGR_L_SetAttributeFilter/OGR_L_SetSpatialFilter. I'm assuming that OGR level filters will be incompatible with the columnar api?

" The method may take into account ignored fields set with SetIgnoredFields() (the
default implementation does), and should take into account filters set with
SetSpatialFilter() and SetAttributeFilter(). Note however that specialized implementations
may fallback to the default (slower) implementation when filters are set."

``ARROW:extension:name`` set to ``WKB``. Specialized implementations may output
by default other formats (particularly the Arrow driver that can return geometries
encoded according to the GeoArrow specification (using a list of coordinates).
The GEOMETRY_ENCODING=WKB option can be passed to force the use of WKB (through
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it in theory (later) also be possible to enable that you can specify GEOMETRY_ENCODING=GEOARROW to force the use of this encoding, even if the file itself is not using it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes one could imagine such an extension

@nyalldawson
Copy link
Collaborator

@rouault

The method may take into account ignored fields set with SetIgnoredFields() (the
default implementation does), and should take into account filters set with
SetSpatialFilter() and SetAttributeFilter(). Note however that specialized implementations
may fallback to the default (slower) implementation when filters are set.

Gosh, how did I miss that?! 🤦

a0x8o pushed a commit to a0x8o/gdal that referenced this pull request Jun 10, 2022
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg
file mentioned in RFC 86 (OSGeo#5830) runs
from 15.2 s down to 8.1 s
a0x8o pushed a commit to a0x8o/gdal that referenced this pull request Jun 10, 2022
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg
file mentioned in RFC 86 (OSGeo#5830) runs
from 8.1 s (previous commit) to 7.5 s
@rouault rouault marked this pull request as ready for review June 14, 2022 13:17
@rouault rouault merged commit 3cd73b1 into OSGeo:master Jun 14, 2022
a0x8o pushed a commit to a0x8o/gdal that referenced this pull request Jun 15, 2022
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg
file mentioned in RFC 86 (OSGeo#5830) runs
from 15.2 s down to 8.1 s
a0x8o pushed a commit to a0x8o/gdal that referenced this pull request Jun 15, 2022
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg
file mentioned in RFC 86 (OSGeo#5830) runs
from 8.1 s (previous commit) to 7.5 s
a0x8o pushed a commit to a0x8o/gdal that referenced this pull request Jun 15, 2022
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg
file mentioned in RFC 86 (OSGeo#5830) runs
from 15.2 s down to 8.1 s
a0x8o pushed a commit to a0x8o/gdal that referenced this pull request Jun 15, 2022
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg
file mentioned in RFC 86 (OSGeo#5830) runs
from 8.1 s (previous commit) to 7.5 s
g8sqh pushed a commit to g8sqh/gdal that referenced this pull request Aug 20, 2022
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg
file mentioned in RFC 86 (OSGeo#5830) runs
from 15.2 s down to 8.1 s
g8sqh pushed a commit to g8sqh/gdal that referenced this pull request Aug 20, 2022
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg
file mentioned in RFC 86 (OSGeo#5830) runs
from 8.1 s (previous commit) to 7.5 s
g8sqh pushed a commit to g8sqh/gdal that referenced this pull request Aug 20, 2022
@rouault
Copy link
Member Author

rouault commented Oct 20, 2022

Note: the oddities with the bench_ogr_batch.cpp benchmarker were GeoPackage was faster than GeoParquet were just an implementation issue fixed per #6558 . The corrected situation is that GeoParquet has the lead and perf of GeoPackage and FlatGeobuf is pretty close, which is expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants