FOSS4GNA Day 3
TileReduce: Distributed Spatial Data Processing in JavaScript
Speaker: Morgan Herlocker (Actually Tim Channell @tcql) Room: 301A 10:30 - 11:05 Abstract Slides
Details
- written in nodejs - browser use case not yet shown, too much data
- distributes analysis across cores: 1 worker per cpu cores
- splits input into tiles, or sources from mbtiles
- tiles processed through map scripts (map reduce-ish)
- data sources: exclusively mapbox vector tiles
- data made availabe as vector-tile-js or geojson in mapping function
Uses
- changeset analysis on osm
- map vandalism
- automated osm analysis
- new feature detection from alternate sources
weirdness
- vector tiles made for rendering, not analysis.
- must take that into account when generating
- use low zoom levels: more tiles with less data slower than less tiles with more data
- vector tiles buffered for rendering, but analysis must not use buffer
- not actually mapreduce… but similar-ish
Questions:
- Simplification issues? on by defaul creating mbtiles
- Turn it off for analysis
- Performance compared to spark
- Haven’t looked at results
- Happy with what they’ve got: 12-20 mins global job
- Uneven data, why use rigid tile sizes
- Not wasting enough time processing low data/empty tiles to make it worthwhile
GeoWave: How Space Filling Curves accelerate ingest and query of Geospatial data
Speaker: Eric Robertson - Booz Allen Hamilton Room: 301A 11:15 - 11:50 Abstract Slides
Architecture
- Distributed sorted key value store
- Data locality preservation through the stack
Problem
- Index 2D in 1D index (Or more dimensions!)
- space filling curve: traverses entire n-dimensional space in order
- precision based on bits assigned to each dimension
- looks like a quad-tree but with rigid/tiled bounds
- bounding box snaps outward
Intervals (polys/multipolys)
- tiered indexing
- kinda like tile zoom scheme for different levels?
- tiers create duplicates…
- optimal duplicates at 2^d
Unbounded dimensions (time)
- Binned to year, each year has a Hilbert curve
- Bin ID as part of row ID in k:v store
Query processing
Features within distace of point
- look up row index of area intersecting the buffered point
- filter results by query params before returning to client
Intent
- Reasonable beyond spatial
- pluggable back end
- self-describing
Ingest
- Supports Karka, local files, distributed files
- Vectors/grid, GPX, Mapnik, etc…
- Can generate statistics when ingesting? some…
Questions
- Indexing against long dimenions?
- Time to calculate curve index, then lookup fast
3D Tiles: Beyond 2D Tiling
Speaker: Sean Lilley (Cesium), Patrick Cozzi (Cesium) Room: 302A 13:30 - 14:05 Abstract
Perhaps even bigger than Cesium itself?
- Cesium focused on defense/aerospace at first
Problems
- 2D: all level tiles same amount of data
- 3D: non-uniform data distribution in spatial area
3D Tiles
- open spec for streaming heterogeous datasets
- spatial data structure defined in JSON
- declarative styling
- each tile maintains spatial coherence
- indexed into tree (quadtree, grid, k-d tree)
- non-uniform spatial dist of children
Status
- Expectation this Fall
- Multiple projects currently in use
Technical implmentation
- tileset.json (bounds/hierarchy)
- points to external tiles
- can be nested to load smaller tileset.json of area
- tile itself in binary format
- multiple tile formats for different use cases (and composite tiles)
Lessons learned deploying a big-data geospatial system on AWS
Speaker: Eric Wenger - Applied Research Associates, Inc. Room: 301A 14:15 - 14:50 Abstract Slides
ARA is hiring
WALDO: semi automated ground level image geolocation
Challenges:
- Search area up to 500,000km2
- Fast (<= 2.5 hours)
Seems like instead of architecturing the application for the cloud, they simply ported the existing app to the cloud and encountered all the problems due to inherent differences in architecture.
Consuming NEXRAD (Doppler Radar) using Containers on Amazon Elastic Beanstalk
Speaker: Mark Korver - Amazon - mkorver@amazon.com Room: 301A 15:00 - 15:35 Abstract
- Beanstalk config.json, Docker conatiner, python Flask app.
- Subscribe queue (SQS) to SNS feed for new data
- Worker spins up instance with new data added, writes geotiff to another bucket.
Really just seems like queued data processing of public data…
Geo(Mesa/Wave/Trellis/Jinni): Processing Geospatial Data at Scale @locationtech
Speaker: Rob Emanuele [Azavea] Room: 301A 16:15 - 16:50 Abstract Slides
What kind of large data?
- Landsat 8 is 355TB and counting on Amazon
- OSM = 75GB compressed
Processing large data at scale
- Hadoop
- Large organizational support (Yahoo, Twitter, FB)
- Top level Apache proj in 2008
- Spark
- Hadoop: running iterative algorithm over large dataset, io every time between node and master
- Spark: keep results in memory… didn’t hear, just less data io with memory use
- Distributed computation engine
- Hadoop: running iterative algorithm over large dataset, io every time between node and master
- Accumulo
- Motivated from Google Bigtable
Geojinni (Spatial Hadoop)
- Provides spatial language to Hadoop
- Indexs
- Operations
Geomesa
- Basically: Geo + Accumulo access through GeoServer
- Can also utilize streaming with Kafka/Storm
- Integrated with SparkSQL
GeoTrellis
- Scala library to geospatial data types
- focus on raster
- Storage HDFS, Accumulo, S3 (Cassandra in dev)
- Temporally stacked tiles
Example
- DEM coloramp + hillshade with NLCD geotiff data
- Outputs png to xyz format
Geowave
- Basically: Geo + Accumulo access through GeoServer
Differences: see slides
geotrellis/geodocker-cluster
Check it out for running these tools in Docker
GeoMesa, GeoBench, and SFCurve: Measuring and improving BigGeo performance
Speaker: Jim Hughes Room: 301A Abstract
- Integrates MapReduce and Spark
- Primarily vector data
SFCurve
- row-major
- really bad locality
- z-order
- interweave the bits
- Hilbert
Query planning
- Distributed processing on Tablet servers