Basil Veerman A Personal Page
About Archive Feed

FOSS4GNA Day 3

TileReduce: Distributed Spatial Data Processing in JavaScript

Speaker: Morgan Herlocker (Actually Tim Channell @tcql) Room: 301A 10:30 - 11:05 Abstract Slides

Details

  • written in nodejs - browser use case not yet shown, too much data
  • distributes analysis across cores: 1 worker per cpu cores
  • splits input into tiles, or sources from mbtiles
  • tiles processed through map scripts (map reduce-ish)
  • data sources: exclusively mapbox vector tiles
  • data made availabe as vector-tile-js or geojson in mapping function

Uses

  • changeset analysis on osm
    • map vandalism
  • automated osm analysis
    • new feature detection from alternate sources

weirdness

  • vector tiles made for rendering, not analysis.
    • must take that into account when generating
  • use low zoom levels: more tiles with less data slower than less tiles with more data
  • vector tiles buffered for rendering, but analysis must not use buffer
  • not actually mapreduce… but similar-ish

Questions:

  • Simplification issues? on by defaul creating mbtiles
    • Turn it off for analysis
  • Performance compared to spark
    • Haven’t looked at results
    • Happy with what they’ve got: 12-20 mins global job
  • Uneven data, why use rigid tile sizes
    • Not wasting enough time processing low data/empty tiles to make it worthwhile

TileReduce

GeoWave: How Space Filling Curves accelerate ingest and query of Geospatial data

Speaker: Eric Robertson - Booz Allen Hamilton Room: 301A 11:15 - 11:50 Abstract Slides

Architecture

  • Distributed sorted key value store
  • Data locality preservation through the stack

Problem

  • Index 2D in 1D index (Or more dimensions!)
    • space filling curve: traverses entire n-dimensional space in order
    • precision based on bits assigned to each dimension
    • looks like a quad-tree but with rigid/tiled bounds
    • bounding box snaps outward

Intervals (polys/multipolys)

  • tiered indexing
    • kinda like tile zoom scheme for different levels?
  • tiers create duplicates…
    • optimal duplicates at 2^d

Unbounded dimensions (time)

  • Binned to year, each year has a Hilbert curve
  • Bin ID as part of row ID in k:v store

Query processing

Features within distace of point

  1. look up row index of area intersecting the buffered point
  2. filter results by query params before returning to client

Intent

  • Reasonable beyond spatial
  • pluggable back end
  • self-describing

Ingest

  • Supports Karka, local files, distributed files
    • Vectors/grid, GPX, Mapnik, etc…
  • Can generate statistics when ingesting? some…

Questions

  • Indexing against long dimenions?
    • Time to calculate curve index, then lookup fast

3D Tiles: Beyond 2D Tiling

Speaker: Sean Lilley (Cesium), Patrick Cozzi (Cesium) Room: 302A 13:30 - 14:05 Abstract

Perhaps even bigger than Cesium itself?

  • Cesium focused on defense/aerospace at first

Problems

  • 2D: all level tiles same amount of data
  • 3D: non-uniform data distribution in spatial area

3D Tiles

  • open spec for streaming heterogeous datasets
  • spatial data structure defined in JSON
  • declarative styling
  • each tile maintains spatial coherence
    • indexed into tree (quadtree, grid, k-d tree)
    • non-uniform spatial dist of children

Status

  • Expectation this Fall
  • Multiple projects currently in use

Technical implmentation

  • tileset.json (bounds/hierarchy)
    • points to external tiles
    • can be nested to load smaller tileset.json of area
  • tile itself in binary format
  • multiple tile formats for different use cases (and composite tiles)

Lessons learned deploying a big-data geospatial system on AWS

Speaker: Eric Wenger - Applied Research Associates, Inc. Room: 301A 14:15 - 14:50 Abstract Slides

ARA is hiring

WALDO: semi automated ground level image geolocation

Challenges:

  • Search area up to 500,000km2
  • Fast (<= 2.5 hours)

Seems like instead of architecturing the application for the cloud, they simply ported the existing app to the cloud and encountered all the problems due to inherent differences in architecture.

Consuming NEXRAD (Doppler Radar) using Containers on Amazon Elastic Beanstalk

Speaker: Mark Korver - Amazon - mkorver@amazon.com Room: 301A 15:00 - 15:35 Abstract

  1. Beanstalk config.json, Docker conatiner, python Flask app.
  2. Subscribe queue (SQS) to SNS feed for new data
  3. Worker spins up instance with new data added, writes geotiff to another bucket.

Really just seems like queued data processing of public data…

Geo(Mesa/Wave/Trellis/Jinni): Processing Geospatial Data at Scale @locationtech

Speaker: Rob Emanuele [Azavea] Room: 301A 16:15 - 16:50 Abstract Slides

What kind of large data?

  • Landsat 8 is 355TB and counting on Amazon
  • OSM = 75GB compressed

Processing large data at scale

  • Hadoop
    • Large organizational support (Yahoo, Twitter, FB)
    • Top level Apache proj in 2008
  • Spark
    • Hadoop: running iterative algorithm over large dataset, io every time between node and master
      • Spark: keep results in memory… didn’t hear, just less data io with memory use
    • Distributed computation engine
  • Accumulo
    • Motivated from Google Bigtable

Geojinni (Spatial Hadoop)

  • Provides spatial language to Hadoop
    • Indexs
    • Operations

Geomesa

  • Basically: Geo + Accumulo access through GeoServer
  • Can also utilize streaming with Kafka/Storm
  • Integrated with SparkSQL

GeoTrellis

  • Scala library to geospatial data types
  • focus on raster
  • Storage HDFS, Accumulo, S3 (Cassandra in dev)
  • Temporally stacked tiles

Example

  • DEM coloramp + hillshade with NLCD geotiff data
    • Outputs png to xyz format

Geowave

  • Basically: Geo + Accumulo access through GeoServer

Differences: see slides

geotrellis/geodocker-cluster

Check it out for running these tools in Docker

GeoMesa, GeoBench, and SFCurve: Measuring and improving BigGeo performance

Speaker: Jim Hughes Room: 301A Abstract

  • Integrates MapReduce and Spark
  • Primarily vector data

SFCurve

  • row-major
    • really bad locality
  • z-order
    • interweave the bits
  • Hilbert

Query planning

  • Distributed processing on Tablet servers