FOSS4GNA Day 3

04 May 2016

TileReduce: Distributed Spatial Data Processing in JavaScript

Speaker: Morgan Herlocker (Actually Tim Channell @tcql) Room: 301A 10:30 - 11:05 Abstract Slides

Details

written in nodejs - browser use case not yet shown, too much data
distributes analysis across cores: 1 worker per cpu cores
splits input into tiles, or sources from mbtiles
tiles processed through map scripts (map reduce-ish)
data sources: exclusively mapbox vector tiles
data made availabe as vector-tile-js or geojson in mapping function

Uses

changeset analysis on osm
- map vandalism
automated osm analysis
- new feature detection from alternate sources

weirdness

vector tiles made for rendering, not analysis.
- must take that into account when generating
use low zoom levels: more tiles with less data slower than less tiles with more data
vector tiles buffered for rendering, but analysis must not use buffer
not actually mapreduce… but similar-ish

Questions:

Simplification issues? on by defaul creating mbtiles
- Turn it off for analysis
Performance compared to spark
- Haven’t looked at results
- Happy with what they’ve got: 12-20 mins global job
Uneven data, why use rigid tile sizes
- Not wasting enough time processing low data/empty tiles to make it worthwhile

TileReduce

GeoWave: How Space Filling Curves accelerate ingest and query of Geospatial data

Speaker: Eric Robertson - Booz Allen Hamilton Room: 301A 11:15 - 11:50 Abstract Slides

Architecture

Distributed sorted key value store
Data locality preservation through the stack

Problem

Index 2D in 1D index (Or more dimensions!)
- space filling curve: traverses entire n-dimensional space in order
- precision based on bits assigned to each dimension
- looks like a quad-tree but with rigid/tiled bounds
- bounding box snaps outward

Intervals (polys/multipolys)

tiered indexing
- kinda like tile zoom scheme for different levels?
tiers create duplicates…
- optimal duplicates at 2^d

Unbounded dimensions (time)

Binned to year, each year has a Hilbert curve
Bin ID as part of row ID in k:v store

Query processing

Features within distace of point

look up row index of area intersecting the buffered point
filter results by query params before returning to client

Intent

Reasonable beyond spatial
pluggable back end
self-describing

Ingest

Supports Karka, local files, distributed files
- Vectors/grid, GPX, Mapnik, etc…
Can generate statistics when ingesting? some…

Questions

Indexing against long dimenions?
- Time to calculate curve index, then lookup fast

3D Tiles: Beyond 2D Tiling

Speaker: Sean Lilley (Cesium), Patrick Cozzi (Cesium) Room: 302A 13:30 - 14:05 Abstract

Perhaps even bigger than Cesium itself?

Cesium focused on defense/aerospace at first

Problems

2D: all level tiles same amount of data
3D: non-uniform data distribution in spatial area

3D Tiles

open spec for streaming heterogeous datasets
spatial data structure defined in JSON
declarative styling
each tile maintains spatial coherence
- indexed into tree (quadtree, grid, k-d tree)
- non-uniform spatial dist of children

Status

Expectation this Fall
Multiple projects currently in use

Technical implmentation

tileset.json (bounds/hierarchy)
- points to external tiles
- can be nested to load smaller tileset.json of area
tile itself in binary format
multiple tile formats for different use cases (and composite tiles)

Lessons learned deploying a big-data geospatial system on AWS

Speaker: Eric Wenger - Applied Research Associates, Inc. Room: 301A 14:15 - 14:50 Abstract Slides

ARA is hiring

WALDO: semi automated ground level image geolocation

Challenges:

Search area up to 500,000km2
Fast (<= 2.5 hours)

Seems like instead of architecturing the application for the cloud, they simply ported the existing app to the cloud and encountered all the problems due to inherent differences in architecture.

Consuming NEXRAD (Doppler Radar) using Containers on Amazon Elastic Beanstalk

Speaker: Mark Korver - Amazon - mkorver@amazon.com Room: 301A 15:00 - 15:35 Abstract

Beanstalk config.json, Docker conatiner, python Flask app.
Subscribe queue (SQS) to SNS feed for new data
Worker spins up instance with new data added, writes geotiff to another bucket.

Really just seems like queued data processing of public data…

Geo(Mesa/Wave/Trellis/Jinni): Processing Geospatial Data at Scale @locationtech

Speaker: Rob Emanuele [Azavea] Room: 301A 16:15 - 16:50 Abstract Slides

What kind of large data?

Landsat 8 is 355TB and counting on Amazon
OSM = 75GB compressed

Processing large data at scale

Hadoop
- Large organizational support (Yahoo, Twitter, FB)
- Top level Apache proj in 2008
Spark
- Hadoop: running iterative algorithm over large dataset, io every time between node and master
  - Spark: keep results in memory… didn’t hear, just less data io with memory use
- Distributed computation engine
Accumulo
- Motivated from Google Bigtable

Geojinni (Spatial Hadoop)

Provides spatial language to Hadoop
- Indexs
- Operations

Geomesa

Basically: Geo + Accumulo access through GeoServer
Can also utilize streaming with Kafka/Storm
Integrated with SparkSQL

GeoTrellis

Scala library to geospatial data types
focus on raster
Storage HDFS, Accumulo, S3 (Cassandra in dev)
Temporally stacked tiles

Example

DEM coloramp + hillshade with NLCD geotiff data
- Outputs png to xyz format

Geowave

Basically: Geo + Accumulo access through GeoServer

Differences: see slides

geotrellis/geodocker-cluster

Check it out for running these tools in Docker

GeoMesa, GeoBench, and SFCurve: Measuring and improving BigGeo performance

Speaker: Jim Hughes Room: 301A Abstract

Integrates MapReduce and Spark
Primarily vector data

SFCurve

row-major
- really bad locality
z-order
- interweave the bits
Hilbert

Query planning

Distributed processing on Tablet servers