IBM PAIRS Curated Big Data Service for ... - Semantic Scholar

41 downloads 176 Views 1MB Size Report
build with relative ease leveraging this technology. Keywords—big data analytics; GIS; Hadoop & HBase for geospati
2016 IEEE International Conference on Big Data (Big Data)

IBM PAIRS Curated Big Data Service for Accelerated Geospatial Data Analytics and Discovery Siyuan Lu*, Xiaoyan Shao, Marcus Freitag, Levente J. Klein, Jason Renwick, Fernando J. Marianno, Conrad Albrecht, Hendrik F. Hamann IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA *e-mail: [email protected] Abstract—IBM's Physical Analytics Integrated Data Repository and Services (PAIRS) is a geospatial Big Data service. PAIRS contains a massive amount of curated geospatial (or more precisely spatio-temporal) data from a large number of public and private data resources, and also supports user contributed data layers. PAIRS offers an easy-to-use platform for both rapid assembly and retrieval of geospatial datasets or performing complex analytics, lowering time-to-discovery significantly by reducing the data curation and management burden. In this paper, we review recent progress with PAIRS and showcase a few exemplary analytical applications which the authors are able to build with relative ease leveraging this technology. Keywords—big data analytics; GIS; Hadoop & HBase for geospatial data; data management systems; machine learning

I. INTRODUCTION Recent years have seen an exploding growth of the availability of geospatial data from both the public and private sectors including geoscientific model outputs [1-3], satellite imagery [4-8], aerial survey data [9, 10], measurement data collected through IoT (internet-of-things) [11, 12], etc. While such growth raises unprecedented opportunities for analytics which derives value via cross-pollinating datasets from multiple disciplines, it also raises enormous challenges for data management [13, 14]. Indeed, to-date, unlike many other areas of IT, binary files such as HDF, GRIB, or GeoTiff etc. remain the most popular way of storing geospatial data, such as satellite images and numerical weather prediction model outputs. While the file based storage approaches provide reasonable data storage efficiency and basic read and write capability, a heavy burden is put on the end-users to assemble data distributed in a large number of files of different formats and harmonize projections from different sources. Moreover, the data query and search capability at the binary file level is also not sufficient to support analytics needs. A Big Data infrastructure in which data is indexed beyond the file level is needed to provide more powerful query support and rapid discovery capabilities.

bottom to top of Fig. 1, PAIRS ingests raw data from a large number of governmental, public, and private sources as they become available. Multiple agents are constantly checking for data availability at numerous ftp sites and web pages. Once an update of dataset becomes available, the raw data are immediately downloaded, reprojected onto a set of nested global grids and stored on a Hadoop/Hbase system distributed across a core cluster of servers which enables cost effective hosting and management of peta-bytes of data. PAIRS leverages efficient data indexing methods which result in spatially and temporally linked data layers, both for data from 2D grids (e.g. satellite images, weather, soil, land use, etc.) and from point locations (e.g. social media data, measurements from distributed sensor networks etc.). As all data layers are aligned on nested global grids, any location on the globe can be queried in an intuitive and consistent manner. Query results in standard file formats (GeoTiff, CSV, JSON) are provided to end-users to whom the complexity of file formats, map projections of the raw data sources becomes transparent. Table 1 summarizes the key features of PAIRS in contrast to a conventional geographic information system (GIS).

To fulfill the aforementioned requirement of geospatial data, IBM's Physical Analytics Integrated Data Repository and Services (PAIRS https://pairs.res.ibm.com) is a big geospatial data and insights service developed to provide an easy-to-use one-stop-shop for rapid assembly, retrieval, and analytics of datasets for multi-disciplinary use-cases. The technical implementation of PAIRS was introduced previously [15]. Fig. 1 shows a brief overview of PAIRS. From

978-1-4673-9005-7/16/$31.00 ©2016 IEEE

2672

Figure 1. Schematic of high-level architecture of Physical Analytics Integrated Data Repository and Services (PAIRS).

Table 1. Salient features of PAIRS in contrast with conventional GIS.

Table 2. Datasets currently available through PAIRS.

Table 2 lists the production and beta version datasets currently available on PAIRS. Users may also upload custom and proprietary data layers which in turn can be used along with the existing data layers to support additional analytics. For many use-cases, by relieving the end-users of the data management burden, PAIRS may significantly accelerate the development of analytics applications. For instance, the value of IoT (sensor) data can often be amplified by the spatio-temporal contextual data from PAIRS. Indeed, albeit the sensors gather the most critical data required, they rarely provide all the inputs needed by a predictive model. For example, when a soil moisture sensor network is deployed for precision irrigation, the measurements may be nicely complemented by soil property, terrain, and forecasted weather (irradiance, humidity, wind, precipitate etc.) retrieved from PAIRS to model the optimal irrigation prescription. Other unique features of PAIRS are its scalability and capability for fast cross-layer data discovery[15]. For example, a query such as: “Show me all urban areas where it will be sunny for the next 10 days and where the population density is larger than 500 people per square mile and where are at least two coffee shops per one square mile area” requires filtering and querying multiple layers across different spatial and temporal scales. A user may perform such multi-layer queries on the PAIRS platform without the need to retrieve individual data layers. In this paper, we report a few examples of remote sensing analytics, which were developed in short periods of time with relative ease by leveraging the PAIRS data capability.

*SMT = self-learning weather modeling technology

MODIS images of the same area using POCS (projections onto convex sets) method [18, 19], respectively. The resolution enhancement, nominally 3x, is obvious by comparing the two images. For example, the boundary of the Ohio river becomes better defined in the super-resolution image and the small island in the Ohio river becomes distinguishable. B.

Satellite Remote Sensing for Methane Detection.

Satellite images are often acquired in spectral bands which carry signatures of chemical or biological phenomena pertaining to various economic activities. The most well studied example is Normalized Difference Vegetation Index (NDVI) that can track crop development on farms [20] and/or health status of forests [21]. Combination of other spectral bands can, for example, track water quality [22], drought [23], and change in urban settlement due to humanitarian issues [24].

II. ANALYTICS AND DATA DISCOVERY USING PAIRS A. Satellite Image Resolution Enhancement The spatial resolution and the revisiting frequency of satellites are usually of competing nature. The usefulness of high resolution satellite images is often impeded by infrequent revisits which are often further complicated by cloud cover. PAIRS provides rapid access to time series of stacks of satellite images which have been re-projected (bilinear or other interpolations) onto a common grid system. This provides an opportunity of resolution enhancement by applying multiframe super-resolution techniques [16, 17]. An example of super-resolution reconstruction of MODIS satellite images is shown in Fig. 2. Panels (A) and (B) show an original 250 m resolution image (band 2, near infrared) near Newburgh, IN and a super-resolution image constructed from nine consecutive

2673

Figure 2. MODIS satellite images of band 2 (NIR) near Newburgh, Indiana: (A) an original 250m resolution image, and (B) a super-resolution image constructed from 9 consecutive MODIS images using POCS method.

Landsat 8 satellite Band 6 acquires images in the 1.57 1.65 μm wavelength window that are sensitive to absorption by certain gases. Specifically methane has a weak absorption close to 1.65 μm [25]. Moreover, since the wavelength window is relatively large, multiple other gases including CO2 and NO2

Figure 4. Corn area prediction for Madison County, NE. Black dots show acreage from after-the-fact USDA report. Red dots show prediction at the end of September. The mean absolute error for 2009 to 2015 is 1.5% normalized by total acreage.

Figure 3. Landsat imagery of Kansas with a combination of livestock farms and the corresponding Landsat 8 Band 6 signitures, indicating high signal above the livestock farms and gas well pads.

also have absorption. In Fig. 3 a side by side comparison of a high resolution map and a Band 6 image of an area in Kansas retrieved from PAIRS is shown. The area contains multiple livestock farms and natural gas well pads. The comparison indicates that both livestock and well pads have distinguishable Band 6 signals with respect to the background. Such sites thus can be easily identified by querying PAIRS, applying a filter of LandSat (Band 6) pixel value above a certain threshold. For enforcing compliance of greenhouse gas emission policies, this technique can be refined and applied to track changes on the surface, identify newly developed sites, or validate if sites are indeed shut down as required. Moreover, one may further consider ingesting sensor network measured methane concentration into PAIRS. This, when correlate with satellite (Landsat and others) and wind data, may enable building a remote-sensing based model for assessing methane emission in areas where ground sensors are lacking. C. Crop Acreage Estimate Spot and futures pricing of commodity crops such as corn, soy, wheat, or cotton are heavily driven by changes in supply forecasts, since these can throw off any existing supply-demand balance. Extreme examples are the effort of droughts or largescale flooding, but even more moderate adjustments of supply estimates can lead to strong price movements. The US Department of Agriculture (USDA) periodically updates estimates for planted crop acreage and crop yield [26]. These numbers are acted upon by all players in the crop futures markets, such as farmers, wholesalers, distributers, and financial institutions. More accurate or timely estimates of actual crop production thus can assist these players in more profitable short-term trading or long-term investment.

2674

The PAIRS platform contains several datasets (Table 2) that make predictive modeling of crop acreage and yield efficient, including historical crop planting maps (Cropscape) [27], historical multispectral satellite data (e.g. Landsat, MODIS) [28], weather and seasonal climate forecasts. For instance, for crop acreage estimates, at pre-season, we build a baseline-model that includes long-term trends in planted crop acreage as well as historical crop-rotation patterns using historical Cropscape. During the season, this model is updated by utilizing time-series of the well-established Normalized Difference Vegetation Index (NDVI) as observed from satellites[29]. For example, corn and soy are differentiated by corn’s higher NDVI shortly after planting, and soy’s higher NDVI late in the growing season. Preliminary study of a few typical counties in the US Corn Belt indicates that prediction accuracy for corn and soy is ~1.5% (Fig. 4) normalized by total acreage. Studies of yield forecasting utilizing predicted acreage and weather and climate forecasting are underway. The ease of accessing complementary contextual data sources in one platform (PAIRS) made such otherwise tedious analytics much more convenient. D. Drone Imagery Beyond satellite imagery, in cases where a higher spatial resolution is required, drones offer a viable alternative with the advantage of 3-dimensional observation of objects and imaging from different observation angles. Using the DJI Phantom 3 Standard, images with a spatial resolution less than 2 cm were acquired. Though the acquisition and processing of high resolution drone imagery is analogous to satellite in many ways, the massive data size acquired by drones and processing time required pose a significant challenge. Indeed, PAIRS provides a data platform for hosting and parallelized image processing of massive amount of drone data. For processing, open source tools such as OpenCV [30] and GDAL [31] are used to stitch and geo-reference the drone images uploaded to PAIRS. The process initially involves applying image correction algorithms to remove angular or perspective distortions, extracting key and unique image features then matching the features between adjacent images.

[9]

[10]

[11] [12] [13] Figure 5. Stitched orthomosaic of IBM T. J. Watson Research Center site in Yorktown Heights, NY from more than 100 individual drone snapshot images.

[14]

Once adequate matches are identified, the images are combined to produce a stitched mosaic. The stitched mosaic is then georeferenced by matching the image to known ground control points stored on PAIRS. Such drone images hosted on PAIRS (Fig. 5) can subsequently be overlaid with other data layers (elevation, soil property, satellite, weather, etc.) for various applications of analytics including change detection, construction monitoring, pest management for crops and irrigation scheduling.

IBM's Physical Analytics Integrated Data Repository and Services (PAIRS) is a geospatial Big Data service. PAIRS curates and pre-processes a massive amount of geospatial data from a large number of public and private data sources and also supports integration of user contributed data layers. PAIRS offers an easy-to-use platform for both rapid assembling and retrieval of geospatial datasets as well as performing complex analytics, lowering time-to-discovery significantly by reducing the data management burden. Introduction to PAIRS and a tutorial demonstration of PAIRS can be found on YouTube [32, 33]. REFERENCES

[2] [3] [4] [5] [6] [7] [8]

[16] [17] [18]

III. CONCLUSION

[1]

[15]

[19] [20]

[21] [22] [23] [24]

K.J.W. Mccaffrey, R.R. Jones, and R.E. Holdsworth, "Unlocking the spatial dimension: digital technologies and the future of geoscience fieldwork," J. Geol. Soc. London, vol. 162, pp. 927-938 NOV 2005. T. Toutin, "ASTER DEMs for geomatic and geoscientific applications: a review," Int. J. Remote Sens., vol. 29, pp. 1855-1875 2008. O. Conrad, B. Bechtel, and M. Bock, "System for Automated Geoscientific Analyses (SAGA) v. 2.1.4," Geoscientific Model Development, vol. 8 pp. 1991-2007 2015. F. D. van der Meer, H. M. A. van der Werff, and F. J. A. van Ruitenbeek, "Multi- and hyperspectral geologic remote sensing: A review," Int. J. Appl. Earth Obs., vol. 14 pp. 112-128 FEB 2012. P. Coppin, I. Jonckheere, and K. Nackaerts, "Digital change detection methods in ecosystem monitoring: a review," Int. J. Remote Sen., vol. 25 pp. 1565-1596 MAY 2004. M. S. Moran, Y. Inoue, and E. M. Barnes, "Opportunities and limitations for image-based remote sensing in precision crop management," Remote Sens. Environ., vol. 61 pp. 319-346 SEP 1997. H. Nagendra, "Using remote sensing to assess biodiversity," Int. J. Remote Sen., vol. 22 pp. 2377-2400 AUG 2001. M. Hussain, D. Chen, and A. Cheng, "Change detection from remotely sensed images: From pixel-based to object-based approaches," ISPRS J. Photogramm., vol. 80 pp. 91-106 JUN 2013.

2675

[25] [26] [27] [28] [29] [30] [31] [32] [33]

A. C. Watts, V. G. Ambrosia, and E. A. Hinkley, "Unmanned Aircraft Systems in Remote Sensing and Scientific Research: Classification and Considerations of Use," Remote Sens., vol. 4 pp. 1671-1692 JUN 2012. S. Harwin and A. Lucieer, "Assessing the Accuracy of Georeferenced Point Clouds Produced via Multi-View Stereopsis from Unmanned Aerial Vehicle (UAV) Imagery," Remote Sens., vol. 4 pp. 1573-1599 JUN 2012. S. K. Datta, C. Bonnet, and N. Nikaein, "An IoT Gateway Centric Architecture to Provide Novel M2M Services," 2014 IEEE World Forum on Internet of Things (WF-IOT) pp. 514-519 2014. W.-T. Sung and M.-H. Tsai, "Data fusion of multi-sensor for IOT precise measurement based on improved PSO algorithms," Comput. Math. Appl., vol. 64 pp. 1450-1461 SEP 2012. S. Li, S. Dragicevic, F. A. Castro, M. Sester, S. Winter, A. Coltekin, et al., "Geospatial big data handling theory and methods: A review and research challenges," ISPRS J. Photogramm., vol. 115 pp. 119-133 MAY 2016. C. Shahabi, F. Banaei-Kashani, A. Khoshgozaran, L. Nocera, and S. Xing, "GeoDec: A Framework to Visualize and Query Geospatial Data for Decision-Making," IEEE Multimedia, vol. 17 pp. 14-22 JUL-SEP 2010. L.J. Klein, F.J. Marianno, C.M. Albrecht, M. Freitag, S. Lu, N. Hinds, et al., "PAIRS: A scalable geo-spatial data analytics platform," 2015 IEEE Conference on Big Data, pp. 1290-1298, 2015. S. Farsiu, D. Robinson, M. Elad, and P. Milanfar, "Advances and challenges in super‐resolution," Int. J. Imag. Syst. Tech., vol. 14, pp. 4757, 2004. S. C. Park, M. K. Park, and M. G. Kang, "Super-resolution image reconstruction: a technical overview," IEEE Signal Proc. Mag., vol. 20, pp. 21-36, 2003. A. E. Galbraith, J. Theiler, K. J. Thome, and R. W. Ziolkowski, "Resolution enhancement of multilook imagery for the multispectral thermal imager," IEEE T. Geosci. Remote, vol. 43, pp. 1964-1977, 2005. H. Stark and P. Oskoui, "High-resolution image recovery from imageplane arrays, using convex projections," J. Opt. Soc. Am. A, vol. 6, pp. 1715-1726, 1989. Brian D. Wardlow and S. L. Egbert, "Large-area crop mapping using time-series MODIS 250 m NDVI data: An assessment for the US Central Great Plains," Remote Sens. Environ., vol. 112, pp. 1096-1116, 2008. M. Hansen, R. Dubayah, and R. DeFries, "Classification trees: an alternative to traditional land cover classifiers," Int. J. Remote Sens., vol. 17, pp. 1075-1081, 1996. R. G. Lathrop, "Landsat Thematic Mapper monitoring of turbid inland water quality," Photogramm. Eng. Rem. S., p. 58, 1992. A. Martha and W. Kustas, "hermal remote sensing of drought and evapotranspiration," EOS T. Am. Geophys. Un., vol. 89, pp. 233-234, 2008. E. Prins, "Use of low cost Landsat ETM+ to spot burnt villages in Darfur, Sudan," Int. J. Remote Sen., vol. 29, pp. 1207-1214, 2008. R.C. Nelson, E. K. Plyler, and W. S. Benedict, "Absorption spectra of methane in the near infrared," J. Res. Nat. Bur. Stand., vol. 41, p. 615, 1948. "United States Department of Agriculture (USDA) Farm Service Agency." "USDA National Agricultural Statistics Service Cropland Data Layer.." "Data available from the U.S. Geological Survey. NASA Land Processes Distributed Active Archive Center (LP DAAC) Products.." F. Kriegler, W. Malila, R. Nalepka, and W. Richardson, "Preprocessing transformations and their effects on multispectral recognition," in Remote Sensing of Environment, VI, 1969, p. 97. (2016). Welcome to openCV documentation. Available: http://docs.opencv.org/2.4/index.html (2016). GDAL - Geospatial Data Abstraction Library (Version 1.11.2 ed.). Available: http://gdal.osgeo.org PAIRS demos. Available: https://www.youtube.com/watch?v=MlPhTKE189s PAIRS Introduction. Available: https://www.youtube.com/watch?v=Nxwi6x0ObT0