Little “Vs” of Big Data

Veracity & variability of location-based big data.

Thu Sep 21 09:31:00 EDT 2017

Big Data has been defined by the “Three Vs” indicating the three characteristics that have prompted the most challenging types of situations faced by IT managers and data scientists: Velocity (the speed at which data is accumulating), volume (the sheer amount of new data) and variety (the numerous data types including structured and unstructured like video). Then two more “Vs” were added making it “Five Vs” (veracity and variability); and later it became “Seven Vs” (visualization and value). My two favorite “Vs” are variability and veracity because both are significant to the analysis of location-based data.

Having worked in geological remote sensing for the first part of my career, we were constantly fighting Big Data, even though we didn’t have a name for the amount of data we were processing. Satellite imagery was inherently “big” compared to the storage capacity of the computers we used. A single Landsat-1, multi-spectral image data file covering approximately 100 square miles was about 50Mb uncompressed.  In 1980, that was huge, especially when the typical image-processing computer had only 64K of random access memory.

Today, there are more Earth observation satellites (EOS) and newer “smallsats”, the size of dorm room refrigerators, orbiting Earth. Add unmanned aerial vehicles (drones) flying at low altitudes and you quickly see the data deluge situation. For example, DigitalGlobe estimates that their EOS constellation orbits the Earth 16 times each day and capturing over 3-million square kilometer in the process. That equals over two petabytes yearly.

Why would you want that much data? It is all about variability. The Earth changes every moment of every day. Temperature, precipitation, man-made intrusions, or sunlight angle—all which could alter measurements captured from soil, vegetation or structures. Farmers adjust water and fertilizer; crop yields are impacted and commodity prices change. Mining operations adjust extraction procedures based on new exploration information. Urban “heat islands” affect energy capacity and utilization leading to utility companies making changes to distribution requirements. These changes are happening every second and technology is now able to keep pace with sensors, both EOS and ground-based. Sensor data is fed to geospatially-enabled big data frameworks that process information and trigger actions for agricultural, mining and other applications.

Then there is veracity of data and therefore, keeping the data accurate. The simplest example is contact information that enters a marketing automation system with false names and inaccurate data. How many times have you seen the names, "Mickey Mouse," "John Smith" or the address "101 Main Street" entered into a client database? Doubtful that “Mickey Mouse” is correct, but what about “John Smith?” It’s perfectly logical that indeed Mr. Smith is a customer … or not because it is such a common name. In addition, how many “101 Main Streets” are there in the world? Establishing truth will require ancillary data, such as additional transactions, location, or time to confirm the data. Misspellings, incorrect juxtaposition of words, errant text characters all contribute to data credibility. If your organization cannot trust your data, your analytical processes will fail.

The volume of data is just one problem. Veracity and variability will contribute to poor data hygiene and the need to employ data quality and location validation solutions. Pitney Bowes Spectrum for Big Data meets the demands of these other data quirks. Download the Location Intelligence for Big Data ebook to find out more.