Two intersecting trends are impacting the insurance industry today: the explosion of data available for analysis (in addition to a company’s internal data sources), and the accelerating adoption of data science to complement the traditional analytic disciplines of actuarial science and underwriting.
Data science uses a different set of tools, such as predictive analytics, machine learning algorithms and neural nets, to extract insight from data — and in most cases, the more data available, the better the results. Insurance companies are looking to data science to address a wide range of use cases, from detecting fraud and predicting claims to improving risk assessment and refining pricing.
However, the value that data science can bring to insurance companies and the speed at which they can make data-driven innovations depends in no small part on the quality of the external data.
The 80/20 rule in data science
There’s a reason that many data science platforms include built-in data preparation tools, and that standalone toolsets also abound. It’s the same reason conventional wisdom has it that data scientists spend 80 percent of their time on data preparation and only 20 percent on analysis.
The reason? All too often, the data is dirty. It’s incorrect and in the wrong format for analysis or modeling, so it needs cleansing, validation and reformatting before it can be used.
Consider, for example, an address — a key data point for P&C insurance companies in particular, who rely on location-based data. An address can have up to six fields that need to be parsed. While tools like Python can parse them, they won’t tell you if the contents are correct. In our experience, most raw data sets that include addresses — including internal data — will require cleansing prior to use.
And that’s just six fields out of hundreds or thousands that could go into a risk analysis or pricing model — all of which will need to be evaluated, cleansed, formatted, organized, annotated, stored and made available for use through the data curation process. It’s easy to see why this can consume up to 80 percent of a data scientist’s time.
Addressing the problem with already curated data and Location Master Data Management
Whether the third-party data you acquire is “free” — open source data from, for example, a government agency like a census bureau — or purchased from a commercial data provider, insurance companies are making major investments in data. That investment may be the purchase price of commercial data or the high internal costs of dealing with stale, inaccurate and nonstandard “free” data, or both.
Pitney Bowes helps companies optimize data investments with insurance industry-specific datasets that have already been curated by our data experts and organized according to the principles of Location Master Data Management. This curation is consistent across all the location-based datasets that an insurer would use to assess and predict risk, enhance pricing efficiency and understand the marketplace.
This ready-to-use data is provided in flat tables complete with an accurate address and hyper-precise geocoding. That means it can be readily analyzed using a wide range of tools, from spreadsheets to machine learning algorithms. With the geocoding element, the data can also be visualized as discrete data layers in a mapping application.
Location Master Data Management adds a new element: a unique, unchanging ID, the pbKey™, for each address in a dataset. This enables Pitney Bowes to match and link data to each persistent ID, without ambiguity, across multiple data sets. Altogether, there are more than 9,000 data attributes available.
Learn more about how curated data and Location Master Data Management are helping insurance company data scientists understand exposures, reduce risk and increase profitability. Read Mastering Location Data: Close, But Not Quite There from Harvard Business Review Analytic Services.