Mastering Data Variety
This post was written in conjunction with Nik Bates-Haus; Nabil Hachem, Ph.D.; Matt Holzapfel; and Mark Marinelli of Tamr, Inc.
Data variety — the middle child of the three Vs of Big Data — is in big trouble.
It’s in the critical path of enterprise data becoming an asset. And it’s been slow to benefit from the kind of technology advancements experienced by its “easier” siblings, data volume and data velocity.
Meanwhile, most enterprises have unconsciously built up extreme data variety over the last 50+ years. So extreme that people at large enterprises may not even realize how much variety they actually have until they are elbows-deep in a data-integration project or high-profile strategic-analytics initiative.
A big part of the problem is structured data — essentially tables of data.
“Structured data” may sound like it’s organized. However, from the broadest perspective and how people in enterprises want to use it, it isn’t. Successful enterprises have mother lodes of diverse, high-value structured data capable of providing operational savings and business-transforming analytics. But the data is stored in multiple, heterogeneous relational databases created and evolved by people over time to serve specific business needs, processes and applications. This extreme variety guarantees radical heterogeneity, data duplication and errors, all of which have to be continuously resolved and curated before enterprises can turn all their data into a true asset. (Note: unstructured data, such as text, has a set of related and complementary problems — but that’s a post for another day.)
My Tamr co-founder Mike Stonebraker calls data variety “the 800-pound-gorilla in the corner”. This is apt: because of the complexity and unpredictability of variety in tabular data, enterprises often may have been reluctant to poke it too hard. Many just continue to treat the problem with traditional data-integration methods like MDM/data warehouses and ETL — doing the same things but somehow expecting a different outcome. Without getting into a detailed discussion about the limitations of either method — my trusted colleague Matt Holzapfel does a nice job of that here — the fundamental challenge is that traditional approaches do not leverage the power of the machine (models and rules) efficiently with input and effort from humans.
It *is* possible to solve the data variety problem. But it requires fully understanding the problem of data variety first, then embracing a new approach second. This approach must artfully combine models, rules and human expertise to curate data broadly across the enterprise while addressing the problem of tabular data variety.
Understanding Data Variety
Data variety is a complicated problem because it’s fundamentally driven by the nature of humans, how they’re organized in large enterprises, and (frankly) the organizational baggage that often comes with organization and people changes over time.
People create structured databases to solve specific business problems and/or automate business processes in their business units. They’re not thinking about the broader picture across the entire organization or company — in fact, it’s contrary to their immediate goals. Moreover, once the database is “done,” there’s usually no provision or budget for cleanup, changes or updates. The data structures in these legacy enterprise systems set like concrete and are difficult or impossible to change. The differences in these static, solidified operational databases become obvious when you try to bring data together: differences in coding of business units across a large corporation, differences in how customers are classified and organized across business units, and so on.
The result? Tens and thousands of idiosyncratic enterprise data sources that are difficult to change, each reflecting a person’s (or group’s) goals, skills, knowledge, unique lens on their problem, proprietary ownership of data systems, habits, and so on.
To truly understand the magnitude of this problem, let’s walk through the core, atomic components of a structured, relational database. Each of these components can be varied in ways that make the resolution, consistency and integrity of data across the enterprise difficult or impossible. They are:
- Data Types
Tables: Structured data is organized into tables. Each table has columns and rows of data associated with it, as in a spreadsheet. All that tabular data is arranged to support an application and make it perform well. People creating and using that application now have “views” of the entities and related attributes (fields) that matter to them, like customers or transactions. All or some of this data could have business value across the enterprise, but the idiosyncrasy of semantics in each database makes the data difficult/impossible to consume by people outside of its original application or context. Now, multiply this scenario by 10 or 100 or, in some large enterprises,1000 application-specific databases.
Columns/Attributes: In a typical customer database, for example, columns might include name, company, address, phone number and so on. A retail bank might have five different customer databases, each with a column for “customer name.” But one system may use “full name”, another may have “first name” and “last name”, and yet another might just have “FN”. This variety makes it difficult to answer the question “What data do we have about our customers?” a vital first step in being able to answer analytical questions such as “How many customers do we have?”
Columns package data characteristics (attributes) in a way that’s highly idiosyncratic to that application. In creating their applications, people may code column/attribute values to do data validation or comply with some mandated external standard. Database professionals often manipulate the way that columns of data are organized in tables to optimize performance for their applications. This can make the columns look less intuitive than our simple customer data examples above. This manipulation also significantly increases the variety of the columns in a database table(s). All of this creates ever-increasing variety in the columns of data across many 10s, 100s and 1000s of tables, consequently creating a lack of transparency of data across the enterprise that impedes the ability of people (or even machines) to understand the data.
Rows/Records: Rows/records provide a multi-attribute view — essentially the identity — of an entity such as a customer, part or transaction, again highly idiosyncratic to that application. Everything about a customer is represented in a “row”/”record” in a database. However, across the enterprise there’s no standard definition of what constitutes the definition of “customer”.
Therefore, in our bank example, the same customer might have multiple customer IDs across various data sources, and with each, a slightly unique representation of their name and address. Which one is correct/accurate? Furthermore, the meaning of the underlying data changes. For example, if one of your customers gets bought by another of your customers, how do you “merge” those two customers into one? Which ID is appropriate to use? Now multiply that question across the 10s,100s and 1000s of tables of customer data and you begin to see how this problem explodes exponentially. Resolving these types of problems often leads you to a fundamental challenge in computer science referred to as the “N Squared problem”.
Data Types: Each of the columns in a table contains a single type of data. The attributes across a row contain many different types of data such as dates, variable-length character strings, and numbers. Beyond the rampant customization mentioned above under Columns/Attributes, people sometimes misuse these data types, creating even more variety. For example: When a manufacturing company acquires another manufacturer, the two companies may have different data types associated with their part numbers (alphanumeric, e.g., QRAX-7321A, versus numeric, e.g., 19820). If the company shifts to all alphanumeric part numbers, with the format “XXX-1234X”, they may soon find random dummy values being introduced to old part numbers to adhere to the new data type (e.g., “19820” becomes “YXY-1982X”), confusing naming conventions and hurting searchability.
Relationships: There are physical and logical relationships between tables. The physical relationship usually manifests itself by a primary key and foreign keys in a second table. There are also logical relationships: a column of customer names in one table can refer to the same physical customers that have a name column in a different table. Logical entities have a point of reference in the real world: people, locations, customers, products.
It’s complex to optimize physical and logical relationships across many different databases to support the required features/functions of the applications AND the needs of users AND database performance. The result? Too often a tangled web of relationships that are difficult or impossible for normal people — even normal engineers — to untangle. The extreme variety of the relationships and conflicting interests of those relationships is often what makes large-scale data a challenging software engineering and computer-science problem.
For example: a large bank may have 5 to 10 different customer IDs tied to tables. One division may want to use its customer ID as the standard, another wants to use its own customer ID and so on. This shouldn’t matter but it does, technically. Trying to do analytics across sources requires resolving not only the technical issues associated with these competing IDs, but also the human and political issues associated with various groups giving up control of their IDs to others.
Resolving relationships — both physical and logical across large numbers of tables/databases — is one of the most interesting and difficult problems in database systems. But it’s also one of the most valuable for large enterprises when solved.
Languages: Large enterprises are fundamentally global. Their data sources are global and contain data in many languages. To manage a global business, it’s essential to manage the variety of language in data. Regardless of the mix of languages in sources or the mix of preferred languages of data consumers, enterprises must ensure all data consumers are able to (1) trust the integrity of the data irrespective of language and (2) easily adjust language in their data to meet their business needs.
For example: At a high level, the variety of languages used to describe parts data in different countries could prevent people from easily understanding if a part record stored in one table means the same part as a record stored in another table from a different system in a different country. This can be powerful economically if you are trying to optimize the purchase or inventory of these parts on a global basis at large scale.
Another example of language variety: Films and television shows often have different titles in different countries — a language-plus-semantic difference that usually requires translation. Think back to the language-translation problems of the web as recently as 10 years ago; today we just use Google Translate. (We at Tamr are working to build the equivalent of Google Translate for structured data.)
Mistakes/Nulls: Both are inevitable (remember: all roads lead back to people). And it goes beyond mistakes and nulls. For example: An application may require a salesperson to enter a phone number before closing a sale. He doesn’t know, so enters 123–456–7890. With a few keystrokes, he’s made the data worse (and possibly in perpetuity). A good default rule: all data is dirty and must be systematically verified before consumption. It’s time to accept this reality and put in place proactive mechanisms to monitor, validate and fix data. Just because data was entered into a table (by a machine or a human) does not mean it has veracity.
To summarize: there are 10s, 100s,1000s or sometimes even 10s of 1000s of these structured-data “silos” in large enterprises. Each silo has its own unique, idiosyncratic collection of tables, columns, rows, relationships, and so on. There’s a lot of dirty and duplicate information in various languages, in such critical areas as customer records or parts catalogs (surprise). Silos continue to both evolve organically and proliferate with business events (M&As, leadership changes) and technology changes (new database technologies, a deepening legacy IT burden) — essentially creating what Mike Stonebraker calls “Database Decay” or “Database Entropy.”
I often refer to the state of data broadly in large enterprises as “random data salad.” This is in stark contrast to how managers in large, hierarchical big companies think about their data (i.e., “It’s all organized and good because it’s in SAP.” )
Random Data Salad
Got the picture? To put it in a real-life context, imagine a global industrial company trying to reduce or optimize purchasing across its supply chain. It wants to make sure it’s getting the lowest possible price for items purchased across its multi-billion-dollar annual spend. It needs a unified, constantly updated view of what it’s spending across its supply chain, which spans multiple business units and 75+ transactional systems in hundreds of countries with dozens of languages. Traditional data integration approaches simply cannot efficiently (1) identify the similarities (item description) and differences (item price) across all the data, (2) resolve them and (3) produce the higher-level business entity view (item or part) necessary to identify savings opportunities and negotiate deals with lowest-price supplier(s).
We see scenarios like this in many businesses spanning many industries.
Solving the Problem: What Internet Companies Have Done
Internet companies like Google, Facebook, Instagram and LinkedIn have intense data variety problems.
For them, it’s about getting structured data out of many diverse web sites and data sources to deliver answers to our web searches or respond to our requests (clicks) at lightning speed.
Obviously, Google can’t tell the data source owners/webmasters how to format their data to play nice with Google. Therefore, the internet companies have invested heavily in very large data curation tools, organizations and processes. They’ve built machines (models) and active-learning systems, artfully integrated human curators into thoughtful curation processes, and established bi-directional feedback from information consumers. I think the best example of this is Google Knowledge Graph, truly one of the most exceptional information resources on the planet today.
Google Knowledge Graph works, obviously. (Just go to Google and type in JJ Abrams and see how amazing the results are in the right-hand panel.)
Google Knowledge Graph is the type of infrastructure that large enterprises are going to have to implement to curate structured tabular data at scale within their enterprises. The organizational structures, roles, technologies, techniques and principles are powerful but unfamiliar to most large enterprises.
Now, what if this established design pattern could be applied to the problem of managing data variety in enterprises, where structured data essentially runs the business?
Solving the Problem: What Some Enterprises Are Doing
The Holy Grail for enterprises, of course, is single, continually updated views of the business entities (“masters”) that are critical to the enterprise: customers, products, parts, suppliers and so on. Tamr makes this possible, by using machine learning and statistics to dramatically lower the cost and reduce the complexity of the mastering process.
While it’s still early in the evolution of this new approach, Tamr customers are already seeing remarkable results.
- For GE, a Tamr-“mastered” unified view of its suppliers enables procurement officers to get GE’s best terms with any given supplier, realizing $80M in hard cost savings in the first 12 months of using Tamr. When it combined its unified supplier view with a mastered view of its parts, GE was able to identify $300M in annual direct spend reduction opportunities by shifting purchasing to their most cost-effective suppliers.
- Toyota Motor Europe “mastered” its data variety, integrating 25+ million customer records from data spanning 30+ countries to achieve an entity-level view. This approach helps them deliver superior, unified customer experiences no matter where their customers go.
- Pharmaceutical manufacturer GSK created a unified data lake from fragmented research domain silos. This made it easier to access and use data for exploratory analysis and decision-making about new medicines.
When I look at results like these (and these are just a few examples), my mind spins at how much trapped value and savings are likely lurking in Global 2000 companies — if they would just embrace their data variety like these customers have. The low-hanging fruit possibilities alone are mind-boggling. And it’s still early, providing a lot of opportunities for enterprises willing to think a bit differently.
Data variety is a problem that today can be solved with the right approach, moving enterprises much closer to data-as-an-asset. We believe that our approach can deliver on the nirvana of decentralized control of data silos with data optimization for the business.
Read more success stories from customers.
Solving the Problem for Large Enterprises
At Tamr, we’re working to enable many of the world’s largest enterprises to unify key data to realize maximum strategic and operational benefit. We’re helping them empower their people to consume accurate, up-to-date, unified data distilled from many silos to deliver transformational analytical and operational outcomes.
Tamr’s enterprise data unification platform uses human-guided machine learning (ML) to clean, label and connect structured data, avoiding problems downstream in delivering readily digestible data for analytics at scale. ML models do the heavy lifting of data integration, taking a probabilistic approach (“scientific guessing”) that invokes just the right amount of human expertise if and when needed. This makes data integration very efficient (unlike internet companies, most enterprises can’t afford huge rooms full of dedicated data curators) and highly scalable (unlike those known-non-scalable technologies MDM and ETL).
Because Tamr’s models are constantly learning with use, they get smarter and smarter over time while data gets cleaner and cleaner. The need for human involvement steadily decreases. Analytic velocity increases. Adding new data sources becomes significantly easier over time.
We also mimicked DevOps early on, embracing and adopting DataOps as core to how people will manage data as an asset in the future using our platform. Like the internet companies did with DevOps, we’ve steadily improved upon DataOps, which is the Agile, repeatable and scalable creation of data supply chains or pipelines.
The combination of these principles enables us to help large enterprises turn their inflexible legacy data infrastructures and structured data into much-more-agile data assets. Enterprises can save time, save money, and empower all kinds of data consumers with high-quality, business-transforming data for their operations and analytics.
To learn more about how Tamr addresses big data variety, please contact us or request a demo.
 In a 2017 Harvard Business Review article, Leandro DalleMule and Tom Davenport noted: “Cross-industry studies show that on average, less than half of an organization’s structured data is actively used in making decisions — and less than 1% of its unstructured data is analyzed or used at all.”