At Tamr we spend a lot of time cleaning up dirty data. Often this work requires mastering key types of entities (people, customers, suppliers, products, etc). Much of the work we do with the Global 2000 is very similar to what the large internet companies do every day to make the web more searchable/accessible — my favorite example is when you search for JJ Abrams on Google you get the amazing writer/director/producer (my hero). Google has used a combination of machine and humans to make sure you get exactly the JJ you are looking for in the info box on the right hand side of your search results — sample below.
The image below is an example of what happens when you don’t have clean data around key entities. My friend Katie Rae is disambiguated appropriately and associated with an article about MIT — but the reference to me — not so much….
The news item from Linkedin refers to a story about “Andy Palmer” but the machine has mistakenly associated me (Andy Palmer the software guy ) with a story about Andy Palmer the automotive guy. Linkedin doesn’t make this mistake often — but it does happen. Consistently and systematically disambiguating people can be a tough problem but it isn’t a new one. We do this work at Tamr every day for G2000 companies that are trying to disambiguate/master their customers, suppliers, employees — you name it and we’ll clean it up. A big part of large companies using their data as an asset requires the relatively unsexy but historically challenging problem of cleaning up core data and making sure that when you refer to a key entity (people, companies, product, etc) that you actually have the right person/company/product. Much to the chagrin of our growth team at Tamr — I refer to our work as enterprise data plumbing — unsexy but valuable when you’ve got dirty water/data ;)
I’ve been working on similar problems since the 1980s. My first real job was building a database of product information for my family’s truck distribution business. I still remember entering data from hundreds of physical product data sheets — firetrucks, street sweepers, snow plows — I spent a full summer getting to know everything about Pierce fire engines.
Many of the methods that we use at Tamr are similar to the methods that Linkedin, Google and other Internet Bigcos use every day — when Ihab, Mike, George, Daniel and Alex started working on Tamr — we often talked about what we were doing as bringing a well understood design pattern in internet companies to the Global 2000 so that they could clean up their data much easier and faster.