<$BlogRSDUrl$>

Tuesday, January 24, 2006

On the Quality of Metadata 

In most environments, spending time creating metadata (data about other data, that is) is pretty much considered a pain your manager makes you go thru. There are few rare exceptions: environments where people like to write metadata, it's their passion and they normally enjoy doing it and, you can hear it coming, they normally do a good/better job at it.

I work in one of these environments (a library) and I work with other ones (for example, museums or other educational institutions) and one thing that always comes up is "the quality of their metadata".

This leads to interesting sentences I've heard like "I envy you, the libraries have much better metadata than museums" or "I've found a typo in your collection. - Yeah, we know, we still have a lot of cleanup work to do" (after spending an hour telling me how great their dataset was).

I couldn't figure out a pattern between all people talking about metadata, but I felt deep in my guts there was something connecting all of them together, until very recently when Ben (who finally joined our team! yey!) and I started discussing this.

One thing we figured out a while ago is that merging two (or more) datasets with high quality metadata results in a new dataset with much lower quality metadata. The "measure" of this quality is just subjective and perceptual, but it's a constant thing: everytime we showed this to people that cared about the data more than the software we were writing, they could not understand why we were so excited about such a system, where clearly the data was so much poorer than what they were expecting.

We use the usual "this is just a prototype and the data mappings were done without much thinking" kind of excuse, just to calm them down, but now that I'm tasked to "do it better this time", I'm starting to feel a little weird because it might well be that we hit a general rule, one that is not a function on how much thinking you put in the data mappings or ontology crosswalks, and talking to Ben helped me understand why.

First, let's start noting that there is no practical and objective definition of metadata quality, yet there are patterns that do emerge. For example, at the most superficial level, coherence is considered a sign of good care and (here all the metadata lovers would agree) good care is what it takes for metadata to be good. Therefore, lack of coherence indicates lack of good care, which automatically resolves in bad metadata.

Note how the is nothing but a syllogism, yet, it's something that, rationally or not, comes up all the time.

This is very important. Why? Well, suppose you have two metadatasets, each of them very coherent and well polished about, say, music. The first encodes Artist names as "Beatles, The" or "Lennon, John", while the second encodes them as "The Beatles" and "John Lennon". Both datasets, independently, are very coherent: there is only one way to spell an artist/band name, but when the two are merged and the ontology crosswalk/map is done (either implicitly or explicitly), the result is that some songs will now be associated with "Beatles, The" and others with "The Beatles".

The result of merging two high quality datasets is, in general, another dataset with a higher "quantity" but a lower "quality" and, as you can see, the ontological crosswalks or mappings were done "right", where for "right" I mean that both sides of the ontological equation would have approved that "The Beatles" or "Beatles, The" are the band name that is associated with that song.

At this point, the fellow semantic web developers would say "pfff, of course you are running into trouble, you haven't used the same URI" and the fellow librarians would say "pff, of course, you haven't mapped them to a controlled vocabulary of artist names, what did you expect?".. deep inside, they are saying the same thing: you need to further link your metadata references "The Beatles" or "Beatles, The" to a common, hopefully globally unique identifier. The librarian shakes the semantic web advocate's hand, nodding vehemently and they are happy campers.

I'm left implementing the task, which normally consist in calling a web service somewhere that returns you the identifier for that particular string. Of course, this is a blind process (and mostly based on string distances of some sort) so you can safely blame it on them if something goes wrong and "Yesterday" ends up being written by "Bangles, The". I could just stop thinking and do the work, but what if such thesauri don't exist?

I put my semantic web hat on and think that "author" could be treated as a weak inverse functional property of a song, meaning that, if, by chance there is a single overlap in the song title in the two datasets (say "Yesterday" was authored by both "Beatles, The" and "The Beatles") then I can infer a "candidate equivalence" between "Beatles, The" and "The Beatles", without knowing anything else.

This weak inverse functional property and the resulting "candidate equivalence" is clearly not Description Logics and therefore not part of OWL, but it's a step forward: the cycle is completed by placing a human being in front of the screen and showing her all the "candidate equivalences" with two buttons "yes/no" and she can decide whether or not these two make sense. The software does the boring and easibly computable rest of the work.

Why weak inverse functional property? well, mostly because it is entirely possible that two groups have written a song with the same title. Yet, most of the time this is not the case.

Again, the semantic web people would tell you that you should never treat a literal as a URI, therefore, even if two songs have the same "title" literal might not necessarely be the same song.

So, here we are, with two datasets that we know contain music information, they have different identifiers for songs and for artists, different spellings for their literal labels and different format encodings of the music file (so that using file hashcodes as identifiers doesn't work). [yes, dear metadata lover, I know FRBR and I know it gets way worse than this, I'm just using this as an example, so, please, bear with me]

Independently, the two datasets are very coherent and a lot of time,money, energy was spent in them. Together and even assuming the ontology/schema crosswalks where done so that owners of the two datasets would agree (which is not a given at all, but let's assume that for now), they look and feel like a total mess together (especially when browing them with a faceted browser like Longwell)

The standard solution in the library/museum world is to map against a higher order taxonomy, something that brings order to the mix. But either no such a thing exists for that particular metadata field, or there is but it was incredibly expensive to make and to maintain and have a tendency, almost by definition, to become very hard to displace once you stick to one of them.

I'm naturally allergic to hard to diplace control hubs, but a well behaving license of use might make that vocabulary/web-service very appealing.

But I find it a little naive to think that we can solve the drop of metadata quality perception in dataset merging by always resorting to a higher level authority: this is exactly the platonic semantic cage that some people fear when they hear about the semantic web. It might work in those rare environments where people find such control necessary and, also, a little comforting, therefore they don't have a problem spending time and money on the cause of restoring the metadata quality of the mix, but it won't work at a global world-wide web scale.

And there is already evidence of that!

I need to give you a little background first. Years ago the "Open Archive Initiative" created something called "PMH" (protocol for metadata harvesting), it's a lightweight way to ask a web site to dump all its metadata content to you. Today, most people do the same with RSS (which has no protocol for metadata harvesting, you just keep polling it), but PMH is a little smarter (and not that more complex). DSpace implements OAI-PMH.

There is a registry of all the OAI-PMH archives and there are indexers/crawlers that consume that information (also Google Scholar does). Both OAI-PMH and DSpace were designed, originally, to work only with Dublin Core only, but OAI-PMH was later extended to support all kinds of metadata and DSpace was modified by a lot of institutions that use it to support some other metadata.

The registry contains a very interesting page: the list of distinct metadata schemas used by the 898 repositories currently contained in the registry (summing up to more than 6 million items). [Note how this is still an XML world, therefore schemas are identified by the URL location of the XMLSchema and not by the URI namespace used.]

A few things are worth noting:

The distribution follows a power law (I suspect this will be the same on the distribution use of RDF ontologies as well in the future: a few will be used a lot, a lot will be used rarely)
Dublin Core is way more fragmented out there than people want to ever admit (and there it goes your common denominator semantic cage)
Crosswalks between these fragmented and refined versions of Dublic Core would increase ontological overlap with minimal effort (here is where speaking of RDF instead of XML starts to become appealing, doing n^2 XSLT scripts gets a little out of hand pretty fast)
But how about the coherence/quality of the metadata found in these repositories? Individually, it is probably very high. Merged, it's probably feels no different than the web itself, which is why Google Scholar doesn't feel any smarter than Google itself (rather the opposite sometimes).

I find this discovery a little ironic: the semantic web, by adding more structure to the model but increasing the diversity at the ontological space might become even *more* messy than the current web, not less. Google built their empire on the tag, but at least there was an tag to work with that every HTML page contained! The RSS world is already starting to see that Babel happening: Apple, Yahoo, Google and Microsoft jumped on the bandwagon and started to add their own RSS extensions and I don't think Sam's validator is going to stop the distribution of the RSS variations from following a power law, it will just help making the distribution's slope higher, but that's about it.

So, are we doomed to turn the web into a babel of languages? and are we doomed to dilute the quality of pure data islands by simply mixing them together?

Luckily, no, not really.

What's missing is the feedback loop: a way for people to inject information back into the system that keeps the system stable. Mixing high quality metadata results in lower quality metadata, but the individual qualities are not lost, are just diluted. There needs to be additional information/energy injected in the system for the quality to return to its previous level (or higher!). This energy can be the one already condensed in the efforts made to create controlled vocabularies and mapping services, or can be distributed on a bunch of people united by common goals/interests and social practices that keep the system stable, trustful and socio-economically feasible.

Both the open source development model and the wikipedia development model are examples of such socio-economically feasible systems, althought they might not scale to the size we need/want for an entire semantic web.

The semantic web has a lot of people working on the technological guts, but very few on the social practices that might make it happen. I suspect this is going to change soon and solutions might come from unexpected places.

This page is powered by Blogger. Isn't yours?