Archive for the ‘api design’ Category

Account from OGF25 Repositories Workshop: Creating a Repository Standard?

Friday, March 20th, 2009

04 March 2009

Catania, Sicily, Italy

Open Grid Forum 25th Conference (OGF25)

 

It’s not entirely clear when I figured out that I was sitting on a standards body panel discussing the creation of a digital repository related standard.  I’m pretty sure it finally clicked sometime after the session was over, once I had consumed a couple glasses of wine.

 

I still don’t see what I contributed to the conversation, though the other participants assured me that my comments were useful.  The experience reminds me of my friend, let’s call him Josh, a community organizer who was recently pulled onto one of the Obama administration’s advisory panels.  Shortly after joining the advisory panel, Josh confessed that at the end of most calls he has to follow up with a friend and ask “Ok. So what exactly did we just decide and who is responsible for doing what?”

 

The panel discussion started by making observations that we’re all familiar with:

  • the importance, and associated challenges, of unique identifiers and persistent URIs
  • search, retrieval, and management are separate concerns, each with appropriate standards associated (ie. SWORD, OpenSearch, etc.)
  • cloud computing is very different from cloud storage

After this, I quickly found myself in completely unfamiliar waters when conversation abruptly turned to the creation of a standard  for digital repositories. I thought “Pshaw.  We don’t need Yet Another Standard.  Where did this come from?”  In fact, the whole field of repositories is so new that the prospect of a repository standard seems absurdly premature to me.  Discussion on the panel honed in on the two obvious contenders for a standard: 1) metadata requirements, and 2) functionality profiles (a list of features necessary in order for a repository system to be deemed compliant and interoperable).  From my perspective, repositories already swim in a glut of metadata standards (as well as non-standard, ad-hoc metadata) and, by nature, must embrace heterogeneous metadata.  The second notion, that of functionality profiles, sounds like something that few will read and none will understand.  To be honest, the entire discussion confused me.  I did my best to contribute to the discussion where I could.

 

After the workshop ended, I had a chance to catch my breath and discuss the panel with a couple of people.  Eventually, I came to look at the whole scenario from a different perspective and had a mild change of heart.  In a discussion with Neil Chue Hong, a very smart guy from Edinburgh, I started thinking about all the informal conclusions that frame discussions between developers at conferences like Dev8D, Code4Lib and RepoCamp.  I then thought about all the little architectural wins and failures that I see in software like Flickr, YouTube, Hulu and (woah) ABC’s full episode player.  After all, these are repositories too. Within a few moments of pondering, an initial list of obvious basic guidelines shone through quite clearly.

  • give permanent, unique URIs to all content you expose, even if you intend to limit access to that content based on geography or time of access
  • support linking with versioning or datetime info
  • expose a RESTful API
  • give preference to AtomPub
  • consider ORE when you need to express aggregations of data
  • provide linked data (RDF) endpoints

Some open topics also seem like obvious fodder for discussion:

  • what query language(s) to use in search APIs
  • navigating the difference between standards and interoperability
  • leveraging standards where possible (ie SWORD)

These are merely the things that seem obvious to me right away.  What would happen if we got the really smart people talking in this vein?  

 

I think this warrants further exploration and, strange though it is, I expect that the outputs of such exploration might resemble the stuff of standards bodies (be it a recommendation, a community document, or a standard).  Possibly I have been infected with that odd standards-wonk bug, or possibly I’m just catching up with the rest of the world in acknowledging the inevitable.

Session Hopping, LinkedData, and Data APIs at OGF25 in Sicily

Friday, March 20th, 2009

04 March 2009

Catania, Sicily, Italy

Open Grid Forum 25th Conference (OGF25)

 

[Note: I'm posting my backlog of updates from the past 2 months of travel.  An update specifically about the OGF Repositories Workshop will follow shortly]

 

I made it to the conference center in Catania, Sicily a few hours before the OGF Repositories Workshop.  Immediately upon arriving I met Nick Ferguson, coordinator of the workshop, and had a nice chat with Neil Chue Hong about repositories, ORE, and grid computing vs. cloud computing.   After that I was left to kill time until the workshop by sitting in on one of the OGF sessions.  At first, I stepped into what I thought was the Earth Sciences session, but it turned out to be the Computational Chemistry session and went way over my head.  I then passed through a handful of other random presentations before settling on a room where about 30 people were having a discussion about XQuery.

 

I soon discerned that this group was hammering out the spec for some sort of standard data systems interface.  When I arrived, they had been debating the strengths and demerits of XPath/XQuery vs. SQL as a query language.  The converation quickly stumbled into the pit of interoperability hell.  Standard interjections abounded: “Some implementations won’t have that data to return…”, “you will have to expose user info in order to support that…” mumble mumble “… we didn’t do it that way because one unnamed vendor couldn’t support it…”  I nearly laughed out loud when an attendee from the back of the room interrupted the discussion declaring “But in most situations, you should only be returning items owned by the current user.” 

 

I still had no idea what data they were attempting to expose.  (I later learned that it was the RUS-WG, who are defining a standard interface for retrieving job usage records … Obscure indeed.)    The 90-minute discussion ended up having nearly nothing to do with the actual data these people want to work with.  Instead, the conversation was entirely dominated by the travails of navigating the strange space of Data API design.

 

Meanwhile, serendipitously, I was using this downtime (and the conference wifi access) to finally read George Thomas’s slides about recovery.gov publishing open data.  Though I missed the presentation, the slides spell out the project’s intentions pretty clearly.  They’re full of references to REST, ATOM, RDFa and the LOD cloud.  I experienced such a fascinating contrast between the exposition before my eyes and the discussion filtering in through my ears.  In particular, one of Thomas’s slides jumped out at me.  The slide, titled “Follow the dollar, not the person”, showed a semantic model for users, user groups, and posts in a bulletin-board style Community Forum system.  It was totally readable, totally understandable, precise, flexible, and using an ontology that lends itself to re-use.  

 

Over the past year, I have satirically placed a golden halo above “linked data” in my mind.  As I sat in the RUS-WG session, light fell upon that halo and it glowed.

 

This experience, as well as consequent discussions at OGF, has left me with a distinct sense that there’s a pattern here.  We are all, of our own accord and in our own little techno-fiefdoms, attempting to do the same things and running into the same challenges.  I think that the previously obscure field of digital repositories has valuable perspective to provide and many pieces of wisdom to share in this domain.  I hope to see more public discourse about these topics, and I know who to start prodding to speak up.  Watch this space.

 

 

Post Script:

 

The morning following my OGF session-hopping experience, I realized that the track I had passed over, innocuously titled “HEP”, was a meeting of the High Energy Physics community.  In particular, it was primarily a discussion about how they are going to handle processing the data outputs from the LHC experiments when they fire up the collider later this year.  /me kicks himself for missing this.