Agile Languages & Fedora — Update from OR09

May 20th, 2009
Written by Matt Zumwalt

Leading up to this year’s Open Repositories, it became clear that there was demand for a BOF (Birds of Feather) session focused on agile languages and Fedora.  I pitched the idea in an email to a couple colleagues beforehand and then announced the BOF at my presentation on Monday morning.  Rather than constricting it to Fedora projects, I billed it as Agile Languages and Repositories.  About 30 people showed up.  The split was pretty even between Ruby, Python, and PHP developers.  About a third seemed to be Java developers in the process of defecting.  In addition to people doing stuff with Fedora, there were a handful of DSpace developers and possibly a couple who maintain ePrints repositories.  

For the first half of the BOF we sat in mixed groups, eating our lunches and each talking about the work we do.  We then split up by language (Ruby, Python, PHP) and discussed language-specific topics.  For that second half I sat at the Ruby table where we talked about ActiveFedora, JRuby, RDF support for Ruby, MODS support for Ruby, Solr (solr-ruby and RSolr), and how Blacklight fits into the mix. 

I closed the conversation by asking if we should set up email lists for collaboration.  It seemed reasonable to set up a general mailing list for the solutions community as well as a list specifically for people doing stuff with Ruby, Fedora repositories, and (most likely) ActiveFedora.  I also resolved to encourage the creation of Python-oriented and PHP-oriented equivalents.  For now I have created two lists on Google Groups.  The first one, Fedora Commons Create, is for general discourse about creating client applications for Fedora.  The second, ActiveFedora / Ruby + Fedora Commons, is for Ruby-specific collaboration.

In the end, I was really pleased to realize that for the first time we had a substantial group of people interested in each of the main interpreted languages (Ruby, Python, PHP) and each group had at least one open source Fedora-based project to use as a starting point for their conversations.  The Ruby group had ActiveFedora, the Python group had Ben O’Steen’s work and Peter Herndon’s Django integration, and the PHP/Drupal people had Islandora & Fez to start from. 

This was a comfortable step forward from the scenario as it was a year ago.

Google Groups
Fedora Commons Create
Visit this group
Google Groups
ActiveFedora / Ruby + Fedora Commons
Visit this group

Trainspotting as an explanation of the Semantic Web

May 9th, 2009
Written by Matt Zumwalt

I just came across this post on Russell Davies’ blog titled I like things to be numbered.  It’s an extract from an episode of the BBC Radio Show Museum of Curiosity. In it, a railway enthusiast named Chris Donald explains the beauty that trainspotters find in the fact that the railway companies assign numbers to absolutely everything, including clocks. [listen to the mp3]

Listening to Chris Donald speak, I couldn’t resist seeing the connection to Linked Data and the Semantic Web.  He nails the key concepts of the beauty and the loaded possibilities that come from being able to trace the connections between things.  The three-minute account even draws out the aspect of Semantic Web that tends to make people squeamish:

[...] I like things to be numbered.  I don’t know why; I just do.  The idea that every bridge had a number attached to it appeals to me and it appeals to a lot of people. [...]  It’s all quantifiable.  They know how many trains there are because they’re all numbered.  They have a book with all the numbers in it.  It’s all very controlled and they can understand it and it’s very two-dimensional. [...] with trains - you stand on the platform and you look at the track and you know that that metal bit of track on the floor is touching every train that you’re looking for and you understand that it’s a puzzle that can be solved.

I frequently find myself trying to adequately characterize the distinction between Semantic Web and Linked Data.  Is it just a re-branding of the concepts?  Is it an offshoot of the greater phenomenon?  In this little account by an avid trainspotter, I see a wonderful way to point out the distinction.  

The past 15 years of noise about Semantic Web have had the ring of this trainspotter’s “I like things to be numbered [...]  It’s all very controlled and [I] can understand it. [...] it’s a puzzle that can be solved.”  While there is nothing wrong with this per-se, it is only going to motivate certain types of people.  Further, it lends itself to visions of grandeur that quickly wander into a quagmire of failed logic and, to be honest, treads close to the intellectual foundations of fascism.  

Meanwhile, the burgeoning Linked Data movement is much more akin to railroad engineers saying “Well, we numbered everything out of necessity.  Might as well let the rest of the world make sense of those connections too.  Who knows what they’ll get out of it, but it certainly doesn’t hurt us to share the data.”

There’s one key place where I wish to differ with Mr. Donald, and I think a lot of Linked Data people will agree.  He describes this world of connections as being very orderly, controlled, and two-dimensional.  He says this because he is only looking at a single set of data from a single perspective.  As soon as you open your eyes to the growing cloud of linked open data, the landscape becomes much more akin to a wilderness, or possibly a garden, where the surface may seem simple and pretty while the world underneath is thriving with the complex, messy stuff of life.

Ads deserve permalinks too

April 27th, 2009
Written by Matt Zumwalt

Twice already this weekend I have wanted to reference an advertisement in a blog post and been unable to link to that ad.  I do all of my television watching online, usually watching on sites like Hulu that provide permalinks for shows, episodes, comments, etc.  However, I have yet to find a permalink for any of the advertisements on those sites.   In our media-conscious world, we are almost as likely to discuss an advert as we are to discuss an episode of a television show or any other video content that you’ve attached the ad to.   Let’s say the mantra: If it’s valuable, it deserves a permalink.  If people might discuss it, it deserves a permalink.

As our modes of media consumption change, advertisers are being forced to adapt.  The most effective adapters have taken to creating ad content that is independently valuable in the eyes of their target consumers.  This is a great thing and it should be encouraged.  We should and do reward innovative advertisers by talking about and sharing their ads.  This would be so much easer to do if they gave us permalinks to use.

You may ask how this is different from viral video advertising, or you may point out that many ads (especially funny ones) already show up on YouTube.  Well, there is certainly a strong connection, but there is an important difference in attitude.  The notion of viral videos has built up in the venue of YouTube and social networks.  It carries a connotation of low production values, simplistic themes, and playing to the finicky, meme-obsessed banality of the crowd. The stuff of viral videos gets dumped into the swamp that is YouTube (or one of its clones) and is left to fester.  Those who watch the videos are treated like little more than flies.  I, the consumer, am reduced to a tick on the view count and possibly a comment on the generic video viewer page.  By allowing a YouTube URL to be the defacto identifier of your content, you’re basically conceding that the content doesn’t deserve to be treated with any distinction.

Our cultural shortsightedness regarding the web, its possibilities, and its future usage is comical.  

Note: In the sense that I’m using it here, permalink == URI (Universal Resource Identifier).   Yes, this post is basically an argument that ads, like everything else, should be treated as nodes in the semantic web (aka Linked Data Cloud).

Account from OGF25 Repositories Workshop: Creating a Repository Standard?

March 20th, 2009
Written by matt

04 March 2009

Catania, Sicily, Italy

Open Grid Forum 25th Conference (OGF25)

 

It’s not entirely clear when I figured out that I was sitting on a standards body panel discussing the creation of a digital repository related standard.  I’m pretty sure it finally clicked sometime after the session was over, once I had consumed a couple glasses of wine.

 

I still don’t see what I contributed to the conversation, though the other participants assured me that my comments were useful.  The experience reminds me of my friend, let’s call him Josh, a community organizer who was recently pulled onto one of the Obama administration’s advisory panels.  Shortly after joining the advisory panel, Josh confessed that at the end of most calls he has to follow up with a friend and ask “Ok. So what exactly did we just decide and who is responsible for doing what?”

 

The panel discussion started by making observations that we’re all familiar with:

  • the importance, and associated challenges, of unique identifiers and persistent URIs
  • search, retrieval, and management are separate concerns, each with appropriate standards associated (ie. SWORD, OpenSearch, etc.)
  • cloud computing is very different from cloud storage

After this, I quickly found myself in completely unfamiliar waters when conversation abruptly turned to the creation of a standard  for digital repositories. I thought “Pshaw.  We don’t need Yet Another Standard.  Where did this come from?”  In fact, the whole field of repositories is so new that the prospect of a repository standard seems absurdly premature to me.  Discussion on the panel honed in on the two obvious contenders for a standard: 1) metadata requirements, and 2) functionality profiles (a list of features necessary in order for a repository system to be deemed compliant and interoperable).  From my perspective, repositories already swim in a glut of metadata standards (as well as non-standard, ad-hoc metadata) and, by nature, must embrace heterogeneous metadata.  The second notion, that of functionality profiles, sounds like something that few will read and none will understand.  To be honest, the entire discussion confused me.  I did my best to contribute to the discussion where I could.

 

After the workshop ended, I had a chance to catch my breath and discuss the panel with a couple of people.  Eventually, I came to look at the whole scenario from a different perspective and had a mild change of heart.  In a discussion with Neil Chue Hong, a very smart guy from Edinburgh, I started thinking about all the informal conclusions that frame discussions between developers at conferences like Dev8D, Code4Lib and RepoCamp.  I then thought about all the little architectural wins and failures that I see in software like Flickr, YouTube, Hulu and (woah) ABC’s full episode player.  After all, these are repositories too. Within a few moments of pondering, an initial list of obvious basic guidelines shone through quite clearly.

  • give permanent, unique URIs to all content you expose, even if you intend to limit access to that content based on geography or time of access
  • support linking with versioning or datetime info
  • expose a RESTful API
  • give preference to AtomPub
  • consider ORE when you need to express aggregations of data
  • provide linked data (RDF) endpoints

Some open topics also seem like obvious fodder for discussion:

  • what query language(s) to use in search APIs
  • navigating the difference between standards and interoperability
  • leveraging standards where possible (ie SWORD)

These are merely the things that seem obvious to me right away.  What would happen if we got the really smart people talking in this vein?  

 

I think this warrants further exploration and, strange though it is, I expect that the outputs of such exploration might resemble the stuff of standards bodies (be it a recommendation, a community document, or a standard).  Possibly I have been infected with that odd standards-wonk bug, or possibly I’m just catching up with the rest of the world in acknowledging the inevitable.

Session Hopping, LinkedData, and Data APIs at OGF25 in Sicily

March 20th, 2009
Written by matt

04 March 2009

Catania, Sicily, Italy

Open Grid Forum 25th Conference (OGF25)

 

[Note: I'm posting my backlog of updates from the past 2 months of travel.  An update specifically about the OGF Repositories Workshop will follow shortly]

 

I made it to the conference center in Catania, Sicily a few hours before the OGF Repositories Workshop.  Immediately upon arriving I met Nick Ferguson, coordinator of the workshop, and had a nice chat with Neil Chue Hong about repositories, ORE, and grid computing vs. cloud computing.   After that I was left to kill time until the workshop by sitting in on one of the OGF sessions.  At first, I stepped into what I thought was the Earth Sciences session, but it turned out to be the Computational Chemistry session and went way over my head.  I then passed through a handful of other random presentations before settling on a room where about 30 people were having a discussion about XQuery.

 

I soon discerned that this group was hammering out the spec for some sort of standard data systems interface.  When I arrived, they had been debating the strengths and demerits of XPath/XQuery vs. SQL as a query language.  The converation quickly stumbled into the pit of interoperability hell.  Standard interjections abounded: “Some implementations won’t have that data to return…”, “you will have to expose user info in order to support that…” mumble mumble “… we didn’t do it that way because one unnamed vendor couldn’t support it…”  I nearly laughed out loud when an attendee from the back of the room interrupted the discussion declaring “But in most situations, you should only be returning items owned by the current user.” 

 

I still had no idea what data they were attempting to expose.  (I later learned that it was the RUS-WG, who are defining a standard interface for retrieving job usage records … Obscure indeed.)    The 90-minute discussion ended up having nearly nothing to do with the actual data these people want to work with.  Instead, the conversation was entirely dominated by the travails of navigating the strange space of Data API design.

 

Meanwhile, serendipitously, I was using this downtime (and the conference wifi access) to finally read George Thomas’s slides about recovery.gov publishing open data.  Though I missed the presentation, the slides spell out the project’s intentions pretty clearly.  They’re full of references to REST, ATOM, RDFa and the LOD cloud.  I experienced such a fascinating contrast between the exposition before my eyes and the discussion filtering in through my ears.  In particular, one of Thomas’s slides jumped out at me.  The slide, titled “Follow the dollar, not the person”, showed a semantic model for users, user groups, and posts in a bulletin-board style Community Forum system.  It was totally readable, totally understandable, precise, flexible, and using an ontology that lends itself to re-use.  

 

Over the past year, I have satirically placed a golden halo above “linked data” in my mind.  As I sat in the RUS-WG session, light fell upon that halo and it glowed.

 

This experience, as well as consequent discussions at OGF, has left me with a distinct sense that there’s a pattern here.  We are all, of our own accord and in our own little techno-fiefdoms, attempting to do the same things and running into the same challenges.  I think that the previously obscure field of digital repositories has valuable perspective to provide and many pieces of wisdom to share in this domain.  I hope to see more public discourse about these topics, and I know who to start prodding to speak up.  Watch this space.

 

 

Post Script:

 

The morning following my OGF session-hopping experience, I realized that the track I had passed over, innocuously titled “HEP”, was a meeting of the High Energy Physics community.  In particular, it was primarily a discussion about how they are going to handle processing the data outputs from the LHC experiments when they fire up the collider later this year.  /me kicks himself for missing this.

Even the NYTimes is Noticing DAM

February 10th, 2009
Written by Matt Zumwalt

Following from last week’s post about the Snarkmarket Book Project, here’s even stiffer evidence of the sudden increase in mainstream attention that real content management has garnered.  In Digital Archivists, Now in Demand, the New York Times Jobs section talks about our “nascent” discipline of Digital Asset Management and discusses the career possibilities in the field.

A friend of mine sent me a link to the article asking “Is this the kind of thing your company works on? Sounds interesting.” It will be really amusing when everyone talks about this stuff like they have always dealt with it.

… in Honor of SOA

February 4th, 2009
Written by matt

In a clever marketing move, The Burton Group have held a wake in honor of Service Oriented Architecture (SOA).  They’ve also set up one of those cute custom shortened URLs: http://tinyurl.com/SOAWake. From the announcement:

[...] It’s time to declare that SOA is dead and move on to more the practical matter of bringing up its offspring: Services.

This great find was brought to our attention by Ben O’Steen’s twitter feed.

The Wave Builds: Thinkers beyond the library world suddenly start talking about digital curation.

February 4th, 2009
Written by matt

To give you a sense of the sudden traction that our area of expertise has deservedly gained, check out the Snarkmarket Book Project which was posted only yesterday. It has already garnered over 100 pitches of subject matter for the “New Liberal Arts” and more than a third of them concern Digital Curation and/or Internet Archivists.

The Librarian Avengers in the crowd will especially relish this comment by Matt Thompson:

“Library science” is a fusty old term that increasingly fails to fit an ever-expanding and ever-more-important range of skills. “Knowledge management”is weighed down by the awful word “management.” In Matt University, we’d rebrand it “knowledge mastery” or something similarly grandiose. After all, this is becoming critical. How do we capture, structure, sift and preserve enormous bodies of information?

A different kind of long tail

November 20th, 2008
Written by matt

This morning I was reading a bio of Ezra Koening in Salon’s Sexiest Men Living Series.  The bio had a link to an entry about Koening’s band, Vampire Weekend, in NPR’s Global Hit Podcast.  After listening to the NPR entry about Vampire Weekend, I habitually added the podcast to my RSS reader.  I was impressed to discover that the feed has 765 entries.  That’s five entries a week going back to 29 November 2005; three years of trends in global music culture, right there at my fingertips.

To my eyes, this seems like a wonderful new meaning of “long tail”.  It’s something that we’ve seen pending on the horizon but it’s now finally beginning to manifest.  Our civilization documents itself so thoroughly that we can grab a detailed background on nearly any topic.  This has been true in a cursory way for a while, manifesting in sites like wikipedia, but the internet is quickly reaching an information saturation point and architectural maturity that allows us to view the entire web as a living, self-documenting wiki.

Up until now, the long tail has mainly referred to an economic construct: by reducing barriers to entry into markets, by radically reducing distribution costs, and by increasing the opportunities for direct engagement between producer and consumer, the internet has made it profitable (or at least financially tenable) to cater to the countless minority niche interests in any given market.

I see a different kind of long tail coming to prevalence now.  Where the economic long tail is far reaching, the long tail of information runs deep.  As with the NPR podcast, we can look back in time and find a wealth of source material.  Armed with 20/20 hindsight, we can view and review the many ways that our civilization has chosen to express itself.  Published materials no longer die a day or a week after their creation; instead they stay alive for us to find them, or find new meaning in them, in the future.  Even better, we have begun to resurrect the materials that might have been presumed dead, destined to spend eternity on a dark dusty shelf.

When I look at that NPR podcast, I see context.  I see one thread in a complex history that I get to explore and rearrange at my own leisure.  Each of us sees this ocean of information differently, and each time we dip our hands into its depths we return with our own fresh version of the story, woven from the many disparate threads (and the gems upon them) that lie beneath the surface.

Preview of ActiveFedora DSL

October 6th, 2008
Written by matt

We have been working hard on creating a Domain Specific Language for declaring object models in ActiveFedora.  We settled on a syntax based on DataMapper.

Here are sample Model declarations for Audio Records and Oral Histories that we are using in a current project.  Keep in mind that this is just a teaser.  The syntax is likely to change over the next few months.

require 'active-fedora'

class AudioRecord

include ActiveFedora::Model

relationship "parents", :is_part_of, [nil, :oral_history]
# Also considering...
#has n, :parents, {:predicate => :is_part_of, :likely_types => [nil, :oral_history]}
# OR
# is_part_of [:oral_history]

property "date_recorded",   :date
property "file_name", :string
property "duration",  :string
property "notes", :text

datastream "compressed", ["audio/mpeg"], :multiple => true
datastream "uncompressed", ["audio/wav", "audio/aiff"], :multiple => true

end

Note that we are making it possible to inject custom methods into a class that search against RDF predicates.  This way, thanks to line 9 below, calling oral_history.parts will return everything pointing at the oral history object with info:fedora/isPartOf.  We are also thinking of supporting constraint paramaters like oral_history.parts(:type => AudioRecord), which would only return the parts that are of type AudioRecord.


require 'active-fedora'

class OralHistory 

    # Imitating DataMapper ...

    include ActiveFedora::Model

    relationship "parts", :is_part_of, [:audio_record], :inbound => true

    # These are all the properties that don't quite fit into Qualified DC

    # Put them on the object itself (in the properties datastream) for now.

    property "alt_title", :string

    property "narrator",  :string

    property "interviewer", :integer

    property "transcript_editor", :text

    property "bio", :string

    property "notes", :text

    property "hard_copy_availability", :text

    property "hard_copy_location", :text

 

    has_metadata "dublin_core", :type => ActiveFedora::MetadataDatastream::QualifiedDublinCore do |m|

      # Default :multiple => true, :refinements => :none

      #

      # on retrieval, these will be pluralized and returned as arrays

      # ie. subject_entries = my_oral_history.dublin_core.subjects

      #

      # aiming to use method_missing to support calling methods like

      # my_oral_history.subjects  OR   my_oral_history.titles  OR EVEN my_oral_history.title whenever possible

      m.field "identifier", :string, :refinements => ["info:fedora", "info:doi"]

      m.field "title", :text, {:multiple => false, :required => true}

      m.field "subject", :text, :refinements => ["dcterms:LCSH", :none]

      m.field "date", :date

      m.field "language", :text

      m.field "location", :text

      m.field "coverage", :text, :refinements => ["dcterms:TGN"]

      m.field "temporal", :text, :refinements => ["dcterms:Period"]

      m.field "abstract", :text

      m.field "rights", :text

      m.field "type", :text

      m.field "SizeOrDuration", :text

      m.field "format", :text

      m.field "medium", :text

    end

    has_metadata "significant_passages" do |m|

      m.field "significant_passage", :text

    end

    has_metadata "sensitive_passages" do |m|

      m.field "sensitive_passage", :text

    end

end