Videos (where available) now in new videos navigation tab

Live Blog: Second Morning Session – Peter Burnhill and Herbert Von de Sompel

Posted: September 2nd, 2010 | Author: Nicola Osborne | Filed under: Live Blog, Planning, Videos | Tags: , , , , , , | Comments Off


::Please note that this is a live blog and various typos, corrections, links and images will be added in due course – all comments or suggestions of useful additions welcomed though::

Peter Burnhill, Director of EDINA begins his introduction by alerting is that OpenDepot.org launches today!

OpenDepot is a deposit facility that uses APIs on OpenDOAR and ROMEO so that institutions can be found and publisher policies can be understood. It includes that all important “request from author” button as well. This work has come out of various project work looking at depositor as user. Some of that thinking around transforming content into other things is really valuable – lots from this morning for repository managers here. But we wanted to set up a repository with some sense of permanance and good quality metadata – without that findability and reuse is compromised. Theo Andrew and Ian Stuart have been working on OpenAccess Repository Junction and OpenDepot. The key notion to OpenDepot is that it is a place where *anyone* can put stuff. It has a clear redirect so that anyone who has an institutional repository will be directed to use this. It is a memorable one stop shop that will direct your users back to you to deposit their data. So remember: “Put it in the Depot!”

Now, my duty proper is to chair Herbert Von de Sompel’s talk. He’s going to talk to us about Memento. Herbert has made a big splash in scholarly communication already. You’ll be aware of OpenURL and it’s importance in identifying the appropriatte copy. Herbert and his colleagues were responsible for designing that particular misuse of HTTP. He’s gone on to work with URIs and the Semantic Web but he’ll be talking today about Memento.

Herbert Von de Sompel – Memento: Time Travel for the Web

This will be a very different talk to this morning’s session but I think you will see that we have much in common around using web applications.

Memento is about accessing the web of the past – adding time dimension to the web. Making it as easy to look at material in the past as the present.

Session 2 of Repository Fringe - Peter Burnhill

Tim Berners Lee talks about Generic versus Specific Resources. There are both resources that change all the time – Generic Resources – and there are those that are static (for instance items in a web archive) or Time Specific Resources. We know this intuitively but HTTP doesn’t deal with this information. Unfortunately the pre-web semantics around time have been lost on the web. Our time information now is about recent modifications and retrieval. That information is not even trustworthy. You can easily get responses from servers that don’t include a time statement. But there are archival resources out there.

Session 2 of Repository Fringe - Peter Burnhill

For instance an old version of the CNN.com homepage from the Internet Archive (from September 11th). It has a URI. Wikipedia also archives pages and provides a URI for a specific moment in time (not just generic URI for that topic). So these are available but getting to them is not that easy. So first of all you need to know what web archives exist and where they all are. They work like a search engine and get a hit list of dates with archival material. And that’s how you find that page – a search and navigational exercise. Very similar in content management systems – so Wikipedia quite similar. There is a browse and edit history for the page. Thousands of edits for some pages. Hard work to look through. Not at a protocol level. Navigating is a bit odd also. So looking at Wikipedia again we see a link to the Pentagon, if I click on it I go to the *current* version of the website. This might be OK but if you are browsing in time you may want to see the Pentagon page from this date.We also have a problem with the web archive – links do not always work and content may be missing. But the way archives work they only point inward rather than to other archives or copies of content.

Session 2 of Repository Fringe - Peter Burnhill

So that was the problem in finding and navigating this information. There is another problem with how these pages link together or do not link to each other. There is a website called “WebCite” where you can archive a website. What you see in the user interface it points to the current page. No link at HTTP level. The URI is opaque and says nothing of the origins of the page. On the Internet Archive we see a page from the Library of Congress. There is no link out between past and present but the URI gives some more information. But there are no links and no links to archives versions of resources, not as HTTP or URI. This is the space in which Memento provides a solution. We are going to find a better link between the past and the present web by solving some of the problems I have described.

So firstly we will link and codify existing ad-hoc methods to create linkage from the past to the present (this is relatively easy) but we will also look further out to the rest of the web and this is much more complex. So if we look at normal HTTP flow we see requests and responses. This is the web without a time dimension. But we do the same thing with a time sensitive resource. But we use two different URIs to distinguish the past and the present. We could solve the confusion of the same website having different IDs by having some sort of unifying URI (URI-G) – a time gate.

This concept is not entirely new. Usually this is to do with brokering information – your browser asks for particular formats or preferences when it looks at the web. Your client communicates that to the server. Content communication exists in four dimensions. Memento is introducing a fifth one: time. So you can negotiate for the resource in time. So here you have a depiction of the solution. Here you see current CNN.com and then we also see archived versions. So we say that at the end of the web archive we add a resource, a time gate, to negotiate in time. So, depending on the time you specify to the time gate, you get another version of the resource. Having introduced this we can use a time slider at the protocol level. But we still remember that we must connect to the current version and the URI by which this thing is really known. We use a think called a Link Header here. This lets you specify typed links at the protocol level.

This is the basic architecture and idea of the project. This can also be applied to content management systems (e.g. Wikipedia) and in fact it is somewhat easier than web archives.

Memento HTTP Flow

Here we see the communication back and forth between requests and brokers to find appropriate representations in time. This is really how the entire flow works in memento. This is a depiction of the entire concept here. URI points to the timegate and then you negotiate content to find the appropriate “memento” in time. This means that you can view multiple versions and the current version and link between it. Herbert is currently writing the RFD for this concept (he shows us the ASCII version of his beautiful diagram!).

When we sent in a first paper on this project we were asked by one peer reviewer

“Is there any statistics to show that many or a good number of web users would like to get obsolete data or resources?”

Our paper was rejected! Now this reviewer would get laughed out of the room here but there are people that think this way and don’t see the usefulness of this work… so…

If we look at the example of Hurricane Katrina: you could look at a documentary about the event but you could also experience it as it happened. And you see the change in headlines and the escalation. It is powerful to revisit contemporary materials in this way and experience events in that way. It’s cute, social scientists may be interested. But this also matters in terms of data.

This is a time series analysis across DBPedia Versions – this was collected using DBPedia (linked data version of Wikipedia) and comparing past to present versions. This was done using Memento.

Resource versioning is important to repositories etc. Lets say a resource sees the light of day at t0, then you see versions at t1, t2 etc. In many applications you actually want version numbers and snapshots as materials evolve. On the web you do this with new URIs at the various places in time. This mechanism is used a lot on the web – this is in recommendations of W3C etc. This exists. But if you augment with memento really wonderful things can happen that allow you to follow your nose to navigate through time. The time gate lets you negotiate in time purely through HTTP referencing. We did various “Picture of the Day” snapshots when we started this work so that you could see the different images AND store data on each access etc. Each image links to a timegate and other versions. And we created a movie through all those images to give a sense of change and time travel. This is fun! Doing this via one URI with lots of links is great. Applied to data this is serious stuff.

So we applied this concept to linked data. This is the notion of a non information resource that links to an information resource. Applying a TimeGate to this we can grab current and previous information on some key concept (e.g. City of Paris). We did work on this at Los Alamos working with DBPedia. So for example Herbert shows us Edinburgh. With a natty timeslider that allows you to browse past versions on DBPedia. This is exactly what we did for the GDP graph – a simple set of data pulled out and graphed via Google. This is linked data but this could apply to scientific data that evolves over time. And I will be meeting tomorrow with the creator of the XARCH tool that takes snapshots of data over time. If we can achieve that it would be the first demonstration of Memento concepts applied to scientific data.

There is a Mozilla Firefox plugin that makes your browser capable (mementoweb.org). There is also a plugin for MediaWiki installs so if you use MediaWiki at your institution I beg you to install the Memento extension and lobby MediaWiki to get this more widely available. Also working on mobile browsers and plugins for Internet Explorer (an intern from Microsoft is working on Memento this summer).

So, to conclude, this is all about the tremendous value of the URI. In the early days of the web the URI was an access point to a page. Then linked data turned the URI into both access point for page but also current information about something. With Memento we are augmenting power of URI to connect the past to the present.

Session 2 of Repository Fringe - Peter Burnhill

Q&A

Q – James Toon) Can you model other data in memento (??)**

A) The huge difference between Memento and current web archives is that in Memento each URI is deferenced in it’s own right. It remains the URI as it is. In archives URIs are rewritten to point to internal links in the archive. In Memento each URI are deferenced in time. So you can reassemble a page from many different web archives. So if you know URIs of all the data you want to access and if the relavant systems are Memento compliant then you can assemble that data.

Q – PB) you started with apparent web, now you look beyond that into almost versioning of any object on the web?

A) Yes, so the obvious target  – it was funded by LC – was web archiving. But when we started on that we relised we are versioning data and that means you can apply this idea all over the place. So some of the most interesting applications will be around data and sciences.

Q – Sheila ) so your GDP data can you track the changes in time – how estimates change

Comment – PB) In time series data there are often lots of changing estimates that are revised over time and there is information at one time that then changes.

A) Yes, that could definitely be viewed and consulted. All of that is possible BUT you have to think about that up front and know that you will need that sort of graph. You need to design that capability in your systems for that research question but Memento gives you the tool to do that.

Q) arbitrary points in time but how about specific versions – e.g. Version 1, Version 2, etc.

A) We have received question of an event or a version rather than a date in time quite a few times. For example looking for the day that Micheal Jackson died – ask the system, it finds the date and returns that to you **ask re: Synch3**. We will be getting further LC

Q) I’ve been trying the plugin – so if you set the slider does that setting persist?

A) If you use your timeslider you will continuously ask for that day until you move it again. In the plugin you actually see the difference between the date you ask and the date you get back. At the moment Memento only stays relevant in the tab it is open – as we received requests for clear differentiation – but perhaps that could be a setting at a later stage so that you could browse in every tab for a given date you have set on the slider.

  • Share/Bookmark

Comments are closed.