As you may have noticed we have now moved over to a new blog for the 2011 Repository Fringe. We’ll be retaining this blog here as an archive of the 2010 Repository Fringe but for all the information on the 2011 event have a look at the new site and our jjust-added call for contribution: http://www.repositoryfringe.org/
As we mentioned at the end of March we are currently planning the Repository Fringe 2011 and we now have a wee update to report.
We are hoping to again have a two day repository fringe event and one day of workshops/meetings this year so grab those diaries and pencil 3rd, 4th and 5th August in as the expected Repository Fringe dates.
The dates also cunningly dovetail the Edinburgh Festival Fringe preview week so once we have finalised the details we’ll let you know here on the blog and through the usual mailing lists as it will be important to book travel and accommodation early.
As ever if you are interested in taking part (with a talk, pecha kucha, roundtable, workshop etc) or sponsoring the event it’s never too early to give us a wee shout (email@example.com).
It’s been a while since we updated this blog so it seems high time for an update.
We are currently discussing Repository Fringe 2011 and would very much welcome any suggestions of speakers for this year’s event. Right now we’re at the eating biscuits, making wishlists and investigating venues and costs stage but we’ll update you when there is firm news to report. We are particularly keen to speak to potential sponsors for the event – please get in touch with the organising team via: firstname.lastname@example.org
Finally you will see that a few posts may have moved around – we had a glitch in the blog which tried to give every post a new time stamp of today and our fix for this may have slightly altered the order of posts. Sincere apologies for this. We have also moved the CoverItLive coverage from RepoFringe2010 away from the top of the page but it is still safe and sound and available to view.
The Linking Articles into Research Data Round Table Session took place on day two of the Repository Fringe (Friday 3rd September 2010) from 11am to 12.30pm. The session was facilitated by Robin Rice of the University of Edinburgh Data Library and EDINA. Philip Hunter kept notes of the session and has kindly provided this report:
Robin introduced the session and the motive behind it, namely the University of Edinburgh’s internally funded LAIRD project. After brief introductions from the 21 participants, the discussion began, with questions brainstormed by the LAIRD team.
‘Should we encourage academics to publish the data set that goes with a particular publication, rather than the whole of the data they’ve produced?’
It was observed that a limited dataset allows you to see the wood for the trees, and that is what you should show to back up your findings. Including a dataset could be seen as a n extension of the peer review process, but what you chose to exclude is of significance. Selecting the dataset is not entirely an objective decision.
Size of datasets was discussed as an issue. Some will be curated for reuse rather than to support an article. An approach was suggested which would involve maintaining the whole original database, plus including filtered views to show what is being used.. However procedures for linking to things don’t recognize that you can link to two things, one as a subset of the other. We need this facility.
Currently if supplementary data is offered to journals, they may say that they can’t support this.
There is a danger with datasets which are very large (terabytes of data), that academics are attempting to get around the cost of archiving data with IT services. They want to pass these costs on to the library, but to have the same standard of archiving services from IT. Studies show that lots of quite moderately sized items cost more than a few large ones. It can be a significant cost because it is ad hoc, and not counted as a cost. Users don’t
‘How do Repository managers want to accept storage – for open access only, or archiving?’
In Oxford everything is in a repository (Oxford Databank). Researchers understand they are paying for storage, and that they have to do this.
It is about managing an expectation from the researchers – people have an idea of what they get from a repository. But data might not be readable without extra work. The researcher has an expectation of what he is getting from the system, even if it is the same underlying system providing different services.
Will this scale if offloaded onto researchers? Oxford places responsibility for choosing formats with the researcher. If zoologists use specific metadata, the repository supplies generic metadata to the world (not everyone has the expertise to offer guidance on very specific metadata).
This was challenged. If you get one thing wrong migrating a dataset, it is null and void (and that includes metadata). The idea was floated that you might keep the original in order to be able to correct the dataset. Though this would be expensive.
On the other hand, lots of useful things can still be done with things which have lost contextual information and metadata. If you worry too much about doing it perfectly, you do nothing. People can still do something by putting in time and effort. To scale this is worrying however. And it may be an incomplete set, though this may be unknown.
It is possible that the dataset has been repurposed for something else. So the authenticity of the data needs to be defended by someone. It was suggested that the scale of data is more of a problem than its authenticity.
Are most people putting data in repositories for Open Access purposes rather than preservation? There was no clear view about this.
‘How could linking datasets and publications improve scientific practice?’
Presenting the underlying data improves verification of the work. One dataset may support two different theses, so publication may make it possible to create the alternative thesis. If the data is available for rechecking there is less likelihood of data being falsified and misused. But much data will not be checked. Sometimes it is difficult to check because you don’t have the exact same dataset.
What is the current reward structure for datasets? Some journals require datasets to be available, and some evidence of the impact of research will be required by research councils. We might ask re-users of data to cite the paper associated with this data. In medicine there might be third party validation of the work and what you do. And as mentioned before, the dataset might have more than one use. It is normal to cite a paper which is being criticized.
The more you can do new research with same data suggests it will be cheaper (this is definitely true in astronomy). And cost is high on people’s agendas. ESRC has required that you offer your data to them for this sort of purpose, but they can reject the data if it is poor.
‘DOIs – for pointing to datasets, what is most important – bibliographic citation or permanent identifier? Are DOIs preferable to handles?’
It is a good idea to be able to cite the dataset directly. But if the dataset is particularly large, it would be good to be able to point to part of it. The type of linking really depends on how the users want the data to be used. It was pointed out that Herbert van de Sompel was talking about the levels of access yesterday, and granularity of access is an issue. Should datasets have to be separate collections? So far we use IDs only at item level. Is one
identifier per file enough?
Data subsetting was mentioned, and is usually quite complex. Who sets the boundaries on the dataset is important – we might be uncomfortable about someone who didn’t know the data doing this, though good documentation might make it possible to figure things out.
Publishers want to be able to track citations of datasets (Datacite has embraced DOIs). Some of us are still looking to see what the minimum requirement.might be. DOIs cost money, and so far are not popular in the linked data community. The exact costs of using DOIs need to be well known (and aren’t).
‘Which metadata is required for linking outputs to datasets, and can ontologies help?’
Have to consider what metadata is required to link outputs to datasets, and what is valuable at the end of that link to the researcher, etc.
In some cases the information comes from the researchers, including the fields they require (groups come with their own ‘kooky’ markup). Government data all comes with its own ontology – if it is well described, then you can reuse it. It is also possible to find datasets through linked data graphs. Eprints is fully linkable -everything is specified as a URI. You can add arbitrary metadata in the form of RDF triples. Though with current data models linking data is not easy.
The Edinburgh Datashare project was looking at Dublin Core for metadata – now DC has RDF triples in it and has become ‘DC terms’. They are trying to make a ‘baby step’ to connect the Research Publications Service (RPS) and Edinburgh Datashare. Though it remains possible to make references without explicit assertion of the relationship of output to the dataset (e.g., ‘has relation’).
‘Citation alerts and other tools?’
Often constrained by policy guidelines. Citation alerts should be one part of a tool for guiding people to where something has gone. The DCC has some data management technology. If tools are provided, say for humanities, hosted along with the db, it makes matters a lot easier. Perhaps we should get access to the entire research area with hosted tools – not all users want to open ports from desktops, so this would encourage people to put their data into the hosted network structure.
An interesting model – but apart from the SURF example, we don’t have much evidence for it so far. We do need some way of creating complex objects, and we should be open to the idea of this.
Ian Stuart reports back from the Round Table on geo-tagging:
The first question that was asked, and it was a question that appeared in several guises through the conversation, was “What is tagged? The picture or the contents of the picture”? – if one is on the London Eye, and take a picture of Big Ben, which is the “location”
Another interesting variation on that theme, more specific to the researcher, is geo-referencing both the location of the people and the place of interest: it would be good to find out where the Archaeologists that have expertise in Northern Persia come from.
This lead to several questions: “Where is a place”; “what about non-terrestrial places”; and “what about the place as is was at <time>”.
The “Where is a place” question has a whole gamut of meanings: what do we mean by ‘Edinburgh’ Is it where the “authority” says it is, or where the crowdsourse says it is?
How accurate does a location need to be? We all agreed that “location” was essentially a fuzzy point in space/time, and the accuracy of that fuzziness depended on the context of the question: where is Edinburgh for someone in the area? For someone in London? For an Alaskan? Another issue is that different cultures will always use their local reference systems (be it OS-gb, Tutsi klavics, or a colloquial name)
How do you cope with fictional places (Mordor) or non terrestrial places (the moon), and how do you cope with the way areas change over time: cities grow; rivers move; and conceptual areas (“the red light district”) fluctuate & move in a very informal way
We recognised that “Finding” is pretty good now – google & streetmap & geo-crosswalk can all use different scales… But deposit/definition is lacking.
The conversation then moved onto how we get that location data, and we were reminded that people do NOT like entering metadata: 7 fields, but only if 4 of them are auto-populated!
In the future, repositories (databases of things with metadata) will not be catalogued… Like youtube & flickr & slideshare, the academic repositories will be self deposited & self-catalogued… (to reflect on changes that have happened in the last few years: just as nobody sends hand written notes to a secretary any more, so we will stop getting other people to do the depositing for us)….. however getting metadata out of academics is like getting blood out of stone!
As we drew closer to the end of our time, we reached a consensus that what we wanted was the application to give us some suggestions, so we can mutate our location from there: get the application to mine the metadata & binary object, and display the locations that the system thinks it refers to. We reckoned that people would be more inclined to de-select the places that are wrong rather than write down the places that are right…. and that the real trick is to get positive feedback: show good uses for geo-located data and people will be encouraged to provide geo-locating information items…. in the self-interested belief that this will raise their own profile.
::Please note that this was noted as a live blog and various typos, corrections, links and images will be added in due course – all comments or suggestions of useful additions welcomed::
‘We imagined teaching and learning repositories – imagine what impact they could make in your institution’ Yvonne Howard/Patrick McSweeny University of Southampton
Yvonne wanted to call this session to talk about teaching and learning activities and repositories. We hear a lot about research uses but much less about learning and teaching but 50% of our funding, as universities, is about learning and teaching. So Southampton have been looking at what is going on in this area.
Repositories help as they are a place that you put stuff. We have a number of projects. RunShare has produced various projects including EdShare – that’s our generic name for a free open sourced learning and teaching software funded by JISC. So, if you are going to have a Teaching and Learning repository then make it work for that, make it fit what people really do.
So what to do with a problem like teaching and learning repositories? Teaching and Learning repositories started around 2002/3 when learning objects were all the rage. The idea was to share well built stuff that could just run. Cue image of learning object – a big lump of content – notes, images, quizzes etc. It’s a lump and an intentional lump. Learning objects are kind of about learning something in 21 days. But that may be what distance learning is about, not what universities are about.
Worse the metadata includes words like scaffolding and discourse structure all wrapped in xml in a lump
Phil, Cetis interjects to say that this is disingenuous – learning objects were supposed to be the granular successor to CD Rom materials…
Yvonne: Except its not very granular, you can’t take those learning object packages apart.
Phil: Yes, you can if you know how.
Patrick: A pedagogist could take that apart but a teacher may not be able to.
Phil: Well that’s about the tools available. You can’t claim that the concept is poor, there are issues with how the concept was/has been realized but you can’t write off learning objects as a concept so sweepingly.
Yvonne: People have had high hopes of learning objects . In talking to teaching staff responses included people saying that .zip files are recognizable but they don’t know what to do with that object… There was confusion there.
So, what is a repository?
When we set up the repository at Southampton it was very much about research and a representation of the research of the institution and about archiving:
- storing things safely, in a permanent and official way.
- indexing – it is a curated and monitored space.
But these tasks around deposit etc. are mainly done for you. Few people use the repository unless they are updating their own CV with their publications to date.
Teaching staff who were not using learning objects or repositories were, however, using sharing sites like YouTube, Flickr etc.
So what does that tell us about teaching staff? They understand the granular reuse paradigm but they are picking materials from other sites.
These sharing sites are offering the hosting of the stuff you are using, they let you organize stuff – which in my research repository I don’t think about that – and they let you have a sense of community and you can trace and browse creators’ work. There are web 2.0 elements that facilitate sharing in a really useful way. But these sites aren’t about altruism – the intent of posting may well not be reuse. Sharing is a side effect not the motivating driver.
Having learned those tough lessons we thought do we need to be like the social sharing sites – where things are ephemeral not about archiving. So we thought what do you do on YouTube? Deposit is simple and metadata is simple and minimal. So if you go to EdShare the only thing you have to put in is the Title. Other metadata is useful but not obligatory. But we do try to derive metadata from contributed data as much as possible. Already know lots of information – depositor name and affiliation etc. The only other thing you have to do is decide who will see your content – is it public or not? And then we also provide hosting – so you have inline preview of content, videos etc. No need to download first. Can add some information of creator and their profile to all videos that enhances relevance of what you have added.
Yes, you can use Zip but it will unpack in the system as a group of items. You can upload multiple files to a single record. You can also have collections – sort of like channels and groupings of materials on YouTube and Flickr. It’s a virtual collection – everything is where they are and can appear in any number of collections but it is a way to organise them. And you can tag items.
This is a flexible approach that means that people can start out without organising their work well but their habits and better organisation can emerge.
Phil: What you are talking about converges with where the traditional approach is now. IMS content packaging’s next draft is very much in this direction. A learning object isn’t the zip or the metadata – that is just content packaging, the content is the learning object.
[There was further discussion of what a learning object is, and what is packaging at this point.]
Phil: One thing we agree on is that teaching staff should never have been told about packages etc. Teaching staff should have been focused on the content.
Yvonne: The community aspect of the teaching and learning repository means that you can point students to collections or materials but also your own homepage for the repository – the public face but also a private space with stats, recent uploads, your identity, the links to your materials. Also a set of things you use most often. And also comments and feedback on resources [very like Flickr] and a tag cloud. A way to make you aware of what’s going on with your stuff.
So the idea is to support a living life cycle – you create material, you share it, you get feedback, and you make changes and develop those materials. The idea is that things change over time. Teachers have a set of needs that is different – materials aren’t static.
We are at a more advanced stage – thinking about how people like to work, not just about the technology.
Remixing and re-use
Remixing and reuse does add value to content. So how does this work in practice. So for instance Rosie creates a resource of video and interviews and teaching notes. Sam takes the same content but adds newspaper content. Tara takes Sam’s object, ditches Rosie’s teaching notes and adds a student exercise – this is a more flexible sort of sharing.
Something Open this way comes
- MIT OpenCourseware
- iTunesU and YouTubeEDU
Now part of much wider ecosystem of open teaching and learning resources.
And OER was a real mandate (strongly supported by the previous government) to create and share materials.
EdShare has now got five repositories using the EdShare software (based on ePrints).
So, I want you to think about what Teaching & Learning repositories mean to librarians – how will that impact your institution? How does that impact what libraries offer? If we have T&L repositories we don’t necessarily have the idea of a curational management process – I deposit my working material myself. Much in the same way that you create your own powerpoints and are responsible for putting this on your machine etc. Where are some of the aspects of what librarians can offer and what they might do. When you think about OER and especially the high stakes ones… where are those resources going to come from? We are going to put up video of Sir Tim Berners-Lee on iTunesU – that’s a great promotional item for informatics and engineering at Southampton.
Lorna, CETIS: iTunesU is NOT open. It is one of the most closed sites that you could mention. It does not host open educational resources.
Yvonne: OK but you can use it to advertise your institution. Where does that material come from? Does it float up from what is already there? I think that really helps academic’s esteem and the reputation of organisation. Maybe the librarians and institutional technologists come in to look at this material. If we make OER content how do we check the quality, how do we gather attention metadata back? This will all be important to institutions in the future.
Yvonne, Dundee University: Who checks the copyright and rights of stuff we share. A lot of my job is about the legal aspect of sharing.
Yvonne: Part of the upload process is licensing and making people aware of what can be shared and what cannot. We try to make sure people are aware of what you are sharing. If you are sharing your own material we assert a CC license – you can change this to another if you like but by default we assert one. We have run workshops on attitudes to copyright and we have had to talk about a mature risk management in this area and strategies for checking the rights of materials – some staff are risk averse and some have no sense of risk.
Patrick: You can flag any resource as having some sort of problem. You can flag to say that there may be a copyright infringement. Education is an important part of this. Librarians roles are increasingly about advocacy, less about old tasks of cataloguing and assisting research. Awareness raising internally is now a big part of librarian role, less and less about cataloguing of resources. Part of that advocacy is about training and having workshops around copyright. And about getting feedback on repositories and departmental feedback. Trick missed with institutional repositories is that they went for mandates rather than advocacy and engagement.
Lorna: Your presentation reminds me of a presentation done by Charles Duncan about 10 years ago before he was working on developing an electronic training suite. What was the thinking behind replicating functionality of Web 2.0 services rather than just guiding users on how to use those existing tools?
Yvonne: Some of our teaching staff are quite naive and their working practice is quite simple. Some staff are still typing and photocopying notes. We can have slightly unreal understandings of what teaching staff will do online.
Lorna: So you thought an institutional service would work better?
Yvonne: Yes, more comfortable for staff and for institutions.
Lorna: The “agoraphobia of openness” – a nice phrase coined by Peter Burnhill of EDINA. It’s an interesting thing though because if you do use MIT OpenCourseware and look at their success that is quite interesting.
Neil, LSE: That institutional comfort level and hand-holding is appreciated by many though.
Yvonne: And there are institutional restrictions on sites like YouTube for some – that needs to be considered.
William, Glasgow: We’ve found examples of that as well. Had videos uploaded to YouTube on how to use the library that foreign students cannot access.
Lorna: And you also prefer to manage access and authentication in house?
Patrick: And it’s also about branding. You lose that institutional presence on YouTube. From the perspective of the vice chancellor it has more value when the URL and banner reflects that. Also you can use the site to find showcase materials to then move into other spaces like iTUnesU
Lorna: But iTunesU is a smoke and mirrors trick: you can share content but there is nothing open about it.
Darius: You do sacrifice some things for that wide and simple distribution.
Lorna: It is open versus closed.
Patrick: Also if iTuneU materials also occur in EdShare then you can of course access the source materials.
Lorna: iTuneU is not going away, we’ll see more usage of course.
Yvonne: So again what does this mean to librarians. We are seeing reputation and excited Vice Chancellors looking at sites like iTunesU. I think there will be a big difference to the way academics work.
Neil: Has there been more collaboration as a result of EdShare?
Patrick: We’ve changed what you see about people and who can edit materials – requests from depositors to share editing with other named people or other department members so that there can be additions and changes.
Yvonne: Another element of collaboration is about peer review. The Humbox project from the arts and humanities teaching areas – they build materials collectively and used discussion to improve and develop these – almost as peer review before depositing in Jorum. People are working together and the sense of live objects.
Phil: Sounds like the system behind OpenLearn - they also work together before publishing out to the public. But where does the public access to anything that looks like a repository come in. Jorum looks like a repository. But where does taking stuff from OpenLearn and putting it into Jorum fit in. What does anyone get out of having materials in multiple places?
Yvonne: Just different ways to use the content and to find it. We have an ecosystem here. All about having as many different routes to information depending on where people look. Not a problem. If things are networked together then that has more valuable – want to know the usage and want to know that our presence is branded.
Phil: Not just about many spaces, it’s also about the costs. JorumOpen doesn’t come free. You have to show that you add value by putting it into JorumOpen – it may be that things are easier to find, or that the packaging is different…
Yvonne: I think it is about re-usability. People think they know how their materials will be reused but they don’t. Re-use cases may be very different to expectations.
Phil: But how does taking material out of context enhance an item. How does chopping up content from a package on OpenLearn and making it available in a repository make it more useful?
Yvonne: Well its just there to find online.
Phil: But it’s hard to find without a lot of metadata.
Patrick: In many cases the content is a PDF and a PowerPoint as well – that provides some metadata.
Phil: If shared without context you have to have good metadata to be found though. And what you get back to is what you started with: something people can find by searching Google.
Patrick: I agree that there are better things than duplicating content across the web. We are looking at how people using EdShare can remix and reuse materials found in other spaces. Would be nice if, after starting with our own repositories, we add in spaces like Jorum. That’s some value add. Not so useful for OpenLearn copied into Jorum. But a less featured repository into Jorum would make sense.
Graeme West: Value isn’t about copying, it is about both tracking origins and seeing reuse at once.
Lorna: You mentioned users coming in – how do they reach you. Via Google or via your interface?
Patrick: Our stats are a mixed bag. About 30% come in from Google and that seems quite low. Many come in deliberately to the site but that is probably because academics refer students and colleagues to us.
Phil: Would be good to compare internal and external users. The next CETIS event at the end of October will be about this. So, why do you want an ecosystem of repositories?
Lorna: Well you want a repository ecosystem – repositories that link out to all sorts of spaces. You don’t want an inward looking ecosystem. You want the repository to link into your VLE, blogs, etc.
Patrick: Actually the other reason for low Google entry may well be that links to EdShare often come from the VLE not external sites.
Neil: That’s not to say that institutional repositories and subject repositories don’t have value.
Lorna: It’s about how people search and find content. They may come in via Google to any of these systems.
Yvonne: It would be nice to hear from librarians on this?
Neil: We just got money from JISC for project called DELILAH - to create a repository for teaching materials from PostGrad teaching programmes. I am here in general interest but will be thinking in depth on this.
Yvonne: Please do let me know if you have any suggestions etc.
Nicola, EDINA: I’m curious about whether sharing actually takes place amongst your academics? And does reflection on the quality of teaching materials in use take place?
Yvonne: It is much more about hosting learning and teaching materials. It is not central to your work to create a resource for other teachers to use. Sharing with students is the motivation.VLE is good for delivering courses but not managing resources. Our idea is that the teaching and learning repository sits there but is then surfaced in VLEs. EdShare is to improve the resources that teachers use – easily hosted and gathered and put together and made available online. Easier than personal management. Can use peer review if wanted. Some of early workshops we asked what people would put in. People would say that they had materials from teaching for last 4 or 5 years but said not ready yet – odd since they use them in teaching. So there is a cultural change about sharing and developing own work. And seeing that materials are useful to others can help boost esteem. Deliberately no star ratings as teachers found that too invasive
Niclola: In general in academia it is curious how much reflection and quality assurance takes place around research but that there is little reflection around teaching.
Yvonne, Dundee: How do you encourage that?
Yvonne: It’s supposition but perhaps seeing other materials does lead to comparison or cultural changes that can be made. It surfaces good content.
Phil: It was pointed out to me the other day that if you look at OpenCourseWare stuff some of that material isn’t very good. And yet still raised their profile. You can work it the other way. You can say is your material as good as this that comes from MIT and they can say “yes it is”.
Yvonne: There are lots of possibilities to research practice and changing culture here – but you need this teaching and learning repository first to do that.
Lorna: There is an issue of digital literacy and making critical judgements on what they are using and how they use them. This is a crucial skill for students. Far more complex decisions to make versus old reading lists or versus providing only a walled garden environment.
Yvonne, Dundee: I support the teaching of medicine and there it is about evidence basis of medicine and about journals and peer reviewed materials and research so what would they share?
Yvonne: but in classrooms what do lecturers actually use in teaching?
Annie: real resistance to e-resources sometimes. Classics and philosophy can be very challenging for instance. I’ve also worked with educational resources – but reusing things is difficult as you want to express yourself in your own voice. It’s so important to have your own way of expressing things
Yvonne: this is what we’ve heard from teachers. Very few teachers are happy to take a package and reuse it. They may use a small item but always want their own items.
Annie: culturally there are big wins in not working in isolation or even institutionally but about globally connecting…
Yvonne: that’s why we don’t say we’re about sharing, it’s about improving teaching materials.
Darius: as a student I get 12 weeks of lectures – by week 12 I don’t remember the first 4 weeks. We also curate our own resources – we use a wiki and keep our own notes and form our own learning. We want an introduction to the subject and information for further resources. But we want to create our own resources. If you experience the materials you remember it, if you read it you don’t. As a student you don’t want to be spoon fed.
Annie: there is always a space for an inspirational lecture – even if notes and presentation don’t exist.
Nicola: some teaching send out lecture notes and resources – not just readings – ahead of time so that you experience the lecture in an informed way, not about just grabbing the notes at the time.
Yvonne: indeed that happens at Southampton.
Darius: EdShare does facilitate sharing of materials. Not just about the notes but also links to assistive materials, papers, etc. used in making a resource. As a student I want to look at some of those source materials. EdShare lets you add in files and combine them nicely. It’s a nice way of doing it.
And with that the session closed in lots of smaller conversations…
Kevin Ashley is introducing our next speaker Micheal Fourman. He is based at the University of Edinburgh Informatics department, and is a former head of department. Micheal is an exception amongst his peers – at FLOC all these computer scientists were along but not Tweeting but Micheal was one of the four or five who did!
Micheal Fourman – Topic Models
Micheal hoped to be here this morning. Yesterday he wasn’t here as listening to Lord Carter explaining why 2Mb/second isn’t good enough but all they could get away with. Anyway…
When we started up iDeaLab last year we wanted to see who was here and who does what. I asked my machine language and learning colleagues about this and either they or Google put me in the direction of Topic Models. People have structures and networks of models. And today I want to talk to you about those ideas because partly funded by JISC, partly funded by the university and partly fueled by optimism we are working on our own topic models.
Since that original work a paper came out:
There is no one way to consistently classify documents and many different tools and systems. So Topic Models give you new tools to explore and browse new documents. The central problem is structure. You can find things by keywords but too many or too few things will have those keywords so as a finding mechanism it’s tricky. So we want an automated index of ideas. This Blei/Lafferty paper looked at the full journal fo Science (1980-2002) from JSTOR took each article and try to analyze the topic. lets find 50 topics that occur. Words with given frequencies. So some may be words like computer. etc. The topics are all there with different frequency. Top 10 or 20 used words do give you a feel for what you are looking at. As a human you can judge a paper to some extent. Then take those words, grab a Google image to fit it and give that sense of an emerging ideas. Topics really do correspond to something you might recognize the articles to be about.
So how do you find these Topic Models?
Firstly you forget the structure of the document. That will come in eventually perhaps but even looking at a document as a bag of words with different frequencies gives a good sense. A topic is a frequency distribution over words. Each document is generated from topic mixture – each document in this model is generated from lots and lots of topics. Some words occur a lot in some documents and not in others. The question is how can we express of account for the variation of topics between documents in terms of a model that sees a collection as a particular mixture of topics in certain percentages. So if we take a topic as a frequency of words and then look at a collection as a percentage of a mixture of topics then you have the idea.
These leads to a Latent Dirichlet Analysis which models all these variables that tells you how the observed frequencies might arise from a collection of topics.
Then you do Bayesian Magic!
You think about a quantity – you think it’s probably 0. The black line on the graph represents something that you’re not sure of. This represents prior idea of value. The blue line represents the fact that we made an observation and saw the value 1. But we say hmm… we saw the value 1. How likely would that have been if the value was 0. So the likelihood of the observation if the value was such and such. So you infer the likely actual value from these two measures. But there is a distribution here with educated range of values. This is what happens in this topic module. So you can look at the probability of a given corpus. Given the likelihoods and expectations you can find that more appropriate range of values. So you correct the probabilities by blending expectation with observation.
Rather than give a value to this we apply something called Monte Carlo methods. If you are serious about gambling you watch a roulette wheel for a very very long time and you will see that the probability of a number coming up is not as expected. On the basis of what is observed you can beat the theoretical statistics. You can also bet on financial markets in similar way.
We’ve done the same with Topic Models. You take samples from this posterior distribution. The examples of topics and mixtures that explain the observed word frequencies.
This Monte Carlo method applied to word frequencies we get a lovely set of topics – UMass have given the world an open source way to use Monte Carlo in this way.
Experiments @ UoE
So applied this method to the first 6000 docs in the repository. TWo topics – one came back on robotics, one on Africa, workers, food etc. We have documents that can be distinguished by this method.
So, for a demo, here are informatics papers divided into three topics (on homepages) and distribution of papers. Middle of triangle here is sparse as the model lets you choose how you set that parameter and how many topics you take. Micheal wanted to see if he could distinguish people – this was slightly encouraging. Then tried interface for whole repository (or everything he had). This has 6000 URLs. If you click on a URL you get the document. If you click to the left you can see what topics occur and how that compares to other documents. So some are similar, some look stand alone. This is not a production version but it’s a test and as I know the people I’ve looked at here are well represented by those similarities. And in the graph currently on screen clicking on a paper reorganizes the related topics. Currently the spatial relationship on screen is to show relevance of connections between topics.
Micheal closes by apologizing about the swiftness of topic modeling. The JISC funded project with interface we are looking for structures that might be useful for people out side the university. Any feedback or questions or ideas for the types of interfaces
Q – Richard of Symplectic) How long do you need to play Monte Carlo game to get statistically significant results?
A) All night! A few hundred iterations seems to be enough to converge. Most of the things I’ve done here were done on my laptop. I ran some on a desktop DICE machine and for production models it would be great to have a meatier mac or faster machine but this work isn’t
Q – Kevin Ashley) I know you’re not worried about the semantics…
A) I didn’t say that – I said that this method doesn’t worry about the semantics.
Q – Kevin) some would say the frequency of Fig in one document is noise you don’t need
A) Machine learning friends would say you need more data. There is no way that the word in this context will get confused with others. In that paper they did discuss looking at comparing document and word frequency to filter out common words. But my machine learning friends say do that as a post processing stage otherwise you are taking away data
Q – James Toon) What use cases do you see this in?
A) I think that this use case is pretty good for discovering content of a journal or article or publication as per this analysis. Loads of structure around lots of documents in a collection with these topics. So you can think about images as a method of interaction for items. Lots of possibility for enhancing interface. My two use cases are browsing documents – I have the fond memory of browsing the library as a UG reading books I didn’t expect to find. Much harder to do that on the web. If you start following these things you start finding bits of the web that you didn’t know about. Serendipitous discovery in DSpace doesn’t work, but would be nice to have on top. The other use case is about what’s going on in a particular environment and who is working on similar things
Q – James) And there is a use case with image navigation around foreign language discovery
A) I only did this quickly. But there is work in the school to break up videos into subjects and index against Wikipedia. It gives such a lot of structure to the videos in a way that it’s hard to do without unstructured searching – I don’t think you could do that with triples
Q – Patrick) Another use case would be wear to publish a paper based on what topics it hits. Any work on co-occurrence?
A) Only at document level here but there is work on near co-location but on a different understanding of topic and distribution. Haven’t seen anything definitive but very interesting idea. When you see a document coloured word by word according to topic that can be good for the reader but can be problematic. Monte Carlo method is quite a subtle way to understand the data
Q) any tie up between this work and data extraction work?
A) I haven’t seen any but that doesn’t mean there isn’t. Any way to disambiguate tokens and count them would work with this model though
Q – Phil) Any idea how deep you could go with this? Could you stick to disciplines, sub disciplines etc?
A) Yes, applying to just informatics and 128 topics seems to work and make sense. And there have been attempts to build that into the model so you can specify sub topics etc.
Ian Stuart is chairing our second set of Pecha Kuchas this afternoon – and there will be a prize for the best again.
Robin Taylor/Ianthe Hind – University of Edinburgh
DSPace for REF – requirements is that it is about publications and publications per researcher expressed in XML and in such a format that they can be filtered through to be the ones they want considered for the REF.
The problem with repositories full of papers however is that you have the issue of author disambiguation. Perhaps you don’t need to, perhaps you let the researchers do this themselves so you enable them but don’t fix all the issues.
The structure of the publications repository reflects the structure of the university. There are collections for each researcher to deposit in to. There is a collection for a researcher – the crucial bit. So if you look at a collection you see the publications listed there – a tie between researcher and publications. But how did we do that?
DSpace has a mechanism to let you to “map” to items in multiple collections. You can search in other collections – the informatics collection say – and add to your own collection. Not only have we avoided author disambiguation – the researcher is the one that identifies their publications. At that stage we’re pretty much home and dry. You now have a REF community and a Researcher community. There is a JISC funded project for Readiness for the Ref that should enable
Sarah Molloy – Queen Mary University London – Reposit project
Sarah Molley from Queen Mary University London talking on JISC RePosit project which looks at ways to simplify deposit and increase engagement with repositories. A large group of partners here – Symplectic are a partner as well as they provide research management software. Some of partners are piloting at the moment, others are in production already.
We hope to change the world with this project! We want to increase deposit and we want evidence that making deposit easier really works. And the relationship between research management systems and repositories. But this will only work collaboratively. Linking CRIS to repository really may be a great method so sharing experience – technical and otherwise -
Outputs will not be software specific but relevant to the wider community. There will be a survey. There will be training materials and guidance. Also advocacy – strategies learned from project workshops etc.
Have set up a Google Group for the project. You can find out on that group more about the project. Also on Twitter @JISCRePosit and a blog. And what can you do? Join our presences, send us your comments and feedback and just get involved!
Robin Rice – University of Edinburgh/EDINA. Developing Services to Support Research Data Management and Sharing
Talking about Research Data Management (RDM) and data sharing.
We have the data library, Edinburgh DataShare and three recent JISC projects in this area. What is a data library? Look to Wikipedia…
The data library has been finding, accessing, using and analyzing data for 25 years here. The DISC-UK DataShare led to the Edinburgh DataShare project (DSpace and compatible to publications repository) and lots of thinking in this area.
BUT a lack of clarity about ethics, rights and ownership and also fears of errors found by users. Aksi fear of scooping, lack of documentation and lack of awareness for sharing data not just publications.
So Data Audit Framework recommended further work on this basis
DAF pilots found inadequate storage and sharing. Also lack of good practice guidance. And lack of clarity about roles and responsibilities both within schools and the university as a whole. We had to respond with developing services and support for RDM, look at seeing is agreement in university policy possible. Information Services website now includes guidance in this area. Also MANagement TRAining work – MANTRA project – looks at both specialist and generic training. We will produce videos of researchers around data management. Need clarity on ownership and transfer process and procedures.
Also keen to ensure data management plans in place from the outside that meets legal compliance and ensures secure storage and back up.
This work is part of the JISC Managing Research Data Programme so there will be lots coming out of this to learn from.
Q – Patrick) Kind of interested in both your presentations Sarah and Robin. What are main services academics actually want on their data? How are you facilitating that?
A) we had an interesting discussion in one of the previous round tables about this actually
Q) so I think the low deposit is because the repository doesn’t give the academics anything they want.
A) we think there are some benefits even just having a URI to point to for Data…
Q) yes but what does the user want – not benefits they can get but what do they want?
A) we have a long way to go with open access and open data. they are more interested in secure storage than sharing. The sharing is important to researchers but they tend to focus on the data they want from others so we’re having to find the right way of dealing with that
Ian) are we as far along with data as we are with publications?
Kevin) We are in a different place. Some areas we’re ahead, others where we are behind
Q – Pablo) Robin – the idea of sharing of items around is brilliant but made me think about mapping.
A) not practical to do as an institution if you are huge but for small teaching led institutions it could work – especially for those without CRIS. But if your academics or research school admins join in the work load is reduced. REF is a big stick so no matter how much work it does get done!
Phil Barker – JISC CETIS
Talking about OERs – Open Educational Resources that can be taken by anyone for free. HEFCE have invested about £9millionin OER. A few years back was on board for institutional learning objects repository. Interoperability and metadata were key. Struggled to think of a successful one at the time but should have thought of MIT OpenCourseWare. But OER is a better approach.
Drivers before was about for accounting for what was done and achieved. But failure due to lack of buy in. Also more focus on systems and repository than on needs of stakeholders. Sharing materials on the open web makes them available both internally and to the world.
Institutional buy in? Well it is becoming apparent that good benefits. OU have found direct link between OpenLearn and course enrollment. This gives a shopfront on courses in a way IR cannot. Also OER give far better idea of what the uni does than any prospectus. There is also the feeling of social responsibility and making resources available being a part of that – thee are publicly funded but also strong motivator
legalities – active management of IPR required. should be happening anyway but often don’t. Some things are simpler with OER though. Registration/authentication much easier for open materials.
Some institutions may provide objects as part of original course – gives huge context even though there are disadvantages. We struggle for standard terms for this stuff. Even if material is presented as a whole course it should stop reuse of individual components.
OER and open to the web content solves a lot of problems. A diversity of approaches seems to work and all have been trialled in the jisc work. http://www.cetis.ac.uk
Anne Robertson/Addy Pope – EDINA ‘ShareGeo 12 months on and going open’
ShareGeo is a repository for sharing geospatial data. last year my colleague Guy McGarva introduced sharegeo and i am going to summarize what we have achieved since last year.
There is a wealth of different geospatial data types. Sharegeo is a spatially enable DSpace repository customized to allow bounding box based searches using Open Layers. We also include validation of metadata
ShareGeo is a JISC funded service and it has been closed to around 3500 users. We have had decent download levels but poor upload levels. So early in 2009 we looked at strategies to increase deposit. The two options were ShareGeo Open and desktop deposit.
One of the drivers for ShareGeo was the changed OS position with open data. ShareGeo Open has just launched! There are around 50 data sets so far. To start with you need an @ac.uk email address to register but even this restriction may change. We still automatically detect and extract location metadata and it looks much like ShareGeo but there is an additional/different licence option/limit of options to openness. Also connected to Go-Geo! allowing upload or download from there. In process of releasing a desktop GIS plugin. Starting with ESRI, the most popular desktop GIS. The desktop depositor does much the same as the online deposit tool with validation etc. Uses SWORD protocol to provide one clikc deposit. We hope to launch this fairly soon.
Future TEchnical improvements
- Visualize actual data with plug-in app
- Data in other formats, particularly web services
- User annotation
Quite optimistic both because of OS Open Data but also EU INSPIRE requirements – both should lead to greater deposit rate. http://www.sharegeo.ac.uk
SONEX – Pablo de Castro
Pablo almost changed his talk title to reflect danger of cycling in Edinburgh!
SONEX is looking at ingest of data into repositories. The group is:
- Peter Burnhill, EDINA
- Richard Jones, Symplectic
- Mogens Sandfaer, Danish Technical University (DTU Copenhagen)
- Pablo de Castro (Chair), Carlos III University Madrid
Driver was IRW 09 Workshop. Gathered 100 key repository people in Amsterdam. Four strands there on Repository Handshake, citation, author inst id, and repository organisation world wide. Repository Handshake has led to SONEX. Oyr paper will be at ECDL in Glasgow next week and will look at progress so far.
The group has no deliverables – doing analysis not coding but we connect to other projects. Sonex has gathered deposit use cases with publisher, author, CRIS driven deposit cases. e.g. OA-RJ and PEER case workflows. Also CRIS IR integration work – looking at ways to integrate metadata form CRIS to Dublin Core repositories.
Readiness for REF
Looked at possibilities institutions are devising to deal with the REF.
Expanding OA-RJ idea for international deposit.
Finally core SONEX use case – Deposit via personal software. Includes article authoring plug-in for Word.
JISC Deposit Call – SONEX will work with selected projects from that call – firstly depositmo; RePosit, and JISC Dura project with Mendeley and Symplectic and Cambridge working on it. There will be a pool of projects to provide general picture of repository deposit.
Currently trying to widen deposit use case analysis for new projects. Also disseminating work.
We welcome comments and questions and would like to thank JISC. Also NHS for the speedy bike injury treatment.
::Please note that this is a live blog and various typos, corrections, links and images will be added in due course – all comments or suggestions of useful additions welcomed though::
Simon Bains, Head of the Digital Library at University of Edinburgh is kicking off our day with the presentation of the prize for the best Day One Pecha Kucha session which goes to… Robbie Ireland and Toby Hanning of Glasgow University
Chris Awre is the Head of Information Management at “Library and Learning Innovation: at the University of Hull and will be talking to us about Hydra… see Wikipedia entry and the enforcement of mandates?!
Chris has begun by saying that he’s been given the hangover slot so he’s offering some key hangover cures – deep fried canary anyone?
So what is Hydra?
It is a collaborative project between University of Hull, University of Virginia, Stanford University, Fedora Commons/Duraspace. It is unfunded but is about meeting the common needs of the partners. It is very mych about reusable frameworks for richer repository enabled solutions. We set up in 2008 and set a time limit of 3 years so the project will continue to 2011 when there will be reflection though there is an expectation that the project will continue beyond this.
There are some fundamental assumptions to Hydra. For instance that no single institution can resource development of digital content management solutions by themselves. As a mid size university Hull can only do so much on it’s own so connections to the Java and Sapphire communities have been useful for finding solutions and working with partners on repositories has huge benefits. But a collaborative project can be tough in terms of compromise so we have had to also find ways to ensure that each partner has specific needs met. Similarly no single system can do anything though, for efficient development and collaborative solutions one needs a common repository framework.
“If you want to go fast, go alone. If you want to go far, get together”
The Hydra group met and discussed needs and ways in which problems could be collaboratively solved. What was wanted might not go the whole way so we wanted to find common principles on which others can build for their own specific needs. We are trying to enable developers and adopters by extending and enhancing the core. And the intent is to build solution bundles for reuse elsewhere.
Origins of Hydra
Hull has benefited from JISC funding for repositories with the RepoMMan, REMAP projects – both were about making repositories part of day to day working embedded in academic life, the latter around preservation using workflows build with web services. Speaking about these projects at OR08 in Southampton the group from Hull caught the eye of Thorny Staples at Fedora Commons who new of similar issues at Stanford. As follow up some face to face meetings emerged. This has led to regular meet ups, on average a face to face meeting every 3 to 4 months. This tends to be in the US as many of the partners are there but it has been well worth Hull’s while to do this. These big universities in the US have huge developer teams so we benefit enormously from this work. In addition to face to face meeting we also do bi-weekly Skype and Elluminate calls though it is not always possible to resolve issues this way or to fully understand tone (hence face to face meetings) but they do (time difference allowing) let you share screens around the group. There are also email lists, a wiki, a Sakai collaboration site at Virginia and a publicly joinable Google Group (hydra Tech**).
A key part of the contribution model of the project is that the partners contribute what they are able. Hull has been able to contribute documentation and have fed significantly into the architectural planning. Virginia has been leading project management and meeting hosting but they also originated BlackLight. Standford lead on software development and UI and through them we have outsourced to MediaShelf. Fedora Commons/DuraSpace have led strategic direction and links to related Fedora developments.Travel is funded by each institution as required and feasible and that is a mutually understood situation.
The initial meeting of Hydra formed the agreement of a mutual desire to work towards a reusable framework for multipurpose, multifunction, multi-institutional repository-enabled solutions.
Initial dissemination at OR09 (Atlanta), further dissemination has taken place and there has also been an initial beta software release in August 2010.
Repository should be an enabler, not a constraint. They have different use cases. They can be used for the management of digital content at different stages of that content’s lifecycle – so understanding and enabling multiple interactions is key. Four use cases were identified to work on.
Hull focuses on a highly varied institutional repository where multiple types of content are accessed through the same user interface.
Multifunction – the use cases comprise some common but also some unique service dunctions. We try to reuse what is known so that the best sharing of experience can take place. However some functions are specific so building these onto a core makes a good deal of sense as this maximizes flexibility and sharing.
Therefore the Hydra Philosophy does fit the name – one body with many heads.
- Tailored apps and workflows for different content types contexts and user interactions
- Common repository infrastructure
- Flexible atomistic data models
- “Lego brick” structure – can build things on top of each other
- Building on work of Fluid (a University of Toronto project around UI components)
Components of Hydra Frameworks
- Fedora repository later supporting object management and persistence
- Hydra Plugin, Ruby on Rails library that provides create update and delete actions against fedora objects – combo of Roby gens incl Active Fedora and Opinionated Mateadata
- Solr provides fast access to indexed content – plus Solrizer indexer
- Blacklight – very much about the access layer that allows you to retrieve over Solr
The project takes a CRUD approach supports workflows over the repository across the lifecycle of the content. You can alter any one component software to meet your needs.
Hydra Heads – is a full stack user facing solutions using the framework for specialized interface for spec. users, content types and workflows
Hydrangea is reference example of a Hydra Head – focused on jornal articles and datasets (software released in August was Hydrangeo 0.1 and 1.0 RC due in early Octoer).
Future Reference Examples
- AIMS – born digital archives: an inter-institutional model for stewardship – a Mellon funded project for 2009-11 that recognizes that much material is now born (and stays) digital. This builds on th ework of SALT an early Stanford specific Hydra head. We are looking to generisize the work from SALT for wider applications
- EEMs – everyday electronic materials – a Stanford project around grey literature and random digital artifacts
- Digitisation workflow – a Virginia interest
- Institutional Repositories (includes ETDs) – Hull is focusing on working on institutional repository needs
Why these technologies?
Fedora – all partners use this and fed commons brought them together
Solr – very powerful indexing tool and used by…
Blacklight – developed at Virginia. next gen library catalogue interface adaptable to repository content. An aside on this. Blacklight is a community developed project and it has a very active developers list with good community principles. It is also being used in multiple contexts from library catalogue to WGBH and Northwest Digital Archives. Blacklight also now have a Strategic Advisory Group formed from users. Blacklight supports any kind of record or metadata, you can create object-specific behaviors, tailored views and approaches supported and it is easy to customize. Testing is a core community principle to Blacklight – and also to Hydra. Blacklight aims to “put the library in control of it’s collections”.
Ruby – Blacklight is based on Ruby and so is ActiveFedora. We had no Ruby experince at Hull but have been able to get up to speed quite quickly
Generally the unifying theme is an Open Approach.
- All the technology choices are open source
- Current technology framework heavily Fedora and Ruby orientated – but models and approach can be used more widely
- Feedback on the project and ideas are very much welcomed.
- Ultimate objective is to intertwine tech and framework.
Cue a live Demo!
Chris looks at the University of Hull repository (interface is Blacklight) and searches to find an eThesis with attached videos (attachments are children of the parent record of an object). Also faceted browse as well. Chris likes that Blacklight shows the build up of search terms and elements very clearly so it is user friendly.
Hull are involved in a JISC project to see if Blacklight can be used on the main Library record. This allows faceted browse or filtering – looks somewhere like WorldCat. The advanced search includes significant advanced options.
Also a demo of Hydrangea at Stanford. This looks like blacklight but also allows the viewing of more complex information on items – including deposit date etc. Still working on UI but the idea is that articles contain rich information on related journal as well. You can also edit records and add records here. Clicking on fields enable editing. No saving required – the edits are saved straight back when you click back away. We are developing an intermediate stage where you edit but only save when you commit the changes – to save on memory.
Find out more:
- Wiki: http://wiki.duraspace.org/display/hydra
- list: email@example.com
- code: http://github.com/projecthydra/hydrangea
- JIRA: http://jira.duraspace.org/browse/HYDRA
- Meet: Hydra & Blacklight Camp, Minneapolis, Oct 4-8 and DLF Fall Forum, Palo Alto (California), ov 1.3
Q – James Toon) How do you deal with changes – can you roll back to previous versions?
A) Yes. Fedora tracks all changes so roll back is always possible.But how that is enabled in the interface is still to be thought about.
Q – James Toon) But what do you do if you link to a particular identifier and there is a change?
A) Can roll back to previous versions of that item. the link always go to the current version. However there may be other aspects to work out re: linking.
Q – William Nixon) is Blacklight a viable alternative to other library catalogue interfaces
A) we are guided and influenced by the way Virginia and Stanford are using Blacklight. It is very flexible. Our current test work barely stratched the surface of what blacklight can do. But also why can’t we use a single interface onto our catalogue, our repository and indeed our archives at some point.