Sep 262011
 

Below is a copy of a blog entry I put on the DCC website after attending their Roadshow in September ……

 

As a rather out-of-practice ecologist I was looking forward to the DCC Roadshow with a little trepidation. Would the material be aimed at digital curators writing code to fine tune their repository functionality?  Or would it be of help to someone like me… drafted in from a different area to work for a few months trying to get to grips with the management and sharing of spatial data for the JISC funded IGIBS project.

Well, all concerns were rapidly dispelled as the first day got underway and the programme was delivered with a broad church in mind.  We were accommodated in the rather glorious Wadham college and I began to feel like a well-cared for data object. Securely stored in safe surroundings and my metadata had obviously been carefully read as I was now being professionally transformed into a more up-to-date format just  before my existing file type became redundant.

It  was clear that Oxford had taken data management  seriously and was a centre for developing new ideas and services to aid their researchers and the wider community. Particularly inspiring was a presentation from David Shotton, who had taken his position as Director of the Bioinformatics Research Group to add considerable professional  weight to the importance of developing a professional  data management  infrastructure. He was leading a project to  manage, publish and cite datasets. Using the theme of infectious diseases was the icing on the cake for me. The freeing of data on this subject is providing both an “academic good” and a very obvious “public good”, especially to countries with less ability to access the journals and data than more developed regions  with perversely  less public need for the data.

Themes that came out of the day were the importance of standards in metadata to bridge across disciplines and the need for institutional repositories to hold  the “long tail” data that most researchers produce en route to paper publication.  It wasn’t until later that evening , when I found myself not in the pub but spending an hour watching Bryan Heidorn on YouTube presenting a version of his “Curating the Dark Data in the Long Tail of Science” paper,  that I realised how the day had really got my remaining grey cells up and firing.

Day two comprised mainly of  group work where I began to understand the roles and problems facing many of the different data managers attending the Roadshow. The need for researchers to fully engage with the data management service providers became apparent as the group sessions developed. There are whole armies (well maybe platoons) of professional data experts trying to herd the cat like academic data producers into the pen of good practice. From meeting the Roadshow participants and coming to understand their roles and expertise,  it is hard to understand how there could be a problem with data management.  It was not  until I came back to Aberystwyth’s IGES department and saw individuals trying to finish off theses, and papers as well as prepare for the new academic term that I really understood the problem. As was suggested in day one of the event, it is only the moment when data management arrives at  the top of an individual’s priority list that it actually gets done. I have since been thinking about trying to start a project using “Nudge” theory to try and move data management up the priority tree in an academic department.

Day three was focused on the tools available to aid the  researcher in their data management . The DCC data management planning tool shone out as a valuable asset for  meeting the increasing need of funding bids to meet the demands/requests of the funding councils.  For me it will provide the framework on which to hang my thoughts and conclusions about managing  spatial data and  provides some of the structure for my work on the IGIBS project.

Now back at Aberystwyth and slipping back into my routine, the measure of success, at least from my perspective, for the Roadshow will be both the new sections in my  final report that I feel able to write and  the changes in working practices that come  about as a result of  my attendance.  Well it’s too early to say, but as I have just  moved metadata creation for some  soon to be used images and shape files to the top of my priority tree, maybe the digital curation  my brain received from the DCC is beginning to show.

 

 

 Posted by at 20:02 asides Tagged with: , , ,  Comments Off on DCC Roadshow Oxford
Aug 082011
 

My apologies to the Duke of Wellington for mutating his often quoted call to a jilted mistress about his intimate letters, but the sentiments in the original statement do suggest the power of the publication process to give information a life of its own. 

If geospatial data were published in a similar way to research findings (or even to letters from the rich and famous to their mistresses) then data management and the academic spatial data infrastructure (SDI) would be an even more  rapidly developing entity, that had the commitment of every academic researcher that generates such data.  OK so a sweeping unrealistic statement but this is how I came to the thought…. 

One of the pillars of modern day science is the peer review process. It takes a piece of original research and, through the  publication process, many stages of refinement are applied to it until the researcher is satisfied that it stands a chance of acceptance by an appropriate journal. During this process it will be proof read, checked for errors and formatted in the appropriate way. Then after input from independent referees it will be further improved and finally, if judged acceptable, the research is published where it can be accessed, seen and discussed by the wider scientific community (or any community that wishes). 

After publication it is archived and catalogued so that it can be found on-line or in hard copy and can be used and quoted by anybody who wishes. There may be a network of people and libraries that will have subscriptions to the journals and they will see the newly published articles appear on their shelves, desks or screens every few months. Finally if the research is worthy it may be used as a component of more research and go on to help develop the knowledge base. 

Concomitant with this process is the recognition given to the researcher and to the associated Institution that can result in promotion for the former and extra funding for the latter. This rather idealised description of the peer review process is something that the majority of academic staff and postgraduate students are fully engaged with and committed to. You might have to forgive my simplistic view but all I wish to establish is the principle that the publication process and the recognition it attracts drives the quality, accessibility and reuse of research findings. 

Now let’s consider how geospatial data is managed. It’s not so easy to simplify the process as there will be much greater variation. Some important data will be lodged with data centres where it may have a guaranteed 10 year life span (if its lucky) or maybe the metadata will be put in a discoverable place with a series of hurdles to cross before anybody can get access to the data itself. Quite often the data will never leave the IT systems of the Institution that it was created in; rather it will take second place to the research publications and may not be made accessible at all. This has been shown to be through worries over intellectual property, through a lack of awareness of a suitable data management and publication methods and a lack of recognition for such activities.  One thing is for sure, that most data will not have the same exposure to the science community as the research findings it supported. 

The long term life expectancy of such data is also likely to be shorter than its wordy cousin the research paper.  In fact it may not even exist after its collector has moved posts or retired or suffered a serious IT problem. It is very unlikely to be archived as well as the printed word and its creator is much less likely to have received credit for collecting it and the institution she/he works for is unlikely to receive improved research ratings or extra funding for generating it. 

Now imagine a word where data is King (or at least Queen alongside the research paper King) and research funding and University Chairs are partially reliant on the proper publication of peer reviewed data sets. I think that a fully functional academic SDI with all the bells and whistles that you could want would be a reality within the next decade. In the same way that JANET has, and continues to develop in the UK with its fast speeds and links to other country networks so would the SDI. 

Obviously this isn’t an original idea, there are a few journals dedicated only to data publication and there are strong policy statements all the way from Government through the publicly funded research councils and even to a few Universities that make the publication and accessibility of data a priority. What is missing are real incentives for researchers to treat data in the same way as research findings and until this gulf is filled then data will be the poor relation of the academic publication World. If the translation of the INSPIRE (see my previous post)   directive into European Governments’ actions includes Universities then maybe it will provide some significant infill for this gulf and move the discussion from “why should I?” to “how do I?”

Background

 Posted by at 12:29 asides Tagged with: ,  Comments Off on Publish Data or be Dammed
Aug 032011
 

I guess that anyone reading this  will already know that Shibboleth is an open source federated user authentication infrastructure that allows control of user sign-on between or within an organisation/s.  I had a vague memory from childhood Sunday School classes that there was an older meaning to the word apart from the Times crossword type definition of a feature that belies your social or ethnic origins.

So just in case there is a person left who hasn’t Googled “Shibboleth” here are the conclusions from 5 mins of browsing.

The origin is from Hebrew in the  late Bronze age (1100BC). It was used as a test of racial origin to try and filter out a group called the  Ephraimites from the Gileadites. In a nut shell anyone attempting to cross the river Jordan after a particularly nasty battle was asked to pronounce the word Shibboleth (meaning stream or  ear of wheat) . If you were unlucky enough to pronounce it Sibboleth then you were for the chop as you were clearly an Ephraimite ( who didn’t have the “sh” sound in their language) or were just unlucky enough to have a speech problem. This is supposed to have lead to about 40,000 executions! So really it was an early form of biometric identification, which sounds more advanced than some of today’s security systems.  Now I assume that with  modern-day software development, failing to pass the Shibb authentication  only results in minor annoyance and no longer leads to capital punishment, but I will be a little more careful when entering my user-name and password from now on.

 Posted by at 10:48 asides Tagged with:  Comments Off on Shibboleth: report of early stage development using biometric id.
Aug 012011
 

I have been reading and  thinking about the relationship between long term spatial data preservation and the short term needs of day-to-day data security during the life of a research project. With research data being generated at faster and faster rates and the life cycles of supporting technologies getting shorter data preservation is destined to be a continual problem requiring new and smarter solutions every few years. Just dealing with new data that can be produced by Earth observation satellites at the volume of terabytes per day and may exist in several formats as is passes through complex processing stages is enough to take the issue  well into the scope of being a serious problem.  The sentiment expressed by Moss points to the aspiration of researchers for their hard won data.

“Scientists now want to keep everything, which they assume is digitally possible, in the belief that everything has value and can be retrieved and repurposed.”  Michael Moss 2008

The question is; will the technology and the resources exist to meet this aspiration?

It is very easy for a researcher in a Higher Educational institution to secure data in the short term via either their own arrangements or by using the services of a central IT department. In that way data can be backed-up to multiple locations and held on hardware that is up-to-date and covered by manufacturer’s warranties. I am sure this professional approach is found in most (or all) HE institutions. It’s still up to individuals to avail themselves of these services but there are few obstacles standing in the way. Even storage costs are falling and for a few £s per gigabyte a university department can store data in professionally managed institutional servers.  

In the hierarchy of data preservation the next levels up become harder for a researcher to arrange.  Consider keeping spatial data for third parties to discover and use for the next 5 years. Immediately there is the need for precise and comprehensive metadata. This has been addressed in several ways and the development of specific standards via UK AGMAP has given, anybody who looks, an easy lead into useful metadata creation. For this longer term data storage it may also be necessary to look outside of your home institution to ensure suitable data curation and discoverability.  But where do you put the data? It needs an accessible location that links metadata and the data object and makes them discoverable by future researchers.  The tools provided by GoGeo provide a solution. Data can be described and even lodged within this service so it becomes searchable and accessible to other researchers. 

So there is an infrastructure for this stage of the data management process but it now needs the data producer to step outside of their daily routine and to work on tasks not always considered core for a busy academic looking to their next paper.  This 5 year time horizon is also significant in that the European INSPIRE Directive will be in force   for its Annex III type data by2013. This means that many university generated geospatial data sets will need to comply with the INSPIRE standards promoting interoperability across boundaries. Possibly more difficult to achieve will be dealing with older data which will also have to meet INSPIRE standards by the next decade.

Once we look beyond the next few years and start to focus on spatial data of high quality or significance things get really interesting and much more challenging.   It’s very easy to talk of data archiving and curation as if there are standard easily accessed facilities in every library. The more I have read the more I realised that it’s a far more fluid and developing science than I appreciated. 

Who decides which data are in need of professional curation, or which data can we afford to curate?  These kinds of questions move the process beyond the researcher into the realms of professional librarians or data curators and government departments working to budgets and polices.  All this leads to a further stream of questions: Can data be given to one institution to look after? Can we guarantee that any institution will be a permanent fixture?  Will the metadata that was created during data collection still have sufficient context to be useful in 10, 50 or 100 years time? How will the increasing number of data objects be kept searchable and accessible?  With hardware life cycles only being a few years, who will ensure the passing on of data to the next technology and will that technology still support the data format? These questions just start to scratch the surface of the issues involved in designing future data curation methods and policies.

Let’s hope that the situation described in the quote below won’t be applied to the early 21st century when looking back in 20 years time.

“In terms of preserving our digital cartographic heritage, the last quarter of the 20th century has some similarities to the dark ages. In many cases, only fragments or written descriptions of the digital maps exist. In other cases, the original data have disappeared or can no longer be accessed due to changes in technical procedures and tools.”Markus Jobst 2010

It’s possible to take this timeline one stage further and start to consider which spatial data sets, that are so important to major scientific discoveries or advancements, should be considered for preservation in  the equivalent of a scientific museum that holds the essential heritage of our scientific community.

Now where did I save that first human genome I was given for safe keeping in 2003? Are well not to worry it wasent geospatial data anyway, well not unless the DNA doner had an adress?! Oh and it was a mapping project so I better fnd it…….

 Posted by at 15:18 asides Tagged with: , , , ,  Comments Off on Don’t get your Back-Up over Data Preservation
Jul 152011
 

As you know from previous blog entries the Dyfi Virtual Observatory (DVO) is a similar project to IGIBS and is also based at  Aberystwyth.  I had a meeting with Nicola from the DVO yesterday and we agreed to help each other out with  data sets where possible.  I thought it would be good to give a little more detail on their project so you can see how it compliments IGIBS.

The DVO is aiming to provide anyone, with an interest in the Dyfi catchment area, with a resource to process spatial data for a range of uses. Their plan is to be holders of data sets that can then be analysed and the results used privately or publicly.  To achieve this and to add processing power it is planned to make use of Cloud Computing. One example they are working on is flood risk calculation, which has obvious overlap with the  kinds of data IGIBS is looking at.

The focus of the project is more socioeconomic than IGIBS but nevertheless any extra data sets that we gain access to, that are tailored to the Dyfi area, will be of benefit to IGIBS and the end users here at IGES.

Thee `DVO web site is immenent but in the mean time you can see some impresive photos of the Dyfi area on their photostream at  flicker

 Posted by at 12:02 asides Tagged with: , , , , ,  Comments Off on Dyfi Virtual Observatory
Jun 022011
 

Hope all those concerned with getting some favourable coverage for Biosffer Dyfi Biosphere are watching their televisions of an evening right now; specifically BBC2 for the next few weeks as Springwatch, that stalwart of Aunties nature programming is at Ynys-hir!

Isnt it brilliant turning on the box and seeing the intrepid trio sitting out on that decking with the evening sun on Mynydd y Llyn in the background?  I am unhappy to hear about the heron numbers going down, shocked by the cannibal Barn Owls, and flabbergasted nobody has pointed a camera at the Pied Flycatchers yet, did I miss that?  Being from Scotland, I get such a kick out of seeing these wee beauties round here – they are pretty scarce elsewhere.

 

Pied Flycatcher – photo by Rainbirder (Steve Garvie) 

Creative Commons Licence This work is licensed under a Creative Commons Attribution-ShareAlike 2.0 Generic License.

Seriously though, this is a great advert and national recognition for some of whats best about the Dyfi area, hope somebody steering Biosphere Reserve developments is taking notes.