HUTpubl - Structured publishing of research series

Tuija Sonkkila

Lecture

Good eftermiddag mina damer och herrar. Först skulle jag vilja tacka organisatörerna och Nordinfo för möjligheten att tala här idag. Tack också för flexibiliteten över vad som gäller valet av språk. Jag kommer nämligen använda ett icke-skandinaviskt språk, engelska. Men efter mitt tal så är jag dock förberedd att svara på era frågor - om det finns några - på mitt (tyvärr ganska haltande) svenska.

Good afternoon. In my presentation I'll give you an outline of what the acronym HUTpubl stands for, what we have done in the project, who we are, what have we learned so far, and what are our plans for the near future.

But first, some words about the Helsinki University of Technology. It is the biggest and oldest technical university in Finland, celebrating its 150th birthday next year, in 1999. There are 12 faculties, roughly 10.000 full-time students, roughly 1.000 staff. As a publisher HUT is a big non-commercial one. Annually, it publishes over 400 titles in over 200 different research series. The publications in the series consist of technical reports, thesis, etc.

The lifespan of individual publications varies considerably, but one thing is in common: the publishing process is decentralized to the degree that normally it is the writer herself who takes care of the whole publishing process, from tapping the keyboard to storing the print run. There is no central publishing unit at HUT, no HUT University Press. In this sense, HUT differs from many other Finnish universities. Therefore it is no surprise really that the twelve faculties at HUT are quite indepent also when it comes to publishing procedures.

Although it is an official secret that the volume of academic publishing in Finland is at least partly tied to principles in university funding (meaning the more your faculty publishes the more you might be rewarded) it doesn't change the fact that publications are products of its organization, an essential part of its image, but also part of something bigger and more abstract: scientific memory. Therefore, all documents accepted for publication and considered worth archiving should have the right to be produced in such a format that they are as much as possible independent of the constantly changing software and hardware environment, and that they are easily transferrable to different platforms and media. In addition, there is evidence that structured queries upon full-text documents is a desirable feature. Some claim for instance that figures reveal what the author has really done whereas the text tells what he thinks he has done! Seriously speaking, research documents are very suitable material for structured IR requests. They are conservative in structure, meaning the logical parts of a document follow a certain, well-established pattern: overview, theory, test, results, discussion, literature. What is needed is a mechanism of marking up these logical parts of the document. It is here that the SGML standard family comes into picture as one of the very few alternatives.

In 1996, the HUT Library was granted funding from the Ministry of Education in Finland for a project of structured electronic publishing of HUT series. The grant was given from the special Information Society Program funds. As of writing this, the second year of the project is coming to its end. One might say that the HUTpubl project is now at the crossroads. We are aiming at production with two series during the next 18 months based on experiences with the project pilot. We are well aware that much is still to be done and that there are surprises lurking round the corner. On of them is resource management: at the moment, HUTpubl is actually just two people (=project staff), and an Advisory Group of five. So we are small in numbers. Project funding covers salaries of 1,5 people plus running costs, but not much more.

So, at this stage, HUTpubl might be called an idealistic underground movement. It is assumed that given the charachteristics of HUT as a busy and heterogenous publishing environment, there is no easy way to get acceptance to archival and document management issues. Any top-down regulations are unacceptable for instance. A special challenge is the fact that from a cultural point of view, technical universities are slow in adopting innovations that relate to textual information. Exaggerating only a little, technology is about machines and programs, not documents. What HUT staff needs is clear evidence that document management is still worth the trouble, that it gives something extra, and that is works. This means a significant amount of lobbying but, and this is even more important, something that might perhaps be called "Request for University Comments" after the well-known procedure of how new internet protocols are publicly evaluated before they are adopted. HUTpubl tries to gather a slowly but steadily growing network of people who are willing to act as fore-runners by taking part of structured document production. It is hoped and believed, that workable solutions gain acceptance and that acceptance brings more followers. In this work the project relies especially on the help, support and interest of three units of HUT: Department of Automation and Systems Technology, the Department of Computer Science and Engineering, and the Administrative Office.

HUTpubl has a field-tested and evaluated pilot. It consists of the following technical elements:

As you can see there are a few parts still missing, e.g. a search engine with a GUI, and a mechanism of maintaining a document repository, to name just two. We have made plans how to build the search engine; that would include sgrep from the project SID at the University of Helsinki and a SGML multi-purpose tool like Balise. We have other use to Balise as well. The conversion table of FMSGML is not capable enough to handle multi-hierachical structures. That became clear in the testing period. We also aim to switch from SGML to XML environment as soon as it seems appropiate. XML, Extensible Markup Language, received the status of W3C Recommendation last week. We have checked that the HUTpubl DTD should be fairly easily transferrable to XML. Among other things, XML will be a missing link between HTML and SGML, and as such an eagerly awaited novelty.

The authoring part of the pilot was tested by three university staff in October-November 1997. The main goal of the test was to find out whether the HUTpubl DTD was suitable to the structure of tested documents which where from three different departments or units of HUT. We also wanted to gather experience about MSWord and FMSGML as structured editors.

We came to the conclusion that after some revision the DTD could be accepted and that FMSGML was a robust editor to proceed with. One feature of FMSGML is that it offers different views to the document, placed in separate windows. The test group especially liked the hierarchical view where you can manipulate text elements in blocks, move them around etc. FMSGML is a full-grown desktop publishing product, combining structured editing with layout environment. This caused some distraction in the test group. It was sometimes difficult for them to concentrate on the contect of document, the information when details in output specifications where so close at hand. And yet, the idea of structured editing is to give freedom to the writer, to let him forget about layout, fonts etc. Layout would be taken care of elsewhere. But the principle of WYSIWYG (what you see is what you get) is deeply rooted in us. The principal output format of todays desktop editors is paper and, as you well may know, editors work under the supervision of printers. Like Mrs. Eskola told us earlier today, we have to behave in a certain manner to be sure that the output of what we are writing is electronically correct.

In the case of Word we became a bit hesitant. Even though Word is a widely used desktop editor its implementation of styles leaves much to be desired. It is clumsy to use especially when the number of different styles is big like it was in our case where all the central HUTpubl DTD elements needed a correspondence each. In addition, the author has no way to validate the structure of the document. Like one of the test members formulated:"It is like writing something and throwing it in a black hole, wishing it would be OK."

Nevertheless, HUTpubl does not expect all HUT researchers to leave their desktop editors and switch to FMSGML. Therefore, besides for Word, we might need template files also for other popular editors like WordPerfect and TeX. Still, it is quite clear that if HUTpubl could offer a genuine SGML editor for HUT authors, like Author/Editor from SoftQuad, and if it would be accepted, the amount of conversion work would decrease significantly.

With this pilot HUTpubl aims at production phase in 1998-1999. The target is two series published by the Laboratory of Media Technology at the Department of Automation and Systems Technology. But production is a big word and in fact it means arranging at least five aspects of publishing:

It is important to note that we do not know yet who or what unit will take the responsibility of structured publishing at HUT, or whether a whole new unit will be established at some stage. This might sound a little bit odd to you. But is stems from the special position of the HUTpubl project existing still on the no man's land - half-research, half-production. It does not yet have the status of being an established part of HUT publishing, yet it tries to become one. At the moment, we have to proceed in small steps because of limited funding and without any real guarantee that it will continue. This is nothing new to projects currently under way in academia. There are three issues that affect projects like HUTpubl, where university services are developed with project funding from outside: the market of staff, particularly IT staff, the status of universities as employers, and the amount of risk-taking in terms of software.

IT staff is a scarce commodity, especially in small countries like Finland but it is also a world-wide fenomenon. Big IT companies like Nokia could easily employ all IT students in Finland - and I can tell you that it almost does! Academic projects are thus forced to compete on an already heavily competitive market. How universities succeed in this may be read from surveys among university students; the only employers that are even more undesirable than universities are communities. One obvious reason for this is salary but there is more than that. I am not going further into this but I would only like to add that, on the other hand, the use of students as project workers is not unproblematic either. Their SGML knowledge normally covers only HTML which means that the project timetable has to be adjusted to include learning periods of 1-2 months at least. Students are also sensitive to new job offers which means that projects should have an arsenal of substitutes. And, in any case, students are students, working in part-time, in the summer, or finally just graduating.

Nevertheless, SGML projects do need students because SGML knowledge, at least in Finland, seems to be divided between three separate group of people: those who have already their hands full of SGML work, typically in a big company like Wärtsilä, Valmet, or Nokia; those with a long university career focusing to the theory of structured documents; and SGML consultants.

The greatest asset of IT students is courage to tackle new, unexplored problems. They are also grown to work in a public-domain software environment. This is good to keep in mind, because structured publishing is much more than just editing and storing documents - it is e.g. about how to deliver the material on different platforms in a format that suits best to that particular platform and is well met by the users.

In a recent SGML survey conducted in Finland it was stated that among the biggest challenges confronted by SGML-based projects both in business, administrative and academic sectors are expensive software packages, especially in database management. This is perhaps no news. But it does not necessarily have to be like that. Let me give you an example. Last week, the Finnish SGML User's Group had its meeting in Helsinki. One of the speakers represented the Edita Group, one of the leading commercial publishers in Finland. The Electronic Publications Systems section of the company, with a successful SGML production line, is happy with a very low-profile SGML document repository system, consisting only from a number of unix shell scripts.

There are many public domain software on the market, capable of handling structured text in machine-readable format. A very comprehensive list was recently published on the web by Eila Kuikka and Erja Nikunen from University of Kuopio and Nokia Telecommunications, respectively.

But how to make a suitable package both from public-domain and commercial software? The question becomes even more urgent when XML-based software and applications will start to pour on internet. Of course much depends on your budget, but the answer is not that simple. On one hand we need robustness and maturity of commercial software. On the other hand, we need to build an interesting and flexible environment for application development with the help of public domain software. The more we pay for software the more our budget it tied. But the less we pay the more we have to think about continuity and risk-taking issues.

In 1996, I had the opportunity to visit the Electronic Text Center at the Alderman Library at the University of Virginia in Charlottesville. The head of the Center, Mr. David Seaman, is himself the author of Perl scripts that take care of the on-the-fly document conversion from SGML to HTML in network delivery. This is one example of how risk-taking could be diminished by ensuring that technical know-how stays in the project.

The HUTpubl project has been in close contact with four other academic organisations in Finland, all with experience in SGML-related research or project activity: Oulu University Library, the Department of Computer Science and Information Systems at the University of Jyväskylä, more precisely its Digital Media unit (from where we have Riitta Kuisma here as a participant - Riitta, would you please stand up!), and Vaasa University Library (represented here by the Library Director Vuokko Palonen - Vuokko?). In December 1997, HUTpubl and the two last mentioned organisations formed an umbrella project called RAJU.

The scope of RAJU covers research and production of structured documents in three main areas: in electronic textboooks, in masters thesis and in research series. The deliverable of RAJU is expected to be a report on the stage and a list of guidelines of structured electronic publishing in Finnish universities. RAJU welcomes also other Nordic institutions. Therefore it has been a great pleasure to me today to have had this opportunity to speak to a Nordic audience. Thank you for listening!


Nordisk konferanse om elektronisk publisering
Programme in English