|
Other Resources
|
Tom Tyler, University of Denver Library
&
Mary Strouse, Howard University Law Library
Presented at 8th annual meeting, Innovative Users Group, April 30-May 2, 2000
With Release 2000, Innovative Interfaces, Inc. has responded to a long-term request from the Innovative Users Group for a means of verifying hypertext links to external electronic resources in catalog records. Innovative's URL Checker is now available to customers of the Web Access Management (WAM) product. While Innovative's new URL Checker will enable many Innopac libraries to consider a program of URL maintenance for the first time, significant obstacles remain to achieving comprehensive, cost effective management of hyperlinks in Web-based catalogs.
Hyperlink Maintenance in the Library Catalog
There are a number of reasons URLs in library catalog records may not function. Some are related to management of the resource serving environment (reorganization of web site, changing server names, etc.) and others are related to the incorrect transcription of URL data into the MARC format. Checking or validating URLs, both initially and periodically thereafter, is necessary to insure that links to external resources continue to function.
The identification of problem URLs is only the first step in the process of hyperlink maintenance in the library catalog. Temporary or transitory network and server conditions may result in some URLs being reported as bad which, on re-testing, are found to have no problems. Any systematic program of link maintenance should therefore incorporate some type of URL re-testing before bibliographic maintenance is undertaken.
For URLs that require maintenance a process must be initiated to locate a URL that works and substitute it for the original in the bibliographic record or to remove the nonfunctioning URL altogether. Because this may be a somewhat complex task, it is advantageous to work within an environment that permits moving easily between URL verification information, a web browser and the catalog database editing module.
URL Checking Software in General
URL checking software has been used for some time to validate hyperlinks in web pages. Automatic URL checking is performed by a type of "robot" software that is similar but not identical to web browsers. This software identifies URLs from a web page and requests header information from the server hosting the resource identified in the URL. Generally, URL checking robots use multiple simultaneous "sessions" or connections when engaged in this process. Robots do not actually load files represented by URLs but rely instead on a conversation with the remote server about its ability to serve the file in question. The result of this conversation is one or more status codes sent by the server to the robot. On the basis of these status codes the robot reports the availability of the URL being checked back to the link checking program.
Because this software requires a single HTML source of URLs to be checked, libraries have used a variety of means to extract URLs from their catalogs, convert them to hyperlinked statements in HTML, and then submit the resulting HTML page to robot validation. MarcXGen, software created by Tom Tyler at the University of Denver, has been used by a number of Innopac libraries to simplify this process.
Link checking robots designed to check single web pages or web pages at a single site typically share a feature which makes them especially problematic for use by libraries. Generally these robots will eliminate duplicate URLs to reduce load on the software and the network. For libraries this characteristic is a problem because a single URL may be associated with dozens of different records, each of which requires maintenance.
Another problematic feature of link checking robots is the handling of URL redirection. Redirection occurs when the contacted server sends the browser to a new URL. For example, when a document has been moved or renamed, a server administrator may set up either a temporary or permanent redirection page to point users to the new location. PURLs (Persistent Uniform Resource Locators) are another type of redirect. Subscription resources which use domain testing to identify authorized users also typically make use of redirects.
Most link checking software reports that a redirection exists, and records the new URL, but does not actually verify that this URL is functioning. This circumstance is complicated by the fact that a redirect sometimes points to another redirect, so that only the third or forth step connects to an actual resource.
Overview of Innovative's URL Verification software
Innovative's URL Checker software scans records for the presence of at least one 856 field containing one or more subfield u. The text of every subfield u of every 856 field is treated as a URL. URLs present in other fields or other subfields of 856 are not checked. 856 fields with no subfield u are ignored. Thus the URL checker will not identify common coding errors, such as placing URLs in subfield a of the 856, and will not check a PURL's underlying URL when it appears in a 530 note, or in 856 subfields x or z. The link checker will separately verify every instance of a URL which reccurs in several different catalog records.
URL Checker has two distinct modes of operation: Automatic and Interactive. Automatic verification is designed to run on a preset schedule. One can also initiate URL verification at any time from the Web Access Management menu. In this "interactive" mode, URL Checker will verify a range of record numbers, or the contents of a review file. This feature allows for verifying URLs immediately after a tape load, or periodic checking of new records if URLs are not tested during cataloging.
The two modes differ predominantly in the format of their output. In automatic mode, URL Checker creates a URL Verification Report in HTML format. In interactive mode, the verification report is output directly to the character-based staff screen, from where it can be printed but not saved. Put another way, automatic URL verification resembles other Millennium products, particularly Web Management Reports, while interactive verification functions like existing database maintenance tools, such as heading change reports. In both modes, catalog records containing problem URLs are written to a system-created review file.
Only reported errors (non-functioning URLs and redirects) are included in the verification reports. In interactive mode, the total number of URLs checked is reported. Both the automatic and interactive reports include similar data: a record ID number (minus check digit), the verified URL, an error type code, and a replacement URL if provided by the queried server. Reported errors are sorted by error type, then by record number.
In the web-based URL Verification Report, clicking on a record number displays that catalog record in a second window. Clicking on a link in the URL column will connect to that resource. This is useful for verifying reported problems, because reported errors may be temporary situations that have since disappeared. Clicking on the link in the new URL column will verify that this URL is functional (something the verification software itself does not do). The text-based report in interactive mode cannot be used to display catalog records or to verify the operability of links., making it far less useful for maintenance work. Inexplicably, the text-based report truncates the verified URLs after 25 characters.
Neither report connects the user directly to an editable version of the catalog record containing a bad URL. To correct or replace URLs reported by URL Checker, the user must either call up the records individually in editing mode (by means of the record ID number plus "a" for the check digit), or use the report as an index while stepping through records in the matching review file.
Performance issues for the initial release of URL Checker have included difficulties loading larger web-based reports, discrepancies in the sort order between reports and their associated review files, and a small number of redirects going unreported. The only customer options available in the initial release are the frequency of the automatic verification (default: weekly) and the precise day and time it is run (default: 2:00 am on Mondays).
Customers are already requesting a number of changes and enhancements, including:
- Optional suppression of PURLs and other redirects from the verification report
- The ability to specify sort order for the verification report
- Output of verification reports produced in interactive mode into the same HTML format used for the automatic reports
- The ability to output the report data in comma delimited (CSV) format
- Inclusion of data, at least in summary form, about URLs that were verified and found to work correctly.
The strengths of III's URL Checker are that it can run automatically, does not require complicated setup, and eliminates the step of exporting URLs outside the catalog before they can be checked. Compared to other link checkers, Innovative's initial product is deficient is several ways. It reports only errors - not the status of all URLs checked. For libraries with a large number of reported errors, the HTML file may be quite large (4 or 6 million bytes in length) and may take an extremely long time to load in a browser. Perhaps the greatest deficiency is the inability to export the report data to a spreadsheet or database for subsequent URL maintenance work.
Libraries which have access to Link Checker, and which have put off tackling the issue of URL maintenance may find this a fairly painless way to begin. Libraries that have already implemented a link maintenance procedure will gain little advantage by switching to URL Checker in its present form. Libraries which neither have nor want Web Access Management may prefer to implement the more cumbersome third-party link checking software available at low or no cost. Libraries with large numbers of URLs may prefer external link checkers because of the better data manipulation capabilities they offer. Even those libraries which currently have WAM software may benefit from using external link checkers for their initial cleanup, before implementing a regular maintenance program using URL Checker.
The table that follows compares Innovative's URL Checker in Automatic mode with other commonly used link checking software.
Comparison Three Link Checkers Used in Innopac Libraries
| Feature |
LinkBot |
Xenu Link Sleuth |
Innopac / WAM URL Verifier |
| Cost |
$250-$500 |
Free |
Included in WAM |
| Tests URLS in: |
Web page or pages |
Web page or pages |
Subfield u of 856 in bibliographic records |
| Speed |
Very, very fast |
Very fast |
Slow |
| Scope of URLs tested |
Unique URLs only |
Unique URLs only |
All URLs |
| Tests redirected URLs |
No |
Yes |
No |
| Accuracy of results |
Good |
Good |
Erratic |
| Requires preliminary web page creation with other software (e.g. MarcXGen) |
Yes |
Yes |
No |
| Report formats |
Proprietary, HTML, CSV |
Proprietary, HTML, TSV |
HTML, Bib Records in Review File |
| Data elements in Report/their labels |
|
|
|
| URL |
Link URL |
Address |
URL |
| Status Code/# |
Link Status |
Status |
Error; New URL |
| Type |
Type |
Type |
-- |
| Size |
Size |
Size |
-- |
| URL Caption |
Link Description |
Title |
-- |
| Last Modified |
Last Modified |
Date |
-- |
| -- |
Hits |
-- |
-- |
| Meta Title |
Link Doc Title |
-- |
-- |
| Meta Author |
Author |
-- |
-- |
| -- |
-- |
Level |
-- |
| -- |
-- |
Links Out |
-- |
| -- |
-- |
Links In |
-- |
| -- |
-- |
Error |
-- |
| Bib Record # |
-- |
-- |
Title |
| Major issues & problems |
Checks only unique URLs - a big problem for Bib URLs |
Cannot connect results of tested redirects with original URLs |
Large reports load very slowly; inability to export to work environment |
This document, visuals, and material related to this presentation will be found at:
http://www.du.edu/~ttyler/iug2000/
Last Update: Monday, March 21, 2000 - 5:43:22 PM (MST)
|
|