Marc URL Extractor and HTML Generator
Tom Tyler, University of Denver Library
· MarcXGen executable and library name template file
· Source file of USMarc bibliographic records having 856 fields in unblocked format
· Microsoft Windows 9x
MarcXGen extracts URLs from Marc bibliographic records and generates HTML code to create a single web page of hyperlinks that can be used with third party Link Checking software such as LinkBot and Xenu's Link Sleuth.
While MarcXGen has been adapted to interpret 9xx field data from several libraries using the Innopac ILS, it has been used successfully with files of Marc records exported from other library systems.
MarcXGen also creates separate files of delimited data that may be used to build a relational database environment that may simplify some maintenance tasks associated with bad or problem URLs in library database records.
Documentation for MarcXGen, in MS Word format, is included among the files in the Zipped package available at the Download heading. A web version of the documentation is available at the following address:
The MarcXGen executable, documentation in MS Word format, and the template file for library name are available in Zipped format the following address:
1. Be sure you web browser is set up to save-to-disk a file with the extension ".zip".
2. Use your web browser to go to the following URL:
3. Save the marcxgen.zip file to the folder/directory on your workstation where you will be processing the Marc output files from your library system.
4. Unzip marcxgen.zip using appropriate software (e.g. the PKZIP utility).
5. After unzipping the marcxgen.zip you should have the following files:
marcxgen.exe The program itself
libname.txt The text file the program needs to hold on to your library's name. You should edit this file with your library's name.
marcxgen.doc. A readme file which corresponds to the file you are reading now.
The only configuration needed is to edit the libname.txt file. Replace all the asterisks with the name of your library. When saving the file be sure that you save it in text format.
Before editing: ******************************** Library
After editing: University of Denver Library
Before running MarcXGen you must have a file of unblocked records in Marc format in the same directory as the MarcXGen software. This file MUST have the extension of ".mar".
Also, the libname.txt should have been edited with your library's name as noted in the preceding section.
MarcXGen may be run from the DOS command line (i.e. MS-DOS Prompt) or from Windows Explorer. So, to get started either
From the DOS prompt enter the command MARCXGEN ; press the ENTER key , or
Select Marcxgen.exe from the folder/directory display if you use Windows Explorer
The program will prompt you to respond three times.
1. First, it will ask you for the file which holds Marc records exported from your Innopac system. This file MUST have the extension ".mar". When the program prompts you for the name DO NOT enter the ".mar". At the conclusion of the program there will be new files with the same name but with different extensions:
".htm -- HTML source file to be used with link-checking software
".txt" -- Delimited text file of data elements for use with MS Access in record maintenance
".asc" -- Delimited text file of URL data from subfields a,z, & x; for use with MS Access
".skp" -- Records skipped due to excessive size - usually will contain no records
2. Next, the program will ask you to enter the current date - Enter the date in the format: Month Day, Year - e.g. June 14, 2000.
3. Next and finally, the program will ask for the name of the text file which holds you library's name. Enter the FULL NAME of this file. This file is distributed at libname.txt. This file should have been edited to include the name of your library.
The program should then run to its conclusion. Depending on the size of your input file, the characteristics of your computer's processor, etc. it will take anywhere from a few seconds to several minutes to conclude.
If the program concludes normally (i.e. without errors) then you should see a display similar to the following:
Bib Records processed ####
Total URLS ####
The following files have been created:
yourfile.htm - Single file in HTML format for use with link checker
yourfile.txt - Delimited data reported in HTML file
yourfile.asc - Delimited data for URLs found in subfields a,z, & x
yourfile.skp - Skipped records (generally should have length of 0)
End of File
MarcXGen Concluded normally
If an error occurred, then the following would appear at the bottom of the screen:
ERROR ## AT ####
Troubleshooting errors, at this point in time, will require:
1. The ERROR ## AT #### information that is displayed in an error condition; and,
2. Access to the source file of Marc records (i.e. the *.MAR file). If you can provide a copy (zipped if the file is large) of the *.MAR file, send an Email to with MarcXGen Error in the subject line and attach the *.MAR file or provide instructions on how/where it can be acquired in you message.
1. Revision history
Created 3/25/98 to generate an HTML page containing live hyperlinks for URLs in bibliographic records in an Innopac database
Revised 11/11/98 to extract Innopac BID's from 035 field, to better analyze / display 856 field data, and incorporate generic parameter input file data
Revised 12/19/98 to allow reading of bib portion only of large records (bib records with hundreds/thousands of item fields included); to use abbreviated text file; to incorporate changes in note displays; to allow display of subfield-3 data; to add dividers to give visual indication of records with multiple 856 fields.
Revised 3/18/99 to recognize 992 field for Innopac BID numbers and to show relative 856 fields with records and relative subfield-u's within each 856 field.
Revised 2/14/2000 to exclude removal of terminating periods ["."] at the end of 856 fields. This change is to reflect the way data from this field is actually loaded/read by Innopac.
Version 2.0 - Spring 2000 - Delimited file output features added.
2. Notes regarding selected Marc data elements used
INNOPAC Bibliographic Record Number - If found in the following fields/subfields:
Record Number from Marc 001 field
3. Sample Record
The numbers in the "curvy brackets" represent the relative number of 856 fields in the record (the number before the colon) and the relative number of the URL within the 856 field (the number after the colon.
If no ".b########" number appears following the "BID:" in the record separator, it means the program either 1) doesn't know where to look for the Innopac BID (i.e. which Marc tag) or 2) no such field is exported in the Marc output file.
If "()" displays in the record separator it means there is no 001 field in the Marc record. The caption (underlined hyperlink) for the URL will repeat these two numbers if they exist.
APPENDIX A: MarcXGen COPYRIGHT NOTICE
The freeware MarcXGen and its documentation is (C) Copyright by Thomas G. Tyler,(1998,1999,2000).
Permission to use, copy, and distribute this software and its documentation, in whole or in part, for any purpose (except as detailed hereunder) is hereby granted without fee, provided that the above copyright notice and this permission notice appear in all copies of the software and related documentation. Notices of copyright and/or attribution which appear in any file included in this distribution must remain intact.
You may not disassemble, decompose, reverse engineer, or alter this file or any of the other files in the package.
This software is provided as FREEWARE, and cannot be sold. This restriction does not apply to connect time charges, or flat rate connection/download fees for electronic bulletin board services. This software can not be bundled with any commercial package without express written permission from Thomas G. Tyler. Source code for this software is
proprietary information of Thomas G. Tyler, the AUTHOR. No source-code license is available.
THE SOFTWARE IS PROVIDED "AS-IS" AND WITHOUT WARRANTY OF ANY KIND, EXPRESS, IMPLIED OR OTHERWISE, INCLUDING WITHOUT LIMITATION, ANY WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, INCIDENTAL, INDIRECT OR CONSEQUENTIAL DAMAGES OF ANY KIND, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER OR NOT ADVISED OF THE POSSIBILITY OF DAMAGE, AND ON ANY THEORY OF LIABILITY, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE, OTHER THAN TO THE EXTENT OF ANY UNAVOIDABLE STATUTORY LIABILITY.
APPENDIX B: Creating the file of Unblocked Marc records - Innopac
Step 1: Create a review file of bibliographic records with 856 fields
1. From MAIN MENU select MANAGEMENT information
2. From MANAGEMENT information select Create LISTS of records (note: password may be required)
3. Select a file number
4. When asked to Choose what kind of list you want to produce: select BIBLIOGRAPHIC list
5. When prompted to Find BIBLIOGRAPHIC records that satisfy the following conditions enter the exclamation point symbol (!) which is the code for Marc Tag.
6. At the display MARC TAG tttii|ssss enter "856" over the "ttt"; press Enter.
7. When prompted to Enter boolean condition (=, ~, >, <, G, L, W, N, H, X) enter the tilde (~} which represents not equal to.
8. When asked to Enter string ( limit of 50 characters ) do nothing; press Enter.
9. Respond to the prompt Enter action ( A for AND, O for OR, S to START search ) OR \ to enter a range for searching.
10. Enter the name of the file you will be creating; press Enter.
11. At the conclusion of the file creation process Quit back to main menu
1. From the MAIN MENU select ADDITIONAL system functions
2. From the ADDITIONAL SYSTEM FUNCTIONS select Read/write MARC records (note: password may be required at this point).
3. At the READ/WRITE MARC RECORDS menu select Output MARC records to another system using IFTS.
4. At the Output MARC Records screen select CREATE disk file of unblocked MARC records.
5. When prompted, enter the name of the file you will be creating (note: no file extension is allowed as part of the filename).
6. At the next display which prompts Specify records to be output : select from a BOOLEAN review file.
7. Select your review file from the review file display.
8. When prompted, select START sending records.
9. When the scrolling information concludes Quit back to the Output MARC Records.
1. Your new file of unblocked Marc records should now display on the Output MARC Records screen. From the options, select SEND a MARC file to another system using FTS.
2. Enter the number for you file.
3. At the FILE TRANSFER SOFTWARE select the IP number or name of the DOS/Windows workstation where you will be running the MarcXGen software. An FTP server application needs to be running on the workstation.
4. You will be asked to enter username and password for the remote site. Innopac's FTS should be in binary prompt mode. If this is the case, then you should see [bin][PROMPT] in the upper right hand corner of the screen.
5. From the options on the Put File At Remote Site select TRANSFER files.
6. When asked to Enter name of remote file reenter the filename with the ".mar" extension which is required by MarcXGen; press Enter.
7. After a delay, a MESSAGE BOX will appear on the screen indicating that the transfer as concluded. From the options select CONTINUE.
8. You may quit at this point - quitting all the way back to Main Menu and then exiting Innopac.
9. You are now ready to run MarcXGen on your workstation.
APPENDIX C: Using the *.TXT & *.ASC files with MS Access
1. Change Look in: to the directory where MarcXGen and its related files are located.
2. Create new or use existing MS Access database
3. Select File/Get External Data/Import
4. Select from File of type Text Files (*.txt;*.csv;*.tab;*.asc)
5. For File name select *.txt or *.prn where "*" is your input filename used with MarcXGen
6. At the Import Text Wizard screen, select Delimited - Characters such as comma or tab separate each field; click on Next>
7. At the next Import Text Wizard screen select Comma, quote mark in Text Qualifier box, and First Row Contains Field Names; click on Next>
8. At the next Import Text Wizard screen select In a New Table; click on Next>
9. At the next Import Text Wizard screen you will be asked to verify Field Name and Data Type for each field. All Data Types should be Text. Click on Next>
10. At the next Import Text Wizard screen select No Primary Key; Click on Next>
11. At the last Import Text Wizard screen add the following to the name that appears in the Import to Table: box:
· for *.txt add TXT YYMMDD where YYMMDD represents the date.
· for *.asc add ASC YYMMDD where YYMMDD represents the date.
12. Click on Finish
At the conclusion of Import you may get a message that reports the following:
Error descriptions with associated row numbers of bad records can be found in the Microsoft Access table ..._ImportErrors."
Take a look at this table. Generally most errors are Field Truncation errorsrepresenting long titles that were truncated because of field size limitations in the table (255 characters). Errors of this type are of no concern.
However, if you have Unparsable Record errors you may want to use a text editor (e.g. MS EDIT from the DOS prompt or NoteTab) to correct the underlying problem in the records before reloading your file a second time. Generally this type of error is caused by quotation marks in the URL field. They must be removed. Quotation marks are appropriate only at the beginning and end of fields in a delimited file.
After any corrections have been made the ..._ImportErrors table may be deleted.
13. Your table is now ready to use.
1. For Innopac Marc records, the bibliographic record number (RECORD#) has been normalized to b plus the first 7 digits of the number. The 8th, or check-digit, has been replaced with the wildcard a.
2. Filing indicators have been ignored in the TITLE field
3. The EXTENSION field holds # extensions that were found in URLs during the creation of the MarcXGen file.
4. URL checkers, such as LinkBot and Xenus Link Sleuth, ignore this part of the URL and it is stripped from the URL before the link-checking session begins. As these link checkers also check only one instance of duplicate URLS, removal of the extension data allows for easy and accurate update queries in MS Access for status and/or redirect information from the link checkers delimited files.
5. The NEW URL field may be used to record working URLs after the table has been updated with the results of a link-checking session. It may also be used to record URLs found by manual searching on the web.
6. If the OCLC# is only numeric (i.e. no ocm at the begining) and if the length is less than 8 characters, the number is left-filled with zeros to make a uniform 8 digits.
7. The TITLE field is a text field limited to 255 characters. This is to preserve the ability to sort on this field when a different view of data in the MS Access table is required.
8. The REL field, as in the HTML file created by MarcXGen, represents the relative 856 field and URL within that field for that bibliographic record.
9. In the *.TXT table, blank records in the URL field (with NO 856 URL in the REL field) usually are caused when:
· The 856 field was created prior to the definition of the subfield-u, or
· The cataloger mistakenly entered the URL in another subfield (usually -a or -z) [note: the *.ASC delimited file contains these
1. All notes for the *.TXT file apply with the exception of the note for the REL field.
2. The SUBFIELD contains the letter a, z, or x; this indicates the subfield of the 856 field where the URL was found.
1. Change %5F to _ (unless you have reason to do otherwise). In MS Access, use Edit/Replace or CTRL-H; Set Search: to All; Remove check-mark from Match Whole Field and Match Case; Keep Search Only Current Field; Click on Replace All
2. Change %7E to ~ (unless you have reason to do otherwise). In MS Access, use Edit/Replace or CTRL-H; Set Search: to All; Remove check-mark from Match Whole Field and Match Case; Keep Search Only Current Field; Click on Replace All
3. Change %3A to : (unless you have reason to do otherwise). In MS Access, use Edit/Replace or CTRL-H; Set Search: to All; Remove check-mark from Match Whole Field and Match Case; Keep Search Only Current Field; Click on Replace All
4. Cut "#"-extensions to URLs and paste in EXTENSION field. Because these extensions are stripped by most link checkers, the post-link checking matching will be possible if they are removed before checking.
5. Check beginning of http entries in URL-sorted table. Correct any http//, http:///, etc. entries that are obviously incorrectly formed.
6. Check end of http entries in URL-sorted table. Correct any http:/www, http:www, htttp, etc. entries that are obviously incorrectly formed.
7. You may want to remove any mailto: entries you find.
8. Look for incorrectly formed GPO PURLs. Some examples:
http://www.access.gpo.gov/GPO/LPS2249 - change www to purl
http://purl.acces.gpo.gov/GPO/LPS2249 - change acces to access
http://purl.access.gpo.gov/GPO.LPS2153 - needs / following /GPO
http://purl.access.gpo.gov/GPO/LPS - no LPS number
http://purl.access.gpo.gov/GPO/LPS/1309 - remove / following LPS
http://purl.access/gpo.gov/GPO/LPS2965 - replace / with . before gpo
http://purl.accesss.gpo.gov/GPO/LPS4150 - change accesss to access
http://purl.acess.gpo.gov/GPO/LPS1363 - change acess to access
http://purl.gpo.gov/GPO//LPS547 - change // to / following GPO; While this PURL will be recognized by the server, adding .access after the purl will correspond to the format generally used by GPO
9. Visually inspecting domains and URLs will often turn up malformed URLs. Example: missing _ characters are easily spotted if other URLs from same domain/path include _s in their filenames.
10. Periods at the end of a URL should be removed.
11. Spaces anywhere in a URL should be removed (unless space is in #-extension segment)
12. If the URL does not have a query section (usually indicated by tne presence of a ? in the URL) it generally attempts to look for a specific filename in the form filename.htm or filename.html or filename.txt or filename.pdf etc.
13. If the URL has only domain information or pathname information following the domain the URL should end in a / the assumed filename is index.html.
If, as in the following example, there is no terminating / the URL will be reported as a redirect error. To avoid this type of error with a link checker, add the / before checking.
Example: http://www.epa.gov/greenchemistry - needs terminating / to avoid error
14. Extraneous text fragments should be removed from URL field. Such fragments are often note information without subfielding.
Thomas G. Tyler
Associate Director for Budget & Technical Planning
University of Denver Library