MarcXGen
Marc URL Extractor and HTML Generator
v.2.01 (6/9/2000)
Tom Tyler, University of Denver Library
![]()
Requirements:
· MarcXGen executable and library name template file
· Source file of USMarc bibliographic records having 856 fields in unblocked format
· Microsoft Windows 9x
Brief
Description:
MarcXGen extracts URLs from Marc bibliographic records and generates HTML code to create a single web page of hyperlinks that can be used with third party Link Checking software such as LinkBot and Xenu's Link Sleuth.
While MarcXGen has been adapted to interpret 9xx field data from several libraries using the Innopac ILS, it has been used successfully with files of Marc records exported from other library systems.
MarcXGen also creates separate files of delimited data that may be used to build a relational database environment that may simplify some maintenance tasks associated with bad or problem URLs in library database records.
Documentation:
Documentation for MarcXGen, in MS Word format, is included among the files in the Zipped package available at the Download heading. A web version of the documentation is available at the following address: http://www.du.edu/~ttyler/freeware/marcxgen.htm
Download:
The MarcXGen executable, documentation in MS Word format, and the template file for library name are available in Zipped format the following address: http://www.du.edu/~ttyler/freeware/marcxgen.zip
![]()
Unzipping MarcXGen.Zip
1.
Be sure you
web browser is set up to save-to-disk a file with the extension
".zip".
2.
Use your
web browser to go to the following URL:
http://www.du.edu/~ttyler/freeware/marcxgen.zip
3.
Save the marcxgen.zip file to the
folder/directory on your workstation where you will be processing the Marc output
files from your library system.
4.
Unzip marcxgen.zip using appropriate software
(e.g. the PKZIP utility).
5.
After
unzipping the marcxgen.zip you should have the following files:
marcxgen.exe The
program itself
libname.txt The
text file the program needs to hold on to your library's name. You should edit this file with your
library's name.
marcxgen.doc. A
readme file which corresponds to the file you are reading now.
Configuring MarcXGen
The
only configuration needed is to edit the libname.txt file. Replace all the asterisks with the name of
your library. When saving the file be
sure that you save it in text format.
Example:
Before editing: ******************************** Library
After editing: University of Denver Library
Using MarcXGen:
Before
running MarcXGen you must have a
file of unblocked records in Marc format in the same directory as the MarcXGen software. This file MUST have the extension of ".mar".
Also,
the libname.txt should have been edited with your library's name as
noted in the preceding section.
MarcXGen may be run from the DOS command line
(i.e. MS-DOS Prompt) or from Windows Explorer.
So, to get started either
From
the DOS prompt enter the command MARCXGEN ; press the ENTER key
, or
Select
Marcxgen.exe from the
folder/directory display if you use Windows Explorer
The
program will prompt you to respond three times.
1.
First, it
will ask you for the file which holds Marc records exported from your Innopac system. This file MUST have the extension ".mar". When the program prompts you for the name
DO NOT enter the ".mar". At the conclusion of
the program there will be new files with the same name but with different
extensions:
".htm --
HTML source file to be used with link-checking software
".txt" -- Delimited text file of data elements for use with MS Access in record maintenance
".asc" -- Delimited text file of URL data from subfields a,z, &
x; for use with MS Access
".skp" -- Records skipped due to excessive size - usually will
contain no records
2.
Next, the
program will ask you to enter the current date - Enter the date in the
format: Month Day, Year - e.g. June 14,
2000.
3.
Next and
finally, the program will ask for the name of the text file which holds you
library's name. Enter the FULL NAME of this file. This file is distributed at libname.txt. This file should have been edited to include the name of your
library.
The
program should then run to its conclusion.
Depending on the size of your input file, the characteristics of your
computer's processor, etc. it will take anywhere from a few seconds to several
minutes to conclude.
If
the program concludes normally (i.e. without errors) then you should see a
display similar to the following:
Bib Records processed ####
Total URLS ####
The following files have been created:
yourfile.htm - Single file in HTML format for use with
link checker
yourfile.txt - Delimited data reported in HTML file
yourfile.asc - Delimited data for URLs found in subfields
a,z, & x
yourfile.skp - Skipped records (generally should have
length of 0)
End of File
MarcXGen Concluded normally
If
an error occurred, then the following would appear at the bottom of the screen:
ERROR ## AT ####
Troubleshooting
errors, at this point in time, will require:
1.
The ERROR ## AT #### information that is
displayed in an error condition; and,
2.
Access to
the source file of Marc records (i.e. the *.MAR file). If
you can provide a copy (zipped if the file is large) of the *.MAR
file, send an Email to ttyler@du.edu with MarcXGen Error in the
subject line and attach the *.MAR file or provide instructions on how/where it
can be acquired in you message.
OTHER INFORMATION:
1.
Revision
history
Created 3/25/98 to generate an HTML page containing
live hyperlinks for URLs in
bibliographic records in an Innopac database
Revised
11/11/98 to extract Innopac BID's from 035 field, to better analyze /
display 856 field data, and incorporate generic parameter input file data
Revised
12/19/98 to allow reading of bib portion only of large records (bib records
with hundreds/thousands of item fields included); to use abbreviated text file;
to incorporate changes in note displays; to allow display of subfield-3 data;
to add dividers to give visual indication of records with multiple 856 fields.
Revised
3/18/99 to recognize 992 field for Innopac BID numbers and to show relative 856
fields with records and relative subfield-u's within each 856 field.
Revised
2/14/2000 to exclude removal of terminating periods ["."] at the end
of 856 fields. This change is to
reflect the way data from this field is actually loaded/read by Innopac.
Version
2.0 - Spring 2000 - Delimited file output features added.
2.
Notes
regarding selected Marc data elements used
INNOPAC Bibliographic
Record Number - If found
in the following fields/subfields:
035,
subfield-a
907,
subfield-a
935,
subfield-a
948,
subfield-a
992,
subfield-a.
Record Number from Marc 001 field
3.
Sample
Record
The
numbers in the "curvy brackets" represent the relative number of 856
fields in the record (the number before the colon) and the relative number of
the URL within the 856 field (the number after the colon.
If
no ".b########" number appears following the "BID:" in the
record separator, it means the program either 1) doesn't know where to look for
the Innopac BID (i.e. which Marc tag) or 2) no such field is exported in the
Marc output file.
If
"()" displays in the record separator it means there is no 001 field
in the Marc record. The caption
(underlined hyperlink) for the URL will repeat these two numbers if they exist.
-------------------------
BID: .b16411080 ( 38584853) -----------------------------
[ .b16411080 |
38584853 {1:1}] Job satisfaction among America's teachers effects of
workplace
conditions, background characteristics and teacher compensation /
Marc 856 Note: View Highlights of this
publication
[ .b16411080 |
38584853 {2:1}] Job satisfaction among America's teachers effects of
workplace
conditions, background characteristics and teacher compensation /
Marc 856 Note: View this publication,
Adobe Acrobat required
APPENDIX
A: MarcXGen COPYRIGHT NOTICE
The
freeware MarcXGen and its documentation
is (C) Copyright by Thomas G.
Tyler,(1998,1999,2000).
Permission
to use, copy, and distribute this software and its documentation, in whole or
in part, for any purpose (except as detailed hereunder) is hereby granted
without fee, provided that the above copyright notice and this permission
notice appear in all copies of the software and related documentation. Notices
of copyright and/or attribution which appear in any file included in this
distribution must remain intact.
You
may not disassemble, decompose, reverse engineer, or alter this file or any of
the other files in the package.
This
software is provided as FREEWARE, and cannot be sold. This restriction does not
apply to connect time charges, or flat rate connection/download fees for
electronic bulletin board services. This software can not be bundled with any
commercial package without express written permission from Thomas G.
Tyler. Source code for this software is
proprietary
information of Thomas G. Tyler, the AUTHOR. No source-code license is
available.
THE
SOFTWARE IS PROVIDED "AS-IS" AND WITHOUT WARRANTY OF ANY KIND,
EXPRESS, IMPLIED OR OTHERWISE, INCLUDING WITHOUT LIMITATION, ANY WARRANTY OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
IN
NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, INCIDENTAL, INDIRECT OR
CONSEQUENTIAL DAMAGES OF ANY KIND, OR ANY DAMAGES WHATSOEVER RESULTING FROM
LOSS OF USE, DATA OR PROFITS, WHETHER OR NOT ADVISED OF THE POSSIBILITY OF
DAMAGE, AND ON ANY THEORY OF LIABILITY, ARISING OUT OF OR IN CONNECTION WITH
THE USE OR PERFORMANCE OF THIS SOFTWARE, OTHER THAN TO THE EXTENT OF ANY
UNAVOIDABLE STATUTORY LIABILITY.
APPENDIX
B: Creating the file of Unblocked Marc records - Innopac
Step 1: Create
a review file of bibliographic records with 856 fields
1. From MAIN MENU select MANAGEMENT information
2. From MANAGEMENT information select Create LISTS of records (note: password may be required)
3. Select a file number
4. When asked to Choose what kind of list you want to produce: select BIBLIOGRAPHIC list
5. When prompted to Find BIBLIOGRAPHIC records that satisfy the following conditions enter the exclamation point symbol (!) which is the code for Marc Tag.
6. At the display MARC TAG tttii|ssss enter "856" over the "ttt"; press Enter.
7. When prompted to Enter boolean condition (=, ~, >, <, G, L, W, N, H, X) enter the tilde (~} which represents not equal to.
8. When asked to Enter string ( limit of 50 characters ) do nothing; press Enter.
9. Respond to the prompt Enter action ( A for AND, O for OR, S to START search ) OR \ to
enter a range for searching.
10. Enter the name of the file you will be creating; press Enter.
11. At the conclusion of the file creation process Quit back to main menu
1. From the MAIN MENU select ADDITIONAL system functions
2. From the ADDITIONAL SYSTEM FUNCTIONS select Read/write MARC records (note: password may be required at this point).
3. At the READ/WRITE MARC RECORDS menu select Output MARC records to another system using IFTS.
4. At the Output MARC Records screen select CREATE disk file of unblocked MARC records.
5. When prompted, enter the name of the file you will be creating (note: no file extension is allowed as part of the filename).
6. At the next display which prompts Specify records to be output : select from a BOOLEAN review file.
7. Select your review file from the review file display.
8. When prompted, select START sending records.
9. When the scrolling information concludes Quit back to the Output MARC Records.
1. Your new file of unblocked Marc records should now display on the Output MARC Records screen. From the options, select SEND a MARC file to another system using FTS.
2. Enter the number for you file.
3. At the FILE TRANSFER SOFTWARE select the IP number or name of the DOS/Windows workstation where you will be running the MarcXGen software. An FTP server application needs to be running on the workstation.
4. You will be asked to enter username and password for the remote site. Innopac's FTS should be in binary prompt mode. If this is the case, then you should see [bin][PROMPT] in the upper right hand corner of the screen.
5. From the options on the Put File At Remote Site select TRANSFER files.
6. When asked to Enter name of remote file reenter the filename with the ".mar" extension which is required by MarcXGen; press Enter.
7. After a delay, a MESSAGE BOX will appear on the screen indicating that the transfer as concluded. From the options select CONTINUE.
8. You may quit at this point - quitting all the way back to Main Menu and then exiting Innopac.
9. You are now ready to run MarcXGen on your workstation.
APPENDIX
C: Using the *.TXT & *.ASC files with MS Access
1. Change Look in: to the directory where MarcXGen and its related files are located.
2. Create new or use existing MS Access database
3. Select File/Get External Data/Import
4. Select from File of type Text Files (*.txt;*.csv;*.tab;*.asc)
5. For File name select *.txt or *.prn where "*" is your input filename used with MarcXGen
6. At the Import Text Wizard screen, select Delimited - Characters such as comma or tab separate each field; click on Next>
7. At the next Import Text Wizard screen select Comma, quote mark in Text Qualifier box, and First Row Contains Field Names; click on Next>
8. At the next Import Text Wizard screen select In a New Table; click on Next>
9. At the next Import Text Wizard screen you will be asked to verify Field Name and Data Type for each field. All Data Types should be Text. Click on Next>
10. At the next Import Text Wizard screen select No Primary Key; Click on Next>
11. At the last Import Text Wizard screen add the following to the name that appears in the Import to Table: box:
· for *.txt add TXT YYMMDD where YYMMDD represents the date.
· for *.asc add ASC YYMMDD where YYMMDD represents the date.
12. Click on Finish
At the conclusion of Import you may get a message that reports the following:
Error descriptions with associated row numbers of bad records can be found in the Microsoft Access table ..._ImportErrors."
Take a look at this table. Generally most errors are Field Truncation errorsrepresenting long titles that were truncated because of field size limitations in the table (255 characters). Errors of this type are of no concern.
However, if you have Unparsable Record errors you may want to use a text editor (e.g. MS EDIT from the DOS prompt or NoteTab) to correct the underlying problem in the records before reloading your file a second time. Generally this type of error is caused by quotation marks in the URL field. They must be removed. Quotation marks are appropriate only at the beginning and end of fields in a delimited file.
After any corrections have been made the ..._ImportErrors table may be deleted.
13. Your table is now ready to use.
URL
EXTENSION
RECORD#
OCLC#
REL
TITLE
NEW URL
1. For Innopac Marc records, the bibliographic record number (RECORD#) has been normalized to b plus the first 7 digits of the number. The 8th, or check-digit, has been replaced with the wildcard a.
2. Filing indicators have been ignored in the TITLE field
3. The EXTENSION field holds # extensions that were found in URLs during the creation of the MarcXGen file.
4. URL checkers, such as LinkBot and Xenus Link Sleuth, ignore this part of the URL and it is stripped from the URL before the link-checking session begins. As these link checkers also check only one instance of duplicate URLS, removal of the extension data allows for easy and accurate update queries in MS Access for status and/or redirect information from the link checkers delimited files.
5. The NEW URL field may be used to record working URLs after the table has been updated with the results of a link-checking session. It may also be used to record URLs found by manual searching on the web.
6. If the OCLC# is only numeric (i.e. no ocm at the begining) and if the length is less than 8 characters, the number is left-filled with zeros to make a uniform 8 digits.
7. The TITLE field is a text field limited to 255 characters. This is to preserve the ability to sort on this field when a different view of data in the MS Access table is required.
8. The REL field, as in the HTML file created by MarcXGen, represents the relative 856 field and URL within that field for that bibliographic record.
9. In the *.TXT table, blank records in the URL field (with NO 856 URL in the REL field) usually are caused when:
· The 856 field was created prior to the definition of the subfield-u, or
· The cataloger mistakenly entered the URL in another subfield (usually -a or -z) [note: the *.ASC delimited file contains these
URL
EXTENSION
RECORD#
OCLC#
SUBFIELD
TITLE
NEW
URL
1. All notes for the *.TXT file apply with the exception of the note for the REL field.
2. The SUBFIELD contains the letter a, z, or x; this indicates the subfield of the 856 field where the URL was found.
1. Change %5F to _ (unless you have reason to do otherwise). In MS Access, use Edit/Replace or CTRL-H; Set Search: to All; Remove check-mark from Match Whole Field and Match Case; Keep Search Only Current Field; Click on Replace All
2. Change %7E to ~ (unless you have reason to do otherwise). In MS Access, use Edit/Replace or CTRL-H; Set Search: to All; Remove check-mark from Match Whole Field and Match Case; Keep Search Only Current Field; Click on Replace All
3. Change %3A to : (unless you have reason to do otherwise). In MS Access, use Edit/Replace or CTRL-H; Set Search: to All; Remove check-mark from Match Whole Field and Match Case; Keep Search Only Current Field; Click on Replace All
4. Cut "#"-extensions to URLs and paste in EXTENSION field. Because these extensions are stripped by most link checkers, the post-link checking matching will be possible if they are removed before checking.
5. Check beginning of http entries in URL-sorted table. Correct any http//, http:///, etc. entries that are obviously incorrectly formed.
6. Check end of http entries in URL-sorted table. Correct any http:/www, http:www, htttp, etc. entries that are obviously incorrectly formed.
7. You may want to remove any mailto: entries you find.
8. Look for incorrectly formed GPO PURLs. Some examples:
http://www.access.gpo.gov/GPO/LPS2249 -
change www to purl
http://purl.acces.gpo.gov/GPO/LPS2249 - change acces to access
http://purl.access.gpo.gov/GPO.LPS2153 - needs / following /GPO
http://purl.access.gpo.gov/GPO/LPS - no LPS number
http://purl.access.gpo.gov/GPO/LPS/1309 - remove /
following LPS
http://purl.access/gpo.gov/GPO/LPS2965 -
replace / with . before gpo
http://purl.accesss.gpo.gov/GPO/LPS4150 - change accesss
to access
http://purl.acess.gpo.gov/GPO/LPS1363 - change acess to access
http://purl.gpo.gov/GPO//LPS547 - change // to / following GPO; While this PURL will be
recognized by the server, adding .access
after the purl will correspond to
the format generally used by GPO
9. Visually inspecting domains and URLs will often turn up malformed URLs. Example: missing _ characters are easily spotted if other URLs from same domain/path include _s in their filenames.
10. Periods at the end of a URL should be removed.
11. Spaces anywhere in a URL should be removed (unless space is in #-extension segment)
12. If the URL does not have a query section (usually indicated by tne presence of a ? in the URL) it generally attempts to look for a specific filename in the form filename.htm or filename.html or filename.txt or filename.pdf etc.
13. If the URL has only domain information or pathname information following the domain the URL should end in a / the assumed filename is index.html.
If, as in the following example, there is no terminating / the URL will be reported as a redirect error. To avoid this type of error with a link checker, add the / before checking.
Example: http://www.epa.gov/greenchemistry - needs terminating / to avoid error
14. Extraneous text fragments should be removed from URL field. Such fragments are often note information without subfielding.
Thomas
G. Tyler
Associate
Director for Budget & Technical Planning
University
of Denver Library
303-871-3334
(w)
303-871-3446
(fax)
Email:
ttyler@du.edu
Web:
http://www.du.edu/~ttyler/freeware/marcxgen.htm