Class Two: Information Structures
Database Fields: Definition (MIT Libraries)
A field is a container, a place where information is stored. That container may have various rules: dates only, numbers only, 4-digit year date only, control number only, any text, any text up to 255 characters, any text following strict entry rules, etc. Also, fields may iterate in some cases. There may be multiple authors, so databases will need to deal with this. Some databases may dump all author data into a single field. A better structure would be to put each author into a separate (iterating) field. Subjects also iterate.
MARC 21 Concise Format for Bibliographic Data
OCLC Bibliographic Formats and Standards (3rd ed.)
Fixed Fields: OCLC
Fixed fields, by definition, contain fixed data in specific formats for the purpose of economy of record size. These letters and numerals contain volumes of information about the bibliographic record. Take a look at the 3-letter MARC Language Codes and you can see how a lot of information can be encoded in three letters.
Variable-length Fields: Webopedia
Iterating Fields - why would you ever want a field more than once?
Non-iterating fields - why would you ever NOT want a field more than once?
The MARC record is an ingenious invention from the 1960s that fakes a relational database structure.
![]()
Misc. Database Issues
- Data loads (dumps) - dirty data. Examples of bad data loads from EbscoHost: Art Abstracts | Library Literature (they have fixed these problems after I reported these problems two times over a period of a year).
- OCR problems: Take a look at this search from Google Patents.
- Mapping / Conversion differences. Take a look at the same Medline record from three different vendors: PubMed [live PubMed record is here], OCLC FirstSearch, CSA. Here is an explanation of PubMed fields.
- Historic database issues. Look at these differing records for the same document from the Serial Set: LexisNexis Congressional | Readex Digital Serial Set. The LN Congressional records differ because indexing was done from title lists, not from examining the piece itself. See this article for further discussion.
Ordering of Indexed Information
Alphabetical Order: Alpha by author, alpha by title, etc.
Chronological Order: Peak keyword default
Classified Order: i.e. by call number (Dewey, Library of Congress, Superintendent of Documents)
What Gets Indexed?
Books
Journal Articles
Book Reviews
Essays
Poetry
Short Stories
Music
Patents
Conference Papers
How is an index different from a catalog ?
Types of Indexes
Classified Indexes: EconLit ; MLA International Bibliography; UNCRD Publications (bibliography and index I created)
Cumulative Indexes: Not relevant in online world, but important in print world
Monthly catalogue, United States public documents (note that this record has a "cumulative index note" that says: " Subject index, 1900-1971. (Includes index to former and later titles.) 15 v."
Concordances: "An alphabetical arrangement of the principal words contained in a book, with citations of the passages in which they occur." - OED
The Harvard concordance to Shakespeare
The New Strong's exhaustive concordance of the Bible
A critical Greek and English concordance of the New Testament (online)
First-line, Last-line Indexes: Columbia Granger's poetry indexes index first and last lines of poetry. Example of an online first-line index.
String Indexes: From the early days of computers.A KWIC index is a type of string index. KWIC stands for key word in context. See Wikipedia entry .
An abstract differs from an annotation and an executive summary .
Descriptive Abstracts
Informative Abstracts
Critical Abstracts
Author Abstracts: ex. Dissertation Abstracts
a) examine ERIC (CSA) records. Which fields are likely variable-length fields, and which are likely fixed fields? How is fixed field information searched differently from variable-length fields?
b) same as above, but with WorldCat.
Librarianship is all about standards. Standards are what makes data principled and information findable. Without standards every database would kind of be doing their own thing. Well, they sort of are doing that now, but it would be infinitely worse.
Some databases follow different standards making searching difficult and retrieval challenging. An example of this is LexisNexis Congressional. Congressional hearing are cataloged in library catalogs using AACR2R rules. But LexisNexis uses their own rules when creating their index. For example, when “ United States ” appears on a hearing title, LexisNexis enters “ U.S. ” When “Fiscal Year 1998” appears, they enter “FY98”. But when “Fiscal Year 2002” appears, they use “FY2002”.
So, how is all this information to be organized? Clearly we do not merely want to repeat the full article. We need something to stand in place of the article – a surrogate. This surrogate must perform two functions. It must describe the article (extent, language, features, article type) and it must capture the “aboutness”. In cataloging, these have traditionally been called “descriptive cataloging” and “subject cataloging”. To economize on data storage “space” and to normalize or regularize ways to refer to descriptive features and subject “aboutness” very often codes are use. These codes occupy fixed fields. They may be letters or numbers that the user sees spelled out.
For example, an article published in English may have eng in a fixed field. A book review may have a code designating the fact that the article is a book review. Rather than using the full term “book review”, the underlying record may simply have the code for book review. This saves space in terms of data storage and regularized the form of entry within that particular database.
Problems arise, however, in that there is little regularization from one database to another. In the world of cataloging books, a great degree of regularization exists. Catalogs generally follow MARC standards. But in the article indexing world, standards vary for a variety of reasons. 1. Databases are developed by different entities. The American Psychological Association develops the PsycInfo database, the Modern Language Association is responsible for the MLA International Bibliography, and the National Library of Medicine produces PubMed (Medline). 2. The underlying controlled vocabularies differ. Book cataloging, at least in the United States , tends to use pre-coordinated Library on Congress subject headings. But each databases follows its own nomenclature – generally some kind of subject thesaurus. 3. The structure of the underlying records vary greatly. The scope of some databases is very broad, covering book reviews, dissertations or theses, scholarly articles, popular articles, to name just a few possibilities. Other databases are more narrow in scope, perhaps only covering scholarly articles.
Another consideration is that some databases are transforming a print product into an electronic one, while others are purely “born digital.” Product that were originally published in print formats often contain less developed information complexity, since their final output was to have been in print, where economy of space was important. Wilson products such as Readers' Guide to Periodical Literature and Social Sciences Index were issued for years in print with few subject headings and no abstracts. In recent years abstracts have been added, but the scant application of subject headings remains.
On the other hand a born digital database such as Readex's United States Congressional Serial Set does not need to economize on application of subject descriptors. There can be dozens of descriptors applied without the worry of having to replicate the citation dozens of places throughout a print index.
The importance of knowing about data structures cannot be underestimated. Successful searching is based upon knowledge of how information is stored, in what kinds of fields it is stored in, and in the characteristics of those fields.
Let's examine some uses of fixed fielded information.
Language information. AACR2 rules proscribe code for encoding languages. These are used when cataloging books. Databases may or may not use the same list of codes, but may use another similar coding structure.
Date information. Dates generally live in a date field. Articles are simpler to catalog than books are, since books may be nulti-volumes issued over a period of years. Articles are generally issued at a single date. However, date fields may not only include a year, but also an issue time (fall, January, etc.).
Frequency. Quarterly, annually, weekly
Genre information.
Coverage period. To some databases, such as those covering history, coverage dates are extremely important. This includes the time period covered by the article. An article may cover the assassination of Abraham Lincoln (1865), it may cover the American Civil War (1861-1865), or it may cover the 1860s as a decade.
Not all fields can economize on space such as the fixed field does. The variable length fields can accommodate longer values that vary considerably from record to record. Fields such as author, title, and abstract can contain long values.
In the MARC21 world there must be one title (field 245), but other titles are possible as well: series title, uniform title, and some titles that occur with various kinds of books as physical objects: half-title, running title, spine title, cover title, etc. These are handled in various ways, but generally are included in the “title index”.
Before the advent of the electronic index access points for library materials were handled in the card catalog. Every distinct access point would require another card to be filed in the respective alphabetical location. Here is where the notion of “main entry” became most important. A cataloger would establish one main entry (usually the author, first author, or book title). The rules were quite complex, but were absolutely necessary.
In the electronic world no consideration is necessary as to number of access points created, as economy of card filing or extent of a print publication is not the issue. The cataloger or indexer can assign as many access points as are deemed necessary.