Class Two: Information Structures

1. Review of LIS4010 and maybe beyond....

Database Fields: Definition (MIT Libraries)

A field is a container, a place where information is stored. That container may have various rules: dates only, numbers only, 4-digit year date only, control number only, any text, any text up to 255 characters, any text following strict entry rules, etc. Also, fields may iterate in some cases. There may be multiple authors, so databases will need to deal with this. Some databases may dump all author data into a single field. A better structure would be to put each author into a separate (iterating) field. Subjects also iterate.

MARC 21 Concise Format for Bibliographic Data

OCLC Bibliographic Formats and Standards (3rd ed.)

Fixed Fields: OCLC

Fixed fields, by definition, contain fixed data in specific formats for the purpose of economy of record size. These letters and numerals contain volumes of information about the bibliographic record. Take a look at the 3-letter MARC Language Codes and you can see how a lot of information can be encoded in three letters.

Variable-length Fields: Webopedia

Iterating Fields - why would you ever want a field more than once?

Non-iterating fields - why would you ever NOT want a field more than once?

The MARC record is an ingenious invention from the 1960s that fakes a relational database structure.

Misc. Database Issues

Data loads (dumps) - dirty data. Examples of bad data loads from EbscoHost: Art Abstracts | Library Literature (they have fixed these problems after I reported these problems two times over a period of a year).

OCR problems: Take a look at this search from Google Patents.

Mapping / Conversion differences. Take a look at the same Medline record from three different vendors: PubMed [live PubMed record is here], OCLC FirstSearch, CSA. Here is an explanation of PubMed fields.

Historic database issues. Look at these differing records for the same document from the Serial Set: LexisNexis Congressional | Readex Digital Serial Set. The LN Congressional records differ because indexing was done from title lists, not from examining the piece itself. See this article for further discussion.

2. Indexes

Ordering of Indexed Information

Alphabetical Order: Alpha by author, alpha by title, etc.

Chronological Order: Peak keyword default

Classified Order: i.e. by call number (Dewey, Library of Congress, Superintendent of Documents)

What Gets Indexed?

Books

Journal Articles

Book Reviews

Essays

Poetry

Short Stories

Music

Patents

Conference Papers

How is an index different from a catalog ?

Types of Indexes

Classified Indexes: EconLit ; MLA International Bibliography; UNCRD Publications (bibliography and index I created)

Cumulative Indexes: Not relevant in online world, but important in print world

Monthly catalogue, United States public documents (note that this record has a "cumulative index note" that says: " Subject index, 1900-1971. (Includes index to former and later titles.) 15 v."

Concordances: "An alphabetical arrangement of the principal words contained in a book, with citations of the passages in which they occur." - OED

The Harvard concordance to Shakespeare

The New Strong's exhaustive concordance of the Bible

A critical Greek and English concordance of the New Testament (online)

First-line, Last-line Indexes: Columbia Granger's poetry indexes index first and last lines of poetry. Example of an online first-line index.

String Indexes: From the early days of computers.A KWIC index is a type of string index. KWIC stands for key word in context. See Wikipedia entry .

3. Abstracts

An abstract differs from an annotation and an executive summary .

Descriptive Abstracts

Informative Abstracts

Critical Abstracts

Author Abstracts: ex. Dissertation Abstracts

4. Default Fields

How do you know what fields are being searched by default? Let's take Academic Search Complete as an example. When users enter the "Basic Search" mode they are simply presented with a search box and no choice of fields. When they enter "Advanced Search" mode they see a search box followed by a pull-down menu from which they can select which field to search. But the default selection says "Select a Field (optional)." In other words, they don't have to select a field, the system will select fields for them (the default fields). The question is what fields are being searched?
Here are the available fields from the pull down menu:

TX All Text
AU Author
TI Title
SU Subject Terms
AB Abstract or Author-Supplied Abstract
KW Author-Suppled Keywords
GE Geographic Terms
PE People
PS Reviews & Products
CO Company Entity
IC NAICS Code or Description
DN DUNS Number
TK Ticker Symbol
SO Journal Name
IS ISSN (No Dashes)
IB ISBN
AN Accession Number

How can we make diagnostic tests that prove whether a particular field is in the set of default search fields? Here is how:
1. Frame a search with a query that isolates a single record that contains a field you want to test. For example, say you want to test for whether the DUNS Number is in the default set of fields. This is a number that identifies a business; thus you need to find a record about a business. The record with the title "Transnational Private Authority in Education Policy in Jordan and South Africa: The Case of Microsoft Corporation" is one such record. It has a DUNS number of 081466849.
2. Isolate the test record with a unique search. In the case above, searching by title is too ineffective. Best to search by accession number. In this case, the access number is 74486309. So I stipulate in my search query that I am searching for 74486309 in Accession Number.
3. Now add to that search another search line (in Advanced Search mode), 081466849 in DUNS number. The search works! Thus DUNS number is included in the set of default search indexes, at least for Academic Search Complete. But now try the same test for Accession Number. Is it also in the set of default search fields?

In-Class Exercices

a) examine ERIC (CSA) records. Which fields are likely variable-length fields, and which are likely fixed fields? How is fixed field information searched differently from variable-length fields?

b) same as above, but with WorldCat.

Structures of Information

Librarianship is all about standards. Standards are what makes data principled and information findable. Without standards every database would kind of be doing their own thing. Well, they sort of are doing that now, but it would be infinitely worse.

Some databases follow different standards making searching difficult and retrieval challenging. An example of this is LexisNexis Congressional. Congressional hearing are cataloged in library catalogs using AACR2R rules. But LexisNexis uses their own rules when creating their index. For example, when “ United States ” appears on a hearing title, LexisNexis enters “ U.S. ” When “Fiscal Year 1998” appears, they enter “FY98”. But when “Fiscal Year 2002” appears, they use “FY2002”.

So, how is all this information to be organized? Clearly we do not merely want to repeat the full article. We need something to stand in place of the article – a surrogate. This surrogate must perform two functions. It must describe the article (extent, language, features, article type) and it must capture the “aboutness”. In cataloging, these have traditionally been called “descriptive cataloging” and “subject cataloging”. To economize on data storage “space” and to normalize or regularize ways to refer to descriptive features and subject “aboutness” very often codes are use. These codes occupy fixed fields. They may be letters or numbers that the user sees spelled out.

For example, an article published in English may have eng in a fixed field. A book review may have a code designating the fact that the article is a book review. Rather than using the full term “book review”, the underlying record may simply have the code for book review. This saves space in terms of data storage and regularized the form of entry within that particular database.

Problems arise, however, in that there is little regularization from one database to another. In the world of cataloging books, a great degree of regularization exists. Catalogs generally follow MARC standards. But in the article indexing world, standards vary for a variety of reasons. 1. Databases are developed by different entities. The American Psychological Association develops the PsycInfo database, the Modern Language Association is responsible for the MLA International Bibliography, and the National Library of Medicine produces PubMed (Medline). 2. The underlying controlled vocabularies differ. Book cataloging, at least in the United States , tends to use pre-coordinated Library on Congress subject headings. But each databases follows its own nomenclature – generally some kind of subject thesaurus. 3. The structure of the underlying records vary greatly. The scope of some databases is very broad, covering book reviews, dissertations or theses, scholarly articles, popular articles, to name just a few possibilities. Other databases are more narrow in scope, perhaps only covering scholarly articles.

Another consideration is that some databases are transforming a print product into an electronic one, while others are purely “born digital.” Product that were originally published in print formats often contain less developed information complexity, since their final output was to have been in print, where economy of space was important. Wilson products such as Readers' Guide to Periodical Literature and Social Sciences Index were issued for years in print with few subject headings and no abstracts. In recent years abstracts have been added, but the scant application of subject headings remains.

On the other hand a born digital database such as Readex's United States Congressional Serial Set does not need to economize on application of subject descriptors. There can be dozens of descriptors applied without the worry of having to replicate the citation dozens of places throughout a print index.

The importance of knowing about data structures cannot be underestimated. Successful searching is based upon knowledge of how information is stored, in what kinds of fields it is stored in, and in the characteristics of those fields.

Field Fields

Let's examine some uses of fixed fielded information.

Language information. AACR2 rules proscribe code for encoding languages. These are used when cataloging books. Databases may or may not use the same list of codes, but may use another similar coding structure.

Date information. Dates generally live in a date field. Articles are simpler to catalog than books are, since books may be nulti-volumes issued over a period of years. Articles are generally issued at a single date. However, date fields may not only include a year, but also an issue time (fall, January, etc.).

Frequency. Quarterly, annually, weekly

Genre information.

Coverage period. To some databases, such as those covering history, coverage dates are extremely important. This includes the time period covered by the article. An article may cover the assassination of Abraham Lincoln (1865), it may cover the American Civil War (1861-1865), or it may cover the 1860s as a decade.

Variable Length Fields

Not all fields can economize on space such as the fixed field does. The variable length fields can accommodate longer values that vary considerably from record to record. Fields such as author, title, and abstract can contain long values.

In the MARC21 world there must be one title (field 245), but other titles are possible as well: series title, uniform title, and some titles that occur with various kinds of books as physical objects: half-title, running title, spine title, cover title, etc. These are handled in various ways, but generally are included in the “title index”.

Indexing before the Electronic Index

Before the advent of the electronic index access points for library materials were handled in the card catalog. Every distinct access point would require another card to be filed in the respective alphabetical location. Here is where the notion of “main entry” became most important. A cataloger would establish one main entry (usually the author, first author, or book title). The rules were quite complex, but were absolutely necessary.

Indexing for the Electronic Index

In the electronic world no consideration is necessary as to number of access points created, as economy of card filing or extent of a print publication is not the issue. The cataloger or indexer can assign as many access points as are deemed necessary.