Notes for Class 4

LIS4361 - The Hidden Internet

The Hidden Internet

Databases previously available on CD-ROM and other formats are migrating to the Web. This is true for commercial databases as well as for those freely available.

Databases most often return search results in the form of Web pages -- pages that are dynamically generated "on the fly". That is, they did not exist until called into existence in response to a query. The data existed in the database, but the Web page itself did not.

The key to finding materials in these databases is an indirect search strategy. Rather than searching a Web search engine for direct answers, you can search the Web search engine indirectly -- looking for databases that would likely contain the kinds of information you are looking for.

An excellent way to illustrate this kind of searching is to examine online public access catalogs (OPACs) of libraries. OPACs usually provide access to hundreds of thousands of records, none of which are searchable using Web search engines.

Library Catalogs are Generally Hidden

Library online catalogs (OPACs) are an excellent example of the hidden Internet. Generally you cannot search library holdings using search engines. There are, however, indirect ways of searching some library holdings. We'll get to that later.

DU's Catalog: http://catalog.du.edu/ (III Millennium)

CU Boulder: http://libraries.colorado.edu/ (III Millennium)

University of Iowa: http://infohawk.uiowa.edu/uiowa (Aleph - Ex Libris) - Note the proper way to give a URL here.

Harvard University | http://hollisweb.harvard.edu/

Melvyl (Univ. of California System) | http://melvyl.worldcat.org/ - now using Worldcat.org

LibDex: The Library Index | http://www.libdex.com/ - Find library OPACs

Libweb | http://lists.webjunction.org/libweb/- Find library Web sites

Examples of Other Hidden, Freely Available Databases:

Storm Events Database
http://www.ncdc.noaa.gov/stormevents/

Terraserver
http://www.terraserver.com/

Art & Architecture Thesaurus
http://www.getty.edu/research/tools/vocabularies/aat/

Getty Thesaurus of Geographic Names Online
http://www.getty.edu/research/tools/vocabularies/tgn/

Library of Congress Thesauri (LIV and TGM II)
http://www.loc.gov/lexico/servlet/lexico

Union List of Artist Names Online (ULAN)
http://www.getty.edu/research/tools/vocabularies/ulan/
Google cannot see these records.

EconPapers
http://econpapers.repec.org/
Previously this database was not searchable by search engines. But note the difference in URL generation when searching EconPapers vs. searching a search engine. Notice that you will need to "lift out of frames" within the database to derive the stable URL.

Handbook of Latin American Studies (HLAS)
http://lcweb2.loc.gov/hlas/
Use Google to demonstrate whether this is open to search engines or not.

TRIS Online
http://ntlsearch.bts.gov/ - Another example of how to give the shortest URL, not the resultant URL.

The European Library
http://www.theeuropeanlibrary.org/ - Another example of how to give the shortest URL, not the resultant URL.

Hidden Internet Directories

Open Directory Project: http://dmoz.org/
"The largest, most comprehensive human-edited directory of the Web."

CompletePlanet: The Deep Web Directory: http://aip.completeplanet.com/
Some are truly hidden; others are not. But it is a directory of databases.

Pandia Power Search: http://www.pandia.com/powersearch - Also, see their list of directories. These may contain some hidden Internet databases.

Searching the Hidden Internet

Use an indirect search strategy.

Question: What were the high and low temperatures in Denver on January 1, 1975?
Indirect Strategy: Look for a climate database.

Question: Does the Denver Public Library have The Color Purple?
Indirect Strategy: Look for the DPL online catalog.

How to Make the Hidden Internet Less Hidden

Search Engine Optimization - provides search engines with a site map or algorithm to tell them how to crawl the site. This can make hidden internet content findable. I will discuss in class.

The Deep Web (SUNY Albany)
This paper was formerly at:
http://library.albany.edu/internet/deepweb.html
Now, it is recoverable from the Internet Archive (The Wayback Machine):
http://web.archive.org/web/*/http://library.albany.edu/internet/deepweb.html

Can you find a "live" copy of this paper on the Internet today? What methodology would you use to find it?

Blog Entry:
http://liblogs.albany.edu/library20/2006/11/the_future_of_the_deep_web.html

Invisible Web: What it is, Why it exists, How to find it, and Its inherent ambiguity
http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html

Exercises:

1. Try to find the documents within this collection in a search engine: http://www.dol.gov/oasam/library/digital/main.htm

Can you create a link to these documents?

Determining whether a database is hidden from search engines or not. Test these databases to see if they are "hidden Internet" or not:

NOAA Photo Library: http://www.photolib.noaa.gov/
Medical Subject Headings (MeSH): http://www.nlm.nih.gov/mesh/
TRIS Online: http://ntlsearch.bts.gov/ - Another example of how to give the shortest URL, not the resultant URL.

Recovering "lost" Internet content.

Guidance for highly qualified teachers in Colorado : section 1119, the No child left behind act of 2001
formerly at: http://www.cde.state.co.us/cdeunified/download/tiia%5Fhqtguidance020205.pdf