Appendix E - Suggested Modification for Incorporation of Thesaurus in twinIsles' Cataloguing/Retrieval Process

twinIsles.dev >> Photography on the Web

As described in the main body of the report the twinIsles cataloguing/retrieval process does not make use of a formal thesaurus. Instead synonymous and hierarchical relationships are defined manually and implemented by the allocation of identical keyword IDs to represent synonymous terms and the inclusion of broader terms as additional keywords to represent hierarchy.

This approach was considered sufficient for the purposes of launching the prototype. However, if twinIsles is to grow further, some form of automated cataloguing system should be implemented along with a formal thesaurus to which it, and the search mechanism, may refer.

This appendix suggests a modified database incorporating a formal thesaurus and the cataloguing and search systems which refer to it.

The Thesaurus
The suggested structure permits homographs within the terms e.g. glasses could be something to drink out of or an aid for the visually impaired as well as modelling the hierarchical, synonymous and associative relationships (see 6.6.1).

The initial thesaurus could be generated from the initial set of images, or could be adapted from an existing thesaurus, a number of which are publicly available e.g. the Library of Congress Thesaurus for Graphic Materials [61] (see below). In either case it will be necessary to periodically update the thesaurus to accommodate new items which may be added to the collection.

Fig E.1 Proposed structure of modified database

Table

Fields

Notes

PHOTO

ID

description

locID

 

Location ID

PHCON

phID

conID

CONCEPT

ID

termID

 

ID of preferred term

TERM

ID

term

) A term may be a single
) word or a phrase.

ADDTERM

conID

termID

) This table models the
) synonymous
) relationship.

HIERARCHY

conID

lft

rgt

) This table allows
) hierarchy
) to be modelled.
) See Celko [14].

RELCON

mainConID

relConID

) This table models the

) associative relationship

LOCATION

ID

location

Full location for caption.

LOCPLACE

locID

placeID

PLACE

ID

placeName

 

Name of a single place.

Fig E.2 Proposed Tables and Fields


A distinction is made between concepts (underlying things being represented) and the terms (keywords and phrases) used to represent them, an approach inspired by Cross et al [17]. Photographs are indexed by concept, rather than term. This implicitly associates a photograph with all the terms describing its associated concept(s) as well as all those describing hierarchically related child concepts.

It was noted during the cataloguing of images for the prototype that some descriptors consist of more than one word e.g. "bullet train", "new year" etc. It is necessary to store these as phrases within the database, thus the indexing system must provide a mechanism for indicating such phrases, e.g. by enclosing them in double quotes.

The search mechanism must therefore also identify phrases. There are two means of implementing this requirement:

  • Every search string could be parsed into single words, 2-word phrases, 3-word phrases … n-word phrase (= entire search string). This would result in 0.5(n^2+n) search terms arising from an n-word query. This places the burden of effort onto the computer system.
  • The user could be requested to indicate phrases e.g. by placing them in double quotes (as is the convention on the leading search engines). This places the burden of effort onto the user. Provided images were also indexed under the individual words forming any associated phrases a reasonably satisfactory result set could be expected even where the user neglected to identify phrases.

Note that a separate HIERARCHY table is required to model the fact that concepts may appear in multiple hierarchies.

In order to estimate the usefulness of employing an existing thesaurus, e.g. the Library of Congress Thesaurus for Graphic Materials I, keywords from the prototype database were compared with terms in this thesaurus. The file containing the keywords from the prototype database was 7KB. The file containing the terms that did not match was 4KB (after removing terms which were singular forms of plurals which had matched), i.e. less that 50% of twinIsles' keywords matched terms in the Library of Congress Thesaurus.

The Cataloguing System
The following is a description of what is required from an automated image cataloguing system suitable for use with twinIsles' database from a user and system perspective.

User (cataloguer)

System

Enters image ID (filename).

Checks for uniqueness of ID.

Checks for existence of image.

Displays thumbnail of image to aid cataloguing.

Enters description (caption to be displayed with image).

Parses description into individual words (i.e. remove extraneous punctuation, conjunctions, articles etc).

Consult thesaurus for synonyms and child terms.

Display all generated and found keywords/concepts.

See note below.

Review and edit displayed keywords/concepts, deleting and adding terms as appropriate.

Update database.

See note below.

Enter location by selecting from list, or where location is not present in list typing it (in full) in text box.

Update place and locplace tables in case of new location.

Update photo record.

Note: In the case of new keywords being entered synonymous, hierarchical and associative relationships would need to be identified and added to the thesaurus. This could be done at the time, or new keywords could be saved and correctly placed in the thesaurus at some later time or even by another person. The former method has the advantage of making the relationships available for the remainder of the current cataloguing session.

In the case of a homogeneous collection, e.g. a museum of butterflies, the required thesaurus could probably be well defined from the beginning, i.e. only in rare cases would new terms need to be added. However, with a heterogeneous collection such as that of twinIsles which grows in an unpredictable manner, it is likely that the addition of new keywords would be a common requirement.

The cataloguing of collections (i.e. sets of similar or related items) would require a different procedure (and different cataloguing interface).


The Search System
The following is a description of what is required from a search system interfacing with a thesaurus.
1. The user enters a query.
2. The query is parsed into individual words and phrases, i.e. search terms. Noise words (e.g. a, and, the) and extraneous punctuation are removed.
3. An initial set of concepts is identified from the search terms (using the synonymous relationship).
4. The initial set of concepts is expanded to include narrower concepts, e.g. the concept of pets would be expanded to include dogs, goldfish and budgies.
5. Images matching the expanded concept set are retrieved.
6. The images are ranked in order of relevance.
7. Related concepts are identified.
8. Matching images are displayed.
9. The related concepts are displayed, providing the user the opportunity to expand the search

Variants on the above include:

  • Showing the concepts associated with each displayed image so that the user may search on one or more of these.
  • Allowing the user to select a number of displayed images as the basis for a further search. The search would be on the concept(s) associated with those images.

twinIsles.dev >> Photography on the Web

e-mail me with your comments and suggestions | Home