Robertson Sparck-Jones

Okapi-Pack

Centre For Interactive Systems Research
City University
London EC1V 0BH

Appendix G: The Graphical User Interface to the BSS.

NOTE: Passage retrieval is a type of search (text databases only) whereby the system determines (possibly), for each document, the highest weighted sub-passage that has a different weight than the whole document. The records in the sample databases provided with Okapi-Pack are too short to allow the necessary paragraph information to be made. Nevertheless, the database conversion and indexing binaries provided with the system (convert_runtime, ix1 and ixf) will produce such information from suitable text databases.

The interface provided does include functionality to implement passage retrieval. This appendix thus includes a description of functionality that will not be found when searching the sample databases provided.

1. Interface Configuration Files.

The interface configuration files are stored in the directory specified by the environment variable:

GUI_CONFIG_FILES

.okapi_rc	File of parameters that determine the configuration of the interface
.okapi_db	Contains a list of available databases. The first line in this file will be the database that will be opened when okapi is run.

2. Databases.

There are two sample databases provided with the system:

1. med.sample	About 1000 records taken from a Medlars database downloaded from Cornell University.
2. cacm.sample	About 1300 records taken from a CACM database downloaded from Cornell University.

3. Terms.

The word "term" will be used to describe both single terms and groups of terms that are entered as phrases.

Terms extracted from documents marked as relevant by users will be referred to as RF (relevance feedback) terms.

4. User Input Of Query Terms

Users may enter one or more words into the term entry box followed by an optional phrase operator ( a '+' sign) as the last non-space character in the line. All the terms entered will be stemmed and looked-up in the database; stopwords and other non-indexed terms will be discarded. If the operator is:

None:

Each term entered will constitute one term in the query.

Phrase (+):

If all of the non-stopwords are index terms an Okapi phrase (which constitutes a single term) will be formed according to the value of the environment variable BOTH_PHRASE_OPS set in ".okapi_rc".

BOTH_PHRASE_OPS = 0 (FALSE)
Phrases will only be found if they exist as adjacent terms, possibly with intervening stopwords, in term input order. e.g.
"stock market"
"research and development"

BOTH_PHRASE_OPS = 1 (TRUE)

Two sets will be formed:

Set Description Postings Weight Examples

A Adjacency (possibly with intervening stopwords) in term input order. n(A) w(A) "stock market"
"stocks of market produce"

S Within same sentence (SAMES) occurrence of the terms in any order. n(S) w(S) "markets monitored by the Stock Exchange".

These are combined into one set (a single query term) using the bestmatch operator BM25 where appropriate, according to the following rules.

Sets Generated Sets Used Displayed Operator

n(A) = n(S) = 0 Both discarded

n(A) = 0, n(S) > 0 S(S), n(S), w(S) (S)

n(A) > 0, n(A) = n(S) S(A), n(A), w(A) None

0 < n(A) < n(S) S(A), n(A), w(A) and
S(S)-S(A), n(S)-n(A), w(S) (B)

The weight calculated for each term is a Robertson/Sparck-Jones F4 predictive weight, with halves . The weighting function allows each user entered term to be assigned a "loaded" weight by assuming the existence of a set of "mythical rels" of which a fixed number contain the term. The number of "mythical rels" is called "bigrload" of which "rload" contain the term. The values used for "rload" and "bigrload" were 4 and 5 respectively.

In addition the system keeps its own list of phrases that are always used as such (e.g. "expert systems"). When displaying phrases in the working query window, if it is followed by nothing this means that it is either in the GSL or, if not, it exists as a true phrase.

5. Candidate Terms.

During the course of a search the system keeps a list of all, distinct terms that are either entered by the user or extracted from documents judged as relevant . These terms are called "candidate terms".

6. Working Query.

The "Working Query" is the set of terms displayed in the working query window . It is generated incrementally (i.e. after each user-term entry and/or relevance judgement ) as follows.

The entire set of user-entered and system-generated terms (the "termset") will be referred to by the letter Q. The number of members of Q is n(Q). Using similar terminology there will, at any time during a search, exist the following subsets of Q.

U	n(U)	user-entered terms.
E	n(E)	non-user terms extracted during relevance feedback.
C	n(C)	candidate terms, those members of Q that satisfy the query threshold conditions.
W	n(W)	members of the working query.

6.1. Incremental Query Expansion.

After each change to Q the working query (W) will be re-generated in the following three stages.

Q is sorted by the keys:

USER_TYPE - U(ser) or E(xtracted)	DESCENDING
*Term Selection value (TSV)*	DESCENDING

C is formed from Q taking:

All members of U

Members of E that occur in at least LR_THRESHOLD (defaults to 2) relevant documents and have an TSV greater than or equal to TSV_FACTOR (defaults to 0.6) times the average TSV of all terms that have been in W.

W is formed from C taking:

The top N terms (N <= MAX_TERMS, a system defined limit: defaults to 20). I.e. all User-entered terms plus and the top "MAX_TERMS - n(U)" members of "E". If "n(U)" >= "MAX_TERMS" then W will be made up of members of U only. Users may alter the value of MAX_TERMS by clicking the Options Menu button and selecting the appropriate menu item.

Terms are displayed in the working query window in descending order of TSV.

7. Searching The Database.

Each search, performed by clicking the "Search" button, marks the next iteration. The members of W are combined using a best match operation ( BM25 ).

If the database is of type text and has been indexed to include positional information about paragraphs (not the case with either of the two sample databases), passage retrieval is applied to the document set generated with parameters p_unit = 4 , p_step = 2 , k1 = 1.6 and b = 0.7 . This will result in the system finding twopassages for each document.

Document Part	Weight	Length
Full document	w(F)	L(F)
A Sub-Passage	w(P)	L(P)

There are two cases to consider.

L(F) = L(P)	The passage and the full document are the same and w(F) = w(P).
L(F) > L(P)	The passage is distinct from the full document; the weight assigned to the document will be the greater of w(F) and w(P).

8. The Hitlist

The result of a search is the set of all documents that contain at least one of the "working query" terms. Each document within this set is given a weight according to the bestmatch function BM25 which takes into account:

the occurrence of the query terms within the document,
the occurrence of the query terms within the entire database,
the length of each document.

The hitlist is made up of the top H ranked documents where H <= MAX_RECS_TO_SHOW . MAX_RECS_TO_SHOW is set in .okapi_rc and defaults to 50. Each document must satisfy the following conditions.

It is not a member of the current set of relevant documents i.e. it must not have already been seen in full by the searcher.
L(P) (or L(F) if there is no passage) must be less than DOC_THRESHOLD characters in length. DOC_THRESHOLD, which defaults to 10K characters, can be altered by clicking the Query Options button and selecting the appropriate menu item.

An entry for each document consists of:

A header line made up of:

<record_no>	<docid>	<normalised_weight>	[<passage_length>]	<document_length>
rank within set	document key	system weight mapped onto the range 1..1000.	passage length	document length

The document and passage lengths are given to the nearest page, where a page is taken to be 2000 characters.

A system generated title, made up from approximately the first 150 characters from the start of the document.
Query term occurrence information: A count of the occurrences of the stems of each query term within the document. The stem of a query term may occur in different source forms within the document. The first source in the document for each stem will be shown.

The hitlist entry at the top of the window is the first unseen document in the list.

9. Showing Documents

Double-clicking anywhere in a document's hitlist entry will display the full document in a pop-up, scrollable text window. Query terms are highlighted in green. The passage (if there is one) is highlighted in light grey. The line of the document displayed to the user at the top of the window is dependent upon the values of w(F|P) and L(F|P) (see the Section on Searching The Database . i.e.

L(F) = L(P)

Don't highlight the passage. The line of the document at the top of the window is the first line containing a query term.

L(F) > L(P)

Highlight the passage. The line of the document at the top of the window is:

w(P) >= w(F)	the first line of the passage.
w(P) < w(F)	the first line containing a query term.

9.1. Making Relevance Judgements

At the bottom of the text window are three buttons to allow users to make a relevance judgment.

Button	Document Relevant?	RF Terms Extracted From:
Full Document	Yes	Entire document.
Passage Only	Yes	Highlighted passage only
Not Relevant	No	No RF terms

Searchers must make a relevance judgment before they may go onto to any other part of the search process. Making a positive judgement (Yes if the database does not support passage retrieval, [ Full Document | Passage Only ] otherwise) results in query expansion: all indexed terms are extracted from the appropriate section of the document and merged with the current complete set of terms. The working query is then formed from the expanded query.

For a document shown from the current hitlist a relevance judgment can be altered at any time until a new search is made. The possible changes:

Original Judgement	New Judgement
[ Full Document ]	[ Passage Only ]
[ Passage Only ]	[ Full Document ]
[Full Document \| Passage Only]	Not Relevant
[ Not Relevant ]	[Full Document \| Passage Only]

will modify the set of RF terms appropriately.

10. Relevance Judgments Pool

The ranked hitlist information for all documents currently judged as relevant. Any member of the current relevance judgments pool that exists in a document set for a new iteration has its weight set to its value in the latest document set; the display order is adjusted accordingly.

11. Removing Terms

Any term may be removed from the working query by double clicking on its entry in the query window. Removed terms are displayed in the removed terms window. If n(C) > MAX_TERMS , as terms are removed, other terms may be promoted to take their place.

12. Reinstating Removed Terms

A removed term may be reinstated in the query by double-clicking on its entry in the removed_terms window, although, as the working query changes, (i) its rank position may be different from that when it was removed, and (ii) in the case of extracted terms, it may not go in at all if n(C) >= MAX_TERMS .

13. Quitting

Quitting the search is achieved by clicking once on the "Exit" button.

14. Query Options

Certain search parameters and query states can be altered by the user. Some of these need to be altered by modifying the interface configuration script, .okapi_rc while others can be changed interactively by clicking on the Query Options button during a search.

14.1. Editing the interface configuration script .

Various parameters are set as environment variables in the interface configuration script, .okapi_rc and read using the appropriate C/Tcl function.

14.2. The Query Options button.

This button provides access to functions that enable the user to interactively alter the current state of the working query. Clicking the button produces a pop-up menu with the following entries.

Clear Relevance Feedback
Clear Current Query

Set Working Query Size >

Fewer RF Terms
More RF Terms

Set LR Threshold >

Clear Relevance Feedback.
The set of relevance feedback (RF) terms, added automatically to Q by the system after each positive relevance judgement, are "removed" from Q (the entire set of user-entered and system-generated terms). The terms in fact remain in Q but their values of little r (the number of relevant documents they occur in) are set to zero and they are "flagged" to indicate that they should not be members of the set of candidate terms (C), unless subsequently:
- re-input by the user, or
- satisfy the threshold conditions based upon new relevance judgements.
The current set of relevance judgements would appear "greyed-out" in the relspool window to indicate that they were made before the relevance feedback was cleared. As before, these documents are not allowed to be members of any subsequent hitlist.
Clear Current Query.
In addition to the "removal" of the RF terms, all user-entered terms are similarly "removed".
Set Working Query Size.
The default maximum size of the working query ( MAX_TERMS ) is 20 terms. Often n(C) does not exceed this value. However, if it does, the only way the user can see the extra terms in the set is by removing one or more of the top 20 terms so that lower terms are promoted.
Clicking on the "Set Working Query Size" button allows the user to increase the maximum size of the working query to 30 or 40 terms during the course of a search, thus:
- Enabling the user to see more than the top 20 terms from C without having to remove terms from W,
- Allowing the user to increase the number of terms used to search the database if relevance feedback is producing enough good terms to warrant this.
Fewer RF Terms
More RF Terms
The number of RF terms that are allowed in the set of candidate terms ( "C" ) is controlled by the two parameters, LR_THRESHOLD and TSV_FACTOR . These two menu options allow users to increase or decrease the value of TSV_FACTOR (by 0.1 per "click"), thereby potentially decreasing or increasing the numbers of RF terms allowable in the set.
Set LR Threshold >
In addition to modifying the value of TSV_FACTOR as described in the previous two menu options ( Fewer and More RF Terms ) the user can also alter the value of LR_THRESHOLD . This sets the minimum number of documents judged as relevant by the user that the term must occur in.
Clicking on the button "Set LR Threshold" presents a cascaded menu which allows the user to set LR_THRESHOLD to 1, 2, 3 or 4 relevant documents.

Okapi-Pack Main Menu

Mail Okapi Support

Registration

Last modified: 12th November 2001