|
|
NOTE: Passage retrieval is a type of search (text databases only) whereby the system determines (possibly), for each document, the highest weighted sub-passage that has a different weight than the whole document. The records in the sample databases provided with Okapi-Pack are too short to allow the necessary paragraph information to be made. Nevertheless, the database conversion and indexing binaries provided with the system (convert_runtime, ix1 and ixf) will produce such information from suitable text databases.
The interface provided does include functionality to implement passage retrieval. This appendix thus includes a description of functionality that will not be found when searching the sample databases provided.
.okapi_rc | File of parameters that determine the configuration of the interface |
.okapi_db | Contains a list of available databases. The first line in this file will be the database that will be opened when okapi is run. |
1. med.sample | About 1000 records taken from a Medlars database downloaded from Cornell University. |
2. cacm.sample | About 1300 records taken from a CACM database downloaded from Cornell University. |
Terms extracted from documents marked as relevant by users will be referred to as RF (relevance feedback) terms.
None: | Each term entered will constitute one term in the query. | ||||||||||||||||||||||||||||||
Phrase (+): |
If all of the non-stopwords are index terms an Okapi phrase
(which constitutes a single term) will be formed according to
the value of the environment variable BOTH_PHRASE_OPS set in
".okapi_rc".
|
The weight calculated for each term is a Robertson/Sparck-Jones F4 predictive weight, with halves . The weighting function allows each user entered term to be assigned a "loaded" weight by assuming the existence of a set of "mythical rels" of which a fixed number contain the term. The number of "mythical rels" is called "bigrload" of which "rload" contain the term. The values used for "rload" and "bigrload" were 4 and 5 respectively.
In addition the system keeps its own list of phrases that are always used as such (e.g. "expert systems"). When displaying phrases in the working query window, if it is followed by nothing this means that it is either in the GSL or, if not, it exists as a true phrase.
The entire set of user-entered and system-generated terms (the "termset") will be referred to by the letter Q. The number of members of Q is n(Q). Using similar terminology there will, at any time during a search, exist the following subsets of Q.
U | n(U) | user-entered terms. |
E | n(E) | non-user terms extracted during relevance feedback. |
C | n(C) | candidate terms, those members of Q that satisfy the query threshold conditions. |
W | n(W) | members of the working query. |
Q is sorted by the keys: |
|
||||
C is formed from Q taking: |
|
||||
W is formed from C taking: |
|
Terms are displayed in the working query window in descending order of TSV.
If the database is of type text and has been indexed to include positional information about paragraphs (not the case with either of the two sample databases), passage retrieval is applied to the document set generated with parameters p_unit = 4 , p_step = 2 , k1 = 1.6 and b = 0.7 . This will result in the system finding twopassages for each document.
Document Part | Weight | Length |
---|---|---|
Full document | w(F) | L(F) |
A Sub-Passage | w(P) | L(P) |
There are two cases to consider.
L(F) = L(P) | The passage and the full document are the same and w(F) = w(P). |
L(F) > L(P) | The passage is distinct from the full document; the weight assigned to the document will be the greater of w(F) and w(P). |
The hitlist is made up of the top H ranked documents where H <= MAX_RECS_TO_SHOW . MAX_RECS_TO_SHOW is set in .okapi_rc and defaults to 50. Each document must satisfy the following conditions.
An entry for each document consists of:
<record_no> | <docid> | <normalised_weight> | [<passage_length>] | <document_length> |
rank within set | document key |
system weight mapped onto the range 1..1000. |
passage length | document length |
The document and passage lengths are given to the nearest page, where a page is taken to be 2000 characters.
The hitlist entry at the top of the window is the first unseen document in the list.
L(F) = L(P) | Don't highlight the passage. The line of the document at the top of the window is the first line containing a query term. | ||||
L(F) > L(P) |
Highlight the passage. The line of the document at the top of the
window is:
|
Button | Document Relevant? | RF Terms Extracted From: |
---|---|---|
Full Document | Yes | Entire document. |
Passage Only | Yes | Highlighted passage only |
Not Relevant | No | No RF terms |
Searchers must make a relevance judgment before they may go onto to any other part of the search process. Making a positive judgement (Yes if the database does not support passage retrieval, [ Full Document | Passage Only ] otherwise) results in query expansion: all indexed terms are extracted from the appropriate section of the document and merged with the current complete set of terms. The working query is then formed from the expanded query.
For a document shown from the current hitlist a relevance judgment can be altered at any time until a new search is made. The possible changes:
Original Judgement | New Judgement |
---|---|
[ Full Document ] | [ Passage Only ] |
[ Passage Only ] | [ Full Document ] |
[Full Document | Passage Only] | Not Relevant |
[ Not Relevant ] | [Full Document | Passage Only] |
will modify the set of RF terms appropriately.
The set of relevance feedback (RF) terms, added automatically to Q by the system after each positive relevance judgement, are "removed" from Q (the entire set of user-entered and system-generated terms). The terms in fact remain in Q but their values of little r (the number of relevant documents they occur in) are set to zero and they are "flagged" to indicate that they should not be members of the set of candidate terms (C), unless subsequently:
The current set of relevance judgements would appear "greyed-out" in the relspool window to indicate that they were made before the relevance feedback was cleared. As before, these documents are not allowed to be members of any subsequent hitlist.
In addition to the "removal" of the RF terms, all user-entered terms are similarly "removed".
The default maximum size of the working query ( MAX_TERMS ) is 20 terms. Often n(C) does not exceed this value. However, if it does, the only way the user can see the extra terms in the set is by removing one or more of the top 20 terms so that lower terms are promoted.
Clicking on the "Set Working Query Size" button allows the user to increase the maximum size of the working query to 30 or 40 terms during the course of a search, thus:
The number of RF terms that are allowed in the set of candidate terms ( "C" ) is controlled by the two parameters, LR_THRESHOLD and TSV_FACTOR . These two menu options allow users to increase or decrease the value of TSV_FACTOR (by 0.1 per "click"), thereby potentially decreasing or increasing the numbers of RF terms allowable in the set.
In addition to modifying the value of TSV_FACTOR as described in the previous two menu options ( Fewer and More RF Terms ) the user can also alter the value of LR_THRESHOLD . This sets the minimum number of documents judged as relevant by the user that the term must occur in.
Clicking on the button "Set LR Threshold" presents a cascaded menu which allows the user to set LR_THRESHOLD to 1, 2, 3 or 4 relevant documents.
Okapi-Pack Main Menu | Mail Okapi Support | Registration |