|
|
This appendix gives a brief description of most of the commands available in the low-level interface to the Okapi BSS. Commands intended specifically for users wishing to implement relevance feedback systems are not included as they are regarded as being at a higher level; it may be better to construct your own functions using the low-level commands.
There are three functions in the library:
<command> | This must contain a valid BSS command. In the case of relevance feedback systems, it must be quite large if you are going to request parsing of long pieces of text. | ||||||
<response> | This will contain the result returned by <command>. It must be large enough to hold anything i0() may produce (possibly up to several megabytes; see ACCEPT and PREFER below). | ||||||
<return_code> |
This will be an integer:
|
The C source code for a straightforward example of the use of i0() is given in Appendix L . A more complex example of how this can be done can be seen in the source code for the GUI to the BSS described in The Interface to the BSS.
Other error messages are written to Err_file, and this may be output to stderr, written to a file or discarded, depending on the compilation. They will sometimes be useful to a system developer. It must be emphasised that the error-trapping system is not well developed. It is hoped that it will improve with time and experience.
There are a few cases where the <name>= is unnecessary and a few where it is invalid (e.g. set defaults).
White space is ignored around the '='.
Most arguments have default values. Some of these are settable using a SET command ; the current value of all settable defaults can be obtained by using the SET command with no arguments. In previous versions of the interface some arguments were implicitly set by being used in a command. This is now true ONLY of current set number (s=..., set=...) and current record number (rec=..., record=...). These are automatically set to the most recently produced, or SHOWn set; and to the next, or first record of the current set.
-29, interface syntax error, no database specified
"System"-type failures (e.g. shortage of memory, missing files, shortage of file descriptors) are scarcely handled at all; this may be done sometime.
One error which can arise from most of the commands is NO_DB_OPEN.
responds with a line containing "<n> database(s)" followed by a line for each available database consisting of the database name followed by a very brief description.
Other INFO topics are described below.
Successfully choosing a database resets all values to their defaults (see SET) and destroys any existing search sets.
There is an optional argument dbflags=<value> or db_flags=<value> which affects the facilities which are loaded when the database is opened. The current status of this is unknown, and you are advised not to use it.
Parses the string into individual terms in a form suitable for looking up in the database. Used to determine what terms to look for with a FIND(1) command.
For description of the "attr" qualifier see under FIND(1) . The response is:
<number of non-stopped terms> [t=<term>] ..
The number may of course be 0. Only indexed terms are output. An output term may be a GSL token of the form @nnnn
superparse [attr | a=<attribute mnemonic>] t=<string>
As PARSE except that more information is given, and terms are output even if they are not indexed (in which case they have GSL class H or F (see Appendix D: The GSL File. ). The response is:
<number of terms> [t=<term> c=< gsl class > s=<source>] ..
<source> is the portion of the input which gave rise to this term.
Applies the named function to the given term. There is no default stem function. The response is:
t=<stemmed term>
This utility function is independent of the current database (if any). Even if there is an open database the stemmed term may not occur in any of the database's indexes. Names of the available functions may be obtained using the DISPLAY command. Note that the <term> is treated by the stem function as if it were a single word.
Errors: an unrecognized stem function gives NO_SUCH_STEMFUNCTION
NB one term only, and the "t=<term>" must come last. The response is:
[ s<n> ] np=<np> t=<term found>
<term found> will be empty (nothing following the "t=") if type=0 and specified term is not in the index or type is nonzero and an extremity of the index has been reached.
The system remembers the last term found by a FIND (1) command. This is its notion of "current term" (initially empty). If no term is specified in a FIND command the current term will be looked up. This is not useful unless the "type" qualifier (see below) is 2 or -2. If type is 2 the "empty" FIND command will find the next term in the index, if type is -2 the previous term. For example, to look at all the terms starting with "dog":
Qualifiers | Default | Settable? |
---|---|---|
save=y | 1 | n | 0 | 1 (y and 1 are synonyms as are n and 0) | yes |
type=-2 | -1 | 0 | 1 | 2 | 0 | yes |
attribute | attr | a=<attribute mnemonic> | default | yes |
save
If save=y or 1 (default) a set is made, otherwise no set is made. save=n may be faster if you only want to find the number of postings for a term, and there will be no unwanted set to be deleted.
type
Value | Function |
---|---|
-2 | finds the greatest term less than the specified term |
-1 | finds the greatest term not greater than the specified term |
0 (default) | finds the specified term or nothing |
1 | finds the least term not less than the specified term |
2 | finds the least term greater than the specified term |
An invalid type value causes a NO_SUCH_SEARCHTYPE error.
a | attr | attribute
This determines the type of data which will be searched, i.e. which index, and, for some kinds of index, the class of objects within that index. There is always a default attribute, which may be referred to as "default" (e.g. in a SET command). Typically, this is some kind of general keyword attribute. It is defined by the first entry in the database's search_groups file. Usually there also exists an attribute called "dn" (document number). Apart from these two, the only way to determine what attributes there are for a database is to look at the search_groups parameter file for the database, although if the database manager is sensible they will have fairly self-evident names like "ti", "au", "ab".
A non-existent attribute mnemonic causes a NO_SUCH_ATTRIBUTE error.
NB sets and weights must come before options
<setnum> must be the number of a set which has been created and not deleted. Set numbers run from zero. There is a limit on the number of sets, currently about 1000. Deleted setnumbers are reused, always lowest free setnumber first. A non-existent set causes a NO_SUCH_SET error.
If the operation is of the best match type a weight must be given for each set which is not itself the result of a best match operation. Missing weight will give a SET_NO_WEIGHT error. If a weight is given for a set which is the result of a best match operation it is ignored (arguably this is wrong, and may in the future be changed).
Normally, weights should be determined by using the weight command. If weights are derived in some other way they must be smallish integers, typically in the range 1-200; weight "overflow" (> 32767) is not detected and will give spurious ordering of retrieved documents.
The response is:
[S<n> ] np=<np>
[S<n> ] np=<np> maxwt=<maxwt> nmaxwt=<# with max wt> ngw=<ngw> mpw=<max poss wt> mpw=<# with max poss wt>
Options
Options | Default | Settable? |
---|---|---|
op=<opname> | bm1 | Y |
aw=<cutoff weight> | -32767 | N |
gw=<weight breakpoint> | 0 | N |
target=<minimum required number of postings> | 0 (=unbounded) | N |
save=y | 1 | n | 0 | 1 | Y |
k1=<real> (bm1100/1500/250/2500 only) | 1.2 | Y |
k2=<real> (bm1100/1500/250/2500 only) | 0.0 | Y |
bm25_b=<real> (bm250/2500 only) | 0.75 | Y |
nopos=0 | 1 | 0 | Y |
p_unit=<int> (bm250 only) | 1 | Y |
p_step=<int> (bm250 only) | 1 | Y |
p_maxlen=<int> (bm250 only) | 20 | Y |
Options may be given in any order (but must follow all sets or set/weight pairs). Redundant options are ignored. If an option is repeated the last occurrence is used.
Special options
These cannot be used in conjunction with any other options (or with each other), and they only apply to a single set.
Options | Default | Settable? |
---|---|---|
top=<topnum> | none | N |
mark=<marknum> | 0 | N |
op
The following boolean and quasi-boolean operations work on all indexes.
and | A and B and C and ... |
not | A and not B and not C and not ... |
or | A or B or C or ... |
and2 | like and but doesn't affect posting weights in 1st set, nor does it add positional records from sets after the 1st |
not2 | like not but doesn't affect posting weights in 1st set |
'and2' and 'not2' are referred to as "Robertson limits".
The following group work on all indexes not of types 0 or 2 (all keyword indexes of current databases are OK)
samef | and within field |
sames | and within sentence |
adj | adjacent in forward order in same sentence but possibly with intervening stopwords |
adj2 | adjacent in forward order in same sentence with no intervening stopwords. Only works on indexes of types >= 3 |
Weighted operations
bm1 | stream weight is the supplied weight, summed over contributing streams; supported on all types of indexes |
bm11 | term weight for a document is the sum of:
for a constant k1, with a "global" correction
where nk is the number of terms in the search & k2 a constant. |
bm15 | As bm11 without the doclen and avedoclen components in the first part. |
bm25 |
As bm11 but with the doclen effect moderated (for nonzero b).
The term weight is:
|
bm250 | As bm2500 (below) but attempts to find best paragraph. Subject to a number of environment variables. "Text" type databases only |
The next three are for evaluation or development purposes where the values of the parameters k1 and k2 need to be controlled.
bm1100 | As bm11 but expects two parameters k1 and k2 |
bm1500 | As bm15 but expects two parameters k1 and k2 |
bm2500 | As bm25 but expects k1, k2, bm25_b. This is universal in the sense that it reduces to bm11, bm15 or bm1 for suitable values of b and k1 |
Note: bm1 with a global doclength correction may be obtained by using bm1100 or bm1500 with k1=0, k2!=0
Special operation
mark |
This operator accepts one or more sets, but no weights, and no
options other than mark=<marknum>. It transcribes the lh
operand, recording <marknum> in (or removing -
<marknum> from if <marknum> is negative) any output
posting which is a member of any of the other operands. If only
one set is given all elements are marked. If the lh operand is
already marked the new mark is superimposed (i.e. if positive,
newmark = oldmark | mark, if negative newmark =
oldmark&~mark). Note that if a marked set becomes an operand
in a subsequent operation, marks are retained.
The default op is bm1, but may be set (see SET ). Any other opname produces a NO_SUCH_OP error. |
nopos=0 | 1 | If nopos=1 no positional data or term frequencies are output. This speeds a large combine operation, reducing output substantially, and is sometimes useful in batch processing. If a set was produced with nopos=1 few subsequent operations can be done on it, nor can highlighted output be obtained. Default 0. Settable. |
aw | <cutoff weight> is just that. No documents with weight below aw will be retrieved. It defaults to -32767. aw is ignored if the operation is not of a weighted type. Not settable. |
gw | One gw value may be specified to give information on the number of postings with weights at least the gw value. For example, in the old interactive system, gw was often set to 2/3 maximum possible weight, and the number of postings reaching gw was reported as the number with "good" weight. It defaults to zero. gw is ignored if the operation is not of a weighted type. Not settable. |
target | If target is positive, output will be restricted as far as possible to the specified number of postings. More than the specified number of documents will usually be retrieved, but the user should discard any excess, as these will not be properly ordered (arguably this should be done by the BSS). It is guaranteed that the "best" documents will be retrieved, up to the target number. target is ignored if the operation is not of the best match type. Default 0. Not settable. |
save | As in FIND (1) |
k1, k2, bm25b, p_unit, p_step, p_maxlen |
These parameters control the document weight calculations in the best match ops bm1100/1500/2500/250. They are described individually below. The defaults are such that if none of them has been assigned a value since the current database was chosen bm1100/1500/2500 all behave like bm1. bm250 will retrieve the same documents and in the same order as bm1, but may find a subdocument with the same weight as the whole document. All these are settable. There is no validation. |
k1 | For ops bm1100/1500/250/2500. Ignored by other ops. Defaults to zero. Settable. (For bm11/15/25 k1 values are in the database parameter and cannot be varied.) |
k2 |
The global doclength correction parameter for ops
bm1100/1500/250/2500. Ignored by other ops. Defaults to
zero. Settable.
For bm11/15/25 k2-values are in the database parameter and cannot be varied. |
bm25_b | bm25b |
The b parameter for ops bm250/2500. Ignored by other
ops. Settable.
For bm25 b-values are in the database parameter and cannot be varied. |
p_unit, p_step, p_maxlen | Parameters only used by op bm250. Ignored by other ops. Default to (1, 1, 20) resp. Settable. |
top=<topnum> |
This option applies to only one set and no additional
options. It makes a new set consisting of the top-weighted
<topnum> elements of the set. If the input set has no
weights it produces the first <topnum> elements in
internal record number order. At present the output string is
defective, containing no weight information. Example:
|
mark=<marknum> | Makes a new set consisting of the elements of the input set which satisfy <input mark> & <mark> != 0 |
Note: aw and target are intended mainly to speed potentially lengthy searches; sets obtained using them are not suitable for inclusion in subsequent FIND (2) commands (except perhaps the "limiting" ops AND2 and NOT2). Setting a small target or a large aw may almost halve the processing time, as little output has to be done.
w | weight | fn=<fn #> n=<np> [ r=<r> bigr=<R> [ rload=<r_load> bigrload=<R_load> ] ] |
The response is a relevance weight as a 16-bit integer (10 * a log to base 2). n is the number of postings for the term (<np> from a FIND command). No default. Not settable.
fn | func | function |
Default 0. Not settable. |
||||||
r | The number of relevant documents containing the term which is being weighted. Default 0. Not settable. | ||||||
bigr | The number of relevant documents. Default 0. Settable. | ||||||
rload | Default 0. Settable. | ||||||
bigrload | Default 0. Settable. |
Note that the big-N argument (number of indexed documents) required by the Robertson/Sparck Jones functions cannot be supplied by the user; it is determined automatically. If there is a need for it a facility may be provided for allowing the user process to supply a big-N value.
Weight errors
A function other than 0-2 will be reported as NO_SUCH_WEIGHTFUNC; failure to supply a value for n gives SYNTAX; phoney values of n, r, bigr, rload and bigrload don't generate an error (perhaps they should), weight is usually returned as 0 or 1.
A direct request by primary key (normally a sequential record number which runs from 1 to the number of records in the database). No set need have been formed. No default. Not settable. Error: RECORD_OUT_OF_RANGE_DB
A request for a record from a set. In this case records are numbered from zero to one less than the number in the set. <num> is the number of records requested. It defaults to 1. The set defaults to the most recent set formed or from which a record has been shown. <record> defaults to the next one, or the first if it is the first show command on a new default set. If set or record is specified they become the new default values. Set and record are not settable, except implicitly. Errors: NO_SUCH_SET, NO_SUCH_RECORD.
See also under MISSING COMMANDS.
There are a number of predefined formats, some only applicable to specific databases. The initial default format is 1, which works with any database, and delivers a complete record in the form
Empty fields are not shown. Where fields end with a linefeed (databases of type "text") there may be one or more empty lines between fields. Other databases (abstracting and indexing databases) do not normally contain linefeeds, and one linefeed is inserted at the end of each nonempty field. Other SHOW formats are shown in the following table. You should refer to the source code of the "okapi" gui (particularly hitlist.c, header.c and defs.h) for an example of the use of format 259.
Format | Description |
---|---|
0 | delivers the contents of field 1 only. By convention this is a document ID or <docno> (not the same as recnum). |
3 / 259 |
These are provisional formats and may change. They are intended
for applications where the originating system wants to process
the documents for itself. They deliver the entire unprocessed
text of a database record preceded by a header. The two formats
are identical except that 3 does not contain highlighting
records (the "number of highlight records" field
contains zero).
The header consists of a sequence of ASCII numbers right justified and character strings left justified in fields of fixed length as shown in the table below. |
value | datatype and length | offset from start of header |
---|---|---|
message length | %8d | 0 |
reserved | %32c | 8 |
set number | %6d | 40 |
record number within set (0-) | %10d | 46 |
internal record number | %10d | 56 |
weight | %10.3f | 66 |
mark | %3d | 76 |
unused | %7c | 79 |
offset of start of data | %6d | 86 |
number of directory records (D) | %3d | 92 |
[directory records] | [%19c] | 95 |
number of passage records (P) | %6d | 95 + 19D |
[passage records] | [%26c] | 101 + 19D |
number of highlight records (H) | %6d | 101 + 19D + 26P |
[highlight records] | [%17c] | 107 + 19D + 26P |
data | 107 + 19D + 26P + 17H |
directory record | field number (1- ) %3d | offset from start of data %8d | field length %8d |
passage record | weight %10.3f | offset from start of data %8d | length %8d |
highlight record | code (unused) %1c | offset from start of data %8d | length %8d |
At present there are always either zero or two passage records. If there are two, one is for the whole record and the other is for the "best" passage; the one with the higher weight appears first, and it is this weight which appears in the entire record's "weight" field.
Highlight records will be ordered by increasing offset from start of data.
Format | Description |
---|---|
197 | shows field 1 and weight. |
100 |
Like format 197 but gives more information provided the set has
paragraph information (was retrieved using bm250).
Example: SJMN91-06222010 190 190 4 13 188 0 32767 The second and subsequent fields are max(best passage weight, document weight), weight of best passage, best passage start para number, best passage finish para number, weight of whole document (the final two fields just indicate that the preceding weight applies to the whole document). In future more than one passage weight might be shown. Paragraph numbers are counted from zero. |
255 |
Tries to give information about the posting, rather
cryptically. It is really meant for diagnostic purposes, and may
not be supported in the future in its present form. It may,
however, be used to give within-document term
frequency. Provided the output is from an atomic set (a set from
looking up a single term or the result of an adj or adj2
operation) the first colon-delimited field is the number of
occurrences of the term or phrase in the record. The other
fields will be documented when I discover under what conditions
this format works!
Example:
S10 np=106057 t=system show s=10 r=0 f=255 10:00000003:0:0:8a560003:...... (the term frequency is 10 in this record) |
Adding 256 to the format number highlights the terms by which the record has been retrieved if possible. The highlighting codes are vt100 at present, but this might in the future become settable, or be replaced by some universal code which the client will replace by whatever is locally required.
Deletes the sets given by number, or all sets if "all". There are no errors, even if invalid set number.
Arguments | Initial default | Errors |
---|---|---|
a | attr | attribute= <attribute name> |
default | NO_SUCH_ATTRIBUTE |
save=0 | n | 1 | y | Y | t | T
0 & n are equivalent as |
1 | none |
type | search_type= -2 | -1 | 0 | 1 | 2 |
0 | none, but a value which can't be read as an integer in this range is ignored. |
nopos=0 | 1 | 0 | none |
k1=<number> | 0.0 | anything which doesn't look like a number gives SYNTAX |
k2=<number> | 0.0 | anything which doesn't look like a number gives SYNTAX |
bm25b | bm25_b=<number> | 1.0 | anything which doesn't look like a number gives SYNTAX |
passage_unit | p_unit= <positive integer> |
1 | anything which doesn't look like a number gives SYNTAX; values less than 1 are ignored. |
passage_step | p_step= <positive integer> |
1 | anything which doesn't look like a number gives SYNTAX; values less than 1 are ignored. |
passage_maxlen | p_maxlen= <positive integer> |
20 | anything which doesn't look like a number gives SYNTAX; values less than 1 are ignored. |
bigr=<non-negative integer> | 0 | anything which doesn't look like a number gives SYNTAX; values less than 0 are ignored. |
bigrload= <non-negative integer> |
0 | anything which doesn't look like a number gives SYNTAX; values less than 0 are ignored. |
rload= <non-negative integer> |
0 | anything which doesn't look like a number gives SYNTAX; values less than 0 are ignored. |
op=<opname> | bm1 | NO_SUCH_OP, SYNTAX |
Exhibits various things.
display db | database | outputs "No database" if none open, otherwise the database name. |
display stemfunctions | stemfuncs | stemfns | Outputs the names of the available stemming functions, one per line, preceded by a line of the form "<n> function(s)". |
See also STEM.
displaystats | ds [s=<setnumber]
Outputs "sum=<sum of weights in set> sumsq=<sum of squares of weights in set". If no set is specified the "current" set is assumed. Using this command does not alter the system's notion of current set.
This command may be used on any set, but at present the only sets which may give nonzero statistics are ones obtained by using a Robertson limit (op=and2 or op=not2 (see above). In addition the environment variable BSS_COMBINE_DO_STATS (see ENVIRONMENT VARIABLES below) must be set and nonzero.
Errors: NO_SUCH_SET
info database | <database_name> | Gives more detailed information on the current or named database. This may include type of data, number of fields and what they contain, number of records, mean and greatest length, and the information returned by "info attributes". |
info attributes | Responds with the preferred names of the available search attributes and a brief description of what each one accesses. |
info rn=<recnum> | info s=<setnum> r=<record> |
Return something of the form:
rn=<recnum> length=<bytes> fd=<fdnum> length=<bytes> paras=<num> ... |
The last info command applies mainly to databases of type
"text".
show |
[ highlight ] fd=<fdnum> [offset=<byte>
length=<bytes> ] | para=<start para> paras=<num> ] ... rn=<recnum> | s=<setnum> r=<record> |
The "para" and "paras" qualifiers would only apply to databases of type "text".
The maximum number of sets is 16384 (numbered 0-16383). Each set occupies at least 232 bytes of memory, and may also have one or two files in the $BSS_TEMPPATH directory. If a FIND operation would result in two many sets it fails with a "NO_FREE_SETS" error. The memory for sets is now dynamically allocated, but this memory is never freed until there is a successful subsequent CHOOSE command.
The maximum number of sets which can be combined in a single FIND operation is operating-system dependent. On the Sun OS 4.1 machines you are recommended not to exceed about 110 (it may appear to work with up to about 238, but results are likely to be spurious). With SOLARIS it should be OK to go up to about 238.
There have been problems PARSE-ing long input. This seems to have been cured by using flex instead of lex to compile the command parser. (But note that the command parser will always stop at a linefeed; you may need to replace linefeeds by blanks before using either of the PARSE commands; we haven't found a way of getting round this.)
Three environment variables are read by the iinit() function. These are:
The low-level interface program ( i1+ ) looks for additional
environmentvariables. These are
BSS_DB | If set the interface will perform a CHOOSE command on the named database. |
BSS_ATTRIBUTE | If present the corresponding SET command will be issued after the specified database has been opened. |
NOTE: if BSS_ATTRIBUTE is set without BSS_DB, there will be a "NO_DB_OPEN" error.
The main use of these environment variables is in scripts such as the following
which loads the low level interface with the database set to search on index "dn" before handing it over to the user.
A number of other environment variables are looked for by several BSS functions. Some of them control the parameters for passage searching, or for logistic regression (not documented here). Another is BSS_COMBINE_DO_STATS , which must be nonzero for the DISPLAYSTATS command to produce nonzero results.
Okapi-Pack Main Menu | Mail Okapi Support | Registration |