Okapi-Pack

Centre For Interactive Systems Research
City University
London EC1V 0BH

Appendix J: BSS Commands Reference

This appendix gives a brief description of most of the commands available in the low-level interface to the Okapi BSS. Commands intended specifically for users wishing to implement relevance feedback systems are not included as they are regarded as being at a higher level; it may be better to construct your own functions using the low-level commands.

1. The BSS Library.

There is one library libi0+.a. Corresponding to i0+ is an executable i1+ which may be used as a command line interface and for trying things out or in shell scripts. Some examples of this are shown in Section 4: Using the BSS.

There are three functions in the library:

iinit()
iexit()
i0()

1.1. iinit()

This function must be called before any commands are issued.

1.2. iexit()

This function cleans up (deletes temporary files, frees memory) at the end of a program. It doesn't perform an exit().

1.3. i0()

This function is called with two character arrays, i.e.

return_code = i0(command, response)

<command>

This must contain a valid BSS command. In the case of relevance feedback systems, it must be quite large if you are going to request parsing of long pieces of text.

<response>

This will contain the result returned by <command>. It must be large enough to hold anything i0() may produce (possibly up to several megabytes; see ACCEPT and PREFER below).

<return_code>

This will be an integer:

Zero: The <command> executed OK

Positive: A warning was issued (i.e. not a fatal error).

Negative: A definite error has been detected. There is a list of codes and messages in bss_errors.h , and the command perror() elicits the error message corresponding to the most recent error or warning. They are not designed to be shown to end users in this form.

The C source code for a straightforward example of the use of i0() is given in Appendix L . A more complex example of how this can be done can be seen in the source code for the GUI to the BSS described in The Interface to the BSS.

Other error messages are written to Err_file, and this may be output to stderr, written to a file or discarded, depending on the compilation. They will sometimes be useful to a system developer. It must be emphasised that the error-trapping system is not well developed. It is hoped that it will improve with time and experience.

2. Notes on commands and arguments

All commands, arguments, qualifiers etc, EXCEPT terms to be looked up or parsed are case independent. Nearly all arguments and qualifiers are to be given in the form

<argument name>=value

There are a few cases where the <name>= is unnecessary and a few where it is invalid (e.g. set defaults).

White space is ignored around the '='.

Most arguments have default values. Some of these are settable using a SET command ; the current value of all settable defaults can be obtained by using the SET command with no arguments. In previous versions of the interface some arguments were implicitly set by being used in a command. This is now true ONLY of current set number (s=..., set=...) and current record number (rec=..., record=...). These are automatically set to the most recently produced, or SHOWn set; and to the next, or first record of the current set.

3. Notes on errors

Where possible, errors are trapped by i0() itself (i.e. inside the BSS), in which case they often result in a fairly specific error code. Some errors, failure to supply a needed argument for example, cannot go beyond the interface parser. These are simply reported as SYNTAX, sometimes followed by the token at which the parser failed, for example

-29, interface syntax error, no database specified

"System"-type failures (e.g. shortage of memory, missing files, shortage of file descriptors) are scarcely handled at all; this may be done sometime.

One error which can arise from most of the commands is NO_DB_OPEN.

4. The Commands.

4.1. INFO | INF <topic>

info databases

responds with a line containing "<n> database(s)" followed by a line for each available database consisting of the database name followed by a very brief description.

Other INFO topics are described below.

4.2. CHOOSE | CH

choose [arg] <database_name>

Successfully choosing a database resets all values to their defaults (see SET) and destroys any existing search sets.

There is an optional argument dbflags=<value> or db_flags=<value> which affects the facilities which are loaded when the database is opened. The current status of this is unknown, and you are advised not to use it.

4.3. PARSE | P

parse [attribute | attr | a=<attribute mnemonic>] t=<string>

Parses the string into individual terms in a form suitable for looking up in the database. Used to determine what terms to look for with a FIND(1) command.

For description of the "attr" qualifier see under FIND(1) . The response is:

<number of non-stopped terms> [t=<term>] ..

The number may of course be 0. Only indexed terms are output. An output term may be a GSL token of the form @nnnn

4.4. SUPERPARSE | SP

superparse [attr | a=<attribute mnemonic>] t=<string>

As PARSE except that more information is given, and terms are output even if they are not indexed (in which case they have GSL class H or F (see Appendix D: The GSL File. ). The response is:

<number of terms> [t=<term> c=< gsl class > s=<source>] ..

<source> is the portion of the input which gave rise to this term.

4.5. STEM

stem stemfunction | stemfunc | stemfn=<stem function name> t=<term>

Applies the named function to the given term. There is no default stem function. The response is:

t=<stemmed term>

This utility function is independent of the current database (if any). Even if there is an open database the stemmed term may not occur in any of the database's indexes. Names of the available functions may be obtained using the DISPLAY command. Note that the <term> is treated by the stem function as if it were a single word.

Errors: an unrecognized stem function gives NO_SUCH_STEMFUNCTION

4.6. FIND | F

4.6.1. FIND (1)

find <qualifiers> [t=<term>]

NB one term only, and the "t=<term>" must come last. The response is:

[ s<n> ] np=<np> t=<term found>

<term found> will be empty (nothing following the "t=") if type=0 and specified term is not in the index or type is nonzero and an extremity of the index has been reached.

The system remembers the last term found by a FIND (1) command. This is its notion of "current term" (initially empty). If no term is specified in a FIND command the current term will be looked up. This is not useful unless the "type" qualifier (see below) is 2 or -2. If type is 2 the "empty" FIND command will find the next term in the index, if type is -2 the previous term. For example, to look at all the terms starting with "dog":

Qualifiers	Default	Settable?
save=y \| 1 \| n \| 0	1 (y and 1 are synonyms as are n and 0)	yes
type=-2 \| -1 \| 0 \| 1 \| 2	0	yes
attribute \| attr \| a=<attribute mnemonic>	default	yes

save

If save=y or 1 (default) a set is made, otherwise no set is made. save=n may be faster if you only want to find the number of postings for a term, and there will be no unwanted set to be deleted.

type

Value	Function
-2	finds the greatest term less than the specified term
-1	finds the greatest term not greater than the specified term
0 (default)	finds the specified term or nothing
1	finds the least term not less than the specified term
2	finds the least term greater than the specified term

An invalid type value causes a NO_SUCH_SEARCHTYPE error.

a | attr | attribute

This determines the type of data which will be searched, i.e. which index, and, for some kinds of index, the class of objects within that index. There is always a default attribute, which may be referred to as "default" (e.g. in a SET command). Typically, this is some kind of general keyword attribute. It is defined by the first entry in the database's search_groups file. Usually there also exists an attribute called "dn" (document number). Apart from these two, the only way to determine what attributes there are for a database is to look at the search_groups parameter file for the database, although if the database manager is sensible they will have fairly self-evident names like "ti", "au", "ab".

A non-existent attribute mnemonic causes a NO_SUCH_ATTRIBUTE error.

4.6.2. FIND (2)

find s=<setnum> [w=<weight>] [s=<setnum> [w=<weight>]] [<options>]

NB sets and weights must come before options

<setnum> must be the number of a set which has been created and not deleted. Set numbers run from zero. There is a limit on the number of sets, currently about 1000. Deleted setnumbers are reused, always lowest free setnumber first. A non-existent set causes a NO_SUCH_SET error.

If the operation is of the best match type a weight must be given for each set which is not itself the result of a best match operation. Missing weight will give a SET_NO_WEIGHT error. If a weight is given for a set which is the result of a best match operation it is ignored (arguably this is wrong, and may in the future be changed).

Normally, weights should be determined by using the weight command. If weights are derived in some other way they must be smallish integers, typically in the range 1-200; weight "overflow" (> 32767) is not detected and will give spurious ordering of retrieved documents.

The response is:

non-weighted operation:
[S<n> ] np=<np>
weighted operation:
[S<n> ] np=<np> maxwt=<maxwt> nmaxwt=<# with max wt> ngw=<ngw> mpw=<max poss wt> mpw=<# with max poss wt>

Options

Options	Default	Settable?
op=<opname>	bm1	Y
aw=<cutoff weight>	-32767	N
gw=<weight breakpoint>	0	N
target=<minimum required number of postings>	0 (=unbounded)	N
save=y \| 1 \| n \| 0	1	Y
k1=<real> (bm1100/1500/250/2500 only)	1.2	Y
k2=<real> (bm1100/1500/250/2500 only)	0.0	Y
bm25_b=<real> (bm250/2500 only)	0.75	Y
nopos=0 \| 1	0	Y
p_unit=<int> (bm250 only)	1	Y
p_step=<int> (bm250 only)	1	Y
p_maxlen=<int> (bm250 only)	20	Y

Options may be given in any order (but must follow all sets or set/weight pairs). Redundant options are ignored. If an option is repeated the last occurrence is used.

Special options

These cannot be used in conjunction with any other options (or with each other), and they only apply to a single set.

Options	Default	Settable?
top=<topnum>	none	N
mark=<marknum>	0	N

The following boolean and quasi-boolean operations work on all indexes.

and	A and B and C and ...
not	A and not B and not C and not ...
or	A or B or C or ...
and2	like and but doesn't affect posting weights in 1st set, nor does it add positional records from sets after the 1st
not2	like not but doesn't affect posting weights in 1st set

'and2' and 'not2' are referred to as "Robertson limits".

The following group work on all indexes not of types 0 or 2 (all keyword indexes of current databases are OK)

samef	and within field
sames	and within sentence
adj	adjacent in forward order in same sentence but possibly with intervening stopwords
adj2	adjacent in forward order in same sentence with no intervening stopwords. Only works on indexes of types >= 3

Weighted operations

bm1	stream weight is the supplied weight, summed over contributing streams; supported on all types of indexes
bm11	term weight for a document is the sum of: stream weight (k1 + 1) * tf/(k1 * doclen/avedoclen + tf)* for a constant k1, with a "global" correction k2 nk * ((avedoclen - doclen)/(avedoclen + doclen))* where nk is the number of terms in the search & k2 a constant.
bm15	As bm11 without the doclen and avedoclen components in the first part.
*bm25*	As bm11 but with the doclen effect moderated (for nonzero b). The term weight is: stream weight (k1 + 1) * tf/(k1 * (b * doclen/avedoclen + 1 - b) + tf)*
bm250	As bm2500 (below) but attempts to find best paragraph. Subject to a number of environment variables. "Text" type databases only

The next three are for evaluation or development purposes where the values of the parameters k1 and k2 need to be controlled.

bm1100	As bm11 but expects two parameters k1 and k2
bm1500	As bm15 but expects two parameters k1 and k2
bm2500	As bm25 but expects k1, k2, bm25_b. This is universal in the sense that it reduces to bm11, bm15 or bm1 for suitable values of b and k1

Note: bm1 with a global doclength correction may be obtained by using bm1100 or bm1500 with k1=0, k2!=0

Special operation

mark	This operator accepts one or more sets, but no weights, and no options other than mark=<marknum>. It transcribes the lh operand, recording <marknum> in (or removing - <marknum> from if <marknum> is negative) any output posting which is a member of any of the other operands. If only one set is given all elements are marked. If the lh operand is already marked the new mark is superimposed (i.e. if positive, newmark = oldmark \| mark, if negative newmark = oldmark&~mark). Note that if a marked set becomes an operand in a subsequent operation, marks are retained. The default op is bm1, but may be set (see *SET* ). Any other opname produces a NO_SUCH_OP error.
nopos=0 \| 1	If nopos=1 no positional data or term frequencies are output. This speeds a large combine operation, reducing output substantially, and is sometimes useful in batch processing. If a set was produced with nopos=1 few subsequent operations can be done on it, nor can highlighted output be obtained. Default 0. Settable.
aw	<cutoff weight> is just that. No documents with weight below aw will be retrieved. It defaults to -32767. aw is ignored if the operation is not of a weighted type. Not settable.
gw	One gw value may be specified to give information on the number of postings with weights at least the gw value. For example, in the old interactive system, gw was often set to 2/3 maximum possible weight, and the number of postings reaching gw was reported as the number with "good" weight. It defaults to zero. gw is ignored if the operation is not of a weighted type. Not settable.
target	If target is positive, output will be restricted as far as possible to the specified number of postings. More than the specified number of documents will usually be retrieved, but the user should discard any excess, as these will not be properly ordered (arguably this should be done by the BSS). It is guaranteed that the "best" documents will be retrieved, up to the target number. target is ignored if the operation is not of the best match type. Default 0. Not settable.
save	As in *FIND (1)*
k1, k2, bm25b, p_unit, p_step, p_maxlen	These parameters control the document weight calculations in the best match ops bm1100/1500/2500/250. They are described individually below. The defaults are such that if none of them has been assigned a value since the current database was chosen bm1100/1500/2500 all behave like bm1. bm250 will retrieve the same documents and in the same order as bm1, but may find a subdocument with the same weight as the whole document. All these are settable. There is no validation.
k1	For ops bm1100/1500/250/2500. Ignored by other ops. Defaults to zero. Settable. (For bm11/15/25 k1 values are in the database parameter and cannot be varied.)
k2	The global doclength correction parameter for ops bm1100/1500/250/2500. Ignored by other ops. Defaults to zero. Settable. For bm11/15/25 k2-values are in the database parameter and cannot be varied.
bm25_b \| bm25b	The b parameter for ops bm250/2500. Ignored by other ops. Settable. For bm25 b-values are in the database parameter and cannot be varied.
p_unit, p_step, p_maxlen	Parameters only used by op bm250. Ignored by other ops. Default to (1, 1, 20) resp. Settable.
top=<topnum>	This option applies to only one set and no additional options. It makes a new set consisting of the top-weighted <topnum> elements of the set. If the input set has no weights it produces the first <topnum> elements in internal record number order. At present the output string is defective, containing no weight information. Example: find s=10 top=1000
mark=<marknum>	Makes a new set consisting of the elements of the input set which satisfy <input mark> & <mark> != 0

Note: aw and target are intended mainly to speed potentially lengthy searches; sets obtained using them are not suitable for inclusion in subsequent FIND (2) commands (except perhaps the "limiting" ops AND2 and NOT2). Setting a small target or a large aw may almost halve the processing time, as little output has to be done.

4.7. WEIGHT | W

w | weight

fn=<fn #> n=<np> [ r=<r> bigr=<R> [ rload=<r_load> bigrload=<R_load> ] ]

The response is a relevance weight as a 16-bit integer (10 * a log to base 2). n is the number of postings for the term (<np> from a FIND command). No default. Not settable.

fn |
func |
function

fn=1	with n, r and R gives a *Robertson/Sparck Jones (R/S-J) F4 predictive weight, with halves.*
fn=0	is the same as f=1 with r=R=0 (i.e. with no relevance information).
fn=2	adds r_load and R_load to r and R (resp) in the numerator ("p"-portion) of the *R/S-J F4* function.

Default 0. Not settable.

The number of relevant documents containing the term which is being weighted. Default 0. Not settable.

bigr

The number of relevant documents. Default 0. Settable.

rload

Default 0. Settable.

bigrload

Default 0. Settable.

Note that the big-N argument (number of indexed documents) required by the Robertson/Sparck Jones functions cannot be supplied by the user; it is determined automatically. If there is a need for it a facility may be provided for allowing the user process to supply a big-N value.

Weight errors

A function other than 0-2 will be reported as NO_SUCH_WEIGHTFUNC; failure to supply a value for n gives SYNTAX; phoney values of n, r, bigr, rload and bigrload don't generate an error (perhaps they should), weight is usually returned as 0 or 1.

4.8. SHOW | S

Document requests

show [format | f=<fmt #>] rn | recnum=<database record number>
A direct request by primary key (normally a sequential record number which runs from 1 to the number of records in the database). No set need have been formed. No default. Not settable. Error: RECORD_OUT_OF_RANGE_DB
show [format | f=<fmt #>] [n=<num>] [r=<record>] [s=<set #>]
A request for a record from a set. In this case records are numbered from zero to one less than the number in the set. <num> is the number of records requested. It defaults to 1. The set defaults to the most recent set formed or from which a record has been shown. <record> defaults to the next one, or the first if it is the first show command on a new default set. If set or record is specified they become the new default values. Set and record are not settable, except implicitly. Errors: NO_SUCH_SET, NO_SUCH_RECORD.

**SHOW formats**
Format	Description
0	delivers the contents of field 1 only. By convention this is a document ID or <docno> (not the same as recnum).
3 / 259	These are provisional formats and may change. They are intended for applications where the originating system wants to process the documents for itself. They deliver the entire unprocessed text of a database record preceded by a header. The two formats are identical except that 3 does not contain highlighting records (the "number of highlight records" field contains zero). The header consists of a sequence of ASCII numbers right justified and character strings left justified in fields of fixed length as shown in the table below.

**The structure of SHOW header information**
value	datatype and length	offset from start of header
message length	%8d	0
reserved	%32c	8
set number	%6d	40
record number within set (0-)	%10d	46
internal record number	%10d	56
weight	%10.3f	66
mark	%3d	76
unused	%7c	79
offset of start of data	%6d	86
number of directory records (D)	%3d	92
[directory records]	[%19c]	95
number of passage records (P)	%6d	95 + 19D
[passage records]	[%26c]	101 + 19D
number of highlight records (H)	%6d	101 + 19D + 26P
[highlight records]	[%17c]	107 + 19D + 26P
data		107 + 19D + 26P + 17H

**Directory, Passage and Highlight Records.**
directory record	field number (1- ) %3d	offset from start of data %8d	field length %8d
passage record	weight %10.3f	offset from start of data %8d	length %8d
highlight record	code (unused) %1c	offset from start of data %8d	length %8d

**SHOW formats (contd.)**
Format	Description
197	shows field 1 and weight.
100	Like format 197 but gives more information provided the set has paragraph information (was retrieved using bm250). Example: SJMN91-06222010 190 190 4 13 188 0 32767 The second and subsequent fields are max(best passage weight, document weight), weight of best passage, best passage start para number, best passage finish para number, weight of whole document (the final two fields just indicate that the preceding weight applies to the whole document). In future more than one passage weight might be shown. Paragraph numbers are counted from zero.
255	Tries to give information about the posting, rather cryptically. It is really meant for diagnostic purposes, and may not be supported in the future in its present form. It may, however, be used to give within-document term frequency. Provided the output is from an atomic set (a set from looking up a single term or the result of an adj or adj2 operation) the first colon-delimited field is the number of occurrences of the term or phrase in the record. The other fields will be documented when I discover under what conditions this format works! Example: find t=system S10 np=106057 t=system show s=10 r=0 f=255 10:00000003:0:0:8a560003:...... (the term frequency is 10 in this record)

4.9. DELETE | DEL

delete [[s=]<setnumber>] | all

Deletes the sets given by number, or all sets if "all". There are no errors, even if invalid set number.

4.10. SET

Sets various defaults, listed below. Any number of arguments can be given. SET with no arguments displays the current default settings (this is the only SET command which works when there is no database open).

Arguments	Initial default	Errors
a \| attr \| attribute= <attribute name>	default	NO_SUCH_ATTRIBUTE
save=0 \| n \| 1 \| y \| Y \| t \| T 0 & n are equivalent as are 1, y, Y, t & T.	1	none
type \| search_type= -2 \| -1 \| 0 \| 1 \| 2	0	none, but a value which can't be read as an integer in this range is ignored.
nopos=0 \| 1	0	none
k1=<number>	0.0	anything which doesn't look like a number gives SYNTAX
k2=<number>	0.0	anything which doesn't look like a number gives SYNTAX
bm25b \| bm25_b=<number>	1.0	anything which doesn't look like a number gives SYNTAX
passage_unit \| p_unit= <positive integer>	1	anything which doesn't look like a number gives SYNTAX; values less than 1 are ignored.
passage_step \| p_step= <positive integer>	1	anything which doesn't look like a number gives SYNTAX; values less than 1 are ignored.
passage_maxlen \| p_maxlen= <positive integer>	20	anything which doesn't look like a number gives SYNTAX; values less than 1 are ignored.
bigr=<non-negative integer>	0	anything which doesn't look like a number gives SYNTAX; values less than 0 are ignored.
bigrload= <non-negative integer>	0	anything which doesn't look like a number gives SYNTAX; values less than 0 are ignored.
rload= <non-negative integer>	0	anything which doesn't look like a number gives SYNTAX; values less than 0 are ignored.
op=<opname>	bm1	NO_SUCH_OP, SYNTAX

4.12. INFO sets

Not supported

4.13. DISPLAY | DISP | DI | D

Exhibits various things.

display db \| database	outputs "No database" if none open, otherwise the database name.
display stemfunctions \| stemfuncs \| stemfns	Outputs the names of the available stemming functions, one per line, preceded by a line of the form "<n> function(s)".

5. SPECIAL PURPOSE OR PROVISIONAL COMMANDS

5.1. DISPLAYSTATS

displaystats | ds [s=<setnumber]

Outputs "sum=<sum of weights in set> sumsq=<sum of squares of weights in set". If no set is specified the "current" set is assumed. Using this command does not alter the system's notion of current set.

This command may be used on any set, but at present the only sets which may give nonzero statistics are ones obtained by using a Robertson limit (op=and2 or op=not2 (see above). In addition the environment variable BSS_COMBINE_DO_STATS (see ENVIRONMENT VARIABLES below) must be set and nonzero.

Errors: NO_SUCH_SET

6. MISSING commands.

6.1. ACCEPT <length of longest acceptable response>

6.2. PREFER <length of reasonable response>

6.3. INFO

info database \| <database_name>	Gives more detailed information on the current or named database. This may include type of data, number of fields and what they contain, number of records, mean and greatest length, and the information returned by "info attributes".
info attributes	Responds with the preferred names of the available search attributes and a brief description of what each one accesses.
info rn=<recnum> \| info s=<setnum> r=<record>	Return something of the form: rn=<recnum> length=<bytes> fd=<fdnum> length=<bytes> paras=<num> ...

The last info command applies mainly to databases of type "text".

6.4. SHOW

A more flexible system of record "show" qualifiers is being developed. This might take the form

show [ highlight ] fd=<fdnum> [offset=<byte> length=<bytes> ] |
para=<start para> paras=<num> ] ... rn=<recnum> |
s=<setnum> r=<record>

The "para" and "paras" qualifiers would only apply to databases of type "text".

7. Sizes, Problems and Limitations

The maximum number of sets is 16384 (numbered 0-16383). Each set occupies at least 232 bytes of memory, and may also have one or two files in the $BSS_TEMPPATH directory. If a FIND operation would result in two many sets it fails with a "NO_FREE_SETS" error. The memory for sets is now dynamically allocated, but this memory is never freed until there is a successful subsequent CHOOSE command.

The maximum number of sets which can be combined in a single FIND operation is operating-system dependent. On the Sun OS 4.1 machines you are recommended not to exceed about 110 (it may appear to work with up to about 238, but results are likely to be spurious). With SOLARIS it should be OK to go up to about 238.

There have been problems PARSE-ing long input. This seems to have been cured by using flex instead of lex to compile the command parser. (But note that the command parser will always stop at a linefeed; you may need to replace linefeeds by blanks before using either of the PARSE commands; we haven't found a way of getting round this.)

8. ENVIRONMENT VARIABLES

Three environment variables are read by the iinit() function. These are:

BSS_TEMPPATH	The directory for temporary files. If an application is searching a large database many megabytes of temporary storage may be needed, and /tmp or /var/tmp may not have enough space.
BSS_PARMPATH	The directory where the database parameters are stored. There is no sensible default, so this will almost always need to be set.
BSS_LOCALBIBPATH	If this is set the BSS will look for text and index files there before going to the paths specified in the database parameters.

The low-level interface program ( i1+ ) looks for additional environmentvariables. These are

BSS_DB	If set the interface will perform a CHOOSE command on the named database.
BSS_ATTRIBUTE	If present the corresponding SET command will be issued after the specified database has been opened.

NOTE: if BSS_ATTRIBUTE is set without BSS_DB, there will be a "NO_DB_OPEN" error.

The main use of these environment variables is in scripts such as the following

#!/bin/sh
BSS_DB=trec123_94
BSS_ATTRIBUTE=dn
export BSS_DB BSS_ATTRIBUTE
i1+

which loads the low level interface with the database set to search on index "dn" before handing it over to the user.

A number of other environment variables are looked for by several BSS functions. Some of them control the parameters for passage searching, or for logistic regression (not documented here). Another is BSS_COMBINE_DO_STATS , which must be nonzero for the DISPLAYSTATS command to produce nonzero results.

Okapi-Pack Main Menu

Mail Okapi Support

Registration

Last modified: 12th November 2001