mirror of
https://git.savannah.gnu.org/git/gperf.git
synced 2025-12-02 13:09:22 +00:00
Regenerated.
This commit is contained in:
395
doc/gperf_5.html
395
doc/gperf_5.html
@@ -1,92 +1,345 @@
|
||||
<HTML>
|
||||
<HEAD>
|
||||
<!-- This HTML file has been created by texi2html 1.51
|
||||
from gperf.texi on 15 April 1998 -->
|
||||
from gperf.texi on 20 August 2000 -->
|
||||
|
||||
<TITLE>User's Guide to gperf - 2 Static search structures and GNU gperf</TITLE>
|
||||
<TITLE>Perfect Hash Function Generator - 3 High-Level Description of GNU gperf</TITLE>
|
||||
</HEAD>
|
||||
<BODY>
|
||||
Go to the <A HREF="gperf_1.html">first</A>, <A HREF="gperf_4.html">previous</A>, <A HREF="gperf_6.html">next</A>, <A HREF="gperf_11.html">last</A> section, <A HREF="gperf_toc.html">table of contents</A>.
|
||||
<P><HR><P>
|
||||
|
||||
|
||||
<H1><A NAME="SEC7" HREF="gperf_toc.html#TOC7">2 Static search structures and GNU <CODE>gperf</CODE></A></H1>
|
||||
<H1><A NAME="SEC7" HREF="gperf_toc.html#TOC7">3 High-Level Description of GNU <CODE>gperf</CODE></A></H1>
|
||||
|
||||
<P>
|
||||
A <STRONG>static search structure</STRONG> is an Abstract Data Type with certain
|
||||
fundamental operations, e.g., <EM>initialize</EM>, <EM>insert</EM>,
|
||||
and <EM>retrieve</EM>. Conceptually, all insertions occur before any
|
||||
retrievals. In practice, <CODE>gperf</CODE> generates a <CODE>static</CODE> array
|
||||
containing search set keywords and any associated attributes specified
|
||||
by the user. Thus, there is essentially no execution-time cost for the
|
||||
insertions. It is a useful data structure for representing <EM>static
|
||||
search sets</EM>. Static search sets occur frequently in software system
|
||||
applications. Typical static search sets include compiler reserved
|
||||
words, assembler instruction opcodes, and built-in shell interpreter
|
||||
commands. Search set members, called <STRONG>keywords</STRONG>, are inserted into
|
||||
the structure only once, usually during program initialization, and are
|
||||
not generally modified at run-time.
|
||||
The perfect hash function generator <CODE>gperf</CODE> reads a set of
|
||||
"keywords" from a <STRONG>keyfile</STRONG> (or from the standard input by
|
||||
default). It attempts to derive a perfect hashing function that
|
||||
recognizes a member of the <STRONG>static keyword set</STRONG> with at most a
|
||||
single probe into the lookup table. If <CODE>gperf</CODE> succeeds in
|
||||
generating such a function it produces a pair of C source code routines
|
||||
that perform hashing and table lookup recognition. All generated C code
|
||||
is directed to the standard output. Command-line options described
|
||||
below allow you to modify the input and output format to <CODE>gperf</CODE>.
|
||||
|
||||
</P>
|
||||
<P>
|
||||
Numerous static search structure implementations exist, e.g.,
|
||||
arrays, linked lists, binary search trees, digital search tries, and
|
||||
hash tables. Different approaches offer trade-offs between space
|
||||
utilization and search time efficiency. For example, an <VAR>n</VAR> element
|
||||
sorted array is space efficient, though the average-case time
|
||||
complexity for retrieval operations using binary search is
|
||||
proportional to log <VAR>n</VAR>. Conversely, hash table implementations
|
||||
often locate a table entry in constant time, but typically impose
|
||||
additional memory overhead and exhibit poor worst case performance.
|
||||
|
||||
</P>
|
||||
|
||||
<P>
|
||||
<EM>Minimal perfect hash functions</EM> provide an optimal solution for a
|
||||
particular class of static search sets. A minimal perfect hash
|
||||
function is defined by two properties:
|
||||
|
||||
</P>
|
||||
|
||||
<UL>
|
||||
<LI>
|
||||
|
||||
It allows keyword recognition in a static search set using at most
|
||||
<EM>one</EM> probe into the hash table. This represents the "perfect"
|
||||
property.
|
||||
<LI>
|
||||
|
||||
The actual memory allocated to store the keywords is precisely large
|
||||
enough for the keyword set, and <EM>no larger</EM>. This is the
|
||||
"minimal" property.
|
||||
</UL>
|
||||
|
||||
<P>
|
||||
For most applications it is far easier to generate <EM>perfect</EM> hash
|
||||
functions than <EM>minimal perfect</EM> hash functions. Moreover,
|
||||
non-minimal perfect hash functions frequently execute faster than
|
||||
minimal ones in practice. This phenomena occurs since searching a
|
||||
sparse keyword table increases the probability of locating a "null"
|
||||
entry, thereby reducing string comparisons. <CODE>gperf</CODE>'s default
|
||||
behavior generates <EM>near-minimal</EM> perfect hash functions for
|
||||
keyword sets. However, <CODE>gperf</CODE> provides many options that permit
|
||||
user control over the degree of minimality and perfection.
|
||||
By default, <CODE>gperf</CODE> attempts to produce time-efficient code, with
|
||||
less emphasis on efficient space utilization. However, several options
|
||||
exist that permit trading-off execution time for storage space and vice
|
||||
versa. In particular, expanding the generated table size produces a
|
||||
sparse search structure, generally yielding faster searches.
|
||||
Conversely, you can direct <CODE>gperf</CODE> to utilize a C <CODE>switch</CODE>
|
||||
statement scheme that minimizes data space storage size. Furthermore,
|
||||
using a C <CODE>switch</CODE> may actually speed up the keyword retrieval time
|
||||
somewhat. Actual results depend on your C compiler, of course.
|
||||
|
||||
</P>
|
||||
<P>
|
||||
Static search sets often exhibit relative stability over time. For
|
||||
example, Ada's 63 reserved words have remained constant for nearly a
|
||||
decade. It is therefore frequently worthwhile to expend concerted
|
||||
effort building an optimal search structure <EM>once</EM>, if it
|
||||
subsequently receives heavy use multiple times. <CODE>gperf</CODE> removes
|
||||
the drudgery associated with constructing time- and space-efficient
|
||||
search structures by hand. It has proven a useful and practical tool
|
||||
for serious programming projects. Output from <CODE>gperf</CODE> is currently
|
||||
used in several production and research compilers, including GNU C, GNU
|
||||
C++, GNU Pascal, and GNU Modula 3. The latter two compilers are not yet
|
||||
part of the official GNU distribution. Each compiler utilizes
|
||||
<CODE>gperf</CODE> to automatically generate static search structures that
|
||||
efficiently identify their respective reserved keywords.
|
||||
In general, <CODE>gperf</CODE> assigns values to the characters it is using
|
||||
for hashing until some set of values gives each keyword a unique value.
|
||||
A helpful heuristic is that the larger the hash value range, the easier
|
||||
it is for <CODE>gperf</CODE> to find and generate a perfect hash function.
|
||||
Experimentation is the key to getting the most from <CODE>gperf</CODE>.
|
||||
|
||||
</P>
|
||||
|
||||
|
||||
<H2><A NAME="SEC8" HREF="gperf_toc.html#TOC8">3.1 Input Format to <CODE>gperf</CODE></A></H2>
|
||||
<P>
|
||||
<A NAME="IDX4"></A>
|
||||
<A NAME="IDX5"></A>
|
||||
<A NAME="IDX6"></A>
|
||||
<A NAME="IDX7"></A>
|
||||
You can control the input keyfile format by varying certain command-line
|
||||
arguments, in particular the <SAMP>`-t'</SAMP> option. The input's appearance
|
||||
is similar to GNU utilities <CODE>flex</CODE> and <CODE>bison</CODE> (or UNIX
|
||||
utilities <CODE>lex</CODE> and <CODE>yacc</CODE>). Here's an outline of the general
|
||||
format:
|
||||
|
||||
</P>
|
||||
|
||||
<PRE>
|
||||
declarations
|
||||
%%
|
||||
keywords
|
||||
%%
|
||||
functions
|
||||
</PRE>
|
||||
|
||||
<P>
|
||||
<EM>Unlike</EM> <CODE>flex</CODE> or <CODE>bison</CODE>, all sections of
|
||||
<CODE>gperf</CODE>'s input are optional. The following sections describe the
|
||||
input format for each section.
|
||||
|
||||
</P>
|
||||
|
||||
|
||||
|
||||
<H3><A NAME="SEC9" HREF="gperf_toc.html#TOC9">3.1.1 <CODE>struct</CODE> Declarations and C Code Inclusion</A></H3>
|
||||
|
||||
<P>
|
||||
The keyword input file optionally contains a section for including
|
||||
arbitrary C declarations and definitions, as well as provisions for
|
||||
providing a user-supplied <CODE>struct</CODE>. If the <SAMP>`-t'</SAMP> option
|
||||
<EM>is</EM> enabled, you <EM>must</EM> provide a C <CODE>struct</CODE> as the last
|
||||
component in the declaration section from the keyfile file. The first
|
||||
field in this struct must be a <CODE>char *</CODE> or <CODE>const char *</CODE>
|
||||
identifier called <SAMP>`name'</SAMP>, although it is possible to modify this
|
||||
field's name with the <SAMP>`-K'</SAMP> option described below.
|
||||
|
||||
</P>
|
||||
<P>
|
||||
Here is a simple example, using months of the year and their attributes as
|
||||
input:
|
||||
|
||||
</P>
|
||||
|
||||
<PRE>
|
||||
struct months { char *name; int number; int days; int leap_days; };
|
||||
%%
|
||||
january, 1, 31, 31
|
||||
february, 2, 28, 29
|
||||
march, 3, 31, 31
|
||||
april, 4, 30, 30
|
||||
may, 5, 31, 31
|
||||
june, 6, 30, 30
|
||||
july, 7, 31, 31
|
||||
august, 8, 31, 31
|
||||
september, 9, 30, 30
|
||||
october, 10, 31, 31
|
||||
november, 11, 30, 30
|
||||
december, 12, 31, 31
|
||||
</PRE>
|
||||
|
||||
<P>
|
||||
<A NAME="IDX8"></A>
|
||||
Separating the <CODE>struct</CODE> declaration from the list of keywords and
|
||||
other fields are a pair of consecutive percent signs, <SAMP>`%%'</SAMP>,
|
||||
appearing left justified in the first column, as in the UNIX utility
|
||||
<CODE>lex</CODE>.
|
||||
|
||||
</P>
|
||||
<P>
|
||||
<A NAME="IDX9"></A>
|
||||
<A NAME="IDX10"></A>
|
||||
Using a syntax similar to GNU utilities <CODE>flex</CODE> and <CODE>bison</CODE>, it
|
||||
is possible to directly include C source text and comments verbatim into
|
||||
the generated output file. This is accomplished by enclosing the region
|
||||
inside left-justified surrounding <SAMP>`%{'</SAMP>, <SAMP>`%}'</SAMP> pairs. Here is
|
||||
an input fragment based on the previous example that illustrates this
|
||||
feature:
|
||||
|
||||
</P>
|
||||
|
||||
<PRE>
|
||||
%{
|
||||
#include <assert.h>
|
||||
/* This section of code is inserted directly into the output. */
|
||||
int return_month_days (struct months *months, int is_leap_year);
|
||||
%}
|
||||
struct months { char *name; int number; int days; int leap_days; };
|
||||
%%
|
||||
january, 1, 31, 31
|
||||
february, 2, 28, 29
|
||||
march, 3, 31, 31
|
||||
...
|
||||
</PRE>
|
||||
|
||||
<P>
|
||||
It is possible to omit the declaration section entirely. In this case
|
||||
the keyfile begins directly with the first keyword line, e.g.:
|
||||
|
||||
</P>
|
||||
|
||||
<PRE>
|
||||
january, 1, 31, 31
|
||||
february, 2, 28, 29
|
||||
march, 3, 31, 31
|
||||
april, 4, 30, 30
|
||||
...
|
||||
</PRE>
|
||||
|
||||
|
||||
|
||||
<H3><A NAME="SEC10" HREF="gperf_toc.html#TOC10">3.1.2 Format for Keyword Entries</A></H3>
|
||||
|
||||
<P>
|
||||
The second keyfile format section contains lines of keywords and any
|
||||
associated attributes you might supply. A line beginning with <SAMP>`#'</SAMP>
|
||||
in the first column is considered a comment. Everything following the
|
||||
<SAMP>`#'</SAMP> is ignored, up to and including the following newline.
|
||||
|
||||
</P>
|
||||
<P>
|
||||
The first field of each non-comment line is always the key itself. It
|
||||
can be given in two ways: as a simple name, i.e., without surrounding
|
||||
string quotation marks, or as a string enclosed in double-quotes, in
|
||||
C syntax, possibly with backslash escapes like <CODE>\"</CODE> or <CODE>\234</CODE>
|
||||
or <CODE>\xa8</CODE>. In either case, it must start right at the beginning
|
||||
of the line, without leading whitespace.
|
||||
In this context, a "field" is considered to extend up to, but
|
||||
not include, the first blank, comma, or newline. Here is a simple
|
||||
example taken from a partial list of C reserved words:
|
||||
|
||||
</P>
|
||||
|
||||
<PRE>
|
||||
# These are a few C reserved words, see the c.gperf file
|
||||
# for a complete list of ANSI C reserved words.
|
||||
unsigned
|
||||
sizeof
|
||||
switch
|
||||
signed
|
||||
if
|
||||
default
|
||||
for
|
||||
while
|
||||
return
|
||||
</PRE>
|
||||
|
||||
<P>
|
||||
Note that unlike <CODE>flex</CODE> or <CODE>bison</CODE> the first <SAMP>`%%'</SAMP> marker
|
||||
may be elided if the declaration section is empty.
|
||||
|
||||
</P>
|
||||
<P>
|
||||
Additional fields may optionally follow the leading keyword. Fields
|
||||
should be separated by commas, and terminate at the end of line. What
|
||||
these fields mean is entirely up to you; they are used to initialize the
|
||||
elements of the user-defined <CODE>struct</CODE> provided by you in the
|
||||
declaration section. If the <SAMP>`-t'</SAMP> option is <EM>not</EM> enabled
|
||||
these fields are simply ignored. All previous examples except the last
|
||||
one contain keyword attributes.
|
||||
|
||||
</P>
|
||||
|
||||
|
||||
<H3><A NAME="SEC11" HREF="gperf_toc.html#TOC11">3.1.3 Including Additional C Functions</A></H3>
|
||||
|
||||
<P>
|
||||
The optional third section also corresponds closely with conventions
|
||||
found in <CODE>flex</CODE> and <CODE>bison</CODE>. All text in this section,
|
||||
starting at the final <SAMP>`%%'</SAMP> and extending to the end of the input
|
||||
file, is included verbatim into the generated output file. Naturally,
|
||||
it is your responsibility to ensure that the code contained in this
|
||||
section is valid C.
|
||||
|
||||
</P>
|
||||
|
||||
|
||||
<H2><A NAME="SEC12" HREF="gperf_toc.html#TOC12">3.2 Output Format for Generated C Code with <CODE>gperf</CODE></A></H2>
|
||||
<P>
|
||||
<A NAME="IDX11"></A>
|
||||
|
||||
</P>
|
||||
<P>
|
||||
Several options control how the generated C code appears on the standard
|
||||
output. Two C function are generated. They are called <CODE>hash</CODE> and
|
||||
<CODE>in_word_set</CODE>, although you may modify their names with a command-line
|
||||
option. Both functions require two arguments, a string, <CODE>char *</CODE>
|
||||
<VAR>str</VAR>, and a length parameter, <CODE>int</CODE> <VAR>len</VAR>. Their default
|
||||
function prototypes are as follows:
|
||||
|
||||
</P>
|
||||
<P>
|
||||
<DL>
|
||||
<DT><U>Function:</U> unsigned int <B>hash</B> <I>(const char * <VAR>str</VAR>, unsigned int <VAR>len</VAR>)</I>
|
||||
<DD><A NAME="IDX12"></A>
|
||||
By default, the generated <CODE>hash</CODE> function returns an integer value
|
||||
created by adding <VAR>len</VAR> to several user-specified <VAR>str</VAR> key
|
||||
positions indexed into an <STRONG>associated values</STRONG> table stored in a
|
||||
local static array. The associated values table is constructed
|
||||
internally by <CODE>gperf</CODE> and later output as a static local C array
|
||||
called <SAMP>`hash_table'</SAMP>; its meaning and properties are described below
|
||||
(see section <A HREF="gperf_9.html#SEC22">7 Implementation Details of GNU <CODE>gperf</CODE></A>). The relevant key positions are specified via
|
||||
the <SAMP>`-k'</SAMP> option when running <CODE>gperf</CODE>, as detailed in the
|
||||
<EM>Options</EM> section below(see section <A HREF="gperf_6.html#SEC14">4 Invoking <CODE>gperf</CODE></A>).
|
||||
</DL>
|
||||
|
||||
</P>
|
||||
<P>
|
||||
<DL>
|
||||
<DT><U>Function:</U> <B>in_word_set</B> <I>(const char * <VAR>str</VAR>, unsigned int <VAR>len</VAR>)</I>
|
||||
<DD><A NAME="IDX13"></A>
|
||||
If <VAR>str</VAR> is in the keyword set, returns a pointer to that
|
||||
keyword. More exactly, if the option <SAMP>`-t'</SAMP> was given, it returns
|
||||
a pointer to the matching keyword's structure. Otherwise it returns
|
||||
<CODE>NULL</CODE>.
|
||||
</DL>
|
||||
|
||||
</P>
|
||||
<P>
|
||||
If the option <SAMP>`-c'</SAMP> is not used, <VAR>str</VAR> must be a NUL terminated
|
||||
string of exactly length <VAR>len</VAR>. If <SAMP>`-c'</SAMP> is used, <VAR>str</VAR> must
|
||||
simply be an array of <VAR>len</VAR> characters and does not need to be NUL
|
||||
terminated.
|
||||
|
||||
</P>
|
||||
<P>
|
||||
The code generated for these two functions is affected by the following
|
||||
options:
|
||||
|
||||
</P>
|
||||
<DL COMPACT>
|
||||
|
||||
<DT><SAMP>`-t'</SAMP>
|
||||
<DD>
|
||||
<DT><SAMP>`--struct-type'</SAMP>
|
||||
<DD>
|
||||
Make use of the user-defined <CODE>struct</CODE>.
|
||||
|
||||
<DT><SAMP>`-S <VAR>total-switch-statements</VAR>'</SAMP>
|
||||
<DD>
|
||||
<DT><SAMP>`--switch=<VAR>total-switch-statements</VAR>'</SAMP>
|
||||
<DD>
|
||||
<A NAME="IDX14"></A>
|
||||
Generate 1 or more C <CODE>switch</CODE> statement rather than use a large,
|
||||
(and potentially sparse) static array. Although the exact time and
|
||||
space savings of this approach vary according to your C compiler's
|
||||
degree of optimization, this method often results in smaller and faster
|
||||
code.
|
||||
</DL>
|
||||
|
||||
<P>
|
||||
If the <SAMP>`-t'</SAMP> and <SAMP>`-S'</SAMP> options are omitted, the default action
|
||||
is to generate a <CODE>char *</CODE> array containing the keys, together with
|
||||
additional null strings used for padding the array. By experimenting
|
||||
with the various input and output options, and timing the resulting C
|
||||
code, you can determine the best option choices for different keyword
|
||||
set characteristics.
|
||||
|
||||
</P>
|
||||
|
||||
|
||||
<H2><A NAME="SEC13" HREF="gperf_toc.html#TOC13">3.3 Use of NUL characters</A></H2>
|
||||
<P>
|
||||
<A NAME="IDX15"></A>
|
||||
|
||||
</P>
|
||||
<P>
|
||||
By default, the code generated by <CODE>gperf</CODE> operates on zero
|
||||
terminated strings, the usual representation of strings in C. This means
|
||||
that the keywords in the input file must not contain NUL characters,
|
||||
and the <VAR>str</VAR> argument passed to <CODE>hash</CODE> or <CODE>in_word_set</CODE>
|
||||
must be NUL terminated and have exactly length <VAR>len</VAR>.
|
||||
|
||||
</P>
|
||||
<P>
|
||||
If option <SAMP>`-c'</SAMP> is used, then the <VAR>str</VAR> argument does not need
|
||||
to be NUL terminated. The code generated by <CODE>gperf</CODE> will only
|
||||
access the first <VAR>len</VAR>, not <VAR>len+1</VAR>, bytes starting at <VAR>str</VAR>.
|
||||
However, the keywords in the input file still must not contain NUL
|
||||
characters.
|
||||
|
||||
</P>
|
||||
<P>
|
||||
If option <SAMP>`-l'</SAMP> is used, then the hash table performs binary
|
||||
comparison. The keywords in the input file may contain NUL characters,
|
||||
written in string syntax as <CODE>\000</CODE> or <CODE>\x00</CODE>, and the code
|
||||
generated by <CODE>gperf</CODE> will treat NUL like any other character.
|
||||
Also, in this case the <SAMP>`-c'</SAMP> option is ignored.
|
||||
|
||||
</P>
|
||||
<P><HR><P>
|
||||
|
||||
Reference in New Issue
Block a user