1
0
mirror of https://git.savannah.gnu.org/git/gperf.git synced 2025-12-02 13:09:22 +00:00

Implement backtracking.

This commit is contained in:
Bruno Haible
2003-02-22 00:19:28 +00:00
parent f1da37e04b
commit 76575063ea
6 changed files with 120 additions and 156 deletions

View File

@@ -7,7 +7,7 @@
@c some day we should @include version.texi instead of defining
@c these values at hand.
@set UPDATED 16 November 2002
@set UPDATED 20 November 2002
@set EDITION 2.7.2
@set VERSION 2.7.2
@c ---------------------
@@ -993,27 +993,14 @@ through a search that minimizes the number of byte positions.
@itemx --duplicates
@cindex Duplicates
Handle keywords whose selected byte sets hash to duplicate values.
Duplicate hash values can occur for two reasons:
@itemize @bullet
@item
Since @code{gperf} does not backtrack it is possible for it to process
all your input keywords without finding a unique mapping for each word.
However, frequently only a very small number of duplicates occur, and
the majority of keywords still require one probe into the table. To
overcome this problem, the option @samp{-m 50} should be used.
@item
Sometimes a set of keywords may have the same names, but possess different
attributes. With the -D option @code{gperf} treats all these keywords as
Duplicate hash values can occur if a set of keywords has the same names, but
possesses different attributes, or if the selected byte positions are not well
chosen. With the -D option @code{gperf} treats all these keywords as
part of an equivalence class and generates a perfect hash function with
multiple comparisons for duplicate keywords. It is up to you to completely
disambiguate the keywords by modifying the generated C code. However,
@code{gperf} helps you out by organizing the output.
@end itemize
Option @samp{-D} is extremely useful for certain large or highly
redundant keyword sets, e.g., assembler instruction opcodes.
Using this option usually means that the generated hash function is no
longer perfect. On the other hand, it permits @code{gperf} to work on
keyword sets that it otherwise could not handle.
@@ -1025,7 +1012,7 @@ Generate the perfect hash function ``fast''. This decreases
table-size. The iteration amount represents the number of times to
iterate when resolving a collision. `0' means iterate by the number of
keywords. This option is probably most useful when used in conjunction
with options @samp{-D} and/or @samp{-S} for @emph{large} keyword sets.
with option @samp{-o} for @emph{large} keyword sets.
@item -m @var{iterations}
@itemx --multiple-iterations=@var{iterations}
@@ -1067,7 +1054,7 @@ produce more minimal perfect hash functions. The reason for this is
that the reordering helps prune the search time by handling inevitable
collisions early in the search process. On the other hand, in practice,
a decreased search time also means a less minimal hash function, and a
higher probability of duplicate hash values. Furthermore, if the
higher frequency of backtracking. Furthermore, if the
number of keywords is @emph{very} large using @samp{-o} may
@emph{increase} @code{gperf}'s execution time, since collisions will
begin earlier and continue throughout the remainder of keyword
@@ -1080,8 +1067,7 @@ Utilizes randomness to initialize the associated values table. This
frequently generates solutions faster than using deterministic
initialization (which starts all associated values at 0). Furthermore,
using the randomization option generally increases the size of the
table. If @code{gperf} has difficultly with a certain keyword set try using
@samp{-r} or @samp{-D}.
table.
@item -s @var{size-multiple}
@itemx --size-multiple=@var{size-multiple}
@@ -1154,16 +1140,6 @@ work efficiently on much larger keyword sets (over 15,000 keywords).
When processing large keyword sets it helps greatly to have over 8 megs
of RAM.
However, since @code{gperf} does not backtrack no guaranteed solution
occurs on every run. On the other hand, it is usually easy to obtain a
solution by varying the option parameters. In particular, try the
@samp{-r} option, and also try changing the default arguments to the
@samp{-s} and @samp{-j} options. To @emph{guarantee} a solution, use
the @samp{-D} and @samp{-S} options, although the final results are not
likely to be a @emph{perfect} hash function anymore! Finally, use the
@samp{-f} option if you want @code{gperf} to generate the perfect hash
function @emph{fast}, with less emphasis on making it minimal.
@item
The size of the generate static keyword array can get @emph{extremely}
large if the input keyword file is large or if the keywords are quite
@@ -1171,7 +1147,7 @@ similar. This tends to slow down the compilation of the generated C
code, and @emph{greatly} inflates the object code size. If this
situation occurs, consider using the @samp{-S} option to reduce data
size, potentially increasing keyword recognition time a negligible
amount. Since many C compilers cannot correctly generated code for
amount. Since many C compilers cannot correctly generate code for
large switch statements it is important to qualify the @var{-S} option
with an appropriate numerical argument that controls the number of
switch statements generated.
@@ -1192,19 +1168,11 @@ module is essential independent from other program modules. Additional
worthwhile improvements include:
@itemize @bullet
@item
Make the algorithm more robust. At present, the program halts with an
error diagnostic if it can't find a direct solution and the @samp{-D}
option is not enabled. A more comprehensive, albeit computationally
expensive, approach would employ backtracking or enable alternative
options and retry. It's not clear how helpful this would be, in
general, since most search sets are rather small in practice.
@item
Another useful extension involves modifying the program to generate
``minimal'' perfect hash functions (under certain circumstances, the
current version can be rather extravagant in the generated table size).
Again, this is mostly of theoretical interest, since a sparse table
This is mostly of theoretical interest, since a sparse table
often produces faster lookups, and use of the @samp{-S} @code{switch}
option can minimize the data size, at the expense of slightly longer
lookups (note that the gcc compiler generally produces good code for