On Tue, Apr 11, 2017 at 3:08 AM, rohit0908 <[EMAIL PROTECTED]> wrote:

Every call to $hits->next requires deserializing an entire document.  It may
be possible, depending on how your application is structured, to reduce or
avoid the cost of deserialization.

If you don't need any fields other than `title` and you are currently have
other fields which are `stored`, then you could try changing the FieldType for
those other fields so that they are no longer `stored`.  That will reduce the
the cost of deserializaing a document.

Another possibility might be to spend memory to avoid i/o, and cache all the
titles in a Perl array on Searcher initialization with indices corresponding
to Lucy doc IDs.  Then you could use a BitCollector, avoiding the
deserialization that $hits->next does.  Something like this:

    my $searcher = Lucy::Search::IndexSearcher->open(index => $index);
    my @titles;
    my $doc_max = $searcher->doc_max;
    for (1 .. $searcher->doc_max - 1) {
        my $doc = $searcher->fetch_doc($_);
        $titles[$_] = $doc->{title};

    my $bit_vec = Lucy::Object::BitVector->new(
        capacity => $searcher->doc_max + 1,
    my $bit_collector = Lucy::Search::Collector::BitCollector->new(
        bit_vector => $bit_vec,
        collector => $bit_collector,
        query     => $query,
    my $last_id = 0;
    while (1) {
        my $doc_id = $bit_vec->next_hit($last_id);
        last if $doc_id == -1;
        $last_id = $doc_id;
        print $titles[$doc_id] . "\n"; # or whatever

Lucy is single-threaded, and there is not a practical way to parallelize
$hits->next at this time.  I've hacked some process-based parallelism using
unsupported private APIs but the approach wasn't ready for prime-time.

Marvin Humphrey
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB