|
|
Ryan McKinley 2007-08-08, 19:56
In the trunk code, the DocumentBuilder is not handling null values well.
SolrInputDocument doc = new SolrInputDocument(); doc.addField( "id", "hello", 1.0f ); doc.addField( "name", null, 1.0f );
Document out = DocumentBuilder.toDocument( doc, core.getSchema() );
throws an exception:
"unknown field 'name'"
Fixing it is easy, but I'm not clear what the semantics of indexing a 'null' value is indented to be. It looks like FieldTypes are given a chance to deal with 'null' values with the toInternal()
I have not looked into it, but I think the StAX parser would make both: <field name="name" /> <field name="name" ></field> into: doc.addField( "name", "", 1.0f );
To me, it makes the most sense to just skip fields that don't have any value. This change passes all test and fixes the 'unknown' field error, but I'm not sure if it changes any undocumented/untested assumptions:
Index: src/java/org/apache/solr/update/DocumentBuilder.java ==================================================================--- src/java/org/apache/solr/update/DocumentBuilder.java (revision 564002) +++ src/java/org/apache/solr/update/DocumentBuilder.java (working copy) @@ -188,8 +188,10 @@ SchemaField[] destArr = schema.getCopyFields(name);
// load each field value + boolean hasField = false; for( Object v : field ) { String val = null; + hasField = true;
// TODO!!! HACK -- date conversion if( sfield != null && v instanceof Date && sfield.getType() instanceof DateField ) { @@ -232,7 +234,7 @@ }
// make sure the field was used somehow... - if( !used ) { + if( !used && hasField ) { throw new SolrException( SolrException.ErrorCode.BAD_REQUEST,"ERROR:unknown field '" + name + "'"); } }
+
Ryan McKinley 2007-08-08, 19:56
-
Re: indexing null values?
Yonik Seeley 2007-08-08, 20:09
On 8/8/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: > In the trunk code, the DocumentBuilder is not handling null values well. > > SolrInputDocument doc = new SolrInputDocument(); > doc.addField( "id", "hello", 1.0f ); > doc.addField( "name", null, 1.0f ); > > Document out = DocumentBuilder.toDocument( doc, core.getSchema() ); > > throws an exception: > > "unknown field 'name'" > > Fixing it is easy, but I'm not clear what the semantics of indexing a > 'null' value is indented to be.
Don't index it.
> It looks like FieldTypes are given a > chance to deal with 'null' values with the toInternal() > > I have not looked into it, but I think the StAX parser would make both: > <field name="name" /> > <field name="name" ></field> > into: doc.addField( "name", "", 1.0f );
That's partially an XML issue. There is no way to distinguish between those two cases, and the latter is the most reasonable way to represent a zero length string... hence "null" is not expressible in the XML. But we do have a way to express that a field has no value: just leave it out.
> To me, it makes the most sense to just skip fields that don't have any > value.
A zero length string is a legal value for a string.
-Yonik
+
Yonik Seeley 2007-08-08, 20:09
-
Re: indexing null values?
Pieter Berkel 2007-08-09, 04:17
>From an theoretical IR standpoint, there is no reason to index null values, or even empty strings for that matter. However in practice there are plenty of cases that I've encountered where it is necessary to obtain a list of documents where a particular field is null (i.e. hasn't been specified at index time) or an empty string.
For example, you may need to generate a list of products contained in your index that do not have a part number. A dirty, ugly hack work-around to this problem that we've used in the past is to replace null or unset values at index time with a special token value like "__null__" that (hopefully) won't appear in normal indexed data. This then allows you to perform a query something like part_number:"__null__" to obtain all documents without a part number. This approach has worked in the past for string fields, not sure how effective it would be for numerical field types though.
Ultimately, this leads to the situation where you are using Lucene (and Solr) as a RDBMS, which it clearly is not. While I'd love to have support for querying null / empty string fields, I don't think it's going to happen in the near future.
PIete
+
Pieter Berkel 2007-08-09, 04:17
-
Re: indexing null values?
Yonik Seeley 2007-08-09, 05:27
On 8/9/07, Pieter Berkel <[EMAIL PROTECTED]> wrote: > From an theoretical IR standpoint, there is no reason to index null values, > or even empty strings for that matter. However in practice there are plenty > of cases that I've encountered where it is necessary to obtain a list of > documents where a particular field is null (i.e. hasn't been specified at > index time) or an empty string. > > For example, you may need to generate a list of products contained in your > index that do not have a part number. A dirty, ugly hack work-around to > this problem that we've used in the past is to replace null or unset values > at index time with a special token value like "__null__" that (hopefully) > won't appear in normal indexed data.
A null field (meaning no value) can be indexed by leaving it out, and searched with a negative filter or query clause: -field:[* TO *]
-Yonik
+
Yonik Seeley 2007-08-09, 05:27
-
Re: indexing null values?
Pieter Berkel 2007-08-09, 07:27
On 09/08/07, Yonik Seeley <[EMAIL PROTECTED] > wrote: > > A null field (meaning no value) can be indexed by leaving it out, and > searched with a negative filter or query clause: > -field:[* TO *] > > -Yonik > Ah, that's a much more elegant solution, is this query syntax specific to Solr? (I don't recall seeing it in the Lucene query parser syntax documentation). Also, is there a simple method to search for fields containing an empty string?
Thanks, Piete
+
Pieter Berkel 2007-08-09, 07:27
-
Re: indexing null values?
Yonik Seeley 2007-08-09, 20:03
On 8/9/07, Pieter Berkel <[EMAIL PROTECTED]> wrote: > On 09/08/07, Yonik Seeley <[EMAIL PROTECTED] > wrote: > > > > A null field (meaning no value) can be indexed by leaving it out, and > > searched with a negative filter or query clause: > > -field:[* TO *] > > > > -Yonik > > > > > Ah, that's a much more elegant solution, is this query syntax specific to > Solr? Lucene can do +something -field:[* TO *] But being able to do a pure negative query is a recent addition to Solr (it's really most useful for filters). Which reminds me, it doesn't look like that's been added to http://wiki.apache.org/solr/SolrQuerySyntax> (I don't recall seeing it in the Lucene query parser syntax > documentation). Also, is there a simple method to search for fields > containing an empty string? Wouldn't foo:"" work for a string (untokenized) field? For a tokenized field that typically eliminates whitespace, no zero length strings would be indexed anyway. -Yonik
+
Yonik Seeley 2007-08-09, 20:03
-
Re: indexing null values?
Yonik Seeley 2007-08-09, 20:07
On 8/9/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: > Lucene can do > +something -field:[* TO *]
I take that back... being able to specify open ends on the range query via the parser is also another Solr specific thing.
-Yonik
+
Yonik Seeley 2007-08-09, 20:07
-
Re: indexing null values?
Pieter Berkel 2007-08-10, 02:07
On 10/08/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> I take that back... being able to specify open ends on the range query > via the parser is also another Solr specific thing. > This is one of the reasons why we're moving from a custom-build Lucene solution to Solr. Wouldn't foo:"" work for a string (untokenized) field? > For a tokenized field that typically eliminates whitespace, no zero > length strings would be indexed anyway. > Unfortunately the Lucene query parser doesn't accept empty string values in the query string:
SEVERE: org.apache.lucene.queryParser.ParseException: Cannot parse 'article_type:""': Lexical error at line 1, column 16. Encountered: <EOF> after : "\"\"" at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java :153) at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java :93) at org.apache.solr.handler.StandardRequestHandler.handleRequestBody( StandardRequestHandler.java:115) at org.apache.solr.handler.RequestHandlerBase.handleRequest( RequestHandlerBase.java:78) at org.apache.solr.core.SolrCore.execute(SolrCore.java:723) at org.apache.solr.servlet.SolrDispatchFilter.execute( SolrDispatchFilter.java:194) at org.apache.solr.servlet.SolrDispatchFilter.doFilter( SolrDispatchFilter.java:162) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter( ApplicationFilterChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke( StandardWrapperValve.java:210) at org.apache.catalina.core.StandardContextValve.invoke( StandardContextValve.java:174) at org.apache.catalina.core.StandardHostValve.invoke( StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke( ErrorReportValve.java:117) at org.apache.catalina.core.StandardEngineValve.invoke( StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service( CoyoteAdapter.java:151) at org.apache.coyote.http11.Http11Processor.process( Http11Processor.java:870) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection (Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket( PoolTcpEndpoint.java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt( LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run( ThreadPool.java:685) at java.lang.Thread.run(Thread.java:619)
While it would be a useful feature to have, I'm not sure if it's worth persuing the matter further.
Piete
+
Pieter Berkel 2007-08-10, 02:07
-
RE: indexing null values?
Luke Tan 2007-08-10, 03:53
Yonik,
Am I right to say that using a RangeFilter for this purpose might be less efficient for large indexes than indexing as "__null__" since RangeFilter uses TermEnum and TermDocs and iterates through every term in the index?
Luke
-----Original Message----- From: Pieter Berkel [mailto:[EMAIL PROTECTED]] Sent: Friday, August 10, 2007 10:08 AM To: [EMAIL PROTECTED] Subject: Re: indexing null values?
On 10/08/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> I take that back... being able to specify open ends on the range query
> via the parser is also another Solr specific thing. > This is one of the reasons why we're moving from a custom-build Lucene solution to Solr. Wouldn't foo:"" work for a string (untokenized) field? > For a tokenized field that typically eliminates whitespace, no zero > length strings would be indexed anyway. > Unfortunately the Lucene query parser doesn't accept empty string values in the query string:
SEVERE: org.apache.lucene.queryParser.ParseException: Cannot parse 'article_type:""': Lexical error at line 1, column 16. Encountered: <EOF> after : "\"\"" at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java :153) at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java :93) at org.apache.solr.handler.StandardRequestHandler.handleRequestBody( StandardRequestHandler.java:115) at org.apache.solr.handler.RequestHandlerBase.handleRequest( RequestHandlerBase.java:78) at org.apache.solr.core.SolrCore.execute(SolrCore.java:723) at org.apache.solr.servlet.SolrDispatchFilter.execute( SolrDispatchFilter.java:194) at org.apache.solr.servlet.SolrDispatchFilter.doFilter( SolrDispatchFilter.java:162) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter( ApplicationFilterChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke( StandardWrapperValve.java:210) at org.apache.catalina.core.StandardContextValve.invoke( StandardContextValve.java:174) at org.apache.catalina.core.StandardHostValve.invoke( StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke( ErrorReportValve.java:117) at org.apache.catalina.core.StandardEngineValve.invoke( StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service( CoyoteAdapter.java:151) at org.apache.coyote.http11.Http11Processor.process( Http11Processor.java:870) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.proc essConnection (Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket( PoolTcpEndpoint.java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt( LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run( ThreadPool.java:685) at java.lang.Thread.run(Thread.java:619)
While it would be a useful feature to have, I'm not sure if it's worth persuing the matter further.
Piete
+
Luke Tan 2007-08-10, 03:53
-
Re: indexing null values?
Yonik Seeley 2007-08-10, 04:59
On 8/9/07, Luke Tan <[EMAIL PROTECTED]> wrote: > Am I right to say that using a RangeFilter for this purpose might be > less efficient for large indexes than indexing as "__null__" since > RangeFilter uses TermEnum and TermDocs and iterates through every term > in the index?
Correct. But if used often, as filters of this type normally are, it will almost always be cached (and pre-cached via autowarming).
-Yonik
+
Yonik Seeley 2007-08-10, 04:59
|
|