filter
public NutchDocument filter(NutchDocument doc,
Parse parse,
Text urlText,
CrawlDatum datum,
Inlinks inlinks)
throws IndexingException
Adds fields or otherwise modifies the document that will be indexed for a
parse. Unwanted documents can be removed from indexing by returning a null value.
- Specified by:
filter in interface IndexingFilter
- Parameters:
doc - document instance for collecting fieldsparse - parse data instanceurlText - page urldatum - crawl datum for the pageinlinks - page inlinks
- Returns:
- modified (or a new) document instance, or null (meaning the document
should be discarded)
- Throws:
IndexingException