Hello!

I am indexing web documents and have a need to extract their top-level URL to be stored in a different field. I have had some success with the PatternTokenizerFactory (relevant schema bits at the bottom) but the behavior appears to be inconsistent.  Most of the times, the top level URL is extracted just fine but for some documents, it is being cut off.

Examples:
URL

Extracted URL

Comment

http://www.calgaryarb.ca/eCourtPublic/15M2018.pdf

http://www.calgaryarb.ca

Success

http://www.calgarymlc.ca/about-cmlc/

http://www.calgarymlc.ca

Success

http://www.calgarypolicecommission.ca/reports.php

http://www.calgarypolicecommissio

Fail

https://attainyourhome.com/

https://attai

Fail

https://liveandplay.calgary.ca/DROPIN/page/dropin

https://livea

Fail
Relevant schema:
<copyField dest="hostname" source="SolrId"/>

<field name="hostname" type="hostnameType" stored="true" indexed="false" multiValued="false"/>

<fieldType name="hostnameType" class="solr.TextField" sortMissingLast="true">
                <analyzer type="index">
                                <tokenizer
                                                class="solr.PatternTokenizerFactory"
                                                pattern="^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)"
                                                group="0"/>
                </analyzer>
</fieldType>
I have tested the Regex and it is matching things fine. Please see https://regex101.com/r/wN6cZ7/358.
So it appears that I have a gap in my understanding of how Solr PatternTokenizerFactory works. I would appreciate any insight on the issue. hostname field will be used in facet queries.

Thank you!
Harinder

________________________________
NOTICE -
This communication is intended ONLY for the use of the person or entity named above and may contain information that is confidential or legally privileged. If you are not the intended recipient named above or a person responsible for delivering messages or communications to the intended recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying of this communication or any of the information contained in it is strictly prohibited. If you have received this communication in error, please notify us immediately by telephone and then destroy or delete this communication, or return it to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB