|
Agnieszka Kukałowicz
2012-03-13, 15:42
Jan Høydahl
2012-03-13, 23:54
Agnieszka Kukałowicz
2012-03-14, 13:36
mizayah
2012-02-22, 14:44
Ahmet Arslan
2012-02-22, 15:49
mizayah
2012-02-22, 20:53
mizayah
2012-02-23, 09:19
Agnieszka Kukałowicz
2012-03-12, 15:42
Jan Høydahl
2012-03-13, 08:47
Agnieszka Kukałowicz
2012-03-13, 09:39
|
-
RE: solr 3.5 and indexing performanceAgnieszka Kukałowicz 2012-03-13, 15:42
Hi,
I did some more tests for Hunspell in solr 3.4, 4.0: Solr 3.4, full import 489017 documents: StempelPolishStemFilterFactory - 2908 seconds, 168 docs/sec HunspellStemFilterFactory - 3922 seconds, 125 docs/sec Solr 4.0, full import 489017 documents: StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec Server specification and Java settings are the same as before. Cheers Agnieszka > -----Original Message----- > From: Agnieszka Kukałowicz [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, March 13, 2012 10:39 AM > To: '[EMAIL PROTECTED]' > Subject: RE: solr 3.5 and indexing performance > > Hi, > > Yes, I confirmed that without Hunspell indexing has normal speed. > I did tests in solr 4.0 with Hunspell and PolishStemmer. > With StempelPolishStemFilterFactory the speed is normal. > > My schema is quit easy. For Hunspell I have one text field I copy 14 > text fields to: > > "<field name="text" type="text_pl_hunspell" indexed="true" > stored="false" multiValued="true"/>" > > > <copyField source="field1" dest="text"/> <copyField source="field2" > dest="text"/> <copyField source="field3" dest="text"/> <copyField > source="field4" dest="text"/> <copyField source="field5" dest="text"/> > <copyField source="field6" dest="text"/> <copyField source="field7" > dest="text"/> <copyField source="field8" dest="text"/> <copyField > source="field9" dest="text"/> <copyField source="field10" dest="text"/> > <copyField source="field11" dest="text"/> <copyField source="field12" > dest="text"/> <copyField source="field13" dest="text"/> <copyField > source="field14" dest="text"/> > > The "text_pl_hunspell" configuration: > > <fieldType name="text_pl_hunspell" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.StopFilterFactory" > ignoreCase="true" > words="dict/stopwords_pl.txt" > enablePositionIncrements="true" > /> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.HunspellStemFilterFactory" > dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true" > <!--filter class="solr.KeywordMarkerFilterFactory" > protected="protwords_pl.txt"/--> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" > synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/> > <filter class="solr.StopFilterFactory" > ignoreCase="true" > words="dict/stopwords_pl.txt" > enablePositionIncrements="true" > /> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.HunspellStemFilterFactory" > dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true" > <filter class="solr.KeywordMarkerFilterFactory" > protected="dict/protwords_pl.txt"/> > </analyzer> > </fieldType> > > I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, > synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same > files I used in 3.4 version. > > For Polish Stemmer the diffrence is only in definion text field: > > "<field name="text" type="text_pl" indexed="true" stored="false" > multiValued="true"/>" > > <fieldType name="text_pl" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.StopFilterFactory" > ignoreCase="true" > words="dict/stopwords_pl.txt" > enablePositionIncrements="true" > /> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.StempelPolishStemFilterFactory"/> +
Agnieszka Kukałowicz 2012-03-13, 15:42
-
Re: solr 3.5 and indexing performanceJan Høydahl 2012-03-13, 23:54
Hi,
Thanks a lot for your detailed problem description. It definitely is an error. Would you be so kind to register it as a bug ticket, including your descriptions from this email? http://wiki.apache.org/solr/HowToContribute#JIRA_tips_.28our_issue.2BAC8-bug_tracker.29. Also please attach to the issue your polish hunspell dictionaries. Then we'll try to reproduce the error. I wonder if this performance decrease is also seen for English dictionaries? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 13. mars 2012, at 16:42, Agnieszka Kukałowicz wrote: > Hi, > > I did some more tests for Hunspell in solr 3.4, 4.0: > > Solr 3.4, full import 489017 documents: > > StempelPolishStemFilterFactory - 2908 seconds, 168 docs/sec > HunspellStemFilterFactory - 3922 seconds, 125 docs/sec > > Solr 4.0, full import 489017 documents: > > StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec > HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec > > Server specification and Java settings are the same as before. > > Cheers > Agnieszka > > >> -----Original Message----- >> From: Agnieszka Kukałowicz [mailto:[EMAIL PROTECTED]] >> Sent: Tuesday, March 13, 2012 10:39 AM >> To: '[EMAIL PROTECTED]' >> Subject: RE: solr 3.5 and indexing performance >> >> Hi, >> >> Yes, I confirmed that without Hunspell indexing has normal speed. >> I did tests in solr 4.0 with Hunspell and PolishStemmer. >> With StempelPolishStemFilterFactory the speed is normal. >> >> My schema is quit easy. For Hunspell I have one text field I copy 14 >> text fields to: >> >> "<field name="text" type="text_pl_hunspell" indexed="true" >> stored="false" multiValued="true"/>" >> >> >> <copyField source="field1" dest="text"/> <copyField source="field2" >> dest="text"/> <copyField source="field3" dest="text"/> <copyField >> source="field4" dest="text"/> <copyField source="field5" dest="text"/> >> <copyField source="field6" dest="text"/> <copyField source="field7" >> dest="text"/> <copyField source="field8" dest="text"/> <copyField >> source="field9" dest="text"/> <copyField source="field10" dest="text"/> >> <copyField source="field11" dest="text"/> <copyField source="field12" >> dest="text"/> <copyField source="field13" dest="text"/> <copyField >> source="field14" dest="text"/> >> >> The "text_pl_hunspell" configuration: >> >> <fieldType name="text_pl_hunspell" class="solr.TextField" >> positionIncrementGap="100"> >> <analyzer type="index"> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.StopFilterFactory" >> ignoreCase="true" >> words="dict/stopwords_pl.txt" >> enablePositionIncrements="true" >> /> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.HunspellStemFilterFactory" >> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true" >> <!--filter class="solr.KeywordMarkerFilterFactory" >> protected="protwords_pl.txt"/--> >> </analyzer> >> <analyzer type="query"> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.SynonymFilterFactory" >> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/> >> <filter class="solr.StopFilterFactory" >> ignoreCase="true" >> words="dict/stopwords_pl.txt" >> enablePositionIncrements="true" >> /> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.HunspellStemFilterFactory" >> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true" >> <filter class="solr.KeywordMarkerFilterFactory" >> protected="dict/protwords_pl.txt"/> >> </analyzer> >> </fieldType> >> >> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, >> synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same +
Jan Høydahl 2012-03-13, 23:54
-
RE: solr 3.5 and indexing performanceAgnieszka Kukałowicz 2012-03-14, 13:36
Bug ticket created:
https://issues.apache.org/jira/browse/SOLR-3245 I also made test you ask with english dictionary. The results are in the ticket. Agnieszka > -----Original Message----- > From: Jan Høydahl [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, March 14, 2012 12:54 AM > To: [EMAIL PROTECTED] > Subject: Re: solr 3.5 and indexing performance > > Hi, > > Thanks a lot for your detailed problem description. It definitely is an > error. Would you be so kind to register it as a bug ticket, including > your descriptions from this email? > http://wiki.apache.org/solr/HowToContribute#JIRA_tips_.28our_issue.2BAC8 > -bug_tracker.29. Also please attach to the issue your polish hunspell > dictionaries. Then we'll try to reproduce the error. > > I wonder if this performance decrease is also seen for English > dictionaries? > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Solr Training - www.solrtraining.com > > On 13. mars 2012, at 16:42, Agnieszka Kukałowicz wrote: > > > Hi, > > > > I did some more tests for Hunspell in solr 3.4, 4.0: > > > > Solr 3.4, full import 489017 documents: > > > > StempelPolishStemFilterFactory - 2908 seconds, 168 docs/sec > > HunspellStemFilterFactory - 3922 seconds, 125 docs/sec > > > > Solr 4.0, full import 489017 documents: > > > > StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec > > HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 > docs/sec > > > > Server specification and Java settings are the same as before. > > > > Cheers > > Agnieszka > > > > > >> -----Original Message----- > >> From: Agnieszka Kukałowicz [mailto:[EMAIL PROTECTED]] > >> Sent: Tuesday, March 13, 2012 10:39 AM > >> To: '[EMAIL PROTECTED]' > >> Subject: RE: solr 3.5 and indexing performance > >> > >> Hi, > >> > >> Yes, I confirmed that without Hunspell indexing has normal speed. > >> I did tests in solr 4.0 with Hunspell and PolishStemmer. > >> With StempelPolishStemFilterFactory the speed is normal. > >> > >> My schema is quit easy. For Hunspell I have one text field I copy 14 > >> text fields to: > >> > >> "<field name="text" type="text_pl_hunspell" indexed="true" > >> stored="false" multiValued="true"/>" > >> > >> > >> <copyField source="field1" dest="text"/> <copyField source="field2" > >> dest="text"/> <copyField source="field3" dest="text"/> <copyField > >> source="field4" dest="text"/> <copyField source="field5" > dest="text"/> > >> <copyField source="field6" dest="text"/> <copyField source="field7" > >> dest="text"/> <copyField source="field8" dest="text"/> <copyField > >> source="field9" dest="text"/> <copyField source="field10" > dest="text"/> > >> <copyField source="field11" dest="text"/> <copyField > source="field12" > >> dest="text"/> <copyField source="field13" dest="text"/> <copyField > >> source="field14" dest="text"/> > >> > >> The "text_pl_hunspell" configuration: > >> > >> <fieldType name="text_pl_hunspell" class="solr.TextField" > >> positionIncrementGap="100"> > >> <analyzer type="index"> > >> <tokenizer class="solr.StandardTokenizerFactory"/> > >> <filter class="solr.StopFilterFactory" > >> ignoreCase="true" > >> words="dict/stopwords_pl.txt" > >> enablePositionIncrements="true" > >> /> > >> <filter class="solr.LowerCaseFilterFactory"/> > >> <filter class="solr.HunspellStemFilterFactory" > >> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true" > >> <!--filter class="solr.KeywordMarkerFilterFactory" > >> protected="protwords_pl.txt"/--> > >> </analyzer> > >> <analyzer type="query"> > >> <tokenizer class="solr.StandardTokenizerFactory"/> > >> <filter class="solr.SynonymFilterFactory" > >> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/> > >> <filter class="solr.StopFilterFactory" > >> ignoreCase="true" > >> words="dict/stopwords_pl.txt" +
Agnieszka Kukałowicz 2012-03-14, 13:36
-
solr 3.5 and indexing performancemizayah 2012-02-22, 14:44
Hello,
I wanted to switch to new version of solr, exactelly to 3.5 but im getting big drop of indexing speed. I'm using 3.1 and after few tests i discower that 3.4 do it a lot of better then 3.5 My schema is really simple few field using "text" type field / <fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.HunspellStemFilterFactory" dictionary="pl_PL.dic" affix="pl_PL.aff"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.HunspellStemFilterFactory" dictionary="pl_PL.dic" affix="pl_PL.aff"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> </fieldType> / All data and configuration are the same, same schema, solrconfig, same jetty. *SOLR 3.5* /Feb 22, 2012 3:40:33 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=/vol/home/mciurla/proj/solr/accordion3.5/example/solr/data/index,segFN=segments_bl,version=1329831219365,generation=417,filenames=[_a5.fdx, _52.fdx, _aq.frq, _a5.fdt, _cr.nrm, _52.fnm, _a5.prx, segments_bl, _52.fdt, _7k.tii, _cr.frq, _a5.tis, _cr.fdt, _a5.nrm, _cr.prx, _cp.prx, _cr.fdx, _cn.nrm, _52.tvf, _cp.fnm, _co.tii, _52.tvd, _8 o.tvx, _co.tis, _8o.tii, _a5.fnm, _8o.tvd, _7k.tis, _8o.tvf, _bb.tis, _7k.fdx, _7k.fdt, _7k.frq, _bb.tii, _cn.frq, _co.prx, _aq.tii, _cq.fdx, _52.tii, _cm.tis, _cq.fdt, _aq.tis, _52.tis, _aq.tvx, _co.nrm, _bb.prx, _cm.tii, _cr.fnm, _aq.tvf, _bb_3.del, _aq.tvd, _cm.frq, _cp.nrm, _cq.tis, _52.prx, _cn.tis, _8o.fnm, _cl.nrm, _cl.fnm, _a5.tii, _cn.tii, _cq .tii, _cp.tis, _cp.fdt, _cl.fdt, _cl.prx, _aq.fdt, _cl.fdx, _cr.tis, _co.frq, _7k.fnm, _cq.frq, _bb.fnm, _cr.tii, _cp.fdx, _cp.tii, _aq.fdx, _cq.tvd, _8o.fdt, _cq.tvf, _52.nrm, _8o.nrm, _aq.fnm, _8o.prx, _co.tvd, _cq.tvx, _52.frq, _bb.nrm, _bb.fdt, _cp.tvf, _a5.tvx, _cp.tvd, _cn.tvx, _7k.nrm, _bb.fdx, _cm.tvx, _cm.fdx, _cl.tvf, _cp.tvx, _co.fdx, _cl.tv d, _cn.tvf, _a5.frq, _cm.fdt, _a5.tvf, _co.fdt, _a5.tvd, _cp.frq, _cn.fdt, _cm.nrm, _7k_d.del, _cn.fdx, _52_1e.del, _7k.prx, _8o.fdx, _cn.prx, _cl.tis, _cq.nrm, _7k.tvx, _cq.prx , _cn.tvd, _cl.tii, _cm.fnm, _7k.tvd, _cm.prx, _8o.tis, _cm.tvf, _52.tvx, _7k.tvf, _cl.tvx, _cm.tvd, _a5_9.del, _bb.tvf, _bb.tvd, _cr.tvd, _co.tvf, _bb.tvx, _cr.tvf, _co.fnm, _a q.prx, _cl.frq, _cq.fnm, _aq_9.del, _bb.frq, _8o.frq, _aq.nrm, _co.tvx, _8o_t.del, _cr.tvx, _cn.fnm, _cl_6.del] Feb 22, 2012 3:40:33 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1329831219365 Feb 22, 2012 3:40:47 PM org.apache.solr.update.processor.LogUpdateProcessor finish *INFO: {add=[2271874, 2271875, 2271876, 2271877, 2271878, 2271879, 2271880, 2271881, ... (100 adds)]} 0 14213* Feb 22, 2012 3:40:47 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={} status=0 QTime=14213 / when on solr 3.4 /Feb 22, 2012 3:42:56 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=/vol/home/mciurla/proj/solr/accordion3.4/example/solr/data/index,segFN=segments_29,version=1329918470592,generation=81,filenames=[_2b.tvf, _2c.tvx, _2d.tvf, _2f.tvx, _2d.tvd, _15.prx, _15.frq, _2b.tvd, _2c.nrm, _20.fnm, _2b.tvx, _2c.fdx, _2c.prx, _2f.tii, _2f.tvf, _20.tvx, _2b.fnm, _2c.fdt, _2d.tis, _15.fdt, _20.frq, _2d.tvx, _2f.tvd, _15.fdx, _15.fnm, _2c.tvf, _2e.frq, _2e.prx, _2c.tvd, _2b.frq, _20.tvd, _2c.fnm, _20.tvf, _2e.tvf, _2e.nrm, _20.tis, _2b.prx, _20.tii, _2e.tvd, _15.tis, _2f.frq, _15.tii, _2e.tvx, _2e.tii, _2c.tis, _2c.frq, _2e.fdx, _2f.prx, _2f.fnm, _15.tvx, _2e.fdt, _15.tvf, _2b.tis, _2c.tii, _2d.prx, _2d.fnm, _20.fdx, _2b.tii, _2e.tis, _20.fdt, _2d.frq, _2b.nrm, _15.tvd, _15_b.del, _2b.fdt, _2f.nrm, _2d.fdx, segments_29, _2d.fdt, _2b.fdx, _20_2.del, _15.nrm, _2f.tis, _2d.tii, _2d.nrm, _20.prx, _20.nrm, _2e.fnm, _2f.fdt, _2f.fdx] Feb 22, 2012 3:42:56 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1329918470592 Feb 22, 2012 3:42:56 PM org.apache.solr.update.processor.LogUpdateProcessor finish *INFO: {add=[2269393, 2269394, 2269395, 2269396, 2269397, 2269398, 2269399, 2269400, ... (100 adds)]} 0 145 *Feb 22, 2012 3:42:56 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={} status=0 QTime=145/ *Any idea what is going on?* View this message in context: http://lucene.472066.n3.nabble.com/solr-3-5-and-indexing-performance-tp3766653p3766653.html Sent from the Solr - User mailing list archive at Nabble.com. +
mizayah 2012-02-22, 14:44
-
Re: solr 3.5 and indexing performanceAhmet Arslan 2012-02-22, 15:49
> I wanted to switch to new version of solr, exactelly to 3.5
> but im getting > big drop of indexing speed. Could it be <autoCommit> configuration in solrconfig.xml? +
Ahmet Arslan 2012-02-22, 15:49
-
Re: solr 3.5 and indexing performancemizayah 2012-02-22, 20:53
i got it all commnented in updateHandler, im prety sure there is no default
autocommit <updateHandler class="solr.DirectUpdateHandler2"> iorixxx wrote > >> I wanted to switch to new version of solr, exactelly to 3.5 >> but im getting >> big drop of indexing speed. > > Could it be <autoCommit> configuration in solrconfig.xml? > -- View this message in context: http://lucene.472066.n3.nabble.com/solr-3-5-and-indexing-performance-tp3766653p3767843.html Sent from the Solr - User mailing list archive at Nabble.com. +
mizayah 2012-02-22, 20:53
-
Re: solr 3.5 and indexing performancemizayah 2012-02-23, 09:19
Ok i found it.
Its becouse of Hunspell which now is in solr. Somehow when im using it by myself in 3.4 it is a lot of faster then one from 3.5. Dont know about differences, but is there any way i use my old Google Hunspell jar? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-3-5-and-indexing-performance-tp3766653p3769139.html Sent from the Solr - User mailing list archive at Nabble.com. +
mizayah 2012-02-23, 09:19
-
RE: solr 3.5 and indexing performanceAgnieszka Kukałowicz 2012-03-12, 15:42
Hi guys,
I have hit the same problem with Hunspell. Doing a few tests for 500 000 documents, I've got: Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4 version - 125 documents per second Build Hunspell from 4.0 trunk - 11 documents per second. All the tests were made on 8 core CPU with 32 GB RAM and index on SSD disks. For Solr 3.5 I've tried to change JVM heap size, rambuffersize, mergefactor but the speed of indexing was about 10 -20 documents per second. Is it possible that there is some performance bug with Solr 4.0? According to previous post the problem exists in 3.5 version. Best regards Agnieszka Kukałowicz > -----Original Message----- > From: mizayah [mailto:[EMAIL PROTECTED]] > Sent: Thursday, February 23, 2012 10:19 AM > To: [EMAIL PROTECTED] > Subject: Re: solr 3.5 and indexing performance > > Ok i found it. > > Its becouse of Hunspell which now is in solr. Somehow when im using it > by myself in 3.4 it is a lot of faster then one from 3.5. > > Dont know about differences, but is there any way i use my old Google > Hunspell jar? > > -- > View this message in context: http://lucene.472066.n3.nabble.com/solr- > 3-5-and-indexing-performance-tp3766653p3769139.html > Sent from the Solr - User mailing list archive at Nabble.com. +
Agnieszka Kukałowicz 2012-03-12, 15:42
-
Re: solr 3.5 and indexing performanceJan Høydahl 2012-03-13, 08:47
Hi,
Have you confirmed that disabling Hunspell in solrconfig gets you back to normal speed? What Hunspell configuration and dictionaries do you have? Can you share more about your environment and documents? Do you have a chance to run a profiler on your Solr instance? Try i.e. VisualVM and run the profiler to see what part of the code takes up the time http://docs.oracle.com/javase/6/docs/technotes/tools/share/jvisualvm.html -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 12. mars 2012, at 16:42, Agnieszka Kukałowicz wrote: > Hi guys, > > I have hit the same problem with Hunspell. > Doing a few tests for 500 000 documents, I've got: > > Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4 version - > 125 documents per second > Build Hunspell from 4.0 trunk - 11 documents per second. > > All the tests were made on 8 core CPU with 32 GB RAM and index on SSD > disks. > For Solr 3.5 I've tried to change JVM heap size, rambuffersize, > mergefactor but the speed of indexing was about 10 -20 documents per > second. > > Is it possible that there is some performance bug with Solr 4.0? According > to previous post the problem exists in 3.5 version. > > Best regards > Agnieszka Kukałowicz > > >> -----Original Message----- >> From: mizayah [mailto:[EMAIL PROTECTED]] >> Sent: Thursday, February 23, 2012 10:19 AM >> To: [EMAIL PROTECTED] >> Subject: Re: solr 3.5 and indexing performance >> >> Ok i found it. >> >> Its becouse of Hunspell which now is in solr. Somehow when im using it >> by myself in 3.4 it is a lot of faster then one from 3.5. >> >> Dont know about differences, but is there any way i use my old Google >> Hunspell jar? >> >> -- >> View this message in context: http://lucene.472066.n3.nabble.com/solr- >> 3-5-and-indexing-performance-tp3766653p3769139.html >> Sent from the Solr - User mailing list archive at Nabble.com. +
Jan Høydahl 2012-03-13, 08:47
-
RE: solr 3.5 and indexing performanceAgnieszka Kukałowicz 2012-03-13, 09:39
Hi,
Yes, I confirmed that without Hunspell indexing has normal speed. I did tests in solr 4.0 with Hunspell and PolishStemmer. With StempelPolishStemFilterFactory the speed is normal. My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to: "<field name="text" type="text_pl_hunspell" indexed="true" stored="false" multiValued="true"/>" <copyField source="field1" dest="text"/> <copyField source="field2" dest="text"/> <copyField source="field3" dest="text"/> <copyField source="field4" dest="text"/> <copyField source="field5" dest="text"/> <copyField source="field6" dest="text"/> <copyField source="field7" dest="text"/> <copyField source="field8" dest="text"/> <copyField source="field9" dest="text"/> <copyField source="field10" dest="text"/> <copyField source="field11" dest="text"/> <copyField source="field12" dest="text"/> <copyField source="field13" dest="text"/> <copyField source="field14" dest="text"/> The "text_pl_hunspell" configuration: <fieldType name="text_pl_hunspell" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="dict/stopwords_pl.txt" enablePositionIncrements="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true" <!--filter class="solr.KeywordMarkerFilterFactory" protected="protwords_pl.txt"/--> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="dict/stopwords_pl.txt" enablePositionIncrements="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true" <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/> </analyzer> </fieldType> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version. For Polish Stemmer the diffrence is only in definion text field: "<field name="text" type="text_pl" indexed="true" stored="false" multiValued="true"/>" <fieldType name="text_pl" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="dict/stopwords_pl.txt" enablePositionIncrements="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StempelPolishStemFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="dict/stopwords_pl.txt" enablePositionIncrements="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StempelPolishStemFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/> </analyzer> </fieldType> One document has 23 fields: - 14 text fields copy to one text field (above) that is only indexed - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The size of one document is 3-4 kB. So, I think this is not very complicated schema. My environment is: - Linux, RedHat 6.2, kernel 2.6.32 - 2 physical CPU Xeon 5606 (4 cores each) - 32 GB RAM - 2 SSD disks in RAID 0 - java version: java -version java version "1.6.0_26" Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) - java is running with -server -Xms4096M -Xmx4096M (I tried a lot of other settings and always I have the same effect) - solr has default configuration except Rambuffersize (128MB) - solr 4.0 from nightly builds (2012-02-21 build). If you need more information, please let me know. I also will try to use profile to see what happens. Agnieszka +
Agnieszka Kukałowicz 2012-03-13, 09:39
|