|
|
Aaron Bains 2011-08-31, 04:01
Hello,
What is the best way to remove duplicate values on output. I am using the following query:
/solr/select/?q=wrt54g2&version=2.2&start=0&rows=10&indent=on&*fl=productid*
And I get the following results:
<doc> <int name="productid">1011630553</int> </doc> <doc> <int name="productid">1011630553</int> </doc> <doc><int name="productid">1011630553</int> </doc> <doc><int name="productid">1011630553</int> </doc> <doc><int name="productid">1011630553</int> </doc> <doc><int name="productid">1011630553</int> </doc> <doc><int name="productid">1011630553</int> </doc> <doc><int name="productid">1013033708</int> </doc> <doc><int name="productid">1013033708</int> </doc> <doc><int name="productid">1013033708</int> </doc> But I don't want those results because there are duplicates. I am looking for results like below:
<doc> <int name="productid">1011630553</int> </doc> <doc> <int name="productid">1013033708</int> </doc>
I know there is deduplication and field collapsing but I am not sure if they are applicable in this situation. Thanks for your help!
-
Re: Duplication of Output
Erick Erickson 2011-08-31, 22:06
The first question I'd ask is "why are there duplicates in your index in the first place?". If you're denormalizing, that would account for it. Mostly, I'm just asking to be sure that you expect duplicate product IDs. If you make your productid a <uniqueKey>, there'll only be one of each....
You'll have to re-index if you make this change though.
But grouping/field collapsing would, indeed, apply to this problem.
deduplication isn't applicable, since you know exactly what duplicates are. deduplication is more for "fuzzy" removal of near-duplicates..
Hope this helps Erick
On Wed, Aug 31, 2011 at 12:01 AM, Aaron Bains <[EMAIL PROTECTED]> wrote: > Hello, > > What is the best way to remove duplicate values on output. I am using the > following query: > > /solr/select/?q=wrt54g2&version=2.2&start=0&rows=10&indent=on&*fl=productid* > > And I get the following results: > > <doc> > <int name="productid">1011630553</int> > </doc> > <doc> > <int name="productid">1011630553</int> > </doc> > <doc><int name="productid">1011630553</int> > </doc> > <doc><int name="productid">1011630553</int> > </doc> > <doc><int name="productid">1011630553</int> > </doc> > <doc><int name="productid">1011630553</int> > </doc> > <doc><int name="productid">1011630553</int> > </doc> > <doc><int name="productid">1013033708</int> > </doc> > <doc><int name="productid">1013033708</int> > </doc> > <doc><int name="productid">1013033708</int> > </doc> > > > But I don't want those results because there are duplicates. I am looking > for results like below: > > <doc> > <int name="productid">1011630553</int> > </doc> > <doc> > <int name="productid">1013033708</int> > </doc> > > I know there is deduplication and field collapsing but I am not sure if they > are applicable in this situation. Thanks for your help! >
-
Re: Duplication of Output
Aaron Bains 2011-08-31, 22:18
Thanks! I appreciate your input. You are right, yesterday I actually denormalized my index using multivalued fields. Now I am using Solr the way it was designed and I am happy, everything seems to work great.
On Wed, Aug 31, 2011 at 6:06 PM, Erick Erickson <[EMAIL PROTECTED]>wrote:
> The first question I'd ask is "why are there duplicates > in your index in the first place?". If you're denormalizing, > that would account for it. Mostly, I'm just asking to be > sure that you expect duplicate product IDs. If you make > your productid a <uniqueKey>, there'll only be one of each.... > > You'll have to re-index if you make this change though. > > But grouping/field collapsing would, indeed, apply to this > problem. > > deduplication isn't applicable, since you know exactly what > duplicates are. deduplication is more for "fuzzy" removal > of near-duplicates.. > > Hope this helps > Erick > > On Wed, Aug 31, 2011 at 12:01 AM, Aaron Bains <[EMAIL PROTECTED]> > wrote: > > Hello, > > > > What is the best way to remove duplicate values on output. I am using the > > following query: > > > > > /solr/select/?q=wrt54g2&version=2.2&start=0&rows=10&indent=on&*fl=productid* > > > > And I get the following results: > > > > <doc> > > <int name="productid">1011630553</int> > > </doc> > > <doc> > > <int name="productid">1011630553</int> > > </doc> > > <doc><int name="productid">1011630553</int> > > </doc> > > <doc><int name="productid">1011630553</int> > > </doc> > > <doc><int name="productid">1011630553</int> > > </doc> > > <doc><int name="productid">1011630553</int> > > </doc> > > <doc><int name="productid">1011630553</int> > > </doc> > > <doc><int name="productid">1013033708</int> > > </doc> > > <doc><int name="productid">1013033708</int> > > </doc> > > <doc><int name="productid">1013033708</int> > > </doc> > > > > > > But I don't want those results because there are duplicates. I am looking > > for results like below: > > > > <doc> > > <int name="productid">1011630553</int> > > </doc> > > <doc> > > <int name="productid">1013033708</int> > > </doc> > > > > I know there is deduplication and field collapsing but I am not sure if > they > > are applicable in this situation. Thanks for your help! > > >
-- Aaron Bains, Ivey HBA +1 519.868.0820 (Mobile) [EMAIL PROTECTED]
-
Re: Duplication of Output
Markus Jelsma 2011-08-31, 22:28
> The first question I'd ask is "why are there duplicates > in your index in the first place?". If you're denormalizing, > that would account for it. Mostly, I'm just asking to be > sure that you expect duplicate product IDs. If you make > your productid a <uniqueKey>, there'll only be one of each.... > > You'll have to re-index if you make this change though. > > But grouping/field collapsing would, indeed, apply to this > problem. > > deduplication isn't applicable, since you know exactly what > duplicates are. deduplication is more for "fuzzy" removal > of near-duplicates..
That's only if you use Nutch' TextProfileSignature, MD5 and Lookup3 are meant for exact matching. I don't know if Lookup3Signature works on non-string/text values but i see no reason it should not work.
Might be an improvement to allow deduplication that skips creating a signature field and dedup on non-string values instead of that signature field.
> > Hope this helps > Erick > > On Wed, Aug 31, 2011 at 12:01 AM, Aaron Bains <[EMAIL PROTECTED]> wrote: > > Hello, > > > > What is the best way to remove duplicate values on output. I am using the > > following query: > > > > /solr/select/?q=wrt54g2&version=2.2&start=0&rows=10&indent=on&*fl=product > > id* > > > > And I get the following results: > > > > <doc> > > <int name="productid">1011630553</int> > > </doc> > > <doc> > > <int name="productid">1011630553</int> > > </doc> > > <doc><int name="productid">1011630553</int> > > </doc> > > <doc><int name="productid">1011630553</int> > > </doc> > > <doc><int name="productid">1011630553</int> > > </doc> > > <doc><int name="productid">1011630553</int> > > </doc> > > <doc><int name="productid">1011630553</int> > > </doc> > > <doc><int name="productid">1013033708</int> > > </doc> > > <doc><int name="productid">1013033708</int> > > </doc> > > <doc><int name="productid">1013033708</int> > > </doc> > > > > > > But I don't want those results because there are duplicates. I am looking > > for results like below: > > > > <doc> > > <int name="productid">1011630553</int> > > </doc> > > <doc> > > <int name="productid">1013033708</int> > > </doc> > > > > I know there is deduplication and field collapsing but I am not sure if > > they are applicable in this situation. Thanks for your help!
|
|