Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Mahout, mail # dev - Dirichlet - NormalModel.pdf() calculation problem


+
Derek O'Callaghan 2010-09-21, 17:57
+
Jeff Eastman 2010-09-21, 18:23
Copy link to this message
-
Re: Dirichlet - NormalModel.pdf() calculation problem
Derek O'Callaghan 2010-09-22, 09:45

>  Oh that's brilliant! I have seen the same situation before
> too but never found the reason for it. Personally, I'd prefer to
> detect the divide by zero explicitly; something like:
>
> if (stdDev > 0)
>     return ex / (stdDev * SQRT2PI);
> else
>     return 0;
>

Yep, that looks better than what I had, I'll use that instead.
 
> On the AbstractCluster point, since all Clusters (being
> themselves Models in the latest refactoring) can now be used
> directly by Dirichlet, the GaussianCluster subclass (which is
> now equivalent to AssymetricSampledNormalModel if you check)
> will have the same pdf problem. Check also
> DistanceMeasureClusterDistribution which instantiates
> DistanceMeasureClusters (equivalent to L1Models) for models and
> GaussianClusterDistribution. Once these bake out a little I plan
> to deprecate most of the current Dirichlet models (which were
> experimental anyway and kind of a learning experience). There
> are already unit tests for the new hierarchy that produce
> equivalent results afaict.
>

Yeah, I had the same problem when using GaussianClusterDistribution, I can see that it's pdf() will also generate NaN as you say. I might hold off on Dirichlet for the moment, or I'll just use it with the workaround. I was just trying it out to see what kind of results I get.

Now I want to take a look at that clean eigenvectors problem :)
 
> On 9/21/10 1:57 PM, Derek O'Callaghan wrote:
> >Hi Jeff,
> >
> >I mentioned this issue in my last mail to the CDbw thread, but
> I thought I'd create a separate thread for it as it's a
> different problem (although similar).
> >
> >When s0 is 1, NormalModel.computeParameters() will set stdDev
> to Double.MIN_VALUE. However, this causes a problem in
> subsequent calls to pdf() from
> DirichletState.adjustedProbability() . In such a case, the call
> to "double sd2 = stdDev * stdDev;" will set sd2 to 0, which
> causes pdf() to return NaN. This means that the call to
> UncommonDistribution.rMultinom() will return 0, and so (I think)
> all subsequent points will be assigned to cluster 0.
> >
> >FYI I was able to workaround this by changing the following in
> NormalModel.pdf():>
> >return ex / (stdDev * SQRT2PI);
> >
> >to:
> >
> >double pdf = ex / (stdDev * SQRT2PI);
> >if (Double.isNaN(pdf)) {
> >     pdf = 0.0;
> >}
> >return pdf;
> >
> >
> >As you mentioned in the other thread,
> AbstractCluster.computeParameters() will also set the radius to
> Double.MIN_VALUE when s0 is 1, although I'm not sure if that's
> used anywhere that'll cause a similar problem as in pdf() above.
> >
> >
> >Derek
> >
>
+
Ted Dunning 2010-09-21, 18:26
+
Jeff Eastman 2010-09-21, 19:09