Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # dev - Goals for Mahout 0.7


Copy link to this message
-
Re: Goals for Mahout 0.7
Jeff Eastman 2012-02-14, 18:46
+1 I think this is an excellent goal. The current code base does not
approach its Java APIs in a uniform manner nor are we where we had hoped
to be on the CLI API uniformity. There's a lot to do here in both areas.

In the Java API area, we do have some notable successes, with the
recommender APIs truly being designed for this kind of invocation. In
the clustering drivers, we have tried to support native Java access as
well, though there are a lot of arguments required for most invocations.
Other drivers have really only been written for CLI access as you note
and some large amounts of rather simple refactoring would be required to
present a usable Java API.

The challenge here is that the Java API must account for all of the
optional CLI arguments of every algorithm. This either leads to ~37
typed arguments (hyperbole) or a set of helper methods which provide
useful defaults for use in common situations. Another approach is to
implement configuration beans which contain all the argument values
required for full specification.

In the current clustering refactoring under way to utilize the
ClusterClassifier, arguments are to be provided in ClusteringPolicy
objects so I'm biased towards the latter approach. We ought to agree
upon which style we want to take this goal forward, but I am 100% behind it.

Jeff
On 2/13/12 10:31 AM, John Conwell wrote:
> > From my perspective, I'd really like to see the Mahout API migrate away
> from a command line centric design it currently utilizes, and migrate more
> towards an library centric API design.  I think this would go a long way in
> getting Mahout adopted into real life commercial applications.
>
> While there might be a few algorithm drivers that you interact with by
> creating an instance of a class, and calling some method(s) on the instance
> to interact with it (I havent actually seen one like that, but there might
> be a few), many algorithms are invoked by calling some static function on a
> class that takes ~37 typed arguments.  Buts whats worse, many drivers are
> invoked by having to create a String array with ~37 arguments as string
> values, and calling the static main function on the class.
>
> Now I'm not saying that having a static main function available to invoke
> an algorithm from the command line isn't useful.  It is, when your testing
> an algorithm.  But once you want to integrate the algorithm into a
> commercial workflow it kind of sucks.
>
> For example, immagine if the API for invoking Math.max was designed the way
> many of the Mahout algorithms currently are?  You'd have something like
> this:
>
> String[] args = new String[2];
> args[0] = "max";
> args[1] = "7";
> args[0] = "4";
> int max = Math.main(args);
>
> It makes your code a horrible mess and very hard to maintain, as well as
> very prone to bugs.
>
> When I see a bunch of static main functions as the only way to interact
> with a library, no matter what the quality of the library is, my initial
> impression is that this has to be some minimally supported effort by a few
> PhD candidates still in academia, who will drop the project as soon as they
> graduate.  And while this might not be the case, it is one of the first
> impressions it gives, and can lead a company to drop the library from
> consideration before they do any due diligence into its quality and utility.
>
> I think as Mahout matures and gets closer to a 1.0 release, this kind of
> API re-design will become more and more necessary, especially if you want a
> higher Mahout integration rate into commercial applications and workflows.
>
> Also, I hope I dont sound too negative.  I'm very impressed with Mahout and
> its capabilities.  I really like that there is a well thought out class
> library of primitives for designing new serial and distributed machine
> learning algorithms.  And I think it has a high utility for integration
> into highly visible commercial projects.  But its high level public API
> really is a barrier to entry when trying to design commercial applications.