Tuesday, February 7, 2012

Bayes algorithm with apache mahout

Apache mahout in action using the java API and bayes algorithm.

Simple bayes implementation

For people that doesn't know what is apache mahout: mahout
Here we'll implement the bayes algorithm for a text classification.
What does the example ?
  • Train apache mahout in order to make it smarter.
  • Automatic classification of a content.

Dependencies

For the example we only need the mahout-core dependency. We use the version 0.5 .
maven
<dependency>
    <groupid>org.apache.mahout</groupid>
    <artifactid>mahout-core</artifactid>
    <version>0.5</version>
</dependency>
ivy
<dependency name="mahout-core" org="org.apache.mahout" rev="0.5" />

Training

Mahout training in order to recognize our content:
/**
  * This method permits to make smarter apache mahout.
  * @param label the label associated to the file.
  * @param fileToClassify the file associated to the specified label.
  * @param charset the charset of content of the file.
  * @param tmpDir the directory where the transformed file will be stocked. 
  * @param databaseOutputDir the directory of the apache hadoops database, which contains reference data for future classification.
  * @throws IOException 
  */
 public void training(String label, String fileToClassify, String charset, String outputDir,
   String databaseOutputDir) throws IOException {

  /*
   * Take the document and associate to a label inside a file that
   * respects the apache mahout input format:
   * [LABEL] _TAB_ [TEXT]
   * example: 
   * english mahout is a good product. 
   * french mahout est un bon produit.
   * Note the analyzer =&gt; This is a lucene analyzer, by default apache mahout provide one. I used this one.
   * In few words the analyzor permits to define how the words will be extracted from your file...
   */
  BayesFileFormatter.format(label, new DefaultAnalyzer(), new File(fileToClassify), Charset.forName(charset),
    new File(outputDir));

  

  /*
   * Here we build the bayes parameters object that permits to define some
   * information about the way to stock the training data. Mahout use
   * apache hadoops in background for save the classification data.
   * See the hadoops documentation to know more about this object.
   * Just take care to specify the classifierType and the basePath.
   */
  BayesParameters bayesParameters = buildBayesParam(charset, databaseOutputDir);

  /*
  * Start the training !
  */ 
  TrainClassifier.trainNaiveBayes(new Path(outputDir), new Path(databaseOutputDir), bayesParameters);

 }

 private BayesParameters buildBayesParam(String charset, String databaseOutputDir) {
  BayesParameters bayesParameters = new BayesParameters();
  bayesParameters.setGramSize(1);
  bayesParameters.set("verbose", "true"); //If you want to see what happen.
  bayesParameters.set("classifierType", "bayes");
  bayesParameters.set("defaultCat", "other"); //The default category to return if a label is not found for a specified text. 
  bayesParameters.set("encoding", charset);
  bayesParameters.set("alpha_i", "1.0");
  bayesParameters.set("dataSource", "hdfs");
  bayesParameters.set("basePath", databaseOutputDir);
  return bayesParameters;
 }

  

Classification

This code try to find the good label in asking mahout for a specified content:
/**
  * Ask to mahout to find the good label for the specified content.
  * 
  * @param contentToClassify
  *            the content to classify.
  * @param charset
  *            the charset of the content.
  * @param databaseOutputDir
  *            mahout database directory.
  * @return label the label retrieved by mahout.
  * @throws InvalidDatastoreException
  * @throws IOException
  */
 public String searchLabel(String contentToClassify, String charset, String databaseOutputDir)
   throws InvalidDatastoreException, IOException {
  //define the algorithm to use
  Algorithm algorithm = new BayesAlgorithm();
  //specify the mahout datastore to use. (the path of hadoops database).
  Datastore datastore = new InMemoryBayesDatastore(buildBayesParam(charset, databaseOutputDir));
  //initialize the mahout context.
  ClassifierContext context = new ClassifierContext(algorithm, datastore);
  context.initialize();
  
  //Make the search
  ClassifierResult classifyResult = context.classifyDocument(new String[] { contentToClassify }, "other");
  
  //Result
  return classifyResult.getLabel();
}
I strongly recommend to see the source of the dependency mahout-utils, very useful to understand the project.
References
apache mahout official site
An other example of use bayes algorithm with mahout

6 comments:

  1. Nice Post, Is this working and is there any confusion matrix you are checking off ??

    ReplyDelete
    Replies
    1. What do you want to say by confusion matrix ? I'm not very skilled on mathematical concept sorry :).

      I just tested on sample data and it worked. Now to be sure it works very well, it's necessary to test on big sets of data to know the relevance of the api.

      There is an example with a wikipedia articles country predictor:
      https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example

      Delete
  2. Hello Nicolas,
    Really useful post, I was wondering if it is possible to incrementally train the classifier?

    ReplyDelete
    Replies
    1. Hi,

      Do you mean to enrich the classification in several times ?
      If it's this case it is possible, you can persist the learning data in the datastore. I advice you to use the big data storage like hadoop that is supported by apache mahout.



      Delete
  3. Hi, Is hadoop necessary for classification?
    Is there any way to do without hadoop.
    If so can u pls tell me.

    ReplyDelete
  4. Hi Suri,

    Unfortunally, as I know for the Bayes classification the implementation is based on hadoop. But there is the SGD (logistic regression) algorithm that is very configurable, the drawbacks this algorithm is sequential but it's very fast.
    For example you can work in memory or database or your own implementation.

    You can see the details here:
    http://mahout.apache.org/users/classification/logistic-regression.html

    ReplyDelete