Find us on Facebook
More keeppies by alexey
Documnet Classification Open Source Software by alexey ,  Aug 16, 2016

Document Classification

A classifier is an algorithm that distinguishes between a fixed set of classes, such as "spam" vs. "non-spam", based on labeled training examples. MALLET includes implementations of several classification algorithms, including Naïve Bayes, Maximum Entropy, and Decision Trees. In addition, MALLET provides tools for evaluating classifiers. Training Maximum Entropy document classifiers using Generalized Expectation Criteria is described in this separate tutorial. To get started with classification, first load your data into MALLET format as described in the importing data section.

Importing data

MALLET represents data as lists of "instances". All MALLET instances include a data object. An instance can also include a name and (in classification contexts) a label. For example, if the application is guessing the language of web pages, an instance might consist of a vector of word counts (data), the URL of the page (name) and the language of the page (label). For information about the MALLET data import API, see the data import developer's guide. There are two primary methods for importing data into MALLET format, first when the source data consists of many separate files, and second when the data is contained in a single file, with one instance per line. One instance per file: After downloading and building MALLET, change to the MALLET directory. Assume that text-only (.txt) versions of English web pages are in files in a directory called sample-data/web/en and text-only versions of German pages are in sample-data/web/de (download sample data). Now run this command:
bin/mallet import-dir --input sample-data/web/* --output web.mallet
MALLET will use the directory names as labels and the filenames as instance names. Note: make sure you are in the mallet directory, not the mallet/bin directory; otherwise you will get a ClassNotFoundException exception. One file, one instance per line: Assume the data is in the following format:
[URL]  [language]  [text of the page...]
After downloading and building MALLET, change to the MALLET directory and run the following command:
bin/mallet import-file --input /data/web/data.txt --output web.mallet
In this case, the first token of each line (whitespace delimited, with optional comma) becomes the instance name, the second token becomes the label, and all additional text on the line is interpreted as a sequence of word tokens. Note that the data in this case will be a vector of feature/value pairs, such that a feature consists of a distinct word type and the value is the number of times that word occurs in the text.

Chapter 5. Document Categorizer

Table of Contents

Document Categorizer Tool
Document Categorizer API
Training Tool
Training API


The OpenNLP Document Categorizer can classify text into pre-defined categories. It is based on maximum entropy framework. For someone interested in Gross Margin, the sample text given below could be classified as GMDecrease

Major acquisitions that have a lower gross margin than the existing network
also had a negative impact on the overall gross margin, but it should improve
following the implementation of its integration strategies.

and the text below could be classified as GMIncrease

The upward movement of gross margin resulted from amounts pursuant to 
adjustments to obligations towards dealers.

To be able to classify a text, the document categorizer needs a model. The classifications are requirements-specific and hence there is no pre-built model for document categorizer under OpenNLP project.

Document Categorizer Tool

The easiest way to try out the document categorizer is the command line tool. The tool is only intended for demonstration and testing. The following command shows how to use the document categorizer tool.

$ opennlp Doccat model

The input is read from standard input and output is written to standard output, unless they are redirected or piped. As with most components in OpenNLP, document categorizer expects input which is segmented into sentences.

nltk.classify package


nltk.classify.api module

Interfaces for labeling tokens with category labels (or “class labels”).

ClassifierI is a standard interface for “single-category classification”, in which the set of categories is known, the number of categories is finite, and each text belongs to exactly one category.

What is Text Classification?

Text classification typically involves assigning a document to a category by automated or human means. LingPipe provides a classification facility that takes examples of text classifications--typically generated by a human--and learns how to classify further documents using what it learned with language models. There are many other ways to construct classifiers, but language models are particularly good at some versions of this task.

20 Newsgroups Demo

A publicly available data set to work with is the 20 newsgroups data available from the

20 Newsgroups Home Page

4 Newsgroups Sample

We have included a sample of 4 newsgroups with the LingPipe distribution in order to allow you to run the tutorial out of the box. You may also download and run over the entire 20 newsgroup dataset. LingPipe's performance over the whole data set is state of the art.

Quick Start

Once you have downloaded and installed LingPipe, change directories to the one containing this read-me:

> cd demos/tutorial/classify

You may then run the demo from the command line (placing all of the code on one line):

On Windows:

-cp "../../../lingpipe-4.1.0.jar;

On Linux, Mac OS X, and other Unix-like operating systems:

-cp "../../../lingpipe-4.1.0.jar:

or through Ant:

ant classifyNews

The demo will then train on the data in demos/fourNewsGroups/4news-train/ and evaluate on demos/4newsgroups/4news-test. The results of scoring are printed to the command line and explained in the rest of this tutorial.

The Code

List of 25+ Natural Language Processing APIs

by Chris Ismael on April 26, 2013

Natural Language Processing API

Note: Check out our latest API collections page for the list of updated APIs.

Natural Language Processing, or NLP, is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.

Here are useful APIs that help bridge the human-computer interaction:

Text Classification

Private APIMeaningCloudhttp://www.meaningcloud.comDataCreated: October 2013

Automatic multilingual text classification according to pre-established categories defined in a model. The algorithm used combines statistic classification with rule-based filtering, which allows to obtain a high degree of precision for very... Read more

Endpointsclass-1.1Automatic classification of multilingual texts

URL Parameters


List of prefixes of the code of the categories to which the classification is limited. Each value will be separated by '|'. All the categories that do not start with any of the prefixes specified in the list will not be taken account in the classification. For example, if only a clasification within the human interest category, the prefix used would be 0800.


Classification model to use. It will define into which categories the text may be classified. Possilbe values are: IPTC_es, IPTC_en, IPTC_ca, IPTC_pt, IPTC_it, IPTC_fr, EUROVOC_es_ca, BusinessRep_es, BusinessRepShort_es


Output formatl, xml or json


Descriptive title of the content. It is an optional field, and it can be plain text, HTML or XML, always using UTF-8 encoding. The terms relevant for the classification process found in the title will have more influence in the classification than if they were in the text.


Input text. It can be plain text, HTML or XML, always using UTF-8 encoding. (Required if 'doc' and 'url' are empty)


URL with the content to classify. Currently only non-authenticated HTTP and FTP are supported. The content types supported for URL contents can be found at (Required if 'txt' and 'doc' are empty)


Verbose mode. Shows additional information about the classification.

Request Headers

X-Mashape-KeySign up to consume this API
A Naive Bayesian Classifier written in Python
  1. Python100.0%
PythonClone or downloadUse SSH

Clone with HTTPS

Use Git or checkout with SVN using the web URL.


Clone with SSH

Use an SSH key and passphrase from account.

Open in Desktop Download ZIP Create new file Upload files Find file Branch:masterSwitch branches/tags master Nothing to showNothing to show New pull request Latest commit 363a9db on Feb 17, 2015@codeboxcodeboxCreate LICENSEPermalink
Failed to load latest commit information.
.gitignoreInitial commit4 years ago
LICENSECreate LICENSE2 years ago
README.mdreadme4 years ago
bayes.dbadded empty db file4 years ago
bayes.pyrefactoring, fixes, validation4 years ago
classify.pyrefactoring, fixes, validation4 years ago
learn.pyrefactoring, fixes, validation4 years ago
mode.pyrefactoring, fixes, validation4 years ago
reset.pyrefactoring, fixes, validation4 years ago
status.pyrefactoring, fixes, validation4 years ago
testharness.pyrefactoring, fixes, validation4 years ago
words.pyImproved word splitting/cleaning

Naive Bayesian Classifier

This is an implementation of a Naive Bayesian Classifier written in Python. The utility uses statistical methods to classify documents, based on the words that appear within them. A common application for this type of software is in email spam filters.

The utility must first be 'trained' using large numbers of pre-classified documents, during the training phase a database is populated with information about how often certain words appear in each type of document. Once training is complete, unclassified documents can be submitted to the classifier which will return a value between 0 and 1, indicating the probablity that the document belongs to one class of document rather than another.


To train the utility, use the following command:

python learn <doctype> <file> <count>
  • The doctype argument can be any non-empty value - this is just the name you have chosen for the type of document that you are showing to the classifier
  • The file argument indicates the location of the file containing the training data that you wish to use
  • The count argument is a numeric value indicating the number of separate documents contained in the training data file

For example:

python learn spam all_my_spam.txt 10000
python learn ham inbox.txt 10000


Once training is complete, classification is performed using this command: classify <file> <doctype> <doctype>
  • The file argument indicates the location of the file containing the document to be classified
  • The two doctype arguments are the names of the document types against which the input file will be compared

For example:

python classify nigerian_finance_email.txt spam ham
> Probability that document is spam rather than ham is 0.98


Identifying to which category an object belongs to.

Applications: Spam detection, Image recognition.Algorithms

SVM, nearest neighbors, random forest, ...