bin/mallet import-dir --input sample-data/web/* --output web.malletMALLET will use the directory names as labels and the filenames as instance names. Note: make sure you are in the mallet directory, not the mallet/bin directory; otherwise you will get a ClassNotFoundException exception. One file, one instance per line: Assume the data is in the following format:
[URL] [language] [text of the page...]After downloading and building MALLET, change to the MALLET directory and run the following command:
bin/mallet import-file --input /data/web/data.txt --output web.malletIn this case, the first token of each line (whitespace delimited, with optional comma) becomes the instance name, the second token becomes the label, and all additional text on the line is interpreted as a sequence of word tokens. Note that the data in this case will be a vector of feature/value pairs, such that a feature consists of a distinct word type and the value is the number of times that word occurs in the text.
Table of Contents
The OpenNLP Document Categorizer can classify text into pre-defined categories. It is based on maximum entropy framework. For someone interested in Gross Margin, the sample text given below could be classified as GMDecrease
Major acquisitions that have a lower gross margin than the existing network also had a negative impact on the overall gross margin, but it should improve following the implementation of its integration strategies.
and the text below could be classified as GMIncrease
The upward movement of gross margin resulted from amounts pursuant to adjustments to obligations towards dealers.
To be able to classify a text, the document categorizer needs a model. The classifications are requirements-specific and hence there is no pre-built model for document categorizer under OpenNLP project.
The easiest way to try out the document categorizer is the command line tool. The tool is only intended for demonstration and testing. The following command shows how to use the document categorizer tool.
$ opennlp Doccat model
The input is read from standard input and output is written to standard output, unless they are redirected or piped. As with most components in OpenNLP, document categorizer expects input which is segmented into sentences.
Interfaces for labeling tokens with category labels (or “class labels”).
ClassifierI is a standard interface for “single-category
classification”, in which the set of categories is known, the number
of categories is finite, and each text belongs to exactly one
Text classification typically involves assigning a document to a category by automated or human means. LingPipe provides a classification facility that takes examples of text classifications--typically generated by a human--and learns how to classify further documents using what it learned with language models. There are many other ways to construct classifiers, but language models are particularly good at some versions of this task.
A publicly available data set to work with is the 20 newsgroups data available from the
We have included a sample of 4 newsgroups with the LingPipe distribution in order to allow you to run the tutorial out of the box. You may also download and run over the entire 20 newsgroup dataset. LingPipe's performance over the whole data set is state of the art.
Once you have downloaded and installed LingPipe, change directories to the one containing this read-me:
> cd demos/tutorial/classify
You may then run the demo from the command line (placing all of the code on one line):
java -cp "../../../lingpipe-4.1.0.jar; classifyNews.jar" ClassifyNews
On Linux, Mac OS X, and other Unix-like operating systems:
java -cp "../../../lingpipe-4.1.0.jar: classifyNews.jar" ClassifyNews
or through Ant:
The demo will then train on the data in
demos/fourNewsGroups/4news-train/ and evaluate on
demos/4newsgroups/4news-test. The results of scoring are
printed to the command line and explained in the rest of this
Note: Check out our latest API collections page for the list of updated APIs.Natural Language Processing, or NLP, is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.
Here are useful APIs that help bridge the human-computer interaction:
Automatic multilingual text classification according to pre-established categories defined in a model. The algorithm used combines statistic classification with rule-based filtering, which allows to obtain a high degree of precision for very... Read more
List of prefixes of the code of the categories to which the classification is limited. Each value will be separated by '|'. All the categories that do not start with any of the prefixes specified in the list will not be taken account in the classification. For example, if only a clasification within the human interest category, the prefix used would be 0800.modelSTRING
Classification model to use. It will define into which categories the text may be classified. Possilbe values are: IPTC_es, IPTC_en, IPTC_ca, IPTC_pt, IPTC_it, IPTC_fr, EUROVOC_es_ca, BusinessRep_es, BusinessRepShort_esofSTRING
Output formatl, xml or jsontitleSTRING
Descriptive title of the content. It is an optional field, and it can be plain text, HTML or XML, always using UTF-8 encoding. The terms relevant for the classification process found in the title will have more influence in the classification than if they were in the text.txtSTRING
Input text. It can be plain text, HTML or XML, always using UTF-8 encoding. (Required if 'doc' and 'url' are empty)urlSTRING
URL with the content to classify. Currently only non-authenticated HTTP and FTP are supported. The content types supported for URL contents can be found at https://textalytics.com/core/supported-formats. (Required if 'txt' and 'doc' are empty)verboseSTRING
Verbose mode. Shows additional information about the classification.
Use Git or checkout with SVN using the web URL.Use HTTPS
Use an SSH key and passphrase from account.Open in Desktop Download ZIP Create new file Upload files Find file Branch:
|Failed to load latest commit information.|
This is an implementation of a Naive Bayesian Classifier written in Python. The utility uses statistical methods to classify documents, based on the words that appear within them. A common application for this type of software is in email spam filters.
The utility must first be 'trained' using large numbers of pre-classified documents, during the training phase a database is populated with information about how often certain words appear in each type of document. Once training is complete, unclassified documents can be submitted to the classifier which will return a value between 0 and 1, indicating the probablity that the document belongs to one class of document rather than another.
To train the utility, use the following command:
python bayes.py learn <doctype> <file> <count>
python bayes.py learn spam all_my_spam.txt 10000 python bayes.py learn ham inbox.txt 10000
Once training is complete, classification is performed using this command:
bayes.py classify <file> <doctype> <doctype>
python bayes.py classify nigerian_finance_email.txt spam ham > Probability that document is spam rather than ham is 0.98