

Fortunately, NLTK can read corpora in a big variety of formats as the list of corpus submodules shows. In order to load a training corpus into NLTK, we need to obtain it in a format that NLTK understands. Obtaining and loading the training corpus
#BEST POS TAGGER PYTHON HOW TO#
Furthermore I’ll show how to save the trained tagger and load it from disk in order not to re-train it every time you need to use it. In this post I will explain how to load a corpus into NLTK, train a tagger with it and then use the tagger with your texts. After training with such a dataset, the POS tagging accuracy is about 96% with the mentioned corpora.

It contains a large set of annotated and POS-tagged German texts. But apart from this library being only available for Python 2.x, its accuracy is suboptimal - only 84% for German language texts.Īnother approach is to use supervised classification for POS tagging, which means that a tagger can be trained with a large text corpus as training data like the TIGER corpus from the Institute for Natural Language Processing / University of Stuttgart. You can try to find a specialized library for your language, for example the pattern library from CLiPS Research Center, which implements POS taggers for German, Spanish and other languages. However, if you’re dealing with other languages, things get trickier. It is also often a prerequisite of lemmatization.įor English texts, POS tagging is implemented in the pos_tag() function of the widely used Python library NLTK. This is useful in many cases, for example in order to filter large corpora of texts only for certain word categories.

It allows to disambiguate words by lexical category like nouns, verbs, adjectives, and so on. Part-of-speech tagging or POS tagging of texts is a technique that is often performed in Natural Language Processing.
