PyThaiNLP is a Python package for text processing and linguistic analysis, similar to NLTK with focus on Thai language.
PyThaiNLP เป็นไลบารีภาษาไพทอนสำหรับประมวลผลภาษาธรรมชาติ คล้ายกับ NLTK โดยเน้นภาษาไทย ดูรายละเอียดภาษาไทยได้ที่ README_TH.MD
Version | Description | Status |
---|---|---|
2.3.1 | Stable | Change Log |
dev |
Release Candidate for 2.4 | Change Log |
PyThaiNLP provides standard NLP functions for Thai, for example part-of-speech tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via command-line interface.
List of Features
pythainlp.thai_consonants
), vowels (pythainlp.thai_vowels
), digits (pythainlp.thai_digits
), and stop words (pythainlp.corpus.thai_stopwords
) – comparable to constants like string.letters
, string.digits
, and string.punctuation
sent_tokenize
), word (word_tokenize
), and subword segmentations based on Thai Character Cluster (subword_tokenize
)pos_tag
)spell
and correct
)transliterate
)soundex
) with three engines (lk82
, udom83
, metasound
)collate
)bahttext
, num_to_thaiword
)thai_strftime
)eng_to_thai
, thai_to_eng
)thainlp
in your shell)pip install --upgrade pythainlp
This will install the latest stable release of PyThaiNLP.
Install different releases:
pip install --upgrade pythainlp
pip install --upgrade --pre pythainlp
pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
Some functionalities, like Thai WordNet, may require extra packages. To install those requirements, specify a set of [name]
immediately after pythainlp
:
pip install pythainlp[extra1,extra2,...]
List of possible extras
full
(install everything)attacut
(to support attacut, a fast and accurate tokenizer)benchmarks
(for word tokenization benchmarking)icu
(for ICU, International Components for Unicode, support in transliteration and tokenization)ipa
(for IPA, International Phonetic Alphabet, support in transliteration)ml
(to support ULMFiT models for classification)thai2fit
(for Thai word vector)thai2rom
(for machine-learnt romanization)wordnet
(for Thai WordNet API)For dependency details, look at extras
variable in setup.py
.
~/pythainlp-data
by default.PYTHAINLP_DATA_DIR
.db.json
) at https://github.com/PyThaiNLP/pythainlp-corpusSome of PyThaiNLP functionalities can be used at command line, using thainlp
command.
For example, displaying a catalog of datasets:
thainlp data catalog
Showing how to use:
thainlp help
License | |
---|---|
PyThaiNLP Source Code and Notebooks | Apache Software License 2.0 |
Corpora, datasets, and documentations created by PyThaiNLP | Creative Commons Zero 1.0 Universal Public Domain Dedication License (CC0) |
Language models created by PyThaiNLP | Creative Commons Attribution 4.0 International Public License (CC-by) |
Other corpora and models that may included with PyThaiNLP | See Corpus License |
If you use PyThaiNLP
in your project or publication, please cite the library as follows
Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, & Pattarawat Chormai. (2016, Jun 27). PyThaiNLP: Thai Natural Language Processing in Python. Zenodo. http://doi.org/10.5281/zenodo.3519354
or BibTeX entry:
@misc{pythainlp,
author = {Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai},
title = {{PyThaiNLP: Thai Natural Language Processing in Python}},
month = Jun,
year = 2016,
doi = {10.5281/zenodo.3519354},
publisher = {Zenodo},
url = {http://doi.org/10.5281/zenodo.3519354}
}
Since 2019, our contributors Korakot Chaovavanich and Lalita Lowphansirikul have been supported by VISTEC-depa Thailand Artificial Intelligence Research Institute.
Author: PyThaiNLP
The Demo/Documentation: View The Demo/Documentation
Download Link: Download The Source Code
Official Website: https://github.com/PyThaiNLP/pythainlp
License: Apache-2.0
#python #data-science