Skip to main content

OpenNMT Tokenizer as TensorFlow Operations

Project description

Build Status PyPI version

OpenNMT Tokenizer TensorFlow Ops

DISCLAIMER: This package is not published by the OpenNMT authors.
Full credits for OpenNMT Tokenizer and OpenNMT-tf goes to their respectively authors.

This project aims to wrap OpenNMT Tokenizer into TensorFlow Ops.

It's primarily intended to be used as an addition to the OpenNMT-tf framework, in order to remove the need of applying tokenization and/or detokenization outside of a serving environment (e.g. TensorFlow Serving).

Compatibility

  • TensorFlow 2.1, 2.2
  • OpenNMT-tf >= 2.6.0 for usage in conjunction with OpenNMT-tf

Installation

Prerequisites :

  • A Linux environment (manylinux2014 eligible)
  • Python 3.5, 3.6, 3.7 or 3.8

Install the package with pip :

pip install tensorflow-onmttok-ops

Usage

Available Tokenizer options

The majority of the OpenNMT Tokenizer options are available.
However, providing BPE or SentencePiece models is not supported, and by extension, setting the tokenizer mode to none is not supported.

You therefore cannot use the following options :

  • bpe_model_path
  • sp_model_path
  • sp_nbest_size
  • sp_alpha
  • vocabulary_path
  • vocabulary_threshold

Note: Tokenizer options are defined at graph construction time and are constants.

Tokenization

import tensorflow_onmttok as tf_onmttok

tokens = tf_onmttok.tokenize(["Hello, how are you?"], mode="conservative")

Detokenization

import tensorflow_onmttok as tf_onmttok

text = tf_onmttok.detokenize(["How", "are", "you", "?"], mode="space")

With OpenNMT-tf

Usage with OpenNMT-tf is pretty straightforward.
This package comes with a built-in tokenizer in order to make usage of the ops.

  1. Before training your model, register the tokenizer as follows :

    from tensorflow_onmttok import register_opennmt_in_graph_tokenizer
    
    register_opennmt_in_graph_tokenizer()
    

    See the complete example

  2. Now that the tokenizer is registered, you can use the OpenNMTInGraphTokenizer class instead of OpenNMTTokenizer in your tokenization configuration files, e.g. :

    type: OpenNMTInGraphTokenizer
    params:
      mode: conservative
      case_feature: true
    
  3. That's it ! You can now train your model as usual. Your ExportedModel will now expect a text input instead of tokens and length.

    Note: Tokenization resources will not be exported to the assets.extra directory.

Build TF Serving with this Ops

This guide will show you how to build TensorFlow Serving with this ops.

Prerequisites

  • You have already cloned the TF Serving >= 2.1.0 repository, and have all tools installed for building it
  • You have installed CMake 3.1.0 or newer

Building

Add the Ops sources

First, download the release of your choice.

Inside the TF Serving sources folder, create a directory named custom_ops and copy the content of the tensorflow_onmttok directory into it.

$ cd <tf_serving_sources>
$ mkdir tensorflow_serving/custom_ops
$ cp -r <op_sources>/tensorflow_onmttok tensorflow_serving/custom_ops

Reference the Ops

Edit tensorflow_serving/model_servers/BUILD to reference the Ops build target :

SUPPORTED_TENSORFLOW_OPS = [
    ...
    "//tensorflow_serving/custom_ops/tensorflow_onmttok:onmttok_ops"
]

Build OpenNMT Tokenizer from sources

The last step is to build a static version of the OpenNMT Tokenizer library.
This repository provides a shell script that will build it with CMake.

$ cd <op_sources>
$ chmod +x build_tokenizer.sh && ./build_tokenizer.sh

Note: Pass sudo argument to the build_tokenizer.sh script to execute the make install command with sudo.

Build TensorFlow Serving

You can now build TensorFlow Serving as usual.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file tensorflow_onmttok_ops-0.4.0-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_onmttok_ops-0.4.0-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 75eb8962f0af155244724c64e1dd48e985abe27dd773fd2762d782bcccdfdde8
MD5 ecedc11d1438f9799a6205b76fe4c427
BLAKE2b-256 19685818031172da3dce2558be7dbca0863b92de4c3151edf8a8f3dc81df4836

See more details on using hashes here.

File details

Details for the file tensorflow_onmttok_ops-0.4.0-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_onmttok_ops-0.4.0-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fc9dc0a31d9a9786bd869246c87d5efdfe0abd84ef5afacd395d97a38d566fd8
MD5 93db0ecdc824b2b9cae71f2a4651661b
BLAKE2b-256 708608c8768f449aed80983641d30e3dd93d7e220dab315b1a2b6ce17a870bbf

See more details on using hashes here.

File details

Details for the file tensorflow_onmttok_ops-0.4.0-cp36-cp36m-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_onmttok_ops-0.4.0-cp36-cp36m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1c32c2d23e17a48fb7338359a42919da80f201c4dd305a6e804a4563d0457012
MD5 8bca973f4bc33264c15aa81b7d945000
BLAKE2b-256 ee263c07030c7adb4cd33d1162c5f7a18b0ff431f409cbb81c6c218d29d3f1a8

See more details on using hashes here.

File details

Details for the file tensorflow_onmttok_ops-0.4.0-cp35-cp35m-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_onmttok_ops-0.4.0-cp35-cp35m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3aa028aab720c7dde021e89394a7c15be9655e7992ce38122a0eb3a750aeea37
MD5 5ebdb34ad6469b967b0b168098305d9f
BLAKE2b-256 eaed8fed6a5c4ed31c1dd32fe8a70b5814a7ad70ff6eeb078c3e633f27a9bbfd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page