Scaling Monolingual NLP with Trans-tokenization

Building and deploying monolingual NLP systems for low-resource languages presents unique challenges, particularly in handling diverse scripts and optimizing for production-scale environments.

This session delves into the use of trans-tokenization, a novel approach to transforming tokens across languages, to enhance large language models for monolingual capabilities.

Using parallel corpora like English-Hindi, we’ll demonstrate how tools such as Unsloth and Mistral enable fine-tuning of models to handle non-Latin scripts effectively.

A key focus will be leveraging Kubernetes to scale these monolingual NLP systems. Attendees will learn how Kubernetes facilitates efficient resource allocation, supports distributed training, and simplifies model deployment in production environments. Topics include managing workloads for parallel corpora processing, optimizing GPU utilization, and ensuring high availability of NLP services at scale.

Suvrakamal Das

Software Engineer @Mattoboard

San Francisco, California, United States

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Scaling Monolingual NLP with Trans-tokenization

Suvrakamal Das

Links

Actions