Persian Language Related Researches | Amir Hossein Kargaran

The Persian language holds a special place in my heart; it’s a passion that I deeply cherish. Here, I’d like to outline my efforts on Persian langugage and other Iranian languages:

A small project of mine that suddenly becomes bigger is the creation of a highly scalable library (known as Parstdex) for extracting Persian time and date markers. An equivalent and better version of this work in English was done in 2012 (ten years ago). We subsequently expanded upon this effort by incorporating transformers leading to its publication (Hengam models) in the Asia chapter of the ACL. You can test hengam live here.
Glot500 is another project, with our objective centered around developing a multilingual masked language model designed to assist lots of low-resource languages. The dataset for this model includes a variety of languages, such as Iranian Persian, South Azerbaijani, Kurdish, Tajik, and Dari.
GlotLID is another project in which we have scaled language identification to cover more than 1600 languages. The dataset for this model includes a variety of languages, such as Iranian Persian, South Azerbaijani, Balochi, Gilaki, Mazandarani, Kurdish, Tajik, Luri, and Dari. You can test GlotLID live here.
GlotSparse is a list of datasources and downloaded sources for low-resource languages. The dataset includes South Azerbaijani, Balochi, Gilaki, Gurani, and Southern Kurdish.