Amir Hossein Kargaran


I’m a Computer Science Ph.D. student at Munich University, advised by Prof. Hinrich Schütze. I’m also affiliated as a junior member with the Munich Center for Machine Learning.

My current research focuses on Multilingual NLP, specifically on scaling NLP technologies to include more languages:

  • GlotCC: CommonCrawl corpus for more than 1,000 languages.
  • MaskLID: Code-switching language identification for any setup.
  • GlotWeb: Web-indexing service for very low-resource languages.
  • GlotLID: Language identification for more than 2,000 languages.
  • GlotScript: Tool and resource for writing system identification.
  • Glot500: Extended version of XLM-R, including a corpus of 500 languages.

Before starting my Ph.D., I completed my M.Sc. degree in Computer Engineering at Sharif University of Technology in 2022. I earned two B.Sc. degrees in Electrical Engineering and Computer Engineering from Isfahan University of Technology in 2020. Additionally, I was a research intern with the User Interfaces group at Aalto University during the summers of 2021 and 2022. I contributed to the development of the Aalto Interface Metrics (AIM) project, a service and codebase for computational GUI evaluation.


Jun 12, 2024 📦 GlotCC v1.0, including lots of languages, is released!
May 20, 2024 🧳 I will be attending LREC-COLING for 20–25 May 2024.
May 16, 2024 📌 MaskLID: Code-Switching Language Identification through Iterative Masking is accepted at 62nd Annual Meeting of the Association for Computational Linguistics (ACL’24). Check out our repo here.

see all the news here.