Research Interests
My research focuses on information retrieval and natural language processing, with a particular emphasis on multilingual and multicultural scenarios,
aiming to advance techniques to serve people from different languages and cultures on equal footing.
- Multilingual Data:
High-quality training data are fundamental for building multilingual models, and evaluation data are fundamental for understanding model capacity.
We constructed large-scale training and evaluation datasets for neural retrieval models that support wide-range of languages
(MIRACL and Mr. TyDi),
and that sepecifically for low-resource languages (AfriCIRMatrix, CIRAL).
- Training Strategies for Multilingual Retrieval:
We conduct systematic studies on the best practices for training multilingual dense retrievers, covering a broad spectrum of scenarios involving varying levels of training data and language model support. More recently, we have developed multimodal retrieval models that operate over text, audio, images, and video—such as OmniEmbed, which achieved first place at the MAGMaR workshop.
- Understanding Multilingual Mechanisms in LMs:
We study how multilingual models internally represent meaning across languages. This includes analyzing shared semantic structures in multilingual language models (Tomato) and examining the sources of cross-lingual transfer, such as the impact of incidental multilingual text in training (Impact of Incidental Multilingual Text).
Happy to connect with people who share similar research interests, feel free to reach out!
Also if you are undergraduate students at University of Waterloo and are looking for research opportunities or advice, I'm happy to chat.
News