News  
Home  
Crystina Xinyu Zhang
 

Hi! I am Crystina, currently in my final-year of pursuing my PhD at University of Waterloo. I'm honored and fortunate to be advised by Professor Jimmy Lin. I received my Bachelor's degree in Computer Science at Hong Kong University of Science and Technology (HKUST). I also had wonderful time interning at Google DeepMind, Cohere, Max Planck Institut für Informatik, and NAVER CLOVA. I was an exchange student at the University of California, Los Angeles, and the University of Waterloo during my undergraduate.

I'm on the faculty and post-doc job market (also consider some research scientist positions). For any discussion, please reach out to me via this email!

News

Research Interests

My current research interests lies under the intersection between Information Retrieval and Natural Language Processing. I'm particularly interesting in multilinguality related topics. I built the first two large-scale multilingual datasets for monolingual retrieval: MIRACL and Mr. TyDi, and hosted the competition at WSDM Cup 2023 on multilingual dense retrieval to foster the research in this area. I also conducted a comprehensive investigation towards the best practices of training multilingual dense models. Beyond building high-quality datasets and optimizing training strategies, I'm also interested in understanding the multilingual language models: My recent work Tomato, Tomahto, Tomate firstly measures the role of shared semantics among subwords in multilingual language models.

Publications show all by date / show all by topic

Topics: Multilingual NLP / LLM Reranking / pretrained-LM Reranking / Others (*: Equal Contribution)

Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models
Xinyu Zhang, Jing Lu, Vinh Q. Tran, Tal Schuster, Donald Metzler, Jimmy Lin

Preprint. arXiv

Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation [Outstanding Paper Award]
Raphael Tang, Xinyu Zhang, Lixinyu Xu, Yao Lu, Wenyan Li, Pontus Stenetorp, Jimmy Lin, Ferhan Ture

EMNLP 2024. Paper / arXiv / Website

FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture
Wenyan Li, Xinyu Zhang, Jiaang Li, Qiwei Peng, Raphael Tang, Li Zhou, Weijia Zhang, Guimin Hu, Yifei Yuan, Anders Søgaard, Daniel Hershcovich, Desmond Elliott

EMNLP 2024. Paper / arXiv

NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation.
Nandan Thakur, Luiz Bonifacio, Xinyu Zhang, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Boxing Chen, Mehdi Rezagholizadeh, Jimmy Lin

EMNLP 2024. Paper / arXiv / Dataset

CELI: Simple yet Effective Approach to Enhance Out-of-Domain Generalization of Cross-Encoders
Xinyu Zhang*, Minghan Li*, and Jimmy Lin

NAACL 2024 Paper

Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models
Raphael Tang*, Xinyu Zhang*, Xueguang Ma, Jimmy Lin, Ferhan Ture

NAACL 2024 Paper / arXiv / Code

CIRAL: A Test Collection for CLIR Evaluations in African Languages [Best Paper Nomination]
Mofetoluwa Adeyemi, Akintunde Oladipo, Xinyu Zhang, Jimmy Lin, David Alfonso-Hermelo, Mehdi Rezagholizadeh, Boxing Chen, ... (17 authors)

SIGIR 2024 Paper / Dataset

MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages.
Xinyu Zhang*, Nandan Thakur*, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin

TACL 2023 Paper / arXiv / Code / Website

Towards Best Practices for Training Multilingual Dense Retrieval Models.
Xinyu Zhang, Kelechi Ogueji, Xueguang Ma, Jimmy Lin

TOIS 2023 Paper / Poster

Evaluating Embedding APIs for Information Retrieval
Ehsan Kamalloo, Xinyu Zhang, Odunayo Ogundepo, Nandan Thakur, David Alfonso-hermelo, Mehdi Rezagholizadeh, and Jimmy Lin

ACL 2023 Paper

GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration.
Aleksandra Piktus, Odunayo Ogundepo, Christopher Akiki, Akintunde Oladipo, Xinyu Zhang, Hailey Schoelkopf, Stella Biderman, Martin Potthast, and Jimmy Lin

ACL 2023 Paper

Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models
Xinyu Zhang, Sebastian Hofstätter, Patrick Lewis, Raphael Tang, Jimmy Lin

Preprint. arXiv

What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations
Raphael Tang, Xinyu Zhang, Jimmy Lin, Ferhan Ture

Preprint. arXiv

HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution
Ehsan Kamalloo*, Aref Jafari*, Xinyu Zhang, Nandan Thakur, Jimmy Lin

Preprint. arXiv / Website / Dataset /

Zero-Shot Listwise Document Reranking with a Large Language Model
Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin

Preprint. arXiv

Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages.
Xinyu Zhang*, Nandan Thakur*, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin

Preprint. arXiv

Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers.
Odunayo Ogundepo, Xinyu Zhang, and Jimmy Lin

Preprint. arXiv

Certified Error Control of Candidate Set Pruning for Two-Stage Relevance Ranking
Minghan Li*, Xinyu Zhang*, Ji Xin, Hongyang Zhang, Jimmy Lin

EMNLP 2022 Paper / Code

AfriCLIRMatrix: Enabling Cross-Lingual Information Retrieval for African Languages.
Odunayo Ogundepo, Xinyu Zhang, Shuo Sun, Kevin Duh, and Jimmy Lin

EMNLP 2022 Paper / Dataset

Squeezing water from a stone: A bag of tricks for further improving cross-encoder effectiveness for reranking
Ronak Pradeep, Yuqi Liu, Xinyu Zhang, Yilin Li, Andrew Yates, and Jimmy Lin

ECIR 2022 Paper / Code

Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval
Xinyu Zhang, Xueguang Ma, Peng Shi, Jimmy Lin

EMNLP 2021 Workshop MRL Paper / Code / Dataset

Approach Zero and Anserini at the CLEF-2021 ARQMath Track: Applying Substructure Search and BM25 on Operator Tree Path Tokens.
Wei Zhong, Xinyu Zhang, Ji Xin, Richard Zanibbi, Jimmy Lin

CLEF 2021 Paper

Bag-of-Words Baselines for Semantic Code Search.
Xinyu Zhang, Ji Xin, Andrew Yates, and Jimmy Lin

ACL-IJCNLP 2021 Workshop NLP4Prog Paper

Comparing Score Aggregation Approaches for Pretrained Neural Language Models
Xinyu Zhang, Andrew Yates, and Jimmy Lin

ECIR 2021 Paper

A Little Bit Is Worse Than None: Ranking with Limited Training Data.
Xinyu Zhang, Andrew Yates, and Jimmy Lin

EMNLP 2020 Workshop SustainNLP Paper

Flexible IR Pipelines with Capreolus.
Andrew Yates, Kevin Martin Jose, Xinyu Zhang, Jimmy Lin

CIKM 2020 Paper / Code

Capreolus: A Toolkit for End-to-End Neural Ad Hoc Retrieval
Andrew Yates, Siddhant Arora, Xinyu Zhang, Wei Yang, Kevin Martin Jose, Jimmy Lin

WSDM 2020 Paper / Code