News  
Home  
Crystina Xinyu Zhang
 

Hi I am Crystina, a fourth-year PhD student at University of Waterloo. I'm honored to be advised by Professor Jimmy Lin. I received my Bachelor's degree in Computer Science at Hong Kong University of Science and Technology (HKUST). I'm currently a part-time Student Researcher in Google Research. Previously I have interned at Cohere, Max Planck Institut für Informatik and NAVER (네이버) . I have exchanged to University of California, Los Angelas and University of Waterloo during my undergraduate.

News

Research Interests

My current research interests lies under the intersection between Information Retrieval and Natural Language Processing. I'm particularly interesting in multilinguality related topics. Representative work includes two large-scale multilingual datasets: MIRACL and Mr. TyDi, and a comprehensive investigation towards the best practices of multilingual dense models.

Publications show all by date / show all by topic

Topics: Multilingual NLP / LLM Reranking / pretrained-LM Reranking / Others (*: Equal Contribution)

CELI: Simple yet Effective Approach to Enhance Out-of-Domain Generalization of Cross-Encoders
Xinyu Zhang*, Minghan Li*, and Jimmy Lin

NAACL 2024 Paper (to be published soon)

Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models
Raphael Tang*, Xinyu Zhang*, Xueguang Ma, Jimmy Lin, Ferhan Ture

NAACL 2024 Paper (to be published soon) / arXiv / Code

CIRAL: A Test Collection for CLIR Evaluations in African Languages
Mofetoluwa Adeyemi, Akintunde Oladipo, Xinyu Zhang, Jimmy Lin, David Alfonso-Hermelo, Mehdi Rezagholizadeh, Boxing Chen, ... (17 authors)

SIGIR 2024 Paper (to be published soon) / Dataset

NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation.
Nandan Thakur, Luiz Bonifacio, Xinyu Zhang, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Boxing Chen, Mehdi Rezagholizadeh, Jimmy Lin

Preprint. arXiv / Dataset

MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages.
Xinyu Zhang*, Nandan Thakur*, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin

TACL 2023 Paper / arXiv / Code / Website

Towards Best Practices for Training Multilingual Dense Retrieval Models.
Xinyu Zhang, Kelechi Ogueji, Xueguang Ma, Jimmy Lin

TOIS 2023 Paper

Evaluating Embedding APIs for Information Retrieval
Ehsan Kamalloo, Xinyu Zhang, Odunayo Ogundepo, Nandan Thakur, David Alfonso-hermelo, Mehdi Rezagholizadeh, and Jimmy Lin

ACL 2023 Paper

GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration.
Aleksandra Piktus, Odunayo Ogundepo, Christopher Akiki, Akintunde Oladipo, Xinyu Zhang, Hailey Schoelkopf, Stella Biderman, Martin Potthast, and Jimmy Lin

ACL 2023 Paper

Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models
Xinyu Zhang, Sebastian Hofstätter, Patrick Lewis, Raphael Tang, Jimmy Lin

Preprint. arXiv

What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations
Raphael Tang, Xinyu Zhang, Jimmy Lin, Ferhan Ture

Preprint. arXiv

HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution
Ehsan Kamalloo*, Aref Jafari*, Xinyu Zhang, Nandan Thakur, Jimmy Lin

Preprint. arXiv / Website / Dataset /

Zero-Shot Listwise Document Reranking with a Large Language Model
Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin

Preprint. arXiv

Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages.
Xinyu Zhang*, Nandan Thakur*, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin

Preprint. arXiv

Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers.
Odunayo Ogundepo, Xinyu Zhang, and Jimmy Lin

Preprint. arXiv

Certified Error Control of Candidate Set Pruning for Two-Stage Relevance Ranking
Minghan Li*, Xinyu Zhang*, Ji Xin, Hongyang Zhang, Jimmy Lin

EMNLP 2022 Paper / Code

AfriCLIRMatrix: Enabling Cross-Lingual Information Retrieval for African Languages.
Odunayo Ogundepo, Xinyu Zhang, Shuo Sun, Kevin Duh, and Jimmy Lin

EMNLP 2022 Paper / Dataset

Squeezing water from a stone: A bag of tricks for further improving cross-encoder effectiveness for reranking
Ronak Pradeep, Yuqi Liu, Xinyu Zhang, Yilin Li, Andrew Yates, and Jimmy Lin

ECIR 2022 Paper / Code

Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval
Xinyu Zhang, Xueguang Ma, Peng Shi, Jimmy Lin

EMNLP 2021 Workshop MRL Paper / Code / Dataset

Approach Zero and Anserini at the CLEF-2021 ARQMath Track: Applying Substructure Search and BM25 on Operator Tree Path Tokens.
Wei Zhong, Xinyu Zhang, Ji Xin, Richard Zanibbi, Jimmy Lin

CLEF 2021 Paper

Bag-of-Words Baselines for Semantic Code Search.
Xinyu Zhang, Ji Xin, Andrew Yates, and Jimmy Lin

ACL-IJCNLP 2021 Workshop NLP4Prog Paper

Comparing Score Aggregation Approaches for Pretrained Neural Language Models
Xinyu Zhang, Andrew Yates, and Jimmy Lin

ECIR 2021 Paper

A Little Bit Is Worse Than None: Ranking with Limited Training Data.
Xinyu Zhang, Andrew Yates, and Jimmy Lin

EMNLP 2020 Workshop SustainNLP Paper

Flexible IR Pipelines with Capreolus.
Andrew Yates, Kevin Martin Jose, Xinyu Zhang, Jimmy Lin

CIKM 2020 Paper / Code

Capreolus: A Toolkit for End-to-End Neural Ad Hoc Retrieval
Andrew Yates, Siddhant Arora, Xinyu Zhang, Wei Yang, Kevin Martin Jose, Jimmy Lin

WSDM 2020 Paper / Code