Clippers 9/24: Amy Chun on Linguistic Age Prediction

Children’s language development is a critical factor in creating engaging and age-appropriate interactions in conversational AI systems. As children grow, their communication evolves in sentence complexity, vocabulary use, and conversational style. However, many current AI-driven systems struggle to dynamically adjust to these developmental changes, especially in interactive environments like the COSI Museum, where engaging, personalized conversations can foster learning and curiosity. In this talk, I will discuss how our research aims to bridge this gap by predicting a child’s age based on linguistic features to create more engaging and age-appropriate interactions.

Clippers 9/17: Discussion of Self-Taught Evaluators paper

Michael White will lead a discussion of Meta’s Self-Taught Evaluators paper.


Self-Taught Evaluators

Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, Xian Li

https://arxiv.org/abs/2408.02666

Model-based evaluation is at the heart of successful model development — as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to improve evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.

Clippers 9/10: Alyssa Allen on GEM Data-to-Text Shared Task

This week in Clippers, I’ll be workshopping a poster that I’m bringing to INLG later this month. Myself, Ash, Yi-Chien, Tomiris, and Mike participated in this Data-to-Text GEM shared task. We were tasked with generating text for triple sets where each triple was of the form Subject | Property | Object. This was done for factual, counterfactual, and fictional triple sets. We experimented with English, Spanish, Chinese, and Russian — ultimately submitted outputs for English and Spanish. I appreciate all feedback on the content and layout of the poster, but (perhaps more importantly) I’d like to know what questions I’ll likely be asked at the conference based on our work.

Clippers 9/3: Christian Clark on predicting human reading times using a Transformer model with a recency bias

Abstract:

Recent psycholinguistic research has compared human reading times to surprisal estimates from language models to study the factors shaping human sentence processing difficulty. Previous studies have shown that surprisal values from Transformer models align with reading times better than those from alternative models such as RNNs. However, standard Transformers include a lossless representation of the entire previous linguistic context, a feature which makes them somewhat implausible as models of human cognition. To address this limitation, I test a transformer variant which includes ALiBi, a recency bias added to attention scores. Surprisal estimates with ALiBi show an improved fit to human reading times compared to a standard Transformer baseline.