Ling 8800 — Seminar in Computational Linguistics (Spring ’25)

Spring ’25, MW 11:10–12:30, Oxley Hall 122
Instructor: Michael White

Synthetic Data in Natural Language Processing

With real data for pre-training large language models (LLMs) evidently becoming scarce, there has been increasing interest in using synthetic data for post-training LLMs. For many years, crowdsourcing was viewed as the most practical way to obtain data to train NLP systems for specific tasks, but with the remarkable role playing abilities of LLMs, researchers have moved towards generating synthetic data that can be equally or even more valuable. This is especially the case for so-called “reasoning models” which aim to “think step-by-step” before responding.

Description

This course will dig into the technical details of these models, emphasizing useful tasks such as response generation in dialogue systems, question answering, summarization, simplification, data augmentation and explanation generation. Topics are expected to include bootstrapping conversational systems from synthetic dialogues, simulating different kinds of users in this setting, methods for enhancing factuality and safety in this context, methods for adapting to different kinds of users, recognizing implicit feedback, etc., with methods of interest including self-training, knowledge distillation, automated evaluation and data cleaning, and reinforcement learning.

Students are encouraged to pursue related research interests for their term project. Topics will be finalized based on the interests of the participants.

Expectations

Students will be expected to actively participate in the discussion and research carried out in the seminar. As detailed below, students will be required to facilitate discussions and post questions on the readings in advance, as well as locate relevant background/tutorial materials. Additionally, students taking the course for 3 credits will be required to carry out a class project on a topic related to the seminar; alternatively, for students already working on a related topic, integrating their focus into the seminar will be an option.

Prerequisites

Ling 5802 or equivalent, or permission of the instructor.

Carmen

We’ll use Carmen to schedule discussion facilitators and post advance questions and comments on the readings, as well as links to background/tutorial materials. We’ll also use it for submitting project documents.

Hypothesis

This semester we’ll be trying out the Hypothesis social annotation tool in Carmen for posting comments and questions on the readings. This tool allows you to annotate papers with your remarks linked to highlighted text in the paper (much like Google docs) with both public and private annotation options. Here is a brief introduction to using Hypothesis in a learning management system (LMS) like Carmen.

Requirements

Class participation (25%)

We are aiming for a dynamic discussion of papers, not death by powerpoint. Thus, we plan on taking a page from Eric Fosler-Lussier’s playbook and requiring everyone (this includes you!) to post at least two questions, comments or replies using Hypothesis on Carmen by 8 p.m. the day before each reading will be discussed. Participants should also feel free to share their (initial) thoughts and views of the papers in their posts. In particular, questions of the type “What did they mean by X?” or “Why did they do X instead of Y?” are encouraged. Remember that most of the papers are targeted to people who are already expert in the area, so you shouldn’t expect to alway understand everything. Airing such questions can help everyone gain a better understanding of the paper — even those who thought they understood it!

Facilitating discussions (25%)

Each meeting where we discuss a paper will have a discussion facilitator. For the main readings, the facilitator should look over the posted questions and choose a subset for discussion. In class, the facilitator should start the session with a brief, five to ten minute summary of the paper, including the highlights and lowlights. Following the opening summary, the facilitator is responsible for managing the discussion, and ensuring that as many viewpoints are heard as possible. Finally, the facilitator is also tasked with keeping track of the potentially most useful background papers for the reading.

Students will be required to facilitate at least one session during the course (ideally more). If the discussion does not take up the entire class period, the remaining time may be used to (informally) discuss class projects.

We will also dedicate various sessions to reading background papers that we expect to be useful to better understand the main papers of interest. The sessions and papers dedicated to background readings will be determined collaboratively.

Term project (50%)

As noted above, students taking the course for 3 credits will be required to carry out a term project, either alone or in a team setting. A project sketch will be required to be presented informally in class for brainstorming by the fourth week; followed by a project proposal by the eighth week; followed by a presentation during the last week of class; and (finally) a final report by the day the final exam would be held (if there were one).

For students taking the course for 1 credit, no project will be required, with the other requirements scaled accordingly.

Use of AI tools

AI tools such as Copilot, Gemini and so forth may be used in the course but their use must be disclosed. When using them to inform posts on Hypothesis, a simple one sentence (or even one phrase) acknowledgment will suffice, while term project items should have a dedicated section on AI use (if applicable).

Topics

The topics and readings we expect to cover are listed below; these will be refined as the course progresses.

Training Reasoning Models

Learning from Implicit Conversational Feedback

Creating Useful Synthetic Data: Self-Training and Knowledge Distillation

AI Evaluation

Persona-Based Generation

Uncertainty Estimation

Background

Policy on Academic Misconduct

It is the responsibility of the Committee on Academic Misconduct to investigate or establish procedures for the investigation of all reported cases of student academic misconduct. The term “academic misconduct” includes all forms of student academic misconduct wherever committed; illustrated by, but not limited to, cases of plagiarism and dishonest practices in connection with examinations. Instructors shall report all instances of alleged academic misconduct to the committee (Faculty Rule 3335-5-48.7 (B)). For additional information, see the Code of Student Conduct.

Students with Disabilities

The university strives to maintain a healthy and accessible environment to support student learning in and out of the classroom. If you anticipate or experience academic barriers based on your disability (including mental health, chronic, or temporary medical conditions), please let me know immediately so that we can privately discuss options. To establish reasonable accommodations, I may request that you register with Student Life Disability Services. After registration, make arrangements with me as soon as possible to discuss your accommodations so that they may be implemented in a timely fashion.
If you are ill and need to miss class, including if you are staying home and away from others while experiencing symptoms of a viral infection or fever, please let me know immediately. In cases where illness interacts with an underlying medical condition, please consult with Student Life Disability Services to request reasonable accommodations. You can connect with them at slds@osu.edu; 614-292-3307; or slds.osu.edu.

This course requires the use of a digital social annotation tool called Hypothes.is. If you encounter an issue with access to this tool, please contact your instructor at their name.#@osu.edu and ascode@osu.edu. Accommodation and assistance will be arranged for you to complete any work required with this tool free of penalty.

Religious Accommodations

Ohio State has had a longstanding practice of making reasonable academic accommodations for students’ religious beliefs and practices in accordance with applicable law. In 2023, Ohio State updated its practice to align with new state legislation. Under this new provision, students must be in early communication with their instructors regarding any known accommodation requests for religious beliefs and practices, providing notice of specific dates for which they request alternative accommodations within 14 days after the first instructional day of the course. Instructors in turn shall not question the sincerity of a student’s religious or spiritual belief system in reviewing such requests and shall keep requests for accommodations confidential.

With sufficient notice, instructors will provide students with reasonable alternative accommodations with regard to examinations and other academic requirements with respect to students’ sincerely held religious beliefs and practices by allowing up to three absences each semester for the student to attend or participate in religious activities. Examples of religious accommodations can include, but are not limited to, rescheduling an exam, altering the time of a student’s presentation, allowing make-up assignments to substitute for missed class work, or flexibility in due dates or research responsibilities. If concerns arise about a requested accommodation, instructors are to consult their tenure initiating unit head for assistance.

A student’s request for time off shall be provided if the student’s sincerely held religious belief or practice severely affects the student’s ability to take an exam or meet an academic requirement and the student has notified their instructor, in writing during the first 14 days after the course begins, of the date of each absence. Although students are required to provide notice within the first 14 days after a course begins, instructors are strongly encouraged to work with the student to provide a reasonable accommodation if a request is made outside the notice period. A student may not be penalized for an absence approved under this policy.

If students have questions or disputes related to academic accommodations, they should contact their course instructor, and then their department or college office. For questions or to report discrimination or harassment based on religion, individuals should contact the Civil Rights Compliance Office (civilrights@osu.edu). (Policy: Religious Holidays, Holy Days and Observances)

Intellectual Diversity

Ohio State is committed to fostering a culture of open inquiry and intellectual diversity within the classroom. This course will cover a range of information and may include discussions or debates about controversial issues, beliefs, or policies. Any such discussions and debates are intended to support understanding of the approved curriculum and relevant course objectives rather than promote any specific point of view. Students will be assessed on principles applicable to the field of study and the content covered in the course. Preparing students for citizenship includes helping them develop critical thinking skills that will allow them to reach their own conclusions regarding complex or controversial matters.

Disclaimer

This syllabus is subject to change. All important changes will be made in writing (email), with ample time for adjustment.