Spring ’20, TTh 9:35–10:55, Enarson Hall 206
Instructor: Michael White
Though essential to knowledge transmission and social interaction, human language poses unique challenges for computational processing. In this course, students will learn the basics of probabilistic modeling and machine learning for natural language processing (NLP) and computational linguistics (CL). Along the way, students will gain experience with using the Python programming language to analyze corpus data and the PyTorch deep learning toolkit to develop and train NLP programs.
The course will make use of the nascent third edition of Jurafsky and Martin’s textbook, Speech and Language Processing, as well as slides from various ACL tutorials and other sources.
Student in the course will have the opportunity to:
- become familiar with the principal terminology, concepts and techniques of probabilistic modeling and machine learning for natural language processing;
- develop an understanding of the interplay between algorithms and models in computational linguistics;
- gain exposure to current ethical issues in deploying NLP
- learn to design, build and evaluate data-driven NLP programs;
- see the messy side of a real research project; and
- develop an appreciation of the awesome complexity and richness of
Ling 5801 or equivalent, or permission of the instructor. The course is open to advanced undergraduate and graduate students.
Topics will include:
- probability and language modeling
- classification with linear models
- word embeddings
- classification with feedforward neural models
- sequence models: hidden markov models and recurrent neural nets
- sequence-to-sequence models, attention and contextual embeddings
- topics TBA, depending on your project proposals
We’ll use Carmen for the detailed schedule, for homework and project assignments, and for distributing slides and links to background/tutorial materials. There will also be discussion forums for posting advance questions on the readings.
It is up to you to make sure you have access to the appropriate hardware and software for doing the assignments. As some assignments may have runtimes on the order of hours, you will have access to an Ohio Supercomputer Center (OSC) classroom account. We will go over how to use OSC resources in class.
Letter grades will be assigned using the standard OSU scale based on
class participation, homework assignments and the group project.
Class participation (10%)
You will be expected to keep up with the readings and actively participate in class discussions and activities, which naturally requires regular attendance.
In addition, as we will be reading some papers from the primary literature, we will tackle those papers as in advanced seminars. Accordingly, we will take a page from Eric Fosler-Lussier’s playbook and require everyone (this includes you!) to post at least one question to the discussion forum on Carmen by 8 p.m. the evening before the reading will be discussed, so that the discussion facilitator can organize the class session around key questions. Participants should also feel free to share their (initial) thoughts and views of the papers in their posts. In particular, questions of the type What did they mean by X? or Why did they do X instead of Y? are encouraged. Remember that most of the papers are targeted to people who are already expert in the area, so you shouldn’t expect to alway understand everything. Airing such questions can help everyone gain a better understanding of the paper — even those who thought they understood it!
Homework assignments (60%)
Homework assignments play a central role in achieving the course’s learning objectives. There will be five regular homework assignments, with the lowest score dropped in calculating the grade. Homework assignments are due by the beginning of class, in the Carmen dropbox. No late homeworks will be accepted.
Each assignment has a programming component and an analytical writeup. In general, I will not grade your code directly; you are expected to provide evidence in your writeup that your code works (or explain how and why it does not work). If your code does not work, I will give feedback if I can.
To encourage learning, you will be allowed to resubmit up to two homework assignments within two weeks of the original due date, which will be regraded for up to full credit. However, resubmission will only be allowed in cases where you submitted a credible attempt by the original due date.
Collaborative discussion of the homeworks is encouraged but each student should turn in their own programs and write-ups. Copying of notes or code snippets is strictly disallowed, and you must fully understand what you turn in. You may use all standard Python and PyTorch libraries, but you are not allowed to use arbitrary code from Github, library example code or other code which you did not write yourself to complete the homework assignments.
Group project (30%)
You will be expected to apply machine learning techniques to an interesting NLP problem as part of a group of 2–4 students. Where feasible, you are encouraged to pursue a project related to your own research interests.
Note that the group project is not required to be novel research. Instead, it is expected to require roughly the same level of effort as one of the homework assignments. However, as the project requires both design and presentation activities, it carries double the weight of one of the homework assignments. A typical project might involve attempting to replicate the results reported in the textbook or the research literature on an NLP task
using your own implementation.
You should declare your group on Carmen by the end of week 4 of the semester. Preliminary proposals for projects are due in the 8th week of the semester, and should include the choice of a related research paper that the class will read. (Revisions to the project proposal may be made based on feedback.) Groups will lead the discussion of the selected research paper towards the end of the course, and present the project results during the last week of the course. Taking into account feedback from the presentations, project write-ups will be due by the day the final exam would be held (if there were one).
When facilitating the discussion of the selected reading, groups should look over the questions posted on Carmen and choose a subset for discussion. In class, facilitators should start the session with a brief, five to ten minute summary of the paper, including the highlights and lowlights. Following the opening summary, facilitators are responsible for managing the discussion, and ensuring that as many viewpoints are heard as possible.
Points for the group project will be distributed as follows (when submitted on time):
- group declaration (2%)
- project proposal (3%)
- selected paper discussion (5%)
- project presentation (10%)
- project write-up (10%)
Additionally, a best project bonus (3%) will be given to the project voted to be the best following the project presentations.
Policy on Academic Misconduct
As with any class at this university, students are required to follow the Ohio State Code of Student Conduct. In particular, note that students are not allowed to, among other things, submit plagiarized (copied but unacknowledged) work for credit. If any violation occurs, the instructor is required to report the violation to the Council on Academic Misconduct.
Students with Disabilities
Students who need an accommodation based on the impact of a disability should contact me to arrange an appointment as soon as possible to discuss the course format, to anticipate needs, and to explore potential accommodations. I rely on the Office of Disability Services for assistance in verifying the need for accommodations and developing accommodation strategies. Students who have not previously contacted the Office for Disability Services are encouraged to do so (292-3307; http://www.ods.ohio-state.edu).
This syllabus is subject to change. All important changes will be made in
writing (email), with ample time for adjustment.