Resources

Some of the software and datasets listed here are very much still useful… some of the older systems (especially the coherence tools) are provided mostly for reference, since they have been superseded by newer methods.

  • ‘Intlis grammar: Created jointly with Julia Papke for Ling 3502 “The Linguistics of Constructed Languages”: descriptive grammar of a Klingon-English creole. [PDF]
  • Neural segmentation: Speech and text segmentation with autoencoders: Python code on top of Keras neural network package. Joint work with Cory Shain, described in EMNLP-17.
    [Github]
  • Aligned child-directed speech: Forced alignments for part of Demuth’s Providence Corpus, plus some hand-corrected alignments, as described in Interspeech-17 with Kiwa Ito.
    [Open Science Foundation]
  • Object array data: Data including transcripts, audio and eyetracks for the object array referring expression experiment from CogSci-17, plus R and Python analysis code.
    [Open Science Foundation]
  • Beamseg: Joint model of word segmentation and phonetic learning from EMNLP-13, including code (C++), analysis scripts (Python) and sample output.
    [Bitbucket]
  • Wally Referring Expressions Corpus (WREC): Rohde, Clarke and Elsner; see Frontiers-13 for details. This distribution includes the text, bounding boxes and dataframe files, but not the images themselves, for copyright reasons. We will distribute the images on request to people who can prove they own the relevant Where’s Wally books: [Edinburgh datashare]
  • Sentiment visualization tool: developed by Robert Ang as part of an M.Sc. co-sponsored by Jon Oberlander and Leverhulme Trust Writer-in-residence Victoria Adams. A web application which displays sentiment trajectories for fictional works, powered by Saif Mohammad’s emotion lexicon.
    [Manual][Javascript application (doesn’t work on Internet Explorer 8)]
  • Pronunciation-varied Bernstein-Ratner corpus: dataset used in ACL-12. The Bernstein-Ratner corpus used by Brent for word segmentation, with added pronunciation variation from the Buckeye corpus.
    [tgz]
  • Brown Coherence Toolkit: software for a variety of local coherence models, now including the extended entity grid, and test applications for ordering and chat disentanglement (C++).
    New version 1.0 as of 2011!
    [Bitbucket]
  • Sentence Fusion Software: software for preprocessing, training and running our English sentence fusion system (C++/Python). By me and Deepak Santhanam.
    [Bitbucket]
  • Waterworks: Python utility package, including ClusterMetrics library for evaluating clusterings. Mostly by David McClosky.
    [Python Package Index]
  • Correlation Clustering System: framework for creating and analyzing datasets (Python), heuristic solvers, LP, ILP and SDP bounding systems (C++).
    This is the release version; the evaluation code requires Waterworks.
    README, [tgz]
    You may also want the data matrices we constructed for 20 newsgroups: [tgz]
  • Unsupervised Pronoun Anaphora System: EM learner, pre-trained model (newswire) and pronoun resolver (C++).
    Eugene wrote this software, so although I’m pleased to answer questions on it, I don’t know the gory innards in detail.
    [tgz]
    If you’re planning to use this software, you should consider Shane Bergsma’s NADA non-referential pronoun detector as a preprocess. The reported results demonstrate significant improvements over our built-in non-referential detector.
  • IRC Chat Data and Disentanglement Model: annotated IRC chat data, annotation software (Java), analysis and disentanglement model (Python).
    README, [tgz]