Tools & Resources

The Paul Davis Moment

Each week at our computational linguistics discussion group, Clippers, we spend the first few minutes discussing interesting, time-saving, or just plain nifty tools that we’ve found. Here is a listing of shared resources:

Introduction to Large Language Models

Machine Learning

  • Mosaic ML is a PyTorch library for adding algorithmic improvements to your models for free. Some have accuracy-time tradeoff while others are speedups without performance degradation. See the composer library, which has a functional API that allows you to load from Hugging Face. Link: composer and methods. (11/2022)
  • Weights & Biases. W&B is a platform for experiment tracking, hyperparameter tuning, model versioning and visualization. (11/2022)
  • LLM.int8(). Using LLM.int8(), a two-part quantization procedure, we can significantly reduce hardware requirements and still preserve performance levels for large language models. Introduction: article 1, article 2, paper. (09/2022)
  • Surge Data Labeling Platform. SurgeAI provides high-quality labelled data in more than 35 languages, https://www.surgehq.ai/ (09/2022).
  • YouTube channel with podcast for papers. https://www.youtube.com/c/YannicKilcher/videos (Mar 2022)
  • Machine Learning Bootcamp. Various video lectures with syncronized slides that some people might be interested in. The main topics covered are
    • Basic Math and TCS for Machine Learning
    • Useful existing software for Machine Learning
    • Introduction to Machine Learning
    • Theoretical frameworks and foundations
    • Experimental Machine Learning
    • Feature extraction and model selection
    • Graphical models
    • Kernel methods and linear predictors
    • Clustering
    • General view of application areas
    • Machine learning in vision
    • Machine learning in user interfaces
    • Machine learning for data mining

    (Jan 08)

  • Andrew Ng’s Machine Learning MOOC. Stanford massive open online course covering:
    • Linear Regression
    • Logistic Regression
    • Regularization
    • Naive Bayes

Neural Networks

  • Microsoft’s CNTK. The MS Computational Network Toolkit is a unified deep-learning toolkit that describes neural networks as a series of computational steps via a directed graph.
  • Stanford Deep Learning for Natural Language Processing. Stanford’s Deep Learning for NLP course provides a deep excursion into cutting-edge research in deep learning applied to NLP. The final project will involve training a complex recurrent neural network and applying it to a large scale NLP problem. On the model side we will cover word vector representations, window-based neural networks, recurrent neural networks, long-short-term-memory models, recursive neural networks, convolutional neural networks as well as some very novel models involving a memory component. Through lectures and programming assignments students will learn the necessary engineering tricks for making neural networks work on practical problems.
  • NVIDIA GPU Grant Program. NVIDIA’s Academic Programs Team is dedicated to empowering and collaborating with professors and researchers at universities worldwide with: small scale GPU grants, graduate fellowships, free teaching materials, and NVIDIA developer access.
  • Tensorflow. Tensorflow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. There is also a probabilistic programming framework within TensorFlow called TensorFlow Probability (Sept 2021)
  • Torch. Torch is a scientific computing framework with wide support for machine learning algorithms. It is easy to use and efficient, thanks to an easy and fast scripting language, LuaJIT, and an underlying C/CUDA implementation.
  • PyTorch. PyTorch is the successor of Torch for Python. There is also a probabilistic programming framework within PyTorch called Pyro (Sept 2021)
  • Theano. Python library for neural networks that supports GPU operations. (Link updated 09/2022)

NLP Libraries

  • Spacy. Spacy is a library for industrial-strength natural language processing in Python and Cython. It features state-of-the-art speed and accuracy, a concise API, and great documentation. If you’re a small company doing NLP, we want spaCy to seem like a minor miracle.
  • SpacyAMR. amrlib facilitates abstract meaning representation (AMR) parsing, gernation and visualization. (11/2022)
  • OpenFST. An open source finite state toolkit from the same folks who brought us the AT&T finite state toolkit. Has many of the same features, some new ones, and searchable source code! Find it here. (March 08)
  • Machine Learning Toolkit. YALE: Yet Another Learning Engine. Available on SourceForge, among other things, it can do word vector processing. (Mar 07)
  • Finite State Software. Been using the AT&T Finite State Toolkit? Looking for a similar product with the option of looking at the source code? Try the MIT Finite State Toolkit, which is open source, and has many of the same functionalities as the former.
  • GEM-benchmark, GEM-metrics. GEM is a benchmark environment for Natural Language Generation with a focus on its Evaluation, both through human annotations and automated Metrics.(description from website). main website, github repo (Sept 2021, updated 09/2022)

Version Control

  • git-annex. git-annex allows managing files with git, without checking the file contents into git. While that may seem paradoxical, it is useful when dealing with files larger than git can currently easily handle, whether due to limitations in memory, time, or disk space.
  • Version Control. There are a few new programs out there that improve on software like Subversion for doing distributed version control. For instance, Git and Mercurial offer some features that make collaboration easier, have better branching capabilites, and more intuitive command line incantations. For advice, contact Jon Dehdari. (Feb 2009)

Supercomputing

  • OSU Supercomputer Center. For heavy lifting. https://www.osc.edu/ (Oct 2016)
  • Hadoop. For those wishing to try out MapReduce code on a computing cluster, a la Google, Hadoop is set up on the Slate machines. You can learn about MapReduce by reading this paper, or if you’re wanting something more in-depth, you can watch this lecture series. The distributed file system is called Hadoop. When you’re ready to get started, Chris or Ilana can show you where to go on our system. (January 09, links updated 09/2022)
  • Apache Mahout. This software implements several machine learning algorithms using the MapReduce framework. Includes Naive Bayes, KNN, others. Should work on our Hadoop setup. Read more here. (January 09)

Natural Language Generation

  • Arria. Arria is the leader in real-time data storytelling. Our core product is known as the Arria NLG Platform, a form of artificial intelligence software that specializes in extracting information from complex data sources and communicating that information in natural language. We configure the Platform for a wide range of client needs; and we also offer its technology as pre-packaged SaaS Products and as a Software Development Kit (SDK) with APIs that allow developers to add NLG functionality to their own applications.

Lab Notebook and Teaching

  • Colaboratory. Colaboratory is a Google research project created to help disseminate machine learning education and research. It’s a Jupyter notebook environment that requires no setup to use and runs entirely in the cloud. (01/2018, link updated 09/2022)
  • LabArchives. (OSU Only) LabArchives is a cloud-based Electronic Lab Notebook application which allows users to electronically create, store, share and manage data. (01/2018, link updated 09/2022)

Paper Reading and Writing

  • Choosing the right graph.  15min video on data visualization by Jean-luc Doumont. (02/2018)
  • Semantic Scholar. Semantic Scholar lets you cut through the clutter and home in on key publications, citations, and results.
  • Overleaf. Overleaf is the new collaborative writing and publishing system developed by the team behind the popular writeLaTeX editor. Overleaf is designed to make the whole process of writing, editing and producing scientific papers much quicker for both authors and publishers.
  • Mendeley. “iTunes for research papers” Mendeley provides a nice GUI for interacting with your collection of research papers. The parallel to a playlist is a collection: create collections sorted howsoever you choose, and put papers in multiple collections. Mendeley does an alright job of importing the correct metadat for many papers, especially if they exist in a public archive or are text-based (as opposed to scanned pages). It is possible to take notes on the PDFs and in a separate comment, and it is also possible to highlight text. You can also sync your library with the Mendeley website to have access to them from anywhere. (January 2011, link updated 09/2022)
  • PGF and TikZ. “A TeX macro package for generating graphics” When you need to show an MT model or generate any other graphics for your papers or presentations, PGF and TikZ will help you out. See the TeXample.net page to see some example images generated using PGF and TikZ. (October 2010)
  • beamerposter. A LaTeX package for creating scientific posters. beamerposter allows you to create beautiful posters for your conference presentations. (October 2010)
  • Looglefight. A tool to help you find the right phrasing for your comp ling papers. Takes two words or phrases input by the user and returns their frequencies in the ACL Reference Corpus to help you determine which phrasing works “better”. (October 2010)
  • PDFMiner. Extracts meaningful information out of PDF documents. PDFMiner is written in Python, has support for preserving layout, and could be useful the next time you’re processing PDFs. (January 2010)
  • Zotero. Collect, manage, and cite your research resources. Watch the video featured on the main page for a quick introduction to the tool. Their beta product is web based for easy accessibility. (October 2009, link updated 09/2022)
  • Bibtex Citations. IEEE Explore, which is available to OSU students on campus or if you log in to the library from off campus, now allows you to download the bibtex citation of the articles you’re looking at. Check for the option on the menu on the left. (Feburary 09)
  • Octet. Handy add-on for emacs that inserts latex code as you write. Helps you avoid leaving off that table end tag and that sort of thing. Puts keystroke bindings with common latex tags. Ask Crystal for help using this handy feature. (January 09)
  • LaPrint. Save your Matlab figures in a way that makes them show up nicer in your latex documents, including adding latex tags to the text (labels, axes, titles) on the figure.
  • Pseudocode in Latex. The style file crlscode.sty works well and produces very pretty pseudocode, same as in the Introduction to Algorithms book. Get the code and the documentation here.
  • Arabtex. It’s also possible to type in Arabic using Latex. Correct right-to-left formatting is included in the arabtex package. It’s a biggie, and complicated, so you’re best off using the package that’s already installed on bardolph.

Teaching

  • Digital Union. For teachers and students at OSU, check out the Digital Union for all kinds of technologically enhanced classroom supplies or study aids. Computers, speakers, cameras, and pedagogical tools you’d never thought of. (Feb 2009)
  • WebVectors. A visualization tool for word vectors, which may be a useful teaching tool. Website (Mar 2022, edited 09/2022)

Unix Tools

  • Encodings. If you have a document in UTF-8, and need it to be in Latin encoding, use unix’s utf2latin1. But if you need to go back the other direction, use iconv. (January 09)
  • rename. A unix command that will change the extension of a bunch of files all at once. For instance, “rename .raw .au *.raw” will change all of the files within the current directory that have a .raw extension to have a .au extension. Quicker than writing a script. (March 08)
  • Syncing Your Files. Along with version control, it is a good idea to keep the many files you may have on the various computers in your life synced up. You can use programs such as unison or r-sync to help you do this. Keep your home directory at school and at home looking the same, and avoid reduplicating your own work, or overwriting your own files. Also helpful if a server goes down – you have your work, ready to use, elsewhere. (Jan 07)
  • Web Download Tool. The Unix tool ‘wget’ will download each page and all included content from a specified website. This can be helpful, if, for instance, a corpus is available for download only as a large series of small files. The tool will drill down through all links from the specified start page with the ‘-r’ option. Example: $> wget -r http://www.ling.ohio-state.edu will download everything from the linguistics website (not recommended).

Corpora

  • LiveJournal on SLaTe. For those wishing to work with blog data, we have zipped-up versions of three months’ worth of LiveJournal webpages on the slate server. This is a standard data set for working with blogs. Talk to Eric or Chris if you’re interested in getting started with it. (January 09)
  • Google N-gram search. First off, the Google English n-gram data is available to those with access to the ling dep’t server. Find it at /home/corpora/EN/WebIT. There is also available some software that searches these n-grams efficiently on the web. I lost that reference, but will update when it’s found. (Dec 07)
  • Penn Discourse Treebank. This is also currently available on the linguistic corpora server. (Dec 07)
  • TigerSearch. This software for searching through syntactically annotated corpora is now available on the Mac portion of the ling machines in Oxley 201. It has a java interface, and allows you to search for examples of general or specific syntactic constructions within many corpora. Ask Detmar or Adriane if you need help. (Oct 07)
  • Wikipedia Downloads. It is possible to download all of Wikipedia, or various portions of it, for use in NLP tasks. The website can be a bit hard to find, so Adriane found it for us: Get Wikipedia here (April 07)
  • RSS News Feeds. If you wish to work with current news documents, and are looking for a standard, uniform format in which to work, RSS is a good choice. To obtain a news article in RSS format, you can use URLs of the form:
    • http://news.google.com/news?q=Ohio+State&output=rss

    Where “Ohio State” was the search term; to restrict it to specific news sites, use the “source:” operator, i.e.

    • http://news.google.com/news?q=Ohio+State+source:new_york_times&output=rss

    Other formats are available. (April 07)

  • BRENT corpus is available within at /home/corpora/EN/childes/Brent. Ask Anton for details on using this corpus, or the Stephanie corpus.
  • Semantic Annotation. RST Tool, available from wagsoft, is a pointy-clicky, slightly non-intuitive but easy to install tool for doing semantic annotation according to the discourse theory of your choice, especially Rhetorical Structure Theory. Also installed on /home/compling (Mar 07)
  • Picture Naming Database. The International Picture Naming Project at CRL-UCSD contains a database of black-and-white drawings along with norms for what names they are given, in a variety of languages. Also given are norms for things including naming time. It contains some pictures published in an earlier set collected by Snodgrass & Vanderwart, which is used in a lot of studies, so you might want to use those pictures to duplicate prior results. If any of those pictures are used, the following paper should be cited (this is their condition of use):
    • Snodgrass, J.G., & Vanderwart, M. (1980). JEP: Human Learning and Memory, 6:3, 174-215.

    The S&V pictures are black and white, if you use the colored versions, you need to cite both Snodgrass & Vanderwart, and Rossion & Pourtois, who modified them to make them in full color:

    • Rossion, B. & Pourtois, G. (2001). Revisiting Snodgrass and Vanderwart’s Object database: Color and Texture improve Object Recognition. 1st Vision Conference, Sarasota, FL.
  • Enron Corpus. Interested in naturally occurring language in the electronic domain? Search inter-office emails sent by employees of the Enron Corporation before the company’s downfall. Scripts are available that filter out emails repeated throughout the corpus. On the linguistics server, see /home/corpora/EN/enron. (Feb 06)
  • Corpora Search Tool. The tools xkwic and cqp, both found on the linguistics department computers, are useful for decoding corpora such as the BNC, and for running complex queries on them. Ask Adriane for details on how to use these effectively (adriane @ ling). (Jan 06)

Programming Languages

  • Higher Order Perl A new book is now available about programming elegantly in everyone’s favorite scripting language. Order it or download it for free. (January 09)

Statistics

  • Bootstrap Method Tutorial. Easy introduction for running bootstrap analysis for significance testing.  (SP 17)

Other

  • arxiv-utils. A web extension for arxiv that adds some nice features for reading articles there, https://github.com/j3soon/arxiv-utils (Sept 2021)
  • overleave. A web extension for Overleaf that allows for rendering pdf on a separate screen, https://github.com/shreyashankar/overleave (Feb 2022)
  • SQLite. This is a good database system to use because it is portable, keeps your data in a single file, works in the user space, and has good software carpentry, that is, it was built intelligently so that you can build on top of it. (May 07).
  • Website User Authentication. If you are building an OSU website for which you wish to require users to identify and/or authenticate themselves before accessing the material, you can use the library’s proxy service to accomplish this. Ask Detmar for details.
  • CL Olympiad. High school students nationwide are encouraged to participate in the Computational Linguistics Olympiad. Students are given traditional linguistic problems, and problems involving computational thinking and issues regarding natural language processing. As of Feb 2, the organization is looking for suggestions for contest problems. (Feb 07, link updated 09/2022)
  • ICE. For inter-process communication, collobarating on projects across universities, etc. This is also called middleware. Read more about ICE here. Competing sofware is OAA: Open Agent Architecture, and Multiplatform: Multiple Language / Target Integration Platform for Modules (Jan 07).
  • Permanent URLs. A permanent URL will allow your website to retain a single, simple address, regardless of whether you change your employment or web-hosting position. purl provides a good service for this. tinyurl.com has a slightly different service, allowing you to create a very short URL that links to a website you may have with a long address. (Jan 07, updated 09/2022)
  • SVG. Scalable Vector Graphics are a great idea if you think your graphics might be seen on a wide variety of monitors – there is no distortion in size when going from movie screen to cell phone screen. Use SVG to build representations of xml documents, or any other node-based structure. See croczilla.com for examples. (Jan 07, link updated 09/2022)
  • Assessment in Academic Pursuits. When it becomes necessary to list your achievements in the academic arena, it is useful to have some information on hand that goes beyond publication titles and dates. Other data to collect as your publication list lengthens:
    • Acceptance rate of papers at each venue/journal. Available in front matter of conference proceedings, journal issues.
    • Your percentage of contribution to a paper. Include actual research/project work, amount of writing, and creative input when you make this calculation.
    • Citation rate. Consult Google, ISI Database, Citeseer for information on how often your papers have been cited. These resources use different metrics for determining citation rates, so you may need to defend the actual citation rate that you choose to report.
    • Relative impact of the venue. Ratings given in ISI database (available in OSCAR).

    Also, let colleagues know what you’re doing, what you’ve published and where, and make sure your publications get into the hands of people who you think should read them. Be annoying if necessary. (Jan 06)

  • Publication Strategy To get the ball rolling on getting published, consider taking the advice in Publication, Publication by Gary King. Main ideas:
    • Build on someone else’s previous research by making one change.
    • Be able to defend the reasons for that change, and the impact it makes.
    • Clearly describe exactly the work that you did.
    • Make your data available. (Jan 06)