Tools & Resources

The Paul Davis Moment

Each week at our computational linguistics discussion group, Clippers, we spend the first few minutes discussing interesting, time-saving, or just plain nifty tools that we’ve found. Here is a listing of shared resources:

Neural Networks

  • Microsoft’s CNTK. The MS Computational Network Toolkit is a unified deep-learning toolkit that describes neural networks as a series of computational steps via a directed graph.
  • Stanford Deep Learning for Natural Language Processing. Stanford’s Deep Learning for NLP course provides a deep excursion into cutting-edge research in deep learning applied to NLP. The final project will involve training a complex recurrent neural network and applying it to a large scale NLP problem. On the model side we will cover word vector representations, window-based neural networks, recurrent neural networks, long-short-term-memory models, recursive neural networks, convolutional neural networks as well as some very novel models involving a memory component. Through lectures and programming assignments students will learn the necessary engineering tricks for making neural networks work on practical problems.
  • NVIDIA GPU Grant Program. NVIDIA’s Academic Programs Team is dedicated to empowering and collaborating with professors and researchers at universities worldwide with: small scale GPU grants, graduate fellowships, free teaching materials, and NVIDIA developer access.
  • Tensorflow. Tensorflow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.
  • Torch. Torch is a scientific computing framework with wide support for machine learning algorithms. It is easy to use and efficient, thanks to an easy and fast scripting language, LuaJIT, and an underlying C/CUDA implementation.
  • Theano. Python library for neural networks that supports GPU operations.

Version Control

  • git-annex. git-annex allows managing files with git, without checking the file contents into git. While that may seem paradoxical, it is useful when dealing with files larger than git can currently easily handle, whether due to limitations in memory, time, or disk space.
  • Version Control. There are a few new programs out there that improve on software like Subversion for doing distributed version control. For instance, Git and Mercurial offer some features that make collaboration easier, have better branching capabilites, and more intuitive command line incantations. For advice, contact Jon Dehdari. (Feb 2009)
  • Subversion. If you missed Scott’s presentation on version control using SVN, or if you’d like to see it again, you can access his slides via the LCC tutorials webpage or Scott’s webpage (May 07).

Supercomputing

  • OSU Supercomputer Center. For heavy lifting. https://www.osc.edu/ (Oct 2016)
  • Hadoop. For those wishing to try out MapReduce code on a computing cluster, a la Google, Hadoop is set up on the Slate machines. You can learn about MapReduce by reading this paper, or if you’re wanting something more in-depth, you can watch this lecture series. The distributed file system is called Hadoop. When you’re ready to get started, Chris or Ilana can show you where to go on our system. (January 09)
  • Apache Mahout. This software implements several machine learning algorithms using the MapReduce framework. Includes Naive Bayes, KNN, others. Should work on our Hadoop setup. Read more here. (January 09)

Natural Language Generation

  • Arria. Arria is the leader in real-time data storytelling. Our core product is known as the Arria NLG Platform, a form of artificial intelligence software that specializes in extracting information from complex data sources and communicating that information in natural language. We configure the Platform for a wide range of client needs; and we also offer its technology as pre-packaged SaaS Products and as a Software Development Kit (SDK) with APIs that allow developers to add NLG functionality to their own applications.

Paper Reading and Writing

  • Kami PDF Reader. PDF reader w/annotation. Chrome plugin. (Oct 16)
  • Semantic Scholar. Semantic Scholar lets you cut through the clutter and home in on key publications, citations, and results.
  • Overleaf. Overleaf is the new collaborative writing and publishing system developed by the team behind the popular writeLaTeX editor. Overleaf is designed to make the whole process of writing, editing and producing scientific papers much quicker for both authors and publishers.
  • Mendeley. “iTunes for research papers” Mendeley provides a nice GUI for interacting with your collection of research papers. The parallel to a playlist is a collection: create collections sorted howsoever you choose, and put papers in multiple collections. Mendeley does an alright job of importing the correct metadat for many papers, especially if they exist in a public archive or are text-based (as opposed to scanned pages). It is possible to take notes on the PDFs and in a separate comment, and it is also possible to highlight text. You can also sync your library with the Mendeley website to have access to them from anywhere. (January 2011)
  • PGF and TikZ. “A TeX macro package for generating graphics” When you need to show an MT model or generate any other graphics for your papers or presentations, PGF and TikZ will help you out. See the TeXample.net page to see some example images generated using PGF and TikZ. (October 2010)
  • beamerposter. A LaTeX package for creating scientific posters. beamerposter allows you to create beautiful posters for your conference presentations. (October 2010)
  • Looglefight. A tool to help you find the right phrasing for your comp ling papers. Takes two words or phrases input by the user and returns their frequencies in the ACL Reference Corpus to help you determine which phrasing works “better”. (October 2010)
  • PDFMiner. Extracts meaningful information out of PDF documents. PDFMiner is written in Python, has support for preserving layout, and could be useful the next time you’re processing PDFs. (January 2010)
  • Zotero. Collect, manage, and cite your research resources. Watch the video featured on the main page for a quick introduction to the tool. Their beta product is web based for easy accessibility. (October 2009)
  • Bibtex Citations. IEEE Explore, which is available to OSU students on campus or if you log in to the library from off campus, now allows you to download the bibtex citation of the articles you’re looking at. Check for the option on the menu on the left. (Feburary 09)
  • Octet. Handy add-on for emacs that inserts latex code as you write. Helps you avoid leaving off that table end tag and that sort of thing. Puts keystroke bindings with common latex tags. Ask Crystal for help using this handy feature. (January 09)
  • LaPrint. Save your Matlab figures in a way that makes them show up nicer in your latex documents, including adding latex tags to the text (labels, axes, titles) on the figure.
  • Semantic. If you’d like to be able to create your own math symbols in latex, specifically those with ligatures, try installing this package. (Dec 07)
  • Anti-Word. If you use a linux machine pretty much exclusively, but get email attachments from people who use Windows products, they you might be interested inAnti-Word, which will convert .doc files to plain text. (Nov 07)
  • PrimoPdf. You can make PDFs of your MS Office documents for free with this nifty app. Get it here. (Oct 07)
  • Pseudocode in Latex. The style file crlscode.sty works well and produces very pretty pseudocode, same as in the Introduction to Algorithms book. Get the code and the documentation here.
  • Arabtex. It’s also possible to type in Arabic using Latex. Correct right-to-left formatting is included in the arabtex package. It’s a biggie, and complicated, so you’re best off using the package that’s already installed on bardolph.
  • Text Editing. The creator of the vim text editor gave a talk to the Google folks on efficient text editing: how to identify when you’re doing things inefficiently, and how to fix that. Emacs users can benefit, too. Find the talk at Google Video. (Mar 07)
  • HeVeA. A utility for converting very simple tex files into webpages. Appropriate for text-heavy, graphics-poor websites like online syllabi, course descriptions, etc. Already installed on the Linguistics department computers. (Jan 07)
  • Prefuse. A Java visualization toolkit. This software can help you make web-ready graphics of parse trees, etc. Could be useful for teaching parsing, grammar, syntax, etc. Find it at www.prefuse.org, and similar tools at graphviz.org. (Jan 07)
  • bibdesk. A point-and-click interface for creating your very own BibTex file. Reduces typos. Find it on SourceForge, at least for Mac. (Jan 07)
  • latex2rtf. Have a latex file and need a Windows document? Try this resource, which works with fair accuracy. Another option is to use OpenOffice, from which documents can be directly exported to pdf, or presentations to Flash or .ppt – but use with caution, fonts can get messy. (Jan 07)
  • BibTex Yourself. When you list a citation to one of your own papers on your website, be sure to put a BibTex entry right next to it. That way, others won’t mis-cite your work. (Jan 07)
  • Google’s BibTex resource. If you use Google Scholar to find academic articles, change the Preferences to have it provide a BibTex entry for the various resources it finds. Use with caution – a quick sample done in our meeting showed some errors – but it’s a good start. (Jan 07)
  • pdflatex. This is an easy way to embed pdf files within your own latex files. Find details in this document. Or, try Googling ‘pdfpages’. (October 06)
  • yab2web. This facility allows easy publication of bibtex entries into html, ideal for listing your publication list on your website. See Donna Byron’s website for an example. (March 06)
  • IPA in HTML. The two following websites will get you started in publishing pages on the web with IPA fonts included:

NLP Libraries

  • Spacy. Spacy is a library for industrial-strength natural language processing in Python and Cython. It features state-of-the-art speed and accuracy, a concise API, and great documentation. If you’re a small company doing NLP, we want spaCy to seem like a minor miracle.
  • OpenFST. An open source finite state toolkit from the same folks who brought us the AT&T finite state toolkit. Has many of the same features, some new ones, and searchable source code! Find it here. (March 08)
  • Machine Learning Toolkit. YALE: Yet Another Learning Engine. Available on SourceForge, among other things, it can do word vector processing. (Mar 07)
  • Finite State Software. Been using the AT&T Finite State Toolkit? Looking for a similar product with the option of looking at the source code? Try the MIT Finite State Toolkit, which is open source, and has many of the same functionalities as the former.

Privacy

  • Recaptcha. Know how when you buy from Ticketmaster, you have to type in the words that appear all squirrely in the picture? Now you can use that same technology to hide your own email address on your webpage. This can help stop spam. (Feburary 09)

Teaching

  • Digital Union. For teachers and students at OSU, check out the Digital Union for all kinds of technologically enhanced classroom supplies or study aids. Computers, speakers, cameras, and pedagogical tools you’d never thought of. (Feb 2009)

Unix Tools

  • Encodings. If you have a document in UTF-8, and need it to be in Latin encoding, use unix’s utf2latin1. But if you need to go back the other direction, use iconv. (January 09)
  • rename. A unix command that will change the extension of a bunch of files all at once. For instance, “rename .raw .au *.raw” will change all of the files within the current directory that have a .raw extension to have a .au extension. Quicker than writing a script. (March 08)
  • sshfs. This unix application allows you to mount an entire filesystem. Then it’s easier to access your ling files from home. This website has details. It should be available on most linux installations: try ‘appget install sshfs’. (May 07)
  • Syncing Your Files. Along with version control, it is a good idea to keep the many files you may have on the various computers in your life synced up. You can use programs such as unison or r-sync to help you do this. Keep your home directory at school and at home looking the same, and avoid reduplicating your own work, or overwriting your own files. Also helpful if a server goes down – you have your work, ready to use, elsewhere. (Jan 07)
  • Web Download Tool. The Unix tool ‘wget’ will download each page and all included content from a specified website. This can be helpful, if, for instance, a corpus is available for download only as a large series of small files. The tool will drill down through all links from the specified start page with the ‘-r’ option. Example: $> wget -r http://www.ling.ohio-state.edu will download everything from the linguistics website (not recommended).

Corpora

  • LiveJournal on SLaTe. For those wishing to work with blog data, we have zipped-up versions of three months’ worth of LiveJournal webpages on the slate server. This is a standard data set for working with blogs. Talk to Eric or Chris if you’re interested in getting started with it. (January 09)
  • Google N-gram search. First off, the Google English n-gram data is available to those with access to the ling dep’t server. Find it at /home/corpora/EN/WebIT. There is also available some software that searches these n-grams efficiently on the web. I lost that reference, but will update when it’s found. (Dec 07)
  • Penn Discourse Treebank. This is also currently available on the linguistic corpora server. (Dec 07)
  • TigerSearch. This software for searching through syntactically annotated corpora is now available on the Mac portion of the ling machines in Oxley 201. It has a java interface, and allows you to search for examples of general or specific syntactic constructions within many corpora. Ask Detmar or Adriane if you need help. (Oct 07)
  • Wikipedia Downloads. It is possible to download all of Wikipedia, or various portions of it, for use in NLP tasks. The website can be a bit hard to find, so Adriane found it for us: Get Wikipedia here (April 07)
  • RSS News Feeds. If you wish to work with current news documents, and are looking for a standard, uniform format in which to work, RSS is a good choice. To obtain a news article in RSS format, you can use URLs of the form:
    • http://news.google.com/news?q=Ohio+State&output=rss

    Where “Ohio State” was the search term; to restrict it to specific news sites, use the “source:” operator, i.e.

    • http://news.google.com/news?q=Ohio+State+source:new_york_times&output=rss

    Other formats are available. (April 07)

  • BRENT corpus is available within at /home/corpora/EN/childes/Brent. Ask Anton for details on using this corpus, or the Stephanie corpus.
  • Semantic Annotation. RST Tool, available from wagsoft, is a pointy-clicky, slightly non-intuitive but easy to install tool for doing semantic annotation according to the discourse theory of your choice, especially Rhetorical Structure Theory. Also installed on /home/compling (Mar 07)
  • Picture Naming Database. The International Picture Naming Project at CRL-UCSD contains a database of black-and-white drawings along with norms for what names they are given, in a variety of languages. Also given are norms for things including naming time. It contains some pictures published in an earlier set collected by Snodgrass & Vanderwart, which is used in a lot of studies, so you might want to use those pictures to duplicate prior results. If any of those pictures are used, the following paper should be cited (this is their condition of use):
    • Snodgrass, J.G., & Vanderwart, M. (1980). JEP: Human Learning and Memory, 6:3, 174-215.

    The S&V pictures are black and white, if you use the colored versions, you need to cite both Snodgrass & Vanderwart, and Rossion & Pourtois, who modified them to make them in full color:

    • Rossion, B. & Pourtois, G. (2001). Revisiting Snodgrass and Vanderwart’s Object database: Color and Texture improve Object Recognition. 1st Vision Conference, Sarasota, FL.
  • Enron Corpus. Interested in naturally occurring language in the electronic domain? Search inter-office emails sent by employees of the Enron Corporation before the company’s downfall. Scripts are available that filter out emails repeated throughout the corpus. On the linguistics server, see /home/corpora/EN/enron. (Feb 06)
  • General Language Ontology. There is a recently acquired ontology on the ling server that may be useful for those who need a basic semantic representation of general concepts. Read more about the ontology, what concepts it encodes, and what it may be useful for at http://research.cyc.com, and find the resource itself at /home/corpora/EN/cyc. You can also contact Stacey (s.bailey @ ling) for further information. (Feb 06)
  • Corpora Search Tool. The tools xkwic and cqp, both found on the linguistics department computers, are useful for decoding corpora such as the BNC, and for running complex queries on them. Ask Adriane for details on how to use these effectively (adriane @ ling). (Jan 06)

Programming Languages

  • Higher Order Perl A new book is now available about programming elegantly in everyone’s favorite scripting language. Order it or download it for free. (January 09)

Machine Learning

  • Machine Learning Bootcamp. Various video lectures with syncronized slides that some people might be interested in. The main topics covered are
    • Basic Math and TCS for Machine Learning
    • Useful existing software for Machine Learning
    • Introduction to Machine Learning
    • Theoretical frameworks and foundations
    • Experimental Machine Learning
    • Feature extraction and model selection
    • Graphical models
    • Kernel methods and linear predictors
    • Clustering
    • General view of application areas
    • Machine learning in vision
    • Machine learning in user interfaces
    • Machine learning for data mining

    (Jan 08)

  • Machine Learning Slides. UC Berkeley’s RAD Lab has made slides and videos available on the web from a recent two-day short course on applied machine learning for its industrial affiliates: (Nov 07)
    Video
    Slides
  • Andrew Ng’s Machine Learning MOOC. Stanford massive open online course covering:
    • Linear Regression
    • Logistic Regression
    • Regularization
    • Naive Bayes

Statistics

  • Bootstrap Method Tutorial. Easy introduction for running bootstrap analysis for significance testing.  (SP 17)
  • Statistics Primer. A good introductory text to basic statistics can be found at http://faculty.vassar.edu/lowry/webtext.html. If you follow the link for VassarStats, you will find tools for calculating various statistics. (Sept 06)

Other

  • IR Systems. Two IR systems that are available for research purposes are Galago and Terrier. Each has its ups and downs, both are worth exploring. Talk to Chris for more info. (Feb 09)
  • Speed Reading. To practice speed reading, find a freely available program called RSVP. It will take any webpage or document and present it to you, word by word, at the speed you set. Then increase the speed as you get better.(Feb 08)
  • Stinkpot. A repository of helpful hints on all kinds of tools we tend to use to do our work: Emacs, Python, Latex, Matlab… it’s a personal blog of a grad student at MIT who works on silly things like evolution. His version of the Paul Davis moment is something you might find helpful. (Dec 07)
  • MIT Workshop on Syntax. It’s not up as of this writing, but check on mitworld.mid.edu for a video of their one day workshop titled “Where Does Syntax Come From? Have We All Been Wrong?”, with guest speakers Sandiway Fong, Chris Manning, and Noam Chomsky, among others. (Nov 07)
  • SQLite. This is a good database system to use because it is portable, keeps your data in a single file, works in the user space, and has good software carpentry, that is, it was built intelligently so that you can build on top of it. (May 07).
  • Website Accessibility. In constructing a website, it’s recommended (required at OSU, in fact), to make it accessible to the disabled. That means to make sure that vision-impaired folks will be able to get your information by using a screen reader. To make sure your website is compliant, use a tool like Fangs to get an idea of what your website “sounds” like. (April 07)
  • Website User Authentication. If you are building an OSU website for which you wish to require users to identify and/or authenticate themselves before accessing the material, you can use the library’s proxy service to accomplish this. Ask Detmar for details.
  • Carmen Tip. Keep backups. The system can go down, and it can take you with it. Exporting and importing is relatively simple. (Feb 07)
  • Google Books. With a Google account, you can use their service to search through many books. You can’t necessarily read them from cover to cover, but it can be a helpful resource if you need to search for particular topics within a text. (Feb 07)
  • CL Olympiad. High school students nationwide are encouraged to participate in the Computational Linguistics Olympiad. Students are given traditional linguistic problems, and problems involving computational thinking and issues regarding natural language processing. As of Feb 2, the organization is looking for suggestions for contest problems. (Feb 07)
  • ICE. For inter-process communication, collobarating on projects across universities, etc. This is also called middleware. Read more about ICE here. Competing sofware is OAA: Open Agent Architecture, and Multiplatform: Multiple Language / Target Integration Platform for Modules (Jan 07).
  • Firefox browser. Version 2 supports many standards, incl. SVG and there are nice, free extensions available, including:
    • Webdeveloper (live editing of html, css, etc.)
    • Aardvark (modify what’s displayed on any webpage, for doing screenshots etc.)
    • Greasemonkey: various neat user scripts
    • Firebug (Debugger and network traffic profiler)
  • mechanize. This perl module will fill in form values in html documents automatically. (Jan 07)
  • Anonymous Feedback. Teachers might find it useful to allow their students to send them anonymous feedback. See Detmar’s example, and if you’d like, copy his on your own website. To do that, copy the entire directory on our department network: ~dm/public_html/feedback . Don’t forget to change all instances of the name and email address! (Jan 07)
  • Permanent URLs. A permanent URL will allow your website to retain a single, simple address, regardless of whether you change your employment or web-hosting position. purl.org provides a good service for this. tinyurl.com has a slightly different service, allowing you to create a very short URL that links to a website you may have with a long address. For an example of purl, you can find the OSU ICALL group and its projects at http://purl.org/net/icall. (Jan 07)
  • AJAX. Not just a cleaning solution, it can solve your messy, slow, database-driven web page problems as well. For an overview, examples, and tutorial of how to use AJAX, see Scott’s slides (Jan 07).
  • SVG. Scalable Vector Graphics are a great idea if you think your graphics might be seen on a wide variety of monitors – there is no distortion in size when going from movie screen to cell phone screen. Use SVG to build representations of xml documents, or any other node-based structure. See croczilla.com for examples. (Jan 07)
  • CCG Parser. A new CCG parser and supertagger is available from Clark and Curran. You can find the software and related literature at: The CCG site. (September 06)
  • Internships. It’s high time to start thinking about summer internships in CL. If you’re interested in working someplace like Microsoft or elsewhere on the West Coast, have a chat with Chris (cbrew @ ling), Eric (fosler @ ling), or Donna (dbyron @ ling) for information and contacts. (Jan 06)
  • Assessment in Academic Pursuits. When it becomes necessary to list your achievements in the academic arena, it is useful to have some information on hand that goes beyond publication titles and dates. Other data to collect as your publication list lengthens:
    • Acceptance rate of papers at each venue/journal. Available in front matter of conference proceedings, journal issues.
    • Your percentage of contribution to a paper. Include actual research/project work, amount of writing, and creative input when you make this calculation.
    • Citation rate. Consult Google, ISI Database, Citeseer for information on how often your papers have been cited. These resources use different metrics for determining citation rates, so you may need to defend the actual citation rate that you choose to report.
    • Relative impact of the venue. Ratings given in ISI database (available in OSCAR).

    Also, let colleagues know what you’re doing, what you’ve published and where, and make sure your publications get into the hands of people who you think should read them. Be annoying if necessary. (Jan 06)

  • Publication Strategy To get the ball rolling on getting published, consider taking the advice in Publication, Publication by Gary King. Main ideas:
    • Build on someone else’s previous research by making one change.
    • Be able to defend the reasons for that change, and the impact it makes.
    • Clearly describe exactly the work that you did.
    • Make your data available. (Jan 06)
  • Idea Solicitation. What kind of project management tools would you like to see in the Linguistics department? How can we make group projects more manageable? Bring your suggestions to Clippers, or email the CL list. (Sept 05)
  • text2onto. Automatically extracts a candidate concept hierarchy and instances from a corpus of plain text. Not fully functional, but possibly handy for small projects, or getting started with ontologies. See the website for more details. (Oct 05)
  • Language Generator. This website includes Perl code that will randomly generate ‘pointy-hair-boss mission statements’, as well as a link (near the end) to similar random language generators. (Oct 05)
  • Boolistic. Not just another search engine, this website may come in handy for those teaching boolean logic: www.boolistic.com. Enter your search terms, then click on different parts of the Venn diagram to alter the search query. (Sept 05)
  • Corpora Mailing List. Sign up here to receive email regarding new corpora and corpus tools.