Code

Since I was pretty young, programming has been a consistent source of joy in my life. A nice thing about being a graduate student for the last few years has been the ability to take time to develop and contribute to various open-source projects. Here are some of the ones I found more important or memorable:

embzip

A library for lossy compression of embedding vectors using product quantization.

cde

Train models in PyTorch using contrastive learning. Includes all of the tricks such as hard negative mining, clustering, gradient caching, multi-GPU, and contextual embeddings.

gptzip

A library for lossless text compression using language models.

bm25_pt

PyTorch-native implementation of the BM25 algorithm that can run on GPU.

diffgif

A neat way to visualize iterative edits to a text sequence. Used to visualize the method for my vec2text work.

vec2text

Tools for recovering text from sentence embeddings and language model outputs. Includes lots of pretrained models. Also won an outstanding paper award at EMNLP.

synthviz

Visualization software for MIDI files that makes it look as if they're being played on a piano keyboard.

language_tool_python

Grammar checker for Python. Really just a wrapper around the Java-based LanguageTool software.

TextAttack

Lots of utilities for attacking text-based language models. Built a framework and relevant modules that when combined could reimplement 20 or so NLP papers from 2017-2021.