Generating spectrograms the hard way with numpy.

A spectrogram is a convenient visualization of the frequencies present in an audio clip. Generating one involves obtaining the frequency components of each window of the audio via a Discrete Fourier Transform (DFT) of its waveform. While tools are available to both generate spectrograms and compute DFTs, I thought it would be fun to implement both myself in my language of choice, Python.

In the following, I will discuss computing a DFT (the hard way), processing a WAV file, and rendering a spectrogram (all in Python). If you’re impatient and just want to see the code, you can find it on GitHub.

Continue reading Generating spectrograms the hard way with numpy.

Integrating GitLab and Google Calendar.

Zeall, like many other software startups, uses GitLab for version control and issue management. We also use the ever-popular Google Calendar to handle meetings, reminders, and deadlines. For several months, we’ve been looking for a way to automatically push GitLab issue deadlines into Google Calendar, and until now it seemed impossible. Only after a recent migration from our own private mailserver to G Suite did we find a solution — or rather, figure out how to feasibly build one.

Continue reading Integrating GitLab and Google Calendar.

Adding custom fields to packets in ndnSIM 2.3 without forking the entire repository.

The recommended way to build something on top of ndnSIM is to fork its scenario template repository and work inside there. You still need to download and compile the actual framework, however you will simply install it into /usr/local and link to it instead of actually working inside the main repository.

It turns out that this workflow actually makes certain tasks a lot more difficult. You might think a network simulator would make it easy to add new header fields to packets. Well, think again.

Continue reading Adding custom fields to packets in ndnSIM 2.3 without forking the entire repository.

An idiot’s guide to fulltext search in PostgreSQL.

I love PostgreSQL. It’s probably the most powerful open-source database system out there. Recent features to handle JSON and geospatial data are allowing it to supplant specialized database systems and become closer to a one-DB-fits-all solution. One feature that I’ve recently been able to exploit is its fulltext search engine. It allowed me to easily move from a terrible search implementation (using regular expressions) to one that actually meets users’ expectations.

In this article, I will walk through a basic fulltext search configuration, as well as highlight a few potential improvements that can be made if you’re so inclined.

Many of the features discussed in this post are only available as of PostgreSQL 9.6. Earlier versions have some rudimentary fulltext functionality, but a lot of the more powerful tools we’ll be using are fairly new.

Continue reading An idiot’s guide to fulltext search in PostgreSQL.

Fun with integer division optimizations.

I recently stumbled across a post about some crazy optimization that clang does to divisions by a constant. If you aren’t interested in reading it yourself, the summary is as follows:

  • Arbitrary integer division is slow.
  • Division by powers of 2 is fast.
  • Given a divisor n, the compiler finds some a, b such that a/2b approximates 1/n.
  • This approximation gives exact results for any 32-bit integer.

I was interested in seeing just how much faster this seemingly-magic optimization was than the regular div instruction, so I set up a simple test framework:

Continue reading Fun with integer division optimizations.

The problem with Python’s datetime class.

This might sound like a strong opinion, but I’m just going to put it out there: Python should make tzinfo mandatory on all datetime objects.

To be fair, that’s just an overzealous suggestion prompted by my frustration after spending two full days debugging timestamp misbehaviors. There are plenty of practical reasons to keep timezone-agnostic datetimes around. Some projects will never need timestamp localization, and requiring them to use tzinfo everywhere will only needlessly complicate things. However, if you think you might ever need to deal with timezones in your application, then you must plan to deal with them from the start. My real proposition is that a team should assess its needs and set internal standards regarding the use of timestamps before beginning a project. That’s more reasonable, I think.

Continue reading The problem with Python’s datetime class.

Using bcache to back a SSD with a HDD on Ubuntu.

Recently, another student asked me to set up a PostgreSQL instance that they could use for some data mining. I initially put the instance on a HDD, but the dataset was quite large and the import was incredibly slow. I installed the only SSD I had available (120 GB), and it sped up the import for the first few tables. However, this turned out to not be enough space.

I did not want to move the database permanently back to the HDD, as this would mean slow I/O. I also was not about to go buy another SSD. I had heard of bcache, a Linux kernel module that lets a SSD act as a cache for a larger HDD. This seemed like the most appropriate solution — most of the data would fit in the SSD, but the backing HDD would be necessary for the rest of it. This article explains how to set up a bcache instance in this scenario. This tutorial is written for Ubuntu Desktop 16.04.1 (Xenial), but it likely applies to more recent versions as well as Ubuntu Server.

Continue reading Using bcache to back a SSD with a HDD on Ubuntu.

Parallelizing single-threaded batch jobs using Python’s multiprocessing library.

Suppose you have to run some program with 100 different sets of parameters. You might automate this job using a bash script like this:

ARGS=("-foo 123" "-bar 456" "-baz 789")
for a in "${ARGS[@]}"; do
  my-program $a
done

The problem with this type of construction in bash is that only one process will run at a time. If your program isn’t already parallel, you can speed up execution by running multiple jobs at a time. This isn’t easy in bash, but fortunately Python’s multiprocessing library makes it quite simple.

Continue reading Parallelizing single-threaded batch jobs using Python’s multiprocessing library.

The fruits of some recent Arduino mischief.

I recently consulted on a project involving embedded devices. Like most early-stage embedded endeavors, it currently consists of an Arduino and a bunch of off-the-shelf peripherals. During the project, I developed two small libraries (unrelated to the main focus of the project) which I’m open-sourcing today.

Continue reading The fruits of some recent Arduino mischief.

A simple recommender system in Python.

Inspired by this post I found about clustering analysis over a dataset of Scotch tasting notes, I decided to try my hand at writing a recommender that works with the same dataset. The dataset conveniently rates each whisky on a scale from 0 to 4 in each of 12 flavor categories.

Continue reading A simple recommender system in Python.