Using bcache to back a SSD with a HDD on Ubuntu.

Recently, another student asked me to set up a PostgreSQL instance that they could use for some data mining. I initially put the instance on a HDD, but the dataset was quite large and the import was incredibly slow. I installed the only SSD I had available (120 GB), and it sped up the import for the first few tables. However, this turned out to not be enough space.

I did not want to move the database permanently back to the HDD, as this would mean slow I/O. I also was not about to go buy another SSD. I had heard of bcache, a Linux kernel module that lets a SSD act as a cache for a larger HDD. This seemed like the most appropriate solution — most of the data would fit in the SSD, but the backing HDD would be necessary for the rest of it. This article explains how to set up a bcache instance in this scenario. This tutorial is written for Ubuntu Desktop 16.04.1 (Xenial), but it likely applies to more recent versions as well as Ubuntu Server.

Continue reading Using bcache to back a SSD with a HDD on Ubuntu.

Parallelizing single-threaded batch jobs using Python’s multiprocessing library.

Suppose you have to run some program with 100 different sets of parameters. You might automate this job using a bash script like this:

ARGS=("-foo 123" "-bar 456" "-baz 789")
for a in "${ARGS[@]}"; do
  my-program $a
done

The problem with this type of construction in bash is that only one process will run at a time. If your program isn’t already parallel, you can speed up execution by running multiple jobs at a time. This isn’t easy in bash, but fortunately Python’s multiprocessing library makes it quite simple.

Continue reading Parallelizing single-threaded batch jobs using Python’s multiprocessing library.

The fruits of some recent Arduino mischief.

I recently consulted on a project involving embedded devices. Like most early-stage embedded endeavors, it currently consists of an Arduino and a bunch of off-the-shelf peripherals. During the project, I developed two small libraries (unrelated to the main focus of the project) which I’m open-sourcing today.

Continue reading The fruits of some recent Arduino mischief.

A simple recommender system in Python.

Inspired by this post I found about clustering analysis over a dataset of Scotch tasting notes, I decided to try my hand at writing a recommender that works with the same dataset. The dataset conveniently rates each whisky on a scale from 0 to 4 in each of 12 flavor categories.

Continue reading A simple recommender system in Python.

Optimizing MySQL and Apache for a low-memory VPS.

Diagnosing the problem.

My last post had a plug about the migration of our WordPress instance to a new server. However, it didn’t go completely smoothly. The site had gone down a few times in the first day after the migration, with WordPress throwing “Error establishing a database connection.” Sure enough, MySQL had gone down. A simple restart of MySQL would bring the site back up, but what caused the crash in the first place?

Continue reading Optimizing MySQL and Apache for a low-memory VPS.

Information-centric networking for laymen.

The design of the current Internet is based on the concept of connections between “hosts”, or individual computers. For example, when you visit a website, your computer (a host) always connects to a particular server (another host) and retrieves content through a session-oriented pipe. However, the amount of content hosted on the Internet and the number of connected devices are both growing. This is a crisis scenario for the current Internet architecture — it won’t scale.

Several proposals for Next-Generation Network (NGN) architectures have been proposed in recent years, aimed at better handling immense amounts of traffic and orders of magnitude more pairwise connections. Information-Centric Networking (ICN) is one NGN paradigm which eschews the concept of connections entirely, removing the host as the basic “unit” of the network and replacing it with content objects.

In other words, the defining feature of an ICN is that instead of asking the network to connect you to a particular server (where you may hope to find a content you desire), you instead ask the network for the content itself.

Continue reading Information-centric networking for laymen.

Why are tuples greater than lists?

I pose this question in quite a literal sense. Why does Python 2.7 have this behavior?

>>> (1,) > [2]
True

No matter what the tuple, and no matter what the list, the tuple will always be considered greater. On the other hand, Python 3 gives us an error, which actually makes a bit more sense:

>>> (1,) > [2]
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
TypeError: unorderable types: tuple() > list()

The following post is a journey into some CPython internals, with a goal of finding out why 2.7 gives us such a weird comparison result.

Continue reading Why are tuples greater than lists?

Quick postfix & dovecot config with virtual hosts (Ubuntu 16.04)

This morning, I received an email from my VPS host notifying me that they will no longer accept PayPal. Instead, my only payment option would be Bitcoin. Not willing to go through this trouble, I decided to migrate from this host (which I had been using for my personal servers for about five years now) to DigitalOcean (which fortunately accepts normal forms of payment).

Part of my server migration was to move email for two of my domains: le1.ca and lo.calho.st. Setting up a new mailserver is a notoriously arduous task, so I’m documenting the process in this post — mostly for my future reference, but also to benefit anyone who might stumble upon my blog in their own confusion.

Since I’m serving mail for two domains, I will be using a simple “virtual hosts” configuration. I’ll talk about the process in four parts: local setup, postfix, dovecot, and DNS configuration.

Continue reading Quick postfix & dovecot config with virtual hosts (Ubuntu 16.04)

An easy way to visualize git activity

Today, I wrote gitply — a fairly simple Python script for visualizing the weekly activity of each contributor to a git repository.

It started out as a run-once script to get some statistics for one of my projects, but I ended up improving it incrementally until it turned into something friendly enough for other people to use.

Continue reading An easy way to visualize git activity

Adventures in image glitching

Databending is a type of glitch art wherein image files are intentionally corrupted in order to produce an aesthetic effect. Traditionally, these effects are produced by manually manipulating the compressed data in an image file. As a result, this is a trial-and-error process; often, edits will result in the file being completely corrupted and unopenable.

Someone recently asked me whether I knew why databending different types of image files produces different effects — and particularly, why PNG glitches are the most interesting. I didn’t know the answer, but the question inspired me to do a little research (mostly reading the Wikipedia articles about the compression techniques used in different image formats). I discovered that most compression techniques are not all that different. Most of them just employ some kind of run-length encoding or dictionary encoding, and then a prefix-free coding step. The subtle differences between the compression algorithms could not explain the wildly different effects we observed (except for in JPEGs, perhaps, since the compression is done in the frequency domain). However, PNG used a pre-filtering step which made it stand out.

Continue reading Adventures in image glitching