Imagnie I have a postgres database with ~3000 columns storing boolean “flags” for ~1 billion rows. The “on” flag (1) is quite sparse, say occuring less than 0.1% of the time in each column. Now, I want to execute a query of the form “select all rows with flags 1,2, 5 and 101” 1. How much space would it take to store such a database? That is, will postgres by default do any sort of compression because of how sparse the “on” flag is? 2. Does postgres by default make queries of this form “fast”? If not, are there well known postgres extensions for this type of query? If not I might write my own, but just want to know if this has already been done.

This blog post contains my thoughts on this. First, we parameterize some things:

Definition. - \(n\) number of rows - \(L\) number of columns (categories) - \(\delta\) density in each column - \(q\) query size - \(w\) word size (we can do \(O(1)\)-time operations on words) - \(p\) number of processors available (we can do \(p\) units of work per a single unit of time.)

Storage

Way 1: Have \(n\) rows, with \(L\) bits, and thus \(L/w\) words, per row. Space: \(\frac{nL}{w}\).

Way 2: For each of the \(L\) categories, store the row-indices of the \(\sim \delta n\) rows in this category. Overall this requires \[\sim \delta n L \frac{\log n}{w}\] space.

Comparison: as long as the density (or should I say sparsity?) is \(\delta < \frac{1}{\log n}\) way 2 is the clear winner here. For instance, if \(n\sim 10^{9}\), the required sparsity for way 2 to win is about \(\delta < \frac{1}{30}\).

Query

With “way 1” the query can be handled by partitioning the rows into \(p\) groups and summing the bit-masked numbers.

This requires time: \[\frac{n}{p} \cdot \frac{L}{w}.\]

With “way 2” the query can be handled by partitioning the rows again into \(p\) groups. But this time we should be careful to “randomly” partition the rows somehow. It probably doesn’t need to be completely a random partition, because that sounds expensive. Should suffice to, e.g., assign each row to that processor iid with probability \(\frac{1}{p}\). Note: this must be done before the query. You modify the storage structure so that each processor has a list of \(\sim n\delta/p\) elements per each category. Now, each processor can do its stuff in time \[\approx q n\delta / p \frac{\log n}{w}\]

So as long as \[\frac{q}{L / \log n} < \frac{1}{\delta}\] way 2 is better. Which kind of makes sense.