Let S be a search data structure (such as a red-black tree) that performs insert, delete and search in \(O(\log n)\) time, where \(n\) is the number of elements stored. An empty data structure S can be created in \(O(1)\) time. We would like to construct a static data structure with \(n\) elements that is statically optimal in total access time, given the number of times an element is accessed in an access sequence. Recall that if item \(i\) is accessed \(p_i m\) times in a sequence of \(m\) operations then the static- optimal access time is \(O(m \sum p_i \log 1/p_i )\).

The data structure is constructed as follows. Search data structure \(S_k\) holds the \(2^{2^k}\) most frequently occurring items in the access sequence (note this means items may be in multiple trees). A search on \(v\) is done on \(S_0 , S_1,\ldots,\) until an \(S_i\) holding \(v\) is encountered. Notice that all elements in \(S_i\) are held in \(S_{i+1}\). (a) Show that the above data structure is asymptotically comparable to the optimal static tree in terms of the total time to process the access sequence. (b) Make the data structure capable of insert operations. Assume that the number of searches to be done on v is provided when v is inserted. The cost of insert should be \(O(\log n)\) amortized time, and total cost of searches should still be optimal (non-amortized). (c) Improve your solution to work even if the frequency of access is not given during the insert. Your data structure now matches the static optimality theorem of splay trees (but of course, had to be built special case). (d) Make your data structure satisfy the working set theorem on splay trees. Ignore the static optimality condition.

data structures to get static optimality

(a)

Consider the items indexed in order of decreasing frequency, i.e. such that \(p_1\geq p_2\geq \cdots \geq p_n.\) We see that \(p_i \leq 1/i\), so \(\log i \leq \log 1/p_i\). The run time of a search for element \(i\) is \[\sum_{k\leq \left\lceil \lg\lg i \right\rceil} \log 2^{2^k} \leq 2\cdot 2^{\left\lceil \lg\lg i \right\rceil} \leq 4\cdot \log i \leq 4\log 1/p_i.\] We search for element \(i\) \(mp_i\) times. Overall, the time for performing all the necessary searches is \[\sum_{i=1}^n mp_i \cdot 4\log 1/p_i,\] which is asymptotically optimal.

(b)

First, we add to our data-structure trees \(F_i\), which contains the elements of \(S_i\) keyed by the number of searches which must be performed on them. \(F_i\) will support insert, delete, but also \((i)\), which returns the \(i\)-th most frequent element in \(F_i\). This “in-order element \(i\)” query is supported in time \(O(\log |F_i|)\) by many types of trees, it just requires augmenting the nodes in a balanced tree with the subtree sizes. Note that such trees also can support a getminfreqelt opperation to return the element of minimum frequency out of all elements in \(F_i\) by simply running getfreqidx\((|F_i|)\). \(F_i\) also supports a opperation which returns the frequency of the minimum frequency element in \(F_i\).

When we insert an element \(v\) with frequency \(freq(v)\), we first check to find the lowest \(i\) such that \(freq(v) \geq \text{\textbf{getminfreq}}(F_i)\). Next, we itterate over \(j\geq i\) and repeatedly insert \(v\) into \(S_j, F_j\) and then find \(w\gets \text{\textbf{getminfreqelt}}(F_j)\) and then deletes \(w\) from both \(S_j\) and \(F_j\). We keep increasing \(j\) until the last group. If the last group is full, we will then need to create a new group. Otherwise, on this last step we skip the deletion of \(w\) step, because there is extra space in the final \(S_j\). This chain of insertions and deletions has total cost of
\[\sum_{j\leq \left\lfloor \lg\lg n \right\rfloor} O(2^j) \leq O(\log n).\]

Very rarely, we have to create a new \(S_i, F_i\). This is expensive, it takes time \(n\log n\). However, it also happens very rarely. If a total of \(n\) elements are inserted into the data-structure then the total cost of building all the \(S_i\) and \(F_i\) is \[O(n\lg n) + \sum_{j < \left\lfloor \lg\lg n \right\rfloor} 2^{2^j} 2^j \leq O(n\lg n).\] So, while individual inserts may be expensive, e.g. \(\Omega(n\log n)\), in an ammortized sense, inserts are still only \(O(\log n)\) time.

On the other hand, we are always maintaining the data-structure with \(S_i\) containing the \(2^{2^i}\) most common elements. We have shown in (a) that such a data-structure has an asymptotically optimal cost for searches.

(c)

Insert works much as before, we simply insert an element with only \(1\) as its frequency; in fact insert is easier: we don’t have to do the chain of insertions and deletions, it suffices to insert at the largest \(S_i\).

But now, each time a opperation is performed on an element \(x\), we must increase the observed frequency of \(x\), which may necessitate moving \(x\) to a smaller \(S_i\), and then also removing other elements from the smaller \(S_i\)’s to make space for \(x\). This is accomplished in essentially the same way as chains of insertion and deletion were handled for inserts: we identify the new \(S_i\) where our element \(x\) must end up, and then we have a chain of insertions and deletions as we bring \(x\) to smaller \(S_i\)’s and delete these \(S_i\)’s least frequent elements to make room for \(x\). By the same analysis as the previous problem, each such opperation has work \(O(\log n)\). However, we aim to show something stronger: namely that the total cost of the find opperations is within a constant factor of that of the optimal static data-structure for this problem! We can tighten our analysis as follows: Consider a time in the opperation when \(M\) accesses have been and we must perform a search on an element \(x\) which has been searched for \(f\) times in the past. Consider the rank of \(x\): there are at most \(M/f\) elements which have been accessed more than \(x\). Thus, we can actually achieve a tighter bound on the cost of the updates that we perform when searching for \(x\): namely, \[\sum_{j \leq \left\lceil \lg\lg M/f \right\rceil}O(2^j) \leq O(\lg M/f).\]

Now we consider the total cost of all the accesses on elements \(1,\ldots, n\): say that by the end we have accessed elements \(x_1,\ldots, x_n\) times, with \(m = \sum x_i\). Let \(m_{i,j}\) be the number of accesses that have been performed before the \(j\)-th access to the \(i\)-th element. The cost of this access will then be \(O(\log m_{i,j}/j).\) Now, let \[T = \sum_{i=1}^n \sum_{j=1}^{x_i} \lg m_{i,j}/j.\] The total cost will be \(O(T)\). We proceed to bound \(T\).

We notice that \(m_{i,j}\) will achieve each value on \(\left\{ 1,\ldots, m\right\}\) exactly once. Re-arranging our expression for \(T\), we get: \[T = \sum_{i=1}^m \lg i - \sum_{i=1}^n \sum_{j=1}^{x_i} \lg j.\] Using Stirling’s approximation, which says that \[N\lg N/2 \leq \sum_{i=1}^N\lg i = \lg N! \leq N\log N,\] we upper bound \(T\) as \[T \leq m\lg m - \sum_{i=1}^n x_i (\lg x_i -1).\] Re-arranging some more we have: \[T \leq \sum_{i=1}^n x_i \cdot (\lg m /x_i +1).\] Now, we substitute \(p_i = x_i/m\), and accept a factor of \(2\) to pay for the \(+1\), to get \[T \leq \sum_{i=1}^n mp_i \cdot 2 \lg 1/p_i.\] As \(T\) is asymptotically optimal and the total cost is \(O(T)\), we have that this data-structure, despite not knowing the search frequencies in advance, still is within \(O(1)\) multiple of the optimal access time based on element frequencies.

(d)

We build a very similar data-structure. But now, instead of being arranged based on access frequencies, the elements are arranged based on how recently they have been accessed. That is, sets \(S_i,F_i\) will now contain the \(2^{2^i}\) most recently accessed elements. Instead of being keyed by access frequency, \(F_i\) is keyed by how recently the element was accessed. Now, when an element is inserted we place it in all \(S_i\)’s and “make room” for this new element by removing the least recent element from each \(S_i\) (except the last). Again, every once and a while inserting will necessitate creating a new \(S_i\). But again, this happens very infrequently and will not affect the ammortized costs of opperations. Very similarly, whenever a is performed we move the searched element to the front, and perform a chain of inserts and deletes. Precisely, if the element \(x\) searched for started in \(S_i\), then for all \(S_j\) with \(j < i\) we must insert \(x\) into \(S_j\) and delete the least recently accessed element in \(S_j\) to make room for \(x\).

With identical analysis to the previous part, we see that the ammortized costs of the insert opperations will be \(O(\log n).\)

Now, we consider the cost of executing a sequence of searches and inserts for elements \(x_1,\ldots, x_n\), where \(t_i\) reffers to the numer of elements since the last time that \(x_i\) was accessed. We see that the cost of searching for an element that was searched among the last \(t_i\) searches is at most \[\sum_{j\leq \left\lceil \log\log t_i \right\rceil} 2^j \leq O(\log t_i).\] Thus, our sequences of accesses and inserts will have cost bounded by \(O\left( n\log n + \sum \log t_i \right)\), as desired.