Reviews (DaMoN 2023)
Reviewer #1
Is the topic of the paper relevant to and important to the DaMoN community?
Yes
Is the paper readable and well organized?
No - there are substantial presentation flaws
Are the research contributions substantial and novel enough to warrant acceptance?
Mostly - the contributions are above the bar
Are the paper's methodology, assumptions, models, and arguments free of serious flaws?
Mostly so - there are some minor issues, but none of them are fatal
Overall rating. Papers with Reject or Strong Reject ratings should have at least one negative score on Q1-Q4.
Accept
Short justification of your Overall Rating.
This paper proposes an annotation-based partitioning algorithm that hits the balance between hash-based partitioning and software-managed buffers. While the idea is sound, there are serious presentation issues that make the material inaccessible.
Detailed comments
I think the paper proposes a sound idea and the results seem positive. However, I couldn't understand many details that blocked me from understanding what the technique is and how the technique addresses main trade-offs. I hope my comments below will be useful in addressing those issues and making the final version of the paper more accessible.
D1. Section 2.2, parag. 1: "We argue that additional copying ...": the "additional copying from buffer to memory" is not clear here. In the previous sentence, it says "the overhead of the extra memory stores becomes prohibitive". Are you referring to these memory stores? If so, are they "prohibitive" or "expendable"... These two sentences would contradict, then.
If not, what is it referring to?
D2. Same paragraph: "As an alternative" is not clear. Alternative to what? If something is "expendable", then you don't need to propose an alternative. You can just use that. In order to use an alternative, you need to point to a problem. But, in the previous sentence (previous to the "As an alternative" sentence), you refer something expendable, which I assume is not a problem...
D3. Same paragraph: "we introduce micro-partitioning to avoid TLB thrashing". In the last paragraph of Section 2.1, you describe software-managed buffers as being TLB-friendly. Now, you propose micro-partitioning to avoid TLB trashing, which software-managed buffers already resolves. Why do we use micro-partitioning then?
D4. Section 2.2, parag. 3: "We allocate the next from the tuple ...": how come allocating can be as simple as an increment operation? It can only be if the space is already allocated. Then, it is not really as simple as incrementing a variable. It is actually allocating memory more than necessary, which brings a trade-off.
D5. Section 2.3, parag. 3: I couldn't understand Figure 4, which is a fundamental figure. I understand that the radix-partitioning is based on radix-bits. But, how come your algorithm also depends on radix-bits? What is the relationship between your algorithm and number of radix-bits? As far as I understand you simply partition the data in small chunks (each being 128 or 256 bytes). But, it is not clear how this is connected to radix-bits. This applies to other similar figures, such as 7 and 10.
D6. Section 3.2, parag. 1: I am not sure why "the resultant code grows more complicated", if the fragment directory would be implemented naively. In the same sentence, I do not know what "tailor-made" means.
D7. The terminology is really confusing; squad queue, worker queue, fragments, data object, morsel... They all refer to something, but it is not clear. For example, in parag. 4, the first sentence says "all tasks required to access a data object in bulk". What is a data object? Are you talking about all the fragments in the same page? Data object is a very ambiguous terminology. It can be anything. A tuple is also a data object. Similarly for morsel. Do you mean a micro-fragment is a morsel? If not, what is morsel? I know it's a terminology from a previous paper, but what is it here? I really couldn't follow this section. I make my best guess, which basically is scheduling the tasks such that each page is scanned continuously, in order to mitigate the random accesses that would come from accessing multiple fragments spread around the memory. This will bring its own trade-off in terms of latency vs. throughput. Those tasks whose executions are deferred will suffer from increased latency, though the overall throughput will be higher.
If this is a full paper and it is rejected, would you support its acceptance as a short (2 page) paper?
Yes
Reviewer #2
Is the topic of the paper relevant to and important to the DaMoN community?
Yes
Is the paper readable and well organized?
Definitely - very clear
Are the research contributions substantial and novel enough to warrant acceptance?
Definitely - a significant advancement
Are the paper's methodology, assumptions, models, and arguments free of serious flaws?
Mostly so - there are some minor issues, but none of them are fatal
Overall rating. Papers with Reject or Strong Reject ratings should have at least one negative score on Q1-Q4.
Accept
Short justification of your Overall Rating.
The paper presents a nice approach to in-memory partitioning combined with dynamic load balancing (task management) and offers both competitive performance and very nice insights in the experimental results.
Detailed commments
Listing strong points (S1, S2) and opportunities (O1, O2, O3) below.
S1. Mixing together partitioning with task assignment is a great idea. Micro-partitions are good enough to avoid TLB thrashing and I like how the book-keeping of list traversal is basically delegated to book-keeping for the list of tasks. Overall a good way to abstract the algorithm while retaining the hardware-friendly performance.
S2. The experimental evaluation of memory access patterns is very good. The idea of interleaving tasks across micro-partitions is great since it allows for a linear load and store of the input (as shown in Figures 8 and 9).
O1. The experimental setup of 8-byte keys and 8-byte pointers is kinda useless in practice unless you show the cost of materialization as well. While the authors do a comparison with earlier work that uses the same setup, the task assignment makes this paper a tad more realistic than the radix join program. I think the authors could generalize this to executing actual joins with payload materialization and rectify their results with recent work on the area.
O2. Some of the results are a bit difficult to interpret given that the layouts here are quite dynamic. For example, Figure 4 does not seem to be an entirely fair comparison given that the baseline uses a histogram that produces a sequential output per partition while the micro-partitions do not.
O3. There might be some implicit hardware assumptions that need to be discussed and evaluated further. The original partitioning idea was limited both by the TLB and the L2 cache size. The authors here do not really mention the L2 cache size in section 2.2. Without a 4X larger L2 than that the older work (1MB vs. 256KB), the authors might not have been able to fit a larger micro-partition rather than a single cache line per partition. The authors should clarify whether the larger L2 is a requirement for us to go from a single cache line per partition to a micro partition of 2KB and for this approach to work. A good way to show this is to run their code on a CPU with 256KB of L2 per core which matches the hardware of the baseline work.
If this is a full paper and it is rejected, would you support its acceptance as a short (2 page) paper?
Yes
Reviewer #3
Is the topic of the paper relevant to and important to the DaMoN community?
Yes
Is the paper readable and well organized?
Mostly - the presentation has minor issues, but is acceptable
Are the research contributions substantial and novel enough to warrant acceptance?
Mostly - the contributions are above the bar
Are the paper's methodology, assumptions, models, and arguments free of serious flaws?
Mostly so - there are some minor issues, but none of them are fatal
Overall rating. Papers with Reject or Strong Reject ratings should have at least one negative score on Q1-Q4.
Accept
Short justification of your Overall Rating.
The paper proposes a new memory partitioning algorithm called "micro partition," which is designed to optimize memory access patterns by considering the specific characteristics of the system hardware and workload. The micro partition algorithm ensures a data access pattern that is both TLB friendly and cache friendly, reducing performance bottlenecks caused by TLB misses and cache in multi-threaded applications.
Detailed commments
Strong Points:
S1. The paper provides a clear rationale for incorporating hardware awareness into memory partitioning algorithms to create a more generalizable solution. By considering the specific hardware characteristics of the system, the proposed algorithm can optimize memory access patterns and improve performance in a wide range of multi-threaded applications.
S2. The proposed micro partition algorithm effectively addresses the trade-off between TLB misses and cache thrashing in existing memory partitioning algorithms by taking into account the memory density of the system, resulting in significant performance improvements compared to existing methods.
S3. The experimental results support the claim that the micro partition algorithm outperforms state-of-the-art memory partitioning methods.
Weak Points:
W1. The paper mentions that tuples are partitioned into fixed sizes; whether these sizes are required to be divisible by the tuple size. If the fixed size is not divisible by the tuple size, the remaining space will either be split or left empty. If the space is split, it may cause extra performance degradation when performing tuple lookups because the system may need to access multiple partitions to retrieve a single tuple. Other the other hand, leaving the space empty may result in memory waste, which can affect the trade-off between TLB and cache fitting.
W2. The paper claims that the micro partition algorithm is hardware-aware, which makes it interesting to evaluate its performance under various hardware settings. Testing the algorithm on different hardware configurations, such as various CPU architectures, memory sizes, and cache sizes, would enable readers to better understand micro partition’s performance under different conditions and how it can be optimized for different hardware environments. This may demonstrate the algorithm's ability to adapt to different hardware characteristics, supporting its claim of being hardware aware.
If this is a full paper and it is rejected, would you support its acceptance as a short (2 page) paper?
Yes