SIGMOD 2021 Reviews
MXTASKS: HOW TO MAKE EFFICIENT SYNCHRONIZATION AND PREFETCHING EASY - SIGMOD 2021 REVIEWS
Reviews for Paper MxTasks: How to Make Efficient Synchronization and Prefetching Easy, submitted to SIGMOD 2021 (Research Track).
Reviewer #1
1. Overall evaluation
Major Revision
3. Novelty
Novel problem and solution
5. Summary of contribution (in a few sentences)
This paper proposes an interesting framework that uses task based execution model and various annotations that promises to ease implementation of data intensive systems by delegating complexity to the framework. Ambitious vision of runtime framework and enhanced operating system support appears promising in the early comparison against pthreads and thread building blocks framework.
6. Describe in detail all strong points, labeled S1, S2, S3, etc.
S1. Proposed task based execution model is very promising and might help enable development and exploration of many more interesting data structures and algorithms for data intensive operations.
S2. Rich annotations enable data prefetching that shows excellent results in enabling higher throughput in read mostly workloads evaluated in experimental section while matching baseline implementations for other workloads.
7. Describe in detail all opportunities for improvement, labeled O1, O2, O3, etc.
O1. Evaluation scope is very limited and it would be valuable to the reader to include direct comparison to the state-of-the-art data structure implementation such as reference [46] that is used as a target workload or [28] that is hinted as a comparable data structure.
O2. XKernel vision is very interesting, however, section 6 is written very vaguely, so it does not contribute much to the description of the framework that is the core contribution of this paper. It would be very valuable to either include evaluation of the preliminary implementation of XKernel or devote additional space to characterization of various aspects of XTasking framework.
O3. Evaluating a single workload with single data structure is underwhelming in comparison with the proposal for a general framework. The paper would be much stronger if it included another example data structure or workload, especially if an operation includes more complex control flow like the one needed for joins.
8. Please rank the three most critical strong points and/or opportunities for improvement that led to your overall evaluation rating, e.g., "{S1, O7, O4}"
O1, S1, S2
9. Inclusive writing: does the paper follow the SIGMOD inclusive writing guidelines with respect to language, examples, figures? Please provide suggestions to the authors on how to improve the writing of the paper according to the guidelines.
Yes
11. Additional remarks. Use this field to describe remarks that are not critical strong points or opportunities for improvement, e.g., to highlight typos, formatting problems, or minor technical issues.
A1. 4th paragraph of the introduction starts by discussing "entire parallelism". It is not clear what this phrase refers to, please consider rephrasing to make it well defined.
A2. Footnote on page 3 implies that XTasking framework currently supports Linux and XKernel, while it is only evaluated on Linux. It would be good to add evaluation on XKernel if possible or at least discuss which modifications are necessary to adjust XTasking design to Linux.
A3. Figure 3 description has a typo: "ibbjects"
A4. Figure 4 description has a typo: "has not to deal" -> "does not have to deal"
A5. In the measurement presented in figure 5, where does the improvement in "application" part in multi-level configuration come from? It would be good to quantify it since it is a noticeable improvement.
A6. Figure 7 does not strengthen the argument in section 5.3 since the setup of the experiment is not described at all and the discussion merely notes the behavior of different configurations without deeper discussion. Enhancing the paragraph describing figure 7 would meaningfully improve confidence in the flexibility of task size parameter and framework's ability to adapt to different sizes.
A7. Discussion of the prefetching efficiency in section 7.2 and scalability in section 7.3 suggests that memory access efficiency plays a decisive role in performance. In such scenarios, especially for YCSB workload that operates on individual keys, taking advantage of locality in the workload is very important for performance. It would be good to include a clear description of how are requests distributed to threads for p_thread and TBB configurations.
15. Comments on the revised paper (if appropriate)
Thank you for addressing our revision items and significantly enhancing the paper. I think the ideas are very interesting and revised paper clearly describes strengths of the approach, limitations of current implementation and puts it in context of related work for the community to build on.
Minor suggestion: phrase "catch two birds with one stone" on page 4 sounds very strange to me and even the common one "kill two birds with one stone" doesn't quite fit the flow of section 3, so my suggestion would be to rephrase that part of the description of annotation-based prefetching mechanism.
16. Recommended decision for the revised paper (if appropriate)
Accept
Reviewer #2
1. Overall evaluation
Major Revision
3. Novelty
Novel application of existing solution to a new problem
5. Summary of contribution (in a few sentences)
This paper proposes XTasking/XTasks, an asynchornous programming approach to improving performance of concurrent data structures while reducing their implementation difficulty. The idea is for application developers to break functionality (e.g., tree operations) into smaller tasks (XTasks) that get executed by worker threads which collect multiple tasks in pools for better scheduling and coordination. To do this, the developer needs to "annotate" each XTask with desired properties such as the data objects to access, locality, access frequency, etc for the scheduler to use. The paper proposed two specific use cases of XTasking: prefetching and synchronization, and used a blink-tree structure to show the potential of this approach. XTasking resembles prior approaches such as data-oriented transaction execution but generalizes them for parallel programming on top of which explores more opportunities enabled by the task-based asynchronous programming paradigm (specifically prefetching and synchronization) to build concurrent data structures.
6. Describe in detail all strong points, labeled S1, S2, S3, etc.
S1. Targets an important problem which is performance and implementation difficulty of concurrent data structures.
S2. Explores opportunities such as prefetching and auto-inserting synchronization primitives enabled by a task-based execution model.
7. Describe in detail all opportunities for improvement, labeled O1, O2, O3, etc.
O1. Lock-free vs. XTasking: XTasking is positioned as an alternative to lock-free programming (from introduction and abstract), to be easier-to-use, and competitive in terms of performance. But it is unknown whether indeed the proposed approach would be so:
- Programmers still needs to annotate the code. Although it should be easier than lock-free programming, it in many cases still requires the programmer to know well the underlying system architecture (e.g., NUMA topology) for XTasking to properly schedule tasks.
- XTasking will automatically insert synchronization primitives. A critical question to ask is, how would one know a data structure built using this approach would be competitive or even beat a lock-free counterpart?
Ideally these questions should be answered by evaluation. I'd suggest to compare with state-of-the-art lock-free data structures (e.g., Bw-Tree as a baseline for lock-free tree structure, see also O8), and some real examples showing how annotation is done in the code level would also help clarify.
O2. Each XTask is expected to be a "small unit of work" that is guaranteed to
"run uninterruptedly to completion." Typically in database systems we also need to support persistence and involve I/O - how is this handled by XTasking, for example when a task needs to conduct I/O operations?
O3. It is not clear how the application-XTasking interface looks like and how the annotation process exactly works: is it like the application programmer annotates the code with some keywords, plus an extra compiler pass to generate code, or like the programmer specifically uses an "XTask" class/structure to fill in "annotations" by setting up structure fields and giving it a function to execute - please clarify this in the paper, although from the provided code it seems to be the latter approach.
O4. XTasking offers three types of synchronization (serializing accesses to the same data objects on single workers, optimistic, and using latches), how does the system determine exactly which one to use? Is it a strategy that's baked in or specified by the programmer?
O5. Some important details about the prefetching approach are missing, I'd suggest to elaborate on the following inter-related questions:
- How is the prefetching distance determined? Does it require tuning?
- Section 3 mentions that prefetching instructions will be inserted "in-between" task executions. Does this mean XTasking will analyze each task and issue prefetch for all objects belonging to a task?
- For prefetching to work, the worker thread also needs to know exactly how much (i.e., data size) to prefetch. Is this gather by XTasking scheduler or actually passed in by the programmer through annotations? What about cases where size isn't known at compile time and is dependent upon runtime results?
- Each task may access a different number of objects and tasks can involve arbitrary access patterns. This may affect scheduling policies, e.g., how does XTasking handle irregular access patterns with a mix of long and short tasks? Once a set of prefetches are issued, when should we switch back to the task that needed prefetching, so that the data still remains in the cache when we switch back to the task?
- Tasks may further spawn tasks - would this again impact prefetching and scheduling policies? E.g., a newly spawned task may cause the scheduler to switch back to another previous task (for which it issued prefetches) later, which may waste the prefetching effort if the newly spawned task trashed the cache.
O6. The current design and implementation largely ignores memory reclamation, but this can be non-trivial in terms of both system design and performance. For example, in "normal" parallel programming one may use epoch-bsaed reclamation or RCU, which in some cases basically wraps the the data-access code with epoch-enter and epoch-exit calls, which are assumed to happen on the *same* thread. Now with XTasking, such data access functions may be broken into multiple tasks, which may get executed on *different* threads. This may cause bugs. How would this be adapted to work in XTasking? I'd urge the authors to complete this and evaluate its impact on performance. Also, Section 4.1 mentions that hazard pointers and "memory reclamation [20]" are two major approaches. This is not accurate - [20] discusses RCU and other particular approaches, but "memory reclamation" is the general concept.
O7. XKernel appears to be very similar to what unikernels/microkernels do. Some comparison/discussions would be helpful. It is also not clear whether this is implemented at all. I feel the discussion of XKernel is largely unnecessary.
O8. It is not clear how the application communicates with XTasking components. Section 4.1 mentioned task spawning is "cheap" using "assembly atomics" and Section 7 mentions pushing tasks to pools involves inter-connect traffic. Some details (e.g., an algorithm listing) would help clarify the mechanism. Evaluation seems to indicate that XTasking itself consumes quite some CPU cycles (Figure 12, more than the actual work conducted by the data structure), this probably indicates high inter-thread communication overhead, and makes XTasking less attractive.
O9. Evaluation has a lot of room for improvement:
- Results showed quite moderate improvement over a blink-tree implementation, especially for prefetching (~50mops/s to ~80mops/s for read-only in Figure 10). This is quite low compared to what was reported by the literature for B-tree-like structures [1].
- Experiments are limited to a Blink-tree structure and misses comparison with state-of-the-art data structures and other approaches.
- Comparison against the TBB-based blink tree doesn't seem to be fair: XTasks implementation utilizes prefetching while thread and TBB tasks implementations do not.
- The paper motivated its approach on the fact that lock-free programming is difficult, but without comparing with any lock-free structures to prove XTasking actually is competitive in terms of performance.
It would be informative to know the annotation effort needed by the application programmer. I'd suggest also adding comparisons to a lock-free tree structure, e.g., the Bw-Tree or other state-of-the-art and widely-used structures such as concurrent ART and Masstree. In addition, it would be informative to compare with recent C++20 coroutines, which are known to be very effective in hiding cache misses using prefetching and simplifying its implementation as described in:
[1] Exploiting Coroutines to Attack the “Killer Nanoseconds”, VLDB 2018
[2] Interleaving with Coroutines: A Practical Approach forRobust Index Joins, VLDB 2018
[3] Cache Craftiness for Fast Multicore Key-Value Storage, EuroSys 2013
O10. The paper largely focused on concurrent data structures. I'd be curious to know how this approach can be adopted to work in a full database system (e.g., with an index structure implemented by XTasking).
8. Please rank the three most critical strong points and/or opportunities for improvement that led to your overall evaluation rating, e.g., "{S1, O7, O4}"
O1, O5, O9
9. Inclusive writing: does the paper follow the SIGMOD inclusive writing guidelines with respect to language, examples, figures? Please provide suggestions to the authors on how to improve the writing of the paper according to the guidelines.
Yes.
11. Additional remarks. Use this field to describe remarks that are not critical strong points or opportunities for improvement, e.g., to highlight typos, formatting problems, or minor technical issues.
A1. Writing in general needs an overhaul. There are some typos and some wordings are a bit strange:
- "sequencing accesses" -> typically we say "serializing accesses"
- "straight-line code" -> sequential code?
- "to supply mutual exclusion" -> to support mutual exclusion
- In Section 4.1, "that delte a shared object" -> delete?
- Section 5.3, "short-living" -> "short-lived"
- Section 3, "Such spawning will happen asynchronously and lightweight" -> missing "is" before "lightweight."
A2. Prefetching was only mentioned the first time in the end of Section 1 when discussing paper organization. It would be more clear if this problem is introduced earlier in introduction.
A3. Section 2.2 mentions the paper would use tree navigation as a "running example," yet it didn't appear again until Section 5.
A4. Footnote 1 mentions XKernel, but without explaining much what it is. I'd suggest add a forward reference to Section 6 for it.
A5. Figure 11 would be more clear if different approaches are compared in the same graph using the same type of operation. E.g., in Figure 11(a) instead of fixing one approach in each figure, it's better to show one operation (e.g., read) in one figure but with three lines (Scheduling+Prefetching, RW-lock+prefetching, Optimistic).
12. Response to author feedback (please answer "N/A" if author feedback was not provided)
Thank you for the feedback
13. List required changes for a revision, if appropriate, by identifying subset of previously specified opportunities for improvement (e.g., O1, O3, O6).
Please address O1, O3, O4, O5, O6, O8, O9 (coroutine-related evaluation is good to have but not a must).
15. Comments on the revised paper (if appropriate)
Thank you for addressing our revision requests.The paper is significantly improved! A few of minor issues:
- I suggest to focus on DBMS, thus remove mentioning of OS in introduction (first sentence).
- Section 5.1 - "simultaneously hold" hold->held?
- Section 5.1 also mentioned "comparable task-based solutions." If there are other similar approaches, please consider provide proper citations.
Finally, if possible, please consider open-sourcing this! I look forward to seeing it being picked up by others.
16. Recommended decision for the revised paper (if appropriate)
Accept
Reviewer #3
1. Overall evaluation
Major Revision
3. Novelty
Novel implementation and evaluation of previous solutions to existing problem
5. Summary of contribution (in a few sentences)
XTask is a framework for easy development of concurrent data-intensive applications. The main techniques in XTasking are user-defined annotations and tasks, which are used to do efficient data pre-fetching and synchronization. While XTask is a user-space application, the long term vision is to integrate the XTask scheduler into the OS kernel, for DBMS/OS co-design.
6. Describe in detail all strong points, labeled S1, S2, S3, etc.
S1. The paper addresses an important problem. Programming on new and increasingly complex hardware is a difficult task. It is thus useful to develop techniques that support parallel and concurrent programming.
S2. The paper is well structured, and generally easy to follow. However, significant details have been left out from the description of XTasks (see opportunities for improvement).
S3. I appreciated seeing a glimpse of the future work ideas in the XKernel section. I think this path is worth exploring further!
7. Describe in detail all opportunities for improvement, labeled O1, O2, O3, etc.
I believe the paper has good potential. However, there are a number of opportunities for improvement that I strongly encourage the authors to consider for a future version of the work.
O1: Motivation. The most important opportunity for improvement is to strengthen the motivation, by addressing the following questions:
- What makes XTasks/XKernel suitable for DBMS/OS co-design specifically? The paper claims that a task-driven model is better than a thread-driven model in data-intensive applications, but it does not clearly explain why.
- Why do annotations and tasks make synchronization easier? How is the “ease” measured? The argument the paper makes is that annotations make writing parallel programs easier because they give programmers more freedom to leverage specific characteristics of the hardware and to specify synchronization needs. To me, it sounds like annotations actually make programmers’ jobs more difficult, since they now need to be aware of the hardware specifics, and data access frequency. Moreover, similarly to transactions, tasks rely on the programmers deciding on the right granularity and content of a task, which is a difficult question. Since tasks are not preemptible, a task that accesses slow I/O can block one worker thread/core, slowing down the entire system.
O2: API. Another important point that needs to be clarified is the XTasks / annotations use.
- What is the API for making annotations? Even though the annotations API is available in the code, it needs to appear in the paper as well.
- Are dependencies allowed between tasks? If yes, how are they handled? Figure 2 explains the order tasks are handled in for pre-fetching, but I find it confusing. It looks like the worker does data pre-fetching, executing tasks, and pre-fetching tasks in parallel. I think a timeline, or pseudocode would be a better visualization.
- How are tasks are different from transactions? The related work section could clarify this aspect.
O3: Examples. It is great that the paper provides a concrete example with the task-based B-tree. The example would however be more clear if it mentioned all the tasks that are necessary for the B-tree implementation. I could see the tree implementation in the code. However, I think these details are useful in the paper, to understand how to concretely use tasks.
Similarly, the task granularity example provided in Figure 7 is useful, but it would have been more helpful to see the granularity effect on more complex queries such as joins.
8. Please rank the three most critical strong points and/or opportunities for improvement that led to your overall evaluation rating, e.g., "{S1, O7, O4}"
{O1, O2, O3}
9. Inclusive writing: does the paper follow the SIGMOD inclusive writing guidelines with respect to language, examples, figures? Please provide suggestions to the authors on how to improve the writing of the paper according to the guidelines.
Yes.
11. Additional remarks. Use this field to describe remarks that are not critical strong points or opportunities for improvement, e.g., to highlight typos, formatting problems, or minor technical issues.
Typos: Page 4 - “focussing”, Page 5 - “ibbjects"
12. Response to author feedback (please answer "N/A" if author feedback was not provided)
Thank you for the author feedback. We are really glad you found our feedback helpful!
13. List required changes for a revision, if appropriate, by identifying subset of previously specified opportunities for improvement (e.g., O1, O3, O6).
Please include the evaluation of the memory management piece, and a comparison with state-of-the-art concurrent data-structures. In terms of writing and presentation, please explain clearly how to use the XTasks API ( mentioned in my O2).
15. Comments on the revised paper (if appropriate)
Thank you for taking into account all our revision items and significantly improving this work!
16. Recommended decision for the revised paper (if appropriate)
Accept