Reviews for paper Analyzing Memory Accesses With Modern Processors, submitted to DaMoN 2020.
Overall Rating: accept
Yes
Definitely - very clear
Yes - the contributions are above the bar
Yes
Strong Accept
By leveraging the PEBS mechanism available from recent Intel CPUs, a hardware-based sampling feature, the authors implemented a low-overhead memory tracing tool and demonstrated that the tool is practical (i.e., the run-time overhead for tracing memory accesses is acceptable) under the realistic database workloads. Although it is desirable to suggest more convincing use cases, this type of tool will be of great help in understanding the memory access pattern and further more in finding unknown but interesting performance bottlenecks, tuning points and research problems. Many people will be happy if the tool is publicly avaible soon.
This paper is well organized and well written, and thus very easy to follow.
In addition, if the authors can find a few novel use cases (e.g. like those explained in related work section), this paper is a good candidate for invitation to a special VLDB journal.
Yes
Yes
Yes
Definitely - very clear
Yes - the contributions are above the bar
Yes
Accept
The paper addresses the lack of means for lightweight memory tracing that can be used for various memory-access related issues like identifying hotspots or poor choice for data structure/layout. The authors present their extension to perf for such memory tracing, by modifying the kernel sub-system and leveraging Intel's new PEBS counters/events. A few benefits of the tool and example insights that could be obtained with reasonable overhead were presented and discussed in sufficient detail.
Thank you for submitting your work to DaMoN. This profiling/tracing tool is a great addition to our community and I sure it will be welcome by many practitioners who try to optimize their systems.
The paper is well motivated, positioned with respect to state-of-the-art and reads well. The included examples with detailed explanation are certainly helpful to understand the intrinsics of the mechanism and the benefits of using the tool.
A minor remark for the paper style:
- The graphs are difficult to read when the paper is printed (in black and white). Can you please see that they become more clear in the camera ready?
Yes
No
Yes
Definitely - very clear
Definitely - a significant advance
Yes
Strong Accept
The paper presents a lower-overhead tool to perform profiling while also collecting memory access traces for a specific hardware event (memory loads).
The implementation uses a combination of Intel's PEBS, perf, and custom code added to the Linux kernel.
Such profiling mechanisms are an essential for memory characterization of systems, and can guide developers / researchers when it comes to choosing optimizations.
Collecting memory access traces are known to incur a high overhead on the system being analyzed.
Any mechanism with a much lower-overhead is very valuable.
The paper also demonstrates the effectiveness of the tool over different use cases that involves real complex systems.
I enjoyed reading this paper very much.
I would have liked to see more discussion for multi-threaded scenarios as they are known to increase the overheads of profiling especially when memory tracing is involved.
That would be my only complain about this paper.
The paper talks about how the tool is implemented targeting a multicore environment.
However, if I understand correctly, the analyzed use cases are from single-threaded experiments.
If I misunderstood it, it would be good clarify on the paper.
If not, how do the overheads increase for multi-threaded case?
Tools like Pin are notoriously bad for memory tracing in multi-threaded scenarios, especially altering the contention behavior during real runs due to slow-downs in the program.
What gives the most benefits, when it comes to the tool presented in the paper being much faster than the existing tools like Pin or Valgrind?
Is it the sampling? Is it the added kernel code? Or something else? Or many things contribute similarly?
It would be nice to add more intuition about the relative benefits of the different optimizations / components.
If you have #s for overhead (e.g., without sampling for example), that would be great to mention.
Because if it is mainly the sampling, maybe people can change their Pin-tools to do something similar.
What is the size of the memory traces you collect for the use cases presented in the paper?
How long does the post-processing take roughly?
Some other related work that augments profiling with memory trace (based on pin):
Tozun et al. OLTP in Wonderland, DaMoN 2013 (uses memory traces and a hardware simulator to map cache misses to different code parts) and ADDICT, PVLDB 2014 (uses pin traces to detect data/instruction reuse across transactions)
Section 4.1, 2nd paragraph: "... we display the addresses in byte", I find this sentence a bit confusing since the referenced graph (in Figure 3) has things in 4K granularity as the next sentence mentions.
Finally, not sure if you will release the codebase and usage instructions for this. Would be great contribution to the community.
Minor: Figure 3: Can you add here what different colors mean on the figure, like it is on the other figures? It is a bit confusing if you just look at the graph for a quick understanding.
Yes
Yes