Zum Inhalt
Fakultät für Informatik
To stride or not to stride the memory access?

Reviews (DIMES 2025)

Review A

Overall merit
3. Weak accept

Reviewer expertise
3. Knowledgeable

Paper summary
The paper runs experiments of summing up array elements using sequential, sequential-SIMD and stride access patterns with various stride sizes. They show that on two platforms and three kinds of memories the strided access pattern outperforms sequential SIMD by up to 25%, but only for a certain stride size. Only a few stride sizes among all the possible ones out perform SIMD-sequential, most others perform much worse. The authors analyze the reasons behind this phenomenon.

Comments for authors
I found the paper very well structured and easy to read. I love the annotations on the figure with Observation numbers. I was also very impressed by your thorough and skillful performance analysis.
I am uncertain whether the paper passes the bar for the workshop for a couple of reasons:

1. While you did a great job explaining why certain stride sizes performed worse than others, I didn't quite understand why certain stride sizes performed better. Surely, you explained that with the best strides, the TLB misses, prefetched activity and conflict misses are all at their best, but so is the case with sequential and SIMD-sequential. Why is strided better? Is it because we achieve a higher memory-level parallelism? But I thought that modern memories stride a single 64-byte cache line across memory banks. So shouldn't we observe pretty good parallelism there as well? This phenomenon would be really nice to explain!

2. Your future work talks about automating stride selection, but what I was hoping to see in this paper is some intuition about how this might be done. Perhaps a simple formula based on cache parameters?
Regardless of the PC discussion outcome, I thank you for submitting your work. I enjoyed reading it and learned something new!

Review B

Overall merit
3. Weak accept

Reviewer expertise
3. Knowledgeable

Paper summary
The paper presents an experimental study to compare sequential access and strided access on two Intel processors, using an AggSum benchmark. The study compares Sapphire Rapids with Cascade Lake, DDR, HBM, and CXL memory technologies in local and remote setup. The analysis investigates impact from TLB, cache associativity, and hardware prefetches. They find that an optimal stride size can outperform sequential access on newer Intel processor.

Comments for authors
The paper provides a timely and comprehensive experimental study of two access patterns on new Intel processors. The paper is well written, with extensive measurement results presented.
The work focuses on the performance of a single-core/thread execution of AggSum operation in database applications. Since each processor has 40-60 cores per socket. When multiple cores are executing the Aggsum benchmark, how would the findings and conclusions change compared to the single-threaded experiments?

When compiling use GCC -O3 Optimization flags, have you checked whether the compiler optimization transforms to use gather instructions? In general, have your experimental study considered the impact of hardware implementation of gather or striped-load instructions on the processor for reaching improved memory bandwidth?

It is unclear how to leverage the findings in this paper to optimize existing database applications. The paper touches on adapting memory allocation to transparently enable strided access at the physical level, but lacks discussion on what implications it may have on hardware and system software design? The paper could be improved by adding some discussion.

Review C

Overall merit
4. Accept

Reviewer expertise
2. Some familiarity

Paper summary
The prevailing assumption in the state of the art is that sequential access yields the best performance, particularly when data resides in contiguous memory locations. This paper challenges that assumption by demonstrating that, for data stored contiguously, a strided access pattern with an appropriately chosen stride size can significantly outperform sequential access. The paper provides in depth experiments to support this claim and to explore why the right striding helps boost performance.

Comments for authors
Thank you for submitting this paper to DIMES! I enjoyed reading this paper. Overall, the main insight that striding at specific strides can be better than sequential access (and even a bit better than SIMD parallelization) is surprising. I also found the explanations very thorough and convincing. For instance, the recommendation to have misaligned starting addresses to avoid the negative effects of cache associativity was surprising, as it goes against the common practices.

To improve the paper even more, I would include some examples of applications that would benefit from striding and provide concrete guidelines on how to choose the stride (maybe something like a flow chart with all the parameters that influence the stride size, such as TLB size, cache line size, memory associativity, size of the working set).

You say that "According to the state-of-the-art, it is still assumed that a sequential access pattern provides the best performance, especially if the data to be processed is stored in adjacent memory locations (contiguous memory) [4, 17]." However, these references are more than 10 years old (from 2008 and 2015). Can you find more recent work also stating that sequential access in continuous memory provides the best performance?

Review D

Overall merit
3. Weak accept

Reviewer expertise
3. Knowledgeable

Paper summary
The paper demonstrates that on Intel systems from a few years ago (SPR), strided access with certain (low) partition counts can outperform sequential access in terms of bandwidth on a single core and when using 4kB pages.

Comments for authors
While the experiments are curious and there's some interesting data, the scenarios studied feel disconnected from realistic workloads and the practical implications are unclear.

### Suggestions and more detail ###

#### Motivation and practical relevance ####
The paper does not motivate why single-core, single-threaded throughput on a 1 GiB sequentially laid out array in 4 KiB pages is an important scenario. In practice, most applications would (a) use multiple cores, making the bottleneck UPI or memory controllers rather than single-core throughput, and (b) employ huge pages, eliminating the TLB issues you highlight. Please explain what real workloads motivate this investigation?

#### CXL setup accuracy ####
You describe the use of CXL 2.0 memory expansion cards on Sapphire Rapids, but SPR only officially supports CXL 1.1 and lacks Type-3 memory expansion support. In fact SPR emulates CXL.mem using CXL.cache flits which leads to rather different performance characteristics. Please clarify what protocol version was actually negotiated, and avoid implying features that weren’t exercised.

#### Remote DRAM experiment limitations #### The remote DRAM results are also single-core. With one core, the bottleneck is latency, not bandwidth; in practice, I would expect UPI bandwidth is the limiting factor. Please discuss behavior under realistic load.

#### Explanation quality ####
The first direction from existing literature here should be on the DRAM side, e.g., bank-level parallelism (BLP) vs row buffer locality. Note that DDR5 essentially doubled the amount of bank-level parallelism (due to subchannels) iso channel count. And then SPR has 8 channels vs 6 channels for Cascade Lake. So, 2.7x more parallelism!
(And modern Intel CPUs like GNR have 12 channels with 16 channels on the horizon!)

See the classical literature of papers that followed Rixner et al., “Memory Access Scheduling” (ISCA 2000).
- Sequential access patterns tend to maximize row buffer hits but underutilize banks.
- Strided / interleaved access can increase BLP (even at the cost of row hits), and memory controllers exploit this trade-off.

I've had to reread multiple paragraphs many times and still was confused by exactly what you mean, e.g., the effects from TLB.  For example, Fig. 5 lacks any note of which architecture it ran on, unlike Fig. 4. Please add architecture, memory placement, and prefetcher settings so the results are more accessible.

#### Claims on modern Intel CPUs ####
Need to put the CPUs into some context. Most of cloud providers have launched Intel EMR VM SKUs a while ago, so that's the dominant platform now. GNR has also been released a year ago, so at least most bare-metal deployments have moved to that (VM SKUs will take a few more months). I'm not sure that SPR ever reached a scale of deployment that was that relevant. Also Sierra Forest and Clear Water Forest CPUs have been out and are coming out, respectively. Lots of differences between Intel CPUs.

Comment by Reviewer D

Thank you for your careful micro benchmarking study. Your paper includes interesting and discussion-worthy observations, while having significant opportunities for improvement. Please try to address the following areas for your presentation and final version.
Several reviewers felt the paper needs clearer context: why do single-core, 4 KB-page experiments matter for real applications, when most systems use multiple cores and often huge pages? Please help readers see when and where this effect is practically relevant.
Finally, your explanations are partly speculative. Alongside TLBs, prefetchers, and cache sets, please connect your findings to known effects like bank-level parallelism versus row-buffer locality, to better position what is new here.