CIDR 2020 Reviews
Reviews for paper Shared Load(ing): Efficient Bulk Loading into Optimized Storage, submitted to CIDR 2020.
Overall rating: accept
Reviewer 1
Overall Recommendation
Weak Accept
Summary of Paper and Rationale for Recommendation
This paper proposes a distributed bulk loading mechanism that enables dynamically offloading deserialization and data compression/transformation to the client machine. It provides a concise but comprehensive experiment with two data-sets in different environment settings. This shows the potentially improvement on data loading and the impact on queries for a contented server that must load and process queries concurrently. While the ideas are interesting and show promise, the paper's scope maybe too narrow for CIDR and the heuristics might be too simple for the real-world use cases.
Three Strong Points
S1. Data loading is important and consumes a considerable amount of execution time. The problem is well introduced, motivated and justified through the analysis.
S2. The idea of shifting some data loading tasks from the server side to the client side is reasonable and sound.
S3. Making decisions based on workload is reasonable.
S4. Comprehensive experiments in different environment settings.
Three Weak Points
W1. More details should be provided on the heuristic function used in Sec 4.3. See detailed comments.
W2. The experiment simulates a distributed environment instead of running on a real environment, making the result less convincing. How does the simulated network environment differ from real protocols and configurations? What about mixed computational resources?
W3. The experiment did not show the superiority of "Dynamic" Offloading (SL_S), it seems that Client-Centric loading (SL_A) better in many situations. An analysis and understanding of why this is would be helpful.
W4. Section2 Mentioned the HANA system relies on order-preserving dictionary, but there is nothing detail about how to merge a dictionary in the new partition to the existing dictionary in Dynamic Offloading or Server-Centric mode. If recoding is always inevitable on server side anyway, the compression on the client side is nothing to do with workload relocation except for saving bandwidth. How would alternative compression schemes compare? Is the dictionary globally or locally ordered? What is the recoding costs?
Detailed Comments
D1. In 4.3, the heuristic should take network bandwidth into account, and should give some examples or experiments how to use runtime information to make decision. This is not described in any detail.
D2. The experiment should be done on real distributed environments, especially considering that client often has a less powerful CPU than server (assuming clients are diverse and HANA runs are a heavyweight server).. The heuristic should also take the CPU computation power difference into account.
D3. Section4.1 mentioned Client-Centric mode "allows the DBMS to merge a partition directly into optimized storage." Isn't recoding always needed on the server side?
D4. It would be good to show the query tail latency without any bulk loading for reference.
D5. The heuristics used in the experiment are naive as it only depends on the file itself, while other system information is not leveraged. This is likely an artifact of the simulated environment.
D6. What is the ramification on experiments of using fixed size strings? If sorting was made more expensive, would this hamper potentially slower clients?
Reviewer 2
Overall Recommendation
Weak Reject
Summary of Paper and Rationale for Recommendation
This paper presents a technique to speed up data loading into a database instance. The key idea is to offload some of the data transformation work to the client. This allows most of the critical work, such as encoding the data, converting it into columnar formats, etc., can be done outside of the database server.
While the idea is practical, this is an idea often used in databases. As such, the research contributions of this paper are not very significant. In addition, generating binary format outside of the database poses its own set of challenges which the paper does not address.
Three Strong Points
S1. Fast data loading into a database remains an important problem. Any improvements to it will be beneficial in reducing the time to insights.
Three Weak Points
W1. The research contribution of this paper is low, given that the idea of offloading the work to servers other than the database server is often used in practice.
W2. Generating data in the format native to the database engine in a client which is often not tightly controlled by the engine poses some interesting challenges.
Detailed Comments
D1. I do not doubt the practicality of this approach. In fact, many systems actually recommend such preprocessing. However, the research contributions of this paper are not significant.
Example of systems that preprocess the data to minimize load overhead:
* Prepping data for a cloud data warehouse or for a data lake. Use of data preparation tools, e.g., Spark or ETL tools such as AWS Glue ETL or Databricks or Azure data factory is a common practice to transform data for data lakes. Many cloud data warehouses, such as Snowflake or Redshift or Azure SQL DW allow fast querying on external tables or even fast loading from pre-transformed external tables.
* Many mature databases also support partition switching, where partitions of a table are pre-created with the specified layout and transformations, effectively offloading much of the work to another server, while allowing a switch operation that is purely a data copy and metadata operations.
D2. Exposing the database's internal storage format to clients exposes some additional challenges. For instance, it is commonly seen in practice that clients often run older versions of the software or the client and server versions often do not upgrade in sync. As a result, client tools often have to be only loosely coupled with the database server version and software. Exposing the physical storage layout for pre-processing the data in the database server's internal format poses some challenges by creating a tight coupling between the client and the server. This could easily result in complications and many practical challenges.
D3. How does the proposed technique compare with recent related approaches such as:
Dong Xie, Badrish Chandramouli, Yinan Li, Donald Kossmann: FishStore: Faster Ingestion with Subset Hashing. SIGMOD Conference 2019: 1711-1728
Since this paper is so recent, I'm not expecting this paper to be able to compare with it experimentally. However, the reason I bring this up is these approaches are exploring some new innovations to speed data loading which needs to be contrasted with.
Reviewer 3
Overall Recommendation
Accept
Summary of Paper and Rationale for Recommendation
This paper argues that there is substantial CPU overhead in deserialization and transformation operations during bulk loads to analytics-optimized databases, such as SAP HANA. To overcome this issue, the authors propose a distributed bulk loading mechanism that can dynamically offload the deserializations and transformations to the input machines. The evaluation shows that the proposed mechanism improves performance and more importantly isolates the performance of the concurrently running queries, achieving up to 2x higher loading throughput than the naive approaches.
Three Strong Points
S1. Bulk loading is a very important operation of modern data warehousing and one of the main comparison criteria among data warehousing solutions.
S2. The idea of offload the deserialization and transformations is a trend that gains traction in the industry, for example with the emergence of cloud services, like S3 SELECT.
S3. Very interesting break down of where the time goes loading the TPC-H Lineitem table in SAP HANA.
S4. All the experiments are done in a commercial-grade system such as SAP HANA and that adds credibility to the paper. Further, the evaluation considers concurrently running queries, which is also realistic--in production systems no bulk loading is happening in isolation.
S5. Well written paper that it is easy to follow.
Three Weak Points
N/A
Detailed Comments
D1. This paper argues that there is substantial CPU overhead in deserialization and transformation operations during bulk loads to analytics-optimized databases, such as SAP HANA. To overcome this issue, the authors propose a distributed bulk loading mechanism that can dynamically offload the deserializations and transformations to the input machines. The evaluation shows that the proposed mechanism improves performance and more importantly isolates the performance of the concurrently running queries, achieving up to 2x higher loading throughput than the naive approaches.
D2. The paper deals with a really critical issue of commercial data warehouses: the bulk loading (also known as COPY / LOAD) of plain data to analytics-optimized columnar stores. The observation that the compression/encoding cost of the ingested data is substantial it is very interesting.
D3. Figure 1, which shows where time goes in a commercial grade analytics-optimized system, such as SAP HANA, during bulk loading, is very interesting. This is one of the first papers that focuses on the transformations. And it contradicts a bit the previous academic papers in the area that were focusing mostly on deserialization (e.g. csv parsing).
D4. The authors propose a hybrid between client-centered and server-centered loading which dynamically decides which option to use. The decision depends on an estimate of the produced analytics-optimized (dictionary-compressed) columns.
D5. A large fraction of bulk loading in modern analytics-optimized systems takes place from files stored in object stores, such as Amazon S3. In these passive stores, it is quite difficult to perform the proposed optimization.
D6. The evaluation considers concurrently running queries which is realistic, as in production systems no bulk loading happens in isolation.
D7. The paper is well written and easy to follow.
Related Information
- submission (PDF)
- final paper (PDF) — published at CIDR 2020