Shared Load(ing): Efficient Bulk Loading into Optimized Storage
Title
Shared Load(ing): Efficient Bulk Loading into Optimized Storage
Authors
Stefan Noll, Jens Teubner, Norman May, and Alexander Böhm
Published
Proc. of the 10th Annual Conference on Innovative Data Systems Research (CIDR), Amsterdam, January 2020.
Download
Abstract
Bulk loading into the optimized storage of a database system is a performance-critical task for data analysis, replication, and system integration. Depending on the storage layout, it may entail complex data transformations, making it also an expensive task that can disturb other workloads running in parallel.
In this work, we demonstrate that for a commercial, in-memory columnar system with compression-optimized storage, data transformation dominates the cost of bulk loading. The transformations may cause resource contention on a stressed system, resulting in poor and unpredictable performance for both bulk loading and query processing. To mitigate this problem, we propose Shared Loading, a distributed bulk loading mechanism that enables dynamically offloading deserialization and data transformation to the machine where the input data resides. In our evaluation we demonstrate that, for different network bandwidths and data sets, Shared Loading accelerates bulk loading into compression-optimized storage and improves the performance and predictability of queries running concurrently.
Project
Real-Time Analysis and Storage for High-Volume Data in Particle Physics (SFB 876, C5)
Publication Log
December 2019
camera-ready for CIDR 2020
August 2019
submission to CIDR 2020 (accepted)
- submission (PDF)
- reviews (results: weak accept, weak reject, accept)