Shared Load(ing): Efficient Bulk Loading into Optimized Storage
Stefan Noll, Jens Teubner, Norman May, and Alexander Böhm
Proc. of the 10th Annual Conference on Innovative Data Systems Research (CIDR), Amsterdam, January 2020.
Bulk loading into the optimized storage of a database system is a performance-critical task for data analysis, replication, and system integration. Depending on the storage layout, it may entail complex data transformations, making it also an expensive task that can disturb other workloads running in parallel.
In this work, we demonstrate that for a commercial, in-memory columnar system with compression-optimized storage, data transformation dominates the cost of bulk loading. The transformations may cause resource contention on a stressed system, resulting in poor and unpredictable performance for both bulk loading and query processing. To mitigate this problem, we propose Shared Loading, a distributed bulk loading mechanism that enables dynamically offloading deserialization and data transformation to the machine where the input data resides. In our evaluation we demonstrate that, for different network bandwidths and data sets, Shared Loading accelerates bulk loading into compression-optimized storage and improves the performance and predictability of queries running concurrently.
Real-Time Analysis and Storage for High-Volume Data in Particle Physics (SFB 876, C5)
camera-ready for CIDR 2020
submission to CIDR 2020 (accepted)