Download PDFOpen PDF in browserScalable Correlated Sampling for Join Query Estimations on Big Data10 pages•Published: September 26, 2019AbstractEstimate query results within limited time constraints is a challenging problem in the research of big data management. Query estimation based on simple random samples per- forms well for simple selection queries; however, return results with extremely high relative errors for complex join queries. Existing methods only work well with foreign key joins, and the sample size can grow dramatically as the dataset gets larger. This research implements a scalable sampling scheme in a big data environment, namely correlated sampling in map-reduce, that can speed up search query length results, give precise join query estimations, and minimize storage costs when presented with big data. Extensive experiments with large TPC-H datasets in Apache Hive show that our sampling method produces fast and accurate query estimations on big data.Keyphrases: query approximation, query size estimation, sampling In: Frederick Harris, Sergiu Dascalu, Sharad Sharma and Rui Wu (editors). Proceedings of 28th International Conference on Software Engineering and Data Engineering, vol 64, pages 41-50.
|